LogoAIAny
Icon for item

Midscene.js

Midscene.js is an open-source JavaScript SDK and framework for vision-driven UI automation across web, Android, iOS and other interfaces. It uses visual-language models to localize and interact with UI purely from screenshots, lets you script automation via natural language, JavaScript or YAML, integrates with Puppeteer/Playwright or device bridges, and provides developer features such as caching, debugging replay, MCP services and zero-code browser extension.

Introduction

Overview

Midscene.js is an open-source JavaScript-first platform that drives UI automation through vision-based models. Rather than relying on DOM selectors or platform-specific element APIs, Midscene centers on screenshot-level localization powered by visual-language and agent models. This pure-vision approach makes it easier to automate across heterogeneous surfaces including web pages, native mobile apps (Android/iOS), canvas-based UIs, and even custom interfaces.

Key Features
  • Vision-first element localization: locate and interact with UI elements using screenshots and visual-language models instead of depending solely on DOM.
  • Natural-language scripting: describe goals and steps in plain language and let Midscene plan and operate the UI for you; also supports JavaScript SDK and YAML-based scripts.
  • Multi-platform support: integrates with Puppeteer and Playwright for web automation, supports adb-based Android control and WebDriverAgent for iOS; also offers a Bridge Mode for desktop browser control.
  • MCP (Midscene Control Plane): exposes atomic Midscene agent actions as tools so higher-level agents can inspect and operate UIs with natural language and tool-like interfaces.
  • Performance & efficiency: caching mechanisms to replay scripts faster and reduce costs; option to include DOM only when needed for extraction or page understanding.
  • Developer experience: visualized replay reports, built-in playgrounds for Android/iOS, a Chrome extension for zero-code experiences, and debugging utilities.
Model Strategy

Midscene embraces visual-language / foundation models (VL/FM) for UI understanding and localization. It supports both hosted and self-hosted models and lists compatibility with models such as Qwen-VL, UI-TARS and other VL models. For data extraction tasks you can optionally include DOM information; for action localization Midscene emphasizes a pure-vision route to reduce token usage and speed up runs.

Supported Platforms & Integrations
  • Web: Puppeteer, Playwright, Bridge Mode (desktop browser control).
  • Android: JS SDK + adb + scrcpy integration for device control and Android playground.
  • iOS: JS SDK + WebDriverAgent for controlling real devices and simulators.
  • Any interface: generic JS SDK to connect custom interfaces.
Typical Use Cases
  • End-to-end UI automation and testing across web and mobile where DOM or accessibility hooks are unavailable or unreliable.
  • Robotic Process Automation (RPA) that must work across apps and custom-rendered surfaces (canvas, games, embedded webviews).
  • Agents that need to inspect and act on UIs via natural language (e.g., higher-level assistants that call MCP tools).
  • Prototyping and research involving visual-language models applied to human-computer interaction tasks.
Developer & Community Resources

Midscene provides documentation, API references, sample projects, and community links (Discord, X). It publishes packages on npm (e.g., @midscene/web), provides a Chrome Extension for quick zero-code trials, and maintains showcases demonstrating real-world flows.

Getting Started (high level)
  1. Visit the official site for docs and guides (https://midscenejs.com).
  2. Install the relevant npm package (e.g., @midscene/web) or use the Chrome Extension for quick experiments.
  3. Choose integration path: Puppeteer/Playwright for web, adb for Android, WebDriverAgent for iOS, or Bridge Mode.
  4. Write steps in natural language or JavaScript/YAML and run; use the visual replay and caching to iterate.
License & Citation

Midscene.js is MIT-licensed. The repository provides a citation block for academic usage and lists contributors and upstream model projects it builds on.

Notes
  • Midscene emphasizes a pure-vision automation philosophy but still allows optional DOM usage for extraction tasks.
  • The project published v1.0 and maintains both v1 and v0 branches/docs for compatibility and migration.

Information

  • Websitegithub.com
  • Authorsweb-infra-dev, Xiao Zhou, Tao Yu, YiBing Lin
  • Published date2024/07/23

Categories