Overview
Midscene.js is an open-source JavaScript-first platform that drives UI automation through vision-based models. Rather than relying on DOM selectors or platform-specific element APIs, Midscene centers on screenshot-level localization powered by visual-language and agent models. This pure-vision approach makes it easier to automate across heterogeneous surfaces including web pages, native mobile apps (Android/iOS), canvas-based UIs, and even custom interfaces.
Key Features
- Vision-first element localization: locate and interact with UI elements using screenshots and visual-language models instead of depending solely on DOM.
- Natural-language scripting: describe goals and steps in plain language and let Midscene plan and operate the UI for you; also supports JavaScript SDK and YAML-based scripts.
- Multi-platform support: integrates with Puppeteer and Playwright for web automation, supports adb-based Android control and WebDriverAgent for iOS; also offers a Bridge Mode for desktop browser control.
- MCP (Midscene Control Plane): exposes atomic Midscene agent actions as tools so higher-level agents can inspect and operate UIs with natural language and tool-like interfaces.
- Performance & efficiency: caching mechanisms to replay scripts faster and reduce costs; option to include DOM only when needed for extraction or page understanding.
- Developer experience: visualized replay reports, built-in playgrounds for Android/iOS, a Chrome extension for zero-code experiences, and debugging utilities.
Model Strategy
Midscene embraces visual-language / foundation models (VL/FM) for UI understanding and localization. It supports both hosted and self-hosted models and lists compatibility with models such as Qwen-VL, UI-TARS and other VL models. For data extraction tasks you can optionally include DOM information; for action localization Midscene emphasizes a pure-vision route to reduce token usage and speed up runs.
Supported Platforms & Integrations
- Web: Puppeteer, Playwright, Bridge Mode (desktop browser control).
- Android: JS SDK + adb + scrcpy integration for device control and Android playground.
- iOS: JS SDK + WebDriverAgent for controlling real devices and simulators.
- Any interface: generic JS SDK to connect custom interfaces.
Typical Use Cases
- End-to-end UI automation and testing across web and mobile where DOM or accessibility hooks are unavailable or unreliable.
- Robotic Process Automation (RPA) that must work across apps and custom-rendered surfaces (canvas, games, embedded webviews).
- Agents that need to inspect and act on UIs via natural language (e.g., higher-level assistants that call MCP tools).
- Prototyping and research involving visual-language models applied to human-computer interaction tasks.
Developer & Community Resources
Midscene provides documentation, API references, sample projects, and community links (Discord, X). It publishes packages on npm (e.g., @midscene/web), provides a Chrome Extension for quick zero-code trials, and maintains showcases demonstrating real-world flows.
Getting Started (high level)
- Visit the official site for docs and guides (https://midscenejs.com).
- Install the relevant npm package (e.g., @midscene/web) or use the Chrome Extension for quick experiments.
- Choose integration path: Puppeteer/Playwright for web, adb for Android, WebDriverAgent for iOS, or Bridge Mode.
- Write steps in natural language or JavaScript/YAML and run; use the visual replay and caching to iterate.
License & Citation
Midscene.js is MIT-licensed. The repository provides a citation block for academic usage and lists contributors and upstream model projects it builds on.
Notes
- Midscene emphasizes a pure-vision automation philosophy but still allows optional DOM usage for extraction tasks.
- The project published v1.0 and maintains both v1 and v0 branches/docs for compatibility and migration.
