Overview
UI-TARS Desktop is a native GUI agent application built as part of the Agent TARS ecosystem. It is designed to let users control desktop applications and browsers via natural language instructions powered by vision-language models (e.g., Seed / UI-TARS series). The project ships a desktop app that can run local operators (directly operate the host machine), remote operators (control remote machines or browsers), and browser-based operators (Midscene integration). It targets productivity automation, GUI testing, and human-like task completion across terminal, desktop and web contexts.
Key features
- Native GUI Agent: Precise mouse and keyboard control combined with visual grounding from screenshots so the agent can interact with real UI elements.
- Multimodal models: Integrates with vision-language models (UI-TARS / Seed series) to interpret screenshots and natural language together.
- Local & Remote Operators: Supports both fully-local operators and remote computer/browser operators (no heavy configuration required for basic remote usage).
- MCP Integration: Built to work with MCP servers/clients to mount and call real-world tools and services in a structured way.
- Cross-platform & Privacy-focused: Runs on Windows, macOS and has browser-based usage options; emphasizes local processing for privacy-sensitive workflows.
- Developer tooling: Part of the Agent TARS stack with CLI and Web UI tooling, SDK for building GUI automation agents, and documentation for deployment and cloud options.
Models and Architecture
UI-TARS Desktop is implemented on top of the broader Agent TARS stack. It uses UI-TARS models (and Seed-1.5/1.6 family models referenced in the repository) as the core vision-language components, and interacts with MCP (modular control protocol) servers for tool invocation and external integrations. The repository bundles a desktop application front-end with operator plugins (local/remote/browser) and ties into Agent TARS CLI / Web UI for different execution modes.
Usage & Quick Start
Typical usage paths include:
- Running the desktop app to use the local operator for automating tasks on your machine (e.g., change app settings, control VS Code, interact with dialogs).
- Using the remote computer/browser operators to control a remote environment from your local UI-TARS Desktop client.
- Integrating with Agent TARS CLI or MCP servers for complex pipelines and tool chaining.
The project README provides quick-start commands and examples for launching via npx/npm, links to documentation pages, and examples demonstrating adding autosave in VS Code, booking flows, and other multimodal tasks.
Releases & Community
The repository and its parent Agent TARS project have an active release and announcement cadence (examples in the repo README show initial desktop v0.1.0 and subsequent updates adding remote operators and features). It links to related resources such as an arXiv paper, Hugging Face model cards, a public website (agent-tars.com), and community channels (Discord).
License & Citation
UI-TARS Desktop is released under the Apache License 2.0. The project README includes a BibTeX citation pointing to an arXiv preprint for UI-TARS and suggests citing the work when used in research.
Links
- Project (GitHub): https://github.com/bytedance/UI-TARS-desktop
- Agent TARS website / docs: https://agent-tars.com
- Paper (arXiv) and model references are linked from the repository README.
