Midscene.js: Vision-Driven UI Automation Without Selectors or XPath

Midscene.js replaces CSS selectors and XPath with vision model inference. Instead of writing page.click('#submit-button'), you write page.click('the blue submit button'). The library captures a screenshot, sends it to GPT-4V or Claude, gets back bounding-box coordinates, and executes the click. This works across web (Playwright, Puppeteer), iOS, and Android with the same API surface.

The project has 13,341 GitHub stars and was trending #6 for TypeScript. It represents a shift from DOM-based automation to visual understanding, which matters when you need to automate platforms without DOM access (native mobile apps, Electron apps, legacy desktop software) or when selectors break constantly.

Why Vision Models Instead of Selectors

Traditional automation tools fail when:

The DOM structure changes between releases
Shadow DOM or iframes hide elements
You need to automate native mobile apps (no DOM at all)
Third-party widgets use obfuscated class names
You want to describe elements the way a human would (“the red error message below the password field”)

Midscene.js solves this by treating the screen as an image. The vision model identifies elements by visual features (color, position, text content, icon shape) rather than HTML structure. This trades selector brittleness for inference latency and API cost.

Architecture: Screenshot to Coordinates

The flow for every UI interaction:

Capture: Take a screenshot of the current viewport (web canvas, iOS screen buffer, Android framebuffer)
Annotate: Optionally overlay element labels or bounding boxes for the model
Infer: Send the image plus natural-language query to the vision model
Parse: Extract bounding-box coordinates from the model response
Execute: Convert coordinates to platform-specific actions (Playwright click, iOS tap via XCTest, Android input via ADB)

The library supports multiple vision backends:

GPT-4V (OpenAI)
Claude 3.5 Sonnet (Anthropic)
Gemini Pro Vision (Google)
UI-TARS-1.5-7B (ByteDance’s open-weight model for UI understanding)

You configure the model provider once. Midscene handles prompt engineering, retry logic, and coordinate normalization across different screen resolutions.

import { Page } from '@midscene/web';

const page = new Page(playwrightPage, {
  aiProvider: 'openai',
  apiKey: process.env.OPENAI_API_KEY
});

// Natural language element location
await page.click('the login button in the top right corner');
await page.type('email input field', 'user@example.com');

// Visual assertions
const hasError = await page.assertVisible('red error message below password');

Latency and Cost Tradeoffs

Every UI action requires a vision model inference call. This introduces latency and API costs that don’t exist with selector-based automation.

Metric	Selector-Based (Playwright)	Vision-Based (Midscene.js)
Element location latency	10-50ms (DOM query)	800-2000ms (model inference)
Cost per action	$0	$0.01-0.05 (GPT-4V)
Failure mode	Selector not found	Model misidentifies element
Cross-platform support	Web only	Web, iOS, Android
Maintenance when UI changes	High (rewrite selectors)	Low (descriptions still work)

For a test suite with 100 interactions, you’re looking at 80-200 seconds of added latency and $1-5 in API costs per run. This is acceptable for end-to-end tests that run nightly but prohibitive for unit tests or local development loops.

The library includes a caching layer that stores screenshot hashes and model responses. If the screen hasn’t changed, it reuses the previous coordinates. This helps when running the same test multiple times but doesn’t help with dynamic content.

State Management and Retry Logic

Vision models sometimes misidentify elements. A “submit” button might be confused with a “cancel” button if they look similar. Midscene.js handles this with:

Confidence scores: The model returns a confidence value (0-1) for each element match. You can set a threshold (default 0.7) and retry if confidence is low.
Retry with context: If the first inference fails, the library retries with additional context (“the blue button, not the gray one”).
Visual diff validation: After clicking, it can capture another screenshot and verify that the expected state change occurred (modal opened, form submitted, page navigated).

await page.click('submit button', {
  minConfidence: 0.8,
  retries: 3,
  validateAfter: async () => {
    return await page.assertVisible('success message');
  }
});

The retry logic adds 2-6 seconds per failed interaction. For flaky tests, this compounds quickly.

Cross-Platform Automation

Midscene.js abstracts platform differences behind a unified API:

Web (Playwright/Puppeteer):

Uses CDP to capture screenshots
Converts coordinates to Playwright page.mouse.click(x, y)
Supports iframes and shadow DOM (treats them as part of the screenshot)

iOS (XCTest):

Captures screen via xcrun simctl io booted screenshot
Converts coordinates to XCTest tap events
Requires macOS with Xcode installed

Android (ADB):

Captures screen via adb exec-out screencap -p
Converts coordinates to adb shell input tap x y
Works with emulators and physical devices over USB

The same test script can run against all three platforms by swapping the page initialization:

// Web
const page = new Page(playwrightPage);

// iOS
const page = new Page(iosDevice, { platform: 'ios' });

// Android
const page = new Page(androidDevice, { platform: 'android' });

This is useful for testing cross-platform apps (React Native, Flutter) where the UI looks identical but the underlying implementation differs.

Integration with Multi-Agent Workflows

Midscene.js integrates with OpenClaw and AGNTCY for multi-agent orchestration. The library exposes a “skills” API that lets agents perform UI actions as part of a larger workflow:

OpenClaw: Midscene provides pre-built skills for common UI tasks (form filling, navigation, data extraction). An agent can call midscene.fillForm(fields) without knowing the underlying platform.
AGNTCY: Midscene acts as a tool in the agent’s toolbox. The agent decides when to use vision-based automation vs. API calls vs. other tools.

The skills layer wraps low-level Midscene actions with error handling, logging, and state recovery. This makes it easier to build reliable agents that interact with UIs.

Observability and Debugging

Every Midscene action generates:

Screenshot artifacts: Saved to disk with bounding boxes overlaid
Model prompts and responses: Logged in JSON format
Execution traces: Timing for each step (capture, infer, execute)

The library includes a visual debugger that replays test runs step-by-step. You can see what the model saw, what it identified, and where it clicked. This is critical for debugging flaky tests where the model occasionally misidentifies elements.

const page = new Page(playwrightPage, {
  debug: true,
  artifactDir: './test-artifacts'
});

Artifacts are organized by test run ID. Each action gets a numbered screenshot (001-capture.png, 001-annotated.png, 001-result.png) and a JSON file with the model response.

Deployment Shape

Midscene.js runs in the same process as your test runner (Jest, Vitest, Mocha). It doesn’t require a separate server or daemon. The vision model calls happen over HTTPS to the provider’s API.

For CI/CD:

GitHub Actions: Install Playwright, set API keys in secrets, run tests normally
Docker: Use a Playwright Docker image, mount test artifacts as volumes
Self-hosted: Run UI-TARS-1.5-7B locally to avoid API costs (requires GPU with 16GB VRAM)

The main deployment challenge is managing API rate limits. If you have 50 parallel test jobs, each making 100 vision calls, you’ll hit OpenAI’s rate limit quickly. The library includes a queue system that serializes requests across workers.

Failure Modes

Where Midscene.js breaks:

Ambiguous descriptions: “Click the button” fails when there are five buttons on screen. You need to be specific (“the blue submit button in the bottom right”).
Dynamic content: If the UI changes between screenshot capture and action execution (loading spinners, animations), coordinates become stale.
Low-contrast UIs: Vision models struggle with light gray text on white backgrounds or subtle UI elements.
Non-English text: GPT-4V handles multiple languages, but UI-TARS-1.5-7B is trained primarily on English UIs.
Cost overruns: A test suite that runs 1000 times per day at $2 per run costs $60k per month.

The library doesn’t solve these problems. You need to write better descriptions, add wait conditions, choose high-contrast themes, or switch to selector-based automation for critical paths.

Technical Verdict

Use Midscene.js when:

You need cross-platform automation (web + mobile) with a single codebase
Selectors break frequently due to UI churn
You’re automating third-party apps without DOM access
Test maintenance cost exceeds API cost
You’re building agents that need to interact with arbitrary UIs

Avoid it when:

You need sub-100ms action latency (use Playwright directly)
API costs are prohibitive (thousands of tests per day)
Your UI has many visually similar elements
You’re automating a stable internal app with reliable selectors
You need deterministic test results (vision models have inherent variability)

The sweet spot is end-to-end tests for cross-platform apps where selector maintenance is painful and you can tolerate 1-2 second latency per action. For unit tests or performance-critical automation, stick with traditional selectors.