mech.app
Automation

Minicor's Windows RPA Stack: How AI Agents Automate Desktop Apps Without APIs

Deep dive into the infrastructure for scaling desktop automation: process isolation, session management, image recognition vs. accessibility APIs, and s...

Source: minicor.com
Minicor's Windows RPA Stack: How AI Agents Automate Desktop Apps Without APIs

Most enterprise software still lives in desktop applications with no API surface. When you need an AI agent to interact with legacy Windows apps, you face a hard infrastructure problem: how do you orchestrate hundreds of concurrent desktop sessions, recover from UI changes, and debug failures that only reproduce on specific screen resolutions?

Minicor is a YC-backed platform that runs desktop RPA (Robotic Process Automation) at scale. The interesting parts are not the AI wrapper but the plumbing required to make desktop automation reliable enough for production workloads.

The Desktop Automation Problem

Traditional RPA scripts break when:

  • A vendor ships a UI update and button coordinates shift
  • An unexpected dialog appears mid-workflow
  • Screen resolution differs between dev and production VMs
  • Windows updates change accessibility tree structure
  • Network latency causes elements to load out of order

Most RPA tools solve this with brittle XPath selectors or pixel-perfect image matching. Both fail frequently. Minicor’s approach uses a reflection agent that validates every action against the actual screen state before proceeding.

Architecture: From API Call to Desktop Session

Here’s the execution flow when you trigger a desktop automation:

  1. API ingress: Your system calls Minicor’s REST API with workflow ID and parameters
  2. Session allocation: Desktop service assigns a Windows VM from the pool
  3. Agent injection: Desktop agent process starts inside the VM session
  4. Workflow execution: Agent reads workflow definition, executes steps, validates each action
  5. State capture: Full video recording, screenshots, and execution context saved
  6. Cleanup: Session released back to pool or destroyed

The desktop service manages VM lifecycle. It handles:

  • Pre-warming sessions to reduce cold-start latency
  • Health checks to detect stuck or crashed agents
  • Automatic failover when a VM becomes unresponsive
  • License management for Windows Server instances

Session Isolation Strategies

Running hundreds of concurrent Windows desktop sessions creates three hard problems:

VM sprawl: Each session needs its own Windows instance. Minicor supports both dedicated VMs and Citrix environments. For cloud deployments, they use Windows Server with Remote Desktop Services to run multiple isolated sessions on a single host.

Licensing: Windows Server Datacenter edition allows unlimited VMs on licensed hardware. For cloud, you pay per-core. The trade-off is cost vs. isolation strength.

State contamination: Desktop apps often write to registry, temp files, or shared directories. Minicor uses session-scoped user profiles that get wiped between runs. For apps that require persistent state, you can mount shared volumes with read-only base images.

Isolation StrategyCostStartup TimeFailure Blast Radius
Dedicated VM per sessionHigh30-60sSingle session
RDS multi-sessionMedium5-10sSingle session
Citrix published appsLow2-5sShared app pool

Element Detection: UI Automation vs. Vision

Minicor uses multiple detection strategies in a fallback chain:

UI Automation API (first choice): Windows provides IUIAutomation for accessibility. You get a structured tree of UI elements with properties like name, role, and bounding box. Fast and reliable when it works. Fails when apps use custom controls or render to canvas.

OCR fallback: When UI Automation can’t find an element, Minicor runs OCR on the screen region. Slower (200-500ms per region) but works on any visible text. Accuracy depends on font rendering and contrast.

Computer vision (last resort): For graphical elements with no text, you can use template matching or ML-based object detection. Latency jumps to 1-2 seconds. Brittle across different screen resolutions.

The reflection agent validates after each action. If you click a button and the expected screen state doesn’t appear, the agent retries with a different detection method or raises an error.

State Recovery and Checkpointing

Desktop automations fail mid-workflow. The question is whether you restart from the beginning or resume from a checkpoint.

Minicor’s approach:

  • Idempotent actions: Workflows should be safe to replay. If you’re filling a form, check if the field is already populated before typing.
  • Explicit checkpoints: Mark steps as checkpoints in the workflow definition. On failure, the agent restarts from the last checkpoint.
  • Session snapshots: For long-running workflows, you can snapshot the VM state. Expensive in storage (5-10 GB per snapshot) but useful for debugging.

Here’s a workflow snippet showing checkpoint usage:

workflow:
  - step: open_app
    action: launch
    path: "C:\\Program Files\\LegacyApp\\app.exe"
    
  - step: login
    action: type
    element: {name: "Username"}
    value: "{{username}}"
    checkpoint: true
    
  - step: navigate_to_orders
    action: click
    element: {name: "Orders"}
    
  - step: create_order
    action: click
    element: {name: "New Order"}
    checkpoint: true
    
  - step: fill_patient_id
    action: type
    element: {name: "Patient ID"}
    value: "{{patient_id}}"
    validate:
      screen_contains: "Patient: {{patient_name}}"

If fill_patient_id fails, the agent restarts from create_order instead of re-logging in.

Observability Primitives

Debugging desktop automation requires more than logs. Minicor captures:

  • Full video recording: Every session is recorded at 5 fps. Storage cost is ~50 MB per hour.
  • Screenshot on error: When a step fails, the agent captures the screen and UI Automation tree.
  • Execution trace: Timestamped log of every action, element detection attempt, and validation result.
  • Slack alerts: Failures trigger notifications with video replay link and error context.

The video replay is the most useful debugging tool. You can see exactly what the agent saw when it failed, including transient dialogs or loading spinners that don’t appear in logs.

Deployment Shapes

Minicor supports three deployment models:

Cloud-hosted: Minicor manages the Windows VMs. You call the API, they handle scaling. Simplest but highest cost and potential data residency issues.

On-premise: You run the desktop service on your own Windows Server infrastructure. Minicor provides the agent binary and orchestration layer. Requires Windows Server licenses and VM management expertise.

Hybrid: Desktop agents run on-premise, but workflow definitions and observability data flow to Minicor’s cloud. Useful when you need to automate apps that can’t leave your network but want managed observability.

Likely Failure Modes

Windows updates: Automatic updates can restart VMs mid-workflow. Solution: Disable automatic updates and schedule maintenance windows, or use VM snapshots to roll back.

Screen resolution mismatches: Workflows developed on 1920x1080 fail on 1280x1024. Solution: Use relative positioning instead of absolute coordinates, or standardize VM screen resolution.

Antivirus interference: Desktop agents trigger heuristic detection because they simulate mouse and keyboard input. Solution: Whitelist the agent binary and add exclusions for workflow directories.

License exhaustion: Running out of Windows Server CALs or RDS licenses. Solution: Monitor license usage and implement session pooling with aggressive timeouts.

Network latency: Desktop apps that load data from remote servers can timeout. Solution: Increase wait times in workflow definitions or use network-aware validation.

Technical Verdict

Use Minicor when:

  • You need to integrate AI agents with legacy Windows desktop apps that have no API
  • You’re automating workflows that run thousands of times per day and need reliability
  • You have budget for Windows Server licensing and VM infrastructure
  • You need full observability and video replay for compliance or debugging

Avoid when:

  • The target application has a usable API (even a bad one is better than desktop automation)
  • You’re automating workflows that run infrequently (manual execution is cheaper)
  • You can’t tolerate 5-10 second cold-start latency for session allocation
  • Your workflows require sub-second response times (desktop automation adds 200-500ms per action)

The hard parts of desktop RPA are session management, state recovery, and observability. Minicor exposes these as managed primitives instead of making you build them. The trade-off is cost and vendor lock-in vs. engineering time saved.