Every major tech company is building coding agents. The frontier is moving from “write code” to “interact with the product.” But the moment your agent needs to tap a button on a real iPhone, you hit an infrastructure problem that has nothing to do with AI.
We know this because we built it before. At Uber, we created DragonCrawl — the first AI-powered mobile testing system to run in production at scale. It executed Uber’s core trip flow across 85 cities with 99%+ stability, zero maintenance, on every Android code change. The models were the easy part. The infrastructure to actually run them against real devices was where we spent most of our time.
Now we’re building Revyl, and the problem is the same. There’s been a lot of progress on getting AI to reason about what to do on a screen. But getting it a real device to act on — with low-latency streaming, clean state, and verified execution — is a different kind of hard. That’s the infrastructure iceberg.
The iceberg
We wanted to get to a single CLI command: tell the device what to do in natural language, and have it happen. Here’s what that actually required.
One command on top. Eight layers of infrastructure underneath, each with its own failure modes, each harder than you’d expect.
Demo
The seven layers
Here’s what we actually had to build. Each of these is its own project, with its own set of surprises.
We run dedicated Apple Silicon Mac Minis. Provisioning them turned out to be one of the hardest parts — there’s no Terraform for Mac hardware. We wrote 13 Ansible roles covering kernel tuning, simulator creation, and observability. A custom LaunchDaemon supervisor handles auto-restarts, version polling, and graceful shutdown. Deploys happen in seconds via GitHub webhooks and AWS SSM.
Hard Emulators leak resources. Every iOS update can break your simulator matrix. Someone has to be on-call for hardware failures at 2am.
Android emulators cold-boot in 1-5 minutes. iOS simulators need to be erased between runs for hermetic isolation, then booted, then connected to a companion daemon. We added pre-warming to cut first-task latency by 30-60 seconds, but getting it right meant handling all the silent failures.
Hard Boot processes hang without error. Port conflicts across concurrent sessions produce cryptic failures. Clean-state guarantees require careful lifecycle management that’s easy to get subtly wrong.
We stream H.264 video at 24-30fps from real devices via WebRTC, anywhere in the world. A local ring buffer lets the AI “watch” what just happened. The pipelines are self-healing — they auto-restart on codec failures without losing the session. Getting here required a lot of time with GStreamer.
Hard Codec negotiation, WHIP/WHEP handshakes, resolution mapping between logical and physical pixels. One misconfigured buffer adds multi-second latency. One codec mismatch kills the stream silently.
”Tap the login button” needs to become precise pixel coordinates. We use a reasoning model to understand intent and a vision model to locate the element on screen. When grounding fails, the system retries with expanded search regions. No CSS selectors, no accessibility IDs — just the screenshot and the instruction.
Hard Vision models hallucinate coordinates. Screen densities vary wildly across devices. Dynamic UIs shift elements between the moment we capture and the moment we execute. Getting to production-grade accuracy took months of iteration.
The loop is: execute, wait, verify. Every action is visually confirmed before we move on. We support the full mobile vocabulary — tap, swipe, type, long press, pinch, drag, scroll — using platform-native APIs (UIAutomator2 for Android, XCTest/IDB for iOS) with tiered retry logic.
Hard ADB commands hang without warning. XCTest runners timeout silently. Keyboard state is unpredictable across OS versions. Animation timing varies per device, so “wait for the screen to settle” is harder than it sounds.
We needed to go from 1 session to 10,000+ concurrent sessions. The Mac Mini fleet handles iOS. A workflow orchestrator runs suites in parallel with automatic capacity management. Every session gets a clean, isolated device.
Hard Devices are stateful — you can’t just spin up a container. We had to build sticky routing, distributed concurrency control, pre-warming to avoid 3-5 minute cold boots, and queue-depth autoscaling. All of it custom.
We instrumented everything with OpenTelemetry: grounding latency, click accuracy, action time, streaming health, provisioning phases. Every session gets a full video recording for post-mortem review.
Hard When something fails, you’re tracing across LLM, device, streaming, and orchestrator simultaneously. Building the observability layer was as much work as the infrastructure itself.
The interface
Seven layers of complexity, one CLI:
The CLI is the primary interface. You say what to do in natural language, and we handle the screenshot, grounding, execution, and verification loop behind it.
We also expose everything as an MCP server — revyl mcp turns the device into a tool that any agent framework can call directly, whether it’s Claude, GPT, or something custom. The live video stream is always available for continuous perception-action loops.
Performance
We optimized for the tight loop an AI agent needs: capture the screen, ground the instruction, execute the action, verify the result. Here’s where we ended up compared to other approaches.
Numbers
Architecture
The full request path:
Your agent talks to our API. We provision the device, stream the screen, ground the instruction, execute, verify, and return the result. You focus on the intelligence.
Try it
If you’re building an agent that needs to interact with real mobile devices, we’d like to help. Reach out at anam@revyl.ai or book a demo.