Engineering

How Tero's Multiverse loop ships product improvements without a human in the loop

Friction detection, agent-screened code variants, and real-traffic A/B validation — engineered to converge in tens of dollars instead of hundreds.

Zaid Mallik · 2026-05-09 · 11 min

The product analytics dashboard tells you the signup → activation step lost 32% of users this week. You don't know why. You suspect the new copy on the second onboarding screen, but you've been wrong before. You'd run an A/B test, except the team that owns the page is shipping something else, and the variant you'd want to try doesn't exist as code yet — someone has to write it. By the time the experiment is designed, implemented, deployed behind a flag, and run for a week of traffic, the funnel will have moved on.

The Multiverse loop is the part of Tero that closes this gap. It detects the friction from connected analytics, generates code-level variants of the regressing surface, pre-tests them against five user archetype agents in headless browsers, ships the winning variant as a real PR via our GitHub App, and validates with live traffic before promoting it. End to end, with no human approval required if the project's policy allows it.

This post is a deep-read of how it actually works — the routing logic, the gates we put on every LLM call, the cost engineering that took a single autonomous task from $5 and 170+ turns of agent thrash down to a tighter range around $0.40–$1 in 7–15 turns, and the open problems we haven't yet solved.

The naive approach and why it fails

If you sat down for a weekend to build "an agent that detects funnel drop-offs and ships PRs to fix them," the obvious shape is: cron job fetches a PostHog funnel, finds the worst step, throws the friction summary at Claude with a system prompt like "please fix this," takes whatever code comes back, and opens a PR. Maybe wrap the whole thing in a Slack approval button.

This works exactly once, on a demo. Then four things kill it:

Cost. A naive agent loop on a real codebase reads the same files repeatedly, re-explores the structure on every retry, and burns through the cache prefix every few turns. Our first version was hitting 170+ turns and roughly $5 per task on moderate work — the per-task math comes out fine in isolation, then you realize the autonomous cron is firing this for every project on every signal, every six hours.

Reliability. LLM-generated code has a particular failure mode where the diff looks right but deletes unrelated code as collateral damage. A theme-toggle task ends up rewriting the iMessage notification handler because the agent decided to tidy up while it was there. On a customer's repo this is unrecoverable.

Scope creep. A user asks for "make the CTA button blue." The agent makes it blue, refactors the button into a new component, extracts the color into a design token, updates seven other places to use the token, and rewrites the storybook entry. None of that was the task.

False positives on the validation side. Even if the code is clean, you can't tell from a single chi-square on simulated data whether the variant is genuinely better or just lucked out on one cohort. Treating "synthetic agent activates more often" as ground truth ships regressions to real users.

Multiverse is what's left after we worked through all four.

System overview

The pipeline is six stages, all triggered either by a user from the UI or autonomously by a cron job that scans connected analytics:

1. Signal detection — multiverse-autonomous-cron (every 6h) and autonomous-loop scan PostHog/Mixpanel/Sentry/Stripe for friction. When a funnel step's drop exceeds the project's threshold, the run kicks off. 2. Archetype discovery — multiverse-discover-archetypes asks Claude Sonnet 4.6 to generate 5 user archetypes from the product description, or pulls real cohorts from PostHog in seeded mode. 3. Variant generation — multiverse-generate-hypotheses produces N code-change hypotheses (typically 4 variants + 1 control). 4. Per-variant code-gen — multiverse-implement-variant fans out, one branch per variant, delegating to the same agent loop that powers our editor (fast-execute → execute-task → agent-loop). Each variant lives at multiverse/run-<id>/variant-<slug>. 5. Stage 1 — agent screening — multiverse-run-sim runs Browserbase + Claude Haiku 4.5 sessions agains…