( field test · aphelion )
We taught an AI to sell tickets to another star.
Aphelion doesn’t exist. We built it (a company selling one-way passage to a new world) so an autonomous agent could rewrite its homepage overnight, with no human approving a single word. Here’s how far it got, and where it stopped.
I went to sleep. My website ran a string of experiments without me, and by morning it converted better than the one I left. No one wrote the variants, read the dashboard, or declared the winner. An AI agent ran the whole loop on its own, overnight.
The block it worked on belongs to a deliberately demanding subject: the homepage of a company selling one-way passage to another star. One job, one block, no second chances at a first impression. The agent could see the analytics and run experiments. It could not see why anything worked. What follows is the run, fully instrumented.

For twenty years, improving a website meant a human at every gate. Someone forms a hypothesis. Someone writes a variant. Someone ships a test, waits a fortnight, squints at a dashboard, picks a winner, and starts again. We told ourselves the hard part was the maths. The hard part was us. People sleep, get busy, and lose interest around variant three.
So we removed the person. All of them. A single agent gets one instruction, make the highest-traffic block convert better, and access to two things only: the analytics, and the ability to run A/B tests. From there it writes genuinely new variants. A multi-armed bandit quietly routes more traffic toward whatever’s working. A Bayesian model only calls a winner once it’s 95% sure. Then it takes what it learned and goes again.
People keep asking what the clever architecture is. There isn’t one. Agentic optimisation is a good prompt in a loop with the right tools. The trick isn’t the agent. The trick is the judge.
A ground-truth model scores every variant. The agent never sees it. It only observes the loop's analytics. The dotted line runs one way.
So the human is gone. The honest question is how much actually gets done in the hours they’re not there, and the answer is uncomfortable, and a little thrilling. It depends entirely on how brave the variant is. A dramatic swing, a genuinely different idea, separates from the control in a couple of thousand visitors. A subtle tweak can need most of a million. Same eight hours, wildly different number of answers.
A single 8-hour night settles 7.3 dramatic experiments, or less than one subtle one. Throughput isn’t fixed. It’s a function of nerve.
an 8-hour night = 10,951 visitors to the block
Per-arm sample for a two-proportion test at 95% confidence and 80% power, from a 3% baseline, two arms per test, with page views spread evenly across the month. Sequential and Bayesian tests can stop early when the gap is real, but they don’t shrink the worst case, so these are the honest floor, not the catch.
Which is the quiet lesson the loop teaches itself overnight: caution is expensive. The tests that finish before morning are the ones that dared to be different, so a tireless optimiser, left alone, learns to stop tinkering and start swinging. And the more traffic you give it, the more of those swings land before you’re even awake.
The same loop, fed by a real plan’s monthly page views. Bold variants, run back to back, fully autonomous, counting only the tests that actually conclude.
| Plan | While you sleep · 8h | In a week · 7 nights |
|---|---|---|
Free 1M views / mo | 2 | 46 |
Premium 10M views / mo | 22 | 457 |
Enterprise 50M views / mo | 109 | 2,287 |
Concluded A/B tests for a bold variant (+50% lift) from a 3% baseline, about 5,029 visitors each, two arms. Subtler bets take proportionally longer; the loop just keeps running while you don’t.
Here’s the run, end to end. The agent took the busiest block and ran five rounds against it, writing 25 variants in all. Each round it tried five different angles, the bandit pooled traffic toward the strongest, and the winner of each round became the control the next round had to beat. The amber path below is that genealogy. Click any variant to see what the agent saw, and what was secretly true.
It reached for five recognisable angles, the same ones a good copywriter cycles through:
- Controlthe incumbent: last round’s winner, carried forward
- Scarcityurgency, countdowns, “closing soon”
- Sensory / sublimemood, atmosphere, the feeling of the thing
- Legacy / identitywho you become by choosing it
- Contrarian / wildthe unexpected, unsentimental angle
Over the five rounds the block climbed from 34.4% to 42.3% of the best score the page could theoretically reach. Round four actually went backwards before round five recovered: the running best held while the agent explored and came up short, exactly as it should.
Running best never gives ground. Round 4's best arm dipped to 39.9, but the incumbent held at 42: the loop only moves the floor up.
The full track is 100% of what's possible for this block. This run moved it from 34.4 to 42.3: real, and deliberately a demonstration of where copy alone runs out.
And here’s the honest framing the numbers force on you. This run only worked on copy (it rewrote words and nothing else), so it captured just 12% of the headroom available on that block before it plateaued. That isn’t a failure. It’s the whole point of the next section.
Every A/B testing demo you’ve ever seen has the same problem, including the slick ones: real traffic is noisy, winners get lucky, and you have no real way of knowing the winner deserved it. You ship it and hope.
We didn’t want to hope. So we gave the agent an opponent it cannot see: a model that knows the true score of every variant the agent could possibly write. The agent never sees it. It only sees what real visitors do, exactly as you would. But we can see both sides, which means we can finally ask the question the industry steps around: when the agent says it won, did it deserve to?

The run plateaued because words have a ceiling, and the axes of craft compound rather than add. Rewriting copy gets you real gains and then runs out. Add the obvious visual moves (a colour, a button, an image) and you climb again. Only genuine multi-axis craft, where copy and structure and type and a purposeful image and a focused CTA all work at once, approaches the top.
Each step is how high that kind of work can reach, measured as a percent of the hidden optimum. The axes compound: each tier clears the one below.
Rewrite the words and nothing else. Real gains, but a hard ceiling: most of what’s possible isn’t language.
Add a colour, a button, an image. The obvious visual moves. Better, still not the whole story.
Copy and structure and type and a purposeful image and a focused CTA, all at once. The axes compound.
These ceilings are the rig's own measurement against its hidden model in the lab, not a published study.
Those figures are our own measurement against our hidden model: the lab’s reading, not a published study.
None of this is magic, and it’s stronger for showing the parts. Two ideas carry the loop: a smarter way to spend traffic, and a smarter way to call a winner.
Half the visitors keep landing on the losing arm for the whole test.
Allocation shifts toward the winner as evidence accrues.
The bandit stops bleeding conversions on the obvious loser instead of paying full price to confirm what it already suspects.
Two posteriors: what we believe each arm's true rate is. The further the variant's curve (shaded) sits past control's, the higher the probability it genuinely wins.
A Bayesian read you can check continuously without the peeking penalty, which is exactly what an always-on loop needs.
Now the part most people would edit out. An autonomous agent doesn’t escape the oldest ghost in experimentation. It inherits every statistical sin we have, just faster. The most useful results we got were the ones where it was wrong, and told us.
A variant crossed 97% confidence, shipped, and went flat. Won ≠ improved.
The winner's curse at machine speed: regression to the mean catches up with anything you crown on a single test.
In this run, 3 of 5 rounds (1, 2, 4) had no arm clear the 95% gate.
So the "winner" was just the highest-CTR arm, and the agent says so, rather than dressing it up as proof.
This is the reason to trust it. An agent that admits when it didn't really prove anything is the one you'd hand the keys to.
And because none of this means anything if you can’t reproduce it, here’s what produced the numbers: the four repositories and their exact commits, the hidden-judge seed, the toolkit checkout the testers read, and the simulation seed. Same inputs, same run.
Two-axis multiplicative latent; tier bands copy≈54% / box-tick≈75% / craft=100%. Read from accelerate-flux/inc/latent.php.
Here’s why this matters beyond a fun demo. “Autonomous” is doing an enormous amount of lifting in this market. The whole lineup will sell you autonomy, and almost every one of them breaks the loop at the same gate: a human still has to approve the variant before it goes live. That’s autonomy inside a human-gated queue: a fine product, and a different claim.
| Product | Generates content? | CMS-native? | Closed loop shown? | Autonomy gate |
|---|---|---|---|---|
| Coframe | Yes | No | No | Human approves each change |
| Evolv / Sentient | Yes | No | No | Human approves the candidate set |
| Intellimize → Webflow | Yes | Partial | No | Human approves before publish |
| VWO Copilot | Yes | No | No | Human approves suggestions |
| Optimizely Opal | Yes | No | No | Human approves the experiment |
| Fibr | Yes | No | No | Human approves variants |
| Accelerate | Yes | Yes | Yes | No per-step human gate |
- Generates
- Yes
- CMS-native
- No
- Closed loop
- No
Human approves each change
- Generates
- Yes
- CMS-native
- No
- Closed loop
- No
Human approves the candidate set
- Generates
- Yes
- CMS-native
- Partial
- Closed loop
- No
Human approves before publish
- Generates
- Yes
- CMS-native
- No
- Closed loop
- No
Human approves suggestions
- Generates
- Yes
- CMS-native
- No
- Closed loop
- No
Human approves the experiment
- Generates
- Yes
- CMS-native
- No
- Closed loop
- No
Human approves variants
- Generates
- Yes
- CMS-native
- Yes
- Closed loop
- Yes
No per-step human gate
So where does this leave the person we just spent a run removing? Not gone. Moved. The job stops being “run the tests” and becomes “set the goal, set the guardrails, and build the judge.” Taste, strategy, knowing what the site is even for: that stays human. The part that disappears is the part that was never really thinking: a person at a gate, clicking approve forty times a week. The optimiser is good. The thing we actually trust is the judge that catches it.
( end of run )
I went to sleep, and my website got better without me. The headline holds, which is rarer than it sounds in this field.
Where this is going
Want to see the loop up close?
The agent, the abilities it calls, and the analytics it reads are real and shipping today inside Accelerate. We’re building the autonomous loop on top of them in the open.
Word on the Future
Watch the self-optimising web arrive, in your inbox.
A letter on where software is heading: the honest version, with the failures left in. No cadence promises, no spam.
Unsubscribe anytime. We never share your email.

