( field test · aphelion )

We taught an AI to sell tickets to another star.

Aphelion doesn’t exist. We built it (a company selling one-way passage to a new world) so an autonomous agent could rewrite its homepage overnight, with no human approving a single word. Here’s how far it got, and where it stopped.

RUN 2026-06-14 · INTERSTELLARcomplete
Start
34.4%
of optimum
End
42.3%
of optimum
Delta
+7.9
points
Headroom
12%
captured
Variants
25
across 5 rounds
Tier
COPY
copy-only run
Visualiser feed
Particle-flow capture · coming
Visualiser feed: traffic routing through the homepage into variants, in real time. It only watches; it never writes.
( SUBJECT )00 / what's under test

I went to sleep. My website ran a string of experiments without me, and by morning it converted better than the one I left. No one wrote the variants, read the dashboard, or declared the winner. An AI agent ran the whole loop on its own, overnight.

The block it worked on belongs to a deliberately demanding subject: the homepage of a company selling one-way passage to another star. One job, one block, no second chances at a first impression. The agent could see the analytics and run experiments. It could not see why anything worked. What follows is the run, fully instrumented.

Aphelion · the long quiet
Aphelion · the long quiet
( THE LOOP )01 / closed-loop

For twenty years, improving a website meant a human at every gate. Someone forms a hypothesis. Someone writes a variant. Someone ships a test, waits a fortnight, squints at a dashboard, picks a winner, and starts again. We told ourselves the hard part was the maths. The hard part was us. People sleep, get busy, and lose interest around variant three.

So we removed the person. All of them. A single agent gets one instruction, make the highest-traffic block convert better, and access to two things only: the analytics, and the ability to run A/B tests. From there it writes genuinely new variants. A multi-armed bandit quietly routes more traffic toward whatever’s working. A Bayesian model only calls a winner once it’s 95% sure. Then it takes what it learned and goes again.

People keep asking what the clever architecture is. There isn’t one. Agentic optimisation is a good prompt in a loop with the right tools. The trick isn’t the agent. The trick is the judge.

The closed loop
01
Goal
a block to improve
02
Write variants
agent drafts copy
03
Bandit allocates
traffic follows evidence
04
Bayesian winner
P2BB ≥ 95%
05
Apply
ship the winner live
Hidden judge

A ground-truth model scores every variant. The agent never sees it. It only observes the loop's analytics. The dotted line runs one way.

( WHILE YOU SLEEP )02 / the math of a night

So the human is gone. The honest question is how much actually gets done in the hours they’re not there, and the answer is uncomfortable, and a little thrilling. It depends entirely on how brave the variant is. A dramatic swing, a genuinely different idea, separates from the control in a couple of thousand visitors. A subtle tweak can need most of a million. Same eight hours, wildly different number of answers.

How many experiments fit in a night?

A single 8-hour night settles 7.3 dramatic experiments, or less than one subtle one. Throughput isn’t fixed. It’s a function of nerve.

1M/mo

an 8-hour night = 10,951 visitors to the block

Subtle
3.0% → 3.1% · a nicer verb
0.03
≈ 38 nights for one
Moderate
3.0% → 3.6% · a sharper promise
0.39
≈ 3 nights for one
Bold
3.0% → 4.5% · a different angle
2.2
5,029 visitors each
Dramatic
3.0% → 6.0% · a new idea entirely
7.3
1,491 visitors each

Per-arm sample for a two-proportion test at 95% confidence and 80% power, from a 3% baseline, two arms per test, with page views spread evenly across the month. Sequential and Bayesian tests can stop early when the gap is real, but they don’t shrink the worst case, so these are the honest floor, not the catch.

Which is the quiet lesson the loop teaches itself overnight: caution is expensive. The tests that finish before morning are the ones that dared to be different, so a tireless optimiser, left alone, learns to stop tinkering and start swinging. And the more traffic you give it, the more of those swings land before you’re even awake.

Now scale it to your traffic

The same loop, fed by a real plan’s monthly page views. Bold variants, run back to back, fully autonomous, counting only the tests that actually conclude.

PlanWhile you sleep · 8hIn a week · 7 nights
Free
1M views / mo
246
Premium
10M views / mo
22457
Enterprise
50M views / mo
1092,287

Concluded A/B tests for a bold variant (+50% lift) from a 3% baseline, about 5,029 visitors each, two arms. Subtler bets take proportionally longer; the loop just keeps running while you don’t.

( LINEAGE / 754 )03 / genealogy

Here’s the run, end to end. The agent took the busiest block and ran five rounds against it, writing 25 variants in all. Each round it tried five different angles, the bandit pooled traffic toward the strongest, and the winner of each round became the control the next round had to beat. The amber path below is that genealogy. Click any variant to see what the agent saw, and what was secretly true.

Loading the variant tree…
Five rounds, 25 variants the agent wrote and shipped. The amber path is the genealogy: each round’s winner becomes the next round’s control.drag to pan · click any variant

It reached for five recognisable angles, the same ones a good copywriter cycles through:

  • Controlthe incumbent: last round’s winner, carried forward
  • Scarcityurgency, countdowns, “closing soon”
  • Sensory / sublimemood, atmosphere, the feeling of the thing
  • Legacy / identitywho you become by choosing it
  • Contrarian / wildthe unexpected, unsentimental angle
( CLIMB )04 / running-best

Over the five rounds the block climbed from 34.4% to 42.3% of the best score the page could theoretically reach. Round four actually went backwards before round five recovered: the running best held while the agent explored and came up short, exactly as it should.

The climb, round by round

Running best never gives ground. Round 4's best arm dipped to 39.9, but the incumbent held at 42: the loop only moves the floor up.

30354045R1R2R3R4R5Round · % of optimumContrarian-wildCountdownLegacy-identityManifestoLegacy-identity
Running best Round best
Against the theoretical optimum

The full track is 100% of what's possible for this block. This run moved it from 34.4 to 42.3: real, and deliberately a demonstration of where copy alone runs out.

34.4%
Baseline
42.3%
Final
100%
optimum
+7.9 ptscaptured ·12%of the available headroom ·copy-tier plateau

And here’s the honest framing the numbers force on you. This run only worked on copy (it rewrote words and nothing else), so it captured just 12% of the headroom available on that block before it plateaued. That isn’t a failure. It’s the whole point of the next section.

( GROUND TRUTH )05 / hidden judge

Every A/B testing demo you’ve ever seen has the same problem, including the slick ones: real traffic is noisy, winners get lucky, and you have no real way of knowing the winner deserved it. You ship it and hope.

We didn’t want to hope. So we gave the agent an opponent it cannot see: a model that knows the true score of every variant the agent could possibly write. The agent never sees it. It only sees what real visitors do, exactly as you would. But we can see both sides, which means we can finally ask the question the industry steps around: when the agent says it won, did it deserve to?

Ground truth · hidden from the agent
Ground truth · hidden from the agent

The run plateaued because words have a ceiling, and the axes of craft compound rather than add. Rewriting copy gets you real gains and then runs out. Add the obvious visual moves (a colour, a button, an image) and you climb again. Only genuine multi-axis craft, where copy and structure and type and a purposeful image and a focused CTA all work at once, approaches the top.

Where the ceiling sits

Each step is how high that kind of work can reach, measured as a percent of the hidden optimum. The axes compound: each tier clears the one below.

54%
75%
100%
Copy only

Rewrite the words and nothing else. Real gains, but a hard ceiling: most of what’s possible isn’t language.

Copy + surface craft

Add a colour, a button, an image. The obvious visual moves. Better, still not the whole story.

Multi-axis craft

Copy and structure and type and a purposeful image and a focused CTA, all at once. The axes compound.

These ceilings are the rig's own measurement against its hidden model in the lab, not a published study.

Those figures are our own measurement against our hidden model: the lab’s reading, not a published study.

( MECHANISM )06 / bandit + P2BB

None of this is magic, and it’s stronger for showing the parts. Two ideas carry the loop: a smarter way to spend traffic, and a smarter way to call a winner.

Where the traffic goes
Fixed 50/50 split
50%
Winner 50%Loser

Half the visitors keep landing on the losing arm for the whole test.

Multi-armed bandit
Winner 86%Loser

Allocation shifts toward the winner as evidence accrues.

The bandit stops bleeding conversions on the obvious loser instead of paying full price to confirm what it already suspects.

Probability to beat baseline

Two posteriors: what we believe each arm's true rate is. The further the variant's curve (shaded) sits past control's, the higher the probability it genuinely wins.

ControlVariant
95%the gate: once the shaded probability clears it, the loop ships the winner.

A Bayesian read you can check continuously without the peeking penalty, which is exactly what an always-on loop needs.

( INCIDENTS )07 / no-gate rounds

Now the part most people would edit out. An autonomous agent doesn’t escape the oldest ghost in experimentation. It inherits every statistical sin we have, just faster. The most useful results we got were the ones where it was wrong, and told us.

Event log · 5 rounds · 3 cautions
R1establish-the-factbest 36.5%run 36.5%
Contrarian-wild — the unsentimental sell
R2make-the-distance-feltbest 37.9%run 37.9%
Countdown — the window
R3the-weight-of-the-crossingbest 42.0%run 42.0%
Legacy-identity — the time cost
R4the-decision-itselfbest 39.9%run 42.0%
Manifesto — the second home
R5the-sublime-closebest 42.3%run 42.3%
Legacy-identity — the arrival name
! no_gateRound 1: no non-control arm reached p2bb>=0.95; winner = highest-ctr arm 4 (gated:false).
! no_gateRound 2: no non-control arm reached p2bb>=0.95; winner = highest-ctr arm 2 (gated:false).
! no_gateRound 4: no non-control arm reached p2bb>=0.95; winner = highest-ctr arm 1 (gated:false).
The honest texture

A variant crossed 97% confidence, shipped, and went flat. Won ≠ improved.

The winner's curse at machine speed: regression to the mean catches up with anything you crown on a single test.

In this run, 3 of 5 rounds (1, 2, 4) had no arm clear the 95% gate.

So the "winner" was just the highest-CTR arm, and the agent says so, rather than dressing it up as proof.

This is the reason to trust it. An agent that admits when it didn't really prove anything is the one you'd hand the keys to.

( PROVENANCE )08 / build manifest

And because none of this means anything if you can’t reproduce it, here’s what produced the numbers: the four repositories and their exact commits, the hidden-judge seed, the toolkit checkout the testers read, and the simulation seed. Same inputs, same run.

Repositories · 4
altis-acceleratefix/bandit-live-results · f5f339f
accelerate-ai-toolkitrelease/1.5.0 · afb83e1
accelerate-fluxmain · 414af56
accelerate-flux-ai-vizops/doctor-stack-recovery-macmini · f72f8fa
Ground truth · hidden judge
weight_seedflux-latent-v2.2

Two-axis multiplicative latent; tier bands copy≈54% / box-tick≈75% / craft=100%. Read from accelerate-flux/inc/latent.php.

Toolkit skill · concept-driven testing
checkoutrelease/1.5.0 · afb83e1
Reproducibility
simulatedtrue · seed 42
( FIELD )09 / where the loop breaks

Here’s why this matters beyond a fun demo. “Autonomous” is doing an enormous amount of lifting in this market. The whole lineup will sell you autonomy, and almost every one of them breaks the loop at the same gate: a human still has to approve the variant before it goes live. That’s autonomy inside a human-gated queue: a fine product, and a different claim.

Coframe
Generates
Yes
CMS-native
No
Closed loop
No

Human approves each change

Evolv / Sentient
Generates
Yes
CMS-native
No
Closed loop
No

Human approves the candidate set

Intellimize → Webflow
Generates
Yes
CMS-native
Partial
Closed loop
No

Human approves before publish

VWO Copilot
Generates
Yes
CMS-native
No
Closed loop
No

Human approves suggestions

Optimizely Opal
Generates
Yes
CMS-native
No
Closed loop
No

Human approves the experiment

Fibr
Generates
Yes
CMS-native
No
Closed loop
No

Human approves variants

Accelerate
Generates
Yes
CMS-native
Yes
Closed loop
Yes

No per-step human gate

So where does this leave the person we just spent a run removing? Not gone. Moved. The job stops being “run the tests” and becomes “set the goal, set the guardrails, and build the judge.” Taste, strategy, knowing what the site is even for: that stays human. The part that disappears is the part that was never really thinking: a person at a gate, clicking approve forty times a week. The optimiser is good. The thing we actually trust is the judge that catches it.

( end of run )

I went to sleep, and my website got better without me. The headline holds, which is rarer than it sounds in this field.

Where this is going

Want to see the loop up close?

The agent, the abilities it calls, and the analytics it reads are real and shipping today inside Accelerate. We’re building the autonomous loop on top of them in the open.

Word on the Future

Watch the self-optimising web arrive, in your inbox.

A letter on where software is heading: the honest version, with the failures left in. No cadence promises, no spam.

Unsubscribe anytime. We never share your email.