( field test · aphelion )

We taught an AI to sell tickets to another star.

Aphelion doesn’t exist. We built it (a company selling one-way passage to a new world) so an autonomous agent could rewrite its homepage overnight, with no human approving a single word. Here’s how far it got, and where it stopped.

RUN 2026-06-14 · INTERSTELLARcomplete

Start

34.4%

of optimum

End

42.3%

of optimum

Delta

+7.9

points

Headroom

12%

captured

Variants

across 5 rounds

Tier

COPY

copy-only run

Visualiser feed

( SUBJECT )00 / what's under test

I went to sleep. My website ran a string of experiments without me, and by morning it converted better than the one I left. No one wrote the variants, read the dashboard, or declared the winner. An AI agent ran the whole loop on its own, overnight.

The block it worked on belongs to a deliberately demanding subject: the homepage of a company selling one-way passage to another star. One job, one block, no second chances at a first impression. The agent could see the analytics and run experiments. It could not see why anything worked. What follows is the run, fully instrumented.

( THE LOOP )01 / closed-loop

For twenty years, improving a website meant a human at every gate. Someone forms a hypothesis. Someone writes a variant. Someone ships a test, waits a fortnight, squints at a dashboard, picks a winner, and starts again. We told ourselves the hard part was the maths. The hard part was us. People sleep, get busy, and lose interest around variant three.

So we removed the person. All of them. A single agent gets one instruction, make the highest-traffic block convert better, and access to two things only: the analytics, and the ability to run A/B tests. From there it writes genuinely new variants. A multi-armed bandit quietly routes more traffic toward whatever’s working. A Bayesian model only calls a winner once it’s 95% sure. Then it takes what it learned and goes again.

People keep asking what the clever architecture is. There isn’t one. Agentic optimisation is a good prompt in a loop with the right tools. The trick isn’t the agent. The trick is the judge.

The closed loop

Goal

a block to improve

Write variants

agent drafts copy

Bandit allocates

traffic follows evidence

Bayesian winner

P2BB ≥ 95%

Apply

ship the winner live

↑ loops back: every applied winner becomes the next round's incumbent

Hidden judge

A ground-truth model scores every variant. The agent never sees it. It only observes the loop's analytics. The dotted line runs one way.

( WHILE YOU SLEEP )02 / the math of a night

So the human is gone. The honest question is how much actually gets done in the hours they’re not there, and the answer is uncomfortable, and a little thrilling. It depends entirely on how brave the variant is. A dramatic swing, a genuinely different idea, separates from the control in a couple of thousand visitors. A subtle tweak can need most of a million. Same eight hours, wildly different number of answers.

How many experiments fit in a night?

A single 8-hour night settles 7.3 dramatic experiments, or less than one subtle one. Throughput isn’t fixed. It’s a function of nerve.

Monthly page views

1M/mo

an 8-hour night = 10,951 visitors to the block

Subtle

3.0% → 3.1% · a nicer verb

0.03

≈ 38 nights for one

Moderate

3.0% → 3.6% · a sharper promise

0.39

≈ 3 nights for one

Bold

3.0% → 4.5% · a different angle

2.2

5,029 visitors each

Dramatic

3.0% → 6.0% · a new idea entirely

7.3

1,491 visitors each

Per-arm sample for a two-proportion test at 95% confidence and 80% power, from a 3% baseline, two arms per test, with page views spread evenly across the month. Sequential and Bayesian tests can stop early when the gap is real, but they don’t shrink the worst case, so these are the honest floor, not the catch.

Which is the quiet lesson the loop teaches itself overnight: caution is expensive. The tests that finish before morning are the ones that dared to be different, so a tireless optimiser, left alone, learns to stop tinkering and start swinging. And the more traffic you give it, the more of those swings land before you’re even awake.

Now scale it to your traffic

The same loop, fed by a real plan’s monthly page views. Bold variants, run back to back, fully autonomous, counting only the tests that actually conclude.

Plan	While you sleep · 8h	In a week · 7 nights
Free 1M views / mo	2	46
Premium 10M views / mo	22	457
Enterprise 50M views / mo	109	2,287

Concluded A/B tests for a bold variant (+50% lift) from a 3% baseline, about 5,029 visitors each, two arms. Subtler bets take proportionally longer; the loop just keeps running while you don’t.

( LINEAGE / 754 )03 / genealogy

Here’s the run, end to end. The agent took the busiest block and ran five rounds against it, writing 25 variants in all. Each round it tried five different angles, the bandit pooled traffic toward the strongest, and the winner of each round became the control the next round had to beat. The amber path below is that genealogy. Click any variant to see what the agent saw, and what was secretly true.

Loading the variant tree…

Five rounds, 25 variants the agent wrote and shipped. The amber path is the genealogy: each round’s winner becomes the next round’s control.drag to pan · click any variant

It reached for five recognisable angles, the same ones a good copywriter cycles through:

Controlthe incumbent: last round’s winner, carried forward
Scarcityurgency, countdowns, “closing soon”
Sensory / sublimemood, atmosphere, the feeling of the thing
Legacy / identitywho you become by choosing it
Contrarian / wildthe unexpected, unsentimental angle

( CLIMB )04 / running-best

Over the five rounds the block climbed from 34.4% to 42.3% of the best score the page could theoretically reach. Round four actually went backwards before round five recovered: the running best held while the agent explored and came up short, exactly as it should.

The climb, round by round

Running best never gives ground. Round 4's best arm dipped to 39.9, but the incumbent held at 42: the loop only moves the floor up.

Running best Round best

Against the theoretical optimum

The full track is 100% of what's possible for this block. This run moved it from 34.4 to 42.3: real, and deliberately a demonstration of where copy alone runs out.

34.4%

Baseline

42.3%

Final

100%

optimum

+7.9 ptscaptured ·12%of the available headroom ·copy-tier plateau

And here’s the honest framing the numbers force on you. This run only worked on copy (it rewrote words and nothing else), so it captured just 12% of the headroom available on that block before it plateaued. That isn’t a failure. It’s the whole point of the next section.

( GROUND TRUTH )05 / hidden judge

Every A/B testing demo you’ve ever seen has the same problem, including the slick ones: real traffic is noisy, winners get lucky, and you have no real way of knowing the winner deserved it. You ship it and hope.

We didn’t want to hope. So we gave the agent an opponent it cannot see: a model that knows the true score of every variant the agent could possibly write. The agent never sees it. It only sees what real visitors do, exactly as you would. But we can see both sides, which means we can finally ask the question the industry steps around: when the agent says it won, did it deserve to?

The run plateaued because words have a ceiling, and the axes of craft compound rather than add. Rewriting copy gets you real gains and then runs out. Add the obvious visual moves (a colour, a button, an image) and you climb again. Only genuine multi-axis craft, where copy and structure and type and a purposeful image and a focused CTA all work at once, approaches the top.

Where the ceiling sits

Each step is how high that kind of work can reach, measured as a percent of the hidden optimum. The axes compound: each tier clears the one below.

54%

75%

100%

Copy only

Rewrite the words and nothing else. Real gains, but a hard ceiling: most of what’s possible isn’t language.

Copy + surface craft

Add a colour, a button, an image. The obvious visual moves. Better, still not the whole story.

Multi-axis craft

Copy and structure and type and a purposeful image and a focused CTA, all at once. The axes compound.

These ceilings are the rig's own measurement against its hidden model in the lab, not a published study.

Those figures are our own measurement against our hidden model: the lab’s reading, not a published study.

( MECHANISM )06 / bandit + P2BB

None of this is magic, and it’s stronger for showing the parts. Two ideas carry the loop: a smarter way to spend traffic, and a smarter way to call a winner.

Where the traffic goes

Fixed 50/50 split

50%

Winner 50%Loser

Half the visitors keep landing on the losing arm for the whole test.

Multi-armed bandit

Winner 86%Loser

Allocation shifts toward the winner as evidence accrues.

The bandit stops bleeding conversions on the obvious loser instead of paying full price to confirm what it already suspects.

Probability to beat baseline

Two posteriors: what we believe each arm's true rate is. The further the variant's curve (shaded) sits past control's, the higher the probability it genuinely wins.

95%the gate: once the shaded probability clears it, the loop ships the winner.

A Bayesian read you can check continuously without the peeking penalty, which is exactly what an always-on loop needs.

( INCIDENTS )07 / no-gate rounds

Now the part most people would edit out. An autonomous agent doesn’t escape the oldest ghost in experimentation. It inherits every statistical sin we have, just faster. The most useful results we got were the ones where it was wrong, and told us.

Event log · 5 rounds · 3 cautions

R1establish-the-factbest 36.5%run 36.5%▲

Contrarian-wild — the unsentimental sell

R2make-the-distance-feltbest 37.9%run 37.9%▲

Countdown — the window

R3the-weight-of-the-crossingbest 42.0%run 42.0%▲

Legacy-identity — the time cost

R4the-decision-itselfbest 39.9%run 42.0%

Manifesto — the second home

R5the-sublime-closebest 42.3%run 42.3%▲

Legacy-identity — the arrival name

! no_gateRound 1: no non-control arm reached p2bb>=0.95; winner = highest-ctr arm 4 (gated:false).

! no_gateRound 2: no non-control arm reached p2bb>=0.95; winner = highest-ctr arm 2 (gated:false).

! no_gateRound 4: no non-control arm reached p2bb>=0.95; winner = highest-ctr arm 1 (gated:false).

The honest texture

A variant crossed 97% confidence, shipped, and went flat. Won ≠ improved.

The winner's curse at machine speed: regression to the mean catches up with anything you crown on a single test.

In this run, 3 of 5 rounds (1, 2, 4) had no arm clear the 95% gate.

So the "winner" was just the highest-CTR arm, and the agent says so, rather than dressing it up as proof.

This is the reason to trust it. An agent that admits when it didn't really prove anything is the one you'd hand the keys to.

( PROVENANCE )08 / build manifest

And because none of this means anything if you can’t reproduce it, here’s what produced the numbers: the four repositories and their exact commits, the hidden-judge seed, the toolkit checkout the testers read, and the simulation seed. Same inputs, same run.

Repositories · 4

altis-acceleratefix/bandit-live-results · f5f339f

accelerate-ai-toolkitrelease/1.5.0 · afb83e1

accelerate-fluxmain · 414af56

accelerate-flux-ai-vizops/doctor-stack-recovery-macmini · f72f8fa

Ground truth · hidden judge

weight_seedflux-latent-v2.2

Two-axis multiplicative latent; tier bands copy≈54% / box-tick≈75% / craft=100%. Read from accelerate-flux/inc/latent.php.

Toolkit skill · concept-driven testing

checkoutrelease/1.5.0 · afb83e1

Reproducibility

simulatedtrue · seed 42

( FIELD )09 / where the loop breaks

Here’s why this matters beyond a fun demo. “Autonomous” is doing an enormous amount of lifting in this market. The whole lineup will sell you autonomy, and almost every one of them breaks the loop at the same gate: a human still has to approve the variant before it goes live. That’s autonomy inside a human-gated queue: a fine product, and a different claim.

Product	Generates content?	CMS-native?	Closed loop shown?	Autonomy gate
Coframe	Yes	No	No	Human approves each change
Evolv / Sentient	Yes	No	No	Human approves the candidate set
Intellimize → Webflow	Yes	Partial	No	Human approves before publish
VWO Copilot	Yes	No	No	Human approves suggestions
Optimizely Opal	Yes	No	No	Human approves the experiment
Fibr	Yes	No	No	Human approves variants
Accelerate	Yes	Yes	Yes	No per-step human gate

Coframe

Generates: Yes
CMS-native: No
Closed loop: No

Human approves each change

Evolv / Sentient

Generates: Yes
CMS-native: No
Closed loop: No

Human approves the candidate set

Intellimize → Webflow

Generates: Yes
CMS-native: Partial
Closed loop: No

Human approves before publish

VWO Copilot

Generates: Yes
CMS-native: No
Closed loop: No

Human approves suggestions

Optimizely Opal

Generates: Yes
CMS-native: No
Closed loop: No

Human approves the experiment

Fibr

Generates: Yes
CMS-native: No
Closed loop: No

Human approves variants

Accelerate

Generates: Yes
CMS-native: Yes
Closed loop: Yes

No per-step human gate

So where does this leave the person we just spent a run removing? Not gone. Moved. The job stops being “run the tests” and becomes “set the goal, set the guardrails, and build the judge.” Taste, strategy, knowing what the site is even for: that stays human. The part that disappears is the part that was never really thinking: a person at a gate, clicking approve forty times a week. The optimiser is good. The thing we actually trust is the judge that catches it.

( end of run )

I went to sleep, and my website got better without me. The headline holds, which is rarer than it sounds in this field.

Where this is going

Want to see the loop up close?

The agent, the abilities it calls, and the analytics it reads are real and shipping today inside Accelerate. We’re building the autonomous loop on top of them in the open.

Find out more

Word on the Future

Watch the self-optimising web arrive, in your inbox.

A letter on where software is heading: the honest version, with the failures left in. No cadence promises, no spam.

Unsubscribe anytime. We never share your email.