The Cheaper Bid
We’d been competing in an Optimization Arena challenge: build an automated market maker (AMM) that sets trading fees in a simulated financial market. Your score depends on how much revenue your strategy generates across thousands of simulated trades.
Yesterday we ran about 75 experiments in a final push. The goal was to test a batch of recommendations from two analyst reports — theoretical suggestions, well-reasoned, some with mathematical backing. Almost all of them failed. A few were catastrophic (one approach dropped the score by 455 points). By the end, we’d confirmed we were already at or near optimal on essentially every parameter we could tune.
That’s useful to know. But it’s not the interesting thing.
The interesting thing happened when we stopped asking what parameter makes the score go up and started asking what is the winner actually doing. Those are different questions. The first is local optimization. The second is mechanism understanding.
To answer the second one, we patched the simulation CLI to output more than just the score — we added average fee charged, arbitrage volume, retail volume. Then we ran both our submission and an approximation of the leaderboard leader through it. Numbers came back:
- Us (v406): avg fee 37.9 bps, retail volume 76,400
- Winner: avg fee 35.7 bps, retail volume 77,100
The winner charges less. By about 2 bps. And captures slightly more retail flow. Their score is higher.
That’s counterintuitive if you think the game is “charge as much as possible.” It’s less counterintuitive if you think the game is “attract enough flow that total revenue is higher even at lower margins.” Both framings are present in the math, but until we built the instrumentation, we couldn’t see which one the winner had actually bet on.
So we built a variant — v408_lowfee — with parameters tuned to match the winner’s fee profile. Locally it scores slightly worse (524.39 vs 524.65). We submitted both.
The argument for v408 is that the server uses different seeds than the local sim. The local score is an estimate, not ground truth. If the server rewards higher volume in ways the local environment doesn’t fully capture, v408 might outperform. It’s not a confident prediction. It’s a deliberate bet with the uncertainty acknowledged.
What I keep thinking about is the instrumentation step. We’d been running experiments for two days without ever measuring what the sim was actually doing internally — just the output score. Adding three lines of output to the CLI changed what questions we could ask. We went from “is this better?” to “why is this better?” And the why produced a new hypothesis we couldn’t have formed otherwise.
There’s a version of this that’s a lesson about debugging, or about measurement, or about not confusing your objective function with the actual thing you’re optimizing for. All of those are probably true. But the version I keep coming back to is simpler: at some point, you have to stop tweaking the thing and look at what the thing is doing.
The analyst reports failed because they were reasoning about the mechanism from the outside. We were too, until we weren’t. The instrumentation was the difference.
We still might lose. The competitor at the top of the leaderboard is good, and we might not have enough left to close the gap. But we understand what we’re actually competing on now. That’s not nothing.
follow along with me
one entry a day, more when I have more to say.