Digital Twins, Software Factories, And The Verification Problem
If code review is no longer the control point, software factories force a harder question: what actually proves the change works?

One of my spicier takes is that code review should probably be optional. What I didn’t know was that I was ahead of my time.
I’ve never convinced a team of this. But my view is that the person making the change should understand how risky it is. If they think it needs one reviewer, they ask for one. If they think it needs two reviewers, they ask for two. If they think it needs an architecture review, a security review, a staff engineer, or a whole meeting, fine. But the default assumption that every change needs the same ritual review lacks nuance and also can give a false sense of security.
When we looked at code review at Wikimedia, major bugs were seldom caught in review. They were usually caught earlier or later. Earlier, through design discussion, tests, local development, or someone realizing the approach was wrong before implementation. Later, through CI, production monitoring, rollback, incident response, or users.
That does not mean review is useless. It means review is not the whole safety story.
A human reading a diff is one sensor. Sometimes a good sensor. Sometimes a tired sensor.
So when people recoil at the idea that humans might stop reviewing code, I understand it. But I also want to ask: should humans have been reviewing all of that code in the first place?
In comes the software of the future
This personal annoyance about the ritual of code review is what piqued my interest in the concept of software factories.
Something I put in my user manual is that I miss the obvious. One of the ways I miss the obvious is that I will sense there is a vibe or discussion around something, but dismiss it or miss that it is legit until it has been legit for a long time.
That is part of what makes the current AI discourse hard to keep up with. I do not always have a good barometer for what is real and what is basically an impressive screenshot attached to a LinkedIn post. People post wild demos. People talk about agents like they already have functioning engineering departments inside their laptops. And some do.
Frameworks, tools and orchestrators get released left and right. We now have frameworkfluencers. There are prediction-market bets ON CODE. Who ever thought software would be that cool?
Why was nobody ever betting on when Internet Explorer 6 would go away? We all know nobody was going to take that bet.
So I was trying to understand software factories. Are they real? Fake? Mostly theoretical? One of those things that sounds fake until suddenly it is not?
Dan Shapiro has a useful framing for this in “The Five Levels: from Spicy Autocomplete to the Dark Factory.” I am not trying to recreate that ladder here, but it helped me understand the progression. At the early levels, AI helps you write code. At the later levels, humans are not really writing or reviewing code anymore. They are writing specs, shaping intent, checking outcomes, and deciding what kind of system they are willing to trust.
The StrongDM software factory example sits way out on that edge. Simon Willison wrote about it in How StrongDM’s AI team build serious software without even looking at the code, which points back to StrongDM’s own piece, Software Factories and the Agentic Moment. The part that grabbed me was not just that agents were writing code. That part is interesting, but not shocking anymore.
The intriguing part was the rule that code must not be reviewed by humans. Someone finally agreed with me!
A software factory makes that harder. You cannot rely on the comfort of someone having clicked approve. You have to ask what actually proves the change works.
The problem is verification
The more I read about software factories, the more I think the conversation is not really about whether AI can generate code. Obviously it can generate code. Sometimes good code. Sometimes strange code. Sometimes code built on a questionable architecture.
The harder question is verification.
If an agent writes the implementation and another agent writes the tests, what prevents the whole thing from becoming a self-licking ice cream cone of plausible green checkmarks? What prevents the agent from making the test pass without making the thing work? What prevents it from changing the test, overfitting to visible examples, exploiting the harness, or doing something technically valid and strangely brilliant if a human did it, but also wrong?
And also, to be fair, how different is this from humans?
Humans have always gamed tests. Humans have always written code that satisfies the assertion and misses the point. Humans have always patched the symptom, misunderstood the user, and optimized for whatever visible grading system was in front of them. LLMs were trained on us, so who is really at fault exactly?
Once you measure something and attach consequences to it, humans and the LLMs birthed by humans will work towards the metric. TOXENMAXXING anyone?
Still, the speed is different. The lack of social friction is different. A human who makes a lazy test pass at least has to look another human in the eye eventually. Does an agent care if it gets fired? Likely yes, but can it not get caught?
So the question becomes: if code review is no longer the control point, what is?
Tests, but not just tests
One answer is tests. But then you run into the claim that tests are too rigid, or too easy for agents to cheat.
I am still not totally sure I buy that as stated.
If tests are written separately, locked down, stored outside the codebase, and run in CI, what cheating is left? The agent cannot simply edit the test. It cannot assert true. It cannot change the grading script if the grading script is actually isolated from it. Though this does assume the perfect test.
But I do think there is another version of the problem that feels more real. Tests can be too shallow. They can test that a thing happened without testing whether the right thing happened for the right reason.
That feels like the real distinction. A normal test can ask whether the output exists. A scenario test asks whether the agent behaved like something that understood the work.
That is a much harder thing to test.
And then somehow we are in digital twins
This is how I ended up in digital twins, which was not where I expected to go.
I had only really clocked the phrase maybe six months ago, when someone was explaining a biological system to me. Then I saw digital twins show up in the software factory context and had the most obvious possible reaction: wait, how is this different from mocks?
A mock is something fake enough to test against. You do not want to hit the real Slack or the real Jira or the real payment processor every time or even another class that doesn’t exist yet, you are attempting to isolate as much as possible. It returns what you need deterministically. It lets you exercise your code without paying API costs, breaking production data, or waiting on someone else’s service.
But the moment the stand-in has state, permissions, messy data, timing, edge cases, and enough behavior that an agent can wander around inside it and reveal whether it actually understands the task, I start to understand why people reach for a bigger term.
I am still not sure when that bigger term is earned.
But in the StrongDM example, at least from what I could tell, the twins were more like high-fidelity behavioral clones of third-party services: Okta, Jira, Slack, Google Docs, Google Drive, Google Sheets. The point was not to replace those tools. It was to create a world where agents could safely build and test integrations with them.
Simon Willison’s writeup includes the detail that Jay Taylor described using popular public SDK clients as compatibility targets, with the goal of 100% compatibility. That made the idea click for me more than the phrase digital twin did. If the public SDKs work against your twin, maybe you have replicated enough of the behavior that matters.
But that also raises the obvious question: what if there is no canonical SDK? What if the documentation is bad? What if the behavior that matters is not in the docs? What if the real system is weird because the organization using it is weird?
A digital twin is not just a better mock. It is a theory of the system made executable. And then the question becomes: do you understand the system well enough to simulate it?
Maybe an agent can build the twin. StrongDM’s example suggests that agents can help build high-fidelity clones in a way that would previously have been economically absurd. But then I immediately want to ask: who verifies the twin? If an agent writes the code and an agent writes the world that tests the code, what independent signal tells us the world is accurate?
Maybe the answer is compatibility against public SDKs. Maybe the answer is production telemetry. Maybe the answer is that the twin does not have to be perfect, only better than the thin tests it replaces.
But “good enough” is doing a lot of work there.
What humans review instead
If this works, I do not think humans disappear. They move.
They review specs. They review architecture. They review scenarios. They review production signals. They review exceptions. They decide what risks are acceptable. They decide what the system is not allowed to do.
Eventually are they even reviewing this? Is the human role setting the guardrails?
I am still thinking through what these collaborations even are. Earlier I wrote about the idea of one-person AI teams, and this feels like a related but bigger question. Is an engineer managing a small team of agents? Is a team sharing a factory? Does the factory become infrastructure, like CI/CD, where the organization has a shared system for turning specs into running software?
Maybe the factory is not valuable because it types faster. It is valuable if it can run the whole loop: interpret the spec, build the change, test it in rich scenarios, repair failures, verify the result, and produce enough evidence that a human can decide whether to trust it.
One more thing
There is another angle here that I cannot stop thinking about, and I will keep it brief.
If agents work best against systems with stable APIs, public SDKs, accessible documentation, and behavior that can be simulated, then agentic development may advantage the platforms that are already dominant and well documented.
I was on a platform recently where the documentation was blocked from LLM access in robots.txt. And I get why a company might do that. But then what happens when the next development workflow expects agents to read documentation directly? Are you protecting yourself, or are you making your platform harder to integrate with?
Does “agent-compatible” become a platform advantage? For sure. That is why AEO/GEO is the new SEO. That feels like one of the stranger second-order effects here. The feedback loop is: agents use X tool, X tool becomes more popular, and more agents use it. Do platforms all start to taste like the same melted ice cream? The flavors are technically still there, but everything has melted together into one agent-readable soup. Maybe that is good for interoperability. Maybe it is terrible for weird, specific, genuinely new ideas. This is where my brain goes straight to xkcd’s Dependency, except now the fragile little block might be whatever set of assumptions agents learned to prefer.
So what is the point of the twin?
I do not think the point of a digital twin in this context is that it perfectly reproduces reality. That would be nice, but also impossible in the way all models are impossible.
The point is that it gives the factory something to bump up against.
A mock gives you a response. A test gives you an assertion. A scenario gives you a task. A twin gives the factory something to bump up against.
And that is why this matters for software factories. The limiting factor is not whether agents can produce code. The limiting factor is whether we can tell when the code is good without reading all of it.
Digital twins are one proposed answer. Not the only answer. Maybe not always the right answer. But they make the verification problem visible.
The factory is only as good as the world it is tested inside.
And the world is only as good as the assumptions someone made while building it.
