Why AI can't just rewrite Windows
Every few months, someone says it. Usually in a comment thread, sometimes a non-developer friend. The logic sounds reasonable on the surface: AI can build APIs, write services, generate entire applications. So why not Windows?
Because Windows isn't a coding problem. It's a complexity problem. And those are very different things.
The scale is hard to actually picture
4,000 engineers. 1,760 daily builds across 440 branches. A git repository so large that Microsoft had to invent a custom file system just to manage it because standard Git couldn't handle it.
50 million lines of code. 3.5 million files. Close to 300 GB. Built, broken, patched, and shipped for 41 years without stopping, since Windows 1.0 in 1985.
I work on things that feel complex to me. AegisMesh, the IAM platform I've been building, has RBAC, a full DevSecOps pipeline, observability. It still feels manageable because I can hold most of it in my head on a decent day.
Nobody holds Windows in their head. I don't think any single team does either.
The problem isn't the code. It's everything the code is touching.
Complexity in software doesn't scale the way people expect.
5 components means 10 possible interactions. 100 components is nearly 5,000. Windows, with millions of components, has interactions that are for practical purposes impossible to fully account for.
Fred Brooks put it directly in No Silver Bullet (1986): "The elements interact with each other in some nonlinear fashion, and the complexity of the whole increases much more than linearly."
And you can't test your way out of it either. Some bugs only appear at production scale, with real users doing things nobody anticipated. No test suite can replicate 41 years of that.
AI can't even read Windows, let alone rewrite it
Models work within a context window. Working memory, basically. Code averages about 18 tokens per line. 50 million lines means roughly 900 million tokens.
Gemini 3 Pro, currently the largest available context window, handles around 10 million tokens. Claude Opus 4.6 handles around 1 million. That makes the Windows codebase 90x to 900x larger than what any AI can process at once.
Any AI working on Windows is always working blind to most of it. It sees one street, not the city. It can write something that looks completely correct locally and silently break something three subsystems away.
I hit a smaller version of this while building DeployLens. An AI tool suggested a change that made total sense in isolation. It broke a GitHub Actions integration because the AI had no idea that dependency existed. On a 900 million token codebase, that problem doesn't just scale, it multiplies in ways that are genuinely hard to reason about.
AI already struggles at large codebases
The numbers on this aren't great. According to a Veracode study, 45% of AI-generated code has security vulnerabilities. XSS failure rate in AI-generated code sits at 86%. In large, mature codebases, fewer than 44% of AI suggestions even get accepted. A METR study found that experienced developers are actually 19% slower on real tasks in large codebases when using AI tools.
AI can't reliably understand dependency graphs, build systems, or architectural conventions. It moves a file and breaks every import pointing to the old location. Now imagine that at 50 million lines of scale.
Backward compatibility is the thing nobody wants to deal with
Say AI wrote perfect code. Every function, every module, exactly right.
Still not done.
Control Panel still exists in Windows because drivers from 2012 call its functions. Windows has to run software from decades ago, drivers from manufacturers that don't exist anymore, enterprise apps embedded in hospital systems, bank infrastructure, and government workflows that nobody fully understands anymore.
When Microsoft ships an update, they're also quietly guaranteeing that every DLL, every registry interaction, every weird edge case from 1998 still behaves the same way. Some behaviors exist specifically because an app from 2003 depended on a bug. Fixing the bug breaks the app. So the bug stays. That's not a failure, that's the deal made with the ecosystem over 40 years.
An AI rewriting from scratch can't know any of that unless someone wrote it down. Most of it wasn't.
"Just run thousands of agents" doesn't fix it
Multi-agent failure rates range from 41% to 87% in practice. Coordination breakdowns alone account for 37% of all failures. Anthropic found that putting Claude in a multi-agent setup dropped performance by 35%. At 1 step, you're looking at around 95% success rate. At 10 steps, 60%. At 20 steps, 36%.
This is Brooks's Law reborn. Adding more people to a late software project makes it later. Adding more agents to a complex codebase makes it worse. At some point your job becomes building and managing the infrastructure that coordinates the agents, not writing the actual code.
Microsoft already tried
Galen Hunt, a Microsoft Distinguished Engineer, said publicly: "My goal is to eliminate every line of C and C++ from Microsoft by 2030. One engineer, one month, one million lines of code."
After community pushback, Microsoft clarified this was a research project, not a product roadmap. They explicitly denied plans to rewrite Windows 11 using AI.
Their actual progress on the .NET Runtime, a much simpler codebase than the Windows kernel: 878 Copilot PRs over 10 months, 535 merged, 126,000 lines touched. Windows has 50 million. At that rate, it would take over 330 years to rewrite Windows. And again, this was the easy part.
Where this is going
AI tools make me faster at the boring parts, which frees me up for the parts that actually need thinking. That's the real value, augmentation, not replacement.
But "just rewrite it" gets code generation and software engineering mixed up. Windows isn't slow to fix because the code is badly written. It's slow because the problem itself is irreducibly complex. 50 million lines of code. 41 years of decisions. Billions of devices. Millions of third-party applications. An entire global economy running on it.
AI is powerful. But this isn't a problem you can brute-force with a bigger model.