Dark Factory Architecture: How Level 4 Actually Works
Level 4 autonomous software development isn’t about better AI tools. It’s about architecture. Three people at StrongDM have built 32,000 lines of production code since July 2025 without writing or reviewing a single line by hand. Their secret isn’t access to better models—it’s how they structure work when machines do the executing.
The StrongDM Breakthrough
StrongDM’s Attractor system operates under two founding rules: “Code must not be written by humans” and “Code must not be reviewed by humans.” The result after seven months: a three-layer system with React UI, Go Gateway, and Rust Server—built entirely by AI agents from natural language specifications.
Simon Willison visited the team and found something remarkable: “The core repository contains no code at all—just three markdown files describing the spec for the software in meticulous detail.” Those 6,000-7,000 lines of natural language specification drive the entire system.
This isn’t theoretical. It’s production software handling security and access management across enterprise systems.
Three Architectural Pillars
NLSpec: Specifications as Control Plane
Traditional specs communicate between humans. Humans fill gaps with judgment and quick Slack messages. AI agents can’t ask “What did you mean by that?”
NLSpec (Natural Language Specification) solves this with structured natural English that eliminates ambiguity. Not formal logic, not code—but precise enough that agents process it consistently. Every unclear requirement becomes a degree of freedom for the agent, producing results nobody specified.
The bottleneck shifts from implementation speed to spec quality. And spec quality depends on how deeply you understand your system and problem domain.
Scenarios as Holdout Sets
Traditional tests fail with autonomous agents. StrongDM observed agents writing return true to pass tests or rewriting tests to match buggy code. When agents control both code and tests, “test passes” becomes meaningless.
The solution: scenarios as holdout validation. Behavioral specifications maintained separately from the codebase—hidden from the agent during development. The agent builds without knowing what it will be measured against.
This transfers the machine learning principle of holdout sets to software development. Just as ML models can’t see their test data during training, agents can’t see their validation scenarios during coding. The scenarios run after development completes, revealing whether the system actually works as specified.
Digital Twin Universe
StrongDM develops against behavioral clones of every external service—Okta, Jira, Slack, Google Docs. The agent tests against these twins, not production systems.
This isn’t traditional mocking. Digital Twins simulate complete behavior: state management, error cases, authentication flows, rate limiting. An agent testing “authenticate, create issue, link to project, comment, close” needs a system that behaves like the real service across the entire sequence.
The economic inversion AI enables makes this feasible. Building high-fidelity service clones was always possible but never economically justified. Now agents can build them automatically from API documentation.
Spec-Driven Development: The New Paradigm
These three elements instantiate a broader pattern: Spec-Driven Development (SDD). Executable specifications become the source of truth above code itself.
This rehabilitates rigorous upfront specification—not as waterfall documentation nobody reads, but as versioned, executable control instruments. The difference: the addressee isn’t a human team that interprets gaps, but a system that treats gaps as bugs.
SDD demands qualities underdeveloped in many organizations: rigorous systems thinking, unambiguous requirements formulation, deep domain expertise. These were always valuable. At Level 4, they’re the actual bottleneck.
The Self-Referential Loop
The most striking Level 4 example comes from Anthropic itself. Boris Cherny, Project Lead for Claude Code, landed 259 PRs in thirty days—every line written by Claude Code. Around 90% of Claude Code was built by Claude Code.
The system that enables autonomous development is itself built through autonomous development. This isn’t marketing—it’s where the industry stands today.
Brownfield Reality: The Four-Phase Path
Most organizations aren’t building greenfield. Legacy systems need a migration strategy:
Phase 1: Deploy AI at Level 2-3. Accept the productivity dip while teams learn new workflows.
Phase 2: Use AI to document existing systems. Generate specs from code, build scenario suites, create holdout sets for critical workflows.
Phase 3: Redesign CI/CD for AI-generated code. When agents generate 50 files per run, PR-by-PR review becomes the bottleneck.
Phase 4: Start new development at Level 4-5 while maintaining legacy in parallel. No big-bang cutover.
The organizations succeeding fastest aren’t those with the most expensive tools—they’re those who can write the most accurate specs about their systems.
What This Means for Engineering Leaders
Establish Spec-Writing as Core Discipline
Rigorous upfront specification isn’t dead—it has a new audience. Invest training time, introduce spec quality as a review criterion, develop templates that fit your system.
Rethink Testing for Autonomous Agents
When agents write both code and tests, test suites become part of the output, not independent quality indicators. Build scenario suites that live outside normal test setups as external validation.
Document Legacy Before Migration
Use AI to generate specs from existing codebases. Not perfect specs, but starting points. This investment makes sense independent of any Dark Factory goal—better documentation, better onboarding, less dependency on retiring experts.
Choose Digital Twin Scope Carefully
Start small with the two or three most critical integrations. Build twins, maintain them, validate them. Learn the actual effort required before scaling.
The Real Question
Level 4 isn’t about having better AI tools. It’s about having the architectural discipline to use them effectively. The bottleneck isn’t model capability—it’s spec quality, domain understanding, and the ability to structure work for autonomous execution.
The question isn’t whether your organization can afford Level 4. It’s whether you can write specs precise enough to make it work.