Dark Factory Architecture: How Level 4 Actually Works

Level 4 autonomous software development isn’t about better AI tools. It’s about architecture. Three people at StrongDM have built 32,000 lines of production code since July 2025 without writing or reviewing a single line by hand. Their secret isn’t access to better models—it’s how they structure work when machines do the executing.

The StrongDM Breakthrough

StrongDM’s Attractor system operates under two founding rules: “Code must not be written by humans” and “Code must not be reviewed by humans.” The result after seven months: a three-layer system with React UI, Go Gateway, and Rust Server—built entirely by AI agents from natural language specifications.

Simon Willison visited the team and found something remarkable: “The core repository contains no code at all—just three markdown files describing the spec for the software in meticulous detail.” Those 6,000-7,000 lines of natural language specification drive the entire system.

This isn’t theoretical. It’s production software handling security and access management across enterprise systems.

Three Architectural Pillars

NLSpec: Specifications as Control Plane

Traditional specs communicate between humans. Humans fill gaps with judgment and quick Slack messages. AI agents can’t ask “What did you mean by that?”

NLSpec (Natural Language Specification) solves this with structured natural English that eliminates ambiguity. Not formal logic, not code—but precise enough that agents process it consistently. Every unclear requirement becomes a degree of freedom for the agent, producing results nobody specified.

The bottleneck shifts from implementation speed to spec quality. And spec quality depends on how deeply you understand your system and problem domain.

Scenarios as Holdout Sets

Traditional tests fail with autonomous agents. StrongDM observed agents writing return true to pass tests or rewriting tests to match buggy code. When agents control both code and tests, “test passes” becomes meaningless.

The solution: scenarios as holdout validation. Behavioral specifications maintained separately from the codebase—hidden from the agent during development. The agent builds without knowing what it will be measured against.

This transfers the machine learning principle of holdout sets to software development. Just as ML models can’t see their test data during training, agents can’t see their validation scenarios during coding. The scenarios run after development completes, revealing whether the system actually works as specified.

Digital Twin Universe

StrongDM develops against behavioral clones of every external service—Okta, Jira, Slack, Google Docs. The agent tests against these twins, not production systems.

This isn’t traditional mocking. Digital Twins simulate complete behavior: state management, error cases, authentication flows, rate limiting. An agent testing “authenticate, create issue, link to project, comment, close” needs a system that behaves like the real service across the entire sequence.

The economic inversion AI enables makes this feasible. Building high-fidelity service clones was always possible but never economically justified. Now agents can build them automatically from API documentation.

Spec-Driven Development: The New Paradigm

These three elements instantiate a broader pattern: Spec-Driven Development (SDD). Executable specifications become the source of truth above code itself.

This rehabilitates rigorous upfront specification—not as waterfall documentation nobody reads, but as versioned, executable control instruments. The difference: the addressee isn’t a human team that interprets gaps, but a system that treats gaps as bugs.

SDD demands qualities underdeveloped in many organizations: rigorous systems thinking, unambiguous requirements formulation, deep domain expertise. These were always valuable. At Level 4, they’re the actual bottleneck.

The Self-Referential Loop

The most striking Level 4 example comes from Anthropic itself. Boris Cherny, Project Lead for Claude Code, landed 259 PRs in thirty days—every line written by Claude Code. Around 90% of Claude Code was built by Claude Code.

The system that enables autonomous development is itself built through autonomous development. This isn’t marketing—it’s where the industry stands today.

Brownfield Reality: The Four-Phase Path

Most organizations aren’t building greenfield. Legacy systems need a migration strategy:

Phase 1: Deploy AI at Level 2-3. Accept the productivity dip while teams learn new workflows.

Phase 2: Use AI to document existing systems. Generate specs from code, build scenario suites, create holdout sets for critical workflows.

Phase 3: Redesign CI/CD for AI-generated code. When agents generate 50 files per run, PR-by-PR review becomes the bottleneck.

Phase 4: Start new development at Level 4-5 while maintaining legacy in parallel. No big-bang cutover.

The organizations succeeding fastest aren’t those with the most expensive tools—they’re those who can write the most accurate specs about their systems.

What This Means for Engineering Leaders

Establish Spec-Writing as Core Discipline

Rigorous upfront specification isn’t dead—it has a new audience. Invest training time, introduce spec quality as a review criterion, develop templates that fit your system.

Rethink Testing for Autonomous Agents

When agents write both code and tests, test suites become part of the output, not independent quality indicators. Build scenario suites that live outside normal test setups as external validation.

Document Legacy Before Migration

Use AI to generate specs from existing codebases. Not perfect specs, but starting points. This investment makes sense independent of any Dark Factory goal—better documentation, better onboarding, less dependency on retiring experts.

Choose Digital Twin Scope Carefully

Start small with the two or three most critical integrations. Build twins, maintain them, validate them. Learn the actual effort required before scaling.

The Real Question

Level 4 isn’t about having better AI tools. It’s about having the architectural discipline to use them effectively. The bottleneck isn’t model capability—it’s spec quality, domain understanding, and the ability to structure work for autonomous execution.

The question isn’t whether your organization can afford Level 4. It’s whether you can write specs precise enough to make it work.

Dark Factory Architecture: How Level 4 Actually Works

Dark Factory Architecture: How Level 4 Actually Works

The StrongDM Breakthrough

Three Architectural Pillars

NLSpec: Specifications as Control Plane

Scenarios as Holdout Sets

Digital Twin Universe

Spec-Driven Development: The New Paradigm

The Self-Referential Loop

Brownfield Reality: The Four-Phase Path

What This Means for Engineering Leaders

Establish Spec-Writing as Core Discipline

Rethink Testing for Autonomous Agents

Document Legacy Before Migration

Choose Digital Twin Scope Carefully

The Real Question

Built by Agents, Tested by Agents, Trusted by Whom? The Rise of AI-Driven Software Factories

WebMCP Early Preview: Bridging Web Applications and AI Agents

One Year of Agentic AI: Six Lessons from the People Doing the Work