cortex-x
Discipline, not the model. How a framework that catches its own mistakes came together solo in seven weeks — and what that reveals about how AI research and implementation actually work.
Discipline, not the model. The bottleneck in AI-assisted engineering is not model capability — it is the operator's consistent operational discipline.
I built cortex-x solo over seven weeks, from nine files (2026-04-17) to its current state. The work produced one conclusion: the model is not the limit. The limit is whether the operator holds the same procedure at 3 a.m., at the sixtieth commit, and when the thing “almost works.” So cortex-x is not the subject of this case study — it is the evidence of a transferable method. And because a model's training data is frozen, the model doesn't even know its own current version: every claim about present state — framework versions, pricing, APIs, a11y standards — must be verified, not guessed. Even research is subject to the discipline.
- 01Training data is frozen, so every claim that depends on current state gets verified by web research before I write it as fact.
- 02Code does not review itself; any non-trivial diff passes an adversarial review pipeline before it merges.
- 03Discipline held in the head does not survive the night — it has to be externalized into a mechanism that does not forget.
- 04Documentation rots fastest precisely in AI projects, because state changes faster than the prose describing it.
- 05Context dies with every session — whatever must survive has to be explicitly persisted to disk.
- 06A flat list of standards fails under pressure; under pressure it is tiered precedence that decides, not good intentions.
- 07Real validation of a method is not one pretty project, but that the same patterns transfer across domains.
- 08Repeatability beats virtuosity — a procedure anyone can run a second time beats a one-off brilliant move.
The rest of this case study is the proof chain — for each of the eight beliefs there is a concrete file, commit, and sprint that exists precisely because the belief exists.
- Context
- Five concurrent production agents
- Surface area
- 5 CLAUDE.md · 5 memory layers · 5 postures
- Cost of context loss
- ~10 hrs / month re-orientation
In April 2026 the operator was already shipping production AI agents — RELO, a back-office agent for Czech real estate (27 tools, 1,700+ tests, three-layer memory with autoDream consolidation), plus a multi-tenant chatbot platform serving production clients. Both ran in production, both passed audits, both were built solo.
By the second month a pattern was visible across them. The operator wasn't writing product code anymore. The work had drifted: safe-tool wrappers, three-layer memory scaffolds, cost guards, multi-agent review pipelines, OWASP Agentic Top 10 mappings, session-start hooks, recommendation queues. The scaffolding was the product. Every new project needed the same scaffolding rebuilt from memory.
“The scaffolding was the product. cortex-x is what fell out of three months of shipping with discipline before that discipline had a name.”
Stop rebuilding the scaffolding — externalize the discipline.
Five concurrent projects. Five CLAUDE.md files. Five distinct memory layers. Five separately-evolving security postures. One operator. One head.
A senior engineer reloading project context after a two-week gap loses 30–60 minutes per session — not coding, just re-orientation. Five projects × monthly drop-in cadence = ~10 hours of pure re-orientation per month, plus the silent tax of decisions made twice because last quarter's reasoning was no longer accessible.
An institutional-wisdom layer for the operator's Claude Code installation — cortex-x.
CLAUDE.md holds current state that changes weekly — tech stack, sprint status, env.
cortex-x holds standards, lessons, decisions, and the agent runtime that doesn't change at all between projects. Zero overlap. The split is enforced by hook contract — not by discipline.
Philosophy → architecture: eight pairs
Each of the eight beliefs below has a concrete artifact in the repository that exists precisely because that belief exists. Consistent philosophy produces consistent architecture — these are not arbitrary tools but conviction made into code you can open and read.
Research-before-assert → R1 mandatory dispatch
Training data is frozen; about the current state of the world — framework and model versions, pricing, a11y standards — I know nothing reliable. So any answer or implementation that depends on external state must be verified before I state it as fact. AI is a tool, not an oracle.
R1 anchors verify-first as a mandatory step: the standard describes the protocol, cortex-goal.md embeds it into Phase 3 (Research) of the plan, and the sprint skill enforces it inside the pipeline. Findings are cited with URLs and cached, so research is reusable across sessions rather than one-shot.
Review-before-merge → R2 six-agent pipeline + Pass-2 skeptic
A diff that wrote itself cannot judge itself. A non-trivial change passes six independent reviewers before the operator is allowed to merge — and a second round, the Pass-2 skeptic, is tasked with refuting the first round's findings.
r2-review.js is a dynamic workflow orchestrating six parallel agents; pre-commit-review-gate.cjs blocks the commit until a verdict arrives. Consensus HIGH findings are applied in-commit, not deferred to a backlog.
Arc 1: 23 HIGH bugs caught, zero refuted by the Pass-2 skeptic, all fixed in-commit before push — nothing escaped to main. In Sprint 2.46 six independent reviewers converged on a single bug at confidence 99/98/96/96/95/92.
Externalize discipline → Signed verdict v2
Discipline that relies on human memory fails exactly when fatigue is highest. A rule you can bypass by forgetting is not a rule — it belongs outside the head, in a cryptographic artifact.
Signed verdict v2 carries an HMAC-SHA256 or Ed25519 signature over a payload containing commit_sha (cross-checked against HEAD), staged_tree (the index contents), workflow_run_id (a single-use nonce in a journal), and secret_tier (env > persisted random > host-derived). A replayed verdict is rejected, a stale verdict is rejected, and a host-derived secret is rejected under STRICT_SECRET=1. The signed verdict complements [skip-review] as a second unblock path beside the session marker — the session marker remains the highest-priority allow path.
Docs rot fastest in AI projects → Doc-currency lint
When an agent writes code faster than a human reads, documentation diverges from reality within a single sprint. Stale docs are worse than none — they impersonate a truth they no longer hold.
cortex-doc-currency.cjs lints documentation against the measured repository state and flags stale numbers and claims; cortex-doc-regen.cjs regenerates derived passages from the source of truth. The documentation.md standard defines what is derived — and thus regenerable — versus hand-written.
Context dies each session → Wisdom-layer split (SSOT)
Every new session starts with empty memory. What changes in weeks — current state — must not live in the same place as what is stable for years — institutional wisdom — otherwise it duplicates and drifts apart.
cortex-load.md is a mental cheat-sheet that brings a new session up to context; memory-decay.cjs governs memory aging so stale entries do not outweigh fresh ones; lessons.cjs manages the lessons-learned layer. One source of truth for each piece of information — never duplicate.
A flat standards list fails under pressure → Tier rules 0/1/1.5/2/3
When there are 35 standards and all carry equal weight, under time pressure none gets applied. Rules need an explicit priority order so you can decide what yields when two budgets conflict.
Tier 0 (Ship-Ready) → 1 (SSOT/modularity) → 1.5 (coding behavior) → 2 (security/testing/correctness) → 3 (process) gives the standards a lexicographic order. RULE-1.md codifies the SSOT layer; action-kinds.cjs maps the 21 action_kinds to rules; code-review.md applies that same order during review.
Real validation = patterns transfer cross-domain → Centralized ~/.claude/shared/
A pattern that works in one project alone may be coincidence. Only when the same standard improves both RELO and the multi-tenant chatbot platform is it a demonstrably transferable method, not a local trick.
Everything shared lives in ~/.claude/shared/, so standards, hooks, and skills load into every project identically — README.md is the index of 35 standards and auto-orchestrate.cjs nudges parallel agents regardless of which project is running. Centralization is the mechanism of transferability.
Repeatability beats virtuosity → The /cortex-sprint Skill
One brilliant sprint proves nothing; a method you can run again and again at the same quality proves everything. The bottleneck is not model capability but the operator's consistent operational discipline.
/cortex-sprint wraps the whole cycle — plan → R1 → implementation → R2 → verdict → capture — into a single repeatable skill; sprint-pipeline.md defines the phases and gates. A sprint plan like sprint-2-44-plan.md is a documented instance of that same process, not a unique performance.
A four-tier trajectory, two already shipped.
SSOT · Modular · Scalable
Three architectural invariants non-negotiable across every scaffolded project. The structural floor — break one of these and the rest stops compounding.
Security · Testing · Observability · Correctness
Four critical standards block PRs in any scaffolded repo. Failures here surface as red CI lanes, not silent drift. The correctness floor.
Thirty-plus further standards (warnings, not blockers)
Code style, doc hygiene, dependency hygiene, naming. Surfaced as warnings so the operator can sequence work; not gated because they degrade gracefully.
Thirty-five standards across five rule tiers, ordered by a single mental model: structure first, then correctness, then polish.
Tier 2 — compound learners — is essentially closed (~80%): daily and weekly Dreaming consolidation crons run on the cortex-x repo since 2026-05-09, alongside the AlphaEvolve A/B harness and FTS5-backed lesson retrieval. Tier 1 has shipped and holds, with the 2.3b runner and Stryker mutation testing still pending. Tier 3 and Tier 4 are stated commitments, not deliveries.
// Four-tier trajectory · two already shippedTier 0 · Foundation ─────────── shipped Scaffolds new projects · 11 stack profiles · 35 standards · 9 review agents · 8 hooks · install in ~3 minutesTier 1 · Verification + multi-agent ── shipped 7-criterion spec verifier · Phoenix OTLP · autoresearch · senior-tester review · 6-agent parallel review pipeline · multi-window cost safetyTier 2 · Compound learners ──────── ~80% done AlphaEvolve A/B harness v0 · self-extending capabilities · FTS5 lessons · daily + weekly Dreaming consolidationTier 3 · Productization ──────────── planned Capability marketplace · WaaS template · voice → recommendation pipelineTier 4 · Persistent entity ───────── 2027+ Self-hosted home server · soul abstraction · Obsidian SSOT · multi-source life ingest// Seven criterion kinds — sprint 1.9.0 + 2.18 + 2.3.1// Verifier sits between applyAction and runNpmTest.{ kind: "shell", // exit code + stdout match kind: "file_predicate", // exists · mtime · content hash kind: "regex", // pattern match in named files kind: "ears_text", // EARS-shape NL clause kind: "llm_judge", // Sonnet-grade boolean verdict kind: "read_set", // proof the LLM read claimed files kind: "mutation_score", // Stryker survival threshold (2.3.1)}// Verifier fails closed — any criterion that does// not pass aborts the action, triggers atomic rollback,// writes the failure to the journal, produces no PR.The design choice worth defending — and the day green tests weren't enough.
“One incident class equals one defense layer plus one regression test. The rule that closes the gap between green tests and actual safety.”
Spec-driven verification — seven criterion kinds
applyAction and runNpmTest that gates every action against per-kind acceptance criteria.Each of the 21 action_kinds in the Steward registry —
autoresearch · evolve_daily · senior_tester_review · secret_history_sweep · workflow_hardener · wiki_consolidate · release_notes_drafter and fourteen others — declares its own criteria list. Verification runs as a layer independent of the model — it checks the output against executable criteria instead of relying on the model having been right. The mainstream agent runtime trusts the LLM's self-report; cortex-x writes the proof in code.Cost as a first-class verifier output
STEWARD_HALT) — is the difference between an experimental autopilot and an autopilot the operator forgets to monitor.The gap analysis was real: a steady $5/day burn for 30 days would clear a daily cap unflagged but blow an $80 monthly ceiling. Multi-window plus velocity catches the pattern, not the snapshot.
Signed verdict v2 — cryptographic, not remembered
commit_sha, staged_tree, a workflow_run_id nonce, and a secret tier. A replayed, stale, or host-derived signature (under STRICT_SECRET=1) is rejected.It does not replace
[skip-review] — it complements it as a second unblock path; the session marker stays the highest-priority allow path. Discipline moved out of human memory and into a cryptographic artifact: the gate no longer trusts whoever says it passed, only what can be verified.The Hermes rebrand
Sprint 4.7 (2026-05-08) was a hard pivot: every reference renamed to Steward, ten shim modules deleted in the same drop instead of carried as backward-compatibility debt, 115 test failures repaired same-day. Commit
8064b34. Lesson: when a public-facing name collides with a project in the same problem space, fix it before the public tag, not after. Search the namespace before committing the brand.Sprint 1.6.18 — when green tests weren't enough
git push away from public-preview. Operator discipline ran a 6-agent parallel review pipeline against the diff anyway — acceptance-auditor · blind-hunter · correctness-auditor · security-auditor · ssot-enforcer · edge-case-hunter, each with differentiated context scope.Eight ship-blockers came back the same day: tightened path-traversal needed NUL-byte and flag-injection guards plus realpath containment; the editPlan shape needed an explicit shape gate; a
data === null guard was missing; default model alignment had drifted from SSOT; CLI help text was stale; MIGRATIONS.md was unbackfilled. All eight fixed the same day. Tests prove behavior, not architecture.When a sprint catches its own bugs before they reach main
Verdict-gate flow: R2 dispatch → 6 agents → Pass-2 skeptic → signed verdict → pre-commit gate.
mutation_score as the seventh criterion kind). The Arc 1 ledger is plain: +145 tests (3,290 → 3,435), 23 HIGH bugs caught and fixed before push, and zero of them on main. No Pass-2 skeptic refuted a single one.Sprint 2.46 was the meta-recursive moment. The workflow whose job was to ship the signed-verdict gate produced structural defects in its own deliverables — a fictional gate-behavior table in
standards/sprint-pipeline.md, an over-promised commitSha binding, and path drift across six reviewers. All six independent reviewers flagged the same bug (confidence 99/98/96/96/95/92). R2 caught them, the parent agent fixed them in-commit, the signed verdict regenerated, and the commit landed. None reached main.This is the structural difference between “we have review” and “review is load-bearing.” When the pipeline finds a defect in an artifact the gate itself was meant to build, the discipline no longer lives in my head — it is externalized into code.
pre-commit-review-gate.cjs does not wait for me to remember; the session marker stays the highest-priority allow path, but when I do not consciously choose a skip, the gate holds on its own.This was the moment the project stopped being a set of tools and became a mechanism — the system corrected itself at a level I had not caught myself.
9 files → a framework in 7 weeks. Solo.
The founding sprint began on 2026-04-17 from nine files and ran to a first public preview. Then came Arc 1 — hardening through 2026-06-03. Since 2026-05-09 the same engine opens real draft PRs straight on this repo: 17 active cron workflows, draft-PR only, atomic rollback on test failure.
Cross-platform install (Bash + PowerShell 5.1 + 7). Three foundational hooks: session-start, block-destructive, pre-compact. Projects library, cortex-thinker agent, insights, journal, coding-behavior tier, ship-ready gate. The scaffolding floor.
Agentic Security section (lethal trifecta, 7 MUSTs) plus runtime SLOs and circuit breakers. Deterministic profile + stage classifiers under 100ms in detectors/. agentskills.io spec, browser-agent profile, Tirith scanner integration.
Zero-deps preserved via native fetch. 8 distinct error codes. Pluggable engine seam (mock / openrouter / claude-cli). First real LLM call validated end-to-end. Atomic rollback on test failure. Journal cost capture. gh pr create --draft Phase 11 integration.
Sprint 1.9 spec-driven verification (5 criterion kinds initially, +1 in Sprint 2.18). Multi-window cost safety + cross-session loop detector. Sprint 2.0 Phoenix OTLP observability. Sprint 2.1 autoresearch. Sprint 2.3 Stryker mutation baseline. 2026-05-14 autonomous burst: 11 sprints + 4 R2 rounds shipped. The first nightly cron workflows go live.
Four sprints in a chain — 2.46 (signed verdict gate), 2.46.1 (Ed25519 v2 + nonce journal), 2.46.2 (doctor integration + qualified-prose tolerance), 2.3.1 (mutation_score as the 7th criterion kind). The meta-recursive moment: the workflow meant to ship the signed-verdict gate itself produced structural defects in its own deliverables. All six independent reviewers flagged the same bug (confidence 99/98/96/96/95/92). R2 caught it, the parent agent fixed it in-commit, the signature regenerated, the commit landed. Nothing reached main.
Test count progression: 207 (Apr 17) → 600 (May 7) → 3,290 (Arc 1 start) → 3,435 (Jun 3). Five-lane CI matrix — Ubuntu bash, macOS bash, Windows Git Bash, Windows PowerShell 7, Windows PowerShell 5.1 — green throughout. The 17 active workflows include steward-harvest · steward-evolve-daily · steward-evolve-weekly · steward-flaky-test-repair · steward-secret-history-sweep · steward-doc-drift · steward-pr-review-responder · steward-senior-tester-review · steward-workflow-hardener and others.
CJS · native fetch
Zero runtime npm or pip deps. Steward must be auditable, vendorable into client infrastructure, and runnable on hardened CI without supply-chain surface. The spec-verifier now ships seven criterion kinds including mutation_score, and a signed r2-verdict gates commits into the pipeline.
Numbers (snapshot 2026-06-03).
Hand-verified against the live repo on 2026-06-03 — and the doc-currency lint guarantees these don't rot silently.
Results
- R/01A maintenance autopilot that actually runs unattended. Since 2026-05-09, 17 cron workflows have been opening draft PRs on the cortex-x repo nightly without manual intervention. Each PR carries a journal trailer with cost, phase timings, and rollback receipts. Real validation, not a screenshot.
- R/02Seven independent enforcement layers, each able to stop a commit on its own. block-destructive intercepts a destructive shell command; a policy denylist refuses forbidden operations; multi-window USD caps cut off spend across several time windows; a loop detector catches runaway cycles; a circuit breaker halts a cascade of failures; atomic rollback returns the tree to a clean state; the signed R2 verdict is the last gate before a commit lands. Compromising one does not bypass the others — seven mutually independent locks, a structural property, not a config you can switch off by accident.
- R/03The XDG separation. Personal data — project library entries, journal traces, research cache, insights — lives under $CORTEX_DATA_HOME (default ~/.cortex/). Framework code stays under ~/.claude/shared/. cortex-uninstall --purge requires a second confirmation step. The framework can be wiped clean without touching months of accumulated work.
- R/04Public Apache-2.0 release with stranger-reproducible install. One-line install on five platforms. 600-line installer audited line-by-line. cortex-doctor validates the install end-to-end with drift detection and auto-fix prompts. The framework leaves the operator's laptop on terms a stranger could verify.
Learnings
- L/01Build the product before the framework.
RELO came first — an AI back-office agent in production. The framework is the distilled discipline that produced RELO, not the precondition for producing it. The order matters: extract the pattern from a working result, then formalize. The opposite order produces frameworks nobody uses.
- L/02Tests prove behavior; multi-agent review proves architecture.
Arc 1 showed it meta-recursively. Sprint 2.46 was meant to deliver the signed-verdict gate; yet the very workflow that was supposed to build the gate delivered structural defects in its own deliverables — a fictional gate-behavior table in standards/sprint-pipeline.md, an over-promised commitSha binding, path drift across six reviewers. All six independent reviewers flagged the same bug at confidence 99/98/96/96/95/92. R2 caught it, the parent agent fixed it in-commit, the signed verdict regenerated, the commit landed. None of it reached main. That is the difference between having a review process and review being load-bearing.
- L/03Security mechanics is a structural gate, not documentation.
The seven independent layers do not exist because they look good in a README. They exist because relying on operator discipline under fatigue and time pressure fails — not occasionally, but reliably, at the worst moment. An architectural gate is more expensive to write once, but cheaper to operate long-term than human vigilance. Discipline I have to remember, I will forget one night at three a.m. A gate wired into the code never forgets.
- L/04Discipline externalized in code beats discipline held in memory.
The signed verdict, the hooks, the acceptance_criteria on every action_kind — all of it survives model generations. When Claude 5 arrives, I don't have to teach the new model what done means; the definition of done is written into 35 standards and into the code, not into the model's head or mine. A rule that lives only in memory disappears with the context window. A rule written to a file stays.
- L/05Repeatability beats virtuosity.
Sprint 2.45 shipped the /cortex-sprint Skill encoding the canonical 8-step pipeline. On 2026-06-03 I forbade freestyle sprint dispatch — every sprint must go through that Skill. The result is an identical plan / R2-summary / verdict structure across sprints; not because I'm equally sharp every time, but because the pipeline replaces the need for that. And because ~/.claude/shared/ is globally available, the discipline transfers across projects. Three documented events confirm it: the RELO RLS-first multi-tenant pattern carried into lasertgame-funos (2026-05-01); the portfolio retrofit where the review pipeline autonomously caught 3 HIGH findings with no explicit invoke (2026-04-21); and news-bot, where agentic-workflow discipline applied in a project with no cortex sprint at all (2026-06-03).
What cortex-x doesn't claim — yet.
The deferred backlog: a multi-agent git-worktree spawner, an Anthropic Memory Tool wire-up, and graduating the Pass-2 skeptic from opt-in to default-on. The Tier 1 2.3b runner plus Stryker is underway, not finished. What is no longer deferred:
mutation_score shipped on 2026-06-03 in Sprint 2.3.1 as the seventh spec-verifier criterion kind, so I stop promising it and start using it.The framework is in public preview because the engine is real, not because everything on the roadmap is done.
A persistent agent on operator-owned infrastructure.
Tier 2 is essentially closed at roughly 80%. Tier 3 — productization — is next. Tier 4 is a persistent entity: a self-hosted home server on which the factory runs independently of the hardware under me. It is not tied to one machine, nor to one provider.
cortex-x is not for sale. It is open source under Apache 2.0. The work behind it is the work I want to do next.
Open source on GitHubWhat working on this looks like
If you're reading this as a hiring manager or a client, this is what my work looks like — and this repository is the proof of it, not the claim.
- 95% confidence baseline. Before I write the first line, I ask clarifying questions until I'm roughly 95% sure of both scope and acceptance criteria. One round of questions saves three to four rounds of corrections.
- Autonomous overnight runs with checkpoint discipline. When the model's smart zone starts to degrade, I checkpoint and clear context — I don't push on blind. Quality over count: fewer commits that hold beats a pile that has to be reverted.
- Cross-project transfer — measured, not declared. Three documented carries: the RLS-first multi-tenant pattern from RELO into lasertgame-funos; a review pipeline that caught 3 HIGH findings on the portfolio with no explicit invoke; and agentic-workflow discipline applied in news-bot with no cortex sprint at all.
- Externalized discipline. What I do by hand today, I codify into a hook, a skill, or a verdict tomorrow. Discipline that lives only in your head is technical debt — it belongs in code.
On April 17 I started from nine files. By June 3 the repo holds 35 standards, 21 action_kinds, and 17 nightly workflows — and across all of Arc 1 not a single R2 HIGH finding shipped to main, with 23 caught before push. When the pipeline ships defects in its own deliverables and catches them before push, that's proof the discipline is externalized into code — not just in my head. That's what working with me looks like.