How osshp Was Really Built: An Orchestrated AI Team, and the Guardrails That Made It Trustworthy

I've written about the architecture decisions behind osshp and about the path I'd recommend if you want to build something like it. This post is about something different: how the work actually got done, day to day. I haven't said this plainly yet, so let me say it plainly now. osshp was built by an orchestrated team of AI agents, working under my direction. I made the product calls, the judgment calls, and gave final approval on everything that shipped. The agents did the execution, inside guardrails I'm about to describe.

I'm telling this story because the guardrails are the actually interesting part, not the fact that AI agents wrote code. Anyone can point a model at a codebase and get output. What made that output trustworthy enough to run my real site on is a specific set of process decisions, and those decisions would transfer to a team of humans, a team of agents, or a mix of both. That's the part worth writing down.

The team, roughly

The structure is a small org chart, not a single model doing everything. There's an orchestrator whose entire job is to receive every request and route it to the right specialist rather than doing the work itself. Underneath that: engineering specialists who build features, an operations specialist who handles deployment, a design specialist, a documentation specialist, an independent QA specialist, and an adversarial security reviewer. Each one has a defined role and its own persistent memory, so a specialist doesn't relearn the same lesson every session.

I want to be specific about where the human line sits, because it's the part people usually get wrong when they imagine this kind of setup. I decided what osshp should be, what tradeoffs were acceptable, and whether a given piece of work was actually done. The agents didn't decide the product. They executed against direction I gave them, inside a process designed to catch the mistakes any builder makes, human or not.

The guardrail that does the most work: the author never QAs their own change

If I had to name one rule that mattered more than the others, it's this one: whoever writes a change does not verify it. A different, independent agent checks every change at runtime, meaning a real browser making real requests against the actual running application, not a read of the diff and a nod. That check covers both functional correctness and WCAG 2.1 AA accessibility, because "it works" and "it works for someone using a screen reader or keyboard-only navigation" are two different claims, and only one of them gets checked if you don't ask both questions on purpose.

Anything touching a security-sensitive surface, meaning auth, cryptography, or anything that opens a new attack surface, gets a second, separate pass on top of that: an adversarial security review by an agent whose entire job is to try to break it. Not to confirm it looks fine. To find the way it fails.

Problem-shaped delegation

The second guardrail is about how work gets handed off. A specialist is given the problem, the evidence for the problem, and the acceptance criteria that define done. Not the solution. Not which files to touch. If I hand someone a fix already, I've made my first guess the ceiling on the outcome, because nobody pushes back on a solution that's already been decided. Handing over the problem instead means a specialist is expected to disagree with my instinct and propose something better when it's warranted, and more than once, they were right to.

Four bugs this actually caught

I could describe these guardrails abstractly, but the honest proof is in what they caught before it reached a visitor. Four are worth naming specifically, because they're the kind of defect that a quick self-check would have waved through.

A content-import path could publish a photo gallery with missing alt text, silently bypassing the accessibility gate the normal admin UI enforced. The gate worked exactly as designed everywhere I'd built it into the UI. It just didn't cover a second way content could enter the system, and the independent QA pass is what noticed the second door.
Early analytics code recorded every 404 hit and stored the attacker-controlled string that produced it, with no bound on length. Nothing about that looked wrong on a casual read of the code; it took the security gate specifically asking "what happens if this string is hostile and unbounded" to flag it.
The SSRF filter protecting the external-image fetcher correctly blocked private IP addresses written in ordinary dotted notation, but missed the equivalent hexadecimal form of an IPv4-mapped IPv6 address, including the form that reaches cloud metadata endpoints. Twenty-seven tests passed. All twenty-seven tested the dotted form. The security reviewer caught the gap reading the code itself, not by running a test suite that had never been asked the right question.
After moving to Cloudflare Tunnel for hosting, a proxy-hop miscount made every visitor to the live site resolve to the same IP address. That quietly broke both rate-limiting and the unique-visitor count, and it only showed up because the site was actually live and actually being visited, not because a test caught it in advance.

None of these are exotic bugs. They're the ordinary kind that slip through when the person who wrote the code is also the last person to look at it.

Journaling, so the same mistake doesn't happen twice

The guardrails above catch mistakes at the moment they're made. A separate practice tries to stop the same mistake from being made a second time: durable lessons get written down as they're learned, not just fixed and forgotten. Two examples from this build: a code scrub that a later, unrelated merge silently reverted, and a pre-public scrub pass that cleared sensitive content from most of the codebase but missed an entire file type. Both got written into the record specifically so the next pass through similar territory checks for exactly that failure mode instead of rediscovering it the hard way.

Dogfooding is what actually surfaced the bugs that mattered

Every one of the four bugs above, and most of the other real defects found along the way, showed up because osshp runs steili.com, my actual site, not a staging environment nobody depends on. A gate can check what you think to ask it. Running the software as your real, publicly-visited site asks it questions you didn't think of. That's not unique to an AI-built project; it's true of any software. It just happened to be the thing that turned "passed the tests" into "actually holds up."

The honest takeaway

What made this trustworthy wasn't the models, in fact most of this site was coded by Sonnet. It was the process wrapped around them: independent QA that never trusts the author's own word, adversarial security review on anything with real stakes, delegation shaped as a problem instead of a pre-decided fix, verification that happens against the running system rather than the diff, and dogfooding on real use before calling anything done. Swap the agents for a team of junior (Sonnet) and senior engineers (Opus, Fable, Codex 5.5) and the same five things would still be the reason the output held up. That's the actual lesson here, and it's the one I'd want a reader to walk away with, whether the team building their next project is AI, human, or both.