
TL;DR
Ever since LLMs hit the mainstream, attention has focused on the model: GPT-5, Claude, Gemini, Qwen. But in production, the real lever for reliability is no longer the model — it's everything around it. Context, tools, workflow, validation, guardrails, observability. We call that the harness, and designing it is a discipline of its own: Harness Engineering.
At VOID, we built our first harness as a 100% on-premise stack to automate security updates, tested on open-source code and our own internal projects. Stack: Autogit (open source), Ollama serving Qwen3.5-27B, all running on an Nvidia RTX PRO 6000. This hands-on experience reminded us of a simple truth: without an automated feedback loop, an AI agent has no business in production.
1. The "we plugged in GPT, it'll work" trap
Everyone has seen this scenario. A company wires an LLM into their information system. Early demos are thrilling. Ship to production. Suddenly, the agent starts hallucinating references, ignoring critical instructions ("never touch the prod database"), doing the exact opposite of what was asked.
Usual diagnosis: "the model isn't good enough, let's wait for the next generation." Wrong. 90% of the time, it wasn't a model problem. It was a harness problem.
The most common mistake
Believing that an LLM, shipped alone, can honor commitments in production. A raw model has no persistent memory, no guardrails, no vetted tools, no recovery loop. It improvises. In production, improvising is not an option.
2. What is a harness in AI?
The word harness — literally a riding or climbing harness — refers in AI engineering to everything around the model that makes it useful and reliable. The LLM is the engine. The harness is the chassis, the steering, the brakes, the dashboard.
Components of a harness
Several consumer products we use every day are actually harnesses wrapped around one or more LLMs:
- Cursor / Claude Code: a harness for coding
- Devin (Cognition): a harness for autonomous development
- Perplexity: a harness for web search
- Harvey: a harness for the legal industry
- GitHub Copilot: a harness for IDE autocomplete
So the question we ask in Harness Engineering isn't "which model should we pick?", but "what environment should we design around the model so it handles the task reliably, traceably, and within governance?"
3. Clarifying the vocabulary
The field moves fast and several terms circulate, often conflated. To frame the discussion, here are the useful distinctions:
| Term | What it refers to | Popularized by |
|---|---|---|
| Harness / AI Harness | The software infrastructure around the LLM (tools, memory, validation, workflow) | Anthropic, Cursor, Devin |
| Context Engineering | The discipline of designing what we feed the agent (context, docs, specs) | Andrej Karpathy (2024) |
| Agent / Agentic Engineering | Designing the full workflow of the agent (plan, exec, verify) | Open-source community |
| Prompt Engineering | (older) writing the right prompts — still useful, but insufficient | 2023 |
| Steering | Guiding / constraining the model to follow directives, even when it tends to forget them | Anthropic / OpenAI research |
| Scaffolding | The "rails" placed around the agent to channel its behavior | Academic papers |
| Skills / Tool use | The capabilities given to the agent (exposed functions, MCP, tools) | OpenAI, Anthropic, MCP |
Harness Engineering glossary
- What it refers to
- The software infrastructure around the LLM (tools, memory, validation, workflow)
- Popularized by
- Anthropic, Cursor, Devin
- What it refers to
- The discipline of designing what we feed the agent (context, docs, specs)
- Popularized by
- Andrej Karpathy (2024)
- What it refers to
- Designing the full workflow of the agent (plan, exec, verify)
- Popularized by
- Open-source community
- What it refers to
- (older) writing the right prompts — still useful, but insufficient
- Popularized by
- 2023
- What it refers to
- Guiding / constraining the model to follow directives, even when it tends to forget them
- Popularized by
- Anthropic / OpenAI research
- What it refers to
- The "rails" placed around the agent to channel its behavior
- Popularized by
- Academic papers
- What it refers to
- The capabilities given to the agent (exposed functions, MCP, tools)
- Popularized by
- OpenAI, Anthropic, MCP
In what follows, we use "Harness Engineering" as the umbrella term. All others are sub-disciplines or techniques that attach to it.
4. The 6 pillars of a good harness
Building a harness isn't about stacking tools. It's about addressing six distinct concerns, and none of them are optional in production.
Context Engineering — giving the agent what it needs
Business documentation, product specs, decision history, properly indexed RAG, internal knowledge bases. An agent without relevant context produces generic output — at best useless, at worst wrong.
Principle: "garbage in, garbage out" — to the power of 10.
Skills & Tools — giving the right tools
Exposed functions, internal APIs, CLI commands, scoped read/write access to a specific perimeter. The Model Context Protocol (MCP) is emerging as the standard for connecting agents to tools.
Principle: the agent can only do what we explicitly allow it to do.
Workflow — orchestrating the steps
Planning → execution → verification → recovery. Patterns that work: ReAct, Chain-of-Thought, Tree-of-Thoughts, agentic loops. For complex tasks, we decompose into multiple agents (planner, executor, critic).
Principle: an agent without a workflow drifts.
Validation loops — verify before commit
Self-critique, automated evals, human-in-the-loop on critical actions. This is where automated tests (unit, integration, E2E Playwright) truly earn their keep.
Principle: an agent that checks its own work beats a bigger agent.
Steering & Guardrails — holding the line
This is the answer to the frustrating question: "Why is the agent doing the opposite of what I asked?" Techniques: reinforced system prompts, constitutional AI, strict output parsing, business rules outside the LLM.
Anti-pattern: putting everything in the prompt and hoping it holds. Critical constraints must live in deterministic code.
Observability — knowing what the agent is doing
Traces (LangSmith, Langfuse, Arize), tool-call logs, metrics: success rate, hallucination rate, cost per task, execution time. Without this, there's no debugging, and no improvement.
Principle: you can't manage what you don't measure.
Context Engineering — giving the agent what it needs
Business documentation, product specs, decision history, properly indexed RAG, internal knowledge bases. An agent without relevant context produces generic output — at best useless, at worst wrong.
Principle: "garbage in, garbage out" — to the power of 10.
Skills & Tools — giving the right tools
Exposed functions, internal APIs, CLI commands, scoped read/write access to a specific perimeter. The Model Context Protocol (MCP) is emerging as the standard for connecting agents to tools.
Principle: the agent can only do what we explicitly allow it to do.
Workflow — orchestrating the steps
Planning → execution → verification → recovery. Patterns that work: ReAct, Chain-of-Thought, Tree-of-Thoughts, agentic loops. For complex tasks, we decompose into multiple agents (planner, executor, critic).
Principle: an agent without a workflow drifts.
Validation loops — verify before commit
Self-critique, automated evals, human-in-the-loop on critical actions. This is where automated tests (unit, integration, E2E Playwright) truly earn their keep.
Principle: an agent that checks its own work beats a bigger agent.
Steering & Guardrails — holding the line
This is the answer to the frustrating question: "Why is the agent doing the opposite of what I asked?" Techniques: reinforced system prompts, constitutional AI, strict output parsing, business rules outside the LLM.
Anti-pattern: putting everything in the prompt and hoping it holds. Critical constraints must live in deterministic code.
Observability — knowing what the agent is doing
Traces (LangSmith, Langfuse, Arize), tool-call logs, metrics: success rate, hallucination rate, cost per task, execution time. Without this, there's no debugging, and no improvement.
Principle: you can't manage what you don't measure.
Deterministic vs non-deterministic
The line between what we delegate to the LLM and what must stay in classical code.
Always the same output
Same input, same result. Reproducible, auditable, testable.
Business code, SQL rules, classical APIs, unit tests, guardrails.
Variable output
Same input, result may vary. Creative, great for language, but unpredictable.
LLMs (GPT, Claude, Qwen), image generation, summaries, reasoning.
Always the same output
Same input, same result. Reproducible, auditable, testable.
Business code, SQL rules, classical APIs, unit tests, guardrails.
Variable output
Same input, result may vary. Creative, great for language, but unpredictable.
LLMs (GPT, Claude, Qwen), image generation, summaries, reasoning.
Example: a banking refund rule
Rule: "No refund > MAD 5,000 without human escalation."
Rule in the prompt:
"Never approve > MAD 5,000 without human escalation."
→ One day, the LLM approves MAD 8,000. Rare, but it will happen. And 1% of cases = disaster.
Rule in the code:
if (amount > 5000) {
escalateToHuman();
}
→ Impossible to bypass. Code never drifts.
Rule in the prompt:
"Never approve > MAD 5,000 without human escalation."
→ One day, the LLM approves MAD 8,000. Rare, but it will happen. And 1% of cases = disaster.
Rule in the code:
if (amount > 5000) {
escalateToHuman();
}
→ Impossible to bypass. Code never drifts.
The golden rule of the Harness
Critical → deterministic. Creative → non-deterministic. The harness draws the line.
5. The 3 traps that kill AI agent projects
"The model is smart enough, no workflow needed"
Wrong. Even GPT-5 or Claude Opus drift without scaffolding. The more sensitive the task, the more critical the workflow.
"We'll put all the rules in the prompt"
The prompt is non-deterministic: it drifts. Critical constraints (security, amounts, permissions) must live outside the LLM, in deterministic code that frames it. That's the golden rule.
"We'll evaluate it in production"
Too late. You need an evals suite before production, like unit tests — with nominal, edge, and adversarial cases.
6. VOID's case study: a 100% sovereign security-update agent
The lesson we want everyone to remember
An AI agent that fixes code but can't test its own work is worse than no agent. Before shipping an AI agent to a project, the priority isn't the model, isn't the orchestration framework, isn't the prompts.
The priority is the feedback loop: unit tests, integration tests, E2E Playwright tests. At VOID, we flipped our approach — we frame and level up the automated test harness before talking about an AI agent at all.
The context
On our DevOps and AWS managed-services engagements, we keep seeing the same reality: security debt piles up faster than it gets paid down. CVEs on NPM / Python / Java dependencies, OS patches, container vulnerabilities. Dozens of alerts every week. Product teams push features, and the debt grows.
Many regulated environments — typically banks, insurers, government — enforce a non-negotiable constraint: code must never leave the infrastructure. That rules out SaaS offerings like Copilot, Cursor or Dependabot Premium. You need a sovereign AI agent, running entirely on-premise.
So we set out to validate the concept end-to-end before any rollout. We ran the pilot on open-source code and on our own internal VOID projects, never on client code. Goal: prove feasibility and measure the limits of a sovereign AI agent that detects vulnerabilities, fixes the code, runs the tests, opens a PR — all without a single byte leaving the infra.
What follows is the raw case study from this pilot: the technical choices, the results, and — most importantly — the lessons we now apply to every project.

The 100% on-premise stack
After several iterations, we converged on a fully self-hosted architecture:
| Component | Choice | Role |
|---|---|---|
| Orchestration | Autogit (open source, self-hosted) | CVE detection, workflow steering, Git interaction |
| Inference server | Ollama | Serves the model locally |
| Model | Qwen3.5-27B | Code reasoning, fix generation |
| Hardware | Nvidia RTX PRO 6000 (Blackwell) | Pro-grade card, 96 GB VRAM, ideal for a 27B model locally |
| Vulnerability scan | Trivy / Snyk CLI | CVE detection |
| E2E tests | Playwright + existing unit/integration suite | Feedback loop — no PR without green tests |
| Git platform | GitHub / GitLab API | PR creation |
100% on-premise technical stack
- Role
- CVE detection, workflow steering, Git interaction
- Choice
- Ollama
- Role
- Serves the model locally
- Choice
- Qwen3.5-27B
- Role
- Code reasoning, fix generation
- Choice
- Nvidia RTX PRO 6000 (Blackwell)
- Role
- Pro-grade card, 96 GB VRAM, ideal for a 27B model locally
- Choice
- Trivy / Snyk CLI
- Role
- CVE detection
- Choice
- Playwright + existing unit/integration suite
- Role
- Feedback loop — no PR without green tests
- Choice
- GitHub / GitLab API
- Role
- PR creation
The key benefit: total sovereignty
No outbound calls. No third-party LLM API. No data exfiltration. Code never leaves its home environment. End-to-end auditable by a CISO, ready to be replicated inside regulated infrastructures.
The harness architecture
| Pillar | Implementation |
|---|---|
| Context | Full repo + PR history + internal docs |
| Tools | Git, Trivy/Snyk, test runner, GitHub/GitLab API |
| Workflow | Detect CVE → localize code → draft fix → run tests → create PR |
| Validation | Unit + integration + E2E Playwright tests. No PR if tests are red |
| Guardrails | Dedicated branch. Never auto-merge. Mandatory human review |
| Observability | Logs on every run, PR acceptance rate, average time-to-fix |
Harness architecture mapped to the 6 pillars
- Implementation
- Full repo + PR history + internal docs
- Implementation
- Git, Trivy/Snyk, test runner, GitHub/GitLab API
- Implementation
- Detect CVE → localize code → draft fix → run tests → create PR
- Implementation
- Unit + integration + E2E Playwright tests. No PR if tests are red
- Implementation
- Dedicated branch. Never auto-merge. Mandatory human review
- Implementation
- Logs on every run, PR acceptance rate, average time-to-fix

What worked
- ✓Trivial fixes (patch version bumps, minor updates, well-documented CVEs): excellent results. Qwen3.5-27B running locally handles them perfectly.
- ✓Non-regression tests generated by the agent on the functions touched by the update. Not perfect, but often better than "nothing".
- ✓Clear PR documentation (CVE referenced, scope of the fix, tests added). Massive time savings for the reviewer.
- ✓Sovereignty: zero data sent to a third party. Stack fully auditable by a CISO.
What didn't work as well
- ⚠Breaking changes between majors: in pure autonomy, the agent tries to blindly fix a v2 → v3 and breaks the build — it doesn't grasp the business impact. Our answer wasn't to ban majors, but to specialize the harness: dedicated Autogit actions with enriched prompts (official migration guide scraped, changelog, API diff, impact checklist) steer the agent through those cases. Slower workflow, mandatory human review. "Light autonomous" mode stays reserved for minors and patches. That's exactly the article's message: we don't change the model, we adapt the harness to the risk level.
- ⚠Complex transitive dependencies (deeply nested NPM peer dependencies): the agent gets lost, can't trace the chain when an update breaks another dependency indirectly.
- ⚠Autonomous generation of Playwright E2E tests: nope. The agent can run the existing Playwright suite and use it as a validation signal, but it can't write a relevant E2E scenario on its own (stable selectors, data-testid, business assertions). Playwright tests stay written in co-pilot mode (developer + agent, Cursor / Copilot style), not autonomously. Exactly why we audit and level up the test harness before shipping the autonomous agent.
- ⚠Qwen3.5-27B vs a frontier model: on pure reasoning, you feel the gap with GPT-5 or Claude Opus. But for this specific use case (repetitive patterns, localized code, tests as a signal), Qwen3.5-27B is more than enough. And that's the price of sovereignty.
The 6 lessons we learned
No automated feedback loop, no production
After our early tests, a brutal realization: an agent that fixes code but can't verify it works is an agent that plants time bombs. The only variable that makes the difference between a cute pilot and a real deployment is the quality of the feedback loop — unit, integration, and above all E2E automation (Playwright, Cypress). Our rule now: before even talking about an AI agent, we audit and level up the test harness.
Scope is everything
We initially wanted the agent to cover "all updates". Failure. By narrowing to minor / patch and documented CVEs, we hit a high acceptance rate and a trust relationship with reviewers.
The test is the real value
An agent that proposes a fix without testing it is Dependabot. An agent that runs the tests before pushing the PR is a real teammate. The validation loop is what separates a tool from an agent.
Never auto-merge
Technically, we could auto-merge PRs whose tests are green. We don't. A human always validates. It's a governance choice, not a technical limit.
Sovereignty = architecture, not marketing
When a regulatory context says "data cannot leave", you don't solve that with an NDA. You solve it with architecture. Ollama + Qwen + Autogit open-source on a local GPU = a concrete, auditable, defensible answer.
Observability is not optional
Every week we measure: PRs proposed, PRs merged, PRs rejected, rejection reasons. Without that, there's no way to tell whether the agent is improving or whether the harness is drifting.
7. What does a harness stack look like in 2026?
The tooling landscape moves fast. Here's what we recommend today, based on the need:
| Need | Tools |
|---|---|
| Agent orchestration | LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Autogit |
| Tool protocol | MCP (Model Context Protocol) — the emerging standard |
| Local model server | Ollama, vLLM, TGI |
| Open-source models | Qwen3.5, Llama 3.3, Mistral, DeepSeek |
| Feedback loop (tests) | Playwright, Cypress, Vitest, Jest, Pytest, JUnit |
| Evals | Braintrust, Langfuse, Promptfoo |
| Observability | LangSmith, Arize, Helicone, Langfuse |
| Guardrails | Guardrails.ai, NeMo Guardrails |
| RAG | LlamaIndex, Haystack, Vectara |
Recommended harness stack in 2026
- Tools
- LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Autogit
- Tools
- MCP (Model Context Protocol) — the emerging standard
- Tools
- Ollama, vLLM, TGI
- Tools
- Qwen3.5, Llama 3.3, Mistral, DeepSeek
- Tools
- Playwright, Cypress, Vitest, Jest, Pytest, JUnit
- Tools
- Braintrust, Langfuse, Promptfoo
- Tools
- LangSmith, Arize, Helicone, Langfuse
- Tools
- Guardrails.ai, NeMo Guardrails
- Tools
- LlamaIndex, Haystack, Vectara
8. Key takeaways
- →The model is no longer the bottleneck. The harness is.
- →Building an AI agent in production = 20% prompt, 80% engineering around it.
- →Failed AI projects are almost always harness problems, not model problems.
- →Without an automated feedback loop (unit tests + E2E Playwright), an AI agent has no business in production.
- →Sovereignty isn't a marketing option. It's an architecture — and it's attainable with an open-source stack.
- →Golden rule: what's critical must be deterministic, what's creative can be non-deterministic. The harness draws the line.
Want to explore an AI agent in your delivery?
At VOID, we always start with the feedback loop. Three entry points, depending on your maturity:
Testability audit
Assessment of coverage and quality of your automated tests.
Test harness uplift
Installing the Playwright / unit / integration suite that will enable the AI agent.
AI agent framing
Once tests are in place, we design the agentic scaffolding — sovereign if needed.
Found this article useful?
Share it with your network — especially teams about to ship an AI agent to production.