Harness Engineering:
The Art of Making AI Agents Reliable in Production

What we learned deploying sovereign AI agents at VOID, and why it all starts with the feedback loop.

AI & Delivery
Read: 12 min
Feb 11, 2026
Harness Engineering — a humanoid robot standing in for an AI agent held by its software harness
The model is no longer the bottleneck. The harness is.

TL;DR

Ever since LLMs hit the mainstream, attention has focused on the model: GPT-5, Claude, Gemini, Qwen. But in production, the real lever for reliability is no longer the model — it's everything around it. Context, tools, workflow, validation, guardrails, observability. We call that the harness, and designing it is a discipline of its own: Harness Engineering.

At VOID, we built our first harness as a 100% on-premise stack to automate security updates, tested on open-source code and our own internal projects. Stack: Autogit (open source), Ollama serving Qwen3.5-27B, all running on an Nvidia RTX PRO 6000. This hands-on experience reminded us of a simple truth: without an automated feedback loop, an AI agent has no business in production.

1. The "we plugged in GPT, it'll work" trap

Everyone has seen this scenario. A company wires an LLM into their information system. Early demos are thrilling. Ship to production. Suddenly, the agent starts hallucinating references, ignoring critical instructions ("never touch the prod database"), doing the exact opposite of what was asked.

Usual diagnosis: "the model isn't good enough, let's wait for the next generation." Wrong. 90% of the time, it wasn't a model problem. It was a harness problem.

The most common mistake

Believing that an LLM, shipped alone, can honor commitments in production. A raw model has no persistent memory, no guardrails, no vetted tools, no recovery loop. It improvises. In production, improvising is not an option.

2. What is a harness in AI?

The word harness — literally a riding or climbing harness — refers in AI engineering to everything around the model that makes it useful and reliable. The LLM is the engine. The harness is the chassis, the steering, the brakes, the dashboard.

Components of a harness

Context
Docs, specs, history, RAG
Tools
APIs, CLIs, MCP, tool use
Memory
Short & long term
Workflow
Orchestration of steps
Validation
Tests, self-critique, evals
Guardrails
Steering, guardrails, permissions
LLM (the engine)

Several consumer products we use every day are actually harnesses wrapped around one or more LLMs:

  • Cursor / Claude Code: a harness for coding
  • Devin (Cognition): a harness for autonomous development
  • Perplexity: a harness for web search
  • Harvey: a harness for the legal industry
  • GitHub Copilot: a harness for IDE autocomplete

So the question we ask in Harness Engineering isn't "which model should we pick?", but "what environment should we design around the model so it handles the task reliably, traceably, and within governance?"

3. Clarifying the vocabulary

The field moves fast and several terms circulate, often conflated. To frame the discussion, here are the useful distinctions:

Harness Engineering glossary

Term
Harness / AI Harness
What it refers to
The software infrastructure around the LLM (tools, memory, validation, workflow)
Popularized by
Anthropic, Cursor, Devin
Term
Context Engineering
What it refers to
The discipline of designing what we feed the agent (context, docs, specs)
Popularized by
Andrej Karpathy (2024)
Term
Agent / Agentic Engineering
What it refers to
Designing the full workflow of the agent (plan, exec, verify)
Popularized by
Open-source community
Term
Prompt Engineering
What it refers to
(older) writing the right prompts — still useful, but insufficient
Popularized by
2023
Term
Steering
What it refers to
Guiding / constraining the model to follow directives, even when it tends to forget them
Popularized by
Anthropic / OpenAI research
Term
Scaffolding
What it refers to
The "rails" placed around the agent to channel its behavior
Popularized by
Academic papers
Term
Skills / Tool use
What it refers to
The capabilities given to the agent (exposed functions, MCP, tools)
Popularized by
OpenAI, Anthropic, MCP

In what follows, we use "Harness Engineering" as the umbrella term. All others are sub-disciplines or techniques that attach to it.

4. The 6 pillars of a good harness

Building a harness isn't about stacking tools. It's about addressing six distinct concerns, and none of them are optional in production.

1

Context Engineering — giving the agent what it needs

Business documentation, product specs, decision history, properly indexed RAG, internal knowledge bases. An agent without relevant context produces generic output — at best useless, at worst wrong.

Principle: "garbage in, garbage out" — to the power of 10.

2

Skills & Tools — giving the right tools

Exposed functions, internal APIs, CLI commands, scoped read/write access to a specific perimeter. The Model Context Protocol (MCP) is emerging as the standard for connecting agents to tools.

Principle: the agent can only do what we explicitly allow it to do.

3

Workflow — orchestrating the steps

Planning → execution → verification → recovery. Patterns that work: ReAct, Chain-of-Thought, Tree-of-Thoughts, agentic loops. For complex tasks, we decompose into multiple agents (planner, executor, critic).

Principle: an agent without a workflow drifts.

4

Validation loops — verify before commit

Self-critique, automated evals, human-in-the-loop on critical actions. This is where automated tests (unit, integration, E2E Playwright) truly earn their keep.

Principle: an agent that checks its own work beats a bigger agent.

5

Steering & Guardrails — holding the line

This is the answer to the frustrating question: "Why is the agent doing the opposite of what I asked?" Techniques: reinforced system prompts, constitutional AI, strict output parsing, business rules outside the LLM.

Anti-pattern: putting everything in the prompt and hoping it holds. Critical constraints must live in deterministic code.

6

Observability — knowing what the agent is doing

Traces (LangSmith, Langfuse, Arize), tool-call logs, metrics: success rate, hallucination rate, cost per task, execution time. Without this, there's no debugging, and no improvement.

Principle: you can't manage what you don't measure.

Deterministic vs non-deterministic

The line between what we delegate to the LLM and what must stay in classical code.

Deterministic

Always the same output

Same input, same result. Reproducible, auditable, testable.

// Classical function
calculateVAT(100) → 120
calculateVAT(100) → 120
calculateVAT(100) → 120
Non-deterministic

Variable output

Same input, result may vary. Creative, great for language, but unpredictable.

// LLM
llm("App name?") → "FlowManager"
llm("App name?") → "TaskZen"
llm("App name?") → "OrgaPro"

Example: a banking refund rule

Rule: "No refund > MAD 5,000 without human escalation."

❌ Wrong approach

Rule in the prompt:

"Never approve > MAD 5,000 without human escalation."

→ One day, the LLM approves MAD 8,000. Rare, but it will happen. And 1% of cases = disaster.

✓ Right approach

Rule in the code:

if (amount > 5000) {
  escalateToHuman();
}

Impossible to bypass. Code never drifts.

The golden rule of the Harness

Critical → deterministic. Creative → non-deterministic. The harness draws the line.

5. The 3 traps that kill AI agent projects

Trap #1

"The model is smart enough, no workflow needed"

Wrong. Even GPT-5 or Claude Opus drift without scaffolding. The more sensitive the task, the more critical the workflow.

Trap #2

"We'll put all the rules in the prompt"

The prompt is non-deterministic: it drifts. Critical constraints (security, amounts, permissions) must live outside the LLM, in deterministic code that frames it. That's the golden rule.

Trap #3

"We'll evaluate it in production"

Too late. You need an evals suite before production, like unit tests — with nominal, edge, and adversarial cases.

6. VOID's case study: a 100% sovereign security-update agent

The lesson we want everyone to remember

An AI agent that fixes code but can't test its own work is worse than no agent. Before shipping an AI agent to a project, the priority isn't the model, isn't the orchestration framework, isn't the prompts.

The priority is the feedback loop: unit tests, integration tests, E2E Playwright tests. At VOID, we flipped our approach — we frame and level up the automated test harness before talking about an AI agent at all.

The context

On our DevOps and AWS managed-services engagements, we keep seeing the same reality: security debt piles up faster than it gets paid down. CVEs on NPM / Python / Java dependencies, OS patches, container vulnerabilities. Dozens of alerts every week. Product teams push features, and the debt grows.

Many regulated environments — typically banks, insurers, government — enforce a non-negotiable constraint: code must never leave the infrastructure. That rules out SaaS offerings like Copilot, Cursor or Dependabot Premium. You need a sovereign AI agent, running entirely on-premise.

So we set out to validate the concept end-to-end before any rollout. We ran the pilot on open-source code and on our own internal VOID projects, never on client code. Goal: prove feasibility and measure the limits of a sovereign AI agent that detects vulnerabilities, fixes the code, runs the tests, opens a PR — all without a single byte leaving the infra.

What follows is the raw case study from this pilot: the technical choices, the results, and — most importantly — the lessons we now apply to every project.

Nvidia RTX PRO 6000 Blackwell — the GPU used by VOID for local inference of Qwen3.5-27B
Nvidia RTX PRO 6000 Blackwell — 96 GB of VRAM, the GPU running Qwen3.5-27B locally at VOID.

The 100% on-premise stack

After several iterations, we converged on a fully self-hosted architecture:

100% on-premise technical stack

Component
Orchestration
Role
CVE detection, workflow steering, Git interaction
Component
Inference server
Choice
Ollama
Role
Serves the model locally
Component
Model
Choice
Qwen3.5-27B
Role
Code reasoning, fix generation
Component
Hardware
Choice
Nvidia RTX PRO 6000 (Blackwell)
Role
Pro-grade card, 96 GB VRAM, ideal for a 27B model locally
Component
Vulnerability scan
Choice
Trivy / Snyk CLI
Role
CVE detection
Component
E2E tests
Choice
Playwright + existing unit/integration suite
Role
Feedback loop — no PR without green tests
Component
Git platform
Choice
GitHub / GitLab API
Role
PR creation

The key benefit: total sovereignty

No outbound calls. No third-party LLM API. No data exfiltration. Code never leaves its home environment. End-to-end auditable by a CISO, ready to be replicated inside regulated infrastructures.

The harness architecture

Harness architecture mapped to the 6 pillars

Pillar
Context
Implementation
Full repo + PR history + internal docs
Pillar
Tools
Implementation
Git, Trivy/Snyk, test runner, GitHub/GitLab API
Pillar
Workflow
Implementation
Detect CVE → localize code → draft fix → run tests → create PR
Pillar
Validation
Implementation
Unit + integration + E2E Playwright tests. No PR if tests are red
Pillar
Guardrails
Implementation
Dedicated branch. Never auto-merge. Mandatory human review
Pillar
Observability
Implementation
Logs on every run, PR acceptance rate, average time-to-fix
VOID observability dashboard: AI agent runs, PR acceptance rate, logs and real-time metrics
Observability is not optional: every run is traced.

What worked

  • Trivial fixes (patch version bumps, minor updates, well-documented CVEs): excellent results. Qwen3.5-27B running locally handles them perfectly.
  • Non-regression tests generated by the agent on the functions touched by the update. Not perfect, but often better than "nothing".
  • Clear PR documentation (CVE referenced, scope of the fix, tests added). Massive time savings for the reviewer.
  • Sovereignty: zero data sent to a third party. Stack fully auditable by a CISO.

What didn't work as well

  • Breaking changes between majors: in pure autonomy, the agent tries to blindly fix a v2 → v3 and breaks the build — it doesn't grasp the business impact. Our answer wasn't to ban majors, but to specialize the harness: dedicated Autogit actions with enriched prompts (official migration guide scraped, changelog, API diff, impact checklist) steer the agent through those cases. Slower workflow, mandatory human review. "Light autonomous" mode stays reserved for minors and patches. That's exactly the article's message: we don't change the model, we adapt the harness to the risk level.
  • Complex transitive dependencies (deeply nested NPM peer dependencies): the agent gets lost, can't trace the chain when an update breaks another dependency indirectly.
  • Autonomous generation of Playwright E2E tests: nope. The agent can run the existing Playwright suite and use it as a validation signal, but it can't write a relevant E2E scenario on its own (stable selectors, data-testid, business assertions). Playwright tests stay written in co-pilot mode (developer + agent, Cursor / Copilot style), not autonomously. Exactly why we audit and level up the test harness before shipping the autonomous agent.
  • Qwen3.5-27B vs a frontier model: on pure reasoning, you feel the gap with GPT-5 or Claude Opus. But for this specific use case (repetitive patterns, localized code, tests as a signal), Qwen3.5-27B is more than enough. And that's the price of sovereignty.

The 6 lessons we learned

1

No automated feedback loop, no production

After our early tests, a brutal realization: an agent that fixes code but can't verify it works is an agent that plants time bombs. The only variable that makes the difference between a cute pilot and a real deployment is the quality of the feedback loop — unit, integration, and above all E2E automation (Playwright, Cypress). Our rule now: before even talking about an AI agent, we audit and level up the test harness.

2

Scope is everything

We initially wanted the agent to cover "all updates". Failure. By narrowing to minor / patch and documented CVEs, we hit a high acceptance rate and a trust relationship with reviewers.

3

The test is the real value

An agent that proposes a fix without testing it is Dependabot. An agent that runs the tests before pushing the PR is a real teammate. The validation loop is what separates a tool from an agent.

4

Never auto-merge

Technically, we could auto-merge PRs whose tests are green. We don't. A human always validates. It's a governance choice, not a technical limit.

5

Sovereignty = architecture, not marketing

When a regulatory context says "data cannot leave", you don't solve that with an NDA. You solve it with architecture. Ollama + Qwen + Autogit open-source on a local GPU = a concrete, auditable, defensible answer.

6

Observability is not optional

Every week we measure: PRs proposed, PRs merged, PRs rejected, rejection reasons. Without that, there's no way to tell whether the agent is improving or whether the harness is drifting.

7. What does a harness stack look like in 2026?

The tooling landscape moves fast. Here's what we recommend today, based on the need:

Recommended harness stack in 2026

Need
Agent orchestration
Tools
LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Autogit
Need
Tool protocol
Tools
MCP (Model Context Protocol) — the emerging standard
Need
Local model server
Tools
Ollama, vLLM, TGI
Need
Open-source models
Tools
Qwen3.5, Llama 3.3, Mistral, DeepSeek
Need
Feedback loop (tests)
Tools
Playwright, Cypress, Vitest, Jest, Pytest, JUnit
Need
Evals
Tools
Braintrust, Langfuse, Promptfoo
Need
Observability
Tools
LangSmith, Arize, Helicone, Langfuse
Need
Guardrails
Tools
Guardrails.ai, NeMo Guardrails
Need
RAG
Tools
LlamaIndex, Haystack, Vectara

8. Key takeaways

  • The model is no longer the bottleneck. The harness is.
  • Building an AI agent in production = 20% prompt, 80% engineering around it.
  • Failed AI projects are almost always harness problems, not model problems.
  • Without an automated feedback loop (unit tests + E2E Playwright), an AI agent has no business in production.
  • Sovereignty isn't a marketing option. It's an architecture — and it's attainable with an open-source stack.
  • Golden rule: what's critical must be deterministic, what's creative can be non-deterministic. The harness draws the line.

Want to explore an AI agent in your delivery?

At VOID, we always start with the feedback loop. Three entry points, depending on your maturity:

Step 1

Testability audit

Assessment of coverage and quality of your automated tests.

Step 2

Test harness uplift

Installing the Playwright / unit / integration suite that will enable the AI agent.

Step 3

AI agent framing

Once tests are in place, we design the agentic scaffolding — sovereign if needed.

Found this article useful?

Share it with your network — especially teams about to ship an AI agent to production.

🌱Eco-designed site