LLMs (Large Language Models), however powerful, have intrinsic constraints that prevent their direct use in complex and prolonged tasks:
Limited context window and lack of persistent memory. Each session starts from scratch: an LLM does not remember what happened in the previous session. For process automation tasks requiring hours or days of work, this kind of amnesia is fatal.
Inability to interact with the surrounding virtual or physical environment. LLMs produce exclusively text. They cannot execute code, query databases, browse the web, or interact with external APIs without an intermediate layer that translates their “textual intentions” into real actions.
Lack of planning and self-verification. Without external structure, LLMs tend to attempt solving the assigned task in a single pass (“one-shotting”), losing track halfway through complex tasks. Moreover, they tend to declare the task completed without any real verification.
Slowness and variability. Unlike traditional software that operates in milliseconds, LLMs take several seconds to generate individual responses, producing output of variable quality.
The enormous investments made to scale model sizes and training datasets have not solved these problems, and most researchers now agree that the technology underlying LLMs has structural and intrinsic limitations.
The ability to produce tools and solutions that are effective in the real production environments where businesses operate requires an architecturally different approach.
Products such as Claude Cowork and OpenClaw are built on this new approach, whose effectiveness has been so evident as to move stock markets alarmed by the potential obsolescence of products based on the traditional PaaS paradigm.
The term harness is used to describe this new architectural model.
What is an agent harness
An agent harness is the software infrastructure that wraps an LLM and manages everything that is not the model itself.
Philipp Schmid (Google DeepMind) proposed an effective analogy: imagining the LLM model as the CPU of our system, the context window would be the available RAM, the agent would be the application, and the harness would be the operating system — the software that manages resources, provides standard interfaces to drivers and peripherals, and controls the processing lifecycle.
Just as the real power and capability of a PC or server does not depend solely on CPU power but is the result of the overall architecture, including the quality of the operating system, similarly the capability and effectiveness of an agentic tool depends not only on the LLM model but also, and above all, on the quality of the harness.
Ethan Mollick, author of the book “Co-Intelligence: Living and Working with AI”, in his guide updated in February 2026, introduced the distinction between model, app, and harness as the three fundamental dimensions for evaluating an AI system: “the same model can behave very differently depending on the harness in which it operates”.
Our personal conviction, derived from field experience in AI-driven automation of our clients’ business processes, is that of the three dimensions cited by Mollick, the most important and decisive is the harness.
Evidence in favor: the harness matters more than the model
Several empirical evidences and industry observations converge on this conclusion.
Anthropic, for example, documented how the Claude Opus 4.5 model, despite being a frontier model, systematically failed in building complex web applications without an adequate harness.
With a structured harness — initializer agent, progress files, incremental work constraints, end-to-end verification via browser automation — the same model became capable of maintaining focus and software coherence across development processes composed of dozens of sessions, often conducted in parallel, each within its own context window.
As a further example, the paper “General Modular Harness for LLM Agents in Multi-Turn Gaming Environments” (ICML 2025) demonstrated that a single LLM (GPT-4 class), equipped with a modular harness composed of perception, memory, and reasoning modules, improved the win rate across all tested games compared to the same model without a harness.
The fact that the model was kept exactly the same in both tests makes it evident that the difference between an AI solution with production ready performance and a prototype that is little more than a simple proof of concept is strongly centered on the ability to design and implement good harnesses.
Harness engineering has replaced prompt engineering in best practices for implementing enterprise-level AI solutions.
Model choice remains relevant for advanced tasks
However, as Ethan Mollick himself emphasized, for complex work it is still essential to select the appropriate model: many open-source LLMs are often optimized for chat speed, computer vision performance, or machine translation, but not for agentic tasks, which require advanced reasoning capabilities and tool-use planning.
Co-optimization is the winning strategy
Some case histories, such as Cognition’s experience with its SWE-1.5 LLM, show that maximum performance is achieved through simultaneous co-optimization of model and harness: the model is trained specifically for the harness, and the harness is refined based on the model’s weaknesses. It is not about choosing one or the other, but designing them as an integrated system.
Looking at recent market releases, this is clearly the emerging trend.
Harness engineering as an emerging discipline
Harness engineering is establishing itself as an autonomous discipline, distinct from both prompt engineering and the broader context engineering.
Bassel Haidar, Vice President for AI initiatives at Federal Agencies at Booz Allen Hamilton, predicted that “by the end of 2026, agent reliability engineering will become a standard discipline, just as DevOps did after the cloud”.
The harness is the control plane of artificial cognition: it does not merely “run the agent”, but governs the conditions under which cognition occurs and the criteria by which it is validated.