Key Findings

Based on our experience, three critical themes define the current state of AI agents:

1. Production Is Now the Norm, Not the Exception

The majority of organizations have moved beyond experimentation. Agents are running in production environments across industries, with large enterprises leading adoption. Smaller companies are catching up quickly, though enterprises benefit from established platform teams and infrastructure investments that accelerate the transition from pilot to production.

2. Quality and Speed Are the Real Bottlenecks, Not Cost

While early agent development was often constrained by cost concerns, that dynamic has changed. Falling model prices and improved efficiency have shifted focus to two harder problems:

For larger enterprises, security also emerges as a critical concern, particularly around data handling and access controls.

3. Observability Is Essential; Evaluation Is Catching Up

The ability to trace agent reasoning and tool calls has become non-negotiable. Nearly all production teams have implemented observability. Without it, debugging failures and building stakeholder trust is nearly impossible.

Evaluation practices are maturing but remain less universal. Most teams start with offline testing on curated datasets, then layer in real-time monitoring once agents face actual users. The most effective teams combine automated assessment (using LLMs to judge outputs at scale) with human review for nuanced or high-stakes decisions.

Where Agents Are Making an Impact

Customer-Facing Applications

Customer service has emerged as a leading use case, signaling a shift toward putting agents directly in front of end users rather than limiting them to internal tools.

Knowledge and Research Work

Agents excel at synthesizing large volumes of information, reasoning across sources, and accelerating research-intensive tasks. This makes them valuable for data analysis, document summarization, and domain exploration.

Internal Productivity

Many organizations deploy agents to automate workflows and boost employee efficiency by handling repetitive tasks, surfacing information, and streamlining operations.

The Model Landscape

Multi-Model Strategies Dominate

Few teams rely on a single provider. While OpenAI's GPT models see the widest adoption, most organizations use multiple models, routing tasks based on complexity, cost, and latency requirements. Claude, Gemini, and open source alternatives all see significant use.

Self-Hosted Models Are Growing

A meaningful portion of organizations invest in running models themselves, driven by cost optimization at scale, data residency requirements, or regulatory constraints.

Fine-Tuning Remains Specialized

Most teams rely on base models combined with prompt engineering and retrieval-augmented generation (RAG). Fine-tuning demands significant investment in data, infrastructure, and maintenance, so it's typically reserved for high-impact use cases where the ROI justifies the effort.

What's Actually Being Used Day-to-Day

Coding Assistants Lead the Pack

Tools like Claude Code, Cursor, GitHub Copilot, and similar assistants have become integral to development workflows. They're used for code generation, debugging, testing, and navigating complex codebases.

Research Agents Follow Closely

Deep research tools powered by ChatGPT, Claude, Gemini, and Perplexity help teams explore new domains, summarize documents, and synthesize information. These often complement coding assistants in the same workflow.

Custom-Built Agents Fill Specific Gaps

Many teams build tailored agents for internal needs: QA automation, knowledge-base search, SQL generation, customer support, and workflow orchestration.

Looking Ahead

The agent landscape is maturing rapidly. Production deployment is becoming standard, observability is table stakes, and teams are learning to balance quality against speed. As the field evolves, success increasingly depends on thoughtful engineering practices, not just model capabilities.

The organizations moving fastest are those treating agent development as a discipline: iterating continuously, investing in evaluation infrastructure, and designing for reliability from the start.