The development of intelligent agents based on large language models (LLMs) of the so-called System 2 generation, combined with recent advances in text-to-speech (TTS) and speech-to-text (STT) systems capable of handling emotions and regional accents, is opening important perspectives in the automated voice assistant sector.
These systems represent a significant evolution from traditional chatbots, offering fluid and natural conversations applicable to intake services for automated switchboards and multi-level customer care functions.
The global conversational artificial intelligence market exceeded $20 billion in 2025 and is expected to reach $70 billion by 2030-2032, with a compound annual growth rate (CAGR) of approximately 25%.
The specific voice AI agents segment shows even faster growth, with projections indicating a jump from $2.4 billion in 2024 to $47.5 billion by 2034 (CAGR of 34.8%).
Enterprise adoption is accelerating rapidly; Gartner predicts that by 2026, over 30% of companies will automate more than half of their operations using AI and LLMs.
This scenario clearly shows the existence of interesting opportunities for many businesses.
Those who are currently launching programs to experiment, implement, and scale these solutions will define service standards for the next decade and could gain a significant competitive advantage.
The implementation of voice AI agents generates significant savings compared to traditional models.
A traditional contact center has an average annual cost that can exceed €500K, with costs per interaction of €1-3.
Meanwhile, a voice AI deployment for small businesses can cost from €20,000 to €100,000 annually with costs per interaction of €0.05-0.50.
For enterprise-sized companies, the cost per interaction drops further to €0.01-0.10.
According to a study conducted by one of the leading vendors in the sector, companies that have already implemented voice AI solutions report average ROIs of 8x, with monthly revenue increases between €40,000-85,000 thanks to approximately 27% greater lead capture.
The escalation penalty for calls requiring human transfer (typically estimated at 20%) brings the overall cost per call from approximately €5.70 (human only) to approximately €2.24 (with mixed AI use), effectively a net reduction of 60%.
The Other Side of the Coin
Despite these prospects, however, implementing truly effective solutions requires overcoming significant technical challenges, both in relation to system integration issues (compatibility with legacy systems, cross-platform information consistency, presence of data silos) and in relation to the functionality of these systems themselves (recognition of accents and dialects, latency in the audio stream processing pipeline, background noise interference, system response latency, and query complexity).
The choice of technologies and their providers and the design of software architectures represent crucial factors for project success.
Fastal’s Positioning in This Sector
The constant research and development activity in the field of Artificial Intelligence and in the creation of Agentic Applications powered by next-generation LLMs, combined with deep knowledge of our Clients’ core business processes and the sectors in which they operate, places us in a privileged position to analyze emerging scenarios and launch innovation projects supported by architectural and implementation choices that minimize risk factors.
In this article, we will first analyze the technological landscape, both current and in terms of short and medium-term prospective evolution, and then present some of our ideas, which we have already successfully tested in two recent, particularly significant projects.
Technological Foundations: The Evolution of LLMs with System 2 Reasoning
Reasoning Models and Advanced Capabilities
The concept of “System 2” reasoning in LLMs represents a paradigm shift from traditional models. While conventional language models like GPT-4 operated primarily through next-word prediction (fast and intuitive reasoning, or “System 1”), new-generation models, starting with OpenAI’s o1/o3 and Anthropic’s Claude 3.7 Sonnet, integrate deliberate multi-step reasoning capabilities.
These systems use reinforcement learning to develop internal chains of thought, enabling them to recognize errors, break down complex problems into simpler steps, and try alternative approaches when a strategy doesn’t work.
Claude 3.7 Sonnet, for example, was the first hybrid reasoning model released to market, capable of producing both near-instant responses and extended reasoning visible to the user.
System 2 model performance improves both with increased compute effort during training and compute time dedicated to reasoning during inference, creating a new scalability paradigm.
This is still a hybrid solution but one that tends toward new neurosymbolic paradigms advocated by researchers who have always been skeptical about the possibility of achieving true Artificial General Intelligence (AGI) by simply scaling models and computing power to extremes.
While it is increasingly evident that the intrinsic quality of LLMs has now reached an evolutionary plateau, new LLMs have appeared on the market capable of bringing to life truly effective agents that can add value in daily business use. OpenAI’s recent GPT-5.2 and Anthropic’s Claude 4.5 Opus are valid examples of this trend.
LLM Agent Architecture
A modern LLM agent is structured around key components that determine its effectiveness. The Large Language Model constitutes the brain of the system, responsible for understanding and generating natural language.
Memory is divided into short-term memory (recent conversations) and long-term memory (accumulated knowledge), enabling contextualization of interactions.
External tools allow agents to interact with APIs, databases, and business systems, transforming them from simple responders into true task executors.
A planning module coordinates multi-step activities, while the reasoning system uses frameworks like ReAct (Reasoning and Acting) to alternate between thought and action, improving decision reliability.
This architecture enables agentic capabilities that go beyond simple question answering, allowing agents to set goals, make decisions, and complete complex activities with minimal human intervention.
However, despite the brilliant results often obtained in the prototype phase, building agentic systems with characteristics and reliability that make them suitable for a real production environment is a difficult goal to achieve, requiring very specific technical and design capabilities that are often different from traditional systems.
A new specialized field called Agent Engineering is clearly emerging, in which Fastal has been significantly investing in recent years.
Next-Generation Voice Technologies: Emotional Synthesis and Accent Management
Voice Synthesis with Emotional Intelligence
Advances in text-to-speech voice synthesis have reached unprecedented levels of naturalness and expressiveness.
ElevenLabs represents a leader in this field, with its v3 model, still in alpha version at the time of writing, supporting over 70 languages and introducing for the first time audio tags like [giggles] or [whispering] for direct control of emotional expression.
The system automatically interprets emotional context from text and generates voices that reflect human tone, rhythm, and inflection.
Hume AI, with its Octave TTS engine, takes an even more sophisticated approach: the model is trained simultaneously on text, voice, and emotion tokens, allowing emotional intelligence to be integrated into the architecture rather than added as a post-processing layer.
Hume’s EVI (Empathic Voice Interface) system can respond to the user’s emotional tone, creating empathic interactions that reduce escalations to human operators by 25%.
Regional Accent Management and Code-Switching
The ability to handle regional accents and fluid language switching (code-switching) represents a crucial element for global-scale adoption.
Recent research demonstrates 23.7% improvements in accent accuracy (Word Error Rate reduction from 15.4% to 11.8%) and 85.3% accuracy in emotional recognition by native listeners.
Modern platforms implement automatic language detection within 2-3 seconds, follow users who mix languages mid-sentence, and distinguish between regional variants like British vs. American English.
This capability is fundamental for multilingual markets and bilingual communities, where code-switching is the norm in everyday conversations, and is particularly interesting for a context like the Italian one, characterized by a great lexical variety expressed at regional and local levels. Code-switching, in fact, concerns not only the recognition and reproduction of accents and cadences but also the correct interpretation of idiomatic phrases and regional sayings.
Real-Time Platforms and APIs: The Technological Ecosystem
OpenAI Realtime API and Speech-to-Speech Models
OpenAI Realtime API, announced as generally available in August 2025, represents a unified architecture for production voice agents. Unlike traditional pipelines that concatenate separate models for STT, language processing, and TTS (introducing 200ms or more of latency per hop), the Realtime API processes and generates audio directly through a single model and WebSocket connection.
The new gpt-realtime model shows significant improvements in following complex instructions, calling tools precisely, and producing natural and expressive voice. The API now supports remote MCP servers, image input, and phone calls via Session Initiation Protocol (SIP), making voice agents more capable through access to additional tools and context.
The system includes automatic voice activity detection (VAD) and allows injecting custom responses for integration with RAG (Retrieval-Augmented Generation) systems.
At the time of writing, it probably represents the most advanced solution in the Voice AI field, although, being a closed proprietary solution, the risk of vendor lock-in should be carefully evaluated.
ElevenLabs and Low-Latency Orchestration
ElevenLabs has developed a complete ecosystem for conversational AI that optimizes every component of the voice pipeline.
The Flash TTS engine achieves a model generation latency of 75ms and an end-to-end audio time-to-first-byte of 135ms, at the time of publication of this article, the best score in the industry.
The streaming architecture allows audio playback to start as soon as the first text tokens arrive, reducing perceived latency below 100ms.
For speech recognition, the system integrates streaming STT that processes audio incrementally while the user is speaking, eliminating 100-300ms per conversational turn.
Latency optimization is critical: the 300ms rule states that voice conversations must maintain response times below this threshold to seem natural rather than robotic.
Mitigating Vendor Lock-in Risks and the Value of System Integration
Our historical positioning as a System Integrator specialized in technological innovation projects and the experience of our management team, gained over more than 30 years of software design in mission-critical sectors, leads us to be naturally wary of vertical single-vendor solutions, not so much because we don’t believe a single vendor is capable of offering quality across the entire technological and functional stack, but because, over the years, we have seen companies and products of absolute value decline and disappear into oblivion, steamrolled by solutions of lesser intrinsic quality but strong in terms of their ability to integrate the complex scenarios of large groups, characterized by complex information systems heavily dependent on legacy systems.
The author has directed complex projects that, 25 years ago, constituted important technological innovations that enabled the replacement of systems built years earlier with 1980s technologies.
Today, despite more than a quarter century having passed, those systems are still in production and difficult to replace, having supported over the years the exponential growth and total digitalization of market sectors that are even more turbulent and competitive than 25 years ago.
Any innovative solution, in any sector, must deal with these types of scenarios. In a world permeated by IT, replacing a system that can now be defined as legacy is even more difficult than it was at the beginning of the millennium.
The versatility of solutions, the ability to create resilient architectures that can be integrated into complex scenarios, is the main factor that can guarantee the success of innovation projects, and the Voice AI sector is no exception.
For this reason, our interest has been directed toward some emerging platforms and realities that, not coincidentally, have had rapid and successful market development.
Daily.co: WebRTC Infrastructure Platform
Daily.co was founded in 2016 as a San Francisco-based startup by a well-known serial founder with a background in real-time video technical experience.
The company started as a bet on the future of video and audio communications over the Internet, focusing from launch on the development of WebRTC, the open standard that enables next-generation video and audio experiences. Daily is an active member of the W3C WebRTC Working Group and contributes to several open source projects, including Mediasoup and GStreamer.
Daily operates a Global Mesh Network that represents one of the platform’s distinctive strengths. The infrastructure includes over 75 points of presence (PoPs) distributed across 10 global geographic regions. This global network enables median first-hop latencies of 13ms and connection times 2x faster than traditional solutions.
Daily’s infrastructure, designed to offer massive scalability and ultra-low latencies, is hosted on AWS in SOC 1, SOC 2, and ISO 27001 certified data centers, with 24/7 operations and enterprise-grade security. The platform offers proven multi-cloud architecture and on-premises and VPC deployment options.
Daily competes with platforms like Twilio, Agora, Vonage (OpenTok), 100ms, Dyte, LiveKit, and ZEGOCLOUD in the audio/video API market. Daily’s distinctive positioning includes ease of implementation, enterprise reliability, high-level developer support, and focus on WebRTC-native architecture.
Pipecat Framework: Open Source for AI Conversational Agents
Pipecat is a Python open source framework for building real-time voice and multimodal conversational agents. Developed and maintained by the Daily.co team and the Pipecat community, the framework is completely vendor-neutral and is not tightly coupled to Daily’s infrastructure, while natively supporting it.
Pipecat’s vision is to simplify the construction of conversational AI applications that can see, hear, and speak in real time, managing the complex orchestration of AI services, network transport, audio processing, and multimodal interactions. The framework enables developers to focus on creating engaging experiences rather than managing infrastructure complexity.
Our Recent Experience
Adding Voice functionality represents the natural evolution of any product that falls into the AI assistant category.
When we prepare to deploy a new Assistant/Agent to production, whatever the business process of reference, not just customer care, we are now prepared for the inevitable request: a “talking” version of the same agent.
Why connect via web to the chat interface to interact with the agent when I could simply call it?
It seems like a trivial add-on, but practical implementation involves a challenge that is anything but trivial.
The agent architecture, if not initially designed to support real-time voice interaction, must be adapted, and refactoring is often not trivial.
But the real challenge is represented by the infrastructure.
Putting into production a system capable of ensuring fluid, interactive conversations that perfectly simulate human interaction while guaranteeing the maintenance of the agent’s functional performance is a problem of great complexity.
Issues to manage include: low latency of all components of the processing pipeline, quality of VAD (Voice Activity Detection) algorithms, impeccable multi-turn recognition management, RTVI - Real Time Voice Interaction functionality, correct barge-in handling (literally intrusion into the conversation - if the user starts talking while the assistant is still speaking, it must stop and listen), WebRTC protocol coexistence issues with any NAT and firewalls, STT-LLM-TTS pipeline management while also guaranteeing interpretation and reproduction of emotional voice nuances.
In this scenario, it is evident that the ability to manage complex multi-vendor integrations and varied technology stacks becomes the factor that makes the difference between a successful project and a poor prototype.
In recent months, we have achieved very encouraging results by integrating our framework based on an Astro.js, Vite, React, TypeScript stack for front-end components and Python FastAPI, LangChain, Redis, and Postgres for back-end components, with the Pipecat framework and Daily WebRTC transport.
In our judgment, for the Italian language, with management of regional varieties and emotional expressions, at this moment, ElevenLabs models are a cut above the competition.
The first two projects, one in the healthcare sector and the second in the customer care sector, are already in the release phase.