Unified Conversation Agent (UCA)
WORK IN PROGRESS
This is an exploration project to build an AI-based Unified Conversation Agent (UCA) to make the lives of end users better and deliver useful services. UCA will leverage AI technologies to support OpenG2P use cases for social benefit delivery across programs and departments. This intelligent agent will engage directly with callers via voice, providing real-time updates on program statuses and disbursements, informing them about eligibility for additional programs, and enabling seamless program application entirely through phone or voice interactions.
1.Speech-to-Text Implementation Using Vosk
Overview
This documentation covers the implementation of a real-time speech-to-text system using the Vosk speech recognition toolkit. The system captures audio input from the microphone and converts it to text in real-time.
Model Selection
After evaluating different speech recognition models, we selected Vosk for its offline capabilities and ease of implementation. Two models were tested:
vosk-model-small-en-us-0.15 (smaller model)
vosk-model-en-us-0.22 (larger model)
Based on empirical testing, the larger model (en-us-0.22) demonstrated better accuracy in speech recognition compared to the smaller model. While no formal metrics were used for evaluation, hands-on experience showed more reliable transcription results with the larger model.
Implementation Details
Dependencies
vosk: Speech recognition engine
sounddevice: Audio input handling
json: Processing recognition results
queue: Managing audio data stream
Key Components
Model Initialization
The system initializes with a Vosk model and sets the audio sampling rate to 16kHz, which is the standard for speech recognition.
Audio Capture The implementation uses a queue-based system to handle audio input:
This callback function captures audio data in real-time and places it in a queue for processing.
Recognition Loop The main recognition loop:
Continuously processes audio data from the queue
Converts speech to text in real-time
Outputs recognized text when confidence is sufficient
Usage
Ensure the appropriate Vosk model is downloaded and placed in the models directory
Run the script
Speak into the microphone
Press Ctrl+C to stop the recognition
Performance Considerations
The larger model (en-us-0.22) requires more computational resources but provides better accuracy
The system processes audio in real-time with minimal latency
Queue-based implementation ensures smooth audio capture without data loss
Future Improvements
Implement formal accuracy metrics for model comparison
Add support for multiple languages
Optimize memory usage for long-running sessions
Technical Notes
Audio is captured at 16kHz with 16-bit depth
Processing occurs in blocks of 8000 samples
Single channel (mono) audio input is used for optimal recognition
2.Text to Speech using different models
Text-to-Speech (TTS) Implementation
Model Evaluation and Selection
We evaluated three different TTS solutions:
Coqui TTS (Jenny Model)
GitHub: https://github.com/coqui-ai/TTS
Implementation used
tts_models/en/jenny/jenny
Voice quality was not satisfactory - produced unexpected voice modulation
Resource-intensive and required significant setup
Coqui Tacotron2-DDC
Using
tts_models/en/ljspeech/tacotron2-DDC
Produced good voice quality
Drawbacks:
Long loading times
Lower accuracy compared to alternatives
Resource-intensive
pyttsx3
GitHub: https://github.com/nateshmbhat/pyttsx3
Selected as final implementation
Advantages:
Fast response time
Simple implementation
Reliable performance
Minimal resource usage
Good voice quality
Implementation uses default speech rate of 150
Final Implementation Details
The system uses pyttsx3 with the following key components:
Engine Initialization
Main TTS Loop
Continuous text input processing
Clean exit functionality
Simple user interface
Usage
Initialize the TTS engine
Enter text when prompted
System converts text to speech in real-time
Type 'quit' to exit
Supports keyboard interrupt (Ctrl+C)
Alternative Implementations (For Reference)
Coqui TTS Implementation
Tacotron Implementation
Performance Considerations
pyttsx3 provides immediate response with minimal latency
No internet connection required
Lower resource usage compared to neural network-based solutions
Suitable for continuous operation
3.Integrated Speech System Documentation
System Overview
The system integrates speech-to-text (STT) and text-to-speech (TTS) capabilities with an API service, creating a complete voice interaction system. Key features include loopback prevention and thread-based conversation management.
Core Components
1. Audio Processing
Uses Vosk for speech recognition (model: vosk-model-en-us-0.22)
Implements pyttsx3 for text-to-speech
Manages audio through sounddevice with 16kHz sampling rate
2. API Integration
Implements REST API communication
Supports conversation threading
Includes timeout handling (10 seconds)
Response cleaning functionality
3. Loopback Prevention System
The system implements multiple mechanisms to prevent audio loopback:
Global Processing Flag
Tracks when system is outputting speech
Prevents audio capture during TTS playback
Audio Callback Control
Only processes input when not outputting speech
Uses global flag to control audio capture
Silence Detection
Implements 2-second silence threshold
Prevents rapid-fire speech processing
Queue Management
Clears audio queue before processing new input
Prevents backlog of audio data
Error Handling
API Communication
Timeout handling for API requests
Response validation
Error message feedback through TTS
Audio Processing
Exception handling in main loop
Graceful shutdown on interruption
Recovery from processing errors
Thread Management
Unique thread IDs for conversation tracking
Format: 'user01_XX' where XX is the session number
Maintains conversation context across interactions
Response Processing
Clean Response Function
Removes formatting characters
Extracts relevant message content
Maintains original response if no cleaning needed
Usage Flow
System initialization
Load speech recognition model
Initialize TTS engine
Configure audio settings
Continuous operation loop
Listen for speech input
Convert speech to text
Send to API
Process response
Convert response to speech
Reset for next interaction
Technical Requirements
Python 3.x
vosk
pyttsx3
sounddevice
requests
Performance Considerations
Audio processing runs at 16kHz with 16-bit depth
8000 sample blocksize for audio processing
2-second silence threshold for speech segmentation
150 WPM speech rate for TTS
Future Improvements
Dynamic silence threshold adjustment
Multiple language support
Enhanced error recovery
Voice activity detection
Configurable audio parameters
Troubleshooting
Audio Loopback Issues
Verify speakers aren't feeding into microphone
Check processing_output flag status
Confirm silence threshold appropriateness
API Communication
Check network connectivity
Verify thread_id format
Monitor API response times
Validate API endpoint status
4.Data Preparation and Embedding Creation
Step 1: Data Preparation and Embedding Creation
Overview
The first step involves extracting data from SQL database and creating embeddings using FAISS. This process creates a searchable vector store for efficient similarity searches.
Components Used
LangChain HuggingFace Embeddings
FAISS Vector Store
SQLite Database
all-MiniLM-L6-v2 embedding model
Implementation Details
1. Database Connection and Data Retrieval
Connects to SQLite database
Retrieves specific fields: pid, mneumonic, description
Returns data as tuples
2. Document Creation
Key features:
Combines mneumonic and description for context
Preserves metadata (pid and mneumonic)
Creates LangChain Document objects
Handles cases where description might be missing
3. Embedding Creation and Storage
Important aspects:
Uses all-MiniLM-L6-v2 for embedding generation
Creates FAISS vector store
Saves index locally for future use
Data Flow
SQL Data → Python Objects
Python Objects → LangChain Documents
Documents → Vector Embeddings
Embeddings → FAISS Index
Technical Considerations
1. Data Structure
Content structure:
"{mneumonic}: {description}"
Metadata structure:
Error Handling
Database Errors
Embedding Creation Errors
5.Integrated AI Agent System
Architecture Overview
The CombinedProgramAgent creates a AI system that integrates vector search (FAISS), structured database queries (SQL), and language model reasoning through the ReAct architecture.
Core Components
1. Agent Initialization
This initialization sets up three primary components:
Language Model (LLM) configuration
Tool initialization (FAISS and SQL)
ReAct agent setup with system prompt
2. LLM Configuration
The LLM configuration:
Uses Ollama for local model deployment
Sets temperature to 0 for consistent, deterministic responses
Enables multi-threading for improved performance
3. Tool Integration
SQL Database Toolkit
The SQLDatabaseToolkit provides:
Query generation from natural language
Direct SQL execution
Result summarization
Schema inspection capabilities
FAISS Vector Search
The FAISS integration enables:
Semantic similarity search
Efficient retrieval of relevant program information
Configurable number of similar results (k=3)
ReAct Agent Architecture
Understanding ReAct
ReAct (Reasoning and Action) is an agent architecture that combines:
Reasoning: Thinking about what to do next
Action: Executing tools based on reasoning
Observation: Processing tool outputs
Reflection: Using results to plan next steps
System Prompt Design
The system prompt structures the agent's behavior by:
Defining clear steps for processing queries
Establishing tool usage priorities
Setting response formatting guidelines
Implementing error checking protocols
Memory Management
The MemorySaver enables:
Conversation state tracking
Thread-based memory management
Consistent context maintenance
Query Processing Flow
Query Reception
Receives user query and thread ID
Prepares configuration for processing
Tool Selection
Agent decides between FAISS and SQL tools
FAISS for semantic search
SQL for specific criteria verification
Response Generation
Combines tool outputs
Formats according to system prompt
Returns structured response
Understanding SQLDatabaseToolkit
The SQLDatabaseToolkit provides several tools:
Query Generator
Converts natural language to SQL
Handles complex query construction
Manages table relationships
SQL Executor
Runs generated queries
Handles error cases
Returns formatted results
Schema Inspector
Analyzes database structure
Provides table information
Helps in query construction
Common Challenges and Solutions
1. Library Dependency Conflicts
Solution approaches:
Use virtual environments
Pin specific package versions(requirements.txt)
Document working configurations
Date-Feb 21st 2025
Improving AI Agent Accuracy and Reliability
Initial Implementation and Challenges
Original Approach
The initial implementation used a combined agent system with:
FAISS vector store for semantic search
SQL database for detailed program information
Basic system prompt for agent guidance
Prompt Used:
Key Challenges Encountered
Data Quality Issues
Limited program descriptions in FAISS
Abstract information leading to ambiguous matches
Insufficient context for accurate recommendations
LLM Hallucination
Agent making assumptions beyond available data
Mixing up eligibility criteria
Providing inaccurate program recommendations
Response Accuracy
Inconsistent response structure
Unclear distinction between found and inferred information
Missing verification steps
Evolution of Solutions
Attempt 1: Enhanced Prompt Engineering
Detailed Structured Prompt
Improvements Attempted:
Strict step-by-step instructions
Explicit search sequence
Mandatory tool usage order
Structured response format
Results:
Some improvement in response structure
Still faced Hallucination issues
Didn't fully solve accuracy problems
Attempt 2: Data-Centric Approach
1. Data Quality Enhancement
Replaced abstract descriptions with detailed program information
Improved FAISS embeddings quality
Better context preservation
2. Simplified Yet Strict Prompt
Key Features:
Clear hallucination prohibition
Explicit tool usage instructions
Strong emphasis on retrieved data only
3. Improved Data Flow
FAISS returns program ID and Mneumonic
SQL lookup using returned IDs
Comprehensive information retrieval
Speech System API Integration
FastAPI Service Implementation:
The system implements a FastAPI-based service that integrates the CombinedProgramAgent with speech capabilities, enabling HTTP-based communication for the speech interface.
Components
1. API Configuration
2. Agent Initialization
3. Request Model
API Endpoints
Health Check
Chat Endpoint
Server Configuration
Listens on all network interfaces
Uses port 8000
Enables remote access
TTS Challenges: Pyttsx3
Platform-Specific Speech Engines
Windows Environment
Uses SAPI5 (Microsoft Speech API)
Advantages:
High-quality voice synthesis
Natural-sounding output
Multiple voice options
Good control over speech parameters
Implementation:
Linux Environment
Uses eSpeak by default
Limitations:
Robotic voice quality
Limited voice options
Less natural pronunciation
Reduced control over voice parameters
Ollama Installation and CUDA Permission Issues
Error Overview
When I ran the combined_agent.py, the following error was encountered: attempting to use Ollama with CUDA acceleration,
This error indicates a permission issue with the CUDA libraries that Ollama needs to access.
Root Causes
Permission Problems: The Ollama service user doesn't have proper permissions to access CUDA libraries
Ownership Issues: CUDA library files have incorrect ownership
Installation Conflicts: Mismatched CUDA versions between system drivers and Ollama requirements
Resolution Steps
The issue was resolved through a complete reinstallation of Ollama and proper permission configuration:
Fix Immediate Permissions
Perform Clean Reinstallation
Verify CUDA Compatibility
Update NVIDIA Drivers (if needed)
Restart and Verify Service
Transitioning from Llama3.2 to DeepSeek:
Limitations of Llama3.2
When implementing the combined agent system with Llama3.2, we encountered several significant performance issues:
Inconsistent Tool Utilization
The model frequently failed to call the appropriate tools
Sometimes ignored the FAISS vector search tool (program_info)
Other times skipped the SQL database tools
Resulted in incomplete information gathering
Poor Intent Recognition
Failed to properly identify user intents
Confused casual conversation with program inquiries
Responded inappropriately to queries
Prompt Adherence Issues
Did not consistently follow the structured approach defined in prompts
Skipped critical verification steps
Provided responses without gathering necessary information
Reasoning Limitations
Struggled with complex multi-step reasoning
Failed to integrate information from multiple sources
Made conclusions without proper verification
Motivation for DeepSeek Implementation
Due to these limitations, we explored the DeepSeek model (deepseek-r1:8b) for the following reasons:
Advanced Capabilities
Larger parameter count (8B vs Llama3.2)
Better reported performance on reasoning tasks
Improved instruction-following capabilities
Enhanced context understanding
Quality Improvements
More consistent reasoning patterns
Better adherence to structured prompts
Improved multi-step planning
Higher accuracy in understanding complex queries
Integration Potential
Compatible with Ollama deployment
Designed for assistant-like applications
Support for complex reasoning chains
DeepSeek Model Compatibility Issues
Error Overview
When attempting to use the DeepSeek model with tools in the CombinedProgramAgent, the following error occurred:
This error indicates that the DeepSeek model, as implemented in Ollama, doesn't support the function calling/tools API that LangGraph and LangChain require for agent implementation.
Technical Background
Tool-Using Capability: Modern LLMs require specific capabilities to utilize tools/function calling:
Standardized input/output formats
Support for specific JSON schema interpretation
Built-in capability to generate structured tool-use requests
DeepSeek Limitations: The current DeepSeek implementation in Ollama:
Lacks the necessary function-calling API
Cannot parse or generate the required JSON structure
Is not fine-tuned for tool-using applications
Enhanced Agent Architecture and Tool Control
This documentation analyzes the evolution of the CombinedProgramAgent system, focusing on architectural improvements that resolved critical limitations in the original implementation. The agent serves as a program eligibility advisor that utilizes vector search (FAISS) and structured database queries (SQL) to provide accurate program recommendations based on user inquiries.
Original Implementation Analysis
Architecture Overview
The original implementation featured:
A standard LangChain ReAct agent architecture
Direct integration of SQL and FAISS tools
A basic system prompt guiding agent behavior
Critical Limitations
1. Tool Sequencing Problems
The original implementation allowed the agent to use tools in any order, resulting in:
This approach gave equal priority to all tools, allowing the agent to:
Execute SQL queries without first identifying relevant programs through FAISS
Misunderstand the dependent relationship between tools
Produce incomplete or erroneous information
2. Hallucination Issues
The original system permitted hallucination through:
Lack of strict data validation
No explicit response verification
Basic prompt structure without enforced boundaries:
Despite these instructions, the agent would often invent program details, combine real and fabricated information, or provide erroneous eligibility assessments.
Enhanced Implementation Analysis
The updated implementation represents a significant architectural advancement with several sophisticated mechanisms:
1. Enforced Tool Sequencing
This implementation enforces a strict tool hierarchy through:
Clear "MUST USE FIRST" directive in the FAISS tool description
SQL tools explicitly requiring input from the FAISS tool
Order-dependent tool list structure
2. SQL Tool Wrapper Mechanism
This wrapper operates through:
Function Closure: Creates a new function that encapsulates the original tool
Input Validation: Checks for the presence of "program_id:" in the query
Error Redirection: Returns an explicit error message rather than executing the tool when validation fails
Transparent Execution: Passes valid requests to the original tool with all necessary context
The wrapper establishes a dependency chain that ensures:
FAISS search must be used first to get program IDs
SQL tools can only operate on previously identified programs
The agent receives immediate feedback when attempting to bypass the workflow
3. Response Validation System
The enhanced implementation introduces a sophisticated response validation mechanism:
This validation system:
Builds a repository of known program names from the database
Applies different validation rules based on response content
Allows conversational responses without program references to pass unchanged
Verifies that program-related responses only mention known programs
Provides a fallback response for potential hallucinations
4. Improved System Prompt
The updated system prompt incorporates several advanced features:
Key improvements include:
Explicit greeting identification with examples
Clear prohibition on tool usage for greetings
Mandatory response format for standardization
Specific prohibition clauses
Last updated
Was this helpful?