Unified Conversation Agent (UCA)
WORK IN PROGRESS
This is an exploration project to build an AI-based Unified Conversation Agent (UCA) to make the lives of end users better and deliver useful services. UCA will leverage AI technologies to support OpenG2P use cases for social benefit delivery across programs and departments. This intelligent agent will engage directly with callers via voice, providing real-time updates on program statuses and disbursements, informing them about eligibility for additional programs, and enabling seamless program application entirely through phone or voice interactions.
1.Speech-to-Text Implementation Using Vosk
Overview
This documentation covers the implementation of a real-time speech-to-text system using the Vosk speech recognition toolkit. The system captures audio input from the microphone and converts it to text in real-time.
Model Selection
After evaluating different speech recognition models, we selected Vosk for its offline capabilities and ease of implementation. Two models were tested:
vosk-model-small-en-us-0.15 (smaller model)
vosk-model-en-us-0.22 (larger model)
Based on empirical testing, the larger model (en-us-0.22) demonstrated better accuracy in speech recognition compared to the smaller model. While no formal metrics were used for evaluation, hands-on experience showed more reliable transcription results with the larger model.
Implementation Details
Dependencies
vosk: Speech recognition engine
sounddevice: Audio input handling
json: Processing recognition results
queue: Managing audio data stream
Key Components
Model Initialization
The system initializes with a Vosk model and sets the audio sampling rate to 16kHz, which is the standard for speech recognition.
Audio Capture The implementation uses a queue-based system to handle audio input:
This callback function captures audio data in real-time and places it in a queue for processing.
Recognition Loop The main recognition loop:
Continuously processes audio data from the queue
Converts speech to text in real-time
Outputs recognized text when confidence is sufficient
Usage
Ensure the appropriate Vosk model is downloaded and placed in the models directory
Run the script
Speak into the microphone
Press Ctrl+C to stop the recognition
Performance Considerations
The larger model (en-us-0.22) requires more computational resources but provides better accuracy
The system processes audio in real-time with minimal latency
Queue-based implementation ensures smooth audio capture without data loss
Future Improvements
Implement formal accuracy metrics for model comparison
Add support for multiple languages
Optimize memory usage for long-running sessions
Technical Notes
Audio is captured at 16kHz with 16-bit depth
Processing occurs in blocks of 8000 samples
Single channel (mono) audio input is used for optimal recognition
2.Text to Speech using different models
Text-to-Speech (TTS) Implementation
Model Evaluation and Selection
We evaluated three different TTS solutions:
Coqui TTS (Jenny Model)
GitHub: https://github.com/coqui-ai/TTS
Implementation used
tts_models/en/jenny/jenny
Voice quality was not satisfactory - produced unexpected voice modulation
Resource-intensive and required significant setup
Coqui Tacotron2-DDC
Using
tts_models/en/ljspeech/tacotron2-DDC
Produced good voice quality
Drawbacks:
Long loading times
Lower accuracy compared to alternatives
Resource-intensive
pyttsx3
GitHub: https://github.com/nateshmbhat/pyttsx3
Selected as final implementation
Advantages:
Fast response time
Simple implementation
Reliable performance
Minimal resource usage
Good voice quality
Implementation uses default speech rate of 150
Final Implementation Details
The system uses pyttsx3 with the following key components:
Engine Initialization
Main TTS Loop
Continuous text input processing
Clean exit functionality
Simple user interface
Usage
Initialize the TTS engine
Enter text when prompted
System converts text to speech in real-time
Type 'quit' to exit
Supports keyboard interrupt (Ctrl+C)
Alternative Implementations (For Reference)
Coqui TTS Implementation
Tacotron Implementation
Performance Considerations
pyttsx3 provides immediate response with minimal latency
No internet connection required
Lower resource usage compared to neural network-based solutions
Suitable for continuous operation
3.Integrated Speech System Documentation
System Overview
The system integrates speech-to-text (STT) and text-to-speech (TTS) capabilities with an API service, creating a complete voice interaction system. Key features include loopback prevention and thread-based conversation management.
Core Components
1. Audio Processing
Uses Vosk for speech recognition (model: vosk-model-en-us-0.22)
Implements pyttsx3 for text-to-speech
Manages audio through sounddevice with 16kHz sampling rate
2. API Integration
Implements REST API communication
Supports conversation threading
Includes timeout handling (10 seconds)
Response cleaning functionality
3. Loopback Prevention System
The system implements multiple mechanisms to prevent audio loopback:
Global Processing Flag
Tracks when system is outputting speech
Prevents audio capture during TTS playback
Audio Callback Control
Only processes input when not outputting speech
Uses global flag to control audio capture
Silence Detection
Implements 2-second silence threshold
Prevents rapid-fire speech processing
Queue Management
Clears audio queue before processing new input
Prevents backlog of audio data
Error Handling
API Communication
Timeout handling for API requests
Response validation
Error message feedback through TTS
Audio Processing
Exception handling in main loop
Graceful shutdown on interruption
Recovery from processing errors
Thread Management
Unique thread IDs for conversation tracking
Format: 'user01_XX' where XX is the session number
Maintains conversation context across interactions
Response Processing
Clean Response Function
Removes formatting characters
Extracts relevant message content
Maintains original response if no cleaning needed
Usage Flow
System initialization
Load speech recognition model
Initialize TTS engine
Configure audio settings
Continuous operation loop
Listen for speech input
Convert speech to text
Send to API
Process response
Convert response to speech
Reset for next interaction
Technical Requirements
Python 3.x
vosk
pyttsx3
sounddevice
requests
Performance Considerations
Audio processing runs at 16kHz with 16-bit depth
8000 sample blocksize for audio processing
2-second silence threshold for speech segmentation
150 WPM speech rate for TTS
Future Improvements
Dynamic silence threshold adjustment
Multiple language support
Enhanced error recovery
Voice activity detection
Configurable audio parameters
Troubleshooting
Audio Loopback Issues
Verify speakers aren't feeding into microphone
Check processing_output flag status
Confirm silence threshold appropriateness
API Communication
Check network connectivity
Verify thread_id format
Monitor API response times
Validate API endpoint status
4.Data Preparation and Embedding Creation
Step 1: Data Preparation and Embedding Creation
Overview
The first step involves extracting data from SQL database and creating embeddings using FAISS. This process creates a searchable vector store for efficient similarity searches.
Components Used
LangChain HuggingFace Embeddings
FAISS Vector Store
SQLite Database
all-MiniLM-L6-v2 embedding model
Implementation Details
1. Database Connection and Data Retrieval
Connects to SQLite database
Retrieves specific fields: pid, mneumonic, description
Returns data as tuples
2. Document Creation
Key features:
Combines mneumonic and description for context
Preserves metadata (pid and mneumonic)
Creates LangChain Document objects
Handles cases where description might be missing
3. Embedding Creation and Storage
Important aspects:
Uses all-MiniLM-L6-v2 for embedding generation
Creates FAISS vector store
Saves index locally for future use
Data Flow
SQL Data → Python Objects
Python Objects → LangChain Documents
Documents → Vector Embeddings
Embeddings → FAISS Index
Technical Considerations
1. Data Structure
Content structure:
"{mneumonic}: {description}"
Metadata structure:
Error Handling
Database Errors
Embedding Creation Errors
5.Integrated AI Agent System
Architecture Overview
The CombinedProgramAgent creates a AI system that integrates vector search (FAISS), structured database queries (SQL), and language model reasoning through the ReAct architecture.
Core Components
1. Agent Initialization
This initialization sets up three primary components:
Language Model (LLM) configuration
Tool initialization (FAISS and SQL)
ReAct agent setup with system prompt
2. LLM Configuration
The LLM configuration:
Uses Ollama for local model deployment
Sets temperature to 0 for consistent, deterministic responses
Enables multi-threading for improved performance
3. Tool Integration
SQL Database Toolkit
The SQLDatabaseToolkit provides:
Query generation from natural language
Direct SQL execution
Result summarization
Schema inspection capabilities
FAISS Vector Search
The FAISS integration enables:
Semantic similarity search
Efficient retrieval of relevant program information
Configurable number of similar results (k=3)
ReAct Agent Architecture
Understanding ReAct
ReAct (Reasoning and Action) is an agent architecture that combines:
Reasoning: Thinking about what to do next
Action: Executing tools based on reasoning
Observation: Processing tool outputs
Reflection: Using results to plan next steps
System Prompt Design
The system prompt structures the agent's behavior by:
Defining clear steps for processing queries
Establishing tool usage priorities
Setting response formatting guidelines
Implementing error checking protocols
Memory Management
The MemorySaver enables:
Conversation state tracking
Thread-based memory management
Consistent context maintenance
Query Processing Flow
Query Reception
Receives user query and thread ID
Prepares configuration for processing
Tool Selection
Agent decides between FAISS and SQL tools
FAISS for semantic search
SQL for specific criteria verification
Response Generation
Combines tool outputs
Formats according to system prompt
Returns structured response
Understanding SQLDatabaseToolkit
The SQLDatabaseToolkit provides several tools:
Query Generator
Converts natural language to SQL
Handles complex query construction
Manages table relationships
SQL Executor
Runs generated queries
Handles error cases
Returns formatted results
Schema Inspector
Analyzes database structure
Provides table information
Helps in query construction
Common Challenges and Solutions
1. Library Dependency Conflicts
Solution approaches:
Use virtual environments
Pin specific package versions(requirements.txt)
Document working configurations
Last updated
Was this helpful?