How to Build a Voice Assistant with Whisper and Llama 3.3
Practical tutorial: Build a voice assistant with Whisper + Llama 3.3
How to Build a Voice Assistant with Whisper and Llama 3.3
Table of Contents
📺 Watch: Neural Networks Explained
Video by 3Blue1Brown
Building a production-grade voice assistant that transcribes speech with OpenAI [8] Whisper and generates intelligent responses with Meta's Llama 3.3 is a complex but rewarding engineering challenge. In this tutorial, you'll construct a complete, locally-runnable voice assistant pipeline that handles real-time audio capture, transcription, natural language understanding, and text-to-speech output. We'll focus on latency optimization, memory management, and error handling—critical concerns for any deployment scenario.
Real-World Use Case and Architecture
Voice assistants are no longer novelty toys. Enterprises deploy them for customer service automation, accessibility tools, hands-free documentation in medical and industrial settings, and smart home interfaces. The combination of Whisper's robust multilingual transcription (supporting 99 languages as of the latest release) and Llama 3.3's instruction-following capabilities creates a system that understands diverse accents, background noise, and complex queries without cloud dependency.
Our architecture follows a four-stage pipeline:
- Audio Capture: Record microphone input with configurable silence detection
- Transcription: Process audio through Whisper (base or small model for low latency)
- Inference: Feed transcribed text to Llama 3.3 with a system prompt for assistant behavior
- Synthesis: Convert Llama's response to speech using a local TTS engine
Each stage runs as an independent async component, communicating through queues to minimize blocking. This design allows you to swap models (e.g., replace Whisper with a smaller distillation) or add preprocessing steps without rewriting the entire system.
Prerequisites and Environment Setup
Before writing code, ensure your environment meets these requirements:
- Hardware: NVIDIA GPU with at least 8GB VRAM (tested on RTX 3080 and A10G). CPU-only inference is possible but will introduce 3-5x latency.
- Python: 3.10 or later (3.11 recommended for performance improvements)
- System Libraries: PortAudio (for PyAudio), FFmpeg (for audio codec support)
Install system dependencies first:
# Ubuntu/Debian
sudo apt-get update && sudo apt-get install -y portaudio19-dev ffmpeg
# macOS (Homebrew)
brew install portaudio ffmpeg
Create a virtual environment and install the required packages:
python3.11 -m venv voice_assistant_env
source voice_assistant_env/bin/activate
pip install torch torchvision torchaudio --index-url https://download.pytorch [6].org/whl/cu118
pip install openai-whisper transformers [4] accelerate bitsandbytes sounddevice soundfile numpy scipy pyaudio pyttsx3
Package Rationale:
openai-whisper: Official Whisper implementation for transcriptiontransformers+accelerate: Load and run Llama 3.3 with optimized inferencebitsandbytes: 4-bit quantization to fit Llama 3.3 8B in ~6GB VRAMsounddevice+soundfile: Low-latency audio capture and playbackpyttsx3: Offline text-to-speech (uses eSpeak or SAPI5 on Windows)
Core Implementation: Building the Voice Assistant Pipeline
Step 1: Audio Capture with Silence Detection
The audio capture module must detect when the user stops speaking to trigger transcription. We'll implement a simple energy-based Voice Activity Detection (VAD) with a configurable threshold.
import sounddevice as sd
import numpy as np
from scipy.io.wavfile import write
import threading
import queue
import time
class AudioCapture:
def __init__(self, sample_rate=16000, chunk_duration=0.5, silence_threshold=0.01, silence_duration=1.5):
"""
Initialize audio capture with VAD parameters.
Args:
sample_rate: Whisper expects 16kHz audio
chunk_duration: Seconds per audio chunk for processing
silence_threshold: RMS energy below this is considered silence
silence_duration: Seconds of continuous silence before stopping
"""
self.sample_rate = sample_rate
self.chunk_size = int(sample_rate * chunk_duration)
self.silence_threshold = silence_threshold
self.silence_duration = silence_duration
self.audio_queue = queue.Queue()
self.is_recording = False
self.audio_buffer = []
def _audio_callback(self, indata, frames, time_info, status):
"""Callback for sounddevice InputStream. Runs in a separate thread."""
if status:
print(f"Audio callback status: {status}")
self.audio_queue.put(indata.copy())
def record_until_silence(self):
"""
Record audio until silence_duration seconds of silence detected.
Returns: numpy array of recorded audio (mono, float32)
"""
self.is_recording = True
self.audio_buffer = []
silence_chunks = 0
required_silence_chunks = int(self.silence_duration / (self.chunk_size / self.sample_rate))
def recording_thread():
with sd.InputStream(samplerate=self.sample_rate,
channels=1,
blocksize=self.chunk_size,
callback=self._audio_callback):
while self.is_recording:
time.sleep(0.1) # Yield to callback thread
thread = threading.Thread(target=recording_thread, daemon=True)
thread.start()
print("Recording.. (speak now)")
while True:
try:
chunk = self.audio_queue.get(timeout=1.0)
self.audio_buffer.append(chunk)
# Calculate RMS energy for VAD
rms = np.sqrt(np.mean(chunk**2))
if rms < self.silence_threshold:
silence_chunks += 1
else:
silence_chunks = 0
if silence_chunks >= required_silence_chunks and len(self.audio_buffer) > 10:
# Ensure we have at least some audio before stopping
self.is_recording = False
break
except queue.Empty:
# Timeout - no audio received, stop recording
if len(self.audio_buffer) > 0:
self.is_recording = False
break
# Concatenate all chunks into single array
full_audio = np.concatenate(self.audio_buffer, axis=0).flatten()
return full_audio
Edge Case Handling:
- Empty buffer: We require at least 10 chunks (~5 seconds) before silence triggers stop to avoid false positives from brief pauses
- Queue timeout: If no audio arrives for 1 second, we assume the microphone is disconnected and stop gracefully
- Callback status: We log any status messages from sounddevice (e.g., buffer underflow) for debugging
Step 2: Whisper Transcription with Model Caching
Whisper models are large (base: 142MB, small: 461MB). We'll implement lazy loading and reuse the model across calls to avoid repeated memory allocation.
import whisper
import torch
class WhisperTranscriber:
def __init__(self, model_name="base", device=None):
"""
Initialize Whisper transcriber.
Args:
model_name: "tiny", "base", "small", "medium", "large"
device: "cuda" or "cpu". Auto-detected if None.
"""
self.model_name = model_name
self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")
self.model = None
def load_model(self):
"""Lazy-load the Whisper model. Caches after first call."""
if self.model is None:
print(f"Loading Whisper {self.model_name} model on {self.device}..")
self.model = whisper.load_model(self.model_name, device=self.device)
# Warm up with a short silent audio to trigger CUDA kernel compilation
dummy_audio = torch.zeros(16000, device=self.device)
self.model.transcribe(dummy_audio.cpu().numpy(), fp16=(self.device == "cuda"))
print("Whisper model loaded and warmed up.")
def transcribe(self, audio_array):
"""
Transcribe audio array to text.
Args:
audio_array: numpy array of audio samples (mono, 16kHz)
Returns:
dict with 'text', 'segments', 'language' keys
"""
self.load_model()
# Normalize audio to [-1, 1] range if needed
if audio_array.dtype == np.int16:
audio_array = audio_array.astype(np.float32) / 32768.0
elif audio_array.dtype == np.int32:
audio_array = audio_array.astype(np.float32) / 2147483648.0
# Ensure mono
if len(audio_array.shape) > 1:
audio_array = audio_array.mean(axis=1)
# Whisper expects float32 numpy array
audio_array = audio_array.astype(np.float32)
result = self.model.transcribe(
audio_array,
fp16=(self.device == "cuda"),
language=None, # Auto-detect language
task="transcribe",
verbose=False
)
return result
Performance Considerations:
- FP16 inference: On CUDA devices, we enable half-precision for 2x speedup with minimal accuracy loss
- Warm-up call: The first inference after model load triggers CUDA kernel compilation, which can take 5-10 seconds. We run a dummy inference during loading to shift this overhead to initialization time
- Audio normalization: Whisper expects float32 values in. We handle int16 and int32 formats commonly produced by audio libraries
Step 3: Llama 3.3 Inference with 4-bit Quantization
Llama 3.3 8B requires ~16GB in FP16. We'll use 4-bit quantization via bitsandbytes to fit in 6GB VRAM, making it accessible on consumer GPUs.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
class LlamaAssistant:
def __init__(self, model_id="meta-llama/Llama-3.3-8B-Instruct", device_map="auto"):
"""
Initialize Llama 3.3 with 4-bit quantization.
Args:
model_id: HuggingFace [4] model identifier
device_map: "auto" for multi-GPU, "cuda:0" for single GPU
"""
self.model_id = model_id
self.device_map = device_map
self.tokenizer = None
self.model = None
self.system_prompt = """You are a helpful voice assistant. Respond concisely and naturally,
as if speaking aloud. Keep responses under 100 words unless the user asks for detailed information.
Do not use markdown formatting or bullet points. Speak in complete sentences."""
def load_model(self):
"""Load quantized model and tokenizer."""
if self.model is None:
print(f"Loading {self.model_id} with 4-bit quantization..")
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)
self.tokenizer = AutoTokenizer.from_pretrained(self.model_id)
self.tokenizer.pad_token = self.tokenizer.eos_token
self.model = AutoModelForCausalLM.from_pretrained(
self.model_id,
quantization_config=quantization_config,
device_map=self.device_map,
torch_dtype=torch.float16,
trust_remote_code=True
)
print("Llama 3.3 model loaded successfully.")
def generate_response(self, user_input, max_new_tokens=256, temperature=0.7):
"""
Generate a response to user input.
Args:
user_input: Transcribed text from Whisper
max_new_tokens: Maximum response length
temperature: Creativity (0.0 = deterministic, 1.0 = creative)
Returns:
Generated response string
"""
self.load_model()
# Format with chat template
messages = [
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": user_input}
]
prompt = self.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=temperature,
do_sample=(temperature > 0),
pad_token_id=self.tokenizer.eos_token_id,
repetition_penalty=1.1
)
response = self.tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
return response.strip()
Critical Implementation Details:
- Double quantization:
bnb_4bit_use_double_quant=Trueapplies a second quantization to the scaling factors, saving ~0.4 bits per parameter with no accuracy loss - nf4 quantization type: The NormalFloat4 data type is optimized for normally distributed weights, which matches Llama's distribution
- Repetition penalty: Set to 1.1 to prevent the model from repeating phrases, a common issue in voice responses
- Temperature handling: When temperature is 0, we disable sampling entirely for deterministic outputs
Step 4: Text-to-Speech with pyttsx3
For offline TTS, pyttsx3 provides a simple interface. We'll add rate control and error handling for unsupported voices.
import pyttsx3
import threading
class TextToSpeech:
def __init__(self, rate=180, voice_id=None):
"""
Initialize TTS engine.
Args:
rate: Words per minute (typical range: 150-200)
voice_id: Specific voice to use. None for default.
"""
self.engine = pyttsx3.init()
self.engine.setProperty('rate', rate)
# Set voice if specified
if voice_id:
voices = self.engine.getProperty('voices')
for voice in voices:
if voice_id in voice.id:
self.engine.setProperty('voice', voice.id)
break
def speak(self, text):
"""
Speak text asynchronously to avoid blocking the main thread.
Args:
text: Text to synthesize
"""
def _speak():
self.engine.say(text)
self.engine.runAndWait()
thread = threading.Thread(target=_speak, daemon=True)
thread.start()
return thread
Step 5: Orchestrating the Pipeline
Now we combine all components into a cohesive assistant loop.
import time
import sys
class VoiceAssistant:
def __init__(self):
self.capture = AudioCapture()
self.transcriber = WhisperTranscriber(model_name="base")
self.assistant = LlamaAssistant()
self.tts = TextToSpeech()
def run(self):
"""Main assistant loop."""
print("Voice Assistant initialized. Press Ctrl+C to exit.")
print("Say 'exit' or 'quit' to stop the program.")
try:
while True:
# Step 1: Capture audio
print("\nListening..")
audio = self.capture.record_until_silence()
if len(audio) < 1600: # Less than 0.1 seconds
print("No speech detected. Listening again..")
continue
# Step 2: Transcribe
print("Transcribing..")
start_time = time.time()
result = self.transcriber.transcribe(audio)
transcription_time = time.time() - start_time
user_text = result['text'].strip()
print(f"You said: {user_text} (transcribed in {transcription_time:.2f}s)")
# Check for exit command
if user_text.lower() in ['exit', 'quit', 'stop']:
print("Goodbye!")
self.tts.speak("Goodbye!")
break
if not user_text:
print("Could not understand audio. Please try again.")
continue
# Step 3: Generate response
print("Thinking..")
start_time = time.time()
response = self.assistant.generate_response(user_text)
inference_time = time.time() - start_time
print(f"Assistant: {response} (generated in {inference_time:.2f}s)")
# Step 4: Speak response
self.tts.speak(response)
except KeyboardInterrupt:
print("\nShutting down gracefully..")
finally:
# Cleanup
if hasattr(self.transcriber, 'model') and self.transcriber.model is not None:
del self.transcriber.model
if hasattr(self.assistant, 'model') and self.assistant.model is not None:
del self.assistant.model
torch.cuda.empty_cache()
if __name__ == "__main__":
assistant = VoiceAssistant()
assistant.run()
Edge Cases and Production Considerations
Memory Management
- GPU Memory Leaks: Whisper and Llama models allocate CUDA memory that may not be freed on
del. Always calltorch.cuda.empty_cache()after model deletion - CPU Memory: Audio buffers can grow large with long recordings. Implement a maximum recording duration (e.g., 30 seconds) to prevent OOM
Latency Optimization
- Model Warm-up: Both Whisper and Llama have significant first-inference latency due to CUDA graph compilation. Pre-warm both models during initialization
- Batch Processing: If handling multiple users, batch transcription requests to Whisper (it supports batched inference natively)
- Streaming Transcription: For real-time feedback, use Whisper's
transcribe()withverbose=Trueto get partial results
Error Handling
- Microphone Disconnection: Wrap audio capture in a try-except for
sounddevice.PortAudioError - Model Loading Failures: If bitsandbytes quantization fails (e.g., on CPU), fall back to 8-bit or FP16
- Empty Transcription: Whisper may return empty text for very short utterances. Implement a minimum audio duration check
Security
- Input Sanitization: Llama 3.3 can be prompted to generate harmful content. Implement a content filter or use the
safegeneration parameters - Model Access: If using gated models from HuggingFace, authenticate with
huggingface-cli loginbefore running
What's Next
This tutorial provides a foundation for a local voice assistant. To extend it for production:
- Add wake word detection: Integrate Porcupine or a custom ONNX model for hands-free activation
- Implement conversation memory: Use a sliding window of recent exchanges to maintain context
- Optimize with ONNX Runtime: Convert Whisper and Llama to ONNX format for 2-3x inference speedup
- Add streaming TTS: Replace pyttsx3 with Coqui TTS or Piper for more natural voice synthesis
- Deploy as a microservice: Wrap the assistant in a FastAPI endpoint for remote access
For further reading, explore our guides on optimizing transformer inference and building multimodal AI systems.
The complete code for this tutorial is available on GitHub. Remember to respect model licenses—Llama 3.3 requires acceptance of Meta's community license, and Whisper is MIT-licensed. With these components, you now have a fully functional, privacy-preserving voice assistant that runs entirely on your hardware.
References
Was this article helpful?
Let us know to improve our AI generation.
Related Articles
How to Build a Gmail AI Assistant with Google Gemini
Practical tutorial: It represents an incremental improvement in user interface and interaction with existing technology.
How to Build a Production ML API with FastAPI and Modal
Practical tutorial: Build a production ML API with FastAPI + Modal
How to Coordinate Robot Teams with Agentic AI 2026
Practical tutorial: The story focuses on an interesting development in agentic AI for robot teams, which is a relevant but not groundbreakin