How to Build a Voice Assistant with Whisper and Llama 3.3

How to Build a Voice Assistant with Whisper and Llama 3.3
- Real-World Use Case and Architecture
- Prerequisites and Environment Setup
Ubuntu/Debian
macOS (Homebrew)
- Core Implementation: Building the Voice Assistant Pipeline

📺 Watch: Neural Networks Explained

Video by 3Blue1Brown

Building a production-grade voice assistant that transcribes speech with OpenAI [8] Whisper and generates intelligent responses with Meta's Llama 3.3 is a complex but rewarding engineering challenge. In this tutorial, you'll construct a complete, locally-runnable voice assistant pipeline that handles real-time audio capture, transcription, natural language understanding, and text-to-speech output. We'll focus on latency optimization, memory management, and error handling—critical concerns for any deployment scenario.

Real-World Use Case and Architecture

Voice assistants are no longer novelty toys. Enterprises deploy them for customer service automation, accessibility tools, hands-free documentation in medical and industrial settings, and smart home interfaces. The combination of Whisper's robust multilingual transcription (supporting 99 languages as of the latest release) and Llama 3.3's instruction-following capabilities creates a system that understands diverse accents, background noise, and complex queries without cloud dependency.

Our architecture follows a four-stage pipeline:

Audio Capture: Record microphone input with configurable silence detection
Transcription: Process audio through Whisper (base or small model for low latency)
Inference: Feed transcribed text to Llama 3.3 with a system prompt for assistant behavior
Synthesis: Convert Llama's response to speech using a local TTS engine

Each stage runs as an independent async component, communicating through queues to minimize blocking. This design allows you to swap models (e.g., replace Whisper with a smaller distillation) or add preprocessing steps without rewriting the entire system.

Prerequisites and Environment Setup

Before writing code, ensure your environment meets these requirements:

Hardware: NVIDIA GPU with at least 8GB VRAM (tested on RTX 3080 and A10G). CPU-only inference is possible but will introduce 3-5x latency.
Python: 3.10 or later (3.11 recommended for performance improvements)
System Libraries: PortAudio (for PyAudio), FFmpeg (for audio codec support)

Install system dependencies first:

# Ubuntu/Debian
sudo apt-get update && sudo apt-get install -y portaudio19-dev ffmpeg

# macOS (Homebrew)
brew install portaudio ffmpeg

Create a virtual environment and install the required packages:

python3.11 -m venv voice_assistant_env
source voice_assistant_env/bin/activate

pip install torch torchvision torchaudio --index-url https://download.pytorch [6].org/whl/cu118
pip install openai-whisper transformers [4] accelerate bitsandbytes sounddevice soundfile numpy scipy pyaudio pyttsx3

Package Rationale:

openai-whisper: Official Whisper implementation for transcription
transformers + accelerate: Load and run Llama 3.3 with optimized inference
bitsandbytes: 4-bit quantization to fit Llama 3.3 8B in ~6GB VRAM
sounddevice + soundfile: Low-latency audio capture and playback
pyttsx3: Offline text-to-speech (uses eSpeak or SAPI5 on Windows)

Core Implementation: Building the Voice Assistant Pipeline

Step 1: Audio Capture with Silence Detection

The audio capture module must detect when the user stops speaking to trigger transcription. We'll implement a simple energy-based Voice Activity Detection (VAD) with a configurable threshold.

import sounddevice as sd
import numpy as np
from scipy.io.wavfile import write
import threading
import queue
import time

class AudioCapture:
    def __init__(self, sample_rate=16000, chunk_duration=0.5, silence_threshold=0.01, silence_duration=1.5):
        """
        Initialize audio capture with VAD parameters.

        Args:
            sample_rate: Whisper expects 16kHz audio
            chunk_duration: Seconds per audio chunk for processing
            silence_threshold: RMS energy below this is considered silence
            silence_duration: Seconds of continuous silence before stopping
        """
        self.sample_rate = sample_rate
        self.chunk_size = int(sample_rate * chunk_duration)
        self.silence_threshold = silence_threshold
        self.silence_duration = silence_duration
        self.audio_queue = queue.Queue()
        self.is_recording = False
        self.audio_buffer = []

    def _audio_callback(self, indata, frames, time_info, status):
        """Callback for sounddevice InputStream. Runs in a separate thread."""
        if status:
            print(f"Audio callback status: {status}")
        self.audio_queue.put(indata.copy())

    def record_until_silence(self):
        """
        Record audio until silence_duration seconds of silence detected.
        Returns: numpy array of recorded audio (mono, float32)
        """
        self.is_recording = True
        self.audio_buffer = []
        silence_chunks = 0
        required_silence_chunks = int(self.silence_duration / (self.chunk_size / self.sample_rate))

        def recording_thread():
            with sd.InputStream(samplerate=self.sample_rate,
                              channels=1,
                              blocksize=self.chunk_size,
                              callback=self._audio_callback):
                while self.is_recording:
                    time.sleep(0.1)  # Yield to callback thread

        thread = threading.Thread(target=recording_thread, daemon=True)
        thread.start()

        print("Recording.. (speak now)")

        while True:
            try:
                chunk = self.audio_queue.get(timeout=1.0)
                self.audio_buffer.append(chunk)

                # Calculate RMS energy for VAD
                rms = np.sqrt(np.mean(chunk**2))

                if rms < self.silence_threshold:
                    silence_chunks += 1
                else:
                    silence_chunks = 0

                if silence_chunks >= required_silence_chunks and len(self.audio_buffer) > 10:
                    # Ensure we have at least some audio before stopping
                    self.is_recording = False
                    break

            except queue.Empty:
                # Timeout - no audio received, stop recording
                if len(self.audio_buffer) > 0:
                    self.is_recording = False
                    break

        # Concatenate all chunks into single array
        full_audio = np.concatenate(self.audio_buffer, axis=0).flatten()
        return full_audio

Edge Case Handling:

Empty buffer: We require at least 10 chunks (~5 seconds) before silence triggers stop to avoid false positives from brief pauses
Queue timeout: If no audio arrives for 1 second, we assume the microphone is disconnected and stop gracefully
Callback status: We log any status messages from sounddevice (e.g., buffer underflow) for debugging

Step 2: Whisper Transcription with Model Caching

Whisper models are large (base: 142MB, small: 461MB). We'll implement lazy loading and reuse the model across calls to avoid repeated memory allocation.

import whisper
import torch

class WhisperTranscriber:
    def __init__(self, model_name="base", device=None):
        """
        Initialize Whisper transcriber.

        Args:
            model_name: "tiny", "base", "small", "medium", "large"
            device: "cuda" or "cpu". Auto-detected if None.
        """
        self.model_name = model_name
        self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")
        self.model = None

    def load_model(self):
        """Lazy-load the Whisper model. Caches after first call."""
        if self.model is None:
            print(f"Loading Whisper {self.model_name} model on {self.device}..")
            self.model = whisper.load_model(self.model_name, device=self.device)
            # Warm up with a short silent audio to trigger CUDA kernel compilation
            dummy_audio = torch.zeros(16000, device=self.device)
            self.model.transcribe(dummy_audio.cpu().numpy(), fp16=(self.device == "cuda"))
            print("Whisper model loaded and warmed up.")

    def transcribe(self, audio_array):
        """
        Transcribe audio array to text.

        Args:
            audio_array: numpy array of audio samples (mono, 16kHz)

        Returns:
            dict with 'text', 'segments', 'language' keys
        """
        self.load_model()

        # Normalize audio to [-1, 1] range if needed
        if audio_array.dtype == np.int16:
            audio_array = audio_array.astype(np.float32) / 32768.0
        elif audio_array.dtype == np.int32:
            audio_array = audio_array.astype(np.float32) / 2147483648.0

        # Ensure mono
        if len(audio_array.shape) > 1:
            audio_array = audio_array.mean(axis=1)

        # Whisper expects float32 numpy array
        audio_array = audio_array.astype(np.float32)

        result = self.model.transcribe(
            audio_array,
            fp16=(self.device == "cuda"),
            language=None,  # Auto-detect language
            task="transcribe",
            verbose=False
        )

        return result

Performance Considerations:

FP16 inference: On CUDA devices, we enable half-precision for 2x speedup with minimal accuracy loss
Warm-up call: The first inference after model load triggers CUDA kernel compilation, which can take 5-10 seconds. We run a dummy inference during loading to shift this overhead to initialization time
Audio normalization: Whisper expects float32 values in. We handle int16 and int32 formats commonly produced by audio libraries

Step 3: Llama 3.3 Inference with 4-bit Quantization

Llama 3.3 8B requires ~16GB in FP16. We'll use 4-bit quantization via bitsandbytes to fit in 6GB VRAM, making it accessible on consumer GPUs.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

class LlamaAssistant:
    def __init__(self, model_id="meta-llama/Llama-3.3-8B-Instruct", device_map="auto"):
        """
        Initialize Llama 3.3 with 4-bit quantization.

        Args:
            model_id: HuggingFace [4] model identifier
            device_map: "auto" for multi-GPU, "cuda:0" for single GPU
        """
        self.model_id = model_id
        self.device_map = device_map
        self.tokenizer = None
        self.model = None
        self.system_prompt = """You are a helpful voice assistant. Respond concisely and naturally, 
        as if speaking aloud. Keep responses under 100 words unless the user asks for detailed information. 
        Do not use markdown formatting or bullet points. Speak in complete sentences."""

    def load_model(self):
        """Load quantized model and tokenizer."""
        if self.model is None:
            print(f"Loading {self.model_id} with 4-bit quantization..")

            quantization_config = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_compute_dtype=torch.float16,
                bnb_4bit_use_double_quant=True,
                bnb_4bit_quant_type="nf4"
            )

            self.tokenizer = AutoTokenizer.from_pretrained(self.model_id)
            self.tokenizer.pad_token = self.tokenizer.eos_token

            self.model = AutoModelForCausalLM.from_pretrained(
                self.model_id,
                quantization_config=quantization_config,
                device_map=self.device_map,
                torch_dtype=torch.float16,
                trust_remote_code=True
            )

            print("Llama 3.3 model loaded successfully.")

    def generate_response(self, user_input, max_new_tokens=256, temperature=0.7):
        """
        Generate a response to user input.

        Args:
            user_input: Transcribed text from Whisper
            max_new_tokens: Maximum response length
            temperature: Creativity (0.0 = deterministic, 1.0 = creative)

        Returns:
            Generated response string
        """
        self.load_model()

        # Format with chat template
        messages = [
            {"role": "system", "content": self.system_prompt},
            {"role": "user", "content": user_input}
        ]

        prompt = self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )

        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)

        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=temperature,
                do_sample=(temperature > 0),
                pad_token_id=self.tokenizer.eos_token_id,
                repetition_penalty=1.1
            )

        response = self.tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
        return response.strip()

Critical Implementation Details:

Double quantization: bnb_4bit_use_double_quant=True applies a second quantization to the scaling factors, saving ~0.4 bits per parameter with no accuracy loss
nf4 quantization type: The NormalFloat4 data type is optimized for normally distributed weights, which matches Llama's distribution
Repetition penalty: Set to 1.1 to prevent the model from repeating phrases, a common issue in voice responses
Temperature handling: When temperature is 0, we disable sampling entirely for deterministic outputs

Step 4: Text-to-Speech with pyttsx3

For offline TTS, pyttsx3 provides a simple interface. We'll add rate control and error handling for unsupported voices.

import pyttsx3
import threading

class TextToSpeech:
    def __init__(self, rate=180, voice_id=None):
        """
        Initialize TTS engine.

        Args:
            rate: Words per minute (typical range: 150-200)
            voice_id: Specific voice to use. None for default.
        """
        self.engine = pyttsx3.init()
        self.engine.setProperty('rate', rate)

        # Set voice if specified
        if voice_id:
            voices = self.engine.getProperty('voices')
            for voice in voices:
                if voice_id in voice.id:
                    self.engine.setProperty('voice', voice.id)
                    break

    def speak(self, text):
        """
        Speak text asynchronously to avoid blocking the main thread.

        Args:
            text: Text to synthesize
        """
        def _speak():
            self.engine.say(text)
            self.engine.runAndWait()

        thread = threading.Thread(target=_speak, daemon=True)
        thread.start()
        return thread

Step 5: Orchestrating the Pipeline

Now we combine all components into a cohesive assistant loop.

import time
import sys

class VoiceAssistant:
    def __init__(self):
        self.capture = AudioCapture()
        self.transcriber = WhisperTranscriber(model_name="base")
        self.assistant = LlamaAssistant()
        self.tts = TextToSpeech()

    def run(self):
        """Main assistant loop."""
        print("Voice Assistant initialized. Press Ctrl+C to exit.")
        print("Say 'exit' or 'quit' to stop the program.")

        try:
            while True:
                # Step 1: Capture audio
                print("\nListening..")
                audio = self.capture.record_until_silence()

                if len(audio) < 1600:  # Less than 0.1 seconds
                    print("No speech detected. Listening again..")
                    continue

                # Step 2: Transcribe
                print("Transcribing..")
                start_time = time.time()
                result = self.transcriber.transcribe(audio)
                transcription_time = time.time() - start_time

                user_text = result['text'].strip()
                print(f"You said: {user_text} (transcribed in {transcription_time:.2f}s)")

                # Check for exit command
                if user_text.lower() in ['exit', 'quit', 'stop']:
                    print("Goodbye!")
                    self.tts.speak("Goodbye!")
                    break

                if not user_text:
                    print("Could not understand audio. Please try again.")
                    continue

                # Step 3: Generate response
                print("Thinking..")
                start_time = time.time()
                response = self.assistant.generate_response(user_text)
                inference_time = time.time() - start_time
                print(f"Assistant: {response} (generated in {inference_time:.2f}s)")

                # Step 4: Speak response
                self.tts.speak(response)

        except KeyboardInterrupt:
            print("\nShutting down gracefully..")
        finally:
            # Cleanup
            if hasattr(self.transcriber, 'model') and self.transcriber.model is not None:
                del self.transcriber.model
            if hasattr(self.assistant, 'model') and self.assistant.model is not None:
                del self.assistant.model
            torch.cuda.empty_cache()

if __name__ == "__main__":
    assistant = VoiceAssistant()
    assistant.run()

Edge Cases and Production Considerations

Memory Management

GPU Memory Leaks: Whisper and Llama models allocate CUDA memory that may not be freed on del. Always call torch.cuda.empty_cache() after model deletion
CPU Memory: Audio buffers can grow large with long recordings. Implement a maximum recording duration (e.g., 30 seconds) to prevent OOM

Latency Optimization

Model Warm-up: Both Whisper and Llama have significant first-inference latency due to CUDA graph compilation. Pre-warm both models during initialization
Batch Processing: If handling multiple users, batch transcription requests to Whisper (it supports batched inference natively)
Streaming Transcription: For real-time feedback, use Whisper's transcribe() with verbose=True to get partial results

Error Handling

Microphone Disconnection: Wrap audio capture in a try-except for sounddevice.PortAudioError
Model Loading Failures: If bitsandbytes quantization fails (e.g., on CPU), fall back to 8-bit or FP16
Empty Transcription: Whisper may return empty text for very short utterances. Implement a minimum audio duration check

Security

Input Sanitization: Llama 3.3 can be prompted to generate harmful content. Implement a content filter or use the safe generation parameters
Model Access: If using gated models from HuggingFace, authenticate with huggingface-cli login before running

What's Next

This tutorial provides a foundation for a local voice assistant. To extend it for production:

Add wake word detection: Integrate Porcupine or a custom ONNX model for hands-free activation
Implement conversation memory: Use a sliding window of recent exchanges to maintain context
Optimize with ONNX Runtime: Convert Whisper and Llama to ONNX format for 2-3x inference speedup
Add streaming TTS: Replace pyttsx3 with Coqui TTS or Piper for more natural voice synthesis
Deploy as a microservice: Wrap the assistant in a FastAPI endpoint for remote access

For further reading, explore our guides on optimizing transformer inference and building multimodal AI systems.

The complete code for this tutorial is available on GitHub. Remember to respect model licenses—Llama 3.3 requires acceptance of Meta's community license, and Whisper is MIT-licensed. With these components, you now have a fully functional, privacy-preserving voice assistant that runs entirely on your hardware.

References

1. Wikipedia - Hugging Face. Wikipedia. [Source]

2. Wikipedia - OpenAI. Wikipedia. [Source]

3. Wikipedia - PyTorch. Wikipedia. [Source]

4. GitHub - huggingface/transformers. Github. [Source]

5. GitHub - openai/openai-python. Github. [Source]

6. GitHub - pytorch/pytorch. Github. [Source]

7. GitHub - meta-llama/llama. Github. [Source]

8. OpenAI Pricing. Pricing. [Source]

How to Build a Voice Assistant with Whisper and Llama 3.3

How to Build a Voice Assistant with Whisper and Llama 3.3

Table of Contents

📺 Watch: Neural Networks Explained

Real-World Use Case and Architecture

Prerequisites and Environment Setup

Core Implementation: Building the Voice Assistant Pipeline

Step 1: Audio Capture with Silence Detection

Step 2: Whisper Transcription with Model Caching

Step 3: Llama 3.3 Inference with 4-bit Quantization

Step 4: Text-to-Speech with pyttsx3

Step 5: Orchestrating the Pipeline

Edge Cases and Production Considerations

Memory Management

Latency Optimization

Error Handling

Security

What's Next

References

Was this article helpful?

Related Articles

How to Build a Gmail AI Assistant with Google Gemini

How to Build a Production ML API with FastAPI and Modal

How to Coordinate Robot Teams with Agentic AI 2026