Build an LLM-Powered Voice Agent in Python

In this article, we’re building a AI voice agent from scratch using Python. 

Briefly, the voice agent can serve as an AI tutor, acting as either an English language learning assistant or a data science coach. We'll combine three powerful tech together: AssemblyAI for real-time speech recognition (so it can hear you), Ollama with the Gemma 3 model for intelligent responses (so it can think), and Minimax text-to-speech service via Replicate for natural voice synthesis (so it can talk back). The beauty of this architecture is its modularity. For example, you could easily swap Replicate for other TTS providers like ElevenLabs without changing the core logic; instead of Gemma 3, you could also swap it with Llama 4, etc.

By the end of this tutorial, you'll have a fully functional voice agent running on your own machine that can maintain context, provide helpful responses, and engage in real conversations. The best part? It's surprisingly straightforward to build!

You can also follow along this accompanying video:

1. Getting Started

Before we dive into the code, let's get our environment set up. You'll need to install a few prerequisites and set up your API keys.

1.1. Install Prerequisites and Python Libraries

First, if you're on macOS, you'll need PortAudio. This is a low-level audio library that handles the nitty-gritty of capturing sound from your microphone and playing audio through your speakers. You can install it using Homebrew:

brew install portaudio

If you encounter compilation errors, you might need to install Xcode command line tools first:

xcode-select --install

Now, it's highly recommended to create a new Conda environment to keep your dependencies clean.

conda create -n aai python=3.12
conda activate aai

With the environment activated, you can install the necessary Python libraries with a single command:

pip install "assemblyai[extras]" ollama replicate soundfile sounddevice

Here's what each library does:

  • assemblyai[extras]: The "ears" of our operation. This library handles real-time speech recognition.

  • ollama: The "brain" of our agent. This runs a language model locally, processing your questions and generating intelligent responses.

  • replicate: The "voice" of our agent. This provides cloud-based text-to-speech synthesis.

  • soundfile and sounddevice: These libraries process and manage real-time audio input and output.

2. Download the Ollama Model

Our voice agent needs a brain, and for this project, we're giving it the Gemma 3 1B model. This is an excellent choice for a real-time conversational agent because it strikes a great balance between response quality and speed. You don't want your voice agent pausing for 10 seconds to think of a response, as that would kill the conversational flow. To download the model, open your terminal and run:

ollama pull gemma3:1b

3. Set Up Your API Keys

Our project requires API keys for AssemblyAI and Replicate. Both services offer generous free tiers to get you started.

Once you have your keys, the recommended approach is to set them as environment variables so they persist across terminal sessions. You can do this by editing your bash profile file (~/.bash_profile or ~/.zshrc):

vi ~/.bash_profile

Then add these lines, replacing the placeholder values with your actual keys:

export REPLICATE_API_TOKEN="YOUR_REPLICATE_KEY_HERE"
export ASSEMBLYAI_API_KEY="YOUR_ASSEMBLYAI_KEY_HERE"

Save the file, then activate the changes:

source ~/bash_profile

4. Code Walkthrough

Now that our environment is set up, let's break down the Python script of the voice agent to see how it works.

4.1. Import libraries

The script begins by importing all necessary libraries, including assemblyai, ollama, and replicate for the core AI functionalities, soundfile and sounddevice for audio handling, and standard Python libraries like os and sys.

import assemblyai as aai
import ollama
import os
import soundfile as sf
import sounddevice as sd
import replicate
import requests
import io
import sys
from typing import Optional

4.2. Set API keys

Next, the script retrieves the API keys from environment variables and performs basic error handling. The aai.settings.api_key is then set using the retrieved key.

# Set API Keys
ASSEMBLYAI_API_KEY = os.environ.get("ASSEMBLYAI_API_KEY")
REPLICATE_API_TOKEN = os.environ.get("REPLICATE_API_TOKEN")

if not ASSEMBLYAI_API_KEY:
    raise ValueError("ASSEMBLYAI_API_KEY not set.")
if not REPLICATE_API_TOKEN:
    raise ValueError("REPLICATE_API_TOKEN not set.")

aai.settings.api_key = ASSEMBLYAI_API_KEY

We’ll also define the exit phrase that will allow us to trigger the voice agent to terminate on demand.

EXIT_PHRASE = "Power off."  # Define the phrase to trigger exit

4.3. Building the Voice Agent

The core logic of the application is encapsulated within the AIVoiceAgent class.

4.3.1. Initializes the Voice Agent

The __init__ method initializes the class instance. It stores the Replicate token and, most importantly, defines the self.transcript list. This list is a conversation history in a structured format ({"role": "user/assistant", "content": "..."}) that both the LLM and the script can use. The initial system prompt sets the tone and instructions for the LLM model, acting as its personality.

class AIVoiceAgent:
    # Initializes the AI Voice Agent with necessary attributes.
    def __init__(self):
        self.replicate_token = REPLICATE_API_TOKEN
        self.transcriber = None
        self.transcript = [{"role": "system", "content": """
        You are an interviewer for a role in data science.
        Can you be proactive in asking questions to see if candidate is a good fit for the role.

        Please keep your answers concise, ideally under 300 characters.
        Please generate only text and no emojis.
        Please start by asking a welcoming question.
        Please ask only one question at a time.
        Instead of * please use numbered lists and use numbered list if there are 2 bullet points.
        """}]

4.3.2. Real-Time Transcription

The transcription process is handled by a series of methods that interact with AssemblyAI's real-time streaming API.

Briefly, the _start_transcript() code block starts the real-time audio transcription session with AssemblyAI by connecting to their service, streaming audio from the computer's microphone, and assigning callback functions to handle incoming transcripts and errors.

    # Starts the real-time audio transcription
    def _start_transcription(self):
        print("\n 🎙️ Listening...")
        self.transcriber = aai.RealtimeTranscriber(
            sample_rate=16000,
            on_data=self._on_data,
            on_error=self._on_error,
            on_open=self._on_open,
            on_close=self._on_close,
        )
        self.transcriber.connect()
        try:
            self.transcriber.stream(
              aai.extras.MicrophoneStream(sample_rate=16000)
            )
        except Exception as e:
            print(f"Mic error: {e}")
            self._close_transcriber()

_on_data(self, transcript: aai.RealtimeTranscript) is the most important callback. It's triggered whenever AssemblyAI sends back a new chunk of transcription. It checks if the transcript is a RealtimeFinalTranscript (meaning the user has finished speaking). If so, it checks if the text is the EXIT_PHRASE to trigger a graceful shutdown. Otherwise, it calls _generate_response() to process the user's input. For partial transcripts, it simply prints the text to the console, giving the user real-time feedback.

The methods stop_transcription(), _close_transcriber(), _on_open(), _on_error(), and _on_close() handle the other aspects of the transcription service lifecycle, from stopping a session to handling errors and closing the connection gracefully.

4.3.3. Processing and Responding

After a final transcript is received, the _generate_response() method takes over. This is the central hub of the conversational flow. It first calls self.stop_transcription() to prevent the agent from listening while it's processing and speaking. It appends the user's transcript to self.transcript, maintaining the conversation history. It then calls ollama.chat() with the full message history to generate a response. By setting stream=True, it gets the response in small chunks as they are generated.

The core of this method is the streaming loop that buffers the incoming text and, when a sentence-ending punctuation (., ?, !, \n) or a certain character limit (> 300) is reached, it takes that complete sentence and passes it to _play_speech() to be spoken. This parallel processing of generating text and speaking it is what makes the conversation feel fast and fluid. After the streaming is complete, any remaining text in the buffer is also spoken. Finally, the full AI response is appended to the self.transcript history, and _start_transcription() is called to begin listening for the next user input.

# Stops the transcription, adds the user's transcript to the history, generates LLM response, and plays the response
    def _generate_response(self, transcript):
        self.stop_transcription()
        self.transcript.append({"role": "user", "content": transcript.text})
        try:
            ollama_stream = ollama.chat(
                model="gemma3:1b",
                messages=self.transcript,
                stream=True,
            )
        except Exception as e:
            print(f"Ollama Error: {e}")
            self._start_transcription()
            return

        print("\n🤖 AI:", end=" ", flush=True)
        buffer = ""
        full_response = ""

        # Iterates through the streamed response chunks from Ollama
        for chunk in ollama_stream:
            content = chunk['message']['content']
            buffer += content
            print(content, end="", flush=True)

            # Generate speech if it contains a sentence-ending punctuation or exceeds 300 characters
            buffer = buffer.replace("**", "")
            if any(p in buffer for p in ['.', '?', '!', '\n']) or len(buffer) > 300:
                sentence = ""
                processed = False
                for p in reversed(['.', '?', '!', '\n']):
                    if p in buffer:
                        parts = buffer.split(p, 1)
                        sentence = parts[0] + p
                        buffer = parts[1] if len(parts) > 1 else ""
                        processed = True
                        break
                if not processed and len(buffer) > 300:
                    sentence = buffer
                    buffer = ""
                current_sentence = sentence.strip()
                if current_sentence:
                    full_response += current_sentence + " "
                    self._play_speech(current_sentence)

        # Processes and generate speech for any remaining text in the buffer after sentence processing
        remaining = buffer.strip()
        if remaining:
            print(remaining, end="\n", flush=True)
            full_response += remaining + " "
            self._play_speech(remaining)

        # Finalizes the AI response, adds it to chat history, and begins listening again
        final_response = full_response.strip()
        if final_response:
            self.transcript.append({"role": "assistant", "content": final_response})
        print("\n------------------------------------")
        self._start_transcription()

4.3.4. Generating Speech from Text

The _play_speech() method is responsible for converting the AI's text response into audible speech.

# Generates speech from the given text using Replicate's TTS model and plays it
    def _play_speech(self, text: str):
        if not text.strip():
            return
        audio_url = "N/A"
        try:
            resp = replicate.run(
                "minimax/speech-02-turbo",
                input={"text": text, "pitch": 0, "speed": 1, "volume": 1,
                       "bitrate": 32000, "channel": "mono", "emotion": "happy", # "auto", "neutral", "happy", "sad", "angry", "fearful", "disgusted", "surprised"
                       "voice_id": "English_Graceful_Lady", "sample_rate": 32000,
                       # English_WiseScholar, English_Graceful_Lady
                       "language_boost": "English", "english_normalization": True}
            )
            audio_url = str(resp)
            if not audio_url or not audio_url.startswith("http"):
                print(f"\n   TTS Error (invalid URL): {audio_url} for '{text}'")
                return
            audio_data = requests.get(audio_url, timeout=20)
            audio_data.raise_for_status()
            data, sr = sf.read(io.BytesIO(audio_data.content))
            sd.play(data, sr)
            sd.wait()

        # Error handling during the text-to-speech process
        except replicate.exceptions.ReplicateError as e:
            print(f"\n   TTS Replicate Error for '{text}': {e}")
        except requests.exceptions.RequestException as e:
            print(f"\n   TTS Download Error ({audio_url}) for '{text}': {e}")
        except sf.SoundFileError as e:
            print(f"\n   TTS Audio Read Error ({audio_url}) for '{text}': {e} (Ensure ffmpeg for MP3)")
        except Exception as e:
            print(f"\n   TTS Unexpected Error for '{text}': {e}")

4.3.5. Main Execution Loop

Finally, the start() method is the main entry point for the application. It prints a welcome message, starts the transcription, and then enters an infinite loop using sd.sleep(10) to keep the script running and listening. The script exits when the EXIT_PHRASE is detected by the _on_data callback or when a KeyboardInterrupt (e.g., pressing Ctrl+C) is caught.

    # Starts the main loop of the AI Voice Agent
    def start(self):
        print(f"⚡ Starting AI Voice Agent... Say '{EXIT_PHRASE}' to exit.")
        self._start_transcription()
        try:
            while True:
                sd.sleep(10)
        except KeyboardInterrupt:
            print("\n 🚪 Exiting...")
            self.stop_transcription()
            print("Exited.")

4.3.6. Running the Voice Agent

Finally, let’s set up the voice agent to run directly from the command line with the if __name__ == "__main__": statement. If the Python script is run from the command line successfully, it creates an instance of the AIVoiceAgent class and calls the start() method to begin the voice agent's operation. Furthermore, additional error handling blocks were added to catch potential configuration or unexpected errors during startup.

# Starts the AI voice agent
if __name__ == "__main__":
    try:
        AIVoiceAgent().start()
    except ValueError as e:
        print(f"Config Error: {e}")
    except Exception as e:
        print(f"Unexpected error: {e}")

Once you run the script with python voiceagent.py, you'll see "🎙️ Listening..."—that's your cue to start talking! The system processes your speech in real-time, so you'll see your words appear on screen as you speak. Then the AI thinks about your input and responds with both text and speech. The conversation flows naturally, with the AI remembering what you've discussed. When you're done experimenting, just say "Power off" and the agent will gracefully shut down.

5. Tips for Voice Agent Customization

The provided script is a great starting point, but you can customize it in many ways to make it your own.

  • Change the AI's Persona: Modify the system prompt to turn the voice agent into a helpful coach, a live tutor, or an English language assistant.

  • Swap the LLM: Ollama offers a wide variety of models you can download and use. Try pulling a different model, like Llama 4 or Mistral, and see how the AI's responses and personality change.

  • Adjust Text-to-Speech Parameters: In the _play_speech method, experiment with different emotion parameter values (sad, angry, etc.) and voice_id to change how the AI sounds.

  • Customize the Conversation Flow: The _generate_response method controls when the AI starts speaking. You can adjust the character limit or the punctuation checks to change how long the AI waits before generating the next part of its speech.

Conclusion

You've just built something that would have seemed like science fiction a few years ago—a voice-enabled AI that can engage in natural conversation, remember context, and provide educational support across multiple domains. This is a powerful yet accessible platform for voice AI applications. The possibilities for customization are endless.

The code for this project can be found in this GitHub repo.

Feel free to explore other projects and videos using AssemblyAI in this AssemblyAI playlist.