Building Sam: How We Taught an AI Meeting Assistant to Recognise Every Voice in the Room

How we built voice identification and in-meeting enrollment for Sam — our open-source AI meeting assistant. Speaker embeddings, resemblyzer vs pyannote, and the flow that lets Sam enrol new attendees without interrupting the meeting.

Building Sam: How We Taught an AI Meeting Assistant to Recognise Every Voice in the Room
Photo by Dylan Gillis / Unsplash
Each of the 9 stages in this series maps directly to a requirement in the original specification. Every stage was implemented, tested, and committed to GitHub before the next began — and every stage has a dedicated blog post explaining the decisions.

📖 Origin story — Building Sam
This post explains what Sam is, why it was built openly using a spec-driven approach, and how the nine-part technical series is structured. Read this first, then follow the series from Part 1.
Start the series: Part 1 →  GitHub →

The Problem: Meetings Are Where Knowledge Goes to Die

Every organisation above a certain size runs on meetings. Decisions are made in them, context is built in them, commitments are spoken aloud in them. And then the calendar invite closes, people return to their desks, and almost all of that knowledge evaporates.

What survives is whatever one person was able to type fast enough, filtered through their own attention and bias. Who proposed what, who pushed back and why, what was actually agreed versus what was assumed — gone.

The tools that exist today don’t fully close this gap. They fall short in specific, predictable ways:

  • Otter.ai assigns speaker labels based on calendar invites and phone numbers. If someone joins late, dials in from an unregistered number, or isn’t in the calendar, attribution breaks. It cannot distinguish voices in a room — only separate audio tracks from known devices.
  • Fireflies.ai produces clean transcripts but treats speaker attribution as a post-processing problem. After the meeting, you still spend time manually assigning who said what before the transcript becomes useful.
  • Grain excels at clipping and sharing highlights but is shallow on understanding. It finds moments worth sharing — it doesn’t reason about the meeting as a document.

The gap we wanted to close: a system that knows who is speaking by their voice, attributes every utterance correctly in real time, and reasons about the content of the meeting — not just transcribes it. And one where every decision made under the hood is documented, because you built it and you understand it.

What AI Agents Actually Are

If you haven’t worked with AI agents before, the term gets used loosely. Here is the clearest way to think about it.

A language model — the technology behind ChatGPT, Claude, and similar systems — takes text in and produces text out. It is very good at understanding context and generating coherent responses. But on its own, it is passive. You give it something, it gives you something back. It cannot take actions in the world.

An AI agent is what you get when you give a language model the ability to act. Instead of just responding, it can read from a database, write to a file, call an external API, decide what to do next based on intermediate results, and trigger other processes. It works through a task step by step — using tools, making decisions, producing results — without a human directing each step.

Language Model AI Agent
Input Prompt Prompt
What it does Generates a response Reasons, decides, acts
Tools None Database, APIs, files, other agents
Autonomy Passive — responds when asked Active — works through a task step by step
Output Text reply Structured results, side effects in the world

In Sam, agents do the work that happens after a meeting ends. Four agents — built with a framework called LangGraph — divide the task of understanding the meeting transcript:

  • A moderator agent coordinates the others and decides what runs in parallel and what must wait
  • A transcription agent reviews the attributed transcript for completeness
  • A summary agent writes a narrative of what was discussed and decided
  • An action items agent extracts commitments with named owners and due dates

None of this requires a human in the loop. The agents read the transcript, run their analysis, and produce structured outputs automatically. This is what makes the system genuinely different from a transcription tool. Transcription captures what was said. Agents understand it.

What Sam Is

Sam is an AI meeting assistant that joins every meeting as a participant. Before the meeting starts, attendees enrol their voice — three short recordings, about four seconds each. Sam learns what each person sounds like and stores a voice fingerprint in its database. When the meeting begins, Sam listens to every audio track, matches each voice to an enrolled attendee in real time, and builds an attributed transcript: not just what was said, but who said it.

Before Meeting During Meeting After Meeting
Each attendee records 3 voice samples. Sam stores a fingerprint — done once, remembered forever. Sam joins the LiveKit room and listens. It matches each audio track to an enrolled attendee in real time, building an attributed transcript: “Alice said… Bob said…” When the meeting ends, 4 LangGraph agents read the transcript and produce a summary, action items, and moderation notes.

When the meeting ends, the four-agent LangGraph pipeline reads the full attributed transcript and produces three outputs:

  • A meeting summary — a concise narrative of what was discussed and decided, written for someone who wasn’t in the room
  • Action items — structured commitments extracted from the conversation, each with a named owner and a due date
  • Moderation notes — a review of whether the meeting stayed on agenda, and any moments that may need follow-up

All of this happens without anyone taking notes, without any manual cleanup, and without any post-meeting data entry.

🎤 The detail that matters: Sam doesn’t ask attendees to introduce themselves when they speak. It recognises them by voice. Enrolment happens once — before the first meeting — and carries forward to every subsequent meeting that person attends. The first time you enrol, Sam knows you forever.

Privacy and Where Your Data Lives

Any system that records meeting audio raises legitimate questions. Here is exactly what Sam does and does not do.

Audio is processed on your infrastructure. Sam runs on three virtual machines in your environment. Audio is captured by LiveKit — an open-source WebRTC server you deploy and control — processed locally for voice identification, and not persisted. No audio is sent to a third-party cloud service.

What is stored: Voice fingerprints, meeting transcripts, and the agent outputs (summary, action items, moderation notes). All of this lives in a PostgreSQL database on a VM you control.

Voice fingerprints are not audio. A voice fingerprint is a 256-dimensional numerical vector — a set of numbers representing the acoustic pattern of a voice. You cannot reconstruct audio from it. If the database were compromised, an attacker would gain a list of numbers, not recordings.

On-premises deployment: Because every component is open-source and self-hosted, Sam can run entirely within a private network with no internet access after initial setup. Part 9 covers the zero trust security configuration in full.

What Sam Cannot Do Yet

Honest accounting matters. Voice identification is not perfect, and there are conditions under which Sam degrades gracefully rather than guessing incorrectly:

  • Significant crosstalk: When two or more people speak simultaneously, Sam cannot reliably separate and attribute overlapping voices. It labels the segment rather than misattributing it.
  • Low-quality audio: Participants who dial in via compressed phone codecs or use low-bitrate connections produce audio that differs enough from their enrolled sample to reduce match confidence.
  • Very similar voices: Close siblings, identical twins, or people with highly similar vocal patterns may produce ambiguous matches near the 0.75 cosine similarity threshold.
  • Voice change since enrolment: Significant illness, sustained background noise during enrolment, or substantial changes in speaking register can cause mismatches.

In all of these cases, Sam falls back to a generic label rather than assigning the wrong name. The technical series covers the threshold tuning and fallback logic in Part 4.

How the Build Was Designed: Spec-Driven Development

Most software projects accumulate decisions as they go. Requirements shift, scope expands, and the reasoning behind individual choices gets lost in Slack threads and closed Jira tickets. Six months later, nobody can explain why a particular technology was chosen, or whether it was a deliberate decision or a default.

Sam was built using a different approach: spec-driven development (SDD).

The principle is simple: before writing a single line of code, write a specification — a precise description of what the system must do, how it must behave, and what constraints it must satisfy. The specification becomes the source of truth. Each stage of development is a requirement from the specification turned into working, tested code. Nothing is built that isn’t in the spec. Nothing in the spec is skipped.

The specification for Sam defined nine discrete requirements, and those requirements became the nine stages of the build:

Each of the 9 stages in this series maps directly to a requirement in the original specification. Every stage was implemented, tested, and committed to GitHub before the next began — and every stage has a dedicated blog post explaining the decisions.

This approach — formalised in open frameworks like OpenSpec — treats specifications as first-class artifacts that live alongside code in version control. Not planning documents written once and forgotten. Living documentation that explains not just what was built, but the decisions that shaped it and the alternatives that were considered and rejected.

The result is that Sam’s entire build is reproducible and explainable. Every stage has a specification requirement, a working implementation, a passing test suite, and a blog post that explains the decisions. If you follow the series from Part 1 to Part 9, you will understand every layer of the system — not just how to run it, but why it was built the way it was.

How the Blog and GitHub Work Together

The project runs on two rails simultaneously, and both are necessary.

The blog — the WHY

Every stage of the build gets a dedicated post here on momentums.com.au. Each post doesn’t just describe what was built — it explains the decisions. Why LiveKit instead of building directly on WebRTC? Why PostgreSQL with pgvector instead of a dedicated vector database? Why LangGraph instead of a simple sequential chain? Why Resemblyzer instead of a cloud speaker diarization API?

These are the questions that matter when you are evaluating whether to use the same approach in your own project. The posts answer them with the context of someone who has already built it, knows where the trade-offs land, and can tell you what they would do differently.

The GitHub repository — the HOW

The repository at jdizon/sam-meeting-ai is the running implementation. Every stage documented on the blog is committed to the repository with working tests. If a post says Sam can identify a speaker with cosine similarity at a 0.75 threshold, the repository has a unit test that verifies exactly that claim. The code is the proof.

📌 Open spec principle: A blog post without code is unverifiable. Code without a blog post is unreadable. Together they form an open specification — a complete, honest account of how a system was built and why it was built that way. This is what spec-driven development produces when done openly: not just a working system, but a transferable understanding of how to build one.

The Stack at a Glance

Sam is built on nine components, each chosen for a specific reason that the corresponding post explains in detail:

  • Proxmox + Ubuntu 24.04 — three VMs with separated concerns: API, voice pipeline, database
  • FastAPI + SQLAlchemy async — Python async API with dependency injection that makes unit testing trivial
  • LiveKit — open-source WebRTC media server; gives Sam a Python participant that can join any room and listen to audio tracks without a browser
  • Resemblyzer — open-source speaker encoder; converts voice samples to 256-dimensional fingerprints, entirely on-premises, no cloud API required
  • PostgreSQL 16 + pgvector — stores both relational meeting data and voice embeddings in the same database, eliminating the need for a separate vector store
  • Redis — LiveKit’s signalling bus; coordinates the voice pipeline across the server and agent worker
  • LangGraph — graph-based multi-agent orchestration; four agents run in parallel where the graph allows, sequentially where outputs depend on each other
  • Next.js 14 (App Router) — frontend with server components for the summary view and client components for the live meeting room and voice enrolment UI
  • Zero trust security — JWT auth on every API endpoint, UFW firewall rules on every VM, service identity tokens for inter-service calls, TLS in transit

Who This Is For

If you are a developer or technical decision-maker evaluating how to add meeting intelligence to your product or workflow — and especially if you want to understand how AI agents actually work, not just use a product that claims to use them — this series is written for you.

It is not a beginner tutorial. It assumes you are comfortable with Python, have deployed a Linux server, and understand roughly what a REST API is. It does not assume you have worked with AI agents, WebRTC, speaker diarization, LangGraph, or any of the specific technologies involved. Those are explained from first principles in the posts where they appear.

If you want to run Sam yourself, clone the repository and follow the README. If you want to understand it before you run it, start with Part 1 and read forward.


The Full Series — Build It Stage by Stage

Each post covers exactly one stage of the specification: what was built, why that technology was chosen over the alternatives, and what it enables for the stages that follow. Read in order for the full picture, or jump to the layer you care about.

  1. Part 1: Infrastructure — Three VMs, One Goal
  2. Part 2: The Database Layer — PostgreSQL, pgvector, and Redis
  3. Part 3: Real-Time Audio — LiveKit, WebRTC, and the Agent Worker
  4. Part 4: Voice Fingerprinting — Speaker Enrollment with Resemblyzer
  5. Part 5: The API Brain — FastAPI, SQLAlchemy, and Async Everything
  6. Part 6: Thinking in Graphs — LangGraph Multi-Agent Orchestration
  7. Part 7: The Face of Sam — Next.js 14 Frontend and the Meeting Room UI
  8. Part 8: Shipping with Confidence — 33 Tests, GitHub Actions, and CI
  9. Part 9: Zero Trust Security — JWT, UFW, and Service Identity

Stages 10 (PDF/DOCX export) and 11 (observability and monitoring) are in progress. New posts are published on momentums.com.au and pushed to GitHub as each stage completes.