Plan: Audio/Video Transcription

Status: Future Date: 2026-02-27 Related app: Transcript Summarizer

Goal

Add audio/video upload support to generate transcripts automatically, feeding into the existing transcript synthesis pipeline.

API Options

OpenAI Whisper API (Recommended)

Already have OpenAI SDK configured in lib/ai.ts
Endpoint: POST /v1/audio/transcriptions
Model: whisper-1
Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webm
Features: timestamps, language detection, SRT/VTT output
Max file size: 25MB (longer files need chunking)
Cost: $0.006/minute

const transcription = await openai.audio.transcriptions.create({
  file: audioFile,
  model: "whisper-1",
});

Alternatives

| Service | Strengths | Notes | |---------|-----------|-------| | Deepgram | Fast, real-time, good speaker diarization | Separate SDK | | AssemblyAI | Speaker labels, sentiment, built-in summarization | Separate SDK | | Google Cloud Speech-to-Text | Enterprise-grade, many languages | Heavy SDK | | AWS Transcribe | Enterprise option | Heavy SDK | | Whisper (open source) | No API costs, local | Needs GPU for speed |

Integration Sketch

Add audio/video upload to Transcript Summarizer UI (alongside existing PDF/text upload)
Send file to new API route (/api/transcripts/[sessionId]/transcribe)
Transcribe via Whisper API (chunk if >25MB)
Optionally request timestamps + speaker diarization (verbose JSON format)
Store resulting text as a Transcript record
Feed into existing synthesis pipeline (map-reduce summarization)

Open Questions

Do we need speaker diarization (who said what)? Whisper's verbose_json format includes some of this, but dedicated services like Deepgram/AssemblyAI are better at it.
Should we support real-time/streaming transcription, or batch-only?
File size limits — Whisper caps at 25MB. For longer recordings, need a chunking strategy (split by silence, fixed intervals, etc.).
Should transcription happen client-side (Whisper.cpp/WASM) to save API costs?