Plan: Audio/Video Transcription
Status: Future Date: 2026-02-27 Related app: Transcript Summarizer
Goal
Add audio/video upload support to generate transcripts automatically, feeding into the existing transcript synthesis pipeline.
API Options
OpenAI Whisper API (Recommended)
- Already have OpenAI SDK configured in
lib/ai.ts - Endpoint:
POST /v1/audio/transcriptions - Model:
whisper-1 - Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webm
- Features: timestamps, language detection, SRT/VTT output
- Max file size: 25MB (longer files need chunking)
- Cost: $0.006/minute
const transcription = await openai.audio.transcriptions.create({
file: audioFile,
model: "whisper-1",
});
Alternatives
| Service | Strengths | Notes | |---------|-----------|-------| | Deepgram | Fast, real-time, good speaker diarization | Separate SDK | | AssemblyAI | Speaker labels, sentiment, built-in summarization | Separate SDK | | Google Cloud Speech-to-Text | Enterprise-grade, many languages | Heavy SDK | | AWS Transcribe | Enterprise option | Heavy SDK | | Whisper (open source) | No API costs, local | Needs GPU for speed |
Integration Sketch
- Add audio/video upload to Transcript Summarizer UI (alongside existing PDF/text upload)
- Send file to new API route (
/api/transcripts/[sessionId]/transcribe) - Transcribe via Whisper API (chunk if >25MB)
- Optionally request timestamps + speaker diarization (verbose JSON format)
- Store resulting text as a
Transcriptrecord - Feed into existing synthesis pipeline (map-reduce summarization)
Open Questions
- Do we need speaker diarization (who said what)? Whisper's
verbose_jsonformat includes some of this, but dedicated services like Deepgram/AssemblyAI are better at it. - Should we support real-time/streaming transcription, or batch-only?
- File size limits — Whisper caps at 25MB. For longer recordings, need a chunking strategy (split by silence, fixed intervals, etc.).
- Should transcription happen client-side (Whisper.cpp/WASM) to save API costs?