Plan: Audio/Video Transcription

Status: Future Date: 2026-02-27 Related app: Transcript Summarizer

Goal

Add audio/video upload support to generate transcripts automatically, feeding into the existing transcript synthesis pipeline.

API Options

OpenAI Whisper API (Recommended)

  • Already have OpenAI SDK configured in lib/ai.ts
  • Endpoint: POST /v1/audio/transcriptions
  • Model: whisper-1
  • Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webm
  • Features: timestamps, language detection, SRT/VTT output
  • Max file size: 25MB (longer files need chunking)
  • Cost: $0.006/minute
const transcription = await openai.audio.transcriptions.create({
  file: audioFile,
  model: "whisper-1",
});

Alternatives

| Service | Strengths | Notes | |---------|-----------|-------| | Deepgram | Fast, real-time, good speaker diarization | Separate SDK | | AssemblyAI | Speaker labels, sentiment, built-in summarization | Separate SDK | | Google Cloud Speech-to-Text | Enterprise-grade, many languages | Heavy SDK | | AWS Transcribe | Enterprise option | Heavy SDK | | Whisper (open source) | No API costs, local | Needs GPU for speed |

Integration Sketch

  1. Add audio/video upload to Transcript Summarizer UI (alongside existing PDF/text upload)
  2. Send file to new API route (/api/transcripts/[sessionId]/transcribe)
  3. Transcribe via Whisper API (chunk if >25MB)
  4. Optionally request timestamps + speaker diarization (verbose JSON format)
  5. Store resulting text as a Transcript record
  6. Feed into existing synthesis pipeline (map-reduce summarization)

Open Questions

  • Do we need speaker diarization (who said what)? Whisper's verbose_json format includes some of this, but dedicated services like Deepgram/AssemblyAI are better at it.
  • Should we support real-time/streaming transcription, or batch-only?
  • File size limits — Whisper caps at 25MB. For longer recordings, need a chunking strategy (split by silence, fixed intervals, etc.).
  • Should transcription happen client-side (Whisper.cpp/WASM) to save API costs?