Setting Up OpenAI Whisper on Ubuntu for Telegram Voice Transcription
activeWorkspace
jarvis
Created
Mon Mar 23
Updated
Mon Mar 23
Content
# Setting Up OpenAI Whisper on Ubuntu for Telegram Voice Transcription
This documents the setup process for running OpenAI Whisper on an Ubuntu server (cc-web) to transcribe Telegram voice messages received by Jarvis.
## Context
Jarvis runs on a Hetzner Ubuntu server (cc-web) via Claude Code. The Telegram plugin delivers voice messages as `.oga` files to `~/.claude/channels/telegram/inbox/`. Whisper runs locally on that server to transcribe them before Jarvis responds.
## Problem: No pip on Ubuntu 24.04
Ubuntu 24.04 (Noble) ships with Python 3.12 but no pip — PEP 668 restricts system-level pip installs by default.
## Solution
### Step 1: Install pip via bootstrap script
```bash
curl -sS https://bootstrap.pypa.io/get-pip.py -o /tmp/get-pip.py
python3 /tmp/get-pip.py --user --break-system-packages
```
This installs pip to `~/.local/bin/pip`.
### Step 2: Install Whisper
```bash
~/.local/bin/pip install openai-whisper --break-system-packages
```
This also installs PyTorch and related CUDA packages (~2GB total).
### Step 3: Install ffmpeg (required for audio decoding)
```bash
sudo apt install ffmpeg -y
```
Whisper cannot decode `.oga` or `.ogg` files without ffmpeg.
### Step 4: Transcribe a voice message
```bash
~/.local/bin/whisper <path-to-file.oga> \
--model small \
--language en \
--output_format txt \
--output_dir /tmp
```
Output is written to `/tmp/<filename>.txt`.
## Notes
- CPU-only mode is used (no GPU on cc-web) — FP16 is automatically downgraded to FP32
- The `small` model is a good balance of speed and accuracy (~140MB)
- Voice messages from Telegram arrive as `.oga` (Ogg Opus format)
- Whisper binary is at `~/.local/bin/whisper`
- Transcription takes ~5–15 seconds per short message on CPU
## Model Options
| Model | Size | Speed (CPU) | Accuracy |
|-------|------|-------------|----------|
| tiny | 75MB | Very fast | Lower |
| base | 142MB | Fast | Decent |
| small | 461MB | Moderate | Good |
| medium | 1.5GB | Slow | Better |
| large | 2.9GB | Very slow | Best |
`small` is recommended for voice messages on a CPU server.