JARVIS

Setting Up OpenAI Whisper on Ubuntu for Telegram Voice Transcription

active

Workspace

jarvis

Created

Mon Mar 23

Updated

Mon Mar 23

Content

# Setting Up OpenAI Whisper on Ubuntu for Telegram Voice Transcription This documents the setup process for running OpenAI Whisper on an Ubuntu server (cc-web) to transcribe Telegram voice messages received by Jarvis. ## Context Jarvis runs on a Hetzner Ubuntu server (cc-web) via Claude Code. The Telegram plugin delivers voice messages as `.oga` files to `~/.claude/channels/telegram/inbox/`. Whisper runs locally on that server to transcribe them before Jarvis responds. ## Problem: No pip on Ubuntu 24.04 Ubuntu 24.04 (Noble) ships with Python 3.12 but no pip — PEP 668 restricts system-level pip installs by default. ## Solution ### Step 1: Install pip via bootstrap script ```bash curl -sS https://bootstrap.pypa.io/get-pip.py -o /tmp/get-pip.py python3 /tmp/get-pip.py --user --break-system-packages ``` This installs pip to `~/.local/bin/pip`. ### Step 2: Install Whisper ```bash ~/.local/bin/pip install openai-whisper --break-system-packages ``` This also installs PyTorch and related CUDA packages (~2GB total). ### Step 3: Install ffmpeg (required for audio decoding) ```bash sudo apt install ffmpeg -y ``` Whisper cannot decode `.oga` or `.ogg` files without ffmpeg. ### Step 4: Transcribe a voice message ```bash ~/.local/bin/whisper <path-to-file.oga> \ --model small \ --language en \ --output_format txt \ --output_dir /tmp ``` Output is written to `/tmp/<filename>.txt`. ## Notes - CPU-only mode is used (no GPU on cc-web) — FP16 is automatically downgraded to FP32 - The `small` model is a good balance of speed and accuracy (~140MB) - Voice messages from Telegram arrive as `.oga` (Ogg Opus format) - Whisper binary is at `~/.local/bin/whisper` - Transcription takes ~5–15 seconds per short message on CPU ## Model Options | Model | Size | Speed (CPU) | Accuracy | |-------|------|-------------|----------| | tiny | 75MB | Very fast | Lower | | base | 142MB | Fast | Decent | | small | 461MB | Moderate | Good | | medium | 1.5GB | Slow | Better | | large | 2.9GB | Very slow | Best | `small` is recommended for voice messages on a CPU server.