Which model should I use for word-level timestamps?

You must use the whisper-1 model with the response_format set to verbose_json to receive word-level or segment-level timestamps.

How does this skill handle large audio files?

It includes patterns for automatic chunking at sentence boundaries using tools like ffmpeg to stay under the 25MB API limit while preserving context.

Does it support real-time audio translation?

While it supports translating non-English audio to English text, it is designed for file-based processing rather than real-time bidirectional conversation.

Can I identify different speakers in a recording?

Yes, it provides specific implementations for the gpt-4o-transcribe-diarize model to provide speaker-labeled transcripts in diarized_json format.

OpenAI Whisper Audio & Transcription

Name: OpenAI Whisper Audio & Transcription
Author: agents-inc

byagents-inc

•

Data Science & ML

Implements high-accuracy speech-to-text transcription, translation, and speaker diarization using OpenAI's audio models.

This skill provides Claude Code with standardized patterns and best practices for integrating OpenAI's audio APIs into applications. It enables developers to implement robust transcription and translation features, handling complex tasks like speaker diarization, word-level timestamps, and subtitle generation (SRT/VTT). The skill also includes critical logic for managing large audio files through automated chunking and context preservation, ensuring high-quality, reliable outputs across various audio formats and file sizes.

Key Features

01Advanced model selection for accuracy vs. cost optimization

02Support for SRT and VTT subtitle format generation

03Multi-speaker diarization and identification patterns

04Word-level and segment-level timestamping

055 GitHub stars

06Automated audio chunking for files exceeding 25MB

Use Cases

01Translating non-English audio recordings directly into English text

02Creating accessibility-compliant subtitles for video content and editing

03Generating searchable transcripts for meetings, interviews, and podcasts

Key Features

01Advanced model selection for accuracy vs. cost optimization

02Support for SRT and VTT subtitle format generation

03Multi-speaker diarization and identification patterns

04Word-level and segment-level timestamping

055 GitHub stars

06Automated audio chunking for files exceeding 25MB

Use Cases

01Translating non-English audio recordings directly into English text

02Creating accessibility-compliant subtitles for video content and editing

03Generating searchable transcripts for meetings, interviews, and podcasts