Why use native speech-to-speech instead of a pipeline?

Native speech-to-speech (like Grok or OpenAI Realtime) eliminates the latency and information loss of converting audio to text and back again, resulting in more natural and faster responses.

Can I transcribe long audio files with this skill?

Yes, it includes patterns for Gemini 2.5 Pro which can handle up to 9.5 hours of audio natively with built-in speaker diarization.

Does this skill support speaker diarization?

Yes, it provides implementation guides for both AssemblyAI and Gemini, allowing you to distinguish between multiple speakers in a recording.

What is the best model for real-time voice assistants?

Grok Voice Agent is currently the fastest with <1s time-to-first-audio (TTFA), making it ideal for low-latency assistants, while Gemini Live is superior for emotional awareness.

Audio Language Models

Name: Audio Language Models
Author: yonatangross

byyonatangross

•

Ciencia de Datos y ML

Implements real-time voice agents, high-accuracy transcription, and expressive text-to-speech using native speech-to-speech models.

This skill provides production-ready implementation patterns for integrating industry-leading audio AI models into your applications. It enables developers to build ultra-low latency voice assistants using Grok Voice Agent, emotional conversational interfaces via Gemini Live, and advanced transcription systems with AssemblyAI or GPT-4o. By leveraging native speech-to-speech capabilities rather than traditional STT-LLM-TTS pipelines, it ensures natural interaction speeds, multilingual support, and sophisticated features like speaker diarization, sentiment analysis, and auditory cues for lifelike AI interactions.

Características Principales

01Expressive text-to-speech (TTS) with style prompts and pacing control

02Advanced speaker diarization and sentiment analysis using AssemblyAI

0369 GitHub stars

04Long-form transcription support for up to 9.5 hours of audio with Gemini 2.5 Pro

05Native speech-to-speech patterns for ultra-low latency (<1s) voice agents

06Real-time emotional awareness and barge-in support for natural conversations

Casos de Uso

01Generating automated meeting transcripts with speaker labels and highlights

02Developing multilingual voice-to-voice translation services

03Building real-time AI phone support agents and conversational assistants

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add yonatangross/orchestkit audio-language-models

For use in Claude.ai and ChatGPT

Download Skill