Audio Processing

Audio is a critical component of video analysis, providing context, dialogue, and cues that are not available visually. The Video Analyzer uses a robust pipeline to extract and transcribe audio.

1. Audio Extraction

The primary tool for audio extraction is FFmpeg. The AudioProcessor class uses FFmpeg to perform several key operations in a single command:

It strips the video track (-vn).
It converts the audio into a standard, uncompressed format (-acodec pcm_s16le).
It resamples the audio to 16kHz (-ar 16000), the optimal rate for Whisper models.
It converts the audio to mono (-ac 1) to simplify processing.
The final output is saved as a temporary audio.wav file.

If FFmpeg is not installed or fails, the system has a fallback mechanism using the pydub library, which can perform similar conversions, though FFmpeg is recommended for performance and reliability.

If the video file contains no audio streams, this stage is skipped gracefully.

2. Audio Transcription

Transcription is handled by faster-whisper, an optimized implementation of OpenAI's Whisper model. This provides high-quality transcriptions that can run efficiently on a CPU or be accelerated on a GPU.

Key features of the transcription process include:

Model Selection: The user can choose the Whisper model size (from tiny to large) to balance speed and accuracy. The default is medium.
Language Detection: By default, Whisper automatically detects the language of the audio. Users can also specify a language code (e.g., en, es) to improve accuracy if the language is known.
Word Timestamps: The transcription process generates timestamps for each word, allowing for potential future features that sync text with video events.
Voice Activity Detection (VAD): The system uses VAD to filter out long periods of silence, making the transcription process more efficient and the final transcript cleaner.
Output: The process generates an AudioTranscript object containing the full text, a list of timed segments, and the detected language.

This transcript is then passed to the final Video Reconstruction stage, where the LLM integrates the spoken content with the visual analysis to create a complete summary.