Architecture Overview

The Video Analyzer is designed as a multi-stage pipeline that processes a video's visual and audio components to generate a cohesive summary. Each stage builds upon the output of the previous one, culminating in a detailed, narrative description.

Design Diagram

Core Workflow

The system operates in three primary stages:

1. Frame Extraction & Audio Processing

This initial stage is responsible for deconstructing the video into its fundamental components: key visual moments and spoken audio.

Frame Extraction: The tool uses OpenCV to analyze the video frame by frame. Instead of processing every single frame, it employs an intelligent algorithm to identify and extract keyframes—frames that represent significant changes in the visual scene. This reduces redundancy and focuses the analysis on moments of action or transition. See the Frame Extraction page for details.
Audio Processing: Simultaneously, the audio track is extracted from the video using FFmpeg. This audio is then fed into a local Whisper model to generate a highly accurate text transcript, complete with timestamps. The system includes confidence checks to handle poor-quality audio gracefully. See the Audio Processing page for details.

2. Frame Analysis

Once the keyframes are extracted, each image is sent to a large language vision model (LLM) for individual analysis. This process is context-aware.

Contextual Analysis: Each frame is analyzed along with the descriptions of the frames that came before it. This allows the LLM to understand the progression of events, track objects and people, and build a chronological narrative rather than describing each image in isolation.
Prompt-Driven: A specialized prompt template (frame_analysis.txt) guides the LLM to focus on specific details like setting, actions, new information, and continuity points, ensuring a structured and consistent analysis for every frame.

3. Video Reconstruction

The final stage synthesizes all the collected information into a single, comprehensive summary of the video.

Data Aggregation: The system gathers all the individual frame analyses and the complete audio transcript.
Final Synthesis: This combined data is sent to the LLM one last time with a final prompt (describe.txt). The model's task is to weave the chronological frame descriptions and the transcript into a coherent, human-readable summary that describes the video's content from beginning to end.