Output Format Explained

After a successful analysis, the Video Analyzer generates a single JSON file named analysis.json in the specified output directory. This file contains a structured representation of the entire analysis process.

Top-Level Structure

The JSON object has four main keys:

metadata: Information about the analysis job itself.
transcript: The full audio transcript, if available.
frame_analyses: A list of detailed descriptions for each extracted keyframe.
video_description: The final, synthesized summary of the video.

Key Breakdown

`metadata`

This object contains settings and statistics for the analysis run.

client: The LLM client used (ollama or openai_api).
model: The specific vision model used for analysis.
whisper_model: The size of the Whisper model used for transcription.
frames_per_minute: The target number of frames to extract per minute.
duration_processed: The duration of the video processed, in seconds. null if the full video was processed.
frames_extracted: The total number of keyframes extracted from the video.
frames_processed: The number of frames that were actually sent to the LLM for analysis.
start_stage: The processing stage the analysis started from.
audio_language: The language code detected or used for transcription (e.g., en).
transcription_successful: A boolean indicating if a transcript was generated.

`transcript`

This object contains the results of the audio transcription. It will be null if the video has no audio or if transcription failed.

text: The full, concatenated text of the entire transcript.
segments: A list of audio segments, each containing:
- text: The transcribed text for that segment.
- start: The start time of the segment in seconds.
- end: The end time of the segment in seconds.
- words: A list of individual words, each with its own word, start time, end time, and probability.

`frame_analyses`

This is a list of JSON objects, one for each processed keyframe. Each object is the raw response from the LLM client.

model: The model that generated the response.
created_at: Timestamp of the analysis.
response: The text description generated by the LLM for that specific frame.
Other keys in this object (like done, total_duration, etc.) are specific to the LLM client (e.g., Ollama) and provide additional metadata about the generation process.

`video_description`

This object contains the final summary of the video, generated by synthesizing the frame_analyses and the transcript.

model: The model that generated the final summary.
response: The full text of the video description.
Like frame_analyses, this may contain other client-specific metadata keys.

Sample `analysis.json` File

Below is a complete example of the output file.

{
  "metadata": {
    "client": "ollama",
    "model": "llama3.2-vision",
    "whisper_model": "medium",
    "frames_per_minute": 60,
    "duration_processed": null,
    "frames_extracted": 5,
    "frames_processed": 5,
    "start_stage": 1,
    "audio_language": "en",
    "transcription_successful": true
  },
  "transcript": {
    "text": " I'm scared!",
    "segments": [
      {
        "text": " I'm scared!",
        "start": 1.78,
        "end": 2.24,
        "words": [
          {
            "word": " I'm",
            "start": 1.78,
            "end": 2.04,
            "probability": 0.4382356107234955
          },
          {
            "word": " scared!",
            "start": 2.04,
            "end": 2.24,
            "probability": 0.9464112520217896
          }
        ]
      }
    ]
  },
  "frame_analyses": [
    {
      "model": "llama3.2-vision",
      "created_at": "2024-12-18T23:14:35.871404545Z",
      "response": "Frame 0\n\nSetting/Scene: A person with long blonde hair, wearing a pink t-shirt and yellow shorts, stands in front of a black plastic tub or container on wheels. The ground appears to be covered in wood chips.\n\nAction/Movement: The person is facing away from the camera, looking down at something inside the tub. Their left hand is resting on their hip, while their right arm hangs loosely by their side.\n\nNew Information: There are no new objects or people visible in this frame. However, there appears to be some greenery and possibly fruit scattered around the ground behind the person.\n\nContinuity Points:\n\n* The person's pink t-shirt matches the color of the shirt worn by the person in the background of Frame 1.\n* The black plastic tub on wheels is also present in Frame 1.\n* The wood chips covering the ground are consistent with those seen in Frame 1.\n\nKey Continuation Point: Watch for the person to pick up an object from the tub and examine it more closely.",
      "done": true,
      "done_reason": "stop",
      "total_duration": 7952576674,
      "load_duration": 2623794964,
      "prompt_eval_count": 349,
      "prompt_eval_duration": 1787000000,
      "eval_count": 207,
      "eval_duration": 3317000000
    }
  ],
  "video_description": {
    "model": "llama3.2-vision",
    "created_at": "2024-12-18T23:15:06.166299111Z",
    "response": "**Video Summary**\n\nDuration: 5 minutes and 67 seconds\n\nThe video begins with a person with long blonde hair, wearing a pink t-shirt and yellow shorts, standing in front of a black plastic tub or container on wheels. The ground appears to be covered in wood chips.\n\nAs the video progresses, the person remains facing away from the camera, looking down at something inside the tub. Their left hand is resting on their hip, while their right arm hangs loosely by their side. There are no new objects or people visible in this frame, but there appears to be some greenery and possibly fruit scattered around the ground behind the person.\n\nThe black plastic tub on wheels is present throughout the video, and the wood chips covering the ground remain consistent with those seen in Frame 0. The person's pink t-shirt matches the color of the shirt worn by the person in Frame 0.\n\nAs the video continues, the person remains stationary, looking down at something inside the tub. There are no significant changes or developments in this frame.\n\nThe key continuation point is to watch for the person to pick up an object from the tub and examine it more closely.\n\n**Key Continuation Points:**\n\n*   The person's pink t-shirt matches the color of the shirt worn by the person in Frame 0.\n*   The black plastic tub on wheels is also present in Frame 0.\n*   The wood chips covering the ground are consistent with those seen in Frame 0.",
    "done": true,
    "done_reason": "stop",
    "total_duration": 3877694027,
    "load_duration": 10604604,
    "prompt_eval_count": 1705,
    "prompt_eval_duration": 558000000,
    "eval_count": 297,
    "eval_duration": 3308000000
  }
}

Output Format Explained

Top-Level Structure

Key Breakdown

metadata

transcript

frame_analyses

video_description

Sample analysis.json File

`metadata`

`transcript`

`frame_analyses`

`video_description`

Sample `analysis.json` File