AI-Powered-Video-Analysis-with-Object-Detection-and-Detailed-Scene-Narratives

AI-driven video analysis system that extracts and transcribes audio with Whisper, detects objects using YOLO, and generates comprehensive scene descriptions with GPT-2. The project combines transcriptions and object detections to produce detailed, context-aware video narratives.

Video processing & object dection using yolov8

The video is processed frame by frame.
For every second (1 frame per second), a frame is extracted and passed through the YOLOv8 model.
Objects detected in the frame with a confidence greater than or equal to 0.7 are stored in a list.
The detections (bounding boxes, class names, and confidence scores) are saved as JSON data for each frame.
The result is a detections.json file, which stores the objects detected in each frame at 1-second intervals.

Audio transcription with timestamps

Model Transcription: The line result = model.transcribe(audio_path, language="en", temperature=0.6, verbose=True) performs the transcription using the Whisper model:
audio_path: Path to the audio file.
language="en": Specifies that the audio is in English.
temperature=0.6: Controls the randomness of the transcription model. A higher value introduces more randomness, and a lower value makes it more deterministic. In this case, it’s set to 0.6 for a balance between creativity and accuracy.
verbose=True: Enables detailed output during transcription
It provides start and end times for each spoken segment and saves the transcription with these timestamps into a text file.
The transcriptions are saved in the format: [start_time - end_time] transcribed_text.

GPT-2 Model and Tokenizer Initialization:

The GPT-2 model and tokenizer are loaded from the transformers library to generate coherent text descriptions based on the given inputs.

Cleaning Input Text:

The function clean_input(input_text) is used to clean the detections summary and transcribed text, removing any unwanted or special characters, ensuring the input is ASCII-only.

Combining Detections and Transcriptions:

In the generate_description function, detections from the video and the transcribed text from the audio are combined into a formatted string (description_input), which serves as input to GPT-2 to generate the final description.

Processing Detections File & transcribed text:

The function process_detections_file(detections_file_path, transcribed_audio_file_path) loads the object detections (from detections.json) and transcribed text (from transcription_with_timestamps.txt), sorts them by frame, and formats the detected objects along with their bounding box positions.
It then calls generate_description to generate a coherent and detailed description of the video, based on both the visual and auditory data.

Output:

The final result is a natural-language description of the video, created by GPT-2, that integrates both visual elements from object detection and the context from the transcribed audio.

Limitations and Future Improvements:

This project currently uses the GPT-2 model to generate detailed descriptions based on object detections and transcribed audio from videos. While GPT-2 is a versatile model, it has limitations in generating coherent and contextually rich descriptions, particularly for complex tasks like video analysis. The results may not fully capture the nuances of the content due to the model’s smaller size and reduced capacity for long-range contextual understanding. To improve the quality of the generated descriptions, a more powerful model, such as GPT-3.5 or GPT-4, could be integrated. These models are better suited for handling complex tasks and generating richer, more coherent outputs. Additionally, fine-tuning on a relevant dataset or incorporating other natural language generation techniques could further enhance the results.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
older work		older work
README.md		README.md
video_description_using_yolov8_and_gpt2_and_whisper.ipynb		video_description_using_yolov8_and_gpt2_and_whisper.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI-Powered-Video-Analysis-with-Object-Detection-and-Detailed-Scene-Narratives

Video processing & object dection using yolov8

Audio transcription with timestamps

GPT-2 Model and Tokenizer Initialization:

Cleaning Input Text:

Combining Detections and Transcriptions:

Processing Detections File & transcribed text:

Output:

Limitations and Future Improvements:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI-Powered-Video-Analysis-with-Object-Detection-and-Detailed-Scene-Narratives

Video processing & object dection using yolov8

Audio transcription with timestamps

GPT-2 Model and Tokenizer Initialization:

Cleaning Input Text:

Combining Detections and Transcriptions:

Processing Detections File & transcribed text:

Output:

Limitations and Future Improvements:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages