Skip to content

AyeshaRafiq229/AI-Powered-Video-Analysis-with-Object-Detection-and-Detailed-Scene-Narratives

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

AI-Powered-Video-Analysis-with-Object-Detection-and-Detailed-Scene-Narratives

AI-driven video analysis system that extracts and transcribes audio with Whisper, detects objects using YOLO, and generates comprehensive scene descriptions with GPT-2. The project combines transcriptions and object detections to produce detailed, context-aware video narratives.

Video processing & object dection using yolov8

The video is processed frame by frame.
For every second (1 frame per second), a frame is extracted and passed through the YOLOv8 model.
Objects detected in the frame with a confidence greater than or equal to 0.7 are stored in a list.
The detections (bounding boxes, class names, and confidence scores) are saved as JSON data for each frame.
The result is a detections.json file, which stores the objects detected in each frame at 1-second intervals.
image
image
image

Audio transcription with timestamps

Model Transcription: The line result = model.transcribe(audio_path, language="en", temperature=0.6, verbose=True) performs the transcription using the Whisper model:
audio_path: Path to the audio file.
language="en": Specifies that the audio is in English.
temperature=0.6: Controls the randomness of the transcription model. A higher value introduces more randomness, and a lower value makes it more deterministic. In this case, it’s set to 0.6 for a balance between creativity and accuracy.
verbose=True: Enables detailed output during transcription
It provides start and end times for each spoken segment and saves the transcription with these timestamps into a text file.
The transcriptions are saved in the format: [start_time - end_time] transcribed_text.
image

GPT-2 Model and Tokenizer Initialization:

The GPT-2 model and tokenizer are loaded from the transformers library to generate coherent text descriptions based on the given inputs.
image

Cleaning Input Text:

The function clean_input(input_text) is used to clean the detections summary and transcribed text, removing any unwanted or special characters, ensuring the input is ASCII-only.
image

Combining Detections and Transcriptions:

In the generate_description function, detections from the video and the transcribed text from the audio are combined into a formatted string (description_input), which serves as input to GPT-2 to generate the final description.
image
image

Processing Detections File & transcribed text:

The function process_detections_file(detections_file_path, transcribed_audio_file_path) loads the object detections (from detections.json) and transcribed text (from transcription_with_timestamps.txt), sorts them by frame, and formats the detected objects along with their bounding box positions.
It then calls generate_description to generate a coherent and detailed description of the video, based on both the visual and auditory data.
image
image

Output:

The final result is a natural-language description of the video, created by GPT-2, that integrates both visual elements from object detection and the context from the transcribed audio.
image
image
image

Limitations and Future Improvements:

This project currently uses the GPT-2 model to generate detailed descriptions based on object detections and transcribed audio from videos. While GPT-2 is a versatile model, it has limitations in generating coherent and contextually rich descriptions, particularly for complex tasks like video analysis. The results may not fully capture the nuances of the content due to the model’s smaller size and reduced capacity for long-range contextual understanding. To improve the quality of the generated descriptions, a more powerful model, such as GPT-3.5 or GPT-4, could be integrated. These models are better suited for handling complex tasks and generating richer, more coherent outputs. Additionally, fine-tuning on a relevant dataset or incorporating other natural language generation techniques could further enhance the results.

About

AI-driven video analysis system that extracts and transcribes audio with Whisper, detects objects using YOLO, and generates comprehensive scene descriptions with GPT-2. The project combines transcriptions and object detections to produce detailed, context-aware video narratives.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors