Add chapters to video transcripts by nicolevanderhoeven · Pull Request #200 · open-telemetry/sig-end-user

nicolevanderhoeven · 2025-11-17T17:33:24Z

This PR adds chapters generation to the video transcript Python script originally added here, adding a new function to use OpenAI to create chapters (timestamped sections of the video). When added to the YouTube description of a video, these chapters allow viewers to skip to relevant portions of the video and also give everyone a better idea of what the video entails.

In addition, I made the following improvements to video transcription:

I upgraded the youtube-transcript-api library to the latest one, since I noticed that the old one was no longer working with the current YouTube API
I added some delays to prevent rate limiting (which I ran into while testing the transcript + chapter generation). I also added some error handling to detect things like IP block detection and XML parse errors. The rate limiting improvements include retries with an exponential backoff.
I added these changes to the documentation and clarified how to use this script.
I added an env.example to show what the .env file should look like.

- Add create_chapters() function to generate YouTube-style timestamps using OpenAI - Integrate chapters section into markdown output between summary and transcript - Add comprehensive rate limiting to avoid YouTube API quota issues - Implement get_video_transcript_with_retry() with exponential backoff - Add robust error handling for quota exceeded and API failures - Improve transcript validation and filtering ([Music], [Applause], etc) - Fix Japanese language code from 'jp' to 'ja' - Increase batch size from 10 to 50 for better efficiency - Add progress indicators and better logging throughout

- Document new chapters/timestamps generation feature - Add comprehensive usage examples with command-line options - Document output format and file structure - Add Features & Reliability section covering rate limiting and error handling - Clarify OpenAI API key requirements and AI-enhanced features - Document multi-language transcript support

Update pydantic_core from 2.39.0 to 2.33.2 to match the version required by pydantic 2.11.9, resolving pip installation error.

Major changes: - Upgrade youtube-transcript-api from 0.6.3 to 1.2.3 (fixes empty response issue) - Update code to use new 1.x API (YouTubeTranscriptApi().fetch()) - Remove deprecated fallback to old static methods - Enhance error handling for 429 rate limits with 60-120s delays - Detect XML parse errors as potential rate limiting - Increase max retries from 3 to 5 attempts Root cause: YouTube changed their API and version 0.6.3 was returning empty responses (not rate limiting). The library can now successfully fetch transcripts and generate chapters.

Reduced unnecessary delays now that transcript API is fixed: - Remove initial 5-15s startup delay - Reduce pagination delays from 3-8s/5-15s to 1-3s - Reduce inter-video delays from 10-30s to 2-5s - Reduce API call separation from 10-20s to 2-4s Keep essential protections: - YouTube Data API quota error handling (60s retry) - 429 rate limit detection and handling (60-120s retry) - XML parse error detection - Small delays to avoid API hammering Result: ~3-5x faster processing while maintaining API safety.

- Detect IP block errors specifically (vs rate limiting) - Stop retries immediately when IP is blocked (no point retrying) - Add comprehensive troubleshooting section to README - Provide clear workaround options for users - Import TooManyRequests exception for better error handling IP blocks are different from rate limits and require different solutions like waiting 24-48 hours, switching networks, or using cookie auth.

linux-foundation-easycla · 2025-11-17T17:33:33Z

The committers listed above are authorized under a signed CLA.

✅ login: nicolevanderhoeven / name: Nicole van der Hoeven (17eed88, 3119c3b, 389d9bf, 3d777da, 4589139, 4b1f380, 550cfaa, 5a5286e, 7dbd5e1, 91a521b, aa70804, c1cff87, c382b4d, d2e3f8a, d3ab651, d697275, db4af16, e543eed, ffbda60)
✅ login: nicolevanderhoeven / name: nicolevanderhoeven (9fd90a7)

LisaHJung · 2025-11-18T17:29:57Z

@avillela , @danielgblanco , and @reese-lee , would you take a look at this PR when you get a chance? Thank you!

reese-lee

Thank you for doing this. It's interesting to see the differences between some of the old AI-generated transcripts and the new one.

I think it works in general for most videos, but I did notice that with the Humans of OTel interviews, it summarizes everyone's thoughts instead of leaving them individual, whereas the whole point of a video like that IS to showcase the individuals.

nicolevanderhoeven · 2025-11-19T16:44:05Z

Thanks @reese-lee , yeah, I think human-generated summaries/chapters are still the best... but in my experience they often don't get done. I think the AI-generated ones are a good starting point at least! Some videos will probably still be exceptions, and it's hard to include that in a general prompt for all videos.

reese-lee · 2025-11-20T18:31:02Z

Hi @nicolevanderhoeven, we reviewed some of the summarized transcripts, and have a couple questions:

Is it possible to remove the line the AI outputs prior to the summarized transcripts, e.g., "Sure! Here are the key moments identified from the transcript"
Is is possible to have the chapter titles added within the summarized transcript itself at the appropriate sections? Instead of just at the beginning?

- Add clean_ai_preamble() function to remove conversational preamble lines - Remove phrases like 'Sure! Here are the key moments...' from chapter output - Preserve only actual timestamp lines in generated transcripts

- Insert chapter timestamps inline in cleaned transcripts at matching text positions - Add video duration constraint to prevent AI from generating timestamps beyond video length - Improve transcript cleanup to preserve exact wording and speaker names - Add post-processing validation to verify chapter timestamps match content - Limit timestamp corrections to ±60 second window around original time - Use window-based matching with chapter titles for accurate timestamp placement - Lower AI temperature for more accurate and deterministic results

- Move chapter limit to CRITICAL INSTRUCTIONS section - Use stronger mandatory language (MUST, NO MORE than 10) - Add reminder at end of prompt to reinforce limit - Prevents AI from generating 19-24 chapters per video

- Replace segment-based windowing with simpler 10-second intervals - Remove complex window_key logic and seen_times tracking - Create windows directly at 10-second intervals within search range - Makes timestamp finding more predictable and easier to debug

- Test timeline building with various intervals and edge cases - Test filtering of [Music] and [Applause] markers - Test timestamp parsing and formatting utilities - Test roundtrip conversions and precision handling

- Add venv/, .venv/, env/, ENV/ directories - Add .env file (for API keys/secrets) - Prevents accidentally committing large virtual environments

nicolevanderhoeven · 2025-11-25T10:15:31Z

@reese-lee Thanks for reviewing! I've just done another pass to address your comments.

Is it possible to remove the line the AI outputs prior to the summarized transcripts, e.g., "Sure! Here are the key moments identified from the transcript"

Sure is! I've just added it here.

Is is possible to have the chapter titles added within the summarized transcript itself at the appropriate sections? Instead of just at the beginning?

Oops, I just put in some logic to add the timestamps of the chapters within the transcript itself, but now I'm rereading this and thinking that you meant the descriptions, not just the timestamps. Before I change this, I just wanted to double-check what you're asking for here.

Currently (here's an example of a generated transcript, there is a chapter at the beginning (00:27:20 Guest introduction: Diana) with both the timestamp and description, and then in the transcript, the time is also inserted there ([00:27:20] **Reese:** Diana, welcome. And thank you guys.

Would you prefer for that line to look like:

### Guest introduction: Diana
**Reese:** Diana, welcome. And thank you guys.

instead? Do you want the timestamp there at that point too, or just the heading for the chapter description?

danielgblanco · 2025-11-27T15:40:50Z

@nicolevanderhoeven I think the example you had there would be great. Having the chapters in the transcripts as different sections would be great :) So, instead of

[00:27:20] **Reese:** Diana, welcome...

Something like this?

### [00:27:20] Guest introduction: Diana
**Reese:** Diana, welcome...

video-transcripts/transcripts.py

danielgblanco · 2025-11-27T15:55:43Z

video-transcripts/transcripts.py

    videos = []
    next_page_token = None
+    page_count = 0
+    max_pages = 1  # Limit to 1 page (50 videos max) to avoid pagination issues


If this only 1, and we want to limit this scrape to 50 videos for the whole channel, then do we need the while loop? If getting all videos for a channel is challenging, I'd opt for removing this method (and associated docs) and thinking of how we can add it safely.

danielgblanco · 2025-11-27T15:57:40Z

video-transcripts/transcripts.py

+        if i < len(videos) - 1:  # Don't sleep after the last video
+            delay = random.uniform(2, 5)  # Random delay between 2-5 seconds
+            print(f"Waiting {delay:.1f} seconds before next video...")
+            time.sleep(delay)


Why do we need to wait? Is it related to limits?

Yep, this is a proactive delay to avoid hitting the YouTube rate limits. It might not be such a big deal if people are just running the script to fetch a few new videos, but I ran into rate limits a lot when regenerating the transcripts for 44 videos. I recommend keeping the delay even if it does slow down generation.

video-transcripts/transcripts.py

danielgblanco · 2025-11-27T17:10:39Z

There is quite a lot of code related to text transformation contributed on this PR. Considering we already have openai_cleanup with a system prompt, and considering that LLMs are generally quite good at following text managing instructions, have you considered doing this by modifying the prompt?

I don't have access to ChatGPT to test this, but I've created a Gemini Gem with the same system prompt we're currently using here, and added the following to the prompt:

Additionally, I want you to create sections in the Markdown referring to chapters in the video. For each chapter, I want you to add an H3 headline in the resulting Markdown and a link that headline to the video at that particular second. You can do this by using the timings in the input JSON. I also want you to add a table of contents with these sections at the top of the doc.

With that, I got the following result with the latest OTel Night in Berlin:

## Table of Contents
* [Introduction & Dash Zero](#introduction--dash-zero)
* [The Problem: Unstructured Logs](#the-problem-unstructured-logs)
* [The Challenge: Parsing at Scale](#the-challenge-parsing-at-scale)
* [Existing Options & Limitations](#existing-options--limitations)
* [The Solution: A Hybrid Approach](#the-solution-a-hybrid-approach)
* [The Log AI Pipeline](#the-log-ai-pipeline)
* [Evaluation & Results](#evaluation--results)
* [Demo: Visualizing Log AI Data](#demo-visualizing-log-ai-data)
* [Alerting with Log Patterns](#alerting-with-log-patterns)
* [Other Experiments: AI Agents & Traces](#other-experiments-ai-agents--traces)
* [Key Takeaways](#key-takeaways)
* [Q&A Session](#qa-session)

***

### [Introduction & Dash Zero](https://youtu.be/video?t=0)
**0:00**
Hello everyone. I'm Lariel...

### [The Problem: Unstructured Logs](https://youtu.be/video?t=68)
**1:08**
That brings me to the main subject....

### [The Challenge: Parsing at Scale](https://youtu.be/video?t=145)
**2:25**
That is the motivation behind the Log AI we developed....

### [Existing Options & Limitations](https://youtu.be/video?t=218)
**3:38**
We started looking into the available options to do this....

I'm not against having any necessary code here, but considering we already use ChatGPT for the transcript cleanup, I think we can use it to create these chapters and link back to the seconds?

- Add timeline skeleton to chapter generation prompt showing actual timestamps every 30s - Remove skip logic for 00:00:00 chapter heading in transcript insertion - Reuse existing time-to-text mapping logic for consistency - Fixes issue where chapter timestamps could be off by several minutes

- Pass video summary to chapter generation to guide topic selection - Add explicit guidance to skip small talk not mentioned in summary - Fix timeline sampling to cover entire video duration dynamically * Was limited to 40 samples (20 minutes) regardless of video length * Now uses ~55 samples distributed across full video duration * Sample interval adjusts based on video length (min 15s) - Update prompt to emphasize reviewing ENTIRE video before selecting chapters - Add better logging showing sample count, interval, and video duration - Fixes issues with: * Chapters for unimportant small talk (e.g., weather discussion) * Missing chapters in second half of longer videos * AI only seeing beginning of video

- Remove unused TooManyRequests import and defensive fallback - Add TestYouTubeAPIErrorHandling class to validate exception API contract - Tests ensure YouTube API exceptions are importable and catchable - Will catch breaking changes if youtube-transcript-api is upgraded

…ards - Add max_pages parameter (default 100) to prevent infinite loops - Add max_retries (3) for quota exceeded errors - Track page_count and retry_count for better control flow - Make loop condition explicit: while page_count < max_pages - Reset retry_count on successful requests - Raise RuntimeError when max retries exceeded This addresses reviewer feedback about the risks of using while True loops.

- Extract video ID fetching logic into _fetch_playlist_video_ids - Extract video details fetching logic into _fetch_video_details_batch - Simplify main get_playlist_videos function to coordinate the two helpers - Each function now has a single, clear responsibility - Improve code readability and maintainability

- Removed 200ms delay in _fetch_video_details_batch - Removed 1-3 second delays in _fetch_playlist_video_ids and get_channel_videos - Removed 2-4 second delay between playlist and channel fetches in main - Fixed retry bug in _fetch_video_details_batch where continue would skip to next batch instead of retrying the failed one - Added proper retry loop with max_retries tracking per batch - 403 quota handlers now properly catch and retry after 60 second wait

- Add explanatory comment in transcripts.py explaining that the 2-5s delay prevents YouTube rate limiting (429 errors) - Update README to accurately reflect actual delays used (2-5s, not 10-30s) - Clarify distinction between proactive delays and reactive retry waits

- Condense multiline IP blocking error messages to 2 lines - Shorten rate limiting error messages - Simplify XML parse error messages - All changes reference README for detailed troubleshooting

- Instruct AI to output only the chapter list without preamble text - Prevent conversational phrases like 'Here are the chapters' or 'Sure, I'll help' - Addresses reviewer feedback to handle this in the prompt rather than post-processing

- Extract _build_time_to_text_mapping() for building time-to-text mapping - Extract _get_window_texts() for getting text from time windows - Extract _extract_key_words() for keyword extraction with filtering - Extract _calculate_line_score() for scoring line matches - Extract _find_best_insertion_line() for finding best insertion points - Extract _build_transcript_with_chapters() for final transcript assembly - Main function now acts as orchestrator (36 lines, down from 113) - Fix syntax error on line 318 (invalid character in return statement) Benefits: improved testability, readability, and maintainability

avillela · 2025-11-28T16:30:33Z

Agree with @danielgblanco. Don't think we need this atm.

danielgblanco · 2025-11-28T21:28:16Z

@nicolevanderhoeven before we continue reviewing this PR, I would like to understand the core problem to solve. From experimentation it looks like the transcript cleanup process via ChatGPT can potentially solve this problem with a modification to the system prompt.

Do you think the code contributed can benefit us in the long term, as opposed to solely relying on LLMs to handle the transcript cleanup?

nicolevanderhoeven · 2025-12-02T13:06:25Z

@avillela and @danielgblanco: Are your objections to the idea of having chapters in video transcripts in general, or the text transformation code vs just adding to the AI prompt?

If it's the first: The intention of this PR is to automatically generate chapters in a format such that they can be copied and pasted into the YouTube description of a video. When this is done, YouTube adds sections to the video that:

make it easier for viewers to skip to sections that are relevant to them, especially for longer, interview-style videos
improve the keyword optimization of the video and make it more searchable
give the video a chance of being featured in "key moments" on Google searches

Here's a video explanation on whether chapters are useful. I apologize if this is information you already know, but I thought it would be good to get it out here and make sure we're on the same page. :)

YouTube also expects the chapters in a specific format, otherwise it doesn't get parsed correctly. So the "Table of Contents" example posted by Daniel might be useful when people are reading through the transcript, but not for the YouTube video itself.

If your objections are instead about the second (the text transformation code vs. a pure AI prompt approach): I definitely considered just doing it entirely via AI. What I found was that AI was inconsistently bad. At this point I've generated and regenerated all 44 of the transcripts for the OTel channel multiple times, and what I found was that when it's purely an AI prompt, with no checks or text transformations, both the format and accuracy suffered. It would often pick chapter descriptions that were not at all useful despite my "only choose important moments" prompt or the chapter description did not match the timestamp. So I'd see things like:

a chapter description "Discussion on the weather and cold front" despite my prompt that it should only choose key moments
the timestamp matching "Guest introduction: Diana" but with a matching timestamp that was 12 minutes AFTER Diana had actually been introduced
overly long (multiple sentences long) chapter descriptions
chapters that only went up to half of the video

Those sort of errors really erode the usefulness of chapters.

So gradually I added more and more checks and tried to catch some things in code when I could. My thinking was that formatting and tests are things that can be done by code, whereas the determining key moments and key chapters to highlight is something that AI still does well and would be difficult to do with code alone.

I only tested with OpenAI, so I'm not sure if Gemini does a better job.

Personally I'd be more comfortable keeping the checks and text transformation code in, just for consistency. I do this for my own videos and find it extremely useful not to have to manually check the chapters, so I really value the script getting it right the first time and am okay with having a bit more code to do so.

Let me know how you'd like to proceed.

danielgblanco · 2025-12-04T11:53:00Z

Thank you so much for all the context @nicolevanderhoeven. We (@open-telemetry/sig-end-user-maintainers) have discussed this. Our main concern was not about adding chapters, and we think that's very valuable and a great idea. However, I must personally say I had glossed over the fact that this can also help add descriptions back to YouTube so it needs to be in a specific format. So, this is then double useful!

Our concern is about having a substantial amount of code added here (larger than the current script) for the purpose of adding chapters to summaries that have already been generated via ChatGPT. The main reason we're hesitant is because we need to ensure the code that's in this repo solves issues that cannot be solved in other off-the-shelf ways.

We'd be much in favour of extending the system prompt we give ChatGPT in the cleanup, when calling client.chat.completions.create, maybe turn down the temperature. At the moment it's set to 0.7 which may be a bit too high to achieve a more deterministic result? We also wonder if switching model to gpt-5-mini or gpt-5 (although this will probably reach limits on free accounts) can help.

It's clear that the non-determinism of LLMs can be an issue, but I'd argue that it's an issue in both the summaries and the chapter identification. If we can make it more deterministic by tuning parameters I think we can benefit from both summaries and chapter identification.

We don't foresee us having to re-do all transcripts every time we change this code in the future, so we also think there's an element of reviewing new video transcripts (and chapters) as they're generated in new PRs.

nicolevanderhoeven · 2025-12-04T12:37:28Z

Thanks for the clarification, @danielgblanco ! I understand your concern more clearly now. Let me take another pass at it (tweaking the temperature is something I didn't try) with that in mind! :)

nicolevanderhoeven added 9 commits November 13, 2025 13:50

Fix pydantic dependency conflict in requirements.txt

d02dac7

Update pydantic_core from 2.39.0 to 2.33.2 to match the version required by pydantic 2.11.9, resolving pip installation error.

Add env.example

9e0ebb3

Update README with instructions to cp env.example

2afaaa5

Delete old transcripts and regenerate with timestamps

bdd392e

nicolevanderhoeven requested a review from a team as a code owner November 17, 2025 17:33

github-project-automation bot added this to End User SIG Nov 17, 2025

github-project-automation bot moved this to Todo in End User SIG Nov 17, 2025

reese-lee approved these changes Nov 18, 2025

View reviewed changes

Merge branch 'main' into main

9fd90a7

nicolevanderhoeven added 7 commits November 25, 2025 11:05

Strip AI preamble from chapter generation

d697275

- Add clean_ai_preamble() function to remove conversational preamble lines - Remove phrases like 'Sure! Here are the key moments...' from chapter output - Preserve only actual timestamp lines in generated transcripts

Enforce 10-chapter maximum limit in chapter generation

7dbd5e1

- Move chapter limit to CRITICAL INSTRUCTIONS section - Use stronger mandatory language (MUST, NO MORE than 10) - Add reminder at end of prompt to reinforce limit - Prevents AI from generating 19-24 chapters per video

Regenerated transcripts

91a521b

Add unit tests for transcript processing

3d777da

- Test timeline building with various intervals and edge cases - Test filtering of [Music] and [Applause] markers - Test timestamp parsing and formatting utilities - Test roundtrip conversions and precision handling

Add Python virtual environment entries to .gitignore

550cfaa

- Add venv/, .venv/, env/, ENV/ directories - Add .env file (for API keys/secrets) - Prevents accidentally committing large virtual environments

danielgblanco requested changes Nov 27, 2025

View reviewed changes

Add ## Transcript heading before cleaned transcript section

4589139

nicolevanderhoeven added 10 commits November 28, 2025 11:49

Simplify verbose error print statements

4b1f380

- Condense multiline IP blocking error messages to 2 lines - Shorten rate limiting error messages - Simplify XML parse error messages - All changes reference README for detailed troubleshooting

Regenerated all transcripts

db4af16

Conversation

nicolevanderhoeven commented Nov 17, 2025

Uh oh!

linux-foundation-easycla bot commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LisaHJung commented Nov 18, 2025

Uh oh!

reese-lee left a comment

Choose a reason for hiding this comment

Uh oh!

nicolevanderhoeven commented Nov 19, 2025

Uh oh!

reese-lee commented Nov 20, 2025

Uh oh!

nicolevanderhoeven commented Nov 25, 2025

Uh oh!

danielgblanco commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danielgblanco Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

danielgblanco Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

nicolevanderhoeven Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danielgblanco commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avillela commented Nov 28, 2025

Uh oh!

danielgblanco commented Nov 28, 2025

Uh oh!

nicolevanderhoeven commented Dec 2, 2025

Uh oh!

danielgblanco commented Dec 4, 2025

Uh oh!

nicolevanderhoeven commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

linux-foundation-easycla bot commented Nov 17, 2025 •

edited

Loading

danielgblanco commented Nov 27, 2025 •

edited

Loading

danielgblanco commented Nov 27, 2025 •

edited

Loading