Merge master changes by tttennnnn · Pull Request #140 · hyscale-lab/LLM-Benchmarking

tttennnnn · 2026-03-06T04:54:58Z

No description provided.

Fix vqa text-only pass

gemini-code-assist · 2026-03-06T04:55:16Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the VQA benchmarking framework by enabling dynamic measurement of input token counts and generating token-matched dummy text for text-only baseline comparisons. These changes improve the precision and flexibility of VQA performance evaluations across various AI providers, allowing for more robust analysis of vision encoder latency. Additionally, provider configurations were updated to support newer vision models and improve client stability.

Highlights

VQA Benchmarking Configuration Update: The VQA streaming benchmark configuration was updated to remove 'TogetherAI' as a provider, switch to 'vision-model-02', and increase the number of requests from 100 to 150.
Provider Model Mappings and Robustness: AWS Bedrock, Azure, and Google providers were updated to include new 'vision-model-01' and 'vision-model-02' mappings. The AWS Bedrock client also received configuration for connection/read timeouts and retries to improve stability.
Input Token Count Extraction: New get_input_token_count methods were implemented across AWS Bedrock, Azure, and Google providers to accurately extract the number of input tokens from both streaming and non-streaming responses.
Dynamic VQA Dummy Text Generation: The base ProviderInterface now includes functionality to dynamically generate dummy text for VQA benchmarks. This text is token-count-matched to the multimodal input, using Hugging Face tokenizers and a corpus from Project Gutenberg, ensuring more accurate text-only baseline comparisons.
VQA Logging Enhancement: The VQA log output was extended to include the multimodal_input_tokens count, providing more detailed metrics for analysis.
New Dependency Added: The transformers library (version 4.46.3) was added to the project dependencies to support the new tokenization functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

benchmarking/experiments/compare_providers_streaming_vqa.json
- Removed 'TogetherAI' from the list of providers.
- Updated the 'models' entry from 'vision-model' to 'vision-model-02'.
- Increased 'num_requests' from 100 to 150.
providers/aws_provider.py
- Imported Config from botocore.config.
- Added 'vision-model-01' and 'vision-model-02' mappings to self.model_map.
- Configured the boto3 client with connect_timeout, read_timeout, and retries.
- Implemented the get_input_token_count method to extract input token usage from responses.
providers/azure_provider.py
- Added 'vision-model-01' and 'vision-model-02' mappings to self.model_map.
- Implemented the get_input_token_count method to extract prompt token usage from responses.
providers/google_provider.py
- Added 'vision-model-01' and 'vision-model-02' mappings to self.model_map.
- Implemented the get_input_token_count method to extract prompt token count from responses.
providers/provider_interface.py
- Added imports for random, requests, and transformers.AutoTokenizer.
- Initialized _vqa_tokenizers and _vqa_corpus_words attributes.
- Implemented the get_vqa_dummy_text method to generate text of a specified token count.
- Modified the VQA log header to include 'multimodal_input_tokens'.
- Removed the dummy_text field from VQA item processing.
- Dynamically generated text_only_messages using get_vqa_dummy_text based on multimodal input token count.
- Added a print statement to report input tokens for the text-only pass.
requirements.txt
- Added transformers==4.46.3 to the list of dependencies.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new methodology for benchmarking Visual Question Answering (VQA) models by measuring vision encoder latency, comparing time-to-first-token (TTFT) of multimodal and text-only requests. It includes helper methods for dummy text generation and input token extraction. Please note that the security review could not be completed as the provided model response was malformed and did not adhere to the specified JSON schema. Furthermore, a critical bug in the dummy text generation has been identified that could invalidate benchmark results, and there are opportunities for improving robustness and code quality.

gemini-code-assist · 2026-03-06T05:01:28Z

+        if not self._vqa_corpus_words:
+            url = "https://www.gutenberg.org/cache/epub/2701/pg2701.txt"
+            try:
+                text = requests.get(url, timeout=10).text
+                all_ids = tokenizer.encode(text, add_special_tokens=False)
+                special_ids = set(tokenizer.all_special_ids)  # Filter out special tokens
+                self._vqa_corpus_words = [tid for tid in all_ids if tid not in special_ids]
+            except Exception as e:
+                print(f"Fallback to dummy tokens: {e}")
+                self._vqa_corpus_words = tokenizer.encode("The quick brown fox jumps over the lazy dog", add_special_tokens=False)
+
+        # Build dummy text
+        sampled_ids = random.choices(self._vqa_corpus_words, k=total_tokens)


There is a critical bug here. The _vqa_corpus_words attribute is shared across all models, but it stores token IDs that are specific to a particular tokenizer. When get_vqa_dummy_text is called for a second model with a different tokenizer, it will incorrectly reuse the token IDs from the first model's tokenizer. This will lead to incorrect dummy text generation and invalidate the benchmark results.

To fix this, _vqa_corpus_words should be a dictionary keyed by the normalized model_id, similar to _vqa_tokenizers. You will also need to update __init__ to initialize self._vqa_corpus_words = {}.

Suggested change

if not self._vqa_corpus_words:

url = "https://www.gutenberg.org/cache/epub/2701/pg2701.txt"

try:

text = requests.get(url, timeout=10).text

all_ids = tokenizer.encode(text, add_special_tokens=False)

special_ids = set(tokenizer.all_special_ids) # Filter out special tokens

self._vqa_corpus_words = [tid for tid in all_ids if tid not in special_ids]

except Exception as e:

print(f"Fallback to dummy tokens: {e}")

self._vqa_corpus_words = tokenizer.encode("The quick brown fox jumps over the lazy dog", add_special_tokens=False)

# Build dummy text

sampled_ids = random.choices(self._vqa_corpus_words, k=total_tokens)

# Load corpus for the specific tokenizer if not already loaded

if not self._vqa_corpus_words.get(model_id):

url = "https://www.gutenberg.org/cache/epub/2701/pg2701.txt"

try:

text = requests.get(url, timeout=10).text

all_ids = tokenizer.encode(text, add_special_tokens=False)

special_ids = set(tokenizer.all_special_ids) # Filter out special tokens

self._vqa_corpus_words[model_id] = [tid for tid in all_ids if tid not in special_ids]

except Exception as e:

print(f"Fallback to dummy tokens: {e}")

self._vqa_corpus_words[model_id] = tokenizer.encode("The quick brown fox jumps over the lazy dog", add_special_tokens=False)

# Build dummy text

sampled_ids = random.choices(self._vqa_corpus_words[model_id], k=total_tokens)

gemini-code-assist · 2026-03-06T05:01:28Z

+            last_chunk = response[-1]
+            usage = last_chunk.get('usage_metadata', {})
+            return usage.get('prompt_token_count', 0)


This implementation assumes that usage_metadata is always present in the last chunk of the streaming response. If for some reason it's in an earlier chunk (e.g., followed by empty chunks), this will fail to find it and return 0. To make this more robust, it's better to iterate backwards through the response to find the last chunk containing usage data, similar to the implementation for the AWS and Azure providers.

# The usage metadata is usually in the last chunk, but we search backwards for robustness. for chunk in reversed(response): if 'usage_metadata' in chunk: usage = chunk.get('usage_metadata', {}) return usage.get('prompt_token_count', 0) return 0

gemini-code-assist · 2026-03-06T05:01:28Z

+        elif 'gemini-3' in lowered_id:
+            model_id = 'gemini-3'
+        else:
+            raise Exception('Invalid model for VQA input type.')


Using a more specific exception type like ValueError instead of the generic Exception is a good practice as it allows for more granular error handling. Including the invalid model_id in the error message would also be helpful for debugging.

Suggested change

raise Exception('Invalid model for VQA input type.')

raise ValueError(f'Invalid model for VQA input type: {model_id}')

gemini-code-assist · 2026-03-06T05:01:28Z

+        final_ids = tokenizer.encode(dummy_text, add_special_tokens=False)
+        if len(final_ids) < total_tokens:
+            dummy_text += "." * (total_tokens - len(final_ids))


The current padding logic is not guaranteed to produce text with the exact total_tokens count. The added . characters can be merged with the preceding token by the tokenizer, resulting in a lower-than-expected token count. This could affect the accuracy of the benchmark. A more robust approach is to iteratively pad and re-tokenize until the target count is met, and then truncate if you've overshot.

final_ids = tokenizer.encode(dummy_text, add_special_tokens=False) # Iteratively adjust to ensure the token count is precise. while len(final_ids) < total_tokens: dummy_text += " ." # Adding a space helps prevent merging. final_ids = tokenizer.encode(dummy_text, add_special_tokens=False) # If we overshot, truncate the token list and decode back to text. if len(final_ids) > total_tokens: dummy_text = tokenizer.decode(final_ids[:total_tokens], clean_up_tokenization_spaces=False)

#MAISON SAPOL# and others added 9 commits March 4, 2026 03:23

VQA; Round 1

506631f

Set num requests

b2a65c2

VQA; Round 2

6b74a55

Fix vqa text-only pass

e9fc84c

Lint

374811d

Add new vision model

ca3bd7f

Log token count of multimodal pass

d5b5860

VQA; vision model 2

01ea8a8

Merge pull request #139 from hyscale-lab/feat/vqa-text-pass

b4d42cf

Fix vqa text-only pass

tttennnnn merged commit 75acf26 into ttft-analysis Mar 6, 2026
3 checks passed

gemini-code-assist Bot reviewed Mar 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge master changes#140

Merge master changes#140
tttennnnn merged 9 commits intottft-analysisfrom
main

tttennnnn commented Mar 6, 2026

Uh oh!

gemini-code-assist Bot commented Mar 6, 2026

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 6, 2026

Uh oh!

gemini-code-assist Bot Mar 6, 2026

Uh oh!

gemini-code-assist Bot Mar 6, 2026

Uh oh!

gemini-code-assist Bot Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	raise Exception('Invalid model for VQA input type.')
	raise ValueError(f'Invalid model for VQA input type: {model_id}')

Conversation

tttennnnn commented Mar 6, 2026

Uh oh!

gemini-code-assist Bot commented Mar 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant