Merge master changes#140
Conversation
Fix vqa text-only pass
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the VQA benchmarking framework by enabling dynamic measurement of input token counts and generating token-matched dummy text for text-only baseline comparisons. These changes improve the precision and flexibility of VQA performance evaluations across various AI providers, allowing for more robust analysis of vision encoder latency. Additionally, provider configurations were updated to support newer vision models and improve client stability. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new methodology for benchmarking Visual Question Answering (VQA) models by measuring vision encoder latency, comparing time-to-first-token (TTFT) of multimodal and text-only requests. It includes helper methods for dummy text generation and input token extraction. Please note that the security review could not be completed as the provided model response was malformed and did not adhere to the specified JSON schema. Furthermore, a critical bug in the dummy text generation has been identified that could invalidate benchmark results, and there are opportunities for improving robustness and code quality.
| if not self._vqa_corpus_words: | ||
| url = "https://www.gutenberg.org/cache/epub/2701/pg2701.txt" | ||
| try: | ||
| text = requests.get(url, timeout=10).text | ||
| all_ids = tokenizer.encode(text, add_special_tokens=False) | ||
| special_ids = set(tokenizer.all_special_ids) # Filter out special tokens | ||
| self._vqa_corpus_words = [tid for tid in all_ids if tid not in special_ids] | ||
| except Exception as e: | ||
| print(f"Fallback to dummy tokens: {e}") | ||
| self._vqa_corpus_words = tokenizer.encode("The quick brown fox jumps over the lazy dog", add_special_tokens=False) | ||
|
|
||
| # Build dummy text | ||
| sampled_ids = random.choices(self._vqa_corpus_words, k=total_tokens) |
There was a problem hiding this comment.
There is a critical bug here. The _vqa_corpus_words attribute is shared across all models, but it stores token IDs that are specific to a particular tokenizer. When get_vqa_dummy_text is called for a second model with a different tokenizer, it will incorrectly reuse the token IDs from the first model's tokenizer. This will lead to incorrect dummy text generation and invalidate the benchmark results.
To fix this, _vqa_corpus_words should be a dictionary keyed by the normalized model_id, similar to _vqa_tokenizers. You will also need to update __init__ to initialize self._vqa_corpus_words = {}.
| if not self._vqa_corpus_words: | |
| url = "https://www.gutenberg.org/cache/epub/2701/pg2701.txt" | |
| try: | |
| text = requests.get(url, timeout=10).text | |
| all_ids = tokenizer.encode(text, add_special_tokens=False) | |
| special_ids = set(tokenizer.all_special_ids) # Filter out special tokens | |
| self._vqa_corpus_words = [tid for tid in all_ids if tid not in special_ids] | |
| except Exception as e: | |
| print(f"Fallback to dummy tokens: {e}") | |
| self._vqa_corpus_words = tokenizer.encode("The quick brown fox jumps over the lazy dog", add_special_tokens=False) | |
| # Build dummy text | |
| sampled_ids = random.choices(self._vqa_corpus_words, k=total_tokens) | |
| # Load corpus for the specific tokenizer if not already loaded | |
| if not self._vqa_corpus_words.get(model_id): | |
| url = "https://www.gutenberg.org/cache/epub/2701/pg2701.txt" | |
| try: | |
| text = requests.get(url, timeout=10).text | |
| all_ids = tokenizer.encode(text, add_special_tokens=False) | |
| special_ids = set(tokenizer.all_special_ids) # Filter out special tokens | |
| self._vqa_corpus_words[model_id] = [tid for tid in all_ids if tid not in special_ids] | |
| except Exception as e: | |
| print(f"Fallback to dummy tokens: {e}") | |
| self._vqa_corpus_words[model_id] = tokenizer.encode("The quick brown fox jumps over the lazy dog", add_special_tokens=False) | |
| # Build dummy text | |
| sampled_ids = random.choices(self._vqa_corpus_words[model_id], k=total_tokens) |
| last_chunk = response[-1] | ||
| usage = last_chunk.get('usage_metadata', {}) | ||
| return usage.get('prompt_token_count', 0) |
There was a problem hiding this comment.
This implementation assumes that usage_metadata is always present in the last chunk of the streaming response. If for some reason it's in an earlier chunk (e.g., followed by empty chunks), this will fail to find it and return 0. To make this more robust, it's better to iterate backwards through the response to find the last chunk containing usage data, similar to the implementation for the AWS and Azure providers.
# The usage metadata is usually in the last chunk, but we search backwards for robustness.
for chunk in reversed(response):
if 'usage_metadata' in chunk:
usage = chunk.get('usage_metadata', {})
return usage.get('prompt_token_count', 0)
return 0| elif 'gemini-3' in lowered_id: | ||
| model_id = 'gemini-3' | ||
| else: | ||
| raise Exception('Invalid model for VQA input type.') |
There was a problem hiding this comment.
Using a more specific exception type like ValueError instead of the generic Exception is a good practice as it allows for more granular error handling. Including the invalid model_id in the error message would also be helpful for debugging.
| raise Exception('Invalid model for VQA input type.') | |
| raise ValueError(f'Invalid model for VQA input type: {model_id}') |
| final_ids = tokenizer.encode(dummy_text, add_special_tokens=False) | ||
| if len(final_ids) < total_tokens: | ||
| dummy_text += "." * (total_tokens - len(final_ids)) |
There was a problem hiding this comment.
The current padding logic is not guaranteed to produce text with the exact total_tokens count. The added . characters can be merged with the preceding token by the tokenizer, resulting in a lower-than-expected token count. This could affect the accuracy of the benchmark. A more robust approach is to iteratively pad and re-tokenize until the target count is met, and then truncate if you've overshot.
final_ids = tokenizer.encode(dummy_text, add_special_tokens=False)
# Iteratively adjust to ensure the token count is precise.
while len(final_ids) < total_tokens:
dummy_text += " ." # Adding a space helps prevent merging.
final_ids = tokenizer.encode(dummy_text, add_special_tokens=False)
# If we overshot, truncate the token list and decode back to text.
if len(final_ids) > total_tokens:
dummy_text = tokenizer.decode(final_ids[:total_tokens], clean_up_tokenization_spaces=False)
No description provided.