fix(providers/common-ai): LlamaIndexEmbeddingOperator always returns …#68434
fix(providers/common-ai): LlamaIndexEmbeddingOperator always returns …#68434bramhanandlingala wants to merge 1 commit into
Conversation
| # ``embed_nodes()`` inside VectorStoreIndex skips nodes whose | ||
| # ``.embedding`` is already set, so pre-embedding causes no duplicate | ||
| # API calls. | ||
| texts = [node.get_content() for node in nodes] |
There was a problem hiding this comment.
get_content() with no argument is MetadataMode.NONE, but llama-index's own embed_nodes() embeds node.get_content(metadata_mode=MetadataMode.EMBED), which includes node metadata and respects excluded_embed_metadata_keys. Checked on llama-index-core 0.14.22: for a node with metadata={"src": "a"} the library embeds 'src: a\n\nhello world' while this embeds just 'hello world'. Since the operator attaches user metadata to every Document and the pre-set embeddings are reused for the persisted index, metadata stops contributing to the vectors entirely. Suggest metadata_mode=MetadataMode.EMBED here so the pre-embed step keeps the same embedding semantics llama-index applies itself.
| # ``.embedding`` is already set, so pre-embedding causes no duplicate | ||
| # API calls. | ||
| texts = [node.get_content() for node in nodes] | ||
| vectors = embed_model.get_text_embedding_batch(texts, show_progress=False) |
There was a problem hiding this comment.
This breaks test_byo_embed_model_bypasses_hook: _byo_embedding() builds its mock with spec=["get_text_embedding", "_get_query_embedding"], so calling .get_text_embedding_batch on it raises AttributeError. The duck-type check in _resolve_embed_model has the same gap, it accepts anything with those two methods but execute now needs a third. Adding get_text_embedding_batch to both the spec and the check would cover it, along with the tests ashb asked for on the new pre-embed path.
Summary:
LlamaIndexEmbeddingOperator.execute() always returned "vector": None for every chunk in the output. The root cause is that VectorStoreIndex._get_node_with_embedding() calls node.model_copy() and assigns the embedding on the copy — the original node objects are never mutated. This has been the behavior across all tested llama-index-core versions (v0.10–v0.14).
Fix:
Pre-embed all nodes by calling embed_model.get_text_embedding_batch() and assigning the resulting vectors onto the original node objects before passing them to VectorStoreIndex. Since embed_nodes() inside VectorStoreIndex skips nodes whose .embedding is already populated, this produces no duplicate embedding API calls.
Testing:
Manually reproduced the bug with a mock stub mirroring llama-index's model_copy() behavior
Verified the fix produces correct non-None vectors for all chunks
No changes to public API or operator interface
Related: Fixes #68416