-
Notifications
You must be signed in to change notification settings - Fork 17.3k
fix(providers/common-ai): LlamaIndexEmbeddingOperator always returns … #68434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -125,9 +125,19 @@ def execute(self, context: Context) -> dict[str, Any]: | |
| nodes = splitter.get_nodes_from_documents(llama_docs) | ||
| self.log.info("Split %d documents into %d chunks", len(llama_docs), len(nodes)) | ||
|
|
||
| # ``VectorStoreIndex(...)`` populates each node's ``.embedding`` as a | ||
| # side effect of building the index; capture the index so the | ||
| # variable isn't discarded. | ||
| # Pre-embed nodes so that ``.embedding`` is set on the original node | ||
| # objects before they are passed to VectorStoreIndex. VectorStoreIndex | ||
| # calls ``_get_node_with_embedding()`` which does ``node.model_copy()`` | ||
| # and attaches the embedding to the *copy*, never the original. Reading | ||
| # ``node.embedding`` after index construction therefore always returns | ||
| # ``None`` (confirmed across llama-index-core v0.10–v0.14). | ||
| # ``embed_nodes()`` inside VectorStoreIndex skips nodes whose | ||
| # ``.embedding`` is already set, so pre-embedding causes no duplicate | ||
| # API calls. | ||
| texts = [node.get_content() for node in nodes] | ||
| vectors = embed_model.get_text_embedding_batch(texts, show_progress=False) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This breaks |
||
| for node, vector in zip(nodes, vectors): | ||
| node.embedding = vector | ||
| index = VectorStoreIndex(nodes, embed_model=embed_model, show_progress=False) | ||
|
|
||
| if self.persist_dir: | ||
|
|
@@ -136,8 +146,8 @@ def execute(self, context: Context) -> dict[str, Any]: | |
| # ``SentenceSplitter`` always returns ``TextNode`` instances, but the | ||
| # base ``get_nodes_from_documents`` signature is typed as | ||
| # ``list[BaseNode]`` (which has no ``.text``). Cast so mypy doesn't | ||
| # flag the ``.text`` access; ``node.embedding`` is populated by | ||
| # ``VectorStoreIndex`` for every node above. | ||
| # flag the ``.text`` access; ``node.embedding`` is populated by the | ||
| # pre-embed step above for every node. | ||
| text_nodes = cast("list[TextNode]", nodes) | ||
| chunks = [ | ||
| { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_content()with no argument isMetadataMode.NONE, but llama-index's ownembed_nodes()embedsnode.get_content(metadata_mode=MetadataMode.EMBED), which includes node metadata and respectsexcluded_embed_metadata_keys. Checked on llama-index-core 0.14.22: for a node withmetadata={"src": "a"}the library embeds'src: a\n\nhello world'while this embeds just'hello world'. Since the operator attaches user metadata to every Document and the pre-set embeddings are reused for the persisted index, metadata stops contributing to the vectors entirely. Suggestmetadata_mode=MetadataMode.EMBEDhere so the pre-embed step keeps the same embedding semantics llama-index applies itself.