Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -125,9 +125,19 @@ def execute(self, context: Context) -> dict[str, Any]:
nodes = splitter.get_nodes_from_documents(llama_docs)
self.log.info("Split %d documents into %d chunks", len(llama_docs), len(nodes))

# ``VectorStoreIndex(...)`` populates each node's ``.embedding`` as a
# side effect of building the index; capture the index so the
# variable isn't discarded.
# Pre-embed nodes so that ``.embedding`` is set on the original node
# objects before they are passed to VectorStoreIndex. VectorStoreIndex
# calls ``_get_node_with_embedding()`` which does ``node.model_copy()``
# and attaches the embedding to the *copy*, never the original. Reading
# ``node.embedding`` after index construction therefore always returns
# ``None`` (confirmed across llama-index-core v0.10–v0.14).
# ``embed_nodes()`` inside VectorStoreIndex skips nodes whose
# ``.embedding`` is already set, so pre-embedding causes no duplicate
# API calls.
texts = [node.get_content() for node in nodes]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_content() with no argument is MetadataMode.NONE, but llama-index's own embed_nodes() embeds node.get_content(metadata_mode=MetadataMode.EMBED), which includes node metadata and respects excluded_embed_metadata_keys. Checked on llama-index-core 0.14.22: for a node with metadata={"src": "a"} the library embeds 'src: a\n\nhello world' while this embeds just 'hello world'. Since the operator attaches user metadata to every Document and the pre-set embeddings are reused for the persisted index, metadata stops contributing to the vectors entirely. Suggest metadata_mode=MetadataMode.EMBED here so the pre-embed step keeps the same embedding semantics llama-index applies itself.

vectors = embed_model.get_text_embedding_batch(texts, show_progress=False)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks test_byo_embed_model_bypasses_hook: _byo_embedding() builds its mock with spec=["get_text_embedding", "_get_query_embedding"], so calling .get_text_embedding_batch on it raises AttributeError. The duck-type check in _resolve_embed_model has the same gap, it accepts anything with those two methods but execute now needs a third. Adding get_text_embedding_batch to both the spec and the check would cover it, along with the tests ashb asked for on the new pre-embed path.

for node, vector in zip(nodes, vectors):
node.embedding = vector
index = VectorStoreIndex(nodes, embed_model=embed_model, show_progress=False)

if self.persist_dir:
Expand All @@ -136,8 +146,8 @@ def execute(self, context: Context) -> dict[str, Any]:
# ``SentenceSplitter`` always returns ``TextNode`` instances, but the
# base ``get_nodes_from_documents`` signature is typed as
# ``list[BaseNode]`` (which has no ``.text``). Cast so mypy doesn't
# flag the ``.text`` access; ``node.embedding`` is populated by
# ``VectorStoreIndex`` for every node above.
# flag the ``.text`` access; ``node.embedding`` is populated by the
# pre-embed step above for every node.
text_nodes = cast("list[TextNode]", nodes)
chunks = [
{
Expand Down
Loading