-
Notifications
You must be signed in to change notification settings - Fork 198
Description
Bug Description
For a very long time, it seems we've been using id fields from docs as if they are guaranteed to be the same as the document's Elasticsearch _id. However, this isn't documented or enforced, so customers who use ingest pipelines to modify id fields are encountering unexpected side effects.
We have no need to produce duplicate fields here. However, we probably can't just top producing id fields, since some customers may be using them. Instead, we should just stop relying on them.
Reliance on id seems to have been introduced here, and through a number of refactorings, today lives here
To Reproduce
Steps to reproduce the behavior:
- set up a connector that indexes a document with
_id: Foo - set up an ingest pipeline for that connector that always sets
id: bar - Run a full sync, so that you get a doc like
{"_id": "Foo": "_source": {"id": "bar"}} - In the 3rd party, delete document Foo, and add document Bar
- Run a full sync
- Both Foo and Bar documents are now in your index, because
id: baris removed from "existing ids", and prevents Foo from being cleaned up.
Expected behavior
We should not rely on any particular field in the document that is not prefixed with a _ or otherwise communicated as a protected or internal field.
Environment
all versions <= 8.16.0-SNAPSHOT
Additional context
Related to: #2775