Skip to content

Connectors interchangeably use id and _id #2776

@seanstory

Description

@seanstory

Bug Description

For a very long time, it seems we've been using id fields from docs as if they are guaranteed to be the same as the document's Elasticsearch _id. However, this isn't documented or enforced, so customers who use ingest pipelines to modify id fields are encountering unexpected side effects.

We have no need to produce duplicate fields here. However, we probably can't just top producing id fields, since some customers may be using them. Instead, we should just stop relying on them.

Reliance on id seems to have been introduced here, and through a number of refactorings, today lives here

To Reproduce

Steps to reproduce the behavior:

  1. set up a connector that indexes a document with _id: Foo
  2. set up an ingest pipeline for that connector that always sets id: bar
  3. Run a full sync, so that you get a doc like {"_id": "Foo": "_source": {"id": "bar"}}
  4. In the 3rd party, delete document Foo, and add document Bar
  5. Run a full sync
  6. Both Foo and Bar documents are now in your index, because id: bar is removed from "existing ids", and prevents Foo from being cleaned up.

Expected behavior

We should not rely on any particular field in the document that is not prefixed with a _ or otherwise communicated as a protected or internal field.

Environment

all versions <= 8.16.0-SNAPSHOT

Additional context

Related to: #2775

Metadata

Metadata

Assignees

No one assigned

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions