Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #14655 +/- ##
=======================================
Coverage 32.37% 32.37%
=======================================
Files 3108 3108
Lines 211664 211692 +28
Branches 38383 38383
=======================================
+ Hits 68516 68537 +21
- Misses 143148 143155 +7
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
698cd42 to
2ea4f68
Compare
opencti-platform/opencti-graphql/src/modules/internal/document/document.ts
Outdated
Show resolved
Hide resolved
3635ce9 to
c4b70ac
Compare
|
Can you please create a public issue in OpenCTI repo and link it https://github.com/OpenCTI-Platform/opencti/issues |
1f89b8f to
c97d2f2
Compare
Done. I 'll make sure the correct issue number is in the commit message when squashing too 👍 . |
c97d2f2 to
f8aaca1
Compare
opencti-platform/opencti-graphql/src/modules/internal/document/document.ts
Outdated
Show resolved
Hide resolved
ea8f73f to
bf5d902
Compare
There was a problem hiding this comment.
Pull request overview
This PR addresses strict_dynamic_mapping_exception errors during file indexing by constraining which fields the Elasticsearch/OpenSearch attachment ingest processor is allowed to extract, aligning ingestion with the existing strict index mappings.
Changes:
- Configure the attachment ingest pipeline (
properties) to only extract fields that are mapped (separately for Elasticsearch vs OpenSearch). - Add an integration test indexing a PDF containing metadata that would previously trigger strict mapping failures.
- Add an OpenSearch dev Docker image build (with
ingest-attachmentplugin) and update dependent test expectations.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| opencti-platform/opencti-graphql/tests/03-integration/04-manager/retentionManager-test.ts | Updates expected file counts due to the new indexed test file. |
| opencti-platform/opencti-graphql/tests/03-integration/01-database/index-file-test.js | Adds an integration test to validate indexing succeeds with “unhandled” PDF metadata. |
| opencti-platform/opencti-graphql/src/utils/type-utils.ts | Adds TS utility types/helpers used for compile-time type assertions. |
| opencti-platform/opencti-graphql/src/modules/internal/document/document.ts | Adds a compile-time check to keep attachment mappings aligned with extracted props. |
| opencti-platform/opencti-graphql/src/database/engine.ts | Restricts ingest-attachment extracted properties for ES/OS pipelines. |
| opencti-platform/opencti-graphql/src/database/attachment-processor-props.ts | Defines the explicit extracted-property allowlists (ES vs OpenSearch) + shared union type. |
| opencti-platform/opencti-dev/opensearch/Dockerfile | Builds an OpenSearch image with the ingest-attachment plugin installed. |
| opencti-platform/opencti-dev/docker-compose.yml | Switches OpenSearch service to build: the new Dockerfile and updates usage hint. |
opencti-platform/opencti-graphql/src/modules/internal/document/document.ts
Outdated
Show resolved
Hide resolved
opencti-platform/opencti-graphql/src/database/attachment-processor-props.ts
Outdated
Show resolved
Hide resolved
bf5d902 to
23afbf3
Compare
…ileIndexManager (#89) The issue arose because of missing index mappings in the attachment sub-document: we use an Elasticsearch pipeline processor for attachments that extracts fields for us. By default this processor extracts all the fields it can: https://www.elastic.co/guide/en/elasticsearch/reference/8.19/attachment.html#attachment-fields. This commit specifies which fields to extract: for those enforce an index mapping def
23afbf3 to
6498c22
Compare
Context
The issue arose because of missing index mappings in the attachment sub-document: we use an Elasticsearch pipeline processor for attachments that extracts fields for us. By default this processor extracts all the fields it can: https://www.elastic.co/guide/en/elasticsearch/reference/8.19/attachment.html#attachment-fields.
The problem is that we've created index mapping sfor only a subset of those fields (see
document.ts). This added to the fact that we enforcedynamic: strictbehavior on indices, meaning we don't let unknown fields be pushed on an index, resulted in a few exceptions.Proposed changes
This PR specifies which fields to extract when configuring the
attachmentpipeline in ES/OS: those for which we already have a mapping.We could consider ingesting the other pieces of data but I'm not sure it's useful and the volume is ultra low for now.
I added an integration test making sur that a PDF with a metadata (
dc:publisherordc:rating) that could be extracted by the processor, but that isn't because we now tell it not to, doesn't fail the indexing.I ran the test with ES and OS locally. Running with OS required tweaking the dev setup to install the
ingest-attachmentplugin before starting the OS process which requires a custom Dockerfile (https://docs.opensearch.org/latest/install-and-configure/install-opensearch/docker/#working-with-plugins).Related issues
Checklist
Further comments
I had to track down how to name the metadata by looking at the ES code (https://github.com/elastic/elasticsearch/blob/main/modules/ingest-attachment/src/main/java/org/elasticsearch/ingest/attachment/AttachmentProcessor.java#L200) and the library itself uses (Apache TIka). Without the
dc:prefix it wouldn't be seen by the processor.I used https://www.embedpdf.com/tools/pdf-metadata-editor to add metadata to the test file and tika to read them like the ES dependency.