Skip to content

feat: Add MongoDB Atlas document store#6471

Merged
julian-risch merged 14 commits into
deepset-ai:v1.xfrom
NoahStapp:v1.x
Dec 14, 2023
Merged

feat: Add MongoDB Atlas document store#6471
julian-risch merged 14 commits into
deepset-ai:v1.xfrom
NoahStapp:v1.x

Conversation

@NoahStapp
Copy link
Copy Markdown

Proposed Changes:

MongoAtlasDocumentStore is a document store for use with Haystack. It functions in a similar manner as the existing Pinecone, Weaviate, and Qdrant document stores.

How did you test it?

Unit and integration tests. The integration tests that test the actual document store functionality require a MongoDB Atlas cluster, so they are not included in this PR.

@NoahStapp NoahStapp requested a review from a team as a code owner December 1, 2023 22:59
@NoahStapp NoahStapp requested review from julian-risch and removed request for a team December 1, 2023 22:59
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Dec 1, 2023

CLA assistant check
All committers have signed the CLA.

@masci masci added this to the 1.23.0 milestone Dec 2, 2023
Copy link
Copy Markdown
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NoahStapp Thank you so much for opening this PR! There are a couple of smaller changes needed to make this PR ready to be merged. Please have a look at my comments below and see whether they make sense to you. 🙂 In addition to these comments, we need to take care of two other things. There are no tests added with this PR but we should add tests similar to the ones we have for other document stores. I assume you tested this PR locally with python 3.10 and MacOS?
Last but not least, we need to add a small release note to this PR: https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md#release-notes
Feel free to reach out if you have any questions or need any help.

Comment thread haystack/document_stores/mongodb_atlas.py Outdated
Comment thread haystack/document_stores/mongodb_atlas.py Outdated
@masci masci added the 1.x label Dec 4, 2023
@NoahStapp
Copy link
Copy Markdown
Author

@NoahStapp Thank you so much for opening this PR! There are a couple of smaller changes needed to make this PR ready to be merged. Please have a look at my comments below and see whether they make sense to you. 🙂 In addition to these comments, we need to take care of two other things. There are no tests added with this PR but we should add tests similar to the ones we have for other document stores. I assume you tested this PR locally with python 3.10 and MacOS? Last but not least, we need to add a small release note to this PR: main/CONTRIBUTING.md#release-notes Feel free to reach out if you have any questions or need any help.

Comments make sense, I just have a question regarding tests. We have an existing test suite that requires a connection to a MongoDB Atlas cluster, but no tests that use mocks to avoid this requirement. Is that sufficient for this PR, or do we need to also add a mocked test suite of unit tests?

@julian-risch
Copy link
Copy Markdown
Member

Haystack 1.x won't change much now that we focus on 2.0 in our development. Mocked unit tests should thus be enough here. For adding integration tests, we would need adjust the workflow and running the tests would take even longer. I'd prefer mocked unit tests for that reason.

Copy link
Copy Markdown
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @NoahStapp in addition to my other comments, I see that the CI steps are still failing because of an issue here: https://github.com/deepset-ai/haystack/actions/runs/7146021601/job/19464075024?pr=6471 and in the code here. That's the match that is not supported before python3.10. And the other failing step is the missing release note. Just let me know if you need any support from my side to fix this so that we can merge your PR soon.

@NoahStapp NoahStapp requested a review from a team as a code owner December 11, 2023 19:31
@NoahStapp NoahStapp requested review from dfokina and removed request for a team December 11, 2023 19:31
@NoahStapp
Copy link
Copy Markdown
Author

Hi @NoahStapp in addition to my other comments, I see that the CI steps are still failing because of an issue here: deepset-ai/haystack/actions/runs/7146021601/job/19464075024?pr=6471 and in the code here. That's the match that is not supported before python3.10. And the other failing step is the missing release note. Just let me know if you need any support from my side to fix this so that we can merge your PR soon.

Whoops, forgot about the second match statement, thanks Julian! Should be all fixed now and with the release note.

@prakul
Copy link
Copy Markdown

prakul commented Dec 12, 2023

@julian-risch @masci Is the CLA mandatory to be signed by the PR author, or can it be done separately? Usually, do other companies engage their legal team for review?

@masci
Copy link
Copy Markdown
Contributor

masci commented Dec 12, 2023

@prakul the CLA is usually signed by the single committer, if that works it's the fastest way. The alternative would be having an agreement between Mongo and deepset where you discuss and accept the CLA as an organization, than I can configure the bot to allow-list any contributor for that org.

@julian-risch
Copy link
Copy Markdown
Member

Whoops, forgot about the second match statement, thanks Julian! Should be all fixed now and with the release note.

@NoahStapp Nice! We're making good progress here. Our goal is to have this in the upcoming Haystack v1.23 release that we are preparing for tomorrow. I fixed the mypy and pylint issues. Could you please have a look at the failing unit tests? For example: https://github.com/deepset-ai/haystack/actions/runs/7180932661/job/19554287643?pr=6471

@NoahStapp
Copy link
Copy Markdown
Author

Whoops, forgot about the second match statement, thanks Julian! Should be all fixed now and with the release note.

@NoahStapp Nice! We're making good progress here. Our goal is to have this in the upcoming Haystack v1.23 release that we are preparing for tomorrow. I fixed the mypy and pylint issues. Could you please have a look at the failing unit tests? For example: deepset-ai/haystack/actions/runs/7180932661/job/19554287643?pr=6471

Are the CI checks/environment identical to the local environment setup by following CONTRIBUTING.md? I don't see that test failure locally, but I've added from __future__ import annotations to remove the possibility of a lazy import forward reference.

@prakul
Copy link
Copy Markdown

prakul commented Dec 13, 2023

@julian-risch @masci Thanks for all the quick reviews and guidance. we would love to see this as part of the Haystack v1.23 release, if possible.

@ZanSara
Copy link
Copy Markdown
Contributor

ZanSara commented Dec 13, 2023

Hey @NoahStapp, this is not documented and it doesn't show as a CI failure, but using from __future__ import annotations currently breaks our pipeline schema generation, so we need to avoid it. Can you find another way to address the forward references?

@julian-risch
Copy link
Copy Markdown
Member

julian-risch commented Dec 13, 2023

With MONGO_ATLAS_USERNAME, MONGO_ATLAS_PASSWORD and MONGO_ATLAS_HOST defined locally, most of the tests pass for me now.
The model "sentence-transformers/all-mpnet-base-v2" could be replaced with "sentence-transformers/all-MiniLM-L6-v2" to increase the speed of the tests. However, that might slightly change the retrieval results and embeddings dimensions are 384 instead of 768 then. So the tests would need to be adjusted to work with this faster model.

The tests that are still failing for me are the following:

  • test_query_by_embedding_default_topk
  • test_query_by_embedding_default_topk_4
  • test_query_by_embedding_filtered
  • test_query_by_embedding_include_embedding
    Although the document store contains >100 documents, the returned list of documents is empty when running:
    results = document_store.query_by_embedding(query_emb=embedding, return_embedding=True) in these tests. @NoahStapp could you please double check these tests?

@julian-risch
Copy link
Copy Markdown
Member

@NoahStapp I took care of the from __future__ import that ZanSara mentioned. What remains to be done by you before we can merge your PR and release Haystack v1.23 with it is to check why the following tests fail:

  • test_query_by_embedding_default_topk
  • test_query_by_embedding_default_topk_4
  • test_query_by_embedding_filtered
  • test_query_by_embedding_include_embedding

If we can make them pass and are confident that everything works well, I would merge only the implementation but not the tests and then do the release tomorrow morning.

@NoahStapp
Copy link
Copy Markdown
Author

@NoahStapp I took care of the from __future__ import that ZanSara mentioned. What remains to be done by you before we can merge your PR and release Haystack v1.23 with it is to check why the following tests fail:

  • test_query_by_embedding_default_topk
  • test_query_by_embedding_default_topk_4
  • test_query_by_embedding_filtered
  • test_query_by_embedding_include_embedding

If we can make them pass and are confident that everything works well, I would merge only the implementation but not the tests and then do the release tomorrow morning.

@julian-risch Those tests require creating an Atlas Vector Search index named test_80_days like so:

{
  "mappings": {
    "dynamic": true,
    "fields": {
      "embedding": {
        "dimensions": 768,
        "similarity": "cosine",
        "type": "knnVector"
      }
    }
  }
}

Once that index is created, those tests also pass.

Copy link
Copy Markdown
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now! 👍 It's ready to be merged. We can merge the tests later after cleaning them up. For now it was important to test it at least locally, which worked.

@julian-risch julian-risch merged commit 6501e3f into deepset-ai:v1.x Dec 14, 2023
@prakul
Copy link
Copy Markdown

prakul commented Dec 14, 2023

Thanks @julian-risch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants