Skip to content

feat: Weights and score normalization for DocumentJoiner with reciprocal rank fusion#6735

Merged
vblagoje merged 5 commits into
mainfrom
join_rrf_v2
Jan 24, 2024
Merged

feat: Weights and score normalization for DocumentJoiner with reciprocal rank fusion#6735
vblagoje merged 5 commits into
mainfrom
join_rrf_v2

Conversation

@robpasternak
Copy link
Copy Markdown
Contributor

@robpasternak robpasternak commented Jan 12, 2024

Related Issues

Proposed Changes:

  • Add weighting for DocumentJoiner node with reciprocal rank fusion.
  • Add score normalization for DocumentJoiner node with reciprocal rank fusion.

How did you test it?

  • Testing pending.

Checklist

@github-actions github-actions Bot added the 2.x label Jan 12, 2024
@coveralls
Copy link
Copy Markdown
Collaborator

coveralls commented Jan 12, 2024

Pull Request Test Coverage Report for Build 7641580805

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.006%) to 88.88%

Totals Coverage Status
Change from base Build 7641576479: 0.006%
Covered Lines: 4524
Relevant Lines: 5090

💛 - Coveralls

@robpasternak
Copy link
Copy Markdown
Contributor Author

I just did some simple tests by hand, and it seems to work! Test code and output below. Will now mark it ready for review.

(Note: I didn't see a reasonable spot to include this as a test in the actual repo itself--all the DocumentJoiner tests seem to deal with much more fundamental stuff--which is why I did the tests by hand. If you'd like me to somehow incorporate tests in the repo let me know.)

Test code

from haystack.components.joiners import DocumentJoiner
from haystack import Document

def make_docs(
    num_docs: int = 6,
):
    out_docs = []
    for i in range(num_docs):
        out_docs.append(Document(content=f"doc{i}"))
    return out_docs

if __name__ == '__main__':
    docs = make_docs(6)
    docs_2 = [docs[0], docs[4], docs[2], docs[5], docs[1]]
    document_lists = [docs, docs_2]

    joiner_1 = DocumentJoiner(
        join_mode = "reciprocal_rank_fusion",
        weights = [0.5,0.5],
    )

    joiner_2 = DocumentJoiner(
        join_mode = "reciprocal_rank_fusion",
        weights = [7,7],
    )

    joiner_3 = DocumentJoiner(
        join_mode = "reciprocal_rank_fusion",
        weights = [0.7,0.3],
    )

    joiner_4 = DocumentJoiner(
        join_mode = "reciprocal_rank_fusion",
        weights = [0.6,0.4],
    )

    joiner_5 = DocumentJoiner(
        join_mode = "reciprocal_rank_fusion",
        weights = [1,0],
    )

    joiners = [joiner_1, joiner_2, joiner_3, joiner_4, joiner_5]
    
    for index, joiner in enumerate(joiners):
        print(f'JOINER {index+1}:')
        join_results = joiner.run(documents = document_lists)
        for doc in join_results['documents']:
            print(doc.content, doc.score)
        print('')

Output

JOINER 1:
doc0 1.0
doc2 0.9682539682539681
doc1 0.961166253101737
doc4 0.961166253101737
doc5 0.938683712121212
doc3 0.4765625

JOINER 2:
doc0 1.0
doc2 0.9682539682539681
doc1 0.961166253101737
doc4 0.961166253101737
doc5 0.938683712121212
doc3 0.4765625

JOINER 3:
doc0 0.9999999999999998
doc1 0.9702481389578161
doc2 0.9682539682539681
doc4 0.9520843672456575
doc5 0.9329071969696969
doc3 0.6671874999999999

JOINER 4:
doc0 0.9999999999999998
doc2 0.9682539682539681
doc1 0.9657071960297765
doc4 0.9566253101736971
doc5 0.9357954545454545
doc3 0.5718749999999999

JOINER 5:
doc0 1.0
doc1 0.9838709677419354
doc2 0.9682539682539681
doc3 0.953125
doc4 0.9384615384615385
doc5 0.9242424242424242

@robpasternak
Copy link
Copy Markdown
Contributor Author

Oh and another comment: we've previously talked about making the score normalization optional, rather than obligatory. This PR keeps the score normalization as obligatory, like in the v1.x JoinDocuments node. I did this because I thought the decision about how to implement any optionality would be best left to Core Engineering folks. But I'm happy to offer hot takes and/or implement changes if you'd like 🙂

@robpasternak robpasternak marked this pull request as ready for review January 15, 2024 10:42
@robpasternak robpasternak requested review from a team as code owners January 15, 2024 10:42
@robpasternak robpasternak requested review from dfokina and vblagoje and removed request for a team January 15, 2024 10:42
@vblagoje
Copy link
Copy Markdown
Member

Hey @robpasternak @anakin87 seems ok. The. only slight concern I have is document_lists being zero len blowing up pipeline run with ZeroDivisionError. No other code path in this component addresses this, so we can leave it for more consistent updates across all components, right @anakin87

So in conclusion, a slightly longer release note blurb explaining this to users would be appreciated. I'll share our guidelines internally.

Copy link
Copy Markdown
Member

@vblagoje vblagoje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's write a bit longer release note please

@vblagoje
Copy link
Copy Markdown
Member

Also @robpasternak would you also please just add the test you wrote to test_document_joiner.py?

@vblagoje vblagoje self-requested a review January 24, 2024 14:25
Copy link
Copy Markdown
Member

@vblagoje vblagoje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@vblagoje vblagoje merged commit 7358b91 into main Jan 24, 2024
@vblagoje vblagoje deleted the join_rrf_v2 branch January 24, 2024 14:45
@robpasternak
Copy link
Copy Markdown
Contributor Author

Oh thanks for jumping on the changes @vblagoje! Apologies for not doing them sooner myself, was gonna do them today but failed to communicate that. Anyway, thanks again!

@vblagoje
Copy link
Copy Markdown
Member

No worries @robpasternak it took me literally a few minutes to update, all credit is yours 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Weights and score normalization for DocumentJoiner with reciprocal rank fusion - 2.x

3 participants