SumEstimation is a framework for estimating the sum of scores across large-scale embedding datasets using a variety of sampling strategies. It is designed to support both approximate and hybrid estimation techniques across multiple similarity functions (KDE, Softmax) and dataset types (image or text embeddings).
Computing exact sums over large embedding datasets (millions of vectors) is computationally expensive. This repo explores sampling-based estimators that can reliably approximate the total sum with fewer computations, enabling scalable deployment in ranking, evaluation, and retrieval workflows.
- TopK: Uses nearest neighbors by similarity.
- Random: Uniform random sampling of dataset points.
- OurAlgorithm: An adaptive sampler that selects a budgeted number of items per query.
- Combined: Combines TopK and Random sampling (hybrid).
Each method supports per-query evaluation and produces score estimates under a variety of hyperparameter and estimator configurations.
Clone the repository and install the required dependencies:
git clone https://github.com/your-org/sum-estimation-public.git
pip install -r requirements.txtIf you use this repository in your work, please cite the paper or contact us via the repository.
For questions, feedback, or contributions:
Steve Mussmann mussmann@gatech.edu
Mehul Smriti Raje mehul@coactive.ai, mehul.raje@gmail.com