Skip to content

[PROBLEM] #2549

@abst0603

Description

@abst0603

[optional format]

Identify the file to be fixed
[openai-cookbook/examples
/Embedding_long_inputs.ipynb](https://developers.openai.com/cookbook/examples/embedding_long_inputs)

Describe the problem
OpenAI embedding API returns normalized embeddings. Normalized embedding keep the direction of embedding but not the magnitude. Average of normalized embedding does NOT have the same direction or magnitude as the mean of original embeddings. This will result in a vector that is an estimation of actual embedding.
In this notebook, the author suggest to weight the embedding based on the length of input tokens. This is not necessary the magnitude of original embedding (not normalized) and can be totally independent of that.
In my experiment, a classifier with truncated embedding performs better than a classifier with normalized mean (not weighted) normalized embedding. To make it clear, I get the OpenAI embeddings (already normalized), compute their mean for all chunks belong to a single text and normalized that again.

Describe a solution
I just don't think a "normalized mean of embedding for chunks" in a single text, is a good representation of that text. Truncation works perfectly.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions