[PROBLEM]

[optional format]

**Identify the file to be fixed**
[[openai-cookbook](https://github.com/openai/openai-cookbook/tree/main)/[examples](https://github.com/openai/openai-cookbook/tree/main/examples)
/Embedding_long_inputs.ipynb](https://developers.openai.com/cookbook/examples/embedding_long_inputs)

**Describe the problem**
OpenAI embedding API returns normalized embeddings. Normalized embedding keep the direction of embedding but not the magnitude. Average of normalized embedding does NOT have the same direction or magnitude as the mean of original embeddings. This will result in a vector that is an estimation of actual embedding.
In this notebook, the author suggest to weight the embedding based on the length of input tokens. This is not necessary the magnitude of original embedding (not normalized) and can be totally independent of that.
In my experiment, a classifier with truncated embedding performs better than a classifier with normalized mean (not weighted) normalized embedding. To make it clear, I get the OpenAI embeddings (already normalized), compute their mean for all chunks belong to a single text and normalized that again.

**Describe a solution**
I just don't think a "normalized mean of embedding for chunks" in a single text, is a good representation of that text. Truncation works perfectly.

**Screenshots**
If applicable, add screenshots to help explain your problem.

**Additional context**
Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROBLEM] #2549

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[PROBLEM] #2549

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions