[optional format]
Identify the file to be fixed
[openai-cookbook/examples
/Embedding_long_inputs.ipynb](https://developers.openai.com/cookbook/examples/embedding_long_inputs)
Describe the problem
OpenAI embedding API returns normalized embeddings. Normalized embedding keep the direction of embedding but not the magnitude. Average of normalized embedding does NOT have the same direction or magnitude as the mean of original embeddings. This will result in a vector that is an estimation of actual embedding.
In this notebook, the author suggest to weight the embedding based on the length of input tokens. This is not necessary the magnitude of original embedding (not normalized) and can be totally independent of that.
In my experiment, a classifier with truncated embedding performs better than a classifier with normalized mean (not weighted) normalized embedding. To make it clear, I get the OpenAI embeddings (already normalized), compute their mean for all chunks belong to a single text and normalized that again.
Describe a solution
I just don't think a "normalized mean of embedding for chunks" in a single text, is a good representation of that text. Truncation works perfectly.
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
Add any other context about the problem here.
[optional format]
Identify the file to be fixed
[openai-cookbook/examples
/Embedding_long_inputs.ipynb](https://developers.openai.com/cookbook/examples/embedding_long_inputs)
Describe the problem
OpenAI embedding API returns normalized embeddings. Normalized embedding keep the direction of embedding but not the magnitude. Average of normalized embedding does NOT have the same direction or magnitude as the mean of original embeddings. This will result in a vector that is an estimation of actual embedding.
In this notebook, the author suggest to weight the embedding based on the length of input tokens. This is not necessary the magnitude of original embedding (not normalized) and can be totally independent of that.
In my experiment, a classifier with truncated embedding performs better than a classifier with normalized mean (not weighted) normalized embedding. To make it clear, I get the OpenAI embeddings (already normalized), compute their mean for all chunks belong to a single text and normalized that again.
Describe a solution
I just don't think a "normalized mean of embedding for chunks" in a single text, is a good representation of that text. Truncation works perfectly.
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
Add any other context about the problem here.