Model/Pipeline/Scheduler description
Copied from #1858:
UnCLIP / Karlo: https://huggingface.co/spaces/kakaobrain/karlo gives some very nice and precise results when doing image generation and can strongly outperform Stable Diffusion in some - see:
https://www.reddit.com/r/StableDiffusion/comments/zshufz/karlo_the_first_large_scale_open_source_dalle_2/
Another extremely interesting aspect of Dalle 2 is its ability to interpolate between text and or image embeddings. See e.g. section 3.) of the Dalle 2 paper: https://cdn.openai.com/papers/dall-e-2.pdf . This PR now allows to directly pass text embeddings and image embeddings which should enable those tasks!
I think we could create a super cool community pipeline. The pipeline could allow to automatically create interpolations between two text prompts and similarly we could create one to do interpolations between two images.
In terms of design to stay as efficient as possible the following would make sense:
-
- The user passes two text prompts and a
num_interpolations input.
-
- The pipeline then embeds those two text prompts into the text embeddings x_0 and x_N and
num_interpolations x_1, x_2, ... x_N-1 are created using the slerp function .
-
- Then we have
num_interpolations + 2 text embeddings that should be passed in a batch through the model to create a nice interpolation of images.
-
- It'd be important to make use of
enable_cpu_offload() to save memory.
It's probably easier to start with the UnCLIPImageInterpolationPipeline since image embeddings are just a single 1-d vector where as for text embeddings two latent vectors are used.
Would be more than happy to help if someone is interested in giving this a try - think it'll make for some super cool demos.
Open source status
Provide useful links for the implementation
No response
Model/Pipeline/Scheduler description
Copied from #1858:
I think we could create a super cool community pipeline. The pipeline could allow to automatically create interpolations between two text prompts and similarly we could create one to do interpolations between two images.
In terms of design to stay as efficient as possible the following would make sense:
num_interpolationsinput.num_interpolationsx_1, x_2, ... x_N-1 are created using theslerpfunction .num_interpolations+ 2 text embeddings that should be passed in a batch through the model to create a nice interpolation of images.enable_cpu_offload()to save memory.It's probably easier to start with the
UnCLIPImageInterpolationPipelinesince image embeddings are just a single 1-d vector where as for text embeddings two latent vectors are used.Would be more than happy to help if someone is interested in giving this a try - think it'll make for some super cool demos.
Open source status
Provide useful links for the implementation
No response