A demo app showcasing Pinecone Full-Text Search combined with multimodal vector search using Gemini Embedding 2, over a corpus of ~2,079 North American bird Wikipedia articles - one document per bird.
- Python 3.10+
- Pinecone account and API key
- Google AI Studio API key (for Gemini Embedding 2)
- Install dependencies:
pip install -r requirements.txt - Copy
.env.exampleto.envand fill in your API keys:PINECONE_API_KEY=... GOOGLE_API_KEY=... - Build the index. Start with a small sample to verify everything works:
Then ingest the full corpus (~2,079 birds):
python build_index.py --sample 50python build_index.py --sample 0 - Run the app:
streamlit run app.py
The bird dataset lives at parsed_birds/ (~58 MB, committed to the repo). It contains ~2,079 North American bird articles scraped from Wikipedia, structured as:
parsing_metadata.json— index of all birds with image metadatatext/<slug>.txt— full article text per birdimages/<slug>/<slug>_1.jpg— primary photo per bird
Each bird is stored in Pinecone as a single document with three text fields (bird_name, intro, body) and one dense vector field (image_embedding). Set BIRD_DATA_DIR in your environment to override the default data path.
Gemini Embedding 2 is a multimodal model that embeds both text and images into the same vector space. This makes cross-modal search possible: a text description like "tall pink wading bird" produces a vector that is directly comparable to the vector computed from a bird's photo at index time — no separate image captioning or two-stage pipeline needed. All image embeddings are precomputed during build_index.py and stored in Pinecone at 768 dimensions with cosine similarity.
BM25 keyword scoring against body, intro, or bird_name. A multi mode searches all three fields at once and lets you write a different query per field (e.g. bird_name=swallow + body=in mountains). Toggle Phrase to require exact word adjacency via Lucene query_string.
Type a description of what a bird looks like. The query is embedded via Gemini Embedding 2 and scored against each bird's stored image vector. Finds birds by appearance even when the article never uses your exact words.
A $match_all filter on body (every required keyword must appear in the article) combined with dense-vector visual reranking — in a single Pinecone round trip. Use when you need both a hard text gate and visual ranking.
Raw Lucene query_string for advanced queries: boolean operators (+required -excluded), term boosts (eagle^3), phrase slop ("northern cardinal"~3), phrase prefixes, and cross-field clauses.
Each tab shows the exact documents.search(...) call beneath the results so you can see what was sent to Pinecone.
MIT