A semantic recommendation system that uses Retrieval-Augmented Generation (RAG) to deliver intelligent, context-aware product suggestions. Users describe what they're looking for in natural language, and the system retrieves the most relevant items using vector similarity search, metadata filtering, and emotion-based re-ranking.
This engine combines vector embeddings, zero-shot classification, and emotion analysis to go beyond keyword matching and understand the intent behind a user's query.
┌─────────────────────────────────────────────────────────────────────┐
│ DATA PREPARATION PIPELINE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Raw Product Data │
│ │ │
│ ▼ │
│ ┌─────────────────┐ ┌──────────────────┐ ┌──────────────┐ │
│ │ Data Cleaning │───▶│ Classification │───▶│ Emotion │ │
│ │ & Preprocessing │ │ (Zero-Shot NLI) │ │ Analysis │ │
│ └─────────────────┘ └──────────────────┘ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ Cleaned Metadata Category Labels Emotion Scores │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Vector Embedding & Indexing │ │
│ │ (Sentence Transformers + ChromaDB) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ RUNTIME QUERY FLOW │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ User Query: "a story about redemption and second chances" │
│ │ │
│ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌───────────────────┐ │
│ │ Embed │────▶│ Vector Search │────▶│ Retrieve Top 200 │ │
│ │ Query │ │ (ChromaDB) │ │ Candidates │ │
│ └──────────────┘ └──────────────┘ └───────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Category Filter │ │
│ │ (Fiction/Nonfic) │ │
│ └──────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Emotion Re-Rank │ │
│ │ (Joy/Fear/Sad..) │ │
│ └──────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Top 16 Results │ │
│ │ with Thumbnails │ │
│ └──────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
- Semantic Search — Understands natural language queries, not just keywords. "A story about forgiveness" matches books about redemption, atonement, and reconciliation.
- Category Filtering — Zero-shot classification using BART-large-MNLI automatically categorizes items into Fiction, Nonfiction, and Children's categories.
- Emotion-Based Re-Ranking — Each item is scored across 7 emotions (joy, anger, disgust, fear, sadness, surprise, neutral) using a fine-tuned DistilRoBERTa model. Users can filter by emotional tone.
- Vector Persistence — ChromaDB stores embeddings on disk for instant startup after initial indexing.
- Interactive UI — Clean Gradio web interface with category dropdowns, tone selectors, and a visual gallery of results.
| Layer | Technology | Purpose |
|---|---|---|
| Embeddings | all-MiniLM-L6-v2 |
384-dim sentence transformers for semantic vector representation |
| Vector Store | ChromaDB | Persistent vector similarity search with HNSW indexing |
| Classification | facebook/bart-large-mnli |
Zero-shot category assignment via natural language inference |
| Emotion Detection | j-hartmann/emotion-english-distilroberta-base |
7-class emotion classification on product descriptions |
| Orchestration | LangChain | Document loading, text splitting, and embedding pipeline |
| Frontend | Gradio (Glass theme) | Interactive web dashboard |
| Data Processing | Pandas, NumPy | Feature engineering and data manipulation |
.
├── Gardio-Dashboard.py # Main application — Gradio web UI
├── Data-explore-Books.ipynb # Step 1: Data cleaning & exploration
├── Text-classification.ipynb # Step 2: Zero-shot category classification
├── sentiment-analysis.ipynb # Step 3: Emotion scoring pipeline
├── Vector-search.ipynb # Step 4: Embedding generation & vector store setup
├── Platform-products.ipynb # E-commerce product analysis (extension)
│
├── books_cleaned.csv # Cleaned dataset (5,197 items)
├── books_final_categorized.csv # Items with category labels
├── books_with_emotions.csv # Items with 7 emotion scores
├── tagged_description.txt # Indexed descriptions for vector store
├── chroma_db/ # Persisted ChromaDB vector store
├── cover-not-found.jpg # Fallback thumbnail
│
├── .env # Environment variables
└── .gitignore
- Python 3.11+
- pip or conda
# Clone the repository
git clone https://github.com/your-username/semantic-recommendation-engine.git
cd semantic-recommendation-engine
# Create a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install langchain langchain-community langchain-text-splitters langchain-chroma \
langchain-huggingface transformers sentence-transformers \
pandas numpy gradio python-dotenv kagglehubExecute the notebooks in order to prepare the data and vector store:
jupyter notebook- Data-explore-Books.ipynb — Cleans raw data, filters short descriptions, outputs
books_cleaned.csv - Text-classification.ipynb — Classifies items into categories, outputs
books_final_categorized.csv - sentiment-analysis.ipynb — Generates emotion scores, outputs
books_with_emotions.csv - Vector-search.ipynb — Creates embeddings and builds the ChromaDB vector store
python Gardio-Dashboard.pyThe Gradio dashboard will open at http://localhost:7860.
Product descriptions are prepended with their unique ID, split into chunks, and embedded into 384-dimensional vectors using all-MiniLM-L6-v2. These are stored in ChromaDB with an HNSW index for fast approximate nearest-neighbor search.
When a user submits a query, it's embedded using the same model. ChromaDB returns the top 200 most similar items by cosine distance.
Results are filtered by category (Fiction, Nonfiction, etc.) and narrowed to the top 16 candidates.
Pre-computed emotion scores (joy, surprise, anger, fear, sadness) are used to sort results by the user's desired emotional tone. This surfaces items that are not just semantically relevant but emotionally aligned.
This system is designed to be adaptable to any product domain:
| Component | What to Change |
|---|---|
| Dataset | Replace books_with_emotions.csv with your product catalog |
| Categories | Update the zero-shot labels in the classification notebook |
| Emotions | Adjust the emotion model or scoring logic in the sentiment notebook |
| UI | Modify Gardio-Dashboard.py to change layout, filters, or display format |
| Embedding Model | Swap all-MiniLM-L6-v2 for a domain-specific model in the embedding config |
- LangChain for the RAG orchestration framework
- HuggingFace for pre-trained transformer models
- ChromaDB for the vector database
- Gradio for the interactive web interface
Built with RAG, powered by semantic understanding.