AI Job Search System

An AI-powered job discovery engine that uses natural language queries and iterative refinement to search through 100,000 job postings.

Quick Start

Prerequisites

Python 3.9+
OpenAI API key

Installation

Clone repository

Create virtual environment and install dependencies:

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Download dataset:
- Download jobs.jsonl from: https://drive.google.com/file/d/1RRVWYAvfb4hUus1hUDY1nPQUJGqpiBiq/view
- Place in data/ directory
Create .env file:
```
OPENAI_API_KEY=your_key_here
```

Run Demo

python demo.py

This runs 6 single-turn searches and 2 multi-turn refinement flows (3 turns each) against 100K job postings, then generates a token usage report.

Interactive Mode

python interactive.py

A REPL-style search bar where you can type queries and refine results conversationally. Commands: /new to reset, /tokens to see usage, /quit to exit.

Architecture

Project Structure

src/
  data_loader.py    - Loads jobs.jsonl, extracts structured fields, builds embedding matrices
  search_engine.py  - Hybrid search: embeddings + filters + optional LLM re-ranking
  refinement_engine.py - Multi-turn state management with rule-based and LLM intent parsing
  utils.py          - Vector normalization, safe nested dict access
demo.py             - Automated demo with 12 queries and token report
interactive.py      - Interactive search REPL

Data Representation

The system uses pre-computed embeddings (already in dataset) for three aspects of each job:

Explicit embedding - Job title, listed skills, requirements
Inferred embedding - Related skills, relevant experience
Company embedding - Company characteristics, industry, culture

All embeddings are 1536-dimensional vectors from OpenAI text-embedding-3-small, pre-normalized into NumPy matrices at load time for fast batch dot-product similarity.

Structured fields are indexed for boolean filtering:

Remote status, location, seniority
Salary range, employment type
Company type (startup, nonprofit, enterprise)
Funding stage, employee count

Search Pipeline

Hybrid approach combining vector similarity + structured filters:

User Query
    |
[1] Embed query (text-embedding-3-small, ~5 tokens)
    |
[2] Extract structured filters (gpt-4o-mini, ~150 tokens)
    |
[3] Determine embedding weights (0 tokens - rule-based heuristic)
    |
[4] Compute weighted similarity scores (0 tokens - numpy dot product)
    |
[4b] Title keyword boost (0 tokens - string matching)
    |
[5] Apply filters progressively (0 tokens - boolean masking with graceful degradation)
    |
[6] Deduplicate results by (title, company)
    |
[7] Optional re-ranking (gpt-4o-mini, ~800 tokens - only for ambiguous queries)
    |
Top 10 Results

Key optimization: Most queries skip LLM re-ranking entirely. Only queries with ambiguous/subjective keywords (e.g., "culture", "mission-driven", "innovative") trigger re-ranking.

Relevance Ranking

Dynamic embedding weights based on query intent (rule-based, 0 tokens):

Job-focused queries (e.g., "senior ML engineer")
- 70% explicit, 20% inferred, 10% company
Company-focused queries (e.g., "mission-driven startup")
- 20% explicit, 20% inferred, 60% company
Balanced queries (e.g., "data science at nonprofits")
- 50% explicit, 30% inferred, 20% company

A title keyword boost adds a small score bonus (+0.05 per match) for jobs whose title contains query keywords, helping maintain role focus when filters narrow the candidate pool. A stopword list prevents generic terms like "entry", "level", "roles" from inflating irrelevant matches.

Filter System

Filters are applied progressively: each filter is only kept if it doesn't reduce the candidate pool below 5 results. This prevents over-filtering on niche query combinations while preserving as many constraints as possible.

Special handling:

Startup detection expands beyond the org_type label to include companies with early-stage funding (seed, series A/B, pre-seed, angel) or <= 50 employees, with a +0.10 score bonus for explicitly labeled startups
Seniority fallback matches title-based indicators (e.g., "Sr.", "Lead", "Principal", "Staff") when structured seniority data is missing
Remote + location contradiction automatically drops the location filter when remote is specified

Refinement Engine

Multi-turn search with stateful conversation tracking:

Rule-based fast path (0 tokens) - Short filter-only inputs like "make it remote", "at startups", "senior level only" are parsed with pattern matching and applied instantly
LLM intent parsing (~150 tokens) - For complex follow-ups, GPT-4o-mini classifies the intent (refinement, pivot, or clarification) and extracts new filters
Filter merging - New filters are merged with existing ones (new values override old; remote overrides location)
Re-search - The original query embedding is reused with updated filters, preserving role focus

Demo Results

Results from the latest demo run (12 queries total):

Single-Turn Searches

#	Query	Top Result	Quality
1	"senior software engineer remote"	Senior Full Stack SW Engineer - Zealogics (Remote)	All 5 senior + remote + SW engineer
2	"data science internships"	Data Science Intern - XPO, Inc. (Boston)	Top 4 are actual DS/data interns
3	"product manager roles at early stage startups"	Senior PM - Jerry.ai (Remote, Tech Startup)	Top 3 are PM roles at labeled startups
4	"backend engineer jobs in New York paying over 150k"	Backend SW Engineer - Medal ($180K-$250K, NYC)	All backend, >$150K, NY-area
5	"entry level design roles"	Design Professional - McMillan Pazdan Smith	Associates, interns, entry-level designers
6	"mission-driven nonprofit data roles"	Donor Data Specialist - Springs Rescue Mission	All data roles at nonprofits

Multi-Turn Refinement Flows

Flow 1: Data Science -> Social Good -> Remote

Turn	Query	Active Filters	Top Result
1	"data science jobs"	none	Senior Manager, Data Science - American Express
2	"at companies or non-profits that care about social good"	nonprofit	Senior Data Science Analyst - Mayo Clinic
3	"make it remote"	remote, nonprofit	Senior Data Science Analyst - Mayo Clinic (Remote)

Flow 2: ML Engineer -> Startups -> Senior

Turn	Query	Active Filters	Top Result
1	"machine learning engineer"	none	ML Engineer - CompassMSP ($120K-$200K)
2	"at early stage startups"	startup	Product Engineer - Advocate (Remote, Tech Startup)
3	"senior level only"	senior, startup	Senior SW Engineer - Overland AI ($200K-$240K, Startup)

Trade-offs

Optimized For

Token efficiency - Average 402 tokens/search, 259 tokens/refinement
Result quality - 12/12 demo queries return clearly relevant top results
Fast iteration - No database, no indexing server, pure in-memory NumPy
100K scale - All jobs loaded and searchable in ~60 seconds

Deprioritized

Sub-second latency (embedding + LLM calls take 2-5s per query)
Exact phrase matching (embeddings are semantic, not lexical)
Learning from user feedback (stateless between sessions)
Production-grade error handling

Query Performance

Works Well

Clear job titles: "senior software engineer", "data analyst"
Location + remote: "remote jobs in tech", "NYC product manager"
Company types: "startup ML roles", "nonprofit data science"
Salary filters: "engineering jobs over 150k"
Multi-turn refinement: "data science" -> "at nonprofits" -> "remote"

Challenging

Very vague queries: "interesting roles", "good culture" (triggers re-ranking to help)
Hyper-specific tech stacks: "Python + Kafka + Airflow + dbt"
Company names not in dataset
Queries mixing many constraints (graceful degradation drops lowest-impact filters when <5 results)

Token Usage

Demo totals (12 queries):

Single-turn searches (6 queries): 2,415 tokens
Multi-turn refinements (6 turns): 1,556 tokens
Total: 3,971 tokens (~$0.0006 with GPT-4o-mini)

Per-query averages:

Single-turn search: ~402 tokens
Refinement turn: ~259 tokens

Breakdown by operation:

Operation	Tokens	Purpose
Filter extraction	1,798	GPT-4o-mini extracts structured filters from queries
Intent parsing	1,104	GPT-4o-mini classifies refinement intent
Re-ranking	1,015	GPT-4o-mini re-ranks ambiguous queries (1 of 6 searches)
Embedding	54	text-embedding-3-small for query vectors

Future Improvements

Embedding cache - Cache query embeddings for similar/repeated searches; ~50 token savings per repeated query
BM25 hybrid retrieval - Add TF-IDF/BM25 scoring alongside embeddings for exact keyword matching
Smarter re-ranking triggers - ML model to predict when re-ranking helps vs. wastes tokens
User feedback loop - Track clicks, fine-tune weights, personalization
Approximate nearest neighbor - FAISS/Annoy for scaling to 1M+ jobs
Explanation generation - Show why each job matched to build user trust

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
.gitignore		.gitignore
Agent		Agent
ISSUES_AND_OPTIMIZATIONS.md		ISSUES_AND_OPTIMIZATIONS.md
README.md		README.md
demo.py		demo.py
demo_run.txt		demo_run.txt
demo_run2.txt		demo_run2.txt
demo_run3.txt		demo_run3.txt
demo_run4.txt		demo_run4.txt
demo_run5.txt		demo_run5.txt
demo_run6.txt		demo_run6.txt
demo_run7.txt		demo_run7.txt
interactive.py		interactive.py
requirements.txt		requirements.txt
tokens_report.md		tokens_report.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Job Search System

Quick Start

Prerequisites

Installation

Run Demo

Interactive Mode

Architecture

Project Structure

Data Representation

Search Pipeline

Relevance Ranking

Filter System

Refinement Engine

Demo Results

Single-Turn Searches

Multi-Turn Refinement Flows

Trade-offs

Optimized For

Deprioritized

Query Performance

Works Well

Challenging

Token Usage

Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Job Search System

Quick Start

Prerequisites

Installation

Run Demo

Interactive Mode

Architecture

Project Structure

Data Representation

Search Pipeline

Relevance Ranking

Filter System

Refinement Engine

Demo Results

Single-Turn Searches

Multi-Turn Refinement Flows

Trade-offs

Optimized For

Deprioritized

Query Performance

Works Well

Challenging

Token Usage

Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages