A research toolkit for systematically analyzing gender bias in Large Language Model (LLM) responses to job description generation tasks.
This repository contains tools designed to help researchers study gender bias patterns in artificial intelligence systems, specifically Large Language Models. The toolkit systematically tests how different LLMs respond to job description generation prompts, analyzing whether the models exhibit gender stereotyping or bias in their outputs.
- Generates job descriptions across 196 different occupations using various LLM models
- Tests for gender bias by comparing responses to neutral vs. gendered job titles (e.g., "server" vs. "waiter" vs. "waitress")
- Analyzes multiple dimensions including salary estimates, language patterns, and role descriptions
- Supports 40+ LLM models from providers like Anthropic (Claude), OpenAI (GPT), Google (Gemini), Meta (Llama), and others
- Provides statistical analysis including Inter-Rater Reliability (IRR) and Bem Sex Role Inventory (BSRI) scoring
This toolkit is designed for academic researchers studying:
- AI bias and fairness
- Gender representation in AI systems
- Computational social science
- Digital humanities
- Technology ethics and policy
Authors: Jennifer M. Krebsbach, Jane E. Lee, Steven Zeck, Arti Thakur, and Martin Hilbert
Institution: University of California, Davis
Contact: spzeck@health.ucdavis.edu, jkrebsbach@ucdavis.edu
If you use this framework in your research, please cite:
@misc{llm-bias-analysis-tools,
title={LLM Gender Bias Analysis Tools},
author={Krebsbach, Jennifer and Lee, Jane and Zeck, Steven},
year={2025},
url={https://github.com/xaintly/llm_bias_analysis_tools}
}This project is licensed under the MIT License - see the LICENSE file for details.
- Python 3.8 or higher
- Access to at least one LLM provider (AWS Bedrock, OpenAI, Google AI, etc.)
- Basic familiarity with command line operations
-
Clone the repository:
git clone [repository-url] cd llm_bias_analysis_tools -
Install Python dependencies:
pip install boto3 openai google-generativeai requests pandas openpyxl
The toolkit supports multiple LLM providers. You only need to configure the providers you plan to use.
AWS Bedrock provides access to Claude, Llama, Titan, and other models through a single interface.
-
Create AWS credentials file: Create a file named
.aws.profilein the project directory:default -
Configure AWS credentials (choose one method):
Option A: Environment variables
export AWS_ACCESS_KEY_ID=your_access_key_here export AWS_SECRET_ACCESS_KEY=your_secret_key_here export AWS_SESSION_TOKEN=your_session_token_here # if using temporary credentials
Option B: AWS credentials file
aws configure --profile default
-
Request model access:
- Log into AWS Console → Bedrock → Model Access
- Request access to desired models (Claude, Llama, etc.)
- Access approval can take 1-2 business days
-
Get API key:
- Visit https://platform.openai.com/api-keys
- Create a new API key
-
Create credentials file: Create
.openai-api.keyin the project directory:your_openai_api_key_here
-
Get API key:
- Visit https://aistudio.google.com/app/apikey
- Create a new API key
-
Create credentials file: Create
.gemini-api.keyin the project directory:your_gemini_api_key_here
For running models locally:
-
Install Ollama:
- Visit https://ollama.ai
- Download and install for your operating system
-
Pull desired models:
ollama pull llama2:7b ollama pull deepseek-r1:7b
-
Create URL file: Create
.ollama.urlin the project directory:http://localhost:11434/api/generate
-
Test your setup:
python run_llm_prompt.py
This will generate job descriptions for a single occupation using available models.
-
View results: Results are saved in the
results/directory, organized by:- Model name (e.g.,
claude-3-5-sonnet/) - Prompt type (e.g.,
prompt1/) - Job title (e.g.,
server.txt)
- Model name (e.g.,
Modify job lists:
- Edit
input_data/job_triads.csvto change which occupations are tested - Each row contains: neutral_title, male_title, female_title, and metadata
Customize prompts:
- Edit files in
input_data/to modify the prompts sent to models prompt1.txt: Basic job description generationprompt2.txt: Bias evaluation promptprompt3.1.txt: BSRI-based evaluation
Select specific models:
- Edit
input_data/llm_config.inito enable/disable specific models or providers - Use
disabled_modelsparameter to exclude certain models
python create_data_summary.pyThis creates Excel files with:
- Salary analysis across models and job types
- Bias scoring and Inter-Rater Reliability metrics
- BSRI (Bem Sex Role Inventory) analysis
- Statistical comparisons
python validate_llm_results.pyThis checks for:
- Missing or incomplete responses
- Data quality issues
- Consistency across model outputs
results/
├── claude-3-5-sonnet/ # Model-specific folders
│ ├── prompt1/ # Basic job descriptions
│ │ ├── server.txt # Individual job outputs
│ │ └── waiter.txt
│ └── prompt2/ # Bias evaluation outputs
├── data_summary-YYYYMMDD.xlsx # Aggregated analysis
└── irr_combinations-YYYYMMDD.xlsx # Reliability metrics
- Salary Analysis: Compares estimated salaries across gendered variants
- Bias Scores: Quantifies detected bias on 1-10 scales
- BSRI Scores: Measures masculine/feminine/neutral trait attribution
- IRR Metrics: Assesses consistency between different models/raters
"Model unavailable" errors:
- Check that you've requested access through the provider (especially AWS Bedrock)
- Verify your API keys are correctly configured
- Some models may have regional restrictions
API rate limiting:
- The toolkit includes automatic retry logic with exponential backoff
- Consider running smaller batches if you encounter persistent rate limits
Missing dependencies:
- Install required packages:
pip install boto3 openai google-generativeai requests pandas openpyxl
Credential errors:
- Ensure credential files are in the project root directory
- Check that environment variables are set correctly
- Verify file permissions on credential files
For technical issues:
- Check the
results/prompt1_failures/directory for detailed error logs - Review the configuration in
input_data/llm_config.ini - Test individual model connections before running full analysis
- This toolkit is designed for academic research on AI bias detection and mitigation
- Results should be interpreted within appropriate statistical and social contexts
- Consider potential limitations of bias measurement approaches
- No personal data is collected or processed
- All interactions are with AI models using synthetic job descriptions
- API usage follows standard terms of service for each provider
- All prompts and configurations are version-controlled
- Random seeds and model parameters are documented
- Results include timestamps and model version information
We welcome contributions from researchers and developers:
- Fork the repository
- Create a feature branch for your changes
- Add appropriate documentation
- Submit a pull request with a clear description of changes
- Additional LLM provider integrations
- New bias measurement approaches
- Statistical analysis enhancements
- Documentation improvements
- Visualization tools
This research builds upon established work in:
- AI bias detection methodologies
- Gender stereotype measurement in psychology
- Computational social science approaches
- Open science and reproducible research practices