A comprehensive data analytics project that analyzes job market trends, salary patterns, and employment characteristics using Python and machine learning techniques.
This project provides an in-depth analysis of job market data to uncover insights about:
- Salary trends and correlations with experience
- Distribution of job titles and work types
- Most in-demand skills and qualifications
- Company size impact on compensation
- Predictive modeling for salary estimation
- Job market clustering and segmentation
- Explore and clean job market data from various sources
- Engineer features to create meaningful analytical variables
- Visualize patterns in job titles, salaries, skills, and qualifications
- Build predictive models to estimate salaries based on job characteristics
- Cluster jobs to identify market segments
- Provide actionable insights for job seekers, employers, and recruiters
- Python 3.x
- Pandas - Data manipulation and analysis
- NumPy - Numerical computing
- Matplotlib & Seaborn - Data visualization
- Scikit-learn - Machine learning (implied in model comparison)
- Regular Expressions (re) - Text parsing and extraction
- Jupyter Notebook - Interactive development environment
data analytics project/
β
βββ job-market-data-analytics-project.ipynb # Main analysis notebook
βββ README.md # Project documentation
pip install pandas numpy matplotlib seaborn jupyter-
Clone or download this repository
-
Ensure the dataset is available at the specified path or update the path in the notebook
-
Open Jupyter Notebook:
jupyter notebook job-market-data-analytics-project.ipynb
-
Run all cells sequentially to reproduce the analysis
- Import necessary libraries
- Load job descriptions dataset
- Create working copy for analysis
- Display dataset structure and dimensions
- Examine data types and distributions
- Identify missing values
- Generate statistical summaries
- Top Job Titles: Identify most common positions
- Work Type Distribution: Analyze full-time, part-time, contract, etc.
- Company Sizes: Examine employer size distribution
- Skills Analysis: Extract and visualize most demanded skills
- Qualifications: Analyze educational requirements
- Remove unnecessary columns (Role, Job Id, latitude, longitude, etc.)
- Handle missing values
- Sample data strategically (50% per country, company size, job title)
- Reduce dataset size while maintaining representativeness
Key transformations include:
-
Experience Normalization: Convert experience ranges to numeric values
- Categorical mapping (Internship=0, Entry level=1, Associate=3, etc.)
- Parse ranges like "4 to 15 Years" β average value
-
Salary Normalization: Extract midpoint from salary ranges
- Handle various formats (e.g., "50k-70k", "$50000-$70000")
-
Company Size Grouping: Categorize companies into size tiers
- Small, Medium, Large, Enterprise
-
Categorical Encoding: Convert text features to numeric codes
- Work Type, Country, Qualifications
Feature Relationships:
- Correlation heatmaps for numeric features
- Experience vs Salary scatter plots with trend lines
- Average Salary vs Experience by Company Size
- Jittered scatter plots for better visualization of overlapping points
Compared three regression models for salary prediction:
| Model | MAE ($) | RMSE ($) | Improvement |
|---|---|---|---|
| Linear Regression | 5,420 | 7,980 | Baseline |
| Random Forest | 4,120 | 5,980 | 24.0% |
| XGBoost | 3,560 | 5,210 | 34.3% |
Key Findings:
- XGBoost provides the best performance
- MAE reduced by $1,860 (34.3%) compared to Linear Regression
- RMSE reduced by $2,770 (34.7%) compared to Linear Regression
- Applied K-Means clustering to identify job market segments
- Visualized clusters in 2D space
- Calculated centroids and cluster assignments
- Demonstrated iterative convergence process
- Implemented strategic sampling (50% per category)
- Handled missing values systematically
- Applied outlier treatment for robust analysis
- Experience and Salary: Positive correlation confirmed
- Company Size Impact: Larger companies tend to offer higher salaries
- Geographic Variations: Salary differences across countries
- Top Skills: Python, SQL, Machine Learning, Communication
- Common Qualifications: Bachelor's Degree, Master's Degree
- Skill Combinations: Certain skill pairs command premium compensation
- Entry-level positions dominate job postings
- Remote/hybrid work types increasing in availability
- Tech and data-related roles show strong growth
-
For Job Seekers:
- Benchmark salary expectations
- Identify high-demand skills to develop
- Understand experience vs compensation relationship
-
For Employers:
- Competitive salary benchmarking
- Optimize job descriptions and requirements
- Identify talent availability trends
-
For Recruiters:
- Market intelligence for client advisory
- Skill gap analysis
- Regional market comparisons
The notebook includes various visualizations:
- Bar charts for categorical distributions
- Correlation heatmaps
- Scatter plots with regression lines
- Line plots for trend analysis
- Clustering visualizations
- Model performance comparisons
The analysis uses a job descriptions dataset with the following key fields:
- Job Title
- Company
- Location (Country)
- Salary Range
- Experience Required
- Work Type
- Skills
- Qualifications
- Company Size
- Aram Elheni
- Youssef Jaziri
- Chaima Ben Yedder
- Zied Knani
- Dataset source: Kaggle Job Description Dataset
- Python data science community
- Open-source library maintainers