AI-Powered Tennis Match Prediction Platform
Overview • Features • How It Works • Architecture • Getting Started • Screenshots
Sapiens-bet is an advanced tennis match prediction platform that leverages historical data and machine learning to forecast match outcomes. By analyzing thousands of past matches, player statistics, head-to-head records, and surface-specific performance metrics, the platform provides data-driven predictions for upcoming tennis matches.
The system combines powerful web scraping, sophisticated statistical analysis, and state-of-the-art machine learning models to deliver accurate predictions accessible through a modern, intuitive web interface.
- 🎯 Data-Driven Predictions: Based on comprehensive historical match data from ATP and WTA tours
- 📊 Advanced Analytics: Deep statistical insights including ELO ratings, surface-specific performance, and head-to-head records
- 🤖 Machine Learning: XGBoost classifier trained on years of match data with rigorous evaluation metrics
- 💰 Betting Strategy Simulation: Kelly Criterion and constant fraction strategies to evaluate betting performance
- 🌐 Modern Web Platform: Beautiful, responsive interface built with React and Mantine UI
- 📈 Real-time Updates: Automated daily predictions for upcoming matches
- Automated Web Scraping: Collects tournament, match, and player data from specialized tennis websites
- Historical Data: Access to matches dating back to 1990 for both ATP and WTA tours
- Live Rankings: Regular updates of official ATP and WTA rankings
- Detailed Match Statistics: Score, surface, odds, player rankings, and more
- Machine Learning Models: XGBoost classifier optimized for binary match outcome prediction
- Feature Engineering: Over 100+ computed features per match including:
- ELO ratings (overall and surface-specific)
- Recent form metrics (win rates, games won/conceded)
- Head-to-head statistics
- Player age, ranking, and ranking points
- Match activity patterns (matches played per period)
- Model Evaluation: Comprehensive metrics (ROC-AUC, log loss, Brier score, calibration curves)
- MLflow Integration: Experiment tracking and model versioning
- Model Performance Metrics: Real-time visualization of prediction accuracy
- ROC Curves & Calibration Plots: Deep dive into model reliability
- Betting Simulation: Historical performance analysis with various betting strategies
- Player Statistics: Detailed career stats and performance trends
- Head-to-Head Records: Historical matchup analysis between players
- Player Browser: Searchable database of tennis players with detailed profiles
- Tournament Tracking: Browse tournaments by year and surface
- Predictions Page: View upcoming match predictions with confidence scores and odds comparison
- Rankings: Current ATP and WTA rankings with historical tracking
- Multilingual Support: Available in English, French, Spanish, and Portuguese
- Responsive Design: Optimized for desktop and mobile devices
The platform starts by scraping historical tennis data from specialized websites:
Tournaments → Matches → Players → Rankings → Historical Database
- Asynchronous scraping for efficient data collection
- Smart caching to avoid redundant requests
- Error handling and retry logic for robust data acquisition
- Database storage with PostgreSQL for reliable persistence
For each match, the system computes comprehensive statistics:
- Player Form Metrics: Win rates, games won/conceded over multiple time windows (last 5, 10, 20, 50 matches)
- ELO Rating System: Dynamic ratings that update after each match, including surface-specific ELO
- Opponent Quality: Average ELO and ranking of opponents faced in recent matches
- Surface Performance: Specialized stats for Clay, Hard, Grass, and Carpet courts
- Head-to-Head History: Direct matchup records between players
- Activity Patterns: Match frequency over different time periods
The prediction engine uses XGBoost, a gradient boosting algorithm known for its:
- High Accuracy: Consistently achieves >70% prediction accuracy
- Feature Importance: Identifies which factors most influence match outcomes
- Probability Calibration: Provides well-calibrated win probabilities, not just binary predictions
- Scalability: Handles hundreds of features efficiently
The model is trained on:
- Historical matches with complete statistics
- Split-date validation to prevent data leakage
- Multiple evaluation metrics to ensure robustness
Daily automated workflow:
Fetch Upcoming Matches → Compute Features → Generate Predictions → Store in Database → Display on Web
- Scheduled jobs run automatically to generate predictions
- Real-time odds integrated from betting markets
- Expected value calculations for betting strategy analysis
- Confidence scores to assess prediction reliability
Users interact with predictions through a modern React application:
- FastAPI Backend: High-performance REST API with automatic OpenAPI documentation
- React Frontend: Built with TypeScript, Mantine UI, and React Query for optimal UX
- Real-time Updates: TanStack Query for efficient data fetching and caching
- Authentication: SuperTokens integration for user management
- Responsive Charts: Chart.js visualizations for model metrics and betting simulations
- Language: Python 3.13+
- Framework: FastAPI for REST API
- Database: PostgreSQL with SQLModel ORM
- ML Framework: XGBoost, scikit-learn
- Data Processing: Pandas, NumPy
- Web Scraping: aiohttp, BeautifulSoup, lxml
- Task Scheduling: APScheduler for automated jobs
- Experiment Tracking: MLflow
- Migrations: Alembic
- Framework: React 19 with TypeScript
- UI Library: Mantine v8 (modern component library)
- Routing: React Router v7
- State Management: TanStack Query (React Query)
- Charts: Chart.js with react-chartjs-2
- Authentication: SuperTokens
- Internationalization: i18next
- Build Tool: Vite
- Package Management: uv (ultra-fast Python package manager)
- Containerization: Docker with multi-stage builds
- Reverse Proxy: Nginx
- Documentation: Docusaurus (multilingual docs)
┌─────────────────────────────────────────────────────────────┐
│ Web Interface (React) │
│ Players | Tournaments | Rankings | Predictions | Stats │
└──────────────────────────┬──────────────────────────────────┘
│ REST API
┌──────────────────────────▼──────────────────────────────────┐
│ FastAPI Backend │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Routers │ │ Jobs │ │ Auth │ │
│ │ (Endpoints) │ │ (Scheduler) │ │ (SuperTokens)│ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└──────────────┬───────────────────┬───────────────────────────┘
│ │
┌──────────────▼──────┐ ┌────────▼────────────────────────┐
│ PostgreSQL DB │ │ ML Pipeline & Scraping │
│ │ │ ┌─────────────────────────┐ │
│ • Matches │ │ │ Data Scraping │ │
│ • Players │ │ │ (aiohttp + BeautifulSoup)│ │
│ • Tournaments │ │ └───────────┬─────────────┘ │
│ • Rankings │ │ │ │
│ • Predictions │ │ ┌───────────▼─────────────┐ │
│ • ELO Ratings │ │ │ Feature Engineering │ │
│ │ │ │ (Stats Computation) │ │
└─────────────────────┘ │ └───────────┬─────────────┘ │
│ │ │
│ ┌───────────▼─────────────┐ │
│ │ XGBoost Model │ │
│ │ (Training & Prediction)│ │
│ └───────────┬─────────────┘ │
│ │ │
│ ┌───────────▼─────────────┐ │
│ │ MLflow │ │
│ │ (Experiment Tracking) │ │
│ └─────────────────────────┘ │
└──────────────────────────────────┘
.
├── backend/ # Python backend
│ ├── tennis_scrapper/
│ │ ├── api/ # FastAPI application
│ │ │ ├── routers/ # API endpoints
│ │ │ └── dto/ # Data transfer objects
│ │ ├── cli/ # Command-line tools
│ │ ├── scrap/ # Web scraping modules
│ │ ├── db/ # Database models & utilities
│ │ ├── ml/ # Machine learning
│ │ │ ├── models/ # ML model wrappers
│ │ │ ├── metrics.py # Evaluation metrics
│ │ │ ├── preprocess_data.py
│ │ │ └── plot.py
│ │ ├── stats/ # Statistics computation
│ │ ├── data/ # Data processing scripts
│ │ ├── train.py # Model training
│ │ ├── predict.py # Prediction generation
│ │ └── workflows.py # Automated workflows
│ ├── conf/ # Configuration files
│ ├── migrations/ # Alembic database migrations
│ └── tests/ # Unit tests
├── frontend/
│ ├── app/ # React application
│ │ ├── src/
│ │ │ ├── components/ # UI components
│ │ │ ├── routes/ # Page components
│ │ │ ├── contexts/ # React contexts
│ │ │ ├── hooks/ # Custom hooks
│ │ │ ├── lib/ # Utilities
│ │ │ └── locales/ # Translations
│ │ └── public/ # Static assets
│ └── client/ # Auto-generated API client
├── documentation/ # Docusaurus documentation
├── docker/ # Docker compose files
└── assets/ # Screenshots & images
- Python 3.13+
- Node.js 18+ and npm
- PostgreSQL 14+
- uv (Python package manager)
-
Clone the repository
git clone <repository-url> cd Zpser
-
Install backend dependencies
cd backend uv sync -
Configure environment
Create a
.envfile at the project root with your database credentials:DATABASE_URL=postgresql://user:password@localhost:5432/tennis_db MLFLOW_TRACKING_URI=http://localhost:5000
-
Run database migrations
uv run alembic upgrade head
-
Scrape initial data (optional, can take several hours for full historical data)
uv run tennis_scrapper scrap-tournaments --from 2020 --to 2024 uv run tennis_scrapper scrap-matches --from 2020 --to 2024
-
Start the API server
cd backend fastapi runThe API will be available at
http://localhost:8000
-
Install frontend dependencies
cd frontend/app npm install -
Start the development server
npm run dev
The web application will be available at
http://localhost:5173
For production deployment with Docker:
docker-compose -f docker/docker-compose-dev.yml up -dThis will start:
- PostgreSQL database
- FastAPI backend
- React frontend
- Nginx reverse proxy
- MLflow tracking server
Comprehensive model evaluation metrics with ROC curves, calibration plots, and betting simulation results
For detailed documentation on specific features and CLI commands, see:
- CLI Usage Guide: Complete reference for command-line tools
- Backend README: Backend-specific documentation
- API Documentation: Available at
http://localhost:8000/docswhen running the server
The platform includes scheduled jobs for:
- Daily Predictions: Generate predictions for upcoming matches
- Rankings Update: Fetch latest ATP/WTA rankings
- Odds Update: Refresh betting odds for active predictions
- Stats Computation: Calculate player statistics for new matches
To train a new prediction model:
cd backend
uv run python -m tennis_scrapper.train --base-dir output --split-date 2024-06-01This will:
- Process historical match data
- Compute features for all matches
- Train an XGBoost model
- Evaluate on validation set
- Log metrics and artifacts to MLflow
- Generate performance plots
To generate predictions for upcoming matches:
uv run python -m tennis_scrapper.predictThis will:
- Scrape upcoming matches for the next 48 hours
- Compute features based on historical data
- Load the latest model from MLflow
- Generate predictions
- Store results in the database
Run the test suite:
cd backend
uv run pytestTests cover:
- Web scraping functionality
- Statistics computation
- Database operations
- API endpoints
The current production model achieves:
- ROC-AUC: ~0.72
- Log Loss: ~0.62
- Accuracy: ~68%
- Brier Score: ~0.24
Performance varies by:
- Tournament level (Grand Slams vs. lower tier events)
- Surface type (some surfaces are more predictable)
- Player ranking (top players are more consistent)
The betting simulation module shows positive expected value when applying Kelly Criterion with appropriate thresholds, though past performance doesn't guarantee future results.
- FastAPI: Fastest Python web framework with automatic API documentation
- XGBoost: Industry-standard for tabular data with excellent performance
- React + Mantine: Modern, accessible UI components with excellent developer experience
- PostgreSQL: Robust relational database with excellent performance
- uv: Lightning-fast Python package management (10-100x faster than pip)
- MLflow: Industry-standard for ML experiment tracking
- Docker: Consistent deployment across environments
- Type Safety: TypeScript for frontend, type hints in Python backend
- Linting: ESLint for TypeScript, Ruff for Python
- Testing: pytest for backend, comprehensive test coverage
- Documentation: Inline documentation and Docusaurus for user guides
- Git Hooks: Pre-commit checks for code quality
Contributions are welcome! Please feel free to submit issues or pull requests.
MIT License - see LICENSE file for details
Pierre Carceller Meunier
- Tennis data sourced from publicly available tennis statistics websites
- Built with open-source technologies and libraries
- Inspired by the sports analytics and data science communities
Made with ❤️ and lots of ☕
⭐ Star this repo if you find it useful!


