Sapiens-bet

AI-Powered Tennis Match Prediction Platform

Overview • Features • How It Works • Architecture • Getting Started • Screenshots

🎾 Overview

Sapiens-bet is an advanced tennis match prediction platform that leverages historical data and machine learning to forecast match outcomes. By analyzing thousands of past matches, player statistics, head-to-head records, and surface-specific performance metrics, the platform provides data-driven predictions for upcoming tennis matches.

The system combines powerful web scraping, sophisticated statistical analysis, and state-of-the-art machine learning models to deliver accurate predictions accessible through a modern, intuitive web interface.

What Makes It Special?

🎯 Data-Driven Predictions: Based on comprehensive historical match data from ATP and WTA tours
📊 Advanced Analytics: Deep statistical insights including ELO ratings, surface-specific performance, and head-to-head records
🤖 Machine Learning: XGBoost classifier trained on years of match data with rigorous evaluation metrics
💰 Betting Strategy Simulation: Kelly Criterion and constant fraction strategies to evaluate betting performance
🌐 Modern Web Platform: Beautiful, responsive interface built with React and Mantine UI
📈 Real-time Updates: Automated daily predictions for upcoming matches

✨ Key Features

🔍 Comprehensive Data Collection

Automated Web Scraping: Collects tournament, match, and player data from specialized tennis websites
Historical Data: Access to matches dating back to 1990 for both ATP and WTA tours
Live Rankings: Regular updates of official ATP and WTA rankings
Detailed Match Statistics: Score, surface, odds, player rankings, and more

🧠 Intelligent Predictions

Machine Learning Models: XGBoost classifier optimized for binary match outcome prediction
Feature Engineering: Over 100+ computed features per match including:
- ELO ratings (overall and surface-specific)
- Recent form metrics (win rates, games won/conceded)
- Head-to-head statistics
- Player age, ranking, and ranking points
- Match activity patterns (matches played per period)
Model Evaluation: Comprehensive metrics (ROC-AUC, log loss, Brier score, calibration curves)
MLflow Integration: Experiment tracking and model versioning

📊 Advanced Analytics Dashboard

Model Performance Metrics: Real-time visualization of prediction accuracy
ROC Curves & Calibration Plots: Deep dive into model reliability
Betting Simulation: Historical performance analysis with various betting strategies
Player Statistics: Detailed career stats and performance trends
Head-to-Head Records: Historical matchup analysis between players

🌐 Modern Web Platform

Player Browser: Searchable database of tennis players with detailed profiles
Tournament Tracking: Browse tournaments by year and surface
Predictions Page: View upcoming match predictions with confidence scores and odds comparison
Rankings: Current ATP and WTA rankings with historical tracking
Multilingual Support: Available in English, French, Spanish, and Portuguese
Responsive Design: Optimized for desktop and mobile devices

🔬 How It Works

1. Data Collection Pipeline

The platform starts by scraping historical tennis data from specialized websites:

Tournaments → Matches → Players → Rankings → Historical Database

Asynchronous scraping for efficient data collection
Smart caching to avoid redundant requests
Error handling and retry logic for robust data acquisition
Database storage with PostgreSQL for reliable persistence

2. Feature Engineering & Statistics

For each match, the system computes comprehensive statistics:

Player Form Metrics: Win rates, games won/conceded over multiple time windows (last 5, 10, 20, 50 matches)
ELO Rating System: Dynamic ratings that update after each match, including surface-specific ELO
Opponent Quality: Average ELO and ranking of opponents faced in recent matches
Surface Performance: Specialized stats for Clay, Hard, Grass, and Carpet courts
Head-to-Head History: Direct matchup records between players
Activity Patterns: Match frequency over different time periods

3. Machine Learning Model

The prediction engine uses XGBoost, a gradient boosting algorithm known for its:

High Accuracy: Consistently achieves >70% prediction accuracy
Feature Importance: Identifies which factors most influence match outcomes
Probability Calibration: Provides well-calibrated win probabilities, not just binary predictions
Scalability: Handles hundreds of features efficiently

The model is trained on:

Historical matches with complete statistics
Split-date validation to prevent data leakage
Multiple evaluation metrics to ensure robustness

4. Prediction Generation

Daily automated workflow:

Fetch Upcoming Matches → Compute Features → Generate Predictions → Store in Database → Display on Web

Scheduled jobs run automatically to generate predictions
Real-time odds integrated from betting markets
Expected value calculations for betting strategy analysis
Confidence scores to assess prediction reliability

5. Web Interface

Users interact with predictions through a modern React application:

FastAPI Backend: High-performance REST API with automatic OpenAPI documentation
React Frontend: Built with TypeScript, Mantine UI, and React Query for optimal UX
Real-time Updates: TanStack Query for efficient data fetching and caching
Authentication: SuperTokens integration for user management
Responsive Charts: Chart.js visualizations for model metrics and betting simulations

🏗️ Architecture

Technology Stack

Backend

Language: Python 3.13+
Framework: FastAPI for REST API
Database: PostgreSQL with SQLModel ORM
ML Framework: XGBoost, scikit-learn
Data Processing: Pandas, NumPy
Web Scraping: aiohttp, BeautifulSoup, lxml
Task Scheduling: APScheduler for automated jobs
Experiment Tracking: MLflow
Migrations: Alembic

Frontend

Framework: React 19 with TypeScript
UI Library: Mantine v8 (modern component library)
Routing: React Router v7
State Management: TanStack Query (React Query)
Charts: Chart.js with react-chartjs-2
Authentication: SuperTokens
Internationalization: i18next
Build Tool: Vite

Infrastructure

Package Management: uv (ultra-fast Python package manager)
Containerization: Docker with multi-stage builds
Reverse Proxy: Nginx
Documentation: Docusaurus (multilingual docs)

System Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Web Interface (React)                    │
│    Players | Tournaments | Rankings | Predictions | Stats    │
└──────────────────────────┬──────────────────────────────────┘
                           │ REST API
┌──────────────────────────▼──────────────────────────────────┐
│                    FastAPI Backend                           │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │   Routers    │  │     Jobs     │  │    Auth      │      │
│  │  (Endpoints) │  │  (Scheduler) │  │ (SuperTokens)│      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└──────────────┬───────────────────┬───────────────────────────┘
               │                   │
┌──────────────▼──────┐   ┌────────▼────────────────────────┐
│   PostgreSQL DB     │   │    ML Pipeline & Scraping       │
│                     │   │  ┌─────────────────────────┐    │
│  • Matches          │   │  │  Data Scraping          │    │
│  • Players          │   │  │  (aiohttp + BeautifulSoup)│  │
│  • Tournaments      │   │  └───────────┬─────────────┘    │
│  • Rankings         │   │              │                   │
│  • Predictions      │   │  ┌───────────▼─────────────┐    │
│  • ELO Ratings      │   │  │  Feature Engineering    │    │
│                     │   │  │  (Stats Computation)    │    │
└─────────────────────┘   │  └───────────┬─────────────┘    │
                          │              │                   │
                          │  ┌───────────▼─────────────┐    │
                          │  │  XGBoost Model          │    │
                          │  │  (Training & Prediction)│    │
                          │  └───────────┬─────────────┘    │
                          │              │                   │
                          │  ┌───────────▼─────────────┐    │
                          │  │  MLflow                 │    │
                          │  │  (Experiment Tracking)  │    │
                          │  └─────────────────────────┘    │
                          └──────────────────────────────────┘

Project Structure

.
├── backend/                      # Python backend
│   ├── tennis_scrapper/
│   │   ├── api/                 # FastAPI application
│   │   │   ├── routers/         # API endpoints
│   │   │   └── dto/             # Data transfer objects
│   │   ├── cli/                 # Command-line tools
│   │   ├── scrap/               # Web scraping modules
│   │   ├── db/                  # Database models & utilities
│   │   ├── ml/                  # Machine learning
│   │   │   ├── models/          # ML model wrappers
│   │   │   ├── metrics.py       # Evaluation metrics
│   │   │   ├── preprocess_data.py
│   │   │   └── plot.py
│   │   ├── stats/               # Statistics computation
│   │   ├── data/                # Data processing scripts
│   │   ├── train.py             # Model training
│   │   ├── predict.py           # Prediction generation
│   │   └── workflows.py         # Automated workflows
│   ├── conf/                    # Configuration files
│   ├── migrations/              # Alembic database migrations
│   └── tests/                   # Unit tests
├── frontend/
│   ├── app/                     # React application
│   │   ├── src/
│   │   │   ├── components/      # UI components
│   │   │   ├── routes/          # Page components
│   │   │   ├── contexts/        # React contexts
│   │   │   ├── hooks/           # Custom hooks
│   │   │   ├── lib/             # Utilities
│   │   │   └── locales/         # Translations
│   │   └── public/              # Static assets
│   └── client/                  # Auto-generated API client
├── documentation/               # Docusaurus documentation
├── docker/                      # Docker compose files
└── assets/                      # Screenshots & images

🚀 Getting Started

Prerequisites

Python 3.13+
Node.js 18+ and npm
PostgreSQL 14+
uv (Python package manager)

Backend Setup

Clone the repository
```
git clone <repository-url>
cd Zpser
```
Install backend dependencies
```
cd backend
uv sync
```

Configure environment

Create a .env file at the project root with your database credentials:

DATABASE_URL=postgresql://user:password@localhost:5432/tennis_db
MLFLOW_TRACKING_URI=http://localhost:5000

Run database migrations
```
uv run alembic upgrade head
```

Scrape initial data (optional, can take several hours for full historical data)

uv run tennis_scrapper scrap-tournaments --from 2020 --to 2024
uv run tennis_scrapper scrap-matches --from 2020 --to 2024

Start the API server
```
cd backend
fastapi run
```
The API will be available at http://localhost:8000

Frontend Setup

Install frontend dependencies
```
cd frontend/app
npm install
```
Start the development server
```
npm run dev
```
The web application will be available at http://localhost:5173

Docker Deployment

For production deployment with Docker:

docker-compose -f docker/docker-compose-dev.yml up -d

This will start:

PostgreSQL database
FastAPI backend
React frontend
Nginx reverse proxy
MLflow tracking server

📸 Screenshots

Player Statistics

Detailed player statistics with career performance metrics and recent form

Match Predictions

Browse through the comprehensive player database with search and filters

Head-to-Head Analysis

Historical head-to-head matchup analysis between players

Model Performance Metrics

Comprehensive model evaluation metrics with ROC curves, calibration plots, and betting simulation results

📚 Documentation

For detailed documentation on specific features and CLI commands, see:

CLI Usage Guide: Complete reference for command-line tools
Backend README: Backend-specific documentation
API Documentation: Available at http://localhost:8000/docs when running the server

🔄 Development Workflow

Daily Automated Tasks

The platform includes scheduled jobs for:

Daily Predictions: Generate predictions for upcoming matches
Rankings Update: Fetch latest ATP/WTA rankings
Odds Update: Refresh betting odds for active predictions
Stats Computation: Calculate player statistics for new matches

Training New Models

To train a new prediction model:

cd backend
uv run python -m tennis_scrapper.train --base-dir output --split-date 2024-06-01

This will:

Process historical match data
Compute features for all matches
Train an XGBoost model
Evaluate on validation set
Log metrics and artifacts to MLflow
Generate performance plots

Generating Predictions

To generate predictions for upcoming matches:

uv run python -m tennis_scrapper.predict

This will:

Scrape upcoming matches for the next 48 hours
Compute features based on historical data
Load the latest model from MLflow
Generate predictions
Store results in the database

🧪 Testing

Run the test suite:

cd backend
uv run pytest

Tests cover:

Web scraping functionality
Statistics computation
Database operations
API endpoints

🎯 Model Performance

The current production model achieves:

ROC-AUC: ~0.72
Log Loss: ~0.62
Accuracy: ~68%
Brier Score: ~0.24

Performance varies by:

Tournament level (Grand Slams vs. lower tier events)
Surface type (some surfaces are more predictable)
Player ranking (top players are more consistent)

The betting simulation module shows positive expected value when applying Kelly Criterion with appropriate thresholds, though past performance doesn't guarantee future results.

🛠️ Tech Highlights

Why These Technologies?

FastAPI: Fastest Python web framework with automatic API documentation
XGBoost: Industry-standard for tabular data with excellent performance
React + Mantine: Modern, accessible UI components with excellent developer experience
PostgreSQL: Robust relational database with excellent performance
uv: Lightning-fast Python package management (10-100x faster than pip)
MLflow: Industry-standard for ML experiment tracking
Docker: Consistent deployment across environments

Code Quality

Type Safety: TypeScript for frontend, type hints in Python backend
Linting: ESLint for TypeScript, Ruff for Python
Testing: pytest for backend, comprehensive test coverage
Documentation: Inline documentation and Docusaurus for user guides
Git Hooks: Pre-commit checks for code quality

🤝 Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

📄 License

MIT License - see LICENSE file for details

👨‍💻 Author

Pierre Carceller Meunier

🙏 Acknowledgments

Tennis data sourced from publicly available tennis statistics websites
Built with open-source technologies and libraries
Inspired by the sports analytics and data science communities

Made with ❤️ and lots of ☕

⭐ Star this repo if you find it useful!

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
.cursor		.cursor
.github/workflows		.github/workflows
.vscode		.vscode
assets		assets
backend		backend
docker		docker
documentation		documentation
frontend		frontend
.gitignore		.gitignore
CLI_USAGE.md		CLI_USAGE.md
README.md		README.md
update_client.sh		update_client.sh

Soum-Soum/Sapiens-bet

Folders and files

Latest commit

History

Repository files navigation