Skip to content

Soum-Soum/Sapiens-bet

Repository files navigation

Sapiens-bet Logo Sapiens-bet Logo

Sapiens-bet

AI-Powered Tennis Match Prediction Platform

OverviewFeaturesHow It WorksArchitectureGetting StartedScreenshots


🎾 Overview

Sapiens-bet is an advanced tennis match prediction platform that leverages historical data and machine learning to forecast match outcomes. By analyzing thousands of past matches, player statistics, head-to-head records, and surface-specific performance metrics, the platform provides data-driven predictions for upcoming tennis matches.

The system combines powerful web scraping, sophisticated statistical analysis, and state-of-the-art machine learning models to deliver accurate predictions accessible through a modern, intuitive web interface.

What Makes It Special?

  • 🎯 Data-Driven Predictions: Based on comprehensive historical match data from ATP and WTA tours
  • 📊 Advanced Analytics: Deep statistical insights including ELO ratings, surface-specific performance, and head-to-head records
  • 🤖 Machine Learning: XGBoost classifier trained on years of match data with rigorous evaluation metrics
  • 💰 Betting Strategy Simulation: Kelly Criterion and constant fraction strategies to evaluate betting performance
  • 🌐 Modern Web Platform: Beautiful, responsive interface built with React and Mantine UI
  • 📈 Real-time Updates: Automated daily predictions for upcoming matches

✨ Key Features

🔍 Comprehensive Data Collection

  • Automated Web Scraping: Collects tournament, match, and player data from specialized tennis websites
  • Historical Data: Access to matches dating back to 1990 for both ATP and WTA tours
  • Live Rankings: Regular updates of official ATP and WTA rankings
  • Detailed Match Statistics: Score, surface, odds, player rankings, and more

🧠 Intelligent Predictions

  • Machine Learning Models: XGBoost classifier optimized for binary match outcome prediction
  • Feature Engineering: Over 100+ computed features per match including:
    • ELO ratings (overall and surface-specific)
    • Recent form metrics (win rates, games won/conceded)
    • Head-to-head statistics
    • Player age, ranking, and ranking points
    • Match activity patterns (matches played per period)
  • Model Evaluation: Comprehensive metrics (ROC-AUC, log loss, Brier score, calibration curves)
  • MLflow Integration: Experiment tracking and model versioning

📊 Advanced Analytics Dashboard

  • Model Performance Metrics: Real-time visualization of prediction accuracy
  • ROC Curves & Calibration Plots: Deep dive into model reliability
  • Betting Simulation: Historical performance analysis with various betting strategies
  • Player Statistics: Detailed career stats and performance trends
  • Head-to-Head Records: Historical matchup analysis between players

🌐 Modern Web Platform

  • Player Browser: Searchable database of tennis players with detailed profiles
  • Tournament Tracking: Browse tournaments by year and surface
  • Predictions Page: View upcoming match predictions with confidence scores and odds comparison
  • Rankings: Current ATP and WTA rankings with historical tracking
  • Multilingual Support: Available in English, French, Spanish, and Portuguese
  • Responsive Design: Optimized for desktop and mobile devices

🔬 How It Works

1. Data Collection Pipeline

The platform starts by scraping historical tennis data from specialized websites:

Tournaments → Matches → Players → Rankings → Historical Database
  • Asynchronous scraping for efficient data collection
  • Smart caching to avoid redundant requests
  • Error handling and retry logic for robust data acquisition
  • Database storage with PostgreSQL for reliable persistence

2. Feature Engineering & Statistics

For each match, the system computes comprehensive statistics:

  • Player Form Metrics: Win rates, games won/conceded over multiple time windows (last 5, 10, 20, 50 matches)
  • ELO Rating System: Dynamic ratings that update after each match, including surface-specific ELO
  • Opponent Quality: Average ELO and ranking of opponents faced in recent matches
  • Surface Performance: Specialized stats for Clay, Hard, Grass, and Carpet courts
  • Head-to-Head History: Direct matchup records between players
  • Activity Patterns: Match frequency over different time periods

3. Machine Learning Model

The prediction engine uses XGBoost, a gradient boosting algorithm known for its:

  • High Accuracy: Consistently achieves >70% prediction accuracy
  • Feature Importance: Identifies which factors most influence match outcomes
  • Probability Calibration: Provides well-calibrated win probabilities, not just binary predictions
  • Scalability: Handles hundreds of features efficiently

The model is trained on:

  • Historical matches with complete statistics
  • Split-date validation to prevent data leakage
  • Multiple evaluation metrics to ensure robustness

4. Prediction Generation

Daily automated workflow:

Fetch Upcoming Matches → Compute Features → Generate Predictions → Store in Database → Display on Web
  • Scheduled jobs run automatically to generate predictions
  • Real-time odds integrated from betting markets
  • Expected value calculations for betting strategy analysis
  • Confidence scores to assess prediction reliability

5. Web Interface

Users interact with predictions through a modern React application:

  • FastAPI Backend: High-performance REST API with automatic OpenAPI documentation
  • React Frontend: Built with TypeScript, Mantine UI, and React Query for optimal UX
  • Real-time Updates: TanStack Query for efficient data fetching and caching
  • Authentication: SuperTokens integration for user management
  • Responsive Charts: Chart.js visualizations for model metrics and betting simulations

🏗️ Architecture

Technology Stack

Backend

  • Language: Python 3.13+
  • Framework: FastAPI for REST API
  • Database: PostgreSQL with SQLModel ORM
  • ML Framework: XGBoost, scikit-learn
  • Data Processing: Pandas, NumPy
  • Web Scraping: aiohttp, BeautifulSoup, lxml
  • Task Scheduling: APScheduler for automated jobs
  • Experiment Tracking: MLflow
  • Migrations: Alembic

Frontend

  • Framework: React 19 with TypeScript
  • UI Library: Mantine v8 (modern component library)
  • Routing: React Router v7
  • State Management: TanStack Query (React Query)
  • Charts: Chart.js with react-chartjs-2
  • Authentication: SuperTokens
  • Internationalization: i18next
  • Build Tool: Vite

Infrastructure

  • Package Management: uv (ultra-fast Python package manager)
  • Containerization: Docker with multi-stage builds
  • Reverse Proxy: Nginx
  • Documentation: Docusaurus (multilingual docs)

System Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Web Interface (React)                    │
│    Players | Tournaments | Rankings | Predictions | Stats    │
└──────────────────────────┬──────────────────────────────────┘
                           │ REST API
┌──────────────────────────▼──────────────────────────────────┐
│                    FastAPI Backend                           │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │   Routers    │  │     Jobs     │  │    Auth      │      │
│  │  (Endpoints) │  │  (Scheduler) │  │ (SuperTokens)│      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└──────────────┬───────────────────┬───────────────────────────┘
               │                   │
┌──────────────▼──────┐   ┌────────▼────────────────────────┐
│   PostgreSQL DB     │   │    ML Pipeline & Scraping       │
│                     │   │  ┌─────────────────────────┐    │
│  • Matches          │   │  │  Data Scraping          │    │
│  • Players          │   │  │  (aiohttp + BeautifulSoup)│  │
│  • Tournaments      │   │  └───────────┬─────────────┘    │
│  • Rankings         │   │              │                   │
│  • Predictions      │   │  ┌───────────▼─────────────┐    │
│  • ELO Ratings      │   │  │  Feature Engineering    │    │
│                     │   │  │  (Stats Computation)    │    │
└─────────────────────┘   │  └───────────┬─────────────┘    │
                          │              │                   │
                          │  ┌───────────▼─────────────┐    │
                          │  │  XGBoost Model          │    │
                          │  │  (Training & Prediction)│    │
                          │  └───────────┬─────────────┘    │
                          │              │                   │
                          │  ┌───────────▼─────────────┐    │
                          │  │  MLflow                 │    │
                          │  │  (Experiment Tracking)  │    │
                          │  └─────────────────────────┘    │
                          └──────────────────────────────────┘

Project Structure

.
├── backend/                      # Python backend
│   ├── tennis_scrapper/
│   │   ├── api/                 # FastAPI application
│   │   │   ├── routers/         # API endpoints
│   │   │   └── dto/             # Data transfer objects
│   │   ├── cli/                 # Command-line tools
│   │   ├── scrap/               # Web scraping modules
│   │   ├── db/                  # Database models & utilities
│   │   ├── ml/                  # Machine learning
│   │   │   ├── models/          # ML model wrappers
│   │   │   ├── metrics.py       # Evaluation metrics
│   │   │   ├── preprocess_data.py
│   │   │   └── plot.py
│   │   ├── stats/               # Statistics computation
│   │   ├── data/                # Data processing scripts
│   │   ├── train.py             # Model training
│   │   ├── predict.py           # Prediction generation
│   │   └── workflows.py         # Automated workflows
│   ├── conf/                    # Configuration files
│   ├── migrations/              # Alembic database migrations
│   └── tests/                   # Unit tests
├── frontend/
│   ├── app/                     # React application
│   │   ├── src/
│   │   │   ├── components/      # UI components
│   │   │   ├── routes/          # Page components
│   │   │   ├── contexts/        # React contexts
│   │   │   ├── hooks/           # Custom hooks
│   │   │   ├── lib/             # Utilities
│   │   │   └── locales/         # Translations
│   │   └── public/              # Static assets
│   └── client/                  # Auto-generated API client
├── documentation/               # Docusaurus documentation
├── docker/                      # Docker compose files
└── assets/                      # Screenshots & images

🚀 Getting Started

Prerequisites

  • Python 3.13+
  • Node.js 18+ and npm
  • PostgreSQL 14+
  • uv (Python package manager)

Backend Setup

  1. Clone the repository

    git clone <repository-url>
    cd Zpser
  2. Install backend dependencies

    cd backend
    uv sync
  3. Configure environment

    Create a .env file at the project root with your database credentials:

    DATABASE_URL=postgresql://user:password@localhost:5432/tennis_db
    MLFLOW_TRACKING_URI=http://localhost:5000
  4. Run database migrations

    uv run alembic upgrade head
  5. Scrape initial data (optional, can take several hours for full historical data)

    uv run tennis_scrapper scrap-tournaments --from 2020 --to 2024
    uv run tennis_scrapper scrap-matches --from 2020 --to 2024
  6. Start the API server

    cd backend
    fastapi run

    The API will be available at http://localhost:8000

Frontend Setup

  1. Install frontend dependencies

    cd frontend/app
    npm install
  2. Start the development server

    npm run dev

    The web application will be available at http://localhost:5173

Docker Deployment

For production deployment with Docker:

docker-compose -f docker/docker-compose-dev.yml up -d

This will start:

  • PostgreSQL database
  • FastAPI backend
  • React frontend
  • Nginx reverse proxy
  • MLflow tracking server

📸 Screenshots

Player Statistics

Player Statistics

Detailed player statistics with career performance metrics and recent form

Match Predictions

Player List

Browse through the comprehensive player database with search and filters

Head-to-Head Analysis

Head-to-Head

Historical head-to-head matchup analysis between players

Model Performance Metrics

Model Statistics

Comprehensive model evaluation metrics with ROC curves, calibration plots, and betting simulation results


📚 Documentation

For detailed documentation on specific features and CLI commands, see:

  • CLI Usage Guide: Complete reference for command-line tools
  • Backend README: Backend-specific documentation
  • API Documentation: Available at http://localhost:8000/docs when running the server

🔄 Development Workflow

Daily Automated Tasks

The platform includes scheduled jobs for:

  1. Daily Predictions: Generate predictions for upcoming matches
  2. Rankings Update: Fetch latest ATP/WTA rankings
  3. Odds Update: Refresh betting odds for active predictions
  4. Stats Computation: Calculate player statistics for new matches

Training New Models

To train a new prediction model:

cd backend
uv run python -m tennis_scrapper.train --base-dir output --split-date 2024-06-01

This will:

  • Process historical match data
  • Compute features for all matches
  • Train an XGBoost model
  • Evaluate on validation set
  • Log metrics and artifacts to MLflow
  • Generate performance plots

Generating Predictions

To generate predictions for upcoming matches:

uv run python -m tennis_scrapper.predict

This will:

  • Scrape upcoming matches for the next 48 hours
  • Compute features based on historical data
  • Load the latest model from MLflow
  • Generate predictions
  • Store results in the database

🧪 Testing

Run the test suite:

cd backend
uv run pytest

Tests cover:

  • Web scraping functionality
  • Statistics computation
  • Database operations
  • API endpoints

🎯 Model Performance

The current production model achieves:

  • ROC-AUC: ~0.72
  • Log Loss: ~0.62
  • Accuracy: ~68%
  • Brier Score: ~0.24

Performance varies by:

  • Tournament level (Grand Slams vs. lower tier events)
  • Surface type (some surfaces are more predictable)
  • Player ranking (top players are more consistent)

The betting simulation module shows positive expected value when applying Kelly Criterion with appropriate thresholds, though past performance doesn't guarantee future results.


🛠️ Tech Highlights

Why These Technologies?

  • FastAPI: Fastest Python web framework with automatic API documentation
  • XGBoost: Industry-standard for tabular data with excellent performance
  • React + Mantine: Modern, accessible UI components with excellent developer experience
  • PostgreSQL: Robust relational database with excellent performance
  • uv: Lightning-fast Python package management (10-100x faster than pip)
  • MLflow: Industry-standard for ML experiment tracking
  • Docker: Consistent deployment across environments

Code Quality

  • Type Safety: TypeScript for frontend, type hints in Python backend
  • Linting: ESLint for TypeScript, Ruff for Python
  • Testing: pytest for backend, comprehensive test coverage
  • Documentation: Inline documentation and Docusaurus for user guides
  • Git Hooks: Pre-commit checks for code quality

🤝 Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.


📄 License

MIT License - see LICENSE file for details


👨‍💻 Author

Pierre Carceller Meunier


🙏 Acknowledgments

  • Tennis data sourced from publicly available tennis statistics websites
  • Built with open-source technologies and libraries
  • Inspired by the sports analytics and data science communities

Made with ❤️ and lots of ☕

⭐ Star this repo if you find it useful!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published