ClusterFlow is an intelligent web platform for clustering analysis that automates the entire machine learning pipeline for data segmentation. Designed for data scientists, analysts, and ML professionals, this application offers a complete flow from data loading to results export, all through an intuitive visual interface.
ClusterFlow simplifies the clustering analysis process through:
- Automated pipeline: Step-by-step guide from raw data to final clusters
- Intelligent variable selection: Automatic algorithm that identifies optimal features
- Multiple clustering algorithms: K-Means, DBSCAN, and Agglomerative with automatic best selection
- Complete exploratory analysis: Statistics, distributions, outliers, and correlations with improved visualizations
- Automatic optimization: Determination of optimal number of clusters using 4 different metrics
- PCA visualization: Automatic high-dimensional to 2D projection to visualize clusters
- Professional export: Download of labeled data, cluster profiles, and quality metrics
- Business analysts who need to segment customers or products
- Data scientists looking to automate repetitive clustering tasks
- Researchers requiring fast visual exploratory analysis
- Teams needing a collaborative and reproducible tool
# Easiest method: Double-click on run.bat
# Or from terminal:
.\run.bat# Run application
streamlit run app/main.py
# The application will open at http://localhost:8501# Build and run
docker-compose up --build
# Access the application
# http://localhost:8501Upload a CSV file with your data
Configure and execute cleaning:
- Remove duplicates
- Impute null values
- Remove outliers
- Descriptive statistics
- Distributions
- Outlier detection
- Correlations
- Bivariate analysis
- Variable selection
- New feature creation
- Multicollinearity analysis
Choose scaling method:
- StandardScaler (Z-score)
- MinMaxScaler (0-1)
- RobustScaler (outlier-resistant)
- Automatic optimal K determination
- Multiple algorithms (KMeans, Hierarchical)
- Automatic comparison
- Cluster visualization
- Cluster profiles
- Results export
✅ Modularized: Code organized in specialized modules
✅ Complete modular architecture: 7 independent pages
✅ Centralized configuration: Easy maintenance
✅ Complete analysis: Exhaustive EDA
✅ Multiple algorithms: KMeans, Hierarchical (Ward, Complete, Average)
✅ Advanced metrics: Silhouette, Davies-Bouldin, Calinski-Harabasz
✅ Feature Engineering: Variable creation and selection
✅ Visualizations: Interactive and informative charts
✅ Export: Download results in CSV
✅ Docker: Containerized deployment
✅ Tests: 75 tests with 99.2% coverage
ClusterFlow implements a 3-layer modular architecture designed for scalability and maintainability:
1. Presentation Layer (pages/)
- 7 independent modules with Streamlit interface
- State management with
st.session_state - Flow validation between pages
2. Business Logic Layer (core/)
- ML algorithms: clustering, scaling, cleaning
- Intelligent feature selection
- Automatic hyperparameter optimization
3. Configuration Layer (config/, utils/, styles/)
- Centralized and reusable configuration
- Independent auxiliary functions
- Consistent visual styles
CSV → Load → Clean → EDA → Feature Eng. → Scale → Clustering → Results
↓ ↓ ↓ ↓ ↓ ↓ ↓
Validate Impute Visualize Selection Normalize Optimize PCA + Export
Outliers Intelligent Optimal K
- Framework: Streamlit
- ML: scikit-learn (clustering, PCA, metrics)
- Analysis: pandas, numpy
- Visualization: matplotlib, seaborn
- Testing: pytest
- Containerization: Docker + Docker Compose
Contributions are welcome! Here's how you can collaborate with ClusterFlow:
-
Report Bugs 🐛
- Use the issue format to describe the problem
- Include steps to reproduce the error
- Attach screenshots if possible
-
Propose New Features 💡
- Open an issue explaining the functionality
- Describe use cases and benefits
- Wait for feedback before implementing
-
Improve Documentation 📝
- Fix errors or improve clarity
- Add usage examples
- Translate content to other languages
-
Contribute Code 💻
- Fork the repository
- Create a branch for your feature:
git checkout -b feature/new-feature - Follow project code conventions
- Add tests for your code (we maintain >99% coverage)
- Run
pytestto verify all tests pass - Commit with descriptive messages
- Open a Pull Request explaining the changes
This project is licensed under the MIT License - see the LICENSE file for details.
