Clustering and Predict Entrepreneur
-
Based on hierarchical clustering and decision tree (Hierarchical Founder Analysis.py in the Founder_Analysis_after_resampling dir.)
-
Based on association rules mining (main.py)
This program performs a hierarchical analysis on entrepreneur data to discover and understand patterns that lead to success. It first identifies main clusters, then within each main cluster, uses a decision tree to find and characterize subclusters. Unlike black-box models, this approach is inherently interpretable: the decision trees produce human-readable rules that clearly explain why certain entrepreneurs fall into particular subgroups. By examining these rules and associated metrics, you can understand the key features driving cluster formation and subcluster differentiation.
- Main Cluster Extraction: Automatically clusters the dataset into main clusters.
- Decision Tree Training: For each main cluster, trains a decision tree to identify subclusters.
- Threshold Value Restoration: If features were scaled, the program restores decision tree thresholds to their original feature scale for easier interpretation.
- Leaf Node Analysis: Examines each leaf node (subcluster) to compute success rates, normalized success rates, and identify features significantly different from the overall population.
- Output:
decision_tree_cluster_X.png: A decision tree visualization for each main cluster (X is the cluster number).decision_rules_cluster_X.txt: A text file containing decision rules for each main cluster.founder_clusters_analysis.xlsx: An Excel file summarizing main clusters, subclusters, and combined clusters.- Results in memory: Arrays/lists of main clusters, subclusters, and labels returned by the
fit_transformmethod.
- Classifying a New Entrepreneur: After fitting, you can apply the model to a new entrepreneur’s feature data to determine which main cluster and subcluster they would belong to, leveraging the explicit, interpretable rules from the decision trees.
This method is highly interpretable:
- Each subcluster is defined by transparent, human-readable decision rules derived from decision trees.
- Visualized decision trees and exported rule files help you understand the logical path from features to cluster membership.
- Identified significant features help explain why certain groups differ from the overall population.
- Python 3.x
- Required packages:
numpypandasscikit-learnmatplotlib
To install dependencies:
pip install numpy pandas scikit-learn matplotlib- Dataset
df: Apandas.DataFramecontaining all relevant feature columns and a binary target column indicated bysuccess_column. - Parameters:
success_column: The name of the binary column indicating success/failure.n_main_clusters: The number of main clusters to identify.min_subcluster_size: The minimum number of samples required in each subcluster (leaf node).real_world_success_rate: A rate used to normalize the success rate in the results.
-
Prepare Your Data:
- Ensure
dfcontains all required features and a binary success/failure column.
- Ensure
-
Instantiate and Configure:
- Create an instance of
TwoStageFounderAnalysis. - Set
success_column,n_main_clusters,min_subcluster_size, andreal_world_success_rate.
- Create an instance of
-
Run the Program:
- Call the
fit_transform(df)method on your analyzer instance. - The program will:
- Identify main clusters.
- Train decision trees for each main cluster.
- Generate visualization images, decision rule text files, and the Excel summary.
- Return
main_clusters,subclusters, andlabels.
- Call the
-
Check the Results:
- After execution, you will find files like
decision_tree_cluster_1.png,decision_rules_cluster_1.txt, andfounder_clusters_analysis.xlsxin the current directory. - The returned
main_clusters,subclusters, andlabelsprovide information on the hierarchical structure discovered.
- After execution, you will find files like
-
Classify a New Entrepreneur:
- After fitting, you can use the trained model (stored within the analyzer) to predict which cluster a new entrepreneur would belong to by passing their feature data through the decision trees.
- If a main cluster has fewer samples than
min_subcluster_size, it is skipped. - If the target column has no variability within a main cluster (e.g., all successes or all failures), no decision tree splitting is performed for that cluster.
- If scaling was applied, ensure that the scaler is fitted on the training data before running the analysis so that threshold values can be correctly inverted.
import pandas as pd
from Hierarchical Founder Analysis import TwoStageFounderAnalysis
df = pd.read_csv("Data.csv")
analyzer = TwoStageFounderAnalysis(
success_column="is_success",
n_main_clusters=5,
min_subcluster_size=15,
real_world_success_rate=0.019
)
main_clusters, subclusters, labels = analyzer.fit_transform(df) ```python
new_entrepreneur = pd.DataFrame([{
"feature1": 0.5,
"feature2": 2.0,
"feature3": -1.0
}])
results = analyzer.classify_new_founder(new_founder)
```
### Example Output
When you run the classifier, you'll get detailed results like this:
Classification Results:
Main Cluster: 2
Decision Path: previous_startup_funding_experience_as_ceo <= 3.50 AND education_institution <= 3.50 AND nasdaq_leadership <= 0.50 AND personal_branding > -0.49
Leaf Node Statistics:
- Success rate in leaf: 56.0%This output tells you:
- The founder belongs to Main Cluster 2
- The specific decision path taken through the tree
- The success rate for similar founders in this leaf node (56.0%)
The decision path can be interpreted as:
- Raised less than 50M USD as CEO
- Not from a top-20 ranked university
- No leadership role in NASDAQ companies
- Above average personal branding
Using these results, you can understand both the classification and the reasoning behind it, making it valuable for both prediction and insight generation.
This project is designed to analyze founder characteristics and predict success probabilities using frequent pattern mining and clustering techniques. It implements the Apriori algorithm for finding frequent itemsets in founder data and uses statistical analysis to identify patterns associated with founder success.
- Frequent pattern mining using Apriori algorithm
- Cluster analysis of founder characteristics
- Success probability prediction
- Real-world probability scaling
- Confidence interval calculations
- Detailed clustering analysis with visualization
- Evaluation of prediction accuracy
.
├── config.py # Configuration parameters
├── founder_clustering.py # Clustering analysis implementation
├── main.py # Main analysis pipeline
└── requirements.txt # Project dependencies
- Clone the repository:
git clone <repository-url>
cd founder-analysis- Install required packages:
pip install -r requirements.txtRequired packages:
- pandas
- numpy
- mlxtend
- scikit-learn
- tqdm
- tabulate
from config import AnalysisConfig
from main import FounderAnalyzer
# Configure analysis parameters
config = AnalysisConfig(
base_feature=None, # Filter by specific feature
feature_value=None, # Value of the base feature
exclude_features=None, # Features to exclude
persona=None, # Filter by persona
feature_combination=1, # Max number of features in combinations
min_sample=30, # Minimum sample size
sample_size=8800, # Total sample size
decreasing_prob=True, # Sort by decreasing probability
include_negative=False, # Include negative indicators
cluster_weights=[5, 3, 0, 0, 0, 0] # Weights for different clusters
)
# Initialize analyzer and run analysis
analyzer = FounderAnalyzer(config)
analyzer.analyze("your_data.csv")# Save clustering results
analyzer.save_clusters('cluster_results.json')
# Load existing clustering results
analyzer.load_clusters('cluster_results.json')# Example founder features
founder_features = {
'feature1': 'value1',
'feature2': 'value2',
# ... more features
}
analyzer.predict_new_founder(founder_features)results = analyzer.evaluate_predictions("your_data.csv", start_idx=0, end_idx=8800)base_feature: Filter analysis by a specific featurefeature_value: Value of the base feature to filter byexclude_features: List of features to exclude from analysispersona: Filter founders by specific personafeature_combination: Maximum number of features to combine (1-3 recommended)min_sample: Minimum sample size for pattern considerationsample_size: Total sample size to analyzenum_results: Number of top results to displaydecreasing_prob: Sort by decreasing probability if Trueconfidence_level: Confidence level for intervals (default: 0.95)real_world_scaling: Scaling factor for real-world probabilitiesinclude_negative: Include negative indicators in analysiscluster_weights: Weights for different success clusters [extremely_high, very_high, high, low, very_low, extremely_low]
The input CSV file should contain founder data with the following columns:
founder_uuid: Unique identifier for each foundername: Founder nameorg_name: Organization namesuccess: Binary indicator of success (0 or 1)- Additional feature columns containing founder characteristics
The analysis provides:
- Frequent patterns in founder characteristics
- Success probabilities with confidence intervals
- Cluster analysis results
- Real-world scaled probabilities
- Detailed cluster statistics and visualization
- Adjust
min_sampleandsample_sizebased on your dataset size - The
cluster_weightsparameter can be tuned to adjust the importance of different success clusters in predictions - Use
include_negative=Trueif you want to consider negative indicators in the analysis - The real-world scaling factor can be adjusted based on your domain knowledge