Feature_Engineering

Please use the program in Founder_Analysis_after_resampling!!!!!

Clustering and Predict Entrepreneur

Based on hierarchical clustering and decision tree (Hierarchical Founder Analysis.py in the Founder_Analysis_after_resampling dir.)
Based on association rules mining (main.py)

README (for Hierarchical Founder Analysis)

Overview

This program performs a hierarchical analysis on entrepreneur data to discover and understand patterns that lead to success. It first identifies main clusters, then within each main cluster, uses a decision tree to find and characterize subclusters. Unlike black-box models, this approach is inherently interpretable: the decision trees produce human-readable rules that clearly explain why certain entrepreneurs fall into particular subgroups. By examining these rules and associated metrics, you can understand the key features driving cluster formation and subcluster differentiation.

Features

Main Cluster Extraction: Automatically clusters the dataset into main clusters.
Decision Tree Training: For each main cluster, trains a decision tree to identify subclusters.
Threshold Value Restoration: If features were scaled, the program restores decision tree thresholds to their original feature scale for easier interpretation.
Leaf Node Analysis: Examines each leaf node (subcluster) to compute success rates, normalized success rates, and identify features significantly different from the overall population.
Output:
- decision_tree_cluster_X.png: A decision tree visualization for each main cluster (X is the cluster number).
- decision_rules_cluster_X.txt: A text file containing decision rules for each main cluster.
- founder_clusters_analysis.xlsx: An Excel file summarizing main clusters, subclusters, and combined clusters.
- Results in memory: Arrays/lists of main clusters, subclusters, and labels returned by the fit_transform method.
Classifying a New Entrepreneur: After fitting, you can apply the model to a new entrepreneur’s feature data to determine which main cluster and subcluster they would belong to, leveraging the explicit, interpretable rules from the decision trees.

Interpretability

This method is highly interpretable:

Each subcluster is defined by transparent, human-readable decision rules derived from decision trees.
Visualized decision trees and exported rule files help you understand the logical path from features to cluster membership.
Identified significant features help explain why certain groups differ from the overall population.

Requirements

Python 3.x
Required packages:
- numpy
- pandas
- scikit-learn
- matplotlib

To install dependencies:

pip install numpy pandas scikit-learn matplotlib

Input Data Requirements

Dataset df: A pandas.DataFrame containing all relevant feature columns and a binary target column indicated by success_column.
Parameters:
- success_column: The name of the binary column indicating success/failure.
- n_main_clusters: The number of main clusters to identify.
- min_subcluster_size: The minimum number of samples required in each subcluster (leaf node).
- real_world_success_rate: A rate used to normalize the success rate in the results.

Usage Steps

Prepare Your Data:
- Ensure df contains all required features and a binary success/failure column.
Instantiate and Configure:
- Create an instance of TwoStageFounderAnalysis.
- Set success_column, n_main_clusters, min_subcluster_size, and real_world_success_rate.
Run the Program:
- Call the fit_transform(df) method on your analyzer instance.
- The program will:
  - Identify main clusters.
  - Train decision trees for each main cluster.
  - Generate visualization images, decision rule text files, and the Excel summary.
  - Return main_clusters, subclusters, and labels.
Check the Results:
- After execution, you will find files like decision_tree_cluster_1.png, decision_rules_cluster_1.txt, and founder_clusters_analysis.xlsx in the current directory.
- The returned main_clusters, subclusters, and labels provide information on the hierarchical structure discovered.
Classify a New Entrepreneur:
- After fitting, you can use the trained model (stored within the analyzer) to predict which cluster a new entrepreneur would belong to by passing their feature data through the decision trees.

Notes

If a main cluster has fewer samples than min_subcluster_size, it is skipped.
If the target column has no variability within a main cluster (e.g., all successes or all failures), no decision tree splitting is performed for that cluster.
If scaling was applied, ensure that the scaler is fitted on the training data before running the analysis so that threshold values can be correctly inverted.

Example

import pandas as pd
from Hierarchical Founder Analysis import TwoStageFounderAnalysis

df = pd.read_csv("Data.csv")
analyzer = TwoStageFounderAnalysis(
    success_column="is_success",
    n_main_clusters=5,
    min_subcluster_size=15,
    real_world_success_rate=0.019
)

main_clusters, subclusters, labels = analyzer.fit_transform(df)

Example:

 ```python
 new_entrepreneur = pd.DataFrame([{
     "feature1": 0.5,
     "feature2": 2.0,
     "feature3": -1.0
 }])
 results = analyzer.classify_new_founder(new_founder)
 ```

 ### Example Output

When you run the classifier, you'll get detailed results like this:

Classification Results:
Main Cluster: 2
Decision Path: previous_startup_funding_experience_as_ceo <= 3.50 AND education_institution <= 3.50 AND nasdaq_leadership <= 0.50 AND personal_branding > -0.49
Leaf Node Statistics:
- Success rate in leaf: 56.0%

This output tells you:

The founder belongs to Main Cluster 2
The specific decision path taken through the tree
The success rate for similar founders in this leaf node (56.0%)

The decision path can be interpreted as:

Raised less than 50M USD as CEO
Not from a top-20 ranked university
No leadership role in NASDAQ companies
Above average personal branding

Using these results, you can understand both the classification and the reasoning behind it, making it valuable for both prediction and insight generation.

README (For Association Rules Mining)

This project is designed to analyze founder characteristics and predict success probabilities using frequent pattern mining and clustering techniques. It implements the Apriori algorithm for finding frequent itemsets in founder data and uses statistical analysis to identify patterns associated with founder success.

Features

Frequent pattern mining using Apriori algorithm
Cluster analysis of founder characteristics
Success probability prediction
Real-world probability scaling
Confidence interval calculations
Detailed clustering analysis with visualization
Evaluation of prediction accuracy

Project Structure

.
├── config.py               # Configuration parameters
├── founder_clustering.py   # Clustering analysis implementation
├── main.py                # Main analysis pipeline
└── requirements.txt       # Project dependencies

Installation

Clone the repository:

git clone <repository-url>
cd founder-analysis

Install required packages:

pip install -r requirements.txt

Required packages:

pandas
numpy
mlxtend
scikit-learn
tqdm
tabulate

Usage

Basic Analysis

from config import AnalysisConfig
from main import FounderAnalyzer

# Configure analysis parameters
config = AnalysisConfig(
    base_feature=None,           # Filter by specific feature
    feature_value=None,          # Value of the base feature
    exclude_features=None,       # Features to exclude
    persona=None,               # Filter by persona
    feature_combination=1,       # Max number of features in combinations
    min_sample=30,              # Minimum sample size
    sample_size=8800,           # Total sample size
    decreasing_prob=True,       # Sort by decreasing probability
    include_negative=False,      # Include negative indicators
    cluster_weights=[5, 3, 0, 0, 0, 0]  # Weights for different clusters
)

# Initialize analyzer and run analysis
analyzer = FounderAnalyzer(config)
analyzer.analyze("your_data.csv")

Saving and Loading Clusters

# Save clustering results
analyzer.save_clusters('cluster_results.json')

# Load existing clustering results
analyzer.load_clusters('cluster_results.json')

Predicting New Founder Success

# Example founder features
founder_features = {
    'feature1': 'value1',
    'feature2': 'value2',
    # ... more features
}

analyzer.predict_new_founder(founder_features)

Evaluating Predictions

results = analyzer.evaluate_predictions("your_data.csv", start_idx=0, end_idx=8800)

Configuration Options

AnalysisConfig Parameters

base_feature: Filter analysis by a specific feature
feature_value: Value of the base feature to filter by
exclude_features: List of features to exclude from analysis
persona: Filter founders by specific persona
feature_combination: Maximum number of features to combine (1-3 recommended)
min_sample: Minimum sample size for pattern consideration
sample_size: Total sample size to analyze
num_results: Number of top results to display
decreasing_prob: Sort by decreasing probability if True
confidence_level: Confidence level for intervals (default: 0.95)
real_world_scaling: Scaling factor for real-world probabilities
include_negative: Include negative indicators in analysis
cluster_weights: Weights for different success clusters [extremely_high, very_high, high, low, very_low, extremely_low]

Data Format

The input CSV file should contain founder data with the following columns:

founder_uuid: Unique identifier for each founder
name: Founder name
org_name: Organization name
success: Binary indicator of success (0 or 1)
Additional feature columns containing founder characteristics

Output

The analysis provides:

Frequent patterns in founder characteristics
Success probabilities with confidence intervals
Cluster analysis results
Real-world scaled probabilities
Detailed cluster statistics and visualization

Notes

Adjust min_sample and sample_size based on your dataset size
The cluster_weights parameter can be tuned to adjust the importance of different success clusters in predictions
Use include_negative=True if you want to consider negative indicators in the analysis
The real-world scaling factor can be adjusted based on your domain knowledge

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Feature_Engineering

Please use the program in Founder_Analysis_after_resampling!!!!!

README (for Hierarchical Founder Analysis)

Overview

Features

Interpretability

Requirements

Input Data Requirements

Usage Steps

Notes

Example

Example:

README (For Association Rules Mining)

Features

Project Structure

Installation

Usage

Basic Analysis

Saving and Loading Clusters

Predicting New Founder Success

Evaluating Predictions

Configuration Options

AnalysisConfig Parameters

Data Format

Output

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
Founder Anlysis_with_path		Founder Anlysis_with_path
Founder_Analysis_after_resampling		Founder_Analysis_after_resampling
README.md		README.md
cluster_results.json		cluster_results.json
config.py		config.py
founder_clustering.py		founder_clustering.py
main.py		main.py

Folders and files

Latest commit

History

Repository files navigation

Feature_Engineering

Please use the program in Founder_Analysis_after_resampling!!!!!

README (for Hierarchical Founder Analysis)

Overview

Features

Interpretability

Requirements

Input Data Requirements

Usage Steps

Notes

Example

Example:

README (For Association Rules Mining)

Features

Project Structure

Installation

Usage

Basic Analysis

Saving and Loading Clusters

Predicting New Founder Success

Evaluating Predictions

Configuration Options

AnalysisConfig Parameters

Data Format

Output

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages