The script starts by importing necessary libraries (pandas, numpy, seaborn, matplotlib.pyplot) and reading a CSV file into a DataFrame (df).
Basic exploration of the dataset using head(), describe(), and checking for missing values using isnull().sum().
One-hot encoding is performed on categorical variables using pd.get_dummies().
Missing values are imputed using the k-nearest neighbors algorithm (KNNImputer from sklearn.impute).
Features are scaled using MinMaxScaler, and the dataset is split into training and testing sets.
Several classification models are chosen (KNeighborsClassifier, GaussianNB, DecisionTreeClassifier, and RandomForestClassifier) for initial testing.
Classification reports are generated for each model to evaluate their performance on the imbalanced dataset.
The script uses the Synthetic Minority Over-sampling Technique (SMOTE) to oversample the minority class.
The same models are re-trained and evaluated on the oversampled dataset.
Random under-sampling is performed to balance the class distribution.
The models are re-trained and evaluated on the undersampled dataset.
The SMOTEENN technique, which combines SMOTE and Edited Nearest Neighbours (ENN), is applied.
The models are re-trained and evaluated on the combined dataset.
- The script provides classification reports for each model after different resampling techniques.
- It highlights that resampling techniques, particularly SMOTEENN, improve the model's ability to identify cases positive for stroke.




