Skip to content

vbucaj/DataAnalysisToolbox

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

DataAnalysisToolbox

The methods in this Python Module can be used to perform the following tasks:

Data Cleaning:

Specifically:

(1) Drop and Impute

This method will drop subjects and features with 'too many' missing values and impute the rest

How to use:

   from minitoolboxVB import DropImpute
    
           drop=DropImpute(dataframe)
           drop.drop_subjects(3,inplace=True)
           drop.impute_values(5,inplace=True)
           

Applying these methods in this order, will first drop subjects who are missing three or more values; it will drop features that miss 5% of values or more, and it will impute the rest by first building an empirical distribution for each feature and then imputing the missing values by values randomly sampled from the corresponding empirical distribution.

(2) Remove Outliers

This method will remove all the values that lie below Q1-1.5IQR and above Q3+1.5IQR

How to use:

     
     from minitoolboxVB import Outliers
     
         outliers=Outliers(dataframe)
         outliers.remove_outliers()

Feature Selection

The methods will perfor feature selection for both regression and classification models. First, it will randomly split the data in a training and test set. Then, it will train the model on the training set. Next, it will shuffle the values of each feature in the test set and measure one of the prespecified metrics to observe the decrease in the predictive performance of the model.

How to use:

   
   from minitoolboxVB import FeatureSelection
   
     feat_sel=FeatureSelection(df, classifier=LogisticRegression(solver='liblinear',penalty='l2'))

     out=feat_sel.feature_selection(classModel=True,regModel=False,
                                                          roc=True,
                                                          boxPlot=False,
                                                          split=False,
                                                          iterations=1,
                                                          length=14,
                                                          height=8,
                                                          title='Feature Importance: ROC Curves After Shuffling',
                                                          title_fontsize=22,
                                                          x_fontsize=16,
                                                          y_fontsize=16)

About

Data Cleaning and Feature Selection Toolbox

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages