GitHub - epolak01/ARCS-Process-Experiment: Recognizing users from process data in Kent 2016 dataset

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Final Results		Final Results
README		README
acc_comp_graphs.py		acc_comp_graphs.py
create_proc_dataset.slrm		create_proc_dataset.slrm
filter_process_file.py		filter_process_file.py
final_process_analysis.csv		final_process_analysis.csv
proc_and_session.slrm		proc_and_session.slrm
proc_based_user_sessions.py		proc_based_user_sessions.py
proc_name.slrm		proc_name.slrm
proc_only.slrm		proc_only.slrm
run_all_features.py		run_all_features.py
run_process_features.py		run_process_features.py
run_process_name_features.py		run_process_name_features.py

Repository files navigation

####################################
# Author: Emil Polakiewicz
# Date: December 2021
# Purpose: README File for ARCS Process Experiment
####################################

Files:
    - filter_process_file.py: Takes in raw proc.txt file from Kent 2016 dataset and filters it into a reasonable size
    - proc_based_user_sessions.py: Takes in filtered process data, and processes the data into process features and
                                   session features (not one-hot encoded)
    - dataset_stats.py: Calculates basic stats about the data using some output files from proc_based_user_sessions.py
    - run_all_features.py: One-hot encodes processed data, and runs linear regression and random forest classifiers
                           using process and session features
    - run_process_features.py: One-hot encodes processed data, and runs linear regression and random forest classifiers
                           using process features
    - run_process_name_features.py: One-hot encodes processed data, and runs linear regression and random forest classifiers
                           using only process name as a feature
    - final_process_analysis.csv: List of 300 most popular process names in dataset for one-hot encoding
    - acc_comp_graphs.py: Creates graphs for visualizing the results of running the classifiers on the processed data


How to Run:
    1. Run filter_process_file.py to obtain a reasonable sized proc file (note proc.txt not included in repository
       because it is too large)
    2. Run proc_based_user_sessions.py on the output file from filter_process_file.py to get processed data using
       create_proc_dataset.slrm if running on HPC cluster
    3. To get extra statistics from the dataset, run dataset_stats.py on the output files from proc_based_user_sessions.py
       specified
    4. Run run_all_features.py, run_process_features.py, run_process_name_features.py on the dataset given by
       step 2 using proc_and_session.slrm, proc_only.slrm, and proc_name.slrm respectively if running on HPC cluster
    5. To obtain graphs, run acc_comp_graphs.py on outputs from the classifiers