CS5228-Data-Mining

Course project for CS5228 Data Mining, Academic Year 2021-2022 Semester 2.

Group member: Yutong Xia, Ziji Shi, Ronghua Zhu.

Overview

In this project, we build a predictive model for the resale price of condominiums in Singapore. Through extensive exploratory data analysis, we investigate the relationship between the features, and we find some problems including missing values and outliers. We then perform the data transformation to clean the dataset. We augment the original dataset using the landmarks information, which improves the model performance. We also explore 3 types of models to predict the housing price: gradient boosted trees, random forests, and ridge regression. We observe that random forest outperforms the rest in our setting. To deal with the exploding hyper-parameter search space, we also explore Bayesian Optimization. We believe this study can provide more insights to the factors that impact the property price, and help the future policy-making.

You can visualize the property price through the website here.

Our report is accessible here.

Project Structure

├── CS5228_Report.pdf
├── EDA
├── Preprocessing
├── GeoDataAugment (code to calculate nearest facilities to augment dataset)
├── README.md
├── Regression (model and experiment scripts)
├── dataset
│   ├── README.md
│   ├── auxiliary-data
│   ├── distance
├── distance_attributes (data for landmark info)
│   ├── test_distance_attributes.csv
│   └── train_distance_attributes.csv
├── model (serialized model weights)
├── plots (plots for the report)
├── prediction (submissions to Kaggle)
└── requirements.txt

Reproducing the result

Installation

pip install -r requirements.txt

Exploratory Data Analysis

Follow EDA/data_visualisation.ipynb.

Data Preprocessing

Duplication removal and filling missing values: Preprocessing/missing_value_part1.ipynb
Outlier removal: Preprocessing/Dataset_Preparation.ipynb
Augment dataset with distance to public facilities: GeoDataAugment/geovisualization_and_distance.ipynb

Model and Evaluation

XGBoost: Regression/xgboost_v2.ipynb
Ridge Regression: Regression/rf_and_ridge.ipynb
Random Forest: Regression/rf_and_ridge.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS5228-Data-Mining

Overview

Project Structure

Reproducing the result

Installation

Exploratory Data Analysis

Data Preprocessing

Model and Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
EDA		EDA
GeoDataAugment		GeoDataAugment
Preprocessing		Preprocessing
Regression		Regression
dataset		dataset
distance_attributes		distance_attributes
model/xgb_tree		model/xgb_tree
plots		plots
prediction		prediction
.gitignore		.gitignore
CS5228_Report.pdf		CS5228_Report.pdf
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

CS5228-Data-Mining

Overview

Project Structure

Reproducing the result

Installation

Exploratory Data Analysis

Data Preprocessing

Model and Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages