- This is a data analysis project that aims to investigate the data science behind anime scores and the variables that may affect them. The project involves analyzing a dataset of anime shows, which includes information such as title, genre, episode count, release year, source material, licensors, producers, and user scores.
- The project is implemented in Python using the popular data analysis libraries Pandas and Matplotlib. The dataset is cleaned and preprocessed to handle missing values, data types, and inconsistencies. Various data analysis techniques are applied to gain insights into the relationship between anime scores and different variables, such as themes and genres, demographics, release year, etc.
We want to predict viewers' liking of a new anime based on various factors such as producing company, genres, age ratings and number of episodes, etc.
- With the wide variety of anime available, it may overwhelm viewers when choosing what show to watch. As such, our analysis helps to predict and improve viewer satisfaction.
- In addition, an average 13-episode anime can cost up to US$2 million to produce. Stakeholders in the anime industry may be motivated to have a data-driven approach to decision-making when it comes to the various factors which may affect viewership and market anime to specific audiences hence enhance viewership ratings and maximise profit, reducing risk of financial losses.
https://www.kaggle.com/datasets/harits/anime-database-2022
Understand the distribution of anime scores and identify any trends or patterns Explore the relationship between anime scores and different variables Visualize the findings using bar charts and other graphical representations
Data Cleaning: This section includes the cleaning and preprocessing of the raw dataset, handling missing values, data types, and inconsistencies.
Data Analysis: This section includes the data analysis techniques applied to gain insights into the relationship between anime scores and different variables, such as genre, episode count, release year, and licensors. This includes the results of the data analysis and the visualizations created to present the findings in an easy-to-understand manner.
Machine Learning: This section includes machine learning models we have used on our dataset, Multivariate Regression and Random Forest Classifier
We analyse the
Typeof animeSourceof animeThemes_Genresof animeStudiosof animeDemographicsof animeProducersof animeLicensorsof animeRatingof anime
and how they may affect the Score of anime in Exploratory Data Analysis.
- Multivariate Linear Regression
- Random Forest Classifier
- K-Means Clustering
Different Machine Learning (ML) techniques
- Random Forest Classifier
- Multivariate Linear Regression Encoders
- One Hot Encoder
- Label Encoder
- Ordinal Encoder
- Nominal Encoder
Data Imputation
- most-frequent (for categorical)
- mean (for numerical)
- Data Cleaning and Preparation: Andrea
- EDA and Visualisation: Andrea, Hain Eu
- Multivariate Regression: Chi, Hain Eu
- Random Forest Classification: Andrea
- K-Means Clustering: Chi, Hain Eu
- Presentation Slides Deck: Hain Eu, Chi, Andrea
- Presentation Script: Hain Eu, Chi, Andrea
- Presentation Voiceover: Hain Eu
- Github ReadMe: Andrea
- https://www.kaggle.com/datasets/harits/anime-database-2022
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
- https://www.kaggle.com/code/dima806/anime-scores-explain#notebook-container
- https://www.kaggle.com/code/infinator/welcome-to-the-world-of-anime
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
- https://towardsdatascience.com/catboost-regression-in-6-minutes-3487f3e5b329
- https://www.webucator.com/article/python-color-constants-module/