Data Analysis of a Movie Database

The movie database (TMDB) is a database that contains informations of about 10,000 movies which was gotten from kaggle. The database is made up 21 features. These are, id, imdb_id, popularity, budget, revenue, original_title, cast, homepage, director, tagline, overview, runtime, genres, production_companies, release_date, vote_count, vote_average, release_year , budget_adj, and revenue_adj. From the proper inspection of the data, the following questions were coined out based on the features.

Questions:

Popularity
Director
Profit
Release year

Conclusions

The dataset lacked consistency. The data wrangling process consisted of dropping the null values, renaming and reordering the features (columns). The null values in the dataset was dropped because they were so little compared to the overall sample data. The zero values in the features were filled with null values so as to protect the data intergrity as they were so ambigious and dropping them would result in losing more than half of the sample data collected. The features were also renamed and reordered for clarity.

The profit feature generated was computed by subtracting the budget from revenue. This was to ensure proper evaluation of coined research questions. The profit feature at the 75% percentile was set to be the margin for high profitability while that of 25% percentile was the margin for low profitability. The rating feature was found to be the dependent variable as it has a relatively normal distribution when compared to the other features.

For the ease of analysis, research questions were coined out based on the features of the dataset. In the first case, the popularity feature was addressed. From the distribution, about 90% of the movies released were not popular. In analyzing how profitable the top 200 most popular movies are; about 49% of the movies were profitable. Indicating that there is no so much correlation in profit compared to their popularity. Also, in terms of rating the popularity showed no relationship in the scattered plot.

In the case of the director feature, Colin Trevorrow was the most popular director and he directed only two movies which made him popular in return. Also, Woody Allen was the director with the highest number of movies released. He released 46 movies with only one being highly profitable.

In the case of the profit feature, most movies made profit within 0 - 100 Million. However, the movie with the highest profit generated was Avatar. In terms of rating, profitability showed no correlation.

Finally, in the case of the release_year feature, most of the movies were released within 2010 - 2015. However, 2014 was the year with th highest number of movies released. A total of 680 movies were released.

Limitations

The movie database (TMDB) contains one of the best collection of movies. Most feautures (columns) had null values and lots of outliers. The outliers in most features were out of range, making comparism between features very difficult. Furthermore the features had thousands of zero values. For example the budget and revenue features had 5,578 and 5,888 respectively. However, the zero values was not dropped but replaced with null values to protect the data intergrity. This lack of consistency in the dataset caused some set backs in the analysis carried out.

Reference

The following are the links to the resources utilized in carrying out the analysis of the movie database (TMBD):

The most influential factor of IMDB movie rating

What makes a movie hit a jackpot

The TMDB 5000 movie dataset

TMDB movies analysis

Pandas DataFrame groupby function for grouping features

How to plot a pandas dataframe with matplotlib

The query function in pandas dataframe

Dropping null values in pandas dataframe

Exploding elements in pandas dataframe

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitattributes		.gitattributes
Project_Investigate_TMDB.html		Project_Investigate_TMDB.html
Project_Investigate_TMDB.ipynb		Project_Investigate_TMDB.ipynb
README.md		README.md
tmdb-movies.csv		tmdb-movies.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Analysis of a Movie Database

Conclusions

Limitations

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Analysis of a Movie Database

Conclusions

Limitations

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages