Movie Box Office
2023-04-30
Chapter 1 Proposal
1.1 Data
The data comes from Kaggle: https://www.kaggle.com/datasets/rajugc/imdb-movies-dataset-based-on-genre. And the original source is IMDB:https://www.imdb.com/interfaces/. For simplicity, I do not intend to use all the data, so I just used data from Kaggle.
1.2 Modeling Goal
The main goal would be predicting the box office of a new movie based on its IMDB rating, certificate, genre, run time, year, casts, and directors. Some concerns about the practicability is that casts and directors contain many names, to simplify it, I decide to only include the top three actors and only one main director. Then I will convert them to dummy variables, i.e., each actor or director as a feature. Also note that, what we want to predict is the ultimate gross box office of the movie, so we can collect the new movie’s rating after it is released for certain time.
I am also interested in finding whether people prefer old movies or new released movies. For example, I can calculate the mean ratings of movies in each year and see how the trend looks like.
1.3 Models
The first model that I intend to use will be the ordinary linear regression model. An issue would be that the underlying relationship is not linear. So I may implement generalized linear model with a Gamma distribution (or I can do some tests to see what kind of exponential distributions that box office is closet to) as the second, decision tree as the third, and random forest as the fourth model.