Neural Network Predicating Movie Box Office Performance - ECE 539 Alex Larson Fall 2013

Page created by Jimmy Meyer
 
CONTINUE READING
Neural Network Predicating Movie Box
Office Performance

Alex Larson

ECE 539

Fall 2013
Abstract

        The movie industry is a large part of modern day culture. With the rise of websites like Netflix,
where people are able to watch hundreds of movies at any time, it is evident that film is a large part of
our culture today. Movie studios are always trying to come up with the next big thing to make the
largest profit. Studios have been adapting books, plays, and comic books to cash in on an already
existing popular intellectual property. Studios have also been remaking older films in the hopes that
they will have the same level of success as its predecessor. Making a movie is an expensive endeavor
and people want to know if a remake, an adaptation, or an entirely new idea will be successful. Some
current examples of how things are being predicated as being done by using data from sites like google
and Wikipedia. Studies have been done using the number of searches a movie gets on google or how
many hits a Wikipedia page gets for a certain movie to predict its box office success.

        The above methods have been shown to work well but I also believe you can predict the success
of a movie based on many of its features. Some of these features may include genre, budget, release
date, which studio making the movie, if the movie is or is not a new intellectual property, actors
involved, MPAA rating (PG, PG-13, etc.), and many more. Using these features one should be able a
prediction of a movie’s potential box office success. I propose to use some artificial neural network
methods to classify and predict a movie’s potential box office success. Using some of the above features
of movies described above I would like to create a data set based on movies within the past few years.
After a good set of features and classes have been established, I will use artificial neural network
algorithms and experiment with various pattern recognition classifiers like Multi-Layer Perceptron
(MLP), k-nearest neighbor classifier, etc. to predict the potential box office success of a movie.

Introduction and Motivation

        The movie industry is a large part of modern day culture. Many companies look to profit off the
success of a movie. The distributor of the movie gains the profit from ticket sales while many other
companies advertise and promote their products by featuring them in movies or having the movie
associated with their own products to boost revenue. One major motivation behind this project to help
investors choose which movies could have the highest possible return. Movies are very expensive to
make and some wish to know if the payoff will be worth their investment. Movies are also something I
enjoy very much. Like many people I think they are a wonderful form of entertainment. It was my hope
that this project would be fun and interesting way to look deeper into movies and the box office
performance behind them.

Related Work

There have been a few recent projects that have dealt with predicating movie box office performance.
One study was done based on the hits of a movie’s Wikipedia page. The researchers for this study
analyzed the activity of editors on the online encyclopedia Wikipedia. Based on this data they built a
minimalistic predictive model for a movies box office success. [1] Google also performed research on
movies box office success. Google used trailer related searches for a particular movie along with the
franchise status of the movie and the season to predict the opening weekend of a movie with 94%
accuracy.

Problem Statement

The goal of this project is to predict the potential box office success of a given movie based only on its
given characteristics at its release.

Data

The data for the project was acquired from the-numbers.com. This website tabulates many movie
characteristics and statistics. Movie data from the years 2008, 2009, 2010, 2011, 2012, and an
incomplete version of 2013 were obtained. This project was performed late in 2013. While it was
incomplete its data was still a good representation for movies released earlier in 2013. Features that
were extracted from the data were as follows: movie’s release month, distributor, genre, MPAA rating,
and whether or not the movie was a sequel. Values were assigned for distributors, genre, and MPAA
rating. For each year a subset of movies were selected at random from the top performing movies for
that given year. Based on the movies yearly gross I choose to divide the data into 3 classes: Movies
grossing less that 49 million, between 49 million and 91 million and more than 91 million. This data was
then translated into machine readable text flies that were used by various MATLAB programs used to
run the experiments for this project.

Experiments

Using the MATLAB programs from the ECE 539 website various experiments were done with the k
nearest neighbor classifier, maximum likelihood classifier and multilayer perceptron. The initial results
of experimentation were not promising. Each classifier was achieving on average around 30%
classification rate. This value is unacceptable because it is essentially the same as random guessing
when there are 3 possible classification labels. From here the data was reevaluated. I plotted histograms
each feature for each class label. I found that there were many outliers in the distributor, and genre
features. Some smaller distribution studios would have a successful movie in one of the years where
data was collected but not in others. Similarly in genre some genres like western and musical for
example are just not represented enough in the data. These outliers where then removed from the data.
The values assigned to the features were also reorganized. The distributor with the most successful
movies was given a higher value, and the same thing was done with genre and MPAA rating.

Results

For all classifiers cross validation was used. I would leave one year out of the training data and train the
classifier with the data from the remaining years. After classification had completed I would test the
trained classifier using the data from the remaining year that was not included in the training data.

The k-nearest neighbor classifier was the fastest of the 3 classifiers used. For the kNN classifier I tested
many different values of for K. the best results I achieved where when I used 14 nearest neighbors. This
resulted in and average classification rate around 48% an improvement from the first implementation.

KNN Classifier

Testing Data 2008        2009        2010       2011       2012        2013       Average

C Rate (%)    48         64          52         56         32          36         48
Confusion Matrix

31    12     7

24    14     12

15    8      27

I then performed classification of the data using the maximum likelihood classifier. This classifier also
computes its results very quickly. The results of the maximum likelihood classifier do not change
between different runs so this classifier only had to be run once. This classifier performed on average
about as well as the kNN model.

Maximum Likelihood Classifier

Testing Data 2008        2009       2010       2011        2012       2013       Average

C Rate (%)    48         56         52         56          48         24         47.3

Confusion Matrix

34    10     6

25    10     15

11    12     27

Finally classification was done using the multi-layer perception. Many various perceptron networks were
experimented with. This program took the longest out of the three classifiers to run. It also was run over
multiple trials because the results change for each trial run. The MLP training was showing promise. It
was classifying around 60% during training but when it came to the actual testing data it performing
similarly to the kNN and maximum likelihood classifiers with an average classification rate around
47.3%.
MLP back propagation

Testing Data 2008        2009        2010       2011       2012        2013       Average

C Rate (%)    52         48          48         52         40          44         47.3

Confusion Matrix

23    14     13

15    18     17

13    8      29

Discussion

The results of these experiments where not superb but they were an improvement from my preliminary
classification runs. Some interesting predictions that I found with the MLP model for 2013 were that it
correctly predicted into the most successful class label were Iron Man 3, Hunger Games: Catching Fire,
and Oblivion. Some interesting misclassifications were Gravity which was in the most successful
category but classified in the worst. Other interesting misclassifications were After Earth and The
Internship both did poorly but were predicted to do well.

All three classifiers tended to do better classifying movies on for either the low class or the high class
where in the middle it would seldom choose correctly. There may not be enough of a correlation
between this set of feature vectors and the chosen class labels. Movie performance can be erratic as
shown in the preliminary testing. Every so often you get outliers that come out of nowhere from lesser
known studios and do extremely well and on the other hand sometimes you have huge movie flops
coming from studios that normally put out great movies.

This classifier in the end did not perform as well as the google or Wikipedia classifiers. Some
improvements that could be made to this data set would be to increase the sample size of the movies
this may lessen the effect of that outliers may have been effecting classification. Adding more features
to the feature vectors could also improve performance. Other characteristics such as a movie’s budget,
leading actor, director could also have an effect on the classification.

References:

[1] Mestyán M, Yasseri T, Kertész J (2013) Early Prediction of Movie Box Office Success Based on
Wikipedia Activity Big Data. PLoS ONE 8(8): e71226.doi:10.1371/journal.pone.0071226

[2]Chen, Andrea, Panaligan Reggie (2013) Quantifying Movie Magic with Google Search

[3] http://boxofficemojo.com

[4] http://www.the-numbers.com
You can also read