HW6: Movie Reviews (20 Points)

Due Monday 4/20/2020

Overview / Logistics

The purpose of this assignment is to get you practice with Python dictionaries and machine learning for natural language processing. By the end of this assignment, you will have a system for automatically determining whether a movie review is positive or negative. Click here to download the starter code for this assignment.

What to submit: When you are finished, you should submit a file MovieReviews.py to Canvas, along with answers to the following as a comment on Canvas:

Which two movie reviews did you use in part 3? What scores did you get?
Did you work with a buddy on this assignment? If so, who?
Are you using up any grace points to buy lateness days? If so, how many?
Approximately how many hours it took you to finish this assignment (I will not judge you for this at all...I am simply using it to gauge if the assignments are too easy or hard)
Your overall impression of the assignment. Did you love it, hate it, or were you neutral? One word answers are fine, but if you have any suggestions for the future let me know.
Any other concerns that you have. For instance, if you have a bug that you were unable to solve but you made progress, write that here. The more you articulate the problem the more partial credit you will receive (fine to leave this blank)

The Problem

In class, we have seen that if we can "vectorize" our data by giving it coordinates, we can measure distances between data points in a meaningful way. We used this both to visualize data and to do supervised learning, or teaching the computer how to categorize data based on examples.

In this assignment, you will do supervised learning using a vectorized representation of text. Given a collection of 1000 positive reviews and 1000 negative movie reviews from the early 2000s (citation), you will train a model to tell the difference between negative and positive reviews. You will examine this model to see what telltale words are for positive and negative reviews, and you will then find a movie review of your own that you believe is positive or negative and score it with the model.

Background: Vectorizing Text with Binary Bag of Words (BBOW)

When we had the images of the digits, it was straightforward, since every image had the same number of pixels, and we treated each pixel as a dimension. By contrast, it is not immediately obvious how to turn a text document into a vector in the same manner, since the documents in a collection don't necessarily all have the same number of words. In this assignment, we will be exploring a simple "binary bag of words" (BBOW) approach, in which we completely disregard the order of the words and simply keep track of which words occur in each document. For example, the phrase "first go left, then go right, then go left, then go right and right again" would simply have the words ["go","again", "left", "then", "right", "first", "and"], in no particular order. Even though our representation loses all information about the sequence, this works surprisingly well in many natural language processing tasks.

As usual, we will set up a matrix in which every row is a data point and in which every column is a dimension. In a BBOW representation, each row corresponds to a document, and each column corresponds to a word. We call the set of all words across all columns the vocabulary of our representation. For a particular document, we put a 1 in a column if the corresponding word exists in that document, and a 0 otherwise. To demonstrate a data matrix in a BBOW representation, we show below a limerick by Kaitlyn Guenther in which every "document" is simply a single line of text. The data matrix then looks like this (where the columns of the vocabulary can be in an arbitrary order, but one which is consistent across documents)

Document	there	once	was	a	wonderful	star	who	thought	she	would	go	very	far	until	fell	down	and	looked	like	clown	knew	never
There once was a wonderful star	1	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Who thought she would go very far	0	0	0	0	0	0	1	1	1	1	1	1	1	0	0	0	0	0	0	0	0	0
Until she fell down	0	0	0	0	0	0	0	0	1	0	0	0	0	1	1	1	0	0	0	0	0	0
And looked like a clown	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	1	0	0
She knew she would never go far	0	0	0	0	0	0	0	0	1	1	1	0	1	0	0	0	0	0	0	0	1	1

Code To Write

Part 1: BBOW Representation (10 Points)

In the first part of this assignment, you will create a bag of words representation from a set of all positive and negative reviews. You should loop through every word in every document (much like you looped through every word in every tweet in the last homework) to determine the vocabulary. You should then build a sparse data matrix with a row for every document and a column for every word in the vocabulary. You should put a 1 in row i column j if document i contains word j.

Extra Credit (+2)

If you've done this properly, you will end up with a vocabulary with over 50,000 words. However, each review will only use about 340 of these words on average. This means that each row has a ton of zeros. It is wasteful to store all of these zeros in memory, and it also slows down computation. It's much better to use a data structure known as a sparse matrix, which has a mechanism to only store values that are nonzero. Thankfully, the scipy library in python makes this quite easy. You can simply initialize a sparse matrix with the code where N is the number of rows and d is the number of columns. None of the rest of your code needs to change

Tips

To figure out the words in the vocabulary, you can use a dictionary whose key is a lowercase word and whose value is simple "True." You won't ever need to use the value; you just need to figure out a vocabulary that covers all words across all documents, so they keys of this dictionary are sufficient.
After you create a list of words in the vocabulary, you need to choose which column in your data matrix corresponds to which word. They can be in any order, but it's probably easiest if you order them the same way they are ordered in the list. To make it easy to determine the index in the list of each word, you should create a dictionary whose key is the word and whose value is the index of that word in the list.

Part 2: Classification Model And Telltale Words (5 Points)

Once you have the data matrix setup, you can build a model to predict if a review is positive or negative. As it turns out, most of the words in the vocabulary don't tell us much about whether the review is positive or negative, so most of the dimensions in our vectorization don't matter. This means that an approach like k-nearest neighbors wouldn't perform very well in this context, since the distance would be swamped by unimportant dimensions. Instead, we're going to have to do a regression to learn which dimensions are important.

Recall in class that we used ridge regression to predict the chance of getting accepted to graduate school based on some independent variables (e.g. GRE score, GPA). Here, we need to predict a binary variable, which is 1 (positive) or 0 (negative), given the independent variables, which are words in our case. There is a related type of regression called logistic regression which is setup for this binary scenario, and you will use that in this part of the assignment. Assuming you have setup a matrix X as your data matrix, and that the first 1000 rows of X correspond to positive reviews and the last 1000 rows of X correspond to negative reviews, the following code will accomplish this

The code uses "5-fold cross-validation" to compute the model; that is, it splits the data up into 5 random subgroups, referred to as "folds." Each fold has 80% of the data, and it trains a model on that 80% and tests on the 20% that's been left out. The score at the end is a number between 0 and 1 which indicates the fraction of predictions that were correct over all folds. Amazingly, if you have done this properly, you should get a score of 1, which indicates that this model is nearly perfect on this dataset.

Once you are at the point where you're getting a score close to 1 for your classification, you can then analyze the model. If you have N words in your vocabulary (and hence N columns in your data matrix X), then clf.coef_.flatten() contains an array of N weights for each column. If a weight is positive in a particular column, then it means the corresponding word contributes towards making the review positive. Conversely, if a particular column is negative, it means the corresponding word contributes towards making the review negative. To complete this task, print out the 15 words with the largest positive coefficient and the 15 words with the largest negative coefficient.

Part 3: Reviews of Your Choice (5 Points)

Now that you have a model, you should apply it to data beyond the training set that you have to see how well it does. Go out on the internet and find a review that you think is very positive, and find a review that you think is very negative. Then load each review in and vectorize it according to your vocabulary. If there's a word in your review that isn't in the vocabulary of the training set, simply ignore it.

Once you have a vectorization, you can see how positive or negative the model says it is. If your vector is x, you can simply write This will sum up every weight in the model associated to words that are used. If the review is positive, you should get a positive number back, and if the review is negative, you should get a negative number back