Programming Assignment 2: Sentiment Analysis

Introduction

Sentiment Analysis is the problem of determining the general attitude expressed by some text. For instance,
we would like to have a program that could look at the text “The film was a breath of fresh air” and realize that
it was a positive statement while “It made me want to poke out my eyeballs” is negative.

One algorithm that we can use for this problem is to assign a numeric value to each word based on how
positive or negative that word is and then score the text as a whole based on the average sentiment value
of the individual words. The challenge here is in finding a way to assign positive or negative values to
individual words.

For the purposes of this project we will assign values to words by analyzing a large collection of movie
reviews collected from the Rotten Tomatoes website:

movie_reviews.txt

This file contains 8529 movie reviews along with their scores. The text of each movie review is
accompanied by a human-generated evaluation of whether the review is positive or negative overall.

The first few lines of the file look like this:

1 A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .  
4 This quiet , introspective and entertaining independent is worth seeking .    
1 Even fans of Ismail Merchant 's work , I suspect , would have a hard time sitting through this one .  
3 A positively thrilling combination of ethnography and all the intrigue , betrayal , deceit and murder of a Shakespearean tragedy or a juicy soap opera .  
1 Aggressive self-glorification and a manipulative whitewash .  
4 A comedy-drama of nearly epic proportions rooted in a sincere performance by the title character undergoing midlife crisis .  
1 Narratively , Trouble Every Day is a plodding mess .

Note that each review starts with a number 0 through 4 with the following meaning:

0 : negative
1 : somewhat negative
2 : neutral
3 : somewhat positive
4 : positive

Individual words will be scored by computing the average rating of all of the reviews that contain that
word. For example, if we were only working with the reviews included above, the word “and” would be
assigned the score $(4 + 3 + 1) / 3 = 2. \overset{―}{6}$ : it appears in the 2nd, 4th, and 5th reviews which have scores of
4, 3, and 1 respectively.

Honor Code

This is an individual assignment. All of the work that you submit must be written by you, based on
your own understanding of the material. Representing someone else’s work as your own, in any form,
constitutes an honor code violation.

Functionality

The finished application must read a file of movie reviews provided as a command-line argument,
then allow the user to enter text for sentiment analysis. The following interaction illustrates the
expected behavior:

$ python sentiment.py movie_reviews.txt
Enter your text (blank to exit): This movie is rotten!
The sentiment score for this text is: 1.58
This text is NEGATIVE.

Enter your text (blank to exit): Best. Movie. Ever. (BEST)
The sentiment score for this text is: 2.48
This text is POSITIVE

Enter your text (blank to exit):

In this example the normal text was produced by the program and the bold text represents
user input. The displayed numeric score must be rounded to two decimal places. The last line of the
output must be based on the numeric score: If the score is below 1.95 the output should be
“This text is NEGATIVE.” If the score greater than or equal to 1.95 and less 2.05 the output
should be “This text is NEUTRAL.” If the score is greater than or equal to 2.05 the output should
be “This text is POSITIVE.”

Implementation

You will need to submit the following files:

Part A - Due Monday 4/25
- review.py - Code for the Review class. See the UML diagram below.
- test_review.py - Unit tests for the Review class. For full credit, these tests
  must provide 100% statement coverage of review.py.
Part B - Due Monday 5/2
- sentiment_analyzer.py - Code for the SentimentAnalyzer class. See the UML
  diagram below.
- sentiment.py - Command-line sentiment analysis application. This application
  must provide the functionality described above.

(Note that it isn’t standard practice in Python to create a separate file for each class. We are
doing it here to facilitate code submission and grading.)

UML

The Review and SentimentAnalyzer classes must match the following UML diagram:

UML

Review Class

Constructor: Note that there are two different constructors listed in the UML above.
The Python language doesn’t support this directly: there can be only one __init__ method.
We can provide the effect of multiple constructors by using the *args parameter to
implement arbitrary argument lists. This is described in Section 11.2 of our textbook.
As an example, imagine we want to create a two-dimensional Point class that can be
initialized either by explicitly providing x and y values or by copyting the x and y values
from an existing Point object. The following code would work:
```
class Point:

def __init__(self, *args):
    if len(args) == 1:      # Point object
        self.x = args[0].x
        self.y = args[0].y
    elif len(args) == 2:    # x and y values
        self.x = args[0]
        self.y = args[1]
    else:
        print("Error!  Incorrect number of arguments.")
```
The caller is now free to think of this as a class with two different constructors:
```
p1 = Point(2.0, 3.0)  # Construct from x, y
p2 = Point(p1)        # Construct from an existing point
```
- Two-argument Review constructor: The two-argument constructor must assign the
  provided rating and text to the appropriate instance variables. It must also initialize
  the words instance variable to be a set of words containing lower-case copies of each
  distinct word that appears in the review text. For example, if the text of the review
  is “This is a very very very bad movie!” then the words set would contain the
  elements "this", "is", "a", "very", "bad" and "movie". The set must not include
  any of the punctuation from the original review. The following code snippet shows
  an example of how we can remove punctuation from a string in Python:
```
import string
text = "I'm not 'pleased'... she said!"
text = text.translate(str.maketrans('', '', string.punctuation))
print(text)  # Prints Im not pleased she said
```
- Single-argument constructor: The single-argument constructor must take a line
  of text in the same format as the lines in the movie_reviews.txt file and extract
  the necessary information to initialize the three required instance variables.
  For example:
```
review = Review("1 Narratively , Trouble Every Day is a plodding mess .")
```
  The first character here corresponds to the rating, the next space is ignored, and
  the rest of the string should be stored in text. The logic for generating the words
  set will be the same regardless of which constructor is used.
__str__ - This method must return a string representation of the review that matches
the format in movie_reviews.txt, except that the overall length must be capped
at 70 characters, with '...' appended in cases where the review has been truncated.
For example, reviews created from the corresponding examples above would generate
the following strings:
```
"1 Narratively , Trouble Every Day is a plodding mess ."
"4 This quiet , introspective and entertaining independent is worth ..."
```
Note that the second string is exactly 70 characters long, including the ellipses.
__eq__ - Two reviews must be considered equal if they have the same rating and same text.

SentimentAnalyzer Instance Variables

review_counts - This dictionary must contain a mapping from individual (lower-case)
words to a count of the total number of reviews that contain those words. For example,
given the sample reviews above, "and" would be mapped to 3, because it appears in
exactly three reviews.
word_rating_totals - This dictionary must contain a mapping from individual (lower-case)
words to the sum of the ratings of all reviews that contain that word. For example,
given the sample reviews above, "and" would be mapped to 8, because that is the sum
of the ratings of the reviews containing "and".
Along with review_counts, this instance variable will allow us to calculate the sentiment
value associated with any particular word: It will be the rating total divided by the count.
reviews - This must be a list containing all Review objects that have been added
through calls to the add_review method or the load_reviews method.

SentimentAnalyzer Methods

Constructor - The constructor must initialize the three instance variables to contain
empty collection objects of the appropriate types.
add_review - This method must add the provided review to the reviews list and
must update review_counts and word_rating_totals with the information
from the provided review.
load_reviews - This method must load all of the reviews from the indicated text file.
You may assume that the formatting will match the formatting of movie_reviews.txt.
save_reviews - This method must save the current collection of reviews to a file with
the indicated name. The format must match the format of movie_reviews.txt.
word_sentiment - This method must return the word sentiment associated with the
provided word. Case must be ignored. If the word does not appear in any reviews,
then this method must return 2.0.
analyze - This method will take an arbitrary string of text and return the average
sentiment value of the individual words. Each word must be included in the average,
including repeats. Capitalization and punctuation must be ignored.

Grading

Submission A:
- Full test coverage from test_review.py and review.py passes all
  student tests: 20%
- review.py passes instructor Gradescope submission tests: 20%
Note that it will not be possible to receive any credit for passing the instructor
unit tests if your submission doesn’t pass your own tests with full coverage.
Submission B:
- sentiment_analyzer.py passes instructor tests: 20%
- sentiment.py passes instructor tests: 20%
Instructor Style Points: 20%
Instructor style points will be based on issues like:
- Appropriate variable names
- Meaningful and informative docstrings
- Clean, readable code that is not overly complex and doesn’t include
  do-nothing statements

Acknowledgments

The idea for this assignment was presented by Eric Manley and Timothy Urness at
the 2016 SIGCSE nifty assignment session. This project uses their data files and borrows some
text from their write-up. The Rotten Tomatoes data was originally collected for the
Stanford sentiment analysis project.

Alvin Chao

Massanutten Hall
1031 South Main Street
Room 293
Harrisonburg, Virginia 22807

Email Us
(540) 568-6206

PA2 - Final