PA2 - Final
Programming Assignment 2: Sentiment Analysis
Introduction
Sentiment Analysis is the problem of determining the general attitude expressed by some text. For instance,
we would like to have a program that could look at the text “The film was a breath of fresh air” and realize that
it was a positive statement while “It made me want to poke out my eyeballs” is negative.
One algorithm that we can use for this problem is to assign a numeric value to each word based on how
positive or negative that word is and then score the text as a whole based on the average sentiment value
of the individual words. The challenge here is in finding a way to assign positive or negative values to
individual words.
For the purposes of this project we will assign values to words by analyzing a large collection of movie
reviews collected from the Rotten Tomatoes website:
This file contains 8529 movie reviews along with their scores. The text of each movie review is
accompanied by a human-generated evaluation of whether the review is positive or negative overall.
The first few lines of the file look like this:
1 A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story . 4 This quiet , introspective and entertaining independent is worth seeking . 1 Even fans of Ismail Merchant 's work , I suspect , would have a hard time sitting through this one . 3 A positively thrilling combination of ethnography and all the intrigue , betrayal , deceit and murder of a Shakespearean tragedy or a juicy soap opera . 1 Aggressive self-glorification and a manipulative whitewash . 4 A comedy-drama of nearly epic proportions rooted in a sincere performance by the title character undergoing midlife crisis . 1 Narratively , Trouble Every Day is a plodding mess .
Note that each review starts with a number 0 through 4 with the following meaning:
- 0 : negative
- 1 : somewhat negative
- 2 : neutral
- 3 : somewhat positive
- 4 : positive
Individual words will be scored by computing the average rating of all of the reviews that contain that
word. For example, if we were only working with the reviews included above, the word “and” would be
assigned the score
4, 3, and 1 respectively.
Honor Code
This is an individual assignment. All of the work that you submit must be written by you, based on
your own understanding of the material. Representing someone else’s work as your own, in any form,
constitutes an honor code violation.
Functionality
The finished application must read a file of movie reviews provided as a command-line argument,
then allow the user to enter text for sentiment analysis. The following interaction illustrates the
expected behavior:
Enter your text (blank to exit): This movie is rotten!
The sentiment score for this text is: 1.58
This text is NEGATIVE.
Enter your text (blank to exit): Best. Movie. Ever. (BEST)
The sentiment score for this text is: 2.48
This text is POSITIVE
Enter your text (blank to exit):
In this example the normal text was produced by the program and the bold text represents
user input. The displayed numeric score must be rounded to two decimal places. The last line of the
output must be based on the numeric score: If the score is below 1.95 the output should be
“This text is NEGATIVE.” If the score greater than or equal to 1.95 and less 2.05 the output
should be “This text is NEUTRAL.” If the score is greater than or equal to 2.05 the output should
be “This text is POSITIVE.”
Implementation
You will need to submit the following files:
- Part A - Due Monday 4/25
review.py
- Code for theReview
class. See the UML diagram below.test_review.py
- Unit tests for theReview
class. For full credit, these tests
must provide 100% statement coverage ofreview.py
.
- Part B - Due Monday 5/2
sentiment_analyzer.py
- Code for theSentimentAnalyzer
class. See the UML
diagram below.sentiment.py
- Command-line sentiment analysis application. This application
must provide the functionality described above.
(Note that it isn’t standard practice in Python to create a separate file for each class. We are
doing it here to facilitate code submission and grading.)
UML
The Review
and SentimentAnalyzer
classes must match the following UML diagram:
Review Class
-
Constructor: Note that there are two different constructors listed in the UML above.
The Python language doesn’t support this directly: there can be only one__init__
method.
We can provide the effect of multiple constructors by using the*args
parameter to
implement arbitrary argument lists. This is described in Section 11.2 of our textbook.
As an example, imagine we want to create a two-dimensionalPoint
class that can be
initialized either by explicitly providing x and y values or by copyting the x and y values
from an existingPoint
object. The following code would work:class Point: def __init__(self, *args): if len(args) == 1: # Point object self.x = args[0].x self.y = args[0].y elif len(args) == 2: # x and y values self.x = args[0] self.y = args[1] else: print("Error! Incorrect number of arguments.")
The caller is now free to think of this as a class with two different constructors:
-
Two-argument
Review
constructor: The two-argument constructor must assign the
provided rating and text to the appropriate instance variables. It must also initialize
thewords
instance variable to be a set of words containing lower-case copies of each
distinct word that appears in the review text. For example, if the text of the review
is “This is a very very very bad movie!” then thewords
set would contain the
elements"this"
,"is"
,"a"
,"very"
,"bad"
and"movie"
. The set must not include
any of the punctuation from the original review. The following code snippet shows
an example of how we can remove punctuation from a string in Python: -
Single-argument constructor: The single-argument constructor must take a line
of text in the same format as the lines in themovie_reviews.txt
file and extract
the necessary information to initialize the three required instance variables.
For example:The first character here corresponds to the rating, the next space is ignored, and
the rest of the string should be stored intext
. The logic for generating thewords
set will be the same regardless of which constructor is used.
-
-
Note that the second string is exactly 70 characters long, including the ellipses.__str__
- This method must return a string representation of the review that matches
the format inmovie_reviews.txt
, except that the overall length must be capped
at 70 characters, with'...'
appended in cases where the review has been truncated.
For example, reviews created from the corresponding examples above would generate
the following strings: -
__eq__
- Two reviews must be considered equal if they have the same rating and same text.
SentimentAnalyzer Instance Variables
review_counts
- This dictionary must contain a mapping from individual (lower-case)
words to a count of the total number of reviews that contain those words. For example,
given the sample reviews above,"and"
would be mapped to3
, because it appears in
exactly three reviews.-
Along withword_rating_totals
- This dictionary must contain a mapping from individual (lower-case)
words to the sum of the ratings of all reviews that contain that word. For example,
given the sample reviews above,"and"
would be mapped to8
, because that is the sum
of the ratings of the reviews containing"and"
.review_counts
, this instance variable will allow us to calculate the sentiment
value associated with any particular word: It will be the rating total divided by the count. -
reviews
- This must be a list containing allReview
objects that have been added
through calls to theadd_review
method or theload_reviews
method.
SentimentAnalyzer Methods
- Constructor - The constructor must initialize the three instance variables to contain
empty collection objects of the appropriate types. add_review
- This method must add the provided review to thereviews
list and
must updatereview_counts
andword_rating_totals
with the information
from the provided review.load_reviews
- This method must load all of the reviews from the indicated text file.
You may assume that the formatting will match the formatting ofmovie_reviews.txt
.save_reviews
- This method must save the current collection of reviews to a file with
the indicated name. The format must match the format ofmovie_reviews.txt
.word_sentiment
- This method must return the word sentiment associated with the
provided word. Case must be ignored. If the word does not appear in any reviews,
then this method must return 2.0.analyze
- This method will take an arbitrary string of text and return the average
sentiment value of the individual words. Each word must be included in the average,
including repeats. Capitalization and punctuation must be ignored.
Grading
- Submission A:
- Full test coverage from
test_review.py
andreview.py
passes all
student tests: 20% review.py
passes instructor Gradescope submission tests: 20%
unit tests if your submission doesn’t pass your own tests with full coverage. - Full test coverage from
- Submission B:
sentiment_analyzer.py
passes instructor tests: 20%sentiment.py
passes instructor tests: 20%
-
Instructor Style Points: 20%
Instructor style points will be based on issues like:- Appropriate variable names
- Meaningful and informative docstrings
- Clean, readable code that is not overly complex and doesn’t include
do-nothing statements
Acknowledgments
The idea for this assignment was presented by Eric Manley and Timothy Urness at
the 2016 SIGCSE nifty assignment session. This project uses their data files and borrows some
text from their write-up. The Rotten Tomatoes data was originally collected for the
Stanford sentiment analysis project.