Making Sense of Millions of Amazon Reviews Using SQL, Spark and Python

by | Dec 18, 2018 | General Information, Technology | 0 comments

Right after the final presentation — look at the happy faces!

This is a story of three simple students (Srivatsan Ramesh, Arpita Shah and yours truly) from a non-CS background who worked on Amazon reviews to make better sense of them. 

We worked on this as part of our Big Data Analytics course where we had to work on a project involving, as you can guess, ‘big data’. Tasked with choosing a topic, we wanted to work with a dataset that was clean (thus, circumventing the atrocious hours we would otherwise spend cleaning), understandable and had a lot of potential for novelty. 

We picked the Amazon Product Review Dataset created by Julian McAuley from UCSD. Now, a good question to ask would be, why this topic? 

Motivation

Here’s why: We noticed that when you purchase a product, the only variables that you can look at are rating and reviews. However, this is how a typical review looks like:

Sure, you can probably read one or two of these — but thousands? There is a line where you cross from human involvement to automation, and this was that line. 

We wanted to build an algorithm that brought out the features of a product by gleaning all the reviews and gave it a more appropriate and semantically intelligent rating.

We will divide the rest of the article into parts:

1. Loading the Dataset

2. Performing Analytics to Prove Motivation

3. Obtaining the Semantic Orientation and Overall Rating

4. Extracting Features (Pros and Cons) from Reviews

1. Loading the Dataset

To work with big data is a boon and a bane — yes, you get to have a corpus of data to work with. But, working with a corpus of data requires fast machines. Fortunately, we got credits to use the Google Cloud Platform virtual machine (Ubuntu 16.04 LTS with 50 GB disk) and Hadoop (and in turn Spark, Python, Hive and HBase).

We picked the Cell Phone and Accessories category inside the dataset as it was ‘big’ enough but manageable with 3,447,249 rows and 9 columns. We also picked it as it sounded like a category where people would be prone to describing product features in the reviews.

We tried various methods to load and work with the dataset, and finally settled on loading it as a JSON using SQL Context as shown below.

2. Performing Analytics to Prove Motivation

Our reason for performing analytics was to drive home the point that the verbosity of the reviews and unclear pattern between the reviews and ratings lead to a situation where people could not glean the information they needed. To do the same, we tried understanding the distribution of reviews, their length and number of people who found it helpful.

Here below are some of the charts that we created for the same:

A few charts on rating, length of review and helpful count

As you can observe, more than 75% of reviews have not been found helpful by people, and among those who were found helpful, the review text has an average length of more than 150 words. While most of the reviews were given 4 or 5 stars, there is no clear pattern between the rating and the length of the reviews. 

So, longer reviews are more helpful, but it does not mean people liked the product more? Well, we need a better way to untangle this mess.

3. Obtaining the Semantic Orientation and Overall Rating

Below is a flow diagram of the process we followed to get to the output.

The process followed to get to the out[ut

As you can see, after the ETL (Extraction, Transformation and Loading), we cleaned the review text to remove all ‘noise’ (in this case, that translates to punctuation, junk values and upper case letters). After this, we removed the objective part of the sentence using Textblob package

Why do this? Think about it. Would you rather read ‘This iPhone has 64 gb of storage’ (which you already know) or ‘This iPhone has high storage for low price’? Subjectivity is what we wanted from the sentence. Textblob does a good job of telling you how subjective a sentence is, using which you can eliminate those that fall below a certain threshold (we picked 0.3). 

Now, coming to the part that required a grammar refresher, we need to perform Parts-of-Speech tagging (done via NLTK library) to the review and then only retain the words that followed a certain pattern. Why do this? It is said that when people describe an object’s feature, it is mostly a combination of ‘adjective-noun’, ‘adverb-adjective-noun’ and ‘adverb-verb-noun’ (there are more combinations as well). 

Hence, we used the following table as a guide and extracted the combinations.

The combinations that have highest probability of having product features

And finally, the words were fed into a semantic orientation analyzer to find out the sentiment behind it. We used the VADER algorithm created by MIT. This gives a score ranging from -1 to 1 which we normalized and then multiplied with the original rating. The final score was summed up over all the reviews of a product and then a mean was taken. 

Now this score gives a more holistic representation of what the person feels. 

4. Extracting Features (Pros and Cons) from Reviews

The objective of feature extraction is to extract the features mentioned by the customers in reviews and understand its polarity with respect to the context to identify it as a pro/con for the product. 

Since we are trying to understand the features, they are classified as nouns in parts of speech. We need to extract such nouns from the reviews and also search for the closest adjectives to these nouns. This can be done with N-gram modelling where we partition a given sentence into groups of 2 or 3. In our scenario, we had implemented a tri-gram model because as window-size of N gram models increase, the context meaning is diminished.

Let’s take a look at the code to understand this.

From the POS tagging and subjective-objective classification, we get the frequency of nouns for a particular product to understand the distribution of features that were mentioned in the reviews. We also derived a list of common words that were mostly related to gadgets and eliminated nouns that weren’t related to the common words. We performed a tri-gram modelling on this which is described in the function below.

After the tri-gram model, we took a list of adjectives that were mapped to a given noun. This was done using the function above where we take the tri-grams containing the given noun.

Using the function, we compute the polarity of all the tri-grams containing a given product feature and take an average of these scores to get the overall polarity associated with a particular feature to understand the performance of that product feature. If it’s close to 1, it’s a ‘pro’ — a feature that most people liked. If it’s close to -1, vice versa. By computing these scores for various product features across multiple reviews, we can get an overall opinion of these features using statistical modelling rather than reading millions of reviews.

*****************************************************************

The end result is a website — developed by Srivatsan using Flask and SQL — where we created a dashboard with the product details and overall rating, and a pro/con list of the features. This concluded the project, but we hope to work more on this in the future. Hope you found this useful and enjoyable (?), and we would be sharing our Github link soon!

*****************************************************************

If you found this to be useful, do Follow me for more articles. I love hearing your thoughts via comments! 💓I write about social issues, products, the technology sector and my graduate school experience in the US. Here is my Medium Blog. If you’re a curious soul looking to learn everyday, here’s a Slack Group that I created for you to join.

Wanna get in touch: The best way is via Instagram or Facebook. I share some interesting content there. To know more about my professional life, check out my LinkedIn. Happy learning!