You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Report/report.tex
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -201,7 +201,7 @@ \section{Exploratory Analysis of Yelp Dataset}
201
201
Since there are so many reviews for restaurant, we decided to focus on the subset of reviews for restaurants ( the subset is obtained in the pre-processing step) for further analysis.
202
202
Next we observed the variation of average rating stars against the \textit{length of the reviews (in characters)}. The average rating varies quite a lot for reviews with higher number of characters, indicating that the polarity of reviews fluctuates a lot more from one length to another. There is less fluctation in the average rating of reviews from length ~50 to ~500 characters. Hence we chose to subset the reviews based on minimum and maximum length threshholds. Figure \ref{average_length} indicates this observed trend. Another motive for reducing the subset for reviews is based from inferences drawn from Figure \ref{length_count}. From this we can see that the total number of length >700 is very less as compared to the overall size of the review corpus. Thus combining the analysis from this and the previous figure, we subsetted the number of reviews.\\
203
203
204
-
Next we observed the distribution of reviews with the star rating, i.e. what review was given which star by a user. From Figure \ref{star_distribution}, we can see that there is a skewed distribution of reviews in terms of star ratings they have received. A majority of reviews have a 5 star rating, while the count for 2 star is the lowest. This will later on form a basis for us to create training datasets based on data sampled from each star rating, in order to ensure even representation in the training corpus.
204
+
Next we observed the distribution of reviews with the star rating, i.e. what review was given which star by a user. From Figure \ref{star_distribution}, we can see that there is a skewed distribution of reviews in terms of star ratings they have received. A majority of reviews have a 5 star or a 4 star rating, while the count for 1 star is the lowest. This will later on form a basis for us to create training datasets based on data sampled from each star rating, in order to ensure even representation in the training corpus.
205
205
206
206
\subsection{Word Cloud}
207
207
Given the distribution of reviews, we decided to capture the common sentiment for all the reviews individually. We did this by plotting a wordcloud of review texts grouped by their star rating. To do this, we preprocessed the review texts to remove all the common stop words, tokenized the words and then plotted them. Figures \ref{wc1}, \ref{wc2}, \ref{wc3}, \ref{wc4}, \ref{wc5} represent this pictorially.
0 commit comments