piyushghai
diff --git a/‎Report/report.log‎
Lines changed: 5 additions & 23 deletions b/‎Report/report.log‎
Lines changed: 5 additions & 23 deletions
diff --git a/‎Report/report.pdf‎
554 Bytes b/‎Report/report.pdf‎
554 Bytes
diff --git a/‎Report/report.synctex.gz‎
66 Bytes b/‎Report/report.synctex.gz‎
66 Bytes
diff --git a/‎Report/report.tex‎
Lines changed: 1 addition & 1 deletion b/‎Report/report.tex‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎Report/star_review_count.png‎
1.35 KB b/‎Report/star_review_count.png‎
1.35 KB
@@ -1,4 +1,4 @@
-This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) (preloaded format=pdflatex 2016.5.22)  30 NOV 2016 18:19
+This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) (preloaded format=pdflatex 2016.5.22)  30 NOV 2016 18:33
 entering extended mode
  restricted \write18 enabled.
  %&-line parsing enabled.
@@ -494,21 +494,6 @@ Underfull \hbox (badness 10000) in paragraph at lines 201--203
 
  []
 
-
-LaTeX Warning: Reference `wc1' on page 3 undefined on input line 207.
-
-
-LaTeX Warning: Reference `wc2' on page 3 undefined on input line 207.
-
-
-LaTeX Warning: Reference `wc3' on page 3 undefined on input line 207.
-
-
-LaTeX Warning: Reference `wc4' on page 3 undefined on input line 207.
-
-
-LaTeX Warning: Reference `wc5' on page 3 undefined on input line 207.
-
 LaTeX Font Info:    Try loading font information for U+msa on input line 213.
 (/usr/local/texlive/2016/texmf-dist/tex/latex/amsfonts/umsa.fd
 File: umsa.fd 2013/01/14 v3.01 AMS symbols A
@@ -536,12 +521,12 @@ File: len_count.png Graphic file (type png)
 Package pdftex.def Info: len_count.png used on input line 266.
 (pdftex.def)             Requested size: 250.9383pt x 200.75136pt.
 
-<star_review_count.png, id=24, 521.0667pt x 359.1819pt>
+<star_review_count.png, id=24, 721.2546pt x 506.6127pt>
 File: star_review_count.png Graphic file (type png)
 
 <use star_review_count.png>
 Package pdftex.def Info: star_review_count.png used on input line 273.
-(pdftex.def)             Requested size: 250.93605pt x 200.75136pt.
+(pdftex.def)             Requested size: 250.93513pt x 200.74756pt.
 
 <1_star_wordcloud_500k.png, id=25, 963.6pt x 963.6pt>
 File: 1_star_wordcloud_500k.png Graphic file (type png)
@@ -634,16 +619,13 @@ ne 595.
 
 [15] (./report.aux)
 
-LaTeX Warning: There were undefined references.
-
-
 LaTeX Warning: There were multiply-defined labels.
 
  ) 
 Here is how much of TeX's memory you used:
  7708 strings out of 493014
  116629 string characters out of 6133351
- 444739 words of memory out of 5000000
+ 444795 words of memory out of 5000000
  11100 multiletter control sequences out of 15000+600000
  22029 words of font info for 51 fonts, out of 8000000 for 9000
  1141 hyphenation exceptions out of 8191
@@ -661,7 +643,7 @@ live/2016/texmf-dist/fonts/type1/public/cm-super/sfrm2074.pfb></usr/local/texli
 ve/2016/texmf-dist/fonts/type1/public/cm-super/sfti1095.pfb></usr/local/texlive
 /2016/texmf-dist/fonts/type1/public/cm-super/sfti1440.pfb></usr/local/texlive/2
 016/texmf-dist/fonts/type1/public/cm-super/sftt1095.pfb>
-Output written on report.pdf (15 pages, 1338214 bytes).
+Output written on report.pdf (15 pages, 1338768 bytes).
 PDF statistics:
  125 PDF objects out of 1000 (max. 8388607)
  73 compressed objects within 1 object stream
 
@@ -201,7 +201,7 @@ \section{Exploratory Analysis of Yelp Dataset}
 Since there are so many reviews for restaurant, we decided to focus on the subset of reviews for restaurants ( the subset is obtained in the pre-processing step) for further analysis.  
 Next we observed the variation of average rating stars against the \textit{length of the reviews (in characters)}. The average rating varies quite a lot for reviews with higher number of characters, indicating that the polarity of reviews fluctuates a lot more from one length to another. There is less fluctation in the average rating of reviews from length ~50 to ~500 characters. Hence we chose to subset the reviews based on minimum and maximum length threshholds. Figure \ref{average_length} indicates this observed trend. Another motive for reducing the subset for reviews is based from inferences drawn from Figure \ref{length_count}. From this we can see that the total number of length >700 is very less as compared to the overall size of the review corpus. Thus combining the analysis from this and the previous figure, we subsetted the number of reviews.\\
 
-Next we observed the distribution of reviews with the star rating, i.e. what review was given which star by a user. From Figure \ref{star_distribution}, we can see that there is a skewed distribution of reviews in terms of star ratings they have received. A majority of reviews have a 5 star rating, while the count for 2 star is the lowest. This will later on form a basis for us to create training datasets based on data sampled from each star rating, in order to ensure even representation in the training corpus.
+Next we observed the distribution of reviews with the star rating, i.e. what review was given which star by a user. From Figure \ref{star_distribution}, we can see that there is a skewed distribution of reviews in terms of star ratings they have received. A majority of reviews have a 5 star or a 4 star rating, while the count for 1 star is the lowest. This will later on form a basis for us to create training datasets based on data sampled from each star rating, in order to ensure even representation in the training corpus.
 
 \subsection{Word Cloud}
 Given the distribution of reviews, we decided to capture the common sentiment for all the reviews individually. We did this by plotting a wordcloud of review texts grouped by their star rating. To do this, we preprocessed the review texts to remove all the common stop  words, tokenized the words and then plotted them. Figures \ref{wc1}, \ref{wc2}, \ref{wc3}, \ref{wc4}, \ref{wc5} represent this pictorially.