piyushghai
diff --git a/‎Report/report.aux‎
Lines changed: 3 additions & 0 deletions b/‎Report/report.aux‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎Report/report.log‎
Lines changed: 7 additions & 7 deletions b/‎Report/report.log‎
Lines changed: 7 additions & 7 deletions
diff --git a/‎Report/report.pdf‎
4.29 KB b/‎Report/report.pdf‎
4.29 KB
diff --git a/‎Report/report.synctex.gz‎
3.03 KB b/‎Report/report.synctex.gz‎
3.03 KB
diff --git a/‎Report/report.tex‎
Lines changed: 6 additions & 1 deletion b/‎Report/report.tex‎
Lines changed: 6 additions & 1 deletion
@@ -14,9 +14,12 @@
 \@writefile{toc}{\contentsline {section}{\numberline {2}Coding Contribution}{3}}
 \@writefile{toc}{\contentsline {subsection}{\numberline {2.1}Data Transformation \ Pre-Processing}{3}}
 \@writefile{toc}{\contentsline {subsection}{\numberline {2.2}Cleaning}{3}}
+\citation{lda}
 \@writefile{toc}{\contentsline {subsection}{\numberline {2.3}Creating Training and Testing Corpus}{4}}
 \@writefile{lot}{\contentsline {table}{\numberline {2.1}{\ignorespaces Star rating distribution in the training corpora}}{4}}
 \newlabel{corpus_size}{{2.1}{4}}
+\@writefile{toc}{\contentsline {subsection}{\numberline {2.4}LDA Model Development}{4}}
 \bibcite{yelp}{1}
 \bibcite{yelp_dataset_challenge}{2}
 \bibcite{nltk}{3}
+\bibcite{lda}{4}
@@ -1,4 +1,4 @@
-This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) (preloaded format=pdflatex 2016.5.22)  30 NOV 2016 00:41
+This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) (preloaded format=pdflatex 2016.5.22)  30 NOV 2016 01:09
 entering extended mode
  restricted \write18 enabled.
  %&-line parsing enabled.
@@ -500,14 +500,14 @@ Class scrartcl Warning: incompatible usage of \@ssect detected.
 (scrartcl)              from within a non compatible caller, that does not
 (scrartcl)              \scr@s@ct@@nn@m@ locally.
 (scrartcl)              This could result in several error messages on input li
-ne 237.
+ne 239.
 
 [5] (./report.aux) ) 
 Here is how much of TeX's memory you used:
- 7590 strings out of 493014
- 114203 string characters out of 6133351
- 444396 words of memory out of 5000000
- 11001 multiletter control sequences out of 15000+600000
+ 7591 strings out of 493014
+ 114208 string characters out of 6133351
+ 444400 words of memory out of 5000000
+ 11002 multiletter control sequences out of 15000+600000
  22029 words of font info for 51 fonts, out of 8000000 for 9000
  1141 hyphenation exceptions out of 8191
  47i,12n,51p,8485b,1620s stack positions out of 5000i,500n,10000p,200000b,80000s
@@ -523,7 +523,7 @@ pfb></usr/local/texlive/2016/texmf-dist/fonts/type1/public/cm-super/sfrm2074.pf
 b></usr/local/texlive/2016/texmf-dist/fonts/type1/public/cm-super/sfti1095.pfb>
 </usr/local/texlive/2016/texmf-dist/fonts/type1/public/cm-super/sfti1440.pfb></
 usr/local/texlive/2016/texmf-dist/fonts/type1/public/cm-super/sftt1095.pfb>
-Output written on report.pdf (5 pages, 216122 bytes).
+Output written on report.pdf (5 pages, 220516 bytes).
 PDF statistics:
  65 PDF objects out of 1000 (max. 8388607)
  46 compressed objects within 1 object stream
 
@@ -232,6 +232,8 @@ \subsection{Creating Training and Testing Corpus}
     \noalign{\smallskip}\hline
   \end{tabular} 
 \end{table}  
+\subsection{LDA Model Development}
+The LDA\cite{lda} model is present in \textbf{gensim package} in python. The inbuilt library method was not so straightforward and required a \textbf{vectorized bag of words} corpus as an input. It also required a dictionary developed from the available training corpora. The parameters that could be tweaked while developing the model were the corpora size and the total number of topics we want to extract. We chose the \textbf{total topics as 7}. In a normal vectorized corpus, the dimensionality would have been the entire size of the disctionary, which is very huge. Selecting the total topics essentially will reduce the dimensionality of our training corpora to merely 7 selected topics. The topic probability distribution dataset was used as a feature to create new training corpora which was used to train off the shelf classifiers such as \textit{MultinomialNaiveBayes, LogisticRegression, RandomForestClassifier, AdaBoostClassifier}. The performance of the models is discussed in a separate section on Model Evaluation.
 
 \newpage
 \begin{thebibliography}{}
@@ -241,7 +243,10 @@ \subsection{Creating Training and Testing Corpus}
  \bibitem{yelp_dataset_challenge}
  ggplot Library is used in this assignment to plot most of the graphs in this assgnment \url{http://ggplot2.org/}. 
 
-  \bibitem{nltk}
+ \bibitem{nltk}
+ ggplot Library is used in this assignment to plot most of the graphs in this assgnment \url{http://ggplot2.org/}. 
+ 
+ \bibitem{lda}
  ggplot Library is used in this assignment to plot most of the graphs in this assgnment \url{http://ggplot2.org/}. 
 
 \end{thebibliography}