Skip to content

Commit 52ba866

Browse files
committed
Added LDA Model description
1 parent 52b5d2a commit 52ba866

File tree

5 files changed

+16
-8
lines changed

5 files changed

+16
-8
lines changed

Report/report.aux

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,12 @@
1414
\@writefile{toc}{\contentsline {section}{\numberline {2}Coding Contribution}{3}}
1515
\@writefile{toc}{\contentsline {subsection}{\numberline {2.1}Data Transformation \ Pre-Processing}{3}}
1616
\@writefile{toc}{\contentsline {subsection}{\numberline {2.2}Cleaning}{3}}
17+
\citation{lda}
1718
\@writefile{toc}{\contentsline {subsection}{\numberline {2.3}Creating Training and Testing Corpus}{4}}
1819
\@writefile{lot}{\contentsline {table}{\numberline {2.1}{\ignorespaces Star rating distribution in the training corpora}}{4}}
1920
\newlabel{corpus_size}{{2.1}{4}}
21+
\@writefile{toc}{\contentsline {subsection}{\numberline {2.4}LDA Model Development}{4}}
2022
\bibcite{yelp}{1}
2123
\bibcite{yelp_dataset_challenge}{2}
2224
\bibcite{nltk}{3}
25+
\bibcite{lda}{4}

Report/report.log

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) (preloaded format=pdflatex 2016.5.22) 30 NOV 2016 00:41
1+
This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) (preloaded format=pdflatex 2016.5.22) 30 NOV 2016 01:09
22
entering extended mode
33
restricted \write18 enabled.
44
%&-line parsing enabled.
@@ -500,14 +500,14 @@ Class scrartcl Warning: incompatible usage of \@ssect detected.
500500
(scrartcl) from within a non compatible caller, that does not
501501
(scrartcl) \scr@s@ct@@nn@m@ locally.
502502
(scrartcl) This could result in several error messages on input li
503-
ne 237.
503+
ne 239.
504504

505505
[5] (./report.aux) )
506506
Here is how much of TeX's memory you used:
507-
7590 strings out of 493014
508-
114203 string characters out of 6133351
509-
444396 words of memory out of 5000000
510-
11001 multiletter control sequences out of 15000+600000
507+
7591 strings out of 493014
508+
114208 string characters out of 6133351
509+
444400 words of memory out of 5000000
510+
11002 multiletter control sequences out of 15000+600000
511511
22029 words of font info for 51 fonts, out of 8000000 for 9000
512512
1141 hyphenation exceptions out of 8191
513513
47i,12n,51p,8485b,1620s stack positions out of 5000i,500n,10000p,200000b,80000s
@@ -523,7 +523,7 @@ pfb></usr/local/texlive/2016/texmf-dist/fonts/type1/public/cm-super/sfrm2074.pf
523523
b></usr/local/texlive/2016/texmf-dist/fonts/type1/public/cm-super/sfti1095.pfb>
524524
</usr/local/texlive/2016/texmf-dist/fonts/type1/public/cm-super/sfti1440.pfb></
525525
usr/local/texlive/2016/texmf-dist/fonts/type1/public/cm-super/sftt1095.pfb>
526-
Output written on report.pdf (5 pages, 216122 bytes).
526+
Output written on report.pdf (5 pages, 220516 bytes).
527527
PDF statistics:
528528
65 PDF objects out of 1000 (max. 8388607)
529529
46 compressed objects within 1 object stream

Report/report.pdf

4.29 KB
Binary file not shown.

Report/report.synctex.gz

3.03 KB
Binary file not shown.

Report/report.tex

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -232,6 +232,8 @@ \subsection{Creating Training and Testing Corpus}
232232
\noalign{\smallskip}\hline
233233
\end{tabular}
234234
\end{table}
235+
\subsection{LDA Model Development}
236+
The LDA\cite{lda} model is present in \textbf{gensim package} in python. The inbuilt library method was not so straightforward and required a \textbf{vectorized bag of words} corpus as an input. It also required a dictionary developed from the available training corpora. The parameters that could be tweaked while developing the model were the corpora size and the total number of topics we want to extract. We chose the \textbf{total topics as 7}. In a normal vectorized corpus, the dimensionality would have been the entire size of the disctionary, which is very huge. Selecting the total topics essentially will reduce the dimensionality of our training corpora to merely 7 selected topics. The topic probability distribution dataset was used as a feature to create new training corpora which was used to train off the shelf classifiers such as \textit{MultinomialNaiveBayes, LogisticRegression, RandomForestClassifier, AdaBoostClassifier}. The performance of the models is discussed in a separate section on Model Evaluation.
235237

236238
\newpage
237239
\begin{thebibliography}{}
@@ -241,7 +243,10 @@ \subsection{Creating Training and Testing Corpus}
241243
\bibitem{yelp_dataset_challenge}
242244
ggplot Library is used in this assignment to plot most of the graphs in this assgnment \url{http://ggplot2.org/}.
243245

244-
\bibitem{nltk}
246+
\bibitem{nltk}
247+
ggplot Library is used in this assignment to plot most of the graphs in this assgnment \url{http://ggplot2.org/}.
248+
249+
\bibitem{lda}
245250
ggplot Library is used in this assignment to plot most of the graphs in this assgnment \url{http://ggplot2.org/}.
246251

247252
\end{thebibliography}

0 commit comments

Comments
 (0)