You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jan 4, 2024. It is now read-only.
Copy file name to clipboardExpand all lines: chapters/chapter01.tex
+26-25Lines changed: 26 additions & 25 deletions
Original file line number
Diff line number
Diff line change
@@ -25,22 +25,22 @@
25
25
\section{Motivation}
26
26
\label{s:introduction-motivation}
27
27
28
-
Many studies have been published which are trying to predict the stock market movement \citep[see][]{Bollen2011a,Mittal2012a,Nguyen2015a,Pagolu2016a,Zhang2011a}.
29
-
As the\ac{EMH} states that financial market movements depend on news, current events and product releases and all these factors will have significant impact on a company's stock value
28
+
Many studies have been published which try to predict the stock market movement \citep[see][]{Bollen2011a,Mittal2012a,Nguyen2015a,Pagolu2016a,Zhang2011a}.
29
+
The\ac{EMH} states that financial market movements depend on news, current events and product releases and all these factors will have significant impact on a company's stock value
30
30
\citep{fama1965behavior}.
31
-
Due the fact that news and current events are unpredictable stock market prices are following a random walk pattern and cannot predicted with more than \SI{50}{\percent} accuracy
31
+
Due to the fact that news and current events are unpredictable, stock market prices follow a random walk pattern and cannot be predicted with more than \SI{50}{\percent} accuracy
32
32
\citep{Pagolu2016a}.
33
33
34
-
\citet{Malkiel2003} noted that with the beginning of the new millennium financial economists believed that stock prices are at least partly predictable.
34
+
\citet{Malkiel2003} noted that with the beginning of the new millennium, financial economists believed that stock prices are at least partly predictable.
35
35
They emphasized the behavioral and psychological elements of stock price determination.
36
36
37
-
Many internet users are microblogging nowadays.
38
-
Millions of messages are published daily on popular websites which provides microblogging services, such as Twitter, Tumblr and Facebook.
39
-
These published messages describing the personal life, opinions or current issues.
40
-
The more users are post about products and services they use the more microblogging websites become a valuable source of peoples opinions and sentiments.
37
+
Nowadays many internet users are microblogging.
38
+
Millions of messages are published daily on popular websites which provide microblogging services, such as Twitter, Tumblr and Facebook.
39
+
These published messages describe the personal life, opinions or current issues.
40
+
The more users post about products and services they use, the more microblogging websites become a valuable source of peoples' opinions and sentiments.
41
41
Therefore, this data can be used for marketing, social studies and as a measure of public opinion
42
42
\citep{Patodkar2016a, Pagolu2016a}.
43
-
As most Twitter messages have a maximum length of 140 characters and speaks public opinion on a topic precisely
43
+
Most Twitter messages have a maximum length of 140 characters and represents the public opinion on a precise topic
44
44
\citep{Pagolu2016a}.
45
45
46
46
Combining these two research fields (namely \ac{EMH} and Twitter) should enable us to investigate whether stock prices can be predicted via public opinions on Twitter.
@@ -51,7 +51,7 @@ \section{Research Goals}
51
51
According to the factors presented in \cref{s:introduction-motivation} the central research question can be formulated:
52
52
\emph{To what extent can stock market movements be explained by the public opinion extracted from Twitter?}
53
53
54
-
The goal of this research to analyze the correlation between sentiment of tweets and share movement of automotive companies.
54
+
The goal of this research is to analyze the correlation between the sentiment of tweets and the share movement of automotive companies.
55
55
This goal will be met by achieving the following objectives:
56
56
57
57
\begin{itemize}
@@ -60,23 +60,23 @@ \section{Research Goals}
60
60
\item\textbf{G3} - Comparing sentiment time series with share prices
61
61
\end{itemize}
62
62
63
-
From definitions of goals and having the central question in mind the following sub tasks are defined in form of questions in order to fulfill the goals:
63
+
Based on the definitions of goals and having the central question in mind, the following sub tasks are defined in the form of questions in order to fulfill the goals:
64
64
65
65
\begin{itemize}
66
66
\item\textbf{G1-Q1} - Which companies should be analyzed?
67
67
\item\textbf{G1-Q2} - Which keywords should be used to find corresponding tweets?
68
68
\item\textbf{G1-Q3} - Which company uses which stock symbol in order to retrieve share prices?
69
-
\item\textbf{G2-Q4} - Why Twitter and not anything else?
70
-
\item\textbf{G2-Q5} - In which way tweets can be collected?
71
-
\item\textbf{G2-Q6} - In which way sentiments can be determined?
69
+
\item\textbf{G2-Q4} - Why Twitter and not any other social media platform?
70
+
\item\textbf{G2-Q5} - In which way can tweets be collected?
71
+
\item\textbf{G2-Q6} - In which way can sentiments be determined?
72
72
\item\textbf{G2-Q7} - Which sentiments are present for various companies?
73
73
\item\textbf{G3-Q8} - Can the time series of sentiments explain the share prices?
74
74
\end{itemize}
75
75
76
76
\section{Research Methodology}
77
77
\label{s:introduction-researchmethodology}
78
78
79
-
The research follows a structure deducted from ``evaluation techniques for systems analysis and design modelling methods'' by \citet{Siau2011} in which the authors try to show up the benefits and the shortcomings of different methods.
79
+
The research follows a structure deducted from ``evaluation techniques for systems analysis and design modelling methods'' by \citet{Siau2011} in which the authors try to show the benefits and the shortcomings of different methods.
80
80
In the following the three main categories and their mapping to this thesis are shown:
is used to compare the results of the case study with share prices of the automotive companies.
94
94
\end{description}
95
95
96
-
As this thesis covers sentiments of people in a global context which is then compared to share prices in an economic context it can be classified as social science \citep{Recker2013}.
97
-
In the following the research actions, which have been undertaken to answer the questions and fulfill the goals, are explained.
96
+
As this thesis covers sentiments of people in a global context which are then compared to share prices in an economic context it can be classified as social science \citep{Recker2013}.
97
+
In the following the research actions which will be undertaken to answer the questions and fulfill the goals are explained.
98
98
99
99
\begin{itemize}
100
-
\item To find answers to the questions \textbf{Q1} to \textbf{Q5} literature research has been conducted.
101
-
A keyword search has been performed on the literature search-engine \emph{Google Scholar} as well as library search.
102
-
The retrieved literature is reviewed and based on the references new literature is obtained.
100
+
\item To find answers to the questions \textbf{Q1} to \textbf{Q5} literature research will be conducted.
101
+
A keyword search will be performed on the literature search engine \emph{Google Scholar}.
102
+
Furthermore, the library will be searched as well.
103
+
The retrieved literature is reviewed and, based on the references, new literature is obtained.
103
104
104
105
\item With the theoretical background which has been obtained in answering the questions \textbf{Q1} to \textbf{Q6} a tweet collection system has been set up in order to answer the question \textbf{Q7}.
105
-
This is done by setting up a open source tweet capturing system \ac{DMITCAT} system and evaluating the sentiment of the captured tweets.
106
+
This is done by setting up an open source tweet capturing system (\ac{DMITCAT}) and evaluating the sentiment of the captured tweets.
106
107
107
-
\item Question \textbf{Q8} is answered through both literature research, which has been collected for the questions \textbf{Q1} to \textbf{Q6} and evaluated sentiments of the collected tweets for question \textbf{Q7}.
108
+
\item Question \textbf{Q8} is answered through both literature research, which has been collected for the questions \textbf{Q1} to \textbf{Q6}, and evaluated sentiments of the collected tweets for question \textbf{Q7}.
108
109
\end{itemize}
109
110
110
111
\section{Structure of this Thesis}
111
112
\label{s:introduction-structureofthisthesis}
112
113
113
114
This section is followed by the background \cref{c:background}, where the necessary theoretical background will be explained.
114
-
In \cref{c:casestudy}, the setup of tweet collection is explained and the execution documented.
115
-
Afterwards, in \cref{c:analysis}, sentiments of collected tweets are determined and converted into a time series which is then compared to the time series of share prices.
116
-
Finally, in \cref{c:conclusion} the results of this work are summed up, and limitations and further points of interest are pointed out.
115
+
In \cref{c:casestudy}, the setup of tweet collection will be explained and the execution documented.
116
+
Afterwards, in \cref{c:analysis}, sentiments of collected tweets will be determined and converted into a time series which will then be compared to the time series of share prices.
117
+
Finally, in \cref{c:conclusion} the results of this work will be summed up, and limitations and further points of interest will be pointed out.
These effects can be also applied to the stock markets: not just news influences the stock market but also the public opinion and mood.
42
42
Previously large surveys have been conducted to gather the public mood of a representative sample.
43
-
This was very timeconsuming and expensive.
43
+
This was very time-consuming and expensive.
44
44
But in the last ten years a significant progress has been made in sentiment tracking techniques.
45
45
Therefore the sentiments can be extracted from news and blogs
46
46
\citep{Bollen2011a}.
@@ -80,7 +80,7 @@ \section{Option Mining}
80
80
\end{enumerate}
81
81
82
82
This study will focus on short documents with given keywords in it.
83
-
Therefore, we assume that the documents describing our targeted topic (see \cref{s:background-socialnetworks} on page \pageref{s:background-socialnetworks} for the background).
83
+
Therefore, it is assumed that the documents describe the targeted topic (see \cref{s:background-socialnetworks} on page \pageref{s:background-socialnetworks} for the background).
84
84
As a result the study will focus on sentiment classification.
85
85
86
86
Sentiment classification has some similarities with topic-based text classification, which classifies the topic of documents into predefined topic classes, for example sports, science or politics.
Copy file name to clipboardExpand all lines: chapters/chapter03.tex
+6-6Lines changed: 6 additions & 6 deletions
Original file line number
Diff line number
Diff line change
@@ -30,7 +30,7 @@ \section{Determine Companies, Keywords and Stock Symbols to Analyze}
30
30
These companies must be traded on a stock exchange to perform the comparison with tweet sentiments.
31
31
As a single company may own several car brands a list of all brands has been set up.
32
32
The result of the analysis is depicted in \cref{tab:casestudy-brands}.
33
-
Both brands which aren't customerfacing passenger car brands and brands which do not longer exist have been omitted.
33
+
Both brands which aren't customer-facing passenger car brands and brands which do not longer exist have been omitted.
34
34
Furthermore, the brands have been grouped by their owning company.
35
35
36
36
\begin{longtable}[c]{!l ^l}
@@ -139,7 +139,7 @@ \subsection{Gather Tweets}
139
139
140
140
A large set of tweets is needed to perform the analysis within a time frame of at least one month.
141
141
There were several approaches to get these tweets: download tweets directly or capture tweets within the given time frame.
142
-
As we are tracking five companies using 23 keywords (brands) there will be a quite big amount of data.
142
+
As the tracking includes five companies using 23 keywords (brands) there will be a quite big amount of data.
143
143
144
144
Several approaches have been tried to get as many tweets as possible to the given keywords, including:
145
145
@@ -148,7 +148,7 @@ \subsection{Gather Tweets}
148
148
was the first attempt.
149
149
But there were very serious limitations to the official \ac{API} that made that quite easy way impossible.
150
150
First, the standard search \ac{API} supports just a maximum count of 100 tweets;
151
-
second, it supports a history of only seven days;
151
+
secondly, it supports a history of only seven days;
152
152
and lastly, there were to tight rate limits defined in order gather all possible tweets of the seven days period \citep{TwitterInc.2018}.
153
153
154
154
\item [Twitter search on website]
@@ -190,7 +190,7 @@ \subsection{Gather Tweets}
190
190
The storage was full after approximately 14 days of data collection.
191
191
As the problem was not detected right away it took several days for identifying and fixing the issue.
192
192
193
-
\item The rate limits of the \ac{API} have been hit now and then in case too many tweets were published.
193
+
\item The rate limits of the \ac{API} were hit now and then in case too many tweets were published.
194
194
\ac{DMITCAT} continued to collect tweets automatically after the corresponding time window.
195
195
196
196
\item New releases \ac{DMITCAT} have been published from time to time which also required a database upgrade.
@@ -399,7 +399,7 @@ \section{Determine Sentiment of Tweets}
399
399
\citep{buitinck2013api}.
400
400
401
401
% Cross validation with GridSearchCV
402
-
Furthermore \emph{scikit-learn} provides some helpers to find the best hyper-parameters for the given problem.
402
+
Furthermore,\emph{scikit-learn} provides some helpers to find the best hyper-parameters for the given problem.
403
403
The user can define which values various hyper-parameters can attain and the helper then perform test runs for various combinations, calculate their score and keep acting as the best performing model.
404
404
Therefore, this type of search is called \emph{model selection}.
405
405
\emph{Scikit-learn} provides two different model selection helpers: \emph{GridSearchCV} and \emph{RandomizedSearchCV}.
@@ -542,7 +542,7 @@ \section{Determine Sentiment of Tweets}
542
542
The stock prices are already in a time series format on a daily basis except for weekends or holidays but the problem of missing entries are tackled later.
543
543
First, the sentiment analysis result dataset must be condensed to form a time series for comparison.
544
544
Therefore the results per tweet are grouped per day and summed up.
545
-
As negative sentiments have the value \texttt{'-1'} and positive sentiments have the value \texttt{'1'} we receive a number which is positive in case more positive than negative tweets have been published on that given day and vice versa.
545
+
As negative sentiments have the value \texttt{'-1'} and positive sentiments have the value \texttt{'1'} a number is received which is positive in case more positive than negative tweets have been published on that given day and vice versa.
546
546
547
547
Missing stock prices have been calculated iteratively and the gaps have been filled by using the following procedure:
548
548
Given $x$ is a stock price value and $y$ is the next present value with one or more values in between are missing.
0 commit comments