Stock Movement Prediction from Tweets and Historical Prices

Page created by Irene Curry
 
CONTINUE READING
Stock Movement Prediction from Tweets and Historical Prices

                                    Yumo Xu and Shay B. Cohen

                         School of Informatics, University of Edinburgh
                            10 Crichton Street, Edinburgh EH8 9AB
                       yumo.xu@ed.ac.uk, scohen@inf.ed.ac.uk

                    Abstract                           More recently, Hu et al. (2018) propose to mine
                                                       news sequence directly from text with hierarchical
    Stock movement prediction is a challeng-           attention mechanisms for stock trend prediction.
    ing problem: the market is highly stochas-            However, stock movement prediction is widely
    tic, and we make temporally-dependent              considered difficult due to the high stochasticity
    predictions from chaotic data. We treat            of the market: stock prices are largely driven by
    these three complexities and present a             new information, resulting in a random-walk pat-
    novel deep generative model jointly ex-            tern (Malkiel, 1999). Instead of using only de-
    ploiting text and price signals for this           terministic features, generative topic models were
    task. Unlike the case with discriminative          extended to jointly learn topics and sentiments
    or topic modeling, our model introduces            for the task (Si et al., 2013; Nguyen and Shirai,
    recurrent, continuous latent variables for a       2015). Compared to discriminative models, gener-
    better treatment of stochasticity, and uses        ative models have the natural advantage in depict-
    neural variational inference to address the        ing the generative process from market informa-
    intractable posterior inference. We also           tion to stock signals and introducing randomness.
    provide a hybrid objective with tempo-             However, these models underrepresent chaotic so-
    ral auxiliary to flexibly capture predictive       cial texts with bag-of-words and employ simple
    dependencies. We demonstrate the state-            discrete latent variables.
    of-the-art performance of our proposed                In essence, stock movement prediction is a time
    model on a new stock movement predic-              series problem. The significance of the temporal
    tion dataset which we collected.1                  dependency between movement predictions is not
                                                       addressed in existing NLP research. For instance,
1   Introduction
                                                       when a company suffers from a major scandal on a
Stock movement prediction has long attracted both      trading day d1 , generally, its stock price will have a
investors and researchers (Frankel, 1995; Edwards      downtrend in the coming trading days until day d2 ,
et al., 2007; Bollen et al., 2011; Hu et al., 2018).   i.e. [d1 , d2 ].2 If a stock predictor can recognize this
We present a model to predict stock price move-        decline pattern, it is likely to benefit all the predic-
ment from tweets and historical stock prices.          tions of the movements during [d1 , d2 ]. Otherwise,
   In natural language processing (NLP), public        the accuracy in this interval might be harmed. This
news and social media are two primary content re-      predictive dependency is a result of the fact that
sources for stock market prediction, and the mod-      public information, e.g. a company scandal, needs
els that use these sources are often discriminative.   time to be absorbed into movements over time
Among them, classic research relies heavily on         (Luss and d’Aspremont, 2015), and thus is largely
feature engineering (Schumaker and Chen, 2009;         shared across temporally-close predictions.
Oliveira et al., 2013). With the prevalence of deep       Aiming to tackle the above-mentioned out-
neural networks (Le and Mikolov, 2014), event-         standing research gaps in terms of modeling high
driven approaches were studied with structured         market stochasticity, chaotic market information
event representations (Ding et al., 2014, 2015).       and temporally-dependent prediction, we propose
  1                                                       2
    https://github.com/yumoxu/                              We use the notation [a, b] to denote the interval of integer
stocknet-dataset                                       numbers between a and b.
StockNet, a deep generative model for stock                        price is widely used for predicting stock price
movement prediction.                                               movement (Xie et al., 2013) or financial volatility
   To better incorporate stochastic factors, we gen-               (Rekabsaz et al., 2017).
erate stock movements from latent driven factors
modeled with recurrent, continuous latent vari-                    3    Data Collection
ables. Motivated by Variational Auto-Encoders
(VAEs; Kingma and Welling, 2013; Rezende et al.,                   In finance, stocks are categorized into 9 industries:
2014), we propose a novel decoder with a vari-                     Basic Materials, Consumer Goods, Healthcare,
ational architecture and derive a recurrent varia-                 Services, Utilities, Conglomerates, Financial, In-
tional lower bound for end-to-end training (Sec-                   dustrial Goods and Technology.5 Since high-trade-
tion 5.2). To the best of our knowledge, StockNet                  volume-stocks tend to be discussed more on Twit-
is the first deep generative model for stock move-                 ter, we select the two-year price movements from
ment prediction.                                                   01/01/2014 to 01/01/2016 of 88 stocks to target,
   To fully exploit market information, StockNet                   coming from all the 8 stocks in Conglomerates and
directly learns from data without pre-extracting                   the top 10 stocks in capital size in each of the other
structured events. We build market sources by                      8 industries (see supplementary material).
referring to both fundamental information, e.g.                       We observe that there are a number of tar-
tweets, and technical features, e.g. historical stock              gets with exceptionally minor movement ratios. In
prices (Section 5.1).3 To accurately depict predic-                a three-way stock trend prediction task, a com-
tive dependencies, we assume that the movement                     mon practice is to categorize these movements
prediction for a stock can benefit from learning to                to another “preserve” class by setting upper and
predict its historical movements in a lag window.                  lower thresholds on the stock price change (Hu
We propose trading-day alignment as the frame-                     et al., 2018). Since we aim at the binary clas-
work basis (Section 4), and further provide a novel                sification of stock changes identifiable from so-
multi-task learning objective (Section 5.3).                       cial media, we set two particular thresholds, -
   We evaluate StockNet on a stock movement pre-                   0.5% and 0.55% and simply remove 38.72% of the
diction task with a new dataset that we collected.                 selected targets with the movement percents be-
Compared with strong baselines, our experiments                    tween the two thresholds. Samples with the move-
show that StockNet achieves state-of-the-art per-                  ment percents ≤-0.5% and >0.55% are labeled
formance by incorporating both data from Twitter                   with 0 and 1, respectively. The two thresholds are
and historical stock price listings.                               selected to balance the two classes, resulting in
                                                                   26,614 prediction targets in the whole dataset with
2       Problem Formulation                                        49.78% and 50.22% of them in the two classes. We
                                                                   split them temporally and 20,339 movements be-
We aim at predicting the movement of a target
                                                                   tween 01/01/2014 and 01/08/2015 are for training,
stock s in a pre-selected stock collection S on a
                                                                   2,555 movements from 01/08/2015 to 01/10/2015
target trading day d. Formally, we use the market
                                                                   are for development, and 3,720 movements from
information comprising of relevant social media
                                                                   01/10/2015 to 01/01/2016 are for test.
corpora M, i.e. tweets, and historical prices, in the
                                                                      There are two main components in our dataset:6
lag [d − ∆d, d − 1] where ∆d is a fixed lag size.
                                                                   a Twitter dataset and a historical price dataset.
We estimate the binary movement where 1 denotes
                                                                   We access Twitter data under the official license
rise and 0 denotes fall,
                                                                   of Twitter, then retrieve stock-specific tweets by
                                  
                y = 1 pcd > pcd−1                 (1)              querying regexes made up of NASDAQ ticker
                                                                   symbols, e.g. “\$GOOG\b” for Google Inc.. We
where pcd denotes the adjusted closing price ad-                   preprocess tweet texts using the NLTK package
justed for corporate actions affecting stock prices,               (Bird et al., 2009) with the particular Twitter
e.g. dividends and splits.4 The adjusted closing
                                                                   paper, the problem is solved by keeping the notational con-
    3                                                              sistency with our recurrent model and using its time step t to
      To a fundamentalist, stocks have their intrinsic values
that can be derived from the behavior and performance of           index trading days. Details will be provided in Section 4. We
their company. On the contrary, technical analysis considers       use d here to make the formulation easier to follow.
                                                                       5
only the trends and patterns of the stock price.                         https://finance.yahoo.com/industries
    4                                                                  6
       Technically, d − 1 may not be an eligible trading day             Our dataset is available at https://github.com/
and thus has no available price information. In the rest of this   yumoxu/stocknet-dataset.
mode, including for tokenization and treatment of       calendar day used in existing research, as the ba-
hyperlinks, hashtags and the “@” identifier. To al-     sic unit for building samples. To this end, we first
leviate sparsity, we further filter samples by ensur-   find all the T eligible trading days referred in a
ing there is at least one tweet for each corpus in      sample, in other words, existing in the time in-
the lag. We extract historical prices for the 88 se-    terval [d − ∆d + 1, d]. For clarity, in the scope
lected stocks to build the historical price dataset     of one sample, we index these trading days with
from Yahoo Finance.7                                    t ∈ [1, T ],8 and each of them maps to an ac-
                                                        tual (absolute) trading day dt . We then propose
4       Model Overview                                  trading-day alignment: we reorganize our inputs,
                                                        including the tweet corpora and historical prices,
                                                        by aligning them to these T trading days. Specif-
                            X      y
                                                        ically, on the tth trading day, we recognize mar-
                                                        ket signals from the corpus Mt in [dt−1 , dt ) and
                                                        the historical prices pt on dt−1 , for predicting the
                                   Z     ✓              movement yt on dt . We provide an aligned sam-
                                   |D|                  ple for illustration in Figure 2. As a result, ev-
                                                        ery single unit in a sample is a trading day, and
Figure 1: Illustration of the generative process        we can predict a sequence of movements y =
from observed market information to stock move-         [y1 , . . . , yT ]. The main target is yT while the re-
ments. We use solid lines to denote the generation      mainder y ∗ = [y1 , . . . , yT −1 ] serves as the tempo-
process and dashed lines to denote the variational      ral auxiliary target. We use these in addition to the
approximation to the intractable posterior.             main target to improve prediction accuracy (Sec-
                                                        tion 5.3).
   We provide an overview of data alignment,               We model the generative process shown in Fig-
model factorization and model components.               ure 1. We encode observed market information
   As explained in Section 1, we assume that pre-       as a random variable X = [x1 ; . . . ; xT ], from
dicting the movement on trading day d can ben-          which we generate the latent driven factor Z =
efit from predicting the movements on its former        [z1 ; . . . ; zT ] for our prediction task. For the afore-
trading days. However, due to the general princi-       mentioned multi-task learning purpose, we aim at
ple of sample independence, building connections        modeling theR conditional probability distribution
directly across samples with temporally-close tar-      pθ (y|X) = Z pθ (y, Z|X) instead of pθ (yT |X).
get dates is problematic for model training.            We write the following factorization for genera-
   As an alternative, we notice that within a sam-      tion,
ple with a target trading day d there are likely to     pθ (y, Z|X) = pθ (yT |X, Z) pθ (zT |z
Training Objective
                                                          ↵

                                                                                                         y3
                                                                                                        07/08
     (c) Attentive Temporal                                                                                                              Output
         Auxiliary (ATA)                           y1           y2
                                                  03/08       06/08           Temporal                                                    hdec       Variational decoder
                                                                              Attention

                                                                                                                                           z
                                                   g1                        g2                          g3
                                                                                                                                                  ⇥         2
                                                                                                                                                                               ⇤
                                                                                                                                               DKL N (µ,        ) k N (0, I)
    (a) Variational Movement                       h1                        h2                          h3
         Decoder (VMD)
                               z1                       z2                         z3
                                                                                                                                "        log     2         µ

                                                                                                                             N (0, I)     henc       Variational encoder

                                                                                                                Historical
                                                02/08                     03/08                       06/08                               Input
     (b) Market Information                                                                                      Prices
                                    Attention                 Attention                   Attention
          Encoder (MIE)
                                                                                                                                        (d) VAEs
                                                Bi-GRUs Message Embedding Layer

                                    02/08                 03/08 - 05/08                   06/08 Message Corpora

Figure 2: The architecture of StockNet. We use the main target of 07/08/2012 and the lag size of 5 for
illustration. Since 04/08/2012 and 05/08/2012 are not trading days (a weekend), trading-day alignment
helps StockNet to organize message corpora and historical prices for the other three trading days in the
lag. We use dashed lines to denote auxiliary components. Red points denoting temporal objectives are
integrated with a temporal attention mechanism to acquire the final training objective.

comprises three primary components following a                                               The basic strategy of acquiring ct is to first feed
bottom-up fashion,                                                                        messages into the Message Embedding Layer for
                                                                                          their low-dimensional representations, then selec-
    1. Market Information Encoder (MIE) that en-                                          tively gather them according to their quality. To
       codes tweets and prices to X;                                                      handle the circumstance that multiple stocks are
                                                                                          discussed in one single message, in addition to text
    2. Variational Movement Decoder (VMD) that
                                                                                          information, we incorporate the position informa-
       infers Z with X, y and decodes stock move-
                                                                                          tion of stock symbols mentioned in messages as
       ments y from X, Z;
                                                                                          well. Specifically, the layer consists of a forward
    3. Attentive Temporal Auxiliary (ATA) that in-                                        GRU and a backward GRU for the preceding and
       tegrates temporal loss through an attention                                        following contexts of a stock symbol, s, respec-
       mechanism for model training.                                                      tively. Formally, in the message corpus of the tth
                                                                                          trading day, we denote the word sequence of the
5      Model Components                                                                   kth message, k ∈ [1, K], as W where W`? =
                                                                                          s, `? ∈ [1, L], and its word embedding matrix as
We detail next the components of our model (MIE,                                          E = [e1 ; e2 ; . . . ; eL ]. We run the two GRUs as
VMD, ATA) and the way we estimate our model                                               follows,
parameters.
                                                                                                                 →
                                                                                                                 −      −−−→     →
                                                                                                                                 −
                                                                                                                 h f = GRU(ef , h f −1 )                                 (4)
5.1      Market Information Encoder                                                                              ←−     ←−−−    ←−
                                                                                                                  h b = GRU(eb , h b+1 )                                 (5)
MIE encodes information from social media and                                                                            →
                                                                                                                         −     ←
                                                                                                                               −
stock prices to enhance market information qual-                                                                  m = ( h `? + h `? )/2                                  (6)
ity, and outputs the market information input X
for VMD. Each temporal input is defined as                                                where f ∈ [1, . . . , `? ], b ∈ [`? , . . . , L]. The stock
                                                                                          symbol is regarded as the last unit in both the
                         xt = [ct , pt ]                                    (3)           preceding and the following contexts where the
                                                                                                         →
                                                                                                         − ←       −
                                                                                          hidden values, h l? , h l? , are averaged to acquire
where ct and pt are the corpus embedding and the                                          the message embedding m. Gathering all message
historical price vector, respectively.                                                    embeddings for the tth trading day, we have a mes-
sage embedding matrix Mt ∈ Rdm ×K . In prac-                      Neural approximation aims at minimizing
tice, the layer takes as inputs a five-rank tensor for         the Kullback-Leibler divergence between the
a mini-batch, and yields all Mt in the batch with              qφ (Z|X, y) and pθ (Z|X, y). Instead of optimiz-
shared parameters.                                             ing it directly, we observe that the following equa-
   Tweet quality varies drastically. Inspired by the           tion naturally holds,
news-level attention (Hu et al., 2018), we weight
messages with their respective salience in col-                             log pθ (y|X)                               (10)
lective intelligence measurement. Specifically, we                        =DKL [qφ (Z|X, y) k pθ (Z|X, y)]
first project Mt non-linearly to ut , the normalized                      +Eqφ (Z|X,y) [log pθ (y|X, Z)]
attention weight over the corpus,
                                                                          −DKL [qφ (Z|X, y) k pθ (Z|X)]
            ut = ζ(wu| tanh(Wm,u Mt ))                   (7)   where DKL [q k p] is the Kullback-Leibler diver-
                                                               gence between the distributions q and p. There-
where ζ(·) is the softmax function and Wm,u ∈                  fore, we equivalently maximize the following vari-
Rdm ×dm , wu ∈ Rdm ×1 are model parameters.                    ational recurrent lower bound by plugging Eq. (2,
Then we compose messages accordingly to ac-                    9) into Eq. (10),
quire the corpus embedding,
                                                                  L (θ, φ; X, y)                                       (11)
                      ct = Mt u|t .                      (8)        T
                                                                    X                            
                                                                =         Eqφ (zt |z
and the shared hidden representation hzt as                                                   Training Objective   ỹT

      hzt = tanh(Wzφ [zt−1 , xt , hst , yt ] + bφz )   (16)       Temporal Attention                                1

                φ      φ
where Wz,µ         , Wz,δ , Wzφ   are weight matrices and
bφµ , bφδ , bφz are biases.                                       Dependency Score

   Since Gaussian distribution belongs to the                                                                      gT
“location-scale” distribution family, we can fur-                  Information Score

ther reparameterize zt as

                      zt = µt + δt                    (17)                             g1       g2      g3
where         denotes an element-wise product. The
noise term  ∼ N (0, I) naturally involves stochas-           Figure 3: The temporal attention in our model.
tic signals in our model.                                     Squares are the non-linear projections of gt and
   Similarly, We let the prior pθ (zt |z
where we adopt the KL term annealing trick (Bow-                  use the input dropout rate of 0.3 to regularize latent
man et al., 2016; Semeniuta et al., 2017) and add                 variables. Tensorflow (Abadi et al., 2016) is used
a linearly-increasing KL term weight λ ∈ (0, 1]                   to construct the computational graph of StockNet
to gradually release the KL regularization effect in              and hyper-parameters are tweaked on the develop-
the training procedure. Then we reuse v ∗ to build                ment set.
the final temporal weight vector v ∈ R1×T ,
                                                                  6.2   Evaluation Metrics
                       v = [αv ∗ , 1]                    (28)
                                                                  Following previous work for stock prediction (Xie
where 1 is for the main prediction and we adopt the               et al., 2013; Ding et al., 2015), we adopt the stan-
auxiliary weight α ∈ [0, 1] to control the overall                dard measure of accuracy and Matthews Corre-
auxiliary effects on the model training. α is tuned               lation Coefficient (MCC) as evaluation metrics.
on the development set and its effects will be dis-               MCC avoids bias dueto data skew. Given the con-
cussed at length in Section 6.5. Finally, we write                fusion matrix tp    fn
                                                                                   fp tn containing the number of
the training objective F by recomposition,                        samples classified as true positive, false positive,
                                                                  true negative and false negative, MCC is calcu-
                             N
                           1 X (n) (n)                            lated as
          F (θ, φ; X, y) =     v f                       (29)
                           N n                                                   tp × tn − fp × fn
                                                                  MCC = p                                     .
                                                                         (tp + fp)(tp + fn)(tn + fp)(tn + fn)
where our model can learn to generalize with
                                                                                                         (30)
the selective attendance of temporal auxiliary. We
take the derivative of F with respect to all the                  6.3   Baselines and Proposed Models
model parameters {θ, φ} through backpropagation
for the update.                                                   We construct the following five baselines in differ-
                                                                  ent genres,10
6     Experiments                                                    • R AND: a naive predictor making random
                                                                       guess in up or down.
In this section, we detail our experimental setup
                                                                     • ARIMA: Autoregressive Integrated Moving
and results.
                                                                       Average, an advanced technical analysis
6.1    Training Setup                                                  method using only price signals (Brown,
                                                                       2004) .
We use a 5-day lag window for sample construc-
                                                                     • R AND F OREST: a discriminative Random For-
tion and 32 shuffled samples in a batch.9 The max-
                                                                       est classifier using Word2vec text represen-
imal token number contained in a message and
                                                                       tations (Pagolu et al., 2016).
the maximal message number on a trading day
                                                                     • TSLDA: a generative topic model jointly
are empirically set to 30 and 40, respectively, with
                                                                       learning topics and sentiments (Nguyen and
the excess clipped. Since all tweets in the batched
                                                                       Shirai, 2015).
samples are simultaneously fed into the model,
                                                                     • HAN: a state-of-the-art discriminative deep
we set the word embedding size to 50 instead of
                                                                       neural network with hierarchical attention
larger sizes to control memory costs and make
                                                                       (Hu et al., 2018).
model training feasible on one single GPU (11GB
memory). We set the hidden size of Message Em-                       To make a detailed analysis of all the primary
bedding Layer to 100 and that of VMD to 150.                      components in StockNet, in addition to H EDGE -
All weight matrices in the model are initialized                  F UNDA NALYST, the fully-equipped StockNet, we
with the fan-in trick and biases are initialized with             also construct the following four variations,
zero. We train the model with an Adam optimizer                      • T ECHNICAL A NALYST: the generative StockNet
(Kingma and Ba, 2014) with the initial learning                        using only historical prices.
rate of 0.001. Following Bowman et al. (2016), we                    • F UNDAMENTAL A NALYST: the generative Stock-
                                                                       Net using only tweet information.
    9
      Typically the lag size is set between 3 and 10. As intro-      • I NDEPENDENTA NALYST: the generative Stock-
duced in Section 4, trading days are treated as basic units in
StockNet and 3 calendar days are thus too short to guarantee           Net without temporal auxiliary targets.
the existence of more than one trading day in a lag, e.g. the
                                                                    10
prediction for the movement of Monday. We also experiment              We do not treat event-driven models as comparable
with 7 and 10 but they do not yield better results than 5.        methods since our model uses no event pre-extraction tool.
Baseline models                                                   Acc.              MCC                    StockNet variations                           Acc.            MCC
  R AND                                                             50.89           -0.002266                T ECHNICAL A NALYST                           54.96         0.016456
  ARIMA (Brown, 2004)                                               51.39           -0.020588                F UNDAMENTAL A NALYST                         58.23         0.071704
              (Pagolu et al., 2016)
  R AND F OREST                                                     53.08           0.012929                 I NDEPENDENTA NALYST                          57.54         0.036610
  TSLDA (Nguyen and Shirai, 2015)                                   54.07           0.065382                 D ISCRIMINATIVE A NALYST                      56.15         0.056493
  HAN (Hu et al., 2018)                                             57.64           0.051800                 H EDGE F UNDA NALYST                          58.23         0.080796

                       Table 1: Performance of baselines and StockNet variations in accuracy and MCC.

       60                                      (a)                                     0.10 60                                         (b)                                       0.10
                                             58.23
       58   57.54       57.24                           57.54    57.44                        58
                                  55.56      0.080796                                  0.08                                                     56.15                            0.08
       56                                                                                     56     55.06
                                                                            54.27                                                                          54.46
       54                                                                              0.06 54                            0.052397              0.056493              53.37      0.06
                                                                                                    0.048112 52.68

                                                                                                                                                                                    MCC
Acc.

       52              0.045046                                                               52
            0.036610                                    0.036610 0.032907              0.04                               51.69      0.038252              0.035652              0.04
       50                                                                                     50               0.027161               50.79                           0.029556
       48           Acc.          0.010535                                             0.02 48               Acc.                                                                0.02
                    MCC                                                     0.007390                         MCC
       46                                                                                     46
                                                                                       0.00                                                                                      0.00
             0.0        0.1        0.3        0.5        0.7      0.9        1.0                     0.0        0.1        0.3        0.5        0.7        0.9        1.0

Figure 4: (a) Performance of H EDGE F UNDA NALYST with varied α, see Eq. (28). (b) Performance of
D ISCRIMINATIVE A NALYST with varied α.

       • D ISCRIMINATIVE A NALYST: the discriminative                                              LYST gains exceptionally competitive results with
         StockNet directly optimizing the likelihood                                               only 0.009092 less in MCC than H EDGE F UNDA N -
         objective. Following Zhang et al. (2016), we                                              ALYST . The performance of F UNDAMENTAL A NALYST
         set zt = µ0t to take out the effects of the KL                                            and T ECHNICAL A NALYST confirm the positive ef-
         term.                                                                                     fects from tweets and historical prices in stock
                                                                                                   movement prediction, respectively. As an effective
6.4         Results                                                                                ensemble of the two market information, H EDGE -
Since stock prediction is a challenging task and                                                   F UNDA NALYST gains even better performance.
a minor improvement usually leads to large po-                                                        Compared with D ISCRIMINATIVE A NALYST, the
tential profits, the accuracy of 56% is generally                                                  performance improvements of H EDGE F UNDA NA -
reported as a satisfying result for binary stock                                                   LYST are not from enlarging the networks, demon-
movement prediction (Nguyen and Shirai, 2015).                                                     strating that modeling underlying market status
We show the performance of the baselines and                                                       explicitly with latent driven factors indeed benefits
our proposed models in Table 1. TLSDA is the                                                       stock movement prediction. The comparison with
best baseline in MCC while HAN is the best                                                         I NDEPENDENTA NALYST also shows the effectiveness
baseline in accuracy. Our model, H EDGE F UNDA N -                                                 of capturing temporal dependencies between pre-
ALYST achieves the best performance of 58.23 in                                                    dictions with the temporal auxiliary. However, the
accuracy and 0.080796 in MCC, outperforming                                                        effects of the temporal auxiliary are more complex
TLSDA and HAN with 4.16, 0.59 in accuracy, and                                                     and will be analyzed further in the next section.
0.015414, 0.028996 in MCC, respectively.
   Though slightly better than random guess, clas-                                                 6.5       Effects of Temporal Auxiliary
sic technical analysis, e.g. ARIMA, does not yield                                                 We provide a detailed discuss of how the tempo-
satisfying results. Similar in using only histori-                                                 ral auxiliary affects model performance. As intro-
cal prices, T ECHNICAL A NALYST shows an obvious                                                   duced in Eq. (28), the temporal auxiliary weight
advantage in this task compared ARIMA. We be-                                                      α controls the overall effects of the objective-level
lieve there are two major reasons: (1) T ECHNICAL -                                                temporal auxiliary to our model. Figure 4 presents
A NALYST learns from training data and incorpo-                                                    how the performance of H EDGE F UNDA NALYST and
rates more flexible non-linearity; (2) our test set                                                D ISCRIMINATIVE A NALYST fluctuates with α.
contains a large number of stocks while ARIMA                                                         As shown in Figure 4, enhanced by the temporal
is more sensitive to peculiar sequence station-                                                    auxiliary, H EDGE F UNDA NALYST approaches the best
arity. It is worth noting that F UNDAMENTAL A NA -                                                 performance at 0.5, and D ISCRIMINATIVE A NALYST
achieves its maximum at 0.7. In fact, objective-       yumoxu/stocknet-dataset.
level auxiliary can be regarded as a denoising reg-
ularizer: for a sample with a specific movement        Acknowledgments
as the main target, the market source in the lag
                                                       The authors would like to thank the three anony-
can be heterogeneous, e.g. affected by bad news,
                                                       mous reviewers and Miles Osborne for their help-
tweets on earlier days are negative but turn to pos-
                                                       ful comments. This research was supported by a
itive due to timely crises management. Without
                                                       grant from Bloomberg and by the H2020 project
temporal auxiliary tasks, the model tries to iden-
                                                       SUMMA, under grant agreement 688139.
tify positive signals on earlier days only for the
main target of rise movement, which is likely to
result in pure noise. In such cases, temporal aux-     References
iliary tasks help to filter market sources in the
lag as per their respective aligned auxiliary move-    Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene
                                                        Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado,
ments. Besides, from the perspective of training
                                                        Andy Davis, Jeffrey Dean, Matthieu Devin, et al.
variational models, the temporal auxiliary helps        2016. Tensorflow: Large-scale machine learning on
H EDGE F UNDA NALYST to encode more useful infor-       heterogeneous distributed systems. arXiv preprint
mation into the latent driven factor Z, which is        arXiv:1603.04467 .
consistent with recent research in VAEs (Seme-
                                                       Steven Bird, Ewan Klein, and Edward Loper. 2009.
niuta et al., 2017). Compared with H EDGE F UND -         Natural language processing with Python: analyz-
A NALYST that contains a KL term performing dy-           ing text with the natural language toolkit. O’Reilly
namic regularization, D ISCRIMINATIVE A NALYST re-        Media, Inc.
quires stronger regularization effects coming with
                                                       Johan Bollen, Huina Mao, and Xiaojun Zeng. 2011.
a bigger α to achieve its best performance.              Twitter mood predicts the stock market. Journal of
   Since y ∗ also involves in generating yT through      computational science 2(1):1–8.
the temporal attention, tweaking α acts as a trade-
off between focusing on the main target and gener-     Samuel R Bowman, Luke Vilnis, Oriol Vinyals, An-
                                                         drew Dai, Rafal Jozefowicz, and Samy Bengio.
alizing by denoising. Therefore, as shown in Fig-        2016. Generating sentences from a continuous
ure 4, our models do not linearly benefit from           space. In Proceedings of The 20th SIGNLL Confer-
incorporating temporal auxiliary. In fact, the two       ence on Computational Natural Language Learning.
models follow a similar pattern in terms of per-         Berlin, Germany, pages 10–21.
formance change: the curves first drop down with
                                                       Robert Goodell Brown. 2004. Smoothing, forecasting
the increase of α, except the MCC curve for D IS -       and prediction of discrete time series. Courier Cor-
CRIMINATIVE A NALYST rising up temporarily at 0.3.       poration.
After that, the curves ascend abruptly to their max-
imums, then keep descending till α = 1. Though         Rich Caruana. 1998. Multitask learning. In Learning
                                                         to learn, Springer, pages 95–133.
the start phase of increasing α even leads to worse
performance, when auxiliary effects are properly       Xiao Ding, Yue Zhang, Ting Liu, and Junwen Duan.
introduced, the two models finally gain better re-       2014. Using structured events to predict stock price
sults than those with no involvement of auxiliary        movement: An empirical investigation. In Proceed-
                                                         ings of the 2014 Conference on Empirical Methods
effects, e.g. I NDEPENDENTA NALYST.                      in Natural Language Processing. Doha, Qatar, pages
                                                         1415–1425.
7   Conclusion
                                                       Xiao Ding, Yue Zhang, Ting Liu, and Junwen Duan.
We demonstrated the effectiveness of deep gen-           2015. Deep learning for event-driven stock predic-
erative approaches for stock movement predic-            tion. In Proceedings of the 24th International Con-
                                                         ference on Artificial Intelligence. Buenos Aires, Ar-
tion from social media data by introducing               gentina, pages 2327–2333.
StockNet, a neural network architecture for this
task. We tested our model on a new compre-             Robert D Edwards, WHC Bassetti, and John Magee.
hensive dataset and showed it performs better            2007. Technical analysis of stock trends. CRC
                                                         press.
than strong baselines, including implementation
of previous work. Our comprehensive dataset is         Jeffrey A Frankel. 1995. Financial markets and mone-
publicly available at https://github.com/                 tary policy. MIT Press.
Ziniu Hu, Weiqing Liu, Jiang Bian, Xuanzhe Liu, and        Danilo Jimenez Rezende, Shakir Mohamed, and Daan
  Tie-Yan Liu. 2018. Listening to chaotic whispers:          Wierstra. 2014. Stochastic backpropagation and ap-
  A deep learning framework for news-oriented stock          proximate inference in deep generative models. In
  trend prediction. In Proceedings of the Eleventh           Proceedings of the 31th International Conference
  ACM International Conference on Web Search and             on Machine Learning. Beijing, China, pages 1278–
  Data Mining. ACM, Los Angeles, California, USA,            1286.
  pages 261–269.
                                                           Robert P Schumaker and Hsinchun Chen. 2009. Tex-
Diederik P. Kingma and Jimmy Ba. 2014.           Adam:       tual analysis of stock market prediction using break-
  A method for stochastic optimization.          CoRR        ing financial news: The azfin text system. ACM
  abs/1412.6980.                                             Transactions on Information Systems 27(2):12.
Diederik P Kingma and Max Welling. 2013. Auto-             Stanislau Semeniuta, Aliaksei Severyn, and Erhardt
  encoding variational bayes.    arXiv preprint               Barth. 2017. A hybrid convolutional variational au-
  arXiv:1312.6114 .                                           toencoder for text generation. In Proceedings of
Quoc Le and Tomas Mikolov. 2014. Distributed repre-           the 2017 Conference on Empirical Methods in Nat-
  sentations of sentences and documents. In Proceed-          ural Language Processing. Copenhagen, Denmark,
  ings of the 31st International Conference on Inter-         pages 627–637.
  national Conference on Machine Learning-Volume           Jianfeng Si, Arjun Mukherjee, Bing Liu, Qing Li,
  32. JMLR. org, Beijing, China, pages 1188–1196.             Huayi Li, and Xiaotie Deng. 2013. Exploiting topic
Piji Li, Wai Lam, Lidong Bing, and Zihao Wang.                based twitter sentiment for stock prediction. In Pro-
   2017. Deep recurrent generative decoder for ab-            ceedings of the 51st Annual Meeting of the Associa-
   stractive text summarization. In Proceedings of the        tion for Computational Linguistics (Volume 2: Short
   2017 Conference on Empirical Methods in Natu-              Papers). Sofia, Bulgaria, volume 2, pages 24–29.
   ral Language Processing. Copenhagen, Denmark,
                                                           Boyi Xie, Rebecca J Passonneau, Leon Wu, and
   pages 2081–2090.
                                                             Germán G Creamer. 2013. Semantic frames to pre-
Ronny Luss and Alexandre d’Aspremont. 2015. Pre-             dict stock price movement. In Proceedings of the
  dicting abnormal returns from news using text clas-        51st Annual Meeting of the Association for Com-
  sification. Quantitative Finance 15(6):999–1012.           putational Linguistics. Sofia, Bulgaria, volume 1,
                                                             pages 873–883.
Burton Gordon Malkiel. 1999. A random walk down
  Wall Street: including a life-cycle guide to personal    Biao Zhang, Deyi Xiong, Hong Duan, Min Zhang, et al.
  investing. WW Norton & Company.                            2016. Variational neural machine translation. In
                                                             Proceedings of the 2016 Conference on Empirical
Thien Hai Nguyen and Kiyoaki Shirai. 2015. Topic             Methods in Natural Language Processing. Austin,
  modeling based sentiment analysis on social media          Texas, USA, pages 521–530.
  for stock market prediction. In Proceedings of the
  53rd Annual Meeting of the Association for Compu-
  tational Linguistics and the 7th International Joint
  Conference on Natural Language Processing. Bei-
  jing, China, volume 1, pages 1354–1364.
Nuno Oliveira, Paulo Cortez, and Nelson Areal. 2013.
  Some experiments on modeling stock market be-
  havior using investor sentiment analysis and posting
  volume from twitter. In Proceedings of the 3rd In-
  ternational Conference on Web Intelligence, Mining
  and Semantics. ACM, Madrid, Spain, page 31.
Venkata Sasank Pagolu, Kamal Nayan Reddy, Gana-
  pati Panda, and Babita Majhi. 2016. Sentiment
  analysis of twitter data for predicting stock market
  movements. In Proceedings of 2016 International
  Conference on Signal Processing, Communication,
  Power and Embedded System. IEEE, Rajaseetapu-
  ram, India, pages 1345–1350.
Navid Rekabsaz, Mihai Lupu, Artem Baklanov,
  Alexander Dür, Linda Andersson, and Allan Han-
  bury. 2017. Volatility prediction using financial dis-
  closures sentiments with word embedding-based ir
  models. In Proceedings of the 55th Annual Meeting
  of the Association for Computational Linguistics.
  Vancouver, Canada, volume 1, pages 1712–1721.
You can also read