Urban Mobility Prediction Using Twitter

Page created by Sergio Higgins
 
CONTINUE READING
Urban Mobility Prediction Using Twitter
2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing,
           Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress

                  Urban Mobility Prediction Using Twitter
                      Saeed Khan                                Ash Rahimi                               Neil Bergmann
                    School of ITEE                           School of ITEE                               School of ITEE
                University of Queensland                 University of Queensland                    University of Queensland
                  Brisbane, Australia                      Brisbane, Australia                          Brisbane, Australia
                   s.khan@uq.edu.au                        a.rahimi@uq.edu.au                       n.bergmann@itee.uq.edu.au

      Abstract—The characteristics and dynamics of human mobility               and predictability [5]. Various entropy and predictability mea-
   have vital implications in areas such as disaster management,                sures are used to determine whether future movements of users
   transportation planning and infrastructure management. While                 have a strong correlation with past locations or they depend
   aggregate mobility modeling is useful for getting a broader
   overview of the system, the prediction of future movements                   only on the current location. Such analysis helps provide
   of people in urban areas is also of significance. This work                   an upper bound for the prediction accuracy a state-of-the-
   investigates the individual-level mobility of Twitter users in three         art prediction algorithm [6] can achieve. Then, the prediction
   Australian cities using the concepts of entropy and predictability.          problem is discussed to ascertain how predictable the users
   Twitter users are distinguished on the basis of their movement               are in general, and whether all cities are equally predictable
   patterns and two distinct groups are identified. The randomness
   and regularity in their movements are calculated via multiple                in terms of their respective user movements.
   metrics, and prediction for the most active users in these cities
   is also performed. The top 10% of Brisbane users have 76.6%                                      II. R ELATED W ORK
   prediction accuracy, much higher than the other cities, suggesting              A number of studies have attempted to calculate the
   heterogeneity among various cities.                                          theoretical limits of predictability. In [5], Song et al.
     Index Terms—mobility, twitter, entropy, prediction, Markov                 investigated the limits of predictability of user trajectories
   chain                                                                        by calculating their entropies using cell phone records and
                                                                                observed 93% theoretical predictability in their mobility
                          I. I NTRODUCTION                                      behaviour. Based on the results, they argue there is a huge
      Urban mobility modeling aims to understand the mobility                   potential to explore the regularities of human mobility by
   dynamics of people with applications in numerous domains                     using cell phones. In [6], Lu et al. calculated the movement
   such as disaster management [1], transportation network                      uncertainties of a large number of people among cell phone
   planning [2] and infrastructure management [3]. With this                    subscribers in Ivory Coast by using the frequencies as well as
   goal, we looked at the system-wide mobility of Twitter users                 the temporal correlations of their trajectories. Their analysis
   in our previous work [4], whereas in this paper we explore                   revealed a high theoretical predictability of up to 88%. In
   the individual-based mobility patterns of users in Australia.                another study, using a cell phone data of 500 users, Qin et al
   A desirable goal of any mobility study is the ability to                     [7] demonstrated that movement patterns and entropy relate to
   predict future movements of people. Such prediction can                      the degree of activities and locations with 78% predictability.
   be done at individual or aggregate level for area under
   study. The ability to predict the next location of people                       In [8], entropy and predictability measures are used to
   can have benefits such as improving various services being                    observe the behaviour and mobility of a large set of players
   used at a location. Previous work [4] has shown that most                    in the virtual world of an online game. In the game, players
   Twitter users are active only for a limited period of time,                  do not make any physical movements, but rather navigate a
   and only few users are active on a regular basis. These                      virtual avatar. It is observed that the movements in virtual
   relatively active users can be monitored to learn long term                  human lives follow the same high levels of predictability as
   predictable patterns, and conversely, the patterns learned can               the real-world mobility.
   help to predict the next location of visit for the users. Thus, it
   is imperative to distinguish such users from the irregular ones.                Different data sources capture different usage patterns
                                                                                which may cause uncertainty in subsequent prediction. In
     Consequently, this paper discusses how to identify and                     [9], the authors investigate the effect of using various types
   separate regular from irregular users based on their activity                of data sources on the predictability of check-in patterns of
   patterns, and investigates the prediction of their movements.                users. They observe that using multiple data sources does
   The user groups are chosen on the basis of their move-                       not necessarily raise the predictability limits, and may even
   ment/tweeting patterns on Twitter from three Australian cities:              decrease it in some cases. The volume of data, however,
   Sydney, Melbourne and Brisbane. The regularity and random-                   can affect the predictability in case of a single data source
   ness of each group are examined using the concepts of entropy                being used. In addition, the differences in user behaviour

978-1-7281-6609-4/20/$31.00 ©2020 IEEE                                    435
DOI 10.1109/DASC-PICom-CBDCom-CyberSciTech49142.2020.00082
Urban Mobility Prediction Using Twitter
patterns across various social networks can also affect how                Random Entropy: This takes into account the number of
their check-in patterns are acquired. They, however, do not              unique locations N visited by each user i and is given as:
measure the accuracy of next-place-of-visit prediction using
distance metrics such as Euclidean distance. Moreover, the                                        SiRand = log2 (N )
types of venues at which users may check-in are not taken                where N > 0 since all users visit at least one location and
into account.                                                            also because log2 (0) is undefined.

   In [10], the authors explore human mobility predictability               Shannon Entropy: The probability of visiting every unique
by analysing a set of tweets generated by Swedish users.                 location j by each user i, summed across all locations visited
They conduct the analysis in terms of temporal history of                at least once:
mobility range showing how users disperse in space, and also
                                                                                                        
                                                                                                        N
investigate entropy and its corresponding predictability. The                              SiShan = −         pi,j log2 (pi,j )
results uncover a set of users routinely visiting some locations                                        j=1
most of the time, and occasionally exploring new places as
well. By measuring the entropy of each user’s individual                 where pi,j is the number of visits to location j by user i
trajectory, a 70% theoretical predictability is achieved. They,          divided by total visits for all locations visited by user i.
however, do not perform the actual prediction for users.
                                                                            Conditional Entropy: Checks for correlation between a
                                                                         previously visited location xt−1 and a subsequent one xt in
   In [11], the Shannon and Real predictability of a particular
                                                                         time series:
set of users are analysed and correlated with corresponding                               
entropy. Using Twitter data, users with at least 100 tweets are             SiCond = −                pi (xt−1 , xt ) log2 pi (xt |xt−1 )
chosen and analysis is performed based on a certain number                                xt ∈Xi xt−1 ∈Xi
of distinct locations visited by these users. The median
                                                                         where Xi is the set of all locations visited by user
value for both entropies follows a linearly increasing trend
                                                                         i, pi (xt−1 , xt ) denotes the probability of visiting the
as the number of locations increases, however, the Shannon
                                                                         ordered pair of visited locations xt−1 and xt by user i,
entropy increases at a faster rate. Similarly, the decrease in
                                                                         and pi (xt |xt−1 ) = pi (xt−1 , xt ) /pi (xt−1 ) represents the
predictability for higher number of locations is slower for
                                                                         probability of visiting the location xt at time-ordered t given
Real predictability compared to Shannon’s. Overall, this work
                                                                         a previously visited location xt−1 by user i. Only pairs that
suggests the importance of spatio-temporal correlations in
                                                                         appear in a user history are used to ensure pi (xt |xt−1 ) > 0.
visitation patterns in predicting future tweet locations. The
actual prediction for users, however, is not analysed here.
                                                                            Real Entropy: This considers the complete spatio-temporal
                                                                         information of a visit, takes into account the frequency,
   For prediction of user movements, a number of techniques
                                                                         visitation sequence and time spent.
have been used by researchers in the literature. In [12], an
                                                                                                     n         −1
                                                                                                         m=2 lm
Auto-Regressive Integrated Moving Average (ARIMA) model
is proposed to predict the spatio-temporal variations of taxi                              SiReal =
                                                                                                      n log2 (n)
passenger numbers in China. The method uses GPS traces
from 4000 taxis and achieves a prediction performance with               where lm is the length of the shortest sequence of locations
5.8% error. In [13], the authors use Mixed Markov Model                  starting at position m which does not show in the part of
(MMM) to predict pedestrian movement in Osaka, Japan                     sequences up to position m − 1.
and claim to achieve a prediction accuracy of 74.4%. Other               B. Predictability
methods such as Gradient Boosting regression tree method
                                                                            In the mobility scenario, predictability is a measure of the
(GBM) [14] and Support Vector Machine (SVM) [15] have
                                                                         future whereabouts of a user and is inversely proportional to
also been used with varying degrees of accuracy.
                                                                         entropy. The predictability Πi of user i is subject to Fano’s
                                                                         inequality [17] and can be related to the user entropy Si as:
             III. T HEORETICAL BACKGROUND
  This section presents the theoretical background about the                        Si = H(Πi ) + (1 − Πi ) log2 (Ni − 1)          (1)
concepts and methods used in this paper for data analysis.               where  H(Πi ) is the binary entropy function which is defined
                                                                         as the entropy of a Bernoulli process with the probability of
A. Entropy                                                               success Πi that can take only two values: 0 and 1.
   Entropy is a measure of the degree of randomness in a                       H(Πi ) = −Πi log2 Πi − (1 − Πi ) log2 (1 − Πi )   (2)
process. Since movement itself is a process and users usually
move between different locations, we are trying to find any               where  represents a placeholder for a particular kind of
patterns or predictable movements between the suburbs. There             entropy, and Ni is the total possible locations visited by user
are four types of entropy as mentioned in [5], [16] and [8].             i based on their history. This means that given the entropy

                                                                   436
Urban Mobility Prediction Using Twitter
S , we can find the predictability Π by solving Equation 1
numerically. For each type of entropy, we have corresponding
type of predictability such as Random predictability, Shannon
predictability, Conditional predictability, and the Real
predictability.

   It should be noted that the term “process” refers to timeline
of each user, and hence, all of the above measures are
calculated for every user in the system. Moreover, location
means the suburb where a tweet is posted, and sequence means
a time-ordered set of suburbs.
C. Markov Model
   As mentioned earlier, our goal is to predict the next location         Fig. 1: The number of tweets in the top 20 suburbs, and the
visited by users based upon the observation of their previous             second top 20 suburbs active on Twitter in the data set.
location(s) over a period of time. The potential applications of
such analysis can include the development of location-based
services anticipating the next movement of a user or to                   the percentage of users making a certain minimum number
predict the spread of a disease from one place to another.                of tweets throughout the data collection period. A very
This can also help in emergency situation, or understanding               large share of users (79.8%) post a small number of tweets
short/long term migration patterns. For this purpose we will              (between 1 and 24), while the users making a large number
use the Markov predictor. This type of predictor represents               of tweets comprise of relatively much smaller percentage
the mobility behaviour of a user as a Markov model and                    (4.9%). This suggests that, broadly, users can be partitioned
predicts the next state based on the previous state(s). In                into two groups: casual and frequent, and hence, for the
our case, the state refers to a suburb, hence, the transition             analysis carried out in this paper, we separate the users who
between two states corresponds to the probability of moving               have posted at least 25 tweets in the data set, and call them
from one suburb to another. The mobility behaviour is treated             ‘frequent’ users, and term all other users as ‘casual’. We
as a discrete stochastic process.                                         perform the entropy and predictability analysis for both these
                                                                          groups.
          IV. DATA S ET & E XPERIMENTAL S ETUP
   The data set has been collected using Twitter Streaming API            TABLE I: The percentage of users posting a minimum number
[18], geographically covering Australia and spanning a period             of tweets in the data set. For example, 79.8% of users posted
from January 2015 till July 2016. It consists of 9462345                  between 1 and 24 tweets.
tweets from 245796 unique users. For Sydney we have 947964
geo-tagged tweets and 40281 unique users, for Melbourne we                                  No. of Tweets     % of Users
                                                                                          Between 1 and 24      79.8
have 854821 geo-tagged tweets and 35556 unique users and                                  Between 25 and 49      9.6
for Brisbane we have 276394 geo-tagged tweets and 14555                                   Between 50 and 99      5.7
unique users in the data set. For experiments, these three                                  100 and above        4.9
cities are chosen and only their top-20 suburbs are utilized
for the analysis. These suburbs are chosen w.r.t highest
                                                                            We also do the prediction analysis for top users of each city
twitter activity since they account for the largest amount of
                                                                          and discuss this in detail in §VI.
data compared to the remaining suburbs as shown in Figure 1.
                                                                              V. P RELIMINARY E NTROPY AND P REDICTABILITY
   As shown in Figure 1 the top-20 suburbs of each city
capture up to four times the twitter activity than the next 20               We start with analysing the entropy and predictability of
suburbs and as we go further down, the twitter activity drops             all users among top-20 suburbs of Sydney, Melbourne and
significantly. Moreover, the top-20 suburbs cover key spatial              Brisbane. This is done to get an idea about how random or
areas of a city where most of the movements usually take                  predictable the user movements are in general. Figure 2 (a-f)
place. The movements that users make between suburbs are                  shows the entropy and predictability measures for the three
referred to as fluxes.                                                     cities. Since the maximum value of entropy is about 4 for
                                                                          random entropy, we divide entropy range 0-4 into 8 bins with
  It is also important to give consideration to the number                width 0.5, and assign each user to one of these bins. For
of tweets posted by each user in the data set. The total                  predictability measure, we assigned users to bins with width
number of unique users posting only geo-tagged tweets in                  0.05, similarly. The number of users in each bin is shown in
our data set is 105641 and Table I provides an overview of                Figure 2.

                                                                    437
Urban Mobility Prediction Using Twitter
(a)                                                             (b)

                                  (c)                                                             (d)

                                  (e)                                                             (f)
                     Fig. 2: The number of users in each entropy and predictability bin in the three cities.

   We observe that entropy distributions do not seem to follow            of each other depends upon how the binning is performed for
the standard order of S Real ≤ S Cond ≤ S Shan ≤ S Rand                   plotting the data points.
and the corresponding predictability distributions also exhibit
coarse trend. This suggests that twitter users have wide ranging          A. Frequent vs Casual User Groups
behaviour reflecting some degree of randomness as well as                     We compute various entropies and predictabilities for fre-
some regularity. If users with similar mobility behaviour, either         quent and casual users belonging to Sydney as shown in
with high randomness or high regularity are separated into                Figure 3.
different groups, then it is possible that highly predictable                We can see that the behaviour of frequent users is different
users may be identified more easily. Visually, the extent to               from casual users whereas the casual users still show coarse
which entropy and predictability measures are mirror image                distributions like the case of all users as shown in Figure

                                                                    438
Urban Mobility Prediction Using Twitter
(a)                                                            (b)

                                     (c)                                                            (d)
                                              Fig. 3: Sydney’s frequent vs. casual users

2 (a, b). Table II summarises the percentages, averages and              than 50% of overall movements between its suburbs.
upper bound threshold for each group of all three cities’ users.
                                                                            The entropy measure can also be used to infer the
                                                                         significance of Markov chain transition probabilities i.e.
          TABLE II: User behaviour for each city
                                                                         whether the next location is highly predictable based on
          Users     No. of Fluxes per User   % of     % of
                                                                         current location. On the other hand, predictability measure
                                             Users    Fluxes             can be perceived as a theoretical upper bound of prediction
                    Max     Avg     StDev                                that can possibly be achieved using a suitable prediction
   Syd   Frequent    198     46.5    27.9    3.06%    29.17%             algorithm. In Figure 4, we show the plots for remaining two
          Casual      24     3.57     4      96.94%   70.83%
   Mel   Frequent   1218    48.43   69.67    3.00%    29.67%             cities.
          Casual      24     3.56    3.98    97.00%   70.33%
   Bri   Frequent   6044     160    809.4    2.08%    53.72%
          Casual      24     2.94    3.39    97.92%   46.28%
                                                                            In Figure 4 (c, g), we see that entropies of casual users
                                                                         do not follow the standard ordering rule and such users are
                                                                         difficult to analyse. The large proportion of very low entropy
   It is expected that frequent users will have different                values corresponds to a large number of users having a small
prediction characteristics than casual users and we analyse              number of fluxes, and prediction accuracy for these users is
the most predictable users in section VI. We also observe                unlikely to be high [19]. In contrast, the entropy distributions
that average value for frequent users is much higher than                of frequent users look better, exhibiting normal distribution as
that of casual users, and although the percentage of number              shown in Figure 4 (a, e). This also suggests that in general,
of frequent users is much smaller than the casual users,                 we have reasonably differentiated and grouped the users
the frequent users still account for a proportionately good              together from entropy point of view. It is important to note
percentage of overall movements. Brisbane shows a different              that estimation of various types of entropies and their standard
trend whereby its average value is more than three times that            order becomes exact only for infinitely long sequences where
of Sydney and Melbourne and its frequent users record more               all locations and transition probabilities between them can

                                                                   439
Urban Mobility Prediction Using Twitter
(a) Melb. frequent entropy                             (b) Melb. frequent predictability

(c) Melb. casual entropy                                (d) Melb. casual predictability

(e) Bris. frequent entropy                              (f) Bris. frequent predictability

 (g) Bris. casual entropy                                (h) Bris. casual predictability
            Fig. 4: Frequent vs Casual users entropy and predictability

                                        440
Urban Mobility Prediction Using Twitter
accurately be calculated [19].                                                      TABLE IV: Average predictability for cities

   Table III shows the average values for all types of entropies                    City                  Avg. Predictability
                                                                                              Random     Shannon     Conditional   Real
for all three cities and we utilise these figures to get further                   Sydney       0.17        0.59          0.75      0.73
insights.                                                                        Melbourne     0.16        0.57          0.75      0.71
                                                                                 Brisbane      0.32        0.73          0.82      0.79

TABLE III: Average entropies for cities and their user groups

   City/User Group     RandEnt   ShanEnt    CondEnt    RealEnt             history of locations. This means prediction problem can be
   Sydney/Frequent       2.7       1.95       1.4       1.49               presented where the actual predictability can be represented
  Melbourne/Frequent     2.71      2.01       1.4       1.54               by the conditional predictability whereby considering just
  Brisbane/Frequent      1.9       1.16       0.92      1.06               the last location returns almost the same predictability as if
                                                                           considering the entire previous history of movements.
   For Sydney frequent users, the averages for random,
                                                                              In literature, studies have been conducted about predictabil-
Shannon, conditional and real entropies are 2.7, 1.95, 1.4
                                                                           ity analysis using mobile phones. Song et al. [5], Lu et al. [6]
and 1.49 respectively. It means that the next location for a
                                                                           and Qin et al. [7] found 93%, 88% and 78% predictability for
user could randomly be out of possible 2rand ≈ 22.7 ≈ 7
                                                                           users respectively. Here, the predictability for Brisbane users
locations where each location is unique one. If we count the
                                                                           is close to the calculation of Qin et al. [7]. However, both Song
movement frequency, then this uncertainty will be shown
                                                                           et al. [5] and Qin et al. [7] did not further pursue their work to
by Shannon entropy, hence, 2Shan ≈ 21.95 ≈ 4 locations.
                                                                           do actual prediction to show whether their high predictability
So Shannon entropy suggests fewer highly probable next
                                                                           results can be achieved in practice. On the other hand, Lu
locations than random entropy. If we take into account the
                                                                           et al. [6] used a Markov Chain model to perform prediction
sequence of locations visited, then the conditional entropy
                                                                           and achieved an accuracy at par with their predictability level
will be 2cond ≈ 21.4 ≈ 3 locations. Finally, considering the
                                                                           using the first order Markov model. In the following section,
whole history of movements, the real entropy average of 1.49
                                                                           we explore the prediction problem to get further insights into
also yields 21.49 ≈ 3 locations. For Melbourne, the number
                                                                           the movement patterns of Twitter users.
of possible locations based on each type of entropy are
7(random), 4(Shannon), 3(conditional) and 3(real) whereas                             VI. P REDICTION OF N EXT L OCATION
for Brisbane these numbers are 4(random), 2(Shannon),
                                                                              Prediction estimates the next location based on the transition
2(conditional) and 2(real).
                                                                           probability matrix. Given a current location, the next location
                                                                           with the highest probability from the transition matrix
   It is observed that real entropy is close to conditional                assembled from the training set is used as the estimated next
entropy, suggesting the fact that entropy is strongly determined           location. We take the entire data set for users and investigate
by location history with most of the information in last visited           the previous location for a record to determine the chance of
location.                                                                  correctly predicting the next location. Specifically, we choose
                                                                           the top most predictable users for each city as per their
   Next we look at predictability as shown in Figure 4                     theoretical real predictability values and do the prediction for
(b,d,f,h). Examining the frequent users, we find that average               them to see what their prediction accuracy is like. We use the
real predictability, ΠReal , for Sydney is 0.73, for Melbourne             first order Markov Chain predictor to conduct this analysis.
it is 0.71, whereas for Brisbane it is 0.79. The predictability            For each user, the origin-destination (OD) transition matrix
values suggest there is a possibility that about 73% of                    and their probabilities are calculated based on one previous
Sydney, 71% of Melbourne and 79% of Brisbane users’                        state. For all the days that a user is active in the whole data
whereabouts can possibly be predicted with the help of a                   set, we use the first 80% of the days as training period, and
good prediction algorithm, whereas the remaining 27%, 29%                  the last 20% as the test period. The predictor checks in which
and 21% users of these cities are difficult to predict. Hence,              suburbs the users have been mostly active in the training set
the predictability measure provides a theoretical upper bound              and calculates prediction for the corresponding suburbs in the
of the prediction algorithm performance [19]. In other words,              test set. The accuracy measure is calculated as the number of
for actual prediction accuracy, this is the target that could              correct predictions divided by the total number of predictions
possibly be achieved by a good algorithm [6]. Table IV shows               (correct plus incorrect).
the average predictability values for these cities.
                                                                             After doing our prediction analysis for the top predictable
   It can be seen that real predictability is close to conditional         users of each city, we report the findings in Table V.
predictability and this clearly suggests that most of the                    It shows the prediction accuracies for top 1%, 5% and 10%
information about the probable next location is held in the                of users (in terms of mobility) for each city along with their
current location, implying a weak dependence on the previous               corresponding theoretical predictability ranges. Among all

                                                                     441
Urban Mobility Prediction Using Twitter
TABLE V: Prediction accuracy & Real predictability                   Future work can involve tracking the movements of frequent
                                                                         users itself over a period of time with a broader goal to
    City      Top 1%   Users    Top 5%   Users    Top 10% Users          detect change in their routine activity patterns and exploit that
              Accu.    Pred.    Accu.    Pred.    Accu.   Pred.
                                                                         information for some useful purpose.
  Brisbane    100%     All 1    100%     All 1    76.6%  0.94-1
  Melbourne   63.3%    0.88-1   47.1%    0.81-1   49.2%  0.79-1                                        R EFERENCES
   Sydney     100%     All 1     53%     0.83-1   54.8%  0.80-1
                                                                          [1] X. Song, Q. Zhang, Y. Sekimoto, R. Shibasaki, N. J. Yuan, and
                                                                              X. Xie, “Prediction and simulation of human mobility following natural
                                                                              disasters,” ACM Transactions on Intelligent Systems and Technology,
                                                                              vol. 8, no. 2, pp. 29:1–29:23, Nov. 2016.
cities, Brisbane users record highest prediction accuracy and             [2] X. Song, H. Kanasugi, and R. Shibasaki, “Deeptransport: Prediction and
can be best related to their theoretical predictability values.               simulation of human mobility and transportation mode at a citywide
Up to first 5% of its users can completely be predicted and                    level,” in 25th International Joint Conference on Artificial Intelligence,
                                                                              ser. IJCAI’16. AAAI Press, 2016, pp. 2618–2624.
after that the prediction accuracy decreases. In general, we can          [3] Q. Wang and J. E. Taylor, “Process map for urban-human mobility
say that the results correspond with theoretical calculations                 and civil infrastructure data collection using geosocial networking plat-
for average real predictability in Table IV which is highest                  forms,” Journal of Computing in Civil Engineering, vol. 30, no. 2, p.
                                                                              04015004, 2016.
for Brisbane than the rest of the cities.                                 [4] S. F. Khan, N. Bergmann, R. Jurdak, B. Kusy, and M. Cameron,
                                                                              “Mobility in cities: Comparative analysis of mobility models using geo-
   For Melbourne, we observe lower prediction accuracy than                   tagged tweets in australia,” in 2017 IEEE 2nd International Conference
                                                                              on Big Data Analysis (ICBDA) Beijing, pp. 816–822.
Brisbane. The maximum accuracy that we could achieve here                 [5] C. Song, Z. Qu, N. Blumm, and A.-L. Barabási, “Limits of predictability
is 63.3% and this is for top 1% of users only. Subsequent                     in human mobility,” Science, vol. 327, no. 5968, pp. 1018–1021, 2010.
sets of users have even lower accuracy and we infer that                  [6] X. Lu, E. Wetter, N. Bharti, A. J. Tatem, and L. Bengtsson, “Approaching
                                                                              the limit of predictability in human mobility,” Scientific reports, vol. 3,
users in Melbourne are not as predictable as those in Brisbane.               p. 2923, 2013.
                                                                          [7] S.-M. Qin, H. Verkasalo, M. Mohtaschemi, T. Hartonen, and M. Alava,
   For Sydney, we note that top 1% of users have 100%                         “Patterns, entropy, and predictability of human mobility and life,” PLOS
                                                                              One, vol. 7, no. 12, p. e51353, 2012.
accuracy but the top 5% users have accuracy dropping down                 [8] R. Sinatra and M. Szell, “Entropy and the predictability of online life,”
to 53%. This suggests that except for a few users, most of the                Entropy, vol. 16, no. 1, pp. 543–556, 2014.
users have comparatively larger variation in their movements.             [9] D. Teixeira, M. Alvim, and J. Almeida, “On the predictability of a
                                                                              user’s next check-in using data from different social networks,” in
Overall, this points to a possible different style of movements               Proceedings of the 2nd ACM SIGSPATIAL Workshop on Prediction of
in these two cities and users in general are harder to predict                Human Mobility, ser. PredictGIS 2018, New York, NY, USA, 2019, pp.
for their next movements.                                                     8–14.
                                                                         [10] Y. Liao and S. Yeh, “Predictability in human mobility based on
                                                                              geographical-boundary-free and long-time social media data,” in 21st
   The prediction for users based on their movement patterns                  International Conference on Intelligent Transportation Systems (ITSC),
can help identify the differences between them, and the com-                  Nov 2018, pp. 2068–2073.
                                                                         [11] R. Jurdak, K. Zhao, J. Liu, M. AbouJaoude, M. Cameron, and D. Newth,
parison of various cities can also reveal the mobility dynamics               “Understanding human mobility from Twitter,” PLOS One, vol. 10, no. 7,
in a broader perspective. Despite using an identical criteria                 p. e0131469, 2015.
for choosing users, we observed heterogeneity among cities               [12] X. Li, G. Pan, Z. Wu, G. Qi, S. Li, D. Zhang, W. Zhang, and Z. Wang,
                                                                              “Prediction of urban human mobility using large-scale taxi traces and its
in terms of predicting the next location of its users.                        applications,” Frontiers of Computer Science, vol. 6, no. 1, pp. 111–121,
                                                                              2012.
                     VII. C ONCLUSIONS                                   [13] A. Asahara, K. Maruyama, A. Sato, and K. Seto, “Pedestrian-movement
   This work explored the movement patterns of Twitter users                  prediction based on mixed Markov-chain model,” in 19th ACM SIGSPA-
                                                                              TIAL international conference on advances in geographic information
in three Australian cities and it is observed that users can                  systems, 2011, pp. 25–33.
generally be split into two groups: frequent and casual based            [14] Y. Zhang and A. Haghani, “A gradient boosting method to improve
on their activity characteristics. The randomness and regularity              travel time prediction,” Transportation Research Part C: Emerging
                                                                              Technologies, vol. 58, pp. 308–324, 2015.
of these user groups are investigated using the concepts of              [15] A. Lopez Aguirre, I. Semanjski, and S. Gautama, “Forecasting travel
entropy and predictability. Brisbane users have highest average               behaviour from crowdsourced data with machine learning based model,”
Real predictability of 79% compared to other cities. In terms                 in Fifth International Conference on Data Analytics, vol. 5.          The
                                                                              International Academy, Research and Industry Association, 2016, pp.
of actual prediction too, the top users in Brisbane have higher               93–99.
prediction accuracy of 100% for top 5% users, and 76.6%                  [16] R. Gallotti, A. Bazzani, M. Degli Esposti, and S. Rambaldi, “Entropic
for top 10% users. In comparison, the Sydney and Melbourne                    measures of individual mobility patterns,” Journal of Statistical Mechan-
                                                                              ics: Theory and Experiment, vol. 2013, no. 10, p. P10022, 2013.
users have much lower prediction accuracy, and overall, this             [17] T. M. Cover and J. A. Thomas, Elements of information theory. John
suggests heterogeneity among the cities. The ability to predict               Wiley & Sons, 2012.
the next location of users can have applications such as antic-          [18] Twitter.           (2015,          Jun.)         Streaming          APIs.
                                                                              https://dev.twitter.com/streaming/overview.
ipating the use and management of resources being offered                [19] I. B. I. Purnama, N. Bergmann, R. Jurdak, and K. Zhao, “Characterising
at a certain place. It is therefore important to identify the                 and predicting urban mobility dynamics by mining bike sharing system
movements which are most predictable and such movements                       data,” in IEEE 12th International Conference on Ubiquitous Intelligence
                                                                              and Computing (UIC), 2015, pp. 159–167.
in turn are expected from frequent users. These users can
be apprised of any update about their probable next location.

                                                                   442
Urban Mobility Prediction Using Twitter Urban Mobility Prediction Using Twitter
You can also read