Deliverable 5.1 REPORT ON THE DEVELOPED INDICATORS FOR NOWCASTING MIGRATION STOCKS BY TWITTER

Page created by Theodore Park
 
CONTINUE READING
Deliverable 5.1 REPORT ON THE DEVELOPED INDICATORS FOR NOWCASTING MIGRATION STOCKS BY TWITTER
This project has received
 funding from the
 European Union’s Horizon
 2020 Research and
 Innovation Programme
 under grant agreement
 No 870661

 Deliverable 5.1
REPORT ON THE DEVELOPED INDICATORS FOR
NOWCASTING MIGRATION STOCKS BY TWITTER

 Alina Sirbu, Laura Pollacci, Jisu Kim & Giulio Rossetti

 July 2021

 http://www.hummingbird-H2020.eu
Deliverable 5.1 REPORT ON THE DEVELOPED INDICATORS FOR NOWCASTING MIGRATION STOCKS BY TWITTER
Abstract
 Given the gaps and limitations of traditional data for migration research, social big data have been
 proposed to complement existing analyses. Amongst various types of social media data, user-generated
 content from Twitter is considered a valuable resource in migration studies. As recent works have shown,
 Twitter can indeed be used to study various migration-related problems such as estimating migration flows
 or studying language mobility. However, it is still unclear up to which extent these data can be useful.
 Here we investigate three different approaches to understand migrant stocks based on Twitter. We show
 that, although data can be reduced, some information can be extracted, and further analyses beyond
 stocks can be performed. At the same time, we show that in some situations Twitter data is limited and
 cannot bring a lot of information. In this case complementary data is required.

This report constitutes Deliverable 5.1, for Work Package 5 of the HumMingBird project.

July 2021

© 2021 – HumMingBird, Enhanced migration measures from a multidimensional perspective, – project
number 870661.

General contact: HumMingBird@kuleuven.be
p.a. HumMingBird
 HIVA - Research Institute for Work and Society
 Parkstraat 47 box 5300, 3000 LEUVEN, Belgium

For more information sandra.volders@kuleuven.be

Please refer to this publication as follows:
Sirbu, A., Pollacci, L., Kim J., & Rossetti, G. (2021). Report on the developed indicators for nowcasting stock
migration by Twitter data (Deliverable 5.1). Leuven: HumMingBird project 870661 – H2020.

Information may be quoted provided the source is stated accurately and clearly.
This publication is also available via http://www.hummingbird-H2020.eu
This publication is part of the HumMingBird project, this project has received funding from the European
Union’s Horizon 2020 research and innovation programme under Grant Agreement No 870661.

Copyright © HumMingBird Consortium, 2019
All rights reserved. No part of the report may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means, electronic, mechanical, photocopying, recording, and may not be quoted or
cited, without prior permission in writing from the project coordinator.
The views expressed during the execution of the HumMingBird project, in whatever form and or by
whatever medium, are the sole responsibility of the authors. The European Union is not liable for any use
that may be made of the information contained therein.
Deliverable 5.1 REPORT ON THE DEVELOPED INDICATORS FOR NOWCASTING MIGRATION STOCKS BY TWITTER
VERSION CONTROL SHEET

 Deliverable number + title D5.1 Report on the developed indicators for nowcasting stock migration by
 Twitter data
 Prepared by Alina Sirbu & Laura Pollacci
 12 – UNIPI
 alina.sirbu@unipi.it
 Work package number 5
 Work package leader 12 - UNIPI
 Dissemination level (PU, CO) PU
 Delivery date 29/07/2021
 Submission date to EC 30/07/2021
 Main authors Alina Sirbu, Laura Pollacci, Jisu Kim, Giulio Rossetti
 Reviewers Carlos Arcila Calderon, Haodong Qi

REVISION HISTORY

 Version Date Summary of changes Initials Changes marked
 1 23/06/2021 Top-down nowcasting of stocks LP
 2 25/07/2021 Bottom-up nowcasting, Border Rush analysis, AS
 Introduction, Discussion and Conclusions, Abstract
 3 29/07/2021 Implemented suggestions from reviewers AS
Deliverable 5.1 REPORT ON THE DEVELOPED INDICATORS FOR NOWCASTING MIGRATION STOCKS BY TWITTER
Contents

1. Introduction 5

2. Top-down approach: Nowcasting using language and sentiment on Twitter
content 6
2.1 Data 6
2.2 Superdiversity Index 6
2.3 Nowcasting immigration rates 9

3. Bottom-up approach: from Twitter users to the study of migration 13
3.1 Data 13
3.2 Labelling strategy 13
 3.2.1 Assigning residence 13
 3.2.2 Assigning nationality 14
3.3 Migrant integration 16
 3.3.1 Overall distribution of DA and HA values 17
 3.3.2 Language as a key factor for integration 19
 3.3.3 Country specific results 19
3.4 Analysis of migrant and native communities 21
 3.4.1 Language and country location of Tweets 21
 3.4.2 Homophily of social links 23

4. Case study: border rush at the Turkish border 25
4.1 Data 25
4.2 Analysis 26

5. Discussion and conclusion 28

References 31

 4
Deliverable 5.1 REPORT ON THE DEVELOPED INDICATORS FOR NOWCASTING MIGRATION STOCKS BY TWITTER
1. Introduction

Measuring migration stocks and flows over time in various countries is crucial but challenging.
Migration-related information has important implications for effective policy design and for under-
standing broader population trends. Researchers and policymakers mostly rely on official statistics
and administrative data. However, these data typically show numerous drawbacks, e.g., low time and
space resolution (i.e. the measurements are not frequent in time and are usually aggregated at high
level geographical regions), inconsistency between different countries (different reporting standards,
different definitions, varying data quality and collection methods) and delays. This explains why the
current availability of data from social media, like Twitter, has offered new opportunities to attempt
to obtain more updated information and estimates and to improve and integrate traditional data
sources (Sirbu et al., 2020). Social media datasets contain several types of user information, cover large
population groups, even across multiple nations, and are often available cheaply and on time. Both
traditional and novel data are currently employed to study different aspects of migration, such as the
economic and cultural effects connected with migrants, monitoring flows, and estimating stocks.
 Among the various types of social big data, user-generated content from Twitter can be a valuable
resource in migration studies. This has been proven by recent works using Twitter data to study
various migration-related problems (Zagheni et al., 2014; Mazzoli et al., 2020; Lenormand et al., 2015;
Moise et al., 2016; Valle et al., 2017). However, data collection, pre-processing and analysis is far from
straightforward and can result in biased data that might influence the final results. Bias comes from
various sources. One is related to sampling bias introduced by the way individuals use Twitter: there
is a selection bias in the general Twitter population, but also when restricting the analysis to certain
subsets of data, such as geolocalised Tweets. Moreover, data, being user generated content, may be
very noisy, impeding their use in certain areas of research, or resulting in limited knowledge after
cleaning. It may also contain misleading or fake information (e.g. users using a nickname, or declaring
the wrong profile location). Furthermore, ethical and privacy issues need to be considered carefully,
as biased results and publication of sensitive information might harm migrants.
 One research question of interest is whether Twitter data can be useful to understand migrant
stocks. The hypothesis is that this type of data can provide better time and space resolution, and
provide more timely information compared to official statistics. Here we investigate these aspects,
through various types of analyses. A first analysis, that we define as top-down, employs features
extracted from the text of tweets generated by a community to estimate migration stocks, using
machine learning techniques. The second approach, bottom up, labels individual users with a nation-
ality and residence, using a data driven model, and estimates stocks from resulting labels. The two
approaches are complementary and will be discussed in detail in the following sections. A different
type of analysis based on Twitter is the study of a restricted period of time and geographical area, to
analyse a specific event. We will show preliminary results for the analysis of border rush at the Turkish
border in March 2021. Here we employ language as the main feature that determines nationality of
users.
 The report will end with a discussion of challenges, limitations and advantages that arose during
our analyses, and what we believe are the benefits of employing this type of data for migration studies.

 5
Deliverable 5.1 REPORT ON THE DEVELOPED INDICATORS FOR NOWCASTING MIGRATION STOCKS BY TWITTER
2. Top-down approach: Nowcasting using
 language and sentiment on Twitter content

In this section we aim to introduce a nowcasting model for immigration rates. Our method is based
on applying machine learning to features extracted from Twitter data at different geographical reso-
lutions. The main feature that we employ is our recent index that measures cultural diversity, i.e. the
Superdiversity Index - SI (Pollacci, 2019). In the following we describe the data employed in the
analysis, we define and validate the SI, and we present the nowcasting method and results. The analy-
sis concentrates on seven European countries: France, Germany, Ireland, Italy, Spain, United King-
dom, The Netherlands.

2.1 Data
To perform our work, we leverage two different kinds of resources, i.e., a geolocalised Twitter dataset
and the Data Challenge on Integration of Migrants in Cities (D4I) dataset.
 The Twitter dataset is a subset of Coletto et al. (2017) composed of more than 73,175,500 geolo-
calised tweets gathered for 3 months, from 1 August to 31 October 2015. From these, we select only
tweets originating in our target countries by leveraging the metadata field ‘coordinates’. Thus, we
extract only tweets containing geographical coordinates, excluding those containing only the field
‘place’, since this has no reliable information on the origin of a tweet. For each target country, we
select only tweets in the National Language, e.g., French for France. Afterwards, we assign tweets to
a city. This is required to match the D4I data to our results. For each tweet we identify the origin
place by using its coordinates and the Geopy Python library. If the origin city is attested in D4I we
assign the related NUTS codes to the tweet (we look at NUTS1, NUTS2 and NUTS3 levels). Other-
wise, we perform a dedicated rule-based pipeline described in Pollacci (2019). This step is often
required since in D4I locations are at the city level, while several tweet origins are at the district or
town level. Finally, tweets are processed following both a rule-based and a general-purpose Natural
Language Processing (NLP) pipeline (Pollacci, 2019). To obtain only significant words from the sen-
timent and meaning point of view, we use part-of-speech tagging and select nouns, adjectives, and
verbs. Following this procedure, we obtain a clean geolocated dataset made up of 4,295,287 tweets
from the 7 countries.
 The D4I dataset 1 contains the concentration of migrants in regions of most leading European
countries, i.e., Italy, Portugal Germany, Spain, France, the Netherlands, and Ireland, plus the United
Kingdom. Migrants are counted based on three different levels of aggregation: by country, continent
and EU versus non-EU. Since all migrant types cause diversity, we select both EU, and non-EU
immigrant counts to obtain total immigration levels.

2.2 Superdiversity Index
To measure the cultural diversity of a certain community we have recently introduced the Super-
diversity Index (Pollacci, 2019). This is based on sentiment of words, and measures the distance
between the sentiment valences of a set of words used by a community and the standard sentiment

1 https://knowledge4policy.ec.europa.eu/migration-demography/data-integration-d4i_en.

 6
Deliverable 5.1 REPORT ON THE DEVELOPED INDICATORS FOR NOWCASTING MIGRATION STOCKS BY TWITTER
valences. The SI is based on the hypothesis that groups of people with different cultural backgrounds
associate emotional values to words differently. Therefore, a multi-cultural community will display a
use of the local language that is different in its emotional content compared to a standard expected
use.
 We use the algorithm of Pollacci et al. (2017) to calculate emotional valences for words used in
tweets in the national language version grouped by countries at different spatial resolutions (from
NUTS1 to NUTS3). We calculate the SI using the average Pearson correlation coefficient (r) over
ten runs between the emotional valences from a standard tagged lexicon, i.e., ANEW (Bradley &
Lang, 1999), and the calculated valences. We correlate the distance between the SI with immigration
rate at different spatial resolutions.
 Figure 1 shows the scatter plots of SI and immigration rates for all the selected European countries.
Ireland has very few NUTs regions, hence the number of points in the plot is reduced. Nevertheless,
we show the results, mentioning however that the statistical measure computed on it do not have a
large enough sample size. All plots show how at all NUTS levels (triangles, squares, and circles,
respectively), the SI tends to allineate with the first diagonal, despite the presence of sparse points.
Tables in the next section also include Pearson correlation values for SI versus Immigration rates,
showing that in general correlation are very good. This indicates that the higher the immigration rates,
the higher our SI values. Also, we observe that correlations are in general larger than for other pos-
sible indices of cultural diversity extracted from the same data.

 7
Deliverable 5.1 REPORT ON THE DEVELOPED INDICATORS FOR NOWCASTING MIGRATION STOCKS BY TWITTER
Figure 1. Scatter plot of SI and immigration by country*

 * The SI was generated using data from 2015 (August-October). The immigration rates come from the D4I
 dataset, based on the 2011 census.

 8
2.3 Nowcasting immigration rates
The results above indicate that there is a direct relation between the SI and immigration rates in
European countries and regions. Therefore we hypothesised that the SI could form a good basis for
a predictive model for immigration rates. For a better prediction performance, we extracted from the
same Twitter data a set of five other indices of diversity: the total number of tweets in the region
(nTweet), the number of tweets per capita (twPc), the number of distinct languages in the dataset
(nLang), the entropy of the use of languages (langS, measuring how flat is the distribution of the
number of tweets in the different languages: a low entropy means most Tweets are in one language
only, a large entropy means there are many languages and Tweets are equally distributed across the
various languages), and the Type/Token Ratio (TTR, Templin et al., 1957), which is a measure of
language richness. Tables 1 (UK) and 2 (Italy) show the correlation between immigration rates and
all six indices (including SI), at different space resolutions. The tables also show results for the null
model for SI, achieved by reshuffling tweets across geographical regions, maintaining fixed the
number of tweets in each region. At the county level (NUTS3) we selected the top 40/20 regions
based on the number of tweets available in the dataset.
 The two tables show that the SI provides the most correlation with immigration rates in Italy and
the UK, however other indices also provide some information. The Null model SI does not correlate
to immigration rates, validating thus the SI.

Table 1. Correlation between different measures of diversity extracted from Twitter and the
 immigration rates, at various geographical levels in the UK, excluding London and
 Northeast England (which were considered outliers)

 Source Pollacci (2019)

Table 2. Correlation between different measures of diversity extracted from Twitter and the
 immigration rates, at various geographical levels in Italy

On this basis, we extend our approach also to other European countries, i.e., Germany, Spain, France,
the Netherlands, Ireland. Table 3 shows the correlations between SI, different measures of diversity,

 9
and the immigration rates by country, over all NUTS levels together. Again, the Superdiversity Index
tends to reach the highest correlations, although in some cases, i.e., Germany and the Netherlands,
other indices may also correlate with immigration rates.

Table 3. Correlations between different measures of diversity and the immigration rates by
 country

Our results suggest that the combined use of various measures of diversity, including the Superdiver-
sity Index, could be a potential predictor for immigration stocks.
 Thus, the SI and other measures are fed into a machine learning pipeline to model the immigration
rates. To obtain the best predictions, we test various regression algorithms, i.e., Linear Regression,
Lasso, Ridge, Decision Tree Regression, Gaussian Support Vector Regression (SVR), and polynomial
SVR. Moreover, to maximise the prediction we study features' impact and finally optimise the
parameters through Grid Search. We apply regression algorithms by country and at the European-
level, i.e., grouping all the tweets posted by all countries in the dataset at all NUTS levels.
 Each regression model uses the six measures of diversity as features, both in the original range and
normalised in the range [0,1]. The regression evaluation is carried out through Leave One Out Cross
Validation (LOOCV), and by considering the Pearson Correlation Coefficient r between actual and
predicted immigration rates and root-mean-square error (root_mse).
 We observe that different algorithms achieve different results with respect to different countries,
as well as normalisation of features. We finally select the Polynomial SVR on normalised features as
the nowcasting model, since it produces the best results.
 Since our goal consists of an European model and an optimised model for individual countries, we
further analyse how features could differently affect the performances of algorithms. We train models
that only employ subsets of features, exploring all possible subsets that include the SI.

Table 1. Feature selection for individual countries and Europe

 10
As summarised in Table 4 results seem to confirm our hypothesis on the different impact of features
across countries. For Europe and Italy, the model benefits from the combination of almost all indices.
Conversely, all the other countries achieve the best performance with different combinations of three
indices, except Spain, which has only two features (Superdiversity Index and the number of tweets
per capita). It is important to mention that the impact here is only based on the accuracy of the model
on test data, i.e. based on an empirical analysis typical to the machine learning domain. We select the
feature combination that gives the best prediction. However this approach does not evaluate the
strength of the effect, and the correlations among the features themselves.
 After the feature selection phase, we perform the optimisation of parameters using Grid Search on
the regularisation parameter (C), epsilon ( ), and degree. The optimisation, performed using feature
selection and grid search, allows us to obtain slight improvements to performances, which are
summarised in Table 5.

Table 2. Result obtained after the optimisation phase for individual countries and Europe

The European model, using all data available, obtains an r=0.66 and RMSE=5.52. National models
achieve better performance, with r-values between 0.70 and 0.89 and RMSE between 1.47 and 6.28
depending on the country. Figure 2 shows predicted versus real immigration rates for Europe and
European countries (triangles correspond to NUTS1 regions, squares to NUTS2, and circles to
NUTS3).
 Despite some sparse points, predicted immigration rates tend to be coherent with actual ones for
both Europe and individual countries. For countries, we note that predictions are more accurate
when immigration is under 10-15%, while for the European model the threshold is 20%.
 All in all, we can conclude that the SI and other features do provide a good estimate of the immi-
gration rate. With respect to traditional data, this approach could be useful to capture migration in a
more timely manner, could include undocumented migration, and could be applied at various spatial
resolutions, as we have shown above. In terms of limitations, the model does not distinguish stocks
by nationality, it only attempts to estimate overall migration. Additionally, the estimates could be
biased, due to the Twitter user selection bias, which does not cover the different demographic groups
uniformly. In future work we will investigate means to estimate stocks by nationality, employing
additional features extracted from the data.

 11
Figure 2. Scatter plot of actual and predicted immigration rates for Europe and selected Euro-
 pean countries*

 * The SI was generated using data from 2015 (August-October). The Immigration rates come from the D4I
 dataset, based on the 2011 census.

 12
3. Bottom-up approach: from Twitter users to the
 study of migration

The bottom-up approach is based on labelling Twitter users with a nationality and residence, and
then studying the stocks for specific countries. This approach also allows to analyse further aspects
of migration, such as migrant integration, and also compare the social network activity of migrants
with that of natives.

3.1 Data
The starting point of our data collection strategy is a Twitter dataset collected by the SoBigData.eu
Laboratory (Coletto et al. 2017). We extracted from this dataset all the geo-located tweets posted from
Italy from August to October 2015. This allowed us to obtain a set of 34,160 individual users that
were in Italy in that period, which we call the first layer users. For these users, we downloaded the
friends, resulting in 258,455 users that we denominate as second layer users. For all of these users,
we have also gathered their 200 most recent tweets. We further extended the dataset to obtain a larger
number of migrants by extracting also the friends of the second layer users (i.e. the third layer), and
their 200 most recent Tweets. After this process, the total number of users grew to 59,476,205. Our
dataset, therefore, consists of three layers: the core first layer users, their friends (second layer users)
and the friends of the friends (third layer users). Our analysis concentrates on a subset of these users
for which we have information about their friends, resulting in a total of 200,354 users. These are
users from the first and second layers (some overlap was present among the two layers).

3.2 Labelling strategy
A migrant is a person that has the residence different from the nationality. We thus consider our core
200,354 Twitter users and assign a residence and nationality based on the information included in
our dataset. The difference between the two labels will allow us to detect individuals who have
migrated and are currently living in a place different from their home country. The methodology we
propose is based on a series of hypotheses: a person that has moved away from their home country
stays in contact with their friends back in the home country and may keep using their mother tongue.

3.2.1 Assigning residence
In order for a place to be called residence, a person has to spend a considerable amount of time at
the location. Our definition of residence is based on the amount of time in which a Twitter user is
observed in a country for a given solar year. More precisely, a residence for each user is the country
with the longest length of stay which is calculated by taking into account both the number of days in
which a user tweets from a country but also the period between consecutive tweets in the same
country. In this work we compute residences based on 2018 data.
 To compute the residence, we first compute the number of days in which we see tweets for each
country for each user. If the top location is not ‘undefined’, then that is the location chosen as resi-
dence. Otherwise, we check whether any tweet sent from ‘undefined’ country was sent on a same day

 13
as tweets sent from the second top country. In case at least one date matched between the two loca-
tions, we substitute second country as the user's place of residence. On average, 5 dates matched.
This is done under the assumption that a user cannot tweet from two different countries in a day.
Although this is not always the case if a user travels or uses a VPN that generates the location, in
most of the days of the year this should be true.
 In case the top country remains ‘undefined’, we computed the length of stay in days by adding
together the duration between consecutive tweets in the same country. We selected the country with
the largest length of stay. In case the top country was `undefined’, we checked whether `undefined’
locations were in between segments of the second top country, in which case the second country was
chosen.

3.2.2 Assigning nationality
In order to estimate nationalities for Twitter users, we took into account two types of information
included in our Twitter data. The first type relates to the users themselves and includes the countries
from which tweets are sent and the languages in which users tweet. For each user u we define two
dictionaries locu and langu where we include, for each country and language the proportion of user
tweets in that country/language.
 The second type of information used is related to the user's friends. Again, we look at the languages
spoken by friends, and locations from which friends tweet. Specifically, starting from the loc and lang
dictionaries of all friends of a user, we define two further dictionaries floc and flang. The first stores all
countries from where friends tweet, together with the average fraction of tweets in that country,
computed over all friends:

Similarly, the flang dictionary stores all languages spoken by friends, with the average fraction of tweets
in each language l:

The four dictionaries defined above are then used to assign a nationality score to each country C for
each user u:

where languages(C) are the set of languages spoken in country C, while wloc, wlang , wfloc, wflang , are para-
meters of our model which need to be estimated from the data (one global value estimated for all
users). Each of the w value gives a weight to the corresponding user attribute in the calculation of the
nationality. To select the nationality for each user we simply select the country C with maximum NC.

 14
In our experiments, maximum accuracy was obtained with the language weights set to null, meaning
that language was not considered relevant. This was based on an internal validation dataset that con-
siders the profile language to be an indication of nationality. More details can be found in Kim et al.
(2020).
 With the approach described above, out of the total 200,354 users, we were able to identify nation-
alities of 197,464 users. As for the residence, we were able to identify residences of 57,299 users. In
total, we have identified both the residences and nationalities for 51,888 users. The rest of the users
were ignored, as not enough geotagged data was available to compute a residence or nationality.
Among 51,888 users, the total number of individual users that we have identified as migrants are
4,940 users. In Figure 3 we display the main migration links in our dataset: the number of migrants
for countries that have at least 10 migrants, showing a total of 21 countries. However, overall, we
have 128 countries of nationality and 163 countries of residence. From the plot, we see that in terms
of nationality, the most present countries are the United States of America, Italy, Great Britain and
Spain. This is due to the fact that our first level users were selected among those geo-localised in Italy.
In terms of migration patterns, we note that Italy has mostly out-going links whereas countries like
the USA and GB has a significant amount of both in and out-going links. France and Germany, on
the other hand, have mostly in-coming links.

Figure 3. Twitter migrant stocks after the identification procedure

We chose to employ this methodology because it adopts a definition of a migrant that is close to the
official definition: the Recommendations on Statistics of International Migration, Revision 1(p. 113).
United Nations, 1998, defines a migrant as ‘a person who moves to a country other than that of his
or her usual residence for a period of at least a year’. It also allows us to identify both immigrants and
emigrants simply by comparing the nationality and residence labels. It is important to mention that
the migration patterns we see here are specific to our dataset, and are not meant to represent a global

 15
view of the world's migration. However, we do observe some correlation to official data when looking
at individual countries. In Figure 4, we show Spearman correlation coefficients between our predicted
data and ground truth data for Italian emigrants from AIRE (Anagrafe degli italiani residenti all'estero, the
Italian register data) and Eurostat. We employ the Spearman coefficient because we want to under-
stand whether the ranking of the countries with the most Italian immigrants corresponds between
the various data sources. For European countries, the correlation with the AIRE data is 0.831 and
0.762 with the Eurostat data. For non-European countries, the correlation stays at 0.56. We note that
for Germany and Switzerland the prediction seems to underestimate the migrant stocks, while for
other EU countries the prediction overestimates migrants. As for non-EU countries, Brazil, Argen-
tina, Venezuela (countries with large Italian expat communities) are underestimated.

Figure 4. Percentage of Italian emigrants in various destination countries based on AIRE and
 Eurostat: predicted versus ground truth data*

 * The first two figures show the results for EU countries and the last figure shows the result for non-EU coun-
 tries. Predictions and official datasets correspond to the data for 2018.

3.3 Migrant integration
The migrant dataset obtained through our labelling strategy can be employed to study several aspects
of migration. One interesting question is how useful these data can be to study cultural integration.
By integration, we mean an umbrella term for the process that migrants go through when moving to
a different country.
 Cultural integration has been long studied by various research communities. These include interna-
tional economic organisations, which have built indicators for integration at different levels, consid-
ering socio-economic features such as labour market participation, living conditions, civic engage-
ment and social integration (Eurostat, 2011; OECD, 2018; Huddleston et al., 2013). On the other
hand, studies of integration have been mainly done by sociologists, by employing survey data such as
World Values Survey, Eurobarometer, and European Social Survey. The main elements used in the
studies are often inter-marriage, religion and language (Esser, 2006; Lochmann et al., 2019; Sirbu et
al., 2020; Vigdor 2008). However, studying integration is very complex, as one is ‘not only attracted
to the culture of host society but is also held back from his culture of origin’ (Park, 1928; Safi, 2008).
The four-fold model reflects this complexity by dividing acculturation into four different classes:
assimilation, integration, marginalisation and separation (Berry, 1997; Constant & Zimmermann,
2008; Penninx, 2003). Integration takes place when a migrant's and receiving society's characteristics
mutually accommodate. Assimilation on the other hand takes place when a migrant perfectly absorbs

 16
the characteristics of the receiving society, losing the connection to the home country. Marginalisation
refers to a situation where migrants remain distinguishable from both the receiving and home society,
whereas separation refers to complete rejection of host's culture. These theories typically consider
two dimensions: preservation of links to the home country and cultural traits, which we call here home
attachment, (HA), and formation of new links and adopting cultural traits from the country of migra-
tion, that we define as destination attachment (DA). Based on these two concepts, we can summarise the
four integration patterns from the literature, as displayed in Table 6.

Table 3. Theories of integration and their relation to HA and DA

 Low HA High HA
 Low DA Marginalisation Separation
 High DA Assimilation Integration

Through our work we provided a novel method to compute HA and DA from Twitter data, to
answer the following questions: How much do migrants absorb the culture of their destination society? Do they
lose connection with their home country? This is based on the topics that migrants and natives discuss on
Twitter, through the analysis of hashtags. The HA index is defined as the fraction of tweets of a
migrant that discuss topics related to their home country. Similarly, DA is the fraction of tweets
discussing topics related to the destination country. We determine country-specific topics by assign-
ing nationalities to hashtags. These definitions are based on the idea that the topics discussed provide
indications on various aspects of attachment: the amount of information that a person holds about a
specific country, the social links to people living in a certain country, the interest in political and
public issues of a country, adoption of customs and ideas, all related to integration as a wider concept.
 In the rest of this section, we analyse the relationship between the two indices and compare them
to a null model obtained by shuffling the hashtags in our dataset. Second, we study different country-
specific cases, i.e., immigrants in the United States and the United Kingdom, and emigrants from
Italy. More details can be found in our submitted article (Kim et al., 2021, a).

3.3.1 Overall distribution of DA and HA values
The distributions of the home and destination attachment indices are shown in Figure 5. The HA
index is 0.051 on average and the DA index is 0.034 on average for all the migrants we have in our
dataset regardless of the nationality or the place of residence. We observe that some users have rela-
tively high values for the two indices, however the majority are under 0.2 in both cases.
 In the same figure, we compare these values with a null model analysis where the hashtags of indi-
vidual users were randomly re-distributed five times. The null model tells us what the DA and HA
values would be if users chose their topics of discussion randomly, i.e. there was no influence from
the country of residence or nationality. We observe that in general the null model DA0 and HA0 are
smaller than the actual index values, with lower means for the null model distributions.
 To statistically validate the difference between the null model, and DA and HA, we also computed
two non-parametric tests: Wilcoxon and Kolmogorov-Smirnov (KS) tests. The results for the
Wilcoxon test show that for both the DA and HA, their distributions are significantly different from
the distribution of the DA0 and HA0 with p-values of 5.16e-07 and 0.014, respectively. We obtained
similar results from the KS tests, with p-values of 1.18e-51 for DA and 2.98e-56 for HA. Although
not reported here, the results for KS-tests for sub-populations split by country of residence and
country of origin equally show that the null model and the actual index values have different distri-
butions.

 17
Figure 5. Distribution of HA and DA values, and comparison to null model DA0 and HA0

To understand the relationship between the DA and HA, we computed the Pearson correlation
among them. Figure 6 displays the HA versus DA values for all users. A weak negative relation is
found with r=-0.13, and p-value=6.937e-14, indicating that in general the more a migrant is attached
to his country of origin, the less the migrant is attached to the host country and vice versa. However,
we can observe various different patterns for individual users, leading to different acculturation types
as mentioned in Table 6. In the same figure, the red curve provides an approximate indication of
users' acculturation type. We underline the fact that we do not aim to provide a specific categorisation
of acculturation types in this paper. Instead, we aim to provide a broad picture where the angle of
each individual from the x/y-axis gives us an indication of the acculturation type. Thus, a migrant
close to the x-axis is most probably going through an assimilation process, a migrant close to the y-
axis is undergoing separation, while those in between are undergoing integration or marginalisation.
The distinction between integration and marginalisation depends on the length of the distance of data
point from the origin. In other words, marginalisation is when the data point is close to 0 and inte-
gration is when the data point is point further away from 0. The data point circled in green would be
a good example of an integrated migrant, who keeps good links with both home and destination
country.

Figure 6. HA versus DA for migrant Twitter users

 18
3.3.2 Language as a key factor for integration
One possible candidate factor to explain the DA and HA values observed is language. As previously
studied, language is considered to be a key factor in integration and our indexes reflect this importance
as well.
 In Figure 7 we display the distribution of the DA and HA for two user groups: a group that speaks
the language of the host country (i.e. over 90% of their tweets are in that language) and a group that
very rarely speaks the language of the host country (under 10% of their tweets are in that language).
Here, we are looking at all the migrants we have in the dataset regardless of the country of origin or
the country of residence. We observe that the group that speaks the language of the destination
country shows in general higher DA compared to the non-speaking group, confirming the signifi-
cance of the language for integration in the host country. In addition, we observe that users who do
not speak the language of the destination country tend to be more attached to their home country
compared to those speaking the destination language. Hence, interestingly, destination language pro-
ficiency seems to correlate with both destination and home attachment levels. When comparing DA
and HA within groups, the groups that speak the destination language have the two indices compa-
rable, while for those who do not speak it, HA is much larger than DA, indicating a pattern of sepa-
ration. However, we do not mean to generalise, what we observe are population level patterns. When
looking at individual level, we do observe all four acculturation types discussed in Table 6.

Figure 7. Box plots showing the HA and DA distributions for a group of migrants who speak the
 language of the host country on the left and a group of migrants who do not speak
 the language of the host country on the right*

 * The plot includes only the migrants identified through our method.
 Source

3.3.3 Country specific results
In this section, we provide country-specific results. One of the advantage of using our methodology
is that we can look at different countries simply by changing the labels. Hence, here we look at dif-
ferent country cases to understand how immigrants in a specific country behave and to know how
emigrants from a certain country of origin behave in different countries. We selected three study

 19
cases which had the largest number of users in our data: immigrants in the US and UK, and emigrants
from Italy. Here we consider only the migrant groups with at least 10 users. The square brackets in
the figures below show the number of users we have for each country of origin.

3.3.3.1 Immigrants in the US
In Figure 8 on the left, we observe different destination and home attachment indices of 17 groups
of immigrants from different countries of origin. Overall, we observe that for many groups of immi-
grants in the United States DA is larger than HA. Immigrants from Canada have the highest DA
followed by Colombian and English immigrants. On the other hand, immigrants from Turkey have
the highest HA followed by Brazilian and Italian immigrants. In the right figure, we observe data
points individually on a scatter plot of HA vs. DA. It tells us that immigrants in the US are integrated
and assimilated in general.

Figure 8. Left: box plots for the DA and HA index of immigrants in the United States. Right:
 scatter plot of HA vs. DA indicating approximate integration types for immigrants in
 the US

3.3.3.2 Immigrants in the UK
Figure 9 shows the indices for the immigrants residing in the United Kingdom. Only four groups are
shown, corresponding to those that have at least 10 migrants. Overall, UK immigrants in our data
are more attached to home than to the destination country. On average, the DA is 0.04 and the HA
is 0.063. From the figure on the left, it is clear that immigrants from Italy have the highest HA index.
On the other hand, we observe that immigrants from Australia that share long historical ties with the
UK have the highest DA index. Looking at the figure on the right, we can observe that immigrants
are mostly in the area of marginalisation/integration.

Figure 9. Left: Box plots for the DA and HA index of immigrants in the United Kingdom. Right:
 Scatter plot of HA vs. DA indicating approximate integration types for immigrants in
 the UK

 20
3.3.3.3 Emigrants from Italy
Figure 10 displays the DA and HA indices for Italian emigrants across different countries of resi-
dence. In general, we observe that Italians are more attached to their home country than to their
destination country. Switzerland, Belgium and Netherlands are the three countries where Italian
emigrants are most attached to home. On the other hand, Italians tend to show higher DA levels in
English speaking countries: the US and in the UK. Among the higher DA levels we also observe
Spain, probably due to the language similarity. In the figure on the right, we also observe that Italian
emigrants have higher HA level compared to DA level. This data points indicate that they are in
general close to the separation type of cultural integration.

Figure 10. Left: Box plots for the DA and HA index of emigrants from Italy. Right: Scatter plot of
 HA vs. DA indicating approximate integration types for emigrants from Italy

3.4 Analysis of migrant and native communities
A different type of analysis that our data allows is to study various characteristics of the migrant and
native communities. In the following we compare the travelling and language usage patterns as
extracted from social media, and the level of homophily of social links for each community. Other
more detailed results can be found in our submitted article Kim et al. (2021,b).

3.4.1 Language and country location of Tweets
Tweets provide useful information about user behaviour. We are interested in the locations (country
level) and languages a user employs on Twitter. Hence, we look at the number of languages and
locations that appear in the users’ 200 most recent tweets and computed also the KS statistics to
compare the differences between the distributions of migrants and natives. As shown in Figure 11,
we note that migrants tweet in a wider variety of languages and locations. The two distributions for
migrants and natives are different from each other as the KS tests show low p-values; 2.36e-194 for
location and 1.412e-38 for language.

 21
Figure 11. Distribution of tweet locations and languages

Since we possess network information, we also studied the tweet language and location information
for a user's friends. In Figure 12, the two distributions show smaller differences among natives and
migrants, compared to Figure 11. However, the p-value of the KS test tells us that the distributions
are indeed different from one another, where the p-value for location and language distribution for
migrants and natives are 3.246e-05 and 0.005 respectively. Although the differences are small, we
observe that the friends of migrants tweet in more numerous locations than those of natives, with
average of 29.6 for migrants and 27.4 for natives. However, although the two distributions are
different from each other from the KS p-value, the actual difference between average values is very
small in the case of the number of languages of friends.

Figure 12. Distribution of tweet locations and languages of friends

 22
These numbers indicate that the migrants have travelled in more various places and hence write in
diverse languages than the natives. The friends of migrants tend to have travelled more also. How-
ever, no large differences were observed for the number of languages that friends can write in for
both migrants and natives.

3.4.2 Homophily of social links
We measure homophily of social links by computing the assortativity of nodes by different attributes
of individuals, i.e., migrants or natives, country of residence and country of nationality. We study the
networks of 51,888 users, out of which 4,940 are migrants. Assortativity tells us whether the network
connections correlate in any way with the given node attributes (Newman, 2002). In other words, it
tells us whether the nodes in the network tend to connect with other similar nodes (e.g. migrants with
other migrants, natives with other natives). It typically ranges between -1 and 1. A value of 1 means
nodes always connect with nodes with the same attributes, i.e. full homophily, while -1 means nodes
tend to connect with nodes with different attributes. In our case this analysis allows us to infer
whether and in what measure the network topology follows the nationality or residence of the users,
or whether the migrant/native status is relevant when building online social links.
 We begin with global assortativity measures, which give one assortativity score for the entire net-
work. When looking at the coefficient by the country of residence, the score of 0.54 shows a very
good homophily level. The score improves slightly when we examine the behaviour through the
attributes of country of nationality (0.6). These values tell us that nodes tend to follow other nodes
that share same country of residence and country of nationality, with a stronger effect for the latter.
However, when looking at the coefficient by the migrant/native label, we observe no particular cor-
relation (0.033).

Figure 13. Node level conformity values for natives and migrants. The histograms are stacked

 23
The global assortativity scores are susceptible to be influenced by the size of the data and the imbal-
ance in labels, which is our case especially for the migrant/native labels. Therefore we continue to
examine the assortativity at local level, allowing us to overcome the possible issues at global level. We
thus compute the scores based on an extension of Newman's assortativity introduced by Rossetti et
al. (2021) called conformity. In Figure 13 we show the distribution of node-level conformity of migrants
and natives, for the three attributes (nationality, residence and migrant/native label). We observe
different behaviour patterns for migrants and natives. Specifically, we see that migrants tend to dis-
play lower homophily compared to natives, when looking at the conformity of nodes by country of
residence. This tells us that migrant users tend to consider less the country of residence when follow-
ing other users. Instead, most natives tend to connect with users residing in the same country. When
looking at nationality, this effect is less pronounced. While natives continue to display generally high
homophily, with a small proportion of users with low values, migrants show a flatter distribution
compared to the nationality. Again, a large part of migrants show low homophily, however a con-
sistent fraction of migrant users show higher nationality homophily, as opposed to what we saw for
the residence. This confirms what we observed at global level: there is a stronger tendency to follow
nationality labels when creating social links.
 As for the conformity of nodes by migrant/native labels, we observe that migrants and natives
clearly have distinctive behaviours. While natives tend to form connections with other natives,
migrants tend to connect with natives as well, resulting in negative conformity values for migrant
users. The observed values could also be due to the fact that migrants are only about 10% of our
users so naturally many friends will be natives (from either residence, nationality or other country).
This result is different from what we observed at global level and confirms that the global assortativity
score was influenced by the size of the data and the imbalance in labels.

 24
4. Case study: border rush at the Turkish border

On the 27 February 2020, after the death of 33 Turkish troops in airstrikes in Northern Syria, a
Turkish government representative gave a briefing and stated that the refugees will no longer be
stopped at the Turkish borders to Europe, and the doors are opened. 2 This was immediately followed
by movements of thousands of people towards the regional hub of Edirne in northeast Turkey from
where both the Bulgarian and Greek border crossings are easily accessible. In the following days
several declarations have been made by political leaders and media stating that over 100,000 refugees
left the border town Edirne. Despite these and other official declarations, 3 there is no clear picture
of the movement patterns across borders in that period. In this section, we investigate the use of
Twitter to gain knowledge on which migrant groups attempted to cross the border and where they
went. This analysis was not initially planned in the project, but was undertaken as additional research
due to the events in Turkey. The analysis will be integrated with mobile phone data for a joint publi-
cation of work packages 5 and 6 of the project.

4.1 Data
We have collected Twitter data from 25 February to 25 March 2020, using the Academic Track
Twitter API. Specifically, we searched for geolocated tweets at the European Turkish border, using
a set of bounding boxes shown in Figure 14. This resulted in a total of 11,187 tweets, posted by
1,375 users. For these users, we went on to download all their 2020 tweets, resulting in an additional
350,969 tweets.

Figure 14. Area of collection of tweets in the period 25 February-25 March 2020

2 https://www.bloomberg.com/news/articles/2020-02-27/erdogan-chairs-emergency-security-meeting-amid-syria-clashe.
3 https://www.theguardian.com/world/2020/feb/29/erdogan-says-border-will-stay-open-as-greece-tries-to-repel-influx.

 25
4.2 Analysis
We first study the language of the Tweets published in the area of interest in the period 25 February-
25 March 2020. We see a total of 35 languages, demonstrating the cultural richness of the region.
Figure 15 shows the number of Tweets in languages that have more than 10 tweets. We observe that
Turkish, English and Greek are the most common, followed by Indonesian, Arabic and Tagalog
(Philippines). The number of tweets in these 3 last languages are however rather reduced (under 100).

Figure 15. Number of tweets in languages that are most present at the Turkish border

Figure 16. Number of users tweeting in each language present at the border*

 * Only languages with at least 10 users are shown.

Secondly, we look at the number of users for each language, in Figure 16. We observe that the ranking
remains almost unchanged, except for the Arabic language which have less than 10 users present.
This indicates that, even if a large amount of migrants that speak Arabic are present in Turkey, and
were mentioned also by media, their presence is not visible on Twitter. This shows that, for this
particular case study, Twitter may not be representative enough for a thorough analysis. We do, how-
ever, manage to see these migrant groups at the border by using CDR data, which in this case is
richer.

 26
Another question of interest is the destination of the persons observed at the border. In principle,
Twitter could allow for a fine-grained analysis of movements of different groups. However, for pri-
vacy concerns, and given the small number of users present in our data, we simply look at the location
of all users and only at a very coarse-grained spatial resolution. We thus consider all the users seen
on the Turkish side of the border in the initial period (25/02/2020-25/03/2020), resulting in
1,314 users. We then compute the set of users who were seen in Turkey between May and December
2020. The intersection of the two sets counts 972 users, telling us that the majority of users remained
in Turkey. The remaining 342 users were either seen at different locations (34 users in 31 different
countries) or are not present in the data after May. Again, the analysis indicates that the data are not
sufficient for this type of analysis. The number of users present is too small to observe general pat-
terns, and privacy concerns reduce the resolution at which we can study the data.
 All in all, we conclude the Twitter data, for this case, did not provide significant information. The
number of users is very reduced, and the migrant groups that were mentioned by official statements
are weakly represented in the data. Here, CDR data provides an advantage, since some specific
nationalities are better represented, as will be described in deliverables pertaining to WP6.

 27
5. Discussion and conclusion

We have presented two different approaches and one case study to estimate migration stocks through
Twitter data. The first is a top-down approach, where overall immigration levels are estimated
through machine learning applied to features extracted from the entire set of Tweets of a community.
The second is a bottom-up approach that identifies migrant users and then studies the distribution
of migrants of different nationalities in different countries. This type of analysis allows for further
investigation on the communities of migrants and natives, such as integration or social network usage.
The case study is an attempt to investigate border rush in Turkey using Twitter. The analysis uncov-
ered one important challenge of these type of data: lack of representativity for some migrant groups,
due probably to selection bias. In fact, our data includes very limited information on refugee groups
present in Turkey.
 The approaches presented here complement the ones in the literature. In general, exiting works
have looked at national and international flows, studying either residence of users (approximated
through the geolocation similarly to our approach, Zagheni et al. 2014), or language used (Moise et
al., 2016). Our top-down approach is the first, to our knowledge, that attempts to use machine learn-
ing for nowcasting, employing various features extracted from the data. The bottom-up approach,
on the other hand, is the only one that looks at residence and nationality at the same time. This allows
to differentiate stocks by nationality, and therefore perform analyses on the different migrant groups.
It is important to note that what we call nationality may be a concept also related to country of birth,
or even country where a person belongs most to. What our data measures is the social connection to
a specific location, therefore a foreign national fully integrated in a certain country would be indistin-
guishable from a native by our method. Furthermore, the method will not identify those people with
foreign nationality who lived all their lives in a different country (e.g. individuals born in the US,
hence having US nationality, but having lived all their lives in a different country). However, our
intention, is not to identify these cases, since they are already integrated, but migrants who actually
spent a significant part of their life in a country and then moved to another one. For this, we expect
that our method will identify most of the time the correct origin and destination country.
 The work we performed underlined a few challenges regarding these types of novel data sources.
First of all, it became clear to us that the data collection step is not straightforward. While using the
streaming API to download Tweets is simpler, we had to make large efforts to download the network
of users, at a large scale. The difficulties came from the large amounts of requests to be done to the
API, and large amounts of data to be stored. Therefore we found that the data collection step (in the
sense of downloading the Twitter data for further analysis), and the initial pre-processing still requires
specific computer science expertise, and a long time to complete.
 A different challenge that we came across, in the context of studying migration stocks, or migrant
behaviour, is that even if we are in the field of big data, the relevant data resulting after the pre-
processing step could still be reduced, due to filtering out of irrelevant or incomplete data in consecu-
tive processing stages. For instance, the fact that we only work with geolocalised tweets, reduced the
data a lot. The proportion of geo-tagged tweets typically range from 3.17% from the Streaming API
and 1.45% from Firehose (Morstatter & Liu, 2017). This is due to the fact that not all users enable
geolocation. In our case, this was visible in the bottom-up approach, where, even if we started with
over 200,000 users in our core network, we were able to assign residence and nationality only to
approximately 50,000 users, which is a quarter of the data.

 28
One of the most notable challenges for Twitter is that the user population does not represent the
offline population (Zagheni & Weber, 2015), i.e. there is a so called ‘selection bias’. This is partly due
to different Internet and Twitter penetration and regulations in different countries. Also, according
to a survey conducted by Pew research, Twitter population has characteristics that are different from
other social media platforms (Wojcik & Hughes, 2019). In the United States, Twitter users tend to
have more education and higher income than the general public. As for the genders, they are equally
represented on Twitter. When it comes to ethnicity, white ethnicity is prominently present on Twitter.
Although there are studies on demographics of general Twitter users using Twitter data (Mislove et
al., 2011; Longley et al., 2015), studies that focus on migrants are missing in the literature. Therefore,
it is difficult to understand socio-economic and demographic composition of migrants. We can how-
ever understand from these works that young population, age 15 and above, and white ethnicity are
over-represented on Twitter in the United States and in London. Furthermore, it is not clear how the
demographics of those who use geolocation are different from the rest of the Twitter and offline
population. We are therefore facing at least a double selection bias: one into using Twitter, and the
second into enabling geolocation. In our second study, among 50,000 total users, only about 4,900
were labelled as migrants, which is, in the end a rather small sample. We might be undercounting
some migrants groups and overcounting some others. An even larger effect was seen in the Turkish
border analysis. Here, the main migrant and refugee groups were not visible at all, indicating that
those groups do not use Twitter. In those cases, other data types such as CDR, may provide more
information.
 The top-down approach is somewhat less affected by these issues: all tweets of a community are
included, including migrants and natives, and that increases a lot the amount of data that we can base
our analyses on. We expect that to be true also for other types of top-down studies, e.g. evaluation
of sentiment towards migrants, where what we are most interested in the sentiment of natives. Even
so, the native population under analysis still suffers from selection bias.
 A different aspect where there are a lot of open issues is that of privacy and ethics. The terms and
conditions of the public APIs can be at times vague, and may not cover all ethics issues involved.
There have been some attempts to produce ethics and privacy guidelines for using social media data
in research (Townsend & Wallace, 2016; Williams et al., 2017), however they are general and do not
consider the special case of migration research. In this field we are facing a very difficult problem.
On one hand, sensitive personal information such as geolocation and demographics are key to
developing suitable analyses and advancing the state of the art. On the other hand, we are studying
possibly very sensitive groups, hence these pieces of personal information should be protected, both
for the sake of the individual and group privacy. In these conditions, special guidelines are necessary,
including issues on privacy, ethics and legality.
 Currently, most migration research using Twitter data do not publish the data they are based on,
impeding thus replicability, which is one of the pillars of good quality research. We are also unable to
publish our raw data, and we are considering new research directions to investigate anonymisation
and privacy preserving methods for Twitter. Some works exist describing de-anonymising and
reidentification for social graphs (Narayanan & Shmatikov, 2009), social network content
(Henriksen-Bulmer & Jeary, 2016) and geolocations (Yoshiura, 2019; Monreale et al., 2010; Cecaj et
al. 2016). However, a set of widely accepted guidelines and solutions to publishing such data do not
exist. Privacy risk assessment is an ongoing topic (Pellungrini et al., 2017), and further results, espe-
cially related to migration, are needed.
 Overall, the gaps and biases that exist in Twitter data make it difficult for researchers to generalise
their findings to the general population. This also applies to the case of the bottom-up approach.
However, we believe that our methodology could still be useful when limiting the scope to a certain
type of migrants. It enables a rich analysis of the characteristics of the migrant population on Twitter,
as we have exemplified with the home and destination attachment study. This due to the richness of
Twitter data, which includes topics of interest, language and location information, etc. We have

 29
You can also read