Quantifying Social Biases in News Articles with Word Embeddings - Maximilian René Keiff

Page created by Erik Juarez
 
CONTINUE READING
Quantifying Social Biases in News Articles
                  with Word Embeddings

                       Maximilian René Keiff

                                June 28, 2021
Department of Computer Science

                      Bachelor’s Thesis

Quantifying Social Biases in News Articles
            with Word Embeddings

                  Maximilian René Keiff

   1. Reviewer     Jun. Prof. Dr. Henning Wachsmuth
                   Computational Social Science (CSS)
                   Paderborn University

   2. Reviewer     Prof. Dr. Gitta Domik-Kienegger
                   Computer Graphics, Visualization and Image Processing
                   Paderborn University

    Supervisor     Jun. Prof. Dr. Henning Wachsmuth

                         June 28, 2021
Maximilian René Keiff
Quantifying Social Biases in News Articles with Word Embeddings
Bachelor’s Thesis, June 28, 2021
Reviewers: Jun. Prof. Dr. Henning Wachsmuth and Prof. Dr. Gitta Domik-Kienegger
Supervisor: Jun. Prof. Dr. Henning Wachsmuth
Advisor: Maximilian Spliethöver

Paderborn University
Computational Social Science Group (CSS)

Department of Computer Science
Warburger Straße 100
33098 and Paderborn
Abstract

Social biases such as prejudices and stereotypes toward genders, religions, and
ethnic groups in the news influence news consumers’ attitudes and judgments about
them. So far, there has been no research into which social biases appear in news
and how they have changed over the last 10 years. Moreover, connections between
political attitudes and social biases have been found elsewhere but have not yet been
quantified in news. This thesis uses a method involving word embeddings and the
bias metric W EAT to address these problems while creating a new dataset of English
political news articles. To create the dataset, we develop a web crawler that extracts
the texts of articles from the websites of 25 U.S. news outlets with different political
media biases. Through our bias evaluation, we find connections between political
and social bias and identify the years and news outlets that exhibit the most social
bias in the period from 2010 to 2020. Our results show that by 2020, the bias toward
gender, African-American and Hispanic has trended slightly downward along the
whole political spectrum. Apart from this, we find that right-wing news media has
the most bias against Islam and ethnic minorities. In contrast, left-wing news media
is characterized by more balanced coverage of the African-American and Hispanic
communities. Besides, we do not find significant age bias in any news outlet.

                                                                                           v
Contents

1 Introduction                                                                          1
  1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      1
  1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . .        3

2 Fundamentals                                                                         5
  2.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . .           5
       2.1.1 Tokens and Lemmas . . . . . . . . . . . . . . . . . . . . . . .           5
  2.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        6
       2.2.1 Linear Regression and Gradient Descent . . . . . . . . . . . .            6
       2.2.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . .          7
  2.3 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . .           9
       2.3.1 Word2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         9
  2.4 Cognitive Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     12
       2.4.1 Political Bias . . . . . . . . . . . . . . . . . . . . . . . . . . .      12
       2.4.2 Social Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     12
  2.5 Quantifying Social Bias . . . . . . . . . . . . . . . . . . . . . . . . . .      13
       2.5.1 Implicit Association Test . . . . . . . . . . . . . . . . . . . . .       13
       2.5.2 Word Embedding Association Test . . . . . . . . . . . . . . . .           14

3 Approach and Implementation                                                          17
  3.1 Collecting News Articles . . . . . . . . . . . . . . . . . . . . . . . . .       17
       3.1.1 Web Crawler . . . . . . . . . . . . . . . . . . . . . . . . . . .         18
       3.1.2 Article Parser . . . . . . . . . . . . . . . . . . . . . . . . . . .      20
  3.2 Preprocessing the Data . . . . . . . . . . . . . . . . . . . . . . . . . .       23
  3.3 Training the Word Embeddings . . . . . . . . . . . . . . . . . . . . .           24
  3.4 Quantifying Social Biases      . . . . . . . . . . . . . . . . . . . . . . . .   25
       3.4.1 Mining and Analysis of Word Co-occurrences . . . . . . . . .              27

4 Experiments and Results                                                              29
  4.1 Gender Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      29
  4.2 Religious Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     31
  4.3 Ethnic Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      32
       4.3.1 African American . . . . . . . . . . . . . . . . . . . . . . . . .        32
       4.3.2 Chinese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       34

                                                                                            vii
4.3.3 Hispanic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                        35
          4.4 Age Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                      36

       5 Discussion                                                                                             39
         5.1 Interpretation of the Results . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   39
              5.1.1 Correlations with Political Media Bias . . . .      .   .   .   .   .   .   .   .   .   .   43
         5.2 Review of the W EAT . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   44
              5.2.1 Interpretation of the W EAT . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   44
              5.2.2 Limitations when Comparing W EAT Results            .   .   .   .   .   .   .   .   .   .   45
              5.2.3 Issues Related to the Choice of Wordsets . .        .   .   .   .   .   .   .   .   .   .   47
         5.3 Limitations with Word2Vec . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   47
         5.4 Evaluation of the Implementation . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   49
              5.4.1 Article Collection . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   49
              5.4.2 Pipeline . . . . . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   50

       6 Conclusion                                                                      53
         6.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
         6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

       A Appendix                                                                          55
         A.1 Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
         A.2 Wordsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

       Bibliography                                                                                             65

viii
Introduction                                                                      1
1.1 Motivation

Social biases occur when we unconsciously or intentionally think or act prejudicially
about certain social groups or individuals based on their group membership (Fiske,
1998). Such biases build on ideas and beliefs about social groups that are reinforced
through repeated exposure (Arendt and Northup, 2015). Common cases include
gender stereotyping and discrimination against members of certain religions or
ethnic minorities. As Fiske (1998) states, the resulting problems are often emotional
damage, discrimination, and disadvantaging of those groups which lead to division
in society.
We pick up such social biases through our environment, which are mainly family or
work colleagues, but also the media, which includes the news we consume. When
news do not report truthfully, objectively, impartially or fairly, it is called media bias
(Newton, 1989). Media bias is divided into non-political biases, which include social
biases, and political biases.
We speak of political bias when a political perspective is emphasized or information
is covered up or altered to make it more attractive. For example, in the United States
liberal bias and conservative bias exists. The political situation in the United States
is very polarized between the two major parties, Democrats and Republicans, and
many news outlets favor one of the two sides (A. Mitchell et al., 2014).
With the increase in the availability of news on the Internet, the behavior of news
consumers has also changed. They receive personally tailored news selected by
machine learning algorithms that analyze their preferences, just as their results are
filtered by search engines, and they share such news with their circle of friends via
social media. Pariser (2011) calls this effect filter bubbles, as we are only exposed
to news and opinions that confirm our existing beliefs. In addition, the machine
learning algorithms behind search engines or other applications adapt and reinforce
our bias by dialoguing from the suggested selection of news and our preferences to
choose from.
To counter this development, researchers and watchdog institutions have set them-
selves the task of monitoring and analyzing media biases by trying to check the facts
behind both biased reporting and unsubstantiated claims. An example of such a
watchdog institution is AllSides1 , which primarily looks at political media bias and
 1
     https://www.allsides.com (Accessed on September 3, 2020)

                                                                                             1
categorizes top news stories along the political spectrum according to their political
    bias. They also maintain a rating of the political bias of U.S. news outlets to classify
    their political leanings. However, besides such ratings on political bias, there is still
    a lack of research on social biases in news.
    The main areas of interest are which social biases occur in the news and how they
    have developed in recent years. Similar to the political media bias, news outlets
    could be categorized according to social bias and the extent of that bias. And it
    would be correspondingly interesting to investigate whether associations to political
    bias exist and, if so, which ones. At the same time, research is needed on how to
    measure social biases in the first place and how effective they are.
    This work will explore one specific method to quantify social biases in the news
    which uses word embeddings and the intrinsic bias metric W EAT (Caliskan et al.,
    2017). We will compare our results on social biases from 2010 to 2020 to analyze
    developments and whether there are connections to political bias.
    This research on how biases manifest in word embeddings is also interesting as a
    basis for gaining more insight into the nature of word embeddings so that biases can
    be identified and neutralized. Applications of word embeddings include automatic
    text summarization (Rossiello et al., 2017), machine translation (Lample et al.,
    2018), information retrieval (Vulić and Moens, 2015), speech recognition (Settle
    et al., 2019) and automatic question answering (Esposito et al., 2020). Relevant
    related work includes the publications by Spliethöver and Wachsmuth (2020), who
    quantified social bias in English debate portals, and Knoche et al. (2019), who
    quantified social bias in various online encyclopedias such as Wikipedia. The latter
    already confirmed that some correlations exist between social bias and political
    bias.

2        Chapter 1   Introduction
1.2 Research Questions

In this thesis, we examine the development of the extents of four exemplary social
biases on gender, religion, ethnicity and age in English U.S. political news between
2010 and 2020. In order to compare our results on social biases with the political
bias of news, we restrict ourselves specifically to political news, as we use the political
media bias rating of news outlets from AllSides as a reference.
To tackle the stated objectives of this work, we have formulated the following four
research question:

   1. How can a large dataset of political news articles be created in a time-efficient
      manner?

   2. How strongly are social biases represented in political news articles from the
      last decade?

   3. Which social biases in the news articles are associated with which political
      positions of the news outlets?

   4. How has the appearance of social bias in news articles developed in recent
      years and which news agencies have contributed most to this?

To create the dataset of political news articles, we crawl the online articles of several
U.S. news outlets. In doing so, we select equal numbers of news outlets for the
different political media biases that AllSides categorizes. This allows us to compare
the measured social biases with the political biases of the outlets that published
these articles.
To quantify social biases in news articles, we will use word embeddings and the
bias metric W EAT. To address the development of social bias over the period 2010
to 2020, we will divide our crawled dataset of articles into the respective years of
publication and examine the social biases for each year.

This thesis first explains the necessary fundamentals for this work, then we describe
our approach and implementation. This is followed by the results of the social bias
measurements and finally we discuss the results and our methodology.

                                                              1.2 Research Questions          3
Fundamentals                                                                   2
This chapter will try to explain all the necessary knowledge for understanding this
thesis. We will start with the basics of machine learning and neural networks,
in order to explain the Word2Vec algorithm to create word embeddings. In the
following, we will explain the concepts of social and political bias to concretize what
exactly we want to analyze in this work. At the end of the chapter, we will describe
the bias metric W EAT that we will apply to these word embeddings.

2.1 Natural Language Processing

Natural Language Processing (NLP) is a branch of computer science and linguistics
that mainly studies the automatic processing and analysis of text (Dan Jurafsky and
Martin, 2009). It is concerned with the automatic processing of written texts or
speech and the extraction of information and insights. Linguistic knowledge, such
as theories on grammar, part-of-speech and word sense ambiguities are applied by
algorithms that focus on solving complex tasks, like question answering, machine
translation or conversational agents. Primarily, various empirical and machine
learning algorithms are used to identify and structure the information in texts. These
algorithms are usually sequentially applied to an input text in a pipeline. The
datasets of texts for such pipelines are called text corpora.

2.1.1 Tokens and Lemmas

Typical steps in NLP pipelines are sentence splitting and tokenization which segments
a text into its single sentences or tokens. Tokens can be for example words, symbols,
and n-grams, the latter being just a contiguous sequence of n elements from a given
text. Based on this, these word tokens are often morphologically normalized to
reduce inflections and partly derivations. For this normalization, there are the two
alternatives to stem or lemmatize. Stemming simply cuts away affixes of the word
based on predefined grammatical rules like studies and studying become studi and
study. Stemming has some limitations if we are interested in the dictionary form
of each word, which is for example not the case with studi. This is exactly what is
achieved in lemmatization (Zgusta, 2012) by additionally taking into account the
lexical category of a word which can be determined by a part-of-speech tagger. A

                                                                                          5
part-of-speech tagger maps each word in a sentence to its part of speech like nouns,
    verbs, adjectives and adverbs based on both its definition and its context. Building
    on these basic tasks, further algorithms can be applied, which become increasingly
    complex and often involve machine learning.

    2.2 Machine Learning

    T. M. Mitchell (1997) defined machine learning as the property of algorithms to
    learn from experience. A machine learning algorithm improves its performance on a
    task by evaluating its recent results using a performance measure and then refining
    its approach with this experience. These respective learning processes can generally
    be divided into the following three categories:
    Unsupervised learning tries to discover information like certain patterns and struc-
    tures in unstructured amounts of data by itself like in clustering. In reinforcement
    learning, the machine learning model improves by trying to maximize the positive
    feedback it gets while trying to solve a problem like a robot learning to climb stairs.
    In supervised machine learning, the algorithm gets a set of inputs X with multiple
    features and an associated set of outputs or targets Y that it should learn to predict.
    If the outputs or targets are continuous, such a machine learning problem is also
    called a regression problem. In the following, we will use linear regression to illus-
    trate the concept of machine learning, since high-level concepts used in this thesis
    such as neural networks and word embeddings build on these fundamentals.

    2.2.1 Linear Regression and Gradient Descent

    In linear regression, a training set consists of input-output pairs (x(i) , y (i) ) with
    i = 1, . . . , n and the goal is to learn a function h : X → Y such that h(x) is a "good"
    predictor for the corresponding y (Russell, 2010; Mehryar and Rostamizadeh, 2012).
    We try to approximate y as a linear function hθ (x) of x, parametrizing the space of
    linear functions mapping from X to Y by the weights θi . The d specifies the number
    of features, that is the independent information about an input x, and the additional
    convention is that we set the first feature x0 = 1:

                                                                        d
                                                                        X
                       hθ (x) = θ0 · x0 + θ1 · x1 + · · · + θd · xd =         θi x i   (2.1)
                                                                        i=0

    The machine learning problem is now to learn the weights θ by the given input-
    output pairs. To formalize the performance of a linear function for given weights θ,
    we use a cost function J(θ) that measures how well the hθ (x(i) ) approximates the

6        Chapter 2   Fundamentals
corresponding y (i) . In this case it is the mean square error of all pairs of approximated
and expected output:

                                         n
                                      1X
                             J(θ) =         (hθ (x(i) ) − y (i) )2                        (2.2)
                                      2 i=0

A good predictor h minimizes the cost or error J(θ) by the appropriate weights θ. To
determine θ, the gradient descent algorithm is often used, which starts with an initial
θ and iteratively adjusts these weights to reduce the error. To do this, the gradient
descent algorithm looks at all the data in each iteration and determines the gradient
of the error function J(θ) with the current weights θ to then take a step of length α
in the direction of the steepest decrease of J. This update step can be formalized like
the following for a single weight θj and is simultaneously performed for all values of
j = 0, . . . , d. In this case the := means that the previous value of θj is overwritten:

                                                  ∂
                                  θj := θj − α       J(θ)                                 (2.3)
                                                 ∂θj

The gradient descent algorithm converges at some point depending on the learning
rate α at a local minimum, whereas in linear regression, as we have just shown,
there is only a single global minimum because J is a convex quadratic function. For
very large training datasets, it takes a long time to consider all input-output pairs
for each iteration in order to determine the next gradient. This is why the variant
stochastic gradient descent is often used, which already adjusts the weights θ after
randomly selected batches of individual training data pairs.
Linear regression can be used to learn continuous functions from data with linear
features, but in order to learn much more complex relationships and abstract features,
neural networks are often used.

2.2.2 Neural Networks

An artificial neural network consists of several linked neurons that form the elemen-
tary units. Such a neuron consists of a linear function that calculates the activation
value a from several inputs xi and weights wi with i = 1, . . . , n and a non-linear
activation function ϕ, which enables the neurons to become more powerful than
simple linear regression. The linear function with the weights wi can be understood
here analogously to the linear regression problem with the weights θi from before
and we change the names of the variable only by convention. We can formalize a
neuron with the following two equations:

                                                                     2.2   Machine Learning       7
n
                                                X
                                         a=           xi wi                          (2.4)
                                                i=1

                                          y = ϕ(a)                                   (2.5)

    The output y of a neuron can then be passed on to act as an input for the next
    neuron. Groups of neurons that are not connected with each other are organized
    into layers and we clearly distinguish between the input, output and an arbitrary
    number of hidden layers in between. As the number of layers in a neural network
    increases, the ability to learn more higher-level features from the input data grows,
    but so does the computational effort to train the many parameters.
    The input of a neural network is a vector that is passed directly to the first neurons.
    The output vector from the output layer, on the other hand, must be adjusted
    with the appropriate activation function, depending on the problem, in order to
    interpret the result. For example, in a multi-class classification problem, the output
    vector ~y is normalized to a probability distribution over all possible classes. Here,
    the softmax function σ : RK → RK is often used, which calculates for each class
    i = 1, . . . , K from the corresponding output yi the probability σ(y)i such that all
    vector components lie in the interval (0, 1) and sum up to 1.

                                       e yi
                             σ(y)i = PK        yj
                                                      for i = 1, . . . , K           (2.6)
                                       j=1 e

    Other popular activation functions are the identity function, the sigmoid, and the
    hyperbolic tangent, but we will not discuss them further in this thesis.
    If all neurons of one layer are fully connected to all neurons of the next layer and
    the whole neural network is a directed acyclic graph we speak of feedforward neural
    networks. A widely used algorithm for training feedforward neural networks is
    backpropagation. Similar to linear regression, for an input X we expect an output y,
    which we try here to predict using the entire network. Using gradient descent and an
    error function, we can identify the neurons where we need to adjust the weights in
    order to minimize the error. This gradient now also takes the activation function of a
    neuron into account. For the previous layers, the calculation of the gradient and the
    adjustment of the weights, depending on their influence on the error, are repeated,
    which is called backpropagation. Since the error function for a neural network is
    no longer convex, backpropagation does not necessarily find the global minimum,
    but only a local one, which is usually not much worse in most cases (Dauphin et al.,
    2014; Choromanska et al., 2015).

8        Chapter 2   Fundamentals
2.3 Word Embeddings

A word embedding is a vector representation of a word that usually encodes the
meaning of a word in a semantic or syntactic way (James and Daniel Jurafsky, 2000).
These vectors have many applications in downstream tasks in NLP especially in
the field of distributional semantics which focuses on quantifying the meaning of
linguistic items like words or tokens through their distributional properties in a
given dataset (Goldberg, 2017). This understanding of language, that semantic and
syntactic relations of words can be recognized by the contexts in which these words
appear, goes back to the semantic theory by Harris (1954) and was popularized
through the quote "You shall know a word by the company it keeps!" by Firth (1957).
To create word embeddings, methods such as dimension reduction on the co-
occurrence matrix (Collobert, 2014) or neural networks (Mikolov, K. Chen, et
al., 2013; Pennington et al., 2014) are used. The resulting word embedding model
is a function that maps a subset of the vocabulary of the underlying text corpus to
word vectors. Most models adopt the contextual relationships of words, for example,
by placing words that have similar meanings or occur frequently together closer as
vectors. With such contextual word embedding models, for example, analogies can
be found through the arithmetic representation of the vectors, as in:

                        vBerlin − vGermany + vF rance ≈ vP aris                  (2.7)

This example is meant to illustrate that just by the similar contexts in which the
capitals Berlin and Paris are mentioned with their corresponding countries, the word
embedding model learns the abstract concept of a capital city. This effect also leads
to the adoption of concepts that are specific to the text corpus on which the model is
trained, such as placing the vectors for "man" and "criminal" closer when looking at
criminal records of men.
In our thesis we will specifically create word embedding models with the Word2Vec
algorithm and therefore explain it in more detail in the following.

2.3.1 Word2Vec

The Word2Vec algorithm by Mikolov, K. Chen, et al. (2013) uses a shallow neural
network to create contextual word embedding models from a large text corpus.
There are two different training approaches that differ in the idea of inferring a word
from its context or vice versa. However, both are built on the same architecture
of a fully connected feedforward neural network and use self-supervised learning.
Self-supervised learning in this case means that input-output pairs are generated by

                                                             2.3   Word Embeddings        9
Input                              Hidden                                          Output
                                                                                              softmax

     x1      0                                                                                 σ(y)1    z1

     x2      0                                                                                 σ(y)2    z2
                                 N              h1
     x3      0                                                       V                         σ(y)3    z3
                                                h2
             ⋅             vector of word i                                                       ⋅
                                                h3

                                                                           vector of word j
             ⋅                                                                                    ⋅
             ⋅                                         N                                          ⋅
                   V
                                                ⋅
      xi    1                                                                                   σ(y)j   zj
                                                ⋅
                                                ⋅
             ⋅
                                                                                                  ⋅
             ⋅
                                                hN                                                ⋅
             ⋅
                                                                                                  ⋅
                        Embedding matrix                       Context matrix
     xV      0                                                                                 σ(y)V    zV

     Fig. 2.1: The Word2Vec neural network architecture: It consists of an input and output layer
               with V many neurons and a single hidden layer with N many neurons. The output
               layer uses the softmax activation function, the other two the identity function. The
               orange weight matrix between the input and hidden layer is called the embedding
               matrix and contains the word embeddings after the training process. The green
               weight matrix between the hidden and output layer is called the context matrix
               and contains contextual word embeddings for each word. During the training
               process a word i is input as a one-hot encoded vector. The output layer then
               calculates a probability distribution of the context. The exemplary word j with the
               highest probability for the input i is highlighted in blue.

10          Chapter 2   Fundamentals
Word2Vec itself from the individual words and their contexts.
As shown in Figure 2.1, the neural network architecture consists of an input layer
and output layer, both of the size V of the underlying vocabulary of the text corpus,
and a single hidden layer. Except for the output layer, which uses softmax as an
activation function, the input and hidden layers use the identity function which
means that they simply pass their results through to the next layer. The size N of
the hidden layer determines the dimensionality of the word vectors and is usually
chosen between 100 and 500. The orange weight matrix between input and hidden
layer is called the embedding matrix and each row contains a word embedding for
a word from the vocabulary. The second green weight matrix between the hidden
and output layer is called the context matrix and each column contains a contextual
word embedding for a word.
Word2Vec first creates the vocabulary of the text corpus and then initializes the
neural network. The neural network is trained with backpropagation using one-
hot encoded input vectors of size V , with a 1 at the i-th position representing the
i-th word in the vocabulary and all other positions otherwise set to 0. The output
vector is a probability distribution by the softmax function and is interpreted as
one or more words depending on one of the selected training approaches. After
the training, we can discard the context matrix and use the embedding matrix as
our word embedding model. The two training models are the continuous skip-gram
model and the continuous bag-of-words model (CBOW):

skip-gram The idea behind the skip-gram model is to infer from a word the context
     in which it is used. A window is slid over each sentence and feeds the respective
     one-hot encoded vectors into the neural network word by word. The output
     vector is a probability distribution, where the current neighbors of the word
     are expected to have the highest probability. The term skip-gram comes from
     the fact that the context of a word is modeled as an n-gram in which a single
     item is skipped in the original sequence. For the backpropagation we look at
     the expected word neighbors in the window individually and sum up each of
     their errors with the output vector.

CBOW The continuous bag-of-words model follows the approach of inferring the
   word from its given context. For each word in the sliding window, all one-hot
   encoded vectors are input in parallel and the outputs of the input layers are
   averaged into one before being passed to the hidden layer. Since the order of
   the individual words in the window is irrelevant due to averaging, the context
   in this case is modeled as a bag-of-words. The output vector then gives the
   probabilities for the words that can be inferred from the context, where the
   currently considered word is expected to have the highest probability.

                                                            2.3   Word Embeddings        11
2.4 Cognitive Bias

     Cognitive bias is a disproportionate, prejudiced, or unfair weighting in favor of or
     against a concept (Haselton et al., 2015). Such biases can be innate or learned
     through the environment, such as people in the family or at work, but also through
     media (Wilson and Brekke, 1994). For example, when people develop prejudices
     for or against a person or a group that share characteristics, this is called social bias.
     Biases in the media are created by journalists and editors who determine what news
     they want to cover and how they want to report about it. In addition to non-political
     bias, such as social bias, political bias is a common problem.

     2.4.1 Political Bias

     Political bias refers to statements in which politicians or the media use cognitive
     distortions to manipulatively influence public opinion about a political party or more
     generally political stance. For this, the media reports on politics by emphasizing
     candidates or political views or casting the opposite opinions in a negative light
     for example through omitting delicate views or quoting out of context. W.-F. Chen,
     Al Khatib, et al. (2020) studied how exactly political media bias manifests itself
     in the news and have found that politically biased articles tend to use emotional
     and opinionated words like "disappoint," "trust" and "angry". Such words strongly
     influence the tone, phrasing and facts of the news which has an impact on the
     consumers of the news. In the United States, such political biases only reinforce
     the already very politically divided society. Their political system is centered on two
     main parties, the liberal Democrats on the left and the conservative Republicans
     on the right of the spectrum. Many news outlets like CNN and Fox News favor one
     of the two sides (A. Mitchell et al., 2014). Independent institutions such as the
     website AllSides have set themselves the mission of monitoring and categorizing
     news outlets regarding the political media bias in their published news articles.

     2.4.2 Social Bias

     Besides the political media biases, there also exist social biases like racism (Rada,
     1996) or gender bias (Atkeson and Krebs, 2008) in the news. According to Fiske
     (1998), social biases can be divided into stereotypes, prejudice and discrimination.
     A stereotype is an overgeneralized characteristic about a social group, for example
     thinking that one skin color is somehow superior or inferior to others. Prejudice is a
     judgment or emotion fed by stereotypes and personal preference toward people based
     solely on their group membership. Prejudice also arises from implicit association
     of characteristics like commonalities of members of social groups with specific

12        Chapter 2   Fundamentals
attributes. For example, Arendt and Northup (2015) show that frequent exposure
to stereotypical associations strengthens implicit social biases like often consuming
news about black criminals creates implicit biases toward black people.
As soon as people start treating each other unfairly because of certain characteristics
of their social groups, we speak of discrimination (Fiske, 1998). Discrimination is
not only practiced by humans, but has also been shown to be adopted by machine
learning under certain conditions, such as racism in decision systems used in health
care (Char et al., 2018), or gender stereotypes in Google image search results where
women are rather shown in family and men in career contexts (Kay et al., 2015).

2.5 Quantifying Social Bias

Research in psychology and social science has long been concerned with the question
of how to measure biases and especially social biases. Several tests and experiments
have already been developed in the past. One of these tests is the Implicit Associa-
tion Test (IAT) by Greenwald, McGhee, et al. (1998), which reveals attitudes and
unconscious associations of concepts like implicit stereotypes.
Next to the traditional methods of conducting experiments and surveys with subjects,
computational social science is concerned with exploring the same problems through
simulations or in large amount of data like social networks, search engines and news
articles. One method that is currently widely used is the W EAT test, which is based
on the concept of the IAT and applied to word embedding models. In the following,
we will explain how the IAT works to clarify the concepts of targets and attributes.
After that, we will introduce the W EAT test that we use in this thesis.

2.5.1 Implicit Association Test

The Implicit Association Test (IAT) by Greenwald, McGhee, et al. (1998) measures
hidden associations in people and is often used to assess implicit stereotypes. The
idea behind this is that implicit social biases arise through cognitive priming, whereby
the human brain is able to recognize a pair of information that occurs together more
frequently (Bargh and Chartrand, 2000). Consequently, according to this concept,
the pair consisting of social group and stereotype are also associated more quickly.
The test is conducted on a computer and requires subjects to be shown a series of
stimuli and to categorize these through pressing either a left or a right button on
the keyboard. These stimuli are usually pictures or words that either possess one
of two attributes (e.g. pleasant and unpleasant) or belong to one of two disjoint
target concepts (e.g. social groups like caucasian and black people). A pair of
target and attribute concepts together form a stereotype or the opposite. The idea
behind this is that it should be easier to categorize the stimuli when the left and

                                                         2.5 Quantifying Social Bias       13
right buttons are connected to the stereotypes (e.g. caucasian people + pleasant vs.
     black people + unpleasant). The implicit bias is then measured by the difference
     in average keystroke reaction times between the stereotypic associations and the
     opposing ones.

     2.5.2 Word Embedding Association Test

     The Word Embedding Association Test (W EAT) by Caliskan et al. (2017) is an
     adaption of the IAT for word embedding models. Since machine learning algorithms
     like neural networks learn certain abstract features during training, the social bias of
     the trainings data is not only inherited but often even amplified by word embedding
     models (Barocas and Selbst, 2016; Zhao et al., 2017; Hendricks et al., 2018). In a
     word embedding model, words that share similar characteristics analogously to the
     members of a social group can be used to define the social groups as a subspace.
     For example, to describe the female gender, words like "she" and "woman" but also
     female names can be used (Bolukbasi et al., 2016). This method is often applied to
     analyze social groups and biases in word embedding models like in the related work
     by Garg et al. (2018).
     W EAT takes advantage of this property of word embeddings combined with the fact
     that the more frequently words occur together, the closer are also their vectors in the
     model. In order to describe the targets and the associated stereotypes as attributes
     similar to the IAT, in the W EAT test both are defined by wordsets. Also the concept of
     measuring stereotypes by the difference in the response time of a subject is applied
     to the distance of the wordsets. The distance of the wordsets is calculated by the
     average cosine similarity of the words in the sets. The cosine similarity captures the
     cosine of the angle between two vectors ~x and ~y which works for any number of
     dimensions. The smaller the angle between the two vectors, the more similar they
     are, with simCosine being maximum with 1 for 0°. The cosine similarity is defined as
     follows:

                                                                m
                                                                    xj · ~yj
                                                              P
                                               ~x · ~y          j=1 ~
                      simCosine (~x, ~y ) =              = qP         qP               (2.8)
                                            k~xk · k~y k     m
                                                                ~
                                                                x
                                                                  2
                                                                    ·      m     2
                                                             j=1 j         j=1 ~
                                                                               yj

     Given two sets of attribute words A and B which define the attribute dimension (e.g.
     wordset A contains career words and B family-related words), we can determine
     the association of a single word w from the target wordset (e.g. the word "woman"
     from the target "female gender") in that dimension using the cosine similarity as
     follows:

14        Chapter 2   Fundamentals
s(w, A, B) = meana∈A simCosine (w,                         ~ ~b)
                                       ~ ~a) − meanb∈B simCosine (w,                     (2.9)

      ~ ~a and ~b represent the corresponding vectors of the words w, a and b in the
Here w,
word embedding. s(w, A, B) ranges from +2 to -2 depending whether w is more
associated with A or B. The W EAT test thus aggregates all these associations in the
attribute dimension for each word in the two target sets X (e.g. "male gender") and
Y (e.g. "female gender") of target words to compute the differential association as
follows:

                                    X                    X
                  s(X, Y, A, B) =         s(x, A, B) −         s(y, A, B)               (2.10)
                                    x∈X                  y∈Y

Since the number of words in target groups X and Y can vary, it roughly holds that
s(X, Y, A, B) > 0 indicates an association of X with A and Y with B. Similarly,
s(X, Y, A, B) < 0 indicates that X is associated with B and Y is associated with A.
The larger the magnitude of these values, the more pronounced both associations
are simultaneously.
In practice, as with the IAT, only the effect size in the value range between +2
and -2 is used instead which is a normalized measure of the separation of the two
distributions, defined as follows:

                           meanx∈X s(x, A, B) − meany∈Y s(y, A, B)
         d(X, Y, A, B) =                                                                (2.11)
                                 std-devw∈X∪Y s(w, A, B)

An example where W EAT can measure a positive value and therefore a social bias
would be in word embedding models trained on texts that very often report about
men rather in combination with career and rarely in connection with family topics
and at the same time mention about women rather with family and rarely together
with career topics.

                                                               2.5 Quantifying Social Bias       15
Approach and Implementation                                                        3
We divide our work into four successive goals, for which we will take a closer look
at our approach and implementation in this chapter. First, we explain our method
for collecting political news articles. Then we describe how we preprocess these
articles to create the text corpora for training word embedding models. The third
step covers the training of the word embedding models with Word2Vec. Finally, we
present the setup of our W EAT experiments and how we perform them.

3.1 Collecting News Articles

Our first goal was to collect political news articles from U.S. media outlets for our
datasets. AllSides categorizes the political media bias in news outlets along the
political spectrum in five main categories: Left, Lean Left, Center, Lean Right, and
Right. Since we wanted to quantify social biases in news articles with these political
media biases as well as news articles from individual news outlets, we decided to
collect political news articles for at least five outlets for each political media bias.
In an earlier paper, W.-F. Chen, Wachsmuth, et al. (2018) already created a public
accessible dataset1 of 6 459 political U.S. news articles through crawling the archive
of the AllSides website. However, this dataset is not large enough to measure the
development of bias over the past ten years because the number of articles for each
year would not be sufficient for training word embeddings which contain enough
words from the wordsets for the bias tests.
Therefore, we created a new dataset by directly crawling the websites of several
popular U.S. news agencies along the political spectrum. Primarily, we already
limited our selection of candidates from each bias category of the AllSides ranking
to those news agencies whose political media bias has been rated by AllSides with
a high level of confidence and with which their community agrees. By doing this,
we wanted to ensure that the rating of the chosen outlets is robust, meaning that
it does not change frequently, and is accepted by both AllSides and the community.
The final selection of news outlets was then based on the popularity of the website
and whether we could actually crawl it.
On the one hand, we prioritized more popular website because they reach a bigger

 1
     https://webis.de/data/webis-bias-flipper-18 (Accessed on September 3, 2020)

                                                                                           17
audience. We determined the popularity of a website via the Alexa ranking2 which
     is just a combined indicator for how much estimated traffic and visitor engagement
     the website receives.
     On the other hand, we assessed the feasibility of crawling their website, taking into
     account obstructive paywalls or the robots exclusion standard3 . We assessed the
     feasibility by making individual attempts to crawl and parse the website. Thereby
     we paid attention to whether the website blocks us after a few requests and urges us
     to solve a captcha, or whether we could read the article on the website correctly at
     all and are not prevented by a pop-up or a login request. In addition we checked the
     robots.txt file, which defines rules for web crawlers according to the robot exclusion
     standard protocol, such as which directories are not allowed to be crawled, or the
     minimum crawl delay.

     To handle the sheer number of selected media outlets in a timely manner, we opted
     for a distributed system approach. The management of the collected articles was
     done via a MySQL database server. The use of such a database guarantees data
     integrity and consistency among other valuable properties during all transactions.
     The process of collecting the articles was divided into two steps, each of which is
     handled by one of two applications. Several instances of these two applications,
     the web crawler and the article parser, could access the database in parallel. The
     achieved benefits are rapid scaling, as another instance could be added at any time,
     and true concurrency, because all could operate simultaneously and independently.

     3.1.1 Web Crawler

     The first of the two applications is the web crawler, which collects URLs to news
     articles for our selected outlets and saves them in the database.
     When the web crawler is started, it first chooses one of the news outlets and retrieves
     the necessary information from the database, namely the strategy of how the web
     crawler should search for URLs, and the annotated base URL of the website of the
     news outlet. These annotations in the URL are explained in more detail in the
     respective strategies. In order to collect all available and accessible articles from the
     last decade, we decided to crawl every single day backwards through time starting
     from New Year’s Eve 2020. This way the web crawler knew at which publication
     date it should search for articles and when it finished crawling an outlet. Moreover,
     in case the web crawler is interrupted, for example due to the loss of the internet
     connection, the last crawled date for a particular outlet could be looked up in the
     database, so that the crawling can be resumed at the same date.

      2
          https://www.alexa.com (Accessed on November 15, 2020)
      3
          https://www.robotstxt.org/ (Accessed on April 4, 2020)

18           Chapter 3   Approach and Implementation
After examining the websites of all the outlets, we decided on three strategies by
which each outlet would be crawled:

News Archive The most effective strategy was to crawl the news archive of the
    outlet. We limited ourselves to archives that had a simple URL structure
    consisting only of a date and page numbering. This structure can be ex-
    ploited by using appropriate annotations in the base URL to parametrize it
    like https://thefederalist.com/{YYYY}/{MM}/page/{P}. The web crawler
    then formatted the annotations YYYY-MM-DD and P in the curly brackets with
    the appropriate date and page number and iterated through the dates and
    pages to collect all links. Because we did not want to put too much load on the
    servers of the news outlets, we waited 10 up to 15 seconds between scraping
    each page.

Date Scheme Some news agencies did not provide an archive of all their published
     articles on their website that we could crawl. But at least we could parametrize
     the date in the URL as well and search via Google indexing for the sites
     that start with the same base URL. A typical google site query looked then
     like site:cnn.com/{YYYY}/{MM}/{DD}/politics where we formatted the
     annotations as explained for the news archive strategy. We always collected
     50 links at once per Google result page but had to wait around a minute each
     time so that we did not risk that the IP address of the web crawler gets blocked
     for a couple hours.

No Scheme For a few news outlets we did not find an archive and there was no way
    we could parametrize the base URL as with apnews.com/article/. In such
    cases we extended our site query with Google’s advanced search commands
    after:YYYY-MM-DD and before:YYYY-MM-DD to narrow down the results to a
    specific date range. We also tried to apply these date range parameters to the
    date scheme strategy, but the amount of results decreased instead of finding
    more. With this strategy, we also had to wait about a minute between each
    Google query.

To further minimize the risk of our IP address being blocked by the archive websites
of the outlets, we generated a new user agent for each request. The user agent is a
string that contains information such as about the browser or operating system that
could make the web crawler recognizable. In addition, we also varied the wait times
of each strategy a bit to be less noticeable as a web crawler and simulate a human
user.

                                                      3.1   Collecting News Articles    19
Once the web crawler found URLs, we filtered for predefined substrings like /author/
     or /gallery/ in the URL before inserting them into the database. This was because
     they did not lead to news articles, but to other websites, like in this example to
     the profile of an author of some of the articles or to a photo gallery. If possible,
     we already kept substrings like /politics/ in the base URL used to crawl the outlet,
     because then the web crawler already yielded only articles of their politics category.
     At the same time, we can filter non-politics categories by adding substrings like
     /sports/ and /entertainment/ to the predefined filter substrings.
     Finally, the URLs that did not contain any of the filtering substrings are added to the
     database and could then be accessed by the article parser.

     3.1.2 Article Parser

     The second application is the article parser which fetched the crawled URLs for
     extracting the article content without the HTML boilerplate from the websites with
     the help of the Mercury Parser4 . The Mercury Parser is the backend of the Mercury
     Reader browser extension by the Postlight company, which is used to declutter
     webpages for distraction-free reading.
     First, the article parser fetched the collected URLs from the database and again fil-
     tered out those that contain any predefined substrings. We performed this step again
     because both applications were developed in an iterative process and during the
     crawling we were picking up additional substrings we wanted to filter for. After that
     we removed certain predefined URL components, such as #comments-container in
     www.americanthinker.com/blog/YYYY/MM/example.html#comments-container.
     This is for instance the case if a comment section is included in the HTML but the
     actual content of the article is collapsed so that the Mercury parser would not be
     able to extract it.
     For some news outlets, the URL did not indicate whether it is a political article, such
     as by a /politics/ substring, nor does the news outlet’s archive have a specific politics
     section. This means that during the parsing process we did not know whether an
     URL is actually referencing a political article. To solve this issue, we decided to
     parse these websites anyway and at the same time searched certain elements in
     the HTML that indicate the category on the website. Such HTML elements are for
     example breadcrumbs, which represent the website navigation of the news outlet
     and can contain similar words to the URL substrings like politics but also sports and
     entertainment. We save the contents of these HTML elements as keywords together
     with the article in our database, so that we can later pick out the political articles
     based on them.

      4
          https://github.com/postlight/mercury-parser (Accessed on September 3, 2020)

20           Chapter 3   Approach and Implementation
Since each website can accept requests for articles independently from each other,
it was possible for the article parser to parse articles from several different news
outlets in parallel. To avoid unnecessary traffic on the servers of the news outlets
or even being blocked, the article parser waited 10 to 15 seconds between each
request to the same outlet and again disguised itself with different user agents. For
our approach, we let the article parser always select as many different outlets as
CPU cores were available in the system resources. At the same time, we reduced the
number of database accesses by always fetching multiple crawled URLs at once and
also inserting the parsed articles as a batch. The idea behind this was to minimize
the access time to the database because otherwise the lock on the respective tables
is withheld from the other application instances for an inefficiently long time.
For this work, we had up to ten web crawlers and article parsers running simulta-
neously and ended up collecting political articles for 48 different news outlets. Of
these, we selected the 25 news outlets, as shown in Table 3.1, with the highest yield
for our analysis in the following chapters. We ended up with 108 815 articles for left
media, 108 815 for lean left, 102 782 for center, 110 825 for lean right and 104 792
articles for right media.

                                                       3.1   Collecting News Articles    21
100 000 of word tokens per year
               Political Bias   Outlet                 Σ     2020   2019   2018     2017   2016    2015   2014       2013   2012   2011   2010
               left             AlterNet               307     65     23      30      29      29     34      37        26     31      3      0
                                Daily Beast            325     43     40      43      50      52     45      43        26      1      0      0
                                The Nation             189     20     17      18      21      22     16      13        15     20     14     13
                                The New Yorker         141     15     16      16      14      14     12      10        12     14     11      7
                                Slate                   94     14      9      11      11      11      8       8         6      6      5      5
               lean left        CNN                    449     80     82      76      72      55     38      15        12     16      3      0
                                The Atlantic           392     12     19      26      35      39     60      42        33     33     13     80
                                The Economist          282     72     22      21      21      22     24      20        19     19     20     22
                                The Guardian           268     14     34      22      26      28     28      24        19     19     20     34
                                The New York Times     642    102     92      83      50      26     26      23        23     22     27    168
               center           Associated Press       481    170     92      82      74      48      6       5         3      0      0      1
                                AZ Central             372     75     57       0      23      56     63      98         0      0      0      0
                                Factcheck               57     10      6       5       5       6      4       4         3      5      4      5

                                                                                                                                                                 Approach and Implementation
                                Heavy                  295     45     58      56      53      35     13      15        15      5      0      0
                                USA Today              166     25     28      27      27      13     15      13        14      4      0      0
               lean right       Fox News               165     79     39      16       7       6     18       0         0      0      0      0
                                New York Post          847     91     87      82      79      68     63      69        66     80     83     79
                                Reason                 207     20     20      18      13      17     17      15        15     26     24     22
                                The Press-Enterprise   207     36     35      35      33      17     16      10         8     10      5      2
                                The Washington Times   482    146      8       8       7       9     10     138        39     46     39     32
               right            American Thinker       607     31     44      46      52      53     54      61        67     70     69     60
                                Breitbart              471     66     68      55      52      70     69      29        21     20     11     10
                                Newsbuster             660     55     58      52      67      68     63      55        59     60     60     63

                                                                                                                                                                Chapter 3
                                The Daily Caller        73      9      5       7      11       8     12       6         4      4      3      4
                                The Federalist         242     42     40      34      38      33     31      20         4      0      0      0
Tab. 3.1: The total number of word tokens for the corpora of each year of the crawled outlets. The number of word tokens per year is rounded to a multiple of
          hundred thousand. Values lower than four are marked italicized.

                                                                                                                                                                22
3.2 Preprocessing the Data

Our second goal was to prepare the text corpora for training the word embedding
models. The articles must be read sentence by sentence and consist only of the
words for which word embeddings are to be generated in order to be able to use
them as input for our training method.
Our preprocessing pipeline started by fetching the political articles from the database
for each outlet and each year. We aggregated the articles from different crawling
approaches we have tried on the same news outlet and filter out the articles that do
not have political keywords. This is because sometimes we tried multiple approaches
to crawl an outlet, or crawled different sections in an archive, such as foreign and
domestic politics. For filtering out the non-political categories, we compiled lists of
political keywords like election and taxes for each news outlet after the article parser
was finished and we could inspect all of them.
Since we planned to tokenize the articles, we first removed URLs from the text
with a regex5 , since URLs do not result in meaningful word tokens for our word
embeddings.
In the next step we performed sentence splitting using the PunktSentenceTokenizer
from the Natural Language Toolkit6 (NLTK).
Then the preprocessor expanded typical contractions in English, such like I’d becomes
I would or I had. This was only important to us to standardize the tokens and we did
not care much about choosing the grammatically correct expansion as these words
have a high term frequency which means they appeared in almost every training
sentence anyway.
Before we started tokenizing we transliterate Unicode symbols like in Beyoncé to
the next ASCII representation Beyonce to further unify our tokens, because tokens
with a low term frequency through different spelling are otherwise ignored in the
upcoming training process with Word2Vec.
Next, the preprocessor tokenized the sentences and applied part-of-speech (POS)
tagging with the Penn Treebank (PTB) by Marcus et al. (1993). We then mapped the
POS tags to four syntactic categories: nouns, verbs, adjectives and adverbs, so that
we could use the WordNet lemmatizer which yields us the context sensitive lemma
for each token (Miller et al., 1990). All lemmas were reassembled to their original
sentences without the POS tags.
During the last steps the preprocessor removed all symbol tokens like quotes and
dots and all excess whitespace and finally converted the remaining word tokens to
lower case. In the following, we will sometimes call the preprocessed articles of one
outlet from one year a text corpus.

 5
     The regex is taken from https://urlregex.com/ (Accessed on December 18, 2020)
 6
     https://www.nltk.org (Accessed on November 3, 2020)

                                                              3.2 Preprocessing the Data   23
3.3 Training the Word Embeddings

     The next goal was the training of the word embedding models using Word2Vec on
     the preprocessed articles. We used the Word2Vec implementation of the Gensim7
     library by Řehůřek and Sojka (2010).
     To investigate the development of social bias, we wanted to measure social bias in as
     many years as possible. At the same time some text corpora were very small, which
     meant that not enough words from the wordsets appeared in them and the W EAT
     results varied too much depending on the choice of wordsets. So, in order to be
     able to perform W EAT tests in enough years but to exclude the very small datasets,
     we decided on a minimum of 350 000 words to train a word embedding on a test
     corpus.
     Since we wanted to measure both the social bias in each news outlet and for each
     political media bias, we trained two kinds of word embedding models:

     Outlet Models For all of the 25 news media outlets, we trained a word embedding
          model for each year with a large enough corpus.

     Media Bias Models These models combine the text corpora of the outlets with the
         same political media bias. Therefore, for each year there will be at most 5
         media bias models for the political media biases left, lean left, center, lean
         right and right. Since not all text corpora were large enough, we trained a
         media bias model for each year which has at least 4 large enough text corpora.

     Our word embedding models are trained with a window size of five words between
     the current and the predicted word, and four training epochs.
     The original authors Mikolov, K. Chen, et al. (2013) of Word2Vec present the
     trainings algorithm CBOW as faster than skip-gram, but that the resulting word
     embeddings perform inferior in quality. İrsoy et al. (2020) points out that based
     on this first rating more misconceptions about the performance of CBOW followed
     because the popular implementation of the training algorithm CBOW in the Gensim
     library is incorrect. Instead of using the flawed implementation in Gensim and with
     additional effort the improved implementation of İrsoy et al. (2020), we simply
     decided to use skip-gram.
     For the number of epochs, we followed the successful results of Mikolov, Sutskever,
     et al. (2013), who trained a word embedding model on a Google News8 dataset
     of about 100 billion words. Their model took only two to four epochs to converge,
     which is why we also decided to go along with four epochs for our models. Mikolov,
      7
          https://radimrehurek.com/gensim (Accessed on March 7, 2021)
      8
          The model is available at https://code.google.com/archive/p/word2vec/ (Accessed on March
           7, 2021)

24           Chapter 3   Approach and Implementation
Sutskever, et al. (2013) state that more epochs can squeeze slightly stronger vectors
out of the dataset, but ultimately the quality of the word embedding model depends
on the size of the corpus. We have corpora of varying sizes, with the largest ones
in 2020 and then gradually decreasing. Obviously, the quality of the models on the
smaller datasets and thus their W EAT results can be questioned. However, since we
rarely observe unusual developments of social bias and we already try to counteract
the problem by using larger wordsets for W EAT, most bias results for the smaller
datasets do not seem to be outliers.
For the vector size, we followed the findings of Pennington et al. (2014) who
shows that the accuracy of word vectors on an analogy test stagnates after 300
dimensions.

3.4 Quantifying Social Biases

The next goal was then to measure various social biases on the word embedding
models using the W EAT metric by Caliskan et al. (2017). We conducted a total
of six different experiments on all word embedding models to measure exemplary
stereotypes and sentiment towards social groups. We address four social biases,
namely gender, religion, ethnicity, and age. Many more could be tested but it would
go beyond the scope of this thesis. The individual experiments and which wordsets
we used for each of the two targets and attributes are listed in Table 3.2. We used
the W EAT implementation of the open-source Word Embedding Fairness Evaluation
framework (WEFE) by Badilla et al. (2020).
Since our text corpora have very different sizes, and we want to avoid that the
W EAT result varies greatly from the choice of a few words in our wordsets, we try to
compensate for this problem by using larger wordsets. At the same time, for larger
wordsets, we also need to lower the threshold for the minimum number of words

    Social Bias   Target X               Target Y               Attribute A   Attribute B
    Gender        male names             female names           career        family

    Religion      Christianity           Islam                  pleasant      unpleasant

    Ethnicity     European-Amer. names   African-Amer. names    pleasant      unpleasant
                  European-Amer. names   Chinese names          pleasant      unpleasant
                  European-Amer. names   Hispanic names         pleasant      unpleasant

    Age           young people’s names   old people’s names     pleasant      unpleasant

Tab. 3.2: The experiments we perform with the W EAT metric. Each one can be mapped to
          a particular social bias and consists of four different word lists as described in
          Section 2.5.2.

                                                          3.4   Quantifying Social Biases      25
You can also read