INVESTIGATING GENDER STEREOTYPES IN THE MEDIA: DIVA

Page created by Terry Henry
 
CONTINUE READING
INVESTIGATING GENDER STEREOTYPES IN THE MEDIA: DIVA
Linköping University | Department of Management and Engineering
                                               Master’s thesis, 30 credits| Master’s programme
                                                      Spring 2021| LIU-IEI-FIL-A--21/03695--SE

Investigating gender
stereotypes in the media:
A Natural Language Processing approach to understanding
gender disparities in the reporting of football.

Isabel Pereira Fernandez

Supervisor: Miriam Hurtado Bodell
Examiner: Erik Rosenqvist

                                                                          Linköping University
                                                                 SE-581 83 Linköping, Sweden
                                                                 +46 013 28 10 00, www.liu.se
INVESTIGATING GENDER STEREOTYPES IN THE MEDIA: DIVA
Contents

1 Introduction                                                                         1

2 Literature Review                                                                    3

  2.1   Gender Stereotypes . . . . . . . . . . . . . . . . . . . . . . . . . . . .     3

  2.2   Media’s role in gender stereotypes . . . . . . . . . . . . . . . . . . . .     4

        2.2.1   Semantic Differences . . . . . . . . . . . . . . . . . . . . . . .     4

        2.2.2   Syntactical Differences . . . . . . . . . . . . . . . . . . . . . .    6

  2.3   History of Women’s Football in the UK . . . . . . . . . . . . . . . . .        8

3 Data & Methods                                                                      10

  3.1   Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

  3.2   Seeded Topic Modelling . . . . . . . . . . . . . . . . . . . . . . . . . 12

        3.2.1   Jensen–Shannon Divergence . . . . . . . . . . . . . . . . . . . 14

        3.2.2   Two Sample Kolmogorov–Smirnov Test . . . . . . . . . . . . . 15

  3.3   POS Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

        3.3.1   Mann–Whitney U test . . . . . . . . . . . . . . . . . . . . . . 17

        3.3.2   Top Word Analysis . . . . . . . . . . . . . . . . . . . . . . . . 18

  3.4   Bonferroni Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Results                                                                             20

  4.1   Semantic Gender Differences . . . . . . . . . . . . . . . . . . . . . . . 20

  4.2   Syntactic Gender Differences . . . . . . . . . . . . . . . . . . . . . . . 25

5 Discussion                                                                          33
6 Conclusion                                                                   35

A List of Words Used for Categorization                                        37

B List of Seed Words Used                                                      37

C Seeded Topic Models Robustness Test                                          38

D Full list of topics and top words                                            42

E Keyword Robustness Test                                                      49

  E.1 POS Tag Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

  E.2 Seeded Topic Model Results . . . . . . . . . . . . . . . . . . . . . . . 57

References                                                                     60
List of Figures

  1   Number of Articles in the Dataset Over the Years . . . . . . . . . . . 11

  2   Intuition behind LDA model from Blei (2012) . . . . . . . . . . . . . 12

  3   How to Calculate the Two Sample Kolmogorov–Smirnov Test Statistic 16

  4   Jensen–Shannon Divergence For Each Topic For Different Sized Bins        22

  5   Average Jensen–Shannon Divergence For Each Topic Over Time . . . 24

  6   Adjusted P-values for the Two Sample Kolmogorov–Smirnov Test per
      Topic Over Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

  7   Overview of results for the POS tag analysis on nouns and pronouns . 29

  8   Overview of results for the POS tag analysis on adjectives and adverbs 30

  9   Overview of results for the POS tag analysis on verbs . . . . . . . . . 31

List of Tables

  1   Most Common Topics For Men’s Football Articles . . . . . . . . . . . 21

  2   Most Common Topics For Women’s Football Articles . . . . . . . . . 21

  3   Results for the Two Sample Kolmogorov–Smirnov Test . . . . . . . . 23

  4   Results for Mann–Whitney U test on POS tag ratios . . . . . . . . . 25

  5   Most Common Unique Words in Category For Each Tag . . . . . . . 27

  6   Percentage of Words in the Top 500 That Are Different For Each Tag 28
Abstract

    Sports can be an important factor in defining gender identity. However,
sports are generally perceived as a masculine activity, especially when they
are highly physical. In turn, this negatively impacts women who want to par-
take in such activities. The most widely watched sport that is perceived to be
masculine is football, it reaches billions of people across the world. Since the
media is the main source of information for thousands of people who follow
football, it is important to understand what part the media play in reproduc-
ing gender stereotypes. The aim of this research is to investigate this phe-
nomenon by answering the following research question: In what ways does the
media reproduce gender stereotypes when reporting on football?. To do that,
all articles from the Football section of the British newspaper The Guardian
published between 2002 and 2020 were collected. The analysis is divided into
two parts: semantic and syntactical differences. First, a seeded topic model is
used to investigate whether the media focuses on different aspects of the sport
depending on what gender they reported on. Second, a POS tag analysis is
conducted to examine if the media employs different syntax on the coverage of
men’s and women’s football. This is the first large-scale longitudinal study to
examine gender differences in the media reporting in sports as well as one of
few to use machine learning to analyse gender stereotypes. Findings indicate
that both semantic and syntactical differences are prevalent in the reporting.
More specifically, results demonstrate that there is a greater focus on female
footballers’ personal life, whereas for male football players the spotlight is
on their performances and accomplishments on the pitch. Furthermore, the
syntactical analysis indicates that the media uses gendered language more of-
ten when reporting on women’s football, and utilizes action-packed language
when covering men’s football. In both semantic and syntactic aspects, the
longitudinal analysis demonstrates that the differences are diminishing over
time.
1     Introduction

Sports are an important factor in defining gender identities and stereotypes (Thorpe,
2010). From an early age, sports are seen as a male activity, especially when
they require masculine attributes such as physical strength, violence and risk-taking
(Musto, Cooky, & Messner, 2017; Thorpe, 2010). Therefore, women have a tendency
to participate in physical activities that are classified as woman-like and avoid those
that are perceived as more masculine. In England, for example, 67% of adult men
practice sports regularly, compared to 55% of women (World Health Organization,
2015). This suggests that men are indeed more inclined to practice sports.

    Football makes for a ideal candidate to study gender stereotypes in sports due
to its vast popularity and as it is perceived as a masculine activity. Football is a
contact sport which requires strength and aggression (Harris, 2005), and is there-
fore regarded as an activity more suited for men, stigmatizing women’s football. In
terms of the number of players, in England, the gap is even larger than that of ac-
tive adults mentioned above: only 24% of registered footballers are women, with the
remainder 76% being men (The Football Association, 2015). In general, women’s
football is less popular than men’s. However, over the last decades, women’s football
has received increased attention and gained popularity. The Women’s World Cup
for example, had a growth of 65% in viewership from 2015 to 2019, reaching over 1.2
billion viewers (FIFA.com, 2019). Despite the increase in popularity, it is still only
a fraction of the viewership of the men’s events. In 2018, the men’s World Cup final
alone amounted to the same number of views as the whole women’s tournament in
2019. In fact, the men’s version of the tournament in its entirety received over 3.5
billion views, matching the Olympic Games in terms of viewership (FIFA, 2018).
This is especially impressive since the Olympic Games are the most important sport-
ing event to take place, as it includes more than 30 different sports with participants
from over 200 countries (Whannel, 1992). Put together, these characteristics make
for an interesting case-study, as football is a male-typed activity with large following
which has undergone a change in the gender composition within the sport in recent
years.

    Media outlets have an important role in influencing gender stereotypes and gen-
der roles. This is because they can frame a story as they wish, and since they
have a large reach, they can influence those who receive the story through their
lens (Altheide & Snow, 1992; Rogers, 2004). In other words, the way in which a
piece portrays a person impacts how the readers of said piece will perceive them.
It has been shown that articles covering male and female athletes tend to use dif-
ferent language and focus on different aspects of their performance (Eastman &
Billings, 2000; Messner, Duncan, & Jensen, 1993; Musto et al., 2017; Wensing &

                                           1
Bruce, 2003). This can lead people to perceive men and women in sports differently.
Therefore, it is important to understand how the media portrays male and female
athletes. Based on that, the following research question arises:

    In what ways does the media reproduce gender stereotypes when reporting on
football?

   As mentioned above, previous literature has found that gender stereotypes are
reproduced by the media both through the topics covered and the language used.
In that sense, both semantics and syntax play a role in the differential coverage
between men’s and women’s football. As a result, two sub questions are formulated:

   What are the semantic differences on the reporting of male and female footballers
and do they change over time?

  To what extent does the syntax used by the media lead to different portrayal of
men and women in football and how does it evolve over time?

   To answer these questions, data from the British newspaper The Guardian is
used. The dataset consists of all articles about football published by the newspaper
from 2002 to 2020. These articles are categorized by the gender they refer to using
keywords. Once the data has been appropriately categorized, two machine learning
techniques are used to obtain results. First, a seeded topic model is adopted to
examine the data semantically. Second, part-of-speech (POS) tagging is applied to
study the syntactical content of the data.

    The present study expands on previous literature in two ways. First, it examines
a period of 19 years of coverage without any interruption, something that has not yet
been done. This enables for a longitudinal investigation of how gender stereotypes
in football evolve over the years. Second, it makes use of machine learning, which
previously has not been used to study gender differences in sports. This approach
also allows for a larger volume of data to be analyzed.

   This paper will continue as follows: first, a literature review will outline previous
studies in the field and, based on that, draw six hypothesis which will be investigated.
Then, the data and methods will be described in more detail. After this, results are
presented and interpreted, followed by a discussion of the findings and the potential
shortcomings of this analysis.

                                           2
2     Literature Review

2.1    Gender Stereotypes

Stereotypes were first defined by Lippmann (1922), who described it as illogical
generalizations about social groups that are erroneous, that is, incorrect. Since
then, studies have demonstrated that this is not necessarily the case, stereotypes
can, in fact, be accurate (Judd & Park, 1993; Jussim, 1991). Therefore, stereo-
types are defined as generalizations of characteristics, attributes, or behaviours of
members of a given group (Hilton & Von Hippel, 1996). Stereotypes can be explicit
or implicit. Implicit stereotypes are beliefs about a group that are unconsciously
activated, whereas explicit stereotypes are a set of beliefs that a person consciously
associates with a given group (Greenwald & Banaji, 1995).

    Gender is one of the most prominent features in a person when it comes to
categorization and perception (Ito & Urland, 2003; Ellemers, 2018). To put it
differently, when a person is surrounded by strangers, gender is one of the main
attributes that will be used to categorize these people and obtain a first impression.
The inferences that are made when this categorization happens stem from gender
stereotypes. In the case of gender, these assumptions are made based on how men
and women are expected to behave. Gender creates such prominent stereotypes that
children at the age of 6 already display awareness of their existence (Bian, Leslie, &
Cimpian, 2017).

    Although some believe that gender differences stem from biological factors, this
has been proven to not be the case. Joel et al. (2015) find that there are no clear
differences between female and male brains that create such a distinguishable dif-
ferentiation as with gender expression. Moreover, a body of research demonstrates
that there are no, or very small gender differences when it comes to mathematics
and verbal skills (Lindberg, Hyde, Petersen, & Linn, 2010; Else-Quest, Hyde, &
Linn, 2010; Hyde & Linn, 1988; Leaper & Robnett, 2011), as well as leadership
ability (Eagly & Carli, 2003) and displaying of emotions (Chaplin & Aldao, 2013;
Else-Quest, Higgins, Allison, & Morton, 2012). Instead, research shows that gender
and gender stereotypes are constructed over time, through socialization processes
(West & Zimmerman, 1987; Lorber, 1994; Bem, 1981).

    Research has also demonstrated that gender stereotypes have negative effects
both on men and women. Bian et al. (2017) find that young girls are less likely
than young boys to think other children of the same gender are smart, causing them
to avoid activities they perceive as intelligent. This has significant effects on their
life outcomes: women are less likely to follow a career in science fields and those

                                          3
who do are perceived as less talented by both men and women (Leslie, Cimpian,
Meyer, & Freeland, 2015). In fact, gender stereotypes negatively affect women’s
careers in general: Heilman (2012) shows that gender stereotypes lead to penalties
for women in the workplace, as their characteristics and skills are perceived as a poor
match for leadership positions. On top of that, these gender differences that stem
from stereotypes create a gender wage gap (Angelov, Johansson, & Lindahl, 2016;
Bertrand, Goldin, & Katz, 2010; Blau & Kahn, 2017). On the other hand, men
experience more pressure to conform to gender roles and prove their masculinity by
being strong, self-reliant and not displaying weakness (Prentice & Carranza, 2002;
Vandello & Bosson, 2013). As a consequence, men are more likely than women to
take risks (e.g: smoking, horse-riding, rock climbing), in order to affirm their gender
identity (Byrnes, Miller, & Schafer, 1999).

2.2     Media’s role in gender stereotypes

There are two ways in which the media reproduces gender stereotypes. Firstly, the
choice of topics to be covered by the media, the emphasis that is placed on them and
how they are framed can reinforce existing stereotypes. Second, the choice of words
used to report on an event, beside the topical context, can also have the potential
to replicate stereotypes.

2.2.1   Semantic Differences

Mass media has the ability to tell a narrative to millions of people, and therefore
affect how they perceive a certain story, subject or individual. For this reason, the
way in which they choose to frame a given situation can impact how people view
it. Framing can be defined as the “conceptual tools which media and individuals
rely on to convey, interpret and evaluate information” (Neuman, Just, & Crigler,
1992, p. 60). To put it differently, framing refers to how individuals or groups give
meaning to an issue or situation (Entman, 1993; Goffman, 1974; Rogers, 2004). In
the case of the media, framing has the power to change the way in which readers
perceive said issue or situation. It has been shown that, in politics, reporters use
different framing based on the gender of the candidate they are reporting on, where
similar activities led to slightly different portrayals for male and female candidates
(Bystrom, Robertson, & Banwart, 2001; Carlin & Winfrey, 2009; Devitt, 2002;
Gidengil & Everitt, 2003; Kittilson & Fridkin, 2008). Moreover, when it comes to
sports, it has been demonstrated that the framing of events shown on television
influence how viewers perceive it (Altheide & Snow, 1992). It is therefore clear
that attention should be paid to how the media chooses to frame female and male

                                          4
athletes, as this likely has an impact on how society perceives them and due to the
fact that gender has been shown to be relevant when it comes to framing in other
areas.

    An extensive body of literature has demonstrated that there are large differences
between the content of the reporting on male and female athletes. There is a greater
focus on the physical appearance of women than that of men, meaning articles
disproportionately comment on the physique and image of women when compared
to men (Kim & Sagas, 2014; Messner, Duncan, & Cooky, 2003). It has also been
found that news sources make greater use of pictures when they report on women’s
sports (Eastman & Billings, 2000; Sainz-de Baranda, Adá-Lameiras, & Blanco-Ruiz,
2020). This suggests that they use photos to get the readers’ attention, shifting the
spotlight away from female athlete’s sporting ability and achievements, and placing
it instead in their appearance. This is detrimental to female athletes’ image, as it
removes the focus from their athletic ability and changes it to their body image
instead, removing their athlete status. Research shows that sexualized images of
female athletes generates a larger focus on their physical appearance and leads to
them being perceived in a similar way to models, stripping them of their sporting
abilities (Daniels, 2009; Daniels & Wartena, 2011).

    Another difference is the depiction of female athletes as woman first, athlete
second. More specifically, when referring to women, the media has a greater focus on
non-sport related aspects of the athlete’s life than when reporting on men. Non-sport
related aspects range from family and dating life to fashion preferences and travelling
(Eastman & Billings, 2000; Messner et al., 1993, 2003; Wensing & Bruce, 2003).
Similar to sexualization, this behaviour draws the focus away from the performance
of women in sports, concentrating instead on their personal lives. This could make
readers give less importance to the ability of female athletes as sportswomen and
instead become interested in them as people, which is diminishing to their career.

    Differential framing by the media has also been recognized when describing fail-
ure. When female athletes fail, there are two main reasons that are used to explain
their failure: their lack of commitment, with athletes described as not wanting to
win or not trying hard enough to do so (Angelini, MacArthur, & Billings, 2012;
Eastman & Billings, 2000; Messner, Duncan, & Wachs, 1996); and emotional diffi-
culties, such as lack of confidence (Duncan & Hasbrook, 1988; Messner et al., 1996).
On the other hand, when it comes to reporting on male athletes’ failures, the focus is
different. Research finds that in this case the media attributes losses and mistakes to
the lack of athletic skills of a player (Eastman & Billings, 2000), the elevated quality
of their opponent (Angelini et al., 2012; Messner et al., 1996) or to external factors
such as bad luck (Duncan & Hasbrook, 1988). Overall, despite some variations in
the findings, there are clear differences in the reporting of the media when it comes

                                           5
to why athletes fail.

    Put together, these discrepancies suggest that the media tends to focus on male
athletes and their achievements within their sports, whereas with women there is a
greater focus on their personal life, their physical appearance and their emotional
state. This leads to the first hypothesis:

   H1: Topic distributions of male and female articles will be different

2.2.2   Syntactical Differences

Another way gender stereotypes are reproduced is through language. Lewis and
Lupyan (2020) demonstrate that gender stereotypes arise, in part, from gendered
language. Gendered language entails gender-specific titles (e.g.: ‘Mr.’,‘Miss.’), pro-
nouns, job titles (e.g.: ‘host’, ‘hostess’; ‘actor’, ‘actress’) and other words and phrases
that identify the gender of a subject. Moreover, Garg et al. (2018) investigate word
embeddings over time and how they compare to demographic data from the US
over the same time period. Their results demonstrate that occupational biases
and adjective associations highly correlate with the demographic data. Similarly,
Charlesworth et al. (2021) use word embeddings on a variety of corpora, totalling
over 65 million words, and find that stereotypical personality traits and occupa-
tions can be identified across all their corpora. Put together, results suggests that
syntactical structure has the ability to convey gender stereotypes in an implicit way.

    When it comes to sport, research has indicated that there are gender differences
in the language used for the reporting. First of all, the media makes use of gender
marking only when reporting on women’s sports (Higgs, Weiller, & Martin, 2003;
Messner et al., 1993). In other words, the media specifies the gender of the athletes
only when they are women. This establishes men as the standard and suggests that
female athletes and women’s sports are of inferior quality (Messner et al., 1993). A
clear example of this is the official name given by the International Federation of
Association Football (FIFA) to the main competition they organize, the World Cup.
The men’s tournament is simply referred to as the World Cup, whereas the women’s
version is officially called the Women’s World Cup (FIFA, 2021b, 2021a). Moreover,
it has also been found that the media often refers to female athletes by their first
name, however, the norm in sports and the practice with their male counterparts is
to address them by their last name (Fink, 2015; Messner et al., 1993). When female
athletes are not called by their name they are often referred to in general terms
such as ‘ladies’ and ‘little girls’, but that is not the case for the men, who are given

                                            6
nicknames (Eastman & Billings, 2000; Wensing & Bruce, 2003). This is the case
both for individuals (e.g.: Ronaldo Nazário is dubbed The Phenomenon; Messi is
called ‘La Pulga’) and teams (e.g.: the Dutch national team is known as the Flying
Dutch Men, Liverpool F.C. is often referred to as The Reds). Put together, research
suggests that the language used by the media when reporting on women’s football
is more gendered than when reporting on men’s football. Therefore, it is expected
that more nouns and pronouns are used when reporting on the women’s game, as
that would reflect gendered language. This leads to hypothesis 2a:

    H2a: Articles on women’s football will have a higher ratio of nouns and pronouns
to other POS tags.

    Despite the way in which sports people are referred to, the media also uses dif-
ferent syntax when reporting of male and female athletes. Eastman and Billings
(2000) have found that less neutral and factual language, such as ‘dribbling past’
and ‘taking a shot’, is used when reporting on men’s sports. This gives room instead
to adjectival descriptors, for example ‘impressive dribbling’ or ‘powerful shot’. More
specifically, the achievements of men are reported in a descriptive manner, whereas
those of women are recounted in a factual manner. In other words, the accomplish-
ments of women tend to be depicted in an objective manner, their actions are stated
as facts, without descriptive language, such as adjectives and adverbs, to embellish
it. As for men, that is not the case, their performances are illustrated in more detail,
with descriptors that better situate and communicate their accomplishments. One
example of this can be observed when reading the play-by-play reporting of the most
recent World Cup finals, both of which had a penalty in a moment the match was
tied. In the men’s World Cup of 2018, the moment was reported as follows:

   ‘GOAL! France 2-1 Croatia (Griezmann 38pen) Antoine Griezmann
nonchalantly rolls the ball past Danijel Subasic into the bottom left-hand corner
and France re-take the lead. It’s the World Cup final and it’s 2-1 to the French.’
(Guardian, 2018)

    In 2019, during the Women’s World Cup, a very similar moment was reported
in the following manner:

   ’GOAL: USA 1-0 Netherlands (Rapinoe 61) Struck well as always.’
(Guardian, 2019b)

                                           7
These updates report on essentially the same situation, however, it is clear that
Griezmann’s penalty kick was described in more detail. Rapinoe’s shot had only one
descriptive, showing that the reporting was more factual; a goal was scored, but no
further information on it is provided. In terms of syntax, this suggests that articles
about male footballers have more adjectives and adverbs, as those are what makes
language descriptive. This expressed in the following hypothesis:

    H2b: Articles on men’s football will have a higher ratio of adjectives and adverbs
to other POS tags.

    Similarly, a study by Musto et al. (2017), which analyses four sport shows that
are broadcast in Los Angeles over a period of nine weeks, finds that there is greater
use of action verbs when referring to men than to women. Examples of action verbs
are ‘nailed’, ‘smoked’ and ‘exploded’. Moreover, the language used to describe
male athletes and their achievements is dominant and agentic, suggesting they are
in control of their own destiny, but that is not the case when it comes to female
athletes. For instance, while men are described ‘to be in complete control’ and to
put their opponents ‘in the spin cycle’, women are depicted as ‘having made 27
saves’ or to ‘have put in the work’ (Musto et al., 2017). In fact, this can be seen also
in the example above, with the description of a penalty taken in a World Cup final.
As seen above, Griezmann is depicted to be in control of the situation, by rolling
the ball past the goalkeeper and retaking the lead. On the other hand, Rapinoe is
described to simply strike the ball, with non-agentic language used to explain how
her goal came about. Due to the above mentioned findings, it is expected that the
language used by the media to describe male footballers will have more verbs, as it
is action packed and agentic. Thus, hypothesis 2c is defined as:

   H2c: Articles on men’s football will have a higher ratio of verbs to other POS
tags.

2.3    History of Women’s Football in the UK

When studying the media’s reproduction of gender stereotypes in football, it is nec-
essary to consider the history of women’s football. In the UK, there are records of
women playing organized football as early as 1895 (Dunn, 1961; Williams & Wood-
house, 1991). Women’s football became particularly popular during the First World
War, as women got together at work and created teams, and reached its pinnacle

                                           8
in December 1920, when over 50, 000 fans gathered to watch a match (Pfister et al.,
2002). Due to its growing popularity, the Football Association (FA) became wor-
ried about the decreased focus on the male teams (Williams & Woodhouse, 1991;
Williams, 2003), and decided to ban women from playing football under questionable
pretenses about how teams were handling money (Pfister et al., 2002; Theivam &
Kassouf, 2019). The ban was made in 1921, and was only lifted in 1971, but women’s
football was not incorporated into the FA until 1993. These factors rendered women
in football invisible for over 50 years, and created a stigma around women who chose
to play, making the women’s game very unpopular (Black & Fielding-Lloyd, 2019;
Meân, 2001).

    After being incorporated into the FA, women’s football grew and by 2002 it had
become the most practiced sport amongst girls and women in the UK. Despite an
increase in women playing football, it was only in 2008 that a semi-professional
league was created, with 8 teams competing. The league was very successful and
therefore was expanded over the years, by 2014 it had developed into a two division
competition with 24 teams in total (The Football Association, 2021). It is clear
that women’s football has come a long way since its ban was lifted in 1971 and it is
expected that the reporting on it has also changed.

   On top of changes in participation, changes in reporting have also been observed
over the years. Musto et al. (2017) have shown that over time, there was a consid-
erable decrease in explicit sexism observed in the media when reporting on sports.
Similarly, Bruce (2015) demonstrated that there is less use of gender marking in
recent years, and that there has been a shift in the frames used by the media to
report on women’s sports. This suggests that the way in which the media reports
might have changed over time.

    Due to increased coverage and participation over the years as well as traces of
change in the reporting of women’s football, it is expected that the media will be
less biased as time goes by. Thus, hypothesis 3 and 4 are defined as follows:

   H3: Topic distributions of male and female articles will converge over time

   H4: Differences in syntactical aspects between genders will reduce over time.

                                         9
3         Data & Methods

3.1        Data

Data from the British newspaper The Guardian will be used to carry out the in-
vestigation. The Guardian is one of the most read papers in the UK, it has a
monthly readership of over 24 million, this accounts for consumers of the print ver-
sion, which comes out every day, as well as those who read their articles online
(PAMCo, 2020). On their website, The Guardian claims to not follow any politi-
cal ideology (Guardian, 2015). However, they are known by the public to be left
leaning (Smith, 2017), and even their editors have admitted that the newspaper is
centre-left (Guardian, 2004). Regardless of their political inclination, the newspaper
is considered the most trustworthy in the country (PAMCo, 2020). Furthermore,
The Guardian also has an API which allows for the collection of their articles easily
and for free (Guardian, 2019a).

    The Guardian’s API allows for the retrieval of data from their database through
keywords and sections. In this case, searching with keywords could limit the scope
of articles retrieved, as not every article about football will include the word football
in it. Therefore, the decision was made to collect sections rather than by searching
keywords. After close inspection of the sections on The Guardian, it was noted
that there was a general sports section, Sports, which did not include articles about
football. Due to the large volume of coverage, there is a separate, stand alone section
dedicated to the coverage of football. This section also included the few articles that
referred to football as soccer and excluded articles on American football.

    Articles from the Football section of The Guardian published between 2002
and 2020 are collected. This is done using the R package guardianapi (Bastos
& Puschmann, 2019), which facilitates the usage of the API. The period covers 10
World Cups (5 of each gender), and 8 European Championships (4 of each gender1 ).
The choice of dates is made based on the availability of data, as it is only avail-
able starting from 1999. The decision is made to exclude data from 1999 to 2001
because including it would lead to an unbalanced coverage of men’s and women’s
events, since the men’s World Cup of 1998 would not be part of the dataset. In
total, 127, 728 articles are collected, totalling 50, 575, 050 words, 471, 676 of which
are unique.

  Once the articles are collected, they are categorized between men’s football and
women’s football. The categorization is done using keywords that were deemed
    1
        The 2020 Men’s EUROS was postponed due to the COVID-19 pandemic

                                             10
relevant for women’s football (e.g.: famous female footballers, the name of women’s
competitions; for a full list of keywords see Appendix A). Articles that contained
one or more of the keywords were categorized as women’s football articles, while
those which contained none of the keywords were classified as articles about men’s
football. An overview of the documents over the years can be found on Figure 1.

           Figure 1: Number of Articles in the Dataset Over the Years

    Using keywords to categorize articles comes with limitations. Articles that cover
a topic, in this case women’s football, but do not contain any of the keywords in
the list will not be categorized as such. Conversely, an article might have one of the
keywords but not actually cover women’s football. An example could be an article
about the French Football Federation that mentioned that the 2019 Women’s World
Cup took place in France. Due to the mention of the Women’s World Cup the article
would be categorized as women’s football, but it is not in fact about that. Both
types of miscategorizations are impossible to avoid, however they can be minimized.
In order to avoid missing out articles about women’s football, the keyword list was
expanded as much as possible. However, to ensure that this would not cause many
non-related articles to be improperly categorized, the keywords were kept as specific
as they could be. A robustness test is performed to ensure that results are not fully
reliant on the set of keywords used for categorization. Results demonstrate that the
findings hold using five different sets of keywords, detailed information is available
on Appendix E.

                                         11
3.2    Seeded Topic Modelling

The first part of this research centers around semantic differences in the reporting
of football. As hypothesis 1 and 3 suggest, this can be investigated through the
use of topic models. The most commonly used topic model is Latent Dirichlet
Allocation (LDA) (Blei, Ng, & Jordan, 2003). LDA is built on the assumption that
a document is made up of a collection of topics. A topic is represented by a group
of words and their weight within that topic, which is based on co-occurrences of the
word throughout the documents. A word can be part of multiple topics, and have
different weights for each of those topics. The number of topics to be found in the
corpora is pre-determined. This process is illustrated with an example in Figure 2.
As can be seen, in this case there are four topics, which are made up of a distribution
of words. Each document is assumed to have a topic distribution as visualized in
the histogram on the right. Based on that, each word is assigned a topic, as seen
by the arrows pointing to the highlighted words.

              Figure 2: Intuition behind LDA model from Blei (2012)

   Formally, this is done through the generative process:

  1. Choose θi ∼ Dir(α)

  2. Choose ϕk ∼ Dir(β)

  3. For each word in document i of length j

       (a) Choose a topic indicator zi,j ∼ Multinomial(θi )
      (b) Choose a word wi,j ∼ Multinomial(ϕi,j )

                                          12
Where θi is the topic distribution for document i, ϕk is the word distribution for
topic k, wi,j is the j-th word in document i and zi,j is the topic for wi,j .

    LDA is an unsupervised model, meaning that it does not require any prior in-
formation (e.g.: tagged data) to infer topics, this causes two main issues. First, it
has been shown that resulting topics in LDA are not always easily interpretable for
humans (Chang et al., 2009). This is especially important in qualitative research,
as it requires interpretable topics. Second, there is no control over what the result-
ing topics are. Therefore, in the case of this study, there would be no assurances
that the stereotypes explored in the literature would appear in the results. For this
reason, seeded topic models are used instead.

    Seeded topic models are a semi-supervised version of LDA, which can have cer-
tain topics be defined ahead of time (Andrzejewski & Zhu, 2009; Jagarlamudi,
Daumé III, & Udupa, 2012). More specifically, seed words can be defined for certain
topics to pre-define what the topic should cover. When a topic is given a seed word,
that word is assumed to belong only to that topic (e.g.: the model assumes there is
zero probability of it belonging to another topic). However, a seed word can be used
in more than one topic, in which case it will have equal probability of belonging
to each topic. In terms of the generative process of seeded topic models, only the
second step is different from that of LDA described above. This happens because in
seeded topic models, the word distribution over topics is restricted. That is, certain
words are predetermined to belong to certain topics, and thus, when generating the
word distribution over the topics that must be take into account. Formally, the
second step of the generative process is adjusted to accommodate a restricted ϕ,
which contains the predetermined seed words for selected topics. In this thesis, an
implementation of the model in Java is used (Magnusson & Jonsson, 2021).

     The topics to be defined ahead of time and their seed words were chosen based on
previous literature about stereotypes used by the media when reporting on women’s
sports. Based on qualitative studies, five topics were identified: sexualization, fam-
ily, emotions, on-pitch coverage and off-pitch coverage (the seed words used for each
of them can be found on Appendix B).

    An important step in this method is deciding on the number of topics to generate.
To do so, it is necessary to evaluate models, so that it is possible to compare how
changing the number of topics affects the results of the model. The most commonly
used measure to asses topic models, which is that suggested by Blei et al. (2003) in
the original LDA paper, is the log-likelihood. For this study, seeded topic models
with 100, 200 and 300 topics were run, and the log-likelihood computed every 100
iterations of the model. Each model had a similar average log-likelihood, of around
−4.89 (need to double check specific values). As it was not possible to determine

                                         13
a clear difference between the performance of the models, the decision was made
to use the results of the model with 100 topics. A robustness check was performed
using the models with 200 and 300 topics to ensure that the substantial findings do
not change depending on the number of topics used in the model. The outcome of
the robustness check, which can be found in Appendix C, showed that for all three
models conclusions from the analysis hold.

    Once the topics are generated, the topic distribution can be deduced for women’s
and men’s articles. These distributions must be evaluated in order to understand
whether or not they differ, as hypothesis 2 suggests. To do that, two different
methods are used, the Jensen-Shannon divergence, which measures how different
topic distributions are; and the Kolmogorov–Smirnov test, which checks whether
this difference is significant.

3.2.1   Jensen–Shannon Divergence

The Jensen-Shannon Divergence (JSD) measures the similarity between two proba-
bility distributions (Lin, 1991), and has been used before to measure topic models
(Aletras & Stevenson, 2014; Blair, Bi, & Mulvenna, 2020). The JSD is based on
the Kullback–Leibler divergence (KLD), which measures how a given probability
distribution differs from a reference distribution (Kullback & Leibler, 1951). The
KLD is calculated using the following equation:
                                                             
                                       X                P (x)
                         DKL (P ||Q) =     P (x) log2                           (1)
                                       x∈X
                                                        Q(x)

    In Equation 1, P(x) and Q(x) are discrete probability distributions. One issue
with the KLD is that it is asymmetric, meaning DKL (P || Q) 6= DKL (Q || P ). When
comparing the distribution of the topic models this is not ideal, since a single simi-
larity scored is desired. For this reason, the JSD is used instead, as it is a symmetric
divergence score based on the KLD. It is calculated using the equation:
                                 1              1
                    DJS (P ||Q) = DKL (P ||M ) + DKL (Q||M )                        (2)
                                 2              2

   In Equation 2, M = 21 (P + Q), or the average between the two distributions.
JSD values range from 0 to 1, where a score of 0 demonstrates two distributions are
identical and a score of 1 that they are maximally different.

  Before applying the equation, an additional step is needed to prepare the data.
The resulting distribution from the seeded topic model is a continuous distribution.

                                          14
However, to calculate the JSD between articles about women and men, distributions
must be discrete. Therefore, the distribution is discretized through binning. More
specifically, the data is made discrete by calculating the probability that a value falls
within an interval, the intervals are defined by the bins. For example, in the seeded
topic model result a given topic can take any probability of occurring in a document,
making it continuous. By binning, the probabilities are divided into intervals and
grouped together (e.g.: from 0 to 0.1 probability; from 0.1 to 0.2 probability, etc).
Based on that, the probability of a topic occurring in a document is now part of
one of the bins that were created, and the data becomes discrete. It is important to
note that the number and size of the bins can impact the resulting JSD, as smaller
intervals will generate groups with higher probabilities and vice-versa. Thus, a
robustness check must be done to investigate how JSD changes as the size of bins
increases or decreases.

3.2.2   Two Sample Kolmogorov–Smirnov Test

The Two Sample Kolmogorov-Smirnov test (KS-2 test) was created by Smirnov
(1939). It allows for the comparison of two empirical distributions and tests whether
or not they are sampled from the same distribution. The KS-2 test is non-parametric,
meaning there is it does not assume the sample is from a given distribution and does
not require information about any parameters. This is very important in this case,
since there is no knowledge of the true distribution of the topics over the documents.
The KS-2 statistic can be calculated using the following formula:
                              Dm,n = max |F (x) − G(x)|                              (3)
                                        x

   In Equation 3, m is the size of the first sample, n is the size of the second
sample and F(x) and G(x) are the cumulative distribution functions of the first and
second sample respectively. An illustration of how the KS-2 statistic is calculated
can be seen on Figure 3. The red line represents F(x) and the blue line G(x), two
cumulative distribution functions. The black arrow represents the biggest difference
between the two cumulative distributions, which is in fact the KS-2 statistic.

   The null hypothesis, H0 , is that the samples come from the populations that
have the same distribution. The hypothesis is rejected if:
                                           r
                                             m+n
                               Dm,n > c(α)                                  (4)
                                              m·n

   In Equation 4, α is the significance level. Based on the critical value table for
the KS-2 test (Smirnov, 1939), when α is 0.05, the critical value, c(α) is 1.358.

                                            15
Figure 3: How to Calculate the Two Sample Kolmogorov–Smirnov Test Statistic

Therefore, the equation above becomes:
                                              r
                                                  m+n
                               Dm,n > 1.358                                       (5)
                                                  m·n

                                                      q
    Based on Equation 5, if Dm,n is larger than 1.358 m+nm·n
                                                             , the null hypothesis is
rejected, meaning that the topic distributions of men’s and women’s articles are not
sampled from the same population.

3.3    POS Tagging

Part-of-speech (POS) tagging is the process through which each word in a corpora
is given a syntactic tag. In other words, by POS tagging a given corpus each
token is labelled with its syntactic property. For example, the word ‘kicked’ would
be tagged as a verb, and the word ‘sunlight’ would be tagged as a noun. This
method can be used for the investigation of hypothesis 2a, 2b, 2c and 4. By POS
tagging the documents in the corpora, it is possible to inspect how often noun,

                                         16
pronouns, adverbs, adjectives and verbs are used and check whether hypothesis 1a-c
are supported by the data.

    POS tagging is an exercise that can be performed by hand, however there are
also automated implementations of POS taggers which can tag data faster than
humans and as a result be applied to larger corpora. Although automated methods
are faster, it is important to note that, like any other model, they are not always
correct. Thus, when using automated POS taggers, it is important to investigate
their performance. This can be done using annotated corpora, that is, documents
that have been tagged by a human who is knowledgeable in the subject. Said
corpora is tagged by an automated tagger and then the results can be compared to
the annotated documents.

   In the case of this study, it would be almost impossible to tag all 127, 728 doc-
uments due to the time required for it. Therefore, the Python package SpaCy
(Honnibal, Montani, Van Landeghem, & Boyd, 2020) is used to POS tag the dataset.
SpaCy’s POS tagger is trained on thousands of annotated documents and uses a
combination of statistical models to predict the syntactic label of a given word.
In order to test the performance of the SpaCy tagger, the Brown corpus is used
(Francis & Kučera, 1979). This corpora was curated and tagged by Francis and
Kučera (1979), who selected 500 documents which were divided between 15 differ-
ent categories. For this test, only one category was used, ‘NEWS’. This is because
the current study investigates news articles and thus, the POS tagger must perform
well in labelling this type of text. If using the entire Brown corpus, documents from
categories such as ‘religion’ or ‘US government’ might affect the performance of the
tagger. The ‘news’ category contains 44 documents, totalling over 100, 000 words,
90.4% of which were correctly labelled by the SpaCy tagger.

3.3.1   Mann–Whitney U test

The Mann-Whitney U (MWU) test is a non-paramatric test that compares two
distributions and determines whether both are drawn from the same population
(Mann & Whitney, 1947). The MWU test is similar to the Student’s t-test, but
does not assume that the distributions are sampled from a normal distribution. The
MW U test is applied using the following method:

  1. Put observations from each distribution in one set and order them from small-
     est to largest value

  2. Assign each observation a rank, based on their position in the ordered set, such
     that the smallest value has rank 1, and the highest value has rank n1 + n2

                                         17
(a) If two or more values in the ordered set are the same, assign a rank equal
            to the midpoint of unadjusted rankings, such that they are all given the
            same rank.
              i. For instance the ordered set (0, 3, 4, 4, 6) will be assigned ranks (1,
                 2, 3.5, 3.5, 5) instead of (1, 2, 3, 4, 5).

   3. Calculate U-statistic rankings

   The U-statistic is calculated using the formulae:

                                    U = min(U1 , U2 )                                   (6)

Where:
                                              n1 (n1 − 1)
                                 U1 = R1 −                                              (7)
                                                   2

                                              n2 (n2 − 1)
                                 U2 = R2 −                                              (8)
                                                   2

    In Equations 7 and 8, n1 and n2 are the number of observations in samples 1 and 2
respectively; R1 and R2 are the sum of ranks in each sample. Since hypothesis 2a, 2b
and 2c compare whether ratios for one category are bigger or smaller than the other,
a one-sided test is used. For hypothesis 2a, this means that the the null hypothesis,
H0 , is that articles about men’s football will have significantly smaller ratio. As
for hypothesis 2b and 2c, the null hypothesis is that articles about men’s football
will have a significantly larger ratio than women’s articles. The null hypothesis is
rejected if:

                                     U > c(α, n1 , n2 )                                 (9)

    In Equation 9, c(α, n1 , n2 ) is the critical value calculated based on the significance
level α, and the sample sizes n1 and n2 . For the purposes of this study, the critical
value is calculated using Python package SciPy (Jones et al., 2001).

3.3.2    Top Word Analysis

Regardless of how often the media makes use of certain types of words, such as
adjectives and verbs, when reporting on men’s and women’s football, there is a
possibility that the word choice is different. For example, even if the media uses

                                            18
adjectives equally often when covering male and female football players, there is
a possibility that the exact adjectives used differ between them. Therefore, it is
important to investigate which words are commonly used used in each word type
category, and whether they differ between reporting on men and women. This is
will give an understanding of the qualitative differences in the reporting for each
gender.

    To do so, the 500 most commonly used words for each tag will be gathered
for both men and women. That is, for each word type category (e.g.: adjectives,
adverbs, nouns, verbs) the 500 words that appear most frequently in the men’s
articles are collected, and the same is done with the women’s articles. Then, the sets
are compared with each other, to obtain the most common words in each category
that are different for men’s and women’s articles. More specifically, the 25 most
common words that were unique to a category and tag were selected (e.g.: top 25
verbs that appear in the top 500 verbs of women’s articles but not men’s).

3.4    Bonferroni Correction

When performing hypothesis testing, it is important to consider the number of tests
that are done to a sample. When testing hypothesis 3 and 4, which relate to how
the media reporting changes over time, many different tests will be conducted. In
the case of the syntactical investigation, for example, 19 tests are run, that is, one
per year in the dataset, for each of the three comparisons made, totalling 57 tests.
Therefore, the probability that at least one of the tests is positive will be almost 1.

            P (≥ 1 signif icant result) = 1 − P (no signif icant results)
            P (≥ 1 signif icant result) = 1 − (1 − α)n. of tests
            P (≥ 1 signif icant result) = 1 − (1 − 0.05)57
            P (≥ 1 signif icant result) = 0.946

    This means that even if all tests are insignificant, there is a 94% chance of observ-
ing one spurious significant result. In order to rectify that, the Bonferroni correction
is applied. This method was introduced by Dunn (1961), and takes into account
the number of tests that are performed in the calculation of the significance level.
More specifically, the Bonferroni correction adjusts the p-values to the following

                                           19
significance level:

                                               αold
                                     αnew =                                       (10)
                                                n

    In Equation 10, αnew is the significance level used to calculate the p-value af-
ter the correction; αold is the desired significance level for the test, regardless of
correction; and n is the total number of hypothesis tests to be performed.

   In the case of the syntactical differences analysis, for example, there are 57 tests
to be performed, at a 0.05 confidence level. Therefore, using Equation 10, the αnew
equates to:

                                               0.05
                                     αnew =                                       (11)
                                                57

   The Bonferroni correction will be used in two instances of this paper. First, when
the comparing the topic distributions over time, using the KS2 test. Second, when
analysing how the syntactical differences evolve over time, as mentioned above.

4     Results

4.1    Semantic Gender Differences

In order to get a general overview of the results from the seeded topic model, the
top five topics for men’s and women’s articles are shown on Tables 1 and 2. The
specific words that make up each topic can be found on Appendix D. Firstly, it
can be observed that no topic appears in the top five of both men’s and women’s
football articles. Moreover, it can be seen that three of the main topics about
men’s football regard, to some extent, on-pitch action. ‘Match results’ and ‘on-
pitch’ both cover actions of the game, whereas ‘injuries’ indirectly does the same.
This is because an interest in player’s injuries demonstrates an interest in their
availability to participate in match. The other two topics cover information about
transfer of players between clubs and their contracts. This suggests that there is an
interest in male footballers individually, not just in terms of their performance in
a given club. On the other hand, a large focus exists on female footballers’ private
life, more specifically in their relationships with their families. Other main topics
about women’s football are very general, as is the case of ‘press’ which contains

                                          20
Table 1: Most Common Topics For Men’s Football Articles

Ranking     Topic             Top Words
1           Match Results     win 1-0 minutes 2-0 2-1 lead goals 1-1 draw side victory
                              3-0 minute 0-0 3-1 late put 2-2 defeat beat
2           Transfers         club contract deal offer future week move talks agreed
            (Contracts)       leave yesterday stay confirmed terms understood inter-
                              est signed agreement agent expected
3           Transfers       transfer move deal loan summer sign window signing
            (Rumours/Moves) striker midfielder fee january bid player offer interest
                            free interested defender leave
4           On-pitch          goal corner scored penalty chance header performance
                              scoring free-kick chances updated save ability scores
                              passing defending passes saved created subs
5           Injuries          injury foot injured knee feet form season injuries suffered
                              leading suspended ankle hamstring scorer subs broke
                              medical broken doubtful discipline

      Table 2: Most Common Topics For Women’s Football Articles

Ranking     Topic             Top Words
1           Women’s           england team game women world cup womens players
            Football          related womens football usa teams time white australia
                              girls mens female play tournament top
2           Family            big family football day letters son sign relationship email
                              father quote send wife website boss brother children free
                              partner
3           Press             footballers film story footballer sun celebrity paper daily
                              press love newspaper star man page tabloid mirror
                              woman newspapers column pr
4           Football          clubs football fa association chief rules government ex-
            Entities          ecutive professional sport uefa scudamore review issues
                              national grassroots organisation board health issue
5           Hillsborough      police court case death evidence investigation statement
                              died association found disaster ban appeal report legal
                              told hillsborough alleged mr

                                     21
words such as ‘newspaper’, ‘film’ and ‘star’; and ‘football entities’, a topic with the
words ‘uefa’, ‘fa’ and ‘association’. Finally, the last topic covers the Hillsborough
disaster, an accident that took place in 1989 during a football match at Hillsborough
stadium. This topic is the 7th most common topic in men’s football, and the only
topic to repeat between the two categories in the top 20 of each of them. Therefore,
although this event does not relate to women’s football, the topic appears to be very
important in the corpora, and likely due to the lack of women’s football specific
topics it appears in the top five. Overall, it is seen that the main topics for each
category are very different, providing some support to hypothesis 1.

     The Jensen–Shannon divergence values for each topic are shown in Figure 4.
As it can be seen, variations on the size of bins used to transform the data have
little impact on the results. Moreover, it is clear that the family topic displays
the biggest difference in distribution between men and women. Two other topics,
off-pitch and on-pitch, also have small differences in their distribution on women’s
and men’s articles. However, there are also topics, namely sexuality and emotions,
that have a JSD close to zero, suggesting the topic distributions between articles on
women and men are almost identical for these topics. Figure 4 suggests that there
are differences in the topics covered when reporting on men’s football as compared
to women’s football. This provides support for hypothesis 2. However, from the

  Figure 4: Jensen–Shannon Divergence For Each Topic For Different Sized Bins

                                          22
Table 3: Results for the Two Sample Kolmogorov–Smirnov Test

       Topic        KS Statistic    Critical Value     Significantly Different
    Sexualization       0.098             0.015                   True
       Family           0.236             0.015                   True
      Emotions          0.134             0.015                   True
      On-pitch          0.122             0.015                   True
      Off-pitch         0.148             0.015                   True

JSD alone it is not possible to understand whether these differences are significant.
To do that, the Kolmogorov–Smirnov Test is needed.

    Table 3 displays results for the Two Sample Kolmogorov–Smirnov Test per-
formed on the distributions. As it can be seen, according to the KS-2 test, all
differences observed between the topic distribution of men’s and women’s articles
are significant. This demonstrates that the media reporting is significantly different
between men’s and women’s football when it comes to the five topics explored in
this investigation.

    Put together, the measures shows that there are differences between the topic
distributions of articles about men’s and women’s football. This supports hypothesis
2 and indicates that the media does make use of certain stereotypes when reporting
on women’s football. However, these differences are small, especially for certain
topics, such as sexualization and emotions.

    These differences might be small because there are changes over time, which do
not appear when studying the data in its entirety. Therefore, the above mentioned
methods are then applied to each year of the dataset to understand how the topic
distributions between men’s and women’s football articles evolves over time. Results
are presented in Figures 5 and 6.

    Figure 5 displays how the average JDS per topic changes over the years. It can
be seen that, in general, the differences between women’s and men’s football articles
decreases over time. The turning point appears to be 2013, after which point the
differences start decreasing. As a whole, Figure 5 shows evidence that supports
hypothesis 3, indicating that the distribution of topics between men’s and women’s
football articles decreases over time.

   As for the significance of these differences, Figure 6 provides information about
the p-values of the KS-2 tests for each topic for each year in the dataset, adjusted
using Bonferroni correction. Results demonstrate that despite the decrease in differ-
ences between the topic distributions, they remain significant for the topics family

                                         23
You can also read