Building Representative Corpora from Illiterate Communities: A Review of Challenges and Mitigation Strategies for Developing Countries
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Building Representative Corpora from Illiterate Communities: A Review
of Challenges and Mitigation Strategies for Developing Countries
Stephanie Hirmer1 , Alycia Leonard1 , Josephine Tumwesige2 , Costanza Conforti2,3
1
Energy and Power Group, University of Oxford
2
Rural Senses Ltd.
3
Language Technology Lab, University of Cambridge
stephanie.hirmer@eng.ox.ac.uk
Abstract areas where the bulk of the population lives (Roser
Most well-established data collection methods and Ortiz-Ospina (2016), Figure 1). As a conse-
currently adopted in NLP depend on the as- quence, common data collection techniques – de-
sumption of speaker literacy. Consequently, signed for use in HICs – fail to capture data from a
the collected corpora largely fail to repre- vast portion of the population when applied to LICs.
sent swathes of the global population, which Such techniques include, for example, crowdsourc-
tend to be some of the most vulnerable and ing (Packham, 2016), scraping social media (Le
marginalised people in society, and often live
et al., 2016) or other websites (Roy et al., 2020),
in rural developing areas. Such underrepre-
sented groups are thus not only ignored when
collecting articles from local newspapers (Mari-
making modeling and system design decisions, vate et al., 2020), or interviewing experts from in-
but also prevented from benefiting from de- ternational organizations (Friedman et al., 2017).
velopment outcomes achieved through data- While these techniques are important to easily build
driven NLP. This paper aims to address the large corpora, they implicitly rely on the above-
under-representation of illiterate communities mentioned assumptions (i.e. internet access and
in NLP corpora: we identify potential biases literacy), and might result in demographic misrep-
and ethical issues that might arise when col-
resentation (Hovy and Spruit, 2016). In this pa-
lecting data from rural communities with high
illiteracy rates in Low-Income Countries, and per, we make a first step towards addressing how
propose a set of practical mitigation strategies to build representative corpora in LICs from il-
to help future work. literate speakers. We believe that this is a cur-
rently unaddressed topic within NLP research. It
1 Introduction
aligns with previous work investigating sources
The exponentially increasing popularity of super- of bias resulting from the under-representation
vised Machine Learning (ML) in the past decade of specific demographic groups in NLP corpora
has made the availability of data crucial to the (such as women (Hovy, 2015), youth (Hovy and
development of the Natural Language Processing Søgaard, 2015), or ethnic minorities (Groenwold
(NLP) field. As a result, much NLP research has et al., 2020)). In this paper, we make the follow-
focused on developing rigorous processes for col- ing contributions: (i) we introduce the challenges
lecting large corpora suitable for training ML sys- of collecting data from illiterate speakers in §2;
tems. We observe, however, that many best prac- (ii) we define various possible sources of biases
tices for quality data collection make two implicit and ethical issues which can contribute to low data
assumptions: that speakers have internet access quality we define various possible sources of bi-
and that they are literate (i.e. able to read and ases and ethical issues which can contribute to low
often write text effortlessly1 ). Such assumptions data quality we define various possible sources of
might be reasonable in the context of most High- biases and ethical issues which can contribute to
Income Countries (HICs) (UNESCO, 2018). How- low data quality §3; finally, (iii) drawing on years
ever, in Low-Income Countries (LICs), and espe- of experience in data collection in LICs, we outline
cially in sub-Saharan Africa (SSA), such assump- practical countermeasures to address these issues
tions may not hold, particularly in rural developing in §4.
1
For example, input from speakers is often taken in writing,
in response to a written stimulus which must be read.
2176
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 2176–2189
April 19 - 23, 2021. ©2021 Association for Computational Linguistics(a) Adult literacy (% ages 15+, UN- (b) Urban population (% total, UN- (c) Internet usage (% of total, ITU
ESCO (2018)) DESA (2018)) (2019))
Figure 1: Literacy, urban population, and internet usage in African countries. Note that countries with more rural
populations tend to have less literacy and less internet users. These countries are likely to be under-represented in
corpora generated using common data collection methods that assume literacy and internet access (Grey: no data).
2 Listening to the Illiterate: What Makes description of best practices for data collection re-
it Challenging? mains a notable research gap.
In recent years, developing corpora that encom- 3 Definitions and Challenges
passes as many human languages as possible has
been recognised as important in the NLP commu- Guided by research in medicine (Pannucci and
nity. In this context, widely translated texts (such Wilkins, 2010), sociology (Berk, 1983), and psy-
as the Bible (Mueller et al., 2020) or the Human chology (Gilovich et al., 2002), NLP has experi-
Rights declaration (King, 2015)) are often used as enced increasing interest in ethics and bias mitiga-
a source of data. However, these texts tend to be tion to minimise unintentional demographic mis-
quite short and domain-specific. Moreover, while representation and harm (Hovy and Spruit, 2016).
the Internet constitutes a powerful data collection While there are many stages where bias may enter
tool which is more representative of real language the NLP pipeline (Shah et al., 2019), we focus on
use than the previously-mentioned texts, it excludes those pertinent to data collection from rural illit-
illiterate communities, as well as speakers which erate communities in LICs, leaving the study of
lack reliable internet access (as is often the case in biases in model development for future work2 .
rural developing settings, Figure 1).
Given the obstacles to using these common lan- 3.1 Data Collection Biases
guage data collection methods in LIC contexts, the
NLP community can learn from methodologies Biases in data collection are inevitable (Marshall,
adopted in other fields. Researchers from fields 1996) but can be minimised when known to the
such as sustainable development (SD, Gleitsmann researcher (Trembley, 1957). We identify various
et al. (2007)), African studies (Adams, 2014), and biases that can emerge when collecting language
ethnology (Skinner et al., 2013), tend to rely heav- data in rural developing contexts, which fall under
ily on qualitative data from oral interviews, tran- three broad categories: sampling, observer, and re-
scribed verbatim. Collecting such data in rural de- sponse bias. Sampling determines who is studied,
veloping areas is considerably more difficult than the interviewer (or observer) determines what in-
in developed or urban contexts. In addition to high formation is sought and how it is interpreted, and
illiteracy levels, researchers face challenges such the interviewee (or respondent) determines which
as seasonal roads and low population densities. To information is revealed (Woodhouse, 1998). These
our knowledge, there are very few NLP works categories span the entire data collection process
which explicitly focus on building corpora from and can affect the quality and quantity of language
rural and illiterate communities: of those works data obtained.
that exist, some present clear priming effect is- 2
Note, this paper does not focus on a particular NLP ap-
sues (Abraham et al., 2020), while others focus plication, as once the data has been collected from illiterate
on application (Conforti et al., 2020). A detailed communities it can be annotated for virtually any specific task.
21773.2 Sampling or selection bias unrelated data as electricity-motivated (Hirmer and
Guthrie, 2017), or omit data which contradicts their
Sampling bias occurs when observations are drawn
hypothesis (Peters, 2020). Using such data to train
from an unrepresentative subset of the population
NLP models may introduce unintentional bias to-
being studied (Marshall, 1996) and applied more
wards the original expectations of the researchers
widely. In our context, this might arise when select-
instead of accurately representing the community.
ing communities from which to collect language
Secondly, the interviewer’s understanding and
data, or specific individuals within each commu-
interpretation of the speaker’s utterances might be
nity. When sampling communities, bias can be
influenced by their class, culture and language.
introduced if convenience is prioritized. Commu-
Note that, particularly in countries without strong
nities which are easier to access may not produce
language standardisation policies, consistent se-
language data representative of a larger area or
mantic shifts can happen even between varieties
group. This can be illustrated through Uganda’s
spoken in neighboring regions (Gordon, 2019),
refugee response, which consists of 13 settlements
which may result in systematic misunderstand-
(including the 2nd largest in the world) hosted in 12
ing (Sayer, 2013). For example, in the neighboring
districts (UNHCR, 2020). Data collection may be
Ugandan tribes of Toro and Bunyoro, the same
easier in one of the older, established settlements;
word omunyoro means respectively husband and
however, such data cannot be generalised over the
a member of the tribe. Language data collected in
entire refugee response due to different cultural
such contexts, if not properly handled, may contain
backgrounds, length of stay of refugees in different
inaccuracies which lead to NLP models that mis-
areas, and the varied stages along the humanitar-
represent these tribes. Rich information commu-
ian chain – emergency, recovery or development –
nicated through gesture, expression, and tone (i.e.
found therein (Winter, 1983; OECD, 2019). Pri-
nonverbal data, Oliver et al. (2005)) may also be
oritizing convenience in this case may result in
systematically lost during verbatim transcription,
corpora which over-represents the cultural and eco-
causing inadvertent inconsistencies in the corpora.
nomic contexts of more established, longer-term
Thirdly, interviewer bias, which refers to
refugees. When sampling interviewees, bias can
the subjectivity unconsciously introduced into
be introduced when certain sub-sets of a commu-
data gathering by the worldview of the inter-
nity have more data collected than others (Bryman,
viewer (Frey, 2018). For instance, a deeply reli-
2012). This is seen when data is collected only
gious interviewer may unintentionally frame ques-
from men in a community due to cultural norms
tions through religious language (e.g. it is God’s
(Nadal, 2017), or only from wealthier people in
will, thank God, etc.), or may perceive certain emo-
cell-phone-based surveys (Labrique et al., 2017).
tions (e.g. thankfulness) as inherently religious,
3.2.1 Observer bias and record language data including this percep-
tion. The researcher’s attitude and behaviour may
Observer bias occurs when there are systematic er- also influence responses (Silverman, 2013); for
rors in how data is recorded, which may stem from instance, when interviewers take longer to deliver
observer viewpoints and predispositions (Gonsamo questions, interviewees tend to provide longer re-
and D’Odorico, 2014). We identify three key ob- sponses (Matarazzo et al., 1963). Unlike in internet-
server biases relevant to our context. based language data collection, where all speakers
Firstly, confirmation bias, which refers to the are exposed to uniform, text-based interfaces, col-
tendency to look for information which confirms lecting data from illiterate communities necessi-
one’s preconceptions or hypotheses (Nickerson, tates the presence of an interviewer, who cannot
1998). Researchers collecting data in LICs may always be the same person due to scalability con-
expect interviewees to express needs or hardships straints, introducing this inevitable variability and
based on their preconceptions. As Kumar (1987) subsequent data bias.
points out, “often they hear what they want to hear
and ignore what they do not want to hear”. A team 3.2.2 Response bias
conducting a needs assessment for a rural electrifi- Response bias occurs when speakers provide inac-
cation project, for instance, may expect a need for curate or false responses to questions. This is par-
electricity, and thus consciously or subconsciously ticularly important when working in rural settings,
seek data which confirms this, interpret potentially where the majority of data collection is currently
2178related to SD projects. The majority of existing cation, Conforti et al. (2020)), these may amplify
data is biased by the projects for which it has been current needs and under-represent long-term needs.
collected, and any newly collected data for NLP Fourthly, acquiescence bias, also known as
uses is also likely to be used in decision making for “yea” saying (Laajaj and Macours, 2017), which
SD. This inherent link of data collection to material can occur in rural developing contexts when in-
development outcomes inevitably affects what is terviewees perceive that certain (possibly false)
communicated. There are five key response biases responses will please a data collector and bring
relevant to our context. benefits to their community. For example, if data
Firstly, recall bias, where speakers recall only collection is being undertaken by a group with a
certain events or omit details (Coughlin, 1990). stated desire to build a school may be more likely
This is often as a result of external influences, such to hear about how much education is valued.
as the presence of a data collector who is new to Finally, priming effect, or the ability of a pre-
the community. Recall can also be affected by the sented stimulus to influence one’s response to a
distortion or amplification of traumatic memories subsequent stimulus (Lavrakas, 2008). Priming
(Strange and Takarangi, 2015); if data is collected is problematic in data collection to inform SD
around a topic a speaker may find traumatic, recall projects; it can be difficult to collect data on the
bias may be unintentionally introduced. relative importance of simultaneous (or conflict-
Secondly, social desirability bias, which refers ing) needs if the community is primed to focus on
to the tendency of interviewees to provide socially one (Veltkamp et al., 2011). An example is shown
desirable/acceptable responses rather than honest in Figure 2a; respondents may be drawn to speak
responses, particularly in certain interview contexts more about the most dominant prompts presented
(Bergen and Labonté, 2020). In tight-knit rural in the chart. This is typical of a broader failure in
communities, it may be difficult to deviate from SD to uncover beneficiary priorities without intro-
traditional social norms, leading to biased data. As ducing project bias (Watkins et al., 2012). Needs
an illustrative example, researchers in Nepal found assessments, like the one referenced above linked
that interviewer gender affected the detail in re- to a rural electrification project, tend to focus ex-
sponses to some sensitive questions (e.g. sex and plicitly on project-related needs instead of more
contraception): participants provided less detail to broadly identifying what may be most important
male interviewers (Axinn, 1991). Social desirabil- to communities (Masangwi, 2015; USAID, 2006).
ity bias can produce corpora which misrepresent As speakers will usually know why data is being
community social dynamics or under-represent sen- collected in such cases, they may be biased towards
sitive topics. stating the project aim as a need, thereby skewing
the corpora to over-represent this aim.
Thirdly, recency effect or serial-position,
which is the tendency of a person to recall the
3.3 Ethical Considerations
first and last items in a series best, and the mid-
dle items worst (Troyer, 2011). This can greatly Certain ethical codes of conduct must be followed
impact the content of language data. For instance, when collecting data from illiterate speakers in ru-
in the context of data collection to guide devel- ral communities in LICs (Musoke et al., 2020).
opment work, it is important to understand cur- Unethical data collection may harm communities,
rent needs and values (Hirmer and Guthrie, 2016); treat them without dignity, disrupt their lives, dam-
however, if only the most recent needs are dis- age intra-community or external relationships, and
cussed, long-term needs may be overlooked. To disregard community norms (Thorley and Henrion,
illustrate, while a community which has just ex- 2019). This is particularly critical in rural develop-
perienced a poor agricultural season may tend to ing regions, as these areas are home to some of the
express the importance of improving agricultural world’s poorest and most vulnerable to exploita-
output, other needs which are less top-of-mind (i.e. tion (Christiaensen and Subbarao, 2005; de Ceni-
healthcare, education) may be equally important val M., 2008). Unethical data collection can repli-
despite being expressed less frequently. If data cate extractive colonial relationships whereby data
containing recency bias is used to develop NLP is extracted from communities with no mutual ben-
models, particularly for sustainable development efit or ownership (Dunbar and Scrimgeour, 2006).
applications (such as for Automatic UPV Classifi- It can lead to a lack of trust between data collec-
2179tor and interviewees and unwillingness to partici- searchers should review socio-economic surveys
pate in future research (Clark, 2008). These phe- and/or consult local stakeholders who can offer
nomena can bias data or reduce data availability. valuable insights on practices and social norms.
Ethical data collection practices in rural develop- These stakeholders can also highlight current or
ing regions with high illiteracy include: obtaining historical matters of concern to the area, which
consent (McAdam, 2004), accounting for cultural may be unfamiliar to researchers, and reveal lo-
differences (Silverman, 2013), ensuring anonymity cal, traditional, and indigenous knowledge which
and confidentiality (Bryman, 2012), respecting ex- may impact the data being collected (Wu, 2014)
isting community or leadership structures (Hard- and result in recency effect. It is good practice to
ing et al., 2012), and making the community the identify local conflicts and segmentation within a
owner of the data. While the latter is not often cur- community, especially in a rural context, where
rently practiced, it is an important consideration for the population is vulnerable and systematically un-
community empowerment, with indigenous data heard (Dudwick et al., 2006; Mallick et al., 2011).
sovereignty efforts (Rainie et al., 2019) already Case sampling. In qualitative research, sample
setting precedent. cases are often strategically selected based on the
research question (i.e. systematic or purposive sam-
4 Countermeasures pling, Bryman (2012)), and characteristics or cir-
Drawing on existing literature and years of field cumstances relevant to the topic of study (Yach,
experience collecting spoken data in LICs, below 1992). If data collected in such research is used
we outline a number of practical data collection beyond its original scope, sampling bias may result.
strategies to minimise previously-outlined chal- So, while data collected in previous research should
lenges (§3), enabling the collection of high-quality, be re-used to expand NLP corpora where possible,
minimally-biased data from illiterate speakers in it is important to be cognizant of the purposive
LICs suitable for use in NLP models. While these sampling underlying existing data. A comprehen-
measures have primarily been applied in SSA, we sive dataset characterisation (Bender and Friedman,
have also successfully tested them in projects fo- 2018; Gebru et al., 2018) can help researchers un-
cusing on refugees in the Middle East and rural derstand whether an existing dataset is appropri-
communities in South Asia. ate to use in new or different research, such as in
training new NLP models, and can highlight the
4.1 Preparation potential ethical concerns of data re-use.
Here, we outline practical preparation steps for Participant sampling. Interviewees should be
careful planning, which can minimise error and selected to represent the diverse interests of a com-
reduce fieldwork duration (Tukey, 1980). munity or sampling group (e.g. occupation, age,
Local Context. A thorough understanding of gender, religion, ethnicity or male/female house-
local context is key to successful data collection hold heads (Bryman, 2012)) to reduce sampling
(Hentschel, 1999; Bukenya et al., 2012; Launiala bias (Kitzinger, 1994). To ensure representativ-
and Kulmala, 2006). Local context is broadly de- ity in collected data, sampling should be random,
fined as facts, concepts, beliefs, values, and percep- i.e. every subject has equal probability to be in-
tions used by local people to interpret the world cluded (Etikan et al., 2016). There may be certain
around them, and is shaped by their surroundings societal subsets that are concealed from view (e.g.
(i.e. their worldview, Vasconcellos and Vasconcel- as a result of embarrassment from disabilities or
los Sobrinho (2014)). It is important to consider physical differences) based on cultural norms in
local context when preparing to collect data in rural less inclusive societies (Vesper, 2019); particular
developing areas, as common data collection meth- care should be exercised to ensure such subsets are
ods may be inappropriate due to contextual linguis- represented.
tic differences and deep-rooted social and cultural Group composition. Participant sampling best
norms (Walker and Hamilton, 2011; Mafuta et al., practices vary by data collection method, with par-
2016; Nikulina et al., 2019; Wang et al.). Selecting ticular care being necessary in group settings. In
a contextually-appropriate data collection method traditional societies where strong power dynamics
is critical in mitigating social desirability bias in exist, attention should be paid to group composi-
the collected data, among other challenges. Re- tion and interaction to prevent some voices from
2180Bias & Definition Key countermeasures
Observer—- Sampling- Community: An unrepresentative sample set is generalised • Select representative communities & only apply data
over the entire case being studied. within same scope (i.e. consult data statements)
Participant: Certain sub-sets of a community have more • Select representative participants, only apply data
data collected from them than others. within same scope & avoid tempting rewards
Confirmation: Looking for information that confirms one’s • Employ interviewers that are impartial to the
preconceptions or hypotheses about a topic/research/sector. topic/research/sector investigated.
Misunderstanding: Data is incorrectly transcribed or cate- • Employ local people & minimise # of people involved
gorized as a result of class, cultural, or linguistic differences. for both data collection & transcription.
Interviewer: Unconscious subjectivity introduced into data • Undertake training to minimise influence exerted from
gathering by interviewers’ worldview. questions, technology, & attitudes.
Response———–
Recall: Tendency of speakers to recall only certain events or • Collect support data (e.g. from socio-economic data or
omit details local stakeholders) to compare with interviews.
Social-desirability: Tendency of participants to provide so- • Select interviewers & design interview processes to
cially desirable/acceptable responses rather than to respond account for known norms which might skew responses
honestly.
Recency effect: Tendency to recall first or last items in a • Minimise external influence on participants throughout
series best, & middle items worst. data gathering (e.g. technologies, people, perceptions).
Acquiescence: Respondents perceive certain, perhaps false, • Gather non-sectoral holistic insights (e.g. from socio-
answers may please data collectors, bringing community economic data or local stakeholders)
benefits.
Priming effect: Ability of a presented stimulus to influence • Use appropriate visual prompts (graphically similar),
one’s response to a subsequent stimulus language and technology
Table 1: Sources of potential bias in data collection when operating in rural and illiterate settings in developing
countries, and key countermeasures that can help mitigating them.
being silenced or over-represented (Stewart et al., made aware of the level of detail needed during
2007). For example, in Uganda, female intervie- transcription (Parcell and Rafferty, 2017).
wees may be less likely to voice opinions in the Study design. In rural LIC communities, quali-
presence of male interviewees (FIDH, 2012; Axinn, tative data like natural language is usually collected
1991), introducing a form of social desirability bias by observation, interview, and/or focus group dis-
in resulting corpora. To minimise this risk of data cussion (or a combination, known as mixed meth-
bias, relations and power dynamics must be con- ods) which are transcribed verbatim (Moser and
sidered during data collection planning (Hirmer, Korstjens, 2018). Prompts are often used to spark
2018). It may be necessary to exclude, for instance, discussion. Whether visual prompts (Hirmer, 2018)
close relatives, governmental officials, and village or verbalised question prompts are used during
leaders from group discussions where data is being data collection, these should be designed to: (i) ac-
collected, and instead engage such stakeholders in commodate illiteracy, (ii) account for disabilities
separate activities to ensure that their voices are (e.g. visually impairment; both could cause sam-
included in the corpora without biasing the data pling bias), and (iii) minimise bias towards a topic
collected from others. or sector (e.g. minimising acquisition bias and
Interviewer selection. The interviewer has a confirmation bias). For instance, visual prompts
significant opportunity to introduce observer and should be graphically similar and contain only vi-
response biases in collected data (Salazar, 1990). suals familiar to the respondents. This is analogous
Interviewers familiar with local language, includ- to the uniform interface with which speakers inter-
ing community-specific dialects, should be selected act during text-based online data collection, where
wherever possible. Moreover, to reduce misunder- the platform used is graphically the same to all
standing and recall biases in collected data, it is users inputting data. Using varied graphical styles
useful to have the same person who conducts the or unfamiliar images may result in priming (Fig-
interviews also transcribe them. This minimizes ure 2a). To minimise recall bias or recency effect
the layers of linguistic interpretation affecting the in collected data, socio-economic data can be inte-
final dataset and can increase accuracy through grated in data analysis to better understand if the
familiarity with the interview content. If the in- assertions made in collected data reference recent
terviewer is unavailable, the transcriber must be events, for example. These should be non-sector
properly trained and briefed on the interviews, and specific, to gain holistic insights and to minimise
2181acquisition bias and confirmation bias. Council on Bioethics (2002) caution that in LICs,
misunderstandings may occur due to cultural dif-
4.2 Engagement ferences, lower social-economic status, and illit-
eracy (McMillan et al., 2004) which can call into
Here, we outline practical steps for successful com-
question the legitimacy of consent obtained. Re-
munity engagement to achieve ethical and high-
searchers must understand that methods such as
quality data collection.
long information forms and consent forms which
Defining community. Defining a community must be signed may be inappropriate for the cul-
in an open and participatory manner is critical to tural context of LICs and can be more likely to
meaningful engagement (Dyer et al., 2014). By confuse than to inform (Tekola et al., 2009). The
understanding the community the way they under- authors advise that consent forms should be ver-
stand themselves, misunderstandings and tensions bal instead of written, with wording familiar to the
that affect data quality can be minimized. The defi- interviewees and appropriate to their level of com-
nition of the community (MacQueen et al., 2001) prehension (Tekola et al., 2009). For example, to
coupled with the requirements and use-cases for speak of data storage on a password protected com-
the collected data determines the data collection puter while obtaining consent in a rural community
methodology and style which will be most appropri- without access to electricity or information technol-
ate (e.g. interview-based community consultation ogy is unfitting. Innovative ways to record consent
vs. collaborative co-design for mutual learning). can be employed in such contexts (e.g. video tap-
Follow formal structures. Researchers enter- ing or recording), as signing an official document
ing a community where they have no background may be “viewed with suspicion or even outright
to collect data should endeavour to know the com- hostility” (Upjohn and Wells, 2016), or seen as
munity prior to commencing any work (Diallo “committing ... to something other than answering
et al., 2005). This could entail visiting the com- questions”. Researchers new to qualitative data
munity and mapping its hierarchies of authority collection should seek advice from experienced re-
and decision-making pathways, which can guide searchers and approval from their ethics committee
the research team on how to interact respectfully before implementing consent processes.
with the community (Tindana et al., 2011). This
process should also illuminate whether knowledge- Approaching participants. Despite having
able community members should facilitate entry by gained permission from community authorities and
performing introductions and assisting the external obtained consent to collect data, researchers must
data collection team. Following formal commu- be cautious when approaching participants (Ira-
nity structures is vital, especially in developing bor and Omonzejele, 2009; Diallo et al., 2005) to
communities, where traditional rules and social ensure they do not violate cultural norms. For ex-
conventions are strongly held yet often not articu- ample, in some cultures a senior family member
lated explicitly or documented. Approaching com- must be present for another household member to
munity leaders in the traditional way can help to be interviewed, or a female must be accompanied
build a positive long-term relationship, removing by a male counterpart during data collection. In-
suspicion about the nature and motivation of the sensitivity to such norms may compromise the data
researchers’ activities, explaining their presence collection process; so, they should be carefully
in the community, and most importantly building noted when researching local context (§4.1) and in-
trust as they are granted permission to engage the terviews should be designed to accommodate them
community by its leadership (Tindana et al., 2007). where possible. Furthermore, researchers should
investigate the motivations of the participants to
Verbalising consent. Data ethics is paramount
identify when inducements become inappropriate
for research involving human participants (Accen-
and may lead to either harm or data bias (McAdam,
ture, 2016; Tindana et al., 2007), including any
2004).
collection of personal and identifiable data, such
as natural language. Genuine (i.e. voluntary and Minimise external influence. Researchers must
informed) consent must be obtained from inter- be aware of how external influences can affect data
viewees to prevent use of data which is illegal, collection (Ramakrishnan et al., 2012). We find
coercive, or for a purpose other than that which three main levels of external influence: (i) tech-
has been agreed (McAdam, 2004). The Nuffield nologies unfamiliar to a rural developing country
2182context may induce social desirability bias or prim- multiple interviewers. Some may report word-for-
ing (e.g. if a researcher arrives to a community in word what is being said, while others may sum-
an expensive vehicle or uses a tablet for data col- marise or misreport, resulting in systematic misun-
lection); (ii) intergroup context, which according derstanding. Despite these risks, employing mul-
to Abrams (2010) refers to when “people in differ- tiple interviewers is often unavoidable when col-
ent social groups view members of other groups” lecting data in rural areas of developing countries,
and may feel prejudiced or threatened by these where languages often exhibit a high number of re-
differences. This can occur, for instance, when a gional, non-mutually intelligible varieties. This is
newcomer arrives and speaks loudly relative to the particularly prominent across SSA. For example, 41
indigenous community, which may be perceived as languages are spoken in Uganda (Nakayiza, 2016);
overpowering; (iii) there is the risk of a researcher English, the official language, is fluently spoken by
over-incentivizing the data collection process, us- only ∼5% of the population, despite being widely
ing leading questions and judgemental framing (in- used among researchers and NGOs (Katushemer-
terviewer bias or confirmation bias). To overcome erwe and Nerbonne, 2015). To minimise data in-
these influences, researchers must be cognizant consistency, researchers should: (i) undertake in-
of their influence and minimise it by hiring local terviewer training workshops to communicate data
mediators where possible alongside employing ap- requirements and practice data collection processes
propriate technology, mannerisms, and language. through mock field interviews; (ii) pilot the data
collection process and seek feedback to spot early
4.3 Undertaking Interviews deviation from data requirements; (iii) regularly
Here, we detail practical steps to minimise chal- spot-check interview notes; (iv) support written
lenges during the actual data collection. notes with audio recordings3 ; and (v) offer quality
Interview settings. People have personal values based incentives to data collectors.
and drivers that may change in specific settings. Participant remuneration. While it is common
For example, in the Ugandan Buganda and Bu- to offer interviewees some form of remuneration
soga tribes, it is culturally appropriate for the male for their time, the decision surrounding payment
head if present to speak on behalf of his wife and is ethically-charged and widely contested (Ham-
children. This could lead to corpora where input mett and Sporton, 2012). Rewards may tempt peo-
from the husband is over-represented compared to ple to participate in data collection against their
the rest of the family. To account for this, it is judgement. They can introduce sampling bias or
important to collect data in multiple interview set- create power dynamics resulting in acquiescence
tings (e.g. individual, group male/female/mixed; bias (Largent and Lynch, 2017). Barbour (2013)
Figures 2b, 2c). Additionally, the inputs of in- offers three practical solutions: (i) not advertise
dividuals in group settings should be considered payment; (ii) omit the amount being offered; or
independently to ensure all participants have an (iii) offer non-financial incentives (e.g. products
equal say, regardless of their position within the that are desirable but difficult to get in an area). The
group (Barry et al., 2008; Gallagher et al., 1993). decision whether or not to remunerate should not
This helps to avoid social desirability bias in the be based upon the researcher’s own ethical beliefs
data and is particularly important in various devel- and resources, but instead by considering the spe-
oping contexts where stereotypical gender roles cific context4 , interviewee expectations, precedents
are prominent (Hirmer, 2018). During interviews, set by previous researchers, and local norms (Ham-
verbal information can be supplemented through mett and Sporton, 2012). Representatives from
the observation of tone, cadence, gestures, and fa- local organisations (such as NGOs or governmen-
cial expressions (Narayanasamy, 2009; Hess et al., tal authorities) may be able to offer advice.
2009), which could enrich the collected data with 3
Relying only on audio data recording may be risky: equip-
an additional layer of annotation. ment can fail or run out of battery (which is not easily reme-
Working with multiple interviewers. Ar- died in rural off-grid regions) and seasonal factors (as noise
guably, one of the biggest challenges in data col- from rain on corrugated iron sheets, commonly used for roof-
ing in SSA) can make recordings inaudible (Hirmer, 2018)).
lection is ensuring consistency when working with 4
In rural Uganda, for example, politicians commonly en-
gage in vote buying by distributing gifts (Blattman et al., 2019)
2
While participants’ photographing permission was such as soap or alcohol. It is therefore considered an unruly
granted, photos were pixelised to protect identity. form of remuneration and can only be avoided when known.
2183(a) (b) (c)
Figure 2: Collecting oral data in rural Uganda. 2a Priming effect (note the word “Energy” in the poster’s title and
the visual prompts differences between items). On the contrary, 2b and 2c show minimal priming; note also that
different demographics are separately interviewed (women group, single men) to avoid social desirability bias.
4.4 Post-interviewing 2020), artwork, speeches, or song could be used to
Here, we discuss practical strategies to mitigate communicate findings. Not communicating find-
ethical issues surrounding the management and ings may result in research fatigue as people in
stewardship of collected data. over-studied communities are no longer willing
to participate in data collection. This is common
Anonymisation. To protect the participants’
“where repeated engagements do not lead to any
identity and data privacy, locations, proper names,
experience of change [...]” Clark (2008). Patel et al.
and culturally explicit aspects (such as tribe
(2020) offers practical guidance to minimise re-
names) of collected data should be made anony-
search fatigue by: (i) increasing transparency of
mous (Sweeney, 2000; Kirilova and Karcher, 2017).
research purpose at the beginning of the research,
This is particularly important in countries with se-
and (ii) engaging with gatekeeper or oversight bod-
curity issues and low levels of democracy.
ies to minimise number of engagements per partic-
Safeguarding data. A primary responsibility of
ipant. Failure to restrict the number of times that
the researcher is to safeguard participants’ data
people are asked to participate in studies risks poor
(Kirilova and Karcher, 2017). In addition to
future participation (Patel et al., 2020) which can
anonymizing data, mechanisms for data manage-
also lead to sampling bias.
ment include in-place handling and storage of
data (UKRI, 2020a). Whatever data management
plan is adopted, it must be clearly articulated to
participants before the start of the interview (i.e. as 5 Conclusion
part of the consent process (Silverman, 2013)), as
was discussed in §4.2 (Verbalising consent).
Withdrawing consent. Participants should In this paper, we provided a first step towards defin-
have the ability to withdraw from research within ing best practices in data collection in rural and
a specified time frame. This is known as with- illiterate communities in Low-Income Countries
draw consent and is commonly done by phone or to create globally representative corpora. We pro-
email (UKRI, 2020b). As people in rural illiterate posed a comprehensive classification of sources
communities have limited means and technology of bias and unethical practices that might arise in
access, a local phone number and contact details of the data collection process, and discussed practical
a responsible person in the area should be provided steps to minimise their negative effects. We hope
to facilitate withdraw consent. that this work will motivate NLP practitioners to
Communication and research fatigue. While include input from rural illiterate communities in
researchers frequently extract knowledge and data their research, and facilitate smooth and respect-
from communities, only rarely are findings fed ful interaction with communities during data col-
back to communities in a way that can be use- lection. Importantly, despite the challenges that
ful to them. Whatever the research outcomes, re- working in such contexts might bring, the effort to
searchers should share the results with participating build substantial and high-quality corpora which
communities in an appropriate manner. In illiter- represent this subset of the population can result in
ate communities, for instance, murals (Jimenez, considerable SD outcomes.
2184Acknowledgments Richard A Berk. 1983. An introduction to sample se-
lection bias in sociological data. American sociolog-
We thank the anonymous reviewers for their con- ical review, pages 386–398.
structive feedback. We are also grateful to Claire
The Nuffield Council on Bioethics. 2002. The ethics
McAlpine, as well as Malcolm McCulloch and of research related to healthcare in developing
other members of the Energy and Power Group countries. The Nuffield Council on Bioethics is
(University of Oxford) for providing valuable feed- funded jointly by the Medical Research Council, the
back on early versions of this paper. This research Nuffield Foundation and the Wellcome Trust.
was carried out as part of the Oxford Martin Pro- Christopher Blattman, Horacio Larreguy, Benjamin
gramme on Integrating Renewable Energy. Finally, Marx, and Otis R Reid. 2019. Eat widely, vote
we are grateful to the Rural Senses team for sharing wisely? lessons from a campaign against vote buy-
experiences on data collection. ing in uganda. Technical report, National Bureau of
Economic Research.
Alan Bryman. 2012. Mixed methods research: com-
References bining quantitative and qualitative research. In Alan
Basil Abraham, Danish Goel, Divya Siddarth, Ka- Bryman, editor, Social Reserach Methods, forth edi-
lika Bali, Manu Chopra, Monojit Choudhury, Pratik tion, chapter 27, pages 628–652. Oxford University
Joshi, Preethi Jyoti, Sunayana Sitaram, and Vivek Press, New York.
Seshadri. 2020. Crowdsourcing speech data for
low-resource languages from low-income workers. Badru Bukenya, Sam Hickey, and Sophie King. 2012.
In Proceedings of The 12th Language Resources Understanding the role of context in shaping social
and Evaluation Conference, pages 2819–2826, Mar- accountability interventions: towards an evidence-
seille, France. European Language Resources Asso- based approach. Manchester: Institute for Develop-
ciation. ment Policy and Management, University of Manch-
ester.
Dominic Abrams. 2010. Processes of prejudices: The-
ory, evidence and intervention. Equalities and Hu- de Cenival M. 2008. Ethics of research: the freedom
man Rights Commission. to withdraw. Bulletin de la Societe de pathologie
exotique, 101(2):98—-101.
Accenture. 2016. Building digital trust: The role of
data ethics in the digital age. Accenture Labs. Luc J. Christiaensen and Kalanidhi Subbarao. 2005.
Towards an understanding of household vulnerabil-
Glenn Adams. 2014. Decolonizing methods: African ity in rural kenya. Journal of African Economies,
studies and qualitative research. Journal of social 14(4):520–558.
and personal relationships, 31(4):467–474.
Tom Clark. 2008. We’re over-researched here!’ explor-
William G Axinn. 1991. The influence of interviewer ing accounts of research fatigue within qualitative
sex on responses to sensitive questions in nepal. So- research engagements. Sociology, 42(5):953–970.
cial Science Research, 20(3):303–318.
Costanza Conforti, Stephanie Hirmer, Dai Morgan,
Rosaline Barbour. 2013. Introducing qualitative re- Marco Basaldella, and Yau Ben Or. 2020. Natural
search: a student’s guide. Sage. language processing for achieving sustainable devel-
opment: the case of neural labelling to enhance com-
Marie-Louise Barry, Herman Steyn, and Alan Brent. munity profiling. In Proceedings of the 2020 Con-
2008. Determining the most important factors for ference on Empirical Methods in Natural Language
sustainable energy technology selection in africa: Processing (EMNLP), pages 8427–8444, Online. As-
Application of the focus group technique. In sociation for Computational Linguistics.
PICMET’08-2008 Portland International Confer-
ence on Management of Engineering & Technology, Steven S Coughlin. 1990. Recall bias in epidemiologic
pages 181–187. IEEE. studies. Journal of clinical epidemiology, 43(1):87–
91.
Emily M. Bender and Batya Friedman. 2018. Data
statements for natural language processing: Toward D. A. Diallo, O. K. Doumbo, C. V. Plowe, T. E.
mitigating system bias and enabling better science. Wellems, E. J. Emanuel, and S. A Hurst. 2005. Com-
Transactions of the Association for Computational munity permission for medical research in develop-
Linguistics, 6:587–604. ing countries. Infectious Diseases Society of Amer-
ica, 41(2):255—-259.
Nicole Bergen and Ronald Labonté. 2020. “everything
is perfect, and we have no problems”: Detecting Nora Dudwick, Kathleen Kuehnast, Veronica N. Jones,
and limiting social desirability bias in qualitative and Michael Woolcock. 2006. Analyzing social cap-
research. Qualitative Health Research, 30(5):783– ital in context: A guide to using qualitative methods
792. and data.
2185Terry Dunbar and Margaret Scrimgeour. 2006. Ethics text generation. In Proceedings of the 2020 Confer-
in indigenous research–connecting with community. ence on Empirical Methods in Natural Language
Journal of Bioethical Inquiry, 3(3):179–185. Processing (EMNLP), pages 5877–5883, Online.
Association for Computational Linguistics.
J Dyer, L.C Stringer, A.J Dougill, J Leventon,
M Nshimbi, F Chama, A Kafwifwi, J.I Muledi, J.- Daniel Hammett and Deborah Sporton. 2012. Paying
M.K Kaumbu, M Falcao, S Muhorro, F Munyemba, for interviews? negotiating ethics, power and expec-
G.M Kalaba, and S Syampungani. 2014. Assessing tation. Area, 44(4):496–502.
participatory practices in community-based natural
resource management: Experiences in community Anna Harding, Barbara Harper, Dave Stone, Cather-
engagement from southern africa. Journal of Envi- ine O’Neill, Patricia Berger, Stuart Harris, and
ronmental Management, 137:137–145. Jamie Donatuto. 2012. Conducting research with
tribal communities: Sovereignty, ethics, and data-
Ilker Etikan, Sulaiman Abubakar Musa, and sharing issues. Environmental health perspectives,
Rukayya Sunusi Alkassim. 2016. Comparison 120(1):6–10.
of convenience sampling and purposive sampling.
American journal of theoretical and applied J. Hentschel. 1999. Contextuality and data collection
statistics, 5(1):1–4. methods: A framework and application to health
service utilisation. Journal of Development Studies,
FIDH. 2012. Women’s rights in Uganda: gaps between 35(4):64–94.
policy and practice. Technical report, International
Federation for Human Rights, Paris. Ursula Hess, RB Jr Adams, and Robert E Kleck. 2009.
Intergroup misunderstandings in emotion communi-
Bruce B Frey. 2018. The SAGE encyclopedia of educa- cation. Intergroup misunderstandings: Impact of di-
tional research, measurement, and evaluation. Sage vergent social realities, pages 85–100.
Publications.
Stephanie Hirmer. 2018. Improving the Sustainability
Batya Friedman, Lisa P. Nathan, and Daisy Yoo. 2017. of Rural Electrification Schemes: Capturing Value
Multi-lifespan information system design in support for Rural Communities in Uganda. Ph.D. thesis,
of transitional justice: Evolving situated design prin- University of Cambridge.
ciples for the long (er) term. Interacting with Com-
puters, 29(1):80–96. Stephanie Hirmer and Peter Guthrie. 2016. Identify-
ing the needs of communities in rural uganda: A
Morris Gallagher, Tim Hares, John Spencer, Colin method for determining the ‘user-perceived value’of
Bradshaw, and Ian Webb. 1993. The nominal group rural electrification initiatives. Renewable and Sus-
technique: a research tool for general practice? tainable Energy Reviews, 66:476–486.
Family practice, 10(1):76–81.
Stephanie Hirmer and Peter Guthrie. 2017. The ben-
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, efits of energy appliances in the off-grid energy
Jennifer Wortman Vaughan, Hanna Wallach, Hal sector based on seven off-grid initiatives in rural
Daumé III, and Kate Crawford. 2018. Datasheets uganda. Renewable and Sustainable Energy Re-
for datasets. arXiv preprint arXiv:1803.09010. views, 79:924–934.
Thomas Gilovich, Dale Griffin, and Daniel Kahneman.
Dirk Hovy. 2015. Demographic factors improve classi-
2002. Heuristics and biases: The psychology of in-
fication performance. In Proceedings of the 53rd An-
tuitive judgment. Cambridge university press.
nual Meeting of the Association for Computational
Brett A Gleitsmann, Margaret M Kroma, and Tammo Linguistics and the 7th International Joint Confer-
Steenhuis. 2007. Analysis of a rural water supply ence on Natural Language Processing (Volume 1:
project in three communities in mali: Participation Long Papers), pages 752–762, Beijing, China. As-
and sustainability. In Natural resources forum, vol- sociation for Computational Linguistics.
ume 31, pages 142–150. Wiley Online Library.
Dirk Hovy and Anders Søgaard. 2015. Tagging perfor-
Alemu Gonsamo and Petra D’Odorico. 2014. Citizen mance correlates with author age. In Proceedings
science: best practices to remove observer bias in of the 53rd Annual Meeting of the Association for
trend analysis. International journal of biometeorol- Computational Linguistics and the 7th International
ogy, 58(10):2159–2163. Joint Conference on Natural Language Processing
(Volume 2: Short Papers), pages 483–488, Beijing,
Matthew J Gordon. 2019. Language variation and China. Association for Computational Linguistics.
change in rural communities. Annual Review of Lin-
guistics, 5:435–453. Dirk Hovy and Shannon L. Spruit. 2016. The social
impact of natural language processing. In Proceed-
Sophie Groenwold, Lily Ou, Aesha Parekh, Samhita ings of the 54th Annual Meeting of the Association
Honnavalli, Sharon Levy, Diba Mirza, and for Computational Linguistics (Volume 2: Short Pa-
William Yang Wang. 2020. Investigating African- pers), pages 591–598, Berlin, Germany. Association
American Vernacular English in transformer-based for Computational Linguistics.
2186D. O. Irabor and P. Omonzejele. 2009. Local atti- Tuan Anh Le, David Moeljadi, Yasuhide Miura, and
tudes, moral obligation, customary obedience and Tomoko Ohkuma. 2016. Sentiment analysis for low
other cultural practices: their influence on the pro- resource languages: A study on informal indone-
cess of gaining informed consent for surgery in a ter- sian tweets. In Proceedings of the 12th Workshop
tiary institution in a developing country. Developing on Asian Language Resources (ALR12), pages 123–
world bioethics, 9(1):34—-42. 131.
ITU. 2019. International Telecommunication Union KM MacQueen, E McLellan, DS Metzger, S Kege-
World Telecommunication/ICT Indicators Database. les, RP Strauss, and et al. 2001. What is commu-
World Bank Open Data. nity? an evidence-based definition for participatory
public health. American Journal of Public Health,
Stephany Jimenez. 2020. Creatively Communicating 91:1929–1938.
through Visual and Verbal Art- Poetry and Murals.
Yale National Initiative. Eric M. Mafuta, Lisanne Hogema, Thérèse N.M.
Mambu, Pontien B Kiyimbi, Berthys P. Indebe,
Fridah Katushemererwe and John Nerbonne. 2015. Patrick K. Kayembe, Tjard De Cock Buning, and
Computer-assisted language learning (call) in sup- Marjolein A. Dieleman. 2016. Understanding the
port of (re)-learning native languages: the case of local context and its possible influences on shaping,
runyakitara. Computer Assisted Language Learning, implementing and running social accountability ini-
28(2):112–129. tiatives for maternal health services in rural demo-
cratic republic of the congo: a contextual factor anal-
Benjamin Philip King. 2015. Practical Natural Lan- ysis. Science Advances, 16(1):1–13.
guage Processing for Low-Resource Languages.
Ph.D. thesis, University of Michigan. B. Mallick, K. Rubayet Rahaman, and J. Vogt. 2011.
Social vulnerability analysis for sustainable disaster
Dessi Kirilova and Sebastian Karcher. 2017. Rethink- mitigation planning in coastal bangladesh. Disaster
ing data sharing and human participant protection in Prevention and Management, 20(3):220–237.
social science research: Applications from the quali-
tative realm. Data Science Journal, 16. Vukosi Marivate, Tshephisho Sefara, Vongani Cha-
balala, Keamogetswe Makhaya, Tumisho Mok-
gonyane, Rethabile Mokoena, and Abiodun Mod-
Jenny Kitzinger. 1994. The methodology of Focus
upe. 2020. Investigating an approach for low re-
Groups: the importance of interaction between re-
source language dataset creation, curation and clas-
search participants. Sociology of Health and Illness,
sification: Setswana and sepedi. arXiv preprint
16(1):103–121.
arXiv:2003.04986.
Krishna Kumar. 1987. Conducting group interviews in M N Marshall. 1996. Sampling for qualitative research.
developing countries. US Agency for International Family practice, 13(6):522–5.
Development Washington, DC.
Salule J. Masangwi. 2015. Methodology for So-
Rachid Laajaj and Karen Macours. 2017. Measuring lar PV Needs Assessment in Chikwawa, Southern
skills in developing countries. The World Bank. Malawi. Technical report, Malawi Renewable En-
ergy Acceleration Programme (MREAP) MREAP:
Alain Labrique, Emily Blynn, Saifuddin Ahmed, Renewable Energy Capacity Building Programme
Dustin Gibson, George Pariyo, and Adnan A Hyder. (RECBP) Produced.
2017. Health surveys using mobile phones in de-
veloping countries: automated active strata monitor- Joseph D Matarazzo, Morris Weitman, George Saslow,
ing and other statistical considerations for improving and Arthur N Wiens. 1963. Interviewer influence on
precision and reducing biases. Journal of medical durations of interviewee speech. Journal of Verbal
Internet research, 19(5):e121. Learning and Verbal Behavior, 1(6):451–458.
Emily A Largent and Holly Fernandez Lynch. 2017. Keith McAdam. 2004. The ethics of research related to
Paying research participants: regulatory uncertainty, healthcare in developing countries. Acta Bioethica,
conceptual confusion, and a path forward. Yale jour- 10(1):49–55.
nal of health policy, law, and ethics, 17(1):61.
J. R. McMillan, C. Conlon, and Nuffield Council
A. Launiala and T. Kulmala. 2006. The importance on Bioethics. 2004. The ethics of research related to
of understanding the local context: Women’s per- healthcare in developing countries. Journal of medi-
ceptions and knowledge concerning malaria in preg- cal ethics, 30:204–206.
nancy in rural malawi. Acta Tropica, 98(2):111–
117. Albine Moser and Irene Korstjens. 2018. Series: Prac-
tical guidance to qualitative research. part 3: Sam-
Paul J Lavrakas. 2008. Encyclopedia of survey re- pling, data collection and analysis. European Jour-
search methods. Sage Publications. nal of General Practice, 24(1):9–18.
2187Aaron Mueller, Garrett Nicolai, Arya D McCarthy, Dy- Research fatigue in covid-19 pandemic and post-
lan Lewis, Winston Wu, and David Yarowsky. 2020. disaster research: causes, consequences and recom-
An analysis of massively multilingual neural ma- mendations. Disaster Prevention and Management:
chine translation for low-resource languages. In Pro- An International Journal.
ceedings of The 12th Language Resources and Eval-
uation Conference, pages 3710–3718. Uwe Peters. 2020. What is the function of confirmation
bias? Erkenntnis, pages 1–26.
David Musoke, Charles Ssemugabo, Rawlance Ndejjo,
Sassy Molyneux, and Elizabeth Ekirapa-Kiracho. Stephanie Carroll Rainie, Tahu Kukutai, Maggie
2020. Ethical practice in my work: community Walter, Oscar Luis Figueroa-Rodrı́guez, Jennifer
health workers’ perspectives using photovoice in Walker, and Per Axelsson. 2019. Indigenous data
wakiso district, uganda. BMC Medical Ethics, sovereignty.
21(1):1–10.
Thiagarajan Ramakrishnan, Mary C. Jones, and Anna
Kevin L. Nadal. 2017. Sampling Bias and Gender. In Sidorova. 2012. Factors influencing business intel-
The SAGE Encyclopedia of Psychology and Gender. ligence (bi) data collection strategies: An empirical
investigation. Decision Support Systems, 52:486—-
Judith Nakayiza. 2016. The sociolinguistic situation 496.
of english in uganda. Ugandan English. Amster- Max Roser and Esteban Ortiz-Ospina. 2016. Literacy.
dam/Philadelphia: John Benjamins, pages 75–94. Our World in Data.
N. Narayanasamy. 2009. Participatory Rural Ap- Dwaipayan Roy, Sumit Bhatia, and Prateek Jain. 2020.
praisal: Principles, Methods and Application, first A topic-aligned multilingual corpus of Wikipedia ar-
edition. SAGE Publications India Pvt Ltd, New ticles for studying information asymmetry in low re-
Dehli. source languages. In Proceedings of the 12th Lan-
guage Resources and Evaluation Conference, pages
Raymond S Nickerson. 1998. Confirmation bias: A 2373–2380, Marseille, France. European Language
ubiquitous phenomenon in many guises. Review of Resources Association.
general psychology, 2(2):175–220.
Mary Kathryn Salazar. 1990. Interviewer bias: How it
V. Nikulina, H. Larson Lindal, J.and Baumann, D. Si- affects survey research. Aaohn Journal, 38(12):567–
mon, and H Ny. 2019. Lost in translation: A frame- 572.
work for analysing complexity of co-production set-
tings in relation to epistemic communities, linguistic Inaad Mutlib Sayer. 2013. Misunderstanding and lan-
diversities and culture. Futures, 113(102442):1–13. guage comprehension. Procedia-Social and Behav-
ioral Sciences, 70:738–748.
OECD. 2019. Survey of refugees and humanitarian
staff in Uganda. Joint effort by Ground Truth So- Deven Shah, H. Andrew Schwartz, and Dirk Hovy.
lutions (GTS) and the Organisation for Economic 2019. Predictive biases in natural language process-
Co-operation and Development (OECD) Secretariat ing models: A conceptual framework and overview.
with financial support from the United Kingdom’s
Department for International Development (DFID). David Silverman. 2013. Doing Qualitative Reserach,
fourth edition. SAGE Publications Inc.
Daniel G Oliver, Julianne M Serovich, and Tina L Ma-
son. 2005. Constraints and opportunities with inter- Jonathan Skinner et al. 2013. The interview: An ethno-
view transcription: Towards reflection in qualitative graphic approach, volume 49. A&C Black.
research. Social forces, 84(2):1273–1289. David Stewart, Prem Shamdasani, and Dennis Rook.
2007. Group dynamics and focus group research. In
Sean Packham. 2016. Crowdsourcing a text corpus for David Stewart, Prem Shamdasani, and Dennis Rook,
a low resource language. Ph.D. thesis, University of editors, Focus Groups: Theory and Practice, chap-
Cape Town. ter 2, pages 19–36. SAGE Publications, Thousand
Oaks.
Christopher J Pannucci and Edwin G Wilkins. 2010.
Identifying and avoiding bias in research. Plastic Deryn Strange and Melanie KT Takarangi. 2015. Mem-
and reconstructive surgery, 126(2):619. ory distortion for traumatic events: the role of men-
tal imagery. Frontiers in psychiatry, 6:27.
Erin S. Parcell and Katherine A. Rafferty. 2017. Inter-
views, recording and transcribing. In Mike Allen, Latanya Sweeney. 2000. Simple demographics often
editor, The SAGE Encyclopedia of Communication identify people uniquely. Health (San Francisco),
Research Methods, pages 800–803. SAGE Publica- 671(2000):1–34.
tions, Thousand Oaks.
Fasil Tekola, Susan J. Bull, Farsides Bobbie, J. New-
Sonny S Patel, Rebecca K Webster, Neil Green- port Melanie, Adeyemo Adebowale, N. Rotimi
berg, Dale Weston, and Samantha K Brooks. 2020. Charles, and Davey Gail. 2009. Tailoring consent
2188You can also read