Newcomer Candidate: Characterizing Contributions of a Novice Developer to GitHub

Page created by Dwayne Warner
 
CONTINUE READING
Newcomer Candidate: Characterizing Contributions of a Novice Developer to GitHub
Noname manuscript No.
                                         (will be inserted by the editor)

                                         Newcomer Candidate: Characterizing Contributions
                                         of a Novice Developer to GitHub

                                         IFraz Rehman · Dong Wang · Raula
                                         Gaikovina Kula · Takashi Ishio · Kenichi
                                         Matsumoto

                                         Received: date / Accepted: date
arXiv:2101.08903v1 [cs.SE] 22 Jan 2021

                                         Abstract The ability for an Open Source Software (OSS) project to attract,
                                         onboard, and retain any newcomer is vital to its livelihood. Evidence suggests
                                         more new users are joining GitHub, however, the extent to which they con-
                                         tribute to OSS projects is unknown. In this study, we coin the term ‘newcomer
                                         candidate’ to describe a novice developer that is a new user to the GitHub
                                         platform, with the intention to later onboard an OSS project. Our objective is
                                         to track and characterize their initial contributions using a mixed-method ap-
                                         proach. Our results show that 68% of newcomer candidates are more likely to
                                         practice non-social coding, 86% tend to work on forward-engineering activities
                                         in their first commits, and 53% show their interest of targeting non-software
                                         repositories. Our quantitative analysis did match only 3% of newcomer can-
                                         didates contributions to established OSS repositories, yet 70% of newcomer
                                         candidates claim to already onboard an OSS project. This study opens up new
                                         avenues for future work, especially in terms of targeting potential contribu-
                                         tors to onboard an existing OSS project. More practical applications would
                                         be tool support to (i) recommend practical examples that OSS project teams
                                         can use to lower their barriers for a newcomer candidate to successfully make
                                         a contribution and (ii) recommend suitable repositories for newcomer candi-
                                         dates based on their preference. Researchers can explore strategies to sustain
                                         newcomer candidate activities until they are ready to onboard an OSS project.
                                         Keywords Newcomer, Open Source Projects, GitHub

                                         1 Introduction

                                         The success of Open Source Software (OSS) has always been the continuous
                                         influx of newcomers and their active involvement (Park and Jensen, 2009).
                                         IFraz Rehman, Dong Wang, Raula Gaikovina Kula, Takashi Ishio, Kenichi Matsumoto
                                         Nara Institute of Science and Technology, Japan
                                         E-mail: {rehman.ifraz.qy4,wang.dong.vt8,raula-k,ishio,matumoto}@is.naist.jp
Newcomer Candidate: Characterizing Contributions of a Novice Developer to GitHub
2                                                                     IFraz Rehman et al.

Recent studies have shown evidence that many contemporary projects are at
risk of failure, with one of the reasons, i.e., inability to attract and retain
newcomers (Fang and Neufeld, 2009; Valiev et al, 2018). For example, Coelho
and Valente (2017) proposed two strategies that include newcomers which aim
to transfer the project to new maintainers and to accept new core developers.
In another study, Steinmacher et al (2014a) presented a model that analyzed
the forces influential to newcomers being drawn or pushed away from a project.
    Most of the work revolve around newcomers onboarding OSS projects.
Newcomers can be novice developers who are starting their career, or experi-
enced developers from an industry who are new to OSS projects, or developers
who migrated from other OSS projects. The term newcomer has usually been
used in a loose way in literature Steinmacher et al (2014b). Inspired by incu-
bation projects of OSS, we coin the term a newcomer candidate as “a novice
developer that is a new user to the GitHub platform, with the intention to
later onboard an OSS project”.
    Interestingly, GitHub reported around 10 million new users in 2019.1 With
this upsurge in newcomer candidate activity, the extent to which these con-
tributions assist OSS projects is unknown. In addition, GitHub2 as a social
coding platform allows over 40 million developers to showcase their skills to the
world’s largest community (44 million repositories). Although there is a com-
plete body of work that have studied the barriers and struggles of newcomers,
none have explored the contribution types of newcomer candidates.
    To fill this gap, our paper executes the research protocols of a registered
report (Rehman et al, 2020) to investigate the contributions of newcomer can-
didates. We received 177 newcomer candidates who are verified not having
any experience of contributing to OSS projects. We formulate four research
questions along with their motivations to guide our study:
    – (RQ1) To what extent does a newcomer candidate practice social
      coding? Scacchi (2002) showed that newcomers are more likely to learn on
      their own. Our motivation of the first research question is to understand
      whether or not a newcomer candidate tends to collaborate with other users.
      Since GitHub is a social platform, we are not sure whether the newcomer
      candidates do social coding or learn on their own. Thus, we raise the fol-
      lowing hypothesis to confirm our assumption: (H1) A newcomer candidate
      is more likely to practice social coding to GitHub.
    – (RQ2) What are the kinds of initial contributions that come from
      a newcomer candidate? We would like to investigate the typical activ-
      ities engaged by newcomer candidates. Answering this research question
      will allow us to understand the nature of their initial contributions. Our
      hypothesis is (H2) A contribution to Github repository for a newcomer
      candidate is more likely to add new content.
    – (RQ3) What kinds of repositories does a newcomer candidate tar-
      get? Kalliamvakou et al (2014) showed that most repositories on GitHub
    1   Statistics from https://octoverse.github.com accessed January 2020
    2   https://github.com
Title Suppressed Due to Excessive Length                                       3

   are non-software related and are for personal use. Thus, the motivation is
   to understand the kinds of projects that attract interest of a newcomer
   candidate. Our hypothesis is (H3) A newcomer candidate is more likely to
   target software repositories.
 – (RQ4) What proportion of newcomer candidates eventually on-
   board an OSS project? In this exploratory research question, we inves-
   tigate the proportion of newcomer candidates that eventually onboard an
   OSS project. Additionally, we validate what kinds of barriers newcomer
   candidates face when onboarding.
    Key results of each RQ are as follows: For RQ1, we show that 68% of new-
comer candidates do not practice social coding after joining GitHub. These
results indicate that the newcomer candidates are less likely to collaborate
with other developers with their initial contributions. For RQ2, we identified
that 86% of newcomer candidates’ contributions are adding new features and
requirements (i.e. forward-engineering activities). For RQ3, results show that
53% of newcomer candidates are likely to target non-software based reposito-
ries, with 21% of documentation and 24% experimental being the most fre-
quently targeted repository kinds (fork and PR, clone and push workflows). For
RQ4, although our quantitative analysis matched only 3% of newcomer can-
didates onboard established OSS repositories, in the survey, 70% of newcomer
candidates claimed that they already started to contribute to OSS repositories.
Furthermore, newcomer candidates strongly agree that they face the barrier
of finding a way to start, while social interaction received the most mixed
responses as a barrier.
    This study has the following implications and recommendations. We rec-
ommend newcomer candidates to read the social coding related guidelines and
become familiar with the environment. More practical applications would be
tool support to (i) recommend practical examples that OSS project teams
can use to lower their barriers for a newcomer candidate to successfully make
a contribution and (ii) recommend suitable repositories for newcomer candi-
dates based on their preference. Researchers can explore strategies to sustain
newcomer candidate activities until they are ready to onboard an OSS project.
    The remainder of this paper is organized as follows: Section 2 introduces
the concept of making a contribution to GitHub. Section 3 describes the data
preparation, which includes preliminary survey verification and mining new-
comer candidate repositories. Section 4 and Section 5 reports the approaches
and results of our empirical study, while Section 6 discusses the implications of
our findings. Section 7 discloses the threats to validity, and Section 8 presents
related work. Finally, we conclude the paper in Section 9.

2 Making a Contribution to GitHub

To contribute to a GitHub repository, we first need to understand the workflow
of contributions. This section describes two contribution workflows and then
further defines how we characterize a GitHub contribution.
4                                                          IFraz Rehman et al.

Fig. 1: Two workflows for GitHub contributions: i) Fork and PR and ii) Clone
and Push. Figure shows the basic conceptual diagram that shows the fork and
PR workflow for an author AUT. Let R denote a repository, C for a set of
commit changes and PR represents a pull request.

Fork and Pull Request (PR) Workflow. Figure 1 shows the basic conceptual
diagram that shows the fork and PR workflow for an author AUT. Let R
denote a repository, C for a set of commit changes and PR represents a pull
request. We now detail each step:

1. Forking a repository. In order to make changes, an author has to create an
   online copy of the repository that they intend to make a contribution. As
   shown in the figure, AUTA makes a fork of repository R. We now call this
   repository R0 .
2. Cloning a forked repository. Once a fork is made, an author downloads
   a local copy of the forked repository, thus creating a local copy on the
   computer to sync between fork. As shown in the figure, AUTA clones R0
   onto their local computer, become a clone repository R0C .
3. Committing changes to a forked repository. Once the local copy is cloned,
   the author can change the local git repository, which involves individual
   changes such as adding, deleting, or modifying files. These set of changes
   are known as commit changes. As shown in the figure, we make a set of
   commit changes C1 to repository R0C .
4. Submitting changes as a Pull Request (PR). Finally, in order to commit
   changes to the original repository, an author needs to submit a PR. The
   PR allows the author to inform others about changes you have pushed to
   a branch in a repository hosting on GitHub. The owner of the original
   repository then decides whether or not to accept the PR. As shown in the
Title Suppressed Due to Excessive Length                                          5

Fig. 2: An example of social coding, where more than one author contributes to
git.gemspec file. The example is available at https://github.com/ruby-git/
ruby-git/blame/master/git.gemspec
                                           .

     figure, the pull request PR contains the set of commit changes C, that will
     submit to the original repository R, thus completing the workflow.

Clone and Push Workflow. We now detail each step of clone and push work-
flow:
1. Cloning a repository. Similar to step two of the fork and PR workflow, the
   author downloads a local copy of the repository. As shown in the figure the
   first step, AUTB directly downloads a local copy of repository R.
2. Committing changes to a repository. Similar to step three of the fork and
   PR workflow, an author can make changes to the local git repository. In
   this example, we push a set of commit changes C2 to repository R.

2.1 Characterizing GitHub Contributions

To characterize each newcomer candidate contribution, we measure character-
istics from three dimensions of social coding, kinds of repositories (i.e., software
and non-software), and contributions.

Social Coding. Figure 2 illustrates an example of how we measure whether or
not a contribution to a GitHub repository is social. As shown in the Figure,
there is two authors (i.e., author A for lines 1-3 and author B for line 4)
that contribute to a single file (i.e., git.gemspec) in a (i.e., ruby-git) GitHub
repository. Since there is more than one author has modified the file, we can
conclude that both authors make a social contribution. To extract who is the
author of a line by line modifications of a file, we use the git-blame command.3

Kinds of Contributions. Purushothaman and Perry (2005) used Swanson clas-
sification of maintenance activities to analyze very small changes while Hindle
et al (2008) perform a similar study for large commits. For this purpose, we
adopt the same kinds of contribution proposed by Hattori and Lanza (2008):
(a) Forward Engineering, (b) Re-engineering, (c) Corrective Engineering, and
(d) Management.
 3   https://www.atlassian.com/git/tutorials/inspecting-a-repository/git-blame
6                                                                       IFraz Rehman et al.

               Table 1: Survey Questions sent to potential respondents

           Survey Questions for newcomer candidates
           Q1) What is your motivation to make a contribution to GitHub?
           (a) Learning to Code.
           (b) Assignment or Experiment Project.
           (c) Intend to contribute to an Open Source.
           (d) Use to showcase my programming skills.
           (e) Others.
           Q2) Did you have prior experience contributing to an OSS before GitHub?
           (Yes/No)
           Q3) List your programming knowledge/interests? (short answer)

Software vs. Non-software. Following Munaiah et al (2016) we first distinguish
between software projects (i.e., an engineered software project with documen-
tation, testing, and project management) and non-software repositories. Con-
cretely, we first classify software repositories based on the Borges et al (2016)
classifications: (a) Application Software, (b) System Software, (c) Web-based-
application, libraries, and frameworks, (d) Non-web libraries and frameworks,
(e) Software tools, (f) Documentation. We use the Kalliamvakou et al (2014)
classifications for non-software repositories, (a) Experimental, (b) Storage, (c)
Academic, (d) Web, (e) No longer accessible, and (f) Empty.

3 Data Preparation

To ensure newcomer candidates, we conducted a preliminary survey to explic-
itly verify the newcomer candidate experience with OSS repositories.

3.1 Preliminary Survey: newcomer candidate verification

Survey Design. Table 1 shows the three survey questions. Apart from the ex-
plicit verification for the requirements of being a newcomer candidate, respon-
dents were asked about their motivations, interests, and rank their perception
of their programming skill.
    For potential respondents, with the consent of the repository owners, we
mined the community of the first-contributions repository.4 From the reposi-
tory, we were able to collect 10,000 emails. Survey was sent out through emails
over a four-weeks period and the anonymous responses were collected.5 In the
end, we received 219 responses.
    Table 2b details the results of respondents, showing that 85% of respon-
dents (i.e., 187 responses) do not have any experience, while only 15% (i.e.,
32 responses) have experience contributing to an OSS. From the results, we
    4   https://github.com/firstcontributions/first-contributions
    5   Our questionnaire is available at https://tinyurl.com/r7acxvn
Title Suppressed Due to Excessive Length                                                 7

Table 2: Table 2a shows evidence that most respondents do not have any prior
experience contributing to an OSS before GitHub and Table 2b shows that
most respondents were motivated with the intent to contribute to an OSS
project.

      What is the motivation to contribute?                  Percent
      (a) Learning to Code.                                     58%
      (b) Assignment or Experiment Project.                     21%
      (c) Intend to contribute to an Open Source.               82%
      (d) Use to showcase my programming skills.                42%
      (e) Others                                                 5%

                           (a) Answers to Q1 of the survey
       Have you had any prior OSS experience?           Percent
       No                                                     85%
       Yes                                                    15%

                           (b) Answers to Q2 of the survey

find that 187 respondents are recognized as newcomer candidates by our defi-
nition, i.e., a newcomer candidate is a novice developer that is a new user to
the GitHub platform. Furthermore, 82% of respondents were motivated with
the intent to contribute to an OSS project i.e., Table 2a.

3.2 Mining Newcomer Candidate Repositories

To construct our dataset, we map our verified newcomer candidate information
with their GitHub repository contributions. To do so, we use the GitHub
REST API (GitHub, 2020) to retrieve newcomer candidate related information
(i.e., contributed repositories, submitted commits) according to their GitHub
accounts that were left in the survey. In the end, we successfully matched
177 newcomer candidates with their 2,437 contributed repositories. Note that
these 2,437 repositories are unique.

                             Filter non-contribution
                                                             First commit dataset
                             newcomer candidates

                          Distinguish Clone and Push,        Representative repository
   Newcomer Candidate      Fork and PR repositories                  dataset
        Datasets

Fig. 3: An overview of sub-dataset preparation. Two sub-datasets are con-
structed based on newcomer candidate dataset: first commit dataset and rep-
resentative repository dataset.
8                                                                       IFraz Rehman et al.

          Table 3: Dataset summary. 177 newcomer candidates are studied.

                                              #   Newcomer candidates             177
          Newcomer Candidate datasets
                                              #   Contributed repositories      2,437
          First Commit Dataset                #   Commits                         174
                                              #   Fork and PR repositories        274
          Representative Repository Dataset
                                              #   Clone and Push repositories     305

    Figure 3 shows an overview of our sub-dataset preparation as discussed
below, while Table 3 shows the details of our newcomer candidate datasets:
    First commit dataset. First, we construct a dataset consisting of first com-
mits that newcomer candidates contributed. To do so, we first cloned the ear-
liest GitHub repositories of each 177 newcomer candidates. We then extract
the first commit id (i.e., sha) from each repository’s commit log as their first-
ever contributions. After applying a filter to remove the newcomer candidates
who do not place any of their contributions after joining GitHub, we found
that only three newcomer candidates forked the repository. Finally, we get the
total number of 174 newcomer candidates who did commits to their GitHub
repositories, i.e., 174 first commits, as shown in Table 3.
    Representative repository dataset. We construct another dataset for a qual-
itative analysis of the repositories from 955 fork and PR workflow and 1,482
Clone and Push workflow repositories. To do so, from 2,437 repositories of 177
newcomer candidates, we draw a statistically representative sample dataset
(i.e., a confidence level of 95% and a confidence interval of 5.6 ) The calcu-
lation of statistically significant sample sizes based on population size, confi-
dence interval, and confidence level is well established (Krejcie and Morgan,
1970). We randomly sampled 274 fork and PR repositories and 305 Clone and
Push repositories to get a representative repository dataset that consists of
579 sample repositories, as shown in Table 3.

4 Approach

In this section, we follow the protocols highlighted in our registered report
Rehman et al (2020) to answer the research questions.

4.1 Answering RQ1

To answer RQ1, we use a quantitative method to identify whether newcomer
candidates practice social coding. To do so, we adopt the first commit dataset
(See Section 3.2) that includes the first commits of all 174 newcomer candi-
dates. Then we identify social coding using the Algorithm 1.
   Algorithm 1 details our procedure to identify social coding. We first extract
the files contained in the first commit of a newcomer candidate (line 1). Second,
    6   https://www.surveysystem.com/sscalc.html
Title Suppressed Due to Excessive Length                                        9

     Input : f irst_commit performed by an author au
     Output : Contribution type of the first commit: social or non-social
 1   F ← A set of files modified by f irst_commit;
 2   T ype(F ) = non-social;
 3   for f ∈ F do
 4       D ← extract_authors(git-blame(f ));
 5       if au ∈ D & |D| > 1 then
 6            T ype(f ) = social;
 7       end
 8   end
 9   return T ype;
 Algorithm 1: Our algorithm to classify a first commit to social or
 non-social

by default, we labeled the type of all first commits as non-social (line 2). Then,
we apply the git-blame command on each contained file in the commit to
check whether the files received changes from more than one unique author
(lines 3-4). Next, we classify the type of first commit as social if the newcomer
candidate changed a file edited by other authors (lines 5-9). Otherwise, by
default, the type of contribution remains non-social.
    To validate our hypothesis (H1) A newcomer candidate is more likely to
practice social coding to GitHub, we use the one proportion Z-test (Paternoster
et al, 1998). The one proportion Z-test compares an observed proportion to a
theoretical one when the categories are binary.

4.2 Answering RQ2

To answer RQ2, we use a semi-automatic approach to identify the different
kinds of first-contribution done by newcomer candidates described in Section
2.1. To do so, we use the first commit dataset, same as (RQ1) that includes
174 first commits of all newcomer candidates from their first projects. Our
approach consists of two rounds. In the first round, we applied the keyword
list from Hattori and Lanza (2008) to automatically classify commits into a
particular category, successfully matching 158 commit kinds based on keyword
lists. In the second round, we performed an additional manual check for the
remaining 16 commits not covered by the keyword list. Our new predefined
keyword list includes (Forward Engineering: first), (Corrective Engineering: so-
lution, break), (Re-engineering: revisi, reforma, chang, simpl), (Management:
note).
    To validate our hypothesis (H2) A contribution to Github repository for
a newcomer candidate is more likely to add new content, similar to RQ1, we
use the one proportion Z-test (Paternoster et al, 1998). Note that Corrective
Engineering, Re-Engineering, and Management are merged into Non-Forward
Engineering in our significance test.
10                                                             IFraz Rehman et al.

4.3 Answering RQ3

To answer RQ3, we use a qualitative method to identify the different kinds of
repositories described in Section 2.1. We use the representative project dataset,
as described in Section 3.2. For the manual classification, we first validate
with 30 samples by the three authors of this paper. We then measure the
inter-rater agreement using Cohen’s Kappa. The Kappa agreement score of
classifying fork and PR workflow projects is 0.91, which is implied as “almost
perfect”, while the Kappa agreement score of classifying clone and push work-
flow projects is 0.76, which is implied as ”substantial agreement” (Viera et al,
2005). After the validation, the two authors completed the manual coding for
the remaining repositories in the representative sample.
    To validate our hypothesis (H3) A newcomer candidate is more likely to
target software repositories, similar to RQ1, we use the one proportion Z-
test (Paternoster et al, 1998).

4.4 Answering RQ4

To answer RQ4, we perform both a quantitative and qualitative analysis. In
quantitative analysis, we use a total of 2,437 projects (See Section 3.2) of 177
newcomer candidates. Using a curated dataset of engineered software reposi-
tories provided by Munaiah et al (2016), we decide to classify whether or not
a newcomer candidate has onboarded an engineered software project.
    To complement this quantitative analysis, we conducted a survey as qual-
itative analysis to acquire the perception of newcomer candidates. The per-
ception is split into two questions. The first question is related to whether
newcomer candidates onboard or not. Then, in the second question, inspired
by the previous work by (Steinmacher et al, 2014b), we would like to validate
the barriers faced by newcomer candidates when placing their initial contribu-
tions to OSS projects. We focused on five popular barriers same as Steinmacher
et al (2014b): (a) Social Interaction, (b) Newcomer Previous Knowledge, (c)
Finding a Way to Start, (d) Technical Hurdles, and (e) Documentation. In
terms of the answer options, we set levels of agreement on a five-point Likert
scale (from "strongly disagree" to "strongly agree"). Our survey details are
available at https://forms.gle/JQiVamovUXdJiy8z5.

5 Results

In this section, we present the results for each of our research questions.

5.1 (RQ1) To what extent does a newcomer candidate practice social coding?

Social Coding. The majority of the newcomer candidates do not practice so-
cial coding after joining GitHub. Table 4 presents the frequency of social and
Title Suppressed Due to Excessive Length                                        11

Table 4: Frequency of newcomer candidates social and non-social contribu-
tions. 68% of newcomer candidates do non-social based initial contributions
after joining GitHub.

               Coding Category             Percent (%)
               Non-Social                            68
               Social                                32

non-social contributions done by newcomer candidates. We find that 68% of
newcomer candidates make non-social-based initial contributions after joining
GitHub, while 32% of newcomer candidates make social-based initial contri-
butions. The results suggest that newcomer candidates are less likely to col-
laborate with other developers when placing their first GitHub contributions.
   Our statistical test reveals that a significant difference exists between the
proportion of social and non-social based contributions, with a p-value < 0.001.
Newcomer candidates are more likely to practice non-social coding. The result
indicates that our proposed hypothesis, i.e., (H1) A newcomer candidate is
more likely to practice social coding to GitHub, is not established.

   RQ1 Summary: Our results show that 68% of the newcomer candi-
   dates do not practice social coding (i.e., newcomer candidates are less
   likely to collaborate with other developers with their initial contribu-
   tions) after joining GitHub. It indicates that our proposed hypothesis
   that a newcomer candidate is more likely to practice social coding to
   GitHub is not established.

5.2 (RQ2) What are the kinds of initial contributions that come from a
newcomer candidate?

Frequency of initial contribution kinds. 86% of newcomer candidates typically
engage in a forward-engineering activity. Table 5 depicts the distribution for
kinds of initial contributions that come from newcomer candidates. The Ta-
ble reveals that newcomer candidates are most likely to engage in development
activities related to incorporating new features and implementing new require-
ments. The following activity frequently referenced by a newcomer candidate
is the maintenance activity related to refactoring and redesign, i.e., 8%. On
the other hand, we observe that only 1% of newcomer candidates contribute to
corrective-engineering and management. The results indicate that newcomer
candidates are less likely to engage in those maintenance activities related to
handling defects, formatting code, cleaning up, and updating documentation.
Specifically, 5% of initial contributions are classified as Others. Through our
manual check, we find that these initial contributions are either inaccessible
(i.e., 404 errors in first commit links) or can not be classified into any category
based on our keyword list.
12                                                             IFraz Rehman et al.

Table 5: Frequency for initial contribution kinds from newcomer candidates.
86% of newcomer candidates typically engage in forward-engineering activity.

          Initial contribution kinds        Percent (%)
          Forward-Engineering                          86
          Re-Engineering                                8
          Management                                    1
          Corrective-Engineering                        1
          Others                                        5

    Our statistical test confirms a significant difference between the proportion
of forward-engineering and non-forward-engineering contributions, with a p-
value < 0.001. A newcomer candidate is more likely to add new content (i.e.,
forward-engineering) in their first contributions. Such a result indicates that
our raised hypothesis, i.e., (H2) A contribution to Github repository for a
newcomer candidate is more likely to add new content, is established.

     RQ2 Summary: We find that 86% of newcomer candidates’ contri-
     butions are new features and requirements (i.e., forward-engineering
     activities), statistically confirming our hypothesis that a contribution
     to the Github repository for a newcomer candidate is more likely to
     add new content.

5.3 (RQ3) What kinds of repositories does a newcomer candidate target?

Frequency for kinds of repositories target. Around 53% of newcomer candi-
dates target repositories that are non-software based. Table 6 shows the pro-
portion of software and non-software based repositories that newcomer can-
didate target. We find that newcomer candidates are less likely to target
software-based repositories that leverage sound software engineering practices
in each of its dimensions, accounting for 47%. Upon closer inspection into two
workflows (i.e., fork and PR, clone and push), we observe that the dominant
workflow for software-based repositories is clone and push, i.e., 56%. While, in
non-software based repositories, we do not find the dominant workflow, i.e.,
50% for clone and push, and fork and PR.
    We now further examine what kinds of repositories are targeted with the
aspects of two workflows (i.e., clone and push, fork and PR) by newcomer
candidates. Based on a manual coding on a statistical representative sample,
Figure 4 shows that Documentation (21%), Experimental (15%), and Web-
based-application (15%), libraries, and frameworks are the most frequently
targeted repository kinds in the clone and push workflow. The other kinds of
repositories that newcomer candidates frequently target are Academic (12%),
Web (10%), and Application Software (9%).
Title Suppressed Due to Excessive Length                                                                            13

                                                                   Non−Software       Software

                                               Clone and Push                                    Fork and PR

                              Web
                           Storage
              No longer accessible
                      Experimental
                            Empty
                         Academic
        Web−based−application, etc
                  System Software
                     Software tools
  Non−web libraries and frameworks
                    Documentation
               Application Software

                                      0    5        10    15       20      25 0         5        10       15   20   25
                                                                        Percent (%)

Fig. 4: Frequency for contributed repository kinds within Clone and Push,
and Fork and PR workflows. Documentation and Experimental are the most
frequently targeted repository kinds in two workflows, i.e., 21% and 24% re-
spectively.

    Specifically, we do not find any repositories related to System Software.
On the other hand, in the fork and PR workflow, we find that Experimen-
tal (24%) and Web-based-application, libraries, and frameworks (16%) are
the most commonly targeted repository kinds. The other kinds of repositories
commonly targeted are Documentation (13%) and Academic (12%).
    Our statistical test validates no significant difference between the propor-
tion of software and non-software based repositories that newcomer candi-
dates target, with a p-value > 0.05. The result indicates that our proposed
hypothesis, i.e., (H3) A newcomer candidate is more likely to target software
repositories, is not established.

   RQ3 Summary: Results show that 53% of newcomer candidates tar-
   geted non-software based repositories. Statistically, we cannot deter-
   mine whether newcomer candidates are likely to choose software repos-
   itories over non-software or vice-versa.

Table 6: The proportion of software and non-software repositories targeted
by newcomer candidates. Around 53% of newcomer candidates targeted Non-
Software repositories.

                      Category            Percent (%)           Contribution Workflow (%)
                                                                  Clone and Push (56)
                       Software                47
                                                                    Fork and PR (44)
                                                                  Clone and Push (50)
                   Non-Software                53
                                                                    Fork and PR (50)
14                                                                IFraz Rehman et al.

Table 7: Frequency of Newcomer candidates onboard OSS projects from quan-
titative and qualitative analysis.

           (a) Newcomer candidates onboard OSS from qualitative analysis

      Onboarded by Munaiah et al (2016)                Percent
      Onboard                                                3%
      Not-Onboard                                           97%
           (b) Newcomer candidates onboard OSS from qualitative analysis.

         Onboarded by survey response                Percent
         Onboard                                          70%
         Not-Onboard                                      30%

5.4 (RQ4) What proportion of newcomer candidates eventually onboard an
OSS project?

Onboard OSS. We now discuss the results of whether newcomer candidates on-
board OSS projects. Table 7a presents the distribution of newcomer candidates
onboard OSS projects in terms of the quantitative analysis. The quantitative
results show that only 3% of newcomer candidates onboard OSS projects,
while 97% of newcomer candidates do not onboard. One explanation for such
low matching, is that the curated engineered OSS projects are a smaller and
outdated subset of OSS projects. On the other hand, our qualitative validates
our perception and the results show that, 70% of newcomer candidates claim
that they successfully contribute to OSS projects since joining GitHub. Ta-
ble 7b shows the distribution of newcomer candidates onboard OSS projects
from qualitative analysis.

    Barriers faced by newcomer candidates. We now further validate the bar-
riers faced by 27 surveyed newcomer candidates. Figure 5 shows the results of
our Likert-scale question related to barriers. The figure shows that finding a
way to start is the most crucial barrier, with 22 responses being positive (i.e.,
12 agree responses and 10 strongly agree responses). The second most posi-
tive barrier is technical hurdles, receiving 18 positive responses (i.e., 15 agree
responses and 3 strongly agree responses). Newcomer previous knowledge is
considered the third most positive barrier with 16 responses (i.e., 10 agree re-
sponses and 6 strongly agree responses). On the other hand, the respondents
are more likely to disagree with the statement that social interaction and doc-
umentation can be barriers for them to onboard OSS projects (i.e., 7 negative
responses for each barrier).
Title Suppressed Due to Excessive Length                                                                                     15

                  Social Interaction

      Newcomer Previous Knowledge

              Finding a Way to Start

                  Technical Hurdles

                    Documentation

                                               10                       0                 10                      20

                                                                                Count

                                       Strongly Disagree   Partially Disagree   Neutral   Partially Agree   Strongly Agree

Fig. 5: Barriers faced by newcomer candidates. Most newcomer candidates
(i.e., 22 out of 27 responses) strongly agree that finding a way to start is a
barrier.

   RQ4 Summary: Although our quantitative analysis matched only 3%
   of newcomer candidates onboard established OSS repositories, 70% of
   newcomer candidates claimed that they already started to contribute
   to OSS repositories. Furthermore, newcomer candidates strongly agree
   that they face the barrier of finding a way to start, while social inter-
   action received the most mixed responses as a barrier.

6 Implications

We now discuss the implications of our results and provide suggestions for
newcomer candidates, OSS projects, and researchers:
    Suggestions for Newcomer Candidates. RQ1 shows that most new-
comer candidates are not practicing social coding while making their initial
contribution, with Table 4 showing that 68% of newcomer candidates’ initial
contributions are non-social based. These results indicate that newcomer can-
didates tend to stick to their solo projects and personal activities even after
joining the GitHub platform. Although recent studies have shown evidence
that social coding indeed improves collaboration among developers Thung
et al (2013), our results show likewise. Our practical suggestions would be
for newcomer candidates to actively read documentation such as contributing
guidelines and engage in discussions and threads. It may increase their confi-
dence and the likelihood of engaging in social coding interactions on GitHub.
16                                                              IFraz Rehman et al.

Also there are initiatives such as the Hacktoberfest7 that encourage contribu-
tions, especially for newcomers.
    Our qualitative analysis for RQ2 and RQ3 helps to understand the con-
tribution and repository kinds. This analysis will help newcomer candidates
provide insights in choosing suitable repositories that matches the newcomer
candidate prior contributions. The complementary results of RQ2, RQ3 reveal
that after joining GitHub, newcomer candidates prefer to add new content to
non-software experimental repositories. The results show that these reposito-
ries serve an essential purpose of engaging newcomer candidates and could be
crucial to keep newcomer candidates motivated before they make a move to a
real OSS project.
    According to our newcomer candidate responses in RQ4, we reveal which
barriers explain why some newcomers never end up contributing to an OSS
project. As responses show, Finding a way to start is one of the most chal-
lenging barrier. To this end, newcomer candidates should use Subramanian
et al (2020) suggestions, including minor feature additions (a change of around
36 lines of code), minor documentation changes, and select bug fixes (as de-
scribed) first-timer friendly task which could reduce this problem. Further-
more, there is an online resources8 that help find easy issues or opportunities
for newcomer candidates to find a way to make a contribution.
    Suggestions for OSS Projects. Our findings provide practical implica-
tions to assist with the onboarding process. The results for RQ2 and RQ3 show
that the repository and contribution kinds help newcomer candidates provide
insights into selecting projects for contribution purposes, which plays a role
in attracting a potential contributor. Therefore, OSS projects that want to
attract newcomer candidates can use our results to find the most prominent
contributions and repository kinds. However, there are still many practical
problems and difficulties that exist. Thus, OSS projects may benefit from of-
fering the right contributions to target a specific type of newcomer candidate
(e.g., documentation opportunities or a particular type of forward engineer-
ing). Analysis of our results regarding barriers highlighted from RQ4, OSS
project teams should identify practical examples to lower them for a new-
comer candidate to contribute. (Tan et al, 2020) showed that OSS projects
now highlight specific issues that are potentially good first issues that new-
comer candidates can target. We propose that similar strategies be highlighted,
especially targeting non-software components such as documentation.
    Suggestions for Researchers. We envision researchers to build on top
of our results and open research questions to widen our understanding and de-
velop strategies to encourage newcomer candidates’ onboarding process. For
example, based on the manual classification results obtained in RQ3 that 53%
of non-software repositories and 47% of software repositories, we have an idea
of these newcomer candidates’ advertised skill levels. At this stage, our classifi-
cations are rather generic. We envision that future work could include concrete

 7   https://hacktoberfest.digitalocean.com/
 8   https://www.firsttimersonly.com/
Title Suppressed Due to Excessive Length                                         17

examples of newcomer candidate source code patches and understand the min-
imal skill levels required for a newcomer candidate to onboard into the OSS
world. Interestingly, we find that the perception of OSS projects may be dif-
ferent from what the research community regards as an OSS project. Hence,
further research is needed to understand to what extend is an OSS project,
as this definition may be changing over time. Future research could be tool
support to match the skill levels with potential OSS repositories that seek
this skill. Other interesting avenues would be explored different motivations of
GitHub users (i.e., advertise their skills for a job, practice skills, or for learn-
ing or educational purposes), and what are the minimal skills to teach these
newcomer candidates to help them become successful contributing members
of the different OSS projects.

7 Threats to Validity

We now discuss threats to the validity of our empirical study.

External Validity. We perform an empirical study on newcomer candidates
relying on the GitHub platform. Our key limitation is that our newcomer can-
didates are restricted to the GitHub platform collected from our preliminary
survey. Newcomer candidates have existed on platforms other than GitHub -
our approach picks up only a newcomer candidate’s first GitHub contribution.

Construct Validity. We summarize two threats regarding construct validity.
First, in our qualitative analysis, especially for projects targeted by newcomer
candidates (RQ3), categories may be miscoded due to the subjective nature of
our coding approach. To mitigate this threat, we took a systematic approach
to first test our comprehension with 30 samples using Kappa agreement scores
by three separate individuals. Only until the Kappa score reaches more than
0.91 for fork and PR workflow projects and 0.76 for clone and push workflow
projects, we were able to complete the rest of the sample dataset.
    The second possible threat is in our quantitative analysis of RQ4, to see
what proportion of newcomer candidates onboard OSS project. We matched
newcomer candidates’ projects with the curated dataset of engineered software
projects provided by Munaiah et al (2016) which was last updated in 2017. We
might get different results regarding the proportion of newcomer candidates
onboard OSS projects if the provided curated dataset would be updated.

Internal Validity. Newcomer candidates have full control over the repos-
itories listed in the owned repositories section, so if they decide to remove
their first contribution or first project from the page, we can’t pick up their
actual first project or contribution. However, we don’t know why a newcomer
candidate would do so.
    Another internal threat to validity is related to results obtained from quan-
titative analysis of RQ1 adapted to data visualization. As per the result, 32%
18                                                             IFraz Rehman et al.

of social coding is done by newcomer candidates. With the git-blame com-
mand’s support, we count down the number of developers on committed files
in their initial contribution and regard that contribution as social if we found
changes done by more than one author. However, we analyzed that in some
initial contributions, the same newcomer candidates use different IDs to make
their first contribution as social. Thus, future in-depth qualitative analysis or
experiment studies are needed to better understand the reason for this pur-
pose.

8 Related Work

In this section, we present significant findings in the respect of related work
about newcomers.

Motivation for Newcomers and OSS Projects. To tempt the outsiders
towards joining process of the project, motivation and project’s attractiveness
plays vital part. A complete body of work which well explored OSS research
topic about developer’s motivation and project’s attractiveness Meirelles et al
(2010); Santos et al (2013); Shah (2006); Ye and Kishida (2003). Other studies
investigate that in order to become the core project member how newcomers
join projects Ducheneaut (2005); Fang and Neufeld (2009); Krogh et al (2003);
Marlow et al (2013); Nakakoji et al (2003). From a more positive angle, Choi
et al (2010) found a welcome message, technical assistance and constructive
criticism delayed the natural decline of newcomer editing. Other parts of the
literature focus on the forces of motivation and attractiveness that drive new-
comers toward projects. For example, Lakhani and Wolf (2003) have found
that external benefits (eg, better work, career advancement) motivate primar-
ily new contributors, along with fun, code-based challenges, and improved
programming skills.

 Onboarding OSS Projects. Onboarding OSS projects has been extensively
studied (Krogh et al, 2003; Nakakoji et al, 2003). Fagerholm et al (2013)
includes preliminary results of his study which deals directly with the process
of onboarding OSS projects. Commercial software development settings also
affects by newcomers onboarding, as described by Begel and Simon (2008);
Dagenais et al (2010). Considering the perspective of individual developers,
Ducheneaut (2005) approached onboarding from a sociological point of view.
    To support the onboarding of newcomers towards OSS, mentorship is rec-
ognized as an important activity. Swap et al (2001) describes mentoring in his
study as a basic knowledge transfer mechanism in the enterprise. Integrate new
developers into software projects there is occurrence of mentoring pattern, a
study present by Sim et al (1998). A joining script proposed in another study
by Krogh et al (2003) for developers who want to take participate in project.
Nakakoji et al (2003) also studied the OSS project and proposed eight possible
Title Suppressed Due to Excessive Length                                      19

joining roles comprise of concentric layers called "the onion patch". For ex-
ample, Zhou and Mockus (2015), found that the willingness of individual and
project’s climate were associated with odds that an individual would become
a long-term contributor.

Barriers for Newcomers. Newcomers are important to the survival, long-
term success, and continuity of OSS projects Kula and Robles (2019). How-
ever, newcomers face many difficulties when making their first contribution
to a project. OSS project newcomers are usually expected to learn about the
project on their own Scacchi (2002). Conversely, newcomers to a project, send
contributions which are not incorporated into the source code and give up try-
ing Steinmacher et al (2015); Steinmacher et al (2015). As discussed by Zhou
and Mockus (2010), the transfer of entire projects to renewal of core developers,
participation in OSS projects, present similar challenges of rapidly increasing
newcomer competence in software projects.
    Several research activities addressed for reducing the barriers for newcom-
ers previously. Steinmacher et al (2014a) proposed a developer joining model
that represents the stages that are common to and the forces that are influ-
ential to newcomers being drawn or pushed away from a project. Steinmacher
et al (2016) created a portal called FLOSScoach based on a conceptual model
of barriers to support newcomers. The evaluation shows that FLOSScoach
played an important role in guiding newcomers and in lowering barriers re-
lated to the orientation and contribution process. Besides these studies, in
terms of barriers, our research has done the complement work for Steinmacher
et al (2014b), which highlighted the most crucial barrier among others, i.e.,
finding a way to start due to which newcomer candidates face difficulty in
contributing OSS projects.
    Compared to other work, our study takes a first look at these candidates to
better understand their social interaction, initial contribution kinds, targeted
repositories, and onboard issue with their barriers. Other work extensively
investigated the nature of newcomers, with none that focus on newcomer can-
didates who are novice developers, with the intention of later onboarding OSS
projects.

9 Conclusion

This paper analyzes a new category of potential contributors to OSS projects
(i.e., newcomer candidates). Our results show that these newcomer candidates
are more likely to practice non-social coding (i.e., 68%), and they tend to work
on forward-engineering activities (i.e., 86%) in their first commits. Neverthe-
less, we cannot determine whether newcomer candidates are likely to choose
software repositories over non-software or vice-versa. Regarding onboarding,
although very few (i.e., 3%) newcomer candidates onboard established OSS
engineered repositories, 70% of newcomer candidates claim they already con-
20                                                          IFraz Rehman et al.

tribute to an OSS, citing that finding a way to contribute as a key barrier to
onboarding.
    As GitHub continues to grow, so does the potential for the newcomer can-
didate. This study opens up new avenues for future work, especially targeting
potential contributors to onboard existing OSS projects. Researchers can also
analyze how to sustain their newcomer candidates’ needs until they are ready
to successfully onboard. More practical applications would be tool support to
(i) recommend suitable repositories for newcomer candidates and (ii) identify
practical examples OSS project teams can use to lower their barriers for a
newcomer candidate to contribute.

Acknowledgement

This work is supported by Japanese Society for the Promotion of Science
(JSPS) KAKENHI Grant Numbers 18H04094 and 20K19774 and 20H05706.

References

Begel A, Simon B (2008) Novice software developers, all over again. ICER’08 -
  Proceedings of the ACM Workshop on International Computing Education
  Research
Borges H, Hora A, Valente MT (2016) Understanding the factors that impact
  the popularity of GitHub repositories. In: ICSME
Choi B, Alexander K, Kraut RE, Levine JM (2010) Socialization tactics in
  wikipedia and their effects. In: Proceedings of the 2010 ACM conference on
  Computer supported cooperative work, pp 107–116
Coelho J, Valente MT (2017) Why modern open source projects fail. In: FSE
Dagenais B, Ossher H, Bellamy RKE, Robillard MP, de Vries JP (2010) Mov-
  ing into a New Software Project Landscape, Association for Computing
  Machinery, p 275–284
Ducheneaut N (2005) Socialization in an open source software community: A
  socio-technical analysis. Computer Supported Cooperative Work (CSCW)
  14:323–368
Fagerholm F, Johnson P, Guinea A, Borenstein J, Münch J (2013) Onboarding
  in open source software projects: A preliminary analysis. In: 2013 IEEE 8th
  International Conference on Global Software Engineering Workshops
Fang Y, Neufeld D (2009) Understanding sustained participation in open
  source software projects. J Manage Inf Syst
GitHub (2020) URL https://developer.github.com/v3/
Hattori LP, Lanza M (2008) On the nature of commits. In: ASE
Hindle A, German DM, Holt R (2008) What do large commits tell us? a tax-
  onomical study of large commits. In: Proceedings of the 2008 international
  working conference on Mining software repositories, pp 99–108
Kalliamvakou E, Gousios G, Blincoe K, Singer L, German DM, Damian D
  (2014) The promises and perils of mining GitHub. In: MSR
Title Suppressed Due to Excessive Length                                        21

Krejcie RV, Morgan DW (1970) Determining sample size for research activities.
  Educational and Psychological Measurement 30(3):607–610
Krogh G, Spaeth S, Lakhani K (2003) Community, joining, and specialization
  in open source software innovation: A case study. Research Policy 32:1217–
  1241
Kula RG, Robles G (2019) The Life and Death of Software Ecosystems,
  Springer, pp 97–105
Lakhani K, Wolf R (2003) Why hackers do what they do: Understanding
  motivation and effort in free/open source software projects. Perspectives on
  Free and Open Source Software
Marlow J, Dabbish L, Herbsleb J (2013) Impression formation in online peer
  production: Activity traces and personal profiles in github. In: Proceedings
  of the 2013 conference on Computer supported cooperative work, Associa-
  tion for Computing Machinery, New York, NY, USA, CSCW ’13, p 117–128
Meirelles P, Santos Jr C, Miranda J, Kon F, Terceiro A, Chavez C (2010)
  A study of the relationships between source code metrics and attractive-
  ness in free software projects. In: 2010 Brazilian Symposium on Software
  Engineering, pp 11 – 20
Munaiah N, Kroh S, Cabrey C, Nagappan M (2016) Curating github for en-
  gineered software projects. EMSE
Nakakoji K, Yamamoto Y, NISHINAKA Y, Kishida K, Ye Y (2003) Evolution
  patterns of open-source software systems and communities. International
  Workshop on Principles of Software Evolution (IWPSE)
Park Y, Jensen C (2009) Beyond pretty pictures: Examining the benefits of
  code visualization for open source newcomers. In: VISSOFT
Paternoster R, Brame R, Mazerolle P, Piquero A (1998) Using the correct sta-
  tistical test for the equality of regression coefficients. Criminology 36(4):859–
  866
Purushothaman R, Perry DE (2005) Toward understanding the rhetoric of
  small source code changes. IEEE Transactions on Software Engineering
  31(6):511–526
Rehman I, Wang D, Kula RG, Ishio T, Matsumoto K (2020) Newcomer candi-
  date: Characterizing contributions of a novice developer to github. In: 2020
  IEEE International Conference on Software Maintenance and Evolution (IC-
  SME), pp 855–855
Santos C, Kuk G, Kon F, Pearson J (2013) The attraction of contributors in
  free and open source software projects. J Strateg Inf Syst 22(1):26–45
Scacchi W (2002) Understanding the requirements for developing open source
  software systems. IEE Proc Soft
Shah S (2006) Motivation, governance, and the viability of hybrid forms in
  open source software development. Management Science 52:1000–1014
Sim S, Richard S, Holt C (1998) The ramp-up problem in software projects: A
  case study of how software immigrants naturalize. Proceedings of the 20th
  international conference on Software engineering pp 361–370
Steinmacher I, Gerosa MA, Redmiles D (2014a) Attracting, onboarding, and
  retaining newcomer developers in open source software projects. In: CSCW
22                                                          IFraz Rehman et al.

Steinmacher I, Graciotto Silva MA, Gerosa MA, Redmiles D (2014b) A sys-
  tematic literature review on the barriers faced by newcomers to open source
  software projects. IST
Steinmacher I, Conte T, Gerosa MA, Redmiles DF (2015) Social barriers
  faced by newcomers placing their first contribution in open source software
  projects. In: CSCW
Steinmacher I, Conte TU, Gerosa MA (2015) Understanding and supporting
  the choice of an appropriate task to start with in open source software
  communities. In: HICSS
Steinmacher I, Conte TU, Treude C, Gerosa MA (2016) Overcoming open
  source project entry barriers with a portal for newcomers. In: ICSE
Subramanian VN, Rehman I, Nagappan M, Kula RG (2020) Analyzing first
  contributions on github: What do newcomers do. IEEE Software pp 0–0
Swap W, Leonard D, Shields M, Abrams L (2001) Using mentoring and story-
  telling to transfer knowledge in the workplace. J of Management Information
  Systems 18:95–114
Tan X, Zhou M, Sun Z (2020) A First Look at Good First Issues on GitHub,
  Association for Computing Machinery, New York, NY, USA, p 398–409
Thung F, Bissyande TF, Lo D, Jiang L (2013) Network structure of social cod-
  ing in github. In: 2013 17th European conference on software maintenance
  and reengineering, IEEE, pp 323–326
Valiev M, Vasilescu B, Herbsleb J (2018) Ecosystem-level determinants of sus-
  tained activity in open-source projects: A case study of the PyPI ecosystem.
  In: FSE
Viera AJ, Garrett JM, et al (2005) Understanding Interobserver Agreement:
  The Kappa Statistic. Family Medicine 37(5):360–363
Ye Y, Kishida K (2003) Toward an understanding of the motivation open
  source software developers. In: Proceedings of the 25th International Con-
  ference on Software Engineering, IEEE Computer Society, USA, ICSE ’03,
  p 419–429
Zhou M, Mockus A (2010) Growth of newcomer competence: Challenges of
  globalization. In: FoSER
Zhou M, Mockus A (2015) Who will stay in the floss community? modeling
  participant’s initial behavior. TSE
You can also read