Styler: Learning Formatting Conventions to Repair Checkstyle Errors - arXiv

Page created by Glen Wolfe
 
CONTINUE READING
1

                                           Styler: Learning Formatting Conventions to Repair
                                                            Checkstyle Errors
                                                                                Benjamin Loriot Fernanda Madeiral Martin Monperrus

                                            Abstract—Ensuring code formatting conventions is an essential                Inspired by the problem statement of program repair [24],
                                         aspect of modern software quality assurance, because it helps                we state in this paper the problem of automatically repairing
                                         in code readability. In this paper, we present S TYLER, a tool               formatting errors: given a program, its format checker rules,
                                         dedicated to fix formatting errors raised by Checkstyle, a highly
arXiv:1904.01754v3 [cs.SE] 10 Aug 2020

                                         configurable format checker for Java. To fix formatting errors in            and one rule violation, the goal is to modify the source code
                                         a given project, S TYLER 1) learns fixes for self-generated errors           formatting so that no violation is raised by the format checker.
                                         according to the project-specific Checkstyle ruleset, based on                  In this paper, we explore this problem in the context of [8],
                                         token sequence fed into a LSTM neural network, and then 2)                   a popular format checker for the Java language. We present
                                         predicts fixes. In an empirical evaluation, we find that S TYLER             S TYLER, a repair tool dedicated to fix Checkstyle formatting
                                         repairs 38% of 11,220 real Checkstyle errors mined from 70
                                         GitHub projects. Moreover, we compare S TYLER with the IntelliJ              errors in Java source code. The uniqueness of S TYLER is to
                                         plugin C HECK S TYLE -IDEA and the machine learning-based                    be applicable to any formatting coding convention, because its
                                         code formatters N ATURALIZE and C ODE B UFF. We find that                    approach is not based on rules to repair specific Checkstyle
                                         S TYLER fixes errors from a more diverse set of Checkstyle rules             errors. The key idea of S TYLER is the usage of machine
                                         (24 rules, compared to C HECK S TYLE -IDEA: 19; N ATURALIZE:                 learning to learn the coding conventions that are used in a
                                         20; C ODE B UFF: 17), and it uniquely repairs errors for two rules.
                                         Finally, S TYLER generates small repairs, and once trained, it               software project. Once trained, S TYLER predicts changes on
                                         predicts repairs in seconds. The promising results suggest that              formatting characters (e.g. whitespaces, new lines, indentation)
                                         S TYLER can be used in IDEs and in Continuous Integration                    to fix a formatting convention violation happening in the wild.
                                         environments to repair Checkstyle errors.                                    Technically, S TYLER uses a sequence-to-sequence machine
                                                                                                                      learning model based on a long short-term memory neural
                                                                                                                      network (LSTM).
                                                                  I. I NTRODUCTION                                       We conduct a large scale experiment to evaluate S TYLER
                                            Code readability is the first requirement for program com-                using a curated dataset of 11,220 real Checkstyle errors mined
                                         prehension: one cannot comprehend what one cannot easily                     from 70 GitHub projects. Based on our research questions, we
                                         read. To improve code readability, most developers agree on                  find that S TYLER repairs many errors (38%), and repairs errors
                                         using coding conventions, so the code is clear and uniformly                 from more different Checkstyle formatting rules compared to
                                         consistent across a given code base or organization [23], [16].              the state-of-the-art of machine learning formatters [3], [26] and
                                            A major challenge of using coding conventions is to keep all              the tailored, human engineered IntelliJ plugin C HECK S TYLE -
                                         source code files consistent with the agreed conventions. The                IDEA [9]. Moreover, S TYLER produces small repairs and its
                                         first step towards that is the detection of coding convention                performance is fast enough for developers.
                                                                                                                         To sum up, our contributions are:
                                         violations (or errors). This can be automatically performed
                                                                                                                         • A novel approach to fix violations of code formatting
                                         using linters, which are static analysis tools that warn software
                                         developers about possible violations of coding conventions                         conventions, based on machine learning. The approach
                                         [36]. The usage of linters also brings challenges because                          is able to learn project-specific formatting rules without
                                         the developers need to create a configuration according to                         manual setup;
                                                                                                                         • A tool, called S TYLER , which implements our approach
                                         their adopted conventions so that the linter detects the right
                                         violations (not more and not less), and then to repair eventually                  in the context of Java and Checkstyle, to repair Check-
                                         violations. In this paper, we focus on the later task, automat-                    style formatting violations. The tool is made publicly
                                         ically repairing linter violations, which is a little researched                   available [21];
                                                                                                                         • A curated dataset of real-world formatting Checkstyle
                                         problem, and we focus on formatting errors1 .
                                            To repair a formatting error detected by a format checker,                      errors, which contains 11,220 errors mined from 70
                                         developers can either perform the fix manually or use a code                       GitHub repositories; To our knowledge, this is the largest
                                         formatter. Both alternatives are not satisfactory. Manually                        dataset of this kind, made publicly available for future
                                         fixing formatting errors is a waste of valuable developer time.                    research;
                                                                                                                         • A comparative experiment of the performance of S TYLER
                                         With code formatters, the key problem is that they do not take
                                         into account the project-specific convention rules, those that                     against the state-of-the-art of automatic code formatting
                                         are configured by the developers for the used format checker.                      [9], [3], [26], showing that S TYLER outperforms it.
                                                                                                                         The remainder of this paper is organized as follows. Sec-
                                           1 In this paper, we refer to linters specialized in formatting as format   tion II and Section III present the background of this work.
                                         checkers.                                                                    Section IV presents our tool, S TYLER. Section V presents the
2

design of our experiment for evaluating S TYLER and compar-        linter before she commits her changes. If she does not do
ing it with three code formatters: the experimental results are    it, she might face a lot of errors raised by the linter after
presented in Section VI. Section VII presents discussions, and     the end of the building step for a release or for shipping the
Section VIII presents the related works. Finally, Section IX       program. On the other hand, when a linter is integrated in build
presents the final remarks.                                        tools, it is automatically executed in Continuous Integration
                                                                   (CI) environments. The important coding conventions might
                      II. BACKGROUND                               be configured to make CI builds break when they are violated.
A. Coding Conventions                                              This way, developers are forced to repair coding convention
                                                                   violations early in the software development process.
   Coding conventions (also known as coding style or coding           Several linters have been developed depending on the pro-
standards) are rules that developers agree on for writing code.    gramming language: e.g. ESLint [13] for JavaScript, Pylint
The usage of coding conventions improves code readability          [29] for Python, StyleCop [34] for C#, and RuboCop [31] for
but it does not change the program behavior.                       Ruby. For Java, which is our target language in this paper, the
   There are several coding convention classes: e.g. naming,       most commonly used linter is Checkstyle [8]. Checkstyle sup-
control flow style, and formatting. In this paper, we focus        ports predefined well-known coding conventions, such as the
on the latter: formatting coding conventions. Formatting here      Google Java Style Guide [16] and the Sun Code Conventions
refers to the appearance or the presentation of the source         [35]. It also allows developers to configure a specific ruleset
code. One can change the formatting by using non-printable         to match their own preferences. Checkstyle is a flexible linter
characters such as spaces, tabulations, and line breaks. In        that can be integrated in both an IDE (e.g. IntelliJ, Eclipse, and
free-format languages such as Java and C++, the formatting         NetBeans) and in a build tool (e.g. Maven and Gradle). In the
does not change the abstract syntax tree. In non-free-format       Java ecosystem, Checkstyle is often executed in Continuous
languages such as Haskell or Python, formatting is even related    Integration environments such as Travis and Circle CI.
to behavior: correcting formatting issues can fix a bug [7].
   For instance, a well-known formatting coding convention is           III. S TUDY OF C HECKSTYLE U SAGE IN THE W ILD
about the placement of braces in code blocks. Figure 1 shows          Static analysis tools have been subject of investigation
two ways that developers may follow when writing conditional       in recent research [39], [38], [22]. However, there is little
blocks: one developer might place the left brace in a new line,    empirical knowledge of the extent of what Checkstyle, one
while another one might place it in the end of the conditional     popular static analyzer, is used in the wild. To ground our work
line. Agreeing on coding conventions avoids edit wars and          with a solid empirical basis, we then investigate the usage of
endless debates: all developers in a team decide on how to         Checkstyle and its rules in open source projects.
format code once and for all.                                         Checkstyle can be executed on a project in different ways.
                                                                   The straightforward ways are 1) by directly invoking Check-
   if (condition)                                                  style on the command line, 2) by a build tool, or 3) by
   {                                 if (condition) {
       // do something                   // do something           a continuous integration service. Independently of the way
   }                                 }                             Checkstyle is executed, there must exist a configuration file
                                                                   with the Checkstyle rules defined by the developers: we refer
    (a) Left curly on new line.   (b) Left curly on end of line.   to this file as Checkstyle ruleset. In this section, we report on
   Fig. 1: Two conventions for placing a left curly brace.         our large-scale study on the usage of Checkstyle on GitHub.

                                                                   A. Checkstyle Usage in Practice
                                                                   Method. To measure the usage of Checkstyle on GitHub, we
B. Coding Convention Checkers
                                                                   queried GitHub2 to only retrieve Java projects with at least five
   A challenge faced by developers is to keep their code           stars, because stars have been shown meaningful to sample
compliant with the agreed coding conventions. Basically, every     projects from GitHub [6]: we found 148,127 Java projects.
new change, every new commit must satisfy the convention           Then, we searched each of them for finding a Checkstyle
rules. Manually checking if code changes do not violate the        ruleset file. A Checkstyle ruleset file can have any name,
coding conventions is not an option because it would be too        but we followed a conservative approach towards identifying
time-consuming and error-prone.                                    true positives: we used a set of commonly used names3 . For
   To overcome this problem, a mechanism to automatically          simplicity, in the rest of this paper we refer to a Checkstyle
check if a code follows the coding convention rules is required.   ruleset file as checkstyle.xml.
Such a tool is known as linter, or coding convention enforcers
[2]. A linter is a static analysis tool that warns software        Results. We found 3,793 Java projects containing a
developers about possible code errors or violations of coding      checkstyle.xml file, which is 2.56% of all Java projects
conventions [36]. Note that linters may go beyond coding             2 In June 9, 2020.
conventions and also perform some basis static analysis on           3 Checkstyle   ruleset file commonly used names: [‘checkstyle.xml’,
the program behavior.                                              ‘.checkstyle.xml’, ‘checkstyle_rules.xml’, ‘checkstyle_config.xml’, ‘check-
                                                                   style_configuration.xml’, ‘checkstyle_checker.xml’, ‘checkstyle_checks.xml’,
   Linters can be usually integrated in IDEs and build tools.      ‘google_checks.xml’, ‘sun_checks.xml’]. Variants by replacing ‘_’ by ‘-’ are
When integrated in IDEs, the developer manually runs the           also used.
3

    RightCurly                                         3,719 (98.05%)               A. Targeted Error Types
RegexpSingleline                  3,162 (83.37%)                                       S TYLER is about learning how to repair errors related to
      LeftCurly                3,083 (81.28%)
 PackageName                 3,047 (80.33%)                                         formatting coding conventions (see Section II-A). For instance,
      UpperEll               3,033 (79.96%)                                         consider that a developer specified that her preference on the
     TypeName               3,018 (79.57%)                                          left curly token “{” in a conditional block must always be
ParameterName              2,996 (78.99%)                                           placed in a new line (as shown in Figure 1a). If this rule is
 MemberName               2,966 Formatting-related
                                  (78.2%)           rules
FileTabCharacter         2,955 (77.91%)                                             not satisfied (e.g. such as in Figure 1b), Checkstyle triggers
                                  Non-formatting-related rules                      a formatting-related error (see Figure 4a). In order to fix this
  MethodName             2,947 (77.7%)
                       3,000 3,200 3,400 3,600 3,800 4,000                          violation, a new line break should be inserted in the program
                                                                                    before the token “{”.
                                   # Projects on Github
                                                                                       In Checkstyle, there are different classes of checks: e.g. for-
           Fig. 2: The top-10 most popular Checkstyle rules.                        matting, naming, and lightweight linting checks. In S TYLER,
                                                                                    we exclusively focus on formatting checks, such as indenta-
                                                                                    tion and whitespace before and after punctuation. We ignore
    with at least five stars on GitHub. Table I shows the proportion                Checkstyle checks that are not related to formatting, e.g.
    of those projects with their build tools and CI services if any.                unused imports and method name.
    We note that build tools are widely used among projects using
    Checkstyle: 98% of the projects use at least one build tool.                    B. S TYLER Workflow
    Moreover, 55% of the projects use a continuous integration
    service, which shows the software engineering maturity of the                      Figure 3 shows the S TYLER workflow. It is composed of
    sampled projects.                                                               two main components: ‘S TYLER training’ for learning how
                                                                                    to fix formatting errors and ‘S TYLER prediction’ for actually
    TABLE I: Usage of build tools and CI services by 3,793                          repairing a concrete Checkstyle error. S TYLER receives as
    projects that use Checkstyle.                                                   input a software project, including its source code and its
                                                                                    Checkstyle ruleset.
                        Maven        54 %
    Build tool usage    Gradle       47 %
                        Ant          10 %                                                                                              Styler Training (learning)
                        TravisCI     51 %                                            Project with source               A. Training                  B.
    CI usage                                                                                                                                                            C. Training LSTM
                                                                                          code and                        data               Error-encoding
                        CircleCI      4%                                                                                                                                     models
                                                                                     Checkstyle ruleset                generation            (tokenization)

    B. Popularity of Checkstyle Rules                                                                               Styler Prediction (repairing)
                                                                                                                                                Java code (Figure 4b)
    Method. To check the usage of Checkstyle rules4 , we analyzed                                          Checkstyle error
                                                                                                                                                      tokenized
                                                                                             D.              (Figure 4a)                                                 F. Predicting
    the previously-found checkstyle.xml files from the 3,793                                                                   E. Error-encoding     (Figure 4c)
                                                                                      Checkstyle-error                                                                   repair (LSTM
    projects using Checkstyle. Our goal is to investigate the most                                                              (tokenization)
                                                                                        localization                                                                        models)
    used rules and check if formatting-related rules, which are the                                                                              Repaired Java code tokenized
    target of this work, are widely used.                                                                                                                 (Figure 4e)

    Results. We found at least one usage for the 174 Checkstyle                                           Repaired Java code
                                                                                                                                                         Repaired
                                                                                                             de-tokenized
    rules. Figure 2 shows the top-10 most used rules. The bars                                G.
                                                                                                              (Figure 4f)          H. Repair
                                                                                                                                                        Java code
    in dark red represent formatting-related rules, and the bars in                    Repair-decoding                                                                I. Repair selection
                                                                                                                                  verification
                                                                                      (de-tokenization)
    gray represent the other rules. In the top-10 most used rules,
    there are four rules related to formatting. Notably, the top-                                           Fig. 3: S TYLER workflow.
    3 most used rules are formatting-related ones. Therefore, we
    conclude that formatting-related rules are very important for                      The component ‘S TYLER training’ is responsible for learn-
    developers, which validates the relevance of our work.                          ing how to repair Checkstyle errors on the given project
                                                                                    according to its project-specific Checkstyle ruleset. It creates
                              IV. S TYLER
                                                                                    the training data by injecting Checkstyle formatting errors on
      S TYLER is a tool to fix Checkstyle formatting errors in                      source code files in the project (step A). Then, it translates the
    Java source code, in order to help developers in different                      training data into abstract token sequences (step B) in order
    software development workflows. For instance, S TYLER could                     to train LSTM neural networks (step C). The learned LSTM
    be used locally as a pre-hook commit when developers are                        models are eventually used to predict repairs.
    about to release projects. Also, it could be configured to run in                  The component ‘S TYLER prediction’ is responsible for
    Continuous Integration, where pull requests are automatically                   predicting fixes for real Checkstyle errors. It first localizes
    opened with formatting fixes’ suggestions. In this section, we                  Checkstyle errors by running Checkstyle on the project (step
    present the workflow and the technical principles of S TYLER.                   D). Then, S TYLER encodes the error line into an abstract token
      4 The set of Checkstyle rules we considered in our study is from Checkstyle   sequence (step E), which is given as input to the LSTM models
    version 8.33 (released in May 31, 2020).                                        (step F) previously learned. The models predict fixes for the
4

given Checkstyle error: these fixes are in the format of abstract    [ERROR] .../NodeRelationshipCache.java:812:82:
token sequences, so they must be translated back to Java code        ’{’ at column 82 should be on a new
(step G). S TYLER then runs Checkstyle on the new Java codes         line. [LeftCurly]
containing the predicted fixes (step H). Finally, among the
                                                                                         (a) Checkstyle LeftCurly rule violation.
predicted fixes where no Checkstyle error is raised, S TYLER
selects one formatting repair to give as output (step I). As         812 p u b l i c v o i d v i s i t C h a n g e d N o d e s ( N o d e C h a n g e V i s i t o r
                                                                          v i s i t o r , i n t nodeTypes )                 {
S TYLER only impacts the formatting of the code, its repairs do      813      l o n g denseMask = changeMask ( t r u e ) ;
not change the behavior of the program under consideration.
                                                                                           (b) Source code snippet of the error.

C. S TYLER in Action                                                 before-context  Identifier
   Consider the Checkstyle error presented in Figure 4a. This        0_SP , 1_SP int 1_SP Identifier 1_SP )
error is raised by a violation of the Checkstyle LeftCurly rule:     4_SP { 1_NL_4_ID long 1_SP Identifier
the left curly should be on a new line. Checkstyle provides,         1_SP = 1_SP Identifier 0_SP ( 1_SP
for a given error, the location (line and column) where the           after-context
Checkstyle rule is violated. The Java source code that caused                               (c) Buggy abstract token sequence.
such an error is presented in Figure 4b.
   S TYLER encodes the incorrectly formatted lines (Figure 4b)       0_SP 1_SP 1_SP 1_SP 1_NL 1_NL_4_ID 1_SP
into the abstract token sequence shown in Figure 4c. Then,           1_SP 1_SP 0_SP 1_SP
this abstract token sequence is given as input to LSTM                 (d) Formatting token sequence generated by a LSTM model.
models, which predict the formatting token sequence shown
                                                                     before-context  Identifier
in Figure 4d. This predicted formatting token sequence is
                                                                     0_SP , 1_SP int 1_SP Identifier 1_SP )
then used to modify the formatting tokens from the buggy
                                                                     1_NL { 1_NL_4_ID long 1_SP Identifier
abstract token sequence. It results in a predicted abstract token
                                                                     1_SP = 1_SP Identifier 0_SP ( 1_SP
sequence, as shown in Figure 4e, that may fix the current
                                                                      after-context
Checkstyle error. The diff between Figure 4c and Figure 4e
(highlighted in bold) shows that the predicted repair is the                              (e) Predicted abstract token sequence.
replacement of the formatting token 4_SP by 1_NL. This               812 p u b l i c v o i d v i s i t C h a n g e d N o d e s ( N o d e C h a n g e V i s i t o r
predicted repair means that the four whitespaces before the                v i s i t o r , i n t nodeTypes )
                                                                     813 {
token “{” should be replaced by a new line.                          814       l o n g denseMask = changeMask ( t r u e ) ;
   Then, the predicted abstract token sequence (Figure 4e) is
translated back to Java code (Figure 4f). Finally, when running                  (f) Source code snippet with repaired formatting.
Checkstyle on the new Java code, no Checkstyle error is raised,     Fig. 4: S TYLER: from the Checkstyle-formatting error to a fix.
meaning that S TYLER successfully repaired the error.

D. Java Source Code Encoding                                        indentation deltas are represented by ∆_ID (indent), negative
                                                                    ones are represented by ∆_DD (dedent), and deltas equal to
   S TYLER encodes the Java source code into an abstract
                                                                    zero (there is no indentation change between two lines) are
token sequence that is required to predict formatting changes.
                                                                    ignored, they are not represented by an abstract token. The
First, S TYLER translates each Java token to an abstract token
                                                                    complete representation after the calculation of the number of
by keeping the value of the Java keywords, separators, and
                                                                    new lines and the indentation delta is n_NL_∆_(ID|DD):
operators (e.g. + → +), and by replacing the other token kinds
                                                                    for instance, in Figure 4b, the new line between lines 812 and
such as literals, comments, and identifiers by their types (e.g.
                                                                    813 is represented by 1_NL_4_ID), i.e. one new line and
x → Identifier). Second, for each pair of subsequent
                                                                    indentation delta +4.
Java tokens, S TYLER creates an abstract formatting token that
depends on the presence of a new line. If there is no new
line, S TYLER counts the number of whitespaces, and then            E. Training Data Generation
represents it like n_SP, where n is the number of whitespaces          S TYLER does not use predefined templates for repairing
(e.g. → 1_SP). If there is no whitespace between two Java           formatting errors. S TYLER uses machine learning for inferring
tokens (e.g. x=), S TYLER adds 0_SP between the tokens. The         a model to repair formatting errors and, consequently, it needs
same process is applied for tabulations.                            training data. One option is to mine past commits from the
   If there are new lines between two Java tokens, S TYLER first    project under consideration to collect training data. However,
counts the number of new lines, and represents it as n_NL,          there might not exist enough data in the history of the project
where n is the number of new lines. Then, S TYLER calculates        to cover all Checkstyle formatting rules.
the indentation delta (∆) between the line containing the              So in order to have enough data for training, our key insight
previous token and the line containing the next token: the          is to generate the training data. The idea is to modify error-
delta is the difference of the indentation between the two          free Java source code files in the project in order to trigger
lines (the indentation is composed of whitespace or tabulation      Checkstyle formatting rule violations. Then, one obtains a pair
characters, exclusively, depending of the project). Positive        of files (αorig , αerr ): αorig is the file without the formatting
5

error, and αerr is the file with the formatting error. αorig          Algorithm 1 Batch injection of Checkstyle errors in Java files.
is a repaired version of αerr , and we can use supervised             Input: ruleset – Checkstyle configuration of the project
machine learning to predict αorig given αerr . We experiment              under consideration
that idea in two different ways (called protocols in this paper)      Input: f iles – corpus of error-free Java files taken from the
to generate training data: we name them as Stylerrandom and               project
Styler3grams , which we present as follows.                           Input: numberOf Errors – number of errored files to be
    The Stylerrandom protocol for injecting Checkstyle errors             generated
in a project consists of automated insertion or deletion of a         Input: protocol in [Stylerrandom , Styler3grams ]
single formatting character (space, tabulation, or new line)          Output: dataset with Checkstyle errors
in Java source files. These modifications require a careful            1: const BAT CH_SIZE ← 500
procedure so that 1) the project still compiles and 2) its             2: var dataset ← {}
behavior is not changed. For this, we specify the locations            3: while dataset.length < numberOf Errors do
in the source code files that are suitable to perform the              4:     var modif iedF iles ← {}
modifications. For insertions, the suitable locations are before       5:     for i ← 0; i < BAT CH_SIZE; i + + do
or after any token. For deletions, the suitable locations are 1)       6:         f ile ← selectRandom(f iles)
before or after any punctuation (“.”, “,”, “(”, “)”, “[”, “]”, “{”,    7:         f ile0 ← changeF ormatting(f ile, protocol)
“}”, and “;”), 2) before or after any operator (e.g. “+”, “-”,         8:         modif iedF iles.append(f ile0 )
“*”, “=”, “+=”), and 3) in any token sequence longer than one          9:     end for
indentation character.                                                10:     checkstyleResult                                    ←
    The Styler3grams protocol is meant to produce likely                  runCheckstyle(modif iedF iles, ruleset)
errors. It performs modifications at the abstract token               11:     erroredF iles ← selectErroredF iles(checkstyleResult)
level instead of directly changing the Java source code as            12:     dataset.append(erroredF iles)
Stylerrandom . The idea is to replace formatting tokens by            13: end while
the ones used by developers in a similar context (i.e. the same       14: return dataset
surrounding Java tokens). For that, we use 3-grams, where
3gram = {Java_token, f ormatting_ token, Java_token}.
So given an error-free Java file, the task of Styler3grams is the        Once the context surrounding a formatting error is tok-
following. First, the Java file is tokenized (see Section IV-D),      enized, S TYLER places two tags around the error, so that
and a random formatting token is picked and used to form              its location and its violation type can be further identified.
a 3-gram, which is 3gramorig . Then, given a corpus of 3-             The tags consist of the name of the Checkstyle rule that was
grams previously mined from a project, Styler3grams finds a           violated and raised the error. For instance, the error presented
3grami−corpus that matches the surrounding Java tokens of             in Figure 4a is about the Checkstyle LeftCurly rule, so the tags
3gramorig . Several matches can be found, but the selection of        around the error are  and  as
a 3grami−corpus is random according to its frequency in the           shown in Figure 4c.
corpus. Then, 3gramorig is replaced by 3grami−corpus : since             To insert the tags concerning the error type in the abstract
the Java tokens match, only the formatting token is actually          token sequence, S TYLER needs to find a place so that the tags
replaced. Finally, Styler3grams performs a de-tokenization so         surround the tokens related to the origin of the error, and at the
that an error version of the original error-free Java file is         same time to minimize the number of tokens between the two
created.                                                              tags to have precise information about the location. S TYLER
    Algorithm 1 presents the algorithm that S TYLER uses to           places the tags according to the location information given by
generate one training dataset per protocol (Stylerrandom and          Checkstyle (line and column). When Checkstyle provides the
Styler3grams ). The input of the algorithm is the Checkstyle          line and the column, S TYLER places  n tokens
ruleset of the project, a corpus of error-free Java files taken       before the error and  n tokens after. When
from the project, the number of errored files to be generated,        Checkstyle provides the line but not the column (e.g. when
and the injection protocol to be used. Then, in each batch            the error is about the LineLength rule), S TYLER places the
iteration, a random file is selected from the corpus of error-         i tokens before the line and 
free Java files, and the specified injection protocol is applied to   j tokens after the end of the line. The values of k, n, i, and
it. Once a batch is completed, Checkstyle is executed so that         j are explained in Section IV-I.
the algorithm selects the modified files that contain a single
error. The algorithm ends when the desired number of errored
files is reached.                                                     G. Machine Learning Model
                                                                      Learning (Figure 3–step C). S TYLER aims to translate a buggy
F. Error Encoding                                                     token sequence (input sequence) to a new token sequence
   In order to repair formatting errors, the Java source code         with no Checkstyle errors (output sequence). S TYLER uses a
encoding using an abstract token sequence (see Section IV-D)          sequence-to-sequence translation based on a recurrent neural
must capture both the error in the code and the context               network LSTM (Long Short-Term Memory), similar to what
surrounding the error. Therefore, S TYLER considers a token           is used for natural language translation. Thanks to the token
window of k lines before and after the error.                         abstraction employed by S TYLER to encode Java source code
6

I   =  ( 0_SP Identifier 0_SP , 1_SP Identifier 1_SP       H. Repair Verification and Selection
Fi =            0_SP            1_SP   1_NL            1_SP
                                                                             S TYLER performs x predictions per training data generation
Oi =  ( 0_SP Identifier 1_SP , 1_NL Identifier 1_SP 
                                                                          protocol (i.e. Stylerrandom and Styler3grams ), so in the end
                   (a) length(Fi ) = length(I)/2.                         S TYLER generates x × 2 predictions to repair a single error.
I   =  ( 0_SP Identifier 0_SP , 1_SP Identifier 1_SP       After the translation of those predictions back to Java source
Fi =            0_SP            1_SP   1_SP            2_SP   1_NL_4_DD   code (Figure 3–step G), S TYLER performs a verification (Fig-
Oi =  ( 0_SP Identifier 1_SP , 1_SP Identifier 2_SP 
                                                                          ure 3–step H), where Checkstyle is executed on the resulting
                                                                          Java source code files. From the correctly repaired files (i.e. the
                   (b) length(Fi ) > length(I)/2.                         ones that do not result in Checkstyle errors), S TYLER selects
I   =  ( 0_SP Identifier 0_SP , 1_SP Identifier 1_SP 
                                                                          the best one to give as output, where the best prediction is the
Fi =           0_SP            1_SP
                                                                          one that has the smallest source code diff (Figure 3–step I).
Oi =  ( 0_SP Identifier 1_SP , 1_SP Identifier 1_SP 

                   (c) length(Fi ) < length(I)/2.                         I. Implementation
Fig. 5: Generation of the sequence Oi based on the predicted                 S TYLER is implemented in Python. We use javalang [18]
formatting tokens Fi and the input I.                                     for parsing and OpenNMT-py [25] for the machine learning
                                                                          part. The code is publicly available [21].
                                                                             For optimally training the LSTM models, we performed an
                                                                          exploratory study by training models with different configura-
                                                                          tions. The configurations combine values for key parameters,
(see Section IV-D and Section IV-F), the input and output                 which are the model attention type (general or mlp), the
vocabularies are small (respectively ∼150 and ∼50), hence                 number of layers (1, 2, or 3) and the number of units (256
are well handled by LSTM models. We use LSTM with                         or 512) for the model encoder/decoder, and the model word
bidirectional encoding, which means that the embedding is                 embedding size (256 or 512). For each configuration, the
able to catch information around the formatting error in                  training was performed for a maximum of 20k iterations, with
the two directions: for instance, an error triggered by the               a batch size of 32, and a model was saved in the iterations 10k
Checkstyle WhitespaceAround rule, which checks that a token               and 20k. This means that, in the end, we obtained 48 models (2
is surrounded by whitespaces, requires the contexts before and            model attention types × 3 numbers of layers × 2 numbers of
after the token.                                                          units × 2 embedding sizes × 2 number of training iterations)
                                                                          per training data generation protocol (i.e. Stylerrandom and
Predicting/Repairing (Figure 3–step F). Once the LSTM mod-                Styler3grams ).
els are trained (one per training protocol, see Section IV-E),               Those models were created for one open-source project5 ,
S TYLER can be used for predicting fixes for an erroneous                 randomly selected from the top-5 projects with most diversity
sequence I as in Figure 4c. For an input sequence I, a LSTM               in terms of number of formatting rules (see Section V-B).
model predicts x alternative formatting token sequences using             The project was given as input to S TYLER, which produced
a technique called beam search, that we use off-the-shelf.                training data by injecting Checkstyle errors in error-free files
These alternatives are all potential repairs for the formatting           in the project (see Section IV-E). For each protocol, 10k errors
error (e.g. Figure 4d).                                                   were injected. This data was used to train the LSTM models,
                                                                          where 9k errors were used for training and 1k for validation.
   Note that the LSTM models predict formatting token se-                 When the 48 models per protocol were created, we ran each
quences (e.g. Figure 4d), but the goal is to have token se-               of them on real errors from the project so that we could test
quences containing Java and formatting tokens (e.g. Figure 4e),           the models and choose the configuration of the best ones.
so they can further be translated back to Java code. Then,                We picked the configuration of the models, one per protocol,
S TYLER generates a new abstract token sequence (Oi ) for each            that repaired more real errors. The best Stylerrandom -based
formatting token sequence (Fi ), based on the original input I,           model was with general model attention type, 2 layers, 512
such as in Figure 5a. Recall that I is composed of pairs of Java          units, embedding size of 512, and 20k training iterations, and
tokens and formatting tokens (see Section IV-D), therefore its            the best Styler3grams -based model was with general model
number of formatting tokens is LI = length(I)/2. However,                 attention type, 1 layer, 512 units, embedding size of 256,
a LSTM model does not enforce the output size, thus we                    and 20k training iterations. Those are the configurations we
cannot guarantee that the length of a predicted formatting                used for training the models for our experiments described in
token sequence (LFi = length(Fi )) is equal to LI . If                    Section VI.
LF > LI , S TYLER uses the first LI formatting tokens from                   For prediction, the beam search creates x = 5 potential
Fi and ignores the remaining ones to generate Oi , such as in             repairs per model. Finally, about the error encoding, we set
Figure 5b. If LF < LI , S TYLER uses all formatting tokens                k = 5, n = 10, i = 2, and j = 13. Recall that those parameters
from Fi , and copies the LFi + 1, LFi + 2, . . . , LI original            are about the token window before and after the error (i.e. the
formatting tokens from I, such as in Figure 5c. Finally, after            context surrounding the error) and the placement of tags for
creating x abstract token sequences Oi , S TYLER continues its
workflow (Figure 3–step G).                                                 5 https://github.com/inovexcorp/mobi
7

the location and violation type identification once the error is    B. Data Collection
encoded. These parameters are made big enough to contain               To answer our research questions, we create a dataset of real
important information and, at the same time, small enough to        Checkstyle formatting errors by mining open source projects.
still allow for learning and prediction, and were set based on      For that, we first build a list of projects to collect errors from
meta-optimization.                                                  by filtering projects from our study presented in Section III.
                                                                    We select the projects that 1) use Checkstyle, 2) have only
                  V. E VALUATION D ESIGN                            one Checkstyle ruleset file, 3) contain at least one Checkstyle
                                                                    formatting rule in the Checkstyle ruleset, and 4) use Maven.
   We conduct an evaluation of S TYLER on real Checkstyle           This results in 1,791 projects.
errors mined from GitHub repositories, and compare S TYLER             For each project, we try to reproduce Checkstyle errors with
against three state-of-the-art code formatting systems. In this     the following procedure. We first clone the remote repository
section, we present the design of our evaluation.                   from GitHub6 . Then, we search in the history of the project
                                                                    for the last commit (cn ) that contains modifications in the
                                                                    checkstyle.xml file: this commit is used as a starting
A. Research Questions                                               point for the reproduction of real errors.
  We aim to answer the following five research questions.              We then perform a sanity check in the checkstyle.xml
                                                                    file from the commit cn : if it contains unresolved variables,
RQ #1 [Accuracy]: To what extent does S TYLER repair real-          we discard the project. Otherwise, we submit all files of
world Checkstyle errors, compared to other systems?                 cn to a process of finding a version of Checkstyle that is
Overall accuracy is an important metric to measure the value        compatible to the checkstyle.xml of the project. This
of tools. We investigate the accuracy of S TYLER on real            is necessary because new versions of Checkstyle sometimes
Checkstyle errors, which allows us to understand to what            introduce breaking backward compatibility7 , and they might
extent S TYLER repairs formatting errors that have occurred         fail to parse a checkstyle.xml used with previous ver-
in practice. Moreover, we compare the accuracy of S TYLER           sions of Checkstyle. The process consists of executing multiple
to the accuracy of three code formatters, by using the same         Checkstyle versions on the project, from a newer version to
dataset of errors, to investigate if, and to what extent, S TYLER   an older one, until finding one version that does not fail or
outperforms the competing systems.                                  until the available options end8 .
                                                                       If a compatible Checkstyle version is found, we gather
RQ #2 [Error type]: To what extent does S TYLER repair
                                                                    all commits since cn , inclusive: this process ensures that all
different error types, compared to other systems?
                                                                    commits are based on the same version of the Checkstyle
Checkstyle has different formatting rules, so it raises different
                                                                    ruleset. For each selected commit, we check it out, and we
error types. In this research question, we investigate if, and to
                                                                    check if the pom.xml file overrides any Checkstyle config-
what extent, S TYLER repairs different error types compared
                                                                    uration option: if it does, we discard that commit because
to the other systems. This analysis is also important to find if
                                                                    we cannot untangle the Maven+Checkstyle configuration with
the systems are complementary to each other.
                                                                    high accuracy. Otherwise, we run Checkstyle on the commit
RQ #3 [Quality]: What is the size of the repairs generated by       source tree. If at least one Checkstyle error is raised, we save
S TYLER, compared to other systems?                                 the errored Java files and also the metadata information about
There may be several alternative repairs that fix a given           the errors (the Checkstyle error types and their location).
Checkstyle error, including ones that change other lines in the        We remove duplicate Java files according to the file content
program and not only the ill-formatted line. In this research       among all commits if any. Then, we select the files con-
question, we compare the size of the repairs produced by            taining a single Checkstyle error related to formatting. We
S TYLER against the repairs from the other systems.                 perform this selection to accurately evaluate repairs predicted
                                                                    by S TYLER. Finally, we keep projects where all criteria yield
RQ #4 [Performance]: How fast is S TYLER for learning and           at least 20 Checkstyle formatting errors. By applying this
for predicting formatting repairs?                                  systematic reproduction and selection process, we obtained a
To investigate if S TYLER is applicable in practice, we measure     dataset containing 11,220 Checkstyle errors spread over 70
its performance for fixing Checkstyle errors. This is a valuable    projects. Additionally, Table II shows the stats per Checkstyle
information for who is interested in using S TYLER as a pre-        formatting rule.
commit hook in IDEs or in continuous integration.
RQ #5 [Technical analysis]: How do the two training data            C. Systems Under Comparison
generation techniques of S TYLER contribute to its accuracy?          We selected three systems to be compared with S TYLER:
Finally, we perform a technical analysis on the two protocols       one is an IDE-based code formatter plugin for Checkstyle,
for training data generation contained in S TYLER (see Sec-
                                                                      6 All   repositories were cloned in June 24, 2020.
tion IV-E), to investigate if one of them contributes more to the
                                                                      7 Checkstyle     release notes: https://checkstyle.sourceforge.io/releasenotes.
accuracy of S TYLER. This is an important investigation from
                                                                    html
the research viewpoint so that other researchers can further          8 Our current implementation supports 35 Checkstyle versions, from 8.0 to
choose a random or a 3-gram approach in related research.           8.33.
8

and the other two are the state-of-the-art of machine learning                      VI. E VALUATION R ESULTS AND D ISCUSSION
formatters that aim to assist developers to fix code formatting-
                                                                               We present and discuss the results for our five research
related issues without any prior or ad-hoc formatting rules.
                                                                             questions in this section.
   1) C HECK S TYLE -IDEA: C HECK S TYLE -IDEA [9], also
referred as CS-IDEA in this paper, is a plugin for the IntelliJ
IDE. It provides IDE integrated feedback against a given                     A. Accuracy of S TYLER (RQ #1)
Checkstyle ruleset and suggests fixes for Checkstyle errors.                    To measure the accuracy of S TYLER and the accuracy of the
   2) NATURALIZE: NATURALIZE [3] is a tool dedicated                         other three systems on the 11,220 real errors, we categorize
to assist developers on fixing coding conventions related to                 the repair attempts per status. Table III shows the results per
naming and formatting in Java programs. It learns coding                     tool and per status of the repair attempts: repaired/no error
conventions from a codebase and suggests fixes to developers                 refers to errors that were successfully repaired, i.e. no error
such as formatting modifications, based on the n-gram model.                 is raised after the repair attempt; repaired/new errors refers
   3) C ODE B UFF: C ODE B UFF [26] is a code formatter appli-               to errors that were fixed, but new errors were introduced in
cable to any programming language with an ANTLR grammar.                     the source code; not repaired/same error refers to errors that
Instead of formatting the code according to ad-hoc rules for a               were not repaired, i.e. the same error is still in the source
language, C ODE B UFF aims to infer the formatting rules given               code; not repaired/same+new refers to errors that were not
a grammar for the language and a set of files following the                  repaired and new errors were introduced in the source code;
same formatting rules. For each token, a KNN model makes                     and broken refers to cases containing files that cannot be
the decision to indent it or to align it with another token based            parsed by javalang after the repair attempts.
on the AST of the source file.                                                  S TYLER repairs 38% of the errors while CS-IDEA repairs
                                                                             63%, which is the greatest overall accuracy among the four
D. Set-up                                                                    considered tools. NATURALIZE and C ODE B UFF repair less
                                                                             errors (13% and 15%, respectively). To check if there is a
   1) C HECK S TYLE -IDEA: To use CS-IDEA, for each
                                                                             significant difference between S TYLER and the other tools, we
project in our dataset, we first create a project in IntelliJ
                                                                             used McNemar test and we considered α = 0.05: we found
containing the checkstyle.xml file and the errored files.
                                                                             p-value=0.000 for all three tests. This means that S TYLER and
Then, we import the Checkstyle ruleset (Settings > Editor
                                                                             any other tool have a different proportion of errors.
> Code Style > Import schema > Checkstyle configuration).
To run the C HECK S TYLE -IDEA plugin we simply call the                        We note that S TYLER and CS-IDEA are the most reliable
function “Refactor code” from the IDE.                                       tools in the sense of delivering to an end-user either a repaired
                                                                             source code or, in the worst case scenario, the code with
   2) NATURALIZE and C ODE B UFF adaptation: To use NAT-
                                                                             the same error. It is not the same case of NATURALIZE and
URALIZE , we have to slightly modify it: i) NATURALIZE
                                                                             C ODE B UFF, which have higher rates of delivering source code
recommends multiple fixes, so we take the first one for a given
                                                                             with new errors or broken. They were, however, designed for
error as being the repair; and ii) we changed NATURALIZE to
                                                                             a different goal, and do not take into account the Checkstyle
only work for indentation, excluding fixes regarding variable
                                                                             ruleset of the project like S TYLER and CS-IDEA do. Yet,
naming conventions (which are out of the scope of this paper).
                                                                             they are relevant for our experiment since they are the state-
To run C ODE B UFF, we give it the required configuration,
                                                                             of-the-art of machine learning-based code formatters. Our
including the number of spaces for indentation. This number is
                                                                             results show the need of specialized, focused-tools to repair
based on the most common indentation used in the considered
                                                                             Checkstyle errors.
projects (usually two or four spaces).
   3) Training tools: We trained S TYLER for each project in                  RQ #1: To what extent does S TYLER repair real-world
our real error dataset. The training process includes a step for              Checkstyle errors, compared to other systems?
creating the training data (see Figure 3–step A), where we                    S TYLER repairs 38% (4,231/11,220) of the Checkstyle
create 10,000 errors per project. To conduct a fair evaluation,               errors we found in the wild, and it outperforms the two
we ensure that S TYLER learns repairs based on the same                       state-of-the-art machine learning systems, NATURALIZE and
Checkstyle ruleset that is used for the real errors in the                    C ODE B UFF. CS-IDEA is able to repair 63% of the errors,
evaluation. Therefore, for each project from the real error                   however we note that CS-IDEA is heavily engineered,
dataset, we select as training seeds all error-free Java files                whereas S TYLER’s approach to repair formatting errors
from the last commit that modified the checkstyle.xml                         is fully automated and hence more appropriate for easily
file used to collect the real errors. We take special care of                 handling new and configurable rules.
consistency in the observed results: all three machine learning-
based systems, S TYLER, NATURALIZE, and C ODE B UFF, are
trained using the same Java files.                                           B. Error Type Analysis (RQ #2)
   4) Testing tools: Finally, we run all the four tools to repair               To answer RQ #2, we investigate if S TYLER is effective
the 11,220 errors from the real error dataset.                               in fixing different Checkstyle error types (one error type is
  9 S TYLER also targets the following rules that are not contained in our
                                                                             related to one Checkstyle rule). Figure 6 shows the repaired
dataset: AnnotationLocation, AnnotationOnSameLine, EmptyForInitializer-      Checkstyle errors per error type and per tool in a heatmap. The
Pad, SingleSpaceSeparator, and TypecastParenPad.                             colour scale is from dark to light colours, where the darkest
9

                                      TABLE II: Real error dataset stats per formatting rule9 .
  Checkstyle rule (25)                                   Projects (70)                               Errors (11,220)
  CommentsIndentation                                        10   ( 14%)                                32     (
10

                            CommentsIndentation (32)        9.4 %      40.6 %      25.0 %       0.0 %      40.6 %
                             EmptyForIteratorPad (10)     100.0 %       0.0 %      40.0 %      40.0 %     100.0 %
                           EmptyLineSeparator (2729)        2.3 %      91.9 %      20.4 %       1.0 %      94.5 %
                               FileTabCharacter (595)       9.4 %      93.1 %       6.4 %      31.8 %      98.8 %
                               GenericWhitespace (6)      100.0 %     100.0 %      16.7 %      33.3 %     100.0 %
                                      Indentation (755)    84.2 %      92.2 %       3.8 %      74.8 %      94.4 %
                                        LeftCurly (197)    95.9 %      92.9 %      35.5 %      34.5 %      95.9 %
                                     LineLength (2774)     31.1 %      48.7 %       0.0 %       1.3 %      51.5 %
                                MethodParamPad (62)        51.6 %      80.6 %      11.3 %      12.9 %      87.1 %
                             NewlineAtEndOfFile (321)      61.4 %       0.0 %       0.0 %       0.0 %      61.4 %
                                       NoLineWrap (11)    100.0 %       0.0 %       0.0 %       0.0 %     100.0 %
                              NoWhitespaceAfter (44)       18.2 %      22.7 %       2.3 %      15.9 %      22.7 %
                            NoWhitespaceBefore (141)       78.0 %      71.6 %      34.8 %      46.1 %      94.3 %
                             OneStatementPerLine (4)       25.0 %      25.0 %       0.0 %       0.0 %      25.0 %
                                  OperatorWrap (231)       55.8 %       0.0 %      15.2 %       4.3 %      57.6 %
                                        ParenPad (120)    100.0 %      36.7 %      35.0 %      26.7 %     100.0 %
                                          Regexp (374)      2.9 %       2.9 %       8.6 %      11.0 %      14.2 %
                                   RegexpMultiline (8)      0.0 %       0.0 %       0.0 %       0.0 %       0.0 %
                               RegexpSingleline (474)      10.8 %      37.8 %       2.1 %       5.3 %      38.0 %
                           RegexpSinglelineJava (203)       2.5 %       0.5 %       9.4 %       0.0 %      11.3 %
                                       RightCurly (372)    54.3 %      76.3 %       3.5 %      26.6 %      78.2 %
                                   SeparatorWrap (16)      37.5 %       6.2 %      25.0 %       0.0 %      37.5 %
                               TrailingComment (370)       88.1 %       0.0 %      31.4 %       0.0 %      88.9 %
                                WhitespaceAfter (563)      78.5 %      86.7 %      17.1 %      56.7 %      90.6 %
                              WhitespaceAround (807)       93.4 %      79.2 %      40.1 %      29.4 %      96.5 %
                                                           Styler Checkstyle-IDEA Naturalize   CodeBuff     All

                                    Fig. 6: Types of Checkstyle error repaired per tool (RQ #2).

error and the repaired source code. Among all repairs that pass             RQ #3: What is the size of the repairs generated by
all Checkstyle rules, the diff should be as small as possible for           S TYLER, compared to other systems?
being the least disrupting for the developers. In the context of            S TYLER has a median repair size of five changed lines,
a pull request on GitHub, a smaller diff is usually considered              the same as NATURALIZE. Yet, NATURALIZE produces
as easier to review and merge [12].                                         small formatting repairs with a less reliable predictability
   We calculate the size in lines of the diff from the errors that          compared to S TYLER. CS-IDEA and C ODE B UFF clearly
S TYLER, CS-IDEA, NATURALIZE, and C ODE B UFF repaired.                     produce bigger formatting repairs. The ability to produce
Figure 7 shows the results: the x axis presents the size                    small diffs is an important property for code-reviews and
distribution of the diffs, and each boxplot represents one tool.            pull-request-based development, hence our results show that
   S TYLER (in green) and NATURALIZE (in yellow) have                       S TYLER can be realistically used in a modern software
the smallest medians of diff size, which are both equal to                  development context.
five changed lines. Yet, they suffer from fewer bad cases
(the right-hand part of the distribution). CS-IDEA (in pink)
and C ODE B UFF (in blue) produce larger diff sizes, and have              D. Performance (RQ #4)
medians equals to 7 and 55, respectively. In the worst cases,                 To investigate if S TYLER can be used in practice, we
they produce the largest diffs, the 95th percentile passes 200             measure the execution time spent when running S TYLER on
changed lines, compared to 7 lines by S TYLER.                             the real error dataset. Table IV shows the minimum, median,
   We performed Wilcoxon rank sum test to verify if the                    average, and maximum spent time on projects, split over the
distributions of the diff sizes obtained by S TYLER and the                different steps from the S TYLER workflow. For training data
other tools are systematically different from one another. We              generation, S TYLER took at least 16 minutes and up to six
found p-value=0.000 when testing S TYLER with CS-IDEA                      and a half hours. To tokenize the training data, it took up to
and C ODE B UFF, and p-value=0.0000000039 when testing                     13 minutes, and for training the models, it took mostly about
S TYLER with NATURALIZE. Considering α = 0.05, we reject                   one hour. Therefore, the training of S TYLER (data generation
the null hypothesis, which means that the distributions of                 + tokenization + model training) took around two hours and
S TYLER is significantly different from the other ones.                    a half on average. This can be considered just fine, since the
11

                                   Styler                        Naturalize          RQ #5: How do the two training data generation
                                   Checkstyle-IDEA               CodeBuff            techniques of S TYLER contribute to its accuracy?
                                                                                     For most errors, S TYLER selects a repair predicted by the
                                                                                     LSTM model based on the Styler3grams protocol because
                                                                                     it produces a smaller diff, which is desirable for devel-
                                                                                     opers. Yet, the model based on Stylerrandom exclusively
                                                                                     contributes to the overall accuracy of S TYLER with 20% of
                                                                                     the fixes.

   0        25       50       75       100      125        150     175        200                          VII. D ISCUSSION
                                    Diff size
                                                                                    A. Machine learning versus rule-based approaches
Fig. 7: Size of the repairs per tool. The two boxplot whiskers                         S TYLER employs a machine-learning-based approach for
represent the 5th and the 95th percentiles (RQ #3).                                 repairing formatting convention violations. An alternative ap-
                                                                                    proach would be a rule-based one. For instance, there would
TABLE IV: Statistics on the performance of S TYLER (RQ #4).                         be one transformation to be applied in the code per Checkstyle
                                                                                    rule. As said, this approach requires the engineering of a
                               Training                           Prediction
            Data generation     Tokenization          Models     Average Time       transformation for every single linter rule, which is time-
  Stepa:          A                   B                 C            E→I            consuming. While this is costly, this might be even impractical
  Min            00:16:18           00:00:51          00:31:54     1.608   s/err    for highly configurable linters such as Checkstyle: the rule-
  Med            00:45:10           00:09:09          00:59:14     2.215   s/err    based repair system would need to have different transforma-
  Avg            01:15:38           00:08:18          00:58:38     2.277   s/err    tions for the same linter rule due to the configurable properties.
  Max            06:30:44           00:13:51          01:22:27     3.407   s/err
   a
                                                                                    On the contrary, a machine learning approach does not require
       The steps were executed in a computer containing a processor In-
       tel(R) Core(TM) i9-10980XE CPU @ 3.00GHz and 125GiB system
                                                                                    costly human engineering. It is able to infer transformations
       memory. For training the models, we used GPUs GeForce RTX 2080               for a diverse set of linter rules. Our experiments have validated
       Ti.                                                                          this property in the context of formatting errors raised by
                                                                                    Checkstyle.

training is meant to happen only when the coding conventions
                                                                                    B. Threats to Validity
change (i.e. the Checkstyle ruleset file), which means rarely (a
given version of coding conventions usually lasts for months).                         S TYLER generates training data for repairing errors based
After S TYLER is trained for a given project, it takes in average                   on the Checkstyle configuration file contained in a given
two seconds to predict a repair, which is fast enough to be used                    project. This means that S TYLER assumes that all formatting
in IDEs or in continuous integration environments.                                  rules contained in the Checkstyle configuration file are valid.
                                                                                    In practice, however, developers might ignore the violations of
 RQ #4: How fast is S TYLER for learning and for                                    certain rules. Our experiment does not take this scenario into
 predicting formatting repairs?                                                     account, thus we do not claim that 100% of the fixes produced
 On average, S TYLER needs about two hours and a half                               by S TYLER are necessarily relevant for developers.
 for training, and two seconds for predicting a repair. The                            The real error dataset contains Checkstyle errors mined
 training time is not an issue since it only happens when the                       from GitHub repositories. It is to be noted that it does not
 Checkstyle ruleset file changes. The prediction time relates                       cover all existing Checkstyle formatting rules. It is worth to
 to usability: our results show that S TYLER can be used in                         mention that we are still collecting real errors, and those can
 the IDE or in CI, in a practical setting.                                          potentially cover new rules. Moreover, the dataset might not be
                                                                                    representative of the real distribution of the 19 rules in the real
                                                                                    world. Consequently, future research is needed to strengthen
E. Technical Analysis on S TYLER (RQ #5)                                            the validity of our study.
                                                                                       When selecting real errors, we chose only files containing a
   At prediction time, S TYLER used two trained LSTM mod-                           single real Checkstyle error (see Section V-B). We performed
els, each one based on a different training data generation pro-                    this selection so that we could accurately check if the error
tocol: Stylerrandom and Styler3grams . We investigate how                           was correctly repaired by the tools. Files containing more than
the two protocols contribute to the final output of S TYLER.                        one error are hard to check the correctness of repairs: once an
We found that S TYLER fixed 852 Checkstyle errors with the                          error is repaired, the location of the other ones in the file would
Stylerrandom -based model exclusively, while it fixed 1,008                         change. Therefore, our results are based on single-error files,
errors with the Styler3grams -based model — 2,374 errors                            and future investigations on multiple-error files are needed.
were fixed with both models. This shows that the model based                           Finally, to compare the quality of the repairs produced
on Styler3grams is more effective. Moreover, when selecting                         by S TYLER with the repairs produced by the other three
one repair to give as output (Figure 3–step I), S TYLER selected                    tools, we measured the size in lines of the diff between the
the repair from the Styler3grams -based model in most cases                         buggy and repaired program versions. However, the diff size
because it generates smaller diffs.                                                 is only one dimension for comparing the tools, which only
12

approximates the developer’s perception on formatting repairs.        They mined millions of buggy and patched program versions
User studies, such as proposing to developers formatting              from the history of GitHub repositories, and abstracted them
repairs, are interesting future experiments to further investigate    to train an Encoder-Decoder model. The model was able to fix
the practical value of this research.                                 hundreds of unique buggy methods in the wild. [10] proposed
                                                                      SequenceR, an end-to-end program repair approach focused on
                    VIII. R ELATED W ORK                              one-line fixes. In an experiment with Defects4J, SequenceR
A. The use of static analysis tools                                   was shown to be able to learn to repair behavioral bugs by
   Static analysis tools have been subject of investigation in        generating patches that pass all tests.
recent research. [39] investigated their usage in 20 popular
Java open source projects hosted on GitHub and using Travis           C. Linter-error repair and formatting
CI to support CI activities. They first found out that the               Linter-error repair. There are some tools to fix errors raised
projects use seven static analysis tools—[8], [14], [28], [20],       by specific linters. For instance, ESLint [13] is a linter for
[4], [11], and [19]—being Checkstyle the most used one.               JavaScript, but it also includes automated solutions to repair
About the integration of static analysis tools in CI pipelines,       errors raised by it. For Python, there exists the autopep8 tool
they found out that build breakages due those tools are mainly        [5], which formats Python code to conform to the PEP 8
related to adherence to coding standards, while breakages             Style Guide for Python Code [27]. For Java, there exists the
related to likely bugs or vulnerabilities occur less frequently.      C HECK S TYLE -IDEA [9] plugin for IntelliJ, which we used
[39] discuss that some tools are sometimes configured to not          to be compared to S TYLER. C HECK S TYLE -IDEA is able to
break the build but just to produce warnings, possibly because        highlight the error and also to suggest fixes in some cases.
of the high number of false positives.                                However, it is very limited in repairing errors from several
   [38] investigated the usage of static analysis tools from the      different rules as we have shown in RQ #2.
perspective of the development context in which these tools are
                                                                      Code formatters. A way to enforce formatting conventions lies
used. For that, they surveyed 42 developers and interviewed
                                                                      in code formatters. In Section V-C, we described NATURALIZE
11 industrial experts that integrate static analysis tools in their
                                                                      [3] and C ODE B UFF [26]: NATURALIZE recommends fixes
workflow. They found out that static analysis tools are used in
                                                                      for coding conventions related to naming and formatting in
three main development contexts, which are local environment,
                                                                      Java programs, and C ODE B UFF infers formatting rules to
code review, and continuous integration. Moreover, they also
                                                                      any language given a grammar. Similar to the idea behind
found out that developers differently consider warning types
                                                                      C ODE B UFF, [30] had previously experimented with different
depending on the context, e.g., when performing code reviews
                                                                      learning algorithms and feature set variations to learn the style
they mainly look at style conventions and code redundancies.
                                                                      of a given corpus so that it could be applied to arbitrary code.
   [22] focused on one specific static analysis tool: [33].
                                                                      Beyond those academic systems, there are code formatters
Through an online survey with 18 developers from different
                                                                      such as google-java-format [15], which reformats source code
organizations, they found out that most respondents agree that
                                                                      according to the Google Java Style Guide [16], and as such
the issues reported by static analysis tools are relevant for
                                                                      fixes violations of the Google Style. However, these formatters
improving the design and implementation of software.
                                                                      are usually not configurable or require manual tweaking, which
                                                                      is a tedious process for developers. This is a problem because
B. Learning for repairing compiler errors and behavioral
                                                                      not all developers are ready to follow a unique convention
bugs
                                                                      style. S TYLER, on the other hand, is generic and automatically
Learning for repairing compiler errors. There are related             captures the conventions used in a project to fix formatting
works in the area of automatic repair of compiler errors.             violations.
In this case, the compiler syntax rules are the equivalent of
the formatting rules. There, recurrent neural networks and                                  IX. C ONCLUSION
token abstraction have been used to fix syntactic errors [7].            In this paper, we presented S TYLER, which implements a
In DeepFix, [17] use a language model for repairing syntactic         novel approach to repair formatting errors raised by Check-
compilation errors in C programs. Out of 6,971 erroneous C            style, the popular linter for Java programs. S TYLER creates
programs, DeepFix was able to completely repair 27% and               a corpus of Checkstyle errors, learns from it, and predicts
partially repair 19% of the programs. Later, [1] proposed             fixes for new errors, using machine learning. Our experimental
TRACER, which outperformed DeepFix, repairing 44% of                  results on 11,220 real Checkstyle errors showed that S TYLER
the programs. [32] confirmed the efficiency of LSTM over              repairs real errors from a more diverse set of Checkstyle rules
n-grams and of token abstraction for single token compiling           than the systems C HECK S TYLE -IDEA, NATURALIZE, and
errors. These approaches do not target formatting errors, which       C ODE B UFF. Moreover, S TYLER produces smaller repairs than
is the target of S TYLER.                                             the compared systems, and its prediction time is low so it can
Learning for repairing behavioral bugs. As for repairing              be used in IDEs or in Continuous Integration environments.
compiler errors, there are also learning systems for repairing           There are interesting areas for future work. First, improve-
behavioral bugs, those that, for instance, break test cases. [37]     ments on the error injection protocols for creating training data
investigated the feasibility of using Neural Machine Transla-         can be done so as to improve the representativeness of seeded
tion techniques for learning bug-fixing patches for real defects.     formatting errors. This might increase the performance of
You can also read