The Effect of Article Characteristics on Citation Number in a Diachronic Dataset of the Biomedical Literature on Chronic Inflammation: An Analysis ...

Page created by Lauren Sherman
 
CONTINUE READING
The Effect of Article Characteristics on Citation Number in a Diachronic Dataset of the Biomedical Literature on Chronic Inflammation: An Analysis ...
publications

Article
The Effect of Article Characteristics on Citation Number in a
Diachronic Dataset of the Biomedical Literature on Chronic
Inflammation: An Analysis by Ensemble Machines
Carlo Galli *           and Stefano Guizzardi

                                          Department of Medicine and Surgery, Histology and Embryology Lab, University of Parma, Via Volturno 39,
                                          43126 Parma, Italy; stefano.guizzardi@unipr.it
                                          * Correspondence: carlo.galli@unipr.it; Tel.: +39-0521-906740

                                          Abstract: Citations are core metrics to gauge the relevance of scientific literature. Identifying features
                                          that can predict a high citation count is therefore of primary importance. For the present study,
                                          we generated a dataset of 121,640 publications on chronic inflammation from the Scopus database,
                                          containing data such as titles, authors, journal, publication date, type of document, type of access
                                          and citation count, ranging from 1951 to 2021. Hence we further computed title length, author count,
                                          title sentiment score, number of colons, semicolons and question marks in the title and we used these
                                          data as predictors in Gradient boosting, Bagging and Random Forest regressors and classifiers. Based
                                          on these data, we were able to train these machines, and Gradient Boosting achieved an F1 score of
                                          0.552 on classification. These models agreed that document type, access type and number of authors
                                          were the best predicting factors, followed by title length.
         
                                   Keywords: citations; title; machine learning
Citation: Galli, C.; Guizzardi, S.
The Effect of Article Characteristics
on Citation Number in a Diachronic
Dataset of the Biomedical Literature      1. Introduction
on Chronic Inflammation: An                     In a world where the sheer amount of available information is growing exponentially,
Analysis by Ensemble Machines.            ranking knowledge according to its reliability has become crucial, and the scientific com-
Publications 2021, 9, 15.                 munity has therefore attributed increasing importance to scientometrics, i.e., the ability
https://doi.org/10.3390/                  to analyze the impact of science [1]. Scientometrics, with all its limits [2], is the compass
publications9020015
                                          that is commonly used to drive strategic decisions in public health, in grant policies, to
                                          allocate funding, and at a lower level, to determine hiring and promotions [3]. Within
Received: 27 December 2020
                                          scientometrics, citations are the single most distinctive parameter to estimate the visibility
Accepted: 30 March 2021
                                          of a scientific publication [4]. The number of citations is also used to compute further
Published: 6 April 2021
                                          indices of performance, such as the Hirsch’s score, which are used to evaluate the scientific
                                          productivity of a scholar [5].
Publisher’s Note: MDPI stays neutral
                                                It may be—not without a reason—assumed that good science will eventually be
with regard to jurisdictional claims in
                                          cited more often, but in a situation of information overload in fields such as life sciences
published maps and institutional affil-
iations.
                                          and medicine, where “mountains of unloved and unread publications [exist]” [6] researchers
                                          have long strived to investigate what factors may affect the number of citations a study
                                          can accrue [7–10]. Several publications have pointed out that shorter titles can facilitate
                                          citations, possibly by making an article stand out during literature searches; its meaning
                                          more easily apprehended, and therefore remembered [11]. Besides cognitive mechanisms,
Copyright: © 2021 by the authors.
                                          other factors can possibly affect how often an article is searched, viewed and cited. Open
Licensee MDPI, Basel, Switzerland.
                                          access articles, for instance, are more easily accessible for scholars in low-budget contexts,
This article is an open access article
                                          who may not rely on institutional subscriptions [12]. The genre of the article could also
distributed under the terms and
                                          make a study more searched. Review articles, at least narrative reviews, generally describe
conditions of the Creative Commons
Attribution (CC BY) license (https://
                                          the state of the art in a given field and are often beloved reading for scholars looking for an
creativecommons.org/licenses/by/
                                          overview of a topic [13]. And even the number of authors (and prestigious authorship of
4.0/).                                    course) could affect citations [3], as each author can be part of a network of acquaintances

Publications 2021, 9, 15. https://doi.org/10.3390/publications9020015                                 https://www.mdpi.com/journal/publications
The Effect of Article Characteristics on Citation Number in a Diachronic Dataset of the Biomedical Literature on Chronic Inflammation: An Analysis ...
Publications 2021, 9, 15                                                                                            2 of 11

                           who may be more easily attracted and read, and therefore cite, their work, not to mention
                           the relevance of specific journals with high impact factor that may be a preferred reading
                           and citation source for scholars [14].
                                The goal of this study was to examine the association between some selected article
                           features, including textual features of the title and number of citations, in a diachronic
                           dataset of studies on chronic inflammation retrieved from Scopus using a set of specific
                           machine learning techniques, namely ensemble classifiers. This specific research field was
                           chosen for consistency reasons, because our group has been focusing and analyzing it in
                           the last few years. Although differences with publications in other medical areas may exist,
                           they are difficult to anticipate, and the conclusions from our survey are therefore applicable
                           mostly to the topic of inflammation.

                           2. Materials and Methods
                                 The analysis was conducted on a literature dataset created through Scopus. The
                           Scopus database was searched using the ‘Chronic Inflammation’ keywords, without filters.
                           All the information was downloaded as several.csv files (due to inherent limitations in
                           Scopus, which does not allow to download more than 20,000 records at a time), which
                           were then imported in a Jupyter notebook running Python 3.6, using the Pandas library.
                           The dataframes were concatenated into one single dataframe without duplicates. For data
                           analysis, the python Numpy, Pandas and Scipy libraries were used. Matplotlib and Seaborn
                           libraries were used for data visualization.
                                 The information provided by Scopus included title of the article, authors, journal,
                           publication date, type of document, type of access, publication stage and citation count,
                           besides Scopus ID codes, DOI and a link to the article. We proceeded to create further
                           features for each publication. We used the VADER implementation [15] in the Nltk library
                           to perform a sentiment analysis of the titles, which yielded a sentiment score. We used the
                           Re library to search for colons, semicolons and question marks within titles through regular
                           expressions. We also used the implementations of Linear Regression, Gradient Boosting,
                           Bagging, Random Forest regressor and classifiers and Logistic Regression algorithms in
                           the Scikit-learn package. To train the models the data were split into a training and a
                           testing set through Scikit-learn. The models were tuned using Randomized search CV in
                           the Scikit-learn package. The performance of the models was then evaluated on accuracy,
                           precision, recall and F1 score (the harmonic mean of precision and recall) calculated on
                           test data.

                           3. Results
                           3.1. Overview of the Dataset
                                 Our Scopus search retrieved 159,461 titles, published over the course of 194 years from
                           1827 to 2021 in 12,308 academic journals. Unfortunately, 25,227 articles lacked a citation
                           count, which could be attributable at least in part to their recent publication date (data not
                           shown). Though we could not rule out that some of those titles were articles that were
                           actually never cited, we decided to keep only those uncontroversial articles where a citation
                           count ≥1 was provided.
                                 Several kinds of publications were initially present in our dataset (data not shown),
                           e.g., letters, surveys, book chapters and even conference papers. Most publications, how-
                           ever, fell within the ‘Article’ or ‘Review’ categories, and the dataset was restricted to these
                           two classes only.
                                 We furthermore decided to keep only those papers that were published after 1951,
                           as only very few papers in our dataset were published before that date and their scarce
                           numbers might affect our analysis. We also decided to exclude papers with unusually high
                           numbers of authors, which, in a few very peculiar cases, could amount to hundreds. Based
                           on the distribution of author number, we removed papers with more authors than 1.5 times
                           their interquartile range (i.e., 15 authors). The remaining dataset, therefore, included
                           121,640 manuscripts.
The Effect of Article Characteristics on Citation Number in a Diachronic Dataset of the Biomedical Literature on Chronic Inflammation: An Analysis ...
Publications 2021, 9, x FOR PEER REVIEW                                                                                                       3 of 11

Publications 2021, 9, 15                                                                                                                      3 of 11
                                  times their interquartile range (i.e., 15 authors). The remaining dataset, therefore, included
                                  121,640 manuscripts.

                                  3.2. Features Distribution
                                        The temporal distribution
                                                       distribution of articles
                                                                        articles by document
                                                                                    document type
                                                                                              type can
                                                                                                   can be seen in Figure 1, which
                                  shows, beside the well-known steady increment in publications over the years, the robust
                                  surge of
                                         of Reviews,
                                             Reviews,which
                                                        whichhave
                                                                havebecome
                                                                      becomemoremoreprominent
                                                                                      prominent since thethe
                                                                                                  since    mid ‘80s‘80s
                                                                                                             mid     andand
                                                                                                                         which  ac-
                                                                                                                             which
                                  count
                                  accountforfor
                                              about
                                                about30%
                                                       30%ofofall
                                                               allthe
                                                                   thepublications
                                                                       publicationsininour
                                                                                        our dataset.
                                                                                            dataset. Beside
                                                                                                     Beside Document
                                                                                                             Document Type, the the
                                  Scopus database contains information about the accessibility of full texts and      and indicates
                                                                                                                          indicates
                                  whether a specific entry belongs to an Open Access typology or a Paid subscription type.
                                  Open Access is a novel business model that grants readers free access to the content of a
                                  report, made possible by the digitalization of publications. Understandably, Open Access
                                  journals have gained in popularity and many journals are now offered as open access, or
                                  with an open access option. Many older articles are also offered as open access to readers,
                                  even if their latest issues may not be.

                 Above: Distribution
      Figure 1. Above:   DistributionofofArticles
                                          Articles(orange)
                                                    (orange)and
                                                             andReviews
                                                                 Reviews  (blue)
                                                                        (blue)   included
                                                                               included in in
                                                                                           thethe present
                                                                                               present    corpus,
                                                                                                       corpus,    expressed
                                                                                                               expressed on aonlog
                                                                                                                                 a
      scale.
      log scale.

                                        To gain deeper insights into the characteristics of the papers in the dataset, we anno-
                                  tated the presence and number of colons and semicolons in the title, as a sign of a more
                                  structured and
                                  structured    andthus
                                                      thuscomplex
                                                             complextitle,
                                                                        title,ororquestion
                                                                                    questionmarks
                                                                                               marks (a (a
                                                                                                        sign  of aofdifferent
                                                                                                           sign               illocutive
                                                                                                                     a different            attitude
                                                                                                                                   illocutive    atti-
                                  in the  sentence),    and   we  also  performed       a sentiment     analysis    using
                                  tude in the sentence), and we also performed a sentiment analysis using VADER, a com-    VADER,       a common
                                  package
                                  mon packagefrom from
                                                    the Nltk
                                                           the library,  whichwhich
                                                               Nltk library,       assigns  a positive
                                                                                         assigns         or negative
                                                                                                   a positive           sentiment
                                                                                                                or negative           to sentences.
                                                                                                                               sentiment      to sen-
                                  This  last
                                  tences.     feature
                                           This         shouldshould
                                                  last feature   be taken bewith
                                                                               takena with
                                                                                       caveat.   VADER
                                                                                             a caveat.     is a human-validated
                                                                                                         VADER      is a human-validated sentiment
                                                                                                                                                 sen-
                                  analysis   method,     supported    by   a  robust   corpus    of literature   [15–17]
                                  timent analysis method, supported by a robust corpus of literature [15–17] but not reallybut  not  really   tested
                                  with
                                  testedbiomedical     texts yet,
                                          with biomedical          although
                                                                texts           our study
                                                                      yet, although      our(Figure    A1) confirms
                                                                                               study (Figure             that most
                                                                                                                A1) confirms      thatpublications
                                                                                                                                         most pub-
                                  display   a  positive   sentiment,   which     is  consistent  with   a positive   publication
                                  lications display a positive sentiment, which is consistent with a positive publication           bias   [18]. bias
                                  [18]. Figure   A1  in  the Appendix     A   summarizes      feature  distribution    in our  dataset,    stratified
                                  by publication
                                        Figure A1type.in theThe   majorityAofsummarizes
                                                              Appendix           publicationsfeature
                                                                                                 in our dataset    are original
                                                                                                          distribution    in our articles,
                                                                                                                                   dataset,almost
                                                                                                                                               strati-
                                  three  times   as many    as  reviews.   Seventy-five     %   of articles are  also  accessible
                                  fied by publication type. The majority of publications in our dataset are original articles,      via   a payable
                                  subscription.
                                  almost            Most as
                                           three times     articles
                                                              manyinasour    datasetSeventy-five
                                                                         reviews.      are simple, without      colons,
                                                                                                       % of articles   areoralso
                                                                                                                             with   two colons
                                                                                                                                 accessible     viaata
                                  most.   Titles  with   more   than  two   colons,    and   up  to six,  are a small   fraction
                                  payable subscription. Most articles in our dataset are simple, without colons, or with two       of  the  dataset.
                                  Most articles
                                  colons   at most.have   no semicolons
                                                      Titles  with more thanor question    marksand
                                                                                    two colons,     (Figure
                                                                                                        up toA1).
                                                                                                               six, About    a thousand
                                                                                                                     are a small   fractionarticles
                                                                                                                                               of the
                                  have a semicolon or a question mark. Though original articles are much more numerous
                                  dataset. Most articles have no semicolons or question marks (Figure A1). About a thou-
                                  than reviews, they are similarly distributed across features.
                                  sand articles have a semicolon or a question mark. Though original articles are much more
                                        The distribution of citations was quite skewed (data not shown). Though few manuscripts
                                  numerous than reviews, they are similarly distributed across features.
                                  were cited a very high number of times—up to 17,996—the overall median number of citations
                                        The distribution of citations was quite skewed (data not shown). Though few manu-
                                  for the included manuscripts was 15, with a 75th percentile of 39. As clearly visible by the
                                  scripts were cited a very high number of times—up to 17,996—the overall median number
                                  distribution of citations over the years, a peak of citations can be observed around the year
                                  of citations for the included manuscripts was 15, with a 75th percentile of 39. As clearly
                                  2000, with citations then declining in more recent years (Figure 2). This reflects the nature of
                                  visible by the distribution of citations over the years, a peak of citations can be observed
                                  citations, which accumulate with time, so manuscripts that were published up to 20 years ago
                                  had obviously more time to collect a higher number of citations. On the other hand, older
                                  articles did not get as many citations, possibly because their cultural parable was already
The Effect of Article Characteristics on Citation Number in a Diachronic Dataset of the Biomedical Literature on Chronic Inflammation: An Analysis ...
ations 2021, 9, x FOR PEER REVIEW                                                                                                          4 of 11

                               around the year 2000, with citations then declining in more recent years (Figure 2). This
          Publications 2021, 9, 15                                                                                                                4 of 11
                               reflects the nature of citations, which accumulate with time, so manuscripts that were
                               published up to 20 years ago had obviously more time to collect a higher number of cita-
                               tions. On the other hand, older articles did not get as many citations, possibly because
                               their cultural declining
                                              parable wasas citations
                                                               alreadystarted  to beas
                                                                         declining   tracked,  or simply
                                                                                       citations   startedbecause  at the time
                                                                                                            to be tracked,   or fewer
                                                                                                                                 simplyjournals existed
                               because at theand
                                               timefewer
                                                     fewerarticles  were
                                                              journals     published
                                                                        existed        and thus
                                                                                and fewer         the chances
                                                                                             articles           to be cited
                                                                                                       were published     andwere
                                                                                                                               thusless
                                                                                                                                     thenumerous. To
                               chances to be account    for the
                                               cited were    lesseffect of timeTo
                                                                  numerous.     onaccount
                                                                                    the citation
                                                                                             for count   of each
                                                                                                  the effect      article
                                                                                                              of time  onwethenormalized
                                                                                                                                citation the citation
                                              number    by  the  average   expected  citation for  the year  of publication,
                               count of each article we normalized the citation number by the average expected citation       and   used that as target
                                              for our  subsequent    analysis.
                               for the year of publication, and used that as target for our subsequent analysis.

                            Figure 2. Distribution of citations for the manuscripts in the present corpus. Values are shown as
                Figure 2. Distribution of citations for the manuscripts in the present corpus. Values are shown as mean +/− stan-
                            mean +/− standard deviation.
                dard deviation.
                                     It has been claimed
                                                       It has beenthat the   length
                                                                         claimed       of the
                                                                                    that  the length
                                                                                                title of of
                                                                                                          a manuscript
                                                                                                             the title of aismanuscript
                                                                                                                                associated is to associated
                                                                                                                                                 its          to its
                               citation count,citation
                                                  though       the results
                                                           count,    though are     conflicting.
                                                                               the results           Further evidence
                                                                                             are conflicting.                has beenhas
                                                                                                                  Further evidence        reported
                                                                                                                                             been reported about
                               about a possible     association
                                               a possible             between
                                                               association        authors
                                                                               between      and citations.
                                                                                          authors               To test
                                                                                                     and citations.     Tothese   hypothesis,
                                                                                                                           test these            we we analyzed
                                                                                                                                       hypothesis,
                               analyzed title title
                                                length    (as   number      of words)    and  number      of  authors    for
                                                      length (as number of words) and number of authors for the manuscripts  the  manuscripts     in in our corpus.
                               our corpus. Title     length     did   not   appear    to change     significantly     over  the  years;  as
                                               Title length did not appear to change significantly over the years; as previously reported,   previ-
                               ously reported, thethe   number
                                                     number       ofofauthors
                                                                        authorshashassteadily
                                                                                       steadilyincreased
                                                                                                  increasedover overtime,
                                                                                                                       time,possibly
                                                                                                                             possibly asas studies
                                                                                                                                            studies have become
                               have become more morecomplex
                                                          complexand    andprogressively
                                                                              progressively      required
                                                                                             required     moremore     competences
                                                                                                                  competences      and and
                                                                                                                                        skillsskills
                                                                                                                                                (data not shown).
                               (data not shown). Title length, however, was not homogeneously distributed across different document
                                     Title length,  however,
                                               types.    Original   was   not homogeneously
                                                                       articles tend to have longer distributed    across
                                                                                                           titles than      different
                                                                                                                         reviews   do, document
                                                                                                                                        while having also more
                               types. Originalauthors
                                                  articles(Figure
                                                              tend toA2).have longer titles than reviews do, while having also more
                               authors (Figure A2).    A cumulative plot of author number and title length can be seen in Appendix B, where
                                     A cumulative       plot of author
                                               these features                 number and
                                                                     are represented           title length
                                                                                          as boxplots           cansignificantly
                                                                                                          and are      be seen in Appendix        B,
                                                                                                                                     different between     reviews
                               where these features
                                               and originalare represented
                                                                    articles with as pboxplots
                                                                                        < 0.001,and by are
                                                                                                        Mann significantly
                                                                                                                 Whitney U    different   between
                                                                                                                                 test (Figure    A3). These data
                               reviews and original
                                               suggestarticles
                                                            that part  with   p < association
                                                                          of the  0.001, by Mann        Whitney
                                                                                                  between            U test (Figure
                                                                                                              title length   or author A3).  These and citations
                                                                                                                                           number
                               data suggest that     part    of  the   association    between     title length
                                               might actually be mediated by differences in article genre.        or  author   number    and   cita-
                               tions might actually       be   mediated     by  differences    in  article  genre.
                                                       To better understand this phenomenon, we preliminarily investigated the relation
                                     To betterbetween
                                                 understand         this phenomenon,
                                                             the citations     of a study andwe its
                                                                                                  preliminarily
                                                                                                      features (Figureinvestigated
                                                                                                                            3). Takenthe   relationreview articles
                                                                                                                                       together,
                               between the citations
                                               collectedofmore a study    and itsthan
                                                                      citations    features   (Figure
                                                                                        original         3). Taken
                                                                                                   articles  (p < 0.001together,
                                                                                                                           by Mannreview    articles
                                                                                                                                      Whitney     U test), possibly
                               collected morebecause
                                                 citations  ofthan
                                                                theiroriginal    articles (p
                                                                        comprehensive      and< 0.001   by Mann
                                                                                                  broader    scope.Whitney       U test),no
                                                                                                                       Cumulatively,      possibly
                                                                                                                                             differences could be
                               because of their    comprehensive
                                               observed                   and broader
                                                               when articles              scope. Cumulatively,
                                                                                 were stratified                        no differences
                                                                                                      by access or sentiment,       while could
                                                                                                                                            fewer becitations were
                               observed when   observed
                                                   articles with
                                                               weretitles    with by
                                                                       stratified   multiple
                                                                                       access colons,     thoughwhile
                                                                                                or sentiment,        thesefewer
                                                                                                                            were citations
                                                                                                                                   much lesswerenumerous, as we
                               observed withhave titlesshown       (Figure 3).
                                                         with multiple        colons, though these were much less numerous, as we
                               have shown (Figure      We 3).next considered the association of citations with title length or author number.
                                               Although it can be easily observed that studies with exceedingly long titles, over 30 or
                                               40 words, are not cited as much as some studies with shorter titles, the bulk of the studies
                                               in the dataset, with title length below 30 words, have a broad range of citations. We have
                                               plotted the distribution of citations separately for articles and reviews and the pattern
                                               appears quite similar (Figure 4), although research articles show a wider range in title
                                               length than Reviews.
The Effect of Article Characteristics on Citation Number in a Diachronic Dataset of the Biomedical Literature on Chronic Inflammation: An Analysis ...
ons 2021, 9,Publications
             x FOR PEER  2021, 9, 15
                           REVIEW                                                                                                                5 of 11             5 of 11

                                                      Figure 3. Boxplots representing the number of (normalized) citations in review and original arti-
                                                      cles (above, left-hand side), subscription-based and open access publications (above, middle), pub-
                                                      lications with negative or positive title sentiment score (above, right-hand side) and citation count
                                                      based on the number of colons in the title (below, left-hand side), semicolons (below, middle) or
                                                      question marks (below, right-hand side). For all the plots, review papers are indicated in blue and
                                                      original articles in orange.

                                                         We next considered the association of citations with title length or author number.
                                                   Although it can be easily observed that studies with exceedingly long titles, over 30 or 40
                             Figure 3.
                 Figure 3. Boxplots   representing words,   are the
                                        Boxplots representing
                                                    the number   not   cited asofmuch
                                                                      number
                                                                  of (normalized)        as some
                                                                                  (normalized)
                                                                                    citations         studies
                                                                                                     citations
                                                                                                 in review       with
                                                                                                                in
                                                                                                              and       shorter
                                                                                                                   review
                                                                                                                   original and   titles,
                                                                                                                                 original
                                                                                                                             articles      the  bulk
                                                                                                                                            arti-
                                                                                                                                      (above,          of the
                                                                                                                                                 left-hand    studies in
                                                                                                                                                           side),
                             cles (above,
                 subscription-based   and left-hand  side),
                                                   the
                                           open access      subscription-based
                                                        dataset,   with
                                                        publications   (above,     and open
                                                                          title middle),
                                                                                length   below   access  publications
                                                                                                    30 words,
                                                                                          publications             have(above,
                                                                                                            with negative a or
                                                                                                                             broad middle),
                                                                                                                                          titlepub-
                                                                                                                                      range
                                                                                                                               positive         of  citations.
                                                                                                                                                sentiment      We have
                                                                                                                                                           score
                             licationsside)
                 (above, right-hand    with and
                                             negative  or positive
                                                   plotted
                                                 citation count     title sentiment
                                                             the based
                                                                  distribution    ofscore
                                                                          on the number     (above,
                                                                                       citations
                                                                                            of          right-hand
                                                                                                     separately
                                                                                                colons                side)
                                                                                                                    for
                                                                                                          in the title      andleft-hand
                                                                                                                         articles
                                                                                                                       (below,    citation  count
                                                                                                                                    and reviews        and the pattern
                                                                                                                                             side), semicolons
                             basedor onquestion
                                        the number   of colons  inright-hand
                                                                   the title (below,
                                                                                side).left-hand     side),  semicolons   (below,are
                 (below, middle)                   appears
                                                 marks        quite
                                                         (below,      similar (Figure   For4),allalthough
                                                                                                  the   plots, research   articlesmiddle)
                                                                                                                review papers         show     or in blue and
                                                                                                                                               a wider range in title
                                                                                                                                       indicated
                             question marks (below, right-hand side). For all the plots, review papers are indicated in blue and
                 original articles in orange.      length than Reviews.
                              original articles in orange.

                                   We next considered the association of citations with title length or author number.
                              Although it can be easily observed that studies with exceedingly long titles, over 30 or 40
                              words, are not cited as much as some studies with shorter titles, the bulk of the studies in
                              the dataset, with title length below 30 words, have a broad range of citations. We have
                              plotted the distribution of citations separately for articles and reviews and the pattern
                              appears quite similar (Figure 4), although research articles show a wider range in title
                              length than Reviews.

                     Figure
                 Figure        4. Above,
                            4. Above,   left:left: Distribution
                                               Distribution      of normalized
                                                             of normalized         citations
                                                                               citations   by by
                                                                                               titletitle length
                                                                                                      length       in reviews
                                                                                                               in reviews        (blue)
                                                                                                                              (blue)  andand  original
                                                                                                                                           original     articles
                                                                                                                                                    articles     (orange);
                                                                                                                                                              (orange);
                     right:
                 right:       boxplot
                          boxplot      representing
                                   representing      thethe  normalized
                                                         normalized        citation
                                                                       citation      count
                                                                                 count   by by   quartiles
                                                                                             quartiles         of title
                                                                                                           of title     length
                                                                                                                    length      in reviews
                                                                                                                             in reviews      or original
                                                                                                                                          or original     articles.
                                                                                                                                                      articles.     Below,
                                                                                                                                                                 Below,
                 left:left: Distribution
                        Distribution        of normalized
                                        of normalized         citations
                                                          citations  by by   author
                                                                         author       count
                                                                                  count       in reviews
                                                                                          in reviews          (blue)
                                                                                                          (blue)   andand    original
                                                                                                                         original      articles
                                                                                                                                   articles     (orange);
                                                                                                                                            (orange);       right:
                                                                                                                                                        right:     boxplot
                                                                                                                                                               boxplot
                     representing
                 representing     the the  normalized
                                       normalized         citation
                                                      citation      count
                                                                count      by quartiles
                                                                      by quartiles         of number
                                                                                      of number            of authors
                                                                                                      of authors         in reviews
                                                                                                                    in reviews        or original
                                                                                                                                  or original     articles.
                                                                                                                                              articles.

                                                     It appears
                                                 It appears    thatthat  verbose
                                                                      verbose        titles
                                                                                 titles     tend
                                                                                         tend     to accrue
                                                                                               to accrue     fewer
                                                                                                          fewer      citations,
                                                                                                                 citations,    butbut
                                                                                                                                    thethe same
                                                                                                                                        same      relationship
                                                                                                                                               relationship
                                          doesdoes
                                                 notnot   hold
                                                       hold  for for   shorter
                                                                  shorter        titles,
                                                                            titles, where where   the whole
                                                                                             the whole   rangerange   of citations
                                                                                                                of citations          is observed.
                                                                                                                                is observed.         This rela-
                                                                                                                                               This relation
                                              tionreally
                                          is not    is notapparent
                                                            really apparent
                                                                        when we    when    we quartilized
                                                                                      quartilized   the datathe
                                                                                                              fordata
                                                                                                                  title for  title(Figure
                                                                                                                         length    length (Figure
                                                                                                                                           4).      4).
                                                     It was
                                                 It was   moremore     difficult
                                                                   difficult       to observe
                                                                               to observe        a relation
                                                                                              a relation     between
                                                                                                          between    thethe    number
                                                                                                                           number         of authors
                                                                                                                                      of authors   andand thethe
gure 4. Above, left: Distribution of normalized
                                          number citations
                                              number     ofby   title length
                                                             citations,
                                                      of citations,    though in areviews
                                                                          though    slight  (blue)
                                                                                      a slight     and
                                                                                                positive
                                                                                            positive    original articles
                                                                                                          association
                                                                                                      association          (orange);
                                                                                                                        could
                                                                                                                    could        be observed
                                                                                                                             be observed        (Figure
                                                                                                                                            (Figure   4). 4).
ght: boxplot representing the normalized citation count by quartiles of title length in reviews or original articles. Below,
ft: Distribution of normalized citations by 3.3.
                                            author count in
                                                 Ensemble   reviews (blue) and original articles (orange); right: boxplot
                                                          Machines
presenting the normalized citation count by quartiles of number of authors in reviews or original articles.
                                                     We decided to evaluate the relative importance of the textual factors on normalized
                                               citations by adopting
                                   It appears that verbose             ensemble
                                                             titles tend to accruepredictive machines,
                                                                                   fewer citations,      namely
                                                                                                    but the sameGradient  Boosting, Bagging
                                                                                                                 relationship
                                               and Random Forest.
                              does not hold for shorter titles, where the whole range of citations is observed. This rela-
                                                     These algorithms are based on decision trees, i.e., algorithms that split the data based
                              tion is not really apparent when we quartilized the data for title length (Figure 4).
                                               on the features provided in order to get homogenous groups, in this case distinguishing
                                   It was more difficult to observe a relation between the number of authors and the
                                               articles with higher and lower citation rates. We first attempted a regression algorithm, i.e.,
                              number of citations, though a slight positive association could be observed (Figure 4).
                                               an algorithm aiming to predict the number of citations based on the collected features. The
                                               following predictors were considered: sentiment analysis score for the title, presence of a
                                               semicolon, a colon, a question mark, access options, document type, number of authors and
The Effect of Article Characteristics on Citation Number in a Diachronic Dataset of the Biomedical Literature on Chronic Inflammation: An Analysis ...
citations by adopting ensemble predictive machines, namely Gradient Boosting, Bagging
                                      and Random Forest.
                                             These algorithms are based on decision trees, i.e., algorithms that split the data based
                                      on the features provided in order to get homogenous groups, in this case distinguishing
Publications 2021, 9, 15
                                      articles with higher and lower citation rates. We first attempted a regression algorithm,         6 of 11
                                      i.e., an algorithm aiming to predict the number of citations based on the collected features.
                                      The following predictors were considered: sentiment analysis score for the title, presence
                                      of a semicolon, a colon, a question mark, access options, document type, number of au-
                                    title
                                      thorslength.  Even
                                              and title    usingEven
                                                        length.   parameter
                                                                       using tuning   techniques,
                                                                              parameter             i.e., RandomCV
                                                                                           tuning techniques,            search to optimize
                                                                                                                  i.e., RandomCV     search to
                                    the   hyperparameters
                                      optimize                 for the algorithm,
                                                   the hyperparameters      for the the degree of
                                                                                    algorithm,  thefitting
                                                                                                      degreeof the   regression
                                                                                                               of fitting        algorithms
                                                                                                                            of the regression
                                    was    very lowwas
                                      algorithms            R2 ranged
                                                      andvery   low andfrom    2% forfrom
                                                                          R2 ranged    a simple  linear
                                                                                            2% for        regression
                                                                                                    a simple            to 3.3% of Random
                                                                                                               linear regression    to 3.3% of
                                    Forest    regressor.  Interestingly, Gradient   Boosting  and  Random
                                      Random Forest regressor. Interestingly, Gradient Boosting and Random    Forest    algorithms   can also
                                                                                                                            Forest algorithms
                                    provide     a score of importance    for the features,  which  are  shown    in  Figure
                                      can also provide a score of importance for the features, which are shown in Figure 5.  5.

   A                                                                        B

                       Figure5.5.Feature
                      Figure      Featureimportance
                                          importancefor
                                                     for(A)
                                                         (A)GradientBoost
                                                             GradientBoostregressor
                                                                           regressorand
                                                                                     and(B)
                                                                                         (B)Random
                                                                                             RandomForest
                                                                                                    Forestregressor.
                                                                                                            regressor.

                                           As
                                            Asa acomparison,
                                                   comparison,   wewe also  decided
                                                                         also  decidedto use  ensemble
                                                                                         to use  ensembleclassifiers, withwith
                                                                                                             classifiers,   the same    predictors:
                                                                                                                                  the same  predic-
                                    totors:
                                        thisto
                                             purpose,     and  to  simplify   the model,    normalized    citations  were   binarily
                                                this purpose, and to simplify the model, normalized citations were binarily cate-     categorized
                                    asgorized
                                         either as
                                                 low   (below
                                                    either   lowthe   median)
                                                                  (below          or highor
                                                                            the median)     (above   the median).
                                                                                               high (above            Table Table
                                                                                                             the median).     1 summarizes     the
                                                                                                                                     1 summarizes
                                    performance       of   these  algorithms,     and    Gradient   Boosting    appears    to
                                      the performance of these algorithms, and Gradient Boosting appears to have the highest   have   the  highest
                                    F1F1score.
                                          score.
                                           AAconfusion
                                               confusion matrix        for the
                                                              matrix for    theclassification
                                                                                 classificationbyby    Gradient
                                                                                                   Gradient       Boosting
                                                                                                               Boosting    cancan    be found
                                                                                                                                 be found  in in
                                    Figure
                                      Figure A4. The relative weight of the considered factors can be seen in Figure 6. TheThe
                                             A4.    The   relative   weight    of  the  considered    factors  can  be  seen    in Figure  6.  doc-
                                    document
                                      ument type  type   (Reviews
                                                     (Reviews      vs.vs. articles)
                                                                       articles) is,is, maybe
                                                                                     maybe       unsurprisingly,the
                                                                                              unsurprisingly,      themost
                                                                                                                       mostrelevant
                                                                                                                               relevantpredictor
                                                                                                                                         predictorof
                                    ofcitations,
                                        citations,followed
                                                    followedby  byitsitsaccess
                                                                         access options
                                                                                 options (whether
                                                                                           (whether open
                                                                                                       open access
                                                                                                             access oror via
                                                                                                                         viapaid
                                                                                                                              paidsubscription),
                                                                                                                                     subscription),
                                    for the Random Forest classifier. The number of authors (which may show some collinearity
                                      for the Random Forest classifier. The number of authors (which may show some colline-
                                    with the publication year, Pearson’s r = 0.17) precedes the title length, which, though it
                                      arity with the publication year, Pearson’s r = 0.17) precedes the title length, which, though
                                    does show some association, does not seem to be playing a major role. The presence of a
                                      it does show some association, does not seem to be playing a major role. The presence of
                                    colon in the title (and even more so of a semi-colon) or the sentiment of the title may also
                                      a colon in the title (and even more so of a semi-colon) or the sentiment of the title may also
                                    have some relevance as predictors.
                                      have some relevance as predictors.
                                 Table 1. Summary of the performance of the algorithms for the classification of the citation rate in
         Table 1. Summary of the performance of the algorithms for the classification of the citation rate in the analyzed corpus.
                                 the analyzed corpus.
                 Algorithm                 Accuracy                   Precision                      Recall                     F1 Score
                                            Algorithm                    Accuracy            Precision            Recall           F1 Score
            Gradient Boosting                 0.599                     0.598                        0.514                       0.552
                   Bagging              Gradient  Boosting
                                              0.570                        0.599
                                                                        0.560                  0.598 0.483         0.514            0.552
                                                                                                                                 0.519
                                             Bagging                       0.570               0.560               0.483            0.519
              Random Forest                   0.597                     0.601                        0.479                       0.533
 Publications 2021, 9, x FOR PEER REVIEW Random Forest                     0.597               0.601               0.479            0.533 7 of 11
           Logistic Regression                0.590
                                        Logistic Regression             0.583
                                                                           0.590               0.583 0.517         0.517         0.548
                                                                                                                                    0.548

    A                                                                       B

                      Figure6.6.Overview
                     Figure     Overviewof
                                         offeature
                                            featureimportance
                                                    importancefor
                                                               for(A)
                                                                   (A)GradientBoost
                                                                       GradientBoostand
                                                                                     and(B)
                                                                                         (B)Random
                                                                                             RandomForest
                                                                                                    Forestclassifier.
                                                                                                           classifier.

                                     4. Discussion
                                          Citations are a fundamental factor for the scientific community, because they are the
                                     basic metrics that are used to rank research and productivity, and therefore regulate its
The Effect of Article Characteristics on Citation Number in a Diachronic Dataset of the Biomedical Literature on Chronic Inflammation: An Analysis ...
Publications 2021, 9, 15                                                                                              7 of 11

                           4. Discussion
                                 Citations are a fundamental factor for the scientific community, because they are the
                           basic metrics that are used to rank research and productivity, and therefore regulate its
                           policies and funding allocation. Within such an information overload setting as the one life
                           sciences are experiencing [19], publishing an article is not enough to expect the article to be
                           seen and cited. On the other hand, some articles may reach impressive levels of visibility. It
                           appears, therefore, important to investigate the factors that, beside scientific soundness and
                           content relevance, may make the article stand out in a literature search and be downloaded,
                           read and possibly cited, because visibility is key to the spreading of ideas and it may affect
                           novel directions and trends of research.
                                 A lot of research has focused on the effects of title length on citations [20], because
                           titles are what a reader would scan first while searching the literature. The results, however,
                           have been, so far, quite conflicting. An investigation conducted in 22 medical journals
                           (comprising more than 9000 articles) concluded that publications with longer titles are cited
                           more than articles with shorter titles, especially for high impact factor journals [21]. This is
                           in agreement with a research by Jacques et al., which showed a positive correlation between
                           title length and citation number in a small set of most- and least-cited articles in three
                           general medicine journals [22]. Other studies, however, found no association [23], different
                           associations according to the field [24], or, more recently, a negative association, when the
                           20,000 more cited papers for each year in the 2007–2013 period were considered [11].
                                 The present commentary followed a slightly different approach. We focused on articles
                           published in the biomedical field on the specific topic of chronic inflammation. This topic
                           was chosen because of the authors’ domain knowledge, but also because it is broad enough
                           to encompass a vast range of publications across thousands of journals in the biomedical
                           field, and was known and researched since the 19th century. This allowed us to collect a
                           large sample of more than 150,000 articles, or, after excluding those for which no citation
                           was available, with an unusually high number of authors and after limiting our analysis to
                           articles published after 1951, 121,640 articles. Our approach has some limits: it admittedly
                           focuses only on one scientific area, and one topic. By considering articles from such a vast
                           range of years, it may suffer from increased heterogeneity on some predictors. For this
                           reason, the initial dataset, which comprised publications from 1827 to 2021, was reduced to
                           papers appearing in the 1951–2021 interval.
                                 Our initial survey showed a robust association between the genre of the article and the
                           number of citations: unsurprisingly, and consistently with what has already been reported,
                           review articles are usually cited more than research articles [25]. What is more relevant,
                           however, is that reviews in our dataset have significantly shorter titles than original articles.
                           This too is somewhat unsurprising, or at least easily accounted for by the nature of reviews.
                           As they usually encompass broader topics, at least in their narrative version, they often
                           do not require investigation of the details of a single finding, as original articles do, and
                           thus, their titles may often need fewer words. This may in part justify the observation that
                           most-cited articles have shorter titles [11], because many of them happen to be reviews.
                           However, even stratifying by document type, there seems to be some kind of trend for
                           articles with longer titles to have fewer citations, so it may be that that shorter titles are
                           cognitively favored to be easily recognized and read and thus maybe cited.
                                 We then decided to investigate the predictive power of some features associated with
                           the text and its title and relied on ensemble machines Gradient Boosting [26], Bagging [27]
                           or Random Forest [28], both as regressors and as classifiers. These algorithms are based
                           on decision trees and reduce overfitting on training data by running parallel sub-optimal
                           decision trees. In other words, a decision tree could easily find the best splits to divide a
                           dataset in order to eventually obtain homogeneous groups, especially given enough splits,
                           but the solution would be difficult to generalize on different datasets. Thus, sub-optimal
                           trees are used in these algorithms: while each decision tree in these machines is a weak
                           predictor that uses only part of the data, the performance and the bias of the algorithm
                           can be improved by running several of such weak predictors in parallel and by averaging
Publications 2021, 9, 15                                                                                               8 of 11

                           their results. When predicting the visibility of a manuscript we decided to avoid using the
                           citation count, as obtained from the repository, because citations are obviously affected by
                           the publication year. The same article would have different citation counts if assessed at
                           different time points after publication and thus time-since-publication is not an intrinsic
                           quality of a manuscript, but a rather incidental one. Thus, we preferred to use a normalized
                           citation count, which we obtained by normalizing the number of citations for an article
                           by the expected average for that year. We could have adjusted for other factors, e.g.,
                           publication or access type, as is often done, but this would have ended up removing these
                           factors from the features the ensemble algorithms consider when analyzing the data. Our
                           goal was to provide these ensemble machines with a series of predictors and let them
                           compute whether they can explain the observed variability in the distribution of the target
                           variable and assess the weight each factor has in affecting such a target.
                                 It can be pointed out that the accuracy of these models was quite low, both for the
                           classifiers and even more so for the regressor algorithms. This means that only a fraction
                           of the variability in the citation count can be explained by these factors which is not
                           surprising. Our database did not include any indicator about the novelty of the content
                           of an article, the prestige of the authors or even the prestige of the journal, scientific and
                           even just sociological factors that would be normally associated to citing a study. Within
                           the factors that we did include, however, the type of document has a heavy impact on
                           citation count and so does the access to the article. The author number, in agreement
                           with Vieira et al. [3] appears high on the list too, although its distribution does not show
                           much association with the number of citations, but the result can be accounted for by the
                           change in author number over time. Title length has some impact but it is overall quite
                           limited. The presence of colons or sentiment analysis appears less relevant, while the effect
                           of semicolons or question marks is quite negligible. It must, however, be highlighted that
                           sentiment analysis was performed with a pre-set sentiment analyzer, albeit a quite popular
                           one, whose reliability in this context might be limited.
                                 Considering how limited the features we used are, the answers the models provided
                           are quite interesting. Further textual characteristics could and should be investigated, such
                           as an analysis of embedding semantics for titles and abstracts, but also abstract length,
                           article length and of course impact factor. Further efforts should also be made to explore
                           the role of the originality and relevance of the content, although it is difficult to define and
                           identify correct metrics for these features.

                           5. Conclusions
                                 In conclusion, considering the dataset of all the publications since 1951 in the field
                           of chronic inflammation, indexed in Scopus, several factors appear to be associated with
                           citation count; although, obviously, no single factor can decide the fate of a publication.
                           Besides the time passed since its publication, which is necessary for an article to gather
                           citations, the document type, the access type and the number of authors affect citation
                           count. Very wordy titles do appear associated to fewer citations, but title length seems to
                           play a small role in computational models trying to predict citation count.

                           Author Contributions: Conceptualization, C.G. and S.G.; methodology, C.G.; software, C.G.; formal
                           analysis, C.G.; investigation, S.G.; resources, S.G.; data curation, C.G.; writing—original draft
                           preparation, C.G.; writing—review and editing, C.G.; supervision, S.G.; Both authors have read and
                           agreed to the published version of the manuscript.
                           Funding: This research received no external funding.
                           Institutional Review Board Statement: Not applicable.
                           Informed Consent Statement: Not applicable.
                           Data Availability Statement: Data available on request due to restrictions.
                           Conflicts of Interest: The authors declare no conflict of interest.
agreed to the published version of the manuscript.
                                      Funding: This research received no external funding.
                                      Funding: This research received no external funding.
                                      Institutional Review Board Statement: Not applicable.
                                      Institutional Review Board Statement: Not applicable.
                                      Informed Consent Statement: Not applicable.
Publications 2021, 9, 15              Informed Consent Statement: Not applicable.                                                                   9 of 11
                                      Data Availability Statement: Data available on request due to restrictions.
                                      Data Availability Statement: Data available on request due to restrictions.
                                      Conflicts of Interest: The authors declare no conflict of interest.
                                      Conflicts of Interest: The authors declare no conflict of interest.
                                      Appendix A
                                      Appendix A
                                      Appendix A

      Figure A1. Composition of our dataset. Above: number of reviews and original articles (left), number of open access or
      Figure A1. Composition
                  Composition     of our dataset. Above: number           of
                                                                          of reviews
                                                                             reviews and   original  articles (left),  number    of
                                                                                                                                 of open  access or
      subscription  papers (middle), number ofAbove:            number
                                                      publications      with negative  and originaltitle
                                                                                       or positive   articles (left), score
                                                                                                         sentiment     number       open
                                                                                                                             (right). Below: num-
      subscription
      subscription  papers
                    papers by(middle),
                             (middle),   number   of
                                        numberinofthe publications
                                                     publications      with  negative  or positive title sentiment    score  (right). Below:  num-
      ber of publications       colon number             title (left),with negative or
                                                                       by semicolon     positiveor
                                                                                      (middle)   title
                                                                                                    bysentiment
                                                                                                        number ofscore    (right).
                                                                                                                      question      Below:
                                                                                                                                 marks      number
                                                                                                                                        (right). All
      ber
      of  of publications   by colon   number   in the  title  (left), by semicolon   (middle)  or  by number    of  question   marks   (right).
                                                                                                                                             All All
      thepublications
           publicationsbyare
                           colon  number
                              stratified    in the title
                                         according    to (left),
                                                          documentby semicolon    (middle)
                                                                        type, where         or by number
                                                                                      blue indicates         of question
                                                                                                       reviews   and orange marks   (right).original
                                                                                                                                 indicates       the
      the publications
      publications       are stratified
                    are stratified       according
                                   according         to document
                                               to document              type, where
                                                                 type, where          blue indicates
                                                                               blue indicates reviewsreviews     and orange
                                                                                                         and orange             indicates
                                                                                                                        indicates  original original
                                                                                                                                            articles.
      articles.
      articles.

 Publications
      Figure2021,
              A2. 9,
              A2.    x FOR distribution
                  Above:   PEER REVIEW  of title lengths in our dataset for review and articles. Below: distribution
                                                                                                        distribution of
                                                                                                                     of author count.10 of 11
                                                                                                                        author count.
      Figure A2. Above: distribution of title lengths in our dataset for review and articles. Below: distribution of author count.
      Blue bars
      Blue  bars indicate
                  indicatereviews
                            reviewsand andorange
                                           orangebars
                                                   barsindicate
                                                         indicate articles.
                                                                articles.   The
                                                                          The   length
                                                                              length    of the
                                                                                     of the barbar indicates
                                                                                                indicates the the exact
                                                                                                              exact     number
                                                                                                                    number     of publica-
                                                                                                                           of publications
      Blue bars indicate reviews and orange bars indicate articles. The length of the bar indicates the exact number of publica-
      tions
      for   forspecific
          that  that specific
                        title   title length
                              length   or    or author
                                          author count  count
                                                        in ourin our dataset.
                                                               dataset.
      tions for that specific title length or author count in our dataset.
                                      Appendix B
                                      Appendix B

      Figure   A3.Boxplots
       FigureA3.   Boxplotsofofauthor
                                authornumber
                                        number(left)
                                                  (left)and
                                                         andtitle
                                                               titlelength
                                                                     length(right)
                                                                             (right)by
                                                                                     bydocument
                                                                                        documenttype.
                                                                                                    type.The
                                                                                                          Theplots
                                                                                                              plotsshow
                                                                                                                    showthat
                                                                                                                           thatreviews
                                                                                                                                reviewshave
                                                                                                                                        have
      significantly
       significantlyfewer
                     fewerauthors
                            authorsand
                                     andsignificantly
                                          significantlyshorter
                                                         shortertitles.
                                                                    titles.The
                                                                            Thewhiskers
                                                                                whiskersofofthe
                                                                                             theplots
                                                                                                 plotscorresponds
                                                                                                       correspondstoto1.5 × the
                                                                                                                        1.5× theinterquartile
                                                                                                                                 interquartile
      range.
       range.Values
               Valuesfalling
                      fallingoutside
                              outsideofofthis
                                          thisintervals
                                               intervalsare
                                                          arerepresented
                                                              representedasasblack
                                                                                 blackdots.
                                                                                       dots.
Figure
Publications   A3.9,Boxplots
             2021,  15       of author number (left) and title length (right) by document type. The plots show that reviews have
                                                                                                                               10 of 11
       significantly fewer authors and significantly shorter titles. The whiskers of the plots corresponds to 1.5× the interquartile
       range. Values falling outside of this intervals are represented as black dots.

                                 Figure
        Figure A4. Confusion matrix      A4. Confusion
                                     for Gradient Boosting.matrix for Gradient
                                                              The matrix  breaksBoosting. The that
                                                                                down articles  matrix breaks
                                                                                                   were      down
                                                                                                        predicted  toarticles that or
                                                                                                                      have a low    were
                                 predicted
        high citation count and what        to have
                                     they were      a low
                                                actually   or (‘Observed’).
                                                         like high citation count and what they were actually like (‘Observed’).

References
References
1.
1.  Bai,
    Bai, X.;
         X.; Liu,
              Liu, H.;
                    H.; Zhang,
                         Zhang, F.; F.; Ning,
                                        Ning, Z.;
                                               Z.; Kong,
                                                   Kong, X.;X.; Lee,
                                                                Lee, I.;
                                                                      I.; Xia,
                                                                          Xia, F.
                                                                                F. An
                                                                                   An overview
                                                                                        overview on  on evaluating
                                                                                                          evaluating and
                                                                                                                       and predicting
                                                                                                                            predicting scholarly
                                                                                                                                          scholarly article    impact.
                                                                                                                                                      article impact.
    Information   2017,    8, 73.  [CrossRef]
    Information 2017, 8, 73, doi:10.3390/info8030073.
2.
2. Bai,
    Bai, X.;
         X.; Zhang,
              Zhang, F.; F.;Lee,
                            Lee,I.I.Predicting
                                      Predictingthethecitations
                                                        citationsof ofscholarly      paper.J.J.Informetr.
                                                                        scholarlypaper.                      2019,13,
                                                                                                Informetr.2019,     13,407–418.
                                                                                                                        407–418,[CrossRef]
                                                                                                                                    doi:10.1016/j.joi.2019.01.010.
3.
3.  Bentéjac,
    Bentéjac, C.; Csörgő, A.; Martínez-Muñoz, G. A comparative analysis of gradient boostingalgorithms.
                C.; Csörgő,     A.; Martínez-Muñoz,       G.   A comparative        analysis  of gradient    boosting    algorithms. Artif.
                                                                                                                                         Artif. Intell.
                                                                                                                                                Intell. Rev.  2021, 54,
                                                                                                                                                        Rev. 2021,   54,
    1937–1967.
    1937–1967, [CrossRef]
                   doi:10.1007/s10462-020-09896-5.
4.
4. Blümel,
    Blümel, C.;C.; Schniedermann,
                    Schniedermann, A.      A. Studying
                                               Studying review
                                                            review articles
                                                                     articles in in scientometrics
                                                                                     scientometrics and and beyond:
                                                                                                              beyond: A  A research    agenda. Science
                                                                                                                            research agenda.                2020, 124,
                                                                                                                                                  Science 2020,    124,
    711–728.   [CrossRef]
    711–728, doi:10.1007/s11192-020-03431-7.
5.
5. Borg,
    Borg, A.;
           A.; Boldt,
                 Boldt, M.M. Using
                              Using VADER
                                       VADERsentiment
                                                 sentimentand  andSVMSVMfor  for predicting
                                                                                  predictingcustomer
                                                                                               customer response        sentiment. Expert
                                                                                                            responsesentiment.        Expert Syst.
                                                                                                                                              Syst. Appl.   2020, 162,
                                                                                                                                                     Appl. 2020,   162,
    113746.   [CrossRef]
    113746, doi:10.1016/j.eswa.2020.113746.
6. Bornmann,
6.  Bornmann, L.;   L.; Mutz,
                        Mutz, R. R. Growth
                                     Growth rates
                                               rates of
                                                     of modern
                                                         modern science:
                                                                    science: A  A bibliometric
                                                                                   bibliometric analysis
                                                                                                  analysis based
                                                                                                              based onon the
                                                                                                                         the number
                                                                                                                               number of of publications
                                                                                                                                            publications and and cited
                                                                                                                                                                  cited
    references.   J. Assoc.   Inf.  Sci. Technol.  2015,  66,  2215–2222.      [CrossRef]
    references. J. Assoc. Inf. Sci. Technol. 2015, 66, 2215–2222, doi:10.1002/asi.23329.
7.
7. Breiman,
    Breiman, L.  L. Random       forests. Mach.
                    Random forests.        Mach. Learn.    2001, 45,
                                                   Learn. 2001,    45, 5–32.
                                                                       5–32, [CrossRef]
                                                                                doi:10.1023/a:1010933404324.
8.
8. Burrell,
    Burrell, Q.L.
               Q.L. Predicting
                     Predicting future
                                     futurecitation   behavior.J.J.Am.
                                             citationbehavior.         Am.Soc.Soc. Informetr.
                                                                                    Informetr. Sci.
                                                                                                Sci. Technol.   2003,54,
                                                                                                     Technol.2003,     54,372–378.
                                                                                                                           372–378, [CrossRef]
                                                                                                                                       doi:10.1002/asi.10207.
9.
9. De
    De Rijcke,
         Rijcke, S.;
                  S.; Rushforth,
                       Rushforth, A.   A. To
                                           To intervene
                                               intervene or  or not
                                                                not to
                                                                     to intervene;
                                                                          intervene; isis that
                                                                                           that the
                                                                                                 the question?
                                                                                                      question? On  On the
                                                                                                                        the role
                                                                                                                              role of
                                                                                                                                   of scientometrics
                                                                                                                                       scientometrics in  in research
                                                                                                                                                              research
    evaluation. J.J. Assoc.
    evaluation.       Assoc. Inf.
                               Inf. Sci.  Technol. 2015,
                                     Sci. Technol.  2015, 66,
                                                           66, 1954–1958.       [CrossRef]
                                                                1954–1958, doi:10.1002/asi.23382.
10.
10. Duyx,
    Duyx, B.;B.; Urlings,
                 Urlings, M.J.;
                             M.J.; Swaen,
                                     Swaen, G.M.;
                                              G.M.; Bouter,
                                                     Bouter, L.M.;
                                                                L.M.; Zeegers,
                                                                        Zeegers, M.P.M.P.Scientific
                                                                                           Scientificcitations
                                                                                                        citations favor
                                                                                                                   favor positive
                                                                                                                          positive results:
                                                                                                                                     results: A
                                                                                                                                              A systematic
                                                                                                                                                 systematic review
                                                                                                                                                                review
    and  meta-analysis.       J. Clin.  Epidemiol.  2017,   88,  92–101.    [CrossRef]
    and meta-analysis. J. Clin. Epidemiol. 2017, 88, 92–101, doi:10.1016/j.jclinepi.2017.06.002.
11.
11. Egghe,
    Egghe, L.L.  TheTheHirsch    index and
                              Hirsch          related
                                          index    andimpact
                                                           related     impactAnnu.
                                                                  measures.              Rev. Inf.Annu.
                                                                                    measures.       Sci. Technol.    Inf. 44,
                                                                                                              Rev. 2010,    Sci.65–114.  [CrossRef]
                                                                                                                                   Technol.    2010, 44, 65–114,
12. Eysenbach,     G.   Citation    advantage
    doi:10.1002/aris.2010.1440440109.            of open    access  articles.    PLoS   Biol. 2006,  4,  e157. [CrossRef]    [PubMed]
13. Génova, G.; Astudillo, H.; Fraga, A. The scientometric bubble considered harmful. Sci. Eng. Ethic. 2016, 22, 227–235. [CrossRef]
    [PubMed]
14. Habibzadeh, F.; Yadollahie, M. Are shorter article titles more attractive for citations? Cross-sectional study of 22 scientific journals.
    Croat. Med. J. 2010, 51, 165–170. [CrossRef]
15. Ho, Y.-S.; Kahn, M. A bibliometric study of highly cited reviews in the Science Citation Index expanded™. J. Assoc. Informetr. Sci.
    Technol. 2014, 65, 372–385. [CrossRef]
16. Hutto, C.J.; Gilbert, E.E. VADER: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of
    the International AAAI Conference on Web and Social Media, Ann Arbor, MI, USA, 1–4 June 2014; Volume 8.
17. Jacques, T.S.; Sebire, N.J. The impact of article titles on citation hits: An analysis of general and specialist medical journals. JRSM
    Short Rep. 2010, 1, 1–5. [CrossRef]
18. Letchford, A.; Moat, H.S.; Preis, T. The advantage of short paper titles. R. Soc. Open Sci. 2015, 2, 150266. [CrossRef]
19. Lovaglia, M.J. Predicting citations to journal articles: The ideal number of references. Am. Sociol. 1991, 22, 49–64. [CrossRef]
20. Mingers, J.; Leydesdorff, L. A review of theory and practice in scientometrics. Eur. J. Oper. Res. 2015, 246, 1–19. [CrossRef]
21. Newman, H.; Joyner, D. Artificial Intelligence in Education; Penstein Rosé, C., Ed.; Springer: Cham, Switzerland, 2018; pp. 246–250.
    [CrossRef]
22. Rostami, F.; Mohammadpoorasl, A.; Hajizadeh, M. The effect of characteristics of title on citation rates of articles. Scientometrics
    2014, 98, 2007–2010. [CrossRef]
23. Siebelt, M.; Siebelt, T.; Pilot, P.; Bloem, R.M.; Bhandari, M.; Poolman, R.W. Citation analysis of orthopaedic literature; 18 major
    orthopaedic journals compared for Impact Factor and SCImago. BMC Musculoskelet. Disord. 2010, 11, 4. [CrossRef] [PubMed]
Publications 2021, 9, 15                                                                                                               11 of 11

24.   Skurichina, M.; Duin, R.P. Bagging for linear classifiers. Pattern Recognit. 1998, 31, 909–930. [CrossRef]
25.   Smith, D.R. Impact factors, scientometrics and the history of citation-based research. Scientometrics 2012, 92, 419–427. [CrossRef]
26.   Vieira, E.; Gomes, J. Citations to scientific articles: Its distribution and dependence on the article features. J. Informetr. 2010,
      4, 1–13. [CrossRef]
27.   Yitzhaki, M. Relation of the title length of a journal article to the length of the article. Scientometrics 2002, 54, 435–447. [CrossRef]
28.   Yogatama, D.; Heilman, M.; O’Connor, B.; Dyer, C.; Routledge, B.R.; Smith, N.A. Predicting a scientific community’s response to
      an article. In Proceedings of the EMNLP 2011—Conference on Empirical Methods in Natural Language Processing, Edinburgh,
      UK, 27–31 July 2011; pp. 594–604.
You can also read