The Effect of Article Characteristics on Citation Number in a Diachronic Dataset of the Biomedical Literature on Chronic Inflammation: An Analysis ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
publications
Article
The Effect of Article Characteristics on Citation Number in a
Diachronic Dataset of the Biomedical Literature on Chronic
Inflammation: An Analysis by Ensemble Machines
Carlo Galli * and Stefano Guizzardi
Department of Medicine and Surgery, Histology and Embryology Lab, University of Parma, Via Volturno 39,
43126 Parma, Italy; stefano.guizzardi@unipr.it
* Correspondence: carlo.galli@unipr.it; Tel.: +39-0521-906740
Abstract: Citations are core metrics to gauge the relevance of scientific literature. Identifying features
that can predict a high citation count is therefore of primary importance. For the present study,
we generated a dataset of 121,640 publications on chronic inflammation from the Scopus database,
containing data such as titles, authors, journal, publication date, type of document, type of access
and citation count, ranging from 1951 to 2021. Hence we further computed title length, author count,
title sentiment score, number of colons, semicolons and question marks in the title and we used these
data as predictors in Gradient boosting, Bagging and Random Forest regressors and classifiers. Based
on these data, we were able to train these machines, and Gradient Boosting achieved an F1 score of
0.552 on classification. These models agreed that document type, access type and number of authors
were the best predicting factors, followed by title length.
Keywords: citations; title; machine learning
Citation: Galli, C.; Guizzardi, S.
The Effect of Article Characteristics
on Citation Number in a Diachronic
Dataset of the Biomedical Literature 1. Introduction
on Chronic Inflammation: An In a world where the sheer amount of available information is growing exponentially,
Analysis by Ensemble Machines. ranking knowledge according to its reliability has become crucial, and the scientific com-
Publications 2021, 9, 15. munity has therefore attributed increasing importance to scientometrics, i.e., the ability
https://doi.org/10.3390/ to analyze the impact of science [1]. Scientometrics, with all its limits [2], is the compass
publications9020015
that is commonly used to drive strategic decisions in public health, in grant policies, to
allocate funding, and at a lower level, to determine hiring and promotions [3]. Within
Received: 27 December 2020
scientometrics, citations are the single most distinctive parameter to estimate the visibility
Accepted: 30 March 2021
of a scientific publication [4]. The number of citations is also used to compute further
Published: 6 April 2021
indices of performance, such as the Hirsch’s score, which are used to evaluate the scientific
productivity of a scholar [5].
Publisher’s Note: MDPI stays neutral
It may be—not without a reason—assumed that good science will eventually be
with regard to jurisdictional claims in
cited more often, but in a situation of information overload in fields such as life sciences
published maps and institutional affil-
iations.
and medicine, where “mountains of unloved and unread publications [exist]” [6] researchers
have long strived to investigate what factors may affect the number of citations a study
can accrue [7–10]. Several publications have pointed out that shorter titles can facilitate
citations, possibly by making an article stand out during literature searches; its meaning
more easily apprehended, and therefore remembered [11]. Besides cognitive mechanisms,
Copyright: © 2021 by the authors.
other factors can possibly affect how often an article is searched, viewed and cited. Open
Licensee MDPI, Basel, Switzerland.
access articles, for instance, are more easily accessible for scholars in low-budget contexts,
This article is an open access article
who may not rely on institutional subscriptions [12]. The genre of the article could also
distributed under the terms and
make a study more searched. Review articles, at least narrative reviews, generally describe
conditions of the Creative Commons
Attribution (CC BY) license (https://
the state of the art in a given field and are often beloved reading for scholars looking for an
creativecommons.org/licenses/by/
overview of a topic [13]. And even the number of authors (and prestigious authorship of
4.0/). course) could affect citations [3], as each author can be part of a network of acquaintances
Publications 2021, 9, 15. https://doi.org/10.3390/publications9020015 https://www.mdpi.com/journal/publicationsPublications 2021, 9, 15 2 of 11
who may be more easily attracted and read, and therefore cite, their work, not to mention
the relevance of specific journals with high impact factor that may be a preferred reading
and citation source for scholars [14].
The goal of this study was to examine the association between some selected article
features, including textual features of the title and number of citations, in a diachronic
dataset of studies on chronic inflammation retrieved from Scopus using a set of specific
machine learning techniques, namely ensemble classifiers. This specific research field was
chosen for consistency reasons, because our group has been focusing and analyzing it in
the last few years. Although differences with publications in other medical areas may exist,
they are difficult to anticipate, and the conclusions from our survey are therefore applicable
mostly to the topic of inflammation.
2. Materials and Methods
The analysis was conducted on a literature dataset created through Scopus. The
Scopus database was searched using the ‘Chronic Inflammation’ keywords, without filters.
All the information was downloaded as several.csv files (due to inherent limitations in
Scopus, which does not allow to download more than 20,000 records at a time), which
were then imported in a Jupyter notebook running Python 3.6, using the Pandas library.
The dataframes were concatenated into one single dataframe without duplicates. For data
analysis, the python Numpy, Pandas and Scipy libraries were used. Matplotlib and Seaborn
libraries were used for data visualization.
The information provided by Scopus included title of the article, authors, journal,
publication date, type of document, type of access, publication stage and citation count,
besides Scopus ID codes, DOI and a link to the article. We proceeded to create further
features for each publication. We used the VADER implementation [15] in the Nltk library
to perform a sentiment analysis of the titles, which yielded a sentiment score. We used the
Re library to search for colons, semicolons and question marks within titles through regular
expressions. We also used the implementations of Linear Regression, Gradient Boosting,
Bagging, Random Forest regressor and classifiers and Logistic Regression algorithms in
the Scikit-learn package. To train the models the data were split into a training and a
testing set through Scikit-learn. The models were tuned using Randomized search CV in
the Scikit-learn package. The performance of the models was then evaluated on accuracy,
precision, recall and F1 score (the harmonic mean of precision and recall) calculated on
test data.
3. Results
3.1. Overview of the Dataset
Our Scopus search retrieved 159,461 titles, published over the course of 194 years from
1827 to 2021 in 12,308 academic journals. Unfortunately, 25,227 articles lacked a citation
count, which could be attributable at least in part to their recent publication date (data not
shown). Though we could not rule out that some of those titles were articles that were
actually never cited, we decided to keep only those uncontroversial articles where a citation
count ≥1 was provided.
Several kinds of publications were initially present in our dataset (data not shown),
e.g., letters, surveys, book chapters and even conference papers. Most publications, how-
ever, fell within the ‘Article’ or ‘Review’ categories, and the dataset was restricted to these
two classes only.
We furthermore decided to keep only those papers that were published after 1951,
as only very few papers in our dataset were published before that date and their scarce
numbers might affect our analysis. We also decided to exclude papers with unusually high
numbers of authors, which, in a few very peculiar cases, could amount to hundreds. Based
on the distribution of author number, we removed papers with more authors than 1.5 times
their interquartile range (i.e., 15 authors). The remaining dataset, therefore, included
121,640 manuscripts.Publications 2021, 9, x FOR PEER REVIEW 3 of 11
Publications 2021, 9, 15 3 of 11
times their interquartile range (i.e., 15 authors). The remaining dataset, therefore, included
121,640 manuscripts.
3.2. Features Distribution
The temporal distribution
distribution of articles
articles by document
document type
type can
can be seen in Figure 1, which
shows, beside the well-known steady increment in publications over the years, the robust
surge of
of Reviews,
Reviews,which
whichhave
havebecome
becomemoremoreprominent
prominent since thethe
since mid ‘80s‘80s
mid andand
which ac-
which
count
accountforfor
about
about30%
30%ofofall
allthe
thepublications
publicationsininour
our dataset.
dataset. Beside
Beside Document
Document Type, the the
Scopus database contains information about the accessibility of full texts and and indicates
indicates
whether a specific entry belongs to an Open Access typology or a Paid subscription type.
Open Access is a novel business model that grants readers free access to the content of a
report, made possible by the digitalization of publications. Understandably, Open Access
journals have gained in popularity and many journals are now offered as open access, or
with an open access option. Many older articles are also offered as open access to readers,
even if their latest issues may not be.
Above: Distribution
Figure 1. Above: DistributionofofArticles
Articles(orange)
(orange)and
andReviews
Reviews (blue)
(blue) included
included in in
thethe present
present corpus,
corpus, expressed
expressed on aonlog
a
scale.
log scale.
To gain deeper insights into the characteristics of the papers in the dataset, we anno-
tated the presence and number of colons and semicolons in the title, as a sign of a more
structured and
structured andthus
thuscomplex
complextitle,
title,ororquestion
questionmarks
marks (a (a
sign of aofdifferent
sign illocutive
a different attitude
illocutive atti-
in the sentence), and we also performed a sentiment analysis using
tude in the sentence), and we also performed a sentiment analysis using VADER, a com- VADER, a common
package
mon packagefrom from
the Nltk
the library, whichwhich
Nltk library, assigns a positive
assigns or negative
a positive sentiment
or negative to sentences.
sentiment to sen-
This last
tences. feature
This shouldshould
last feature be taken bewith
takena with
caveat. VADER
a caveat. is a human-validated
VADER is a human-validated sentiment
sen-
analysis method, supported by a robust corpus of literature [15–17]
timent analysis method, supported by a robust corpus of literature [15–17] but not reallybut not really tested
with
testedbiomedical texts yet,
with biomedical although
texts our study
yet, although our(Figure A1) confirms
study (Figure that most
A1) confirms thatpublications
most pub-
display a positive sentiment, which is consistent with a positive publication
lications display a positive sentiment, which is consistent with a positive publication bias [18]. bias
[18]. Figure A1 in the Appendix A summarizes feature distribution in our dataset, stratified
by publication
Figure A1type.in theThe majorityAofsummarizes
Appendix publicationsfeature
in our dataset are original
distribution in our articles,
dataset,almost
strati-
three times as many as reviews. Seventy-five % of articles are also accessible
fied by publication type. The majority of publications in our dataset are original articles, via a payable
subscription.
almost Most as
three times articles
manyinasour datasetSeventy-five
reviews. are simple, without colons,
% of articles areoralso
with two colons
accessible viaata
most. Titles with more than two colons, and up to six, are a small fraction
payable subscription. Most articles in our dataset are simple, without colons, or with two of the dataset.
Most articles
colons at most.have no semicolons
Titles with more thanor question marksand
two colons, (Figure
up toA1).
six, About a thousand
are a small fractionarticles
of the
have a semicolon or a question mark. Though original articles are much more numerous
dataset. Most articles have no semicolons or question marks (Figure A1). About a thou-
than reviews, they are similarly distributed across features.
sand articles have a semicolon or a question mark. Though original articles are much more
The distribution of citations was quite skewed (data not shown). Though few manuscripts
numerous than reviews, they are similarly distributed across features.
were cited a very high number of times—up to 17,996—the overall median number of citations
The distribution of citations was quite skewed (data not shown). Though few manu-
for the included manuscripts was 15, with a 75th percentile of 39. As clearly visible by the
scripts were cited a very high number of times—up to 17,996—the overall median number
distribution of citations over the years, a peak of citations can be observed around the year
of citations for the included manuscripts was 15, with a 75th percentile of 39. As clearly
2000, with citations then declining in more recent years (Figure 2). This reflects the nature of
visible by the distribution of citations over the years, a peak of citations can be observed
citations, which accumulate with time, so manuscripts that were published up to 20 years ago
had obviously more time to collect a higher number of citations. On the other hand, older
articles did not get as many citations, possibly because their cultural parable was alreadyations 2021, 9, x FOR PEER REVIEW 4 of 11
around the year 2000, with citations then declining in more recent years (Figure 2). This
Publications 2021, 9, 15 4 of 11
reflects the nature of citations, which accumulate with time, so manuscripts that were
published up to 20 years ago had obviously more time to collect a higher number of cita-
tions. On the other hand, older articles did not get as many citations, possibly because
their cultural declining
parable wasas citations
alreadystarted to beas
declining tracked, or simply
citations startedbecause at the time
to be tracked, or fewer
simplyjournals existed
because at theand
timefewer
fewerarticles were
journals published
existed and thus
and fewer the chances
articles to be cited
were published andwere
thusless
thenumerous. To
chances to be account for the
cited were lesseffect of timeTo
numerous. onaccount
the citation
for count of each
the effect article
of time onwethenormalized
citation the citation
number by the average expected citation for the year of publication,
count of each article we normalized the citation number by the average expected citation and used that as target
for our subsequent analysis.
for the year of publication, and used that as target for our subsequent analysis.
Figure 2. Distribution of citations for the manuscripts in the present corpus. Values are shown as
Figure 2. Distribution of citations for the manuscripts in the present corpus. Values are shown as mean +/− stan-
mean +/− standard deviation.
dard deviation.
It has been claimed
It has beenthat the length
claimed of the
that the length
title of of
a manuscript
the title of aismanuscript
associated is to associated
its to its
citation count,citation
though the results
count, though are conflicting.
the results Further evidence
are conflicting. has beenhas
Further evidence reported
been reported about
about a possible association
a possible between
association authors
between and citations.
authors To test
and citations. Tothese hypothesis,
test these we we analyzed
hypothesis,
analyzed title title
length (as number of words) and number of authors for
length (as number of words) and number of authors for the manuscripts the manuscripts in in our corpus.
our corpus. Title length did not appear to change significantly over the years; as
Title length did not appear to change significantly over the years; as previously reported, previ-
ously reported, thethe number
number ofofauthors
authorshashassteadily
steadilyincreased
increasedover overtime,
time,possibly
possibly asas studies
studies have become
have become more morecomplex
complexand andprogressively
progressively required
required moremore competences
competences and and
skillsskills
(data not shown).
(data not shown). Title length, however, was not homogeneously distributed across different document
Title length, however,
types. Original was not homogeneously
articles tend to have longer distributed across
titles than different
reviews do, document
while having also more
types. Originalauthors
articles(Figure
tend toA2).have longer titles than reviews do, while having also more
authors (Figure A2). A cumulative plot of author number and title length can be seen in Appendix B, where
A cumulative plot of author
these features number and
are represented title length
as boxplots cansignificantly
and are be seen in Appendix B,
different between reviews
where these features
and originalare represented
articles with as pboxplots
< 0.001,and by are
Mann significantly
Whitney U different between
test (Figure A3). These data
reviews and original
suggestarticles
that part with p < association
of the 0.001, by Mann Whitney
between U test (Figure
title length or author A3). These and citations
number
data suggest that part of the association between title length
might actually be mediated by differences in article genre. or author number and cita-
tions might actually be mediated by differences in article genre.
To better understand this phenomenon, we preliminarily investigated the relation
To betterbetween
understand this phenomenon,
the citations of a study andwe its
preliminarily
features (Figureinvestigated
3). Takenthe relationreview articles
together,
between the citations
collectedofmore a study and itsthan
citations features (Figure
original 3). Taken
articles (p < 0.001together,
by Mannreview articles
Whitney U test), possibly
collected morebecause
citations ofthan
theiroriginal articles (p
comprehensive and< 0.001 by Mann
broader scope.Whitney U test),no
Cumulatively, possibly
differences could be
because of their comprehensive
observed and broader
when articles scope. Cumulatively,
were stratified no differences
by access or sentiment, while could
fewer becitations were
observed when observed
articles with
weretitles with by
stratified multiple
access colons, thoughwhile
or sentiment, thesefewer
were citations
much lesswerenumerous, as we
observed withhave titlesshown (Figure 3).
with multiple colons, though these were much less numerous, as we
have shown (Figure We 3).next considered the association of citations with title length or author number.
Although it can be easily observed that studies with exceedingly long titles, over 30 or
40 words, are not cited as much as some studies with shorter titles, the bulk of the studies
in the dataset, with title length below 30 words, have a broad range of citations. We have
plotted the distribution of citations separately for articles and reviews and the pattern
appears quite similar (Figure 4), although research articles show a wider range in title
length than Reviews.ons 2021, 9,Publications
x FOR PEER 2021, 9, 15
REVIEW 5 of 11 5 of 11
Figure 3. Boxplots representing the number of (normalized) citations in review and original arti-
cles (above, left-hand side), subscription-based and open access publications (above, middle), pub-
lications with negative or positive title sentiment score (above, right-hand side) and citation count
based on the number of colons in the title (below, left-hand side), semicolons (below, middle) or
question marks (below, right-hand side). For all the plots, review papers are indicated in blue and
original articles in orange.
We next considered the association of citations with title length or author number.
Although it can be easily observed that studies with exceedingly long titles, over 30 or 40
Figure 3.
Figure 3. Boxplots representing words, are the
Boxplots representing
the number not cited asofmuch
number
of (normalized) as some
(normalized)
citations studies
citations
in review with
in
and shorter
review
original and titles,
original
articles the bulk
arti-
(above, of the
left-hand studies in
side),
cles (above,
subscription-based and left-hand side),
the
open access subscription-based
dataset, with
publications (above, and open
title middle),
length below access publications
30 words,
publications have(above,
with negative a or
broad middle),
titlepub-
range
positive of citations.
sentiment We have
score
licationsside)
(above, right-hand with and
negative or positive
plotted
citation count title sentiment
the based
distribution ofscore
on the number (above,
citations
of right-hand
separately
colons side)
for
in the title andleft-hand
articles
(below, citation count
and reviews and the pattern
side), semicolons
basedor onquestion
the number of colons inright-hand
the title (below,
side).left-hand side), semicolons (below,are
(below, middle) appears
marks quite
(below, similar (Figure For4),allalthough
the plots, research articlesmiddle)
review papers show or in blue and
a wider range in title
indicated
question marks (below, right-hand side). For all the plots, review papers are indicated in blue and
original articles in orange. length than Reviews.
original articles in orange.
We next considered the association of citations with title length or author number.
Although it can be easily observed that studies with exceedingly long titles, over 30 or 40
words, are not cited as much as some studies with shorter titles, the bulk of the studies in
the dataset, with title length below 30 words, have a broad range of citations. We have
plotted the distribution of citations separately for articles and reviews and the pattern
appears quite similar (Figure 4), although research articles show a wider range in title
length than Reviews.
Figure
Figure 4. Above,
4. Above, left:left: Distribution
Distribution of normalized
of normalized citations
citations by by
titletitle length
length in reviews
in reviews (blue)
(blue) andand original
original articles
articles (orange);
(orange);
right:
right: boxplot
boxplot representing
representing thethe normalized
normalized citation
citation count
count by by quartiles
quartiles of title
of title length
length in reviews
in reviews or original
or original articles.
articles. Below,
Below,
left:left: Distribution
Distribution of normalized
of normalized citations
citations by by author
author count
count in reviews
in reviews (blue)
(blue) andand original
original articles
articles (orange);
(orange); right:
right: boxplot
boxplot
representing
representing the the normalized
normalized citation
citation count
count by quartiles
by quartiles of number
of number of authors
of authors in reviews
in reviews or original
or original articles.
articles.
It appears
It appears thatthat verbose
verbose titles
titles tend
tend to accrue
to accrue fewer
fewer citations,
citations, butbut
thethe same
same relationship
relationship
doesdoes
notnot hold
hold for for shorter
shorter titles,
titles, where where the whole
the whole rangerange of citations
of citations is observed.
is observed. This rela-
This relation
tionreally
is not is notapparent
really apparent
when we when we quartilized
quartilized the datathe
fordata
title for title(Figure
length length (Figure
4). 4).
It was
It was moremore difficult
difficult to observe
to observe a relation
a relation between
between thethe number
number of authors
of authors andand thethe
gure 4. Above, left: Distribution of normalized
number citations
number ofby title length
citations,
of citations, though in areviews
though slight (blue)
a slight and
positive
positive original articles
association
association (orange);
could
could be observed
be observed (Figure
(Figure 4). 4).
ght: boxplot representing the normalized citation count by quartiles of title length in reviews or original articles. Below,
ft: Distribution of normalized citations by 3.3.
author count in
Ensemble reviews (blue) and original articles (orange); right: boxplot
Machines
presenting the normalized citation count by quartiles of number of authors in reviews or original articles.
We decided to evaluate the relative importance of the textual factors on normalized
citations by adopting
It appears that verbose ensemble
titles tend to accruepredictive machines,
fewer citations, namely
but the sameGradient Boosting, Bagging
relationship
and Random Forest.
does not hold for shorter titles, where the whole range of citations is observed. This rela-
These algorithms are based on decision trees, i.e., algorithms that split the data based
tion is not really apparent when we quartilized the data for title length (Figure 4).
on the features provided in order to get homogenous groups, in this case distinguishing
It was more difficult to observe a relation between the number of authors and the
articles with higher and lower citation rates. We first attempted a regression algorithm, i.e.,
number of citations, though a slight positive association could be observed (Figure 4).
an algorithm aiming to predict the number of citations based on the collected features. The
following predictors were considered: sentiment analysis score for the title, presence of a
semicolon, a colon, a question mark, access options, document type, number of authors andcitations by adopting ensemble predictive machines, namely Gradient Boosting, Bagging
and Random Forest.
These algorithms are based on decision trees, i.e., algorithms that split the data based
on the features provided in order to get homogenous groups, in this case distinguishing
Publications 2021, 9, 15
articles with higher and lower citation rates. We first attempted a regression algorithm, 6 of 11
i.e., an algorithm aiming to predict the number of citations based on the collected features.
The following predictors were considered: sentiment analysis score for the title, presence
of a semicolon, a colon, a question mark, access options, document type, number of au-
title
thorslength. Even
and title usingEven
length. parameter
using tuning techniques,
parameter i.e., RandomCV
tuning techniques, search to optimize
i.e., RandomCV search to
the hyperparameters
optimize for the algorithm,
the hyperparameters for the the degree of
algorithm, thefitting
degreeof the regression
of fitting algorithms
of the regression
was very lowwas
algorithms R2 ranged
andvery low andfrom 2% forfrom
R2 ranged a simple linear
2% for regression
a simple to 3.3% of Random
linear regression to 3.3% of
Forest regressor. Interestingly, Gradient Boosting and Random
Random Forest regressor. Interestingly, Gradient Boosting and Random Forest algorithms can also
Forest algorithms
provide a score of importance for the features, which are shown in Figure
can also provide a score of importance for the features, which are shown in Figure 5. 5.
A B
Figure5.5.Feature
Figure Featureimportance
importancefor
for(A)
(A)GradientBoost
GradientBoostregressor
regressorand
and(B)
(B)Random
RandomForest
Forestregressor.
regressor.
As
Asa acomparison,
comparison, wewe also decided
also decidedto use ensemble
to use ensembleclassifiers, withwith
classifiers, the same predictors:
the same predic-
totors:
thisto
purpose, and to simplify the model, normalized citations were binarily
this purpose, and to simplify the model, normalized citations were binarily cate- categorized
asgorized
either as
low (below
either lowthe median)
(below or highor
the median) (above the median).
high (above Table Table
the median). 1 summarizes the
1 summarizes
performance of these algorithms, and Gradient Boosting appears to
the performance of these algorithms, and Gradient Boosting appears to have the highest have the highest
F1F1score.
score.
AAconfusion
confusion matrix for the
matrix for theclassification
classificationbyby Gradient
Gradient Boosting
Boosting cancan be found
be found in in
Figure
Figure A4. The relative weight of the considered factors can be seen in Figure 6. TheThe
A4. The relative weight of the considered factors can be seen in Figure 6. doc-
document
ument type type (Reviews
(Reviews vs.vs. articles)
articles) is,is, maybe
maybe unsurprisingly,the
unsurprisingly, themost
mostrelevant
relevantpredictor
predictorof
ofcitations,
citations,followed
followedby byitsitsaccess
access options
options (whether
(whether open
open access
access oror via
viapaid
paidsubscription),
subscription),
for the Random Forest classifier. The number of authors (which may show some collinearity
for the Random Forest classifier. The number of authors (which may show some colline-
with the publication year, Pearson’s r = 0.17) precedes the title length, which, though it
arity with the publication year, Pearson’s r = 0.17) precedes the title length, which, though
does show some association, does not seem to be playing a major role. The presence of a
it does show some association, does not seem to be playing a major role. The presence of
colon in the title (and even more so of a semi-colon) or the sentiment of the title may also
a colon in the title (and even more so of a semi-colon) or the sentiment of the title may also
have some relevance as predictors.
have some relevance as predictors.
Table 1. Summary of the performance of the algorithms for the classification of the citation rate in
Table 1. Summary of the performance of the algorithms for the classification of the citation rate in the analyzed corpus.
the analyzed corpus.
Algorithm Accuracy Precision Recall F1 Score
Algorithm Accuracy Precision Recall F1 Score
Gradient Boosting 0.599 0.598 0.514 0.552
Bagging Gradient Boosting
0.570 0.599
0.560 0.598 0.483 0.514 0.552
0.519
Bagging 0.570 0.560 0.483 0.519
Random Forest 0.597 0.601 0.479 0.533
Publications 2021, 9, x FOR PEER REVIEW Random Forest 0.597 0.601 0.479 0.533 7 of 11
Logistic Regression 0.590
Logistic Regression 0.583
0.590 0.583 0.517 0.517 0.548
0.548
A B
Figure6.6.Overview
Figure Overviewof
offeature
featureimportance
importancefor
for(A)
(A)GradientBoost
GradientBoostand
and(B)
(B)Random
RandomForest
Forestclassifier.
classifier.
4. Discussion
Citations are a fundamental factor for the scientific community, because they are the
basic metrics that are used to rank research and productivity, and therefore regulate itsPublications 2021, 9, 15 7 of 11
4. Discussion
Citations are a fundamental factor for the scientific community, because they are the
basic metrics that are used to rank research and productivity, and therefore regulate its
policies and funding allocation. Within such an information overload setting as the one life
sciences are experiencing [19], publishing an article is not enough to expect the article to be
seen and cited. On the other hand, some articles may reach impressive levels of visibility. It
appears, therefore, important to investigate the factors that, beside scientific soundness and
content relevance, may make the article stand out in a literature search and be downloaded,
read and possibly cited, because visibility is key to the spreading of ideas and it may affect
novel directions and trends of research.
A lot of research has focused on the effects of title length on citations [20], because
titles are what a reader would scan first while searching the literature. The results, however,
have been, so far, quite conflicting. An investigation conducted in 22 medical journals
(comprising more than 9000 articles) concluded that publications with longer titles are cited
more than articles with shorter titles, especially for high impact factor journals [21]. This is
in agreement with a research by Jacques et al., which showed a positive correlation between
title length and citation number in a small set of most- and least-cited articles in three
general medicine journals [22]. Other studies, however, found no association [23], different
associations according to the field [24], or, more recently, a negative association, when the
20,000 more cited papers for each year in the 2007–2013 period were considered [11].
The present commentary followed a slightly different approach. We focused on articles
published in the biomedical field on the specific topic of chronic inflammation. This topic
was chosen because of the authors’ domain knowledge, but also because it is broad enough
to encompass a vast range of publications across thousands of journals in the biomedical
field, and was known and researched since the 19th century. This allowed us to collect a
large sample of more than 150,000 articles, or, after excluding those for which no citation
was available, with an unusually high number of authors and after limiting our analysis to
articles published after 1951, 121,640 articles. Our approach has some limits: it admittedly
focuses only on one scientific area, and one topic. By considering articles from such a vast
range of years, it may suffer from increased heterogeneity on some predictors. For this
reason, the initial dataset, which comprised publications from 1827 to 2021, was reduced to
papers appearing in the 1951–2021 interval.
Our initial survey showed a robust association between the genre of the article and the
number of citations: unsurprisingly, and consistently with what has already been reported,
review articles are usually cited more than research articles [25]. What is more relevant,
however, is that reviews in our dataset have significantly shorter titles than original articles.
This too is somewhat unsurprising, or at least easily accounted for by the nature of reviews.
As they usually encompass broader topics, at least in their narrative version, they often
do not require investigation of the details of a single finding, as original articles do, and
thus, their titles may often need fewer words. This may in part justify the observation that
most-cited articles have shorter titles [11], because many of them happen to be reviews.
However, even stratifying by document type, there seems to be some kind of trend for
articles with longer titles to have fewer citations, so it may be that that shorter titles are
cognitively favored to be easily recognized and read and thus maybe cited.
We then decided to investigate the predictive power of some features associated with
the text and its title and relied on ensemble machines Gradient Boosting [26], Bagging [27]
or Random Forest [28], both as regressors and as classifiers. These algorithms are based
on decision trees and reduce overfitting on training data by running parallel sub-optimal
decision trees. In other words, a decision tree could easily find the best splits to divide a
dataset in order to eventually obtain homogeneous groups, especially given enough splits,
but the solution would be difficult to generalize on different datasets. Thus, sub-optimal
trees are used in these algorithms: while each decision tree in these machines is a weak
predictor that uses only part of the data, the performance and the bias of the algorithm
can be improved by running several of such weak predictors in parallel and by averagingPublications 2021, 9, 15 8 of 11
their results. When predicting the visibility of a manuscript we decided to avoid using the
citation count, as obtained from the repository, because citations are obviously affected by
the publication year. The same article would have different citation counts if assessed at
different time points after publication and thus time-since-publication is not an intrinsic
quality of a manuscript, but a rather incidental one. Thus, we preferred to use a normalized
citation count, which we obtained by normalizing the number of citations for an article
by the expected average for that year. We could have adjusted for other factors, e.g.,
publication or access type, as is often done, but this would have ended up removing these
factors from the features the ensemble algorithms consider when analyzing the data. Our
goal was to provide these ensemble machines with a series of predictors and let them
compute whether they can explain the observed variability in the distribution of the target
variable and assess the weight each factor has in affecting such a target.
It can be pointed out that the accuracy of these models was quite low, both for the
classifiers and even more so for the regressor algorithms. This means that only a fraction
of the variability in the citation count can be explained by these factors which is not
surprising. Our database did not include any indicator about the novelty of the content
of an article, the prestige of the authors or even the prestige of the journal, scientific and
even just sociological factors that would be normally associated to citing a study. Within
the factors that we did include, however, the type of document has a heavy impact on
citation count and so does the access to the article. The author number, in agreement
with Vieira et al. [3] appears high on the list too, although its distribution does not show
much association with the number of citations, but the result can be accounted for by the
change in author number over time. Title length has some impact but it is overall quite
limited. The presence of colons or sentiment analysis appears less relevant, while the effect
of semicolons or question marks is quite negligible. It must, however, be highlighted that
sentiment analysis was performed with a pre-set sentiment analyzer, albeit a quite popular
one, whose reliability in this context might be limited.
Considering how limited the features we used are, the answers the models provided
are quite interesting. Further textual characteristics could and should be investigated, such
as an analysis of embedding semantics for titles and abstracts, but also abstract length,
article length and of course impact factor. Further efforts should also be made to explore
the role of the originality and relevance of the content, although it is difficult to define and
identify correct metrics for these features.
5. Conclusions
In conclusion, considering the dataset of all the publications since 1951 in the field
of chronic inflammation, indexed in Scopus, several factors appear to be associated with
citation count; although, obviously, no single factor can decide the fate of a publication.
Besides the time passed since its publication, which is necessary for an article to gather
citations, the document type, the access type and the number of authors affect citation
count. Very wordy titles do appear associated to fewer citations, but title length seems to
play a small role in computational models trying to predict citation count.
Author Contributions: Conceptualization, C.G. and S.G.; methodology, C.G.; software, C.G.; formal
analysis, C.G.; investigation, S.G.; resources, S.G.; data curation, C.G.; writing—original draft
preparation, C.G.; writing—review and editing, C.G.; supervision, S.G.; Both authors have read and
agreed to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Data available on request due to restrictions.
Conflicts of Interest: The authors declare no conflict of interest.agreed to the published version of the manuscript.
Funding: This research received no external funding.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Publications 2021, 9, 15 Informed Consent Statement: Not applicable. 9 of 11
Data Availability Statement: Data available on request due to restrictions.
Data Availability Statement: Data available on request due to restrictions.
Conflicts of Interest: The authors declare no conflict of interest.
Conflicts of Interest: The authors declare no conflict of interest.
Appendix A
Appendix A
Appendix A
Figure A1. Composition of our dataset. Above: number of reviews and original articles (left), number of open access or
Figure A1. Composition
Composition of our dataset. Above: number of
of reviews
reviews and original articles (left), number of
of open access or
subscription papers (middle), number ofAbove: number
publications with negative and originaltitle
or positive articles (left), score
sentiment number open
(right). Below: num-
subscription
subscription papers
papers by(middle),
(middle), number of
numberinofthe publications
publications with negative or positive title sentiment score (right). Below: num-
ber of publications colon number title (left),with negative or
by semicolon positiveor
(middle) title
bysentiment
number ofscore (right).
question Below:
marks number
(right). All
ber
of of publications by colon number in the title (left), by semicolon (middle) or by number of question marks (right).
All All
thepublications
publicationsbyare
colon number
stratified in the title
according to (left),
documentby semicolon (middle)
type, where or by number
blue indicates of question
reviews and orange marks (right).original
indicates the
the publications
publications are stratified
are stratified according
according to document
to document type, where
type, where blue indicates
blue indicates reviewsreviews and orange
and orange indicates
indicates original original
articles.
articles.
articles.
Publications
Figure2021,
A2. 9,
A2. x FOR distribution
Above: PEER REVIEW of title lengths in our dataset for review and articles. Below: distribution
distribution of
of author count.10 of 11
author count.
Figure A2. Above: distribution of title lengths in our dataset for review and articles. Below: distribution of author count.
Blue bars
Blue bars indicate
indicatereviews
reviewsand andorange
orangebars
barsindicate
indicate articles.
articles. The
The length
length of the
of the barbar indicates
indicates the the exact
exact number
number of publica-
of publications
Blue bars indicate reviews and orange bars indicate articles. The length of the bar indicates the exact number of publica-
tions
for forspecific
that that specific
title title length
length or or author
author count count
in ourin our dataset.
dataset.
tions for that specific title length or author count in our dataset.
Appendix B
Appendix B
Figure A3.Boxplots
FigureA3. Boxplotsofofauthor
authornumber
number(left)
(left)and
andtitle
titlelength
length(right)
(right)by
bydocument
documenttype.
type.The
Theplots
plotsshow
showthat
thatreviews
reviewshave
have
significantly
significantlyfewer
fewerauthors
authorsand
andsignificantly
significantlyshorter
shortertitles.
titles.The
Thewhiskers
whiskersofofthe
theplots
plotscorresponds
correspondstoto1.5 × the
1.5× theinterquartile
interquartile
range.
range.Values
Valuesfalling
fallingoutside
outsideofofthis
thisintervals
intervalsare
arerepresented
representedasasblack
blackdots.
dots.Figure
Publications A3.9,Boxplots
2021, 15 of author number (left) and title length (right) by document type. The plots show that reviews have
10 of 11
significantly fewer authors and significantly shorter titles. The whiskers of the plots corresponds to 1.5× the interquartile
range. Values falling outside of this intervals are represented as black dots.
Figure
Figure A4. Confusion matrix A4. Confusion
for Gradient Boosting.matrix for Gradient
The matrix breaksBoosting. The that
down articles matrix breaks
were down
predicted toarticles that or
have a low were
predicted
high citation count and what to have
they were a low
actually or (‘Observed’).
like high citation count and what they were actually like (‘Observed’).
References
References
1.
1. Bai,
Bai, X.;
X.; Liu,
Liu, H.;
H.; Zhang,
Zhang, F.; F.; Ning,
Ning, Z.;
Z.; Kong,
Kong, X.;X.; Lee,
Lee, I.;
I.; Xia,
Xia, F.
F. An
An overview
overview on on evaluating
evaluating and
and predicting
predicting scholarly
scholarly article impact.
article impact.
Information 2017, 8, 73. [CrossRef]
Information 2017, 8, 73, doi:10.3390/info8030073.
2.
2. Bai,
Bai, X.;
X.; Zhang,
Zhang, F.; F.;Lee,
Lee,I.I.Predicting
Predictingthethecitations
citationsof ofscholarly paper.J.J.Informetr.
scholarlypaper. 2019,13,
Informetr.2019, 13,407–418.
407–418,[CrossRef]
doi:10.1016/j.joi.2019.01.010.
3.
3. Bentéjac,
Bentéjac, C.; Csörgő, A.; Martínez-Muñoz, G. A comparative analysis of gradient boostingalgorithms.
C.; Csörgő, A.; Martínez-Muñoz, G. A comparative analysis of gradient boosting algorithms. Artif.
Artif. Intell.
Intell. Rev. 2021, 54,
Rev. 2021, 54,
1937–1967.
1937–1967, [CrossRef]
doi:10.1007/s10462-020-09896-5.
4.
4. Blümel,
Blümel, C.;C.; Schniedermann,
Schniedermann, A. A. Studying
Studying review
review articles
articles in in scientometrics
scientometrics and and beyond:
beyond: A A research agenda. Science
research agenda. 2020, 124,
Science 2020, 124,
711–728. [CrossRef]
711–728, doi:10.1007/s11192-020-03431-7.
5.
5. Borg,
Borg, A.;
A.; Boldt,
Boldt, M.M. Using
Using VADER
VADERsentiment
sentimentand andSVMSVMfor for predicting
predictingcustomer
customer response sentiment. Expert
responsesentiment. Expert Syst.
Syst. Appl. 2020, 162,
Appl. 2020, 162,
113746. [CrossRef]
113746, doi:10.1016/j.eswa.2020.113746.
6. Bornmann,
6. Bornmann, L.; L.; Mutz,
Mutz, R. R. Growth
Growth rates
rates of
of modern
modern science:
science: A A bibliometric
bibliometric analysis
analysis based
based onon the
the number
number of of publications
publications and and cited
cited
references. J. Assoc. Inf. Sci. Technol. 2015, 66, 2215–2222. [CrossRef]
references. J. Assoc. Inf. Sci. Technol. 2015, 66, 2215–2222, doi:10.1002/asi.23329.
7.
7. Breiman,
Breiman, L. L. Random forests. Mach.
Random forests. Mach. Learn. 2001, 45,
Learn. 2001, 45, 5–32.
5–32, [CrossRef]
doi:10.1023/a:1010933404324.
8.
8. Burrell,
Burrell, Q.L.
Q.L. Predicting
Predicting future
futurecitation behavior.J.J.Am.
citationbehavior. Am.Soc.Soc. Informetr.
Informetr. Sci.
Sci. Technol. 2003,54,
Technol.2003, 54,372–378.
372–378, [CrossRef]
doi:10.1002/asi.10207.
9.
9. De
De Rijcke,
Rijcke, S.;
S.; Rushforth,
Rushforth, A. A. To
To intervene
intervene or or not
not to
to intervene;
intervene; isis that
that the
the question?
question? On On the
the role
role of
of scientometrics
scientometrics in in research
research
evaluation. J.J. Assoc.
evaluation. Assoc. Inf.
Inf. Sci. Technol. 2015,
Sci. Technol. 2015, 66,
66, 1954–1958. [CrossRef]
1954–1958, doi:10.1002/asi.23382.
10.
10. Duyx,
Duyx, B.;B.; Urlings,
Urlings, M.J.;
M.J.; Swaen,
Swaen, G.M.;
G.M.; Bouter,
Bouter, L.M.;
L.M.; Zeegers,
Zeegers, M.P.M.P.Scientific
Scientificcitations
citations favor
favor positive
positive results:
results: A
A systematic
systematic review
review
and meta-analysis. J. Clin. Epidemiol. 2017, 88, 92–101. [CrossRef]
and meta-analysis. J. Clin. Epidemiol. 2017, 88, 92–101, doi:10.1016/j.jclinepi.2017.06.002.
11.
11. Egghe,
Egghe, L.L. TheTheHirsch index and
Hirsch related
index andimpact
related impactAnnu.
measures. Rev. Inf.Annu.
measures. Sci. Technol. Inf. 44,
Rev. 2010, Sci.65–114. [CrossRef]
Technol. 2010, 44, 65–114,
12. Eysenbach, G. Citation advantage
doi:10.1002/aris.2010.1440440109. of open access articles. PLoS Biol. 2006, 4, e157. [CrossRef] [PubMed]
13. Génova, G.; Astudillo, H.; Fraga, A. The scientometric bubble considered harmful. Sci. Eng. Ethic. 2016, 22, 227–235. [CrossRef]
[PubMed]
14. Habibzadeh, F.; Yadollahie, M. Are shorter article titles more attractive for citations? Cross-sectional study of 22 scientific journals.
Croat. Med. J. 2010, 51, 165–170. [CrossRef]
15. Ho, Y.-S.; Kahn, M. A bibliometric study of highly cited reviews in the Science Citation Index expanded™. J. Assoc. Informetr. Sci.
Technol. 2014, 65, 372–385. [CrossRef]
16. Hutto, C.J.; Gilbert, E.E. VADER: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of
the International AAAI Conference on Web and Social Media, Ann Arbor, MI, USA, 1–4 June 2014; Volume 8.
17. Jacques, T.S.; Sebire, N.J. The impact of article titles on citation hits: An analysis of general and specialist medical journals. JRSM
Short Rep. 2010, 1, 1–5. [CrossRef]
18. Letchford, A.; Moat, H.S.; Preis, T. The advantage of short paper titles. R. Soc. Open Sci. 2015, 2, 150266. [CrossRef]
19. Lovaglia, M.J. Predicting citations to journal articles: The ideal number of references. Am. Sociol. 1991, 22, 49–64. [CrossRef]
20. Mingers, J.; Leydesdorff, L. A review of theory and practice in scientometrics. Eur. J. Oper. Res. 2015, 246, 1–19. [CrossRef]
21. Newman, H.; Joyner, D. Artificial Intelligence in Education; Penstein Rosé, C., Ed.; Springer: Cham, Switzerland, 2018; pp. 246–250.
[CrossRef]
22. Rostami, F.; Mohammadpoorasl, A.; Hajizadeh, M. The effect of characteristics of title on citation rates of articles. Scientometrics
2014, 98, 2007–2010. [CrossRef]
23. Siebelt, M.; Siebelt, T.; Pilot, P.; Bloem, R.M.; Bhandari, M.; Poolman, R.W. Citation analysis of orthopaedic literature; 18 major
orthopaedic journals compared for Impact Factor and SCImago. BMC Musculoskelet. Disord. 2010, 11, 4. [CrossRef] [PubMed]Publications 2021, 9, 15 11 of 11
24. Skurichina, M.; Duin, R.P. Bagging for linear classifiers. Pattern Recognit. 1998, 31, 909–930. [CrossRef]
25. Smith, D.R. Impact factors, scientometrics and the history of citation-based research. Scientometrics 2012, 92, 419–427. [CrossRef]
26. Vieira, E.; Gomes, J. Citations to scientific articles: Its distribution and dependence on the article features. J. Informetr. 2010,
4, 1–13. [CrossRef]
27. Yitzhaki, M. Relation of the title length of a journal article to the length of the article. Scientometrics 2002, 54, 435–447. [CrossRef]
28. Yogatama, D.; Heilman, M.; O’Connor, B.; Dyer, C.; Routledge, B.R.; Smith, N.A. Predicting a scientific community’s response to
an article. In Proceedings of the EMNLP 2011—Conference on Empirical Methods in Natural Language Processing, Edinburgh,
UK, 27–31 July 2011; pp. 594–604.You can also read