The Messenger No. 177 - Quarter 3 | 2019 - European Southern Observatory

Page created by Philip Morales
 
CONTINUE READING
The Messenger No. 177 - Quarter 3 | 2019 - European Southern Observatory
Distributed Peer Review
M87 Event Horizon Telescope Results
The PHANGS Surveys
Total Solar Eclipse Over La Silla
                                                                   The Messenger
                                      No. 177 – Quarter 3 | 2019
The Messenger No. 177 - Quarter 3 | 2019 - European Southern Observatory
ESO, the European Southern Observa-                  Contents
tory, is the foremost intergovernmental
astronomy organisation in Europe. It is              Telescopes and Instrumentation
supported by 16 Member States: Austria,              Patat F. et al. – The Distributed Peer Review Experiment                        3
­Belgium, the Czech Republic, Denmark,               Coccato L. et al. – On the Telluric Correction of KMOS Spectra                 14
 France, Finland, Germany, Ireland, Italy,           Gonté F. et al. – Bringing the New Adaptive Optics Module
 the Netherlands, Poland, Portugal, Spain,             for Interferometry (NAOMI) into Operation                                    19
 Sweden, Switzerland and the United
 Kingdom, along with the host country of             Astronomical Science
 Chile and with Australia as a Strategic             Goddi C. et al. – First M87 Event Horizon Telescope Results
 Partner. ESO’s programme is focussed                  and the Role of ALMA                                                         25
 on the design, construction and opera-              Schinnerer E. et al. – The Physics at High Angular resolution in
 tion of powerful ground-based observing               Nearby GalaxieS (PHANGS) Surveys                                             36
 ­facilities. ESO operates three observato-
  ries in Chile: at La Silla, at P
                                 ­ aranal, site of   Astronomical News
  the Very Large Telescope, and at Llano             Ventura L. et al. – Total Solar Eclipse Over La Silla                          43
  de Chajnantor. ESO is the European                 Christensen L. L. et al. – Science & Outreach at La Silla During the
  ­partner in the Atacama Large Millimeter/            Total Solar Eclipse                                                          47
   submillimeter Array (ALMA). Currently             Dennefeld M. et al. – Pointing the NTT at the Sun: Studying the Solar Corona
   ESO is engaged in the construction of the           During the Total Eclipse                                                     54
   Extremely Large ­Telescope.                       Sani E. et al. – Report on the ESO Workshop “KMOS@5: Star and
                                                       Galaxy Formation in 3D — Challenges in KMOS 5th Year”                        56
The Messenger is published, in hardcopy              Liske J., Mainieri V. – Report on the ESO Workshop “Preparing for 4MOST —
and electronic form, four times a year.                A Community Workshop Introducing ESO’s Next-Generation Spectroscopic
ESO produces and distributes a wide                    Survey Facility                                                              61
variety of media ­connected to its activi-           Mroczkowski T. et al. – Report on the ESO Workshop “ALMA Development
ties. For further information, including               Workshop”                                                                    64
postal subscription to The Messenger,                Mérand A., Leibundgut B. – Report on the ESO Workshop “The VLT in 2030”        67
contact the ESO Department of Commu-                 Yang C. – Fellows at ESO                                                       70
nication at:                                         Jethwa P., Oikonomou F. – External Fellows at ESO                              71
                                                     Hofstadt D. – Lodewijk Woltjer (1930–2019)                                     74
ESO Headquarters                                     Personnel Movements                                                            75
Karl-Schwarzschild-Straße 2
85748 Garching bei München, Germany
Phone +498932006-0
information@eso.org

The Messenger
Editor: Gaitee A. J. Hussain
Layout, Typesetting, Graphics:
Jutta B
      ­ oxheimer, Mafalda Martins,
Lorenzo Benassi
Design, P­ roduction: Jutta ­Boxheimer
Proofreading: Peter Grimley,
Caroline Reid
­w ww.eso.org/messenger/

Printed by FIBO Druck- und Verlags GmbH
Fichtenstraße 8, 82061 Neuried, Germany

Unless otherwise indicated, all images in
The Messenger are courtesy of ESO,
except authored contributions which are
courtesy of the respective authors.

© ESO 2019
                                                     Front cover: A series of exposures showing the tra-
ISSN 0722-6691                                       jectory of the Sun over roughly two and a half hours.
                                                     The total solar eclipse resulted in almost two minutes
                                                     of totality at 20:39 UT. Credit: ESO/P. Horálek

2            The Messenger 177 – Quarter 3 | 2019
The Messenger No. 177 - Quarter 3 | 2019 - European Southern Observatory
Telescopes and Instrumentation                                                                                  DOI: 10.18727/0722-6691/5147

The Distributed Peer Review Experiment

Ferdinando Patat 1                             the significant growth of the user com-          example, late dropouts during the review
Wolfgang Kerzendorf 2, 3, 4                    munity, which has made ESO one of the            process can reduce the number of
Dominic Bordelon 1                             largest astronomical facilities in the world,    pre-meeting reviews per proposal, mak-
Glen Van de Ven 5                              the way telescope time applications are          ing the triage procedure less robust.
Tyler Pritchard 2                              reviewed has remained substantially the          While this change was relatively easy to
                                               same since 1993. Barring the necessary           implement, experience gained during
                                               increase in the number of reviewers, the         Periods 102 and 103 suggests that the
1
  ESO                                          procedure has changed in the details,            negative consequences outweigh the
2
  Center for Cosmology and Particle           but not in its substance. Following steady       benefits. It is clear that further and more
   Physics, New York University, USA           growth in the numbers of submissions,            drastic and structured actions need to
3
   Department of Physics and Astronomy,       the current review load is about 70 pro-         be taken; these include a move to an
    Michigan State University, USA             posals per panel member and up to 100            annual cycle and the deployment of a fast
4
    Department of Computational Mathe-        for OPC-proper members (the latter serve         track channel (FTC; see Patat, 2018a).
     matics, Science and Engineering,          on a second panel which reviews the
     Michigan State University, USA            ­recommendations across all science cat-         By construction, the FTC requires a short
5
     Department of Astrophysics, University    egories). These numbers have reached            duty cycle during which referees are
      of Vienna, Austria                        critical levels, requiring a re-evaluation of   continuously on duty. The most suitable
                                                the procedures and an examination of the        mechanism for reviewing the proposals
                                                effectiveness of peer review.                   is a Distributed Peer Review (DPR), one
All large, ground- and space-based                                                              of the most innovative schemes through
astronomical facilities serving wide           The pressure on the peer review process          which the load on referees can be allevi-
communities face a similar problem: in         has been the subject of a study by the           ated (Merrifield & Saari, 2009). This con-
many cases the number of applications          ESO OPC Working Group (Brinks et al.,            cept has been successfully applied to
they receive in response to each call          2012) and the Time Allocation Working            the Fast Turnaround channel deployed at
exceeds 1000. This poses a serious             Group (TAWG; Patat, 2018a). Both stud-           the Gemini Telescope, which has pro-
challenge to running an effective selec-       ies identified the excessive number of           cessed over 1000 proposals in this way
tion process under the classic peer-­          proposals per referee as the most urgent         since 2015. The Gemini Observatory has
review paradigm, in which the propos-          problem that ESO needs to tackle. Not            published a report (Andersen et al., 2019)
als are assigned to pre-allocated panels       only does the workload severely affect           and updates are continuously provided
with fixed compositions. Although, in          the referees (also increasing the rejection      on its webpages 1.
principle, one could increase the size of      rate during the recruitment phase), but it
the time allocation committee, this cre-       can also have an impact on the quality           Depending on the fraction of total tele-
ates logistic and financial problems           of the reviews and the feedback provided         scope time that is allocated via the FTC,
which place a practical limit on its           to the applicants, with potentially serious      this channel may also serve to decrease
maxi­mum size, making this solution            consequences. The feedback has been              the load on the OPC, which would then
unviable beyond a certain volume of            repeatedly and consistently identified           focus only on proposals with larger time
applications. For this reason, alternative     as a major problem by the OPC and the            requests. ESO has conducted a system-
solutions must be sought. One of these         Users Committee, and via direct commu-           atic study aimed at better evaluating the
is the so-called Distributed Peer Review       nications from numerous individual users.        application of DPR to its programmes.
(DPR) in which, by submitting a pro-           Problems with the peer review could              In Period 103, in parallel with the regular
posal, the Principal Investigators (PIs)       ­ultimately affect the scientific productivity   OPC cycle, a DPR experiment was run
agree both to act as reviewers and              and impact of the Organisation itself. A        involving a subset of submitted propos-
to have their proposal reviewed by their        number of recommendations have been             als. This article presents a brief descrip-
peers. In this article we report the            proposed by the working groups, some            tion of the experiment setup and summa-
results of a DPR experiment run by ESO          of which are interdependent.                    rises an analysis of several statistical
in Period 103, in parallel with the regular                                                     indicators. More details can be found in
review by the Observing Programmes             As a first step, since Period 102 ESO has        Kerzendorf et al. (2019).
Committee (OPC).                               decreased the number of referees (from
                                               six to three) who review a proposal
                                               ahead of the OPC meeting. Triage is then         Distributed Peer Review and the DPR
Introduction                                   applied using the three pre-OPC meeting          Experiment
                                               grades, with about the lowest 30% of
Following the start of VLT operations in       proposals being rejected. At the meeting         Different measures to alleviate the load
1998, the number of applications to            all non-conflicted panel members are             on the reviewers have been and are
use ESO telescopes has been steadily           then asked to discuss and grade only the         being considered by various facilities.
­growing, exceeding 1100 proposals in          surviving proposals. While this measure          These include drastic solutions, like the
 Period 84. After this peak, the number of     has successfully reduced the workload            one deployed by the National Science
 submissions per semester stabilised at        of the panel members, it has become              Foundation (NSF, USA) to limit the num-
 around 900 (Patat et al., 2017). Despite      cumbersome to manage in practice. For            ber of applications (Mervis, 2014a). The

                                                                                                The Messenger 177 – Quarter 3 | 2019        3
The Messenger No. 177 - Quarter 3 | 2019 - European Southern Observatory
Telescopes and Instrumentation                    Patat F. et al., The Distributed Peer Review Experiment

Distributed Peer Review (DPR) concept is          participate in the experiment. This implied   complete mismatch (orthogonal knowl-
simple; in submitting a proposal the PI           that each would review eight proposals        edge vectors), while a unit cosine indi-
agrees to review n proposals submitted            submitted by peers and have their pro-        cates a case of perfect match (parallel
by peers, and to have her/his proposal/s          posal refereed by the same number of          knowledge vectors). For the purposes of
reviewed by n peers. Also, if s/he submits        peers. The participants were given two        the statistical analysis, each DT referee
m proposals, s/he accepts to review               weeks to complete their reviews and           received four proposals with the largest
n × m proposals, hence essentially limit-         were informed that the outcome of the         similarity, two proposals with median
ing the number of submissions through             DPR would have no effect on the fate of       similarity, and two proposals with the
a self-regulating mechanism. Following            their proposals. By the deadline (22 Octo-    lowest similarity.
this idea, the Gemini Observatory                 ber 2018) 167 (97.1%) had completed their
deployed the DPR for its Fast Turnaround          task. In a real implementation the five PIs   The participants were not aware of the
channel (Andersen et al., 2019), which is         who did not meet the deadline would           distribution mechanism just described.
capped to 10% of the total time. The              have had their proposals automatically        They were just provided with a simple
NSF also explored this possibility with a         rejected. In this experiment however, their   web-based interface giving them access
pilot study in 2013, in which each PI was         proposals were kept in the sample, but        to the eight assigned proposals and
asked to review seven proposals sub­              the PIs did not receive the final feedback.   allowing them to review, grade and com-
mitted by peers (Ardabili & Liu, 2013;            Additionally, the parti­cipating PIs were     ment on the applications. Before access-
Mervis, 2014b). The NSF pilot was based           asked to fill in a web-based questionnaire    ing the proposals, the referees were
on 131 applications submitted by volun-           covering various aspects of the experi-       asked to sign a non-disclosure agree-
teers within the Civil, Mechanical and            ment. A total of 140 (83.8% of the DPR        ment, very similar to that signed by the
Manufacturing Innovation Division, but            sample, 19% of the total PI sample of         OPC and Panel members.
the outcome is unknown as no report on            P103) returned the completed form.
the study was published. Interestingly,                                                         During the review phase, the participants
a similar pilot experiment was carried out        The proposal distribution was performed       were also asked to declare any scientific/
in 2016 by the National Institute of Food         using two channels, which we will call        personal conflicts, while institutional
and Agriculture 2; in this case too the           OPC Emulate (OE) and DeepThought              ­conflicts were automatically taken into
results were not published. Despite the           (DT). In both cases the reviewers were         account by the distribution software,
general acceptance that followed the              assigned eight proposals each. For the         based on the affiliations recorded in the
deployment of this channel at the Gemini          OE channel, 60 volunteers were selected        User Portal database. For each proposal,
Observatory, to the best of our knowl-            at random and assigned, on the basis           the referees had to fill in a comment (with
edge the Fast Turnaround channel is the           of the category of the proposal each sub-      a minimum length of 80 characters), and
only example of DPR being employed by             mitted, to the four scientific categories:     also provide a self-evaluation of their
a large-scale astronomical facility.              A (Cosmology), B (Galaxy Structure and         expertise level (high/medium/low) for
                                                  Evolution), C (Planets, Star ­Formation and    each proposal assigned to them.
In the specific case of ESO, the TAWG             Interstellar Medium) and D (Stellar Evolu-
tasked to address these issues has pro-           tion). The underlying (and reasonable)        Once the review process was completed,
duced a set of recommendations. The               assumption is that a scientist submitting     the grades of the various referees were
core aim is to reduce the number of               a proposal for a given category is an         combined using a simple average (similar
applications per reviewer, which has              expert in that same area. This emulates       to the regular OPC process), and a final
been identified as an urgent action that          the case of the real OPC, in which a per-     ranking list was compiled. The PIs were
ESO needs to take (Patat, 2018a). The             son only receives proposals within her/his    then provided with the quartile rank and
deployment of DPR falls within the rec-           area of expertise.                            the individual, unedited anonymous
ommendations. As a first step, and after                                                        ­comments. Finally, they were asked to
consulting the advisory bodies, ESO               For the remaining 112 volunteers selected      provide feedback on the experiment via
decided to run a test during the ESO              for the DT channel, the process was as         a web-based form; this included a request
Period 103 in parallel to the regular OPC         follows. For each scientist, a knowledge       to express the usefulness of each com-
review. The experiment was designed in            vector was built based on their publica-       ment they received on their proposal.
line with the implementation at Gemini,           tions, which were downloaded from the
enhancing the process by means of                 public SAO/NASA Astrophysics Data Sys-
­Natural Language Processing (NLP) and            tem database (ADS) and processed by           General statistics and demographics
 Machine Learning (a different method             a machine learning algorithm ­(Kerzendorf,
 of using NLP for proposal reviews can be         2017). The same approach was used for         Although, in principle, each proposal
 found in Strolger et al., 2017).                 the proposals and applied to their scien-     should have been reviewed by eight sci-
                                                  tific rationale. The match between the        entists and each scientist should have
The DPR experiment was announced in               ­referee expertise and the area covered by    reviewed eight proposals, because of the
the Call for Proposals for Period 103,             the proposal was then quantified through     scientific/personal conflicts declared
released on 30 August 2018. A total of             the “cosine distance”, which is directly     ­during the refereeing process (and to a
172 PIs — representing 23% of all distinct         related to the angle formed by the two        much smaller extent because five partici-
PIs in that semester — volunteered to              hyper-vectors; a null cosine signals a        pants did not complete the process),

4          The Messenger 177 – Quarter 3 | 2019
The Messenger No. 177 - Quarter 3 | 2019 - European Southern Observatory
both these numbers were on average                                                                                                                     Figure 1. Scientific
smaller than eight. The number of review-                              Seniority (this work)                                                           seniority distribution of
                                                                                                                                                       the DPR sample (blue)
ers Nr ranged from 4 to 8, with an aver-                               Seniority (Patat 16)
                                                          0.4                                                                                          and the OPC sample
age of 7.3; in 95% of the cases the num-                                                                                                               (orange). From Patat
ber was Nr ≥ 6. The number of proposals                                                                                                                (2016).
Np varied from 5 to 8, with an average of                 0.3
                                               Fraction
7.6, and Nr ≥ 6 in 98% of cases. The DPR
produced a total of 4055 distinct grade
pairs, to be compared with the maximum                    0.2
number of pairs 172 × 8 × 7/2 = 4816
(see below for more details) one would
obtain in the case of no conflicts and no                 0.1
dropouts.

The F/M gender distribution of the DPR                    0.0
                                                                            yet                        ye a
                                                                                                              rs              ars                ars
participants (32/68) and the scientific                                hD                          4                    12 ye              12 ye
                                                                N   oP                       tha n                  n4 –               a n
seniority distribution derived from the                                           Le s
                                                                                         s                     we e               e th
                                                                                                          B et               Mor
DPR questionnaire (see Figure 1) reflect
the underlying PI population of ESO users
(Patat, 2016). Since participation in                                                                                                                  Figure 2. Distribution of
                                                                                                                                                       the number of proposals
the experiment was on a completely vol-                   0.5                                                                                          submitted to ESO by the
untary basis, we cannot exclude the                                                                                                                    DPR participants.
presence of self-selection biases. For
instance, one could argue that research-                  0.4
ers who already had a positive opinion
of the DPR concept would be more will-
                                               Fraction

ing to participate than opponents, hence                  0.3
introducing systematics into the final
analysis. On the other hand, if the com-
munity were strongly against the para-                    0.2
digm, one would expect a similar effect.
In general, although we cannot guarantee
that there are no specific attributes that                0.1
lead the participants to self-selection, the
demographics indicate that, if they exist,
                                                          0.0
they are well hidden.                                               Fewer than                   Between 3 and                More than
                                                                    3 proposals                  10 proposals                 10 proposals
An important aspect regarding the
­demographics of the experiment con-
 cerns the fraction of junior scientists.       there are published studies that indicate                                one single proposal sub-category (within
 Since, as a rule, the regular panel mem-       reviewers who self-report higher levels                                  a given scientific category), the panel
 bers serving on the OPC are required to        of expertise tend to be less generous in                                 members are requested to identify three
 have a minimum seniority level (typically      assigning the top grades (Gallo et al.,                                  sub-categories, ranking them in order
 starting with scientists at their second       2016), the differences seen between the                                  of expertise. This information is then used
 postdoc onward), this establishes a sig-       grade distributions of senior and junior                                 to compose review panels in such a way
 nificant difference between the two pools      DPR participants are not statistically                                   that the expertise coverage within each
 of reviewers. In the case of the OPC, the      significant.                                                             of them is as broad as possible. This is
 distribution is heavily skewed towards                                                                                  required by any schema in which physical
 senior members (88%), with a small frac-                                                                                panels exist, which is in turn a constraint
 tion of postdocs (12%) and no students         Referee-Proposal matching                                                stemming from the fact that the panels
 (Patat, 2016), while the postdoc and stu-                                                                               have to meet face-to-face and discuss
 dent reviewers reach about 18% in the          In the regular OPC process, the panel                                    the same set of proposals. This intro-
 case of the DPR sample (Figure 1).             members are recruited to cover the widest                                duces a certain rigidity, which is also
                                                possible range of astrophysical areas.                                   related to the relatively small number of
Most DPR participants were relatively           Each of the selected reviewers is asked                                  available reviewers.
experienced in submitting proposals (Fig-       to declare her/his expertise by providing
ure 2), although almost 60% of them             sub-categories from the same list used                                   Since DPR has the advantage of involving
had never served on a time allocation           by the applicants to categorise their pro-                               a much larger number of reviewers, it
commitee before (Figure 3). Although            posal. While the PI is allowed to indicate                               allows a significantly more flexible and

                                                                                                                         The Messenger 177 – Quarter 3 | 2019                      5
The Messenger No. 177 - Quarter 3 | 2019 - European Southern Observatory
Telescopes and Instrumentation                      Patat F. et al., The Distributed Peer Review Experiment

more objective approach in which, for                         0.6                                                                Figure 3. Distribution
                                                                                                                                 of expertise in serving
each proposal, an ad hoc, optimised
                                                                                                                                 on Time Allocation
panel can be formed. A key ingredient in                                                                                         Committees (TAC) for
                                                              0.5
this approach is the proposal-referee                                                                                            the DPR participants.
matching, which should work without the
need for human supervision, especially                        0.4
when the turnaround has to be fast.
                                                   Fraction

For this purpose, the DT algorithm used                       0.3
in the DPR experiment was designed to
predict what we call domain expertise,                        0.2
which in this context can be considered
to be the objective ability of a given sci-
entist to review a given proposal. Before                     0.1
we discuss its reliability, we examine how
referees assessed their own ability to                        0.0
review each proposal assigned to them.                              Never served       Served once       Served multiple times
As anticipated in the introduction, during                          on TAC             on TAC            on TAC
the refereeing process each participant
was asked to express their self-perceived                                                                                        Figure 4. Distribution of
                                                                         Negative, no PhD yet                                    self-reported domain
expertise level for each of the assigned
                                                                                                                                 knowledge for the differ-
proposals, resulting in about 1200 eva­                                  Less than 4 years                                       ent scientific seniority of
luations. The distribution of participants’                   0.4
                                                                         Between 4 and 12 years                                  the DPR participants.
self-evaluated ability to review the assigned
                                                                         More than 12 years
proposals is presented in Fig­ure 4, where
we have used different ­colours for the                       0.3
                                                   Fraction

­different classes of scientific seniority. As
 expected, junior scientists tend to perceive
 themselves as experts less often than                        0.2
 senior scientists do. Also, they often indi-
 cate that they have limited knowledge of
 a given field. We take this is an indication
 that the self-evaluated ability of a referee                 0.1
 to review the assigned proposals is a
 useful proxy of the more objective (albeit
 more abstract) concept of domain                             0.0
                                                                       Expert        General knowledge      No knowledge
 knowledge.

The data collected in the DPR experiment            expertise, which can be considered as a              perceive them to be the top and interme-
enable an additional analysis of a possi-           reasonable first approximation to the                diate classes. As shown in Figure 5, the
ble gender dependence on the above                  underlying domain knowledge. From a                  correlation in the intermediate cases
self-evaluation. This has been reported,            statistical point of view, this is equivalent        becomes fuzzier. With the available data
for instance, by Huang (2013), who con-             to computing the Bayesian conditional                it is impossible to tell which of the two
cluded that females tend to under-predict           probability P (self-reported | DT) of having         estimators is responsible for the observed
their performance in certain STEM fields.           a certain self-­reported expertise level,            noise. If on the one hand we can argue
Our data suggest that, at least for post-           given the DT-­inferred level. In simpler             that the DT approach has obvious limita-
graduates in the domain of astrophysics,            words, one checks how the self-reported              tions (which is certainly true), on the other
there is no statistically significant gender        and DT-­inferred levels correlate. The               hand the self-reported levels are affected
difference.                                         result is p
                                                              ­ resented in Figure 5, which              by a significant level of uncertainty, as
                                                    shows an encouragingly high correlation.             they are related to subjective perceptions
Since the DT is designed to predict the             For instance, the probability that the DT            rather than to objective criteria.
expertise of a referee with respect to a            considers a match as the worst which
given proposal, the first question one              the referee believes is the best, is less            Another aspect is the importance of
should ask is how reliable the algorithm            then 1%. At the other extreme, it is very            proper proposal-referee matching. Our
is. Obviously, there is no absolute refer-          likely (78%) that if the DT estimates the            direct experience, accumulated over
ence; the DT is one possible objective              match is poor, the referee is of the same            many years of managing the review pro-
estimate of this quality. Therefore, as a           opinion. The agreement on the best                   cess at ESO, shows that, in addition to
first exploratory test, one can check the           matches is at the level of 50%, while for            the obvious problem related to exces-
DT results against the self-evaluation of           81% of the best DT matches, the referees             sively large numbers of proposals, panel

6           The Messenger 177 – Quarter 3 | 2019
The Messenger No. 177 - Quarter 3 | 2019 - European Southern Observatory
members report a general uneasiness                                                                              P (helpful comment | DeepThought)                                                               Figure 5. Conditional
                                                                                                                                                                                                                 probability for the
when dealing with proposals in areas in

                                                DeepThought inferred knowledge
                                                                                                                                                                                                                 ­various combinations of
which they feel they are not experts. For                                                                                                                                                                         self-reported and
a more quantitative assessment, DPR                                                                  0.14                         0.24                      0.39                      0.24                        DT-­inferred knowledge

                                                                          st
participants were asked to express their

                                                                       be
                                                                                                                                                                                                                  level.
level of confidence, using a four-point
scale, when asked to evaluate those
cases; the corresponding distribution is                                                            0.29                          0.28                      0.27                       0.16

                                                                n
presented in Figure 6. In about 60% of the
                                                             ia
                                                          ed
cases, the reviewers were not comfortable
                                                        m

with this situation. This implies that better
matching of expertise gives the reviewers
a better experience, an aspect which                                                                0.28                          0.31                      0.27                       0.13
                                                     st
                                                  or

should not be underestimated.
                                                w

                                                                                             1                                         2                          3                 4
                                                                                         Not helpful                                                                           Very helpful
Feedback quality                                                                                                                   Review evaluation
In the classical review concept, the feed-
back provided by the panel to the PI is
supposed to reflect the consensus opin-                    0.5
                                                                                                                                                                                                                 Figure 6. Distribution of
                                                                                                                                                                                                                 the answers to the
ion. This paradigm has at least two obvi-
                                                                                                                                                                                                                 question: “How satisfac-
ous limitations: (a) proposals that are tri-                                                                                                                                                                     torily were you able to
aged out (i.e., the bottom ~ 30%) are not                  0.4                                                                                                                                                   evaluate the proposals
discussed, and the feedback is based                                                                                                                                                                             for which you were not
                                                                                                                                                                                                                 an expert?”.
                                                Fraction

on the opinion of the primary referee; (b)                 0.3
for proposals that are discussed during
the face-to-face meeting the primary                       0.2
­referee tries to capture the main points of
 the discussion and produces a single
 comment. There is simply not enough                       0.1
 time for the panel members to review all
 the feedback and to make sure it reflects                 0.0
                                                                                 provided an unfair evaluation

 all the aspects of the discussion. In the
                                                                                                                    might not always have been

                                                                                                                                                 Mostly; I sometimes missed
                                                                                                                    Somewhat; I struggled and

                                                                                                                                                                                  Fully; I could evaluate well
                                                                                                                                                                                  and fairly as a non-expert

 current implementation at ESO, the com-
                                                                                                                                                 the expertise but was still
                                                                                 Not satisfactory; I might

 ments are formally supervised by panel
 chairs, who are responsible for the integ-
                                                                                 have unintentionally

 rity of the feedback (particularly as it
                                                                                                                    able to evaluate

                                                                                                                                                 able to evaluate

 relates to the language used). The net
 effect, possibly coupled with a sub-­
 optimal matching between proposal and
 referee, is a high level of dissatisfaction
 in the community, which is consistently
 reported by the Users Committee; the
 dissatisfaction reported is about 30% for       The participants were asked to rate each                                                                              (99% of the sub-sample that responded).
 all of ESO and exceeds 50% for ALMA 3.          of the comments they received for their                                                                               In about 40% of the cases the DPR was
                                                 proposal, based on its helpfulness. It is                                                                             reported to have provided better com-
Since the TAWG recommended the use               important to stress that they were not                                                                                ments, while the fraction of comments
of DPR for a FTC, no attempt was made            asked whether the comments were good                                                                                  with quality similar to, or better than the
to produce consensus feedback and/or             or bad, or whether they liked them or not,                                                                            OPC reaches about 85%.
to edit/check individual comments, which         but whether they were useful for improv-
were distributed to the PIs in their original    ing the quality of their proposal. The gen-                                                                           The analysis of comment helpfulness as
form. The purpose of this implementation         eral response was very satisfactory, as                                                                               a function of the reviewer’s expertise
was two-fold: (a) to get feedback on the         shown in Figure 7, with more than 60%                                                                                 (either self-reported or DT-inferred) shows
concept itself, and (b) to detect possible       of the comments judged as being useful,                                                                               that the dependence is mild in the central
problems (for example, inappropriate lan-        and about 5% not useful. One of the                                                                                   regions; the experts very rarely gave
guage) generated by the unedited/unfil-          questions also concerned the compari-                                                                                 unhelpful comments and, conversely,
tered text.                                      son with the edited OPC comments                                                                                      non-experts rarely gave very helpful com-
                                                 received by the PIs in previous semesters                                                                             ments. A similar analysis as a function of

                                                                                                                                                                       The Messenger 177 – Quarter 3 | 2019                             7
The Messenger No. 177 - Quarter 3 | 2019 - European Southern Observatory
Telescopes and Instrumentation                        Patat F. et al., The Distributed Peer Review Experiment

the reviewer’s scientific seniority reveals a                                                                                                                                                                         Figure 7. Distribution
                                                                                                                                                                                                                      of the “helpfulness”
flat distribution (within the noise), with one
                                                                      0.4                                                                                                                                             ­ratings of the referee
remarkable exception: graduate students                                                                                                                                                                                comments for the entire
seem to be unable to provide very useful                                                                                                                                                                               DPR sample.
comments. This may signal a training                                  0.3
issue, which can probably be addressed             Fraction
by exposing the students to schemes like                              0.2
the DPR. Finally, no statistically significant
difference is seen between the helpful-                               0.1
ness of comments written by female and
male referees.                                                        0.0

                                                                                                               Somewhat; some comments
                                                                                                               might help me to strengthen

                                                                                                                                                                                  Fully; overall the comments
                                                                               will not help me to improve

                                                                                                                                                   Mostly; several comments
                                                                               Not useful; the comments

                                                                                                                                                   will help me to strengthen

                                                                                                                                                                                  will allow me to improve
A brief primer on subjectivity

                                                                                                               my proposed project

                                                                                                                                                   my proposed project

                                                                                                                                                                                  my proposed project
                                                                               my proposed project

                                                                                                                                                                                                                      Figure 8 (below). Pre-
Before we proceed with the comparison                                                                                                                                                                                 meeting OPC referee–­
between the final OPC and DPR out-                                                                                                                                                                                    referee correlation. In
comes, a digression on the subjectivity                                                                                                                                                                               this density diagram
                                                                                                                                                                                                                      each point represents a
inherent in the process is necessary.                                                                                                                                                                                 pair of grades attributed
Although it is common knowledge that                                                                                                                                                                                  to the same proposal
two different panels reviewing the same                                                                                                                                                                               by two distinct referees.
set of proposals would provide different                                                                                                                                                                              The data are from the
                                                                                                                                                                                                                      P18 sample.
rankings (and this is often used to compare
time allocation committees to roulette),                              4.0
quantitative statements are very rare. This
matter is addressed in great detail in an                                             N(data) = 196153                                                                                                                                    480
extensive study based on about 15 000
ESO proposals (Patat, 2018b; hereafter                                3.5
P18). The interested reader is referred to                                                                                                                                                                                                420
the paper for a thorough discussion,
while here we will focus only on the con-
cepts relevant to the present discussion.                             3.0                                                                                                                                                                 360
                                                   Referee grade #2

One way of quantitatively describing the
reproducibility of a review process is                                                                                                                                                                                                    300
the correlation between the grades attrib-
                                                                      2.5
uted to the same set of applications by
two distinct bodies. These bodies can be                                                                                                                                                                                                  240
composed of a single individual or of sev-
eral members. We will be talking about
referee–referee (r–r) and panel–panel                                 2.0                                                                                                                                                                 180
(p–p) correlations. In the first instance,
one simply considers all the distinct
                                                                                                                                                                                                                                          120
grade pairs attributed by referee #1 and
referee #2 to the same set of proposals,                              1.5
placing them in a diagram in which the                                                                                                                                                                                                    60
grades are used as coordinates, so that
each single grade pair is represented                                                                                                                                           Corr. coeff. = 0.21
by a point. One can then repeat the pro-                              1.0
cess for all possible referee pairs, plotting                            1.0                             1.5                2.0                     2.5       3.0                                               3.5          4.0
all the corresponding points on the r–r                                                                                                      Referee grade #1
plane. Since the same proposal is graded
by many reviewers, each single proposal               proposal is np = Nr (Nr –1)/2. For instance,                                                                     with Np = 172 clouds of points. In the
is represented on the r–r plane by a cloud            in the case of the DPR experiment, with                                                                          case of the DPR experiment, this would
of points.                                            typically Nr = 7, the above combinatorics                                                                        yield 172 × 21 = 3612 points. In an ideal
                                                      formula yields 21 distinct pairs per pro-                                                                        situation, all the clouds would be very
In the simplifying assumption that each               posal. Of course, the same operation can                                                                         small in size (meaning that all referees
proposal is seen by Nr referees, the num-             be repeated for all Np proposals in the                                                                          would provide very similar grades for the
ber of distinct grade pairs np for each               sample, which will populate the diagram                                                                          given proposal), and so the points would

8           The Messenger 177 – Quarter 3 | 2019
The Messenger No. 177 - Quarter 3 | 2019 - European Southern Observatory
be distributed very close to the straight-         is 1, while a null value would signal a        Comparing the OPC and DPR
line y = x on the r–r plane.                       complete disagreement. The average             outcomes
                                                   agreement is expected to be 0.25 in case
To illustrate what one is to expect in real        a fully stochastic process, i.e., when there   The first test we apply to the DPR data
life, we have constructed the r–r plane for        is no correlation between the two bodies.      concerns the subjectivity level character-
the pre-meeting OPC P18 sample, from               The concept can be extended to all             ising the typical participant. For this pur-
which we derived almost 200 000 grade              ­quartiles, including cross-quartile values,   pose, we have computed the average
pairs accumulated over 16 ESO cycles.               and the quartile agreement matrix (QAM)       r–r QAM that we introduced in the previ-
The resulting diagram is presented in Fig-          can be constructed. In statistical terms,     ous section. Because of the DPR setup,
ure 8. It is important to note that for a           the generic element Mij of the QAM is the     the ranking list for each referee includes
perfectly stochastic process, the points           conditional probability that a proposal        at most eight proposals, so each quartile
would be distributed within a circular area,       ranked in the i-th quartile by referee #1 is   contains no more than two proposals.
with some radial, typically Gaussian, dis-         ranked in the j-th quartile by referee #2.     Also, at variance with the classical panel
tribution. The fact that the real d­ istribution                                                  scheme, the number of proposals in
is elongated along the diagonal direction          The application of this concept to the         common between two reviewers is typi-
signals that the process is not aleatory.          P18 pre-meeting sample shows that, on          cally very small. As a direct comparison
This qualitative conclusion can be made            average, the ranking lists produced by         between ranks is not possible, we use
more quantitative by computing the Pear-           two distinct referees have about 33% of        a bootstrap approach. Very briefly, for
son linear correlation coefficient, which          the proposals in common in their first and     each of the 172 proposals we randomly
ranges from –1 (complete anti-correla-             last quartiles. In the central quartiles the   extract one grade pair and form two
tion) to 1 (complete correlation) and is null      intersection is compatible with a purely       ranking lists, which are used to compute
for complete uncor­relation. The value             random selection (25%). This extends to        the quartile agreement fractions. The
derived for the sample is 0.21. Given the          the mixed cases (i ≠ j ), with the exception   ­process is repeated a large number of
very large number of points, this is a very        of the extreme quartiles; the fraction          times and the average values and stand-
robust estimate which can be reliably              of proposals ranked in the first quartile by    ard deviations are derived for each of the
taken as a low correlation. For the same           referee #1 and in the fourth quartile           QAM elements. The result is presented
reason, however, this value reveals that           by referee #2 is ∼ 17%, which deviates in       in Table 1. A direct comparison with
there is a statistically significant signal        a statistically significant way from the        the values derived from the P18 sample
indicating that the process is not com-            ­random value. As in the case of the r–r        reveals that the two results are statisti-
pletely aleatory. If on the one hand this           correlation introduced above, the r–r          cally indistinguishable. No meaningful
may sound discouraging, it helps to put             agreement fraction gives a quantitative        ­difference is seen in the QAMs computed
things in the correct context, as it char-          estimate of the high level of subjectivity      for the OE and DT sub-samples.
acterises the subjectivity of the process           that characterises the process, provid­ing
in a more quantitative and objective way,           a precise indication of what one should       In a further test, we have investigated the
as opposed to the common statements                 expect.                                       possible dependence on the scientific
which are normally based on pure anec-                                                            seniority level introduced above. Of the
dotal evidence.                                    The reason why the applications are usu-       167 reviewers, 136 provided this informa-
                                                   ally evaluated by more than one reviewer       tion, which we used to sub-divide the
A different way of measuring the repeata-          is to reduce the inherent “noise” which,       reviewers into two classes: junior (groups
bility of the process, which we will use           as we have just seen, is quite substantial.    0 and 1) and senior (groups 2 and 3).
extensively in the next section, is the            For this purpose, the grades attributed        These classes roughly correspond to
quartile agreement fraction (P18). The             by different referees to the same proposal     PhD students plus junior postdocs (37),
concept is as follows. When the same set           (typically grouped in panels) are aggre-       and advanced postdocs plus senior sci-
of proposals is reviewed by two different          gated to form one single figure of merit. In   entists (99), respectively. We then com-
bodies #1 and #2, one can compile                  the ESO implementation (and this is a          puted the r–r QAM for the two classes;
the rankings for the two distinct reviews          common recipe), this is achieved simply        the first quartile terms are 0.22 and 0.32,
based on their distinct grades. The                taking the average, with no weights and/       respectively. At face value this indicates
­rankings are then used to derive a merit          or rejection. The effect of increasing the     a larger agreement between senior
 classification within the classical quartile      number of reviews is diffusely discussed       reviewers. However, the small size of the
 scheme. For instance, the top 25% of              in P18; here it suffices to say that for
 proposals are ranked in the first quartile        Nr = 3 the first quartile agreement fraction
 of the distribution of grades.                    grows to 43% and 30% in the first and          Table 1. Bootstrapped r–r Quartile Agreement Matrix
                                                                                                  for the DPR experiment.
                                                   second quartiles, respectively.
Once this is done, one can compute the                                                            Referee #1          Referee #2 quartile
fraction of applications ranked in the             Armed with these terms of reference we         quartile     1         2       3       4
first quartile by review #1 which are also         can now discuss the results of the DPR         1            0.33      0.26    0.24   0.18
graded in the same quartile by review #2.          experiment.                                    2            0.26      0.26    0.25   0.23
For a complete agreement the fraction                                                             3            0.24      0.25    0.25   0.26
                                                                                                  4            0.18      0.23    0.26   0.34

                                                                                                  The Messenger 177 – Quarter 3 | 2019             9
The Messenger No. 177 - Quarter 3 | 2019 - European Southern Observatory
Telescopes and Instrumentation                     Patat F. et al., The Distributed Peer Review Experiment

junior class produces a significant scatter,       Table 2. Average DPR–OPC (pre-meeting)         Table 3. DPR–OPC (pre-meeting)
                                                   r–r Quartile Agreement Matrix.                 p–p Quartile Agreement Matrix.
so the difference may not be significant.
                                                   DPR referee       OPC referee quartile         DPR            OPC (pre-meeting) quartile
One can extend the above bootstrapping             quartile      1      2       3       4         quartile       1     2      3       4
procedure to subsets with a number of              1             0.31    0.26   0.24   0.18       1              0.37   0.26   0.28    0.09
referees Nr > 1. The case of Nr = 3 is par-        2             0.24    0.27   0.25   0.24       2              0.28   0.16   0.28    0.28
ticularly interesting as this is directly          3             0.24    0.23   0.26   0.26       3              0.16   0.40   0.19    0.26
­comparable to the results presented in            4             0.20    0.23   0.25   0.31       4              0.19   0.19   0.26    0.37
 P18. The procedure is as follows: we
 first make a selection of the proposals           This matrix is very similar to that derived    of the pre-meeting OPC process (P18).
 having at least 6 reviews (164); for each of      within the DPR reviews (see Table 1), pos-     Note that, given the large noise inherent
 these we randomly select two distinct             sibly indicating a DPR–OPC r–r agree-          in the process, a much larger data set
 (i.e., non-intersecting) subsets of Nr = 3        ment slightly lower than the correspond-       (or more realisations of the experiment)
 grades each, from which two average               ing DPR–DPR. A check performed on              would be required to reach a sufficiently
 grades are derived; the subsequent steps          the two sub-samples for the junior and         high statistical significance and to make
 are identical to the r–r procedure, and           senior DPR reviewers (according to the         robust claims about possible systematic
 lead to what we will call the p–p QAM.            classification described above) has given      deviations.
                                                   statistically indistinguishable results.
The first-quartile agreement turns out to                                                         The fact that in the real OPC process
be 41%, while for the second and third             As explained in the introduction, the pro-     there is a face-to-face meeting consti-
quartiles this is 30%. The top-bottom              posals were reviewed by Nr = 3 OPC             tutes the most pronounced difference
quartile agreement is 10%. These values            ­referees in the pre-meeting phase. This       between the two review schemes. In the
are very similar to those presented in              constitutes a significant difference, in      meeting, the opinions of single reviewers
P18 for the OPC process for Nr = 3 sub-­            that the DPR ranking is typically based       are changed during the discussion, so
panels. As for the r–r case, the OE and             on ~ 7 grades, whereas the pre-meeting        that grades assigned by individual refer-
DT sub-samples yield statistically indistin-        OPC ranking rests on 3 grades only.           ees are not completely independent ­
guishable values. The conclusion is that,           With this caveat in mind, one can never-      from each other (as opposed to in the
in terms of self-consistency, the DPR               theless compute the QAM for the two           pre-meeting phase, in which any signifi-
review behaves in the same way as the               overall ranking lists. The result is pre-     cant correlation should depend only on
pre-meeting OPC process.                            sented in Table 3. At face value, about       the intrinsic merits of the proposal). The
                                                    37% of the proposals ranked in the 1st        effects of the meeting can be quantified
We now come to what is perhaps one of               quartile by the DPR were ranked in the        in terms of the quartile agreement frac-
the most interesting aspects. As antici-            same quartile by the OPC, with a similar      tions between the pre- and post-meeting
pated, the proposals used in the DPR                fraction for the bottom quartile. When        outcomes, as outlined in Patat (in prepa-
experiment were also subject to the regu-           looking at these values, one needs to         ration; hereafter called P19). Based on
lar OPC review. This enables the com­               consider that this is only one realisation,   the P18 sample, P19 concludes that the
parison between the outcomes of the two             which is affected by large scatter, as        change is significant; on average, only
selections, with the caveats outlined               can be deduced from the comparatively         75% of the proposals ranked in the top
above about their inherent differences.             large fluctuations in the QAM. These are      quartile before the meeting remain in the
                                                    evident when compared to, for instance,       top quartile after the discussion (about
For a first test we used a bootstrap                the average values obtained from the          20% are demoted to the second quartile,
procedure in which, for each proposal               bootstrapping procedures described            and 5% to the third quartile). P19 charac-
included in the DPR, we randomly                    above. The numerical simulations show         terises this effect by introducing the
extracted one evaluation from the DPR               that the standard deviation of a single       Quartile Migration Matrix (QMM). For the
(typically one out of 7) and one from               realisation is ~ 0.1.                         specific case of Period 103, the QMM
the OPC (one out of 3), forming two                                                               is reported in Table 4 for the subset of the
ranking lists from which a r–r QAM was             Using the model presented in P18, one          DPR experiment. Of the initial 172 pro-
­computed. The operation was repeated              can predict that, on average, the top and      posals included in the DPR sample, 36
 a large number of times and the average           bottom quartile agreement between the          were triaged out in the OPC process and
 and standard deviation matrices were              DPR and the pre-meeting OPC should be          are therefore not considered.
 constructed. This approach provides a             around 0.5 (see Kerzendorf, 2019 for
 direct indication of the DPR-OPC agree-           more detail). The observed value (0.37)        As anticipated, the effect is very marked;
 ment at the r–r level and overcomes               differs at the 1.3-s level from the average    the meeting does have a strong effect
 the problem that the two reviews have             value. For the central quartiles the differ-   on the final outcome. In light of these
 a ­different number of evaluations per            ence is at the ~ 1.5-s level. Therefore,       facts, we can finally inspect the QAM
 ­proposal (see below). The result is pre-         although lower than expected on aver-          between the DPR and the final outcome
  sented in Table 2. The typical standard          age, the observed DPR–OPC agreement            of the OPC process. This is presented
  deviation of single realisations from the        is statistically consistent with that          in Table 5. With the only possible excep-
  average is 0.06.                                 expected from the statistical description      tion of M4, 4, which indicates a relatively

10          The Messenger 177 – Quarter 3 | 2019
Table 4. OPC Quartile Migration Matrix for the            Table 5. DPR–OPC (post-meeting)                      processing. The next logical step is to
 DPR sub-sample (N = 136).                                 Quartile Agreement Fraction.
                                                                                                                expand this experiment and distribute a
 OPC pre-meeting               OPC post-meeting quartile   DPR            OPC post-meeting quartile             fraction of observing time using DPR at
 quartile                  1        2     3       4        quartile       1     2     3       4                 more facilities. More than 95% of the
 1                         0.56     0.32    0.12   0.00    1              0.26   0.38   0.24     0.12           ­participants suggest an implementation
 2                         0.32     0.32    0.29   0.06    2              0.24   0.35   0.24     0.18            of such a scheme for some part of the
 3                         0.12     0.26    0.38   0.24    3              0.32   0.12   0.29     0.26            ESO proposal types, with 75% support
 4                         0.00     0.09    0.21   0.71    4              0.19   0.15   0.24     0.44            for the short programmes (time requests
                                                                                                                 < 20 hours). Fewer than 5% of the
 marked agreement for the proposals in                     weakest aspect of the DPR. However,                   responses were against implementing
 the bottom quartile, the two reviews                      it remains unclear whether panel discus-              DPR for any of the programme types. In
 appear to be almost completely uncorre-                   sions lead to the selection of better                 particular, about 70% of the responses
 lated. By means of simple Monte-Carlo                     ­science. In this respect, it is important to         are in favour of deploying DPR for the
 calculations one can show that for two                     note that several studies have shown that            Fast Track Channel, while only about 15%
 fully aleatory panels, the standard devia-                 panel meetings can increase the differ-              are against it (the remaining 15% is indif-
 tion of a single realisation around the                    ences between two panels with respect                ferent). We take this as a clear indication
 average value (0.25) is 0.10. We conclude                  to the pre-meeting agreement. In other               of support.
 the majority of the Mi,j elements in Table 5               words, while the meeting increases the
 are consistent with a stochastic process                   internal consensus by polarising different          One of the objections that is typically
 at the 1-s level.                                          opinions within the panels, it does not             made to the DPR concept is that, by dis-
                                                            lead to a better panel-panel agreement              tributing the proposals to a larger number
 The main conclusion of this analysis is                    (see Obrecht et al., 2007 and references            of unselected scientists, it increases the
 that, while the pre-meeting agreement                      therein). One would expect the discus-              chances of information leakage and pla-
 is consistent, with the DPR and OPC                        sions to bring judgment closer to identify-         giarism. In the specific case of the DPR
 reviewers behaving in a very similar way                   ing the best science; however, these                experiment, the proposals were distrib-
 (in terms of r–r and p–p agreements),                      studies indicate that a face-to-face meet-          uted to 172 reviewers, while in the OPC
 the face-to-face meeting has the effect of                 ing does not necessarily make the pro-              process the applications were seen by 78
 significantly increasing the discrepancy                   cess better.                                        individuals. However, while in the OPC
 between the two processes. However,                                                                            implementation each reviewer has access
 we caution that the sample is relatively                                                                       to all proposals assigned within her/his
 small, and therefore the results are signifi-             Conclusions and outlook                              panel (typically 70–80), the DPR reviewer
 cantly affected by noise.                                                                                      sees a factor of ~ 10 fewer proposals.
                                                           Gemini has already implemented a                     Therefore, under the reasonable hypo­
 That the DPR–OPC agreement is smaller                     ­variant of this mechanism successfully              thesis that the fraction of “malevolent”
 than the internal DPR–DPR agreement                        over the past few years for their Fast              ­scientists is the same in both review bod-
 is not unexpected, as there are intrinsic                  Turnaround (Andersen et al., 2019). The              ies (which are selected from the same
 differences between the two setups, the                    approach presented here enhances this                community), one would actually expect
 largest one being the absence of a face-                   process, using better review-­proposal               that the DPR is less prone to confidential-
 to-face meeting, which is potentially the                  matching based on natural language                   ity issues on average. To get a direct
                                                                                                                 opinion from DPR participants, the ques-
           0.30
                                                                                    Figure 9. Distribution of    tionnaire contained an explicit question
                                                                                    the answers to the
                                                                                                                 about this aspect. The distribution of the
                                                                                    question: “For which
                                                                                    types of proposals           responses is shown in Figure 10. Exclud-
           0.25                                                                     do you think distributed     ing the “no strong opinion” cases, 66%
                                                                                    peer review would be         of the users declared themselves to be
                                                                                    beneficial?” in the DPR
           0.20                                                                                                  equally or more confident in the DPR­
                                                                                    survey.
                                                                                                                 ­process, resulting in about a third of the
Fraction

                                                                                                                  users placing more trust in the classical
           0.15
                                                                                                                  scheme.

           0.10                                                                                                 Another concern that is often heard when
                                                                                                                discussing DPR is the possible presence
                                                                                                                of biases. Again, the specific question
           0.05
                                                                                                                put to the participants regarding this
                                                                                                                point does not support this concern; 74%
            0.0                                                                                                 of the respondents believe DPR is equally
                  Short   Regular     Large    Short, Regular, Short,     None
                                                                                                                or more robust against biases (Figure 11).
                                               regular large   regular,
                                                               large

                                                                                                                The Messenger 177 – Quarter 3 | 2019     11
Telescopes and Instrumentation                                                                         Patat F. et al., The Distributed Peer Review Experiment

                                                                                                                                              Figure 10. Distribution      gives an objective criterion to assign a
                                                                                                                                              of answers to a question
                                                                                                                                                                           particular expertise, eliminating biases in
           0.4                                                                                                                                about how secure the
                                                                                                                                              participants felt about      self-reporting. DPR implicitly removes
                                                                                                                                                                           the concept of panel, which adds rigidity
Fraction

                                                                                                                                              confidentiality issues.
                                                                                                                                                                           to the process. For instance, it maximises
           0.2                                                                                                                                                             the overlap in evaluations, which is a
                                                                                                                                                                           ­t ypical issue in pre-allocated panels. The
                                                                                                                                                                            lack of a face-to-face meeting prevents
           0.0                                                                                                                                                              strong personal opinions from having a
                                                                                                                                                                            pivotal influence on the process. Also,
                                                 concerned about confidentiality

                                                                                                                                                                            DPR involves a larger part of the commu-
                                                                                   I am more concerned about
                 I am less concerned about

                                                                                   confidentiality issues in the
                 confidentiality issues in the

                                                 issues in the DPR process

                                                                                                                                                                            nity, increasing its democratic breadth
                                                 I am neither more nor less

                                                 than in the OPC process

                                                                                   DPR process than in the
                 DPR process than in the

                                                                                                                   I have no strong opinion

                                                                                                                                                                            and exposing all applicants to the typical
                                                                                                                                                                            quality of the proposals. This allows them
                                                                                                                                                                            to better understand if their request is
                                                                                                                                                                            not allocated time by placing it in a wider
                                                                                   OPC process
                 OPC process

                                                                                                                   on this point

                                                                                                                                                                            context, which will help to improve their
                                                                                                                                                                            proposal-writing skills, training the mem-
                                                                                                                                                                            bers of the community without additional
                                                                                                                                                                            effort.

                                                                                                                                                                           We acknowledge that the lack of a meet-
           0.3
                                                                                                                                              Figure 11. Distribution of   ing does not allow the exchange of
                                                                                                                                              answers to a question        ­opinions and the possibility of asking and
                                                                                                                                              about the robustness of
                                                                                                                                              the process against           answering questions to/from the peers.
           0.2                                                                                                                                                              Despite the fact that its effectiveness
Fraction

                                                                                                                                              biases.
                                                                                                                                                                            remains to be demonstrated and quanti-
                                                                                                                                                                            fied (see above), it is clear that the social,
           0.1                                                                                                                                                              educational and networking aspects
                                                                                                                                                                            of the face-to-face meeting should not
           0.0                                                                                                                                                              be undervalued. In this respect, we note
                                                                                                                                                                            that the resources freed by the DPR
                                                                                                                   I have no strong opinion
                 against biases than the

                                                                                   against biases than the
                 process is more robust
                 I think that DPR review

                                                 I think that DPR review

                                                                                   I think that DPR review
                                                                                   process is less robust

                                                                                                                                                                            approach can be used by the organisa-
                                                 against biases as the
                 OPC review process

                                                 OPC review process

                                                                                   OPC review process
                                                 process is as robust

                                                                                                                                                                            tions for education and community
                                                                                                                                                                            ­networking (training on proposal writing,
                                                                                                                                                                             fostering collaborations, etc.).
                                                                                                                   on this point

                                                                                                                                                                           In April and May 2019, results of the
                                                                                                                                                                           DPR experiment were presented to the
                                                                                                                                                                           ESO governing bodies most closely
                                                                                                                                                                           ­concerned with the Peer Review process
 The main conclusions drawn from the                                                                    To these aspects, which come directly                               (i.e., the Scientific Technical Committee,
 DPR experiment can be summarised as                                                                    from the data, other positive facts can                             the Users Committee and the Observing
 follows:                                                                                               be added. DPR allows a much larger sta-                             Programmes Committee). The ensuing
 – The DeepThought-enhanced DPR                                                                        tistical basis enabling robust outlier rejec-                       discussions have resulted in a wealth of
    experiment was very well received by                                                                tion (the number of proposals per referee                           useful feedback that is being discussed
    the participants.                                                                                   can be easily brought to 10–12) and it                              internally. We would like to conclude
 – The mechanism allows an optimal                                                                     removes possible biases generated by                                by pointing out that these kinds of stud-
    referee-proposal matching.                                                                          panel member nominations. The larger                                ies are crucial if we are to progress from
 – The DPR process is as subjective as                                                                 pool of scientists allows much better                               a situation in which the classical peer
    the OPC process.                                                                                    ­coverage in terms of proposal expertise                            review process is adopted notwithstand-
 – The participants do not see the confi-                                                               matching, and the smaller number of                                ing its limitations simply due to the lack of
    dentiality and bias issues as being more                                                             ­proposals per reviewer allows more care-                          better alternatives. As scientists, we firmly
    severe than in the classical scheme.                                                                  ful work and more useful feedback.                                believe in experiments, including those
 – ESO should consider deploying DPR for                                                                                                                                   that address the selection of the experi-
    regular proposals below a certain                                                                   Another aspect of the DeepThought                                   ments themselves.
    time request, while leaving the classical                                                           approach to proposal-referee matching is
    review for larger time requests.                                                                    that it can be semi-automated; it also

 12                      The Messenger 177 – Quarter 3 | 2019
2
                                     Acknowledgements                                            istributed Peer Review Pilot in Foundational
                                                                                                D                                                       Kerzendorf, W. E. et al. 2019, submitted to Nature
                                                                                                ­Program: https://nifa.usda.gov/resource/                  Astronomy
                                     The authors wish to express their gratitude to the          distributed-­p eer-review-pilot-foundational-program   Merrifield, M. R. & Saari, D. G. 2009, Astronomy and
                                                                                              3
                                     167 volunteers who participated in the DPR experi-          Report from ESO Users Committee No. 42 (2018):           Geophysics, 50, 4.16
                                     ment, for their work and enthusiasm. The authors are         https://www.eso.org/public/about-eso/commit-          Mervis, J. 2014a, Science, 344, 1328
                                     also grateful to Markus Kissler-Patig for passionately       tees/uc/uc-42nd/UCreport2018.pdf                      Mervis, J. 2014b, Science, 345, 248
                                     promoting the DPR experiment following his experi-                                                                 Obrecht, M., Tibelius, K. & D’Aloisio, G. 2007,
                                     ence at Gemini; to ESO’s Director General Xavier                                                                      Research Evaluation, 16 (2), 79
                                     Barcons and ESO’s Director for Science Rob Ivison        References                                                Patat, F. 2016, The Messenger, 165, 2
                                     for their support; and to Hinrich Schütze for several                                                              Patat, F. et al. 2017, The Messenger, 169, 5
                                     suggestions on the NLP process.                          Andersen, M. et al. 2019, AAS, 233, 455.03                Patat, F. 2018a, The Messenger, 173, 7
                                                                                              Ardabili, P. N. & Liu, M. 2013, CoRR, arxiv:1307.6528     Patat, F. 2018b, PASP, 130, 084501
                                                                                              Brinks, E. et al. 2012, The Messenger, 150, 20            Strolger, L.-G. et al. 2017, AJ, 153, 181
                                     Links                                                    Gallo, S. A., Sullivan, J. H. & Glisson, S. R. 2016,
                                                                                                 PLoS ONE, 11, e0165147
                                     1
                                          emini Observatory Fast Turnaround Observing
                                         G                                                    Huang, C. 2013, European Journal of Psychology of
                                         Mode webpage: http://www.gemini.edu/sciops/             Education, 28, 1
                                         observing-gemini/proposal-routes-and-observing-      Kerzendorf, W. E. 2017, Journal of Astrophysics and
                                         modes/fast-turnaround                                   Astronomy, arxiv:1705.05840
ESO/G. Hüdepohl (atacamaphoto.com)

                                                                                                                                                                                    Snowfall at Paranal is
                                                                                                                                                                                    a rare phenomenon that
                                                                                                                                                                                    serves to utterly trans-
                                                                                                                                                                                    form the surroundings of
                                                                                                                                                                                    the VLT/I into an other-
                                                                                                                                                                                    worldly landscape.

                                                                                                                                                        The Messenger 177 – Quarter 3 | 2019             13
Telescopes and Instrumentation                                                                                     DOI: 10.18727/0722-6691/5148

On the Telluric Correction of KMOS Spectra

Lodovico Coccato 1                                   rate atmospheric and instrumental              (MIPAS) atmospheric profiles for temper-
Wolfram Freudling 1                                  effects, (for example, the instrument          ature, humidity, water vapour and other
Alain Smette 1                                       response) if a large wavelength range          molecules, and (d) analytic functions
Eleonora Sani 1                                      of stellar continuum is absorbed by            or user-provided files for the instrumental
Jose A. Escartin 1, 2                                blended absorption lines. Last but not         spectral resolution. The fit to the telluric
Yves Jung1                                           least, the noise and imperfections in          absorption lines in the observed spectra
Gurvan Bazin1                                        the data reduction of these stars are inev-    provides the integrated column density
                                                     itably propagated to scientific spectra.       of individual molecules. Future versions
                                                                                                    will further improve the quality of the
1
    ESO                                              Alternatively, one can model the atmos-        model by including real-time measure-
2
     ax-Planck-Institut für extraterrestrische
    M                                                phere, generate its transmission spec-         ment of precipitable water vapour and
    Physik, Garching, Germany                        trum and apply it to observations. The         other molecules along the line of sight of
                                                     model itself can be obtained by fitting        the exposures.
                                                     well-defined telluric lines to the spectrum
The presence of strong absorption                    of either a standard star or a sufficiently    In the following, we describe the improve-
lines in the atmospheric transmission                bright science target. In general, a model     ments in the quality of KMOS (Sharples
spectrum affects spectroscopic obser-                depends on four components: (a) a radia-       et al., 2013) spectra obtained with
vations, in particular those in the near-            tive transfer model; (b) a set of parame-      the model approach using molecfit with
and mid-infrared. Therefore, there is the            ters that determines the absorption and        respect to the empirical method. Data
need to correct scientific observations              transmission properties of individual          were reduced using the KMOS pipeline
for this effect, a process known as tel-             ­molecules; (c) atmospheric profiles of        (Davies et al., 2013). In the model
luric correction. The use of a detailed               temperature, humidity, and volume mix-        approach, the atmospheric model was
model of the atmospheric transmission                 ing ratio for the molecules involved; and     obtained by fitting a number of pre-­
spectrum brings several advantages                    (d) instrumental parameters such as           defined telluric lines on a standard star
over the method of empirically deriving               spectral resolution. This model-depend-       spectrum observed close in time to
corrections using observations of a                   ent approach has several advantages           the scientific data (i.e., the same standard
­telluric standard star. In this paper, we            over the empirical method. First, no addi-    star that was used in the empirical
 discuss and compare the two methods                  tional noise or sources of error coming       method). The telluric correction over
 applied to K-band Multi-Object Spec-                 from the standard star observations and       the full wavelength range was then com-
 trograph (KMOS) observations and                     reduction are propagated to the science       puted accounting for the differences in
 show the improvements in the quality of              spectra. Second, it allows additional         airmass and spectral resolution between
 the final products obtained by imple-                components to be taken into account,          the s­ cientific spectrum to correct and
 menting the modelling technique                      such as the amount of precipitable water      the standard star. As a test-bench for
 offered by the ESO molecfit sky tool.                vapour from external sources and inac-        comparison, we processed one month of
                                                      curate wavelength calibrations, and dif-      KMOS data and compared the results
                                                      ferences between the observations of the      obtained with these two different telluric
Correction for atmospheric transmission               standard star and the science target (for     correction strategies.
in spectroscopic data                                 example, airmass and spectral resolu-
                                                      tion). On the other hand, using a model of
Ground-based spectroscopic observa-                   the atmosphere for the telluric correction    Benefits of the molecfit strategy for
tions are strongly affected by the Earth’s            risks the introduction of systematics         KMOS observations
atmosphere. In particular, spectra of                 because of limitations in the modelling. In
objects taken in the near- and mid-infra-             practice, the artefacts caused by such        As described previously, because the
red wavelength ranges are characterised               systematics are outweighed by the             molecfit correction is based on a model,
by a forest of absorption lines, called               improvements made in the corrections.         it does not add noise to the final products
­telluric absorptions. These features are                                                           or defects such as uncorrected cosmic
 caused by (mainly water and OH) mole-               The model approach has been devel-             rays that are embedded in the standard
 cules present in the atmosphere that                oped in a software package named               star spectrum. Figure 1 shows a compari-
 absorb the light from astrophysical                 molecfit (Kausch et al., 2013; Smette et       son between the mean signal-to-noise
 sources. The standard way to correct for            al., 2015). Molecfit uses (a) the Line-by-     per pixel of the datacubes obtained by
 this effect is to acquire a spectrum of             line Radiative Transfer Model 1 (LBLRTM)       correcting the telluric absorption directly
 a bright and featureless star close in time         algorithm (Clough, Iacono & Moncet,            with a standard star (i.e., the empirical
 and airmass to the scientific target, and           2005) to compute the radiative transfer        method) and by modelling the atmos-
 compare it either with its model or, if             model, (b) the high-resolution transmis-       pheric absorptions with molecfit. The
 available, with a spectrum taken from               sion molecular absorption (HITRAN)             ­signal-to-noise is measured in a wave-
 space. This empirical strategy, however,            database 2 for the molecular parameters,        length region that is free of sky or telluric
 has some drawbacks. First, it requires              (c) Global Data Assimilation System 3           lines, and therefore is an indication of
 additional (expensive) telescope time.              (GDAS) and ESA Michelson Interferome-           the noise added by the telluric correction.
 Second it can be complicated to sepa-               ter for Passive Atmospheric Sounding 4          As expected, the data corrected with

14            The Messenger 177 – Quarter 3 | 2019
You can also read