Graphical Abstract - A mechanistic model for the negative binomial distribution of single-cell mRNA counts

bioRxiv preprint first posted online Jun. 3, 2019; doi: http://dx.doi.org/10.1101/657619. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                             It is made available under a CC-BY-NC-ND 4.0 International license.




  Graphical Abstract

  A mechanistic model for the negative binomial distribution of single-cell mRNA counts

  Lisa Amrhein, Kumar Harsha, Christiane Fuchs




           Biological Process                                                                                   Data




                                                                                                               Frequency
                                                              Measurement                                                        #
                                                                                                               Frequency




                                                                                                                                #
                                                                                                                               Estimation &
             Simplification             Knowledge Transfer                                                    Prediction
                                                                                                                               Parameter Selection

               Mechanistic Model                                                                             Steady State Distribution
                                                                                                             Poisson distribution
                                                                                                              #            ~
                                                                     Link via Ornstein-Uhlenbeck processes




                                                                                                             Poisson-beta distribution


                                                                                                              #            ~




                                                                                                             Negative binomial distribution


                                                                                                              #            ~
bioRxiv preprint first posted online Jun. 3, 2019; doi: http://dx.doi.org/10.1101/657619. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                             It is made available under a CC-BY-NC-ND 4.0 International license.




      A mechanistic model for the negative binomial distribution of single-cell
                                  mRNA counts

                                 Lisa Amrheina,b , Kumar Harshaa,b , Christiane Fuchsa,b,c,d,∗
                   a Institute
                             of Computational Biology, Helmholtz Zentrum Munich, 85764 Neuherberg, Germany
                     b Department of Mathematics, Technical University of Munich, 85747 Garching, Germany
               c Faculty of Business Administration and Economics, Bielefeld University, 33615 Bielefeld, Germany
                                                         d Lead Contact




  Summary

  Several tools analyze the outcome of single-cell RNA-seq experiments, and they often assume a probability
  distribution for the observed sequencing counts. It is an open question of which is the most appropriate
  discrete distribution, not only in terms of model estimation, but also regarding interpretability, complexity
  and biological plausibility of inherent assumptions. To address the question of interpretability, we inves-
  tigate mechanistic transcription and degradation models underlying commonly used discrete probability
  distributions. Known bottom-up approaches infer steady-state probability distributions such as Poisson
  or Poisson-beta distributions from different underlying transcription-degradation models. By turning this
  procedure upside down, we show how to infer a corresponding biological model from a given probability
  distribution, here the negative binomial distribution. Realistic mechanistic models underlying this distribu-
  tional assumption are unknown so far. Our results indicate that the negative binomial distribution arises as
  steady-state distribution from a mechanistic model that produces mRNA molecules in bursts. We empirically
  show that it provides a convenient trade-off between computational complexity and biological simplicity.
  Keywords: gene expression, negative binomial distribution, Poisson-beta distribution, single-cell RNA
  sequencing, switching process, bursting process, stochastic differential equation, Ornstein-Uhlenbeck
  process


  Introduction                                                            the 23 listed tools, around 60% use a negative bi-
                                                                          nomial (NB) distribution, 40% a zero-inflated dis-
  When analyzing the outcomes of single-cell RNA                          tribution (these two cases can overlap) and about
  sequencing (scRNA-seq) experiments, it is essen-                        7% a Poisson-beta (PB) distribution.
  tial to appropriately take properties of the result-                    Count data is most appropriately described by dis-
  ing data into account. Many methods assume a                            crete distributions unless count numbers are with-
  parametric distribution for the sequencing counts                       out exception very high. A commonly chosen dis-
  due to its larger power than non-parametric ap-                         tribution is the Poisson distribution, which can
  proaches. To that end, a family of parametric dis-                      be derived from a simple birth-death model of
  tributions which adequately models the data needs                       mRNA transcription and degradation. However,
  to be chosen. In Supplementary Table S1, we pro-                        due to widespread overdispersed data, it is sel-
  vide an overview of computational tools for scRNA-                      dom suitable. Another typical choice is a three-
  seq analyses and their distribution choices. Among                      parameter PB distribution (Delmans and Hemberg,
                                                                          2016, Vu et al., 2016) which can be derived from
     ∗ Correspondence
                                                                          a DNA switching model (also called random tele-
      Email address:
                                                                          graph model, see Dattani and Barahona, 2017, or
  christiane.fuchs@helmholtz-muenchen.de (Christiane                      basic model of gene activation and inactivation, see
  Fuchs)                                                                  Raj et al., 2006). Parameters of the PB distribu-
bioRxiv preprint first posted online Jun. 3, 2019; doi: http://dx.doi.org/10.1101/657619. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                             It is made available under a CC-BY-NC-ND 4.0 International license.




  tion can be estimated from scRNA-seq data (Kim
                                                                             A   Generalized model
  and Marioni, 2013), as well as experimentally mea-
  sured and inferred (Suter et al., 2011). This distri-
  bution provides good estimates of scRNA-seq data;
  however, it entails the estimation of three param-
  eters which introduces a high computational cost
  (Kim and Marioni, 2013). A frequent third choice
  is the NB distribution, used by several tools that
                                                                                   B   Basic model
  analyze single-cell gene expression measurements
  such as SCDE (Kharchenko et al., 2014), Mono-
  cle 2 (Qiu et al., 2017) and many more (see Sup-
  plementary Table S1). This distribution is chosen
  due to computational convenience and good empir-
                                                                                   C   Switching model
  ical fits. Mathematically, it can be considered as
  asymptotic steady-state distribution of the switch-
  ing model (see Raj et al., 2006). However, this will
  entail biologically unrealistic assumptions. So far,
  no mechanistic model is known that directly leads                                                                      DNA
  to a NB distribution in steady state.                                                                                  mRNA
                                                                                                                         Polymerase
  To close this gap, we look again at the already
  known mechanistic processes and their inferred
  parametric steady-state distributions: Poisson and                               D   Bursting model
  PB. Integrating these in the general framework
  of Ornstein-Uhlenbeck (OU) processes (Barndorff-
  Nielsen and Shephard, 2001), we aim to transfer
  a general method of connecting mechanistic pro-
  cesses via stochastic differential equations (SDEs)
  and their theoretical steady-state distributions to
  this research problem. Hence, we show how to con-
  nect a desired steady-state distribution of the in-
                                                                          Figure 1: Transcription and degradation models: (A) Gen-
  tensity process with the corresponding SDE by us-
                                                                          eralized model with time-dependent stochastic transcription
  ing OU processes and their properties. We use this                      rate Rt and constant deterministic degradation rate rdeg .
  method to calculate the corresponding SDE from                          (B) Basic model with constant deterministic transcription
  the NB distribution as given steady-state distribu-                     and degradation rates. (C) Switching model of gene acti-
                                                                          vation and inactivation, transcription and degradation. (D)
  tion; from this, we can read a corresponding mech-                      Bursting model, where bursts occur at rate rburst and burst
  anistic model. In a (Case Study), we use our R                          sizes have mean sburst . This model differs from (A) in
  package scModels to estimate three count distribu-                      that transcription events can produce more than one mRNA
  tion models (Poisson, PB and NB) on simulated                           molecule.
  perfect-world data, and perform model selection as
  well as goodness-of-fit tests. A comparison with ex-                    Results
  isting implementations of the PB distribution, de-
  tailed derivations and definitions of the employed                      It has previously been shown how to derive an
  probability distributions can be found in the Ap-                       mRNA count distribution from a simple birth-
  pendix. Lastly, we repeat this comparison on real-                      death model for mRNA transcription and degra-
  world data and extend the models to more realistic                      dation (Dattani and Barahona, 2017, Peccoud and
  ones by including zero inflation and heterogeneity.                     Ycart, 1995). Alterations in the transcription and
                                                                          degradation model lead to alterations in the re-
  By inferring a mechanistic model for stochastic gene                    sulting mRNA count distribution. We will sketch
  expression, our work validates the NB distribution                      the derivation of several such models and distribu-
  as a steady-state distribution for mRNA content in                      tions. Our models describe the number of mRNA
  single cells.                                                           molecules in a cell for either one gene or for a group
                                                                      3
bioRxiv preprint first posted online Jun. 3, 2019; doi: http://dx.doi.org/10.1101/657619. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                             It is made available under a CC-BY-NC-ND 4.0 International license.




  of genes for which we can assume identical kinetic                      times between switches are assumed to be exponen-
  parameters.                                                             tially distributed. As shown in the Appendix, these
  In the general context, we consider a transcription-                    assumptions lead to It following a four-parameter
  degradation model with stochastic time-varying                          Beta (ract /rdeg , rdeact /rdeg , roff /rdeg , ron /rdeg ) dis-
  transcription rate Rt and deterministic constant                        tribution, and therefore the mRNA content in
  degradation rate rdeg (Figure 1A). Here, the num-                       steady state is described by a Poisson-beta (PB)
  ber of mRNA molecules at time t is Poisson dis-                         distribution.       Hence, the probability of hav-
  tributed with intensity It following the random dif-                    ing n mRNA molecules at time t is time-
  ferential equation                                                      independent. For roff = 0 (i. e. no transcription
                                                                          possible during inactive DNA state), it can be sim-
                    dIt = −rdeg It dt + Rt dt                   (1)
                                                                          plified to
  for t ≥ 0 and fixed I0 = i0 > 0. Depending on the
  transcription process, described by Rt , this RDE                                                          n              
  has different solutions which will be shown in the                                    Γ     + rrdeact
                                                                                                ract
                                                                                                rdeg
                                                                                                   deg
                                                                                                           ron
                                                                                                           rdeg
                                                                                                                    Γ   ract
                                                                                                                        rdeg
                                                                                                                             +  n
  following (for detailed calculations see Appendix).                      P(n, t) =                                           
                                                                                        ract                   ract   rdeact
                                                                                     Γ rdeg Γ(n + 1)Γ rdeg + rdeg + n
  Basic model: constant transcription and                                            
                                                                                       ract       rdeact ract               ron
                                                                                                                                
  degradation.       In the basic model, transcrip-                           × 1 F1         +n,         +          +n, −          , (4)
                                                                                       rdeg        rdeg      rdeg          rdeg
  tion and degradation occur at constant rates rtran
  and rdeg (Figure 1B). The RDE (1) simplifies to the                     where Γ denotes theR gamma function and
  ordinary differential equation (ODE)                                                        Γ(b)    1 zu a−1
                                                                                                               (1 − u)b−a−1 du
                                                                          1 F1 (a, b, z) = Γ(a)Γ(b−a) 0 e u
                   dIt = −rdeg It dt + rtran dt                 (2)       is the confluent hypergeometric function of first
  with time-independent non-stochastic steady state                       order, also called Kummer function. The density
  It = rtran /rdeg . Hence, if the cell is in steady                      function of this PB distribution converges to the
  state, mRNA counts in this model follow a Poisson                       density function of a negative binomial (NB)
  distribution with constant intensity rtran /rdeg (see                   distribution under specific conditions (Appendix).
  Appendix).                                                              For ron = rtran , ract → ∞ and rdeact = 0, the
                                                                          switching model reduces to the basic model, and
  Switching model: gene activation and deacti-                            the above PB distribution collapses to a Poisson
  vation. In the well-known switching model, a gene                       distribution with intensity parameter rtran /rdeg , in
  switches between an inactive state where transcrip-                     consistency with the above-derived results.
  tion is impossible, and an active state where tran-
  scription occurs. This can be explained by poly-                        Connecting SDEs with steady-state distribu-
  merases binding and unbinding to the specific gene                      tions. Taken together, both models described the
  (Figures 1C and S1). The RDE (1) becomes                                intensity process of a Poisson distribution (Equa-
                                                                          tions 2 and 3). These intensity processes govern the
                dIt = −rdeg It dt + rswitch (t)dt               (3)       transcription and degradation within the mechanis-
                                                                          tic models. They determine the steady-state dis-
  with                                                                    tribution of the intensity parameter, and thus the
                   (
                    ron        if DNA active at time t                    overall distribution of the mRNA content. Impor-
     rswitch (t) =                                                        tantly, changes in the intensity process lead to dif-
                    roff       if DNA inactive at time t,
                                                                          ferent steady-state distributions. We generalize this
  where roff < ron . The transcription rate is modeled                    framework by using Ornstein-Uhlenbeck (OU) pro-
  by a continuous-time Markov process (rswitch (t))t≥0                    cesses and their properties (Barndorff-Nielsen and
  that switches between two discrete states ron                           Shephard, 2001).
  and roff with activation and deactivation rates ract                    The general definition of an OU SDE (adjusted to
  and rdeact , respectively. One usually sets roff = 0.                   the above notation) is given by
  This corresponds to a system where a gene’s en-
  hancer sites can be bound by different transcrip-                                             dIt = −rdeg It dt + dLt ,            (5)
  tion factors or co-factors. Once bound, transcrip-
  tion occurs at constant rate ron , and mRNA con-                        where Lt with L0 = 0 (almost surely) is a Lévy pro-
  tinuously happens at constant rate rdeg . Waiting                       cess, i. e. a stochastic process with independent and
                                                                      4
bioRxiv preprint first posted online Jun. 3, 2019; doi: http://dx.doi.org/10.1101/657619. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                             It is made available under a CC-BY-NC-ND 4.0 International license.




  stationary increments. In addition, we need Lt to                       for α, β > 0 as derived in the Appendix.
  be a subordinator, that is a Lévy process with pos-                    In analogy to the derivation of the PB distribution
  itive increments (Definition 7 in Appendix). A spe-                     from the switching model, we now seek to describe
  cial property of OU processes is that, under certain                    the mRNA content by a Poisson distribution with
  conditions (see Definition 9), for a chosen distribu-                   intensity parameter It , which in steady state fol-
  tion D there is an OU process that in steady state                      lows a gamma distribution instead of a beta dis-
  leads to this distribution D. The other direction,                      tribution. Thus, we aim to specify an OU process
  i. e. the existence of a steady-state distribution D                    with the gamma distribution as steady-state dis-
  for a chosen OU process (in terms of its subordi-                       tribution. In terms of mechanistic modeling, this
  nator), holds as well. For a given Lévy subordina-                     means that we need to describe a suitable tran-
  tor Lt , the characteristic function of D, and thus D                   scription process. Mathematically, we need to spec-
  itself, can be derived as described in the following                    ify the Lévy subordinator Lt accordingly. From fi-
  (adjusted to the notation of Equation 5):                               nancial mathematics it is known that a stationary
                                                                          gamma distribution is obtained if Lt is chosen to
    1. Find the characteristic function µ̂Lt (z) of the
                                                                          be a compound Poisson process (CPP, see Defini-
       Lévy subordinator Lt .
                                                                          tion 8) with exponentially distributed jump sizes
    2. Calculate µ̂L1 (z) and write the result in the                     (Barndorff-Nielsen et al., 2001). This will be our
       form exp(φ(z)) for some function φ(z).                             choice of subordinator; however, the parameters of
                                                                          this process still need to be specified. In the follow-
    3. Calculate the characteristic function C(z) of
                                                                          ing, we will show that the Lévy subordinator of the
       the stationary distribution D of It by setting
                     −1 z
                        R         −1                                      OU process (5) whose one-dimensional stationary
       C(z) = exp(rdeg      φ(ω)ω    dω). C(z) leads
                          0                                               distribution is Gamma(α, β), is a CPP with inten-
       to D.
                                                                          sity parameter α · rdeg and mean jump size β −1 .
  More details and examples are shown in the                              To obtain this result, we follow the three-step pro-
  Appendix.       Despite this apparently clear line                      cedure described above in reverse order. We start
  of action, finding a corresponding law D and                            with D =  ˆ Gamma(α, β)n and transform its ocharac-
  process Lt is challenging without prior knowledge,                                                  −1 z
                                                                                                             φ(ω)ω −1 dω , using
                                                                                                         R
                                                                          teristic function to exp rdeg    0
  e. g. if D is not well-known or Lt is only specified
                                                                          the characteristic function of D as given in the Ap-
  through the characteristic function of L1 . In the
                                                                          pendix, Definition 1:
  next section, we cast the NB distribution as an
  alternative distribution for which a subordinator                                         −α                        
  can be derived.                                                                         iz                           iz
                                                                          C(z) =       1−        = exp −α log 1 −
                                                                                          β                            β
  Negative binomial distribution: Deriving an                                            Z z
                                                                                                  −1
                                                                                                          
  explanatory bursting process. A widely con-                                    =   exp α             dω
                                                                                             0 iβ + ω
  sidered model for scRNA-seq counts is the NB dis-                                      Z z                 
  tribution. Like the above-employed PB distribu-                                                   iω
                                                                                 =   exp α                 dω
  tion, it accounts for overdispersion by modeling the                                         (β − iω)ω
                                                                                         0Z z                              
  variance independently of the mean of the data.                                          −1                β
                                                                                 =   exp rdeg     α rdeg          − 1 ω −1 dω
  Having one parameter less than PB, NB is an ap-                                              0           β − iω
  pealing choice. However, mechanistic models un-                                            Z z             
                                                                                           −1
  derlying the NB distributional assumption are un-                              =   exp rdeg     φ(ω)ω −1 dω
                                                                                                   0
  known. We aim to derive such a mechanistic model
  of transcription and degradation by reversing the
                                                                                                       
                                                                                                β
                                                                          with φ(ω) = α rdeg β−iω   − 1 and i the imaginary
  steps that led from the switching model to the PB
  distribution. For that purpose, an important fact                       number. Next, we use µ̂L1 (z) = exp(φ(z)) to obtain
  is that an NB distribution can be expressed as a                                                   
                                                                                                          β
                                                                                                                  
  Poisson-gamma (PG) distribution, i. e. as a condi-                            µ̂L1 (z) = exp α rdeg          −1     .   (7)
                                                                                                        β − iz
  tional Poisson distribution with gamma distributed
  intensity parameter I. One has                                          We aim to bring this into agreement with µ̂Lt (z),
                                                                          the time-dependent characteristic function of a gen-
                         ˆ NB α, (β + 1)−1
                                                     
                PG(α, β) =                                      (6)       eral CPP Lt with intensity parameter λ. This is
                                                                      5
bioRxiv preprint first posted online Jun. 3, 2019; doi: http://dx.doi.org/10.1101/657619. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                             It is made available under a CC-BY-NC-ND 4.0 International license.




  given by                                                                the transcription strength becomes infinitesimally
                                                                          large, the trajectory of Lswitch
                                                                                                         t    turns into a step
                µ̂Lt (z) = exp(t λ(µ̂Y (z) − 1)),                         function as depicted in Figure 2 on the right. This
                                                                          limit is obtained if rdeact → ∞ and ron → ∞ in
  where Y is a random variable following the distribu-
                                                                          a way that needs to be specified. For that reason,
  tion of the jump sizes of the CPP, and µ̂Y is its char-
                                                                          we in the following seek to describe a mechanistic
  acteristic function (see Appendix, Definition 8). A
                                                                          model for the transition phase (Figure 2, middle)
  CPP with intensity λ = α · rdeg and i.i.d. expo-
                                                                          leading to the bursting model.
  nentially distributed increments Y ∼ Exp(β) with
                                                                          In the switching model, as soon as DNA becomes
  characteristic function µ̂Y (z) = β/(β − iz) yields
                                                                          active, a competition starts between the events
  the overall characteristic function
                                                                          transcription and deactivation. In addition, degra-
                                          
                                    β                                     dation may happen, which will affect the intensity
         µ̂Lt (z) = exp tαrdeg           −1       .                       process It and the number of mRNA molecules,
                                  β − iz
                                                                          but not the transcription process. If a transcrip-
  This is in accordance with µ̂L1 (z) as derived in                       tion event occurs, the competition between tran-
  Equation (7), and hence, a mathematically appro-                        scription and deactivation continues at the same
  priate subordinator is a CPP with intensity param-                      probability rates as before; the only affected event
  eter α · rdeg and mean jump size β −1 .                                 probability is the one for degradation because this
  As a consequence, transcription is expressed via a                      probability depends on the current mRNA count.
  stochastic process Lt , namely the CPP, which expe-                     We now consider the following approximation of the
  riences jumps after exponentially distributed wait-                     switching model and call it the transition model :
  ing times. In contrast to the Lévy subordinators of                    When DNA becomes active, we allow the events
  the basic model, Lbasic   = ron t, and of the switching                 transcription and deactivation to happen, but not
            switch
                      Rtt
  model, Lt        = 0 rswitch (s)ds, it possesses point-                 degradation. To correct for the missing degrada-
  wise discontinuous sample paths (Figure 2, right).                      tion events, we introduce a waiting time W after
  Intervals without any transcription activity seem to                    DNA deactivation in which only degradation can
  be disrupted by sudden explosions of mRNA num-                          occur, but no DNA activation. For appropriately
  bers. This burstiness led us to call the mechanism                      chosen rdeact → ∞ and ron → ∞, the approxima-
  behind the NB distribution the bursting model. We                       tion error will tend to zero.
  denote its subordinator by Lburst t  and argue the bi-                  The number of transcription events S during one
  ological justification of the model in the Discussion                   active DNA phase is geometrically distributed with
  and Conclusion.                                                         success probability parameter rdeact /(rdeact + ron ).
  We aim to derive the mechanistic transcription pro-                     In the interpretation of the geometric distribution,
  cess of the bursting model in more detail. Specif-                      transcription events are considered as failures, de-
  ically, we tackle the distribution of burst sizes of                    activation as success. The waiting time W needs to
  mRNA counts. For this we look at a heuristic tran-                      accumulate the times before S transcriptions and
  sition from Lswitch
                 t      to Lburst
                            t     .                                       one deactivation. Thus, W = T1 + · · · + TS + D,
  First, we dismantle the shape of the trajectories                       where Ti ∼ Exp(ron ), i = 1, . . . , S, are the sin-
  of Lswitch
        t    . As depicted in Figure 2 on the left, such                  gle waiting times for each transcription event and
  a trajectory consists of alternating piecewise con-                     D ∼ Exp(rdeact ) is the waiting time till the next
  stant and piecewise linear parts. The constant parts                    DNA deactivation.
  appear during time intervals without transcription,                     Taken together, the bursting process can be con-
  i. e. where the DNA is inactive. The length of such                     sidered as the limiting process of the approxima-
  a time interval depends only on the rate ract of the                    tion of the switching process as ron → ∞ and
  switching model. Once the DNA switches into the                         rdeact → ∞ under the condition that the success
  active mRNA transcribing state, the time interval                       probability parameter of the geometric distribution,
  with transcription depends only on the rate rdeact .                    rdeact /(rdeact + ron ) stays constant. As the link be-
  The slope of the trajectory during this active DNA                      tween the switching model and PB distribution is
  state represents the transcription strength and de-                     known, and since PB converges towards NB under
  pends only on the parameter ron .                                       certain conditions (see Appendix), we can connect
  In case the length of the time interval of active DNA                   the parameters of the bursting model with those of
  becomes infinitesimally small, and at the same time                     the NB distribution and CPP.
                                                                      6
bioRxiv preprint first posted online Jun. 3, 2019; doi: http://dx.doi.org/10.1101/657619. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                             It is made available under a CC-BY-NC-ND 4.0 International license.




  Lévy Subordinators
     Switching model                                           Transition model                            Bursting model




                                  transcription
                                    strength
                                     slope:




                                                                                                                                   intensity
                                                  Approxi-                                       Limit




                                                                                                                                     burst
                                                  mation




       DNA     DNA DNA                                                                                    waiting for
        off     on  off                                                                                   next burst

  Figure 2: The Lévy subordinator of the switching model is shown on the left by means of an exemplary trajectory. For
  small duration of the DNA being active and large transcription strength, its behavior can be approximately described by a
  step function as depicted in the middle (transition model). The limit of this approximation, with infinitesimally small DNA
  activation time interval and infinitesimally large transcription strength, leads to a trajectory of the subordinator of the bursting
  model which is shown on the right.



  That is, the bursting model can mechanistically be                      be added. The final number of balls in the bucket
  described as follows: After Exp(rburst )-distributed                    represents the number of mRNA molecules in a cell
  waiting times, a Geo((1 + sburst )−1 )-distributed                      at steady state.
  number of mRNAs are produced at once, where                             The above top-down derivation from the steady-
  sburst is the mean burst size (see also Golding                         state distribution to the mechanistic process has
  et al., 2005). As in the basic and switching                            to be motivated heuristically in parts. In the
  models, degradation events occur with Exp(rdeg )-                       Appendix we prove bottom-up that the above
  distributed waiting times. The just described mech-                     described mechanistic bursting model indeed leads
  anistic bursting model is shown in Figure 1D. It can                    to the steady-state NB distribution by directly
  equivalently be described by the OU process (5)                         calculating the master equation (see also Supple-
  with Lt being a CPP with Exp(rburst )-distributed                       mentary Figure S2).
  waiting times and Exp(sburst )-distributed jump
  sizes. Thus, in steady state, mRNA counts follow a
                                                                          Heterogeneity and dropout. The transcription
  PG(rburst /rdeg , sburst ) distribution
                                       or, equivalently,                 and degradation models considered so far describe
  a NB rburst /rdeg , (1 + sburst )−1 distribution if the
                                                                          the number of mRNA molecules for homogeneously
  bursting model is assumed.
                                                                          expressed genes that are actually present in a cell.
  The NB rburst /rdeg , (1 + sburst )−1 model, again,
                                                                         Real-world data is usually more complex: First,
  can be interpreted as follows (see also Appendix,                       cell populations may be heterogeneous. Second,
  Definition 3): Assume you have an empty bucket                          scRNA-seq measurements will be subject to mea-
  into which you put balls according to the follow-                       surement error. For example, they often contain a
  ing stochastic procedure. You perform a number                          large number of zeros. If a zero is due to techni-
  of independent Bernoulli trials, each with success                      cal error, it is called dropout. Regardless of what
  probability (1 + sburst )−1 . If there is a failure, you                causes this phenomenon, a data model should take
  add one ball to the bucket. If there is a success,                      this property into account. We describe two model
  you do not do anything but count the success event.                     extensions here.
  You continue until there have been rburst /rdeg suc-                    Data that originates from different cell populations
  cesses. (For interpretation purposes, we here as-                       (in terms of different transcriptomic properties) can
  sume rburst /rdeg to be a whole-valued number.)                         be modeled mathmatically. If a population consists
  The larger sburst , the smaller the success proba-                      of e. g. two subpopulations, each of them is mod-
  bility, i. e. by expectation you will put more balls                    eled by one single distribution, D1 or D2 , parame-
  in the bucket for large sburst . Similarly, the larger                  terized via θ1 and θ2 , respectively. One assumes
  the ratio of rburst to rdeg , the more success events                   the mRNA count to be distributed according to
  will be waited for, thus the more balls will tend to                    pD1 (θ1 ) + (1 − p)D2 (θ2 ) with p ∈ [0, 1], that means:
                                                                      7
bioRxiv preprint first posted online Jun. 3, 2019; doi: http://dx.doi.org/10.1101/657619. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                             It is made available under a CC-BY-NC-ND 4.0 International license.




  With probability p, the count distribution of that                                without additional zero inflation. In total, we in-
  cell is D1 , otherwise D2 . The corresponding mix-                                vestigate twelve different models as shown in Fig-
  ture density is given by                                                          ure 3. The numbers of parameters in these models
                                                                                    are listed in Supplementary Table S3.
    f2mix (x; θ1 , θ2 , p) = p f1 (x; θ1 ) + (1 − p)f2 (x; θ2 ),
                                                                                    Selected distributions after GOF
  where f1 and f2 are the densities of D1 and D2 ,
  and x is the observed mRNA count. For k > 2                                        A    Nestorowa et al.:
  subpopulations, the density can easily be general-                                                             0         100       10,000
  ized to a mixture of k distributions D1 , . . . , Dk with
                                                                                                     Pois       NB         PB
  probabilities p1 , . . . , pk :
                                                                                       1-pop          0       1,251        43       1,294

                                                                                         ZI-1-pop     0       1,043        2        1,045
     fkmix (x; θ1 , · · · , θk , p1 , · · · , pk−1 ) =                    (8)
                                                          !                              2-pop        1       7,248       159       7,408
                                               k−1
                                               X
            p1 f1 (x; θ1 ) + . . . +     1−          pi        fk (x; θk ).              ZI-2-pop     1        427         0         428
                                               i=1
                                                                                                      2       9,969       204       10,175
  The distributions Di can be any (ideally discrete
  count) distribution, possibly from different distri-
  bution families.                                                                   B    mm10:10x:
  An appropriate model for the occurrence of the                                                                 0          100 1,000
  above-mentioned dropout is a zero-inflated distri-                                                 Pois       NB         PB
  bution (Kharchenko et al., 2014). For one homo-                                      1-pop                  3,374                 3,454
                                                                                                     50                    30
  geneous population, the mRNA count will be dis-
  tributed according to p1{0} +(1−p)D(θ), with 1{0}                                      ZI-1-pop    193       266         3         462
  being the indicator function with point mass at                                        2-pop       111        98         58        267
  zero, and the corresponding density function reads
                                                                                         ZI-2-pop    12          8         0          20
          fzi (x; θ, p) = p 1{0} (x) + (1 − p)f (x; θ),
                                                                                                     366      3,746        91       4,203
  where f is the density function of D. Analogously,
                                                                                    Figure 3: Frequencies of chosen distributions via BIC-
  zero inflation can be added to a mixture of several                               after-GOF applied to real-world datasets: (A) Nestorowa
  distributions, see (8). mRNA counts are then dis-                                 et al. (2016) (B) mm10:10x, Official 10x Genomics Support
  tributed according to                                                             (2017).


                                                  k
                                                               !                    We estimate these twelve models on two real-world
                                                                                    datasets: The first one stems from Nestorowa
                                                  X
    p1 1{0} + p2 D1 (θ1 ) + . . . +         1−            pi       Dk (θk ).
                                                  i=1
                                                                                    et al. (2016), contains 1,656 mouse HSPCs and
                                                                                    was generated using the Smart-Seq2 (Picelli et al.,
  The corresponding density function is given by                                    2014) protocol, and thus did not employ unique
              fzi-kmix (x; θ1 , · · · , θk , p1 , · · · , pk )                      molecular identifiers (UMIs). The second dataset
                                                                                    contains 3,356 homogeneous NIH3T3 mouse cells
            = p1 1{0} (x) + p2 f1 (x; θ1 ) + . . .
                                !                                                   and has been generated using the 10x Chromium
                         k
                        X                                                           technique (Zheng et al., 2017), thus incorporat-
              + 1−          pi fk (x; θk ).                                         ing UMIs. It is part of the publicly available
                             i=1
                                                                                    10x dataset “6k 1:1 Mixture of Fresh Frozen
  Data application. We perform a comprehensive                                      Human (HEK293T) and Mouse (NIH3T3) Cells”
  comparison of the considered mRNA count distri-                                   (Official 10x Genomics Support, 2017). Here, we
  butions, that is the Poisson, NB and PB distribu-                                 refer to this dataset as mm10:10x.
  tion, when applied to real-world data. Within each                                We applied a gene filter (see Appendix), estimated
  of the three distributions we further consider mix-                               the model parameters of the twelve considered
  tures of two populations (from identical distribu-                                models via maximum likelihood, and performed
  tion types but with different parameters) with and                                model selection as described in the Case Study via
                                                                                8
bioRxiv preprint first posted online Jun. 3, 2019; doi: http://dx.doi.org/10.1101/657619. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                             It is made available under a CC-BY-NC-ND 4.0 International license.




  the Bayesian information criterion (BIC). Figure 3                      profiles. The simulation study showed that an NB
  summarizes the frequencies of the chosen models                         distribution may be best suited even if the in silico
  across genes. We only display those choices where                       data had been generated from the switching model.
  the chosen distribution with estimated parameters                       Also in the real-data application, the NB distribu-
  was not rejected by a goodness-of-fit (GOF) test                        tion was chosen in most cases. In line with our
  (χ2 -test) at 5% significance level with multiple test-                 expectations, gene profiles of the non-UMI-based
  ing correction.                                                         dataset by Nestorowa et al. (2016) showed strong
  In the data from Nestorowa et al. (2016), 16,364                        preference for a two-population mixture or zero-
  genes remained after filtering, of which 10,175 were                    inflated variant of the NB distribution. In con-
  not rejected by the GOF test. Figure 3A shows that                      trast, the mm10:10x dataset consists by construc-
  some variant of the NB distribution was chosen for                      tion of homogeneous cells, and 10x Chromium is not
  98% of these genes. Among these, mRNA count                             known for large amounts of unexpected zeros in the
  numbers for many genes were best described by the                       measurements. Accordingly, the single-population
  zero-inflated NB distribution. However, an even                         NB distribution was sufficient for most gene profiles
  higher preference could be observed for the mixture                     here. For 9% of the considered genes in the m10:10x
  of two NB distributions. This can be explained by                       dataset, mRNA counts were most appropriately de-
  taking a closer look at the gene expression counts                      scribed by some form of the Poisson distribution.
  of the affected genes (see also Supplementary Fig-                      We have examined these 366 genes for functional
  ure S6): Most of those genes not only show many                         similarities; while estimated parameters show some
  zeros, but also many low non-zero counts, i. e. many                    apparent pattern (Supplementary Figure S5), we
  ones, twos etc., next to higher counts. Such ex-                        did not find any defining biological characteristics
  pression profiles are not covered by a simple zero-                     (Supplementary Figure S7).
  inflated model but prefer a mix of two distributions,                   Similar to us, Vieth et al. (2017) performed model
  one of them mapping to low expression values (see                       selection among Poisson, NB and PB distributions
  Supplementary Table S3).                                                by BIC and GOF on several publicly available
  In the mm10:10x data, 4,203 genes remained after                        datasets. Although they used the method of Vu
  filtering and the GOF test. Figure 3B shows that                        et al. (2016) for which computation of the GOF
  for 89% of these genes, an NB distribution vari-                        statistics is impossible for the PB distribution, they
  ant was chosen as most appropriate model. How-                          still observed a tendency towards the NB distribu-
  ever, other than for the dataset from Nestorowa                         tion as preferred models. In our study, we represent
  et al. (2016), the standard NB distribution (for one                    the PB density in terms of the Kummer function,
  population, without zero inflation) was sufficient in                   which allows us to compute the GOF statistics ac-
  the majority of cases. We looked for commonali-                         cordingly.
  ties between the gene profiles that led to the same                     Different sequencing protocols might lead to
  distribution choice. Supplementary Figure S5 sug-                       differences in distributions and also might generate
  gests an interdependence between the chosen one-                        data of different magnitudes. Ziegenhain et al.
  population distributions and the range of the pa-                       (2017) applied various sequencing methods to cells
  rameter estimates.                                                      of the same kind to understand the impact of the
  Taken together, the NB distribution was chosen for                      experimental technique on the data. Based on the
  most gene profiles, either as a single distribution, a                  so-generated data, Chen et al. (2018) investigated
  mixture of two NB distributions or with additional                      differences in gene expression profiles between
  zero inflation.                                                         read-based and UMI-based sequencing technolo-
                                                                          gies. They concluded that, other than for read
  NB distribution as commonly chosen count                                counts, the NB distribution adequately models
  model. While the mechanistic models and their                           UMI counts. Townes et al. (2019) suggest to
  steady-state distributions describe actual mRNA                         describe UMI counts by multinomial distributions
  contents in single cells, real-world data underlies                     to reflect the nature of the sequencing procedure;
  technical variation in addition to biological com-                      for computational reasons, they propose to ap-
  plexity. We investigated in a simulation study                          proximate the multinomial density again by an
  (Box 1 and Figure 4) and on real-world data (Fig-                       NB density. Overall, the NB distribution appears
  ure 3) which distributions were most appropriate                        sufficiently flexible to hold independently of the
  among those considered to describe gene expression                      specific sequencing approach.
                                                                      9
bioRxiv preprint first posted online Jun. 3, 2019; doi: http://dx.doi.org/10.1101/657619. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                             It is made available under a CC-BY-NC-ND 4.0 International license.




                                                                          Poisson-beta (PB) and the negative binomial (NB)
  R package scModels.         We provide the R pack-                      distribution, on all generated datasets via maxi-
  age scModels which contains all functions needed                        mum likelihood estimation. To investigate which
  for maximum likelihood estimation of the consid-                        distribution explains the data best, we compute the
  ered distribution models. Three applications of the                     Bayesian information criterion
  Gillespie algorithm (Gillespie, 1976) allow synthetic
  data simulation (as used in the Case Study) via the                                   BIC = −2`(θ̂) + log(n)dim(θ̂),
  basic, switching and bursting model, respectively.
  Implementations of the likelihood functions for the                      where ` represents the corresponding log-likelihood
  one-population case and two-population mixtures,                        function, θ the possibly multivariable parameter
  with and without zero inflation, allow inference of                     vector of the distribution, θ̂ its maximum likelihood
  the Poisson, NB and PB distributions. We pro-                           estimate, dim(θ̂) its dimension, i. e. the number of
  vide a new implementation of the PB density, based                      unknown scalar parameters, and n the sample size.
  on our novel implementation of the Kummer func-                         The distribution with lowest BIC value is consid-
  tion, also known as the generalized hypergeomet-                        ered most appropriate among all considered mod-
  ric series of Kummer. This became necessary, be-                        els. Afterwards, we apply a χ2 -test to assess the
  cause the existing R function (kummerM() con-                           goodness-of-fit (GOF) and neglect those datasets
  tained in package fAsianOptions) was only valid for                     for which the distribution fits are rejected at the
  specific parameter values, and hence, was not suited                    5% significance level (without multiple testing cor-
  for optimization in continuous unconstrained space                      rection). This reduces the total number of 1,000
  (more information in Appendix). Existing packages                       simulated datasets per model to the amounts dis-
  such as D3E (Delmans and Hemberg, 2016), imple-                         played in Figure 4A.
  mented in Python, and BPSC (Vu et al., 2016), im-                       We investigate whether the selected distributions
  plemented in R, use the PB distribution for scRNA-                      correspond to the distributions that arise from the
  seq data analysis but based on a different approxi-                     respective mechanistic models: For the datasets
  mation scheme (see Appendix). With our new im-                          generated from the basic model, model selection
  plementation of the PB density we did not overcome                      via BIC-after-GOF indeed prefers the Poisson dis-
  the problem of time-consuming calculation, but we                       tribution in most cases, independently of the used
  for the first time provided an implementation of the                    distribution parameter λ (Figure 4A, first bar, and
  Kummer function in R valid for all values required                      Figure 4B). In contrast, for datasets generated by
  inside the PB density. For a more detailed descrip-                     the switching model, BIC-after-GOF in big parts
  tion of scModels and a package comparison to D3E                        chooses either the NB or the PB distribution (Fig-
  and BPSC, see the Appendix.                                             ure 4A, middle bar). The choice seems to depend on
                                                                          the employed rate parameters: Figure 4C indicates
                                                                          a tendency towards the PB distribution for low val-
  Case Study                                                              ues of β; otherwise, the NB distribution often seems
  In a simulation study, we generate in silico data                       to model the data generated by the switching model
  from the three considered mechanistic models: the                       sufficiently well. For datasets generated by the
  basic model (Figure 1B), the switching model (Fig-                      bursting model, BIC-after-GOF picks the NB dis-
  ure 1C), and the bursting model (Figure 1D), using                      tribution for the majority of the time without any
  the Gillespie algorithm implemented in scModels.                        obvious bias (Figure 4A/D). The study shows that,
  In order to choose realistic values for the rate pa-                    apparently, the NB distribution is complex enough
  rameters, we orient ourselves on experimental stud-                     to describe the data generated from the switching
  ies which aim to determine rates of the switching                       model. The BIC decides in many cases that a po-
  process in specific cases. For example, Suter et al.                    tentially better fit is not worth the extra effort for
  (2011) identify rates for so-called short-lived genes                   estimating an additional parameter in the PB dis-
  where mRNA and protein pulses are directly con-                         tribution.
  nected to one single on-and-off-switch of a gene. We
  provide an overview of the employed rate combina-                       Discussion and Conclusion
  tions in Supplementary Table S4.
  As a proof of concept, we estimate the three                            In this work, we derived a mechanistic model for
  corresponding distributions, i. e. the Poisson, the                     stochastic gene expression that results in the NB
                                                                     10
bioRxiv preprint first posted online Jun. 3, 2019; doi: http://dx.doi.org/10.1101/657619. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                             It is made available under a CC-BY-NC-ND 4.0 International license.




  A                                              Best model fits (Best BIC after GOF)                                                                                C                        Data generated by switching model
                                                                                                                                                                                                                                            Best BIC
                                                                                                                                                                                    600                                                     after GOF:
                                                                                                                                                                                                                                             ● Pois




                                                                                                                                                                     of PB
                 basic                                                                                             963                              2
                 model                                                                                                                                                              400                                                         PB
                                                                                                                                                              Pois                                                                              NB
     switching                                                                                                                                                                      200
                                                                                                      673                                   303               PB




                                                                                                                                                                     True
        model
                                                                                                                                                              NB
          bursting                                                                                                                                                                    0
            model                                                                                                 939                               8
                                                                                                                                                                                          0    10     20           30           40    50       60
                                                                                                                                                                                                            True        of PB
                                  0                                             200                          400             600           800      1000
                                                                                                             Frequencies
                                                                                                                                                                                    600
  B                                                               Data generated by basic model




                                                                                                                                                                     of PB
                                                                                                                                                                                    400
                       ●
                       ●●
                       ●●
                       ●●
                       ●
                       ●●
                       ●●
                        ●●
                        ●●
                        ●●●
                          ●●
                          ●
                          ●●
                          ●
                          ●●
                           ●●
                            ●●
                            ●
                            ●●
                            ●●●
                              ●●
                              ●●
                               ●●
                               ●
                               ●●●
                                 ●●
                                 ●
                                 ●●●
                                 ●●●
                                   ●●●●●●●●●●●●●●●●●●●●
                                                      ●●●●●●●●●● ● ●●●●●●●●●●● ●● ●● ●   ●● ● ●   ●    ● ●    ●    ●     ●         ●   ●                ●      ●

                                                                                                                                                                                    200




                                                                                                                                                                     True
                                                                                                                                                                                      0
                      0                                                    500                               1000            1500            2000             2500
                                                                                                              True       of Pois                                                          0     500        1000        1500          2000      2500
                                                                                                                                                                                                             True c of PB
  D                                                       Data generated by bursting model
                                                                                                                                                                               2500
                 60
  True r of NB




                                                                                                                                                                     True c of PB




                 40                                                                                                                                                            1500

                 20
                                                                                                                                                                                    500
                  0                                                                                                                                                                   0
                      0.0                                               0.1                             0.2        0.3                     0.4          0.5                               0    10     20           30           40    50       60
                                                                                                         True p of NB                                                                                       True        of PB

  Figure 4: Model selection on in silico data: (A) Frequencies of chosen distributions (Poisson, PB, NB) via BIC-after-GOF
  based on datasets generated by the three different transcription models (basic, bursting, switching). (B-D) Employed parameter
  values (indicated by horizontal/vertical position) and chosen distributions (indicated by color/symbol) for basic model (B),
  switching model (C) and bursting model (D). The names of the parameters correspond to those in Definitions 2, 3 and 5 in
  the Appendix.



  distribution as steady-state distribution for mRNA                                                                                                                                tions (Figures 1B,C). Here, we provide a possible
  content in single cells. According to the so-obtained                                                                                                                             explanation.
  bursting model, transcription happens in chunks,                                                                                                                                  Second, we demonstrated how to generally link a
  rather than in a one-by-one production as com-                                                                                                                                    probability distribution to an Ornstein-Uhlenbeck
  monly assumed in mechanistic modeling (Dattani                                                                                                                                    (OU) process and derive a mechanistic model. This
  and Barahona, 2017). We discuss the biological                                                                                                                                    brings a new field of mathematics to single-cell biol-
  plausibility of bursty transcription further below.                                                                                                                               ogy. The procedure can be used to deduce possible
  The consideration of the bursting model and its                                                                                                                                   mechanistic processes leading to different steady-
  derivation is interesting from both practical and                                                                                                                                 state distributions, exploiting the rich literature on
  theoretical points of view:                                                                                                                                                       OU processes from financial mathematics.
  First of all, the NB distribution is defined through                                                                                                                              Third, although we focused on the resulting
  two parameters whereas the PB distribution typ-                                                                                                                                   steady-state distributions of the mechanistic mod-
  ically requires three parameters to be specified in                                                                                                                               els here, our mathematical framework also provides
  the current context. Therefore, the NB distribu-                                                                                                                                  model descriptions in terms of stochastic processes.
  tion is computationally less elaborate to estimate,                                                                                                                               Nowadays, sequencing counts are commonly avail-
  given some data, than the PB distribution. Several                                                                                                                                able as snapshot data. However, time-resolved
  tools employ the NB distribution to parameterize                                                                                                                                  measurements may become standard (Golding
  mRNA read counts (see Supplementary Table S1).                                                                                                                                    et al., 2005), and in that case our models open up
  However, there has been no mechanistic biological                                                                                                                                 the statistical toolbox of stochastic processes to
  model known so far leading to this distribution,                                                                                                                                  extract information from interdependencies within
  other than for the Poisson and the PB distribu-                                                                                                                                   single-cell time series.
                                                                                                                                                                   11
bioRxiv preprint first posted online Jun. 3, 2019; doi: http://dx.doi.org/10.1101/657619. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                             It is made available under a CC-BY-NC-ND 4.0 International license.




                                                                          This criticism is underpinned by the work of
                                                                          Suter et al. (2011) who derived ranges of the rates
  Limiting cases of the switching model that
                                                                          of the switching model experimentally and by
  give rise to the NB distribution are biologi-
                                                                          calculations. Here, only so-called short-lived genes
  cally unrealistic. The NB and PB distributions
                                                                          were taken into account. Thus, observed mRNA
  have been linked before. Among others, Raj et al.
                                                                          half-lives were on a smaller scale, mainly between
  (2006) and Grün et al. (2014) have shown that
                                                                          30 and 140 min, resulting in mRNA degradation
  the NB distribution is an asymptotic result of the
                                                                          rates between 0.005 min−1 and 0.023 min−1 . At
  switching model and the corresponding PB distri-
                                                                          the same time, deactivation rates were found
  bution (see Appendix). However, this result holds
                                                                          in the range between 0.1 min−1 and 0.6 min−1 .
  only under biologically unrealistic assumptions as
                                                                          Hence, their quotient is at maximum around
  we elaborate in the following. Our derivation of
                                                                          120 and thus nowhere close to infinity. Another
  the NB steady-state distribution, in contrast, is
                                                                          mathematical assumption for deriving the NB limit
  based on a thoroughly realistic mechanism of bursty
                                                                          distribution was that the transcription rate needed
  transcription. The approach by Raj et al. (2006)
                                                                          to be smaller than the deactivation rate. This is
  and Grün et al. (2014) requires rdeact /rdeg → ∞
                                                                          not confirmed by Suter et al. (2011) for most genes.
  and ron /rdeact < 1. That means, the deactiva-
  tion rate has to be substantially larger than the
                                                                          Biological plausibility of bursting model.
  mRNA degradation rate and, simultaneously, the
                                                                          Burst-like transcription has been discussed, e. g.
  transcription rate needs to be smaller than the gene
                                                                          Golding et al. (2005), Schwanhäusser et al. (2011)
  deactivation rate. Here, we discuss the plausibility
                                                                          and Suter et al. (2011). We take a look at the
  of these presumptions:
                                                                          inherent assumptions of the bursting model: The
  Schwanhäusser et al. (2011) showed that mRNA                           bursting rate rburst represents the waiting time
  half-life is in median around t1/2 = 9 h (range:                        until the DNA turns open for transcription in
  1.61 h to 40.47 h), which results in a degra-                           addition to the time which the polymerase needs
  dation rate rdeg = log(2)/t1/2 of 0.077 h−1 =                           to transcribe. The model assumes that several
  0.00128 min−1     (range:        0.00718 min−1   to                     polymerases attach simultaneously to the DNA
               −1
  0.00029 min ).       For rdeact /rdeg → ∞, the                          and terminate transcription at the same time. By
  mRNA degradation rate needs to become much                              simplifying this part of the transcription process
  smaller than the gene deactivation rate. Visual                         model, the problem of persisting DNA activation
  comparison shows that density curves of the PB                          during the whole transcription process in the
  and according NB distributions start to look                            switching model is avoided.
  similar for rdeact /rdeg ≈ 20, 000. Assuming a
  20,000-fold larger gene deactivation rate results                       Practical relevance. There is no unambigu-
  in rdeact = 29.67 min−1 (range: 143.51 min−1 to                         ous answer to the question of the most appropri-
  5.71 min−1 ). This means that on average the                            ate probability distribution for mRNA count data.
  gene switches approximately 30 times per minute                         Pragmatic reasons will often lead to NB distribu-
  into the off-state, i. e. on average the gene is                        tion as already employed by many tools (see Sup-
  in its active state for only two seconds. RNA                           plementary Table S1). However, the choice may
  polymerases proceed at 30 nt/sec (without pausing                       depend on experimental techniques, the statisti-
  at approximately 70 nt/sec) (Darzacq et al., 2007).                     cal analysis to be performed, and also differ be-
  Genes have a length of hundreds to thousands of                         tween genes within the same dataset. For large read
  nucleotides. Thus, according to the above numbers,                      counts, even continuous distributions may be most
  genes cannot be transcribed in such short phases.                       suitable.
  The DNA needs to stay active during the whole                           While statistics quantifies which model is the most
  transcription process of one (or more) mRNAs;                           plausible one from the data point of view, mathe-
  as soon as the DNA turns inactive, all currently                        matical modelling points out which biological as-
  running transcriptions are stopped. In other words,                     sumptions may implicitly be made when a par-
  although the NB distribution can mathematically                         ticular distribution is used. Importantly, while
  be derived as a limiting steady-state distribution                      the mechanistic model leads to a unique steady-
  of the switching model, this entails biologically                       state distribution, the reverse conclusion is not true.
  implausible assumptions.                                                In general, the basic model and the correspond-
                                                                     12
bioRxiv preprint first posted online Jun. 3, 2019; doi: http://dx.doi.org/10.1101/657619. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                             It is made available under a CC-BY-NC-ND 4.0 International license.




  ing Poisson distribution may appear too simple in                                      ∗ OU PROCESS DERIVATION FOR
  most cases (both with respect to biological plausi-                                      BASIC MODEL
  bility and the ability to describe measured sequenc-                              – MASTER   EQUATION                      OF    THE
  ing data). The switching and bursting models are                                    BURSTING MODEL
  harder to distinguish. From the mathematical point
  of view, their densities are of similar shape, such                               – R PACKAGE scModels
  that the less complex NB model will often be pre-                                      ∗   BPSC
  ferred. Answering the question from the biological                                     ∗   D3E
  perspective may require measuring mRNA gener-
                                                                                         ∗   scModels
  ation at a sufficiently small time resolution (e. g.
  Golding et al., 2005) to see whether several mRNA                                      ∗   COMPARISON OF scModels WITH
  molecules are generated at once (bursting model) or                                        D3E AND BPSC
  in short successional intervals (switching model).                                – DATA APPLICATION
  Taken together, we have identified mechanistic                                         ∗ GENE FILTERING
  models for mRNA transcription and degradation
                                                                                         ∗ ESTIMATION     OF    ONE-
  with good interpretability, and established a link
                                                                                           POPULATION MODELS
  to mathematical representations by stochastic pro-
  cesses and steady-state count distributions. Specif-                                   ∗ BLOOD     DIFFERENTIATION
  ically, the commonly used NB model is supplied                                           MARKER GENES
  with a proper mechanistic model of the underlying                                      ∗ GO TERMS
  biological process. The R package scModels over-                                  – OVERVIEW OF SINGLE-CELL ANAL-
  comes a previous shortcoming in the implementa-                                     YSIS TOOLS
  tion of the PB density. It provides a full toolbox for
  data simulation and parameter estimation, equip-                           • DATA AND SOFTWARE AVAILABILITY
  ping users with the freedom to choose their mod-
  els based on content-related, design-based or purely                              – Case Study: Simulated data
  pragmatic motives.                                                                – Scripts
                                                                                    – Software
  Appendix

  Detailed methods are provided and include the fol-                      Supplemental Information
  lowing:
                                                                          Supplemental Information includes seven figures
     • OVERVIEW TOOLS TABLE                                               and four tables which can be found at the end of
                                                                          this paper.
     • METHOD DETAILS

           – DEFINITIONS AND IDENTITIES                                   Author Contributions
           – NEGATIVE   BINOMIAL   CORRE-                                 The study was designed by LA and CF. LA devel-
             SPONDS TO POISSON-GAMMA                                      oped and performed the mathematical analysis and
           – POISSON-BETA CONVERGES                           TO-         software development with help of KH and CF. LA
             WARDS NEGATIVE BINOMIAL                                      and CF wrote the paper.
           – MASTER EQUATION OF THE GEN-
             ERALIZED MODEL                                               Acknowledgments
                 ∗ DETERMINISTIC CONTINUOUS
                   TRANSCRIPTION MODEL                                    Our research was supported by the German Re-
                                                                          search Foundation within the SFB 1243, Subproject
                 ∗ BASIC MODEL
                                                                          A17, by the German Federal Ministry of Education
                 ∗ SWITCHING MODEL                                        and Research under grant number 01DH17024, and
           – OU PROCESSES LINK SDES TO                                    by the National Institutes of Health under grant
             STEADY-STATE DISTRIBUTIONS                                   number U01-CA215794.
                                                                     13
bioRxiv preprint first posted online Jun. 3, 2019; doi: http://dx.doi.org/10.1101/657619. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                             It is made available under a CC-BY-NC-ND 4.0 International license.




  References                                                              Huang, M., Wang, J., Torre, E., Dueck, H., Shaffer, S., Bona-
                                                                             sio, R., Murray, J. I., Raj, A., Li, M. and Zhang, N. R.
  Adan, I. and Resing, J. (2002). Queueing theory. Eindhoven                 (2018). SAVER: gene expression recovery for single-cell
    University of Technology Eindhoven.                                      RNA sequencing. Nature Methods 15, 539–542.
  Andrews, T. S. and Hemberg, M. (2018). M3Drop: dropout-                 Intosalmi, J., Mannerstrom, H., Hiltunen, S. and Lahdes-
    based feature selection for scRNASeq. Bioinformatics                     maki, H. (2018). SCHiRM: Single Cell Hierarchical Re-
    bty1044.                                                                 gression Model to detect dependencies in read count data.
  Barndorff-Nielsen, O. E., Resnick, S. I. and Mikosch, T., eds              BioRxiv preprint.
    (2001). Lévy Processes. Birkhäuser Boston, Boston, MA.              Karlis, D. and Xekalaki, E. (2005). Mixed poisson distribu-
    DOI: 10.1007/978-1-4612-0197-7.                                          tions. International Statistical Review 73, 35–58.
  Barndorff-Nielsen, O. E. and Shephard, N. (2001). Non-                  Kharchenko, P. V., Silberstein, L. and Scadden, D. T. (2014).
    Gaussian Ornstein-Uhlenbeck-based models and some of                     Bayesian approach to single-cell differential expression
    their uses in financial economics. Journal of the Royal                  analysis. Nature Methods 11, 740–742.
    Statistical Society: Series B (Statistical Methodology) 63,           Kim, J. K. and Marioni, J. C. (2013). Inferring the kinet-
    167–241.                                                                 ics of stochastic gene expression from single-cell RNA-
  Brent, R. P. (2010). Unrestricted algorithms for elementary                sequencing data. Genome biology 14, R7.
    and special functions. arXiv preprint.                                Li, W. V. and Li, J. J. (2018). An accurate and robust im-
  Chen, W., Li, Y., Easton, J., Finkelstein, D., Wu, G. and                  putation method scImpute for single-cell RNA-seq data.
    Chen, X. (2018). UMI-count modeling and differential ex-                 Nature Communications 9.
    pression analysis for single-cell RNA sequencing. Genome              Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. and Yosef,
    Biology 19.                                                              N. (2018). Deep generative modeling for single-cell tran-
  Darzacq, X., Shav-Tal, Y., de Turris, V., Brody, Y., Shenoy,               scriptomics. Nature Methods 15, 1053–1058.
    S. M., Phair, R. D. and Singer, R. H. (2007). In vivo                 Muller, K. E. (2001). Computing the confluent hypergeo-
    dynamics of RNA polymerase II transcription. Nature                      metric function, M ( a,b,x ). Numerische Mathematik
    Structural & Molecular Biology 14, 796–806.                              90, 179–196.
  Dattani, J. and Barahona, M. (2017). Stochastic models of               Nestorowa, S., Hamey, F. K., Pijuan Sala, B., Diamanti, E.,
    gene transcription with upstream drives: exact solution                  Shepherd, M., Laurenti, E., Wilson, N. K., Kent, D. G.
    and sample path characterization. Journal of The Royal                   and Gottgens, B. (2016). A single-cell resolution map of
    Society Interface 14, 20160833.                                          mouse hematopoietic stem and progenitor cell differenti-
  Delmans, M. and Hemberg, M. (2016). Discrete distribu-                     ation. Blood 128, e20–e31.
    tional differential expression (D3E) - a tool for gene ex-            Olver, F. W. J., Olde Daalhuis, A. B., Lozier, D. W., Schnei-
    pression analysis of single-cell RNA-seq data. BMC Bioin-                der, B. I., Boisvert, F., Clark, C. W., Miller, B. R. and
    formatics 17.                                                            Saunders, B. V. (2019). NIST Digital Library of Mathe-
  Dormann, C. F. (2013). Parametrische Statistik. Springer                   matical Functions. Release 1.0.22 of 2019-03-15.
    Berlin Heidelberg, Berlin, Heidelberg.                                Paul, F., Arkin, Y., Giladi, A., Jaitin, D. A., Kenigsberg,
  Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. and                  E., Keren-Shaul, H., Winter, D., Lara-Astiaso, D., Gury,
    Theis, F. J. (2019). Single-cell RNA-seq denoising using                 M., Weiner, A., David, E., Cohen, N., Lauridsen, F. K. B.,
                                                                             Haas, S., Schlitzer, A., Mildner, A., Ginhoux, F., Jung, S.,
    a deep count autoencoder. Nature Communications 10.
  Finak, G., McDavid, A., Yajima, M., Deng, J., Gersuk, V.,                  Trumpp, A., Porse, B. T., Tanay, A. and Amit, I. (2015).
                                                                             Transcriptional Heterogeneity and Lineage Commitment
    Shalek, A. K., Slichter, C. K., Miller, H. W., McElrath,
    M. J., Prlic, M., Linsley, P. S. and Gottardo, R. (2015).                in Myeloid Progenitors. Cell 163, 1663–1677.
                                                                          Peccoud, J. and Ycart, B. (1995). Markovian Modeling of
    MAST: a flexible statistical framework for assessing tran-
    scriptional changes and characterizing heterogeneity in                  Gene-Product Synthesis. Theoretical Population Biology
                                                                             48, 222–234.
    single-cell RNA sequencing data. Genome Biology 16.
  Gillespie, D. T. (1976). A general method for numerically               Picelli, S., Faridani, O. R., Björklund, s. K., Winberg, G.,
                                                                             Sagasser, S. and Sandberg, R. (2014). Full-length RNA-
    simulating the stochastic time evolution of coupled chemi-
    cal reactions. Journal of Computational Physics 22, 403–                 seq from single cells using Smart-seq2. Nature Protocols
    434.                                                                     9, 171–181.
  Golding, I., Paulsson, J., Zawilski, S. M. and Cox, E. C.               Pierson, E. and Yau, C. (2015). ZIFA: Dimensionality reduc-
    (2005). Real-Time Kinetics of Gene Activity in Individual                tion for zero-inflated single-cell gene expression analysis.
    Bacteria. Cell 123, 1025–1036.                                           Genome Biology 16.
  Graham, R. L., Knuth, D. E. and Patashnik, O. (2017). Con-              Qiu, X., Hill, A., Packer, J., Lin, D., Ma, Y.-A. and Trapnell,
    crete mathematics: a foundation for computer science.                    C. (2017). Single-cell mRNA quantification and differen-
    2. ed., 31. print edition, Addison-Wesley, Upper Saddle                  tial analysis with Census. Nature Methods 14, 309–315.
    River, NJ. OCLC: 993616132.                                           Raj, A., Peskin, C. S., Tranchina, D., Vargas, D. Y. and
  Grün, D., Kester, L. and van Oudenaarden, A. (2014). Vali-                Tyagi, S. (2006). Stochastic mRNA Synthesis in Mam-
    dation of noise models for single-cell transcriptomics. Na-              malian Cells. PLoS Biology 4, e309.
    ture Methods 11, 637–640.                                             Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. and Vert,
  Hafemeister, C. and Satija, R. (2019). Normalization and                   J.-P. (2018). A general and flexible method for signal ex-
    variance stabilization of single-cell RNA-seq data us-                   traction from single-cell RNA-seq data. Nature Commu-
    ing regularized negative binomial regression. bioRxiv                    nications 9.
    preprint.                                                             Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi,
  Haghverdi, L., Büttner, M., Wolf, F. A., Buettner, F. and                 W. and Smyth, G. K. (2015). limma powers differential
    Theis, F. J. (2016). Diffusion pseudotime robustly recon-                expression analyses for RNA-sequencing and microarray
    structs lineage branching. Nature Methods 13, 845–848.                   studies. Nucleic Acids Research 43, e47–e47.

                                                                     14
You can also read
Next part ... Cancel