# Graphical Abstract - A mechanistic model for the negative binomial distribution of single-cell mRNA counts

**Page content transcription**( If your browser does not render page correctly, please read the page content below )

bioRxiv preprint first posted online Jun. 3, 2019; doi: http://dx.doi.org/10.1101/657619. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. Graphical Abstract A mechanistic model for the negative binomial distribution of single-cell mRNA counts Lisa Amrhein, Kumar Harsha, Christiane Fuchs Biological Process Data Frequency Measurement # Frequency # Estimation & Simplification Knowledge Transfer Prediction Parameter Selection Mechanistic Model Steady State Distribution Poisson distribution # ~ Link via Ornstein-Uhlenbeck processes Poisson-beta distribution # ~ Negative binomial distribution # ~

bioRxiv preprint first posted online Jun. 3, 2019; doi: http://dx.doi.org/10.1101/657619. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. A mechanistic model for the negative binomial distribution of single-cell mRNA counts Lisa Amrheina,b , Kumar Harshaa,b , Christiane Fuchsa,b,c,d,∗ a Institute of Computational Biology, Helmholtz Zentrum Munich, 85764 Neuherberg, Germany b Department of Mathematics, Technical University of Munich, 85747 Garching, Germany c Faculty of Business Administration and Economics, Bielefeld University, 33615 Bielefeld, Germany d Lead Contact Summary Several tools analyze the outcome of single-cell RNA-seq experiments, and they often assume a probability distribution for the observed sequencing counts. It is an open question of which is the most appropriate discrete distribution, not only in terms of model estimation, but also regarding interpretability, complexity and biological plausibility of inherent assumptions. To address the question of interpretability, we inves- tigate mechanistic transcription and degradation models underlying commonly used discrete probability distributions. Known bottom-up approaches infer steady-state probability distributions such as Poisson or Poisson-beta distributions from different underlying transcription-degradation models. By turning this procedure upside down, we show how to infer a corresponding biological model from a given probability distribution, here the negative binomial distribution. Realistic mechanistic models underlying this distribu- tional assumption are unknown so far. Our results indicate that the negative binomial distribution arises as steady-state distribution from a mechanistic model that produces mRNA molecules in bursts. We empirically show that it provides a convenient trade-off between computational complexity and biological simplicity. Keywords: gene expression, negative binomial distribution, Poisson-beta distribution, single-cell RNA sequencing, switching process, bursting process, stochastic differential equation, Ornstein-Uhlenbeck process Introduction the 23 listed tools, around 60% use a negative bi- nomial (NB) distribution, 40% a zero-inflated dis- When analyzing the outcomes of single-cell RNA tribution (these two cases can overlap) and about sequencing (scRNA-seq) experiments, it is essen- 7% a Poisson-beta (PB) distribution. tial to appropriately take properties of the result- Count data is most appropriately described by dis- ing data into account. Many methods assume a crete distributions unless count numbers are with- parametric distribution for the sequencing counts out exception very high. A commonly chosen dis- due to its larger power than non-parametric ap- tribution is the Poisson distribution, which can proaches. To that end, a family of parametric dis- be derived from a simple birth-death model of tributions which adequately models the data needs mRNA transcription and degradation. However, to be chosen. In Supplementary Table S1, we pro- due to widespread overdispersed data, it is sel- vide an overview of computational tools for scRNA- dom suitable. Another typical choice is a three- seq analyses and their distribution choices. Among parameter PB distribution (Delmans and Hemberg, 2016, Vu et al., 2016) which can be derived from ∗ Correspondence a DNA switching model (also called random tele- Email address: graph model, see Dattani and Barahona, 2017, or christiane.fuchs@helmholtz-muenchen.de (Christiane basic model of gene activation and inactivation, see Fuchs) Raj et al., 2006). Parameters of the PB distribu-

bioRxiv preprint first posted online Jun. 3, 2019; doi: http://dx.doi.org/10.1101/657619. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. tion can be estimated from scRNA-seq data (Kim A Generalized model and Marioni, 2013), as well as experimentally mea- sured and inferred (Suter et al., 2011). This distri- bution provides good estimates of scRNA-seq data; however, it entails the estimation of three param- eters which introduces a high computational cost (Kim and Marioni, 2013). A frequent third choice is the NB distribution, used by several tools that B Basic model analyze single-cell gene expression measurements such as SCDE (Kharchenko et al., 2014), Mono- cle 2 (Qiu et al., 2017) and many more (see Sup- plementary Table S1). This distribution is chosen due to computational convenience and good empir- C Switching model ical fits. Mathematically, it can be considered as asymptotic steady-state distribution of the switch- ing model (see Raj et al., 2006). However, this will entail biologically unrealistic assumptions. So far, no mechanistic model is known that directly leads DNA to a NB distribution in steady state. mRNA Polymerase To close this gap, we look again at the already known mechanistic processes and their inferred parametric steady-state distributions: Poisson and D Bursting model PB. Integrating these in the general framework of Ornstein-Uhlenbeck (OU) processes (Barndorff- Nielsen and Shephard, 2001), we aim to transfer a general method of connecting mechanistic pro- cesses via stochastic differential equations (SDEs) and their theoretical steady-state distributions to this research problem. Hence, we show how to con- nect a desired steady-state distribution of the in- Figure 1: Transcription and degradation models: (A) Gen- tensity process with the corresponding SDE by us- eralized model with time-dependent stochastic transcription ing OU processes and their properties. We use this rate Rt and constant deterministic degradation rate rdeg . method to calculate the corresponding SDE from (B) Basic model with constant deterministic transcription the NB distribution as given steady-state distribu- and degradation rates. (C) Switching model of gene acti- vation and inactivation, transcription and degradation. (D) tion; from this, we can read a corresponding mech- Bursting model, where bursts occur at rate rburst and burst anistic model. In a (Case Study), we use our R sizes have mean sburst . This model differs from (A) in package scModels to estimate three count distribu- that transcription events can produce more than one mRNA tion models (Poisson, PB and NB) on simulated molecule. perfect-world data, and perform model selection as well as goodness-of-fit tests. A comparison with ex- Results isting implementations of the PB distribution, de- tailed derivations and definitions of the employed It has previously been shown how to derive an probability distributions can be found in the Ap- mRNA count distribution from a simple birth- pendix. Lastly, we repeat this comparison on real- death model for mRNA transcription and degra- world data and extend the models to more realistic dation (Dattani and Barahona, 2017, Peccoud and ones by including zero inflation and heterogeneity. Ycart, 1995). Alterations in the transcription and degradation model lead to alterations in the re- By inferring a mechanistic model for stochastic gene sulting mRNA count distribution. We will sketch expression, our work validates the NB distribution the derivation of several such models and distribu- as a steady-state distribution for mRNA content in tions. Our models describe the number of mRNA single cells. molecules in a cell for either one gene or for a group 3

bioRxiv preprint first posted online Jun. 3, 2019; doi: http://dx.doi.org/10.1101/657619. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. of genes for which we can assume identical kinetic times between switches are assumed to be exponen- parameters. tially distributed. As shown in the Appendix, these In the general context, we consider a transcription- assumptions lead to It following a four-parameter degradation model with stochastic time-varying Beta (ract /rdeg , rdeact /rdeg , roff /rdeg , ron /rdeg ) dis- transcription rate Rt and deterministic constant tribution, and therefore the mRNA content in degradation rate rdeg (Figure 1A). Here, the num- steady state is described by a Poisson-beta (PB) ber of mRNA molecules at time t is Poisson dis- distribution. Hence, the probability of hav- tributed with intensity It following the random dif- ing n mRNA molecules at time t is time- ferential equation independent. For roff = 0 (i. e. no transcription possible during inactive DNA state), it can be sim- dIt = −rdeg It dt + Rt dt (1) plified to for t ≥ 0 and fixed I0 = i0 > 0. Depending on the transcription process, described by Rt , this RDE n has different solutions which will be shown in the Γ + rrdeact ract rdeg deg ron rdeg Γ ract rdeg + n following (for detailed calculations see Appendix). P(n, t) = ract ract rdeact Γ rdeg Γ(n + 1)Γ rdeg + rdeg + n Basic model: constant transcription and ract rdeact ract ron degradation. In the basic model, transcrip- × 1 F1 +n, + +n, − , (4) rdeg rdeg rdeg rdeg tion and degradation occur at constant rates rtran and rdeg (Figure 1B). The RDE (1) simplifies to the where Γ denotes theR gamma function and ordinary differential equation (ODE) Γ(b) 1 zu a−1 (1 − u)b−a−1 du 1 F1 (a, b, z) = Γ(a)Γ(b−a) 0 e u dIt = −rdeg It dt + rtran dt (2) is the confluent hypergeometric function of first with time-independent non-stochastic steady state order, also called Kummer function. The density It = rtran /rdeg . Hence, if the cell is in steady function of this PB distribution converges to the state, mRNA counts in this model follow a Poisson density function of a negative binomial (NB) distribution with constant intensity rtran /rdeg (see distribution under specific conditions (Appendix). Appendix). For ron = rtran , ract → ∞ and rdeact = 0, the switching model reduces to the basic model, and Switching model: gene activation and deacti- the above PB distribution collapses to a Poisson vation. In the well-known switching model, a gene distribution with intensity parameter rtran /rdeg , in switches between an inactive state where transcrip- consistency with the above-derived results. tion is impossible, and an active state where tran- scription occurs. This can be explained by poly- Connecting SDEs with steady-state distribu- merases binding and unbinding to the specific gene tions. Taken together, both models described the (Figures 1C and S1). The RDE (1) becomes intensity process of a Poisson distribution (Equa- tions 2 and 3). These intensity processes govern the dIt = −rdeg It dt + rswitch (t)dt (3) transcription and degradation within the mechanis- tic models. They determine the steady-state dis- with tribution of the intensity parameter, and thus the ( ron if DNA active at time t overall distribution of the mRNA content. Impor- rswitch (t) = tantly, changes in the intensity process lead to dif- roff if DNA inactive at time t, ferent steady-state distributions. We generalize this where roff < ron . The transcription rate is modeled framework by using Ornstein-Uhlenbeck (OU) pro- by a continuous-time Markov process (rswitch (t))t≥0 cesses and their properties (Barndorff-Nielsen and that switches between two discrete states ron Shephard, 2001). and roff with activation and deactivation rates ract The general definition of an OU SDE (adjusted to and rdeact , respectively. One usually sets roff = 0. the above notation) is given by This corresponds to a system where a gene’s en- hancer sites can be bound by different transcrip- dIt = −rdeg It dt + dLt , (5) tion factors or co-factors. Once bound, transcrip- tion occurs at constant rate ron , and mRNA con- where Lt with L0 = 0 (almost surely) is a Lévy pro- tinuously happens at constant rate rdeg . Waiting cess, i. e. a stochastic process with independent and 4

bioRxiv preprint first posted online Jun. 3, 2019; doi: http://dx.doi.org/10.1101/657619. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. stationary increments. In addition, we need Lt to for α, β > 0 as derived in the Appendix. be a subordinator, that is a Lévy process with pos- In analogy to the derivation of the PB distribution itive increments (Definition 7 in Appendix). A spe- from the switching model, we now seek to describe cial property of OU processes is that, under certain the mRNA content by a Poisson distribution with conditions (see Definition 9), for a chosen distribu- intensity parameter It , which in steady state fol- tion D there is an OU process that in steady state lows a gamma distribution instead of a beta dis- leads to this distribution D. The other direction, tribution. Thus, we aim to specify an OU process i. e. the existence of a steady-state distribution D with the gamma distribution as steady-state dis- for a chosen OU process (in terms of its subordi- tribution. In terms of mechanistic modeling, this nator), holds as well. For a given Lévy subordina- means that we need to describe a suitable tran- tor Lt , the characteristic function of D, and thus D scription process. Mathematically, we need to spec- itself, can be derived as described in the following ify the Lévy subordinator Lt accordingly. From fi- (adjusted to the notation of Equation 5): nancial mathematics it is known that a stationary gamma distribution is obtained if Lt is chosen to 1. Find the characteristic function µ̂Lt (z) of the be a compound Poisson process (CPP, see Defini- Lévy subordinator Lt . tion 8) with exponentially distributed jump sizes 2. Calculate µ̂L1 (z) and write the result in the (Barndorff-Nielsen et al., 2001). This will be our form exp(φ(z)) for some function φ(z). choice of subordinator; however, the parameters of this process still need to be specified. In the follow- 3. Calculate the characteristic function C(z) of ing, we will show that the Lévy subordinator of the the stationary distribution D of It by setting −1 z R −1 OU process (5) whose one-dimensional stationary C(z) = exp(rdeg φ(ω)ω dω). C(z) leads 0 distribution is Gamma(α, β), is a CPP with inten- to D. sity parameter α · rdeg and mean jump size β −1 . More details and examples are shown in the To obtain this result, we follow the three-step pro- Appendix. Despite this apparently clear line cedure described above in reverse order. We start of action, finding a corresponding law D and with D = ˆ Gamma(α, β)n and transform its ocharac- process Lt is challenging without prior knowledge, −1 z φ(ω)ω −1 dω , using R teristic function to exp rdeg 0 e. g. if D is not well-known or Lt is only specified the characteristic function of D as given in the Ap- through the characteristic function of L1 . In the pendix, Definition 1: next section, we cast the NB distribution as an alternative distribution for which a subordinator −α can be derived. iz iz C(z) = 1− = exp −α log 1 − β β Negative binomial distribution: Deriving an Z z −1 explanatory bursting process. A widely con- = exp α dω 0 iβ + ω sidered model for scRNA-seq counts is the NB dis- Z z tribution. Like the above-employed PB distribu- iω = exp α dω tion, it accounts for overdispersion by modeling the (β − iω)ω 0Z z variance independently of the mean of the data. −1 β = exp rdeg α rdeg − 1 ω −1 dω Having one parameter less than PB, NB is an ap- 0 β − iω pealing choice. However, mechanistic models un- Z z −1 derlying the NB distributional assumption are un- = exp rdeg φ(ω)ω −1 dω 0 known. We aim to derive such a mechanistic model of transcription and degradation by reversing the β with φ(ω) = α rdeg β−iω − 1 and i the imaginary steps that led from the switching model to the PB distribution. For that purpose, an important fact number. Next, we use µ̂L1 (z) = exp(φ(z)) to obtain is that an NB distribution can be expressed as a β Poisson-gamma (PG) distribution, i. e. as a condi- µ̂L1 (z) = exp α rdeg −1 . (7) β − iz tional Poisson distribution with gamma distributed intensity parameter I. One has We aim to bring this into agreement with µ̂Lt (z), the time-dependent characteristic function of a gen- ˆ NB α, (β + 1)−1 PG(α, β) = (6) eral CPP Lt with intensity parameter λ. This is 5

bioRxiv preprint first posted online Jun. 3, 2019; doi: http://dx.doi.org/10.1101/657619. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. given by the transcription strength becomes infinitesimally large, the trajectory of Lswitch t turns into a step µ̂Lt (z) = exp(t λ(µ̂Y (z) − 1)), function as depicted in Figure 2 on the right. This limit is obtained if rdeact → ∞ and ron → ∞ in where Y is a random variable following the distribu- a way that needs to be specified. For that reason, tion of the jump sizes of the CPP, and µ̂Y is its char- we in the following seek to describe a mechanistic acteristic function (see Appendix, Definition 8). A model for the transition phase (Figure 2, middle) CPP with intensity λ = α · rdeg and i.i.d. expo- leading to the bursting model. nentially distributed increments Y ∼ Exp(β) with In the switching model, as soon as DNA becomes characteristic function µ̂Y (z) = β/(β − iz) yields active, a competition starts between the events the overall characteristic function transcription and deactivation. In addition, degra- β dation may happen, which will affect the intensity µ̂Lt (z) = exp tαrdeg −1 . process It and the number of mRNA molecules, β − iz but not the transcription process. If a transcrip- This is in accordance with µ̂L1 (z) as derived in tion event occurs, the competition between tran- Equation (7), and hence, a mathematically appro- scription and deactivation continues at the same priate subordinator is a CPP with intensity param- probability rates as before; the only affected event eter α · rdeg and mean jump size β −1 . probability is the one for degradation because this As a consequence, transcription is expressed via a probability depends on the current mRNA count. stochastic process Lt , namely the CPP, which expe- We now consider the following approximation of the riences jumps after exponentially distributed wait- switching model and call it the transition model : ing times. In contrast to the Lévy subordinators of When DNA becomes active, we allow the events the basic model, Lbasic = ron t, and of the switching transcription and deactivation to happen, but not switch Rtt model, Lt = 0 rswitch (s)ds, it possesses point- degradation. To correct for the missing degrada- wise discontinuous sample paths (Figure 2, right). tion events, we introduce a waiting time W after Intervals without any transcription activity seem to DNA deactivation in which only degradation can be disrupted by sudden explosions of mRNA num- occur, but no DNA activation. For appropriately bers. This burstiness led us to call the mechanism chosen rdeact → ∞ and ron → ∞, the approxima- behind the NB distribution the bursting model. We tion error will tend to zero. denote its subordinator by Lburst t and argue the bi- The number of transcription events S during one ological justification of the model in the Discussion active DNA phase is geometrically distributed with and Conclusion. success probability parameter rdeact /(rdeact + ron ). We aim to derive the mechanistic transcription pro- In the interpretation of the geometric distribution, cess of the bursting model in more detail. Specif- transcription events are considered as failures, de- ically, we tackle the distribution of burst sizes of activation as success. The waiting time W needs to mRNA counts. For this we look at a heuristic tran- accumulate the times before S transcriptions and sition from Lswitch t to Lburst t . one deactivation. Thus, W = T1 + · · · + TS + D, First, we dismantle the shape of the trajectories where Ti ∼ Exp(ron ), i = 1, . . . , S, are the sin- of Lswitch t . As depicted in Figure 2 on the left, such gle waiting times for each transcription event and a trajectory consists of alternating piecewise con- D ∼ Exp(rdeact ) is the waiting time till the next stant and piecewise linear parts. The constant parts DNA deactivation. appear during time intervals without transcription, Taken together, the bursting process can be con- i. e. where the DNA is inactive. The length of such sidered as the limiting process of the approxima- a time interval depends only on the rate ract of the tion of the switching process as ron → ∞ and switching model. Once the DNA switches into the rdeact → ∞ under the condition that the success active mRNA transcribing state, the time interval probability parameter of the geometric distribution, with transcription depends only on the rate rdeact . rdeact /(rdeact + ron ) stays constant. As the link be- The slope of the trajectory during this active DNA tween the switching model and PB distribution is state represents the transcription strength and de- known, and since PB converges towards NB under pends only on the parameter ron . certain conditions (see Appendix), we can connect In case the length of the time interval of active DNA the parameters of the bursting model with those of becomes infinitesimally small, and at the same time the NB distribution and CPP. 6

bioRxiv preprint first posted online Jun. 3, 2019; doi: http://dx.doi.org/10.1101/657619. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. Lévy Subordinators Switching model Transition model Bursting model transcription strength slope: intensity Approxi- Limit burst mation DNA DNA DNA waiting for off on off next burst Figure 2: The Lévy subordinator of the switching model is shown on the left by means of an exemplary trajectory. For small duration of the DNA being active and large transcription strength, its behavior can be approximately described by a step function as depicted in the middle (transition model). The limit of this approximation, with infinitesimally small DNA activation time interval and infinitesimally large transcription strength, leads to a trajectory of the subordinator of the bursting model which is shown on the right. That is, the bursting model can mechanistically be be added. The final number of balls in the bucket described as follows: After Exp(rburst )-distributed represents the number of mRNA molecules in a cell waiting times, a Geo((1 + sburst )−1 )-distributed at steady state. number of mRNAs are produced at once, where The above top-down derivation from the steady- sburst is the mean burst size (see also Golding state distribution to the mechanistic process has et al., 2005). As in the basic and switching to be motivated heuristically in parts. In the models, degradation events occur with Exp(rdeg )- Appendix we prove bottom-up that the above distributed waiting times. The just described mech- described mechanistic bursting model indeed leads anistic bursting model is shown in Figure 1D. It can to the steady-state NB distribution by directly equivalently be described by the OU process (5) calculating the master equation (see also Supple- with Lt being a CPP with Exp(rburst )-distributed mentary Figure S2). waiting times and Exp(sburst )-distributed jump sizes. Thus, in steady state, mRNA counts follow a Heterogeneity and dropout. The transcription PG(rburst /rdeg , sburst ) distribution or, equivalently, and degradation models considered so far describe a NB rburst /rdeg , (1 + sburst )−1 distribution if the the number of mRNA molecules for homogeneously bursting model is assumed. expressed genes that are actually present in a cell. The NB rburst /rdeg , (1 + sburst )−1 model, again, Real-world data is usually more complex: First, can be interpreted as follows (see also Appendix, cell populations may be heterogeneous. Second, Definition 3): Assume you have an empty bucket scRNA-seq measurements will be subject to mea- into which you put balls according to the follow- surement error. For example, they often contain a ing stochastic procedure. You perform a number large number of zeros. If a zero is due to techni- of independent Bernoulli trials, each with success cal error, it is called dropout. Regardless of what probability (1 + sburst )−1 . If there is a failure, you causes this phenomenon, a data model should take add one ball to the bucket. If there is a success, this property into account. We describe two model you do not do anything but count the success event. extensions here. You continue until there have been rburst /rdeg suc- Data that originates from different cell populations cesses. (For interpretation purposes, we here as- (in terms of different transcriptomic properties) can sume rburst /rdeg to be a whole-valued number.) be modeled mathmatically. If a population consists The larger sburst , the smaller the success proba- of e. g. two subpopulations, each of them is mod- bility, i. e. by expectation you will put more balls eled by one single distribution, D1 or D2 , parame- in the bucket for large sburst . Similarly, the larger terized via θ1 and θ2 , respectively. One assumes the ratio of rburst to rdeg , the more success events the mRNA count to be distributed according to will be waited for, thus the more balls will tend to pD1 (θ1 ) + (1 − p)D2 (θ2 ) with p ∈ [0, 1], that means: 7

bioRxiv preprint first posted online Jun. 3, 2019; doi: http://dx.doi.org/10.1101/657619. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. With probability p, the count distribution of that without additional zero inflation. In total, we in- cell is D1 , otherwise D2 . The corresponding mix- vestigate twelve different models as shown in Fig- ture density is given by ure 3. The numbers of parameters in these models are listed in Supplementary Table S3. f2mix (x; θ1 , θ2 , p) = p f1 (x; θ1 ) + (1 − p)f2 (x; θ2 ), Selected distributions after GOF where f1 and f2 are the densities of D1 and D2 , and x is the observed mRNA count. For k > 2 A Nestorowa et al.: subpopulations, the density can easily be general- 0 100 10,000 ized to a mixture of k distributions D1 , . . . , Dk with Pois NB PB probabilities p1 , . . . , pk : 1-pop 0 1,251 43 1,294 ZI-1-pop 0 1,043 2 1,045 fkmix (x; θ1 , · · · , θk , p1 , · · · , pk−1 ) = (8) ! 2-pop 1 7,248 159 7,408 k−1 X p1 f1 (x; θ1 ) + . . . + 1− pi fk (x; θk ). ZI-2-pop 1 427 0 428 i=1 2 9,969 204 10,175 The distributions Di can be any (ideally discrete count) distribution, possibly from different distri- bution families. B mm10:10x: An appropriate model for the occurrence of the 0 100 1,000 above-mentioned dropout is a zero-inflated distri- Pois NB PB bution (Kharchenko et al., 2014). For one homo- 1-pop 3,374 3,454 50 30 geneous population, the mRNA count will be dis- tributed according to p1{0} +(1−p)D(θ), with 1{0} ZI-1-pop 193 266 3 462 being the indicator function with point mass at 2-pop 111 98 58 267 zero, and the corresponding density function reads ZI-2-pop 12 8 0 20 fzi (x; θ, p) = p 1{0} (x) + (1 − p)f (x; θ), 366 3,746 91 4,203 where f is the density function of D. Analogously, Figure 3: Frequencies of chosen distributions via BIC- zero inflation can be added to a mixture of several after-GOF applied to real-world datasets: (A) Nestorowa distributions, see (8). mRNA counts are then dis- et al. (2016) (B) mm10:10x, Official 10x Genomics Support tributed according to (2017). k ! We estimate these twelve models on two real-world datasets: The first one stems from Nestorowa X p1 1{0} + p2 D1 (θ1 ) + . . . + 1− pi Dk (θk ). i=1 et al. (2016), contains 1,656 mouse HSPCs and was generated using the Smart-Seq2 (Picelli et al., The corresponding density function is given by 2014) protocol, and thus did not employ unique fzi-kmix (x; θ1 , · · · , θk , p1 , · · · , pk ) molecular identifiers (UMIs). The second dataset contains 3,356 homogeneous NIH3T3 mouse cells = p1 1{0} (x) + p2 f1 (x; θ1 ) + . . . ! and has been generated using the 10x Chromium k X technique (Zheng et al., 2017), thus incorporat- + 1− pi fk (x; θk ). ing UMIs. It is part of the publicly available i=1 10x dataset “6k 1:1 Mixture of Fresh Frozen Data application. We perform a comprehensive Human (HEK293T) and Mouse (NIH3T3) Cells” comparison of the considered mRNA count distri- (Official 10x Genomics Support, 2017). Here, we butions, that is the Poisson, NB and PB distribu- refer to this dataset as mm10:10x. tion, when applied to real-world data. Within each We applied a gene filter (see Appendix), estimated of the three distributions we further consider mix- the model parameters of the twelve considered tures of two populations (from identical distribu- models via maximum likelihood, and performed tion types but with different parameters) with and model selection as described in the Case Study via 8

bioRxiv preprint first posted online Jun. 3, 2019; doi: http://dx.doi.org/10.1101/657619. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. the Bayesian information criterion (BIC). Figure 3 profiles. The simulation study showed that an NB summarizes the frequencies of the chosen models distribution may be best suited even if the in silico across genes. We only display those choices where data had been generated from the switching model. the chosen distribution with estimated parameters Also in the real-data application, the NB distribu- was not rejected by a goodness-of-fit (GOF) test tion was chosen in most cases. In line with our (χ2 -test) at 5% significance level with multiple test- expectations, gene profiles of the non-UMI-based ing correction. dataset by Nestorowa et al. (2016) showed strong In the data from Nestorowa et al. (2016), 16,364 preference for a two-population mixture or zero- genes remained after filtering, of which 10,175 were inflated variant of the NB distribution. In con- not rejected by the GOF test. Figure 3A shows that trast, the mm10:10x dataset consists by construc- some variant of the NB distribution was chosen for tion of homogeneous cells, and 10x Chromium is not 98% of these genes. Among these, mRNA count known for large amounts of unexpected zeros in the numbers for many genes were best described by the measurements. Accordingly, the single-population zero-inflated NB distribution. However, an even NB distribution was sufficient for most gene profiles higher preference could be observed for the mixture here. For 9% of the considered genes in the m10:10x of two NB distributions. This can be explained by dataset, mRNA counts were most appropriately de- taking a closer look at the gene expression counts scribed by some form of the Poisson distribution. of the affected genes (see also Supplementary Fig- We have examined these 366 genes for functional ure S6): Most of those genes not only show many similarities; while estimated parameters show some zeros, but also many low non-zero counts, i. e. many apparent pattern (Supplementary Figure S5), we ones, twos etc., next to higher counts. Such ex- did not find any defining biological characteristics pression profiles are not covered by a simple zero- (Supplementary Figure S7). inflated model but prefer a mix of two distributions, Similar to us, Vieth et al. (2017) performed model one of them mapping to low expression values (see selection among Poisson, NB and PB distributions Supplementary Table S3). by BIC and GOF on several publicly available In the mm10:10x data, 4,203 genes remained after datasets. Although they used the method of Vu filtering and the GOF test. Figure 3B shows that et al. (2016) for which computation of the GOF for 89% of these genes, an NB distribution vari- statistics is impossible for the PB distribution, they ant was chosen as most appropriate model. How- still observed a tendency towards the NB distribu- ever, other than for the dataset from Nestorowa tion as preferred models. In our study, we represent et al. (2016), the standard NB distribution (for one the PB density in terms of the Kummer function, population, without zero inflation) was sufficient in which allows us to compute the GOF statistics ac- the majority of cases. We looked for commonali- cordingly. ties between the gene profiles that led to the same Different sequencing protocols might lead to distribution choice. Supplementary Figure S5 sug- differences in distributions and also might generate gests an interdependence between the chosen one- data of different magnitudes. Ziegenhain et al. population distributions and the range of the pa- (2017) applied various sequencing methods to cells rameter estimates. of the same kind to understand the impact of the Taken together, the NB distribution was chosen for experimental technique on the data. Based on the most gene profiles, either as a single distribution, a so-generated data, Chen et al. (2018) investigated mixture of two NB distributions or with additional differences in gene expression profiles between zero inflation. read-based and UMI-based sequencing technolo- gies. They concluded that, other than for read NB distribution as commonly chosen count counts, the NB distribution adequately models model. While the mechanistic models and their UMI counts. Townes et al. (2019) suggest to steady-state distributions describe actual mRNA describe UMI counts by multinomial distributions contents in single cells, real-world data underlies to reflect the nature of the sequencing procedure; technical variation in addition to biological com- for computational reasons, they propose to ap- plexity. We investigated in a simulation study proximate the multinomial density again by an (Box 1 and Figure 4) and on real-world data (Fig- NB density. Overall, the NB distribution appears ure 3) which distributions were most appropriate sufficiently flexible to hold independently of the among those considered to describe gene expression specific sequencing approach. 9

bioRxiv preprint first posted online Jun. 3, 2019; doi: http://dx.doi.org/10.1101/657619. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. Poisson-beta (PB) and the negative binomial (NB) R package scModels. We provide the R pack- distribution, on all generated datasets via maxi- age scModels which contains all functions needed mum likelihood estimation. To investigate which for maximum likelihood estimation of the consid- distribution explains the data best, we compute the ered distribution models. Three applications of the Bayesian information criterion Gillespie algorithm (Gillespie, 1976) allow synthetic data simulation (as used in the Case Study) via the BIC = −2`(θ̂) + log(n)dim(θ̂), basic, switching and bursting model, respectively. Implementations of the likelihood functions for the where ` represents the corresponding log-likelihood one-population case and two-population mixtures, function, θ the possibly multivariable parameter with and without zero inflation, allow inference of vector of the distribution, θ̂ its maximum likelihood the Poisson, NB and PB distributions. We pro- estimate, dim(θ̂) its dimension, i. e. the number of vide a new implementation of the PB density, based unknown scalar parameters, and n the sample size. on our novel implementation of the Kummer func- The distribution with lowest BIC value is consid- tion, also known as the generalized hypergeomet- ered most appropriate among all considered mod- ric series of Kummer. This became necessary, be- els. Afterwards, we apply a χ2 -test to assess the cause the existing R function (kummerM() con- goodness-of-fit (GOF) and neglect those datasets tained in package fAsianOptions) was only valid for for which the distribution fits are rejected at the specific parameter values, and hence, was not suited 5% significance level (without multiple testing cor- for optimization in continuous unconstrained space rection). This reduces the total number of 1,000 (more information in Appendix). Existing packages simulated datasets per model to the amounts dis- such as D3E (Delmans and Hemberg, 2016), imple- played in Figure 4A. mented in Python, and BPSC (Vu et al., 2016), im- We investigate whether the selected distributions plemented in R, use the PB distribution for scRNA- correspond to the distributions that arise from the seq data analysis but based on a different approxi- respective mechanistic models: For the datasets mation scheme (see Appendix). With our new im- generated from the basic model, model selection plementation of the PB density we did not overcome via BIC-after-GOF indeed prefers the Poisson dis- the problem of time-consuming calculation, but we tribution in most cases, independently of the used for the first time provided an implementation of the distribution parameter λ (Figure 4A, first bar, and Kummer function in R valid for all values required Figure 4B). In contrast, for datasets generated by inside the PB density. For a more detailed descrip- the switching model, BIC-after-GOF in big parts tion of scModels and a package comparison to D3E chooses either the NB or the PB distribution (Fig- and BPSC, see the Appendix. ure 4A, middle bar). The choice seems to depend on the employed rate parameters: Figure 4C indicates a tendency towards the PB distribution for low val- Case Study ues of β; otherwise, the NB distribution often seems In a simulation study, we generate in silico data to model the data generated by the switching model from the three considered mechanistic models: the sufficiently well. For datasets generated by the basic model (Figure 1B), the switching model (Fig- bursting model, BIC-after-GOF picks the NB dis- ure 1C), and the bursting model (Figure 1D), using tribution for the majority of the time without any the Gillespie algorithm implemented in scModels. obvious bias (Figure 4A/D). The study shows that, In order to choose realistic values for the rate pa- apparently, the NB distribution is complex enough rameters, we orient ourselves on experimental stud- to describe the data generated from the switching ies which aim to determine rates of the switching model. The BIC decides in many cases that a po- process in specific cases. For example, Suter et al. tentially better fit is not worth the extra effort for (2011) identify rates for so-called short-lived genes estimating an additional parameter in the PB dis- where mRNA and protein pulses are directly con- tribution. nected to one single on-and-off-switch of a gene. We provide an overview of the employed rate combina- Discussion and Conclusion tions in Supplementary Table S4. As a proof of concept, we estimate the three In this work, we derived a mechanistic model for corresponding distributions, i. e. the Poisson, the stochastic gene expression that results in the NB 10

bioRxiv preprint first posted online Jun. 3, 2019; doi: http://dx.doi.org/10.1101/657619. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. A Best model fits (Best BIC after GOF) C Data generated by switching model Best BIC 600 after GOF: ● Pois of PB basic 963 2 model 400 PB Pois NB switching 200 673 303 PB True model NB bursting 0 model 939 8 0 10 20 30 40 50 60 True of PB 0 200 400 600 800 1000 Frequencies 600 B Data generated by basic model of PB 400 ● ●● ●● ●● ● ●● ●● ●● ●● ●●● ●● ● ●● ● ●● ●● ●● ● ●● ●●● ●● ●● ●● ● ●●● ●● ● ●●● ●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ● ●●●●●●●●●●● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● 200 True 0 0 500 1000 1500 2000 2500 True of Pois 0 500 1000 1500 2000 2500 True c of PB D Data generated by bursting model 2500 60 True r of NB True c of PB 40 1500 20 500 0 0 0.0 0.1 0.2 0.3 0.4 0.5 0 10 20 30 40 50 60 True p of NB True of PB Figure 4: Model selection on in silico data: (A) Frequencies of chosen distributions (Poisson, PB, NB) via BIC-after-GOF based on datasets generated by the three different transcription models (basic, bursting, switching). (B-D) Employed parameter values (indicated by horizontal/vertical position) and chosen distributions (indicated by color/symbol) for basic model (B), switching model (C) and bursting model (D). The names of the parameters correspond to those in Definitions 2, 3 and 5 in the Appendix. distribution as steady-state distribution for mRNA tions (Figures 1B,C). Here, we provide a possible content in single cells. According to the so-obtained explanation. bursting model, transcription happens in chunks, Second, we demonstrated how to generally link a rather than in a one-by-one production as com- probability distribution to an Ornstein-Uhlenbeck monly assumed in mechanistic modeling (Dattani (OU) process and derive a mechanistic model. This and Barahona, 2017). We discuss the biological brings a new field of mathematics to single-cell biol- plausibility of bursty transcription further below. ogy. The procedure can be used to deduce possible The consideration of the bursting model and its mechanistic processes leading to different steady- derivation is interesting from both practical and state distributions, exploiting the rich literature on theoretical points of view: OU processes from financial mathematics. First of all, the NB distribution is defined through Third, although we focused on the resulting two parameters whereas the PB distribution typ- steady-state distributions of the mechanistic mod- ically requires three parameters to be specified in els here, our mathematical framework also provides the current context. Therefore, the NB distribu- model descriptions in terms of stochastic processes. tion is computationally less elaborate to estimate, Nowadays, sequencing counts are commonly avail- given some data, than the PB distribution. Several able as snapshot data. However, time-resolved tools employ the NB distribution to parameterize measurements may become standard (Golding mRNA read counts (see Supplementary Table S1). et al., 2005), and in that case our models open up However, there has been no mechanistic biological the statistical toolbox of stochastic processes to model known so far leading to this distribution, extract information from interdependencies within other than for the Poisson and the PB distribu- single-cell time series. 11

bioRxiv preprint first posted online Jun. 3, 2019; doi: http://dx.doi.org/10.1101/657619. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. This criticism is underpinned by the work of Suter et al. (2011) who derived ranges of the rates Limiting cases of the switching model that of the switching model experimentally and by give rise to the NB distribution are biologi- calculations. Here, only so-called short-lived genes cally unrealistic. The NB and PB distributions were taken into account. Thus, observed mRNA have been linked before. Among others, Raj et al. half-lives were on a smaller scale, mainly between (2006) and Grün et al. (2014) have shown that 30 and 140 min, resulting in mRNA degradation the NB distribution is an asymptotic result of the rates between 0.005 min−1 and 0.023 min−1 . At switching model and the corresponding PB distri- the same time, deactivation rates were found bution (see Appendix). However, this result holds in the range between 0.1 min−1 and 0.6 min−1 . only under biologically unrealistic assumptions as Hence, their quotient is at maximum around we elaborate in the following. Our derivation of 120 and thus nowhere close to infinity. Another the NB steady-state distribution, in contrast, is mathematical assumption for deriving the NB limit based on a thoroughly realistic mechanism of bursty distribution was that the transcription rate needed transcription. The approach by Raj et al. (2006) to be smaller than the deactivation rate. This is and Grün et al. (2014) requires rdeact /rdeg → ∞ not confirmed by Suter et al. (2011) for most genes. and ron /rdeact < 1. That means, the deactiva- tion rate has to be substantially larger than the Biological plausibility of bursting model. mRNA degradation rate and, simultaneously, the Burst-like transcription has been discussed, e. g. transcription rate needs to be smaller than the gene Golding et al. (2005), Schwanhäusser et al. (2011) deactivation rate. Here, we discuss the plausibility and Suter et al. (2011). We take a look at the of these presumptions: inherent assumptions of the bursting model: The Schwanhäusser et al. (2011) showed that mRNA bursting rate rburst represents the waiting time half-life is in median around t1/2 = 9 h (range: until the DNA turns open for transcription in 1.61 h to 40.47 h), which results in a degra- addition to the time which the polymerase needs dation rate rdeg = log(2)/t1/2 of 0.077 h−1 = to transcribe. The model assumes that several 0.00128 min−1 (range: 0.00718 min−1 to polymerases attach simultaneously to the DNA −1 0.00029 min ). For rdeact /rdeg → ∞, the and terminate transcription at the same time. By mRNA degradation rate needs to become much simplifying this part of the transcription process smaller than the gene deactivation rate. Visual model, the problem of persisting DNA activation comparison shows that density curves of the PB during the whole transcription process in the and according NB distributions start to look switching model is avoided. similar for rdeact /rdeg ≈ 20, 000. Assuming a 20,000-fold larger gene deactivation rate results Practical relevance. There is no unambigu- in rdeact = 29.67 min−1 (range: 143.51 min−1 to ous answer to the question of the most appropri- 5.71 min−1 ). This means that on average the ate probability distribution for mRNA count data. gene switches approximately 30 times per minute Pragmatic reasons will often lead to NB distribu- into the off-state, i. e. on average the gene is tion as already employed by many tools (see Sup- in its active state for only two seconds. RNA plementary Table S1). However, the choice may polymerases proceed at 30 nt/sec (without pausing depend on experimental techniques, the statisti- at approximately 70 nt/sec) (Darzacq et al., 2007). cal analysis to be performed, and also differ be- Genes have a length of hundreds to thousands of tween genes within the same dataset. For large read nucleotides. Thus, according to the above numbers, counts, even continuous distributions may be most genes cannot be transcribed in such short phases. suitable. The DNA needs to stay active during the whole While statistics quantifies which model is the most transcription process of one (or more) mRNAs; plausible one from the data point of view, mathe- as soon as the DNA turns inactive, all currently matical modelling points out which biological as- running transcriptions are stopped. In other words, sumptions may implicitly be made when a par- although the NB distribution can mathematically ticular distribution is used. Importantly, while be derived as a limiting steady-state distribution the mechanistic model leads to a unique steady- of the switching model, this entails biologically state distribution, the reverse conclusion is not true. implausible assumptions. In general, the basic model and the correspond- 12

bioRxiv preprint first posted online Jun. 3, 2019; doi: http://dx.doi.org/10.1101/657619. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. ing Poisson distribution may appear too simple in ∗ OU PROCESS DERIVATION FOR most cases (both with respect to biological plausi- BASIC MODEL bility and the ability to describe measured sequenc- – MASTER EQUATION OF THE ing data). The switching and bursting models are BURSTING MODEL harder to distinguish. From the mathematical point of view, their densities are of similar shape, such – R PACKAGE scModels that the less complex NB model will often be pre- ∗ BPSC ferred. Answering the question from the biological ∗ D3E perspective may require measuring mRNA gener- ∗ scModels ation at a sufficiently small time resolution (e. g. Golding et al., 2005) to see whether several mRNA ∗ COMPARISON OF scModels WITH molecules are generated at once (bursting model) or D3E AND BPSC in short successional intervals (switching model). – DATA APPLICATION Taken together, we have identified mechanistic ∗ GENE FILTERING models for mRNA transcription and degradation ∗ ESTIMATION OF ONE- with good interpretability, and established a link POPULATION MODELS to mathematical representations by stochastic pro- cesses and steady-state count distributions. Specif- ∗ BLOOD DIFFERENTIATION ically, the commonly used NB model is supplied MARKER GENES with a proper mechanistic model of the underlying ∗ GO TERMS biological process. The R package scModels over- – OVERVIEW OF SINGLE-CELL ANAL- comes a previous shortcoming in the implementa- YSIS TOOLS tion of the PB density. It provides a full toolbox for data simulation and parameter estimation, equip- • DATA AND SOFTWARE AVAILABILITY ping users with the freedom to choose their mod- els based on content-related, design-based or purely – Case Study: Simulated data pragmatic motives. – Scripts – Software Appendix Detailed methods are provided and include the fol- Supplemental Information lowing: Supplemental Information includes seven figures • OVERVIEW TOOLS TABLE and four tables which can be found at the end of this paper. • METHOD DETAILS – DEFINITIONS AND IDENTITIES Author Contributions – NEGATIVE BINOMIAL CORRE- The study was designed by LA and CF. LA devel- SPONDS TO POISSON-GAMMA oped and performed the mathematical analysis and – POISSON-BETA CONVERGES TO- software development with help of KH and CF. LA WARDS NEGATIVE BINOMIAL and CF wrote the paper. – MASTER EQUATION OF THE GEN- ERALIZED MODEL Acknowledgments ∗ DETERMINISTIC CONTINUOUS TRANSCRIPTION MODEL Our research was supported by the German Re- search Foundation within the SFB 1243, Subproject ∗ BASIC MODEL A17, by the German Federal Ministry of Education ∗ SWITCHING MODEL and Research under grant number 01DH17024, and – OU PROCESSES LINK SDES TO by the National Institutes of Health under grant STEADY-STATE DISTRIBUTIONS number U01-CA215794. 13

bioRxiv preprint first posted online Jun. 3, 2019; doi: http://dx.doi.org/10.1101/657619. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. References Huang, M., Wang, J., Torre, E., Dueck, H., Shaffer, S., Bona- sio, R., Murray, J. I., Raj, A., Li, M. and Zhang, N. R. Adan, I. and Resing, J. (2002). Queueing theory. Eindhoven (2018). SAVER: gene expression recovery for single-cell University of Technology Eindhoven. RNA sequencing. Nature Methods 15, 539–542. Andrews, T. S. and Hemberg, M. (2018). M3Drop: dropout- Intosalmi, J., Mannerstrom, H., Hiltunen, S. and Lahdes- based feature selection for scRNASeq. Bioinformatics maki, H. (2018). SCHiRM: Single Cell Hierarchical Re- bty1044. gression Model to detect dependencies in read count data. Barndorff-Nielsen, O. E., Resnick, S. I. and Mikosch, T., eds BioRxiv preprint. (2001). Lévy Processes. Birkhäuser Boston, Boston, MA. Karlis, D. and Xekalaki, E. (2005). Mixed poisson distribu- DOI: 10.1007/978-1-4612-0197-7. tions. International Statistical Review 73, 35–58. Barndorff-Nielsen, O. E. and Shephard, N. (2001). Non- Kharchenko, P. V., Silberstein, L. and Scadden, D. T. (2014). Gaussian Ornstein-Uhlenbeck-based models and some of Bayesian approach to single-cell differential expression their uses in financial economics. Journal of the Royal analysis. Nature Methods 11, 740–742. Statistical Society: Series B (Statistical Methodology) 63, Kim, J. K. and Marioni, J. C. (2013). Inferring the kinet- 167–241. ics of stochastic gene expression from single-cell RNA- Brent, R. P. (2010). Unrestricted algorithms for elementary sequencing data. Genome biology 14, R7. and special functions. arXiv preprint. Li, W. V. and Li, J. J. (2018). An accurate and robust im- Chen, W., Li, Y., Easton, J., Finkelstein, D., Wu, G. and putation method scImpute for single-cell RNA-seq data. Chen, X. (2018). UMI-count modeling and differential ex- Nature Communications 9. pression analysis for single-cell RNA sequencing. Genome Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. and Yosef, Biology 19. N. (2018). Deep generative modeling for single-cell tran- Darzacq, X., Shav-Tal, Y., de Turris, V., Brody, Y., Shenoy, scriptomics. Nature Methods 15, 1053–1058. S. M., Phair, R. D. and Singer, R. H. (2007). In vivo Muller, K. E. (2001). Computing the confluent hypergeo- dynamics of RNA polymerase II transcription. Nature metric function, M ( a,b,x ). Numerische Mathematik Structural & Molecular Biology 14, 796–806. 90, 179–196. Dattani, J. and Barahona, M. (2017). Stochastic models of Nestorowa, S., Hamey, F. K., Pijuan Sala, B., Diamanti, E., gene transcription with upstream drives: exact solution Shepherd, M., Laurenti, E., Wilson, N. K., Kent, D. G. and sample path characterization. Journal of The Royal and Gottgens, B. (2016). A single-cell resolution map of Society Interface 14, 20160833. mouse hematopoietic stem and progenitor cell differenti- Delmans, M. and Hemberg, M. (2016). Discrete distribu- ation. Blood 128, e20–e31. tional differential expression (D3E) - a tool for gene ex- Olver, F. W. J., Olde Daalhuis, A. B., Lozier, D. W., Schnei- pression analysis of single-cell RNA-seq data. BMC Bioin- der, B. I., Boisvert, F., Clark, C. W., Miller, B. R. and formatics 17. Saunders, B. V. (2019). NIST Digital Library of Mathe- Dormann, C. F. (2013). Parametrische Statistik. Springer matical Functions. Release 1.0.22 of 2019-03-15. Berlin Heidelberg, Berlin, Heidelberg. Paul, F., Arkin, Y., Giladi, A., Jaitin, D. A., Kenigsberg, Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. and E., Keren-Shaul, H., Winter, D., Lara-Astiaso, D., Gury, Theis, F. J. (2019). Single-cell RNA-seq denoising using M., Weiner, A., David, E., Cohen, N., Lauridsen, F. K. B., Haas, S., Schlitzer, A., Mildner, A., Ginhoux, F., Jung, S., a deep count autoencoder. Nature Communications 10. Finak, G., McDavid, A., Yajima, M., Deng, J., Gersuk, V., Trumpp, A., Porse, B. T., Tanay, A. and Amit, I. (2015). Transcriptional Heterogeneity and Lineage Commitment Shalek, A. K., Slichter, C. K., Miller, H. W., McElrath, M. J., Prlic, M., Linsley, P. S. and Gottardo, R. (2015). in Myeloid Progenitors. Cell 163, 1663–1677. Peccoud, J. and Ycart, B. (1995). Markovian Modeling of MAST: a flexible statistical framework for assessing tran- scriptional changes and characterizing heterogeneity in Gene-Product Synthesis. Theoretical Population Biology 48, 222–234. single-cell RNA sequencing data. Genome Biology 16. Gillespie, D. T. (1976). A general method for numerically Picelli, S., Faridani, O. R., Björklund, s. K., Winberg, G., Sagasser, S. and Sandberg, R. (2014). Full-length RNA- simulating the stochastic time evolution of coupled chemi- cal reactions. Journal of Computational Physics 22, 403– seq from single cells using Smart-seq2. Nature Protocols 434. 9, 171–181. Golding, I., Paulsson, J., Zawilski, S. M. and Cox, E. C. Pierson, E. and Yau, C. (2015). ZIFA: Dimensionality reduc- (2005). Real-Time Kinetics of Gene Activity in Individual tion for zero-inflated single-cell gene expression analysis. Bacteria. Cell 123, 1025–1036. Genome Biology 16. Graham, R. L., Knuth, D. E. and Patashnik, O. (2017). Con- Qiu, X., Hill, A., Packer, J., Lin, D., Ma, Y.-A. and Trapnell, crete mathematics: a foundation for computer science. C. (2017). Single-cell mRNA quantification and differen- 2. ed., 31. print edition, Addison-Wesley, Upper Saddle tial analysis with Census. Nature Methods 14, 309–315. River, NJ. OCLC: 993616132. Raj, A., Peskin, C. S., Tranchina, D., Vargas, D. Y. and Grün, D., Kester, L. and van Oudenaarden, A. (2014). Vali- Tyagi, S. (2006). Stochastic mRNA Synthesis in Mam- dation of noise models for single-cell transcriptomics. Na- malian Cells. PLoS Biology 4, e309. ture Methods 11, 637–640. Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. and Vert, Hafemeister, C. and Satija, R. (2019). Normalization and J.-P. (2018). A general and flexible method for signal ex- variance stabilization of single-cell RNA-seq data us- traction from single-cell RNA-seq data. Nature Commu- ing regularized negative binomial regression. bioRxiv nications 9. preprint. Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, Haghverdi, L., Büttner, M., Wolf, F. A., Buettner, F. and W. and Smyth, G. K. (2015). limma powers differential Theis, F. J. (2016). Diffusion pseudotime robustly recon- expression analyses for RNA-sequencing and microarray structs lineage branching. Nature Methods 13, 845–848. studies. Nucleic Acids Research 43, e47–e47. 14

Pages you may also like

You can also read