A single-cell expression simulator guided by gene regulatory networks - bioRxiv

Page created by Allen Molina
 
CONTINUE READING
A single-cell expression simulator guided by gene regulatory networks - bioRxiv
bioRxiv preprint first posted online Jul. 28, 2019; doi: http://dx.doi.org/10.1101/716811. The copyright holder for this preprint (which
       was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                                       All rights reserved. No reuse allowed without permission.

      A single-cell expression simulator guided by gene regulatory
      networks

      Payam Dibaeinia1, Saurabh Sinha1,2,3*

      1
        Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL, 61801,
      USA
      2
        Carl R. Woese Institute of Genomic Biology, University of Illinois Urbana-Champaign, Urbana,
      IL, 61801, USA
      3
        Cancer Center at Illinois, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA

      *
       To whom correspondence should be addressed. Tel: 217-333-3233; Email:
      sinhas@illinois.edu

      Abstract
      A common approach to benchmarking of single-cell transcriptomics tools is to generate
      synthetic data sets that resemble experimental data in their statistical properties.
      However, existing single-cell simulators do not incorporate known principles of
      transcription factor-gene regulatory interactions that underlie expression dynamics.
      Here we present SERGIO, a simulator of single-cell gene expression data that models
      the stochastic nature of transcription as well as linear and non-linear influences of
      multiple transcription factors on genes according to a user-provided gene regulatory
      network. SERGIO is capable of simulating any number of cell types in steady-state or
      cells differentiating to multiple fates according to a provided trajectory, reporting both
      unspliced and spliced transcript counts in single-cells. We show that data sets
      generated by SERGIO are comparable with experimental data in terms of multiple
      statistical measures. We also illustrate the use of SERGIO to benchmark several
      popular single-cell analysis tools, including GRN inference methods.
A single-cell expression simulator guided by gene regulatory networks - bioRxiv
bioRxiv preprint first posted online Jul. 28, 2019; doi: http://dx.doi.org/10.1101/716811. The copyright holder for this preprint (which
       was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                                       All rights reserved. No reuse allowed without permission.

      Introduction
      Single-cell transcriptomics technologies are revolutionizing biology today1–4, and have
      led to the rapid development of computational tools for analyzing the resulting data
      sets5–8. These tools, developed for a wide array of tasks such as clustering9–11,
      trajectory inference12,13 and gene regulatory network (GRN) reconstruction9,14,15, as well
      as pre-processing operations such as imputation16–18, adopt complementary strategies
      whose relative merits and weaknesses are not clear a priori. In some cases, single-cell
      data sets annotated using domain knowledge19,20 allow objective evaluations of different
      strategies, but this is not a scalable approach to systematic benchmarking. A promising
      alternative approach is to synthesize single-cell expression data sets that mimic real
      data in their statistical properties and for which underlying biological relationships are
      known by construction.

      Simulation tools (“simulators”) for single-cell expression data have been reported in
      various forms. Several studies offering novel analysis tools use in-house simulators to
      benchmark those tools8,21–26, while other studies specifically develop simulators for use
      by the community27–32. Most of these simulators are geared towards capturing the noise
      characteristics of technologies such as single-cell RNA-seq (scRNA-seq), by first
      estimating statistical quantities describing real data sets and then sampling single-cell
      expression profiles from probability distributions that mirror those quantities. A crucial
      aspect of biology missing in current simulators is the gene regulatory network (GRN):
      the set of transcription factor (TF)-gene relationships that underlies the dynamics and
      steady states of gene expression in each cell. We believe it is imperative that a single-
      cell expression simulator be guided by an underlying GRN, not only because of the
      biological realism that it represents, but also because this is the only direct way to
      benchmark tools specifically designed for GRN reconstruction. Some existing tools do
      attempt to induce gene-gene relationships in synthetic data using multi-gene statistical
      models for sampling purposes28,33, but these attempts do not incorporate the special
      properties of gene regulatory processes that have been reported in the literature34–37,
A single-cell expression simulator guided by gene regulatory networks - bioRxiv
bioRxiv preprint first posted online Jul. 28, 2019; doi: http://dx.doi.org/10.1101/716811. The copyright holder for this preprint (which
       was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                                       All rights reserved. No reuse allowed without permission.

      including non-linear response to TFs, intrinsic fluctuations in expression and
      propagation of such “biological noise” along the GRN.

      In the realm of “bulk” transcriptomics GRN-driven simulations are already the norm, as
      exemplified by the simulation tool called GeneNetWeaver (GNW)38, which was used in a
      community-wide effort to benchmark numerous GRN reconstruction tools39–42. GNW is
      not meant to simulate scRNA-seq data, and though some studies have employed
      workarounds to use it for this purpose14,43, it is believed that such synthetic data do not
      exhibit       the statistical characteristics                     of contemporary single-cell                        data sets43.
      Furthermore, such workarounds do not offer key features necessary for a single cell
      expression simulator, such as simulation of multiple cell types and cells differentiating
      from one cell type to another.

      In this work, we develop a simulator tool that (1) uses a principled mathematical
      description of transcriptional regulatory processes to synthesize single-cell expression
      data associated with a specified GRN, (2) includes stochasticity of gene expression as
      an integral part of the process, thus capturing biological noise expected to manifest in
      cell-to-cell variability, and (3) incorporates various types of measurement errors
      (“technical noise”) that are typical of single-cell technologies. The new tool, called
      SERGIO (Single-cell ExpRession of Genes In silico), is freely available as a stand-alone
      software package. It borrows some of its modeling assumptions from the widely used
      GNW simulator, but relinquishes the more complex features of GNW, such as a
      thermodynamics-based model of regulation and explicit modeling of translation
      processes, which would have necessitated use of poorly-understood parameters during
      simulation and slowed down simulations of large GRNs.

      SERGIO uses a stochastic differential equation (SDE) called the chemical Langevin
      equation44 to simulate a gene’s expression dynamics as a function of the changing (or
      fluctuating) levels of its regulators (TFs), as prescribed by a fixed GRN. It performs such
      simulations for any pre-specified number of genes in parallel, and generates single-cell
      expression “profiles” (expression values of all genes) by sampling from these temporal
A single-cell expression simulator guided by gene regulatory networks - bioRxiv
bioRxiv preprint first posted online Jul. 28, 2019; doi: http://dx.doi.org/10.1101/716811. The copyright holder for this preprint (which
       was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                                       All rights reserved. No reuse allowed without permission.

      simulations in steady-state, thus mimicking established cell types. It allows users to
      specify the number of cell types to be simulated, via steady-state levels of a few
      “master” regulators in the GRN. SERGIO also allows users to simulate single-cell
      expression data from a specified differentiation program, for which it samples cells from
      transient portions of temporal simulations. In this simulation mode, SERGIO explicitly
      models the splicing step with an additional SDE, resulting in simulations of unspliced
      and spliced transcript levels. SERGIO subjects the synthesized expression data to a
      multi-step transformation where technical noise is incorporated in a manner reflecting
      real scRNA-seq data. To our knowledge, SERGIO is the first stand-alone simulator tool
      for single-cell transcriptomics that offers all of the above-mentioned features while
      basing its simulations on a given GRN. Here, we outline key aspects of its model and
      implementation and show that it may be used to generate realistic data sets that
      resemble an experimental scRNA-seq data set by several statistical measures. We then
      showcase its use to benchmark a number of popular single-cell analysis tools. We find
      that while modern tools are able to accurately identify cell types and differentiation
      trajectories from suitable data sets, their ability to reconstruct gene regulatory
      relationships remains severely limited.

      Results
      We developed SERGIO to simulate how expression values of a specified number of
      genes vary from cell to cell under the control of a given GRN, and how such information
      is captured in modern single-cell RNA-seq data sets. We first simulate “clean” gene
      expression data based on the GRN and mathematical models of transcriptional
      processes, including stochasticity of such processes (“biological noise”). We then add
      “technical noise” to the clean data, mimicking the nature of measurement errors
      attributed to scRNA-seq technology45.

      Simulation of “clean” data
      We generate expression profiles of single cells by sampling them from the steady state
      of a dynamical process that involves genes expressing at rates influenced by other
A single-cell expression simulator guided by gene regulatory networks - bioRxiv
bioRxiv preprint first posted online Jul. 28, 2019; doi: http://dx.doi.org/10.1101/716811. The copyright holder for this preprint (which
       was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                                       All rights reserved. No reuse allowed without permission.

      genes (transcription factors) (Figure 1). A select few of the genes are pre-designated as
      master regulators (MRs); these have no regulatory inputs in the GRN and their
      expression evolves over time under constant production and decay rates (see
      Methods). Expression of every other gene (non-MR) evolves under a production rate
      determined by adding contributions from its GRN-specified regulators (equation 5 in
      Methods) and a constant decay rate. Each regulator’s contribution to a gene depends
      on the former’s current concentration and an interaction parameter (strength of
      activation or repression) specific to the regulator and regulated gene. This dependence
      is described by a Hill function46, thus allowing for non-linear effects.

      Each gene’s time course is simulated while incorporating biological noise, using the
      chemical Langevin equation44, as adopted in the GeneNetWeaver (GNW) simulator38.
      Once the system of evolving expression profiles reaches steady state, we sample
      profiles from randomly selected time points. Variation in expression profiles across cells
      of the same type is assumed to mimic variation across time points in the steady state
      (the “ergodic assumption”47), hence the temporally sampled cells are used as the
      collection of cells in the synthetic data.

      Specifying the fixed production rates of MRs determines the average steady state
      expression profile of the sampled cells, and is used to generate data for a single cell
      type. In order to synthesize a data set with multiple cell types, the above simulation is
      repeated using different settings of MR production rates. The aggregate of expression
      profiles sampled across all simulations forms the “clean” synthetic data set.

      Incorporation of technical noise
      We then use the clean data to simulate integer-valued “count” data, as are produced in
      current scRNA-seq technologies, by sampling from a Poisson distribution whose mean
      is the real-valued expression level. However, prior to this conversion, the real-valued
      expression data matrix (genes x cells) is operated upon by modules that incorporate
      three different types of technical noise. The statistical details of these modules are
A single-cell expression simulator guided by gene regulatory networks - bioRxiv
bioRxiv preprint first posted online Jul. 28, 2019; doi: http://dx.doi.org/10.1101/716811. The copyright holder for this preprint (which
       was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                                       All rights reserved. No reuse allowed without permission.

      borrowed from the Splatter simulation tool32 and re-implemented in SERGIO (see
      Methods).

      SERGIO simulates realistic data sets
      We used SERGIO to generate synthetic data sets under three different settings of the
      underlying GRN, referred to as “data set 1” (DS1), “data set 2” (DS2) and “data set 3”
      (DS3). These three settings use GRNs with 100, 400 and 1200 genes respectively, that
      were sampled from real regulatory networks in E. coli or S. cerevisae (Table 1); all
      simulations included 300 cells for each of 9 cell types, for a total of 2700 single cells.
      Each data set was synthesized in 15 “replicates” by re-executing SERGIO with identical
      parameters multiple times. We sought to compare statistical properties of these
      synthetic data sets to a published data set from mouse brain comprising expression
      profiles of cells that are categorized into nine cell types with high confidence48,
      henceforth called the “real data set”. We thus configured SERGIO to introduce technical
      noise in the simulated expression profiles, to an extent that matches the real data set.
      This was done through manual iteration of the technical noise parameters (see
      Methods). For each simulation setting we sampled a comparison data set from the real
      data to have the same number of genes, repeating this 50 times to obtain 50 replicates
      of the (sampled) real data set, each of which was compared to the 15 replicates of the
      corresponding synthetic data set. We performed our comparisons using synthetic data
      with and without technical noise, referred to as the “noisy” and “clean” forms of the data
      set.

      We compared several commonly used summary statistics between each synthetic data
      set and a matching real data set (Figure 2). These include two cell-level statistics –
      “library size” and “zero count per cell” (number of genes with zero recorded expression
      in a cell) – and three gene-level statistics – “zero count per gene” (number of cells in
      which a gene has zero recorded expression), “mean count” and “variance count” (mean
      and variance of expression of genes). As shown in Figure 2, there is strong qualitative
      agreement between real and synthetic (noisy) data sets in terms of each of these five
      statistics. As expected, the clean form of each synthetic data set has substantially
A single-cell expression simulator guided by gene regulatory networks - bioRxiv
bioRxiv preprint first posted online Jul. 28, 2019; doi: http://dx.doi.org/10.1101/716811. The copyright holder for this preprint (which
       was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                                       All rights reserved. No reuse allowed without permission.

      different statistical properties from real data (For a more intuitive interpretation of the
      “total variation” metric used to compare distributions, see Supplementary Figure S2).

      An empirical observation about scRNA-seq data reported in the literature is that there is
      an inverse relationship between the number of zeros in the recorded expression of a
      gene and its mean expression level across cells49,50. This inverse relationship is clearly
      seen in our (noisy) synthetic data sets and their corresponding real data sets (Figure
      2k,l), and arises not only because genes with lower expression levels are more likely to
      result in sampled zero counts, but also because the simulator creates “dropouts” (a form
      of technical noise) with higher probability for such genes. Similarly, an inverse
      relationship between the coefficient of variation (CV) – a common measure of
      expression noise – and mean expression of a gene has been extensively discussed in
      the literature51–53. Figure 2m shows the existence of this relationship in a representative
      synthetic data set as well as in a corresponding real data set. It is not the result of
      adding technical noise, and is present in the clean synthetic data sets as well (Figure
      2n). It arises naturally from the gene regulatory model implemented in SERGIO, in
      contrast to other single cell simulators that explicitly add such a relationship to their
      statistical sampling procedures32. In other words, the synthetic data sets generated by
      SERGIO not only exhibit realistic distributions of key summary statistics (Figure 2a-j),
      they also exhibit second-order relationships between pairs of variables that are
      characteristic of real data sets (Figure 2k-n).

      Simulated data exhibit cell heterogeneity similar to real data
      Motivated by the growing use of single cell RNA-seq data to characterize cellular
      heterogeneity in biological samples, we next asked if the synthetic data sets from
      SERGIO exhibit heterogeneity similar to real ones. We first used Principal Components
      Analysis (PCA) to reduce each cell’s representation to 10 dimensions and then used the
      popular tSNE algorithm to plot cells in two dimensions. Figures 3a and 3b show such
      tSNE plots for a representative synthetic data set (in the DS3 setting) in their clean and
      noisy forms respectively. It is clear that in the absence of technical noise the nine cell
A single-cell expression simulator guided by gene regulatory networks - bioRxiv
bioRxiv preprint first posted online Jul. 28, 2019; doi: http://dx.doi.org/10.1101/716811. The copyright holder for this preprint (which
       was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                                       All rights reserved. No reuse allowed without permission.

      types (as specified during simulation) are highly distinguishable, and that the noisy data
      sets smear this visual separability significantly.

      However, cell type detection in practice does not rely only on visual separation, and
      specialized high-dimensional clustering algorithms are being developed for the purpose.
      One such algorithm is SC311, which has been shown to have high accuracy for the task.
      It was used by Aibar et al.9 to cluster mouse cortex cells in the “real data set” of our
      study48 and the clusters were found to be very similar to the true cell types present in
      the sample (Adjusted Rand Index, ARI, of ~0.8). If our synthetic data sets exhibit similar
      levels of cellular heterogeneity as the real set, then we expect SC3-reported clusters to
      have similar levels of concordance with “true” cell types as known to the simulator.
      Figure 3c shows the composition of nine clusters found by SC3 on the (noisy) synthetic
      data set visualized in Figure 3b, in terms of the true cell types present in each cluster.
      We note that seven of the nine reported clusters predominantly comprise cells of one
      (distinct) type, and only two of the clusters are of mixed composition, thus suggesting a
      high accuracy of clustering. To make this observation more formal, we computed the
      Adjusted Rand Index (ARI) between SC3-reported clusters and true cell types for each
      of the 15 replicates of the DS3 data set, noting an average ARI of 0.78. We repeated
      this for each of the 50 sampled subsets of the real data set corresponding to DS3
      settings, and found the average ARI to be 0.80, very close to that seen in synthetic
      data. This exercise demonstrates that synthetic data sets generated by SERGIO exhibit
      realistic levels of cellular heterogeneity also illustrates the use of SERGIO to benchmark
      clustering methods.

      Benchmarking GRN reconstruction methods
      We next illustrate how the data synthetized by SERGIO can serve to benchmark GRN
      reconstruction tools. In our first tests we worked with clean data sets generated by
      SERGIO, reasoning that these should provide an upper bound for performance on noisy
      realistic data sets. We evaluated the popular GRN inference algorithm called GENIE354,
      which was originally developed for analyzing bulk RNA-seq data but has since been
      used successfully on single cell data as well. We applied GENIE3 on the (clean) data
A single-cell expression simulator guided by gene regulatory networks - bioRxiv
bioRxiv preprint first posted online Jul. 28, 2019; doi: http://dx.doi.org/10.1101/716811. The copyright holder for this preprint (which
       was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                                       All rights reserved. No reuse allowed without permission.

      sets DS1 (100 genes) and DS3 (1200 genes) and evaluated the predicted TF-gene
      pairs based on the underlying GRNs in these data sets, using the common metrics Area
      Under Receiver Operating Characteristics (AUROC) and Area Under Precision-Recall
      Curve (AUPRC). Recall that these data sets were synthesized to include 300 cells for
      each of nine cell types. To assess the impact of data set size, we created smaller sets
      by sampling 200, 100 or 10 cells per cell type from the original simulated data (for each
      replicate of DS1 and DS3), and repeated the GRN reconstruction assessments for
      these. We also sought to assess the advantage of having single cell resolution in the
      data, and thus synthesized “bulk” expression data sets by averaging the expression of
      each gene in all cells of the same type, mimicking a situation where each cell type has
      been sorted separately and subjected to traditional expression profiling. (The resulting
      synthetic data sets included nine conditions with “bulk” expression values of each of 100
      or 1200 genes, depending on the original data set.)

      Figures 4a and 4b show the ROC and PRC respectively for a representative replicate of
      the DS3 data set, in its original setting (300 cells per type) as well as its sampled
      smaller versions and their respective “bulk” data set versions. A more comprehensive
      view, spanning all replicates of DS1 and DS3, is shown in Figure 4c-f. Several points
      are apparent from these figures. First, in nearly all versions of the data sets, GENIE3
      performs significantly better than random, as is evident from AUROC values well above
      the 0.5 value expected from a random predictor. Second, we note that while
      performance is significantly better on larger data sets than on the smallest data set (10
      cells per type), there is not a clear difference among the data sets with 100 cells per
      type or more. This suggests that, at least in the absence of technical noise, the benefits
      of greater cell count for GRN reconstruction accuracy saturate at commonly seen levels.
      Third, the “bulk” data sets consistently yielded lower accuracy than the single-cell data
      sets, regardless of the numbers of cells, confirming the value of the latter for regulatory
      inference. Finally, we noted that although the DS1 and DS3 data sets had similar
      AUROC values, the AUPRC values revealed significantly worse predictions in the larger
      (DS3, 1200 genes) data sets. This is expected, in part because the random baseline is
      lower for DS3 (random AUPRC of 0.002) than for DS1 (random AUPRC of 0.026), but
bioRxiv preprint first posted online Jul. 28, 2019; doi: http://dx.doi.org/10.1101/716811. The copyright holder for this preprint (which
       was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                                       All rights reserved. No reuse allowed without permission.

      also because high levels of gene co-expression confound methods such as GENIE3
      more for larger data sets.

      We next examined the impact of cellular heterogeneity on GRN reconstruction
      accuracy, using our clean synthetic data sets. For this, we sampled from each replicate
      of DS1 and DS3 (at their original setting of 300 cells per type) smaller data sets
      comprising 6, 3 or 1 cell type rather than the 9 cell types simulated. As shown via
      AUROC and AUPRC measures in Figures 4i-l (with representative ROC and PRC
      curves in Figures 4g,h), we found data sets with greater heterogeneity to consistently
      improve GENIE3 performance, which remained clearly above the random baseline
      (AUROC of 0.5 and AUPRC of 0.026 and 0.002 for DS1 and DS3 respectively) for all
      but the “1 cell type” setting. This is expected, since the latter setting includes gene
      expression variation resulting only from biological noise, and even though extrinsic
      noise (fluctuations in TF levels reflected in target gene levels55) may be exploited to
      infer TF-gene relationships, such correlations are diluted by the presence of intrinsic
      gene expression noise in the simulations (see Methods). On the other hand, in settings
      with 3 – 9 different cell types, the dominant form of expression variation arises from
      differences in the steady state profiles of the cell types, making regulatory inferences
      more effective.

      We next examined the effect of technical noise on GRN reconstruction. For this, we
      compared GENIE3 performance on clean and noisy versions of each replicate of DS3
      (1200 genes), in the original setting of 300 cells per type as well as a sampled version
      thereof with 100 cells per type. The complete results are shown in Figures 4o,p, with
      representative ROC and PRC curves shown in Figures 4m,n. Both performance metrics
      (AUROC and AUPRC) deteriorate to levels expected from random prediction when
      analyzing noisy synthetic data, in contrast to the very high levels                                                 seen prior to
      introducing technical noise. Such nearly-random performance of GENIE3 on noisy
      single-cell expression data has been reported in previous studies conducted based on
      real as well as synthetic single-cell expression data sets43,56. Notably, increasing the
      number of cells (from 100 per type to 300) does not change our conclusion.
bioRxiv preprint first posted online Jul. 28, 2019; doi: http://dx.doi.org/10.1101/716811. The copyright holder for this preprint (which
       was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                                       All rights reserved. No reuse allowed without permission.

      In light of the above finding, we considered the possibility of using imputation tools
      specialized for single cell RNA-seq data as a means to improve the signal necessary for
      GRN reconstruction. We thus utilized the popular imputation tool called MAGIC17 to pre-
      process the noisy synthetic data sets prior to analyzing them with GENIE3, and
      compared the performance metrics to those obtained above. Results were only
      modestly improved from those without imputation, with AUROC values ~ 0.65 in the 300
      cell/type setting and ~ 0.52 in the 100 cell/type setting (Figures 4m-p). Closer
      examination revealed that the default settings of MAGIC made the data overly
      structured, resulting in unrealistically large gene-gene correlations (Supplementary
      Figures S3 and S4), similar to previous reports57–59. In order to address this issue, we
      employed two smaller values of the ‘t’ parameter in MAGIC (t = 2 or 7), in separate runs,
      prior to GRN reconstruction. Both of these settings resulted in improved performance
      over the default setting of MAGIC, and substantially better than that seen in noisy data
      sets without imputation (Figures 4m-p). For instance, AUROC values for the 300
      cell/type setting were at ~0.70 (t = 7), squarely in the middle of those without imputation
      (~0.46) and those on clean data sets (~0.86). AUPRC values (~0.08) were also
      significantly above random expectation (~0.002), though far from the high values ~0.4
      observed on clean data sets. Although we noted above that GRN reconstruction
      accuracy on clean data sets did not improve when increasing the cell counts (300
      versus 100 cells per type), we do notice a significant and consistent effect of cell counts
      in performance on imputed data (Figures 4o,p). Presumably, greater cell counts are
      beneficial for the imputation step, which in turn results in higher performance of
      GENIE3. Our overall conclusion from the above tests (Figure 4) is that a state-of-the-art
      GRN reconstruction method such as GENIE354 can perform accurately on single cell
      expression data in the hypothetical scenario where technical noise is absent, but falls to
      near-random performance in the face of realistic levels of technical noise. The accuracy
      does improve above random baseline if the data are imputed with specialized tools but
      remains far short from the upper bar observed in clean data, making technical noise a
      major factor for future GRN reconstruction methods to address.
bioRxiv preprint first posted online Jul. 28, 2019; doi: http://dx.doi.org/10.1101/716811. The copyright holder for this preprint (which
       was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                                       All rights reserved. No reuse allowed without permission.

      Benchmarking differentiation trajectory inference tools
      Our analysis so far involved using SERGIO to synthesize steady-state expression
      profiles representing different cell types. The simulator is additionally capable of
      synthesizing dynamic expression data on a set of genes controlled by a given regulatory
      network in single cells differentiating along a given trajectory (Figure 5). In this mode the
      simulator is provided with a differentiation graph whose nodes represent stable cell
      types in a differentiation program and whose edges represent differentiation from the
      parent cell type to child cell type. The simulator samples expression profiles from the
      steady state represented by the parent cell type, and then simulates a dynamical
      process (identical to that described above) that begins with one of these expression
      profiles and evolves into the steady state represented by the child cell type. It then
      samples expression profiles from the temporal duration when the cells are transitioning
      from the initial to final cell type. The entire “clean” data set is synthesized by repeating
      this simulation process for each edge in the differentiation graph. Technical noise is
      then added in a manner identical to the steady state simulation mode.

      An emerging approach to describe the dynamics of differentiation programs through
      single-cell expression profiling involves examination of spliced as well as unspliced
      transcript levels in the data and inferring “RNA velocity” of each gene60. To allow
      synthesizing data sets amenable to such analysis, the differentiation simulation mode
      uses a variation on the underlying model described above. In particular, it invokes two
      chemical Langevin equations (CLE) similar to equation 1 to generate unspliced and
      spliced transcript levels (see Equation 8 and 9 in Methods). It reports the simulated
      expression values as levels of unspliced as well as spliced transcripts, whose sum may
      be considered the total expression of a gene.

      To illustrate these features of the simulator, we generated four synthetic differentiation
      data sets (DS4 – DS7), each containing 100 genes controlled by the same GRN, but
      obeying different differentiation graphs – linear (DS4), bifurcation (DS5), trifurcation
      (DS6) and tree (DS7) (Figure 6, top). Figure 6a shows the two dimensional PCA plot of
      the clean total transcriptome (without technical noise added) for the four types of
bioRxiv preprint first posted online Jul. 28, 2019; doi: http://dx.doi.org/10.1101/716811. The copyright holder for this preprint (which
       was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                                       All rights reserved. No reuse allowed without permission.

      differentiation graphs. It is visually evident that these two-dimensional representations of
      cells based on their gene expression profiles match their corresponding graphs used in
      the simulations. We note that the dispersion of cells of each type (end points of each
      branch of a graph) as well as the width of the differentiation path from one type to
      another can be controlled by user-specified parameters in SERGIO (Supplementary
      Figure S5).

      Differentiation data sets synthesized by SERGIO can be used to benchmark trajectory
      inference algorithms since the underlying differentiation trajectory (graph) is known for
      these data. To illustrate this, we applied the Slingshot13 tool on the above data sets, still
      in their clean form without technical noise. Slingshot is a tool specifically developed for
      trajectory inference, with published reports of high accuracy. Consistent with these
      reports, we noted that Slingshot infers the correct trajectory in three of the four data
      sets; however, it failed to reconstruct the more complex, tree trajectory (Figure 6b) of
      DS7.

      We then analyzed the above synthetic data sets with the Velocyto60 tool, which infers an
      “RNA velocity” field in a low dimensional representation of single cells that indicates the
      direction in which each cell’s expression profile appears to be changing. The velocity
      field also provides an intuitive visualization of differentiation trajectories. Figures 6c,d
      depict the inferred velocity fields for DS6 and DS7, demonstrating how Velocyto
      correctly captures these differentiation trajectories, including the tree of DS7 (Figure 6d)
      that Slingshot was unable to recover (Figure 6b, right). (Velocyto output for DS4 and
      DS5 may be found in Supplementary Figure S6.) Thus, we find that use of an additional
      layer of information – spliced versus unspliced mRNA counts – can improve trajectory
      inference from single cell transcriptomic data. This is not limited to data sets with
      complex underlying trajectories – Figure 6e shows an example data set (DS8)
      generated using a simple bifurcation graph for which Slingshot infers a linear trajectory
      while Velocyto reports a velocity field clearly indicative of the true bifurcation trajectory.
      It is worth noting here that the Slingshot tool may be made to utilize prior knowledge of
      stable cell types, and we did not provide such information, which may resolve the errors
bioRxiv preprint first posted online Jul. 28, 2019; doi: http://dx.doi.org/10.1101/716811. The copyright holder for this preprint (which
       was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                                       All rights reserved. No reuse allowed without permission.

      noted above. To summarize, synthetic data sets generated by SERGIO show that, at
      least in the absence of prior information on stable cell types, RNA velocity-based
      approaches may have an advantage in terms of trajectory inference on single cell data.

      Benchmarking GRN reconstruction on differentiation data
      Single-cell transcriptomic profiles of differentiation processes offer unique opportunities
      for GRN reconstruction, where cells are ordered by “pseudotime” (a temporal partial
      ordering obtained by mapping them to inferred differentiation paths) and the resulting
      pseudotime labels are exploited to infer causal relationships between TFs and target
      genes. Several methods have been recently proposed that specifically channel this
      opportunity, including SCODE56, SINCERITIES61 and SINGE62. We used the dynamic
      data simulated by SERGIO to benchmark these specialized GRN-reconstruction
      algorithms, using Slingshot for pseudotime inference. We used one simulated replicate
      of DS4, DS5 and DS6, for which we verified above that Slingshot infers trajectories
      accurately. For each data set, we evaluated and compared the three above-mentioned
      GRN reconstruction methods on single cells associated with a single branch of the
      inferred differentiation trajectory (see Methods). We also used GENIE3 as a baseline
      method to infer TF-gene relationships without utilizing pseudotime information.
      Interestingly, GENIE3 clearly outperforms the three specialized algorithms in all six
      evaluations (Figure 6f,g). In other words, the use of temporal ordering of single cells
      does not help GRN reconstruction, at least in the absence of technical noise.

      Discussion
      The main distinguishing quality of SERGIO is its ability to simulate single-cell
      expression data based on a specified GRN. Its implementation strikes a balance
      between a biologically realistic model of transcriptional processes and simplifying
      assumptions that facilitate fast simulation, capable of scaling to thousands of genes and
      regulatory interactions. SERGIO employs an intuitive definition of cell types as steady
      states of GRN dynamics63, and can simulate any number of user-defined cell types. It
      can also simulate collections of cells differentiating from one cell type to another, an
bioRxiv preprint first posted online Jul. 28, 2019; doi: http://dx.doi.org/10.1101/716811. The copyright holder for this preprint (which
       was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                                       All rights reserved. No reuse allowed without permission.

      important feature not available in GNW 38 even after modifications to simulate single-cell
      data. Additionally, by including separate simulation of unspliced and spliced transcripts
      in differentiating cells, SERGIO allows assessment of tools based on the emerging
      approach of RNA velocity.

      The unique features of SERGIO make it a powerful tool for benchmarking a wide variety
      of single-cell analysis tools. We have presented several examples of such
      benchmarking efforts, which yielded useful insights about the evaluated tools. For
      instance, we showed a simple example (Figure 6e) of a differentiation data set where
      RNA velocity-based inference outperforms alternative trajectory inference algorithms.
      Our assessment of a leading GRN inference tool found that it is rendered largely
      inaccurate (close to random performance) due to technical noise typical of
      contemporary data sets, even though they are capable of far greater accuracy in the
      absence of measurement errors. In the same context, we noted that imputation
      algorithms such as MAGIC17 can alleviate this problem to an extent, leading to modestly
      improved accuracy, especially if data sets have larger numbers of cells.

      We also evaluated GRN inference methods designed specifically for time-ordered
      single-cell expression data56,61,62, and were surprised to find that these specialized
      methods are less effective than a more general-purpose method – GENIE354 – even for
      differentiation data sets. However, the performance of these specialized tools depends
      on the type of differentiation trajectories, number of single-cells and other factors. For
      example, SINGE62, one of the evaluated methods, is designed to be used with an
      ensemble of parameter settings, and in our evaluations we used this tool with only two
      sets of parameters; its performance might have been significantly better if a larger
      ensemble of parameters were to be used.

      It should be noted that the GRN benchmarking in this study considered methods based
      on expression only, while better accuracy can result from existing tools that use
      additional information such as TF-DNA binding data9. Future work can combine
      SERGIO simulations of single-cell expression with existing ideas on benchmarking GRN
bioRxiv preprint first posted online Jul. 28, 2019; doi: http://dx.doi.org/10.1101/716811. The copyright holder for this preprint (which
       was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                                       All rights reserved. No reuse allowed without permission.

      inference from bulk data and prior information64. Expression data from TF knockout
      experiments can also be exploited by GRN inference algorithms65, and knockout of
      master regulators (MR) can be easily simulated in SERGIO to assess such algorithms.

      In conclusion, we believe that SERGIO will prove useful to a number of researchers
      developing tools for the rapidly developing field of single-cell transcriptomics. It will be
      especially useful for testing GRN reconstruction methods, which according to our
      assessments is the analytical task most in need of future improvements. But its
      usefulness will extend to future tools for other popular tasks as well, since synthetic data
      sets that capture real data more closely naturally provide more reliable assessments of
      those tools. Moreover, the “clean” simulated data sets (without technical noise)
      generated by SERGIO should be useful in their own right, since they also capture
      realistic expression variation due to biological noise and can provide upper bounds on
      accuracy in the idealized scenario where measurement noise has been eliminated.

      Methods
      Steady-State Simulations
      We model the dynamics of the concentration of genes using systems of stochastic
      differential equations (SDE) that have been previously employed in GeneNetWeaver
      (GNW)38,40 and which are derived from the chemical Langevin equation (CLE)44. The
      time-course of mRNA concentration of gene i is modeled by:

                                       
                                                                                                                      (1)
                                       

      where  is the expression of gene i,  is its production rate, which reflects the influence
      of its regulators as identified by the given GRN (details below),                                       is the decay rate, and
          is the noise amplitude in the transcription of gene i.  and  are two independent
      Gaussian white noise processes. In order to obtain the mRNA concentrations as a
      function of time, we integrate the above stochastic differential equation for all genes:
bioRxiv preprint first posted online Jul. 28, 2019; doi: http://dx.doi.org/10.1101/716811. The copyright holder for this preprint (which
       was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                                       All rights reserved. No reuse allowed without permission.

                                                                                            

                                 బ                                                
                                                        బ                                  బ
                                                                                                                                           (2)
                                                         

                                                                       
                                                        బ

      where  and  are two independent stochastic Wiener processes. We integrate this
      equation in pre-defined time steps of duration , according to Euler–Maruyama
      method66 using the Itô scheme:

                                                                                                    (3)

                                          ~ √ 0,1 ,  ~ √ 0,1                                                               (4)

      Each iteration yields the mRNA concentrations of all genes at time step         Δ using
      each gene’s own concentration and all of its regulators’ concentrations at time step .

      We model each gene’s production rate,  , as the sum of contributions from each of its
      regulators (as prescribed by the GRN):

                                                                                                                                      (5)
                                                                         ೔

      where ! is the set of all regulators of gene i,                               is the basal production rate of gene i,
      and   is the regulatory effect of gene (TF) j on gene i. The latter is modeled as a non-
      linear saturating Hill function of the mRNA concentration of the TF46:

                                                     ೙೔ೕ
                                                     ೕ
                                    "         ೙೔ೕ ೙೔ೕ         ;           if regulator j is an activator of gene i (6)
                                                  ೔ೕ  ೕ
                                                  ೙೔ೕ
                                                  ೕ
                           "  1          ೙೔ೕ ೙೔ೕ       ;               if regulator j is a repressor of gene i (7)
                                               ೔ೕ  ೕ
bioRxiv preprint first posted online Jul. 28, 2019; doi: http://dx.doi.org/10.1101/716811. The copyright holder for this preprint (which
       was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                                       All rights reserved. No reuse allowed without permission.

      where "  denotes the maximum contribution of regulator j to target gene i, #  is the Hill
      coefficient that introduces non-linearity to the model and $  is the regulator
      concentration that produces half-maximal regulatory effect (half-response). If gene i is a
      user-designated “master regulator” (MR), i.e., no gene regulates it, then its production
      rate  is entirely determined by basal production rate                                             which is a user-defined
      parameter. For simplicity, we set                        0 for genes other than master regulators. "  and
      #  are user-defined parameters, and the type of each interaction (activation or
      repression) is also user-specified. The $  parameter is set to be the average of the
      regulators’ expression among the cell types to be simulated. The parameters  and  in
      equation 1 characterize the intrinsic noise associated with the production and decay
      processes of the mRNA transcript of gene i. Moreover, the intrinsic noise in the
      transcription of regulators propagates along the GRN and thus influences the production
      rate  to become an extrinsic noise source in the transcription of gene i. We support
      three forms of noise:
           1. Dual Production Decay (“dpd”): the form of stochastic noise that is shown in
                 equation 1.
           2. Single Production (“sp”): including only the noise term associated with the
                 production process (equivalently, set   0).
           3. Single Decay (“sd”): including only the noise term associated with the decay
                 process (equivalently, set   0)

      We note that the current version of Sergio is not capable of simulating GRNs containing
      auto-regulatory edges or cycles.

      Sampling Single Cells
      We use the above system of equations to simulate the time-course of each gene’s
      expression in a cell, starting with a given initial value, and record expression values of
      all genes at randomly selected time points after the simulation has reached steady
      state. Invoking the ergodic assumption47, we treat the expression profiles at these time
      points to represent single-cell profiles. In order to speed up the simulation, we estimate
bioRxiv preprint first posted online Jul. 28, 2019; doi: http://dx.doi.org/10.1101/716811. The copyright holder for this preprint (which
       was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                                       All rights reserved. No reuse allowed without permission.

      the steady-state concentrations of all genes given the input parameters (see
      Supplementary Notes 1) and initialize the time-course simulation with those values.
      Also, we ensure that a sufficient number of time steps, which is controlled by a user-
      defined parameter, are simulated in the steady state prior to sampling cells.

      Cell Types
      The above simulation is performed for each “cell type” separately. We define a cell type
      (or cell state) by the average concentration of master regulators. A cell type differs from
      another cell type by the average concentration of one or more of the master regulators
      among the population of cells belong to each cell type. This can be controlled by the
      basal production rate b for master regulators (see Supplementary Notes 1). Sergio
      takes as input the basal production rate of all master regulators in each of the cell types
      to be simulated.

      Simulation of differentiation trajectories
      In addition to simulating one or more “cell types” in steady state, Sergio may be used to
      simulate cells on the differentiation trajectory from one cell type to another, i.e., between
      two steady states. More generally, given a “differentiation graph” where nodes represent
      cell types and directed edges indicate differentiation from one cell type to the other,
      Sergio can simulate expression profiles of cells spanning different stages of
      differentiation specified by the graph. Such cells are either in one of the steady states
      represented by nodes or have departed away from the steady-state of their “parent” cell
      type of an edge and are migrating toward the steady-state of the corresponding “child”
      cell type. The differentiation is presumed to commence when one or more master
      regulators change their expression from that in the steady state of the parent cell type,
      e.g., due to a signaling event67 or due to a noise-driven switch68. Thus, given a
      differentiation graph and average expression levels of master regulators for each cell
      type (nodes), we simulate each differentiation trajectory (edge) as follows: 1) Cells
      representing the parent cell type are sampled from the corresponding steady state. 2)
      Production rates (Pi) of master regulators are changed from those specified for the
      parent cell type to those of the child cell type, and time-course simulations are
bioRxiv preprint first posted online Jul. 28, 2019; doi: http://dx.doi.org/10.1101/716811. The copyright holder for this preprint (which
       was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                                       All rights reserved. No reuse allowed without permission.

      performed following equations 3-4 as explained above. As these simulations proceed,
      all genes ultimately converge to their steady-state concentrations in the child cell type.
      3) Cells (expression profiles) are sampled at random from the entire simulation,
      including cells in the parent and child cell types (steady states) as well as cells on the
      differentiation trajectory (transient states). Multiple such time-course simulations are
      performed and the sampled cells are randomly chosen from the entire collection of such
      simulations. Also, after each simulation reaches the steady-state of the child cell type, it
      may be continued for a user-defined number of additional steps. This controls the ratio
      of the cells in the steady states of the differentiation graph to the number of cells in
      differentiating (transient) states.

      Simulations of differentiation trajectories in Sergio generate not only the total mRNA
      concentration of each gene (in a time-course), but the changing levels of spliced and
      unspliced mRNA transcripts separately. To this end, we express the rate of change in
      the concentration of unspliced and spliced RNA using ordinary differential equations
      (ODEs), following prior work60,69. Furthermore, we introduce noise terms to these ODEs
      in a manner similar to steady-state simulations (equation 1). Thus, the time-course of
      the spliced (s) and unspliced (u) transcript level of gene i is modeled as:

                             %
                                       & %                                              & %                        (8)
                             
                                 '
                                    & %   ( '                     
                                                                               & % )             ( ' *                             (9)
                                 

      where   is the production rate of pre-mRNA (unspliced transcript) that includes
      regulatory interactions,                  and & are the degradation and splicing rate respectively of
      pre-mRNA and                 is the noise amplitude associated with the transcription of pre-mRNA.
      For simplicity, we assume the degradation rate                                    of pre-mRNA is zero and all of its
      decay is due to splicing (user-defined parameter & ). Also, ( is the degradation rate of
                                       
      spliced mRNA and                     is the noise amplitude associated with the transcription of spliced
      mRNA. , , ) and * are independent Gaussian white noise processes. All the three
      form of stochastic noise (“dpd”, “sp”, “sd”) described for steady-state simulations are
bioRxiv preprint first posted online Jul. 28, 2019; doi: http://dx.doi.org/10.1101/716811. The copyright holder for this preprint (which
       was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                                       All rights reserved. No reuse allowed without permission.

      also supported in dynamics simulation. Moreover, production-rate  is modeled as in
      steady-state simulations (equations 5-7 above). Both of the SDEs in equations 8-9 are
      integrated according to Euler–Maruyama scheme to obtain time-courses of unspliced
      and spliced mRNA concentrations.

      Technical Noise
      Sergio adopts methods similar to Splatter32 for adding technical noise to the simulated
      single-cell expression data. One module introduces the phenomenon of “outlier genes”,
      which refers to the empirical observation that a small set of genes appear to have
      unusually high expression measurements across cells in typical scRNA-seq data sets. A
      second module incorporates the noted phenomenon of different cells having different
      total counts (library size), that follows a log-normal distribution. A third module
      introduces “dropouts”, which refers to the observation that a high percentage of genes
      are recorded at zero expression in any given cell, indicating an experimental failure to
      record their expression rather than true non-expression. These three modules may be
      invoked optionally and in any combination and order specified by the user.                                                           We
      elaborate on each of these modules below. We focus on details pertinent to simulation
      of steady state data; corresponding details for differentiation trajectory data are provided
      in Supplementary Notes 3. Each of the modules outlined below adds a single type of
      technical noise to the data set provided to it.

      Outlier genes: Each gene is designated as an outlier with a user-defined probability. If
      so, its expression (in every cell) is multiplied by a factor sampled from a log-normal
      distribution, otherwise the expression is left unchanged:

                                   +, - .1 … 0 1 2 3 ~ 4567   , 8  ~ 9#& , :  
                              +, - .1 … 0 1 , +; - .1 … < 1 2                    = 3 8            1  3  

      where G and C denote the total number of simulated genes and cells respectively, and
        denotes the simulated expression of gene i in cell c. 3 is a binary variable indicating
      if gene i is an outlier, and is sampled from a Bernoulli distribution with parameter 7  .
bioRxiv preprint first posted online Jul. 28, 2019; doi: http://dx.doi.org/10.1101/716811. The copyright holder for this preprint (which
       was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                                       All rights reserved. No reuse allowed without permission.

      Also, & and :  are user-defined mean and standard deviation of the lognormal
      distribution from which the outlier scaling factor 8  is sampled.

      Library size: For every cell (library) a library size parameter is sampled from a lognormal
      distribution, and expression values of all genes in the cell are scaled by a constant
      factor such that the resulting total cell depth matches the sampled library size:

                                                 +; - .1 …  ~ 9#& , :  
                                                                                                     >
                                   +, - .1 … 01 , +; - .1 …  of cell c is sampled.

      Dropout: To introduce dropouts to the simulated data, we first assign a probability to the
      expression of each gene in each of the simulated cells being a dropout. This probability
      is modeled as a logistic function of the expression of the gene in that cell, so that a high
      expression value is less likely to be zeroed out. This probability is then used as the
      parameter of a Bernoulli distribution from which a binary variable is sampled to indicate
      whether the gene is a dropout in the cell:

                                                        @ 
                                                 56;5#,95 A8 B  

                                                            1
              +, - .1 … 01, +; - .1 …
bioRxiv preprint first posted online Jul. 28, 2019; doi: http://dx.doi.org/10.1101/716811. The copyright holder for this preprint (which
       was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                                       All rights reserved. No reuse allowed without permission.

      Conversion to UMI counts: We generate UMI counts (M< ) by sampling from a Poisson
      distribution whose mean is the simulated expression level of the gene in the cell:

                                +, - .1 … 0 1 , +; - .1 … < 1 2                        M< , ~ A,''A#  

      Data Set Generation
      We now describe how we set simulation parameters to generate the data sets analyzed
      in this study.

      We sampled four gene regulatory networks (GRNs) from the known regulatory networks
      in S. cerevisiae and E. coli using GNW38 using the “random seed” argument to select
      genes and the “random among top 20%” setting for neighbor selection. Two of the
      networks consist of 100 genes and were separately sampled from Ecoli, a third network
      containing 1200 genes was sampled from E. coli, and the fourth network comprising
      400 genes was sampled from S. cerevisiae. We also used GNW to designate each TF-
      gene edge as either an activating or a repressive interaction. Auto-regulatory edges
      were removed from the sampled networks and cycles were broken at a randomly
      selected edge, since Sergio does not support these two graph properties. These four
      networks were used to simulate 8 data sets, each with 9 cell types and 300 cells per cell
      type (Table 1). Fifteen “replicates” of each data set were created that had identical
      simulation parameters and differed due to the stochastic noise and random sampling.
      For all data sets, interaction strengths "  (equations 6-7) were uniformly sampled from
      the range 1 to 5. Each cell type to be simulated was specified by the expression state
      (high or low) of each master regulator (MR); the basal production rate (                                            in equation 5)
      of each MR was sampled from a pre-defined range that depends on the expression
      state and varies among different data sets (see Supplementary Table S3). We used a
      hill coefficient of 2 for all interactions in all data sets. We used the same noise amplitude
      parameter      1 and the same decay parameter  0.8 for all genes in all steady-state
      data sets. In dynamics simulations, we used an unspliced noise parameter      0.3 and
      a spliced noise parameter   0.07 for all genes. Also, we used an unspliced transcript
bioRxiv preprint first posted online Jul. 28, 2019; doi: http://dx.doi.org/10.1101/716811. The copyright holder for this preprint (which
       was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                                       All rights reserved. No reuse allowed without permission.

      decay rate of &  0.8 and a spliced transcript decay rate of (  0.2 that maintains a ratio
      of spliced to unspliced expression of a gene at ~4 (see Supplementary Notes 2). We
      used “dpd” setting of intrinsic noise and an integration time step of 0.01 for both steady-
      state and dynamics simulations.

      We compared the simulated expression matrix (genes x cells) to a single-cell RNA-seq
      data set from the mouse cerebral cortex48, referred to as the “real data set”, to
      demonstrate that the simulated and real data sets have similar statistical properties. The
      real data set includes 3005 cells from nine cell types and our simulations therefore used
      nine cell types. However, the real data set has variable numbers of cells per cell type
      while we sought to keep this number fixed, or at least comparable, across cell types for
      ease of downstream interpretations. Hence we simulated 300 cells for each cell type
      (total of 2700 cells) and sampled the real data set by drawing cells of each type at
      random: for cell types with less than 300 cells, we retained all the cells, while for the
      other cell types we randomly sampled 300 cells such that a total of 2500 single cells
      were sampled. Our simulations generated expression values for 100, 400 or 1200
      genes depending on the data set, hence we randomly sampled from the real data set
      the same number of genes as present in the synthetic data.

      To add technical noise we used the above-mentioned modules for outlier genes, library
      size effect and dropouts in that order, and finally converted the expression levels to UMI
      counts. For each data set, we manually tuned the input parameters (see Supplementary
      Table S1) to each of the technical noise modules to obtain a match between the
      synthetic and real data. Furthermore, we filtered cells from the synthetic data that have
      total UMI count (sum over all genes) less than 5. In this study, we only added technical
      noise to the steady-state synthetic data sets and the dynamics simulations only utilized
      “clean data sets” without technical noise.

      Settings of single-cell analysis tools
      In this study we applied several tools to the real or synthetic data sets to mimic real-
      world analysis of such data and to benchmark these tools. We did not normalize the
You can also read