Machine Learning to Decipher the Astrophysical Processes at Cosmic Dawn

 
CONTINUE READING
MNRAS 000, 1–17 (2021) Preprint 21 January 2022 Compiled using MNRAS LATEX style file v3.0

 Machine Learning to Decipher the Astrophysical Processes at Cosmic
 Dawn
 Sudipta Sikder,1★ Rennan Barkana,1,2 Itamar Reis1 and Anastasia Fialkov3
 1 School of Physics and Astronomy, Tel-Aviv University, Tel-Aviv, 69978, Israel
 2 Institute for Advanced Study, 1 Einstein Drive, Princeton, New Jersey 08540, USA
 3 Institute of Astronomy, University of Cambridge, Madingley Road, Cambridge, CB3 0HA, UK
arXiv:2201.08205v1 [astro-ph.CO] 20 Jan 2022

 Accepted XXX. Received YYY; in original form ZZZ

 ABSTRACT

 The cosmic 21-cm line of hydrogen is expected to be measured in detail by the next generation of radio telescopes. The
 enormous dataset from future 21-cm surveys will revolutionize our understanding of early cosmic times. We present a machine
 learning approach that uses emulation in order to uncover the astrophysics in the epoch of reionization and cosmic dawn. Using
 a seven-parameter astrophysical model that covers a very wide range of possible 21-cm signals, over the redshift range 6 to 30
 and wavenumber range 0.05 Mpc−1 to 1 Mpc−1 we emulate the 21-cm power spectrum with a typical accuracy of 10 − 20%. As
 a realistic example, we train an emulator using the 21-cm power spectrum with an optimistic model for observational noise as
 expected for the Square Kilometre Array (SKA). Fitting to mock SKA data results in a typical measurement accuracy of 5% in
 the optical depth to the CMB, 30% in the star-formation efficiency of galactic halos, and a factor of 3.5 in the X-ray efficiency
 of galactic halos; the latter two parameters are currently uncertain by orders of magnitude. In addition to standard astrophysical
 models, we also consider two exotic possibilities of strong excess radio backgrounds at high redshifts. We use a neural network
 to identify the type of radio background present in the 21-cm power spectrum, with an accuracy of 87% for mock SKA data.
 Key words: methods: numerical – methods: statistical – dark ages, reionization, first stars – cosmology: theory

 1 INTRODUCTION of redshifts including cosmic dawn. Thus, we expect a great deal of
 data from observations in the upcoming decade.
 The redshifted 21-cm signal from neutral hydrogen is the most
 promising probe of the Epoch of Reionization (EoR) and cosmic
 The question arises as to what are the possible ways to infer the
 dawn. This 21-cm emission or absorption originates from the hy-
 astrophysical parameters from the observed 21-cm power spectrum
 perfine splitting of the hydrogen atom. As this signal depends on
 data. Since the characteristic astrophysical parameters at high red-
 both cosmological and astrophysical parameters, it should be possi-
 shifts are currently almost entirely unconstrained, the 21-cm signal
 ble to decipher abundant information about the early universe from
 must be calculated for a large number of parameter sets that cover a
 the signal once it is observed. The Low Frequency Array (LOFAR,
 wide range of possibilities. Given the complexity of the 21-cm signal
 Gehlot et al. 2019), the Precision Array to Probe the Epoch of Reion-
 (see Barkana 2018a; Mesinger 2019) and its highly non-linear de-
 ization (PAPER, Kolopanis et al. 2019), the Murchison Wide-field
 pendence on the astrophysical parameters, artificial neural networks
 Array (MWA, Trott et al. 2020), the Owens Valley Radio Observa-
 (ANNs) are a useful method for emulation and fitting. Shimabukuro
 tory Long Wavelength Array (OVRO-LWA, Eastwood et al. 2019),
 & Semelin (2017) used an ANN to estimate the astrophysical param-
 The Large-aperture Experiment to detect the Dark Age (LEDA, Price
 eters from 21-cm observations. They trained the ANN using 70 data
 et al. 2018; Garsden et al. 2021) and the Hydrogen Epoch of Reion-
 sets where each set consists of the 21-cm power spectrum obtained
 ization Array (HERA, DeBoer et al. 2017) are experiments that have
 using 21cmfast (Mesinger et al. 2011) as input, with three EoR pa-
 analyzed data in an attempt to detect the power spectrum from the
 rameters used in the simulation as output. They applied the trained
 epoch of reionization. Although the existing upper limits are weak,
 ANN to 54 data sets to evaluate how the algorithm performs. Kern
 they already provide interesting constraints on some of the exotic sce-
 et al. (2017) used a machine learning algorithm to emulate the 21-cm
 narios (e.g. with extra radio background as considered here) (Mondal
 power spectrum and perform Bayesian analysis for parameter con-
 et al. 2020; The HERA Collaboration et al. 2021). HERA along with
 straints over eleven parameters which included six parameters of the
 the New Extension in Nançay Upgrading LOFAR (NenuFAR, Zarka
 EoR and X-ray heating and five additional cosmological parameters.
 et al. 2012) and the Square Kilometre Array (SKA, Koopmans et al.
 Schmit & Pritchard (2018) built an emulator using a neural network
 2015) will aim to measure the power spectrum over a wide range
 to emulate the 21-cm power spectrum where they generated the train-
 ing and test data sets using the 21cmfast simulation and compared
 their results with 21CMMC. Cohen et al. (2019) introduced the first
 ★ E-mail: sudiptas@mail.tau.ac.il global 21-cm signal emulator using an ANN. Recently, Hellum Bye

 © 2021 The Authors
2 S. Sikder et al.
et al. (2021); Bevins et al. (2021) proposed two different approches • The X ray radiation efficiency, , is defined by the standard
for emulating all sky averaged (global) 21-cm signal. In this paper, expression of the ratio of the X-ray luminosity to the star formation
we use an emulation method to constrain 21-cm power spectrum for rate (LX −SFR relation) [see Fialkov et al. (2014), Cohen et al. (2017)
the seven-parameter astrophysical model. We construct the emulator for more details]
using a large dataset of models that cover a very wide range of the LX
astrophysical parameter space. Given the seven-parameter astrophys- = 3 × 1040 erg s−1 M−1 yr . (2)
 SFR
ical model, the emulator is able to predict the 21-cm power spectrum
over a wide redshift range ( = 6 to 30). We also explore a more real- In the above expression LX is the bolometric luminosity and is
istic case of the observational measurements expected for the SKA, the X-ray efficiency of the source. The normalization is such that
as well as extended models that also include an excess early radio = 1 corresponds to the typical observed value for low-metallicity
background. galaxies. Given the almost total absence of observational constraints
 This paper is organised as follows: We present in section 2 a at the relevant redshifts, we vary from 0.0001 to 1000.
description of the theory and methods used to generate the datasets • The power law slope and the low energy cutoff min de-
(2.1 – 2.4) and build the ANN (2.5). Section 3 presents our results, termine the shape of the spectral energy distribution (SED). We
for standard astrophysical models (3.1 – 3.4) and ones with an early parameterize the X-ray SED by the power law slope (where
radio background (3.5 – 3.7). Finally, we summarize our results and log( )/ log( ) = − ) and the low energy cutoff ( min ). These
discuss our conclusions in section 4. two parameters have significant degeneracy, so we vary in the nar-
 row range 1 − 1.5 and min in the broad range of 0.1 − 3.0 keV.
 The SEDs of the early X-ray sources strongly affect the 21-cm signal
 from both the EoR and cosmic dawn (Fialkov et al. 2014; Fialkov
2 THEORY AND METHODS & Barkana 2014). Soft X-ray sources (emitting mostly below 1 keV)
 produce strong fluctuations on relatively small scales (up to a few
2.1 21-cm signals
 tens of Mpc) whereas hard X-ray sources produce milder fluctuations
2.1.1 Astrophysical parameters on larger scales. X-Ray binaries (XRB) (Mirabel et al. 2011; Fragos
 et al. 2013) are major sources that are expected to have a hard X-ray
We use seven key parameters to parameterize the high redshift as- spectral energy distribution.
trophysics: the star formation efficiency ( ∗ ), the minimum circular • The optical depth of the CMB, , is one of two parameters
velocity of star-forming halos ( ), the X ray radiation efficiency that describe the epoch of reionization. For given values of the other
( ), the power law slope ( ) and the low energy cutoff ( min ) of parameters, the CMB optical depth has a one to one relation with the
the X ray spectral energy distribution (SED), the optical depth ( ) ionizing efficiency which is defined by
of the cosmic microwave background (CMB) and the mean free path
( mfp ) of ionizing photons. Here we briefly discuss these astrophys- 1
 = ∗ esc ion , (3)
ical parameters. 1 + ¯ rec
 • The star formation efficiency, ∗ , quantifies the fractional where ∗ is the star formation efficiency, esc is the fraction of ioniz-
amount of gas in star-forming dark matter halos that is converted into ing photons that escape from their host galaxy, ion is the number of
stars (Tegmark et al. 1997). The value of ∗ depends on the details ionizing photons produced per stellar baryon in star-forming halos,
of star formation that are unknown at high redshift, so we treat it as and ¯ rec is the mean number of recombinations per ionized hydrogen
a free parameter. We assume a constant star formation efficiency in atom. We choose to include the CMB optical depth ( ) in our seven-
halos heavier than the atomic cooling mass and a logarithmic cutoff parameter astrophysical model instead of the ionizing efficiency ( )
in the efficiency in lower mass halos (Fialkov et al. 2013). We cover because is directly constrained by CMB observations (Planck Col-
a wide range of ∗ values, from 0.0001 to 0.5. laboration et al. 2018).
 • The circular velocity, , is another parameter that encodes • The mean free path of ionizing photons, mfp , is the other
the information about star formation. Star formation takes place in EoR parameter (Alvarez & Abel 2012). mfp sets the maximum
dark matter halos that are massive enough to radiatively cool the in- distance travelled by ionizing photons. Due to the process of structure
falling gas (Tegmark et al. 1997). This is the main element in setting formation, dense regions of neutral hydrogen (Lyman-limit systems)
the minimum mass of star-forming halos, min . We equivalently effectively absorb all the ionizing radiation and thus limit the sphere
use the minimum circular velocity as one of our free parameters. of influence of each ionizing source. The mean free path parameter
Since the cooling and the internal feedback depend on the depth of approximately accounts for the effect of these dense neutral hydrogen
the potential and the potential is directly related to , it is more pockets during reionization. In our simulations, we vary mfp from
physical to use a fixed versus redshift rather than a fixed min . 10 to 70 comoving Mpc.
Since complex feedback (e.g., Schauer et al. 2015) of various types
can suppress star formation in low-mass halos, we treat as a
free parameter. In practice the actual threshold is not spatially ho- 2.1.2 Power spectrum
mogeneous in our simulation since individual pixels are affected by
feedback processes including Lyman-Werner feedback on small ha- It is possible in principle to map the distribution of neutral hydrogen
los, photoheating feedback during the EoR and the streaming velocity three dimensionally in the early universe by observing the brightness
between dark matter and baryons. The relation between the circular temperature contrast of the 21-cm line. In order to infer the informa-
velocity ( ) and the minimum mass of the dark matter halo ( min ) tion about the astrophysical processes in the epoch of reionization
is given (in the Einstein de-Sitter limit which is valid at high redshift) and cosmic dawn, there are a variety of approaches one can follow to
by characterize the 21-cm signal. Other than the global signal, the most
 straightforward approach is to use the statistical description of the
 min 1/3 1 + 1/2 Ω 1/6
      
 21-cm fluctuations, i.e., the 21-cm power spectrum.
 = 16.9 km s−1 . (1)
 108 10 0.0316 The 21-cm power spectrum encodes a great deal of information

MNRAS 000, 1–17 (2021)
Machine learning in 21-cm cosmology 3
about the underlying physical processes related to reionization and the 21-cm rest frame frequency at redshift is given by (Fialkov &
cosmic dawn. We define the power spectrum ( ) of fluctuation of Barkana 2019)
the 21-cm brightness temperature (relative to the radio background,   
 1420
which is the CMB in standard models) by radio = × 2.725(1 + ) K , (7)
 78(1 + )
h ˜ (k) ˜ ∗ (k 0 )i = (2 ) 3 (k − k 0 ) ( ) , (4)
 where the spectral index = −2.6 (set to match the slope of the
where k is the comoving wave vector, is the Dirac delta function, observed extragalactic radio background observed by ARCADE2
and the angular brackets denote the ensemble average. ˜ (k) is the (Fixsen et al. 2011; Seiffert et al. 2011) and confirmed by LWA1
Fourier transform of (x) which is defined by (x) = ( (x) − (Dowell & Taylor 2018)) and is the amplitude of the radio back-
 ¯ )/ ¯ . Finally we express the power spectrum in terms of the ground. Here 1420 MHz/(1 + ) is the observed frequency corre-
variance, in mK2 units: sponding to redshift , and measures the amplitude (relative to
 the CMB) at the central frequency of the EDGES feature (78 MHz).
 3 ( )
Δ2 = h i 2 , (5) Thus, the external radio model has eight free parameters: ∗ , , ,
 2 2 , min , , mfp and .
where the expression 3 ( )/2 2 is dimensionless. The 21-cm sig- In contrast to this external radio background, astrophysical sources
nal is significantly non-Gaussian because of both large-scale and such as supermassive black holes or supernovae could in principle
small-scale processes during reionization and cosmic dawn. Thus, produce such an extra radio background due to synchrotron radiation.
the power spectrum does not reveal all the statistical information that In such a case, the radio emission would originate from within high
is available. Nevertheless, a wealth of astrophysical information can redshift radio galaxies and would thus result in a spatially varying
be extracted from the 21-cm power spectrum and it can be measured radio background, as computed accurately on large scales within our
relatively easily from observations. semi-numerical simulations (Reis et al. 2020b). The galaxy radio
 luminosity can be written as

2.2 The Excess radio background   − radio  SFR 
 radio ( , ) = × 1022 W Hz−1 ,
The first observational signature of the HI 21-cm line from cosmic 150 MHz M yr−1
dawn was tentatively detected by the EDGES collaboration (Bow- (8)
man et al. 2018). The shape and magnitude of this signal are not where radio is the spectral index in the radio band, SFR is the star
consistent with the standard astrophysical expectation. The reported formation rate and is the normalization of the radio emissivity.
21-cm signal is centered at = 78.2 MHz with an absorption trough Based on observations of low-redshift galaxies, we set radio = 0.7
of = −500+200 −500
 mK (Bowman et al. 2018). The amplitude of and note that = 1 roughly corresponds to the expected value
absorption is more than a factor of two larger than that predicted (Gürkan et al. 2018; Mirocha & Furlanetto 2019). Since extrapolating
from standard astrophysics based on the ΛCDM cosmology and hi- low-redshift observations to cosmic dawn may be wildly inaccurate,
erarchical structure formation. The SARAS 3 experiment recently in our analysis we allow to vary over a wide range. Thus, the
reported the upper limit of the global signal that is inconsistent with galactic radio model is also based on eight parameters: ∗ , , ,
the EDGES signal (Singh et al. 2021) at 95%, so it will be some time , min , , mfp , and .
before we can be confident that the global 21-cm signal has been Both types of radio background, if they exist, can affect the 21-
reliably measured. cm power spectrum, leading to a strong amplification of the 21-cm
 If EDGES is confirmed, one possible explanation of this observed signal during cosmic dawn and the EoR in models in which the radio
signal is that there is be an additional cooling mechanism that makes background is significantly brighter than the CMB. However, there
the neutral hydrogen gas colder than expected; a novel dark matter are some major differences between the two models. The external
interaction with the cosmic gas (Barkana 2018b) is a viable option, radio background is spatially uniform, is present at early cosmic
but it likely requires a somewhat elaborate dark matter model (Berlin times (prior to the formation of the first stars), and increases with
et al. 2018; Barkana et al. 2018; Muñoz & Loeb 2018; Liu et al. redshift (i.e., it is very strong at cosmic dawn and weakens during
2019). Another possibility, which we consider in detail in this paper, the EoR). On the other hand, the galactic radio background is non-
is an excess radio background above the CMB (Bowman et al. 2018; uniform, and its intensity generally rises with time as it follows the
Feng & Holder 2018; Ewall-Wice et al. 2018; Fialkov & Barkana formation of galaxies (as long as is assumed to be constant with
2019; Mirocha & Furlanetto 2019; Ewall-Wice et al. 2020; Reis et al. redshift).
2020b). This excess radio background increases the contrast between
the spin temperature and the background radiation temperature. In
this case the basic equation for the observed 21-cm brightness tem- 2.3 Mock SKA data
perature from redshift relative to the background is
 To consider a more realistic case study, we create mock SKA data
 − rad
 = S (1 − − ) , (6) by including several expected observational effects in the 21-cm
 1+ power spectrum, which we refer to as the case "with SKA noise". To
where rad = CMB + radio , with radio being the brightness tem- incorporate the SKA noise case within the data, (i) we include the
perature of the excess radio background and CMB = 2.725(1 + ) effect of the SKA angular resolution, (ii) we add a pure Gaussian
K. We consider two distinct types of extra radio models, which we noise smoothed over the SKA resolution as a realization of the SKA
have considered in previous publications. The external radio model thermal noise (following Banet et al. (2021), see also Koopmans et al.
assumes a homogeneous background that is not directly related to as- (2015)) and (iii) we also add residuals from foreground avoidance, by
trophysical sources, i.e., may be generated by exotic processes (such assuming that part of -space (the "foreground wedge") is removed
as dark matter decay) in the early universe. In this model, we assume since it is dominated by foregrounds (following Reis et al. (2020a),
that the brightness temperature of the excess radio background at see also Datta et al. (2010); Dillon et al. (2014); Pober et al. (2014);

 MNRAS 000, 1–17 (2021)
4 S. Sikder et al.
Pober (2015); Jensen et al. (2015)). Each of the three effects is = 0.01 − 0.12, and mfp = 10.0 − 70.0 Mpc. As explained above,
included along with its expected redshift dependence. Regarding the our analysis involved two more datasets (3195 models each) of 21-cm
foreground residuals, we note that we assume that the high-resolution power spectra, with either full SKA noise or SKA thermal noise only,
maps of the SKA will enable a first step of reasonably accurate in order to analyze a more realistic situation. In order to investigate
foreground subtraction, so that the remaining wedge-like region for the two scenarios of the excess radio background (where the number
avoidance will be limited (corresponding to the "optimistic model" of free parameters is increased by one), we use two new datasets of
of Pober et al. (2014)). In order to gain some understanding of the models: 10158 models with the galactic radio background and 5077
separate SKA effects, we also consider a case that we label "with models with the external radio background.
thermal noise". In this case, we add the effect of SKA resolution
and thermal noise, i.e., the same as "SKA noise" except without
foreground avoidance.
 Given the lower accuracy, for cases with mock SKA effects we use 2.5 Artificial Neural Network
coarser binning, namely eight redshift bins and five bins. The five
 Artificial neural networks (ANN) (often simply called neural net-
 -bins are spaced evenly in log scale between = 0.05 Mpc−1 and 
 works or NN) are computing systems that mimic in some ways the
= 1.0 Mpc−1 ; we average the 21-cm power spectrum at each redshift
 biological neural networks that constitute the human brain. We briefly
over the range of values within each bin. To fix the redshift bins, we
 summarize their properties. An ANN consists of a collection of arti-
imagine placing our simulation box multiple times along the line of
 ficial neurons. Each artificial neuron has inputs and produces a single
sight, so that our comoving box size fixes the redshift range of each
 output which can be the input of multiple other neurons. In our anal-
bin. For example, we start with = 27.4, which corresponds to 50
 ysis, we use a Multi-layer Perceptron (MLP) which is a supervised
MHz (the limit of the SKA), as the far side of the highest-redshift bin.
 machine learning algorithm in an artificial neural network. To define
Then the center of the box is 192 comoving Mpc (half of our 384 Mpc
 the neural network architecture we need to specify the number of
box length) closer to us. The redshift corresponding to the center is
 hidden layers, number of nodes in each layer, the activation function,
taken as the central redshift of the first bin. The next redshift bin is
 the solver, and the maximum number of iterations.
384 Mpc closer and so on. As the total comoving distance between
 A Multi-layer Perceptron (MLP) is a supervised learning algo-
 = 27.4 and = 6 is around 3000 Mpc, we obtain 8 redshift bins that
 rithm that learns to fit a mapping : → using a training
naturally correspond to a line of sight filled with simulation boxes.
 dataset, where is the input dimension and is the output di-
We then average the 21-cm power spectrum over the redshift range
 mension. When we apply unknown data as a set of input features
spanned by each box along the line of sight, by using the simulation
 = 1 , 2 , 3 , ..., , the neural network uses the mapping to in-
outputs which we have at finer resolution in redshift. This averaging
 fer the target output ( ). This Multi-layer Perceptron can be used
is part of the effect of observing a light cone; while there is also an
 for both classification and regression problems. The advantage of a
associated anisotropy (Barkana & Loeb 2006; Datta et al. 2012), in
 Multi-layer Perceptron is that it can learn highly non-linear models.
this paper we only consider the spherically-averaged 21-cm power
 Every neural network has three different types of layers each con-
spectrum.
 sisting of a set of nodes or neurons. They are the input layer, hidden
 layer and output layer. The input layer consists of a set of neurons
2.4 Method to generate the dataset that represent the input features = 1 , 2 , 3 , ..., . Each neuron
 in the input layer is connected to all the neurons in the first hidden
We use our own semi numerical simulation (Visbal et al. 2012; Fi- layer with some weights and each node in the first hidden layer is
alkov & Barkana 2014) to predict the 21-cm signal for each possible connected to all the nodes in the next hidden layer and so on. The
model. The simulation generates realizations of the universe in a output layer receives the values from the last hidden layer and trans-
large cosmological volume (3843 comoving Mpc3 ) with a resolution forms them to the output target value. A specific weight ( ) and
of 3 comoving Mpc over a wide range of redshifts (6 to 50). The sim- a bias ( ) are applied to every input or feature. Both the weight
ulation follows the hierarchical structure formation and the evolution and the bias are initially chosen randomly. For a particular neuron in
of the Ly , X-ray, Lyman-Werner (LW), and ionizing ultra-violet ra- the ’th hidden layer, if is the input and +1 is the output of that
diation. The extended Press-Schechter formalism is used to compute neuron, then +1 = ( + ), where is called the activation
the star formation rate in each cell at each redshift (Barkana & Loeb function. Using linear activation functions would make the entire
2004). The 21-cm brightness temperature cubes are output by the network linear in the inputs, and thus equivalent to a one layered
simulation and we use them to calculate the 21-cm power spectrum network. Thus, non-linear activation function are typically used in
at each redshift. While this semi-numerical simulation was inspired order to provide the ability to handle complex, non-linear data. The
by 21cmFAST (Mesinger et al. 2011), it is an entirely independent activation function activates a neuron, i.e., this function takes its in-
implementation with various differences such as more accurate X-ray put and compares it with a threshold value. If the input is greater than
heating (including the effect of local reionization on the X-ray absorp- the threshold, it is forwarded to the next layer and if it is less than the
tion) and Ly fluctuations (including the effect of multiple scattering threshold, it is turned to zero. Commonly used non-linear activation
and Ly heating). Inhomogeneous processes such as the streaming functions include the logistic sigmoid function, hyperbolic tangent
velocity, LW feedback, and photo-heating feedback are also included function and the rectified linear unit function. A backpropagation
in the code. We created a mock 21-cm signal using the code for a algorithm is usually used to train an artificial neural network. The
large number of astrophysical models and calculated the 21-cm power training procedure for a network involves several steps:
spectrum for each parameter combination. Considering first standard
astrophysical models (without an excess radio background), we gen- • Initialization: Randomly chosen initial weights and biases are
erated the 21-cm power spectrum for 3195 models that cover a wide applied to all the nodes or neurons in each layer.
range of possible values of the seven astrophysical parameters. The • Forward propagation: The output is computed using the neural
ranges of the parameters were ∗ = 0.0001 − 0.50, = 4.2 − 100 network based on the initial choices of the weights and biases given
km s−1 , = 0.0001 − 1000, = 1.0 − 1.5, min = 0.1 − 3.0 keV, the input from the training dataset. Since the calculation progresses

MNRAS 000, 1–17 (2021)
Machine learning in 21-cm cosmology 5
from the input to the output layer (through the hidden layers), this is moid function as the activation function for the hidden layers and the
known as forward propagation. stochastic gradient-based optimizer for the weight optimization. We
 • Error estimation: An error function (often called a loss function) use 3095 models to train the neural network, and we then apply the
is used to compute the difference between the predicted and the true trained ANN to a test dataset consisting of 100 models. Throughout
(known) output of the model, given the current weights. MLP uses this paper, for simplicity we choose test cases that have non-zero
different loss functions based on the problem type. For regression, a power spectra from intergalactic hydrogen, i.e., that have not fully
common choice is the mean square error. reionized by redshift 6.
 • Backpropagation and updating of the weights : A backpropa-
gation algorithm minimizes the error function and finds the optimal
weight values, typically by using the gradient descent technique. The 2.5.2 Emulation of the 21-cm power spectrum
outermost weights get updated first and then the updates propagates
 If the statistical description of the 21-cm signal (here the 21-cm
towards the input layer, hence the term backpropagation.
 power spectrum) is our main focus, then we hope to avoid the need
 • Repetition until convergence: In each iteration, the weights get
 to run a semi-numerical simulation for each parameter combination.
updated by a small amount, so to train a neural network several
 We can instead construct an emulator that provides rapidly-computed
iterations are required. The number of iterations until convergence
 output statistics that capture the important information in the signal
depends on the learning rate and the optimization method used in the
 given a set of astrophysical parameters.
network.
 We train the neural network to predict the 21-cm power spec-
Once the network has been trained using the training dataset, the trum based on the seven parameter astrophysical model specified
trained network can make predictions for arbitrary input data that above. As in the case of the ANN to predict the parameters, here
were not a part of the training set. also we standardize the features as part of data pre-processing. To
 reduce the dimension of the power spectrum data, we apply PCA
 transformation to the data; after experimentation we found that here
2.5.1 Astrophysical parameter predictions 20 PCA components suffice. As before, we again use a log scale
 for both the dataset of the parameters and the 21-cm power spectra.
For the purpose of predicting the astrophysical parameters, we used
 Next we need to find the appropriate neural network architecture to
a two layer MLP with 150 neurons in the first hidden layer and 50
 construct the emulator. For this, we choose some specified hyperpa-
neurons in the second hidden layer. The network was expected to
 rameters for our multi layer perceptron estimator and search among
be somewhat complex as we want a mapping between the seven
 all possible combinations to find the best one to use in our MLP re-
astrophysical parameters of the model and, on the other side, the
 gressor. To emulate the 21-cm power spectrum, we use a three layer
21-cm power spectrum for 32 values of the wavenumber in the range
 MLP with 134 neurons in each layer. We use the logistic sigmoid
0.05 Mpc−1 < 
6 S. Sikder et al.
we follow a Bayesian analysis for finding the posterior probability Parameters Lower bound Upper bound
distribution of the parameters. We use MCMC methods for sampling
the probability distribution functions or probability density functions ∗ 0.0001 0.50
(pdfs). [km/s] 4.2 100
 The posterior pdf for the parameters given the data , ( | ), 0.0001 1000
is, in general, the likelihood ( | ) (i.e., the pdf for the data given 0.9 1.6
the parameters ) times the prior pdf ( ) for the parameters, divided min [keV] 0.09 3.1
 0.01 0.14
by the probability of the data ( ):
 mfp [Mpc] 9 74
 ( | ) ( )
 ( | ) = , (9)
 ( ) Table 1. The prior bounds for the astrophysical parameters.

where the denominator ( ) can be thought of as a normalization
factor that makes the posterior distribution function integrate to unity. In practice we show below that the MCMC uncertainties signif-
If we assume that the noise is independent between data points, then icantly underestimate the true errors. Thus, in order to find more
the likelihood function is the product of the conditional probabilities accurate error bounds, we use ensemble learning. Instead of using
 one emulator, we use an ensemble of emulators with the same neural
 
 Ö network architecture. Here we use 20 emulators, each of which we
L= ( | ) . (10) train with a randomly drawn subset consisting of 90% of the training
 =1 dataset (3095 models). We apply each of the trained emulators to
Taking the logarithm, the same test dataset and carry out the Bayesian analysis employing
 the MCMC sampler. Then, for a particular parameter, we take as
 
 " #
 1 ∑︁ [ − n,model ( )] 2 2
 our best predicted value the mean of the predicted values from the
ln L = − + ln(2 ) , (11) MCMC sampler using all the emulators. For the uncertainties, we
 2
 =1 2 
 take the mean of the distances to the upper and lower uncertainty
 bounds of the emulators. We label the resulting average the MCMC
where we set 2 = 2 + 2n,model 2 . The likelihood function here
 uncertainty; this is an ensemble-averaged estimate of the internal
is assumed to be a Gaussian, where the variance is modelled as is
 error of the MCMC procedure using a single emulator. To find the
common for the MCMC procedure, as a sum of a constant plus a
 external error of each parameter, we calculate the standard deviation
multiple of the predicted data (i.e., as a combination of an absolute
 of the predicted best-fit values given by these 20 different emulators,
error and a relative error). While we might in the future try to directly
 and this we label the Bootstrap uncertainty as it originated due to the
include estimated observational errors and covariances in the data,
 random sampling of the training dataset.
here we instead adopt a black-box approach where we allow the NN
and MCMC procedures to estimate on their own the total effective
uncertainties and correlations, including also the effect of the un-
certainty in the emulation. In particular, is a free parameter that 3 RESULTS
gives the MCMC procedure the flexibility to do this, so we include
it effectively as an additional model parameter. We apply the proce- 3.1 Performance analysis of the emulators
dure to obtain the posterior distribution for all the parameters (seven We show the performance of the emulator of the 21-cm power spec-
astrophysical parameters and ) and then marginalize over the extra trum in Fig. 1. We compare the emulated power spectrum and the
parameter ( ) to obtain the properly marginalized posterior distribu- true power spectrum from the semi-numerical simulation for two
tion for the seven astrophysical parameters (Hogg et al. 2010). Here particular values. The left panel shows a few random examples of
the index denotes various -bins and -bins, where the data is the emulated power spectrum (solid lines) and the true power spec-
the mock observation of the 21-cm power spectrum and n,model is trum (dashed lines). The different colors denote different models.
the predicted 21-cm power spectrum from the emulator. In this work In this figure, we see that the accuracy of the emulator is generally
we adopt an effective constant error of: good and tends to improve with the height of the power spectrum,
 = 0.15 mK2 . (12) although there is some random variation among different models.
 A more representative, statistical analysis of the accuracy is shown
This ensures that the algorithm does not try to achieve a low relative further below.
error when the fluctuation itself is low (below ∼ 0.4 mK) and likely The right panel of Fig. 1 shows a few random examples of the
more susceptible to systematic errors in realistic data. What we have comparison between the power spectrum emulated by the emulator
described here is a typical setup for MCMC. The final uncertainties with SKA noise and the true power spectrum with the SKA noise.
are insensitive to the detailed assumptions since in the end the errors The different colors denote different astrophysical models. Again, the
are found numerically by the MCMC procedure; furthermore, we emulation is seen to be reasonably accurate, although in some cases
have test models that allow us to independently test the reliability of the emulated 21-cm power spectrum significantly deviates from the
the uncertainty estimates, as described further in the results section actual one at low redshift and/or when the power spectrum is low. The
below. variations intrinsic to the different models in the power spectra (left
 We use the emcee sampler (Foreman-Mackey et al. 2013) which panels in Fig. 1) are heavily suppressed once we include the expected
is the affine-invariant ensemble sampler for MCMC (Goodman & observational effects of the SKA experiment into the power spectra
Weare 2010). The MCMC sampler only computes the likelihood (right panels in Fig. 1). In particular, the thermal noise dominates
when the parameters are within the prior bounds. We set the prior at high redshift. However, as we find from the results below, when
bounds for the parameters according to Table 1 and we use flat priors we fit the power spectrum with SKA noise there is still significant
for the parameter values (in log except for and min ). information in the data that allows the fitting procedure to reconstruct

MNRAS 000, 1–17 (2021)
Machine learning in 21-cm cosmology 7
 Without SKA noise With SKA noise
 103
 k = 0.11 Mpc −1
 10 2 k = 0.11 Mpc−1
 2
 10

 101 101
 ∆2 [mK2]

 ∆2 [mK2]
 100
 100
 −1
 10 Emulated ∆2 Emulated ∆2
 True ∆2 True ∆2
 10−2 10−1
 31 27 23 19 15 11 7 25 22 19 16 13 10 7
 1+z 1+z
 103
 103 k = 1.09 Mpc−1 k = 1.0 Mpc−1

 102
 102
 ∆2 [mK2]

 ∆2 [mK2]
 101
 101

 100 100
 Emulated ∆2 Emulated ∆2
 True ∆2 True ∆2
 10−1 10−1
 31 27 23 19 15 11 7 25 22 19 16 13 10 7
 1+z 1+z

Figure 1. A few random examples of the emulated power spectrum without SKA noise (left panel) and with SKA noise (right panel) at = 0.11 Mpc−1 (upper
panel) and ≈ 1.0 Mpc−1 (lower panel); note that the -bin values and widths are different in the SKA case, as explained in the text. The dashed line is the true
power spectrum from the simulation and the solid line is the emulated power spectrum (for combinations of astrophysical parameters that were not included in
the training set). Different colors show different models.

the input parameters. An advantage of machine learning is that the median of the relative error compared to the case without SKA noise.
algorithm learns directly how to best deal with noisy data, and there This first performance analysis uses an optimistic measure of error
is no need to try to explicitly model or fit the observational effects. as it is normalized to the maximum value of the power spectrum,
 To test statistically the performance of the emulator in predicting but it clearly indicates that some portions of and space can be
the 21-cm power spectrum, we use a test dataset of 100 randomly- emulated accurately, including for the case where the emulator only
chosen models that were not part of the training set. We quantify has access to data with SKA noise. Below we consider more detailed
the performance in detail below, but here, as an optimistic overall assessments of the performance of the emulator.
estimate, we quantify whether any parts of the power spectrum (in 
and space) are well measured. To this end, we calculate the error
in the predicted power spectrum Δ2predicted compared to the power
 3.2 Dependence of the emulation error on the redshift and
spectrum generated from the simulation for the same parameter set, wavenumber
Δ2true , and normalize relative to the maximum value of the power
spectrum. Specifically, we find the r.m.s. value of the difference For a more detailed assessment of the emulator, we calculate how
between Δ2predicted and Δ2true , and divide by the maximum value of the error varies with redshift and wavenumber. For this we use test
the true power spectrum over all and : datasets of 100 models for each of the cases: without SKA noise, with
 √︄  SKA noise, and with SKA thermal noise only. We first directly test
 2 the emulator by comparing the predicted power spectrum (feeding
 Mean Δ2predicted − Δ2true
 into the emulator the known true parameters) to the true simulated
Error =   . (13) power spectrum (as in the previous subsection, but here divided sepa-
 Max Δ2true
 rately into and bins). In addition, we test the complete framework
Here we take the mean over all and . When we calculate this for the by finding the best-fit astrophysical parameters to mock data using
dataset with SKA noise, we normalize the error using the maximum the MCMC sampler; feeding the best-fit parameters to the emulator
value of the power spectrum without SKA noise, binned over the of the power spectrum; and finding the error of this best-fit predicted
SKA and -bins. For the case of the emulator trained using the 21- power spectrum compared to the true simulated power spectrum. In
cm power spectrum without SKA noise, the median of this relative cases with SKA noise, we are not interested in finding the error in the
error over the test dataset is 0.009 whereas for the emulator trained predicted power spectrum with SKA noise (as the power spectrum
using the 21-cm power spectrum with the SKA noise, the median is is often dominated by noise, especially at high redshifts); instead,
0.002. The SKA binning and the SKA smoothing effects (namely the we make the more challenging comparison of the best-fit predicted
angular resolution and foreground avoidance) reduce the differences power spectrum to the true power spectrum, both in their "clean"
between different inherent power spectra, and this results in a lower versions (i.e., without SKA noise). To be clear, this means taking

 MNRAS 000, 1–17 (2021)
8 S. Sikder et al.

 z = 6 z = 6 z = 6 z = 6
 z = 11 z = 11 z = 11 z = 11
 z = 17 z = 17 z = 17 z = 17
 100
 z = 22 z = 22 z = 22 z = 22
Error (k, z)

 z = 26 z = 26 z = 26 z = 26
 z = 30 z = 30 z = 30 z = 30

 10−1

 10−1 100 10−1 100 10−1 100 10−1 100
 k [Mpc−1] k [Mpc−1] k [Mpc−1] k [Mpc−1]

 k = 0.05 k = 0.05 k = 0.05 k = 0.05
 k = 0.09 k = 0.09 k = 0.09 k = 0.09
 k = 0.16 k = 0.16 k = 0.16 k = 0.16
 100
 k = 0.3 k = 0.3 k = 0.3 k = 0.3
Error (k, z)

 k = 0.54 k = 0.54 k = 0.54 k = 0.54
 k = 0.99 k = 0.99 k = 0.99 k = 0.99

 10−1

 6 10 14 18 22 26 30 6 10 14 18 22 26 30 6 10 14 18 22 26 30 6 10 14 18 22 26 30
 z z z z

Figure 2. Redshift and wavenumber dependence of the relative error in emulating the best-fit power spectrum. The upper panels shows the dependence on
wavenumber (for fixed redshift) and the lower panels depict the redshift dependence (for fixed wavenumber). For the left-most panels, we emulate the power
spectrum using the true parameters from the test dataset. For the panels in the second column from the left, we emulate the power spectrum using the best-fit
parameters derived from the network without SKA noise. For the panels in the third column from the left, we use the best-fit parameters derived from the
network with SKA noise, but for the error we measure the prediction of the real power spectrum, i.e., we apply the emulator trained without SKA noise. For the
panels in the right-most column, we use the best-fit parameters derived from the network with thermal noise and otherwise do the same as for the third column.
Note that the plots in this figure show all 25 values and 32 values.

the reconstructed best-fit astrophysical parameters (which were re- some perspective, we note that a 20% error is typically adopted
constructed from the mock data with SKA noise, based on the NN to represent the systematic theoretical modeling error in the 21-cm
trained using power spectra with SKA noise), and using it as input to power spectrum (e.g., Kern et al. 2017). In the panels in the second
the NN trained using power spectra without SKA noise. Here we use column from the left, we use the best-fit parameters derived from the
the following definition to quantify the error as a function of redshift network without SKA noise to emulate the power spectrum. From
and wavenumber: the comparison to the left-most panels, we see that the fitting of the
 astrophysical parameters (in this case without SKA noise) is nearly
 Δ2predicted_clean − Δ2true_clean perfect, in that the error that it adds is small compared to the error
Error( , ) = Median , (14)
 Δ2true_clean + 0.15 mK2 of the emulator itself. In the panels in the third column, the best-fit
 parameters are derived from the network with SKA noise, but as
where we take the median over the test models; in this paper we noted above, the errors are calculated for the ability to predict the
often take the median in order to measure the typical error and real power spectrum, i.e., by comparing the true power spectrum
reduce the sensitivity to outliers. This definition of the error measures to the prediction of the emulator that was trained using the power
the absolute value of the relative error, except that the denominator spectrum without SKA noise. SKA noise reduces the accuracy of the
includes a constant in order not to demand a low relative error when reconstruction of the astrophysical parameters but not by too much,
the fluctuation itself is low (in agreement with eq. 12). Note that increasing the typical errors by a fairly uniform factor of ∼ 1.5, to
here the errors are much larger than before because they are not 15 − 30% for most values of and . For the panels in the last
normalized to the maximum value of the power spectrum but are column, we use the best-fit parameters derived from the network
measured separately at each bin, including when the power spectrum trained using the power spectrum with SKA thermal noise only. The
is low. errors are nearly identical to the full SKA noise panels, showing that
 In Fig. 2, we show how the error varies with wavenumber (top the foreground effects do not add substantial error beyond the angular
panels) and redshift (bottom panels), for both the without and with resolution plus thermal noise, at least for the optimistic foreground
SKA cases. For the direct emulation case (left-most panels, where avoidance model that we have assumed.
we emulate the power spectrum using the true parameters from the In order to get a better understanding of the span of the models
test dataset), the relative error decreases with wavenumber up to over and , we show in Fig. 3 characteristic quantities that enter
 ∼ 0.1 − 0.2 Mpc−1 , then plateaus, and again increases above into the above calculation of the relative errors. In the left column, we
 ∼ 0.6 Mpc−1 . The redshift dependence shows a less regular pattern, show the median of the clean power spectrum (without any noise) as a
except that the errors tend to increase both at the low-redshift and function of the wavenumber (upper panel) and redshift (lower panel).
high-redshift end. Overall, the typical emulation error of the power In the other columns, the median of the absolute difference between
spectrum in each bin is 10 − 20% over a broad range of and , the true and predicted clean power spectra is shown as a function
but it rises above 20% at the lowest and highest values (for most of wavenumber (upper panels) and redshift (lower panels). For the
redshifts), and at the lowest redshift for all values (i.e., at = 6, panels in the middle column, the best-fit parameters are derived
near the end of reionization, when the power spectrum is highly from the network without noise, whereas for the panels in the right
variable and is sensitive to small changes in the parameters). For column we use the best-fit parameters derived from the network with

MNRAS 000, 1–17 (2021)
Machine learning in 21-cm cosmology 9
SKA noise to emulate the clean power spectrum (without noise). If 0 is an accurate estimate then the actual values of this normalized
This figure shows that the 21-cm power spectrum varies greatly as error (for the test dataset) should have a standard deviation of unity.
a function of and , even when we take out the model-to-model All these quantities, namely Ptrue , Ppredicted and 0 , are measured in
variation by showing the median of the 100 random test cases. The log space (log10 ) for all the parameters except for and min . We
variation is by three and a half orders of magnitude; even if we show the histogram of the errors in predicting each of the parameters
ignore the parameter space in which the power spectrum is lower in Figs. 4 (for the three most important parameters of high-redshift
than 0.4 mK (see eq. 12), we are left with a range of more than two galaxies) and B1 (for the four other parameters, shown in the Ap-
orders of magnitude. For the considered ranges, the overall variation pendix). In these figures, the left panels are for the case without SKA
with redshift at a given wavelength is much greater than the variation noise, the middle panels are for the case with SKA noise, and the
with wavelength at a given redshift. Over this large range, the relative right panels are for the case with SKA thermal noise. The black solid
error in each case (with or without SKA noise) remains relatively line in each panel shows the best-fit Gaussian of the histogram, also
constant; this is seen by the panels in Fig. 3 that show the relative listing its mean ( ) and standard deviation ( ) within the panel. The
error, which overall follows a similar pattern (with and ) as the two grey dashed lines in each panel represent the 3 boundary of
power spectrum except with a compressed range of values. the respective Gaussian. The standard deviations ( ) for most of the
 seven parameters are close to unity (within ∼ 20%), which implies
 that our procedure generates a reasonable estimate of the uncertain-
 ties. Also, the mean (which measures the bias in the prediction) is
3.3 Errors in the fitted astrophysical parameters
 in every parameter (without SKA noise) less than 0.3 in size. The
Up to now, we have examined the errors in emulating or reconstruct- datasets with SKA noise or with thermal noise give similar results,
ing the 21-cm power spectrum. Of greater interest is, of course, consistent with the similar comparison in Fig. 2. With the noisy data,
the ability to extract astrophysical information from a given power the mean values are as large as ∼ 1 for some of the parameters
spectrum. In addition to the unavoidable effect on the fitting of the ( and min ) which means that adding SKA noise in the dataset
emulation uncertainty, there are also the SKA observational effects. increases the bias in the predicted results. We also find that most of
We show results for a random example model in Tables 2 and 3, the distributions are fairly Gaussian, with only a small fraction of the
without or with SKA noise. In this example, for several parameters 100 models yielding best-fit parameter values that fall outside the 3 
the true parameter values lie well outside the 1 uncertainty bounds boundary of the respective Gaussian fit.
calculated from MCMC only, e.g., and min are off by nearly 5 In Figs. 5 and B2 (the latter in the Appendix), we show the his-
in the case without SKA noise, while is off by 2.4 in the case togram of the total uncertainty ( 0 ) for each of the parameters. In
with SKA noise. Thus, as explained in section 2.6, we also calcu- the left panels we compare the histogram of the total uncertainty for
late a Bootstrap uncertainty using random sampling of the training the cases: without SKA noise and with SKA noise, whereas in the
dataset. Tables 2 and 3 show that the Bootstrap uncertainty is often right panels we compare the total uncertainty for the cases: with-
significantly larger, especially for the parameters where the MCMC out SKA noise and with thermal noise. Again the total uncertainties
uncertainty severely underestimates the actual error. In the case with- are measured in log scale (log10 ) for all the parameters except for
out SKA noise, the bootstrap uncertainty is larger than the MCMC and min . Table 4 shows the corresponding median of the total
uncertainty by a factor that ranges from 1.4 to 4.9 in this example. Af- uncertainty ( 0 ) for the cases: without SKA noise, with SKA noise
ter adding SKA noise, the MCMC uncertainty is significantly larger and with thermal noise. In the theoretical case of no observational
than in the case without SKA noise, for all the parameters, while limitations ("Without SKA"), the emulation errors still allow the pa-
for the bootstrap uncertainty this is the case only for a few of the rameters and to be reconstructed with a typical accuracy of 3%,
parameters. and ∗ to within 8%. The ionizing mean free path ( mfp ) is typically
 Since the MCMC uncertainty (as given by a single emulator) is uncertain by a factor of 1.5, and by a factor of 3.1. For the linear
unreliable, in what follows we do not focus on the MCMC contours. parameters, the uncertainty is typically ±0.2 in and ±0.65 in min .
We show them in the Appendix for one other example (i.e., a dif- The uncertainties with full SKA noise and with only SKA thermal
ferent astrophysical model than the one used for Tables 2 and 3), in noise are basically the same (except for some random scatter). The
figures A1 and A2. More generally, while the results in this section uncertainties in , , min , and mfp are only marginally affected
are interesting, we have only shown a couple random example models by adding mock SKA effects, indicating that the emulation error
here. In order to understand the general trends, we consider below dominates for these parameters. However, SKA noise substantially
the overall statistics of the fitting as calculated for a large number of increases the errors in the other parameters, to 15% in , 5% in ,
models. and 32% in ∗ . Of course, currently our knowledge of most of these
 parameters is uncertain by large factors (orders of magnitude in some
 cases), so these types of constraints would represent a remarkable
 advance.
3.4 Statistical analysis of the astrophysical parameter errors
In order to test the overall performance in predicting each of the
 3.5 Classification of the radio backgrounds
parameters, we use our test dataset of 100 models. We calculate in
each case the MCMC uncertainty 1 and the bootstrap uncertainty 2 As noted in the introduction, the possible observation of the absorp-
 0
√︃ in the previous sub-section, and define a total uncertainties ≡
as tion profile of the 21-cm line centered at 78 MHz with an amplitude
 12 + 22 . In order to test whether this total uncertainty is a realistic of −500 K by the EDGES collaboration is incompatible with the
 standard astrophysical prediction. One of the possible explanations
error estimate, we calculate the normalized error in predicting a
 for this unexpected signal is that the excess radio background above
parameter (Ppredicted ) compared to the true value (Ptrue ) as:
 the CMB enhances the contrast between the spin temperature and
 Ptrue − Ppredicted the background radiation temperature. Fialkov & Barkana (2019)
Error ( ) = . (15) considered a uniform external radio background (not related to the
 0

 MNRAS 000, 1–17 (2021)
10 S. Sikder et al.
 102

 [mK2]

 [mK2]
 z = 6 z = 6
 101 101

 clean |

 clean |
 z = 11 z = 11
[mK2]

 101 z = 17 z = 17

 − ∆2predicted

 − ∆2predicted
 z = 22 z = 22
 clean )

 z = 6 z = 26 z = 26
 100 100 100
Median(∆2true

 z = 11 z = 30 z = 30

 clean

 clean
 z = 17

 Median |∆2true

 Median |∆2true
 10−1 z = 22
 z = 26 10−1 10−1
 z = 30
 10−2
 10−1 100 10−1 100 10−1 100
 k [Mpc−1] k [Mpc−1] k [Mpc−1]

 102

 [mK2]

 [mK2]
 k = 0.05 k = 0.05
 101 101

 clean |

 clean |
 k = 0.09 k = 0.09
[mK2]

 101 k = 0.16 k = 0.16

 − ∆2predicted

 − ∆2predicted
 k = 0.3 k = 0.3
 clean )

 k = 0.05 k = 0.54 k = 0.54
 100 100 100
Median(∆2true

 k = 0.09 k = 0.99 k = 0.99

 clean

 clean
 k = 0.16

 Median |∆2true

 Median |∆2true
 10−1 k = 0.3
 k = 0.54 10−1 10−1
 k = 0.99
 −2
 10
 6 10 14 18 22 26 30 6 10 14 18 22 26 30 6 10 14 18 22 26 30
 z z z

Figure 3. Left column: Median of the true (clean) power spectrum (without SKA noise), Δ2true_clean , as a function of wavenumber (upper panel) and redshift
(lower panel). Other columns : The median of the absolute value of the difference between the true and predicted clean power spectrum. For the panels in the
middle column, we emulate the power spectrum using the best-fit parameters derived from the network without SKA noise. For the panels in the right column,
the best-fit parameters are derived from the network with SKA noise, but the error is measured by emulating the clean power spectrum. As in Fig. 2, the plots in
this figure show all 25 values and 32 values.

 Uncertainty
 Parameters True value Predicted value
 MCMC Uncertainty Bootstrap Uncertainty

 ∗ -1.108 -1.101 ±0.007 ±0.015
 [km/s] 1.108 1.110 ±0.007 ±0.010
 -0.942 -0.901 ±0.032 ±0.124
 1.5 1.274 ±0.048 ±0.236
 min [keV] 0.7 0.648 ±0.011 ±0.026
 -1.112 -1.112 ±0.002 ±0.004
 mfp [Mpc] 1.447 1.449 ±0.019 ±0.054

Table 2. Predicted parameter values and their respective uncertainties for a model with relatively low-mass halos (the value of the parameter is 12.8 km/s).
Here all the parameter values are in log10 except and min . Here we use the 21-cm power spectrum without SKA noise.

 Uncertainty
 Parameters True value Predicted value
 MCMC Uncertainty Bootstrap Uncertainty

 ∗ -1.108 -1.038 ±0.042 ±0.057
 [km/s] 1.108 1.170 ±0.026 ±0.042
 -0.942 -0.851 ±0.110 ±0.103
 1.5 1.187 ±0.186 ±0.089
 min [keV] 0.7 0.668 ±0.027 ±0.028
 -1.112 -1.116 ±0.007 ±0.004
 mfp [Mpc] 1.447 1.475 ±0.132 ±0.079

Table 3. Same as table 2. Here we use the 21-cm power spectrum with SKA noise.

astrophysical sources directly), with a synchrotron spectrum of spec- use both the external and galactic radio models and train a neural
tral index = −2.6 and amplitude parameter r measured relative network to try to infer the type of the radio background given the
to the CMB at the reference frequency of 78 MHz. Another potential 21-cm power spectrum. For this purpose, we create a training dataset
model for the excess radio background is that it comes from the high of 9500 models (where there are ∼ 5000 models with a galactic
redshift radio galaxies. The effect of the inhomogeneous galactic radio background and ∼ 4500 models with an external radio back-
radio background on the 21-cm signal has been explored by Reis ground), with the astrophysical parameters varying over the following
et al. (2020b). They used the galactic radio background model to ranges: ∗ = 0.01 − 0.5, = 4.2 − 60 km s−1 , = 0.0001 − 1000,
explain the unexpected EDGES low band signal. In our work, we = 1.0−1.5, min = 0.1−3.0, = 0.033−0.089, mfp = 10.0−70.0.

MNRAS 000, 1–17 (2021)
Machine learning in 21-cm cosmology 11

 30 30
 35
 σ: 0.81, µ: 0.01 σ: 1.10, µ: 0.10 σ: 1.04, µ: 0.16
 30 25 25
 Without SKA noise With SKA noise With thermal noise
 25 parameter: f∗ parameter: f∗ parameter: f∗
 20 20
 Counts

 Counts

 Counts
 20
 15 15
 15
 10 10
 10

 5 5 5

 0 0 0
 −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8
 Error Error Error

 35 35
 σ: 0.99, µ: -0.18 35 σ: 0.80, µ: 0.04 σ: 0.93, µ: 0.13
 30 30
 Without SKA noise 30 With SKA noise With thermal noise
 25 parameter: VC parameter: VC 25 parameter: VC
 25
 20 20
 Counts

 Counts

 Counts
 20
 15 15 15

 10 10 10

 5 5 5

 0 0 0
 −8 −6 −4 −2 0 2 4 6 8 −10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 10.0 −8 −6 −4 −2 0 2 4 6 8
 Error Error Error

 35 30 35
 σ: 1.04, µ: 0.29 σ: 0.98, µ: 0.99 σ: 1.05, µ: 1.17
 30 25 30
 Without SKA noise With SKA noise With thermal noise
 25 parameter: fX parameter: fX 25 parameter: fX
 20
 20 20
 Counts

 Counts

 Counts
 15
 15 15
 10
 10 10

 5 5 5

 0 0 0
 −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8
 Error Error Error

 Figure 4. Histogram of the errors in the predicted parameters: ∗ , and as defined in Eq. 15.

 Parameters Without SKA With SKA With thermal
 the criteria adopted by Fialkov & Barkana (2019) as representing a
 rough compatibility with the 99% limits of the detected signal in the
 ∗ 0.032 0.120 0.140 EDGES low band experiment, in terms of the overall decline and
 [km/s] 0.013 0.062 0.062 rise without regard to the precise shape of the absorption (which is
 0.486 0.539 0.527 much more uncertain). The enhanced radio emission must strictly be
 0.206 0.216 0.220 a high redshift phenomena, in order to not over-produce the observed
 min [keV] 0.647 0.657 0.664 radio background (Fialkov & Barkana 2019), so we assume a cut-off
 0.013 0.021 0.021 redshift, cutoff = 15 (Reis et al. 2020b) below which = 1 as for
 mfp [Mpc] 0.164 0.164 0.174
 present-day radio sources. So we only consider here redshifts from
 15 to 30 (or the highest SKA redshift in the case with SKA noise). In
Table 4. The median (over 100 test models) of the total uncertainty ( 0 ) for
 our training dataset, we treat the radio background parameters or
each parameter. As before, all the parameter values are in log10 except and
 min . The columns show the cases without SKA noise, with SKA noise, and
 on an equal footing and add an extra column of a binary parameter
with SKA thermal noise. that specify the type of radio background: 0 for the external radio
 background and 1 for the galactic radio background. In our EDGES
 compatible test dataset, we have 530 models and 308 models with an
For the models with a galactic radio background, the normalization external and a galactic radio background, respectively. We apply this
of the radio emissivity (measured relative to low-redshift galaxies), test dataset to the trained NN. In the predicted parameters, we round
 R , varies over the range = 0.01 − 107 , and the range for the off the binary parameter either to zero (when it is ≤ 0.5) which is the
amplitude of the radio background, r , for the external radio models label for the external radio background, or to unity (when it is > 0.5)
is is 0.0001 − 0.5. which is the label of the galactic radio background. The confusion
 We apply an EDGES-compatible test dataset to the two trained matrix shown in Fig. 6 indicates the performance of our classification
networks. The models that we refer to as EDGES-compatible satisfy method for identifying the type of radio background. In the case

 MNRAS 000, 1–17 (2021)
You can also read