Recovering Quantitative Models of Human Information Processing with Differentiable Architecture Search
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Recovering Quantitative Models of Human Information Processing
with Differentiable Architecture Search
Sebastian Musslick (musslick@princeton.edu)
Princeton Neuroscience Institute, Princeton University
Princeton, NJ 08544, USA
arXiv:2103.13939v1 [cs.LG] 25 Mar 2021
Abstract this article, we introduce the notion of a computation graph
The integration of behavioral phenomena into mechanistic and describe ways of expressing the architecture of a quan-
models of cognitive function is a fundamental staple of cog- titative model in terms of a such a graph. We then review
nitive science. Yet, researchers are beginning to accumulate the use of differentiable architecture search (DARTS; Liu,
increasing amounts of data without having the temporal or
monetary resources to integrate these data into scientific the- Simonyan, & Yang, 2018) for searching the space candi-
ories. We seek to overcome these limitations by incorporat- date computation graphs, and introduce an adaptation of this
ing existing machine learning techniques into an open-source method for discovering quantitative models of human infor-
pipeline for the automated construction of quantitative models.
This pipeline leverages the use of neural architecture search to mation processing. We evaluate this method based on its
automate the discovery of interpretable model architectures, ability to recover three different models of human cognition
and automatic differentiation to automate the fitting of model from synthetic data, to explain behavioral phenomena in psy-
parameters to data. We evaluate the utility of these methods
based on their ability to recover quantitative models of human chophysics, learning and perceptual decision making. Our re-
information processing from synthetic data. We find that these sults indicate that this method is capable of recovering com-
methods are capable of recovering basic quantitative motifs putational motifs found in these models. However, we also
from models of psychophysics, learning and decision making.
We also highlight weaknesses of this framework, and discuss discuss further developments that are needed to expand the
future directions for their mitigation. scope of models amenable to DARTS.
Keywords: autonomous empirical research; computation
graph; continuous relaxation; NAS; DARTS; AutoML Quantitative Models as Computation Graphs
A broad class of complex mathematical functions—including
Introduction the functions expressed by a quantitative model of human in-
The process of developing a mechanistic model of cogni- formation processing—can be formulated as a computation
tion incurs two challenges: (1) identifying the architecture graph. A computation graph is a collection of nodes that are
of the model, i.e. the composition of functions and param- connected by directed edges. Each node denotes an expres-
eters, and (2) tuning parameters of the model to fit experi- sion of a variable, and each outgoing edge corresponds to a
mental data. While there are various methods for automating function applied to this variable (cf. Figure 1D). The value
the fitting of parameters, cognitive scientists typically lever- of a node is typically computed by integrating over the result
age their own expertise and intuitions to identify the architec- of every function (edge) feeding to that node. Akin to quanti-
ture of a model—a process that requires substantial human tative models of cognitive function, a computation graph can
effort. In machine learning, interest has grown in automating take experiment parameters as input (e.g. the brightness of
the construction and parameterization of neural networks to two visual stimuli), and can transform this input through a
solve machine learning problems more efficiently (He, Zhao, combination of functions (edges) and latent variables (inter-
& Chu, 2021). This involves the use of neural architecture mediate nodes) to produce observable dependent measures as
search (NAS) for automating the discovery of model architec- output nodes (e.g the probability that a participant is able to
tures (Elsken, Metzen, Hutter, et al., 2019), and the use of au- detect the difference in brightness between two stimuli).
tomatic differentiation to automate parameter fitting (Paszke The expression of a formula as computation graph can be
et al., 2017). This combination of methods has led to break- illustrated with Weber’s law (Fechner, 1860)—a quantitative
throughs in the automated construction of neural networks hypothesis that relates the difference between the intensities
that are capable of outperforming networks designed by hu- of two stimuli to the probability that a participant can detect
man researchers (e.g. in computer vision: Mendoza, Klein, this difference. It states that the just noticeable difference
Feurer, Springenberg, & Hutter, 2016). In this study, we ex- (JND; the difference in intensity that a participant is capable
plore the utility of these methods for constructing quantitative of detecting in 50% of the trials) amounts to
models of human information processing.
∆I = k · I (1)
To ease the application of NAS to the discovery of a quan-
titative model, it is useful to treat quantitative models as neu- where ∆I is the JND, I corresponds to the intensity of the
ral networks or, more generally, as computation graphs. In baseline stimulus and k is a constant. The probability of de-tecting the difference between two stimuli, I and I , can then Every output node is computed by linearly combining all
be formulated as a function of the two stimuli (with I < I ), intermediate nodes projecting to it. The goal of DARTS is
P(detected) = σlogistic ((I − I ) − ∆I) (2) to identify all operations oi, j of the DAG. Following Liu et
al. (2018), we define O = {oi, j , oi, j , . . . , oM i, j } to be the set of
where σlogistic is a logistic function. Figure 1D depicts the M candidate operations associated with edge ei, j where ev-
argument of σlogistic as a computation graph for k = .. The ery operation om i, j (xi ) corresponds to some function applied to
graph encompasses two input nodes, one representing x = I xi (e.g. linear, exponential or logistic). DARTS relaxes the
and the other x = I . The intermediate node x expresses problem of searching over candidate operations by formulat-
∆I which results from multiplying I with the parameter k = ing the transformation associated with an edge as a mixture
.. The addition and subtraction of I and I , respectively, of all possible operations in O (cf. Figure 1A-B):
result in their difference (I − I ) and are represented by the X exp(αoi, j )
intermediate node x . The linear combination of x and x in ōi, j (x) = P o0
· oi, j (x). (4)
the output node r resembles the argument to σlogistic . o∈O o 0 ∈O exp(αi, j )
The automated construction of a mathematical hypothe- where each operation is weighted by the softmax transfor-
sis, like Weber’s law, can be formulated as a search over mation of its architectural weight αoi, j . Every edge ei, j is as-
the space of all possible computation graphs. Machine learn- signed a weight vector αi, j of dimension M, containing the
ing researchers leverage the notion of computation graphs to weights of all possible candidate operations for that edge. The
represent the composition of functions performed by a com- set of all architecture weight vectors α = {αi, j } determines the
plex artificial neural network (i.e. its architecture), and de- architecture of the model. Thus, searching the architecture
ploy NAS to search a space of computation graphs. Although amounts to identifying α. The key contribution of DARTS
some level of specification of the graph remains with the re- is that searching α becomes amenable to gradient descent af-
searcher, NAS relieves the researcher from searching through ter relaxing the search space to become continuous (Equation
these possibilities. (4)). However, minimizing the loss function of the model
L(w, α) requires finding both α∗ and w∗ —the parameters of
Identifying Computation Graphs the computation graph.1 Liu et al. (2018) propose to learn α
with Neural Architecture Search and w simultaneously using bi-level optimization:
NAS refers to a family of methods for automating the dis- min Lval (w∗ (α), α)
α
covery of useful neural network architectures. There are a (5)
number of methods to guide this search, such as evolutionary s.t. w∗ (α) = argminw Ltrain (w, α).
algorithms, reinforcement learning or Bayesian optimization That is, one can obtain α∗ through gradient descent, by
(for a recent survey of NAS search strategies, see Elsken et iterating through the following steps:
al., 2019). However, most of these methods are computation-
ally demanding due to the nature of the optimization prob- 1. Obtain the optimal set of weights w∗ for the current archi-
lem: The search space of candidate computation graphs is tecture α by minimizing the training loss Ltrain (w, α).
high-dimensional and discrete. To address this problem, Liu
2. Update the architecture α (cf. Figure 1C) by following the
et al. (2018) proposed DARTS which relaxes the search space
gradient of the validation loss ∇Lval (w∗ , α).
to become continuous, making architecture search amenable
to gradient decent. The authors demonstrate that DARTS Once α∗ is found, one can obtain the final architecture by
can yield useful network architectures for image classification replacing ōi, j with the operation that has the highest archi-
and language modeling that are on par with architectures de- tectural weight, i.e. oi, j ← argmaxo α∗o i, j , or by sampling each
signed by human researchers. In this work, we assess whether operation with probability P (oi, j | αi, j ) = σsoftmax (α∗i, j )o .
DARTS can be adopted to automate the discovery of inter-
pretable quantitative models to explain human cognition. Adapting DARTS for Autonomous Cognitive Modeling
We adopt the framework from Liu et al. (2018) by represent-
Differentiable Architecture Search ing quantitative models of information processing as DAGs,
DARTS treats the architecture of a neural network as a di- and seek to automate the discovery of model architectures by
rected acyclic computation graph (DAG), containing N nodes differentiating through the space of operations in the under-
in sequential order (Figure 1). Each node xi corresponds to a lying computation graph. To map computation graphs onto
latent representation of the input space. Each directed edge quantitative models of cognitive function, we separate the
ei, j is associated with some operation oi, j that transforms the nodes of the computation graph into input nodes, intermedi-
representation of the preceding node i, and feeds it to node ate nodes and output nodes. Every input node corresponds to
j. Each intermediate node is computed by integrating over its a different independent variable (e.g. the brightness of a stim-
transformed predecessors: ulus) and every output node corresponds to a different depen-
X dent variable (e.g. the probability of detecting the stimulus).
xj = oi, j (xi ) . (3)
1 This includes the parameters of each candidate operation om
i< j i, j .P(detected)
x_4 y
y
x_3 P(detected)
P(detected)
y x_2 x_4
A Candidate Operations x_4 Continuous Relaxation
Bx_1 x_3
P(detected) Table 1: Search space of candidate operations o(x) ∈ O and
none x_3 their complexity p(o). Note that parameters a, b ∈ w are fitted
x_4 y x_0 x_2
a*x
x_2 P(detected) r
separately for every om
i, j .
y
x_3 -x
P(detected) x_1
Description o(x) p(o)
P(detected) +x x_1 x_3 P(detected)
zero 0
x_2 x_4 x_0
addition +x 1
x_4 x_0 x_1 x_4 subtraction −x 1
Training of Architecture Architecture Sampling
C x_1 x_3 Weights D
multiplication a·x 2
x_3 x_0 x_3 linear function a·x+b 3
x_0 x_2 0.5 * x x_2 -1 * x
x_0 -x r
exponential function exp(a · x + b) 3
x_2 P(detected) r I_11 * x x_2 rectified linear function ReLU(x) 1
x_1 x_3
+x logistic function σlogistic (x) 1
x_1 I_0 x_1
x_1 x_3 P(detected)
x_0
x_0 1: Learning
Figure x_1 x_4
computation graphs with DARTS. x_0
The where p(om i, j ) corresponds to the complexity of a candi-
nodes and edges in a computation graph correspond to vari- date operation, amounting to one plus the number of trainable
x_0 x_3
ables and functions (operations) performed on those vari- parameters (see Table 1), and γ scales the degree to which
ables, respectively.I_1(A) Edges
x_2 represent different candidate complexity is penalized. Following the objective in Equa-
operations. (B) DARTS relaxes the search space of opera- tion (5), we seek to minimize the total loss, Ltotal (w, α) =
I_0
tions to be continuous.
x_1
Each intermediate node (blue) is com- LMSE (r j ,t j | w, α) + Lcomplexity , by simultaneously finding α∗
puted as a weighted mixturex_0of operations. The (architectural) and w∗ , using gradient descent.
weight of a candidate operation in an edge represents the con-
tribution of that operation to the mixture computation. Output Experiments and Results
nodes (green) are computed by linearly combining all inter- Identifying the architecture of a quantitative model is an am-
mediate nodes. (C) Architectural weights are trained using bitious task. Consider the challenge of constructing a DAG
bi-level optimization, and used to sample the final architec- to explain the relationship between three independent vari-
ture of the computation graph, as shown in (D). ables and one dependent variable, with only two latent vari-
ables. Assuming a set of eight candidate operations per edge
Intermediate nodes represent latent variables of the model and for a total number of seven edges, there are possible ar-
are computed according to Equation (3), by applying an op- chitectures to explore and endless ways to parameterize the
eration to every predecessor of the node and by integrating chosen architecture. Adding one more latent variable to the
over all transformed predecessors.2 For the simulation ex- model would expand the search space to possible archi-
periments reported below, we consider eight candidate oper- tectures. DARTS offers one way of automating this search.
ations which are summarized in Table 1, including a “zero” However, before applying DARTS to explain human data, it is
operation to indicate the lack of a connection between nodes. worth assessing whether this method is capable of recovering
Similar to Liu et al. (2018), we compute every output node r j computational motifs from a known ground truth. Therefore,
by linearly combining all intermediate nodes: we seek to evaluate whether DARTS can recover quantitative
X
K+S models of human information processing from synthetic data.
rj = vi, j xi (6) As detailed below, we assess the performance of DARTS
i=S+ in recovering three distinct computational motifs in cognitive
where vi, j ∈ w is a trainable weight projecting from inter- psychology (see Test Cases). For each test case, we vary
mediate node x j to the output node r j , S corresponds to the the number of intermediate nodes k ∈ {, , } and the com-
number of input nodes and K to the number of intermediate plexity penalty y ∈ {, ., ., ., .} across architecture
nodes. To warrant interpretability of the model, we constrain searches, and initialize each search with ten different seeds.
all nodes to be scalar variables, i.e. xi , r j ∈ R× . When evaluating instantiations of NAS, it is important to
Our goal is to identify a computation graph that can pre- compare their performance against baselines (Lindauer &
dict each dependent variable from all independent variables. Hutter, 2020). In many cases, random sampling can yield
Thus, we seek to minimize, for every dependent variable j, results that are comparable to more sophisticated NAS (Li &
the discrepancy between every output of the model r j and the Talwalkar, 2020; Xie, Kirillov, Girshick, & He, 2019). Thus,
observed data t j . This discrepancy can be formulated as a we seek to compare the performance of each search condition
mean squared error (MSE), LMSE (r j ,t j | w, α), that is contin- against a randomly sampled architecture.
gent on both the architecture α and the parameters in w. In
addition, we seek to minimize the complexity of the model, Training and Evaluation Procedure
XXX For each test case and each search condition, we optimized
Lcomplexity = γ p(om
i, j ) (7)
the architecture according to Equation (5), using stochas-
i j m
tic gradient descent (SGD). To identify w∗ , we optimized w
2 Predecessors may include both input and intermediate nodes. for the selected training set over epochs with a cosinecote, Brown, & Mewhort, 2000). It explains the performance
Table 2: Summary of test cases, stating the reference to the re-
on a task Pn as follows:
spective equation (Eqn.), the number of independent variables
Pn = P∞ − (P∞ − P ) · e−α·t (8)
(IVs), the number of dependent variables (DVs), the number
of free parameters (|Θ|), as well as distinct operations (o∗ ). where t corresponds to the number of practice trials, α
Test Case Eqn. IVs DVs |Θ| o∗ is a learning rate, P corresponds to the initial performance
Weber’s Law (2) 2 1 1 subtraction on a task, and P∞ to the final performance for t → ∞. We
Exp. Learning (8) 3 1 1 exponential treat t, P and P∞ as independent variables of the model and
LCA (10) 3 1 3 rectified linear Pn as a real-valued dependent variable. To avoid numerical
instabilities based on large inputs, we constrain ≤ t ≤ ,
annealing schedule (initial learning rate = ., minimum
≤ P ≤ . and . ≤ P∞ ≤ and set α = . We generate
learning rate × − ), momentum . and weight decay
the synthesized data set by drawing eight evenly-spaced sam-
× − . Following Liu et al. (2018), we initialize archi-
ples for each independent variable, generating a full crossing
tecture variables to be zero. For a given w∗ , we optimized α
between these samples, and by computing Pn for each con-
for the validation set over epochs using Adam (Kingma
dition. The purpose of this test case is to highlight a poten-
& Ba, 2014), with initial learning rate × − , momentum
tial weakness of DARTS: Intermediate nodes cannot repre-
β = (., .) and weight decay × − .
sent non-linear interactions between input variables—as it is
After training w and α, we sampled the final architecture
the case in Equation (8)—due to the additive integration of
by selecting operations with the highest architectural weights.
their inputs. Thus, DARTS must identify alternative expres-
Finally, we trained 5 random initializations of each sam-
sions to approximate Equation (8).
pled architecture on the training set for 1000 epochs using
SGD with a cosine annealing schedule (initial learning rate Case 3: Leaky Competing Accumulator To model the dy-
= ., minimum learning rate × − ). All parameters namics of perceptual decision making, Usher and McClelland
were selected based on recoveries of out-of-sample test cases. (2001) introduced the leaky, competing accumulator (LCA).
Every unit xi of the model represents a different choice in a
Test Cases decision making task. The activity of these units is used to
All test cases are summarized in Table 2. Here, we report the determine the selected choice of an agent. The activity dy-
results for three different psychological models as test cases namics are determined by the non-linear equation (without
for DARTS. While these models appear fairly simple, they are consideration of noise):
X dt
based on common computational motifs in cognitive psychol- dxi = [ρi − λxi + α f (xi ) − β f (x j )] (9)
ogy, and serve as a proof of concept for uncovering potential j6=i
τ
weaknesses of DARTS. Below, we describe each computa- where ρi is an external input provided to unit xi , λ is the
tional model in greater detail. For each of these models, we decay rate of xi , α is the recurrent excitation weight of xi , β
used 80% of the generated data set to compute the training is the inhibition weight between units, τ is a rate constant and
loss, and 20% to compute the validation loss, to optimize the f (xi ) is a rectified linear activation function. Here, we seek
objective stated in Equation (5). Experiment sequences were to recover the dynamics of an LCA with three units, using the
generated with SweetPea—a programming language for au- following (typical) parameterization: λ = ., α = ., β =
tomating experimental design (Musslick et al., 2020). . and τ = . In addition, we assume that all units receive no
Case 1: Weber’s Law Weber’s law is a quantitative hy- external input ρi = . This results in the simplified equation:
pothesis from psychophysics relating the difference in inten- X
dxi = [−.xi + . f (xi ) − . f (x j )]dt (10)
sity of two stimuli (e.g. their brightness) to the probability
j6=i
that a participant can detect the difference. Here, we adopt We treat units x , x , x as independent variables (− ≤ xi ≤
the formal description of Weber’s law with k = from Equa- ) and dx as the dependent variable for a given time step dt.
tion (2) (see Quantitative Models as Computation Graphs for We generate data from the model by drawing eight evenly-
a detailed description). We consider the two stimulus intensi- spaced samples for each xi , generating the full crossing be-
ties, I and I as the independent variables of the model, and tween these, and computing dx for each condition.
P(detected) as the dependent variable. The generated data
set (for computing Lval and Ltrain ) is synthesized based on 20 Results
evenly spaced samples from the interval [, ] for I, and by Figures 2, 3 and 4 summarize the search results for each test
computing P(detected) for all valid crossings between I and case. In most cases, DARTS outperforms random sampling in
I , with I ≤ I . Since we seek to explain a single probability, recovering architectures from synthesized data. The best per-
we apply a sigmoid function to the output of each generated formance for DARTS can on average be observed for k =
computation graph. and γ = although the best fitting architecture may result
Case 2: Exponential Learning The exponential learning from different parameters.3 Below, we examine the best-
equation is one of the standard equations to explain the im- 3 Note that we expect no relationship between γ and the validation
provement on a task with practice (Thurstone, 1919; Heath- loss for random sampling, as random sampling is unaffected by γ.A B D A B D
C Recovered Computation Graph
C
Recovered Computation Graph
Figure 2: Architecture search results for Weber’s law.
(A, B) The mean validation loss as a function of the number of Figure 3: Architecture search results for exponential
intermediate nodes (k) and penalty on model complexity (γ) learning. (A, B) The mean validation loss as a function of
for architectures obtained through DARTS and random sam- the number of intermediate nodes (k) and penalty on model
pling. Vertical bars indicate the standard error of the mean complexity (γ) for architectures obtained through DARTS and
(SEM) across seeds. The star designates the validation loss of random sampling. Vertical bars indicate the SEM across
the best-fitting architecture depicted in (C). (D) Psychomet- seeds. The star designates the loss of the best-fitting archi-
ric function for different baseline intensities, generated by the tecture shown in (C). (D) The learning curves generated by
original model and the recovered architecture shown in (C). the original model and the recovered architecture in (C).
fitting architectures for each test case—determined by the blance to the original model (cf. Equation (10)),
lowest validation loss—which are generally capable of recov-
ering distinct operations used by the data generating model. X
dxi = [. − . · x − . ReLU(xi )]dt (12)
Case 1: Weber’s Law The best fitting architecture (k = ) j6=i
for Weber’s law (Figure 2C) can be summarized as follows: in that it recovers the rectified linear activation function
imposed on the two units competing with x , as well as the
P(detected) = σlogistic (. · I − . · I − .) (11)
corresponding inhibitory weight . ≈ β = .. Yet, the re-
covered model misses to apply this function to unit xi . The
and resembles a simplification of the ground truth model
generated dynamics are nevertheless capable of approximat-
in Equation (2): σlogistic (I − · I )), recovering the compu-
ing the behavior of the original model (Figure 4D).
tational motif of the difference between the two input vari-
ables, as well as the role of I as a bias term. The architecture
General Discussion and Conclusion
can also reproduce psychometric functions generated by the
original model (Figure 2D). However, the recovery of We- Empirical scientists are challenged with integrating an in-
ber’s law should be merely considered a sanity check given creasingly large number of experimental phenomena into
that the data generating model could be recovered with much quantitative models of cognitive function. In this article, we
simpler methods, such as logistic regression. This is reflected introduced and evaluated a method for recovering quantita-
in decent performance of the random sampling method. tive models of cognition using DARTS. The proposed method
treats quantitative models as DAGs, and leverages continuous
Case 2: Exponential Learning One of the core features relaxation of the architectural search space to identify candi-
of this test case is the exponential relationship between task date models using gradient descent. We evaluated the perfor-
performance and the number of trials. Note that we do not mance of this method based on its ability to recover three dif-
expect DARTS to fully recover Equation (8) as it is—by ferent quantitative models of human cognition from synthetic
design—incapable of representing the non-linear interaction data. Our results show that this method has an advantage over
of (P∞ − P ) and e−α·t . Nevertheless, DARTS recovers the random sampling, and is capable of recovering computational
exponential relationship between the number of trials t and motifs from quantitative models of human information pro-
performance Pn for k = (Figure 3C). However, the best- cessing, such as the difference operation in Weber’s law or
fitting architecture relies on a number of other transforma- the rectified linear activation function in the LCA. While the
tions to compute Pn based on its independent variables, and initial results reported here seem promising, there are a num-
fails to fully recover learning curves of the original model ber of limitations worth addressing in future work.
(Figure 3D). In the General Discussion, we examine ways of
All limitations of DARTS pertain to its assumptions, most
mitigating this issue.
of which limit the scope of discoverable models. First, not
Case 3: Leaky Competing Accumulator The best-fitting all quantitative models can be represented as a DAG, such as
architecture, here shown for k = , bears remarkable resem- ones that require independent variables to be combined in aA B D part of a Python toolbox for autonomous empirical research,
and allows for the user-friendly integration and evaluation of
other search methods and test cases. As such, the reposi-
tory includes additional test cases (e.g. models of controlled
processing) that we could not include in this article due to
space constraints. We invite interested researchers to evalu-
C Recovered Computation Graph
ate DARTS based on other computational models, and to uti-
lize this method for the automated discovery of quantitative
models of human information processing.
References
Figure 4: Architecture search results for LCA. (A, B) The Chu, X., Zhou, T., Zhang, B., & Li, J. (2020). Fair darts:
mean validation loss as a function of the number of interme- Eliminating unfair advantages in differentiable architecture
diate nodes (k) and penalty on model complexity (γ) for ar- search. In European conference on computer vision (pp.
chitectures obtained through DARTS and random sampling. 465–480).
Vertical bars indicate the SEM across seeds. The star desig- Elsken, T., Metzen, J. H., Hutter, F., et al. (2019). Neural
nates the loss of the best-fitting architecture for k = depicted architecture search: A survey. JMLR, 20(55), 1–21.
in (C). (D) Dynamics of each decision unit simulated with the Fechner, G. T. (1860). Elemente der psychophysik (Vol. 2).
original model and the best architecture shown in (C). Breitkopf u. Härtel.
He, X., Zhao, K., & Chu, X. (2021). Automl: A survey of the
multiplicative fashion (see Test Case 2). Solving this prob- state-of-the-art. Knowledge-Based Systems, 212, 106622.
lem may require expanding the search space to include dif- Heathcote, A., Brown, S., & Mewhort, D. J. (2000). The
ferent integration functions performed on every node.4 Sec- power law repealed: The case for an exponential law of
ond, some operations may have an unfair advantage over oth- practice. Psychonomic bulletin & review, 7(2), 185–207.
ers when trained via gradient descent, e.g. if their gradients Kingma, D. P., & Ba, J. (2014). Adam: A method for stochas-
are larger. Due to the competitive nature of the softmaxed tic optimization. arXiv preprint arXiv:1412.6980.
weighting of operations (Equation (4)) this can lead to a sup- Li, L., & Talwalkar, A. (2020). Random search and repro-
pression of operations that take more time to learn but are, ducibility for neural architecture search. In Uncertainty in
in principle, closer to the ground truth. Chu, Zhou, Zhang, artificial intelligence (pp. 367–377).
and Li (2020) address this problem by replacing Equation (4) Lindauer, M., & Hutter, F. (2020). Best practices for scientific
with a sigmoid function. This modification, referred to as research on neural architecture search. JMLR, 21(243), 1–
fair DARTS, allows operations to contribute in a cooperative 18.
manner. However, the space of solutions is still limited to the Liu, H., Simonyan, K., & Yang, Y. (2018). Darts:
number and types of functions considered (Lindauer & Hut- Differentiable architecture search. arXiv preprint
ter, 2020). Finally, the performance of DARTS is contingent arXiv:1806.09055.
on a number of training and evaluation parameters, as is the Mendoza, H., Klein, A., Feurer, M., Springenberg, J. T., &
case for other NAS methods. Future work is needed to evalu- Hutter, F. (2016). Towards automatically-tuned neural net-
ate DARTS for a larger space of parameters, in addition to the works. In Workshop on automl (pp. 58–65).
number of intermediate nodes and the penalty on model com- Musslick, S., Cherkaev, A., Draut, B., Butt, A., Srikumar,
plexity as explored in this study. However, despite all these V., Flatt, M., & Cohen, J. D. (2020). Sweetpea: A stan-
limitations, DARTS may provide a first step toward automat- dard language for factorial experimental design. PsyArXiv,
ing the construction of complex quantitative models based on doi:10.31234/osf.io/mdwqh.
interpretable linear and non-linear expressions. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
In this study, we consider a small number of test cases to DeVito, Z., . . . Lerer, A. (2017). Automatic differentiation
evaluate the performance of DARTS. While these test cases in pytorch.
present useful proofs of concept, we encourage the rigorous Thurstone, L. L. (1919). The learning curve equation. Psy-
evaluation of this method based on other quantitative mod- chological Monographs, 26(3), i.
els of cognitive function. To enable such explorations, we Usher, M., & McClelland, J. L. (2001). The time course
provide open access to a documented implementation of the of perceptual choice: the leaky, competing accumulator
evaluation pipeline described in this article.5 This pipeline is model. Psychological review, 108(3), 550.
Xie, S., Kirillov, A., Girshick, R., & He, K. (2019). Exploring
4 Another solution would be to linearize the data or to operate in
randomly wired neural networks for image recognition. In
logarithmic space. However, the former might hamper interpretabil- Proceedings of the IEEE/CVF International Conference on
ity for models relying on simple non-linear functions, and the latter
may be inconvenient if the ground truth cannot be easily represented Computer Vision (pp. 1284–1293).
in logarithmic space.
5 We will include a link to the source code and documentation of this pipeline in the final submission.You can also read