Recovering Quantitative Models of Human Information Processing with Differentiable Architecture Search

Page created by Sue Davidson
 
CONTINUE READING
Recovering Quantitative Models of Human Information Processing
                                                                 with Differentiable Architecture Search
                                                                                 Sebastian Musslick (musslick@princeton.edu)
                                                                                 Princeton Neuroscience Institute, Princeton University
                                                                                              Princeton, NJ 08544, USA
arXiv:2103.13939v1 [cs.LG] 25 Mar 2021

                                                                     Abstract                                 this article, we introduce the notion of a computation graph
                                           The integration of behavioral phenomena into mechanistic           and describe ways of expressing the architecture of a quan-
                                           models of cognitive function is a fundamental staple of cog-       titative model in terms of a such a graph. We then review
                                           nitive science. Yet, researchers are beginning to accumulate       the use of differentiable architecture search (DARTS; Liu,
                                           increasing amounts of data without having the temporal or
                                           monetary resources to integrate these data into scientific the-    Simonyan, & Yang, 2018) for searching the space candi-
                                           ories. We seek to overcome these limitations by incorporat-        date computation graphs, and introduce an adaptation of this
                                           ing existing machine learning techniques into an open-source       method for discovering quantitative models of human infor-
                                           pipeline for the automated construction of quantitative models.
                                           This pipeline leverages the use of neural architecture search to   mation processing. We evaluate this method based on its
                                           automate the discovery of interpretable model architectures,       ability to recover three different models of human cognition
                                           and automatic differentiation to automate the fitting of model     from synthetic data, to explain behavioral phenomena in psy-
                                           parameters to data. We evaluate the utility of these methods
                                           based on their ability to recover quantitative models of human     chophysics, learning and perceptual decision making. Our re-
                                           information processing from synthetic data. We find that these     sults indicate that this method is capable of recovering com-
                                           methods are capable of recovering basic quantitative motifs        putational motifs found in these models. However, we also
                                           from models of psychophysics, learning and decision making.
                                           We also highlight weaknesses of this framework, and discuss        discuss further developments that are needed to expand the
                                           future directions for their mitigation.                            scope of models amenable to DARTS.
                                           Keywords: autonomous empirical research; computation
                                           graph; continuous relaxation; NAS; DARTS; AutoML                    Quantitative Models as Computation Graphs
                                                                                                              A broad class of complex mathematical functions—including
                                                                 Introduction                                 the functions expressed by a quantitative model of human in-
                                         The process of developing a mechanistic model of cogni-              formation processing—can be formulated as a computation
                                         tion incurs two challenges: (1) identifying the architecture         graph. A computation graph is a collection of nodes that are
                                         of the model, i.e. the composition of functions and param-           connected by directed edges. Each node denotes an expres-
                                         eters, and (2) tuning parameters of the model to fit experi-         sion of a variable, and each outgoing edge corresponds to a
                                         mental data. While there are various methods for automating          function applied to this variable (cf. Figure 1D). The value
                                         the fitting of parameters, cognitive scientists typically lever-     of a node is typically computed by integrating over the result
                                         age their own expertise and intuitions to identify the architec-     of every function (edge) feeding to that node. Akin to quanti-
                                         ture of a model—a process that requires substantial human            tative models of cognitive function, a computation graph can
                                         effort. In machine learning, interest has grown in automating        take experiment parameters as input (e.g. the brightness of
                                         the construction and parameterization of neural networks to          two visual stimuli), and can transform this input through a
                                         solve machine learning problems more efficiently (He, Zhao,          combination of functions (edges) and latent variables (inter-
                                         & Chu, 2021). This involves the use of neural architecture           mediate nodes) to produce observable dependent measures as
                                         search (NAS) for automating the discovery of model architec-         output nodes (e.g the probability that a participant is able to
                                         tures (Elsken, Metzen, Hutter, et al., 2019), and the use of au-     detect the difference in brightness between two stimuli).
                                         tomatic differentiation to automate parameter fitting (Paszke           The expression of a formula as computation graph can be
                                         et al., 2017). This combination of methods has led to break-         illustrated with Weber’s law (Fechner, 1860)—a quantitative
                                         throughs in the automated construction of neural networks            hypothesis that relates the difference between the intensities
                                         that are capable of outperforming networks designed by hu-           of two stimuli to the probability that a participant can detect
                                         man researchers (e.g. in computer vision: Mendoza, Klein,            this difference. It states that the just noticeable difference
                                         Feurer, Springenberg, & Hutter, 2016). In this study, we ex-         (JND; the difference in intensity that a participant is capable
                                         plore the utility of these methods for constructing quantitative     of detecting in 50% of the trials) amounts to
                                         models of human information processing.
                                                                                                                                          ∆I = k · I                    (1)
                                            To ease the application of NAS to the discovery of a quan-
                                         titative model, it is useful to treat quantitative models as neu-      where ∆I is the JND, I corresponds to the intensity of the
                                         ral networks or, more generally, as computation graphs. In           baseline stimulus and k is a constant. The probability of de-
tecting the difference between two stimuli, I and I , can then       Every output node is computed by linearly combining all
be formulated as a function of the two stimuli (with I < I ),     intermediate nodes projecting to it. The goal of DARTS is
            P(detected) = σlogistic ((I − I ) − ∆I)        (2)    to identify all operations oi, j of the DAG. Following Liu et
                                                                    al. (2018), we define O = {oi, j , oi, j , . . . , oM i, j } to be the set of
   where σlogistic is a logistic function. Figure 1D depicts the    M candidate operations associated with edge ei, j where ev-
argument of σlogistic as a computation graph for k = .. The       ery operation om    i, j (xi ) corresponds to some function applied to
graph encompasses two input nodes, one representing x = I         xi (e.g. linear, exponential or logistic). DARTS relaxes the
and the other x = I . The intermediate node x expresses          problem of searching over candidate operations by formulat-
∆I which results from multiplying I with the parameter k =         ing the transformation associated with an edge as a mixture
.. The addition and subtraction of I and I , respectively,      of all possible operations in O (cf. Figure 1A-B):
result in their difference (I − I ) and are represented by the                                  X        exp(αoi, j )
intermediate node x . The linear combination of x and x in                   ōi, j (x) =          P                 o0
                                                                                                                            · oi, j (x).        (4)
the output node r resembles the argument to σlogistic .                                           o∈O    o 0 ∈O exp(αi, j )
   The automated construction of a mathematical hypothe-               where each operation is weighted by the softmax transfor-
sis, like Weber’s law, can be formulated as a search over           mation of its architectural weight αoi, j . Every edge ei, j is as-
the space of all possible computation graphs. Machine learn-        signed a weight vector αi, j of dimension M, containing the
ing researchers leverage the notion of computation graphs to        weights of all possible candidate operations for that edge. The
represent the composition of functions performed by a com-          set of all architecture weight vectors α = {αi, j } determines the
plex artificial neural network (i.e. its architecture), and de-     architecture of the model. Thus, searching the architecture
ploy NAS to search a space of computation graphs. Although          amounts to identifying α. The key contribution of DARTS
some level of specification of the graph remains with the re-       is that searching α becomes amenable to gradient descent af-
searcher, NAS relieves the researcher from searching through        ter relaxing the search space to become continuous (Equation
these possibilities.                                                (4)). However, minimizing the loss function of the model
                                                                    L(w, α) requires finding both α∗ and w∗ —the parameters of
         Identifying Computation Graphs                             the computation graph.1 Liu et al. (2018) propose to learn α
         with Neural Architecture Search                            and w simultaneously using bi-level optimization:
NAS refers to a family of methods for automating the dis-                                     min Lval (w∗ (α), α)
                                                                                                α
covery of useful neural network architectures. There are a                                                                                     (5)
number of methods to guide this search, such as evolutionary                          s.t. w∗ (α) = argminw Ltrain (w, α).
algorithms, reinforcement learning or Bayesian optimization            That is, one can obtain α∗ through gradient descent, by
(for a recent survey of NAS search strategies, see Elsken et        iterating through the following steps:
al., 2019). However, most of these methods are computation-
ally demanding due to the nature of the optimization prob-          1. Obtain the optimal set of weights w∗ for the current archi-
lem: The search space of candidate computation graphs is               tecture α by minimizing the training loss Ltrain (w, α).
high-dimensional and discrete. To address this problem, Liu
                                                                    2. Update the architecture α (cf. Figure 1C) by following the
et al. (2018) proposed DARTS which relaxes the search space
                                                                       gradient of the validation loss ∇Lval (w∗ , α).
to become continuous, making architecture search amenable
to gradient decent. The authors demonstrate that DARTS                 Once α∗ is found, one can obtain the final architecture by
can yield useful network architectures for image classification     replacing ōi, j with the operation that has the highest archi-
and language modeling that are on par with architectures de-        tectural weight, i.e. oi, j ← argmaxo α∗o  i, j , or by sampling each
signed by human researchers. In this work, we assess whether        operation with probability P (oi, j | αi, j ) = σsoftmax (α∗i, j )o .
DARTS can be adopted to automate the discovery of inter-
pretable quantitative models to explain human cognition.            Adapting DARTS for Autonomous Cognitive Modeling
                                                                    We adopt the framework from Liu et al. (2018) by represent-
Differentiable Architecture Search                                  ing quantitative models of information processing as DAGs,
DARTS treats the architecture of a neural network as a di-          and seek to automate the discovery of model architectures by
rected acyclic computation graph (DAG), containing N nodes          differentiating through the space of operations in the under-
in sequential order (Figure 1). Each node xi corresponds to a       lying computation graph. To map computation graphs onto
latent representation of the input space. Each directed edge        quantitative models of cognitive function, we separate the
ei, j is associated with some operation oi, j that transforms the   nodes of the computation graph into input nodes, intermedi-
representation of the preceding node i, and feeds it to node        ate nodes and output nodes. Every input node corresponds to
 j. Each intermediate node is computed by integrating over its      a different independent variable (e.g. the brightness of a stim-
transformed predecessors:                                           ulus) and every output node corresponds to a different depen-
                              X                                     dent variable (e.g. the probability of detecting the stimulus).
                         xj =   oi, j (xi ) .                 (3)
                                                                       1 This   includes the parameters of each candidate operation om
                             i< j                                                                                                    i, j .
P(detected)

                                                    x_4                    y
                                                     y
                                                    x_3                P(detected)
                                               P(detected)
      y                                           x_2                     x_4

  A         Candidate Operations                    x_4 Continuous Relaxation
                                                   Bx_1              x_3
 P(detected)                                                                                               Table 1: Search space of candidate operations o(x) ∈ O and
                    none                            x_3                                                    their complexity p(o). Note that parameters a, b ∈ w are fitted
      x_4              y                            x_0                   x_2
                    a*x
                                                    x_2                  P(detected)             r
                                                                                                           separately for every om
                                                                                                                                 i, j .
       y
      x_3           -x
                   P(detected)                                            x_1
                                                                                                            Description                           o(x)    p(o)
 P(detected)        +x                              x_1                     x_3           P(detected)
                                                                                                            zero                                             0
    x_2                  x_4                                              x_0
                                                                                                            addition                                +x       1
    x_4                                             x_0                        x_1           x_4            subtraction                             −x       1
      Training of Architecture                              Architecture Sampling
  C x_1                 x_3 Weights                D
                                                                                                            multiplication                        a·x        2
      x_3                                                                      x_0           x_3            linear function                  a·x+b           3
      x_0                x_2                                 0.5 * x     x_2     -1 * x
                                                    x_0           -x                         r
                                                                                                            exponential function        exp(a · x + b)       3
      x_2             P(detected)         r                                    I_11 * x      x_2            rectified linear function      ReLU(x)           1
                       x_1                                               x_3
                                                                  +x                                        logistic function             σlogistic (x)      1
                                                    x_1                        I_0           x_1
      x_1                  x_3      P(detected)
                         x_0
    x_0 1: Learning
Figure              x_1      x_4
                      computation    graphs with DARTS.    x_0
                                                             The                                              where p(om     i, j ) corresponds to the complexity of a candi-
nodes and edges in a computation      graph  correspond  to vari-                                          date operation, amounting to one plus the number of trainable
                    x_0      x_3
ables and functions (operations) performed on those vari-                                                  parameters (see Table 1), and γ scales the degree to which
ables, respectively.I_1(A) Edges
                             x_2   represent different candidate                                           complexity is penalized. Following the objective in Equa-
operations. (B) DARTS relaxes the search space of opera-                                                   tion (5), we seek to minimize the total loss, Ltotal (w, α) =
                    I_0
tions to be continuous.
                             x_1
                         Each intermediate node (blue) is com-                                             LMSE (r j ,t j | w, α) + Lcomplexity , by simultaneously finding α∗
puted as a weighted mixturex_0of operations. The (architectural)                                           and w∗ , using gradient descent.
weight of a candidate operation in an edge represents the con-
tribution of that operation to the mixture computation. Output                                                             Experiments and Results
nodes (green) are computed by linearly combining all inter-                                                Identifying the architecture of a quantitative model is an am-
mediate nodes. (C) Architectural weights are trained using                                                 bitious task. Consider the challenge of constructing a DAG
bi-level optimization, and used to sample the final architec-                                              to explain the relationship between three independent vari-
ture of the computation graph, as shown in (D).                                                            ables and one dependent variable, with only two latent vari-
                                                                                                           ables. Assuming a set of eight candidate operations per edge
Intermediate nodes represent latent variables of the model and                                             for a total number of seven edges, there are  possible ar-
are computed according to Equation (3), by applying an op-                                                 chitectures to explore and endless ways to parameterize the
eration to every predecessor of the node and by integrating                                                chosen architecture. Adding one more latent variable to the
over all transformed predecessors.2 For the simulation ex-                                                 model would expand the search space to  possible archi-
periments reported below, we consider eight candidate oper-                                                tectures. DARTS offers one way of automating this search.
ations which are summarized in Table 1, including a “zero”                                                 However, before applying DARTS to explain human data, it is
operation to indicate the lack of a connection between nodes.                                              worth assessing whether this method is capable of recovering
Similar to Liu et al. (2018), we compute every output node r j                                             computational motifs from a known ground truth. Therefore,
by linearly combining all intermediate nodes:                                                              we seek to evaluate whether DARTS can recover quantitative
                                              X
                                              K+S                                                          models of human information processing from synthetic data.
                                   rj =                vi, j xi                                      (6)      As detailed below, we assess the performance of DARTS
                                          i=S+                                                            in recovering three distinct computational motifs in cognitive
   where vi, j ∈ w is a trainable weight projecting from inter-                                            psychology (see Test Cases). For each test case, we vary
mediate node x j to the output node r j , S corresponds to the                                             the number of intermediate nodes k ∈ {, , } and the com-
number of input nodes and K to the number of intermediate                                                  plexity penalty y ∈ {, ., ., ., .} across architecture
nodes. To warrant interpretability of the model, we constrain                                              searches, and initialize each search with ten different seeds.
all nodes to be scalar variables, i.e. xi , r j ∈ R× .                                                      When evaluating instantiations of NAS, it is important to
   Our goal is to identify a computation graph that can pre-                                               compare their performance against baselines (Lindauer &
dict each dependent variable from all independent variables.                                               Hutter, 2020). In many cases, random sampling can yield
Thus, we seek to minimize, for every dependent variable j,                                                 results that are comparable to more sophisticated NAS (Li &
the discrepancy between every output of the model r j and the                                              Talwalkar, 2020; Xie, Kirillov, Girshick, & He, 2019). Thus,
observed data t j . This discrepancy can be formulated as a                                                we seek to compare the performance of each search condition
mean squared error (MSE), LMSE (r j ,t j | w, α), that is contin-                                          against a randomly sampled architecture.
gent on both the architecture α and the parameters in w. In
addition, we seek to minimize the complexity of the model,                                                 Training and Evaluation Procedure
                                XXX                                                                        For each test case and each search condition, we optimized
                Lcomplexity = γ                p(om
                                                  i, j )      (7)
                                                                                                           the architecture according to Equation (5), using stochas-
                                               i        j    m
                                                                                                           tic gradient descent (SGD). To identify w∗ , we optimized w
   2 Predecessors      may include both input and intermediate nodes.                                      for the selected training set over  epochs with a cosine
cote, Brown, & Mewhort, 2000). It explains the performance
Table 2: Summary of test cases, stating the reference to the re-
                                                                     on a task Pn as follows:
spective equation (Eqn.), the number of independent variables
                                                                                      Pn = P∞ − (P∞ − P ) · e−α·t          (8)
(IVs), the number of dependent variables (DVs), the number
of free parameters (|Θ|), as well as distinct operations (o∗ ).         where t corresponds to the number of practice trials, α
 Test Case        Eqn.    IVs    DVs     |Θ|   o∗                    is a learning rate, P corresponds to the initial performance
 Weber’s Law      (2)     2      1       1     subtraction           on a task, and P∞ to the final performance for t → ∞. We
 Exp. Learning    (8)     3      1       1     exponential           treat t, P and P∞ as independent variables of the model and
 LCA              (10)    3      1       3     rectified linear      Pn as a real-valued dependent variable. To avoid numerical
                                                                     instabilities based on large inputs, we constrain  ≤ t ≤ ,
annealing schedule (initial learning rate = ., minimum
                                                                      ≤ P ≤ . and . ≤ P∞ ≤  and set α = . We generate
learning rate  × − ), momentum . and weight decay
                                                                     the synthesized data set by drawing eight evenly-spaced sam-
 × − . Following Liu et al. (2018), we initialize archi-
                                                                     ples for each independent variable, generating a full crossing
tecture variables to be zero. For a given w∗ , we optimized α
                                                                     between these samples, and by computing Pn for each con-
for the validation set over  epochs using Adam (Kingma
                                                                     dition. The purpose of this test case is to highlight a poten-
& Ba, 2014), with initial learning rate  × − , momentum
                                                                     tial weakness of DARTS: Intermediate nodes cannot repre-
β = (., .) and weight decay  × − .
                                                                     sent non-linear interactions between input variables—as it is
   After training w and α, we sampled the final architecture
                                                                     the case in Equation (8)—due to the additive integration of
by selecting operations with the highest architectural weights.
                                                                     their inputs. Thus, DARTS must identify alternative expres-
Finally, we trained 5 random initializations of each sam-
                                                                     sions to approximate Equation (8).
pled architecture on the training set for 1000 epochs using
SGD with a cosine annealing schedule (initial learning rate          Case 3: Leaky Competing Accumulator To model the dy-
= ., minimum learning rate  × − ). All parameters            namics of perceptual decision making, Usher and McClelland
were selected based on recoveries of out-of-sample test cases.       (2001) introduced the leaky, competing accumulator (LCA).
                                                                     Every unit xi of the model represents a different choice in a
Test Cases                                                           decision making task. The activity of these units is used to
All test cases are summarized in Table 2. Here, we report the        determine the selected choice of an agent. The activity dy-
results for three different psychological models as test cases       namics are determined by the non-linear equation (without
for DARTS. While these models appear fairly simple, they are         consideration of noise):
                                                                                                               X           dt
based on common computational motifs in cognitive psychol-                     dxi = [ρi − λxi + α f (xi ) − β   f (x j )]     (9)
ogy, and serve as a proof of concept for uncovering potential                                                   j6=i
                                                                                                                           τ
weaknesses of DARTS. Below, we describe each computa-                    where ρi is an external input provided to unit xi , λ is the
tional model in greater detail. For each of these models, we         decay rate of xi , α is the recurrent excitation weight of xi , β
used 80% of the generated data set to compute the training           is the inhibition weight between units, τ is a rate constant and
loss, and 20% to compute the validation loss, to optimize the         f (xi ) is a rectified linear activation function. Here, we seek
objective stated in Equation (5). Experiment sequences were          to recover the dynamics of an LCA with three units, using the
generated with SweetPea—a programming language for au-               following (typical) parameterization: λ = ., α = ., β =
tomating experimental design (Musslick et al., 2020).                . and τ = . In addition, we assume that all units receive no
Case 1: Weber’s Law Weber’s law is a quantitative hy-                external input ρi = . This results in the simplified equation:
pothesis from psychophysics relating the difference in inten-                                                       X
                                                                                  dxi = [−.xi + . f (xi ) − .   f (x j )]dt (10)
sity of two stimuli (e.g. their brightness) to the probability
                                                                                                                  j6=i
that a participant can detect the difference. Here, we adopt            We treat units x , x , x as independent variables (− ≤ xi ≤
the formal description of Weber’s law with k =  from Equa-          ) and dx as the dependent variable for a given time step dt.
tion (2) (see Quantitative Models as Computation Graphs for          We generate data from the model by drawing eight evenly-
a detailed description). We consider the two stimulus intensi-       spaced samples for each xi , generating the full crossing be-
ties, I and I as the independent variables of the model, and       tween these, and computing dx for each condition.
P(detected) as the dependent variable. The generated data
set (for computing Lval and Ltrain ) is synthesized based on 20      Results
evenly spaced samples from the interval [, ] for I, and by       Figures 2, 3 and 4 summarize the search results for each test
computing P(detected) for all valid crossings between I and         case. In most cases, DARTS outperforms random sampling in
I , with I ≤ I . Since we seek to explain a single probability,   recovering architectures from synthesized data. The best per-
we apply a sigmoid function to the output of each generated          formance for DARTS can on average be observed for k = 
computation graph.                                                   and γ =  although the best fitting architecture may result
Case 2: Exponential Learning The exponential learning                from different parameters.3 Below, we examine the best-
equation is one of the standard equations to explain the im-            3 Note that we expect no relationship between γ and the validation
provement on a task with practice (Thurstone, 1919; Heath-           loss for random sampling, as random sampling is unaffected by γ.
A                         B                   D                      A                  B                            D

                                                                      C                      Recovered Computation Graph

 C
              Recovered Computation Graph

Figure 2: Architecture search results for Weber’s law.
(A, B) The mean validation loss as a function of the number of       Figure 3: Architecture search results for exponential
intermediate nodes (k) and penalty on model complexity (γ)           learning. (A, B) The mean validation loss as a function of
for architectures obtained through DARTS and random sam-             the number of intermediate nodes (k) and penalty on model
pling. Vertical bars indicate the standard error of the mean         complexity (γ) for architectures obtained through DARTS and
(SEM) across seeds. The star designates the validation loss of       random sampling. Vertical bars indicate the SEM across
the best-fitting architecture depicted in (C). (D) Psychomet-        seeds. The star designates the loss of the best-fitting archi-
ric function for different baseline intensities, generated by the    tecture shown in (C). (D) The learning curves generated by
original model and the recovered architecture shown in (C).          the original model and the recovered architecture in (C).

fitting architectures for each test case—determined by the           blance to the original model (cf. Equation (10)),
lowest validation loss—which are generally capable of recov-
ering distinct operations used by the data generating model.                                                      X
                                                                          dxi = [. − . · x − .                   ReLU(xi )]dt   (12)
Case 1: Weber’s Law The best fitting architecture (k = )                                                          j6=i
for Weber’s law (Figure 2C) can be summarized as follows:              in that it recovers the rectified linear activation function
                                                                     imposed on the two units competing with x , as well as the
     P(detected) = σlogistic (. · I − . · I − .)   (11)
                                                                     corresponding inhibitory weight . ≈ β = .. Yet, the re-
                                                                     covered model misses to apply this function to unit xi . The
   and resembles a simplification of the ground truth model
                                                                     generated dynamics are nevertheless capable of approximat-
in Equation (2): σlogistic (I −  · I )), recovering the compu-
                                                                     ing the behavior of the original model (Figure 4D).
tational motif of the difference between the two input vari-
ables, as well as the role of I as a bias term. The architecture
                                                                             General Discussion and Conclusion
can also reproduce psychometric functions generated by the
original model (Figure 2D). However, the recovery of We-             Empirical scientists are challenged with integrating an in-
ber’s law should be merely considered a sanity check given           creasingly large number of experimental phenomena into
that the data generating model could be recovered with much          quantitative models of cognitive function. In this article, we
simpler methods, such as logistic regression. This is reflected      introduced and evaluated a method for recovering quantita-
in decent performance of the random sampling method.                 tive models of cognition using DARTS. The proposed method
                                                                     treats quantitative models as DAGs, and leverages continuous
Case 2: Exponential Learning One of the core features                relaxation of the architectural search space to identify candi-
of this test case is the exponential relationship between task       date models using gradient descent. We evaluated the perfor-
performance and the number of trials. Note that we do not            mance of this method based on its ability to recover three dif-
expect DARTS to fully recover Equation (8) as it is—by               ferent quantitative models of human cognition from synthetic
design—incapable of representing the non-linear interaction          data. Our results show that this method has an advantage over
of (P∞ − P ) and e−α·t . Nevertheless, DARTS recovers the           random sampling, and is capable of recovering computational
exponential relationship between the number of trials t and          motifs from quantitative models of human information pro-
performance Pn for k =  (Figure 3C). However, the best-             cessing, such as the difference operation in Weber’s law or
fitting architecture relies on a number of other transforma-         the rectified linear activation function in the LCA. While the
tions to compute Pn based on its independent variables, and          initial results reported here seem promising, there are a num-
fails to fully recover learning curves of the original model         ber of limitations worth addressing in future work.
(Figure 3D). In the General Discussion, we examine ways of
                                                                        All limitations of DARTS pertain to its assumptions, most
mitigating this issue.
                                                                     of which limit the scope of discoverable models. First, not
Case 3: Leaky Competing Accumulator The best-fitting                 all quantitative models can be represented as a DAG, such as
architecture, here shown for k = , bears remarkable resem-          ones that require independent variables to be combined in a
A                        B                D                           part of a Python toolbox for autonomous empirical research,
                                                                         and allows for the user-friendly integration and evaluation of
                                                                         other search methods and test cases. As such, the reposi-
                                                                         tory includes additional test cases (e.g. models of controlled
                                                                         processing) that we could not include in this article due to
                                                                         space constraints. We invite interested researchers to evalu-
   C           Recovered Computation Graph
                                                                         ate DARTS based on other computational models, and to uti-
                                                                         lize this method for the automated discovery of quantitative
                                                                         models of human information processing.

                                                                                                    References
Figure 4: Architecture search results for LCA. (A, B) The                Chu, X., Zhou, T., Zhang, B., & Li, J. (2020). Fair darts:
mean validation loss as a function of the number of interme-               Eliminating unfair advantages in differentiable architecture
diate nodes (k) and penalty on model complexity (γ) for ar-                search. In European conference on computer vision (pp.
chitectures obtained through DARTS and random sampling.                    465–480).
Vertical bars indicate the SEM across seeds. The star desig-             Elsken, T., Metzen, J. H., Hutter, F., et al. (2019). Neural
nates the loss of the best-fitting architecture for k =  depicted         architecture search: A survey. JMLR, 20(55), 1–21.
in (C). (D) Dynamics of each decision unit simulated with the            Fechner, G. T. (1860). Elemente der psychophysik (Vol. 2).
original model and the best architecture shown in (C).                     Breitkopf u. Härtel.
                                                                         He, X., Zhao, K., & Chu, X. (2021). Automl: A survey of the
multiplicative fashion (see Test Case 2). Solving this prob-               state-of-the-art. Knowledge-Based Systems, 212, 106622.
lem may require expanding the search space to include dif-               Heathcote, A., Brown, S., & Mewhort, D. J. (2000). The
ferent integration functions performed on every node.4 Sec-                power law repealed: The case for an exponential law of
ond, some operations may have an unfair advantage over oth-                practice. Psychonomic bulletin & review, 7(2), 185–207.
ers when trained via gradient descent, e.g. if their gradients           Kingma, D. P., & Ba, J. (2014). Adam: A method for stochas-
are larger. Due to the competitive nature of the softmaxed                 tic optimization. arXiv preprint arXiv:1412.6980.
weighting of operations (Equation (4)) this can lead to a sup-           Li, L., & Talwalkar, A. (2020). Random search and repro-
pression of operations that take more time to learn but are,               ducibility for neural architecture search. In Uncertainty in
in principle, closer to the ground truth. Chu, Zhou, Zhang,                artificial intelligence (pp. 367–377).
and Li (2020) address this problem by replacing Equation (4)             Lindauer, M., & Hutter, F. (2020). Best practices for scientific
with a sigmoid function. This modification, referred to as                 research on neural architecture search. JMLR, 21(243), 1–
fair DARTS, allows operations to contribute in a cooperative               18.
manner. However, the space of solutions is still limited to the          Liu, H., Simonyan, K., & Yang, Y. (2018). Darts:
number and types of functions considered (Lindauer & Hut-                  Differentiable architecture search.           arXiv preprint
ter, 2020). Finally, the performance of DARTS is contingent                arXiv:1806.09055.
on a number of training and evaluation parameters, as is the             Mendoza, H., Klein, A., Feurer, M., Springenberg, J. T., &
case for other NAS methods. Future work is needed to evalu-                Hutter, F. (2016). Towards automatically-tuned neural net-
ate DARTS for a larger space of parameters, in addition to the             works. In Workshop on automl (pp. 58–65).
number of intermediate nodes and the penalty on model com-               Musslick, S., Cherkaev, A., Draut, B., Butt, A., Srikumar,
plexity as explored in this study. However, despite all these              V., Flatt, M., & Cohen, J. D. (2020). Sweetpea: A stan-
limitations, DARTS may provide a first step toward automat-                dard language for factorial experimental design. PsyArXiv,
ing the construction of complex quantitative models based on               doi:10.31234/osf.io/mdwqh.
interpretable linear and non-linear expressions.                         Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
   In this study, we consider a small number of test cases to              DeVito, Z., . . . Lerer, A. (2017). Automatic differentiation
evaluate the performance of DARTS. While these test cases                  in pytorch.
present useful proofs of concept, we encourage the rigorous              Thurstone, L. L. (1919). The learning curve equation. Psy-
evaluation of this method based on other quantitative mod-                 chological Monographs, 26(3), i.
els of cognitive function. To enable such explorations, we               Usher, M., & McClelland, J. L. (2001). The time course
provide open access to a documented implementation of the                  of perceptual choice: the leaky, competing accumulator
evaluation pipeline described in this article.5 This pipeline is           model. Psychological review, 108(3), 550.
                                                                         Xie, S., Kirillov, A., Girshick, R., & He, K. (2019). Exploring
    4 Another solution would be to linearize the data or to operate in
                                                                           randomly wired neural networks for image recognition. In
logarithmic space. However, the former might hamper interpretabil-         Proceedings of the IEEE/CVF International Conference on
ity for models relying on simple non-linear functions, and the latter
may be inconvenient if the ground truth cannot be easily represented       Computer Vision (pp. 1284–1293).
in logarithmic space.
    5 We will include a link to the source code and documentation of     this pipeline in the final submission.
You can also read