Experimentally realized in situ backpropagation for deep learning in nanophotonic neural networks

Page created by Tracy Vasquez
 
CONTINUE READING
Experimentally realized in situ backpropagation for deep learning in nanophotonic neural networks
Experimentally realized in situ backpropagation for deep learning in nanophotonic
                                                                              neural networks
                                                         Sunil Pai,1, ∗ Zhanghao Sun,1 Tyler W. Hughes,1, 2 Taewon Park,1 Ben Bartlett,3 Ian A.
                                                        D. Williamson,1, 4 Momchil Minkov,1, 2 Maziyar Milanizadeh,5 Nathnael Abebe,1 Francesco
                                                          Morichetti,5 Andrea Melloni,5 Shanhui Fan,1 Olav Solgaard,1 and David A.B. Miller1
                                                              1
                                                                  Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
                                                                                   2
                                                                                     now at Flexcompute Inc., Belmont, MA, USA
                                                                   3
                                                                     Department of Applied Physics, Stanford University, Stanford, CA 94305, USA
                                                                              4
                                                                                now at X Development LLC, Mountain View, CA USA.
                                                                                         5
                                                                                           Politecnico di Milano, Milan, Italy
                                                           Neural networks are widely deployed models across many scientific disciplines and commercial
                                                        endeavors ranging from edge computing and sensing to large-scale signal processing in data cen-
arXiv:2205.08501v1 [cs.ET] 17 May 2022

                                                        ters. The most efficient and well-entrenched method to train such networks is backpropagation, or
                                                        reverse-mode automatic differentiation. To counter an exponentially increasing energy budget in the
                                                        artificial intelligence computing sector, there has been recent interest in analog implementations of
                                                        neural networks, specifically nanophotonic optical neural networks for which no analog backpropa-
                                                        gation demonstration exists. We design mass-manufacturable silicon photonic neural networks that
                                                        alternately cascade our custom designed “photonic mesh” accelerator with digital nonlinearities to
                                                        output the result of arbitrary matrix multiplication of the input signal. These photonic meshes
                                                        are parametrized by reconfigurable physical voltages that tune the interference of optically encoded
                                                        input data propagating through integrated Mach-Zehnder interferometer networks. Here, using our
                                                        packaged photonic chip, we demonstrate in situ backpropagation for the first time to solve classi-
                                                        fication tasks and evaluate a new protocol to keep the entire gradient measurement and update of
                                                        physical device voltages in the analog domain, improving on past theoretical proposals. This in situ
                                                        method is made possible by introducing three changes to typical photonic meshes: (1) measurements
                                                        at optical “grating tap” monitors, (2) bidirectional optical signal propagation automated by fiber
                                                        switch, and (3) universal generation and readout of optical amplitude and phase. After training,
                                                        our classification achieves accuracies similar to digital equivalents even in the presence of systematic
                                                        error. Our findings suggest a new training paradigm for photonics-accelerated artificial intelligence
                                                        based entirely on a physical analog of the popular backpropagation technique.

                                            Neural networks (NNs) are intelligent computational               updating gradients on each element in the graph. Im-
                                         graph-based models that are ubiquitous in scientific data            portantly, this can be physically implemented in linear
                                         analysis and commercial artificial intelligence (AI) appli-          optical devices by simply sending light-encoded errors
                                         cations such as self-driving cars and speech recognition             backwards through photonic devices and performing op-
                                         software. Through deep learning, NN models are dynam-                tical measurements [5], which is a faster and more ef-
                                         ically “trained” on input image, audio or language data              ficient calculation than a digital implementation. Our
                                         to automatically make decisions (“inference”) for com-               demonstration of this new physics-based backpropaga-
                                         plex signal processing powering much of today’s mod-                 tion algorithm, and analysis of systematic error in gra-
                                         ern technology. Due to increasing demand, these models               dient calculations used in backpropagation, could help
                                         require an ever-increasing computational energy budget,              ultimately offer new, possibly energy-efficient strategies
                                         which has recently been estimated to double every 3 to 4             to teach modern AI to make intelligent decisions on mass
                                         months, according to OpenAI [1]. An increasingly large               manufacturable silicon photonics hardware using efficient
                                         reservoir of available data and adoption of AI in mod-               model-free training [5, 6].
                                         ern technology necessitates an energy-efficient solution
                                         for training of NNs.                                                    As a physical platform for our backpropagation demon-
                                            In this paper, we experimentally demonstrate the first            stration, we explore programmable nanophotonic devices
                                         (to our knowledge) optical implementation of backprop-               called “photonic meshes” for accelerating matrix multi-
                                         agation, the most widely used and accepted method of                 plication [7, 8]. Photonic meshes shown in Fig. 1 are
                                         training NNs [2, 3], on a scalable foundry-manufactured              silicon-based low-cost, commercially scalable N × N port
                                         device. (A minimal bulk optical demonstration has been               photonic integrated circuits (PICs) consisting of Mach-
                                         previously explored [4].) Specifically, backpropagation              Zehnder interferometers (MZIs) and programmable opti-
                                         consists of backward propagating model errors computed               cal phase shifts. These PICs are capable of representing
                                         for training data through the NN graph to determine                  matrix-vector multiplication (MVM) through only the
                                                                                                              propagation of guided monochromatic light (1560 nm in
                                                                                                              our demonstration) through N silicon waveguide “wires”
                                                                                                              clad by silicon oxide [7–9]. Each waveguide can support
                                         ∗   sunilpai@stanford.edu                                            a single optical mode which has two degrees of freedom:
Experimentally realized in situ backpropagation for deep learning in nanophotonic neural networks
2

amplitude and phase, yielding a complex N -dimensional         adding new capabilities beyond existing inference or in
vector x at the input of the system. Programmable phase        silico learning demonstrations [7, 17, 18]. Our findings
shift settings physically modulate the propagation speed       ultimately pave the way for a new class of approaches for
(and relative phases) of the wave over segments of silicon     energy-efficient analog training of neural networks and
wire to affect how the N propagating modes construc-           optical devices more broadly.
tively or destructively interfere in each interferometer.
The energy efficiency of these devices has been estimated
to be up to two orders of magnitude higher than current
state-of-the-art electronic application-specific integrated            PHOTONIC NEURAL NETWORKS
circuits (ASICs) in AI [10].
   Assuming no light is lost in the ideal photonic circuit,
the mesh can be programmed to transform optical inputs            At a high level, a neural network is able to transform
using an arbitrary programmable unitary MVM y = U x            data into useful decisions or interpretations, an example
[8, 11, 12]. The matrix U is parametrized by the pro-          of which is shown in Fig. 1(a) and (d) where we label
grammable phase shifters on the device and transforms          points in 2D based on their location within or outside
inputs x propagating through the device to output modes        of a ring. This problem (and in principle many more
y. The programmed phase shifts on the device define the        complex problems like audio signal processing [7]) can
matrix U , and the N output mode amplitude and phase           be solved in the optical domain using photonic neural
measurements y represent the solution to this optical          networks (PNNs) as shown in Fig 1(a)-(d). To solve the
computation. This fundamental mathematical operation           problem, we design and evaluate a hybrid digital-optical
enables meshes to be widely employed in various analog         deep PNN architecture parameterized by trainable pro-
signal processing applications 13 such as telecommunica-       grammable phase shifts η ∈ [0, 2π)D , where D represents
tions [9], quantum computing [14], sensing, and machine        the total number of phase shifting elements across all lay-
learning [7], the last of which we explore experimentally      ers in the overall PNN.
in this work via our backpropagation demonstration.               Using a combination of photonic hardware and soft-
   To form what we call a “hybrid” digital-photonic NN         ware implementations, our hybrid PNN can solve non-
(PNN), we alternately cascade photonic meshes and digi-        trivial tasks using alternating sequences of analog linear
tal nonlinear functions [13, 15], which ultimately forms a     optical MVM operations U (`) (η (`) ) and digital nonlinear
composite function and model capable of complex de-            transformations f (`) where ` denotes the neural network
cision making. While performing inference or back-             layer and we assume a total of L layers. For example, af-
propagation, the hybrid PNN performs time-and energy-          ter L = 3 neural network layers for N = 4, our photonic
efficient MVM, converts photonic mesh output signals to        neural network is capable of transforming the boundary
the digital domain, applies nonlinearities, and then con-      function used to separate the labelled points in Fig. 1(d).
verts the data back to optical domain for MVM in the           To realize this implementation in a mathematical model,
next layer. Hybrid PNNs offer more versatility over fully      the following sequence of functions transforms the data,
analog PNNs in the near term due to flexible manipu-           proceeding in a “feedforward” manner through the layers
lation of signals in the digital domain easing implemen-       of the network:
tations for recurrent and convolutional neural networks.
As a result, hybrid PNNs have been demonstrated to                                   y (`) = U (`) x(`)
provide a reliable low-latency and energy-efficient analog                                                            (1)
                                                                                  x(`+1) = f (`) (y (`) ).
optical solution for inference, recently in circuit sizes of
up to 64 × 64 in commercial settings [16]. Despite this
success in PNN-based inference, on-device backpropaga-         The inputs x = x(1) to the overall system are forward-
tion training of PNNs has not been demonstrated, due           propagated to the final layer (layer L), outputting zb :=
to significantly higher experimental complexity compared       x(L+1) . This forward propagation and resulting out-
to the inference procedure.                                    put measurement of data sent through this network is
   In this paper, we address this gap by experimentally        called “inference” and is depicted in Fig. 1(a, b, d). The
demonstrating in situ backpropagation in a hybrid PNN          model cost or error function is represented by L(x, z) =
architecture. First, we propose a novel and energy ef-         c(b
                                                                 z (x), z) for a given set of ground truth labels z, where
ficient implementation of the technique for measuring          c is any cost function representing the error between zb
phase shifter updates entirely in the optoelectronic (ana-     and z. We refer to the input, label pair of training data
log) domain. Second, we experimentally validate training       (x, z) as a “training example.” Backpropagation and
a backpropagation-enabled, foundry-manufactured pho-           other gradient-based training approaches seek to update
tonic circuit using a custom optical rig setup on a mul-       parameters η based on the vector gradient ∂L         D
                                                                                                            ∂η ∈ R eval-
tilayer neural network. Our demonstration solves ma-           uated for a given training example (or averaged over a
chine learning tasks on this photonic hardware using this      batch of training examples). We now explain how we im-
optically-accelerated backpropagation with similar accu-       plement both the inference and backpropagation training
racy compared to a conventional digital implementation,        calculations directly on our core photonic neural network.
Experimentally realized in situ backpropagation for deep learning in nanophotonic neural networks
3

(a) Unlabelled inputs                       (b)          In situ backpropagation training                                                                         (d)        Labelled outputs
                                                                                                    ...
                              Example                                                                                                                                       Label (post-training)
   2                                                         (1)                  (1)                                                                      Cost                                                 1.0
                                                         U                    f                     ...                 U
                                                                                                                            (L)
                                                                                                                                                 f
                                                                                                                                                     (L)
                                                                                                                                                                  L     2
   1                                                                                                                                                                                                            0.8
                                                                                                    ...                                                                 1
                                                (1)                    (1)                  (2)               (L)                         (L)
   0                                        x                      y                    x                 x                           y                                 0                                       0.6
                                                         ∂L                                                             ∂L
 −1
                                            (c)          ∂η                                                             ∂η                                            −1
                                                   (1)                  ( )                   (2)               (L)                        (L)
                                                xaj                    yaj1                 xaj             xaj                       yaj                                                                       0.4
 −2                                                                                                 ...                                                               −2

        −2     −1         0   1    2                                                                                                                                         −2    −1    0    1        2
                                                             (1)                  (1)                                        (L)                     (L)
                                                         U                    faj                   ...                  U                       faj
                                                                                                    ...                                                           Tapped MZI
                                                                                                                                                                                        ϕ          θ
(e) In situ backpropagation optical measurement protocol                                                                                                                                     pϕ            pθ
                    U                                                   xaj                 U         yaj                                              x − i(xaj )∗ U
        x                     y
                                                   1                                                                              2                                                           3
                    pη             Send input signal forward                            pη,aj                       Send error signal back                            pη,sum        Send sum signal forward
                                   Measure output signal                                                            Measure adjoint signal                                          Measure gradient
                                   Monitor powers                                                                   Monitor powers                                                  via digital subtraction

       input reference output                                                                                                                              sum
                                                                    adjoint reference error
                                         IR Camera                                                                       IR Camera                                                                  IR Camera
Input laser                              pθ                   IR Camera                                                  pθ,aj                   Input laser
                                                                                                                                                                                                    pθ,sum

              Generator                 Analyzer                                        Analyzer                        Generator                                      Generator                  Analyzer

FIG. 1. (a) Example machine learning problem: an unlabelled 2D set of inputs that are formatted to be input into a photonic
neural network. (b, c) In situ backpropagation training of an L photonic neural network for (b) the forward direction and (c) the
backward direction showing the procedure for calculating gradient updates for phase shifts. (d) An inference task implemented
on the actual chip results in good agreement between the chip-labelled points and the ideal implemented ring classification
boundary (resulting from the ideal model) and a 90% classification accuracy. (e) We depict our proposed architecture and
the three steps of in situ (analog) backpropagation, consisting of a 6 × 6 mesh implementing coherent 4 × 4 forward and
inverse unitary matrix-vector products using a reference arm. We depict the (1) forward (2) backward (3) sum steps of in
situ photonic backpropagation. Arbitrary input setting and complete amplitude and phase output measurement are enabled in
both directions using the reciprocity and symmetries of our architecture. All powers throughout the mesh are monitored using
the tapped MZI shown in the inset for each step, allowing for digital subtraction to compute the gradient [5]. These power
measurements performed at phase shifts are indicated by green horizontal bars.

       BACKPROPAGATION DEMONSTRATION                                                                          (shown in green) capable of sending any input vector x
                                                                                                              and measuring any output vector y from Eq. 1. These
                                                                                                              calibrated optical I/O circuits are referred to as “Gener-
   For practical demonstration purposes, our multilayer                                                       ator” and “Analyzer” circuits are shown in red and blue
PNN is completely controlled by a single a photonic mesh                                                      respectively in Figs. 1(e) and 2(b). A more complete
(Note that in practice, energy-efficient photonic NNs are                                                     discussion of how an MVM operation is achieved using
controlled by separate photonic meshes of MZIs for each                                                       our architecture is provided in the Methods, and similar
linear layer). Each MZI unit is controlled by an elec-                                                        approaches have been attempted for complex-valued pho-
tronic control unit that applies voltages to set various                                                      tonic neural network architectures [19]. Note that for the
phase shifts on the device packaged on a thermally con-                                                       input generation and output measurements, we need to
trolled assembly as shown in Fig. 2(a, b). These phase                                                        calibrate the voltage mappings θ(vθ ), φ(vφ ) (equivalently
shifts are placed at the input external arm of the MZI                                                        for the output measurement, vθ (θ), vφ (φ)), which is dis-
(φ, controlled by voltage vφ ) and in the internal arm                                                        cussed in detail in Ref. 20. This is a standard calibration
of the MZI (θ controlled by voltage vθ ); this ultimately                                                     protocol [7, 21, 22] discussed at length in the Appendix
controls the propagation pattern of the light through the                                                     and required for accurate operation of the chip.
chip, enabling arbitrary unitary matrix multiplication.
In our chip specifically, we embed an arbitrary 4 × 4 uni-                                                       Our core contribution in this paper, shown in Fig. 1(e),
tary matrix multiply in a 6 × 6 triangular network of                                                         is to devise and test a photonic mesh matrix accelerator
MZIs. This configuration incorporates two 1 × 5 pho-                                                          architecture that experimentally implements backprop-
tonic meshes on either end of the 4 × 4 “Matrix unit”                                                         agation as proposed in Ref. 5 within a hybrid digital-
Experimentally realized in situ backpropagation for deep learning in nanophotonic neural networks
4

(a)           Experimental chip close-up                               (b)                                         Control and measurement protocol
                                                                          5 mW                           Generator                   Matrix unit                                    Analyzer Camera

                                                                          Laser Fiber

                                                                                                                                     Program U

                                                                                                                 Set input feature                    Measure output
                                                                                                  Electronic control / calibration parameters                                     DAC, PCB

 (c)                                                                   Thermal TiN
 Analog gradient detection                                             phase shifter
                                                                                                                                                                                             Spot
                                                                                                                                                                                             images
                                                        Grating tap monitor
                     U       (d)                    Gradient measurement: σθ,φ = 1             (e)              Gradient histogram: σθ,φ = 1       (f)                   Gradient error near convergence
                                                                                                                                                                 100
                                                  1.0

                                                                                                                                                   Gradient error (1 − g · g)
                                                                                                                                                                                               Analog
                    dθ (η)                                                                                                                                                                     update
                                                             ∂L                                           0.4

                                                                                       Gradient ∂L/∂ η
                                AC Power (dη )

       x                                          0.5        ∂η                                                                                                                                Digital
                                                                                                                                                                                               subtraction
                                                                                                          0.2                                                10−1
                                                  0.0
  ζ                                                                                                       0.0
                                                 −0.5
                                                                                                         −0.2                                                10−2
   −(xaj )∗
                                                 −1.0
                                                        0           π            2π                                           Measured                                           10−1                  100
           Phase sweep                                                                                                                                                                           † U )|2 )
                                                            Adjoint phase (ζ)                                                 Predicted                           Convergence distance (1 − |tr(U

FIG. 2. (a) Image of the packaged photonic chip wirebonded to a custom PCB with fiber array for laser input and a camera
overhead for imaging the chip. Zooming in reveals the core control-and-measurement unit of the chip, enabling power mea-
surement using 3% grating tap monitors and a thermal phase shifter nearby. (b) The DAC control is used to set up inputs
and perform coherent detection of the outputs, and the IR camera over the chip can be used to image all waveguide segments
throughout the chip via grating tap monitors, which we use for backpropagation. (c) Analog gradient detection could be used
to measure the gradient by introducing a summing interference circuit (not implemented on the chip in (b)) between the input
and adjoint fields. (d) The adjoint phase ζ can be swept from 0 to 2π to perform an all-analog gradient measurement, and,
ultimately in principle, update phase shifters with an optoelectronic scheme. (e) Gradients measured using our analog scheme
yield approximately correct gradients when the implemented mesh is perturbed from the optimal (target) unitary U = DFT(4)
with phase error σθ,φ = 1. (f) The average normalized gradient error (averaged over 20 instances of random implemented U  b)
decreases with distance of the device implementation Û (η) from the optimal U = DFT(4). This “distance” is represented in
                                    b † U )|2 .
terms of the fidelity error 1 − |tr(U

analog model implementing the most expensive opera-                                                             3. We implement both amplitude and phase detec-
tions using universal linear optics. Our backpropagation-                                                          tion (improving on past approaches [19]) using a
enabled architecture differs from previously-proposed                                                              self-configuring programmable Matrix unit layer
photonic mesh architectures in three ways:                                                                         [20, 23] on both the red and blue Generator and
                                                                                                                   Analyzer subcircuits of Fig. 1(e) and Fig. 2(b),
   1. We enable “bidirectional light propagation,” the                                                             which by symmetry works for sending and mea-
      ability to send and measure light propagating left-                                                          suring light that propagates forward or backward
      to-right or right-to-left through the circuit (as de-                                                        through the mesh.
      picted in Fig. 1(e)).
                                                                                                           These improvements on an already versatile photonic
   2. We implement “global monitoring,” the ability to                                                   hardware architecture enable backpropagation-based ma-
      measure optical power at any waveguide segment                                                     chine learning implemented entirely using optical mea-
      in the circuit using 3% grating taps (shown in the                                                 surement to optimize programmable phase shifters in
      inset of Fig. 1(e) and Fig. 2(a, b)). In our proof-                                                PNNs. As shown in Fig. 1(e), all three steps of back-
      of-concept setup, we use an IR camera mounted on                                                   propagation 5 require monitoring of optical powers at
      an automated stage to image these taps throughout                                                  each phase shifter and measurement of complex field out-
      the chip.                                                                                          puts at the left and right sides of the mesh. Furthermore,
5

the bidirectionality of the monitoring and optical I/O is       where the last equation indicates the equivalence of “dig-
required to switch between forward and backward propa-          ital subtraction,” shown in Fig. 1 and our proposed “ana-
gation of signals required for in situ backpropagation to       log update” scheme dη (0)/2 in Fig. 2(c, d) (Appendix).
be experimentally realized. Equipped with these addi-           Pseudocode and the complete enumerated backpropaga-
tional elements, our protocol can be implemented on any         tion protocol are discussed in the Appendix. Note that
feedforward photonic circuit [6] with the requisite Ana-        the digital and analog gradient updates can both be im-
lyzer and Generator circuitry, though we use a triangular       plemented in parallel across all photonic layers of the
mesh in this work to enable the chip to be used in other        network.
applications [24].                                                 Now that we have defined the analog in situ backprop-
   Here we give a quick summary of the procedure (fully         agation update, we experimentally evaluate the accuracy
described in the Appendix). For each layer ` and train-         of the analog gradient measurement for a matrix opti-
ing example pair, a “forward inference” signal x(`) is sent     mization problem in Fig. 2(b, d). Since our circuit does
forward and a corresponding “backward adjoint” signal           not have an explicit backprop unit architecture, we ex-
  (`)
xaj is sent backward through a mesh implementing U (`) .        perimentally simulate the “backprop unit” of Fig. 2(c)
The backward pass is in a sense a mirror image of the for-      by programming a sequence of summing vectors in the
ward pass (error signal is sent from final layer to input       Generator unit of our chip and recording dη (ζ) to com-
layer) which algorithmically computes an efficient “re-         pute the gradient with respect to η. We implement back-
verse mode” chain rule calculation. The final step sends        propagation in a single photonic mesh layer optimizing
                                         (`)
what we call a “sum” vector x(`) − i(xaj )∗ . Previously        a linear cost function Lm = 1 − |b   uTm u∗m |2 , where um is
[5], it was shown that global monitoring in all three steps     row m of U , a target matrix that we choose to be the
enables us to calculate the gradient by subtracting the         four-point discrete Fourier transform (DFT), and u      b m is
                                                                          b
                                                                row m of U , the implemented matrix on the device. Each
backward and forward measurements from sum measure-
ments in the digital domain, in what we call an “optical        phase shifter in the photonic network implements a phase
vector-Jacobian product (VJP)” (Appendix).                      shift θ + δθ, where θ is the optimal phase shift for U
                                                                and δθ is some random phase error with standard de-
                                                                viation σθ,φ , which can serve as a measure of “distance
                  ANALOG UPDATE                                 to convergence” during training of the device. For our
                                                                gradient measurement step, we send in the derivative
   Going beyond an experimental implementation of the           yaj = ∂L∂y = −2(b
                                                                          m
                                                                                   uTm u∗m )∗ em to achieve an adjoint field
theoretical proposal of Ref. 5, we additionally explore a       xaj , where em is the mth standard basis vector (1 at po-
more energy-efficient fully analog gradient measurement         sition m, 0 everywhere else). We find in Fig. 2(f) that
update for the final step that avoids the digital subtrac-      analog gradient measurement is increasingly less accu-
tion update. The key difference is in the final “sum”           rate when calculated near convergence, likely due to un-
step where we instead sweep the adjoint phase ζ (giving         corrected photonic circuit error (e.g. due to loss and/or
           (`)
x(`) − i(xaj )∗ eiζ ) from 0 to 2π repeatedly (e.g., using a    thermal crosstalk) resulting in large gradient measure-
                                                                ment errors.
sawtooth signal). During the sweep, we record dη (ζ), the
AC component of the measured power monitored through
phase shifter θ, pη,sum (ζ). It is straightforward to show
                                                                      PHOTONIC NEURAL NET TRAINING
that gradient is dη (0), the AC component evaluated when
no adjoint phase is applied (Appendix). To achieve the ζ
sweep physically, we can employ the summing architec-              To test overall training within our photonic mesh chip,
                                        (`)                     we assess the accuracy of in situ backpropagation in Fig.
ture in Fig. 2(c) which sums x(`) , i(xaj )∗ interferometri-
                                                      √         3 to train L-layer photonic neural networks to solve mul-
cally with a constant loss factor of 1/2 in power (1/ 2 in      tiple 2D classification tasks using the digital subtraction
amplitude). Then, using a boxcar gated integrator and           protocol in Ref. 5. The classification problem assigns
high pass filter, we can physically compute dη (ζ) and up-      points in 2D space to a 0 or 1 label (red or blue coloring)
date the phase shift voltage entirely in the analog domain      based on whether the point is in a region of space, and the
(Appendix). Ultimately, this approach potentially avoids        neural network implements the nonlinear boundary (for
a costly analog-digital conversion and additional memory        instance circle-, moon- or ring-shaped) separating points
complexity required to program N 2 elements. Since the          of different labels standardized using the Python package
PNN has L layers, the gradient calculation step requires        Sklearn and specified in our code [26]. The points are ran-
local feedback circuits at each phase shifter η that update     domly synthetically generated and are “noisy,” meaning
the parameters using the measured gradient:                     some training example points have a small probability of
        ∂L                                                      being assigned a label despite being on the wrong side of
           = −I(xη xη,aj )                                      the ideal boundary.
        ∂η
                                                          (2)      To solve this task, we use a three layer PNN where
           = (|xη − ix∗η,aj |2 − |xη |2 − |xη,aj |2 )/2         each linear layer uses 4 optical ports (4 × 4 MVM), i.e.
            = (pη,sum − pη − pη,aj )/2 = dη (0)/2,              L = 3 with N = 4 inputs and outputs. The inference
6

(a)                       Chip        Computer                   Chip        Computer             Chip               Computer                             (b)
                                                                                                                           +
                                                                                                                                                                          Digital subtraction

                                                                                                                               Softmax
                                                                                                                                         1
                                          |x |                                 |x |                                    2                                           Sum         Backward         Forward
                          U                                      U                                  U           |x |       +
                                                                                                                                                           ∂L
                                                                                                                                         0       Compute
                                                                                                                                                           ∂η    pη,sum    −              −
                                                                                                                                                                                 pη,aj            pη
(c)                Circle: in situ training curve       (d)             Circle: classification
                                                                                                           (e)       Circle: gradient error histogram
                                                                                                         1.0   80                                                   Optical monitor measurements
                                    Digital: test            2
                                                                                    Iteration 930
                                    in situ: test                                                                                                               (1) Forward                            1.0
           2.0                                                                                           0.8
                                    Digital: train                                                             60                                                          Predicted
                                                             1
Cost function L

                                    in situ: train                                                                                                                                                     0.5
           1.5                                                                                           0.6
                                                                                                                                                                           Measured
                                                        x2   0                                                 40
                                                                                                                                                                                                       0.0
                                                                                                         0.4
           1.0
                                                         −1                                                                                                     (2) Backward
                                                                                                                                                                                                       1.0
                                                                                                         0.2   20                                                          Predicted
           0.5
                                                         −2                                                                                                                                            0.5
                                                                                                         0.0                                                               Measured
                                                                                                                0
                   0             500             1000             −2              0           2
                                                                                                                    10−3      10−2        10−1          100                                            0.0
                              Iteration                                          x1                                     Gradient errors:  = 1 − ĝ · g
(f)                                                     (g)                                                (h)                                                  (3) Sum                                1.0
                   Moons: in situ training curve                        Moons: classification                        Moons: gradient error histograms                      Predicted
             2.0                                                                                         1.0
                            Digital                          2
                                                                                   Iteration 1800                                            Correct phase                                             0.5
                                                                                                               80                                                          Measured
                            in situ, correct phase                                                                                           Measured phase
                                                                                                         0.8
             1.5            in situ, meas phase                                                                                                                                                        0.0
                                                             1
 Cost function L

                                                                                                               60
                                                                                                         0.6                                                        Gradient: g = 5.70
                                                     x2

             1.0                                             0                                                                                                                            Measured
                                                                                                         0.4   40                                                                         Predicted

                                                         −1
             0.5
                                                                                                         0.2   20
                                                         −2
             0.0                                                                                         0.0
                    0           1000             2000              −2             0           2                  0
                                                                                                                     10−3      10−2        10−1          100
                              Iteration                                          x1                                      Gradient errors:  = 1 − ĝ · g

FIG. 3. We perform in situ backpropagation training based on stochastic gradient descent and Adam update [25] (learning
rate 0.01) for two classification tasks solvable by (a) a three layer hybrid photonic neural network consisting of absolute value
nonlinearities and a softmax (effectively sigmoid) decision layer. For the circle dataset, (b) the training curve for Adam update
for stochastic gradient descent shows excellent agreement between test and train for both digital and in situ backpropagation
updates resulting in (c) a classification plot showing the true labels and the classification curve based on the learned parameters
resulting in 96% model test accuracy and 93% model train accuracy. (d) For the moons dataset, we compare the test cost
curves for the digital and the in situ updates, where we find that our phase measurements are sufficiently inaccurate to impact
training leading to a lower model train accuracy of 87%. If we use ground truth phase measurements (red) instead of measured
phases (blue), we ultimately arrive at (e) a sufficiently high model test accuracy of 98% (model train accuracy is lower at
95%). When using ground truth phases instead of measured phases, (f) the gradient error reduces considerably by roughly
an order of magnitude. (g) While training the circle classification was successful, the measured gradient error is similarly
large as that measured in the moons training experiment. This suggests that the importance of accurate gradients can be
problem-dependent.

operation of our photonic neural network consists of pro-                                                            where we define the softmax2 : C4 → [0, 1]2
gramming the inputs in the red Generator circuit and                                                                 as two-element vector representing the prob-
measuring outputs on the blue Analyzer circuit, repro-                                                               ability of 0 or 1 label to be softmax2(y) =
                                                                                                                            2      2        2      2         2      2         2      2
gramming the unitary for each Matrix unit layer on the                                                               (e|y1 | +|y2 | , e|y3 | +|y4 | )/(e|y1 | +|y2 |  + e|y3 | +|y4 | ).
same chip, and square-rooting the output power mea-                                                                  We then apply a softmax cross entropy (SCE) cost func-
surement to achieve absolute value nonlinearities of the                                                             tion L(x) = SCE(ẑ(x), z) = z0 log ẑ0 + z1 log ẑ1 . The
form |y| (see Appendix for more detailed description).                                                               ultimate goal is to apply automatic differentiation and
This unitary layer reprogramming is only intended for                                                                in situ analog gradient measurement on our photonic
a proof-of-concept; the ultimate implementation would                                                                device to optimize L.
dedicate a separate optical device to each linear layer.
   The neural network inference model outputs probabil-                                                                Input data to our device is formatted into the form
ity of 0 or 1 (red or blue assignment) of each point based                                                           (x1 , x2 , p, p), where x1 , x2 correspond to the location in
on the following model:                                                                                              2D space and p is some power ensuring that all inputs are
                                                                                                                     normalized to the same power P , i.e. x21 + x22 + 2p2 = P ;
                        ẑ(x) = softmax2(|U (3) |U (2) |U (1) x|||)                                      (3)         this convention follows the simple example in Ref. 5.
7

We perform a 80/20% train-test split (200 train points,        the simulated and measured training curves. Despite
50 test points), holding out test data from training to        these swings and phase shift gradient errors shown in
ensure no “overfitting” takes place, though this is unlikely   Fig. 2(e), our results of 96% model test accuracy and
for our simple model. We generally find higher test than       93% model train accuracy indicate successful training as
train accuracy in our results since there are fewer “noisy     shown in Fig. 2(d).
examples” in our randomly generated test sets.                   We then train a moons dataset, where we apply the
   Our single photonic chip is used to perform the data        same procedure to achieve a model train accuracy of
input, data output and matrix operations for all three         87% and model test accuracy of 94%, which suggests
layers of our photonic neural network shown in Fig. 3(a).      that training occurs but there is room for improvement
After each pass through the photonic chip, we measure          as shown in green in Fig. 3(f). Upon further investi-
the output power and digitally perform a square-root op-       gation, we find that if we use the ground truth phase
eration to effectively implement absolute value nonlinear-     for the phase measurement but keep the amplitude mea-
ities on the computer. Throughout the process (forward,        surements, we reduce the phase shift gradient error by
backward and sum steps), we also perform digital simula-       roughly an order of magnitude on average as shown in
tions so we can compare the experimental and simulated         Fig. 3(h). This results in the successfully trained classifi-
performance at each step as shown in Fig. 1(b). Mini-          cation of Fig. 3(g) and the red curve in in Fig. 3(f) which
mizing the cost L ultimately leads to maximum accuracy         shows excellent correlation with the black digital train-
in classifying points to the appropriate labels.               ing curve. When using the corrected ground truth phase
   When performing training, the most critical informa-        measurement, we achieve a model train accuracy of 95%
tion is in the gradient direction, so we compute gradi-        and model test accuracy of 97% for the full backpropa-
ent direction error using 1 − g · ĝ comparing normalized      gation demonstration based on measured phase (stopped
measured and predicted g = ∂L/∂η · k∂L/∂ηk−1 . An              early at 1000 iterations), an improvement that under-
important distinction to make for metric reporting is the      scores the importance of accurate phase measurement for
difference in “model” versus “device” cost function and        improved training efficiency.
accuracy. In Fig. 3, we report “model metrics” by evalu-
ating device parameters learned on our chip on the true
model. Thus, actual training of the physical parameters                   DISCUSSION AND OUTLOOK
is performed on the device itself, and the result of train-
ing is evaluated on the computer.                                In this paper, we have laid the foundation for the anal-
   Our first task is to verify that inference works on our     ysis of our new in situ backpropagation proposal and
platform, which we show for a randomly generated “ring”        tolerance to gradient errors for the design of practically
dataset to have 90% device test set accuracy on our phys-      useful photonic mesh accelerators. Our proof-of-principle
ical platform shown previously in Fig. 1(c). Once we           experiments suggests that even in the presence of such
confirm that the inference performance is acceptable, we       gradient error, gradient measurement and training pho-
then perform training of 2D classification problems using      tonic neural networks using analog backpropagation up-
our digital subtraction approach on our randomly gener-        dates is efficient and feasible.
ated datasets. We use standard gradient update Adam              Although there exist many approaches for training
stochastic gradient descent [25] with a learning rate of       photonic neural networks, our demonstration and en-
0.01, with all non-linear automatic differentiation per-       ergy calculations (Appendix) suggest that in situ back-
formed off the chip via Python libraries JAX and Haiku         propagation is the most practical and efficient approach
[27, 28].                                                      for training deep multilayer hybrid photonic neural net-
   We first report our training update model metrics for       works. Our hybrid approach to training optically ac-
the circle dataset for the photonic neural network shown       celerates the most computationally intensive operations
in Fig. 3(a). In Fig. 3(b), we show the grating tap-           (both in energy and in time complexity), specifically
to-camera measurements of normalized field magnitudes          O(N 2 ) matrix-vector products and matrix gradient com-
in the final layer for all three passes required for digital   putations (backward pass). On the other hand, all
subtraction across all layers of our device at iteration 930   other O(N ) computations such as nonlinearities and their
(near the optimum), which show excellent agreement be-         derivatives are implemented on the computer directly,
tween predicted and measured fields. The training curves       which is reasonable because O(N ) time is needed to mod-
in Fig. 3(c) indicate that stochastic gradient descent is a    ulate and measure optical inputs and outputs anyway.
highly noisy training process due to the noisy synthetic       Other techniques such as population-based methods [29],
dataset about the boundary; this phenomenon can be ob-         direct feedback alignment [30, 31], and perturbative ap-
served for both the digital and analog approaches. Due         proaches require fewer components to implement but are
to these outliers, we only observe convergence in the time     ultimately less efficient for training deep neural networks
(iteration)-averaged curves because even at convergence,       compared to backpropagation.
updates based on outliers or some incorrectly labelled           Our main finding is that gradient accuracy plays an
points can result in large swings in the cost function.        important role in reaching optimal results during train-
These large swings appear roughly correlated between           ing. As we find in Fig. 3, more accurate gradients result
8

in training convergence speeds and oscillations compara-        perimental code via Phox [26], simulation code via Sim-
ble to digital calculations of the gradients updated over       phox [35], and circuit design code via Dphox [36].
the same training example sequence. This accuracy is vi-
tal for in situ backpropagation to be a viable competitor
to existing purely digital training schemes; in particular,                       CONTRIBUTIONS
even if individual updates are faster to compute, high
error would result in longer training times that mitigate          SP taped out the photonic integrated circuit and ran
that benefit. In the Appendix, we frame this error scaling      all experiments with input from ZS, TH, TP, BB, NA,
in terms of a larger scale PNN simulation on the MNIST          MM, OS, SF, DM. SP and ZS wrote code to control ex-
dataset originally as explored in Ref. [32], where we con-      perimental device. TP designed the custom PCB with
sider errors in gradient measurement due to optical I/O         input from SP. SP wrote the manuscript with input from
errors and photodetector noise at the global monitoring         all coauthors. All coauthors contributed to discussions
taps.                                                           of the protocol and results.
   Our findings ultimately have wide ranging implications
because backpropagation is the most efficient and widely
used neural network training algorithm in conventional                       CONFLICTS OF INTEREST
machine learning hardware used today. Our analog ap-
proach for machine learning thus opens up a vast op-              SP, ZS, TH, IW, MM, SF, OS, DM have filed a
portunity for energy-efficient artificial intelligence appli-   patent for the analog backpropagation update protocol
cations using photonic hardware. We additionally pro-           discussed in this work with Prov. Appl. No.: 63/323743.
vide seamless integration into current machine learning         The authors declare no other conflicts of interest.
training protocols (e.g. autodifferentiation frameworks
such as JAX [27] and TensorFlow [33]). A particularly
impactful opportunity is in data center machine learn-                                 METHODS
ing where optical signals already store data that can be
fed into PNNs for inference and training tasks. In such                   A.    Circuit design and packaging
settings, our demonstration presents a key new oppor-
tunity for both inference and training of hybrid PNNs
                                                                   Our photonic integrated circuit is a 6 × 6 triangular
to dramatically reduce carbon footprint and counter the
                                                                photonic mesh consisting of a total of 15 MZIs fabricated
exponentially increasing costs of AI computation.
                                                                at the AdvancedMicroFoundry (AMF) in Singapore de-
                                                                signed using our photonic library DPhox [36] which is
                                                                a custom automated photonic design library in Python.
              ACKNOWLEDGEMENTS                                  Each of the MZIs in the mesh is controlled using pro-
                                                                grammable phase shifters in the form of 80 µm × 2 µm ti-
   We would like to acknowledge Advanced Micro-                 tanium nitride heaters with 10.5 ohm/sq sheet resistance
Foundries (AMF) in Singapore for their help in fabri-           surrounded by deep trenches that are 80 µm × 10 µm
cating and characterizing the photonic circuit for our          and a total of 7µm away from the waveguide, which use
demonstration and Silitronics for their help in packag-         resistive heating to control the interference of light prop-
ing our chip for our demonstration. We would also like          agating in the chip. The MZIs consist of two 50/50 direc-
to acknowledge funding from Air Force Office of Scientific      tional couplers, with S-bends consisting of 30 µm radius
Research (AFOSR) grants FA9550-17-1-0002 in collabo-            arc turns and 40 µm long interaction lengths with a 300
ration with UT Austin and FA9550-18-1-0186 through              nm gap. Next to each of the phase shifters is a bidirec-
which we share a close collaboration with UC Davis un-          tional grating tap monitor, which is a directional coupler
der Dr. Ben Yoo. Thanks also to Payton Broaddus                 tap that couples 3% of the light propagating either for-
for helping with wafer dicing, Simon Lorenzo for help in        ward or backward through the waveguide attached to the
fiber splicing the fiber switch for bidirectional operation,    tap and feeds that light to a grating to be imaged on a
Nagaraja Pai for advice on electrical and thermal con-          camera focused on the grating. Traces for one of the ter-
trol packaging, and finally Carsten Langrock and Karel          minals of each of the phase shifters are routed to separate
Urbanek for their help in building our movable optical          individual pads on the edge of the chip, and the ground
breadboard.                                                     connections across all phase shifters in a column of MZIs
                                                                are shared and connected to a single ground pad. The
                                                                trace widths need to be thick enough to handle high ther-
                                                                mal currents, so we use 15 µm wide traces and 15Nwire
               DATA AND SOFTWARE                                µm wide traces when multiple connections are connected
                                                                to a shared ground contact.
  All software and data for running the simulations                The photonic chip is attached using silver paint to
and experiments are available through Zenodo [34] and           a 1.5mm thick copper shim and a custom Advanced
Github through the Phox framework, including our ex-            Circuits PCB designed in KiCAD consisting of ENIG
9

coated metal traces to interface the phase shifters with an   triangular mesh circuit is constructed such that the grat-
NI PCIe-6739 controller for setting programmable phase        ing taps lie along columns of devices, which means the
shifts throughout the device. Our PCB is wirebonded           optical rig images a 6 × 19 array of spots. The infrared
using two-tier wirebonding to the chip by Silitronics So-     path has roughly a 700 × 600 µm field of view, allow-
lutions, made possible by fanout to NI SCB-68 connectors      ing simultaneous measurements of 6 × 3 grating spots on
that interface directly to our PCIe-6739 system. The in-      the chip (MZIs are 625 µm long in total given roughly
put optical source is a Agilent 81606A tunable laser with     165 µm long directional couplers), which necessitates an
a tunable range of 1460 nm to 1580 nm. The laser light is     XY translation stage to image multiple spots simultane-
coupled into a single-mode fiber and optically interfaced     ously on the chip.
to the chip using W2 Optronics 127 micron pitch fiber            The speed of backpropagation is limited by the me-
array interposers at the left and right sides of the mesh,    chanics of the XY stage required to image spots through-
with a mirror facet designed to couple optical signals at     out the chip, so our demonstration training experiments
10 degrees from the normal as we only need to couple          took up to 31 hours of real time to run, limited primarily
into a single grating coupler for each fiber array coupler.   by the wait time for the stage to settle on various groups
Optical stray reflections from light not coupled into the     of spots on the chip. Assuming T iterations, the stage
chip generally interfere with grating tap signals forming     needs to move a total of 15T times (5 for each of the
extra streaks in the camera; these stray reflections are      three in situ backpropagation steps to be able to image
blocked using pieces of paper carefully placed above the      all of the spots). For 1000 iterations, the stage needs to
fiber arrays that act as lightweight removable stray light    move a total of 15000 times which necessitates the need of
blockers.                                                     automation for the stage of our proof-of-concept demon-
   For thermal stability, this chip-PCB assembly is ther-     stration. In a final commercial implementation, the grat-
mally connected to a thermoelectric cooler (TEC). This        ing taps would be replaced by integrated photodetectors;
thermal connection is made possible by metal vias con-        there would in principle be no separate optical rig system
necting rectangular ENIG-coated copper patches on the         in a fully packaged hybrid digital-analog photonic circuit.
top of the PCB to the bottom of the PCB, with thermal
paste between an aluminum heat sink mount and the
bottom rectangular metal patch. For feedback control,                    C.   Forward inference operation
a thermistor placed near the chip and the TEC under
the chip are attached to a TEC controller unit, allowing        Forward inference proceeds as follows for layer ` (see
stable chip temperature (kept at 30◦ C) for training.         Fig. 1(a) in the main text) where each step is O(N ):
                                                                                                                   (`)   (`)
                                                                 1. Compute the sets of phase shifter settings θX , φX
                 B.   Optical rig design                            for the Generator to give the desired vector x(`) of
                                                                    complex input amplitudes for the Matrix unit in
   Our optical rig consists of an Ethernet cable-connected          layer ` .
Xenics Bobcat 640 IR camera and microscope assembly              2. Set these as the actual phase shifts in the Gen-
mounted on an XY stage and six-axis stages for free space           erator phase shifters using calibration curves for
fiber alignment. The IR camera and microscope image                 vθ (θ), vφ (φ) and shine light into the Generator cir-
individual grating taps throughout a photonic integrated            cuit to create the corresponding actual vector of
circuit (PIC) and is responsible for all measurement on             optical input amplitudes for the Matrix unit.
the chip (both optical I/O and optical gradient monitor-
ing).                                                            3. After the propagation of light through the Matrix
   The microscope uses an ∞-corrected Mitutuyo IR 10x               unit, the system has optically evaluated the vec-
objective and a 40cm tube lens leading to a dichroic con-           tor of complex optical output amplitudes y (`) =
nected to visible and IR optical paths for simultaneous             U (`) x(`) . Now self-configure [37] the output An-
visible and infrared imaging. The optical rig is also out-          alyzer circuit to give all the output power in the
fitted with additional paths for LEDs to illuminate the             “top” output waveguide, and note the correspond-
actual chip features. This allows us to find the optimal            ing sets of voltages vφ and voltages vθ now applied
focus for the grating spots, an image shown in Fig. 2(a).           to each phase shifter in the Generator circuit.
In order to measure intensities directly using the IR cam-                                       (`)   (`)
era, the Bobcat camera “Raw” mode is turned on and               4. Deduce the phase shifts θY , φY in the Analyzer
autogain features are turned off. The integration time              circuit using calibration curves for θ(vθ ), φ(vφ ), and
is set to 1 millisecond, and the input laser power is set           hence compute the corresponding measured output
to 3 mW; note that higher integration times are required            amplitudes y (`) .
for lower input laser powers. We take an initial reference       5. Compute x(`+1) = f (`) (y (`) ) on the computer.
image to get a baseline and then to measure the spots in-
tensities or powers, we sum up the pixel values that “fill”   The first four steps are also used in cases where light
the appropriate grating taps throughout the device. The       is sent backwards (see Fig. 1(g, h)), switching the role
10

  (a) Grating monitors         (b) Metal trace, via, and thermal phase shifter                      (d) Fiber array inputs

                                                       Deep trench: thermal isolation

                                                                    TiN
            Bidirectional                             Deep trench: thermal isolation
             3% coupler
       Bidirectional
                                            (c) Routing phase shifters to pads on the chip

                                  Two-tier wirebond to PCB

FIG. 4. Microscope images of the photonic mesh used in this paper. (a) Grating monitor closeup showing the bidirectional
grating tap we use to perform the backpropagation protocol. (b) Metal trace, via, and TiN (titanium nitride) phase shifter is
colocated with the grating monitor and is used to control the interference by changing optical phase in the mesh programmat-
ically. Deep trenches are used for thermal isolation. Here, we show an overlay of phase shifter focal plane on the top metal
trace and via used to connect each phase shifter to the pads. (c) A large scale view of a section of the chip (d) Fiber array
inputs to the photonic mesh are spaced 127 µm apart and are used for interfacing fiber arrays.

of the input and output vector units from Generator to          Fig. 1(c), each transformation from the forward step is
Analyzer and vice versa. Pseudocode for the forward             mapped to a VJP in the corresponding backward step
operation of the PNN is provided in the Appendix, and           (defined in decreasing order from layer L to 1) which de-
code for the actual implementation is provided in our           pends on intermediate function evaluations in both for-
photonic simulation and control framework Phox [? ].            ward and backward passes. The in situ backpropagation
                                                                step implements the costly intermediate VJP evaluations
                                                                (i.e. matrix multiplications) directly in the analog optical
                                                                domain. We define the VJP for nonlinearity f (`) (y (`) ) as
            D.     Backpropagation protocol                       (`)          (`+1)
                                                                fvjp (y (`) , xaj ):
  For each training example (x, z), we calculate gradient                          (`)     (`)          (`+1)
                                                                                  yaj = fvjp (y (`) , xaj       )
updates to phase shifts η using a “backward pass” cor-                                                                   (4)
                                                                                   (`)              (`)
responding to the inference “forward pass” for that data.                         xaj = (U (`) )T yaj
More formally, we define a “vector-Jacobian product” or
VJP for each function U (`) , f (`) to algorithmically com-       Finally, we synthesize Eqs. 1 and 4 and the results
pute the gradient of our cost function L. As shown in           of Ref. 5 to get the backpropagation update based on
11

applying the chain rule evaluating the cost function at a                                  5. Update η using measured gradients ∂L/∂η.
random training example xt , zt at iteration t:                                                                                     (`)
                                                                                        Note that Step 1 can be simplified to yaj = (f (`) )0 (y (`) )
                                         (`)                                             (`+1)
                                        xaj                                             xaj     in the case that f (`) is holomorphic, or complex-
                         z         }|           {
                                                                                        differentiable. In this paper for the neural network
    ∂L      ∂y (`)       ∂x(`+1)       ∂ zb ∂L
          =                       ···                                                   parametrized by Eq. 3, we specifically care about the
   ∂η (`)   ∂η (`)         ∂y (`)     ∂y (L) ∂ zb                                       nonlinearity f (`) (y) = |y|, which has the associated VJP:
                                      | {z }                xt ,zt
                                                   (L)
                                               yaj
                                                                                                          (`)               y
                                                                                                        fvjp (y, xaj ) =       · R(xaj ).           (6)
                 D` ×N Jacobian
                      z }| {            N ×1 vector
                                                                                                                           |y|
                                           z}|{
                      ∂y (`)                 (`)                               (5)      The other VJP required to calculate ∂L/∂y (L) from the
             =                      ·      xaj
                      ∂η (`)                                                            final softmax cross entropy and power measurement at
                     “optical VJP”                                                      the end of the network is handled by our automatic dif-
                 z       }|       { D` ×1 in situ gradient
                         ∂U (`) (`)   z      }|         {                               ferentiation framework JAX [27, 28].
                     (`) T
             = (x      )    (`)
                                xaj = −I(xη(`) xη(`) ,aj )                                 Steps 3 and 4 can be parallelized over all layers (i.e.,
                         ∂η                                                             parameters of the network) for both the digital and ana-
                               ∂L                                                       log update schemes. Pseudocode for the overall protocol
      ηt := ηt−1 + α                                                                    (using digital subtraction), along with an energy-efficient
                               ∂η
                                    xt ,zt                                              proposal for analog gradient computation, is discussed
                                                                                        in the Appendix. The final step can be achieved using
where xη represents a vector of intermediate fields at the
                                                                                        “stochastic gradient descent” (which independently up-
input of phase shifters in layer ` η (`) at iteration t, D`                             dates the loss function based on randomly chosen training
is the number of phase shifts parametrizing the device at                               examples) or adaptive learning where the update vector
layer `, I refers to imaginary part, and α is the learning                              depends both on past updates and the new gradient. A
rate. The main idea is that if enough training examples                                 successful and commonly used implementation of this,
are supplied (i.e., after T updates), the device will au-                               which we use in this paper, is called the Adam update
tomatically discover or “learn” a function that performs                                [25].
the task we desire.
   Based on Eq. 4, the steps of our optical VJP step, as
depicted in Fig. 1(c), is as follows in order from layer                                      Appendix A: Energy and latency analysis
` = L to 1 of the photonic neural network:
                                                                        (`)                In this section, we justify why the analog in situ up-
   1. Compute            the        “adjoint”            vector      yaj            =
       (`)             (`+1)                                                    (`)
                                                                                        date discussed in the main text may be chosen over the
     fvjp (y (`) , xaj         ).
                          For the last layer, set yaj                                   digital update proposed in Ref. [5] and used in our main
                             (L)           ∗
     to be the error signal yaj = ∂x∂L
                                     (L+1)    .                                         backpropagation training demonstration.
                                                                                           In our hybrid scheme, most of the computation is con-
   2. Perform the backward “adjoint” pass xaj = U T yaj
                                                                  (`)           (`)     centrated in sending in N input modes using modulators
      by sending light backwards through layer ` of the                                 (each taking energy Einp ) and digital-analog converters
      mesh and measuring the resulting vector of ampli-                                 and measuring the N output mode powers and ampli-
             (`)                                                                        tudes using photodetectors and analog-digital converters
      tudes xaj emerging backwards from the mesh.
                                                                                        (each taking energy Emeas ). Therefore, the various ap-
                                                                              (`)       proaches for a given matrix-vector product cost roughly
   3. Send the vector of optical amplitudes x(`) −i(xaj )∗                              N · (Einp + Emeas ), equivalent to the cost for setting up
      forward into layer ` of the mesh.                                                 the input/output behavior for the photonic mesh. A dig-
                                                                                        ital electronic computer, on the other hand, requires N 2
   4. Measure gradient ∂L/∂θ for any phase shifter θ:
                                                                                        sequential operations (i.e., multiply-and acumulate oper-
       (a) If using digital subtraction measurement [5],                                ations that are not parallel) to compute any given matrix-
           measure the sum power pθ,sum and subtract                                    vector product.
           pθ and pθ,aj (monitored power from forward                                      Beyond inference tasks, the additional backward and
           and backward steps) to get the gradient.                                     sum steps required for in situ backpropagation adds addi-
                                                                                        tional energy and latency contributions. The analog up-
       (b) If using analog gradient measurement, sweep                                  date explored in Fig. 2 requires N 2 optoelectronic units
           the adjoint global phase ζ (giving x(`) −                                    for energy-efficient operation, each of which is outfitted
               (`)
           i(xaj )∗ eiζ ) from 0 to 2π repeatedly (e.g., us-                            with a photodetector, a lock-in amplifier, and high-pass
           ing a sawtooth signal). Measure dθ (ζ), the                                  filter consuming energy Egrad to measure dθ (0) for a total
           AC component of the measured power through                                   energy consumption of N 2 Egrad + N · (3Einp + 2Emeas )
           phase shifter θ, pθ,sum (ζ). The gradient is                                 for all three steps of the full backpropagation measure-
           dθ (0)/2.                                                                    ment. The 2Emeas comes from the output measurements
12

in the first two steps, and the Egrad comes from an analog    This would dramatically smooth out the noisy training
gradient measurement in the final step.                       curves shown in Fig. 6(a-f), as the resulting averaged
   In comparison, the digital subtraction described in Fig.   gradients would actually be much smaller in magnitude.
1 can be useful in adaptive updates that require storing      However, as shown in Fig. 6(g), we find that the nor-
information about previous gradients (such as Adam [25]       malized error of a minibatch gradient is generally sig-
which we exploit for training), but there are a couple of     nificantly higher than that of the gradient for a single
drawbacks. First, a digital update is less memory efficient   training example which can have negative implications
since N 2 elements need to be stored using analog memory      for training. This is because the variance of the gradient
to be able to run the “digital subtraction” computation       error remains the same when averaged over many exam-
in backpropagation. Additionally, the total energy con-       ples, but the contribution of the gradient error is much
sumption becomes 3N 2 Egrad,digital +N ·(3Einp +2Emeas ),     larger over a batch. This phenomenon might be problem-
with Egrad,digital  Egrad,analog due to large numbers        dependent; if the average gradient for the minibatch is
of analog-digital conversions required to implement the       not closer to zero than the gradient for individual train-
analog-digital conversions and digital subtraction calcu-     ing examples, this error may not be an issue. This under-
lations. Analog-digital conversions are among the most        scores the importance of accurate gradient measurement,
energy- and time-consuming operations in a hybrid pho-        which can be improved using more accurate output phase
tonic device; when operating at GHz speeds, the best          measurements; our output phase measurement alone re-
individual comparators generally require up to 40 fJ          sults in an order-of-magnitude increase in gradient error.
[38, 39] (versus around 1 fJ/bit for input modulators            Finally, a linear relationship between phase and volt-
[40]) and therefore should ideally be reserved for opti-      age can help to improve gradient update accuracy with-
cal input/output in the photonic meshes.                      out requiring nontrivial scaling complexity in the hard-
   A final energy consideration is the phase shift modu-      ware. In other words, we ensure ∂L/∂vθ = ∂θ/∂vθ ·
lation. These voltage-controlled modulators may be con-       ∂L/∂θ with constant ∂θ/∂vθ which simplifies the re-
trolled by thermal actuation [41], microelectromechanical     quired analog circuitry. The ∂θ/∂vθ term is calculated
(MEMS) actuation [42] or phase-change materials such          using calibration curves, and this assumption is more-or-
as barium titanate (BTO) [43]. Of these options, MEMS         less valid in our case as we operate the phase shifters in
actuation is among the most promising because unlike          the linear regime as shown in Fig. 8(e).
thermally actuated phase shifters, they cost no energy
to maintain a given programmed state (“static energy”),
dramatically improving the energy efficiency of operation       Appendix C: Usage in machine learning software
compared to thermal phase shifters which constantly dis-
sipate large amounts of heat. Additionally, unlike phase         Backpropagation is also known as automatic differenti-
change materials, MEMS phase shifters use CMOS mate-          ation (AD) because any program that uses backpropaga-
rials such as silicon or silicon nitride. Furthermore, such   tion registers a “backward” gradient function for any for-
devices can be designed to operate in the linearly with       ward function, which is used by AD Python engines such
voltage [44, 45] which ensures that the gradient update       as JAX[27], TensorFlow2 [33], and PyTorch. We demon-
applied to the voltage is the same as that of the phase       strate that our protocol can be easily coupled with an
shift without a calibration curve. This helps with gradi-     existing automatic differentiation framework (JAX and
ent accuracy as we discuss now.                               Haiku [27, 28]), which can register a backward step and
                                                              adaptive update based on Adam [25] for all unitary ma-
                                                              trix operations as an analog in situ backpropagation gra-
           Appendix B: Gradient accuracy                      dient calculation rather than an expensive digital opera-
                                                              tion. In this way, the digital side of our hybrid PNN never
   As shown in Fig. 3(f, h) and in Fig. 6(h), gradient        needs to store or have any knowledge of parameters in
accuracy can affect the optimization and decrease as the      the photonic mesh architecture. However, in cases where
optimization approaches convergence. As we find in the        adaptive gradient updates are used, such as Adam, aggre-
main text, accurate phase measurement plays an impor-         gated knowledge based on past gradient updates needs
tant role in measuring accurate gradients. This is true       to be stored; non-volatile memory may be required to
even when the nonlinearity (as in our case with abso-         energy-efficiently store these additional parameters.
lute value) removes the need to measure phases in the
inference step. Since in the main text, we evaluate the
model accuracy (device-trained parameters evaluated on            Appendix D: Comparison with other training
a theoretical computer model), we also show some ev-                             algorithms
idence that the device and model classifications match
quite well in Fig. 6(i, j).                                      Backpropagation is the most widely used and efficient
   One popular type of update is based on “minibatch          known algorithm for training multilayer neural network
gradient descent,” a machine learning technique that cal-     models, though it is far from the only method for calcu-
culating gradients based on multiple training examples.       lating gradient-based updates.
You can also read