Content Authentication for Neural Imaging Pipelines: End-to-end Optimization of Photo Provenance in Complex Distribution Channels

Page created by Roberto James
 
CONTINUE READING
Content Authentication for Neural Imaging Pipelines: End-to-end
                                        Optimization of Photo Provenance in Complex Distribution Channels

                                                                    Pawel Korus                                              Nasir Memon
                                                               New York University,                                       New York University
                                                     and AGH University of Science and Technology
                                                                  http://kt.agh.edu.pl/%7Ekorus
arXiv:1812.01516v1 [cs.CV] 4 Dec 2018

                                                              Abstract                                   tures or watermarking; (2) passive forensics analysis
                                                                                                         which exploits inherent statistical properties result-
                                            Forensic analysis of digital photo provenance relies         ing from the photo acquisition pipeline. While the
                                        on intrinsic traces left in the photograph at the time           former provides superior performance and allows for
                                        of its acquisition. Such analysis becomes unreliable af-         advanced protection features (like precise tampering
                                        ter heavy post-processing, such as down-sampling and             localization [37], or reconstruction of tampered con-
                                        re-compression applied upon distribution in the Web.             tent [24]), it failed to gain widespread adoption due to
                                        This paper explores end-to-end optimization of the en-           the necessity to generate protected versions of the pho-
                                        tire image acquisition and distribution workflow to fa-          tographs, and the lack of incentives of camera vendors
                                        cilitate reliable forensic analysis at the end of the dis-       to modify camera design to integrate such features [3].
                                        tribution channel. We demonstrate that neural imag-                 Passive forensics, on the other hand, relies on our
                                        ing pipelines can be trained to replace the internals of         knowledge of the photo acquisition pipeline, and statis-
                                        digital cameras, and jointly optimized for high-fidelity         tical artifacts introduced by its successive steps. While
                                        photo development and reliable provenance analysis.              this approach is well suited for potentially analyzing
                                        In our experiments, the proposed approach increased              any digital photograph, it often fails short due to the
                                        image manipulation detection accuracy from 55% to                complex nature of image post-processing and distribu-
                                        nearly 98%. The findings encourage further research              tion channels. Digital images are not only heavily com-
                                        towards building more reliable imaging pipelines with            pressed, but also enhanced or even manipulated before,
                                        explicit provenance-guaranteeing properties.                     during or after dissemination. Popular images have
                                                                                                         many online incarnations, and tracing their distribu-
                                                                                                         tion and evolution has spawned a new field of image
                                        1. Introduction                                                  phylogeny [12, 13] which relies on visual differences be-
                                                                                                         tween multiple images to infer their relationships and
                                           Ensuring integrity of digital images is one of the            editing history. However, phylogeny does not provide
                                        most challenging and important problems in multime-              any tools to reason about the authenticity or history
                                        dia communications. Photographs and videos are com-              of individual images. Hence, reliable authentication of
                                        monly used for documentation of important events, and            real-world online images remains untractable [38].
                                        as such, require efficient and reliable authentication              At the moment, forensic analysis often yields useful
                                        protocols. Our current media acquisition and distribu-           results in near-acquisition scenarios. Analysis of native
                                        tion workflows are built with entertainment in mind,             images straight from the camera is more reliable, and
                                        and not only fail to provide explicit security features,         even seemingly benign implementation details - like the
                                        but actually work against them. Image compression                rounding operators used in the camera’s image signal
                                        standards exploit heavy redundancy of visual signals             processor [1] - can provide useful clues. Most forensic
                                        to reduce communication payload, but optimize for hu-            traces quickly become unreliable as the image under-
                                        man perception alone. Security extensions of popular             goes further post-processing. One of the most reliable
                                        standards lack in adoption [19].                                 tools at our disposal involves analysis of the imaging
                                           Two general approaches to assurance and verifica-             sensor’s artifacts (the photo response non-uniformity
                                        tion of digital image integrity include [23, 29, 31]: (1)        pattern) which can be used both for source attribution
                                        pro-active protection methods based on digital signa-            and content authentication problems [9].

                                                                                                     1
Cross entropy loss

                                                   Post­proc.
                                                       A

                                                                                  Classification labels: pristine, post­proc. A ­ Z
                                                   Post­proc.
                                                       B

   RAW                         Developed                                                                   JPEG
               NIP                                                     MUX       Downscale                                            FAN              =
measurements                    image                                                                    Compression

                                                   Post­proc.
                     L2 loss
                                                       Y
                                    =

  Desired                                          Post­proc.
output image                                           Z

                      Acquisition          Post-processing / editing                         Distribution                                   Analysis

Figure 1. Optimization of the image acquisition and distribution channel to facilitate photo provenance analysis. The neural
imaging pipeline (NIP) is trained to develop images that both resemble the desired target images, but also retain meaningful
forensic clues at the end of complex distribution channels.

    In the near future, the rapid progress in compu-                         tations of the current technology. While adoption of
tational imaging will challenge digital image forensics                      digital watermarking in image authentication was lim-
even in near-acquisition authentication. In the pur-                         ited by the necessity to modify camera hardware, our
suit of better image quality and convenience, digital                        approach exploits the flexibility of neural networks to
cameras and smartphones employ sophisticated post-                           learn relevant integrity-preserving features within the
processing directly in the camera, and soon few pho-                         expressive power of the model. We believe that with
tographs will resemble the original images captured                          solid security-oriented understanding of neural imaging
by the sensor(s). Adoption of machine learning has                           pipelines, and with the rare opportunity of replacing
recently challenged many long-standing limitations of                        the well-established and security-oblivious pipeline, we
digital photography, including: (1) high-quality low-                        can significantly improve digital image authentication
light photography [8]; (2) single-shot HDR with overex-                      capabilities.
posed content recovery [14]; (3) practical high-quality                         We aim to inspire discussion about novel camera de-
digital zoom from multiple shots [36]; (4) quality en-                       signs that could improve photo provenance analysis ca-
hancement of smartphone-captured images with weak                            pabilities. We demonstrate that it is possible to opti-
supervision from DSLR photos [17].                                           mize an imaging pipeline to significantly improve de-
    These remarkable results demonstrate tangible ben-                       tection of photo manipulation at the end of a complex
efits of replacing the entire acquisition pipeline with                      real-world distribution channel, where state-of-the-art
neural networks. As a result, it will be necessary to                        deep-learning techniques fail. The main contributions
investigate the impact of the emerging neural imag-                          of our work include:
ing pipelines on existing forensics protocols. While im-
portant, such evaluation can be see rather as damage                          1. The first end-to-end optimization of the imaging
assessment and control and not as a solution for the                             pipeline with explicit photo provenance objectives;
future. We believe it is imperative to consider novel                         2. The first security-oriented discussion of neural
possibilities for security-oriented design of our cameras                        imaging pipelines and the inherent trade-offs;
and multimedia dissemination channels.
    In this paper, we propose to optimize neural imaging                      3. Significant improvement of forensic analysis per-
pipelines to improve photo provenance in complex dis-                            formance in challenging, heavily post-processed
tribution channels. We exploit end-to-end optimization                           conditions;
of the entire photo acquisition and distribution chan-                        4. A neural model of the entire photo acquisition and
nel to ensure that reliable forensics decisions can be                           distribution channel with a fully differentiable ap-
made even after complex post-processing, where clas-                             proximation of the JPEG codec.
sical forensics fails (Fig. 1). We believe that immi-
nent revolution in camera design creates a unique op-                          To facilitate further research in this direction, and
portunity to address some of the long-standing limi-                         enable reproduction of our results, we will publish our
neural imaging toolbox at https://github.com/ upon           physical inconsistencies, such as reflections [32, 35], or
paper acceptance.                                            shadows [21]. To the best of our knowledge, there are
                                                             currently no efforts to either assess the consequences
2. Related Work                                              of the emerging neural imaging pipelines, or to exploit
                                                             this opportunity to improve photo reliability.
Trends in Pipeline Design Learning individual
steps of the imaging pipeline (e.g., demosaicing) has a
long history [20] but regained momentum in the recent        3. End-to-end Optimization of Photo
years thanks to adoption of novel deep learning tech-           Provenance Analysis
niques. Naturally, the research focused on the most
                                                                Digital image forensics relies on intrinsic statisti-
difficult operations, i.e., demosaicing [16, 22, 33, 34]
                                                             cal artifacts introduced to photographs at the time of
and denoising [5, 25, 39]. The newly developed tech-
                                                             their acquisition. Such traces are later used for reason-
niques delivered not only improved performance, but
                                                             ing about the source, authenticity and processing his-
also additional features. Grahbi et al. proposed a con-
                                                             tory of individual photographs. The main problem is
volutional neural network (CNN) trained to perform
                                                             that contemporary media distribution channels employ
both demosaicing and denoising [16]. A recent work
                                                             heavy compression and post-processing which destroy
by Syu et al. proposes to exploit CNNs for joint opti-
                                                             the traces and inhibit forensic analysis.
mization of the color filter array and a corresponding
                                                                The core of the proposed approach is to model the
demosaicing filter [33].
                                                             entire acquisition and distribution channel, and opti-
   Optimization of digital camera design can go even
                                                             mize the neural imaging pipeline (NIP) to facilitate
further. The recently proposed L3 model by Jiang et al.
                                                             photo provenance analysis after content distribution
replaces the entire imaging pipeline with a large collec-
                                                             (Fig. 1). The analysis is performed by a forensic anal-
tion of local linear filters [18]. The L3 model reproduces
                                                             ysis network (FAN) which makes a decision about the
the entire photo development process, and aims to facil-
                                                             authenticity / processing history of the analyzed photo-
itate research and development efforts for non-standard
                                                             graph. In the presented example, the model is trained
camera designs. In the original paper, the model was
                                                             to perform manipulation detection, i.e., aims to clas-
used for learning imaging pipelines for RGBW (red-
                                                             sify input images as either coming straight from the
green-blue-white) and RGB-NIR (red-green-blue-near-
                                                             camera, or as being affected by a certain class of post-
infra-red) color filter arrays.
                                                             processing. The distribution channel in the middle
   Replacing the entire imaging pipeline with a mod-
                                                             mimics the behavior of modern photo sharing services
ern CNN can also overcome long-standing limitations
                                                             and social networks which habitually down-sample and
of digital photography. Chen et al. trained a UNet
                                                             re-compress the photographs. As will be demonstrated
model [30] to develop high-quality photographs in
                                                             later, forensic analysis in such conditions is severely
low-light conditions [8] by exposing it to paired ex-
                                                             inhibited.
amples of images taken with short and long expo-
                                                                The parameters of the NIP are updated to guaran-
sure. The network learned to develop high-quality well-
                                                             tee both faithful representation of a desired color pho-
exposed color photographs from underexposed raw in-
                                                             tograph (L2 loss), and accurate decisions of forensics
put, and yielded better performance than traditional
                                                             analysis at the end of the distribution channel (cross-
image post-processing based on brightness adjustment
                                                             entropy loss). Hence, the parameters of the NIP and
and denoising. Eilertsen et al. also trained a UNet
                                                             FAN models are chosen as:
model to develop high-dynamic range images from a
single shot [14]. The network not only learned to cor-           ∗
                                                                                X
                                                                θnip = argmin       kyn − nip(xn | θnip )k2       (1a)
rectly perform tone mapping, but was also able to re-                       θnip       n
cover overexposed highlights. This significantly sim-                  X                                                   
plifies HDR photography by eliminating the need for                +          log fanc (dc (nip(xn | θnip )) | θfan )            (1b)
bracketing and dealing with ghosting artifacts.                         c

                                                                                   XX
                                                              ∗
                                                                                                                                         
Trends in Forensics The current research in foren-           θfan = argmin                     log fanc (dc (nip(xn | θnip )) | θfan )
                                                                       θfan        n       c
sic image analysis focuses on two main directions: (1)
learning deep features relevant to low-level forensic        where: θnip/fan are the parameters of the NIP and FAN
analysis for problems like manipulation detection [2,        networks, respectively; xn are the raw sensor measure-
40], camera model identification [10], or detection of       ments for the n-th example patch; nip(xn ) is the color
artificially generated content [26]; (2) adoption of high-   RGB image developed by NIP from xn ; dc () denotes a
level vision to automate manual analysis that exposes        color image patch processed by manipulation c; fanc ()
Color space
     RAW         LUT        Normalization &                                                                                Gamma         RGB
                                              White balancing   Demosaicing           Denoising         conversion
    (h, w, 1)   mapping       calibration                                                                                 correction   (h, w, 3)
                                                                                                         (sRGB)

                                                                        ½h, ½w, 4

                                                                                                  NIP                                    RGB
                                                                                                                                       (h, w, 3)

Figure 2. Adoption of a neural imaging pipeline to develop raw sensor measurements into color RGB images: (top) the
standard imaging pipeline; (bottom) adoption of the NIP model.

is the probability that an image belongs to the c-th                                concatenated quantization matrices (for both the
manipulation class, as estimated by the FAN model.                                  luminance and chrominance channels).
3.1. The Neural Imaging Pipeline                                                • The actual quantization is approximated by a con-
                                                                                  tinuous function ρ(x)(see details below).
     We replace the entire imaging pipeline with a CNN
model which develops raw sensor measurements into                                The key problem in making JPEG fully differen-
color RGB images (Fig. 2). Before feeding the im-                             tiable lies in the rounding of DCT coefficients. We
ages to the network, we pre-process them by reversing                         considered two possible approximations. Firstly, we
the nonlinear value mapping according to the camera’s                         used a Taylor series expansion, which can be made ar-
LUT, subtracting black levels from the edge of the sen-                       bitrarily accurate by considering more terms. Finally,
sor, normalizing by sensor saturation values, and ap-                         we decided to use a smoother, and simpler sinusoidal
plying white-balancing according to shot settings. We                         approximation obtained by matching the phase with
also standardized the inputs by reshaping the tensors                         the sawtooth function:
to have feature maps with successive measured color                                                                   sin(2πx)
channels. This ensures a well-formed input of shape                                                ρ(x) = x −                                      (2)
                                                                                                                         2π
( h2 , w2 , 4) with values normalized to [0, 1]. All of the
remaining steps of the pipeline are replaced by a NIP.                        Both approximations are shown in Fig. 3a.
See Section 4.1 for details on the considered pipelines.                         We validated our JPEGNet by comparing produced
                                                                              images with a reference codec from libJPEG. The re-
3.2. Approximation of JPEG Compression                                        sults are shown in Fig. 3bcd for a standard rounding
   To enable end-to-end optimization of the entire ac-                        operation, and the two approximations, respectively.
quisition and distribution channel, we need to ensure                         We used 5 terms for the harmonic rounding. The devel-
that every processing step remains differentiable. In                         oped module produces equivalent compression results
the considered scenario, the main problem is JPEG                             with standard rounding, and a good approximation for
compression. We designed a JPEGNet model which                                its differentiable variants. Fig. 3e-h show a visual com-
approximates the operation of a standard JPEG codec,                          parison of an example image patch, and its libJPEG
and expresses successive steps of the codec as matrix                         and JPEGNet-compressed counterparts.
multiplications or convolution layers that can be imple-                         In our distribution channel, we used quality level 50.
mented in TensorFlow (see supplementary materials for
                                                                              3.3. The Forensic Analysis Network
a detailed network definition):
                                                                                 The forensic analysis network (FAN) is implemented
  • RGB to/from YCbCr color-space conversions are                             as a CNN following the most recent recommendations
    implemented as 1 × 1 convolutions.                                        on construction of neural networks for forensics analy-
  • Isolation of 8×8 blocks for independent processing                        sis [2]. Bayar and Stamm have proposed a new layer
    is implemented by a combination of space-to-depth                         type, which constrains the learned filters to be valid
    and reshaping operations.                                                 residual filters [2]. Adoption of the layer helps ignore
  • Forward/backward 2D discrete cosine transforms                            visual content and facilitates extraction of forensically-
    are implemented by matrix multiplication accord-                          relevant low-level features. In summary, our network
    ing to DxDT where x denotes a 8 × 8 input, and                            operates on 128×128×3 patches in the RGB color space
    D denotes the transformation matrix.                                      and includes (see supplementary materials for full net-
                                                                              work definition):
  • Division/multiplication of DCT coefficients by the
    corresponding quantization steps are implemented                            • A constrained convolutions layer learning 5 × 5
    as element-wise operations with properly tiled and                            residual filters and with no activation function.
(a) rounding approximations        (b) libJPEG vs JPEGNet (standard)        (c) libJPEG vs JPEGNet (harmonic rounding)         (d) libJPEG vs JPEGNet (sin rounding)
 2        round                                                                   Q100                                             Q100                                                                            Q100

                                                                                                                                                                                           PSNR for libJPEG [dB]
          harmonic                                                        Q90                                            Q90                                                  Q90     40
          sinus
                                                                    Q80                                            Q80                                                  Q80
                                                                  Q70                                            Q70                                                  Q70

 0                                                              Q60
                                                              Q50
                                                                                                               Q60
                                                                                                             Q50
                                                                                                                                                                    Q60
                                                                                                                                                                  Q50                 35
                                                            Q40                                            Q40                                                  Q40
                                                          Q30                                            Q30                                                  Q30

                                                    Q20                                            Q20                                                  Q20

                                                                                                                                                                                      30
                                             Q10                                            Q10                                                   Q10
−2
     −2      −1        0         1       2   30       35     40     45     50     55      30       35     40     45     50        55      30     35     40     45     50             55
                                                      PSNR for JPGNet [dB]                         PSNR for JPGNet [dB]                          PSNR for JPGNet [dB]
           (e) uncompressed (96x96px)             (f) libJPEG(50), PSNR=34.4 dB             (g) JPEGNet(50), PSNR=35.5 dB                   (h) JPEGNet(50), PSNR=37.9 dB

Figure 3. Implementation of JPEG compression as a fully differentiable JPEGNet module: (a) continuous approximations
of the rounding function; (b)-(d) validation of the JPEGNet module against the standard libJPEG library with standard
rounding, and the harmonic and sinusoidal approximations; (e) an example image patch; (f) standard JPEG compression
with quality 50; (g)-(h) JPEGNet-compressed patches with the harmonic and sinusoidal approximations.

     • Four 5×5 convolutional layers with doubling num-                                              Table 1. Digital cameras used in our experiments
       ber of output feature maps (starting from 32). The                                      Camera                          Res.1       #Images                     Source       Bayer
       layers use leaky ReLU activation and are followed                                       Canon EOS 5D                       12           864 dng                 MIT-5k       RGGB
       by 2 × 2 max pooling.                                                                   Canon EOS 40D                      10           313 dng                 MIT-5k       RGGB
                                                                                               Nikon D5100                        16            288 nef                Private      RGGB
     • A 1 × 1 convolutional layer mapping 256 features                                        Nikon D700                         12           590 dng                 MIT-5k       RGGB
       into 256 features.                                                                      Nikon D7000                        16           >1k nef                  Raise       RGGB
                                                                                               Nikon D750                         24            312 nef                Private      RGGB
     • A global average pooling layer reducing the num-                                        Nikon D810                         36            205 nef                Private      RGGB
                                                                                               Nikon D90                          12           >1k nef                  Raise       GBRG
       ber of features to 256.                                                                 1   Sensor resolution in Megapixels [Mpx]
     • Two fully connected layers with 512 and 128 nodes
       activated by leaky ReLU.
                                                                                           images with landscape orientation. These images will
     • A fully connected layer with N = 5 output nodes
                                                                                           later be divided into separate training/validation sets.
       and softmax activation.
                                                                                           4.1. Neural Imaging Pipelines
In total, the network has 1,341,990 parameters.
                                                                                              We considered three NIP models with various com-
4. Experimental Evaluation                                                                 plexity and design principles (Table 2): INet - a sim-
                                                                                           ple convolutional network with layers corresponding to
   We started our evaluation by using several NIP mod-                                     successive steps of the standard pipeline; UNet - the
els to reproduce the output of a standard imaging                                          well-known UNet architecture [30] adapted from [8];
pipeline (Sec. 4.1). Then, we used the FAN model to                                        DNet - adaptation of the model originally used for joint
detect popular image manipulations (Sec. 4.2). Ini-                                        demosaicing and denoising [16]. A detailed specifica-
tially, we validated that the models work correctly by                                     tion of the networks’ architectures is included in sup-
using it without a distribution channel (Sec. 4.3). Fi-                                    plementary materials.
nally, we performed extensive evaluation of the entire                                        We trained a separate model for each camera in
acquisition and distribution network (Sec. 4.4).                                           our dataset. For training, we used 120 diverse full-
   We collected a data-set with RAW images from 8                                          resolution images. In each iteration, we extracted ran-
cameras (Table 1). The photographs come from two                                           domly located 128 × 128 patches and formed a batch
public (Raise [11] and MIT-5k [6]) and from one private                                    with 20 examples (one patch per image). For vali-
data-set. For each camera, we randomly selected 150                                        dation, we used a fixed set of 512 × 512 px patches
optimization target     bilinear demosaicing     converged pipeline                                                     Native patch   Sharpen   Gaussian filtering   JPEGNet   Resampling

                                                                               (b) After distribution (a) Only manipulation
              (a) Improved demosaicing example (Canon EOS 40D)

                  optimization target      converged pipeline

                                                                               Figure 5. An example image patch with all of the considered
                                                                               manipulation variants: (a) just after the manipulation; (b)
                                                                               after the distribution channel (down-sampling and JPEG
                                                                               compression).

                 (b) Improved denoising example (Nikon D5100)
                                                                               for a long time had problems with faithful color ren-
Figure 4. Examples of serendipitous image quality improve-                     dering. The typical training times are reported in Ta-
ments obtained by neural imaging pipelines: (a) better de-                     ble 2. The measurements were collected on a Nvidia
mosaicing performance; (b) better denoising.                                   Tesla K80 GPU. The models were trained until the rel-
                                                                               ative change of the average validation loss for the last 5
      Table 2. Considered neural imaging pipelines                             dropped below 10−4 . The maximum number of epochs
                                    INet              UNet           DNet      was 50,000. For the DNet model we adjusted the stop-
 # Parameters                       321              7,760,268      493,976    ping criterion to 10−5 since the training progress was
 PSNR                               42.8                44.3          46.2     slow, and often terminated prematurely with incorrect
 SSIM                              0.989               0.990         0.995     color rendering.
 Train. speed [it/s]                8.80                1.75          0.66
 Train. time                    17 - 26 min            2-4 h       12 - 22 h
                                                                               4.2. Image Manipulation
                                                                                  Our experiment mirrors the standard setup for im-
                                                                               age manipulation detection [2, 4, 15]. We consider four
extracted from the remaining 30 images. The models
                                                                               mild post-processing operations: sharpening - imple-
were trained to reproduce color RGB images developed
                                                                               mented as an unsharp mask operator with the following
by a standard imaging pipeline. We used our own im-
                                                                               kernel:
plementation based on Python and rawkit [7] wrappers                                             "             #
over libRAW [27]. Demosaicing was performed using                                              1 −1 −4 −1
an adaptive algorithm by Menon et al. [28].                                                        −4 26 −4                        (3)
                                                                                               6 −1 −4 −1
   All NIPs successfully reproduced target images with
high fidelity. The resulting color photographs are vi-                         applied to the luminance channel in the HSV color
sually indistinguishable from the targets. Objective fi-                       space; resampling - implemented as successive 1:2
delity measurements for the validation set are collected                       down-sampling and 2:1 up-sampling using bilinear in-
in Table 2 (average for all 8 cameras). Interestingly, the                     terpolation; Gaussian filtering - implemented using a
trained models often revealed better denoising and de-                         convolutional layer with a 5 × 5 filter and standard de-
mosaicing performance, despite the lack of a denoising                         viation equal 4; JPG compression - implemented using
step in the simulated pipeline, and the lack of explicit                       the JPEGNet module with sinusoidal rounding approx-
optimization objectives (see Fig. 4).                                          imation and quality level 80. Fig. 5 shows the post-
   Of all of the considered models, INet was the easi-                         processed variants of an example image patch: (a) just
est to train - not only due to its simplicity, but also                        after manipulation; and (b) after the distribution chan-
because it could be initialized with meaningful pa-                            nel (as seen by the FAN module).
rameters that already produced valid results and only
                                                                               4.3. FAN Model Validation
needed fine-tuning. We initialized the demosaicing fil-
ters with bilinear interpolation, color space conversion                          To validate our FAN model, we initially imple-
with a known multiplication matrix, and gamma cor-                             mented a simple experiment, where analysis is per-
rection with a toy model separately trained to repro-                          formed just after image manipulation (no distribution
duce this non-linearity. The UNet model was initial-                           channel distortion, as in [2]). We used the UNet model
ized randomly, but improved rapidly thanks to skip                             to develop larger image patches to guarantee the same
connections. The DNet model took the longest and                               size of the inputs for the FAN model (128×128×3 RGB
43                                                                          Table 3. Typical confusion matrices (Nikon D90). Entries
                   41                                                                          ≈ 0 are not shown; entries / 3% are marked with (*).
    PSNR [dB]

                   39                                                                                   (a) standalone FAN optimization (UNet) → 55.2%
                   37

                                                                                                                                                       gau.
                                                                                                                                                sha.
                                                                                                                                     nat.

                                                                                                                                                                           res.
                                                                                                                                                                 jpg
                                                                                                                  Predicted
                   35
                                                                                                      True
                   33
                                                                                                      native                         73     24          *        *
                   1.0
                                                                                                      sharpen                         7     88          *        *         *
                                                                                                                                      *      *         45       35        18
FAN accuracy [%]

                    .8                                                                                gaussian
                                                                                                      jpg                             *     12         22       54        10
                    .6                                                                                resampled                       *      *         37       44        16

                    .4
                                                                                  FAN only              (b) joint FAN+NIP optimization (INet) → 67.4%
                                                                               Joint FAN+NIP

                                                                                                                                                       gau.
                    .2

                                                                                                                                                sha.
                                                                                                                                     nat.

                                                                                                                                                                           res.
                                                                                                                                                                 jpg
                         0   2   4   6       8      10       12      14   16      18     20
                                                                                                                  Predicted
                                                                                                      True
                                     Training progress [sampling steps]
                                                                                                      native                         77     15          *        *         4
Figure 6. Typical progression of validation metrics (Nikon                                            sharpen                         5     82          5        5         *
D90) for standalone FAN training (F) and joint optimiza-                                              gaussian                        *      *         62       22        13
tion of FAN and NIP models (F+N).                                                                     jpg                             *      8         42       37        11
                                                                                                      resampled                       *                18        2        79

                                                                                                        (c) joint FAN+NIP optimization (UNet) → 97.8%
images). In such conditions, the model works just as

                                                                                                                                                       gau.
                                                                                                                                                sha.
                                                                                                                                     nat.

                                                                                                                                                                           res.
                                                                                                                                                                 jpg
expected, and yields classification accuracy of 99% [2].                                                          Predicted
                                                                                                      True

4.4. Imaging Pipeline Optimization                                                                    native                         97       *
                                                                                                      sharpen                               100
                                                                                                      gaussian                                *        96        *
   In this experiment, we perform forensic analysis at
                                                                                                      jpg                                     *         *       96
the end of the distribution channel. We consider two                                                  resampled                                                          100
optimization modes: (F) only the FAN network is op-
timized given a fixed NIP model; (F+N) both the FAN
                                                                                               Table 4. The fidelity-accuracy trade-off for joint optimiza-
and NIP models are optimized jointly. In both cases,                                           tion of the FAN and NIP models
the NIPs are pre-initialized with previously trained
                                                                                                      Camera      PSNR [dB]                  SSIM               Acc. [%]
models (Section 4.1). The training was implemented
                                                                                                                                    UNet model
with two separate Adam optimizers, where the first
                                                                                                      EOS 40D     42.9   →   36.5   0.991   →    0.969        0.51   →   0.94
one updates the FAN (and in the F+N mode also the                                                     EOS 5D      43.0   →   36.0   0.991   →    0.966        0.57   →   0.97
NIP) and the second one updates the NIP based on the                                                  D7000       43.7   →   35.4   0.992   →    0.963        0.55   →   0.97
                                                                                                      D90         43.3   →   35.8   0.991   →    0.965        0.56   →   0.98
image fidelity objective.
                                                                                                                                    INet model
   Similarly to previous experiments, we used 120 full-
                                                                                                      EOS 40D     42.2   →   40.9   0.986   →    0.983        0.50   →   0.56
resolution images for training, and the remaining 30                                                  EOS 5D      40.1   →   40.6   0.986   →    0.985        0.55   →   0.56
images for validation. From training images, in each                                                  D7000       41.5   →   39.7   0.987   →    0.981        0.54   →   0.65
iteration we randomly extract new patches. The val-                                                   D90         43.8   →   40.8   0.992   →    0.986        0.56   →   0.65

idation set is fixed at the beginning and includes 100
random patches per each image (3,000 patches in to-
tal) for classification accuracy assessment. To speed-up                                       cally decreasing by 15% every 100 epochs. Each run
image fidelity evaluation, we used 2 patches per im-                                           was repeated 10 times. The typical progression of val-
age (60 patches in total). In order to prevent over-                                           idation metrics for the UNet model (classification ac-
representation of empty patches, we bias the selection                                         curacy and distortion PSNR) is shown in Fig. 6 (sam-
by outward rejection of patches with pixel variance <                                          pled every 50 epochs). The distribution channel signif-
0.01, and by 50% chance of keeping patches with vari-                                          icantly disrupts forensic analysis, and the classification
ance < 0.02. More diverse patches are always accepted.                                         accuracy drops to ≈ 55% for standalone FAN optimiza-
   Due to computational constraints, we performed the                                          tion (F). In particular, the FAN struggles to identify
experiment for 4 cameras (Canon EOS 40D and EOS                                                three low-pass filtering operations (Gaussain filtering,
5D, and Nikon D7000, and D90, see Table 1) and for                                             JPEG compression, and re-sampling; see confusion ma-
the INet and UNet models only. (Based on prelimi-                                              trix in Tab. 3a), which bear strong visual resemblance
nary experiments, we excluded the DNet model which                                             at the end of the distribution channel (Fig. 5b). Opti-
rapidly lost and could not regain image representation                                         mization of the imaging pipeline significantly increases
fidelity.) We ran the optimization for 1,000 epochs,                                           classification accuracy (up to 98%) and makes all of the
starting with a learning rate of 10−4 and systemati-                                           manipulation paths easily distinguishable (Tab. 3c).
(a) the INet model                                    (b) the UNet model

                           Nikon D90 (F+N)
                              Nikon D90 (F)
                         Nikon D7000 (F+N)
                            Nikon D7000 (F)
                       Canon EOS 5D (F+N)
                          Canon EOS 5D (F)
                       Canon EOS 40D (F+N)
                         Canon EOS 40D (F)

                                              .5   .6         .7           .8         .9      1   .5     .6         .7           .8         .9   1
                                                        Classification accuracy [%]                           Classification accuracy [%]

Figure 7. Validation accuracy for image manipulation detection after the distribution channel: (F) denotes standalone FAN
training given a fixed NIP; (F+N) denotes joint optimization of the FAN and NIP models.

                                                                                              The observed improvement in forensic classification
(F) UNet reference

                                                                                           accuracy comes at the cost of image distortion, and
                                                                                           leads to artifacts in the developed photographs. In
                                                                                           our experiments, photo development fidelity dropped
                                                                                           to 36 dB (PSNR) / 0.96 (SSIM) for the UNet model
                                                                                           and 40 dB/0.97 for the INet model. The complete re-
(F+N) UNet - worst

                                                                                           sults are collected in Tab. 4. Qualitative illustration
                                                                                           of the observed artifacts is shown in Fig. 8. The fig-
                                                                                           ure shows diverse image patches developed by several
                                                                                           variants of the UNet/INet models (different training
                                                                                           runs). The artifacts vary in severity from disturbing
(F+N) UNet - best

                                                                                           to imperceptible, and tend to be well masked by image
                                                                                           content. INet’s limited flexibility resulted in virtually
                                                                                           imperceptible distortions, which compensates for mod-
                                                                                           est improvements in FAN’s classification accuracy.
(F) INet reference

                                                                                           5. Discussion and Future Work
                                                                                              We replaced the photo acquisition pipeline with a
                                                                                           neural network, and developed a fully-differentiable
                                                                                           model of the entire photo acquisition and distribution
(F+N) INet - typical

                                                                                           channel. Such an approach allows for joint optimiza-
                                                                                           tion of photo analysis and acquisition models. Our
                                                                                           results show that it is possible to optimize the imag-
                                                                                           ing pipeline for better provenance analysis capabilities
Figure 8. Example image patches developed with joint NIP
                                                                                           at the end of complex distribution channels. We ob-
and FAN optimization (F+N). The UNet examples show
                                                                                           served a significant improvement in manipulation de-
best and worst observed patches at the end of 1,000 training
epochs. The INet examples show a typical output example.                                   tection accuracy w.r.t state-of-the-art classical foren-
                                                                                           sics [2] (accuracy increased from ≈ 55% to nearly 98%).
                                                                                              Competing optimization objectives lead to imaging
   Fig. 7 collects the obtained results for both the UNet                                  artifacts which tend to be well masked in textured ar-
and INet models. UNet delivers consistent and signifi-                                     eas, but are currently too visible in empty and flat re-
cant improvement in analysis accuracy, which increases                                     gions. Limited artifacts and classification improvement
up to 95-98%. INet is much simpler and yields only                                         for a simpler pipeline model indicate that better config-
modest classification improvements - up to ≈ 65%. It                                       urations should be achievable by learning to control the
also lacked consistency and for one of the tested cam-                                     fidelity-accuracy trade-off. Further improvements may
eras yielded virtually no improvement. However, since                                      also possible be exploiting explicit HVS modeling. In
the final decision is often taken by boosting results from                                 the future, would like to consider joint optimization of
many patches from a larger image, we believe the model                                     additional components of the acquisition-distribution
could still be useful given its simplicity and minimal                                     channel: e.g., Bayer filters in the camera, or lossy com-
quality degradation.                                                                       pression during content distribution.
References                                                           [22] F. Kokkinos and S. Lefkimmiatis. Deep image demosaick-
                                                                          ing using a cascade of convolutional residual denoising net-
 [1] S. Agarwal and H. Farid. Photo forensics from jpeg dimples.          works. arXiv, 2018. arXiv:1803.05215v4.
     In Information Forensics and Security (WIFS), 2017 IEEE         [23] P. Korus. Digital image integrity–a survey of protection and
     Workshop on, pages 1–6. IEEE, 2017.                                  verification techniques. Digital Signal Processing, 71:1–26,
 [2] B. Bayar and M. Stamm. Constrained convolutional neural              2017.
     networks: A new approach towards general purpose image          [24] P. Korus, J. Bialas, and A. Dziech. Towards practical
     manipulation detection. IEEE Transactions on Information             self-embedding for JPEG-compressed digital images. IEEE
     Forensics and Security, 13(11):2691–2706, 2018.                      Trans. Multimedia, 17(2):157–170, Feb 2015.
 [3] P. Blythe and J. Fridrich. Secure digital camera. In Digital    [25] J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Kar-
     Forensic Research Workshop, pages 11–13, 2004.                       ras, M. Aittala, and T. Aila. Noise2noise: Learning
 [4] M. Boroumand and J. Fridrich. Deep learning for detecting            image restoration without clean data. arXiv preprint
     processing history of images. Electronic Imaging, 2018(7):1–         arXiv:1803.04189, 2018.
     9, 2018.                                                        [26] H. Li, B. Li, S. Tan, and J. Huang. Detection of deep
 [5] H. Burger, C. Schuler, and S. Harmeling. Image denoising:            network generated images using disparities in color compo-
     Can plain neural networks compete with bm3d? In Int.                 nents. arXiv preprint arXiv:1808.07276, 2018.
     Conf. Computer Vision and Pattern Recognition (CVPR),           [27] LibRaw LLC. libRAW. https://www.libraw.org/. Ac-
     pages 2392–2399. IEEE, 2012.                                         cessed 5 Nov 2018.
 [6] V. Bychkovsky, S. Paris, E. Chan, and E. Durand. Learning       [28] D. Menon, S. Andriani, and G. Calvagno. Demosaicing with
     photographic global tonal adjustment with a database of              directional filtering and a posteriori decision. IEEE Trans-
     input / output image pairs. In IEEE Conf. on Computer                actions on Image Processing, 16(1):132–141, 2007.
     Vision and Pattern Recognition, 2011.                           [29] A. Piva. An overview on image forensics. ISRN Signal
 [7] P. Cameron and S. Whited. Rawkit. https://rawkit.                    Processing, 2013, 2013. Article ID 496701.
     readthedocs.io/en/latest/. Accessed 5 Nov 2018.                 [30] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolu-
 [8] C. Chen, Q. Chen, J. Xu, and V. Koltun. Learning to see              tional networks for biomedical image segmentation. In Int.
     in the dark. arXiv, 2018. arXiv:1805.01934.                          Conf. on Medical image computing and computer-assisted
 [9] M. Chen, J. Fridrich, M. Goljan, and J. Lukas. Determining           intervention, pages 234–241. Springer, 2015.
     image origin and integrity using sensor noise. IEEE Trans.      [31] M. Stamm, M. Wu, and K. Liu. Information forensics: An
     Inf. Forensics Security, 3(1):74–90, 2008.                           overview of the first decade. IEEE Access, 1:167–200, 2013.
[10] D. Cozzolino and L. Verdoliva.              Noiseprint:     a   [32] Z. H. Sun and A. Hoogs. Object insertion and removal
     cnn-based camera model fingerprint.          arXiv preprint          in images with mirror reflection. In Information Forensics
     arXiv:1808.08396, 2018.                                              and Security (WIFS), 2017 IEEE Workshop on, pages 1–6.
[11] D. Dang-Nguyen, C. Pasquini, V. Conotter, and G. Boato.              IEEE, 2017.
     RAISE - a raw images dataset for digital image forensics.       [33] N.-S. Syu, Y.-S. Chen, and Y.-Y. Chuang. Learning
     In Proc. of ACM Multimedia Systems, 2015.                            deep convolutional networks for demosaicing. arXiv, 2018.
[12] Z. Dias, S. Goldenstein, and A. Rocha. Large-scale image             arXiv:1802.03769.
     phylogeny: Tracing image ancestral relationships. IEEE          [34] R. Tan, K. Zhang, W. Zuo, and L. Zhang. Color image
     Multimedia, 20(3):58–70, 2013.                                       demosaicking via deep residual learning. In IEEE Int. Conf.
[13] Z. Dias, S. Goldenstein, and A. Rocha. Toward image                  Multimedia and Expo (ICME), 2017.
     phylogeny forests: Automatically recovering semantically        [35] E. Wengrowski, Z. Sun, and A. Hoogs. Reflection corre-
     similar image relationships. Forensic science international,         spondence for exposing photograph manipulation. In Im-
     231(1):178–189, 2013.                                                age Processing (ICIP), 2017 IEEE International Confer-
[14] G. Eilertsen, J. Kronander, G. Denes, R. Mantiuk, and                ence on, pages 4317–4321. IEEE, 2017.
     J. Unger. HDR image reconstruction from a single expo-          [36] B. Wronski and P. Milanfar. See better and further with su-
     sure using deep CNNs. ACM Transactions on Graphics,                  per res zoom on the pixel 3. https://ai.googleblog.com/
     36(6):1–15, nov 2017.                                                2018/10/see-better-and-further-with-super-res.html.
[15] W. Fan, K. Wang, and F. Cayre. General-purpose image                 Accessed on 22 Oct 2018.
     forensics using patch likelihood under image statistical mod-   [37] C. P. Yan and C. M. Pun. Multi-scale difference map fu-
     els. In Proc. of IEEE Int. Workshop on Inf. Forensics and            sion for tamper localization using binary ranking hashing.
     Security, 2015.                                                      IEEE Transactions on Information Forensics and Security,
[16] M. Gharbi, G. Chaurasia, S. Paris, and F. Durand. Deep               12(9):2144–2158, 2017.
     joint demosaicking and denoising. ACM Transactions on           [38] M. Zampoglou, S. Papadopoulos, and Y. Kompatsiaris.
     Graphics, 35(6):1–12, nov 2016.                                      Large-scale evaluation of splicing localization algorithms
[17] A. Ignatov, N. Kobyshev, R. Timofte, K. Vanhoey, and                 for web images.        Multimedia Tools and Applications,
     L. Van Gool. Wespe: weakly supervised photo enhancer for             76(4):4801–4834, 2017.
     digital cameras. arXiv preprint arXiv:1709.01118, 2017.         [39] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Be-
[18] H. Jiang, Q. Tian, J. Farrell, and B. A. Wandell. Learning           yond a gaussian denoiser: Residual learning of deep cnn for
     the image processing pipeline. IEEE Transactions on Image            image denoising. IEEE Transactions on Image Processing,
     Processing, 26(10):5032–5042, oct 2017.                              26(7):3142–3155, 2017.
[19] Joint Photographic Experts Group.            JPEG Privacy       [40] P. Zhou, X. Han, V. Morariu, and L. S. Davis. Learning rich
     & Security. https://jpeg.org/items/20150910_privacy_                 features for image manipulation detection. In IEEE Int.
     security_summary.html. Accessed 24 Oct 2018.                         Conf. Computer Vision and Pattern Recognition (CVPR),
[20] O. Kapah and H. Z. Hel-Or. Demosaicking using artificial             pages 1053–1061, 2018.
     neural networks. In Applications of Artificial Neural Net-
     works in Image Processing V, volume 3962, pages 112–121.
     International Society for Optics and Photonics, 2000.
[21] E. Kee, J. O’brien, and H. Farid. Exposing photo manip-
     ulation from shading and shadows. ACM Trans. Graph.,
     33(5):165–1, 2014.
You can also read