Example-based Learning for Single-Image Super-resolution

Page created by Leonard Lee
 
CONTINUE READING
Example-based Learning for Single-Image Super-resolution
Example-based Learning
             for Single-Image Super-resolution

                      Kwang In Kim1 and Younghee Kwon2
                  1
                   Max-Planck-Institute für biologische Kybernetik,
                    Spemannstr. 38, D-72076 Tübingen, Germany
              2
                  Korea Advanced Institute of Science and Technology,
                   373-1 Kusong-dong, Yusong-Ku, Taejon, Korea

      Abstract. This paper proposes a regression-based method for single-
      image super-resolution. Kernel ridge regression (KRR) is used to esti-
      mate the high-frequency details of the underlying high-resolution image.
      A sparse solution of KRR is found by combining the ideas of kernel
      matching pursuit and gradient descent, which allows time-complexity to
      be kept to a moderate level. To resolve the problem of ringing artifacts
      occurring due to the regularization effect, the regression results are post-
      processed using a prior model of a generic image class. Experimental
      results demonstrate the effectiveness of the proposed method.

1   Introduction

Single-image super-resolution refers to the task of constructing a high-resolution
enlargement of a given low-resolution image. This problem is inherently ill-posed
as there are generally multiple high-resolution images that can produce the same
low-resolution image. Accordingly, prior information is required to approach this
problem. Often, this prior information is available either in the explicit form of
an energy functional defined on the image class [9, 10], or in the implicit form of
example images leading to example-based super-resolution [1–3, 5].
    Previous example-based super-resolution algorithms can be characterized
as nearest neighbor (NN)-based estimations [1–3] : during the training phase,
pairs of low-resolution and the corresponding high-resolution image patches
(sub-windows of images) are collected. Then, in the super-resolution phase,
each patch of the given low-resolution image is compared to the stored low-
resolution patches, and the high-resolution patch corresponding to the nearest
low-resolution patch is selected as the output. For instance, Freeman et al. [2]
posed the image super-resolution as the problem of estimating missing high-
frequency details by interpolating the input low-resolution image into the de-
sired scale (which results in a blurred image). Then, the super-resolution was
performed by the NN-based estimation of high-frequency patches based on the
corresponding patches of input low-frequency image.
    Although this method (and also other NN-based methods) has already shown
an impressive performance, there is still room for improvement if one views the
2

image super-resolution as a regression problem, i.e., finding a map f from the
space of low-resolution image patches X to the space of target high-resolution
patches Y. It is well known in the machine learning community that NN-based
estimation suffers from overfitting where one obtains a function which explains
the training data perfectly yet cannot be generalized to unknown data. In the
super-resolution, this can result in noisy reconstructions at complex image re-
gions (cf. Sect. 3). Accordingly, it is reasonable to expect that NN-based methods
can be improved by adopting learning algorithms with regularization capability
to avoid overfitting.
    Based on the framework of Freeman et al. [2], Kim et al. posed the problem
of estimating the high-frequency details as a regression problem which is then
resolved by support vector regression (SVR) [6]. Meanwhile, Ni and Nguyen uti-
lized SVR in the frequency domain and posed the super-resolution as a kernel
learning problem [7]. While SVR produced a significant improvement over ex-
isting example-based methods, it has several drawbacks in building a practical
system: 1. As a regularization framework, SVR tends to smooth the sharp edges
and produce an oscillation along the major edges. This might lead to low recon-
struction error on average, but is visually implausible; 2. SVR results in a dense
solution, i.e., the regression function is expanded in the whole set of training
data points and accordingly is computationally demanding both in training and
in testing.3
    The current work extends the framework of Kim et al. [6]. A kernel ridge
regression (KRR) is utilized for the regression. Due to the observed optimality
of  at (nearly) 0 for SVR in our previous study, the only difference between
SVR and KRR in the proposed setting is their loss functions (L1 - and L2 -
loss, respectively). The L2 -loss adopted by KRR is differentiable and facilitates
gradient-based optimization. To reduce the time complexity of KRR, a sparse
basis is found by combining the idea of the kernel matching pursuit (KMP) [11]
and gradient descent such that the time complexity and the quality of super-
resolution can be traded. As the regularizer of KRR is the same as that of SVR,
the problem of oscillation along the major edges still remains. This is resolved
by exploiting a prior over image structure proposed by Tappen et al. [9].

2     Regression-based Image Super-resolution
Base System. Adopting the framework of Freeman et al. [2], for the super-reso-
lution of a given image, we estimate the corresponding missing high-frequency
details based on its interpolation into the desired scale, which in this work is
obtained by the bicubic interpolation. Furthermore, based on the conditional
independence assumption of high- and low-frequency components given mid-
frequency components of an image [2], the estimation of high-frequency compo-
nents (Y ) is performed based on the Laplacian of the bicubic interpolation (X).
The Y is then added to the bicubic to produce the super-resolved image Z.
3
    In our simulation, the optimum value of  for the -insensitive loss function of SVR
    was close to zero.
3

     To retain the complexity of the resulting regression problem at a moderate
level, a patch-based approach is taken where the estimation of the values of Y
at specific locations NN (Y (x, y)) is performed based on only the values of X at
corresponding locations NM (X(x, y)), where NG (S(x, y)) represents a G-sized
square window (patch) centered at the location (x, y) of the image S.
     Then, during the super-resolution, X is scanned with a small window (of size
M ) to produce a patch-valued regression result (of size N ) for each pixel. This
results in a set of candidate pixels for each location of Z (as the patches are
overlapping with their neighbors), which are then combined to make the final
estimation (details will be provided later). The training images for the regressor
are obtained by blurring and subsampling (by bicubic resampling) a set of high-
resolution images to constitute a set of low- and high-resolution image pairs.
The training image patch pairs are randomly sampled therein. To increase the
efficiency of the training set, the data are contrast-normalized ([2]): during the
construction of the training set both the input image patch and corresponding
desired patches are normalized by dividing them by the L1 -norm of the input
patch. For an unseen image patch, the input is again normalized before the
regression and the corresponding output is inverse normalized.
     For a given set of training data points {(x1 , y1 ), . . . , (xl , yl )} ⊂ IRM × IRN ,
we minimize the following regularized cost functional
                                  X 1 X                                 1
                                                                                  
        O({f 1 , . . . , f N }) =              (f i (xj ) − yji )2 + λkf i k2H ,        (1)
                                       2                                 2
                              i=1,...,N    j=1,...,l

where yj = [yj1 , . . . , yjN ] and H is a reproducing kernel Hilbert space (RKHS).
Due to the reproducing property, the minimizer of above functional is expanded
in kernel functions:
                                   X
                       f i (·) =       aij k(xj , ·), for i = 1, . . . , N      (2)
                              j=1,...,l

where k is the generating kernel  for H which, we choose as a Gaussian kernel
(k(x, y) = exp −kx − yk2 /σk ). Equation (1) is the sum of individual convex
cost functionals for each scalar-valued regressor and can be minimized separately.
However, by tying the regularization parameter λ and the kernel k we can reduce
the time complexity of training and testing down to the case of scalar-valued
regression, as in this case the kernel matrix can be shared: plugging (2) into (1)
and noting the convexity of (1) yields

                                  A = (K + λI)−1 Y,                                    (3)

where Y = [y1> , . . . , yl> ]> and the i-th column of A constitutes the coefficient
vector ai = [ai1 , . . . , ail ]> for the i-th regressor.

Sparse Solution. As evident from (2) and (3), the training and testing time
of KRR is O(l3 ) and O(M × l), respectively, which becomes prohibitive even
for a relatively small number of training data points (e.g., l > 10, 000). One
4

way of reducing the time complexity is to trade it off with the optimality of
the solution by finding the minimizer of (1) only within the span of a basis set
{k(b1 , ·), . . . , k(blb , ·)} (lb  l):
                                     X
                          f i (·) =       aij k(bj , ·), for i = 1, . . . , N. (4)
                              j=1,...,lb

In this case, the solution is obtained by

                           A = (Kbx K>
                                     bx + λKbb )
                                                 −1
                                                    Kbx Y,                               (5)

where [Kbx(i,j) ]lb ,l = k(bi , xj ) and [Kbb(i,j) ]lb ,lb = k(bi , bj ), and accordingly the
time complexity reduces to O(M × lb ) for testing. For a given fixed basis points
B = {b1 , . . . , blb }, the time complexity of computing the coefficient matrix A is
O(lb3 + l × lb × M ). In general, the total training time depends on the method of
finding B.
    In KMP [11, 4], the basis points are selected from the training data points in
an incremental way: for given n − 1 basis points, the n-th basis is chosen such
that the cost functional (1) is minimized when the A is optimized accordingly.
The exact implementation of KMP costs O(l2 )-time for each step. Another pos-
sibility is to note the differentiability of the cost functional (4) which leads to
gradient-based optimization to construct B. Assuming that the evaluation of
the derivative of k with respect to a basis vector takes O(M )-time, the evalua-
tion of derivative of (1) with respect to B and corresponding coefficient matrix
A takes O(M × l × lb + l × lb2 )-time. Because of the increased flexibility, in
general, gradient-based methods can lead to a better optimization of the cost
functional (1) than selection methods as already demonstrated in the context of
sparse Gaussian process (GP) regression [8]. However, due to the non-convexity
of (1) with respect to B, it is susceptible to local minima and accordingly a good
heuristic is required to initialize the solution.
    In this paper, we use a combination of KMP and gradient descent. The basic
idea is to assume that at the n-th step of KMP, the chosen basis point bn plus
the accumulation of basis points obtained until the (n − 1)-th step (Bn−1 ) is a
good initial point. Then, at each step of KMP, Bn can be subsequently optimized
by gradient descent. Naive implementation of this idea is still very expensive. To
reduce further the complexity, the following simplifications are adopted: 1. In the
KMP step, instead of evaluating the whole training set for choosing bn , only lc
(lc  l) points are considered; 2. Gradient descent of Bn (M ) and corresponding
A(1:n,:) 4 are performed only at the every r-th KMP step. Instead, for each KMP
step, only bn and A(n,:) are optimized. In this case, the gradient can be evaluated
at O(M × l).5
4
    With a slight abuse of the Matlab notation, A(m:n,:) stands for the submatrix of A
    obtained by extracting the rows of A from m to n.
5
    Similarly to [4], A(n) can be analytically calculated at O(M × l)-cost:

                      Kbx(n,:) (y − K>
                                     bx(1:n−1,:) A(1:n−1,:) ) − λKnb A(1:n−1,:)
             A(n) =                                                               .      (6)
                                           Kbx(n,:) K>
                                                     bx(n,:) + λ
5

    At the n-th step, the lc -candidate basis points for KMP is selected based
on a rather cheap criterion: we use the difference between the function output
obtained at the (n−1)-th step and the estimated desired response of full KRR for
each training data points which is then approximated by the localized KRR: for a
training data point xi , its NNs are collected in the training set and the full KRR
is trained based on only these NNs. The output of this localized KRR for xi gives
the estimation of the desired response for xi . It should be noted that these local
KRRs cannot be directly applied for regression as they might interpolate poorly
on non-training data points. Once computed at the beginning, the estimated
desired responses are fixed throughout the whole optimization process.
    To gain an insight into the performances of different sparse solution method,
a set of preliminary experiments has been performed with KMP, gradient descent
(with basis initialized by kmeans algorithm), and the proposed combination of
KMP and gradient descent with 10,000 training data points. Figure 1 summa-
rizes the results. Both gradient descent methods outperform KMP, while the
combination with KMP provides a better performance. This could be attributed
to the better initialization of the solution for the subsequent gradient descent
step.

                                                                KMP
                        58
                                                                Gradient descent
                                                                KMP+gradient descent
                        54

                        50
                 cost

                        46

                        42

                        38

                         0   50    100                    200                          300
                                         # basis points

Fig. 1. Performance of the different sparse solution methods evaluated in terms of the
cost functional (1). A fixed set of hyper-parameters were used such that the comparison
can be made directly in (1).

Combining Candidates. It is possible to construct a super-resolved image based
on only the scalar-valued regression (i.e., N = 1). However, we propose to predict
a patch-valued output such that for each pixel, N different candidates are gen-
erated. These candidates constitutes a 3-D image Z where the third dimension
corresponds the candidates. This setting is motivated by the observation that
1. by sharing the hyper-parameters, the computational complexity of resulting
patch-valued learning reduces to the scalar-valued learning; 2. the candidates
contain information of different input image locations which are actually diverse
enough such that the combination can boost the performance: in our prelim-
inary experiments, constructing an image by choosing the best and the worst
6

(in terms of the distance to the ground truth) candidates from each 2-D loca-
tion of Z resulted in an average signal-to-noise ratio (SNR) difference of 8.24dB.
Certainly, the ground truth is not available at actual super-resolution stage and
accordingly a way of constructing a single pixel out of N candidates is required.
    One straightforward way is to construct the final estimation as a convex com-
bination of candidates based on a certain confidence measure. For instance, by
noting that the (sparse) KRR corresponds to the maximum a posteriori esti-
mation with the (sparse) GP prior [8], one could utilize the predictive variance
as a basis for the selection. In the preliminary experiments this resulted in an
improvement over the scalar-valued regression. However, a better prediction was
obtained when the confidence estimation is obtained based not only on the in-
put patches but also on the context of neighbor reconstructions. For this, a set
of linear regressors is trained such that for each location (x, y), they receive a
patch of output images Z(NL (x,y),:) and produce the estimation of differences
({d1 (x, y), . . . , dN (x, y)}) between the unknown desired output and each can-
didate. The final estimation of pixel value for an image location (x, y) is then
obtained as the convex combination of candidates given in the form of a softmax :
                                         X
                             Y (x, y) =      wi (x, y)Z(x, y, i),              (7)
                                    i=1,...,N               |dj (x,y)| 
where wi (x, y) = exp − |di σ(x,y)|
                                        P
                              C
                                      /      j=1,...,N exp(−    σC     ) .
     For the experiments in this paper, we set M = 49(7 × 7), N = 25(5 × 5),
L = 49(7×7), σk = 0.025, σC = 0.03, and λ = 0.5·10−7 . The values are obtained
based on a set of separate validation images. The number of basis points for KRR
(lb ) is determined to be 300 as the trade off between the accuracy and the time
complexity. In the super-resolution experiments, the combination of candidates
based on these parameters resulted in an average SNR increase of 0.43dB over
the scalar-valued regression.

Post-processing Based on Image Prior. As demonstrated in Fig. 2.b, the result
of the proposed regression-based method is significantly better than the bicubic
interpolation. However, detailed visual inspection along the major edges (edges
showing rapid and strong change of pixel values) reveals ringing artifacts (oscil-
lation occurred along the edges). In general, regularization methods (depending
on the specific class of regularizer) including KRR and SVR tend to fit the
data with a smooth function. Accordingly, at the sharp changes of the function
(edges in the case of images) oscillation occurs to compensate the resulting loss
of smoothness. While this problem can indirectly be resolved by imposing less
regularization at the vicinity of edges, more direct approach is to rely on the
prior knowledge of discontinuity of images. In this work, we use a modification
of the natural image prior (NIP) framework proposed by Tappen et al. [9]:
                                                α  Y     "           2 #
                 1     Y             |x̂i − x̂j |                x̂i − yi
  P ({x}|{y}) =            exp −                      ·   exp −                , (8)
                 C                       σN             i
                                                                    σR
                   (i,j∈NS (i))

where {y} represents the observed variables corresponding to the pixel values
of Y , {x} represents the latent variable, and NS (i) stands for the 8-connected
7

neighbors of the pixel location i. While the second product term has the role of
preventing the final solution flowing far away from the input regression result Y ,
the first product term tends to smooth the image based on the costs |x̂i − x̂j |.
The role of α(< 1) is to re-weight the costs such that the largest difference is
stressed relatively less than the others such that large changes of pixel values are
relatively less penalized. Furthremore, the cost term |x̂i −x̂j |α becomes piece-wise
concave with extreme points at NS (i) such that if the second term is removed,
the maximum probability for a pixel i is achieved by assigning it with the value
of a neighbor, rather than a certain weighted average of neighbors which might
have been the case when α > 1. Accordingly, this distribution prefers a strong
edge rather than a set of small edges and can be used to resolve the problem of
smoothing around major edges. The optimization of (8) is performed by belief
propagation (BP) similarly to [9]. To facilitate the optimization, we reuse the
candidate set generated from the regression step such that the best candidates
are chosen by BP.

            a                    b                     c                    d

                       e                                         f

Fig. 2. Example of super resolution: a. bicubic, b regression result. c. post-processed
result of b based on NIP, d. Laplacian of bicubic with major edges displayed as green
pixels, and e and f. enlarged portions of a-c from left to right.

    Optimizing (8) throughout the whole image region can lead to degraded re-
sults as it tends to flatten the textured area, especially, when the contrast is
low such that the contribution of the second term is small.6 This problem is re-
solved by applying the (modification of) NIP only at the vicinity of major edges.
Based on the observation that the input images are blurred and accordingly very
high spatial frequency components are removed, the major edges are found by
thresholding each pixel of Laplacian of the input image using L2 and L∞ norms
of the local patches encompassing it. It should be noted that the major edge is
in general different from the object contour. For instance, in Fig. 2.d, the bound-
6
    In original work of Tappen et al. [9], this problem does not happen as the candidates
    are 2 × 2-size image patches rather than individual pixels.
8

ary between the chest of the duck and water is not detected as major edges as
the intensity variations are not significant across the boundary. In this case, no
visible oscillation of pixel values are observed in the original regression result.
    The parameters α, σN , and σR are determined at 0.85, 200 and 1, respectively.
While the improvement in terms of SNR is less significant (on average 0.04dB
from the combined regression result) the improved visual quality at major edges
demonstrate the effectiveness of NIP (Fig. 2).

3     Experiments
The proposed method was evaluated based on a set of high- and low-resolution
image pairs (Fig. 3) which is disjoint from the training images. The desired
resolution is twice the input image along each dimension. The number of training
data points is 200,000 where it took around a day to train the sparse KRR on
a 2.5GHz PC. For comparison, several different example-based image super-
resolution methods were evaluated, which include Freeman et al.’s NN-based
method [2], Tappen et al.’s NIP [9],7 and Kim et al.’s SVR-based method [6]
(trained based on only 10,000 data points).

Fig. 3. Thumbnails of test images: the images are indexed by numbers arranged in the
raster order.

    Figure 4 shows examples of super-resolution results. All the example-based
super-resolution methods outperform the bicubic interpolation in terms of vi-
sual plausibility. The NN-based method and the original NIP produced sharper
images at the expense of introducing noise which, even with the improved vi-
sual quality, lead to lower SNR values than the bicubic interpolations. The SVR
produced less noisy images. However it generated smoothed edges and perceptu-
ally distracting ring artifacts which have disappeared for the proposed method.
Disregarding the post-processing stage, we measured on average 0.69dB improve-
ment of SNRs for the proposed method from the SVR. This could be attributed
to the sparsity of the solution which enabled training on a large data set and the
7
    The original NIP algorithm was developed for super-resolving the NN-subsampled
    image (not bicubic resampling which is used for experiments with all the other meth-
    ods). Accordingly, for the experiments with NIP, the low resolution images were gen-
    erated by NN subsampling. The visual qualities of the super-resolution results are not
    significantly different from the results obtained from bicubic resampling. However,
    the quantitative results should not be directly compared with other methods.
9

effectiveness of the candidate combination scheme. Moreover, in comparison to
SVR the proposed method requires much less processing time: super-resolving a
256 × 256-size image into 512 × 512 requires around 25 seconds for the proposed
method and 20 minutes for the SVR-based method. For quantitative comparison,
SNRs of different algorithms are plotted in Fig. 5.

            a                   b                    c                    d

            e                    f                   g                    h

            i                    j                   k                     l

Fig. 4. Results of different super-resolution algorithms on two images from Fig. 3: a-b.
original, c-d. bicubic, e-f. SVR [6], g-h. NN-based method [2], i-j. NIP [9], and k-l.
proposed method.

4    Conclusion
This paper approached the problem of image super-resolution from a nonlinear
regression viewpoint. A combination of KMP and gradient descent is adopted to
obtain a sparse KRR solution which enabled a realistic application of regression-
based super-resolution. To resolve the problem of smoothing artifacts that occur
due to the regularization, the NIP was adopted to post-process the regression
result such that the edges are sharpen while the artifacts are suppressed. Com-
parison with the existing example-based image super-resolution methods demon-
strated the effectiveness of the proposed method. Future work should include
comparison and combination of various non-example-based approaches.
10

                                                  4
                                                                                                     bicubic
                                                                                                     SVR
                                                  3                                                  NN
                                                                                                     NIP

                  increase of SNR from bicubic
                                                                                                     proposed
                                                  2

                                                  1

                                                  0

                                                 −1

                                                 −2

                                                      1   2   3   4   5    6      7     8   9   10    11        12
                                                                          image index

            Fig. 5. Performance of different super-resolutions algorithms.

Acknowledgment. The contents of this paper have greatly benefited from discus-
sions with G. BakIr and C. Walder, and comments from anonymous reviewers.
The idea of using localized KRR was originated by C. Walder.

References
 1. Baker, S., Kanade, T.: Limits on super-resolution and how to break them. IEEE
    Trans. Pattern Analysis and Machine Intelligence 24(9), 1167–1183 (2002)
 2. Freeman, W.T., Jones, T.R., Pasztor, E.C.: Example-based super-resolution. IEEE
    Computer Graphics and Applications 22(2), 56–65 (2002)
 3. Hertzmann, A., Jacobs, C.E., Oliver, N., Curless, B., Salesin, D.H.: Image analo-
    gies. In: Computer Graphics (Proc. Siggraph 2001), pp. 327–340. ACM Press, NY
    (2001)
 4. Keerthi, S.S., Chu, W.: A matching pursuit approach to sparse gaussian process
    regression. In: Advances in Neural Information Processing Systems. MIT Press,
    Cambridge, MA (2005)
 5. Kim, K.I., Franz, M.O., Schölkopf, B.: Iterative kernel principal component analysis
    for image modeling. IEEE Trans. Pattern Analysis and Machine Intelligence 27(9),
    1351–1366 (2005)
 6. Kim, K.I., Kim, D.H., Kim, J.H.: Example-based learning for image super-
    resolution. In: Proc. the third Tsinghua-KAIST Joint Workshop on Pattern Recog-
    nition, pp. 140–148 (2004)
 7. Ni, K., Nguyen, T.Q.: Image superresolution using support vector regression. IEEE
    Trans. Image Processing 16(6), 1596–1610 (2007)
 8. Snelson, E., Ghahramani, Z.: Sparse gaussian processes using pseudo-inputs. In:
    Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA
    (2006)
 9. Tappen, M.F., Russel, B.C., Freeman, W.T.: Exploiting the sparse derivative prior
    for super-resolution and image demosaicing. In: Proc. IEEE Workshop on Statis-
    tical and Computational Theories of Vision (2003)
10. Tschumperlé, D., Deriche, R.: Vector-valued image regularization with pdes: a
    common framework for different applications. IEEE Trans. Pattern Analysis and
    Machine Intelligence 27(4), 506–517 (2005)
11. Vincent, P., Bengio, Y.: Kernel matching pursuit. Machine Learning 48, 165–187
    (2002)
You can also read