Similarity Evaluation of Graphic Design Based on Deep Visual Saliency Features

Page created by Stacy Castillo
 
CONTINUE READING
Similarity Evaluation of Graphic Design Based on Deep Visual Saliency Features
Similarity Evaluation of Graphic Design Based on
Deep Visual Saliency Features
Zhuohua Liu
 Guangdong Mechanical & Electrical Polytechnic
Jingrui An
 Eindhoven University of Technology
Caijuan Huang
 Guangdong Mechanical & Electrical Polytechnic
Bin Yang (  b.yang@tue.nl )
 Eindhoven University of Technology

Research Article

Keywords: Similarity Evaluation, Deep Visual Saliency, Graphic Design, Plagiarism detection

Posted Date: February 7th, 2023

DOI: https://doi.org/10.21203/rs.3.rs-2537865/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License

                                                Page 1/24
Similarity Evaluation of Graphic Design Based on Deep Visual Saliency Features
Abstract
The creativity of an excellent design work generally comes from the inspiration and innovation of its
main visual features. The similarity between the main visual elements is the most important indicator for
detecting plagiarism of design concepts, which is important to protect cultural heritage and copyright.
The purpose of this paper is to develop an efficient similarity evaluation scheme for graphic design. A
novel deep visual saliency feature extraction generative adversarial network is proposed to deal with the
problem of lack of training examples. It consists of two networks: one predicts visual a saliency feature
map from an input image; the other takes the output of the first to distinguish whether a visual saliency
feature map is a predicted one or ground truth. Different from traditional saliency generative adversarial
networks, a residual refinement module is connected after the encoding and decoding network. Design
importance maps generated by professional designers are used to guide the network training. A saliency-
based segmentation method is developed to not only locate the optimal layout regions but also notice
insignificant regions. Priorities are then assigned to different visual elements. Experimental results show
that the proposed model obtains state-of-the-art performance among various similarity measurement
methods.

1. Introduction
Human beings perceive the external world mainly through the information obtained through auditory,
visual, olfactory, taste, and tactile sensory pathways. Among all the information processing subsystems
of the brain, the visual processing system occupies the most important position, because more than 70%
of the external information comes from visual perception. Image has the advantages of intuitive and
clear content, easy acquisition, convenient dissemination, and rich information. It is the most important
carrier of visual information in human daily activities. With the rapid development of society,
breakthroughs in science and technology, and the increasing popularity of the Internet, the means for
people to obtain images has become increasingly convenient and flexible, and the amount of image data
obtained has also increased rapidly. Thanks to the visual attention mechanism for complex scenes, we
can process such a huge amount of information in real time. People can quickly locate the salient or
interesting content in the visual scene and further process it, ignoring other inconspicuous or
uninteresting content.

Cognitive psychologists and neurophysiologists explore the psychological and biophysical essence of the
attention principle from human psychological activities and neuroanatomy. Since the 1990s, more and
more computer vision studies have focused on the visual attention mechanism area. In the cognitive
theory of visual attention [1], salience is usually defined as:

    Certain parts of the visual scene are intuitively salient relative to their surrounding parts, which may
    be certain objects or certain regions.

                                                  Page 2/24
Similarity Evaluation of Graphic Design Based on Deep Visual Saliency Features
The purpose of Visual Saliency Detection (VSD) [2] is to find salient regions in the visual scene and
estimate their saliency. Since the computing resources needed for image analysis and processing can be
allocated preferentially through the guidance of visual saliency. The interference of redundant
information to computing can be eliminated, which not only improves the speed of computer vision
algorithms but also improves the accuracy of algorithms. VSD has a wide range of applications in many
fields, such as image or video compression[3], content-aware image scaling [4], image rendering [5],
image retrieval [6], image segmentation [7], target detection and recognition[8], behavior recognition [9],
target tracking [10], etc.

However, the wide sharing and rapid dissemination of design artworks have brought about serious
problems of design homogeneity. Since the main visual features in image design can well reflect the
designer's ideas and creativity [11]. The similarity evaluation of the main visual features is not only
conducive to Content-Based Image Retrieval (CBIR) but also helps to detect plagiarism of design
concepts, which is of great significance for cultural protection and copyright protection. The widespread
and unprecedented distribution of digital artworks (e.g., posters, illustrations, advertisements) puts them
at a higher risk of plagiarism [12]. The plagiarism of designs is often based on other people’s ideas (such
as layout, form, creative concept, etc.), and it is often carried out by hand drawing. This makes it difficult
to describe the similarity between the plagiarized designs and generate quantitative indicators [13].
Figure 1 presents two graphic posters with a similar design concept and layout structure. Although they
are completely different at the pixel level, they will still be considered suspected of plagiarism.

In this paper, a novel Visual Saliency Features (VSF) extraction network, named VSFGAN is proposed. It
consists of two networks: one predicts VSF maps from an input image; the other one takes the output of
the first one to discriminate whether a VSF map is a predicted one or ground truth. Different from
traditional saliency GANs, we proposed a specific loss function for the VSF of design. The VSF map is
segmented based on saliency features guided by the aesthetic rule. We apply the diffusion equation to
compute the probability maps for non-dominant visual regions. Finally, a multi-weight similarity measure
method is developed based on SSIM [14]. The highlights of this paper can be summarized as follows:

     A similarity evaluation scheme is proposed for graphic design, which can be used for plagiarism
     detection.
     A novel VSF extraction model based on GAN is developed. A residual refinement module is
     connected after the encoding and decoding network. Design importance maps generated by
     professional designers are used to guide the network training.
     The minor visual element regions should be placed in the non-salient areas of the image according
     to the aesthetic rules. It still needs to be considered in the similarity calculation of design works. We
     proposed an algorithm to calculate the minor visual element probability map.

The rest of this paper is organized as follows. In Section 2, related works are presented and discussed.
The proposed method is presented in Section 3. Experiments and results are shown in Section 4. Finally,
the conclusion and future work are presented in Section 5.
                                                   Page 3/24
Similarity Evaluation of Graphic Design Based on Deep Visual Saliency Features
2. Related Works
2.1 Visual Saliency Detection
In 1998, a classic saliency calculation model based on the neurophysiological mechanism of visual
attention and cognitive psychology was proposed by Itti and Koch[15], which laid the foundation for
saliency research in computer vision. Since then, the field of visual saliency detection has begun to
flourish, bringing computer vision closer to human vision.

From the perspective of the information processing mechanism, VSD methods can be roughly divided
into two categories [16]: task-driven Top-Down (TD) models [17] and data-driven Bottom-Up (BU) models
[18]. TD models feature saliency mapping is mainly guided by task-specific or prior knowledge learned
from training scenarios [2]. In contrast, BU models are unconscious, guided by underlying visual features
present in the visual field such as color, orientation, texture, intensity, etc., without any specific task
guidance. The main difference between the two models is whether indicators from volitional tasks or
learned priors are considered in the feature integration computation. TD methods generally need to use a
large amount of data containing the true value for training or high-level information to guide the saliency
detection under a specific task. Compared with the BU methods, TD methods have great limitations in
application.

Previous VSD methods mainly utilize low-level feature (color, orientation, intensity, etc.) contrasts and
calculate saliency through linear or nonlinear combinations. With the in-depth research on visual saliency
detection, some new salient features have been used for the detection, such as uniqueness, distribution,
focus, objectness, etc. At the same time, more and more frameworks are also introduced, such as
saliency detection based on cellular automata.

Itti and Koch[15] attempted to model the bottom-up processing performed by early vision systems to
detect salient regions and thus estimate visual fixation locations. The model detects salient regions by
using central-peripheral differences in color, brightness, and orientation, and computes a saliency map by
linearly combining the resulting feature maps of color, brightness, and orientation features. Three primary
features were Gaussian filtered to obtain a multi-scale feature pyramid, and the central-peripheral
operation is to calculate the difference between different scales in each feature dimension. The final
saliency map obtained is a grayscale image, and points with high pixel values have high saliency. Harel
et al. [19] proposed a Graph-Based Visual Saliency (GBVS) model to improve the model proposed in [15].
Similar to [15], GBVS simulates the visual principle in the feature extraction stage. But it introduces the
Markov chain in the process of generating the saliency map to improve the accuracy of saliency
detection. The FES proposed by Tavakoli et al. [20] can be considered as a model for simulating visual
processing because it also designs a central-peripheral mechanism. FES applies the Bayesian framework
to multi-scale central-peripheral analysis, and the required distribution in the Bayesian formula is
obtained through sparse sampling and kernel density estimation. Borji [21] combined low-level features
such as orientation, color, intensity, and saliency maps of previous best bottom-up models with top-down

                                                  Page 4/24
Similarity Evaluation of Graphic Design Based on Deep Visual Saliency Features
cognitive visual features (e.g., faces, humans, cars, etc.), and the direct mapping is learned from those
features to eye fixations using Regression, SVM, and AdaBoost classifiers.

Like many computer vision applications, recent studies have entered the era of using deep learning as
feature extraction, and these solutions have greatly improved the performance of VSD. In this paper, we
mainly introduce the VSD schemes based on deep learning techniques.

Liu et al. [22] assumed the saliency of image elements could be derived from the relevance of the
saliency seeds (i.e., the most representative salient elements). In this view, they developed a normal linear
elliptic system with a Dirichlet boundary to match the diffusion from seeds to other relevant points. Li
and Yu [23] found that the model of salience can be derived from multi-scale features obtained using
deep convolutional neural networks. They used fully-connected layers on the top of a CNN, responsible
for the extraction of the features at different levels. Although CNNs have made a substantial
improvement in human attention prediction, Wang and Shen [24] have improved the CNN-based attention
models by efficiently leveraging multi-scale features. Hierarchical saliency information is captured by the
visual attention network from deep coarse layers with global saliency information to shallow fine layers
with local saliency responses. In this model, supervision is directly fed into multi-level layers. Cornia et al.
[25] used a convolutional Long Short-Term Memory (LSTM) network[26] to iteratively obtain the most
salient area from input to refine the predicted saliency map. Moreover, they a set of prior maps generated
with Gaussian functions are learned to tackle the center bias typical of human eye fixations.

With the rapid development of GAN models[27–29], more and more GAN-based VAD methods have been
proposed. Pan et al. [30] proposed a deep CNN for visual saliency prediction, named SalGAN. In the
generator, weights are learned by back-propagation computed from a binary cross entropy loss over
downsampled versions of the saliency maps. To resolve the distinction between the saliency map
produced by the generation stage and the ground truth map, the generated predictions are processed by a
trained discriminator network.

Most previous studies were aimed at improving the detection accuracy of the region. To obtain clear
salient object detection boundaries, Qin et al. [31] proposed a hybrid training loss to better preserve the
structure of the original image. The architecture consists of a densely supervised encoder-decoder
network and a residual refinement module, which are responsible for saliency prediction and saliency
map refinement, respectively. To evaluate a saliency model's ability to predict where humans look in
images. With the development of visual saliency detection algorithm research, to evaluate the
performance of various saliency detection methods, many corresponding image databases and
evaluation indicators have been published and proposed, and have been widely used by researchers. The
proposal of different saliency detection databases and evaluation metrics also promotes the
development of visual saliency detection. Bylinskii et al. [32] provided an analysis of 8 different
evaluation metrics and their properties. With the help of systematic experiments and visualization of
metric computations, the interpretability of saliency scores and transparency was added to the evaluation

                                                    Page 5/24
Similarity Evaluation of Graphic Design Based on Deep Visual Saliency Features
of saliency models. They made recommendations for metric selection under certain assumptions and
specific applications based on the differences in metric properties and behavior.

2.2 Similarity Evaluation of Graphic Designs
The evaluation of the similarity of graphic designs has always been an unavoidable problem in the visual
design domain, especially for copyright protection. Most plagiarized works are often based on other
people's creative ideas, and show a certain but not complete visual similarity (such as layout, form, color
matching, etc.), and are often done by hand. This makes it difficult to describe the similarity between
plagiarized designs and to generate quantitative metrics. Since the high-level similarity of graphic
designs generally is not directly copied from the original image. Most traditional image forensics
methods are hard to detect such plagiarism too. As shown in Fig. 2, the left and right posters are similar
in terms of composition, spatial organization, and object properties in each space. But they are not
similar in terms of pixels and image features.

Most of the plagiarism identification of graphic designs is based on human eye observation and
comparison, which leads to a high degree of subjectivity in similarity judgments. With the continuous
improvement of computer vision technology, researchers try to use computers to calculate the similarity
of graphic designs. Garrett and Robinson developed iTrace [33] to explore the possibility of detecting
plagiarism in visual works based on image similarity. Bozkr and Sezer [34] tried to evaluate the layout
similarity of web pages. Spatial pyramid matching was used to classify web page elements. Finally, the
histogram intersection mode was used to capture and measure the visual similarity of partial and entire
page layouts. Morphological analysis [35] is a method based on morphological theory to analyze target
objects. Its principle is to decompose a problem that needs to be studied into individual small elements
and conduct separate processing and research on these independent small elements. These independent
small elements are arranged and combined in a network diagram to produce a systematic solution to a
problem or a method. Artistic style can be used for image classification [36]. The curvature of the line is
used to describe the fluidity of the lines in the image, and the color contrast is used to describe the
characteristics of the color style of the image. The similarity rules of image artistic style were generated
to classify images based on artistic style.

Lang et al. [37] conducted research on the plagiarized clothing retrieval problem. They proposed a novel
network called Plagiarized-Search-Net (PS-Net) based on region representations, and then the utilized
landmarks were used to guide the learning of region representations. Finally, the suspected fashion items
were compared region by region. In addition, they proposed a plagiarized fashion database for
plagiarized clothes retrieval, which provides a meaningful addition to the existing field of fashion
retrieval. Cui et al. [38] elaborate on 8 elements that form unique posters and 6 judgment criteria for
plagiarism using an exploratory study with designers. They proposed models leveraging the combination
of primary elements and criteria of plagiarism to find suspect instances in a retrieval process. The models
are trained under the context of modern artwork and evaluated on the poster plagiarism dataset. Finally,
they showed through experiments that the proposed method outperforms the baseline with excellent Top-
K accuracy (33%) and retrieval performance (42%).
                                                  Page 6/24
Although in recent years many scholars have begun to enter higher-dimensional image similarity
calculations (i.e., cognitive dimensions). Due to the highly abstract and aesthetic features of design and
artwork, similarity studies on these works are rare. In this paper, we propose to analyze the similarity of
graphic designs in the cognitive dimension through visual saliency features.

3. Method
Visual saliency detection refers to simulating the human visual attention mechanism through computer
vision algorithms, calculating the importance of information in images, and extracting salient regions
(regions of interest) [39]. In this paper, we aim to simulate the visual attention area of people when
viewing art and design works through deep learning techniques. Our VSFGAN consists of two networks:
one predicts VSF maps from an input image; the other one takes the output of the first one to
discriminate whether a VSF map is a predicted one or ground truth. Different from traditional saliency
GANs, we proposed a specific loss function for the VSF of design. Visual elements in the image will be
assigned different priorities, and secondary visual elements will be greatly suppressed.

The proposed scheme is shown in Fig. 3.

VSD is the calculation of the visual importance of different elements in natural images or graphic
designs. Most of the previous traditional methods are aimed at visual saliency detection of natural
images rather than graphic design. While some of the input backgrounds are natural images, the
elements in the layout are still considered graphic designs. Therefore, each element of the background
image needs to be considered a graphic design element in the VSD stage.

3.1 Design Importance Map from Human Vision (DIM-HV)
When creating a design, controlling the perceived importance of various elements is crucial, and
designers often arrange elements to convey their importance. Color, size, and position all affect the
position of an element in a design, but it is difficult to quantify this position through mathematical
formulas. There is a clear relative difference in the importance of elements in a design: a large graphic in
the center will be much more important than a small text in the corner [40]. However, how important is
typography to the similarity judgment of the same type of graphic designs? How does the significance of
other elements depend on it? The importance of the image is correlated with its salience.

Inspired by Judd [41], the DIM-HV is generated by using a data-driven approach. First, 500 graphic
designs were downloaded from the dataset in [40] and departmental repositories. Second, 8 professional
designers were asked to mark the important regions of graphic designs that can be identified as
plagiarism. Third, the responses over all users were averaged and the DIM-HV of each design work was
obtained by normalizing the responses. Finally, the DIM-HV of each design work is used as the ground
true mask for our collected experimental images.

                                                  Page 7/24
3.2 Visual Saliency Features GAN
Many VSD schemes need to train the network by designing a specific loss function to achieve pleased
performance. However, it is difficult to measure the significant effect in a unified way due to the elusive
design ideas. Inspired by SalGAN [30], we introduce the idea of generative adversarial, instead of focusing
on complex loss functions, expecting the generator to generate saliency maps close to the real through
adversarial. GAN is a method of unsupervised learning, which learns through two neural networks playing
games with each other. It consists of a generative network and a discriminative network. The generator
network randomly samples from the latent space as input, and its output needs to mimic the real
samples in the training set as much as possible. The input of the discriminative network is the real
sample or the output of the generation network, and its purpose is to distinguish the output of the
generation network from the real sample as much as possible. The generative network should deceive the
discriminative network as much as possible. These two networks fight against each other and constantly
adjust the parameters. The ultimate goal is to make the discriminative network unable to judge whether
the output of the generating network is true or not. Since GAN does not require a large number of training
samples, it is suitable in scenarios where there is a lack of plagiarism samples of the design. To this end,
we proposed a novel visual saliency features extraction network. It consists of two networks: one predicts
VSF maps from an input image; the other one takes the output of the first one to discriminate whether a
VSF map is a predicted one or ground truth. The overall model framework is based on the Encoder-
Decoder architecture.

VSD requires the input and output images to be consistent in pixel and size. Therefore, we adopt the
scheme of down-sampling first and then up-sampling to restore the output image to the same size as the
input image. The Encoder-Decoder architecture is based on pre-trained VGG-16. To enable the proposed
network simultaneously capture the global high-level semantic information and low-level detail
information of the graphic designs, a Residual Refinement Module (RRM) [42] is connected after the
encoding and decoding network. RRM is a residual block with spatial attention and is adopted to refine
the features effectively. Each layer of convolution consists of 64 3×3 convolution kernels. Downsampling
uses maximum pooling, upsampling uses bilinear interpolation, and the RRM module learns to predict the
saliency map and the real saliency map. The residual is used to further refine the predicted saliency map.
After adding the residual to the initial visual saliency map, the output is the final visual saliency map. The
architecture of the proposed VSFGAN is presented in Fig. 5.

Table 1 and Table 2 list the implementation details of the proposed generator and discriminator,
respectively.

                                                  Page 8/24
Table 1
    The architectural details of the proposed generator.
layer         depth    kernel    stride   pad     activation

conv 1_1      64       1*1       1        1       ReLU

conv 1_2      64       3*3       1        1       ReLU

pool1                  2*2       2        0       -

conv 2_1      128      3*3       1        1       ReLU

conv 2_2      128      3*3       1        1       ReLU

pool2                  2*2       2        0       -

conv 3_1      256      3*3       1        1       ReLU

conv 3_2      256      3*3       1        1       ReLU

conv 3_3      256      3*3       1        1       ReLU

pool3                  2*2       2        0       -

conv 4_1      512      3*3       1        1       ReLU

conv 4_2      512      3*3       1        1       ReLU

conv 4_3      512      3*3       1        1       ReLU

pool4                  2*2       2        0       -

conv 5_1      512      3*3       1        1       ReLU

conv 5_2      512      3*3       1        1       ReLU

conv 5_3      512      3*3       1        1       ReLU

conv 6_1      512      3*3       1        1       ReLU

conv 6_2      512      3*3       1        1       ReLU

conv 6_3      512      3*3       1        1       ReLU

upsample6              2*2       2        0       -

conv 7_1      512      3*3       1        1       ReLU

conv 7_2      512      3*3       1        1       ReLU

conv 7_3      512      3*3       1        1       ReLU

upsample7              2*2       2        0       -

conv 8_1      256      3*3       1        1       ReLU

conv 8_2      256      3*3       1        1       ReLU
                         Page 9/24
layer           depth        kernel       stride   pad       activation

                     conv 8_3        256          3*3          1        1         ReLU

                     upsample8                    2*2          2        0         -

                     conv 9_1        128          3*3          1        1         ReLU

                     conv 9_2        128          3*3          1        1         ReLU

                     upsample9                    2*2          2        0         -

                     conv 10_1       64           3*3          1        1         ReLU

                     conv 10_2       64           3*3          1        1         ReLU

                     output          1            1*1          1        0         sigmoid

                                                  Table 2
                         The architectural details of the proposed discriminator.
layer                depth               kernel            stride             pad              activation

conv 1_1             3                   1*1               1                  1                ReLU

conv 1_2             32                  3*3               1                  1                ReLU

pool1                                    2*2               2                  0                -

conv 2_1             64                  3*3               1                  1                ReLU

conv 2_2             64                  3*3               1                  1                ReLU

pool2                                    2*2               2                  0                -

conv 3_1             64                  3*3               1                  1                ReLU

conv 3_2             64                  3*3               1                  1                ReLU

pool3                                    2*2               2                  0                -

fc4                  100                 -                 -                  -                tanh

fc5                  2                   -                 -                  -                tanh

fc6                  1                   -                 -                  -                sigmoid

Different from the BASNet [31] which focuses on detecting and segmenting salient objects, the goal
proposed VSD model is to evaluate the pixel-level visual saliency map, that is, the saliency value of
each pixel is in the real range of [0,1]. There are several differences between the VSFGAN and the
previous GAN model's loss:

1. The goal is to generate the actual significance value, instead of getting a real image from random
   noise; in this case, the input to the generator is no longer random noise, but a design image.

                                                   Page 10/24
2. The generator is not only generating a saliency map indistinguishable from the real one, but also
       makes them both correspond to the same input; therefore, both the design image and the
       corresponding DSIM are used as the input of the discriminator.
    3. When updating the parameters of the generating function, using the loss function of the combination
      of discriminator error and cross-entropy relative to ground truth can improve the stability and
      convergence speed of training.

We used a hybrid loss function for VSFGAN:

                                        L(Θ) = αLBCE(Θ) + LSSIM(Θ)                 (1)

where LBCE is the content loss function. A user may notice more than just a single pixel when looking at a
design, so it makes more sense to treat each predictor as independent of the other predictors. Thus, the
Binary Cross Entropy (BCE) is calculated by averaging individual BCEs over all pixels.

                                         N
                                    1
                                                   ˆ                           ˆ
                       LBC E = −        ∑(S j log (S j ) + (1 − S j ) log (1 − S j ))
                                    N
                                         j=1

2

where Sj and ˆ
             S j are the ground true normalized VSF and the predicted normalized VSF of the input,

respectively.

LSSIM(Θ) is the loss function of SSIM [14], Θ represents the parameters of the visual saliency detection
network. SSIM loss function can capture the structural information of each element in the image. It is a
region-level measurement method. It will give higher weight to the element boundary when the model
predicts the same salient value between pixels. This will help to obtain clear element boundaries in VSF
maps. Suppose x = [xn|n = 1,2,..., N] and y = [ym |m = 1,2,..., M] denote two images extracted from VSF
maps Sj and ˆ
            S j , respectively. The SSIM loss function can be defined as:

                                                    (2μ μ             + C 1 )(2σxy + C 2 )
                                                          x       y
                           LS (Θ) = 1 −
                                                    2             2                2       2
                                               (μ       + μ           + C 1 )(σx       + σy +C 2 )
                                                x             y

3

where µx, µy and σ2x, σ2y represent the mean and variance of x and y respectively, and σxy is their
covariance. We set C1 = 0.012, C2 = 0.032 based on experimental experience. Experiments have shown that
when the hyperparameter α in the above function (1) is set to 0.005, the effect of the model is the best.

3.3 Saliency based segmentation

                                                        Page 11/24
After obtaining a saliency map of an image using a visual saliency detection network, the input image
should be segmented to evaluate the similarity of different regions. Designers often use grids or
rectangular areas to organize elements. Observers perceive this structure and associate alignment,
grouping, and symmetry with these regions. The global position features can be used for segmentation
guiding which includes the distance to the ‘Third lines’, power points (intersections of the Third lines),
image center, boundaries, and diagonals. To compare design similarities, we estimate layout structures
based on DSF and assign weights for visual importance.

The Minor Visual Element (MVE) regions should be placed in the non-salient areas of the image
according to the aesthetic rules. To calculate the optimal layout area, an intuitive method is to
exhaustively enumerate the possible positions and sizes of all MVE regions, and calculate the visual
salience value of these areas as the score of the layout. But this method has three shortcomings. First,
since there are many inconspicuous background areas, the salient values of these pixels are all small and
close, so it is difficult to directly calculate the most suitable MVE layout based on the salient values area.
Second, since the saliency value at the edge of the image is usually small, only considering the visual
saliency value will place the MVE near the edge of the image, which violates the aesthetic rules and leads
to bad visual presentation. Third, similar saliency values in the background region will also lead to a huge
search space and increase the amount of computation.

We proposed an algorithm to solve the above shortcomings. First, the diffusion equation is used to
calculate the MVE probability map. This map represents the probability of the MVE layout at the
corresponding position, and then the candidate area generation algorithm is used to obtain the design
layout.

We apply the diffusion equation to compute the probability maps for MVE regions, which are defined as
follows:

                                     P DM +1 = P DM + θ(dX + dY )
                            {
                                dX = cX ∇X (P DM ) , dY = cY ∇Y (P DM )

4

where ∇X and ∇Y represent the gradients in the horizontal and vertical directions of the pixel, respectively.
cX and cY represent the diffusion coefficients in the two directions, respectively. The goal of the diffusion
equation we defined is to calculate the probability maps of the MVE regions. The diffusion coefficients cX
and cY are set to 1 and 0.6 according to the aesthetic rules.

In the probability map of the initial stage, there are many regions with the same probability value. While
the diffusion equation considers the visual saliency distribution of image elements around each possible
region, the number of suitable MVE regions is continuously reduced during the iterative process. The
iteration stops when the difference between the MVE probability map and the initial saliency map is
greater than a threshold.
                                                  Page 12/24
In object detection tasks, many methods for generating bounding boxes were developed. However, most
of them do not consider the local relationship between the main visual region and surrounding image
elements, and the generated candidate boxes are hard to be applied for design layout segmentation. Here,
we used the hierarchical segmentation algorithm in [40] to segment the design image. Different from [40],
we used the probability maps of the MVE regions computed by Eq. 3 as segmentation input. Furthermore,
the main visual region will be fixed and set to the highest energy term. The proposed algorithm takes as
input a layout, a binary mask for each element, and the element class (graphic or text); the output is a
hierarchical segmentation of the design into non-overlapping rectangular regions.

Given a rectangular region R, a cut c is defined as an point (x,y) in R that divides the region into two
rectangular subregions r1 and r2. The energy term penalizes cuts based on the distance to each element's
bounding box. Cuts closer to the center are weighted more, and cuts closer to the region border are given
less weight.

                                                  1                 p              2
                                                                             c
                                   Fint (c) =         ∑ max(I           dis ( p ))
                                                                    i        i
                                                  n         i
                                                      p∈c

5
                                              p
where p∈c is the pixels p along the cut c, Ii is an indicator variable indicating if element i overlaps with
pixel p, and disci(p) is the distance of pixel p to the bounding box of element i. This distance depends on
the cut type c. Regions r1 and r2 and counts the number of elements of the same class (text or graphics)
in both regions is calculated by an energy function Felm (c).

                                         Felm(c)= -(N(r1) + N(r2))           (6)

where N(r) is the number of same class elements in region r, and 0 means no element.

The algorithm tends to divide regions evenly with the region center. Thus, we normalize the distance of
the cut:

                                                                |c − r c |
                                           Fcen = (c) −
                                                                   rl

7

where rc is the regional center’s location, and rl is the length of region. A cut F(c) is then evaluated by:

                               F(c) = vintFint(c) + velmFelm(c) + vcenFcen(c)          (8)

We set vint =50, velm = 100 and vcen = 1 to obtain the best experimental results.

3.4 Similarity Evaluation
                                                      Page 13/24
Similarity evaluation is to calculate the similarity distance between feature vectors through a certain
measurement algorithm. The commonly used measurement algorithms include Euclidean distance,
Cosine distance, Hash distance, Mutual Information, et al. Although, there are many image similarity
calculation methods based on deep learning models. Because the layout similarity of the graphic designs
is not enough to judge the similarity or plagiarism of the works, it is also necessary to evaluate the
similarity of the conceptual design of the main visual region. We propose a multi-weight similarity
measure based on the Structural SIMilarity (SSIM) index [14] since the segmented image already has a
relatively obvious element relationship structure.

                                        1     C

                                 S =        ∑       W V (i) × SSIM (Ri )
                                        C     i=1

9

where Ri is i th segmented region of the image. Each segmented region is performed SSIM. C is the
number of segmented image regions. The weighted VSF of region Ri is calculated by function WV(x).

                                                     w x × hx × VSFx
                                     W V (x) =
                                                           W × H

10

where VSFx is the normalized VSF of Rx. wx and hx represent the width and height of the segmented
region Rx (the segmented region is a rectangle), respectively; W and H are the width and height of an
image.
4. Evaluation And Discussion
4.1 Experimental Setup
Two datasets are used to train and test our proposed scheme. 1) The Plagiarized Poster dataset [38],
which contains 22,624 images with 224 query images, each poster has an average of 4.92 plagiarized
designs. It is used to train and test the similarity evaluation ability of our proposed scheme on visual
saliency in poster design. 2) The Graphic Design Importance dataset by O’Donovan et al. [43] which
comes with importance annotations for 1,078 graphic designs from Flickr. It is used to test the similarity
evaluation ability of our proposed scheme on the layout in poster design. Some samples are shown in
Fig. 6.

Experiments were evaluated on a PC server with Nvidia GeForce GTX two TitanX. The proposed models
are trained with a learning rate of 0.0002. 70% of the samples were selected as the training set, and the
rest were used for testing. As few studies have been concerned with the plagiarism issue for graphic
designs. We compared our scheme to the method proposed in [38], which focuses on retrieving
plagiarized posters. To make the test results clearer and more comparable, similar test metrics as in [38]
are chosen.
                                                    Page 14/24
Top-k Accuracy. The number of accurate plagiarized samples retrieved in the rank K images. K is set
    to 10 and 20.
    Normalized Discounted Cumulative Gain (NDCG)[44]. It is used to measure and evaluate the
    accuracy of search result algorithms and index ranking results.

4.2 Evaluation Results
We implemented six related image similarity measurement methods for comparison. Two of them [45,
46] focus on copy-move forgery detection since plagiarism detection can be considered a kind of clone
forensics. Other methods [13] [34] [37] [38] are developed for artwork plagiarism detection. The
experiment result on the Plagiarized Poster dataset [38] is shown in Table 3.

                                                   Table 3
                                              Experiment result.
 Method                             Focus on            Top-10                Top-20 Accuracy%      NDCG
                                                        Accuracy%

 SIFT-based [45]                    Image clone         12.52                 8.45                  0.55

 Dense Inception Net [46]           Image clone         14.26                 10.32                 0.60

 VAE-WGAN [13]                      Logo design         40.34                 31.60                 0.73

 Spatial pyramid matching           Website             0                     0                     0
 [34]                               design

 Plagiarized-Search-Net [37]        Clothes design      36.33                 28.47                 0.67

 Conceptual filtering [38]          Poster design       67.45                 48.14                 0.92

 Ours                               Graphic design      78.36                 63.38                 0.94

Our proposed method achieves the best performance in all metrics (i.e., Top-10 Accuracy is 78.36%, Top-
20 Accuracy is 63.38%, and NDCG is 0.94). This is a benefit of the use of VSDGAN and saliency-based
segmentation algorithm. Since GAN does not require a large number of training samples, it is suitable in
scenarios where there is a lack of plagiarism samples of the design. As long as the image database and
loss function for training are equipped, the applicability of this generative confrontation mode will be
greatly improved. That is why the VAE-WGAN based method [13] can still handle some types of
plagiarism in artworks. Although it was developed for computing the cognitive similarity of graphic logos.
Two copy-move forgery detection methods [45, 46] can barely expose plagiarism in graphic designs,
which means that traditional clone forensics cannot be directly applied to perceptual similarity
measurement. Notice that, the spatial pyramid matching method [34] developed for website design failed
to perform in this application. This is because it only focuses on the similarity of the website layout
structure. However, layout similarity measures alone are not sufficient to detect plagiarism in graphic
designs.

                                                  Page 15/24
To show the effect of using the RRM module and the MVE region in the proposed scheme, we tried four
different strategies (i.e., non-RRM and non-MVE, RRM without MVE, MVE without RRM, and RRM with
MVE) to evaluate our method. The experiment results are shown in Table 4 demonstrating the importance
of using the RRM module and the MVE region to evaluate the similarity of graphic designs.

                                                 Table 4
                            Experiment results of different scheme strategies.
  Strategy                           Top-10 Accuracy%              Top-20 Accuracy%             NDCG

  non-RRM and non-MVE                63.77                         47.63                        0.74

  RRM without MVE                    70.91                         56.44                        0.80

  MVE without RRM                    75.45                         60.72                        0.88

  RRM with MVE                       78.36                         63.38                        0.94

  Note that using MVE is more important than using RRM. This is because that the MVE regions enable
  the similarity evaluation algorithm to notice insignificant regions, which is essential for higher-
  dimensional similarity calculations according to the aesthetic rules of designs. Insignificant regions
  may contain some important elements that affect the similarity of graphic designs, such as painting
  style, texture, composition, etc.

5. Conclusions
Similarity studies of designs and artworks are rare due to their highly abstract and aesthetic features. We
propose to analyze the similarity of graphic designs in cognitive dimension through visual saliency
features. A novel visual saliency features extraction network based on the GAN model is developed. The
RRM module is used to enable the proposed network to simultaneously capture global high-level
semantic information and low-level detail information of graphic designs. Finally, since the segmented
image already has a relatively obvious element relationship structure, a multi-weight similarity measure
based on SSIM is developed. There are currently some limitations to our scheme. Currently, our
optimization and learning process is not efficient enough for real-time interaction. Predicting element
importance is currently a time-consuming operation; investigating simpler importance models is possible
future work. In addition, the performance of similarity evaluation still has a lot of room for improvement.

Declarations
Acknowledgements

This work was supported in part by the National Social Science Foundation of China (21BG131).

Conflict of interest

Not applicable.

                                                 Page 16/24
Ethical approval

Not applicable.

Competing interests

The authors declared that they have no conflicts of interest to this work.

Authors' contributions

All authors have contributed equally to this work.

Funding

This work was supported in part by the National Social Science Foundation of China (21BG131).

Availability of data and materials

All data used to support the findings of this study are included within the article (Data sets used can be
accessed from [38] and [43]).

References
  1. A. Borji and L. Itti, "State-of-the-Art in Visual Attention Modeling," IEEE Transactions on Pattern
     Analysis and Machine Intelligence, vol. 35, no. 1, pp. 185-207, 2013, doi: 10.1109/TPAMI.2012.89.
  2. Z. Niu, G. Zhong, and H. Yu, "A review on the attention mechanism of deep learning,"
    Neurocomputing, vol. 452, pp. 48-62, 2021.
  3. J. Ross, R. Simpson, and B. Tomlinson, "Media richness, interactivity and retargeting to mobile
    devices: a survey," International Journal of Arts and Technology, vol. 4, no. 4, pp. 442-459, 2011.
  4. A. Garg, A. Negi, and P. Jindal, "Structure preservation of image using an efficient content-aware
    image retargeting technique," Signal, Image and Video Processing, vol. 15, no. 1, pp. 185-193, 2021.
  5. R. Nasiripour, H. Farsi, and S. Mohamadzadeh, "Visual saliency object detection using sparse
    learning," IET Image Processing, vol. 13, no. 13, pp. 2436-2447, 2019.
  6. L. Shamir, "What makes a Pollock Pollock: a machine vision approach," International Journal of Arts
    and Technology, vol. 8, no. 1, pp. 1-10, 2015.
  7. Y. Liu, D. Zhang, Q. Zhang, and J. Han, "Part-object relational visual saliency," IEEE Transactions on
     Pattern Analysis and Machine Intelligence, 2021.
  8. Y. Yang, Y. Zhang, S. Huang, Y. Zuo, and J. Sun, "Infrared and visible image fusion using visual
     saliency sparse representation and detail injection model," IEEE Transactions on Instrumentation and
    Measurement, vol. 70, pp. 1-15, 2020.

                                                  Page 17/24
9. Y. Zhu, G. Zhai, Y. Yang, H. Duan, X. Min, and X. Yang, "Viewing behavior supported visual saliency
    predictor for 360 degree videos," IEEE Transactions on Circuits and Systems for Video Technology,
   vol. 32, no. 7, pp. 4188-4201, 2021.
10. C. Zhang, Y. He, Q. Tang, Z. Chen, and T. Mu, "Infrared Small Target Detection via Interpatch
   Correlation Enhancement and Joint Local Visual Saliency Prior," IEEE Transactions on Geoscience
   and Remote Sensing, vol. 60, pp. 1-14, 2021.
11. B. Yang, L. Wei, and Z. Pu, "Measuring and Improving User Experience Through Artificial Intelligence-
   Aided Design," (in English), Frontiers in Psychology, vol. 11, no. 3, 2020, doi:
   10.3389/fpsyg.2020.595374.
12. N. Farhan, M. Abdulmunem, and M. a. Abid-Ali, Image Plagiarism System for Forgery Detection in
   Maps Design. 2019, pp. 51-56.
13. B. Yang, "Perceptual similarity measurement based on generative adversarial neural networks in
   graphics design," Applied Soft Computing, vol. 110, p. 107548, 2021/10/01/ 2021, doi:
   https://doi.org/10.1016/j.asoc.2021.107548.
14. Z. Wang, E. P. Simoncelli, and A. C. Bovik, "Multiscale structural similarity for image quality
    assessment," in The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003,
   2003, vol. 2: Ieee, pp. 1398-1402.
15. L. Itti, C. Koch, and E. Niebur, "A model of saliency-based visual attention for rapid scene analysis,"
    IEEE Transactions on pattern analysis and machine intelligence, vol. 20, no. 11, pp. 1254-1259, 1998.
16. J. K. Tsotsos, S. M. Culhane, W. Y. Kei Wai, Y. Lai, N. Davis, and F. Nuflo, "Modeling visual attention
    via selective tuning," Artificial Intelligence, vol. 78, no. 1, pp. 507-545, 1995/10/01/ 1995, doi:
   https://doi.org/10.1016/0004-3702(95)00025-9.
17. L. Marchesotti, C. Cifarelli, and G. Csurka, "A framework for visual saliency detection with
   applications to image thumbnailing," in 2009 IEEE 12th International Conference on Computer Vision,
   2009: IEEE, pp. 2232-2239.
18. C. Xia, F. Qi, and G. Shi, "Bottom–up visual saliency estimation with deep autoencoder-based sparse
   reconstruction," IEEE transactions on neural networks and learning systems, vol. 27, no. 6, pp. 1227-
   1240, 2016.
19. J. Harel, C. Koch, and P. Perona, "Graph-based visual saliency," Advances in neural information
    processing systems, vol. 19, 2006.
20. H. Rezazadegan Tavakoli, E. Rahtu, and J. Heikkilä, "Fast and efficient saliency detection using
   sparse sampling and kernel density estimation," in Scandinavian conference on image analysis,
   2011: Springer, pp. 666-675.
21. A. Borji, "Boosting bottom-up and top-down visual features for saliency estimation," in 2012 ieee
    conference on computer vision and pattern recognition, 2012: IEEE, pp. 438-445.
22. R. Liu, J. Cao, Z. Lin, and S. Shan, "Adaptive partial differential equation learning for visual saliency
    detection," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014,
   pp. 3866-3873.
                                                  Page 18/24
23. G. Li and Y. Yu, "Visual saliency detection based on multiscale deep CNN features," IEEE transactions
    on image processing, vol. 25, no. 11, pp. 5012-5024, 2016.
24. W. Wang and J. Shen, "Deep visual attention prediction," IEEE Transactions on Image Processing, vol.
   27, no. 5, pp. 2368-2378, 2017.
25. M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, "Predicting human eye fixations via an lstm-based
   saliency attentive model," IEEE Transactions on Image Processing, vol. 27, no. 10, pp. 5142-5154,
   2018.
26. H. T. H. Phan, A. Kumar, D. Feng, M. Fulham, and J. Kim, "Unsupervised Two-Path Neural Network for
    Cell Event Detection and Classification Using Spatiotemporal Patterns," IEEE Transactions on
   Medical Imaging, vol. 38, no. 6, pp. 1477-1487, 2019, doi: 10.1109/tmi.2018.2885572.
27. O. Sbai, M. Elhoseiny, A. Bordes, Y. Lecun, and C. Couprie, "DeSIGN: Design Inspiration from
   Generative Networks," 04/03 2018.
28. A. Elgammal, B. Liu, M. Elhoseiny, and M. Mazzone, "CAN: Creative Adversarial Networks, Generating
    "Art" by Learning About Styles and Deviating from Style Norms," in the eighth International
   Conference on Computational Creativity (ICCC), held in Atlanta, GA, June 20th-June 22nd 2017.
   [Online]. Available: https://arxiv.org/abs/1706.07068. [Online]. Available:
   https://arxiv.org/abs/1706.07068
29. M. Andries, A. Dehban, and J. Santos-Victor, "Automatic Generation of Object Shapes With Desired
    Affordances Using Voxelgrid Representation," Frontiers in Neurorobotics, vol. 14, 05/14 2020, doi:
   10.3389/fnbot.2020.00022.
30. J. Pan et al., "Salgan: Visual saliency prediction with generative adversarial networks," arXiv preprint
   arXiv:1701.01081, 2017.
31. X. Qin, Z. Zhang, C. Huang, C. Gao, M. Dehghan, and M. Jagersand, "Basnet: Boundary-aware salient
   object detection," in Proceedings of the IEEE/CVF conference on computer vision and pattern
   recognition, 2019, pp. 7479-7489.
32. Z. Bylinskii, T. Judd, A. Oliva, A. Torralba, and F. Durand, "What do different evaluation metrics tell us
   about saliency models?," IEEE transactions on pattern analysis and machine intelligence, vol. 41, no.
   3, pp. 740-757, 2018.
33. L. Garrett and A. Robinson, "Spot the Difference! Plagiarism identification in the visual arts," 2012.
34. A. S. Bozkr and E. A. Sezer, "SimiLay: A Developing Web Page Layout Based Visual Similarity Search
   Engine," in 10th International Conference on Machine Learning and Data Mining MLDM 2014, 2014.
35. A. Álvarez and T. Ritchey, "Applications of general morphological analysis," Acta Morphologica
    Generalis, vol. 4, no. 1, 2015.
36. E. Cetinic, T. Lipic, and S. Grgic, "Fine-tuning convolutional neural networks for fine art classification,"
    Expert Systems with Applications, vol. 114, pp. 107-118, 2018.
37. Y. Lang, Y. He, F. Yang, J. Dong, and H. Xue, "Which is plagiarism: Fashion image retrieval based on
    regional representation for design protection," in Proceedings of the IEEE/CVF Conference on
   Computer Vision and Pattern Recognition, 2020, pp. 2595-2604.
                                                   Page 19/24
38. S. Cui, F. Liu, T. Zhou, and M. Zhang, "Understanding and Identifying Artwork Plagiarism with the
    Wisdom of Designers: A Case Study on Poster Artworks," in Proceedings of the 30th ACM
   International Conference on Multimedia, 2022, pp. 1117-1127.
39. C. Huo, Z. Zhou, K. Ding, and C. Pan, "Online Target Recognition for Time-Sensitive Space
    Information Networks," IEEE Transactions on Computational Imaging, vol. 3, no. 2, pp. 254-263, 2017,
   doi: 10.1109/TCI.2017.2655448.
40. P. O’Donovan, A. Agarwala, and A. Hertzmann, "Learning Layouts for Single-PageGraphic Designs,"
   IEEE Transactions on Visualization and Computer Graphics, vol. 20, no. 8, pp. 1200-1213, 2014, doi:
   10.1109/TVCG.2014.48.
41. T. Judd, K. Ehinger, F. Durand, and A. Torralba, "Learning to predict where humans look," in 2009 IEEE
   12th international conference on computer vision, 2009: IEEE, pp. 2106-2113.
42. Y. Zhu, C. Chen, G. Yan, Y. Guo, and Y. Dong, "AR-Net: Adaptive attention and residual refinement
   network for copy-move forgery detection," IEEE Transactions on Industrial Informatics, vol. 16, no. 10,
   pp. 6714-6723, 2020.
43. Z. Bylinskii et al., "Learning visual importance for graphic designs and data visualizations," in
    Proceedings of the 30th Annual ACM symposium on user interface software and technology, 2017,
   pp. 57-69.
44. C. Distinguishability, "A Theoretical Analysis of Normalized Discounted Cumulative Gain (NDCG)
    Ranking Measures," 2013.
45. B. Yang, X. Sun, H. Guo, Z. Xia, and X. Chen, "A copy-move forgery detection method based on CMFD-
    SIFT," Multimedia Tools and Applications, journal article vol. 77, no. 1, pp. 837-855, 2018, doi:
   10.1007/s11042-016-4289-y.
46. J.-L. Zhong and C.-M. Pun, "An End-to-End Dense-InceptionNet for Image Copy-Move Forgery
   Detection," IEEE Transactions on Information Forensics and Security, vol. 15, pp. 2134-2146, 2020,
   doi: 10.1109/TIFS.2019.2957693.

Figures

                                                Page 20/24
Figure 1

Two graphic posters with similar design concepts and layout structures. Available:
https://ent.sina.cn/tv/jp_kr/2022-09-05/detail-imizmscv9236093.d.html

                                                Page 21/24
Figure 2

Posters suspected of plagiarism. Available: https://www.shrx.org/plus/view-60913-1.html

                                              Page 22/24
Figure 3

The proposed scheme.

Figure 4

DIM-HV generation process. The final DIM-HV (bottom right image) is the normalization of the 8
responses (black background images).

Figure 5

                                               Page 23/24
The architecture of the proposed VSFGAN.

Figure 6

Poster samples in the Plagiarized Poster dataset and layout design samples in the Graphic Design
Importance dataset. (a) Plagiarized Poster dataset, (b) Graphic Design Importance dataset.

                                               Page 24/24
You can also read