Similarity Evaluation of Graphic Design Based on Deep Visual Saliency Features
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Similarity Evaluation of Graphic Design Based on Deep Visual Saliency Features Zhuohua Liu Guangdong Mechanical & Electrical Polytechnic Jingrui An Eindhoven University of Technology Caijuan Huang Guangdong Mechanical & Electrical Polytechnic Bin Yang ( b.yang@tue.nl ) Eindhoven University of Technology Research Article Keywords: Similarity Evaluation, Deep Visual Saliency, Graphic Design, Plagiarism detection Posted Date: February 7th, 2023 DOI: https://doi.org/10.21203/rs.3.rs-2537865/v1 License: This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License Page 1/24
Abstract The creativity of an excellent design work generally comes from the inspiration and innovation of its main visual features. The similarity between the main visual elements is the most important indicator for detecting plagiarism of design concepts, which is important to protect cultural heritage and copyright. The purpose of this paper is to develop an efficient similarity evaluation scheme for graphic design. A novel deep visual saliency feature extraction generative adversarial network is proposed to deal with the problem of lack of training examples. It consists of two networks: one predicts visual a saliency feature map from an input image; the other takes the output of the first to distinguish whether a visual saliency feature map is a predicted one or ground truth. Different from traditional saliency generative adversarial networks, a residual refinement module is connected after the encoding and decoding network. Design importance maps generated by professional designers are used to guide the network training. A saliency- based segmentation method is developed to not only locate the optimal layout regions but also notice insignificant regions. Priorities are then assigned to different visual elements. Experimental results show that the proposed model obtains state-of-the-art performance among various similarity measurement methods. 1. Introduction Human beings perceive the external world mainly through the information obtained through auditory, visual, olfactory, taste, and tactile sensory pathways. Among all the information processing subsystems of the brain, the visual processing system occupies the most important position, because more than 70% of the external information comes from visual perception. Image has the advantages of intuitive and clear content, easy acquisition, convenient dissemination, and rich information. It is the most important carrier of visual information in human daily activities. With the rapid development of society, breakthroughs in science and technology, and the increasing popularity of the Internet, the means for people to obtain images has become increasingly convenient and flexible, and the amount of image data obtained has also increased rapidly. Thanks to the visual attention mechanism for complex scenes, we can process such a huge amount of information in real time. People can quickly locate the salient or interesting content in the visual scene and further process it, ignoring other inconspicuous or uninteresting content. Cognitive psychologists and neurophysiologists explore the psychological and biophysical essence of the attention principle from human psychological activities and neuroanatomy. Since the 1990s, more and more computer vision studies have focused on the visual attention mechanism area. In the cognitive theory of visual attention [1], salience is usually defined as: Certain parts of the visual scene are intuitively salient relative to their surrounding parts, which may be certain objects or certain regions. Page 2/24
The purpose of Visual Saliency Detection (VSD) [2] is to find salient regions in the visual scene and estimate their saliency. Since the computing resources needed for image analysis and processing can be allocated preferentially through the guidance of visual saliency. The interference of redundant information to computing can be eliminated, which not only improves the speed of computer vision algorithms but also improves the accuracy of algorithms. VSD has a wide range of applications in many fields, such as image or video compression[3], content-aware image scaling [4], image rendering [5], image retrieval [6], image segmentation [7], target detection and recognition[8], behavior recognition [9], target tracking [10], etc. However, the wide sharing and rapid dissemination of design artworks have brought about serious problems of design homogeneity. Since the main visual features in image design can well reflect the designer's ideas and creativity [11]. The similarity evaluation of the main visual features is not only conducive to Content-Based Image Retrieval (CBIR) but also helps to detect plagiarism of design concepts, which is of great significance for cultural protection and copyright protection. The widespread and unprecedented distribution of digital artworks (e.g., posters, illustrations, advertisements) puts them at a higher risk of plagiarism [12]. The plagiarism of designs is often based on other people’s ideas (such as layout, form, creative concept, etc.), and it is often carried out by hand drawing. This makes it difficult to describe the similarity between the plagiarized designs and generate quantitative indicators [13]. Figure 1 presents two graphic posters with a similar design concept and layout structure. Although they are completely different at the pixel level, they will still be considered suspected of plagiarism. In this paper, a novel Visual Saliency Features (VSF) extraction network, named VSFGAN is proposed. It consists of two networks: one predicts VSF maps from an input image; the other one takes the output of the first one to discriminate whether a VSF map is a predicted one or ground truth. Different from traditional saliency GANs, we proposed a specific loss function for the VSF of design. The VSF map is segmented based on saliency features guided by the aesthetic rule. We apply the diffusion equation to compute the probability maps for non-dominant visual regions. Finally, a multi-weight similarity measure method is developed based on SSIM [14]. The highlights of this paper can be summarized as follows: A similarity evaluation scheme is proposed for graphic design, which can be used for plagiarism detection. A novel VSF extraction model based on GAN is developed. A residual refinement module is connected after the encoding and decoding network. Design importance maps generated by professional designers are used to guide the network training. The minor visual element regions should be placed in the non-salient areas of the image according to the aesthetic rules. It still needs to be considered in the similarity calculation of design works. We proposed an algorithm to calculate the minor visual element probability map. The rest of this paper is organized as follows. In Section 2, related works are presented and discussed. The proposed method is presented in Section 3. Experiments and results are shown in Section 4. Finally, the conclusion and future work are presented in Section 5. Page 3/24
2. Related Works 2.1 Visual Saliency Detection In 1998, a classic saliency calculation model based on the neurophysiological mechanism of visual attention and cognitive psychology was proposed by Itti and Koch[15], which laid the foundation for saliency research in computer vision. Since then, the field of visual saliency detection has begun to flourish, bringing computer vision closer to human vision. From the perspective of the information processing mechanism, VSD methods can be roughly divided into two categories [16]: task-driven Top-Down (TD) models [17] and data-driven Bottom-Up (BU) models [18]. TD models feature saliency mapping is mainly guided by task-specific or prior knowledge learned from training scenarios [2]. In contrast, BU models are unconscious, guided by underlying visual features present in the visual field such as color, orientation, texture, intensity, etc., without any specific task guidance. The main difference between the two models is whether indicators from volitional tasks or learned priors are considered in the feature integration computation. TD methods generally need to use a large amount of data containing the true value for training or high-level information to guide the saliency detection under a specific task. Compared with the BU methods, TD methods have great limitations in application. Previous VSD methods mainly utilize low-level feature (color, orientation, intensity, etc.) contrasts and calculate saliency through linear or nonlinear combinations. With the in-depth research on visual saliency detection, some new salient features have been used for the detection, such as uniqueness, distribution, focus, objectness, etc. At the same time, more and more frameworks are also introduced, such as saliency detection based on cellular automata. Itti and Koch[15] attempted to model the bottom-up processing performed by early vision systems to detect salient regions and thus estimate visual fixation locations. The model detects salient regions by using central-peripheral differences in color, brightness, and orientation, and computes a saliency map by linearly combining the resulting feature maps of color, brightness, and orientation features. Three primary features were Gaussian filtered to obtain a multi-scale feature pyramid, and the central-peripheral operation is to calculate the difference between different scales in each feature dimension. The final saliency map obtained is a grayscale image, and points with high pixel values have high saliency. Harel et al. [19] proposed a Graph-Based Visual Saliency (GBVS) model to improve the model proposed in [15]. Similar to [15], GBVS simulates the visual principle in the feature extraction stage. But it introduces the Markov chain in the process of generating the saliency map to improve the accuracy of saliency detection. The FES proposed by Tavakoli et al. [20] can be considered as a model for simulating visual processing because it also designs a central-peripheral mechanism. FES applies the Bayesian framework to multi-scale central-peripheral analysis, and the required distribution in the Bayesian formula is obtained through sparse sampling and kernel density estimation. Borji [21] combined low-level features such as orientation, color, intensity, and saliency maps of previous best bottom-up models with top-down Page 4/24
cognitive visual features (e.g., faces, humans, cars, etc.), and the direct mapping is learned from those features to eye fixations using Regression, SVM, and AdaBoost classifiers. Like many computer vision applications, recent studies have entered the era of using deep learning as feature extraction, and these solutions have greatly improved the performance of VSD. In this paper, we mainly introduce the VSD schemes based on deep learning techniques. Liu et al. [22] assumed the saliency of image elements could be derived from the relevance of the saliency seeds (i.e., the most representative salient elements). In this view, they developed a normal linear elliptic system with a Dirichlet boundary to match the diffusion from seeds to other relevant points. Li and Yu [23] found that the model of salience can be derived from multi-scale features obtained using deep convolutional neural networks. They used fully-connected layers on the top of a CNN, responsible for the extraction of the features at different levels. Although CNNs have made a substantial improvement in human attention prediction, Wang and Shen [24] have improved the CNN-based attention models by efficiently leveraging multi-scale features. Hierarchical saliency information is captured by the visual attention network from deep coarse layers with global saliency information to shallow fine layers with local saliency responses. In this model, supervision is directly fed into multi-level layers. Cornia et al. [25] used a convolutional Long Short-Term Memory (LSTM) network[26] to iteratively obtain the most salient area from input to refine the predicted saliency map. Moreover, they a set of prior maps generated with Gaussian functions are learned to tackle the center bias typical of human eye fixations. With the rapid development of GAN models[27–29], more and more GAN-based VAD methods have been proposed. Pan et al. [30] proposed a deep CNN for visual saliency prediction, named SalGAN. In the generator, weights are learned by back-propagation computed from a binary cross entropy loss over downsampled versions of the saliency maps. To resolve the distinction between the saliency map produced by the generation stage and the ground truth map, the generated predictions are processed by a trained discriminator network. Most previous studies were aimed at improving the detection accuracy of the region. To obtain clear salient object detection boundaries, Qin et al. [31] proposed a hybrid training loss to better preserve the structure of the original image. The architecture consists of a densely supervised encoder-decoder network and a residual refinement module, which are responsible for saliency prediction and saliency map refinement, respectively. To evaluate a saliency model's ability to predict where humans look in images. With the development of visual saliency detection algorithm research, to evaluate the performance of various saliency detection methods, many corresponding image databases and evaluation indicators have been published and proposed, and have been widely used by researchers. The proposal of different saliency detection databases and evaluation metrics also promotes the development of visual saliency detection. Bylinskii et al. [32] provided an analysis of 8 different evaluation metrics and their properties. With the help of systematic experiments and visualization of metric computations, the interpretability of saliency scores and transparency was added to the evaluation Page 5/24
of saliency models. They made recommendations for metric selection under certain assumptions and specific applications based on the differences in metric properties and behavior. 2.2 Similarity Evaluation of Graphic Designs The evaluation of the similarity of graphic designs has always been an unavoidable problem in the visual design domain, especially for copyright protection. Most plagiarized works are often based on other people's creative ideas, and show a certain but not complete visual similarity (such as layout, form, color matching, etc.), and are often done by hand. This makes it difficult to describe the similarity between plagiarized designs and to generate quantitative metrics. Since the high-level similarity of graphic designs generally is not directly copied from the original image. Most traditional image forensics methods are hard to detect such plagiarism too. As shown in Fig. 2, the left and right posters are similar in terms of composition, spatial organization, and object properties in each space. But they are not similar in terms of pixels and image features. Most of the plagiarism identification of graphic designs is based on human eye observation and comparison, which leads to a high degree of subjectivity in similarity judgments. With the continuous improvement of computer vision technology, researchers try to use computers to calculate the similarity of graphic designs. Garrett and Robinson developed iTrace [33] to explore the possibility of detecting plagiarism in visual works based on image similarity. Bozkr and Sezer [34] tried to evaluate the layout similarity of web pages. Spatial pyramid matching was used to classify web page elements. Finally, the histogram intersection mode was used to capture and measure the visual similarity of partial and entire page layouts. Morphological analysis [35] is a method based on morphological theory to analyze target objects. Its principle is to decompose a problem that needs to be studied into individual small elements and conduct separate processing and research on these independent small elements. These independent small elements are arranged and combined in a network diagram to produce a systematic solution to a problem or a method. Artistic style can be used for image classification [36]. The curvature of the line is used to describe the fluidity of the lines in the image, and the color contrast is used to describe the characteristics of the color style of the image. The similarity rules of image artistic style were generated to classify images based on artistic style. Lang et al. [37] conducted research on the plagiarized clothing retrieval problem. They proposed a novel network called Plagiarized-Search-Net (PS-Net) based on region representations, and then the utilized landmarks were used to guide the learning of region representations. Finally, the suspected fashion items were compared region by region. In addition, they proposed a plagiarized fashion database for plagiarized clothes retrieval, which provides a meaningful addition to the existing field of fashion retrieval. Cui et al. [38] elaborate on 8 elements that form unique posters and 6 judgment criteria for plagiarism using an exploratory study with designers. They proposed models leveraging the combination of primary elements and criteria of plagiarism to find suspect instances in a retrieval process. The models are trained under the context of modern artwork and evaluated on the poster plagiarism dataset. Finally, they showed through experiments that the proposed method outperforms the baseline with excellent Top- K accuracy (33%) and retrieval performance (42%). Page 6/24
Although in recent years many scholars have begun to enter higher-dimensional image similarity calculations (i.e., cognitive dimensions). Due to the highly abstract and aesthetic features of design and artwork, similarity studies on these works are rare. In this paper, we propose to analyze the similarity of graphic designs in the cognitive dimension through visual saliency features. 3. Method Visual saliency detection refers to simulating the human visual attention mechanism through computer vision algorithms, calculating the importance of information in images, and extracting salient regions (regions of interest) [39]. In this paper, we aim to simulate the visual attention area of people when viewing art and design works through deep learning techniques. Our VSFGAN consists of two networks: one predicts VSF maps from an input image; the other one takes the output of the first one to discriminate whether a VSF map is a predicted one or ground truth. Different from traditional saliency GANs, we proposed a specific loss function for the VSF of design. Visual elements in the image will be assigned different priorities, and secondary visual elements will be greatly suppressed. The proposed scheme is shown in Fig. 3. VSD is the calculation of the visual importance of different elements in natural images or graphic designs. Most of the previous traditional methods are aimed at visual saliency detection of natural images rather than graphic design. While some of the input backgrounds are natural images, the elements in the layout are still considered graphic designs. Therefore, each element of the background image needs to be considered a graphic design element in the VSD stage. 3.1 Design Importance Map from Human Vision (DIM-HV) When creating a design, controlling the perceived importance of various elements is crucial, and designers often arrange elements to convey their importance. Color, size, and position all affect the position of an element in a design, but it is difficult to quantify this position through mathematical formulas. There is a clear relative difference in the importance of elements in a design: a large graphic in the center will be much more important than a small text in the corner [40]. However, how important is typography to the similarity judgment of the same type of graphic designs? How does the significance of other elements depend on it? The importance of the image is correlated with its salience. Inspired by Judd [41], the DIM-HV is generated by using a data-driven approach. First, 500 graphic designs were downloaded from the dataset in [40] and departmental repositories. Second, 8 professional designers were asked to mark the important regions of graphic designs that can be identified as plagiarism. Third, the responses over all users were averaged and the DIM-HV of each design work was obtained by normalizing the responses. Finally, the DIM-HV of each design work is used as the ground true mask for our collected experimental images. Page 7/24
3.2 Visual Saliency Features GAN Many VSD schemes need to train the network by designing a specific loss function to achieve pleased performance. However, it is difficult to measure the significant effect in a unified way due to the elusive design ideas. Inspired by SalGAN [30], we introduce the idea of generative adversarial, instead of focusing on complex loss functions, expecting the generator to generate saliency maps close to the real through adversarial. GAN is a method of unsupervised learning, which learns through two neural networks playing games with each other. It consists of a generative network and a discriminative network. The generator network randomly samples from the latent space as input, and its output needs to mimic the real samples in the training set as much as possible. The input of the discriminative network is the real sample or the output of the generation network, and its purpose is to distinguish the output of the generation network from the real sample as much as possible. The generative network should deceive the discriminative network as much as possible. These two networks fight against each other and constantly adjust the parameters. The ultimate goal is to make the discriminative network unable to judge whether the output of the generating network is true or not. Since GAN does not require a large number of training samples, it is suitable in scenarios where there is a lack of plagiarism samples of the design. To this end, we proposed a novel visual saliency features extraction network. It consists of two networks: one predicts VSF maps from an input image; the other one takes the output of the first one to discriminate whether a VSF map is a predicted one or ground truth. The overall model framework is based on the Encoder- Decoder architecture. VSD requires the input and output images to be consistent in pixel and size. Therefore, we adopt the scheme of down-sampling first and then up-sampling to restore the output image to the same size as the input image. The Encoder-Decoder architecture is based on pre-trained VGG-16. To enable the proposed network simultaneously capture the global high-level semantic information and low-level detail information of the graphic designs, a Residual Refinement Module (RRM) [42] is connected after the encoding and decoding network. RRM is a residual block with spatial attention and is adopted to refine the features effectively. Each layer of convolution consists of 64 3×3 convolution kernels. Downsampling uses maximum pooling, upsampling uses bilinear interpolation, and the RRM module learns to predict the saliency map and the real saliency map. The residual is used to further refine the predicted saliency map. After adding the residual to the initial visual saliency map, the output is the final visual saliency map. The architecture of the proposed VSFGAN is presented in Fig. 5. Table 1 and Table 2 list the implementation details of the proposed generator and discriminator, respectively. Page 8/24
Table 1 The architectural details of the proposed generator. layer depth kernel stride pad activation conv 1_1 64 1*1 1 1 ReLU conv 1_2 64 3*3 1 1 ReLU pool1 2*2 2 0 - conv 2_1 128 3*3 1 1 ReLU conv 2_2 128 3*3 1 1 ReLU pool2 2*2 2 0 - conv 3_1 256 3*3 1 1 ReLU conv 3_2 256 3*3 1 1 ReLU conv 3_3 256 3*3 1 1 ReLU pool3 2*2 2 0 - conv 4_1 512 3*3 1 1 ReLU conv 4_2 512 3*3 1 1 ReLU conv 4_3 512 3*3 1 1 ReLU pool4 2*2 2 0 - conv 5_1 512 3*3 1 1 ReLU conv 5_2 512 3*3 1 1 ReLU conv 5_3 512 3*3 1 1 ReLU conv 6_1 512 3*3 1 1 ReLU conv 6_2 512 3*3 1 1 ReLU conv 6_3 512 3*3 1 1 ReLU upsample6 2*2 2 0 - conv 7_1 512 3*3 1 1 ReLU conv 7_2 512 3*3 1 1 ReLU conv 7_3 512 3*3 1 1 ReLU upsample7 2*2 2 0 - conv 8_1 256 3*3 1 1 ReLU conv 8_2 256 3*3 1 1 ReLU Page 9/24
layer depth kernel stride pad activation conv 8_3 256 3*3 1 1 ReLU upsample8 2*2 2 0 - conv 9_1 128 3*3 1 1 ReLU conv 9_2 128 3*3 1 1 ReLU upsample9 2*2 2 0 - conv 10_1 64 3*3 1 1 ReLU conv 10_2 64 3*3 1 1 ReLU output 1 1*1 1 0 sigmoid Table 2 The architectural details of the proposed discriminator. layer depth kernel stride pad activation conv 1_1 3 1*1 1 1 ReLU conv 1_2 32 3*3 1 1 ReLU pool1 2*2 2 0 - conv 2_1 64 3*3 1 1 ReLU conv 2_2 64 3*3 1 1 ReLU pool2 2*2 2 0 - conv 3_1 64 3*3 1 1 ReLU conv 3_2 64 3*3 1 1 ReLU pool3 2*2 2 0 - fc4 100 - - - tanh fc5 2 - - - tanh fc6 1 - - - sigmoid Different from the BASNet [31] which focuses on detecting and segmenting salient objects, the goal proposed VSD model is to evaluate the pixel-level visual saliency map, that is, the saliency value of each pixel is in the real range of [0,1]. There are several differences between the VSFGAN and the previous GAN model's loss: 1. The goal is to generate the actual significance value, instead of getting a real image from random noise; in this case, the input to the generator is no longer random noise, but a design image. Page 10/24
2. The generator is not only generating a saliency map indistinguishable from the real one, but also makes them both correspond to the same input; therefore, both the design image and the corresponding DSIM are used as the input of the discriminator. 3. When updating the parameters of the generating function, using the loss function of the combination of discriminator error and cross-entropy relative to ground truth can improve the stability and convergence speed of training. We used a hybrid loss function for VSFGAN: L(Θ) = αLBCE(Θ) + LSSIM(Θ) (1) where LBCE is the content loss function. A user may notice more than just a single pixel when looking at a design, so it makes more sense to treat each predictor as independent of the other predictors. Thus, the Binary Cross Entropy (BCE) is calculated by averaging individual BCEs over all pixels. N 1 ˆ ˆ LBC E = − ∑(S j log (S j ) + (1 − S j ) log (1 − S j )) N j=1 2 where Sj and ˆ S j are the ground true normalized VSF and the predicted normalized VSF of the input, respectively. LSSIM(Θ) is the loss function of SSIM [14], Θ represents the parameters of the visual saliency detection network. SSIM loss function can capture the structural information of each element in the image. It is a region-level measurement method. It will give higher weight to the element boundary when the model predicts the same salient value between pixels. This will help to obtain clear element boundaries in VSF maps. Suppose x = [xn|n = 1,2,..., N] and y = [ym |m = 1,2,..., M] denote two images extracted from VSF maps Sj and ˆ S j , respectively. The SSIM loss function can be defined as: (2μ μ + C 1 )(2σxy + C 2 ) x y LS (Θ) = 1 − 2 2 2 2 (μ + μ + C 1 )(σx + σy +C 2 ) x y 3 where µx, µy and σ2x, σ2y represent the mean and variance of x and y respectively, and σxy is their covariance. We set C1 = 0.012, C2 = 0.032 based on experimental experience. Experiments have shown that when the hyperparameter α in the above function (1) is set to 0.005, the effect of the model is the best. 3.3 Saliency based segmentation Page 11/24
After obtaining a saliency map of an image using a visual saliency detection network, the input image should be segmented to evaluate the similarity of different regions. Designers often use grids or rectangular areas to organize elements. Observers perceive this structure and associate alignment, grouping, and symmetry with these regions. The global position features can be used for segmentation guiding which includes the distance to the ‘Third lines’, power points (intersections of the Third lines), image center, boundaries, and diagonals. To compare design similarities, we estimate layout structures based on DSF and assign weights for visual importance. The Minor Visual Element (MVE) regions should be placed in the non-salient areas of the image according to the aesthetic rules. To calculate the optimal layout area, an intuitive method is to exhaustively enumerate the possible positions and sizes of all MVE regions, and calculate the visual salience value of these areas as the score of the layout. But this method has three shortcomings. First, since there are many inconspicuous background areas, the salient values of these pixels are all small and close, so it is difficult to directly calculate the most suitable MVE layout based on the salient values area. Second, since the saliency value at the edge of the image is usually small, only considering the visual saliency value will place the MVE near the edge of the image, which violates the aesthetic rules and leads to bad visual presentation. Third, similar saliency values in the background region will also lead to a huge search space and increase the amount of computation. We proposed an algorithm to solve the above shortcomings. First, the diffusion equation is used to calculate the MVE probability map. This map represents the probability of the MVE layout at the corresponding position, and then the candidate area generation algorithm is used to obtain the design layout. We apply the diffusion equation to compute the probability maps for MVE regions, which are defined as follows: P DM +1 = P DM + θ(dX + dY ) { dX = cX ∇X (P DM ) , dY = cY ∇Y (P DM ) 4 where ∇X and ∇Y represent the gradients in the horizontal and vertical directions of the pixel, respectively. cX and cY represent the diffusion coefficients in the two directions, respectively. The goal of the diffusion equation we defined is to calculate the probability maps of the MVE regions. The diffusion coefficients cX and cY are set to 1 and 0.6 according to the aesthetic rules. In the probability map of the initial stage, there are many regions with the same probability value. While the diffusion equation considers the visual saliency distribution of image elements around each possible region, the number of suitable MVE regions is continuously reduced during the iterative process. The iteration stops when the difference between the MVE probability map and the initial saliency map is greater than a threshold. Page 12/24
In object detection tasks, many methods for generating bounding boxes were developed. However, most of them do not consider the local relationship between the main visual region and surrounding image elements, and the generated candidate boxes are hard to be applied for design layout segmentation. Here, we used the hierarchical segmentation algorithm in [40] to segment the design image. Different from [40], we used the probability maps of the MVE regions computed by Eq. 3 as segmentation input. Furthermore, the main visual region will be fixed and set to the highest energy term. The proposed algorithm takes as input a layout, a binary mask for each element, and the element class (graphic or text); the output is a hierarchical segmentation of the design into non-overlapping rectangular regions. Given a rectangular region R, a cut c is defined as an point (x,y) in R that divides the region into two rectangular subregions r1 and r2. The energy term penalizes cuts based on the distance to each element's bounding box. Cuts closer to the center are weighted more, and cuts closer to the region border are given less weight. 1 p 2 c Fint (c) = ∑ max(I dis ( p )) i i n i p∈c 5 p where p∈c is the pixels p along the cut c, Ii is an indicator variable indicating if element i overlaps with pixel p, and disci(p) is the distance of pixel p to the bounding box of element i. This distance depends on the cut type c. Regions r1 and r2 and counts the number of elements of the same class (text or graphics) in both regions is calculated by an energy function Felm (c). Felm(c)= -(N(r1) + N(r2)) (6) where N(r) is the number of same class elements in region r, and 0 means no element. The algorithm tends to divide regions evenly with the region center. Thus, we normalize the distance of the cut: |c − r c | Fcen = (c) − rl 7 where rc is the regional center’s location, and rl is the length of region. A cut F(c) is then evaluated by: F(c) = vintFint(c) + velmFelm(c) + vcenFcen(c) (8) We set vint =50, velm = 100 and vcen = 1 to obtain the best experimental results. 3.4 Similarity Evaluation Page 13/24
Similarity evaluation is to calculate the similarity distance between feature vectors through a certain measurement algorithm. The commonly used measurement algorithms include Euclidean distance, Cosine distance, Hash distance, Mutual Information, et al. Although, there are many image similarity calculation methods based on deep learning models. Because the layout similarity of the graphic designs is not enough to judge the similarity or plagiarism of the works, it is also necessary to evaluate the similarity of the conceptual design of the main visual region. We propose a multi-weight similarity measure based on the Structural SIMilarity (SSIM) index [14] since the segmented image already has a relatively obvious element relationship structure. 1 C S = ∑ W V (i) × SSIM (Ri ) C i=1 9 where Ri is i th segmented region of the image. Each segmented region is performed SSIM. C is the number of segmented image regions. The weighted VSF of region Ri is calculated by function WV(x). w x × hx × VSFx W V (x) = W × H 10 where VSFx is the normalized VSF of Rx. wx and hx represent the width and height of the segmented region Rx (the segmented region is a rectangle), respectively; W and H are the width and height of an image. 4. Evaluation And Discussion 4.1 Experimental Setup Two datasets are used to train and test our proposed scheme. 1) The Plagiarized Poster dataset [38], which contains 22,624 images with 224 query images, each poster has an average of 4.92 plagiarized designs. It is used to train and test the similarity evaluation ability of our proposed scheme on visual saliency in poster design. 2) The Graphic Design Importance dataset by O’Donovan et al. [43] which comes with importance annotations for 1,078 graphic designs from Flickr. It is used to test the similarity evaluation ability of our proposed scheme on the layout in poster design. Some samples are shown in Fig. 6. Experiments were evaluated on a PC server with Nvidia GeForce GTX two TitanX. The proposed models are trained with a learning rate of 0.0002. 70% of the samples were selected as the training set, and the rest were used for testing. As few studies have been concerned with the plagiarism issue for graphic designs. We compared our scheme to the method proposed in [38], which focuses on retrieving plagiarized posters. To make the test results clearer and more comparable, similar test metrics as in [38] are chosen. Page 14/24
Top-k Accuracy. The number of accurate plagiarized samples retrieved in the rank K images. K is set to 10 and 20. Normalized Discounted Cumulative Gain (NDCG)[44]. It is used to measure and evaluate the accuracy of search result algorithms and index ranking results. 4.2 Evaluation Results We implemented six related image similarity measurement methods for comparison. Two of them [45, 46] focus on copy-move forgery detection since plagiarism detection can be considered a kind of clone forensics. Other methods [13] [34] [37] [38] are developed for artwork plagiarism detection. The experiment result on the Plagiarized Poster dataset [38] is shown in Table 3. Table 3 Experiment result. Method Focus on Top-10 Top-20 Accuracy% NDCG Accuracy% SIFT-based [45] Image clone 12.52 8.45 0.55 Dense Inception Net [46] Image clone 14.26 10.32 0.60 VAE-WGAN [13] Logo design 40.34 31.60 0.73 Spatial pyramid matching Website 0 0 0 [34] design Plagiarized-Search-Net [37] Clothes design 36.33 28.47 0.67 Conceptual filtering [38] Poster design 67.45 48.14 0.92 Ours Graphic design 78.36 63.38 0.94 Our proposed method achieves the best performance in all metrics (i.e., Top-10 Accuracy is 78.36%, Top- 20 Accuracy is 63.38%, and NDCG is 0.94). This is a benefit of the use of VSDGAN and saliency-based segmentation algorithm. Since GAN does not require a large number of training samples, it is suitable in scenarios where there is a lack of plagiarism samples of the design. As long as the image database and loss function for training are equipped, the applicability of this generative confrontation mode will be greatly improved. That is why the VAE-WGAN based method [13] can still handle some types of plagiarism in artworks. Although it was developed for computing the cognitive similarity of graphic logos. Two copy-move forgery detection methods [45, 46] can barely expose plagiarism in graphic designs, which means that traditional clone forensics cannot be directly applied to perceptual similarity measurement. Notice that, the spatial pyramid matching method [34] developed for website design failed to perform in this application. This is because it only focuses on the similarity of the website layout structure. However, layout similarity measures alone are not sufficient to detect plagiarism in graphic designs. Page 15/24
To show the effect of using the RRM module and the MVE region in the proposed scheme, we tried four different strategies (i.e., non-RRM and non-MVE, RRM without MVE, MVE without RRM, and RRM with MVE) to evaluate our method. The experiment results are shown in Table 4 demonstrating the importance of using the RRM module and the MVE region to evaluate the similarity of graphic designs. Table 4 Experiment results of different scheme strategies. Strategy Top-10 Accuracy% Top-20 Accuracy% NDCG non-RRM and non-MVE 63.77 47.63 0.74 RRM without MVE 70.91 56.44 0.80 MVE without RRM 75.45 60.72 0.88 RRM with MVE 78.36 63.38 0.94 Note that using MVE is more important than using RRM. This is because that the MVE regions enable the similarity evaluation algorithm to notice insignificant regions, which is essential for higher- dimensional similarity calculations according to the aesthetic rules of designs. Insignificant regions may contain some important elements that affect the similarity of graphic designs, such as painting style, texture, composition, etc. 5. Conclusions Similarity studies of designs and artworks are rare due to their highly abstract and aesthetic features. We propose to analyze the similarity of graphic designs in cognitive dimension through visual saliency features. A novel visual saliency features extraction network based on the GAN model is developed. The RRM module is used to enable the proposed network to simultaneously capture global high-level semantic information and low-level detail information of graphic designs. Finally, since the segmented image already has a relatively obvious element relationship structure, a multi-weight similarity measure based on SSIM is developed. There are currently some limitations to our scheme. Currently, our optimization and learning process is not efficient enough for real-time interaction. Predicting element importance is currently a time-consuming operation; investigating simpler importance models is possible future work. In addition, the performance of similarity evaluation still has a lot of room for improvement. Declarations Acknowledgements This work was supported in part by the National Social Science Foundation of China (21BG131). Conflict of interest Not applicable. Page 16/24
Ethical approval Not applicable. Competing interests The authors declared that they have no conflicts of interest to this work. Authors' contributions All authors have contributed equally to this work. Funding This work was supported in part by the National Social Science Foundation of China (21BG131). Availability of data and materials All data used to support the findings of this study are included within the article (Data sets used can be accessed from [38] and [43]). References 1. A. Borji and L. Itti, "State-of-the-Art in Visual Attention Modeling," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 185-207, 2013, doi: 10.1109/TPAMI.2012.89. 2. Z. Niu, G. Zhong, and H. Yu, "A review on the attention mechanism of deep learning," Neurocomputing, vol. 452, pp. 48-62, 2021. 3. J. Ross, R. Simpson, and B. Tomlinson, "Media richness, interactivity and retargeting to mobile devices: a survey," International Journal of Arts and Technology, vol. 4, no. 4, pp. 442-459, 2011. 4. A. Garg, A. Negi, and P. Jindal, "Structure preservation of image using an efficient content-aware image retargeting technique," Signal, Image and Video Processing, vol. 15, no. 1, pp. 185-193, 2021. 5. R. Nasiripour, H. Farsi, and S. Mohamadzadeh, "Visual saliency object detection using sparse learning," IET Image Processing, vol. 13, no. 13, pp. 2436-2447, 2019. 6. L. Shamir, "What makes a Pollock Pollock: a machine vision approach," International Journal of Arts and Technology, vol. 8, no. 1, pp. 1-10, 2015. 7. Y. Liu, D. Zhang, Q. Zhang, and J. Han, "Part-object relational visual saliency," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. 8. Y. Yang, Y. Zhang, S. Huang, Y. Zuo, and J. Sun, "Infrared and visible image fusion using visual saliency sparse representation and detail injection model," IEEE Transactions on Instrumentation and Measurement, vol. 70, pp. 1-15, 2020. Page 17/24
9. Y. Zhu, G. Zhai, Y. Yang, H. Duan, X. Min, and X. Yang, "Viewing behavior supported visual saliency predictor for 360 degree videos," IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 7, pp. 4188-4201, 2021. 10. C. Zhang, Y. He, Q. Tang, Z. Chen, and T. Mu, "Infrared Small Target Detection via Interpatch Correlation Enhancement and Joint Local Visual Saliency Prior," IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1-14, 2021. 11. B. Yang, L. Wei, and Z. Pu, "Measuring and Improving User Experience Through Artificial Intelligence- Aided Design," (in English), Frontiers in Psychology, vol. 11, no. 3, 2020, doi: 10.3389/fpsyg.2020.595374. 12. N. Farhan, M. Abdulmunem, and M. a. Abid-Ali, Image Plagiarism System for Forgery Detection in Maps Design. 2019, pp. 51-56. 13. B. Yang, "Perceptual similarity measurement based on generative adversarial neural networks in graphics design," Applied Soft Computing, vol. 110, p. 107548, 2021/10/01/ 2021, doi: https://doi.org/10.1016/j.asoc.2021.107548. 14. Z. Wang, E. P. Simoncelli, and A. C. Bovik, "Multiscale structural similarity for image quality assessment," in The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, 2003, vol. 2: Ieee, pp. 1398-1402. 15. L. Itti, C. Koch, and E. Niebur, "A model of saliency-based visual attention for rapid scene analysis," IEEE Transactions on pattern analysis and machine intelligence, vol. 20, no. 11, pp. 1254-1259, 1998. 16. J. K. Tsotsos, S. M. Culhane, W. Y. Kei Wai, Y. Lai, N. Davis, and F. Nuflo, "Modeling visual attention via selective tuning," Artificial Intelligence, vol. 78, no. 1, pp. 507-545, 1995/10/01/ 1995, doi: https://doi.org/10.1016/0004-3702(95)00025-9. 17. L. Marchesotti, C. Cifarelli, and G. Csurka, "A framework for visual saliency detection with applications to image thumbnailing," in 2009 IEEE 12th International Conference on Computer Vision, 2009: IEEE, pp. 2232-2239. 18. C. Xia, F. Qi, and G. Shi, "Bottom–up visual saliency estimation with deep autoencoder-based sparse reconstruction," IEEE transactions on neural networks and learning systems, vol. 27, no. 6, pp. 1227- 1240, 2016. 19. J. Harel, C. Koch, and P. Perona, "Graph-based visual saliency," Advances in neural information processing systems, vol. 19, 2006. 20. H. Rezazadegan Tavakoli, E. Rahtu, and J. Heikkilä, "Fast and efficient saliency detection using sparse sampling and kernel density estimation," in Scandinavian conference on image analysis, 2011: Springer, pp. 666-675. 21. A. Borji, "Boosting bottom-up and top-down visual features for saliency estimation," in 2012 ieee conference on computer vision and pattern recognition, 2012: IEEE, pp. 438-445. 22. R. Liu, J. Cao, Z. Lin, and S. Shan, "Adaptive partial differential equation learning for visual saliency detection," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 3866-3873. Page 18/24
23. G. Li and Y. Yu, "Visual saliency detection based on multiscale deep CNN features," IEEE transactions on image processing, vol. 25, no. 11, pp. 5012-5024, 2016. 24. W. Wang and J. Shen, "Deep visual attention prediction," IEEE Transactions on Image Processing, vol. 27, no. 5, pp. 2368-2378, 2017. 25. M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, "Predicting human eye fixations via an lstm-based saliency attentive model," IEEE Transactions on Image Processing, vol. 27, no. 10, pp. 5142-5154, 2018. 26. H. T. H. Phan, A. Kumar, D. Feng, M. Fulham, and J. Kim, "Unsupervised Two-Path Neural Network for Cell Event Detection and Classification Using Spatiotemporal Patterns," IEEE Transactions on Medical Imaging, vol. 38, no. 6, pp. 1477-1487, 2019, doi: 10.1109/tmi.2018.2885572. 27. O. Sbai, M. Elhoseiny, A. Bordes, Y. Lecun, and C. Couprie, "DeSIGN: Design Inspiration from Generative Networks," 04/03 2018. 28. A. Elgammal, B. Liu, M. Elhoseiny, and M. Mazzone, "CAN: Creative Adversarial Networks, Generating "Art" by Learning About Styles and Deviating from Style Norms," in the eighth International Conference on Computational Creativity (ICCC), held in Atlanta, GA, June 20th-June 22nd 2017. [Online]. Available: https://arxiv.org/abs/1706.07068. [Online]. Available: https://arxiv.org/abs/1706.07068 29. M. Andries, A. Dehban, and J. Santos-Victor, "Automatic Generation of Object Shapes With Desired Affordances Using Voxelgrid Representation," Frontiers in Neurorobotics, vol. 14, 05/14 2020, doi: 10.3389/fnbot.2020.00022. 30. J. Pan et al., "Salgan: Visual saliency prediction with generative adversarial networks," arXiv preprint arXiv:1701.01081, 2017. 31. X. Qin, Z. Zhang, C. Huang, C. Gao, M. Dehghan, and M. Jagersand, "Basnet: Boundary-aware salient object detection," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 7479-7489. 32. Z. Bylinskii, T. Judd, A. Oliva, A. Torralba, and F. Durand, "What do different evaluation metrics tell us about saliency models?," IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 3, pp. 740-757, 2018. 33. L. Garrett and A. Robinson, "Spot the Difference! Plagiarism identification in the visual arts," 2012. 34. A. S. Bozkr and E. A. Sezer, "SimiLay: A Developing Web Page Layout Based Visual Similarity Search Engine," in 10th International Conference on Machine Learning and Data Mining MLDM 2014, 2014. 35. A. Álvarez and T. Ritchey, "Applications of general morphological analysis," Acta Morphologica Generalis, vol. 4, no. 1, 2015. 36. E. Cetinic, T. Lipic, and S. Grgic, "Fine-tuning convolutional neural networks for fine art classification," Expert Systems with Applications, vol. 114, pp. 107-118, 2018. 37. Y. Lang, Y. He, F. Yang, J. Dong, and H. Xue, "Which is plagiarism: Fashion image retrieval based on regional representation for design protection," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2595-2604. Page 19/24
38. S. Cui, F. Liu, T. Zhou, and M. Zhang, "Understanding and Identifying Artwork Plagiarism with the Wisdom of Designers: A Case Study on Poster Artworks," in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 1117-1127. 39. C. Huo, Z. Zhou, K. Ding, and C. Pan, "Online Target Recognition for Time-Sensitive Space Information Networks," IEEE Transactions on Computational Imaging, vol. 3, no. 2, pp. 254-263, 2017, doi: 10.1109/TCI.2017.2655448. 40. P. O’Donovan, A. Agarwala, and A. Hertzmann, "Learning Layouts for Single-PageGraphic Designs," IEEE Transactions on Visualization and Computer Graphics, vol. 20, no. 8, pp. 1200-1213, 2014, doi: 10.1109/TVCG.2014.48. 41. T. Judd, K. Ehinger, F. Durand, and A. Torralba, "Learning to predict where humans look," in 2009 IEEE 12th international conference on computer vision, 2009: IEEE, pp. 2106-2113. 42. Y. Zhu, C. Chen, G. Yan, Y. Guo, and Y. Dong, "AR-Net: Adaptive attention and residual refinement network for copy-move forgery detection," IEEE Transactions on Industrial Informatics, vol. 16, no. 10, pp. 6714-6723, 2020. 43. Z. Bylinskii et al., "Learning visual importance for graphic designs and data visualizations," in Proceedings of the 30th Annual ACM symposium on user interface software and technology, 2017, pp. 57-69. 44. C. Distinguishability, "A Theoretical Analysis of Normalized Discounted Cumulative Gain (NDCG) Ranking Measures," 2013. 45. B. Yang, X. Sun, H. Guo, Z. Xia, and X. Chen, "A copy-move forgery detection method based on CMFD- SIFT," Multimedia Tools and Applications, journal article vol. 77, no. 1, pp. 837-855, 2018, doi: 10.1007/s11042-016-4289-y. 46. J.-L. Zhong and C.-M. Pun, "An End-to-End Dense-InceptionNet for Image Copy-Move Forgery Detection," IEEE Transactions on Information Forensics and Security, vol. 15, pp. 2134-2146, 2020, doi: 10.1109/TIFS.2019.2957693. Figures Page 20/24
Figure 1 Two graphic posters with similar design concepts and layout structures. Available: https://ent.sina.cn/tv/jp_kr/2022-09-05/detail-imizmscv9236093.d.html Page 21/24
Figure 2 Posters suspected of plagiarism. Available: https://www.shrx.org/plus/view-60913-1.html Page 22/24
Figure 3 The proposed scheme. Figure 4 DIM-HV generation process. The final DIM-HV (bottom right image) is the normalization of the 8 responses (black background images). Figure 5 Page 23/24
The architecture of the proposed VSFGAN. Figure 6 Poster samples in the Plagiarized Poster dataset and layout design samples in the Graphic Design Importance dataset. (a) Plagiarized Poster dataset, (b) Graphic Design Importance dataset. Page 24/24
You can also read