Automatic detection of regions of interest in complex video sequences
Page content transcription
If your browser does not render page correctly, please read the page content below
header for SPIE use Automatic detection of regions of interest in complex video sequences Wilfried Osberger* and Ann Marie Rohaly Video Business Unit, Tektronix, Inc., Beaverton OR ABSTRACT Studies of visual attention and eye movements have shown that people generally attend to only a few areas in typical scenes. These areas are commonly referred to as regions of interest (ROIs). When scenes are viewed with the same context and motivation (e.g., typical entertainment scenario), these ROIs are often highly correlated amongst different people, motivating the development of computational models of visual attention. This paper describes a novel model of visual attention designed to provide an accurate and robust prediction of a viewer's locus of attention across a wide range of typical video content. The model has been calibrated and verified using data gathered in an experiment in which the eye movements of 24 viewers were recorded while viewing material from a large database of still (130 images) and video (~13 minutes) scenes. Certain characteristics of the scene content, such as moving objects, people, foreground and centrally-located objects, were found to exert a strong influence on viewers’ attention. The results of comparing model predictions to experimental data demonstrate a strong correlation between the predicted ROIs and viewers’ fixations. Keywords: Visual attention, regions of interest, ROI, eye movements, human visual system. 1. INTRODUCTION In order to efficiently process the mass of information presented to it, the resolution of the human retina varies across its spatial extent. High acuity is only available in the fovea, which is approximately 2 deg in diameter. Knowledge of a scene is obtained through regular eye movements which reposition the area under foveal view. These eye movements are by no means random; they are controlled by visual attention mechanisms, which direct the fovea to regions of interest (ROIs) within the scene. The factors that influence visual attention are considered to be either top-down (i.e., task or context driven) or bottom- up (i.e., stimulus driven). A number of studies have shown that when scenes are viewed with the same context and motivation (e.g., typical entertainment scenario), ROIs are often highly correlated amongst different people.1,2,3,4 As a result, it is possible to develop computational models of visual attention that can analyze a picture and accurately estimate the location of viewers’ ROIs. Many different applications can make use of such a model,5 including image and video compression, picture quality analysis, image and video databases and advertising. A number of different models of visual attention have been proposed in the literature (see refs. 5 and 6 for a review). Some of these have been designed for use with simple, artificial scenes (like those used in visual search experiments) and consequently do not perform well on complex scenes. Others require top-down input. Because such information is not available in a typical entertainment video application, these types of models are not considered here. Koch and his colleagues have proposed a number of models for detecting visual saliency. In their most recent work,7 a multi- scale decomposition is performed on the input image and three features – contrast, color and orientation – are extracted. These features are weighted equally and combined to produce a saliency map. The ordering of viewers’ fixations is then estimated using an inhibition-of-return model that suppresses recently fixated scenes. Results of this model have so far only been demonstrated for still images. More recently, object-based models of attention have been proposed.5,8,9,10 This approach is supported by evidence that attention is directed towards objects rather than locations (see ref. 5 for a discussion). The general framework for this class of models is quite similar. The scene is first segmented into homogeneous objects. Factors that are known to influence visual attention are then calculated for each object in the scene, weighted and combined to produce a map that indicates the likelihood that observers would focus their attention on a particular region. We refer to these maps as Importance Maps (IMs).5 Although the results reported for these models have been promising, a number of issues prevent their use with complex video sequences: * Further author information: Email: email@example.com
• Most of the models were designed for use with still images and cannot be directly used to predict attention in video sequences. • The number of features used in the models is often too small to obtain high accuracy predictions across a wide range of complex scene types. • The relative weightings of the different features is unknown or modeled in an ad hoc manner. The model described in this paper specifically addresses the above issues. It is based on the framework developed in our previous work5,8,11 with a number of significant changes that improve its robustness and accuracy across a wide range of typical video content. This paper is organized as follows: Section 2 discusses human visual attention, focusing in particular on the different features that have been found to influence attention. In Section 3, the details of the attention model are presented while the subjective eyetracking experiments used to calibrate and verify the model’s operation are contained in Section 4. Results of the model’s performance across a wide range of complex video inputs are summarized in Section 5. 2. FACTORS THAT INFLUENCE VISUAL ATTENTION Visual search experiments, eye movement studies and other psychophysical and psychological tests have resulted in the identification of a number of factors that influence visual attention and eye movements. These are often categorized as being either top-down (task or context driven) or bottom-up (stimulus driven), although for some factors the distinction may not be so clear. A general observation is that an area or object that stands out from its surroundings with respect to a particular factor is likely to attract attention. Some of the factors that have been found to exert the strongest influence on attention are listed below: • Motion. Motion has been found to be one of the strongest influences on visual attention.12 Peripheral vision is highly tuned to detect changes in motion, the result being that attention is involuntarily drawn to peripheral areas exhibiting motion which is distinct from surrounding areas. Areas undergoing smooth, steady motion can be tracked by the eye and humans cannot tolerate any more distortion in these regions than in stationary regions.13 • Contrast. The human visual system converts luminance into contrast at an early stage of processing. Region contrast is consequently a very strong bottom-up visual attractor.14 Regions that have a high contrast with their surrounds attract attention and are likely to be of greater visual importance. • Size. Findlay15 has shown that region size also has an important effect on attention. Larger regions are more likely to attract attention than smaller ones. A saturation point exists, however, after which the importance of size levels off. • Shape. Regions whose shape is long and thin (edge-like) or have many corners and angles have been found to be visual attractors.3 They are more likely to attract attention than smooth, homogeneous regions of predictable shape. • Color. Color has also been found to be important in attracting attention.16 A strong influence occurs when the color of a region is distinct from the color of its background. Certain specific colors (e.g., red) have been shown to attract attention more than others. • Location. Eyetracking experiments have shown that viewers’ fixations are directed at the center 25% of a frame for a majority of viewing material.17,18 • Foreground / Background. Viewers are more likely to be attracted to objects in the foreground than those in the background.4 • People. Many studies1,19 have shown that viewers are drawn to people in a scene, in particular their faces, eyes, mouths and hands. • Context. Viewers’ eye movements can be dramatically changed, depending on the instructions they are given prior to or during the observation of an image.1,14 Other bottom-up factors that have been found to influence attention include brightness, orientation, edges and line ends. Although many factors that influence visual attention have been identified, little quantitative data exists regarding the exact weighting of the different factors and their inter-relationships. Some factors are clearly of very high importance (e.g., motion) but it is difficult to determine exactly the relative importance of one factor vs. another. To answer this question, we have
performed an eye movement test with a large number of viewers and a wide range of stimulus material (see Section 4). The individual factors used in the visual attention model were correlated to viewers’ fixations in order to determine the relative influence of each factor on eye movements. This provided a set of factor-weightings which were then used to calibrate the model. Details of this process are contained in Section 4 of the paper. 3. MODEL DESCRIPTION An overall block diagram showing the operation of the attention model is shown in Figure 1. While the general structure is similar to that reported previously,5,8,11 the algorithms and techniques used within each of the model components have been changed significantly, resulting in considerable improvements in accuracy and robustness. In addition, new features such as camera motion estimation, color and skin detection have been added. 1 Isize 0 thsize1 thsize2 thsize3 thsize4 size(Ri) / size(frame) Figure 1: Block diagram of the visual attention model Figure 2: Size importance factor 3.1 Segmentation The spatial part of the model is represented by the upper branch in Figure 1. The original frame is first segmented into homogeneous regions. A recursive split-and-merge technique is used to perform the segmentation. Both the graylevel (gl) and the color (in L*u*v* coordinates) are used in determining the split and merge conditions. The condition for the recursive split is: If: (var(Ri ( gl )) > th splitlum ) & (var(Ri (col )) > thsplitcol ) , Then: split Ri into 4 quadrants, and recursively split each, where var=variance, var(R i (col)) = var(R i (u*)) 2 + var(R i (v*)) 2 . Values of the thresholds that have been found to provide good results are thsplitlum = 250 and thsplitcol = 120. For the region merging, both the mean and the variance of the region’s graylevel and color are used to determine whether two regions should be merged. The merge condition for testing whether two regions R1 and R2 should be merged is: ((var(R12 ( gl )) < thmergelum ) & (var( R12 (col )) < thmergecol ) & (∆lum < thmeanmergelum ) & (∆col < thmeanmergecol )) If: , OR ((∆lum < thlumlow ) & (∆col < thcollow )) Then: combine two regions into one Else: keep regions separate,
where ∆lum =| R1 ( gl ) − R 2 ( gl ) | and ∆col = ( R1 (u*) − R 2 (u*)) 2 + ( R1 (v*) − R 2 (v*)) 2 . The thresholds thmergelum and thmergecol are adaptive and increase as the size of the regions being merged increases. The thresholds thmeanmergelum and thmeanmergecol are also adaptive and depend upon a region’s luminance and color. Following the split-and-merge procedure, regions that have a size less than 64 pixels (for a 512 x 512 image) are merged with the neighboring region having the closest luminance. This process removes the large number of small regions that may be present after the split-and-merge. 3.2 Spatial factors The segmented frame is then analyzed by a number of different factors that are known to influence attention (see Section 2), resulting in an importance map for each factor. Seven different attentional factors are used in the current implementation of the spatial model. • Contrast of region with its surround. Regions that have a high luminance contrast with their local surrounds are known to attract visual attention. The contrast importance Icont of a region Ri is calculated as: J ∑ (| R ( gl ) − R j =1 i j ( gl ) | ⋅ min( k border ⋅ Bij , size( R j ))) I cont ( Ri ) = J , ∑ min(k j =1 border ⋅ Bij , size( R j )) where j = 1..J are the regions that share a 4-connected border with Ri, kborder is a constant to limit the extent of influence of neighbors (set to 10) and Bij is the number of pixels in Rj that share a 4-connected border with Ri. Improved results can be achieved when the contrast is scaled to account for Weber and deVries-Rose conditions. Icont is then scaled to the range 0-1. This is done in an adaptive manner so the contrast importance for a region of a particular contrast is reduced in frames that have many high contrast regions and increased in frames where the highest contrast is low. • Size of region. Large objects have been found to be visual attractors. A saturation point exists, however, after which further size increases no longer increase the likelihood of the object attracting attention. The effect of this is illustrated in Figure 2. Parameter values that have been found to work well are thsize1 = 0, thsize2 = 0.01, thsize3 = 0.05 and thsize4 = 0.50. • Shape of region. Areas with an unusual shape or areas with a long and thin shape have been identified as attractors of attention. Importance due to region shape is calculated as: pow bp ( Ri ) shape I shape ( Ri ) = k shape ⋅ , size ( Ri ) where bp(Ri) is the number of pixels in Ri that share a 4-connected border with another region, powshape is used to increase the size-invariance of Ishape (set to 1.75) and kshape is an adaptive scaling constant that reduces the shape importance for regions with many neighbors. Ishape is then scaled to the range 0-1. As with Icont, this is done in an adaptive manner so the shape importance of a particular region is reduced in frames that have many regions of high shape importance and increased in frames where the highest shape importance is low. • Location of region. Several studies have shown a viewer preference for centrally-located objects. As a result, the location importance is calculated using four zones within the frame whose importance gradually decreases with distance from the center. This is shown graphically in Figure 3. The location importance for each region is calculated as: 4 ∑w z =1 z ⋅ numpix( Riz ) I location ( Ri ) = , size( Ri ) where numpix(Riz) is the number of pixels in region Ri that are located in zone z, and wz are the zone weightings (values given in Figure 3).
w 4=0.0 w 3=0.4 w 2=0.7 1 w 1=1.0 Iskin(Ri) 0 0.70 video frame 0.25 0.75 0.55 Proportion of skin pixels in Ri 0.40 Figure 3: Weightings and zones for location IM Figure 4: Calculation of skin importance • Foreground/Background region. Foreground objects have been found to attract more attention than background objects. It is difficult to detect foreground and background objects in a still scene since no motion information is present. However, a general assumption can be made that foreground objects will not be located on the border of the scene. A region can then be assigned to the foreground or background on the basis of the number of pixels which lie on the frame border. Regions with a high number of frame border pixels are classified as belonging to the background and have a low Foreground/Background importance as given by: borderpix( Ri ) I FGBG = 1 − min( ,1.0) , 0.3 ⋅ min(borderpix( frame), perimeterpix( Ri )) where borderpix(R) is the number of pixels in region R that also border on the edge of the frame and perimeterpix(R) is the number of pixels in region R that share a 4-connected border with another region. • Color contrast of region. The color importance is calculated in a manner very similar to that of contrast importance. In effect, the two features are performing a similar operation – one calculates the luminance contrast of a region with respect to its background while the other calculates the color contrast of a region with respect to its background. The calculation of color importance begins by calculating color contrasts separately for u* and v*, by substituting these values for gl in the formula used to compute contrast importance. The two color importance values are then combined: I col ( Ri ) = I u* ( Ri ) 2 + I v* ( Ri ) 2 . Icol is then scaled to the range 0-1. This is done in an adaptive manner so the color importance for a region of a particular color contrast is reduced in frames that have many regions of high color contrast and increased in frames where the highest color contrast is low. • Skin. People, in particular their faces and hands, are very strong attractors of attention. Areas of skin can be detected by analyzing their color since all human skin tones, regardless of race, fall within a restricted area of color space. The hue- saturation-value (HSV) color space is commonly used since human skin tones are strongly clustered into a narrow range of HSV values. After converting to HSV, an algorithm similar to that proposed by Heredotou20 is used on each pixel to determine whether or not the pixel has the same color as skin. The skin importance for the region is then calculated as illustrated in Figure 4. 3.3 Combining spatial factors The seven spatial factors each produce a spatial IM. An example of this is shown in Figure 5 for the scene rapids. The seven spatial factors need to be combined to produce an overall spatial IM. The literature provides little quantitative indication of
(a) (b) (c) (d) (e) (f) (g) (h) (i) Figure 5: Spatial factor IMs produced for the image rapids. (a) original image, (b) segmented image, (c) location, (d) shape, (e) size, (f) foreground/background, (g) contrast, (h) color and (i) skin. In (c)-(i), lighter shading represents highest importance how the different factors should be weighted although it is known that each factor exerts a different influence on visual attention. In order to quantify this relationship, the individual factor maps were correlated with the eye movements of viewers collected in the experiment described in Section 4. This provided an indication of the relative influence of the different factors on viewers’ eye movements. Using this information, the following weighting was used to calculate the overall spatial IM: 7 pow f I spatial ( Ri ) = ∑ f =1 (w f poww ⋅ I f ( Ri ) ), where wf is the feature weight (obtained from eyetracking experiments), poww is the feature weighting exponent (to control the relative impact of wf) and powf is the IM weighting exponent. The values of wf are given in Section 4. The spatial IM was then scaled so that the region of highest importance had a value of 1.0. To expand the ROIs, block processing was performed on the resultant spatial IM. The block-processed IM is simply the maximum of the spatial IM within each local n x n block. Values of n = 16 and n = 32 have been shown to provide good results. The resultant spatial IMs can be quite noisy from frame to frame. In order to reduce this noisiness and improve the temporal consistency of the IMs, a temporal smoothing operation can be performed.
3. 4 Temporal IM A temporal IM model that was found to work well on a subset of video material has been reported previously.5,11 Two major problems, however, prevented this model’s use with all video content: • The model could not distinguish camera motion from true object motion. Hence, it failed when there was any camera movement (e.g., pan, tilt, zoom, rotate) while the video was being shot. • Fixed thresholds were used when assigning importance to a particular motion. Since the amount of motion (and consequently the motion’s influence on attention) varies greatly across different video scenes, these thresholds need to adapt to the motion in the video. A block diagram showing the improved temporal attention model is shown in Figure 6. Features have been added to solve the problems noted above, allowing the model to work reliably over a wide range of video content. 1 Itemp Current Frame Calculate Camera Compensate Smoothing & Temporal 0 Motion Motion for Camera Adaptive Previous IM Vectors Estimation Motion Thresholding Frame thtemp1 thtemp2 thtemp3 thtemp4 Object Motion (deg/sec) Figure 6: Temporal attention model Figure 7: Mapping from object motion to temporal importance As in ref. 5, the current and previous frames are used in a hierarchical block matching process to calculate the motion vectors. These motion vectors are used by a novel camera motion estimation algorithm to determine four parameters regarding the camera’s motion: pan, tilt, zoom and rotation. These four parameters are used to compensate the motion vectors (MVs) calculated in the first part of the model (MVorig) so that the true object motion in the scene can be captured. MVcomp = MVorig − MVcamera . Since the motion vectors in texturally flat areas are not reliable, the compensated motion vectors in these areas are set to 0. In the final block in Figure 6, the compensated MVs are converted into a measure of temporal importance. This involves scene cut detection, temporal smoothing, flat area removal and adaptive thresholding. The adaptive thresholding process is shown graphically in Figure 7. The thresholds thtempx are calculated adaptively, depending on the amount of object motion in the scene. Scenes with few moving objects and with slow moving objects should have lower thresholds than those scenes with many fast moving objects since human motion sensitivity will not be masked by numerous fast moving objects. An estimate of the amount of motion in the scene is obtained by taking the mth percentile of the camera motion compensated MV map (termed motionm). The current model obtains good results using m = 98. The thresholds are then calculated as: thtemp1 = 0 , thtemp 2 = min(max(k th 2 ⋅ motion m , k th 2 min ), k th 2 max ) , thtemp 3 = min(max(k th 3 ⋅ motion m , k th3 min ), k th3 max ) , thtemp 4 = k th 4 ⋅ thtemp 3 . Parameter values that provide good results are k th 2 = 0.5 , k th 2 min = 1.0 deg/ sec , k th 2 max = 10.0 deg/ sec , k th 3 = 1.5 , k th 2 min = 5.0 deg/ sec , k th 2 max = 20.0 deg/ sec and k th 3 = 2.0 . Since the motion is measured in deg/sec, it is necessary to know the monitor’s resolution, pixel spacing and the viewing distance. The current model assumes a pixel spacing of 0.25 mm and a viewing distance of five screen heights, which is typical for SDTV viewing.
In scenes where a fast moving object is being tracked by a fast pan or tilt movement, the object’s motion may be greater than thtemp3; hence, its temporal importance will be reduced to a value less than 1.0. To prevent this from occurring, objects being tracked by fast pan or tilt are detected and their temporal importance is set to 1.0. To increase the spatial extent of the ROIs, block processing at the 16 x 16 pixel level is performed, in the same way as for the spatial IM. 3. 5 Combining spatial and temporal IMs To provide an overall IM, a linear weighting of the spatial and temporal IMs is performed. I total = k comb ⋅ I spat + (1 − k comb ) ⋅ I temp . An appropriate value of kcomb that has been determined from the eyetracker experiments is 0.6. 4. MODEL CALIBRATION AND VALIDATION As discussed previously, there is limited information available in the literature regarding the relative influence of the different attentional factors and their interactions. For this reason, an eye movement study was performed using a wide range of viewing material. The individual factors were then correlated with the viewers’ fixations in order to determine each factor’s relative influence. The correlation of the overall IMs could then be calculated to determine how accurately the model predicts viewers’ fixations. In this section, the eyetracker experiment is described and results given, along with details of how these results were used to calibrate the attention model. 4.1 Eyetracker experiment The viewing room and monitor (Sony BVM-20F1U) were calibrated in accordance with ITU-R Rec. BT.500.21 Viewing distance was five screen heights. Twenty-four viewers (12 men and 12 women) from a range of technical and non-technical backgrounds and with normal or corrected-to-normal vision participated in the experiment. Their ages ranged from 27-57 years with a mean of 40.2 years. The viewers had their eye movements recorded non-obtrusively by an ASL model #504 eyetracker during the experiment. The accuracy of the eyetracker, as reported by the manufacturer, is within ±1.0 deg. Viewers were asked to watch the material as they would if they were watching television at home (i.e., for entertainment purposes). The stimulus set consisted of 130 still images (displayed for 5 seconds each) and 46 video clips derived from standard test scenes and from satellite channels. The total duration of the video clips was approximately 13 minutes. The material was selected to cover a broad cross-section of features such as saturated/unsaturated color, high/low motion and varying spatial complexity. The still images and video clips were presented in separate blocks and the ordering of the images and clips was pseudorandom within each block. None of the material was processed to introduce defects. Some of the material had previously undergone high bit-rate JPEG or MPEG compression but the effects were not perceptible. To ensure the accuracy of the recorded eye movements, calibration checks were performed every 1-2 minutes during the test. Re-calibration was performed if the accuracy of the eyetracker was found to have drifted. Post-processing of the data was also performed, in order to correct for any calibration offsets. 4.2 Experimental results The fixations of all viewers were corrected, aggregated and superimposed on the original stimuli. For the still images, this resulted in an average of approximately 250 fixations per scene while for the video clips, there were aproximately 20 fixations per frame. Some examples of the aggregated images are shown in Figure 8. Each white square represents a single viewer fixation, with the size of the square being proportional to fixation duration. The smallest squares (1 x 1 pixel) show very short fixations (< 200 msec) while the largest squares (9 x 9 pixels) correspond to very long fixations (> 800 msec). 4.3 Correlation of factors with eye movements To determine how well each factor predicts viewers’ eye movements, a hit ratio was computed for each factor. The hit ratio was defined as the percentage of all fixations falling within the most important regions of the scene, as identified by the attention model IM. Hit ratios were calculated over those regions whose combined area represented 10%, 20%, 30% and 40% of the scene’s total area. To calibrate the spatial model and compute the values of wf, only the data from the 130 still images were used. These scenes were split in two equal-sized sets. One set was used for training, and the sequestered set was later used to verify the calibration parameters and to ensure that the model was not tuned to a particular set of data.
header for SPIE use (a) (b) Figure 8: Aggregate fixations for two test images, (a) rapids and (b) liberty. The hit ratios for the four area levels for each of the seven spatial factors are given in Table 1. These values show that the IMs for three of the factors – location, foreground/background and skin – correlated very strongly with the viewers’ fixations. Three other factors – shape, color and contrast – had lower but still strong correlation with viewers’ fixations while the size factor reported the lowest hit ratios. Note that the hit ratios for all of the factors were greater than that expected if areas were chosen randomly (termed the baseline level), confirming that all factors exert some influence on viewers’ eye movements. Weights for each of the factors wf were then calculated as follows: h f ,10 + h f , 20 + h f ,30 + h f , 40 wf = 7 , 40 , ∑h f =1, a =10 f ,a where hf,a is the hit ratio for feature f at area level a. This resulted in wf = [0.193, 0.176, 0.172, 0.130, 0.121, 0.114, 0.094] for [location, foreground/background, skin, shape, contrast, color, size] respectively. Table 1 shows that when these unequal values of wf were used in the spatial model, the hit ratios increased considerably. The hit ratios for the full spatial model for both the test and sequestered set were very similar (within 2%), demonstrating that the weightings are not tuned to a specific set of scenes. Hit Ratio at Hit Ratio at Hit Ratio at Hit Ratio at Stimulus Set Factor or Model Tested 10% area (%) 20% area (%) 30% area (%) 40% area (%) Still foreground / background 31.8 46.9 63.2 75.2 Still center 34.8 54.7 70.1 76.8 Still color 17.8 32.1 44.8 54.8 Still contrast 18.4 34.1 46.6 59.3 Still shape 20.8 36.4 51.4 60.8 Still size 14.6 27.0 35.9 46.0 Still skin 25.2 48.3 70.8 79.1 Still Spatial model (equal wf) 29.3 49.2 63.6 74.8 Still Spatial model (unequal wf) 32.9 54.5 68.2 78.8 Video Spatial model 46.6 67.2 75.8 81.9 Video Temporal model 32.4 50.4 62.9 72.3 Video Combined model 41.0 63.8 75.0 82.6 baseline 10.0 20.0 30.0 40.0 Table 1: Hit ratios for different factors and models.
5. RESULTS Figure 9 shows the resulting IMs for the sequence football. The original scene is of high temporal and moderate spatial complexity, with large camera panning, some camera zoom and a variety of different objects in the scene. The superimposed fixations show that all of the viewers fixated on or near the player with the ball. The spatial IM identified the players as the primary ROIs and the background people as secondary ROIs. (The IMs in the figure have lighter shading in regions of high importance.) Note that the spatial extent of the ROIs was increased by the temporal smoothing process. The temporal model correctly compensated for camera pan and zoom and identified only those areas of true object motion. The combined IM identified the player with the ball as the most important object while the other two players were also identified as being of high importance. The background people were classified as secondary ROIs while the playing field was assigned the lowest importance. The model’s predictions correspond well with viewers’ fixations. Another example of the model’s performance, this time with a scene of high spatial and moderate temporal complexity (mobile) is shown in Figure 10. Spatially, this scene is very cluttered and there are a number of objects capable of attracting attention. Since there are numerous potential spatial attractors, the spatial IM is quite flat with a general weighting towards the center. The temporal model compensated for the slow camera pan and correctly identified the mobile, train and calendar as the moving objects in the scene. Although they are located near the boundary of the scene, these moving objects still received a high weighting in the overall IM. This corresponds well with the experimental data, as most fixations were located around the moving objects in the lower part of the scene. The model has been tested on a wide range of video sequences and found to operate in a robust manner. In order to quantify the model’s accuracy and robustness, hit ratios were calculated across the full set of video sequences used in the eyetracker experiment. The results are shown in Table (video stimulus set). The hit ratios achieved by the combined model are high – for example, 75% of all fixations occur in the 30% area classified by the model as most important. The hit ratios are above the baseline level for each individual clip indicating that the model does not fail catastrophically on any of the 46 sequences. Visual inspection of IMs and fixations showed that many of the fixations that were not hits had narrowly missed the model’s ROIs and were often within the ±1 deg accuracy of the eyetracker. Hence, if the ROIs were expanded spatially or if the eyetracker had higher accuracy, the hit ratio may increase. The hit ratios of the spatial model were all considerably higher than those of the temporal model and were often slightly higher than those of the combined model. This may be caused in part by the fact that in scenes with no motion or extremely high and unpredictable motion, the correlation between motion and eye movements is low. Several of these types of sequences are contained in the test set and the resultant low hit ratio for these scenes with the temporal model has reduced the overall hit ratio for the temporal model. Nevertheless, for most of the test sequences, the spatial model had a higher impact on predicting viewers’ attention than the temporal model. As a result, it was given a slightly higher weighting when combining the two models. 6. DISCUSSION This paper has described a computational model of visual attention which automatically predicts ROIs in complex video sequences. The model was based on a number of factors known to influence visual attention. These factors were calibrated using eye movement data gathered from an experiment using a large database of still and video scenes. A comparison of viewers’ fixations and model predictions showed that, across a wide range of material, 75% of viewers’ fixations occurred in the 30% area estimated by the model as being the most important. This verifies the accuracy of the model, in light of the facts that the eyetracking device used exhibits some inaccuracy and people’s fixations naturally exhibit some degree of drift. There are a number of different applications where a computational model of visual attention can be readily utilized. These include areas as diverse as image and video compression, objective picture quality evaluation, image and video databases, machine vision and advertising. Any application requiring variable-resolution processing of scenes may also benefit from the use of a visual attention model such as the one described here.
header for SPIE use (a) (b) (c) (d) Figure 9: IMs for a frame of the football sequence. (a) original frame, (b) spatial IM, (c) temporal IM and (d) combined IM. (a) (b) (c) (d) Figure 10: IMs for a frame of the mobile sequence. (a) original frame, (b) spatial IM, (c) temporal IM and (d) combined IM.
REFERENCES 1. A.L. Yarbus, Eye Movements and Vision, Plenum Press, New York, 1967. 2. L. Stelmach, W.J. Tam and P.J. Hearty, “Static and dynamic spatial resolution in image coding: An investigation of eye movements,” Proceedings SPIE, Vol. 1453, pp. 147-152, San Jose, 1992. 3. N.H. Mackworth and A.J. Morandi, “The gaze selects informative details within pictures,” Perception & Psychophysics, 2(11), pp. 547-552, 1967. 4. G.T. Buswell, How people look at pictures, The University of Chicago Press, Chicago, 1935. 5. W. Osberger, Perceptual Vision Models for Picture Quality Assessment and Compression Applications, PhD Thesis, Queensland University of Technology, Brisbane, Australia, 2000. http://www.scsn.bee.qut.edu.au/~wosberg/thesis.htm 6. J.M. Wolfe, “Visual search: A review”, in H. Pashler (ed), Attention, University College London Press, London, 1998. 7. L. Itti, C. Koch and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE Trans. PAMI, 20(11), pp. 1254-1259, 1998. 8. W. Osberger and A. J. Maeder, “Automatic identification of perceptually important regions in an image,” 14th ICPR, pp. 701-704, Brisbane, Australia, August 1998. 9. X. Marichal, T. Delmot, C. De Vleeschouwer, V. Warscotte and B. Macq, “Automatic detection of interest areas of an image or a sequence of images,” ICIP-96, pp. 371-374, Lausanne, 1996. 10. J. Zhao, Y. Shimazu, K. Ohta, R. Hayasaka and Y. Matsushita, “An outstandingness oriented image segmentation and its application,” ISSPA, pp. 45-48, Gold Coast, Australia, 1996. 11. W. Osberger, A.J. Maeder and N. Bergmann, “A perceptually based quantisation technique for MPEG encoding,” Proceedings SPIE, Vol 3299, pp. 148-159, San Jose, 1998. 12. R.B. Ivry, “Asymmetry in visual search for targets defined by differences in movement speed,” J.of Exp Psych: Human Perc.& Perf, 18(4), pp. 1045-1057, 1992. 13. M.P. Eckert and G. Buchsbaum, “The significance of eye movements and image acceleration for coding television image sequences,” in A.B. Watson (ed), Digital Images and Human Vision, pp. 89-98, MIT Press, Cambridge MA, 1993. 14. L. Stark, I. Yamashita, G. Tharp and H.X. Ngo, “Search patterns and search paths in human visual search,” in D. Brogan, A. Gale and K. Carr (eds), Visual Search 2, pp. 37-58, Taylor and Francis, London, 1993. 15. J. M. Findlay, “The visual stimulus for saccadic eye movements in human observers,” Perception, 9, pp. 7-21, 1980. 16. M. D’Zmura, “Color in visual search,” Vision Research, 31(6), pp. 951-966, 1991. 17. J. Enoch, “Effect of the size of a complex display on visual search,” JOSA, 49(3), pp. 208-286, 1959. 18. J. Wise, “Eye movements while viewing commercial NTSC format television,” SMPTE psychophysics committee white paper, 1984. 19. G. Walker-Smith, A. Gale and J. Findlay, “Eye movement strategies involved in face perception,” Perception, (6), pp. 313-326, 1977. 20. N. Herodotou, K. Plataniotis and A. Venetsanopoulos, “Automatic location and tracking of the facial region in color video sequences,” Signal Processing: Image Communication, 14, pp. 359-388, 1999. 21. ITU-R Rec. BT.500, “Methodology for the subjective assessment of the quality of television pictures.”
You can also read
NEXT SLIDES ... Cancel