Deep-Learning-Based Pupil Center Detection and Tracking Technology for Visible-Light Wearable Gaze Tracking Devices

 
Deep-Learning-Based Pupil Center Detection and Tracking Technology for Visible-Light Wearable Gaze Tracking Devices
applied
 sciences
Article
Deep-Learning-Based Pupil Center Detection and Tracking
Technology for Visible-Light Wearable Gaze Tracking Devices
Wei-Liang Ou, Tzu-Ling Kuo, Chin-Chieh Chang and Chih-Peng Fan *

 Department of Electrical Engineering, National Chung Hsing University, 145 Xingda Rd., South Dist.,
 Taichung City 402, Taiwan; soft96123@gmail.com (W.-L.O.); claire5883claire5883@gmail.com (T.-L.K.);
 g107064065@mail.nchu.edu.tw (C.-C.C.)
 * Correspondence: cpfan@dragon.nchu.edu.tw

 Abstract: In this study, for the application of visible-light wearable eye trackers, a pupil tracking
 methodology based on deep-learning technology is developed. By applying deep-learning object
 detection technology based on the You Only Look Once (YOLO) model, the proposed pupil tracking
 method can effectively estimate and predict the center of the pupil in the visible-light mode. By
 using the developed YOLOv3-tiny-based model to test the pupil tracking performance, the detection
 accuracy is as high as 80%, and the recall rate is close to 83%. In addition, the average visible-light
 pupil tracking errors of the proposed YOLO-based deep-learning design are smaller than 2 pixels for
 the training mode and 5 pixels for the cross-person test, which are much smaller than those of the
 previous ellipse fitting design without using deep-learning technology under the same visible-light
 conditions. After the combination of calibration process, the average gaze tracking errors by the
 proposed YOLOv3-tiny-based pupil tracking models are smaller than 2.9 and 3.5 degrees at the
 training and testing modes, respectively, and the proposed visible-light wearable gaze tracking
  system performs up to 20 frames per second (FPS) on the GPU-based software embedded platform.
 

Citation: Ou, W.-L.; Kuo, T.-L.; Keywords: deep-learning; YOLOv3-tiny; visible-light; pupil tracking; gaze tracker; wearable
Chang, C.-C.; Fan, C.-P. eye tracker
Deep-Learning-Based Pupil Center
Detection and Tracking Technology
for Visible-Light Wearable Gaze
Tracking Devices. Appl. Sci. 2021, 11, 1. Introduction
851. https://doi.org/10.3390/
 Recently, wearable gaze tracking devices (or called eye trackers) have started to
app11020851
 become more widely used in human-computer interaction (HCI) applications. To evaluate
 people’s attention deficit, vision, and cognitive information and processes, gaze tracking
Received: 29 November 2020
Accepted: 12 January 2021
 devices have been well-utilized in this field [1–6]. In [4], the eye-tracking technology was a
Published: 18 January 2021
 vastly applied methodology to examine hidden cognitive processes. The authors developed
 a program source code debugging procedure that was inspected by using the eye-tracking
Publisher’s Note: MDPI stays neutral
 method. The eye-tracking-based analysis was elaborated to test the debug procedure and
with regard to jurisdictional claims in
 record the major parameters of eye movement. In [5], the eye-movement tracking device
published maps and institutional affil- was used to observe and analyze a complex cognitive process for C# programming. By
iations. evaluating the knowledge level and eye movement parameters, the test subjects were
 involved to analyze the readability and intelligibility of the query and method at the
 Language-Integrated Query (LINQ) declarative query of the C# programming language.
 In [6], the forms and efficiency of debugging parts of software development were observed
Copyright: © 2021 by the authors.
 by tracking eye movements with the participation of test subjects. The routes of gaze for
Licensee MDPI, Basel, Switzerland.
 the test subjects were observed continuously during the procedure, and the coherences
This article is an open access article
 were decided by the assessment of the recorded metrics.
distributed under the terms and Many up-to-date consumer products have applied embedded gaze tracking technol-
conditions of the Creative Commons ogy in their designs. Through the use of gaze tracking devices, the technology improves
Attribution (CC BY) license (https:// the functions and applications of intuitive human-computer interaction. Gaze tracking
creativecommons.org/licenses/by/ designs use an image sensor based on near-infrared (NIR), which was utilized in [7–16].
4.0/). Some previous eye-tracking designs used high-contrast eye images, which are recorded by

Appl. Sci. 2021, 11, 851. https://doi.org/10.3390/app11020851 https://www.mdpi.com/journal/applsci
Deep-Learning-Based Pupil Center Detection and Tracking Technology for Visible-Light Wearable Gaze Tracking Devices
Appl. Sci. 2021, 11, x FOR PEER REVIEW 2 of 21

Appl. Sci. 2021, 11, 851 2 of 21
 designs use an image sensor based on near-infrared (NIR), which was utilized in [7–16].
 Some previous eye-tracking designs used high-contrast eye images, which are recorded
 by active
 active NIRNIR image image sensors.
 sensors. NIR NIR
 image image sensor-based
 sensor-based technology
 technology can achieve
 can achieve highhigh accu-
 accuracy
 racy when used indoors. However, the NIR-based design is not
 when used indoors. However, the NIR-based design is not suitable for use outdoors or suitable for use outdoors
 or with
 with sunlight.
 sunlight. Moreover,
 Moreover, long-term
 long-term operation
 operation maymay have
 have thethe potential
 potential toto hurtthe
 hurt theuser’s
 user’s
 eyes. When eye trackers are considered for general and long-term
 eyes. When eye trackers are considered for general and long-term use, visible-light pupil use, visible-light pupil
 trackingtechnology
 tracking technologybecomesbecomesaasuitable
 suitablesolution.
 solution.
 Duetotothe
 Due the characteristics
 characteristics of the
 of the selected
 selected sensor
 sensor module,
 module, the contrast
 the contrast of theofvisible-light
 the visible-
 light
 eye eye may
 image image bemay be insufficient,
 insufficient, and the nearfield
 and the nearfield eye imageeyemay image
 containmay
 somecontain
 randomsome ran-
 types
 ofdom types
 noise. of noise.visible-light
 In general, In general,eye visible-light
 images usuallyeye images
 include usually include
 insufficient insufficient
 contrast con-
 and large
 trast and
 random largeInrandom
 noise. order tonoise. In order
 overcome thisto overcome thissome
 inconvenience, inconvenience,
 other methods some other have
 [17–19] meth-
 ods [17–19]
 been developed haveforbeen developed for visible-light-based
 visible-light-based gaze tracking design.gaze tracking design. For
 For the previous the pre-
 wearable
 vious
 eye wearable
 tracking designeyeintracking design
 [15,19] with theinvisible-light
 [15,19] withmode,
 the visible-light
 the estimation mode, the estimation
 of pupil position
 was easilyposition
 of pupil disturbed wasbyeasily
 light and shadow
 disturbed byocclusion, and the prediction
 light and shadow occlusion, anderrorthe
 of the pupil
 prediction
 position,
 error of thus,
 the pupilbecomes serious.
 position, thus,Figure 1 illustrates
 becomes serious.some
 Figureof1the visible-light
 illustrates somepupil
 of thetracking
 visible-
 results by using
 light pupil the previous
 tracking results by ellipse
 usingfitting-based
 the previousmethods in [15,19]. Thus,
 ellipse fitting-based methodsa large pupil
 in [15,19].
 tracking error will lead to inaccuracy of the gaze tracking operation.
 Thus, a large pupil tracking error will lead to inaccuracy of the gaze tracking operation.In [20,21], for indoor
 applications,
 In [20,21], forthe designers
 indoor developed
 applications, the eye tracking
 designers devices based
 developed on visible-light
 eye tracking devices basedimageon
 sensors. A previous
 visible-light imagestudy [20]Aperformed
 sensors. previous experiments indoors, but
 study [20] performed showed thatindoors,
 experiments the devicebut
 could
 showednotthat
 work thenormally outdoors.
 device could not work When the wearable
 normally eye-tracking
 outdoors. device is correctly
 When the wearable eye-track-
 used in all indoor
 ing device and outdoor
 is correctly used in environments,
 all indoor and the method
 outdoor based on thethe
 environments, visible-light
 method based imageon
 sensor is sufficient.
 the visible-light image sensor is sufficient.

 Figure1.1.Visible-light
 Figure Visible-lightpupil
 pupiltracking
 trackingresults
 resultsby
 bythe
 theprevious
 previousellipse
 ellipsefitting-based
 fitting-basedmethod.
 method.

 Related
 RelatedWorks
 Works
 InInrecent
 recent years, thethe
 years, progress of deep-learning
 progress of deep-learningtechnologies has become
 technologies very important,
 has become very im-
 especially in the fieldinof
 portant, especially computer
 the vision andvision
 field of computer patternand recognition. Through deep-learning
 pattern recognition. Through deep-
 technology, inference models
 learning technology, canmodels
 inference be trained
 can toberealize
 trainedreal-time
 to realizeobject detection
 real-time and
 object recog-
 detection
 nition. For the deep-learning-based designs in [22–37], the eye tracking
 and recognition. For the deep-learning-based designs in [22–37], the eye tracking systems systems could
 detect
 couldthe location
 detect of eyes or
 the location of pupils,
 eyes orregardless of using of
 pupils, regardless theusing
 visible-light or near-infrared
 the visible-light or near-
 image
 infraredsensors.
 imageTable 1 describes
 sensors. Table 1 the overview
 describes the of the deep-learning-based
 overview gaze tracking
 of the deep-learning-based gaze
 designs.
 tracking Fromdesigns. the From
 viewpoints of device of
 the viewpoints setup, the setup,
 device deep-learning-based gaze tracking
 the deep-learning-based gaze
 designs
 tracking indesigns
 [22–37] incan be divided
 [22–37] can beinto two different
 divided into twostyles,
 differentwhich include
 styles, which theinclude
 non-wearable
 the non-
 and the wearable
 wearable and thetypes.
 wearable types.
 For
 For thedeep-learning-based
 the deep-learning-basednon-wearable
 non-wearableeye eyetracking
 trackingdesigns,
 designs,inin[22],
 [22],the
 theauthors
 authors
 proposed
 proposed a deep-learning-based gaze estimation scheme to estimate the gazedirection
 a deep-learning-based gaze estimation scheme to estimate the gaze direction
 from
 fromaasingle
 singlefacefaceimage.
 image.The Theproposed
 proposedgazegazeestimation
 estimationdesigndesignwas wasbased
 basedon onapplying
 applying
 multiple
 multiple convolutional neural networks (CNN) to learn the model for gazeestimation
 convolutional neural networks (CNN) to learn the model for gaze estimation
 from
 fromthe theeyeeyeimages.
 images.The Theproposed
 proposeddesign
 designprovided
 providedaccurate
 accurategaze
 gazeestimation
 estimationfor forusers
 users
 with different head poses by including the head pose information into
 with different head poses by including the head pose information into the gaze estimationthe gaze estimation
 framework.
 framework.ComparedComparedwith withthe
 theprevious
 previousmethods,
 methods,the thedeveloped
 developedgazegazeestimation
 estimationdesign
 design
 increased
 increased the accuracy of appearance-based gaze estimation under headpose
 the accuracy of appearance-based gaze estimation under head posevariations.
 variations.
 In [23], the purpose of this work was to use CNNs to classify eye tracking data. Firstly,
 In [23], the purpose of this work was to use CNNs to classify eye tracking data.
 a CNN was used to classify two different web interfaces for browsing news data. Secondly,
 Firstly, a CNN was used to classify two different web interfaces for browsing news data.
 a CNN was utilized to classify the nationalities of users. In addition, the data-preprocessing
 Secondly, a CNN was utilized to classify the nationalities of users. In addition, the data-
 and feature-engineering techniques were applied. In [24], for emerging consumer elec-
 tronic products, an accurate and efficient eye gaze estimation design was important, and
Deep-Learning-Based Pupil Center Detection and Tracking Technology for Visible-Light Wearable Gaze Tracking Devices
Appl. Sci. 2021, 11, 851 3 of 21

 the products were required to remain reliable under difficult environments with low-
 power consumption and low hardware cost. In this paper, a new hardware-friendly CNN
 model with minimal computational requirements was developed and assessed for efficient
 appearance-based gaze estimation. The proposed CNN model was tested and compared
 against existing appearance-based CNN approaches, achieving better eye gaze accuracy
 with large reductions of computational complexities.
 In [25], iRiter technology was developed, which can assist paralyzed people to write
 on screen by using only their eye movements. iRiter detects precisely the movement of the
 eye pupil by referring to the reflections of external near-infrared (IR) signals. Then, the
 deep multilayer perceptron model was used for the calibration process. By the position
 vector of pupil area, the coordinates of the pupil were mapped to the given gaze position
 on the screen.
 In [26], the authors proposed a gaze zone estimation design by using deep-learning
 technology. Compared with traditional designs, the developed method did not need the
 procedure of calibration. In the proposed method, a Kinect was used to capture the video
 of a computer user, a Haar cascade classifier was applied to detect the face and eye regions,
 and data on the eye region was used to estimate the gaze zones on the monitor by the
 trained CNN. The proposed calibration-free method performed a high accuracy to be
 applied for human–computer interaction.
 In [27], which focused on healthcare applications, the high accuracy in semantic
 segmentation of medical images were important; a key issue for training the CNNs was
 obtaining large-scale and precisely annotated images, and the authors sought to address
 the lack of annotated data used with eye tracking method. The hypothesis was that
 segmentation masks generated to help eye tracking would be very similar to those rendered
 by hand annotation, and the results demonstrated that the eye tracking method created
 segmentation masks, which were suitable for deep-learning-based semantic segmentation.
 In [29], for the conditions of unconstrained gaze estimation, most of the gaze datasets
 were collected under laboratory conditions, and the previous methods were not evaluated
 across multiple datasets. In this work, the authors studied key challenges, including target
 gaze range, illumination conditions, and facial appearance variation. The study showed
 that image resolution and the use of both eyes affected gaze estimation performance.
 Moreover, the authors proposed the first deep appearance-based gaze estimation method,
 i.e., GazeNet, to improve the state of the art by 22% for the cross-dataset evaluation.
 In [30], a subconscious response influenced the estimation of human gaze from factors
 related to the human mental activity. The previous methodologies only based on the use
 of gaze statistical modeling and classifiers for assessing images without referring to a
 user’s interests. In this work, the authors introduced a large-scale annotated gaze dataset,
 which was suitable for training deep-learning models. Then, a novel deep-learning-based
 method was used to capture gaze patterns for assessing image objects with respect to the
 user’s preferences. The experiments demonstrated that the proposed method performed
 effectively by considering key factors related to the human gaze behavior.
 In [31], the authors proposed a two-step training network, which was named the Gaze
 Estimator, to raise the gaze estimation accuracy on mobile devices. At the first step, an
 eye landmarks localization network was trained on the 300W-LP dataset, and the second
 step was to train a gaze estimation model on the GazeCapture dataset. In [28], the network
 localized the eye precisely on the image and the CNN network was robust when facial
 expressions and occlusion happened; the inputs of the gaze estimation network were eye
 images and eye grids.
 In [32], since many previous methods were based on a single camera, and focused on
 either the gaze point estimation or gaze direction estimation, the authors proposed an ef-
 fective multitask method for the gaze point estimation by multi-view cameras. Specifically,
 the authors analyzed the close relationship between the gaze point estimation and gaze
 direction estimation, and used a partially shared convolutional neural networks to estimate
Deep-Learning-Based Pupil Center Detection and Tracking Technology for Visible-Light Wearable Gaze Tracking Devices
Appl. Sci. 2021, 11, 851 4 of 21

 the gaze direction and gaze point. Moreover, the authors introduced a new multi-view
 gaze tracking data set consisting of multi-view eye images of different subjects.
 In [33], the U2Eyes dataset was introduced, and U2Eyes was a binocular database of
 synthesized images regenerating real gaze tracking scripts with 2D/3D labels. The U2Eyes
 dataset could be used in machine and deep-learning technologies for eye tracker designs.
 In [34], for using appearance-based gaze estimation techniques, only using a single camera
 for the gaze estimation will limit the application field to short distance. To overcome this
 issue, the authors developed a new long-distance gaze estimation design. In the training
 phase, the Learning-based Single Camera eye tracker (LSC eye tracker) acquired gaze data
 by a commercial eye tracker, the face appearance images were captured by a long-distance
 camera, and deep convolutional neural network models were used to learn the mapping
 from appearance images to gazes. In the application phase, the LSC eye tracker predicted
 gazes based on the acquired appearance images by the single camera and the trained CNN
 models with effective accuracy and performance.
 In [35], the authors introduced an effective eye-tracking approach by using a deep-
 learning-based method. The location of the face relative to the computer was obtained
 by detecting color from the infrared LED with OpenCV, and the user’s gaze position was
 inferred by the YOLOv3 model. In [36], the authors developed a low-cost and accurate
 remote eye-tracking device which used an industrial prototype smartphone with integrated
 infrared illumination and camera. The proposed design used a 3D gaze-estimation model
 which enabled accurate point-of-gaze (PoG) estimation with free head and device motion.
 To accurately determine the input eye features, the design applied the convolutional neural
 networks with a novel center-of-mass output layer. The hybrid method, which used
 artificial illumination, a 3D gaze-estimation model, and a CNN feature extractor, achieved
 better accuracy than the existing eye-tracking methods on smartphones.
 On the other hand, for the deep-learning-based wearable eye tracking designs, in [28],
 the eye tracking method with video-oculography (VOG) images needed to provide an
 accurate localization of the pupil with artifacts and under naturalistic low-light conditions.
 The authors proposed a fully convolutional neural network (FCNN) for the segmentation
 of the pupil area that was trained on 3946 hand-annotated VOG images. The FCNN output
 simultaneously performed pupil center localization, elliptical contour estimation, and
 blink detection with a single network on commercial workstations with GPU acceleration.
 Compared with existing methods, the developed FCNN-based pupil segmentation design
 was accurate and robust for new VOG datasets. In [34], to conquer the obstructions and
 noises in nearfield eye images, the nearfield visible-light eye tracker design utilized the
 deep-learning-based gaze tracking technologies to solve the problem.
 In order to make the pupil tracking estimation more accurate, this work uses the
 YOLOv3-tiny [38] network-based deep-learning design to detect and track the position
 of the pupil. The proposed pupil detector is a well-trained deep-learning network model.
 Compared with the previous designs, the proposed method reduces the pupil tracking
 errors caused by light reflection and shadow interference in visible-light mode, and will
 also improve the accuracy of the calibration process of the entire eye tracking system.
 Therefore, the gaze tracking function of the wearable eye tracker will become powerful
 under visible-light conditions.
Deep-Learning-Based Pupil Center Detection and Tracking Technology for Visible-Light Wearable Gaze Tracking Devices
Appl. Sci. 2021, 11, 851 5 of 21

 Table 1. Overview of the deep-learning-based gaze tracking designs.

 Deep-learning-based Operational Eye/Pupil Tracking Gaze Estimation and
 Dataset Used
 Methods Mode/Setup Method Calibration Scheme
 Multiple pose-based
 Uses a facial
 Visible-light mode and VGG-like CNN
 Sun et al. [22] UT Multiview landmarks-based design to
 /Non-Wearable models for gaze
 locate eye regions
 estimation
 Joint eye-gaze CNN
 Visible-light mode The CNN-based eye architecture with both
 Lemley et al. [24] MPII Gaze
 /Non-Wearable detection eyes for gaze
 estimation
 Uses the OpenCV method
 Visible-light mode GoogleNetV1 based
 Cha et al. [26] Self-made dataset to extract the face region
 /Non-Wearable gaze estimation
 and eye region
 Uses face alignment and VGGNet-16-based
 Visible-light mode
 Zhang et al. [29] MPII Gaze 3D face model fitting to GazeNet for gaze
 /Non-Wearable
 find eye zones estimation
 CNN-based multi-view
 Visible-light mode ShanghaiTechGaze and Uses the ResNet-34 model
 Lian et al. [32] and multi-task gaze
 /Non-Wearable MPII Gaze for eye features extraction
 estimation
 Uses the OpenFace CLM-
 Visible-light mode framework/Face++/YOLOv3 Infrared-LED based
 Li et al. [34] Self-made dataset
 /Non-Wearable to extract facial ROI and calibration
 landmarks
 Visible-light mode Uses YOLOv3 to detect the Infrared-LED based
 Rakhmatulin et al. [35] Self-made dataset
 /Non-Wearable eyeball and eye’s corners calibration
 Infrared mode Eye-Feature locator CNN 3D gaze-estimation
 Brousseau et al. [36] Self-made dataset
 /Non-Wearable model model-based design
 Uses a fully convolutional
 3D eyeball model,
 Near-eye infrared German center for neural network for pupil
 marker, and projector-
 Yiu et al. [28] mode vertigo and balance segmentation and uses
 assisted-based
 /Wearable disorders ellipse fitting to locate the
 calibration
 eye center
 Uses the
 Near-eye Marker and affine
 YOLOv3-tiny-based lite
 Proposed design visible-light Self-made dataset transform-based
 model to detect the
 mode/Wearable calibration
 pupil zone

 The main contribution of this study is that the pupil location in nearfield visible-light
 eye images is detected precisely by the YOLOv3-tiny-based deep-learning technology. To
 our best knowledge from surveying related studies, the novelty of this work is to provide
 the leading reference design of deep-learning-based pupil tracking technology using the
 nearfield visible-light eye images for the application of wearable gaze tracking devices.
 Moreover, the other novelties and advantages of the proposed deep-learning-based pupil
 tracking technology are described as follows:
 (1) By using a nearfield visible-light eye image dataset for training, the used deep-
 learning models achieved real-time and accurate detection of the pupil position at the
 visible-light mode.
 (2) The proposed design detects the position of the pupil’s object at any eyeball movement
 condition, which is more effective than the traditional image processing methods
 without deep-learning technologies at the near-eye visible-light mode.
 (3) The proposed pupil tracking technology can overcome efficiently the light and shadow
 interferences at the near-eye visible-light mode, and the detection accuracy of a pupil’s
Deep-Learning-Based Pupil Center Detection and Tracking Technology for Visible-Light Wearable Gaze Tracking Devices
Appl. Sci. 2021, 11, 851 6 of 21

 location is higher than that of previous wearable gaze tracking designs without using
 Appl. Sci. 2021, 11, xdeep-learning
 FOR PEER REVIEWtechnologies. 6 of 21
 To enhance detection accuracy, we used the YOLOv3-tiny-based deep-learning model
 to detect the pupil’s object in the visible-light near-eye images. We also compared its
 detection results to the Toother designs,
 enhance which
 detection include
 accuracy, wethe methods
 used without using deep-
 the YOLOv3-tiny-based deep-learning
 model
 learning technologies. Theto pupil
 detect detection
 the pupil’sperformance
 object in the visible-light
 was evaluatednear-eye images.
 in terms We also compared
 of precision,
 its detection
 recall, and pupil tracking results
 errors. to the
 By the other designs,
 proposed which include the
 YOLOv3-tiny-based methods without using deep-
 deep-learning-based
 learning technologies. The pupil detection
 method, the trained YOLO-based deep-learning models achieve enough performance was evaluated in terms
 precision with of preci-
 sion, recall, and pupil tracking errors. By the proposed YOLOv3-tiny-based deep-learn-
 small average pupil tracking errors, which are less than 2 pixels for the training mode and
 ing-based method, the trained YOLO-based deep-learning models achieve enough preci-
 5 pixels for the cross-person test under the near-eye visible-light conditions.
 sion with small average pupil tracking errors, which are less than 2 pixels for the training
 mode and 5 pixels for the cross-person test under the near-eye visible-light conditions.
 2. Background of Wearable Eye Tracker Design
 Figure 2 depicts the generaloftwo-stage
 2. Background Wearable Eyeoperational flow of the wearable gaze tracker
 Tracker Design
 with the four-point calibration scheme. The wearable gaze
 Figure 2 depicts the general two-stage operational tracking device
 flow ofcanthe use the NIR
 wearable gaze tracker
 or visible-light image sensor-based
 with the gaze tracking
 four-point calibration scheme. design, whichgaze
 The wearable can tracking
 apply appearance-,
 device can use the NIR
 model-, or learning-based technology.
 or visible-light First, the gaze
 image sensor-based gazetracking
 trackingdevice
 design,estimates
 which canand tracks
 apply appearance-,
 model-,
 the pupil’s center to prepareor learning-based
 the calibrationtechnology.
 process. For First,
 thetheNIR gaze tracking
 image device estimates
 sensor-based design, and tracks
 the pupil’scomputations,
 for example, the first-stage center to preparewhich
 the calibration
 involve process. For thefor
 the processes NIRpupil
 imagelocation,
 sensor-based de-
 sign, for
 extraction of the pupil’s example,
 region the first-stage
 of interest (ROI), thecomputations,
 binarization which involve
 process withthethe
 processes
 two-stage for pupil lo-
 cation, extraction of the pupil’s region of interest (ROI), the binarization process with the
 scheme in pupil’s ROI, and the fast scheme of ellipse fitting with binarized pupil’s edge
 two-stage scheme in pupil’s ROI, and the fast scheme of ellipse fitting with binarized pu-
 points, are utilized [15]. Then, the pupil’s center can be detected effectively. For the visible-
 pil’s edge points, are utilized [15]. Then, the pupil’s center can be detected effectively. For
 light image sensor-based design,
 the visible-light for sensor-based
 image example, bydesign, usingfor the gradients-based
 example, by using theiris center
 gradients-based iris
 location technology [21] or the deep-learning-based methodology [37], the gaze
 center location technology [21] or the deep-learning-based methodology [37], the gaze tracking
 device realizes antracking
 effective iris center
 device realizestracking
 an effectivecapability.
 iris centerNext, at capability.
 tracking the second processing
 Next, at the second pro-
 stage, the functioncessing
 and process for function
 stage, the calibrationandand gazefor
 process prediction
 calibrationcanandbegaze
 done. To increase
 prediction can be done.
 the accuracy of gazeTo tracking,
 increase thetheaccuracy
 head movement effect can
 of gaze tracking, be compensated
 the head movement effect by the
 canmotion
 be compensated
 measurement device by the
 [39].motion
 In themeasurement device for
 following section, [39].the
 In proposed
 the following section,gaze
 wearable for the proposed wear-
 tracking
 able processing
 design, the first-stage gaze trackingprocedure
 design, theuses
 first-stage processing procedurepupil
 the deep-learning-based uses the deep-learning-
 tracking
 based pupil tracking technology by using a visible-light
 technology by using a visible-light nearfield eye image sensor. Next, the second-stage nearfield eye image sensor. Next,
 the second-stage process achieves the function of calibration for the gaze tracking and
 process achieves the function of calibration for the gaze tracking and prediction work.
 prediction work.

 Figure 2. The general processing
 Figure flow of
 2. The general the wearable
 processing gaze
 flow of tracker. gaze tracker.
 the wearable

 3. Proposed Methods
 3. Proposed Methods
 The proposed wearable eye tracking
 The proposed wearable method is divided
 eye tracking methodinto severalinto
 is divided steps as follows:
 several steps as follows:
 (1) Collecting the visible-light
 (1) Collectingnearfield eye images
 the visible-light in eye
 nearfield datasets and
 images labelingand
 in datasets these eye image
 labeling these eye image
 datasets;
 datasets; (2) Choosing and (2) designing
 Choosing and thedesigning the deep-learning
 deep-learning networknetwork architecture
 architecture for pupil object
 for pupil
 detection; (3) Training and testing the trained inference model;
 object detection; (3) Training and testing the trained inference model; (4) Using the deep-(4) Using the deep-learning
 model to infer and detect the pupil’s object, and estimate the center
 learning model to infer and detect the pupil’s object, and estimate the center coordinate coordinate of pupil box;
 (5) Following the calibration process and predicting the gaze points. Figure 3 depicts the
 of pupil box; (5) Following the calibration process and predicting the gaze points. Figure
 computational process of the proposed visible-light gaze tracking design based on the
 3 depicts the computational process of the proposed visible-light gaze tracking design
 YOLO deep-learning model. In the proposed design, a visible-light camera module is used
 based on the YOLO deep-learning model. In the proposed design, a visible-light camera
 module is used to capture nearfield eye images. The design uses a deep-learning network
Deep-Learning-Based Pupil Center Detection and Tracking Technology for Visible-Light Wearable Gaze Tracking Devices
calculates the pupil center from the detected pupil box for the subsequent calibration and
 gaze tracking process.
 Figure 4 demonstrates the scheme of the self-made wearable eye tracking device. In
 the proposed wearable device, both the outward-view scene camera and the nearfield eye
Appl.Appl. Sci. 2021,
 Sci. 2021, 11, x 11,
 FOR PEER REVIEW camera are required to provide the whole function of the wearable gaze tracker.
 851 7 ofThe
 7 ofpro-
 21 21
 posed design uses deep-learning-based technology to detect and track the pupil zone, es-
 timate the pupil’s center by using the detected pupil object, and then use the coordinate
 information to perform and predict the gaze points. Compared with the previous image
 to capture
 based nearfield
 on YOLO eye
 to images.
 train anThe designmodel
 inference uses a deep-learning
 to detect network
 the object based and
 of pupil, on YOLO
 then, the
 processing methods of tracking the pupil at the visible-light mode, the proposed method
 to train an inference
 proposed system model to detect
 calculates the thecenter
 pupil objectfrom
 of pupil,
 the and then,
 detected thebox
 pupil proposed
 for the system
 subsequent
 can detect the pupil object at any eyeball position, and it also is not affected by light and
 calculates the pupil
 calibration centertracking
 and gaze from the detected pupil box for the subsequent calibration and
 shadow interferences in realprocess.
 applications.
 gaze tracking process.
 Figure 4 demonstrates the scheme of the self-made wearable eye tracking device. In
 the proposed wearable device, both the outward-view scene camera and the nearfield eye
 camera are required to provide the whole function of the wearable gaze tracker. The pro-
 posed design uses deep-learning-based technology to detect and track the pupil zone, es-
 timate the pupil’s center by using the detected pupil object, and then use the coordinate
 information to perform and predict the gaze points. Compared with the previous image
 processing methods of tracking the pupil at the visible-light mode, the proposed method
 can detect the pupil object at any eyeball position, and it also is not affected by light and
 shadow interferences in real applications.

 Figure 3. The operational flow of the proposed visible-light based wearable gaze tracker.
 Figure 3. The operational flow of the proposed visible-light based wearable gaze tracker.
 Figure 4 demonstrates the scheme of the self-made wearable eye tracking device. In
 the proposed wearable device, both the outward-view scene camera and the nearfield
 eye camera are required to provide the whole function of the wearable gaze tracker. The
 proposed design uses deep-learning-based technology to detect and track the pupil zone,
 estimate the pupil’s center by using the detected pupil object, and then use the coordinate
 information to perform and predict the gaze points. Compared with the previous image
 processing methods of tracking the pupil at the visible-light mode, the proposed method
 can detect the pupil object at any eyeball position, and it also is not affected by light and
 shadow interferences in real applications.
 Figure 3. The operational flow of the proposed visible-light based wearable gaze tracker.

 Figure 4. The self-made visible-light wearable gaze tracking device.

 3.1. Pre-Processing for Pupil Center Estimation and Tracking
 In the pre-processing work, the visible-light nearfield eye images are recorded for
 datasets, and the collected eye image datasets will be labeled for training and testing the
 trained inference model. Moreover, we choose several YOLO-based deep-learning models
 for pupil object detection. By using the YOLO-based deep-learning model to detect the
 pupil’s object, the proposed system can track and estimate the center coordinate of pupil
 box for calibration and gaze prediction.
 Figure 4. The
 Figure 4. self-made visible-light
 The self-made wearable
 visible-light gazegaze
 wearable tracking device.
 tracking device.

 3.1. Pre-Processing
 3.1. Pre-Processing for Pupil
 for Pupil Center
 Center Estimation
 Estimation and Tracking
 and Tracking
 In pre-processing
 In the the pre-processing
 work,work, the visible-light
 the visible-light nearfield
 nearfield eye eye images
 images are recorded
 are recorded for for
 datasets, and the collected eye image datasets will be labeled for training and testing the the
 datasets, and the collected eye image datasets will be labeled for training and testing
 trained
 trained inference
 inference model.
 model. Moreover,
 Moreover, we choose
 we choose several
 several YOLO-based
 YOLO-based deep-learning
 deep-learning modelsmodels
 for pupil object detection. By using the YOLO-based deep-learning model to detect the the
 for pupil object detection. By using the YOLO-based deep-learning model to detect
 pupil’s object, the proposed system can track and estimate the center coordinate of pupil
 pupil’s object, the proposed system can track and estimate the center coordinate of pupil
 box for calibration and gaze prediction.
 box for calibration and gaze prediction.
Deep-Learning-Based Pupil Center Detection and Tracking Technology for Visible-Light Wearable Gaze Tracking Devices
Appl.
 Appl.Sci. 2021,
 Sci. 11,11,
 2021, 851x FOR PEER REVIEW 8 of 2121
 8 of

 3.1.1.Generation
 3.1.1. GenerationofofTraining
 Trainingand andValidation
 Validation Datasets
 Datasets
 InInorder
 ordertotocollect
 collectnearfield
 nearfieldeye eyeimages
 imagesfrom fromeight
 eightdifferent
 differentusers
 usersforfortraining
 trainingandand
 testing, each user
 testing, each user used used a self-made wearable eye tracker to emulate the gaze
 wearable eye tracker to emulate the gaze tracking tracking op-
 erations, after
 operations, afterwhich
 which thethe
 users rotated
 users theirtheir
 rotated gaze gaze
 around the screen
 around to capture
 the screen the nearfield
 to capture the
 eye images
 nearfield eye with
 images various eyeball eyeball
 with various positions. The collected
 positions. and recorded
 The collected near-eye
 and recorded images
 near-eye
 cover all
 images coverpossible eyeballeyeball
 all possible positions. FigureFigure
 positions. 5 illustrates nearfield
 5 illustrates visible-light
 nearfield eye images
 visible-light eye
 collected
 images for training
 collected and in-person
 for training and in-persontests. In Figure
 tests. 5, the numbers
 In Figure of near-eye
 5, the numbers images
 of near-eye
 from person
 images 1, person
 from person 1, 2, person
 person 2, 3,person
 person3,4,person
 person4,5, person
 person 6, 5, person
 person 7,6,and person
 person 8 are
 7, and
 849, 987,
 person 734,
 8 are 849,826, 977,
 987, 896,
 734, 886,
 826, and
 977, 939,
 896, respectively.
 886, Moreover, the
 and 939, respectively. near-eye
 Moreover, images
 the cap-
 near-eye
 images captured
 tured from person from9 toperson
 person916 to are
 person
 used16 forare
 theused for the cross-person
 cross-person test. Thus, thetest. Thus, the
 nearfield eye
 nearfield
 images fromeye images
 personfrom 1 toperson
 person1 8towithout
 person 8closed
 without closed
 eyes eyes are selected,
 are selected, of whichofapproxi-
 which
 approximately
 mately 900 images 900 images per person
 per person are usedare used for training
 for training and and testing.
 testing. The The
 imageimage datasets
 datasets con-
 consisted of a total of 7094 eye patterns.
 sisted of a total of 7094 eye patterns. Next, these Next, these selected eye
 eye images will go througha a
 images will go through
 labeling process before training the inference model
 labeling process before training the inference model for pupil for pupil object detection
 object and and
 detection tracking.
 track-
 Toing.
 increase the number
 To increase of images
 the number in thein
 of images training phase,phase,
 the training the datathe augmentation
 data augmentation process,
 pro-
 which
 cess, randomly
 which randomly changeschanges
 angle, saturation, exposure,
 angle, saturation, and hue and
 exposure, of the selected
 hue of theimage
 selecteddataset,
 image
 isdataset,
 enabledisby using the
 enabled framework
 by using of YOLO.ofAfter
 the framework YOLO. theAfter
 160,000
 the iterations for training
 160,000 iterations the
 for train-
 YOLOv3-tiny-based
 ing the YOLOv3-tiny-based models, the total number of images used for training willtobe
 models, the total number of images used for training will be up
 10upmillion.
 to 10 million.

 Figure5.5.Collected
 Figure Collecteddatasets
 datasetsofofnearfield
 nearfieldvisible-light
 visible-lighteye
 eyeimages.
 images.

 Before
 Beforetraining
 trainingthethemodel,
 model,the thedesigner
 designermust mustfind
 findthe
 thecorrect
 correctimage
 imageobject
 object(or
 (oranswer)
 answer)
 ininthe
 theimage
 imagedatasets,
 datasets,tototrain
 trainthe
 themodel
 modelwhatwhattotolearn
 learnandandtotoidentify
 identifythethelocation
 locationofofthethe
 correct
 correctobject
 objectininthe
 thewhole
 wholeimage,
 image,which
 whichisiscommonly
 commonlyknown knownasasthetheground
 groundtruth,
 truth,sosothe
 the
 labeling
 labelingprocess
 processmust
 mustbebedone
 donebefore
 beforethe
 thetraining
 trainingprocess.
 process.TheTheposition
 positionofofthethepupil
 pupilcenter
 center
 varies
 variesininthe
 thedifferent
 differentnearfield
 nearfieldeyeeyeimages.
 images.Figure
 Figure6 6demonstrates
 demonstratesthe thelabeling
 labelingprocess
 processforfor
 the location of the pupil’s ROI object.
 the location of the pupil’s ROI object. For the following learning process of the
 following learning process of the detector detectorof
 ofthe
 thepupil
 pupil zone,
 zone, thethe
 sizesize of labeled
 of the the labeled box affects
 box also also affects the detection
 the detection accuracy accuracy of the
 of the network
 network architecture.
 architecture. In our experiments,
 In our experiments, firstly, afirstly,
 labeleda labeled
 box with box with
 the 10 ×the
 10 10 × 10size
 pixels pixels size
 is tested,
 isand
 tested, and
 then, then, a box
 a labeled labeled
 withbox with
 a size a size ranging
 ranging from 20 ×from 20 × 20
 20 pixels pixels
 to 30 × 30to 30 ×is30adopted.
 pixels pixels
 isThe
 adopted. Thelabeled
 size of the size ofbounding
 the labeled boxbounding box is closely
 is closely related relatedof
 to the number tofeatures
 the number of
 analyzed
 features analyzed by the deep-learning architecture. In the labeling process
 by the deep-learning architecture. In the labeling process [40], in order to improve the [40], in order
 to improve the accuracy of the ground truth, the ellipse fitting method was applied for
Deep-Learning-Based Pupil Center Detection and Tracking Technology for Visible-Light Wearable Gaze Tracking Devices
Appl. Sci. 2021, 11, x FOR PEER REVIEW 9 of 21
Appl. Sci. 2021, 11, 851 9 of 21

 accuracy of the ground truth, the ellipse fitting method was applied for the pre-position-
 the
 ingpre-positioning of the pupil’s
 of the pupil’s center, center,
 after which weafter
 usedwhich we used
 a manual a manual
 method method
 to further to further
 correct the co-
 correct the coordinates of the pupil center in each selected nearfield eye image. Before
 ordinates of the pupil center in each selected nearfield eye image. Before the training pro- the
 training process, all labeled images (i.e., 7094 eye images) are randomly divided
 cess, all labeled images (i.e., 7094 eye images) are randomly divided into a training part, into a
 training part, verification part, and test part, with corresponding ratios of 8:1:1. Therefore,
 verification part, and test part, with corresponding ratios of 8:1:1. Therefore, the number
 the numbereye
 of labeled of images
 labeledusedeye images usedisfor
 for training training
 5676, is number
 and the 5676, and
 of the number
 labeled imagesof labeled
 used for
 images used and
 verification for verification and testing is 709.
 testing is 709.

 Figure 6. The labeling process for location of the pupil’s ROI.
 Figure 6. The labeling process for location of the pupil’s ROI.

 3.1.2.The
 3.1.2. TheUsed
 UsedYOLO-Based
 YOLO-BasedModels
 Modelsand
 andTraining/Validation
 Training/ValidationProcess
 Process
 Theproposed
 The proposeddesign
 designusesusesthetheYOLOv3-tiny
 YOLOv3-tinydeep-learning
 deep-learningmodel. model.The Thereasons
 reasonsfor for
 choosingaanetwork
 choosing networkmodel
 modelbased
 basedon onYOLOv3-tiny
 YOLOv3-tinyare: are:(1)
 (1)The
 TheYOLOv3-tiny-based
 YOLOv3-tiny-basedmodel model
 achievesgood
 achieves gooddetection
 detectionand andrecognition
 recognitionperformance
 performanceatatthe thesame
 sametime;
 time;(2)(2)The
 TheYOLOv3-
 YOLOv3-
 tiny-based
 tiny-basedmodelmodelisiseffective
 effectiveforforsmall
 smallobject
 objectdetection,
 detection,andandthe thepupil
 pupilininthethenearfield
 nearfieldeye eye
 image is also a very small object. In addition, because the number of
 image is also a very small object. In addition, because the number of convolutional layersconvolutional layers
 ininYOLOv3-tiny
 YOLOv3-tinyis issmall,small,thethe
 inference
 inference model
 modelbased on YOLOv3-tiny
 based on YOLOv3-tiny can achieve real-time
 can achieve real-
 operations on theon
 time operations software-based
 the software-basedembedded platform.
 embedded platform.
 In
 Inthetheoriginal
 originalYOLOv3-tiny
 YOLOv3-tiny model,
 model, as as
 shown
 shownin Figure
 in Figure 7, each layer
 7, each usesuses
 layer threethree
 anchoran-
 boxes to predict
 chor boxes the bounding
 to predict boxes inboxes
 the bounding a gridincell. However,
 a grid only one pupil
 cell. However, only oneobject category
 pupil object
 needs to be
 category detected
 needs to beindetected
 each nearfield
 in eacheye image,eye
 nearfield after whichafter
 image, the number
 which the of number
 anchor boxes
 of an-
 required can be reduced. Figure 8 illustrates the network architecture
 chor boxes required can be reduced. Figure 8 illustrates the network architecture of the of the YOLOv3-tiny
 model with only
 YOLOv3-tiny one anchor.
 model with only Besides that the
 one anchor. number
 Besides of the
 that required
 number anchor boxes can
 of required be
 anchor
 reduced, since the image features of the detected pupil object is not intricate,
 boxes can be reduced, since the image features of the detected pupil object is not intricate, the number
 ofthe
 convolutional layers can alsolayers
 number of convolutional be reduced.
 can alsoFigure 9 showsFigure
 be reduced. the network
 9 shows architecture
 the network of the
 ar-
 YOLOv3-tiny model with one anchor and a one-way path. In Figure
 chitecture of the YOLOv3-tiny model with one anchor and a one-way path. In Figure 9, 9, since the number
 of convolutional
 since the numberlayers is reduced, layers
 of convolutional the computational
 is reduced, the complexity can also
 computational be reduced
 complexity can
 effectively.
 also be reduced effectively.
 Before the labeled eye images are fed into the YOLOv3-tiny-based deep-learning
 models, the parameters used for training deep-learning networks must be set properly,
 e.g., batch, decay, subdivision, momentum, etc., after which the inference models for
 pupil center tracking will be achieved accurately. In our training design for models, the
 parameters of Width/height, Channels, Batch, Max_batches, Subdivision, Momentum,
 Decay, and Learning_rate are set to 416, 3, 64, 500200, 2, 0.9, 0.0005, and 0.001, respectively.
 In the architecture of the YOLOv3-tiny model, 13 convolution layers with differ-
 ent filter numbers and pooling layers with different strides are established to allow the
 YOLOv3-tiny model to have a high enough learning capability for training from images. In
 addition, the YOLOv3-tiny-based model not only uses the convolution neural network, but
 also adds upsampling and concrete calculation methods. During the learning process, the
 YOLO model uses training datasets for training, uses the images in the validation dataset
 to test the model, observes whether the model is indeed learning correctly, uses the weight
 back propagation method to gradually converge the weight, and observes whether the
 weight converges with the loss function.
Deep-Learning-Based Pupil Center Detection and Tracking Technology for Visible-Light Wearable Gaze Tracking Devices
Appl. Sci. 2021, 11, 851 10 of 21
 Appl. Sci. 2021, 11, x FOR PEER REVIEW 10 of 21
 Appl. Sci. 2021, 11, x FOR PEER REVIEW 10 of 21

 Figure7.7.The
 Figure Thenetwork
 networkarchitecture
 architectureofofthe
 theoriginal
 originalYOLOv3-tiny
 YOLOv3-tiny model.
 model.
 Figure 7. The network architecture of the original YOLOv3-tiny model.

 Figure 8. The network architecture of the YOLOv3-tiny model with one anchor.
 Figure8.8.The
 Figure Thenetwork
 networkarchitecture
 architecture
 ofof the
 the YOLOv3-tiny
 YOLOv3-tiny model
 model with
 with one
 one anchor.
 anchor.
Appl.
 Appl.Sci. 2021,11,
 Sci.2021, 11,851
 x FOR PEER REVIEW 1111ofof2121

 Figure9.9.The
 Figure Thenetwork
 networkarchitecture
 architectureof
 of the
 the YOLOv3-tiny
 YOLOv3-tiny model
 model with
 with one
 one anchor
 anchor and
 and aa one-way
 one-way path.
 path.

 Before10
 Figure the labeled
 depicts theeye images
 training andare fed into
 testing the YOLOv3-tiny-based
 processes by the YOLO-based deep-learning
 models for
 models, pupils.
 tracking the parameters used forand
 In the detection training deep-learning
 classification processnetworks mustmodel,
 by the YOLO be set properly,
 the grid
 e.g.,anchor
 cell, batch, decay,
 box, andsubdivision,
 activationmomentum,
 function areetc.,
 used after which the inference
 to determine models
 the position for
 of the pupil
 object
 center tracking
 bounding box inwill
 the be achieved
 image. accurately.
 Therefore, In our training
 the well-trained design
 model can for models,
 detect the param-
 the position of
 the pupil in the image with the bounding box. The detected pupil box
 eters of Width/height, Channels, Batch, Max_batches, Subdivision, Momentum, Decay, can be compared
 Appl. Sci. 2021, 11,withFORthe
 xand ground
 REVIEWtruth
 Learning_rate
 PEER aretoset
 generate the
 to 416, 3, 64,intersection over
 500200, 2, 0.9, unionand
 0.0005, (IoU), precision,
 0.001, and recall
 respectively. 12 of
 values; Inwe
 theuse these values
 architecture to judge
 of the the quality
 YOLOv3-tiny of the13
 model, model training.layers with different
 convolution
 filter numbers and pooling layers with different strides are established to allow the
 YOLOv3-tiny model to have a high enough learning capability for training from images.
 In addition, the YOLOv3-tiny-based model not only uses the convolution neural network,
 but also adds upsampling and concrete calculation methods. During the learning process,
 the YOLO model uses training datasets for training, uses the images in the validation da-
 taset to test the model, observes whether the model is indeed learning correctly, uses the
 weight back propagation method to gradually converge the weight, and observes whether
 the weight converges with the loss function.
 Figure 10 depicts the training and testing processes by the YOLO-based models for
 tracking pupils. In the detection and classification process by the YOLO model, the grid
 cell, anchor box, and activation function are used to determine the position of the object
 bounding box in the image. Therefore, the well-trained model can detect the position of
 the pupil in the image with the bounding box. The detected pupil box can be compared
 with the ground truth to generate the intersection over union (IoU), precision, and recall
 values; we use these values to judge the quality of the model training.
 Figure 10. The
 Figure training
 10. The andand
 training testing processes
 testing bybythe
 processes theYOLO-based
 YOLO-basedmodels
 models for
 for tracking pupil.
 tracking pupil.

 Table 2 presents the confusion
 Table 2 presentsmatrix used formatrix
 the confusion the performance
 used for theevaluation
 performance of classifica-
 evaluation of class
 tion. In patternfication.
 recognition for information
 In pattern recognitionretrieval and classification,
 for information recall
 retrieval and is mentioned
 classification, recall is me
 to as the true positive rate
 tioned to or sensitivity,
 as the and
 true positive precision
 rate is mentioned
 or sensitivity, as theispositive
 and precision mentionedpredic-
 as the positiv
 tive value. predictive value.

 Table 2. Confusion matrix for classification.

 Actual Values
 Positives Negatives
 Positives

 True Positives False Positives
 values

 (TP) (FP)
Appl. Sci. 2021, 11, 851 12 of 21

 Table 2. Confusion matrix for classification.

 Actual Values
 Positives Negatives
 True Positives False Positives
 Positives
 Predictive values (TP) (FP)
 False Negatives True Negatives
 Negatives
 (FN) (TN)

 Both precision and recall are based on an understanding and measurement of rele-
 vance. Based on Table 1, the definition of precision is expressed as follows:

 Precision = TP/(TP + FP), (1)

 and recall is defined as follows:

 Recall = TP/(TP + FN). (2)

 After using the YOLO-based deep-learning model to detect the pupil’s ROI object,
 the midpoint coordinate of the pupil’s bounding box can be found, and the midpoint
 coordinate value is the estimated pupil‘s center point. The midpoint coordinates for
 the estimated pupil’s centers are evaluated with Euclidean distance by comparing the
 estimated coordinates with the coordinates of ground truth, and the tracking errors of
 pupil’s center will also be obtained for the performance evaluation. Using the pupil‘s
 central coordinates for the coordinate conversion of calibration, the proposed wearable
 device can correctly estimate and predict the outward gaze points.

 3.2. Calibration and Gaze Tracking Processes
 Before using the wearable eye tracker to predict the gaze points, the gaze tracking
 Appl. Sci. 2021, 11, x FOR PEER REVIEW 13 of 2
 device must undergo a calibration procedure. Because the appearance of each user’s eye
 and the application distance to the near-eye camera are different, the calibration must
 be done to confirm the coordinate mapping scheme between the pupil’s center and the
 the application distance to the near-eye camera are different, the calibration must be done
 outward scene, after which the wearable gaze tracker can correctly predict the gaze points
 to confirm the coordinate mapping scheme between the pupil’s center and the outward
 correspond to the outward
 scene, scene.
 after which the Figure
 wearable11gaze
 depicts the
 tracker canused calibration
 correctly process
 predict the of the
 gaze points correspond
 proposed wearable gaze tracker. The steps of the applied calibration process are described
 to the outward scene. Figure 11 depicts the used calibration process of the proposed weara
 as follows: ble gaze tracker. The steps of the applied calibration process are described as follows:

 Figure 11. The used11.calibration
 Figure process ofprocess
 The used calibration the proposed wearable
 of the proposed gaze tracker.
 wearable gaze tracker.

 At the first step,At
 thethe
 system collects
 first step, the correspondence
 the system between the between
 collects the correspondence center point coor- point co
 the center
 dinates of the pupils captured
 ordinates of theby the nearfield
 pupils captured eye camera
 by the andeye
 nearfield thecamera
 gaze points
 and thesimulated
 gaze points simu
 by the outward lated
 camera. Before
 by the outwardcalibration, in addition
 camera. Before to recording
 calibration, thetopositions
 in addition recordingof thethe
 positions o
 center points of the
 the center
 pupil points
 movement,
 of the the trajectory
 pupil movement, of markers captured
 the trajectory by thecaptured
 of markers outward by the out
 camera must also ward
 be camera
 recorded.mustFigure
 also be12recorded.
 shows the Figure 12 shows
 trajectory the trajectory
 diagram diagram
 of tracking theof tracking
 thethe
 pupil’s center with pupil’s centereye
 nearfield with the nearfield
 image eye image
 for calibration. for calibration.
 In addition, FigureIn12addition,
 reveals theFigure 12 re
 veals the corresponding trajectory diagram of tracking four markers’ center points with
 the outward image for calibration.
Figure 11. The used calibration process of the proposed wearable gaze tracker.

 At the first step, the system collects the correspondence between the center point co-
 ordinates of the pupils captured by the nearfield eye camera and the gaze points simu-
Appl. Sci. 2021, 11, 851 lated by the outward camera. Before calibration, in addition to recording the positions of
 13 of 21
 the center points of the pupil movement, the trajectory of markers captured by the out-
 ward camera must also be recorded. Figure 12 shows the trajectory diagram of tracking
 the pupil’s center with the nearfield eye image for calibration. In addition, Figure 12 re-
 corresponding trajectory diagram
 veals the corresponding of tracking
 trajectory diagramfour markers’four
 of tracking center points center
 markers’ with the outward
 points with
 image for calibration.
 the outward image for calibration.

 Figure 12. Diagram of tracking pupil’s center with the nearfield eye camera for calibration.
 Figure 12. Diagram of tracking pupil’s center with the nearfield eye camera for calibration.

 InFigure
 In Figure12, 12, the
 the yellow
 yellow line
 line indicates
 indicatesthe
 thesequence
 sequenceofofthethemovement
 movement path
 pathof of
 gazing
 gazingthe
 markers, and the red rectangle indicates the range of the eye movement.
 the markers, and the red rectangle indicates the range of the eye movement. In Figure 13, In Figure 13, when
 the calibration
 when the calibrationprocess is executed,
 process the user
 is executed, thegazes
 user at the four
 gazes markers
 at the sequentially
 four markers from ID
 sequentially
 1, IDID
 from 2, 1,IDID3,2,toIDID3,4,toand
 ID 4,the
 andsystem records
 the system the coordinates
 records of theofpupil
 the coordinates movement
 the pupil movement path
 Appl. Sci. 2021, 11, x FOR PEER REVIEW 14 of 21
 while
 path watching
 while watching the gazing
 the gazingcoordinates of the
 coordinates of outward
 the outwardscene withwith
 scene markers. Then,
 markers. the
 Then, pro-
 the
 posed system
 proposed system performs
 performs thethe
 coordinate
 coordinateconversion
 conversionto achieve the the
 to achieve calibration effect.
 calibration effect.

 Figure13.
 Figure 13.Diagram
 Diagramof oftracking
 trackingfour
 fourmarkers’
 markers’center
 centerpoints
 pointswith
 withthe
 theoutward
 outwardcamera
 camerafor
 forcalibration
 calibra-
 tion ① When the user gazes at the marker ID1 for calibration. ② When the user
 1 When the user gazes at the marker ID1 for calibration. 2 When the user gazes at the marker gazes at the ID2
 marker
 for ID2 for 3calibration.
 calibration. ③ When
 When the user gazesthe usermarker
 at the gazes ID3
 at the
 formarker ID3 for4 calibration.
 calibration. ④ When
 When the user gazes at
 the user gazes at the marker
 the marker ID4 for calibration. ID4 for calibration.

 The
 Thepseudo-affine transformisisused
 pseudo-affine transform usedfor
 forthe
 thecoordinate
 coordinate conversion
 conversion between
 between thethe pupil
 pupil and
 and
 the outward scene. The equation of the pseudo-affine conversion is expressed as follows: as
 the outward scene. The equation of the pseudo-affine conversion is expressed
 follows:
 Xp
  
 a1 a2 a3 a4   Yp,
    
 X
 = =
 
 
 , (3)
 (3)
 Y a5 a6 a7 a8  X p Yp 
 1 1
 where(X, (X, Y) is the outward coordinate of the markers’ center point coordinate system,
 where  Y) is the outward coordinate of the markers’ center point coordinate system, and
 and 
 X p , Yp is , the is the nearfield
 nearfield eye coordinates
 eye coordinates of the pupils’
 of the pupils’ center center point coordinate
 point coordinate system.system.
 In (3),
 In (3), the unknown transformation coefficients
 the unknown transformation coefficients of a1 ~a8 need of to~ 
 be need
 solved tobybe solved
 the by the
 coordinate coor-
 values
 , carry
 order to
 
 dinate values
 computed and computed and estimated
 estimated from (X, Y) andfromX p ,(X,
 Yp Y). Inand . In out
 order
 thetocalculation
 carry out the
 of
 calculation
 the transformationof the parameters
 transformationof a1parameters of ~ Equation
 ~a8 for calibration, for calibration, Equation
 (4) describes (4) de-
 the pseudo-
 scribes
 affine the pseudo-affine
 transformation transformation
 with four-points with four-points
 calibration, which is used calibration,
 to solve thewhich is used to
 transformation
 solve the transformation coefficients of the coordinate conversion. The equation of the
 pseudo-affine conversion with four-point calibration is expressed as follows:
 1 0 0 0 0
 ⎡ ⎤ 
 ⎡ ⎤ 0 0 0 0 1 ⎡ ⎤
You can also read