Unmasking Face Embeddings by Self-restrained Triplet Loss for Accurate Masked Face Recognition

Page created by Gladys Navarro
 
CONTINUE READING
Unmasking Face Embeddings by Self-restrained Triplet Loss for Accurate Masked Face Recognition
1

                                            Unmasking Face Embeddings by Self-restrained Triplet Loss for
                                                        Accurate Masked Face Recognition
                                                                Fadi Boutrosa,b , Naser Damera,b,∗ , Florian Kirchbuchnera , Arjan Kuijpera,b
                                                            a Fraunhofer Institute for Computer Graphics Research IGD, Darmstadt, Germany
                                                          b Mathematical and Applied Visual Computing, TU Darmstadt, Darmstadt, Germany

                                                                                   Email: fadi.boutros@igd.fraunhofer.de

                                           Using the face as a biometric identity trait is motivated by the   does not require any modification or training of the existing
                                        contactless nature of the capture process and the high accuracy
arXiv:2103.01716v1 [cs.CV] 2 Mar 2021

                                                                                                              face recognition model. We achieved this goal by proposing
                                        of the recognition algorithms. After the current COVID-19             the Embedding Unmasking Model (EUM) operated on the
                                        pandemic, wearing a face mask has been imposed in public places
                                        to keep the pandemic under control. However, face occlusion           embedding space. The input for EUM is feature embedding
                                        due to wearing a mask presents an emerging challenge for face         extracted from the masked face, and its output is new feature
                                        recognition systems. In this paper, we presented a solution to im-    embedding similar to an embedding of a unmasked face of
                                        prove the masked face recognition performance. Specifically, we       the same identity, whereas, it is dissimilar from any other
                                        propose the Embedding Unmasking Model (EUM) operated on               embedding of any other identity. To achieve that through our
                                        top of existing face recognition models. We also propose a novel
                                        loss function, the Self-restrained Triplet (SRT), which enabled the   EUM, we propose a novel loss function, Self-restrained Triplet
                                        EUM to produce embeddings similar to these of unmasked faces          Loss (SRT) to guide the EUM during the training phase. The
                                        of the same identities. The achieved evaluation results on two face   SRT shares the same learning objective with the triplet loss
                                        recognition models and two real masked datasets proved that           i.e. it enables the model to minimize the distance between
                                        our proposed approach significantly improves the performance          genuine pairs and maximize the distance between imposter
                                        in most experimental settings.
                                                                                                              pairs. Nonetheless, unlike triplet loss, the SRT can dynamical
                                                                                                              self-adjust its learning objective by focusing on minimizing the
                                            Index Terms—COVID-19, Biometric recognition, Identity ver-        distance between the genuine pairs when the distance between
                                        ification, Masked face recognition.
                                                                                                              the imposter pairs is deemed to be sufficient.
                                                                                                                 The presented approach is evaluated on top of two face
                                                               I. I NTRODUCTION                               recognition models, ResNet-50 [9] and MobileFaceNet [10]
                                           Face recognition is one of the preferable biometric recogni-       trained with the loss function, Arcface loss [11], to validate
                                        tion solutions due to its contactless nature and the high accu-       the feasibility of adopting our solution on top of different deep
                                        racy achieved by face recognition algorithms. Face recognition        neural network architectures. With a detailed evaluation of the
                                        systems have been widely deployed in many application sce-            proposed EUM and SRT, we reported the verification per-
                                        narios such as automated border control, surveillance, as well        formance gain by the proposed approach on two real masked
                                        as convenience applications [1], [2]. However, these systems          face datasets [7], [3]. We further experimentally supported our
                                        are mostly designed to operate on none occluded faces. After          theoretical motivation behind our SRT loss by comparing its
                                        the current COVID-19 pandemic, wearing a protective face              performance with the conventional triplet loss. The overall ver-
                                        mask has been imposed in public places by many governments            ification result showed that our proposed approach improved
                                        to reduce the rate of COVID-19 spread. This new situation             the performance in most of the experimental settings. For
                                        raises a serious unusually challenge for the current face             example, when the probes are masked, the achieved FMR100
                                        recognition systems. Recently, several studies have evaluated         measures (the lowest false non-match rate (FNMR) for false
                                        the effect of wearing a face mask on face recognition accuracy        match rate (FMR) ≤ 1.0 %) by our approach on top of
                                        [3], [4], [5], [6]. The listed studies have reported the negative     MobileFaceNet are reduced by ∼ 28% and 26% on the two
                                        impact of masked faces on the face recognition performance.           evaluated datasets
                                        The main conclusion of these studies [3], [4], [5], [6] is that          In the rest of the paper, we discuss first the related works
                                        the accuracy of face recognition algorithm with masked faces          focusing on masked face recognition in Section II. Then, we
                                        is significantly degraded, in comparison to unmasked face.            present our proposed EUM architecture and our SRT loss in
                                           Motivate by this new circumstance we propose in this paper         Section III. In Section IV, we present the experimental setups
                                        a new approach to reduce the negative impact of wearing a             and implementation details applied in this work. Section V
                                        facial mask on face recognition performance. The presented            presents and discuss the achieved results. Finally, a set of
                                        solution is designed to operate on top of existing face recog-        conclusions are drawn in Section VI.
                                        nition models, and thus avoid retraining existing solutions              .
                                        used for unmasked faces. Recent works either proposed to
                                        train face recognition models with simulated masked faces                                II. R ELATED W ORK
                                        [7], or to train a model to learn the periocular area of the face       In recent years, significant progress has been made to
                                        images exclusively [8]. Unlike these, our proposed solution           improve face recognition verification performance with essen-
Unmasking Face Embeddings by Self-restrained Triplet Loss for Accurate Masked Face Recognition
2

tially none-occluded face. Several previous works [12], [13]        purpose. Recently, a rapid number of researches are published
addressed general face occlusion e.g. wearing sunglasses or a       to address the detection of wearing a face mask [16], [17],
scarf. Nonetheless, they did not directly address facial mask       [18]. These studies did not directly address the effect of
occlusion (before the current COVID-19 situation).                  wearing a mask on the performance of face recognition or
   After the current COVID-19 situation, four major studies         presenting a solution to improve masked face recognition.
have evaluated the effect of wearing a facial mask on face          As reported in a previous study [3], [4], face recognition
recognition performance [3], [4], [5], [6]. The National In-        systems might fail in detecting a masked face. Thus, a face
stitute of Standards and Technology (NIST) has published            recognition system could benefit from the detection of face
two specific studies on the effect of masked faces on the           mask to improve the face detection, alignment and cropping as
performance of face recognition solutions submitted by ven-         they are an essential preprocessing stpdf for feature extraction.
dors using pre-COVID-19 [5] algorithms, and post-COVID-                Motivated by the recent evaluations efforts on the negative
19 [6] algorithms. These studies are part of the ongoing Face       effect of wearing a facial mask on the face recognition
Recognition Vendor Test (FRVT). The studies by the NIST             performance [3], [4], [5], [6] and driven by the need for ex-
concluded that wearing a face mask has a negative effect            clusively developing an effective solution to improve masked
on the face recognition performance. However, the evaluation        face recognition, we present in this work a novel approach to
by NIST is conducted using synthetically generated masks,           improve masked face recognition performance. The proposed
which may not fully reflect the actual effect of wearing a          solution is designed to run on top of existing face recognition
protective face mask on the face recognition performance. The       models. Thus, it does not require any retraining the existing
recent study by Damer et al. [3] has tackled this limitation        face recognition models as presented in next Section III.
by evaluating the effect of wear mask on two academic
face recognition algorithms and one commercial solution                                  III. M ETHODOLOGY
using a specific collected database for this purpose from              In this section, we present our proposed approach to im-
24 participants over three collaborative sessions. The study        prove the verification performance of masked face recognition.
indicates the significant effect of wearing a face mask on face     The presented solution is designed to be built on top of existing
recognition performance. A similar study was carried out by         face recognition models. To achieve this goal, we propose an
the Department of Homeland Security (DHS) [4]. In this study,       EUM. The input for our proposed model is a face embedding
several face recognition systems (using six face acquisition        extracted from a masked face image and the output is what
systems and 10 matching algorithms) were evaluated on a             we call an ”unmasked face embedding”, which aims at being
specifically collected database of 582 individuals. The main        more similar to the embedding of the same identity when not
conclusion from this study is that the accuracy of most best-       wearing a mask. Thus, the proposed solution does not require
performing face recognition system is degraded from 100% to         any modification or training of the existing face recognition
96% when the subject is wearing a facial mask.                      solution. Figure 1 shows an overview of the proposed approach
   Li et al. [8] proposed to use an attention-based method to       workflow in training and in operation modes.
train a face recognition model to learn from the periocular area       Furthermore, we propose the SRT to control the model
of masked faces. The presented method show improvement              during the training phase. Similar to the well-known triplet-
in the masked face recognition performance, however, the            based learning, the SRT loss has two learning objectives:
proposed approach is only tested on simulated masked face           1) Minimizing the inter-class variation i.e. minimizing the
datasets and essentially only maps the problem into a perioc-       distance between genuine pairs of unmasked and masked
ular recognition problem. A recent preprint by [7] presented        face embeddings. 2) Maximizing the intra-class variation i.e.
a small dataset of 269 unmasked and masked face images of           maximizing the distance between imposter pairs of masked
53 identities crawled from the internet. The work proposed          face embeddings. However, unlike the traditional triplet loss,
to fine-tune Facenet model [14] using simulated masked face         the proposed SRT loss function can self-adjust its learning ob-
images to improve the recognition accuracy. However, the            jective by only focusing on optimizing the inter-class variation
proposed solution is only tested using a small database (269        when the intra-class variation is deemed to be sufficient. When
images). Wang et al. [15] presented three datasets crawled          the gap on intra-class variation is violated, our proposed loss
from the internet for face recognition, detection, and simulated    behave as traditional triplet loss. The theoretical motivation
masked faces. The face recognition dataset contains 5000            behind our SRT loss will be presented along with the function
masked face images of 525 identities, and 90000 unmasked            formulation later in this section.
face images of the same 525 identities. The authors claim to           In the following, this section presents our proposed EUM
improve the verification accuracy from 50% to 95% on the            architecture and the SRT loss.
masked face. However, they did not provide any information
about the evaluation protocol, proposed solution, or imple-
mentation details. Moreover, the published part of the dataset       A. Embedding Unmasking Model Architecture
does not contain pairs of unmasked-masked images for most              The architecture of the EUM is based on a Fully Connected
of the identities 1 . Thus, such a dataset could be more suitable   Neural Network (FCNN). Having a fully connected layer,
for face mask detection [16], [17], [18] than face recognition      where all neurons in two consecutive layers are connected to
                                                                    each other, enables us to demonstrate a generalized EUM de-
  1 https://github.com/X-zhangyang/Real-World-Masked-Face-Dataset   sign. This is the case because this structure is easily adaptable
Unmasking Face Embeddings by Self-restrained Triplet Loss for Accurate Masked Face Recognition
3

                                                                                                         backpropagation
                                                        Embedding unmasking model

                            CNN

         Anchor      Feature extractor
                       (Pretrained)        Face
                                         embedding                   Unmasked face
                                                                      embedding
                                                                                              Positive
                                                                                              d1
                                                                                     Anchor
                            CNN                                                                   d2

                                                                                                            Negative
                                                                                       Self-restrained triplet loss
         Positive    Feature extractor
                       (Pretrained)        Face
                                         embedding

                            CNN

        Negative
                     Feature extractor
                       (Pretrained)
                                           Face
                                         embedding

Fig. 1: The workflow of the face recognition model with our proposed EUM. The top part of the figure (inside the dashed
rectangle) shows the proposed solution in operation mode. Given an embedding obtained from the masked face, the proposed
model is trained with SRT loss to output a new embedding that is similar to the one of the unmasked face of the same identity
and different from the masked face embedding from all other identities.

to different input shapes, and thus can be adapted on the top    report concluded that FMR seems to be less affected when the
of different face recognition models, motivating our choice of   probes are masked. A similar observation from the study by
the FCNN. The model input is a masked feature embedding          Damer et al. [3] stated that the genuine score distributions are
(i.e. resulting from a masked face image) of the size d (d       significantly affected by masked probes. The study reported
depends on the used face recognition network), and the model     that the genuine score distribution strongly shifts towards the
output is a features vector of the same size d. The proposed     imposter score distributions. On the other hand, the imposter
model consists of four fully connected layers (FC)- one input    score distributions do not seem to be strongly affected by
layer, two hidden layers, and one output layer. The input size   masked face probes.
for all FC layers is of the size d. Each of the layers 1, 2,        One of the main observations of the previous studies in
and 3 is followed by batch normalization(BN) [19] and Leaky      [3], [5], is that wearing a face mask significantly increase the
ReLU non-linearity activation function [20]. The last FC layer   FNMR, whereas the FMR seem to be less affected by wearing
is followed by BN.                                               a mask. Similar remarks have been also reported in our result
                                                                 (see Section V). Based on these observations, we motivate
 B. Unmasked Face Embedding Learning                             our proposed SRT loss function to focus on increasing the
   The learning objective of our model is to reduce the FNMR     similarity between genuine pairs of unmasked and masked face
of genuine unmasked-masked pairs. The main motivation            embeddings, while maintaining the imposter similarity at an
behind this learning goal is inspired by the latest reports on   acceptable level. In the following section, we briefly present
evaluating the effect of the masked faces on face recogni-       the naive triplet loss followed by our proposed SRT loss.
tion performance by the National Institute of Standards and         1) Self-restrained Triplet Loss
Technology (NIST) [5] and the recent work by Damer et               Previous works [14], [21] indicated that utilizing triplet-
al. [3]. The NIST report [5] stated that the false non-match     based learning is beneficial for learning discriminative face
rates (FNMR) are increased in all evaluated algorithms when      embeddings. Let x ∈ X represents a batch of training samples,
the probes are masked. For the most accurate algorithms,         and f (x) is the face embeddings obtained from the face
the FNMR increased from 0.3% to 5% at FMR of 0.001%              recognition model. Training with triplet loss requires a triplet
when the probes are masked. On the other hand, the NIST          of samples in the form {xai , xpi , xni } ∈ X, where xai , the
Unmasking Face Embeddings by Self-restrained Triplet Loss for Accurate Masked Face Recognition
4

             2.5
                                                                                           2.5

             2.0
                                                                                           2.0
  Distance

                                                                                Distance
                                                        Triplet loss-d1                                                               Triplet loss-d1
                                                        Triplet loss-d2                                                               Triplet loss-d2

             1.5                                        SRT loss-d1                                                                   SRT loss-d1
                                                                                           1.5
                                                        SRT loss-d2                                                                   SRT loss-d2

             1.0                                                                           1.0

             0.5                                                                           0.5
                   0     20000    40000     60000    80000     100000                            0        20000    40000    60000    80000   100000
                                      Iteration                                                                       Iteration

                       (a) ResNet-50: Triplet loss vs. SRT loss                                      (b) MobileFaceNet: Triplet loss vs. SRT loss
Fig. 2: Naive triplet loss vs. SRT loss distances learning over training iterations. The plots show the learned d1 (distance
between genuine pairs) and d2 (distance between imposter pairs) by each loss over training iterations. It can be clearly noticed
that the anchor (model output) of model trained with SRT loss is more similar to the positive than the anchor of the model
trained with naive triple loss.

anchor, and xpi , the positive, are two different samples of the                a huge number of training triplet with large computational
same identity, and xni , the negative, is a sample belonging to                 resources for selecting the optimal triplets for training. Given
a different identity. The learning objective of the triplet loss                a masked face embedding, our model is trained to generate
is that the distance between f (xai ) and f (xpi ) (genuine pairs)              a new embedding such as it is similar to the unmasked face
with the addition of a fixed margin value (m) is smaller than                   embedding of the same identity and dissimilar from other face
the distance between f (xai ) and any face embedding f (xpi ) of                embeddings of any other identities. As discussed earlier in
any other identities (imposter pairs). In FaceNet [14], triplet                 this section, the distance between imposter pairs have been
loss is proposed to learn face embeddings using euclidean                       found to be less affected by wearing a mask [3], [5]. Thus,
distance applied on normalized face embeddings. Formally,                       we aim to ensure that our proposed loss focuses on minimizing
the triplet loss `t for a min-batch of N samples is defined as                  the distance between the genuine pairs (similar to scenario 2)
following:                                                                      while maintaining the distance between imposter pairs.
         N                                                                         Training EUM with SRT loss requires a triple to be defined
      1 X
`t =        max{d(f (xai ), f (xpi )) − d(f (xai ), f (xni )) + m, 0},          as follows: f (xai ) is an anchor of masked face embedding,
      N i                                                                       EU M (f (xai )) is the anchor given as an output of the EUM,
                                                                   (1)          f (xpi ) is a positive of unmasked embedding, and f (xni ) is a
where m is a margin applied to impose the separability                          negative embedding of a different identity than anchor and
between genuine and imposter pairs. An d is the euclidean                       positive. This triplet is illustrated in Figure 1. We want to
distance applied on normalized features, and is given by:                       ensure that the distance (d1) between EU M (f (xai )) and f (xpi )
                                                    2                           in addition to a predefined margin is smaller than the distance
                            d(xi , yi ) = kxi − yi k2 .                   (2)
                                                                                (d2) between EU M (f (xai )) and f (xni ). Our goal is to train
   Figure 3 visualize two triplet loss learning scenarios. Figure               EUM to have more focus on minimizing d1, as d2 is less
3.a shows the initial training triplet, and Figure 3.b and 3.c                  affected by mask.
illustrate two scenarios that can be learnt using triplet loss.                    Under the assumption that the distance between the positive
In both scenarios, the goal of the triplet loss is achieved                     and the negative embeddings (d3) are close to optimal and
i.e. d(f (xai ), f (xni )) > d(f (xai ), f (xpi )) + m. In Figure 3.b           it does not contribute to the back-propagation of the EUM
(scenario 1), both distances are optimized. However, in this                    model, we propose to use this distance as a reference to control
scenario, the optimization of d2 distance is greater than the                   the triplet loss. The main idea is to train the model as a naive
optimization on d1 distance. Whereas, in Figure 3.c (scenario                   triplet loss when d2 (anchor-negative distance) is smaller than
2), the triplet loss enforces the model to focus on minimizing                  d3 (positive-negative distance). In this case, the SRT guides the
the distance between the anchor and the positive. The optimal                   model to maximize d2 distance and to minimize d1 distance.
state for the triplet loss is achieved when both distance are                   When d2 is equal or greater than d3, we replace d2 by d3 in
fully optimized i.e. d(f (xai ), f (xpi )) is equal to zero and                 the loss calculation. This distance swapping allows the SRT
d(f (xai ), f (xni )) is greater than the predefined margin. How-               to learn only, at this point, to minimize d1 distance. At any
ever, achieving such a state may not be feasible, and it requires               point of the training, when the condition on d2 is violated i.e
Unmasking Face Embeddings by Self-restrained Triplet Loss for Accurate Masked Face Recognition
5

d(d2) < d(d3), the SRT behave again as a naive triplet loss.            version of MS-Celeb-1M [24] dataset. Our choice to employ
We opt to compare the d2 and d3 distances on the batch level            Arcface loss as it achieved state-of-the-art performance of
to avoid swapping the distance on every minor update on the             several face recognition testing datasets such as Labeled Face
distance between the imposter pairs. In this case, we want to           in the Wild(LFW) [25]. The achieved accuracy on LFW by
ensure that the d1 distance, with the addition of a margin m,           ResNet-50 and MobileFaceNet trained with Arcface loss using
is smaller than the mean of the d3 distances calculated on the          MS1MV2 dataset are 99.80% and 99.55%, respectively. The
mini-batch of a triplet. Thus, our loss is less sensitive to the        considered face recognition models are evaluate with cosine-
outliers resulted from comparing imposter pairs. We define our          distance for comparison. The Multi-task Cascaded Convo-
SRT loss for a mini-batch of the size N as follow:                      lutional Networks (MTCNN) solution [26] is employed to
           1 PN              a         p            a     n
                                                                        detect and align the input face image. Both models process
          
           N   i max{d(f (xi ), f (xi )) − d(f (xi ), f (xi ))         aligned and cropped face image of size 112 × 112 pixels to
          
          +m, 0}                             if µ(d2) < µ(d3)
  `SRT   = 1 PN               a         p                         (3)   produce 512−D embedding feature by ResNet-50 and 128−D
                   max{d(f (x i ), f (x i )) − µ(d3)                    embedding feature by MobileFaceNet.
          N i
          
          
          
            +m, 0}                            otherwise,

where µ(d2) is the mean of the distances between the anchor              B. Synthetic Mask Generation
and the
      Pnegative   pairs calculated on the mini-batch level, given          As there is no large scale database with pairs of unmasked
         N
as N1 i (d(f (xai ), f (xni )). µ(d3) is the mean of the distances      and masked face images, we opted to use a synthetically
between the positive and thePnegative pairs calculated on the           generated mask to train our proposed approach. Specifically,
                                  N
mini-batch level, given as N1 i (d(f (xpi ), f (xni )). An d is the     we employ the synthetic mask generation method proposed
euclidean distance computed on normalized feature embedding             by NIST [5]. The synthetic generation method depends on
(Equation 2).                                                           the Dlib [27] toolkit to detect and extract 68 facial landmarks
   Figure 2 illustrates the optimization of d1 (distance between        from a face image. Based on the extracted landmark points,
genuine pairs) and d2 (distance between imposter pairs) by              a face mask of different shapes, heights, and colors can be
naive triplet loss and SRT loss over the training iterations of         drawn on the face images. The detailed implementation of
two EUM models. In Figure 2a, the EUM model is trained                  the synthetic mask generation method is described in [5]. The
on top of feature embeddings obtained from a ResNet-50 face             synthetic mask generation method provided six mask types
recognition model. In Figure 2b, the EUM model is trained               with different high and coverage: A) wide-high coverage, B)
on top of feature embeddings obtained from MobileFaceNet-               round-high coverage, C) wide-medium coverage, D) round-
50 face recognition model. Details on the training is presented         medium coverage, E) wide-low coverage, and F) round-low
in Section IV. It can be clearly noticed that the d1 distance           coverage. Figure 4 shows samples of simulated face mask
(anchor-positive distance) learned by SRT is significantly              of different type and coverage. The mask color is randomly
smaller than the one learned by naive triplet loss. This in-            selected from the RGB color space. To synthetically generate a
dicates that the output embedding of the EUM trained with               masked face image, we extract first the facial landmarks of the
SRT is more similar to the embedding of the same identity               input face image. Then, a masked of specific color and type
(the positive) than the output embedding of EUM trained with            can be drawn on the face image using the x, y coordination
triplet loss.                                                           of the facial landmarks points.

                   IV. E XPERIMENTAL SETUP
                                                                         C. Database
   This section presents the experimental setups and the im-
                                                                           We used MS1MV2 dataset [11] to train our proposed
plementation details applied in the paper.
                                                                        approach. The MS1MV2 is a refined version of MS-Celeb-
                                                                        1M [24] dataset. The MS1MV2 contains 58m images of 85k
 A. Face Recognition Model                                              different identities. We generated a masked version of the
   To provide a deep evaluation of the performance of the               MS1MV2 noted as MS1MV2-Masked as described in Section
proposed solution, we evaluated our proposed solution on top            IV-C. The mask type (as described in Section ) and color
of two face recognition models - ResNet-50 [9] and Mobile-              are randomly selected for each image to add more diversity
FaceNet [10]. ResNet is one of the widely used Convolutional            of mask color and coverage to the training dataset. The Dlib
Neural Network (CNN) architecture employed in several face              failed in extracting the facial landmarks from 426k images.
recognition models e.g. ArcFace [11] and VGGFace2 [22].                 These images are neglected from the training database. A
   MobileFaceNet is a compact model designed for low                    subset of 5k images are randomly selected from MS1MV2-
computational powered devices. MobileFaceNet model archi-               Masked to validate the model during the training phase.
tecture is based on residual bottlenecks proposed by Mo-                   The proposed solution is evaluated using two real masked
bileNetV2 [23] and depth-wise separable convolutions layer,             face datasets: Masked Faces in Real World for Face Recog-
which allows building a CNN model with a much smaller set               nition (MRF2) [7] and the extended masked face recognition
of parameters in comparison to standard CNNs. To provide                (MFR) dataset [3]. To the best of our knowledge, these are
fair and comparable evaluation results, both models are trained         the only datasets available for research including pairs of
using the same loss function, Arcface loss [11], and the same           real masked face images and unmasked face images. Table
training database, MS1MV2 [11]. The MS1MV2 is a refined                 I summarize the evaluation datasets used in this work. The
Unmasking Face Embeddings by Self-restrained Triplet Loss for Accurate Masked Face Recognition
6

                                                                                            Positive
                                                                                    d1
                                                                      Anchor                               d3
                                Positive
                                                                                            d2
                       d1
                                             Learning                                                           Negative
                                        d3                                                 b) Scenario 1
            Anchor
                                                                                Positive
                          d2                                                    d1
                                   Negative                          Anchor                d3

                                                                                    d2

                                                                                                Negative
                     a) Training triplet                                                   c) Scenario 2

Fig. 3: Triplet loss guide the model to maximize the distance between the anchor and negative such as it is greater than
the distance between the anchor and positive with the addition of a fixed margin value. One can be clearly noticed the high
similarity between the anchor and positive (d1) learned in scenario 2, in comparison to the d1 learned one in scenario 1,
whereas, the distance, d2, between the anchor and the negative (imposter pairs) in scenario 1 is extremely large than the d2
in scenario 2.

              (a) Wide-high coverage                (b) Round-high coverage                   (c) Wide-medium coverage

            (d) Round-medium coverage                   (e) Wide-low coverage                    (f) Round-low coverage
                 Fig. 4: Samples of the synthetically generated face masks of different shape and coverage.

MFR2 contains 269 images of 53 identities crawled from the          and triplet loss noted as F-M(SRT) and M-M(SRT) and F-
internet. Therefore, the images in the MRF2 dataset can be          M(T) and M-M(T), respectively.
considered to be captured under in-the-wild conditions. The
database contains images of masked and unmasked faces with             Also, We deploy an extended version of the MFR dataset
an average of 5 images per identity. The baseline evaluation        [3]. The extended version of MFR is collected from 48 par-
of the considered face recognition on MFR2 is done by               ticipants using their webcams under three different sessions-
performing an N:N comparisons between the unmasked face             session 1 (reference) and session 2 and 3 (probes). The
noted as F-F, the unmasked faces and masked faces noted             sessions are captured on three different days. Each session
as F-M, the masked faces noted as M-M to evaluate the               contains data captured using three videos. In each session,
influence of wearing a facial mask on the face recognition          the first video is recorded when the subject is not wearing a
performance. Finally, to evaluate our proposed solution, we         facial mask in the daylight without additional electric lighting.
report the verification performance of the M-M and F-M              The second and third videos are recorded when the subject is
settings based on our proposed EUM model with SRT loss              wearing a facial mask and with no additional electric lighting
                                                                    in the second video and with electric lighting in the third
Unmasking Face Embeddings by Self-restrained Triplet Loss for Accurate Masked Face Recognition
7

video (room light is turned on). The baseline reference (BLR)      settings. FDR is a class separability criterion described in [28],
contains 480 images from the first video of the first session      and it is given by:
(day) as described in [3]. The mask reference (MR) contains
                                                                                                 (µG − µI )2
960 images from the second and third videos of the first                              F DR =                    ,
                                                                                                (σG )2 + (σI )2
session. The baseline probe (BLP) contains 960 images from
the first video of the second and third sessions and contains      where µG and µI are the genuine and imposter scores mean
face images with no mask. The mask probe (MP) contains             values and σG and σI are their standard deviations values.
1920 images from the second and third videos of the second         The larger FDR value, the higher is the separation between
and third sessions. The baseline evaluation of face recognition    the genuine and imposters scores and thus better expected
performance on the MFR database is done by performing              verification performance.
N:N between BLR and BLP (unmasked face), noted as BLR-
BLP. To evaluate the effect of wearing a facial mask on face                                 V. R ESULT
recognition performance, we perform an N:N comparisons                In this section, we present and discuss our achieved results.
between BLR and MP data splits, noted as BLR-MP, and               We experimentally present first the negative impact of wearing
an N:N comparisons between MR and MP data splits, noted            a face mask on face recognition performance. Then, we present
as MR-MP. Finally, we evaluate and report the verification         and discuss the impact of our EUM trained with SRT on
performance of the MR-MP and BLR-MP settings achieved by           enhancing the separability between the genuine and imposter
our EUM model trained with SRT and, the naive triplet loss (as     comparison scores. Then, we present the gain in the masked
a baseline of our SRT loss), noted as MR-MP(SRT) and BLR-          face verification performance achieved by our proposed EUM
MP(SRT), and MR-MP(T) and BLR-MP(T), respectively.                 trained with SRT on the collaborative and in-the-wild masked
                                                                   face recognition. Finally, we present an ablation study on SRT
 D. Model Training Setup                                           to experimentally supported our theoretical motivation behind
                                                                   the SRT loss by comparing its performance with the triplet
   We trained four instances of the EUM model. The first and       loss.
second instances, model1 and model2, are trained with SRT
loss using feature embeddings obtained from ResNet-50 and
MobileFaceNet, respectively. The third and fourth instances,        A. Impact of Wearing a Protective Face Mask on Face
model3 and model4, are trained with triplet loss using feature     Recognition Performance
embeddings obtained from ResNet-50 and MobileFaceNet,                 Tables II, III, IV, and V present a comparison between
respectively as an ablation study to our proposed SRT. The         the baseline evaluation where both reference and probe are
proposed EUM models in this paper are implemented by               unmasked (F-F, BLR-BLP), the case where only the probe
Pytorch and trained on Nvidia GeForce RTX 2080 GPU. All            is masked (F-M,BLR-MP), and the case where reference and
models are trained using SGD optimizer with initial learning       probe are masked (M-M,MR-MP). Using unmasked images,
rate of 1e-1 and batch size of 512. The learning rate is divided   the considered face recognition models, ResNet-50 and Mo-
by 10 at 30k, 60k, , 90k training iterations. The early-stopping   bileFaceNet, achieve a very high verification performance.
patience parameter is set to 3 (around 30k training iteration)     This is demonstrated by scoring 0.0% and 0.0% EER on the
causing model1, model2, model3, and model4 to stop after           MFR database and 0% and 0.0106% on the MRF2 database,
80k, 70k, 60k, 10k training iterations, respectively.              respectively by model ResNet-50 and MobileFaceNet. The
                                                                   verification performances of the considered models is substan-
                                                                   tially degraded when evaluated on masked face images. This
 E. Evaluation Metric                                              is indicated by the degradation in verification performance
  The verification performance is reported as Equal Error          measures and FDR values, in comparison to the case where
Rate (EER), FMR100 and FMR1000 which are the lowest                probe and reference are unmasked. Furthermore, the consid-
FNMR for a FMR≤1.0% and ≤0.1%, respectively. We also               ered models achieved lower verification performance when a
report the mean of the genuine scores (G-mean) and the mean        masked probe is compared to a masked reference than the case
of imposter scores (I-mean) to analysis the shifts in genuine      where only the probe is masked, as seen in Tables II, III, IV,
and imposter scores distributions induced by wearing a face        and V. For example, using MFR dataset, when the probe is
mask and to demonstrate the improvement in the verification        masked, the achieved EER by ResNet-50 model is 1.2492%
performance achieved by our proposed solution. For each            (BLR-MP). This error rate is raised to 1.2963% when both
of the evaluation settings, we plot the receiver operating         probe and reference are masked (MR-MP) as seen in Table II.
characteristic (ROC) curves. Also, for each of the conducted       This results is also supported by G-mean, I-mean and FDR as
experiment, we report the failure to extract rate (FTX) to         shown in Tables II, III, IV and V.
capture the effect of wearing a face mask on face detection.          We also make two general observations. 1) The compact
FTX measures is proportion of comparisons where the feature        model, MobileFaceNet, achieved lower verification perfor-
extraction was not possible. Further, we enrich our reported       mance than the ResNet-50 model when comparing masked
evaluation results by reporting the Fisher Discriminant Ratio      probes to unmasked/masked references. One of the reasons
(FDR) to provide an in-depth analysis of the separability          for this performance degradation might be due to the smaller
of genuine and imposters scores for different experimental         embedding size of MobileFaceNet (128-D), in comparison to
8

                          Dataset     Multi-sessions   No. images      No. identities     Capture scenario
                         MFR [3]           Yes            269               53             Collaborative
                         MRF2 [7]          No            4320               48              In-the-wild
TABLE I: An overview of the evaluation datasets employed in this work. We evaluate our solution on real (not simulated)
masked datasets in two different capture scenarios, collaborative and in-the-wild.

              Resnet50          EER %      FMR100 %       FMR1000 %        G-mean       I-mean    FDR          FTX %
              BLR-BLP           0.0        0.0            0.0              0.8538       0.0349    55.9594      0.0
              BLR-MP            1.2492     1.4251         3.778            0.5254       0.0251    12.6189      4.4237
              BLR-MP(T)         1.9789     2.9533         7.9988           0.4401       0.0392    9.4412       4.4237
              BLR-MP(SRT)       0.9611     0.946          2.5652           0.5447       0.0272    13.4045      4.4237
              MR-MP             1.2963     1.4145         2.6311           0.7232       0.0675    15.1356      4.4736
              MR-MP(T)          1.3091     1.456          2.8259           0.8269       0.4169    13.0528      4.4736
              MR-MP(SRT)        1.1207     1.1367         2.4523           0.7189       0.0557    15.1666      4.4736
TABLE II: The achieved verification performance of different experimental settings achieved by ResNet-50 model along with
EUM trained with triplet loss and EUM trained with SRT loss. The result is reported using MFR dataset. It can be clearly
noticed the significant improvement in the verification performance induced by our proposed approach (SRT).

               Mobilefacenet      EER%      FMR100%       FMR1000%        G-mean        I-mean   FDR         FTX %
               BLR-BLP            0.0       0.0           0.0             0.8432        0.0488   37.382      0.0
               BLR-MP             3.4939    6.507         20.564          0.468         0.0307   7.1499      4.4237
               BLR-MP(T)          5.2759    12.7835       28.8175         0.3991        0.0501   5.9623      4.4237
               BLR-MP(SRT)        2.8805    4.6331        13.4384         0.5013        0.0383   8.6322      4.4237
               MR-MP              3.506     6.8842        17.3479         0.6769        0.1097   7.9614      4.4736
               MR-MP(T)           4.2947    7.9124        16.3772         0.8082        0.4716   6.6455      4.4736
               MR-MP(SRT)         3.1866     5.6166       13.529          0.6636        0.0837   8.0905      4.4736
TABLE III: The achieved verification performance of different experimental settings achieved by MobileFaceNet model along
with EUM trained with triplet loss and EUM trained with SRT loss. The result is reported using MFR dataset. It can be clearly
noticed the significant improvement in the verification performance induced by our proposed approach (SRT).

the embedding size of 512-D in ResNet-50. 2) The considered        measures by considered face recognition models, as shown in
models achieved lower performance when evaluated on the            Table II, III, V and IV. This indicates a general expected im-
MRF2 dataset than the case when evaluated on the MRF               provement in verification performance of the face recognition
dataset. This result was expected as the images in the MRF2        and enhancing the general trust in the verification decision.
dataset are crawled from the internet with large variations in     For example, when the ResNet-50 model is evaluated on the
facial expression, pose, illumination. On the other hand, the      MFR dataset, and the probe is masked, the FDR is increased
images in the MFR dataset are collected in a collaborative         from 12.6189 (BLR-MP) to 13.4045 (BLR-MP(SRT)) using
environment.                                                       our proposed approach. This improvement in the separability
   To summarize, wearing a protective face mask has negative       between the genuine and the imposter samples by our proposed
effect on the face recognition verification performance. This      approach is consistent in all reported results.
result supports and complements (by evaluating masked-to-
masked pairs in this work) to the previous findings in the           C. Impact of our EUM with SRT solution on the collabora-
studies in [3], [4], [5], [6] evaluated the impact of wearing a    tive masked face recognition
mask on face recognition performance.
                                                                      When the considered models are evaluated on the MFR
                                                                   dataset, it can be observed that our proposed approach sig-
 B. Impact of our EUM with SRT on the Separability between         nificantly enhanced the masked face verification performance
Genuine and Imposter Comparison Scores                             as shown in Table II and III. For example, when comparing
   The proposed approach significantly enhanced the separa-        masked probes to unmasked references, the achieved EER
bility between the genuine and imposter comparison scores in       by ResNet-50 model is 1.2492% (BLR-MP). This error rate
both considered face recognition models and on both evaluated      is decreased to 0.9611% by our proposed approach (BLR-
datasets. This improvement is noticeable from the increment        MP(SRT)) indicating a clear improvement in the verification
in the FDR separability measure achieved by our proposed           performance induced by our proposed approach as shown in.
EUM trained with SRT, in comparison to the achieved FDR            Table II. Similar enhancement in the verification performance
9

                          Resnet50          EER%          FMR100%        FMR1000%               G-mean     I-mean      FDR               FTX %
                          F-F               0.0           0.0            0.0                    0.7477     0.0038      37.9345           0.0
                          F-M               4.3895        6.7568         10.473                 0.4263     0.0005      8.2432            0.9497
                          F-M(T)            6.4169        7.7703         12.1622                0.3567     -0.0066     6.8853            0.9497
                          F-M(SRT)          4.7274        7.4324         9.4595                 0.4553     0.0014      8.4507            0.9497
                          M-M               6.8831        10.0156        13.7715                0.6496     0.0301      4.7924            1.203
                          M-M(T)            6.8831        9.7027         14.0845                0.7759     0.3663      4.8791            1.203
                          M-M(SRT)          6.2578        9.0767         11.8936                0.6488     0.0144      4.9381            1.203
TABLE IV: The achieved verification performance of different experimental settings achieved by ResNet-50 model along
with EUM trained with triplet loss and EUM trained with SRT loss. The result is reported using MRF2 dataset. Our proposed
approach (SRT) achieved competitive performance on F-M experimental setting and the best performance on M-M experimental
setting.

             1.00                                                                             1.00

             0.98                                                                             0.98
  1 - FNMR

             0.96                                                                  1 - FNMR   0.96

             0.94                                                                             0.94

             0.92                                 BLR-MP AUC = 0.9975                         0.92                                MR-MP AUC = 0.9968
                                                  BLR-MP(T) AUC = 0.9960                                                          MR-MP(T) AUC = 0.9969
                                                  BLR-MP(SRT) AUC = 0.9971                                                        MR-MP(SRT) AUC = 0.9968
             0.90                                                                             0.90
                    0.0   0.1         0.2           0.3        0.4           0.5                     0.0   0.1        0.2          0.3            0.4       0.5
                                            FMR                                                                             FMR

                                (a) ResNet-50: BLR-MP                                                            (b) ResNet-50: MR-MP

             1.00                                                                             1.00

             0.98                                                                             0.98
  1 - FNMR

                                                                                   1 - FNMR

             0.96                                                                             0.96

             0.94                                                                             0.94

             0.92                                 BLR-MP AUC = 0.9908                         0.92                                MR-MP AUC = 0.9916
                                                  BLR-MP(T) AUC = 0.9878                                                          MR-MP(T) AUC = 0.9875
                                                  BLR-MP(SRT) AUC = 0.9949                                                        MR-MP(SRT) AUC = 0.9918
             0.90                                                                             0.90
                    0.0   0.1         0.2           0.3        0.4           0.5                     0.0   0.1        0.2          0.3            0.4       0.5
                                            FMR                                                                             FMR

                           (c) MobileFaceNet: BLR-MP                                                        (d) MobileFaceNet: MR-MP
Fig. 5: The verification performance for the considered face recognition models, our proposed EUM trained with naive triplet
loss, and our proposed EUM trained with SRT loss, are presented as ROC curves. The curves are plot using MFR dataset for
the experimental settings BLR-MP (probe is masked) and MR-MP (reference and probe are masked). For each ROC curve,
the area under the curve (AUC) is listed inside the plot. The improvement in the performance is noticeable by our proposed
SRT. This improvement is very clear for the cases where masked probe significantly affected the verification performance.

is observed by our approach when comparing masked probes                           to masked references. In this case, the EER is decreased from
10

                                                                                                         M-M AUC = 0.9766
             1.00                                                                           1.00         M-M(T) AUC = 0.9836
                                                                                                         M-M(SRT) AUC = 0.9780
             0.98                                                                           0.98
  1 - FNMR

                                                                                 1 - FNMR
             0.96                                                                           0.96

             0.94                                                                           0.94

             0.92                                  F-M AUC = 0.9937                         0.92
                                                   F-M(T) AUC = 0.9869
                                                   F-M(SRT) AUC = 0.9904
             0.90                                                                           0.90
                    0.0   0.1         0.2         0.3        0.4           0.5                     0.0       0.1         0.2         0.3       0.4           0.5
                                            FMR                                                                                FMR

                                  (a) ResNet-50: F-M                                                                 (b) ResNet-50: M-M

                                                   F-M AUC = 0.9717                                                                  M-M AUC = 0.9609
             1.00                                  F-M(T) AUC = 0.9669                      1.00                                     M-M(T) AUC = 0.9671
                                                   F-M(SRT) AUC = 0.9753                                                             M-M(SRT) AUC = 0.9675
             0.98                                                                           0.98
  1 - FNMR

                                                                                 1 - FNMR

             0.96                                                                           0.96

             0.94                                                                           0.94

             0.92                                                                           0.92

             0.90                                                                           0.90
                    0.0   0.1         0.2         0.3        0.4           0.5                     0.0       0.1         0.2         0.3       0.4           0.5
                                            FMR                                                                                FMR

                                (c) MobileFaceNet: F-M                                                             (d) MobileFaceNet: M-M
Fig. 6: The verification performance for the considered face recognition models, our proposed EUM trained with naive triplet
loss, and our proposed EUM trained with SRT loss, are presented as ROC curves. The curves are plot using MRF2 dataset
for the experimental settings F-M (probe is masked) and M-M (reference and probe are masked). For each ROC curve, the
area under the curve (AUC) is listed inside the plot. The improvement in the performance is noticeable by our proposed SRT
in the plots c and d, in comparison to the achieved performance by the base face recognition model.

1.2963% (MR-MP) to 1.1207% (MR-MP(SRT)).                                         EER by ResNet-50 model is 4.3895% (F-M) as shown in Table
   When the probes are masked, the achieved EER by the                           II. Only in this case, the EER did not improve by our proposed
MobileFaceNet model is 3.4939% (BLR-MP). This error is                           approach where the achieved EER by our proposed approach
reduced to 2.8805% using our proposed approach (BLR-                             is 4.7274% (F-M(SRT)). Nonetheless, a notable improvement
MP(SRT)). Also, when comparing masked probes to masked                           in the FMR1000 and the FDR separability measures can be
references, the EER is decreased from 3.506% (MR-MP) to                          observed from the reported result. The increase in FDR points
3.1866 (MR-MP(SRT)) by our approach. The improvement in                          out the possibility that given a larger more representative eval-
the masked face recognition verification performance is also                     uation data, the consistent enhancement in verification accu-
noticeable from the improvement in FMR100 and FMR1000                            racy will be apparent here as well. A significant improvement
measures.                                                                        in the verification performance is achieved by our approach
                                                                                 when comparing masked probes to masked references. In this
  D. Impact of our EUM with SRT on in-the-wild masked face                       case, the achieved EER is decreased from 6.8831% (M-M)
recognition                                                                      to 6.2578% (M-M(SRT)). A Similar conclusion can be made
                                                                                 from the improvement from the other performance verification
  The achieved evaluation results on the MRF2 dataset by
                                                                                 measures and the FDR measure.
ResNet-50 and MobileFaceNet models are presented in Tables
IV and V, respectively. Using masked probes, the achieved                               Using masked probes, the achieved verification performance
11

                Mobilefacenet      EER%          FMR100%        FMR1000%        G-mean     I-mean     FDR         FTX %
                F-F                0.0106%       0.0            0.0             0.7318     0.0078     26.4276     0.0
                F-M                6.4169        16.8919        24.3243         0.3803     -0.0019    4.6457      0.9497
                F-M(T)             7.7685        15.8784        34.4595         0.3304     -0.0027    4.2067      0.9497
                F-M(SRT)           6.079         12.5           21.9595         0.4157     -0.0018    5.2918      0.9497
                M-M                8.4777        18.1534        28.795          0.6087     0.0509     3.2505      1.203
                M-M(T)             8.7634        17.5274        26.2911         0.7638     0.3966     3.5408      1.203
                M-M(SRT)           7.8232        15.0235        22.5352         0.6087     0.0241     3.5815      1.203
TABLE V: The achieved verification performance of different experimental settings achieved by MobileFaceNet model along
with EUM trained with triplet loss and EUM trained with SRT loss. The result is reported using MRF2 dataset. In both F-M
and M-M experimental settings, our proposed approach (SRT) achieved the best performances.

by MobileFaceNet is significantly enhanced by our proposed               masked anchor is similar (to some degree) to the positive
approach. Similar improvement in the verification performance            (unmasked embedding) and it is dissimilar (to some degree)
is achieved when comparing masked probe to masked refer-                 from the negative. Therefore, finding triplets that violate the
ence by our approach as shown in Table V. For example,when               triplet condition is not trivia and it could not be possible for
the probes and references are masked, the achieved EER                   many of triplet in the training dataset. This explains the poor
by MobileFaceNet is 8.4777%. This error rate is reduced to               result achieved when the EUM model is trained with triplet
7.8232% using our proposed approach.                                     loss, as there are only few triplets violating the triplet loss
                                                                         condition. One can assume that using a larger margin value
                                                                         allows the EUM model to further optimizing the genuine pairs
  E. Ablation Study on Self-restrained Triplet loss
                                                                         distance and imposter pairs distance as the triplet condition
   In this subsection we experimentally prove, and theoretically         can be violated by increasing the margin value. However, by
discuss, the advantage of our proposed SRT solution over the             increasing the margin value, we increase the upper bound
common naive triplet loss. Using masked face dataset, we                 of the loss function. Thus, we ignore the fact that distance
explore first the validity of training EUM model with triplet            between imposter pairs is sufficient respect to the distance
loss. It is noticeable that training EUM with naive triplet              between genuine pairs in the embedding space. For example,
is inefficient for learning from masked face embedding as                using unmasked data, the mean of the imposter scores achieved
presented in in Table II, IV III and V. For example, when the            by ResNet-50 on MFR dataset is 0.0349. When the probe is
probe is masked, the achieved EER by EUM with triplet loss               masked, the mean of imposter scores is 0.0251 as shown in
on top of ResNet-50 is 1.9789%, in comparison to 0.9611%                 Table II. Therefore, any further optimization on the distance
EER achieved by EUM with our SRT as shown in Table II. It is             between the imposter pairs will effect the discriminative fea-
crucial for learning with triplet loss that the input triplet violate    tures learned by the base face recognition model and there
the condition d(f (xai ), f (xni )) > d(f (xai ), f (xpi )) + m. Thus,   is no restriction on the learning process that insures that the
the model can learn to minimize the distance between the                 model will maintain the distance between the imposter pairs.
genuine pairs and maximize the distance between the imposter             Alternatively, training the EUM model with our SRT loss
pairs. When the previous condition is not violated, the loss             achieved significant improvement on minimizing the distance
value will be close to zero and the model will not be able               between the genuine pairs, simultaneously, it maintains the dis-
to further optimizing the distances of the genuine pairs and             tance between the imposter pairs to be close to the one learned
imposter pairs. This motivated our SRT solution.                         by the base face recognition model. It is noticeable from the
   Given that our proposed EUM solution is built on top of               reported result that the I-mean achieved by our SRT is closer to
a pre-trained face recognition model. The feature embeddings             the I-mean achieved when the model is evaluated on unmasked
of the genuine pairs are similar (to a large degree), and the            data, in comparison to the one achieved by naive triplet loss as
ones of imposter pairs are dissimilar. However, this similarity          shown in Tables II, III, IV and V. The achieved result pointed
is affected (to some degree) when the faces are masked and               out the efficiency of our proposed EUM trained with SRT
our main goal is to reduce this effect. This statement can be            in improving the masked face recognition, in comparison to
observed from the achieved results presented in Tables II, III,          the considered face recondition models. Also, it supported our
IV and V. For example, using MFR dataset, the achieved                   theoretical motivation behind SRT where training the EUM
G-mean and I-mean by ResNet-50 is 0.8538 and 0.0349,                     with SRT significantly outperformed the EUM trained with
respectively. When the probe is masked, the achieved G-                  naive triplet loss.
mean and I-mean shift to 0.5254 and 0.0251, respectively                       a) :
as shown in Table II. The shifting in G-mean pointed out
that the similarity between the genuine pairs is reduced (to
some degree) when the probe is masked. Training EUM with                                       VI. C ONCLUSION
naive triplet loss requires selecting a triplet of embeddings              In this paper, we presented and evaluated a novel solution to
violated the triplet condition. As we discussed earlier, the             reduce the negative impact of wearing a protective face mask
12

on face recognition performance. This work was motivated                        [12] L. Song, D. Gong, Z. Li, C. Liu, and W. Liu, “Occlusion robust
by the recent evaluation efforts on the effect of masked faces                       face recognition based on mask learning with pairwise differential
                                                                                     siamese network,” in 2019 IEEE/CVF International Conference on
on the face recognition performance [3], [4], [5], [6]. The                          Computer Vision, ICCV 2019, Seoul, Korea (South), October 27
presented solution is designed to operate on top of existing                         - November 2, 2019, 2019, pp. 773–782. [Online]. Available:
face recognition models, thus avoid the need for retraining                          https://doi.org/10.1109/ICCV.2019.00086
                                                                                [13] M. Opitz, G. Waltner, G. Poier, H. Possegger, and H. Bischof, “Grid
existing face recognition solutions used for unmasked faces.                         loss: Detecting occluded faces,” CoRR, vol. abs/1609.00129, 2016.
This goal has been accomplished by proposing the EUM op-                             [Online]. Available: http://arxiv.org/abs/1609.00129
erated on the embedding space. The learning objective of our                    [14] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified
                                                                                     embedding for face recognition and clustering,” pp. 815–823, 2015.
EUM is to increase the similarity between genuine unmasked-                          [Online]. Available: https://doi.org/10.1109/CVPR.2015.7298682
masked pairs and to decrease the similarity between imposter                    [15] Z. Wang, G. Wang, B. Huang, Z. Xiong, Q. Hong, H. Wu, P. Yi,
pairs. We achieved this learning objective by proposing a                            K. Jiang, N. Wang, Y. Pei et al., “Masked face recognition dataset and
                                                                                     application,” arXiv preprint arXiv:2003.09093, 2020.
novel loss function, the SRT. Through ablation study and                        [16] M. Loey, G. Manogaran, M. H. N. Taha, and N. E. M. Khalifa, “A
experiments on two real masked face dataset and two face                             hybrid deep transfer learning model with machine learning methods for
recognition models, we demonstrated that our proposed EUM                            face mask detection in the era of the covid-19 pandemic,” Measurement,
                                                                                     vol. 167, p. 108288, 2021.
with SRT significantly improved the masked face verification                    [17] Z. Wang, P. Wang, P. C. Louis, L. E. Wheless, and Y. Huo, “Wearmask:
performance in most experimental settings.                                           Fast in-browser face mask detection with serverless edge computing for
                                                                                     covid-19,” arXiv preprint arXiv:2101.00784, 2021.
                                                                                [18] B. Qin and D. Li, “Identifying facemask-wearing condition using
                         ACKNOWLEDGMENT                                              image super-resolution with classification network to prevent covid-19,”
                                                                                     Sensors, vol. 20, no. 18, p. 5236, 2020.
   This research work has been funded by the German Federal                     [19] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
Ministry of Education and Research and the Hessen State                              network training by reducing internal covariate shift,” in Proceedings
                                                                                     of the 32nd International Conference on Machine Learning, ICML
Ministry for Higher Education, Research and the Arts within                          2015, Lille, France, 6-11 July 2015, ser. JMLR Workshop and
their joint support of the National Research Center for Applied                      Conference Proceedings, F. R. Bach and D. M. Blei, Eds.,
Cybersecurity ATHENE.                                                                vol. 37. JMLR.org, 2015, pp. 448–456. [Online]. Available:
                                                                                     http://proceedings.mlr.press/v37/ioffe15.html
                                                                                [20] A. L. Maas, “Rectifier nonlinearities improve neural network acoustic
                             R EFERENCES                                             models,” 2013.
                                                                                [21] Y. Feng, H. Wang, H. R. Hu, L. Yu, W. Wang, and S. Wang,
 [1] D. Gorodnichy, S. Yanushkevich, and V. Shmerko, “Automated border               “Triplet distillation for deep face recognition,” in IEEE International
     control: Problem formalization,” in Computational Intelligence in Bio-          Conference on Image Processing, ICIP 2020, Abu Dhabi, United Arab
     metrics and Identity Management (CIBIM), 2014 IEEE Symposium on.                Emirates, October 25-28, 2020. IEEE, 2020, pp. 808–812. [Online].
     IEEE, 2014, pp. 118–125.                                                        Available: https://doi.org/10.1109/ICIP40778.2020.9190651
 [2] G. Lovisotto, R. Malik, I. Sluganovic, M. Roeschlin, P. Trueman, and       [22] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “Vggface2:
     I. Martinovic, “Mobile biometrics in financial services: A five factor          A dataset for recognising faces across pose and age,” in 13th IEEE
     framework,” Tech. Rep.                                                          International Conference on Automatic Face & Gesture Recognition, FG
 [3] N. Damer, J. H. Grebe, C. Chen, F. Boutros, F. Kirchbuchner, and                2018, Xi’an, China, May 15-19, 2018. IEEE Computer Society, 2018,
     A. Kuijper, “The effect of wearing a mask on face recognition                   pp. 67–74. [Online]. Available: https://doi.org/10.1109/FG.2018.00020
     performance: an exploratory study,” in BIOSIG 2020 - Proceedings of        [23] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen,
     the 19th International Conference of the Biometrics Special Interest            “Inverted residuals and linear bottlenecks: Mobile networks for
     Group, online, 16.-18. September 2020, ser. LNI, A. Brömme, C. Busch,          classification, detection and segmentation,” CoRR, vol. abs/1801.04381,
     A. Dantcheva, K. B. Raja, C. Rathgeb, and A. Uhl, Eds., vol. P-306.             2018. [Online]. Available: http://arxiv.org/abs/1801.04381
     Gesellschaft für Informatik e.V., 2020, pp. 1–10. [Online]. Available:    [24] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “Ms-celeb-1m: A dataset
     https://dl.gi.de/20.500.12116/34316                                             and benchmark for large-scale face recognition,” in Computer Vision -
 [4] Department of Homeland Security, “Biometric Technology Rally at                 ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands,
     MDTF,” https://mdtf.org/Rally2020, 2020, last accessed: March 3, 2021.          October 11-14, 2016, Proceedings, Part III, ser. Lecture Notes in
 [5] M. L. Ngan, P. J. Grother, and K. K. Hanaoka, “Ongoing face recognition         Computer Science, B. Leibe, J. Matas, N. Sebe, and M. Welling,
     vendor test (frvt) part 6b: Face recognition accuracy with face masks           Eds., vol. 9907. Springer, 2016, pp. 87–102. [Online]. Available:
     using post-covid-19 algorithms,” 2020.                                          https://doi.org/10.1007/978-3-319-46487-9 6
 [6] ——, “Ongoing face recognition vendor test (frvt) part 6b: Face recog-      [25] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeled
     nition accuracy with face masks using post-covid-19 algorithms,” 2020.          faces in the wild: A database for studying face recognition in uncon-
 [7] A. Anwar and A. Raychowdhury, “Masked face recognition for secure               strained environments,” University of Massachusetts, Amherst, Tech.
     authentication,” 2020.                                                          Rep. 07-49, October 2007.
 [8] Y. Li, K. Guo, Y. Lu, and L. Liu, “Cropping and attention based approach   [26] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and
     for masked face recognition,” Applied Intelligence, pp. 1–14, 2021.             alignment using multitask cascaded convolutional networks,” IEEE
 [9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for                Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016.
     image recognition,” in 2016 IEEE Conference on Computer Vision             [27] D. E. King, “Dlib-ml: A machine learning toolkit,” J. Mach.
     and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June                    Learn. Res., vol. 10, pp. 1755–1758, 2009. [Online]. Available:
     27-30, 2016. IEEE Computer Society, 2016, pp. 770–778. [Online].                https://dl.acm.org/citation.cfm?id=1755843
     Available: https://doi.org/10.1109/CVPR.2016.90                            [28] N. Poh and S. Bengio, “A study of the effects of score normalisation
[10] S. Chen, Y. Liu, X. Gao, and Z. Han, “Mobilefacenets: Efficient cnns            prior to fusion in biometric authentication tasks,” IDIAP, Tech. Rep.,
     for accurate real-time face verification on mobile devices,” CoRR, vol.         2004.
     abs/1804.07573, 2018. [Online]. Available: http://arxiv.org/abs/1804.
     07573
[11] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular
     margin loss for deep face recognition,” in IEEE Conference on
     Computer Vision and Pattern Recognition, CVPR 2019, Long Beach,
     CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE,
     2019, pp. 4690–4699. [Online]. Available: http://openaccess.thecvf.com/
     content CVPR 2019/html/Deng ArcFace Additive Angular Margin
     Loss for Deep Face Recognition CVPR 2019 paper.html
You can also read