A Novel Approach to Classifying Breast Cancer Histopathology Biopsy Images Using Bilateral Knowledge Distillation and Label Smoothing ...

Page created by Bob Mills

Society

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

A Novel Approach to Classifying Breast Cancer Histopathology Biopsy Images Using Bilateral Knowledge Distillation and Label Smoothing ...

Hindawi
Computational and Mathematical Methods in Medicine
Volume 2021, Article ID 4019358, 11 pages
https://doi.org/10.1155/2021/4019358

Research Article
A Novel Approach to Classifying Breast Cancer Histopathology
Biopsy Images Using Bilateral Knowledge Distillation and Label
Smoothing Regularization

          Sushovan Chaudhury ,1 Nilesh Shelke ,2 Kartik Sau ,1 B. Prasanalakshmi ,3
          and Mohammad Shabaz 4
          1
            University of Engineering and Management, Kolkata, India
          2
            Priyadarshini Indira Gandhi College of Engineering, Nagpur, India
          3
            Department of Computer Science, King Khalid University, Abha, Saudi Arabia
          4
            Arba Minch University, Ethiopia

          Correspondence should be addressed to B. Prasanalakshmi; prengaraj@kku.edu.sa

          Received 3 September 2021; Revised 1 October 2021; Accepted 1 October 2021; Published 20 October 2021

          Academic Editor: Deepika Koundal

          Copyright © 2021 Sushovan Chaudhury et al. This is an open access article distributed under the Creative Commons Attribution
          License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
          properly cited.

          Breast cancer is the most common invasive cancer in women and the second main cause of cancer death in females, which can be
          classiﬁed benign or malignant. Research and prevention on breast cancer have attracted more concern of researchers in recent
          years. On the other hand, the development of data mining methods provides an eﬀective way to extract more useful
          information from complex databases, and some prediction, classiﬁcation, and clustering can be made according to the
          extracted information. The generic notion of knowledge distillation is that a network of higher capacity acts as a teacher and a
          network of lower capacity acts as a student. There are diﬀerent pipelines of knowledge distillation known. However, previous
          work on knowledge distillation using label smoothing regularization produces experiments and results that break this general
          notion and prove that knowledge distillation also works when a student model distils a teacher model, i.e., reverse knowledge
          distillation. Not only this, but it is also proved that a poorly trained teacher model trains a student model to reach equivalent
          results. Building on the ideas from those works, we propose a novel bilateral knowledge distillation regime that enables
          multiple interactions between teacher and student models, i.e., teaching and distilling each other, eventually improving each
          other’s performance and evaluating our results on BACH histopathology image dataset on breast cancer. The pretrained
          ResNeXt29 and MobileNetV2 models which are already tested on ImageNet dataset are used for “transfer learning” in our
          dataset, and we obtain a ﬁnal accuracy of more than 96% using this novel approach of bilateral KD.

1. Introduction                                                        though invasive or inﬁltrating ductal carcinoma may be
                                                                       rapidly progressive, but it is still a treatable condition at least
Breast cancer is the most lethal type of tumour in women               when identiﬁed in its initial stages [1]. Machine learning is a
worldwide, second only to lung cancer in terms of preva-               branch of artiﬁcial intelligence that employs a variety of
lence. In the most common type of breast cancer, which is              statistical, probabilistic, and optimization techniques that
popularly known as invasive ductal carcinoma (IDC), the                allow computers to “learn” from past examples and to detect
cancer develops in the milk ducts of the breast and then rap-          hard-to-discern patterns from large, noisy, or complex data-
idly spreads to involve the surrounding structures. Invasive           sets. As a result, machine learning is frequently used in
ductal carcinoma is so progressive that it can even inﬁltrate          cancer diagnosis and detection. Machine learning and deep
the lymph nodes and the circulatory system, which leads to             learning approaches may help radiologists to a large extent
the cancer spreading to various parts of the body. Even                by helping to detect breast cancer at an early stage by

2 Computational and Mathematical Methods in Medicine

identifying useful patterns in malignant cells. The learnt be completely different from the convergence of the teacher
patterns can be used to classify the unknown images of the network [9].
breast as benign or malignant. Mammography, ultrasonogra- The basic steps of knowledge distillation are given as
phy, and histological biopsies are some of the numerous follows.
options for BCD. Mammograms can help to understand
calcification in human breast, whereas studying histology (1) Train the teacher network: The complex teacher
images remains a challenge due to its complexity in pattern network is first trained using the whole dataset on
identification. This study analyzes the gene activities of the high-performing GPUs
survival vs. deceased for each therapy, and the potential bio-
(2) Establish a link of correspondence: The student net-
markers will help to identify the best therapy for the patients
work is then designed and a correspondence needs to
based on their gene expression test. This model has very high
be formed between each output of the simple student
accuracy levels, and it uses a hierarchical model as a tree that
network and the complex teacher network
includes one-versus-rest classifications. In 2012, there were
roughly 1.7 million new cancer cases (representing 25% of (3) Forward propagation of knowledge through the
all malignancies in women) and 0.5 million cancer deaths teacher network: Feedforward data through the
(representing 15% of all cancer fatalities in women). In 140 teacher network to get all intermediate outputs;
nations, ductal carcinoma is the most prevalent cancer diag- apply augmentation (if any) to the same
nosis among females, and it is the primary cause of cancer
death in 101 nations. According to McGuire [2], histological (4) Backpropagation through the student network: The
study of breast cancers can also reveal the stage of the disease output from the teacher and the correspondence
or the extent to which the cancer has spread. This is mea- relation are used to backpropagate the error to the
sured in two ways: by the size of the tumour and by assessing student. Therefore, the student network learns to
the involvement of regional lymph nodes. Both of these imitate the “knowledge” of the teacher
variables provide valuable prognostic data for patient care.
Smith et al. [3] recommend screening practices for the early 2. Related Work
recognition of breast cancer. Computer-aided diagnosis of
ductal carcinoma can advance the performance of radiolo- Koboldt et al. [10] studied key ductal carcinoma by genomic
gists in several folds as explained by Filipczuk et al. and DNA copy arrays, DNA methylation, exome sequencing,
Kowal et al. [4, 5]. The simple idea of the CAD system is to messenger RNA arrays, micro-RNA sequencing, and
take raw descriptions from different modalities like mammo- opposite-segment protein arrays. The different classifica-
gram and histology images and preprocess them. This pre- tions of breast cancers are researched in accordance with
processing includes amending size, contrast, and brightness WHO standards [11]. They outline the methods required
and performing augmentation in case input data which are to examine the histological description of breast cancer,
scarce. The next step involves segmentation to critically iden- similar to the one explained by Veta et al. [12]. With the
tify the regions of interest in the image so that the system introduction of whole slide imaging (WSI) scanners, which
only learns the relevant part of the image and not the noise can be affordable and high-throughput histopathology slide
part. The third stage is to extract features from a region of digitalization, this study topic has become particularly rele-
interest (ROI), which is usually done using CNN, which han- vant, with the goal of substituting the optical microscope
dles automatic feature extraction by producing feature maps as the principal tool used by pathologists. Breast cancer
at each convolution layer and regularising the models with classification from histological photos is of considerable
dropouts so that they do not fit [6]. The higher-order feature therapeutic value, according to [13], although it is yet to be
maps are flattened, and classification is performed using the studied. The multiple classification of ductal carcinoma has
class probabilities generated by a softmax layer. These a purpose to find the subcategories of breast cancer (ductal
models, however, are massive, containing millions (if not bil- carcinoma, fibroadenoma, lobular carcinoma, etc.). How-
lions) of parameters, and so cannot be used on edge devices. ever, multiclassification of breast cancer from histopathol-
The technique is being tested on the BACH dataset, a large ogy pictures faces 2 major challenges: (1) sorting of binary
histology dataset for breast cancer detection, to reduce com- classes (benign and malignant) and (2) minor variances in
putational complexity. The concept is to represent compres- various classes due to the wide diversity of high-resolution
sion by having a larger and more sophisticated pretrained picture appearances, strong coherency of malignant cells,
network that teaches a smaller network (students) step by and large colour distribution inhomogeneity. As per Sharma
step (teacher). Indeed, the theoretical search space for more et al. [14], special attention was given to histopathology
complicated models is bigger than that of a smaller system images in the BACH dataset. A model of careful consider-
[7, 8]. However, if we suppose that the equivalent (or even ation was proposed where the network emphasizes learning
comparative) convergence can be attained using a smaller from patches and records good accuracy in multilevel classi-
link, then the convergence space of the teacher system should fication. Kausar et al. [15] experimented with a number of
connect with the solution space of the student system. Trag- deep convolutional models, using colour normalisation as a
ically, that by itself does not ensure union of the solution normalisation strategy before applying augmentation to the
space for the student network and the teacher network in a input data. Praveena Anjelin and Ganesh Kumar [16]
similar area. The convergence of the student network may experimented with different transfer learning techniques

Computational and Mathematical Methods in Medicine 3

and pretrained models such as MobileNetV2, AlexNet, each image, these photos are categorized as [N] normal,
ResNet, and VGGNet on the BACH dataset. Several efforts [B] benign, [I] in situ carcinoma, or invasive carcinoma,
in the subject of knowledge distillation have also been com- with 100 labels each. The dataset was distributed into
pleted. Ba and Caruana [17] investigated with the CIFAR10 70 : 30 train-test subgroups for the studies we ran. Because
dataset and detected that a large network is capable of train- of the variances in staining methods, scanner colour
ing smaller networks in patches that are quite shallow but response, and tissue preparation, the staining process is a
are capable to match the accuracy of deep neural nets, and complex process that might result in undesired colour vari-
since they need not be necessarily deep, the computational ations among tissue types. We employ a technique called
complexity can be reduced. Hinton et al. [18] found startling structure-preserving colour normalisation with sparse strain
findings on MNIST, demonstrating that condensing the separation since colour variance across pictures can affect
knowledge from an ensemble of models into a single model performance. The approach uses one picture as the target,
can greatly improve the acoustic model of a widely used and the other images are normalised by merging their sepa-
commercial system [19]. Often the finest outcomes are rate stain density maps with the target image’s stain colour
attained by assembling millions of individual classifiers, base, maintaining the morphology. After that, the modified
which is a computationally expensive task. In [20], dataset is scaled between 0 and 1, and each channel is nor-
researchers offer a strategy for “compressing” big, compli- malised to the ImageNet dataset. ImageNet project is a large
cated ensembles into smaller, quicker models with little to visual database used in visual object recognition in software
no performance loss. Dipalma et al. [21] utilised knowledge research and is tested and trained on more than 14 million
distillation with model compression to address challenging images of various types, both labelled and unlabelled. We
problems in whole slide images, such as histology. Knowl- use online data augmentation techniques to provide variety
edge distillation has been used in object detection by Dai to the training dataset, allowing the network to generalise
et al. [22], face recognition by Ge et al. [23], semantic to new datasets. Because the histological pictures are rota-
segmentation by He et al. [24], classification of chest X-ray tionally invariant, horizontal and vertical flips with rotations
abnormalities by Ho and Gwak [25], and to regain land sur- ranging from -25 to 25 degrees were also included. As
face temperature from MODIS daytime mid-infrared data augmentation methods, random zoom and intensity shifts
by Tang and Wang [26]. Knowledge distillation was used were also utilised.
to enhance the computational efficiency of breast cancer
diagnosis by Garg et al. and Thiagarajan et al. [27, 28]. They 3.2. Knowledge Distillation in Theory. Knowledge distillation
stress that in a variety of applications, such as medical image refers to distilling facts from a complex teacher model into a
analysis, weakly supervised instance labelling utilising sim- relatively weak and light student model. However, it is can
ple picture-level labels rather than costly fine-grained pixel be better summarized as a label smoothing regularization
annotations are critical. In contrast to traditional computer technique. The outcome of knowledge distillation depends
vision instance segmentation, the challenges we address not only on the similarity information between teacher
are characterised by a small number of training photos categories but also on the regularization of soft goals. The
and nonlocal patterns that lead to diagnosis. Motivated by softening regularization of the label refers to the model
all of these KD applications, as well as the use of KD and training by the exchange of one hot label and soft smoothed
computationally efficient neural nets for training large WSI label between the teacher and the student. Loss function of
as in BACH images of the breast, we decided to focus on label smoothing regularization (LSR) for a network S is as
bilateral knowledge distillation, in which the teacher and follows. For every training instance x, S outputs the proba-
student learn from each other and improve each other’s bility of each label as shown in the following equation:
performance, as shown in [29–31].

3. Methodology exp ðT a Þ
a ∈ f1::Ag: rðajyÞ = soft max ðT a Þ = , ð1Þ
3.1. Dataset and Preprocessing. For our experiments, we used ∑Af−1 T f
ICIAR 2018 Grand Challenge on Breast Cancer Histology
dataset [32] as described in Figure 1. The dataset is com-
prised of haematoxylin and eosin (H&E) stained breast where S is a neural network to train, T f is the logit of the
microscopic anatomy research and whole slide images neural network, and S is given in the following equation.
labelled as traditional, benign, in-place cancer, or invasive
cancer. The clarification was performed by 2 clinical special-
ists and images with conflict were disposed of. A total of 400
microscope images are in the collection, out of which 100 n ′ ðaÞ = ð1 − βÞnðaÞ + βvðaÞ, ð2Þ
each are normal images, benign images, in situ carcinoma,
and invasive carcinoma. All microscopy images are on.tiff
format, and their specifications are given below in Table 1.
The BACH dataset comprises of about 400 histological where n ′ ðaÞ is the smoothed labelled distribution, is a mix-
microscopic images of the breast stained with haematoxylin ture of nðaÞ, and is a fixed distribution vðaÞ with weight β.
and eosin (H&E). Conferring to the major cancer type in Now, vðaÞ is uniform distribution as vðaÞ = ½1/A.

4                                                                                 Computational and Mathematical Methods in Medicine

                            Normal                                   Benign                   in situ carcinoma         Invasive carcin

                                     Figure 1: The labelled images of each type of cell used in classiﬁcation.

                                                   Table 1: Speciﬁcations of microscopy images.

S no.                                                     Parameter                                                             Details
1                                                       Colour model                                                     Red, green, blue
2                                                           Size                                                    2048ðW Þ × 1536ðLÞ in pixels
3                                                         Pixel scale                                                   0:42 μm × 0:42 μm
4                                           Retention area: 10-20 MB (approx.)                                           10-20 MB (approx.)

   The smoothed entropy loss Bðn ′ , rÞ is deﬁned over the                    normal KD, Re-KD takes teachers’ accuracy as the base-
smoothed label as in the following equation.                                  line accuracy, which is much advanced than normal KD.
                                                                              For example, in various cases, Re-KD exceeds the normal
                A                                                           KD. The basis of this concept is that the teacher becomes
    B n ′ , r = − 〠 n ′ ðaÞ log r ðaÞ = ð1 − βÞBðn, r Þ + βBðv, rÞ            the student and the student becomes the teacher. Using Re-
                   f −1                                                       KD in the BACH dataset for cancer histopathology biopsy
              = ð1 − βÞBðn, rÞ + βðC AM ðv, rÞ + BðvÞÞ,                       images, we found that Re-KD produces better results.
                                                                        ð3Þ   3.4. Bilateral Knowledge Distillation in Theory. For conven-
                                                                              tional knowledge, distillation of the loss function is mathe-
where CAM is the Kullback-Leibler divergence and CðaÞ                         matically deﬁned as follows:
denotes the entropy of v and is a constant for the ﬁxed uni-
                                                                                                                             
form distribution vðaÞ.                                                                    M WAC = δ j, j ′ + 〠 ηh ρh ESh , EXh ,             ð8Þ
    The loss function for label smoothing of neural network                                                       hεH
S can be written as in the following equations:
                                                                              where δðj, j ′ Þ represents loss of task-speciﬁc nature, j is the
             M MO = ð1 − βÞBðn, r Þ + βCAM ðv, r Þ,                     ð4Þ
                                                                              ground truth, transformed output of the student layer is rep-
                                                                              resented as ESh , transformed output of the teacher network is
             mAC = ð1 − βÞBðn, r Þ + βC AM ðr xπ , r π Þ,               ð5Þ
                                                                              represented as EXh , note that this transformed module is pro-
where Bðn, rÞ is the minimized cross-entropy and CAM is the                   duced from m-th module, ηh is a tunable balance factor
Kullback-Leibler divergence (AM divergence).                                  which balances the unlike losses, ρh ð·, · Þ is the m-th distilla-
       The rxπ ðkÞ in C AM ðv, rÞ is a distribution from the teacher          tion loss to reduce the diﬀerence between ESh and EXh .
model.                                                                             Consider kf ∈ f0, 1g to be a Bernoulli random variable,
       Additionally, C AM ðr xπ , rπ Þ = Bðr xπ , rπ Þ − Bðr xπ Þ, where B    and f is the f -th hybrid block in the hybrid network. Assum-
ðr xπ Þ is a constant entropy. Therefore, Equation (4) can                    ing (kf = 1) as the student path selected and when the
be reframed as the following equation.                                        teacher path is chosen, it is assumed as (kf = 0). Suppose
                                                                              B f represents the input and B f +1 depicts the output of the
      mAC = ð1 − βÞBðn, r Þ + βðCAM ðrxπ , r π Þ + Bðr xπ , r π ÞÞ            f -th hybrid block; hence, the yield f -th hybrid block can
                                                                        ð6Þ
          = ð1 − βÞBðn, r Þ + ðβBðr xπ , r π ÞÞ:                              be represented as follows:
                                                                                                                                
   Setting the temperature π = 1 and mAC = Bð~
                                             nx , rÞ,                                      B f +1 = k f I Sf B f + 1 − β f I Xf B f :         ð9Þ
where n~ is as follows:
        x

                    ~x ðaÞ = ð1 − βÞnðaÞ + βrx ðaÞ:
                    n                                                   ð7Þ       Here, I Sf ð⋅Þ is the functional representation of the student
                                                                              block and I Xf ð⋅Þ is the functional representation of the
3.3. Reverse Knowledge Distillation in Theory. Unlike                         teacher block.

Computational and Mathematical Methods in Medicine                                                                                 5

                                                                       Resnext29
                                                                    baseline training

                                                        Teacher       KD–200 epochs      Student
                                                       resnext29                        mobilenetv2
                                   Phase 1
                                                       Teacher       ReKD–200 epochs      Student
                                                      mobilenetv2                        resnext29

                                                                      KD–200 epochs
                                                        Teacher                          Student
                                                       resnext29                        mobilenetv2
                                   Phase 2
                                                       Teacher       ReKD–200 epochs      Student
                                                      mobilenetv2                        resnext29

                                                        Teacher       KD–200 epochs      Student
                                   Phase 3
                                                       resnext29                        mobilenetv2

                                      Figure 2: The proposed framework of knowledge distillation.

                 Phase 1
                 Step 1 Knowledge distil mobilenetv2 - Teacher model: Resnext29 - train e epochs
                 Step 2 Reverse Knowledge distil Resnext29 - Teacher model: Mobilenetv2 - train e epochs
                 Phase 2
                 Step 1 Knowledge distil mobilenetv2 (phase1) - Teacher model: Resnext29 (phase1) - train e epochs
                 Step 2 Reverse Knowledge distil Resnext29 - Teacher model: Mobilenetv2 - train e epochs
                 Phase 3
                 Step 1 Knowledge distil mobilenetv2 (phase2) - Teacher model: Resnext29 (phase2) - train e epochs

                        Algorithm 1: Bilateral knowledge distillation (base training: ResNeXt29 for e epochs).

    Taking the value of β f to be 1, Equation (2) takes the          ing knowledge. It is already shown in the literature that the
following form:                                                      teacher and students can distil knowledge to each other. In
                                                                     this novel approach, phase 2 comes into a picture, where
                                                                    the learnings of the teacher and student are fed back to each
                       B f +1 = I Xf B f :                  ð10Þ
                                                                     other and improved accuracy at each phase. We have
                                                                     stopped in phase 3 where knowledge distillation through
    This simply means the hybrid block reduces to the orig-          MobileNetV2 has produced state-of-the-art accuracy as dis-
inal student block. Whereas if we consider β f = 0, Equation         cussed in our results. This process can be iterated over several
(2) can be formulated as follows:                                    phases until we get the desired accuracy and depending on the
                                                                    computational resources available. The idea is further illus-
                       B f +1 = I Xf B f :                  ð11Þ     trated in Figure 3 with a schematic description.

    Now suppose δð:, :Þ is the task-speciﬁc loss, j represents       4. Results
the ground fact ~jy denoting the predicted result of the hybrid
network, and dy is the hybrid network, its optimized loss            4.1. Experimental Setup. For our experiments, we use two
function can be written as follows:                                  well-known transfer learning networks in the ImageNet
                                                                     dataset, ResNeXt29 and MobileNetV2 pretrained models.
                                                                   BACH 2018 dataset of breast histopathology images was
                        d y = δ j, ~jy :                    ð12Þ     trained on those networks for the training of histopathology
                                                                     images. PyTorch (build 1.9.0) with CUDA 10.2 platform was
    As per Figure 2 and Algorithm 1, we propose a bilateral          used for the experiments.
knowledge distillation regime. In phase 1, the more complex
teacher model, the ResNeXt29, is used as a teacher model to          4.2. Baseline Teacher Training. Our baseline trainer training
train the lightweight MobileNetV2. In step 2, the teacher and        on ResNeXt29 (8 × 64 d with 34.53 M parameters) and
the student are reversed just being interactive and exchang-         MobileNetV2 (including the preliminary completely

6                                                                             Computational and Mathematical Methods in Medicine

                                       Phase                       Phase                              Phase
                                         1                           2                                  3

                                       STEPl                       STEPl                              STEPl
                                Knowledge distil               Knowledge distil                  Knowledge distil
                               mobilenetv2-teacher           mobilenetv2 (phase1)              mobilenetv2 (phase2)
                                model: resnext29-              -teacher model:                   -teacher model:
                                 train e epochs.             resnext29 (phase1)-               resnext29 (phase2)-
                                                                train e epochs.                   train e epochs
                                       STEP2
                                                                   STEP2
                               Reverse Knowledge
                                distil resnext29-            Reverse knowledge
                                 teacher model:               distil resnext29-
                               mobilenetv2-train e             teacher model:
                                     epochs                  mobilenetv2-train e
                                                                   epochs

                               Figure 3: Schematic ﬂow diagram of bilateral KD as used in this paper.

    Train_accuracy                                                         Test_accuracy
    tag: Train_accuracy
                                                                           Test_accuracy
      100                                                                  tag: Test_accuracy

       80                                                                    50

       60

       40                                                                    40

                 0       40       80       120       160   200

                                                                                     0       20 40 60 80 100 120 140 160 180 200

    Train_loss

      Train_loss                                                         Test_loss
      tag: Train_loss

                                                                            Test_loss
                                                                            tag: Test_loss
         1.2

                                                                              2.6
         0.8

         0.4                                                                  2.2

            0                                                                 1.8

                     0    40       80      120       160   200
                                                                              1.4

                                                                                         0       40       80          120   160   200

                                         Figure 4: Train and test set accuracy of MobileNetV2.

Computational and Mathematical Methods in Medicine 7

Train_accuracy Test_accuracy
tag: Train_accuracy tag: Test_accuracy

95
65
85
75 55

65 45
55
35
45
0 40 80 120 160 200 0 20 40 60 80 100 120 140 160 180 200

Train_loss Test_loss

Train_loss Test_loss
tag: Train_loss tag: Test_loss

5
1.2
4
0.8
3
0.4 2

0 1

0 40 80 120 160 200 0 40 80 120 160 200

Figure 5: Train and test set accuracy of ResNeXt29.

convolution layer with 32 filters, accompanied by the way of reverse knowledge distillation on the given set of histological
19 residual bottleneck layers) indicates that the validation images. As such, MobileNetV2 now becomes the teacher
accuracy for ResNeXt29 is as excessive as 75 percent wherein and ResNeXt29 becomes the student. The results were
as that of MobileNetV2 is 60.71 percent as shown in improved as we got a validation accuracy of 65.58 percent.
Figures 4 and 5. The results which are obtained through these experiments
are shown in Figure 7.
4.3. Tuning Hyperparameters. Several hyperparameters were
tuned as follows: Optimizer: SGD, momentum=0.9, weight_ 4.6. Bilateral Knowledge Distillation Results. From the results
decay=5e-4, Learning Rate scheduler: learning rate warm up, of the above two experiments, we infer that the normal KD
Defined phase as n: 3, Define epochs as e: 200, “alpha”: 0.95, or the reverse KD is not sufficient to give us the best result
“temperature”: 20, “learning_rate”: 0.1, “batch_size”: 8, for computationally complex histological images. Our pro-
“dropout_rate”: 0.5, “save_summary_steps”: 100, “num_ posed method of bilateral knowledge distillation, which
workers”: 4. works on the notion of both teacher and student, interac-
tively learns from each other and thereby improves each
4.4. Normal Knowledge Distillation Results. Since ResNeXt29 other’s accuracy. After three phases of interactive learning
gave us higher accuracy in validation, we proceed by taking and exchanging information between teachers and the
it as a teacher network and taking MobileNetV2 as the ligh- student, the MobileNetV2 reaches 96.3 percent accuracy on
ter student network. As per literature, we confirmed that the a validation set, which we claim as a state-of-the-art result
teacher model was able to distil the lightweight student for histological images of the breast. Results of training and
model. However, since histological images are computation- validation using bilateral knowledge distillation are shown
ally complex, the results were no more than 60.71 percent on in Figure 8.
the validation set. The loss was considerable. The normal
KD results are shown in Figure 6. 5. Discussion of Results
4.5. Reverse Knowledge Distillation Results. Again, we In bilateral cognitive distillation, we found that students and
reversed the teacher and student networks to validate the teachers interact with each other. In noninteractive cognitive

8                                                                           Computational and Mathematical Methods in Medicine

                Train_accuracy                                        Test_accuracy
                tag: Train_accuracy                                   tag: Test_accuracy

                 50                                                    54

                 46                                                    50

                 42                                                    46

                 38                                                    42

                 34                                                    38

                 30                                                    34

                          0   20 40 60 80 100 120 140 160 180                  0   20 40 60 80 100 120 140 160 180

                                            Figure 6: Normal knowledge distillation results.

          Train_accuracy                                              Test_accuracy
          tag: Train_accuracy                                         tag: Test_accuracy

           75                                                          60
                                                                       56
           65
                                                                       52
           55                                                          48
                                                                       44
           45
                                                                       40
           35                                                          36

                      0   20 40 60 80 100 120 140 160 180 200                  0           40   80   120     160     200

                                            Figure 7: Reverse knowledge distillation result.

distillation such as traditional cognitive distillation, the          accuracies were limited. In normal KD, due to higher valida-
teacher network sets only one goal to mimic the student net-          tion accuracy, ResNeXt29, the more complex model has
work, ignoring the communication of the student network.              been chosen as the teacher network and the simple Mobile-
However, since the characteristic transformation potential            NetV2 as the corresponding student network and the
of the student network is much lower than that of the                 teacher and student networks were reversed for Re-KD.
teacher, it is impossible to mimic the exact knowledge that           After three phases of interactive learning and exchanging
the teacher has transferred to it. Therefore, there is a diﬀer-       information among teacher and the student, the Mobile-
ence between the knowledge taken by the student and the               NetV2 reaches 96.3 percent accuracy on validation set which
knowledge transferred by the teacher, which limits KD per-            we claim as a state-of-the-art result for histological images of
formance. In interactive, bilateral cognitive distillation, the       the breast. The limited accuracy obtained through these con-
student takes action based on the problem they are facing,            ventional methods prompted us to run the experiment with
then the teacher gives feedback based on the student’s                bilateral-interactive KD. As knowledge distillation is nothing
actions, and ﬁnally, the student takes action according to            but a regularization technique, the student network is
the teacher feedback. In particular, we propose that the              seldom overﬁtted and it can backpropagate the errors.
teacher and students be interactive and that knowledge be             Therefore, there is a constant feedforward and backpropaga-
transferred from teacher to student and from student to               tion across the diﬀerent phases of the bilateral KD model. In
student to teacher until the student who assesses the classes         each phase, the labels are smoothed with regularization and
reaches the allowable accuracy. We used two pretrained                hence the forth accuracy improves with each phase. All the
models trained on ImageNet datasets for our baseline train-           results we discussed can be summarized in Table 2 below.
ing, namely, ResNeXt29 and MobileNetV2, for validation of             We also conclude that with the increase of temperature,
theories of knowledge distillation on breast histological             the distribution of soft targets in the teacher network mimics
images. The histology biopsy images are too complex and               the uniform distribution of label smoothing. The other
have a similar colour combination when viewed under a                 important observation in this experiment is the fact that
microscope after staining. As such even with robust                   bilateral KD does not need additional distillation losses
networks like ResNeXt29 and MobileNetV2, the base line                and is in accordance with the conventional KD method.

Computational and Mathematical Methods in Medicine                                                                                        9

                                Train_accuracy

                                 100
                                  90
                                  80
                                  70
                                  60
                                  50
                                  40

                                            5    10    15   20     25   30    35   40   45   50   55   60

                                Test_loss

                                  1.5

                                   1

                                  0.5

                                   0

                                            5    10    15   20     25   30    35   40   45   50   55   60

                          Figure 8: Results of training and validation using bilateral knowledge distillation.

                                                                             correspondence improves the accuracy in each subsequent
         Table 2: Results of all experiments performed.
                                                                             phase. In normal KD, due to higher validation accuracy
Experiment                                            Validation             ResNeXt29, being the more complex model has been chosen
                              Model                                          as the teacher network and the simple MobileNetV2 as the
performed                                              accuracy
                                                                             corresponding student network and the teacher and the
Baseline teacher            ResNeXt29                   75%
                                                                             student networks were reversed for Re-KD. The limited
training                   MobileNetV2                 60.71%
                                                                             accuracy obtained through these conventional methods
                        ResNeXt29 teacher                                    prompted us to run the experiment with bilateral-
Normal KD                                              60.71%
                       MobileNetV2 student                                   interactive KD. The choice of transfer learning architecture
                                                                             was being validated by a baseline training. The result of
                       MobileNetV2 teacher                                   bilateral KD is that the MobileNetV2 network produced
Reverse KD                                             65.58%
                        ResNeXt29 student                                    excellent accuracy in phase 3 of the validation. Being a light-
                         Phase 1+phase 2                                     weight model, MobileNetV2 architecture will support
Bilateral KD                                            96.3%                deployment of the model in mobile applications also. The
                            +phase 3
                                                                             challenges of working with large-sized histological images
                                                                             and computational complexity can thus be avoided. This
6. Conclusion                                                                approach produces state-of-the-art results with an accuracy
                                                                             of 96.3 percent. It will also boost the performance of the stu-
In this paper, we have discussed the results obtained in two                 dent model on complex and diverse cancer cell images.
baseline papers on knowledge distillation and interactive                    Moreover, the model does not rely on copying the teacher
knowledge distillation. We proposed a novel method called                    behaviour, but rather it can utilize the feature transforma-
bilateral knowledge distillation on breast histological images.              tional ability of the teacher. Going forward, we believe that
In our approach, we took ResNeXt29 and MobileNetV2 as                        this paper will help researchers to work with histological
the baseline pretrained models. The phenomenon that                          data of other cancers as well.
knowledge distillation is nothing but a label smoothing
regularization technique was established yet again through                   Data Availability
the experiments conducted on histological image dataset of
the breast. Additionally, the idea that label smoothing regu-                Data will be made available on request.
larization is an ad hoc KD is reestablished. The interactive
bilateral KD model which we proposed conforms to the fact                    Conflicts of Interest
that the teacher and student blocks interact with each other
and is capable of transferring knowledge to each other                       The authors declare that there is no conﬂict of interest
through the correspondence being established and that this                   regarding the publication of this paper.

10                                                                               Computational and Mathematical Methods in Medicine

Acknowledgments                                                                  hyperchaotic encryption,” Security and Communication Net-
                                                                                 works, vol. 2021, Article ID 5536170, 2021.
This research is self-funded.                                             [15]   T. Kausar, M. A. Ashraf, A. Kausar, and I. Riaz, “Convolution
                                                                                 Neural Network based Approach for Breast Cancer Classiﬁca-
                                                                                 tion. Proceedings of 18th International Bhurban Conference
References                                                                       on Applied Sciences and Technologies,” in 2021 International
                                                                                 Bhurban Conference on Applied Sciences and Technologies
 [1] P. Ratta, A. Kaur, S. Sharma, M. Shabaz, and G. Dhiman,                     (IBCAST), pp. 407–413, Islamabad, Pakistan, 2021.
     “Application of blockchain and Internet of Things in health-         [16]   D. Praveena Anjelin and S. Ganesh Kumar, “Blockchain tech-
     care and medical sector: applications, challenges, and future               nology for data sharing in decentralized storage system,” in
     perspectives,” Journal of Food Quality, vol. 2021, Article ID               Intelligent Computing and Applications (vol. 1172), Springer,
     7608296, 2021.                                                              Singapore, 2021.
 [2] S. McGuire, “World Cancer Report 2014. Geneva, Switzerland:          [17]   L. J. Ba and R. Caruana, “Do deep nets really need to be deep?,”
     World Health Organization, International Agency for                         Advances in Neural Information Processing Systems, vol. 3,
     Research on Cancer, WHO Press, 2015,” Advances in Nutri-                    pp. 2654–2662, 2014.
     tion, vol. 7, no. 2, pp. 418-419, 2016.
                                                                          [18]   G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge
 [3] R. A. Smith, V. Cokkinides, and H. J. Eyre, “American Cancer                in a neural network,” 2015, http://arxiv.org/abs/1503.02531.
     Society Guidelines for the Early Detection of Cancer, 2006,”
     CA: a Cancer Journal for Clinicians, vol. 56, no. 1, pp. 11–25,      [19]   M. K. Sharma, T. Perumal, and N. Dey, “Innovations in
     2006.                                                                       computational intelligence and computer vision,” in
                                                                                 Advances in Intelligent Systems and Computing, p. 1189,
 [4] P. Filipczuk, T. Fevens, A. Krzyzak, and R. Monczak, “Com-                  Springer, 2021.
     puter-aided breast cancer diagnosis based on the analysis of
     cytological images of ﬁne needle biopsies,” IEEE Transactions        [20]   C. Bucilǎ, R. Caruana, and A. Niculescu-Mizil, “Model com-
     on Medical Imaging, vol. 32, no. 12, pp. 2169–2178, 2013.                   pression,” in Proceedings of the 12th ACM SIGKDD Interna-
                                                                                 tional Conference on Knowledge Discovery and Data Mining
 [5] M. Kowal, P. Filipczuk, A. Obuchowicz, J. Korbicz, and
                                                                                 - KDD '06, pp. 535–541, New York, 2006.
     R. Monczak, “Computer-aided diagnosis of breast cancer
     based on ﬁne needle biopsy microscopic images,” Computers            [21]    J. Dipalma, A. A. Suriawinata, L. J. Tafe, L. Torresani, and
     in Biology and Medicine, vol. 43, no. 10, pp. 1563–1572, 2013.               S. Hassanpour, “Resolution-based distillation for eﬃcient histol-
                                                                                  ogy image classiﬁcation,” 2021, http://arxiv.org/abs/2101.04170.
 [6] B. Wang, X. Yao, Y. Jiang, C. Sun, and M. Shabaz, “Design of a
     real-time monitoring system for smoke and dust in thermal            [22]   X. Dai, Z. Jiang, Z. Wu et al., “General instance distillation for
     power plants based on improved genetic algorithm,” Journal                  object detection,” 2021, http://arxiv.org/abs/2103.02340.
     of Healthcare Engineering, vol. 2021, Article ID 7212567, 2021.      [23]   S. Ge, S. Zhao, C. Li, and J. Li, “Low-resolution face recognition
 [7] X. Huang, V. Jagota, E. Espinoza-Muñoz, and J. Flores-                      in the wild via selective knowledge distillation,” IEEE Transac-
     Albornoz, “Tourist hot spots prediction model based on                      tions on Image Processing, vol. 28, no. 4, pp. 2051–2062, 2019.
     optimized neural network algorithm,” International Journal           [24]   T. He, C. Shen, Z. Tian, D. Gong, C. Sun, and Y. Yan,
     of System Assurance Engineering and Management, vol. 12,                    “Knowledge adaptation for eﬃcient semantic segmentation,”
     2021.                                                                       in 2019 IEEE/CVF Conference on Computer Vision and Pat-
 [8] V. Jagota, A. P. S. Sethi, and K. Kumar, “Finite element                    tern Recognition (CVPR), pp. 578–587, Long Beach, CA,
     method: an overview,” Walailak Journal of Science & Technol-                USA, 2019.
     ogy, vol. 10, no. 1, pp. 1–8, 2013.                                  [25]   T. K. K. Ho and J. Gwak, “Utilizing knowledge distillation in
 [9] L. Chen, V. Jagota, and A. Kumar, “Research on optimization                 deep learning for classiﬁcation of chest X-ray abnormalities,”
     of scientiﬁc research performance management based on BP                    IEEE Access, vol. 8, pp. 160749–160761, 2020.
     neural network,” International Journal of System Assurance           [26]   B. H. Tang and J. Wang, “A physics-based method to retrieve
     Engineering and Management., vol. 12, 2021.                                 land surface temperature from MODIS daytime midinfrared
[10] D. C. Koboldt, R. S. Fulton, M. D. McLellan et al., “Compre-                data,” IEEE Transactions on Geoscience and Remote Sensing,
     hensive molecular portraits of human breast tumours,”                       vol. 54, no. 8, pp. 4672–4679, 2016.
     Nature, vol. 490, no. 7418, pp. 61–70, 2012.                         [27]   A. Garg, K. Aggarwal, M. Saxena, and A. Bhat, “Classifying
[11] S. R. Lakhani, I. O. Ellis, S. J. Schnitt, P. H. Tan, M. J. van de          medical histology images using computationally eﬃcient
     Vijver, and World Health Organisation, Classiﬁcation of                     CNNs through distilling knowledge,” in Emerging Technolo-
     Tumours of the Breast, WHO Press, 4th edition, 2012.                        gies in Data Mining and Information Security, J. M. R. S.
[12] M. Veta, J. P. W. Pluim, P. J. Van Diest, and M. A. Viergever,              Tavares, S. Chakrabarti, A. Bhattacharya, and S. Ghatak,
     “Breast cancer histopathology image analysis: a review,” IEEE               Eds., pp. 713–721, Springer, Singapore, 2021.
     Transactions on Biomedical Engineering, vol. 61, no. 5,              [28]   J. J. Thiagarajan, S. Kashyap, and A. Karargyris, “Distill-to-
     pp. 1400–1411, 2014.                                                        label: weakly supervised instance labeling using knowledge dis-
[13] Z. Han, B. Wei, Y. Zheng, Y. Yin, K. Li, and S. Li, “Breast can-            tillation,” in 2019 18th IEEE International Conference on
     cer multi-classiﬁcation from histopathological images with                  Machine Learning and Applications (ICMLA), pp. 902–907,
     structured deep learning model,” Scientiﬁc Reports, vol. 7,                 Boca Raton, FL, USA, 2019.
     no. 1, 2017.                                                         [29]   S. Fu, Z. Li, J. Xu, M.-M. Cheng, Z. Liu, and X. Yang, “Interactive
[14] C. Sharma, B. Amandeep, R. Sobti, T. K. Lohani, and                         knowledge distillation,” 2020, http://arxiv.org/abs/2007.01476.
     M. Shabaz, “A secured frame selection-based video water-             [30]   Y. Li, “Performance evaluation of machine learning methods
     marking technique to address quality loss of data: combining                for breast cancer prediction,” Applied and Computational
     graph-based transform, singular valued decomposition, and                   Mathematics, vol. 7, no. 4, p. 212, 2018.

Computational and Mathematical Methods in Medicine                     11

[31] L. Yuan, F. E. H. Tay, G. Li, T. Wang, and J. Feng, “Revisiting
     knowledge distillation via label smoothing regularization,” in
     2020 IEEE/CVF Conference on Computer Vision and Pattern
     Recognition (CVPR), pp. 3902–3910, Seattle, WA, USA, 2020.
[32] G. Aresta, T. Araújo, S. Kwok et al., “BACH: grand challenge
     on breast cancer histology images,” Medical Image Analysis,
     vol. 56, pp. 122–139, 2019.

You can also read