Monolithic 3D-based SRAM/MRAM Hybrid Memory for an Energy-efficient Unified L2 TLB-Cache Architecture - ResearchGate

Page created by Marjorie Carroll
 
CONTINUE READING
Monolithic 3D-based SRAM/MRAM Hybrid Memory for an Energy-efficient Unified L2 TLB-Cache Architecture - ResearchGate
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
                                                                            10.1109/ACCESS.2021.3054021, IEEE Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2020.Doi Number

Monolithic 3D-based SRAM/MRAM Hybrid
Memory for an Energy-efficient Unified L2 TLB-
Cache Architecture
Young-Ho Gong1, Member, IEEE
1
 School of Computer and Information Engineering, Kwangwoon University, Seoul Republic of Korea (email: yhgong@kw.ac.kr)
Corresponding author: Young-Ho gong (e-mail: yhgong@kw.ac.kr).
The present research has been conducted by the Research Grant of Kwangwoon University in 2020. This work was also supported by the National Research
Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2020R1G1A1100040)

ABSTRACT Monolithic 3D (M3D) integration has been emerged as a promising technology for fine-
grained 3D stacking. As the M3D integration offers extremely small dimension of via in a nanometer-scale,
it is beneficial for small microarchitectural blocks such as caches, register files, translation look-aside
buffers (TLBs), etc. However, since the M3D integration requires low-temperature process for stacked
layers, it causes lower performance for stacked transistors compared to the conventional 2D process. In
contrast, non-volatile memory (NVM) such as magnetic RAM (MRAM) is originally fabricated at a low
temperature, which enables the M3D integration without performance degradation. In this paper, we
propose an energy-efficient unified L2 TLB-cache architecture exploiting M3D-based SRAM/MRAM
hybrid memory. Since the M3D-based SRAM/MRAM hybrid memory consumes much smaller energy than
the conventional 2D SRAM-only memory and 2D SRAM/MRAM hybrid memory, while providing
comparable performance, our proposed architecture improves energy efficiency significantly. Especially, as
our proposed architecture changes the memory partitioning of the unified L2 TLB-cache depending on the
L2 cache miss rate, it maximizes the energy efficiency for parallel workloads suffering extremely high L2
cache miss rate. According to our analysis using PARSEC benchmark applications, our proposed
architecture reduces the energy consumption of L2 TLB+L2 cache by up to 97.7% (53.6% on average),
compared to the baseline with the 2D SRAM-only memory, with negligible impact on performance.
Furthermore, our proposed technique reduces the memory access energy consumption by up to 32.8%
(10.9% on average), by reducing memory accesses due to TLB misses.

INDEX TERMS Monolithic 3D, Cache memory, Translation look-aside buffer, SRAM, MRAM, Energy
efficiency

I. INTRODUCTION                                                                                      HBM), it is not appropriate for finer-grained 3D integration
   As process technology shrinks, microprocessor                                                     (e.g., 3D stacking of small caches) due to the micrometer-
performance has been significantly improved. However,                                                scale dimension of TSVs; though Intel adopts TSV-3D in
with the technology scaling down to sub-10nm node, the                                               their recent commercial processor [9], the TSV-3D is
conventional 2D IC technology faced a physical limitation.                                           utilized for 3D interconnects between two different
To extend technology scaling, 3D stacking has emerged as                                             (heterogeneous) processor packages, not for the 3D
a promising alternative technology [6][9][10][12][21][22]                                            integration of microarchitectural blocks within a single
[32]. Many researchers have studied 3D stacking based on                                             processor.
through-silicon-via (TSV), leading to commercial 3D                                                     For fine-grained 3D integration, monolithic 3D (M3D)
products such as high bandwidth memory (HBM) [6][22]                                                 has recently emerged as a promising alternative to TSV-
and a 3D microprocessor [9]. Though the TSV-based 3D                                                 based 3D stacking. Contrary to the TSV-3D which uses
stacking technology (TSV-3D) improves the bandwidth and                                              micrometer-scale TSVs, M3D enables extremely tiny
latency significantly for large 3D stacked DRAMs (i.e.,                                              dimension (e.g., 100nm) of vias called monolithic inter-tier

VOLUME XX, 2021                                                                                                                                                                                  1

                      This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Monolithic 3D-based SRAM/MRAM Hybrid Memory for an Energy-efficient Unified L2 TLB-Cache Architecture - ResearchGate
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
                                                                          10.1109/ACCESS.2021.3054021, IEEE Access

via (MIV). In the TSV-3D manufacturing process, each                                               note, in our study, ‘L2 cache’ indicates the second-level
wafer is fabricated individually, and then the pre-fabricated                                      private cache per core, while ‘L3 cache’ indicates the
wafers are stacked by using TSVs. The TSV fabrication                                              shared last-level cache. In the M3D-based SRAM/MRAM
process causes low alignment precision (>1um) thereby                                              hybrid memory, MRAM banks are stacked on top of the
limiting the dimension scaling of TSVs. On the other hand,                                         bottom SRAM layer. As we described, MRAM is able to be
in the M3D fabrication process, a stacked layer is fabricated                                      fabricated at low temperature [8][30], so that it does not
sequentially on top of the previous layer. The sequential                                          suffer from transistor performance degradation in the M3D
integration process of M3D eliminates alignment problem,                                           fabrication process. Since the M3D integration reduces wire
and thus it enables extremely small MIV dimension.                                                 overhead in terms of latency and energy consumption, the
Thanks to the small MIV dimension, the M3D has been                                                M3D-based SRAM/MRAM hybrid memory provides better
recently studied for stacking small microarchitectural                                             energy efficiency than the conventional 2D SRAM-only
blocks (e.g., L1 caches) [10][12][21] or even transistors                                          memory and the 2D SRAM/MRAM hybrid memory. Based
[32].                                                                                              on the M3D-based SRAM/MRAM hybrid memory, we
   By exploiting small MIVs, the M3D provides much                                                 apply different memory partitioning depending on the
lower interconnect latency as well as higher bandwidth,                                            application characteristics (e.g., L2 cache miss rate). For
compared to the TSV-3D. Additionally, the M3D is                                                   this purpose, we propose a unified L2 TLB-cache controller
considered as a promising technology for energy-efficient                                          to adopt an energy-efficient memory partitioning for the
3D ICs, due to the negligible energy consumption of MIVs.                                          unified L2 TLB-cache by profiling L2 cache miss rate.
However, M3D has a drawback in fabrication process,                                                According to the previous work [3][4], parallel workloads
which requires low temperature (
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
                                                                          10.1109/ACCESS.2021.3054021, IEEE Access

nature of TSV-3D fabrication. In the TSV-3D fabrication,                                              Among MRAM, RRAM, and PRAM, MRAM has been
dies/wafers are fabricated separately. After that, the pre-                                        widely considered as an alternative to the conventional
fabricated dies are bonded by using TSVs as vertical                                               SRAM which is the most commonly used for on-chip
interconnects between them. During the TSV bonding                                                 caches, as it provides high read performance, low leakage
process, each die needs to be aligned precisely. Unfortunately,                                    power, and also high density. While SRAM stores
the alignment accuracy of TSV-3D is limited to 0.5um which                                         (maintains) data based on a latch-based feedback circuit
inhibits the dimension scaling of TSVs; even in the state-of-                                      using 6 transistors, MRAM stores data using magnetic
the-art commercial products based on TSV-3D, the TSV                                               tunnel junction (MTJ) with negligible standby power,
diameter is limited to 5um~10um [14][22]. Due to the large                                         compared to the SRAM; the data in the MTJ can be
size of TSV, TSV-3D incur serious overhead for small                                               accessed by using an access transistor. Due to the smaller
architectures [10] in terms of performance, power, and area.                                       number of transistors, MRAM cells lead to much smaller
As a result, TSV-3D is not appropriate for 3D stacking of                                          dynamic and leakage power than SRAM cells [13]. By
small architectural blocks.                                                                        exploiting the advantages of MRAM, researchers have
   M3D has emerged as a promising technology for finer-                                            proposed MRAM-based caches. Sun, et al. [35] proposed
grained 3D stacking compared to TSV-3D. Different from                                             SRAM/MRAM hybrid non-uniform cache architecture for
the TSV-3D (parallel integration), in the M3D, the top                                             multi-megabyte shared last-level caches. They considered
(stacked) transistors are sequentially fabricated on the pre-                                      SRAM for only 1 cache way among 32 ways and MRAM
fabricated bottom layer. Thanks to the sequential                                                  for the other cache ways. Also, they adopted a write buffer
integration, M3D enables extremely higher alignment                                                scheme to hide long write latency for MRAM, which is
precision, which is negligible compared to the TSV-3D.                                             widely used in many other previous work on MRAM-based
Furthermore, the high alignment precision of M3D allows a                                          on-chip caches [1][17][23][35][40][41]. In addition to
nanometer-scale dimension of MIVs. Based on the MIVs,                                              MRAM-based large caches, several researchers proposed
several      recent     studies     proposed      M3D-based                                        MRAM-based register files [17] and L1 caches [20][36],
microarchitectural blocks for high-performance processors                                          which also avoid long MRAM write latency based on small
[10][12][21]. Kong, et al. [21] proposed M3D-based last-                                           write buffers.
level caches using MIVs for vertical wordlines (M3D-                                                  Though many previous studies just applied MRAM to
VWL) or vertical bitlines (M3D-VBL). Since M3D offers                                              cache memories as an alternative memory technology to
nanometer-scale MIVs, it can be used for such fine-grained                                         SRAM, MRAM could be more beneficial for TLBs than
3D integration with negligible overhead. In addition to                                            caches. Liu, et al. [25] proposed MRAM-based TLB
large caches, M3D is considered to be beneficial for small                                         architectures. According to [25], TLBs have much lower
microarchitectural blocks such as L1 cache, TLB, ALU,                                              write ratio than caches for most multi-threaded workloads.
branch predictor, and etc. Gopireddy and Torrellas [12]                                            Since lower write ratio could reduce the negative impact of
presented a comprehensive architectural analysis on M3D-                                           long write latency on system performance, it would be
based microprocessors. According to [12], M3D provides                                             better to adopt MRAM for TLBs rather than caches. Our
benefits for various blocks such as register files,                                                proposed architecture is also motivated from the fact that
issue/store/load queues, and so on, in terms of latency,                                           MRAM would be beneficial for read-intensive architectural
energy, and footprint, compared to the conventional 2D.                                            blocks.

B. Non-Volatile Memory (NVM)                                                                       C. Cache Partitioning and TLB Caching
As described in the previous studies, M3D enables block-                                           Cache partitioning has been broadly studied in computer
level 3D integration based on the extremely small MIVs.                                            architecture. Most of the previous studies have considered
Meanwhile, M3D may not provide performance                                                         partitioning a multi-megabyte shared last-level cache.
enhancement as it inevitably causes transistor performance                                         Though researchers proposed various partitioning algorithms
degradation due to the low temperature process for the                                             for the last-level caches, partitioning (or reducing) last-level
stacked transistors, which eventually offset the benefits from                                     caches would incur main memory access overhead to evict
the 3D stacking. To avoid the transistor performance                                               many dirty cache lines as well as to handle misses in the last-
degradation, we need to consider other logic/memory cells                                          level cache. Based on cache partitioning, several recent
that could be fabricated at low temperature. Thus, in many                                         studies proposed further enhanced techniques to exploit a
M3D studies, emerging non-volatile memories (NVMs) such                                            part of the last-level cache for caching TLB entries [26][29].
as MRAM, RRAM, and PRAM are considered to be good                                                  Both previous studies [26][29] are motivated by the fact that
candidates for M3D ICs. Since NVMs are originally                                                  workloads in virtualized environments have high L2 TLB
manufactured by using low temperature process, they do not                                         miss rates, resulting in serious performance overhead due to
suffer transistor performance degradation with M3D, while                                          the multi-level page table walk. However, they did not
not causing damage to the bottom layer [38][39][42].                                               increase the size of private L2 TLBs but utilized a part of L3
                                                                                                   caches as L3 TLBs, since increasing the size of private

VOLUME XX, 2021                                                                                                                                                                                3

                    This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
                                                                           10.1109/ACCESS.2021.3054021, IEEE Access

                                                                                                    depending on processors, such an architecture is widely
                                                                                                    adopted to commercial embedded processors [2] as well as
                                                                                                    high-performance processors [28]. As described in many
                                                                                                    previous studies, most applications have low miss rates in the
                                                                                                    L1 caches and L1 TLBs (i.e., L1 Icache, L1 Dcache, ITLB,
                                                                                                    and DTLB). However, applications have quite different
                                                                                                    access characteristics for the L2 TLB and L2 cache.
                                                                                                       Figure 2 shows our preliminary analysis of cache/TLB
                                                                                                    miss rates for various parallel workloads in the PARSEC
                                                                                                    benchmark suite. We also depict the working set sizes in
FIGURE 1. Architecture of a modern CPU core.
                                                                                                    Figure 2; working set 1 consists of thread-private data and
                                                                                                    working set 2 is composed of data used for inter-thread
resources such as L2 TLB causes longer latency compared to                                          communication [4]. As shown in Figure 2, all the
the original small-sized resources.                                                                 workloads have low miss rates (16MB)
performance overhead even with a larger size, while                                                 show high L2 cache miss rate (>50%); canneal and
improving energy efficiency significantly; according to the                                         streamcluster show L2 cache miss rate higher than 90%.
previous M3D studies [10][12][21], M3D provides a                                                   Additionally, more than half of PARSEC applications
comparable latency even with a larger size, compared to the                                         suffer high L2 TLB miss rate which leads to performance
2D. Thus, we utilize the M3D-based memory structure for a                                           degradation due to frequent page table access. Such TLB
unified L2 TLB-cache architecture, different from the                                               access characteristics can be found in parallel workloads or
previous work only considering 2D memory structure                                                  virtualized workloads which utilize many threads and larger
[26][29]. We apply memory partitioning to the M3D-based                                             working set sizes [29][37]. In our study, we focus on the
SRAM/MRAM hybrid memory for caching more TLB                                                        high miss rates in L2 TLB and L2 cache of parallel
entries.                                                                                            workloads. Considering the large size of working set, the
                                                                                                    commonly-used private L2 cache size (256KB~512KB) is
III. WORKLOAD CHARACTERIZATION OF L2 TLB-
CACHE USAGE                                                                                         not beneficial for performance and energy efficiency of
Modern CPUs have multiple cores and a shared last-level                                             parallel workloads. Instead of caching data blocks, using
cache. Though the shared last-level cache has multi-                                                the L2 cache SRAM arrays to store L2 TLB entries would
megabyte capacity, private resources (such as L1/L2 caches,                                         be much beneficial for parallel workloads, since it reduces
and TLBs) have small capacities for guaranteeing low                                                page walk overhead due to TLB misses. Moreover, to
latency. Figure 1 shows the architecture diagram of a modern                                        further improve energy efficiency, we apply MRAM for the
CPU core. As shown in Figure 1, each core architecture                                              unified L2 TLB-cache. Since TLB is more read-intensive
includes private L1/L2 caches and private L1/L2 TLBs with                                           than cache, using MRAM for TLB would improve energy
fixed sizes. Though the detailed sizes of caches/TLBs vary                                          efficiency with a marginal performance impact. We will
                                                                                                    describe our proposed architecture in Section IV.

FIGURE 2. Cache/TLB Miss rates (left-axis) and working set size (right-axis) of PARSEC applications.

VOLUME XX, 2021                                                                                                                                                                                 4

                     This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
                                                                           10.1109/ACCESS.2021.3054021, IEEE Access

                   (a)                                         (b)                                                 (c)                                               (d)

FIGURE 3. Bank structure of (a) 2D SRAM-only Memory; (b) 2D SRAM/MRAM Hybrid Memory; (c) M3D-based SRAM-only memory; (d) M3D-based
SRAM/MRAM Hybrid Memory.

                                                                                                                                TABLE I
IV. MONOLITHIC 3D-BASED SRAM/MRAM HYBRID                                                                 LATENCY AND ENERGY DEPENDING ON MEMORY CONFIGURATIONS.
MEMORY FOR UNIFIED L2 TLB-CACHE                                                                                                           2D                                  M3D
ARCHITECTURE                                                                                            Parameter           SRAM-         SRAM/MRAM              SRAM         SRAM/MRAM
                                                                                                                             only         hybrid memory           -only       hybrid memory
A. M3D-based SRAM/MRAM Hybrid Memory
                                                                                                    Read cycles (#)         5             5(S=M)                 4            4(S=M)
Figure 3 shows the bank structures depending on memory
                                                                                                    Write cycles (#)        4             4(S), 18(M)            4            4(S), 17(M)
configurations; (a) 2D SRAM-only memory, (b) 2D
                                                                                                    Read energy (nJ)        0.68          0.68(S), 0.28(M)       0.43         0.41(S), 0.25(M)
SRAM/MRAM hybrid memory, (c) M3D-based SRAM-only                                                    Write energy*(nJ)       0.09          0.09(S), 0.15(M)       0.06         0.05(S), 0.12(M)
memory, and (d) M3D-based SRAM/MRAM hybrid                                                          Leakage (mW)            431.7         318.6(S+M)             382.3        269.2(S+M)
memory1. We assume the 2D SRAM-only memory as our                                                     ‘S’ and ‘M’ stand for SRAM and MRAM, respectively.
baseline, which is depicted in Figure 3 (a). In the 2D SRAM-                                          *While MRAM write latency is 4.5x higher than SRAM write latency,
only bank structure, h-tree wire delay accounts for a large                                        MRAM write energy is only 1.7x higher than SRAM write energy. This is
portion of total access latency. When we unify the memory                                          because MRAM has much smaller energy overhead in routing wires and
                                                                                                   peripherals (e.g., routing components, senseamps, bitline muxes, etc.).
banks of L2 TLB and L2 cache, the access latency of L2
TLB and L2 cache would be increased, compared to the non-                                           Especially, since M3D provides extremely small MIVs, it
unified architecture. For example, when we use all the                                              leads to negligible vertical routing latency [10][21].
unified capacity for only L2 TLB, the L2 TLB access latency                                         Furthermore, since MRAM is originally fabricated by low-
increases significantly, compared to the baseline L2 TLB                                            temperature process [8][30], it does not suffer transistor
access latency.                                                                                     performance degradation by M3D integration. Due to the
   To reduce the impact of unified L2 TLB-cache                                                     reasons, we apply M3D-based SRAM/MRAM hybrid
architecture on access latency, we are able to consider two                                         memory for the unified L2 TLB-cache architecture.
options. Firstly, replacing SRAM cells into smaller memory                                             Table I describes latency and energy parameters
cells could reduce the h-tree wire delay, since smaller cells                                       depending on memory configurations used in our study. To
leads to area reduction, which eventually reduces total wire                                        model M3D-based memory configurations, we use CACTI
length. Among various memory cells, MRAM is widely                                                  [10] (which is a modified version of CACTI with M3D
considered as a promising alternative to SRAM, due to its                                           support) and NVSim [7]. In NVSim, the original MRAM
comparable read latency as well as small area. However, as                                          model is outdated. Thus, we consider an MRAM model
MRAM requires larger peripheral circuits compared to                                                proposed by Jan et al. [16]. According to [13], though there
SRAM, the area reduction by MRAM would be marginal in                                               are several MRAM models that can be used with NVSim, the
small on-chip caches [41], while it would be significant in                                         Jan’s MRAM model [16] is the most energy-efficient
large-size caches. Secondly, we can consider 3D stacking of                                         compared to the other MRAM models, while it consumes
SRAM/MRAM hybrid memory. 3D stacking of memory                                                      slightly larger area. Note we assume 8-way associative
banks reduces the h-tree wire length significantly, as 3D                                           256KB SRAM arrays as our baseline L2 cache. In case of L2
vertical routing length is much shorter than 2D routing length.                                     TLB, we assume that the baseline L2 TLB consists of 2048
                                                                                                    entries (=2048×16B per entry=32KB). The SRAM/MRAM
                                                                                                    hybrid memory configurations (both 2D and M3D) consist of
   1
     As depicted in Figure 3, in this paper, we performed iso-capacity                              4 SRAM banks and 4 MRAM banks. As described in Table I,
evaluation considering the same number of banks and same capacity for all                           we consider SRAM-only memory and SRAM/MRAM
the memory configurations. While iso-area evaluation would lead to much                             hybrid memory with 2D and M3D structures. Though the
better performance results for our proposed SRAM/MRAM hybrid
architecture than the SRAM-only memory (e.g., L2 TLB/cache miss
                                                                                                    M3D-based SRAM-only memory may be expected to offer
reduction by increasing capacity), we do not want to distort the                                    much better characteristics than the M3D-based
performance results. Additionally, due to iso-capacity, the SRAM/MRAM                               SRAM/MRAM hybrid memory, it actually has same
hybrid memories require no additional decoding logic.

VOLUME XX, 2021                                                                                                                                                                                  5

                     This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
                                                                           10.1109/ACCESS.2021.3054021, IEEE Access

FIGURE 4. Our proposed unified L2 TLB-cache controller exploiting M3D-based SRAM/MRAM hybrid memory. The detailed memory partitioning is
described in Table II.

read/write cycles and slightly higher energy consumption,                                           Figure 4 shows our proposed unified L2 TLB-cache
compared to the SRAM part of the M3D-based                                                          controller exploiting SRAM/MRAM hybrid memory. Our
SRAM/MRAM hybrid memory. Since the low temperature                                                  proposed unified L2 TLB-cache controller determines the
M3D fabrication causes transistor performance degradation                                           memory partitioning of L2 TLB and L2 cache considering
for the stacked SRAM layer [10], it affects performance and                                         the L2 cache miss rate; note we also describe the memory
energy characteristics. On the other hand, the M3D-based                                            partitioning depending on configurations in Table II,
SRAM/MRAM hybrid memory does not suffer the transistor                                              assuming the baseline L2 TLB and L2 cache sizes are 32KB
performance degradation. In the SRAM/MRAM hybrid                                                    (=2048 TLB entries) and 256KB, respectively. As shown in
memories, since MRAM has same read performance to                                                   Figure 4, at the beginning of an application, our proposed
SRAM, there is no difference in read latency (cycles)                                               controller ○1 collects access statistics of the L2 cache and ○
                                                                                                                                                                 2
between SRAM and MRAM. Also, MRAM consumes much                                                     profiles the L2 cache miss rate per pre-defined epoch T.
smaller read energy than SRAM, due to the smaller number                                            When the epoch T is too small, the profiled L2 cache miss
of transistors. However, MRAM consumes longer write                                                 rate is highly fluctuated so that our proposed controller may
latency (4.5x) and higher write energy (+73.5%), compared                                           not provide an appropriate memory configuration for the
to SRAM2. To mitigate the write overhead, we use a small                                            running application (due to the intermittently high (or low)
portion of SRAM arrays as a write buffer for MRAM; we                                               L2 cache miss rates). To avoid the situation, we consider 1ms
describe the details on our proposed unified L2 TLB-cache                                           for the epoch T. Based on the profiled L2 cache miss rate, the
controller in the following subsection.                                                             proposed controller ○    3 applies an appropriate memory
   In terms of leakage power, as MRAM consumes much                                                 configuration to the unified L2 TLB-cache memory. We
smaller leakage power than SRAM, both 2D and M3D-based                                              consider two threshold values to determine the memory
SRAM/MRAM hybrid memory configurations have lower                                                   configuration. When the profiled L2 cache miss rate is lower
leakage power than the 2D and M3D-based SRAM-only                                                   than thresholdlow, the proposed controller maintains the
memory configurations. Especially, when we adopt M3D, h-                                            baseline configuration for the unified L2 TLB-cache
tree wire leakage power is reduced significantly [10] due to                                        architecture. Otherwise, we consider thresholdhigh. When the
the reduced wire length. Thus, the M3D-based                                                        L2 cache miss rate is in between thresholdlow and
SRAM/MRAM hybrid memory has 37.6% lower leakage                                                     thresholdhigh, the proposed controller applies Half-
power than the 2D SRAM-only memory.                                                                 L2$ configuration to the unified L2 TLB-cache architecture,
                                                                                                    allocating 128KB SRAM arrays for the L2 cache, 128KB
B. Unified L2 TLB-Cache Controller                                                                  MRAM arrays for the L2 TLB, and 32KB SRAM arrays for
                                                                                                    the TLB write buffer. When the L2 cache miss rate is higher
   2
     Though our MRAM model has higher write energy, its write energy is                             than thresholdhigh, bypassing L2 cache would be better for
much smaller, compared to the original MRAM model in NVSim; the                                     energy efficiency. Accordingly, the proposed controller
original MRAM model causes 3x~7x higher write energy than the classic                               applies Bypass-L2$ configuration to the unified L2 TLB-
6T SRAM [1], due to the energy-efficient characteristics of our MRAM
model.                                                                                              cache architecture, which allocates all the capacity of

VOLUME XX, 2021                                                                                                                                                                                 6

                     This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
                                                                           10.1109/ACCESS.2021.3054021, IEEE Access

FIGURE 5. Profiled L2 cache miss rates of PARSEC applications (epoch T=1ms).

                           TABLE II
                                                                                                    excessive eviction overhead. Our proposed unified L2
         MEMORY PARTITIONING OF THE UNIFIED L2 TLB-CACHE                                            TLB-cache controller prevents unnecessarily frequent
                 DEPENDING ON CONFIGURATIONS*.                                                      changes of the memory configuration by exploiting
                                                                                                    counters for each configuration (as described in Figure 4).
     Parameter                  Baseline             Half-L2$             Bypass-L2$
                                                                                                    In case of canneal (Figure 5(b)), the L2 cache miss rate is
 L2 TLB (KB)                32KB (S)              128KB (M)              256KB (S+M)                always much higher than thresholdhigh. In this case,
                            (2K entries)          (8K entries)           (16K entries)              bypassing L2 cache would have little impact on
                                                                                                    performance while providing much better energy efficiency.
 L2 cache (KB)              256KB (S)             128KB (S)              0
                                                                                                    Therefore, in case of canneal, our proposed controller sets
 Write buffer (KB)          0              32KB (S)      32KB (S)                                   Bypass-L2$ configuration to the unified L2 TLB-cache
                                           (2K entries)  (2K entries)                               memory. In case of fluidanimate shown in Figure 5(c), most
   ‘S’ and ‘M’ stand for SRAM and MRAM, respectively.                                               of the profiled L2 cache miss rate values are in between
   * Note the total capacity (L2 TLB + L2 cache + Write buffer) is same                             thresholdlow and thresholdhigh, which means that Half-
   for all the configurations for fair comparison.
                                                                                                    L2$ configuration would be better than other configurations.
SRAM/MRAM arrays for L2 TLB. In this case, all the L2
                                                                                                    In case of raytrace (Figure 5(d)), though there are
cache accesses are bypassed to the shared L3 cache.
                                                                                                    fluctuations in the profiled L2 cache miss rate, our
Furthermore, we adopt a 4-bit counter for each configuration,
                                                                                                    proposed controller applies Half-L2$ to the unified L2
considering the fluctuations of the profiled L2 cache miss
                                                                                                    TLB-cache memory according to the Half-L2$ counter hits
rate. Thus, only if one of the counters becomes all ones
                                                                                                    all ones earlier than the Bypass-L2$ counter. According to
(meaning that the profiled L2 cache miss rate is expected to
                                                                                                    our analysis, the L2 cache miss rate of an application does
last during runtime of the application), our proposed
                                                                                                    not vary significantly during runtime. Thus, our proposed
controller ○
           3 sets the target memory configuration depending
                                                                                                    unified L2 TLB-cache controller offers an appropriate
on the profiled L2 cache miss rate; otherwise, it ○
                                                  4 maintains
                                                                                                    memory configuration reflecting the application
all counters without changing the memory configuration until
                                                                                                    characteristics, with negligible eviction overhead.
any counter becomes all ones. When the application is
terminated, the operating system can reset the proposed                                             V. EVALUATION
controller and thus the memory configuration is restored to
the baseline. In our study, we set thresholdlow and                                                 A. Evaluation Methodology
thresholdhigh to 0.5 (50%) and 0.8 (80%), respectively.                                             We analyze the impact of M3D-based SRAM/MRAM hybrid
   Figure 5 shows cumulative and periodic L2 cache miss                                             memory with our proposed unified L2 TLB-cache controller,
rates of four PARSEC benchmark applications profiled                                                in terms of L2 TLB miss rate, performance, and energy
every 1ms. As shown in Figure 5(a), blackscholes has low                                            consumption. For this purpose, we implement our proposed
cumulative L2 cache miss rate, which is lower than                                                  unified L2 TLB-cache controller by extending Sniper
thresholdlow. However, the profiled L2 cache miss rate of                                           simulator [5]. To reflect the M3D-based SRAM/MRAM
blackscholes is intermittently higher than thresholdlow (0.5).                                      hybrid memory, we apply the latency and energy parameters
If our proposed controller changes the memory                                                       described in Table I to the unified L2 TLB-cache memory.
configuration into Half-L2$, whenever the profiled L2                                                  Table III shows the architectural parameters used in our
cache miss rate >thresholdlow, the performance of                                                   evaluation. The parameters of private resources are similar to
blackscholes could be degraded, since the profiled L2 cache                                         a commercial microprocessor specification [28]. In our
miss rate becomes lower than thresholdlow in the next epoch                                         simulation, each processor core has its own private L1/L2
time. In this case, frequent changes of the memory                                                  caches and TLBs, and all the cores share the L3 cache as the
configuration between the baseline and Half-L2$ causes                                              last-level cache. ’Baseline’ and ’Proposed’ have only

VOLUME XX, 2021                                                                                                                                                                                 7

                     This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
                                                                           10.1109/ACCESS.2021.3054021, IEEE Access

                              TABLE III
                 ARCHITECTURAL SIMULATION PARAMETERS.
   Parameter               Baseline                           Proposed
 Processor core                4-core, Octa-core, 22nm
 ITLB (private)                   128 entries, 4-way
 DTLB (private)                   64 entries, 4-way
   L1 Icache                    32KB, 8-way, 2 cycles
   L1 Dcache                    32KB, 8-way, 2 cycles
    L2 TLB       2048 entries,
                                      32KB: SRAM (Write buffer)
    (private)   4-way, 2 cycles
                                    256KB (Unified L2 TLB-cache):
    L2 cache       256KB,
                                   128KB SRAM + 128KB MRAM
    (private)   8-way, 5 cycles
    L3 cache                                                                                        FIGURE 6. L2 TLB miss rate of PARSEC applications depending on
                               8MB, 16-way, 30 cycles
    (shared)                                                                                        configurations (Baseline Proposed).
   Page walk
                                      200 cycles
     penalty                                                                                        the Bypass-L2$. Thus, by increasing the L2 TLB size, while
   Benchmark           blackscholes, canneal, dedup, facesim,
  applications          fluidanimate, raytrace, streamcluster
                                                                                                    decreasing the L2 cache size, our proposed controller
                                                                                                    eliminates most (99.7%) of the L2 TLB misses of
difference in L2 TLB and L2 cache specifications. To adopt a                                        streamcluster as shown in Figure 6, compared to the baseline.
realistic value of page walk penalty, we measured the page                                          Meanwhile, in case of fluidanimate and raytrace which have
walk latency in our experimental environment by using Intel                                         the L2 cache miss rate between thresholdlow and thresholdhigh,
VTune Profiler [15]. Based on the real measurement, we                                              our proposed controller applies Half-L2$ configuration to the
adopt 200 cycles for the page walk penalty. For benchmark                                           unified L2 TLB-cache architecture, Thus, in case of
applications, we simulate the entire region of interest (ROI)                                       fluidanimate and raytrace, our proposed controller reduces
of each PARSEC benchmark application described in Table                                             the L2 TLB miss rate by 29.4% and 7.1%, compared to the
III, using simlarge input data set [4].                                                             baseline respectively.

                                                                                                    C. Performance
B. L2 TLB Miss Rate
Figure 6 depicts the L2 TLB miss rate of PARSEC                                                     Our proposed controller changes memory configurations of
applications. Among the applications, since blackscholes and                                        the unified L2 TLB-cache architecture depending on the
facesim have low L2 cache miss rate (
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
                                                                          10.1109/ACCESS.2021.3054021, IEEE Access

FIGURE 8. Normalized energy consumption (L2 TLB + L2 cache) of PARSEC applications depending on configurations.

performance.                                                                                       accesses due to TLB misses, we compare the memory energy
   Figure 7 depicts the performance impact of our proposed                                         consumption between baseline and our proposed technique.
controller with 2D and M3D-based SRAM/MRAM hybrid                                                     Figure 8 shows the energy consumption of L2 TLB and L2
memory, compared to the baseline with 2D and M3D-based                                             cache, which is normalized to the energy consumption in the
SRAM-only memory. As shown in Figure 7, our proposed                                               baseline with the 2D SRAM-only memory. As shown in
controller has a marginal impact on performance for most                                           Figure 8, our proposed architecture with the M3D-based
cases, compared to the baseline with 2D SRAM-only                                                  SRAM/MRAM hybrid memory provides much lower energy
memory. In addition, since it is able to hide performance                                          than the other configurations. In case of blackscholes and
impact of long MRAM write latency by utilizing SRAM                                                facesim, our proposed controller with the M3D-based
write buffer, our proposed controller with M3D-based                                               SRAM/MRAM hybrid memory reduces energy consumption
SRAM/MRAM hybrid memory does not incur performance                                                 by 19.9% and 42.9%, respectively, compared to the baseline
degradation compared to the baseline with M3D-based                                                with the 2D SRAM-only memory. In addition, our proposed
SRAM-only memory. In case of canneal and streamcluster,                                            controller with the M3D-based SRAM/MRAM hybrid
our proposed controller with M3D-based SRAM/MRAM                                                   memory shows 6.5% (blackscholes) and 11.2% (facesim)
hybrid memory improves the performance by 2.7% and 1.5%,                                           lower energy than the baseline with the M3D-based SRAM-
respectively, compared to the baseline with 2D SRAM-only                                           only memory. Though our proposed controller does not
memory. In case of dedup, our proposed controller with                                             change the memory configuration due to low L2 cache miss
M3D-based SRAM/MRAM hybrid memory slightly                                                         rate (thresholdhigh). As a result, our proposed controller provides
                                                                                                   51.8%, 97.7%, and 93.2% lower energy consumption in the
D. Energy Consumption                                                                              unified L2 TLB-cache for canneal, dedup, and streamcluster,
With negligible impact on performance, our proposed                                                respectively, compared to the baseline with 2D SRAM-only
architecture improves energy efficiency significantly, since                                       memory. Especially, our proposed architecture is
we exploit the M3D-based SRAM/MRAM hybrid memory                                                   significantly beneficial for dedup and streamcluster,
for the unified L2 TLB-cache architecture. We compare the                                          compared to the other applications. The reason is that our
energy consumption of L2 TLB+L2 cache for four different                                           proposed architecture reduces the unnecessarily wasted
configurations: i) Baseline with the 2D SRAM-only memory,                                          energy for L2 cache accesses in dedup and streamcluster,
ii) Baseline with the M3D-based SRAM-only memory, iii)                                             while the applications (dedup and streamcluster) do not incur
Our proposed architecture with the 2D SRAM/MRAM                                                    a large number of L2 TLB accesses. Due to the reason, for
hybrid memory, and iv) Our proposed architecture with the                                          dedup and streamcluster, our proposed controller with the
M3D-based SRAM/MRAM hybrid memory. Furthermore, as                                                 M3D-based SRAM/MRAM hybrid memory shows much
our proposed technique reduces the number of memory                                                lower energy consumption in L2 TLB+L2 cache, even

VOLUME XX, 2021                                                                                                                                                                                9

                    This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
                                                                          10.1109/ACCESS.2021.3054021, IEEE Access

FIGURE 9. Normalized memory access energy consumption depending on configurations (Baseline and Proposed).

compared to the baseline with the M3D-based SRAM-only                                              with high L2 cache miss rate, where our proposed controller
memory.                                                                                            applies Bypass-L2$ configuration to the unified L2 TLB-
   Lastly, in case of fluidanimate and raytrace, our proposed                                      cache. Thus, such applications (e.g., canneal, dedup, and
controller applies the Half-L2$ configuration to the unified                                       streamcluster) have energy reduction in the main memory as
L2 TLB-cache, reflecting the profiled L2 cache miss rate. As                                       well as the unified L2 TLB-cache. On average, our proposed
shown in Figure 8, our proposed controller with 2D                                                 technique reduces the memory access energy consumption
SRAM/MRAM hybrid memory results in slightly higher                                                 by 10.9% for PARSEC benchmark applications.
energy consumption than the baseline, since 2D
SRAM/MRAM hybrid memory does not lead to wire energy                                               VI. DISCUSSION
reduction, compared to the 2D SRAM-only memory.
Additionally, our proposed controller adds energy                                                  A. Potential Impacts of Monolithic 3D ICs
consumption by buffering write operations, so that our                                             Thermal Problem: As many researchers have considered,
proposed controller with 2D SRAM/MRAM hybrid memory                                                thermal problem is a major concern of 3D stacking of
does not provide energy reduction. On the other hand, when                                         logic/memory architectures. Especially, in the case of TSV-
using M3D-based SRAM/MRAM hybrid memory, the                                                       3D, the bonding layer is too thick (e.g., ~2.5um), so that it
energy consumption for SRAM arrays is reduced, thanks to                                           impedes vertical heat flow towards heat sink, which in turn
the wire energy reduction. As a result, our proposed                                               increases temperature of bottom layer seriously. However, in
controller with M3D-based SRAM/MRAM hybrid memory                                                  contrast to the TSV-3D, M3D has better thermal
reduces L2 TLB+L2 cache energy consumption by 36.9%                                                characteristics due to the interlayer dielectric (ILD) which is
and 32.8% in case of fluidanimate and raytrace, respectively.                                      much thinner than the bonding layer in the TSV-3D designs.
Meanwhile, in case of raytrace, our proposed controller with                                       Thus, M3D has been considered to be more robust to thermal
the M3D-based SRAM/MRAM hybrid memory has slightly                                                 problem rather than TSV-3D. To examine the thermal impact
higher energy consumption, compared to the baseline with                                           of our proposed architecture, we execute HotSpot thermal
the M3D-based SRAM-only memory. In this case, as our                                               simulation reflecting the SRAM/MRAM hybrid memory; we
proposed controller adopts Half-L2$ configuration to the                                           use the detailed layer material properties from [10].
unified L2 TLB-cache architecture, it causes slightly more                                         According to our analysis, the CPU peak temperature reaches
L2 cache misses, which in turn increases the overall energy                                        83.2˚C in case of raytrace (i.e., the most compute-intensive
consumption due to the increased SRAM accesses.                                                    application), which is still lower than DTM trigger
   Figure 9 shows the memory access energy consumption                                             temperature (i.e., 90˚C) of modern microprocessors. Several
depending on configurations. Since our proposed technique                                          previous studies on M3D [10][18] also report that M3D-
reduces the number of memory accesses due to TLB misses                                            based microprocessors may not suffer from thermal problem
significantly, it leads to considerable memory access energy                                       even with 4-layer stacking, since M3D leads to lower energy
reduction. As shown in Figure 9, in the best case (canneal),                                       consumption as well as better heat conduction than TSV-3D.
our proposed technique reduces the memory access energy                                            Consequently, our proposed M3D hybrid architecture does
consumption by 32.9%, as it reduces the memory access                                              not incur thermal problem, while it reduces system-wide
energy due to TLB misses by 41.7%. In case of streamcluster,                                       energy significantly.
our proposed technique reduces the memory access energy                                            Noise Characteristics: Though the TSV-3D has been proved
consumption by 21.4% by eliminating most of the TLB                                                not appropriate for 3D stacking of memory cells due to the
misses. Our proposed technique is beneficial for applications                                      poor alignment precision and area overhead. Furthermore,
                                                                                                   TSV-3D causes significant leakage power in TSVs and its

VOLUME XX, 2021                                                                                                                                                                                10

                    This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
                                                                          10.1109/ACCESS.2021.3054021, IEEE Access

buffer chains, which affects noise margin. On the contrary,                                        unified L2 TLB-cache architecture, it consumes much
M3D would have much better noise characteristics than                                              smaller energy compared to the conventional 2D SRAM-
TSV-3D, due to the following reasons. First, the M3D has                                           only memory. According to our analysis, our proposed
negligible leakage power in MIVs. Second, the ILD layers in                                        controller with the M3D-based SRAM/MRAM hybrid
M3D act as noise shielding layer between two active layers.                                        memory reduces the energy consumption of L2 TLB and L2
Thus, several previous works have shown that M3D-based                                             cache by 53.6%, on average, compared to the 2D-SRAM
SRAM/NVMs have strong noise margin [18][24][31][34]. In                                            only memory. Furthermore, our proposed technique leads to
addition, the coupling effect between layers can be controlled                                     memory energy reduction by up to 32.9%, Consequently, our
in design time by adjusting ILD thickness with negligible                                          proposed technique makes the unified L2 TLB-cache
performance impact; while increasing ILD thickness, the                                            architecture more energy-efficient by using M3D technology.
thickness is much thinner (shorter) compared to the 2D wire                                        We expect that our proposed technique will be adopted to
length.                                                                                            future M3D-based microprocessors to enhance energy
                                                                                                   efficiency.
B. Monolithic 3D Fabrication of SRAM-MRAM Hybrid
Memories                                                                                           ACKNOWLEDGMENT
Though several studies have applied M3D integration to                                             The present research has been conducted by the Research
SRAM-only architecture [10][21], M3D stacking of SRAM-                                             Grant of Kwangwoon University in 2020. This work was
only memory banks may cause transistor performance                                                 also supported by the National Research Foundation of
degradation due to the low temperature process of M3D                                              Korea (NRF) grant funded by the Korea government (MSIT)
fabrication. On the other hand, since MRAM is originally                                           (No. NRF-2020R1G1A1100040)
fabricated at low temperature, it can be stacked without the
damage to the bottom (i.e., SRAM) layer, while not causing                                         REFERENCES
transistor performance degradation. Due to the reason, non-                                        [1]  J. Ahn, S. Yoo and K. Choi, "Prediction Hybrid Cache: An Energy-
                                                                                                        Efficient STT-RAM Cache Architecture," in IEEE Transactions on
volatile memories such as MRAM have been considered to                                                  Computers, vol. 65, no. 3, pp. 940-951, 1 March 2016.
be good candidates for memory/storage layers in M3D ICs                                            [2] ARM.        2017.ARM       Cortex-A75      Specifications.   [Online]
[38][39][42].                                                                                           https://developer.arm.com/ip-products/processors/cortex-a/cortex-
                                                                                                        a75
   Nevertheless, for the mass production of M3D ICs, it is                                         [3] A. Bhattacharjee and M. Martonosi. "Inter-core cooperative TLB for
certain that M3D fabrication technology needs to be further                                             chip multiprocessors." ACM Sigplan Notices, vol. 45, no. 3, pp. 359-
improved with cost reduction. Thus, many researchers have                                               370, 2010.
                                                                                                   [4] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The PARSEC
tried to improve M3D fabrication technology in various                                                  benchmark suite: Characterization and architectural implications,”
levels of studies, such as layer/via materials, fabrication                                             Proceedings of the 17th international conference on Parallel
methods, CAD tools, etc. [18][19][27].                                                                  architectures and compilation techniques, pp. 72-81, 2010.
                                                                                                   [5] T. E. Carlson, H. Wim, and E. Lieven, "Sniper: Exploring the level
                                                                                                        of abstraction for scalable and accurate parallel multi-core
VII. CONCLUSION                                                                                         simulation," Proceedings of 2011 International Conference for High
Emerging workloads have varying characteristics in private                                              Performance Computing, Networking, Storage and Analysis. Pp. 72-
L2 TLB and L2 cache. Nevertheless, modern                                                               81, 2011.
                                                                                                   [6] J. H. Cho et al., "A 1.2V 64Gb 341GB/S HBM2 stacked DRAM
microprocessors still have private L2 TLB and L2 cache with
                                                                                                        with spiral point-to-point TSV structure and improved bank group
fixed capacities, which causes inefficiency in performance                                              data control," 2018 IEEE International Solid-State Circuits
and energy consumption. Especially, emerging parallel                                                   Conference (ISSCC), pp. 208-210, 2018.
workloads have high miss rate in private L2 cache due to                                           [7] X. Dong, C. Xu, Y. Xie and N. P. Jouppi, "NVSim: A Circuit-Level
                                                                                                        Performance, Energy, and Area Model for Emerging Nonvolatile
large working set size, and they also suffer from high L2                                               Memory," in IEEE Transactions on Computer-Aided Design of
TLB miss rate, resulting in significant main memory accesses                                            Integrated Circuits and Systems, vol. 31, no. 7, pp. 994-1007, July
for page table walks.                                                                                   2012.
                                                                                                   [8] M. S. Ebrahimi et al., “Monolithic 3D integration advances and
   In this paper, we propose an energy-efficient unified L2                                             challenges: From technology to system levels,” 2014 SOI-3D-
TLB-cache        architecture      exploiting   M3D-based                                               Subthreshold Microelectronics Technology Unified Conference (S3S).
SRAM/MRAM hybrid memory. We consider the unified L2                                                     IEEE, 2014.
                                                                                                   [9] W. Gomes et al., "8.1 Lakefield and Mobility Compute: A 3D
TLB-cache architecture, while the conventional architecture                                             Stacked 10nm and 22FFL Hybrid Processor System in 12×12mm2,
uses the L2 TLB and L2 cache separately. Our proposed                                                   1mm Package-on-Package," IEEE International Solid- State Circuits
controller applies different memory configurations (i.e.,                                               Conference - (ISSCC), pp. 144-146, 2020.
Baseline, Half-L2$, and Bypass-L2$) depending on the L2                                            [10] Y. Gong, J. Kong and S. W. Chung, "Quantifying the Impact of
                                                                                                        Monolithic 3D (M3D) Integration on L1 Caches," in IEEE
cache miss rate. By reconfiguring the unified L2 TLB-cache                                              Transactions on Emerging Topics in Computing, published online.
architecture, our proposed controller reduces the L2 TLB                                           [11] K. Gopalakrishnan, et al., “Highly-scalable novel access device
miss rate 7.14%~99.73%, with negligible performance                                                     based on Mixed Ionic Electronic conduction (MIEC) materials for
                                                                                                        high density phase change memory (PCM) arrays,” 2010 Symposium
overhead. Furthermore, as our proposed technique utilizes                                               on VLSI Technology. IEEE, 2010.
the M3D-based SRAM/MRAM hybrid memory for the

VOLUME XX, 2021                                                                                                                                                                                11

                    This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
                                                                           10.1109/ACCESS.2021.3054021, IEEE Access

[12] B. Gopireddy and J. Torrellas, "Designing Vertical Processors in                               [31] C.-H. Shen, et al., “Monolithic 3D Chip Integrated with 500ns NVM,
     Monolithic 3D," 2019 ACM/IEEE 46th Annual International                                             3ps Logic Circuits and SRAM,” IEEE International Electron Devices
     Symposium on Computer Architecture (ISCA), pp. 643-656, 2019.                                       Meeting, 2013
[13] A. Hankin, T. Shapira, K. Sangaiah, M. Lui, and M. Hempstead.                                  [32] J. Shi, D. Nayak, M. Ichihashi, S. Banna and C. A. Moritz, "On the
     “Evaluation of Non-Volatile Memory Based Last Level Cache Given                                     Design of Ultra-High Density 14nm Finfet Based Transistor-Level
     Modern Use Case Behavior.” IEEE International Symposium on                                          Monolithic 3D ICs," 2016 IEEE Computer Society Annual
     Workload Characterization (IISWC), November 3-5, 2019.                                              Symposium on VLSI (ISVLSI), pp. 449-454, 2016.
[14] D. B. Ingerly et al., "Foveros: 3D Integration and the use of Face-to-                         [33] M. M. Shulaker et al., "Monolithic 3D integration of logic and
     Face Chip Stacking for Logic Devices," IEEE International Electron                                  memory: Carbon nanotube FETs, resistive RAM, and silicon FETs."
     Devices Meeting (IEDM), pp. 19.6.1-19.6.4, 2019                                                     2014 IEEE International Electron Devices Meeting. IEEE, 2014.
[15] Intel VTune Profiler. https://software.intel.com/content/www/us/en/                            [34] S. Srinivasa, et al. "ROBIN: Monolithic-3D SRAM for enhanced
     develop/tools/vtune-profiler.html                                                                   robustness with in-memory computation support." IEEE
[16] G. Jan et al., “Demonstration of fully functional 8mb perpendicular                                 Transactions on Circuits and Systems I: Regular Papers, vol. 66, no.
     stt-mram chips with sub-5ns writing for non-volatile embedded                                       7, pp. 2533-2545, 2019.
     memories,” 2014 Symposium on VLSI Technology (VLSI                                             [35] G. Sun, X. Dong, Y. Xie, J. Li and Y. Chen, "A novel architecture of
     Technology): Digest of Technical Papers, June 2014, pp. 1–2.                                        the 3D stacked MRAM L2 cache for CMPs," IEEE 15th
[17] W. Jeon, J. H. Park, Y. Kim, G. Koo and W. W. Ro, "Hi-End:                                          International Symposium on High Performance Computer
     Hierarchical, Endurance-Aware STT-MRAM-Based Register File for                                      Architecture, pp. 239-249, 2009
     Energy-Efficient GPUs," in IEEE Access, vol. 8, pp. 127768-127780,                             [36] H. Sun, C. Liu, W. Xu, J. Zhao, N. Zheng and T. Zhang, "Using
     2020.                                                                                               Magnetic RAM to Build Low-Power and Soft Error-Resilient L1
[18] J. Jiang, K. Parto, W. Cao, and K. Banerjee, “Ultimate Monolithic-                                  Cache," in IEEE Transactions on Very Large Scale Integration
     3D Integration With 2D Materials: Rationale, Prospects, and                                         (VLSI) Systems, vol. 20, no. 1, pp. 19-28, Jan. 2012.
     Challenges,” IEEE Journal of the Electron Devices Society, vol.7, pp.                          [37] G. Venkatasubramanian, R. J. Figueiredo, R. Illikkal and D. Newell,
     878-887, 2019.                                                                                      "A Simulation Framework for the Analysis of the TLB Behavior in
[19] P. S. Kanhaiya, et al., “X3D Heterogeneous Monolithic 3D                                            Virtualized Environments," IEEE International Symposium on
     Integration of X (Arbitrary) Nanowires Silicon, III–V, and Carbon                                   Modeling, Analysis and Simulation of Computer and
     Nanotubes,” IEEE Transactions on Nanotechnology, vol. 18, pp.                                       Telecommunication Systems, pp. 211-221, 2010.
     270-273, 2019.                                                                                 [38] H. Wong and S. Salahuddin, “Memory leads the way to better
[20] J. Kong, "A novel technique for technology-scalable STT-RAM                                         computing,” Nature Nanotech vol. 10, no. 3 pp. 191–194, 2015.
     based L1 instruction cache." IEICE Electronics Express (2016).                                 [39] Y. Yu and N. K. Jha, “SPRING: A Sparsity-Aware Reduced-
[21] J. Kong, Y. Gong and S. W. Chung, "Architecting large-scale SRAM                                    Precision Monolithic 3D CNN Accelerator Architecture for Training
     arrays with monolithic 3D integration," IEEE/ACM International                                      and Inference,” IEEE Transactions on Emerging Topics in
     Symposium on Low Power Electronics and Design (ISLPED), Taipei,                                     Computing, published online.
     pp. 1-6, 2017.                                                                                 [40] Y. Zhang, Y. Li, Z. Sun, H. Li, Y. Chen and A. K. Jones, "Read
[22] D. U. Lee et al., "25.2 A 1.2V 8Gb 8-channel 128GB/s high-                                          Performance: The Newest Barrier in Scaled STT-RAM," in IEEE
     bandwidth memory (HBM) stacked DRAM with effective                                                  Transactions on Very Large Scale Integration (VLSI) Systems, vol.
     microbump I/O test methods using 29nm process and TSV," IEEE                                        23, no. 6, pp. 1170-1174, June 2015.
     International Solid-State Circuits Conference Digest of Technical                              [41] L. Zhu et al., "Heterogeneous 3D Integration for a RISC-V System
     Papers (ISSCC), pp. 432-433, 2014.                                                                  With STT-MRAM," in IEEE Computer Architecture Letters, vol. 19,
[23] S. Lee, K. Kang and C. Kyung, "Runtime Thermal Management for                                       no. 1, pp. 51-54, May 2020.
     3-D Chip-Multiprocessors With Hybrid SRAM/MRAM L2 Cache,"                                      [42] F. Zokaee, M. Zhang, X. Ye, D. Fan and L. Jiang, "Magma: A
     in IEEE Transactions on Very Large Scale Integration (VLSI)                                         Monolithic 3D Vertical Heterogeneous Reram-Based Main Memory
     Systems, vol. 23, no. 3, pp. 520-533, March 2015.                                                   Architecture," 2019 56th ACM/IEEE Design Automation Conference
[24] C. Liu and S. K. Lim. “Ultra-high density 3D SRAM cell designs for                                  (DAC), Las Vegas, NV, USA, pp. 1-6, 2019.
     monolithic 3D integration,” IEEE International Interconnect
     Technology Conference, 2012.
[25] X. Liu, Y. Li, Y. Zhang, A. K. Jones and Y. Chen, "STD-TLB: A
     STT-RAM-based dynamically-configurable translation lookaside
     buffer for GPU architectures," 2014 19th Asia and South Pacific
     Design Automation Conference (ASP-DAC), Singapore, pp. 355-360,
     2014.
[26] Y. Marathe, N. Gulur, J. H. Ryoo, S. Song, and L. K. John, “CSALT:
     context switch aware large TLB,” Proceedings of the 50th Annual
     IEEE/ACM International Symposium on Microarchitecture (MICRO-
     50). Association for Computing Machinery, New York, NY, USA,
     pp. 449–462, 2017.
[27] Z. Or-Bach, B. Cronquist, Z. Wurman, I. Beinglass, and A. K.
     Henning, “Modified ELTRAN®—A game changer for Monolithic
     3D,” IEEE SOI-3D-Subthreshold Microelectronics Technology                                                          YOUNG-HO GONG received the BS degree in the
     Unified Conference (S3S), 2015.                                                                                    Division of Computer and Communication
[28] E. Rotem, "Intel architecture, code name Skylake deep dive: A new                                                  Engineering from Korea University, in 2012. He
     architecture to manage power performance and energy efficiency,"                                                   received the PhD degree in the Department of
     Intel Developer Forum. Vol. 24. 2015.                                                                              Computer Science and Engineering, Korea
[29] J. H. Ryoo, N. Gulur, S. Song, and L. K. John. "Rethinking TLB                                                     University, in 2018. He was a staff engineer in
     designs in virtualized environments: A very large part-of-memory                                                   Samsung Electronics DS. He is currently an
     TLB," ACM SIGARCH Computer Architecture News, vol. 45, no. 2,                                                      assistant professor in the School of Computer and
     pp. 469-480, 2017.                                                                                                 Information Engineering at Kwangwoon University.
[30] H. Sato, et al., “14ns write speed 128Mb density Embedded STT-                                 His research interests include 3D-stacked system design, low-
     MRAM with endurance>1010 and 10yrs retention@85°C using                                        power/thermal-aware computer architecture design, and architecture-level
     novel low damage MTJ integration process,” 2018 IEEE                                           thermal modeling.
     International Electron Devices Meeting (IEDM). IEEE, 2018.

VOLUME XX, 2021                                                                                                                                                                                 12

                     This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
You can also read