Simplified Parallel Architecture for LTE-A Turbo Decoder Implemented on FPGA

Page created by Milton Holland
 
CONTINUE READING
Advances in Circuits, Systems, Signal Processing and Telecommunications

          Simplified Parallel Architecture for LTE-A Turbo Decoder
                            Implemented on FPGA

                            CRISTIAN ANGHEL, CONSTANTIN PALEOLOGU
                                    Telecommunications Department
                                   University Politehnica of Bucharest
                                      Iuliu Maniu, 1-3, Bucharest
                                               ROMANIA
                                     {canghel, pale}@comm.pub.ro

Abstract: - This paper describes a turbo decoder for 3GPP Long Term Evolution Advanced (LTE-A) standard,
using a Max LOG MAP algorithm, implemented on Field Programmable Gate Array (FPGA). Taking
advantage of the quadratic permutation polynomial (QPP) interleaver proprieties and considering some FPGA
block memory characteristics, a simplified parallel decoding architecture is proposed. It should be used
especially for large data blocks, when high decoding latency is introduced by the serial decoding. The
parallelization factor N is usually a power of 2, the maximum considered value being 8. The obtained parallel
decoding latency is N times lower than the serial decoding latency. With the cost of very low latency added to
this value, the parallel decoding performances are similar with the serial decoding ones. The novelty of the
proposed parallel architecture is that only one interleaver is used, independently of the N value.

Key-Words: - LTE-A, turbo decoder, Max LOG MAP, parallel architecture, FPGA

1 Introduction                                                   releases (after High Speed Packet Access was
    The discussions around the channel coding                    introduced) added support for new and interesting
theory were intense in the last decades, but even                features, while turbo coding remained still
more interest around this topic was added once the               unchanged. Some modifications were introduced by
turbo codes were found by Berrou, Glavieux, and                  the Long Term Evolution (LTE) standard [5][6], not
Thitimajshima [1][2][3].                                         significant as volume, but important as concept.
    At the beginning of their existence, after proving           While keeping exactly the same coding structure as
the obtained decoding performances, the turbo                    in UMTS, 3GPP proposed for LTE a new
codes were introduced in different standards as                  interleaver scheme.
recommendations, while convolutional codes were                      An UMTS dedicated turbo decoding scheme is
still mandatory. The reason behind this decision was             presented in [7]. Due to the new LTE/ LTE-A
especially the high complexity of turbo decoder                  interleaver, the decoding performances are
implementation. But the turbo codes became more                  improved compared with the ones corresponding to
attractive once the supports for digital processing,             UMTS standard. Moreover, the new LTE interleaver
like Digital Signal Processor (DSP) or Field                     provides support for the parallelization of the
Programmable Gate Array (FPGA), were extended                    decoding process inside the algorithm, taking
more and more in terms of processing capacity.                   advantage on the main principle introduced by turbo
Nowadays the chips include dedicated hardware                    decoding, i.e., the usage of extrinsic values from one
accelerators for different types of turbo decoders,              turbo iteration to another. The parallel decoding
but this approach makes them standard dependent.                 represents one software adaptation requested by the
    The Third-Generation Partnership Project                     high data rates, while additional hardware changes
(3GPP) [4] is an organization, which adopted early               are also proposed [8].
these advanced coding techniques. Turbo codes                        There are many parallel decoding architectures
were standardized from the first version of                      proposed in the literature in the last years. The
Universal Mobile Telecommunications System                       obtained results are evaluated on 2 axes. The first
(UMTS) technology, in 1999. The next UMTS                        one is the decoding performances degradation

ISBN: 978-1-61804-271-2                                    102
Advances in Circuits, Systems, Signal Processing and Telecommunications

introduced by the parallel method compared with                  dimensions are unchanged between the two block
the serial decoding scheme and the second one is the             schemes. In terms of decoding performances, with
amount of resources needed for such parallel                     the cost of a small overhead added, the
architecture implementation. A first set of parallel             performances of the serial and parallel decoding
architectures is described in [9]. Starting from the             architectures are kept similar.
classical method of implementing the Maximum A                       The paper is organized as follows. Section 2
Posteriori (MAP) algorithm, i.e., going to trellis               describes the LTE coding scheme with the new
once to compute the Forward State Metrics (FSM)                  introduced QPP interleaver. Section 3 presents the
and then twice to compute the Backward State                     decoding algorithm. In Section 4, there are
Metrics (BSM) and also the Log Likelihood Ratios                 discussed the implementation solutions and the
(LLR), several solutions to reduce the decoding                  proposed decoding schemes, for both serial and
latency of 2K clock periods per semi-iteration,                  parallel decoding. Section 5 presents throughput and
where K is the data block length, are introduced.                speed results obtained when targeting a
The first one reduces the decoding time to half (only            XC5VFX70T [13] chip on Xilinx ML507 [14]
K) by starting simultaneously the BSM and FSM                    board; it also provides simulation curves comparing
computation. After computing half of these values,               the results obtained when using serial decoding,
2 LLR blocks start working in parallel, the                      parallel decoding, and parallel decoding with
interleaver block being also doubled. Another                    overlap. Section 6 contains the conclusions of this
proposed scheme eliminates the need for the second               work.
interleaver but increases the decoding time with K/2
compared with the previous one, a total decoding
latency of 3K/2 clock periods being obtained.                    2 LTE Coding Scheme
    A second set of parallel architectures takes                    The coding scheme presented in 3GPP LTE
advantage of the Quadratic Permutation Polynomial                specification is a classic turbo coding scheme,
(QPP) interleaver algebraic-geometric properties, as             including two constituent encoders and one
described in [10][11]. Here efficient hardware                   interleaver module. It is described in Fig. 1. One can
implementations of the QPP interleaver are                       observe at the input of the LTE turbo encoder the
proposed, but the parallelization factor N represents            data block Ck. The K bits corresponding to this
also the number of used interleavers in the proposed             block are sent as systematic bits at the output in the
architectures.                                                   steam Xk. In the same time, the data block is
    A third approach consist in using a folded                   processed by the first constituent encoder resulting
memory to store simultaneously all the values
                                                                 parity bits Zk, while the interleaved data block Ck' is
needed for parallel processing [12]. But for this kind
of implementation the main challenge is to correctly             processed by the second constituent encoder
distribute the data to each decoding unit once a                 resulting parity bits Z k' . Combining the systematic
memory location containing all N values was read.                bits and the two streams of parity bits, the following
More precisely, the N decoding units working in                  sequence is obtained at the output of the encoder:
parallel were writing their data in a concatenated               X1, Z1, Z’1, X2, Z2, Z’2, …, XK, ZK, Z’K.
order to the same location, but when the interleaved                At the end of the coding process, in order to
reading is taking place, these values are not going in           drive back the constituent encoders to the initial
the same order to the same decoding unit, but                    state, the switches from Fig. 1 are moved from
instead they should be redistributed. To solve this,             position A to B. Since the final states of the two
an architecture based on 2 Batcher sorting networks              constituent encoders are different, depending on the
is proposed. But also in this approach, N interleavers           input data block, this switching procedure will
are needed to generate all the interleaved addresses             generate tail bits for each encoder. These tail bits
that input the master network.                                   have to be transmitted together with the systematic
    In this paper, we introduce also a folded memory             and parity bits resulting the following final
based approach, but the main difference comparing                sequence: XK+1, ZK+1, XK+2, ZK+2, XK+3, ZK+3, X’K+1,
with the already existent solutions described above              Z’K+1, X’K+2, Z’K+2, X’K+3, Z’K+3.
is that our proposed solution uses only one                         As mentioned before, the novelty introduced by
interleaver. Additionally, with some multiplexing                the LTE standard in terms of turbo coding is the
and demultiplexing blocks, the parallel architecture             interleaver module. The output bits are reorganized
remains close to the serial one, only the Soft Input             using:
Soft Output (SISO) decoding unit being instantiated
N times. The block memories numbers and

ISBN: 978-1-61804-271-2                                    103
Advances in Circuits, Systems, Signal Processing and Telecommunications

                                                                       Λi ( X k )
                                                                                                  Λ1o ( X k )
                                                                       Λi ( Zk )

                                                                                                                            Λ o2 ( X k' )      Λ o2 ( X k )
                                                                       Λ i ( Z k' )

                                                                                                 Fig. 2. LTE turbo decoder.

                                                                                      max* ( x, y ) = ln(e x + e y ) =
                     Fig. 1. LTE CTC encoder.                                                                    − y−x
                                                                                                                                                  (3)
                                                                                      max( x, y ) + ln(1 + e             ) ≈ max( x, y ).
         Ci' = C π ( i ) , i = 1, 2,..., K ,               (1)
                                                                          The LTE turbo decoder trellis diagram contains 8
                                                                       states, as depicted in Fig. 3. Each diagram state
where the interliving function π applied over the                      permits 2 inputs and 2 outputs. The branch metric
output index i is defined as                                           between the states Si and Sj is

         π(i ) = ( f1 ⋅ i + f 2 ⋅ i 2 ) mod K .            (2)                        γ ij = V ( X k ) X ( i, j ) + Λ i ( Z k ) Z ( i, j ) ,      (4)

   The input block length K and the parameters f1
                                                                       where X(i,j) represents the data bit and Z(i,j) is the
and f2 are provided in Table 5.1.3-3 in [5].
                                                                       parity bit, both associated to one branch. Also
                                                                       Λ i ( Z k ) is the LLR for the input parity bit. When
3 Decoding Algorithm                                                   SISO 1 decoder is taken into discussion this input
   The LTE turbo decoding scheme is depicted in                        LLR is Λ i ( Z k ) , while for SISO 2 it becomes
Fig. 2. The two Recursive Systematic Convolutional
(RSC) decoders use in theory the MAP algorithm.                        Λ i ( Z k' ) ; V(Xk)=V1(Xk) represents the sum between
This classic algorithm provides the best decoding                      Λ i ( X k ) and W(Xk) for SISO 1 and V(Xk)=V2(X’k)
performances, but it suffers from very high
implementation complexity and it can lead to large                     represents the interleaved version of the difference
dynamic range for its variables. For these reasons,                    between Λ1o ( X k ) and W(Xk) for SISO 2. In Fig. 2,
the MAP algorithm is used as a reference for                           W(Xk) is the extrinsic information, while Λ1o ( X k )
targeted decoding performances, while for real
implementation new sub-optimal algorithms have                         and Λ o2 ( X k' ) are the output LLRs generated by the
been studied: Logarithmic MAP (Log MAP) [15],                          two SISOs.
Maximum Log MAP (Max Log MAP), Constant                                   In the LTE turbo encoder case, there are 4
Log MAP (Const Log MAP) [16], and Linear Log                           possible values for the branch metrics between 2
MAP (Lin Log MAP) [17].                                                states in the trellis:
   For the proposed decoding scheme, the Max Log
MAP algorithm is selected. This algorithm reduces                                                 γ0 = 0
the implementation complexity and controls the
                                                                                                  γ1 = V ( X k )
dynamic range problem with the cost of acceptable                                                                                                   (5)
performances degradation, compared to classic                                                     γ 2 = Λi ( Z k )
MAP algorithm. The Max Log MAP algorithm                                                          γ 2 = V ( X k ) + Λi ( Z k ) .
keeps from Jacobi logarithm only the first term, i.e.,
                                                                          The decoding process is based on going forward
                                                                       and backward through the trellis.

ISBN: 978-1-61804-271-2                                          104
Advances in Circuits, Systems, Signal Processing and Telecommunications

                                                                                      metric for the state Si at the stage k is α k ( Si ) with
                                                                                       0 ≤ k ≤ K − 1 and 0 ≤ i ≤ 7 . The forward recursion
                                                                                      is initialized with α 0 ( S0 ) = 0 and α 0 ( Si ) = 0, ∀i > 0 .
                                                                                      Starting from the stage k=1 and continuing through
                                                                                      the trellis until the last stage k=K, the un-normalized
                                                                                      forward metrics are given by

                                                                                      αˆ k ( S j ) = max {(α k −1 ( Si1 ) + γ i1 j ),(α k −1 ( Si 2 ) + γ i 2 j )} ,
                                                                                                                                                                           (8)

                                                                                      where Si1 and Si2 are the two states from stage k − 1
                                                                                      connected to the state Sj from stage k. After the
                                                                                      computation of αˆ k ( S0 ) value, the rest of the
                      Fig. 3. LTE turbo coder trellis.
                                                                                      forward metrics are normalized as

3.1 Backward recursion                                                                                           α k ( Si ) = αˆ k ( Si ) − αˆ k ( S0 ) .               (9)
    The trellis is covered backward and the
computed metrics are stored in a normalized form at                                      Because the forward metrics α are computed for
each node of the trellis. These stored values are used                                the stage k, the decoding algorithm can obtain in the
for the LLR computation at the trellis forward                                        same time a LLR estimated for the data bits Xk. This
recursion. The backward metric for the Si state at the                                LLR is found the first time by considering that the
kth stage is β k ( Si ) , where 2 ≤ k ≤ K + 3 and                                     likelihood of the connection between the state Si at
 0 ≤ i ≤ 7 . The backward recursion is initialized with                               stage k − 1 and the state Sj at stage k is
 β K +3 ( S0 ) = 0 and β K +3 ( Si ) = 0, ∀i > 0 . Starting
from the stage k=K+2 and continuing through the
                                                                                                  λk ( i, j ) = α k −1 ( Si ) + γ ij + β k ( S j ) .                   (10)
trellis until stage k=2, the computed backward
metrics are                                                                              The likelihood of having a bit equal to 1 (or 0) is
                                                                                      when the Jacobi logarithm of all the branch
                     {                                                    }
 βˆk ( Si ) = max ( β k +1 ( S j1 ) + γ ij1 ),( β k +1 ( S j 2 ) + γ ij 2 ) ,         likelihoods corresponds to 1 (or 0) and thus:

                                                                         (6)          Λo ( X k ) =         max           {λk ( i, j )} −        max            {λk ( i, j )},
                                                                                                      ( Si → S j ): X i =1                 ( Si → S j ): X i = 0

where βˆk ( Si ) represents the un-normalized metric                                                                                                                   (11)
and Sj1 and Sj2 are the two states from stage k+1
                                                                                      where “max” operator is recursively computed over
connected to the state Si from stage k. After the
                                                                                      the branches, which have at the input a bit of 1
computation of βˆ ( S ) value, the rest of the
                              k    0                                                  {                             }                {
                                                                                        ( Si → S j ) : X i = 1 or a bit 0 ( Si → S j ) : X i = 0 .                     }
backward metrics are normalized as

            βk ( Si ) = βˆk ( Si ) − βˆk ( S0 )                         (7)
                                                                                      4 Proposed Decoding Scheme
and then stored in the dedicated memory.
                                                                                      4.1 Serial Decoder Block Scheme
                                                                                         From the theoretical decoding scheme depicted
3.2 Forward recursion                                                                 in Fig. 2 it can be noticed that SISO 2 decoder starts
   During the forward recursion, the trellis is                                       working only after SISO 1 decoder finishes its job
covered in the normal direction, this process being                                   and vice-versa, the usage of previously obtained
similar with the one specific for Viterbi algorithm.                                  extrinsic values being the main principle of the
In order to allow the computation of the current                                      turbo decoding.
stage (k) metrics, only the forward metrics from the
last stage (k − 1) have to be stored. The forward

ISBN: 978-1-61804-271-2                                                         105
Advances in Circuits, Systems, Signal Processing and Telecommunications

                                                                                                          W(Xk)                                    W(Xk)
                                                                                                                                                  memory

                           Λi(Xk)                V1(Xk) V1(Xk) 1
                          memory
                                           +     memory
                                    Λi ( X k )                  2
                                                        V2(X’k)            RSC                                                          Write normal
                                                                                                                                       Read interleaved
                                                                        (SISO1 or
                                                                         SISO2)                  Λ1o ( X k )
                           Λi(Zk)                    1                                       1                              V2(Xk) V2(Xk)                                  V2(X’k)
                                                                                                               +                                      Interleaver
                          memory Λ i ( Z k )                                                                                memory                                         memory
                                                     2                                       2

                                                                                            Λ o2 (X’k)
                          Λi(Z’k)                                                           memory
                                                                                                                   Write interleaved
                                                                                                                                                                       +
                          memory Λ i Z '
                                      k ( )                                           Λ o2 ( X k' )
                                                                                                                    Read normal

                                                                                                                                          Λ o2 (Xk)     Λ o2 ( X k )             Xˆ k
                                                                                                               Deinterleaver
                                                                                                                                          memory

                                                         Fig. 4. Proposed serial turbo decoder block scheme.

    Also, all the processing is based on complete                                            vectors in order to perform the second semi-
data blocks since the interleaver or deinterleaver                                           iteration.
procedures should be applied in between. It results                                               The vector V1(Xk) is obtained by adding the input
that the 2 SISOs are decoding data in non-                                                   vector Λ i ( X k ) with the extrinsic information vector
overlapped time windows, so only one SISO unit                                               W(Xk). While reading these 2 memories, SISO 1
can be used to process in a time-multiplexed
                                                                                             starts the decoding process. At the output, the LLRs
manner, as one can observe in Fig. 4, where a serial                                         are available and performing the subtraction
decoder block scheme based on the previous work                                              between them and the delayed extrinsic values
presented in [18] for a WiMAX CTC decoder is
                                                                                             already read from W(Xk) memory, the vector V2(Xk)
described.
                                                                                             is computed and then stored into its corresponding
    The memory blocks are used for storing data
                                                                                             memory in a normal order. The interleaving process
from one semi-iteration to another and from one                                              is started (the initially written ILM is read now in
iteration to another. The dotted-line memory blocks
                                                                                             normal order, so that interleaved read address for
are virtual memories added only to ease the                                                  V2(Xk) are obtained) and the re-ordered LLRs
understanding of the introduced notations. Also, it                                          V2(X’k) are available, the corresponding values for
should be mentioned that the Interleaver and
                                                                                             the 3 tail bits X’K+1, X’K+2, X’K+3 being added at the
Deinterleaver blocks are in fact the same, including                                         end of this sequence. The second semi-iteration is
a block memory called ILM (Interleaver Memory)                                               ongoing. The same SISO unit is used, but reading
and an interleaver. The ILM is the new approach
                                                                                             this time data inputs from the other memory blocks.
introduced by the author compared with the                                                   As one can see from Fig. 4, two switching
previous serial implementation presented in [19] and
                                                                                             mechanisms are included in the scheme. When in
the goal is to prepare the architecture for parallel
                                                                                             position 1, the memory blocks for V1(Xk) and
decoding also. The memory is written with the
interleaved addresses each time a new data block is                                           Λ i ( Z k ) are used, while in position 2 the memory
received. The values are then used as read addresses                                         blocks for V2(X’k) and Λ i ( Z k' ) become active.
(when interleaver process is ongoing) or as write
                                                                                                 At the output of the SISO unit, after each semi-
addresses (when deinterleaver process is ongoing).
                                                                                             iteration, K LLRs are obtained. The ones
This ILM, together with the 3 memories from the
                                                                                             corresponding to the second semi-iteration are
left side of the picture (for the input data) are
switched-buffers, allowing new data to be written                                            stored in the Λ o2 ( X k' ) memory (the ILM output,
while the previous one is still under decoding                                               which was already available for the V2(Xk)
process.                                                                                     interleaver process, is used as writing address for
                                                                                              Λ o2 ( X k' ) memory , after a delay is added).
    The scheme depicted in Fig. 4 works as follows.
SISO 1 reads the memory locations corresponding
to V1(Xk) and Λ i ( Z k ) vectors. The reading process                                                 Reading in a normal order Λ o2 ( X k' ) memory and
is performed forward and backward and it serves the                                          also V2(Xk) memory provides inputs for W(Xk)
first semi-iteration. At the end of this process, SISO                                       memory and on the same time allows a new semi-
2 reads forward and backward from the memory                                                 iterations to start for SISO 1. So the W(Xk) memory
blocks corresponding to V2(X’k) and Λ i ( Z k' )                                             update is made on the same time with a new semi-
                                                                                             iteration start. Fig. 5 depicts a time diagram for the
                                                                                             serial turbo decoding and the gray colored intervals

ISBN: 978-1-61804-271-2                                                             106
Advances in Circuits, Systems, Signal Processing and Telecommunications

                                                                                 V1 ( X k ) /
describe W(Xk) memory writing. One can observe                                   V2 ( X k' )

that the upper 4 memories in the picture are                                     Λi ( Zk ) /

switched-buffers, so they are written while the                                  Λ i ( Z k' )

previous data block is still processed. In the picture
R stands for Read, W represents Write, (K − 1:0) is                                                                                           Λ1o ( X k ) /

the backward trellis run, (0:K − 1) is the forward                                                                                            Λ o2 ( X k' )

trellis run and IL means interleaved read (for
interleaver process) or write (for deinterleaver
process).
    In order to be able to handle all the data block
dimensions, the used memory blocks have 6144                                                        Fig. 6. Proposed SISO block scheme.
locations (this is the maximum data block length),
except the ones storing the input data for RSCs,                                   Without normalization, the forward and
which have 6144 + 3 locations, including here also                             backward metric width should be wider in order to
the tail bits. Each memory locations is 10 bits wide,                          avoid saturation, which means more memory
the first bit being used for the sign, the next 6 bits                         blocks, more complex arithmetic (i.e., more used
representing the integer part and the last 3 bits                              resources), and lower frequency (as an overall
indicating the fractional part. This format was                                consequence). Hence, reducing the logic levels by
decided studying the dynamic range of the variables                            eliminating the normalizing procedure does not
(for the integer part) and the variations of the                               increase the system performances.
decoding performances (for the fractional part).                                   The ALPHA, BETA, and GAMMA blocks are
    The constituent modules of the SISO block are                              implemented in a dedicated way. Each metric
the ones presented in Fig. 6. One can notice both the                          corresponding to each state is computed separately,
un-normalized metric computing blocks ALPHA                                    not using the same function with different input
(forward) and BETA (backward), and the transition                              parameters.
metric computing block GAMMA, which in                                             Consequently, 16 equations should be used for
addition includes the normalization function                                   transition metric computation (2 possible transitions
(subtract the metrics for the first state from all the                         for each of the 8 states from a stage). In fact, only 4
other metrics).                                                                equations are needed [as indicated in (5)]; moreover,
    The L block computes the output LLRs, which                                from these 4 equations one of them leads to zero
are normalized by the NORM block. The MUX-                                     value, so that the computational effort is minimized
MAX block selects inputs corresponding to the                                  for this implementation solution.
forward or backward recursion and computes the                                     The interleaver module is used both for
maximum function. The MEM BETA block stores                                    interleaving and deinterleaving. The interleaved
the backward metrics, which are computed before                                index is obtained based on a modified form of (2),
forward metrics. The metric normalization is                                   i.e.,
required to preserve the dynamic range.
                                                                                                π ( i ) = {[( f1 + f 2 ⋅ i ) mod K ] ⋅ i}mod K . (12)
            MODULES from ARCHITECTURE

                                                                                   In order to obtain both functions, either the input
                                                                               data is stored in the memory in natural order and
                                                                               then it is read in interleaved order, either the input
                                                                               data is stored in the interleaved order and then it is
                                                                               read in natural order.
                                                                                   The interleaved index computation is performed
                                                                               in three steps. First the value for ( f1 + f 2 ⋅ i ) mod K
                                                                               is computed. This partial result is multiplied by
                                                                               natural order index i and then a new modulo K
                                                                               function is applied. In the first stage of this process,
 SISO

                                                                               the remark that the formula is increased with f2 for
                                                                               consecutive values of index i is used. This way, a
                                                                               register value is increased with f2 at each new index
                                                                               i. If the resulted value is bigger than K, the value of
          Fig. 5. Time diagram for serial turbo decoder.
                                                                               K is subtracted from the register value. This

ISBN: 978-1-61804-271-2                                                  107
Advances in Circuits, Systems, Signal Processing and Telecommunications

processing is one clock period long, this being the
reason why data is generated in a continuous
manner.

4.2 Parallel Decoder Block Scheme
    The proposed parallel architecture is similar to
the serial one described in Fig. 4, only that the RSC
SISO module is instantiated N times in the scheme.
We propose an architecture that concatenates the N
values from the N RSCs and points always at the                                  Fig. 8. Virtual parallel interleaver.
same memory location, for all the memories in the
scheme. So instead of having K locations with 10                      Fig. 7 describes the way ILM works. As one can
bits per location as in the serial architecture, in the           observe, while writing procedure, each index i from
parallel one each memory contains K/N locations                   0 to K − 1 generates a corresponding interleaved
with 10N bits per locations.                                      values. These interleaved values are written in a
    The main advantage introduced by the proposed                 normal order in ILM. The first K/N corresponding
serial architecture is the fact that the interleaver              interleaved values occupy the first position on each
block works only once, before the decoding itself                 memory locations. The second K/N values are
taking place. The ILM memory is written when a                    placed on the second position of each location, and
new data block is received, while the previous one is             so on. In order to perform this procedure, a true dual
still under decoding. This approach allows a                      port BRAM is used. Each time a new position in
simplified parallel scheme way of work. Knowing                   location n is written, the content of location n+1 is
the parallelization factor N, the ILM memory can be               also read from the memory, so that the next clock
prepared for the parallel processing that follows.                period the next interleaved value to be added to the
More precisely, the ILM memory will have K/N                      already existing content at that location. When the
locations, N values being written at each location.               interleaver function is needed during a semi-
As mentioned in [20], a Virtex 5 block memory can                 iteration, the ILM is read in a normal way, so that
be configured from (32k locations x 1 bit) to (512                the N interleaved values from one location to
locations x 72 bits). In the worst case scenario when             represent the reading addresses for V2(Xk) memory.
K=6144, based on the N values and keeping the                     But the QPP proprieties guarantee that the N values
stored values on 10 bits as previously mentioned,                 that should be read in the interleaved way from the
the parallel ILM memory can be (768 locations x 80                memory are placed at the same memory location,
bits), (1536 locations x 40 bits), (3072 locations x              only that their positions should be re-arranged
20 bits), or (6144 locations x 10 bits), so still only 2          before being sent to the corresponding RSCs. For
BRAMs are used, as in the case of serial ILM.                     simplifying the representation, the case of K=40 and
                                                                  N=8 is exemplified in Fig. 8. On the left one can see
                                                                  the content of V2(Xk) memory. Each column
                                                                  represents the outupts of one of the N RSC SISOs.
                                                                  On the right there is described the content of ILM
                                                                  memory. The minimum values from each line of
                                                                  ILM (grey colored circle in figure) represents the
                                                                  line address for V2(Xk) memory. Then, using a re-
                                                                  ordering module implemented with multiplexers and
                                                                  de-multiplexers, each position from the read line is
                                                                  sent to its corresponding SISO. For example,
                                                                  position b from the first read line (index 5) is sent to
                                                                  SISO f, while position b from the second read line
                                                                  (index 8) is sent to SISO d. The same procedure
                                                                  applies also for deinterleaver process, only that the

            Fig. 7. ILM memory writing procedure.

ISBN: 978-1-61804-271-2                                     108
Advances in Circuits, Systems, Signal Processing and Telecommunications

                       MODULES from ARCHITECTURE

                                                                                                    Fig. 10. a) non overlaping split; b) overlapping split.

                                                                                              Testing the parallel decoding performances, a
       SISOa

                                                                                           certain level of degradation was observed, since the
                                                                                           forward and backward metrics are altered at the data
                                                                                           block extremities. In order to obtain similar results
                                                                                           as in the serial decoding case, a small overhead is
                                                                                           accepted. If at each parallel block boarder an
       SISOb

                                                                                           overlap is added, the metrics computation will have
                                                                                           a training phase. The minimum overlap window
                                                                                           may be as long as the minimum standard defined
       Fig. 9. Time diagram for parallel turbo decoder (N=2).
                                                                                           data block, in this case Kmin=40 bits. Fig. 10
                                                                                           describes this situation, for N=2 case.
write addresses are extracted from ILM, while the                                             The corresponding latency is in this case,
read ones are in normal order.                                                             considering N>2, which leads to blocks with Kmin at
   From timing point of view, Fig. 9 depicts the                                           both left and right sides:
case when N=2 is used. Same comments as the ones
for Fig. 5 apply.                                                                                 Latency _ po = ( 4( K / N + 2 K min ) + 2 Delay ) L .
                                                                                                                                                   (15)

5 Implementation Results                                                                      In order to evaluate the performances, the used
    From Fig. 5 and 9 it can be observed that the                                          hardware programming language is Very High
decoding latency is reduced in the case of parallel                                        Speed Hardware Description Language (VHDL).
decoding with almost a factor equal to N. There is a                                       For the generation of RAM/ ROM memory blocks
certain Delay, which in this implementation case is                                        Xilinx Core Generator 11.1 was used. The
11 clock periods that adds at each forward trellis                                         simulations were performed with ModelSIM 6.5.
run, when the LLRs are computed, so 2 such values                                          The synthesis process was done using Xilinx XST
are introduced at each iteration.                                                          from Xilinx ISE 11.1. Using these tools, the
    The native latency for serial decoding is                                              obtained system frequency when implementing the
computed as follows: K clock periods needed for the                                        decoding structure on a Xilinx XC5VFX70T-
backward trellis run at the first semi-iterations,                                         FFG1136 chip is around 210 MHz.
another K clock periods plus Delay for the forward                                            The values included in Table 1 are computed
trellis run and LLR computation, and multiplied by                                         based on (13), (14), and (15) for the N=8 case. It can
2 for the second semi-iteration. Considering L the                                         be noticed that the overhead introduced by the
number of executed iterations, it results a total                                          overlapped split is less significant once the value of
latency in clock periods for each block serial                                             K increases, which represents the scenario when
decoding of:                                                                               parallel decoding is usually used.

               Latency _ s = ( 4 K + 2 Delay ) L ,
                                                                                           Table 1. Latency Values for N=8, L= 3 or 4 and K=1536, 4096 or 6144
                                                                             (13)
                                                                                                     Latency_s [us]        Latency_p [us]        Latency_po [us]
                                                                                              K                                   L
while for the parallel decoding the needed number                                                      3          4          3        4             3           4
of clock periods is:                                                                        1536     88.08      117.4      11.28    15.04         15.85       21.14
                                                                                            4096     234.3      312.5      29.57    39.42         34.14       45.52
               Latency _ p = ( 4 K / N + 2 Delay ) L .
                                                                                            6144     351.4      468.5       44.2     58.9          48.7       56.02
                                                                              (14)

ISBN: 978-1-61804-271-2                                                              109
Advances in Circuits, Systems, Signal Processing and Telecommunications

   Table 2 provides the corresponding throughput                                                               0
                                                                                                                                     QPSK, 1024, 3 iter, N = 4
                                                                                                              10
rate when the values from Table I are used.                                                                                                                      parallel with overlap
                                                                                                                                                                 parallel without overlap
                                                                                                                                                                 serial
Table 2. Throughput Values for N=8, L= 3 or 4 and K=1536, 4096 or                                              -1
                                                                                                              10
6144
               Tput_s [Mbps]                 Tput_p [Mbps]              Tput_po [Mbps]
   K                                                L                                                          -2
                                                                                                              10
                     3            4            3        4                 3             4

                                                                                                        BER
 1536              17.43        13.07        136.1    102.0             96.86         72.64
                                                                                                               -3
 4096              17.47        13.10        138.5    103.8             119.9          89.9                   10
 6144              17.48        13.11         139     104.2             125.9          94.4
                                                                                                               -4
   As one can observe from Table 2, the serial                                                                10

decoding performance is close to the theoretical one.
Let us consider for example the case K=6144 and                                                                -5
                                                                                                              10
                                                                                                                   -3   -2.5   -2   -1.5    -1    -0.5       0         0.5       1          1.5
L=3. The native theoretical latency is 4KL clock                                                                                             SNR[dB]
periods, which leads to a theoretical throughput of
                                                                                                      Fig. 12. Comparative decoding results for QPSK, L = 3, K = 1024, N =
17.5 Mbps, while the obtained results for the                                                                                          4.
proposed serial implementation is 17.48 Mbps.
   The following performance curves were obtained                                                        On the other hand, the parallel decoding without
using a finite precision Matlab simulator. This                                                       overlap introduces a certain level of degradation
approach was selected because the Matlab simulator                                                    compared with the serial decoding, the loss in terms
produces exactly the same outputs as the ModelSIM                                                     of performances being dependent on the value of N.
simulator, while the simulation time is smaller.
   All the simulation results are using the Max Log
MAP algorithm. All pictures describe the Bit Error                                                    6 Conclusions
Rate (BER) versus Signal-to-Noise Ratio (SNR)                                                            The most important aspects regarding the FPGA
expressed as the ratio between the energy per bit                                                     implementation of a turbo decoder for LTE-A
and the noise power spectral density.                                                                 systems were presented in this paper. The serial
   Fig. 11 depicts the obtained results when a block                                                  turbo decoder architecture was developed and
of length K = 512 was decoded in a serial manner, in                                                  implemented in an efficient manner, especially from
a parallel without overlapping manner and in a                                                        the interleaver/ deinterleaver processes point of
parallel with overlapping manner. For this scenario                                                   view. The interleaver memory ILM was introduced
N = 2, QPSK modulation was used and L = 3. Fig.                                                       so that the interleaver process to work effectively
12 presents the same type of results, for the case of                                                 only outside the decoding process itself. The ILM
K = 1024 and N = 4.                                                                                   was written together with the input data, while the
   As one can observe from Fig. 11 and 12, the                                                        previous block was still under decoding. This
parallel decoding with overlap is producing same                                                      approach allowed the transfer to the parallel
results as the serial decoding.                                                                       architecture in a simplified way, using only
                                        QPSK, 3 iter, 512 bits, N = 2
                                                                                                      concatenated values at same memory locations. The
          0
         10                                                                                           parallel architecture used the same number of block
                                                                 serial
                                                                 parallel with overlap                memories and only one interleaver, with the cost of
          -1
         10
                                                                 parallel without overlap
                                                                                                      some multiplexing/ demultiplexing structures.
                                                                                                         The parallel decoding performances were
          -2
         10
                                                                                                      compared with the serial ones and certain
                                                                                                      degradation was observed. To eliminate this
   BER

          -3
                                                                                                      degradation, a small overhead was accepted by the
         10
                                                                                                      overlapping split that was applied to the parallel
                                                                                                      data blocks.
          -4
         10

          -5
         10
              -3           -2           -1           0            1             2           3
                                                  SNR[dB]

Fig. 11. Comparative decoding results for QPSK, L = 3, K = 512, N = 2.

ISBN: 978-1-61804-271-2                                                                         110
Advances in Circuits, Systems, Signal Processing and Telecommunications

Acknowledgment                                                    [11] Di Wu, R. Asghar, Yulin Huang, and D. Liu,
  The work has been funded by the Sectoral                             Implementation of a high-speed parallel turbo
Operational  Programme     Human     Resources                         decoder for 3GPP LTE terminals, ASICON
Development 2007-2013 of the Ministry of                               ’09, IEEE 8th International Conference on
European Funds through the Financial Agreement                         ASIC, pp. 481-484, 2009.
POSDRU/159/1.5/S/134398.                                          [12] C. Studer, C. Benkeser, S. Belfanti, and
                                                                       Quiting Huang, Design and implementation of
                                                                       a parallel turbo-decoder ASIC for 3GPP-LTE,
References:                                                            IEEE Journal of Solid-State Circuits, vol. 46,
[1] C. Berrou, A. Glavieux, and P. Thitimajshima,                      issue 1, pp 8-17, Jan. 2011.
     Near Shannon limit error-correcting coding and               [13] “Xilinx Virtex 5 family user guide,”
     decoding: Turbo Codes, IEEE Proceedings of                        www.xilinx.com.
     the Int. Conf. on Communications, Geneva,                    [14] “Xilinx ML507 evaluation platform user
     Switzerland, May 1993, pp. 1064-1070.                             guide,” www.xilinx.com.
[2] C. Berrou and A. Glavieux, Near optimum                       [15] P. Robertson, E. Villebrun, and P. Hoeher, A
     error correcting coding and decoding: Turbo-                      Comparison of Optimal and Sub-Optimal MAP
     Codes, IEEE Trans. Communications, vol. 44,                       Decoding Algorithms Operating in the Log
     no. 10, pp. 1261-1271, Oct. 1996.                                 Domain, Proc. IEEE International Conference
[3] C. Berrou and M. Jézéquel, Non binary                              on Communications (ICC’95), Seattle, pp.
     convolutional codes for turbo coding,                             1009-1013, June 1995.
     Electronics Letters, vol. 35, no. 1, pp. 9-40,               [16] S. Papaharalabos, P. Sweeney, and B. G.
     Jan. 1999.                                                        Evans, Constant log-MAP decoding algorithm
[4] Third Generation Partnership Project. 3GPP                         for duo-binary turbo codes, Electronics Letters
     home page. www.3gpp.org.                                          vol. 42, issue 12, pp. 709 – 710, June 2006.
[5] 3GPP TS 36.212 V8.7.0 (2009-05) Technical                    [17] Jung-Fu Cheng and T. Ottosson, Linearly
     Specification, “3rd Generation Partnership                        approximated log-MAP algorithms for turbo
     Project; Technical Specification Group Radio                      decoding, Vehicular Technology Conference
     Access Network; Evolved Universal Terrestrial                     Proceedings, 2000. VTC 2000-Spring Tokyo.
     Radio Access (E-UTRA); Multiplexing and                           2000 IEEE 51st vol. 3, pp. 2252 – 2256, 2000.
     channel coding (Release 8).”                                 [18] C. Anghel, A. A. Enescu, C. Paleologu, and S.
[6] F. Khan, LTE for 4G Mobile Broadband,                              Ciochina, CTC Turbo Decoding Architecture
     Cambridge University Press, New York, 2009.                       for H-ARQ Capable WiMAX Systems
[7] M. C. Valenti and J. Sun, The UMTS turbo                           Implemented on FPGA, Ninth International
     code and an efficient decoder implementation                      Conference on Networks ICN 2010, Menuires,
     suitable     for    software-defined      radios,                 France, April 2010.
     International Journal of Wireless Information                [19] C. Anghel, V. Stanciu, C. Stanciu, and C.
     Networks, vol. 8, no. 4, Oct. 2001.                               Paleologu, CTC Turbo Decoding Architecture
[8] M. Sanad and N. Hassan, Novel wideband                             for LTE Systems Implemented on FPGA,
     MIMO antennas that can cover the whole LTE                        IARIA ICN 2012, Reunion, France, 2012.
     spectrum in handsets and portable computers,                 [20] “Virtex 5 Family Overview,” Feb. 2009,
     The Scientific World Journal, vol. 2014, art. ID                  www.xilinx.com.
     694805, 9 pages, 2014.
[9] Suchang Chae, A low complexity parallel
      architecture of turbo decoder based on QPP
      interleaver       for       3GPP-LTE/LTE-A,
      http://www.design-
      reuse.com/articles/31907/turbo-decoder-
      architecture-qpp-interleaver-3gpp-lte-lte-
      a.html
[10] Y. Sun and J. R. Cavallaro, Efficient hardware
     implementation of a highly-parallel 3GPP
     LTE/ LTE-advance turbo decoder, Integration,
     the VLSI Journal, vol. 44, issue 4, pp. 305-315,
     Sept. 2011.

ISBN: 978-1-61804-271-2                                    111
You can also read