Simplified Parallel Architecture for LTE-A Turbo Decoder Implemented on FPGA
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Advances in Circuits, Systems, Signal Processing and Telecommunications
Simplified Parallel Architecture for LTE-A Turbo Decoder
Implemented on FPGA
CRISTIAN ANGHEL, CONSTANTIN PALEOLOGU
Telecommunications Department
University Politehnica of Bucharest
Iuliu Maniu, 1-3, Bucharest
ROMANIA
{canghel, pale}@comm.pub.ro
Abstract: - This paper describes a turbo decoder for 3GPP Long Term Evolution Advanced (LTE-A) standard,
using a Max LOG MAP algorithm, implemented on Field Programmable Gate Array (FPGA). Taking
advantage of the quadratic permutation polynomial (QPP) interleaver proprieties and considering some FPGA
block memory characteristics, a simplified parallel decoding architecture is proposed. It should be used
especially for large data blocks, when high decoding latency is introduced by the serial decoding. The
parallelization factor N is usually a power of 2, the maximum considered value being 8. The obtained parallel
decoding latency is N times lower than the serial decoding latency. With the cost of very low latency added to
this value, the parallel decoding performances are similar with the serial decoding ones. The novelty of the
proposed parallel architecture is that only one interleaver is used, independently of the N value.
Key-Words: - LTE-A, turbo decoder, Max LOG MAP, parallel architecture, FPGA
1 Introduction releases (after High Speed Packet Access was
The discussions around the channel coding introduced) added support for new and interesting
theory were intense in the last decades, but even features, while turbo coding remained still
more interest around this topic was added once the unchanged. Some modifications were introduced by
turbo codes were found by Berrou, Glavieux, and the Long Term Evolution (LTE) standard [5][6], not
Thitimajshima [1][2][3]. significant as volume, but important as concept.
At the beginning of their existence, after proving While keeping exactly the same coding structure as
the obtained decoding performances, the turbo in UMTS, 3GPP proposed for LTE a new
codes were introduced in different standards as interleaver scheme.
recommendations, while convolutional codes were An UMTS dedicated turbo decoding scheme is
still mandatory. The reason behind this decision was presented in [7]. Due to the new LTE/ LTE-A
especially the high complexity of turbo decoder interleaver, the decoding performances are
implementation. But the turbo codes became more improved compared with the ones corresponding to
attractive once the supports for digital processing, UMTS standard. Moreover, the new LTE interleaver
like Digital Signal Processor (DSP) or Field provides support for the parallelization of the
Programmable Gate Array (FPGA), were extended decoding process inside the algorithm, taking
more and more in terms of processing capacity. advantage on the main principle introduced by turbo
Nowadays the chips include dedicated hardware decoding, i.e., the usage of extrinsic values from one
accelerators for different types of turbo decoders, turbo iteration to another. The parallel decoding
but this approach makes them standard dependent. represents one software adaptation requested by the
The Third-Generation Partnership Project high data rates, while additional hardware changes
(3GPP) [4] is an organization, which adopted early are also proposed [8].
these advanced coding techniques. Turbo codes There are many parallel decoding architectures
were standardized from the first version of proposed in the literature in the last years. The
Universal Mobile Telecommunications System obtained results are evaluated on 2 axes. The first
(UMTS) technology, in 1999. The next UMTS one is the decoding performances degradation
ISBN: 978-1-61804-271-2 102Advances in Circuits, Systems, Signal Processing and Telecommunications
introduced by the parallel method compared with dimensions are unchanged between the two block
the serial decoding scheme and the second one is the schemes. In terms of decoding performances, with
amount of resources needed for such parallel the cost of a small overhead added, the
architecture implementation. A first set of parallel performances of the serial and parallel decoding
architectures is described in [9]. Starting from the architectures are kept similar.
classical method of implementing the Maximum A The paper is organized as follows. Section 2
Posteriori (MAP) algorithm, i.e., going to trellis describes the LTE coding scheme with the new
once to compute the Forward State Metrics (FSM) introduced QPP interleaver. Section 3 presents the
and then twice to compute the Backward State decoding algorithm. In Section 4, there are
Metrics (BSM) and also the Log Likelihood Ratios discussed the implementation solutions and the
(LLR), several solutions to reduce the decoding proposed decoding schemes, for both serial and
latency of 2K clock periods per semi-iteration, parallel decoding. Section 5 presents throughput and
where K is the data block length, are introduced. speed results obtained when targeting a
The first one reduces the decoding time to half (only XC5VFX70T [13] chip on Xilinx ML507 [14]
K) by starting simultaneously the BSM and FSM board; it also provides simulation curves comparing
computation. After computing half of these values, the results obtained when using serial decoding,
2 LLR blocks start working in parallel, the parallel decoding, and parallel decoding with
interleaver block being also doubled. Another overlap. Section 6 contains the conclusions of this
proposed scheme eliminates the need for the second work.
interleaver but increases the decoding time with K/2
compared with the previous one, a total decoding
latency of 3K/2 clock periods being obtained. 2 LTE Coding Scheme
A second set of parallel architectures takes The coding scheme presented in 3GPP LTE
advantage of the Quadratic Permutation Polynomial specification is a classic turbo coding scheme,
(QPP) interleaver algebraic-geometric properties, as including two constituent encoders and one
described in [10][11]. Here efficient hardware interleaver module. It is described in Fig. 1. One can
implementations of the QPP interleaver are observe at the input of the LTE turbo encoder the
proposed, but the parallelization factor N represents data block Ck. The K bits corresponding to this
also the number of used interleavers in the proposed block are sent as systematic bits at the output in the
architectures. steam Xk. In the same time, the data block is
A third approach consist in using a folded processed by the first constituent encoder resulting
memory to store simultaneously all the values
parity bits Zk, while the interleaved data block Ck' is
needed for parallel processing [12]. But for this kind
of implementation the main challenge is to correctly processed by the second constituent encoder
distribute the data to each decoding unit once a resulting parity bits Z k' . Combining the systematic
memory location containing all N values was read. bits and the two streams of parity bits, the following
More precisely, the N decoding units working in sequence is obtained at the output of the encoder:
parallel were writing their data in a concatenated X1, Z1, Z’1, X2, Z2, Z’2, …, XK, ZK, Z’K.
order to the same location, but when the interleaved At the end of the coding process, in order to
reading is taking place, these values are not going in drive back the constituent encoders to the initial
the same order to the same decoding unit, but state, the switches from Fig. 1 are moved from
instead they should be redistributed. To solve this, position A to B. Since the final states of the two
an architecture based on 2 Batcher sorting networks constituent encoders are different, depending on the
is proposed. But also in this approach, N interleavers input data block, this switching procedure will
are needed to generate all the interleaved addresses generate tail bits for each encoder. These tail bits
that input the master network. have to be transmitted together with the systematic
In this paper, we introduce also a folded memory and parity bits resulting the following final
based approach, but the main difference comparing sequence: XK+1, ZK+1, XK+2, ZK+2, XK+3, ZK+3, X’K+1,
with the already existent solutions described above Z’K+1, X’K+2, Z’K+2, X’K+3, Z’K+3.
is that our proposed solution uses only one As mentioned before, the novelty introduced by
interleaver. Additionally, with some multiplexing the LTE standard in terms of turbo coding is the
and demultiplexing blocks, the parallel architecture interleaver module. The output bits are reorganized
remains close to the serial one, only the Soft Input using:
Soft Output (SISO) decoding unit being instantiated
N times. The block memories numbers and
ISBN: 978-1-61804-271-2 103Advances in Circuits, Systems, Signal Processing and Telecommunications
Λi ( X k )
Λ1o ( X k )
Λi ( Zk )
Λ o2 ( X k' ) Λ o2 ( X k )
Λ i ( Z k' )
Fig. 2. LTE turbo decoder.
max* ( x, y ) = ln(e x + e y ) =
Fig. 1. LTE CTC encoder. − y−x
(3)
max( x, y ) + ln(1 + e ) ≈ max( x, y ).
Ci' = C π ( i ) , i = 1, 2,..., K , (1)
The LTE turbo decoder trellis diagram contains 8
states, as depicted in Fig. 3. Each diagram state
where the interliving function π applied over the permits 2 inputs and 2 outputs. The branch metric
output index i is defined as between the states Si and Sj is
π(i ) = ( f1 ⋅ i + f 2 ⋅ i 2 ) mod K . (2) γ ij = V ( X k ) X ( i, j ) + Λ i ( Z k ) Z ( i, j ) , (4)
The input block length K and the parameters f1
where X(i,j) represents the data bit and Z(i,j) is the
and f2 are provided in Table 5.1.3-3 in [5].
parity bit, both associated to one branch. Also
Λ i ( Z k ) is the LLR for the input parity bit. When
3 Decoding Algorithm SISO 1 decoder is taken into discussion this input
The LTE turbo decoding scheme is depicted in LLR is Λ i ( Z k ) , while for SISO 2 it becomes
Fig. 2. The two Recursive Systematic Convolutional
(RSC) decoders use in theory the MAP algorithm. Λ i ( Z k' ) ; V(Xk)=V1(Xk) represents the sum between
This classic algorithm provides the best decoding Λ i ( X k ) and W(Xk) for SISO 1 and V(Xk)=V2(X’k)
performances, but it suffers from very high
implementation complexity and it can lead to large represents the interleaved version of the difference
dynamic range for its variables. For these reasons, between Λ1o ( X k ) and W(Xk) for SISO 2. In Fig. 2,
the MAP algorithm is used as a reference for W(Xk) is the extrinsic information, while Λ1o ( X k )
targeted decoding performances, while for real
implementation new sub-optimal algorithms have and Λ o2 ( X k' ) are the output LLRs generated by the
been studied: Logarithmic MAP (Log MAP) [15], two SISOs.
Maximum Log MAP (Max Log MAP), Constant In the LTE turbo encoder case, there are 4
Log MAP (Const Log MAP) [16], and Linear Log possible values for the branch metrics between 2
MAP (Lin Log MAP) [17]. states in the trellis:
For the proposed decoding scheme, the Max Log
MAP algorithm is selected. This algorithm reduces γ0 = 0
the implementation complexity and controls the
γ1 = V ( X k )
dynamic range problem with the cost of acceptable (5)
performances degradation, compared to classic γ 2 = Λi ( Z k )
MAP algorithm. The Max Log MAP algorithm γ 2 = V ( X k ) + Λi ( Z k ) .
keeps from Jacobi logarithm only the first term, i.e.,
The decoding process is based on going forward
and backward through the trellis.
ISBN: 978-1-61804-271-2 104Advances in Circuits, Systems, Signal Processing and Telecommunications
metric for the state Si at the stage k is α k ( Si ) with
0 ≤ k ≤ K − 1 and 0 ≤ i ≤ 7 . The forward recursion
is initialized with α 0 ( S0 ) = 0 and α 0 ( Si ) = 0, ∀i > 0 .
Starting from the stage k=1 and continuing through
the trellis until the last stage k=K, the un-normalized
forward metrics are given by
αˆ k ( S j ) = max {(α k −1 ( Si1 ) + γ i1 j ),(α k −1 ( Si 2 ) + γ i 2 j )} ,
(8)
where Si1 and Si2 are the two states from stage k − 1
connected to the state Sj from stage k. After the
computation of αˆ k ( S0 ) value, the rest of the
Fig. 3. LTE turbo coder trellis.
forward metrics are normalized as
3.1 Backward recursion α k ( Si ) = αˆ k ( Si ) − αˆ k ( S0 ) . (9)
The trellis is covered backward and the
computed metrics are stored in a normalized form at Because the forward metrics α are computed for
each node of the trellis. These stored values are used the stage k, the decoding algorithm can obtain in the
for the LLR computation at the trellis forward same time a LLR estimated for the data bits Xk. This
recursion. The backward metric for the Si state at the LLR is found the first time by considering that the
kth stage is β k ( Si ) , where 2 ≤ k ≤ K + 3 and likelihood of the connection between the state Si at
0 ≤ i ≤ 7 . The backward recursion is initialized with stage k − 1 and the state Sj at stage k is
β K +3 ( S0 ) = 0 and β K +3 ( Si ) = 0, ∀i > 0 . Starting
from the stage k=K+2 and continuing through the
λk ( i, j ) = α k −1 ( Si ) + γ ij + β k ( S j ) . (10)
trellis until stage k=2, the computed backward
metrics are The likelihood of having a bit equal to 1 (or 0) is
when the Jacobi logarithm of all the branch
{ }
βˆk ( Si ) = max ( β k +1 ( S j1 ) + γ ij1 ),( β k +1 ( S j 2 ) + γ ij 2 ) , likelihoods corresponds to 1 (or 0) and thus:
(6) Λo ( X k ) = max {λk ( i, j )} − max {λk ( i, j )},
( Si → S j ): X i =1 ( Si → S j ): X i = 0
where βˆk ( Si ) represents the un-normalized metric (11)
and Sj1 and Sj2 are the two states from stage k+1
where “max” operator is recursively computed over
connected to the state Si from stage k. After the
the branches, which have at the input a bit of 1
computation of βˆ ( S ) value, the rest of the
k 0 { } {
( Si → S j ) : X i = 1 or a bit 0 ( Si → S j ) : X i = 0 . }
backward metrics are normalized as
βk ( Si ) = βˆk ( Si ) − βˆk ( S0 ) (7)
4 Proposed Decoding Scheme
and then stored in the dedicated memory.
4.1 Serial Decoder Block Scheme
From the theoretical decoding scheme depicted
3.2 Forward recursion in Fig. 2 it can be noticed that SISO 2 decoder starts
During the forward recursion, the trellis is working only after SISO 1 decoder finishes its job
covered in the normal direction, this process being and vice-versa, the usage of previously obtained
similar with the one specific for Viterbi algorithm. extrinsic values being the main principle of the
In order to allow the computation of the current turbo decoding.
stage (k) metrics, only the forward metrics from the
last stage (k − 1) have to be stored. The forward
ISBN: 978-1-61804-271-2 105Advances in Circuits, Systems, Signal Processing and Telecommunications
W(Xk) W(Xk)
memory
Λi(Xk) V1(Xk) V1(Xk) 1
memory
+ memory
Λi ( X k ) 2
V2(X’k) RSC Write normal
Read interleaved
(SISO1 or
SISO2) Λ1o ( X k )
Λi(Zk) 1 1 V2(Xk) V2(Xk) V2(X’k)
+ Interleaver
memory Λ i ( Z k ) memory memory
2 2
Λ o2 (X’k)
Λi(Z’k) memory
Write interleaved
+
memory Λ i Z '
k ( ) Λ o2 ( X k' )
Read normal
Λ o2 (Xk) Λ o2 ( X k ) Xˆ k
Deinterleaver
memory
Fig. 4. Proposed serial turbo decoder block scheme.
Also, all the processing is based on complete vectors in order to perform the second semi-
data blocks since the interleaver or deinterleaver iteration.
procedures should be applied in between. It results The vector V1(Xk) is obtained by adding the input
that the 2 SISOs are decoding data in non- vector Λ i ( X k ) with the extrinsic information vector
overlapped time windows, so only one SISO unit W(Xk). While reading these 2 memories, SISO 1
can be used to process in a time-multiplexed
starts the decoding process. At the output, the LLRs
manner, as one can observe in Fig. 4, where a serial are available and performing the subtraction
decoder block scheme based on the previous work between them and the delayed extrinsic values
presented in [18] for a WiMAX CTC decoder is
already read from W(Xk) memory, the vector V2(Xk)
described.
is computed and then stored into its corresponding
The memory blocks are used for storing data
memory in a normal order. The interleaving process
from one semi-iteration to another and from one is started (the initially written ILM is read now in
iteration to another. The dotted-line memory blocks
normal order, so that interleaved read address for
are virtual memories added only to ease the V2(Xk) are obtained) and the re-ordered LLRs
understanding of the introduced notations. Also, it V2(X’k) are available, the corresponding values for
should be mentioned that the Interleaver and
the 3 tail bits X’K+1, X’K+2, X’K+3 being added at the
Deinterleaver blocks are in fact the same, including end of this sequence. The second semi-iteration is
a block memory called ILM (Interleaver Memory) ongoing. The same SISO unit is used, but reading
and an interleaver. The ILM is the new approach
this time data inputs from the other memory blocks.
introduced by the author compared with the As one can see from Fig. 4, two switching
previous serial implementation presented in [19] and
mechanisms are included in the scheme. When in
the goal is to prepare the architecture for parallel
position 1, the memory blocks for V1(Xk) and
decoding also. The memory is written with the
interleaved addresses each time a new data block is Λ i ( Z k ) are used, while in position 2 the memory
received. The values are then used as read addresses blocks for V2(X’k) and Λ i ( Z k' ) become active.
(when interleaver process is ongoing) or as write
At the output of the SISO unit, after each semi-
addresses (when deinterleaver process is ongoing).
iteration, K LLRs are obtained. The ones
This ILM, together with the 3 memories from the
corresponding to the second semi-iteration are
left side of the picture (for the input data) are
switched-buffers, allowing new data to be written stored in the Λ o2 ( X k' ) memory (the ILM output,
while the previous one is still under decoding which was already available for the V2(Xk)
process. interleaver process, is used as writing address for
Λ o2 ( X k' ) memory , after a delay is added).
The scheme depicted in Fig. 4 works as follows.
SISO 1 reads the memory locations corresponding
to V1(Xk) and Λ i ( Z k ) vectors. The reading process Reading in a normal order Λ o2 ( X k' ) memory and
is performed forward and backward and it serves the also V2(Xk) memory provides inputs for W(Xk)
first semi-iteration. At the end of this process, SISO memory and on the same time allows a new semi-
2 reads forward and backward from the memory iterations to start for SISO 1. So the W(Xk) memory
blocks corresponding to V2(X’k) and Λ i ( Z k' ) update is made on the same time with a new semi-
iteration start. Fig. 5 depicts a time diagram for the
serial turbo decoding and the gray colored intervals
ISBN: 978-1-61804-271-2 106Advances in Circuits, Systems, Signal Processing and Telecommunications
V1 ( X k ) /
describe W(Xk) memory writing. One can observe V2 ( X k' )
that the upper 4 memories in the picture are Λi ( Zk ) /
switched-buffers, so they are written while the Λ i ( Z k' )
previous data block is still processed. In the picture
R stands for Read, W represents Write, (K − 1:0) is Λ1o ( X k ) /
the backward trellis run, (0:K − 1) is the forward Λ o2 ( X k' )
trellis run and IL means interleaved read (for
interleaver process) or write (for deinterleaver
process).
In order to be able to handle all the data block
dimensions, the used memory blocks have 6144 Fig. 6. Proposed SISO block scheme.
locations (this is the maximum data block length),
except the ones storing the input data for RSCs, Without normalization, the forward and
which have 6144 + 3 locations, including here also backward metric width should be wider in order to
the tail bits. Each memory locations is 10 bits wide, avoid saturation, which means more memory
the first bit being used for the sign, the next 6 bits blocks, more complex arithmetic (i.e., more used
representing the integer part and the last 3 bits resources), and lower frequency (as an overall
indicating the fractional part. This format was consequence). Hence, reducing the logic levels by
decided studying the dynamic range of the variables eliminating the normalizing procedure does not
(for the integer part) and the variations of the increase the system performances.
decoding performances (for the fractional part). The ALPHA, BETA, and GAMMA blocks are
The constituent modules of the SISO block are implemented in a dedicated way. Each metric
the ones presented in Fig. 6. One can notice both the corresponding to each state is computed separately,
un-normalized metric computing blocks ALPHA not using the same function with different input
(forward) and BETA (backward), and the transition parameters.
metric computing block GAMMA, which in Consequently, 16 equations should be used for
addition includes the normalization function transition metric computation (2 possible transitions
(subtract the metrics for the first state from all the for each of the 8 states from a stage). In fact, only 4
other metrics). equations are needed [as indicated in (5)]; moreover,
The L block computes the output LLRs, which from these 4 equations one of them leads to zero
are normalized by the NORM block. The MUX- value, so that the computational effort is minimized
MAX block selects inputs corresponding to the for this implementation solution.
forward or backward recursion and computes the The interleaver module is used both for
maximum function. The MEM BETA block stores interleaving and deinterleaving. The interleaved
the backward metrics, which are computed before index is obtained based on a modified form of (2),
forward metrics. The metric normalization is i.e.,
required to preserve the dynamic range.
π ( i ) = {[( f1 + f 2 ⋅ i ) mod K ] ⋅ i}mod K . (12)
MODULES from ARCHITECTURE
In order to obtain both functions, either the input
data is stored in the memory in natural order and
then it is read in interleaved order, either the input
data is stored in the interleaved order and then it is
read in natural order.
The interleaved index computation is performed
in three steps. First the value for ( f1 + f 2 ⋅ i ) mod K
is computed. This partial result is multiplied by
natural order index i and then a new modulo K
function is applied. In the first stage of this process,
SISO
the remark that the formula is increased with f2 for
consecutive values of index i is used. This way, a
register value is increased with f2 at each new index
i. If the resulted value is bigger than K, the value of
Fig. 5. Time diagram for serial turbo decoder.
K is subtracted from the register value. This
ISBN: 978-1-61804-271-2 107Advances in Circuits, Systems, Signal Processing and Telecommunications
processing is one clock period long, this being the
reason why data is generated in a continuous
manner.
4.2 Parallel Decoder Block Scheme
The proposed parallel architecture is similar to
the serial one described in Fig. 4, only that the RSC
SISO module is instantiated N times in the scheme.
We propose an architecture that concatenates the N
values from the N RSCs and points always at the Fig. 8. Virtual parallel interleaver.
same memory location, for all the memories in the
scheme. So instead of having K locations with 10 Fig. 7 describes the way ILM works. As one can
bits per location as in the serial architecture, in the observe, while writing procedure, each index i from
parallel one each memory contains K/N locations 0 to K − 1 generates a corresponding interleaved
with 10N bits per locations. values. These interleaved values are written in a
The main advantage introduced by the proposed normal order in ILM. The first K/N corresponding
serial architecture is the fact that the interleaver interleaved values occupy the first position on each
block works only once, before the decoding itself memory locations. The second K/N values are
taking place. The ILM memory is written when a placed on the second position of each location, and
new data block is received, while the previous one is so on. In order to perform this procedure, a true dual
still under decoding. This approach allows a port BRAM is used. Each time a new position in
simplified parallel scheme way of work. Knowing location n is written, the content of location n+1 is
the parallelization factor N, the ILM memory can be also read from the memory, so that the next clock
prepared for the parallel processing that follows. period the next interleaved value to be added to the
More precisely, the ILM memory will have K/N already existing content at that location. When the
locations, N values being written at each location. interleaver function is needed during a semi-
As mentioned in [20], a Virtex 5 block memory can iteration, the ILM is read in a normal way, so that
be configured from (32k locations x 1 bit) to (512 the N interleaved values from one location to
locations x 72 bits). In the worst case scenario when represent the reading addresses for V2(Xk) memory.
K=6144, based on the N values and keeping the But the QPP proprieties guarantee that the N values
stored values on 10 bits as previously mentioned, that should be read in the interleaved way from the
the parallel ILM memory can be (768 locations x 80 memory are placed at the same memory location,
bits), (1536 locations x 40 bits), (3072 locations x only that their positions should be re-arranged
20 bits), or (6144 locations x 10 bits), so still only 2 before being sent to the corresponding RSCs. For
BRAMs are used, as in the case of serial ILM. simplifying the representation, the case of K=40 and
N=8 is exemplified in Fig. 8. On the left one can see
the content of V2(Xk) memory. Each column
represents the outupts of one of the N RSC SISOs.
On the right there is described the content of ILM
memory. The minimum values from each line of
ILM (grey colored circle in figure) represents the
line address for V2(Xk) memory. Then, using a re-
ordering module implemented with multiplexers and
de-multiplexers, each position from the read line is
sent to its corresponding SISO. For example,
position b from the first read line (index 5) is sent to
SISO f, while position b from the second read line
(index 8) is sent to SISO d. The same procedure
applies also for deinterleaver process, only that the
Fig. 7. ILM memory writing procedure.
ISBN: 978-1-61804-271-2 108Advances in Circuits, Systems, Signal Processing and Telecommunications
MODULES from ARCHITECTURE
Fig. 10. a) non overlaping split; b) overlapping split.
Testing the parallel decoding performances, a
SISOa
certain level of degradation was observed, since the
forward and backward metrics are altered at the data
block extremities. In order to obtain similar results
as in the serial decoding case, a small overhead is
accepted. If at each parallel block boarder an
SISOb
overlap is added, the metrics computation will have
a training phase. The minimum overlap window
may be as long as the minimum standard defined
Fig. 9. Time diagram for parallel turbo decoder (N=2).
data block, in this case Kmin=40 bits. Fig. 10
describes this situation, for N=2 case.
write addresses are extracted from ILM, while the The corresponding latency is in this case,
read ones are in normal order. considering N>2, which leads to blocks with Kmin at
From timing point of view, Fig. 9 depicts the both left and right sides:
case when N=2 is used. Same comments as the ones
for Fig. 5 apply. Latency _ po = ( 4( K / N + 2 K min ) + 2 Delay ) L .
(15)
5 Implementation Results In order to evaluate the performances, the used
From Fig. 5 and 9 it can be observed that the hardware programming language is Very High
decoding latency is reduced in the case of parallel Speed Hardware Description Language (VHDL).
decoding with almost a factor equal to N. There is a For the generation of RAM/ ROM memory blocks
certain Delay, which in this implementation case is Xilinx Core Generator 11.1 was used. The
11 clock periods that adds at each forward trellis simulations were performed with ModelSIM 6.5.
run, when the LLRs are computed, so 2 such values The synthesis process was done using Xilinx XST
are introduced at each iteration. from Xilinx ISE 11.1. Using these tools, the
The native latency for serial decoding is obtained system frequency when implementing the
computed as follows: K clock periods needed for the decoding structure on a Xilinx XC5VFX70T-
backward trellis run at the first semi-iterations, FFG1136 chip is around 210 MHz.
another K clock periods plus Delay for the forward The values included in Table 1 are computed
trellis run and LLR computation, and multiplied by based on (13), (14), and (15) for the N=8 case. It can
2 for the second semi-iteration. Considering L the be noticed that the overhead introduced by the
number of executed iterations, it results a total overlapped split is less significant once the value of
latency in clock periods for each block serial K increases, which represents the scenario when
decoding of: parallel decoding is usually used.
Latency _ s = ( 4 K + 2 Delay ) L ,
Table 1. Latency Values for N=8, L= 3 or 4 and K=1536, 4096 or 6144
(13)
Latency_s [us] Latency_p [us] Latency_po [us]
K L
while for the parallel decoding the needed number 3 4 3 4 3 4
of clock periods is: 1536 88.08 117.4 11.28 15.04 15.85 21.14
4096 234.3 312.5 29.57 39.42 34.14 45.52
Latency _ p = ( 4 K / N + 2 Delay ) L .
6144 351.4 468.5 44.2 58.9 48.7 56.02
(14)
ISBN: 978-1-61804-271-2 109Advances in Circuits, Systems, Signal Processing and Telecommunications
Table 2 provides the corresponding throughput 0
QPSK, 1024, 3 iter, N = 4
10
rate when the values from Table I are used. parallel with overlap
parallel without overlap
serial
Table 2. Throughput Values for N=8, L= 3 or 4 and K=1536, 4096 or -1
10
6144
Tput_s [Mbps] Tput_p [Mbps] Tput_po [Mbps]
K L -2
10
3 4 3 4 3 4
BER
1536 17.43 13.07 136.1 102.0 96.86 72.64
-3
4096 17.47 13.10 138.5 103.8 119.9 89.9 10
6144 17.48 13.11 139 104.2 125.9 94.4
-4
As one can observe from Table 2, the serial 10
decoding performance is close to the theoretical one.
Let us consider for example the case K=6144 and -5
10
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5
L=3. The native theoretical latency is 4KL clock SNR[dB]
periods, which leads to a theoretical throughput of
Fig. 12. Comparative decoding results for QPSK, L = 3, K = 1024, N =
17.5 Mbps, while the obtained results for the 4.
proposed serial implementation is 17.48 Mbps.
The following performance curves were obtained On the other hand, the parallel decoding without
using a finite precision Matlab simulator. This overlap introduces a certain level of degradation
approach was selected because the Matlab simulator compared with the serial decoding, the loss in terms
produces exactly the same outputs as the ModelSIM of performances being dependent on the value of N.
simulator, while the simulation time is smaller.
All the simulation results are using the Max Log
MAP algorithm. All pictures describe the Bit Error 6 Conclusions
Rate (BER) versus Signal-to-Noise Ratio (SNR) The most important aspects regarding the FPGA
expressed as the ratio between the energy per bit implementation of a turbo decoder for LTE-A
and the noise power spectral density. systems were presented in this paper. The serial
Fig. 11 depicts the obtained results when a block turbo decoder architecture was developed and
of length K = 512 was decoded in a serial manner, in implemented in an efficient manner, especially from
a parallel without overlapping manner and in a the interleaver/ deinterleaver processes point of
parallel with overlapping manner. For this scenario view. The interleaver memory ILM was introduced
N = 2, QPSK modulation was used and L = 3. Fig. so that the interleaver process to work effectively
12 presents the same type of results, for the case of only outside the decoding process itself. The ILM
K = 1024 and N = 4. was written together with the input data, while the
As one can observe from Fig. 11 and 12, the previous block was still under decoding. This
parallel decoding with overlap is producing same approach allowed the transfer to the parallel
results as the serial decoding. architecture in a simplified way, using only
QPSK, 3 iter, 512 bits, N = 2
concatenated values at same memory locations. The
0
10 parallel architecture used the same number of block
serial
parallel with overlap memories and only one interleaver, with the cost of
-1
10
parallel without overlap
some multiplexing/ demultiplexing structures.
The parallel decoding performances were
-2
10
compared with the serial ones and certain
degradation was observed. To eliminate this
BER
-3
degradation, a small overhead was accepted by the
10
overlapping split that was applied to the parallel
data blocks.
-4
10
-5
10
-3 -2 -1 0 1 2 3
SNR[dB]
Fig. 11. Comparative decoding results for QPSK, L = 3, K = 512, N = 2.
ISBN: 978-1-61804-271-2 110Advances in Circuits, Systems, Signal Processing and Telecommunications
Acknowledgment [11] Di Wu, R. Asghar, Yulin Huang, and D. Liu,
The work has been funded by the Sectoral Implementation of a high-speed parallel turbo
Operational Programme Human Resources decoder for 3GPP LTE terminals, ASICON
Development 2007-2013 of the Ministry of ’09, IEEE 8th International Conference on
European Funds through the Financial Agreement ASIC, pp. 481-484, 2009.
POSDRU/159/1.5/S/134398. [12] C. Studer, C. Benkeser, S. Belfanti, and
Quiting Huang, Design and implementation of
a parallel turbo-decoder ASIC for 3GPP-LTE,
References: IEEE Journal of Solid-State Circuits, vol. 46,
[1] C. Berrou, A. Glavieux, and P. Thitimajshima, issue 1, pp 8-17, Jan. 2011.
Near Shannon limit error-correcting coding and [13] “Xilinx Virtex 5 family user guide,”
decoding: Turbo Codes, IEEE Proceedings of www.xilinx.com.
the Int. Conf. on Communications, Geneva, [14] “Xilinx ML507 evaluation platform user
Switzerland, May 1993, pp. 1064-1070. guide,” www.xilinx.com.
[2] C. Berrou and A. Glavieux, Near optimum [15] P. Robertson, E. Villebrun, and P. Hoeher, A
error correcting coding and decoding: Turbo- Comparison of Optimal and Sub-Optimal MAP
Codes, IEEE Trans. Communications, vol. 44, Decoding Algorithms Operating in the Log
no. 10, pp. 1261-1271, Oct. 1996. Domain, Proc. IEEE International Conference
[3] C. Berrou and M. Jézéquel, Non binary on Communications (ICC’95), Seattle, pp.
convolutional codes for turbo coding, 1009-1013, June 1995.
Electronics Letters, vol. 35, no. 1, pp. 9-40, [16] S. Papaharalabos, P. Sweeney, and B. G.
Jan. 1999. Evans, Constant log-MAP decoding algorithm
[4] Third Generation Partnership Project. 3GPP for duo-binary turbo codes, Electronics Letters
home page. www.3gpp.org. vol. 42, issue 12, pp. 709 – 710, June 2006.
[5] 3GPP TS 36.212 V8.7.0 (2009-05) Technical [17] Jung-Fu Cheng and T. Ottosson, Linearly
Specification, “3rd Generation Partnership approximated log-MAP algorithms for turbo
Project; Technical Specification Group Radio decoding, Vehicular Technology Conference
Access Network; Evolved Universal Terrestrial Proceedings, 2000. VTC 2000-Spring Tokyo.
Radio Access (E-UTRA); Multiplexing and 2000 IEEE 51st vol. 3, pp. 2252 – 2256, 2000.
channel coding (Release 8).” [18] C. Anghel, A. A. Enescu, C. Paleologu, and S.
[6] F. Khan, LTE for 4G Mobile Broadband, Ciochina, CTC Turbo Decoding Architecture
Cambridge University Press, New York, 2009. for H-ARQ Capable WiMAX Systems
[7] M. C. Valenti and J. Sun, The UMTS turbo Implemented on FPGA, Ninth International
code and an efficient decoder implementation Conference on Networks ICN 2010, Menuires,
suitable for software-defined radios, France, April 2010.
International Journal of Wireless Information [19] C. Anghel, V. Stanciu, C. Stanciu, and C.
Networks, vol. 8, no. 4, Oct. 2001. Paleologu, CTC Turbo Decoding Architecture
[8] M. Sanad and N. Hassan, Novel wideband for LTE Systems Implemented on FPGA,
MIMO antennas that can cover the whole LTE IARIA ICN 2012, Reunion, France, 2012.
spectrum in handsets and portable computers, [20] “Virtex 5 Family Overview,” Feb. 2009,
The Scientific World Journal, vol. 2014, art. ID www.xilinx.com.
694805, 9 pages, 2014.
[9] Suchang Chae, A low complexity parallel
architecture of turbo decoder based on QPP
interleaver for 3GPP-LTE/LTE-A,
http://www.design-
reuse.com/articles/31907/turbo-decoder-
architecture-qpp-interleaver-3gpp-lte-lte-
a.html
[10] Y. Sun and J. R. Cavallaro, Efficient hardware
implementation of a highly-parallel 3GPP
LTE/ LTE-advance turbo decoder, Integration,
the VLSI Journal, vol. 44, issue 4, pp. 305-315,
Sept. 2011.
ISBN: 978-1-61804-271-2 111You can also read