Design Analysis of Turbo Decoder Based on One MAP Decoder Using High Level Synthesis Tool

High Level Synthesis (HLS) tool does not only simplify the designing operation and rapid prototyping but also allows the designers to explore large number of design’s techniques such as parallelism, pipeline, memory partitioning and many other techniques. Turbo decoder based on Maximum APosterior Probability (MAP) algorithm is designed in this work using Vivado HLS. The normal turbo decoder with two MAP decoders were implemented with and without parallelism and proposed a new design of turbo decoder with one MAP decoder and it was designed with and without parallelism using different window technique in HLS tool which it is not explored previously. These designs were implemented for different frame size in this work. A comp-arison in latency and resource utilization where done and how a tradeoff done between these two parameters to reach the specific design that we need. The new design produces better results.

FEC has been introduced in many areas of wireless and space communications like 3 rd Generation Partnership Project Long Term Evolution (3GPP LTE), Global System for Mobile communications (GSM), IEEE Standard P802.16 also known as (WiMAX) Worldwide inter-operability for Microwave Access, Digital Video Broadcasting Satellite Services to Handheld (DVB-SH), Universal Mobile Telecommunication System (UMTS) and other applications [3]. Forward error correction has a long history so it has many types of coding. Each type has many different characteristics in possibilities, purposes and mathematical calculations like: The Repetition Code, The Parity Bit, The Hamming Code, Hadamard Codes, Golay Codes, Reed-Solomon Codes, Convolution Codes, low-density parity-check (LDPC), Turbo codes and many of them [15]. One of FEC types is the turbo code. The development and improvement of turbo encoder and turbo decoder is a wide area of research so there are many attempts to implement the turbo code. Many researches using Very high-speed integrated circuit Hardware Description Language (VHDL). Whereas Recently, the trend has turned to made implementations using High Level Synthesis (HLS) tools. Which is a tool for rapid prototyping and production of hardware designs with short developmental cycles in Register Transfer Level (RTL). It is based on widespread high-level language (HLL)C/C++. This will enable the user to design complex hardware designs [10]. The researchers in [1] using parallelism level 64 for frame size 2048-6144 and parallelism level 8 for frame size 256-2048 and using an interleaver proposed by them, and they conclude that the result of performance and latency that could use turbo code in future terrestrial broadcasting (TB) systems. And in research [2] a comparison between turbo code, LDPC, polar code was done the result showed that the turbo code made the best performance than the others in error correction and flexibility in using different block sizes and different code rates. In paper [3] BER (Bit Error Rate) per-formance was investigated with multilevel of parallelism and for small frame size and concluded that selfconcatenated convolutional code is better than the normal turbo code performance in BER. Research in which HLS tools were verified as a means of producing prototypes and shortened development cycles needed to produce device designs at the RTL was done in [4]. A LDPC was used and their results showed that codecs that the using HLS tools, either as pipeline or as data flow designs, is able to reach the productivity of existing designs at the RTL. While a master's thesis in [5] proposed by Conn, three types of turbo decoders were designed. These designs were implemented with different architecture using high level synthesis (HLS) tool that used in software defined radios (SDR). 46 articles were studied on the quality of the results and design efforts. The implemented designs were compared using HLS tools and RTL designs in [6] and they found that 40% of the cases studied proved that HLS tools equaled or outperformed RTL designs from In terms of performance and better use of resources, they also studied whether the size of the design affects performance quality, and they concluded that HLS tools are suitable for both large and small designs. In paper [7] high level synthesis tool was used for implementing turbo decoder algorithms with exploration of HLS optimization using different directives for designing several archi-tecture designs of turbo decoders. This work designs a normal turbo decoder with two decoders in iterative fashion in C++ language and make it fully parallel by unroll directive using Vivado HLS tool and make a comparative with a proposed turbo decoder with one decoder in iterative fashion and again made it fully parallels by same directive. These two decoders are decoding three different frame size of 108, 216 and 432 received bits, with two different windows, 9 decoded bits of 27 received bits and 36 decoded bits of 108 received bits and compare these designs in latency and resources utilization.

THE THEORETICAL BASES
This section talks about turbo code as one of forward error correction and formed by two stages: turbo encoder and turbo decoder. The next part is about HLS tools.

Turbo encoder
It is formed by two recursive systematic convolutional (RSC) encoder in parallel concatenation and the two encoders separated by an interleaver. There are many researches on the (RSC) and how to design it but in this paper two memory Al-Rafidain Engineering Journal (AREJ) Vol. 25, No.1, June 2020, pp. 70-77 elements were used with recursive feedback and the generator polynomial is [16]: The main purpose of interleaver is to reduce the burst error to help for error correction. This paper uses a block interleaver. Block interleaver is look like two-dimensional array and inter the value of the input bits, bit by bit in rows way into this array and read these bits again, bit by bit in columns way. In the end of encoding each input bit well be decoded in three bits one is the systematic input and two parity bits. Fig. 2 explains the components of the turbo encoder.

Turbo decoder
All equations were mentioned in this section were taken from [5]. The second stage of turbo code is the turbo decoder, it is more complex than turbo encoder. The decoding technique based on trellis method for sot input soft output (SISO) algorithms in Fig. 3 turbo decoder is shown. Where SISO 1 and SISO 2 are the soft input soft output decoding algorithms. Hard decision maker is deciding whether the bit is zero or one depending on the value of decoder output, if the value is positive the bit will be one and if the value is negative the bit will be zero, this decision is taken after decoding the message.
The work of turbo decoder simply begins when SISO 1 decoder decode the massage and pass his opinion depending on the algorithms decode, to the SISO 2 decoder helping this decoder to make his decode and pass his opinion to the SISO 1 again in iterative fashion for suitable number of itera-tions. This opinion is named extrinsic information.
One of the (SISO) decoding algorithms is maximum a Posterior (MAP) or called (BCJR) algorithm [2]. MAP algorithm calculates the logarithm ratio of the probability of bit that will be one over probability of bit that will be zero this ratio is named log likelihood ratio (LLR). The LLR is computed by the equation bellow: Where P is the probability, u k is the bit will be decoded and y is the bit was received from the channel.
After several mathematical analyzing and simplification, the equation will be like bellow: The numerator represents the multiplication of the values of alpha, gamma and beta for the bit equal one in transmits from one state to another in trellis. The denominator value is the same thing but for bit equal to zero, is called the forward recursion and it is computed by the next equation: is represented the alpha value for the next state, is the value of alpha for the current state and is the value of branch metric for branching from current state to next state. The value of alpha is calculated from the beginning to the end of trellis. The initial value of first alpha value is one for state zero and zero for the other states. And β k is called backward recursion and it is computed by the equation bellow: (4) Where is the value of beta for the next state, is the value of beta for current state and is the value of branch metric for branching from current state to next state. The value of beta is calculated from the end to the beginning of the trellis. The initial value of the Al-Rafidain Engineering Journal (AREJ) Vol. 25, No.1, June 2020, pp. 70-77 last beta value is one for state zero and zero for the others. Finally, is the branch metric for branching from state to another state and it is computed by the following equation: Where C k is a constant ignored because of the division, Lc is the channel reliability, y k the bit received from the channel, u k represent the trellis input, Lu k represent the extrinsic information that came from the previous decoder, n equal two bits the systematic bit and the parity bit. Where y k is the received bit for systematic and parity bits. At the end, we calculate the extrinsic information by subtracting the decoded value from the extrinsic information from previous decoder and from the received value of the bits.

Vivado HLS
When using HLS for designing we can convert an algorithm written in HLL such as C/C++ to RTL automatically. But there are many points must deal with. First of all, the tool has limitation for example the all memories must be static and the compiler must know the memory size before the compilation. For that there is neither heap nor stack memory were used in the program [8]. Another point is how to write a program and how the tool translate the program as a hardware. For example, when we write these programming sentences in C language: if (X==Val1) X=Function(A); if (Y==Val2) Y=Function(B); this well make the hardware of the (Function) duplicated. But if we write them like: if (X==Val1) X=Function(A); else if (Y==Val2) Y=Function(B); the hardware well not duplicated and this well be effort on the resource utilization. Directives or known as pragmas are another point must focus in. The hardware design can't be optimized by using HLL only. The HLS tool offer another feature for how the hardware implemented. There are many directives for different design's techniques such as parallelism, pipeline, memory partitioning and many other techniques [8]. Only the directives that used in this paper well be mentioned:  Loop Unrolling: when this directive was used the hardware well replicated. And this well allows parallelism and effective to reduce latency but the cost is more resources well be used [8].
 Array partitioning: this directive was used to reduce the data access dispute and this done by splitting the array into small block RAM Vivado HLS provided three types of partitioning block, cyclic and complete. By using complete type, the array well split into individual elements (registers) and the fully parallel access can be done but the cost is in the resource utilization [8].

THE PROPOSED METHODOLOGY
The Scenario of how the MAP decoding algorithm working is presented by the following steps: • Calculating the gamma's values and in the same time calculate alpha's values from the beginning of the trellis to the end of it. • Calculating beta's values, decoder's output and the extrinsic information in the same time from the end of trellis to the beginning. • Repeat the first two steps for 8 iterations. The algorithm is implemented in C++ language using Vivado HLS tool. Vivado HLS tool supporting C, C++and System C language for designing. Also, supporting many processing tech-niques like pipeline, parallelism, memory par-titioning and many others by using directives. In this paper unrolling directive will be used for parallelism the all for loops (the inner for loops for calculating alpha, gamma, beta and the extrinsic information. And the outer for loops for the iterations) in turbo decoder to design full parallel turbo decoder. Two designs were compered in this work one is the normal turbo decoder with two MAP decoders and the other is turbo decoder with one MAP decoder. Fig. 4 show how the algorithm work with one MAP decoder. In this research, not the whole message was decoding, a smaller window size was taken. A window of 27 bits from the message to decode 9 bits and window of 108 bits to decode 36 bits. These two windows were implemented for frame size of 108, 216 and 432 to decode 36, 73 and 144 bits these numbers of frame size were chosen to fit in the two chosen windows. The window size represents the parallelism level when the Vivado HLS unrolling directive was involved in the program loops for making them working in parallel. Block interleaver was used for the all designs. The results were calculated and comparison in latency and resource utilization were made.

RESULTS AND DISCUSSIONS
The whole programs were written in C++ language using Vivado HLS. The data type where used was double. The directives were used are dependencies and unroll for parallelism. The device where used to implement all designs was Virtex Ultra Scale+FPGA (xcvu13p-fsga2577-2-I) where it has 216000 SLICE, 1728000 LUT, 3456000 FF, 12288 DSP and 5476 BRAM. The targeted clock set to 10 ns. The result listed in Table 1 for turbo decoder with one MAP decoder resulting the mini-mum and maximum latency with different frame size and different windows.  The drawing in Fig. 6 and Fig. 7 for Table 1 and Table 2. The plotted symbols meaning listed by Table 5. From these results we can see that when the turbo decoder with one MAP decoder was used with large window size and without using paralle-lism in the design, the latency is better from small window size and has less resource utilization only in memory utilization is higher because of the larg-er window size. And the same explanations for the turbo decoder with two MAP decoders and for maximum and minimum latency. When we use parallelism for both turbo decoder designs with one or two MAP decoders, the minimum and maximum latency of the small window was better than the larger window size, but the resource utilization was larger than the large window size. In other hand, the larger window size takes less resources than the small window size but in latency it was not the better choice.
If we make a comparison between the turbo decoder with one MAP decoder and the turbo decoder with two MAP decoders without using parallelism the design with two MAP decoders was better in maximum latency than the design with one MAP decoder. And if we use parallelism the turbo decoder with one MAP decoder was the best for the two designs in maximum latency.
And it was utilized less resources than the turbo decoder with two MAP decoders and for the two designs with or without using parallelism.

CONCLUSION
In this paper we explained that we can use HLS to make an architecture design by C++ language and make the (RTL) automatically using Vivado HLS and make several designs with many processes like parallelism, and we conclude that the design of turbo decoder with one MAP decoder is better than the turbo decoder with two decoders spatially when we went to make the area of the Al-Rafidain Engineering Journal (AREJ) Vol. 25, No.1, June 2020, pp. 70-77 designs smaller by utilize less resources. And when we went to design low maximum latency turbo decoder in fully parallel design, the turbo decoder with one MAP decoder produce better results.