FPGA Implementation of Multiplierless DCT/IDCT Chip

The advance of mobile electronics technology has produced handheld appliances allowing both wireless voice and data communications. One of the most important operations in the realm of digital signal and image processing is the 2-D Discrete Cosine Transform. This paper presents a multiplierless two dimensional Discrete Cosine Transform/Inverse Discrete Cosine Transform (DCT/IDCT) based on the transpose method. In this method the 2-D DCT is obtained by taking two 1-D DCTs in series. The input data is first divided into NxN blocks and the row-wise 1-D DCT of each block is taken, the intermediate transposition is then determined and a column-wise 1-D DCT is ascertained which gives the 2-D DCT of the data. The hardware implementation is parallel, pipelined and decomposed the coefficients matrix into four power of two term(i.e:16 ) to perform shift and add operations instead of multipliers(i.e 16); it costs only 1,443 slice , and runs at maximum frequency of 82.8 MHz with a very high process throughput of 991.2 Megabits/sec when synthesized onto Spartan3-E XC3S500 FPGA device. The proposed 2-D DCT/IDCT design achieving the most demanding real-time requirements of CODEC standardized frame resolutions and rates.


Introduction
The discrete cosine transform (DCT), proposed by Ahmed et al. in 1974 [1], has become an increasingly important tool for image, audio filters and video signal processing applications due to its utility and its adoption in standards such as Joint Photographic Experts Group (JPEG), Moving Picture Experts Group (MPEG), and CCITT H.261 [2]- [4].DCT is also playing an important role in digital watermarking [5].Compared with other orthogonal transforms, the performance of the DCT is very similar to the optimal Karhunen-Loeve transform (KLT) for highly correlated data [6].
To meet the real-time audio, image and video processing requirement, DCT and inverse DCT (IDCT) are needed [7].There are various algorithms for computation of Discrete Transforms that have been developed over the years [4]- [8].The 2D DCT is computationally intensive and as such there is a great demand for high speed, high throughput and short latency computing architectures.
There are three dominant paradigms in the implementation of 2-D DCT/IDCT: software, DSP and hardware.The BISON Configurable Digital Signal Processor(BCDSP) architecture uses multiple memories, few instructions, and a special pipelined floating point arithmetic function core presented in [8].Tian-Sheuan Changand Chein-Wei Jen proposed a costeffective processor core design for 2-D DCT/IDCT using the fast algorithm to reduce the computation, the DA formulation for higher precision, and the subexpression sharing for lower hardware cost and fewer computation cycles [9].
Due to its high flexibility, low cost, and fast development time, FPGAs have been widely used as a common platform for hardware implementation.Nowadays many researchers [10]- [12] focused on the use of FPGA as one of the ideal platforms to DCT, IDCT hardware implementation.Many 2-D DCT/IDCT algorithms have been proposed to achieve reduction of the computational complexity and thus increase the operational speed and throughput [10]- [14] .
Multiplierless architectures, such as distributed arithmetic (DA), new DA (NEDA), coordinate rotation digital computer (CORDIC) and integer DCT (intDCT), are widely used for its VLSI realization because of improvements in speed, area overhead and power consumption [13].Other researchers [14]- [15] are based on butterfly coding schemes.Although many DCT/IDCT cores are available, speed, power and area are still important issues in the development of these components.
The rest of the paper is organized as follows.In Section 2, background concerning the DCT algorithm .Mathematical basis of inverse discrete cosine transform algorithm is briefly discussed in section 3. Section 4, a description of the design environment and its design flow are given.In Section 4.1, synthesis and measured performance results are presented .This is followed in Section 4.2 with a discussion of the experimental results and conclusions are drawn in last section.
The separability property of the DCT, has an advantage that Z(u, v) can be computed in two successive steps, 1-D operations on rows and columns of an image block and then calculate the 2D-DCT as shown in figure 1.Using this property equation ( 2), becomes as the following equation ( 3).This separated transformation can also be expressed in matrix notation Which T is an N×N matrix whose basis vectors are sampled cosines, defined as Where A= , B= , C= , D= ,E= , F= , G= .
The (8×1)1-D DCT is given by equation 4, where X(0) to X(7) represent the 8 data elements.In this implementation it needs the use of 64 multipliers, 56 adders, and one shifters that performs arithmetic right shifts to divide by 2. This matrix can be further simplify ed to reduce the number of multipliers and adders [12].This reduces the number of multipliers to 32 from 64 and adders to 24 from 56.This substantially reduces the power consumption.So 1-D DCT can be further decomposed to obtain fast algorithm as follows.

The Proposed 2-D DCT and IDCT Architectures
A commonly used approach to construct 2-D DCT/IDCT is the row-column decomposition method.The decomposition performs a row-wise 1-D transform followed by another column wise 1-D transform with the intermediate transposition, as shown in fig. 1 .This decomposition approach has two advantages.First, the computational complexity is significantly reduced.Second, the original 1-D DCT can be easily replaced by different DCT algorithms.The proposed architecture targets power efficiency by minimizing the number of arithmetic operations as well as they combine the two transformation in same design.The proposed 1-D DCT/IDCT using fast method written in equation 6 which organize the coefficient matrix needing four multiply operations only instead of eight by toggle flip flop that add or subtract X(n) with X(N-1-n).By this approach vector processing using parallel four multipliers be enough to implement 1D-DCT/IDCT, the following point record the operations of the proposed architecture: 1. Convert the input signal from serial form to parallel using eight shift register shown in fig. 2 , this step take eight clock cycles.2. Rearrange the input element from X(0) to X(7) using toggle flip flop to add X(n) +X(N-1-n) , then subtract it in the DCT unit and use X(n),,,X(N/2-1) and X(N/2) to X(N-1) in the IDCT unit to reduce the multiplier element into half.This step takes two clock cycles .

VHDL Synthesis Results
The 2-D DCT architecture was described in VHDL .Synthesis results for 8×8-points DCT and IDCT units are included in Table 1.These results indicate that the Xilinx FPGA implementations of DCT/IDCT can process higher numbers of frames per time unit.This VHDL was synthesized into an XC3S500 Spartan3-E FPGA.The resulted VHDL simulation of forward and invert transform of the DCT , IDCT algorithms are shown in Fig. 4 and Fig. 5 respectively.The arranged architecture has the latency of 94 clock cycles for DCT choose , and 101 for IDCT.

A Multiplierless DCT/IDCT Architecture
The discrete cosine transform is one of the most compute intensive parts in various image/video coding standards.Its computational burden is due to large numbers of multiplications and additions.Multiplierless architectures are widely used for its VLSI realization because of improvements in speed, area overhead and power consumption [6]- [13].These advantages of multiplierless architectures are due to multiple constant multiplications using optimized shift-and-add operations instead of generic array multipliers.The iterative or overlapped additions in DCT multiplications improve the power efficiency due to the reduced number of arithmetic operations.CORDIC executes shift-and-add iteratively to compute angle rotation and magnitude compensation.DA uses ROM memories where the partial sums of inputs are stored; the result is obtained by shifting-and-adding values stored in the ROMs.NEDA utilizes an adder array instead of ROMs to improve the area-efficiency.
Common subexpression elimination (CSE) is another efficient scheme to reduce the number of additions required to realize multiple multiplications.It is mainly applied to single-input/multiple-output operations such as finite impulse response filters (FIRs).Constant matrix multiplication (CMM) is being researched for the efficient hardware design of multiple-input/multiple-output operations targeting area and delay optimization.Byoung-Il Kim and Sotirios G. Ziavras minimized arithmetic-operation redundancy , the DCT design focuses on Chen's factorization approach and the constant matrix multiplication (CMM) problem [13].
In this paper, the proposed architecture is designed to reduce the required additions/subtractions , the multiplier used in the 1-D DCT architecture was decomposed in shifts and adds as a way to minimize the hardware.Each DCT/IDCT Matrix coefficients can be decomposed in four power of two elements which stored in ROM memories , tabulated in table 2. The multiplierless DCT/IDCT design replace each multiplier shown in Fig. 3 by circuit shown in Fig. 6.The new architecture shifts the input value by a factor fetched from the ROM then adds the four shifted term using one adder and two subtractors .The multiplierless architecture uses a three adder to perform it in one clock cycles same as the proposed multiplier architecture illustrated in Fig. 3. Saving 3 clock cycles when using three adder element shown in Fig. 6 b.In same manner a second ROM memory is used to store the shifted values of inverse DCT coefficients. .This method difference than barrel shift that used to avoid multiplier where the coefficient (90) 10 represent in binary form(1011010) 2 and this need to six adders and ANDING operation with the value that multiplied with it, where as in the proposed architecture the coefficient(90) 10 need (one adder and two subtractors) without ANDING operations to perform multiply operation.
Fig. 6 The proposed multiplier architecture.
Where x, y, z, w are coefficients from ROM.The result of applying the proposed architecture for block of image, and compared with Matlab program as shown in Table 3(a, b, c, e): The proposed and the conventional multiplierless DCT architectures are implemented on a Spartan3-E Xilinx FPGA.Synthesis results for multiplierless DCT/IDCT architecture are included in Table 4.

Multiplierless DCT Architectures Comparison
Byoung-Il Kim and Sotirios G. Ziavras paper verify an overall recent comparison for the 8×1 DCT involves the required adders and the performance of the proposed architectures as compared to previous multiplierless DCT architectures.All architectures are the number of NEDA accumulation cycles depends on the precision of the cosine coefficients.The proposed architecture reduces the arithmetic operations by 58.9% compared to the conventional CMM as illustrated in table 5..

Implementation and Results
Byoung-Il Kim and Sotirios G. Ziavras implemented the conventional multiplierless DCT architectures NEDA , CORDIC and CMM on the Xilinx XC2VP50 FPGA.To compare the proposed architecture we implement on virtexII pro XC2VP50 using Xtreme DSP development kit .The required area of the proposed design using Block RAM to store the intermediate result without the companding scheme is shown in Table 6; the area was measured in number of slices.

5.Conclusions
This paper proposed a low-power DCT architecture for image/video coders.Power reduction was realized by minimizing the number of arithmetic operations In order to minimize the operations, the 1-D DCT was decomposed into 4-sub adder unit .The total required number of operations for the 8x1 DCT was 23; it represents a reduction of 41.1% compared to the conventional CMM.An adaptive companding scheme was proposed to effectively reduce the arithmetic unit.The result showed that the proposed 2-D DCT/IDCT architecture is conform with real time video and image compression.The proposed architecture is expected to be useful in mobile multimedia applications due to minimize arithmetic units as compared with CMM architecture.It is very suitable for low-complexity and multi-purpose Video CODECs.

‫اﻟﻤ‬ ‫ﺗﻤﺎم‬ ‫اﻟﺠﯿﺐ‬ ‫ﺗﺤﻮﯾﻞ‬ ‫ﻟﺪاﻟﺔ‬ ‫رﻗﺎﻗﺔ‬ ‫ﺗﻨﻔﯿﺬ‬ ‫ﺘ‬ ‫ﻘﻄﻊ‬ ‫وﻣﻌﻜﻮﺳﺔ‬ ‫ﺿﺎرب‬ ‫ﺑﻼ‬ ‫د‬ . ‫ﻓ‬ ‫أﺣﻼم‬ ‫ﺻﺎﻟﺢ‬ ‫ﻣﺤﻤﺪ‬ ‫اﻟﻜﺮﯾﻢ‬ ‫ﻋﺒﺪ‬ ‫ﻣﺤﻤﻮد‬ ‫ﺎﺿﻞ‬ ‫ﺣﺎﺳﺒﺎت‬ ‫ھﻨﺪﺳﺔ‬ ‫ﻗﺴﻢ‬ ‫ﺣﺎﺳﺒﺎت‬ ‫ھﻨﺪﺳﺔ‬ ‫ﻗﺴﻢ‬ ‫اﻟﻤﻮﺻﻞ‬ ‫ﺟﺎﻣﻌﺔ‬ ‫اﻟﻤﻮﺻﻞ‬ ‫ﺟﺎﻣﻌﺔ‬ abdlat_1986@yahoo.com ahlam.mahmood@gmail.com ‫اﻟﻤﻠﺨﺺ‬ ‫ﻋﺪد‬ ‫ﺎﻟﺔ‬ ‫اﻟﻨﻘ‬ ‫اﻹﻟﻜﺘﺮوﻧﯿﺎت‬ ‫ﺗﻘﻨﯿﺔ‬ ‫م‬ ‫ﺗﻘﺪ‬ ‫أﻧﺘﺞ‬
4. The output of the adder/subtractor is fed into a multiplier element , the constant coefficient multiplication values are stored in a ROM and fed into the second input of the multiplier, it takes one clock . 5. The outputs of the four multipliers are added together resulting in the intermediate coefficients.6.The intermediate coefficients are stored in a RAM.The values stored in the intermediate RAM are read out one column at a time.This is the input for the second stage, it takes 64 clock to store all 1-D DCT/IDCT elements.7. Two transpose buffer memory are used RAM1 and RAM2 to synchronous the procedure in pipeline miner, after 64 locations are written into, RAM1 goes into read mode and RAM2 goes into write mode .The cycle then repeats by signal generated from control unit.8.The first DCT/IDCT coefficient appears at the output of the RAM1 after 16+64 clock cycles.So the 2 nd DCT/IDCT operation starts after 80 clock cycles.After 94 clock cycles first output will be obtained then one sample at each clock can be taken.The proposed block diagram is illustrated in Fig. 3.