Design and FPGA Implementation of Dual Scan two Dimensional Discrete Wavelet transforms

In this paper, hardware architectures for two dimensional discrete wavelet transform (2-D DWT) are examined, the 4-input/4-output Dual Scan architecture for one-level DWT is presented, then by using the pipelined architecture and parallel method, the one-level architecture is developed to perform a complete dyadic decomposition of NXN image in multi-level 2-D DWT. After that the internal memory sizes that are needed to design the proposed architectures and the proper fixed point word length are determined. The proposed architectures are down loaded in to FPGA board (Spartan-3E) to calculate the die area and the critical path of these architectures. The main advantage of Dual Scan method is high reduction in the time delay to perform the architecture.


Introduction
Many recent architectures for 2-D DWT are designed depending on using the Row by Row method to read input images. In these methods, 2-D DWT architectures are accomplished by reading pixels one by one from the first row till it's end and then they begin to read pixels from the second row and so on. This type of reading depends on the conventional type of graphical processing. The 2-D architecture proposed by Andra etal [1] composes of simple processing units and computes one stage of the DWT at a time. Jiang et al. [2] has proposed a parallel processing architecture that models the DWT computation as a finite state machine and it is only efficient for computing the wavelet coefficients near the boundary of each segment of the input signal. Lian et al. [3] also proposed a 1-D folded architecture to improve the hardware utilization for 5/3 and 9/7 wavelet filters. A block-scanning algorithm for DWT implementation is proposed by Yamauchi et al. [4] to achieve highly efficient calculations. However, all those methods are block-based architectures, hence; require a large size of raw data buffer storage. Dillen et al. [5] then proposed a combined line-based architecture for the 5-3 and 9-7 DWT, which was implemented for one-level decomposition. Liu et al. [6] also proposed an efficient line-based 2-D architecture by using spatial combinative lifting algorithm of the 9/7 DWT. Tseng et al. [7] and Liao et al. [8] have proposed another two similar lifting-based 2-D generic architectures by employing parallel and pipeline techniques with recursive pyramid algorithm. In spite of multi-level decomposition by the use of an interleaving scheme that reduce the size of memory and the number of memory accesses. in these architectures, they still suffer from slow throughput rates and inefficient hardware utilization. Many wavelet architectures are recently proposed for specialized applications; these architectures try to minimize the memory access [9]. A complex structure is presented in [10], which can be simplified with the lifting technique. Finally, the two-dimensional (2-D) transform is studied in [11] with the use of a transposition memory between the horizontal and vertical decompositions. The architectures are mostly folded and can be broadly classified into serial architectures (where the inputs are supplied to the filters in a serial manner) and parallel architectures (where the inputs are supplied to the filters in a parallel manner). The serial architectures are either based on systolic arrays that interleave the computation of outputs of different levels to reduce storage and latency [12]- [14] or on digit pipelining, which implements the filter bank structure efficiently [15], [16]. The parallel architectures implement interleaving of the outputs and support pipelining to any level [17]. In this paper, another type of reading pixels from an input image is proposed to reduce both time of computations of architecture and internal memory access. This method is called Dual Scan Architecture (DSA). The 2-D DWT architectures are accomplished by using two approaches: a Separable and Non separable methods. A simple separable approach begins to process the horizontal direction of transform method and followed by vertical direction of transformation or vice versa. The Non separable approach decomposes an image into four sub-band images and process them one after the other without any individual row or column processor. This can improve the performance of the architectures but considerably with more hardware resources to build the architecture. In order to tradeoff the speed and die area, A Dual Scan-Separable method is proposed in this paper to speed up the architecture with the reduction of the hardware resources. The rest of this paper is organized as follows; the db4 filter lifting scheme analysis is presented in section 2. The conventional and proposed one-level and multi-level Dual Scan Architectures are explained in section 3. In section 4, the simulation results and the hardware implementations are illustrated. The performance analysis is shown in section 5 with a comparative study. Finally, section 6 includes the conclusion of this paper.

Discrete Wavelet Transform and Lifting Scheme
In this section, brief reviews of Mallat's tree algorithm and the dB4 lifting scheme are presented. Mallat's algorithm is reviewed in sub-section A. The lifting scheme and the factorization of a wavelet filter are introduced in sub-sections B and C, respectively, while the boundary treatment is described in sub-section D.

A. Mallat's Algorithm
The classical DWT can be calculated using an approach known as Mallat's tree algorithm [8].
Here, the lower resolution wavelet coefficients of each DWT stage are calculated recursively according to the following equations: where c j,k :-low pass (j,k) th coefficient at the pth resolution. d j,k :-high pass (j,k) th coefficient at the pth resolution. h[n] :-low pass wavelet filter corresponding to the mother wavelet; g[n] :-high pass wavelet filter corresponding to the mother wavelet. l :-the length of c j,k and d J,k . The corresponding tree structure for a two-level DWT is illustrated in Fig.1. The structure of the conventional separable 2-D DWT algorithm is shown in Fig.2, where h[n] and g[n] represent the low pass and high pass sub-band filters, respectively. The input image is first decomposed horizontally; the resulting two bands are then decomposed vertically into four sub-bands usually denoted by LL, LH, HL, and HH. The LL sub-band can then be further decomposed by the same method.

B. Lifting Scheme
In 1994, Sweldens proposed a more efficient way of constructing the biorthogonal wavelet bases, which he called the lifting scheme [18]. Concurrently, similar ideas were also proposed by Carnicer et al. [19]. The basic structure of the lifting scheme is shown in Fig. 3, the input signal is first split into even and odd samples. The detail (i.e., high-frequency) coefficients of the signal are then generated by subtracting the output of a prediction function of the odd samples from the even samples. The smooth coefficients (the low-frequency components) are produced by adding the odd samples to the output of an update function of the details. The computation of either the detail or smooth coefficients is called a lifting step.

C. Lifting Steps of Factorized Wavelet Filters
D a u b e c h i e s a n d S w e l d e n s [ 2 0 ] s h o w s t h a t e v e r y F I R w a v e l e t o r f i l t e r b a n k c a n b e decomposed into a cascade of lifting steps, as a finite product of upper and lower triangular matrices with a diagonal normalization matrix. The high pass filter and low pass filter in (1) and (2) can thus be rewritten in the z-domain as where l :-the length of the filter.
The high pass and low pass filters can be separated into even and odd parts: These filters can also be expressed as a polyphase matrix as follows: Using the Euclidean algorithm, which recursively finds the greatest common divisors of the even and odd parts of the original filters, the forward transform polyphase matrix can then be factored into lifting steps as follows: where s i (z) and t i (z) are Laurent polynomials corresponding to the update and prediction steps, respectively. The inverse DWT is described by the following equation: The low pass and high pass filters correspond to the Daubechies 4-tap wavelet are expressed as [20].
where h 0 = ( ) The corresponding synthesis polyphase matrix can be factored as:

Conventional and Proposed 2-D Architectures
A conventional implementation of a separable 2-D lifting scheme based DWT is illustrated in Fig .4. resulting decomposed low and high frequency coefficients are stored in Memory Bank 1.
Since this bank stores all the horizontal DWT coefficients, the size of this memory bank is N 2 for an NXN image size. When row DWT is completed, the vertical 1-D DWT architecture starts reading the data from Memory Bank 2, which represents the coefficients from the horizontally decomposed image and calculates the vertical DWT. The size of Memory Bank 2 is also N 2 . The LL, LH, HL and HH sub-band images are the final results. The straight forward implementation of 2-D DWT is both time-and memory-intensive. To increase the computation speed, a 2-D Dual scan architecture for separable lifting scheme-based Discrete Wavelet Transform is proposed.

The proposed 2-Dimensional dual Scan architectures (2-D DSAs )
As mentioned before, in a conventional 2-D DWT algorithm, the vertical DWT can only be carried out after finishing the horizontal DWT. This time of executing both row and column computations sequentially limit the processing speed.

One-Level Architecture
To implement the 4-input/4-output one-level dual scan lifting scheme-based architecture, the input signal has to be first separated into even and odd samples, But in this architecture, the first two rows are read parallelly, the odd sample of the first row are written into FIFO1 memory as shown in Fig.6  the size of IBU is 2N. Then the wavelets module (WTM) is designed to perform the 2-D DWT by receiving four input samples from IBU and generating four output samples per a cycle. The WTM is composed of two similar row-wise 1-D DWT modules (R-WT1 and R-WT2) and two similar column-wise 1-D DWT modules (C-WT1 and C-WT2), which work in parallel. In each internal clock cycle, four inputs (x ee (n,m) and x eo (n,m) (the samples of even numbered row and even numbered column and the samples of even numbered row and odd numbered column) and x oe ( n , m ) a n d x oo (n,m) (the samples of odd numbered row and even numbered column and the samples of odd numbered row and the odd numbered column, respectively) are read by R-WT1 and R-WT2 in parallel. In each clock cycle, R-WT1 generates one low frequency coefficient after reading the data from IBU then after three clocks, R-WT1 generates the high frequency coefficient as shown in Fig.7. At the same time, R-WT2 generates one low frequency coefficient after one clock cycle from reading the data and generates the high frequency coefficient after three clock, but R-WT2 performs the 1-D DWT on odd rows. The output of both R-WT1and R -W T 2 i s t h e n s e n t t o C -W T 1 a n d C -W T 2 i n a p a r a l l e l m a n n e r . T h e l o w f r e q u e n c y coefficients that are generated from R-WT1 and R-WT2 are sent into C-WT1 and decomposed into sub-bands of low-low (LL) frequency and low-high (LH) frequency, meanwhile the high frequency coefficients that are generated from R-W T 1 a n d R -W T 2 a r e s e n t t o C -W T 2 t o p e r f o r m t h e c o l u m n w a v e l e t t r a n s f o r m a n d generate high-low (HL) frequency sub-band and high-high (HH) frequency subband. The architecture of wavelet module can be designed by directly mapping the lifting factorization of chosen wavelet filter (i.e., the Daubechies 4-tap wavelet filter). Fig.7 The timing diagram of calculating approximation and detail coefficients.

Multi-level Architecture
When the particular application requires J level (J>1), where J is the number of decomposition levels of 2-D DWT. As in most cases, the LL sub-band image (Horizontal and vertical low frequencies) produced by the one-level transform module can be transformed further on as shown in Fig. 8. However, the proposed multi-level (three-level) Dual scan architecture as above requires a memory size N 2 /4 for storing intermediate data that The other sub-band images (LH 2 , HL 2 and HH 2 ) are stored into external memory, as shown in Fig. 8.

The Implementations
The proposed architectures are implemented as behavioral level VHDL models and confirmed their correctness in simulation. As the dynamic range of 2-D DSA coefficients with the number of decomposition levels, the number of bits used to represent the coefficients should be large enough to prevent the overflow. Bits representing the fractional part can be added to improve the signal to noise ratio (SNR). In our simulation, the filter coefficients and 2-D DSA coefficients are representing by 16-bit. Therefore, 16-bit multipliers are implemented in our design. The SNR values of three-level forward 2-D DSA of the test graylevel 512X512 images are listed in Table 1.  Table 2.

Performance Analysis and Comparison
The typical evaluation for the performance of the architectures of 2-D DSA includes in the following subsections:

Hardware complexity
The number of multipliers and adders are the main factors in affecting the hardware complexity of 2-D DSA, These two factors depend on the type of the used filter. In this paper; dB-4 filter is used to design a 2-D DAS with 18 adders and 10 multipliers.

Time of computations
The time of computations is normalized to the number of the consumed intra-clock cycles, and can be estimated as follows: Since the architecture of 2-D DSA is designed to generate 4 input/4 output at every intra -clock cycle (cc), then in one-level architecture, (N/2 cc) is needed to process (2*N pixels). Thus to process the NXN pixels for one-level, (N 2 /4 cc) is needed. In the proposed architecture, the number of levels is greater than one (three-level)

The memory bandwidth
The memory bandwidth item is normalized to the number of samples for each work cycle required in each internal working clock cycles (in samples/cycle). So the memory bandwidth of the proposed architectures is 4-samples/cycle. This will increase the speed of the corresponding 2-D DWT coefficients. As shown in Table. 3, the architecture in [8] uses a low number of resources (8 adders and 8 multipliers only) but the time of computation is great. Although, the number of resources in the proposed architecture is greater than others, but the speed of the architecture is the advantage. Thus, the proposed architecture tradeoff between the speed of calculation and die area.

Conclusions
In this paper, a 3-level dB4 2-D DSA has been proposed and captured by VHDL. In which the parallelism among four sub-bands transforms, in a lifting scheme 2-D DWT is explored. The comparison demonstrates that the proposed 2-D DSA is faster than other lifting scheme architectures. The proposed architecture passes a shorter time of computation. In addition, the 2-D DSA can continuously compute the 2-D DWT coefficients as soon as the samples became available at the IBU.