Design and FPGA Implementations of Four Orthogonal DWT Filter Banks Using Lattice Structures

In this paper, lattice structures for DWT are introduced through the design and FPGA implementations of the orthogonal Daubechies filter banks of orders 2, 4, 6 and 8. Multipliers and shift-add methods are both used to perform multiplication operations for these types of filter banks. Two implementation techniques are introduced, namely; the pipelining technique that is efficient from the throughput point of view, and the area efficient bit-serial implementation technique. The obtained results show the ability to achieve high throughput using pipelining (with 2 output samples / clock) on behalf of the area allocation. While bitserial technique minimizes the allocated area on behalf of the throughput which may decrease with increasing filter order. As compared with other recent implementations, the results of implementing the designed filter banks using the SPARTAN-3E FPGA kit are efficient in minimizing implementation complexity to 0.584 0.712 of its corresponding values for different structures in recent hardware implementations. It is also obtained that the resulting structures can operate at high frequencies (up to 47.09 MHz).


Introduction
The wavelet transform is the representation of a function by wavelets, providing a time-frequency representation of the signal. The transform was initially in a continuous form and is called the Continuous Wavelet Transform (CWT), which gives the wavelet coefficients in a detailed manner. The redundancy accompanies this type of transformation brought on the need to a sampled version of the Wavelet Transform [1]. That is the Discrete Wavelet Transform (DWT), DWT is a powerful tool for signal processing and analysis [2]. The success of DWT stems from its ease of computation and its inherent decomposition of an image into non-overlapping subbands that enables the design of efficient algorithms and allows for incorporation of the human visual system [3].
The DWT of a signal x[n] is computed by passing it through a series of filters. First the samples are passed through a lowpass filter with impulse response H 0 [n] giving approximation coefficients. The signal is decomposed simultaneously using a highpass filter G 0 [n], giving the detailed coefficients. Since half the frequencies of the signal are removed, the filter outputs are down sampled by 2 [4].
The recent development of DWT, which can be implemented efficiently with a filter bank, has provided a way to efficiently utilize wavelets for signal processing [2]. DWT has four different structures to be implemented with, they are the direct form, polyphase, lifting and the lattice structures. Each of these structures has its advantages and drawbacks which have to be considered according to the specific application being implemented [5].
The availabili ty of high-densi ty , l ow-cost FPGA devices has given digi tal designers l ots of flexibility to design custom digital architectures [6], where the FPGA is the hardware platform that can be used to implement just about any hardware design by writing a description for the digital circuit using a Hardware Description Language (HDL) [7].
The rest of this paper is organized as follows: section 2 introduces the lattice structures for DWT, Section 3 explains the design of the lattice DWT for a group of the Daubechies filter banks. Matlab7.6 programming is used in section 4 to verify the results and to estimate a suitable coefficient's wordlength to work with. The implementation results are given in section 5, when implementing using a Xilinx FPGA device. A comparative study of the implemented filters with other implementations is given in section 6. And finally, section 7 concludes this paper.

Lattice Structures
The wavelet filter banks have highly efficient lattice structures which are easy to implement [5], they preserve both modularity and regularity with a high degree of parallelism that allows for high throughput [8].
The lattice structure has many advantages, such as better coefficient quantization response as well as a reduction by a factor of two of the stages needed for a given filter order [9]. That reduces the number of coefficients and thus the number of multipliers.
An efficient hardware implementation of the lattice structure of wavelet transform can be accomplished using a number of Processing Elements (PE), that are a chain of similar parts that work together to solve a single problem. They are arranged in a regular form, where the computational demand on each individual processing element is quite low. In the lattice structure, a processing element in both of the analysis and synthesis sides consists of two adders and two multipliers. A single delay element appears before each PE at the analysis side and after it at the synthesis side with a scaling gain appears at the end of each side. The i th processing element at the analysis side of the lattice structure is shown in Fig. 1.
The lattice DWT offers a novel tool for the design of multiplexer-demultiplexer (Mux-Demux) information channels such as data transmission in optical fiber networks and telemetric multichannel equipments, where parallel data channels should be multiplexed into one signal, transmitted and then reconstructed [10]. Figure 2 shows an example of the dual-channel mux-demux arrangement using the lattice DWT synthesis and analysis parts. The design is accomplished in the next section for the Daubechies filter banks of orders 2, 4, 6 and 8. These filter banks are orthogonal and have the characteristics of satisfying the perfect reconstruction property [11], exploiting large number of vanishing moments which makes them effective feature extraction tools [12]. They also use overlapping windows, so the high frequency coefficient spectrum reflects all high frequency changes. Therefore, Daubechies wavelets are useful in compression and noise removal of audio signal processing [11].

Filter Design
The first design step is to find the polyphase matrix H p (z) of the specified filter bank and a similar matrix of the lattice structure. Where the filters' polyphase representation is expressed as [13] … (1) The transfer function in matrix form for the polyphase structure will be (depending on the form given in ref. [13]): Fig.1: The i th stage lattice DWT processing element, x ei and x oi are the even and odd inputs of the i th stage, respectively, α i is the value of the lattice coefficient at the i th stage, and y 0i & y 1i refer to the forward even and odd results of the i th stage, respectively Where x e and x o are used to denote the even and odd terms of input x, respectively, and the polyphase matrix H p (z) has the following form: Afterwards, a comparison between these two matrices will be achieved in order to obtain the lattice coefficients.

Daubechies-2 filter bank lattice design
For this type of filter banks, H 0 (z) and G 0 (z) are as follows [14]: From (3) and (4), the polyphase matrix will be as follows: while the lattice transfer function of the analysis side, depending on the structure given in Fig. 3, is given in the matrix form as … (8) Fig.3 The analysis and synthesis sides of the Daubechies-2 lattice filter bank.
From (7) and (8), it can be easily obtained that the lattice coefficients are and

Daubechies-4 filter bank lattice design
For the Daubechies-4 filter bank, the two equations of H 0 (z) and G 0 (z) are [14] … (9) … (10) where , , , After separating H 0 (z) and G 0 (z) into even and odd parts, the polyphase matrix will be as follows: … (11) while the corresponding lattice matrix depending on the design given in Fig. 4 will be as … (12) Fig.4 The analysis and synthesis sides of the Daubechies-4 lattice filter bank.
Thus the lattice coefficient values will be obtained by comparing (11) and (12) as , , and . It can be noted that adding an advance element at the end of the analysis side make it be able to compare the two matrices of the polyphase and lattice structures.

Daubechies-6 filter bank lattice design
For this type of filter banks, the LPF and HPF equations are [14] … (13) … (14) where , , , After separating H 0 (z) and G 0 (z) into even and odd parts, the polyphase matrix will be as follows: while the lattice corresponding matrix will be as follows when utilizing It can be noted that two advance elements are considered to be at the end of both the upper and lower branches of the lattice structure of this filter bank for the comparison purpose. These advance elements can be neglected during the implementation as they have similar effect on both branches. Thus the lattice coefficients are , , , and . After separating H 0 (z) and G 0 (z) into even and odd parts, the polyphase matrix will be as follows:  And the values of the lattice coefficients will be as follows: , , , , a n d . It can be noted that, the signs of the lattice multiplier coefficients alternate between stages, and that, the values of the multiplier coefficients α i decrease with the increase of i.

Daubechies-8 filter bank lattice design
When comparing between the analysis and synthesis sides, it can also be noted that the delay element is removed from the beginning of the lower branch to the end of the upper branch of each stage, the coefficient α i remains in its place on the two sub-branches but with reversed signs and the first analysis stage becomes the last synthesis one. This is due to the time reversal of the analysis side needed to be performed in the synthesis side in order to satisfy the perfect reconstruction condition.

Matlab Representation
The filter bank designs are described using Matlab programming for verification. The results show that these structures can be achieved with perfect reconstruction when performing the analysis-synthesis operations using the lattice structures. That is the difference between the original and reconstructed signals is zero. Matlab programs are used to find the minimum wordlength of the lattice coefficients for acceptable PSNR values which is considered to be about 30dB. This wordlength is found to be 6 bits as indicated by the PSNR readings taken for the designed filter banks using a group of gray scale test images, as shown in Fig.7.

FPGA Implementations
Two techniques are used to perform the DWT implementation using the designed lattice structures. The first is the pipelining, causing the architecture to have a throughput of two output samples every clock cycle, while the second is the bit-serial implementation technique that minimizes the area allocation by using a single processing element to perform the whole task, bearing in mind the increase of the processing time as a disadvantage of the latter. The filter bank architectures are implemented using one of the Xilinx FPGA devices, the SPARTAN-3E kit. This device has a capacity of (4656) logic slices where the total number of CLBs is (1164) (each CLB contains four slices). It has 20 dedicated multipliers, 20 BRAMs and can operate at a maximum clock speed of 50MHz.

Pipelining implementation results
Pipelining is a technique of decomposing a sequential process into suboperations, with each subprocess being executed in a special dedicated segment that operates concurrently with all other segments. The technique is efficient for those applications that need to repeat the same task many times with different sets of data [15]. Therefore, the pipeline technique is used to implement the lattice structure that consists of a sequence of similar PEs, in order to increase the amount of accomplished processing during a given time interval.
The pipelining implementation requires two inputs every clock cycle, gives a throughput of two samples per clock cycle (after having the first input processed) and, as the filter order increases, the cost of implementing also increases. This implementation of the lattice structure that have two output branches provides us with the ability of giving two output samples every clock cycle, which is a very important property for real time applications.
The results of implementation using the SPARTAN-3E device are summarized in Table 1 for the two cases of using the dedicated multipliers and the shift-add operations to perform multiplication between the lattice coefficients and the data. From Table 1, it can be noted that the use of the shift-add operations for multiplication increases the number of exploited slices with making no use of the dedicated multipliers available in the device. Table 1 Pipelining area results.

Bit-serial implementation results
When implementing in a bit-serial manner, a single processing element is exploited to perform the whole task, each two input samples are folded through this processing element to be processed for number of times equal to the number of filter structure stages. The cost is lower as compared with the pipelining implementation, but as the filter order increases, the time needed to process a pair of input samples also increases.
The results of implementing using the SPARTAN-3E device are shown in Table 2 in the two cases of using the dedicated multipliers and the shift-add operations.

A Comparative Study
A comparison is made with other implementations for two of the four Daubechies filter banks using other wavelet structures in previous researches given in Refs. [5], [9] and [16] for the Daubechies-4 filter bank and given in Refs. [9] for the Daubechies-8 filter bank. The comparison indicates that smaller area can be achieved using the implemented lattice structures, and in addition they have higher circuit frequency as compared with these researches. Table 3 shows the comparison with these references using the same devices, while Table 4 shows the comparison of the implementation of Daubechies-4 filter bank using SPARTAN3E device as compared with other implementations using ALTERA devices.

Conclusions
The lattice structures for DWT have been designed and tested through the implementation of the Daubechies filter banks. The results have shown that they are very appropriate DWT structures from area allocation point of view. The filter banks have been implemented and the results have been tested for different FPGA devices, setting the optimization goal of the synthesizer to be the area. The results of implementing the designed filter banks using the SPARTAN-3E FPGA kit highlight an efficient implementation with minimum hardware utilization, having the area allocation being minimized to 0.584 -0.712 of the corresponding recent implementations using the same filter banks.
From the comparative tables in the previous section, it can be noticed that, the pipelining implementation of the lattice structure is very efficient and is suitable for high speed applications (minimum delay = 0.02 secs.). But, when area is the constraint, as the filter order increases, the most suitable lattice implementation scheme becomes the bit-serial implementation, bearing in mind the increase of the processing time for a specified task is a disadvantage of this implementation scheme.