A Prime-Factor Algorithm for the Inverse Discrete Sine Transform and its Architecture

Ajit Kumar Patro¹, Dr. Sudhansu Sekhar Nayak², Nibedita Sahoo³

¹Research Scholar, Centurion University of Technology and Management, Paralakhemundi, Odisha, India
²Professor, Centurion University of Technology and Management, Paralakhemundi, Odisha, India
³Associate Professor, Roland Institute of Technology, Berhampur, Odisha, India

Abstract: In this paper, we have suggested a simple scheme for prime-factor decomposition of Inverse Discrete Sine Transform (IDST) and systolic mesh architecture for its efficient implementation. In the systolic architecture the transposition of the intermediate matrix is avoided in this structure by orthogonal processing during the pair of matrix multiplication i.e., if the processing for the first matrix multiplication takes place along X-direction, the processing for the second matrix multiplication is carried out along Y-direction. Due to this feature, the systolic architecture is highly compact, offers saving for transposition hardware and at the same time yields high throughput with less latency. The proposed systolic architecture has very less area complexity, less computation time and very high VLSI performance measures compared with existing structures. Due to orthogonal nature of the Discrete Sine Transform (DST), the forward transform may however be realized by the transpositions of the inverse transform.

Keywords: Prime-factor decomposition, Inverse Discrete Sine Transform (IDST), VLSI, Systolic architecture, Orthogonal processing.

I. INTRODUCTION

The prime-factor decomposition technique is popularly used for fast computation of digital convolution and discrete orthogonal transforms. Prime-factor decomposition approach has, therefore, been tried for efficient computation of the DST. The main theoretical rationale of this technique is to convert N-point DST into a two dimensional \( (N_1 \times N_2) \)-point DST by employing certain index mapping where \( N = (N_1 \times N_2) \). Then we can deal with the resulting groups of small size problems in each dimension [1]. In a DSP processor the memory for data storage is always expensive. By the prime-factor approach it is feasible to implement the long-length DST by processor of small memory as short-length DSTs are implemented one after the other. In addition, when this approach is combined with efficient short-length algorithms the computational complexity is reduced considerably. In previous paper it is derived prime-factor DCT algorithm based on various DFT algorithms which require complex number multiplications [2]. A prime factor decomposed algorithm was proposed for fast computation of DST. A prime-factor DCT algorithm is implemented which included only real-number multiplications. However, its index mapping was complicated. Few paper presented input and output index mapping for a prime-factor decomposed computation of DCT. Chakrabarti developed a systolic architecture implementing Lee’s algorithm [3]. They wanted to compute the DCT from DHT. So they modified the index mappings which are essentially the same as Lee’s. However, they did not discuss the actual implementation of these index mappings. In Section II, we have presented prime-factor decomposition of IDST.

II. PRIME-FACTOR DECOMPOSITION OF IDST

The inverse discrete sine transform (IDST) [4] may be written as

\[
x(n) = \sum_{k=1}^{N} X(k) \sin \left[ \frac{\pi (2n-1)k}{2N} \right] \text{ for } n = 1, 2, \ldots, N
\]  \hspace{1cm} (1)

where transform length \( N = N_1 \times N_2 \), \( N_1 \) and \( N_2 \) being relatively prime, the input index \( k \) in equation (1) may be mapped into \( (k_1, k_2) \) as

\[
k = (N_2 k_1 + N_1 k_2) \mod N
\]  \hspace{1cm} (2)

For \( N_2 k_1 + N_1 k_2 < N \), using equation (2), equation (1) may be expressed as
\[ x(n) = \sum_{k_1=1}^{N_1} \sum_{k_2=1}^{N_2} X(k_1, k_2) \sin \left[ \frac{\pi(2n-1)(N_2 k_1 + N_1 k_2)}{2N} \right] \] (3)

\[ x(n) = \sum_{k_1=1}^{N_1} \sum_{k_2=1}^{N_2} X(k_1, k_2) \sin \left[ \frac{\pi(2n-1)k_1}{2N_1} + \frac{\pi(2n-1)k_2}{2N_2} \right] \] (4)

\[ x(n) = \sum_{k_1=1}^{N_1} \sum_{k_2=1}^{N_2} X(k_1, k_2) \left\{ \sin \left[ \frac{\pi(2n-1)k_1}{2N_1} \right] \cos \left[ \frac{\pi(2n-1)k_2}{2N_2} \right] + \cos \left[ \frac{\pi(2n-1)k_1}{2N_1} \right] \sin \left[ \frac{\pi(2n-1)k_2}{2N_2} \right] \right\} \] (5)

Equation (3) may otherwise be expressed as

\[ x(n) = \sum_{k_1=1}^{N_1} \sum_{k_2=1}^{N_2} (-1)^{n+1} [X(k_1, N_2 - k_2) + X(N_1 - k_1, k_2)] \sin \left[ \frac{\pi(2n-1)k_1}{2N_1} \right] \sin \left[ \frac{\pi(2n-1)k_2}{2N_2} \right] \] (6)

For \( N_2 k_1 + N_1 k_2 > N \), equation (1) may be expressed as

\[ x(n) = \sum_{k_1=1}^{N_1} \sum_{k_2=1}^{N_2} X(k_1, k_2) \sin \left[ \frac{\pi(2n-1)(N_2 k_1 + N_1 k_2 - N)}{2N} \right] \] (7)

\[ x(n) = (-1)^{n+2} \sum_{k_1=1}^{N_1} \sum_{k_2=1}^{N_2} X(k_1, k_2) \cos \left[ \frac{\pi(2n-1)k_1}{2N_1} + \frac{\pi(2n-1)k_2}{2N_2} \right] \] (8)

\[ x(n) = (-1)^{n+2} \sum_{k_1=1}^{N_1} \sum_{k_2=1}^{N_2} X(k_1, k_2) \left\{ \cos \left[ \frac{\pi(2n-1)k_1}{2N_1} \right] \cos \left[ \frac{\pi(2n-1)k_2}{2N_2} \right] - \sin \left[ \frac{\pi(2n-1)k_1}{2N_1} \right] \sin \left[ \frac{\pi(2n-1)k_2}{2N_2} \right] \right\} \] (9)

\[ x(n) = (-1)^{n+2} \sum_{k_1=1}^{N_1} \sum_{k_2=1}^{N_2} X(N_1 - k_1, N_2 - k_2) \sin \left[ \frac{\pi(2n-1)k_1}{2N_1} \right] \sin \left[ \frac{\pi(2n-1)k_2}{2N_2} \right] \]
\[ - X(k_1, k_2) \sin \left[ \frac{\pi(2n-1)k_1}{2N_1} \right] \sin \left[ \frac{\pi(2n-1)k_2}{2N_2} \right] \] (10a)

\[ x(n) = (-1)^{n+2} \sum_{k_1=1}^{N_1} \sum_{k_2=1}^{N_2} [X(N_1 - k_1, N_2 - k_2) - X(k_1, k_2)] \sin \left[ \frac{\pi(2n-1)k_1}{2N_1} \right] \sin \left[ \frac{\pi(2n-1)k_2}{2N_2} \right] \] (10b)

Equation (6) and (10) may be expressed as

\[ x(n) = \sum_{k_1=1}^{N_1} \sum_{k_2=1}^{N_2} y(k_1, k_2) \sin \left[ \frac{\pi(2n-1)k_1}{2N_1} \right] \sin \left[ \frac{\pi(2n-1)k_2}{2N_2} \right] \] (11)

where

\[ y(k_1, k_2) = X(k_1, N_2 - k_2) + X(N_1 - k_1, k_2) \text{ for } N_2 k_1 + N_1 k_2 < N \text{ and } n \text{ is odd} \]

\[ = -X(k_1, N_2 - k_2) + X(N_1 - k_1, k_2) \text{ for } N_2 k_1 + N_1 k_2 < N \text{ and } n \text{ is even} \]

©IJRASET (UGC Approved Journal): All Rights are Reserved 2714
\[ X(N_1 - k_1, N_2 - k_2) = X(k_1, k_2) \text{ for } N_2k_1 + N_1k_2 > N \text{ and } n \text{ is even} \]
\[ X(k_1, k_2) - X(N_1 - k_1, N_2 - k_2) \text{ for } N_2k_1 + N_1k_2 > N \text{ and } n \text{ is odd} \] (12)

The output index \( n \) in equation (11) may be mapped into \((n_1, n_2)\) as [5]
\[ n_i = \begin{cases} \bar{n}_i & \text{if } \bar{n}_i < N_i \text{ for } i = 1 \text{ and 2} \\ 2N_i - 1 - \bar{n}_i, & \text{otherwise, where } \bar{n}_i = n \mod N_i \end{cases} \] (13)

Using equation (12) and (13), equation (11) can be written as
\[ x(n_1, n_2) = \sum_{k_1=1}^{N_1} \sum_{k_2=1}^{N_2} y(k_1, k_2) \sin \left( \frac{\pi (2n_1 - 1) k_1}{2N_1} \right) \sin \left( \frac{\pi (2n_2 - 1) k_2}{2N_2} \right) \] (14)

Due to orthogonal nature of the DST, the forward transform, may however, be realized by the transpose of the inverse transform. In Section III, we have proposed a systolic architecture for implementation of the IDST.

### III. SYSTOLIC ARCHITECTURE FOR IMPLEMENTATION OF THE IDST

Equation (14) may be expressed in a form
\[ X^T = D[CY]^T \] (15)

Where \( X \) and \( Y \) are matrices of size \( N_1 \times N_2 \) while \( C \) and \( D \) are matrices of size \( N_1 \times N_2 \) and \( N_2 \times N_2 \) respectively, which represent the transform kernels. For avoiding the transpose operation, equation (14) may otherwise be expressed as
\[ X_{im} = \sum_{j=1}^{N_2} Z_{ij} D_{mj} \] (16a)

for
\[ Z_{ij} = \sum_{l=1}^{N_1} C_{il} Y_{lj} \] (16b)

where \( l = 1, 2, \ldots, N_1 \) and \( m = 1, 2, \ldots, N_2 \)

\[ Y_{lj} = y(i, j) \text{ as given by equation (12)} \]

\[ C_{il} = \sin \left( \frac{\pi (2l - 1) i}{2N_1} \right) \] (17)

\[ D_{mj} = \sin \left( \frac{\pi (2m - 1) j}{2N_2} \right) \] (18)

As given by equation (16), the prime-factor IDST may be computed two distinct stages. In the first stage, one has to perform multiplication of matrix \([Y_{lj}]\) of size \( N_1 \times N_2 \) with \( N_1 \)-point transform kernel \([C_{il}]\) of size \( N_1 \times N_1 \) to obtain an intermediate matrix \([Z_{ij}]\) of size \( N_1 \times N_2 \) (equation 16b). In the second stage each row of intermediate matrix \([Z_{ij}]\) is multiplied with the rows of \( N_2 \)-point transform kernel \([D_{mj}]\) of size \( N_2 \times N_2 \) (equation 16a). Multiplication of both these stages may be mapped into systolic mesh containing \( N_1 \times N_2 \) PEs for fully pipelined processing. The proposed systolic structure for computing \( N \)-point transform, \( N = (N_1 \times N_2) \), is shown in Fig. 1. The function of each PE is depicted in Fig. 2. The first row of the \( N_1 \)-point transform kernel \([C_{il}]\) is fed of the first array. The successive rows of the transform kernels are fed to the successive arrays, staggered by one time-step with respect
to the preceding one. The first column of the input matrix \([Y_{ij}]\) is fed to the first processing element (PE) of the first array [6]. The successive columns are fed to the successive processing PE’s of the first array in subsequent time-steps. The columns of the \(N_2\)-point transform kernel \([D_{mj}]\) are fed of the different PEs of the first array at the lag of \(N_1\) time-steps with respect to the corresponding columns of the matrix \([Y_{ij}]\). In the first stage of computation, the \((j+1)\)th PE of \((l+1)\)th array computes an element \([Z_{ij}]\) of the intermediate matrix in \(N_1\) time-steps, where a PE performs a multiplication and adds the result to the content of the accumulator \(A_1\), in every time-step. At the end \(N_1\) time-steps, the accumulator content is transferred to the accumulator \(A_2\) while \(A_1\) is reset to zero. In the second stage of computation, in each array, the multiplication of one row of the intermediate matrix \([Z_{ij}]\) with \(N_2\)-point transform kernel \([D_{mj}]\) is performed to obtain \(N_2\) number of desired components [7].

IV. HARDWARE AND THROUGHPUT CONSIDERATIONS

Fig. 1: Systolic Architecture for Implementation of the IDST
The systolic architecture for implementation of IDST as shown in Fig. 1 requires $N_1 \times N_2$ number of identical PEs. In each PE there are two multipliers, two adders and two accumulators. The area complexity of the proposed structure is $O(N_1 \times N_2)$. The first transform component is obtained after $N_1 + N_2$ time-steps. The first set of transform components is obtained after $2(N_1 + N_2 - 1)$ time-steps. However, the successive sets of transform components are obtained in every $N_2$ time-step interval. The throughput rate of the proposed structure would, therefore, be $R = \left(\frac{N_1}{T}\right)$ where $T$ is the duration of a time-step, given by $T = T_m + T_a$. $T_m$ and $T_a$ are, respectively, the time required for performing a real multiplication and a real addition in the PE. It is interesting to note that the transposition of the intermediate matrix is avoided in this structure by orthogonal processing during the pair of matrix multiplication i.e., if the processing for the first matrix multiplication takes place along X-direction, the processing for the second matrix multiplication is carried out along Y-direction. Due to this feature, the structure is highly compact, offers saving for transposition hardware and at the same time yields high throughput with less latency [8].

A. Algorithm
1) $A_1 = A_2$
   $A_1 = 0$

2) $A_1 = A_1 + C_{in} Z_{in}$
   $X_{out} = X_{in} + A_2 D_{in}$
   $C_{out} = C_{in}$
   $D_{out} = D_{in}$
   $Z_{out} = Z_{in}$
   Count = Count + 1
   if (Count = $N_1$) go to 1
   else
     go to 2
   end if

Table 1 comparison of area complexity, computation time and vlsi performance measure of the structures for computing prime-factor idst of size $n \times n$.

<table>
<thead>
<tr>
<th>Structures</th>
<th>Area Complexity (A)</th>
<th>Computation Time ($\tau$)</th>
<th>VLSI Performance Measure $(A\tau^2)$</th>
</tr>
</thead>
<tbody>
<tr>
<td>Structure $^6$</td>
<td>$3N^2$</td>
<td>$3N$</td>
<td>$27N^4$</td>
</tr>
<tr>
<td>Structure $^7$</td>
<td>$(N^2 - 1)/2$</td>
<td>$(N^2 + 1)/2$</td>
<td>$(N^8 + 2N^6 - 2N^2 - 1)/8$</td>
</tr>
<tr>
<td>Systolic Architecture for IDST (Fig. 1)</td>
<td>$N^2$</td>
<td>$2N$</td>
<td>$4N^4$</td>
</tr>
</tbody>
</table>
V. CONCLUSION

We have suggested a simple scheme for prime-factor decomposition of IDST and systolic mesh architecture for its implementation. It is interesting to note that the transposition of the intermediate matrix is avoided in this structure by orthogonal processing during the pair of matrix multiplication i.e., if the processing for the first matrix multiplication takes place along X-direction, the processing for the second matrix multiplication is carried out along Y-direction. Due to this feature, the structure is highly compact, offers saving for transposition hardware and at the same time yields high throughput with less latency.

REFERENCES


AUTHORS

Mr. Ajit Kumar Patro as received his B.E. (EIE) and M.Tech (ECE) degrees in 2004 and 2008 respectively from Biju Patnaik University of Technology, Rourkela, Odisha. Presently pursuing Ph.D (ECE) in Centurion University of Technology and Management, Paralakhemundi, Odisha and working as an Assistant Professor in Electronics Engineering Department of Gandhi Institute of Engineering and Technology (Autonomous), Gunupur, Odisha, India.

Prof. (Dr.) Sudhansu Sekhar Nayak received M.Sc. (Physics), MCA, Ph.D degrees in 1973, 2006 and 2001 respectively. At Present he is working as Professor in Centurion University of Technology and Management, Paralakhemundi, Odisha, India.

Mrs. Nibedita Sahoo as received her B.E. (EE) degree in 2003 from University College of Engineering, Burla, Odisha and M.Tech (EIE) degree in 2010 from Biju Patnaik University of Technology, Rourkela, Odisha. Presently working as an Associate Professor in Electrical & Electronics Engineering Department of Roland Institute of Technology, Berhampur, Odisha, India.