A device capable of executing interlocked fixed point arithmetic logic unit (ALU) instructions in parallel with other instructions causing the execution interlock is presented. The device incorporates the design of a 3-1 ALU and can execute two\'s complement, unsigned binary, and binary logical operations. It is shown that status for ALU operations using a 3-1 ALU can be determined in a parallel fashion, resulting in the compliance of the proposed device with predetermined architectural behavior of single instruction execution. The device requires no more logic stages than does a 3-1 binary adder using a carry-save adder (CSA) followed by a carry-lookahead adder (CLA) design. They are executed in serial like any serial machine. It takes more than one cycle to execute multiple instructions causing performance degradation. The solution requires a special kind of device called “Interlock collapsing ALU”. The Interlock Collapsing ALU, unlike conventional 2-1 ALU’s is a 3- 1 ALU. The proposed device executes the interlocked instructions in a single instruction cycle, unlike other parallel machines, resulting in high performance. The resulting implementation demonstrates that the proposed 3-1 Interlock Collapsing ALU can be designed to outperform existing schemes for ICALU, by a factor of at least two. The ICALU is implemented in VHDL. Its functionality is verified through simulation.
Introduction
1. Overview
Parallel processing enables multiple instructions to execute simultaneously, improving computational speed. However, interlocked instructions (those with data dependencies) cannot be processed in parallel, degrading performance and underutilizing hardware. To address this, the concept of an Interlock Collapsing ALU (ICALU) is introduced, which allows dependent instructions to be collapsed and executed together, reducing instruction cycles.
2. Key Innovation: Interlock Collapsing ALU (ICALU)
Traditional ALUs are 2-1 ALUs, handling two inputs per operation.
The proposed ICALU is a 3-1 ALU, capable of executing three-operand instructions, which allows collapsing of interlocked instruction pairs into one.
Implemented using Carry Save Adder (CSA) and Carry Lookahead Adder (CLA) with additional logic stages.
3. Design & Implementation
Implemented in VHDL and verified using simulation.
ALU1 is a conventional 2-1 ALU.
ICALU includes ALU1 and an extended logic unit for collapsing instructions.
Functional models are designed for each block, followed by optimization of delay and gate logic.
Reduced ICALU Model:
Optimized version with removed multiplexers and improved logic blocks.
Control Logic:
Control signals (e.g., K12, K13) guide the ICALU to perform arithmetic or logical operations based on instruction type.
4. Performance Optimization
The ICALU adds two logic stages (CSA and Post-CLA logic) over traditional ALU1.
Simulations validate ICALU across different operation sequences:
Arithmetic followed by Arithmetic
Arithmetic followed by Logical
Logical followed by Arithmetic
Logical followed by Logical
Outputs show that ICALU processes interlocked sequences more efficiently, merging them into single operations.
6. Instruction Execution Example
Non-Interlocked (Conventional Parallel Machine):
Instructions execute in parallel as they are independent:
sql
CopyEdit
ADD R1, R2 ADD R4, R3
Interlocked (Using ICALU):
Interdependent instructions are merged:
sql
CopyEdit
ADD R2, R1 ADD R3, R2 → becomes → ADD R3, R2, R1
ICALU allows collapsing these into a single instruction, maintaining correctness while improving execution speed.
7. Results
RTL Schematics for ALU1, ICALU, and full ICUNIT show architectural design.
Simulation outputs verify functional accuracy for various instruction patterns.
Execution delay significantly reduced with ICALU.
Conclusion
The objective of the thesis, execution of interlocked instructions in one instruction cycle. This was achieved by ICALU successfully designed and implemented using VHDL in Xilinx. Its functionally was verified through simulation. The ICALU can be implemented in just 3 logic delays more than of a conventional 2-1 ALU. The performance of an ordinary non-ICALU i.e., parallel machine and machine with the ICALU incorporated in it, was compared.
This work has been designed for 32-bit word size and results is evaluated for parameters like delay. This work can be further extended for higher number of bits. New architecture can be designed in order to reduce the delay of the circuit. Steps may be taken to optimize the other parameters like area, power, frequency, number of gate clocks, etc.
References
[1] P. M. Kogge, The Architecture of Pipelined Computers, New York: McGraw-Hill, 1981.
[2] S. Vassiliadis, J. Phillips and B. Blaner, ICU design considerations, pp. 22, Oct. 1991.
[3] S. Vassiliadis and J. Phillips, Interlock collapsing SCISM ALU design, pp. 31, Oct. 1991. R. M. Tomasulo, \"An efficient algorithm for exploiting multiple arithmetic units\", IBM J. Res. Developp., pp. 25-33, Jan. 1967.
[4] R. D. Acosta, J. Kjelstrup and H. C. Torng, \"An instruction issuing approach to enhancing performance in multiple functional unit processors\", IEEE Trans.Comput., vol. C-35, pp. 815-828, Sept. 1986.
[5] H. S. Warren, \"Instruction scheduling for the IBM RISC system/6000 processor\", IBM J. Res. Develop., vol. 34, no. 1, pp. 85-92, Jan. 1990.
[6] N. F. Jouppi, \"The nonuniform distribution of instruction-level and machine parallelism and its effect on performance\", IEEE Trans. Comput., vol. 38, no. 12, pp. 1645-1658, Dec. 1989.
[7] W. A. Wulf, \"The WM computer architecture\", Comput. Architecture News, vol. 16, no. 1, pp. 70-84, Mar. 1988.
[8] N. Malik, R. J. Eickenmeyer and S. Vassiliadis, \"Interlock collapsing ALU for increased instruction-level parallelism\", Conf. Proc. MICRO 25, pp. 149-157, 1992-Dec.