# INTERNATIONAL JOURNAL FOR RESEARCH IN APPLIED SCIENCE & ENGINEERING TECHNOLOGY Volume: 2 Issue: XI Month of publication: November 2014 DOI: www.ijraset.com Call: © 08813907089 E-mail ID: ijraset@gmail.com ### International Journal for Research in Applied Science & Engineering Technology (IJRASET) ### Public Key Cryptography on Modern Graphics Hardware Jaspal Singh Ramola<sup>1</sup>, Vikram Singh<sup>2</sup> (B.tech 3rd sem ) Dept. of Computer Science Engineering, Dronacharya College of Engineering, Gurgaon, India Abstract. Graphics processing units (GPUs) are increasingly being used for general purpose pro-cessing. We present implementations of large integer modular exponentiation, the core of public-key cryptosystems such as RSA, on a DirectX 10 compliant GPU. DirectX 10 compliant graphics pro-cessors are the latest generation of GPU architecture, which provide increased programming flex-ibility and support for integer operations. We present high performance modular exponentiation implementations based on integers represented in both standard radix form and residue number system (RNS) form. We show how a GPU implementation of a 1024-bit RSA decrypt primitive can outperform for the first time a comparable CPU implementation by up to 4 times. We present how an adaptive approach to modular exponentiation involving implementations based on both a radix and a residue number system gives the best all-around performance on the GPU. We also highlight the criteria necessary to allow the GPU to improve the performance of public key cryptographic operations. #### I. INTRODUCTION The graphics processing unit (GPU) has enjoyed a large increase in floating point performance compared with the CPU in the last number of years. The traditional CPU has leveled off in terms of clock frequency as power and heat concerns increasingly become dominant restrictions. The latest GPU from Nvidia's GT200 series reports a peak throughput of almost 1 TeraFlop, whereas the latest Intel CPUs reported throughput is in the order of 100 GFlops [1]. This competitive advantage of the GPU comes at the price of a decreased applicability to general purpose computing. The latest generation of graphics processors, which are DirectX 10 [2] compliant, support integer processing and give more control over the processor's threading and memory model compared to previous GPU generations. We use this new generation of GPU to accelerate public key cryptography. In particular we use an Nvidia 8800GTX GPU with CUDA [3] to investigate the possibility of high speed 1024-bit RSA decryption. We focus on 1024-bit RSA decryption as it shows a high arithmetic intensity, ratio of arithmetic to IO operations, and also allows easy comparison with CPU implementations. We exploit the new GPU's flexibility to support a GPU sliding window [4] exponentiation implementation, based on Montgomery exponentiation [5] using both radix and residue number system (RNS) representations. We investigate both types of number representation showing how GPU occupancy and inter thread communication plays a central role to performance. Regarding the RNS implementations, we exploit the GPU's flexibility to use a more optimised base extension approach than was previously possible. We also explore various GPU implementations of single precision modular multiplication for use within the exponentiation approaches based on RNS. #### II. STANDARD MONTGOMERY EXPONENTIATION ON THE GPU We present two different GPU implementations with varying degrees of parallelism incorporating the Montgomery reduction method in radix representation and pencil-and-paper multiplication. One obser-vation that applies to all implementations of exponentiation on a CUDA compatible device is that it is only suitable to use a single exponent per CUDA warp, and in some scenarios per CUDA block. The reason for this is that the exponent largely determines the flow of control through the code. These con-ditional code paths dependant on the exponent cause thread divergence. When threads within a CUDA warp diverge on a single processor, all code paths are executed serially, thus a large performance overhead is incurred for threads that diverge for large portions of code. If inter thread communication is required, a synchronisation barrier must be used to prevent race conditions occurring. All threads within a CUDA block that perform a synchronisation barrier ### International Journal for Research in Applied Science & Engineering Technology (IJRASET) all threads within a single CUDA block are required to execute the same path at points of synchronisa-tion and so it follows that for exponentiation that uses inter thread communication, only one exponent can be used per CUDA block. #### A. Serial Approach Each thread within this implementation performs a full exponentiation without any inter thread com-munication or cooperation. This is a standard optimised implementation of an exponentiation using the Quisquater and Couvreur CRT approach [6], operating on two independent pairs of 16 limb numbers. The approach also uses the sliding window technique to reduce the number of Montgomery multiplies and squares required. As a single thread computes an exponentiation independently, a single exponent must be used across groups of 32 threads. In terms of RSA, assuming peak performance, this imple-mentation is restricted to using a maximum of 1 key per 32 primitives (or messages). As we are using the CRT based approach to split the input messages in two, we also use two different exponents for a single message. Thus a message must be split into different groups of 32 threads to avoid guaranteed thread divergence. We have adopted a simple strategy to avoid divergence, whereby CUDA blocks are used in pairs. The first block handles all 16 limb numbers relating to the modulus $\bf q$ , where $\bf n = pq$ and $\bf n$ is the original modulus. The threading model employed is illustrated in Figure 1. This separation of $\bf p$ and $\bf q$ related data is also used in the implementations in Section 2.2 and 3. Fig. 1. Serial Thread Model The added support for integers, bitwise operations and increased memory flexibility such as scatter operations, in the 8800GTX, allows this implementation to execute largely in a single kernel call. The byte and bit manipulation operations required for the efficient implementation of sliding window are now straightforward. The macro level details of this algorithm are largely standard and as such, we do not list the high level steps of the algorithm. However, we draw attention to the following optimisations that were applied within the implementation: all nxn limb multiplies used cumulative addition to reduce memory operations [7]; all squaring requirements were optimised to reduce the number of required multiplies [4]; nxn limb multiplies $\operatorname{mod} \mathbf{R}$ were truncated, again to remove redundant multiplies; and the final two steps within Montgomery multiplication were combined into a single nxn multiply and accumulate. - 1) Memory usage: The concept of a uniform, hierarchical memory structure such as a CPU's L1/L2 cache, etc does not exist on the GPU and performance cliffs can be encountered without careful memory planning. The following are the highlights of the various memory interactions of this implementation. Note that the implementations in Section 2.2 and Section 3 use similar adaptive memory approaches as described below. - 2) Adaptive memory approaches: The sliding window technique requires the pre-calculation of var-ious powers of the input data. This data is used during the exponentiation process to act as one of the n limb inputs into an nxn multi-precision multiplication. There are two options on how to handle the storage and retrieval of this pre-calculated data. 1. The pre-calculation is done on the GPU and is written to global memory. The data is stored in a single array with a stride width equal to the number messages being processed in a single kernel call multiplied by the message size. Reads are then made subsequently from this array direct from global memory. In this scenario only a single kernel call is required for the exponentiation. 2. Unfortunately the data reads cannot be coalesced as each thread reads a single limb which is separated by 16 integers from the next message. Coalesced global reads require the data to start at a 128-bit boundary for a warp and require each thread of the warp to read consecutively from memory with a stride of up to 4 32-bit integers wide. Non-coalesced reads generate separate memory transactions thus significantly reducing load/store throughput. To ameliorate this the sliding window pre-calculation data is first generated in an initialisation kernel writing its results to global memory. A texture can then be bound to this memory and the subsequent exponentiation kernel can use the pre-calculation data via texture references. Note that texture access uses the texture cache, which is a local on chip cache, however textures cannot be written to directly hence the need for a separate initialisation kernel. ### International Journal for Research in Applied Science & Engineering Technology (IJRASET) The first approach described above is better for smaller amounts of data. The second approach in beneficial for larger amounts of data when the advantage of texture use outweighs the fixed overhead of the extra kernel call. Another adaptive memory approach concerns the exponent. As mentioned, the exponent must be the same across a warp number of threads, thus all threads within a warp, when reading the exponent, access the same memory location at any one time. Constant memory has the best performance under this scenario [8], however is limited to 64KB on the G80. As each exponent requires 32 integers worth of storage, in an RSA 1024-bit context we can use constant memory for up to 512 different keys. If the amount of exponents exceed this threshold (in practice lower than 512 different keys as a small amount of constant memory is used for other purposes and a new key is used for at least each new block whether needed or not for lookup efficiency) then texture memory is used. 3) Other memory considerations: In an aim to increase the nxn multiplication performance we have allocated all of the on chip fast shared memory for storing and retrieving the most commonly used n limb multiplicand of the nxn operation. The less frequently accessed multiplier is retrieved from textures when possible. The input and output data is non exceptional in this implementation save that it cannot be coalesced due to the message stride within memory. A convolution of multiple messages could be an option to offset the lack of coalescing though this has not been explored here and would seem to be just adding extra steps to the CPU processing side. The other per key variables, $-\mathbf{n}^{-1}(\mathbf{mod}\ \mathbf{R})$ and $\mathbf{R}^2(\mathbf{mod}\ \mathbf{n})$ (for use in generating the Montgomery representation of the input) for both moduli $\mathbf{p}$ and $\mathbf{q}$ , where $\mathbf{n} = \mathbf{p}\mathbf{q}$ , are stored and loaded via texture references. In the context of RSA decryption these variables are assumed to be pre-calculated and it should be noted that performance will degrade slightly if these have to be calculated with a high frequency. The results for this implementation are presented in Section 2.3 in conjunction with the parallel approach described below. Note that two parts of the exponentiation are not included in these implementations, the initial $\mathbf{x}(\mathbf{mod}\ \mathbf{p})$ , $\mathbf{x}(\mathbf{mod}\ \mathbf{q})$ and the final CRT to recombine, these are done on the CPU. This is also the case for all implementations reported in this paper. These steps contribute little to the overall exponentiation runtime and so the performance impact is expected to minor. #### B. Parallel Approach This approach uses the same macro structure as the algorithm used above, however it executes the various stages within the algorithm in parallel. Each thread is responsible for loading a single limb of the input data, with 16 threads combining to calculate the exponentiation. Each thread undergoes the same high level code flow, following the sliding window main loop, however the Montgomery multiplication stages are implemented in parallel. This approach relies heavily on inter thread communication, which has a performance overhead as well as an implication that only one exponent is supported per CUDA block. As the number of threads per block in this implementation is limited to 256, due to shared resource constraints, the number of 1024-bit RSA primitives per key in effect is limited to a minimum of 16. The nxn multiplies within the Montgomery multiplication are parallelised by their separation into individual 1xn limb multiplications. Each thread is independently responsible for a single 1xn limb multiply. This is followed by a co-operative reduction across all threads to calculate the partial product additions. This parallel reduction carries with it an overhead where more and more threads are left idle. Figure 2 shows the distribution of the nxn operation across the 16 threads and its subsequent additive reduction. It also shows the use of shared memory to store the entire operations output and input of each stage. Also shown in the Figure 2 are the synchronisation points used to ensure all shared memory writes are committed before subsequent reads are performed, which add a signification performance burden. The optimisations applied to the different nxn multiplies, listed in the serial approach, are not possible in the parallel approach. The squaring optimisation, and also the modulo multiplication step, in general only execute half the limb multiplies that are required compared to a full nxn multiply. However, the longest limb within the nxn multiply dictates its overall execution time as all threads within a warp | TID: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 15 | | |------|------|--------|------|------|---|---|---|------|---------------------| | | 1x16 | 1x16 | 1x16 | 1x16 | | | | | | | | Mul | Mul | Mul | Mul | | | | | TID = Thread ID | | Sync | | | | | | | | | Sync = Sync Barrier | | SHM | | 17 Int | ts | | | | | | SHM = Shared Memory | | Sync | | Add | l | Ad | d | | | | | | SHM | | | | | | | | | | ### International Journal for Research in Applied Science & Engineering Technology (IJRASET) must execute in lock step. Thus, although one thread only executes a single multiply, it must wait until the largest 1xn multiply finishes. Also, as each thread executes its own 1xn multiply separately, the cumulative approach to addition must also be separated from the multiplication process. #### C. Results Figure 3 illustrates the performance of both the parallel and serial approaches. All measurements represent the number of 1024-bit RSA decrypt primitives executed per second. The GPU implementations show their dependence on an increasing number of messages to approach their peak performance. This is due to having more threads available to hide memory read/write latency, and also an increased ratio of kernel work compared to the fixed overheads associated with data transfer and kernel calls. We can see the advantage of the parallel approach over the serial approach at lower primitives per kernel call due to an higher level of occupancy. However the performance bottlenecks of excessive synchronisations and lack of optimisations limit the parallel approach. Also included in Figure 3, is the fastest implementation reported on the Crypto++ website [9] for a 1024-bit RSA decrypt, which is running on an AMD Opteron 2.4 GHz processor. Also included are the performance measurements for Openssl's [10] speed test for 1024-bit RSA decryption running in both single (SC) and dual core (DC) modes on an AMD Athlon 64 X2 Dual Core 3800+. As can be seen at peak performance, the serial approach on the GPU is almost 4 times the speed of the fastest CPU implementation at 5536.75 primitives per second. We can see that the serial approach becomes competitive with the fastest CPU implementation at 256 primitives per second. Unfortunately the parallel approach at no point is faster than both the serial GPU approach and the CPU implementations. #### III.MONTGOMERY EXPONENTIATION IN RNS ON THE GPU #### A. Single Precision Modular Multiplication on the GPU The most executed primitive operation within Montgomery RNS is a single precision modular multiplication. On the Nvidia CUDA hardware series the integer operations are limited to 32-bit input and output. Integer multiplies are reported to take 16 cycles, where divides are not quoted in cycles but rather a recommendation to avoid if possible [1]. Here we present an investigation into 6 different techniques for achieving single precision modular multiplication suitable for RNS based exponentiation implementations. - 1. 32-bit Simple Long Division: given two 32-bit unsigned integers we use the native multiply operation and the umulhi(x,y) CUDA intrinsic to generate the low and high 32-bit parts of the product. We then use the product as a 4 16-bit limb dividend and divide by the 2 16-bit limb divisor using standard multi-precision division [7] to generate the remainder. - 2. 32-bit Division by Invariant Integers using Multiplication: we make the observation that the divisors within an RNS Montgomery implementation are static. Also, as we select the moduli, they can be chosen to be close to the word size of the GPU. Thus we can assume that all invariant divisors, within the context of our implementation, are normalised (i.e. they have their most significant bit set). These two observations allow us to use a simplified variant of Granlund and Montgomery's approach for division by invariants using multiplication [11]. The basic concept used by [11] to calculate n/d is to find a sufficiently accurate approximation of 1/d in the form m/2<sup>x</sup>. Thus the division can be performed by the multiplication of n \* m and cheap byte manipulation for division. We pre-calculate m for each of the base residues used and transfer them to the GPU for use via texture lookups. The algorithm below removes all normalisation requirements from the original algorithm. It also rearranges some of the calculations to suit the efficient predication available on the GPU. Inputs: N is the word bit length on the GPU; single word multiplier and multiplicand x and y; m is a pre-calculated value dependent on d alone; d is the divisor. Output: r, the remainder. ``` n = x * y, n1 = hiword(n), n0 = loword(n) ns = n0 >> (N - 1) ``` #### **International Journal for Research in Applied Science & Engineering Technology (IJRASET)** ``` if(ns > 0) n0+=d t = hiword((m * (n1 + ns)) + q1 = n1 + t dr = (n - (d \ll N)) + ((2^{N} - 1 - q1) * d) r = loword(dr) + (d & hiword(dr)) ``` 3. 32-bit Reduction by Residue Multiplication: in this approach we use the observation that the moduli comprising the RNS bases can be selected close the GPU's maximum single word value. For 1024-bit RSA we can determine that for all moduli, d, the following holds $|2^{N}|_{d} < 2^{11}$ , where N is the word bit length of the GPU, i.e. 32. As such, given a single precision multiplication n = xy, and using the convention that n1 is the most signification word of n, and n0 the least significant word, we can rewrite n as $|n1 * 2^N + n0|_d$ . By repeatedly applying this representation to the most signification part of the equation, and using the pre-calculated value $r = |2^{N}| d$ , we can derive an algorithm for executing modular multiplication with multiplies and additions only. This observation is more formally stated as follows (left), and the resultant pseudocode is also listed (right). ``` Observation: Pseudocode: |\mathbf{x} * \mathbf{y}|_{\mathbf{d}} = |\mathbf{n}|_{\mathbf{d}} n=x*y = ||n1|_d * |2^N|_d + |n0|_d|_d \\ n0 = loword(n), n1 = hiword(n) Let r = |2^N|_d / r < 2^{11} * / n1r = n1 * r |n|_d = ||n1|_d * r + |n0|_d|_d n1r0 = loword(n1r) = ||n1r|_d + |n0|_d|_d /* n1r < 2^{43}*/ n1r1 = hiword(n1r) |n1r|_d = ||n1r1|_d * r + |n1r0|_d|_d n1r1r = loword(n1r1 * r) = ||n1r1r|_d + |n1r0|_d|_d /* n1r1r < 2^{22} */ r = n1r1r + n1r0 + n0 Thus: if(r < d)r - = d |\mathbf{n}|_{d} \equiv |\mathbf{n}1\mathbf{r}1\mathbf{r}|_{d} + |\mathbf{n}1\mathbf{r}0|_{d} + |\mathbf{n}0|_{d}, if(r < d)r - = d. which is < 3d. ``` - 4. 32-bit Native Reduction using CRT: using a modulus with two co-prime factors p and q, we can represent the modular multiplication input values, x and y, as $|x|_p$ , $|x|_q$ , $|y|_p$ , $|y|_q$ . Thus we have a mini RNS representation and as such can multiply these independently. We use CRT to recombine to give the final product. As p and q can be 16-bit, we are able to use the GPU's native integer modulus operator while maintaining 32-bit operands for our modular multiplication. This approach is described in more detail in the Moss et al. paper [12]. - 5. 16-bit Native Reduction: we can use 16-bit integers as the basic operand size of our modular multiplication, both input and output. We can then simply use the GPU's native multiply and modulus operators without any concern of overflow. However, we need to maintain the original dynamic range of the RNS bases when using 32-bit moduli. We can achieve this by doubling the number of moduli used in each base (note there is plenty of extra dynamic range when using 17 32-bit integers to accommodate this simple doubling). - 6. 12-bit Native Reduction: this is the same concept as the 16-bit native approach above, however using 12-bit input and outputs we can use the much faster floating point multiplies and modulus operators without overflow concerns. Again we need to maintain the dynamic range by approximately tripling the original 32-bit moduli. Also there is an issue where the Kawamura approximations require the base moduli to be within a certain range of the next power of 2. This is not discussed further here, though note that a full 12-bit implementation would require the use of another base extension method than the one described below. Results: All tests of the above approaches process 2<sup>32</sup> bytes, executing modular multiplication operations, reading and accumulating from and to shared memory. The results can be seen in Table 1. We can see that the 12-bit and 16-bit approaches show the best performance, however a correction step is required for these figures. As we will see, the base extension executes in O(n) time across n processors, where n is the number of moduli in the RNS base. In the context of 1024-bit RSA, the 12-bit approach requires a minimum of 43 moduli (512 bits / 12 bits) compared to 17 32-bit moduli for each RNS base. ### International Journal for Research in Applied Science & Engineering Technology (IJRASET) Also, the base extension step in Montgomery RNS is the most intensive part of our implementations consuming 80% of the execution time. A minimum approximation correction for the 12-bit result presented here is a division of 2, and for 16-bit 1.5. The most effective approach for use in Montgomery RNS is Reduction by Residue Multiplication. #### Modular Multiplication Approach Modular multiplications per second | 1. | 32-bit LongDiv | $2.89 * 10^9$ | |----|--------------------|---------------| | 2. | 32-bit Inverse Mul | $3.63 * 10^9$ | | 3. | 32-bit Residue Mul | $4.64 * 10^9$ | | 4. | 32-bit Native+CRT | $1.12 * 10^9$ | | 5. | 16-bit Native | $4.71 * 10^9$ | | 6. | 12-bit Native | $7.99 * 10^9$ | Table 1. GPU Modular Multiplication throughput using a variety of techniques #### IV. CONCLUSIONS We have focused on 1024-bit RSA decryption running on an Nvidia 8800 GTX and demonstrated a peak throughput of 0.18 ms/op giving a 4 times improvement over a comparable CPU implementation. We have shown that a standard serial implementation of Montgomery exponentiation gives the best performance in the context of a high number of parallel messages, while an RNS based Montgomery exponentiation gives better performance with fewer messages. We show that an optimised RNS approach proves better performance than a CPU implementation at 32 parallel ciphertext/plaintext messages per kernel call and the pencil-and-paper approach proves better than the RNS approach at 256 parallel messages. Also covered in the paper is the applicability of the GPU to general public key cryptography, where the observation is made that peak performance is only achievable in the context of substantial key reuse. In the case of 1024-bit RSA using RNS, peak performance requires the key to change at a maximum rate of once per 15 messages, and once per 32 messages when using a serial pencil-and-paper approach. We have also presented a variety of techniques for achieving efficient GPU based modular multiplication suitable for RNS. #### **REFERENCES** - [1] Nvidia CUDA Programming Guide, Version 2.0, 2008. - [2] Microsoft, "Direct X Technology", <a href="http://msdn.microsoft.com/directx/">http://msdn.microsoft.com/directx/</a> - [3] Nvidia Corporation, "CUDA", <a href="http://developer.nvidia.com/object/cuda.html">http://developer.nvidia.com/object/cuda.html</a> - [4] A. Menezes, P. van Oorschot, and S. Vanstone, "Handbook of Applied Cryptography". CRC Press, October 1996. ISBN 0-8493-8523-7. - [5] P.L. Montgomery. "Modular Multiplication Without Trial Division". Mathematics of Computation, 44, 519-521, 1985. - [6] J-J. Quisquater and C. Couvreur. "Fast Decipherment Algorithm for RSA Public-Key Cryptosystem". Electronics Letters, Vol. 18, No. 21, Pages 905907, 1982. - [7] D.E. Knuth. "The Art of Computer Programming. Vol 2". Addison-Wesley, 3rd ed., 1997. - [8] O. Harrison and J. Waldron, "Practical Symmetric Key Cryptography on Modern Graphics Hardware". 17th USENIX Security Symposium. San Jose, CA. July 28 August 1, 2008. - [9] AMD 64 RSA Benchmarks, http://www.cryptopp.com/benchmarks-amd64.html - [10] OpenSSL Open Source Project. <a href="http://www.openssl.org/">http://www.openssl.org/</a>. - [11] T. Granlund and P. Montgomery. "Division by Invariant Integers using Multiplication". SIGPLAN '94 Conference on Programming Language Design and Implementation. Orlando, Florida. June 1994. - [12] A. Moss, D. Page and N.P. Smart, "Toward Acceleration of RSA Using 3D Graphics Hardware". Eleventh IMA International Conference on Cryptography and Coding. Cirencester, UK. December 18-20, 2007. - [13] S. Kawamura, M. Koike, F. Sano and A. Shimbo, "Cox-Rower Architecture for Fast Parallel Montgomery Multiplication In Advances in Cryptology". Springer-Verlag LNCS 1807, 523538, 2000 - [14] K.C. Posch and R. Posch. "Base Extension Using a Convolution Sum in Residue Number Systems". In Computing 50, Pages 93104, 1993. 45.98 IMPACT FACTOR: 7.129 IMPACT FACTOR: 7.429 ## INTERNATIONAL JOURNAL FOR RESEARCH IN APPLIED SCIENCE & ENGINEERING TECHNOLOGY Call: 08813907089 🕓 (24\*7 Support on Whatsapp)