# Accelerating Computations on Encrypted Data with an FPGA

By David Bruce Cousins and Kurt Rohloff, Raytheon BBN Technologies

Healthcare, financial, government, and military organizations depend on encryption to secure sensitive data. Historically, this data has had to be decrypted before it could be processed or analyzed. As a result, data processing had to be performed on secured hardware, eliminating the possibility of using the cloud or other low-cost, third-party computing resources.

At the Symposium on the Theory of Computing in 2009, Craig Gentry of IBM presented a fully homomorphic encryption (FHE) scheme that made it possible to send sensitive data to an unsecured server, process it there, and receive an encrypted result, without ever decrypting the original data. While FHE was a major theoretical breakthrough, actual FHE implementations are many orders of magnitude too slow to be of practical use, particularly for large encryption keys and ciphertexts.

As a step toward a practical FHE implementation, we have developed a somewhat homomorphic encryption (SHE) scheme that, with modifications, can be converted into an FHE scheme. Current FHE implementations depend on complicated operations that are inefficient when performed on a CPU, and our goal was to take advantage of the parallelism and pipelining of FPGAs using MATLAB®, Simulink®, and HDL Coder™. Homomorphic encryption is an active area of study, and new advances are being made regularly. By using MATLAB and Simulink instead of a lower-level programming language, we can keep pace with these developments by rapidly implementing improvements to the algorithms.

### Homomorphic Encryption Basics

FHE enables secure and private computation using encrypted data. In theory, computations can be carried out using just two FHE operations: EvalAdd and EvalMult. These operations are similar to binary XNOR and AND operations, but they operate on encrypted bits, and their result remains encrypted. Current FHE schemes are based on computationally intensive stochastic lattice theory problems. Stochastics introduce noise into each EvalAdd or EvalMult operation. The amount of noise increases rapidly with the number of operations performed. FHE schemes address this buildup of noise by periodically running a bootstrapping algorithm on the intermediate results. One problem with this approach is that the bootstrapping algorithm is computationally expensive. A second is that current FHE schemes entail modular arithmetic with a large modulus. The operations required are memory-intensive and inefficient when performed on standard CPUs.

SHE schemes avoid the need for bootstrapping by limiting the number of EvalAdd and EvalMult operations that must be performed to keep noise below an acceptable threshold. SHE schemes can be augmented with bootstrapping operations to produce FHE schemes, provided that the additional number of operations for bootstrapping itself keeps the noise below this threshold.

### German Aerospace Center (DLR) Robotics and Mechatronics Center Develops the Autonomous Robot Justin with MATLAB and Simulink

FHE schemes are based on EvalAdd and EvalMult, elementwise vector addition and multiplication operations performed in modulo q arithmetic where q is a large prime integer. In MATLAB, EvalAdd is expressed simply as

c = mod(a+b, q)


and EvalMult is expressed as

c = mod(a.*b, q)


These straightforward representations make it easy to perform calculations on a CPU, but to exploit the parallelism of FPGAs we needed to develop efficient software implementations of EvalAdd and EvalMult in VHDL®. We began by prototyping algorithms in MATLAB—for example, the EvalAdd operation with inputs bounded by the modulus q:

c=a+b;

cgteq = (c>=q);

c(cgteq)=c(cgteq)-q;


Our initial MATLAB representation served as a reference for the Simulink model, which would be used for hardware implementation and HDL code generation (Figure 1). With the initial MATLAB representation, we were able to explore several theoretical approaches because MATLAB handled large complicated vector and matrix operations quickly and naturally. Additionally, we were able to use Fixed-Point Designer™ to generate bit-accurate fixed-point solutions, including modeling rollover in our adders, with only minor modifications to our systems.

Once we had converged on a sound approach, a detailed Simulink model was created for HDL implementation. With Simulink, we were able to lay out the logical components in the design and automatically generate optimized HDL code. We were also able to add capabilities, including supporting multiple moduli and optimizing our models to use table lookups, in a controlled, incremental way.

The model can process one pair of inputs on each clock cycle.

To verify the model, we compared the results it produced with the results from our MATLAB code.

Modulo multiplication is much more complicated than modulo addition. To manage this complexity and enable more efficient pipelining in the generated HDL, we have developed a multiple-word modulo multiply operation based on the Barrett reduction algorithm. Figure 2 shows the structure of a pipelined, four-stage multiple-word multiplier, where each stage operates on a subword of the large overall word.

Figure 2. Simulink model of the 64-bit Barrett modulo multiply operation, showing four stages.

Figure 3 shows the detailed model of a single stage, also pipelined for efficient HDL generation. Here again, the flexibility of Simulink enables us to specify the actual word widths at run time. The same models are used for generating 48-bit as well as 64-bit arithmetic HDL.

Figure 3. Simulink model of a single stage of the Barrett modulo multiply operation.

## Implementing the Chinese Remainder Transform

Our SHE scheme uses the Chinese remainder transform (CRT) to simplify the structure of our add and multiply operations. CRT resembles the discrete Fourier transform (DFT), but uses modular integer arithmetic instead of complex arithmetic. We implemented the CRT as an EvalMult operation followed by a fast numeric transform (like a fast Fourier transform [FFT] but with modulo arithmetic).

To create the Simulink model for the FFT shown in Figure 3, we started with a standard streaming FFT, reordered the inputs, and converted from complex arithmetic to modulo arithmetic using integer fixed-point arithmetic (Figures 4a, 4b, and 4c).

Figure 4a. Streaming fast numeric transform modeled in Simulink.
Figure 4b. Detail of a single stage of the fast numeric transform showing the butterfly and shuffle models.
Figure 4c. Detail of the modulo Butterfly operation of the fast numeric transform.

We were able to develop a streaming model that consisted of two basic blocks that are parameterized at run time. By cascading log2(N) blocks together we can assemble FFTs of various power-of-2 sizes. This approach allows us to process two input samples every clock cycle. We have implemented and benchmarked up to a 2^14 size CRT on a Xilinx® Virtex®-7 in this manner.

## Moving from Floating Point to Fixed Point and from Model to VHDL

When we started prototyping in MATLAB we used floating-point math, which provided a quick, easy way to understand the computations that we needed to support. We then used Fixed-Point Designer to transition from floating-point to fixed-point arithmetic with varying integer bit-widths.

Our MATLAB code and Simulink models use the same fixed-point variables and produce output in the same format, simplifying verification of test results. When running simulations, we can specify the bit-width of the input data. The intermediate mathematical operations are automatically sized by Fixed-Point Designer, enabling us to use same models for inputs of varying bit-widths.

The transition to fixed-point math is a required step for VHDL implementation. We generated the required VHDL from our Simulink models using HDL Coder. We simulated the VHDL using Mentor Graphics® ModelSim®, and synthesized the code on a Xilinx Virtex FPGA. The generated VHDL can be used on FPGAs from different vendors, enabling us to benchmark across multiple platforms. We also use HDL Verifier™ to validate and demonstrate the generated VHDL running on a Xilinx ML 605 evaluation board.

## Accelerating Development

The combination of MATLAB, Simulink, Fixed-Point Designer, HDL Coder, and HDL Verifier enables us to develop, implement, and improve our encryption scheme much faster than would be possible with traditional methods. Speed is essential to our efforts because FHE theory is evolving so rapidly. Several times a year, innovations come to light that require a rewrite of our code and subsequent changes to our model. We estimate that development would take two to three times longer if we were working in C, Python, or another low-level language. Without Simulink and HDL Coder, production of VHDL code would be similarly slowed because our team does not have experience writing VHDL from scratch.

As we continue to increase the capabilities of our SHE scheme and improve its performance, rapid prototyping in MATLAB and Simulink and automatic VHDL code generation with HDL Coder remain central to our development efforts.

Sponsored by the Defense Advanced Research Projects Agency (DARPA) and the Air Force Research Laboratory (AFRL) under Contract No. FA8750-11-C-0098. The views expressed are those of the authors and do not necessarily reflect the official policy or position of the Department of Defense or the U.S. Government. Distribution Statement “A” (Approved for Public Release, Distribution Unlimited.)

Published 2013 - 92098v00