Over the past decade, high-performance graphics cards for PC's have become steadily more powerful and more general purpose. The graphical processing units (GPU) now include gigabytes of high-speed memory and thousands of computing cores that can perform integer and floating point computation.
One of the two major vendors of GPU's, NVIDIA, offers GPU accelerators for high-performance scientific computing as well as graphics cards aimed at the gaming market and at the market for workstation computing. NVIDIA's GeForce (gaming), Quadro (workstation), and Tesla (high-performance computing) cards can all be used for scientific computing to a greater or lesser extent. CUDA is NVIDIA's platform and API for numerical computing on NVIDIA GPU's. The GPU's made by NVIDIA's main competitor, AMD, focus more on graphics performance, although they can also be used for scientific computing. Both AMD and NVIDIA support the OpenCL standard API for numerical computing, although in practice CUDA is nearly always used with NVIDIA GPU's. I'll focus here on the NVIDIA platform because that's I've been working with an NVIDIA card and because the NVIDIA cards have more software support.
The NVIDIA GPU can be thought of as a separate computer with its own memory and processor cores. The GPU is connected to the system's main CPU and memory via the PCIe bus. Memory is not shared directly between the CPU and GPU- rather, data and small programs (called kernels) are transferred from the main CPU to the GPU. After the GPU completes the computation, results are be copied back to main memory. Since the GPU typically has less memory than the main system, it may be necessary to break up a larger problem into smaller parts before sending the smaller subproblems to the GPU for processing. The highest performance is often attained by codes which overlap copying data to/from main memory while computations are being done on the GPU.
Custom kernels can be written in a special version of C and compiled with a compiler, NVCC, supplied by NVIDIA. Writing these kernels requires a detailed understanding of the GPU architecture and is not something that I really want to learn.
Fortunately, NVIDIA supplies software libraries that can handle many common numerical linear algebra tasks corresponding to the BLAS library in its CUBLAS library. Programming with CUBLAS involves controlling the allocation of storage on the GPU and the main system and the copying of data to/from the GPU. At a slightly higher level, the NVBLAS library is designed to act as a drop-in replacement for those parts of the BLAS library that are most effectively handled by the GPU. NVBLAS functions take care of allocation of storage on the GPU and copying data back and forth so that the user doesn't have to make any changes to their code.
One of the simplest ways to get started in using the GPU is to use the NVBLAS functions from within Octave, the open source MATLAB clone. Since Octave is dynamically linked to libblas.so, you can simply set the LD_PRELOAD environment variable to include libnvblas.so. As the system loads Octave, the dynamic linking will automatically resolve the BLAS library routines using the higher priority routines from libnvblas.so. Thus there is no need to recompile Octave. MATLAB also supports the use of an NVIDIA GPU. Similarly, support for using the GPU to handle basic linear algebra exists in Python, R, and Julia.
As it happens, the CUDA Toolkit also includes a drop-in replacement for the widely used FFTW library of routines for fast Fourier transforms. I was able to use the CUDA library, libcufft.so, in the same way as libnvblas.so.
At this stage, I could run Octave with either OpenBLAS or NVBLAS + OpenBLAS for comparison purposes. I could also substitute NVBLAS + OpenBLAS for OpenBLASs in any other program that dynamically linked to OpenBLAS. In particular, I could use NVBLAS + OpenBLAS as a replacement for OpenBLAS in my CSDP solver without changes to the code and without evening having to recompile the code!
Some quick testing revealed that using NVBLAS in this way didn't really improve the performance of CSDP. This was likely because the code was spending most of its time computing Cholesky factorizations using a LAPACK routine. Unfortunately, NVBLAS limits itself to various level 3 BLAS functions. For the higher level functions in LAPACK, our choices are to either run the LAPACK code on the CPU and use the GPU for BLAS function calls or to look at other software libraries that directly implement LAPACK functions on the GPU.
NVIDIA has a library called CUSOLVER which provides some functions for solving systems of equations. MAGMA, developed at the University of Kentucky/Oak Ridge National Lab, offers LAPACK functionality on top of CUBLAS. MAGMA can also be used with AMD Radeon graphics cards and Intel Knight's landing processors. ViennaCL is another C++ package which offers high-level linear algebra routines that can make use of NVIDIA and AMD Radeon GPU's and Intel Knights Landing processors.
I selected MAGMA and installed it on my system. In comparison with NVIDIA's software, MAGMA isn't quite so well documented and supported, but I was able to build the library without too much trouble.
Computational Experiments
The following computational experiments were performed on a Linux system with an Intel Xeon E5-1630v4 processor and 32 gigabytes of RAM. The memory bandwidth is approximately 80 GB/s. For this Broadwell processor, the maximum single precision performance is 32 flops per cycle times 3.7 Ghz clock rate times 4 cores, or 474 GFLOPS. The maximum possible double precision performance is 16 flops per cycle times 3.7 Ghz clock times times 4 cores, or 237 GFLOPS.
The GPU is an NVIDIA Quadro P4000 with 8 gigabytes of GDDR5 RAM. The memory bandwidth is 240 GB/s (about 4 times higher than the CPU.) This GPU has a maximum possible single precision performance of 5304 GFLOPS. Double precision maximum possible performance on GeForce and Quadro GPU's is limited to 1/32 of the single precision performance (about 170 GFLOPS for the P4000.)
The PCIe bus between the CPU and GPU has a bandwidth of about 12 GB/s, which is much lower than the main memory bandwidth or the memory bandwidth on the GPU. Thus it's important to minimize copying of data back and forth from the CPU to the GPU and maximize the amount of computing done on data in the GPU.
Knowing that the GPU was likely to be most helpful for linear algebra tasks that required O(n^3) work on O(n^2) data such as matrix multiplication and Cholesky factorization and that the single precision floating performance was likely to be much higher than the double precision performance, I first tested the BLAS matrix multiplication routine SGEMM(). Figure 1 shows the performance in GFLOPS for matrices of size n by n, using the OpenBLAS implementation of SGEMM() on the CPU and using MAGMA on the GPU. Using the MAGMA routine, the GPU is much faster then the CPU even at small sizes of N=500. OpenBLAS on the CPU achieves a respectable 70% of the maximum possible performance. The GPU also achieves about 70% of the maximum possible performance, but only for much larger values of N.
Figure 1: Performance of OpenBLAS and MAGMA SGEMM() routines.
Next, we tested the performance of OpenBLAS vs. MAGMA on the Cholesky factorization routine SPOTRF(). The results are shown in Figure 2. Again, the GPU using the MAGMA routine is much faster than the CPU for larger matrices, but for smaller matrices the CPU is faster, with a crossover at roughly N=7000. Again, both the CPU and GPU achieve roughly 70% of the maximum possible performance, but the GPU achieves this only for very large values of N.
Figure 2: Performance of OpenBLAS and MAGMA SPOTRF() Routines.
Figures 3 and 4 compare the performance of the CPU and GPU on Double Precision matrices. Again, the OpenBLAS routine achieves roughly 70% of the maximum possible performance on the CPU. The GPU achieves nearly the maximum possible performance. The GPU is roughly 20% faster than the CPU, but only for quite large matrices (around N=3000 for DGEMM() and N=13000 for DPOTRF().)
Figure 3: Performance of OpenBLAS and MAGMA DGEMM() Routines.
Figure 4: Performance of OpenBLAS and MAGMA DPOTRF() Routines.
I was originally interested in seeing how the GPU could be used within the CSDP solver for semidefinite programming. Unfortunately, the primal-dual interior point method used by CSDP requires computations to be performed in double precision. Given the above results, it's unlikely that this GPU could provide any significant performance enhancement.
I did test a version of CSDP that used the MAGMA routines whenever matrices were large enough to justify the switch from OpenBLAS to MAGMA. I tested this code on the 1et.2048 test problem, which has M=22529 constraints and requires repeated Cholesky factorization of matrices of size 22529 (well past the crossover point.) This problem also requires repeated matrix multiplications of matrices of size 2048 (not large enough for MAGMA to be worthwhile.) Using MAGMA reduced the solution time from 930 seconds to 830 seconds, about an 11% performance improvement.
NVIDIA's Tesla GPU's do not have the same limitations on double precision performance. I'd be interested in testing my code on a system with one of the Tesla accelerators.
1. Install version 384 of the NVIDIA drivers for Linux. "sudo apt-get nvidia-384"
2. Install the CUDA toolkit and libraries. https://developer.nvidia.com/cuda-toolkit
3. Install OpenBLAS 0.2.20. This was needed to implement BLAS functions that are not efficiently computed on the GPU and also for comparison with NVBLAS. I built my copy from source. http://www.openblas.net
4. Configure NVBLAS to use OpenBLAS as its backup for BLAS routines that are not in NVBLAS. You'll need to create or edit the nvblas.conf file.
5. Use the Ubuntu update-alternatives command to make OpenBLAS the system default BLAS and LAPACK library.
6. Install Octave 4.2.1 using the Ubuntu ppa for Octave. Octave links by default to the update-alternatives BLAS and LAPACK (which were configured to be OpenBLAS in step 5.)
7. Use the LD_PRELOAD environment variable to override OpenBLAS routines with available NVBLAS routines before running Octave or any other code that uses BLAS.
8. Install MAGMA 2.3.0.
9. Try MAGMA and NVBLAS out on your own code.
Dear Brian, have you had a chance to test your code on a Tesla/Radeon Instinct GPUs? And if so, could this give CSDP a significant speed-up?
ReplyDeleteBest wishes, Jan