Using an NVIDIA GPU with CUDA For Numerical Linear Algebra in Linux
Trying out CSDP on a Xeon Phi Processor
Many software packages for scientific computing spend most of their time performing numerical linear algebra operations including matrix-vector and matrix-matrix multiplication, solving linear systems of equations, eigenvalue computations, and matrix factorizations such as the QR and SVD factorizations. These packages often make use of BLAS and LAPACK subroutines to perform these linear algebra operations.
The BLAS library performs basic linear algebra operations such as dot products, vector norms, matrix-vector multiplications, and matrix-matrix operations. The LAPACK uses functions from the BLAS library to perform higher level computations such as solving linear systems of equations, finding eigenvalues of a matrix, or finding the singular value decomposition of a matrix. You may also see references to two earlier libraries, LINPACK (for systems of equations) and EISPACK (for eigenvalue problems) that also use BLAS. They have been effectively superseded by LAPACK. In most cases, performance optimization at the BLAS level is sufficient. Some of the libraries mentioned below also include optimized versions of some LAPACK routines, but in general, it's usually sufficient to use an optimized BLAS with the standard LAPACK implementation.
LAPACK and BLAS were originally written in Fortran, and most software that calls BLAS/LAPACK routines uses the original Fortran interface. For programs written in other languages such as C, this can create complications in interfacing to the libraries. See the appendix below on calling Fortran BLAS/LAPACK routines from C. There are alternative interfaces that wrap around the BLAS and LAPACK libraries to provide a more natural C language interface. The LAPACKE interface provides C interfaces to the LAPACK routines. The CBLAS interface provides C interfaces to the BLAS routines.
The performance of the software can depend greatly on how well the BLAS operations have been optimized. There are a number of available implementations of the BLAS with different licenses, performance characteristics, features, and ease of installation and use. In helping users of my own package, CSDP, I often find that they're struggling to install BLAS and LAPACK to work with CSDP or that they're getting poor performance out of the BLAS that they've used.
This post discusses the available implementations of the BLAS on Linux and gives some recommendations on how to select a BLAS. I've written the blog posting from the point of view of someone who is installing software that makes use of the BLAS. A few appendices at the end of the post discuss technical issues that may be of interest if you're actually writing Fortran or C code that calls BLAS or LAPACK routines.
Factors Affecting the Performance of BLAS
In this section, I'll discuss some important aspects of computer architecture that affect the performance of the BLAS and LAPACK libraries. Intel and AMD processors based on the x86-64 (aka Intel 64) architecture are most commonly used on desktop and laptop Linux machines, so I'll limit my discussion to these processors.
It is import to understand that on modern processors it is relatively time-consuming to access a floating point number in main memory- it can take hundreds of nanoseconds to do this, while floating point additions, subtractions, multiplications, and divisions can all be done in a nanosecond or less.
To deal with the slow memory access, modern processors have cache memory. The cache memory is much smaller than main memory but also much faster. Processors now have three or even four levels of cache between the processor and main memory.
The key to obtaining higher performance with cache memory is to make multiple uses of data that has been brought into the cache from main memory before this data is flushed out of the cache. Algorithms that make a single pass over the data with one or two floating point operations for each number brought from main memory are effectively limited by the speed of main memory and the cache doesn't help.
The linear algebra operations performed by the BLAS are conveniently categorized into three levels. The level one operations act on one or two vectors of length n and perform a total of O(n) operations. Examples include vector dot products, norms, and the saxpy operation in which a multiple of one vector is added to a second vector. The level two operations involve matrices of and vectors of size n and perform a total of O(n^2) operations. The most common of these operations is matrix-vector multiplication. The level three operations involve matrices of size n but perform O(n^3) operations. The most common level three operation is matrix-matrix multiplication.
It should be clear that it is impossible to make effective use of cache with level one and level two operations since there are only a small number (one, two, or three) of floating point operations for each number brought in from main memory. However, the level three operations perform O(n^3) operations on O(n^2) data, so a factor of n reuse of data is possible. For example, blocked matrix-matrix multiplication algorithms take two large matrices of size n by n and break the matrices into block matrices whose individual blocks are small enough to fit in cache. The products of blocks are then performed in cache added together and written out to main memory. Cache aware blocked level 3 operations can be vastly faster than naively coded loops. LAPACK was designed to make as much use as possible of level 3 operations to improve performance on computers with cache memory.
An important aspect of modern random access memory is that the memory operates in burst mode. Rather than reading a single word from memory into the cache, the memory system reads in a "cache line" that might contain 64 or 128 consecutive bytes of memory. This is very convenient if the algorithm accesses data consecutively but very inefficient when the algorithm jumps around in memory. Matrices can be stored in memory in "column-major" order with the elements of a column in consecutive positions in memory or "row-major" order with the elements of a row in consecutive positions. Algorithms used by BLAS and LAPACK are designed to take advantage of the column-major storage used in the Fortran programming language. See the appendix below for a more detailed discussion of how Fortran and C store matrices.
In recent years microprocessors have become faster by performing operations in parallel. There are two main levels at which parallelism has been used to improve performance. At the lower level are instructions that can perform multiple arithmetic operations simultaneously. These Single Instruction Multiple Data (SIMD) instructions make it possible to perform 2, 4 or even 8 floating point operations per cycle rather than a single operation. In successive generations, SSE, SSE2, and AVX instructions have been added as new features to the x86-64 architecture. Compilers are generally capable of automatically producing code that performs operations in parallel using SIMD instructions. However, if a code is to be compiled so that it can be distributed in binary form and run on all x86-64 processors then the more recent SIMD instructions cannot be used.
At a higher level, modern processor chips actually consist of multiple processor cores. As individual transistors have gotten smaller on integrated circuits it has become possible to cram more and more processor cores onto a single chip. Some of the latest generation processors have as many as 18 separate processor cores on a single chip. Some programs are inherently sequential and can only run on one processor core. In other cases, it is possible to break up the work so that it can be performed in parallel on multiple cores. Fortunately for us, it is relatively easy to write multithreaded versions of the BLAS functions that can make use of multiple processor cores.
In order to get high performance, an implementation of the BLAS must make effective use of cache memory, SIMD instructions, and multiple cores. The software can either be written to work with the most basic x86-64 architecture (so that it will run on all AMD and Intel processors) or written to dynamically select routines that will work well on the current processor, or it can be tuned at compile time for a specific processor. All three of these approaches are used by some of the BLAS libraries listed below.
Available BLAS and LAPACK Libraries
The following table summarizes the features of some of the most commonly used BLAS and LAPACK libraries. OpenMP and ILP64 features are discussed in the appendices.
The LAPACK project is hosted at the NETLIB web site along with many other packages for numerical computing. The current version of the library is 3.7.0. The LAPACK library is written in Fortran, but a C language interface, LAPACKE, is also distributed with LAPACK. See the LAPACK package in Ubuntu 16.04.
NETLIB Reference BLAS
The reference implementation of the BLAS is a single threaded Fortran library that makes no use of parallel processing and doesn't take advantage of cache memory. Some performance improvement is possible by using compiler optimizations and SIMD instructions. The Ubuntu package libblas3 is a generic x86-64 version of the reference BLAS. It's easy to install but performs poorly in practice.
ATLAS stands for Automatically Tuned Linear Algebra Subroutines. The ATLAS library comes with a very complicated build procedure that runs timing tests to tune the software to select block size parameters and algorithms that work best on the current machine. The build process is time-consuming and requires that CPU throttling (commonly used for power management and cooling) be turned off. Once you've worked through this complicated build process, the library general performs well. ATLAS includes a few of the LAPACK routines, but for anything complicated, you'll need the regular LAPACK library.
Ubuntu has a generic x86-64 version of ATLAS, but its performance is typically poor in comparison with a version built from source because the generic binary package hasn't been properly tuned.
The OpenBLAS library can be compiled from source and automatically configures the block sizes and algorithms for your machine if the microarchitecture is on its supported list. For newer processors, it's generally OK to use the most recent supported architecture. For example, I built my copy of OpenBLAS on a Kaby Lake processor using the configuration for a Haswell processor and got good performance. OpenBLAS includes the cblas interface as well as LAPACK and LAPACKE routines. OpenBLAS works particularly well with main programs that are compiled with OpenMP parallelism.
Ubuntu has a generic x86-64 version of OpenBLAS, but much like ATLAS and the reference BLAS, this package performs poorly because it hasn't been tuned. (I'm told that this may have been fixed recently with a DYNAMIC_ARCH version of the openblas package.)
OpenBLAS can be built with a DYNAMIC_ARCH flag so that it will dynamically select code optimized for the particular architecture of the machine that it is running on. A variety of Intel and AMD processors are supported. Executable code built with this version of the library will be quite large, but the resulting code will typically run well on both older and newer systems.
The Intel MKL library includes BLAS, LAPACK, and other useful mathematical subroutines. The cblas and LAPACKE interfaces are also included. The BLAS is highly optimized and dynamically configures itself on a range of Intel processors. The library doesn't work well on AMD processors. Intel has a number of ways of getting free (as in beer) licenses to use MKL. It is not open source software, however.
ACML is roughly comparable to Intel's MKL in its functionality but optimized for use on AMD processors. The library doesn't work well on Intel processors. The ACML software is proprietary but can be downloaded for free (as in beer.) AMD has announced the end of support for the proprietary ACML and is moving to support open source high-performance computing projects including the BLIS effort to replace BLAS. (BLIS is pretty new and complicated to build, so I haven't tried it yet.)
If you simply need to run software that requires the BLAS and aren't concerned about performance, then I would recommend installing a precompiled OpenBLAS package or perhaps a precompiled version of the reference BLAS. If performance is of only moderate concern, then a precompiled version of OpenBLAS may be suitable. I would generally not pick ATLAS at this point because of the difficulty of installing it. If performance is important then I would compile OpenBLAS from source. The Intel MKL and AMD ACML libraries provide performance similar to (and perhaps better than) OpenBLAS, but I wouldn't bother with MKL or ACML unless you also need access to other features of these libraries beyond BLAS.
For example, I recently installed my CSDP software on a newly purchased laptop with a Kaby Lake (latest generation Intel) processor. The Ubuntu OpenBLAS binary package performed poorly. I found that a version of OpenBLAS compiled from source was at least four times faster than the Ubuntu OpenBLAS binary package, the Ubuntu ATLAS package, and the Ubuntu reference BLAS package. ACML performed poorly on this Intel machine. MKL was roughly similar in performance to the version of OpenBLAS that I built from source.
Appendix: Fortran and C array storage
In Fortran, an m by n matrix A is stored in a data structure called an array. To give flexibility in adding rows to a matrix, Fortran and the BLAS/LAPACK interfaces allow for an array A to have LDA > m rows. (LDA is an acronym for “leading dimension of A.”)
It is necessary to map the two-dimensional array of entries of A onto the one-dimensional linear layout of memory. In the Fortran scheme, the array is stored in “column major” order. That is, the entries in the first column are followed by the second column, and so on. In general, the A(i,j) element of the array A appears in position i+(j-1)*LDA in memory.
For example, a 3x3 matrix might be stored in a 5x3 array as
In calling Fortran routines from the BLAS and LAPACK libraries, the programmer must pass the array A together with the matrix size parameters m and n and the leading dimension LDA.
C actually has two different schemes for storing arrays. Dynamically allocated arrays consist of a vector of pointers to 1-D arrays which represent the individual rows of the matrix. This style of arrays is not supported by the cblas and LAPACKE interfaces. Statically allocated two-dimensional arrays are stored in a scheme very similar to the Fortran arrays, except that the array is stored in “row major” order, with all of the entries in row one proceeding row two, and so on. Another less important difference is that C array subscripts start at 0 rather than 1.
A C programmer has two options for interfacing to BLAS and LAPACK. The older approach is to implement Fortran array indexing within 1-D C arrays and then call the standard Fortran BLAS/LAPACK routines. This approach is used by many software packages. The alternative approach is to use LAPACKE and cblas if they are available. The LAPACKE and cblas interfaces work with matrices that are stored in either row-major or column-major order.
Appendix: Calling Fortran subroutines from C
Calling Fortran subroutines from C is not completely standardized. Different Fortran compilers use different naming and parameter passing conventions, so it is sometimes necessary to adjust C code to work with a different Fortran compiler.
In Fortran, subroutine names are case-insensitive (and often converted into lower-case by the compiler), while in C, they are case sensitive. Many Fortran compilers insert an underscore ("_") at the end of subroutine name.
In Fortran, parameters are passed by reference (the address of the parameter) rather than value, while in C parameters are typically passed by value. Thus when you call a Fortran subroutine from C it is necessary to pass a pointer to each of the parameters.
In C, variable length strengths have a null byte (0) to indicate the end of the string. Some Fortran compilers include additional hidden string length parameters to specify the length of each of the strings that are being passed to the subroutine.
Consider the Fortran call to an LAPACK routine
where the string 'U' tells LAPACK to put its answer in the upper triangular part of A, N is the dimension of the matrix A, LDA is the leading dimension of the array A (not necessarily equal to N!), and INFO returns a result code.
In C, the most common way to write this would be
However, depending on the Fortran compiler, it could be
In this last example, the parameter "1" specifies the length of the string "U".
Appendix: Pthreads and OpenMP threading
Many of the BLAS libraries are "threaded", meaning that the library will automatically spawn several threads of execution on different processor cores to compute (e.g.) a matrix-matrix multiplication. A program that uses BLAS/LAPACK libraries might itself run single threaded, or it might itself be multithreaded. Typically, it's best for a program to use as many threads as there are available processor cores. Running more threads than cores can hurt performance as the computer must spend time switching back and forth between threads. A common performance problem occurs when a multithreaded main program has several threads that each call threaded BLAS routines. E.g. on a four core machine, a program might use 4 threads, but these 4 threads can each call a BLAS routine that starts 4 threads, and suddenly there are 16 active threads.
There are two very commonly used threaded programming models. The pthreads (posix threads) model and OpenMP threading. Intel also has its "threaded building blocks", although these are not very widely used in open source software. There is no standard way in pthreads for a process to indicate that subroutines shouldn't create new threads and thus no effective way to avoid the performance problem mentioned in the previous paragraph. However, OpenMP does have a standard mechanism for doing this. Some BLAS libraries are "OpenMP aware" and will not create additional threads if the calling program doesn't want additional threads. This is a very useful feature.
Appendix: I32LP64 vs. ILP64
On 64 bit Linux systems, C uses 32 bits to represent integers, 64 bits to represent long integers, and 64 bits for pointers. Similarly, Fortran integers are 32 bits. This programming model is referred to as I32LP64. An alternative model, ILP64, uses 64 bits to represent integers. There is a potential issue with using the I32LP64 model in BLAS/CBLAS and LAPACK/LAPACKE, since it is possible in some circumstances that a vector might be so long that we can't represent its length with a 32-bit integer. The largest number that can be represented by a signed 32-bit integer is about 2 billion, but BLAS and LAPACK are often used on machines with hundreds of gigabytes of RAM, so this actually can occur in some cases. Some implementations of the BLAS and LAPACK libraries have versions that use or can be compiled to use long int 64-bit integer parameters.
Appendix: CPU Frequency Scaling
Modern processors from Intel and AMD have features to adjust the processor clock rate as needed to save power and prevent the processor from overheating. The processor cores can also go into a faster than normal (but not sustainable for long periods) "turbo" mode. For example, the cores on my desktop computer can run at clock rates from 1.0 Ghz (not loaded) to 3.7 Ghz (the nominal clock rate of the processor) and up to 4.0 Ghz (in turbo mode.) Rather than operating at a fixed, high, clock rate, processor cores are usually kept at a low clock rate. When the system has to start performing computations, the clock rate is adjusted upward by the operating system. However, if the processor cores start to overheat, the system will turn the clock rate down.
As a system administrator, you can control the minimum and maximum clock speeds and the "governor" procedure that determines how the system responds to increased mode. The default governor, "ondemand" increases the clock rate in response to high system loads, but it can take some time to get up to full speed. The "powersave" mode is useful for laptops running on battery power but doesn't give very high performance. There is also a "performance" mode which is the mode should typically be used for scientific computing. The cpufreq package has command line tools that can be used to configure CPU frequency scaling. I keep my systems in "performance" mode. I've configured my laptops for "powersave" while unplugged and "performance" while plugged in and charging.
Appendix: Hyperthreading (Simultaneous Multithreading)
Some AMD and Intel processors are capable of "simultaneous multithreading (SMT)" (Intel marketing uses the term "hyperthreading" for this.) This means that a single core can simultaneously execute two (or more) threads. The processor core has additional registers so that it can quickly switch from one thread to another. For example, when one thread of execution is stuck waiting for data from memory, the other thread can take over immediately without any of the overhead associated with switching tasks at the operating system level. For number crunching, hyperthreading typically provides relatively little benefit, and in some cases, codes even run slower with hyperthreading turned on. It can also be confusing that a 4-core processor with 2-way hyperthreading appears to Linux as an 8-core processor. You should experiment to see how much if any performance improvement you get by running more threads than processor cores.
Appendix: The TLB and Transparent Hugepages
Modern processors make use of "virtual memory", which allows the operating system to move pages of code and data in and out of main memory and store them on disk. This makes it possible to have multiple programs running and using more "virtual memory" than the real memory in the system. Because pages can end up being moved to different real memory locations as they are paged in and out of main memory, it is necessary to translate virtual addresses into real physical addresses. The operating system maintains an address translation table that maps each virtual memory page's virtual addresses into a corresponding physical address (or indicates that a page is not in main memory and must be brought in from disk before being used.)
A special cache called the translation lookaside buffer (TLB) stores the address translation for a small number (at most a couple of thousand) of the most recently accessed pages. If a page's information isn't in the TLB, then the operating system executes code to bring that entry from main memory into the TLB. The TLB handles "hits" instantly, while TLB "misses" cost a considerable amount of time (a microsecond or more.)
The Linux kernel was originally designed to use only 4K byte pages. More recently, support for hugepages (of 2 megabytes and also 1 gigabyte for the latest processors) was added to the kernel. Programs can explicitly allocate the larger pages, or a Linux feature called "transparent hugepages" can be used to dynamically identify situations where small 4K byte pages can be combined into larger hugepages.
For numerical linear algebra codes involving multi-gigabyte matrices, the older 4K byte pages are very problematic. A 4-gigabyte array uses one million 4K byte pages, far more than the capacity of the TLB. This results in many performance-sapping TLB misses. With two megabyte huge pages, the 4 gigabyte array uses a much more manageable two thousand hugepages. For some other workloads, particularly databases and web servers, that rapidly allocate and deallocate storage, the transparent hugepages feature actually hurts performance. The decision to turn on or off transparent hugepages depends on the workload of a particular computer. Some Linux distributions enable the feature by default while others disable it.
You can check to see whether transparent hugepages are enabled on your Linux machine with
The answer will either be "always", "madvise" (if a program has to ask for huge pages) or "never." If desired, the transparent hugepage feature can be turned on by an administrator by
sudo hugeadm --thb-always
Using transparent hugepages can improve the performance of codes that use BLAS and LAPACK on large arrays. For example, I typically see about a 10% performance improvement with hugepages on my laptop computer.
Appendix: The Alternatives system on Debian and Ubuntu Linux
Debian Linux (and by inheritance, Ubuntu) has a feature that makes it possible to have multiple versions of BLAS, LAPACK, and other libraries installed and to automatically adjust symlinks as packages are added to the system that implement these libraries. For example, the standard location for the version 3.x LAPACK library is /usr/lib/liblapack.so.3. This is actually a symlink to /etc/alternatives/liblapack.so.3, which itself is a symlink to the currently active version of LAPACK.
You can query the available alternatives for a package and use the update-alternatives tool to update the configuration as desired. For example,
sudo update-alternatives --query liblapack.so.3
will tell you the current configuration for liblapack.so.3. You can pick from among the installed versions of LAPACK with
sudo update-alternatives --config liblapack.so.3
If you have your own version of openblas in /opt/openblas/libopenblas.so, you can make it the active version of liblapack.so.3 by
sudo update-alternatives --install /usr/lib/liblapack.so.3 liblapack.so.3 /opt/openblas/libopenblas.so 99
where the priority "99" should be large enough to keep other versions from taking over as liblapack.so.3.
For general background on computer architecture including caching, the translation lookaside buffer, and other features that are important to performance, I highly recommend the book:
R. E. Bryant and D. R. O'Hallaron. Computer Systems: A Programmer's Perspective, 3/E, 2015. An earlier draft edition is available for free online.
For background on numerical linear algebra, algorithms that work by rows versus columns, and blocked algorithms, I highly recommend:
D. S. Watkins. Fundamentals of Matrix Computations, 3rd ed.