Tuesday, August 22, 2017

Trying out CSDP on a Xeon Phi Processor

I've written two other related posts that you might also want to read:

Using BLAS and LAPACK on Linux

Using an NVIDIA GPU with CUDA For Numerical Linear Algebra in Linux


I was recently given the opportunity to port my semidefinite programming code to an Intel  Xeon Phi processor.  Although I only had about a week to work on this port and didn't have time to carefully optimize the code, I did learn a lot about this new system and some implications for the future development of CSDP.

My code, CSDP, was originally written in the late 1990's as a single-threaded C code that made extensive use of linear algebra routines from the BLAS and LAPACK libraries.  The decision to use BLAS and LAPACK has proven to be a good choice because implementations of these libraries have consistently improved performance over the years.

However, the original code was written in the era before multi-core processors became common.   Within a few years, it became apparent that multi-core processors were the wave of the future.  In particular, "cache coherent non-uniform memory access" architectures were starting to enter the mainstream.    I made the decision to rewrite the code for use on shared memory multiprocessor systems using OpenMP directives.  This version of the code is described in a 2007 paper.   In that paper, we presented computational results showing how the performance scaled up to 16 processor cores.  For a variety of problems the typical parallel efficiency was about 50% at 16 cores- that is, the runtime was about 8 times faster using 16 cores than one core.  In the decade since that paper was written, the number of cores in desktop and laptop processor chips has steadily increased, to the point that desktop machines with 16 or 32 cores are quite common.

The Intel Knight's Landing Xeon Phi processor has 64 to 72 cores (depending on the particular model), and each core has 4-way hyperthreading.   The particular model that I was working with, the KNL7250, has 68 cores.  The cores share access to 16 gigabytes of high-speed memory that was set up for use as a cache in my tests.  It did not appear that hyperthreading actually helped the performance of the code, so I limited my tests to 64 threads.

The individual cores in the Xeon Phi processors have relatively low clock rates (1.4 GHz in the 7250.)  Intel's high-performance desktop and server processors typically operate at about twice that clock rate.   Furthermore, these cores lack many of the sophisticated architectural features found in other Intel processors.  One important new feature that is present in the Knight's Landing cores is support for SIMD instructions operating on 512 bits (8 double precision floating point numbers) at a time.  In theory, the KNL7250 is capable of 3 teraflops performance.  In practice, getting this kind of performance requires the code to be tuned for high parallel efficiency and careful use of the SIMD instructions.

In my tests, I did not carefully tune the CSDP code.  In particular, I made no source code changes.  Rather, I limited my efforts to making use of Intel's MKL library for BLAS and LAPACK routines and setting optimization flags for Intel's C compiler.

The work performed in each iteration by CSDP consists of forming and solving a large, symmetric, and positive definite system of linear equations.  For a semidefinite programming problem with m constraints and matrix variables of size n, it takes O(m^2n^2) time to compute the elements of the matrix (ELEMENTS), O(m^3) time to factor the matrix (FACTOR) and most of the other operations (OTHER) take O(n^3) time.  In most cases, m is greater than or equal to n.  When m is much bigger than n the dominant computational cost is in the factorization step.  When m and n are of similar size, then ELEMENTS or OTHER can be dominant.

As we showed in the 2007 paper, parallel efficiency of the code approaches the parallel efficiency of the least efficient part of the code as the number of processors increases.  This is a manifestation of Amdahl's law (aka The Universal Scaling Law.)

The maxG55 test problem has m=5000 and n=5000.  Table 1 gives the wall clock times with 1 and 2 cores on my Dell XPS-13 laptop, which has a 7560U processor.  Table 2 gives the times with 1 to 64 cores on the Xeon Phi machine.  Table 3 shows the percent parallel efficiencies.



1
2
Elements
5.9
3.7
Factor
19.7
15.7
Other
521.6
371.1
Total
547.1
390.4

Table 1: Wall clock times for the maxG55 test problem with one and two cores on my Dell XPS-13 laptop.




1
2
4
8
16
32
64
Elements
39.7
23.8
12.1
6.2
3.6
2.6
3.8
Factor
23.5
12.3
7.1
4.0
2.5
1.4
1.1
Other
713.0
535.9
325.8
192.1
128.4
91.9
73.5
Total
776.1
572.0
344.9
202.3
134.5
96.0
78.4
Table 2: Wall clock times for the maxG55 test problem with 1 to 64 cores on the Xeon Phi system.



1
2
4
8
16
32
64
Elements
100%
83%
82%
80%
69%
48%
16%
Factor
100%
96%
83%
73%
59%
52%
33%
Other
100%
67%
55%
46%
35%
24%
15%
Total
100%
68%
56%
48%
36%
25%
15%
Table 3: Percent parallel efficiencies for the maxG55 test problem on the Xeon Phi system with 1 to 64 cores.

Note that one core of the laptop processor is actually somewhat faster than one core of the Xeon Phi.  However, the laptop only has two cores compared with the 68 cores on the Xeon Phi.  For this problem, other operations are the most time-consuming part of the computation, and since the parallel efficiency is worst on these other operations the overall parallel efficiency is largely determined by these operations.

The thetaG51 test problem has m=6910 and n=1001.  For this problem, we expect that the elements and factor operations will take more time than the other operations because m is considerably larger than n.

Table 4 shows the wall clock times for the thetaG51 problem with 1 and 2 cores on the laptop.   Tables 5 and 6 show the wall clock times and parallel efficiencies with 1 to 64 cores on the Xeon Phi system.  For this problem, with one core, computing the elements of the matrix dominates the computational effort.  However, since these computations scale well, while the parallel efficiency of the other operations is poor, it turns out that the other operations consume the most time with 64 cores.




1
2
Elements
63.9
45.8
Factor
99.9
72.7
Other
30.4
25.4
Total
194.3
143.9
Table 4: Wall clock times for the thetaG51 test problem with one and two cores on my Dell XPS-13 laptop.


1
2
4
8
16
32
64
Elements
535.5
271.2
142.4
71.1
36.1
19.4
11.6
Factor
121.5
62.5
37.2
20.9
10.9
7.1
4.0
Other
58.2
43.2
35.6
29.6
28.1
27.2
27.5
Total
715.2
376.9
215.2
121.6
75.0
53.7
43.1
Table 5: Wall clock times for the thetaG51 test problem with 1 to 64 cores on the Xeon Phi system.


1
2
4
8
16
32
64
Elements
100%
99%
94%
94%
93%
86%
72%
Factor
100%
97%
82%
73%
70%
53%
47%
Other
100%
67%
41%
25%
13%
7%
3%
Total
100%
95%
83%
74%
60%
42%
26%
Table 6: Percent parallel efficiencies for the thetaG51 test problem on the Xeon Phi system with 1 to 64 cores.

It was surprisingly easy to get CSDP up and running on the Xeon Phi system, and we immediately obtained results that were about 5 times faster than the code running on a fast laptop computer.  There is obviously room for improvement in these results.  In particular, the relatively poor scaling of the "other" operations should be the target of further efforts to optimize the code.

There are two levels at which I would work to improve the parallel efficiency of the other operations.  The lower level optimization would be to improve the vectorization of the code by using OpenMP 4.5 directives to aid the compiler in vectorizing computations that cannot currently be vectorized.  Compiler output showed that many of the loops in the code could not currently be vectorized.  At a higher level, it would be desirable to use OpenMP directives to parallelize various loops within the "other" operations.  It seems likely that over the next few years, mainstream processors will acquire the 512-bit vector registers of the Xeon Phi and that the number of cores per processor chip will continue to grow.  Thus improvements in these aspects of the code should help with performance on these systems as well.