Thursday, September 14, 2017

Sabbatical Leaves at Mathematical Institutes

I'm on sabbatical leave this semester at the Institute for Computational and Experimental Research in Mathematics (ICERM) at Brown University.  Since many of my family, friends, and colleagues have been asking questions about this, I thought I'd write a blog posting about sabbatical leave and particularly about doing a sabbatical leave at one of the mathematics institutes.

Sabbatical leave is one of the great things about working in academia.  Traditionally, tenured faculty members are given the opportunity to take paid leave to work on research projects and professional development.  In the US, a common policy is that faculty can take fully paid leave for one semester every seven years.  Sometimes it's possible to take an entire year of sabbatical leave, although at my institution a one-year sabbatical leave is at only at half pay.  Sabbatical leave is not supposed to be a simple vacation from work, so you have to apply in advance (a semester or more in advance at my institution) for the leave with a plan for what you will do during the leave.  That plan might include research in a laboratory, field work, travel to access specialized library collections, writing a book, attending academic conferences, collaborating with researchers at another university, etc.  I've reviewed many sabbatical leave applications as a member of the sabbatical leave committee at my university, and most of the applications that I've seen clearly show that the faculty member has a serious and well thought out plan.

It is possible to do a "staybattical" in which you remain at your home institution, but most faculty find that it is better to get away from their home department and avoid the kinds of interruptions and obligations that tend to occur at their home institution.

One common kind of sabbatical leave is a semester or year long sabbatical in an academic department at another university.  Typically, this is based on a preexisting collaboration with a particular faculty member at that institution.  The visitor would typically be given an office, access to computers and a laboratory.  If the sabbatical is at half pay, it's not uncommon for the hosting department to pay the visitor to teach some courses.  Similarly, it's not uncommon for faculty to visit at one of the national laboratories.

Another option is to visit at a research institute.  Some institutes, such as the Institute for Advanced Studies (IAS) at Princeton University or the Simons Institute at Berkeley have been privately funded.  In recent decades, the National Science Foundation in the US has funded a series of mathematics research institutes.  Some of these institutes, such as the Institute for Discrete Mathematics and Theoretical Computer Science (DIMACS) at Rutgers, have graduated from their initial NSF funding and become self sustaining.  There are similar institutes in Canada and other countries.

These institutes are typically located at a major university where they have their own separate office and meeting space and are perhaps loosely connected to an academic department.  The NSF institutes conduct summer school programs, short (one week) workshops and longer term (semester long, year long, or even multi-year) programs on specific research topics.  The institutes have funding to support travel to the shorter term programs and funding for both travel and lodging for participants in the longer term programs.

In the past, I've done short term visits for workshops to the Institute for Mathematics and its Applications (IMA) at the University of Minnesota and at DIMACS.  In 1999 I participated in a six-week summer school on mathematical geophysics at Stanford University.  In 2003 and 2010 I took semester-long sabbatical leaves at the Institute for Pure and Applied Math (IPAM) at UCLA.

This fall, I'm participating in the semester long program on Mathematical and Computational Challenges in Radar and Seismic Imaging at ICERM at Brown University.  Radar and seismic imaging problems have important similarities and differences, so the program is bringing together researchers from these areas to share their ideas and approaches. The semester program is bringing together over a hundred participants in three workshops.  A smaller core of long term participants is staying for the whole semester. 

My official status at Brown University is that I'm a "Visiting Scholar".  That means that I've got an ID and can access the libraries, gym, etc.  At ICERM I've been given an office for the semester.  It's pretty bare bones, with a desk, a phone, a computer, and a blackboard.  I'm living near campus in a small studio apartment owned by Brown University in a building that houses visiting scholars.  ICERM pays the rent for my apartment, and they'll reimburse me for my airplane ticket, but I'm responsible for my other expenses (meals, laundry, etc.).

I'm expected to participate in the semester long program by being around and discussing the research topic with other participants, attending three week long workshops embedded within the semester, and perhaps to provide some mentoring to the graduate students and post-docs that are here for the semester.  I'll be looking for new research projects to work on, supervising some MS students via Google Hangouts video calls, working on the forthcoming third edition of my inverse problems textbook, and trying to finish off some leftover projects that have been hanging around my neck.  I will try very hard not to get involved in committee and other service work from my own campus.

While I'm in Providence, I'll take some time to enjoy Providence and New England.  For example, on Saturday I took a commuter rail train up to Boston, had some Dim Sum and then visited the Museum of Fine Art.  Tonight, I went out for dinner at an Italian restaurant in Providence's Federal Hill neighborhood with an old friend from college.  There will be lots of time to visit other friends and family in the area. 

I'd encourage all of my math/stat faculty friends to consider taking a sabbatical leave at one of the mathematical institutes.  There are also lots of opportunities for post-docs, graduate students, and even undergraduate students to visit the mathematical institutes.  Check out upcoming workshops, summer schools, and semester programs, and make sure to apply early to participate. 

Tuesday, August 22, 2017

Trying out CSDP on a Xeon Phi Processor

I was recently given the opportunity to port my semidefinite programming code to an Intel  Xeon Phi processor.  Although I only had about a week to work on this port and didn't have time to carefully optimize the code, I did learn a lot about this new system and some implications for the future development of CSDP.

My code, CSDP, was originally written in the late 1990's as a single threaded C code that made extensive use of linear algebra routines from the BLAS and LAPACK libraries.  The decision to use BLAS and LAPACK has proven to be a good choice because implementations of these libraries have consistently improved performance over the years.

However, the original code was written in the era before multi-core processors became common.   Within a few years, it became apparent that multi-core processors were the wave of the future.  In particular, "cache coherent non-uniform memory access" architectures were starting to enter the main stream.    I made the decision to rewrite the code for use on shared memory multiprocessor systems using OpenMP directives.  This version of the code is described in a 2007 paper.   In that paper, we presented computational results showing how the performance scaled up to 16 processor cores.  For a variety of problems the typical parallel efficiency was about 50% at 16 cores- that is, the runtime was about 8 times faster using 16 cores than one core.  In the decade since that paper was written, the number of cores in desktop and laptop processor chips has steadily increased, to the point that desktop machines with 16 or 32 cores are quite common.

The Intel Knight's Landing Xeon Phi processor has 64 to 72 cores (depending on the particular model), and each core has 4-way hyper threading  .   The particular model that I was working with, the KNL7250, has 68 cores.  The cores share access to 16 gigabytes of high-speed memory that was setup for use as a cache in my tests.
It did not appear that hyperthreading actually helped the performance of the code, so I limited my tests to 64 threads.

The individual cores in the Xeon Phi processors have relatively low clock rates (1.4 GHz in the 7250.)  Intel's high-performance desktop and server processors typically operate at about twice that clock rate.   Furthermore, these cores lack many of the sophisticated architectural features found in other Intel processors.  One important new feature that is present in the Knight's Landing cores is support for SIMD instructions operating on 512 bits (8 double precision floating point numbers) at a time.  In theory, the KNL7250 is capable of 3 teraflops performance.  In practice, getting this kind of performance requires the code to be tuned for high parallel efficiency and careful use of the SIMD instructions.

In my tests, I did not carefully tune the CSDP code.  In particular, I made no source code changes.  Rather, I limited my efforts to making use of Intel's MKL library for BLAS and LAPACK routines and setting optimization flags for Intel's C compiler.

The work performed in each iteration by CSDP consists of forming and solving a large, symmetric, and positive definite system of linear equations.  For a semidefinite programming problem with m constraints and matrix variables of size n, it takes O(m^2n^2) time to compute the elements of the matrix (ELEMENTS), O(m^3) time to factor the matrix (FACTOR) and most of the other operations (OTHER) take O(n^3) time.  In most cases, m is greater than or equal to n.  When m is much bigger than n the dominant computational cost is in the factorization step.  When m and n are of similar size, then ELEMENTS or OTHER can be dominant.

As we showed in the 2007 paper, parallel efficiency of the code approaches the parallel efficiency of the least efficient part of the code as the number of processors increases.  This is a manifestation of Amdahl's law (aka The Universal Scaling Law.)

The maxG55 test problem has m=5000 and n=5000.  Table 1 gives the wall clock times with 1 and 2 cores on my Dell XPS-13 laptop, which has a 7560U processor.  Table 2 gives the times with 1 to 64 cores on the Xeon Phi machine.  Table 3 shows the percent parallel efficiencies.  


1
2
Elements
5.9
3.7
Factor
19.7
15.7
Other
521.6
371.1
Total
547.1
390.4

Table 1: Wall clock times for the maxG55 test problem with one and two cores on my Dell XPS-13 laptop.



1
2
4
8
16
32
64
Elements
39.7
23.8
12.1
6.2
3.6
2.6
3.8
Factor
23.5
12.3
7.1
4.0
2.5
1.4
1.1
Other
713.0
535.9
325.8
192.1
128.4
91.9
73.5
Total
776.1
572.0
344.9
202.3
134.5
96.0
78.4
Table 2: Wall clock times for the maxG55 test problem with 1 to 64 cores on the Xeon Phi system.



1
2
4
8
16
32
64
Elements
100%
83%
82%
80%
69%
48%
16%
Factor
100%
96%
83%
73%
59%
52%
33%
Other
100%
67%
55%
46%
35%
24%
15%
Total
100%
68%
56%
48%
36%
25%
15%
Table 3: Percent parallel efficiencies for the maxG55 test problem on the Xeon Phi system with 1 to 64 cores.

Note that one core of the laptop processor is actually somewhat faster than one core of the Xeon Phi.  However, the laptop only has two cores compared with the 68 cores on the Xeon Phi.  For this problem, other operations are the most time-consuming part of the computation, and since the parallel efficiency is worst on these other operations the overall parallel efficiency is largely determined by these operations.

The thetaG51 test problem has m=6910 and n=1001.  For this problem, we expect that the elements and factor operations will take more time than the other operations because m is considerably larger than n.

Table 4 shows the wall clock times for the thetaG51 problem with 1 and 2 cores on the laptop.   Tables 5 and 6 show the wall clock times and parallel efficiencies with 1 to 64 cores on the Xeon Phi system.  For this problem, with one core, computing the elements of the matrix dominates the computational effort.  However, since these computations scale well, while the parallel efficiency of the other operations is poor, it turns out that the other operations consume the most time with 64 cores.



1
2
Elements
63.9
45.8
Factor
99.9
72.7
Other
30.4
25.4
Total
194.3
143.9
Table 4: Wall clock times for the thetaG51 test problem with one and two cores on my Dell XPS-13 laptop.


1
2
4
8
16
32
64
Elements
535.5
271.2
142.4
71.1
36.1
19.4
11.6
Factor
121.5
62.5
37.2
20.9
10.9
7.1
4.0
Other
58.2
43.2
35.6
29.6
28.1
27.2
27.5
Total
715.2
376.9
215.2
121.6
75.0
53.7
43.1
Table 5: Wall clock times for the thetaG51 test problem with 1 to 64 cores on the Xeon Phi system.


1
2
4
8
16
32
64
Elements
100%
99%
94%
94%
93%
86%
72%
Factor
100%
97%
82%
73%
70%
53%
47%
Other
100%
67%
41%
25%
13%
7%
3%
Total
100%
95%
83%
74%
60%
42%
26%
Table 6: Percent parallel efficiencies for the thetaG51 test problem on the Xeon Phi system with 1 to 64 cores.

It was surprisingly easy to get CSDP up and running on the Xeon Phi system, and we immediately obtained results that were about 5 times faster than the code running on a fast laptop computer.  There is obviously room for improvement in these results.  In particular, the relatively poor scaling of the "other" operations should be the target of further efforts to optimize the code.

There are two levels at which I would work to improve the parallel efficiency of the other operations.  The lower level optimization would be to improve the vectorization of the code by using OpenMP 4.5 directives to aid the compiler in vectorizing computations that cannot currently be vectorized.  Compiler output showed that many of the loops in the code could not currently be vectorized.  At a higher level, it would be desirable to use OpenMP directives to parallelize various loops within the "other" operations.  
It seems likely that over the next few years, main stream processors will acquire the 512-bit vector registers of the Xeon Phi and that the number of cores per processor chip will continue to grow.  Thus improvements in these aspects of the code should help with performance on these systems as well.