Wednesday, June 14, 2017

Using BLAS and LAPACK on Linux


Many software packages for scientific computing spend most of their time performing numerical linear algebra operations including matrix-vector and matrix-matrix multiplication, solving linear systems of equations, eigenvalue computations, and matrix factorizations such as the QR and SVD factorizations. These packages often make use of BLAS and LAPACK subroutines to perform these linear algebra operations.

The BLAS library performs basic linear algebra operations such as dot products, vector norms, matrix-vector multiplications, and matrix-matrix operations.  The LAPACK uses functions from the BLAS library to perform higher level computations such as solving linear systems of equations, finding eigenvalues of a matrix, or finding the singular value decomposition of a matrix.  You may also see references to two earlier libraries, LINPACK (for systems of equations) and EISPACK (for eigenvalue problems) that also use BLAS.  They have been effectively superseded by LAPACK.  In most cases, performance optimization at the BLAS level is sufficient.  Some of the libraries mentioned below also include optimized versions of some LAPACK routines, but in general, it's usually sufficient to use an optimized BLAS with the standard LAPACK implementation.

LAPACK and BLAS were originally written in Fortran, and most software that calls BLAS/LAPACK routines uses the original Fortran interface.  For programs written in other languages such as C, this can create complications in interfacing to the libraries.  See the appendix below on calling Fortran BLAS/LAPACK routines from C.  There are alternative interfaces that wrap around the BLAS and LAPACK libraries to provide a more natural C language interface.  The LAPACKE interface provides C interfaces to the LAPACK routines.  The CBLAS interface provides C interfaces to the BLAS routines.

The performance of the software can depend greatly on how well the BLAS operations have been optimized.  There are a number of available implementations of the BLAS with different licenses, performance characteristics, features, and ease of installation and use.  In helping users of my own package, CSDP, I often find that they're struggling to install BLAS and LAPACK to work with CSDP or that they're getting poor performance out of the BLAS that they've used.  

This post discusses the available implementations of the BLAS on Linux and gives some recommendations on how to select a BLAS.  I've written the blog posting from the point of view of someone who is installing software that makes use of the BLAS.  A few appendices at the end of the post discuss technical issues that may be of interest if you're actually writing Fortran or C code that calls BLAS or LAPACK routines.

Factors Affecting the Performance of BLAS

In this section, I'll discuss some important aspects of computer architecture that affect the performance of the BLAS and LAPACK libraries.  Intel and AMD processors based on the x86-64 (aka Intel 64) architecture are most commonly used on desktop and laptop Linux machines, so I'll limit my discussion to these processors. 

It is import to understand that on modern processors it is relatively time-consuming to access a floating point number in main memory- it can take hundreds of nanoseconds to do this, while floating point additions, subtractions, multiplications, and divisions can all be done in a nanosecond or less.
To deal with the slow memory access, modern processors have cache memory.  The cache memory is much smaller than main memory but also much faster.  Processors now have three or even four levels of cache between the processor and main memory.  

The key to obtaining higher performance with cache memory is to make multiple uses of data that has been brought into the cache from main memory before this data is flushed out of the cache.  Algorithms that make a single pass over the data with one or two floating point operations for each number brought from main memory are effectively limited by the speed of main memory and the cache doesn't help.  

The linear algebra operations performed by the BLAS are conveniently categorized into three levels.  The level one operations act on one or two vectors of length n and perform a total of O(n) operations.  Examples include vector dot products, norms, and the saxpy operation in which a multiple of one vector is added to a second vector.  The level two operations involve matrices of and vectors of size n and perform a total of O(n^2) operations.  The most common of these operations is matrix-vector multiplication.  The level three operations involve matrices of size n but perform O(n^3) operations.  The most common level three operation is matrix-matrix multiplication.  

It should be clear that it is impossible to make effective use of cache with level one and level two operations since there are only a small number (one, two, or three) of floating point operations for each number brought in from main memory.  However, the level three operations perform O(n^3) operations on O(n^2) data, so a factor of n reuse of data is possible.  For example, blocked matrix-matrix multiplication algorithms take two large matrices of size n by n and break the matrices into block matrices whose individual blocks are small enough to fit in cache.  The products of blocks are then performed in cache added together and written out to main memory.   Cache aware blocked level 3 operations can be vastly faster than naively coded loops.  LAPACK was designed to make as much use as possible of level 3 operations to improve performance on computers with cache memory.  

An important aspect of modern random access memory is that the memory operates in burst mode.  Rather than reading a single word from memory into the cache, the memory system reads in a "cache line" that might contain 64 or 128 consecutive bytes of memory.   This is very convenient if the algorithm accesses data consecutively but very inefficient when the algorithm jumps around in memory.  Matrices can be stored in memory in "column-major" order with the elements of a column in consecutive positions in memory or "row-major" order with the elements of a row in consecutive positions.   Algorithms used by BLAS and LAPACK are designed to take advantage of the column-major storage used in the Fortran programming language.  See the appendix below for a more detailed discussion of how Fortran and C store matrices.  

In recent years microprocessors have become faster by performing operations in parallel.  There are two main levels at which parallelism has been used to improve performance.  At the lower level are instructions that can perform multiple arithmetic operations simultaneously.  These Single Instruction Multiple Data (SIMD) instructions make it possible to perform 2, 4 or even 8 floating point operations per cycle rather than a single operation.  In successive generations, SSE, SSE2, and AVX instructions have been added as new features to the x86-64 architecture.  Compilers are generally capable of automatically producing code that performs operations in parallel using SIMD instructions.  However, if a code is to be compiled so that it can be distributed in binary form and run on all x86-64 processors then the more recent SIMD instructions cannot be used.  

At a higher level, modern processor chips actually consist of multiple processor cores.  As individual transistors have gotten smaller on integrated circuits it has become possible to cram more and more processor cores onto a single chip.  Some of the latest generation processors have as many as 18 separate processor cores on a single chip.  Some programs are inherently sequential and can only run on one processor core.  In other cases, it is possible to break up the work so that it can be performed in parallel on multiple cores.  Fortunately for us, it is relatively easy to write multithreaded versions of the BLAS functions that can make use of multiple processor cores. 

In order to get high performance, an implementation of the BLAS must make effective use of cache memory, SIMD instructions, and multiple cores.  The software can either be written to work with the most basic x86-64 architecture (so that it will run on all AMD and Intel processors) or written to dynamically select routines that will work well on the current processor, or it can be tuned at compile time for a specific processor.   All three of these approaches are used by some of the BLAS libraries listed below.

Available BLAS and LAPACK Libraries

The following table summarizes the features of some of the most commonly used BLAS and LAPACK libraries.    OpenMP and ILP64 features are discussed in the appendices.

OpenMP Aware









The LAPACK project is hosted at the NETLIB web site along with many other packages for numerical computing.  The current version of the library is 3.7.0.   The LAPACK library is written in Fortran, but a C language interface, LAPACKE, is also distributed with LAPACK.  See the LAPACK package in Ubuntu 16.04.


The reference implementation of the BLAS is a single threaded Fortran library that makes no use of parallel processing and doesn't take advantage of cache memory.  Some performance improvement is possible by using compiler optimizations and SIMD instructions.  The Ubuntu package libblas3 is a generic x86-64 version of the reference BLAS.  It's easy to install but performs poorly in practice.


ATLAS stands for Automatically Tuned Linear Algebra Subroutines.  The ATLAS library comes with a very complicated build procedure that runs timing tests to tune the software to select block size parameters and algorithms that work best on the current machine.  The build process is time-consuming and requires that CPU throttling (commonly used for power management and cooling) be turned off.   Once you've worked through this complicated build process, the library general performs well.  ATLAS includes a few of the LAPACK routines, but for anything complicated, you'll need the regular LAPACK library.

Ubuntu has a generic x86-64 version of ATLAS, but its performance is typically poor in comparison with a version built from source because the generic binary package hasn't been properly tuned.


The OpenBLAS library can be compiled from source and automatically configures the block sizes and algorithms for your machine if the microarchitecture is on its supported list.  For newer processors, it's generally OK to use the most recent supported architecture.  For example, I built my copy of OpenBLAS on a Kaby Lake processor using the configuration for a Haswell processor and got good performance.  OpenBLAS includes the cblas interface as well as LAPACK and LAPACKE routines. OpenBLAS works particularly well with main programs that are compiled with OpenMP parallelism.

Ubuntu has a generic x86-64 version of OpenBLAS, but much like ATLAS and the reference BLAS, this package performs poorly because it hasn't been tuned.

OpenBLAS can be built with a DYNAMIC_ARCH flag so that it will dynamically select code optimized for the particular architecture of the machine that it is running on.  A variety of Intel and AMD processors are supported.  Executable code built with this version of the library will be quite large, but the resulting code will typically run well on both older and newer systems.  

Intel MKL

The Intel MKL library includes BLAS, LAPACK, and other useful mathematical subroutines.  The cblas and LAPACKE interfaces are also included.  The BLAS is highly optimized and dynamically configures itself on a range of Intel processors.  The library doesn't work well on AMD processors.  Intel has a number of ways of getting free (as in beer) licenses to use MKL.  It is not open source software, however.


ACML is roughly comparable to Intel's MKL in its functionality but optimized for use on AMD processors.  The library doesn't work well on Intel processors.  The ACML software is proprietary but can be downloaded for free (as in beer.)   AMD has announced the end of support for the proprietary ACML and is moving to support open source high-performance computing projects including the BLIS effort to replace BLAS.  (BLIS is pretty new and complicated to build, so I haven't tried it yet.)


If you simply need to run software that requires the BLAS and aren't concerned about performance, then I would recommend installing a precompiled OpenBLAS package or perhaps a precompiled version of the reference BLAS.  If performance is of only moderate concern, then a precompiled version of OpenBLAS may be suitable.  I would generally not pick ATLAS at this point because of the difficulty of installing it.  If performance is important then I would compile OpenBLAS from source.  The Intel MKL and AMD ACML libraries provide performance similar to (and perhaps better than) OpenBLAS, but I wouldn't bother with MKL or ACML unless you also need access to other features of these libraries beyond BLAS.

For example, I recently installed my CSDP software on a newly purchased laptop with a Kaby Lake (latest generation Intel) processor.  The Ubuntu OpenBLAS binary package performed poorly.  I found that a version of OpenBLAS compiled from source was at least four times faster than the Ubuntu OpenBLAS binary package, the Ubuntu ATLAS package, and the Ubuntu reference BLAS package.  ACML performed poorly on this Intel machine.  MKL was roughly similar in performance to the version of OpenBLAS that I built from source.

Appendix: Fortran and C array storage

In Fortran, an m by n matrix A is stored in a data structure called an array.  To give flexibility in adding rows to a matrix, Fortran and the BLAS/LAPACK interfaces allow for an array A to have LDA > m rows.  (LDA is an acronym for “leading dimension of A.”)

It is necessary to map the two-dimensional array of entries of A onto the one-dimensional linear layout of memory.  In the Fortran scheme, the array is stored in “column major” order. That is, the entries in the first column are followed by the second column, and so on.  In general, the A(i,j) element of the array A appears in position i+(j-1)*LDA in memory.  

For example, a 3x3 matrix might be stored in a 5x3 array as


In calling Fortran routines from the BLAS and LAPACK libraries, the programmer must pass the array A together with the matrix size parameters m and n and the leading dimension LDA.

C actually has two different schemes for storing arrays.  Dynamically allocated arrays consist of a vector of pointers to 1-D arrays which represent the individual rows of the matrix.  This style of array is not supported by the cblas and LAPACKE interfaces.  Statically allocated two-dimensional arrays are stored in a scheme very similar to the Fortran arrays, except that the array is stored in “row major” order, with all of the entries in row one proceeding row two, and so on.  Another less important difference is that C array subscripts start at 0 rather than 1.

A C programmer has two options for interfacing to BLAS and LAPACK.  The older approach is to implement Fortran array indexing within 1-D C arrays and then call the standard Fortran BLAS/LAPACK routines.  This approach is used by many software packages. The alternative approach is to use LAPACKE and cblas if they are available. The LAPACKE and cblas interfaces work with matrices that are stored in either row-major or column-major order.  

Appendix: Calling Fortran subroutines from C

Calling Fortran subroutines from C is not completely standardized.  Different Fortran compilers use different naming and parameter passing conventions, so it is sometimes necessary to adjust C code to work with a different Fortran compiler.  

In Fortran, subroutine names are case insensitive (and often converted into lower-case by the compiler), while in C, they are case sensitive.  Many Fortran compilers insert an underscore ("_") at the end of subroutine name.  

In Fortran, parameters are passed by reference (the address of the parameter) rather than value, while in C parameters are typically passed by value.  Thus when you call a Fortran subroutine from C it is necessary to pass a pointer to each of the parameters.    

In C, variable length strengths have a null byte (0) to indicate the end of the string.  Some Fortran compilers include additional hidden string length parameters to specify the length of each of the strings that are being passed to the subroutine.  

Consider the Fortran call to a LAPACK routine


where the string 'U' tells LAPACK to put its answer in the upper triangular part of A, N is the dimension of the matrix A, LDA is the leading dimension of the array A (not necessarily equal to N!), and INFO returns a result code.

In C, the most common way to write this would be


However, depending on the Fortran compiler, it could be








In this last example, the parameter "1" specifies the length of the string "U".

Appendix: Pthreads and OpenMP threading

Many of the BLAS libraries are "threaded", meaning that the library will automatically spawn several threads of execution on different processor cores to compute (e.g.) a matrix-matrix multiplication.  A program that uses BLAS/LAPACK libraries might itself run single threaded, or it might itself be multithreaded. Typically, it's best for a program to use as many threads as there are available processor cores.  Running more threads than cores can hurt performance as the computer must spend time switching back and forth between threads.  A common performance problem occurs when a multithreaded main program has several threads that each call threaded BLAS routines.  E.g. on a four core machine, a program might use 4 threads, but these 4 threads can each call a BLAS routine that starts 4 threads, and suddenly there are 16 active threads.

There are two very commonly used threaded programming models.  The pthreads (posix threads) model and OpenMP threading.  Intel also has its "threaded building blocks", although these are not very widely used in open source software.   There is no standard way in pthreads for a process to indicate that subroutines shouldn't create new threads and thus no effective way to avoid the performance problem mentioned in the previous paragraph.  However, OpenMP does have a standard mechanism for doing this.  Some BLAS libraries are "OpenMP aware" and will not create additional threads if the calling program doesn't want additional threads.  This is a very useful feature.

Appendix: I32LP64 vs. ILP64

On 64 bit Linux systems, C uses 32 bits to represent integers, 64 bits to represent long integers, and 64 bits for pointers.  Similarly, Fortran integers are 32 bits.  This programming model is referred to as I32LP64.  An alternative model, ILP64, uses 64 bits to represent integers.  There is a potential issue with using the I32LP64 model in BLAS/CBLAS and LAPACK/LAPACKE, since it is possible in some circumstances that a vector might be so long that we can't represent its length with a 32-bit integer.  The largest number that can be represented by a signed 32-bit integer is about 2 billion, but BLAS and LAPACK are often used on machines with hundreds of gigabytes of RAM, so this actually can occur in some cases.  Some implementations of the BLAS and LAPACK libraries have versions that use or can be compiled to use long int 64-bit integer parameters.

Appendix: The TLB and Transparent Hugepages

Modern processors make use of "virtual memory", which allows the operating system to move pages of code and data in and out of main memory and store them on disk.  This makes it possible to have multiple programs running and using more "virtual memory" than the real memory in the system.   Because pages can end up being moved to different real memory locations as they are paged in and out of main memory, it is necessary to translate virtual addresses into real physical addresses.  The operating system maintains an address translation table that maps each virtual memory page's virtual addresses into a corresponding physical address (or indicates that a page is not in main memory and must be brought in from disk before being used.)

A special cache called the translation lookaside buffer (TLB) stores the address translation for a small number (at most a couple of thousand) of the most recently accessed pages.  If a page's information isn't in the TLB, then the operating system executes code to bring that entry from main memory into the TLB.   The TLB handles "hits" instantly,  while TLB "misses" cost a considerable amount of time (a microsecond or more.)

The Linux kernel was originally designed to use only 4K byte pages.  More recently, support for hugepages (of 2 megabytes) was added to the kernel.  Programs can explicitly allocate the larger pages, or a Linux feature called "transparent hugepages" can be used to dynamically identify situations where small 4K byte pages can be combined into larger hugepages.

For numerical linear algebra codes involving multi-gigabyte matrices, the older 4K byte pages are very problematic.  A 4-gigabyte array uses one million 4K byte pages, far more than the capacity of the TLB.  This results in many performance-sapping TLB misses.  With two megabyte huge pages, the 4 gigabyte array uses a much more manageable two thousand hugepages.  For some other workloads, particularly databases and web servers, that rapidly allocate and deallocate storage, the transparent hugepages feature actually hurts performance.  The decision to turn on or off transparent hugepages depends on the workload of a particular computer.  Some Linux distributions enable the feature by default while others disable it.

You can check to see whether transparent hugepages are enabled on your Linux machine with

cat /sys/kernel/mm/transparent_hugepage/enabled

The answer will either be "always", "madvise" (if a program has to ask for huge pages) or "never."  If desired, the transparent hugepage feature can be turned on by an administrator by

sudo hugeadm --thb-always

Using transparent hugepages can improve the performance of codes that use BLAS and LAPACK on large arrays.  For example, I typically see about a 10% performance improvement with hugepages on my laptop computer.  

Further Reading

For general background on computer architecture including caching, the translation lookaside buffer, and other features that are important to performance, I highly recommend the book:

R. E. Bryant and D. R. O'Hallaron.  Computer Systems: A Programmer's Perspective, 3/E, 2015.   An earlier draft edition is available for free online.

For background on numerical linear algebra, algorithms that work by rows versus columns, and blocked algorithms, I highly recommend:

D. S. Watkins.  Fundamentals of Matrix Computations, 3rd ed.

Monday, February 24, 2014

Scraping Citation Counts from a Google Scholar Profile

Google Scholar has no published API.  This means that if you want to extract information from someone's public Google Scholar profile, you have to write a script to scrape the web page.

I was recently asked to come up with a way to extract citation counts by year from a Google Scholar profile. This information is visible in a chart on the profile page, and it wasn't too hard to write a Python script to extract it. You can see my script at

Note that the citation counts are actually plotted as percentages of the citation counts in the highest year, and these percentages are rounded to one digit after the decimal point.  e.g. if an author's maximum citation year had 70 citations and a particular year had only 3 citations, then this is encoded in the table as 4.3%.  The script rounds to the nearest integer, but this can easily be off when the total number of citations is high.  

Similar plots are produced by Thompson's "MyResearcherID" profile, but in that case the chart is produced by a process that is opaque- you can get at the produced GIF, but in order to extract counts by year you'd have to run this graph through a "graph digitizer" tool.

It's a sad commentary on the state of the world wide web that information like this is presented in web pages in formats that are very hard to decode.  

Friday, February 8, 2013

Review of Coursera's "Computing for Data Analysis" MOOC

I signed up for this course primarily because I wanted to experience a Coursera massively open online course (MOOC), and also because I wanted to improve upon my very rudimentary knowledge of the R programming language.  For those of you too impatient to read the entire review the summary is that I did learn a few things from the course and I'm glad that I took the time to participate, but there were many things about the course that really could be improved.

First, a comment about the word "open" in the acronym MOOC.  Although some of the first MOOCs really were open in the sense that the course content was made available under open source licenses, the Coursera course content is not made available under any open license.  This makes it effectively impossible for students and other instructors to build upon the MOOC and improve it.  Many of my criticisms of Computing for Data Analysis could be very effectively addressed by incremental improvements made in the typical fashion of source software projects.

It is interesting that there are already many open resources for learning R programming in the form of online tutorials and notes. These clearly aren't as interactive and engaging as the Coursera course, but it seems likely that the authors who produced these materials have enough knowledge of the subject so that with the support of instructional designers and producers they could also produce a MOOC on R programming.  Thus the special sauce is clearly not in Roger Peng's expertise in R- lots of instructors have that background.

I also don't believe that the Coursera platform is a unique distinguishing feature- Coursera and its competitors have all been able to produce MOOC platforms that can scale to handling hundreds of thousands of students with very little difficulty.  The technologies involved in these platforms, including lecture videos with subtitles and indexing, online multiple choice tests, banks of test questions, algorithmically generated questions, and even more sophisticated adaptive instruction techniques are already available in many computer based instruction systems.

Many universities already make use of the same textbooks and instructional software- they're not competing with each other by controlling access to the content that they're teaching.  Rather, they're competing with each in other areas, including the quality of interaction between students and faculty, the scholarly reputation of faculty, and various aspects of the college experience unrelated to academics.  In fact, faculty are often encouraged to publish textbooks because this helps to bolster the reputation of an institution.  If universities don't want to compete in the area of content, but they're concerned about Coursera dominating the world of MOOCs, then they have an easy alternative.  They can cooperatively develop truly open content in the form of Open Educational Resources (OER).

OER also offers an escape route for faculty who fear that they may be reduced to nothing more than tutors and graders for courses in which an academic star is featured in recorded lectures.  With OER, instructors have the opportunity to incorporate previous work by other instructors into their own courses, but they also have an opportunity to bring their own work into the mix.  Everyone gets to participate, and instructors can focus on finding the best ways to present the
material that they are most interested in.

For these reasons, I think that OER represent a real and significant alternative to the commercial MOOCs.

Now on to the Computing for Data Analysis (hereafter I'll just use the acronym CFDA)  course itself.

One obvious issue in constructing a MOOC is whether to set the course up in self paced format or to run it on a schedule.  CFDA was run on a four week schedule, with weekly quizzes and programming assignments.  Given the way that the video lectures, quizzes, and programming assignments were presented, there was no technical reason that the course couldn't be presented in a self paced format with students working at their own pace.  The scheduled format did make it easier for participants to discuss the course in the online forums associated with the course.  However, it also caused problems for students who had to travel or were otherwise busy at some point during the four weeks of the course.  To help with this, students were given loose deadlines and a small number of extra days that they could spend on assignments.  I personally found the scheduled format to be overly restrictive, and I would have preferred to see the course presented in a self paced format.

The lectures in this course were presented as online videos of up to about 1/2 hour length.  The videos consisted of prepared slides with the lecturers voice in the soundtrack.  To keep the videos from becoming too boring, there were multiple choice questions interspersed throughout the videos.  There were also some activities for the viewer to try at the R command line.  The production values were comparable to what I regularly see in lecture capture videos from conventional courses-  these were not highly polished videos.

In most conventional courses, lectures are at least loosely tied to a textbook.  Although there are many open source texts on R and also many commercially available textbooks, CFDA was not tied to any particular text.  Using a  commercial text would obviously have been problematic in a "free" course.  Using one of the open source resources would have been equally problematic given Coursera's closed source approach.

Not having a reference is problematic in a course about a programming language, since anyone learning a programming language is going to have to frequently look up the details of syntax and library functions.  It was quite irritating to me in doing the programming assignments that I frequently had a vague memory of some function from the lecture but didn't remember it in enough detail.  I then had the unpleasant choice of going back through the video or going to some other resource to find that information.

I was disappointed by the lack of depth in the lecture material.  This was a course in R programming rather than statistics, so any in-depth discussion of statistical issues would have been inappropriate, but I was surprised by how little discussion there was of the structure and semantics of the R programming language.  For example, R uses a fairly unusual scheme of lexical scoping that can be particularly useful in statistical work.  This was mentioned briefly in one  lecture but not fully explained.

The quizzes were presented in multiple choice format, with the student given  the opportunity to retake a quiz up to 3 times to get a higher score.  Some of the  questions were written in such a way that it was easy to guess the correct answer without actually understanding the question.  On retakes, the same questions reappeared, sometimes with the order of the answers randomized.  The system did not appear to be using algorithmic question generation or even a large library of similar questions.   Although multiple choice questions like these are easy to implement in a MOOC, they simply don't have the depth of questions that require a written answer or actual coding.

The programming assignments were automatically graded- the student would simply run test scripts that tested his code on various example inputs and then upload the resulting output for checking.  Although this kind of automated grading of programming assignments can be quite rigorous if its done with carefully designed tests, the test cases used in this course seemed to be pretty easy.  None of my codes ever lost a point.

In my opinion, this kind of automated grading of programming assignments often leads to poor programming practices, since it pushes students to debug until they "get the right answer", rather than developing a program that know to be correct by analysis.  Furthermore, there's no opportunity to give the student feedback on the proper style of programming in the particular language.  The
programs that I wrote for CFDA were not "pretty", and probably would have come across to an expert R programmer as being non-idiomatic R code.  All that mattered within the course is that they produced the correct answers on the test cases.

Although the online discussion forums were billed as one of the most important features of the course, I was very disappointed by them.  The way that the assignments were set up, students were focused on completing particular tasks rather than on developing a broader understanding of the R system.  As a result, much of the discussion in the forums was at the level of "How do I do in R?"  I found that I was able to more efficiently answer my questions of this sort by using other online resources on R and a bit of Google searching.

Although I think this course could have been much better, I certainly did learn some things about R programming that I wanted to pick up, and overall I'm glad that I took the time to participate in the course.