Sunday, April 24, 2011

Convex Optimization 2.0

My sabbatical leave last fall at the Institute for Pure and Applied Mathematics (IPAM) at UCLA was a wonderful opportunity to get up to speed on some recent developments in convex optimization and to reassess where the field is going. I recently gave a talk to our math department seminar with the somewhat provacative title of Convex Optimization 2.0.

This posting is a brief summary of what I talked about, along with some links to useful resources related to the talk.

I'll start with a quote from R. Tyrrell Rockafellar in 1993.

In fact, the great watershed in optimization isn't between linearity and nonlinearity, but convexity and nonconvexity.

R. Tyrrell Rockafellar (1993)


At the time, this was a somewhat surprising statement. Through the 60's, 70's, and into the 80's, most people working in the field thought of mathematical optimization being divided into two parts- linear programming and nonlinear programming. The central mathematical ideas of linear programming were combinatorial in nature, while the mathematics of nonlinear programming was all about analysis of smooth functions and applications of Newton's method.

What's amazing to me about this quote is that today, most of the people that I know who are working in optimization would agree with Rockafellar. The point of my talk and this blog post is to tell the story of how we got to this point.

George Dantzig's simplex method was the preferred method for solving linear programming problems from the 1950's through the mid 1980's. An enormous amount of research and software development effort (and lots of CPU cycles) were invested in the simplex method.

Methods for nonlinear programming are often divided into first order methods that use only first derivative information and second order methods that use second derivative information to obtain faster convergence. Of course second order methods also require storage that grows quadratically with the size of the problem. Thus first order methods are often preferred for very large scale problems, while second order methods are more commonly used for small to medium sized problems. There are also methods such as the Limited Memory BFGS Method that try to fit into the space between first and second order methods to get faster convergence than the first order methods without the memory requirements of second order methods.

When primal-dual interior point methods for linear programming appeared on the scene in the mid 1980's, it was surprising that ideas from nonlinear programming could actually be useful in linear programming. In less than a decade, these new interior point methods evolved to the point that they were often competitive with the classical simplex method, particularly on very large and degenerate linear programming problems. Interior point methods were also being applied to nonlinear programming problems, further blurring the distinction between linear and nonlinear programming.

Convex optimization problems are problems in which the feasible region is a convex set and the objective function to be minimized is a convex function. Some researchers, including Rockafellar, had worked on methods for solving nonlinear optimization problems that were convex but not necessarily smooth. The huge advantage of this class of problems is that any local minimum of a convex optimization problem is also a global minimum. On the other hand, without smoothness it is impossible to directly apply methods like Newton's method that depend on the differentiability of the objective and constraint functions. Methods for nonsmooth convex optimization (or "nondifferentiable optimization" as it was sometimes called) were quite limited.

After the development of interior point methods for linear programming, it was quickly discovered that the primal-dual interior point method for LP could also be extended to handle other kinds of convex optimization problems. For example, interior point methods can be used to solve convex quadratically constrained programming problems (CQP).

Of particular interest were second order cone programming problems which involve constraints of the form

\(
\|Ax-b \|_{2} \leq c^{T}x+d
\)

and semidefinite programming problems in which the variable $X$ is a symmetric matrix that is constrained to be positive semidefinite.

Looking at this situation from the point of view of our ability to solve various kinds of convex optimization problems, we now had a heirarchy

\(
LP \subset CQP \subset SOCP \subset SDP \subset CP.
\)

with primal-dual interior point methods able to efficiently and reliably solve problems up to the level of SDP.

Although SOCP and SDP might seem to be extremely weird problem classes, it turned out that in practice, many convex optimization problems could be reformulated into SOCP or SDP form. This led to a paradigm that I'll call Convex Optimization 1.0. Start with a convex optimization problem, reformulate as an LP, CQP, SOCP, or SDP, and then use a primal--dual interior point method to solve the problem. This approach is beautifully laid out in the 2003 textbook, Convex Optimzation, by Stephen Boyd and Lieven Vandenberghe.

In fact, the reformulation techniques used to put a problem into standard form are so simple that they can actually be built into software. For example, Michael Grant and Stephen Boyd developed a MATLAB package called CVX that automatically reformulates convex optimization problems as LP's, SOCP's, or SDP's, and then solves them using a general purpose LP/SOCP/SDP solver routine. Some other important software packages that also implement this approach include YALMIP, ROME for robust optimization, and GLOPTIPOLY and SOSTOOLS for polynomial optimization problems.

As interest in convex optimization grew, one important problem with primal-dual interior point methods became apparent. The primal-dual interior point method is a second order method, with storage requirements that grow as the square of the problem size. Users of interior point codes were often disappointed that the problems they wanted to solve simply couldn't be solved because of this storage requirement. The need for more storage efficient methods was particulary important in the burgeoning field of compressive sensing, where researchers wanted to solve problems involving millions of variables.

This led to a revival of interest in first order methods for convex optimization problems. Although these methods had originally been developed in the 1970's, this was an area of research that had been basically dead for a long time.

The classic algorithm for smooth convex minimization problems is the projected gradient descent method. Given a convex function $f(x)$, and a convex feasible set $C$, and a function for computing the projection of a point onto the nearest point in $C$, the projected gradient descent method simply alternates between moving downhill and projecting back onto the set $C$. If the problem is unconstrained, the $P_C$ is trivial. The projected gradient descent method follows the iteration:

\(
x^{k+1}=P_{C}(x^{k}-\alpha_{k}\nabla f(x^{k}))
\)

The step lengths $\alpha_{k}$ can be chosen in many different ways to insure convergence. One simple rule is $\alpha_{k}=1/k$.

For nonsmooth convex optimization problems, the projected gradient descent method can be extended by using subgradients instead of the gradient vector. A vector $g^{k}$ is a subgradient of $f$ at the point $x_{k}$, if

\(
f(x) \geq f(x^{k})+g^{T}(x-x^{k})\;\; \mbox{for all} \; x.
\)

The convergence rate of these methods is sublinear. However, it can be useful to analyze how many iterations are required to obtain a solution $x^{k}$ with $f(x^{k})-f(x^{*})<\epsilon$, where $x^{*}$ is an optimal solution. The classical (projected) (sub)gradient descent method takes $O(1/\epsilon)$ iterations on smooth problems and $O(1/\epsilon^{2})$ iterations on nonsmooth problems.

Nesterov (1983) gave an optimal first order method for smooth convex optimization problems with Lipschitz continuous gradients. This accelerated first order method uses previous gradients in a recursive formula for computing the next step. This effectively provides the algorithm with some information about the curvature of the function. This algorithm requires $O(\sqrt{L/\epsilon})$ iterations.

In 2005, Nesterov returned to this earlier algorithm and showed how a nonsmooth convex optimization problem could be approximated by a smooth convex optimization problem with Lipschitz continuous gradient. Nesterov's smoothed problem had $L=O(1/\sqrt{\epsilon})$, so the accelerated version of gradient descent on the smoothed problem took $O(1/\epsilon)$ iterations. Since 2005 there has been an explosion of interest in methods for various convex optimization problems that use this idea of accelerated gradient descent on a smoothed version of a nonsmooth convex optimization problem.

This approach has been particularly successful in the field of compressive sensing. Here the goal is to find a sparse (few nonzero entries) solution to an underdetermined linear system of equations $Ax=b$. Minimizing the number of nonzero entries in the solution is actually an NP-Hard combinatorial optimization problem. However, it turns out (and this is another story worthy of its own blog posting) that this can be done very effectively in practice by solving a basis pursuit denoising problem .

\(
\min \| x \|_{1}
\)
subject to
\(
\| Ax - b \|_{2} \leq \sigma .
\)

Although these problems can be solved in the Convex Optimization 1.0 paradigm by a general purpose SDP solver, in practice, specialized first order methods are vastly more efficient. Stephen Becker has produced a very useful collection of resources on software for BPDN and related problems.

Now we have the beginings of a new paradigm that I'll refer to as Convex Optimization 2.0. Use an accelerated first order method with smoothing to solve your convex optimization problem.

So far, most of the work on first order methods has been specialized to particular problems. An interesting recent development is the release of Templates for First Order Conic Solvers (TFOCS) by Stephen Becker, Emmanuel J. Candès and Michael Grant. TFOCS deals with these problems by formulating them in conic form (using for example SOCP constraints), dualizing, and then applying a first order method to the dual problem.

Although first order methods have worked out very well in many applications, we haven't yet seen these methods develop to the point that a general purpose first order method for LP, SOCP, and SDP can take the place of a primal-dual interior point method for general problems. The semidefinite programming case seems to be particularly hard. However, I'm hopeful that we may soon see a resurgence in interest in convex optimization as first order methods become more broadly applicable.