**FISTA**

Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) [Beck09] was one of the first algorithms to use Nesterov’s accelerated gradient descent to speed up the convergence of iterative shrinkage-thresholding (ISTA) from O(β/ε) to O(√(β/ε)). The algorithm and proof sketch for in the previous post based were based on [Bubeck14] and [Beck09] so we’ll skip them. Instead give an R implementation and results that compares ISTA and FISTA performance (in terms of number of iterations) for various step sizes (iterations increase and step size decreases).

## ISTA ista <- function(A, x, beta, iter = 100) { ## parameters eta <- 1 / beta lambda <- 1 xx <- c(0, 0, 0, 0, 0, 0) ## helpers Axx <- A %*% x Axn <- norm(A %*% x) error <- vector(length = iter) iters <- vector(length = iter) ## main loop for (i in 1:iter) { gradient_x <- t(A) %*% ( A %*% xx - y ) xx_tmp <- xx - eta * gradient_x v <- eta * lambda # L1 prox/shrinkage-thresholding xx <- pmax(xx_tmp - v, 0) - pmax(- xx_tmp - v, 0) error[i] <- norm(A %*% xx - Axx) / Axn iters[i] <- i } return(list(xx, error, iters)) } ## FISTA fista <- function(A, x, beta, iter = 100) { ## parameters eta <- 1 / beta lambda <- 1 x_next <- c(0, 0, 0, 0, 0, 0) z_prev <- x_next z_next <- x_next mu_prev <- 0 mu_next <- 0 ## helpers Ax <- A %*% x Ax_norm <- norm(Ax) error <- vector(length = iter) iters <- vector(length = iter) ## main loop for (i in 1:iter) { mu_next <- 0.5 * (1 + sqrt(1 + 4 * mu_prev^2)) gamma <- (1 - mu_prev) / mu_next gradient_x_n <- t(A) %*% ( A %*% x_next - y ) z_tmp <- x_next - eta * gradient_x_n v <- eta * lambda # L1 prox/shrinkage-thresholding z_next <- pmax(z_tmp - v, 0) - pmax(- z_tmp - v, 0) x_next <- z_next + gamma * (z_prev - z_next) z_prev <- z_next mu_prev <- mu_next error[i] <- norm(A %*% x_next - Ax) / Ax_norm iters[i] <- i } return(list(x_next, error, iters)) } max_iter <- 500; beta <- norm(A, type="F")^2 res1 <- ista(A, x, beta, max_iter) res2 <- fista(A, x, beta, max_iter) plot(res1[[3]], res1[[2]], col = "red", xlab = "iterations", ylab = "error", type = 'l', lty = 1) lines(res2[[3]], res2[[2]], col = "blue", type = 'l', lty = 1)

Note the non-monotonic convergence in FISTA’s case – [Beck09b, Teboulle10] describe a simple change to the algorithm to fix that. Another interesting problem with FISTA is the dependence on the worst-case smoothness parameter β in the algorithm, which can substantially reduce convergence rate for large β. This is addressed in [Katya14] using a backtracking strategy that to improve the dependence from worst-case to average “local composite Lipschitz constant for ∇f”, which can have a much smaller value, implying a larger step size. Another solution is presented in [Baes12].

Recall that ISTA, and hence FISTA, solve the following optimization problems:

FISTA works for non-smooth objectives as long as they can be formulated as the sum of a smooth and a non-smooth function [Section 2, Beck09; Section 3, Tseng08]. Are there efficient first-order algorithms for a larger class of non-smooth problems?

**NESTA**

NESTA [Becker09] solves the following optimization problem that FISTA cannot since the objective function is non-composite non-smooth:

This algorithm draws on [Nesterov05], which introduces a way to create smooth approximations of certain non-smooth function. This approximation, coupled with an accelerated gradient descent method, gives good convergence for optimizing a large class of non-smooth functions (though not as good as FISTA or Nesterov07). We’ll review [Nesterov05] via [Becker09] before returning to NESTA.

*Smoothing non-smooth functions*

Nesterov considers a class of functions that can be represented as [Section 2, Nesterov05; Section 2, Becker09; Nesterov08]

where φ(u) is a convex function and Q_{d} is the dual domain (convex, closed) of a function f(x) minimized over a domain Q_{p}. The representation is similar to the convex conjugate, f*(u), of this function. Indeed Nesterov considers using φ(u) = f*(u), but then argues that such φ(u) might not be a simple enough function. This might be true in general, but as we’ll see that in the special case of L_{1} regularization, φ(u) = f*(u) will work.

Nesterov then presents a smooth approximation of the function

where μ is the smoothness parameter, prox_{α}(u) is an α-strongly convex proximal function over the dual domain Q_{d}. In the original paper [Nesterov05], Nesterov actually allows for a composite function approximation, that is f(x) = fº(x) + f_{μ}(x) where fº(x) is some smooth convex function. We’ll follow [Becker09] in assuming that fº(x) = 0. [Beck12] extends this smoothing framework to optimize an even broader class of non-smooth functions.

The important upshot is that (1) this function is smooth with factor [Theorem 1, Nesterov05], where ||A|| is the operator norm of A, and (2) that f_{μ}(x) is a uniform smooth approximation of f(x) for μ > 0. So now we can use an optimal gradient descent for smooth optimization to minimize f_{μ}(x).

*Smooth optimization*

Section 3 in [Nesterov05] describes an optimal first-order algorithm to minimize a β-smooth convex function f(x) using an α-strongly convex proximal function prox_{α}(x).

This is shown [Theorem 2, Nesterov05] to converge at a rate

This algorithm is more complicated than the earlier ones [Nesterov83, Nesterov88] – especially when compared to the version FISTA uses (see previous post). The only advantage seems to be a somewhat better convergence rate by a factor of α at the cost of solving two minimization problems (instead of one as in FISTA).

Applying this algorithm to optimize f_{μ}(x) [Theorem 3, Nesterov05] gives O(1/k) convergence for the optimal value of μ since

*Nesterov’s Algorithm*

[Becker09] call their application of [Nesterov05] to L_{1} regularization, NESTA. They show that

1. Optimizing over the convex conjugate of L_{1} norm when combined with an appropriate proximal function yields a simple, well-known function called the *Huber loss* function. It is analytic and can be computed very efficiently, which makes it trivial to implement gradient descent (quite like the shrinkage-thresholding operation in ISTA/FISTA).

Choosing prox_{α}(u) to be the squared Euclidean distance with α = 1 (see also [cf. Section 4.2, Nesterov05]),

2. y_{k} and z_{k} updates in the original algorithm need to be modified taking into account the L_{2} constraint. This is done using the Lagrangian form of each minimization that, under certain assumptions, allows expressing Lagrangian variables in terms of ε.

3. Another interesting idea is the use of “continuation.” Starting with a large value of the smoothing parameter and multiplicatively decreasing after each iteration helps converge more quickly (although the theoretical convergence rate stays the same).

Here’s R code for the unconstrained version of NESTA based on FISTA’s version of accelerated gradient descent:

nesta <- function(A, x, beta, iter = 100) { ## parameters eta <- 1 / beta lambda <- 1 x_next <- c(0, 0, 0, 0, 0, 0) z_prev <- x_next z_next <- x_next mu_prev <- 0 mu_next <- 0 smooth <- 0.9 * norm(A, "I") ## helpers Ax <- A %*% x Ax_norm <- norm(Ax) error <- vector(length = iter) iters <- vector(length = iter) ## main loop for (i in 1:iter) { mu_next <- 0.5 * (1 + sqrt(1 + 4 * mu_prev^2)) gamma <- (1 - mu_prev) / mu_next # grad_huber1 + grad_huber2 = gradient of Huber function grad_huber1 <- (abs(x_next) < smooth) / smooth grad_huber2 <- (abs(x_next) >= smooth) * ifelse(x_next==0, 0, (x_next / abs(x_next))) gradient_x_n <- t(A) %*% ( A %*% x_next - y ) + grad_huber1 + grad_huber2 z_next <- x_next - eta * gradient_x_n x_next <- z_next + gamma * (z_prev - z_next) z_prev <- z_next mu_prev <- mu_next smooth <- 0.5 * smooth error[i] <- norm(A %*% x_next - Ax) / Ax_norm iters[i] <- i } return(list(x_next, error, iters)) }

**Postscript [02/09/2014]**

Building on the message-passing (aka belief propagation) algorithms introduced in [Donoho09], [Mousavi13] improve the convergence rate of iterative-shrinkage thresholding algorithm (ISTA) to O(e^{-t}). This was made possible by finding the optimal values of the regularization parameter, λ, at each iteration under the assumption of Gaussian error distribution (which is probably why it escapes the upper-bounds for first-order methods).

**References**

[Baes12] M. Baes, M. Burgisser. An acceleration procedure for optimal first-order methods. 2012

[Beck09] A. Beck, M. Teboulle. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM Journal of Imaging Sciences, 2009

[Beck09b] A. Beck, M. Teboulle. Fast Gradient-Based Algorithms for Constrained Total Variation Image Denoising and Deblurring Problems. IEEE Transactions on Image Processing, 2009.

[Beck12] A. Beck, M. Teboulle. Smoothing and First Order Methods: A Unified Approach. SIAM Journal of Optimization, 2012.

[Becker09] S. Becker, J. Bobin, E. Candes. NESTA: A Fast and Accurate First-Order Method for Sparse Recovery. Technical Report, Caltech, 2009

[Bubeck14] S. Bubeck, Theory of Convex Optimization for Machine Learning

[Donoho09] D. Donoho, A. Maleki, A. Montanari. Message-passing algorithms for compressed sensing. Proceedings of National Academy of Sciences, 2009.

[Mousavi13] A. Mousavi, A. Maleki, R. Baranuick. Parameterless optimal approximate message passing. CoRR, 2013

[Nesterov83] Y. Nesterov. A method for solving a convex programming problem with convergence rate O(1/k^{2}). Dokaldy AN SSR, 1983

[Nesterov88] Y. Nesterov. On an approach to the construction of optimal methods of minimization of smooth convex functions. Ekonom. i. Mat. Metody, 1988

[Nesterov04] Y. Nesterov. Introductory Lectures On Convex Programming: A Basic Course. Kluwer Academic Publishers, 2004

[Nesterov05] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 2005.

[Nesterov07] Y. Nesterov. Gradient methods for minimizing composite objective function. Report, CORE, 2007

[Nesterov08] Y. Nesterov. How to advance in Structural Convex Optimization. Optima, 2008

[Katya14] K. Scheinberg, D. Goldfarb, X. Bai. Fast First-Order Methods for Composite Convex Optimization with Backtracking. Foundations of Computational Mathematics, 2014.

[Teboulle10]. M. Teboulle. First-Order Methods for Optimization. IPAM Optimization Tutorials, 2010

[Tseng08] P. Tseng. On Accelerated Proximal Gradient Algorithms for Convex-Concave Optimization. SIAM Journal of Optimization, 2008

**Implementations**

[1] http://web.stanford.edu/~boyd/papers/prox_algs/lasso.html

[2] https://github.com/gpeyre/numerical-tours/tree/master/matlab/solutions

[3] https://www.ceremade.dauphine.fr/~peyre/numerical-tour/tours/optim_4_fb/#54

[4] http://www.caam.rice.edu/~optimization/disparse/LASSO/FISTA/pFistaLasso.html

[5] http://www.mathworks.com/matlabcentral/fileexchange/16204-toolbox-sparse-optmization/content/toolbox_optim/perform_fb.m

[6] http://slipguru.disi.unige.it/Software/L1L2Py/algorithms.html

[7] http://www.eecs.berkeley.edu/~yang/software/l1benchmark/

[8] http://www.ece.rice.edu/~tag7/Tom_Goldstein/CGIST.html

[9] http://users.ece.gatech.edu/%7Esasif/index.html

[10] https://www.ceremade.dauphine.fr/~peyre/numerical-tour/

[11] http://www.mit.edu/~dimitrib/PTseng/papers/apg_alg.m

[12] http://www.ece.rice.edu/~tag7/Tom_Goldstein/CGIST.html

[13] http://www.math.ucla.edu/~wotaoyin/software.html