# NIPS 2009 — Day 1

It’s this time of the year again where the Neural Information Processing Systems (NIPS) conference focuses the collective brainpower of the machine learning community for a week of unstopping climax.

Here are some presentations and papers that piqued my interest during the first day (Monday December 9).

### TUTORIALS

1. Making Very Large-Scale Linear Algebraic Computations Possible Via Randomization (Gunnar Martinsson). This is the study of randomized SVD, which goes beyond the “Google” method by drawing k+p random vectors in R^N (where k is the number of singular values to extract and p is a small constant like 5 or 10), and applies the matrix to be decomposed to each vector.  The result are random vectors within the span of the matrix to be decomposed.  The next steps consist in orthonormalizing (i.e. Gram-Schmidt) and carrying out an SVD on the k-by-k resulting matrix, which is much smaller than the original one. Then by applying a few minor substitutions, we recover the SVD of the original matrix. For an NxN matrix, this requires time $O(N^2 (k+p))$, which is comparable to classical approaches to the SVD. However, the beauty of randomized approaches, contrarily to classical methods, is that we now require a single (or at most a small number) of passes over the data, and we can traverse the data in a defined order (e.g. row by row). Hence it’s very efficient if the data is stored on a slow medium like a hard drive, as opposed to fast random-access memory. Moreover it is applicable to streaming data, which we can see only once. In addition, Martinsson discussed approached based on the fast Fourier transform which require time $O(N^2 \log(k))$, which is a considerable further saving. [Slides]
2. Sparse Methods for Machine Learning: Theory and Algorithms (Francis Bach). I found this to be a lot of theory, and a bit of algorithm. I was mostly interested by potential results related to sparse quantile regression, i.e. the main objective is to minimize an L1 norm with respect to the data fit, and the regularization term is also an L1 norm. But sadly, this question was not brought up, nor was the presenter aware of any substantive result in that respect. As an aside, many papers at the main confernce discuss this problem; see discussion below and over the next few days. [Slides]
3. Sequential Monte Carlo Methods (Arnaud Doucet et Nando de Freitas). This was mostly a tutorial on particle filtering. The first part, by Doucet, was clear; the second part, by de Freitas, was a bit quick for people lacking the background. The one thing I took out from the tutorial was the explanation of why particle filtering works when doing filtering, but fails for computing (possibly implicitly) functions of the joint distribution of the hidden states and observations: because of the resampling step (which is nevertheless essential), all particles end up having the same parent, and we don’t keep any representative of the historical diversity of particles. [Slides: to my knowledge, not yet available]

### ARTICLES

• M6: Gaussian process regression with Student-t likelihood (Jarno Vanhatalo, Pasi Jylänki, Aki Vehtari). If the likelihood function of a Gaussian Process is not Gaussian, it is usually impossible to analytically compute the posterior distribution over the latent function. However, normality of the likelihood makes learning sensitive to outliers. One solution is to use a Student-t likelihood function, and previous approaches have applied MCMC and Variational Bayes to the problem of posterior inference. However, surprisingly, the Laplace approximation, despite being commonplace for GP classification, had apparently never been proposed for this task. The authors’ approach appears to work well (at least for the examples shown), converges quickly, and is “only” between 50% and 100% slower than an analytical GP (whereas, for instance, VB is about 1000% slower). [Paper]
• M9: Non-stationary continuous dynamic Bayesian networks (Marco Grzegorczyk, Dirk Husmeier).  The authors seek to learn the structure of a Dynamic Bayes Net, and model each graph node (within a structure that may contain loops for the time delays) as being piecewise constant within a time interval. They propose a sampling approach that let them marginalize over all changepoint hypotheses. They apply the method to a problem in genetic analysis. At the moment, they have applied the approach to the smoothing case; it was not clear how quickly changepoints could be detected in the filtering case. The matlab code is available upon request. [Paper]
• M13: Large Scale Nonparametric Bayesian Inference: Data Parallelisation in the Indian Buffet Process  (Finale Doshi-Velez, David Knowles, Shakir Mohamed, Zoubin Ghahramani). The Indian Buffet Process can be used as a prior for Bayesian feature selection models. Traditionally, in nonparametric Bayesian models (for which representation can grow without bounds), inference is carried out by a variant of Gibbs sampling, and can be (very) expensive. The authors show how to split the problem of inference in IBP models into several parts (multiple CPUs/cores) and how to communicate results between compute nodes using belief propagations. This approach is promising for scaling up the models to more realistic use-cases. [Paper]
• M40: Streaming k-means approximation (Nir Ailon, Ragesh Jaiswal, Claire Monteleoni). This is a method to carry out an approximate K-Means using a single pass on the data (for streaming data, which can be seen only once). The algorithm is very simple, but the heart of the paper is in the analysis. [Paper]
• M49: Dual Averaging Method for Regularized Stochastic Learning and Online Optimization (Lin Xiao). Extremely simple online optimization algorithm for L1-penalized regression objectives. On logistic regression, the author obtains nearly as good classification performance with the online method as with a batch optimization, but with a much-reduced computational cost. [Paper]
• M85: Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation (Hamid Maei, Csaba Szepesvari, Shalabh Bhatnagar, Doina Precup, David Silver, Rich Sutton).  Boys and girls, this seems to be the real deal: reinforcement learning solved (ok, not quite, but getting there). In a nutshell, the authors introduce a correction to the classical Bellman error used as the learning objective in Temporal-Difference learning. They show that their algorithm is convergent in the off-policy case and with general function approximators (such as neural networks), which are two situations where a number of divergence examples have been illustrated in the late-90’s literature. As with many things, the algorithm is simple and the heart of the paper is the convergence proof. There is, as of yet, not a great deal of experimental validation of the algorithm, although the authors’ results on a 9×9 Go example appear very promising. [Paper]

Stay tuned for more candy tomorrow.