# Monthly Archives: December 2009

## NIPS 2009 — Day 3

Here are a few papers that interested me on the Wednesday of NIPS 2009.
• On Stochastic and Worst-case Models for Investing (Elad Hazan, Satyen Kale). The authors extend Cover’s theory of Universal Portfolios to cover the “average case” of Geometric Brownian Motion, while still providing a regret bound equivalent to that of universal portfolios in the worst case.  The approach seems to offer a viable compromise between classical Mean-Variance allocation and Cover’s rebalancing rules. However, there still remains the evaluation of the approach under transaction costs, which constitute a major problem of universal portfolios. [Paper]
• An LP View of the M-best MAP problem (Menachem Fromer, Amir Globerson). The authors frame the problem of obtaining the M most probable assignments in a probabilistic graphical model as solving a linear program on a polytope similar to the marginal polytope. Successive configurations are obtained by progressively adding constraints to the program, and the method present the distinct flavor of a Benders’ decomposition. This paper helps better understand the relationship between inference in graphical models and relaxations in linear programs. [Paper]
• Variational Gaussian-process factor analysis for modeling spatio-temporal data (Jaakko Luttinen, Alexander Ilin). This is an application of Gaussian processes to modeling climatic data over very long periods (decades), with irregular measurements spread across time and space. The authors introduce a very simple factorization that decouples time and space and allows for simple inference through a variational technique. [Paper]
• Sharing Features among Dynamical Systems with Beta Processes (Emily Fox, Erik Sudderth, Michael Jordan, Alan Willsky). This is a follow-up of a paper presented at last year’s NIPS by the same authors, in which they introduced the Sticky-HDP-HMM. This year, they serve the Beta-HMM, which provide a nonparametric Bayesian treatment of the case where several time series share states and emission distributions in an HMM (they do not cover the case of the switching linear dynamical system, as last year, but this is a logical extension of this year’s model). They apply this model to mocap data, where the model learns more-than-reasonable segmentations in a completely unsupervised fashion. Code is available on Emily Fox’s website. [Paper]
• Bayesian Belief Polarization (Alan Jern, Kai-min Chang, Charles Kemp). Amusing application in cognitive science: “common sense” shows, as an example of irrationality, the case where two people having a priori contradictory opinions reinforcing these opinons when shown identical evidence. The authors examine the class of Bayesian networks that can give rise to such inferences. They reach the conclusion that such inferences are common enough for many Bayesian networks with three variables: the key is to have a latent variable influencing the opinion variable, and itself influenced by observations. It is possible that some examples in behavioral finance or economics can be interpreted in a similar manner. [Paper]
• FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs (Andrew McCallum, Karl Schultz, Sameer Singh). This is a library of probabilistic graphical models (factor graphs) written in Scala, a functional language targeting the JVM. The interesting aspect is that contarily to dedicated probabilistic languages such as BLOG, Factorie allows plugging in new inference procedures very easily, which make it feasible to treat substructures for which efficient automatic inference is difficult. On a bibliographic coreference task, they claim performance 3-15x faster than Markov Logic Network, while reducing the number of errors by 20-25%. [Paper]

## NIPS 2009 — Day 2

Here are a few papers that interested me on the Tuesday of NIPS 2009.

• T1: Kernels and learning curves for Gaussian process regression on random graphs (Peter Sollich, Matthew Urry, Camille Coti). Gaussian processes (GPs) are generally used with continuous inputs. This paper defines a covariance function (kernel) on the vertices of a tree and shows how we can use a GP to regress on those constructions. The idea of the kernel is simple: $k(x,y)$ is proportional to the probability of jumping from vertex x to vertex y under a random walk on the tree edges. The current results have been applied to random graphs with constant vertex degree, but in principle this can be applied to any graph type. [Paper]
• T29: Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units (Feng Yan, Ningyi Xu, Yuan Qi). The authors explain how to efficiently carry out Latent Dirichlet Allocation on a GPU. They obtain speedups ranging from 25x to 200x with the particular GPU model they used. They also present a streaming version of the algorithm that does not require all data to lie within video memory (=1GB on their card). This potentially could be interesting for other non-parametric Bayesian models as well. [Paper]
• T42: A General Projection Property for Distribution Families (Yao-Liang Yu, Yuxi Li, Dale Schuurmans, Csaba Szepesvari). Proof of a theoretical result on projection of distributions. What is interesting, and applies to portfolio management, is that portfolio optimization under “worst-case Value-at-Risk” and “Conditional Value-at-Risk” (expected shortfall) both result in identical portfolios. This should be studied more deeply. [Paper]
• T44: Linear-time Algorithms for Pairwise Statistical Problems (Parikshit Ram, Dongryeol Lee, William March, Alexander Gray). One of the bottlenecks to applying SVMs and Gaussian Processes (GPs) lies in the requirement to compute the kernel matrix, namely the matrix of kernel evaluations between all pairs of points in the training set, which is naïvely $O(N^2)$, where N is the number of elements in the training sets. For the past few years, the group  led by Alex Gray at Georgia Tech has been working on tree-like structures to make these operations faster. In this paper, they analyze in detail the computational complexity of an algorithm introduced a few years ago, and show that pairwise tasks can be reduced to $O(N \log N)$ or even $O(N)$ in some cases. [Paper]
• T47: Accelerated Gradient Methods for Stochastic Optimization and Online Learning (Chonghai Hu, James Kwok, Weike Pan). This paper presents an online optimization algorithm based on Nesterov gradients, in which we try to optimize a strongly-convex training criterion (e.g. L2) penalized by a convex (but not necessarily strongly so) regularizer, e.g. L1. The authors show that they obtain optimal convergence rates, much quicker than simple stochastic gradient descent. The technique appears to be very simple to implement, much more so than homotopy methods. [Paper]
• T65: Unsupervised feature learning for audio classification using convolutional deep belief networks (Honglak Lee, Peter Pham, Yan Largman, Andrew Ng). Application of Convolutional Restricted Boltzmann Machines (e.g. Desjardins and Bengio, 2008) to the classification of phonemes, speakers and musical sequences. The conclusion appears to be that the audio features learned in an unsupervised way by the CRBM (working in the log-spectral domain of short-term Fourier transforms, i.e. spectrograms) are helpful to improve classification, but are not always sufficient to beat MFCCs. However, when combining unsupervised features with MFCCs, the results outperform what has previously been reported in the literature for the tasks under consideration. The classification algorithm itself is fairly simple: SVMs, which are applied frame-by-frame followed by some voting scheme. [Paper]

## NIPS 2009 — Day 1

It’s this time of the year again where the Neural Information Processing Systems (NIPS) conference focuses the collective brainpower of the machine learning community for a week of unstopping climax.

Here are some presentations and papers that piqued my interest during the first day (Monday December 9).

### TUTORIALS

1. Making Very Large-Scale Linear Algebraic Computations Possible Via Randomization (Gunnar Martinsson). This is the study of randomized SVD, which goes beyond the “Google” method by drawing k+p random vectors in R^N (where k is the number of singular values to extract and p is a small constant like 5 or 10), and applies the matrix to be decomposed to each vector.  The result are random vectors within the span of the matrix to be decomposed.  The next steps consist in orthonormalizing (i.e. Gram-Schmidt) and carrying out an SVD on the k-by-k resulting matrix, which is much smaller than the original one. Then by applying a few minor substitutions, we recover the SVD of the original matrix. For an NxN matrix, this requires time $O(N^2 (k+p))$, which is comparable to classical approaches to the SVD. However, the beauty of randomized approaches, contrarily to classical methods, is that we now require a single (or at most a small number) of passes over the data, and we can traverse the data in a defined order (e.g. row by row). Hence it’s very efficient if the data is stored on a slow medium like a hard drive, as opposed to fast random-access memory. Moreover it is applicable to streaming data, which we can see only once. In addition, Martinsson discussed approached based on the fast Fourier transform which require time $O(N^2 \log(k))$, which is a considerable further saving. [Slides]
2. Sparse Methods for Machine Learning: Theory and Algorithms (Francis Bach). I found this to be a lot of theory, and a bit of algorithm. I was mostly interested by potential results related to sparse quantile regression, i.e. the main objective is to minimize an L1 norm with respect to the data fit, and the regularization term is also an L1 norm. But sadly, this question was not brought up, nor was the presenter aware of any substantive result in that respect. As an aside, many papers at the main confernce discuss this problem; see discussion below and over the next few days. [Slides]
3. Sequential Monte Carlo Methods (Arnaud Doucet et Nando de Freitas). This was mostly a tutorial on particle filtering. The first part, by Doucet, was clear; the second part, by de Freitas, was a bit quick for people lacking the background. The one thing I took out from the tutorial was the explanation of why particle filtering works when doing filtering, but fails for computing (possibly implicitly) functions of the joint distribution of the hidden states and observations: because of the resampling step (which is nevertheless essential), all particles end up having the same parent, and we don’t keep any representative of the historical diversity of particles. [Slides: to my knowledge, not yet available]

### ARTICLES

• M6: Gaussian process regression with Student-t likelihood (Jarno Vanhatalo, Pasi Jylänki, Aki Vehtari). If the likelihood function of a Gaussian Process is not Gaussian, it is usually impossible to analytically compute the posterior distribution over the latent function. However, normality of the likelihood makes learning sensitive to outliers. One solution is to use a Student-t likelihood function, and previous approaches have applied MCMC and Variational Bayes to the problem of posterior inference. However, surprisingly, the Laplace approximation, despite being commonplace for GP classification, had apparently never been proposed for this task. The authors’ approach appears to work well (at least for the examples shown), converges quickly, and is “only” between 50% and 100% slower than an analytical GP (whereas, for instance, VB is about 1000% slower). [Paper]
• M9: Non-stationary continuous dynamic Bayesian networks (Marco Grzegorczyk, Dirk Husmeier).  The authors seek to learn the structure of a Dynamic Bayes Net, and model each graph node (within a structure that may contain loops for the time delays) as being piecewise constant within a time interval. They propose a sampling approach that let them marginalize over all changepoint hypotheses. They apply the method to a problem in genetic analysis. At the moment, they have applied the approach to the smoothing case; it was not clear how quickly changepoints could be detected in the filtering case. The matlab code is available upon request. [Paper]
• M13: Large Scale Nonparametric Bayesian Inference: Data Parallelisation in the Indian Buffet Process  (Finale Doshi-Velez, David Knowles, Shakir Mohamed, Zoubin Ghahramani). The Indian Buffet Process can be used as a prior for Bayesian feature selection models. Traditionally, in nonparametric Bayesian models (for which representation can grow without bounds), inference is carried out by a variant of Gibbs sampling, and can be (very) expensive. The authors show how to split the problem of inference in IBP models into several parts (multiple CPUs/cores) and how to communicate results between compute nodes using belief propagations. This approach is promising for scaling up the models to more realistic use-cases. [Paper]
• M40: Streaming k-means approximation (Nir Ailon, Ragesh Jaiswal, Claire Monteleoni). This is a method to carry out an approximate K-Means using a single pass on the data (for streaming data, which can be seen only once). The algorithm is very simple, but the heart of the paper is in the analysis. [Paper]
• M49: Dual Averaging Method for Regularized Stochastic Learning and Online Optimization (Lin Xiao). Extremely simple online optimization algorithm for L1-penalized regression objectives. On logistic regression, the author obtains nearly as good classification performance with the online method as with a batch optimization, but with a much-reduced computational cost. [Paper]
• M85: Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation (Hamid Maei, Csaba Szepesvari, Shalabh Bhatnagar, Doina Precup, David Silver, Rich Sutton).  Boys and girls, this seems to be the real deal: reinforcement learning solved (ok, not quite, but getting there). In a nutshell, the authors introduce a correction to the classical Bellman error used as the learning objective in Temporal-Difference learning. They show that their algorithm is convergent in the off-policy case and with general function approximators (such as neural networks), which are two situations where a number of divergence examples have been illustrated in the late-90’s literature. As with many things, the algorithm is simple and the heart of the paper is the convergence proof. There is, as of yet, not a great deal of experimental validation of the algorithm, although the authors’ results on a 9×9 Go example appear very promising. [Paper]

Stay tuned for more candy tomorrow.