# NIPS 2009 — Day 2

Here are a few papers that interested me on the Tuesday of NIPS 2009.

• T1: Kernels and learning curves for Gaussian process regression on random graphs (Peter Sollich, Matthew Urry, Camille Coti). Gaussian processes (GPs) are generally used with continuous inputs. This paper defines a covariance function (kernel) on the vertices of a tree and shows how we can use a GP to regress on those constructions. The idea of the kernel is simple: $k(x,y)$ is proportional to the probability of jumping from vertex x to vertex y under a random walk on the tree edges. The current results have been applied to random graphs with constant vertex degree, but in principle this can be applied to any graph type. [Paper]
• T29: Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units (Feng Yan, Ningyi Xu, Yuan Qi). The authors explain how to efficiently carry out Latent Dirichlet Allocation on a GPU. They obtain speedups ranging from 25x to 200x with the particular GPU model they used. They also present a streaming version of the algorithm that does not require all data to lie within video memory (=1GB on their card). This potentially could be interesting for other non-parametric Bayesian models as well. [Paper]
• T42: A General Projection Property for Distribution Families (Yao-Liang Yu, Yuxi Li, Dale Schuurmans, Csaba Szepesvari). Proof of a theoretical result on projection of distributions. What is interesting, and applies to portfolio management, is that portfolio optimization under “worst-case Value-at-Risk” and “Conditional Value-at-Risk” (expected shortfall) both result in identical portfolios. This should be studied more deeply. [Paper]
• T44: Linear-time Algorithms for Pairwise Statistical Problems (Parikshit Ram, Dongryeol Lee, William March, Alexander Gray). One of the bottlenecks to applying SVMs and Gaussian Processes (GPs) lies in the requirement to compute the kernel matrix, namely the matrix of kernel evaluations between all pairs of points in the training set, which is naïvely $O(N^2)$, where N is the number of elements in the training sets. For the past few years, the group  led by Alex Gray at Georgia Tech has been working on tree-like structures to make these operations faster. In this paper, they analyze in detail the computational complexity of an algorithm introduced a few years ago, and show that pairwise tasks can be reduced to $O(N \log N)$ or even $O(N)$ in some cases. [Paper]
• T47: Accelerated Gradient Methods for Stochastic Optimization and Online Learning (Chonghai Hu, James Kwok, Weike Pan). This paper presents an online optimization algorithm based on Nesterov gradients, in which we try to optimize a strongly-convex training criterion (e.g. L2) penalized by a convex (but not necessarily strongly so) regularizer, e.g. L1. The authors show that they obtain optimal convergence rates, much quicker than simple stochastic gradient descent. The technique appears to be very simple to implement, much more so than homotopy methods. [Paper]
• T65: Unsupervised feature learning for audio classification using convolutional deep belief networks (Honglak Lee, Peter Pham, Yan Largman, Andrew Ng). Application of Convolutional Restricted Boltzmann Machines (e.g. Desjardins and Bengio, 2008) to the classification of phonemes, speakers and musical sequences. The conclusion appears to be that the audio features learned in an unsupervised way by the CRBM (working in the log-spectral domain of short-term Fourier transforms, i.e. spectrograms) are helpful to improve classification, but are not always sufficient to beat MFCCs. However, when combining unsupervised features with MFCCs, the results outperform what has previously been reported in the literature for the tasks under consideration. The classification algorithm itself is fairly simple: SVMs, which are applied frame-by-frame followed by some voting scheme. [Paper]