# Category Archives: Modeling

## Account of the Third McGill/IFM2 Risk Management Conference

Held in the charming Quebec ski resort of Mont-Tremblant, the Third McGill/IFM2 Risk Management Conference took place from March 12 to 14, 2010. In addition to the rather good ski, a set of rather interesting papers were presented; among them, the following piqued my interest:

- “Components of Bull and Bear Markets: Bull Corrections and Bear Rallies” by John Maheu, Tom McCurdy, Yong Song. The authors employ a Markov switching model to separate out bull and bear markets. In particular, they identify four states: bull market, bull correction, bear market, and bear rally. The interesting point of the paper is their insight to incorporate economic restrictions into the HMM parametrization: the transition matrix is appropriately constrained, and the emission distributions each state must obey:
- Bull market state: mean return > 0
- Bull market correction: mean return < 0
- Bear market state: mean return < 0
- Bear market rally: mean return > 0

The estimation proceeds in a full Bayesian manner, with Gibbs sampling applied throughout. A private discussion with the authors indicates that the Gibbs chain mixes very quickly, which is probably due to the strong constraints on the parameters induced by the economic restrictions. All in all, an intuitive and tantalizing model.

- “A Multifrequency Theory of the Term Structure of Interest Rates” by Laurent Calvet, Adlai Fisher, Liuren Wu. This paper introduces a no-arbitrage term structure model that is capable of fitting instantaneously-observed rates perfectly (which is important in many applications), yet provides realistic evolution dynamics. Its most interesting feature—from my point of view—is to propose a low-dimensional representation of a large number of time series, each operating at its own time scale, and that are superposed to yield the final forecast. The individual time series are linked through common parameters, and the time-scale dependence stems from a power law. This decomposition appears useful for a large number of applications in time-series modeling beyond term-structure of interest rates.
- “Multi-Period Forecasts of Volatility: Direct, Iterated, and Mixed-Data Approaches” by Eric Ghysels, Antonio Rubia, Rossen Valkanov. This paper follows on the previous MIDAS literature to compare various paradigms for long-horizon forecasting of volatility:
- The “direct” approach tries to have a model predict at horizon N given N-period returns
- The “iterated” approach iterates N times a one-period forecast
- The “mixed-data” approach combines a set of one-period returns to yield directly a N-period forecast.

The authors’ model falls in the latter category, and appears to beat both the direct and iterated approaches on US stock market portfolios, using a cross-section of size, book-to-market and industry portfolios.

- “Variance Risk Premia, Asset Predictability Puzzles, and Macroeconomic Uncertainty” by Hao Zhou (speaking for himself and not the Federal Reserve Board). One of the wonders of option markets is that implied volatility is generally higher than the subsequent realized volatility—in other words, implied volatility is usually an upwards-biased predictor of realized volatility. This explains why option writing strategies can be so profitable. This phenomenon, called the
*volatility premium*, has been characterized in depth by Bjørn Eraker. However, it remained some mystery as to the economic drivers of this premium. This paper tries to explain it via*economic uncertainty*(which is incorporated via a general equilibrium model). It also examines the implications for risk premia across equity, bond and credit asset classes. - “What Ties Return Volatilities to Price Valuations and Fundamentals?” by Alex David, Pietro Veronesi. It has long been observed that stock price volatility is much higher than can be explained by volatility in the fundamentals. The paper opens with a citation by Engle and Rangel:

“

*After more than 25 years of research on volatility, the central unsolved problem is the relation between the state of the economy and aggregate financial volatility. The number of models that have been developed to predict volatility based on time series information is astronomical, but the models that incorporate economic variables are hard to find. Using various methodologies, links are found but they are generally much weaker than seems reasonable. For example, it is widely recognized that volatility is higher during recessions and following announcements but these effects turn out to be a small part of measured volatility*.” [Engle and Rangel (2008)]This paper attemps to shed more light on this issue: they introduce a model where investors learning about unusual fundamental states leads to a V-shaped relationship between volatilities and prices.

- “Credit Conditions and Expected Stock and Bond Returns” by Sudheer Chava, Michael Gallmeyer, Heungju Park. This paper introduces a variable that appears predictive of stock returns at quarterly horizons. It is computed from a survey of credit standards administered since 1967 by the Federal Reserve to bank senior loan officers (although the data contains a big whole during the 1980’s, corresponding to changes in the survey methodology). The variable seems to introduce fundamentally new information not already captured by the standard predictors, and is robust to a variety of in-sample and out-of-sample tests.
- “Option Trading: Information or Differences of Opinion?” by Siu Kai Choy and Jason Zhanshun Wei. This was a last-minute addition to the conference and does not appear in the program. The authors investigate whether option traders have special information or they are merely speculators. They provide evidence that in aggregate, option traders are not particularly well informed, and therefore the “opinion difference” hypothesis (i.e. speculation) probably drives most option trades.

## NIPS 2009 — Day 3

**On Stochastic and Worst-case Models for Investing**(Elad Hazan, Satyen Kale). The authors extend Cover’s theory of Universal Portfolios to cover the “average case” of Geometric Brownian Motion, while still providing a regret bound equivalent to that of universal portfolios in the worst case. The approach seems to offer a viable compromise between classical Mean-Variance allocation and Cover’s rebalancing rules. However, there still remains the evaluation of the approach under transaction costs, which constitute a major problem of universal portfolios. [Paper]**An LP View of the M-best MAP problem**(Menachem Fromer, Amir Globerson). The authors frame the problem of obtaining the M most probable assignments in a probabilistic graphical model as solving a linear program on a polytope similar to the marginal polytope. Successive configurations are obtained by progressively adding constraints to the program, and the method present the distinct flavor of a Benders’ decomposition. This paper helps better understand the relationship between inference in graphical models and relaxations in linear programs. [Paper]**Variational Gaussian-process factor analysis for modeling spatio-temporal data**(Jaakko Luttinen, Alexander Ilin). This is an application of Gaussian processes to modeling climatic data over very long periods (decades), with irregular measurements spread across time and space. The authors introduce a very simple factorization that decouples time and space and allows for simple inference through a variational technique. [Paper]**Sharing Features among Dynamical Systems with Beta Processes**(Emily Fox, Erik Sudderth, Michael Jordan, Alan Willsky). This is a follow-up of a paper presented at last year’s NIPS by the same authors, in which they introduced the Sticky-HDP-HMM. This year, they serve the Beta-HMM, which provide a nonparametric Bayesian treatment of the case where several time series share states and emission distributions in an HMM (they do not cover the case of the switching linear dynamical system, as last year, but this is a logical extension of this year’s model). They apply this model to mocap data, where the model learns more-than-reasonable segmentations in a completely unsupervised fashion. Code is available on Emily Fox’s website. [Paper]**Bayesian Belief Polarization**(Alan Jern, Kai-min Chang, Charles Kemp). Amusing application in cognitive science: “common sense” shows, as an example of irrationality, the case where two people having a priori contradictory opinions reinforcing these opinons when shown identical evidence. The authors examine the class of Bayesian networks that can give rise to such inferences. They reach the conclusion that such inferences are common enough for many Bayesian networks with three variables: the key is to have a latent variable influencing the opinion variable, and itself influenced by observations. It is possible that some examples in behavioral finance or economics can be interpreted in a similar manner. [Paper]- FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs (Andrew McCallum, Karl Schultz, Sameer Singh). This is a library of probabilistic graphical models (factor graphs) written in Scala, a functional language targeting the JVM. The interesting aspect is that contarily to dedicated probabilistic languages such as BLOG, Factorie allows plugging in new inference procedures very easily, which make it feasible to treat substructures for which efficient automatic inference is difficult. On a bibliographic coreference task, they claim performance 3-15x faster than Markov Logic Network, while reducing the number of errors by 20-25%. [Paper]

## NIPS 2009 — Day 2

Here are a few papers that interested me on the Tuesday of NIPS 2009.

- T1:
**Kernels and learning curves for Gaussian process regression on random graphs**(Peter Sollich, Matthew Urry, Camille Coti). Gaussian processes (GPs) are generally used with continuous inputs. This paper defines a covariance function (kernel) on the vertices of a tree and shows how we can use a GP to regress on those constructions. The idea of the kernel is simple: is proportional to the probability of jumping from vertex x to vertex y under a**random walk**on the tree edges. The current results have been applied to random graphs with constant vertex degree, but in principle this can be applied to any graph type. [Paper] - T29:
**Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units**(Feng Yan, Ningyi Xu, Yuan Qi). The authors explain how to efficiently carry out Latent Dirichlet Allocation on a GPU. They obtain speedups ranging from 25x to 200x with the particular GPU model they used. They also present a streaming version of the algorithm that does not require all data to lie within video memory (=1GB on their card). This potentially could be interesting for other non-parametric Bayesian models as well. [Paper] - T42:
**A General Projection Property for Distribution Families**(Yao-Liang Yu, Yuxi Li, Dale Schuurmans, Csaba Szepesvari). Proof of a theoretical result on projection of distributions. What is interesting, and applies to portfolio management, is that portfolio optimization under “worst-case Value-at-Risk” and “Conditional Value-at-Risk” (expected shortfall) both result in identical portfolios. This should be studied more deeply. [Paper] - T44:
**Linear-time Algorithms for Pairwise Statistical Problems**(Parikshit Ram, Dongryeol Lee, William March, Alexander Gray). One of the bottlenecks to applying SVMs and Gaussian Processes (GPs) lies in the requirement to compute the kernel matrix, namely the matrix of kernel evaluations between all pairs of points in the training set, which is naïvely , where N is the number of elements in the training sets. For the past few years, the group led by Alex Gray at Georgia Tech has been working on tree-like structures to make these operations faster. In this paper, they analyze in detail the computational complexity of an algorithm introduced a few years ago, and show that pairwise tasks can be reduced to or even in some cases. [Paper] - T47:
**Accelerated Gradient Methods for Stochastic Optimization and Online Learning**(Chonghai Hu, James Kwok, Weike Pan). This paper presents an online optimization algorithm based on Nesterov gradients, in which we try to optimize a strongly-convex training criterion (e.g. L2) penalized by a convex (but not necessarily strongly so) regularizer, e.g. L1. The authors show that they obtain optimal convergence rates, much quicker than simple stochastic gradient descent. The technique appears to be very simple to implement, much more so than homotopy methods. [Paper] - T65:
**Unsupervised feature learning for audio classification using convolutional deep belief networks**(Honglak Lee, Peter Pham, Yan Largman, Andrew Ng). Application of Convolutional Restricted Boltzmann Machines (e.g. Desjardins and Bengio, 2008) to the classification of phonemes, speakers and musical sequences. The conclusion appears to be that the audio features learned in an unsupervised way by the CRBM (working in the log-spectral domain of short-term Fourier transforms, i.e. spectrograms) are helpful to improve classification, but are not always sufficient to beat MFCCs. However, when combining unsupervised features with MFCCs, the results outperform what has previously been reported in the literature for the tasks under consideration. The classification algorithm itself is fairly simple: SVMs, which are applied frame-by-frame followed by some voting scheme. [Paper]

## NIPS 2009 — Day 1

It’s this time of the year again where the Neural Information Processing Systems (NIPS) conference focuses the collective brainpower of the machine learning community for a week of unstopping climax.

Here are some presentations and papers that piqued my interest during the first day (Monday December 9).

### TUTORIALS

**Making Very Large-Scale Linear Algebraic Computations Possible Via Randomization**(Gunnar Martinsson). This is the study of randomized SVD, which goes beyond the “Google” method by drawing k+p**random**vectors in R^N (where k is the number of singular values to extract and p is a small constant like 5 or 10), and applies the matrix to be decomposed to each vector. The result are random vectors*within the span of the matrix to be decomposed*. The next steps consist in orthonormalizing (i.e. Gram-Schmidt) and carrying out an SVD on the k-by-k resulting matrix, which is much smaller than the original one. Then by applying a few minor substitutions, we recover the SVD of the original matrix. For an NxN matrix, this requires time , which is comparable to classical approaches to the SVD. However, the beauty of randomized approaches, contrarily to classical methods, is that we now require a single (or at most a small number) of passes over the data, and we can traverse the data in a defined order (e.g. row by row). Hence it’s very efficient if the data is stored on a slow medium like a hard drive, as opposed to fast random-access memory. Moreover it is applicable to streaming data, which we can see only once. In addition, Martinsson discussed approached based on the fast Fourier transform which require time , which is a considerable further saving. [Slides]**Sparse Methods for Machine Learning: Theory and Algorithms**(Francis Bach). I found this to be a lot of theory, and a bit of algorithm. I was mostly interested by potential results related to sparse quantile regression, i.e. the main objective is to minimize an L1 norm with respect to the data fit, and the regularization term is also an L1 norm. But sadly, this question was not brought up, nor was the presenter aware of any substantive result in that respect. As an aside, many papers at the main confernce discuss this problem; see discussion below and over the next few days. [Slides]**Sequential Monte Carlo Methods**(Arnaud Doucet et Nando de Freitas). This was mostly a tutorial on particle filtering. The first part, by Doucet, was clear; the second part, by de Freitas, was a bit quick for people lacking the background. The one thing I took out from the tutorial was the explanation of why particle filtering works when doing*filtering*, but fails for computing (possibly implicitly) functions of the joint distribution of the hidden states and observations: because of the*resampling step*(which is nevertheless essential), all particles end up having the same parent, and we don’t keep any representative of the historical diversity of particles. [Slides: to my knowledge, not yet available]

### ARTICLES

- M6:
**Gaussian process regression with Student-t likelihood**(Jarno Vanhatalo, Pasi Jylänki, Aki Vehtari). If the likelihood function of a Gaussian Process is not Gaussian, it is usually impossible to analytically compute the posterior distribution over the latent function. However, normality of the likelihood makes learning sensitive to outliers. One solution is to use a Student-t likelihood function, and previous approaches have applied MCMC and Variational Bayes to the problem of posterior inference. However, surprisingly, the Laplace approximation, despite being commonplace for GP classification, had apparently never been proposed for this task. The authors’ approach appears to work well (at least for the examples shown), converges quickly, and is “only” between 50% and 100% slower than an analytical GP (whereas, for instance, VB is about 1000% slower). [Paper] - M9:
**Non-stationary continuous dynamic Bayesian networks**(Marco Grzegorczyk, Dirk Husmeier). The authors seek to learn the structure of a Dynamic Bayes Net, and model each graph node (within a structure that may contain loops for the time delays) as being piecewise constant within a time interval. They propose a sampling approach that let them marginalize over all changepoint hypotheses. They apply the method to a problem in genetic analysis. At the moment, they have applied the approach to the smoothing case; it was not clear how quickly changepoints could be detected in the filtering case. The matlab code is available upon request. [Paper] - M13:
**Large Scale Nonparametric Bayesian Inference**: Data Parallelisation in the Indian Buffet Process (Finale Doshi-Velez, David Knowles, Shakir Mohamed, Zoubin Ghahramani). The Indian Buffet Process can be used as a prior for Bayesian feature selection models. Traditionally, in nonparametric Bayesian models (for which representation can grow without bounds), inference is carried out by a variant of Gibbs sampling, and can be (very) expensive. The authors show how to split the problem of inference in IBP models into several parts (multiple CPUs/cores) and how to communicate results between compute nodes using belief propagations. This approach is promising for scaling up the models to more realistic use-cases. [Paper] - M40:
**Streaming k-means approximation**(Nir Ailon, Ragesh Jaiswal, Claire Monteleoni). This is a method to carry out an approximate K-Means using a single pass on the data (for streaming data, which can be seen only once). The algorithm is very simple, but the heart of the paper is in the analysis. [Paper] - M49:
**Dual Averaging Method for Regularized Stochastic Learning and Online Optimization**(Lin Xiao). Extremely simple online optimization algorithm for L1-penalized regression objectives. On logistic regression, the author obtains nearly as good classification performance with the online method as with a batch optimization, but with a much-reduced computational cost. [Paper] - M85:
**Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation**(Hamid Maei, Csaba Szepesvari, Shalabh Bhatnagar, Doina Precup, David Silver, Rich Sutton). Boys and girls, this seems to be the real deal: reinforcement learning solved (ok, not quite, but getting there). In a nutshell, the authors introduce a correction to the classical Bellman error used as the learning objective in Temporal-Difference learning. They show that their algorithm is convergent in the off-policy case and with general function approximators (such as neural networks), which are two situations where a number of divergence examples have been illustrated in the late-90’s literature. As with many things, the algorithm is simple and the heart of the paper is the convergence proof. There is, as of yet, not a great deal of experimental validation of the algorithm, although the authors’ results on a 9×9 Go example appear very promising. [Paper]

Stay tuned for more candy tomorrow.