Center for Nonlinear Studies

Monday, November 09, 200910:00 AM - 11:00 AMCNLS Conference Room (TA-3, Bldg 1690)
Seminar
High Efficiency Model Identification for Statistical Graphical Models
Terran LaneUniversity of New Mexico
Statistical graphical models, such as Bayesian networks, Markov random fields, or factor graphs, have become increasingly important for data modeling in fields as diverse as economics, ecology, physics, neuroscience, computer vision, and genomics. These models are attractive because they efficiently and compactly represent the often complex probability distributions that arise in such fields, and because they provide semantically rich models to domain scientists. However, often the graph structure of the target model is initially unknown -- indeed, in many cases the model structure is, itself, the quantity of interest to the domain scientist. The problem of structure identification (or model selection, if you prefer) remains a prominent open question in this field. Statistically, the problem is one of identifying (conditional, multivariate) dependencies from data, and is reasonably well understood. Computationally, however, the task is quite challenging: Worst case analysis shows that optimal structure identification is NP-hard, while practical algorithms are typically high-order polynomial runtime and may require many scans over the complete data in order to accumulate sufficient statistics. In this talk, we present preliminary work on a new approach to structure identification that exploits the topology of graph structure space itself. The key insight is that we need not compute the exact optimality criterion for every model we evalute during search, if we can approximate it well. And building good function approximators is precisely what Machine Learning and Statistics are very good at... We demonstrate how to use this insight to build a structure search algorithm that runs orders of magnitude faster than conventional approaches, while achieving results that are at least as good, if not better. We give preliminary data demonstrating our approach on a number of synthetic and real-world data sets, including some challenging neuroimaging data sets that involve hidden variables.

Host: Joshua Neil, jneil@lanl.gov