Course description
Information theory is an indispensable tool for a statistician (data scientist?). It allows one to quantize the information a random variable reveals about another, bit-by-bit. This is in contrast to the classical probability view where one can quantify knowing or not knowing a random variable, but not having a bit of information about it. The incremental notion of information and uncertainty provided by information theory enables the proof of lower bounds for statistical problems, the design of penalty functions for model selection, and the design of information theoretic metrics for feature selection. Furthermore, the measures of information provided by information theory gives rise to a geometry on the probability simplex which in turn provides a heuristic interpretation of several popular optimization procedures used in machine learning.
In this topics course, we will present several use-cases for information theory in statistics and machine learning and cover recent developments along with classic results. Our plan is to cover the following topics (but as a wise person once said:
Prediction is difficult, especially about the future).
- Part 1: Information theoretically optimal distribution testing and learning: Minmax and Bayesian formulations; Information theoretic lower bounds; case studies such as uniformity testing, learning Gaussian mixtures, etc..
- Part 2: Probability estimation and compression: Optimal redundancy compression; estimating discrete probabilities; universal portfolio and online learning; context tree weighting; model selection: BIC and MDL criteria.
- Part 3: Information geometry: A geometric view of parametric families; an information geometric view of popular optimization procedures (case studies: Alternating minization, belief propagation).
Resources: We will be drawing from several recent papers and books. The specific source will be given along with the lecture notes.