Module 12 of E0 259, Data Analytics, August 2015


Selected Problems in Sports Analytics (Shubhabrata Das, IIM Bangalore)

This will be a lecture by Professor Shubhabrata Das of IIM Bangalore under the Big Data Lecture Series.

Date: Thursday 26 November 2015, Faculty Hall, Indian Institute of Science

Scribes: Indranil Bhattacharya (Problems 1 and 2), Jay Ilesh Oza (Problems 3 and 4)

Abstract

Application of more advanced statistical methods in the domain of sports has been on steady rise, leading to academic conferences and journals dwelling exclusively on this domain. In this talk, we would discuss briefly a few such problems.

Problem 1: Not out scores in cricket

In cricket, batting average has always been used as the primary measure of performance of a batsman. But traditional batting average exhibits serious limitation in reflecting the true performance of a batsman in light of notout innings. Treating notouts as censored data, adaptation of Kaplan-Meir estimator provides a more reasonable solution, but it still suffers both from conceptual as well as operational problems at certain situations. A generalized class of geometric distribution (GGD) is proposed in this work to model the runs scored by individual batsmen, with the generalization coming in the form of hazard of getting out changing from one score to another. We consider the change points as the known or specified parameters and derive the general expressions for the restricted maximum likelihood estimators of the hazard rates under the generalized structure considered. Given the domain context, we propose and test ten different variations of the GGD model and carry out the test across the nested models using the asymptotic distribution of the likelihood ratio statistic. We propose two alternative approaches for improved estimation of batting average on the basis of the above modelling.

Problem 2: Tracking the progress in a round-robin tournament (World Cup football, hockey, cricket)

The up-to-date position of competing teams based on points obtained by them in the middle of any round-robin (stage of) tournament may inadequately reflect their actual relative position, because of the strength of the opposition faced till that stage. To help the followers of the game, as well as to possibly help the teams to strategize, a simple probably matrix based approach followed up by computation of the expected points may easily bring clarity to the situation. While an unstructured or unconstrained way of updating these probabilities, reflecting individual perspective, at successive stages of the tournament may be an acceptable approach, this method, being ad-hoc, suffers from arbitrariness and may lack consistency. In that context, we explore how a model based Bayesian adaptation can work effectively.

Problem 3: New models for repeated tournaments (Illustration with NCAA College basketball)

The primary objective here is to model the win-loss records of matches in a repeated tournament, using strengths of the teams. Of particular focus is the case of a standard knockout tournament with teams ranked apriori and National Collegiate Athletic Association (NCAA) men and women basketball tournament data are considered for demonstration. The work considers modifications of Bradley-Terry (BT) model that are consistent with ranks of the participating teams. The BT model with restricted maximum likelihood strengths involves estimation of too many parameters and strength estimates typically lack strict monotonicity. A proposed class of rank-based percentile BT models from different parametric family provides an excellent fit to the past data using only few parameters and this validates the ranking procedure adopted by NCAA. Parameter estimation, goodness-of-fit using suitably framed test statistic and its null distribution, selection between nested models in the change point framework, as well as other estimation aspects are discussed. Adaptive variations of the model, that allow strength to alter, are also considered. The discussed model and analysis can be extended in more general tournament structures, as shown through an analysis of results from Indian Premiere League. The work has potential application in the wider domain of paired comparison.

Problem 4: Seeded Contests and Betting Odds (Illustration with tennis)

We next develop a model to predict the outcome (win-loss) of a game based on the rank of the participating players and the betting odds set by the bookmakers. The model is based on Bradley Terry framework where the participating players are linked by a measure of their competitive ability. We illustrate the application of our model with a data set comprising records from international tennis tournament for women and men. Bayesian approach has been adopted to make inferences about the parameters in the model. The estimates are also used to infer the margin by which the 'true-odds' may be altered by the bookmakers. Prediction based on the estimated model is compared with true observation for the games played in the year 2015. Various strategies of selecting bets based on the model have been discussed. We propose two very promising betting strategies that have yielded positive result, albeit in short run.