Theory of Inference 2014/15 (MATH 35600, MATH M0019)

This course has a mission statement!

Helping people to make better choices under uncertainty.

Lecturer	Jonathan Rougier, j.c.rougier@bristol.ac.uk
Level	H/6 and M/7, 20cp, TB2
Official unit page	level H/6, level M/7
Timetable	1100-1150 Tue, Maths SM3
	1200-1250 Tue, Maths SM3 (office hour)
	0900-0950 Wed, Maths SM4
	1500-1550 Thu, Maths SM1
	Please note: All lectures start promptly on the hour and last for fifty minutes.

Navigation: Course outline, details, homework and assessment.

Click on 'details' to see a lecture-by-lecture summary.

Announcements

Office hours: Tue 11am, please come to my office (rm 4.12, top floor of the Maths Dept) at the start of the hour.
Second Level M assignment now set.
Mon 20 Apr: this is the last week of new material. Next week (27 Apr to 5 May) will comprise three revision lectures.
First Level M assignment now set.
Advanced warning: The 1100-1250 slot on Tue 10 Mar has been moved to 1600-1750 on Fri 13 Mar, SM4.
No lecture on Thu 12 Feb.
Change of room for Wed 0900-0950, now SM4 in Maths.
Week one of the course (TW13) is a reading week. A handout will be supplied shortly.
Here it is:
I will arrange for paper copies of these chapters to be available from the Maths Reception (please await email confirmation).
You should read these three chapters in your own time over the course of the week (TW13), and mull them over. This should take about three hours. I will provide a self-assessment sheet (now available) at the end of the week for you to check your understanding. You should not find the material mathematically challenging, but you will find it philosophically challenging. This is because one of the main concerns in this course is to find the meaning in statistical inference.
From TW14 I will assume that you have assimilated this material, although the first lecture will involve a discussion of the some of the issues raised. From time-to-time you will want to revisit these chapters; ch 1 in particular lays out my notation and some of my conventions.
Here is the self-assessment sheet. We will discuss some of the issues raised in the first lecture, and you can also stay for the Office Hour immediately afterwards.

Course outline

The basic premise of inference is our judgement that the things we would like to know are related to other things that we can measure. This premise holds over the whole of the sciences. The distinguishing features of statistical science are

A probabilistic approach to quantifying uncertainty, and, within that,
A concern to assess the principles under which we make good inferences, and
The development of tools to facilitate the making of such inferences, particularly those using data.

This course illustrates these features at a high level of generality, while also covering some of the situations in science and policy where they arise. See Details for more information.

Reading

There is a comprehensive handout for this course. The following are just suggestions if you are interested in following up the topics in the lectures.

For additional reading, start with

D.R. Cox, 2006, Principles of Statistical Inference, Oxford University Press.

For background reading on basic probability and 'standard' statistics:

J.A. Rice, 1999, Mathematical Statistics and Data Analysis, 2nd edn, Duxbury Press.
M. DeGroot and M. Schervish, 2002, Probability and Statistics, Addison Wesley, 3rd edn.

For more advanced material on applied statistics:

D.R. Cox and C. A. Donnelly, 2011, Principles of Applied Statistics, Cambridge University Press.
A. Davison, 2003, Statistical Modelling, Cambridge University Press.

And for more advanced material on theoretical statistics:

D.R. Cox and D.V. Hinkley, 1974, Theoretical Statistics, London: Chapman and Hall. Old, but very insightful.

For the use of probability and statistics in society,

I. Evans et al, 2011, Testing Treatments: Better Research for Better Healthcare, 2nd edition, Pinter & Martin. Elementary account of evidence in healthcare.
G. Gigerenzer, 2003, Reckoning with Risk: Learning to Live with Uncertainty, Penguin.
D. Kahneman, 2012, Thinking, Fast and Slow, Penguin. Awesome.
S. Senn, 2003, Dicing with death: Chance, Risk and Health, Cambridge UK: Cambridge University Press.

Comment on the exam

The exam comprises five questions, of which your best four count. In my experience it is much better to pick out your best four questions at the start of the exam, and focus on these, than to try all five. If you adopt the latter strategy, all of the time you put into your weakest question is wasted.

Previous exam papers are available on Blackboard. You should be aware that the course continues to evolve, and these questions cannot be taken as a reliable quide to the questions that will be set this year.

Answers to previous exam papers will not be made available. The exam is designed to assess whether you have attended the lectures, read and thought about your lecture notes and the handouts, done the homework, and read a bit more widely in the textbooks. Diligent students who have done the above will gain no relative benefit from studying the answers to previous exam questions. On the other hand, less diligent students may suffer the illusion that they will do well in the exam, when probably they will not.

Instead, I will supply `exam-style' questions for revision purposes.

Course details

Here is a summary of the course, looking as far ahead as seems prudent. This plan is subject to revision. There will be some time at the end for revision of the major themes.

Expectation, probability, and inference
3 Feb. Das Ringen um Bedeutung. Why statistical inference is more than just the manipulation of symbols according to rules. Franklin's law (uncertainty is ubiquitous). People are bad at reasoning, especially about uncertainty. See the books by Gigerenzer and Kahneman above. We also talked about the Fundamental Theorem of Prevision (FTP), and its alternative form in terms of the convex hull of the columns of the matrix G.
4 Feb. Das Ringen um Notation. Observables and observations; the need for a general model of how observables are related to the random quantities of interest. The observation proposition Q. How introducing observations knocks out all of the elements of the realm of X that are incompatible with them. Notation: "Y → y" = "the operation Y was carried out and y was the result". "Pr(Q) ← 1" = "assign the value 1 to the probability of Q". An image of Sakurajima volcano from the International Space Station.
Chapter 4
5 Feb. Statistical inference is not learning. Bayesian conditionalisation: a model for learning favoured by philosophers, physicists, and computer scientists. But supposing ≠ knowing. Real statistical inference is a protracted negotiation between the client, the statistician, and the datapool. The dataset is that subset of the datapool that the statistician feels able to model. For alarming facts about global crop yields, see Yield trends are insufficient to double global crop production by 2050.
10 Feb. Frequentist inference. The Frequentist is cautious, and does not commit to a single PMF for the random quantities of interest, but rather to a family of PMFs, index by a parameter typically denoted θ. For any given member of the family he can update beliefs using the observations, but he needs an additional principle to deal with the width of the parameter space. Plugging in the maximum likelihood (ML) estimate of θ is very common. But actually finding the value of this estimate can be challenging in applications where the parameter space is not-tiny.
11 Feb. Bayesian inference (naive view). The Bayesian adduces a prior distribution (PDF) π_θ and then is able to proceed within the axioms of expectation. Contrasting the two approaches, they both need to add something to the basic set-up of a model and a parameter:
1. The Frequentist needs an additional principle, in order to collapse the width of the parameter space; often that principle is to plug-in the ML estimate.
2. The Bayesian needs an additional distribution, the prior distribution.
The two approaches will in general give different inferences. The 'less naive' view of Bayesian inference will be discussed in later lectures.
17 Feb. The Stable estimation theorem. L.J. Savage identified simple conditions under which the normalised likelihood function approximates the posterior distribution. These conditions, which can be subjectively assessed by the statistician and client, permit the substitution of a simple prior distribution for a more considered one. The dominant condition is that the likelihood function is highly concentrated, which typically arises when the number of observations is far larger than the number of parameters. Have another read of the comments in section 4.6 of the notes.
Decision theory and prediction
Chapter 5
18 Feb. Introduction to decision theory. A formal treatment of decisions based on a client's action set and loss function: the Bayes action minimises the expected loss. More complicated decisions involving observables; the Bayes Rule. The interpretation of the loss function as the expected loss supposing X = x. The difficulty of specifying a loss function; the role of the statistician as the client's 'critical friend'. Here is our article about donkeys.
19 Feb. The Bayes Rule theorem. This amazing theorem states that the very challenging task of finding a Bayes rule for a decision problem simplifies to minimising expected loss conditional on the observables. It holds because of the FTP, the defining property of conditional expectation, and chronology.
24 Feb. Prediction as a decision problem. A prediction is a point value for a random quantity. But predictions are used in different ways, and should themselves be derived according to some notion of consequences, ie using a loss function. A generic prediction has a convex and symmetric loss function, which is usually well-approximated by a quadratic loss function. Under quadratic loss, the optimal prediction is the expectation (and the optimal prediction rule is the hypothetical expectation).
25 Feb. The Netflix prize. Why will computer science people sometimes outperform statisticians in prediction problems? The Netflix challenge shows why: with massive datasets a computer scientist can duplicate the conditions of the competition itself, on the understanding that the organisers have randomly-sampled the training data from the total available data. Whether this is a good idea is another matter: Netflix have ignored the issue of sample bias. See this excellent article by Tim Harford.
26 Feb. Admissible rules. An inadmissible rule is a rule which is dominated by another rule: using an inadmissible rule is an embarassing mistake. It is easy to check that, informally, if a rule is a Bayes rule for some prior distribution for θ then it is admissible. Wald's theorem asserts the converse: that all admissible rules are either Bayes rules or the improper limit of Bayes rules. The maximum likelihood estimator is often the improper limit of Bayes rules, but Stein's paradox (aka Stein's bombshell!) showed that, even in rather simple situations, it is not admissible for quadratic loss. Here's a helpful article on Stein's paradox, including the proof.
Model choice, hypothesis testing, significance levels, confidence sets
Chapter 6
3 Mar. A multiplicity of models. How the client can end up with several different models of the same random quantities. Linear and logarithmic pooling, why linear pooling is more conservative. Bayes Model Averaging: after conditioning on the observations, each model updates and the weights update as well. The evidence for each model: how to get a large evidence.
4 Mar. Choosing a single model. Restatement of the Bayes Model Averaging theorem in terms of a random quantity of interest X and an observable Y. Why the client might want to proceed with just a single model. An investigation of why the kneejerk reaction of taking the model with the largest evidence might not be the right thing to do: the best single model depends on the client's decision problem. Two special cases: when one model has a much larger evidence than all the other models combined; when the action set is very small and the actions have clearly different consequences.
5 Mar. Information criteria. What to do if the client is unable or unwilling to provide a loss function? It looks as though we might be able to select models using their evidences (even if this is not a great idea, see above). But there are two further difficulties. First, for the Bayesian statistician the value of the evidence of a model depends on the prior distribution for the model parameters (there is no 'stable estimation' type result in this case, as there was for the posterior distribution). Second, for the Frequentist the whole issue is moot, because he cannot compute the evidence anyhow, but only the evidence for each possible value for θ. Information criteria are designed to plug the gap; to do model selection on the basis of best fit penalised by model complexity. An alternative, for some inferences, is to use Leave-One-Out cross validation. This is n times as expensive. But in special cases, the Akaike Information Criterion (AIC) gives roughly the same answer at one nth of the cost.
11 Mar. Hypothesis tests. Hypothesis tests: choosing between models defined as disjoint subsets of a parameter space Ω. The need to be able to evaluate p(y; θ), which is not as easy as Statistics 1 and the textbooks leads us to believe. Null and alternative hypotheses; simple and composite hypotheses, typical situations involving hypothesis tests.
12 Mar. Simple/simple hypothesis tests. This is the situation for which we have a complete theory! (If only it occurred more often.) Everyone agrees that the choice between two simple hypotheses should be according to the value of the likelihood ratio. For Bayesians, this follows from Bayes's theory in odds form. For Frequentists, this is a consequence of the famous Neyman-Pearson Lemma. As for where the cut-off should be chosen, this is a more complicated issue (for Frequentists). The Neyman-Pearson approach, of choosing the cut-off so that the Type 1 error is controlled at some value such as 5%, is obsolete.
17 Mar. P-values: basics. The basic prinicple of P-values, which is that they attempt to measure whether the observations are compatible with a single hypothesis, rather than to show which of two hypotheses is favoured. In the later case there is no requirement for the observations to be compatible with either hypothesis, so the P-value is definitely asking something new. The modern definition of a P-value (for a simple null hypothesis): a statistic with a subuniform distribution under H₀.
18 Mar. P-value interpretation and fallacies. Why a small P-value makes one suspect that H₀ may not be true, but a large P-value does not contain any information. Some key things to understand about P-values:
1. The P-value is not similar to the probability that H₀ is 'true'.
2. The P-value conundrum. H₀ is always false, so, really, when the P-value is small we should not be surprised, and when it is large we should reflect that possibly we do not have a very powerful test.
3. For any H₀ there is an infinity of P-values, and many of them are useless.
4. P-values for H₀ and H₁ cannot be compared to decide which hypothesis is better supported by the observations. A P-value cannot be a hypothesis test because the latter requires two competing hypotheses.
For more on the story about a journal banning P-values, have a look at these blog posts:
- Christian Robert, Eliminating an important obstacle to creative thinking: Statistics ...
- Andrew Gelman, Psych journal bans signficance tests; stats blogger inundated with emails
19 Mar. Constructing P-values. The Probability Integral Transform. Every statistic induces a P-value. The statistic t(y) = c induces a P-value which is completely useless, from which we learn that P-values occupy a spectrum from completely useless to potentially useful. A useful P-value is designed to indicate decision-relevant departures from H₀. Formally, if H₁ were such a hypothesis, we would require t(Y) under H₁ to stochastically dominate t(Y) under H₀. Even cartoonists have fun with P-values!
24 Mar. Computing P-values. Three different approaches. In the exact case, there are one or two models and one or two test statistics for which we know exactly what the distribution of the test statistic under the null hypothesis is: e.g. the IID Normal with H₀: μ=μ₀, and t(y) = y₁ + ... + y_n. Then there is asymptotic theory, which can provide a null distribution for a particular test statistic, such as Pearson's χ² test (precisely what this test statistic is sensitive to is more subtle). Then there is a simulation result based on approximating the probability using the Law of Large Numbers.
25 Mar. Confidence sets. A confidence set makes a probabilistic statement about random sets in parameter space, in an attempt to quantify lack of knowledge about the parameters. Your reply to the Minister: "Not quite Minister. (0.74m, 1.33m) is one realisation of a random interval that has probability of at least 95% of containing the true value of sea-level rise, no matter what that true value happens to be". We drew a confidence set on the board, it wasn't easy.
26 Mar. Duality between p-values and confidence sets. A p-value for the hypothesis H₀: θ = t (for each t ∈ Ω) can be used to construct a level-β confidence set for θ. So there is an infinity of confidence sets (one for each p-value) and these range from the useless to the potentially interesting. The Marginalisation Theorem tells us that we can construct a level-β confidence set for any function of θ from a level-β confidence set for θ. This is how we deal with nuisance parameters.
20 Apr. Confidence sets again. Quick revision; Wilks's theorem, and its use in constructing generic confidence sets with specified coverage, based on level sets of the (log) likelihood function. These are 'approximately exact level β confidence sets', where β is specified, usually β = 0.95.
21 Apr. Bootstrap correction for level error. A confidence set based on conditions which do not hold (eg Wilks's confidence set) cannot guarantee coverage of β =0.95 (or some other value). The difference between the nominal coverage of β and the minimum coverage over θ is termed the level error. A simulation experiment can be used to tune Wilks's confidence set so that its coverage is β at the ML estimate of θ and, one hopes, near to β in the parameter space around the ML estimate.
22 Apr. Application spotlight. Confidence sets for a simple likelihood function in a complicated parameter space: global large-eruption recording rates for stratovolcanoes.

Homework and assessment

Homework

There will be a homework roughly every fortnight, to be handed in during a lecture or by 5pm in the box outside my office (rm 4.12, top floor, turn left out of the lift/stairs). Typically I will distribute them in Tue's lecture to be collected by 5pm a week later. The Office Hour at 12 noon before the deadline is an opportunity to ask questions and collect hints. You are strongly encouraged to do the homeworks and to hand in your efforts, to be commented on and marked.

First homework. Due Tue 24 Feb. Here are the answers.
Feedback: Remember that a good answer is compelling. This is about clarity and logic. You must state restrictions on results, and invoke conditions exactly when they are needed. This might seem like showing off, but you should do it anyway. If you do not practice now then it will not come naturally in the exam.
Second homework. Due Tue 3 Mar. Here are the answers.
Feedback: We still need more words in order for your answers to be compelling. Remember that you are not having a dialogue with your reader: he cannot come back to you and ask for clarification. It's all got to be there on the page. A ten-mark question in the exam is an opportunity for you to show off: don't do the minimum because parts of your answer will not make much sense unless you have provided some background.
Third homework. Due Tue 17 Mar. Here are the answers.
Feedback: No one really got the tedious Q1. The exam-style revision question was generally well-done, with one or two marks being lost per part for failures to be precise, or including information that was not relevant.
Fourth homework. Due Tue 21 Apr. Here are the answers.
Feedback: We covered this homework in two Office Hours. Part (d) in the exam-style revision question was very hard, but it was logical, and marks could have been collected for starting the problem, and for stating what needed to be shown. Also see feedback from HW2: in an exam answer, you must do more than the bare minimum. You must convince your reader that you are totally on top of the question.
Fifth homework. Due Wed 6 May. Here are the answers.
Feedback: Two things to be really clear about (mentioned in the answers). The difference between y, which is a value, and Y which is a random quantity. Probability statements that depend on a model for Y must always be indexed with a θ. Some definitions, such as for confidence sets, must hold for all θ in Ω.

Assessment (Level 7/M)

There are two assessments, each counting 10 marks towards the final mark. These will be set in about weeks six and eleven. You may want to consult the University regulations on assessed work and the Science Faculty regulations on late submissions.

First assignment. Due 5pm Mon 23 March.
Second assignment. Due 5pm Fri 8 May.