Lecturer | Jonathan Rougier, j.c.rougier@bristol.ac.uk |
---|---|
Level | H/6 and M/7, 20cp, TB2 |
Official unit page | level H/6, level M/7 |
Timetable | 1100-1150 Tue, Maths SM3 |
1200-1250 Tue, Maths SM3 (office hour) | |
0900-0950 Wed, Maths SM4 | |
1500-1550 Thu, Maths SM1 | |
Please note: All lectures start promptly on the hour and last for fifty minutes. |
Navigation: Course outline, details, homework and assessment.
Click on 'details' to see a lecture-by-lecture summary.
Here it is:
I will arrange for paper copies of these chapters to be available from the Maths Reception (please await email confirmation).
You should read these three chapters in your own time over the course of the week (TW13), and mull them over. This should take about three hours. I will provide a self-assessment sheet (now available) at the end of the week for you to check your understanding. You should not find the material mathematically challenging, but you will find it philosophically challenging. This is because one of the main concerns in this course is to find the meaning in statistical inference.
From TW14 I will assume that you have assimilated this material, although the first lecture will involve a discussion of the some of the issues raised. From time-to-time you will want to revisit these chapters; ch 1 in particular lays out my notation and some of my conventions.
Here is the self-assessment sheet. We will discuss some of the issues raised in the first lecture, and you can also stay for the Office Hour immediately afterwards.
For additional reading, start with
Previous exam papers are available on Blackboard. You should be aware that the course continues to evolve, and these questions cannot be taken as a reliable quide to the questions that will be set this year.
Answers to previous exam papers will not be made available. The exam is designed to assess whether you have attended the lectures, read and thought about your lecture notes and the handouts, done the homework, and read a bit more widely in the textbooks. Diligent students who have done the above will gain no relative benefit from studying the answers to previous exam questions. On the other hand, less diligent students may suffer the illusion that they will do well in the exam, when probably they will not.
Instead, I will supply `exam-style' questions for revision purposes.
Here is a summary of the course, looking as far ahead as seems prudent. This plan is subject to revision. There will be some time at the end for revision of the major themes.
3 Feb. Das Ringen um Bedeutung. Why statistical inference is more than just the manipulation of symbols according to rules. Franklin's law (uncertainty is ubiquitous). People are bad at reasoning, especially about uncertainty. See the books by Gigerenzer and Kahneman above. We also talked about the Fundamental Theorem of Prevision (FTP), and its alternative form in terms of the convex hull of the columns of the matrix G.
4 Feb. Das Ringen um Notation. Observables and observations; the need for a general model of how observables are related to the random quantities of interest. The observation proposition Q. How introducing observations knocks out all of the elements of the realm of X that are incompatible with them. Notation: "Y → y" = "the operation Y was carried out and y was the result". "Pr(Q) ← 1" = "assign the value 1 to the probability of Q". An image of Sakurajima volcano from the International Space Station.
5 Feb. Statistical inference is not learning. Bayesian conditionalisation: a model for learning favoured by philosophers, physicists, and computer scientists. But supposing ≠ knowing. Real statistical inference is a protracted negotiation between the client, the statistician, and the datapool. The dataset is that subset of the datapool that the statistician feels able to model. For alarming facts about global crop yields, see Yield trends are insufficient to double global crop production by 2050.
10 Feb. Frequentist inference. The Frequentist is cautious, and does not commit to a single PMF for the random quantities of interest, but rather to a family of PMFs, index by a parameter typically denoted θ. For any given member of the family he can update beliefs using the observations, but he needs an additional principle to deal with the width of the parameter space. Plugging in the maximum likelihood (ML) estimate of θ is very common. But actually finding the value of this estimate can be challenging in applications where the parameter space is not-tiny.
11 Feb. Bayesian inference (naive view). The Bayesian adduces a prior distribution (PDF) πθ and then is able to proceed within the axioms of expectation. Contrasting the two approaches, they both need to add something to the basic set-up of a model and a parameter:
17 Feb. The Stable estimation theorem. L.J. Savage identified simple conditions under which the normalised likelihood function approximates the posterior distribution. These conditions, which can be subjectively assessed by the statistician and client, permit the substitution of a simple prior distribution for a more considered one. The dominant condition is that the likelihood function is highly concentrated, which typically arises when the number of observations is far larger than the number of parameters. Have another read of the comments in section 4.6 of the notes.
18 Feb. Introduction to decision theory. A formal treatment of decisions based on a client's action set and loss function: the Bayes action minimises the expected loss. More complicated decisions involving observables; the Bayes Rule. The interpretation of the loss function as the expected loss supposing X = x. The difficulty of specifying a loss function; the role of the statistician as the client's 'critical friend'. Here is our article about donkeys.
19 Feb. The Bayes Rule theorem. This amazing theorem states that the very challenging task of finding a Bayes rule for a decision problem simplifies to minimising expected loss conditional on the observables. It holds because of the FTP, the defining property of conditional expectation, and chronology.
24 Feb. Prediction as a decision problem. A prediction is a point value for a random quantity. But predictions are used in different ways, and should themselves be derived according to some notion of consequences, ie using a loss function. A generic prediction has a convex and symmetric loss function, which is usually well-approximated by a quadratic loss function. Under quadratic loss, the optimal prediction is the expectation (and the optimal prediction rule is the hypothetical expectation).
25 Feb. The Netflix prize. Why will computer science people sometimes outperform statisticians in prediction problems? The Netflix challenge shows why: with massive datasets a computer scientist can duplicate the conditions of the competition itself, on the understanding that the organisers have randomly-sampled the training data from the total available data. Whether this is a good idea is another matter: Netflix have ignored the issue of sample bias. See this excellent article by Tim Harford.
26 Feb. Admissible rules. An inadmissible rule is a rule which is dominated by another rule: using an inadmissible rule is an embarassing mistake. It is easy to check that, informally, if a rule is a Bayes rule for some prior distribution for θ then it is admissible. Wald's theorem asserts the converse: that all admissible rules are either Bayes rules or the improper limit of Bayes rules. The maximum likelihood estimator is often the improper limit of Bayes rules, but Stein's paradox (aka Stein's bombshell!) showed that, even in rather simple situations, it is not admissible for quadratic loss. Here's a helpful article on Stein's paradox, including the proof.
3 Mar. A multiplicity of models. How the client can end up with several different models of the same random quantities. Linear and logarithmic pooling, why linear pooling is more conservative. Bayes Model Averaging: after conditioning on the observations, each model updates and the weights update as well. The evidence for each model: how to get a large evidence.
4 Mar. Choosing a single model. Restatement of the Bayes Model Averaging theorem in terms of a random quantity of interest X and an observable Y. Why the client might want to proceed with just a single model. An investigation of why the kneejerk reaction of taking the model with the largest evidence might not be the right thing to do: the best single model depends on the client's decision problem. Two special cases: when one model has a much larger evidence than all the other models combined; when the action set is very small and the actions have clearly different consequences.
5 Mar. Information criteria. What to do if the client is unable or unwilling to provide a loss function? It looks as though we might be able to select models using their evidences (even if this is not a great idea, see above). But there are two further difficulties. First, for the Bayesian statistician the value of the evidence of a model depends on the prior distribution for the model parameters (there is no 'stable estimation' type result in this case, as there was for the posterior distribution). Second, for the Frequentist the whole issue is moot, because he cannot compute the evidence anyhow, but only the evidence for each possible value for θ. Information criteria are designed to plug the gap; to do model selection on the basis of best fit penalised by model complexity. An alternative, for some inferences, is to use Leave-One-Out cross validation. This is n times as expensive. But in special cases, the Akaike Information Criterion (AIC) gives roughly the same answer at one nth of the cost.
11 Mar. Hypothesis tests. Hypothesis tests: choosing between models defined as disjoint subsets of a parameter space Ω. The need to be able to evaluate p(y; θ), which is not as easy as Statistics 1 and the textbooks leads us to believe. Null and alternative hypotheses; simple and composite hypotheses, typical situations involving hypothesis tests.
12 Mar. Simple/simple hypothesis tests. This is the situation for which we have a complete theory! (If only it occurred more often.) Everyone agrees that the choice between two simple hypotheses should be according to the value of the likelihood ratio. For Bayesians, this follows from Bayes's theory in odds form. For Frequentists, this is a consequence of the famous Neyman-Pearson Lemma. As for where the cut-off should be chosen, this is a more complicated issue (for Frequentists). The Neyman-Pearson approach, of choosing the cut-off so that the Type 1 error is controlled at some value such as 5%, is obsolete.
17 Mar. P-values: basics. The basic prinicple of P-values, which is that they attempt to measure whether the observations are compatible with a single hypothesis, rather than to show which of two hypotheses is favoured. In the later case there is no requirement for the observations to be compatible with either hypothesis, so the P-value is definitely asking something new. The modern definition of a P-value (for a simple null hypothesis): a statistic with a subuniform distribution under H0.
18 Mar. P-value interpretation and fallacies. Why a small P-value makes one suspect that H0 may not be true, but a large P-value does not contain any information. Some key things to understand about P-values:
19 Mar. Constructing P-values. The Probability Integral Transform. Every statistic induces a P-value. The statistic t(y) = c induces a P-value which is completely useless, from which we learn that P-values occupy a spectrum from completely useless to potentially useful. A useful P-value is designed to indicate decision-relevant departures from H0. Formally, if H1 were such a hypothesis, we would require t(Y) under H1 to stochastically dominate t(Y) under H0. Even cartoonists have fun with P-values!
24 Mar. Computing P-values. Three different approaches. In the exact case, there are one or two models and one or two test statistics for which we know exactly what the distribution of the test statistic under the null hypothesis is: e.g. the IID Normal with H0: μ=μ0, and t(y) = y1 + ... + yn. Then there is asymptotic theory, which can provide a null distribution for a particular test statistic, such as Pearson's χ2 test (precisely what this test statistic is sensitive to is more subtle). Then there is a simulation result based on approximating the probability using the Law of Large Numbers.
25 Mar. Confidence sets. A confidence set makes a probabilistic statement about random sets in parameter space, in an attempt to quantify lack of knowledge about the parameters. Your reply to the Minister: "Not quite Minister. (0.74m, 1.33m) is one realisation of a random interval that has probability of at least 95% of containing the true value of sea-level rise, no matter what that true value happens to be". We drew a confidence set on the board, it wasn't easy.
26 Mar. Duality between p-values and confidence sets. A p-value for the hypothesis H0: θ = t (for each t ∈ Ω) can be used to construct a level-β confidence set for θ. So there is an infinity of confidence sets (one for each p-value) and these range from the useless to the potentially interesting. The Marginalisation Theorem tells us that we can construct a level-β confidence set for any function of θ from a level-β confidence set for θ. This is how we deal with nuisance parameters.
20 Apr. Confidence sets again. Quick revision; Wilks's theorem, and its use in constructing generic confidence sets with specified coverage, based on level sets of the (log) likelihood function. These are 'approximately exact level β confidence sets', where β is specified, usually β = 0.95.
21 Apr. Bootstrap correction for level error. A confidence set based on conditions which do not hold (eg Wilks's confidence set) cannot guarantee coverage of β =0.95 (or some other value). The difference between the nominal coverage of β and the minimum coverage over θ is termed the level error. A simulation experiment can be used to tune Wilks's confidence set so that its coverage is β at the ML estimate of θ and, one hopes, near to β in the parameter space around the ML estimate.
22 Apr. Application spotlight. Confidence sets for a simple likelihood function in a complicated parameter space: global large-eruption recording rates for stratovolcanoes.
Feedback: Remember that a good answer is compelling. This is about clarity and logic. You must state restrictions on results, and invoke conditions exactly when they are needed. This might seem like showing off, but you should do it anyway. If you do not practice now then it will not come naturally in the exam.
Feedback: We still need more words in order for your answers to be compelling. Remember that you are not having a dialogue with your reader: he cannot come back to you and ask for clarification. It's all got to be there on the page. A ten-mark question in the exam is an opportunity for you to show off: don't do the minimum because parts of your answer will not make much sense unless you have provided some background.
Feedback: No one really got the tedious Q1. The exam-style revision question was generally well-done, with one or two marks being lost per part for failures to be precise, or including information that was not relevant.
Feedback: We covered this homework in two Office Hours. Part (d) in the exam-style revision question was very hard, but it was logical, and marks could have been collected for starting the problem, and for stating what needed to be shown. Also see feedback from HW2: in an exam answer, you must do more than the bare minimum. You must convince your reader that you are totally on top of the question.
Feedback: Two things to be really clear about (mentioned in the answers). The difference between y, which is a value, and Y which is a random quantity. Probability statements that depend on a model for Y must always be indexed with a θ. Some definitions, such as for confidence sets, must hold for all θ in Ω.