| Lecturer | Yiannis Papastathopoulos, i.papastathopoulos@bristol.ac.uk |
|---|---|
| Official unit pages | MATH 30510, MATH M0510 |
| Timetable | 1600 Mon, SM4; 1400 Thu, C44; 10 Fri, C44. |
| Office hours | 13.00 Tue, my office, Rm3.13 Maths Dept |
| Computing office hours | TBA More information. |
| Exam dates | TBA. See this comment on the exam. |
Navigation: Course outline, details, code snippets, homework/assignments.
The key questions we will address are:
The main textbook is
My only qualm about Mardia et al. is that I prefer to treat the singular value decomposition (SVD) as the primitive decomposition of a matrix, a view that is adopted by
Level M students should be aware that being able to use R is a prerequisite for the coursework assessment, which counts 20% towards the final mark.
Computing is a vital part of multivariate analysis, because the matrix algebra calculations required are far too complicated to do by hand, and the pictures would be tedious to draw. Throughout the course, examples will be presented using the statistical computing environment R, which you will have been introduced to in your 1st year, and which you would have used in Statistics 2, if you took that course. You should download this software onto your computer, as you will need it for your homework (or else you will need to do your homework in one of the University's computer labs).
Computing office hours are available if you would like guidance in using R. Attendence is not obligatory, but you should go if you have questions about the use of R in the homeworks.
Code snippets for the examples are available below.
Previous exam papers are available on Blackboard.
Answers to previous exam papers will not be made available. The exam is designed to assess whether you have attended the lectures, read and thought about your lecture notes and the handouts, done the homework, and read a bit more widely in the textbooks. Diligent students who have done the above will gain no relative benefit from studying the answers to previous exam questions. On the other hand, less diligent students may suffer the illusion that they will do well in the exam, when probably they will not.
Here is a lecture-by-lecture summary of the course.
That's it!
#### recode factors, some examples
## ## This snippet just illustrative: you will need to adapt it to
## ## your problem -- it will not run as written.
## Let X be a dataframe, and NAME be a factor: do str(X) to see which
## columns are factors
## ## to recode (ordered) factor NAME as integers
x <- as.numeric(X$NAME)
X$RECODENAME <- x
## obviously you will want to replace NAME and RECODENAME with your
## own variable names. If you then want to delete NAME,
## then X$NAME <- NULL
## ## to recode (ordered) factor NAME with specified values for the levels
x <- X$NAME
vals <- c(1.0, 1.2, 1.5, 2.7, 4.5) # say, should be one for each level
stopifnot(length(vals) == length(levels(x))
x <- vals[as.numeric(x)]
X$RECODENAME <- x
## ## to recode an unordered factor with k levels as k binary variables
x <- X$NAME
Y <- matrix(0, length(x), length(levels(x))) # n by k
Y[cbind(seq(along = x), as.numeric(x))] <- 1
colnames(Y) <- paste("NAME", "is", levels(x), sep = ".")
Y <- as.data.frame(Y)
X <- cbind(X, Y) # add new variables to the end
#### parallel coordinate plot
library(MASS) # has parcoord function, do ?parcoord for details
## have a look at the famous iris data set, do ?iris for details
parcoord(iris[, 1:4], col = 1 + as.numeric(iris$Species),
var.label = TRUE, main = "Parallel coordinate plot, Iris data")
legend(0.4, 1, legend = levels(iris$Species), text.col = 1 + 1:3,
title = "Species", title.col = 1, lty = 0, pch = NA, bty = "n", xpd = NA)
#### example of basic R functions
## load up the MKB exam data
X <- read.table("http://www.maths.leeds.ac.uk/~charles/mva-data/openclosedbook.dat",
header = TRUE)
X <- as.matrix(X)
## compute sample mean and variance
n <- nrow(X)
p <- ncol(X)
xbar <- colMeans(X)
S <- var(X) * (n-1) / n # use n in denominator
s <- sqrt(diag(S))
## centering and scaling
HX <- sweep(X, 2, xbar, "-") # centered
rr <- apply(X, 2, range)
rr <- rr[2, ] - rr[1, ]
iD <- diag(1 / rr, p)
Y <- X %*% iD # rescaled by variable ranges
iD <- diag(1/s, p)
Y <- X %*% iD # rescaled by variable standard deviations
Y <- HX %*% iD # both centered and rescaled
R <- crossprod(Y) / n # sample correlation matrix, but easier to compute:
R <- cor(X)
#### approaches to imputation
## let's use some real data, but remove some values
X <- read.table("http://www.maths.leeds.ac.uk/~charles/mva-data/openclosedbook.dat",
header = TRUE)
## 10 missing NO values
set.seed(101) # change this for different missing values
X$NO[sample(1:nrow(X), 10)] <- NA
## ## Mean imputation; in each case Y will be our imputed X
Y <- X
ybar <- colMeans(Y, na.rm = TRUE) # only computed for non-missing values
y <- Y$NO
y[is.na(y)] <- xbar["NO"]
Y$NO <- y
## ## random imputation
Y <- X
y <- Y$NO
y[is.na(y)] <- sample(y[!is.na(y)], sum(is.na(y)), replace = TRUE)
Y$NO <- y
## ## last value carried forward (LVCF)
Y <- X
y <- Y$NO
## let's order the objects by the mean mark in the other subjects
omean <- rowMeans(Y[, -match("NO", names(Y))], na.rm = TRUE)
oo <- order(omean)
y <- y[oo] # permute y to put similar objects together
## here's a little function
LVCF <- function(x) {
miss <- which(is.na(x))
for (i in miss) {
if (i > 1)
x[i] <- x[i-1]
}
x
}
y <- LVCF(y)
y[oo] <- y # undo the reordering
Y$NO <- y
## little picture
pairs(Y, col = ifelse(is.na(X$NO), "red", "darkgrey"), # set colours
xlim = c(0, 100), ylim = c(0, 100))
#### SVD of HX and the spectral decomposition of S
## load and process the data, as before
X <- read.table("http://www.maths.leeds.ac.uk/~charles/mva-data/openclosedbook.dat",
header = TRUE)
X <- as.matrix(X)
n <- nrow(X)
p <- ncol(X)
xbar <- colMeans(X)
S <- var(X) * (n-1) / n # use n in denominator
HX <- sweep(X, 2, xbar, "-") # centered
## now compute the SVD
udv <- svd(HX)
str(udv) # to see the components
lambda <- udv$d^2 / n
Lambda <- diag(lambda, p)
Gamma <- udv$v
## spectral decomposition is S = Gamma %*% Lambda %*% t(Gamma).
## We could have computed this directly from S using 'eigen'
spd <- eigen(S, symmetric = TRUE) # can also compute directly from S
all.equal.numeric(spd$values, lambda, check.attributes = FALSE) # TRUE
print(spd$vectors) # same up to the signs of the columns
print(Gamma) #
## can compute determinants with 'det'
dd <- det(S)
## but much better to get them from the SVD
logdd1 <- -p * log(n) + 2 * sum(log(udv$d))
all.equal.numeric(dd, exp(logdd1)) # TRUE
#### looking at R functions
## Some R functions cannot easily be inspected. For example, try
prcomp
#= function (x, ...)
#= UseMethod("prcomp")
#= <environment: namespace:stats>
## (output shown after #='s). This means that to see all the versions
## of prcomp you need to type
methods(prcomp)
#= [1] prcomp.default* prcomp.formula*
#=
#= Non-visible functions are asterisked
## prcomp.default is the one that we want, but it is 'non-visible'.
## To see its definition, you need to note that it is in the 'stats'
## namespace, and then do
stats:::prcomp.default
## and then you can see its definition. If you inspect this definition
## you will see that there are options for centering (TRUE) and rescaling
## (FALSE) the X matrix, and then it's just the SVD with some repackaging.
## You should use 'prcomp' to do your principal components analysis. But,
## just so you can see the structure, here is a stripped-down version in
## our notation.
myprcomp <- function(X) {
X <- as.matrix(X)
n <- nrow(X)
p <- ncol(X)
HX <- scale(X, scale = FALSE) # centre
udv <- svd(HX)
## return same structure as prcomp.default
robj <- list(sdev = udv$d / sqrt(n),
rotation = udv$v,
x = udv$u %*% diag(udv$d, p),
center = TRUE,
scale = FALSE)
class(robj) <- "prcomp"
robj
}
## If you test this out you will see that the sdev values are slightly
## different compared to the R 'prcomp' function: I prefer to use 1/n rather
## than 1/(n-1) in the denominator of the sample variance matrix.
## The R 'prcomp' function is better at handling the case where HX
## does not have full column rank, and does some extra naming of the outputs.
## mybiplot with some formatting, using the USArrests dataset
source("http://www.maths.bris.ac.uk/~ip12483/multivariate/mybiplot.R")
upop <- USArrests$UrbanPop
qq <- round(quantile(upop, c(0, 0.25, 0.5, 0.75, 1)))
upop <- cut(upop, breaks = qq, include.lowest = TRUE)
upopcol <- colorRampPalette(c("forestgreen", "brown"))(length(levels(upop)))
par(mar = c(1, 1, 3, 1), cex.main = 1.25)
mybiplot(prcomp(USArrests[, c("Murder", "Assault", "Rape")]),
asp = 1, axes = FALSE, bty = "n",
xlab = "", ylab = "", main = "US Arrests",
type = "labels", col.obj = upopcol[as.numeric(upop)],
expand.arrow = 1.75, col.arrow = "black", lwd.arrow = 2,
cex.arrow = 1.1, font.arrow = 2)
legend("topright", fill = upopcol,
legend = levels(upop), title = "% urban population",
bty = "n")
## here's how to save the plot as a file
dev.print(jpeg, file = "USArrestsBiplot.jpg", width = 600) # as a jpeg
dev.print(pdf, file = "USArrestsBiplot.pdf") # as a pdf
The example should look like this:
#### try out a famous dataset using mylda
source("http://www.maths.bris.ac.uk/~ip12483/multivariate/mylda.R")
## here is the raw data as a pairs plot
X <- iris[, 1:4]
species <- iris$Species
colpallete <- 1 + seq(along = levels(species))
col <- colpallete[as.numeric(species)] # a species colour for each object
pairs(X, col = col, pch = 16)
## do the LDA
foo <- mylda(X, groups = species)
print(foo$a)
## can see that LDA is putting 'a' mainly along the diagonal of
## Petal.Length and Petal.Width, which seems sensible!
## plot the data in terms of first two canonical variables
Y <- as.matrix(X)
plot(Y %*% foo$loadings[, 1:2], col = col,
pch = 16,
xlab = "First canonical variable",
ylab = "Second canonical variable",
main = "First two canonical variables",
bty = "n")
legend("topleft", legend = levels(species), pch = 16, lty = 0,
col = colpallete, bty = "n", title = "Species")
## add on the linear discrimination rule
ybar <- drop(foo$means %*% foo$a)
midp <- sort(ybar)
midp <- (midp[-1] + midp[-length(ybar)]) / 2
abline(v = midp, lty = 2)
ybottom <- par("usr")[3]
points(ybar, rep(ybottom, length(ybar)), pch = 16,
col = colpallete, cex = 1.5, xpd = NA)
dev.print(jpeg, file = "irisLDA.jpg", width = 600) # as a jpeg
## Pretty good separation into species. Can see that 2 virginicas
## are misclassified as versicolor, and three versicolors as
## virginica.
## the cross-validation by jack-knifing takes a little longer
foo <- mylda(X, groups = species, CV = TRUE)
print(foo$CVtable)
print(round(100 * foo$CVtable / rowSums(foo$CVtable))) # as %age by group
## example of classification; hold out large sample of 40
set.seed(201) # change this for a different random sample
rsam <- sort(sample(1:nrow(X), 40))
foo <- mylda(X[-rsam, ], groups = species[-rsam])
class <- classify.mylda(foo, X[rsam, ])
print(cbind(true = species[rsam], class))
threshold <- 5 # say Y <- as.matrix(X) >= threshold dd <- dist(Y, method = "binary")The output from 'dist' is an object of class 'dist'. This is in fact a n(n-1)/2 vector, which will print as a lower-triangular matrix. If you definitely want the matrix, do as.matrix(dd), which is an n × n matrix.
If you have a data matrix with a mixture of quantitative variables and factors, then the daisy function in the package cluster may be helpful. Remember, though, that an 'automatic' dissimilarity such 'metric = "gower"' may not be want you need for your particular research objective.
R does not seem to have an explicit function for Hellinger distance, also called 'chordal distance' or 'Bhattacharyya distance'. This is the distance between two vectors of non-negative proportions that sum to one. However, it is easy to construct one from the dist function:
hellinger <- function(X) {
X <- as.matrix(X)
n <- nrow(X)
stopifnot(is.numeric(X), X >= 0, X <= 1)
ss <- rowSums(X)
sumto1 <- all.equal.numeric(ss, rep(1, n), check.attributes = FALSE)
if (!identical(sumto1, TRUE))
warning("Some rows sum to less than 1")
dist(sqrt(X))
}
You can find information about clustering in R using the CRAN clustering task view. There are lots of ways to do clustering, and these snippets just show one.
You will need to load the cluster package, and then the function agnes does agglomerative hierarchical clustering.
As a simple example, consider the dissimilarity matrix of white-toothed shrews from Mardia, Kent, and Bibby (1979, p 386). You can download this as a file. Here is some code to draw dendrograms for three different distances measures on groups.
source("http://www.maths.bris.ac.uk/~ip12483/multivariate/shrewDist.R")
print(shrew)
library(cluster)
par(mfrow = c(2, 2), oma = c(0, 0, 3, 0), cex.main = 1) # divide panel into four
sh1 <- agnes(shrew, method = "single")
plot(sh1, which = 2, main = "Single linkage", xlab = "Location")
sh2 <- agnes(shrew, method = "complete")
plot(sh2, which = 2, main = "Complete linkage", xlab = "Location")
sh3 <- agnes(shrew, method = "average")
plot(sh3, which = 2, main = "Average linkage", xlab = "Location")
title(main = "Dendrograms, shrew data", outer = TRUE, cex.main = 1.5)
dev.print(jpeg, file = "shrewDend.jpg", width = 600) # as a jpeg
For more control over the dendrogram, you will want to use plot.dendrogram explicitly. If you have a slightly out-of-date installation of R, as I do, then you will need to turn the output from agnes into a hclust object, and then into a dendrogram object. Here is the adapted example from the agnes helpfile.
library(cluster) data(votes.repub) agn1 <- agnes(votes.repub, metric = "manhattan", stand = TRUE) hh1 <- as.hclust(agn1) par(cex = 0.8) dd1 <- as.dendrogram(hh1) plot(dd1, main = "Dendrogram, % Republican votes, 1856-1976", edge.root = TRUE) dev.print(jpeg, file = "repubDend.jpg", width = 600) # as a jpeg
This will do the clustering and create the dendrogram in the handout:
As you will see from these examples and the helpfile, agnes can operate with either the original data matrix, or directly with a dissimilarity matrix. In the former case it computes a dissimilarity matrix of type specified by the 'metric' argument.
You will want to access directly the partitions from your clustering, or at least some of them. The function cutree operates on an hclust object such as hh1 and allows you to extract either the partition with a specified number of groups, or the partition with a specified height (in the dendrogram). Here is a code snippet that also includes the output:
> cutree(hh1, k = 10) # 10 groups
Alabama Alaska Arizona Arkansas California
1 2 3 1 3
Colorado Connecticut Delaware Florida Georgia
3 3 3 4 1
Hawaii Idaho Illinois Indiana Iowa
5 3 3 3 3
Kansas Kentucky Louisiana Maine Maryland
3 3 6 3 3
Massachusetts Michigan Minnesota Mississippi Missouri
7 3 3 8 3
Montana Nebraska Nevada New Hampshire New Jersey
3 3 3 3 3
New Mexico New York North Carolina North Dakota Ohio
3 3 4 3 3
Oklahoma Oregon Pennsylvania Rhode Island South Carolina
4 3 3 7 9
South Dakota Tennessee Texas Utah Vermont
3 4 5 3 10
Virginia Washington West Virginia Wisconsin Wyoming
4 3 3 3 3
> cutree(hh1, h = 40) # partition at height 40
Alabama Alaska Arizona Arkansas California
1 2 3 1 3
Colorado Connecticut Delaware Florida Georgia
3 3 3 3 1
Hawaii Idaho Illinois Indiana Iowa
4 3 3 3 3
Kansas Kentucky Louisiana Maine Maryland
3 3 1 3 3
Massachusetts Michigan Minnesota Mississippi Missouri
3 3 3 5 3
Montana Nebraska Nevada New Hampshire New Jersey
3 3 3 3 3
New Mexico New York North Carolina North Dakota Ohio
3 3 3 3 3
Oklahoma Oregon Pennsylvania Rhode Island South Carolina
3 3 3 3 5
South Dakota Tennessee Texas Utah Vermont
3 3 4 3 2
Virginia Washington West Virginia Wisconsin Wyoming
3 3 3 3 3
For more fun, you can identify these on the dendrogram itself,
using rect.hclust; try
rect.hclust(hh1, k = 10, border = c("red", "blue"))
This function has a couple of useful arguments that allow you to highlight certain clusters.
There are a number of ways of assessing how good your dendrogram is. The agglomerative coefficient is printed in the plots of the shrew data, above. Another useful measure is the cophenetic correlation, which is the correlation between the dissimilarities and the heights at which each pair of objects becomes joined in the dendrogram.
coph <- cophenetic(hh1) # cophentic matrix print(cor(coph, agn1$diss)) # cophenetic correlation, 0.849
source("http://www.maths.bris.ac.uk/~ip12483/multivariate/shrewDist.R")
mm <- cmdscale(shrew, eig = TRUE)
print(mm$GOF[1]) # alpha_2, 90%
par(mar = c(1, 1, 4, 4), cex.main = 1)
plot(mm$points, pch = 16,
main = "MDS, shrew data", xlab = "", ylab = "", axes = FALSE, asp = 1)
text(mm$points, row.names(mm$points), pos = 4, cex = 0.8, xpd = NA)
X <- as.matrix(USArrests[, c(1, 2, 4)])
D <- as.matrix(dist(X)) # Euclidean distance
n <- nrow(D)
H <- diag(1, n) - matrix(1, n, n) / n # centering matrix
B <- H %*% (-D^2 / 2) %*% H
spd <- eigen(B, symmetric = TRUE)
all(spd$values >= -1e-8) # TRUE, so D is Euclidean (not surprising!)
pev <- spd$values >= 1e-8 * abs(spd$values[1])
k <- sum(pev) # 3 (not surprising, since X has 3 variables)
Z <- spd$vectors[, pev, drop=FALSE] %*% diag(sqrt(spd$values[pev]))
alpha <- spd$values[pev] / sum(abs(spd$values))
print(alpha) # alpha[1] = 0.99
#### there is NO reason for Z and X to agree! But we ought to be able to
#### transform Z to X using translations and rotations/reflections, because
#### D is Euclidean.
source("http://www.maths.bris.ac.uk/~ip12483/multivariate/procrustes.R")
Zhat <- procrustes(X, Z)
attr(Zhat, "R2") # zero: shows we have got an exact match
all.equal.numeric(X, Zhat, check.attributes = FALSE) # TRUE - excellent!
Two of the homeworks will be assessed for the course credit points, probably homeworks 2 and 5. You are strongly encouraged to do the homeworks and to hand in your efforts, to be commented on and marked.
Level 4 students will have two additional assignments, which will count towards their final mark: see below.
| Week | Sheet | Answers |
|---|---|---|
| 12 | Question sheet 1 | Answers |
Feedback: These matrix questions were generally well done. There
were one or two 'howlers':
| ||
| 15 | Question sheet 2 | Answers |
|
Feedback: Not much to say here, that is not addressed in the answers.
Have a look at the answer to Q3; note that the mahalanobis function,
although tempting, is not the right function for us (see the help-page).
| ||
| 16 | Question sheet 4 | Answers |
| 17 (assessed) | Question sheet 5 | |
| 18 | Question sheet 6 | |
|
| ||
This dataset comprises proportions by species of the pollen extracted from sediments in Abernethy forest, Scotland, indexed by depth. The data is in the rioja package in R, and the help-page is online. For simplicity, you can download the data here, and then load it into R using
source("aber.R")(you may
need to specify the folder where you saved the file).
You will see that the dataset aber comprises 49 rows, indexed by depth (cm), and 36 columns, indexed by species. Thus aber[1, 1] is 15.11, which indicates that at depth 300cm, the proportion of species BETULA was 15.11 percent. The rows sum to almost 100: some very minor species have not been recorded. Remember that you can inspect the structure of aber using the str and head commands.
For this assignment, you should:
Your report should comprise a document containing the plots and the analysis, and an appendix containing your code. It should be handed in at the reception desk of the Maths Department. Because this is a summative assessment, your document will not be returned, and no 'answers' will be given. However, general feedback will be available once the assignment has been marked.
Feedback:
Some general feedback on your first assignments.
Where marks were lost it was sometimes because the question had not
been correctly answered (eg some 'scree plots' were not scree plots).
But it was mainly because the resulting plots and text were not as effective as they might be, in terms of communicating the information in the dataset. Your plot must be as clear as possible, and this requires you to be critical of your first attempt, and to adjust things such as scale, points or lines, point type or line type, colours, font size, legend placement and scale, title, axes titles, etc. And also to choose exactly how much information to include -- too much information in your plot can obscure the message. Likewise, too much text can also obscure the message.
This will involve the following stages:
This will involve the following stages:
As before, you should make your plots as informative as possible. Please see the general feedback on the first assignment.
Your report should comprise a document containing the plots and the
analysis, and an appendix containing your code. It should be handed
in at the reception desk of the Maths Department.
Because this is a summative assessment, your
document will not be returned, and no 'answers' will be given.
However, general feedback will be available once the assignment has
been marked.