Multivariate Analysis (MATH 30510 and MATH M0510)

LecturerYiannis Papastathopoulos, i.papastathopoulos@bristol.ac.uk
Official unit pages MATH 30510, MATH M0510
Timetable1600 Mon, SM4; 1400 Thu, C44; 10 Fri, C44.
Office hours13.00 Tue, my office, Rm3.13 Maths Dept
Computing office hoursTBA More information.
Exam datesTBA. See this comment on the exam.

Navigation: Course outline, details, code snippets, homework/assignments.

Course outline

Multivariate analysis concerns measurements of two or more variables collected on many objects. The basic structure in multivariate analysis is the data matrix, denoted X, which consists of n rows and p columns, where n is the number of objects and p is the number of variables. The basic mathematical tool for multivariate analysis is matrix algebra.

The key questions we will address are:

  1. How to summarise and visualise X in terms of objects, variables, and both.
  2. If the objects are divided into distinct clusters, how to construct rules to allocate a new object to a cluster, and how to assess the accuracy of the rule. This is an example of supervised learning.
  3. How to arrange the objects into distinct clusters, and to assess the relationship between clusters. This is an example of unsupervised learning.

Textbooks

The main textbook is

Available on Amazon. This is widely regarded among statisticians as being one of the great textbooks. Unfortunately it is expensive, but if you are thinking about doing any statistics after you graduate then this has to be on your bookshelf, if only for the reference material on matrix algebra in the Appendix. There are lots of other books to browse at classmark QA278 in the library in the Queens building and also the Arts and Social Science Library. In particular, you might find a useful introduction.

My only qualm about Mardia et al. is that I prefer to treat the singular value decomposition (SVD) as the primitive decomposition of a matrix, a view that is adopted by

see chapter 2. In this course the SVD will be taken as given.

Prerequisites

I am going to assume that you are familiar with matrix algebra. For references, you may want to consult (this is where I go for references if I don't have Mardia et al. handy). If you are really struggling, then has an appropriate title, at least!

Level M students should be aware that being able to use R is a prerequisite for the coursework assessment, which counts 20% towards the final mark.

Computing

Computing is a vital part of multivariate analysis, because the matrix algebra calculations required are far too complicated to do by hand, and the pictures would be tedious to draw. Throughout the course, examples will be presented using the statistical computing environment R, which you will have been introduced to in your 1st year, and which you would have used in Statistics 2, if you took that course. You should download this software onto your computer, as you will need it for your homework (or else you will need to do your homework in one of the University's computer labs).

Computing office hours are available if you would like guidance in using R. Attendence is not obligatory, but you should go if you have questions about the use of R in the homeworks.

Code snippets for the examples are available below.

Handouts

There are many textbooks on multivariate data analysis in the University Library, and also excellent resources online. Acquiring and synthesising information is a valuable transferable skill, and here is an opportunity to practice it. Having said that, there will be handouts on topics that are less-well covered in the standard textbooks. An initial handout will cover notation and concepts.

Comment on the exam

The exam comprises three questions, of which your best two count. In my experience it is much better to pick out your best two questions at the start of the exam, and focus on these, than to try all three. If you adopt the latter strategy, all of the time you put into your weakest question is wasted.

Previous exam papers are available on Blackboard.

Answers to previous exam papers will not be made available. The exam is designed to assess whether you have attended the lectures, read and thought about your lecture notes and the handouts, done the homework, and read a bit more widely in the textbooks. Diligent students who have done the above will gain no relative benefit from studying the answers to previous exam questions. On the other hand, less diligent students may suffer the illusion that they will do well in the exam, when probably they will not.

Course details

Here is a lecture-by-lecture summary of the course.

Introduction, concepts

  1. The data frame and the data matrix, notation, types of variable, recoding the data frame as quantitative variables, parallel coordinate plots. Handout Code snippet (PCP) Code snippet (recoding)
  2. The sample mean vector and variance matrix, the correlation matrix, centering and rescaling (brief recap on means, variances, and linear algebra). Code snippet
  3. Dealing with missing values, imputation: mean value, last value carried forwards (LVCF). There is useful material in the early parts of this chapter from what I think is the first edition of A. Gelman et al, Bayesian Data Analysis, CRC Press. Code snippet

Dimensional reduction and visualisation

  1. The SVD of the data matrix, the spectral decomposition of the sample variance matrix. Handout Code snippet
  2. Principal component analysis (PCA). (Lecture cancelled due to AGM) Code snippet
  3. More on principal components analysis: scree plots, selecting important components, interpreting the loadings.
  4. Biplots: the principal components biplot of Gabriel (1971). Handout
  5. R chill zone on biplots. Code snippet

Linear Discriminant Analysis (LDA)

  1. Discriminant analysis and facial recognition.
  2. Cluster analysis and discriminant analysis, overview. Geometry of discriminant analysis.
  3. Fisher's linear discriminant rule. Handout
  4. Assessing your classifier, R chill zone on LDA. Code snippet

Cluster analysis

There is some excellent material in
  1. Dissimilarity measures and dissimilarity matrices. Code snippet
  2. Hierarchical clustering methods 1. Handout
  3. Hierarchical clustering methods 2, R chill zone on hierarchical clustering. Code snippet

Multidimensional scaling (MDS)

  1. Multidimensional scaling: starting with a dissimilarity matrix. Handout
  2. More on MDS, mention of Procrustes transformation. Code snippet
  3. Blue sky lecture: non-linear dimensional reduction, and ISOMAP. Read the original paper. A simple explanation of Floyd's algorithm.

That's it!

Code snippets

These can be cut-and-pasted into your own files.
  1. Recoding factors
  2. Parallel coordinate plot
  3. Basic R functions
  4. Simple approaches to imputation
  5. The SVD
  6. Looking at the prcomp function
  7. Biplots
  8. Linear discriminant analysis
  9. Distances
  10. Clustering
  11. Multidimensional scaling
  12. Procrustes transformation

Recoding factors

#### recode factors, some examples

## ## This snippet just illustrative: you will need to adapt it to
## ## your problem -- it will not run as written.

## Let X be a dataframe, and NAME be a factor: do str(X) to see which
## columns are factors

## ## to recode (ordered) factor NAME as integers

x <- as.numeric(X$NAME)
X$RECODENAME <- x

## obviously you will want to replace NAME and RECODENAME with your
## own variable names.  If you then want to delete NAME,
## then X$NAME <- NULL

## ## to recode (ordered) factor NAME with specified values for the levels

x <- X$NAME
vals <- c(1.0, 1.2, 1.5, 2.7, 4.5) # say, should be one for each level
stopifnot(length(vals) == length(levels(x))
x <- vals[as.numeric(x)]
X$RECODENAME <- x

## ## to recode an unordered factor with k levels as k binary variables

x <- X$NAME
Y <- matrix(0, length(x), length(levels(x))) # n by k
Y[cbind(seq(along = x), as.numeric(x))] <- 1
colnames(Y) <- paste("NAME", "is", levels(x), sep = ".")
Y <- as.data.frame(Y)
X <- cbind(X, Y) # add new variables to the end

Parallel coordinate plot

#### parallel coordinate plot

library(MASS) # has parcoord function, do ?parcoord for details

## have a look at the famous iris data set, do ?iris for details

parcoord(iris[, 1:4], col = 1 + as.numeric(iris$Species),
         var.label = TRUE, main = "Parallel coordinate plot, Iris data")

legend(0.4, 1, legend = levels(iris$Species), text.col = 1 + 1:3,
       title = "Species", title.col = 1, lty = 0, pch = NA, bty = "n", xpd = NA)

Basic R functions

#### example of basic R functions

## load up the MKB exam data

X <- read.table("http://www.maths.leeds.ac.uk/~charles/mva-data/openclosedbook.dat",
                       header = TRUE)
X <- as.matrix(X)

## compute sample mean and variance

n <- nrow(X)
p <- ncol(X)
xbar <- colMeans(X)
S <- var(X) * (n-1) / n # use n in denominator
s <- sqrt(diag(S))

## centering and scaling

HX <- sweep(X, 2, xbar, "-") # centered

rr <- apply(X, 2, range)
rr <- rr[2, ] - rr[1, ]
iD <- diag(1 / rr, p)
Y <- X %*% iD # rescaled by variable ranges

iD <- diag(1/s, p)
Y <- X %*% iD # rescaled by variable standard deviations

Y <- HX %*% iD # both centered and rescaled

R <- crossprod(Y) / n # sample correlation matrix, but easier to compute:
R <- cor(X)

Simple approaches to imputation

#### approaches to imputation

## let's use some real data, but remove some values

X <- read.table("http://www.maths.leeds.ac.uk/~charles/mva-data/openclosedbook.dat",
                       header = TRUE)

## 10 missing NO values

set.seed(101) # change this for different missing values
X$NO[sample(1:nrow(X), 10)] <- NA

## ## Mean imputation; in each case Y will be our imputed X

Y <- X
ybar <- colMeans(Y, na.rm = TRUE) # only computed for non-missing values

y <- Y$NO
y[is.na(y)] <- xbar["NO"]
Y$NO <- y

## ## random imputation

Y <- X
y <- Y$NO
y[is.na(y)] <- sample(y[!is.na(y)], sum(is.na(y)), replace = TRUE)
Y$NO <- y

## ## last value carried forward (LVCF)

Y <- X
y <- Y$NO

## let's order the objects by the mean mark in the other subjects

omean <- rowMeans(Y[, -match("NO", names(Y))], na.rm = TRUE)
oo <- order(omean)
y <- y[oo] # permute y to put similar objects together

## here's a little function

LVCF <- function(x) {
  miss <- which(is.na(x))
  for (i in miss) {
    if (i > 1)
      x[i] <- x[i-1]
  }
  x
}

y <- LVCF(y)
y[oo] <- y # undo the reordering
Y$NO <- y

## little picture

pairs(Y, col = ifelse(is.na(X$NO), "red", "darkgrey"), # set colours
  xlim = c(0, 100), ylim = c(0, 100))

The SVD

#### SVD of HX and the spectral decomposition of S

## load and process the data, as before

X <- read.table("http://www.maths.leeds.ac.uk/~charles/mva-data/openclosedbook.dat",
                       header = TRUE)
X <- as.matrix(X)
n <- nrow(X)
p <- ncol(X)
xbar <- colMeans(X)
S <- var(X) * (n-1) / n # use n in denominator

HX <- sweep(X, 2, xbar, "-") # centered

## now compute the SVD

udv <- svd(HX)
str(udv) # to see the components

lambda <- udv$d^2 / n
Lambda <- diag(lambda, p)
Gamma <- udv$v

## spectral decomposition is S = Gamma %*% Lambda %*% t(Gamma).
## We could have computed this directly from S using 'eigen'

spd <- eigen(S, symmetric = TRUE) # can also compute directly from S

all.equal.numeric(spd$values, lambda, check.attributes = FALSE) # TRUE

print(spd$vectors) # same up to the signs of the columns
print(Gamma)       #

## can compute determinants with 'det'

dd <- det(S)

## but much better to get them from the SVD

logdd1 <- -p * log(n) + 2 * sum(log(udv$d))

all.equal.numeric(dd, exp(logdd1)) # TRUE

Looking at the prcomp function

#### looking at R functions

## Some R functions cannot easily be inspected.  For example, try

prcomp
#= function (x, ...)
#= UseMethod("prcomp")
#= <environment: namespace:stats>

## (output shown after #='s).  This means that to see all the versions
## of prcomp you need to type

methods(prcomp)
#= [1] prcomp.default* prcomp.formula*
#= 
#=    Non-visible functions are asterisked

## prcomp.default is the one that we want, but it is 'non-visible'.
## To see its definition, you need to note that it is in the 'stats'
## namespace, and then do

stats:::prcomp.default

## and then you can see its definition.  If you inspect this definition
## you will see that there are options for centering (TRUE) and rescaling
## (FALSE) the X matrix, and then it's just the SVD with some repackaging.

## You should use 'prcomp' to do your principal components analysis.  But,
## just so you can see the structure, here is a stripped-down version in 
## our notation.

myprcomp <- function(X) {

  X <- as.matrix(X)
  n <- nrow(X)
  p <- ncol(X)
  HX <- scale(X, scale = FALSE) # centre
  udv <- svd(HX)

  ## return same structure as prcomp.default

  robj <- list(sdev = udv$d / sqrt(n),
               rotation = udv$v,
               x = udv$u %*% diag(udv$d, p),
               center = TRUE,
               scale = FALSE)
  class(robj) <- "prcomp"
  robj
}

## If you test this out you will see that the sdev values are slightly
## different compared to the R 'prcomp' function: I prefer to use 1/n rather
## than 1/(n-1) in the denominator of the sample variance matrix.

## The R 'prcomp' function is better at handling the case where HX 
## does not have full column rank, and does some extra naming of the outputs.

Biplots

I've put the mybiplot code into a separate file, mybiplot.R. Here is a demonstration of how it is used.
## mybiplot with some formatting, using the USArrests dataset

source("http://www.maths.bris.ac.uk/~ip12483/multivariate/mybiplot.R")

upop <- USArrests$UrbanPop
qq <- round(quantile(upop, c(0, 0.25, 0.5, 0.75, 1)))
upop <- cut(upop, breaks = qq, include.lowest = TRUE)
upopcol <- colorRampPalette(c("forestgreen", "brown"))(length(levels(upop)))

par(mar = c(1, 1, 3, 1), cex.main = 1.25)

mybiplot(prcomp(USArrests[, c("Murder", "Assault", "Rape")]),
         asp = 1, axes = FALSE, bty = "n",
         xlab = "", ylab = "", main = "US Arrests",
         type = "labels", col.obj = upopcol[as.numeric(upop)],
         expand.arrow = 1.75, col.arrow = "black", lwd.arrow = 2,
         cex.arrow = 1.1, font.arrow = 2)

legend("topright", fill = upopcol,
       legend = levels(upop), title = "% urban population",
       bty = "n")

## here's how to save the plot as a file

dev.print(jpeg, file = "USArrestsBiplot.jpg", width = 600) # as a jpeg
dev.print(pdf, file = "USArrestsBiplot.pdf") # as a pdf

The example should look like this:

Image of US Arrests biplot

Linear discriminant analysis

This code snippet is a bit long for the webpage, so you can download it as a file. Then load it into R using source("mylda.R"), where you might need to give the path to the file (or else follow the format given below and download it directly from the webpage). After you source this file, you can try out the following code snippet, which will produce the picture below (and other things as well).
#### try out a famous dataset using mylda

source("http://www.maths.bris.ac.uk/~ip12483/multivariate/mylda.R")

## here is the raw data as a pairs plot

X <- iris[, 1:4]
species <- iris$Species

colpallete <- 1 + seq(along = levels(species))
col <- colpallete[as.numeric(species)] # a species colour for each object

pairs(X, col = col, pch = 16)

## do the LDA
  
foo <- mylda(X, groups = species)
print(foo$a)

## can see that LDA is putting 'a' mainly along the diagonal of
## Petal.Length and Petal.Width, which seems sensible!

## plot the data in terms of first two canonical variables
  
Y <- as.matrix(X)
plot(Y %*% foo$loadings[, 1:2], col = col,
     pch = 16,
     xlab = "First canonical variable",
     ylab = "Second canonical variable",
     main = "First two canonical variables",
     bty = "n")
legend("topleft", legend = levels(species), pch = 16, lty = 0,
       col = colpallete, bty = "n", title = "Species")

## add on the linear discrimination rule

ybar <- drop(foo$means %*% foo$a)
midp <- sort(ybar)
midp <- (midp[-1] + midp[-length(ybar)]) / 2
abline(v = midp, lty = 2)
ybottom <- par("usr")[3]
points(ybar, rep(ybottom, length(ybar)), pch = 16,
  col = colpallete, cex = 1.5, xpd = NA)

dev.print(jpeg, file = "irisLDA.jpg", width = 600) # as a jpeg

## Pretty good separation into species.  Can see that 2 virginicas
## are misclassified as versicolor, and three versicolors as
## virginica.

## the cross-validation by jack-knifing takes a little longer

foo <- mylda(X, groups = species, CV = TRUE)
print(foo$CVtable)
print(round(100 * foo$CVtable / rowSums(foo$CVtable))) # as %age by group

## example of classification; hold out large sample of 40

set.seed(201) # change this for a different random sample
rsam <- sort(sample(1:nrow(X), 40))
foo <- mylda(X[-rsam, ], groups = species[-rsam])
class <- classify.mylda(foo, X[rsam, ])
print(cbind(true = species[rsam], class))

Image of Iris LDA

Distances

R provides several different ways of computing the dissimilarity matrix starting from a data matrix. The dist function does Euclidean distance, Manhattan, Canberra, and Jaccard, referred to in dist as 'binary (aka asymmetric binary)'. For 'binary' the only distinction is between zero and non-zero values, and so the data matrix can be zero-ed below a threshold by doing
threshold <- 5 # say
Y <- as.matrix(X) >= threshold
dd <- dist(Y, method = "binary")
The output from 'dist' is an object of class 'dist'. This is in fact a n(n-1)/2 vector, which will print as a lower-triangular matrix. If you definitely want the matrix, do as.matrix(dd), which is an n × n matrix.

If you have a data matrix with a mixture of quantitative variables and factors, then the daisy function in the package cluster may be helpful. Remember, though, that an 'automatic' dissimilarity such 'metric = "gower"' may not be want you need for your particular research objective.

R does not seem to have an explicit function for Hellinger distance, also called 'chordal distance' or 'Bhattacharyya distance'. This is the distance between two vectors of non-negative proportions that sum to one. However, it is easy to construct one from the dist function:

hellinger <- function(X) {
  X <- as.matrix(X)
  n <- nrow(X)
  stopifnot(is.numeric(X), X >= 0, X <= 1)
  ss <- rowSums(X)
  sumto1 <- all.equal.numeric(ss, rep(1, n), check.attributes = FALSE)
  if (!identical(sumto1, TRUE))
    warning("Some rows sum to less than 1")
  dist(sqrt(X))
}

Clustering

You can find information about clustering in R using the CRAN clustering task view. There are lots of ways to do clustering, and these snippets just show one.

You will need to load the cluster package, and then the function agnes does agglomerative hierarchical clustering.

As a simple example, consider the dissimilarity matrix of white-toothed shrews from Mardia, Kent, and Bibby (1979, p 386). You can download this as a file. Here is some code to draw dendrograms for three different distances measures on groups.

source("http://www.maths.bris.ac.uk/~ip12483/multivariate/shrewDist.R")
print(shrew)

library(cluster)

par(mfrow = c(2, 2), oma = c(0, 0, 3, 0), cex.main = 1) # divide panel into four

sh1 <- agnes(shrew, method = "single")
plot(sh1, which = 2, main = "Single linkage", xlab = "Location")

sh2 <- agnes(shrew, method = "complete")
plot(sh2, which = 2, main = "Complete linkage", xlab = "Location")

sh3 <- agnes(shrew, method = "average")
plot(sh3, which = 2, main = "Average linkage", xlab = "Location")

title(main = "Dendrograms, shrew data", outer = TRUE, cex.main = 1.5)

dev.print(jpeg, file = "shrewDend.jpg", width = 600) # as a jpeg

Dendrogram of shrew data

For more control over the dendrogram, you will want to use plot.dendrogram explicitly. If you have a slightly out-of-date installation of R, as I do, then you will need to turn the output from agnes into a hclust object, and then into a dendrogram object. Here is the adapted example from the agnes helpfile.

library(cluster)
data(votes.repub)
agn1 <- agnes(votes.repub, metric = "manhattan", stand = TRUE)
hh1 <- as.hclust(agn1)
par(cex = 0.8)
dd1 <- as.dendrogram(hh1)
plot(dd1, main = "Dendrogram, % Republican votes, 1856-1976", edge.root = TRUE)

dev.print(jpeg, file = "repubDend.jpg", width = 600) # as a jpeg

This will do the clustering and create the dendrogram in the handout:

Dendrogram of votes.repub data

As you will see from these examples and the helpfile, agnes can operate with either the original data matrix, or directly with a dissimilarity matrix. In the former case it computes a dissimilarity matrix of type specified by the 'metric' argument.

You will want to access directly the partitions from your clustering, or at least some of them. The function cutree operates on an hclust object such as hh1 and allows you to extract either the partition with a specified number of groups, or the partition with a specified height (in the dendrogram). Here is a code snippet that also includes the output:

> cutree(hh1, k = 10) # 10 groups
       Alabama         Alaska        Arizona       Arkansas     California 
             1              2              3              1              3 
      Colorado    Connecticut       Delaware        Florida        Georgia 
             3              3              3              4              1 
        Hawaii          Idaho       Illinois        Indiana           Iowa 
             5              3              3              3              3 
        Kansas       Kentucky      Louisiana          Maine       Maryland 
             3              3              6              3              3 
 Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
             7              3              3              8              3 
       Montana       Nebraska         Nevada  New Hampshire     New Jersey 
             3              3              3              3              3 
    New Mexico       New York North Carolina   North Dakota           Ohio 
             3              3              4              3              3 
      Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
             4              3              3              7              9 
  South Dakota      Tennessee          Texas           Utah        Vermont 
             3              4              5              3             10 
      Virginia     Washington  West Virginia      Wisconsin        Wyoming 
             4              3              3              3              3 

> cutree(hh1, h = 40) # partition at height 40
       Alabama         Alaska        Arizona       Arkansas     California 
             1              2              3              1              3 
      Colorado    Connecticut       Delaware        Florida        Georgia 
             3              3              3              3              1 
        Hawaii          Idaho       Illinois        Indiana           Iowa 
             4              3              3              3              3 
        Kansas       Kentucky      Louisiana          Maine       Maryland 
             3              3              1              3              3 
 Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
             3              3              3              5              3 
       Montana       Nebraska         Nevada  New Hampshire     New Jersey 
             3              3              3              3              3 
    New Mexico       New York North Carolina   North Dakota           Ohio 
             3              3              3              3              3 
      Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
             3              3              3              3              5 
  South Dakota      Tennessee          Texas           Utah        Vermont 
             3              3              4              3              2 
      Virginia     Washington  West Virginia      Wisconsin        Wyoming 
             3              3              3              3              3 
For more fun, you can identify these on the dendrogram itself, using rect.hclust; try
rect.hclust(hh1, k = 10, border = c("red", "blue"))

This function has a couple of useful arguments that allow you to highlight certain clusters.

There are a number of ways of assessing how good your dendrogram is. The agglomerative coefficient is printed in the plots of the shrew data, above. Another useful measure is the cophenetic correlation, which is the correlation between the dissimilarities and the heights at which each pair of objects becomes joined in the dendrogram.

coph <- cophenetic(hh1) # cophentic matrix
print(cor(coph, agn1$diss)) # cophenetic correlation, 0.849

Multidimensional scaling

Here is a little code snippet for the shrew data, also given in the handout. The R function for 'classical' (or 'metric') MDS is cmdscale. Also see Procrustes transformation for hand-calculating MDS, including assessing whether the dissimilarity matrix is Euclidean.
source("http://www.maths.bris.ac.uk/~ip12483/multivariate/shrewDist.R")
mm <- cmdscale(shrew, eig = TRUE)
print(mm$GOF[1]) # alpha_2, 90%

par(mar = c(1, 1, 4, 4), cex.main = 1)
plot(mm$points, pch = 16,
  main = "MDS, shrew data", xlab = "", ylab = "", axes = FALSE, asp = 1)
text(mm$points, row.names(mm$points), pos = 4, cex = 0.8, xpd = NA)

Procrustes transformation

Procrustes transformations are discussed in Question sheet 6. You can download my procrustes function as a file. Here's a simple example.
X <- as.matrix(USArrests[, c(1, 2, 4)])
D <- as.matrix(dist(X)) # Euclidean distance
n <- nrow(D)
H <- diag(1, n) - matrix(1, n, n) / n # centering matrix
B <- H %*% (-D^2 / 2) %*% H

spd <- eigen(B, symmetric = TRUE)
all(spd$values >= -1e-8) # TRUE, so D is Euclidean (not surprising!)
pev <- spd$values >= 1e-8 * abs(spd$values[1])
k <- sum(pev) # 3 (not surprising, since X has 3 variables)

Z <- spd$vectors[, pev, drop=FALSE] %*% diag(sqrt(spd$values[pev]))
alpha <- spd$values[pev] / sum(abs(spd$values))
print(alpha) # alpha[1] = 0.99

#### there is NO reason for Z and X to agree!  But we ought to be able to
#### transform Z to X using translations and rotations/reflections, because
#### D is Euclidean.

source("http://www.maths.bris.ac.uk/~ip12483/multivariate/procrustes.R")

Zhat <- procrustes(X, Z)
attr(Zhat, "R2") # zero: shows we have got an exact match
all.equal.numeric(X, Zhat, check.attributes = FALSE) # TRUE - excellent!

Homework, assignments

There will be a homework every week, set on the Monday and due back a week later. Either hand in your homework after the Monday lecture, or leave it in the box outside my office door (before 5.30pm).

Two of the homeworks will be assessed for the course credit points, probably homeworks 2 and 5. You are strongly encouraged to do the homeworks and to hand in your efforts, to be commented on and marked.

Level 4 students will have two additional assignments, which will count towards their final mark: see below.

WeekSheetAnswers
12 Question sheet 1Answers
Feedback: These matrix questions were generally well done. There were one or two 'howlers':
  • A B = 0 does not imply A=0 or B=0. To prove that A AT we note that since AT A = I then A = A -1. From this we get that A A -1 = I = A A T
  • |a| where a is a scalar does not behave the same as |A| where A is a square matrix and |A| = det(A).
  • Πi λi = Πii2) does not imply that λi = 0 or 1. Take for example λ1 = 1/λ2 and λ1 different from 0 or 1.

15 Question sheet 2Answers
Feedback: Not much to say here, that is not addressed in the answers. Have a look at the answer to Q3; note that the mahalanobis function, although tempting, is not the right function for us (see the help-page).

16Question sheet 4Answers
17 (assessed) Question sheet 5
18Question sheet 6

Level 4 Assignents

Each of these two assignments counts for 10 percent of your final mark. The work you hand in must be your own. If you are in any doubt about this, or about the consequences of plagiarism, you should consult the University's guidelines.
  1. First assignment
  2. Second assignment

1. Abernethy forest data

This dataset comprises proportions by species of the pollen extracted from sediments in Abernethy forest, Scotland, indexed by depth. The data is in the rioja package in R, and the help-page is online. For simplicity, you can download the data here, and then load it into R using

source("aber.R")
(you may need to specify the folder where you saved the file).

You will see that the dataset aber comprises 49 rows, indexed by depth (cm), and 36 columns, indexed by species. Thus aber[1, 1] is 15.11, which indicates that at depth 300cm, the proportion of species BETULA was 15.11 percent. The rows sum to almost 100: some very minor species have not been recorded. Remember that you can inspect the structure of aber using the str and head commands.

For this assignment, you should:

  1. Perform a principal component analysis of the dataset, and present your results as a scree plot. (3 marks)
  2. Plot the important principal components by depth, justifying your choice of which principal components are important. Use the resulting plot to divide the depths into three periods: Late, Middle, and Early, and indicate this division on your plot. (3 marks)
  3. Redo the the principal components analysis, retaining only those species which are most prevalent. Now represent the results as a biplot, using colour to indicate the period. Interpret your results. (4 marks) (I do not expect there will be more than ten prevalent species, at the most.)
You should make your plots as informative as possible.

Your report should comprise a document containing the plots and the analysis, and an appendix containing your code. It should be handed in at the reception desk of the Maths Department. Because this is a summative assessment, your document will not be returned, and no 'answers' will be given. However, general feedback will be available once the assignment has been marked.


Feedback: Some general feedback on your first assignments. Where marks were lost it was sometimes because the question had not been correctly answered (eg some 'scree plots' were not scree plots).

But it was mainly because the resulting plots and text were not as effective as they might be, in terms of communicating the information in the dataset. Your plot must be as clear as possible, and this requires you to be critical of your first attempt, and to adjust things such as scale, points or lines, point type or line type, colours, font size, legend placement and scale, title, axes titles, etc. And also to choose exactly how much information to include -- too much information in your plot can obscure the message. Likewise, too much text can also obscure the message.

2. More on the Abernethy forest data

Two more analyses using the Abernethy forest data.
  1. First, repeat the principal components analysis of assignment 1 using Hellinger distance rather than Euclidean distance. (5 marks)

    This will involve the following stages:

    1. Rescale the (49×36) data matrix so that the rows sum to one (this is only a minor change), call this X. Then compute the Hellinger dissimilarity matrix for the 49 depths.
    2. Check that this dissimliarity matrix is Euclidean, and then apply Multidimensional Scaling (MDS) to construct a configuration of points consistent with the dissimilarity matrix, call this Z. Report the number of variables in Z and produce a table of the proportion of the dissimilarity matrix explained by each of the variables in Z.
    3. Plot the first three principal components of Z against depth (depth on the horizontal axis). Comment briefly on any differences you observe between the first three principal components computed using Euclidean distance, and the first three using Hellinger distance.

  2. Second, do a cluster analysis of the data in R-mode, in which the species are the objects and the depths are the variables. (5 marks)

    This will involve the following stages:

    1. Compute the Jaccard dissimilarity matrix for the species, using a threshold of 0.5% to distinguish between absence and presence at each depth.
    2. Perform a cluster analysis for three different measures of group dissimilarity: single linkage, group average, and complete linkage. For each one, compute the cophenetic correlation coefficient, and report the results.
    3. Plot the dendrogram of the cluster analysis with the largest cophenetic correlation coefficient.

As before, you should make your plots as informative as possible. Please see the general feedback on the first assignment.

Your report should comprise a document containing the plots and the analysis, and an appendix containing your code. It should be handed in at the reception desk of the Maths Department. Because this is a summative assessment, your document will not be returned, and no 'answers' will be given. However, general feedback will be available once the assignment has been marked.

Valid HTML 4.01 Transitional