PaintMyChromosomes.com
fineSTRUCTURE v2 & GLOBETROTTER

Finestructure Icon
© 2012 Daniel Lawson.
Website template by Arcsin

2 How to use this software

This software attempts to automate as much of the processing pipeline as possible. You need to start with phased data as output by either SHAPEIT, BEAGLE, IMPUTE2, etc. There are conversion scripts provided for each of these, described in Section 11. We don’t want to make any recommendations, but most people use SHAPEIT. The others may be better depending on your circumstances.

Running FineSTRUCTURE for small datasets is now extremely easy. If you have both a modest number of individuals (less than around 200) and SNPS (100K) you can run the whole pipeline on a single machine (exploiting multiple cores, if you have them). Running the entire pipeline could be as simple as:


Listing 1: Simple example)
> fs example.cp -idfile data.ids -phasefiles data.phase -recombfiles data.recombfile -go

where we have specified 5 things:

  • example.cp: This is the file where the results and intermediate quantities are stored. A directory called ‘example’ will be created to store intermediate files.
  • -idfile data.ids: This defines the names of each individual in the data, one per row.
  • -phasefiles data.phase: This contains the PHASE format data (and we could have specified different files for different chromosomes, e.g. -phasefiles chr1.phase chr2.phase)
  • -recombfiles data.recombfile: This contains the linkage information about the genetic distance between the SNPs specified in the phase data.
  • -go: fs will figure out what needs to be done and in what order. It will then (in this example) run the entire pipeline, including ensuring that the MCMC has been run long enough.

Converting or writing idfiles, phase files and recombination files are described in Section 4.1 with conversion scripts in Section 11.

Running FineSTRUCTURE for larger datasets is more difficult, because we assume that users will want to exploit High Performance Computing (HPC) resources. We therefore split the computation into a number of stages, each of which can be run on a cluster. A text file is generated containing the commands, 1 per line. You will be prompted with the location of this file. The process becomes:


Listing 2: HPC example)
> fs example.cp -idfile data.ids -phasefiles data.phase -recombfiles data.recombfile -hpc 1 -go 
> qsub_run.sh -f example_commandfile1.txt # and wait for it to execute 
> fs example.fs -go 
> qsub_run.sh -f example_commandfile2.txt # and wait for it to execute 
> fs example.fs -go 
> qsub_run.sh -f example_commandfile3.txt # and wait for it to execute 
> fs example.fs -go 
> qsub_run.sh -f example_commandfile4.txt # and wait for it to execute 
> fs example.fs -go

Because there are things that can go wrong in each processing step, and rerunning has an overhead in this approach, it is more important to get the parameters right in advance in HPC mode. See ‘Potential pitfalls’ (Section 12) to get these right first time.

You are STRONGLY ENCOURAGED to go through the provided example, to get a feeling for how this works in practice, to see how to set various important parameters, and to cover some basic problems that you might encounter.