Codebase list srst2 / HEAD example.txt
HEAD

Tree @HEAD (Download .tar.gz)

example.txt @HEADraw · history · blame

This example uses real public data from three Shigella sonnei genomes, two of which were 
sequenced twice and so allow for reproducibility testing. 

As Shigella sonnei is a sublineage of E. coli, we must use the E. coli MLST scheme. 

We will also use the resistance genes database provided with srst2, resistance.fasta 
(modified from ResFinder).

1. Download the files for the E. coli MLST scheme, using the script provided:

getmlst.py --species "Escherichia coli#1"

  For SRST2, remember to check what separator is being used in this allele database

  Looks like --mlst_delimiter '-'

  >adk-1  --> -->   ('adk', '-', '1')
  
  Suggested srst2 command for use with this MLST database:

    srst2 --output test --input_pe *.fastq.gz --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt --mlst_delimiter '-'
  

Note, this is correctly guessing that we should use the default --mlst_delimiter '-' with
this database. The log file will tell you exactly what files were downloaded.

Check that the allele sequences have been downloaded and compiled into a single fasta file:
  Escherichia_coli#1.fasta

Check that the ST definitions have been downloaded:
  ecoli.txt


2. Download Illumina paired end read sets from the ENA. Note if you have limited download
capacity or time to spare, you could download the first two files only (total ~30MB) for 
basic testing of srst2.

# strain 20081885 from ERP000182 (14M, 13M; 51M, 44M)
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR024/ERR024070/ERR024070_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR024/ERR024070/ERR024070_2.fastq.gz

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR028/ERR028678/ERR028678_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR028/ERR028678/ERR028678_2.fastq.gz

# strain 20031275 from ERP000182 (89M, 94M; 88M, 74M)
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR024/ERR024082/ERR024082_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR024/ERR024082/ERR024082_2.fastq.gz

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR028/ERR028690/ERR028690_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR028/ERR028690/ERR028690_2.fastq.gz

# strain IB694 from ERP000182 (80M, 64M)
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR024/ERR024619/ERR024619_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR024/ERR024619/ERR024619_2.fastq.gz


3. Try running MLST and gene detection against a tiny read set with low coverage (~3x):

srst2 --input_pe ERR024070*.fastq.gz --output shigella1 --log --save_scores --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt --gene_db ARGannot.fasta

Note that, as the databases have not been indexed for bowtie2 yet, that srst2 has to run
bowtie2-build which spits out a lot of messages to stdout. This is not a problem!

Bowtie2 also reports its mapping stats, it should indicate that there were only 
400,882 reads in this data set, of which 0.02% mapped to our MLST/resistance loci.
This probably won't be enough to get good quality information.

Check the log file (shigella1.log) to see what happened.

Outputs are printed to:
	shigella1__mlst__Escherichia_coli#1__results.txt - mlst results only
	shigella1__fullgenes__ARGannot__results.txt - resistance results, one line per gene
	shigella1__genes__ARGannot__results.txt - resistance results only, tabulated
	shigella1__compiledResults.txt - MLST and resistance genes, tabulated
	shigella1__ERR024070.Escherichia_coli#1.scores - full score & alignment info for all MLST alleles
	shigella1__ERR024070.ARGannot.scores - full score & alignment info for resistance genes with >90% coverage
	
Because we ran with the --save_scores flag, we have also got a scores file for each database, 
which details the score and mapping information for every allele in that database.
	
The ST was not called correctly because depth was too low (average read depth 2.7; this 
is printed in the MLST result table.

2 resistance genes were detected with >90% coverage (the default threshold), which look 
like variants of ampC and strB (marked with * to indicate they were not precise matches, 
and ? to indicate uncertainty because some bases weren't covered). 

This shows us that this level of data (which was considered a failed sequencing run) isn't 
enough to get good results. Luckily this genome was resequenced; see what happens when you analyse 
that data set.


4. Try running MLST and gene detection against a proper read set, and compile together with
the results from the poor read set for comparison.

srst2 --input_pe ERR028678*.fastq.gz --output shigella2 --log --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt --gene_db ARGannot.fasta --prev_output shigella1__compiledResults.txt

Check the log file to see what happened.

Outputs are printed to:
	shigella2__mlst__Escherichia_coli#1__results.txt - mlst results only
	shigella2__fullgenes__ARGannot__results.txt - resistance results, one line per gene
	shigella2__genes__ARGannot__results.txt - resistance results only, tabulated
	shigella2__compiledResults.txt - MLST and resistance genes, tabulated

This time there is an ST called, which should look like this:

Sample  ST      adk     fumC    gyrB    icd     mdh     purA    recA    mismatches      uncertainty     depth   maxMAF
ERR028678  152?    11?     63      7?      1       14      7       7       0       adk-11/edge2.0;gyrB-7/edge2.0   13.9595714286   0.125

Note that the ST has a question mark (?) after it, indicating that we have some uncertainty about the call. 
The table shows this uncertainty comes from two loci, adk and gyrB, which are themselves labelled with question marks.
The 'uncertainty' column indicates that the problem is a drop-off in depth at the edge of these loci, down to 2x, which is the default minimum to report a potential uncertainty (--min_edge_depth 2).

There are also lots of resistance genes identified.

Because we supplied a prior results file, shigella2__compiledResults.txt contains the new
results as well as the old results.


5. Now try running on the other three read sets:

srst2 --input_pe ERR024082*.fastq.gz ERR028690*.fastq.gz ERR024619*.fastq.gz --output shigella3 --log --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt --gene_db ARGannot.fasta


6. Now compile the results from all 5 read sets:

srst2 --output all --prev_output shigella2__compiledResults.txt shigella3__compiledResults.txt

This outputs a file, all__compiledResults.txt, containing the compilation of all results
for MLST and resistance genes across the 5 strains.