Codebase list art-nextgen-simulation-tools / 7b993e2a-50b0-47df-acd2-434a07cb9a85/main art_SOLiD_README
7b993e2a-50b0-47df-acd2-434a07cb9a85/main

Tree @7b993e2a-50b0-47df-acd2-434a07cb9a85/main (Download .tar.gz)

art_SOLiD_README @7b993e2a-50b0-47df-acd2-434a07cb9a85/mainraw · history · blame

ART_SOLiD (version 1.3.3) README (last updated on 03/19/2015) Weichun Huang at whduke@gmail.com                              

DESCRIPTION:

	ART_SOLiD  is a simulation program to generate sequence read data of SOLiD sequencing reads.
       	ART generates reads according to a SOLiD read error profile. The built-in error profile is an
       	empiricial error profile summarized from large SOLiD sequencing data. ART has been using for 
	testing or benchmarking a variety of method or tools for next-generation sequencing data analysis,
       	including read alignment, de novo assembly, detection of SNP, CNV, or other structure variation.
       	
	art_SOLiD can generate both single-end, matepair, and paired-end of SOLiD sequencing platform.
       	art_SOLiD also support amplicon sequencing simulation with RNA references. Its outputs include 
	FASTQ read, MAP alignment, and optional SAM alignment files. The map2bed.pl tool included can
       	convert ART MAP alignment files to a UCSC BED file 	

COMPILATION AND INSTALLATION:

	PREREQUISITES: 
			1) GNU g++ 4.0 or above (http://gcc.gnu.org/install) 
			2) GNU gsl library (http://www.gnu.org/s/gsl/) 

	COMPILATION & INSTALLATION

		1) add gsl library installation directory to compiler's search path

	       	If your GSL library is not installed in system standard search path, your GSL installation 
		directory need to be added to gcc/g++ complier FLAGS. For example, gsl is typically installed
	       	in /opt/local directory in MacOS X. You can add the directory to the search path of gcc/g++ 
		complier by run the following command: 
		
		export CFLAGS="$CFLAGS -I/opt/local/include" CPPFLAGS="$CPPFLAGS -I/opt/local/include" LDFLAGS="$LDFLAGS -L/opt/local/lib"

		2) compile and install 

		./configure --prefix=$HOME
	       	make
	       	make install

EXAMPLES

	In the "examples" subdirectory, the shell script "run_test_examples_SOLiD.sh" gives examples of using
       	art_SOLiD for read simulation.  To test these two examples, just run the script "run_test_examples_SOLiD.sh"

USAGES

	SINGLE-END (F3 READ) SIMULATION
		art_SOLiD [ options ] <INPUT_SEQ_FILE> <OUTPUT_FILE_PREFIX> <LEN_READ> <FOLD_COVERAGE>
	
	MATE-PAIR READS (F3-R3 PAIR) SIMULATION
		art_SOLiD [ options ] <INPUT_SEQ_FILE> <OUTPUT_FILE_PREFIX> <LEN_READ> <FOLD_COVERAGE> <MEAN_FRAG_LEN> <STD_DEV>
	
	PAIRED-END READS (F3-F5 PAIR) SIMULATION
		art_SOLiD [ options ] <INPUT_SEQ_FILE> <OUTPUT_FILE_PREFIX> <LEN_READ_F3> <LEN_READ_F5> <FOLD_COVERAGE> <MEAN_FRAG_LEN> <STD_DEV>
	
	AMPLICON SEQUENCING SIMULATION
		art_SOLiD [ options ] -A s <INPUT_SEQ_FILE> <OUTPUT_FILE_PREFIX> <LEN_READ> <READS_PER_AMPLICON>
		art_SOLiD [ options ] -A m <INPUT_SEQ_FILE> <OUTPUT_FILE_PREFIX> <LEN_READ> <READ_PAIRS_PER_AMPLICON>
		art_SOLiD [ options ] -A p <INPUT_SEQ_FILE> <OUTPUT_FILE_PREFIX> <LEN_READ_F3> <LEN_READ_F5> <READ_PAIRS_PER_AMPLICON>
	
	===== PARAMETERS =====
	
	INPUT_SEQ_FILE            -  filename of DNA/RNA reference sequences in FASTA format
	OUTPUT_FILE_PREFIX        -  prefix or directory for all output data files
	FOLD_COVERAGE             -  fold of read coverage over the reference sequences 
	LEN_READ                  -  length of F3/R3 reads
	LEN_READ_F3               -  length of F3 reads for paired-end read simulation
	LEN_READ_F5               -  length of F5 reads for paired-end read simulation
	READS_PER_AMPLICON        -  number of reads per amplicon
	READ_PAIRS_PER_AMPLICON   -  number of read pairs per amplicon
	MEAN_FRAG_LEN             -  mean DNA/RNA fragment size for matepair/paired-end read simulation
	STD_DEV                   -  standard deviation of the DNA/RNA fragment sizes for matepair/paired-end read simulation
	
	===== OPTIONAL PARAMETERS =====
	
	-A specify the read type for amplicon sequencing simulation (s:single-end, m: matepair, p: paired-end)
	-M indicate to use CIGAR 'M' instead of '=/X' for alignment match/mismatch
	-s indicate to generate a SAM alignment file
	-r specify the random seed for the simulation
	-f specify the scale factor adjusting error rate (e.g., -f 0 for zero-error rate simulation)
	-p specify user's own read profile for simulation
	
	===== EXAMPLES =====
       	1) singl-end 25bp reads simulation at 10X coverage
		art_SOLiD -s seq_reference.fa ./outdir/single_dat 25 10
        2) singl-end 75bp reads simulation at 20X coverage with users' error profile
       		art_SOLiD -s -p ../SOLiD_profiles/pseudo_profile ./seq_reference.fa ./dat_userProfile 75 20
	3) matepair 35bp (F3-R3) reads simulation at 20X coverage with DNA MEAN fragment size 2000bp and STD 50
		art_SOLiD -s seq_reference.fa ./outdir/matepair_dat 35 20 2000 50
        4) matepair reads simulation with a fixed random seed
		art_SOLiD -r 777 -s seq_reference.fa ./outdir/matepair_fs 50 10 1500 50
	5) paired-end reads (75bp F3, 35bp F5) simulation with the MEAN fragment size 250 and STD 10 at 20X coverage
		art_SOLiD -s seq_reference.fa ./outdir/paired_dat 75 35 50 250 10
        6) amplicon sequencing with 25bp single-end reads at 100 reads per amplicon
		art_SOLiD -A s -s amp_reference.fa ./outdir/amp_single 25 100
	7) amplicon sequencing with 50bp matepair reads at 80 read pairs per amplicon
		art_SOLiD -A m -s amp_reference.fa ./outdir/amp_matepair 50 80
        8) amplicon sequencing with paired-end reads (35bp F3, 25bp F5 reads) at 50 pairs per amplicon
		art_SOLiD -A p -s amp_reference.fa ./outdir/amp_pair 35 25 50
	 
OUTPUT DATA FILES

	*.fq   - FASTQ read data files. For matepair/paired-end read simulation, *_R3.fq/*_F5.fq contains data of the first reads, and *_F3.fq for the second reads.
	*.map  - read alignment files. For matepair read simulation, *_R3/*_F5.map has read alignments for the first reads and *_F3.map for the second reads.
	*.sam  - SAM read alignment files 

	FASTQ file format 

		A FASTQ file contains both sequence bases and quality scores of sequencing reads and is in the following format:  

			@read_id 
			sequence_read 
			+ 
			base_quality_scores 
	
		A base quality score is coded by the ASCII code of a single character, where the quality score is equal to ASCII code of the
       		character minus 33.    

		Example: 
		@1_1_1_F3
	       	T10121011322000311302213102311132
	       	+
	       	163=+48./48<347//,=/84-4)77(''-)
	       	@1_1_2_F3
	       	T01213000012110232021321303011332
	       	+
	       	1>7<-653?01:55./.%3+'61-+40(*)#'

	MAP file format 
		A MAP file has a Header and main Body parts. The header part includes the command used to generate this file and reference
	       	sequence id and length. The header @CM tag for command line, and @SQ for reference sequence.  A header always starts with 
		"##ART" and ends with  "##Header End".  ART MAP alignment files can be converted to a UCSC BED file by the map2bed.pl tool.

		HEADER EXAMPLE

		##ART_SOLiD     read_length     32
		@CM     ../../bin/MacOS64/art_SOLiD -s ./testSeq.fa ./single_SOLiD_test 32 10 
		@SQ     seq1    7207
	       	@SQ     seq2    3056
		##Header End

		The body part of a MAP file is a tab-delimited text alignment data where each line is the mapping information record of a read. The format of each line
		is below:
		
		ref_seq_id	read_id		ref_map_pos	ref_seq_strand	num_sequencing_errors	err_pos_1  WrongCorrectColor_1	err_pos_2  WrongCorrectColor_2	...	
	
		ref_map_pos - the alignment start position of reference sequence. ref_map_pos is always relative to the strand of reference
       		sequence. That is, ref_map_pos 10 in the plus (+) strand is different from ref_map_pos 10 in the minus (‐) stand.  

		num_sequencing_errors - number of sequencing call errors

		err_pos_n - the zero-indexed read position of the n_th call error
		WrongCorrectColor_n  - the n_th call error representing by two digits number with the 1st number the wrong called color and the 2nd the correct color

		Example: 
		
		seq1	1_1_1_F3	6028	+	4	18	31	22	03	25	32	27	23

	SAM format

       		SAM is a standard format for next-gen sequencing read alignments. The details of the format and examples are available at the links below:	
		1) http://samtools.sourceforge.net/SAM1.pdf
	       	2) http://genome.sph.umich.edu/wiki/SAM	

		Read sequences in a SAM file are in the regular DNA base space. As one change in SOLiD reads in the color space can result multiple changes in base space,
	        so not all mismatches in base space between reads and its reference sequence are of SOLiD sequencing errors.

	BED format

		See the format at UCSC http://genome.ucsc.edu/FAQ/FAQformat.html#format1

		NOTE: both MAP and BED format files use 0-based coordinate system while SAM format uses 1-based coordinate system.

ERROR PROFILE FILE FORMAT

	An ART SOLiD read error profile has the following five columns:
		1) read position
	       	2) correct base
	       	3) wrong base called 
		4) probability of calling the wrong base in the 1st read
	       	5) probability of calling the wrong base in the 2nd read

	and an error profile end with "*end_of_profile".

	Please see "profile_default" in the folder SOLiD_profiles for a real example