Commit upstream/1.11+dfsg - prokka

Imported Upstream version 1.11+dfsg Andreas Tille 9 years ago

11 changed file(s) with 601 addition(s) and 749 deletion(s). Raw diff Collapse all Expand all

+24

-0

.gitignore less more

	0	tmp/
	1	blib/
	2	.build/
	3	_build/
	4	cover_db/
	5	inc/
	6	Build
	7	!Build/
	8	Build.bat
	9	.last_cover_stats
	10	Makefile
	11	Makefile.old
	12	MANIFEST.bak
	13	META.yml
	14	MYMETA.yml
	15	nytprof.out
	16	pm_to_blib
	17	doc/prokka-*
	18	bug/
	19	db/cm/.i1
	20	db/kingdom//sprot.p
	21	db/hmm/*.h3?
	22	db/genus/.p
	23

+374

-0

README.md less more

	0	# Prokka: rapid prokaryotic genome annotation
	1
	2	Torsten Seemann <torsten.seemann@monash.edu>
	3	Victorian Bioinformatics Consortium, AUSTRALIA <http://vicbioinformatics.com>
	4
	5	##Contents
	6
	7	- [Introduction](#introduction)
	8	- [Installation](#installation)
	9	- [Invoking Prokka](#invoking-prokka)
	10	- [Output Files](#output-files)
	11	- [Command line options](#command-line-options)
	12	- [Dependencies](#dependencies)
	13	- [Databases](#databases)
	14	- [FAQ](#faq)
	15	- [Still To Do](#still-to-do)
	16	- [Changes](#changes)
	17	- [Citation](#citation)
	18
	19	##Introduction
	20
	21	Whole genome annotation is the process of identifying features of interest
	22	in a set of genomic DNA sequences, and labelling them with useful
	23	information. Prokka is a software tool to annotate bacterial, archaeal and
	24	viral genomes quickly and produce standards-compliant output files.
	25
	26	##Installation
	27
	28	###Download
	29
	30	Download the latest `prokka-1.xx.tar.gz` archive from http://www.bioinformatics.net.au/software.prokka.shtml
	31
	32	###Extract
	33
	34	Choose somewhere to put it, for example in your home directory (no root access required):
	35
	36	% cd $HOME
	37	% tar zxvf prokka-1.xx.tar.gz
	38	% ls prokka-1.xx
	39
	40	###Add to PATH
	41
	42	Add the following line to your `$HOME/.bashrc` file,
	43	or to `/etc/profile.d/prokka.sh` to make it available to all users:
	44
	45	export PATH=$PATH:$HOME/prokka-1.xx/bin
	46
	47	###Index the sequence databases
	48
	49	% prokka --setupdb
	50
	51	###Install dependencies
	52
	53	Prokka comes with many binaries for Linux and Mac OS X. It will always use your existing installed versions if they exist, but will use the included ones if that fails. For some older systems (eg. Centos 4.x) some of them won't work due to them being dynamically linked against new GLIBC libraries you don't have.
	54
	55	You can consult the list of dependencies later in this document.
	56
	57	###Choose a rRNA predictor
	58
	59	####Option 1 - Don't use one
	60
	61	If Prokka can't find a predictor for rRNA featues (either Barrnap or RNAmmer below) then it simply won't annotate any. Most people don't care that much about them anyway,
	62
	63	####Option 2 - Barrnap
	64
	65	This was written by the author of Prokka and is recommended if you prefer speed over absolute accuracy. It uses the new multi-core NHMMER for DNA:DNA profile searches. Download it from http://www.vicbioinformatics.com/software.barrnap.shtml
	66
	67	####Option 3 - RNAmmer
	68
	69	RNAmmer was written when HMMER 2.x was the latest release. Since them, HMMER 3.x has been released, and uses the same executable binary names. Prokka needs HMMER3 and RNAmmer (and hence HMMER2) so you need to edit your RNAmmer script to explicitly point your HMMER2 binary instead of using the HMMER3 binary which is more likely to be in your PATH first.
	70
	71	Type which rnammer to find the script, and then edit it with your favourite editor. Find the following lines at the top:
	72
	73	if ( $uname eq "Linux" ) {
	74	# $HMMSEARCH_BINARY = "/usr/cbs/bio/bin/linux64/hmmsearch"; # OLD
	75	$HMMSEARCH_BINARY = "/path/to/my/hmmer-2.3.2/bin/hmmsearch"; # NEW (yours)
	76	}
	77
	78	If you are using Mac OS X, you'll also have to change the `"Linux"` to `"Darwin"` too. As you can see, I have commented out the original part, and replaced it with the location of my HMMER2 hmmsearch tool, so it doesn't run the HMMER3 one. You need to ensure HMMER3 is in your PATH before the old HMMER2 too.
	79
	80	###Test
	81
	82	* Type `prokka` and it should output it's help screen.
	83	* Type `prokka --version` and you should see an output like `prokka 1.x`
	84	* Type `prokka --listdb` and it will show you what databases it has installed to use.
	85
	86
	87	##Invoking Prokka
	88
	89	###Beginner
	90
	91	# Vanilla (but with free toppings)
	92	% prokka contigs.fa
	93
	94	# Look for a folder called PROKKA_yyyymmdd (today's date) and look at stats
	95	% cat PROKKA_yyyymmdd/*.txt
	96
	97	###Moderate
	98
	99	# Choose the names of the output files
	100	% prokka --outdir mydir --prefix mygenome contigs.fa
	101
	102	# Visualize it in Artemis
	103	% art mydir/mygenome.gff
	104
	105	###Expert
	106
	107	# It's not just for bacteria, people
	108	% prokka --kingdom Archaea --outdir mydir --genus Pyrococcus --locustag PYCC
	109
	110	# Search for my favourite gene
	111	% exonerate --bestn 1 zetatoxin.fasta mydir/PYCC_06072012.faa \| less
	112
	113	###Wizard
	114
	115	# Watch and learn
	116	% prokka --outdir mydir --locustag EHEC --proteins NewToxins.faa --evalue 0.001 --gram neg --addgenes contigs.fa
	117
	118	# Check to see if anything went really wrong
	119	% less mydir/EHEC_06072012.err
	120
	121	# Add final details using Sequin
	122	% sequin mydir/EHEC_0607201.sqn
	123
	124	###Genbank submitter
	125
	126	# Register your BioProject and your locus_tag prefix first!
	127	% prokka --compliant --centre UoN --outdir PRJNA123456 --locustag EHEC --prefix EHEC-Chr1 contigs.fa
	128
	129	# Check to see if anything went really wrong
	130	% less PRJNA123456/EHEC-Chr1.err
	131
	132	# Add final details using Sequin
	133	% sequin PRJNA123456/EHEC-Chr1.sqn
	134
	135	###Crazy Person
	136
	137	# No stinking Perl script is going to control me
	138	% prokka \
	139	--outdir $HOME/genomes/Ec_POO247 --force \
	140	--prefix Ec_POO247 --addgenes --locustag ECPOOp \
	141	--increment 10 --gffver 2 --centre CDC --compliant \
	142	--genus Escherichia --species coli --strain POO247 --plasmid pECPOO247 \
	143	--kingdom Bacteria --gcode 11 --usegenus \
	144	--proteins /opt/prokka/db/trusted/Ecocyc-17.6 \
	145	--evalue 1e-9 --rfam \
	146	plasmid-closed.fna
	147
	148
	149	##Output Files
	150
	151	\| Extension \| Description \|
	152	\| --------- \| ----------- \|
	153	\| .gff \| This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV. \|
	154	\| .gbk \| This is a standard Genbank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence. \|
	155	\| .fna \| Nucleotide FASTA file of the input contig sequences. \|
	156	\| .faa \| Protein FASTA file of the translated CDS sequences. \|
	157	\| .ffn \| Nucleotide FASTA file of all the annotated sequences, not just CDS. \|
	158	\| .sqn \| An ASN1 format "Sequin" file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc. \|
	159	\| .fsa \| Nucleotide FASTA file of the input contig sequences, used by "tbl2asn" to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines. \|
	160	\| .tbl \| Feature Table file, used by "tbl2asn" to create the .sqn file. \|
	161	\| .err \| Unacceptable annotations - the NCBI discrepancy report. \|
	162	\| .log \| Contains all the output that Prokka produced during its run. This is a record of what settings you used, even if the --quiet option was enabled. \|
	163	\| .txt \| Statistics relating to the annotated features found. \|
	164
	165	##Command line options
	166
	167	General:
	168	--help This help
	169	--version Print version and exit
	170	--docs Show full manual/documentation
	171	--citation Print citation for referencing Prokka
	172	--quiet No screen output (default OFF)
	173	--debug Debug mode: keep all temporary files (default OFF)
	174	Setup:
	175	--listdb List all configured databases
	176	--setupdb Index all installed databases
	177	--cleandb Remove all database indices
	178	--depends List all software dependencies
	179	Outputs:
	180	--outdir [X] Output folder [auto] (default '')
	181	--force Force overwriting existing output folder (default OFF)
	182	--prefix [X] Filename output prefix [auto] (default '')
	183	--addgenes Add 'gene' features for each 'CDS' feature (default OFF)
	184	--locustag [X] Locus tag prefix (default 'PROKKA')
	185	--increment [N] Locus tag counter increment (default '1')
	186	--gffver [N] GFF version (default '3')
	187	--compliant Force Genbank/ENA/DDJB compliance: --genes --mincontiglen 200 --centre XXX (default OFF)
	188	--centre [X] Sequencing centre ID. (default '')
	189	Organism details:
	190	--genus [X] Genus name (default 'Genus')
	191	--species [X] Species name (default 'species')
	192	--strain [X] Strain name (default 'strain')
	193	--plasmid [X] Plasmid name or identifier (default '')
	194	Annotations:
	195	--kingdom [X] Annotation mode: Archaea\|Bacteria\|Mitochondria\|Viruses (default 'Bacteria')
	196	--gcode [N] Genetic code / Translation table (set if --kingdom is set) (default '0')
	197	--gram [X] Gram: -/neg +/pos (default '')
	198	--usegenus Use genus-specific BLAST databases (needs --genus) (default OFF)
	199	--proteins [X] Fasta file of trusted proteins to first annotate from (default '')
	200	--hmms [X] Trusted HMM to first annotate from (default '')
	201	--metagenome Improve gene predictions for highly fragmented genomes (default OFF)
	202	--rawproduct Do not clean up /product annotation (default OFF)
	203	Computation:
	204	--fast Fast mode - skip CDS /product searching (default OFF)
	205	--cpus [N] Number of CPUs to use [0=all] (default '8')
	206	--mincontiglen [N] Minimum contig size [NCBI needs 200] (default '1')
	207	--evalue [n.n] Similarity e-value cut-off (default '1e-06')
	208	--rfam Enable searching for ncRNAs with Infernal+Rfam (SLOW!) (default '0')
	209	--norrna Don't run rRNA search (default OFF)
	210	--notrna Don't run tRNA search (default OFF)
	211	--rnammer Prefer RNAmmer over Barrnap for rRNA prediction (default OFF)
	212
	213	##Dependencies
	214
	215	* __GNU Parallel__
	216	A shell tool for executing jobs in parallel using one or more computers
	217	_O. Tange, GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, Feb 2011:42-47._
	218
	219	* __BioPerl__
	220	Used for input/output of various file formats
	221	_Stajich et al, The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002 Oct;12(10):1611-8._
	222
	223	* __Aragorn__
	224	Finds transfer RNA features (tRNA)
	225	_Laslett D, Canback B. ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res. 2004 Jan 2;32(1):11-6._
	226
	227	* __Barrnap__
	228	Used to predict ribosomal RNA features (rRNA). My licence-free replacement for RNAmmmer.
	229	_Manuscript under preparation._
	230
	231	* __RNAmmer__
	232	Finds ribosomal RNA features (rRNA)
	233	_Lagesen K et al. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 2007;35(9):3100-8._
	234
	235	* __Prodigal__
	236	Finds protein-coding features (CDS)
	237	_Hyatt D et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010 Mar 8;11:119._
	238
	239	* __SignalP__
	240	Finds signal peptide features in CDS (sig_peptide)
	241	_Petersen TN et al. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods. 2011 Sep 29;8(10):785-6._
	242
	243	* __BLAST+__
	244	Used for similarity searching against protein sequence libraries
	245	_Camacho C et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009 Dec 15;10:421._
	246
	247	* __HMMER3__
	248	Used for similarity searching against protein family profiles
	249	_Finn RD et al. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011 Jul;39(Web Server issue):W29-37._
	250
	251	* __Infernal__
	252	Used for similarity searching against ncRNA family profiles
	253	_D. L. Kolbe, S. R. Eddy. Fast Filtering for RNA Homology Search. Bioinformatics, 27:3102-3109, 2011._
	254
	255
	256	##Databases
	257
	258	###The Core (BLAST+) Databases
	259
	260	Prokka uses a variety of databases when trying to assign function to the predicted CDS features. It takes a hierarchial approach to make it fast. A small, core set of well characterized proteins are first searched using BLAST+. This combination of small database and fast search typically completes about 70% of the workload. Then a series of slower but more sensitive HMM databases are searched using HMMER3.
	261
	262	The initial core databases are derived from UniProtKB; there is one per "kingdom" supported. To qualify for inclusion, a protein must be (1) from Bacteria (or Archaea or Viruses); (2) not be "Fragment" entries; and (3) have an evidence level ("PE") of 2 or lower, which corresponds to experimental mRNA or proteomics evidence.
	263
	264	####Making a Core Databases
	265
	266	If you want to modify these core databases, the included script `prokka-uniprot_to_fasta_db`, along with the official `uniprot_sprot.dat`, can be used to generate a new database to put in `/opt/prokka/db/kingdom/`. If you add new ones, the command `prokka --listdb` will show you whether it has been detected properly.
	267
	268	####The Genus Databases
	269
	270	If you enable `--usegenus` and also provide a Genus via `--genus` then it will first use a BLAST database which is Genus specific. Prokka comes with a set of databases for the most common Bacterial genera; type prokka `--listdb` to see what they are.
	271
	272	####Adding a Genus Databases
	273
	274	If you have a set of Genbank files and want to create a new Genus database, Prokka comes with a tool called
	275	`prokka-genbank_to_fasta_db` to help. For example, if you had four annotated "Coccus" genomes, you could do the following:
	276
	277	% prokka-genbank_to_fasta_db Coccus1.gbk Coccus2.gbk Coccus3.gbk Coccus4.gbk > Coccus.faa
	278	% cd-hit -i Coccus.faa -o Coccus -T 0 -M 0 -g 1 -s 0.8 -c 0.9
	279	% rm -fv Coccus.faa Coccus.bak.clstr Coccus.clstr
	280	% makeblastdb -dbtype prot -in Coccus
	281	% mv Coccus.p* /path/to/prokka/db/genus/
	282
	283	###The HMM Databases
	284
	285	Prokka comes with a bunch of HMM libraries for HMMER3. They are mostly Bacteria-specific. They are searched after the core and genus databases. You can add more simply by putting them in `/opt/prokka/db/hmm`. Type `prokka --listdb` to confirm they are recognised.
	286
	287	###FASTA database format
	288
	289	Prokka understands two annotation tag formats, a plain one and a detailed one.
	290
	291	The plain one is a standard FASTA-like line with the ID after the `>` sign, and the protein `/product`
	292	after the ID (the "description" part of the line):
	293
	294	>SeqID product
	295
	296	The detailed one consists of a special encoded three-part description line. The parts are the `/EC_number`,
	297	the `/gene` code, then the `/product` - and they are separated by a special "~~~" sequence:
	298
	299	>SeqID EC_number~~~gene~~~product
	300
	301	Here are some examples. Note that not all parts need to be present, but the "~~~" should still be there:
	302
	303	>YP_492693.1 2.1.1.48~~~ermC~~~rRNA adenine N-6-methyltransferase
	304	MNEKNIKHSQNFITSKHNIDKIMTNIRLNEHDNIFEIGSGKGHFTLELVQRCNFVTAIEI
	305	DHKLCKTTENKLVDHDNFQVLNKDILQFKFPKNQSYKIFGNIPYNISTDIIRKIVF*
	306	>YP_492697.1 ~~~traB~~~transfer complex protein TraB
	307	MIKKFSLTTVYVAFLSIVLSNITLGAENPGPKIEQGLQQVQTFLTGLIVAVGICAGVWIV
	308	LKKLPGIDDPMVKNEMFRGVGMVLAGVAVGAALVWLVPWVYNLFQ*
	309	>YP_492694.1 ~~~~~~transposase
	310	MNYFRYKQFNKDVITVAVGYYLRYALSYRDISEILRGRGVNVHHSTVYRWVQEYAPILYQ
	311	QSINTAKNTLKGIECIYALYKKNRRSLQIYGFSPCHEISIMLAS*
	312
	313	The same description lines apply to HMM models, except the "NAME" and "DESC" fields are used:
	314
	315	NAME PRK00001
	316	ACC PRK00001
	317	DESC 2.1.1.48~~~ermC~~~rRNA adenine N-6-methyltransferase
	318	LENG 284
	319
	320
	321	##FAQ
	322
	323	* __Where does the name "Prokka" come from?__
	324	Prokka is a contraction of "prokaryotic annotation". It's also relatively unique within Google, and also rhymes with a native Australian marsupial called the quokka.
	325
	326	* __Can I annotate by eukaryote genome with Prokka?__
	327	No. Prokka is specifically designed for Bacteria, Archaea and Viruses. It can't handle multi-exon gene models; I would recommend using MAKER 2 for that purpose.
	328
	329	* __Why does Prokka keeps on crashing when it gets to tge "tbl2asn" stage?__
	330	It seems that the tbl2asn program from NCBI "expires" after 12 months, and refuses to run. Unfortunately you need to install a newer version which you can download from here.
	331
	332	* __The hmmscan step seems to hang and do nothing?__
	333	The problem here is GNU Parallel. It seems the Debian package for hmmer has modified it to require the `--gnu` option to behave in the 'default' way. There is no clear reason for this. The only way to restore normal behaviour is to edit the prokka script and change `parallel` to `parallel --gnu`.
	334
	335	* __Why does prokka fail when it gets to hmmscan?__
	336	Unfortunately HMMER keeps changing it's database format, and they aren't upward compatible. If you upgraded HMMER (from 3.0 to 3.1 say) then you need to "re-press" the files. This can be done as follows:
	337	cd /path/to/prokka/db/hmm
	338	mkdir new
	339	for D in *.hmm ; do hmmconvert $D > new/$D ; done
	340	cd new
	341	for D in *.hmm ; do hmmpress $D ; done
	342	mv * ..
	343	rmdir new
	344
	345	* __Why does Prokka take so long to download?__
	346	Our server is in Australia, and the international pipes aren't always flowing as well as we'd like. I try to put it on GoogleDrive. Dropbox is no longer possible due to bandwidth quotas. If you are able to mirror Prokka (~2 GB) outside please let me know.
	347
	348	* __Why can't I load Prokka .GBK files into Mauve?__
	349	Mauve uses BioJava to parse GenBank files, and it is very picky about Genbank files.
	350	It does not like long contig names,
	351	like those from Velvet or Spades. One solution is to use `--centre XXX`
	352	in Prokka and it will rename all your contigs to be NCBI (and Mauve)
	353	compliant. It does not like the ACCESSION and VERSION strings that Prokka
	354	produces via the "tbl2asn" tool. The following Unix command will fix them:
	355	`egrep -v '^(ACCESSION\|VERSION)' prokka.gbk > mauve.gbk`
	356
	357
	358	##Still To Do
	359
	360	* ToDoList.txt: https://github.com/Victorian-Bioinformatics-Consortium/prokka/blob/master/doc/ToDoList.txt
	361
	362
	363	##Changes
	364
	365	* ChangeLog.txt: https://raw.githubusercontent.com/Victorian-Bioinformatics-Consortium/prokka/master/doc/ChangeLog.txt
	366	* Github commits: https://github.com/Victorian-Bioinformatics-Consortium/prokka/commits/master
	367
	368	##Citation
	369
	370	Seemann T.
	371	Prokka: rapid prokaryotic genome annotation
	372	Bioinformatics 2014 Jul 15;30(14):2068-9.
	373	[PMID:24642063](http://www.ncbi.nlm.nih.gov/pubmed/24642063)

+133

-54

bin/prokka less more

31	31	use Bio::Seq;
32	32	use Bio::SeqFeature::Generic;
33	33	use Bio::Tools::GFF;
	34	use Bio::Tools::GuessSeqFormat;
34	35	use FindBin;
35	36
36	37	# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40	41	my $OPSYS = $^O;
41	42	my $BINDIR = "$FindBin::RealBin/../binaries/$OPSYS";
42	43	my $EXE = $FindBin::RealScript;
43		my $VERSION = "1.10";
	44	my $VERSION = "1.11";
44	45	my $AUTHOR = 'Torsten Seemann <torsten.seemann@monash.edu>';
45	46	my $URL = 'http://www.vicbioinformatics.com';
46	47	my $DBDIR = "$FindBin::RealBin/../db";
47	48	my $HYPO = 'hypothetical protein';
48	49	my $UNANN = 'unannotated protein';
49		my $MAXCONTIGIDLEN = 38; # Genbank rule
	50	my $MAXCONTIGIDLEN = 20; # Genbank rule
50	51
51	52	# these should accept .faa on STDIN and write report to STDOUT
52	53	my $BLASTPCMD = "blastp -query - -db %d -evalue %e -num_threads 1 -num_descriptions 1 -num_alignments 1 -seg no";

56	57	my $aragorn_opt = '';
57	58
58	59	# debian package broke compatibility so have to force it now grumble
59		my $PARALLELCMD = "parallel --gnu";
	60	my $PARALLELCMD = "parallel --gnu --plain";
60	61
61	62	# not used anymore
62	63	#my $INFERNALCMD = "cmscan --noali --notextw --acc -E %e --cpu 1 -o %o %d %i 2>/dev/null";
63	64
64	65	my $starttime = localtime;
65	66	my %seq;
	67	my @seq;
66	68
67	69	# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68	70	# table of tools we need/optional and min versions

163	165	'egrep' => { NEEDED=>1 },
164	166	'sed' => { NEEDED=>1 },
165	167	'find' => { NEEDED=>1 },
	168	# for the new --proteins option ability to take .gbk or .embl files
	169	'prokka-genbank_to_fasta_db' => { NEEDED=>1 },
166	170	);
167	171
168	172	# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

215	219	# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
216	220	# command line options
217	221
218		my(@Options, $quiet, $kingdom, $fast, $force, $outdir, $prefix, $cpus, $addgenes,
	222	my(@Options, $quiet, $debug, $kingdom, $fast, $force, $outdir, $prefix, $cpus, $addgenes,
219	223	$gcode, $gram, $gffver, $locustag, $increment, $mincontiglen, $evalue, $coverage,
220	224	$genus, $species, $strain, $plasmid,
221		$usegenus, $proteins, $centre, $scaffolds,
	225	$usegenus, $proteins, $hmms, $centre, $scaffolds,
222	226	$rfam, $norrna, $notrna, $rnammer, $rawproduct,
223	227	$metagenome, $compliant, $listdb, $citation);
224	228	setOptions();

232	236	msg("Local time is $starttime");
233	237	msg("You are", $ENV{USER} \|\| 'not telling me who you are!');
234	238	msg("Operating system is $OPSYS");
	239	msg("You have enabled DEBUG mode. Temporary files will NOT be deleted.") if $debug;
235	240
236	241	# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
237	242	# check BioPerl version

240	245	my $bpver = $Bio::Root::Version::VERSION;
241	246	msg("You have BioPerl $bpver");
242	247	err("Please install BioPerl $minbpver or higher") if $bpver < $minbpver;
	248
	249	# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
	250	# Check some incompatible options
	251
	252	$fast && $proteins and err("In --fast mode, the --proteins will not be used!");
	253	$fast && $hmms and err("In --fast mode, the --hmms will not be used!");
243	254
244	255	# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
245	256	# Determine CPU cores available

312	323	}
313	324
314	325	#$centre or err("You must set --centre or the NCBI tbl2asn tool won't work properly, sorry.");
315		($gcode < 1 or $gcode > 24) and err("Invalid genetic code, must be 1..24");
	326	($gcode < 1 or $gcode > 25) and err("Invalid genetic code, must be 1..25");
316	327	$evalue >= 0 or err("Invalid --evalue, must be >= 0");
317	328	#($coverage >= 0 and $coverage <= 100) or err("Invalid --coverage, must be 0..100");
318	329	$increment >= 1 or err("Invalid --increment, must be >= 1");

392	403	my $contigprefix = $locustag \|\| $prefix \|\| $outdir \|\| $strain \|\| '';
393	404	$contigprefix .= '_' if $contigprefix;
394	405	my $contig_name_len = length($centre) + 1 + length($contigprefix) + 6;
395		if ($compliant and $contig_name_len > $MAXCONTIGIDLEN) {
396		err("Genbank contig IDs are $contig_name_len chars, must be <= $MAXCONTIGIDLEN. Prefix is: $contigprefix");
	406	if ($compliant) {
	407	my $contig_name_len = length($centre) + 1 + length($contigprefix) + 6;
	408	if ($contig_name_len > $MAXCONTIGIDLEN) {
	409	err("Genbank contig IDs are $contig_name_len chars, must be <= $MAXCONTIGIDLEN. Prefix is: $contigprefix");
	410	}
397	411	}
398	412
399	413	while (my $seq = $fin->next_seq) {

408	422	$seq->id( sprintf "gnl\|$centre\|${contigprefix}contig%06d", $ncontig );
409	423	}
410	424	if (length($seq->id) > $MAXCONTIGIDLEN) {
411		msg("WARNING: Contig IDs must be less than 38 characters for Genbank compliance - ".$seq->id)
	425	msg("Contig ID must <= $MAXCONTIGIDLEN chars long: ".$seq->id);
	426	err("Please rename your contigs or use --centre XXX to generate clean contig names.");
412	427	}
413	428	my $s = $seq->seq;
414	429	$s = uc($s);

421	436	err("Uh oh! Sequence file '$in' contains duplicate sequence ID:", $seq->id);
422	437	}
423	438	$seq{ $seq->id }{DNA} = $seq;
	439	push @seq, $seq->id; # this array it used to preserve original contig order
424	440	}
425	441	$ncontig > 0 or err("FASTA file '$in' contains no suitable sequence entries");
426	442	msg("Wrote $ncontig contigs");
427		#msg(sort keys %seq); exit;
428	443
429	444	# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
430	445	# tRNA + tmRNA

567	582	my $num_ncrna = 0;
568	583	my $tool = "Infernal:".$tools{'cmscan'}->{VERSION};
569	584	my $icpu = $cpus \|\| 1;
570		open INFERNAL, '-\|', "cmscan --cpu $icpu -E $evalue --tblout /dev/stdout -o /dev/null --noali $cmdb \Q$outdir/$prefix.fna\E";
	585	my $cmd = "cmscan --cpu $icpu -E $evalue --tblout /dev/stdout -o /dev/null --noali $cmdb \Q$outdir/$prefix.fna\E";
	586	msg("Running: $cmd");
	587	open INFERNAL, '-\|', $cmd;
571	588	while (<INFERNAL>) {
572	589	next if m/^#/; # ignore comments
573	590	my @x = split ' '; # magic Perl whitespace splitter

607	624	# Tally all the RNA features __ which we want to exclude overlaps with CDS __
608	625
609	626	my @allrna;
610		for my $sid (sort keys %seq) {
	627	for my $sid (@seq) {
611	628	push @allrna, (grep { $_->primary_tag =~ m/[tr]RNA/ } @{ $seq{$sid}{FEATURE} });
612	629	}
613	630	msg("Total of", scalar(@allrna), "tRNA + rRNA features");

641	658	# CDS
642	659
643	660	msg("Predicting coding sequences");
644		my $totalbp = sum( map { $seq{$_}{DNA}->length } keys %seq);
	661	my $totalbp = sum( map { $seq{$_}{DNA}->length } @seq);
645	662	my $prodigal_mode = ($totalbp >= 100000 && !$metagenome) ? 'single' : 'meta';
646	663	msg("Contigs total $totalbp bp, so using $prodigal_mode mode");
647	664	my $num_cds=0;

700	717	# Connect features to their parent sequences
701	718
702	719	msg("Connecting features back to sequences");
703		for my $sid (sort keys %seq) {
	720	for my $sid (@seq) {
704	721	for my $f (@{ $seq{$sid}{FEATURE} }) {
705	722	$f->attach_seq( $seq{$sid}{DNA} );
706	723	}

722	739	my $spout = Bio::SeqIO->new(-fh=>$spoutfh, -format=>'fasta');
723	740	my %cds;
724	741	my $count=0;
725		for my $sid (sort keys %seq) {
	742	for my $sid (@seq) {
726	743	for my $f (@{ $seq{$sid}{FEATURE} }) {
727	744	next unless $f->primary_tag eq 'CDS';
728	745	$cds{++$count} = $f;
729		my $seq = $f->seq->translate;
	746	my $seq = $f->seq->translate(-codontable_id=>$gcode, -complete=>1);
730	747	$seq->display_id($count);
731	748	$spout->write_seq($seq);
732	749	}

746	763	my $prob = $x[5];
747	764	my $cleave = $x[3];
748	765	my $start = $parent->strand > 0 ? $parent->start : $parent->end;
749		my $end = $start + $parent->strand * ($cleave - 1);
	766	# need to convert to DNA coordinates
	767	my $end = $start + $parent->strand * ($cleave*3 - 1);
750	768	my $sigpep = Bio::SeqFeature::Generic->new(
751	769	-seq_id => $parent->seq_id,
752	770	-source_tag => $tool,

772	790	my $parent = $cds{ $x[0] };
773	791	my $cleave = $x[2];
774	792	my $start = $parent->strand > 0 ? $parent->start : $parent->end;
775		my $end = $start + $parent->strand * ($cleave - 1);
	793	# need to convert to DNA coordinates
	794	my $end = $start + $parent->strand * ($cleave*3 - 1);
776	795	my $sigpep = Bio::SeqFeature::Generic->new(
777	796	-seq_id => $parent->seq_id,
778	797	-source_tag => $tool,

817	836
818	837	# secondary sources are a series of HMMs
819	838	unless ($kingdom eq 'Viruses') {
820		for my $name (qw(HAMAP CLUSTERS Pfam)) {
821		push @database, {
822		DB => "$DBDIR/hmm/$name.hmm",
823		SRC => "protein motif:$name:",
824		FMT => 'hmmer3',
825		CMD => $HMMER3CMD,
826		VERSION => 3, # without this, latest Bioperl goes into infinite loop
827		};
	839	for my $name ( hmms() ) {
	840	my $dbfile = "$DBDIR/hmm/$name.hmm";
	841	if (-r $dbfile) {
	842	push @database, {
	843	DB => $dbfile,
	844	SRC => "protein motif:$name:",
	845	FMT => 'hmmer3',
	846	CMD => $HMMER3CMD,
	847	VERSION => 3, # without this, latest Bioperl goes into infinite loop
	848	}
	849	}
	850	else {
	851	msg("Will not use $name HMM database, $dbfile not installed.");
	852	}
828	853	}
829	854	}
830	855

849	874	msg("Not using genus-specific database. Try --usegenus to enable it.");
850	875	}
851	876
	877	# if user supplies a trusted set of HMMs, we try these first!
	878	if (-r $hmms) {
	879	msg("Preparing user-supplied primary HMMER annotation source: $hmms");
	880	-r "$hmms.h3i" or err("Your HMM is not indexed, please run: hmmpress $hmms");
	881	my $src = $hmms;
	882	$src =~ s{^.*/}{};
	883	$src =~ s/.hmm$//;
	884	msg("Using /inference source as '$src'");
	885	unshift @database, {
	886	DB => $hmms,
	887	SRC => "protein motif:$src:",
	888	FMT => 'hmmer3',
	889	CMD => $HMMER3CMD,
	890	VERSION => 3, # without this, latest Bioperl goes into infinite loop
	891	};
	892	}
	893
852	894	# if user supplies a trusted set of proteins, we try these first!
853	895	if (-r $proteins) {
854		msg("Preparing user-supplied primary annotation source: $proteins");
855		runcmd("makeblastdb -dbtype prot -in \Q$proteins\E -out \Q$outdir/proteins\E -logfile /dev/null");
	896	msg("Preparing user-supplied primary BLAST annotation source: $proteins");
	897	my $faa_file = $proteins;
	898	my $format = Bio::Tools::GuessSeqFormat->new(-file=>$proteins)->guess
	899	or err("could not determine format of --proteins file '$proteins'");
	900	msg("Guessed source was in $format format.");
	901	if ($format =~ m/^(embl\|genbank)$/) {
	902	$faa_file = "$outdir/proteins.faa";
	903	msg("Converting $format '$proteins' into Prokka FASTA '$faa_file'");
	904	runcmd("prokka-genbank_to_fasta_db --format $format \Q$proteins\E > \Q$faa_file\E 2> /dev/null");
	905	}
	906	elsif ($format ne 'fasta') {
	907	err("Option --proteins only supports FASTA, GenBank, and EMBL formats.");
	908	}
	909	runcmd("makeblastdb -dbtype prot -in \Q$faa_file\E -out \Q$outdir/proteins\E -logfile /dev/null");
856	910	my $src = $proteins;
857	911	$src =~ s{^.*/}{};
858	912	msg("Using /inference source as '$src'");

876	930
877	931	for my $db (@database) {
878	932
	933	# create a unqiue output name so we can save them in --debug mode
	934	my $outname = $db->{DB};
	935	$outname =~ s{^.*/}{};
	936
879	937	# we write out all the CDS which haven't been annotated yet and then search them
880		my $faa_name = "$outdir/proteins.faa";
	938	my $faa_name = "$outdir/$outname.faa";
881	939	open my $faa, '>', $faa_name;
882	940
883	941	my %cds;
884	942	my $count=0;
885		for my $sid (sort keys %seq) {
	943	for my $sid (@seq) {
886	944	for my $f (@{ $seq{$sid}{FEATURE} }) {
887	945	next unless $f->primary_tag eq 'CDS';
888	946	next if $f->has_tag('product');
889	947	$cds{++$count} = $f;
890		print $faa ">$count\n",$f->seq->translate->seq,"\n";
	948	print $faa ">$count\n",
	949	$f->seq->translate(-codontable_id=>$gcode, -complete=>1)->seq,"\n";
891	950	}
892	951	}
893	952	close $faa;

909	968	my $bsize = int($faa_bytes / $cpus / 2); # div 2 to allow for slow vs fast subtasks?
910	969	my $paropts = $cpus > 0 ? " -j $cpus" : "";
911	970
912		my $bls_name = "$outdir/proteins.bls";
	971	my $bls_name = "$outdir/$outname.".$db->{FMT};
913	972	runcmd(
914	973	"cat \Q$faa_name\E \| ${PARALLELCMD}$paropts --block $bsize --recstart '>' --pipe".
915	974	" $cmd > \Q$bls_name\E 2> /dev/null"

943	1002	$cds{$pid}->add_tag_value('inference', $db->{SRC}.$hit->name);
944	1003	}
945	1004	msg("Cleaned $num_cleaned /product names") if $num_cleaned > 0;
946		delfile( $faa_name, $bls_name);
	1005	delfile($faa_name, $bls_name);
947	1006	}
948	1007	}
949	1008

956	1015
957	1016	my $empty_label = $fast ? 'unannotated protein' : $HYPO;
958	1017	my $num_hypo=0;
959		for my $sid (sort keys %seq) {
	1018	for my $sid (@seq) {
960	1019	for my $f ( @{ $seq{$sid}{FEATURE} }) {
961	1020	if ($f->primary_tag eq 'CDS' and not $f->has_tag('product')) {
962	1021	$f->add_tag_value('product', $empty_label);

969	1028	# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
970	1029	# Look for possible /pseudo genes - adjacent with same annotation
971	1030
972		for my $sid (sort keys %seq) {
	1031	for my $sid (@seq) {
973	1032	my $prev = '';
974	1033	for my $f ( grep { $_->primary_tag eq 'CDS' } @{ $seq{$sid}{FEATURE} } ) {
975	1034	my $this = TAG($f, 'product');

987	1046
988	1047	my %collide;
989	1048
990		for my $sid (sort keys %seq) {
	1049	for my $sid (@seq) {
991	1050	for my $f ( sort { $a->start <=> $b->start } @{ $seq{$sid}{FEATURE} }) {
992	1051	next unless $f->primary_tag eq 'CDS';
993	1052	my $gene = TAG($f, 'gene') or next;
	1053	$gene =~ s/_(\d+)$//; # remove existing de-duplicated suffix
994	1054	push @{ $collide{$gene} }, $f;
995	1055	}
996	1056	}

1006	1066	$n++;
1007	1067	$f->add_tag_value('gene', "${gene}_${n}");
1008	1068	}
	1069	msg("Fixed $n duplicate /gene -", map { $_->get_tag_values('gene') } @cds);
1009	1070	$num_collide++;
1010	1071	}
1011	1072	msg("Fixed $num_collide colliding /gene names.");

1015	1076
1016	1077	msg("Adding /locus_tag identifiers");
1017	1078	my $num_lt=0;
1018		for my $sid (sort keys %seq) {
	1079	for my $sid (@seq) {
1019	1080	for my $f ( sort { $a->start <=> $b->start } @{ $seq{$sid}{FEATURE} }) {
1020	1081	next unless $f->primary_tag =~ m/CDS\|RNA/;
1021	1082	$num_lt++;

1029	1090	if ($addgenes) {
1030	1091	# make a 'sister' gene feature for the CDS feature
1031	1092	# (ideally it would encompass the UTRs as well, but we don't know them)
	1093	my $gene_id = "${ID}_gene";
1032	1094	my $g = Bio::SeqFeature::Generic->new(
1033	1095	-primary => 'gene',
1034	1096	-seq_id => $f->seq_id,

1036	1098	-end => $f->end,
1037	1099	-strand => $f->strand,
1038	1100	-source_tag => $EXE,
1039		-tag => { 'locus_tag'=> $ID },
	1101	-tag => {
	1102	'locus_tag' => $ID,
	1103	'ID' => $gene_id, # Add suffix to ID for GFF output
	1104	},
1040	1105	);
	1106	# Make a Parent tag from the CDS to the gene
	1107	$f->add_tag_value('Parent', $gene_id);
	1108	# copy the /gene from the CDS
1041	1109	if (my $gENE = TAG($f, 'gene')) {
1042	1110	$g->add_tag_value('gene', $gENE);
1043	1111	}

1059	1127
1060	1128	my $gff_factory = Bio::Tools::GFF->new(-gff_version=>$gffver);
1061	1129	print $gff_fh "##gff-version $gffver\n";
1062		for my $id (sort keys %seq) {
	1130	for my $id (@seq) {
1063	1131	print $gff_fh "##sequence-region $id 1 ", $seq{$id}{DNA}->length, "\n";
1064	1132	}
1065	1133
1066	1134	my $fsa_desc = "[gcode=$gcode] [organism=$genus $species] [strain=$strain]";
1067	1135	$fsa_desc .= " [plasmid=$plasmid]" if $plasmid;
1068	1136
1069		for my $sid (sort keys %seq) {
	1137	for my $sid (@seq) {
1070	1138	my $ctg = $seq{$sid}{DNA};
1071	1139	$ctg->desc($fsa_desc);
1072	1140	$fsa_fh->write_seq($ctg);

1095	1163	$ffn_fh->write_seq($p);
1096	1164	}
1097	1165	if ($f->primary_tag eq 'CDS') {
1098		$faa_fh->write_seq( $p->translate(-codontable_id=>$gcode) );
1099		}
1100		}
1101		}
1102
1103		if (scalar keys %seq) {
	1166	$faa_fh->write_seq( $p->translate(-codontable_id=>$gcode, -complete=>1) );
	1167	}
	1168	}
	1169	}
	1170
	1171	if (@seq) {
1104	1172	print $gff_fh "##FASTA\n";
1105	1173	my $seqio = Bio::SeqIO->new(-fh=>$gff_fh, -format=>'fasta');
1106		for my $sid (sort keys %seq) {
	1174	for my $sid (@seq) {
1107	1175	$seqio->write_seq($seq{$sid}{DNA});
1108	1176	}
1109	1177	}

1116	1184	select $txtFh;
1117	1185
1118	1186	printf "organism: $genus $species $strain $plasmid\n";
1119		printf "contigs: %d\n", scalar(keys %seq);
1120		printf "bases: %d\n", sum( map { $seq{$_}{DNA}->length } keys %seq );
	1187	printf "contigs: %d\n", scalar(@seq);
	1188	printf "bases: %d\n", sum( map { $seq{$_}{DNA}->length } @seq );
1121	1189
1122	1190	my %count;
1123		for my $sid (sort keys %seq) {
	1191	for my $sid (@seq) {
1124	1192	for my $f (@{ $seq{$sid}{FEATURE} }) {
1125	1193	$count{ $f->primary_tag }++;
1126	1194	}

1149	1217	);
1150	1218	delfile("$outdir/errorsummary.val");
1151	1219	delfile( map { "$outdir/$prefix.$_" } qw(dr fixedproducts ecn val) );
1152		move("$outdir/$prefix.gbf", "$outdir/$prefix.gbk"); # rename XXX.gbf to XXX.gbk
	1220
	1221	msg("Repairing broken .GBK output that tbl2asn produces...");
	1222	runcmd("sed 's/COORDINATES: profile/COORDINATES:profile/' < \Q$outdir/$prefix.gbf\E > \Q$outdir/$prefix.gbk\E");
	1223	delfile("$outdir/$prefix.gbf");
1153	1224
1154	1225	# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1155	1226	# Some final log output

1181	1252
1182	1253	$p =~ s/^arCOG\d+\s+//;
1183	1254	$p =~ s/$(EC\|COG).*?$//;
1184		$p =~ s/\s*\w+\d{4,}c?//; # remove possible locus tags
	1255	$p =~ s/\s+\w+\d{4,}c?//; # remove possible locus tags
1185	1256	$p =~ s/ and (inactivated\|related) \w+//;
	1257	$p =~ s/,\s*family$//;
1186	1258
1187	1259	$p =~ s/^(potential\|possible\|probable\|predicted\|uncharacteri.ed)/putative/i;
1188	1260	if ($p =~ m/(domain\|family\|binding\|fold\|like\|motif\|repeat)\s*$/i and $p !~ m/,/) {

1259	1331
1260	1332	sub delfile {
1261	1333	for my $file (@_) {
1262		msg("Deleting unwanted file:", $file);
1263		unlink $file;
	1334	if ($debug) {
	1335	msg("In --debug mode, saving temporary file:", $file);
	1336	}
	1337	else {
	1338	msg("Deleting unwanted file:", $file);
	1339	unlink $file;
	1340	}
1264	1341	}
1265	1342	}
1266	1343

1395	1472	{OPT=>"docs", VAR=>\&showdoc, DESC=>"Show full manual/documentation"},
1396	1473	{OPT=>"citation",VAR=>\&show_citation, DESC=>"Print citation for referencing Prokka"},
1397	1474	{OPT=>"quiet!", VAR=>\$quiet, DEFAULT=>0, DESC=>"No screen output"},
	1475	{OPT=>"debug!", VAR=>\$debug, DEFAULT=>0, DESC=>"Debug mode: keep all temporary files"},
1398	1476	'Setup:',
1399	1477	{OPT=>"listdb", VAR=>\&list_db, DESC=>"List all configured databases"},
1400	1478	{OPT=>"setupdb", VAR=>\&setup_db, DESC=>"Index all installed databases"},

1422	1500	{OPT=>"gram=s", VAR=>\$gram, DEFAULT=>'', DESC=>"Gram: -/neg +/pos"},
1423	1501	{OPT=>"usegenus!", VAR=>\$usegenus, DEFAULT=>0, DESC=>"Use genus-specific BLAST databases (needs --genus)"},
1424	1502	{OPT=>"proteins=s", VAR=>\$proteins, DEFAULT=>'', DESC=>"Fasta file of trusted proteins to first annotate from"},
	1503	{OPT=>"hmms=s", VAR=>\$hmms, DEFAULT=>'', DESC=>"Trusted HMM to first annotate from"},
1425	1504	{OPT=>"metagenome!", VAR=>\$metagenome, DEFAULT=>0, DESC=>"Improve gene predictions for highly fragmented genomes"},
1426	1505	{OPT=>"rawproduct!", VAR=>\$rawproduct, DEFAULT=>0, DESC=>"Do not clean up /product annotation"},
1427	1506	'Computation:',

+43

-19

bin/prokka-genbank_to_fasta_db less more

0	0	#!/usr/bin/perl -w
1	1	use strict;
2	2	use Bio::SeqIO;
	3	use Bio::Tools::CodonTable;
3	4	use Data::Dumper;
4	5
5		my(@Options, $verbose, $format, $hypo, $idtag, $sep, $blank, $pseudo, $minlen);
	6	# gene complement(1..1200)
	7	# /locus_tag="P148_SR1C00001G0001"
	8	# CDS complement(1..1200)
	9	# /locus_tag="P148_SR1C00001G0001"
	10	# /note="RAAC1_SR1_1_1"
	11	# /codon_start=1
	12	# /transl_table=25
	13	# /product="MAEBL"
	14	# /protein_id="AHB40819.1"
	15	# /db_xref="GI:563351664"
	16
	17	my @IDTAG = ('protein_id', 'locus_tag', 'db_xref');
	18
	19	my(@Options, $verbose, $format, $hypo, $idtag, $sep, $blank, $pseudo, $minlen, $gcode);
6	20	setOptions();
7	21
8	22	my $in = Bio::SeqIO->new(-fh=>\*ARGV, -format=>$format);
9	23	my $out = Bio::SeqIO->new(-fh=>\*STDOUT, -format=>'Fasta');
10	24
	25	my @TRYIDTAG = $idtag ? ($idtag) : @IDTAG;
	26	print STDERR "Will use first of (@TRYIDTAG) as FASTA ID\n";
	27
	28	# support different /transl_table=XX tags in CDS
	29	#my $table = Bio::Tools::CodonTable->new();
	30	#printf STDERR "Default /transl_table is %d - %s\n", $table->id, $table->name;
	31
11	32	while (my $seq = $in->next_seq) {
12		print STDERR "\rParsing: ",$seq->display_id;
	33	print STDERR "\rParsing: ",$seq->display_id, "\n";
13	34	my $counter = 0;
14	35	for my $f ($seq->get_SeqFeatures) {
15	36

21	42
22	43	my $prod = TAG($f, 'product') \|\| $blank;
23	44
24		next if $prod eq 'hypothetical protein';
	45	next if !$hypo and $prod eq 'hypothetical protein';
25	46
26	47	my $cds = $f->spliced_seq; # don't forget eukaryotes!
27	48
28		my $id = TAG($f, $idtag) or die "feature #$counter does not have /$idtag tag!";
	49	my $id;
	50	for my $idtag (@TRYIDTAG) {
	51	$id = TAG($f, $idtag);
	52	last if $id;
	53	}
	54	$id or die "\nFeature #$counter does not have any of these tags: @TRYIDTAG";
29	55	$counter++;
30		print STDERR "Processing: [$counter] ", $seq->display_id, " \| $id \| $prod\n" if $verbose;
	56	print STDERR "[$counter] ", $seq->display_id, " \| $id \| $prod\n" if $verbose;
31	57
32	58	# HANDLE CODON START FOR FUZZY FEATURES!
33	59	if ($f->has_tag('codon_start')) {
34	60	my($cs) = $f->get_tag_values('codon_start');
35	61	if ($cs != 1) {
36		print STDERR "/codon_start=$cs - trimming mRNA!";
	62	print STDERR "\t/codon_start=$cs - trimming mRNA!\n";
37	63	$cds = $cds->trunc($cs, $cds->length);
38	64	}
39	65	}
40	66	#END
41	67
42
43		# DNA -> AA
44		$cds = $cds->translate;
	68	# DNA -> AA
	69	my $tt = $gcode \|\| TAG($f, 'transl_table') \|\| 11;
	70	print STDERR "\tUsing specified /transl_table=$tt\n" if $verbose && $tt != 1;
45	71
46		# remove stop codon
47		my $aa = $cds->seq;
48		$aa =~ s/\*$//;
49		$cds->seq($aa);
	72	# http://www.bioperl.org/wiki/HOWTO:Beginners#Translating
	73	$cds = $cds->translate(-codontable_id => $tt, -complete => 1);
50	74
51	75	my $ec = TAG($f, 'EC_number') \|\| $blank;
52	76	my $gene = TAG($f, 'gene') \|\| $blank;
53	77
54		$cds->desc("$ec$sep$gene$sep$prod");
	78	$cds->desc( join $sep, $ec, $gene, $prod );
55	79	$cds->display_id($id);
56	80
57	81	$out->write_seq($cds);
58	82	}
59		print STDERR "\n";
60	83	}
61		print STDERR "\nDone\n";
	84	print STDERR "Done.\n";
62	85
63	86	#----------------------------------------------------------------------
64	87

78	101	use Getopt::Long;
79	102
80	103	@Options = (
81		{OPT=>"help", VAR=>\&usage, DESC=>"This help"},
82		{OPT=>"verbose!", VAR=>\$verbose, DEFAULT=>0, DESC=>"Verbose progress"},
	104	{OPT=>"help", VAR=>\&usage, DESC=>"This help"},
	105	{OPT=>"verbose!", VAR=>\$verbose, DEFAULT=>0, DESC=>"Verbose progress"},
83	106	{OPT=>"format=s", VAR=>\$format, DEFAULT=>'genbank', DESC=>"Input format"},
84		{OPT=>"idtag=s", VAR=>\$idtag, DEFAULT=>'protein_id', DESC=>"What tag to use as Fasta ID"},
	107	{OPT=>"idtag=s", VAR=>\$idtag, DEFAULT=>'', DESC=>"What tag to use as Fasta ID (default = try first of: @IDTAG)"},
85	108	{OPT=>"sep=s", VAR=>\$sep, DEFAULT=>'~~~', DESC=>"Separator between EC/gene/product" },
86	109	{OPT=>"blank=s", VAR=>\$blank, DEFAULT=>'', DESC=>"Replace empty EC/gene/product with this"},
87	110	{OPT=>"pseudo!", VAR=>\$pseudo, DEFAULT=>0, DESC=>"Include /pseudo genes"},
88	111	{OPT=>"hypo!", VAR=>\$hypo, DEFAULT=>0, DESC=>"Include 'hypothetical protein' genes"},
	112	{OPT=>"gcode=i", VAR=>\$gcode, DEFAULT=>0, DESC=>"Force this genetic code for translation (otherwise /transl_table)"},
89	113	{OPT=>"minlen=i", VAR=>\$minlen, DEFAULT=>0, DESC=>"Minimum peptide length"},
90	114	);
91	115

-3

bin/prokka-uniprot_to_fasta_db less more

4	4	#use SWISS::OC;
5	5	use Data::Dumper;
6	6
7		my(@Options, $verbose, $frag, $evlev, $minlen, $sep, $blank, $term, $hypo);
	7	my(@Options, $verbose, $frag, $evlev, $minlen, $maxlen, $sep, $blank, $term, $hypo);
8	8	setOptions();
9	9
10	10	my $HYPO = 'hypothetical protein';

25	25	next if not $frag and $entry->isFragment;
26	26	next if not $frag and $entry->DEs->hasFragment;
27	27
28		# Too short?
29		next if $minlen > 0 and length($entry->SQs->seq) < $minlen;
	28	# Too short or to long?
	29	my $L = length($entry->SQs->seq);
	30	next if $minlen > 0 and $L < $minlen;
	31	next if $maxlen > 0 and $L > $maxlen;
30	32
31	33	# Reject on evidence level
32	34	# grep ^PE uniprot_sprot.dat \| sort \| uniq -c

103	105	{OPT=>"evidence=i", VAR=>\$evlev, DEFAULT=>2, DESC=>"1=prot 2=mrna 3=homol 4=pred 5=unsure"},
104	106	{OPT=>"fragments!", VAR=>\$frag, DEFAULT=>0, DESC=>"Include 'DE Flags: Fragment;' entries"},
105	107	{OPT=>"minlen=i", VAR=>\$minlen, DEFAULT=>20, DESC=>"Minimum peptide length"},
	108	{OPT=>"maxlen=i", VAR=>\$maxlen, DEFAULT=>1E5, DESC=>"Minimum peptide length"},
106	109	{OPT=>"term=s", VAR=>\$term, DEFAULT=>'', DESC=>"Lineage must contain this term eg. 'Bacteria'"},
107	110	{OPT=>"hypo!", VAR=>\$hypo, DEFAULT=>0, DESC=>"Don't filter out hypothetical proteins"},
108	111	);

~~db/hmm/CLUSTERS.hmm~~ less more

Binary diff not shown

~~db/hmm/Pfam.hmm~~ less more

Binary diff not shown

+19

-0

doc/ChangeLog.txt less more

0	0
1	1	Prokka ChangeLog
2	2	----------------
	3
	4	v1.11 - 15 Feb 2015
	5	* Option --proteins now supports .GBK/.EMBL files directly!
	6	* Support for user supplied HMM via --hmms [Connor Driscoll]
	7	* Add Prodigal -c option when in metagenome mode [Daan Speth]
	8	* Wierd coordinate errors with long Spades contig names [Stephan Pabinger]
	9	* Replaced dodgy OSX aragorn binary [Yevgeny Nikolaichik]
	10	* Handle translation table 25 [Mads Albertsen]
	11	* Fix semantically incorrect GFF3 output [Marc Hoeppner]
	12	* Fix contigs with pipe characters from Quiver [Vaughn Cooper]
	13	* Keep original contig ordering [Jane Hawkey]
	14	* Fix protein translation with alternate --gcode [Andreas Leimbach]
	15	* Various bug fixes in prokka-* support scripts [Andreas Leimbach]
	16	* Workaround tbl2asn bug with COORDINATES:profile [Wiep Smits]
	17	* Provide MD5 checksum for website download [Willem VA Viljoen]
	18	* Hopefully slightly improved cleanup_product() function
	19	* Ensure released are tagged on github for Linux Brew [Shaun Jackman]
	20	* Check databases exist before searching against them
	21	* Updated bundled binaries
3	22
4	23	v1.10 - 28 July 2014
5	24	* Support for barrnap 0.4 with Archaea and Mito support [Lionel Guy]

-4

doc/ToDoList.txt less more

1	1	Prokka Wishlist
2	2	---------------
3	3
	4	* New --first-blast option, like --proteins?
	5	* New --last-blast option eg. "nr" [Check who emailed me this]
4	6	* annotate possible frame-shifts
5	7	* FAQ about annotation transfer
6	8	* use included binary if PATH one is wrong version [Simon Gladman]
7	9	* reconsider !~m/a-z/ rule in cleanup_product() [Roderick Felsheim]
8	10	* all Kingdom=ALL or ANY for metagenomes [Andreas Bremges]
9	11	* make --cleanup option for /product names [Andreas Bremges]
10		* check databases exist before spawning blast/hmmer
11	12	* locus_tag rules: http://www.ncbi.nlm.nih.gov/genomes/locustag/Proposal.pdf
12		* Add .PTT file output [Andrew Buultjens]
13	13	* Bug: "existing RNA (repeat_region)" CRISPR is not RNA ?
14		* Allow --proteins to be a .GBK or .GFF3 file ie. extract the CDS annotations
15	14	* Factor out some functions into modules!
16	15	* Check for large genomic tracts which are unannotated. Sometimes Prodigal misses big genes.
17	16	* Add an example input/output so users can check their copy is producing similar results.
18	17	* Output potential homopolymer assembly errors near CDS flanks
19	18	* Add the CLUSTERS "PHA" library to Viruses mode.
20	19	* Option to include ribosomal binding sites (RBS) as features.
21		* Check input contigs for runs of Ns, and either complain, or split the file, or additionally create a .AGP scaffold file. (Mitchell Stanton-Cook has done this, need to incorporate)
22	20	* Add hyperlinks to tool references in Manual
23	21

-215

~~doc/prokka-manual.html~~ less more

0		<h1 id="prokka-rapid-prokaryotic-genome-annotation">Prokka: rapid prokaryotic genome annotation</h1>
1		<p>Torsten Seemann <script type="text/javascript">
2		<!--
3		h='monash.edu';a='@';n='torsten.seemann';e=n+a+h;
4		document.write('<a h'+'ref'+'="ma'+'ilto'+':'+e+'">'+'<code>'+e+'</code>'+'<\/'+'a'+'>');
5		// -->
6		</script><noscript>torsten.seemann at monash dot edu</noscript><br />Victorian Bioinformatics Consortium, AUSTRALIA <a href="http://vicbioinformatics.com"><code class="url">http://vicbioinformatics.com</code></a></p>
7		<h2 id="introduction">Introduction</h2>
8		<p>Whole genome annotation is the process of identifying features of interest in a set of genomic DNA sequences, and labelling them with useful information. Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files.</p>
9		<h2 id="installation">Installation</h2>
10		<h3 id="download">Download</h3>
11		<p>Download the latest <code>prokka-1.xx.tar.gz</code> archive from http://www.bioinformatics.net.au/software.prokka.shtml</p>
12		<h3 id="extract">Extract</h3>
13		<p>Choose somewhere to put it, for example in your home directory (no root access required):</p>
14		<pre><code>% cd $HOME
15		% tar zxvf prokka-1.xx.tar.gz
16		% ls prokka-1.xx</code></pre>
17		<h3 id="add-to-path">Add to PATH</h3>
18		<p>Add the following line to your <code>$HOME/.bashrc</code> file, or to <code>/etc/profile.d/prokka.sh</code> to make it available to all users:</p>
19		<pre><code>export PATH=$PATH:$HOME/prokka-1.xx/bin</code></pre>
20		<h3 id="index-the-sequence-databases">Index the sequence databases</h3>
21		<pre><code>% prokka --setupdb</code></pre>
22		<h3 id="install-dependencies">Install dependencies</h3>
23		<p>Prokka comes with many binaries for Linux and Mac OS X. It will always use your existing installed versions if they exist, but will use the included ones if that fails. For some older systems (eg. Centos 4.x) some of them won't work due to them being dynamically linked against new GLIBC libraries you don't have.</p>
24		<p>You can consult the list of dependencies later in this document.</p>
25		<h3 id="choose-a-rrna-predictor">Choose a rRNA predictor</h3>
26		<h4 id="option-1---dont-use-one">Option 1 - Don't use one</h4>
27		<p>If Prokka can't find a predictor for rRNA featues (either Barrnap or RNAmmer below) then it simply won't annotate any. Most people don't care that much about them anyway,</p>
28		<h4 id="option-2---barrnap">Option 2 - Barrnap</h4>
29		<p>This was written by the author of Prokka and is recommended if you prefer speed over absolute accuracy. It uses the new multi-core NHMMER for DNA:DNA profile searches. Download it from http://www.vicbioinformatics.com/software.barrnap.shtml</p>
30		<h4 id="option-3---rnammer">Option 3 - RNAmmer</h4>
31		<p>RNAmmer was written when HMMER 2.x was the latest release. Since them, HMMER 3.x has been released, and uses the same executable binary names. Prokka needs HMMER3 and RNAmmer (and hence HMMER2) so you need to edit your RNAmmer script to explicitly point your HMMER2 binary instead of using the HMMER3 binary which is more likely to be in your PATH first.</p>
32		<p>Type which rnammer to find the script, and then edit it with your favourite editor. Find the following lines at the top:</p>
33		<pre><code>if ( $uname eq "Linux" ) {
34		# $HMMSEARCH_BINARY = "/usr/cbs/bio/bin/linux64/hmmsearch"; # OLD
35		$HMMSEARCH_BINARY = "/path/to/my/hmmer-2.3.2/bin/hmmsearch"; # NEW (yours)
36		}</code></pre>
37		<p>If you are using Mac OS X, you'll also have to change the <code>"Linux"</code> to <code>"Darwin"</code> too. As you can see, I have commented out the original part, and replaced it with the location of my HMMER2 hmmsearch tool, so it doesn't run the HMMER3 one. You need to ensure HMMER3 is in your PATH before the old HMMER2 too.</p>
38		<h3 id="test">Test</h3>
39		<ul>
40		<li>Type <code>prokka</code> and it should output it's help screen.</li>
41		<li>Type <code>prokka --version</code> and you should see an output like <code>prokka 1.x</code></li>
42		<li>Type <code>prokka --listdb</code> and it will show you what databases it has installed to use.</li>
43		</ul>
44		<h2 id="invoking-prokka">Invoking Prokka</h2>
45		<h3 id="beginner">Beginner</h3>
46		<pre><code># Vanilla (but with free toppings)
47		% prokka contigs.fa
48
49		# Look for a folder called PROKKA_yyyymmdd (today's date) and look at stats
50		% cat PROKKA_yyyymmdd/*.txt</code></pre>
51		<h3 id="moderate">Moderate</h3>
52		<pre><code># Choose the names of the output files
53		% prokka --outdir mydir --prefix mygenome contigs.fa
54
55		# Visualize it in Artemis
56		% art mydir/mygenome.gff</code></pre>
57		<h3 id="expert">Expert</h3>
58		<pre><code># It's not just for bacteria, people
59		% prokka --kingdom Archaea --outdir mydir --genus Pyrococcus --locustag PYCC
60
61		# Search for my favourite gene
62		% exonerate --bestn 1 zetatoxin.fasta mydir/PYCC_06072012.faa \| less</code></pre>
63		<h3 id="wizard">Wizard</h3>
64		<pre><code># Watch and learn
65		% prokka --outdir mydir --locustag EHEC --proteins NewToxins.faa --evalue 0.001 --gram neg --addgenes contigs.fa
66
67		# Check to see if anything went really wrong
68		% less mydir/EHEC_06072012.err
69
70		# Add final details using Sequin
71		% sequin mydir/EHEC_0607201.sqn</code></pre>
72		<h3 id="genbank-submitter">Genbank submitter</h3>
73		<pre><code># Register your BioProject and your locus_tag prefix first!
74		% prokka --compliant --centre UoN --outdir PRJNA123456 --locustag EHEC --prefix EHEC-Chr1 contigs.fa
75
76		# Check to see if anything went really wrong
77		% less PRJNA123456/EHEC-Chr1.err
78
79		# Add final details using Sequin
80		% sequin PRJNA123456/EHEC-Chr1.sqn</code></pre>
81		<h3 id="crazy-person">Crazy Person</h3>
82		<pre><code># No stinking Perl script is going to control me
83		% prokka \
84		--outdir $HOME/genomes/Ec_POO247 --force \
85		--prefix Ec_POO247 --addgenes --locustag ECPOOp \
86		--increment 10 --gffver 2 --centre CDC --compliant \
87		--genus Escherichia --species coli --strain POO247 --plasmid pECPOO247 \
88		--kingdom Bacteria --gcode 11 --usegenus \
89		--proteins /opt/prokka/db/trusted/Ecocyc-17.6 \
90		--evalue 1e-9 --rfam \
91		plasmid-closed.fna</code></pre>
92		<h2 id="output-files">Output Files</h2>
93		<p>\| Extension \| Description \| \| --------- \| ----------- \| \| .gff \| This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV. \| \| .gbk \| This is a standard Genbank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence. \| \| .fna \| Nucleotide FASTA file of the input contig sequences. \| \| .faa \| Protein FASTA file of the translated CDS sequences. \| \| .ffn \| Nucleotide FASTA file of all the annotated sequences, not just CDS. \| \| .sqn \| An ASN1 format "Sequin" file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc. \| \| .fsa \| Nucleotide FASTA file of the input contig sequences, used by "tbl2asn" to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines. \| \| .tbl \| Feature Table file, used by "tbl2asn" to create the .sqn file. \| \| .err \| Unacceptable annotations - the NCBI discrepancy report. \| \| .log \| Contains all the output that Prokka produced during its run. This is a record of what settings you used, even if the --quiet option was enabled. \| \| .txt \| Statistics relating to the annotated features found. \|</p>
94		<h2 id="command-line-options">Command line options</h2>
95		<pre><code>General:
96		--help This help
97		--version Print version and exit
98		--docs Show full manual/documentation
99		--citation Print citation for referencing Prokka
100		--quiet No screen output (default OFF)
101		Setup:
102		--listdb List all configured databases
103		--setupdb Index all installed databases
104		--cleandb Remove all database indices
105		--depends List all software dependencies
106		Outputs:
107		--outdir [X] Output folder [auto] (default '')
108		--force Force overwriting existing output folder (default OFF)
109		--prefix [X] Filename output prefix [auto] (default '')
110		--addgenes Add 'gene' features for each 'CDS' feature (default OFF)
111		--locustag [X] Locus tag prefix (default 'PROKKA')
112		--increment [N] Locus tag counter increment (default '1')
113		--gffver [N] GFF version (default '3')
114		--compliant Force Genbank/ENA/DDJB compliance: --genes --mincontiglen 200 --centre XXX (default OFF)
115		--centre [X] Sequencing centre ID. (default '')
116		Organism details:
117		--genus [X] Genus name (default 'Genus')
118		--species [X] Species name (default 'species')
119		--strain [X] Strain name (default 'strain')
120		--plasmid [X] Plasmid name or identifier (default '')
121		Annotations:
122		--kingdom [X] Annotation mode: Archaea\|Bacteria\|Mitochondria\|Viruses (default 'Bacteria')
123		--gcode [N] Genetic code / Translation table (set if --kingdom is set) (default '0')
124		--gram [X] Gram: -/neg +/pos (default '')
125		--usegenus Use genus-specific BLAST databases (needs --genus) (default OFF)
126		--proteins [X] Fasta file of trusted proteins to first annotate from (default '')
127		--metagenome Improve gene predictions for highly fragmented genomes (default OFF)
128		--rawproduct Do not clean up /product annotation (default OFF)
129		Computation:
130		--fast Fast mode - skip CDS /product searching (default OFF)
131		--cpus [N] Number of CPUs to use [0=all] (default '8')
132		--mincontiglen [N] Minimum contig size [NCBI needs 200] (default '1')
133		--evalue [n.n] Similarity e-value cut-off (default '1e-06')
134		--rfam Enable searching for ncRNAs with Infernal+Rfam (SLOW!) (default '0')
135		--norrna Don't run rRNA search (default OFF)
136		--notrna Don't run tRNA search (default OFF)
137		--rnammer Prefer RNAmmer over Barrnap for rRNA prediction (default OFF)</code></pre>
138		<h2 id="dependencies">Dependencies</h2>
139		<ul>
140		<li><p><strong>GNU Parallel</strong><br />A shell tool for executing jobs in parallel using one or more computers<br /><em>O. Tange, GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, Feb 2011:42-47.</em></p></li>
141		<li><p><strong>BioPerl</strong><br />Used for input/output of various file formats<br /><em>Stajich et al, The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002 Oct;12(10):1611-8.</em></p></li>
142		<li><p><strong>Aragorn</strong><br />Finds transfer RNA features (tRNA)<br /><em>Laslett D, Canback B. ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res. 2004 Jan 2;32(1):11-6.</em></p></li>
143		<li><p><strong>Barrnap</strong><br />Used to predict ribosomal RNA features (rRNA). My licence-free replacement for RNAmmmer.<br /><em>Manuscript under preparation.</em></p></li>
144		<li><p><strong>RNAmmer</strong><br />Finds ribosomal RNA features (rRNA)<br /><em>Lagesen K et al. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 2007;35(9):3100-8.</em></p></li>
145		<li><p><strong>Prodigal</strong><br />Finds protein-coding features (CDS)<br /><em>Hyatt D et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010 Mar 8;11:119.</em></p></li>
146		<li><p><strong>SignalP</strong><br />Finds signal peptide features in CDS (sig_peptide)<br /><em>Petersen TN et al. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods. 2011 Sep 29;8(10):785-6.</em></p></li>
147		<li><p><strong>BLAST+</strong><br />Used for similarity searching against protein sequence libraries<br /><em>Camacho C et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009 Dec 15;10:421.</em></p></li>
148		<li><p><strong>HMMER3</strong><br />Used for similarity searching against protein family profiles<br /><em>Finn RD et al. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011 Jul;39(Web Server issue):W29-37.</em></p></li>
149		<li><p><strong>Infernal</strong><br />Used for similarity searching against ncRNA family profiles<br /><em>D. L. Kolbe, S. R. Eddy. Fast Filtering for RNA Homology Search. Bioinformatics, 27:3102-3109, 2011.</em></p></li>
150		</ul>
151		<h2 id="databases">Databases</h2>
152		<h3 id="the-core-blast-databases">The Core (BLAST+) Databases</h3>
153		<p>Prokka uses a variety of databases when trying to assign function to the predicted CDS features. It takes a hierarchial approach to make it fast. A small, core set of well characterized proteins are first searched using BLAST+. This combination of small database and fast search typically completes about 70% of the workload. Then a series of slower but more sensitive HMM databases are searched using HMMER3.</p>
154		<p>The initial core databases are derived from UniProtKB; there is one per "kingdom" supported. To qualify for inclusion, a protein must be (1) from Bacteria (or Archaea or Viruses); (2) not be "Fragment" entries; and (3) have an evidence level ("PE") of 2 or lower, which corresponds to experimental mRNA or proteomics evidence.</p>
155		<h4 id="making-a-core-databases">Making a Core Databases</h4>
156		<p>If you want to modify these core databases, the included script <code>prokka-uniprot_to_fasta_db</code>, along with the official <code>uniprot_sprot.dat</code>, can be used to generate a new database to put in <code>/opt/prokka/db/kingdom/</code>. If you add new ones, the command <code>prokka --listdb</code> will show you whether it has been detected properly.</p>
157		<h4 id="the-genus-databases">The Genus Databases</h4>
158		<p>If you enable <code>--usegenus</code> and also provide a Genus via <code>--genus</code> then it will first use a BLAST database which is Genus specific. Prokka comes with a set of databases for the most common Bacterial genera; type prokka <code>--listdb</code> to see what they are.</p>
159		<h4 id="adding-a-genus-databases">Adding a Genus Databases</h4>
160		<p>If you have a set of Genbank files and want to create a new Genus database, Prokka comes with a tool called <code>prokka-genbank_to_fasta_db</code> to help. For example, if you had four annotated "Coccus" genomes, you could do the following:</p>
161		<pre><code>% prokka-genbank_to_fasta_db Coccus1.gbk Coccus2.gbk Coccus3.gbk Coccus4.gbk > Coccus.faa
162		% cd-hit -i Coccus.faa -o Coccus -T 0 -M 0 -g 1 -s 0.8 -c 0.9
163		% rm -fv Coccus.faa Coccus.bak.clstr Coccus.clstr
164		% makeblastdb -dbtype prot -in Coccus
165		% mv Coccus.p* /path/to/prokka/db/genus/</code></pre>
166		<h3 id="the-hmm-databases">The HMM Databases</h3>
167		<p>Prokka comes with a bunch of HMM libraries for HMMER3. They are mostly Bacteria-specific. They are searched after the core and genus databases. You can add more simply by putting them in <code>/opt/prokka/db/hmm</code>. Type <code>prokka --listdb</code> to confirm they are recognised.</p>
168		<h3 id="fasta-database-format">FASTA database format</h3>
169		<p>Prokka understands two annotation tag formats, a plain one and a detailed one.</p>
170		<p>The plain one is a standard FASTA-like line with the ID after the <code>></code> sign, and the protein <code>/product</code> after the ID (the "description" part of the line):</p>
171		<pre><code>>SeqID product</code></pre>
172		<p>The detailed one consists of a special encoded three-part description line. The parts are the <code>/EC_number</code>, the <code>/gene</code> code, then the <code>/product</code> - and they are separated by a special "<sub>~</sub>" sequence:</p>
173		<pre><code>>SeqID EC_number~~~gene~~~product</code></pre>
174		<p>Here are some examples. Note that not all parts need to be present, but the "<sub>~</sub>" should still be there:</p>
175		<pre><code>>YP_492693.1 2.1.1.48~~~ermC~~~rRNA adenine N-6-methyltransferase
176		MNEKNIKHSQNFITSKHNIDKIMTNIRLNEHDNIFEIGSGKGHFTLELVQRCNFVTAIEI
177		DHKLCKTTENKLVDHDNFQVLNKDILQFKFPKNQSYKIFGNIPYNISTDIIRKIVF*
178		>YP_492697.1 ~~~traB~~~transfer complex protein TraB
179		MIKKFSLTTVYVAFLSIVLSNITLGAENPGPKIEQGLQQVQTFLTGLIVAVGICAGVWIV
180		LKKLPGIDDPMVKNEMFRGVGMVLAGVAVGAALVWLVPWVYNLFQ*
181		>YP_492694.1 ~~~~~~transposase
182		MNYFRYKQFNKDVITVAVGYYLRYALSYRDISEILRGRGVNVHHSTVYRWVQEYAPILYQ
183		QSINTAKNTLKGIECIYALYKKNRRSLQIYGFSPCHEISIMLAS*</code></pre>
184		<p>The same description lines apply to HMM models, except the "NAME" and "DESC" fields are used:</p>
185		<pre><code>NAME PRK00001
186		ACC PRK00001
187		DESC 2.1.1.48~~~ermC~~~rRNA adenine N-6-methyltransferase
188		LENG 284</code></pre>
189		<h2 id="faq">FAQ</h2>
190		<ul>
191		<li><p><strong>Where does the name "Prokka" come from?</strong><br />Prokka is a contraction of "prokaryotic annotation". It's also relatively unique within Google, and also rhymes with a native Australian marsupial called the quokka.</p></li>
192		<li><p><strong>Can I annotate by eukaryote genome with Prokka?</strong><br />No. Prokka is specifically designed for Bacteria, Archaea and Viruses. It can't handle multi-exon gene models; I would recommend using MAKER 2 for that purpose.</p></li>
193		<li><p><strong>Why does Prokka keeps on crashing when it gets to tge "tbl2asn" stage?</strong><br />It seems that the tbl2asn program from NCBI "expires" after 12 months, and refuses to run. Unfortunately you need to install a newer version which you can download from here.</p></li>
194		<li><p><strong>The hmmscan step seems to hang and do nothing?</strong><br />The problem here is GNU Parallel. It seems the Debian package for hmmer has modified it to require the <code>--gnu</code> option to behave in the 'default' way. There is no clear reason for this. The only way to restore normal behaviour is to edit the prokka script and change <code>parallel</code> to <code>parallel --gnu</code>.</p></li>
195		<li><p><strong>Why does prokka fail when it gets to hmmscan?</strong><br />Unfortunately HMMER keeps changing it's database format, and they aren't upward compatible. If you upgraded HMMER (from 3.0 to 3.1 say) then you need to "re-press" the files. This can be done as follows: cd /path/to/prokka/db/hmm mkdir new for D in .hmm ; do hmmconvert <span class="math"><em>D</em> > <em>n</em><em>e</em><em>w</em> / </span>D ; done cd new for D in .hmm ; do hmmpress $D ; done mv * .. rmdir new</p></li>
196		<li><p><strong>Why does Prokka take so long to download?</strong><br />Our server is in Australia, and the international pipes aren't always flowing as well as we'd like. I try to put it on GoogleDrive. Dropbox is no longer possible due to bandwidth quotas. If you are able to mirror Prokka (~2 GB) outside please let me know.</p></li>
197		<li><p><strong>Why can't I load Prokka .GBK files into Mauve?</strong><br />Mauve is very picky about Genbank files. It does not like long contig names, like those from Velvet or Spades. The simple solution is to use <code>--centre XXX</code> in Prokka and it will rename all your contigs to be NCBI (and Mauve) compliant. It does not like the ACCESSION and VERSION strings that Prokka produces via the "tbl2asn" tool. The following Unix command will fix them: <code>egrep -v '^(ACCESSION\|VERSION)' prokka.gbk > mauve.gbk</code></p></li>
198		</ul>
199		<h2 id="still-to-do">Still To Do</h2>
200		<ul>
201		<li>ToDoList.txt: https://github.com/Victorian-Bioinformatics-Consortium/prokka/blob/master/doc/ToDoList.txt</li>
202		</ul>
203		<h2 id="changes">Changes</h2>
204		<ul>
205		<li>ChangeLog.txt: https://raw.githubusercontent.com/Victorian-Bioinformatics-Consortium/prokka/master/doc/ChangeLog.txt</li>
206		<li>Github commits: https://github.com/Victorian-Bioinformatics-Consortium/prokka/commits/master</li>
207		</ul>
208		<h2 id="bugs">Bugs</h2>
209		<ul>
210		<li>tbl2asn seems to be removing the "(anti-codon)" part in my tRNA /product values</li>
211		<li>tbl2asn putting space in /inference for Infernal</li>
212		</ul>
213		<h2 id="citation">Citation</h2>
214		<p>Seemann T.<br /><em>Prokka: rapid prokaryotic genome annotation</em><br /><strong>Bioinformatics</strong> 2014 Jul 15;30(14):2068-9. <a href="http://www.ncbi.nlm.nih.gov/pubmed/24642063">PMID:24642063</a><br /><a href="doi:10.1093/bioinformatics/btu153">DOI</a></p>

-454

~~doc/prokka-manual.txt~~ less more

0		Prokka: rapid prokaryotic genome annotation
1		===========================================
2
3		Torsten Seemann torsten.seemann@monash.edu
4		Victorian Bioinformatics Consortium, AUSTRALIA
5		http://vicbioinformatics.com
6
7		Introduction
8		------------
9
10		Whole genome annotation is the process of identifying features of
11		interest in a set of genomic DNA sequences, and labelling them with
12		useful information. Prokka is a software tool to annotate bacterial,
13		archaeal and viral genomes quickly and produce standards-compliant
14		output files.
15
16		Installation
17		------------
18
19		Download
20
21		Download the latest prokka-1.xx.tar.gz archive from
22		http://www.bioinformatics.net.au/software.prokka.shtml
23
24		Extract
25
26		Choose somewhere to put it, for example in your home directory (no root
27		access required):
28
29		% cd $HOME
30		% tar zxvf prokka-1.xx.tar.gz
31		% ls prokka-1.xx
32
33		Add to PATH
34
35		Add the following line to your $HOME/.bashrc file, or to
36		/etc/profile.d/prokka.sh to make it available to all users:
37
38		export PATH=$PATH:$HOME/prokka-1.xx/bin
39
40		Index the sequence databases
41
42		% prokka --setupdb
43
44		Install dependencies
45
46		Prokka comes with many binaries for Linux and Mac OS X. It will always
47		use your existing installed versions if they exist, but will use the
48		included ones if that fails. For some older systems (eg. Centos 4.x)
49		some of them won't work due to them being dynamically linked against new
50		GLIBC libraries you don't have.
51
52		You can consult the list of dependencies later in this document.
53
54		Choose a rRNA predictor
55
56		Option 1 - Don't use one
57
58		If Prokka can't find a predictor for rRNA featues (either Barrnap or
59		RNAmmer below) then it simply won't annotate any. Most people don't care
60		that much about them anyway,
61
62		Option 2 - Barrnap
63
64		This was written by the author of Prokka and is recommended if you
65		prefer speed over absolute accuracy. It uses the new multi-core NHMMER
66		for DNA:DNA profile searches. Download it from
67		http://www.vicbioinformatics.com/software.barrnap.shtml
68
69		Option 3 - RNAmmer
70
71		RNAmmer was written when HMMER 2.x was the latest release. Since them,
72		HMMER 3.x has been released, and uses the same executable binary names.
73		Prokka needs HMMER3 and RNAmmer (and hence HMMER2) so you need to edit
74		your RNAmmer script to explicitly point your HMMER2 binary instead of
75		using the HMMER3 binary which is more likely to be in your PATH first.
76
77		Type which rnammer to find the script, and then edit it with your
78		favourite editor. Find the following lines at the top:
79
80		if ( $uname eq "Linux" ) {
81		# $HMMSEARCH_BINARY = "/usr/cbs/bio/bin/linux64/hmmsearch"; # OLD
82		$HMMSEARCH_BINARY = "/path/to/my/hmmer-2.3.2/bin/hmmsearch"; # NEW (yours)
83		}
84
85		If you are using Mac OS X, you'll also have to change the "Linux" to
86		"Darwin" too. As you can see, I have commented out the original part,
87		and replaced it with the location of my HMMER2 hmmsearch tool, so it
88		doesn't run the HMMER3 one. You need to ensure HMMER3 is in your PATH
89		before the old HMMER2 too.
90
91		Test
92
93		- Type prokka and it should output it's help screen.
94		- Type prokka --version and you should see an output like prokka 1.x
95		- Type prokka --listdb and it will show you what databases it has
96		installed to use.
97
98		Invoking Prokka
99		---------------
100
101		Beginner
102
103		# Vanilla (but with free toppings)
104		% prokka contigs.fa
105
106		# Look for a folder called PROKKA_yyyymmdd (today's date) and look at stats
107		% cat PROKKA_yyyymmdd/*.txt
108
109		Moderate
110
111		# Choose the names of the output files
112		% prokka --outdir mydir --prefix mygenome contigs.fa
113
114		# Visualize it in Artemis
115		% art mydir/mygenome.gff
116
117		Expert
118
119		# It's not just for bacteria, people
120		% prokka --kingdom Archaea --outdir mydir --genus Pyrococcus --locustag PYCC
121
122		# Search for my favourite gene
123		% exonerate --bestn 1 zetatoxin.fasta mydir/PYCC_06072012.faa \| less
124
125		Wizard
126
127		# Watch and learn
128		% prokka --outdir mydir --locustag EHEC --proteins NewToxins.faa --evalue 0.001 --gram neg --addgenes contigs.fa
129
130		# Check to see if anything went really wrong
131		% less mydir/EHEC_06072012.err
132
133		# Add final details using Sequin
134		% sequin mydir/EHEC_0607201.sqn
135
136		Genbank submitter
137
138		# Register your BioProject and your locus_tag prefix first!
139		% prokka --compliant --centre UoN --outdir PRJNA123456 --locustag EHEC --prefix EHEC-Chr1 contigs.fa
140
141		# Check to see if anything went really wrong
142		% less PRJNA123456/EHEC-Chr1.err
143
144		# Add final details using Sequin
145		% sequin PRJNA123456/EHEC-Chr1.sqn
146
147		Crazy Person
148
149		# No stinking Perl script is going to control me
150		% prokka \
151		--outdir $HOME/genomes/Ec_POO247 --force \
152		--prefix Ec_POO247 --addgenes --locustag ECPOOp \
153		--increment 10 --gffver 2 --centre CDC --compliant \
154		--genus Escherichia --species coli --strain POO247 --plasmid pECPOO247 \
155		--kingdom Bacteria --gcode 11 --usegenus \
156		--proteins /opt/prokka/db/trusted/Ecocyc-17.6 \
157		--evalue 1e-9 --rfam \
158		plasmid-closed.fna
159
160		Output Files
161		------------
162
163		\| Extension \| Description \| \| --------- \| ----------- \| \| .gff \| This is
164		the master annotation in GFF3 format, containing both sequences and
165		annotations. It can be viewed directly in Artemis or IGV. \| \| .gbk \|
166		This is a standard Genbank file derived from the master .gff. If the
167		input to prokka was a multi-FASTA, then this will be a multi-Genbank,
168		with one record for each sequence. \| \| .fna \| Nucleotide FASTA file of
169		the input contig sequences. \| \| .faa \| Protein FASTA file of the
170		translated CDS sequences. \| \| .ffn \| Nucleotide FASTA file of all the
171		annotated sequences, not just CDS. \| \| .sqn \| An ASN1 format "Sequin"
172		file for submission to Genbank. It needs to be edited to set the correct
173		taxonomy, authors, related publication etc. \| \| .fsa \| Nucleotide FASTA
174		file of the input contig sequences, used by "tbl2asn" to create the .sqn
175		file. It is mostly the same as the .fna file, but with extra Sequin tags
176		in the sequence description lines. \| \| .tbl \| Feature Table file, used
177		by "tbl2asn" to create the .sqn file. \| \| .err \| Unacceptable
178		annotations - the NCBI discrepancy report. \| \| .log \| Contains all the
179		output that Prokka produced during its run. This is a record of what
180		settings you used, even if the --quiet option was enabled. \| \| .txt \|
181		Statistics relating to the annotated features found. \|
182
183		Command line options
184		--------------------
185
186		General:
187		--help This help
188		--version Print version and exit
189		--docs Show full manual/documentation
190		--citation Print citation for referencing Prokka
191		--quiet No screen output (default OFF)
192		Setup:
193		--listdb List all configured databases
194		--setupdb Index all installed databases
195		--cleandb Remove all database indices
196		--depends List all software dependencies
197		Outputs:
198		--outdir [X] Output folder [auto] (default '')
199		--force Force overwriting existing output folder (default OFF)
200		--prefix [X] Filename output prefix [auto] (default '')
201		--addgenes Add 'gene' features for each 'CDS' feature (default OFF)
202		--locustag [X] Locus tag prefix (default 'PROKKA')
203		--increment [N] Locus tag counter increment (default '1')
204		--gffver [N] GFF version (default '3')
205		--compliant Force Genbank/ENA/DDJB compliance: --genes --mincontiglen 200 --centre XXX (default OFF)
206		--centre [X] Sequencing centre ID. (default '')
207		Organism details:
208		--genus [X] Genus name (default 'Genus')
209		--species [X] Species name (default 'species')
210		--strain [X] Strain name (default 'strain')
211		--plasmid [X] Plasmid name or identifier (default '')
212		Annotations:
213		--kingdom [X] Annotation mode: Archaea\|Bacteria\|Mitochondria\|Viruses (default 'Bacteria')
214		--gcode [N] Genetic code / Translation table (set if --kingdom is set) (default '0')
215		--gram [X] Gram: -/neg +/pos (default '')
216		--usegenus Use genus-specific BLAST databases (needs --genus) (default OFF)
217		--proteins [X] Fasta file of trusted proteins to first annotate from (default '')
218		--metagenome Improve gene predictions for highly fragmented genomes (default OFF)
219		--rawproduct Do not clean up /product annotation (default OFF)
220		Computation:
221		--fast Fast mode - skip CDS /product searching (default OFF)
222		--cpus [N] Number of CPUs to use [0=all] (default '8')
223		--mincontiglen [N] Minimum contig size [NCBI needs 200] (default '1')
224		--evalue [n.n] Similarity e-value cut-off (default '1e-06')
225		--rfam Enable searching for ncRNAs with Infernal+Rfam (SLOW!) (default '0')
226		--norrna Don't run rRNA search (default OFF)
227		--notrna Don't run tRNA search (default OFF)
228		--rnammer Prefer RNAmmer over Barrnap for rRNA prediction (default OFF)
229
230		Dependencies
231		------------
232
233		- GNU Parallel
234		A shell tool for executing jobs in parallel using one or more
235		computers
236		O. Tange, GNU Parallel - The Command-Line Power Tool, ;login: The
237		USENIX Magazine, Feb 2011:42-47.
238
239		- BioPerl
240		Used for input/output of various file formats
241		Stajich et al, The Bioperl toolkit: Perl modules for the life
242		sciences. Genome Res. 2002 Oct;12(10):1611-8.
243
244		- Aragorn
245		Finds transfer RNA features (tRNA)
246		Laslett D, Canback B. ARAGORN, a program to detect tRNA genes and
247		tmRNA genes in nucleotide sequences. Nucleic Acids Res. 2004 Jan
248		2;32(1):11-6.
249
250		- Barrnap
251		Used to predict ribosomal RNA features (rRNA). My licence-free
252		replacement for RNAmmmer.
253		Manuscript under preparation.
254
255		- RNAmmer
256		Finds ribosomal RNA features (rRNA)
257		Lagesen K et al. RNAmmer: consistent and rapid annotation of
258		ribosomal RNA genes. Nucleic Acids Res. 2007;35(9):3100-8.
259
260		- Prodigal
261		Finds protein-coding features (CDS)
262		Hyatt D et al. Prodigal: prokaryotic gene recognition and
263		translation initiation site identification. BMC Bioinformatics. 2010
264		Mar 8;11:119.
265
266		- SignalP
267		Finds signal peptide features in CDS (sig_peptide)
268		Petersen TN et al. SignalP 4.0: discriminating signal peptides from
269		transmembrane regions. Nat Methods. 2011 Sep 29;8(10):785-6.
270
271		- BLAST+
272		Used for similarity searching against protein sequence libraries
273		Camacho C et al. BLAST+: architecture and applications. BMC
274		Bioinformatics. 2009 Dec 15;10:421.
275
276		- HMMER3
277		Used for similarity searching against protein family profiles
278		Finn RD et al. HMMER web server: interactive sequence similarity
279		searching. Nucleic Acids Res. 2011 Jul;39(Web Server issue):W29-37.
280
281		- Infernal
282		Used for similarity searching against ncRNA family profiles
283		D. L. Kolbe, S. R. Eddy. Fast Filtering for RNA Homology Search.
284		Bioinformatics, 27:3102-3109, 2011.
285
286		Databases
287		---------
288
289		The Core (BLAST+) Databases
290
291		Prokka uses a variety of databases when trying to assign function to the
292		predicted CDS features. It takes a hierarchial approach to make it fast.
293		A small, core set of well characterized proteins are first searched
294		using BLAST+. This combination of small database and fast search
295		typically completes about 70% of the workload. Then a series of slower
296		but more sensitive HMM databases are searched using HMMER3.
297
298		The initial core databases are derived from UniProtKB; there is one per
299		"kingdom" supported. To qualify for inclusion, a protein must be (1)
300		from Bacteria (or Archaea or Viruses); (2) not be "Fragment" entries;
301		and (3) have an evidence level ("PE") of 2 or lower, which corresponds
302		to experimental mRNA or proteomics evidence.
303
304		Making a Core Databases
305
306		If you want to modify these core databases, the included script
307		prokka-uniprot_to_fasta_db, along with the official uniprot_sprot.dat,
308		can be used to generate a new database to put in
309		/opt/prokka/db/kingdom/. If you add new ones, the command
310		prokka --listdb will show you whether it has been detected properly.
311
312		The Genus Databases
313
314		If you enable --usegenus and also provide a Genus via --genus then it
315		will first use a BLAST database which is Genus specific. Prokka comes
316		with a set of databases for the most common Bacterial genera; type
317		prokka --listdb to see what they are.
318
319		Adding a Genus Databases
320
321		If you have a set of Genbank files and want to create a new Genus
322		database, Prokka comes with a tool called prokka-genbank_to_fasta_db to
323		help. For example, if you had four annotated "Coccus" genomes, you could
324		do the following:
325
326		% prokka-genbank_to_fasta_db Coccus1.gbk Coccus2.gbk Coccus3.gbk Coccus4.gbk > Coccus.faa
327		% cd-hit -i Coccus.faa -o Coccus -T 0 -M 0 -g 1 -s 0.8 -c 0.9
328		% rm -fv Coccus.faa Coccus.bak.clstr Coccus.clstr
329		% makeblastdb -dbtype prot -in Coccus
330		% mv Coccus.p* /path/to/prokka/db/genus/
331
332		The HMM Databases
333
334		Prokka comes with a bunch of HMM libraries for HMMER3. They are mostly
335		Bacteria-specific. They are searched after the core and genus databases.
336		You can add more simply by putting them in /opt/prokka/db/hmm. Type
337		prokka --listdb to confirm they are recognised.
338
339		FASTA database format
340
341		Prokka understands two annotation tag formats, a plain one and a
342		detailed one.
343
344		The plain one is a standard FASTA-like line with the ID after the >
345		sign, and the protein /product after the ID (the "description" part of
346		the line):
347
348		>SeqID product
349
350		The detailed one consists of a special encoded three-part description
351		line. The parts are the /EC_number, the /gene code, then the /product -
352		and they are separated by a special "~" sequence:
353
354		>SeqID EC_number~~~gene~~~product
355
356		Here are some examples. Note that not all parts need to be present, but
357		the "~" should still be there:
358
359		>YP_492693.1 2.1.1.48~~~ermC~~~rRNA adenine N-6-methyltransferase
360		MNEKNIKHSQNFITSKHNIDKIMTNIRLNEHDNIFEIGSGKGHFTLELVQRCNFVTAIEI
361		DHKLCKTTENKLVDHDNFQVLNKDILQFKFPKNQSYKIFGNIPYNISTDIIRKIVF*
362		>YP_492697.1 ~~~traB~~~transfer complex protein TraB
363		MIKKFSLTTVYVAFLSIVLSNITLGAENPGPKIEQGLQQVQTFLTGLIVAVGICAGVWIV
364		LKKLPGIDDPMVKNEMFRGVGMVLAGVAVGAALVWLVPWVYNLFQ*
365		>YP_492694.1 ~~~~~~transposase
366		MNYFRYKQFNKDVITVAVGYYLRYALSYRDISEILRGRGVNVHHSTVYRWVQEYAPILYQ
367		QSINTAKNTLKGIECIYALYKKNRRSLQIYGFSPCHEISIMLAS*
368
369		The same description lines apply to HMM models, except the "NAME" and
370		"DESC" fields are used:
371
372		NAME PRK00001
373		ACC PRK00001
374		DESC 2.1.1.48~~~ermC~~~rRNA adenine N-6-methyltransferase
375		LENG 284
376
377		FAQ
378		---
379
380		- Where does the name "Prokka" come from?
381		Prokka is a contraction of "prokaryotic annotation". It's also
382		relatively unique within Google, and also rhymes with a native
383		Australian marsupial called the quokka.
384
385		- Can I annotate by eukaryote genome with Prokka?
386		No. Prokka is specifically designed for Bacteria, Archaea and
387		Viruses. It can't handle multi-exon gene models; I would recommend
388		using MAKER 2 for that purpose.
389
390		- Why does Prokka keeps on crashing when it gets to tge "tbl2asn"
391		stage?
392		It seems that the tbl2asn program from NCBI "expires" after 12
393		months, and refuses to run. Unfortunately you need to install a
394		newer version which you can download from here.
395
396		- The hmmscan step seems to hang and do nothing?
397		The problem here is GNU Parallel. It seems the Debian package for
398		hmmer has modified it to require the --gnu option to behave in the
399		'default' way. There is no clear reason for this. The only way to
400		restore normal behaviour is to edit the prokka script and change
401		parallel to parallel --gnu.
402
403		- Why does prokka fail when it gets to hmmscan?
404		Unfortunately HMMER keeps changing it's database format, and they
405		aren't upward compatible. If you upgraded HMMER (from 3.0 to 3.1
406		say) then you need to "re-press" the files. This can be done as
407		follows: cd /path/to/prokka/db/hmm mkdir new for D in *.hmm ; do
408		hmmconvert D > new/D ; done cd new for D in *.hmm ; do hmmpress $D ;
409		done mv * .. rmdir new
410
411		- Why does Prokka take so long to download?
412		Our server is in Australia, and the international pipes aren't
413		always flowing as well as we'd like. I try to put it on GoogleDrive.
414		Dropbox is no longer possible due to bandwidth quotas. If you are
415		able to mirror Prokka (~2 GB) outside please let me know.
416
417		- Why can't I load Prokka .GBK files into Mauve?
418		Mauve is very picky about Genbank files. It does not like long
419		contig names, like those from Velvet or Spades. The simple solution
420		is to use --centre XXX in Prokka and it will rename all your contigs
421		to be NCBI (and Mauve) compliant. It does not like the ACCESSION and
422		VERSION strings that Prokka produces via the "tbl2asn" tool. The
423		following Unix command will fix them:
424		egrep -v '^(ACCESSION\|VERSION)' prokka.gbk > mauve.gbk
425
426		Still To Do
427		-----------
428
429		- ToDoList.txt:
430		https://github.com/Victorian-Bioinformatics-Consortium/prokka/blob/master/doc/ToDoList.txt
431
432		Changes
433		-------
434
435		- ChangeLog.txt:
436		https://raw.githubusercontent.com/Victorian-Bioinformatics-Consortium/prokka/master/doc/ChangeLog.txt
437		- Github commits:
438		https://github.com/Victorian-Bioinformatics-Consortium/prokka/commits/master
439
440		Bugs
441		----
442
443		- tbl2asn seems to be removing the "(anti-codon)" part in my tRNA
444		/product values
445		- tbl2asn putting space in /inference for Infernal
446
447		Citation
448		--------
449
450		Seemann T.
451		Prokka: rapid prokaryotic genome annotation
452		Bioinformatics 2014 Jul 15;30(14):2068-9. PMID:24642063
453		DOI