Codebase list prokka / upstream/1.11+dfsg
Imported Upstream version 1.11+dfsg Andreas Tille 9 years ago
11 changed file(s) with 601 addition(s) and 749 deletion(s). Raw diff Collapse all Expand all
0 tmp/
1 blib/
2 .build/
3 _build/
4 cover_db/
5 inc/
6 Build
7 !Build/
8 Build.bat
9 .last_cover_stats
10 Makefile
11 Makefile.old
12 MANIFEST.bak
13 META.yml
14 MYMETA.yml
15 nytprof.out
16 pm_to_blib
17 doc/prokka-*
18 bug/
19 db/cm/*.i1*
20 db/kingdom/*/sprot.p*
21 db/hmm/*.h3?
22 db/genus/*.p*
23
0 # Prokka: rapid prokaryotic genome annotation
1
2 Torsten Seemann <torsten.seemann@monash.edu>
3 Victorian Bioinformatics Consortium, AUSTRALIA <http://vicbioinformatics.com>
4
5 ##Contents
6
7 - [Introduction](#introduction)
8 - [Installation](#installation)
9 - [Invoking Prokka](#invoking-prokka)
10 - [Output Files](#output-files)
11 - [Command line options](#command-line-options)
12 - [Dependencies](#dependencies)
13 - [Databases](#databases)
14 - [FAQ](#faq)
15 - [Still To Do](#still-to-do)
16 - [Changes](#changes)
17 - [Citation](#citation)
18
19 ##Introduction
20
21 Whole genome annotation is the process of identifying features of interest
22 in a set of genomic DNA sequences, and labelling them with useful
23 information. Prokka is a software tool to annotate bacterial, archaeal and
24 viral genomes quickly and produce standards-compliant output files.
25
26 ##Installation
27
28 ###Download
29
30 Download the latest `prokka-1.xx.tar.gz` archive from http://www.bioinformatics.net.au/software.prokka.shtml
31
32 ###Extract
33
34 Choose somewhere to put it, for example in your home directory (no root access required):
35
36 % cd $HOME
37 % tar zxvf prokka-1.xx.tar.gz
38 % ls prokka-1.xx
39
40 ###Add to PATH
41
42 Add the following line to your `$HOME/.bashrc` file,
43 or to `/etc/profile.d/prokka.sh` to make it available to all users:
44
45 export PATH=$PATH:$HOME/prokka-1.xx/bin
46
47 ###Index the sequence databases
48
49 % prokka --setupdb
50
51 ###Install dependencies
52
53 Prokka comes with many binaries for Linux and Mac OS X. It will always use your existing installed versions if they exist, but will use the included ones if that fails. For some older systems (eg. Centos 4.x) some of them won't work due to them being dynamically linked against new GLIBC libraries you don't have.
54
55 You can consult the list of dependencies later in this document.
56
57 ###Choose a rRNA predictor
58
59 ####Option 1 - Don't use one
60
61 If Prokka can't find a predictor for rRNA featues (either Barrnap or RNAmmer below) then it simply won't annotate any. Most people don't care that much about them anyway,
62
63 ####Option 2 - Barrnap
64
65 This was written by the author of Prokka and is recommended if you prefer speed over absolute accuracy. It uses the new multi-core NHMMER for DNA:DNA profile searches. Download it from http://www.vicbioinformatics.com/software.barrnap.shtml
66
67 ####Option 3 - RNAmmer
68
69 RNAmmer was written when HMMER 2.x was the latest release. Since them, HMMER 3.x has been released, and uses the same executable binary names. Prokka needs HMMER3 and RNAmmer (and hence HMMER2) so you need to edit your RNAmmer script to explicitly point your HMMER2 binary instead of using the HMMER3 binary which is more likely to be in your PATH first.
70
71 Type which rnammer to find the script, and then edit it with your favourite editor. Find the following lines at the top:
72
73 if ( $uname eq "Linux" ) {
74 # $HMMSEARCH_BINARY = "/usr/cbs/bio/bin/linux64/hmmsearch"; # OLD
75 $HMMSEARCH_BINARY = "/path/to/my/hmmer-2.3.2/bin/hmmsearch"; # NEW (yours)
76 }
77
78 If you are using Mac OS X, you'll also have to change the `"Linux"` to `"Darwin"` too. As you can see, I have commented out the original part, and replaced it with the location of my HMMER2 hmmsearch tool, so it doesn't run the HMMER3 one. You need to ensure HMMER3 is in your PATH before the old HMMER2 too.
79
80 ###Test
81
82 * Type `prokka` and it should output it's help screen.
83 * Type `prokka --version` and you should see an output like `prokka 1.x`
84 * Type `prokka --listdb` and it will show you what databases it has installed to use.
85
86
87 ##Invoking Prokka
88
89 ###Beginner
90
91 # Vanilla (but with free toppings)
92 % prokka contigs.fa
93
94 # Look for a folder called PROKKA_yyyymmdd (today's date) and look at stats
95 % cat PROKKA_yyyymmdd/*.txt
96
97 ###Moderate
98
99 # Choose the names of the output files
100 % prokka --outdir mydir --prefix mygenome contigs.fa
101
102 # Visualize it in Artemis
103 % art mydir/mygenome.gff
104
105 ###Expert
106
107 # It's not just for bacteria, people
108 % prokka --kingdom Archaea --outdir mydir --genus Pyrococcus --locustag PYCC
109
110 # Search for my favourite gene
111 % exonerate --bestn 1 zetatoxin.fasta mydir/PYCC_06072012.faa | less
112
113 ###Wizard
114
115 # Watch and learn
116 % prokka --outdir mydir --locustag EHEC --proteins NewToxins.faa --evalue 0.001 --gram neg --addgenes contigs.fa
117
118 # Check to see if anything went really wrong
119 % less mydir/EHEC_06072012.err
120
121 # Add final details using Sequin
122 % sequin mydir/EHEC_0607201.sqn
123
124 ###Genbank submitter
125
126 # Register your BioProject and your locus_tag prefix first!
127 % prokka --compliant --centre UoN --outdir PRJNA123456 --locustag EHEC --prefix EHEC-Chr1 contigs.fa
128
129 # Check to see if anything went really wrong
130 % less PRJNA123456/EHEC-Chr1.err
131
132 # Add final details using Sequin
133 % sequin PRJNA123456/EHEC-Chr1.sqn
134
135 ###Crazy Person
136
137 # No stinking Perl script is going to control me
138 % prokka \
139 --outdir $HOME/genomes/Ec_POO247 --force \
140 --prefix Ec_POO247 --addgenes --locustag ECPOOp \
141 --increment 10 --gffver 2 --centre CDC --compliant \
142 --genus Escherichia --species coli --strain POO247 --plasmid pECPOO247 \
143 --kingdom Bacteria --gcode 11 --usegenus \
144 --proteins /opt/prokka/db/trusted/Ecocyc-17.6 \
145 --evalue 1e-9 --rfam \
146 plasmid-closed.fna
147
148
149 ##Output Files
150
151 | Extension | Description |
152 | --------- | ----------- |
153 | .gff | This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV. |
154 | .gbk | This is a standard Genbank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence. |
155 | .fna | Nucleotide FASTA file of the input contig sequences. |
156 | .faa | Protein FASTA file of the translated CDS sequences. |
157 | .ffn | Nucleotide FASTA file of all the annotated sequences, not just CDS. |
158 | .sqn | An ASN1 format "Sequin" file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc. |
159 | .fsa | Nucleotide FASTA file of the input contig sequences, used by "tbl2asn" to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines. |
160 | .tbl | Feature Table file, used by "tbl2asn" to create the .sqn file. |
161 | .err | Unacceptable annotations - the NCBI discrepancy report. |
162 | .log | Contains all the output that Prokka produced during its run. This is a record of what settings you used, even if the --quiet option was enabled. |
163 | .txt | Statistics relating to the annotated features found. |
164
165 ##Command line options
166
167 General:
168 --help This help
169 --version Print version and exit
170 --docs Show full manual/documentation
171 --citation Print citation for referencing Prokka
172 --quiet No screen output (default OFF)
173 --debug Debug mode: keep all temporary files (default OFF)
174 Setup:
175 --listdb List all configured databases
176 --setupdb Index all installed databases
177 --cleandb Remove all database indices
178 --depends List all software dependencies
179 Outputs:
180 --outdir [X] Output folder [auto] (default '')
181 --force Force overwriting existing output folder (default OFF)
182 --prefix [X] Filename output prefix [auto] (default '')
183 --addgenes Add 'gene' features for each 'CDS' feature (default OFF)
184 --locustag [X] Locus tag prefix (default 'PROKKA')
185 --increment [N] Locus tag counter increment (default '1')
186 --gffver [N] GFF version (default '3')
187 --compliant Force Genbank/ENA/DDJB compliance: --genes --mincontiglen 200 --centre XXX (default OFF)
188 --centre [X] Sequencing centre ID. (default '')
189 Organism details:
190 --genus [X] Genus name (default 'Genus')
191 --species [X] Species name (default 'species')
192 --strain [X] Strain name (default 'strain')
193 --plasmid [X] Plasmid name or identifier (default '')
194 Annotations:
195 --kingdom [X] Annotation mode: Archaea|Bacteria|Mitochondria|Viruses (default 'Bacteria')
196 --gcode [N] Genetic code / Translation table (set if --kingdom is set) (default '0')
197 --gram [X] Gram: -/neg +/pos (default '')
198 --usegenus Use genus-specific BLAST databases (needs --genus) (default OFF)
199 --proteins [X] Fasta file of trusted proteins to first annotate from (default '')
200 --hmms [X] Trusted HMM to first annotate from (default '')
201 --metagenome Improve gene predictions for highly fragmented genomes (default OFF)
202 --rawproduct Do not clean up /product annotation (default OFF)
203 Computation:
204 --fast Fast mode - skip CDS /product searching (default OFF)
205 --cpus [N] Number of CPUs to use [0=all] (default '8')
206 --mincontiglen [N] Minimum contig size [NCBI needs 200] (default '1')
207 --evalue [n.n] Similarity e-value cut-off (default '1e-06')
208 --rfam Enable searching for ncRNAs with Infernal+Rfam (SLOW!) (default '0')
209 --norrna Don't run rRNA search (default OFF)
210 --notrna Don't run tRNA search (default OFF)
211 --rnammer Prefer RNAmmer over Barrnap for rRNA prediction (default OFF)
212
213 ##Dependencies
214
215 * __GNU Parallel__
216 A shell tool for executing jobs in parallel using one or more computers
217 _O. Tange, GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, Feb 2011:42-47._
218
219 * __BioPerl__
220 Used for input/output of various file formats
221 _Stajich et al, The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002 Oct;12(10):1611-8._
222
223 * __Aragorn__
224 Finds transfer RNA features (tRNA)
225 _Laslett D, Canback B. ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res. 2004 Jan 2;32(1):11-6._
226
227 * __Barrnap__
228 Used to predict ribosomal RNA features (rRNA). My licence-free replacement for RNAmmmer.
229 _Manuscript under preparation._
230
231 * __RNAmmer__
232 Finds ribosomal RNA features (rRNA)
233 _Lagesen K et al. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 2007;35(9):3100-8._
234
235 * __Prodigal__
236 Finds protein-coding features (CDS)
237 _Hyatt D et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010 Mar 8;11:119._
238
239 * __SignalP__
240 Finds signal peptide features in CDS (sig_peptide)
241 _Petersen TN et al. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods. 2011 Sep 29;8(10):785-6._
242
243 * __BLAST+__
244 Used for similarity searching against protein sequence libraries
245 _Camacho C et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009 Dec 15;10:421._
246
247 * __HMMER3__
248 Used for similarity searching against protein family profiles
249 _Finn RD et al. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011 Jul;39(Web Server issue):W29-37._
250
251 * __Infernal__
252 Used for similarity searching against ncRNA family profiles
253 _D. L. Kolbe, S. R. Eddy. Fast Filtering for RNA Homology Search. Bioinformatics, 27:3102-3109, 2011._
254
255
256 ##Databases
257
258 ###The Core (BLAST+) Databases
259
260 Prokka uses a variety of databases when trying to assign function to the predicted CDS features. It takes a hierarchial approach to make it fast. A small, core set of well characterized proteins are first searched using BLAST+. This combination of small database and fast search typically completes about 70% of the workload. Then a series of slower but more sensitive HMM databases are searched using HMMER3.
261
262 The initial core databases are derived from UniProtKB; there is one per "kingdom" supported. To qualify for inclusion, a protein must be (1) from Bacteria (or Archaea or Viruses); (2) not be "Fragment" entries; and (3) have an evidence level ("PE") of 2 or lower, which corresponds to experimental mRNA or proteomics evidence.
263
264 ####Making a Core Databases
265
266 If you want to modify these core databases, the included script `prokka-uniprot_to_fasta_db`, along with the official `uniprot_sprot.dat`, can be used to generate a new database to put in `/opt/prokka/db/kingdom/`. If you add new ones, the command `prokka --listdb` will show you whether it has been detected properly.
267
268 ####The Genus Databases
269
270 If you enable `--usegenus` and also provide a Genus via `--genus` then it will first use a BLAST database which is Genus specific. Prokka comes with a set of databases for the most common Bacterial genera; type prokka `--listdb` to see what they are.
271
272 ####Adding a Genus Databases
273
274 If you have a set of Genbank files and want to create a new Genus database, Prokka comes with a tool called
275 `prokka-genbank_to_fasta_db` to help. For example, if you had four annotated "Coccus" genomes, you could do the following:
276
277 % prokka-genbank_to_fasta_db Coccus1.gbk Coccus2.gbk Coccus3.gbk Coccus4.gbk > Coccus.faa
278 % cd-hit -i Coccus.faa -o Coccus -T 0 -M 0 -g 1 -s 0.8 -c 0.9
279 % rm -fv Coccus.faa Coccus.bak.clstr Coccus.clstr
280 % makeblastdb -dbtype prot -in Coccus
281 % mv Coccus.p* /path/to/prokka/db/genus/
282
283 ###The HMM Databases
284
285 Prokka comes with a bunch of HMM libraries for HMMER3. They are mostly Bacteria-specific. They are searched after the core and genus databases. You can add more simply by putting them in `/opt/prokka/db/hmm`. Type `prokka --listdb` to confirm they are recognised.
286
287 ###FASTA database format
288
289 Prokka understands two annotation tag formats, a plain one and a detailed one.
290
291 The plain one is a standard FASTA-like line with the ID after the `>` sign, and the protein `/product`
292 after the ID (the "description" part of the line):
293
294 >SeqID product
295
296 The detailed one consists of a special encoded three-part description line. The parts are the `/EC_number`,
297 the `/gene` code, then the `/product` - and they are separated by a special "~~~" sequence:
298
299 >SeqID EC_number~~~gene~~~product
300
301 Here are some examples. Note that not all parts need to be present, but the "~~~" should still be there:
302
303 >YP_492693.1 2.1.1.48~~~ermC~~~rRNA adenine N-6-methyltransferase
304 MNEKNIKHSQNFITSKHNIDKIMTNIRLNEHDNIFEIGSGKGHFTLELVQRCNFVTAIEI
305 DHKLCKTTENKLVDHDNFQVLNKDILQFKFPKNQSYKIFGNIPYNISTDIIRKIVF*
306 >YP_492697.1 ~~~traB~~~transfer complex protein TraB
307 MIKKFSLTTVYVAFLSIVLSNITLGAENPGPKIEQGLQQVQTFLTGLIVAVGICAGVWIV
308 LKKLPGIDDPMVKNEMFRGVGMVLAGVAVGAALVWLVPWVYNLFQ*
309 >YP_492694.1 ~~~~~~transposase
310 MNYFRYKQFNKDVITVAVGYYLRYALSYRDISEILRGRGVNVHHSTVYRWVQEYAPILYQ
311 QSINTAKNTLKGIECIYALYKKNRRSLQIYGFSPCHEISIMLAS*
312
313 The same description lines apply to HMM models, except the "NAME" and "DESC" fields are used:
314
315 NAME PRK00001
316 ACC PRK00001
317 DESC 2.1.1.48~~~ermC~~~rRNA adenine N-6-methyltransferase
318 LENG 284
319
320
321 ##FAQ
322
323 * __Where does the name "Prokka" come from?__
324 Prokka is a contraction of "prokaryotic annotation". It's also relatively unique within Google, and also rhymes with a native Australian marsupial called the quokka.
325
326 * __Can I annotate by eukaryote genome with Prokka?__
327 No. Prokka is specifically designed for Bacteria, Archaea and Viruses. It can't handle multi-exon gene models; I would recommend using MAKER 2 for that purpose.
328
329 * __Why does Prokka keeps on crashing when it gets to tge "tbl2asn" stage?__
330 It seems that the tbl2asn program from NCBI "expires" after 12 months, and refuses to run. Unfortunately you need to install a newer version which you can download from here.
331
332 * __The hmmscan step seems to hang and do nothing?__
333 The problem here is GNU Parallel. It seems the Debian package for hmmer has modified it to require the `--gnu` option to behave in the 'default' way. There is no clear reason for this. The only way to restore normal behaviour is to edit the prokka script and change `parallel` to `parallel --gnu`.
334
335 * __Why does prokka fail when it gets to hmmscan?__
336 Unfortunately HMMER keeps changing it's database format, and they aren't upward compatible. If you upgraded HMMER (from 3.0 to 3.1 say) then you need to "re-press" the files. This can be done as follows:
337 cd /path/to/prokka/db/hmm
338 mkdir new
339 for D in *.hmm ; do hmmconvert $D > new/$D ; done
340 cd new
341 for D in *.hmm ; do hmmpress $D ; done
342 mv * ..
343 rmdir new
344
345 * __Why does Prokka take so long to download?__
346 Our server is in Australia, and the international pipes aren't always flowing as well as we'd like. I try to put it on GoogleDrive. Dropbox is no longer possible due to bandwidth quotas. If you are able to mirror Prokka (~2 GB) outside please let me know.
347
348 * __Why can't I load Prokka .GBK files into Mauve?__
349 Mauve uses BioJava to parse GenBank files, and it is very picky about Genbank files.
350 It does not like long contig names,
351 like those from Velvet or Spades. One solution is to use `--centre XXX`
352 in Prokka and it will rename all your contigs to be NCBI (and Mauve)
353 compliant. It does not like the ACCESSION and VERSION strings that Prokka
354 produces via the "tbl2asn" tool. The following Unix command will fix them:
355 `egrep -v '^(ACCESSION|VERSION)' prokka.gbk > mauve.gbk`
356
357
358 ##Still To Do
359
360 * ToDoList.txt: https://github.com/Victorian-Bioinformatics-Consortium/prokka/blob/master/doc/ToDoList.txt
361
362
363 ##Changes
364
365 * ChangeLog.txt: https://raw.githubusercontent.com/Victorian-Bioinformatics-Consortium/prokka/master/doc/ChangeLog.txt
366 * Github commits: https://github.com/Victorian-Bioinformatics-Consortium/prokka/commits/master
367
368 ##Citation
369
370 Seemann T.
371 *Prokka: rapid prokaryotic genome annotation*
372 **Bioinformatics** 2014 Jul 15;30(14):2068-9.
373 [PMID:24642063](http://www.ncbi.nlm.nih.gov/pubmed/24642063)
3131 use Bio::Seq;
3232 use Bio::SeqFeature::Generic;
3333 use Bio::Tools::GFF;
34 use Bio::Tools::GuessSeqFormat;
3435 use FindBin;
3536
3637 # . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4041 my $OPSYS = $^O;
4142 my $BINDIR = "$FindBin::RealBin/../binaries/$OPSYS";
4243 my $EXE = $FindBin::RealScript;
43 my $VERSION = "1.10";
44 my $VERSION = "1.11";
4445 my $AUTHOR = 'Torsten Seemann <torsten.seemann@monash.edu>';
4546 my $URL = 'http://www.vicbioinformatics.com';
4647 my $DBDIR = "$FindBin::RealBin/../db";
4748 my $HYPO = 'hypothetical protein';
4849 my $UNANN = 'unannotated protein';
49 my $MAXCONTIGIDLEN = 38; # Genbank rule
50 my $MAXCONTIGIDLEN = 20; # Genbank rule
5051
5152 # these should accept .faa on STDIN and write report to STDOUT
5253 my $BLASTPCMD = "blastp -query - -db %d -evalue %e -num_threads 1 -num_descriptions 1 -num_alignments 1 -seg no";
5657 my $aragorn_opt = '';
5758
5859 # debian package broke compatibility so have to force it now *grumble*
59 my $PARALLELCMD = "parallel --gnu";
60 my $PARALLELCMD = "parallel --gnu --plain";
6061
6162 # not used anymore
6263 #my $INFERNALCMD = "cmscan --noali --notextw --acc -E %e --cpu 1 -o %o %d %i 2>/dev/null";
6364
6465 my $starttime = localtime;
6566 my %seq;
67 my @seq;
6668
6769 # . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6870 # table of tools we need/optional and min versions
163165 'egrep' => { NEEDED=>1 },
164166 'sed' => { NEEDED=>1 },
165167 'find' => { NEEDED=>1 },
168 # for the new --proteins option ability to take .gbk or .embl files
169 'prokka-genbank_to_fasta_db' => { NEEDED=>1 },
166170 );
167171
168172 # . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
215219 # . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
216220 # command line options
217221
218 my(@Options, $quiet, $kingdom, $fast, $force, $outdir, $prefix, $cpus, $addgenes,
222 my(@Options, $quiet, $debug, $kingdom, $fast, $force, $outdir, $prefix, $cpus, $addgenes,
219223 $gcode, $gram, $gffver, $locustag, $increment, $mincontiglen, $evalue, $coverage,
220224 $genus, $species, $strain, $plasmid,
221 $usegenus, $proteins, $centre, $scaffolds,
225 $usegenus, $proteins, $hmms, $centre, $scaffolds,
222226 $rfam, $norrna, $notrna, $rnammer, $rawproduct,
223227 $metagenome, $compliant, $listdb, $citation);
224228 setOptions();
232236 msg("Local time is $starttime");
233237 msg("You are", $ENV{USER} || 'not telling me who you are!');
234238 msg("Operating system is $OPSYS");
239 msg("You have enabled DEBUG mode. Temporary files will NOT be deleted.") if $debug;
235240
236241 # . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
237242 # check BioPerl version
240245 my $bpver = $Bio::Root::Version::VERSION;
241246 msg("You have BioPerl $bpver");
242247 err("Please install BioPerl $minbpver or higher") if $bpver < $minbpver;
248
249 # . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
250 # Check some incompatible options
251
252 $fast && $proteins and err("In --fast mode, the --proteins will not be used!");
253 $fast && $hmms and err("In --fast mode, the --hmms will not be used!");
243254
244255 # . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
245256 # Determine CPU cores available
312323 }
313324
314325 #$centre or err("You must set --centre or the NCBI tbl2asn tool won't work properly, sorry.");
315 ($gcode < 1 or $gcode > 24) and err("Invalid genetic code, must be 1..24");
326 ($gcode < 1 or $gcode > 25) and err("Invalid genetic code, must be 1..25");
316327 $evalue >= 0 or err("Invalid --evalue, must be >= 0");
317328 #($coverage >= 0 and $coverage <= 100) or err("Invalid --coverage, must be 0..100");
318329 $increment >= 1 or err("Invalid --increment, must be >= 1");
392403 my $contigprefix = $locustag || $prefix || $outdir || $strain || '';
393404 $contigprefix .= '_' if $contigprefix;
394405 my $contig_name_len = length($centre) + 1 + length($contigprefix) + 6;
395 if ($compliant and $contig_name_len > $MAXCONTIGIDLEN) {
396 err("Genbank contig IDs are $contig_name_len chars, must be <= $MAXCONTIGIDLEN. Prefix is: $contigprefix");
406 if ($compliant) {
407 my $contig_name_len = length($centre) + 1 + length($contigprefix) + 6;
408 if ($contig_name_len > $MAXCONTIGIDLEN) {
409 err("Genbank contig IDs are $contig_name_len chars, must be <= $MAXCONTIGIDLEN. Prefix is: $contigprefix");
410 }
397411 }
398412
399413 while (my $seq = $fin->next_seq) {
408422 $seq->id( sprintf "gnl|$centre|${contigprefix}contig%06d", $ncontig );
409423 }
410424 if (length($seq->id) > $MAXCONTIGIDLEN) {
411 msg("WARNING: Contig IDs must be less than 38 characters for Genbank compliance - ".$seq->id)
425 msg("Contig ID must <= $MAXCONTIGIDLEN chars long: ".$seq->id);
426 err("Please rename your contigs or use --centre XXX to generate clean contig names.");
412427 }
413428 my $s = $seq->seq;
414429 $s = uc($s);
421436 err("Uh oh! Sequence file '$in' contains duplicate sequence ID:", $seq->id);
422437 }
423438 $seq{ $seq->id }{DNA} = $seq;
439 push @seq, $seq->id; # this array it used to preserve original contig order
424440 }
425441 $ncontig > 0 or err("FASTA file '$in' contains no suitable sequence entries");
426442 msg("Wrote $ncontig contigs");
427 #msg(sort keys %seq); exit;
428443
429444 # . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
430445 # tRNA + tmRNA
567582 my $num_ncrna = 0;
568583 my $tool = "Infernal:".$tools{'cmscan'}->{VERSION};
569584 my $icpu = $cpus || 1;
570 open INFERNAL, '-|', "cmscan --cpu $icpu -E $evalue --tblout /dev/stdout -o /dev/null --noali $cmdb \Q$outdir/$prefix.fna\E";
585 my $cmd = "cmscan --cpu $icpu -E $evalue --tblout /dev/stdout -o /dev/null --noali $cmdb \Q$outdir/$prefix.fna\E";
586 msg("Running: $cmd");
587 open INFERNAL, '-|', $cmd;
571588 while (<INFERNAL>) {
572589 next if m/^#/; # ignore comments
573590 my @x = split ' '; # magic Perl whitespace splitter
607624 # Tally all the RNA features __ which we want to exclude overlaps with CDS __
608625
609626 my @allrna;
610 for my $sid (sort keys %seq) {
627 for my $sid (@seq) {
611628 push @allrna, (grep { $_->primary_tag =~ m/[tr]RNA/ } @{ $seq{$sid}{FEATURE} });
612629 }
613630 msg("Total of", scalar(@allrna), "tRNA + rRNA features");
641658 # CDS
642659
643660 msg("Predicting coding sequences");
644 my $totalbp = sum( map { $seq{$_}{DNA}->length } keys %seq);
661 my $totalbp = sum( map { $seq{$_}{DNA}->length } @seq);
645662 my $prodigal_mode = ($totalbp >= 100000 && !$metagenome) ? 'single' : 'meta';
646663 msg("Contigs total $totalbp bp, so using $prodigal_mode mode");
647664 my $num_cds=0;
700717 # Connect features to their parent sequences
701718
702719 msg("Connecting features back to sequences");
703 for my $sid (sort keys %seq) {
720 for my $sid (@seq) {
704721 for my $f (@{ $seq{$sid}{FEATURE} }) {
705722 $f->attach_seq( $seq{$sid}{DNA} );
706723 }
722739 my $spout = Bio::SeqIO->new(-fh=>$spoutfh, -format=>'fasta');
723740 my %cds;
724741 my $count=0;
725 for my $sid (sort keys %seq) {
742 for my $sid (@seq) {
726743 for my $f (@{ $seq{$sid}{FEATURE} }) {
727744 next unless $f->primary_tag eq 'CDS';
728745 $cds{++$count} = $f;
729 my $seq = $f->seq->translate;
746 my $seq = $f->seq->translate(-codontable_id=>$gcode, -complete=>1);
730747 $seq->display_id($count);
731748 $spout->write_seq($seq);
732749 }
746763 my $prob = $x[5];
747764 my $cleave = $x[3];
748765 my $start = $parent->strand > 0 ? $parent->start : $parent->end;
749 my $end = $start + $parent->strand * ($cleave - 1);
766 # need to convert to DNA coordinates
767 my $end = $start + $parent->strand * ($cleave*3 - 1);
750768 my $sigpep = Bio::SeqFeature::Generic->new(
751769 -seq_id => $parent->seq_id,
752770 -source_tag => $tool,
772790 my $parent = $cds{ $x[0] };
773791 my $cleave = $x[2];
774792 my $start = $parent->strand > 0 ? $parent->start : $parent->end;
775 my $end = $start + $parent->strand * ($cleave - 1);
793 # need to convert to DNA coordinates
794 my $end = $start + $parent->strand * ($cleave*3 - 1);
776795 my $sigpep = Bio::SeqFeature::Generic->new(
777796 -seq_id => $parent->seq_id,
778797 -source_tag => $tool,
817836
818837 # secondary sources are a series of HMMs
819838 unless ($kingdom eq 'Viruses') {
820 for my $name (qw(HAMAP CLUSTERS Pfam)) {
821 push @database, {
822 DB => "$DBDIR/hmm/$name.hmm",
823 SRC => "protein motif:$name:",
824 FMT => 'hmmer3',
825 CMD => $HMMER3CMD,
826 VERSION => 3, # without this, latest Bioperl goes into infinite loop
827 };
839 for my $name ( hmms() ) {
840 my $dbfile = "$DBDIR/hmm/$name.hmm";
841 if (-r $dbfile) {
842 push @database, {
843 DB => $dbfile,
844 SRC => "protein motif:$name:",
845 FMT => 'hmmer3',
846 CMD => $HMMER3CMD,
847 VERSION => 3, # without this, latest Bioperl goes into infinite loop
848 }
849 }
850 else {
851 msg("Will not use $name HMM database, $dbfile not installed.");
852 }
828853 }
829854 }
830855
849874 msg("Not using genus-specific database. Try --usegenus to enable it.");
850875 }
851876
877 # if user supplies a trusted set of HMMs, we try these first!
878 if (-r $hmms) {
879 msg("Preparing user-supplied primary HMMER annotation source: $hmms");
880 -r "$hmms.h3i" or err("Your HMM is not indexed, please run: hmmpress $hmms");
881 my $src = $hmms;
882 $src =~ s{^.*/}{};
883 $src =~ s/.hmm$//;
884 msg("Using /inference source as '$src'");
885 unshift @database, {
886 DB => $hmms,
887 SRC => "protein motif:$src:",
888 FMT => 'hmmer3',
889 CMD => $HMMER3CMD,
890 VERSION => 3, # without this, latest Bioperl goes into infinite loop
891 };
892 }
893
852894 # if user supplies a trusted set of proteins, we try these first!
853895 if (-r $proteins) {
854 msg("Preparing user-supplied primary annotation source: $proteins");
855 runcmd("makeblastdb -dbtype prot -in \Q$proteins\E -out \Q$outdir/proteins\E -logfile /dev/null");
896 msg("Preparing user-supplied primary BLAST annotation source: $proteins");
897 my $faa_file = $proteins;
898 my $format = Bio::Tools::GuessSeqFormat->new(-file=>$proteins)->guess
899 or err("could not determine format of --proteins file '$proteins'");
900 msg("Guessed source was in $format format.");
901 if ($format =~ m/^(embl|genbank)$/) {
902 $faa_file = "$outdir/proteins.faa";
903 msg("Converting $format '$proteins' into Prokka FASTA '$faa_file'");
904 runcmd("prokka-genbank_to_fasta_db --format $format \Q$proteins\E > \Q$faa_file\E 2> /dev/null");
905 }
906 elsif ($format ne 'fasta') {
907 err("Option --proteins only supports FASTA, GenBank, and EMBL formats.");
908 }
909 runcmd("makeblastdb -dbtype prot -in \Q$faa_file\E -out \Q$outdir/proteins\E -logfile /dev/null");
856910 my $src = $proteins;
857911 $src =~ s{^.*/}{};
858912 msg("Using /inference source as '$src'");
876930
877931 for my $db (@database) {
878932
933 # create a unqiue output name so we can save them in --debug mode
934 my $outname = $db->{DB};
935 $outname =~ s{^.*/}{};
936
879937 # we write out all the CDS which haven't been annotated yet and then search them
880 my $faa_name = "$outdir/proteins.faa";
938 my $faa_name = "$outdir/$outname.faa";
881939 open my $faa, '>', $faa_name;
882940
883941 my %cds;
884942 my $count=0;
885 for my $sid (sort keys %seq) {
943 for my $sid (@seq) {
886944 for my $f (@{ $seq{$sid}{FEATURE} }) {
887945 next unless $f->primary_tag eq 'CDS';
888946 next if $f->has_tag('product');
889947 $cds{++$count} = $f;
890 print $faa ">$count\n",$f->seq->translate->seq,"\n";
948 print $faa ">$count\n",
949 $f->seq->translate(-codontable_id=>$gcode, -complete=>1)->seq,"\n";
891950 }
892951 }
893952 close $faa;
909968 my $bsize = int($faa_bytes / $cpus / 2); # div 2 to allow for slow vs fast subtasks?
910969 my $paropts = $cpus > 0 ? " -j $cpus" : "";
911970
912 my $bls_name = "$outdir/proteins.bls";
971 my $bls_name = "$outdir/$outname.".$db->{FMT};
913972 runcmd(
914973 "cat \Q$faa_name\E | ${PARALLELCMD}$paropts --block $bsize --recstart '>' --pipe".
915974 " $cmd > \Q$bls_name\E 2> /dev/null"
9431002 $cds{$pid}->add_tag_value('inference', $db->{SRC}.$hit->name);
9441003 }
9451004 msg("Cleaned $num_cleaned /product names") if $num_cleaned > 0;
946 delfile( $faa_name, $bls_name);
1005 delfile($faa_name, $bls_name);
9471006 }
9481007 }
9491008
9561015
9571016 my $empty_label = $fast ? 'unannotated protein' : $HYPO;
9581017 my $num_hypo=0;
959 for my $sid (sort keys %seq) {
1018 for my $sid (@seq) {
9601019 for my $f ( @{ $seq{$sid}{FEATURE} }) {
9611020 if ($f->primary_tag eq 'CDS' and not $f->has_tag('product')) {
9621021 $f->add_tag_value('product', $empty_label);
9691028 # . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9701029 # Look for possible /pseudo genes - adjacent with same annotation
9711030
972 for my $sid (sort keys %seq) {
1031 for my $sid (@seq) {
9731032 my $prev = '';
9741033 for my $f ( grep { $_->primary_tag eq 'CDS' } @{ $seq{$sid}{FEATURE} } ) {
9751034 my $this = TAG($f, 'product');
9871046
9881047 my %collide;
9891048
990 for my $sid (sort keys %seq) {
1049 for my $sid (@seq) {
9911050 for my $f ( sort { $a->start <=> $b->start } @{ $seq{$sid}{FEATURE} }) {
9921051 next unless $f->primary_tag eq 'CDS';
9931052 my $gene = TAG($f, 'gene') or next;
1053 $gene =~ s/_(\d+)$//; # remove existing de-duplicated suffix
9941054 push @{ $collide{$gene} }, $f;
9951055 }
9961056 }
10061066 $n++;
10071067 $f->add_tag_value('gene', "${gene}_${n}");
10081068 }
1069 msg("Fixed $n duplicate /gene -", map { $_->get_tag_values('gene') } @cds);
10091070 $num_collide++;
10101071 }
10111072 msg("Fixed $num_collide colliding /gene names.");
10151076
10161077 msg("Adding /locus_tag identifiers");
10171078 my $num_lt=0;
1018 for my $sid (sort keys %seq) {
1079 for my $sid (@seq) {
10191080 for my $f ( sort { $a->start <=> $b->start } @{ $seq{$sid}{FEATURE} }) {
10201081 next unless $f->primary_tag =~ m/CDS|RNA/;
10211082 $num_lt++;
10291090 if ($addgenes) {
10301091 # make a 'sister' gene feature for the CDS feature
10311092 # (ideally it would encompass the UTRs as well, but we don't know them)
1093 my $gene_id = "${ID}_gene";
10321094 my $g = Bio::SeqFeature::Generic->new(
10331095 -primary => 'gene',
10341096 -seq_id => $f->seq_id,
10361098 -end => $f->end,
10371099 -strand => $f->strand,
10381100 -source_tag => $EXE,
1039 -tag => { 'locus_tag'=> $ID },
1101 -tag => {
1102 'locus_tag' => $ID,
1103 'ID' => $gene_id, # Add suffix to ID for GFF output
1104 },
10401105 );
1106 # Make a Parent tag from the CDS to the gene
1107 $f->add_tag_value('Parent', $gene_id);
1108 # copy the /gene from the CDS
10411109 if (my $gENE = TAG($f, 'gene')) {
10421110 $g->add_tag_value('gene', $gENE);
10431111 }
10591127
10601128 my $gff_factory = Bio::Tools::GFF->new(-gff_version=>$gffver);
10611129 print $gff_fh "##gff-version $gffver\n";
1062 for my $id (sort keys %seq) {
1130 for my $id (@seq) {
10631131 print $gff_fh "##sequence-region $id 1 ", $seq{$id}{DNA}->length, "\n";
10641132 }
10651133
10661134 my $fsa_desc = "[gcode=$gcode] [organism=$genus $species] [strain=$strain]";
10671135 $fsa_desc .= " [plasmid=$plasmid]" if $plasmid;
10681136
1069 for my $sid (sort keys %seq) {
1137 for my $sid (@seq) {
10701138 my $ctg = $seq{$sid}{DNA};
10711139 $ctg->desc($fsa_desc);
10721140 $fsa_fh->write_seq($ctg);
10951163 $ffn_fh->write_seq($p);
10961164 }
10971165 if ($f->primary_tag eq 'CDS') {
1098 $faa_fh->write_seq( $p->translate(-codontable_id=>$gcode) );
1099 }
1100 }
1101 }
1102
1103 if (scalar keys %seq) {
1166 $faa_fh->write_seq( $p->translate(-codontable_id=>$gcode, -complete=>1) );
1167 }
1168 }
1169 }
1170
1171 if (@seq) {
11041172 print $gff_fh "##FASTA\n";
11051173 my $seqio = Bio::SeqIO->new(-fh=>$gff_fh, -format=>'fasta');
1106 for my $sid (sort keys %seq) {
1174 for my $sid (@seq) {
11071175 $seqio->write_seq($seq{$sid}{DNA});
11081176 }
11091177 }
11161184 select $txtFh;
11171185
11181186 printf "organism: $genus $species $strain $plasmid\n";
1119 printf "contigs: %d\n", scalar(keys %seq);
1120 printf "bases: %d\n", sum( map { $seq{$_}{DNA}->length } keys %seq );
1187 printf "contigs: %d\n", scalar(@seq);
1188 printf "bases: %d\n", sum( map { $seq{$_}{DNA}->length } @seq );
11211189
11221190 my %count;
1123 for my $sid (sort keys %seq) {
1191 for my $sid (@seq) {
11241192 for my $f (@{ $seq{$sid}{FEATURE} }) {
11251193 $count{ $f->primary_tag }++;
11261194 }
11491217 );
11501218 delfile("$outdir/errorsummary.val");
11511219 delfile( map { "$outdir/$prefix.$_" } qw(dr fixedproducts ecn val) );
1152 move("$outdir/$prefix.gbf", "$outdir/$prefix.gbk"); # rename XXX.gbf to XXX.gbk
1220
1221 msg("Repairing broken .GBK output that tbl2asn produces...");
1222 runcmd("sed 's/COORDINATES: profile/COORDINATES:profile/' < \Q$outdir/$prefix.gbf\E > \Q$outdir/$prefix.gbk\E");
1223 delfile("$outdir/$prefix.gbf");
11531224
11541225 # . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11551226 # Some final log output
11811252
11821253 $p =~ s/^arCOG\d+\s+//;
11831254 $p =~ s/\((EC|COG).*?\)//;
1184 $p =~ s/\s*\w+\d{4,}c?//; # remove possible locus tags
1255 $p =~ s/\s+\w+\d{4,}c?//; # remove possible locus tags
11851256 $p =~ s/ and (inactivated|related) \w+//;
1257 $p =~ s/,\s*family$//;
11861258
11871259 $p =~ s/^(potential|possible|probable|predicted|uncharacteri.ed)/putative/i;
11881260 if ($p =~ m/(domain|family|binding|fold|like|motif|repeat)\s*$/i and $p !~ m/,/) {
12591331
12601332 sub delfile {
12611333 for my $file (@_) {
1262 msg("Deleting unwanted file:", $file);
1263 unlink $file;
1334 if ($debug) {
1335 msg("In --debug mode, saving temporary file:", $file);
1336 }
1337 else {
1338 msg("Deleting unwanted file:", $file);
1339 unlink $file;
1340 }
12641341 }
12651342 }
12661343
13951472 {OPT=>"docs", VAR=>\&showdoc, DESC=>"Show full manual/documentation"},
13961473 {OPT=>"citation",VAR=>\&show_citation, DESC=>"Print citation for referencing Prokka"},
13971474 {OPT=>"quiet!", VAR=>\$quiet, DEFAULT=>0, DESC=>"No screen output"},
1475 {OPT=>"debug!", VAR=>\$debug, DEFAULT=>0, DESC=>"Debug mode: keep all temporary files"},
13981476 'Setup:',
13991477 {OPT=>"listdb", VAR=>\&list_db, DESC=>"List all configured databases"},
14001478 {OPT=>"setupdb", VAR=>\&setup_db, DESC=>"Index all installed databases"},
14221500 {OPT=>"gram=s", VAR=>\$gram, DEFAULT=>'', DESC=>"Gram: -/neg +/pos"},
14231501 {OPT=>"usegenus!", VAR=>\$usegenus, DEFAULT=>0, DESC=>"Use genus-specific BLAST databases (needs --genus)"},
14241502 {OPT=>"proteins=s", VAR=>\$proteins, DEFAULT=>'', DESC=>"Fasta file of trusted proteins to first annotate from"},
1503 {OPT=>"hmms=s", VAR=>\$hmms, DEFAULT=>'', DESC=>"Trusted HMM to first annotate from"},
14251504 {OPT=>"metagenome!", VAR=>\$metagenome, DEFAULT=>0, DESC=>"Improve gene predictions for highly fragmented genomes"},
14261505 {OPT=>"rawproduct!", VAR=>\$rawproduct, DEFAULT=>0, DESC=>"Do not clean up /product annotation"},
14271506 'Computation:',
00 #!/usr/bin/perl -w
11 use strict;
22 use Bio::SeqIO;
3 use Bio::Tools::CodonTable;
34 use Data::Dumper;
45
5 my(@Options, $verbose, $format, $hypo, $idtag, $sep, $blank, $pseudo, $minlen);
6 # gene complement(1..1200)
7 # /locus_tag="P148_SR1C00001G0001"
8 # CDS complement(1..1200)
9 # /locus_tag="P148_SR1C00001G0001"
10 # /note="RAAC1_SR1_1_1"
11 # /codon_start=1
12 # /transl_table=25
13 # /product="MAEBL"
14 # /protein_id="AHB40819.1"
15 # /db_xref="GI:563351664"
16
17 my @IDTAG = ('protein_id', 'locus_tag', 'db_xref');
18
19 my(@Options, $verbose, $format, $hypo, $idtag, $sep, $blank, $pseudo, $minlen, $gcode);
620 setOptions();
721
822 my $in = Bio::SeqIO->new(-fh=>\*ARGV, -format=>$format);
923 my $out = Bio::SeqIO->new(-fh=>\*STDOUT, -format=>'Fasta');
1024
25 my @TRYIDTAG = $idtag ? ($idtag) : @IDTAG;
26 print STDERR "Will use first of (@TRYIDTAG) as FASTA ID\n";
27
28 # support different /transl_table=XX tags in CDS
29 #my $table = Bio::Tools::CodonTable->new();
30 #printf STDERR "Default /transl_table is %d - %s\n", $table->id, $table->name;
31
1132 while (my $seq = $in->next_seq) {
12 print STDERR "\rParsing: ",$seq->display_id;
33 print STDERR "\rParsing: ",$seq->display_id, "\n";
1334 my $counter = 0;
1435 for my $f ($seq->get_SeqFeatures) {
1536
2142
2243 my $prod = TAG($f, 'product') || $blank;
2344
24 next if $prod eq 'hypothetical protein';
45 next if !$hypo and $prod eq 'hypothetical protein';
2546
2647 my $cds = $f->spliced_seq; # don't forget eukaryotes!
2748
28 my $id = TAG($f, $idtag) or die "feature #$counter does not have /$idtag tag!";
49 my $id;
50 for my $idtag (@TRYIDTAG) {
51 $id = TAG($f, $idtag);
52 last if $id;
53 }
54 $id or die "\nFeature #$counter does not have any of these tags: @TRYIDTAG";
2955 $counter++;
30 print STDERR "Processing: [$counter] ", $seq->display_id, " | $id | $prod\n" if $verbose;
56 print STDERR "[$counter] ", $seq->display_id, " | $id | $prod\n" if $verbose;
3157
3258 # HANDLE CODON START FOR FUZZY FEATURES!
3359 if ($f->has_tag('codon_start')) {
3460 my($cs) = $f->get_tag_values('codon_start');
3561 if ($cs != 1) {
36 print STDERR "/codon_start=$cs - trimming mRNA!";
62 print STDERR "\t/codon_start=$cs - trimming mRNA!\n";
3763 $cds = $cds->trunc($cs, $cds->length);
3864 }
3965 }
4066 #END
4167
42
43 # DNA -> AA
44 $cds = $cds->translate;
68 # DNA -> AA
69 my $tt = $gcode || TAG($f, 'transl_table') || 11;
70 print STDERR "\tUsing specified /transl_table=$tt\n" if $verbose && $tt != 1;
4571
46 # remove stop codon
47 my $aa = $cds->seq;
48 $aa =~ s/\*$//;
49 $cds->seq($aa);
72 # http://www.bioperl.org/wiki/HOWTO:Beginners#Translating
73 $cds = $cds->translate(-codontable_id => $tt, -complete => 1);
5074
5175 my $ec = TAG($f, 'EC_number') || $blank;
5276 my $gene = TAG($f, 'gene') || $blank;
5377
54 $cds->desc("$ec$sep$gene$sep$prod");
78 $cds->desc( join $sep, $ec, $gene, $prod );
5579 $cds->display_id($id);
5680
5781 $out->write_seq($cds);
5882 }
59 print STDERR "\n";
6083 }
61 print STDERR "\nDone\n";
84 print STDERR "Done.\n";
6285
6386 #----------------------------------------------------------------------
6487
78101 use Getopt::Long;
79102
80103 @Options = (
81 {OPT=>"help", VAR=>\&usage, DESC=>"This help"},
82 {OPT=>"verbose!", VAR=>\$verbose, DEFAULT=>0, DESC=>"Verbose progress"},
104 {OPT=>"help", VAR=>\&usage, DESC=>"This help"},
105 {OPT=>"verbose!", VAR=>\$verbose, DEFAULT=>0, DESC=>"Verbose progress"},
83106 {OPT=>"format=s", VAR=>\$format, DEFAULT=>'genbank', DESC=>"Input format"},
84 {OPT=>"idtag=s", VAR=>\$idtag, DEFAULT=>'protein_id', DESC=>"What tag to use as Fasta ID"},
107 {OPT=>"idtag=s", VAR=>\$idtag, DEFAULT=>'', DESC=>"What tag to use as Fasta ID (default = try first of: @IDTAG)"},
85108 {OPT=>"sep=s", VAR=>\$sep, DEFAULT=>'~~~', DESC=>"Separator between EC/gene/product" },
86109 {OPT=>"blank=s", VAR=>\$blank, DEFAULT=>'', DESC=>"Replace empty EC/gene/product with this"},
87110 {OPT=>"pseudo!", VAR=>\$pseudo, DEFAULT=>0, DESC=>"Include /pseudo genes"},
88111 {OPT=>"hypo!", VAR=>\$hypo, DEFAULT=>0, DESC=>"Include 'hypothetical protein' genes"},
112 {OPT=>"gcode=i", VAR=>\$gcode, DEFAULT=>0, DESC=>"Force this genetic code for translation (otherwise /transl_table)"},
89113 {OPT=>"minlen=i", VAR=>\$minlen, DEFAULT=>0, DESC=>"Minimum peptide length"},
90114 );
91115
44 #use SWISS::OC;
55 use Data::Dumper;
66
7 my(@Options, $verbose, $frag, $evlev, $minlen, $sep, $blank, $term, $hypo);
7 my(@Options, $verbose, $frag, $evlev, $minlen, $maxlen, $sep, $blank, $term, $hypo);
88 setOptions();
99
1010 my $HYPO = 'hypothetical protein';
2525 next if not $frag and $entry->isFragment;
2626 next if not $frag and $entry->DEs->hasFragment;
2727
28 # Too short?
29 next if $minlen > 0 and length($entry->SQs->seq) < $minlen;
28 # Too short or to long?
29 my $L = length($entry->SQs->seq);
30 next if $minlen > 0 and $L < $minlen;
31 next if $maxlen > 0 and $L > $maxlen;
3032
3133 # Reject on evidence level
3234 # grep ^PE uniprot_sprot.dat | sort | uniq -c
103105 {OPT=>"evidence=i", VAR=>\$evlev, DEFAULT=>2, DESC=>"1=prot 2=mrna 3=homol 4=pred 5=unsure"},
104106 {OPT=>"fragments!", VAR=>\$frag, DEFAULT=>0, DESC=>"Include 'DE Flags: Fragment;' entries"},
105107 {OPT=>"minlen=i", VAR=>\$minlen, DEFAULT=>20, DESC=>"Minimum peptide length"},
108 {OPT=>"maxlen=i", VAR=>\$maxlen, DEFAULT=>1E5, DESC=>"Minimum peptide length"},
106109 {OPT=>"term=s", VAR=>\$term, DEFAULT=>'', DESC=>"Lineage must contain this term eg. 'Bacteria'"},
107110 {OPT=>"hypo!", VAR=>\$hypo, DEFAULT=>0, DESC=>"Don't filter out hypothetical proteins"},
108111 );
db/hmm/CLUSTERS.hmm less more
Binary diff not shown
db/hmm/Pfam.hmm less more
Binary diff not shown
00
11 Prokka ChangeLog
22 ----------------
3
4 v1.11 - 15 Feb 2015
5 * Option --proteins now supports .GBK/.EMBL files directly!
6 * Support for user supplied HMM via --hmms [Connor Driscoll]
7 * Add Prodigal -c option when in metagenome mode [Daan Speth]
8 * Wierd coordinate errors with long Spades contig names [Stephan Pabinger]
9 * Replaced dodgy OSX aragorn binary [Yevgeny Nikolaichik]
10 * Handle translation table 25 [Mads Albertsen]
11 * Fix semantically incorrect GFF3 output [Marc Hoeppner]
12 * Fix contigs with pipe characters from Quiver [Vaughn Cooper]
13 * Keep original contig ordering [Jane Hawkey]
14 * Fix protein translation with alternate --gcode [Andreas Leimbach]
15 * Various bug fixes in prokka-* support scripts [Andreas Leimbach]
16 * Workaround tbl2asn bug with COORDINATES:profile [Wiep Smits]
17 * Provide MD5 checksum for website download [Willem VA Viljoen]
18 * Hopefully slightly improved cleanup_product() function
19 * Ensure released are tagged on github for Linux Brew [Shaun Jackman]
20 * Check databases exist before searching against them
21 * Updated bundled binaries
322
423 v1.10 - 28 July 2014
524 * Support for barrnap 0.4 with Archaea and Mito support [Lionel Guy]
11 Prokka Wishlist
22 ---------------
33
4 * New --first-blast option, like --proteins?
5 * New --last-blast option eg. "nr" [Check who emailed me this]
46 * annotate possible frame-shifts
57 * FAQ about annotation transfer
68 * use included binary if PATH one is wrong version [Simon Gladman]
79 * reconsider !~m/a-z/ rule in cleanup_product() [Roderick Felsheim]
810 * all Kingdom=ALL or ANY for metagenomes [Andreas Bremges]
911 * make --cleanup option for /product names [Andreas Bremges]
10 * check databases exist before spawning blast/hmmer
1112 * locus_tag rules: http://www.ncbi.nlm.nih.gov/genomes/locustag/Proposal.pdf
12 * Add .PTT file output [Andrew Buultjens]
1313 * Bug: "existing RNA (repeat_region)" CRISPR is not RNA ?
14 * Allow --proteins to be a .GBK or .GFF3 file ie. extract the CDS annotations
1514 * Factor out some functions into modules!
1615 * Check for large genomic tracts which are unannotated. Sometimes Prodigal misses big genes.
1716 * Add an example input/output so users can check their copy is producing similar results.
1817 * Output potential homopolymer assembly errors near CDS flanks
1918 * Add the CLUSTERS "PHA" library to Viruses mode.
2019 * Option to include ribosomal binding sites (RBS) as features.
21 * Check input contigs for runs of Ns, and either complain, or split the file, or additionally create a .AGP scaffold file. (Mitchell Stanton-Cook has done this, need to incorporate)
2220 * Add hyperlinks to tool references in Manual
2321
+0
-215
doc/prokka-manual.html less more
0 <h1 id="prokka-rapid-prokaryotic-genome-annotation">Prokka: rapid prokaryotic genome annotation</h1>
1 <p>Torsten Seemann <script type="text/javascript">
2 <!--
3 h='&#x6d;&#x6f;&#110;&#x61;&#x73;&#104;&#46;&#x65;&#100;&#x75;';a='&#64;';n='&#116;&#x6f;&#114;&#x73;&#116;&#x65;&#110;&#46;&#x73;&#x65;&#x65;&#x6d;&#x61;&#110;&#110;';e=n+a+h;
4 document.write('<a h'+'ref'+'="ma'+'ilto'+':'+e+'">'+'<code>'+e+'</code>'+'<\/'+'a'+'>');
5 // -->
6 </script><noscript>&#116;&#x6f;&#114;&#x73;&#116;&#x65;&#110;&#46;&#x73;&#x65;&#x65;&#x6d;&#x61;&#110;&#110;&#32;&#x61;&#116;&#32;&#x6d;&#x6f;&#110;&#x61;&#x73;&#104;&#32;&#100;&#x6f;&#116;&#32;&#x65;&#100;&#x75;</noscript><br />Victorian Bioinformatics Consortium, AUSTRALIA <a href="http://vicbioinformatics.com"><code class="url">http://vicbioinformatics.com</code></a></p>
7 <h2 id="introduction">Introduction</h2>
8 <p>Whole genome annotation is the process of identifying features of interest in a set of genomic DNA sequences, and labelling them with useful information. Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files.</p>
9 <h2 id="installation">Installation</h2>
10 <h3 id="download">Download</h3>
11 <p>Download the latest <code>prokka-1.xx.tar.gz</code> archive from http://www.bioinformatics.net.au/software.prokka.shtml</p>
12 <h3 id="extract">Extract</h3>
13 <p>Choose somewhere to put it, for example in your home directory (no root access required):</p>
14 <pre><code>% cd $HOME
15 % tar zxvf prokka-1.xx.tar.gz
16 % ls prokka-1.xx</code></pre>
17 <h3 id="add-to-path">Add to PATH</h3>
18 <p>Add the following line to your <code>$HOME/.bashrc</code> file, or to <code>/etc/profile.d/prokka.sh</code> to make it available to all users:</p>
19 <pre><code>export PATH=$PATH:$HOME/prokka-1.xx/bin</code></pre>
20 <h3 id="index-the-sequence-databases">Index the sequence databases</h3>
21 <pre><code>% prokka --setupdb</code></pre>
22 <h3 id="install-dependencies">Install dependencies</h3>
23 <p>Prokka comes with many binaries for Linux and Mac OS X. It will always use your existing installed versions if they exist, but will use the included ones if that fails. For some older systems (eg. Centos 4.x) some of them won't work due to them being dynamically linked against new GLIBC libraries you don't have.</p>
24 <p>You can consult the list of dependencies later in this document.</p>
25 <h3 id="choose-a-rrna-predictor">Choose a rRNA predictor</h3>
26 <h4 id="option-1---dont-use-one">Option 1 - Don't use one</h4>
27 <p>If Prokka can't find a predictor for rRNA featues (either Barrnap or RNAmmer below) then it simply won't annotate any. Most people don't care that much about them anyway,</p>
28 <h4 id="option-2---barrnap">Option 2 - Barrnap</h4>
29 <p>This was written by the author of Prokka and is recommended if you prefer speed over absolute accuracy. It uses the new multi-core NHMMER for DNA:DNA profile searches. Download it from http://www.vicbioinformatics.com/software.barrnap.shtml</p>
30 <h4 id="option-3---rnammer">Option 3 - RNAmmer</h4>
31 <p>RNAmmer was written when HMMER 2.x was the latest release. Since them, HMMER 3.x has been released, and uses the same executable binary names. Prokka needs HMMER3 and RNAmmer (and hence HMMER2) so you need to edit your RNAmmer script to explicitly point your HMMER2 binary instead of using the HMMER3 binary which is more likely to be in your PATH first.</p>
32 <p>Type which rnammer to find the script, and then edit it with your favourite editor. Find the following lines at the top:</p>
33 <pre><code>if ( $uname eq &quot;Linux&quot; ) {
34 # $HMMSEARCH_BINARY = &quot;/usr/cbs/bio/bin/linux64/hmmsearch&quot;; # OLD
35 $HMMSEARCH_BINARY = &quot;/path/to/my/hmmer-2.3.2/bin/hmmsearch&quot;; # NEW (yours)
36 }</code></pre>
37 <p>If you are using Mac OS X, you'll also have to change the <code>&quot;Linux&quot;</code> to <code>&quot;Darwin&quot;</code> too. As you can see, I have commented out the original part, and replaced it with the location of my HMMER2 hmmsearch tool, so it doesn't run the HMMER3 one. You need to ensure HMMER3 is in your PATH before the old HMMER2 too.</p>
38 <h3 id="test">Test</h3>
39 <ul>
40 <li>Type <code>prokka</code> and it should output it's help screen.</li>
41 <li>Type <code>prokka --version</code> and you should see an output like <code>prokka 1.x</code></li>
42 <li>Type <code>prokka --listdb</code> and it will show you what databases it has installed to use.</li>
43 </ul>
44 <h2 id="invoking-prokka">Invoking Prokka</h2>
45 <h3 id="beginner">Beginner</h3>
46 <pre><code># Vanilla (but with free toppings)
47 % prokka contigs.fa
48
49 # Look for a folder called PROKKA_yyyymmdd (today&#39;s date) and look at stats
50 % cat PROKKA_yyyymmdd/*.txt</code></pre>
51 <h3 id="moderate">Moderate</h3>
52 <pre><code># Choose the names of the output files
53 % prokka --outdir mydir --prefix mygenome contigs.fa
54
55 # Visualize it in Artemis
56 % art mydir/mygenome.gff</code></pre>
57 <h3 id="expert">Expert</h3>
58 <pre><code># It&#39;s not just for bacteria, people
59 % prokka --kingdom Archaea --outdir mydir --genus Pyrococcus --locustag PYCC
60
61 # Search for my favourite gene
62 % exonerate --bestn 1 zetatoxin.fasta mydir/PYCC_06072012.faa | less</code></pre>
63 <h3 id="wizard">Wizard</h3>
64 <pre><code># Watch and learn
65 % prokka --outdir mydir --locustag EHEC --proteins NewToxins.faa --evalue 0.001 --gram neg --addgenes contigs.fa
66
67 # Check to see if anything went really wrong
68 % less mydir/EHEC_06072012.err
69
70 # Add final details using Sequin
71 % sequin mydir/EHEC_0607201.sqn</code></pre>
72 <h3 id="genbank-submitter">Genbank submitter</h3>
73 <pre><code># Register your BioProject and your locus_tag prefix first!
74 % prokka --compliant --centre UoN --outdir PRJNA123456 --locustag EHEC --prefix EHEC-Chr1 contigs.fa
75
76 # Check to see if anything went really wrong
77 % less PRJNA123456/EHEC-Chr1.err
78
79 # Add final details using Sequin
80 % sequin PRJNA123456/EHEC-Chr1.sqn</code></pre>
81 <h3 id="crazy-person">Crazy Person</h3>
82 <pre><code># No stinking Perl script is going to control me
83 % prokka \
84 --outdir $HOME/genomes/Ec_POO247 --force \
85 --prefix Ec_POO247 --addgenes --locustag ECPOOp \
86 --increment 10 --gffver 2 --centre CDC --compliant \
87 --genus Escherichia --species coli --strain POO247 --plasmid pECPOO247 \
88 --kingdom Bacteria --gcode 11 --usegenus \
89 --proteins /opt/prokka/db/trusted/Ecocyc-17.6 \
90 --evalue 1e-9 --rfam \
91 plasmid-closed.fna</code></pre>
92 <h2 id="output-files">Output Files</h2>
93 <p>| Extension | Description | | --------- | ----------- | | .gff | This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV. | | .gbk | This is a standard Genbank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence. | | .fna | Nucleotide FASTA file of the input contig sequences. | | .faa | Protein FASTA file of the translated CDS sequences. | | .ffn | Nucleotide FASTA file of all the annotated sequences, not just CDS. | | .sqn | An ASN1 format &quot;Sequin&quot; file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc. | | .fsa | Nucleotide FASTA file of the input contig sequences, used by &quot;tbl2asn&quot; to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines. | | .tbl | Feature Table file, used by &quot;tbl2asn&quot; to create the .sqn file. | | .err | Unacceptable annotations - the NCBI discrepancy report. | | .log | Contains all the output that Prokka produced during its run. This is a record of what settings you used, even if the --quiet option was enabled. | | .txt | Statistics relating to the annotated features found. |</p>
94 <h2 id="command-line-options">Command line options</h2>
95 <pre><code>General:
96 --help This help
97 --version Print version and exit
98 --docs Show full manual/documentation
99 --citation Print citation for referencing Prokka
100 --quiet No screen output (default OFF)
101 Setup:
102 --listdb List all configured databases
103 --setupdb Index all installed databases
104 --cleandb Remove all database indices
105 --depends List all software dependencies
106 Outputs:
107 --outdir [X] Output folder [auto] (default &#39;&#39;)
108 --force Force overwriting existing output folder (default OFF)
109 --prefix [X] Filename output prefix [auto] (default &#39;&#39;)
110 --addgenes Add &#39;gene&#39; features for each &#39;CDS&#39; feature (default OFF)
111 --locustag [X] Locus tag prefix (default &#39;PROKKA&#39;)
112 --increment [N] Locus tag counter increment (default &#39;1&#39;)
113 --gffver [N] GFF version (default &#39;3&#39;)
114 --compliant Force Genbank/ENA/DDJB compliance: --genes --mincontiglen 200 --centre XXX (default OFF)
115 --centre [X] Sequencing centre ID. (default &#39;&#39;)
116 Organism details:
117 --genus [X] Genus name (default &#39;Genus&#39;)
118 --species [X] Species name (default &#39;species&#39;)
119 --strain [X] Strain name (default &#39;strain&#39;)
120 --plasmid [X] Plasmid name or identifier (default &#39;&#39;)
121 Annotations:
122 --kingdom [X] Annotation mode: Archaea|Bacteria|Mitochondria|Viruses (default &#39;Bacteria&#39;)
123 --gcode [N] Genetic code / Translation table (set if --kingdom is set) (default &#39;0&#39;)
124 --gram [X] Gram: -/neg +/pos (default &#39;&#39;)
125 --usegenus Use genus-specific BLAST databases (needs --genus) (default OFF)
126 --proteins [X] Fasta file of trusted proteins to first annotate from (default &#39;&#39;)
127 --metagenome Improve gene predictions for highly fragmented genomes (default OFF)
128 --rawproduct Do not clean up /product annotation (default OFF)
129 Computation:
130 --fast Fast mode - skip CDS /product searching (default OFF)
131 --cpus [N] Number of CPUs to use [0=all] (default &#39;8&#39;)
132 --mincontiglen [N] Minimum contig size [NCBI needs 200] (default &#39;1&#39;)
133 --evalue [n.n] Similarity e-value cut-off (default &#39;1e-06&#39;)
134 --rfam Enable searching for ncRNAs with Infernal+Rfam (SLOW!) (default &#39;0&#39;)
135 --norrna Don&#39;t run rRNA search (default OFF)
136 --notrna Don&#39;t run tRNA search (default OFF)
137 --rnammer Prefer RNAmmer over Barrnap for rRNA prediction (default OFF)</code></pre>
138 <h2 id="dependencies">Dependencies</h2>
139 <ul>
140 <li><p><strong>GNU Parallel</strong><br />A shell tool for executing jobs in parallel using one or more computers<br /><em>O. Tange, GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, Feb 2011:42-47.</em></p></li>
141 <li><p><strong>BioPerl</strong><br />Used for input/output of various file formats<br /><em>Stajich et al, The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002 Oct;12(10):1611-8.</em></p></li>
142 <li><p><strong>Aragorn</strong><br />Finds transfer RNA features (tRNA)<br /><em>Laslett D, Canback B. ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res. 2004 Jan 2;32(1):11-6.</em></p></li>
143 <li><p><strong>Barrnap</strong><br />Used to predict ribosomal RNA features (rRNA). My licence-free replacement for RNAmmmer.<br /><em>Manuscript under preparation.</em></p></li>
144 <li><p><strong>RNAmmer</strong><br />Finds ribosomal RNA features (rRNA)<br /><em>Lagesen K et al. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 2007;35(9):3100-8.</em></p></li>
145 <li><p><strong>Prodigal</strong><br />Finds protein-coding features (CDS)<br /><em>Hyatt D et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010 Mar 8;11:119.</em></p></li>
146 <li><p><strong>SignalP</strong><br />Finds signal peptide features in CDS (sig_peptide)<br /><em>Petersen TN et al. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods. 2011 Sep 29;8(10):785-6.</em></p></li>
147 <li><p><strong>BLAST+</strong><br />Used for similarity searching against protein sequence libraries<br /><em>Camacho C et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009 Dec 15;10:421.</em></p></li>
148 <li><p><strong>HMMER3</strong><br />Used for similarity searching against protein family profiles<br /><em>Finn RD et al. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011 Jul;39(Web Server issue):W29-37.</em></p></li>
149 <li><p><strong>Infernal</strong><br />Used for similarity searching against ncRNA family profiles<br /><em>D. L. Kolbe, S. R. Eddy. Fast Filtering for RNA Homology Search. Bioinformatics, 27:3102-3109, 2011.</em></p></li>
150 </ul>
151 <h2 id="databases">Databases</h2>
152 <h3 id="the-core-blast-databases">The Core (BLAST+) Databases</h3>
153 <p>Prokka uses a variety of databases when trying to assign function to the predicted CDS features. It takes a hierarchial approach to make it fast. A small, core set of well characterized proteins are first searched using BLAST+. This combination of small database and fast search typically completes about 70% of the workload. Then a series of slower but more sensitive HMM databases are searched using HMMER3.</p>
154 <p>The initial core databases are derived from UniProtKB; there is one per &quot;kingdom&quot; supported. To qualify for inclusion, a protein must be (1) from Bacteria (or Archaea or Viruses); (2) not be &quot;Fragment&quot; entries; and (3) have an evidence level (&quot;PE&quot;) of 2 or lower, which corresponds to experimental mRNA or proteomics evidence.</p>
155 <h4 id="making-a-core-databases">Making a Core Databases</h4>
156 <p>If you want to modify these core databases, the included script <code>prokka-uniprot_to_fasta_db</code>, along with the official <code>uniprot_sprot.dat</code>, can be used to generate a new database to put in <code>/opt/prokka/db/kingdom/</code>. If you add new ones, the command <code>prokka --listdb</code> will show you whether it has been detected properly.</p>
157 <h4 id="the-genus-databases">The Genus Databases</h4>
158 <p>If you enable <code>--usegenus</code> and also provide a Genus via <code>--genus</code> then it will first use a BLAST database which is Genus specific. Prokka comes with a set of databases for the most common Bacterial genera; type prokka <code>--listdb</code> to see what they are.</p>
159 <h4 id="adding-a-genus-databases">Adding a Genus Databases</h4>
160 <p>If you have a set of Genbank files and want to create a new Genus database, Prokka comes with a tool called <code>prokka-genbank_to_fasta_db</code> to help. For example, if you had four annotated &quot;Coccus&quot; genomes, you could do the following:</p>
161 <pre><code>% prokka-genbank_to_fasta_db Coccus1.gbk Coccus2.gbk Coccus3.gbk Coccus4.gbk &gt; Coccus.faa
162 % cd-hit -i Coccus.faa -o Coccus -T 0 -M 0 -g 1 -s 0.8 -c 0.9
163 % rm -fv Coccus.faa Coccus.bak.clstr Coccus.clstr
164 % makeblastdb -dbtype prot -in Coccus
165 % mv Coccus.p* /path/to/prokka/db/genus/</code></pre>
166 <h3 id="the-hmm-databases">The HMM Databases</h3>
167 <p>Prokka comes with a bunch of HMM libraries for HMMER3. They are mostly Bacteria-specific. They are searched after the core and genus databases. You can add more simply by putting them in <code>/opt/prokka/db/hmm</code>. Type <code>prokka --listdb</code> to confirm they are recognised.</p>
168 <h3 id="fasta-database-format">FASTA database format</h3>
169 <p>Prokka understands two annotation tag formats, a plain one and a detailed one.</p>
170 <p>The plain one is a standard FASTA-like line with the ID after the <code>&gt;</code> sign, and the protein <code>/product</code> after the ID (the &quot;description&quot; part of the line):</p>
171 <pre><code>&gt;SeqID product</code></pre>
172 <p>The detailed one consists of a special encoded three-part description line. The parts are the <code>/EC_number</code>, the <code>/gene</code> code, then the <code>/product</code> - and they are separated by a special &quot;<sub>~</sub>&quot; sequence:</p>
173 <pre><code>&gt;SeqID EC_number~~~gene~~~product</code></pre>
174 <p>Here are some examples. Note that not all parts need to be present, but the &quot;<sub>~</sub>&quot; should still be there:</p>
175 <pre><code>&gt;YP_492693.1 2.1.1.48~~~ermC~~~rRNA adenine N-6-methyltransferase
176 MNEKNIKHSQNFITSKHNIDKIMTNIRLNEHDNIFEIGSGKGHFTLELVQRCNFVTAIEI
177 DHKLCKTTENKLVDHDNFQVLNKDILQFKFPKNQSYKIFGNIPYNISTDIIRKIVF*
178 &gt;YP_492697.1 ~~~traB~~~transfer complex protein TraB
179 MIKKFSLTTVYVAFLSIVLSNITLGAENPGPKIEQGLQQVQTFLTGLIVAVGICAGVWIV
180 LKKLPGIDDPMVKNEMFRGVGMVLAGVAVGAALVWLVPWVYNLFQ*
181 &gt;YP_492694.1 ~~~~~~transposase
182 MNYFRYKQFNKDVITVAVGYYLRYALSYRDISEILRGRGVNVHHSTVYRWVQEYAPILYQ
183 QSINTAKNTLKGIECIYALYKKNRRSLQIYGFSPCHEISIMLAS*</code></pre>
184 <p>The same description lines apply to HMM models, except the &quot;NAME&quot; and &quot;DESC&quot; fields are used:</p>
185 <pre><code>NAME PRK00001
186 ACC PRK00001
187 DESC 2.1.1.48~~~ermC~~~rRNA adenine N-6-methyltransferase
188 LENG 284</code></pre>
189 <h2 id="faq">FAQ</h2>
190 <ul>
191 <li><p><strong>Where does the name &quot;Prokka&quot; come from?</strong><br />Prokka is a contraction of &quot;prokaryotic annotation&quot;. It's also relatively unique within Google, and also rhymes with a native Australian marsupial called the quokka.</p></li>
192 <li><p><strong>Can I annotate by eukaryote genome with Prokka?</strong><br />No. Prokka is specifically designed for Bacteria, Archaea and Viruses. It can't handle multi-exon gene models; I would recommend using MAKER 2 for that purpose.</p></li>
193 <li><p><strong>Why does Prokka keeps on crashing when it gets to tge &quot;tbl2asn&quot; stage?</strong><br />It seems that the tbl2asn program from NCBI &quot;expires&quot; after 12 months, and refuses to run. Unfortunately you need to install a newer version which you can download from here.</p></li>
194 <li><p><strong>The hmmscan step seems to hang and do nothing?</strong><br />The problem here is GNU Parallel. It seems the Debian package for hmmer has modified it to require the <code>--gnu</code> option to behave in the 'default' way. There is no clear reason for this. The only way to restore normal behaviour is to edit the prokka script and change <code>parallel</code> to <code>parallel --gnu</code>.</p></li>
195 <li><p><strong>Why does prokka fail when it gets to hmmscan?</strong><br />Unfortunately HMMER keeps changing it's database format, and they aren't upward compatible. If you upgraded HMMER (from 3.0 to 3.1 say) then you need to &quot;re-press&quot; the files. This can be done as follows: cd /path/to/prokka/db/hmm mkdir new for D in *.hmm ; do hmmconvert <span class="math"><em>D</em> &gt; <em>n</em><em>e</em><em>w</em> / </span>D ; done cd new for D in *.hmm ; do hmmpress $D ; done mv * .. rmdir new</p></li>
196 <li><p><strong>Why does Prokka take so long to download?</strong><br />Our server is in Australia, and the international pipes aren't always flowing as well as we'd like. I try to put it on GoogleDrive. Dropbox is no longer possible due to bandwidth quotas. If you are able to mirror Prokka (~2 GB) outside please let me know.</p></li>
197 <li><p><strong>Why can't I load Prokka .GBK files into Mauve?</strong><br />Mauve is very picky about Genbank files. It does not like long contig names, like those from Velvet or Spades. The simple solution is to use <code>--centre XXX</code> in Prokka and it will rename all your contigs to be NCBI (and Mauve) compliant. It does not like the ACCESSION and VERSION strings that Prokka produces via the &quot;tbl2asn&quot; tool. The following Unix command will fix them: <code>egrep -v '^(ACCESSION|VERSION)' prokka.gbk &gt; mauve.gbk</code></p></li>
198 </ul>
199 <h2 id="still-to-do">Still To Do</h2>
200 <ul>
201 <li>ToDoList.txt: https://github.com/Victorian-Bioinformatics-Consortium/prokka/blob/master/doc/ToDoList.txt</li>
202 </ul>
203 <h2 id="changes">Changes</h2>
204 <ul>
205 <li>ChangeLog.txt: https://raw.githubusercontent.com/Victorian-Bioinformatics-Consortium/prokka/master/doc/ChangeLog.txt</li>
206 <li>Github commits: https://github.com/Victorian-Bioinformatics-Consortium/prokka/commits/master</li>
207 </ul>
208 <h2 id="bugs">Bugs</h2>
209 <ul>
210 <li>tbl2asn seems to be removing the &quot;(anti-codon)&quot; part in my tRNA /product values</li>
211 <li>tbl2asn putting space in /inference for Infernal</li>
212 </ul>
213 <h2 id="citation">Citation</h2>
214 <p>Seemann T.<br /><em>Prokka: rapid prokaryotic genome annotation</em><br /><strong>Bioinformatics</strong> 2014 Jul 15;30(14):2068-9. <a href="http://www.ncbi.nlm.nih.gov/pubmed/24642063">PMID:24642063</a><br /><a href="doi:10.1093/bioinformatics/btu153">DOI</a></p>
+0
-454
doc/prokka-manual.txt less more
0 Prokka: rapid prokaryotic genome annotation
1 ===========================================
2
3 Torsten Seemann torsten.seemann@monash.edu
4 Victorian Bioinformatics Consortium, AUSTRALIA
5 http://vicbioinformatics.com
6
7 Introduction
8 ------------
9
10 Whole genome annotation is the process of identifying features of
11 interest in a set of genomic DNA sequences, and labelling them with
12 useful information. Prokka is a software tool to annotate bacterial,
13 archaeal and viral genomes quickly and produce standards-compliant
14 output files.
15
16 Installation
17 ------------
18
19 Download
20
21 Download the latest prokka-1.xx.tar.gz archive from
22 http://www.bioinformatics.net.au/software.prokka.shtml
23
24 Extract
25
26 Choose somewhere to put it, for example in your home directory (no root
27 access required):
28
29 % cd $HOME
30 % tar zxvf prokka-1.xx.tar.gz
31 % ls prokka-1.xx
32
33 Add to PATH
34
35 Add the following line to your $HOME/.bashrc file, or to
36 /etc/profile.d/prokka.sh to make it available to all users:
37
38 export PATH=$PATH:$HOME/prokka-1.xx/bin
39
40 Index the sequence databases
41
42 % prokka --setupdb
43
44 Install dependencies
45
46 Prokka comes with many binaries for Linux and Mac OS X. It will always
47 use your existing installed versions if they exist, but will use the
48 included ones if that fails. For some older systems (eg. Centos 4.x)
49 some of them won't work due to them being dynamically linked against new
50 GLIBC libraries you don't have.
51
52 You can consult the list of dependencies later in this document.
53
54 Choose a rRNA predictor
55
56 Option 1 - Don't use one
57
58 If Prokka can't find a predictor for rRNA featues (either Barrnap or
59 RNAmmer below) then it simply won't annotate any. Most people don't care
60 that much about them anyway,
61
62 Option 2 - Barrnap
63
64 This was written by the author of Prokka and is recommended if you
65 prefer speed over absolute accuracy. It uses the new multi-core NHMMER
66 for DNA:DNA profile searches. Download it from
67 http://www.vicbioinformatics.com/software.barrnap.shtml
68
69 Option 3 - RNAmmer
70
71 RNAmmer was written when HMMER 2.x was the latest release. Since them,
72 HMMER 3.x has been released, and uses the same executable binary names.
73 Prokka needs HMMER3 and RNAmmer (and hence HMMER2) so you need to edit
74 your RNAmmer script to explicitly point your HMMER2 binary instead of
75 using the HMMER3 binary which is more likely to be in your PATH first.
76
77 Type which rnammer to find the script, and then edit it with your
78 favourite editor. Find the following lines at the top:
79
80 if ( $uname eq "Linux" ) {
81 # $HMMSEARCH_BINARY = "/usr/cbs/bio/bin/linux64/hmmsearch"; # OLD
82 $HMMSEARCH_BINARY = "/path/to/my/hmmer-2.3.2/bin/hmmsearch"; # NEW (yours)
83 }
84
85 If you are using Mac OS X, you'll also have to change the "Linux" to
86 "Darwin" too. As you can see, I have commented out the original part,
87 and replaced it with the location of my HMMER2 hmmsearch tool, so it
88 doesn't run the HMMER3 one. You need to ensure HMMER3 is in your PATH
89 before the old HMMER2 too.
90
91 Test
92
93 - Type prokka and it should output it's help screen.
94 - Type prokka --version and you should see an output like prokka 1.x
95 - Type prokka --listdb and it will show you what databases it has
96 installed to use.
97
98 Invoking Prokka
99 ---------------
100
101 Beginner
102
103 # Vanilla (but with free toppings)
104 % prokka contigs.fa
105
106 # Look for a folder called PROKKA_yyyymmdd (today's date) and look at stats
107 % cat PROKKA_yyyymmdd/*.txt
108
109 Moderate
110
111 # Choose the names of the output files
112 % prokka --outdir mydir --prefix mygenome contigs.fa
113
114 # Visualize it in Artemis
115 % art mydir/mygenome.gff
116
117 Expert
118
119 # It's not just for bacteria, people
120 % prokka --kingdom Archaea --outdir mydir --genus Pyrococcus --locustag PYCC
121
122 # Search for my favourite gene
123 % exonerate --bestn 1 zetatoxin.fasta mydir/PYCC_06072012.faa | less
124
125 Wizard
126
127 # Watch and learn
128 % prokka --outdir mydir --locustag EHEC --proteins NewToxins.faa --evalue 0.001 --gram neg --addgenes contigs.fa
129
130 # Check to see if anything went really wrong
131 % less mydir/EHEC_06072012.err
132
133 # Add final details using Sequin
134 % sequin mydir/EHEC_0607201.sqn
135
136 Genbank submitter
137
138 # Register your BioProject and your locus_tag prefix first!
139 % prokka --compliant --centre UoN --outdir PRJNA123456 --locustag EHEC --prefix EHEC-Chr1 contigs.fa
140
141 # Check to see if anything went really wrong
142 % less PRJNA123456/EHEC-Chr1.err
143
144 # Add final details using Sequin
145 % sequin PRJNA123456/EHEC-Chr1.sqn
146
147 Crazy Person
148
149 # No stinking Perl script is going to control me
150 % prokka \
151 --outdir $HOME/genomes/Ec_POO247 --force \
152 --prefix Ec_POO247 --addgenes --locustag ECPOOp \
153 --increment 10 --gffver 2 --centre CDC --compliant \
154 --genus Escherichia --species coli --strain POO247 --plasmid pECPOO247 \
155 --kingdom Bacteria --gcode 11 --usegenus \
156 --proteins /opt/prokka/db/trusted/Ecocyc-17.6 \
157 --evalue 1e-9 --rfam \
158 plasmid-closed.fna
159
160 Output Files
161 ------------
162
163 | Extension | Description | | --------- | ----------- | | .gff | This is
164 the master annotation in GFF3 format, containing both sequences and
165 annotations. It can be viewed directly in Artemis or IGV. | | .gbk |
166 This is a standard Genbank file derived from the master .gff. If the
167 input to prokka was a multi-FASTA, then this will be a multi-Genbank,
168 with one record for each sequence. | | .fna | Nucleotide FASTA file of
169 the input contig sequences. | | .faa | Protein FASTA file of the
170 translated CDS sequences. | | .ffn | Nucleotide FASTA file of all the
171 annotated sequences, not just CDS. | | .sqn | An ASN1 format "Sequin"
172 file for submission to Genbank. It needs to be edited to set the correct
173 taxonomy, authors, related publication etc. | | .fsa | Nucleotide FASTA
174 file of the input contig sequences, used by "tbl2asn" to create the .sqn
175 file. It is mostly the same as the .fna file, but with extra Sequin tags
176 in the sequence description lines. | | .tbl | Feature Table file, used
177 by "tbl2asn" to create the .sqn file. | | .err | Unacceptable
178 annotations - the NCBI discrepancy report. | | .log | Contains all the
179 output that Prokka produced during its run. This is a record of what
180 settings you used, even if the --quiet option was enabled. | | .txt |
181 Statistics relating to the annotated features found. |
182
183 Command line options
184 --------------------
185
186 General:
187 --help This help
188 --version Print version and exit
189 --docs Show full manual/documentation
190 --citation Print citation for referencing Prokka
191 --quiet No screen output (default OFF)
192 Setup:
193 --listdb List all configured databases
194 --setupdb Index all installed databases
195 --cleandb Remove all database indices
196 --depends List all software dependencies
197 Outputs:
198 --outdir [X] Output folder [auto] (default '')
199 --force Force overwriting existing output folder (default OFF)
200 --prefix [X] Filename output prefix [auto] (default '')
201 --addgenes Add 'gene' features for each 'CDS' feature (default OFF)
202 --locustag [X] Locus tag prefix (default 'PROKKA')
203 --increment [N] Locus tag counter increment (default '1')
204 --gffver [N] GFF version (default '3')
205 --compliant Force Genbank/ENA/DDJB compliance: --genes --mincontiglen 200 --centre XXX (default OFF)
206 --centre [X] Sequencing centre ID. (default '')
207 Organism details:
208 --genus [X] Genus name (default 'Genus')
209 --species [X] Species name (default 'species')
210 --strain [X] Strain name (default 'strain')
211 --plasmid [X] Plasmid name or identifier (default '')
212 Annotations:
213 --kingdom [X] Annotation mode: Archaea|Bacteria|Mitochondria|Viruses (default 'Bacteria')
214 --gcode [N] Genetic code / Translation table (set if --kingdom is set) (default '0')
215 --gram [X] Gram: -/neg +/pos (default '')
216 --usegenus Use genus-specific BLAST databases (needs --genus) (default OFF)
217 --proteins [X] Fasta file of trusted proteins to first annotate from (default '')
218 --metagenome Improve gene predictions for highly fragmented genomes (default OFF)
219 --rawproduct Do not clean up /product annotation (default OFF)
220 Computation:
221 --fast Fast mode - skip CDS /product searching (default OFF)
222 --cpus [N] Number of CPUs to use [0=all] (default '8')
223 --mincontiglen [N] Minimum contig size [NCBI needs 200] (default '1')
224 --evalue [n.n] Similarity e-value cut-off (default '1e-06')
225 --rfam Enable searching for ncRNAs with Infernal+Rfam (SLOW!) (default '0')
226 --norrna Don't run rRNA search (default OFF)
227 --notrna Don't run tRNA search (default OFF)
228 --rnammer Prefer RNAmmer over Barrnap for rRNA prediction (default OFF)
229
230 Dependencies
231 ------------
232
233 - GNU Parallel
234 A shell tool for executing jobs in parallel using one or more
235 computers
236 O. Tange, GNU Parallel - The Command-Line Power Tool, ;login: The
237 USENIX Magazine, Feb 2011:42-47.
238
239 - BioPerl
240 Used for input/output of various file formats
241 Stajich et al, The Bioperl toolkit: Perl modules for the life
242 sciences. Genome Res. 2002 Oct;12(10):1611-8.
243
244 - Aragorn
245 Finds transfer RNA features (tRNA)
246 Laslett D, Canback B. ARAGORN, a program to detect tRNA genes and
247 tmRNA genes in nucleotide sequences. Nucleic Acids Res. 2004 Jan
248 2;32(1):11-6.
249
250 - Barrnap
251 Used to predict ribosomal RNA features (rRNA). My licence-free
252 replacement for RNAmmmer.
253 Manuscript under preparation.
254
255 - RNAmmer
256 Finds ribosomal RNA features (rRNA)
257 Lagesen K et al. RNAmmer: consistent and rapid annotation of
258 ribosomal RNA genes. Nucleic Acids Res. 2007;35(9):3100-8.
259
260 - Prodigal
261 Finds protein-coding features (CDS)
262 Hyatt D et al. Prodigal: prokaryotic gene recognition and
263 translation initiation site identification. BMC Bioinformatics. 2010
264 Mar 8;11:119.
265
266 - SignalP
267 Finds signal peptide features in CDS (sig_peptide)
268 Petersen TN et al. SignalP 4.0: discriminating signal peptides from
269 transmembrane regions. Nat Methods. 2011 Sep 29;8(10):785-6.
270
271 - BLAST+
272 Used for similarity searching against protein sequence libraries
273 Camacho C et al. BLAST+: architecture and applications. BMC
274 Bioinformatics. 2009 Dec 15;10:421.
275
276 - HMMER3
277 Used for similarity searching against protein family profiles
278 Finn RD et al. HMMER web server: interactive sequence similarity
279 searching. Nucleic Acids Res. 2011 Jul;39(Web Server issue):W29-37.
280
281 - Infernal
282 Used for similarity searching against ncRNA family profiles
283 D. L. Kolbe, S. R. Eddy. Fast Filtering for RNA Homology Search.
284 Bioinformatics, 27:3102-3109, 2011.
285
286 Databases
287 ---------
288
289 The Core (BLAST+) Databases
290
291 Prokka uses a variety of databases when trying to assign function to the
292 predicted CDS features. It takes a hierarchial approach to make it fast.
293 A small, core set of well characterized proteins are first searched
294 using BLAST+. This combination of small database and fast search
295 typically completes about 70% of the workload. Then a series of slower
296 but more sensitive HMM databases are searched using HMMER3.
297
298 The initial core databases are derived from UniProtKB; there is one per
299 "kingdom" supported. To qualify for inclusion, a protein must be (1)
300 from Bacteria (or Archaea or Viruses); (2) not be "Fragment" entries;
301 and (3) have an evidence level ("PE") of 2 or lower, which corresponds
302 to experimental mRNA or proteomics evidence.
303
304 Making a Core Databases
305
306 If you want to modify these core databases, the included script
307 prokka-uniprot_to_fasta_db, along with the official uniprot_sprot.dat,
308 can be used to generate a new database to put in
309 /opt/prokka/db/kingdom/. If you add new ones, the command
310 prokka --listdb will show you whether it has been detected properly.
311
312 The Genus Databases
313
314 If you enable --usegenus and also provide a Genus via --genus then it
315 will first use a BLAST database which is Genus specific. Prokka comes
316 with a set of databases for the most common Bacterial genera; type
317 prokka --listdb to see what they are.
318
319 Adding a Genus Databases
320
321 If you have a set of Genbank files and want to create a new Genus
322 database, Prokka comes with a tool called prokka-genbank_to_fasta_db to
323 help. For example, if you had four annotated "Coccus" genomes, you could
324 do the following:
325
326 % prokka-genbank_to_fasta_db Coccus1.gbk Coccus2.gbk Coccus3.gbk Coccus4.gbk > Coccus.faa
327 % cd-hit -i Coccus.faa -o Coccus -T 0 -M 0 -g 1 -s 0.8 -c 0.9
328 % rm -fv Coccus.faa Coccus.bak.clstr Coccus.clstr
329 % makeblastdb -dbtype prot -in Coccus
330 % mv Coccus.p* /path/to/prokka/db/genus/
331
332 The HMM Databases
333
334 Prokka comes with a bunch of HMM libraries for HMMER3. They are mostly
335 Bacteria-specific. They are searched after the core and genus databases.
336 You can add more simply by putting them in /opt/prokka/db/hmm. Type
337 prokka --listdb to confirm they are recognised.
338
339 FASTA database format
340
341 Prokka understands two annotation tag formats, a plain one and a
342 detailed one.
343
344 The plain one is a standard FASTA-like line with the ID after the >
345 sign, and the protein /product after the ID (the "description" part of
346 the line):
347
348 >SeqID product
349
350 The detailed one consists of a special encoded three-part description
351 line. The parts are the /EC_number, the /gene code, then the /product -
352 and they are separated by a special "~" sequence:
353
354 >SeqID EC_number~~~gene~~~product
355
356 Here are some examples. Note that not all parts need to be present, but
357 the "~" should still be there:
358
359 >YP_492693.1 2.1.1.48~~~ermC~~~rRNA adenine N-6-methyltransferase
360 MNEKNIKHSQNFITSKHNIDKIMTNIRLNEHDNIFEIGSGKGHFTLELVQRCNFVTAIEI
361 DHKLCKTTENKLVDHDNFQVLNKDILQFKFPKNQSYKIFGNIPYNISTDIIRKIVF*
362 >YP_492697.1 ~~~traB~~~transfer complex protein TraB
363 MIKKFSLTTVYVAFLSIVLSNITLGAENPGPKIEQGLQQVQTFLTGLIVAVGICAGVWIV
364 LKKLPGIDDPMVKNEMFRGVGMVLAGVAVGAALVWLVPWVYNLFQ*
365 >YP_492694.1 ~~~~~~transposase
366 MNYFRYKQFNKDVITVAVGYYLRYALSYRDISEILRGRGVNVHHSTVYRWVQEYAPILYQ
367 QSINTAKNTLKGIECIYALYKKNRRSLQIYGFSPCHEISIMLAS*
368
369 The same description lines apply to HMM models, except the "NAME" and
370 "DESC" fields are used:
371
372 NAME PRK00001
373 ACC PRK00001
374 DESC 2.1.1.48~~~ermC~~~rRNA adenine N-6-methyltransferase
375 LENG 284
376
377 FAQ
378 ---
379
380 - Where does the name "Prokka" come from?
381 Prokka is a contraction of "prokaryotic annotation". It's also
382 relatively unique within Google, and also rhymes with a native
383 Australian marsupial called the quokka.
384
385 - Can I annotate by eukaryote genome with Prokka?
386 No. Prokka is specifically designed for Bacteria, Archaea and
387 Viruses. It can't handle multi-exon gene models; I would recommend
388 using MAKER 2 for that purpose.
389
390 - Why does Prokka keeps on crashing when it gets to tge "tbl2asn"
391 stage?
392 It seems that the tbl2asn program from NCBI "expires" after 12
393 months, and refuses to run. Unfortunately you need to install a
394 newer version which you can download from here.
395
396 - The hmmscan step seems to hang and do nothing?
397 The problem here is GNU Parallel. It seems the Debian package for
398 hmmer has modified it to require the --gnu option to behave in the
399 'default' way. There is no clear reason for this. The only way to
400 restore normal behaviour is to edit the prokka script and change
401 parallel to parallel --gnu.
402
403 - Why does prokka fail when it gets to hmmscan?
404 Unfortunately HMMER keeps changing it's database format, and they
405 aren't upward compatible. If you upgraded HMMER (from 3.0 to 3.1
406 say) then you need to "re-press" the files. This can be done as
407 follows: cd /path/to/prokka/db/hmm mkdir new for D in *.hmm ; do
408 hmmconvert D > new/D ; done cd new for D in *.hmm ; do hmmpress $D ;
409 done mv * .. rmdir new
410
411 - Why does Prokka take so long to download?
412 Our server is in Australia, and the international pipes aren't
413 always flowing as well as we'd like. I try to put it on GoogleDrive.
414 Dropbox is no longer possible due to bandwidth quotas. If you are
415 able to mirror Prokka (~2 GB) outside please let me know.
416
417 - Why can't I load Prokka .GBK files into Mauve?
418 Mauve is very picky about Genbank files. It does not like long
419 contig names, like those from Velvet or Spades. The simple solution
420 is to use --centre XXX in Prokka and it will rename all your contigs
421 to be NCBI (and Mauve) compliant. It does not like the ACCESSION and
422 VERSION strings that Prokka produces via the "tbl2asn" tool. The
423 following Unix command will fix them:
424 egrep -v '^(ACCESSION|VERSION)' prokka.gbk > mauve.gbk
425
426 Still To Do
427 -----------
428
429 - ToDoList.txt:
430 https://github.com/Victorian-Bioinformatics-Consortium/prokka/blob/master/doc/ToDoList.txt
431
432 Changes
433 -------
434
435 - ChangeLog.txt:
436 https://raw.githubusercontent.com/Victorian-Bioinformatics-Consortium/prokka/master/doc/ChangeLog.txt
437 - Github commits:
438 https://github.com/Victorian-Bioinformatics-Consortium/prokka/commits/master
439
440 Bugs
441 ----
442
443 - tbl2asn seems to be removing the "(anti-codon)" part in my tRNA
444 /product values
445 - tbl2asn putting space in /inference for Infernal
446
447 Citation
448 --------
449
450 Seemann T.
451 Prokka: rapid prokaryotic genome annotation
452 Bioinformatics 2014 Jul 15;30(14):2068-9. PMID:24642063
453 DOI