diff --git a/debian/NEWS.Debian b/debian/NEWS.Debian new file mode 100644 index 0000000..799761e --- /dev/null +++ b/debian/NEWS.Debian @@ -0,0 +1,5 @@ +With version 2.13-1, the binaries and man pages all have a prefix "tigr-" +to avoid conflicts with other programs, with extract in particular. + + -- Steffen Moeller , Thu, 10 Nov 2004 17:33:46 +0100 + diff --git a/debian/README.Debian b/debian/README.Debian new file mode 100644 index 0000000..700b230 --- /dev/null +++ b/debian/README.Debian @@ -0,0 +1,23 @@ +tigr-glimmer for Debian +----------------------- + +The glimmer software of the TIGR institute was renamed to tigr-glimmer +because of a name conflict with the GNOME library. The package works +for me, most efforts went into the reformatting of the readme files +for the man pages, feedback is welcome. + +The upstream authors are very supportive of this debian package for +their software and I thank them for this. + +In version 2.13-1, the binaries and man pages all have a prefix "tigr-" +to avoid conflicts with other programs, with extract in particular. +This was changed in the current packaging of version 3.x in favour of +putting the executables under /usr/lib/tigr-glimmer. A wrapper is +provided that enables to call the binaries via + + tigr-glimmer + +(see man tigr-glimmer). Alternatively you might add this directory +to your search PATH to call the binaries directly. + + -- Steffen Moeller , Thu, 10 Nov 2004 12:33:46 +0100 diff --git a/debian/bin/tigr-glimmer b/debian/bin/tigr-glimmer new file mode 100644 index 0000000..91e5ba9 --- /dev/null +++ b/debian/bin/tigr-glimmer @@ -0,0 +1,24 @@ +#!/bin/sh + +BINDIR=/usr/lib/tigr-glimmer + +if [ $# -lt 1 ] ; then + echo "Usage: $0 " 1>&2 + echo " Existing programs are:" + ls ${BINDIR} + exit 1 +fi + +WRAPPER=$0 +PROGRAM=$1 +shift +ARGS=$* + +if [ -x ${BINDIR}/${PROGRAM} ]; then + exec ${BINDIR}/${PROGRAM} ${ARGS} +else + echo "Usage: ${PROGRAM} does not exist in Tigr Glimmer" + echo " Existing programs are:" + ls ${BINDIR} + exit 1 +fi diff --git a/debian/bin/tigr-run-glimmer3 b/debian/bin/tigr-run-glimmer3 new file mode 100755 index 0000000..4014d77 --- /dev/null +++ b/debian/bin/tigr-run-glimmer3 @@ -0,0 +1,20 @@ +#!/bin/sh +echo "run Glimmer3" +clear +echo "Genome is " $1 +echo "Find non-overlapping orfs in tmp.coord" +BINDIR="/usr/lib/tigr-glimmer" +rm -f tmp.coord +${BINDIR}/long-orfs $1 | ${BINDIR}/get-putative >tmp.coord +echo "Extract training sequences to tmp.train" +rm -f tmp.train +${BINDIR}/extract $1 tmp.coord >tmp.train +wc tmp.train +echo "Build interpolated context model in tmp.model" +rm -f tmp.model +${BINDIR}/build-icm tmp.model +echo "Predict genes with Glimmer3 with coordinates in g3.coord" +rm -f g3.coord +# get-putative is ot contained in version 3.x any more +# ${BINDIR}/glimmer3 $1 tmp.model | ${BINDIR}/get-putative >g3.coord +${BINDIR}/glimmer3 $1 tmp.model >g3.coord diff --git a/debian/changelog b/debian/changelog new file mode 100644 index 0000000..f163ee9 --- /dev/null +++ b/debian/changelog @@ -0,0 +1,82 @@ +tigr-glimmer (3.02-4) unstable; urgency=medium + + * moved debian/upstream to debian/upstream/metadata + * cme fix dpkg-control + * Fix crashes reported by Mayhem + Closes: #715701, #715702 + + -- Andreas Tille Tue, 15 Dec 2015 10:17:14 +0100 + +tigr-glimmer (3.02-3) unstable; urgency=low + + * debian/upstream: publication information + * debian/source/format: 3.0 (quilt) + * debian/control: + - updated homepage URL + - cme fix dpkg-control + - canonical Vcs URLs + - dropped cdbs + quilt from Build-Depends + * debian/README.source: deleted because redundant + * debian/rules: switch from cdbs to dh + * Hardening by droping Makefile patch in favour of providing + options directly inside debian/rules + * debian/copyright: DEP5 + * Verified current build log and noticed that -L/usr/lib is not used + Closes: #722845 + + -- Andreas Tille Tue, 05 Nov 2013 10:33:48 +0100 + +tigr-glimmer (3.02-2) unstable; urgency=low + + * debian/control: + - Fixed Vcs-Svn (missing svn/) + - Updated Standards-Version to 3.8.1 (no changes needed) + - Standards-Version: 3.8.3 (no changes needed) + - debhelper (>= 7) + * Fixed E-Mail address of upstream author in debian/copyright + * Fix FTBFS on amd64 + Closes: #560442 + * Added README.source + + -- Andreas Tille Thu, 21 Jan 2010 22:52:45 +0100 + +tigr-glimmer (3.02-1) unstable; urgency=low + + [ Charles Plessy ] + * debian/watch: + - Replaced by the new one written by Nelson (Closes: #385258) + + [ Andreas Tille ] + * New upstream version + * Group maintenance by Debian-Med team + - DM-Upload-Allowed: Yes + - Vcs tags + - Use correct address as Uploader: Steffen Moeller + * Standards-Version: 3.7.3 (no changes needed) + * debhelper >= 5 + * Moved Homepage from long description to control fields + * Removed [Biology] from short description + + -- Andreas Tille Tue, 22 Apr 2008 11:59:07 +0200 + +tigr-glimmer (2.13-1.1) unstable; urgency=low + + * Non-maintainer upload. + * Fix GCC 4.3 compatibility, patch by Kumar Appaiah (Closes: #461691) + + -- Moritz Muehlenhoff Thu, 20 Mar 2008 00:02:06 +0100 + +tigr-glimmer (2.13-1) unstable; urgency=low + + * New upstream release - no significant changes for Linux users. + * Resolves conflict for "extract" binary and man page (Closes:Bug#227790,Bug#274780). + + -- Steffen Moeller Wed, 10 Nov 2004 11:58:46 +0100 + +tigr-glimmer (2.12-1) unstable; urgency=low + + * Initial Release (Closes:#219453). + * Added man pages to upstream release + + -- Steffen Moeller Thu, 16 Oct 2003 17:33:46 +0200 + diff --git a/debian/compat b/debian/compat new file mode 100644 index 0000000..ec63514 --- /dev/null +++ b/debian/compat @@ -0,0 +1 @@ +9 diff --git a/debian/control b/debian/control new file mode 100644 index 0000000..a7b2511 --- /dev/null +++ b/debian/control @@ -0,0 +1,25 @@ +Source: tigr-glimmer +Maintainer: Debian Med Packaging Team +Uploaders: Steffen Moeller , + Andreas Tille +Section: science +Priority: optional +Build-Depends: debhelper (>= 9), + docbook-to-man +Standards-Version: 3.9.6 +Vcs-Browser: http://anonscm.debian.org/viewvc/debian-med/trunk/packages/tigr-glimmer/trunk/ +Vcs-Svn: svn://anonscm.debian.org/debian-med/trunk/packages/tigr-glimmer/trunk/ +Homepage: http://ccb.jhu.edu/software/glimmer/index.shtml + +Package: tigr-glimmer +Architecture: any +Depends: ${shlibs:Depends}, + ${misc:Depends} +Description: Gene detection in archea and bacteria + Developed by the TIGR institute this software detects coding sequences in + bacteria and archea. + . + Glimmer is a system for finding genes in microbial DNA, especially the + genomes of bacteria and archaea. Glimmer (Gene Locator and Interpolated + Markov Modeler) uses interpolated Markov models (IMMs) to identify the + coding regions and distinguish them from noncoding DNA. diff --git a/debian/copyright b/debian/copyright new file mode 100644 index 0000000..51cea7c --- /dev/null +++ b/debian/copyright @@ -0,0 +1,23 @@ +Format: http://www.debian.org/doc/packaging-manuals/copyright-format/1.0/ +Upstream-Name: Glimmer +Upstream-Contact: Art Delcher , + Steven Salzberg +Source: http://ccb.jhu.edu/software/glimmer/glimmer302b.tar.gz + +Files: * +Copyright: © 1999-2008 Art Delcher , + Steven Salzberg +License: Artistic + THIS PACKAGE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR + IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED + WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. + On Debian systems, the complete text of the GNU General + Public License can be found in `/usr/share/common-licenses/Artistic'. + +Files: debian/* +Copyright: © 2003-2004 Steffen Moeller + © 2008 Charles Plessy + © 2008-2013 Andreas Tille +License: GPL-2+ + On Debian systems, the complete text of the GNU General + Public License can be found in `/usr/share/common-licenses/GPL'. diff --git a/debian/docs b/debian/docs new file mode 100644 index 0000000..b34877b --- /dev/null +++ b/debian/docs @@ -0,0 +1,2 @@ +docs/notes.pdf +debian/glimmer2_docs diff --git a/debian/glimmer2_docs/README b/debian/glimmer2_docs/README new file mode 100644 index 0000000..6287b13 --- /dev/null +++ b/debian/glimmer2_docs/README @@ -0,0 +1,101 @@ + This file and all files in this release of the Glimmer system are + copyright (c) 1999 and (c) 2000 by Arthur Delcher, Steven Salzberg, + Simon Kasif, and Owen White. All rights reserved. Redistribution + is not permitted without the express written permission of + the authors. + +Glimmer 2.0 is described in: + A.L. Delcher, D. Harmon, S. Kasif, O. White, and S.L. Salzberg. + Improved Microbial Gene Identification with Glimmer. + Nucleic Acids Research, 27 (1999), 4636-4641. +Please reference this paper if you use the system as part of any +published research. Note that Glimmer 1.0 is described in + S. Salzberg, A. Delcher, S. Kasif, and O. White. + Microbial Gene Identification using Interpolated Markov Models. + Nucleic Acids Research, 26:2 (1998), 544-548. + +Quickstart: if you just want to run Glimmer 2.0 on your genome +and you don't want to adjust any parameters (although we don't +recommend this), you can simply compile this system and run +it with the included run-glimmer2 script. E.g.: +unix-prompt> make +[various compilation messages appear] +unix-prompt> run-glimmer2 mygenome + +run-glimmer2 will create an Interpolated Markov Model of your genome +and store it in a binary file called tmp.model. It will store +the predicted gene coordinates in g2.coord. Along the way +it will extract long ORFs and store them and their coordinates +in tmp.train and tmp.coord. + +Recommended: read the readmes. + +Glimmer 1.0 had 4 readme files, and Glimmer 2.0 maintains that +structure. The four main programs are: + 1. long-orfs + 2. extract + 3. build-icm + 4. glimmer2 +There are files called *.readme for each of these programs. Please +read these first before emailing the authors with any questions. + +Art Delcher, adelcher@tigr.org, was the primary programmer for +most of the Glimmer 2.0 code, and he can answer most technical +questions. + +CHANGELOG, 7/31/00: + - Weak scores are now only invoked with the -w option. Any weak-score + gene is rejected automatically by an overlap with a regular gene. + - Weak-scores genes and "voted" genes are now annotated by [Weak] and + [Vote] in the final listing. Voted genes are those which have a + significant number of relatively high-scoring subregions. Voted + genes also are rejected automatically by overlaps with regular genes. + - Weak scores are computed to be more independent of architecture-dependent + floating-point features. (Previously, 64-bit machines would sometimes + generate different results from 32-bit machines.) + - Fixed bug in RNABin function that occurred when the gene + started on the very last base of the genome. This function is + now not called at all if the Choose_First_Start_Codon option is + selected (which is the default). + - Fixed problem that occurred on short pieces of genome when one + frame (or more) had no stop codons. + - An ignore option (-i) to specify a list of regions in which no predictions + will be made, such as ribosomal RNAs. This feature has not yet been + thoroughly tested. + +CHANGELOG, 9 December 2002 + - Raw scores are now printed in the main listing and in []'s in + the final list of putatative genes + - Add +S option to us a "stricter" independent (intergenic) model + that discounts stop codons. Since only orfs (which have no stop + codons) are ever scored, the independent model is at a disadvantage + unless it also assumes that it is only scoring orfs. Thus, with the + +S option, the independent score is done codon by codon. + The probabilities of codons are intially set to what the + previous independent model would be: + The probability of a codon "atg", for example is: + Pr[a] * Pr[t] * Pr[g] + Then each of these is divide by the sum of the probabilities of the + non-stop codons. + - Add -L option to specify the name of a file containing a list + of coordinates. The genes in these lists are scored separately by + the ICM, output, and then the program stops (i.e., no + overlapping/voting rules). + +CHANGELOG, 5 February 2003 + - The strict independent (intergenic) model is now the only mode. + The +S option is tolerated but has no effect. + +CHANGELOG, 18 April 2003 + - Compute the optimal length for minimum "long" orfs, so that the + program will return the largest number of orfs possible. The -g + switch still works if specified, but I don't know why anyone would + want to use that for a training set. + - Change minimum overlap by default to be 0. This means that genes + that overlap even by 1 base will be considered in conflict by Glimmer, + and the program will try to adjust their start codons to remove the + conflict or else delete one of the genes. + +CHANGELOG, 7 October 2003 + - Fix bug on long-orfs.cc to avoid occasional array out-of-bounds + error (detected on Mac OS X). diff --git a/debian/glimmer2_docs/build-icm.readme b/debian/glimmer2_docs/build-icm.readme new file mode 100644 index 0000000..9a89599 --- /dev/null +++ b/debian/glimmer2_docs/build-icm.readme @@ -0,0 +1,60 @@ +// Copyright (c) 1997-99 by Arthur Delcher, Steven Salzberg, Simon +// Kasif, and Owen White. All rights reserved. Redistribution +// is not permitted without the express written permission of +// the authors. + +Program build-icm.c creates and outputs an interpolated Markov +model (IMM) as described in the paper + A.L. Delcher, D. Harmon, S. Kasif, O. White, and S.L. Salzberg. + Improved Microbial Gene Identification with Glimmer. + Nucleic Acids Research, 1999, in press. +Please reference this paper if you use the system as part of any +published research. + +Input comes from the file named on the command-line. Format should be +one string per line. Each line has an ID string followed by white space +followed by the sequence itself. The script run-glimmer2 generates +an input file in the correct format using the 'extract' program. + +The IMM is constructed as follows: For a given context, say +acgtta, we want to estimate the probability distribution of the +next character. We shall do this as a linear combination of the +observed probability distributions for this context and all of +its suffixes, i.e., cgtta, gtta, tta, ta, a and empty. By +observed distributions I mean the counts of the number of +occurrences of these strings in the training set. The linear +combination is determined by a set of probabilities, lambda, one +for each context string. For context acgtta the linear combination +coefficients are: + lambda (acgtta) + (1 - lambda (acgtta)) x lambda (cgtta) + (1 - lambda (acgtta)) x (1 - lambda (cgtta)) x lambda (gtta) + (1 - lambda (acgtta)) x (1 - lambda (cgtta)) x (1 - lambda (gtta)) x lambda (tta) + : + (1 - lambda (acgtta)) x (1 - lambda (cgtta)) x (1 - lambda (gtta)) + x (1 - lambda (tta)) x (1 - lambda (ta)) x (1 - lambda (a)) + +We compute the lambda values for each context as follows: + - If the number of observations in the training set is >= the constant + SAMPLE_SIZE_BOUND, the lambda for that context is 1.0 + - Otherwise, do a chi-square test on the observations for this context + compared to the distribution predicted for the one-character shorter + suffix context. + If the chi-square significance < 0.5, set the lambda for this context to 0.0 + Otherwise set the lambda for this context to: + (chi-square significance) x (# observations) / SAMPLE_WEIGHT + +To compile the program: + + g++ build-icm.c -lm -o build-icm + + Uses include files delcher.h context.h strarray.h gene.h + +To run the program: + + build-icm train.model + + This will use the training data in train.seq to produce the file + train.model, containing your IMM. + + diff --git a/debian/glimmer2_docs/extract.readme b/debian/glimmer2_docs/extract.readme new file mode 100644 index 0000000..19d630e --- /dev/null +++ b/debian/glimmer2_docs/extract.readme @@ -0,0 +1,55 @@ +// Copyright (c) 1997 by Arthur Delcher, Steven Salzberg, Simon +// Kasif, and Owen White. All rights reserved. Redistribution +// is not permitted without the express written permission of +// the authors. + +Program extract takes a FASTA format sequence file and a file +with a list of start/stop positions in that file (e.g., as produced +by the long-orfs program) and extracts and outputs the +specified sequences. + +The first command-line argument is the name of the sequence file, +which must be in FASTA format. + +The second command-line argument is the name of the coordinate file. +It must contain a list of pairs of positions in the first file, one +per line. The format of each entry is: + +This file should contain no other information, so if you're using +the output of glimmer or long-orfs , you'll have to cut off +header lines. + +The output of the program goes to the standard output and has one +line for each line in the coordinate file. Each line contains +the IDstring , followed by white space, followed by the substring +of the sequence file specified by the coordinate pair. Specifically, +the substring starts at the first position of the pair and ends at +the second position (inclusive). If the first position is bigger +than the second, then the DNA reverse complement of each position +is generated. Start/stop pairs that "wrap around" the end of the +genome are allowed. + +There are two optional command-line arguments: + + -skip makes the output omit the first 3 characters of each sequence, + i.e., it skips over the start codon. This was the default + behaviour of the previous version of the program. + + -l n makes the output omit an sequences shorter than n characters. + n includes the 3 skipped characters if the -skip switch + is one. + +To compile the program: + + g++ extract.c -lm -o extract + + Uses include file delcher.h + + +To run the program: + + extract genome.seq list.coord + + where genome.seq is a genome sequence in FASTA format and + list.coord is a list of start/stop pairs + diff --git a/debian/glimmer2_docs/glimmer2.readme b/debian/glimmer2_docs/glimmer2.readme new file mode 100644 index 0000000..0b71d8c --- /dev/null +++ b/debian/glimmer2_docs/glimmer2.readme @@ -0,0 +1,295 @@ +// Copyright (c) 1997-99 by Arthur Delcher, Steven Salzberg, Simon +// Kasif, and Owen White. All rights reserved. Redistribution +// is not permitted without the express written permission of +// the authors. + +// Version 1.02 revised 25 Feb 98 to ignore the independent +// (random) model for long orfs. The default +// length for "long" in this case is set to the length at which +// exactly 1 orf of this length would be expected per 1 million +// bases given the gc content of the genome. This value also can be +// set by command-line option -q . + +// Version 1.03 revised 8 Feb 99 to make it easier to specify +// start and stop codons. + +// Version 1.04 revised 10 May 99 to add -l command-line switch +// to both glimmer and long-orfs to regard genome as *NOT* +// circular. Default is to regard it as circular. +// Version 2.0 uses a tree-based IMM as described in the references +// given in the README file. It also implements an extensive new +// algorithm (see the paper) to adjust the start locations of genes +// whose initial coordinates result in an overlap. + +// Version: 2.01 31 Jul 98 +// Change probability model +// Simplify wraparounds +// Move start codons to eliminate overlaps +// Discount independent model scores when +// there are no overlaps +// Uses Harmon's model + +// Version: 2.03 9 Dec 2002 +// Include raw scores in output +// Add strict option to use independent intergenic +// model that discounts stop codons +// Add option to score each entry from a list of coordinates +// separately, without overlapping/voting rules + +// Version: 2.10 5 Feb 2003 +// Strict option to use independent intergenic +// model that discounts stop codons is only behaviour + +// Version: 2.11 18 Apr 2003 +// Change long-orfs to automatically compute the +// optimal value of ORF length in order to maximize +// the amount of training data. +Program glimmer takes two inputs: a sequence file (in FASTA format) +and a collection of Markov models for genes as produced by the program +build-icm . It outputs a list of all open reading frames (orfs) together +with scores for each as a gene. + +The first few lines of output specify the settings of various +parameter in the program: + + Minimum gene length is the length of the smallest fragment + considered to be a gene. The length is measured from the first base + of the start codon to the last base *before* the stop codon. + This value can be specified when running the program with the -g option. + + Minimum overlap length is a lower bound on the number of bases overlap + between 2 genes that is considered a problem. Overlaps shorter than + this are ignored. + + Minimum overlap percent is another lower bound on the number of bases + overlap that is considered a problem. Overlaps shorter than this + percentage of *both* genes are ignored. + + Threshold score is the minimum in-frame score for a fragment to be + considered a potential gene. + + Use independent scores indicates whether the last column that scores each + fragment using independent base probabilities is present. + + Use first start codon indicates whether the first possible start codon + is used or not. If not, the function Choose_Start is called to + choose the start codon. Currently it computes hybridization energy + between the string Ribosome_Pattern and the region in front of + the start codon, and if this is above a threshold, that start site + is chosen. The ribosome pattern string can be set by the -s option. + Presumably function Choose_Start should be modified to do something + cleverer. + + Currently used start codons are atg, gtg & ttg . These can be changed + in the function Is_Start , but corresponding changes should be + made in Choose_Start . + + +The next portion of the output is the result for each orf: + + Column 1 is an ID number for reference purposes. It is assigned + sequentially starting with 1 to all orfs whose Gene Score is + at least 90 . I'll make this a command-line option when I decide + what letter to use. + + Column 2 is the reading frame of the orf. Three forward (F1, F2 and F3) + and three reverse (R1, R2 and R3). These correspond with the headings + for the scores in columns 9-14. + + Column 3 is the start position of the orf, i.e., the first base *after* + the previous stop codon. + + Column 4 is the position of the first base of the first start codon in + the orf. Currently I use atg, ctg, gtg and ttg as start codons. + + Column 5 is the position of the last base *before* the stop codon. Stop + codons are taa, tag, and tga. Note that for orfs in the reverse + reading frames have their start position higher than the end position. + The order in which orfs are listed is in increasing order by + Max {OrfStart, End}, i.e., the highest numbered position in the orf, + except for orfs that "wrap around" the end of the sequence. + + Columns 6 and 7 are the lengths of the orf and gene, respectively, i.e., + 1 + |OrfStart - End| and 1 + |GeneStart - End| . + + Column 8 is the score for the gene region. It is the probability (as + a percent) that the Markov model in the correct frame generated this + sequence. This value matches the value in the corresponding column + of frame scores--an orf in reading frame R1 has a Gene Score equal to + the value in the R1 column of frame scores for that orf. + + Columns 9-14 are the scores for the gene region in each of the 6 reading + frames. It is the probability (as a percent) that the Markov model in + that frame generated this sequence. + + Column 15 is the probability as a percent that the gene sequence was generated + by a model of independent probabilities for each base, and represents to + some extent the probability that the sequence is "random". + + +When two genes with ID numbers overlap by at least a sufficient +amount (as determined by Min_Olap and Min_Olap_Percent ), a line +beginning with *** is printed and scores for the overlap region +are printed. If the frame of the high score of the overlap +region matches the frame of the longer gene, then a message is +printed that the shorter gene is rejected. Otherwise, a message +is printed that *both* genes are "suspect". A suspect or reject +message for any gene is only printed once, however. + +A message is also printed if a gene with an ID number wholly contains another +gene with an ID number. The longer "shadows" the shorter. + + +At the end a list of "putative" gene positions is produced. The first +column is the ID number, the second is the start position, the third +is the end position. For "suspect" genes, a notation in [] 's follows: + + [Bad Olap a b c] means that gene number a overlapped this one and + was shorter but scored higher on the overlap region. b is the length + of the overlap region and c is the score of *this* gene on the overlap + region. There should be a [Shorter ...] notation with gene a + giving its score. + + [Shorter a b c] means that gene number a overlapped this one and + was longer but scored lower on the overlap region. b is the length + of the overlap region and c is the score of *this* gene on the overlap + region. There should be a [Bad olap ...] notation with gene a + giving its score. + + [Shadowed by a] means that this gene was completed contained as part + of gene a 's region, but in another frame. + + [Delay by a b c d] means that this gene was tentatively rejected + because of an overlap with gene b , but if the start codon is postponed + by a positions, then this would be a valid gene. The start position + reported for this gene includes the delay. c is the length of the overlap + region that caused the rejection and d is the score in this gene's frame + on that overlap region. + + [Weak] means that this gene did not meet the regular scoring threshold, + but if the independent model were ignored, its score would be high + enough. Should only occur if the -w option is used. + + [Vote] means that this gene did not meet the regular scoring threshold, + but sufficiently many of its subranges had high enough scores to + indicate it might be a gene. + +Note that a gene marked as rejected may appear in this list. This can +occur if the gene that caused the rejection was itself rejected. The +actual algorithm to produce the list is as follows: + + Consider the genes in decreasing order by length. If gene x is to + be rejected because of an overlap with longer gene y that has not been + rejected, then gene x is rejected and does not appear in the list. + Otherwise, all notations for gene x that are not caused by rejected + genes are reported. + +I think a "delayed" gene might incorrectly be listed as causing a problem +by the part of it that was eliminated by the delay. Probably the remaining +portion should be reinserted into the sorted list base on its now-shorter +length, and any notations caused by it should be re-checked to see if +they're affected by shortening the gene. Let's save this for the next +version. + + + +Specifying Different Start and Stop Codons: + +To specify different sets of start and stop codons, modify the file +gene.h . Specifically, the functions: + + Is_Forward_Start Is_Reverse_Start Is_Start + Is_Forward_Stop Is_Reverse_Stop Is_Stop + +are used to determine what is used for start and stop codons. + +Is_Start and Is_Stop do simple string comparisons to specify +which patterns are used. To add a new pattern, just add the comparison +for it. To remove a pattern, comment out or delete the comparison +for it. + +The other four functions use a bit comparison to determine start and +stop patterns. They represent a codon as a 12-bit pattern, with 4 bits +for each base, one bit for each possible value of the bases, T, G, C +or A. Thus the bit pattern 0010 0101 1100 represents the base +pattern [C] [A or G] [G or T]. By doing bit operations (& | ~) and +comparisons, more complicated patterns involving ambiguous reads +can be tested efficiently. Simple patterns can be tested as in +the current code. + +For example, to insert an additional start codon of CAT requires 3 changes: +1. The line + || (Codon & 0x218) == Codon + should be inserted into Is_Forward_Start , since 0x218 = 0010 0001 1000 + represents CAT. +2. The line + || (Codon & 0x184) == Codon + should be inserted into Is_Reverse_Start , since 0x184 = 0001 1000 0100 + represents ATG, which is the reverse-complement of CAT. Alternately, + the #define constant ATG_MASK could be used. +3. The line + || strncmp (S, "cat", 3) == 0 + should be inserted into Is_Start . +If not automatically using the first start codon, some changes might +also be made to the function Choose_Start . + + + +To compile the program: + + Use the Makefile. It will put the executables in a bin subdirectory. + + To compile just this program use: + + g++ glimmer2.c -lm -o glimmer + + Uses include files delcher.h context.h strarray.h gene.h + + +To run the program: + + First run build-icm on a set of sequences to make the Markov models. + + build-icm train.model + + This will produce a file train.model. You can call this file anything + you like, train.model, myicm, itsrainingtoday, etc. + + Then run glimmer2 + + glimmer2 hflu.seq train.model + + Options can be specified after the 2nd file name + + glimmer2 hflu.seq train.model + + Options are: + -f Use ribosome-binding energy to choose start codon. This is + not fully tested and likely to be buggy. Better not to use it. + +f Use first codon in orf as start codon + -g n Set minimum gene length to n + -i s Ignore bases within the coordinates listed in file s. File s + should consist of one base pair per line (no tags), and the ignore + region should be a multiple of three bases long. [Somewhat buggy] + -l Regard the genome as linear (not circular), i.e., do not allow + genes to "wrap around" the end of the genome. + This option works on both glimmer and long-orfs . + The default behavior is to regard the genome as circular. + -o n Set minimum overlap length to n. Overlaps shorter than this + are ignored. + -p n Set minimum overlap percentage to n%. Overlaps shorter than + this percentage of *both* strings are ignored. + -q n If using independent model scores (+r option), it will only + apply to orfs shorter than n . The default value for n + has an expectation of one orf that length or longer occurring + per million bases in a random genome with the same gc content + -r Don't use independent probability score column + +r Use independent probability score column + -s s Use string s as the ribosome binding pattern to find start codons. + Not fully tested and known to have bugs. + -t n Set threshold score for calling as gene to n. If the in-frame + score >= n, then the region is given a number and considered + a potential gene. + -w n Use "weak" scores on potential genes at least n bases long. + Weak scores ignore the independent model. + -X Allow orfs extending off ends of sequence to be scored diff --git a/debian/glimmer2_docs/long-orfs.readme b/debian/glimmer2_docs/long-orfs.readme new file mode 100644 index 0000000..a19bfdf --- /dev/null +++ b/debian/glimmer2_docs/long-orfs.readme @@ -0,0 +1,140 @@ +// Copyright (c) 1997-99 by Arthur Delcher, Steven Salzberg, Simon +// Kasif, and Owen White. All rights reserved. Redistribution +// is not permitted without the express written permission of +// the authors. +// Version: 1.1 April 2003 (S. Salzberg) +// Compute the optimal length for minimum "long" +// orfs, so that the program will return the largest +// number of orfs possible. The -g switch still works +// if specified, but I don't know why anyone would want +// to use that for a training set. +// Also, change min overlap by default to be 0. +// Version 1.04 revised 10 May 99 to add -l command-line switch +// to both glimmer and long-orfs to regard genome as *NOT* +// circular. Default is to regard it as circular. + +Program long-orfs takes a sequence file (in FASTA format) and +outputs a list of all long "potential genes" in it that do not +overlap by too much. By "potential gene" I mean the portion of +an orf from the first start codon to the stop codon at the end. + +The first few lines of output specify the settings of various +parameters in the program: + + Minimum gene length is the length of the smallest fragment + considered to be a gene. The length is measured from the first base + of the start codon to the last base *before* the stop codon. + This value can be specified when running the program with the -g option. + By default, the program now (April 2003) will compute an optimal length + for this parameter, where "optimal" is the value that produces the + greatest number of long ORFs, thereby increasing the amount of data + used for training. + + Minimum overlap length is a lower bound on the number of bases overlap + between 2 genes that is considered a problem. Overlaps shorter than + this are ignored. + + Minimum overlap percent is another lower bound on the number of bases + overlap that is considered a problem. Overlaps shorter than this + percentage of *both* genes are ignored. + +The next portion of the output is a list of potential genes: + + Column 1 is an ID number for reference purposes. It is assigned + sequentially starting with 1 to all long potential genes. If + overlapping genes are eliminated, gaps in the numbers will occur. + The ID prefix is specified in the constant ID_PREFIX . + + Column 2 is the position of the first base of the first start codon in + the orf. Currently I use atg, and gtg as start codons. This is + easily changed in the function Is_Start () . + + Column 3 is the position of the last base *before* the stop codon. Stop + codons are taa, tag, and tga. Note that for orfs in the reverse + reading frames have their start position higher than the end position. + The order in which orfs are listed is in increasing order by + Max {OrfStart, End}, i.e., the highest numbered position in the orf, + except for orfs that "wrap around" the end of the sequence. + +When two genes with ID numbers overlap by at least a sufficient +amount (as determined by Min_Olap and Min_Olap_Percent ), they +are eliminated and do not appear in the output. + +The final output of the program (sent to the standard error file so +it does not show up when output is redirected to a file) is the +length of the longest orf found. + + + +Specifying Different Start and Stop Codons: + +To specify different sets of start and stop codons, modify the file +gene.h . Specifically, the functions: + + Is_Forward_Start Is_Reverse_Start Is_Start + Is_Forward_Stop Is_Reverse_Stop Is_Stop + +are used to determine what is used for start and stop codons. + +Is_Start and Is_Stop do simple string comparisons to specify +which patterns are used. To add a new pattern, just add the comparison +for it. To remove a pattern, comment out or delete the comparison +for it. + +The other four functions use a bit comparison to determine start and +stop patterns. They represent a codon as a 12-bit pattern, with 4 bits +for each base, one bit for each possible value of the bases, T, G, C +or A. Thus the bit pattern 0010 0101 1100 represents the base +pattern [C] [A or G] [G or T]. By doing bit operations (& | ~) and +comparisons, more complicated patterns involving ambiguous reads +can be tested efficiently. Simple patterns can be tested as in +the current code. + +For example, to insert an additional start codon of CAT requires 3 changes: +1. The line + || (Codon & 0x218) == Codon + should be inserted into Is_Forward_Start , since 0x218 = 0010 0001 1000 + represents CAT. +2. The line + || (Codon & 0x184) == Codon + should be inserted into Is_Reverse_Start , since 0x184 = 0001 1000 0100 + represents ATG, which is the reverse-complement of CAT. Alternately, + the #define constant ATG_MASK could be used. +3. The line + || strncmp (S, "cat", 3) == 0 + should be inserted into Is_Start . + + + +To compile the program: + + g++ long-orfs.c -lm -o long-orfs + + Uses include files delcher.h gene.h + + +To run the program: + + long-orfs genome.seq + + where genome.seq is a genome sequence in FASTA format. + + Options can be specified after the genome file name + + long-orfs genome.seq + + Options are: + -g n Set minimum gene length to n. Default is to compute an + optimal value automatically. Don't change this unless you + know what you're doing. + -l Regard the genome as linear (not circular), i.e., do not allow + genes to "wrap around" the end of the genome. + This option works on both glimmer and long-orfs . + The default behavior is to regard the genome as circular. + -o n Set maximum overlap length to n. Overlaps shorter than this + are permitted. (Default is 0 bp.) + -p n Set maximum overlap percentage to n%. Overlaps shorter than + this percentage of *both* strings are ignored. (Default is 10%.) + +If you *DON'T* want to eliminate overlapping genes, just use the -p 100 +option. diff --git a/debian/glimmer2_mans/tigr-anomaly.sgml b/debian/glimmer2_mans/tigr-anomaly.sgml new file mode 100644 index 0000000..807de64 --- /dev/null +++ b/debian/glimmer2_mans/tigr-anomaly.sgml @@ -0,0 +1,124 @@ + manpage.1'. You may view + the manual page with: `docbook-to-man manpage.sgml | nroff -man | + less'. A typical entry in a Makefile or Makefile.am is: + +manpage.1: manpage.sgml + docbook-to-man $< > $@ + + + The docbook-to-man binary is found in the docbook-to-man package. + Please remember that if you create the nroff version in one of the + debian/rules file targets (such as build), you will need to include + docbook-to-man in your Build-Depends control field. + + --> + + + Steffen"> + Möller"> + + November 10, 2004"> + 1"> + moeller@debian.org"> + + TIGR-GLIMMER"> + + + Debian"> + GNU"> + GPL"> +]> + + + +
+ &dhemail; +
+ + &dhfirstname; + &dhsurname; + + + 2003 + &dhusername; + + &dhdate; +
+ + &dhucpackage; + + &dhsection; + + + &dhpackage; + + +The program lacks a description + + + + + tigr-anomaly + >dna-file + >coord-file + + + + DESCRIPTION + + + + + + OPTIONS + + + SEE ALSO + +tigr-glimmer3 (1), +tigr-adjust (1), +tigr-anomaly (1), +tigr-build-icm (1), +tigr-check (1), +tigr-codon-usage (1), +tigr-compare-lists (1), +tigr-extract (1), +tigr-generate (1), +tigr-get-len (1), +tigr-get-putative (1), + + +http://www.tigr.org/software/glimmer/ + + + Please see the readme in /usr/share/doc/tigr-glimmer for a description on how to use Glimmer3. + + + AUTHOR + + This manual page was quickly copied from the glimmer web site by &dhusername; &dhemail; for + the &debian; system. + + + +
+ + diff --git a/debian/glimmer2_mans/tigr-build-icm.sgml b/debian/glimmer2_mans/tigr-build-icm.sgml new file mode 100644 index 0000000..344be5a --- /dev/null +++ b/debian/glimmer2_mans/tigr-build-icm.sgml @@ -0,0 +1,162 @@ + manpage.1'. You may view + the manual page with: `docbook-to-man manpage.sgml | nroff -man | + less'. A typical entry in a Makefile or Makefile.am is: + +manpage.1: manpage.sgml + docbook-to-man $< > $@ + + + The docbook-to-man binary is found in the docbook-to-man package. + Please remember that if you create the nroff version in one of the + debian/rules file targets (such as build), you will need to include + docbook-to-man in your Build-Depends control field. + + --> + + + Steffen"> + Möller"> + + Novemver 10, 2004"> + 1"> + moeller@debian.org"> + + TIGR-GLIMMER"> + <!ENTITY dhpackage "tigr-glimmer"> + + <!ENTITY debian "<productname>Debian</productname>"> + <!ENTITY gnu "<acronym>GNU</acronym>"> + <!ENTITY gpl "&gnu; <acronym>GPL</acronym>"> +]> + +<refentry> + <refentryinfo> + <address> + &dhemail; + </address> + <author> + &dhfirstname; + &dhsurname; + </author> + <copyright> + <year>2003</year> + <holder>&dhusername;</holder> + </copyright> + &dhdate; + </refentryinfo> + <refmeta> + &dhucpackage; + + &dhsection; + </refmeta> + <refnamediv> + <refname>&dhpackage;</refname> + <refpurpose>Ceates and outputs an interpolated Markov model(IMM)</refpurpose> + </refnamediv> + <refsynopsisdiv> + <cmdsynopsis> + <command>tigr-build-icm</command> + </cmdsynopsis> + </refsynopsisdiv> + <refsect1> + <title>DESCRIPTION + +Program build-icm.c creates and outputs an interpolated Markov +model (IMM) as described in the paper + A.L. Delcher, D. Harmon, S. Kasif, O. White, and S.L. Salzberg. + Improved Microbial Gene Identification with Glimmer. + Nucleic Acids Research, 1999, in press. +Please reference this paper if you use the system as part of any +published research. + +Input comes from the file named on the command-line. Format should be +one string per line. Each line has an ID string followed by white space +followed by the sequence itself. The script run-glimmer3 generates +an input file in the correct format using the 'extract' program. + +The IMM is constructed as follows: For a given context, say +acgtta, we want to estimate the probability distribution of the +next character. We shall do this as a linear combination of the +observed probability distributions for this context and all of +its suffixes, i.e., cgtta, gtta, tta, ta, a and empty. By +observed distributions I mean the counts of the number of +occurrences of these strings in the training set. The linear +combination is determined by a set of probabilities, lambda, one +for each context string. For context acgtta the linear combination +coefficients are: + + lambda (acgtta) + (1 - lambda (acgtta)) x lambda (cgtta) + (1 - lambda (acgtta)) x (1 - lambda (cgtta)) x lambda (gtta) + (1 - lambda (acgtta)) x (1 - lambda (cgtta)) x (1 - lambda (gtta)) x lambda (tta) + (1 - lambda (acgtta)) x (1 - lambda (cgtta)) x (1 - lambda (gtta)) + x (1 - lambda (tta)) x (1 - lambda (ta)) x (1 - lambda (a)) + +We compute the lambda values for each context as follows: + - If the number of observations in the training set is >= the constant + SAMPLE_SIZE_BOUND, the lambda for that context is 1.0 + - Otherwise, do a chi-square test on the observations for this context + compared to the distribution predicted for the one-character shorter + suffix context. + If the chi-square significance < 0.5, set the lambda for this context to 0.0 + Otherwise set the lambda for this context to: + (chi-square significance) x (# observations) / SAMPLE_WEIGHT + +To run the program: + + build-icm <train.seq > train.model + + This will use the training data in train.seq to produce the file + train.model, containing your IMM. + + + + SEE ALSO + +tigr-glimmer3 (1), +tigr-long-orfs (1), +tigr-adjust (1), +tigr-anomaly (1), +tigr-extract (1), +tigr-check (1), +tigr-codon-usage (1), +tigr-compare-lists (1), +tigr-extract (1), +tigr-generate (1), +tigr-get-len (1), +tigr-get-putative (1), + + http://www.tigr.org/software/glimmer/ + Please see the readme in /usr/share/doc/tigr-glimmer for a description on how to use Glimmer3. + + + AUTHOR + + This manual page was quickly copied from the glimmer web site and readme file by &dhusername; &dhemail; for + the &debian; system. + + + + + + + + diff --git a/debian/glimmer2_mans/tigr-extract.sgml b/debian/glimmer2_mans/tigr-extract.sgml new file mode 100644 index 0000000..f701860 --- /dev/null +++ b/debian/glimmer2_mans/tigr-extract.sgml @@ -0,0 +1,165 @@ + manpage.1'. You may view + the manual page with: `docbook-to-man manpage.sgml | nroff -man | + less'. A typical entry in a Makefile or Makefile.am is: + +manpage.1: manpage.sgml + docbook-to-man $< > $@ + + + The docbook-to-man binary is found in the docbook-to-man package. + Please remember that if you create the nroff version in one of the + debian/rules file targets (such as build), you will need to include + docbook-to-man in your Build-Depends control field. + + --> + + + Steffen"> + Möller"> + + November 10, 2004"> + 1"> + moeller@debian.org"> + + TIGR-GLIMMER"> + + + Debian"> + GNU"> + GPL"> +]> + + + +
+ &dhemail; +
+ + &dhfirstname; + &dhsurname; + + + 2003 + &dhusername; + + &dhdate; +
+ + &dhucpackage; + + &dhsection; + + + &dhpackage; + + +Fine start/stop positions of genes in genome sequence + + + + + tigr-extract + genome-file + + + + DESCRIPTION + +Program extract takes a FASTA format sequence file and a file +with a list of start/stop positions in that file (e.g., as produced +by the long-orfs program) and extracts and outputs the +specified sequences. + +The first command-line argument is the name of the sequence file, +which must be in FASTA format. + +The second command-line argument is the name of the coordinate file. +It must contain a list of pairs of positions in the first file, one +per line. The format of each entry is: + <IDstring>> <start position> <stop position> +This file should contain no other information, so if you're using +the output of glimmer or long-orfs , you'll have to cut off +header lines. + +The output of the program goes to the standard output and has one +line for each line in the coordinate file. Each line contains +the IDstring , followed by white space, followed by the substring +of the sequence file specified by the coordinate pair. Specifically, +the substring starts at the first position of the pair and ends at +the second position (inclusive). If the first position is bigger +than the second, then the DNA reverse complement of each position +is generated. Start/stop pairs that "wrap around" the end of the +genome are allowed. + + + + OPTIONS + + + + + makes the output omit the first 3 characters of each sequence, i.e., it skips over the start codon. This was the behaviour of the previous version of the program. + + + + + makes the output omit an sequences shorter than n characters. + n includes the 3 skipped characters if the -skip switch + is one. + + + + + + SEE ALSO + +tigr-glimmer3 (1), +tigr-long-orfs (1), +tigr-adjust (1), +tigr-anomaly (1), +tigr-build-icm (1), +tigr-check (1), +tigr-codon-usage (1), +tigr-compare-lists (1), +tigr-extract (1), +tigr-generate (1), +tigr-get-len (1), +tigr-get-putative (1), + + +http://www.tigr.org/software/glimmer/ + + + Please see the readme in /usr/share/doc/tigr-glimmer for a description on how to use Glimmer3. + + + AUTHOR + + This manual page was quickly copied from the glimmer web site by &dhusername; &dhemail; for + the &debian; system. + + + +
+ + + + diff --git a/debian/glimmer2_mans/tigr-glimmer3.sgml b/debian/glimmer2_mans/tigr-glimmer3.sgml new file mode 100644 index 0000000..c552f16 --- /dev/null +++ b/debian/glimmer2_mans/tigr-glimmer3.sgml @@ -0,0 +1,246 @@ + manpage.1'. You may view + the manual page with: `docbook-to-man manpage.sgml | nroff -man | + less'. A typical entry in a Makefile or Makefile.am is: + +manpage.1: manpage.sgml + docbook-to-man $< > $@ + + + The docbook-to-man binary is found in the docbook-to-man package. + Please remember that if you create the nroff version in one of the + debian/rules file targets (such as build), you will need to include + docbook-to-man in your Build-Depends control field. + + --> + + + Steffen"> + Möller"> + + November 10, 2004"> + 1"> + moeller@debian.org"> + + TIGR-GLIMMER"> + + + Debian"> + GNU"> + GPL"> +]> + + + +
+ &dhemail; +
+ + &dhfirstname; + &dhsurname; + + + 2003 + &dhusername; + + &dhdate; +
+ + &dhucpackage; + + &dhsection; + + + &dhpackage; + +Find/Score potential genes in genome-file using the probability model in icm-file + + + + + tigr-glimmer3 + + + + + + + DESCRIPTION + +&dhpackage; is a system for finding genes in microbial DNA, especially the genomes of bacteria and archaea. &dhpackage; (Gene Locator and Interpolated Markov Modeler) uses interpolated Markov models (IMMs) to identify the coding regions and distinguish them from noncoding DNA. The IMM approach, described in our Nucleic Acids Research paper on &dhpackage; 1.0 and in our subsequent paper on &dhpackage; 2.0, uses a combination of Markov models from 1st through 8th-order, weighting each model according to its predictive power. &dhpackage; 1.0 and 2.0 use 3-periodic nonhomogenous Markov models in their IMMs. + +&dhpackage; is the primary microbial gene finder at TIGR, and has been used to annotate the complete genomes of B. burgdorferi (Fraser et al., Nature, Dec. 1997), T. pallidum (Fraser et al., Science, July 1998), T. maritima, D. radiodurans, M. tuberculosis, and non-TIGR projects including C. trachomatis, C. pneumoniae, and others. Its analyses of some of these genomes and others is available at the TIGR microbial database site. + +A special version of &dhpackage; designed for small eukaryotes, GlimmerM, was used to find the genes in chromosome 2 of the malaria parasite, P. falciparum.. GlimmerM is described in S.L. Salzberg, M. Pertea, A.L. Delcher, M.J. Gardner, and H. Tettelin, "Interpolated Markov models for eukaryotic gene finding," Genomics 59 (1999), 24-31. Click here (http://www.tigr.org/software/glimmerm/) to visit the GlimmerM site, which includes information on how to download the GlimmerM system. + +The &dhpackage; system consists of two main programs. The first of these is the training program, build-imm. This program takes an input set of sequences and builds and outputs the IMM for them. These sequences can be complete genes or just partial orfs. For a new genome, this training data can consist of those genes with strong database hits as well as very long open reading frames that are statistically almost certain to be genes. The second program is glimmer, which uses this IMM to identify putative genes in an entire genome. &dhpackage; automatically resolves conflicts between most overlapping genes by choosing one of them. It also identifies genes that are suspected to truly overlap, and flags these for closer inspection by the user. These ``suspect'' gene candidates have been a very small percentage of the total for all the genomes analyzed thus far. + &dhpackage; is a program that... + + + OPTIONS + + + + + Use n as GC percentage of independent model + Note: n should be a percentage, e.g., -C 45.2 + + + + -fUse ribosome-binding energy to choose start codon + + + Use first codon in orf as start codon + + + Set minimum gene length to n + + + + + Use + to select regions of bases that are off + limits, so that no bases within that area will be examined + + + + + + Assume linear rather than circular genome, i.e., no wraparound + + + + Use filename to specify a list of orfs that should + be scored separately, with no overlap rules + + + + + Input is a multifasta file of separate genes to be scored + separately, with no overlap rules + + + + + + + Set minimum overlap length to n. Overlaps shorter than this + are ignored. + + + + + + + Set minimum overlap percentage to n%. Overlaps shorter than this percentage of *both* strings are ignored. + + + + + + + Set the maximum length orf that can be rejected because of + the independent probability score column to (n - 1) + + + + + + + + Don't use independent probability score column + + + + + + +Use independent probability score column + + + + + + + + Don't use independent probability score column + + + Use string s as the ribosome binding pattern to find start codons. + + + + + + + Do use stricter independent intergenic model that doesn't + give probabilities to in-frame stop codons. (Option is obsolete + since this is now the only behaviour + + + + + + Set threshold score for calling as gene to n. If the in-frame + score >= n, then the region is given a number and considered + a potential gene. + + + + + + Use "weak" scores on tentative genes n or longer. Weak + scores ignore the independent probability score. + + + + + + SEE ALSO + +tigr-adjust (1), +tigr-anomaly (1), +tigr-build-icm (1), +tigr-check (1), +tigr-codon-usage (1), +tigr-compare-lists (1), +tigr-extract (1), +tigr-generate (1), +tigr-get-len (1), +tigr-get-putative (1), +tigr-glimmer3 (1), +tigr-long-orfs (1) + + +http://www.tigr.org/software/glimmer/ + + Please see the readme in /usr/share/doc/glimmer for a description on how to use Glimmer. + + + AUTHOR + This manual page was quickly copied from the glimmer web site by &dhusername; &dhemail; for + the &debian; system. + + +
+ + + + diff --git a/debian/glimmer2_mans/tigr-long-orfs.sgml b/debian/glimmer2_mans/tigr-long-orfs.sgml new file mode 100644 index 0000000..a3922d1 --- /dev/null +++ b/debian/glimmer2_mans/tigr-long-orfs.sgml @@ -0,0 +1,238 @@ + manpage.1'. You may view + the manual page with: `docbook-to-man manpage.sgml | nroff -man | + less'. A typical entry in a Makefile or Makefile.am is: + +manpage.1: manpage.sgml + docbook-to-man $< > $@ + + + The docbook-to-man binary is found in the docbook-to-man package. + Please remember that if you create the nroff version in one of the + debian/rules file targets (such as build), you will need to include + docbook-to-man in your Build-Depends control field. + + --> + + + Steffen"> + Möller"> + + November 10, 2004"> + 1"> + moeller@debian.org"> + + LONG-ORFS"> + + + Debian"> + GNU"> + GPL"> +]> + + + +
+ &dhemail; +
+ + &dhfirstname; + &dhsurname; + + + 2003 + &dhusername; + + &dhdate; +
+ + &dhucpackage; + + &dhsection; + + + &dhpackage; + + +Find/Score potential genes in genome-file using +the probability model in icm-file + + + + + tigr-long-orgs + genome-file + + + + DESCRIPTION + +Program long-orfs takes a sequence file (in FASTA format) and +outputs a list of all long "potential genes" in it that do not +overlap by too much. By "potential gene" I mean the portion of +an orf from the first start codon to the stop codon at the end. + +The first few lines of output specify the settings of various +parameters in the program: + + Minimum gene length is the length of the smallest fragment + considered to be a gene. The length is measured from the first base + of the start codon to the last base *before* the stop codon. + This value can be specified when running the program with the -g option. + By default, the program now (April 2003) will compute an optimal length + for this parameter, where "optimal" is the value that produces the + greatest number of long ORFs, thereby increasing the amount of data + used for training. + + Minimum overlap length is a lower bound on the number of bases overlap + between 2 genes that is considered a problem. Overlaps shorter than + this are ignored. + + Minimum overlap percent is another lower bound on the number of bases + overlap that is considered a problem. Overlaps shorter than this + percentage of *both* genes are ignored. + +The next portion of the output is a list of potential genes: + + Column 1 is an ID number for reference purposes. It is assigned + sequentially starting with 1 to all long potential genes. If + overlapping genes are eliminated, gaps in the numbers will occur. + The ID prefix is specified in the constant ID_PREFIX . + + Column 2 is the position of the first base of the first start codon in + the orf. Currently I use atg, and gtg as start codons. This is + easily changed in the function Is_Start () . + + Column 3 is the position of the last base *before* the stop codon. Stop + codons are taa, tag, and tga. Note that for orfs in the reverse + reading frames have their start position higher than the end position. + The order in which orfs are listed is in increasing order by + Max {OrfStart, End}, i.e., the highest numbered position in the orf, + except for orfs that "wrap around" the end of the sequence. + +When two genes with ID numbers overlap by at least a sufficient +amount (as determined by Min_Olap and Min_Olap_Percent ), they +are eliminated and do not appear in the output. + +The final output of the program (sent to the standard error file so +it does not show up when output is redirected to a file) is the +length of the longest orf found. + + + +Specifying Different Start and Stop Codons: + +To specify different sets of start and stop codons, modify the file +gene.h . Specifically, the functions: + + Is_Forward_Start Is_Reverse_Start Is_Start + Is_Forward_Stop Is_Reverse_Stop Is_Stop + +are used to determine what is used for start and stop codons. + +Is_Start and Is_Stop do simple string comparisons to specify +which patterns are used. To add a new pattern, just add the comparison +for it. To remove a pattern, comment out or delete the comparison +for it. + +The other four functions use a bit comparison to determine start and +stop patterns. They represent a codon as a 12-bit pattern, with 4 bits +for each base, one bit for each possible value of the bases, T, G, C +or A. Thus the bit pattern 0010 0101 1100 represents the base +pattern [C] [A or G] [G or T]. By doing bit operations (& | ~) and +comparisons, more complicated patterns involving ambiguous reads +can be tested efficiently. Simple patterns can be tested as in +the current code. + +For example, to insert an additional start codon of CAT requires 3 changes: +1. The line + || (Codon & 0x218) == Codon + should be inserted into Is_Forward_Start , since 0x218 = 0010 0001 1000 + represents CAT. +2. The line + || (Codon & 0x184) == Codon + should be inserted into Is_Reverse_Start , since 0x184 = 0001 1000 0100 + represents ATG, which is the reverse-complement of CAT. Alternately, + the #define constant ATG_MASK could be used. +3. The line + || strncmp (S, "cat", 3) == 0 + should be inserted into Is_Start . + + + + + OPTIONS + + + + + Set minimum gene length to n. Default is to compute an + optimal value automatically. Don't change this unless you + know what you're doing. + + + + Regard the genome as linear (not circular), i.e., do not allow + genes to "wrap around" the end of the genome. + This option works on both glimmer and long-orfs . + The default behavior is to regard the genome as circular. + + + Set maximum overlap length to n. Overlaps shorter than this + are permitted. (Default is 0 bp.) + + + Set maximum overlap percentage to n%. Overlaps shorter than + this percentage of *both* strings are ignored. (Default is 10%.) + + + + + SEE ALSO + +tigr-glimmer3 (1), +tigr-adjust (1), +tigr-anomaly (1), +tigr-build-icm (1), +tigr-check (1), +tigr-codon-usage (1), +tigr-compare-lists (1), +tigr-extract (1), +tigr-generate (1), +tigr-get-len (1), +tigr-get-putative (1), + + +http://www.tigr.org/software/glimmer/ + + + Please see the readme in /usr/share/doc/tigr-glimmer for a description on how to use Glimmer3. + + + AUTHOR + + This manual page was quickly copied from the glimmer web site by &dhusername; &dhemail; for + the &debian; system. + + + +
+ + diff --git a/debian/glimmer2_mans/tigr-run-glimmer3.sgml b/debian/glimmer2_mans/tigr-run-glimmer3.sgml new file mode 100644 index 0000000..abee996 --- /dev/null +++ b/debian/glimmer2_mans/tigr-run-glimmer3.sgml @@ -0,0 +1,120 @@ + manpage.1'. You may view + the manual page with: `docbook-to-man manpage.sgml | nroff -man | + less'. A typical entry in a Makefile or Makefile.am is: + +manpage.1: manpage.sgml + docbook-to-man $< > $@ + + + The docbook-to-man binary is found in the docbook-to-man package. + Please remember that if you create the nroff version in one of the + debian/rules file targets (such as build), you will need to include + docbook-to-man in your Build-Depends control field. + + --> + + + Steffen"> + Möller"> + + November 10, 2004"> + 1"> + moeller@debian.org"> + + TIGR-GLIMMER"> + + + Debian"> + GNU"> + GPL"> +]> + + + +
+ &dhemail; +
+ + &dhfirstname; + &dhsurname; + + + 2003 + &dhusername; + + &dhdate; +
+ + &dhucpackage; + + &dhsection; + + + &dhpackage; + + +Apply the suite of programs within glimmer3 to a a prokaryotic or archean genome. + + + + + tigr-run-glimmer3 + + + + DESCRIPTION + +A shell script that wraps a set of tigr-* utilities of the glimmer package to retrieve coding regions. + + + + SEE ALSO + +tigr-glimmer3 (1), +tigr-adjust (1), +tigr-anomaly (1), +tigr-build-icm (1), +tigr-check (1), +tigr-codon-usage (1), +tigr-compare-lists (1), +tigr-extract (1), +tigr-generate (1), +tigr-get-len (1), +tigr-get-putative (1), +tigr-long-orfs (1), + + +http://www.tigr.org/software/glimmer/ + + + Please see the readme in /usr/share/doc/tigr-glimmer for a description on how to use Glimmer3. + + + AUTHOR + + This manual page was quickly copied from the glimmer web site by &dhusername; &dhemail; for + the &debian; system. + + + +
+ + diff --git a/debian/install b/debian/install new file mode 100644 index 0000000..e8750d7 --- /dev/null +++ b/debian/install @@ -0,0 +1,2 @@ +bin/* usr/lib/tigr-glimmer +debian/bin/* usr/bin diff --git a/debian/manpages b/debian/manpages new file mode 100644 index 0000000..5ff77c4 --- /dev/null +++ b/debian/manpages @@ -0,0 +1,2 @@ +debian/*.1 +debian/glimmer2_mans/*.1 diff --git a/debian/patches/10_gcc4.3.patch b/debian/patches/10_gcc4.3.patch new file mode 100644 index 0000000..3d74c84 --- /dev/null +++ b/debian/patches/10_gcc4.3.patch @@ -0,0 +1,175 @@ +Author: Kumar Appaiah +Description: Fix #461691 + +--- a/src/Common/delcher.cc ++++ b/src/Common/delcher.cc +@@ -9,6 +9,7 @@ + + #include "delcher.hh" + ++#include + + const int COMMATIZE_BUFF_LEN = 50; + // Length of buffer for creating string with commas +--- a/src/Common/fasta.cc ++++ b/src/Common/fasta.cc +@@ -9,7 +9,7 @@ + + #include "fasta.hh" + +- ++#include + + void Fasta_Print + (FILE * fp, const char * s, const char * hdr, int fasta_width) +--- a/src/Common/gene.cc ++++ b/src/Common/gene.cc +@@ -10,6 +10,7 @@ + #include "delcher.hh" + #include "gene.hh" + ++#include + + static const char COMPLEMENT_TABLE [] + = "nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn" +--- a/src/Glimmer/anomaly.cc ++++ b/src/Glimmer/anomaly.cc +@@ -12,6 +12,7 @@ + + #include "anomaly.hh" + ++#include + + // Global variables + +--- a/src/ICM/icm.cc ++++ b/src/ICM/icm.cc +@@ -15,6 +15,8 @@ + + #include "icm.hh" + ++#include ++ + using namespace std; + + extern int Verbose; +--- a/src/Util/entropy-score.cc ++++ b/src/Util/entropy-score.cc +@@ -9,7 +9,7 @@ + // regions in it by entropy distance. Results are output + // to stdout . + +- ++#include + + #include "entropy-score.hh" + +--- a/src/Glimmer/glimmer3.cc ++++ b/src/Glimmer/glimmer3.cc +@@ -12,11 +12,10 @@ + // Copyright (c) 2006 University of Maryland Center for Bioinformatics + // & Computational Biology + +- ++#include + + #include "glimmer3.hh" + +- + static int For_Edwin = 0; + + +--- a/src/ICM/build-icm.cc ++++ b/src/ICM/build-icm.cc +@@ -13,6 +13,7 @@ + + #include "build-icm.hh" + ++#include + + static int Genbank_Xlate_Code = 0; + // Holds the Genbank translation table number that determines +--- a/src/Util/extract.cc ++++ b/src/Util/extract.cc +@@ -9,7 +9,7 @@ + // sequences specified by coordinates. The resulting sequences + // are output (in multifasta or two-string format) to stdout. + +- ++#include + + #include "extract.hh" + +--- a/src/Glimmer/glimmer2.cc ++++ b/src/Glimmer/glimmer2.cc +@@ -37,6 +37,7 @@ + #include "delcher.h" + #include "gene.h" + ++#include + + const int DEFAULT_MIN_GENE_LEN = 90; + const double DEFAULT_MIN_OLAP_PERCENT = 0.10; +--- a/src/Glimmer/long-orfs.cc ++++ b/src/Glimmer/long-orfs.cc +@@ -15,7 +15,7 @@ + + #include "long-orfs.hh" + +- ++#include + + // External variables + +--- a/src/ICM/build-fixed.cc ++++ b/src/ICM/build-fixed.cc +@@ -12,6 +12,7 @@ + + #include "build-fixed.hh" + ++#include + + static FILE * Index_File_fp = NULL; + // File containing a list of subscripts of strings to train model +--- a/src/ICM/score-fixed.cc ++++ b/src/ICM/score-fixed.cc +@@ -8,6 +8,7 @@ + + #include "score-fixed.hh" + ++#include + + static char * Pos_Model_Path; + // Name of file containing the positive model +--- a/src/Util/multi-extract.cc ++++ b/src/Util/multi-extract.cc +@@ -10,7 +10,7 @@ + // resulting sequences are output (in multifasta or two-string format) + // to stdout. + +- ++#include + + #include "multi-extract.hh" + +--- a/src/Util/start-codon-distrib.cc ++++ b/src/Util/start-codon-distrib.cc +@@ -17,6 +17,7 @@ + + #include "start-codon-distrib.hh" + ++#include + + // External variables + +--- a/src/Util/uncovered.cc ++++ b/src/Util/uncovered.cc +@@ -10,7 +10,7 @@ + // specified in the file named as the second command-line argument. + // Output is a multifasta file sent to stdout. + +- ++#include + + #include "uncovered.hh" + diff --git a/debian/patches/10_gcc4.4.patch b/debian/patches/10_gcc4.4.patch new file mode 100644 index 0000000..1a8428a --- /dev/null +++ b/debian/patches/10_gcc4.4.patch @@ -0,0 +1,25 @@ +Author: Andreas Tille +Description: Fix FTBFS #560442 + +--- a/src/Common/gene.cc ++++ b/src/Common/gene.cc +@@ -444,7 +444,7 @@ int Char_Sub + // Return a subscript corresponding to character ch . + + { +- char * p; ++ const char * p; + + p = strchr (CONVERSION_STRING, tolower (ch)); + if (p == NULL) +--- a/src/ICM/icm.cc ++++ b/src/ICM/icm.cc +@@ -1983,7 +1983,7 @@ int Subscript + // model) for character ch . + + { +- char * p; ++ const char * p; + + p = strchr (ALPHA_STRING, tolower (Filter (ch))); + if (p == NULL) diff --git a/debian/patches/mayhem.patch b/debian/patches/mayhem.patch new file mode 100644 index 0000000..94a300b --- /dev/null +++ b/debian/patches/mayhem.patch @@ -0,0 +1,140 @@ +Author: Andreas Tille +Last-Update: Mon, 14 Dec 2015 16:44:19 +0100 +Bug-Debian: http://bugs.debian.org/715701, + http://bugs.debian.org/715702 +Description: Fix crashes reported by Mayhem + See http://www.drpaulcarter.com/cs/common-c-errors.php#4.1 + to make fgetc() more safe. However, the original problem is + that for empty strings no space at all is allocated. This is + now done in advance. + +--- a/src/ICM/build-fixed.cc ++++ b/src/ICM/build-fixed.cc +@@ -234,20 +234,24 @@ static int Read_String + { + int ch, ct; + +- while ((ch = fgetc (fp)) != EOF && ch != '>') ++ while ((ch = fgetc (fp)) != EOF && ch != ((int) '>')) + ; + + if (ch == EOF) + return FALSE; + + ct = 0; +- while ((ch = fgetc (fp)) != EOF && ch != '\n' && isspace (ch)) ++ while ((ch = fgetc (fp)) != EOF && ch != ((int) '\n') && isspace (ch)) + ; + if (ch == EOF) + return FALSE; +- if (ch != '\n' && ! isspace (ch)) ++ if (ch != ((int) '\n') && ! isspace (ch)) + ungetc (ch, fp); +- while ((ch = fgetc (fp)) != EOF && ch != '\n') ++ if (tag_size == 0 ) { ++ tag_size += INCR_SIZE; ++ tag = (char *) Safe_realloc (tag, tag_size); ++ } ++ while ((ch = fgetc (fp)) != EOF && ch != ((int) '\n')) + { + if (ct >= tag_size - 1) + { +@@ -259,7 +263,11 @@ static int Read_String + tag [ct ++] = '\0'; + + ct = 0; +- while ((ch = fgetc (fp)) != EOF && ch != '>') ++ if (s_size == 0) { ++ s_size += INCR_SIZE; ++ s = (char *) Safe_realloc (s, s_size); ++ } ++ while ((ch = fgetc (fp)) != EOF && ch != ((int) '>')) + { + if (isspace (ch)) + continue; +--- a/src/ICM/build-icm.cc ++++ b/src/ICM/build-icm.cc +@@ -271,20 +271,24 @@ static int Read_String + { + int ch, ct; + +- while ((ch = fgetc (fp)) != EOF && ch != '>') ++ while ((ch = fgetc (fp)) != EOF && ch != ((int) '>')) + ; + + if (ch == EOF) + return FALSE; + + ct = 0; +- while ((ch = fgetc (fp)) != EOF && ch != '\n' && isspace (ch)) ++ while ((ch = fgetc (fp)) != EOF && ch != ((int) '\n') && isspace (ch)) + ; + if (ch == EOF) + return FALSE; + if (ch != '\n' && ! isspace (ch)) + ungetc (ch, fp); +- while ((ch = fgetc (fp)) != EOF && ch != '\n') ++ if (tag_size == 0) { ++ tag_size += INCR_SIZE; ++ tag = (char *) Safe_realloc (tag, tag_size); ++ } ++ while ((ch = fgetc (fp)) != EOF && ch != ((int) '\n')) + { + if (ct >= tag_size - 1) + { +@@ -296,7 +300,11 @@ static int Read_String + tag [ct ++] = '\0'; + + ct = 0; +- while ((ch = fgetc (fp)) != EOF && ch != '>') ++ if (s_size == 0) { ++ s_size += INCR_SIZE; ++ s = (char *) Safe_realloc (s, s_size); ++ } ++ while ((ch = fgetc (fp)) != EOF && ch != ((int) '>')) + { + if (isspace (ch)) + continue; +--- a/src/ICM/score-fixed.cc ++++ b/src/ICM/score-fixed.cc +@@ -163,20 +163,24 @@ int Read_String + { + int ch, ct; + +- while ((ch = fgetc (fp)) != EOF && ch != '>') ++ while ((ch = fgetc (fp)) != EOF && ch != ((int) '>')) + ; + + if (ch == EOF) + return FALSE; + + ct = 0; +- while ((ch = fgetc (fp)) != EOF && ch != '\n' && isspace (ch)) ++ while ((ch = fgetc (fp)) != EOF && ch != ((int) '\n') && isspace (ch)) + ; + if (ch == EOF) + return FALSE; + if (ch != '\n' && ! isspace (ch)) + ungetc (ch, fp); +- while ((ch = fgetc (fp)) != EOF && ch != '\n') ++ if (tag_size == 0 ) { ++ tag_size += INCR_SIZE; ++ tag = (char *) Safe_realloc (tag, tag_size); ++ } ++ while ((ch = fgetc (fp)) != EOF && ch != ((int) '\n')) + { + if (ct >= tag_size - 1) + { +@@ -188,7 +192,11 @@ int Read_String + tag [ct ++] = '\0'; + + ct = 0; +- while ((ch = fgetc (fp)) != EOF && ch != '>') ++ if (s_size == 0) { ++ s_size += INCR_SIZE; ++ s = (char *) Safe_realloc (s, s_size); ++ } ++ while ((ch = fgetc (fp)) != EOF && ch != ((int) '>')) + { + if (isspace (ch)) + continue; diff --git a/debian/patches/series b/debian/patches/series new file mode 100644 index 0000000..58ad637 --- /dev/null +++ b/debian/patches/series @@ -0,0 +1,4 @@ +10_gcc4.3.patch +10_gcc4.4.patch + +mayhem.patch diff --git a/debian/rules b/debian/rules new file mode 100755 index 0000000..442bc1a --- /dev/null +++ b/debian/rules @@ -0,0 +1,25 @@ +#!/usr/bin/make -f + +MANPAGES=debian/glimmer2_mans/tigr-anomaly.1 \ + debian/glimmer2_mans/tigr-build-icm.1 \ + debian/glimmer2_mans/tigr-extract.1 \ + debian/glimmer2_mans/tigr-glimmer3.1 \ + debian/glimmer2_mans/tigr-long-orfs.1 \ + debian/glimmer2_mans/tigr-run-glimmer3.1 + +.SUFFIXES: .1 .sgml + +.sgml.1: + docbook-to-man $< > $@ + +%: + dh $@ + +override_dh_clean: + dh_clean $(MANPAGES) + cd src; make clean + rm -f bin/* lib/* obj/* + +override_dh_auto_build: $(MANPAGES) + # dh_auto_build + cd src; make CFLAGS="$(CFLAGS)" CPPFLAGS="$(CPPFLAGS)" CXXFLAGS="$(CXXFLAGS)" LDFLAGS="$(LDFLAGS)" diff --git a/debian/source/format b/debian/source/format new file mode 100644 index 0000000..163aaf8 --- /dev/null +++ b/debian/source/format @@ -0,0 +1 @@ +3.0 (quilt) diff --git a/debian/tigr-glimmer.1 b/debian/tigr-glimmer.1 new file mode 100644 index 0000000..561c705 --- /dev/null +++ b/debian/tigr-glimmer.1 @@ -0,0 +1,39 @@ +.TH TIGR-GLIMMER 1 "April 16, 2008" +.SH NAME +tigr-glimmer \- runs various programs of the TIGR Glimmer suite +.SH SYNOPSIS +.B tigr-glimmer +.B program +[arguments] +.SH DESCRIPTION +This manual page documents briefly the +.B tigr-glimmer +wrapper to the TIGR Glimmer programs. +This manual page was written for the Debian GNU/Linux distribution +because upstream does not provide this wrapper and it was invented +for Debian to avoid conflicts with other packages that might cause +a name space polution. +.PP +\fBtigr-glimmer\fP is just a wrapper that invokes the various programs in +the TIGR Glimmer software package. You can get more detailed documentation +in /usr/share/doc/tigr-glimmer. Please note that the documentation there +is a part of the former version Glimmer 2. The version Glimmer 3 has +some features that were described in the notes.pdf document inside +the documentation directory. +.PP +The following programs are included: anomaly, build-fixed, build-icm, +entropy-profile, entropy-score, extract, glimmer3, long-orfs, multi-extract, +score-fixed, start-codon-distrib, test, uncovered and window-acgt. +.SH OPTIONS +There are no options. +.SH EXAMPLES +.IP tigr-glimmer\ build-icm +.IP tigr-glimmer\ long-orfs +.SH SEE ALSO +For the pre previously packaged version Glimmer2 some text files from +the documentation were turned to man pages for the Debian GNU/Linux +distribution by Steffen Moeller +.br +.SH AUTHORS +This manual page was written by Andreas Tille , for +the Debian GNU/Linux system (but may be used by others). diff --git a/debian/upstream/metadata b/debian/upstream/metadata new file mode 100644 index 0000000..3d96ff3 --- /dev/null +++ b/debian/upstream/metadata @@ -0,0 +1,12 @@ +Reference: + Author: Steven L. Salzberg and Arthur L. Delcher and S. Kasif and O. White + Title: Microbial gene identification using interpolated Markov models + Journal: Nucleic Acids Research + Year: 1998 + Volume: 26 + Number: 2 + Pages: 544-8 + DOI: 10.1093/nar/26.2.544 + PMID: 9421513 + URL: http://nar.oxfordjournals.org/content/26/2/544 + eprint: http://nar.oxfordjournals.org/content/26/2/544.full.pdf+html diff --git a/debian/watch b/debian/watch new file mode 100644 index 0000000..294c791 --- /dev/null +++ b/debian/watch @@ -0,0 +1,3 @@ +version=3 +opts="dversionmangle=s/\.//" \ +http://www.cbcb.umd.edu/software/glimmer/ glimmer(.*)\.tar.gz