Codebase list fasta3 / fb82b5f
New upstream version 36.3.8h Andreas Tille 4 years ago
115 changed file(s) with 6906 addition(s) and 3194 deletion(s). Raw diff Collapse all Expand all
00
11 ## The FASTA package - protein and DNA sequence similarity searching and alignment programs
22
3 The **FASTA** (pronounced FAST-Aye, not FAST-Ah) programs are a
4 comprehensive set of similarity searching and alignment programs for
5 searching protein and DNA sequence databases. Like the **BLAST** programs `blastp` and `blastn`, the `fasta` program itself uses a rapid heuristic strategy for finding similar regions in protein and DNA sequences. But in
6 addition to heuristic similarity searching, the FASTA package provides
7 programs for rigorous local (`ssearch`) and global (`ggsearch`)
8 similarity searching, as well as a program for finding non-overlapping
9 sequence similarities (`lalign`). Like BLAST, the FASTA package also
10 includes programs for aligning translated DNA sequences against
11 proteins (`fastx`, `fasty` are equivalent to `blastx`, `tfastx`,
12 `tfasty` are similar to `tblastn`).
3 The **FASTA** (pronounced FAST-Aye, not FAST-Ah) programs are a comprehensive set of similarity searching and alignment programs for searching protein and DNA sequence databases. Like the **BLAST** programs `blastp` and `blastn`, the `fasta` program itself uses a rapid heuristic strategy for finding similar regions in protein and DNA sequences. But in addition to heuristic similarity searching, the FASTA package provides
4 programs for rigorous local (`ssearch`) and global (`ggsearch`) similarity searching, as well as a program for finding non-overlapping sequence similarities (`lalign`). Like BLAST, the FASTA package also includes programs for aligning translated DNA sequences against proteins (`fastx`, `fasty` are equivalent to `blastx`, and `tfastx`, `tfasty` are similar to `tblastn`).
135
14 ####December, 2017
15 The current FASTA version is fasta-36.3.8f, Dec. 2017
6 #### March, 2019
7
8 An updated release of the FASTA package (`fasta-36.3.8h`) is
9 available. In addition to minor bug fixes, the latest version can
10 generate query and library sequences using program scripts.
11
12 See doc/README_v36.3.8h.md and doc/readme.v36 for a more complete summary of changes.
13
14 #### December, 2018
15
16 The latest version of the FASTA package is `fasta-36.3.8h`, Dec. 2018.
17
18 See doc/README_v36.3.8h.md for a more complete summary of changes.
19
20 #### November, 2018
21
22 The current released version of the FASTA package is `fasta-36.3.8h`, Nov. 2018
23
24 See doc/README_v36.3.8h.md for a more complete summary of changes.
25
26 #### October, 2018
27
28 The current version of the FASTA package is fasta-36.3.8g, Oct. 2018
29
30 See doc/README_v36.3.8h.md for a more complete summary of changes.
31
32 #### April, 2018
33 The current version of the FASTA package is fasta-36.3.8g, Apr. 2018
34
35 #### December, 2017
36 The current FASTA version is fasta-36.3.8g, Dec. 2017
1637
1738 The statistics routines for normally distributed scores (ggsearch36,
1839 glsearch36) are more robust to very low E()-value thresholds.
1940
20 ####Sept, 2017
41 #### Sept, 2017
2142 The current FASTA version is fasta-36.3.8f, Sept. 2017
2243
2344 If the -S option is used and a query sequence has no upper case
2445 letters, it is re-read with lower-case letters converted to upper-case.
2546
26 ####May, 2017
47 #### May, 2017
2748 The current FASTA version is fasta-36.3.8f, May. 2017
2849
2950 Various bugs in sub-alignment scoring corrected and support for the
30 EBI SP:GSTM1_HUMAN P09488 added. The format for the $SRCH_URL and
31 $SRCH_URL2 format strings has changed to enable pairwise alignment.
51 EBI SP:GSTM1_HUMAN P09488 added. The format for the `$SRCH_URL` and
52 `$SRCH_URL2` format strings has changed to enable pairwise alignment.
3253
33 ####September, 2016
54 #### September, 2016
3455
3556 The fasta-36.3.6e version includes a new directory, `psisearch2`, with
3657 scripts to run iterative PSSM (PSI-BLAST or SSEARCH36) searches using
+0
-18
doc/README_v36.3.8g.md less more
0
1
2 ## The FASTA package - protein and DNA sequence similarity searching and alignment programs
3
4 Changes in **fasta-36.3.8f** released 31-Dec-2017
5
6 1. (December, 2017) -- Make statistical thresholds more robust for
7 small E()-values with normally distributed scores (ggsearch36,
8 glsearch36).
9
10 2. (September, 2017) Treat all lower-case queries as uppercase with -S option.
11
12 3. (May, 2017) Improvements/fixes to sub-alignment scoring strategies.
13
14 4. Improvements/fixes to psisearch2 scripts.
15
16 For more detailed information, see `doc/readme.v36`.
17
0
1 ## The FASTA package - protein and DNA sequence similarity searching and alignment programs
2
3 Changes in **fasta-36.3.8h** August, 2019
4
5 1. Modifications to support makeblastdb format v5 databases. Currently, only simple database reads have been tested.
6
7
8 Changes in **fasta-36.3.8h** March, 2019
9
10 1. Translation table 1 (`-t 1`) now translates 'TGA'->'U' (selenocysteine).
11
12 2. New script for extracting DNA sequences from genomes (`scripts/get_genome_seq.py`). Currently works with human (hg38), mouse (mm10), and rat (rn6).
13
14 Changes in **fasta-36.3.8h** January, 2019
15
16 1. Bug fixes: `fastx`/`tfastx` searches done with the `-t t` option (which adds a `*` to protein sequences so that termination codons can be matched), did not work properly with the `VT` series of matrices, particularly `VT10`. This has been fixed.
17
18 2. New features: Both query and library/subject sequences can be generated by specifying a program script, either by putting a `!` at the start of the query/subject file name, or by specifying library type `9`. Thus, `fasta36 \\!../scripts/get_protein.py+P09488+P30711 /seqlib/swissprot.fa` or `fasta36 "../scripts/get_protein.py+P09488+P30711 9" /seqlib/swissprot.fa` will compare two query sequences, `P09488` and `P30711`, to SwissProt, by downloading them from Uniprot using the `get_protein.py` script (which can download sequences using either Uniprot or RefSeq protein accessions). Often, the leading `!` must be escaped from shell interpretation with `\\!`.
19
20 New scripts that return FASTA sequences using accessions or genome coordinates are available in `scripts/`. `get_protein.py`, `get_uniprot.py`, `get_up_prot_iso_sql.py` and `get_refseq.py`. `get_refseq.py` can download either protein or mRNA RefSeq entries. `get_up_prot_iso_sql.py` retrieves a protein and its isoforms from a MySQL database.
21
22 `get_genome_seq.py` extracts genome sequences using coordinates from local reference genomes (`hg38` and `mm10` included by default).
23
24 Changes in **fasta-36.3.8h** December, 2018
25
26 The `scripts/ann_exons_up_www.pl` and `ann_exons_up_sql.pl` now include the option `--gen_coord` which provides the associated genome coordinate (including chromosome) as a feature, indicated by `'<'` (start of exon) and `'>'` (end of exon).
27
28 Changes in **fasta-36.3.8h** released November, 2018
29
30 **fasta-36.3.8h** provides new scripts and modifications to the `fasta` programs that normalize the process of merging sub-alignment scores and region information into both FASTA and BLAST results. To move BLASTP towards FASTA with respect to alignment annotation and sub-alignment scoring:
31
32 1. The `blastp_annot_cmd.sh` runs a blast search, finds and scores domain information for the alignments, and merges this information back into the blast output `.html` file. This script uses:
33
34 1. `annot_blast_btab2.pl --query query.file --ann_script annot_script.pl --q_ann_script annot_script.pl blast.btab_file > blast.btab_file_ann` (a blast tabular file with one or two new fields, an annotation field and (optionally with --dom_info) a raw domain content field.
35 2. `merge_blast_btab.pl --btab blast.btab_file_ann blast.html > blast_ann.html` (merge the annotations and domain content information in the `blast.btab_file_ann` file together with the standard blast output file to produce annotated alignments.
36 3. In addition, `rename_exons.py` is available to rename exons (later other domains) in the subject sequences to match the exon labeling in the aligned query sequence.
37 4. `relabel_domains.py` can be used to adjust color sets for homologous domains.
38
39 2. There is also an equivalent `fasta_annot_cmd.sh` script that provides similar funtionality for the FASTA programs. This script does not need to use `annot_blast_btab2.pl` to produce domain subalignment scores (that functionality is provided in FASTA), but it also can use `merge_fasta_btab.pl` and `rename_exons.py` to modify the names of the aligned exons/domains in the subject sequences.
40
41 3. To support the independence of the `blastp`/`fasta` output from html annotation, the FASTA package includes some new options:
42
43 1. The `-m 8CBL` option includes query sequence length and subject sequence length in the blast tabular output. In addition, if domain annotations are available, the raw domain coordinates are provided in an additional field after the annotation/subalignment scoring field. `-m 8CBl` provides the sequence lengths, but does not add the raw domain coordinates.
44
45 2. The `-Xa` option prevents annotation information from being included in the html output -- it is only available in the `-m 8CB` (or `-m 8CBL/l`) output
46
47 3. To reduce problems with spaces in script arguements, annotation scripts with spaces separating arguments can use '+' instead of ' '.
48
49 4. The `fasta_annot_cmd.sh` script produces both a conventional alignment on `stdout` and a `-m 8CBL` alignment, which is sent to a separate file, which is separated from the `-m F8CBL` option with a `=`, thus `-m F8CBL=tmp_output.blast_tab`.
50
51 Changes in **fasta-36.3.8g** released 23-Oct-2018
52
53 1. (Oct. 2018) Improvements to scripts in the `psisearch2/` directory:
54
55 1. `psisearch2/m89_btop_msa2.pl`
56 1. the `--clustal` option produces a "CLUSTALW (1.8)", which is required for some downstream programs
57 2. the `--trunc_acc` option removes the database and accession from identifiers of the form: `sp|P09488|GSTM1_HUMAN` to produce `GSTM1_HUMAN`.
58 3. the `--min_align` option specifies the fraction of the query sequence that must be aligned `(q_end-q_start+1)/q_length)`
59 Together, these changes make it possible for the output of `m89_btop_msa2.pl` to be used by the EMBOSS program `fprotdist`.
60
61 2. A more general implementation of `psisearch2_msa_iter.sh`, which does `psisearch2` one iteration at a time, and a new equivalent `psisearch2_msa_iter_bl.sh`, which uses `psiblast` to do the search.
62
63 * (Oct. 2018) A small restructuring of the `make/Makefiles` to remove the `-lz` dependence for non-debugging scripts (and add it back when -DDEBUG is used).
64
65 Changes in **fasta-36.3.8g** released 5-Aug-2018
66
67 1. (Apr 2018) incorporation of `-t t1` termination codes ("*") in `-m 8CB`, `-m 8CC`, and `-m9C` so that aligned termination codons are indicated as `**` (`-m8CB`) or `*1` (`-m8CC`, `-m9C`).
68
69 2. (Mar 2018) Updates to scripts/annot_blast_btop2.pl to provide subalignment scoring for blastp searches (BLOSUM62 only). (see doc/readme.v36)
70
71 3. (Feb. 2018) a new extended option, `-XB`, which causes percent identity, percent similarity, and alignment length to be calculated using the BLAST model, which does not count gaps in the alignment length.
72
73 see readme.v36 for other bug fixes.
74
75 Changes in **fasta-36.3.8g** released 31-Dec-2017
76
77 1. (December, 2017) -- Make statistical thresholds more robust for small E()-values with normally distributed scores (`ggsearch36`,`glsearch36`).
78
79 2. (September, 2017) Treat lower-case queries with no upper-case residues as uppercase with `-S` option.
80
81 3. (May, 2017) Improvements/fixes to sub-alignment scoring strategies.
82
83 4. Improvements/fixes to psisearch2 scripts.
84
85 For more detailed information, see `doc/readme.v36`.
86
2323 </small>
2424 </pre>
2525 <hr>
26 <h2>Latest Updates - FASTA version 36.3.8d (April, 2016)</h2>
27 <ol>
28 <li>
29 The <tt>fasta-36.3.8d/scripts/</tt> directory now provides a
30 script, <tt>annot_blast_btop2.pl</tt> that allows annotations and
31 sub-alignment scoring on BLAST alignments that use the tabular format
32 with BTOP alignment encoding.
33 <p>
34 <li>
35 Bug fixes for overlapping domain domain scoring. v36.3.7 was not thread-safe.
36 <li>
37 Annotation scripts accessing the Pfam domain database can now use
38 the <tt>--vdoms</tt> option to highlight missing parts of a Pfam
39 domain model. In addtion, domains from clans are labeled as clans
40 unless <tt>--no-clans</tt> is specified.
41 </ol>
42 <h2>Updates - FASTA version 36.3.7 (November, 2014)</h2>
26 <h2>Latest Updates - FASTA version 36.3.8h (March, 2019)</h2>
4327 <ol>
4428 <li>The FASTA programs have been released under the Apache2.0 Open
4529 Source License. The COPYRIGHT file, and copyright notices in
4630 program files, have been updated to reflect this change.
4731 <p>
32 <li>
33 fasta-36.3.8h includes bug fixes for translated alignments
34 with termination codons, the ability to use scripts as query
35 and library sequences, and new scripts for extracting genomic
36 DNA sequences given chromosome coordinates.
37 <li>
38 fasta-36.3.8g includes bug fixes for sub-alignment scoring and
39 psisearch2 scripts, new annotation scripts for exons, and
40 fixes enabling very low statistical thresholds with ggsearch36
41 and glsearch36.
42 <li>
43 fasta-36.3.8e/scripts includes updated scripts for
44 capturing domain and feature annotations using the
45 EBI/proteins API (https://www.ebi.ac.uk/proteins/api/) to get
46 Uniprot annotations and exon locations.
47 <p>
48 <li>
49 The <tt>fasta-36.3.8e/psisearch2/</tt> directory now
50 provides <tt>psisearch2_msa.pl</tt>
51 and <tt>psisearch2_msa.py</tt>, functionally identical scripts
52 for iterative searching with <tt>psiblast</tt>
53 or <tt>ssearch36</tt>. <tt>psisearch2-msa.pl</tt> offers an
54 option, <tt>--query_seed</tt>, that can dramatically reduce
55 false-positives caused by alignment overextension, with very
56 little loss of search sensitivity.
57 <p>
58 <li>
59 The <tt>fasta-36.3.8d/scripts/</tt> directory now provides a
60 script, <tt>annot_blast_btop2.pl</tt> that allows annotations and
61 sub-alignment scoring on BLAST alignments that use the tabular format
62 with BTOP alignment encoding.
63 <p>
4864 <li>Alignment sub-scoring scripts have been extended to allow
4965 overlapping domains. This requires a modified annotation file format.
5066 The "classic" format placed the beginning and end of a domain on different lines:
6985 </pre>
7086 <p>
7187 <li> New annotation scripts are available in
72 the <tt>fasta-36.3.7/scripts</tt> directory,
88 the <tt>fasta-36.3.8/scripts</tt> directory,
7389 e.g. <tt>ann_pfam_www_e.pl</tt> (Pfam) and <tt>ann_up_www2_e.pl</tt>
7490 (Uniprot) to support this new format. If the domain annotations
7591 provided by Pfam or Uniprot overlap, then overlapping domains are
Binary diff not shown
266266 with a '$>$' character, followed by the sequence itself:
267267 \begin{quote}
268268 \begin{verbatim}
269 >sequence name and description 1
269 >sequence_name1 and description
270270 A F A S Y T .... actual sequence.
271271 F S S .... second line of sequence.
272 >sequence name and description 2
272 >sequence_name2 and description
273273 PMILTYV ... sequence 2
274274 \end{verbatim}
275275 \end{quote}
276276 All of the characters of the description line are read, and special
277277 characters can be used to indicate additional information about the
278 sequence. In general, non-amino-acid/non-nucleotide sequences in the
279 sequence lines are ignored.
278 sequence. In particular, a \texttt{'@:C 12345'} at the end of the
279 description line indicates that the first residue of the sequence has
280 coordinate \texttt{'12345'}, instead of starting at \texttt{'1'}.
281 Coordinates can be negative; a DNA sequence upstream from the start of
282 transcription could be displayed with negative coordinates.
283
284 In general, non-amino-acid/non-nucleotide sequences in the sequence
285 lines are ignored, with the exception of \texttt{'*'}, which indicates
286 a termination codon in a protein sequence, and can be used to indicate
287 the match to a termination codon in protein:DNA alignments.
280288
281289 FASTA format files from major sequence distributors, like the NCBI and
282290 EBI, have specially formatted description lines, e.g.:\\
283291 \indent
284292 \texttt{
285 >gi|54321|ref|np\_12345| example NCBI refseq sequence\\
293 >np\_12345| example NCBI refseq sequence\\
286294 }
287295 or\\
288296 \indent
289297 \texttt{
290 >sw:gstm1\_human P01234 glutathione transferase GSTM1 - human\\
298 >sp:gstm1\_human P01234 glutathione transferase GSTM1 - human\\
299 }
300 or
301 \indent
302 \texttt{
303 >sp|P09488|GSTM1\_HUMAN glutathione transferase GSTM1 - human\\
291304 }
292305
293306 Several sample test files are included with the FASTA distribution:
851864 comments, \texttt{-m 8XC} without comments) and, if available, an
852865 annotation encoding matching FASTA \texttt{-m 9C} output. All the
853866 \texttt{-m 9c/C/d/D} encodings are available with BLAST tabular
854 output using \texttt{-m 8C[c/C/d/D]}.
867 output using \texttt{-m 8C[c/C/d/D]}. In the v36.3.8h release, a
868 new option has been added to \texttt{-m 8CB}, \texttt{-m 8CBL} (or
869 \texttt{-m 8CBl}. The \texttt{L/l} option adds the lengths of the
870 query and subject sequences after the \texttt{seqid}'s to BLAST
871 tabular output, e.g. \texttt{qseqid qlen sseqid slen percid ...}
855872
856873 \item[\texttt{-m 9}] display alignment coordinates and scores with the
857874 best score information. \texttt{-m 9i} provides alignment length,
925942 \texttt{1M1X2M4X2M1X2M7X3M9D1M2X1M4X2M1X1M1X2I1X1M1X1M3X1M2X1I3M1D1X1M2X1M}
926943 \end{footnotesize}
927944 \item[\texttt{-m 10}]
928 a parseable format for use with other programs.
945 a parseable format for use with other programs (this option no longer reliably tested; \texttt{-m 8CBL} is easier to parse and tested more extensively).
929946 \item[\texttt{-m 11}]
930947 Provide \texttt{lav}-like output (used by \texttt{lalign}) for graphical output.
931948 \begin{quote}
11231140 programs. (There is an option in the \texttt{Makefile},
11241141 \texttt{-DDNALIB\_LC}, to enable preserving case in DNA sequences.)
11251142
1126 \item[\texttt{-t \#}]
1127 Translation table - fastx36, tfastx36, fasty36, and
1128 tfasty3 now support the BLAST translation tables. See
1129 \url{http://www.ncbi.nih.gov/Taxonomy/Utils/wprintgc.cgi}.
1130
1131 \texttt{-t t} or \texttt{-t t\#} enables the addition of
1132 an implicit termination codon to a protein:translated DNA match. That
1133 is, each protein sequence implicitly ends with \texttt{*}, which
1134 matches the termination codes for the appropriate genetic code.
1135 \texttt{-t t\#} sets implicit termination and a different genetic
1136 code.
1143 \item[\texttt{-t \#}] Translation table - fastx36, tfastx36, fasty36,
1144 and tfasty3 now support the BLAST translation tables. See
1145 \url{http://www.ncbi.nih.gov/Taxonomy/Utils/wprintgc.cgi}.
1146
1147 \texttt{-t 1} also enables translation of \texttt{'TGA'} to
1148 \texttt{'U'} (seleno-cysteine) (by default, \texttt{'TGA'} is
1149 translated to \texttt{'*'}). Because of the ambiguity of the
1150 \texttt{'TGA'} codon, translated alignments of \texttt{'TGA'} with
1151 \texttt{-t 1} match \texttt{'U'} and \texttt{'*'} (termination)
1152 equally well.
1153
1154 \texttt{-t t} enables the addition of an implicit termination codon to
1155 a protein:translated DNA match. That is, each protein sequence
1156 implicitly ends with \texttt{*}, which matches the termination codes
1157 for the appropriate genetic code. To change the translation table and
1158 insert a termination character after each protein sequence, use
1159 \texttt{-t 1 -t t}.
1160
11371161 \item[\texttt{-T \#}]
11381162 set number of threads/workers. Normally on a multi-core machine, the maximum
11391163 number of processors/cores is used.
13481372 \item[\texttt{X1}] sort output by \texttt{init1} score (for
13491373 compatibility with FASTP; obsolete).
13501374
1351 \item[\texttt{XB}] (Previously \texttt{-B}.) Show the z-score, rather
1375 \item[\texttt{XB}] Calculate pecent identity, percent similarity, and
1376 alignment using the BLAST model, which excludes gapped residues.
1377 This allows very high identity alignments with large gaps to look
1378 much closer, but causes the alignment length to drop by the length
1379 of the gap.
1380
1381 \item[\texttt{Xb}] (Previously \texttt{-B}.) Show the z-score, rather
13521382 than the bit-score in the list of best scores (rarely used, provided
13531383 for backward compatibility).
13541384
17941824 5 & NBRF/PIR VMS (\texttt{>P1;SEQID}/comment/sequence) (obsolete)\\
17951825 6 & GCG (version 8.0) Unix Protein and DNA (compressed)\\
17961826 7 & FASTQ (sequence only, quality ignored)\\
1827 9 & a script that is executed to produce FASTA format sequences \\
17971828 10 & subset format (</slib2/swissprot.lseg 0:2 4|) \\
17981829 11 & NCBI Blast1.3.2 format (unix only) (obsolete)\\
17991830 12 & NCBI Blast2.0 format\\
18691900 \section{Frequently Asked Questions (FAQs)}
18701901
18711902 {\noindent}\textbf{Where can I get FASTA?} --
1872 \url{http://faculty.virginia.edu/wrpearson/fasta} has the latest
1873 versions of the FASTA programs. This document describes
1874 \texttt{\CURRENT}, which is available from
1875 \url{http://faculty.virginia.edu/wrpearson/fasta/fasta3.tar.gz}.
1876 In addition, pre-compiled versions of the programs are available for
1903
1904 The most current version of the FASTA source code is available from
1905 \url{http://github.com/wrpearson/fasta36}. In addition, you can get
1906 the programs from \url{http://faculty.virginia.edu/wrpearson/fasta},
1907 but sometimes there is a lag between the latest release on GITHUB and
1908 the compiled versions at \url{faculty.virginia.edu}. This document
1909 describes \texttt{\CURRENT}, which is available from
1910 \url{http://faculty.virginia.edu/wrpearson/fasta/fasta3.tar.gz}. In
1911 addition, pre-compiled versions of the programs are available for
18771912 MacOSX and Windows.
18781913
18791914 \needspace{4\baselineskip}
18861921 Prot. & Prot. & \texttt{fasta36} & \texttt{blastp} & heuristic local similarity \\
18871922 & & \texttt{ssearch36} & & optimal local sim.\\
18881923 & & \texttt{ggearch36} & & global:global sim. \\
1889 & & \texttt{ggearch36} & & global:local sim.\\
1924 & & \texttt{glearch36} & & global:local sim.\\
18901925 DNA & DNA & \texttt{fasta36}$^*$ & \texttt{blastn} & \\[1.2ex]
18911926 \hline \\[-1.0ex]
18921927 Prot. & Prot. & \texttt{lalign36} & & multiple non-intersecting \\
20282063 \begin{quote}
20292064 William R. Pearson\\
20302065 Department of Biochemistry\\
2031 Jordan Hall Box 800733\\
2066 Pinn Hall Box 800733\\
20322067 U. of Virginia\\
20332068 Charlottesville, VA\\
20342069 wrp@virginia.EDU
0 README_v36.3.8h.md
110110
111111 This release provides an extremely efficient SSE2 implementation of
112112 the Smith-Waterman algorithm for the SSE2 vector instructions written
113 by Michael Farrar (farrar.michael@gmail.com). The SSE code speeds up
113 by Michael Farrar. The SSE code speeds up
114114 Smith-Waterman 8 - 10-fold in my tests, making it comparable to Eric
115115 Lindahl's Altivec code for the Apple/IBM G4/G5 architecture.
116116
44 multiple high-scoring alignments to be shown, rather than just one.
55 This is the main functional difference between FASTA and BLAST -
66 BLAST could show multiple HSPs, FASTA did not.
7
8 >>Aug. 9, 2019
9 [src/ncbl2_mlib.c, ncbl2_head.h]
10
11 Modest extensions made to support reading makeblastdb format v5
12 databases. Changes have only been made to read the db.pin file, but
13 things work in simple tests.
14
15 >July 16, 2019
16 [src/comp_lib9.c]
17
18 Fixed a memory leak problem when searching with large libraries that
19 could be memory mapped (libraries with .xin index files). If the
20 library did not fit in memory, then the kept allocating new memory.
21 By default, the largest database that fits in memory must be less than
22 16 GB. Larger libraries will be re-read, which slows down multi-query
23 searches considerably. To increase the size of the library allowed in
24 memory, use the option: "-X M32G" to fit 32 GB libraries.
25
26 >>Mar. 8, 2019
27 [src/initfa.c,faatran.c,dropfx2.c]
28 Modify translation table 1 to allow selenocysteine translation
29 (TGA->'U'), and modify scoring matrices to give positive scores to
30 '*':'U'. The translation modification ONLY works with "-t 1". In
31 addition, BLAST BTOP alignments (-m 8CB) convert a 'U' aligned with a
32 '*' to a '*', so the end of the alignment is '**' rather than 'U*'
33 (fastx36) or '*U' (tfastx36).
34
35 dropfx2.c (fastx36/tfastx36), dropfz3.c(fasty36/tfasty36) did not
36 properly switch protein and translated DNA codes with -m 8CB -- fixed.
37
38 version date updated to Mar, 2019
39
40 >>Feb. 26, 2019
41 [scripts/get_genome_seq.py]
42 added get_genome_seq.py as a replacement for get_hg38_bed.py, remove
43 get_hg38_bed.py. 'get_genome_seq.py --genome mm10' also produces
44 sequences from mouse mm10 (and can now do any genome that bedtools can
45 read).
46
47 >>Feb. 23, 2019
48 [src/comp_lib9.c, mshowbest.c]
49 Modify repeat_thresh so that poor alignment scores (E() >
50 ppst->e_cut_r, typically -E-threshold/10.0) do not look for additional
51 alignments.
52
53 >>Feb. 21, 2019
54 [src/nmgetaa.c, scaleswn.c, scripts/get_protein.py, get_hg38_bed.py]
55
56 Modify nmgetaa.c to ignore ':'s (for sequence subsets) in scripts.
57 The script can do the subsetting. Modify scripts/get_protein.py to
58 provide subsetting. Add scripts/get_hg38_bed.py to extract fasta
59 sequences using the format "chr2:123456-543210"
60
61 Modify scaleswn.c to estimate Altshul-Gish parameters when gap and
62 extension do not match exactly.
63
64 >>Feb. 6, 2019
65 [src/compacc2e.c, nmgetaa.c]
66 modify build_link_data() to allow '+' for space in scripts. Ensure
67 that lib_type is properly initialized (open_lib.c()).
68
69 >>Jan. 23, 2019
70 [nmgetaa.c]
71 Fix bug introduced when checking for lib_type.
72
73 >>Jan. 15, 2019
74 [src/upam.h, altlib.h, nmgetaa.c]
75 [scripts/rename_exons.py, map_exons_coords.py, get_uniprot.py, get_refseq.py, get_proteins.py]
76
77 Bug fixes: The VT10, VT20, etc scoring matrices did not have scores for '*:*'
78 alignments, used with FASTX/TFASTX for extending alignments through
79 the termination codon. As a result, searchs with '-t t' did not
80 extend through the termination codon, even though they should have.
81 This has been fixed.
82
83 Enhancements: FASTA can now download both query and library sequences using a script, by specifying file type 9. Thus:
84
85 fasta36 "../scripts/get_uniprot.py+P09488 9" /seqlib/swissprot.fasta
86
87 Will run the script "get_uniprot.py" with the argument "P09488" and
88 use the output of the script as the query sequence. In this example,
89 the library type (9) is specified by the " 9" (this space cannot be
90 replaced with a '+' character).
91
92 Alternatively, library type '9' can be specified by putting a '!' before the script file name.
93
94 fasta36 \!../scripts/get_uniprot.py+P09488 /seqlib/swissprot.fasta
95
96 Scripts can be used to produce query or library sequences, or both.
97 Three scripts that download sequences from the NCBI and Uniprot have
98 been added in the "scripts" directory: "get_uniprot.py" takes Uniprot
99 accessions as arguments, "get_refseq.py" takes refseq accessions
100 (protein or mRNA), and "get_protein.py" gets both Uniprot and RefSeq
101 protein sequences.
102
103 rename_exons.py and map_exons_coords.py can take annotated BTOP
104 alignments with genome coordinates and map exons to the alternative
105 genome.
106
107 >>Jan. 2, 2019
108 [src/mshowbest.c]
109 Fix problems with site annotation when dom_info is provided with -m8CBL
110 [scripts/ann_exons_up_sql.pl, ann_exons_up_www.pl]
111 Make scripts more robust to missing chromosome information,
112 reverse-strand coordinates.
113
114 >>Dec. 11, 2018
115 [scripts/ann_exons_up_www.pl, ann_exons_up_sql.pl]
116 Add the option "--gen_coord" to report exon start ('<') and end ('>')
117 genome coordinates features of exons.
118
119 >>Nov. 14, 2018
120 [scripts/rename_exons.py, relabel_domains.py, compacc2e.c]
121
122 Two new scripts, rename_exons.py and relabel_domains.py, that take a
123 blast tabular output file with domain alignment annotations (and
124 possibly raw domain information) and modifies the names
125 (rename_exons.py) or colors (relabel_domains.py). rename_exons.py
126 takes the exon numbering associated with the query sequence and maps
127 it onto the subject alignments. relabel_domains.py can be used to use
128 different color numbers for homologous and non-homologous domains.
129
130 Both of these programs modify blast tabular output files, which can
131 then be merged back into an alignment display using
132 merge_blastp_annot.pl or merge_fasta_annot.pl.
133
134 compacc2.c:build_link_data() has been modified to convert '+' in the
135 script string to ' ', to allow passing command line options. A space
136 in the script string is used to separate the script from the library
137 type of the file returned by the script.
138
139 >>Nov. 6-7, 2018
140 [doinit.c, mshowbest.c, mshowalign2.c, defs.h, structs.h]
141
142 (a) Add options to provide query and subject sequence lengths and raw
143 domain coordinates in BLASTP tabular output with the options -m 8CBl
144 and -m 8CBL. If domain annotations are available, -m 8CBL also
145 provides the raw domain coordinates (not just those included in the
146 alignment) in the form |DX:1-100;C=PF12345|XD:1-100;C=PF12345 where
147 |DX a query annotation and |XD indicates a subject annotation. -m
148 8CBl (lower-case L) shows the sequence lengths, but not the raw domain
149 info.
150
151 (b) parse the annotation program strings so that '+' are converted to
152 ' '. This greatly simplifies passing arguments to the annotation scripts. Thus:
153
154 -V \!ann_pfam_sql.pl --db=pfam31 --neg --vdoms can be written as:
155 -V \!ann_pfam_sql.pl+--db=pfam31+--neg+--vdoms (likewise for -V q\!ann_pfam...)
156
157 (c) provide an option to remove region/feature annotations from non-m8
158 (blast-tabular) output. This simplifies the process of using
159 scripts/merge_fasta_btab.pl to use .bl_tab (-m 8CBL) files to inject
160 sub-alignment scores and domain information.
161
162 >>Nov. 1, 2018
163 [doinit.c]
164 Allow -m F#=file.name in addition to -m "F# file.name" to address
165 problems I had with spaces in shell scripts.
166
167 >>Oct. 23, 2018 [re-released as fasta-36.3.8g] (see README_v36.3.8g.md)
168 [make/Makefiles*,psisearch2/m89_btop_msa2.pl]
169
170 Add options to psisearch2/m89_btop_msa2.pl to provide clustalw header
171 (--clustal), require a minimum coverage of the query sequence
172 (--min_align 0.8), and edit sequence identifiers to remove database
173 and accession (--trunc_acc).
174
175 Remove -lz dependency from non-debug Makefiles.
176
177 >>Aug. 5, 2018 [re-released as fasta-36.3.8g]
178 [lib_sel.c]
179 Make lib_select.c more robust to missing indirect name files.
180 [scripts/ann*.pl]
181 update various annotation scripts to use https:// instead of http://
182
183 >>April 3, 2018
184 [initfa.c, comp_lib.c, dropfx2.c]
185 Changes to (a) ensure that the "-t t" option correctly inserts and
186 aligns a termination codon '*'. (a) changes to -m 8CB, -m8CC, and -m9C
187 so that aligned termination codons are indicated as "**" (-m8CB) or
188 "*1" (-m8CC, -m9C).
189
190 >>Mar. 9, 2018
191 [scripts/annot_blast_btop2.pl, merge_blast_btab.pl, blastp_annot_cmd.sh]
192 Code is now in place to provide sub-alignment scoring using domain
193 annotations with blastp searches (BLOSUM62 only). blastp_annot_cmd.sh
194 runs blast and produces both a standard HTML and a tabular output
195 file. It then runs annot_blast_btop2.pl to add sub-alignment scoring
196 to the tabular ouput file, and then merge_blast_btab.pl merges the
197 domain-annotated blast tabular file with the HTML output file. When
198 combined in this way, the FASTA web server (fasta.bioch.virginia.edu)
199 can produce blastp searches with domain highlights/scoring.
200
201 >>Feb. 6, 2018
202 [initfa.c, doinit.c, mshowbest.c, mshowalign2.c]
203 Add a new extended option, -XB, which causes percent identity, percent
204 similarity, and alignment length to be presented using the BLAST
205 model, which does not count gaps in the alignment length.
7206
8207 >>Dec. 30, 2017 [released as fasta-36.3.8g]
9208 [scaleswn.c]
+0
-1
make/Makefile.linux less more
0 Makefile.linux64_sse2
0 # $ Id: $
1 #
2 # makefile for fasta3, fasta3_t Use Makefile.mpi for fasta36_mpi
3 #
4 # This file is designed for 64-bit Linux systems using an X86
5 # architecture with SSE2 extensions. -D_LARGEFILE64_SOURCE and
6 # -DBIG_LIB64 require a 64-bit linux system.
7 # SSE2 extensions are used for ssearch35(_t)
8 #
9 # Use Makefile.linux32_sse2 for 32-bit linux x86
10 #
11
12 SHELL=/bin/bash
13
14 CC = gcc -g -O -msse2
15 LIB_DB=
16
17 #CC= gcc -pg -g -O -msse2 -ffast-math
18 #CC = gcc -g -DDEBUG -msse2
19 #CC=gcc -Wall -pedantic -ansi -g -msse2 -DDEBUG
20
21 # EBI uses the following with pgcc, -O3 does not work:
22 # CC= pgcc -O2 -pipe -mcpu=pentiumpro -march=pentiumpro -fomit-frame-pointer
23
24 # this file works for x86 LINUX
25
26 # standard options
27
28 CFLAGS= -DSHOW_HELP -DSHOWSIM -DUNIX -DTIMES -DHZ=100 -DMAX_WORKERS=8 -DTHR_EXIT=pthread_exit -DM10_CONS -D_REENTRANT -DHAS_INTTYPES -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -DUSE_FSEEKO -DSAMP_STATS -DPGM_DOC -DUSE_MMAP -D_LARGEFILE64_SOURCE -DBIG_LIB64
29 # -I/usr/include/mysql -DMYSQL_DB
30 # -DSUPERFAMNUM -DSFCHAR="'|'"
31
32 #
33 #(for mySQL databases) (also requires change to Makefile36m.common or use of Makefile36m.common_mysql)
34 # run 'mysql_config' so find locations of mySQL files
35
36 LIB_M = -lm
37 # for mySQL databases
38 # LIB_M = -L/usr/lib64/mysql -lmysqlclient -lm
39
40 HFLAGS= -o
41 NFLAGS= -o
42
43 # for Linux
44 THR_SUBS = pthr_subs2
45 THR_LIBS = -lpthread
46 THR_CC =
47
48 BIN = ../bin
49 XDIR = /seqprg/bin
50 #XDIR = ~/bin/LINUX
51
52 # set up files for SSE2/Altivec acceleration
53 #
54 include ../make/Makefile.sse_alt
55
56 # SSE2 acceleration
57 #
58 DROPGSW_O = $(DROPGSW_SSE_O)
59 DROPLAL_O = $(DROPLAL_SSE_O)
60 DROPGNW_O = $(DROPGNW_SSE_O)
61 DROPLNW_O = $(DROPLNW_SSE_O)
62
63 # renamed (fasta36) programs
64 include ../make/Makefile36m.common
65 # conventional (fasta3) names
66 # include ../make/Makefile.common
1212
1313 #CC= gcc -g -O
1414 #CC = gcc -g -DDEBUG
15 #LIB_DB=
1516
1617 #CC=gcc -Wall -pedantic -ansi -g -O
1718 CC= /usr/local/parasoft/bin/insure -g -DDEBUG
19 LIB_DB=-lz
1820
1921 # EBI uses the following with pgcc, -O3 does not work:
2022 # CC= pgcc -O2 -pipe -mcpu=pentiumpro -march=pentiumpro -fomit-frame-pointer
1212 SHELL=/bin/bash
1313
1414 CC= gcc -g -O -msse2 -ffast-math
15 LIB_DB=
1516 #CC = gcc -g -DDEBUG -msse2
1617
1718 #CC= /usr/local/parasoft/bin/insure -g -DDEBUG
19 #LIB_DB=-lz
1820
1921 #CC=gcc -Wall -pedantic -ansi -g -O
2022
+0
-1
make/Makefile.linux64 less more
0 Makefile.linux64_sse2
0 # $ Id: $
1 #
2 # makefile for fasta3, fasta3_t Use Makefile.mpi for fasta36_mpi
3 #
4 # This file is designed for 64-bit Linux systems using an X86
5 # architecture with SSE2 extensions. -D_LARGEFILE64_SOURCE and
6 # -DBIG_LIB64 require a 64-bit linux system.
7 # SSE2 extensions are used for ssearch35(_t)
8 #
9 # Use Makefile.linux32_sse2 for 32-bit linux x86
10 #
11
12 SHELL=/bin/bash
13
14 CC = gcc -g -O -msse2
15 LIB_DB=
16
17 #CC= gcc -pg -g -O -msse2 -ffast-math
18 #CC = gcc -g -DDEBUG -msse2
19 #CC=gcc -Wall -pedantic -ansi -g -msse2 -DDEBUG
20
21 # EBI uses the following with pgcc, -O3 does not work:
22 # CC= pgcc -O2 -pipe -mcpu=pentiumpro -march=pentiumpro -fomit-frame-pointer
23
24 # this file works for x86 LINUX
25
26 # standard options
27
28 CFLAGS= -DSHOW_HELP -DSHOWSIM -DUNIX -DTIMES -DHZ=100 -DMAX_WORKERS=8 -DTHR_EXIT=pthread_exit -DM10_CONS -D_REENTRANT -DHAS_INTTYPES -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -DUSE_FSEEKO -DSAMP_STATS -DPGM_DOC -DUSE_MMAP -D_LARGEFILE64_SOURCE -DBIG_LIB64
29 # -I/usr/include/mysql -DMYSQL_DB
30 # -DSUPERFAMNUM -DSFCHAR="'|'"
31
32 #
33 #(for mySQL databases) (also requires change to Makefile36m.common or use of Makefile36m.common_mysql)
34 # run 'mysql_config' so find locations of mySQL files
35
36 LIB_M = -lm
37 # for mySQL databases
38 # LIB_M = -L/usr/lib64/mysql -lmysqlclient -lm
39
40 HFLAGS= -o
41 NFLAGS= -o
42
43 # for Linux
44 THR_SUBS = pthr_subs2
45 THR_LIBS = -lpthread
46 THR_CC =
47
48 BIN = ../bin
49 XDIR = /seqprg/bin
50 #XDIR = ~/bin/LINUX
51
52 # set up files for SSE2/Altivec acceleration
53 #
54 include ../make/Makefile.sse_alt
55
56 # SSE2 acceleration
57 #
58 DROPGSW_O = $(DROPGSW_SSE_O)
59 DROPLAL_O = $(DROPLAL_SSE_O)
60 DROPGNW_O = $(DROPGNW_SSE_O)
61 DROPLNW_O = $(DROPLNW_SSE_O)
62
63 # renamed (fasta36) programs
64 include ../make/Makefile36m.common
65 # conventional (fasta3) names
66 # include ../make/Makefile.common
1212 SHELL=/bin/bash
1313
1414 CC = gcc -g -O -msse2
15 LIB_DB=
16
1517 #CC= gcc -pg -g -O -msse2 -ffast-math
1618 #CC = gcc -g -DDEBUG -msse2
1719 #CC=gcc -Wall -pedantic -ansi -g -msse2 -DDEBUG
77 SHELL=/bin/bash
88
99 CC= icc -g -O3
10 LIB_DB=
1011 #CC = icc -g -DDEBUG
12 #LIB_DB=-lz
1113
1214 #CC=gcc -Wall -pedantic -ansi -g -O
1315 #CC= /usr/local/parasoft/bin/insure -g -DDEBUG
88
99 SHELL=/bin/bash
1010
11 CC= icc -O3 -g
11 CC= icc -O3 -g -pthread
12 LIB_DB=
1213 #CC = icc -g -DDEBUG
14 #LIB_DB=-lz
1315
1416 #CC=gcc -Wall -pedantic -ansi -g -O
1517 #CC= /usr/local/parasoft/bin/insure -g -DDEBUG
1010 SHELL=/bin/bash
1111
1212 CC= gcc -g -O2
13 LIB_DB=
1314 #CC= gcc -g -DDEBUG
15 #LIB_DB=-lz
1416
1517 # this file works for x86 LINUX
1618
1010 SHELL=/bin/bash
1111
1212 CC= gcc -g -O
13 LIB_DB=
1314 #CC= gcc -g -DDEBUG
15 #LIB_DB=-lz
1416 #CC=/opt/parasoft/bin.linux2/insure -g -DDEBUG
1517
1618 # this file works for x86 LINUX
1010 SHELL=/bin/bash
1111
1212 CC= gcc -g -O
13 LIB_DB=
1314 #CC= gcc -g -DDEBUG
15 #LIB_DB=-lz
1416 #CC=/opt/parasoft/bin.linux2/insure -g -DDEBUG
1517
1618 # this file works for x86 LINUX
+0
-1
make/Makefile.linux_sse2 less more
0 Makefile.linux64_sse2
0 # $ Id: $
1 #
2 # makefile for fasta3, fasta3_t Use Makefile.mpi for fasta36_mpi
3 #
4 # This file is designed for 64-bit Linux systems using an X86
5 # architecture with SSE2 extensions. -D_LARGEFILE64_SOURCE and
6 # -DBIG_LIB64 require a 64-bit linux system.
7 # SSE2 extensions are used for ssearch35(_t)
8 #
9 # Use Makefile.linux32_sse2 for 32-bit linux x86
10 #
11
12 SHELL=/bin/bash
13
14 CC = gcc -g -O -msse2
15 LIB_DB=
16
17 #CC= gcc -pg -g -O -msse2 -ffast-math
18 #CC = gcc -g -DDEBUG -msse2
19 #CC=gcc -Wall -pedantic -ansi -g -msse2 -DDEBUG
20
21 # EBI uses the following with pgcc, -O3 does not work:
22 # CC= pgcc -O2 -pipe -mcpu=pentiumpro -march=pentiumpro -fomit-frame-pointer
23
24 # this file works for x86 LINUX
25
26 # standard options
27
28 CFLAGS= -DSHOW_HELP -DSHOWSIM -DUNIX -DTIMES -DHZ=100 -DMAX_WORKERS=8 -DTHR_EXIT=pthread_exit -DM10_CONS -D_REENTRANT -DHAS_INTTYPES -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -DUSE_FSEEKO -DSAMP_STATS -DPGM_DOC -DUSE_MMAP -D_LARGEFILE64_SOURCE -DBIG_LIB64
29 # -I/usr/include/mysql -DMYSQL_DB
30 # -DSUPERFAMNUM -DSFCHAR="'|'"
31
32 #
33 #(for mySQL databases) (also requires change to Makefile36m.common or use of Makefile36m.common_mysql)
34 # run 'mysql_config' so find locations of mySQL files
35
36 LIB_M = -lm
37 # for mySQL databases
38 # LIB_M = -L/usr/lib64/mysql -lmysqlclient -lm
39
40 HFLAGS= -o
41 NFLAGS= -o
42
43 # for Linux
44 THR_SUBS = pthr_subs2
45 THR_LIBS = -lpthread
46 THR_CC =
47
48 BIN = ../bin
49 XDIR = /seqprg/bin
50 #XDIR = ~/bin/LINUX
51
52 # set up files for SSE2/Altivec acceleration
53 #
54 include ../make/Makefile.sse_alt
55
56 # SSE2 acceleration
57 #
58 DROPGSW_O = $(DROPGSW_SSE_O)
59 DROPLAL_O = $(DROPLAL_SSE_O)
60 DROPGNW_O = $(DROPGNW_SSE_O)
61 DROPLNW_O = $(DROPLNW_SSE_O)
62
63 # renamed (fasta36) programs
64 include ../make/Makefile36m.common
65 # conventional (fasta3) names
66 # include ../make/Makefile.common
1212
1313 # in my hands, gcc-4.0 is about 40% slower than gcc-3.3 on the Altivec code
1414 CC= gcc -g -O3 -arch ppc -falign-loops=32 -O3 -maltivec -mpim-altivec -force_cpusubtype_ALL
15 LIB_DB=
16
1517 # -pg -finstrument-functions -lSaturn
1618
1719 #CC= gcc-3.3 -g -falign-loops=32 -O3 -mcpu=7450 -faltivec
1820 #CC= gcc-3.3 -g -DDEBUG -mcpu=7450 -faltivec
21 #LIB_DB=-lz
1922 #CC= cc -g -Wall -pedantic -faltivec
2023 #
2124 # standard line for normal searching
1212 SHELL=/bin/bash
1313
1414 CC= gcc -g -O3 -arch i386 -msse2
15 LIB_DB=
1516 #CC= gcc -g -DDEBUG -arch i386 -msse2
17 #LIB_DB=-lz
1618
1719 #CC= cc -g -Wall -pedantic
1820 #
1212 SHELL=/bin/bash
1313
1414 CC= cc -O -g -arch x86_64 -msse2
15 LIB_DB=
16
1517 #CC= cc -g -DDEBUG -fsanitize=address -arch x86_64 -msse2
18 #LIB_DB=-lz
1619
1720 #CC= cc -g -Wall -pedantic
1821 #
1212 SHELL=/bin/bash
1313
1414 CC= clang -g -O -arch x86_64 -msse2
15 LIB_DB=
1516 #CC= clang -g -DDEBUG -arch x86_64 -msse2
17 #LIB_DB=-lz
1618
1719 #CC= cc -g -Wall -pedantic
1820 #
1212 SHELL=/bin/bash
1313
1414 CC= icc -g -O -m64 # intel icc compiler
15 LIB_DB=
1516 #CC= icc -g -DDEBUG -m64
17 #LIB_DB=-lz
1618
1719 #CC= cc -g -Wall -pedantic
1820 #
6161 pushd $(BIN); cp $(TPROGS) $(XDIR); popd
6262
6363 fasta36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fa.o scale_se.o karlin.o $(DROPNFA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o
64 $(CC) $(HFLAGS) $(BIN)/fasta36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M)
64 $(CC) $(HFLAGS) $(BIN)/fasta36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB)
6565
6666 fastx36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fx.o scale_se.o karlin.o drop_fx.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o
67 $(CC) $(HFLAGS) $(BIN)/fastx36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fx.o drop_fx.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M)
67 $(CC) $(HFLAGS) $(BIN)/fastx36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fx.o drop_fx.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB)
6868
6969 fasty36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fy.o scale_se.o karlin.o drop_fz.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o
70 $(CC) $(HFLAGS) $(BIN)/fasty36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fy.o drop_fz.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M)
70 $(CC) $(HFLAGS) $(BIN)/fasty36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fy.o drop_fz.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB)
7171
7272 fastf36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_ff.o scaleswts.o last_tat.o tatstats_ff.o karlin.o $(DROPFF_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o
73 $(CC) $(HFLAGS) $(BIN)/fastf36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_ff.o $(DROPFF_O) scaleswts.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M)
73 $(CC) $(HFLAGS) $(BIN)/fastf36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_ff.o $(DROPFF_O) scaleswts.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB)
7474
7575 fasts36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fs.o scaleswts.o last_tat.o tatstats_fs.o karlin.o $(DROPFS_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o
76 $(CC) $(HFLAGS) $(BIN)/fasts36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fs.o $(DROPFS_O) scaleswts.o last_tat.o tatstats_fs.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M)
76 $(CC) $(HFLAGS) $(BIN)/fasts36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fs.o $(DROPFS_O) scaleswts.o last_tat.o tatstats_fs.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB)
7777
7878 fastm36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fm.o scaleswts.o last_tat.o tatstats_fm.o karlin.o $(DROPFM_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o
79 $(CC) $(HFLAGS) $(BIN)/fastm36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fm.o $(DROPFM_O) scaleswts.o last_tat.o tatstats_fm.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M)
79 $(CC) $(HFLAGS) $(BIN)/fastm36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fm.o $(DROPFM_O) scaleswts.o last_tat.o tatstats_fm.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB)
8080
8181 tfastx36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfx.o scale_se.o karlin.o drop_tfx.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o
82 $(CC) $(HFLAGS) $(BIN)/tfastx36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfx.o drop_tfx.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M)
82 $(CC) $(HFLAGS) $(BIN)/tfastx36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfx.o drop_tfx.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB)
8383
8484 tfasty36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfy.o scale_se.o karlin.o drop_tfz.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o
85 $(CC) $(HFLAGS) $(BIN)/tfasty36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfy.o drop_tfz.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M)
85 $(CC) $(HFLAGS) $(BIN)/tfasty36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfy.o drop_tfz.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB)
8686
8787 tfastf36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tf.o scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(DROPTFF_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o
88 $(CC) $(HFLAGS) $(BIN)/tfastf36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tf.o $(DROPTFF_O) scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M)
88 $(CC) $(HFLAGS) $(BIN)/tfastf36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tf.o $(DROPTFF_O) scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB)
8989
9090 tfastf36s : $(COMP_LIBO) $(COMPACC_SO) showsum.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tf.o scaleswtf.o karlin.o $(DROPTFF_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o
91 $(CC) $(HFLAGS) $(BIN)/tfastf36s $(COMP_LIBO) $(COMPACC_SO) showsum.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tf.o $(DROPTFF_O) scaleswtf.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M)
91 $(CC) $(HFLAGS) $(BIN)/tfastf36s $(COMP_LIBO) $(COMPACC_SO) showsum.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tf.o $(DROPTFF_O) scaleswtf.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB)
9292
9393 tfasts36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfs.o scaleswts.o tatstats_fs.o last_tat.o karlin.o $(DROPTFS_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o
94 $(CC) $(HFLAGS) $(BIN)/tfasts36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfs.o $(DROPTFS_O) scaleswts.o tatstats_fs.o last_tat.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M)
94 $(CC) $(HFLAGS) $(BIN)/tfasts36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfs.o $(DROPTFS_O) scaleswts.o tatstats_fs.o last_tat.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB)
9595
9696 tfastm36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfm.o scaleswts.o tatstats_fm.o last_tat.o karlin.o $(DROPTFM_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o
97 $(CC) $(HFLAGS) $(BIN)/tfastm36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfm.o $(DROPTFM_O) scaleswts.o tatstats_fm.o last_tat.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M)
97 $(CC) $(HFLAGS) $(BIN)/tfastm36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfm.o $(DROPTFM_O) scaleswts.o tatstats_fm.o last_tat.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB)
9898
9999 ssearch36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o scale_se.o karlin.o $(DROPGSW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o
100 $(CC) $(HFLAGS) $(BIN)/ssearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPGSW_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M)
100 $(CC) $(HFLAGS) $(BIN)/ssearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPGSW_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB)
101101
102102 # do not use accelerated Smith-Waterman
103103 ssearch36s : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o scale_se.o karlin.o $(DROPGSW_NA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o
104 $(CC) $(HFLAGS) $(BIN)/ssearch36s $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPGSW_NA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M)
104 $(CC) $(HFLAGS) $(BIN)/ssearch36s $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPGSW_NA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB)
105105
106106 lalign36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(LSHOWALIGN).o htime.o apam.o doinit.o init_lal.o scale_se.o karlin.o last_thresh.o $(DROPLAL_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o
107 $(CC) $(HFLAGS) $(BIN)/lalign36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(LSHOWALIGN).o htime.o apam.o doinit.o init_lal.o $(DROPLAL_O) scale_se.o karlin.o last_thresh.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M)
107 $(CC) $(HFLAGS) $(BIN)/lalign36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(LSHOWALIGN).o htime.o apam.o doinit.o init_lal.o $(DROPLAL_O) scale_se.o karlin.o last_thresh.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB)
108108
109109 osearch36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_ssw.o scale_se.o karlin.o $(DROPNSW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o
110 $(CC) $(HFLAGS) $(BIN)/osearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_ssw.o $(DROPNSW_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M)
110 $(CC) $(HFLAGS) $(BIN)/osearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_ssw.o $(DROPNSW_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB)
111111
112112 glsearch36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o scale_sn.o karlin.o $(DROPLNW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o
113 $(CC) $(HFLAGS) $(BIN)/glsearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPLNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M)
113 $(CC) $(HFLAGS) $(BIN)/glsearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPLNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB)
114114
115115 ggsearch36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o scale_sn.o karlin.o $(DROPGNW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o
116 $(CC) $(HFLAGS) $(BIN)/ggsearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPGNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M)
116 $(CC) $(HFLAGS) $(BIN)/ggsearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPGNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB)
117117
118118 prss36 : ssearch36
119119 ln -sf ssearch36 prss36
120120
121121 ssearch36_t : $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o scale_se.o karlin.o $(DROPGSW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o
122 $(CC) $(HFLAGS) $(BIN)/ssearch36_t $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGSW_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(THR_LIBS)
122 $(CC) $(HFLAGS) $(BIN)/ssearch36_t $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGSW_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
123123
124124 ssearch36s_t : $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o scale_se.o karlin.o $(DROPGSW_NA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o
125 $(CC) $(HFLAGS) $(BIN)/ssearch36s_t $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGSW_NA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(THR_LIBS)
125 $(CC) $(HFLAGS) $(BIN)/ssearch36s_t $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGSW_NA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
126126
127127 glsearch36_t : $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o scale_sn.o karlin.o $(DROPLNW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o
128 $(CC) $(HFLAGS) $(BIN)/glsearch36_t $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPLNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(THR_LIBS)
128 $(CC) $(HFLAGS) $(BIN)/glsearch36_t $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPLNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
129129
130130 glsearch36s_t : $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o scale_sn.o karlin.o $(DROPLNW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o
131 $(CC) $(HFLAGS) $(BIN)/glsearch36s_t $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPLNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(THR_LIBS)
131 $(CC) $(HFLAGS) $(BIN)/glsearch36s_t $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPLNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
132132
133133 ggsearch36_t : $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o scale_sn.o karlin.o $(DROPGNW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o
134 $(CC) $(HFLAGS) $(BIN)/ggsearch36_t $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(THR_LIBS)
134 $(CC) $(HFLAGS) $(BIN)/ggsearch36_t $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
135135
136136 ggsearch36s_t : $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o scale_sn.o karlin.o $(DROPGNW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o
137 $(CC) $(HFLAGS) $(BIN)/ggsearch36s_t $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(THR_LIBS)
137 $(CC) $(HFLAGS) $(BIN)/ggsearch36s_t $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
138138
139139 fasta36_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o scale_se.o karlin.o $(DROPNFA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o
140 $(CC) $(HFLAGS) $(BIN)/fasta36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(THR_LIBS)
140 $(CC) $(HFLAGS) $(BIN)/fasta36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
141141
142142 fasta36sum_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o scale_se.o karlin.o $(DROPNFA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o
143 $(CC) $(HFLAGS) $(BIN)/fasta36sum_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(THR_LIBS)
143 $(CC) $(HFLAGS) $(BIN)/fasta36sum_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
144144
145145 fasta36u_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showun.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o scale_se.o karlin.o $(DROPNFA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o
146 $(CC) $(HFLAGS) $(BIN)/fasta36u_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showun.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS)
146 $(CC) $(HFLAGS) $(BIN)/fasta36u_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showun.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
147147
148148 fasta36r_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showrel.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o scale_se.o karlin.o $(DROPNFA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o
149 $(CC) $(HFLAGS) $(BIN)/fasta36r_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showrel.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS)
149 $(CC) $(HFLAGS) $(BIN)/fasta36r_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showrel.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
150150
151151 fastf36_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_ff.o scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(DROPFF_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o
152 $(CC) $(HFLAGS) $(BIN)/fastf36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_ff.o $(DROPFF_O) scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS)
152 $(CC) $(HFLAGS) $(BIN)/fastf36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_ff.o $(DROPFF_O) scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
153153
154154 fastf36s_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showsum.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_ff.o scaleswtf.o karlin.o $(DROPFF_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o
155 $(CC) $(HFLAGS) $(BIN)/fastf36s_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showsum.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_ff.o $(DROPFF_O) scaleswtf.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS)
155 $(CC) $(HFLAGS) $(BIN)/fastf36s_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showsum.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_ff.o $(DROPFF_O) scaleswtf.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
156156
157157 fasts36_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fs.o scaleswts.o last_tat.o tatstats_fs.o karlin.o $(DROPFS_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o
158 $(CC) $(HFLAGS) $(BIN)/fasts36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fs.o $(DROPFS_O) scaleswts.o last_tat.o tatstats_fs.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS)
158 $(CC) $(HFLAGS) $(BIN)/fasts36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fs.o $(DROPFS_O) scaleswts.o last_tat.o tatstats_fs.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
159159
160160 fastm36_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fm.o scaleswts.o last_tat.o tatstats_fm.o karlin.o $(DROPFM_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o
161 $(CC) $(HFLAGS) $(BIN)/fastm36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fm.o $(DROPFM_O) scaleswts.o last_tat.o tatstats_fm.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS)
161 $(CC) $(HFLAGS) $(BIN)/fastm36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fm.o $(DROPFM_O) scaleswts.o last_tat.o tatstats_fm.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
162162
163163 fastx36_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o c_dispn.o htime.o apam.o doinit.o init_fx.o faatran.o scale_se.o karlin.o drop_fx.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o
164 $(CC) $(HFLAGS) $(BIN)/fastx36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fx.o drop_fx.o faatran.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(THR_LIBS)
164 $(CC) $(HFLAGS) $(BIN)/fastx36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fx.o drop_fx.o faatran.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
165165
166166 fasty36_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o c_dispn.o htime.o apam.o doinit.o init_fy.o faatran.o scale_se.o karlin.o drop_fz.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o
167 $(CC) $(HFLAGS) $(BIN)/fasty36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fy.o drop_fz.o faatran.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(THR_LIBS)
167 $(CC) $(HFLAGS) $(BIN)/fasty36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fy.o drop_fz.o faatran.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
168168
169169 tfasta36 : $(COMP_LIBO) compacc.o $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfa.o scale_se.o karlin.o $(DROPTFA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o
170 $(CC) $(HFLAGS) $(BIN)/tfasta36 $(COMP_LIBO) compacc.o $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfa.o $(DROPTFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M)
170 $(CC) $(HFLAGS) $(BIN)/tfasta36 $(COMP_LIBO) compacc.o $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfa.o $(DROPTFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB)
171171
172172 tfasta36_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o c_dispn.o htime.o apam.o doinit.o init_tfa.o scale_se.o karlin.o $(DROPTFA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o
173 $(CC) $(HFLAGS) $(BIN)/tfasta36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfa.o $(DROPTFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(THR_LIBS)
173 $(CC) $(HFLAGS) $(BIN)/tfasta36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfa.o $(DROPTFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
174174
175175 tfastf36_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o c_dispn.o htime.o apam.o doinit.o init_tf.o scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(DROPTFF_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o
176 $(CC) $(HFLAGS) $(BIN)/tfastf36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tf.o $(DROPTFF_O) scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS)
176 $(CC) $(HFLAGS) $(BIN)/tfastf36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tf.o $(DROPTFF_O) scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
177177
178178 tfasts36_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o c_dispn.o htime.o apam.o doinit.o init_tfs.o scaleswts.o last_tat.o tatstats_fs.o karlin.o $(DROPTFS_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o
179 $(CC) $(HFLAGS) $(BIN)/tfasts36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfs.o $(DROPTFS_O) scaleswts.o last_tat.o tatstats_fs.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS)
179 $(CC) $(HFLAGS) $(BIN)/tfasts36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfs.o $(DROPTFS_O) scaleswts.o last_tat.o tatstats_fs.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
180180
181181 tfastx36_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfx.o scale_se.o karlin.o drop_tfx.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o
182 $(CC) $(HFLAGS) $(BIN)/tfastx36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfx.o drop_tfx.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(THR_LIBS)
182 $(CC) $(HFLAGS) $(BIN)/tfastx36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfx.o drop_tfx.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
183183
184184 tfasty36_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfy.o scale_se.o karlin.o drop_tfz.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o
185 $(CC) $(HFLAGS) $(BIN)/tfasty36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfy.o drop_tfz.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(THR_LIBS)
185 $(CC) $(HFLAGS) $(BIN)/tfasty36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfy.o drop_tfz.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
186186
187187 comp_mlib5e.o : comp_lib5e.c mw.h structs.h defs.h param.h
188188 $(CC) $(THR_CC) $(CFLAGS) -DCOMP_MLIB -c comp_lib5e.c -o comp_mlib5e.o
212212 $(CC) $(THR_CC) $(CFLAGS) -c work_thr2.c
213213
214214 print_pssm : print_pssm.c getseq.c karlin.c apam.cn pssm_asn_subs.c
215 $(CC) -o print_pssm $(CFLAGS) print_pssm.c getseq.c karlin.c apam.c pssm_asn_subs.c $(LIB_M)
215 $(CC) -o print_pssm $(CFLAGS) print_pssm.c getseq.c karlin.c apam.c pssm_asn_subs.c $(LIB_M) $(LIB_DB)
216216
217217 map_db : map_db.c uascii.h ncbl2_head.h
218218 $(CC) $(CFLAGS) -o $(BIN)/map_db map_db.c
5757 pushd $(BIN); cp $(TPROGS) $(XDIR); popd
5858
5959 fasta36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fa.o scale_se.o karlin.o $(DROPNFA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o
60 $(CC) $(HFLAGS) $(BIN)/fasta36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M)
60 $(CC) $(HFLAGS) $(BIN)/fasta36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB)
6161
6262 fastx36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fx.o scale_se.o karlin.o drop_fx.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o
63 $(CC) $(HFLAGS) $(BIN)/fastx36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fx.o drop_fx.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M)
63 $(CC) $(HFLAGS) $(BIN)/fastx36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fx.o drop_fx.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB)
6464
6565 fasty36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fy.o scale_se.o karlin.o drop_fz.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o
66 $(CC) $(HFLAGS) $(BIN)/fasty36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fy.o drop_fz.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M)
66 $(CC) $(HFLAGS) $(BIN)/fasty36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fy.o drop_fz.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB)
6767
6868 fastf36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_ff.o scaleswts.o last_tat.o tatstats_ff.o karlin.o $(DROPFF_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o
69 $(CC) $(HFLAGS) $(BIN)/fastf36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_ff.o $(DROPFF_O) scaleswts.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M)
69 $(CC) $(HFLAGS) $(BIN)/fastf36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_ff.o $(DROPFF_O) scaleswts.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB)
7070
7171 fasts36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fs.o scaleswts.o last_tat.o tatstats_fs.o karlin.o $(DROPFS_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o
72 $(CC) $(HFLAGS) $(BIN)/fasts36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fs.o $(DROPFS_O) scaleswts.o last_tat.o tatstats_fs.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M)
72 $(CC) $(HFLAGS) $(BIN)/fasts36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fs.o $(DROPFS_O) scaleswts.o last_tat.o tatstats_fs.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB)
7373
7474 fastm36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fm.o scaleswts.o last_tat.o tatstats_fm.o karlin.o $(DROPFM_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o
75 $(CC) $(HFLAGS) $(BIN)/fastm36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fm.o $(DROPFM_O) scaleswts.o last_tat.o tatstats_fm.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M)
75 $(CC) $(HFLAGS) $(BIN)/fastm36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fm.o $(DROPFM_O) scaleswts.o last_tat.o tatstats_fm.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB)
7676
7777 tfastx36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfx.o scale_se.o karlin.o drop_tfx.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o
78 $(CC) $(HFLAGS) $(BIN)/tfastx36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfx.o drop_tfx.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M)
78 $(CC) $(HFLAGS) $(BIN)/tfastx36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfx.o drop_tfx.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB)
7979
8080 tfasty36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfy.o scale_se.o karlin.o drop_tfz.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o
81 $(CC) $(HFLAGS) $(BIN)/tfasty36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfy.o drop_tfz.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M)
81 $(CC) $(HFLAGS) $(BIN)/tfasty36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfy.o drop_tfz.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB)
8282
8383 tfasta36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfa.o scale_se.o karlin.o $(DROPTFA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o
84 $(CC) $(HFLAGS) $(BIN)/tfasta36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfa.o $(DROPTFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M)
84 $(CC) $(HFLAGS) $(BIN)/tfasta36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfa.o $(DROPTFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB)
8585
8686 tfastf36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tf.o scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(DROPTFF_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o
87 $(CC) $(HFLAGS) $(BIN)/tfastf36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tf.o $(DROPTFF_O) scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M)
87 $(CC) $(HFLAGS) $(BIN)/tfastf36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tf.o $(DROPTFF_O) scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB)
8888
8989 tfastf36s : $(COMP_LIBO) $(COMPACC_SO) showsum.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tf.o scaleswtf.o karlin.o $(DROPTFF_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o
90 $(CC) $(HFLAGS) $(BIN)/tfastf36s $(COMP_LIBO) $(COMPACC_SO) showsum.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tf.o $(DROPTFF_O) scaleswtf.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M)
90 $(CC) $(HFLAGS) $(BIN)/tfastf36s $(COMP_LIBO) $(COMPACC_SO) showsum.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tf.o $(DROPTFF_O) scaleswtf.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB)
9191
9292 tfasts36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfs.o scaleswts.o tatstats_fs.o last_tat.o karlin.o $(DROPTFS_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o
93 $(CC) $(HFLAGS) $(BIN)/tfasts36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfs.o $(DROPTFS_O) scaleswts.o tatstats_fs.o last_tat.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M)
93 $(CC) $(HFLAGS) $(BIN)/tfasts36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfs.o $(DROPTFS_O) scaleswts.o tatstats_fs.o last_tat.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB)
9494
9595 tfastm36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfm.o scaleswts.o tatstats_fm.o last_tat.o karlin.o $(DROPTFM_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o
96 $(CC) $(HFLAGS) $(BIN)/tfastm36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfm.o $(DROPTFM_O) scaleswts.o tatstats_fm.o last_tat.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M)
96 $(CC) $(HFLAGS) $(BIN)/tfastm36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfm.o $(DROPTFM_O) scaleswts.o tatstats_fm.o last_tat.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB)
9797
9898 ssearch36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o scale_se.o karlin.o $(DROPGSW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o
99 $(CC) $(HFLAGS) $(BIN)/ssearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPGSW_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M)
99 $(CC) $(HFLAGS) $(BIN)/ssearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPGSW_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB)
100100
101101 # do not use accelerated Smith-Waterman
102102 ssearch36s : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o scale_se.o karlin.o $(DROPGSW_NA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o
103 $(CC) $(HFLAGS) $(BIN)/ssearch36s $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPGSW_NA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M)
103 $(CC) $(HFLAGS) $(BIN)/ssearch36s $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPGSW_NA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB)
104104
105105 lalign36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(LSHOWALIGN).o htime.o apam.o doinit.o init_lal.o scale_se.o karlin.o last_thresh.o $(DROPLAL_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o
106 $(CC) $(HFLAGS) $(BIN)/lalign36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(LSHOWALIGN).o htime.o apam.o doinit.o init_lal.o $(DROPLAL_O) scale_se.o karlin.o last_thresh.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M)
106 $(CC) $(HFLAGS) $(BIN)/lalign36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(LSHOWALIGN).o htime.o apam.o doinit.o init_lal.o $(DROPLAL_O) scale_se.o karlin.o last_thresh.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB)
107107
108108 osearch36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_ssw.o scale_se.o karlin.o $(DROPNSW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o
109 $(CC) $(HFLAGS) $(BIN)/osearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_ssw.o $(DROPNSW_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M)
109 $(CC) $(HFLAGS) $(BIN)/osearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_ssw.o $(DROPNSW_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB)
110110
111111 glsearch36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o scale_sn.o karlin.o $(DROPLNW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o
112 $(CC) $(HFLAGS) $(BIN)/glsearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPLNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M)
112 $(CC) $(HFLAGS) $(BIN)/glsearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPLNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB)
113113
114114 ggsearch36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o scale_sn.o karlin.o $(DROPGNW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o
115 $(CC) $(HFLAGS) $(BIN)/ggsearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPGNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M)
115 $(CC) $(HFLAGS) $(BIN)/ggsearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPGNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB)
116116
117117 prss36 : ssearch36
118118 ln -sf ssearch36 prss36
145145 $(CC) $(THR_CC) $(CFLAGS) -c work_thr2.c
146146
147147 print_pssm : print_pssm.c getseq.c karlin.c apam.c pssm_asn_subs.c
148 $(CC) -o print_pssm $(CFLAGS) print_pssm.c getseq.c karlin.c apam.c pssm_asn_subs.c $(LIB_M)
148 $(CC) -o print_pssm $(CFLAGS) print_pssm.c getseq.c karlin.c apam.c pssm_asn_subs.c $(LIB_M) $(LIB_DB)
149149
150150 map_db : map_db.c uascii.h ncbl2_head.h
151151 $(CC) $(CFLAGS) -o $(BIN)/map_db map_db.c
5353 pushd $(BIN); cp $(TPROGS) $(XDIR); popd
5454
5555 lalign36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(LSHOWALIGN).o htime.o apam.o doinit.o init_lal.o scale_se.o karlin.o last_thresh.o $(DROPLAL_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o
56 $(CC) $(HFLAGS) $(BIN)/lalign36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(LSHOWALIGN).o htime.o apam.o doinit.o init_lal.o $(DROPLAL_O) scale_se.o karlin.o last_thresh.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M)
56 $(CC) $(HFLAGS) $(BIN)/lalign36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(LSHOWALIGN).o htime.o apam.o doinit.o init_lal.o $(DROPLAL_O) scale_se.o karlin.o last_thresh.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB)
5757
5858 ssearch36 : $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o scale_se.o karlin.o $(DROPGSW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o
59 $(CC) $(HFLAGS) $(BIN)/ssearch36 $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGSW_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(THR_LIBS)
59 $(CC) $(HFLAGS) $(BIN)/ssearch36 $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGSW_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
6060
6161 ssearch36s : $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o scale_se.o karlin.o $(DROPGSW_NA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o
62 $(CC) $(HFLAGS) $(BIN)/ssearch36s $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGSW_NA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(THR_LIBS)
62 $(CC) $(HFLAGS) $(BIN)/ssearch36s $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGSW_NA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
6363
6464 glsearch36 : $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o scale_sn.o karlin.o $(DROPLNW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o
65 $(CC) $(HFLAGS) $(BIN)/glsearch36 $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPLNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(THR_LIBS)
65 $(CC) $(HFLAGS) $(BIN)/glsearch36 $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPLNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
6666
6767 glsearch36s : $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o scale_sn.o karlin.o $(DROPLNW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o
68 $(CC) $(HFLAGS) $(BIN)/glsearch36s $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPLNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(THR_LIBS)
68 $(CC) $(HFLAGS) $(BIN)/glsearch36s $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPLNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
6969
7070 ggsearch36 : $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o scale_sn.o karlin.o $(DROPGNW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o
71 $(CC) $(HFLAGS) $(BIN)/ggsearch36 $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(THR_LIBS)
71 $(CC) $(HFLAGS) $(BIN)/ggsearch36 $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
7272
7373 ggsearch36s : $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o scale_sn.o karlin.o $(DROPGNW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o
74 $(CC) $(HFLAGS) $(BIN)/ggsearch36s $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(THR_LIBS)
74 $(CC) $(HFLAGS) $(BIN)/ggsearch36s $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
7575
7676 fasta36 : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o scale_se.o karlin.o $(DROPNFA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o
77 $(CC) $(HFLAGS) $(BIN)/fasta36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(THR_LIBS)
77 $(CC) $(HFLAGS) $(BIN)/fasta36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
7878
7979 fasta36sum : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o scale_se.o karlin.o $(DROPNFA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o
80 $(CC) $(HFLAGS) $(BIN)/fasta36sum $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(THR_LIBS)
80 $(CC) $(HFLAGS) $(BIN)/fasta36sum $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
8181
8282 fasta36u : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showun.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o scale_se.o karlin.o $(DROPNFA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o
83 $(CC) $(HFLAGS) $(BIN)/fasta36u $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showun.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS)
83 $(CC) $(HFLAGS) $(BIN)/fasta36u $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showun.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
8484
8585 fasta36r : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showrel.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o scale_se.o karlin.o $(DROPNFA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o
86 $(CC) $(HFLAGS) $(BIN)/fasta36r $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showrel.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS)
86 $(CC) $(HFLAGS) $(BIN)/fasta36r $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showrel.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
8787
8888 fastf36 : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_ff.o scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(DROPFF_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o
89 $(CC) $(HFLAGS) $(BIN)/fastf36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_ff.o $(DROPFF_O) scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS)
89 $(CC) $(HFLAGS) $(BIN)/fastf36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_ff.o $(DROPFF_O) scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
9090
9191 fastf36s : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showsum.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_ff.o scaleswtf.o karlin.o $(DROPFF_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o
92 $(CC) $(HFLAGS) $(BIN)/fastf36s $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showsum.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_ff.o $(DROPFF_O) scaleswtf.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS)
92 $(CC) $(HFLAGS) $(BIN)/fastf36s $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showsum.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_ff.o $(DROPFF_O) scaleswtf.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
9393
9494 fasts36 : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fs.o scaleswts.o last_tat.o tatstats_fs.o karlin.o $(DROPFS_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o
95 $(CC) $(HFLAGS) $(BIN)/fasts36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fs.o $(DROPFS_O) scaleswts.o last_tat.o tatstats_fs.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS)
95 $(CC) $(HFLAGS) $(BIN)/fasts36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fs.o $(DROPFS_O) scaleswts.o last_tat.o tatstats_fs.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
9696
9797 fastm36 : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fm.o scaleswts.o last_tat.o tatstats_fm.o karlin.o $(DROPFM_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o
98 $(CC) $(HFLAGS) $(BIN)/fastm36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fm.o $(DROPFM_O) scaleswts.o last_tat.o tatstats_fm.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS)
98 $(CC) $(HFLAGS) $(BIN)/fastm36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fm.o $(DROPFM_O) scaleswts.o last_tat.o tatstats_fm.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
9999
100100 fastx36 : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o c_dispn.o htime.o apam.o doinit.o init_fx.o faatran.o scale_se.o karlin.o drop_fx.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o
101 $(CC) $(HFLAGS) $(BIN)/fastx36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fx.o drop_fx.o faatran.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(THR_LIBS)
101 $(CC) $(HFLAGS) $(BIN)/fastx36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fx.o drop_fx.o faatran.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
102102
103103 fasty36 : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o c_dispn.o htime.o apam.o doinit.o init_fy.o faatran.o scale_se.o karlin.o drop_fz.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o
104 $(CC) $(HFLAGS) $(BIN)/fasty36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fy.o drop_fz.o faatran.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(THR_LIBS)
104 $(CC) $(HFLAGS) $(BIN)/fasty36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fy.o drop_fz.o faatran.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
105105
106106 tfasta36 : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o c_dispn.o htime.o apam.o doinit.o init_tfa.o scale_se.o karlin.o $(DROPTFA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o
107 $(CC) $(HFLAGS) $(BIN)/tfasta36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfa.o $(DROPTFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(THR_LIBS)
107 $(CC) $(HFLAGS) $(BIN)/tfasta36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfa.o $(DROPTFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
108108
109109 tfastf36 : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o c_dispn.o htime.o apam.o doinit.o init_tf.o scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(DROPTFF_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o
110 $(CC) $(HFLAGS) $(BIN)/tfastf36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tf.o $(DROPTFF_O) scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS)
110 $(CC) $(HFLAGS) $(BIN)/tfastf36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tf.o $(DROPTFF_O) scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
111111
112112 tfasts36 : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o c_dispn.o htime.o apam.o doinit.o init_tfs.o scaleswts.o last_tat.o tatstats_fs.o karlin.o $(DROPTFS_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o
113 $(CC) $(HFLAGS) $(BIN)/tfasts36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfs.o $(DROPTFS_O) scaleswts.o last_tat.o tatstats_fs.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS)
113 $(CC) $(HFLAGS) $(BIN)/tfasts36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfs.o $(DROPTFS_O) scaleswts.o last_tat.o tatstats_fs.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
114114
115115 tfastm36 : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfm.o scaleswts.o tatstats_fm.o last_tat.o karlin.o $(DROPTFM_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o
116 $(CC) $(HFLAGS) $(BIN)/tfastm36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfm.o $(DROPTFM_O) scaleswts.o tatstats_fm.o last_tat.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS)
116 $(CC) $(HFLAGS) $(BIN)/tfastm36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfm.o $(DROPTFM_O) scaleswts.o tatstats_fm.o last_tat.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
117117
118118 tfastx36 : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfx.o scale_se.o karlin.o drop_tfx.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o
119 $(CC) $(HFLAGS) $(BIN)/tfastx36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfx.o drop_tfx.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(THR_LIBS)
119 $(CC) $(HFLAGS) $(BIN)/tfastx36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfx.o drop_tfx.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
120120
121121 tfasty36 : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfy.o scale_se.o karlin.o drop_tfz.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o
122 $(CC) $(HFLAGS) $(BIN)/tfasty36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfy.o drop_tfz.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(THR_LIBS)
122 $(CC) $(HFLAGS) $(BIN)/tfasty36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfy.o drop_tfz.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) $(THR_LIBS)
123123
124124 comp_mlib4.o : comp_lib4.c mw.h structs.h defs.h param.h
125125 $(CC) $(THR_CC) $(CFLAGS) -DCOMP_MLIB -c comp_lib4.c -o comp_mlib4.o
167167 $(CC) $(THR_CC) $(CFLAGS) -c work_thr2.c
168168
169169 print_pssm : print_pssm.c getseq.c karlin.c apam.c pssm_asn_subs.c
170 $(CC) -o print_pssm $(CFLAGS) print_pssm.c getseq.c karlin.c apam.c pssm_asn_subs.c $(LIB_M)
170 $(CC) -o print_pssm $(CFLAGS) print_pssm.c getseq.c karlin.c apam.c pssm_asn_subs.c $(LIB_M) $(LIB_DB)
171171
172172 map_db : map_db.c uascii.h ncbl2_head.h
173173 $(CC) $(CFLAGS) -o $(BIN)/map_db map_db.c
3333 # and "-L/usr/lib64/mysql -lmysqlclient -lz" in LIB_M
3434 # some systems may also require a LD_LIBRARY_PATH change
3535
36 LIB_M= -lm -lz
37 #LIB_M= -L/usr/lib64/mysql -lmysqlclient -lz -lm
36 LIB_M= -lm
37 #LIB_M= -L/usr/lib64/mysql -lmysqlclient -lm # -lz
3838 NCBL_LIB=ncbl2_mlib.o
3939 #NCBL_LIB=ncbl2_mlib.o mysql_lib.o
4040
0 #!/usr/bin/env perl
1
2 ################################################################
3 # copyright (c) 2014,2015 by William R. Pearson and The Rector &
4 # Visitors of the University of Virginia */
5 ################################################################
6 # Licensed under the Apache License, Version 2.0 (the "License");
7 # you may not use this file except in compliance with the License.
8 # You may obtain a copy of the License at
9 #
10 # http://www.apache.org/licenses/LICENSE-2.0
11 #
12 # Unless required by applicable law or agreed to in writing,
13 # software distributed under this License is distributed on an "AS
14 # IS" BASIS, WITHOUT WRRANTIES OR CONDITIONS OF ANY KIND, either
15 # express or implied. See the License for the specific language
16 # governing permissions and limitations under the License.
17 ################################################################
18
19 ################################################################
20 # clustal2fasta.pl
21 ################################################################
22 # clustal2fasta.pl takes a standard clustal format alignment file
23 # and produces the corresponding FASTA file.
24 #
25 ################################################################
26
27 use warnings;
28 use strict;
29 use Pod::Usage;
30 use Getopt::Long;
31
32 my ($shelp, $help, $trim) = (0, 0);
33
34 GetOptions(
35 "h|?" => \$shelp,
36 "help" => \$help,
37 );
38
39 pod2usage(1) if $shelp;
40 pod2usage(exitstatus => 0, verbose => 2) if $help;
41 unless (-f STDIN || -p STDIN || @ARGV) {
42 pod2usage(1);
43 }
44
45 my @seq_ids = ();
46 my %msa = ();
47
48 # read the first line, first should not be blank
49 my $title = <>;
50
51 while (my $line = <>) {
52 chomp $line;
53 next unless ($line);
54 next if ($line =~ m/^[\s:\*\+\.]+$/); # skip conservation line
55
56 my ($seq_id, $align) = split(/\s+/,$line);
57
58 if (defined($msa{$seq_id})) {
59 $msa{$seq_id} .= $align;
60 }
61 else {
62 $msa{$seq_id} = $align;
63 push @seq_ids, $seq_id;
64 }
65 }
66
67 for my $seq_id ( @seq_ids ) {
68 my $fmt_seq = $msa{$seq_id};
69 $fmt_seq =~ s/(.{0,60})/$1\n/g;
70 print ">$seq_id\n$fmt_seq";
71 }
72
73 __END__
74
75 =pod
76
77 =head1 NAME
78
79 clustal2fasta.pl
80
81 =head1 SYNOPSIS
82
83 clustal2fasta.pl clustal.msa
84
85 =head1 OPTIONS
86
87 -h short help
88 --help include description
89
90
91 =head1 DESCRIPTION
92
93 C<clustal2fasta.pl> takes a Clustal format interleaved multiple
94 sequence alignment and produces the corresponding fasta format library.
95
96 =head1 AUTHOR
97
98 William R. Pearson, wrp@virginia.edu
99
100 =cut
0 #!/usr/bin/env python
1
2 ################################################################
3 # copyright (c) 2014,2015 by William R. Pearson and The Rector &
4 # Visitors of the University of Virginia */
5 ################################################################
6 # Licensed under the Apache License, Version 2.0 (the "License");
7 # you may not use this file except in compliance with the License.
8 # You may obtain a copy of the License at
9 #
10 # http://www.apache.org/licenses/LICENSE-2.0
11 #
12 # Unless required by applicable law or agreed to in writing,
13 # software distributed under this License is distributed on an "AS
14 # IS" BASIS, WITHOUT WRRANTIES OR CONDITIONS OF ANY KIND, either
15 # express or implied. See the License for the specific language
16 # governing permissions and limitations under the License.
17 ################################################################
18
19 ################################################################
20 # clustal2fasta.pl
21 ################################################################
22 # clustal2fasta.pl takes a standard clustal format alignment file
23 # and produces the corresponding FASTA file.
24 #
25 # if --end_mask or --int_mask are set, then end or internal '-'s are converted to the query (first) sequence
26 # if --trim is set, then alignments beyond the beginning/end of the query sequence are trimmed
27 #
28 ################################################################
29
30 import argparse
31 import fileinput
32 import re
33
34 ################
35 #
36 # python re-write of clustal2fasta.pl
37 #
38 # in the future, modify for various query seeding strategies
39 ################
40
41 arg_parse = argparse.ArgumentParser(description='Convert clustal MSA to FASTA library')
42 arg_parse.add_argument('--query|--query_file', dest='query_file', action='store',help='query sequence file')
43 arg_parse.add_argument('files', metavar='FILE', nargs='*', help='files to read, if empty, stdin is used')
44 args=arg_parse.parse_args()
45
46 msa = {}
47 seq_ids = []
48
49 is_line1 = True
50 for line in fileinput.input(args.files):
51 if is_line1:
52 is_line1 = False
53 continue
54 line = line.strip()
55 if not line:
56 continue
57 if re.search(r'^[\s:\*\+\.]+$',line):
58 continue
59
60 (seq_id, align) = re.split(r'\s+',line)
61
62 if seq_id in msa:
63 msa[seq_id] += align
64 else:
65 msa[seq_id] = align
66 seq_ids.append(seq_id)
67
68 for seq_id in seq_ids:
69 fmt_seq = re.sub(r'(.{0,60})',r'\1\n',msa[seq_id])
70 print ">%s\n%s" % (seq_id, fmt_seq)
0 #!/usr/bin/perl -w
0 #!/usr/bin/env perl
11
22 ################################################################
33 # copyright (c) 2014,2015 by William R. Pearson and The Rector &
3737 #
3838 ################################################################
3939
40 use warnings;
4041 use strict;
4142 use Pod::Usage;
4243 use Getopt::Long;
5051
5152 my ($shelp, $help, $m_format, $evalue, $qvalue, $domain_bound) = (0, 0, "m8CB", 0.001, 30.0,0);
5253 my ($query_file, $sel_file, $bound_file_in, $bound_file_only, $bound_file_out, $masked_lib_out,$mask_type_end, $mask_type_int) = ("","","","","","","","");
54 my ($clustal_id,$trunc_acc,$min_align) = (0,0,0.0);
5355 my $query_lib_r = 0;
5456 my ($eval2_fmt, $eval2) = (0,"");
5557
5759 "query=s" => \$query_file,
5860 "query_file=s" => \$query_file,
5961 "eval2=s" => \$eval2, # change the evalue used for inclusion
60 "evalue=f" => \$evalue,
61 "expect=f" => \$evalue,
62 "evalue|expect=f" => \$evalue,
6263 "qvalue=f" => \$qvalue,
6364 "format=s" => \$m_format,
64 "selected_file_in=s" => \$sel_file,
65 "sel_file_in=s" => \$sel_file,
66 "sel_file=s" => \$sel_file,
67 "m_format=s" => \$m_format,
68 "mformat=s" => \$m_format,
65 "clustal!" => \$clustal_id,
66 "trunc_acc!" => \$trunc_acc,
67 "selected_file_in|sel_file_in|sel_accs=s" => \$sel_file,
68 "m_format|mformat=s" => \$m_format,
6969 "bound_file_in=s" => \$bound_file_in,
7070 "bound_file_only=s" => \$bound_file_only,
7171 "bound_file_out=s" => \$bound_file_out,
8282 "domain" => \$domain_bound,
8383 "int_mask_type=s" => \$mask_type_int,
8484 "int_mask=s" => \$mask_type_int,
85 "min_align=f" => \$min_align,
8586 "h|?" => \$shelp,
8687 "help" => \$help,
8788 );
214215 $q_acc = $query_descr;
215216 }
216217
217 $acc_names{$q_acc} = 1; # this is necessary for the new acc-only NCBI SwissProt libraries
218 $acc_names{$q_acc} = $q_acc; # this is necessary for the new acc-only NCBI SwissProt libraries
218219
219220 $q_acc =~ s/\.\d+$//;
220221
227228 my $annot_f='NULL';
228229
229230 if ($m_format =~ m/^m9/i) {
230 last if $line =~ m/>>>/;
231 last if $line =~ m/>>>/ || $line =~ m/^<\/pre>/;
231232 next if $line =~ m/^\+\-/; # skip over HSPs
232233 my ($left, $right, $align_f) = ("","",'NULL');
233234 ($left, $right, $align_f, $annot_f) = split(/\t/,$line);
235236 $align_f= 'NULL' unless $align_f;
236237 $annot_f= 'NULL' unless $annot_f;
237238
239 if ($left =~ m/<font/) {
240 $left =~ s/<font color="darkred">//;
241 $left =~ s/<\/font>//;
242 }
243
238244 my @fields = split(/\s+/,$left);
239 my ($ldb, $l_id, $l_acc) = ("","","");
240 if ($fields[0] =~ m/:/) {
241 ($ldb, $l_id) = split(/:/,$fields[0]);
242 ($l_acc) = $fields[1];
243 } else {
244 ($ldb, $l_acc,$l_id) = split(/\|/,$fields[0]);
245 }
245 $subj_acc = $s_seqid = $fields[0];
246
247 # my ($ldb, $l_id, $l_acc) = ("","","");
248 # if ($fields[0] =~ m/:/) {
249 # ($ldb, $l_id) = split(/:/,$fields[0]);
250 # ($l_acc) = $fields[1];
251 # } else {
252 # ($ldb, $l_acc,$l_id) = split(/\|/,$fields[0]);
253 # }
246254
247255 @hit_data{@m9_field_names} = split(/\s+/,$right);
256
248257 if ($eval2_fmt) {
249258 @hit_data{qw(bits evalue eval2)} = @fields[-3, -2,-1];
250259 }
255264 #
256265 # currently preselbdr files have $ldb|$l_acc, not full s_seqid, so construct it
257266 #
258 ($s_seqid, $subj_acc) = (join('|',($ldb, $l_acc, $l_id)), "$ldb|$l_acc");
267 # ($s_seqid, $subj_acc) = (join('|',($ldb, $l_acc, $l_id)), "$ldb|$l_acc");
259268 @hit_data{qw(s_seqid subj_acc)} = ($s_seqid, $subj_acc);
260269 @hit_data{qw(query_id query_acc)} = ($query_descr, $q_acc);
261270 $hit_data{BTOP} = $align_f;
265274 last if $line =~ m/^#/;
266275 @hit_data{@m8_field_names} = split(/\t/,$line);
267276 $subj_acc = $hit_data{'s_seqid'};
268 $subj_acc =~ s/^gi\|\d+\|(\w+\|\w+)\|?\w+/$1/;
277 # remove gi number
278 if ($subj_acc =~ m/^gi|\d+\|/) {
279 $subj_acc =~ s/^gi\|\d+\|//;
280 }
269281 }
270282
271283 if ($have_sel_accs) {
284296 # $s_seqid_u .= "_". $acc_names{$subj_acc};
285297 }
286298 else {
299 my $tr_acc = $hit_data{'s_seqid'};
287300 $acc_names{$hit_data{'s_seqid'}} = 1;
288301 }
289302
290303 # must be after duplicate seqid check because blast HSP's have bad E-values after good.
291304 next if ($eval_fptr->(\%hit_data) > $evalue);
292305
306 next if (($hit_data{q_end}-$hit_data{q_start}+1)/$query_len < $min_align);
307
293308 $hit_data{s_seqid_u} = $s_seqid_u;
294
295 if (length($s_seqid_u) > $max_sseqid_len) {
296 $max_sseqid_len = length($s_seqid_u);
297 }
298309
299310 my $have_dom = 0;
300311 if ($domain_bound && $hit_data{annot}) {
369380 }
370381 }
371382
383 $max_sseqid_len = 10;
384 for my $acc ( @multi_names) {
385 my $this_len = length($acc);
386 if ($trunc_acc && ($acc=~m/\|\w+\|(\w+)$/)) {
387 $this_len = length($1);
388 }
389 if ($this_len > $max_sseqid_len) {
390 $max_sseqid_len = $this_len;
391 }
392 }
393
372394 # final MSA output
373395 $max_sseqid_len += 2;
374396
375 printf "BTOP%s multiple sequence alignment\n\n\n",$m_format;
397 if (! $clustal_id) {
398 printf "BTOP%s multiple sequence alignment\n\n\n",$m_format;
399 }
400 else {
401 print "CLUSTALW (1.8) multiple sequence alignment\n\n\n";
402 }
376403
377404 my $i_pos = 0;
378405 for (my $j = 0; $j < $query_len/60; $j++) {
380407 if ($i_end >= $query_len) {$i_end = $query_len-1;}
381408 for my $acc (@multi_names) {
382409 next unless $acc;
383 printf("%-".$max_sseqid_len."s %s\n",$acc,join("",@{$multi_align{$acc}}[$i_pos .. $i_end]));
410
411 my $this_acc = $acc;
412 if ($trunc_acc && ($acc=~m/\|\w+\|(\w+)$/)) {
413 $this_acc = $1;
414 }
415 printf("%-".$max_sseqid_len."s %s\n",$this_acc,join("",@{$multi_align{$acc}}[$i_pos .. $i_end]));
384416 }
385417 $i_pos += 60;
386418 print "\n\n";
752784 my ($q_num, $query_desc, $q_start, $q_stop, $q_len, $l_num, $l_len, $best_yes);
753785
754786 while (my $line = <>) {
755 if ($line =~ m/^\s*(\d+)>>>(\S+)\s.+ \- (\d+) aa$/) {
787 if ($line =~ m/^\s*(\d+)>>>(\S+)\s.*\- (\d+) aa$/) {
756788 ($q_num,$query_desc, $q_len) = ($1,$2,$3);
757789 # ($q_len) = ($line =~ m/(\d+) aa$/);
758790 $line = <>; # skip Library:
890922 --query -- same as --query_file
891923 (only one sequence per file)
892924
925 --expect|evalue: 0.001 -- maximum e-value to be include in output
926
893927 --eval2 : "": use E()-value, "eval2": use E2()/eval2, "ave": use geom. mean
928
929 --qvalue: 30.0 -- minimum qvalue for domain to be considered
894930
895931 --bound_file_in -- tab delimited accession<tab>start<tab>end that
896932 specifies MSA boundaries WITHIN alignment.
903939
904940 --bound_file_out -- "--bound_file" for next iteration of psisearch2
905941
942 --clustal -- use "CLUSTALW (1.8)" multiple alignment string
943
944 --trunc_acc -- remove db, acc from db|acc|ident, e.g. sp|P0948|GSTM1_HUMAN becomes GSTM1_HUMAN
945
906946 --domain_bound parse domain annotations (-V) from m9B file
907947 --domain
908948
909949 --masked_lib_out -- FASTA format library of MSA sequences
950
951 --min_align:0.0 -- minimum fractional alignment (q_end-q_start+1)/q_len
910952
911953 --int_mask_type = "query", "rand", "X", "none"
912954 --end_mask_type = "query", "rand", "X", "none"
0 #!/usr/bin/perl -w
0 #!/usr/bin/env perl
11
22 ################################################################
33 # copyright (c) 2016 by William R. Pearson and The Rector &
1616 # governing permissions and limitations under the License.
1717 ################################################################
1818
19 use warnings;
1920 use strict;
2021 use Getopt::Long;
2122 use Pod::Usage;
3233 ################
3334 #
3435 # command:
35 # psisearch2_msa.pl --query query.file --db database.file --num_iter N --pssm_evalue 0.002 --int_mask none/query/random --end_mask none/query/random --tmp_dir results/ --domain --align --out_suffix none --pgm ssearch/psiblast --prev_m89res prev_results.itx.m8CB.file --sel_res selected_accs.file --prev_bounds boundary.file
36 # psisearch2_msa.pl --query query.file --in_msa msa.file --db database.file --num_iter N --pssm_evalue 0.002 --int_mask none/query/random --end_mask none/query/random --tmp_dir results/ --domain --align --out_suffix none --pgm ssearch/psiblast --prev_m89res prev_results.itx.m8CB.file --sel_res selected_accs.file --prev_bounds boundary.file
3637 #
3738 ################
3839
5354 my $makeblastdb_bin = "$pgm_bin/makeblastdb";
5455 my $datatool_bin = "$pgm_bin/datatool -m $pgm_data/NCBI_all.asn";
5556 my $align2msa_lib = "$pgm_bin/m89_btop_msa2.pl";
57 my $clustal2fasta = "$pgm_bin/clustal2fasta.pl";
5658
5759 my %srch_subs = ('ssearch' => \&get_ssearch_cmd,
5860 'psiblast' => \&get_psiblast_cmd,
6062
6163 my %annot_cmds = ('rpd3' => qq("\!ann_pfam28.pl --pfacc --db RPD3 --vdoms --split_over"),
6264 'rpd3nv' => qq("\!ann_pfam28.pl --pfacc --db RPD3 --split_over"),
63 'rpd3nvn' => qq("\!ann_pfam28.pl --pfacc --db RPD3 --split_over --neg"),
64 'pfam' => qq("\!ann_pfam30.pl --vdoms --split_over --neg")
65 'rpd3nvn' => qq("\!./annot/ann_pfam28.pl --pfacc --db RPD3 --split_over --neg"),
66 'pfam' => qq("\!./annot/ann_pfam30.pl --db pfam31_qfo --vdoms --split_over --neg")
6567 );
6668
6769 ($num_iter, $pssm_evalue, $srch_evalue, $dom_flag, $align_flag, $int_mask, $end_mask, $query_mask, $srch_pgm, $tmp_dir, $error_log, $annot_type, $quiet) =
6870 ( 5, 0.002, 5.0, 0, 0, 'none', 'none', 0, 'ssearch','',0, 0, "", 0);
6971 ($save_all, $tmp_file_list, $delete_bnd, $delete_tmp) = (0, "", 0, 0);
70 ($prev_m89res, $m_format, $prev_sel_res, $prev_bound, $this_iter, $use_stdout) = ("","", "","", 1, 0);
72 ($prev_m89res, $m_format, $prev_sel_res, $prev_bound, $this_iter, $use_stdout) = ("","m8CB", "","", 1, 0);
7173
7274 my $pgm_command = "# ".join(" ",($0,@ARGV));
7375 print STDERR "# ",join(" ",($0,@ARGV)),"\n" if ($error_log);
8991 'sel_accs=s' => \$prev_sel_res,
9092 'sel_file=s' => \$prev_sel_res,
9193 'sel_file_in=s' => \$prev_sel_res,
92 # 'in_msa=s' => \$prev_msa,
94 'in_msa=s' => \$prev_msa,
9395 # 'out_msa=s' => \$next_msa,
9496 # 'in_hitdb=s' => \$prev_hitdb,
9597 # 'out_hitdb=s' => \$next_hitdb,
183185
184186 my @del_err_files = ();
185187
186 unless ($prev_m89res) {
188 unless ($prev_m89res || $prev_msa) {
187189 $search = $srch_subs{$srch_pgm}($query_file, $db_file, $prev_pssm);
188190 unless ($use_stdout) {
189191 log_system("$search > $this_file_out 2> $this_file_out.err");
194196 push @del_err_files, "$this_file_out.err";
195197 $first_iter++;
196198 }
197 else {
199 elsif ($prev_m89res) {
198200 $this_file_out = $prev_m89res;
201 }
202 elsif ($prev_msa) {
203 # build a PSSM, do a search, up the iteration count
204 $prev_pssm = pssm_from_msa($query_file, $prev_msa);
205 $search = $srch_subs{$srch_pgm}($query_file, $db_file, $prev_pssm);
206 unless ($use_stdout) {
207 log_system("$search > $this_file_out 2> $this_file_out.err");
208 }
209 else {
210 log_system("$search 2> $this_file_out.err");
211 }
212 push @del_err_files, "$this_file_out.err";
213 $first_iter++;
199214 }
200215
201216 my ($this_pssm, $this_bound_out) = ("","");
264279
265280 my ($cmd) = @_;
266281
267 print STDERR "$cmd\n" if $error_log;
282 print STDERR "# $cmd\n" if $error_log;
268283 system($cmd);
269284 }
270285
275290 sub get_ssearch_cmd {
276291 my ($query_file, $db_file, $pssm_file) = @_;
277292
278 my $search_cmd = qq($ssearch_bin -S -m 6 -m 9B -E "$srch_evalue 0" -s BP62);
293 my $mf_arg = $m_format;
294 $mf_arg =~ s/^m//;
295 $mf_arg =~ s/\+/ /;
296
297 my $search_cmd = qq($ssearch_bin -S -E "$srch_evalue 0" -s BP62 -m $mf_arg);
298
279299 if ($annot_type) {
280300 $search_cmd .= qq( -V $annot_cmds{$annot_type});
281301 }
383403 }
384404 else {
385405 return ($this_pssm_asntxt, $this_bound_out);
406 }
407 }
408
409 ################
410 # pssm_from_msa()
411 #
412 # given query, --in_msa Clustal MSA
413 # use psiblast to generate PSSM in .asntxt or .asnbin format
414 # (later - optionally deletes intermediate files)
415 #
416 # always produce a $bound_file_out file to test for convergence
417 #
418 sub pssm_from_msa {
419 my ($query_file, $msa_file) = @_;
420
421 my $this_file_out = $query_file;
422
423 my ($this_hit_db, $this_pssm_asntxt, $this_pssm_asnbin, $this_psibl_out, $this_bound_out) =
424 ("$this_file_out.hit_db",
425 "$this_file_out.asntxt",
426 "$this_file_out.asnbin",
427 "$this_file_out.psibl_out",
428 "$this_file_out.bnd_out",
429 );
430
431 my $blastdb_err = "$this_file_out.mkbldb_err";
432 ## should not need this, but may need to convert in_msa file to fasta file for equivalence to build_msa_pssm()
433 my $clus2fa_cmd = qq($clustal2fasta $msa_file > $this_hit_db);
434
435 log_system($clus2fa_cmd);
436
437 my $makeblastdb_cmd = "$makeblastdb_bin -in $this_hit_db -dbtype prot -parse_seqids > $blastdb_err";
438 log_system($makeblastdb_cmd);
439
440 my $buildpssm_cmd = "$psiblast_bin -max_target_seqs 5000 -outfmt 7 -inclusion_ethresh 100.0 -in_msa $msa_file -db $this_hit_db -out_pssm $this_pssm_asntxt -num_iterations 1 -save_pssm_after_last_round";
441
442 log_system("$buildpssm_cmd > $this_psibl_out 2> $this_psibl_out.err");
443
444 log_system("rm $this_hit_db.p* $blastdb_err");
445
446 # remove uninformative error logs
447 log_system("rm $this_psibl_out.err") unless $error_log;
448
449 unless ($srch_pgm eq 'psiblast') {
450 my $asn2asn_cmd = "$datatool_bin -v $this_pssm_asntxt -e $this_pssm_asnbin";
451 log_system($asn2asn_cmd);
452 return ($this_pssm_asnbin);
453 }
454 else {
455 return ($this_pssm_asntxt);
386456 }
387457 }
388458
0 #!/usr/bin/python
0 #!/usr/bin/env python
11
22 ################################################################
33 # copyright (c) 2016 by William R. Pearson and The Rector &
3333 ################
3434 #
3535 # command:
36 # psisearch2_msa.py --query query_file --db database --num_iter N --evalue 0.002 --no_msa --int_mask none/query/random --end_mask none/query/random --tmp_dir results/ --domain --align --suffix M8CB --pgm ssearch/psiblast --prev_m89res pre_iter.out --this_iter # --num_iter #
36 # psisearch2_msa.py --query query_file --db database --num_iter N --pssm_evalue 0.002 --no_msa --int_mask none/query/random --end_mask none/query/random --tmp_dir results/ --domain --align --suffix M8CB --pgm ssearch/psiblast --prev_m89res pre_iter.out --this_iter # --num_iter #
3737 #
3838 ################
3939
5151 makeblastdb_bin = pgm_bin+"/makeblastdb"
5252 datatool_bin = "%s/datatool -m %s/NCBI_all.asn" % (pgm_bin,pgm_data)
5353 align2msa_lib = "m89_btop_msa2.pl"
54 clustal2fasta = "clustal2fasta.py"
5455
5556 annot_cmds = {'rpd3': '"!../scripts/ann_pfam28.pl --pfacc --db RPD3 --vdoms --split_over"',
5657 'rpd3nv':'"!../scripts/ann_pfam28.pl --pfacc --db RPD3 --split_over"',
5758 'pfam':'"!../scripts/ann_pfam30.pl --pfacc --vdoms --split_over"'}
5859
5960 num_iter = 5
60 evalue = 0.002
6161 srch_pgm = 'ssearch'
62 error_log = 0
6362 rm_flag = 0
6463 quiet = 0
6564
6665 ################
6766 # log_system()
68 # run system on string, logging first if error_log
67 # run system on string, logging first if args.error_log
6968 #
7069 def log_system (cmd, error_log):
7170
7978 # sub get_ssearch_cmd()
8079 # builds an ssearch command line with query, db, and pssm
8180 #
82 def get_ssearch_cmd(query_file, db_file, pssm_file) :
83
84 search_cmd = '%s -S -m 8CB -d 0 -E "1.0 0" -s BP62' % (ssearch_bin)
81 def get_ssearch_cmd(query_file, db_file, pssm_file, args) :
82
83 search_cmd = '%s -S -m 8CB -d 0 -E "%f 0" -s BP62' % (ssearch_bin, args.srch_evalue)
8584
8685 if (args.annot_type) :
8786 search_cmd += " -V %s" % (annot_cmds[args.annot_type])
9897 # sub get_psiblast_cmd()
9998 # builds an ssearch command line with query, db, and pssm
10099 #
101 def get_psiblast_cmd(query_file, db_file, pssm_file) :
102
103 search_cmd = "%s -num_threads 4 -max_target_seqs 5000 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore score btop' -inclusion_ethresh %f -num_iterations 1 -db %s" % (psiblast_bin, args.evalue, db_file)
100 def get_psiblast_cmd(query_file, db_file, pssm_file, args) :
101
102 search_cmd = "%s -num_threads 4 -max_target_seqs 5000 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore score btop' -inclusion_ethresh %f -evalue %f -num_iterations 1 -db %s" % (psiblast_bin, args.pssm_evalue, args.srch_evalue, db_file)
104103
105104 if (pssm_file) :
106105 search_cmd += " -in_pssm %s" % (pssm_file)
119118 #
120119 # always produce a bound_file_out file to test for convergence
121120 #
122 def build_msa_pssm(query_file, this_file_out,prev_bound_in, prev_sel_res, args, error_log) :
121 def build_msa_pssm(query_file, this_file_out,prev_bound_in, prev_sel_res, error_log) :
123122
124123 (this_msa, this_hit_db, this_pssm_asntxt, this_pssm_asnbin, this_psibl_out, this_bound_out) = (this_file_out+".msa",this_file_out+".hit_db",this_file_out+".asntxt",this_file_out+".asnbin",this_file_out+".psibl_out",this_file_out+".bnd_out")
125124
129128 if (prev_sel_res) :
130129 aln2msa_cmd += " --sel_res %s" % (prev_sel_res)
131130 else:
132 aln2msa_cmd += " --evalue %f" % (args.evalue)
131 aln2msa_cmd += " --evalue %f" % (args.pssm_evalue)
133132
134133 if (args.int_mask) :
135134 aln2msa_cmd += " --int_mask_type %s" % (args.int_mask)
141140 aln2msa_cmd += " --domain"
142141
143142 if (args.align_flag and args.prev_bound_in) :
144 aln2msa_cmd += " --bound_file_in %s" %(args.prev_bound_in)
143 aln2msa_cmd += " --bound_file_in %s" %(args.prev_bound_in)
144
145 if (args.m_format):
146 aln2msa_cmd += " --m_format %s" % (args.m_format)
145147
146148 # always produce this file to check for convergence
147149 aln2msa_cmd += " --bound_file_out %s" % (this_bound_out)
170172 return (this_pssm_asntxt, this_bound_out)
171173
172174 ################
175 # sub pssm_from_msa
176 # read multiple sequence alignment, produce pssm file
177 #
178 def pssm_from_msa(query_file, msa_file, error_log):
179
180 this_file_out = query_file
181
182 this_hit_db = this_file_out+".hit_db"
183 this_pssm_asntxt = this_file_out+".asntxt"
184 this_pssm_asnbin = this_file_out+".asnbin"
185 this_psibl_out = this_file_out+".psibl_out"
186 this_bound_out = this_file_out+".bnd_out"
187
188 blastdb_err = this_file_out + ".mkbldb_err"
189
190 clus2fa_cmd = "%s %s > %s" % (clustal2fasta, msa_file, this_hit_db)
191
192 log_system(clus2fa_cmd, error_log);
193
194 makeblastdb_cmd = "%s -in %s -dbtype prot -parse_seqids > %s" % (makeblastdb_bin, this_hit_db, blastdb_err);
195 log_system(makeblastdb_cmd, error_log);
196
197 built_pssm_cmd = "%s -max_target_seqs 5000 -outfmt 7 -inclusion_ethresh 100.0 -in_msa %s -db %s -out_pssm %s -num_iterations 1 -save_pssm_after_last_round" % (psiblast_bin, msa_file, this_hit_db, this_pssm_asntxt)
198
199 log_system("%s > %s 2> %s.err" % (buildpssm_cmd, this_psibl_out, this_psibl_out), error_log)
200
201 log_system("rm %s.p* %s" % (this_hit_db,blastdb_err), error_log)
202
203 # remove uninformative error logs
204 if (not error_log):
205 log_system("rm %s.err" % (this_psibl_out), error_log)
206
207 if (srch_pgm != 'psiblast'):
208 asn2asn_cmd = "%s -v %s -e %s" % (datatool_bin, this_pssm_asntxt, this_pssm_asnbin)
209 log_system(asn2asn_cmd, error_log);
210 return this_pssm_asnbin
211 else:
212 return this_pssm_asntxt
213
214 ################
173215 # sub has_converged()
174216 # reads two boundary files and compares accessions
175217 #
210252
211253 srch_subs = {'ssearch' : get_ssearch_cmd,
212254 'psiblast': get_psiblast_cmd}
213
214 pgm_command = "# "+" ".join(sys.argv);
215 if (error_log) :
216 sys.stderr.write('pgm_command\n')
217255
218256 arg_parse = argparse.ArgumentParser(description='Iterative search with SSEARCH/PSIBLAST')
219257 arg_parse.add_argument('--query', dest='query_file', action='store',help='query sequence file')
221259 arg_parse.add_argument('--db', dest='db_file', action='store',help='sequence database name')
222260 arg_parse.add_argument('--database', dest='db_file', action='store',help='sequence database name')
223261 arg_parse.add_argument('--dir', dest='tmp_dir', action='store',help='directory for result and tmp_file output')
224 arg_parse.add_argument('--evalue', dest='evalue', default=0.002, type=float, action='store',help='E()-value threshold for inclusion in PSSM')
262 arg_parse.add_argument('--pssm_evalue', dest='pssm_evalue', default=0.002, type=float, action='store',help='E()-value threshold for inclusion in PSSM')
263 arg_parse.add_argument('--search_evalue', dest='srch_evalue', default=5.0, type=float, action='store',help='E()-value threshold for search display')
264 arg_parse.add_argument('--m_format', dest='m_format', action='store',help='input result format m8 [def] or m9')
225265 arg_parse.add_argument('--annot_db', dest='annot_type', action='store',help='source of domain annotations')
226266 arg_parse.add_argument('--suffix', dest='suffix', action='store',help='suffix for result output')
227267 arg_parse.add_argument('--out_name', dest='file_out', action='store',help='result file name')
233273 arg_parse.add_argument('--pgm', dest='srch_pgm', action='store',default='ssearch',help='search program: ssearch/psiblast')
234274 arg_parse.add_argument('--query_seed', dest='query_mask', action='store_true',help='use query seeding')
235275 arg_parse.add_argument('--prev_m89res', dest='prev_m89res', action='store', help='prevous iteration result')
276 arg_parse.add_argument('--prev_msa', dest='prev_msa', action='store', help='prevous MSA')
236277 arg_parse.add_argument('--sel_res', dest='prev_sel_res', action='store', help='selected accession file')
237278 arg_parse.add_argument('--this_iter', dest='this_iter', help='this iteration number',type=int)
238279 arg_parse.add_argument('--int_seed', dest='int_mask', action='store',default='none',help='sequence masking: none/query/random')
243284 arg_parse.add_argument('--save_all', dest='save_all', action='store_true',help='save all temporary files')
244285 arg_parse.add_argument('--delete_all', dest='delete_tmp', action='store_true',help='delete all temporary files')
245286 arg_parse.add_argument('--delete_bnd', dest='delete_bnd', action='store_true',help='delete boundary temporary file')
287 arg_parse.add_argument('--use_stdout', dest='use_stdout', action='store_true',help='send results to stdout',default=False)
288 arg_parse.add_argument('--errors', dest='error_log', action='store_true', help='log errors', default=False)
246289 arg_parse.add_argument('--quiet', dest='quiet', action='store_true',help='fewer messages')
247290 arg_parse.add_argument('-Q', dest='quiet', action='store_true',help='fewer messages')
248291
249292 args = arg_parse.parse_args()
293
294 pgm_command = "# "+" ".join(sys.argv);
295 if (args.error_log) :
296 sys.stderr.write('pgm_command\n')
297
250298 if (args.quiet) :
251299 quiet = args.quiet
252300
317365 del_err_files = []
318366
319367 # do the first search
320 if (not args.prev_m89res):
321 search_str = srch_subs[srch_pgm](args.query_file, args.db_file, args.prev_pssm)
322 log_system(search_str+" > "+this_file_out+" 2> "+this_file_out+".err", error_log)
368 if (not (args.prev_m89res or args.prev_msa)):
369 search_str = srch_subs[srch_pgm](args.query_file, args.db_file, args.prev_pssm, args)
370 if (not args.use_stdout):
371 log_system(search_str+" > "+this_file_out+" 2> "+this_file_out+".err", args.error_log)
372 else:
373 log_system(search_str + " 2> "+this_file_out+".err", args.error_log)
323374 del_err_files.append(this_file_out+".err")
324375 first_iter += 1
325 else:
376 elif (args.prev_m89res):
326377 this_file_out = args.prev_m89res
327
378 elif (args.prev_msa):
379 # build a PSSM, do a search, up the iteration count
380 prev_pssm = pssm_from_msa(query_file, prev_msa, args.error_log)
381 search_str = srch_subs[srch_pgm](args.query_file, args.db_file, args.prev_pssm, args)
382 if (not args.use_stdout):
383 log_system(search_str + "> " + this_file_out + " 2> " + this_file_out + ".err", args.error_log);
384 else:
385 log_system(search_str + " 2> " + this_file_out + ".err");
386
387 del_err_files.append(this_file_out+".err")
388 first_iter += 1
328389
329390 it=first_iter
330391
332393
333394 while (it < args.num_iter) :
334395
335 (this_pssm, this_bound_out) = build_msa_pssm(args.query_file, this_file_out, prev_bound_in, arg.prev_sel_res, error_log)
396 (this_pssm, this_bound_out) = build_msa_pssm(args.query_file, this_file_out, prev_bound_in, args.prev_sel_res, args.error_log)
336397 prev_file_out = this_file_out
337 arg.prev_sel_res = ''
398 args.prev_sel_res = ''
338399
339400 iter_val = this_iter + it
340401
347408 if (args.tmp_dir) :
348409 this_file_out = args.tmp_dir+"/"+this_file_out
349410
350 search_str = srch_subs[srch_pgm](args.query_file, args.db_file, prev_pssm)
351 log_system("%s > %s 2> %s" % (search_str,this_file_out,this_file_out+".err"), error_log)
411 search_str = srch_subs[srch_pgm](args.query_file, args.db_file, prev_pssm, args)
412 log_system("%s > %s 2> %s" % (search_str,this_file_out,this_file_out+".err"), args.error_log)
352413 del_err_files.append(this_file_out+".err")
353414
354415 if (len(del_file_ext)):
355416 del_file_list = [ prev_file_out+'.'+ext for ext in del_file_ext]
356 log_system('rm '+' '.join(del_file_list),error_log)
417 log_system('rm '+' '.join(del_file_list),args.error_log)
357418
358419 if (has_converged(prev_bound_in, this_bound_out)) :
359420 if (not quiet) :
361422
362423 # if (len(del_file_ext)):
363424 # del_file_list = [ prev_file_out+'.'+ext for ext in del_file_ext]
364 # log_system('rm '+' '.join(del_file_list),error_log)
425 # log_system('rm '+' '.join(del_file_list),args.error_log)
365426
366427 if (delete_bnd) :
367 log_system("rm "+prev_bound_in,error_log)
428 log_system("rm "+prev_bound_in,args.error_log)
368429
369430 exit(0)
370431
371432 if (delete_bnd) :
372 log_system("rm "+prev_bound_in,error_log)
433 log_system("rm "+prev_bound_in,args.error_log)
373434 prev_bound_in = this_bound_out
374435
375436 it += 1
376437
377438 if (len(del_err_files)):
378 log_system('rm '+' '.join(del_err_files),error_log)
439 log_system('rm '+' '.join(del_err_files),args.error_log)
379440
380441 # if (len(del_file_ext)):
381442 # del_file_list = [ prev_file_out+'.'+ext for ext in del_file_ext]
382 # log_system('rm '+' '.join(del_file_list),error_log)
443 # log_system('rm '+' '.join(del_file_list),args.error_log)
383444
384445 if (delete_bnd):
385 log_system("rm "+this_bound_out,error_log)
446 log_system("rm "+this_bound_out,args.error_log)
386447
387448 if (not quiet) :
388449 sys.stderr.write(" %s %s %s %s finished (%d iterations)\n" % (sys.argv[0], srch_pgm, query_file, args.db_file, it))
0 #!/bin/sh
1
2 ################
3 # example that runs psisearch2_msa.pl iteratively through 5 iterations.
4 # Equivalent to:
5 # psisearch2_msa.pl --query CL0238_emb.fa --num_iter 5 --db /slib2/fa_dbs/rpd3_pfam28_lib.lseg
6 #
7
8
9 PS_BIN=~/Devel/fa36_v3.8/psisearch2
10 Q_DIR="../seq"
11 FA_DB=/slib2/fa_dbs/qfo78.lseg
12 BL_DB=/slib2/bl_dbs/qfo78
13 DB=$FA_DB
14
15 OUT_SUFF='qm8CB'
16
17 M_FORMAT='m8CB'
18 ITERS='2 3 4 5'
19
20 for q_file_p in $*; do
21
22 q_file=${q_file_p##*/}
23 echo $q_file
24
25 # iteration 1:
26
27 $PS_BIN/psisearch2_msa.pl --query $Q_DIR/$q_file --num_iter 1 --db $DB --int_mask query --end_mask query --out_suffix $OUT_SUFF --m_format $M_FORMAT
28
29 # iteration 2 - 5
30 for it in $ITERS; do
31 prev=$(($it-1))
32 $PS_BIN/psisearch2_msa.pl --query $Q_DIR/$q_file --num_iter 1 --db $DB --int_mask query --end_mask query --out_suffix $OUT_SUFF --this_iter $it --prev_m89res $q_file.it${prev}.$OUT_SUFF --m_format $M_FORMAT
33 done
34
35 done
0 #!/bin/sh
1
2 ################
3 # example that runs psisearch2_msa.pl iteratively through 5 iterations using psiblast instead of ssearch
4 # Equivalent to:
5 # psisearch2_msa.pl --pgm psiblast --query query.aa --num_iter 5 --db /slib2/bl_dbs/qfo78
6 #
7
8 PS_BIN=~/Devel/fa36_v3.8/psisearch2
9 q_file=$1
10 m_format='m8CB'
11 SRC_QDIR=../hum_1dom200_queries
12
13 iters='2 3 4 5'
14 # iters=''
15
16 for q_file_p in $*; do
17
18 q_file=${q_file_p##*/}
19 echo $q_file
20
21 # iteration 1:
22 # echo "$PS_BIN/psisearch2_msa.pl --pgm psiblast --query $SRC_QDIR/$q_file --num_iter 1 --db /slib2/bl_dbs/qfo78 --int_mask query --end_mask query --out_suffix q_pblt --m_format $m_format --save_list asnbin"
23 $PS_BIN/psisearch2_msa.pl --pgm psiblast --query $SRC_QDIR/$q_file --num_iter 1 --db /slib2/bl_dbs/qfo78 --int_mask query --end_mask query --out_suffix q_pblt --m_format $m_format --save_list asntxt
24
25 # iteration 2 - 5
26 for it in $iters; do
27 prev=$(($it-1))
28 $PS_BIN/psisearch2_msa.pl --pgm psiblast --query $SRC_QDIR/$q_file --num_iter 1 --db /slib2/bl_dbs/qfo78 --int_mask query --end_mask query --out_suffix q_pblt --this_iter $it --prev_m89res $q_file.it${prev}.q_pblt --m_format $m_format --save_list asntxt
29 done
30 done
00
11 22-Jan-2014
22 13-Apr-2016 updated
3 22-Feb-2019 updated
34
45 fasta36/scripts
56
67 Perl scripts for annotating sequences and expanding libraries
8
9 -- Sequence generation (January, February, 2019)
10
11 The FASTA programs can now use sequences that are downloaded from
12 Uniprot or NCBI/RefSeq (or otherwise provided by a program script that
13 produces FASTA sequences from an identifier) by specifying the name of
14 the script, the accession(s), and library type 9, e.g.
15
16 fasta36 \!../scripts/get_protein.py+P09488 /seqlib/swissprot.fasta
17
18 Scripts are available for downloading protein sequences from Uniprot
19 or RefSeq (get_protein.py), Uniprot (get_uniprot.py), and for
20 downloading either protein or mRNA sequences from RefSeq
21 (get_refseq.py).
22
23 scripts/get_protein.py get Refseq or Uniprot proteins
24 scripts/get_refseq.py get RefSeq proteins or mRNAs
25 scripts/get_up_prot_iso_sql.py get a protein and its isoforms using a mysql database
26 scripts/get_genome_seq.py get human genome (hg38) or mouse (mm10) --genome mm10 sequences using bedtools using "get_genome_seq.py chr1:123456-126543"
727
828 -- Sequence alignment scoring/annotation
929
82102 ann_pdb_cath.pl -- generate CATH domains using PDB accessions from a mySQL database
83103 ann_pdb_vast.pl -- use VAST domains, but domain names are not informative
84104
85 ann_pfam27.pl -- generate Pfam domains using local Pfam mySQL database (Pfam27 with auto_pfamA, auto_pfamseq)
86 ann_pfam28.pl -- generate Pfam domains using local Pfam mySQL database (Pfam28, no auto_pfamA, auto_pfamseq)
105 ann_pfam28.pl -- generate Pfam domains using local Pfam mySQL database
106 (Pfam28, no auto_pfamA, auto_pfamseq)
107
87108 ann_pfam_www.pl -- use Pfam Website, and XML::Twig, to get Pfam domain info.
88109
89 ann_exons_ens.pl -- generate exon boundaries on SwissProt proteins from Ensembl.
90 ann_exons_up_www.pl -- generate exon boundaries on SwissProt proteins using the EBI/Proteins/API/coordinate service
110 ann_exons_up_www.pl -- generate exon boundaries on Uniprot proteins
111 using the EBI/Proteins/API/coordinate service
112
113 ann_exons_up_sql_www.pl -- generate exon boundaries on Uniprot
114 proteins using an SQL database (if available) or the EBI/Proteins
115 coordinate service. The SQL results are dramatically faster.
116
91117 ann_exons_ncbi.pl -- generate exon boundaries on NCBI refseq proteins.
92118
93119 -- Library expansion
94120
121 expand_up_isoforms.pl -- for Uniprot reference proteomes, provide
122 isoforms for each canonical sequence.
123
95124 expand_uniref50.pl -- allows search of uniref50 to be expanded
96 expand_links.pl -- script to take hits from a smaller library and expand to complete library
125
126 expand_links.pl -- script to take hits from a smaller library and
127 expand to complete library
128
97129 links2sql.pl -- create links for expand_links.pl
98130
99131 exp_up_ensg.pl -- expand uniprot sequences to include Ensembl splice variants
0 #!/usr/bin/env perl
1
2 ################################################################
3 # copyright (c) 2014,2015 by William R. Pearson and The Rector &
4 # Visitors of the University of Virginia
5 ################################################################
6 # Licensed under the Apache License, Version 2.0 (the "License");
7 # you may not use this file except in compliance with the License.
8 # You may obtain a copy of the License at
9 #
10 # http://www.apache.org/licenses/LICENSE-2.0
11 #
12 # Unless required by applicable law or agreed to in writing,
13 # software distributed under this License is distributed on an "AS
14 # IS" BASIS, WITHOUT WRRANTIES OR CONDITIONS OF ANY KIND, either
15 # express or implied. See the License for the specific language
16 # governing permissions and limitations under the License.
17 ################################################################
18
19 # ann_exons_up_sql.pl gets an annotation file from fasta36 -V with a line of the form:
20
21 # gi|62822551|sp|P00502|GSTA1_RAT Glutathione S-transfer\n (at least from pir1.lseg)
22 #
23 # it must:
24 # (1) read in the line
25 # (2) parse it to get the up_acc
26 # (3) return the tab delimited features
27
28 # this version can read feature2 uniprot features (acc/pos/end/label/value), but returns sorted start/end domains
29 # modified 18-Jan-2016 to produce annotation symbols consistent with ann_exons_up_www2.pl
30 # modified Dec 2018 to generate genomic coordinates with --gen_coord
31 # modified 3-Jan-2019 to merge sql and www (--www) access to exon coordinates
32
33 use warnings;
34 use strict;
35
36 use DBI;
37 use Getopt::Long;
38 use Pod::Usage;
39 use LWP::Simple;
40 use LWP::UserAgent;
41 use JSON qw(decode_json);
42
43 use vars qw($host $db $a_table $port $user $pass);
44
45 my %domains = ();
46 my $domain_cnt = 0;
47
48 my $hostname = `/bin/hostname`;
49
50 unless ($hostname =~ m/ebi/) {
51 ($host, $db, $a_table, $port, $user, $pass) = ("wrpxdb.its.virginia.edu", "uniprot", "annot2", 0, "web_user", "fasta_www");
52 # $host = 'xdb';
53 }
54 else {
55 ($host, $db, $a_table, $port, $user, $pass) = ("mysql-pearson-prod", "up_db", "annot", 4124, "web_user", "fasta_www");
56 }
57
58 my ($lav, $gen_coord, $exon_label, $use_www, $shelp, $help) = (0,0,0,0,0,0);
59
60 my ($show_color) = (1);
61 my $color_sep_str = " :";
62 $color_sep_str = '~';
63
64 GetOptions(
65 "gen_coord|gene_coord!" => \$gen_coord,
66 "exon_label|label_exons!" => \$exon_label,
67 "www!" => \$use_www,
68 "host=s" => \$host,
69 "db=s" => \$db,
70 "user=s" => \$user,
71 "password=s" => \$pass,
72 "port=i" => \$port,
73 "lav" => \$lav,
74 "h|?" => \$shelp,
75 "help" => \$help,
76 );
77
78 pod2usage(1) if $shelp;
79 pod2usage(exitstatus => 0, verbose => 2) if $help;
80 pod2usage(1) unless (@ARGV || -p STDIN || -f STDIN);
81
82 my $connect = "dbi:mysql(AutoCommit=>1,RaiseError=>1):database=$db";
83 $connect .= ";host=$host" if $host;
84 $connect .= ";port=$port" if $port;
85
86 my $dbh = DBI->connect($connect,
87 $user,
88 $pass
89 ) or die $DBI::errstr;
90
91
92 my $get_annot_sub = \&get_annots;
93
94
95 my $ua = LWP::UserAgent->new(ssl_opts=>{verify_hostname => 0});
96 my $uniprot_url = 'https://www.ebi.ac.uk/proteins/api/coordinates/';
97 my $uniprot_suff = ".json";
98
99
100 if ($use_www) {
101 $get_annot_sub = \&get_annots_up_www;
102 }
103
104
105 my $get_annots_id = $dbh->prepare(qq(select up_exons.* from up_exons join annot2 using(acc) where id=? order by ix));
106 my $get_annots_acc = $dbh->prepare(qq(select up_exons.* from up_exons where acc=? order by ix));
107 my $get_annots_refacc = $dbh->prepare(qq(select ref_acc, start, end, ix from up_exons join annot2 using(acc) where ref_acc=? order by ix));
108 my $get_annots_refseq = $dbh->prepare(qq(select acc, ex_p_start as start, ex_p_end as end, ex_num as ix, chrom, g_start, g_end from seqdb_demo2.ref_exons where acc=? order by ix));
109
110 my $get_annots_sql = $get_annots_acc;
111
112 my ($tmp, $gi, $sdb, $acc, $id, $use_acc);
113
114 # get the query
115 my ($query, $seq_len) = @ARGV;
116 $seq_len = 0 unless defined($seq_len);
117
118 $query =~ s/^>// if ($query);
119
120 my @annots = ();
121
122 #if it's a file I can open, read and parse it
123 unless ($query && ($query =~ m/[\|:]/ ||
124 $query =~ m/^[NX]P_/ ||
125 $query =~ m/^[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}\s/)) {
126
127 while (my $a_line = <>) {
128 $a_line =~ s/^>//;
129 chomp $a_line;
130 push @annots, show_annots($a_line, $get_annot_sub, $use_www);
131 }
132 }
133 else {
134 push @annots, show_annots("$query\t$seq_len", $get_annot_sub, $use_www);
135 }
136
137 for my $seq_annot (@annots) {
138 print ">",$seq_annot->{seq_info},"\n";
139 for my $annot (@{$seq_annot->{list}}) {
140 if (!$lav && $show_color && defined($domains{$annot->[-1]})) {
141 $annot->[-1] .= $color_sep_str.$domains{$annot->[-1]};
142 }
143 print join("\t",@$annot),"\n";
144 }
145 }
146
147 exit(0);
148
149 sub show_annots {
150 my ($query_len, $get_annot_sub, $use_www) = @_;
151
152 my ($annot_line, $seq_len) = split(/\t/,$query_len);
153
154 my %annot_data = (seq_info=>$annot_line);
155
156 if ($annot_line =~ m/^gi\|/) {
157 $use_acc = 1;
158 ($tmp, $gi, $sdb, $acc, $id) = split(/\|/,$annot_line);
159 }
160 elsif ($annot_line =~ m/^(SP|TR):(\w+) (\w+)/) {
161 ($sdb, $id, $acc) = ($1,$2,$3);
162 $use_acc = 1;
163 $sdb = lc($sdb)
164 }
165 elsif ($annot_line =~ m/^(SP|TR):(\w+)/) {
166 ($sdb, $id) = ($1,$2);
167 $use_acc = 0;
168 $sdb = lc($sdb)
169 }
170 elsif ($annot_line !~ m/\|/) { # new NCBI swissprot format
171 $use_acc =1;
172 if ($annot_line =~ m/[NXY]P_\d+/) {
173 $sdb = 'ref';
174 }
175 else {
176 $sdb = 'sp';
177 }
178 ($acc) = split(/\s+/,$annot_line);
179 }
180 else {
181 $use_acc = 1;
182 ($sdb, $acc, $id) = split(/\|/,$annot_line);
183 }
184
185 unless ($use_acc) {
186 $get_annots_sql = $get_annots_id;
187 $get_annots_sql->execute($id);
188 }
189 else {
190 if ($sdb =~ m/ref/) {
191 $get_annots_sql = $get_annots_refseq;
192 } else {
193 $get_annots_sql = $get_annots_acc;
194 }
195 $acc =~ s/\.\d+$//;
196
197 unless ($use_www) {
198 $get_annots_sql->execute($acc);
199 }
200 else {
201 $get_annots_sql = $acc;
202 }
203 }
204
205 $annot_data{list} = $get_annot_sub->($get_annots_sql, $seq_len);
206
207 return \%annot_data;
208 }
209
210 sub get_annots {
211 my ($get_annots_sql, $seq_len) = @_;
212
213 my @feats = ();
214
215 while (my $exon_hr = $get_annots_sql->fetchrow_hashref()) {
216 my $ix = $exon_hr->{ix};
217 if ($lav) {
218 push @feats, [$exon_hr->{start}, $exon_hr->{end}, "exon_$ix~$ix"];
219 } else {
220 my ($exon_info,$ex_info_start, $ex_info_end) = ("exon_$ix~$ix","","");
221 if ($gen_coord) {
222 if (defined($exon_hr->{g_start})) {
223 my $chr=$exon_hr->{chrom};
224 $chr = "unk" unless $chr;
225 if ($chr =~ m/^\d+$/ || $chr =~m/^[XYZ]+$/) {
226 $chr = "chr$chr";
227 }
228 $ex_info_start = sprintf("exon_%d::%s:%d",$ix, $chr, $exon_hr->{g_start});
229 $ex_info_end = sprintf("exon_%d::%s:%d",$ix, $chr, $exon_hr->{g_end});
230 if ($exon_label) {
231 $exon_info = sprintf("exon_%d{%s:%d-%d}~%d",$ix, $chr, $exon_hr->{g_start}, $exon_hr->{g_end}, $ix);
232 push @feats, [$exon_hr->{start}, "-", $exon_hr->{end}, $exon_info];
233 } else {
234 push @feats, [$exon_hr->{start}, "-", $exon_hr->{end}, $exon_info];
235 push @feats, [$exon_hr->{start},'<','-',$ex_info_start];
236 push @feats, [$exon_hr->{end},'>','-',$ex_info_end];
237 }
238 }
239 } else {
240 push @feats, [$exon_hr->{start}, "-", $exon_hr->{end}, $exon_info];
241 }
242 }
243 }
244
245 return \@feats;
246 }
247
248 sub get_annots_up_www {
249 my ($acc, $seq_len) = @_;
250
251 my @feats = ();
252
253 # my $exon_json = get_https($uniprot_url.$acc.$uniprot_suff);
254 my $exon_json = get($uniprot_url.$acc.$uniprot_suff);
255
256 unless (!$exon_json || $exon_json =~ m/errorMessage/ || $exon_json =~ m/Can not find/) {
257 return parse_json_up_exons($exon_json);
258 }
259 else {
260 return ();
261 }
262 }
263
264 sub parse_json_up_exons {
265 my ($exon_json) = @_;
266
267 my @exons = ();
268 my @ex_coords = ();
269
270 my $acc_exons = decode_json($exon_json);
271
272 my $exon_num = 1;
273 my $last_end = 0;
274 my $last_phase = 0;
275
276 my $chrom = $acc_exons->{'gnCoordinate'}[0]{'genomicLocation'}{'chromosome'};
277 my $rev_strand = $acc_exons->{'gnCoordinate'}[0]{'genomicLocation'}{'reverseStrand'};
278
279 for my $exon ( @{$acc_exons->{'gnCoordinate'}[0]{'genomicLocation'}{'exon'}} ) {
280 my ($p_begin, $p_end) = ($exon->{'proteinLocation'}{'begin'}{'position'},$exon->{'proteinLocation'}{'end'}{'position'});
281 my ($g_begin, $g_end) = ($exon->{'genomeLocation'}{'begin'}{'position'},$exon->{'genomeLocation'}{'end'}{'position'});
282
283 my $this_phase = 0;
284 if (defined($g_begin) && defined($g_end)) {
285 $this_phase = ($g_end - $g_begin + 1) % 3;
286 }
287
288 if (!defined($p_begin) || !defined($p_end)) {
289 $exon_num++;
290 $last_phase = 0;
291 next;
292 }
293
294 if ($p_end >= $p_begin) {
295 if ($p_begin == $last_end) {
296 if ($last_phase==2) {
297 $p_begin += 1;
298 }
299 elsif ($last_phase==1) {
300 $last_end -= 1;
301 $exons[-1]->{seq_end} -= 1;
302 }
303 }
304
305 if ($p_begin <= $last_end && $p_end > $last_end) {
306 $p_begin = $last_end+1;
307 }
308 $last_end = $p_end;
309 $last_phase = $this_phase;
310
311 my ($gs_begin, $gs_end) = ($g_begin, $g_end);
312 if ($rev_strand) {
313 ($gs_begin, $gs_end) = ($g_end, $g_begin);
314 }
315
316 push @exons, {
317 ix=>$exon_num,
318 start=>$p_begin,
319 end=>$p_end,
320 g_start=>$gs_begin,
321 g_end=>$gs_end,
322 chrom=>$chrom,
323 };
324
325 $exon_num++;
326 }
327 }
328
329 # check for domain overlap, and resolve check for domain overlap
330 # (possibly more than 2 domains), choosing the domain with the best
331 # evalue
332
333 my @ex_feats = ();
334
335 for my $exon_hr (@exons) {
336 my $ix = $exon_hr->{ix};
337 if ($lav) {
338 push @ex_feats, [$exon_hr->{start}, $exon_hr->{end}, "exon_$ix~$ix" ];
339 }
340 else {
341 my ($exon_info,$ex_info_start, $ex_info_end) = ("exon_$ix~$ix","","");
342 if ($gen_coord) {
343 if (defined($exon_hr->{g_start})) {
344 my $chr=$exon_hr->{chrom};
345 $chr = "unk" unless $chr;
346 if ($chr =~ m/^\d+$/ || $chr =~m/^[XYZ]+$/) {
347 $chr = "chr$chr";
348 }
349 $ex_info_start = sprintf("exon_%d::%s:%d",$ix, $chr, $exon_hr->{g_start});
350 $ex_info_end = sprintf("exon_%d::%s:%d",$ix, $chr, $exon_hr->{g_end});
351 if ($exon_label) {
352 $exon_info = sprintf("exon_%d{%s:%d-%d}~%d",$ix, $chr, $exon_hr->{g_start}, $exon_hr->{g_end},$ix);
353 push @ex_feats, [$exon_hr->{start}, "-", $exon_hr->{end}, $exon_info];
354 } else {
355 push @ex_feats, [$exon_hr->{start}, "-", $exon_hr->{end}, $exon_info];
356 push @ex_feats, [$exon_hr->{start},'<','-',$ex_info_start];
357 push @ex_feats, [$exon_hr->{end},'>','-',$ex_info_end];
358 }
359 }
360 } else {
361 push @ex_feats, [$exon_hr->{start}, "-", $exon_hr->{end}, $exon_info];
362 }
363 }
364 }
365 return \@ex_feats;
366 }
367
368 sub get_https {
369 my ($url) = @_;
370
371 my $result = "";
372 my $response = $ua->get($url);
373
374 if ($response->is_success) {
375 $result = $response->decoded_content;
376 } else {
377 $result = '';
378 }
379 return $result;
380 }
381
382
383
384 __END__
385
386 =pod
387
388 =head1 NAME
389
390 ann_exons_up_sql.pl
391
392 =head1 SYNOPSIS
393
394 ann_exons_up_sql.pl --lav 'sp|P09488|GSTM1_NUMAN' | accession.file
395
396 =head1 OPTIONS
397
398 -h short help
399 --help include description
400 --gen_coord -- provide genomic exon start/stop coordinates as features
401 --lav produce lav2plt.pl annotation format, only show domains/repeats
402 --host, --user, --password, --port --db -- info for mysql database
403
404 =head1 DESCRIPTION
405
406 C<ann_exons_all.pl> extracts exon location information from msyql
407 databases (uniprot for Uniprot proteins, seqdb_demo2 for refseq) built
408 from EBI/proteins API data (Uniprot) or Refseq GFF data (refseq).
409
410 Given a command line argument that contains a sequence accession
411 (P09488) or identifier (GSTM1_HUMAN), the program looks up the
412 features available for that sequence and returns them in a
413 tab-delimited format:
414
415 >sp|P09488|GSTM1_HUMAN
416 1 - 12 exon_1~1
417 13 - 38 exon_2~2
418 39 - 59 exon_3~3
419 60 - 87 exon_4~4
420 88 - 120 exon_5~5
421 121 - 152 exon_6~6
422 153 - 189 exon_7~7
423 190 - 218 exon_8~8
424
425 C<ann_exons_all.pl --gen_coord 'sp|P09488|GSTM1_HUMAN'>also provides genomic coordinates:
426
427 >sp|P09488|GSTM1_HUMAN
428 1 - 12 exon_1~1
429 1 < - exon_1::chr1:109687874
430 12 > - exon_1::chr1:109687909
431 13 - 37 exon_2~2
432 13 < - exon_2::chr1:109688170
433 37 > - exon_2::chr1:109688245
434 38 - 59 exon_3~3
435 38 < - exon_3::chr1:109688673
436 59 > - exon_3::chr1:109688737
437 ...
438 190 - 218 exon_8~8
439 190 < - exon_8::chr1:109693206
440 218 > - exon_8::chr1:109693292
441
442 C<ann_exons_all.pl> is designed to be used by the B<FASTA> programs
443 with the C<-V \!ann_exons_all.pl> option, or by the
444 C<annot_blast_btop.pl> script. It can also be used with the
445 lav2plt.pl program with the C<--xA "\!ann_exons_all.pl --lav"> or
446 C<--yA "\!ann_exons_all.pl --lav"> options.
447
448 =head1 AUTHOR
449
450 William R. Pearson, wrp@virginia.edu
451
452 =cut
0 #!/usr/bin/perl -w
0 #!/usr/bin/env perl
11
22 ################################################################
33 # copyright (c) 2014,2015 by William R. Pearson and The Rector &
2828 # (3) return the tab delimited exon boundaries
2929
3030
31 use warnings;
3132 use strict;
3233
3334 use DBI;
0 #!/usr/bin/perl -w
0 #!/usr/bin/env perl
11
22 # ann_exons_ncbi.pl gets an annotation file from fasta36 -V with a line of the form:
33
4 # gi|23065544|ref|NP_000552.2|
4 # gi|23065544|ref|NP_000552.2| or
5 # NP_000552
56 #
67 # and returns the exons present in the protein from NCBI gff3 tables (human, mouse, rat, xtrop)
78 #
1112 # (3) return the tab delimited exon boundaries
1213 #
1314
15 use warnings;
1416 use strict;
1517
1618 use DBI;
2325
2426 ($host, $db, $port, $user, $pass) = ("wrpxdb.its.virginia.edu", "seqdb_demo2", 0, "web_user", "fasta_www");
2527
26 my ($auto_reg,$rpd2_fams, $neg_doms, $lav, $no_doms, $pf_acc, $shelp, $help) = (0, 0, 0, 0,0, 0,0,0);
27 my ($min_nodom) = (10);
28 my ($lav, $shelp, $help) = (0, 0, 0);
2829
2930 my $color_sep_str = " :";
3031 $color_sep_str = '~';
3637 "password=s" => \$pass,
3738 "port=i" => \$port,
3839 "lav" => \$lav,
39 "neg" => \$neg_doms,
40 "neg_doms" => \$neg_doms,
41 "neg-doms" => \$neg_doms,
42 "min_nodom=i" => \$min_nodom,
43 "pfacc" => \$pf_acc,
44 "RPD2" => \$rpd2_fams,
45 "auto_reg" => \$auto_reg,
4640 "h|?" => \$shelp,
4741 "help" => \$help,
4842 );
130124 elsif ($annot_line =~ m/^ref\|/) {
131125 ($sdb, $acc) = split(/\|/,$annot_line);
132126 }
127 else {
128 $acc = $annot_line;
129 }
133130
134131 $acc =~ s/\.\d+$//;
135132 $get_annots_sql->execute($acc);
147144 # get the list of domains, sorted by start
148145 while ( my $row_href = $get_annots->fetchrow_hashref()) {
149146
150 $row_href->{info} = "exon_".$row_href->{ex_num};
147 $row_href->{info} = "exon_".$row_href->{ex_num}.$color_sep_str.$row_href->{ex_num};
151148 push @exons, $row_href
152149 }
153150
171168 return \@feats;
172169 }
173170
174 # domain name takes a uniprot domain label, removes comments ( ;
175 # truncated) and numbers and returns a canonical form. Thus:
176 # Cortactin 6.
177 # Cortactin 7; truncated.
178 # becomes "Cortactin"
179 #
180
181 sub domain_name {
182
183 my ($value) = @_;
184
185 if (!defined($domains{$value})) {
186 $domain_cnt++;
187 $domains{$value} = $domain_cnt;
188 }
189 return $value;
190 }
191
192171 __END__
193172
194173 =pod
195174
196175 =head1 NAME
197176
198 ann_feats.pl
177 ann_exons_ncbi.pl
199178
200179 =head1 SYNOPSIS
201180
202 ann_pfam.pl --neg-doms 'sp|P09488|GSTM1_NUMAN' | accession.file
181 ann_exons_ncbi.pl NP_000552
203182
204183 =head1 OPTIONS
205184
206185 -h short help
207186 --help include description
208 --neg-doms, -- report domains between annotated domains as NODOM
209 (also --neg, --neg_doms)
210 --min_nodom=10 -- minimum length between domains for NODOM
211
187 --lav produce lav2plt.pl annotation format, only show domains/repeats
212188 --host, --user, --password, --port --db -- info for mysql database
213189
214190 =head1 DESCRIPTION
215191
216 C<ann_pfam.pl> extracts domain information from a msyql
192 C<ann_exons_ncbi.pl> extracts domain information from a msyql
217193 database. Currently, the program works with database sequence
218194 descriptions in one of two formats:
219195
220 >pf26|649|O94823|AT10B_HUMAN -- RPD2_seqs
221
222 (pf26 databases have auto_pfamseq in the second field) and
223
224 >gi|1705556|sp|P54670.1|CAF1_DICDI
225
226 C<ann_pfam.pl> uses the C<pfamA_reg_full_significant>, C<pfamseq>,
227 and C<pfamA> tables of the C<pfam> database to extract domain
228 information on a protein. For proteins that have multiple domains
229 associated with the same overlapping region (domains overlap by more
230 than 1/3 of the domain length), C<auto_pfam.pl> selects the domain
231 annotation with the best C<domain_evalue_score>. When domains overlap
232 by less than 1/3 of the domain length, they are shortened to remove
233 the overlap.
234
235 C<ann_pfam.pl> is designed to be used by the B<FASTA> programs with
236 the C<-V \!ann_pfam.pl> or C<-V "\!ann_pfam.pl --neg"> option.
196 >gi|23065544|ref|NP_000552.2| or
197 >NP_000552
198
199 C<ann_exons_ncbi.pl> uses the C<ref_exons> table of the C<seqdb2>
200 database to extract exon position information on a protein. The
201 C<seqdb2/ref_exons> table is constructed from refseq gff files using
202 the C<ncbi_refseq_ex2prot.pl> script.
203
204 C<ann_exons_ncbi.pl> is designed to be used by the B<FASTA> programs with
205 the C<-V \!ann_exons_ncbi.pl> option.
237206
238207 =head1 AUTHOR
239208
0 #!/usr/bin/perl -w
0 #!/usr/bin/env perl
11
22 ################################################################
33 # copyright (c) 2014,2015 by William R. Pearson and The Rector &
2424 # (1) read in the line
2525 # (2) parse it to get the up_acc
2626 # (3) return the tab delimited features
27 #
2827
2928 # this version can read feature2 uniprot features (acc/pos/end/label/value), but returns sorted start/end domains
3029 # modified 18-Jan-2016 to produce annotation symbols consistent with ann_exons_up_www2.pl
3130
31 use warnings;
3232 use strict;
3333
3434 use DBI;
5050 ($host, $db, $a_table, $port, $user, $pass) = ("mysql-pearson-prod", "up_db", "annot", 4124, "web_user", "fasta_www");
5151 }
5252
53 my ($sstr, $lav, $neg_doms, $no_vars, $no_doms, $no_feats, $shelp, $help, $pfam26) = (0,0,0,0,0,0,0,0,0,0);
54 my ($min_nodom) = (10);
53 my ($lav, $gen_coord, $shelp, $help) = (0,0,0,0);
5554
5655 my ($show_color) = (1);
5756 my $color_sep_str = " :";
5857 $color_sep_str = '~';
5958
6059 GetOptions(
60 "gen_coord!" => \$gen_coord,
6161 "host=s" => \$host,
6262 "db=s" => \$db,
6363 "user=s" => \$user,
6464 "password=s" => \$pass,
6565 "port=i" => \$port,
6666 "lav" => \$lav,
67 "no_doms" => \$no_doms,
68 "no-doms" => \$no_doms,
69 "nodoms" => \$no_doms,
70 "no_var" => \$no_vars,
71 "no-var" => \$no_vars,
72 "novar" => \$no_vars,
73 "neg" => \$neg_doms,
74 "neg_doms" => \$neg_doms,
75 "neg-doms" => \$neg_doms,
76 "negdoms" => \$neg_doms,
77 "min_nodom=i" => \$min_nodom,
78 "min-nodom=i" => \$min_nodom,
79 "no_feats" => \$no_feats,
80 "no-feats" => \$no_feats,
81 "nofeats" => \$no_feats,
82 "color!" => \$show_color,
83 "sstr" => \$sstr,
8467 "h|?" => \$shelp,
8568 "help" => \$help,
8669 );
9982 ) or die $DBI::errstr;
10083
10184
102 my $get_annot_sub = \&get_fasta_annots;
103 if ($lav) {
104 $no_feats = 1;
105 $get_annot_sub = \&get_lav_annots;
106 }
107
108 my $get_annots_id = $dbh->prepare(qq(select acc, start, end, ix from up_exons join annot2 using(acc) where id=? order by ix));
109 my $get_annots_acc = $dbh->prepare(qq(select acc, start, end, ix from up_exons where acc=? order by ix));
85 my $get_annot_sub = \&get_annots;
86
87 my $get_annots_id = $dbh->prepare(qq(select up_exons.* from up_exons join annot2 using(acc) where id=? order by ix));
88 my $get_annots_acc = $dbh->prepare(qq(select up_exons.* from up_exons where acc=? order by ix));
11089 my $get_annots_refacc = $dbh->prepare(qq(select ref_acc, start, end, ix from up_exons join annot2 using(acc) where ref_acc=? order by ix));
11190
11291 my $get_annots_sql = $get_annots_acc;
199178 return \%annot_data;
200179 }
201180
202 sub get_fasta_annots {
181 sub get_annots {
203182 my ($get_annots_sql, $seq_len) = @_;
204183
205 my ($acc, $start, $end, $ix);
206184 my @feats = ();
207185
208 while (($acc, $start, $end, $ix) = $get_annots_sql->fetchrow_array()) {
209 push @feats, [$start, "-", $end, "exon_$ix~$ix"];
186 while (my $exon_hr = $get_annots_sql->fetchrow_hashref()) {
187 my $ix = $exon_hr->{ix};
188 if ($lav) {
189 push @feats, [$exon_hr->{start}, $exon_hr->{end}, "exon_$ix~$ix"];
190 }
191 else {
192 push @feats, [$exon_hr->{start}, "-", $exon_hr->{end}, "exon_$ix~$ix"];
193 if ($gen_coord) {
194 if (not defined($exon_hr->{g_start})) {
195 next;
196 }
197
198 my $chr=$exon_hr->{chrom};
199 $chr = "unk" unless $chr;
200 if ($chr =~ m/^\d+$/ || $chr =~m/^[XYZ]+$/) {
201 $chr = "chr$chr";
202 }
203 my $ex_info = sprintf("exon_%d::%s:%d",$ix, $chr, $exon_hr->{g_start});
204 push @feats, [$exon_hr->{start},'<','-',$ex_info];
205 $ex_info = sprintf("exon_%d::%s:%d",$ix, $chr, $exon_hr->{g_end});
206 push @feats, [$exon_hr->{end},'>','-',$ex_info];
207 }
208 }
210209 }
211210
212211 return \@feats;
213212 }
214213
215 sub get_lav_annots {
216 my ($get_annots_sql, $seq_len) = @_;
217
218 my ($pos, $end, $label, $value, $comment);
219
220 my @feats = ();
221
222 my %annot = ();
223 while (($acc, $pos, $end, $label, $value) = $get_annots_sql->fetchrow_array()) {
224 next unless ($label =~ m/^DOMAIN/ || $label =~ m/^REPEAT/);
225 $value =~ s/\s?\{.+\}\.?$//;
226 $value = domain_name($label,$value);
227 push @feats, [$pos, $end, $value];
228 }
229
230 return \@feats;
231 }
232
233 # domain name takes a uniprot domain label, removes comments ( ;
234 # truncated) and numbers and returns a canonical form. Thus:
235 # Cortactin 6.
236 # Cortactin 7; truncated.
237 # becomes "Cortactin"
238 #
239
240 sub domain_name {
241
242 my ($label, $value) = @_;
243
244 if ($label =~ /DOMAIN|REPEAT/) {
245 $value =~ s/;.*$//;
246 $value =~ s/\s+\d+\.?$//;
247 $value =~ s/\.\s*$//;
248 $value =~ s/\s+\d+\.\s+.*$//;
249 $value =~ s/\s+/_/;
250 if (!defined($domains{$value})) {
251 $domain_cnt++;
252 $domains{$value} = $domain_cnt;
253 }
254 return $value;
255 }
256 else {
257 return $value;
258 }
259 }
260
261214 __END__
262215
263216 =pod
268221
269222 =head1 SYNOPSIS
270223
271 ann_exons_up_sql.pl --no_doms --no_feats --lav 'sp|P09488|GSTM1_NUMAN' | accession.file
224 ann_exons_up_sql.pl --lav 'sp|P09488|GSTM1_NUMAN' | accession.file
272225
273226 =head1 OPTIONS
274227
275228 -h short help
276229 --help include description
277 --no-doms do not show domain boundaries (domains are always shown with --lav)
278 --no-feats do not show features (variants, active sites, phospho-sites)
279 --no-var do not show variant sites (--no_var, --novar)
230 --gen_coord -- provide genomic exon start/stop coordinates as features
280231 --lav produce lav2plt.pl annotation format, only show domains/repeats
281 --neg-doms, -- report domains between annotated domains as NODOM
282 (also --neg, --neg_doms)
283 --min_nodom=10 minimum non-domain length to produce NODOM
284232 --host, --user, --password, --port --db -- info for mysql database
285233
286234 =head1 DESCRIPTION
287235
288 C<ann_exons_up_sql.pl> extracts feature, domain, and repeat information from
289 a msyql database (default name, uniprot) built by parsing the
290 uniprot_sprot.dat and uniprot_trembl.dat feature tables. Given a
291 command line argument that contains a sequence accession (P09488) or
292 identifier (GSTM1_HUMAN), the program looks up the features available
293 for that sequence and returns them in a tab-delimited format:
236 C<ann_exons_up_sql.pl> extracts exon location information from
237 a msyql database (default name, uniprot) built from EBI/proteins API data.
238
239 Given a command line argument that contains a sequence accession
240 (P09488) or identifier (GSTM1_HUMAN), the program looks up the
241 features available for that sequence and returns them in a
242 tab-delimited format:
294243
295244 >sp|P09488|GSTM1_HUMAN
296 2 - 88 GST_N-terminal~1
297 7 V F Mutagen: Reduces catalytic activity 100- fold. {ECO:0000269|PubMed:16548513}.
298 34 * - MOD_RES: Phosphothreonine. {ECO:0000250|UniProtKB:P10649}.
299 90 - 208 GST_C-terminal~2
300 108 V S Mutagen: Changes the properties of the enzyme toward some substrates. {ECO:0000269|PubMed:16548513, ECO:0000269|PubMed:9930979}.
301 108 V Q Mutagen: Reduces catalytic activity by half. {ECO:0000269|PubMed:16548513, ECO:0000269|PubMed:9930979}.
302 109 V I Mutagen: Reduces catalytic activity by half. {ECO:0000269|PubMed:16548513}.
303 116 # - BINDING: Substrate.
304 116 V A Mutagen: Reduces catalytic activity 10-fold. {ECO:0000269|PubMed:16548513}.
305 116 V F Mutagen: Slight increase of catalytic activity. {ECO:0000269|PubMed:16548513}.
306 173 V N in allele GSTM1B; dbSNP:rs1065411. {ECO:0000269|Ref.3, ECO:0000269|Ref.5}.
307 210 * - MOD_RES: Phosphoserine. {ECO:0000250|UniProtKB:P04905}.
308 210 V T in dbSNP:rs449856.
309
310 If features are provided, then a legend of feature symbols is provided
311 as well:
312
313 ==:Active site
314 =*:Modified
315 =#:Substrate binding
316 =^:Site
317 =!:Metal binding
318
319 If the C<--lav> option is specified, domain and repeat features are
320 presented in a different format for the C<lav2plt.pl> program:
321
322 >sp|P09488|GSTM1_HUMAN
323 2 88 GST N-terminal.
324 90 208 GST C-terminal.
245 1 - 12 exon_1~1
246 13 - 38 exon_2~2
247 39 - 59 exon_3~3
248 60 - 87 exon_4~4
249 88 - 120 exon_5~5
250 121 - 152 exon_6~6
251 153 - 189 exon_7~7
252 190 - 218 exon_8~8
253
254 C<ann_exons_up_sql.pl --gen_coord 'sp|P09488|GSTM1_HUMAN'>also provides genomic coordinates:
255
256 >sp|P09488|GSTM1_HUMAN
257 1 - 12 exon_1~1
258 1 < - exon_1::chr1:109687874
259 12 > - exon_1::chr1:109687909
260 13 - 37 exon_2~2
261 13 < - exon_2::chr1:109688170
262 37 > - exon_2::chr1:109688245
263 38 - 59 exon_3~3
264 38 < - exon_3::chr1:109688673
265 59 > - exon_3::chr1:109688737
266 ...
267 190 - 218 exon_8~8
268 190 < - exon_8::chr1:109693206
269 218 > - exon_8::chr1:109693292
325270
326271 C<ann_exons_up_sql.pl> is designed to be used by the B<FASTA> programs
327272 with the C<-V \!ann_exons_up_sql.pl> option, or by the
0 #!/usr/bin/env perl
1
2 ################################################################
3 # copyright (c) 2014,2015 by William R. Pearson and The Rector &
4 # Visitors of the University of Virginia
5 ################################################################
6 # Licensed under the Apache License, Version 2.0 (the "License");
7 # you may not use this file except in compliance with the License.
8 # You may obtain a copy of the License at
9 #
10 # http://www.apache.org/licenses/LICENSE-2.0
11 #
12 # Unless required by applicable law or agreed to in writing,
13 # software distributed under this License is distributed on an "AS
14 # IS" BASIS, WITHOUT WRRANTIES OR CONDITIONS OF ANY KIND, either
15 # express or implied. See the License for the specific language
16 # governing permissions and limitations under the License.
17 ################################################################
18
19 # ann_exons_up_sql.pl gets an annotation file from fasta36 -V with a line of the form:
20
21 # gi|62822551|sp|P00502|GSTA1_RAT Glutathione S-transfer\n (at least from pir1.lseg)
22 #
23 # it must:
24 # (1) read in the line
25 # (2) parse it to get the up_acc
26 # (3) return the tab delimited features
27
28 # this version can read feature2 uniprot features (acc/pos/end/label/value), but returns sorted start/end domains
29 # modified 18-Jan-2016 to produce annotation symbols consistent with ann_exons_up_www2.pl
30 # modified Dec 2018 to generate genomic coordinates with --gen_coord
31 # modified 3-Jan-2019 to merge sql and www (--www) access to exon coordinates
32
33 use warnings;
34 use strict;
35
36 use DBI;
37 use Getopt::Long;
38 use Pod::Usage;
39 use LWP::Simple;
40 use LWP::UserAgent;
41 use JSON qw(decode_json);
42
43 use vars qw($host $db $a_table $port $user $pass);
44
45 my %domains = ();
46 my $domain_cnt = 0;
47
48 my $hostname = `/bin/hostname`;
49
50 unless ($hostname =~ m/ebi/) {
51 ($host, $db, $a_table, $port, $user, $pass) = ("wrpxdb.its.virginia.edu", "uniprot", "annot2", 0, "web_user", "fasta_www");
52 # $host = 'xdb';
53 }
54 else {
55 ($host, $db, $a_table, $port, $user, $pass) = ("mysql-pearson-prod", "up_db", "annot", 4124, "web_user", "fasta_www");
56 }
57
58 my ($lav, $gen_coord, $exon_label, $use_www, $shelp, $help) = (0,0,0,0,0,0);
59
60 my ($show_color) = (1);
61 my $color_sep_str = " :";
62 $color_sep_str = '~';
63
64 GetOptions(
65 "gen_coord|gene_coord!" => \$gen_coord,
66 "exon_label|label_exons!" => \$exon_label,
67 "www!" => \$use_www,
68 "host=s" => \$host,
69 "db=s" => \$db,
70 "user=s" => \$user,
71 "password=s" => \$pass,
72 "port=i" => \$port,
73 "lav" => \$lav,
74 "h|?" => \$shelp,
75 "help" => \$help,
76 );
77
78 pod2usage(1) if $shelp;
79 pod2usage(exitstatus => 0, verbose => 2) if $help;
80 pod2usage(1) unless (@ARGV || -p STDIN || -f STDIN);
81
82 my $connect = "dbi:mysql(AutoCommit=>1,RaiseError=>1):database=$db";
83 $connect .= ";host=$host" if $host;
84 $connect .= ";port=$port" if $port;
85
86 my $dbh = DBI->connect($connect,
87 $user,
88 $pass
89 ) or die $DBI::errstr;
90
91
92 my $get_annot_sub = \&get_annots;
93
94
95 my $ua = LWP::UserAgent->new(ssl_opts=>{verify_hostname => 0});
96 my $uniprot_url = 'https://www.ebi.ac.uk/proteins/api/coordinates/';
97 my $uniprot_suff = ".json";
98
99
100 if ($use_www) {
101 $get_annot_sub = \&get_annots_up_www;
102 }
103
104
105 my $get_annots_id = $dbh->prepare(qq(select up_exons.* from up_exons join annot2 using(acc) where id=? order by ix));
106 my $get_annots_acc = $dbh->prepare(qq(select up_exons.* from up_exons where acc=? order by ix));
107 my $get_annots_refacc = $dbh->prepare(qq(select ref_acc, start, end, ix from up_exons join annot2 using(acc) where ref_acc=? order by ix));
108
109 my $get_annots_sql = $get_annots_acc;
110
111 my ($tmp, $gi, $sdb, $acc, $id, $use_acc);
112
113 # get the query
114 my ($query, $seq_len) = @ARGV;
115 $seq_len = 0 unless defined($seq_len);
116
117 $query =~ s/^>// if ($query);
118
119 my @annots = ();
120
121 #if it's a file I can open, read and parse it
122 unless ($query && ($query =~ m/[\|:]/ ||
123 $query =~ m/^[NX]P_/ ||
124 $query =~ m/^[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}\s/)) {
125
126 while (my $a_line = <>) {
127 $a_line =~ s/^>//;
128 chomp $a_line;
129 push @annots, show_annots($a_line, $get_annot_sub, $use_www);
130 }
131 }
132 else {
133 push @annots, show_annots("$query\t$seq_len", $get_annot_sub, $use_www);
134 }
135
136 for my $seq_annot (@annots) {
137 print ">",$seq_annot->{seq_info},"\n";
138 for my $annot (@{$seq_annot->{list}}) {
139 if (!$lav && $show_color && defined($domains{$annot->[-1]})) {
140 $annot->[-1] .= $color_sep_str.$domains{$annot->[-1]};
141 }
142 print join("\t",@$annot),"\n";
143 }
144 }
145
146 exit(0);
147
148 sub show_annots {
149 my ($query_len, $get_annot_sub, $use_www) = @_;
150
151 my ($annot_line, $seq_len) = split(/\t/,$query_len);
152
153 my %annot_data = (seq_info=>$annot_line);
154
155 if ($annot_line =~ m/^gi\|/) {
156 $use_acc = 1;
157 ($tmp, $gi, $sdb, $acc, $id) = split(/\|/,$annot_line);
158 }
159 elsif ($annot_line =~ m/^(SP|TR):(\w+) (\w+)/) {
160 ($sdb, $id, $acc) = ($1,$2,$3);
161 $use_acc = 1;
162 $sdb = lc($sdb)
163 }
164 elsif ($annot_line =~ m/^(SP|TR):(\w+)/) {
165 ($sdb, $id) = ($1,$2);
166 $use_acc = 0;
167 $sdb = lc($sdb)
168 }
169 elsif ($annot_line !~ m/\|/) { # new NCBI swissprot format
170 $use_acc =1;
171 $sdb = 'sp';
172 ($acc) = split(/\s+/,$annot_line);
173 }
174 else {
175 $use_acc = 1;
176 ($sdb, $acc, $id) = split(/\|/,$annot_line);
177 }
178
179 unless ($use_acc) {
180 $get_annots_sql = $get_annots_id;
181 $get_annots_sql->execute($id);
182 }
183 else {
184 unless ($sdb =~ m/ref/) {
185 $get_annots_sql = $get_annots_acc;
186 } else {
187 $get_annots_sql = $get_annots_refacc;
188 }
189 $acc =~ s/\.\d+$//;
190
191 unless ($use_www) {
192 $get_annots_sql->execute($acc);
193 }
194 else {
195 $get_annots_sql = $acc;
196 }
197 }
198
199 $annot_data{list} = $get_annot_sub->($get_annots_sql, $seq_len);
200
201 return \%annot_data;
202 }
203
204 sub get_annots {
205 my ($get_annots_sql, $seq_len) = @_;
206
207 my @feats = ();
208
209 while (my $exon_hr = $get_annots_sql->fetchrow_hashref()) {
210 my $ix = $exon_hr->{ix};
211 if ($lav) {
212 push @feats, [$exon_hr->{start}, $exon_hr->{end}, "exon_$ix~$ix"];
213 } else {
214 my ($exon_info,$ex_info_start, $ex_info_end) = ("exon_$ix~$ix","","");
215 if ($gen_coord) {
216 if (defined($exon_hr->{g_start})) {
217 my $chr=$exon_hr->{chrom};
218 $chr = "unk" unless $chr;
219 if ($chr =~ m/^\d+$/ || $chr =~m/^[XYZ]+$/) {
220 $chr = "chr$chr";
221 }
222 $ex_info_start = sprintf("exon_%d::%s:%d",$ix, $chr, $exon_hr->{g_start});
223 $ex_info_end = sprintf("exon_%d::%s:%d",$ix, $chr, $exon_hr->{g_end});
224 if ($exon_label) {
225 $exon_info = sprintf("exon_%d{%s:%d-%d}~%d",$ix, $chr, $exon_hr->{g_start}, $exon_hr->{g_end}, $ix);
226 push @feats, [$exon_hr->{start}, "-", $exon_hr->{end}, $exon_info];
227 } else {
228 push @feats, [$exon_hr->{start}, "-", $exon_hr->{end}, $exon_info];
229 push @feats, [$exon_hr->{start},'<','-',$ex_info_start];
230 push @feats, [$exon_hr->{end},'>','-',$ex_info_end];
231 }
232 }
233 } else {
234 push @feats, [$exon_hr->{start}, "-", $exon_hr->{end}, $exon_info];
235 }
236 }
237 }
238
239 return \@feats;
240 }
241
242 sub get_annots_up_www {
243 my ($acc, $seq_len) = @_;
244
245 my @feats = ();
246
247 # my $exon_json = get_https($uniprot_url.$acc.$uniprot_suff);
248 my $exon_json = get($uniprot_url.$acc.$uniprot_suff);
249
250 unless (!$exon_json || $exon_json =~ m/errorMessage/ || $exon_json =~ m/Can not find/) {
251 return parse_json_up_exons($exon_json);
252 }
253 else {
254 return ();
255 }
256 }
257
258 sub parse_json_up_exons {
259 my ($exon_json) = @_;
260
261 my @exons = ();
262 my @ex_coords = ();
263
264 my $acc_exons = decode_json($exon_json);
265
266 my $exon_num = 1;
267 my $last_end = 0;
268 my $last_phase = 0;
269
270 my $chrom = $acc_exons->{'gnCoordinate'}[0]{'genomicLocation'}{'chromosome'};
271 my $rev_strand = $acc_exons->{'gnCoordinate'}[0]{'genomicLocation'}{'reverseStrand'};
272
273 for my $exon ( @{$acc_exons->{'gnCoordinate'}[0]{'genomicLocation'}{'exon'}} ) {
274 my ($p_begin, $p_end) = ($exon->{'proteinLocation'}{'begin'}{'position'},$exon->{'proteinLocation'}{'end'}{'position'});
275 my ($g_begin, $g_end) = ($exon->{'genomeLocation'}{'begin'}{'position'},$exon->{'genomeLocation'}{'end'}{'position'});
276
277 my $this_phase = 0;
278 if (defined($g_begin) && defined($g_end)) {
279 $this_phase = ($g_end - $g_begin + 1) % 3;
280 }
281
282 if (!defined($p_begin) || !defined($p_end)) {
283 $exon_num++;
284 $last_phase = 0;
285 next;
286 }
287
288 if ($p_end >= $p_begin) {
289 if ($p_begin == $last_end) {
290 if ($last_phase==2) {
291 $p_begin += 1;
292 }
293 elsif ($last_phase==1) {
294 $last_end -= 1;
295 $exons[-1]->{seq_end} -= 1;
296 }
297 }
298
299 if ($p_begin <= $last_end && $p_end > $last_end) {
300 $p_begin = $last_end+1;
301 }
302 $last_end = $p_end;
303 $last_phase = $this_phase;
304
305 my ($gs_begin, $gs_end) = ($g_begin, $g_end);
306 if ($rev_strand) {
307 ($gs_begin, $gs_end) = ($g_end, $g_begin);
308 }
309
310 push @exons, {
311 ix=>$exon_num,
312 start=>$p_begin,
313 end=>$p_end,
314 g_start=>$gs_begin,
315 g_end=>$gs_end,
316 chrom=>$chrom,
317 };
318
319 $exon_num++;
320 }
321 }
322
323 # check for domain overlap, and resolve check for domain overlap
324 # (possibly more than 2 domains), choosing the domain with the best
325 # evalue
326
327 my @ex_feats = ();
328
329 for my $exon_hr (@exons) {
330 my $ix = $exon_hr->{ix};
331 if ($lav) {
332 push @ex_feats, [$exon_hr->{start}, $exon_hr->{end}, "exon_$ix~$ix" ];
333 }
334 else {
335 my ($exon_info,$ex_info_start, $ex_info_end) = ("exon_$ix~$ix","","");
336 if ($gen_coord) {
337 if (defined($exon_hr->{g_start})) {
338 my $chr=$exon_hr->{chrom};
339 $chr = "unk" unless $chr;
340 if ($chr =~ m/^\d+$/ || $chr =~m/^[XYZ]+$/) {
341 $chr = "chr$chr";
342 }
343 $ex_info_start = sprintf("exon_%d::%s:%d",$ix, $chr, $exon_hr->{g_start});
344 $ex_info_end = sprintf("exon_%d::%s:%d",$ix, $chr, $exon_hr->{g_end});
345 if ($exon_label) {
346 $exon_info = sprintf("exon_%d{%s:%d-%d}~%d",$ix, $chr, $exon_hr->{g_start}, $exon_hr->{g_end},$ix);
347 push @ex_feats, [$exon_hr->{start}, "-", $exon_hr->{end}, $exon_info];
348 } else {
349 push @ex_feats, [$exon_hr->{start}, "-", $exon_hr->{end}, $exon_info];
350 push @ex_feats, [$exon_hr->{start},'<','-',$ex_info_start];
351 push @ex_feats, [$exon_hr->{end},'>','-',$ex_info_end];
352 }
353 }
354 } else {
355 push @ex_feats, [$exon_hr->{start}, "-", $exon_hr->{end}, $exon_info];
356 }
357 }
358 }
359 return \@ex_feats;
360 }
361
362 sub get_https {
363 my ($url) = @_;
364
365 my $result = "";
366 my $response = $ua->get($url);
367
368 if ($response->is_success) {
369 $result = $response->decoded_content;
370 } else {
371 $result = '';
372 }
373 return $result;
374 }
375
376
377
378 __END__
379
380 =pod
381
382 =head1 NAME
383
384 ann_exons_up_sql.pl
385
386 =head1 SYNOPSIS
387
388 ann_exons_up_sql.pl --lav 'sp|P09488|GSTM1_NUMAN' | accession.file
389
390 =head1 OPTIONS
391
392 -h short help
393 --help include description
394 --gen_coord -- provide genomic exon start/stop coordinates as features
395 --lav produce lav2plt.pl annotation format, only show domains/repeats
396 --host, --user, --password, --port --db -- info for mysql database
397
398 =head1 DESCRIPTION
399
400 C<ann_exons_up_sql.pl> extracts exon location information from
401 a msyql database (default name, uniprot) built from EBI/proteins API data.
402
403 Given a command line argument that contains a sequence accession
404 (P09488) or identifier (GSTM1_HUMAN), the program looks up the
405 features available for that sequence and returns them in a
406 tab-delimited format:
407
408 >sp|P09488|GSTM1_HUMAN
409 1 - 12 exon_1~1
410 13 - 38 exon_2~2
411 39 - 59 exon_3~3
412 60 - 87 exon_4~4
413 88 - 120 exon_5~5
414 121 - 152 exon_6~6
415 153 - 189 exon_7~7
416 190 - 218 exon_8~8
417
418 C<ann_exons_up_sql.pl --gen_coord 'sp|P09488|GSTM1_HUMAN'>also provides genomic coordinates:
419
420 >sp|P09488|GSTM1_HUMAN
421 1 - 12 exon_1~1
422 1 < - exon_1::chr1:109687874
423 12 > - exon_1::chr1:109687909
424 13 - 37 exon_2~2
425 13 < - exon_2::chr1:109688170
426 37 > - exon_2::chr1:109688245
427 38 - 59 exon_3~3
428 38 < - exon_3::chr1:109688673
429 59 > - exon_3::chr1:109688737
430 ...
431 190 - 218 exon_8~8
432 190 < - exon_8::chr1:109693206
433 218 > - exon_8::chr1:109693292
434
435 C<ann_exons_up_sql.pl> is designed to be used by the B<FASTA> programs
436 with the C<-V \!ann_exons_up_sql.pl> option, or by the
437 C<annot_blast_btop.pl> script. It can also be used with the
438 lav2plt.pl program with the C<--xA "\!ann_exons_up_sql.pl --lav"> or
439 C<--yA "\!ann_exons_up_sql.pl --lav"> options.
440
441 =head1 AUTHOR
442
443 William R. Pearson, wrp@virginia.edu
444
445 =cut
0 #!/usr/bin/perl -w
0 #!/usr/bin/env perl
11
22 ################################################################
33 # copyright (c) 2014,2015 by William R. Pearson and The Rector &
1616 # governing permissions and limitations under the License.
1717 ################################################################
1818
19 # ann_exons_up_www.pl gets an annotation file from fasta36 -V with a line of the form:
20
21 # gi|23065544|ref|NP_000552.2|
22 #
23 # and returns the exons present in the protein from NCBI gff3 tables (human and mouse only)
19 # ann_exons_up_www.pl gets an annotation file from fasta36 -V with a
20 # line of the form:
21 #
22 # sp|P09488|GSTM1_HUMAN<tab>218
23 #
24 # and uses the EBI protein coordinate API to get the locations of exons
25 # https://www.ebi.ac.uk/proteins/api/coordinates/P09488.json
2426 #
2527 # it must:
2628 # (1) read in the line
2830 # (3) get exon information from EBI/Uniprot
2931 # (4) return the tab delimited exon boundaries
3032
31 # 22-May-2017 -- use get("http://"), not get_https("https://"), because EBI does not have LWP::Protocol:https
32
33 # 22-May-2017 -- use get("https://"), not get_https("https://"), because EBI does not have LWP::Protocol:https
34
35 # 11-Dec-2018 -- modified to include --gen_coord, which reports exon starts and stops in genomic coordinates as <, >
36
37 use warnings;
3338 use strict;
3439
3540 use Getopt::Long;
4146
4247 use vars qw($host $db $port $user $pass);
4348
44 my ($lav, $shelp, $help) = (0, 0,0);
49 my ($lav, $gen_coord, $shelp, $help) = (0, 0, 0, 0);
4550
4651 my $color_sep_str = " :";
4752 $color_sep_str = '~';
4853
4954 GetOptions(
55 "gen_coord!" => \$gen_coord,
5056 "lav" => \$lav,
5157 "h|?" => \$shelp,
5258 "help" => \$help,
6571 my $get_annot_sub = \&get_up_www_exons;
6672
6773 my $ua = LWP::UserAgent->new(ssl_opts=>{verify_hostname => 0});
68 my $uniprot_url = 'http://www.ebi.ac.uk/proteins/api/coordinates/';
74 my $uniprot_url = 'https://www.ebi.ac.uk/proteins/api/coordinates/';
6975 my $uniprot_suff = ".json";
7076
7177 # get the query
131137
132138 $acc =~ s/\.\d+$//;
133139
140 # my $exon_json = get_https($uniprot_url.$acc.$uniprot_suff);
134141 my $exon_json = get($uniprot_url.$acc.$uniprot_suff);
135142
136143 unless (!$exon_json || $exon_json =~ m/errorMessage/ || $exon_json =~ m/Can not find/) {
144151 my ($exon_json) = @_;
145152
146153 my @exons = ();
154 my @ex_coords = ();
147155
148156 my $acc_exons = decode_json($exon_json);
149157
150158 my $exon_num = 1;
151159 my $last_end = 0;
152160 my $last_phase = 0;
161
162 my $chrom = $acc_exons->{'gnCoordinate'}[0]{'genomicLocation'}{'chromosome'};
163 my $rev_strand = $acc_exons->{'gnCoordinate'}[0]{'genomicLocation'}{'reverseStrand'};
153164
154165 for my $exon ( @{$acc_exons->{'gnCoordinate'}[0]{'genomicLocation'}{'exon'}} ) {
155166 my ($p_begin, $p_end) = ($exon->{'proteinLocation'}{'begin'}{'position'},$exon->{'proteinLocation'}{'end'}{'position'});
183194 $last_end = $p_end;
184195 $last_phase = $this_phase;
185196
197 my $info ="exon_".$exon_num.$color_sep_str.$exon_num;
198
199 my ($gs_begin, $gs_end) = ($g_begin, $g_end);
200 if ($rev_strand) {
201 ($gs_begin, $gs_end) = ($g_end, $g_begin);
202 }
203
186204 push @exons, {
187 info=>"exon_".$exon_num.$color_sep_str.$exon_num,
205 info=>$info,
206 exon_num=>$exon_num,
188207 seq_start=>$p_begin,
189208 seq_end=>$p_end,
209 gen_seq_start=>$gs_begin,
210 gen_seq_end=>$gs_end,
211 chrom=>$chrom,
190212 };
213
191214 $exon_num++;
192215 }
193216 }
204227 }
205228 else {
206229 push @ex_feats, [$d_ref->{seq_start}, '-', $d_ref->{seq_end}, $d_ref->{info} ];
230 if ($gen_coord) {
231 my $chr=$d_ref->{chrom};
232 if ($chr =~ m/^\d+$/ || $chr =~m/^[XYZ]+$/) {
233 $chr = "chr$chr";
234 }
235 my $ex_info = sprintf("exon_%d::%s:%d",$d_ref->{exon_num}, $chr, $d_ref->{gen_seq_start});
236 push @ex_feats, [$d_ref->{seq_start},'<','-',$ex_info];
237 $ex_info = sprintf("exon_%d::%s:%d",$d_ref->{exon_num}, $chr, $d_ref->{gen_seq_end});
238 push @ex_feats, [$d_ref->{seq_end},'>','-',$ex_info];
239 }
207240 }
208241 }
209242 return \@ex_feats;
223256 return $result;
224257 }
225258
226 sub domain_name {
227
228 my ($value) = @_;
229
230 if (!defined($domains{$value})) {
231 $domain_cnt++;
232 $domains{$value} = $domain_cnt;
233 }
234 return $value;
235 }
236
237259 __END__
238260
239261 =pod
251273 -h short help
252274 --help include description
253275 --lav produce lav2plt.pl annotation format
276 --gen_coord produce genome coordinate features
254277
255278 =head1 DESCRIPTION
256279
0 #!/usr/bin/perl -w
0 #!/usr/bin/env perl
11
22 ################################################################
33 # copyright (c) 2014 by William R. Pearson and The Rector &
3434 # ann_feats2ipr.pl is largely identical to ann_feats2l.pl, except that
3535 # it uses Interpro for domain/repeat information.
3636
37 use warnings;
3738 use strict;
3839
3940 use DBI;
0 #!/usr/bin/perl -w
0 #!/usr/bin/env perl
11
22 ################################################################
33 # copyright (c) 2014 by William R. Pearson and The Rector &
3434 # ann_feats2ipr.pl is largely identical to ann_feats2l.pl, except that
3535 # it uses Interpro for domain/repeat information.
3636
37 use warnings;
3738 use strict;
3839
3940 use DBI;
0 #!/usr/bin/perl -w
0 #!/usr/bin/env perl
11
22 ################################################################
33 # copyright (c) 2014,2015 by William R. Pearson and The Rector &
2929 # this version can read feature2 uniprot features (acc/pos/end/label/value), but returns sorted start/end domains
3030 # modified 18-Jan-2016 to produce annotation symbols consistent with ann_feats_up_www2.pl
3131
32 use warnings;
3233 use strict;
3334
3435 use DBI;
0 #!/usr/bin/perl -w
0 #!/usr/bin/env perl
11
22 ################################################################
33 # copyright (c) 2014,2015 by William R. Pearson and The Rector &
1717 ################################################################
1818
1919 ## modified 29-Sept-2016 to use EBI/proteins JSON URL:
20 ## http://www.ebi.ac.uk/proteins/api/features/p12345
20 ## https://www.ebi.ac.uk/proteins/api/features/p12345
2121
2222 # ann_feats_up_www2.pl gets an annotation file from fasta36 -V with a line of the form:
2323
3131
3232 # this version can read feature2 uniprot features (acc/pos/end/label/value), but returns sorted start/end domains
3333
34 use warnings;
3435 use strict;
3536
3637 use Getopt::Long;
3738 use Pod::Usage;
3839 use LWP::Simple;
40 use LWP::UserAgent;
3941 use JSON qw(decode_json);
4042
4143 ## use IO::String;
4244
43 my $up_base = 'http://www.ebi.ac.uk/proteins/api/features';
45 my $ua = LWP::UserAgent->new(ssl_opts=>{verify_hostname => 0});
46 my $up_base = 'https://www.ebi.ac.uk/proteins/api/features';
47 my $uniprot_suff = ".json";
4448
4549 my %domains = ();
4650 my $domain_cnt = 0;
213217 my $lwp_features = "";
214218
215219 if ($acc && ($acc =~ m/^[A-Z][0-9][A-Z0-9]{3}[0-9]/)) {
216 $lwp_features = get("$up_base/$acc.json");
220 $lwp_features = get_https("$up_base/$acc.json");
217221 }
218222 # elsif ($id && ($id =~ m/^\w+$/)) {
219223 # $lwp_features = get("$up_base/$id/$gff_post");
366370 }
367371 }
368372
373 sub get_https {
374 my ($url) = @_;
375
376 my $result = "";
377 my $response = $ua->get($url);
378
379 if ($response->is_success) {
380 $result = $response->decoded_content;
381 } else {
382 $result = '';
383 }
384 return $result;
385 }
369386
370387
371388 __END__
398415
399416 C<ann_feats_up_www2.pl> extracts feature, domain, and repeat
400417 information from the Uniprot DAS server through an XSLT transation
401 provided by http://www.ebi.ac.uk/Tools/dbfetch/dbfetch/uniprotkb.
418 provided by https://www.ebi.ac.uk/Tools/dbfetch/dbfetch/uniprotkb.
402419 This server provides GFF descriptions of Uniprot entries, with most of
403420 the information provided in UniProt feature tables.
404421
0 #!/usr/bin/perl -w
0 #!/usr/bin/env perl
11
22 ################################################################
33 # copyright (c) 2014 by William R. Pearson and The Rector &
3636 # (3) return the tab delimited domains
3737 #
3838
39 use warnings;
3940 use strict;
4041
4142 use Getopt::Long;
0 #!/usr/bin/perl -w
0 #!/usr/bin/env perl
11
22 ################################################################
33 # copyright (c) 2014 by William R. Pearson and The Rector &
3434 # database
3535 #
3636
37 use warnings;
3738 use strict;
3839
3940 use DBI;
0 #!/usr/bin/perl -w
0 #!/usr/bin/env perl
11
22 ################################################################
33 # copyright (c) 2014, 2015 by William R. Pearson and The Rector &
3434 # database
3535 #
3636
37 use warnings;
3738 use strict;
3839
3940 use LWP::Simple;
+0
-656
scripts/ann_pfam27.pl less more
0 #!/usr/bin/perl -w
1
2 ################################################################
3 # copyright (c) 2014 by William R. Pearson and The Rector &
4 # Visitors of the University of Virginia */
5 ################################################################
6 # Licensed under the Apache License, Version 2.0 (the "License");
7 # you may not use this file except in compliance with the License.
8 # You may obtain a copy of the License at
9 #
10 # http://www.apache.org/licenses/LICENSE-2.0
11 #
12 # Unless required by applicable law or agreed to in writing,
13 # software distributed under this License is distributed on an "AS
14 # IS" BASIS, WITHOUT WRRANTIES OR CONDITIONS OF ANY KIND, either
15 # express or implied. See the License for the specific language
16 # governing permissions and limitations under the License.
17 ################################################################
18
19 # ann_pfam_e.pl gets an annotation file from fasta36 -V with a line of the form:
20
21 # gi|62822551|sp|P00502|GSTA1_RAT Glutathione S-transfer\n (at least from pir1.lseg)
22 #
23 # it must:
24 # (1) read in the line
25 # (2) parse it to get the up_acc
26 # (3) return the tab delimited features
27 #
28
29 # this version only annotates sequences known to Pfam:pfamseq:
30 # >pf26|164|O57809|1A1D_PYRHO
31 # and only provides domain information
32
33 use strict;
34
35 use DBI;
36 use Getopt::Long;
37 use Pod::Usage;
38
39 use vars qw($host $db $port $user $pass);
40
41 my $hostname = `/bin/hostname`;
42
43 ($host, $db, $port, $user, $pass) = ("wrpxdb.its.virginia.edu", "pfam27", 0, "web_user", "fasta_www");
44 #$host = 'xdb';
45
46 my ($auto_reg,$rpd2_fams, $vdoms, $neg_doms, $lav, $no_doms, $no_clans, $pf_acc, $no_over, $acc_comment, $shelp, $help) =
47 (0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0);
48 my ($min_nodom, $min_vdom) = (10,10);
49
50 my $color_sep_str = " :";
51 $color_sep_str = '~';
52
53
54 GetOptions(
55 "host=s" => \$host,
56 "db=s" => \$db,
57 "user=s" => \$user,
58 "password=s" => \$pass,
59 "port=i" => \$port,
60 "lav" => \$lav,
61 "acc_comment" => \$acc_comment,
62 "no-over" => \$no_over,
63 "no_over" => \$no_over,
64 "no-clans" => \$no_clans,
65 "no_clans" => \$no_clans,
66 "neg" => \$neg_doms,
67 "neg_doms" => \$neg_doms,
68 "neg-doms" => \$neg_doms,
69 "min_nodom=i" => \$min_nodom,
70 "pfacc" => \$pf_acc,
71 "RPD2" => \$rpd2_fams,
72 "auto_reg" => \$auto_reg,
73 "vdoms" => \$vdoms,
74 "h|?" => \$shelp,
75 "help" => \$help,
76 );
77
78 pod2usage(1) if $shelp;
79 pod2usage(exitstatus => 0, verbose => 2) if $help;
80 pod2usage(1) unless (@ARGV || -p STDIN || -f STDIN);
81
82 my $connect = "dbi:mysql(AutoCommit=>1,RaiseError=>1):database=$db";
83 $connect .= ";host=$host" if $host;
84 $connect .= ";port=$port" if $port;
85
86 my $dbh = DBI->connect($connect,
87 $user,
88 $pass
89 ) or die $DBI::errstr;
90
91 my %annot_types = ();
92 my %domains = (NODOM=>0);
93 my %domain_clan = (NODOM => {clan_id => 'NODOM', clan_acc=>0, domain_cnt=>0});
94 my @domain_list = (0);
95 my $domain_cnt = 0;
96
97 my $get_annot_sub = \&get_pfam_annots;
98
99 my $get_pfam_acc = $dbh->prepare(<<EOSQL);
100
101 SELECT seq_start, seq_end, model_start, model_end, model_length, auto_pfamA, pfamA_acc, pfamA_id, auto_pfamA_reg_full, domain_evalue_score as evalue, length
102 FROM pfamseq
103 JOIN pfamA_reg_full_significant using(auto_pfamseq)
104 JOIN pfamA USING (auto_pfamA)
105 WHERE in_full = 1
106 AND pfamseq_acc=?
107 ORDER BY seq_start
108
109 EOSQL
110
111 my $get_pfam_refacc = $dbh->prepare(<<EOSQL);
112
113 SELECT seq_start, seq_end, auto_pfamA, pfamA_acc, pfamA_id, auto_pfamA_reg_full, domain_evalue_score as evalue, length
114 FROM pfamseq
115 JOIN pfamA_reg_full_significant using(auto_pfamseq)
116 JOIN pfamA USING (auto_pfamA)
117 JOIN seqdb_demo2.annot as sa1 on(sa1.acc=pfamseq_acc and sa1.db='sp')
118 JOIN seqdb_demo2.annot as sa2 using(prot_id)
119 WHERE in_full = 1
120 AND sa2.acc=?
121 AND sa2.db='ref'
122 ORDER BY seq_start
123
124 EOSQL
125
126 my $get_annots_sql = $get_pfam_acc;
127
128 my $get_pfam_id = $dbh->prepare(<<EOSQL);
129
130 SELECT seq_start, seq_end, auto_pfamA, pfamA_acc, pfamA_id, auto_pfamA_reg_full, domain_evalue_score as evalue, length
131 FROM pfamseq
132 JOIN pfamA_reg_full_significant using(auto_pfamseq)
133 JOIN pfamA USING (auto_pfamA)
134 WHERE in_full=1
135 AND pfamseq_id=?
136 ORDER BY seq_start
137
138 EOSQL
139
140 my $get_pfam_clan = $dbh->prepare(<<EOSQL);
141
142 SELECT clan_acc, clan_id
143 FROM clans
144 JOIN clan_membership using(auto_clan)
145 WHERE auto_pfamA=?
146
147 EOSQL
148
149 my $get_rpd2_clans = $dbh->prepare(<<EOSQL);
150
151 SELECT auto_pfamA, clan
152 FROM ljm_db.RPD2_final_fams
153 WHERE clan is not NULL
154
155 EOSQL
156
157 # -- LEFT JOIN clan_membership USING (auto_pfamA)
158 # -- LEFT JOIN clans using(auto_clan)
159
160 my ($tmp, $gi, $sdb, $acc, $id, $use_acc);
161
162 # get the query
163 my ($query, $seq_len) = @ARGV;
164 $seq_len = 0 unless defined($seq_len);
165
166 $query =~ s/^>// if ($query);
167
168 my @annots = ();
169
170 my %rpd2_clan_fams = ();
171
172 if ($rpd2_fams) {
173 $get_rpd2_clans->execute();
174 my ($auto_pfam, $auto_clan);
175 while (($auto_pfam, $auto_clan)=$get_rpd2_clans->fetchrow_array()) {
176 $rpd2_clan_fams{$auto_pfam} = $auto_clan;
177 }
178 }
179
180 #if it's a file I can open, read and parse it
181 unless ($query && $query =~ m/[\|:]/) {
182
183 while (my $a_line = <>) {
184 $a_line =~ s/^>//;
185 chomp $a_line;
186 push @annots, show_annots($a_line, $get_annot_sub);
187 }
188 }
189 else {
190 push @annots, show_annots("$query $seq_len", $get_annot_sub);
191 }
192
193 for my $seq_annot (@annots) {
194 print ">",$seq_annot->{seq_info},"\n";
195 for my $annot (@{$seq_annot->{list}}) {
196 if (!$lav && defined($domains{$annot->[-1]})) {
197 my ($a_name, $a_num) = domain_num($annot->[-1],$domains{$annot->[-1]});
198 if ($acc_comment) {
199 $annot->[-1] .= $a_name."{$domain_list[$a_num]}";
200 }
201 $annot->[-1] = $a_name.$color_sep_str.$a_num;
202 }
203 print join("\t",@$annot),"\n";
204 }
205 }
206
207 exit(0);
208
209 sub show_annots {
210 my ($query_len, $get_annot_sub) = @_;
211
212 my ($annot_line, $seq_len) = split(/\s+/,$query_len);
213
214 my $pfamA_acc;
215
216 my %annot_data = (seq_info=>$annot_line);
217
218 $use_acc = 1;
219 $get_annots_sql = $get_pfam_acc;
220
221 if ($annot_line =~ m/^pf26\|/) {
222 ($sdb, $gi, $acc, $id) = split(/\|/,$annot_line);
223 $dbh->do("use RPD2_pfam");
224 }
225 elsif ($annot_line =~ m/^gi\|/) {
226 ($tmp, $gi, $sdb, $acc, $id) = split(/\|/,$annot_line);
227 if ($sdb =~ m/ref/) {
228 $get_annots_sql = $get_pfam_refacc;
229 }
230 }
231 elsif ($annot_line =~ m/^sp\|/) {
232 ($sdb, $acc, $id) = split(/\|/,$annot_line);
233 }
234 elsif ($annot_line =~ m/^ref\|/) {
235 ($sdb, $acc) = split(/\|/,$annot_line);
236 $get_annots_sql = $get_pfam_refacc;
237 }
238 elsif ($annot_line =~ m/^tr\|/) {
239 ($sdb, $acc, $id) = split(/\|/,$annot_line);
240 }
241 elsif ($annot_line =~ m/^SP:/i) {
242 ($sdb, $id) = split(/:/,$annot_line);
243 $use_acc = 0;
244 }
245 else {
246 $use_acc = 1;
247 ($acc) = split(/\s+/,$annot_line);
248 }
249
250 # remove version number
251 unless ($use_acc) {
252 $get_annots_sql = $get_pfam_id;
253 $get_annots_sql->execute($id);
254 }
255 else {
256 $acc =~ s/\.\d+$//;
257 $get_annots_sql->execute($acc);
258 }
259
260 $annot_data{list} = $get_annot_sub->($get_annots_sql, $seq_len);
261
262 return \%annot_data;
263 }
264
265 sub get_pfam_annots {
266 my ($get_annots, $seq_length) = @_;
267
268 $seq_length = 0 unless $seq_length;
269
270 my @pf_domains = ();
271
272 # get the list of domains, sorted by start
273 while ( my $row_href = $get_annots->fetchrow_hashref()) {
274 if ($auto_reg) {
275 $row_href->{info} = $row_href->{auto_pfamA_reg_full};
276 }
277 elsif ($pf_acc) {
278 $row_href->{info} = $row_href->{pfamA_acc};
279 }
280 else {
281 $row_href->{info} = $row_href->{pfamA_id};
282 }
283
284 if ($row_href && $row_href->{length} > $seq_length && $seq_length == 0) { $seq_length = $row_href->{length};}
285
286 next if ($row_href->{seq_start} >= $seq_length);
287 if ($row_href->{seq_end} > $seq_length) {
288 $row_href->{seq_end} = $seq_length;
289 }
290
291 push @pf_domains, $row_href
292 }
293
294 # check for domain overlap, and resolve check for domain overlap
295 # (possibly more than 2 domains), choosing the domain with the best
296 # evalue
297
298 if($no_over && scalar(@pf_domains) > 1) {
299
300 my @tmp_domains = @pf_domains;
301 my @save_domains = ();
302
303 my $prev_dom = shift @tmp_domains;
304
305 while (my $curr_dom = shift @tmp_domains) {
306
307 my @overlap_domains = ($prev_dom);
308
309 my $diff = $prev_dom->{seq_end} - $curr_dom->{seq_start};
310 # check for overlap > domain_length/3
311
312 my ($prev_len, $cur_len) = ($prev_dom->{seq_end}-$prev_dom->{seq_start}+1, $curr_dom->{seq_end}-$curr_dom->{seq_start}+1);
313 my $inclusion = ((($curr_dom->{seq_start} >= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} <= $prev_dom->{seq_end})) ||
314 (($curr_dom->{seq_start} <= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} >= $prev_dom->{seq_end})));
315
316 my $longer_len = ($prev_len > $cur_len) ? $prev_len : $cur_len;
317
318 while ($inclusion || ($diff > 0 && $diff > $longer_len/3)) {
319 push @overlap_domains, $curr_dom;
320 $curr_dom = shift @tmp_domains;
321 last unless $curr_dom;
322 $diff = $prev_dom->{seq_end} - $curr_dom->{seq_start};
323 ($prev_len, $cur_len) = ($prev_dom->{seq_end}-$prev_dom->{seq_start}+1, $curr_dom->{seq_end}-$curr_dom->{seq_start}+1);
324 $longer_len = ($prev_len > $cur_len) ? $prev_len : $cur_len;
325 $inclusion = ((($curr_dom->{seq_start} >= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} <= $prev_dom->{seq_end})) ||
326 (($curr_dom->{seq_start} <= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} >= $prev_dom->{seq_end})));
327 }
328
329 # check for overlapping domains; >1 because $prev_dom is always there
330 if (scalar(@overlap_domains) > 1 ) {
331 # if $rpd2_fams, check for a chosen one
332 if ($rpd2_fams) {
333 for my $dom (@overlap_domains) {
334 if ($rpd2_clan_fams{$dom->{auto_pfamA}}) {
335 $prev_dom = $dom;
336 last;
337 }
338 }
339 }
340 else {
341 @overlap_domains = sort { $a->{evalue} <=> $b->{evalue} } @overlap_domains;
342 $prev_dom = $overlap_domains[0];
343 }
344 }
345
346 # $prev_dom should be the best of the overlaps, and we are no longer overlapping > dom_length/3
347 push @save_domains, $prev_dom;
348 $prev_dom = $curr_dom;
349 }
350 if ($prev_dom) {push @save_domains, $prev_dom;}
351
352 @pf_domains = @save_domains;
353
354 # now check for smaller overlaps
355 for (my $i=1; $i < scalar(@pf_domains); $i++) {
356 if ($pf_domains[$i-1]->{seq_end} >= $pf_domains[$i]->{seq_start}) {
357 my $overlap = $pf_domains[$i-1]->{seq_end} - $pf_domains[$i]->{seq_start};
358 $pf_domains[$i-1]->{seq_end} -= int($overlap/2);
359 $pf_domains[$i]->{seq_start} = $pf_domains[$i-1]->{seq_end}+1;
360 }
361 }
362 }
363
364 # $vdoms -- virtual Pfam domains -- the equivalent of $neg_doms,
365 # but covering parts of a Pfam model that are not annotated. split
366 # domains have been joined, so simply check beginning and end of
367 # each domain (but must also check for bounded-ness)
368 # only add when 10% or more is missing and missing length > $min_nodom
369
370 if ($vdoms && scalar(@pf_domains)) {
371 my @vpf_domains;
372
373 my $curr_dom = $pf_domains[0];
374 my $length = $curr_dom->{length};
375
376 my $prev_dom={seq_end=>0, pfamA_acc=>''};
377 my $prev_dom_end = 0;
378 my $next_dom_start = $length+1;
379
380 for (my $dom_ix=0; $dom_ix < scalar(@pf_domains); $dom_ix++ ) {
381 $curr_dom = $pf_domains[$dom_ix];
382
383 my $pfamA = $curr_dom->{pfamA_acc};
384
385 # first, look left, is there a domain there (if there is,
386 # it should be updated right
387
388 # my $min_vdom = $curr_dom->{model_length} / 10;
389
390 if ($prev_dom->{pfamA_acc}) { # look for previous domain
391 $prev_dom_end = $prev_dom->{seq_end};
392 }
393
394 # there is a domain to the left, how much room is available?
395 my $left_dom_len = min($curr_dom->{seq_start}-$prev_dom_end-1, $curr_dom->{model_start}-1);
396 if ( $left_dom_len > $min_vdom) {
397 # there is room for a virtual domain
398 my %new_dom = (seq_start=> $curr_dom->{seq_start}-$left_dom_len,
399 seq_end => $curr_dom->{seq_start}-1,
400 info=>'@'.$curr_dom->{info},
401 model_length=>$curr_dom->{model_length},
402 model_end => $curr_dom->{model_start}-1,
403 model_start => $left_dom_len,
404 pfamA_acc=>$pfamA,
405 );
406 push @vpf_domains, \%new_dom;
407 }
408
409 # save the current domain
410 push @vpf_domains, $curr_dom;
411 $prev_dom = $curr_dom;
412
413 if ($dom_ix < $#pf_domains) { # there is a domain to the right
414 # first, give all the extra space to the first domain (no splitting)
415 $next_dom_start = $pf_domains[$dom_ix+1]->{seq_start};
416 }
417 else {
418 $next_dom_start = $length;
419 }
420
421 # is there room for a virtual domain right
422
423 my $right_dom_len = min($next_dom_start-$curr_dom->{seq_end}-1, # space available
424 $curr_dom->{model_length}-$curr_dom->{model_end} # space needed
425 );
426 if ( $right_dom_len > $min_vdom) {
427 my %new_dom = (seq_start=> $curr_dom->{seq_end}+1,
428 seq_end=> $curr_dom->{seq_end}+$right_dom_len,
429 info=>'@'.$pfamA,
430 model_length => $curr_dom->{model_length},
431 pfamA_acc=> $pfamA,
432 );
433 push @vpf_domains, \%new_dom;
434 $prev_dom = \%new_dom;
435 }
436 } # all done, check for last one
437
438 # $curr_dom=$pf_domains[-1];
439 # # my $min_vdom = $curr_dom->{model_length}/10;
440
441 # my $right_dom_len = min($length - $curr_dom->{seq_end}+1, # space available
442 # $curr_dom->{model_length}-$curr_dom->{model_end} # space needed
443 # );
444 # if ($right_dom_len > $min_vdom) {
445 # my %new_dom = (seq_start=> $curr_dom->{seq_end}+1,
446 # seq_end => $curr_dom->{seq_end}+$right_dom_len,
447 # info=>'@'.$curr_dom->{pfamA_acc},
448 # model_len=> $curr_dom->{model_len},
449 # pfamA_acc => $curr_dom->{pfamA_acc},
450 # model_start => $curr_dom->{model_end}+1,
451 # model_end => $curr_dom->{model_len},
452 # );
453
454 # push @vpf_domains, \%new_dom;
455 # }
456
457 # @vpf_domains has both old @pf_domains and new neg-domains
458 @pf_domains = @vpf_domains;
459 }
460
461 if ($neg_doms) {
462 my @npf_domains;
463 my $prev_dom={seq_end=>0};
464 for my $curr_dom ( @pf_domains) {
465 if ($curr_dom->{seq_start} - $prev_dom->{seq_end} > $min_nodom) {
466 my %new_dom = (seq_start=>$prev_dom->{seq_end}+1, seq_end => $curr_dom->{seq_start}-1, info=>'NODOM');
467 push @npf_domains, \%new_dom;
468 }
469 push @npf_domains, $curr_dom;
470 $prev_dom = $curr_dom;
471 }
472 if ($seq_length - $prev_dom->{seq_end} > $min_nodom) {
473 my %new_dom = (seq_start=>$prev_dom->{seq_end}+1, seq_end=>$seq_length, info=>'NODOM');
474 if ($new_dom{seq_end} > $new_dom{seq_start}) {push @npf_domains, \%new_dom;}
475 }
476
477 # @npf_domains has both old @pf_domains and new neg-domains
478 @pf_domains = @npf_domains;
479 }
480
481 # now make sure we have useful names: colors
482
483 for my $pf (@pf_domains) {
484 $pf->{info} = domain_name($pf->{info}, $pf->{auto_pfamA}, $pf->{pfamA_acc});
485 }
486
487 my @feats = ();
488 for my $d_ref (@pf_domains) {
489 if ($lav) {
490 push @feats, [$d_ref->{seq_start}, $d_ref->{seq_end}, $d_ref->{info}];
491 }
492 else {
493 push @feats, [$d_ref->{seq_start}, '-', $d_ref->{seq_end}, $d_ref->{info} ];
494 # push @feats, [$d_ref->{seq_end}, ']', '-', ""];
495 }
496
497 }
498
499 return \@feats;
500 }
501
502 sub min {
503 my ($arg1, $arg2) = @_;
504
505 return ($arg1 <= $arg2 ? $arg1 : $arg2);
506 }
507
508 sub max {
509 my ($arg1, $arg2) = @_;
510
511 return ($arg1 >= $arg2 ? $arg1 : $arg2);
512 }
513
514 # domain name takes a uniprot domain label, removes comments ( ;
515 # truncated) and numbers and returns a canonical form. Thus:
516 # Cortactin 6.
517 # Cortactin 7; truncated.
518 # becomes "Cortactin"
519 #
520
521 sub domain_name {
522
523 my ($value, $pfamA_acc) = @_;
524 my $is_virtual = 0;
525
526 if ($value =~ m/^@/) {
527 $is_virtual = 1;
528 $value =~ s/^@//;
529 }
530
531 # check for clan:
532 if ($no_clans) {
533 if (! defined($domains{$value})) {
534 $domain_clan{$value} = 0;
535 $domains{$value} = ++$domain_cnt;
536 push @domain_list, $pfamA_acc;
537 }
538 }
539 elsif (!defined($domain_clan{$value})) {
540 ## only do this for new domains, old domains have known mappings
541
542 ## ways to highlight the same domain:
543 # (1) for clans, substitute clan name for family name
544 # (2) for clans, use the same color for the same clan, but don't change the name
545 # (3) for clans, combine family name with clan name, but use colors based on clan
546
547 # check to see if it's a clan
548 $get_pfam_clan->execute($pfamA_acc);
549
550 my $pfam_clan_href=0;
551
552 if ($pfam_clan_href=$get_pfam_clan->fetchrow_hashref()) { # is a clan
553 my ($clan_id, $clan_acc) = @{$pfam_clan_href}{qw(clan_id clan_acc)};
554
555 # now check to see if we have seen this clan before (if so, do not increment $domain_cnt)
556 my $c_value = "C." . $clan_id;
557 if ($pf_acc) {$c_value = $clan_acc;}
558
559 $domain_clan{$value} = {clan_id => $clan_id,
560 clan_acc => $clan_acc};
561
562 if ($domains{$c_value}) {
563 $domain_clan{$value}->{domain_cnt} = $domains{$c_value};
564 $value = $c_value;
565 }
566 else {
567 $domain_clan{$value}->{domain_cnt} = ++ $domain_cnt;
568 $value = $c_value;
569 $domains{$value} = $domain_cnt;
570 push @domain_list, $pfamA_acc;
571 }
572 }
573 else { # not a clan
574 $domain_clan{$value} = 0;
575 $domains{$value} = ++$domain_cnt;
576 push @domain_list, $pfamA_acc;
577 }
578 }
579 elsif ($domain_clan{$value} && $domain_clan{$value}->{clan_acc}) {
580 if ($pf_acc) {$value = $domain_clan{$value}->{clan_acc};}
581 else { $value = "C." . $domain_clan{$value}->{clan_id}; }
582 }
583
584 if ($is_virtual) {
585 $domains{'@'.$value} = $domains{$value};
586 $value = '@'.$value;
587 }
588 return $value;
589 }
590
591 sub domain_num {
592 my ($value, $number) = @_;
593 if ($value =~ m/^@/) {
594 $value =~ s/^@/v/;
595 # $number = $number."v";
596 }
597 return ($value, $number);
598 }
599
600 __END__
601
602 =pod
603
604 =head1 NAME
605
606 ann_feats.pl
607
608 =head1 SYNOPSIS
609
610 ann_pfam_e.pl --neg-doms 'sp|P09488|GSTM1_NUMAN' | accession.file
611
612 =head1 OPTIONS
613
614 -h short help
615 --help include description
616 --no-over : generate non-overlapping domains (equivalent to ann_pfam.pl)
617 --no-clans : do not use clans with multiple families from same clan
618 --neg-doms : report domains between annotated domains as NODOM
619 (also --neg, --neg_doms)
620 --min_nodom=10 : minimum length between domains for NODOM
621
622 --host, --user, --password, --port --db : info for mysql database
623
624 =head1 DESCRIPTION
625
626 C<ann_pfam_e.pl> extracts domain information from the pfam msyql
627 database. Currently, the program works with database sequence
628 descriptions in one of two formats:
629
630 Currently, the program works with database
631 sequence descriptions in several formats:
632
633 >gi|1705556|sp|P54670.1|CAF1_DICDI
634 >sp|P09488|GSTM1_HUMAN
635 >sp:CALM_HUMAN
636
637 C<ann_pfam_e.pl> uses the C<pfamA_reg_full_significant>, C<pfamseq>,
638 and C<pfamA> tables of the C<pfam> database to extract domain
639 information on a protein.
640
641 If the "--no-over" option is set, overlapping domains are selected and
642 edited to remove overlaps. For proteins with multiple overlapping
643 domains (domains overlap by more than 1/3 of the domain length),
644 C<auto_pfam_e.pl> selects the domain annotation with the best
645 C<domain_evalue_score>. When domains overlap by less than 1/3 of the
646 domain length, they are shortened to remove the overlap.
647
648 C<ann_pfam_e.pl> is designed to be used by the B<FASTA> programs with
649 the C<-V \!ann_pfam_e.pl> or C<-V "\!ann_pfam_e.pl --neg"> option.
650
651 =head1 AUTHOR
652
653 William R. Pearson, wrp@virginia.edu
654
655 =cut
+0
-782
scripts/ann_pfam28.pl less more
0 #!/usr/bin/perl -w
1
2 ################################################################
3 # copyright (c) 2014,2015 by William R. Pearson and The Rector &
4 # Visitors of the University of Virginia */
5 ################################################################
6 # Licensed under the Apache License, Version 2.0 (the "License");
7 # you may not use this file except in compliance with the License.
8 # You may obtain a copy of the License at
9 #
10 # http://www.apache.org/licenses/LICENSE-2.0
11 #
12 # Unless required by applicable law or agreed to in writing,
13 # software distributed under this License is distributed on an "AS
14 # IS" BASIS, WITHOUT WRRANTIES OR CONDITIONS OF ANY KIND, either
15 # express or implied. See the License for the specific language
16 # governing permissions and limitations under the License.
17 ################################################################
18
19 # ann_pfam.pl gets an annotation file from fasta36 -V with a line of the form:
20
21 # gi|62822551|sp|P00502|GSTA1_RAT Glutathione S-transfer\n (at least from pir1.lseg)
22 #
23 # it must:
24 # (1) read in the line
25 # (2) parse it to get the up_acc
26 # (3) return the tab delimited features
27 #
28
29 # this version only annotates sequences known to Pfam:pfamseq:
30 # and only provides domain information
31
32 use strict;
33
34 use DBI;
35 use Getopt::Long;
36 use Pod::Usage;
37
38 use vars qw($host $db $port $user $pass);
39
40 my $hostname = `/bin/hostname`;
41
42 ($host, $db, $port, $user, $pass) = ("wrpxdb.its.virginia.edu", "pfam28", 0, "web_user", "fasta_www");
43 #$host = 'xdb';
44 #$host = 'localhost';
45 #$db = 'RPD2_pfam28u';
46
47 my ($auto_reg,$rpd2_fams, $neg_doms, $vdoms, $lav, $no_doms, $no_clans, $pf_acc, $acc_comment, $bound_comment, $shelp, $help) =
48 (0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0,);
49 my ($no_over, $split_over, $over_fract) = (0, 0, 3.0);
50
51 my $color_sep_str = " :";
52 $color_sep_str = '~';
53
54 my ($min_nodom, $min_vdom) = (10,10);
55
56 GetOptions(
57 "host=s" => \$host,
58 "db=s" => \$db,
59 "user=s" => \$user,
60 "password=s" => \$pass,
61 "port=i" => \$port,
62 "lav" => \$lav,
63 "acc_comment" => \$acc_comment,
64 "bound_comment" => \$bound_comment,
65 "no-over" => \$no_over,
66 "no_over" => \$no_over,
67 "split-over" => \$split_over,
68 "split_over" => \$split_over,
69 "over_fract" => \$over_fract,
70 "over-fract" => \$over_fract,
71 "no-clans" => \$no_clans,
72 "no_clans" => \$no_clans,
73 "neg" => \$neg_doms,
74 "neg_doms" => \$neg_doms,
75 "neg-doms" => \$neg_doms,
76 "min_nodom=i" => \$min_nodom,
77 "vdoms" => \$vdoms,
78 "v_doms" => \$vdoms,
79 "pfacc" => \$pf_acc,
80 "RPD2" => \$rpd2_fams,
81 "auto_reg" => \$auto_reg,
82 "h|?" => \$shelp,
83 "help" => \$help,
84 );
85
86 pod2usage(1) if $shelp;
87 pod2usage(exitstatus => 0, verbose => 2) if $help;
88 pod2usage(1) unless (@ARGV || -p STDIN || -f STDIN);
89
90 my $connect = "dbi:mysql(AutoCommit=>1,RaiseError=>1):database=$db";
91 $connect .= ";host=$host" if $host;
92 $connect .= ";port=$port" if $port;
93
94 my $dbh = DBI->connect($connect,
95 $user,
96 $pass
97 ) or die $DBI::errstr;
98
99 my %annot_types = ();
100 my %domains = (NODOM=>0);
101 my %domain_clan = (NODOM => {clan_id => 'NODOM', clan_acc=>0, domain_cnt=>0});
102 my @domain_list = (0);
103 my $domain_cnt = 0;
104
105 my $pfamA_reg_full = 'pfamA_reg_full_significant';
106
107 my $get_annot_sub = \&get_pfam_annots;
108
109 my @pfam_fields = qw(seq_start seq_end model_start model_end model_length pfamA_acc pfamA_id auto_pfamA_reg_full domain_evalue_score as evalue length);
110
111 my $get_pfam_acc = $dbh->prepare(<<EOSQL);
112 SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_pfamA_reg_full, domain_evalue_score as evalue, length
113 FROM pfamseq
114 JOIN $pfamA_reg_full using(pfamseq_acc)
115 JOIN pfamA USING (pfamA_acc)
116 WHERE in_full = 1
117 AND pfamseq_acc=?
118 ORDER BY seq_start
119
120 EOSQL
121
122 my $get_pfam_refacc = $dbh->prepare(<<EOSQL);
123
124 SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_pfamA_reg_full, domain_evalue_score as evalue, length
125 FROM pfamseq
126 JOIN $pfamA_reg_full using(pfamseq_acc)
127 JOIN pfamA USING (pfamA_acc)
128 JOIN seqdb_demo2.annot as sa1 on(sa1.acc=pfamseq_acc and sa1.db='sp')
129 JOIN seqdb_demo2.annot as sa2 using(prot_id)
130 WHERE in_full = 1
131 AND sa2.acc=?
132 AND sa2.db='ref'
133 ORDER BY seq_start
134
135 EOSQL
136
137 my $get_annots_sql = $get_pfam_acc;
138
139 my $get_pfam_id = $dbh->prepare(<<EOSQL);
140
141 SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_pfamA_reg_full, domain_evalue_score as evalue, length
142 FROM pfamseq
143 JOIN $pfamA_reg_full using(pfamseq_acc)
144 JOIN pfamA USING (pfamA_acc)
145 WHERE in_full=1
146 AND pfamseq_id=?
147 ORDER BY seq_start
148
149 EOSQL
150
151 my $get_pfam_clan = $dbh->prepare(<<EOSQL);
152
153 SELECT clan_acc, clan_id
154 FROM clan
155 JOIN clan_membership using(clan_acc)
156 WHERE pfamA_acc=?
157
158 EOSQL
159
160 my $get_rpd2_clans = $dbh->prepare(<<EOSQL);
161
162 SELECT auto_pfamA, clan
163 FROM ljm_db.RPD2_final_fams
164 WHERE clan is not NULL
165
166 EOSQL
167
168 # -- LEFT JOIN clan_membership USING (auto_pfamA)
169 # -- LEFT JOIN clans using(auto_clan)
170
171 my ($tmp, $gi, $sdb, $acc, $id, $use_acc);
172
173 # get the query
174 my ($query, $seq_len) = @ARGV;
175 $seq_len = 0 unless defined($seq_len);
176
177 $query =~ s/^>// if ($query);
178
179 my @annots = ();
180
181 my %rpd2_clan_fams = ();
182
183 if ($rpd2_fams) {
184 $get_rpd2_clans->execute();
185 my ($auto_pfam, $auto_clan);
186 while (($auto_pfam, $auto_clan)=$get_rpd2_clans->fetchrow_array()) {
187 $rpd2_clan_fams{$auto_pfam} = $auto_clan;
188 }
189 }
190
191 #if it's a file I can open, read and parse it
192 unless ($query && $query =~ m/[\|:]/) {
193
194 while (my $a_line = <>) {
195 $a_line =~ s/^>//;
196 chomp $a_line;
197 push @annots, show_annots($a_line, $get_annot_sub);
198 }
199 }
200 else {
201 push @annots, show_annots("$query $seq_len", $get_annot_sub);
202 }
203
204 for my $seq_annot (@annots) {
205 print ">",$seq_annot->{seq_info},"\n";
206 for my $annot (@{$seq_annot->{list}}) {
207 if (!$lav && defined($domains{$annot->[-1]})) {
208 my ($a_name, $a_num) = domain_num($annot->[-1],$domains{$annot->[-1]});
209 $annot->[-1] = $a_name;
210 my $tmp_a_num = $a_num;
211 $tmp_a_num =~ s/v$//;
212 if ($acc_comment) {
213 $annot->[-1] .= "{$domain_list[$tmp_a_num]}";
214 }
215 if ($bound_comment) {
216 $annot->[-1] .= $color_sep_str.$annot->[0].":".$annot->[2];
217 }
218 $annot->[-1] .= $color_sep_str.$a_num;
219 }
220 print join("\t",@$annot),"\n";
221 }
222 }
223
224 exit(0);
225
226 sub show_annots {
227 my ($query_len, $get_annot_sub) = @_;
228
229 my ($annot_line, $seq_len) = split(/\t/,$query_len);
230
231 my $pfamA_acc;
232
233 my %annot_data = (seq_info=>$annot_line);
234
235 $use_acc = 1;
236 $get_annots_sql = $get_pfam_acc;
237
238 if ($annot_line =~ m/^pf\d+\|/) {
239 ($sdb, $gi, $pfamA_acc, $acc, $id) = split(/\|/,$annot_line);
240 # $dbh->do("use RPD2_pfam");
241 }
242 elsif ($annot_line =~ m/^gi\|/) {
243 ($tmp, $gi, $sdb, $acc, $id) = split(/\|/,$annot_line);
244 if ($sdb =~ m/ref/) {
245 $get_annots_sql = $get_pfam_refacc;
246 }
247 }
248 elsif ($annot_line =~ m/^(sp|tr)\|/) {
249 ($sdb, $acc, $id) = split(/\|/,$annot_line);
250 }
251 elsif ($annot_line =~ m/^ref\|/) {
252 ($sdb, $acc) = split(/\|/,$annot_line);
253 $get_annots_sql = $get_pfam_refacc;
254 }
255 elsif ($annot_line =~ m/^(SP|TR):/i) {
256 ($sdb, $id) = split(/:/,$annot_line);
257 $use_acc = 0;
258 }
259 elsif ($annot_line !~ m/\|/) { # new NCBI swissprot format
260 $use_acc =1;
261 $sdb = 'sp';
262 ($acc) = split(/\s+/,$annot_line);
263 }
264
265 # remove version number
266 unless ($use_acc) {
267 $get_annots_sql = $get_pfam_id;
268 $get_annots_sql->execute($id);
269 } else {
270 unless ($acc) {
271 warn "missing acc in $annot_line";
272 next;
273 } else {
274 $acc =~ s/\.\d+$//;
275 $get_annots_sql->execute($acc);
276 }
277 }
278
279 $annot_data{list} = $get_annot_sub->($get_annots_sql, $seq_len);
280
281 return \%annot_data;
282 }
283
284 sub get_pfam_annots {
285 my ($get_annots, $seq_length) = @_;
286
287 $seq_length = 0 unless $seq_length;
288
289 my @pf_domains = ();
290
291 # get the list of domains, sorted by start
292
293 # $row_href has: seq_start, seq_end, model_start, model_end, model_length,
294 # pfamA_acc, pfamA_id, auto_pfamA_reg_full,
295 # domain_evalue_score as evalue, length
296
297 while ( my $row_href = $get_annots->fetchrow_hashref()) {
298 if ($auto_reg) {
299 $row_href->{info} = $row_href->{auto_pfamA_reg_full};
300 } elsif ($pf_acc) {
301 $row_href->{info} = $row_href->{pfamA_acc};
302 } else {
303 $row_href->{info} = $row_href->{pfamA_id};
304 }
305
306 if ($row_href && $row_href->{length} > $seq_length && $seq_length == 0) {
307 $seq_length = $row_href->{length};
308 }
309
310 next if ($row_href->{seq_start} >= $seq_length);
311 if ($row_href->{seq_end} > $seq_length) {
312 $row_href->{seq_end} = $seq_length;
313 }
314
315 push @pf_domains, $row_href
316 }
317
318 # before checking for domain overlap, check for "split-domains"
319 # (self-unbound) by looking for runs of the same domain that are
320 # ordered by model_start
321
322 if (scalar(@pf_domains) > 1) {
323 my @j_domains; #joined domains
324 my @tmp_domains = @pf_domains;
325
326 my $prev_dom = shift(@tmp_domains);
327
328 for my $curr_dom (@tmp_domains) {
329 # to join domains:
330 # (1) the domains must be in order by model_start/end coordinates
331 # (3) joining the domains cannot make the total combination too long
332
333 # check for model and sequence consistency
334 if (($prev_dom->{pfamA_acc} eq $curr_dom->{pfamA_acc}) # same family
335 && $prev_dom->{model_start} < $curr_dom->{model_start} # model check
336 && $prev_dom->{model_end} < $curr_dom->{model_end}
337
338 && ($curr_dom->{model_start} > $prev_dom->{model_end} * 0.80 # limit overlap
339 || $curr_dom->{model_start} < $prev_dom->{model_end} * 1.25)
340 && ((($curr_dom->{model_end} - $curr_dom->{model_start}+1)/$curr_dom->{model_length} +
341 ($prev_dom->{model_end} - $prev_dom->{model_start}+1)/$prev_dom->{model_length}) < 1.33)
342 ) { # join them by updating $prev_dom
343 $prev_dom->{seq_end} = $curr_dom->{seq_end};
344 $prev_dom->{model_end} = $curr_dom->{model_end};
345 $prev_dom->{auto_pfamA_reg_full} = $prev_dom->{auto_pfamA_reg_full} . ";". $curr_dom->{auto_pfamA_reg_full};
346 $prev_dom->{evalue} = ($prev_dom->{evalue} < $curr_dom->{evalue} ? $prev_dom->{evalue} : $curr_dom->{evalue});
347 } else {
348 push @j_domains, $prev_dom;
349 $prev_dom = $curr_dom;
350 }
351 }
352 push @j_domains, $prev_dom;
353 @pf_domains = @j_domains;
354
355
356 if ($no_over) { # for either $no_over or $split_over, check for overlapping domains and edit/split them
357
358 my @tmp_domains = @pf_domains; # allow shifts from copy of @pf_domains
359 my @save_domains = (); # where the new domains go
360
361 my $prev_dom = shift @tmp_domains;
362
363 while (my $curr_dom = shift @tmp_domains) {
364
365 my @overlap_domains = ($prev_dom);
366
367 my $diff = $prev_dom->{seq_end} - $curr_dom->{seq_start};
368
369 my ($prev_len, $cur_len) = ($prev_dom->{seq_end}-$prev_dom->{seq_start}+1,
370 $curr_dom->{seq_end}-$curr_dom->{seq_start}+1);
371
372 my $inclusion = ((($curr_dom->{seq_start} >= $prev_dom->{seq_start}) # start is right && end is left
373 && ($curr_dom->{seq_end} <= $prev_dom->{seq_end})) || # -- curr inside prev
374 (($curr_dom->{seq_start} <= $prev_dom->{seq_start}) # start is left && end is right
375 && ($curr_dom->{seq_end} >= $prev_dom->{seq_end}))); # -- prev is inside curr
376
377 my $longer_len = ($prev_len > $cur_len) ? $prev_len : $cur_len;
378
379 # check for overlap > domain_length/$over_fract
380 while ($inclusion || ($diff > 0 && $diff > $longer_len/$over_fract)) {
381 push @overlap_domains, $curr_dom;
382 $curr_dom = shift @tmp_domains;
383 last unless $curr_dom;
384 $diff = $prev_dom->{seq_end} - $curr_dom->{seq_start};
385 ($prev_len, $cur_len) = ($prev_dom->{seq_end}-$prev_dom->{seq_start}+1, $curr_dom->{seq_end}-$curr_dom->{seq_start}+1);
386 $longer_len = ($prev_len > $cur_len) ? $prev_len : $cur_len;
387 $inclusion = ((($curr_dom->{seq_start} >= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} <= $prev_dom->{seq_end})) ||
388 (($curr_dom->{seq_start} <= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} >= $prev_dom->{seq_end})));
389 }
390
391 # check for overlapping domains; >1 because $prev_dom is always there
392 if (scalar(@overlap_domains) > 1 ) {
393 # if $rpd2_fams, check for a chosen one
394
395 for my $dom ( @overlap_domains) {
396 $dom->{evalue} = 1.0 unless defined($dom->{evalue});
397 }
398
399 @overlap_domains = sort { $a->{evalue} <=> $b->{evalue} } @overlap_domains;
400 $prev_dom = $overlap_domains[0];
401 }
402
403 # $prev_dom should be the best of the overlaps, and we are no longer overlapping > dom_length/3
404 push @save_domains, $prev_dom;
405 $prev_dom = $curr_dom;
406 }
407
408 if ($prev_dom) {
409 push @save_domains, $prev_dom;
410 }
411
412 @pf_domains = @save_domains;
413
414 # now check for smaller overlaps
415 for (my $i=1; $i < scalar(@pf_domains); $i++) {
416 if ($pf_domains[$i-1]->{seq_end} >= $pf_domains[$i]->{seq_start}) {
417 my $overlap = $pf_domains[$i-1]->{seq_end} - $pf_domains[$i]->{seq_start};
418 $pf_domains[$i-1]->{seq_end} -= int($overlap/2);
419 $pf_domains[$i]->{seq_start} = $pf_domains[$i-1]->{seq_end}+1;
420 }
421 }
422 }
423 elsif ($split_over) { # here, everything that overlaps by > $min_vdom should be split into a separate domain
424 my @save_domains = (); # where the new domains go
425
426 # check to see if one domain is included (or overlapping) more
427 # than xx% of the other. If so, pick the longer one
428
429 my ($prev_dom, $curr_dom) = ($pf_domains[0],0) ;
430 for (my $i=1; $i < scalar(@pf_domains); $i++) {
431 $curr_dom = $pf_domains[$i];
432
433 my ($prev_len, $cur_len) = ($prev_dom->{seq_end}-$prev_dom->{seq_start}+1, $curr_dom->{seq_end}-$curr_dom->{seq_start}+1);
434 my $longer_len = ($prev_len > $cur_len) ? $prev_len : $cur_len;
435
436 if (($curr_dom->{seq_start} >= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} <= $prev_dom->{seq_end})
437 && $cur_len / $prev_len > 0.80) {
438 # $prev_dom stays the same, $curr_dom deleted
439 next;
440 }
441 elsif (($curr_dom->{seq_start} <= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} >= $prev_dom->{seq_end})
442 && $prev_len / $cur_len > 0.80) {
443 $prev_dom = $curr_dom; # this should delete $prev_dom
444 next;
445 }
446
447 if ($prev_dom->{seq_end} >= $curr_dom->{seq_start} + $min_vdom) {
448 my ($l_seq_end, $r_seq_start) = ($curr_dom->{seq_start}-1, $prev_dom->{seq_end}+1);
449
450 $prev_dom->{seq_end} = $l_seq_end;
451 push @save_domains, $prev_dom;
452 my $new_dom = {seq_start => $l_seq_end+1, seq_end=>$r_seq_start-1,
453 model_length => -1,
454 pfamA_acc=>$prev_dom->{pfamA_acc}."/".$curr_dom->{pfamA_acc},
455 pfamA_id=>$prev_dom->{pfamA_id}."/".$curr_dom->{pfamA_id},
456 };
457
458 if ($pf_acc) {
459 $new_dom->{info} = $new_dom->{pfamA_acc};
460 }
461 else {
462 $new_dom->{info} = $new_dom->{pfamA_id};
463 }
464
465 push @save_domains, $new_dom;
466 $curr_dom->{seq_start} = $r_seq_start;
467 $prev_dom = $curr_dom;
468 }
469 else {
470 push @save_domains, $prev_dom;
471 $prev_dom = $curr_dom;
472 }
473 }
474 push @save_domains, $prev_dom;
475 @pf_domains = @save_domains;
476 }
477 }
478
479 # $vdoms -- virtual Pfam domains -- the equivalent of $neg_doms,
480 # but covering parts of a Pfam model that are not annotated. split
481 # domains have been joined, so simply check beginning and end of
482 # each domain (but must also check for bounded-ness)
483 # only add when 10% or more is missing and missing length > $min_nodom
484
485 if ($vdoms && scalar(@pf_domains)) {
486 my @vpf_domains;
487
488 my $curr_dom = $pf_domains[0];
489 my $length = $curr_dom->{length};
490
491 my $prev_dom={seq_end=>0, pfamA_acc=>''};
492 my $prev_dom_end = 0;
493 my $next_dom_start = $length+1;
494
495 for (my $dom_ix=0; $dom_ix < scalar(@pf_domains); $dom_ix++ ) {
496 $curr_dom = $pf_domains[$dom_ix];
497
498 my $pfamA = $curr_dom->{pfamA_acc};
499
500 # first, look left, is there a domain there (if there is,
501 # it should be updated right
502
503 # my $min_vdom = $curr_dom->{model_length} / 10;
504
505 if ($curr_dom->{model_length} < $min_vdom) {
506 push @vpf_domains, $curr_dom;
507 next;
508 }
509 if ($prev_dom->{pfamA_acc}) { # look for previous domain
510 $prev_dom_end = $prev_dom->{seq_end};
511 }
512
513 # there is a domain to the left, how much room is available?
514 my $left_dom_len = min($curr_dom->{seq_start}-$prev_dom_end-1, $curr_dom->{model_start}-1);
515 if ( $left_dom_len > $min_vdom) {
516 # there is room for a virtual domain
517 my %new_dom = (seq_start=> $curr_dom->{seq_start}-$left_dom_len,
518 seq_end => $curr_dom->{seq_start}-1,
519 info=>'@'.$curr_dom->{info},
520 model_length=>$curr_dom->{model_length},
521 model_end => $curr_dom->{model_start}-1,
522 model_start => $left_dom_len,
523 pfamA_acc=>$pfamA,
524 );
525 push @vpf_domains, \%new_dom;
526 }
527
528 # save the current domain
529 push @vpf_domains, $curr_dom;
530 $prev_dom = $curr_dom;
531
532 if ($dom_ix < $#pf_domains) { # there is a domain to the right
533 # first, give all the extra space to the first domain (no splitting)
534 $next_dom_start = $pf_domains[$dom_ix+1]->{seq_start};
535 }
536 else {
537 $next_dom_start = $length;
538 }
539
540 # is there room for a virtual domain right
541
542 my $right_dom_len = min($next_dom_start-$curr_dom->{seq_end}-1, # space available
543 $curr_dom->{model_length}-$curr_dom->{model_end} # space needed
544 );
545 if ( $right_dom_len > $min_vdom) {
546 my %new_dom = (seq_start=> $curr_dom->{seq_end}+1,
547 seq_end=> $curr_dom->{seq_end}+$right_dom_len,
548 info=>'@'.$curr_dom->{info},
549 model_length => $curr_dom->{model_length},
550 pfamA_acc=> $pfamA,
551 );
552 push @vpf_domains, \%new_dom;
553 $prev_dom = \%new_dom;
554 }
555 } # all done, check for last one
556
557 # $curr_dom=$pf_domains[-1];
558 # # my $min_vdom = $curr_dom->{model_length}/10;
559
560 # my $right_dom_len = min($length - $curr_dom->{seq_end}+1, # space available
561 # $curr_dom->{model_length}-$curr_dom->{model_end} # space needed
562 # );
563 # if ($right_dom_len > $min_vdom) {
564 # my %new_dom = (seq_start=> $curr_dom->{seq_end}+1,
565 # seq_end => $curr_dom->{seq_end}+$right_dom_len,
566 # info=>'@'.$curr_dom->{pfamA_acc},
567 # model_len=> $curr_dom->{model_len},
568 # pfamA_acc => $curr_dom->{pfamA_acc},
569 # model_start => $curr_dom->{model_end}+1,
570 # model_end => $curr_dom->{model_len},
571 # );
572
573 # push @vpf_domains, \%new_dom;
574 # }
575
576 # @vpf_domains has both old @pf_domains and new neg-domains
577 @pf_domains = @vpf_domains;
578 }
579
580 if ($neg_doms) {
581 my @npf_domains;
582 my $prev_dom={seq_end=>0};
583 for my $curr_dom ( @pf_domains) {
584 if ($curr_dom->{seq_start} - $prev_dom->{seq_end} > $min_nodom) {
585 my %new_dom = (seq_start=>$prev_dom->{seq_end}+1, seq_end => $curr_dom->{seq_start}-1, info=>'NODOM');
586 push @npf_domains, \%new_dom;
587 }
588 push @npf_domains, $curr_dom;
589 $prev_dom = $curr_dom;
590 }
591 if ($seq_length - $prev_dom->{seq_end} > $min_nodom) {
592 my %new_dom = (seq_start=>$prev_dom->{seq_end}+1, seq_end=>$seq_length, info=>'NODOM');
593 if ($new_dom{seq_end} > $new_dom{seq_start}) {
594 push @npf_domains, \%new_dom;
595 }
596 }
597
598 # @npf_domains has both old @pf_domains and new neg-domains
599 @pf_domains = @npf_domains;
600 }
601
602 # now make sure we have useful names: colors
603
604 for my $pf (@pf_domains) {
605 $pf->{info} = domain_name($pf->{info}, $pf->{pfamA_acc});
606 }
607
608 my @feats = ();
609 for my $d_ref (@pf_domains) {
610 if ($lav) {
611 push @feats, [$d_ref->{seq_start}, $d_ref->{seq_end}, $d_ref->{info}];
612 } else {
613 push @feats, [$d_ref->{seq_start}, '-', $d_ref->{seq_end}, $d_ref->{info} ];
614 # push @feats, [$d_ref->{seq_end}, ']', '-', ""];
615 }
616
617 }
618
619 return \@feats;
620 }
621
622 sub min {
623 my ($arg1, $arg2) = @_;
624
625 return ($arg1 <= $arg2 ? $arg1 : $arg2);
626 }
627
628 sub max {
629 my ($arg1, $arg2) = @_;
630
631 return ($arg1 >= $arg2 ? $arg1 : $arg2);
632 }
633
634 # domain name takes a uniprot domain label, removes comments ( ;
635 # truncated) and numbers and returns a canonical form. Thus:
636 # Cortactin 6.
637 # Cortactin 7; truncated.
638 # becomes "Cortactin"
639 #
640
641 sub domain_name {
642
643 my ($value, $pfamA_acc) = @_;
644 my $is_virtual = 0;
645
646 if ($value =~ m/^@/) {
647 $is_virtual = 1;
648 $value =~ s/^@//;
649 }
650
651 # check for clan:
652 if ($no_clans) {
653 if (! defined($domains{$value})) {
654 $domain_clan{$value} = 0;
655 $domains{$value} = ++$domain_cnt;
656 push @domain_list, $pfamA_acc;
657 }
658 }
659 elsif (!defined($domain_clan{$value})) {
660 ## only do this for new domains, old domains have known mappings
661
662 ## ways to highlight the same domain:
663 # (1) for clans, substitute clan name for family name
664 # (2) for clans, use the same color for the same clan, but don't change the name
665 # (3) for clans, combine family name with clan name, but use colors based on clan
666
667 # check to see if it's a clan
668 $get_pfam_clan->execute($pfamA_acc);
669
670 my $pfam_clan_href=0;
671
672 if ($pfam_clan_href=$get_pfam_clan->fetchrow_hashref()) { # is a clan
673 my ($clan_id, $clan_acc) = @{$pfam_clan_href}{qw(clan_id clan_acc)};
674
675 # now check to see if we have seen this clan before (if so, do not increment $domain_cnt)
676 my $c_value = "C." . $clan_id;
677 if ($pf_acc) {$c_value = $clan_acc;}
678
679 $domain_clan{$value} = {clan_id => $clan_id,
680 clan_acc => $clan_acc};
681
682 if ($domains{$c_value}) {
683 $domain_clan{$value}->{domain_cnt} = $domains{$c_value};
684 $value = $c_value;
685 }
686 else {
687 $domain_clan{$value}->{domain_cnt} = ++ $domain_cnt;
688 $value = $c_value;
689 $domains{$value} = $domain_cnt;
690 push @domain_list, $pfamA_acc;
691 }
692 }
693 else { # not a clan
694 $domain_clan{$value} = 0;
695 $domains{$value} = ++$domain_cnt;
696 push @domain_list, $pfamA_acc;
697 }
698 }
699 elsif ($domain_clan{$value} && $domain_clan{$value}->{clan_acc}) {
700 if ($pf_acc) {$value = $domain_clan{$value}->{clan_acc};}
701 else { $value = "C." . $domain_clan{$value}->{clan_id}; }
702 }
703
704 if ($is_virtual) {
705 $domains{'@'.$value} = $domains{$value};
706 $value = '@'.$value;
707 }
708 return $value;
709 }
710
711 sub domain_num {
712 my ($value, $number) = @_;
713 if ($value =~ m/^@/) {
714 $value =~ s/^@/v/;
715 $number = $number."v";
716 }
717 return ($value, $number);
718 }
719
720
721 __END__
722
723 =pod
724
725 =head1 NAME
726
727 ann_pfam28.pl
728
729 =head1 SYNOPSIS
730
731 ann_pfam28.pl --neg-doms --vdoms 'sp|P09488|GSTM1_NUMAN' | accession.file
732
733 =head1 OPTIONS
734
735 -h short help
736 --help include description
737 --no-over : generate non-overlapping domains (equivalent to ann_pfam.pl)
738 --split-over : overlaps of two domains generate a new hybrid domain
739 --no-clans : do not use clans with multiple families from same clan
740 --neg-doms : report domains between annotated domains as NODOM
741 (also --neg, --neg_doms)
742 --vdoms : produce "virtual domains" using model_start,
743 model_end for partial pfam domains
744 --min_nodom=10 : minimum length between domains for NODOM
745
746 --host, --user, --password, --port --db : info for mysql database
747
748 =head1 DESCRIPTION
749
750 C<ann_pfam28.pl> extracts domain information from the pfam msyql
751 database. Currently, the program works with database
752 sequence descriptions in several formats:
753
754 >gi|1705556|sp|P54670.1|CAF1_DICDI
755 >sp|P09488|GSTM1_HUMAN
756 >sp:CALM_HUMAN
757
758 C<ann_pfam28.pl> uses the C<pfamA_reg_full_significant>, C<pfamseq>,
759 and C<pfamA> tables of the C<pfam> database to extract domain
760 information on a protein.
761
762 If the C<--no-over> option is set, overlapping domains are selected and
763 edited to remove overlaps. For proteins with multiple overlapping
764 domains (domains overlap by more than 1/3 of the domain length),
765 C<auto_pfam28.pl> selects the domain annotation with the best
766 C<domain_evalue_score>. When domains overlap by less than 1/3 of the
767 domain length, they are shortened to remove the overlap.
768
769 If the C<--split-over> option is set, if two domains overlap, the
770 overlapping region is split out of the domains and labeled as a new,
771 virtual-lie, domain. If one domain is internal to another and spans
772 80% of the domain, the shorter domain is removed.
773
774 C<ann_pfam28.pl> is designed to be used by the B<FASTA> programs with
775 the C<-V \!ann_pfam28.pl> or C<-V "\!ann_pfam28.pl --neg"> option.
776
777 =head1 AUTHOR
778
779 William R. Pearson, wrp@virginia.edu
780
781 =cut
+0
-859
scripts/ann_pfam30.pl less more
0 #!/usr/bin/perl -w
1
2 ################################################################
3 # copyright (c) 2014,2015 by William R. Pearson and The Rector &
4 # Visitors of the University of Virginia */
5 ################################################################
6 # Licensed under the Apache License, Version 2.0 (the "License");
7 # you may not use this file except in compliance with the License.
8 # You may obtain a copy of the License at
9 #
10 # http://www.apache.org/licenses/LICENSE-2.0
11 #
12 # Unless required by applicable law or agreed to in writing,
13 # software distributed under this License is distributed on an "AS
14 # IS" BASIS, WITHOUT WRRANTIES OR CONDITIONS OF ANY KIND, either
15 # express or implied. See the License for the specific language
16 # governing permissions and limitations under the License.
17 ################################################################
18
19 # ann_pfam.pl gets an annotation file from fasta36 -V with a line of the form:
20
21 # gi|62822551|sp|P00502|GSTA1_RAT Glutathione S-transfer\n (at least from pir1.lseg)
22 #
23 # it must:
24 # (1) read in the line
25 # (2) parse it to get the up_acc
26 # (3) return the tab delimited features
27 #
28
29 # this is the first version that works with the new Pfam strategy of
30 # separating Uniprot reference sequences from the rest of uniprot. as
31 # a result, it is possible that 2 SQL queries will be required, one to
32 # pfamA_reg_full_significant and a second to uniprot_reg_full.
33
34 # modified 15-Jan-2017 to reduce the number of calls when the same
35 # accession is present multiple times. Accessions are saved in a hash
36 # than ensures uniqueness. (Could also speed things up by creating temporary table.)
37 #
38
39
40 use strict;
41
42 use DBI;
43 use Getopt::Long;
44 use Pod::Usage;
45
46 use vars qw($host $db $port $user $pass);
47
48 my $hostname = `/bin/hostname`;
49
50 ($host, $db, $port, $user, $pass) = ("wrpxdb.its.virginia.edu", "pfam31", 0, "web_user", "fasta_www");
51 #$host = 'xdb';
52 #$host = 'localhost';
53 #$db = 'RPD2_pfam28u';
54
55 my ($auto_reg,$rpd2_fams, $neg_doms, $vdoms, $lav, $no_doms, $no_clans, $pf_acc, $acc_comment, $bound_comment, $shelp, $help) =
56 (0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0,);
57 my ($no_over, $split_over, $over_fract) = (0, 0, 3.0);
58
59 my ($color_sep_str, $show_color) = (" :",1);
60 $color_sep_str = '~';
61
62 my ($min_nodom, $min_vdom) = (10,10);
63
64 GetOptions(
65 "host=s" => \$host,
66 "db=s" => \$db,
67 "user=s" => \$user,
68 "password=s" => \$pass,
69 "port=i" => \$port,
70 "lav" => \$lav,
71 "acc_comment" => \$acc_comment,
72 "bound_comment" => \$bound_comment,
73 "color!" => \$show_color,
74 "no-over" => \$no_over,
75 "no_over" => \$no_over,
76 "split-over" => \$split_over,
77 "split_over" => \$split_over,
78 "over_fract" => \$over_fract,
79 "over-fract" => \$over_fract,
80 "no-clans" => \$no_clans,
81 "no_clans" => \$no_clans,
82 "neg" => \$neg_doms,
83 "neg_doms" => \$neg_doms,
84 "neg-doms" => \$neg_doms,
85 "min_nodom=i" => \$min_nodom,
86 "vdoms" => \$vdoms,
87 "v_doms" => \$vdoms,
88 "pfacc" => \$pf_acc,
89 "RPD2" => \$rpd2_fams,
90 "auto_reg" => \$auto_reg,
91 "h|?" => \$shelp,
92 "help" => \$help,
93 );
94
95 pod2usage(1) if $shelp;
96 pod2usage(exitstatus => 0, verbose => 2) if $help;
97 pod2usage(1) unless (@ARGV || -p STDIN || -f STDIN);
98
99 my $connect = "dbi:mysql(AutoCommit=>1,RaiseError=>1):database=$db";
100 $connect .= ";host=$host" if $host;
101 $connect .= ";port=$port" if $port;
102
103 my $dbh = DBI->connect($connect,
104 $user,
105 $pass
106 ) or die $DBI::errstr;
107
108 my %annot_types = ();
109 my %domains = (NODOM=>0);
110 my %domain_clan = (NODOM => {clan_id => 'NODOM', clan_acc=>0, domain_cnt=>0});
111 my @domain_list = (0);
112 my $domain_cnt = 0;
113
114 my $pfamA_reg_full = 'pfamA_reg_full_significant';
115 my $uniprot_reg_full = 'uniprot_reg_full';
116
117 my $get_annot_sub = \&get_pfam_annots;
118
119 my @pfam_fields = qw(seq_start seq_end model_start model_end model_length pfamA_acc pfamA_id auto_pfamA_reg_full domain_evalue_score as evalue length);
120 my @upfam_fields = qw(seq_start seq_end model_start model_end model_length pfamA_acc pfamA_id auto_uniprot_reg_full domain_evalue_score as evalue length);
121
122 my $get_pfam_acc = $dbh->prepare(<<EOSQL);
123 SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_pfamA_reg_full, domain_evalue_score as evalue, length
124 FROM pfamseq
125 JOIN pfamA_reg_full_significant using(pfamseq_acc)
126 JOIN pfamA USING (pfamA_acc)
127 WHERE in_full = 1
128 AND pfamseq_acc=?
129 ORDER BY seq_start
130
131 EOSQL
132
133 my $get_upfam_acc = $dbh->prepare(<<EOSQL);
134 SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_uniprot_reg_full as auto_pfamA_reg_full, domain_evalue_score as evalue, length
135 FROM uniprot
136 JOIN uniprot_reg_full using(uniprot_acc)
137 JOIN pfamA USING (pfamA_acc)
138 WHERE in_full = 1
139 AND uniprot_acc=?
140 ORDER BY seq_start
141
142 EOSQL
143
144 my $get_pfam_refacc = $dbh->prepare(<<EOSQL);
145 SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_pfamA_reg_full, domain_evalue_score as evalue, length
146 FROM $pfamA_reg_full
147 JOIN pfamseq using(pfamseq_acc)
148 JOIN pfamA USING (pfamA_acc)
149 JOIN uniprot.refseq2up as rf2up on(rf2up.up_acc=pfamseq_acc)
150 WHERE in_full = 1
151 AND rf2up.refseq_acc=?
152 ORDER BY seq_start
153
154 EOSQL
155
156 my $get_upfam_refacc = $dbh->prepare(<<EOSQL);
157 SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_uniprot_reg_full as auto_pfamA_reg_full, domain_evalue_score as evalue, length
158 FROM uniprot
159 JOIN uniprot_reg_full using(uniprot_acc)
160 JOIN pfamA USING (pfamA_acc)
161 JOIN uniprot.refseq2up as rf2up on(rf2up.up_acc=uniprot_acc)
162 WHERE in_full = 1
163 AND refseq_acc=?
164 ORDER BY seq_start
165
166 EOSQL
167
168 my $get_annots_sql = $get_pfam_acc;
169
170 my $get_pfam_id = $dbh->prepare(<<EOSQL);
171 SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_pfamA_reg_full, domain_evalue_score as evalue, length
172 FROM pfamseq
173 JOIN $pfamA_reg_full using(pfamseq_acc)
174 JOIN pfamA USING (pfamA_acc)
175 WHERE in_full=1
176 AND pfamseq_id=?
177 ORDER BY seq_start
178
179 EOSQL
180
181 my $get_upfam_id = $dbh->prepare(<<EOSQL);
182 SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_uniprot_reg_full as auto_pfamA_reg_full, domain_evalue_score as evalue, length
183 FROM uniprot
184 JOIN uniprot_reg_full using(pfamseq_acc)
185 JOIN pfamA USING (pfamA_acc)
186 WHERE in_full=1
187 AND uniprot_id=?
188 ORDER BY seq_start
189
190 EOSQL
191
192 my $get_pfam_clan = $dbh->prepare(<<EOSQL);
193
194 SELECT clan_acc, clan_id
195 FROM clan
196 JOIN clan_membership using(clan_acc)
197 WHERE pfamA_acc=?
198
199 EOSQL
200
201 my $get_rpd2_clans = $dbh->prepare(<<EOSQL);
202
203 SELECT auto_pfamA, clan
204 FROM ljm_db.RPD2_final_fams
205 WHERE clan is not NULL
206
207 EOSQL
208
209 # -- LEFT JOIN clan_membership USING (auto_pfamA)
210 # -- LEFT JOIN clans using(auto_clan)
211
212 my ($tmp, $gi, $sdb, $acc, $id, $use_acc);
213
214 # get the query
215 my ($query, $seq_len) = @ARGV;
216 $seq_len = 0 unless defined($seq_len);
217
218 $query =~ s/^>// if ($query);
219
220 my @annots = ();
221 my %annot_set = ();
222
223 my %rpd2_clan_fams = ();
224
225 if ($rpd2_fams) {
226 $get_rpd2_clans->execute();
227 my ($auto_pfam, $auto_clan);
228 while (($auto_pfam, $auto_clan)=$get_rpd2_clans->fetchrow_array()) {
229 $rpd2_clan_fams{$auto_pfam} = $auto_clan;
230 }
231 }
232
233 #if it's a file I can open, read and parse it
234 unless ($query && ($query =~ m/[\|:]/ ||
235 $query =~ m/^[NX]P_/ ||
236 $query =~ m/^[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}\s/)) {
237
238 while (my $a_line = <>) {
239 $a_line =~ s/^>//;
240 chomp $a_line;
241 push @annots, show_annots($a_line, $get_annot_sub);
242 }
243 }
244 else {
245 push @annots, show_annots("$query\t$seq_len", $get_annot_sub);
246 }
247
248 for my $seq_annot (@annots) {
249 next unless $seq_annot;
250 my $annot_r = $annot_set{$seq_annot};
251 print ">",$annot_r->{seq_info},"\n";
252 for my $annot (@{$annot_r->{list}}) {
253 if (!$lav && defined($domains{$annot->[-1]})) {
254 my ($a_name, $a_num) = domain_num($annot->[-1],$domains{$annot->[-1]});
255 $annot->[-1] = $a_name;
256 my $tmp_a_num = $a_num;
257 $tmp_a_num =~ s/v$//;
258 if ($acc_comment) {
259 $annot->[-1] .= "{$domain_list[$tmp_a_num]}";
260 }
261 if ($bound_comment) {
262 $annot->[-1] .= $color_sep_str.$annot->[0].":".$annot->[2];
263 }
264 elsif ($show_color) {
265 $annot->[-1] .= $color_sep_str.$a_num;
266 }
267 }
268 print join("\t",@$annot),"\n";
269 }
270 }
271
272 exit(0);
273
274 sub show_annots {
275 my ($query_len, $get_annot_sub) = @_;
276
277 my ($annot_line, $seq_len) = split(/\t/,$query_len);
278
279 my $pfamA_acc;
280
281 $use_acc = 1;
282 $get_annots_sql = $get_pfam_acc;
283
284 my $get_annots_sql_u = $get_upfam_acc;
285
286 if ($annot_line =~ m/^pf\d+\|/) {
287 ($sdb, $gi, $pfamA_acc, $acc, $id) = split(/\|/,$annot_line);
288 # $dbh->do("use RPD2_pfam");
289 }
290 elsif ($annot_line =~ m/^gi\|/) {
291 ($tmp, $gi, $sdb, $acc, $id) = split(/\|/,$annot_line);
292 if ($sdb =~ m/ref/) {
293 $get_annots_sql = $get_pfam_refacc;
294 $get_annots_sql_u = $get_upfam_refacc;
295 }
296 }
297 elsif ($annot_line =~ m/^(sp|tr|up)\|/) {
298 ($sdb, $acc, $id) = split(/\|/,$annot_line);
299 }
300 elsif ($annot_line =~ m/^ref\|/) {
301 ($sdb, $acc) = split(/\|/,$annot_line);
302 $get_annots_sql = $get_pfam_refacc;
303 $get_annots_sql_u = $get_upfam_refacc;
304 }
305 elsif ($annot_line =~ m/^(SP|TR):/i) {
306 ($sdb, $id) = split(/:/,$annot_line);
307 $use_acc = 0;
308 }
309 elsif ($annot_line !~ m/\|/ && $annot_line !~ m/:/) {
310 $use_acc = 1;
311 ($acc) = split(/\s+/,$annot_line);
312 }
313 # deal with no-database SwissProt/NR
314 else {
315 ($acc)=($annot_line =~ /^(\S+)/);
316 }
317
318 # here we have an $acc or an $id: check to see if we have the data
319
320 my %annot_data = (seq_info=>$annot_line);
321 my $annot_key = '';
322 unless ($use_acc) {
323 next if ($annot_set{$id});
324 $annot_set{$id} = \%annot_data;
325 $annot_key = $id;
326
327 $get_annots_sql = $get_pfam_id;
328 $get_annots_sql->execute($id);
329 unless ($get_annots_sql->rows()) {
330 $get_annots_sql = $get_annots_sql_u;
331 $get_annots_sql->execute($id);
332 }
333 } else {
334 unless ($acc) {
335 warn "missing acc in $annot_line";
336 return "";
337 }
338 else {
339 $acc =~ s/\.\d+$//;
340
341 $annot_key = $acc;
342 if ($annot_set{$acc}) {
343 goto ret_label;
344 }
345 $annot_set{$acc} = \%annot_data;
346
347 $get_annots_sql->execute($acc);
348 unless ($get_annots_sql->rows()) {
349 $get_annots_sql = $get_annots_sql_u;
350 $get_annots_sql->execute($acc);
351 }
352 }
353 }
354
355 $annot_data{list} = $get_annot_sub->($get_annots_sql, $seq_len);
356
357 ret_label:
358 return $annot_key;
359 }
360
361 sub get_pfam_annots {
362 my ($get_annots, $seq_length) = @_;
363
364 $seq_length = 0 unless $seq_length;
365
366 my @pf_domains = ();
367
368 # get the list of domains, sorted by start
369
370 # $row_href has: seq_start, seq_end, model_start, model_end, model_length,
371 # pfamA_acc, pfamA_id, auto_pfamA_reg_full,
372 # domain_evalue_score as evalue, length
373
374 while ( my $row_href = $get_annots->fetchrow_hashref()) {
375 if ($auto_reg) {
376 $row_href->{info} = $row_href->{auto_pfamA_reg_full};
377 } elsif ($pf_acc) {
378 $row_href->{info} = $row_href->{pfamA_acc};
379 } else {
380 $row_href->{info} = $row_href->{pfamA_id};
381 }
382
383 if ($row_href && $row_href->{length} > $seq_length && $seq_length == 0) {
384 $seq_length = $row_href->{length};
385 }
386
387 next if ($row_href->{seq_start} >= $seq_length);
388 if ($row_href->{seq_end} > $seq_length) {
389 $row_href->{seq_end} = $seq_length;
390 }
391
392 push @pf_domains, $row_href
393 }
394
395 # before checking for domain overlap, check for "split-domains"
396 # (self-unbound) by looking for runs of the same domain that are
397 # ordered by model_start
398
399 if (scalar(@pf_domains) > 1) {
400 my @j_domains; #joined domains
401 my @tmp_domains = @pf_domains;
402
403 my $prev_dom = shift(@tmp_domains);
404
405 for my $curr_dom (@tmp_domains) {
406 # to join domains:
407 # (1) the domains must be in order by model_start/end coordinates
408 # (3) joining the domains cannot make the total combination too long
409
410 # check for model and sequence consistency
411 if (($prev_dom->{pfamA_acc} eq $curr_dom->{pfamA_acc}) # same family
412 && $prev_dom->{model_start} < $curr_dom->{model_start} # model check
413 && $prev_dom->{model_end} < $curr_dom->{model_end}
414
415 && ($curr_dom->{model_start} > $prev_dom->{model_end} * 0.80 # limit overlap
416 || $curr_dom->{model_start} < $prev_dom->{model_end} * 1.25)
417 && ((($curr_dom->{model_end} - $curr_dom->{model_start}+1)/$curr_dom->{model_length} +
418 ($prev_dom->{model_end} - $prev_dom->{model_start}+1)/$prev_dom->{model_length}) < 1.33)
419 ) { # join them by updating $prev_dom
420 $prev_dom->{seq_end} = $curr_dom->{seq_end};
421 $prev_dom->{model_end} = $curr_dom->{model_end};
422 $prev_dom->{auto_pfamA_reg_full} = $prev_dom->{auto_pfamA_reg_full} . ";". $curr_dom->{auto_pfamA_reg_full};
423 $prev_dom->{evalue} = ($prev_dom->{evalue} < $curr_dom->{evalue} ? $prev_dom->{evalue} : $curr_dom->{evalue});
424 } else {
425 push @j_domains, $prev_dom;
426 $prev_dom = $curr_dom;
427 }
428 }
429 push @j_domains, $prev_dom;
430 @pf_domains = @j_domains;
431
432
433 if ($no_over) { # for either $no_over or $split_over, check for overlapping domains and edit/split them
434
435 my @tmp_domains = @pf_domains; # allow shifts from copy of @pf_domains
436 my @save_domains = (); # where the new domains go
437
438 my $prev_dom = shift @tmp_domains;
439
440 while (my $curr_dom = shift @tmp_domains) {
441
442 my @overlap_domains = ($prev_dom);
443
444 my $diff = $prev_dom->{seq_end} - $curr_dom->{seq_start};
445
446 my ($prev_len, $cur_len) = ($prev_dom->{seq_end}-$prev_dom->{seq_start}+1,
447 $curr_dom->{seq_end}-$curr_dom->{seq_start}+1);
448
449 my $inclusion = ((($curr_dom->{seq_start} >= $prev_dom->{seq_start}) # start is right && end is left
450 && ($curr_dom->{seq_end} <= $prev_dom->{seq_end})) || # -- curr inside prev
451 (($curr_dom->{seq_start} <= $prev_dom->{seq_start}) # start is left && end is right
452 && ($curr_dom->{seq_end} >= $prev_dom->{seq_end}))); # -- prev is inside curr
453
454 my $longer_len = ($prev_len > $cur_len) ? $prev_len : $cur_len;
455
456 # check for overlap > domain_length/$over_fract
457 while ($inclusion || ($diff > 0 && $diff > $longer_len/$over_fract)) {
458 push @overlap_domains, $curr_dom;
459 $curr_dom = shift @tmp_domains;
460 last unless $curr_dom;
461 $diff = $prev_dom->{seq_end} - $curr_dom->{seq_start};
462 ($prev_len, $cur_len) = ($prev_dom->{seq_end}-$prev_dom->{seq_start}+1, $curr_dom->{seq_end}-$curr_dom->{seq_start}+1);
463 $longer_len = ($prev_len > $cur_len) ? $prev_len : $cur_len;
464 $inclusion = ((($curr_dom->{seq_start} >= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} <= $prev_dom->{seq_end})) ||
465 (($curr_dom->{seq_start} <= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} >= $prev_dom->{seq_end})));
466 }
467
468 # check for overlapping domains; >1 because $prev_dom is always there
469 if (scalar(@overlap_domains) > 1 ) {
470 # if $rpd2_fams, check for a chosen one
471
472 for my $dom ( @overlap_domains) {
473 $dom->{evalue} = 1.0 unless defined($dom->{evalue});
474 }
475
476 @overlap_domains = sort { $a->{evalue} <=> $b->{evalue} } @overlap_domains;
477 $prev_dom = $overlap_domains[0];
478 }
479
480 # $prev_dom should be the best of the overlaps, and we are no longer overlapping > dom_length/3
481 push @save_domains, $prev_dom;
482 $prev_dom = $curr_dom;
483 }
484
485 if ($prev_dom) {
486 push @save_domains, $prev_dom;
487 }
488
489 @pf_domains = @save_domains;
490
491 # now check for smaller overlaps
492 for (my $i=1; $i < scalar(@pf_domains); $i++) {
493 if ($pf_domains[$i-1]->{seq_end} >= $pf_domains[$i]->{seq_start}) {
494 my $overlap = $pf_domains[$i-1]->{seq_end} - $pf_domains[$i]->{seq_start};
495 $pf_domains[$i-1]->{seq_end} -= int($overlap/2);
496 $pf_domains[$i]->{seq_start} = $pf_domains[$i-1]->{seq_end}+1;
497 }
498 }
499 }
500 elsif ($split_over) { # here, everything that overlaps by > $min_vdom should be split into a separate domain
501 my @save_domains = (); # where the new domains go
502
503 # check to see if one domain is included (or overlapping) more
504 # than xx% of the other. If so, pick the longer one
505
506 my ($prev_dom, $curr_dom) = ($pf_domains[0],0) ;
507 for (my $i=1; $i < scalar(@pf_domains); $i++) {
508 $curr_dom = $pf_domains[$i];
509
510 my ($prev_len, $cur_len) = ($prev_dom->{seq_end}-$prev_dom->{seq_start}+1, $curr_dom->{seq_end}-$curr_dom->{seq_start}+1);
511 my $longer_len = ($prev_len > $cur_len) ? $prev_len : $cur_len;
512
513 if (($curr_dom->{seq_start} >= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} <= $prev_dom->{seq_end})
514 && $cur_len / $prev_len > 0.80) {
515 # $prev_dom stays the same, $curr_dom deleted
516 next;
517 }
518 elsif (($curr_dom->{seq_start} <= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} >= $prev_dom->{seq_end})
519 && $prev_len / $cur_len > 0.80) {
520 $prev_dom = $curr_dom; # this should delete $prev_dom
521 next;
522 }
523
524 if ($prev_dom->{seq_end} >= $curr_dom->{seq_start} + $min_vdom) {
525 my ($l_seq_end, $r_seq_start) = ($curr_dom->{seq_start}-1, $prev_dom->{seq_end}+1);
526
527 $prev_dom->{seq_end} = $l_seq_end;
528 push @save_domains, $prev_dom;
529 my $new_dom = {seq_start => $l_seq_end+1, seq_end=>$r_seq_start-1,
530 model_length => -1,
531 pfamA_acc=>$prev_dom->{pfamA_acc}."/".$curr_dom->{pfamA_acc},
532 pfamA_id=>$prev_dom->{pfamA_id}."/".$curr_dom->{pfamA_id},
533 };
534
535 if ($pf_acc) {
536 $new_dom->{info} = $new_dom->{pfamA_acc};
537 }
538 else {
539 $new_dom->{info} = $new_dom->{pfamA_id};
540 }
541
542 push @save_domains, $new_dom;
543 $curr_dom->{seq_start} = $r_seq_start;
544 $prev_dom = $curr_dom;
545 }
546 else {
547 push @save_domains, $prev_dom;
548 $prev_dom = $curr_dom;
549 }
550 }
551 push @save_domains, $prev_dom;
552 @pf_domains = @save_domains;
553 }
554 }
555
556 # $vdoms -- virtual Pfam domains -- the equivalent of $neg_doms,
557 # but covering parts of a Pfam model that are not annotated. split
558 # domains have been joined, so simply check beginning and end of
559 # each domain (but must also check for bounded-ness)
560 # only add when 10% or more is missing and missing length > $min_nodom
561
562 if ($vdoms && scalar(@pf_domains)) {
563 my @vpf_domains;
564
565 my $curr_dom = $pf_domains[0];
566 my $length = $curr_dom->{length};
567
568 my $prev_dom={seq_end=>0, pfamA_acc=>''};
569 my $prev_dom_end = 0;
570 my $next_dom_start = $length+1;
571
572 for (my $dom_ix=0; $dom_ix < scalar(@pf_domains); $dom_ix++ ) {
573 $curr_dom = $pf_domains[$dom_ix];
574
575 my $pfamA = $curr_dom->{pfamA_acc};
576
577 # first, look left, is there a domain there (if there is,
578 # it should be updated right
579
580 # my $min_vdom = $curr_dom->{model_length} / 10;
581
582 if ($curr_dom->{model_length} < $min_vdom) {
583 push @vpf_domains, $curr_dom;
584 next;
585 }
586 if ($prev_dom->{pfamA_acc}) { # look for previous domain
587 $prev_dom_end = $prev_dom->{seq_end};
588 }
589
590 # there is a domain to the left, how much room is available?
591 my $left_dom_len = min($curr_dom->{seq_start}-$prev_dom_end-1, $curr_dom->{model_start}-1);
592 if ( $left_dom_len > $min_vdom) {
593 # there is room for a virtual domain
594 my %new_dom = (seq_start=> $curr_dom->{seq_start}-$left_dom_len,
595 seq_end => $curr_dom->{seq_start}-1,
596 info=>'@'.$curr_dom->{info},
597 model_length=>$curr_dom->{model_length},
598 model_end => $curr_dom->{model_start}-1,
599 model_start => $left_dom_len,
600 pfamA_acc=>$pfamA,
601 );
602 push @vpf_domains, \%new_dom;
603 }
604
605 # save the current domain
606 push @vpf_domains, $curr_dom;
607 $prev_dom = $curr_dom;
608
609 if ($dom_ix < $#pf_domains) { # there is a domain to the right
610 # first, give all the extra space to the first domain (no splitting)
611 $next_dom_start = $pf_domains[$dom_ix+1]->{seq_start};
612 }
613 else {
614 $next_dom_start = $length;
615 }
616
617 # is there room for a virtual domain right
618
619 my $right_dom_len = min($next_dom_start-$curr_dom->{seq_end}-1, # space available
620 $curr_dom->{model_length}-$curr_dom->{model_end} # space needed
621 );
622 if ( $right_dom_len > $min_vdom) {
623 my %new_dom = (seq_start=> $curr_dom->{seq_end}+1,
624 seq_end=> $curr_dom->{seq_end}+$right_dom_len,
625 info=>'@'.$curr_dom->{info},
626 model_length => $curr_dom->{model_length},
627 pfamA_acc=> $pfamA,
628 );
629 push @vpf_domains, \%new_dom;
630 $prev_dom = \%new_dom;
631 }
632 } # all done, check for last one
633
634 # $curr_dom=$pf_domains[-1];
635 # # my $min_vdom = $curr_dom->{model_length}/10;
636
637 # my $right_dom_len = min($length - $curr_dom->{seq_end}+1, # space available
638 # $curr_dom->{model_length}-$curr_dom->{model_end} # space needed
639 # );
640 # if ($right_dom_len > $min_vdom) {
641 # my %new_dom = (seq_start=> $curr_dom->{seq_end}+1,
642 # seq_end => $curr_dom->{seq_end}+$right_dom_len,
643 # info=>'@'.$curr_dom->{pfamA_acc},
644 # model_len=> $curr_dom->{model_len},
645 # pfamA_acc => $curr_dom->{pfamA_acc},
646 # model_start => $curr_dom->{model_end}+1,
647 # model_end => $curr_dom->{model_len},
648 # );
649
650 # push @vpf_domains, \%new_dom;
651 # }
652
653 # @vpf_domains has both old @pf_domains and new neg-domains
654 @pf_domains = @vpf_domains;
655 }
656
657 if ($neg_doms) {
658 my @npf_domains;
659 my $prev_dom={seq_end=>0};
660 for my $curr_dom ( @pf_domains) {
661 if ($curr_dom->{seq_start} - $prev_dom->{seq_end} > $min_nodom) {
662 my %new_dom = (seq_start=>$prev_dom->{seq_end}+1, seq_end => $curr_dom->{seq_start}-1, info=>'NODOM');
663 push @npf_domains, \%new_dom;
664 }
665 push @npf_domains, $curr_dom;
666 $prev_dom = $curr_dom;
667 }
668 if ($seq_length - $prev_dom->{seq_end} > $min_nodom) {
669 my %new_dom = (seq_start=>$prev_dom->{seq_end}+1, seq_end=>$seq_length, info=>'NODOM');
670 if ($new_dom{seq_end} > $new_dom{seq_start}) {
671 push @npf_domains, \%new_dom;
672 }
673 }
674
675 # @npf_domains has both old @pf_domains and new neg-domains
676 @pf_domains = @npf_domains;
677 }
678
679 # now make sure we have useful names: colors
680
681 for my $pf (@pf_domains) {
682 $pf->{info} = domain_name($pf->{info}, $pf->{pfamA_acc});
683 }
684
685 my @feats = ();
686 for my $d_ref (@pf_domains) {
687 if ($lav) {
688 push @feats, [$d_ref->{seq_start}, $d_ref->{seq_end}, $d_ref->{info}];
689 } else {
690 push @feats, [$d_ref->{seq_start}, '-', $d_ref->{seq_end}, $d_ref->{info} ];
691 # push @feats, [$d_ref->{seq_end}, ']', '-', ""];
692 }
693
694 }
695
696 return \@feats;
697 }
698
699 sub min {
700 my ($arg1, $arg2) = @_;
701
702 return ($arg1 <= $arg2 ? $arg1 : $arg2);
703 }
704
705 sub max {
706 my ($arg1, $arg2) = @_;
707
708 return ($arg1 >= $arg2 ? $arg1 : $arg2);
709 }
710
711 # domain name takes a uniprot domain label, removes comments ( ;
712 # truncated) and numbers and returns a canonical form. Thus:
713 # Cortactin 6.
714 # Cortactin 7; truncated.
715 # becomes "Cortactin"
716 #
717
718 sub domain_name {
719
720 my ($value, $pfamA_acc) = @_;
721 my $is_virtual = 0;
722
723 if ($value =~ m/^@/) {
724 $is_virtual = 1;
725 $value =~ s/^@//;
726 }
727
728 # check for clan:
729 if ($no_clans) {
730 if (! defined($domains{$value})) {
731 $domain_clan{$value} = 0;
732 $domains{$value} = ++$domain_cnt;
733 push @domain_list, $pfamA_acc;
734 }
735 }
736 elsif (!defined($domain_clan{$value})) {
737 ## only do this for new domains, old domains have known mappings
738
739 ## ways to highlight the same domain:
740 # (1) for clans, substitute clan name for family name
741 # (2) for clans, use the same color for the same clan, but don't change the name
742 # (3) for clans, combine family name with clan name, but use colors based on clan
743
744 # check to see if it's a clan
745 $get_pfam_clan->execute($pfamA_acc);
746
747 my $pfam_clan_href=0;
748
749 if ($pfam_clan_href=$get_pfam_clan->fetchrow_hashref()) { # is a clan
750 my ($clan_id, $clan_acc) = @{$pfam_clan_href}{qw(clan_id clan_acc)};
751
752 # now check to see if we have seen this clan before (if so, do not increment $domain_cnt)
753 my $c_value = "C." . $clan_id;
754 if ($pf_acc) {$c_value = $clan_acc;}
755
756 $domain_clan{$value} = {clan_id => $clan_id,
757 clan_acc => $clan_acc};
758
759 if ($domains{$c_value}) {
760 $domain_clan{$value}->{domain_cnt} = $domains{$c_value};
761 $value = $c_value;
762 }
763 else {
764 $domain_clan{$value}->{domain_cnt} = ++ $domain_cnt;
765 $value = $c_value;
766 $domains{$value} = $domain_cnt;
767 push @domain_list, $pfamA_acc;
768 }
769 }
770 else { # not a clan
771 $domain_clan{$value} = 0;
772 $domains{$value} = ++$domain_cnt;
773 push @domain_list, $pfamA_acc;
774 }
775 }
776 elsif ($domain_clan{$value} && $domain_clan{$value}->{clan_acc}) {
777 if ($pf_acc) {$value = $domain_clan{$value}->{clan_acc};}
778 else { $value = "C." . $domain_clan{$value}->{clan_id}; }
779 }
780
781 if ($is_virtual) {
782 $domains{'@'.$value} = $domains{$value};
783 $value = '@'.$value;
784 }
785 return $value;
786 }
787
788 sub domain_num {
789 my ($value, $number) = @_;
790 if ($value =~ m/^@/) {
791 $value =~ s/^@/v/;
792 $number = $number."v";
793 }
794 return ($value, $number);
795 }
796
797
798 __END__
799
800 =pod
801
802 =head1 NAME
803
804 ann_pfam30.pl
805
806 =head1 SYNOPSIS
807
808 ann_pfam30.pl --neg-doms --vdoms 'sp|P09488|GSTM1_NUMAN' | accession.file
809
810 =head1 OPTIONS
811
812 -h short help
813 --help include description
814 --no-over : generate non-overlapping domains (equivalent to ann_pfam.pl)
815 --split-over : overlaps of two domains generate a new hybrid domain
816 --no-clans : do not use clans with multiple families from same clan
817 --neg-doms : report domains between annotated domains as NODOM
818 (also --neg, --neg_doms)
819 --vdoms : produce "virtual domains" using model_start,
820 model_end for partial pfam domains
821 --min_nodom=10 : minimum length between domains for NODOM
822
823 --host, --user, --password, --port --db : info for mysql database
824
825 =head1 DESCRIPTION
826
827 C<ann_pfam30.pl> extracts domain information from the pfam msyql
828 database. Currently, the program works with database
829 sequence descriptions in several formats:
830
831 >gi|1705556|sp|P54670.1|CAF1_DICDI
832 >sp|P09488|GSTM1_HUMAN
833 >sp:CALM_HUMAN
834
835 C<ann_pfam30.pl> uses the C<pfamA_reg_full_significant>, C<pfamseq>,
836 and C<pfamA> tables of the C<pfam> database to extract domain
837 information on a protein.
838
839 If the C<--no-over> option is set, overlapping domains are selected and
840 edited to remove overlaps. For proteins with multiple overlapping
841 domains (domains overlap by more than 1/3 of the domain length),
842 C<auto_pfam28.pl> selects the domain annotation with the best
843 C<domain_evalue_score>. When domains overlap by less than 1/3 of the
844 domain length, they are shortened to remove the overlap.
845
846 If the C<--split-over> option is set, if two domains overlap, the
847 overlapping region is split out of the domains and labeled as a new,
848 virtual-lie, domain. If one domain is internal to another and spans
849 80% of the domain, the shorter domain is removed.
850
851 C<ann_pfam30.pl> is designed to be used by the B<FASTA> programs with
852 the C<-V \!ann_pfam30.pl> or C<-V "\!ann_pfam30.pl --neg"> option.
853
854 =head1 AUTHOR
855
856 William R. Pearson, wrp@virginia.edu
857
858 =cut
0 #!/usr/bin/perl -w
0 #!/usr/bin/env perl
11
22 ################################################################
33 # copyright (c) 2014,2015 by William R. Pearson and The Rector &
3939 # create temporary tables/select permissions for tmp_annot
4040 #
4141
42 use warnings;
4243 use strict;
4344
4445 use DBI;
0 #!/usr/bin/env perl
1
2 ################################################################
3 # copyright (c) 2015 by William R. Pearson and The Rector &
4 # Visitors of the University of Virginia */
5 ################################################################
6 # Licensed under the Apache License, Version 2.0 (the "License");
7 # you may not use this file except in compliance with the License.
8 # You may obtain a copy of the License at
9 #
10 # http://www.apache.org/licenses/LICENSE-2.0
11 #
12 # Unless required by applicable law or agreed to in writing,
13 # software distributed under this License is distributed on an "AS
14 # IS" BASIS, WITHOUT WRRANTIES OR CONDITIONS OF ANY KIND, either
15 # express or implied. See the License for the specific language
16 # governing permissions and limitations under the License.
17 ################################################################
18
19 # ann_pfam.pl gets an annotation file from fasta36 -V with a line of the form:
20
21 # gi|62822551|sp|P00502|GSTA1_RAT Glutathione S-transfer\n (at least from pir1.lseg)
22 #
23 # it must:
24 # (1) read in the line
25 # (2) parse it to get the up_acc
26 # (3) return the tab delimited features
27 #
28
29 # this is the first version that works with the new Pfam strategy of
30 # separating Uniprot reference sequences from the rest of uniprot. as
31 # a result, it is possible that 2 SQL queries will be required, one to
32 # pfamA_reg_full_significant and a second to uniprot_reg_full.
33
34 # modified 15-Jan-2017 to reduce the number of calls when the same
35 # accession is present multiple times. Accessions are saved in a hash
36 # than ensures uniqueness. (Could also speed things up by creating temporary table.)
37 #
38
39 use warnings;
40 use strict;
41
42 use DBI;
43 use Getopt::Long;
44 use Pod::Usage;
45
46 use vars qw($host $db $port $user $pass);
47
48 my $hostname = `/bin/hostname`;
49
50 ($host, $db, $port, $user, $pass) = ("wrpxdb.its.virginia.edu", "pfam32", 0, "web_user", "fasta_www");
51 #$host = 'xdb';
52 #$host = 'localhost';
53 #$db = 'RPD2_pfam28u';
54
55 my ($auto_reg,$rpd2_fams, $neg_doms, $vdoms, $lav, $no_doms, $no_clans, $pf_acc, $acc_comment, $bound_comment, $shelp, $help) =
56 (0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0,);
57 my ($no_over, $split_over, $over_fract) = (0, 0, 3.0);
58 my ($clan_fam) = (0);
59
60 my ($color_sep_str, $show_color) = (" :",1);
61 $color_sep_str = '~';
62
63 my ($min_nodom, $min_vdom) = (10,10);
64
65 GetOptions(
66 "host=s" => \$host,
67 "db=s" => \$db,
68 "user=s" => \$user,
69 "password=s" => \$pass,
70 "port=i" => \$port,
71 "lav" => \$lav,
72 "acc_comment" => \$acc_comment,
73 "bound_comment" => \$bound_comment,
74 "color!" => \$show_color,
75 "clan_fam|clan-fam" => \$clan_fam,
76 "no_over|no-over" => \$no_over,
77 "split_over|split-over=f" => \$split_over,
78 "over_fract|over-fract=f" => \$over_fract,
79 "no-clans|no_clans" => \$no_clans,
80 "neg|neg_doms|neg-doms" => \$neg_doms,
81 "min_nodom=i" => \$min_nodom,
82 "vdoms|v_doms" => \$vdoms,
83 "pfacc" => \$pf_acc,
84 "RPD2" => \$rpd2_fams,
85 "auto_reg" => \$auto_reg,
86 "h|?" => \$shelp,
87 "help" => \$help,
88 );
89
90 pod2usage(1) if $shelp;
91 pod2usage(exitstatus => 0, verbose => 2) if $help;
92 pod2usage(1) unless (@ARGV || -p STDIN || -f STDIN);
93
94 my $connect = "dbi:mysql(AutoCommit=>1,RaiseError=>1):database=$db";
95 $connect .= ";host=$host" if $host;
96 $connect .= ";port=$port" if $port;
97
98 my $dbh = DBI->connect($connect,
99 $user,
100 $pass
101 ) or die $DBI::errstr;
102
103 my %annot_types = ();
104 my %domains = (NODOM=>0);
105 my %domain_clan = (NODOM => {clan_id => 'NODOM', clan_acc=>0, domain_cnt=>0});
106 my @domain_list = (0);
107 my $domain_cnt = 0;
108
109 my $pfamA_reg_full = 'pfamA_reg_full_significant';
110 my $uniprot_reg_full = 'uniprot_reg_full';
111
112 my $get_annot_sub = \&get_pfam_annots;
113
114 my @pfam_fields = qw(seq_start seq_end model_start model_end model_length pfamA_acc pfamA_id auto_pfamA_reg_full domain_evalue_score as evalue length);
115 my @upfam_fields = qw(seq_start seq_end model_start model_end model_length pfamA_acc pfamA_id auto_uniprot_reg_full domain_evalue_score as evalue length);
116
117 my $get_pfam_acc = $dbh->prepare(<<EOSQL);
118 SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_pfamA_reg_full, domain_evalue_score as evalue, length
119 FROM pfamseq
120 JOIN pfamA_reg_full_significant using(pfamseq_acc)
121 JOIN pfamA USING (pfamA_acc)
122 WHERE in_full = 1
123 AND pfamseq_acc=?
124 ORDER BY seq_start
125
126 EOSQL
127
128 my $get_upfam_acc = $dbh->prepare(<<EOSQL);
129 SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_uniprot_reg_full as auto_pfamA_reg_full, domain_evalue_score as evalue, length
130 FROM uniprot
131 JOIN uniprot_reg_full using(uniprot_acc)
132 JOIN pfamA USING (pfamA_acc)
133 WHERE in_full = 1
134 AND uniprot_acc=?
135 ORDER BY seq_start
136
137 EOSQL
138
139 my $get_pfam_refacc = $dbh->prepare(<<EOSQL);
140 SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_pfamA_reg_full, domain_evalue_score as evalue, length
141 FROM $pfamA_reg_full
142 JOIN pfamseq using(pfamseq_acc)
143 JOIN pfamA USING (pfamA_acc)
144 JOIN uniprot.up2ref_acc as up2ref on(up2ref.acc=pfamseq_acc)
145 WHERE in_full = 1
146 AND up2ref.ref_acc=?
147 ORDER BY seq_start
148
149 EOSQL
150
151 my $get_upfam_refacc = $dbh->prepare(<<EOSQL);
152 SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_uniprot_reg_full as auto_pfamA_reg_full, domain_evalue_score as evalue, length
153 FROM uniprot
154 JOIN uniprot_reg_full using(uniprot_acc)
155 JOIN pfamA USING (pfamA_acc)
156 JOIN uniprot.up2ref_acc as up2ref on(up2ref.acc=uniprot_acc)
157 WHERE in_full = 1
158 AND ref_acc=?
159 ORDER BY seq_start
160
161 EOSQL
162
163 my $get_annots_sql = $get_pfam_acc;
164
165 my $get_pfam_id = $dbh->prepare(<<EOSQL);
166 SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_pfamA_reg_full, domain_evalue_score as evalue, length
167 FROM pfamseq
168 JOIN $pfamA_reg_full using(pfamseq_acc)
169 JOIN pfamA USING (pfamA_acc)
170 WHERE in_full=1
171 AND pfamseq_id=?
172 ORDER BY seq_start
173
174 EOSQL
175
176 my $get_upfam_id = $dbh->prepare(<<EOSQL);
177 SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_uniprot_reg_full as auto_pfamA_reg_full, domain_evalue_score as evalue, length
178 FROM uniprot
179 JOIN uniprot_reg_full using(pfamseq_acc)
180 JOIN pfamA USING (pfamA_acc)
181 WHERE in_full=1
182 AND uniprot_id=?
183 ORDER BY seq_start
184
185 EOSQL
186
187 my $get_pfam_clan = $dbh->prepare(<<EOSQL);
188
189 SELECT clan_acc, clan_id
190 FROM clan
191 JOIN clan_membership using(clan_acc)
192 WHERE pfamA_acc=?
193
194 EOSQL
195
196 my $get_rpd2_clans = $dbh->prepare(<<EOSQL);
197
198 SELECT auto_pfamA, clan
199 FROM ljm_db.RPD2_final_fams
200 WHERE clan is not NULL
201
202 EOSQL
203
204 # -- LEFT JOIN clan_membership USING (auto_pfamA)
205 # -- LEFT JOIN clans using(auto_clan)
206
207 my ($tmp, $gi, $sdb, $acc, $id, $use_acc);
208
209 ################
210 ## check for db=*_qfo -- do not use get_upfam_acc in that case
211 if ($db =~ m/_qfo/) {
212 $get_upfam_acc= '';
213 }
214
215 # get the query
216 my ($query, $seq_len) = @ARGV;
217 $seq_len = 0 unless defined($seq_len);
218
219 $query =~ s/^>// if ($query);
220
221 my @annots = ();
222 my %annot_set = ();
223
224 my %rpd2_clan_fams = ();
225
226 if ($rpd2_fams) {
227 $get_rpd2_clans->execute();
228 my ($auto_pfam, $auto_clan);
229 while (($auto_pfam, $auto_clan)=$get_rpd2_clans->fetchrow_array()) {
230 $rpd2_clan_fams{$auto_pfam} = $auto_clan;
231 }
232 }
233
234 #if it's a file I can open, read and parse it
235 unless ($query && ($query =~ m/[\|:]/ ||
236 $query =~ m/^[NX]P_/ ||
237 $query =~ m/^[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}\s/)) {
238
239 while (my $a_line = <>) {
240 $a_line =~ s/^>//;
241 chomp $a_line;
242 push @annots, show_annots($a_line, $get_annot_sub);
243 }
244 }
245 else {
246 push @annots, show_annots("$query\t$seq_len", $get_annot_sub);
247 }
248
249 for my $seq_annot (@annots) {
250 next unless $seq_annot;
251 my $annot_r = $annot_set{$seq_annot};
252 print ">",$annot_r->{seq_info},"\n";
253 for my $annot (@{$annot_r->{list}}) {
254 if (!$lav && defined($domains{$annot->[-1]})) {
255 my ($a_name, $a_num) = domain_num($annot->[-1],$domains{$annot->[-1]});
256 $annot->[-1] = $a_name;
257 my $tmp_a_num = $a_num;
258 $tmp_a_num =~ s/v$//;
259 if ($acc_comment) {
260 $annot->[-1] .= "{$domain_list[$tmp_a_num]}";
261 }
262 if ($bound_comment) {
263 $annot->[-1] .= $color_sep_str.$annot->[0].":".$annot->[2];
264 }
265 elsif ($show_color) {
266 $annot->[-1] .= $color_sep_str.$a_num;
267 }
268 }
269 print join("\t",@$annot),"\n";
270 }
271 }
272
273 exit(0);
274
275 sub show_annots {
276 my ($query_len, $get_annot_sub) = @_;
277
278 my ($annot_line, $seq_len) = split(/\t/,$query_len);
279
280 my $pfamA_acc;
281
282 $use_acc = 1;
283 $get_annots_sql = $get_pfam_acc;
284
285 my $get_annots_sql_u = $get_upfam_acc;
286
287 if ($annot_line =~ m/^pf\d+\|/) {
288 ($sdb, $gi, $pfamA_acc, $acc, $id) = split(/\|/,$annot_line);
289 # $dbh->do("use RPD2_pfam");
290 }
291 elsif ($annot_line =~ m/^gi\|/) {
292 ($tmp, $gi, $sdb, $acc, $id) = split(/\|/,$annot_line);
293 if ($sdb =~ m/ref/) {
294 $get_annots_sql = $get_pfam_refacc;
295 $get_annots_sql_u = $get_upfam_refacc;
296 }
297 }
298 elsif ($annot_line =~ m/^(sp|tr|up)\|/) {
299 ($sdb, $acc, $id) = split(/\|/,$annot_line);
300 }
301 elsif ($annot_line =~ m/^ref\|/) {
302 ($sdb, $acc) = split(/\|/,$annot_line);
303 $get_annots_sql = $get_pfam_refacc;
304 $get_annots_sql_u = $get_upfam_refacc;
305 }
306 elsif ($annot_line =~ m/^(SP|TR):/i) {
307 ($sdb, $id) = split(/:/,$annot_line);
308 $use_acc = 0;
309 }
310 elsif ($annot_line !~ m/\|/ && $annot_line !~ m/:/) {
311 $use_acc = 1;
312 ($acc) = split(/\s+/,$annot_line);
313 }
314 # deal with no-database SwissProt/NR
315 else {
316 ($acc)=($annot_line =~ /^(\S+)/);
317 }
318
319 # here we have an $acc or an $id: check to see if we have the data
320
321 my %annot_data = (seq_info=>$annot_line, seq_len=>$seq_len);
322 my $annot_key = '';
323 unless ($use_acc) {
324 next if ($annot_set{$id});
325 $annot_set{$id} = \%annot_data;
326 $annot_key = $id;
327
328 $get_annots_sql = $get_pfam_id;
329 $get_annots_sql->execute($id);
330 unless ($get_annots_sql->rows()) {
331 if ($get_annots_sql_u) {
332 $get_annots_sql = $get_annots_sql_u;
333 $get_annots_sql->execute($id);
334 }
335 }
336 } else {
337 unless ($acc) {
338 warn "missing acc in $annot_line";
339 return "";
340 }
341 else {
342 $acc =~ s/\.\d+$//;
343
344 $annot_key = $acc;
345 if ($annot_set{$acc}) {
346 goto ret_label;
347 }
348 $annot_set{$acc} = \%annot_data;
349
350 $get_annots_sql->execute($acc);
351 unless ($get_annots_sql->rows()) {
352 if ($get_annots_sql_u) {
353 $get_annots_sql = $get_annots_sql_u;
354 $get_annots_sql->execute($id);
355 }
356 }
357 }
358 }
359
360 $annot_data{list} = $get_annot_sub->($get_annots_sql, $seq_len);
361
362 ret_label:
363 return $annot_key;
364 }
365
366 sub get_pfam_annots {
367 my ($get_annots, $seq_length) = @_;
368
369 $seq_length = 0 unless $seq_length;
370
371 my @pf_domains = ();
372
373 # get the list of domains, sorted by start
374
375 # $row_href has: seq_start, seq_end, model_start, model_end, model_length,
376 # pfamA_acc, pfamA_id, auto_pfamA_reg_full,
377 # domain_evalue_score as evalue, length
378
379 while ( my $row_href = $get_annots->fetchrow_hashref()) {
380 if ($auto_reg) {
381 $row_href->{info} = $row_href->{auto_pfamA_reg_full};
382 } elsif ($pf_acc) {
383 $row_href->{info} = $row_href->{pfamA_acc};
384 } else {
385 $row_href->{info} = $row_href->{pfamA_id};
386 }
387
388 if ($row_href && $row_href->{length} > $seq_length && $seq_length == 0) {
389 $seq_length = $row_href->{length};
390 }
391
392 next if ($row_href->{seq_start} >= $seq_length);
393 if ($row_href->{seq_end} > $seq_length) {
394 $row_href->{seq_end} = $seq_length;
395 }
396
397 push @pf_domains, $row_href
398 }
399
400 # before checking for domain overlap, check for "split-domains"
401 # (self-unbound) by looking for runs of the same domain that are
402 # ordered by model_start
403
404 if (scalar(@pf_domains) > 1) {
405 my @j_domains; #joined domains
406 my @tmp_domains = @pf_domains;
407
408 my $prev_dom = shift(@tmp_domains);
409
410 for my $curr_dom (@tmp_domains) {
411 # to join domains:
412 # (1) the domains must be in order by model_start/end coordinates
413 # (3) joining the domains cannot make the total combination too long
414
415 # check for model and sequence consistency
416 if (($prev_dom->{pfamA_acc} eq $curr_dom->{pfamA_acc}) # same family
417 && $prev_dom->{model_start} < $curr_dom->{model_start} # model check
418 && $prev_dom->{model_end} < $curr_dom->{model_end}
419
420 && ($curr_dom->{model_start} > $prev_dom->{model_end} * 0.80 # limit overlap
421 || $curr_dom->{model_start} < $prev_dom->{model_end} * 1.25)
422 && ((($curr_dom->{model_end} - $curr_dom->{model_start}+1)/$curr_dom->{model_length} +
423 ($prev_dom->{model_end} - $prev_dom->{model_start}+1)/$prev_dom->{model_length}) < 1.33)
424 ) { # join them by updating $prev_dom
425 $prev_dom->{seq_end} = $curr_dom->{seq_end};
426 $prev_dom->{model_end} = $curr_dom->{model_end};
427 $prev_dom->{auto_pfamA_reg_full} = $prev_dom->{auto_pfamA_reg_full} . ";". $curr_dom->{auto_pfamA_reg_full};
428 $prev_dom->{evalue} = ($prev_dom->{evalue} < $curr_dom->{evalue} ? $prev_dom->{evalue} : $curr_dom->{evalue});
429 } else {
430 push @j_domains, $prev_dom;
431 $prev_dom = $curr_dom;
432 }
433 }
434 push @j_domains, $prev_dom;
435 @pf_domains = @j_domains;
436
437
438 if ($no_over) { # for either $no_over or $split_over, check for overlapping domains and edit/split them
439
440 my @tmp_domains = @pf_domains; # allow shifts from copy of @pf_domains
441 my @save_domains = (); # where the new domains go
442
443 my $prev_dom = shift @tmp_domains;
444
445 while (my $curr_dom = shift @tmp_domains) {
446
447 my @overlap_domains = ($prev_dom);
448
449 my $diff = $prev_dom->{seq_end} - $curr_dom->{seq_start};
450
451 my ($prev_len, $cur_len) = ($prev_dom->{seq_end}-$prev_dom->{seq_start}+1,
452 $curr_dom->{seq_end}-$curr_dom->{seq_start}+1);
453
454 my $inclusion = ((($curr_dom->{seq_start} >= $prev_dom->{seq_start}) # start is right && end is left
455 && ($curr_dom->{seq_end} <= $prev_dom->{seq_end})) || # -- curr inside prev
456 (($curr_dom->{seq_start} <= $prev_dom->{seq_start}) # start is left && end is right
457 && ($curr_dom->{seq_end} >= $prev_dom->{seq_end}))); # -- prev is inside curr
458
459 my $longer_len = ($prev_len > $cur_len) ? $prev_len : $cur_len;
460
461 # check for overlap > domain_length/$over_fract
462 while ($inclusion || ($diff > 0 && $diff > $longer_len/$over_fract)) {
463 push @overlap_domains, $curr_dom;
464 $curr_dom = shift @tmp_domains;
465 last unless $curr_dom;
466 $diff = $prev_dom->{seq_end} - $curr_dom->{seq_start};
467 ($prev_len, $cur_len) = ($prev_dom->{seq_end}-$prev_dom->{seq_start}+1, $curr_dom->{seq_end}-$curr_dom->{seq_start}+1);
468 $longer_len = ($prev_len > $cur_len) ? $prev_len : $cur_len;
469 $inclusion = ((($curr_dom->{seq_start} >= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} <= $prev_dom->{seq_end})) ||
470 (($curr_dom->{seq_start} <= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} >= $prev_dom->{seq_end})));
471 }
472
473 # check for overlapping domains; >1 because $prev_dom is always there
474 if (scalar(@overlap_domains) > 1 ) {
475 # if $rpd2_fams, check for a chosen one
476
477 for my $dom ( @overlap_domains) {
478 $dom->{evalue} = 1.0 unless defined($dom->{evalue});
479 }
480
481 @overlap_domains = sort { $a->{evalue} <=> $b->{evalue} } @overlap_domains;
482 $prev_dom = $overlap_domains[0];
483 }
484
485 # $prev_dom should be the best of the overlaps, and we are no longer overlapping > dom_length/3
486 push @save_domains, $prev_dom;
487 $prev_dom = $curr_dom;
488 }
489
490 if ($prev_dom) {
491 push @save_domains, $prev_dom;
492 }
493
494 @pf_domains = @save_domains;
495
496 # now check for smaller overlaps
497 for (my $i=1; $i < scalar(@pf_domains); $i++) {
498 if ($pf_domains[$i-1]->{seq_end} >= $pf_domains[$i]->{seq_start}) {
499 my $overlap = $pf_domains[$i-1]->{seq_end} - $pf_domains[$i]->{seq_start};
500 $pf_domains[$i-1]->{seq_end} -= int($overlap/2);
501 $pf_domains[$i]->{seq_start} = $pf_domains[$i-1]->{seq_end}+1;
502 }
503 }
504 }
505 elsif ($split_over) { # here, everything that overlaps by > $min_vdom should be split into a separate domain
506 my @save_domains = (); # where the new domains go
507
508 # check to see if one domain is included (or overlapping) more
509 # than xx% of the other. If so, pick the longer one
510
511 my ($prev_dom, $curr_dom) = ($pf_domains[0],0) ;
512 for (my $i=1; $i < scalar(@pf_domains); $i++) {
513 $curr_dom = $pf_domains[$i];
514
515 my ($prev_len, $cur_len) = ($prev_dom->{seq_end}-$prev_dom->{seq_start}+1, $curr_dom->{seq_end}-$curr_dom->{seq_start}+1);
516 my $longer_len = ($prev_len > $cur_len) ? $prev_len : $cur_len;
517
518 if (($curr_dom->{seq_start} >= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} <= $prev_dom->{seq_end})
519 && $cur_len / $prev_len > 0.80) {
520 # $prev_dom stays the same, $curr_dom deleted
521 next;
522 }
523 elsif (($curr_dom->{seq_start} <= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} >= $prev_dom->{seq_end})
524 && $prev_len / $cur_len > 0.80) {
525 $prev_dom = $curr_dom; # this should delete $prev_dom
526 next;
527 }
528
529 if ($prev_dom->{seq_end} >= $curr_dom->{seq_start} + $min_vdom) {
530 my ($l_seq_end, $r_seq_start) = ($curr_dom->{seq_start}-1, $prev_dom->{seq_end}+1);
531
532 $prev_dom->{seq_end} = $l_seq_end;
533 push @save_domains, $prev_dom;
534 my $new_dom = {seq_start => $l_seq_end+1, seq_end=>$r_seq_start-1,
535 model_length => -1,
536 pfamA_acc=>$prev_dom->{pfamA_acc}."/".$curr_dom->{pfamA_acc},
537 pfamA_id=>$prev_dom->{pfamA_id}."/".$curr_dom->{pfamA_id},
538 };
539
540 if ($pf_acc) {
541 $new_dom->{info} = $new_dom->{pfamA_acc};
542 }
543 else {
544 $new_dom->{info} = $new_dom->{pfamA_id};
545 }
546
547 push @save_domains, $new_dom;
548 $curr_dom->{seq_start} = $r_seq_start;
549 $prev_dom = $curr_dom;
550 }
551 else {
552 push @save_domains, $prev_dom;
553 $prev_dom = $curr_dom;
554 }
555 }
556 push @save_domains, $prev_dom;
557 @pf_domains = @save_domains;
558 }
559 }
560
561 # $vdoms -- virtual Pfam domains -- the equivalent of $neg_doms,
562 # but covering parts of a Pfam model that are not annotated. split
563 # domains have been joined, so simply check beginning and end of
564 # each domain (but must also check for bounded-ness)
565 # only add when 10% or more is missing and missing length > $min_nodom
566
567 if ($vdoms && scalar(@pf_domains)) {
568 my @vpf_domains;
569
570 my $curr_dom = $pf_domains[0];
571 my $length = $curr_dom->{length};
572
573 my $prev_dom={seq_end=>0, pfamA_acc=>''};
574 my $prev_dom_end = 0;
575 my $next_dom_start = $length+1;
576
577 for (my $dom_ix=0; $dom_ix < scalar(@pf_domains); $dom_ix++ ) {
578 $curr_dom = $pf_domains[$dom_ix];
579
580 my $pfamA = $curr_dom->{pfamA_acc};
581
582 # first, look left, is there a domain there (if there is,
583 # it should be updated right
584
585 # my $min_vdom = $curr_dom->{model_length} / 10;
586
587 if ($curr_dom->{model_length} < $min_vdom) {
588 push @vpf_domains, $curr_dom;
589 next;
590 }
591 if ($prev_dom->{pfamA_acc}) { # look for previous domain
592 $prev_dom_end = $prev_dom->{seq_end};
593 }
594
595 # there is a domain to the left, how much room is available?
596 my $left_dom_len = min($curr_dom->{seq_start}-$prev_dom_end-1, $curr_dom->{model_start}-1);
597 if ( $left_dom_len > $min_vdom) {
598 # there is room for a virtual domain
599 my %new_dom = (seq_start=> $curr_dom->{seq_start}-$left_dom_len,
600 seq_end => $curr_dom->{seq_start}-1,
601 info=>'@'.$curr_dom->{info},
602 model_length=>$curr_dom->{model_length},
603 model_end => $curr_dom->{model_start}-1,
604 model_start => $left_dom_len,
605 pfamA_acc=>$pfamA,
606 );
607 push @vpf_domains, \%new_dom;
608 }
609
610 # save the current domain
611 push @vpf_domains, $curr_dom;
612 $prev_dom = $curr_dom;
613
614 if ($dom_ix < $#pf_domains) { # there is a domain to the right
615 # first, give all the extra space to the first domain (no splitting)
616 $next_dom_start = $pf_domains[$dom_ix+1]->{seq_start};
617 }
618 else {
619 $next_dom_start = $length;
620 }
621
622 # is there room for a virtual domain right
623
624 my $right_dom_len = min($next_dom_start-$curr_dom->{seq_end}-1, # space available
625 $curr_dom->{model_length}-$curr_dom->{model_end} # space needed
626 );
627 if ( $right_dom_len > $min_vdom) {
628 my %new_dom = (seq_start=> $curr_dom->{seq_end}+1,
629 seq_end=> $curr_dom->{seq_end}+$right_dom_len,
630 info=>'@'.$curr_dom->{info},
631 model_length => $curr_dom->{model_length},
632 pfamA_acc=> $pfamA,
633 );
634 push @vpf_domains, \%new_dom;
635 $prev_dom = \%new_dom;
636 }
637 } # all done, check for last one
638
639 # $curr_dom=$pf_domains[-1];
640 # # my $min_vdom = $curr_dom->{model_length}/10;
641
642 # my $right_dom_len = min($length - $curr_dom->{seq_end}+1, # space available
643 # $curr_dom->{model_length}-$curr_dom->{model_end} # space needed
644 # );
645 # if ($right_dom_len > $min_vdom) {
646 # my %new_dom = (seq_start=> $curr_dom->{seq_end}+1,
647 # seq_end => $curr_dom->{seq_end}+$right_dom_len,
648 # info=>'@'.$curr_dom->{pfamA_acc},
649 # model_len=> $curr_dom->{model_len},
650 # pfamA_acc => $curr_dom->{pfamA_acc},
651 # model_start => $curr_dom->{model_end}+1,
652 # model_end => $curr_dom->{model_len},
653 # );
654
655 # push @vpf_domains, \%new_dom;
656 # }
657
658 # @vpf_domains has both old @pf_domains and new neg-domains
659 @pf_domains = @vpf_domains;
660 }
661
662 if ($neg_doms) {
663 my @npf_domains;
664 my $prev_dom={seq_end=>0};
665 for my $curr_dom ( @pf_domains) {
666 if ($curr_dom->{seq_start} - $prev_dom->{seq_end} > $min_nodom) {
667 my %new_dom = (seq_start=>$prev_dom->{seq_end}+1, seq_end => $curr_dom->{seq_start}-1, info=>'NODOM');
668 push @npf_domains, \%new_dom;
669 }
670 push @npf_domains, $curr_dom;
671 $prev_dom = $curr_dom;
672 }
673 if ($seq_length - $prev_dom->{seq_end} > $min_nodom) {
674 my %new_dom = (seq_start=>$prev_dom->{seq_end}+1, seq_end=>$seq_length, info=>'NODOM');
675 if ($new_dom{seq_end} > $new_dom{seq_start}) {
676 push @npf_domains, \%new_dom;
677 }
678 }
679
680 if (scalar(@pf_domains)==0) {
681 my %new_dom = (seq_start=>1, seq_end=> $seq_len, info=>'NODOM');
682 push @pf_domains, \%new_dom;
683 }
684
685 # @npf_domains has both old @pf_domains and new neg-domains
686 @pf_domains = @npf_domains;
687 }
688
689 # now make sure we have useful names: colors
690
691 for my $pf (@pf_domains) {
692 $pf->{info} = domain_name($pf->{info}, $pf->{pfamA_acc});
693 }
694
695 my @feats = ();
696 for my $d_ref (@pf_domains) {
697 if ($lav) {
698 push @feats, [$d_ref->{seq_start}, $d_ref->{seq_end}, $d_ref->{info}];
699 } else {
700 push @feats, [$d_ref->{seq_start}, '-', $d_ref->{seq_end}, $d_ref->{info} ];
701 # push @feats, [$d_ref->{seq_end}, ']', '-', ""];
702 }
703
704 }
705
706 return \@feats;
707 }
708
709 sub min {
710 my ($arg1, $arg2) = @_;
711
712 return ($arg1 <= $arg2 ? $arg1 : $arg2);
713 }
714
715 sub max {
716 my ($arg1, $arg2) = @_;
717
718 return ($arg1 >= $arg2 ? $arg1 : $arg2);
719 }
720
721 # domain name takes a uniprot domain label, removes comments ( ;
722 # truncated) and numbers and returns a canonical form. Thus:
723 # Cortactin 6.
724 # Cortactin 7; truncated.
725 # becomes "Cortactin"
726 #
727
728 sub domain_name {
729
730 my ($value, $pfamA_acc) = @_;
731 my $is_virtual = 0;
732
733 if ($value =~ m/^@/) {
734 $is_virtual = 1;
735 $value =~ s/^@//;
736 }
737
738 # check for clan:
739 if ($no_clans) {
740 if (! defined($domains{$value})) {
741 $domain_clan{$value} = 0;
742 $domains{$value} = ++$domain_cnt;
743 push @domain_list, $pfamA_acc;
744 }
745 }
746 elsif (!defined($domain_clan{$value})) {
747 ## only do this for new domains, old domains have known mappings
748
749 ## ways to highlight the same domain:
750 # (1) for clans, substitute clan name for family name
751 # (2) for clans, use the same color for the same clan, but don't change the name
752 # (3) for clans, combine family name with clan name, but use colors based on clan
753
754 # check to see if it's a clan
755 $get_pfam_clan->execute($pfamA_acc);
756
757 my $pfam_clan_href=0;
758
759 if ($pfam_clan_href=$get_pfam_clan->fetchrow_hashref()) { # is a clan
760 my ($clan_id, $clan_acc) = @{$pfam_clan_href}{qw(clan_id clan_acc)};
761
762 # now check to see if we have seen this clan before (if so, do not increment $domain_cnt)
763 my $c_value = "C." . $clan_id;
764
765 if ($clan_fam) {
766 $c_value = $c_value;
767 }
768
769 if ($pf_acc) {
770 $c_value = $clan_acc;
771 }
772
773 $domain_clan{$value} = {clan_id => $clan_id,
774 clan_acc => $clan_acc};
775
776 if ($domains{$c_value}) {
777 $domain_clan{$value}->{domain_cnt} = $domains{$c_value};
778 $value = $c_value;
779 }
780 else {
781 $domain_clan{$value}->{domain_cnt} = ++ $domain_cnt;
782 $value = $c_value;
783 $domains{$value} = $domain_cnt;
784 push @domain_list, $pfamA_acc;
785 }
786 }
787 else { # not a clan
788 $domain_clan{$value} = 0;
789 $domains{$value} = ++$domain_cnt;
790 push @domain_list, $pfamA_acc;
791 }
792 }
793 elsif ($domain_clan{$value} && $domain_clan{$value}->{clan_acc}) {
794 if ($pf_acc) {$value = $domain_clan{$value}->{clan_acc};}
795 else { $value = "C." . $domain_clan{$value}->{clan_id}; }
796 }
797
798 if ($is_virtual) {
799 $domains{'@'.$value} = $domains{$value};
800 $value = '@'.$value;
801 }
802
803 return $value;
804 }
805
806 sub domain_num {
807 my ($value, $number) = @_;
808 if ($value =~ m/^@/) {
809 $value =~ s/^@/v/;
810 $number = $number."v";
811 }
812 return ($value, $number);
813 }
814
815
816 __END__
817
818 =pod
819
820 =head1 NAME
821
822 ann_pfam_sql.pl
823
824 =head1 SYNOPSIS
825
826 ann_pfam_sql.pl --neg-doms --vdoms 'sp|P09488|GSTM1_NUMAN' | accession.file
827
828 =head1 OPTIONS
829
830 -h short help
831 --help include description
832 --no-over : generate non-overlapping domains (equivalent to ann_pfam.pl)
833 --split-over : overlaps of two domains generate a new hybrid domain
834 --no-clans : do not use clans with multiple families from same clan
835 --neg-doms : report domains between annotated domains as NODOM
836 (also --neg, --neg_doms)
837 --pfacc : report Pfam ACC (PF01234), rather than Pfam identifier (GST-N)
838 --vdoms : produce "virtual domains" using model_start,
839 model_end for partial pfam domains
840 --min_nodom=10 : minimum length between domains for NODOM
841
842 --host, --user, --password, --port --db : info for mysql database
843
844 =head1 DESCRIPTION
845
846 C<ann_pfam_sql.pl> extracts domain information from the pfam msyql
847 database. Currently, the program works with database
848 sequence descriptions in several formats:
849
850 >gi|1705556|sp|P54670.1|CAF1_DICDI
851 >sp|P09488|GSTM1_HUMAN
852 >sp:CALM_HUMAN
853
854 C<ann_pfam_sql.pl> uses the C<pfamA_reg_full_significant>, C<pfamseq>,
855 and C<pfamA> tables of the C<pfam> database to extract domain
856 information on a protein.
857
858 If the C<--no-over> option is set, overlapping domains are selected and
859 edited to remove overlaps. For proteins with multiple overlapping
860 domains (domains overlap by more than 1/3 of the domain length),
861 C<auto_pfam28.pl> selects the domain annotation with the best
862 C<domain_evalue_score>. When domains overlap by less than 1/3 of the
863 domain length, they are shortened to remove the overlap.
864
865 If the C<--split-over> option is set, if two domains overlap, the
866 overlapping region is split out of the domains and labeled as a new,
867 virtual-lie, domain. If one domain is internal to another and spans
868 80% of the domain, the shorter domain is removed.
869
870 C<ann_pfam_sql.pl> is designed to be used by the B<FASTA> programs with
871 the C<-V \!ann_pfam_sql.pl> or C<-V "\!ann_pfam_sql.pl --neg"> option.
872
873 =head1 AUTHOR
874
875 William R. Pearson, wrp@virginia.edu
876
877 =cut
0 #!/usr/bin/perl -w
0 #!/usr/bin/env perl
11
22 ################################################################
33 # copyright (c) 2014, 2015 by William R. Pearson and The Rector &
3030 # >pf26|164|O57809|1A1D_PYRHO
3131 # and only provides domain information
3232
33 use warnings;
3334 # use strict;
3435
3536 use Getopt::Long;
7980 my @domain_list = (0);
8081 my $domain_cnt = 0;
8182
82 my $loc="http://pfam.xfam.org/";
83 my $loc="https://pfam.xfam.org/";
8384 my $url;
8485
8586 my @pf_domains;
0 ann_exons_ens.pl
0 ann_exons_up_sql.pl
11 ann_exons_up_www.pl
22 ann_feats2ipr.pl
33 ann_feats_up_sql.pl
0 #!/usr/bin/perl -w
0 #!/usr/bin/env perl
11
22 ################################################################
33 # copyright (c) 2014 by William R. Pearson and The Rector &
2828
2929 # this version can read feature2 uniprot features (acc/pos/end/label/value), but returns sorted start/end domains
3030
31 use warnings;
3132 use strict;
3233
3334 use Getopt::Long;
0 #!/usr/bin/perl -w
0 #!/usr/bin/env perl
11
22 ################################################################
3 # copyright (c) 2014,2015 by William R. Pearson and The Rector &
3 # copyright (c) 2017,2018 by William R. Pearson and The Rector &
44 # Visitors of the University of Virginia */
55 ################################################################
66 # Licensed under the Apache License, Version 2.0 (the "License");
1717 ################################################################
1818
1919 ################################################################
20 # annot_blast_btop2.pl --query query.file --ann_script ann_pfam_www.pl blast_tab_btop_file
20 # annot_blast_btop2.pl --query query.file --ann_script ann_pfam_www.pl --include_doms blast_tab_btop_file
2121 ################################################################
2222 # annot_blast_btop2.pl associates domain annotation information and
2323 # subalignment scores with a blast tabular (-outfmt 6 or -outfmt 7)
2929 # If the BTOP field or query_file is not available, the script
3030 # produces domain content without sub-alignment scores.
3131 ################################################################
32 ## 4-Nov-2018
33 # add --include_doms, which adds a new field with the coordinates of
34 # the domains in the protein (independent of alignment)
35 #
36 ################################################################
37 ## 21-July-2018
38 # include sequence length (actually alignment end) to produce NODOM's (no NODOM's without length).
39 #
40 ################################################################
3241 ## 13-Jan-2017
3342 # modified to provide query/subject coordinates and identities if no
3443 # query sequence -- does not decrement for reverse-complement fastx/blastx DNA
4049 # add -q_annot_script to annotate query sequence
4150 #
4251
52 use warnings;
4353 use strict;
4454 use IPC::Open2;
4555 use Pod::Usage;
4656 use Getopt::Long;
57 use File::Temp qw/ tempfile /;
58
4759 # use Data::Dumper;
4860
4961 # read lines of the form:
5567 # and report the domain content ala -m 8CC
5668
5769 my ($matrix, $ann_script, $q_ann_script, $show_raw, $shelp, $help) = ("BLOSUM62", "", "", 0, 0, 0);
70 my ($have_qslen, $dom_info, $sub2query) = (0,0,0); # blast tabular file has sseqid sseqlen qseqid qseqlen
5871 my ($query_lib_name) = (""); # if $query_lib_name, do not use $query_file_name
5972 my ($out_field_str) = ("");
6073 my $query_lib_r = 0;
6780
6881 GetOptions(
6982 "matrix:s" => \$matrix,
70 "ann_script:s" => \$ann_script,
71 "q_ann_script:s" => \$q_ann_script,
83 "ann_script|script:s" => \$ann_script,
84 "q_ann_script|q_script:s" => \$q_ann_script,
85 "have_qslen|have_sqlen!" => \$have_qslen,
86 "domain_info|dom_info!" => \$dom_info,
87 "sub2query!" => \$sub2query,
7288 "query:s" => \$query_lib_name,
7389 "query_file:s" => \$query_lib_name,
7490 "query_lib:s" => \$query_lib_name,
7591 "out_fields:s" => \$out_field_str,
76 "script:s" => \$ann_script,
77 "q_script:s" => \$q_ann_script,
7892 "raw_score" => \$show_raw,
7993 "h|?" => \$shelp,
8094 "help" => \$help,
92106
93107 my @tab_fields = qw(q_seqid s_seqid percid alen mismatch gopen q_start q_end s_start s_end evalue bits score BTOP);
94108
109 if ($have_qslen) {
110 @tab_fields = qw(q_seqid q_len s_seqid s_len percid alen mismatch gopen q_start q_end s_start s_end evalue bits score BTOP);
111 }
112
95113 # the fields that are displayed are listed here. By default, all fields except score and BTOP are displayed.
96114 my @out_tab_fields = @tab_fields[0 .. $#tab_fields-1];
115
97116 if ($show_raw) {
98117 push @out_tab_fields, "raw_score";
99
100 }
118 }
119
101120 if ($out_field_str) {
102121 @out_tab_fields = split(/\s+/,$out_field_str);
103122 }
134153 push @hit_list, \%hit_data;
135154 }
136155
137 # get the current query sequence
156 # get the query annotations
157 if ($q_ann_script) {
158 $q_ann_script =~ s/\+/ /g;
159 }
160
138161 if ($q_ann_script && -x (split(/\s+/,$q_ann_script))[0]) {
139162 # get the domains for the q_seqid using --q_ann_script
140163 #
142165 my $pid = open2($Reader, $Writer, $q_ann_script);
143166 my $hit = $hit_list[0];
144167
145 print $Writer $hit->{q_seqid},"\n";
168 my $q_seq_len = scalar(@{$query_lib_r->{$hit->{q_seqid}}});
169 print $Writer $hit->{q_seqid},"\t",$q_seq_len,"\n";
146170 close($Writer);
147171
148 @q_hit_list = ({ s_seq_id=> $hit->{q_seqid} });
172 push @q_hit_list,{ s_seq_id=> $hit->{q_seqid}, s_end=> $q_seq_len};
149173
150174 read_annots($Reader, \@q_hit_list, 0);
151175
152176 waitpid($pid, 0);
153177 }
154178
155 # get the current query sequence
179 # get the subject annotations
180 if ($ann_script) {
181 $ann_script =~ s/\+/ /g;
182 }
183
156184 if ($ann_script && -x (split(/\s+/,$ann_script))[0]) {
157185 # get the domains for each s_seqid using --ann_script
158186 #
187 # this does not work currently because only one accession is sent.
188 # For mulitple hits, I need to make a tmp_file.
189
159190 my ($Reader, $Writer);
160191 my $pid = open2($Reader, $Writer, $ann_script);
192
161193 for my $hit (@hit_list) {
162 print $Writer $hit->{s_seqid},"\n";
194 # print STDERR $hit->{s_seqid},"\t", $hit->{s_end},"\n";
195 # print $Writer $hit->{s_seqid},"\t", $hit->{s_end},"\n";
196 my $s_len = 100000;
197 if ($have_qslen) {
198 $s_len = $hit->{s_len};
199 }
200 print $Writer $hit->{s_seqid},"\t", $s_len,"\n";
163201 }
164202 close($Writer);
165203
174212 @header_lines = ($next_line);
175213
176214 # now get query sequence if available
215
216 if ($sub2query && scalar(@q_hit_list)==0) {
217 # copy the information from $hit_list
218 for my $tmp_hit ( @hit_list ) {
219 if ($tmp_hit->{q_seqid} eq $tmp_hit->{s_seqid}) {
220 my %tmp_q_hit = (s_seq_id=> $tmp_hit->{q_seqid}, s_end=> $tmp_hit->{s_len});
221
222 $tmp_q_hit{'domains'} = [];
223 for my $dom ( @{$tmp_hit->{domains}} ) {
224 my %new_dom = map { $_ => $dom->{$_} } keys(%$dom);
225 $new_dom{target} = 0;
226 push @{$tmp_q_hit{'domains'}}, \%new_dom;
227 }
228
229 $tmp_q_hit{'sites'} = [];
230 for my $site ( @{$tmp_hit->{sites}} ) {
231 my %new_site = map { $_ => $site->{$_} } keys(%$site);
232 $new_site{target} = 0;
233 push @{$tmp_q_hit{'sites'}}, \%new_site;
234 }
235 push @q_hit_list,\%tmp_q_hit;
236 last;
237 }
238 }
239 }
177240
178241 my $q_hit = $q_hit_list[0];
179242
237300
238301 if (scalar(@$merged_annots_r)) { # show subalignment scores if available
239302 print "\t";
240
241303 print format_annot_info($hit, $merged_annots_r);
304 if ($dom_info) {
305 print "\t",format_dom_info($q_hit->{domains}, $hit->{domains});
306 }
242307 }
243308 elsif (@list_covered) { # otherwise show domain content
244309 print "\t",join(";",@list_covered);
245 }
310 if ($dom_info) {
311 print "\t",format_dom_info($q_hit->{domains}, $hit->{domains});
312 }
313 }
314
246315 print "\n";
247316 }
248317
275344 while (my $line = <$Reader>) {
276345 next if $line=~ m/^=/;
277346 chomp $line;
347
348 # print STDERR "$line\n";
278349
279350 # check for header
280351 if ($line =~ m/^>/) {
289360 }
290361 @hit_domains = (); # current domains
291362 @hit_sites = (); # current sites
292 $current_domain = $line;
363 $current_domain = (split(/\s+/,$line))[0];
293364 $current_domain =~ s/^>//;
294365 } else { # check for data
295366 my %annot_info = (target=>$target);
308379 }
309380 close($Reader);
310381
311 # all done, save the last one
312382 $hit_list_r->[$hit_ix]{domains} = \@hit_domains;
313383 $hit_list_r->[$hit_ix]{sites} = \@hit_sites;
384
385 # clean up NODOMs in {domains}
386 for my $hit ( @$hit_list_r ) {
387 # clean-up last NODOM if < 10
388 my $tmp_domains = $hit->{domains};
389 next unless (scalar(@{$tmp_domains}));
390 my ($last_dom, $left_coord) = ($tmp_domains->[-1], $hit->{s_end});
391 if ($last_dom->{descr} =~ m/^NODOM/ && (($left_coord - $last_dom->{d_pos} + 1) < 10)) {
392 pop @$tmp_domains;
393 }
394 }
314395 }
315396
316397 # input: a blast BTOP string of the form: "1VA160TS7KG10RK27"
416497 $blosum62[22] = [ qw( 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4) ];
417498 $blosum62[23] = [ qw( -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1) ];
418499
419
420500 die "blosum62 length mismatch $#blosum62 != $#ncbi_blaa" if (scalar(@blosum62) != scalar(@ncbi_blaa));
421501
422502 for (my $i=0; $i < scalar(@ncbi_blaa); $i++) {
499579 my @aligned_domains = ();
500580
501581 my $left_active_end = $domain_r->[-1]->{d_end}+1; # as far right as possible
582 my $left_align_end = $hit_r->{q_end};
583 if ($target) {
584 $left_align_end = $hit_r->{s_end};
585 }
586
587 if ($left_active_end > $left_align_end ) {
588 $left_active_end = $left_align_end ;
589 }
590
502591 my ($q_start, $s_start, $h_start, $h_end) = @{$hit_r}{qw(q_start s_start s_start s_end)};
503 my ($qix, $six) = ($q_start, $s_start); # $qix now starts from 1, like $ssix;
592 my ($qix, $six) = ($q_start, $s_start); # $qix now starts from 1, like $six;
504593
505594 my $ds_ix = \$six; # use to track the subject position
506595 # reverse coordinate names if $target==0
11371226 return \@merged_array;
11381227 }
11391228
1140 # domain output formatter
1229 ####
1230 # print raw domain info:
1231 # |DX:%d-%d;C=dom_info|XD:%d-%d:C=dom_info
1232 #
11411233 sub format_dom_info {
1142 my ($hit_r, $raw_score, $dom_r) = @_;
1143
1144 unless ($raw_score) {
1145 warn "no raw_score at: ".$hit_r->{s_seqid}."\n";
1146 $raw_score = $hit_r->{score};
1147 }
1148
1149 my ($score_scale, $fsub_score) = ($hit_r->{score}/$raw_score, $dom_r->{score}/$raw_score);
1150
1151 my $qval = 0.0;
1152 if ($hit_r->{evalue} == 0.0) {
1153 $qval = 3000.0
1154 }
1155 else {
1156 $qval = -10.0*log($hit_r->{evalue})*$fsub_score/(log(10.0))
1157 }
1158
1159 my ($ns_score, $s_bit) = (int($dom_r->{score} * $score_scale+0.5),
1160 int($hit_r->{bits} * $fsub_score +0.5),
1161 );
1162 $qval = 0 if $qval < 0;
1163
1164 # print join(":",($dom_r->{ad_pos},$dom_r->{ad_end},$ns_score, $s_bit, sprintf("%.1f",$qval))),"\n";
1165 return join(";",(sprintf("|XR:%d-%d:%d-%d:s=%d",
1166 $dom_r->{qa_start},$dom_r->{qa_end},
1167 $dom_r->{sa_start},$dom_r->{sa_end},$ns_score),
1168 sprintf("b=%.1f",$s_bit),
1169 sprintf("I=%.3f",$dom_r->{percid}),
1170 sprintf("Q=%.1f",$qval),$dom_r->{descr}));
1234 my ($q_dom_r, $dom_r) = @_;
1235
1236 my $dom_str = "";
1237 for my $dom ( @$q_dom_r ) {
1238 $dom_str .= sprintf("|DX:%d-%d;C=%s",@{$dom}{qw(d_pos d_end descr)});
1239 }
1240 for my $dom ( @$dom_r ) {
1241 $dom_str .= sprintf("|XD:%d-%d;C=%s",@{$dom}{qw(d_pos d_end descr)});
1242 }
1243
1244 return $dom_str;
11711245 }
11721246
11731247 # merged annot output formatter
11951269 if ($annot_r->{type} eq '-') { # domain with scores
11961270 my $fsub_score = $annot_r->{score}/$raw_score;
11971271
1272 my ($ns_score, $s_bit) = (int($annot_r->{score} * $score_scale + 0.5),
1273 int($hit_r->{bits} * $fsub_score + 0.5),
1274 );
11981275 my $qval = 0.0;
11991276 if ($hit_r->{evalue} == 0.0) {
1200 $qval = 3000.0
1277 if ($s_bit > 50) {
1278 $qval = 3000.0
1279 }
1280 else {
1281 $qval = -10.0 * (log(400.0 * 400.) + $s_bit)/log(10.0);
1282 }
12011283 } else {
12021284 $qval = -10.0*log($hit_r->{evalue})*$fsub_score/(log(10.0))
12031285 }
12041286
1205 my ($ns_score, $s_bit) = (int($annot_r->{score} * $score_scale+0.5),
1206 int($hit_r->{bits} * $fsub_score +0.5),
1207 );
12081287 $qval = 0 if $qval < 0;
12091288
12101289 $annot_str .= join(";",(sprintf("|%s:%d-%d:%d-%d:s=%d",
12131292 $annot_r->{sa_start},$annot_r->{sa_end},$ns_score),
12141293 sprintf("b=%.1f",$s_bit),
12151294 sprintf("I=%.3f",$annot_r->{percid}),
1216 sprintf("Q=%.1f",$qval),$annot_r->{descr}));
1295 sprintf("Q=%.1f",$qval),"C=".$annot_r->{descr}));
12171296 }
12181297 else { # site annotation
12191298 my $ann_type = $annot_r->{type};
12521331
12531332 --ann_script -- annotation script returning site/domain locations for subject sequences
12541333 -- same as --script
1334
1335 --have_qslen -- use a blast tabular format that includes the query and subject sequence lengths:
1336 -- q_seqid q_len s_seqid s_len ...
12551337
12561338 --q_ann_script -- annotation script for query sequences
12571339 -- same as --q_script
0 #!/bin/bash
1
2 cmd="";
3 for i in "$@"
4 do
5 case $i in
6 -o=*|--outname=*)
7 OUTNAME="${i#*=}"
8 shift # past argument=value
9 ;;
10 -q=*|--query=*)
11 QUERY="${i#*=}"
12 cmd="$cmd -query $QUERY"
13 shift # past argument=value
14 ;;
15 --ann_script=*)
16 ANN_SCRIPT="${i#*=}"
17 shift
18 ;;
19 --q_ann_script=*)
20 Q_ANN_SCRIPT="${i#*=}"
21 shift
22 ;;
23 *)
24 cmd="$cmd $i"
25 ;;
26 esac
27 done
28
29 # echo "OUTNAME: " $OUTNAME
30 # echo "CMD: " $cmd
31
32 if [[ $OUTNAME == '' ]]; then
33 OUTNAME=${QUERY}_out
34 fi
35
36 #if [[ $ANN_SCRIPT == '' ]]; then
37 # ANN_SCRIPT="/seqprg/bin/ann_pfam30.pl --db=pfam31_qfo --host=localhost --neg --vdoms --acc_comment"
38 #fi
39
40
41 # echo "OUTNAME2: " $OUTNAME
42
43 bl_asn="$OUTNAME.asn"
44 bl0_out="$OUTNAME.html"
45 bla_out="${OUTNAME}_an.html"
46 blm_out="$OUTNAME.msa"
47 blt_out="$OUTNAME.bl_tab"
48 blt_ann="$OUTNAME.bl_tab_ann"
49 blr_out="$OUTNAME.bl_tab_rn"
50
51 # echo "tmp_files:"
52 # echo $bl_asn $bl0_out $bla_out $blt_out
53
54 # echo "OUTFILE = ${OUTNAME}"
55
56 #export BLAST_PATH="/ebi/extserv/bin/ncbi-blast+/bin"
57 export BLAST_PATH="/seqprg/bin"
58
59 $BLAST_PATH/blastp -outfmt 11 $cmd > $bl_asn
60 $BLAST_PATH/blast_formatter -archive $bl_asn -outfmt 0 -html > $bl0_out
61 $BLAST_PATH/blast_formatter -archive $bl_asn -outfmt '7 qseqid qlen sseqid slen pident length mismatch gapopen qstart qend sstart send evalue bitscore score btop' > $blt_out
62 annot_blast_btop2.pl --query $QUERY --have_qslen --dom_info --ann_script "$ANN_SCRIPT" --q_ann_script "$Q_ANN_SCRIPT" $blt_out > $blt_ann
63
64 rename_exons.py --have_qslen --dom_info $blt_ann > $blr_out
65 merge_blast_btab.pl --plot_url="plot_domain6t.cgi" --have_qslen --dom_info --btab $blr_out $bl0_out
66
67 # $BLAST_PATH/blast_formatter -archive $bl_asn -outfmt 2 > $blm_out
2525
2626 $BLAST_PATH/blastp -outfmt 11 $cmd > $bl_asn
2727 $BLAST_PATH/blast_formatter -archive $bl_asn -outfmt 0 -html > $bl0_out
28 $BLAST_PATH/blast_formatter -archive $bl_asn -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore score btop' > $blt_out
28 $BLAST_PATH/blast_formatter -archive $bl_asn -outfmt '7 qseqid qlen sseqid slen pident length mismatch gapopen qstart qend sstart send evalue bitscore score btop' > $blt_out
2929 $BLAST_PATH/blast_formatter -archive $bl_asn -outfmt 2 > $blm_out
3030
0 #!/usr/bin/perl -w
0 #!/usr/bin/env perl
11
22 ################################################################
33 # copyright (c) 2010, 2014 by William R. Pearson and The Rector &
3434 ## sequences from an NCBI blast-formatted database.
3535 ##
3636
37 use warnings;
3738 use strict;
3839 use DBI;
3940
0 #!/usr/bin/perl -w
0 #!/usr/bin/env perl
11
22 ################################################################
33 # copyright (c) 2010, 2014 by William R. Pearson and The Rector &
3434 ## sequences from an NCBI blast-formatted database.
3535 ##
3636
37 use warnings;
3738 use strict;
3839 use DBI;
3940
0 #!/usr/bin/perl -w
0 #!/usr/bin/env perl
11
22 ################################################################
33 # copyright (c) 2010, 2014 by William R. Pearson and The Rector &
2424 # (2) take the uniprot accessions and produce a fasta library file
2525 # from them
2626
27 use warnings;
2728 use strict;
2829 use DBI;
2930
0 #!/usr/bin/env perl
1
2 ################################################################
3 # copyright (c) 2010, 2014 by William R. Pearson and The Rector &
4 # Visitors of the University of Virginia */
5 ################################################################
6 # Licensed under the Apache License, Version 2.0 (the "License");
7 # you may not use this file except in compliance with the License.
8 # You may obtain a copy of the License at
9 #
10 # http://www.apache.org/licenses/LICENSE-2.0
11 #
12 # Unless required by applicable law or agreed to in writing,
13 # software distributed under this License is distributed on an "AS
14 # IS" BASIS, WITHOUT WRRANTIES OR CONDITIONS OF ANY KIND, either
15 # express or implied. See the License for the specific language
16 # governing permissions and limitations under the License.
17 ################################################################
18
19 ## usage - expand_up_isoforms.pl [--prim_acc] up_hits.file > up_isoforms.file
20 ##
21 ## take a fasta36 -e expand.sh result file of the form:
22 ## sp|P09488_GSTM1_HUMAN|<tab>1.1e-50
23 ##
24 ## and extract the accession number, looking it up from the an SQL
25 ## table $table -- in this case "annot2_iso" to provide Uniprot
26 ## isoforms based on a uniprot accession.
27 ##
28 ## if --prim_acc, then the primary accession (used to find the isoforms) is added to the isoform seq_id, e.g.
29 ## sp|P04988|GSTM1_HUMAN has isoforms: with --prim_acc, the identifiers become
30 ## >iso|E7EWW9|E7EWW9_HUMAN >iso|E7EWW9|E7EWW9_HUMAN_P09488
31 ## >iso|H3BRM6|H3BRM6_HUMAN >iso|H3BRM6|H3BRM6_HUMAN_P09488
32 ## >iso|H3BQT3|H3BQT3_HUMAN >iso|H3BQT3|H3BQT3_HUMAN_P09488
33
34 use warnings;
35 use strict;
36 use Getopt::Long;
37 use Pod::Usage;
38 use DBI;
39
40 my ($host, $db, $port, $user, $pass) = ("xdb", "uniprot", 0, "web_user", "fasta_www");
41 $host = 'wrpxdb.its.virginia.edu';
42 my ($a_table, $i_table) = ("annot2", "annot2_iso");
43 my ($help, $shelp) = (0,0);
44 my ($e_thresh, $prim_acc) = (1e-6, 0);
45
46 GetOptions(
47 "h" => \$shelp,
48 "help" => \$help,
49 "host=s" => \$host,
50 "prim_acc!" => \$prim_acc,
51 "db=s" => \$db,
52 "expect|evalue|e_thresh=f" => \$e_thresh,
53 "user=s" => \$user,
54 "password=s" => \$pass,
55 "port=i" => \$port,
56 "i_table" => \$i_table,
57 "a_table" => \$a_table,
58 );
59
60 pod2usage(1) if $shelp;
61 pod2usage(exitstatus => 0, verbose => 2) if $help;
62 pod2usage(1) unless (@ARGV || -p STDIN || -f STDIN);
63
64 my $dbh = DBI->connect("dbi:mysql:host=$host:$db",
65 $user, $pass,
66 { RaiseError => 1, AutoCommit => 1}
67 ) or die $DBI::errstr;
68
69 my %sth = (
70 seed2link_acc => "SELECT acc FROM $i_table WHERE prim_acc=?",
71 seed2link_id => "SELECT iso_a.acc FROM $i_table as iso_a JOIN $a_table as an2 on(iso_a.prim_acc=an2.acc) where an2.id=?",
72 link2seq => "SELECT db, acc, prim_acc, id, descr, seq FROM annot2_iso JOIN protein_iso USING(acc) WHERE acc=?"
73 );
74
75 for my $sth (keys(%sth)) {
76 $sth{$sth} = $dbh->prepare($sth{$sth});
77 }
78
79 my %acc_uniq = ();
80
81 # get the query
82 my ($query, $eval_arg) = @ARGV;
83 $eval_arg = 1e-10 unless $eval_arg;
84 $query =~ s/^>// if ($query);
85 my @link_lines = ();
86
87 #if it's a file I can open, read and parse it
88 unless ($query && ($query =~ m/[\|:]/ ||
89 $query =~ m/^[NX]P_/ ||
90 $query =~ m/^[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}\s/)) {
91
92 while (my $a_line = <>) {
93 $a_line =~ s/^>//;
94 chomp $a_line;
95 push @link_lines, $a_line;
96 }
97 }
98 else {
99 push @link_lines, "$query\t$eval_arg";
100 }
101
102 for my $line ( @link_lines ) {
103 my ($hit, $e_val) = split(/\t/,$line);
104
105 if ($e_val <= $e_thresh) {
106 process_line($hit,$sth{seed2link_acc},$sth{seed2link_id});
107 }
108 }
109
110 for my $acc ( keys %acc_uniq ) {
111
112 $sth{link2seq}->execute($acc);
113 while (my $row_href = $sth{link2seq}->fetchrow_hashref ) {
114 my $id_str = $row_href->{id};
115 if ($prim_acc) {
116 $id_str .= "_".$row_href->{prim_acc};
117 }
118
119 printf(">%s|%s|%s %s\n","iso",$acc,$id_str,$row_href->{descr});
120 my $iso_seq = $row_href->{seq};
121 $iso_seq =~ s/(.{60})/$1\n/g;
122
123 print "$iso_seq\n";
124 }
125 $sth{link2seq}->finish();
126 }
127
128 $dbh->disconnect();
129
130 sub process_line{
131 my ($seqid,$sth_acc, $sth_id)=@_;
132
133 my $sth = $sth_acc;
134
135 my ($db, $link_acc, $link_id) = ("","","");
136
137 if ($seqid =~ m/\|/) {
138 ($db, $link_acc, $link_id) = split('\|',$seqid);
139 $link_acc =~ s/\.\d+$//;
140
141 $sth_acc->execute($link_acc);
142 }
143 elsif ($seqid =~ m/:/) {
144 ($db, $link_id) = split(':',$seqid);
145 $sth_id->execute($link_id);
146 $sth = $sth_id;
147 }
148 else {
149 $link_acc = $seqid;
150 $link_acc =~ s/\.\d+$//;
151 $sth_acc->execute($link_acc);
152 }
153
154 while (my ($acc) = $sth->fetchrow_array()) {
155 next if ($acc eq $link_acc);
156 $acc_uniq{$acc} = $link_acc unless $acc_uniq{$acc};
157 }
158 $sth->finish();
159 }
160
161 __END__
162
163 =pod
164
165 =head1 NAME
166
167 expand_up_isoforms.pl expand_file.tab
168
169 =head1 SYNOPSIS
170
171 expand_up_isoforms.pl expand_file.tab
172
173 =head1 OPTIONS
174
175 -h short help
176 --help include description
177 --evalue E()-value threshold for expansion
178 --prim_acc : show primary accession as part of sequence identifier
179 >iso|E7EWW9|E7EWW9_HUMAN becomes >iso|E7EWW9|E7EWW9_HUMAN_P09488
180
181 --host, --user, --password, --port --db : info for mysql database
182 --a_table, --i_table -- SQL table names with reference and isoform acc/id/prim_acc mappings.
183
184 =head1 DESCRIPTION
185
186 C<expand_up_isoforms.pl> uses protein isoform tables in an SQL database to identify and extract
187 isoforms of proteins in a reference protein sequence database.
188
189 C<expand_up_isoforms.pl> takes a file with sequece identifiers and E()-values of the form:
190
191 sp|P09488|GSTM1_HUMAN <tab> 1e-40
192 sp:CALM_HUMAN <tab> 1e-40
193
194 Lines with E()-values less than --evalue (1E-6 by default) are used to
195 identify protein isoforms, which are included in the set of sequences to be aligned.
196
197 C<expand_up_isoforms.pl> is designed to be used by the B<FASTA> programs with
198 the C<-e expand_up_isoforms.pl> option.
199
200 =head1 AUTHOR
201
202 William R. Pearson, wrp@virginia.edu
203
204 =cut
0 #!/bin/bash
1
2 cmd="";
3 for i in "$@"
4 do
5 case $i in
6 --outname=*)
7 OUTNAME="${i#*=}"
8 shift # past argument=value
9 ;;
10 --query=*)
11 QUERY="${i#*=}"
12 shift # past argument=value
13 ;;
14 --db=*)
15 DATABASE="${i#*=}"
16 shift # past argument=value
17 ;;
18 --cmd=*)
19 SRCH_CMD="${i#*=}"
20 shift
21 ;;
22 --ktup=*)
23 KTUP="${i#*=}"
24 shift
25 ;;
26 *)
27 cmd="$cmd $i"
28 ;;
29 esac
30 done
31
32
33 # echo "OUTNAME: " $OUTNAME
34 echo "# CMD: " $cmd
35
36 if [[ $OUTNAME == '' ]]; then
37 OUTNAME=${QUERY}_out
38 fi
39
40 if [[ $SRCH_CMD == '' ]]; then
41 SRCH_CMD=fasta36
42 fi
43
44 #if [[ $ANN_SCRIPT == '' ]]; then
45 # ANN_SCRIPT="/seqprg/bin/ann_pfam30.pl --db=pfam31_qfo --host=localhost --neg --vdoms --acc_comment"
46 #fi
47
48
49 # echo "OUTNAME: " $OUTNAME
50
51 bl0_out="$OUTNAME.html"
52 bla_out="${OUTNAME}_an.html"
53 blt_out="$OUTNAME.fa_tab"
54 blr_out="$OUTNAME.fa_tab_rn"
55
56 export BLAST_PATH="/seqprg/bin"
57 # BLAST_PATH="../bin"
58
59 cmd="$cmd -mF8CBL=$blt_out $QUERY $DATABASE"
60
61 # echo "tmp_files:"
62 # echo $bl_asn $bl0_out $bla_out $blt_out
63 # echo "OUTFILE = ${OUTNAME}"
64
65 #echo "cmd: $cmd"
66 #echo "==="
67 #echo "bl0_out: $bl0_out"
68 #echo "==="
69
70 # echo "$BLAST_PATH/$SRCH_CMD $cmd > $bl0_out"
71
72 # run the program
73 $BLAST_PATH/$SRCH_CMD $cmd > $bl0_out
74
75 $BLAST_PATH/rename_exons.py --have_qslen --dom_info $blt_out > $blr_out
76
77 if [ ! -s $blr_out ]; then
78 # echo "# " `ls -l $blt_out $blr_out`
79 blr_out=$blt_out
80 # echo "# " `ls -l $blt_out $blr_out`
81 fi
82
83 $BLAST_PATH/merge_fasta_btab.pl --plot_url="plot_domain6t.cgi" --have_qslen --dom_info --btab $blr_out $bl0_out
0 #!/usr/bin/python
1
2 ################
3 ## get_hg38_bed.py parses an HG38 coordinate into a pseudo-bed entry,
4 ## and runs bedtools getfasta to return the fasta sequence
5 ##
6
7 import sys
8 import re
9 from subprocess import Popen, PIPE, STDOUT
10 import shlex
11 import argparse
12
13 ## a genome_loc should look like: chr#:start-stop
14 ## if stop < start, coordinates are reversed
15
16 genome_dict={'hg38':'genome_dna/hg38/reference.fa',
17 'mm10':'genome_dna/mm10/reference.fa',
18 'rn6':'genome_dna/rn6/rn6.fa'}
19
20 parser=argparse.ArgumentParser(description='get_genome_seq.py : get fasta sequence from genome coordinates ')
21 parser.add_argument('--genome', help='genome: hg38 | mm10 | rn6',dest='genome',action='store',default='hg38')
22 parser.add_argument('coords', help='genome coordinates chr1:12345-54321', nargs='*')
23
24 args=parser.parse_args()
25
26 bed_cmd = 'bedtools getfasta -fi $RDLIB2/%s -bed stdin' % (genome_dict[args.genome])
27
28 bed_lines = ''
29 for genome_loc in args.coords:
30
31 chrom, g_range = genome_loc.split(':')
32 g_start, g_end = g_range.split('-')
33
34 if (g_start > g_end):
35 g_start, g_end = g_end, g_start
36
37 g_start, g_end = int(g_start), int(g_end)
38 g_start -= 1
39
40 bed_lines += '%s\t%d\t%d\n' % (chrom, g_start, g_end)
41
42 bed_p = Popen(bed_cmd, stdout=PIPE, stdin=PIPE, stderr=STDOUT, shell=True)
43 out, err = bed_p.communicate(input=bed_lines)
44
45 for line in out.split('\n'):
46 if (line and line[0]=='>'):
47 (chrom, start, stop) = re.search(r'>([^:]+):(\d+)\-(\d+)',line).groups()
48 print line + " @C:%s" % (start)
49 elif (line):
50 print line
51
52
0 #!/usr/bin/python
1
2 ## get_protein.py --
3 ## get a protein sequence from Uniprot or NCBI/Refseq using the accession
4 ##
5
6 import sys
7 import re
8 import textwrap
9 from urllib2 import urlopen
10
11 ncbi_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?"
12 uniprot_url = "https://www.uniprot.org/uniprot/"
13
14 sub_range = ''
15 for acc in sys.argv[1:]:
16
17 if (re.search(r':',acc)):
18 (acc, sub_range) = acc.split(':')
19
20 if (re.match(r'^(sp|tr|iso|ref)\|',acc)):
21 acc=acc.split('|')[1]
22
23 if (re.match(r'[NX]P_',acc)):
24 db_type="protein"
25
26 seq_args = "db=%s&id=" % (db_type) + ",".join(sys.argv[1:]) + "&rettype=fasta"
27 seq_html = urlopen(ncbi_url + seq_args).read()
28 else:
29 seq_html = urlopen(uniprot_url + acc + ".fasta").read()
30
31 header=''
32 seq = ''
33 for line in seq_html.split('\n'):
34 if (line and line[0]=='>'):
35 # print out old one if there
36 if (header):
37 if (sub_range):
38 start, stop = sub_range.split('-')
39 start, stop = int(start), int(stop)
40 if (start > 0):
41 start -= 1
42 new_seq = seq[start:stop]
43 else:
44 start = 0
45 new_seq = seq
46
47 if (start > 0):
48 print "%s @C%d" %(header, start+1)
49 else:
50 print header
51 print '\n'.join(textwrap.wrap(new_seq))
52
53 header = line;
54 seq = ''
55 else:
56 seq += line
57
58 start=0
59 if (sub_range):
60 start, stop = sub_range.split('-')
61 start, stop = int(start), int(stop)
62 if (start > 0):
63 start -= 1
64 new_seq = seq[start:stop]
65 else:
66 new_seq = seq
67
68 if (start > 0):
69 print "%s @C:%d" %(header, start+1)
70 else:
71 print header
72
73 print '\n'.join(textwrap.wrap(new_seq))
0 #!/usr/bin/python
1
2 import sys
3 import re
4 from urllib2 import urlopen
5
6
7 db_type="protein"
8 if (re.match(r'[NX]M_',sys.argv[1])):
9 db_type="nucleotide"
10
11 seq_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?"
12 seq_args = "db=%s&id=" % (db_type) + ",".join(sys.argv[1:]) + "&rettype=fasta"
13
14 seq_html = urlopen(seq_url + seq_args).read()
15
16 print seq_html
0 #!/usr/bin/python
1
2 import sys
3 from urllib import urlopen
4
5 ARGV = sys.argv[1:];
6
7 for acc in ARGV :
8 url = "https://www.uniprot.org/uniprot/" + acc + ".fasta"
9 # print url
10 fa_seq = urlopen(url).read()
11 print fa_seq
0 #!/usr/bin/python
1
2 import sys
3 import re
4 import textwrap
5 import argparse
6 import MySQLdb.cursors
7
8 from urllib2 import urlopen
9
10 ncbi_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?"
11 uniprot_url = "https://www.uniprot.org/uniprot/"
12
13 db = MySQLdb.connect(db='uniprot', host='xdb', user='web_user', passwd='fasta_www',
14 cursorclass=MySQLdb.cursors.DictCursor)
15
16 cur1 = db.cursor()
17 cur2 = db.cursor()
18 get_iso_acc='select acc from annot2_iso where prim_acc="%s"'
19 get_fasta_info='select db, acc, id, descr, seq from annot2 join protein using(acc) where acc="%s"'
20 get_iso_fasta_info='select db, acc, id, descr, seq from annot2_iso join protein_iso using(acc) where prim_acc="%s"'
21
22 fasta_seqs=[]
23
24 for acc in sys.argv[1:]:
25
26 if (re.search(r':',acc)):
27 (acc, sub_range) = acc.split(':')
28
29 if (re.match(r'^(sp|tr|iso|ref)\|',acc)):
30 acc=acc.split('|')[1]
31
32 cur1.execute(get_fasta_info%(acc,))
33 row = cur1.fetchone()
34 if (row):
35 fasta_seqs.append(row)
36 else:
37 sys.stderr.write("***error*** %s sequence not found\n"%(acc))
38 continue
39
40 cur2.execute(get_iso_fasta_info%(acc,))
41 for row in cur2:
42 fasta_seqs.append(row)
43
44 for row in fasta_seqs:
45 print ">%s|%s|%s %s"%(row['db'],row['acc'],row['id'],row['descr'])
46 print '\n'.join(textwrap.wrap(row['seq']))
47
48
49
50
0 #!/usr/bin/perl -w
0 #!/usr/bin/env perl
11
22 # lav2plt.pl - produce plotfrom lav output */
33
2121 # governing permissions and limitations under the License.
2222 ################################################################
2323
24 use warnings;
2425 use strict;
2526 use Getopt::Long;
2627 use Pod::Usage;
0 #!/usr/bin/env perl
1 #
02 ################################################################
13 # copyright (c) 2012, 2014 by William R. Pearson and The Rector &
24 # Visitors of the University of Virginia */
1315 # express or implied. See the License for the specific language
1416 # governing permissions and limitations under the License.
1517 ################################################################
18
19 use warnings;
20 use strict;
1621
1722 #define SX(x) (int)((double)(x)*fxscal+fxoff+24)
1823 sub SX {
0
0 #!/usr/bin/env perl
1 #
12 ################################################################
23 # copyright (c) 2012, 2014 by William R. Pearson and The Rector &
34 # Visitors of the University of Virginia */
1516 # governing permissions and limitations under the License.
1617 ################################################################
1718
19 use warnings;
20 use strict;
21
1822 #define SX(x) (int)((double)(x)*fxscal+fxoff+6)
1923 sub SX {
2024 my $xx = shift;
0 #!/usr/bin/perl -w
0 #!/usr/bin/env perl
11
22 ################################################################
33 # copyright (c) 2014 by William R. Pearson and The Rector &
1616 # governing permissions and limitations under the License.
1717 ################################################################
1818
19 use warnings;
1920 use strict;
2021 use DBI;
2122 use Getopt::Long;
0 #!/usr/bin/perl -w
0 #!/usr/bin/env perl
11
22 ################################################################
33 # copyright (c) 2014,2015 by William R. Pearson and The Rector &
3939 #
4040 ################################################################
4141
42 use warnings;
4243 use strict;
4344 use IPC::Open2;
4445 use Pod::Usage;
0 #!/usr/bin/perl -w
0 #!/usr/bin/env perl
11
22 ################################################################
33 # copyright (c) 2014,2015 by William R. Pearson and The Rector &
3636 #
3737 ################################################################
3838
39 use warnings;
3940 use strict;
4041 use IPC::Open2;
4142 use Pod::Usage;
0 #!/usr/bin/env python
1 #
2 # given a -m8CB file with exon annotations for the query and subject,
3 # provide a function that maps subject coordinates to query, or vice versa
4
5 ################################################################
6 # copyright (c) 2018 by William R. Pearson and The Rector &
7 # Visitors of the University of Virginia */
8 ################################################################
9 # Licensed under the Apache License, Version 2.0 (the "License");
10 # you may not use this file except in compliance with the License.
11 # You may obtain a copy of the License at
12 #
13 # http://www.apache.org/licenses/LICENSE-2.0
14 #
15 # Unless required by applicable law or agreed to in writing,
16 # software distributed under this License is distributed on an "AS
17 # IS" BASIS, WITHOUT WRRANTIES OR CONDITIONS OF ANY KIND, either
18 # express or implied. See the License for the specific language
19 # governing permissions and limitations under the License.
20 ################################################################
21
22 import fileinput
23 import sys
24 import re
25 import argparse
26 import copy
27
28 ################
29 # "domain" class that describes a domain/exon alignment annotation
30 #
31 class exonInfo:
32 def __init__(self, name, q_target, p_start, p_end, chrom, d_start, d_end, full_text):
33 self.name = name
34 self.q_target = q_target
35 self.p_start = p_start
36 self.p_end = p_end
37 self.chrom = chrom
38 self.d_start = d_start
39 self.d_end = d_end
40 self.text = full_text
41 self.plus_strand = True
42 if (d_start > d_end):
43 self.plus_strand = False
44
45 def __str__(self):
46 rxr_str = "XD"
47 if (self.q_target):
48 rxr_str="DX"
49 return '|%s:%i-%i:%s{%s:%i-%i}' % (rxr_str, self.p_start, self.p_end, self.name, self.chrom, self.d_start, self.d_end)
50
51 class exonAlign:
52 def __init__(self, name, q_target, qp_start, qp_end, sp_start, sp_end, full_text):
53 self.exon = None
54
55 self.name = name
56 self.q_target = q_target
57
58 self.q_start = qp_start
59 self.q_end = qp_end
60 self.s_start = sp_start
61 self.s_end = sp_end
62
63 self.text = full_text
64 self.out_str = ''
65
66 def __str__(self):
67 rxr_str = "RX"
68 if (self.q_target):
69 rxr_str="XR"
70 return "[%s:%i-%i:%i-%i::%s" % (rxr_str,self.q_start, self.q_end, self.s_start, self.s_end, self.name)
71
72 def print_bar_str(self): # checking for 'NADA'
73 if (not self.out_str):
74 self.out_str = self.text
75 return str("|%s"%(self.out_str))
76
77 # Parses domain annotations after split at '|'
78
79 #
80 def parse_exon_align(text):
81 # takes a domain in string form, turns it into a domain object
82 # looks like: RX:5-82:5-82:s=397;b=163.1;I=1.000;Q=453.6;C=C.Thioredoxin~1
83 # could also look like: RX:5-82:5-82:s=397;b=163.1;I=1.000;Q=453.6;C=C.Thioredoxin{PF012445}~1
84
85 # get RX/XR and qstart/qstop sstart/sstop as strings
86 m = re.search(r'^(\w+):(\d+)-(\d+):(\d+)-(\d+):',text)
87 if (m):
88 (RXRState, qstart_s, qend_s, sstart_s, send_s) = m.groups()
89 else:
90 sys.stderr.write("could not parse exon location: %s\n"%(text))
91
92 # get domain name/color (and possibly {info})
93
94 (name, color_s) = re.search(r';C=([^~]+)~(.+)$',text).groups()
95 info_s=""
96
97 if (re.search(r'\}$',name)):
98 (name, info_s) = re.search(r'([^\{]+)(\{[^\}]+\})$',name).groups()
99
100 q_target = True
101 if (RXRState=='XR'):
102 q_target = False
103
104 exon_align = exonAlign(name, q_target, int(qstart_s), int(qend_s), int(sstart_s), int(send_s),
105 text)
106
107 return exon_align
108
109 ################
110 # exon_info is like domain, but no scores
111 #
112 def parse_exon_info(text):
113 # takes a domain in string form, turns it into a domain object
114 # looks like: DX:1-100;C=C.Thioredoxin~1
115
116 (RXRState, start_s, end_s,name, color) = re.search(r'^(\w+):(\d+)-(\d+);C=([^~]+)~(.*)$',text).groups()
117 info = ""
118 if (re.search(r'\}$',name)):
119 (name, info) = re.search(r'([^\{]+)(\{[^\}]+\})$',name).groups()
120
121 gene_re = re.search(r'^\{(\w+):(\d+)\-(\d+)\}',info)
122 if (gene_re):
123 (chrom, d_start, d_end) = gene_re.groups()
124 else:
125 sys.stderr.write("genome info not found: %s\n" % (text))
126
127 q_target = True;
128 if (RXRState == 'XD'):
129 q_target = False
130
131 exon_info = exonInfo(name, q_target, int(start_s), int(end_s), chrom, int(d_start), int(d_end), text)
132
133 return exon_info
134
135 ####
136 # parse_protein(result_line)
137 # takes a protein in string format, turns it into a dictionary properly
138 # looks like: sp|P30711|GSTT1_HUMAN up|Q2NL00|GSTT1_BOVIN 86.67 240 32 0 1 240 1 240 1.4e-123 444.0 16VI7DR6IT3IR15KQ3AI6TI11TA7YH8RC12TA3SN10FL10QETM2AT6VMTA2LV2DG4ND6PS24EK6TA11DV14FSPQ5IL3LMML1WK5RQ |XR:4-76:4-76:s=327;b=134.6;I=0.895;Q=367.8;C=C.Thioredoxin~1|RX:5-82:5-82:s=356;b=146.5;I=0.902;Q=403.3;C=C.Thioredoxin~1|RX:83-93:83-93:s=52;b=21.4;I=0.818;Q=30.9;C=NODOM~0|XR:77-93:77-93:s=86;b=35.4;I=0.882;Q=72.6;C=NODOM~0|RX:94-110:94-110:s=88;b=36.2;I=0.882;Q=75.0;C=vC.GST_C~2v|XR:94-110:94-110:s=88;b=36.2;I=0.882;Q=75.0;C=vC.GST_C~2v|RX:111-201:111-201:s=409;b=168.3;I=0.868;Q=468.3;C=C.GST_C~2|XR:111-201:111-201:s=409;b=168.3;I=0.868;Q=468.3;C=C.GST_C~2|RX:202-240:202-240:s=154;b=63.4;I=0.795;Q=155.9;C=NODOM~0|XR:202-240:202-240:s=154;b=63.4;I=0.795;Q=155.9;C=NODOM~0
139 #
140 def parse_protein(line_data,fields, req_name):
141 # last part (domain annotions) split('|') and parsed by parse_domain()
142
143 data = {}
144 data = dict(zip(fields, line_data))
145 if (re.search(r'\|',data['qseqid'])):
146 data['qseq_acc'] = data['qseqid'].split('|')[1]
147 else:
148 data['qseq_acc'] = data['qseqid']
149
150 if (re.search(r'\|',data['sseqid'])):
151 data['sseq_acc'] = data['sseqid'].split('|')[1]
152 else:
153 data['sseq_acc'] = data['sseqid']
154
155 Qexon_list = []
156 Sexon_list = []
157
158 Qinfo_list = []
159 Sinfo_list = []
160
161 counter = 0
162
163 if ('align_annot' in data and len(data['align_annot']) > 0):
164 for exon_str in data['align_annot'].split('|')[1:]:
165 if (req_name and not re.search(req_name, exon_str)):
166 continue
167
168 counter += 1
169 exon = parse_exon_align(exon_str)
170 if (exon.q_target):
171 Qexon_list.append(exon)
172 else:
173 Sexon_list.append(exon)
174
175 data['q_exalign_list'] = Qexon_list
176 data['s_exalign_list'] = Sexon_list
177
178 if ('exon_info' in data and len(data['exon_info']) > 0):
179 for info_str in data['exon_info'].split('|')[1:]:
180 if (not re.search(r'^[DX][XD]',info_str)):
181 continue
182
183 dinfo = parse_exon_info(info_str)
184
185 if (dinfo.q_target):
186 Qinfo_list.append(dinfo)
187 else:
188 Sinfo_list.append(dinfo)
189
190
191 # put links to info_list into exon_list so info_list names can
192 # be changed -- give S/Qinfo's the S/Qdom ids of the overlapping domain
193
194 # find_info_overlaps(Qinfo_list, Qexon_list)
195 # find_info_overlaps(Sinfo_list, Sexon_list)
196
197 data['q_exinfo_list'] = Qinfo_list
198 data['s_exinfo_list'] = Sinfo_list
199
200 return data
201
202 ################
203 #
204 # decode_btop() -
205 # input: a blast BTOP string of the form: "1VA160TS7KG10RK27"
206 # returns a list_ref of tokens: (1, "VA", 60, "TS", 7, "KG, 10, "RK", 27)
207 def decode_btop(btop_str):
208 out_tokens = []
209 for token in re.split(r'(\d+)',btop_str):
210 if (not token): continue
211 if re.match(r'\d+',token):
212 out_tokens.append(token)
213 else:
214 for mismat in re.split(r'(..)',token):
215 if (mismat): out_tokens.append(mismat)
216
217 return out_tokens
218
219 ################
220 #
221 # map_align(btop, q_start, s_start)
222 # input: btop
223 # output: q_pos_arr, s_pos_arr
224 #
225 def map_align(btop_str, q_start, s_start):
226
227 q_pos = q_start
228 s_pos = s_start
229
230 q_pos_arr = []
231 s_pos_arr = []
232
233 btop_tokens = decode_btop(btop_str)
234
235 for t in btop_tokens:
236 if (re.match(r'\d+',t)):
237 for i in range(int(t)) :
238 q_pos_arr.append(q_pos)
239 q_pos += 1
240 s_pos_arr.append(s_pos)
241 s_pos += 1
242 elif (re.match(r'\-\w',t)):
243 q_pos_arr.append(q_pos)
244 s_pos_arr.append(s_pos)
245 s_pos += 1
246 elif (re.match(r'\w\-',t)):
247 q_pos_arr.append(q_pos)
248 q_pos += 1
249 s_pos_arr.append(s_pos)
250 else:
251 q_pos_arr.append(q_pos)
252 q_pos += 1
253 s_pos_arr.append(s_pos)
254 s_pos += 1
255
256 return q_pos_arr, s_pos_arr
257
258 ################
259 #
260 # map_coords(from_coords, to_coords, coord_list)
261 #
262 def map_coords(from_coords, to_coords, coord_list):
263
264 mapped_coords = []
265
266 fx = 0
267 mx = 0
268 while mx < len(coord_list):
269 this_from_coord = coord_list[mx]
270 while (from_coords[fx] < this_from_coord):
271 fx += 1
272 continue
273
274 mapped_coords.append(to_coords[fx])
275 mx += 1
276
277 return mapped_coords
278
279 ################
280 #
281 # map_align_coords() given a BTOP, q_start, s_start, and s_target, generate s_coords for list of q_coords
282 #
283 def map_align_coords(btop_str, q_start, s_start, s_target, coord_list):
284
285 (q_coords, s_coords) = map_align(btop_str, q_start, s_start)
286
287 sorted_coord_list = sorted(coord_list)
288
289 if (s_target):
290 s_mapped_coords = map_coords(q_coords, s_coords, sorted_coord_list)
291 else:
292 s_mapped_coords = map_coords(s_coords, q_coords, sorted_coord_list)
293
294 coord_dict={}
295 for ix, s_coord in enumerate(sorted_coord_list):
296 coord_dict[s_coord]=s_mapped_coords[ix]
297
298 return [ coord_dict[c] for c in coord_list ]
299
300
301 ################
302 #
303 # aa_to_exon() --- given a coordinate and the corresponding exon map, return the exon coordinate
304 # (can only be done for aligned exons)
305 #
306 # this version of the function must use an info_list, not an
307 # align_list, because it uses p_start/p_end rather than qp_start/sp_start, etc.
308 # a version using qp_start/sp_start would also need a target argument
309 #
310 def aa_to_exon(aa_coords, exon_info_list):
311
312 sorted_aa_coords = sorted(aa_coords)
313
314 pos_strand = True
315 if (exon_info_list[0].d_start > exon_info_list[0].d_end):
316 pos_strand = False
317
318 ex_x = 0
319 exon_coords = []
320
321 aap_x = 0
322 this_aap = sorted_aa_coords[aap_x]
323 while (ex_x < len(exon_info_list)):
324 this_exon = exon_info_list[ex_x]
325 if (this_aap <= this_exon.p_end and this_aap >= this_exon.p_start):
326 aa_dna_offset = (this_aap - this_exon.p_start) * 3
327
328 if (pos_strand):
329 aa_dna_pos = this_exon.d_start + aa_dna_offset
330 else:
331 aa_dna_pos = this_exon.d_start - aa_dna_offset
332
333 exon_coords.append({'chrom':this_exon.chrom, 'dpos':aa_dna_pos})
334 aap_x += 1
335 if (aap_x < len(sorted_aa_coords)):
336 this_aap = sorted_aa_coords[aap_x]
337 else:
338 break
339 else:
340 ex_x += 1
341
342 aa_coord_dict = {}
343 for aap_x, aap in enumerate(sorted_aa_coords):
344 aa_coord_dict[aap] = exon_coords[aap_x]
345
346 return [aa_coord_dict[ax] for ax in aa_coords]
347
348 ################
349 # set_data_fields() -- initialize field[] used to generate data[] dict
350 #
351 def set_data_fields(args, line_data) :
352
353 field_str = 'qseqid sseqid pident length mismatch gapopen q_start q_end s_start s_end evalue bitscore BTOP align_annot'
354 field_qs_str = 'qseqid q_len sseqid s_len pident length mismatch gapopen q_start q_end s_start s_end evalue bitscore BTOP align_annot'
355
356 if (len(line_data) > 1) :
357 if ((not args.have_qslen) and re.search(r'\d+',line_data[1])):
358 args.have_qslen=True
359
360 if ((not args.exon_info) and re.search(r'^\|[DX][XD]\:',line_data[-1])):
361 args.exon_info = True
362
363 end_field = -1
364 fields = field_str.split(' ')
365
366 if (args.have_qslen):
367 fields = field_qs_str.split(' ')
368
369 if (args.exon_info):
370 fields.append('exon_info')
371 end_field = -2
372
373 return (fields, end_field)
374
375 ################################################################
376 #
377 # main program
378 # print "#"," ".join(sys.argv)
379
380 def main():
381
382 data_fields_reset=False
383
384 parser=argparse.ArgumentParser(description='map_exon_coords.py result_file.m8CB saa:coord : map subject coordinate to query genomic coordinate')
385 parser.add_argument('--have_qslen', help='bl_tab fields include query/subject lengths',dest='have_qslen',action='store_true',default=False)
386 parser.add_argument('--exon_info', help='raw domain coordinates included',action='store_true',default=True)
387 parser.add_argument('--subj_aa',help='subject aa coordinate to map',action='store',type=int,dest='subj_aa_coord',default=1)
388 parser.add_argument('files', metavar='FILE', nargs='*', help='files to read, if empty, stdin is used')
389 args=parser.parse_args()
390
391 end_field = -1
392 data_fields_reset=False
393
394 (fields, end_field) = set_data_fields(args, [])
395
396 if (args.have_qslen and args.exon_info):
397 data_fields_reset=True
398
399 saved_qexon_list = []
400 qexon_list = []
401
402 for line in fileinput.input(args.files):
403 # pass through comments
404 if (line[0] == '#'):
405 print line, # ',' because have not stripped
406 continue
407
408 ################
409 # break up tab fields, check for extra fields
410 line = line.strip('\n')
411 line_data = line.split('\t')
412 if (not data_fields_reset): # look for --have_qslen number, --exon_info data, even if not set
413 (fields, end_field) = set_data_fields(args, line_data)
414 data_fields_reset = True
415
416 ################
417 # get exon annotations
418 # produces: data['q_exalign_list'], data['s_exalign_list']
419 # data['q_exinfo_list'], data['s_exinfo_list']
420 data = parse_protein(line_data,fields,"exon") # get score/alignment/domain data
421
422 # extract aligned query_coordinates
423 q_coords = []
424 sa_from_qa = []
425 for q_ex in data['q_exalign_list']:
426 q_coords.append(q_ex.q_start)
427 q_coords.append(q_ex.q_end)
428 sa_from_qa.append(q_ex.s_start)
429 sa_from_qa.append(q_ex.s_end)
430
431 s_coords = []
432 qa_from_sa = []
433 for s_ex in data['s_exalign_list']:
434 s_coords.append(s_ex.s_start)
435 s_coords.append(s_ex.s_end)
436 qa_from_sa.append(s_ex.q_start)
437 qa_from_sa.append(s_ex.q_end)
438
439 ################
440 # map aligned coordinates in query to subject exons
441 # -- this is not necessary -- it already in data['q_exalign_list'].s_start/s_end
442 # s_target=True
443 # sa_from_qa = map_align_coords(data['BTOP'], int(data['q_start']), int(data['s_start']),
444 # s_target, qa_coords)
445 sex_from_qa2sa = aa_to_exon(sa_from_qa, data['s_exinfo_list'])
446 qex_from_sa2qa = aa_to_exon(qa_from_sa, data['q_exinfo_list'])
447
448
449 ################
450 # print out non-exon info
451
452 print '\t'.join([str(data[x]) for x in fields[:end_field]]),
453
454 ################
455 # edit the full text to insert the other aligned coordinates
456 # (also re-order the regions query-first, then subject
457 # for 'q_exalign_list', I need to add the subj_genome_coords sex_from_qa2sa
458 # and they need to be second
459 # for 's_exalign_list', I need to add the query_genome_coords from qex_from_sa2qa
460 # and they need to be first
461
462 q_exalign_out=[]
463 for qx, q_exon in enumerate(data['q_exalign_list']):
464 sg_start = sex_from_qa2sa[2*qx]
465 sg_end = sex_from_qa2sa[2*qx+1]
466 sg_replace="::%s:%d-%d}"%(sg_start['chrom'],sg_start['dpos'],sg_end['dpos'])
467
468 this_outstr=re.sub(r'\}',sg_replace,q_exon.text)
469 q_exalign_out.append(this_outstr)
470
471 s_exalign_out=[]
472 for sx, s_exon in enumerate(data['s_exalign_list']):
473 qg_start = qex_from_sa2qa[2*sx]
474 qg_end = qex_from_sa2qa[2*sx+1]
475 qg_replace="{%s:%d-%d::"%(qg_start['chrom'],qg_start['dpos'],qg_end['dpos'])
476
477 this_outstr=re.sub(r'\{',qg_replace,s_exon.text)
478 s_exalign_out.append(this_outstr)
479
480 print "\t|"+"|".join(q_exalign_out+s_exalign_out)+"\t"+line_data[-1]
481
482 ################
483 # run the program ...
484
485 if __name__ == '__main__':
486 main()
487
0 #!/usr/bin/env perl
1
2 ################################################################
3 # copyright (c) 2018 by William R. Pearson and The Rector &
4 # Visitors of the University of Virginia */
5 ################################################################
6 # Licensed under the Apache License, Version 2.0 (the "License");
7 # you may not use this file except in compliance with the License.
8 # You may obtain a copy of the License at
9 #
10 # http://www.apache.org/licenses/LICENSE-2.0
11 #
12 # Unless required by applicable law or agreed to in writing,
13 # software distributed under this License is distributed on an "AS
14 # IS" BASIS, WITHOUT WRRANTIES OR CONDITIONS OF ANY KIND, either
15 # express or implied. See the License for the specific language
16 # governing permissions and limitations under the License.
17 ################################################################
18
19 ################################################################
20 # merge_blast_btab.pl --btab .btab file html_file
21 ################################################################
22
23 use warnings;
24 use strict;
25 use Getopt::Long;
26 use Pod::Usage;
27 use URI::Encode qw(uri_encode);
28 use URI::Escape qw(uri_escape);
29
30 my ($btab_file, $have_qslen, $help, $shelp, $dom_info) = ("", 0, 0, 0, 0);
31 my ($plot_url) = ("");
32
33 GetOptions(
34 "btab_file|btab=s" => \$btab_file,
35 "have_qslen|have_sqlen!" => \$have_qslen,
36 "domain_info|dom_info!" => \$dom_info,
37 "plot_url=s"=> \$plot_url,
38 "h|?" => \$shelp,
39 "help" => \$help,
40 );
41
42 pod2usage(1) if $shelp;
43 pod2usage(exitstatus => 0, verbose => 2) if $help;
44 unless (-f STDIN || -p STDIN || @ARGV) {
45 pod2usage(1);
46 }
47
48 # require a btab file
49
50 # read it in, save structure as list/hash on accession (list more robust)
51 # what happens with multiple hits for same library -- need to add code
52 #
53
54 my @bl_fields = qw(q_seqid s_seqid percid alen mismatch gopen q_start q_end s_start s_end evalue bits score annot);
55
56 if ($have_qslen) {
57 @bl_fields = qw(q_seqid q_len s_seqid s_len percid alen mismatch gopen q_start q_end s_start s_end evalue bits score annot);
58 }
59
60 if ($dom_info) {
61 push @bl_fields, "dom_info";
62 }
63
64 my %tab_data = ();
65 my @sseq_ids = ();
66
67 unless ($btab_file) {
68 die "--btab_file required"
69 }
70 else {
71 # read in btab file
72 open(my $fd, $btab_file) || die "cannot open $btab_file";
73
74 while (my $line = <$fd>) {
75 next if ($line =~ m/^#/); # ignore comments
76 chomp($line);
77 my %a_data = ();
78 @a_data{@bl_fields} = split(/\t/,$line);
79
80 # here we should confirm that the sseqid is new. If it is not, then add to a list.
81 my $sseqid = $a_data{'s_seqid'};
82
83 if (defined($tab_data{$sseqid})) {
84 push @{$tab_data{$sseqid}}, \%a_data
85 }
86 else {
87 $tab_data{$sseqid} = [ \%a_data ];
88 push @sseq_ids, $sseqid;
89 }
90 }
91 }
92
93 # have the annotation data in %tab_data{} and @seq_ids
94 # read in the blastp html file and annotate it
95
96 my ($in_best, $in_align) = (0,0);
97 my ($best_ix, $align_ix, $hsp_ix) = (0,0,0);
98
99 while (my $line = <>) {
100 chomp($line);
101 unless ($line) {
102 print "\n";
103 next;
104 }
105 if ($line =~ m/^Sequences producing/) {
106 $in_best = 1;
107 $best_ix = 0;
108 print "$line\n";
109 next;
110 }
111
112 if ($in_best) {
113 if ($line =~ /^>/) {
114 $in_best = 0;
115 $in_align = 1;
116 $align_ix = 0;
117 $hsp_ix = 0;
118 # print out the first line
119 print "$line\n";
120 next;
121 }
122 else {
123 $line = add_best($line, $tab_data{$sseq_ids[$best_ix]}->[0]);
124 $best_ix++;
125 }
126 }
127
128 if ($in_align) {
129 if ($line =~ m/^\s+Score = \d+/) { # have Length= match, put out annotations if available
130 my $regions_str = regions_to_str($tab_data{$sseq_ids[$align_ix]}->[$hsp_ix]);
131 print $regions_str;
132
133 if ($plot_url) {
134 my $raw_dom_str = "";
135 if ($dom_info) {
136 $raw_dom_str = dom_info_str($tab_data{$sseq_ids[$align_ix]}->[$hsp_ix]{'dom_info'});
137 }
138
139 my $plot_tag = plot_tag_str($plot_url, $tab_data{$sseq_ids[$align_ix]}->[$hsp_ix], $regions_str, $raw_dom_str);
140 if ($plot_tag) {print $plot_tag,"\n";}
141 }
142
143 $hsp_ix++;
144
145 }
146 elsif ($line =~ m/^>/) {
147 $align_ix++;
148 $hsp_ix = 0;
149 }
150 }
151
152 print "$line\n";
153 }
154
155 sub parse_annots {
156 my ($annot_str) = @_;
157
158 my @annot_list = ();
159
160 unless ($annot_str && $annot_str =~ m/^\|/) {
161 return \@annot_list;
162 }
163
164 my @annots = split('\|',$annot_str);
165 shift @annots;
166
167 for my $annot ( @annots ) {
168 my %annot_data = ();
169 next unless ($annot =~ m/^[XR][RX]/);
170 my @a_fields = split(/;/,$annot);
171 for my $f (@a_fields) {
172 if ($f =~ m/^[XR][XR]/) {
173 my @a2_f = split(':',$f);
174 if ($a2_f[0] =~ m/^XR/) {
175 $annot_data{target} = 'subj';
176 }
177 else {
178 $annot_data{target} = 'query';
179 }
180 $annot_data{coord} = "$a2_f[1]:$a2_f[2]";
181 $annot_data{score} = (split('=',$a2_f[3]))[1]
182 }
183 elsif ($f =~ m/(\w)=(.+)/) {
184 $annot_data{$1} = $2;
185 }
186 }
187 $annot_data{name} = $a_fields[-1];
188 $annot_data{name} =~ s/^C=//;
189 push @annot_list, \%annot_data;
190 }
191 return \@annot_list;
192 }
193
194 sub regions_to_str {
195 my ($a_data_r) = @_;
196
197 my $annot_ref = parse_annots($a_data_r->{annot});
198
199 my $region_str = "";
200 my $annot_str = "";
201
202 for my $annot ( @{$annot_ref}) {
203 if ($annot->{target} =~ m/^q/) {
204 $region_str = "qRegion";
205 }
206 else {
207 $region_str = " Region";
208 }
209
210 $annot_str .= sprintf "%s: %s : score=%d; bits=%.1f; Id=%.3f; Q=%.1f : %s\n", $region_str,
211 @{$annot}{qw(coord score b I Q name)};
212 }
213 return $annot_str;
214 }
215
216 sub add_best {
217 my ($line, $a_data) = @_;
218
219 my $annot_str = '';
220
221 my $annot_refs = parse_annots($a_data->{annot});
222
223 for my $annot ( @$annot_refs) {
224 if ($annot->{target} !~ m/^q/) {
225 $annot_str .= $annot->{name} . ";"
226 }
227 }
228
229 if ($annot_str) {
230 return "$line $annot_str";
231 }
232 else {
233 return $line;
234 }
235 }
236
237 sub plot_tag_str {
238
239 my ($plot_script, $align_data_r, $regions_str, $doms_str) = @_;
240
241 my $svg_pref = q(<object type="image/svg+xml" );
242 my $svg_post = q( width="660" height="76" ></object>);
243
244 #build argument string
245 my %plt_args = ();
246 @plt_args{qw(q_cstart l_cstart)} = (1, 1);
247 @plt_args{qw(q_name q_cstop q_astart q_astop l_name l_cstop l_astart l_astop)} =
248 @{$align_data_r}{qw(q_seqid q_len q_start q_end s_seqid s_len s_start s_end)};
249 $plt_args{'regions'}= uri_escape(uri_encode($regions_str));
250 if ($doms_str) {
251 $plt_args{'doms'} = uri_encode($doms_str);
252 }
253
254 my $dom_info = ();
255
256 my @args = map {"$_=$plt_args{$_}"} keys(%plt_args);
257
258 return $svg_pref . qq( data="$plot_url?) . join('&amp;',@args) . '"' . $svg_post;
259 }
260
261 sub dom_info_str {
262 my ($raw_dom_info) = @_;
263
264 my $dom_str = "";
265
266 unless ($raw_dom_info) { return "";}
267
268 my @raw_doms = split('\|',$raw_dom_info);
269 shift(@raw_doms);
270
271 for my $dom ( @raw_doms ) {
272 my $tmp_dom = $dom;
273 $tmp_dom =~ s/^DX:/qDomain:\t/g;
274 $tmp_dom =~ s/^XD:/lDomain:\t/g;
275 $tmp_dom =~ s/;C=/\t/g;
276
277 $dom_str .= "$tmp_dom\n";
278 }
279
280 return $dom_str;
281 }
282
283
284 __END__
285
286 =pod
287
288 =head1 NAME
289
290 merge_blast_btab.pl
291
292 =head1 SYNOPSIS
293
294 merge_blast_btab.pl --btab_file=result.b_tab result.html
295
296 =head1 OPTIONS
297
298 -h short help
299 --help include description
300
301 --btab_file|--btab file_name -- blast tabular output file with
302 sub-alignment scoring
303
304 =head1 DESCRIPTION
305
306 C<merge_blast_btab.pl> merges the domain annotations and sub-alignment scoring from C<annot_blast_btop2.pl> blast tabular output file with a conventional blast result file.
307
308 The tab file is read and parsed, and then the subject/query seqid is used to
309 capture domain locations in the subject/query sequence. If the domains
310 overlap the aligned region, the domain names are appended to the output.
311
312 =head1 AUTHOR
313
314 William R. Pearson, wrp@virginia.edu
315
316 =cut
0 #!/usr/bin/env perl
1
2 ################################################################
3 # copyright (c) 2018 by William R. Pearson and The Rector &
4 # Visitors of the University of Virginia */
5 ################################################################
6 # Licensed under the Apache License, Version 2.0 (the "License");
7 # you may not use this file except in compliance with the License.
8 # You may obtain a copy of the License at
9 #
10 # http://www.apache.org/licenses/LICENSE-2.0
11 #
12 # Unless required by applicable law or agreed to in writing,
13 # software distributed under this License is distributed on an "AS
14 # IS" BASIS, WITHOUT WRRANTIES OR CONDITIONS OF ANY KIND, either
15 # express or implied. See the License for the specific language
16 # governing permissions and limitations under the License.
17 ################################################################
18
19 ################################################################
20 # merge_fasta_btab.pl --btab .btab file html_file
21 ################################################################
22
23 ################################################################
24 # takes a standard (or <html> output FASTA file and converts (or adds) labels using .btab information
25 ################################################################
26
27
28 use warnings;
29 use strict;
30 use Getopt::Long;
31 use Pod::Usage;
32 use URI::Encode qw(uri_encode);
33 use URI::Escape qw(uri_escape);
34
35 my ($btab_file, $have_qslen, $help, $shelp, $dom_info) = ("", 0, 0, 0, 0);
36 my ($plot_url) = ("");
37
38 GetOptions(
39 "btab_file|btab=s" => \$btab_file,
40 "have_qslen|have_sqlen" => \$have_qslen,
41 "have_qslen|have_sqlen!" => \$have_qslen,
42 "domain_info|dom_info!" => \$dom_info,
43 "plot_url=s"=> \$plot_url,
44 "h|?" => \$shelp,
45 "help" => \$help,
46 );
47
48 pod2usage(1) if $shelp;
49 pod2usage(exitstatus => 0, verbose => 2) if $help;
50 unless (-f STDIN || -p STDIN || @ARGV) {
51 pod2usage(1);
52 }
53
54 # require a btab file
55
56 # read it in, save structure as list/hash on accession (list more robust)
57 # what happens with multiple hits for same library -- need to add code
58 #
59
60 my @bl_fields = qw(q_seqid s_seqid percid alen mismatch gopen q_start q_end s_start s_end evalue bits score annot);
61
62 if ($have_qslen) {
63 @bl_fields = qw(q_seqid q_len s_seqid s_len percid alen mismatch gopen q_start q_end s_start s_end evalue bits score annot);
64 }
65
66 my %pgm_names= ('FASTA'=>'fap', 'FASTX'=>'fx', 'FASTY'=>'fy', 'FASTS'=>'fs', 'FASTM'=>'fm',
67 'SSEARCH' => 'gsw', 'GGSEARCH'=>'gnw', 'GLSEARCH'=>'lnw',
68 'TFASTX' => 'tfx', 'TFASTY'=>'tfx', 'TFASTS'=>'tfs', 'TFASTM'=>'tfm',
69 'BLASTP'=>'bp', 'BLASTN'=>'bn', 'TBLASTN'=>'tbn' );
70
71 if ($dom_info) {
72 push @bl_fields, "dom_info";
73 }
74
75 my $pgm_name = '';
76 my %tab_data = ();
77 my @sseq_ids = ();
78
79 unless ($btab_file) {
80 die "--btab_file required"
81 }
82 else {
83 # read in btab file
84 open(my $fd, $btab_file) || die "cannot open $btab_file";
85
86 while (my $line = <$fd>) {
87 if ($line =~ m/^#/) { # check for program name
88 if (!$pgm_name) {
89 my ($name) = ($line =~ m/^# (\w+) /);
90 if ($name && $pgm_names{$name}) {
91 $pgm_name = $pgm_names{$name};
92 }
93 }
94 next;
95 }
96 chomp($line);
97
98 my %a_data = ();
99 @a_data{@bl_fields} = split(/\t/,$line);
100
101 # here we should confirm that the sseqid is new. If it is not, then add to a list.
102 my $sseqid = $a_data{'s_seqid'};
103
104 if (defined($tab_data{$sseqid})) {
105 push @{$tab_data{$sseqid}}, \%a_data
106 }
107 else {
108 $tab_data{$sseqid} = [\%a_data ];
109 push @sseq_ids, $sseqid;
110 }
111 }
112 }
113
114 # have the annotation data in %tab_data{} and @seq_ids
115 # read in the blastp html file and annotate it
116
117 my ($in_best, $in_align, $in_annot) = (0,0,0);
118 my ($annot_id) = ("");
119 my ($best_ix, $align_ix, $hsp_ix) = (0,0,0);
120
121 while (my $line = <>) {
122 chomp($line);
123 unless ($line) {
124 print "\n";
125 next;
126 }
127 if ($line =~ m/^The best scores are:/) {
128 $in_best = 1;
129 $best_ix = 0;
130 print "$line\n";
131 next;
132 }
133
134 if ($in_best) {
135 if ($line =~ /<pre>>>/) {
136 $in_best = 0;
137 $in_align = 1;
138 $in_annot = 0;
139 $align_ix = 0;
140 $hsp_ix = 0;
141 # print out the first line
142 print "$line\n";
143 next;
144 }
145 else {
146 if (scalar(@sseq_ids) && $sseq_ids[$best_ix]) {
147 $line = add_best($line, $tab_data{$sseq_ids[$best_ix]}->[0]);
148 $best_ix++;
149 }
150 }
151 }
152
153 if ($in_align) {
154 if ($line =~ m/^<!\-\- ANNOT_START "([^"]+)" \-\->/) {
155 $annot_id = $1;
156 my $regions_str = regions_to_str($tab_data{$sseq_ids[$align_ix]}->[$hsp_ix]);
157 print qq(<!-- ANNOT_START "$annot_id" -->);
158 print $regions_str;
159
160 if ($plot_url) {
161 my $raw_dom_str = "";
162 if ($dom_info) {
163 $raw_dom_str = dom_info_str($tab_data{$sseq_ids[$align_ix]}->[$hsp_ix]{'dom_info'});
164 }
165
166 my $plot_tag = plot_tag_str($plot_url, $pgm_name, $tab_data{$sseq_ids[$align_ix]}->[$hsp_ix], $regions_str, $raw_dom_str);
167 if ($plot_tag) {print $plot_tag,"\n";}
168 }
169
170 $hsp_ix++;
171
172 # remove the old domain information */
173 while ($line = <> ) {
174 chomp($line);
175 if ($line !~ m/^\s*q?Region:/ && $line !~ /ANNOT_STOP/) {
176 print "$line\n";
177 }
178 if ($line =~ m/^<!\-\- ANNOT_STOP \-\->/) {
179 last;
180 }
181 }
182 }
183 elsif ($line =~ m/<pre>>>/) {
184 $align_ix++;
185 $hsp_ix=0;
186 }
187 }
188
189 print "$line\n";
190 }
191
192 sub parse_annots {
193 my ($annot_str) = @_;
194
195 my @annot_list = ();
196
197 unless ($annot_str && $annot_str =~ m/^\|/) {
198 return \@annot_list;
199 }
200
201 my @annots = split('\|',$annot_str);
202 shift @annots;
203
204 for my $annot ( @annots ) {
205 my %annot_data = ();
206 next unless ($annot =~ m/^[XR][RX]/);
207 my @a_fields = split(/;/,$annot);
208 for my $f (@a_fields) {
209 if ($f =~ m/^[XR][XR]/) {
210 my @a2_f = split(':',$f);
211 if ($a2_f[0] =~ m/^XR/) {
212 $annot_data{target} = 'subj';
213 }
214 else {
215 $annot_data{target} = 'query';
216 }
217 $annot_data{coord} = "$a2_f[1]:$a2_f[2]";
218 $annot_data{score} = (split('=',$a2_f[3]))[1]
219 }
220 elsif ($f =~ m/(\w)=(.+)/) {
221 $annot_data{$1} = $2;
222 }
223 }
224 $annot_data{name} = $a_fields[-1];
225 $annot_data{name} =~ s/^C=//;
226
227 push @annot_list, \%annot_data;
228 }
229 return \@annot_list;
230 }
231
232 sub print_regions {
233 my ($annot_id, $annot_ref) = @_;
234
235 my $region_str = "";
236
237 print qq(<!-- ANNOT_START "$annot_id" -->);
238
239 for my $annot ( @{$annot_ref}) {
240 if ($annot->{target} =~ m/^q/) {
241 $region_str = "qRegion";
242 }
243 else {
244 $region_str = " Region";
245 }
246
247 printf "%s: %s : score=%d; bits=%.1f; Id=%.3f; Q=%.1f : %s\n", $region_str,
248 @{$annot}{qw(coord score b I Q name)};
249 }
250 }
251
252 sub regions_to_str {
253 my ($a_data_r) = @_;
254
255 my $annot_ref = parse_annots($a_data_r->{annot});
256
257 my $region_str = "";
258 my $annot_str = "";
259
260 for my $annot ( @{$annot_ref}) {
261 if ($annot->{target} =~ m/^q/) {
262 $region_str = "qRegion";
263 }
264 else {
265 $region_str = " Region";
266 }
267
268 $annot_str .= sprintf "%s: %s : score=%d; bits=%.1f; Id=%.3f; Q=%.1f : %s\n", $region_str,
269 @{$annot}{qw(coord score b I Q name)};
270 }
271 return $annot_str;
272 }
273
274 sub add_best {
275 my ($line, $a_data) = @_;
276
277 my $annot_str = '';
278
279 my $annot_refs = parse_annots($a_data->{annot});
280
281 # remove old annotation if present
282 my @line_words = split(/\s/,$line);
283 if ($line_words[-1] =~ m/~\d/) {
284 $line = join(' ',@line_words[0 .. $#line_words-1]);
285 }
286
287 for my $annot ( @$annot_refs) {
288 if ($annot->{target} !~ m/^q/) {
289 $annot_str .= $annot->{name} . ";"
290 }
291 }
292
293 if ($annot_str) {
294 return "$line $annot_str";
295 }
296 else {
297 return $line;
298 }
299 }
300
301 sub plot_tag_str {
302
303 my ($plot_script, $pgm_name, $align_data_r, $regions_str, $doms_str) = @_;
304
305 my $svg_pref = q(<object type="image/svg+xml" );
306 my $svg_post = q( width="660" height="76" ></object>);
307
308 #build argument string
309 my %plt_args = ();
310 @plt_args{qw(pgm q_cstart l_cstart)} = ($pgm_name, 1, 1);
311 @plt_args{qw(q_name q_cstop q_astart q_astop l_name l_cstop l_astart l_astop)} =
312 @{$align_data_r}{qw(q_seqid q_len q_start q_end s_seqid s_len s_start s_end)};
313 $plt_args{'regions'}= uri_escape(uri_encode($regions_str));
314 if ($doms_str) {
315 $plt_args{'doms'} = uri_encode($doms_str);
316 }
317
318 my $dom_info = ();
319
320 my @args = map {"$_=$plt_args{$_}"} keys(%plt_args);
321
322 return $svg_pref . qq( data="$plot_url?) . join('&amp;',@args) . '"' . $svg_post;
323 }
324
325 sub dom_info_str {
326 my ($raw_dom_info) = @_;
327
328 my $dom_str = "";
329
330 unless ($raw_dom_info) { return "";}
331
332 my @raw_doms = split('\|',$raw_dom_info);
333 shift(@raw_doms);
334
335 for my $dom ( @raw_doms ) {
336 my $tmp_dom = $dom;
337 $tmp_dom =~ s/^DX:/qDomain:\t/g;
338 $tmp_dom =~ s/^XD:/lDomain:\t/g;
339 $tmp_dom =~ s/;C=/\t/g;
340
341 $dom_str .= "$tmp_dom\n";
342 }
343
344 return $dom_str;
345 }
346
347 __END__
348
349 =pod
350
351 =head1 NAME
352
353 merge_blast_btab.pl
354
355 =head1 SYNOPSIS
356
357 merge_blast_btab.pl --btab_file=result.b_tab result.html
358
359 =head1 OPTIONS
360
361 -h short help
362 --help include description
363
364 --btab_file|--btab file_name -- blast tabular output file with
365 sub-alignment scoring
366
367 =head1 DESCRIPTION
368
369 C<merge_blast_btab.pl> merges the domain annotations and sub-alignment scoring from C<annot_blast_btop2.pl> blast tabular output file with a conventional blast result file.
370
371 The tab file is read and parsed, and then the subject/query seqid is used to
372 capture domain locations in the subject/query sequence. If the domains
373 overlap the aligned region, the domain names are appended to the output.
374
375 =head1 AUTHOR
376
377 William R. Pearson, wrp@virginia.edu
378
379 =cut
0 #!/usr/bin/env python
1
2 # Given a blast_tabular file with search results from one or more protein queries
3 #
4
5 ################################################################
6 # copyright (c) 2018 by William R. Pearson and The Rector & Visitors
7 # of the University of Virginia */
8 # ###############################################################
9 # Licensed under the Apache License, Version 2.0 (the "License"); you
10 # may not use this file except in compliance with the License. You
11 # may obtain a copy of the License at
12 # http://www.apache.org/licenses/LICENSE-2.0 Unless required by
13 # applicable law or agreed to in writing, software distributed under
14 # this License is distributed on an "AS IS" BASIS, WITHOUT WRRANTIES
15 # OR CONDITIONS OF ANY KIND, either express or implied. See the
16 # License for the specific language governing permissions and
17 # limitations under the License.
18 # ###############################################################
19
20
21 import fileinput
22 import sys
23 import re
24 import argparse
25 import urllib2
26
27 from rename_exons import *
28
29 def replace_dom_number(line):
30
31 out_str = ''
32 if (not re.search(r'~',line)):
33 return line
34
35 (info, num, vdom) = re.search(r'^([^~]+)~(\d+)(v?)$',line).groups()
36 if (vdom is None):
37 vdom=''
38
39 if (num in homolog_dict):
40 return "%s~h%s%s" % (info, str(homolog_dict[num]['num']), vdom)
41
42 else:
43 name = line.split(" ")[-1].split("{")[0]
44 if (name == "NODOM"):
45 return line
46 else:
47 if (name in nonhomolog_dict):
48 return '~'.join(line.split('~')[:-1]) + "~" + str(nonhomolog_dict[name])
49 return out_str
50
51
52 ################
53 # __main__ function
54 #
55
56 e_thresh = 1e-6
57 q_thresh = 30.0
58
59 homolog_dict = {}
60 nonhomolog_dict = {}
61
62 def main():
63
64 # print "#"," ".join(sys.argv)
65
66 hom_color = 1
67 n_hom_color = 11
68
69 parser=argparse.ArgumentParser(description='relabel_domains.py result_file.m8CB')
70
71 parser.add_argument('--have_qslen', help='bl_tab fields include query/subject lengths',dest='have_qslen',action='store_true',default=False)
72 parser.add_argument('--dom_info', help='raw domain coordinates included',action='store_true',default=False)
73 parser.add_argument('files', metavar='FILE', nargs='*', help='files to read, if empty, stdin is used')
74
75 args=parser.parse_args()
76
77 end_field = -1
78 data_fields_reset=False
79
80 (fields, end_field) = set_data_fields(args, [])
81
82 if (args.have_qslen and args.dom_info):
83 data_fields_reset=True
84
85
86 for line in fileinput.input(args.files):
87 # pass through comments
88 if (line[0] == '#'):
89 print line, # ',' because have not stripped
90 continue
91
92 ################
93 # break up tab fields, check for extra fields
94 line = line.strip('\n')
95 line_data = line.split('\t')
96 if (not data_fields_reset): # look for --have_qslen number, --dom_info data, even if not set
97 (fields, end_field) = set_data_fields(args, line_data)
98 data_fields_reset = True
99
100 ################
101 # get exon annotations
102 data = parse_protein(line_data,fields,'') # get score/alignment/domain data
103
104 if (len(data['sdom_list'])==0 and len(data['qdom_list'])==0):
105 print line # no domains to be edited, print stripped line and contine
106 continue
107
108 ################
109 # have domains, check if significant and new, or old and known
110 # goals are: (1) consistent coloring between query and subject for same domain
111 # (2) homologous domains get special labels
112 # need dict of good domain names
113
114 ################
115 # check to update doms with good E()-value
116 if float(data['evalue']) <= e_thresh:
117 for q_dom in data['qdom_list']:
118 if (float(q_dom.q_score) >= q_thresh and q_dom.name not in homolog_dict ):
119 homolog_dict['q_dom.name'] = q_dom_color
120 dom_color += 1
121
122 for s_dom in data['sdom_list']:
123 if (float(s_dom.q_score) >= q_thresh and s_dom.name not in homolog_dict):
124 homolog_dict['s_dom.name'] = s_dom.color
125 hom_color += 1
126 else:
127 for s_dom in data['sdom_list']:
128 if (s_dom.name not in homolog_dict):
129 nonhomolog_dict['s_dom.name'] = s_dom.color
130 n_hom_color += 1
131
132
133 ################
134 # done storing good domains, write things out
135
136 btab_str = '\t'.join(str(data[x]) for x in fields[:end_field])
137
138 for s_dom in data['sdom_list']:
139 if (s_dom.name in homolog_dict):
140 s_dom.color=homolog_dict[s_dom.name]
141 elif (s_dom.name in nonhomolog_dict):
142 s_dom.color=nonhomolog_dict[s_dom.name]
143
144
145 dom_bar_str = ''
146 for dom in sorted(data['qdom_list']+data['sdom_list'],key=lambda r: r.idnum):
147 dom_bar_str += dom.make_bar_str()
148
149 print btab_str+dom_bar_str
150
151
152 if __name__ == '__main__':
153 main()
0 #!/usr/bin/env python
1 #
2 # given a -m8CB file with exon annotations for the query and subject,
3 # adjust the subject exon names to match the query exon names
4
5 ################################################################
6 # copyright (c) 2018 by William R. Pearson and The Rector &
7 # Visitors of the University of Virginia */
8 ################################################################
9 # Licensed under the Apache License, Version 2.0 (the "License");
10 # you may not use this file except in compliance with the License.
11 # You may obtain a copy of the License at
12 #
13 # http://www.apache.org/licenses/LICENSE-2.0
14 #
15 # Unless required by applicable law or agreed to in writing,
16 # software distributed under this License is distributed on an "AS
17 # IS" BASIS, WITHOUT WRRANTIES OR CONDITIONS OF ANY KIND, either
18 # express or implied. See the License for the specific language
19 # governing permissions and limitations under the License.
20 ################################################################
21
22 import fileinput
23 import sys
24 import re
25 import argparse
26 import copy
27
28 ################
29 # "domain" class that describes a domain/exon alignment annotation
30 #
31 class DomAlign:
32 def __init__(self, name, info, color, qstart, qend, sstart, send, raw_score, bit_score, ident, qscore, RXRState, fulltext):
33 self.name = name
34 self.info = info
35 self.color_type = ''
36 if (not re.search(r'^\d+$',color)):
37 m=re.search(r'^(\d+)([a-z]?\w*)$',color)
38 if (m):
39 (self.color, self.color_type) = m.groups()
40 self.color = int(self.color)
41 else:
42 self.color = int(color)
43
44 self.q_start = qstart
45 self.q_end = qend
46 self.s_start = sstart
47 self.s_end = send
48 self.raw_score = raw_score
49 self.bit_score = bit_score
50 self.percid = ident
51 self.q_score = qscore
52 self.rxr = RXRState
53 self.idnum = 0
54 self.overlap_list = []
55 self.info_dom = None
56 self.text = fulltext
57 self.out_str = ''
58 self.over_cnt = 0
59
60 def append_overlap(self, overlap_dict):
61 self.overlap_list.append(overlap_dict)
62
63 def __str__(self):
64 # return "[%d]name: %s : %i-%i : %i-%i I=%.1f Q=%.1f %s" % (self.idnum, self.name, self.q_start, self.q_end, self.s_start, self.s_end, self.percid, self.q_score, self.rxr)
65 return "[%d:%s] %i-%i:%i-%i::%s [over:%d]" % (self.idnum, self.rxr, self.q_start, self.q_end, self.s_start, self.s_end, self.name,len(self.overlap_list))
66
67 def print_bar_str(self): # checking for 'NADA'
68 if (not self.out_str):
69 self.out_str = self.text
70 return str("|%s"%(self.out_str))
71
72 def make_bar_str(self): # create original string from values
73 bar_str = "|%s:%d-%d:%d-%d:s=%d;b=%.1f;I=%.3f;Q=%.1f;C=%s%s~%d" % (
74 self.rxr, self.q_start, self.q_end, self.s_start, self.s_end,
75 self.raw_score, self.bit_score, self.percid, self.q_score, self.name, self.info, self.color)
76
77 if (self.color_type):
78 bar_str += self.color_type
79 return bar_str
80
81 ################
82 # "exonInfo" class describes raw (un-aligned) exons with genome coordinates
83 #
84 class exonInfo:
85 def __init__(self, name, q_target, p_start, p_end, chrom, d_start, d_end, full_text):
86 self.name = name
87 self.q_target = q_target
88 self.p_start = p_start
89 self.p_end = p_end
90 self.chrom = chrom
91 self.d_start = d_start
92 self.d_end = d_end
93 self.text = full_text
94 self.plus_strand = True
95 if (d_start > d_end):
96 self.plus_strand = False
97
98 def __str__(self):
99 rxr_str = "XD"
100 if (self.q_target):
101 rxr_str="DX"
102 return '|%s:%i-%i:%s{%s:%i-%i}' % (rxr_str, self.p_start, self.p_end, self.name, self.chrom, self.d_start, self.d_end)
103
104
105 # Parses domain annotations after split at '|'
106 #|RX:1-38:3-40:s=37;b=17.0;I=0.289;Q=15.9;C=exon_1~1
107 #|RX:39-67:41-69:s=78;b=35.8;I=0.483;Q=68.7;C=exon_2~2
108 #|XR:1-67:3-69:s=115;b=52.8;I=0.373;Q=116.3;C=exon_1~1
109 #|RX:68-117:72-113:s=14;b=6.4;I=0.385;Q=0.0;C=exon_3~3
110 #|XR:68-124:70-119:s=-11;b=0.0;I=0.378;Q=0.0;C=exon_2~2
111 #|XR:125-167:120-165:s=39;b=17.9;I=0.429;Q=18.5;C=exon_3~3
112 #|RX:118-176:114-175:s=24;b=11.0;I=0.411;Q=1.5;C=exon_4~4
113 #|RX:177-200:176-198:s=27;b=12.4;I=0.435;Q=4.0;C=exon_5~5
114 #|XR:168-200:166-198:s=12;b=5.5;I=0.419;Q=0.0;C=exon_4~4
115 #
116 def parse_domain(text):
117 # takes a domain in string form, turns it into a domain object
118 # looks like: RX:5-82:5-82:s=397;b=163.1;I=1.000;Q=453.6;C=C.Thioredoxin~1
119 # could also look like: RX:5-82:5-82:s=397;b=163.1;I=1.000;Q=453.6;C=C.Thioredoxin{PF012445}~1
120
121 # get RX/XR and qstart/qstop sstart/sstop as strings
122 m = re.search(r'^(\w+):(\d+)-(\d+):(\d+)-(\d+):',text)
123 if (m):
124 (RXRState, qstart_s, qend_s, sstart_s, send_s) = m.groups()
125 else:
126 sys.stderr.write("could not parse location: %s\n"%(text))
127
128 # get score, bits, identity, Q info
129 m = re.search(r's=(\-?\d+);b=(\-?[\d\.]+);I=([\d\.]+);Q=(\-?\d+\.\d*);',text)
130 if (m):
131 (r_score_s, b_score_s, ident_s, qscore_s) = m.groups()
132 else:
133 sys.stderr.write("Error: no scores: %s\n" %(text))
134 r_score_s = b_score_s = qscore_s = "-1.0"
135
136 # get domain name/color (and possibly {info})
137
138 (name, color_s) = re.search(r';C=([^~]+)~(.+)$',text).groups()
139 info_s=""
140
141 if (re.search(r'\}$',name)):
142 (name, info_s) = re.search(r'([^\{]+)(\{[^\}]+\})$',name).groups()
143
144 dom_align = DomAlign(name, info_s, color_s, int(qstart_s), int(qend_s), int(sstart_s), int(send_s),
145 int(r_score_s), float(b_score_s), float(ident_s),float(qscore_s), RXRState, text)
146
147 return dom_align
148
149 # dom_info is like domain, but no scores
150 ################
151 # exon_info is like domain, but no scores
152 #
153 def parse_exon_info(text):
154 # takes a domain in string form, turns it into a domain object
155 # looks like: DX:1-100;C=C.Thioredoxin~1
156
157 (RXRState, start_s, end_s,name, color) = re.search(r'^(\w+):(\d+)-(\d+);C=([^~]+)~(.*)$',text).groups()
158 info = ""
159 if (re.search(r'\}$',name)):
160 (name, info) = re.search(r'([^\{]+)(\{[^\}]+\})$',name).groups()
161
162 gene_re = re.search(r'^\{([\w\.]+):(\d+)\-(\d+)\}',info)
163 if (gene_re):
164 (chrom, d_start, d_end) = gene_re.groups()
165 else:
166 (chrom, d_start, d_end) = ('',-1,-1)
167 # sys.stderr.write("genome info not found: %s\n" % (text))
168
169 q_target = True;
170 if (RXRState == 'XD'):
171 q_target = False
172
173 exon_info = exonInfo(name, q_target, int(start_s), int(end_s), chrom, int(d_start), int(d_end), text)
174
175 return exon_info
176
177 def overlap_fract(qdom, sdom):
178 # checks if a query and subject domain overlap
179 # if they do, return the amount of overlap with respect to each domain
180 # how much of query is covered by subject, how much of subject is covered by query
181
182 q_overlap = 0.0
183 s_overlap = 0.0
184
185 qq_len = qdom.q_end-qdom.q_start+1 # query alignment length in query coordinates
186 qs_len = qdom.s_end-qdom.s_start+1 # query alignment length in subj coordinates
187 sq_len = sdom.q_end-sdom.q_start+1 # subj alignment length in query coordinates
188 ss_len = sdom.s_end-sdom.s_start+1 # subj alignment length in subject coordinates
189
190 case = -1
191
192 # case (0) no overlap at all
193 if (qdom.q_end < sdom.q_start or sdom.s_end < qdom.s_start or qdom.q_start > sdom.q_end or sdom.q_start > qdom.q_end) :
194 case = 0
195 q_overlap = s_overlap = 0.0
196 # case (1) query surrounds subject
197 elif (qdom.q_start <= sdom.q_start and qdom.q_end >= sdom.q_end):
198 case = 1
199 s_overlap = 1.0
200 q_overlap = float(sq_len)/qq_len
201 # case (2) subject surrounds query
202 elif (sdom.s_start <= qdom.s_start and sdom.s_end >= qdom.s_end):
203 case = 2
204 q_overlap = 1.0
205 s_overlap = float(qs_len)/ss_len
206 # case (3) query left of subject
207 elif (qdom.q_start <= sdom.q_start and qdom.q_end <= sdom.q_end):
208 case = 3
209 q_overlap = float(qdom.q_end-sdom.q_start+1)/qq_len
210 s_overlap = float(qdom.s_end-sdom.s_start+1)/ss_len
211 # case (4) subject of left of query
212 elif (sdom.s_start <= qdom.s_start and sdom.s_end <= qdom.s_end):
213 case = 4
214 q_overlap = float(sdom.q_end-qdom.q_start+1)/qq_len
215 s_overlap = float(sdom.s_end-qdom.s_start+1)/ss_len
216
217 if (q_overlap > 1.0 or s_overlap > 1.0):
218 if (1):
219 sys.stderr.write("***%i: qdom: %s sdom: %s\n"% (case,str(qdom),str(sdom)))
220 sys.stderr.write(" ** qover %.3f sover: %.3f\n"% (q_overlap, s_overlap))
221 sys.stderr.write(" ** qq_len: %d qs_len: %d ss_len: %d sq_len %d\n"%(qq_len, qs_len, ss_len, sq_len))
222
223 return (q_overlap, s_overlap)
224
225 ####
226 # parse_protein(result_line)
227 # takes a protein in string format, turns it into a dictionary properly
228 # looks like: sp|P30711|GSTT1_HUMAN up|Q2NL00|GSTT1_BOVIN 86.67 240 32 0 1 240 1 240 1.4e-123 444.0 16VI7DR6IT3IR15KQ3AI6TI11TA7YH8RC12TA3SN10FL10QETM2AT6VMTA2LV2DG4ND6PS24EK6TA11DV14FSPQ5IL3LMML1WK5RQ |XR:4-76:4-76:s=327;b=134.6;I=0.895;Q=367.8;C=C.Thioredoxin~1|RX:5-82:5-82:s=356;b=146.5;I=0.902;Q=403.3;C=C.Thioredoxin~1|RX:83-93:83-93:s=52;b=21.4;I=0.818;Q=30.9;C=NODOM~0|XR:77-93:77-93:s=86;b=35.4;I=0.882;Q=72.6;C=NODOM~0|RX:94-110:94-110:s=88;b=36.2;I=0.882;Q=75.0;C=vC.GST_C~2v|XR:94-110:94-110:s=88;b=36.2;I=0.882;Q=75.0;C=vC.GST_C~2v|RX:111-201:111-201:s=409;b=168.3;I=0.868;Q=468.3;C=C.GST_C~2|XR:111-201:111-201:s=409;b=168.3;I=0.868;Q=468.3;C=C.GST_C~2|RX:202-240:202-240:s=154;b=63.4;I=0.795;Q=155.9;C=NODOM~0|XR:202-240:202-240:s=154;b=63.4;I=0.795;Q=155.9;C=NODOM~0
229 #
230 # returns [data[x] for x in fields] but also data['q/s_dom_list'] and data['q/sinfo_list']
231 def parse_protein(line_data,fields, req_name):
232 # last part (domain annotions) split('|') and parsed by parse_domain()
233
234 data = {}
235 data = dict(zip(fields, line_data))
236 if (re.search(r'\|',data['qseqid'])):
237 data['qseq_acc'] = data['qseqid'].split('|')[1]
238 else:
239 data['qseq_acc'] = data['qseqid']
240
241 if (re.search(r'\|',data['sseqid'])):
242 data['sseq_acc'] = data['sseqid'].split('|')[1]
243 else:
244 data['sseq_acc'] = data['sseqid']
245
246 Qdom_list = []
247 Sdom_list = []
248
249 Qinfo_list = []
250 Sinfo_list = []
251
252 counter = 0
253
254 if ('dom_annot' in data and len(data['dom_annot']) > 0):
255 for dom_str in data['dom_annot'].split('|')[1:]:
256 if (req_name and not re.search(req_name, dom_str)):
257 continue
258
259 counter += 1
260 dom = parse_domain(dom_str)
261 dom.idnum = counter
262 if (dom.rxr == 'RX'):
263 Qdom_list.append(dom)
264 else:
265 Sdom_list.append(dom)
266
267 data['qdom_list'] = Qdom_list
268 data['sdom_list'] = Sdom_list
269
270 if ('dom_info' in data and len(data['dom_info']) > 0):
271 for info_str in data['dom_info'].split('|')[1:]:
272 if (req_name and not re.search(req_name, info_str)):
273 continue
274 if (not re.search(r'^[DX][XD]',info_str)):
275 continue
276
277 dinfo = parse_exon_info(info_str)
278
279 if (dinfo.q_target):
280 Qinfo_list.append(dinfo)
281 else:
282 Sinfo_list.append(dinfo)
283
284
285 # put links to info_list into dom_list so info_list names can
286 # be changed -- give S/Qinfo's the S/Qdom ids of the overlapping domain
287
288 find_info_overlaps(Qinfo_list, Qdom_list)
289 find_info_overlaps(Sinfo_list, Sdom_list)
290
291 data['qinfo_list'] = Qinfo_list
292 data['sinfo_list'] = Sinfo_list
293
294 return data
295
296 # "domain" : RX:1-38:3-40:s=37;b=17.0;I=0.289;Q=15.9;C=exon_1~1
297 # "name" : like exon_2
298 # expanded for domain: RX:1-38:3-40:s=37;b=17.0;I=0.289;Q=15.9;C=exon_1{chr1:12345678-123456987}~1
299 def replace_name(domain_text, new_name, new_color_s):
300 out = "=".join(domain_text.split("=")[:-1]) # out has everything to last '='
301
302 old_name = domain_text.split(";C=")[-1]
303 old_info=""
304
305 if (re.search(r'\}~',old_name)):
306 (old_info)=re.search(r'(\{[^\}]+\})~',old_name).group(1)
307
308 if (not re.match(r'\d+',new_color_s)):
309 new_color_s="0"
310 out += "="+new_name+old_info+"~"+new_color_s # put it together
311 return out
312
313 ################
314 # check for overlaps using mid-point
315 #
316 def mid_overlaps(qdom_list, sdom_list):
317
318 if (len(qdom_list) != len(sdom_list)):
319 return False
320
321 for ix, q_dom in enumerate(qdom_list):
322 s_dom = sdom_list[ix]
323 q_mid = q_dom.q_start + (q_dom.q_end - q_dom.q_start + 1)/2.0
324 if not (q_mid >= s_dom.q_start and q_mid <= s_dom.q_end):
325 return False
326
327 q_qfract, q_sfract = overlap_fract(q_dom, s_dom) # overlap from query perspective
328 s_sfract, s_qfract = overlap_fract(s_dom, q_dom) # overlap from subject perspective
329
330 q_dom.overlap_list.append({"dom": s_dom, "q_over": q_qfract, "s_over": q_sfract})
331 s_dom.overlap_list.append({"dom": q_dom, "q_over": s_qfract, "s_over": s_sfract})
332
333 return True
334
335 ################
336 # find_overlaps -- populates dom.overlap_list for qdoms, sdoms
337 #
338 def find_overlaps(qdom_list, sdom_list, over_thresh):
339 # find qdom, sdom overlaps in O(N) time
340 #
341
342 if (len(sdom_list) == 0 or len(qdom_list)==0):
343 return
344
345 if (len(sdom_list) == len(qdom_list)): # same number of domains
346 if (mid_overlaps(qdom_list, sdom_list)):
347 return;
348 else:
349 for d in qdom_list:
350 d.overlap_list = []
351 for d in sdom_list:
352 d.overlap_list = []
353
354
355 qdom_queue = [x for x in qdom_list] # build a duplicate list
356 sdom_queue = [x for x in sdom_list]
357
358 qdom = qdom_queue.pop(0) # get the first element of each
359 sdom = sdom_queue.pop(0)
360
361 while (True):
362 pop_s = pop_q = False
363
364 q_qfract, q_sfract = overlap_fract(qdom, sdom) # overlap from query perspective
365 if (q_qfract > over_thresh or q_sfract > over_thresh):
366 qdom.append_overlap({"dom": sdom, "q_over": q_qfract, "s_over": q_sfract})
367 qdom.over_cnt += 1
368
369 s_sfract, s_qfract = overlap_fract(sdom, qdom) # overlap from query perspective
370 if (s_qfract > over_thresh or s_sfract > over_thresh):
371 sdom.append_overlap({"dom": qdom, "q_over": s_qfract, "s_over": s_sfract})
372 sdom.over_cnt += 1
373
374 # check to see if we've used up the domain
375 if (qdom.s_end >= sdom.s_end):
376 pop_s = True
377 # else there are more s_dom's that are part of this q_dom
378
379 if (sdom.q_end >= qdom.q_end):
380 pop_q = True
381 # else there are more q_dom's that are part of this s_dom
382
383 # print 'QS: %s %s\t%s %s' %(pop_q, pop_s, qdom, sdom)
384
385 if (len(qdom_queue) > 0):
386 if (pop_q): # done with this qdom, get next
387 qdom = qdom_queue.pop(0)
388 elif (pop_q): # don't break until we try to get the next domain
389 break;
390
391 if (len(sdom_queue) > 0):
392 if (pop_s): # done with this sdom, get next
393 sdom = sdom_queue.pop(0)
394 elif (pop_s): # don't break until we try to get the next domain
395 break;
396 ####
397 # all done with overlaps
398
399 # # print "overlaps done"
400 # for qd in qdom_list:
401 # print qd, qd.over_cnt
402 # for sd in qd.overlap_list:
403 # print " s: q_over %.3f s_over: %.3f %s" % (sd['q_over'], sd['s_over'], str(sd['dom']))
404 # print "===="
405
406 # for sd in sdom_list:
407 # print sd, sd.over_cnt
408 # for qd in sd.overlap_list:
409 # print " q: q_over %.3f s_over: %.3f %s" % (qd['q_over'], qd['s_over'], str(qd['dom']))
410 # print "===="
411
412 ################
413 # info_overlaps -- populates dom.overlap_list for qdoms, sdoms
414 #
415 def find_info_overlaps(info_list, dom_list):
416
417 if (len(info_list) == 0 or len(dom_list)==0):
418 return
419
420 info_queue = [x for x in info_list] # build a duplicate list
421 dom_queue = [x for x in dom_list]
422
423 info = info_queue.pop(0) # get the first element of each
424 dom = dom_queue.pop(0)
425
426 while (True):
427 pop_d = pop_i = False
428
429 if (dom.rxr == 'RX'): # use dom.q_start/q_end
430 if (dom.q_end < info.p_start):
431 pop_d = True
432 elif (dom.q_end >= info.p_start and dom.q_start <= info.p_end): # overlap
433 dom.info_dom = info
434 pop_d = True
435 pop_i = True
436 elif (info.p_end < dom.q_start):
437 pop_i = True
438
439 else: # use dom.s_start/s_end
440 if (dom.s_end < info.p_start):
441 pop_d = True
442 elif (dom.s_end >= info.p_start and dom.s_start <= info.p_end): # overlap
443 dom.info_dom = info
444 pop_d = True
445 pop_i = True
446 elif (info.p_end < dom.s_start):
447 pop_i = True
448
449 if (len(info_queue) > 0):
450 if (pop_i): # done with this info, get next
451 info = info_queue.pop(0)
452 elif (pop_i): # don't break until we try to get the next domain
453 break;
454
455 if (len(dom_queue) > 0):
456 if (pop_d): # done with this dom, get next
457 dom = dom_queue.pop(0)
458 elif (pop_d):
459 break;
460
461 ################
462 # build_multi_dict -- builds of dictionaries of multiple overlaps in
463 # qdom.overlap_list or sdom.overlap_list
464 # returns multi_dict[idnum]
465 #
466 def build_multi_dict(dom_list):
467 # this code looks for xdom's that are associated with multiple ydoms
468 #
469 multi_dict = {} # dict of {qids:/sdom:/qdoms:[]}
470 for dom in dom_list: # for each subject domain
471 if (dom.over_cnt <= 1):
472 continue
473
474 multi_id_list = []
475 multi_dom_list = []
476 multi_q_cnt = 0
477 for xd_over_yd in dom.overlap_list: # a set of q_doms that overlap the subject
478 multi_q_cnt += 1
479 multi_id_list.append(xd_over_yd["dom"].idnum) # these are q_dom idnum's
480 multi_dom_list.append(xd_over_yd["dom"]) # these are q_doms
481
482 if (multi_q_cnt > 1): # only save when two (or more) overlaps
483 multi_dict[dom.idnum] = {"yids": multi_id_list, "ydoms":multi_dom_list, 'xdom':dom}
484
485 # # print out current multi_q_list
486 # print "--- multi_q dict ---"
487 # for db in multi_dict.keys():
488 # print "sdom: %s"%(db)
489 # for ix, qd in enumerate(multi_dict[db]['ydoms']):
490 # print " %d %d: %s"%(ix, multi_dict[db]['yids'][ix], qd)
491
492 # print "--- multi_dict done"
493
494 return multi_dict
495
496 ################
497 # find_best_id() -- returns id of domain with longest 'q_over'
498 #
499 def find_best_id(overlap_list, over_type):
500
501 max_fract = 0.0
502 max_idnum = 0
503 for over_d in overlap_list:
504 if (over_d[over_type] > max_fract):
505 max_idnum = over_d['dom'].idnum
506 max_fract = over_d[over_type]
507
508 return max_idnum
509
510 ################################################################
511 # final labeling routine -- leave qdom's alone, modify sdoms based on qdoms.
512 ################
513 # sdom's in more than one qdom are in multi_q_dict[]
514 # qdom's in more than one sdom are in multi_s_dict[]
515 # everyone else just gets the qdom name
516 # returns sdom_displayed_dict{idnum} -- the set of sdoms that have been modified
517 #
518 # 13-Nov-2018 -- ensure that there is an info_dom before replacing info_dom.text
519 #
520 def label_doms(qdom_list, sdom_list, multi_q_dict, multi_s_dict):
521
522 sdom_displayed_dict = {}
523 for qdom in qdom_list:
524 # qdom's stay the same
525 qdom.out_str = qdom.text
526
527 # check for s_doms with multiple q_doms
528 if (qdom.idnum in multi_s_dict):
529 # find the best, name it exon_X, find the rest, name it qdom.name
530 multi_s_entry = multi_s_dict[qdom.idnum]
531 best_id = find_best_id(qdom.overlap_list,'q_over') # find sdom with most overlap
532 for s_over in qdom.overlap_list: # find the sdom's that overlap this qdom
533 sdom = s_over['dom']
534 if (sdom.idnum == best_id):
535 sdom.out_str = replace_name(sdom.text, qdom.name, str(qdom.color))
536 if (sdom.info_dom):
537 sdom.info_dom.out_str = replace_name(sdom.info_dom.text,qdom.name, str(qdom.color))
538 else:
539 sdom.out_str = replace_name(sdom.text, "exon_X","0")
540 if (sdom.info_dom):
541 sdom.info_dom.out_str = replace_name(sdom.info_dom.text,"exon_X","0")
542 sdom_displayed_dict[sdom.idnum] = sdom;
543 continue # prevents re-labeling later
544
545 # check for q_doms with multiple doms
546 for sd_over in qdom.overlap_list:
547 sdom = sd_over['dom']
548 # it might make sense to do this in a second for loop after
549 # all the multiple stuff is done
550 if (sdom.idnum not in multi_q_dict):
551 # this is the simplest case -- sdom.text gets qdom.name
552 if (sdom.idnum not in sdom_displayed_dict):
553 sdom.out_str = replace_name(sdom.text, qdom.name, str(qdom.color))
554 if (sdom.info_dom):
555 sdom.info_dom.out_str = replace_name(sdom.info_dom.text,qdom.name, str(qdom.color))
556 else:
557 # this sdom belongs to multiple q_doms, add each of those q_doms to the name
558 exon_str='exon_'
559 # "ydoms" here are the qdoms overlapped by sdom
560 exon_str += '/'.join([ x.name.split("_")[1] for x in multi_q_dict[sdom.idnum]['ydoms']])
561 sdom.out_str = replace_name(sdom.text, exon_str,"0")
562 if (sdom.info_dom):
563 sdom.info_dom.out_str = replace_name(sdom.info_dom.text,exon_str,"0")
564
565 sdom_displayed_dict[sdom.idnum]=sdom
566
567 # done with labeling sdoms based on qdoms, but some may be unlabeled
568 # check for missing s_doms
569 while (len(sdom_displayed_dict.keys()) < len(sdom_list)):
570 for sdom in sdom_list:
571 if (sdom.idnum not in sdom_displayed_dict):
572 sdom.out_str = replace_name(sdom.text, "exon_X","0")
573 if (sdom.info_dom):
574 sdom.info_dom.out_str = replace_name(sdom.info_dom.text,"exon_X","0")
575
576 sdom_displayed_dict[sdom.idnum] = sdom
577
578 return sdom_displayed_dict
579
580 ################
581 #
582 # aa_to_exon() --- given a coordinate and the corresponding exon map, return the exon coordinate
583 # (can only be done for aligned exons)
584 #
585 # this version of the function must use an info_list, not an
586 # align_list, because it uses p_start/p_end rather than q_start/s_start, etc.
587 # a version using qp_start/sp_start would also need a target argument
588 #
589 def aa_to_exon(aa_coords, exon_info_list):
590
591 sorted_aa_coords = sorted(aa_coords)
592
593 pos_strand = True
594 if (exon_info_list[0].d_start > exon_info_list[0].d_end):
595 pos_strand = False
596
597 ex_x = 0
598 exon_coords = []
599
600 aap_x = 0
601 this_aap = sorted_aa_coords[aap_x]
602 while (ex_x < len(exon_info_list)):
603 this_exon = exon_info_list[ex_x]
604 if (this_aap <= this_exon.p_end and this_aap >= this_exon.p_start):
605 aa_dna_offset = (this_aap - this_exon.p_start) * 3
606
607 if (pos_strand):
608 aa_dna_pos = this_exon.d_start + aa_dna_offset
609 else:
610 aa_dna_pos = this_exon.d_start - aa_dna_offset
611
612 exon_coords.append({'chrom':this_exon.chrom, 'dpos':aa_dna_pos})
613 aap_x += 1
614 if (aap_x < len(sorted_aa_coords)):
615 this_aap = sorted_aa_coords[aap_x]
616 else:
617 break
618 else:
619 ex_x += 1
620
621 aa_coord_dict = {}
622 for aap_x, aap in enumerate(sorted_aa_coords):
623 aa_coord_dict[aap] = exon_coords[aap_x]
624
625 return [aa_coord_dict[ax] for ax in aa_coords]
626
627 ################
628 #
629 def set_data_fields(args, line_data) :
630
631 field_str = 'qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore BTOP dom_annot'
632 field_qs_str = 'qseqid qlen sseqid slen pident length mismatch gapopen qstart qend sstart send evalue bitscore BTOP dom_annot'
633
634 if (len(line_data) > 1) :
635 if ((not args.have_qslen) and re.search(r'\d+',line_data[1])):
636 args.have_qslen=True
637
638 if ((not args.dom_info) and re.search(r'^\|[DX][XD]\:',line_data[-1])):
639 args.dom_info = True
640
641 end_field = -1
642 fields = field_str.split(' ')
643
644 if (args.have_qslen):
645 fields = field_qs_str.split(' ')
646
647 if (args.dom_info):
648 fields.append('dom_info')
649 end_field = -2
650
651 return (fields, end_field)
652
653 ################################################################
654 #
655 # main program
656 # print "#"," ".join(sys.argv)
657
658 def main():
659
660 parser=argparse.ArgumentParser(description='scan_exons.py result_file.m8CB : re-label subject exons to match query')
661 parser.add_argument('--have_qslen', help='bl_tab fields include query/subject lengths',dest='have_qslen',action='store_true',default=False)
662 parser.add_argument('--dom_info', help='raw domain coordinates included',action='store_true',default=False)
663 parser.add_argument('--fill_gcoords', help='fill in genomic coordinates',action='store_true',default=False)
664 parser.add_argument('files', metavar='FILE', nargs='*', help='files to read, if empty, stdin is used')
665
666 args=parser.parse_args()
667
668 end_field = -1
669 data_fields_reset=False
670
671 (fields, end_field) = set_data_fields(args, [])
672
673 if (args.have_qslen and args.dom_info):
674 data_fields_reset=True
675
676 saved_qdom_list = []
677 qdom_list = []
678
679 for line in fileinput.input(args.files):
680 # pass through comments
681 if (line[0] == '#'):
682 print line, # ',' because have not stripped
683 continue
684
685 ################
686 # break up tab fields, check for extra fields
687 line = line.strip('\n')
688 line_data = line.split('\t')
689 if (not data_fields_reset): # look for --have_qslen number, --dom_info data, even if not set
690 (fields, end_field) = set_data_fields(args, line_data)
691 data_fields_reset = True
692
693 ################
694 # get exon annotations
695 data = parse_protein(line_data,fields,"exon") # get score/alignment/domain data
696
697 if (len(data['sdom_list'])==0 and len(data['qdom_list'])==0):
698 print line # no domains to be edited, print stripped line and contine
699 continue
700
701 # qdom_list=[] outside of loop for cases where the qseqid==sseqid match is not first
702 if len(data['qdom_list'])== 0:
703 if data['qseqid'] == data['sseqid']:
704 saved_qdom_list = [ copy.deepcopy(x) for x in data['sdom_list']]
705 max_sdom_id=len(data['sdom_list'])+1
706 for qdom in saved_qdom_list:
707 qdom.rxr = 'RX'
708 qdom.idnum = max_sdom_id
709 max_sdom_id += 1
710
711 qdom_list = [copy.deepcopy(x) for x in saved_qdom_list]
712 else:
713 qdom_list = data['qdom_list']
714
715 # print out non-exon info
716
717 if (len(qdom_list) == 0):
718 print line
719 continue
720
721 btab_str = '\t'.join(str(data[x]) for x in fields[:end_field])
722 # print # comment out for single line
723
724 ################
725 # find overlaps and multi-overlaps
726 #
727 find_overlaps(qdom_list,data['sdom_list'], 0.2)
728
729 multi_q_dict = build_multi_dict(data['sdom_list']) # keys are sdoms hitting multiple qdoms
730 multi_s_dict = build_multi_dict(qdom_list) # keys are qdoms hitting mulitple sdoms
731
732 ################
733 # label qdoms, relabel sdoms
734 #
735 sdom_displayed_dict = label_doms(qdom_list, data['sdom_list'], multi_q_dict, multi_s_dict)
736
737 ################
738 # print exon annotations
739 #
740 q_exon_list = data['qdom_list']
741
742 s_exon_list = [sdom_displayed_dict[x] for x in sdom_displayed_dict.keys()]
743
744 ################
745 # if args.fill_gcoords, then do the transformations on the current exon lists
746
747 if (args.fill_gcoords):
748 sa_from_qa = []
749 for q_ex in q_exon_list:
750 sa_from_qa.append(q_ex.q_start)
751 sa_from_qa.append(q_ex.q_end)
752
753 # have list of coordinates, map them to exon
754 sex_from_qa2sa = aa_to_exon(sa_from_qa,data['sinfo_list'])
755
756 for iqx, q_ex in enumerate(q_exon_list):
757 sg_start = sex_from_qa2sa[2*iqx]
758 sg_end = sex_from_qa2sa[2*iqx+1]
759 sg_replace="::%s:%d-%d}"%(sg_start['chrom'],sg_start['dpos'],sg_end['dpos'])
760 q_ex.text=re.sub(r'\}',sg_replace,q_ex.text)
761 q_ex.out_str=re.sub(r'\}',sg_replace,q_ex.out_str)
762
763 qa_from_sa = []
764 for s_ex in s_exon_list:
765 qa_from_sa.append(s_ex.q_start)
766 qa_from_sa.append(s_ex.q_end)
767
768 # have list of coordinates, map them to exon
769 qex_from_sa2qa = aa_to_exon(qa_from_sa,data['qinfo_list'])
770
771 for isx, s_ex in enumerate(s_exon_list):
772 qg_start = sex_from_qa2sa[2*iqx]
773 qg_end = sex_from_qa2sa[2*iqx+1]
774 qg_replace="{%s:%d-%d::"%(sg_start['chrom'],sg_start['dpos'],sg_end['dpos'])
775 s_ex.text=re.sub(r'\{',qg_replace,s_ex.text)
776 s_ex.out_str = re.sub(r'\{',qg_replace,s_ex.out_str)
777
778 sorted_exon_list = sorted(q_exon_list+s_exon_list,key = lambda r: r.idnum)
779
780 dom_bar_str = ''
781 for exon in sorted_exon_list:
782 # print exon.print_bar_str() # for multi-line output
783 dom_bar_str += exon.print_bar_str()
784
785 info_bar_str = ''
786 for info in data['qinfo_list'] + data['sinfo_list']:
787 info_bar_str += info.text
788
789 print '\t'.join((btab_str, dom_bar_str, info_bar_str))
790
791 ################
792 # run the program ...
793
794 if __name__ == '__main__':
795 main()
796
0 #!/usr/bin/perl -w
0 #!/usr/bin/env perl
11
22 ################################################################
33 # copyright (c) 2014 by William R. Pearson and The Rector &
2121 # parse:
2222 # sp|P09488|GSTM1_HUMAN gi|121735|sp|P09488.3|GSTM1_HUMAN 100.00 218 0 0 1 218 1 218 2.9e-113 408.2 218M |RX:1-12:1-12:s=64;b=25.0;I=1.000;Q=47.5;C=exon_1|RX:13-37:13-37:s=128;b=49.9;I=1.000;Q=121.4;C=exon_2|RX:38-59:38-59:s=125;b=48.7;I=1.000;Q=117.9;C=exon_3|RX:60-86:60-86:s=145;b=56.5;I=1.000;Q=141.0;C=exon_4|RX:87-120:87-120:s=185;b=72.1;I=1.000;Q=187.2;C=exon_5|RX:121-152:121-152:s=174;b=67.8;I=1.000;Q=174.5;C=exon_6|RX:153-189:153-189:s=197;b=76.8;I=1.000;Q=201.0;C=exon_7|RX:190-218:190-218:s=151;b=58.9;I=1.000;Q=147.9;C=exon_8
2323
24
25
24 use warnings;
2625 use strict;
2726 use Getopt::Long;
2827 use Pod::Usage;
seq/gstt1_pssm.asn1 less more
Binary diff not shown
3838 #define VMSPIR 5
3939 #define GCGBIN 6
4040 #define FASTQ 7
41 #define LASTTXT 7
41 #define ACC_SCRIPT 9
42 #define LASTTXT 9
4243 #define ACC_LIST 10
4344
4445 #include "mm_file.h"
9495 #endif
9596
9697 int (*getliba[LASTLIB])(unsigned char *, int, char *, int, fseek_t *, int *,
97 struct lmf_str *, long *)={
98 agetlib,lgetlib,pgetlib,egetlib,
99 igetlib,vgetlib,gcg_getlib,qgetlib,
100 agetlib,agetlib
98 struct lmf_str *, long *)={
99 agetlib,lgetlib,pgetlib,egetlib, /* 0 - 3 */
100 igetlib,vgetlib,gcg_getlib,qgetlib, /* 4- 7 */
101 agetlib,agetlib /* 8,9 */
101102 #ifdef UNIX
102 ,agetlib
103 ,agetlib /* 10 */
103104 #ifdef NCBIBL13
104 ,ncbl_getliba
105 ,ncbl_getliba /* 11 */
105106 #else
106 ,ncbl2_getliba
107 ,ncbl2_getliba /* 12 */
107108 #endif
108109 #ifdef NCBIBL20
109 ,ncbl2_getliba
110 ,ncbl2_getliba /* 12 */
111 #else
112 ,agetlib /* 12 - place holder */
110113 #endif
111114 #ifdef MYSQL_DB
112 ,agetlib
113 ,agetlib
114 ,agetlib
115 ,mysql_getlib
115 ,agetlib /* 13 */
116 ,agetlib /* 14 */
117 ,agetlib /* 15 */
118 ,mysql_getlib /* 16 */
116119 #endif
117120 #endif
118121 };
0 /* cal_cons.c - routines for printing translated alignments for
0 /* cal_cons.c - routines for printing alignments for
11 fasta, ssearch, ggsearch, glsearch */
22
33 /* $Id: cal_cons.c 1280 2014-08-21 00:47:55Z wrp $ */
638638
639639 /* Open query library */
640640 if ((q_file_p= open_lib(q_lib_p, m_msg.qdnaseq,qascii,!m_msg.quiet))==NULL) {
641 s_abort(" cannot open library ",m_msg.tname);
641 fprintf(stderr,"*** error [%s:%d] cannot open library %s\n",__FILE__,__LINE__, m_msg.tname);
642 exit(1);
643
644 /* s_abort(" cannot open library ",m_msg.tname); */
642645 }
643646 /* Fetch first sequence */
644647 qlib = 0;
663666
664667 /* if protein and ldb_info.term_code set, add '*' if not there */
665668 if (m_msg.ldb_info.term_code && !(m_msg.qdnaseq==SEQT_DNA || m_msg.qdnaseq==SEQT_RNA) &&
666 aa0[0][m_msg.n0-1]!='*') {
667 aa0[0][m_msg.n0++]='*';
669 aa0[0][m_msg.n0-1]!=aascii['*']) {
670 aa0[0][m_msg.n0++]=aascii['*'];
668671 aa0[0][m_msg.n0]=0;
669672 }
670673
762765 }
763766
764767 /* get a list of files to search */
765 lib_list_p = lib_select(lib_db_file, m_msg.ltitle, m_msg.flstr,
766 m_msg.ldb_info.ldnaseq);
768 lib_list_p = lib_select(lib_db_file, m_msg.ltitle, m_msg.flstr, m_msg.ldb_info.ldnaseq);
767769 }
768770 else {
769771 /* get a list of files to search */
914916
915917 if (!validate_params(aa0[0],m_msg.n0, &m_msg, &pst,
916918 lascii, pascii)) {
917 fprintf(stderr," *** ERROR *** validate_params() failed:\n -- %s\n", argv_line);
919 fprintf(stderr," *** error [%s:%d] - validate_params() failed:\n -- %s\n", __FILE__, __LINE__, argv_line);
918920 exit(1);
919921 }
920922
15201522 if (pst.do_rep) {
15211523 if (pst.zsflag >= 0) {
15221524 for (i=m_msg.nskip; i < m_msg.nskip + m_msg.nshow; i++) {
1523 bestp_arr[i]->repeat_thresh =
1524 min(E1_to_s(pst.e_cut_r, m_msg.n0, bestp_arr[i]->seq->n1,
1525 pst.zdb_size, m_msg.pstat_void),bestp_arr[i]->rst.score[pst.score_ix]);
1525 if (bestp_arr[i]->rst.escore > pst.e_cut_r) {
1526 bestp_arr[i]->repeat_thresh = bestp_arr[i]->rst.score[pst.score_ix] * 10;
1527 }
1528 else {
1529 bestp_arr[i]->repeat_thresh =
1530 min(E1_to_s(pst.e_cut_r, m_msg.n0, bestp_arr[i]->seq->n1, pst.zdb_size, m_msg.pstat_void),
1531 bestp_arr[i]->rst.score[pst.score_ix]);
1532 }
15261533 }
15271534 }
15281535 else {
22422249 getlib() calls */
22432250 /* **************************************************************** */
22442251 struct getlib_str *
2245 init_getlib_info(struct lib_struct *lib_list_p, int maxn,long max_memK) {
2252 init_getlib_info(struct lib_struct *lib_list_p, int maxn, long max_memK) {
22462253 struct getlib_str *my_getlib_info;
22472254 unsigned char *aa1save;
22482255
23532360 if ((cur_lib_p->m_file_p =
23542361 open_lib(cur_lib_p, m_msp->ldb_info.ldnaseq, lascii, !m_msp->quiet))
23552362 ==NULL) {
2356 fprintf(stderr," cannot open library %s\n",cur_lib_p->file_name);
2363 fprintf(stderr,"(*** warning [%s:%d] cannot open library %s\n",__FILE__,__LINE__,cur_lib_p->file_name);
23572364 getlib_info->lib_list_p = getlib_info->lib_list_p->next;
23582365 if (getlib_info->lib_list_p == NULL) {
23592366 goto return_null;
23742381 /* if the library is NCBIBL20 and memory mapped, simply return
23752382 pointers to the memory map */
23762383 m_fd = getlib_info->lib_list_p->m_file_p;
2377 if (m_fd->get_mmap_chain) {
2384
2385 if (m_fd->get_mmap_chain && getlib_info->use_memory>=0) {
23782386 /* get a new seqr_chain */
23792387 my_seqr_chain =
23802388 new_seqr_chain(m_bufi_p->max_chain_seqs,(m_bufi_p->seq_buf_size+1),
222222
223223 /* subs_env takes a string, possibly with ${ENV}, and looks up all the
224224 potential environment variables and substitutes them into the
225 string */
226
225 string
226 */
227227 void subs_env(char *dest, char *src, int dest_size) {
228228 char *last_src, *bp, *bp1;
229229
273273 dest[dest_size-1]='\0';
274274 }
275275 }
276
277276
278277 void
279278 selectbest(struct beststr **bptr, int k, int n) /* k is rank in array */
14031402 char *link_lib_str;
14041403 char link_script[MAX_LSTR];
14051404 int link_lib_type;
1406 char *bp, *link_bp;
1405 char *bp, *link_bp, *bp_s;
14071406 FILE *link_fd=NULL; /* file for link accessions */
14081407
14091408 #ifndef UNIX
14661465 }
14671466
14681467 strncpy(link_script,link_bp,sizeof(link_script));
1468 /* un-edit m_msp->link_lname */
1469 if (bp != NULL) *bp = ' ';
1470
14691471 link_script[sizeof(link_script)-1] = '\0';
1472
1473 /* convert + to space in script string */
1474 for (bp_s = strchr(link_script+1,'+'); bp_s; bp_s=strchr(bp_s+1,'+')) {
1475 *bp_s = ' ';
1476 }
1477
14701478 SAFE_STRNCAT(link_script," ",sizeof(link_script));
14711479 SAFE_STRNCAT(link_script,link_acc_file,sizeof(link_script));
14721480 SAFE_STRNCAT(link_script," >",sizeof(link_script));
14731481 SAFE_STRNCAT(link_script,link_lib_file,sizeof(link_script));
1474
1475 /* un-edit m_msp->link_lname */
1476 if (bp != NULL) *bp = ' ';
14771482
14781483 /* run link_script link_acc_file > link_lib_file */
14791484 status = system(link_script);
15801585 }
15811586
15821587 strncpy(lib_db_script,lib_bp,sizeof(lib_db_script));
1588 bp = strchr(lib_db_script,'+');
1589 for ( ; bp; bp=strchr(bp+1,'+')) {
1590 *bp=' ';
1591 }
1592
15831593 lib_db_script[sizeof(lib_db_script)-1] = '\0';
15841594 SAFE_STRNCAT(lib_db_script," >",sizeof(lib_db_script));
15851595 SAFE_STRNCAT(lib_db_script,lib_db_file,sizeof(lib_db_script));
16491659
16501660 this->max_annot += (this->max_annot/2);
16511661 if ((this->tmp_arr_p= (struct annot_entry *)realloc(this->tmp_arr_p, this->max_annot*sizeof(struct annot_entry)))==NULL) {
1652 fprintf(stderr,"[*** error [%s:%d] - cannot reallocate tmp_ann_astr[%d]\n",
1662 fprintf(stderr,"*** error [%s:%d] - cannot reallocate tmp_ann_astr[%d]\n",
16531663 __FILE__, __LINE__, this->max_annot);
16541664 return 0;
16551665 }
17021712 annotations back
17031713 */
17041714
1715 /* create filename for input accessions */
17051716 annot_bline_file[0] = '\0';
17061717
17071718 if ((annot_descr_file=(char *)calloc(MAX_STR,sizeof(char)))==NULL) {
17101721 }
17111722 annot_descr_file[0] = '\0';
17121723
1724 /* add temporary directory if $TMP_DIR */
17131725 if ((bp=getenv("TMP_DIR"))!=NULL) {
17141726 strncpy(annot_bline_file,bp,sizeof(annot_bline_file));
17151727 annot_bline_file[sizeof(annot_bline_file)-1] = '\0';
17281740 goto no_annots;
17291741 }
17301742
1743 /* write out accessions, sequence length */
17311744 for (i=0; i<nbest; i++) {
17321745 if (bestp_arr[i]->mseq->annot_req_flag) { continue; }
17331746 if ((strlen(bestp_arr[i]->mseq->bline) > DESCR_OFFSET) &&
17431756 }
17441757 fclose(annot_fd);
17451758
1746 subs_env(annot_script, sname+1, sizeof(annot_script));
1759 /* convert '+' in annot_script to ' ' */
1760 bp = strchr(sname+1,'+');
1761 for ( ; bp; bp=strchr(bp+1,'+')) {
1762 *bp=' ';
1763 }
1764
1765 subs_env(annot_script, sname+1, sizeof(annot_script));
17471766 annot_script[sizeof(annot_script)-1] = '\0';
17481767 SAFE_STRNCAT(annot_script," ",sizeof(annot_script));
17491768 SAFE_STRNCAT(annot_script,annot_bline_file,sizeof(annot_script));
17521771
17531772 /* run annot_script annot_bline_file > annot_descr_file */
17541773 status = system(annot_script);
1774
1775 #ifdef DEBUG
1776 if (debug) {
1777 fprintf(stderr,"%s\n",annot_script);
1778 }
1779 #endif
1780
17551781 if (!debug) {
17561782 #ifdef UNIX
17571783 unlink(annot_bline_file);
21712197
21722198 q_offset = m_msp->q_offset + m_msp->q_off - 1;
21732199 if (q_offset < 0) { q_offset = 0;}
2200
2201 /* convert '+' in annot_script to ' ' */
2202 bp = strchr(sname+1,'+');
2203 for ( ; bp; bp=strchr(bp+1,'+')) {
2204 *bp=' ';
2205 }
2206
21742207 sprintf(annot_script,"%s \"%s\" %ld",sname+1, bline_descr,q_offset+m_msp->n0);
21752208 annot_script[sizeof(annot_script)-1] = '\0';
21762209
41044137 else if (aln && toupper(sp0) == 'N') aln->ngap_q++;
41054138 else if (aln && toupper(sp1) == 'N') aln->ngap_l++;
41064139 }
4140 else if ((sp0 == '*' && toupper(sp1) == 'U') ||
4141 (toupper(sp0) == 'U' && sp1 == '*')) {
4142 spa_val = M_IDENT;
4143 if (aln) {
4144 aln->nident++;
4145 aln->nmismatch--;
4146 }
4147 }
41074148
41084149 /* correct nident, nmismatch for N:N / X:X */
41094150 if (pam_x_id_sim < 0) { /* > 0 -> identical, similar */
6767
6868 #ifndef MAX_MEMK
6969 #if defined(BIG_LIB64) && (defined(COMP_THR) || defined(PCOMPLIB))
70 #define MAX_MEMK 8*1024*1024 /* 12 GB (<<10) for library in memory */
70 #define MAX_MEMK 16*1024*1024 /* 16 GB (<<10) for library in memory */
7171 #else
7272 #define MAX_MEMK 2*1024*1024 /* 2 GB (<<10) for library in memory */
7373 #endif
151151 #define MX_M9SUMM 64 /* markx==9(c) */
152152 #define MX_M10FORM 128 /* markx==10 - verbose output */
153153 #define MX_M11OUT 256 /* markx==11 - lalign lav */
154 #define MX_M8OUT 512 /* markx==8 blast8 output */
155 #define MX_M8COMMENT 1024 /* markx==8 blast8 output */
156 #define MX_MBLAST 2048 /* markx=B blast output */
157 #define MX_MBLAST2 4096 /* markx=BB more blast output */
154 #define MX_M8OUT 512 /* markx==8 blast tabular (-outfmt=6) output */
155 #define MX_M8COMMENT 1024 /* markx==8 blast tabular (-outfmt=7) with comments output */
156 #define MX_MBLAST 2048 /* markx=B blast alignment -outfmt=0 output */
157 #define MX_MBLAST2 4096 /* markx=BB blast best scores and alignment (-outfmt=0) output */
158158 #define MX_ANNOT_COORD 16384 /* -m 0, use -m 0B for both */
159159 #define MX_ANNOT_MID 32768 /* markx 0M, 1M, 2M annotations in middle */
160160 #define MX_RES_ALIGN_SCORE (1<<20) /* show residue alignment score, not alignment */
161 #define MX_M8_BTAB_LEN (1<<21) /* show query/subject seq. lens in -m 8 output */
161162
162 /* codes for -m 9 */
163 /* codes for -m 9, -m 8C? */
163164 #define SHOW_CODE_ID 1 /* identity only */
164165 #define SHOW_CODE_IDD 2 /* identity with domains */
165166 #define SHOW_CODE_ALIGN 4 /* encoded alignment */
168169 #define SHOW_CODE_MASK 12 /* use higher bits for annotation format */
169170 #define SHOW_CODE_EXT 16 /* encode identity, mismatch state */
170171 #define SHOW_ANNOT_FULL 32 /* show full-length annot in calc_code */
172 #define SHOW_CODE_DOMINFO 64 /* include raw domain info in btab/BTOP */
173
293293 m_msp->do_showbest = 1;
294294 m_msp->ashow = -1;
295295 m_msp->ashow_set = 0;
296
296297 m_msp->nmlen = DEF_NMLEN;
298
299
300 /* values set in initfa.c: parse_ext_opts() */
297301 m_msp->z_bits = 1;
298302 m_msp->tot_ident = 0;
303 m_msp->blast_ident = 0;
304 m_msp->m8_show_annot = 0;
305
299306 m_msp->mshow_set = 0;
300307 m_msp->mshow_min = 0;
301308 m_msp->aln.llen = 60;
620627 else {
621628 m_msp->ann_arr_def[i_ann] = NULL;
622629 }
623
624
625630 }
626631
627632 /* read definitions of annotation symbols from a file */
710715
711716 return markx;
712717 }
718
719 /* specify output format. If output format type is 'F', then provide
720 file name and write to file.
721
722 Thus, -m "F8CB outfile.m8CB" writes -m 8CB output to outfile.m8CB
723 Different format outputs can be written to different files
724
725 */
713726
714727 void
715728 pre_parse_markx(char *opt_arg, struct mngmsg *m_msp) {
757770
758771 /* first check for -m "F file" format */
759772 if (optarg[0] == 'F') {
760 if ((bp=strchr(optarg+1,' '))==NULL) {
773 if ((bp=strchr(optarg+1,' '))==NULL && (bp=strchr(optarg+1,'='))==NULL) {
761774 fprintf(stderr,"-m F missing file name: %s\n",optarg);
762775 return;
763776 }
823836 void
824837 parse_markx(char *optarg, struct markx_str *this) {
825838 int itmp;
826 char ctmp, ctmp2;
839 char ctmp, ctmp2, ctmp3;
827840
828841 itmp = 0;
829 ctmp = ctmp2 = '\0';
842 ctmp = ctmp2 = ctmp3 = '\0';
830843
831844 if (optarg[0] == 'B') { /* BLAST alignment output */
832845 this->markx = MX_MBLAST;
853866 return;
854867 }
855868 else {
856 sscanf(optarg,"%d%c%c",&itmp,&ctmp,&ctmp2);
869 sscanf(optarg,"%d%c%c%c",&itmp,&ctmp,&ctmp2,&ctmp3);
857870 }
858871 if (itmp==9) {
859872 if (ctmp=='c') {this->show_code = SHOW_CODE_ALIGN;}
876889 else if (ctmp2 == 'C') {this->show_code = SHOW_CODE_CIGAR;}
877890 else if (ctmp2 == 'D') {this->show_code = SHOW_CODE_CIGAR + SHOW_CODE_EXT;}
878891 else if (ctmp2 == 'B') {this->show_code = SHOW_CODE_BTOP;}
892
893 if (ctmp3 == 'L') {
894 this->markx |= MX_M8_BTAB_LEN;
895 this->show_code |= SHOW_CODE_DOMINFO;
896 }
897 else if (ctmp3 == 'l') {
898 this->markx |= MX_M8_BTAB_LEN;
899 }
900
879901 }
880902 }
881903
116116
117117 f_str = (struct f_struct *) calloc(1, sizeof(struct f_struct));
118118 if(f_str == NULL) {
119 fprintf(stderr, "Couldn't calloc f_str\n");
119 fprintf(stderr, "*** error [%s:%d] - cannot calloc f_str [%lu]\n",
120 __FILE__, __LINE__, sizeof(struct f_struct));
120121 exit(1);
121122 }
122123
134135 if (ppst->hsq[i0] < NMAP && ppst->hsq[i0] > mhv) mhv = ppst->hsq[i0];
135136
136137 if (mhv <= 0) {
137 fprintf (stderr, " maximum hsq <=0 %d\n", mhv);
138 fprintf (stderr, "*** error [%s:%d] - maximum hsq <=0 %d\n",
139 __FILE__, __LINE__, mhv);
138140 exit (1);
139141 }
140142
146148 f_str->hmask = (hmax >> f_str->kshft) - 1;
147149
148150 if ((f_str->aa0 = (unsigned char *) calloc(n0+1, sizeof(char))) == NULL) {
149 fprintf (stderr, " cannot allocate f_str->aa0 array; %d\n",n0+1);
151 fprintf (stderr, "*** error [%s:%d] - cannot allocate f_str->aa0 array; %d\n",
152 __FILE__, __LINE__, n0+1);
150153 exit (1);
151154 }
152155 for (i=0; i<n0; i++) f_str->aa0[i] = aa0[i];
153156 aa0 = f_str->aa0;
154157
155158 if ((f_str->aa0t = (unsigned char *) calloc(n0+1, sizeof(char))) == NULL) {
156 fprintf (stderr, " cannot allocate f_str0->aa0t array; %d\n",n0+1);
159 fprintf (stderr, "*** error [%s:%d] - cannot allocate f_str0->aa0t array; %d\n",
160 __FILE__, __LINE__, n0+1);
157161 exit (1);
158162 }
159163 f_str->aa0ix = 0;
160164
161165 if ((f_str->harr = (struct hlstr *) calloc (hmax, sizeof (struct hlstr))) == NULL) {
162 fprintf (stderr, " cannot allocate hash array; hmax: %d hmask: %d\n",
163 hmax,f_str->hmask);
166 fprintf (stderr, "*** error [%s:%d] - cannot allocate hash array; hmax: %d hmask: %d\n",
167 __FILE__, __LINE__, hmax,f_str->hmask);
164168 exit (1);
165169 }
166170 if ((f_str->pamh1 = (int *) calloc (nsq+1, sizeof (int))) == NULL) {
167 fprintf (stderr, " cannot allocate pamh1 array\n");
171 fprintf (stderr, "*** error [%s:%d] - cannot allocate pamh1 array [%d]\n",
172 __FILE__, __LINE__, nsq+1);
168173 exit (1);
169174 }
170175 if ((f_str->pamh2 = (int *) calloc (hmax, sizeof (int))) == NULL) {
171 fprintf (stderr, " cannot allocate pamh2 array\n");
176 fprintf (stderr, "*** error [%s:%d] - cannot allocate pamh2 array [%d]\n",
177 __FILE__, __LINE__, hmax);
172178 exit (1);
173179 }
174180 if ((f_str->link = (struct hlstr *) calloc (n0, sizeof (struct hlstr))) == NULL) {
175 fprintf (stderr, " cannot allocate hash link array");
181 fprintf (stderr, "*** error [%s:%d] - cannot allocate hash link array [%d]",
182 __FILE__, __LINE__, n0);
176183 exit (1);
177184 }
178185
247254 f_str->maxsav = MAXSAV;
248255 if ((f_str->vmax = (struct savestr *)
249256 calloc(MAXSAV,sizeof(struct savestr)))==NULL) {
250 fprintf(stderr, "Couldn't allocate vmax[%d].\n",f_str->maxsav);
257 fprintf(stderr, "*** error [%s:%d] - cannot allocate vmax[%d].\n",
258 __FILE__, __LINE__, f_str->maxsav);
251259 exit(1);
252260 }
253261
254262 if ((f_str->vptr = (struct savestr **)
255263 calloc(MAXSAV,sizeof(struct savestr *)))==NULL) {
256 fprintf(stderr, "Couldn't allocate vptr[%d].\n",f_str->maxsav);
264 fprintf(stderr, "*** error [%s:%d] - cannot allocate vptr[%d].\n",
265 __FILE__, __LINE__, f_str->maxsav);
257266 exit(1);
258267 }
259268
260269 for (vmptr = f_str->vmax; vmptr < &f_str->vmax[MAXSAV]; vmptr++) {
261270 vmptr->used = (int *) calloc(n0, sizeof(int));
262271 if(vmptr->used == NULL) {
263 fprintf(stderr, "Couldn't alloc vmptr->used\n");
272 fprintf(stderr, "*** error [%s:%d] - cannot alloc vmptr->used [%d]\n",
273 __FILE__, __LINE__, n0);
264274 exit(1);
265275 }
266276 }
284294
285295 if (f_str->diag == NULL)
286296 {
287 fprintf (stderr, " cannot allocate diagonal arrays: %ld\n",
288 (long) MAXDIAG * (long) (sizeof (struct dstruct)));
297 fprintf (stderr, "*** error [%s:%d] - cannot allocate diagonal arrays: %ld\n",
298 __FILE__, __LINE__, (long) MAXDIAG * (long) (sizeof (struct dstruct)));
289299 exit (1);
290300 }
291301
293303 if ((f_str->aa1x =(unsigned char *)calloc((size_t)ppst->maxlen+2,
294304 sizeof(unsigned char)))
295305 == NULL) {
296 fprintf (stderr, "cannot allocate aa1x array %d\n", ppst->maxlen+2);
306 fprintf (stderr, "*** error [%s:%d] - cannot allocate aa1x array %d\n",
307 __FILE__, __LINE__, ppst->maxlen+2);
297308 exit (1);
298309 }
299310 f_str->aa1x++;
304315
305316 maxn0 = max(3*n0/2,MIN_RES);
306317 if ((res = (int *)calloc((size_t)maxn0,sizeof(int)))==NULL) {
307 fprintf(stderr,"cannot allocate alignment results array %d\n",maxn0);
318 fprintf(stderr,"*** error [%s:%d] - cannot allocate alignment results array %d\n",
319 __FILE__, __LINE__, maxn0);
308320 exit(1);
309321 }
310322 f_str->res = res;
314326
315327 /* initialize priors array. */
316328 if((f_str->priors = (double *)calloc(ppst->nsq+1, sizeof(double))) == NULL) {
317 fprintf(stderr, "Couldn't allocate priors array.\n");
329 fprintf(stderr, "*** error [%s:%d] - cannot allocate priors array [%d]\n",
330 __FILE__, __LINE__, ppst->nsq+1);
318331 exit(1);
319332 }
320333 calc_priors(f_str->priors, ppst, f_str, NULL, 0, ppst->pseudocts);
420433 }
421434
422435 if (n0+n1+1 >= MAXDIAG) {
423 fprintf(stderr,"n0,n1 too large: %d, %d\n",n0,n1);
436 fprintf(stderr,"*** error [%s:%d] - n0,n1 too large %d + %d > %d\n",
437 __FILE__, __LINE__, n0,n1, MAXDIAG);
424438 rst->score[0] = rst->score[1] = rst->score[2] = -1;
425439 rst->escore = 2.0;
426440 rst->segnum = 0;
642656 if (ppst->debug_lib)
643657 for (i=0; i<n10; i++)
644658 if (f_str->aa1x[i]>ppst->nsq) {
645 fprintf(stderr,
646 "residue[%d/%d] %d range (%d)\n",i,n1,
647 f_str->aa1x[i],ppst->nsq);
659 fprintf(stderr, "*** error [%s:%d] - residue[%d/%d] %d range (%d)\n",
660 __FILE__, __LINE__, i,n1, f_str->aa1x[i],ppst->nsq);
648661 f_str->aa1x[i]=0;
649662 n10=i-1;
650663 }
842855 }
843856 tot += ctot;
844857 if (ci >= 0) {
845 if (ci >= n0) {fprintf(stderr," warning - ci off end %d/%d\n",ci,n0);}
858 if (ci >= n0) {fprintf(stderr,"*** warning [%s:%d] - ci off end %d/%d\n",
859 __FILE__, __LINE__, ci,n0);}
846860 else {
847861 *aa0pt++ = aa0p[ci];
848862 aa0p[ci] += 32;
855869 if (aa0t_flg) {
856870 dmax->dp -= f_str->aa0ix; /* shift ->dp for aa0t */
857871 if ((ci=(int)(aa0pt-f_str->aa0t)) > n0) {
858 fprintf(stderr," warning - aapt off %d/%d end\n",ci,n0);
872 fprintf(stderr,"*** warning [%s:%d] - aapt off %d/%d end\n",
873 __FILE__, __LINE__, ci,n0);
859874 }
860875 else
861876 *aa0pt++ = 0; /* skip over NULL */
11571172 *have_ares = 0x2; /* set 0x2 bit to indicate local copy */
11581173
11591174 if ((a_res = (struct a_res_str *)calloc(1, sizeof(struct a_res_str)))==NULL) {
1160 fprintf(stderr," [do_walign] Cannot allocate a_res");
1175 fprintf(stderr,"*** error [%s:%d] - cannot allocate a_res [%lu]",
1176 __FILE__, __LINE__, sizeof(struct a_res_str));
11611177 return NULL;
11621178 }
11631179
11801196 */
11811197
11821198 if ((aa0t = (unsigned char *)calloc(n0+1,sizeof(unsigned char)))==NULL) {
1183 fprintf(stderr," cannot allocate aa0t %d\n",n0+1);
1199 fprintf(stderr,"*** error [%s:%d] - cannot allocate aa0t %d\n",
1200 __FILE__, __LINE__, n0+1);
11841201 exit(1);
11851202 }
11861203
20652065 #define XTERNAL
20662066 #include "upam.h"
20672067
2068 /* this code shows the alignment of the protein with the three phased
2069 translation of the DNA sequence
2070 */
2071
20682072 extern void
2069 display_alig(int *a, unsigned char *dna, unsigned char * pro, int length, int ld)
2073 display_alig(int *a, unsigned char *dna_p, unsigned char * pro, int length, int ld)
20702074 {
20712075 int len = 0, i, j, x, y, lines, k;
20722076 char line1[100], line2[100], line3[100],
20732077 tmp[10] = " ";
2074 unsigned char *dna1, c1, c2, c3, *st;
2075
2076 dna1 = ckalloc((size_t)ld);
2077 for (st = dna, i = 0; i < ld; i++, st++) dna1[i] = NCBIstdaa[*st];
2078 unsigned char *dna_p1, c1, c2, c3, *st;
2079
2080 dna_p1 = ckalloc((size_t)ld);
2081 for (st = dna_p, i = 0; i < ld; i++, st++) dna_p1[i] = NCBIstdaa[*st];
20782082 line1[0] = line2[0] = line3[0] = '\0'; x= a[0]; y = a[1]-1;
20792083
20802084 for (len = 0, j = 2, lines = 0; j < length; j++) {
20862090 if (a[j+1] == 2) tmp[2] = ' ';
20872091 }
20882092 if (i > 0) {
2089 strncpy(&line1[len], (const char *)&dna1[y], i); y+=i;
2090 } else {line1[len] = '-'; i = 1; tmp[0] = NCBIstdaa[pro[x++]];}
2093 strncpy(&line1[len], (const char *)&dna_p1[y], i);
2094 y+=i;
2095 }
2096 else {
2097 line1[len] = '-';
2098 i = 1;
2099 tmp[0] = NCBIstdaa[pro[x++]];
2100 }
20912101 strncpy(&line2[len], tmp, i);
20922102 for (k = 0; k < i; k++) {
20932103 if (tmp[k] != ' ' && tmp[k] != '-') {
2094 if (k == 2) tmp[k] = '\\';
2095 else if (k == 1) tmp[k] = '|';
2096 else tmp[k] = '/';
2097 } else tmp[k] = ' ';
2104 if (k == 2) {tmp[k] = '\\';}
2105 else if (k == 1) { tmp[k] = '|'; }
2106 else { tmp[k] = '/'; }
2107 }
2108 else { tmp[k] = ' '; }
20982109 }
20992110 if (i == 1) tmp[0] = ' ';
21002111 strncpy(&line3[len], tmp, i);
21032114 line1[len] = line2[len] =line3[len] = '\0';
21042115 if (len >= WIDTH) {
21052116 printf("\n%5d", WIDTH*lines++);
2106 for (k = 10; k <= WIDTH; k+=10)
2117 for (k = 10; k <= WIDTH; k+=10) {
21072118 printf(" . :");
2108 if (k-5 < WIDTH) printf(" .");
2119 }
2120 if (k-5 < WIDTH) { printf(" ."); }
21092121 c1 = line1[WIDTH]; c2 = line2[WIDTH]; c3 = line3[WIDTH];
21102122 line1[WIDTH] = line2[WIDTH] = line3[WIDTH] = '\0';
2123
21112124 printf("\n %s\n %s\n %s\n", line1, line3, line2);
2125
21122126 line1[WIDTH] = c1; line2[WIDTH] = c2; line3[WIDTH] = c3;
21132127 strncpy(line1, &line1[WIDTH], sizeof(line1)-1);
21142128 strncpy(line2, &line2[WIDTH], sizeof(line2)-1);
21222136 if (k-5 < len) printf(" .");
21232137 printf("\n %s\n %s\n %s\n", line1, line3, line2);
21242138 }
2125
21262139
21272140 /* alignment store the operation that align the protein and dna sequence.
21282141 The code of the number in the array is as follows:
21372150 in the protein and dna sequences in the local alignment.
21382151
21392152 Display looks like where WIDTH is assumed to be divisible by 10.
2153
2154 -- this alignment is incorrect, protein phases rather than DNA are shown --
21402155
21412156 0 . : . : . : . : . : . :
21422157 CCTATGATACTGGGATACTGGAACGTCCGCGGACTGACACACCCGATCCGCATGCTCCTG
281281 if (hsq[i0] < NMAP && hsq[i0] > mhv) mhv = hsq[i0];
282282
283283 if (mhv <= 0) {
284 fprintf (stderr, " maximum hsq <=0 %d\n", mhv);
284 fprintf (stderr, "*** error [%s:%d] - maximum hsq <=0 %d\n",
285 __FILE__, __LINE__, mhv);
285286 exit (1);
286287 }
287288
298299 f_str->hmask = (hmax >> f_str->kshft) - 1;
299300
300301 if ((f_str->harr = (int *) calloc (hmax, sizeof (int))) == NULL) {
301 fprintf (stderr, " cannot allocate hash array\n");
302 fprintf (stderr, "*** error [%s:%d] - cannot allocate hash array [%d]\n",
303 __FILE__, __LINE__, hmax );
302304 exit (1);
303305 }
304306 if ((f_str->pamh1 = (int *) calloc (ppst->nsq+1, sizeof (int))) == NULL) {
305 fprintf (stderr, " cannot allocate pamh1 array\n");
307 fprintf (stderr, "*** error [%s:%d] - cannot allocate pamh1 array [%d]\n",
308 __FILE__, __LINE__, ppst->nsq+1);
306309 exit (1);
307310 }
308311 if ((f_str->pamh2 = (int *) calloc (hmax, sizeof (int))) == NULL) {
309 fprintf (stderr, " cannot allocate pamh2 array\n");
312 fprintf (stderr, "*** error [%s:%d] - cannot allocate pamh2 array [%d]\n",
313 __FILE__, __LINE__, hmax);
310314 exit (1);
311315 }
312316 if ((f_str->link = (int *) calloc (n0, sizeof (int))) == NULL) {
313 fprintf (stderr, " cannot allocate hash link array");
317 fprintf (stderr, "*** error [%s:%d] - cannot allocate hash link array [%d]",
318 __FILE__, __LINE__, n0);
314319 exit (1);
315320 }
316321
318323 if ((f_str->aa1x =(unsigned char *)calloc((size_t)ppst->maxlen+2,
319324 sizeof(unsigned char)))
320325 == NULL) {
321 fprintf (stderr, "cannot allocate aa1x array %d\n", ppst->maxlen+2);
326 fprintf (stderr, "*** error [%s:%d] - cannot allocate aa1x array %d\n",
327 __FILE__, __LINE__, ppst->maxlen+2);
322328 exit (1);
323329 }
324330 f_str->aa1x++;
326332 if ((f_str->aa1y =(unsigned char *)calloc((size_t)ppst->maxlen+2,
327333 sizeof(unsigned char)))
328334 == NULL) {
329 fprintf (stderr, "cannot allocate aa1y array %d\n", ppst->maxlen+2);
335 fprintf (stderr, "*** error [%s:%d] - cannot allocate aa1y array %d\n",
336 __FILE__, __LINE__, ppst->maxlen+2);
330337 exit (1);
331338 }
332339 f_str->aa1y++;
334341 maxn0 = n0 + 2;
335342 if ((aa0x =(unsigned char *)calloc((size_t)maxn0,sizeof(unsigned char)))
336343 == NULL) {
337 fprintf (stderr, "cannot allocate aa0x array %d\n", maxn0);
344 fprintf (stderr, "*** error [%s:%d] - cannot allocate aa0x array %d\n",
345 __FILE__, __LINE__, maxn0);
338346 exit (1);
339347 }
340348 aa0x++;
342350
343351 if ((aa0y =(unsigned char *)calloc((size_t)maxn0,sizeof(unsigned char)))
344352 == NULL) {
345 fprintf (stderr, "cannot allocate aa0y array %d\n", maxn0);
353 fprintf (stderr, "*** error [%s:%d] - cannot allocate aa0y array %d\n",
354 __FILE__, __LINE__, maxn0);
346355 exit (1);
347356 }
348357 aa0y++;
437446 #ifndef ALLOCN0
438447 if ((f_str->diag = (struct dstruct *) calloc ((size_t)MAXDIAG,
439448 sizeof (struct dstruct)))==NULL) {
440 fprintf (stderr," cannot allocate diagonal arrays: %ld\n",
449 fprintf (stderr,"*** error [%s:%d] - cannot allocate diagonal arrays: %ld\n",
450 __FILE__, __LINE__,
441451 (long) MAXDIAG *sizeof (struct dstruct));
442452 exit (1);
443453 };
444454 #else
445455 if ((f_str->diag = (struct dstruct *) calloc ((size_t)n0,
446456 sizeof (struct dstruct)))==NULL) {
447 fprintf (stderr," cannot allocate diagonal arrays: %ld\n",
448 (long)n0*sizeof (struct dstruct));
457 fprintf (stderr,"*** error [%s:%d] - cannot allocate diagonal arrays: %ld\n",
458 __FILE__, __LINE__, (long)n0*sizeof (struct dstruct));
449459 exit (1);
450460 };
451461 #endif
452462
453463
454464 if ((waa= (int *)malloc (sizeof(int)*(nsq+1)*n0)) == NULL) {
455 fprintf(stderr,"cannot allocate waa struct %3d\n",nsq*n0);
465 fprintf(stderr,"*** error [%s:%d] - cannot allocate waa struct %3d\n",
466 __FILE__, __LINE__, nsq*n0);
456467 exit(1);
457468 }
458469
466477 f_str->waa0 = waa;
467478
468479 if ((waa= (int *)malloc (sizeof(int)*(nsq+1)*n0)) == NULL) {
469 fprintf(stderr,"cannot allocate waa struct %3d\n",nsq*n0);
480 fprintf(stderr,"*** error [%s:%d] - cannot allocate waa struct %3d\n",
481 __FILE__, __LINE__, nsq*n0);
470482 exit(1);
471483 }
472484
488500 maxn0 = max(4*n0,MIN_RES);
489501 #endif
490502 if ((res = (int *)calloc((size_t)maxn0,sizeof(int)))==NULL) {
491 fprintf(stderr,"cannot allocate alignment results array %d\n",maxn0);
503 fprintf(stderr,"*** error [%s:%d] -cannot allocate alignment results array %d\n",
504 __FILE__, __LINE__, maxn0);
492505 exit(1);
493506 }
494507 f_str->res = res;
690703 }
691704
692705 if (n0+n1+1 >= MAXDIAG) {
693 fprintf(stderr,"n0,n1 too large: %d, %d\n",n0,n1);
706 fprintf(stderr,"*** error [%s:%d] - n0,n1 too large > %d: %d, %d\n",
707 __FILE__, __LINE__, n0,n1, MAXDIAG);
694708 rst->score[0] = rst->score[1] = rst->score[2] = -1;
695709 return;
696710 }
15231537 }
15241538
15251539 if (i >= max_res) {
1526 fprintf(stderr," alignment truncated: %d/%d\n", max_res,i);
1540 fprintf(stderr,"*** error [%s:%d] - alignment truncated: %d > %d (max_res)\n",
1541 __FILE__, __LINE__, i, max_res);
15271542 }
15281543
15291544 up = &up[-3]; down = &down[-3]; tp = &tp[-3];
15801595 ld += 2;
15811596 init_ROW(up, ld+1); /* set to zero */
15821597 init_ROW(down, ld+1); /* set to zero */
1583
15841598
15851599 cur = up+1;
15861600 last = down+1;
20702084 #define XTERNAL
20712085 #include "upam.h"
20722086
2087 /* this code is not used by the program, it was included for testing */
2088 /* display_alig(*align_enc, *dna_p, *prot, length, ld) takes the
2089
2090 alignment encoding, and the DNA and protein sequences, and produces an alignment.
2091 *dna_p is the three phases of the translated DNA sequence
2092 *prot is the original protein sequence
2093
2094 length is the length of the encoding
2095 ld is the length of the alignment(?)
2096
2097 the first two entries in align_enc[] are the start of the protein
2098 and DNA sequences.
2099
2100 The encoding is: (why no code 1?:)
2101
2102 0: delete amino acid.
2103 2: frame shift, 2 nucleotides match with an amino acid
2104 3: match an amino acid with a codon
2105 4: the other type of frame shift
2106 5: delete of a codon
2107
2108 One of the properties of this encoding is that it indicates the
2109 amount that the DNA sequence index needs to be incremented after
2110 prot match (except for 5)
2111
2112 */
2113
20732114 extern void
2074 display_alig(int *a, unsigned char *dna, unsigned char * pro, int length, int ld)
2115 display_alig(int *a, unsigned char *dna_p, unsigned char * pro, int length, int ld)
20752116 {
20762117 int len = 0, i, j, x, y, lines, k;
20772118 char line1[100], line2[100], line3[100],
20782119 tmp[10] = " ";
2079 unsigned char *dna1, c1, c2, c3, *st;
2080
2081 dna1 = ckalloc((size_t)ld);
2082 for (st = dna, i = 0; i < ld; i++, st++) dna1[i] = NCBIstdaa[*st];
2083 line1[0] = line2[0] = line3[0] = '\0'; x= a[0]; y = a[1]-1;
2120 unsigned char *dna_p1, c1, c2, c3, *st;
2121
2122 dna_p1 = ckalloc((size_t)ld); /* dna_p1 is the ascii (sq0) translated-DNA residue */
2123
2124 /* generate the ascii aa characters */
2125 for (st = dna_p, i = 0; i < ld; i++, st++) {
2126 dna_p1[i] = NCBIstdaa[*st];
2127 }
2128 line1[0] = line2[0] = line3[0] = '\0';
2129
2130 x= a[0]; /* start in protein */
2131 y = a[1]-1; /* start in DNA */
20842132
20852133 for (len = 0, j = 2, lines = 0; j < length; j++) {
2086 i = a[j];
2134 i = a[j]; /* i is align_enc value 0-5 */
20872135 /*printf("%d %d %d\n", i, len, b->j);*/
2136
20882137 if (i > 0 && i < 5) tmp[i-2] = NCBIstdaa[pro[x++]];
2089 if (i == 5) {
2090 i = 3; tmp[0] = tmp[1] = tmp[2] = '-';
2138 if (i == 5) { /* special case */
2139 i = 3; /* increment DNA value by 3, prot by 0 */
2140 tmp[0] = tmp[1] = tmp[2] = '-';
20912141 if (a[j+1] == 2) tmp[2] = ' ';
20922142 }
20932143 if (i > 0) {
2094 strncpy(&line1[len], (const char *)&dna1[y], i); y+=i;
2095 } else {line1[len] = '-'; i = 1; tmp[0] = NCBIstdaa[pro[x++]];}
2144 strncpy(&line1[len], (const char *)&dna_p1[y], i);
2145 y+=i;
2146 }
2147 else {
2148 line1[len] = '-';
2149 i = 1;
2150 tmp[0] = NCBIstdaa[pro[x++]];
2151 }
2152
20962153 strncpy(&line2[len], tmp, i);
2154
20972155 for (k = 0; k < i; k++) {
20982156 if (tmp[k] != ' ' && tmp[k] != '-') {
20992157 if (k == 2) tmp[k] = '\\';
21282186 printf("\n %s\n %s\n %s\n", line1, line3, line2);
21292187 }
21302188
2131
21322189 /* alignment store the operation that align the protein and dna sequence.
21332190 The code of the number in the array is as follows:
21342191 0: delete of an amino acid.
21372194 4: the other type of frame shift
21382195 5: delete of a codon
21392196
2140
21412197 Also the first two element of the array stores the starting point
21422198 in the protein and dna sequences in the local alignment.
21432199
23782434
23792435 /* now we need alignment storage - get it */
23802436 if ((cur_ares->res = (int *)calloc((size_t)max_res,sizeof(int)))==NULL) {
2381 fprintf(stderr," *** cannot allocate alignment results array %d\n",max_res);
2437 fprintf(stderr,"*** error [%s:%d] - cannot allocate alignment results array %d\n",
2438 __FILE__, __LINE__, max_res);
23822439 exit(1);
23832440 }
23842441
25992656 *have_ares = 0x3; /* set 0x2 bit to indicate local copy */
26002657
26012658 if ((a_res = (struct a_res_str *)calloc(1, sizeof(struct a_res_str)))==NULL) {
2602 fprintf(stderr," [do_walign] Cannot allocate a_res");
2659 fprintf(stderr,"*** error [%s:%d] - cannot allocate a_res [%lu]",
2660 __FILE__, __LINE__, sizeof(struct a_res_str));
26032661 return NULL;
26042662 }
26052663
26472705 #endif
26482706 /*
26492707 if (a_res->res[0] != 3) {
2650 fprintf(stderr, "*** alignment does not start with match: %d\n",a_res->res[0]);
2708 fprintf(stderr, "*** error [%s:%d] - alignment does not start with match: %d\n",
2709 __FILE__, __LINE__, a_res->res[0]);
26512710 }
26522711 */
26532712
26542713 #ifdef DEBUG
26552714 if (adler32(1L,aa1,n1) != adler32_crc) {
2656 fprintf(stderr,"[dropfx.c/do_walign] adler32_crc mismatch n1: %d\n",n1);
2715 fprintf(stderr,"*** error [%s:%d] - adler32_crc mismatch n1: %d\n",
2716 __FILE__, __LINE__, n1);
26572717 }
26582718 #endif
26592719
27302790 }
27312791
27322792 /*
2733 Alignment: store the operation that align the protein and dna sequence.
2793 Alignment: store the operation that aligns the protein and dna sequences.
27342794 The code of the number in the array is as follows:
27352795 0: delete of an amino acid.
27362796 2: frame shift, 2 nucleotides match with an amino acid
29773037 else if (calc_func_mode == CALC_ID || calc_func_mode == CALC_ID_DOM) {
29783038 have_ann = (annotp_p && annotp_p->n_annot > 0);
29793039 spa_p = &spa_c;
2980 sp0_p = &sp0_c;
2981 sp1_p = &sp1_c;
2982
2983 sp0a_p = &sp0a_c;
2984 sp1a_p = &sp1a_c;
3040 sp0_p = &sp1_c;
3041 sp1_p = &sp0_c;
3042
3043 sp0a_p = &sp1a_c;
3044 sp1a_p = &sp0a_c;
29853045 annot_fmt = 3;
29863046
29873047 /* does not require aa0a/aa1a, only for variants */
29883048 }
29893049 else if (calc_func_mode == CALC_CODE) {
29903050 spa_p = &spa_c;
2991 sp0_p = &sp0_c;
2992 sp1_p = &sp1_c;
2993
2994 sp0a_p = &sp0a_c;
2995 sp1a_p = &sp1a_c;
3051 sp0_p = &sp1_c;
3052 sp1_p = &sp0_c;
3053
3054 sp0a_p = &sp1a_c;
3055 sp1a_p = &sp0a_c;
29963056
29973057 show_code = (display_code & (SHOW_CODE_MASK+SHOW_CODE_EXT)); /* see defs.h; SHOW_CODE_ALIGN=2,_CIGAR=3,_CIGAR_EXT=4 */
29983058 annot_fmt = 2;
30173077 rpmax = &a_res->res[a_res->nres];
30183078
30193079 lenc = not_c = aln->nident = aln->nmismatch = aln->nsim = aln->npos = ngap_p = ngap_d = nfs= 0;
3080
30203081 i0 = a_res->min1;
30213082 i1 = a_res->min0;
30223083
31413202 *spa_p = M_DEL;
31423203
31433204 if (calc_func_mode == CALC_CODE) {
3205 #ifndef TFAST
31443206 update_code(align_code_dyn, update_data_p, 2, *spa_p,*sp0_p,*sp1_p);
3207 #else
3208 update_code(align_code_dyn, update_data_p, 2, *spa_p,*sp1_p,*sp0_p);
3209 #endif
3210
31453211 }
31463212
31473213 if (calc_func_mode == CALC_CONS) {
32183284 *spa_p = align_type(itmp, *sp0_p, *sp1_p, 0, aln, ppst->pam_x_id_sim);
32193285
32203286 if (calc_func_mode == CALC_CODE) {
3287 #ifndef TFAST
32213288 update_code(align_code_dyn, update_data_p, 3, *spa_p,*sp0_p,*sp1_p);
3289 #else
3290 update_code(align_code_dyn, update_data_p, 3, *spa_p,*sp1_p,*sp0_p);
3291 #endif
32223292 }
32233293
32243294 d1_alen++;
33203390 if (cumm_seq_score) *i_spa++ = itmp;
33213391
33223392 if (calc_func_mode == CALC_CODE) {
3393 #ifndef TFAST
33233394 update_code(align_code_dyn, update_data_p, 3, *spa_p, *sp0_p, *sp1_p);
3395 #else
3396 update_code(align_code_dyn, update_data_p, 3, *spa_p, *sp1_p, *sp0_p);
3397 #endif
33243398
33253399 if (have_push_features) {
33263400 add_annot_code(have_ann, *sp0_p, *sp1_p, *sp1a_p,
33663440 *spa_p = M_DEL;
33673441
33683442 if (calc_func_mode == CALC_CODE) {
3443 #ifndef TFAST
33693444 update_code(align_code_dyn, update_data_p, 4, *spa_p, *sp0_p, *sp1_p);
3445 #else
3446 update_code(align_code_dyn, update_data_p, 4, *spa_p, *sp1_p, *sp0_p);
3447 #endif
33703448 }
33713449
33723450 if (calc_func_mode == CALC_CONS) {sp0_p++; sp1_p++; spa_p++;}
34353513 if (*spa_p == M_IDENT) {d1_ident++;}
34363514
34373515 if (calc_func_mode == CALC_CODE) {
3516 #ifndef TFAST
34383517 update_code(align_code_dyn, update_data_p, 3, *spa_p,*sp0_p,*sp1_p);
3518 #else
3519 update_code(align_code_dyn, update_data_p, 3, *spa_p,*sp1_p,*sp0_p);
3520 #endif
34393521 }
34403522
34413523 if (cumm_seq_score) *i_spa++ = itmp;
34843566
34853567 if (calc_func_mode == CALC_CODE) {
34863568 *spa_p = 5;
3569 #ifndef TFAST
34873570 update_code(align_code_dyn, update_data_p, 5, *spa_p,*sp0_p,*sp1_p);
3571 #else
3572 update_code(align_code_dyn, update_data_p, 5, *spa_p,*sp1_p,*sp0_p);
3573 #endif
34883574 }
34893575
34903576 if (calc_func_mode == CALC_CONS) {sp0_p++; sp1_p++; spa_p++;}
36143700 */
36153701
36163702 static struct update_code_str *
3617 init_update_data(show_code) {
3703 init_update_data(int show_code) {
36183704
36193705 struct update_code_str *update_data_p;
36203706
37163802
37173803 /* only aligned identities update counts */
37183804 if (op==3 && sim_code == M_IDENT) {
3719 up_dp->p_op_cnt++;
3720 return;
3805 if ((sp0 == '*' && (sp1 == '*' || toupper(sp1) == 'U'))
3806 || (sp1 == '*' && (sp0 == '*' || toupper(sp0) == 'U'))) {
3807 if (up_dp->p_op_cnt > 0) {
3808 sprintf(tmp_str,"%d**",up_dp->p_op_cnt);
3809 up_dp->p_op_cnt = 0;
3810 return;
3811 }
3812 }
3813 else {
3814 up_dp->p_op_cnt++;
3815 return;
3816 }
37213817 }
37223818 else {
37233819 if (up_dp->p_op_cnt > 0) {
37853881 }
37863882 }
37873883 else { /* have a termination codon, output for !SHOW_CODE_CIGAR */
3788 if (!up_dp->cigar_order) {
3789 if (sp0 == '*' || sp1 == '*') { op = 6;}
3790 }
3791 else if (up_dp->show_ext && (sp0 != sp1)) { op = 1;}
3884 if (!up_dp->cigar_order) { /* -m9c : -m9C and -m8CC are cigar_order */
3885 if (sp0 == '*' || sp1 == '*') {
3886 /* op = 6 gets '*' from op_map="-x/=\\+*" when the string is closed */
3887 op = 6;
3888 }
3889 }
3890 else if (sp0=='*' && sp1=='*') {
3891 op=6;
3892 }
3893 else if (up_dp->show_ext && (sp0 != sp1)) {
3894 op = 1;
3895 }
37923896 }
37933897
37943898 if (up_dp->p_op_cnt == 0) {
218218 char le[MAXLC+1][64];
219219
220220 if (naa > MAXLC) {
221 fprintf(stderr,"*** dropfz2.c compilation problem naa(%d) > MAXLX(%d) ***\n",
222 naa, MAXLC);
221 fprintf(stderr,"*** error [%s:%d] - compilation problem naa(%d) > MAXLC(%d) ***\n",
222 __FILE__, __LINE__, naa, MAXLC);
223223 }
224224
225225 if ((*weighti=(struct wgt **)calloc((size_t)(naa+1),sizeof(struct wgt *)))
226226 ==NULL) {
227 fprintf(stderr," cannot allocate weights array: %d\n",naa);
227 fprintf(stderr,"*** error [%s:%d] - cannot allocate weights array: %d\n",
228 __FILE__, __LINE__, naa);
228229 exit(1);
229230 }
230231
233234 for (aa=0; aa <= naa; aa++) {
234235 if ((weight[aa]=(struct wgt *)calloc((size_t)256,sizeof(struct wgt)))
235236 ==NULL) {
236 fprintf(stderr," cannot allocate weight[]: %d/%d\n",aa,naa);
237 fprintf(stderr,"*** error [%s:%d] - cannot allocate weight[]: %d/%d\n",
238 __FILE__, __LINE__, aa,naa);
237239 exit(1);
238240 }
239241 }
242244 if (weightci !=NULL) {
243245 if ((*weightci=(struct wgtc **)calloc((size_t)(naa+1),
244246 sizeof(struct wgtc *)))==NULL) {
245 fprintf(stderr," cannot allocate weight_c array: %d\n",naa);
247 fprintf(stderr,"*** error [%s:%d] - cannot allocate weight_c array: %d\n",
248 __FILE__, __LINE__, naa);
246249 exit(1);
247250 }
248251 weightc = *weightci;
250253 for (aa=0; aa <= naa; aa++) {
251254 if ((weightc[aa]=(struct wgtc *)calloc((size_t)256,sizeof(struct wgtc)))
252255 ==NULL) {
253 fprintf(stderr," cannot allocate weightc[]: %d/%d\n",aa,naa);
256 fprintf(stderr,"*** error [%s:%d] - cannot allocate weightc[]: %d/%d\n",
257 __FILE__, __LINE__, aa,naa);
254258 exit(1);
255259 }
256260 }
411415 #endif
412416
413417 if (nt[NT_N] != 'N') {
414 fprintf(stderr," nt[NT_N] (%d) != 'X' (%c) - recompile\n",NT_N,nt[NT_N]);
418 fprintf(stderr,"*** error [%s:%d] - nt[NT_N] (%d) != 'X' (%c) - recompile\n",
419 __FILE__, __LINE__, NT_N,nt[NT_N]);
415420 exit(1);
416421 }
417422
460465 if ((aa0x =(unsigned char *)calloc((size_t)maxn0,
461466 sizeof(unsigned char)))
462467 == NULL) {
463 fprintf (stderr, "cannot allocate aa0x array %d\n", maxn0);
468 fprintf (stderr, "*** error [%s:%d] - cannot allocate aa0x array %d\n",
469 __FILE__, __LINE__, maxn0);
464470 exit (1);
465471 }
466472 aa0x++;
470476 if ((aa0v =(unsigned char *)calloc((size_t)maxn0,
471477 sizeof(unsigned char)))
472478 == NULL) {
473 fprintf (stderr, "cannot allocate aa0v array %d\n", maxn0);
479 fprintf (stderr, "*** error [%s:%d] - cannot allocate aa0v array %d\n",
480 __FILE__, __LINE__, maxn0);
474481 exit (1);
475482 }
476483 aa0v++;
522529 if (hsq[i0] < NMAP && hsq[i0] > mhv)
523530 mhv = ppst->hsq[i0];
524531
525 if (mhv <= 0)
526 {
527 fprintf (stderr, " maximum hsq <=0 %d\n", mhv);
532 if (mhv <= 0) {
533 fprintf (stderr, "*** error [%s:%d] - maximum hsq <=0 %d\n",
534 __FILE__, __LINE__, mhv);
528535 exit (1);
529536 }
530537
539546 f_str->hmask = (hmax >> f_str->kshft) - 1;
540547
541548 if ((f_str->harr = (int *) calloc (hmax, sizeof (int))) == NULL) {
542 fprintf (stderr, " cannot allocate hash array\n");
549 fprintf (stderr, "*** error [%s:%d] - cannot allocate hash array [%d]\n",
550 __FILE__, __LINE__, hmax);
543551 exit (1);
544552 }
545553 if ((f_str->pamh1 = (int *) calloc (ppst->nsq+1, sizeof (int))) == NULL) {
546 fprintf (stderr, " cannot allocate pamh1 array\n");
554 fprintf (stderr, "*** error [%s:%d] - cannot allocate pamh1 array [%d]\n",
555 __FILE__, __LINE__, ppst->nsq+1);
547556 exit (1);
548557 }
549 if ((f_str->pamh2 = (int *) calloc (hmax, sizeof (int))) == NULL) {
550 fprintf (stderr, " cannot allocate pamh2 array\n");
558 if ((f_str->pamh2 = (int *)calloc (hmax, sizeof (int))) == NULL) {
559 fprintf (stderr, "*** error [%s:%d] - cannot allocate pamh2 array [%d]\n",
560 __FILE__, __LINE__, hmax);
551561 exit (1);
552562 }
553563 if ((f_str->link = (int *) calloc (n0, sizeof (int))) == NULL) {
554 fprintf (stderr, " cannot allocate hash link array");
564 fprintf (stderr, "*** error [%s:%d] - cannot allocate hash link array [%d]",
565 __FILE__, __LINE__, n0);
555566 exit (1);
556567 }
557568
614625 #ifndef ALLOCN0
615626 if ((f_str->diag = (struct dstruct *) calloc ((size_t)MAXDIAG,
616627 sizeof (struct dstruct)))==NULL) {
617 fprintf (stderr," cannot allocate diagonal arrays: %lu\n",
618 MAXDIAG *sizeof (struct dstruct));
628 fprintf (stderr,"*** error [%s:%d] - cannot allocate diagonal arrays: %lu\n",
629 __FILE__, __LINE__, MAXDIAG *sizeof (struct dstruct));
619630 exit (1);
620631 };
621632 #else
622633 if ((f_str->diag = (struct dstruct *) calloc ((size_t)n0,
623634 sizeof (struct dstruct)))==NULL) {
624 fprintf (stderr," cannot allocate diagonal arrays: %ld\n",
625 (long)n0*sizeof (struct dstruct));
635 fprintf (stderr,"*** error [%s:%d] - cannot allocate diagonal arrays: %ld\n",
636 __FILE__, __LINE__, (long)n0*sizeof (struct dstruct));
626637 exit (1);
627638 };
628639 #endif
636647 if ((f_str->aa1x =(unsigned char *)calloc((size_t)ppst->maxlen+4,
637648 sizeof(unsigned char)))
638649 == NULL) {
639 fprintf (stderr, "cannot allocate aa1x array %d\n", ppst->maxlen+4);
650 fprintf (stderr, "*** error [%s:%d] - cannot allocate aa1x array %d\n",
651 __FILE__, __LINE__, ppst->maxlen+4);
640652 exit (1);
641653 }
642654 f_str->aa1x++;
643655
644656 if ((f_str->aa1v =(unsigned char *)calloc((size_t)ppst->maxlen+4,
645657 sizeof(unsigned char))) == NULL) {
646 fprintf (stderr, "cannot allocate aa1v array %d\n", ppst->maxlen+4);
658 fprintf (stderr, "*** error [%s:%d] - cannot allocate aa1v array %d\n",
659 __FILE__, __LINE__, ppst->maxlen+4);
647660 exit (1);
648661 }
649662 f_str->aa1v++;
651664 #endif
652665
653666 if ((waa= (int *)malloc (sizeof(int)*(nsq+1)*n0)) == NULL) {
654 fprintf(stderr,"cannot allocate waa struct %3d\n",nsq*n0);
667 fprintf(stderr,"*** error [%s:%d] - cannot allocate waa struct %3d\n",
668 __FILE__, __LINE__, nsq*n0);
655669 exit(1);
656670 }
657671
670684 maxn0 = max(4*n0,MIN_RES);
671685 #endif
672686 if ((res = (int *)calloc((size_t)maxn0,sizeof(int)))==NULL) {
673 fprintf(stderr,"cannot allocate alignment results array %d\n",maxn0);
687 fprintf(stderr,"*** error [%s:%d] - cannot allocate alignment results array %d\n",
688 __FILE__, __LINE__, maxn0);
674689 exit(1);
675690 }
676691 f_str->res = res;
848863 }
849864
850865 if (n0+n1+1 >= MAXDIAG) {
851 fprintf(stderr,"n0,n1 too large: %d, %d\n",n0,n1);
866 fprintf(stderr,"*** error [%s:%d] - n0,n1 too large > %d: %d, %d\n",
867 __FILE__, __LINE__, n0,n1, MAXDIAG);
852868 rst->score[0] = rst->score[1] = rst->score[2] = -1;
853869 return;
854870 }
10961112 aa1x = f_str->aa1x;
10971113 #ifdef DEBUG
10981114 if (frame > 1) {
1099 fprintf(stderr, "*** fz_walign - frame: %d - out of range [0,1]\n",frame);
1115 fprintf(stderr, "*** error [%s:%d] - fz_walign - frame: %d - out of range [0,1]\n",
1116 __FILE__, __LINE__, frame);
11001117 }
11011118 #endif
11021119
16321649 aq = ap->next; free(ap); ap = aq;
16331650 }
16341651 if (i >= max_res)
1635 fprintf(stderr,"***alignment truncated: %d/%d***\n", max_res,i);
1652 fprintf(stderr,"*** error [%s:%d] - alignment truncated: %d >= %d***\n",
1653 __FILE__, __LINE__, i, max_res);
16361654
16371655 /* up = &up[-3]; down = &down[-3]; tp = &tp[-3]; */
16381656 free(&f_str->up[-3]); free(&f_str->tp[-3]); free(&f_str->down[-3]);
24782496
24792497 /* now we need alignment storage - get it */
24802498 if ((cur_ares->res = (int *)calloc((size_t)max_res,sizeof(int)))==NULL) {
2481 fprintf(stderr," *** cannot allocate alignment results array %d\n",max_res);
2499 fprintf(stderr,"*** error [%s:%d] - cannot allocate alignment results array %d\n",
2500 __FILE__, __LINE__, max_res);
24822501 exit(1);
24832502 }
24842503
26492668 *have_ares = 0x3; /* set 0x2 bit to indicate local copy */
26502669
26512670 if ((a_res = (struct a_res_str *)calloc(1, sizeof(struct a_res_str)))==NULL) {
2652 fprintf(stderr," [do_walign] Cannot allocate a_res");
2671 fprintf(stderr,"*** error [%s:%d] - cannot allocate a_res [%lu]",
2672 __FILE__, __LINE__, sizeof(struct a_res_str));
26532673 return NULL;
26542674 }
26552675
29402960 update_data_p = init_update_data(show_code);
29412961 }
29422962 else {
2943 fprintf(stderr,"*** error [%s:%d] --- cal_cons_u() invalid calc_func_mode: %d\n",
2963 fprintf(stderr,"*** error [%s:%d] --- calc_cons_u() invalid calc_func_mode: %d\n",
29442964 __FILE__, __LINE__, calc_func_mode);
29452965 exit(1);
29462966 }
29722992 else if (calc_func_mode == CALC_ID || calc_func_mode == CALC_ID_DOM) {
29732993 have_ann = (annotp_p && annotp_p->n_annot > 0);
29742994 spa_p = &spa_c;
2975 sp0_p = &sp0_c;
2976 sp1_p = &sp1_c;
2977
2978 sp0a_p = &sp0a_c;
2979 sp1a_p = &sp1a_c;
2995 sp0_p = &sp1_c;
2996 sp1_p = &sp0_c;
2997
2998 sp0a_p = &sp1a_c;
2999 sp1a_p = &sp0a_c;
29803000 annot_fmt = 3;
29813001
29823002 /* does not require aa0a/aa1a, only for variants */
29833003 }
29843004 else if (calc_func_mode == CALC_CODE) {
29853005 spa_p = &spa_c;
2986 sp0_p = &sp0_c;
2987 sp1_p = &sp1_c;
2988
2989 sp0a_p = &sp0a_c;
2990 sp1a_p = &sp1a_c;
3006 sp0_p = &sp1_c;
3007 sp1_p = &sp0_c;
3008
3009 sp0a_p = &sp1a_c;
3010 sp1a_p = &sp0a_c;
29913011
29923012 show_code = (display_code & (SHOW_CODE_MASK+SHOW_CODE_EXT)); /* see defs.h; SHOW_CODE_ALIGN=2,_CIGAR=3,_CIGAR_EXT=4 */
29933013 annot_fmt = 2;
30013021 update_data_p = init_update_data(show_code);
30023022 }
30033023 else {
3004 fprintf(stderr,"*** error [%s:%d] --- cal_cons_u() invalid calc_func_mode: %d\n",
3024 fprintf(stderr,"*** error [%s:%d] --- calc_cons_u() invalid calc_func_mode: %d\n",
30053025 __FILE__, __LINE__, calc_func_mode);
30063026 exit(1);
30073027 }
31173137 if (cumm_seq_score) *i_spa++ = itmp;
31183138
31193139 if (calc_func_mode == CALC_CODE) {
3140 #ifndef TFAST
31203141 update_code(align_code_dyn, update_data_p, 3, *spa_p, *sp0_p, *sp1_p);
3142 #else
3143 update_code(align_code_dyn, update_data_p, 3, *spa_p, *sp1_p, *sp0_p);
3144 #endif
31213145
31223146 if (have_ann && have_push_features) {
31233147 add_annot_code(have_ann, *sp0_p, *sp1_p, *sp1a_p,
31593183 *spa_p = M_DEL;
31603184
31613185 if (calc_func_mode == CALC_CODE) {
3186 #ifndef TFAST
31623187 update_code(align_code_dyn, update_data_p, 2, *spa_p,*sp0_p,*sp1_p);
3188 #else
3189 update_code(align_code_dyn, update_data_p, 2, *spa_p,*sp1_p,*sp0_p);
3190 #endif
31633191 }
31643192
31653193 if (cumm_seq_score) *i_spa++ = ppst->gshift;
32323260 *spa_p = align_type(itmp, *sp0_p, *sp1_p, 0, aln, ppst->pam_x_id_sim);
32333261
32343262 if (calc_func_mode == CALC_CODE) {
3263 #ifndef TFAST
32353264 update_code(align_code_dyn, update_data_p, 3, *spa_p,*sp0_p,*sp1_p);
3265 #else
3266 update_code(align_code_dyn, update_data_p, 3, *spa_p,*sp1_p,*sp0_p);
3267 #endif
32363268 }
32373269
32383270 d1_alen++;
32793311 *spa_p = M_DEL;
32803312
32813313 if (calc_func_mode == CALC_CODE) {
3314 #ifndef TFAST
32823315 update_code(align_code_dyn, update_data_p, 4, *spa_p,*sp0_p,*sp1_p);
3316 #else
3317 update_code(align_code_dyn, update_data_p, 4, *spa_p,*sp1_p,*sp0_p);
3318 #endif
32833319 }
32843320
32853321 if (calc_func_mode == CALC_CONS) {sp0_p++; sp1_p++; spa_p++;}
33443380 *spa_p = align_type(itmp, *sp0_p, *sp1_p, 0, aln, ppst->pam_x_id_sim);
33453381
33463382 if (calc_func_mode == CALC_CODE) {
3383 #ifndef TFAST
33473384 update_code(align_code_dyn, update_data_p, 3, *spa_p,*sp0_p,*sp1_p);
3385 #else
3386 update_code(align_code_dyn, update_data_p, 3, *spa_p,*sp1_p,*sp0_p);
3387 #endif
33483388 }
33493389
33503390 d1_alen++;
33923432
33933433 if (calc_func_mode == CALC_CODE) {
33943434 *spa_p = 5;
3435 #ifndef TFAST
33953436 update_code(align_code_dyn, update_data_p, 5, *spa_p,*sp0_p,*sp1_p);
3437 #else
3438 update_code(align_code_dyn, update_data_p, 5, *spa_p,*sp1_p,*sp0_p);
3439 #endif
33963440 }
33973441
33983442 lenc++;
34083452
34093453 if (calc_func_mode == CALC_CODE) {
34103454 *spa_p = 5; /* indel code */
3455 #ifndef TFAST
34113456 update_code(align_code_dyn, update_data_p, 0, *spa_p,*sp0_p,*sp1_p);
3457 #else
3458 update_code(align_code_dyn, update_data_p, 0, *spa_p,*sp1_p,*sp0_p);
3459 #endif
34123460 }
34133461
34143462 if (cumm_seq_score) {
35943642 */
35953643
35963644 static struct update_code_str *
3597 init_update_data(show_code) {
3645 init_update_data(int show_code) {
35983646
35993647 struct update_code_str *update_data_p;
36003648
36403688
36413689 if (!up_dp) return;
36423690
3643 if (up_dp->btop_enc) {
3644 sprintf(tmp_cnt,"%d",up_dp->p_op_cnt);
3645 up_dp->p_op_cnt = 0;
3646 }
3647 else {
3648 sprintf_code(tmp_cnt,up_dp, up_dp->p_op_idx, up_dp->p_op_cnt);
3649 }
3650 dyn_strcat(align_code_dyn, tmp_cnt);
3691 if (up_dp->p_op_cnt) {
3692 if (up_dp->btop_enc) {
3693 sprintf(tmp_cnt,"%d",up_dp->p_op_cnt);
3694 up_dp->p_op_cnt = 0;
3695 }
3696 else {
3697 sprintf_code(tmp_cnt,up_dp, up_dp->p_op_idx, up_dp->p_op_cnt);
3698 }
3699 dyn_strcat(align_code_dyn, tmp_cnt);
3700 }
36513701
36523702 free(up_dp);
36533703 }
37003750
37013751 /* only aligned identities update counts */
37023752 if (op==3 && sim_code == M_IDENT) {
3703 up_dp->p_op_cnt++;
3704 return;
3753 if ((sp0 == '*' && (sp1 == '*' || toupper(sp1) == 'U'))
3754 || (sp1 == '*' && (sp0 == '*' || toupper(sp0) == 'U'))) {
3755 if (up_dp->p_op_cnt > 0) {
3756 sprintf(tmp_str,"%d**",up_dp->p_op_cnt);
3757 up_dp->p_op_cnt = 0;
3758 return;
3759 }
3760 }
3761 else {
3762 up_dp->p_op_cnt++;
3763 return;
3764 }
37053765 }
37063766 else {
37073767 if (up_dp->p_op_cnt > 0) {
208208 if (hsq[i0] < NMAP && hsq[i0] > mhv) mhv = hsq[i0];
209209
210210 if (mhv <= 0) {
211 fprintf (stderr, " maximum hsq <=0 %d\n", mhv);
211 fprintf (stderr, "*** error [%s:%d] maximum hsq <=0 %d\n", __FILE__, __LINE__, mhv);
212212 exit (1);
213213 }
214214
222222 f_str->hmask = (hmax >> f_str->kshft) - 1;
223223
224224 if ((f_str->harr = (int *) calloc (hmax, sizeof (int))) == NULL) {
225 fprintf (stderr, " *** cannot allocate hash array: hmax: %d hmask: %d\n",
226 hmax, f_str->hmask);
225 fprintf (stderr, "*** error [%s:%d] - cannot allocate hash array: hmax: %d hmask: %d\n",
226 __FILE__,__LINE__,hmax, f_str->hmask);
227227 exit (1);
228228 }
229229
230230 if ((f_str->pamh1 = (int *) calloc (nsq+1, sizeof (int))) == NULL) {
231 fprintf (stderr, " *** cannot allocate pamh1 array nsq=%d\n",nsq);
231 fprintf (stderr, "*** error [%s:%d] - cannot allocate pamh1 array nsq=%d\n",
232 __FILE__, __LINE__, nsq);
232233 exit (1);
233234 }
234235
235236 if ((f_str->pamh2 = (int *) calloc (hmax, sizeof (int))) == NULL) {
236 fprintf (stderr, " *** cannot allocate pamh2 array hmax=%d\n",hmax);
237 fprintf (stderr, "*** error [%s:%d] - cannot allocate pamh2 array hmax=%d\n",
238 __FILE__, __LINE__,hmax);
237239 exit (1);
238240 }
239241
240242 if ((f_str->link = (int *) calloc (n0, sizeof (int))) == NULL) {
241 fprintf (stderr, " *** cannot allocate hash link array n0=%d",n0);
243 fprintf (stderr, "*** error [%s:%d] - cannot allocate hash link array n0=%d",
244 __FILE__, __LINE__, n0);
242245 exit (1);
243246 }
244247
299302 f_str->ndo = 0;
300303 if ((f_str->diag = (struct dstruct *) calloc ((size_t)MAXDIAG,
301304 sizeof (struct dstruct)))==NULL) {
302 fprintf (stderr," *** cannot allocate diagonal arrays: %lu\n",
303 MAXDIAG *sizeof (struct dstruct));
305 fprintf (stderr,"*** error [%s:%d] - cannot allocate diagonal arrays: %lu\n",
306 __FILE__, __LINE__, MAXDIAG *sizeof (struct dstruct));
304307 exit (1);
305308 };
306309
309312 if ((f_str->aa1x =(unsigned char *)calloc((size_t)ppst->maxlen+2,
310313 sizeof(unsigned char)))
311314 == NULL) {
312 fprintf (stderr, " *** cannot allocate aa1x array %d\n", ppst->maxlen+2);
315 fprintf (stderr, "*** error [%s:%d] - cannot allocate aa1x array %d\n",
316 __FILE__, __LINE__, ppst->maxlen+2);
313317 exit (1);
314318 }
315319 f_str->aa1x++;
324328 maxn0 = n0 + 4;
325329 if ((ss = (struct swstr *) calloc (maxn0, sizeof (struct swstr)))
326330 == NULL) {
327 fprintf (stderr, " *** cannot allocate ss array %3d\n", n0);
331 fprintf (stderr, "*** error [%s:%d] - cannot allocate ss array %3d\n",
332 __FILE__, __LINE__, n0);
328333 exit (1);
329334 }
330335 ss++;
335340
336341 /* initialize variable (-S) pam matrix */
337342 if ((f_str->waa_s= (int *)calloc((nsq+1)*(n0+1),sizeof(int))) == NULL) {
338 fprintf(stderr,"*** error [%s:%d] cannot allocate waa_s array %3d\n",
343 fprintf(stderr,"*** error [%s:%d] - cannot allocate waa_s array %3d\n",
339344 __FILE__, __LINE__, nsq*n0);
340345 exit(1);
341346 }
342347
343348 /* initialize pam2p[1] pointers */
344349 if ((f_str->pam2p[1]= (int **)calloc((n0+1),sizeof(int *))) == NULL) {
345 fprintf(stderr,"*** error [%s:%d] cannot allocate pam2p[1] array %3d\n",
350 fprintf(stderr,"*** error [%s:%d] - cannot allocate pam2p[1] array %3d\n",
346351 __FILE__, __LINE__, n0);
347352 exit(1);
348353 }
349354
350355 pam2p = f_str->pam2p[1];
351356 if ((pam2p[0]=(int *)calloc((nsq+1)*(n0+1),sizeof(int))) == NULL) {
352 fprintf(stderr,"*** error [%s:%d] cannot allocate pam2p[1][] array %3d\n",
357 fprintf(stderr,"*** error [%s:%d] - cannot allocate pam2p[1][] array %3d\n",
353358 __FILE__, __LINE__, nsq*n0);
354359 exit(1);
355360 }
360365
361366 /* initialize universal (alignment) matrix */
362367 if ((f_str->waa_a= (int *)calloc((nsq+1)*(n0+1),sizeof(int))) == NULL) {
363 fprintf(stderr,"*** error [%s:%d] cannot allocate waa_a struct %3d\n",
368 fprintf(stderr,"*** error [%s:%d] - cannot allocate waa_a struct %3d\n",
364369 __FILE__, __LINE__, nsq*n0);
365370 exit(1);
366371 }
367372
368373 /* initialize pam2p[0] pointers */
369374 if ((f_str->pam2p[0]= (int **)calloc((n0+1),sizeof(int *))) == NULL) {
370 fprintf(stderr,"*** error [%s:%d] cannot allocate pam2p[1] array %3d\n",
375 fprintf(stderr,"*** error [%s:%d] - cannot allocate pam2p[1] array %3d\n",
371376 __FILE__, __LINE__, n0);
372377 exit(1);
373378 }
374379
375380 pam2p = f_str->pam2p[0];
376381 if ((pam2p[0]=(int *)calloc((nsq+1)*(n0+1),sizeof(int))) == NULL) {
377 fprintf(stderr,"*** error [%s:%d] cannot allocate pam2p[1][] array %3d\n",
382 fprintf(stderr,"*** error [%s:%d] - cannot allocate pam2p[1][] array %3d\n",
378383 __FILE__, __LINE__, nsq*n0);
379384 exit(1);
380385 }
527532 *f_arg = NULL;
528533 }
529534 else {
530 fprintf(stderr, "*** error [%s:%d] close_work() with NULL f_str ***\n",
535 fprintf(stderr, "*** error [%s:%d] - close_work() with NULL f_str ***\n",
531536 __FILE__, __LINE__);
532537 }
533538 }
615620 }
616621
617622 if (n0+n1+1 >= MAXDIAG) {
618 fprintf(stderr,"*** error [%s:%d] n0,n1 too large: %d + %d (%d) > %d \n",
623 fprintf(stderr,"*** error [%s:%d] - n0,n1 too large: %d + %d (%d) > %d \n",
619624 __FILE__, __LINE__, n0,n1,n0+n1+1,MAXDIAG);
620625 rst->score[0] = rst->score[1] = rst->score[2] = -1;
621626 return;
11361141
11371142 #ifdef DEBUG
11381143 if (window > f_str->bss_size) {
1139 fprintf(stderr,"*** error [%s:%d] dropnfa.c:dmatch window [%d] out of range [%d]\n",
1144 fprintf(stderr,"*** error [%s:%d] - dmatch window [%d] out of range [%d]\n",
11401145 __FILE__, __LINE__, window, f_str->bss_size);
11411146 window = f_str->bss_size - 4;
11421147 }
12041209
12051210 band = up-low+1;
12061211 if (band < 1) {
1207 fprintf(stderr,"*** error [%s:%d] low > up is unacceptable!: M: %d N: %d l/u: %d/%d\n",
1212 fprintf(stderr,"*** error [%s:%d] - low > up is unacceptable!: M: %d N: %d l/u: %d/%d\n",
12081213 __FILE__, __LINE__, M, N, low, up);
12091214 return 0;
12101215 }
13461351
13471352 /* now we need alignment storage - get it */
13481353 if ((cur_ares->res = (int *)calloc((size_t)max_res,sizeof(int)))==NULL) {
1349 fprintf(stderr,"*** error [%s:%d] cannot allocate alignment results array %d\n",
1354 fprintf(stderr,"*** error [%s:%d] - cannot allocate alignment results array %d\n",
13501355 __FILE__, __LINE__, max_res);
13511356 exit(1);
13521357 }
13841389 local_aa1 = (unsigned char *)aa1;
13851390 if (l_min > 0 || l_max < n1 - 1) {
13861391 if (l_max - l_min < 0) {
1387 fprintf(stderr,"*** error [%s:%d] l_min: %d > l_max %d\n",__FILE__, __LINE__, l_min,l_max);
1392 fprintf(stderr,"*** error [%s:%d] - l_min: %d > l_max %d\n",__FILE__, __LINE__, l_min,l_max);
13881393 exit(1);
13891394 }
13901395 if ((local_aa1 = (unsigned char *)calloc(l_max - l_min +2,sizeof(unsigned char *)))==NULL) {
1391 fprintf(stderr,"*** error [%s:%d] Cannot allocate local_aa1\n",__FILE__, __LINE__);
1396 fprintf(stderr,"*** error [%s:%d] - cannot allocate local_aa1\n",__FILE__, __LINE__);
13921397 exit(1);
13931398 }
13941399
15641569
15651570 window = min (n1, ppst->param_u.fa.optwid);
15661571 if (window > f_str->bss_size) {
1567 fprintf(stderr,"*** error [%s:%d] walign window [%d] out of range [%d]\n",
1572 fprintf(stderr,"*** error [%s:%d] - walign window [%d] out of range [%d]\n",
15681573 __FILE__, __LINE__, window, f_str->bss_size);
15691574 window = f_str->bss_size - 4;
15701575 }
15791584 a_res->n1 = n1;
15801585
15811586 if (score <=0) {
1582 fprintf(stderr,"*** [%s:%d] n0/n1: %d/%d hoff: %d window: %d\n",
1587 fprintf(stderr,"*** [%s:%d] - score <= 0 - n0/n1: %d/%d hoff: %d window: %d\n",
15831588 __FILE__, __LINE__, n0, n1, hoff, window);
15841589 return 0;
15851590 }
21772182 *have_ares = 0x3; /* set 0x2 bit to indicate local copy */
21782183
21792184 if ((a_res = (struct a_res_str *)calloc(1, sizeof(struct a_res_str)))==NULL) {
2180 fprintf(stderr,"*** error [%s:%d] Cannot allocate a_res", __FILE__, __LINE__);
2185 fprintf(stderr,"*** error [%s:%d] - cannot allocate a_res", __FILE__, __LINE__);
21812186 return NULL;
21822187 }
21832188
22032208
22042209 #ifdef DEBUG
22052210 if (adler32(1L,aa1,n1) != adler32_crc) {
2206 fprintf(stderr,"*** error [%s:%d] adler32_crc mismatch n1: %d\n",__FILE__, __LINE__, n1);
2211 fprintf(stderr,"*** error [%s:%d] - adler32_crc mismatch n1: %d\n",__FILE__, __LINE__, n1);
22072212 }
22082213 #endif
22092214
574574 * be rerun with 16 bits. If it is more, and we have tried at least
575575 * 500 sequences, we switch off the 8-bit mode.
576576 */
577 if (score == OVERFLOW) {
577 if (score == OVERFLOW_SCORE) {
578578 f_str->done_16bit++;
579579 if(f_str->done_8bit>500 && (3*f_str->done_16bit)>(f_str->done_8bit))
580580 f_str->try_8bit = 0;
3737
3838 */
3939 static
40 char *AA1="FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG";
40 char *AA1="FFLLSSSSYY**CCUWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG";
4141 /*
4242 Starts = ---M---------------M---------------M----------------------------
4343 Base1 = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
415415 aacmap[ii]= *aasmap++;
416416 }
417417
418
419 for (i=0; i<64; i++) {
420 fprintf(stderr,"'%c',",aacmap[i]);
421 if ((i%16)==15) fputc('\n',stderr);
422 }
423 fputc('\n',stderr);
424
418 if (debug) {
419 for (i=0; i<64; i++) {
420 fprintf(stderr,"'%c',",aacmap[i]);
421 if ((i%16)==15) fputc('\n',stderr);
422 }
423 fputc('\n',stderr);
424 }
425425 }
426426 for (i=0; i<64; i++) {
427427 aamap[i]=aascii[aacmap[i]];
497497 char *iprompt2=" database file name: ";
498498
499499 #ifdef PCOMPLIB
500 char *verstr="36.3.8g Dec, 2017 MPI";
501 #else
502 char *verstr="36.3.8g Dec, 2017";
500 char *verstr="36.3.8h Aug, 2019 MPI";
501 #else
502 char *verstr="36.3.8h Aug, 2019";
503503 #endif
504504
505505 static int mktup=3;
779779 ppst->pam2[0][ix_j][p_i] = ppst->pam2[0][ix_i][p_i];
780780 ppst->pam2[0][p_i][ix_j] = ppst->pam2[0][p_i][ix_i];
781781 }
782 }
782 p_i = pascii['*'];
783 ppst->pam2[0][ix_j][p_i] = ppst->pam2[0][p_i][ix_j] = ppst->pam2[0][p_i][p_i];
784 }
783785 else {
784786 pascii['U'] = pascii['C'];
785787 pascii['u'] = pascii['c'];
12891291 }
12901292 }
12911293
1292 static char my_opts[] = "1BIM:ox:y:N:";
1294 /* Extended options:
1295 -X1 - use the init1 score, rather than initn, for statistics and ordering results
1296 -Xa - only report annotation information in -m 8CB output (for later merge)
1297 -Xb - report z-score, not bit-score
1298 -XB - use blast identities
1299 -XI - ensure that identities are not rounded to 100%
1300 -XM: - specify memory limits for database buffering
1301 -XN:[+S] - treat N:N/X:X as similar as well as identical
1302 -Xo - use initn score, not opt score, for statistics and ordering results
1303 -Xx: - penalties for X:X, X:not-X match
1304 -Xy: - width of band for optimized scores
1305 */
1306
1307 static char my_opts[] = "1aBbIM:ox:y:N:";
12931308
12941309 void
12951310 parse_ext_opts(char *opt_arg, int pgm_id, struct mngmsg *m_msp, struct pstruct *ppst) {
13091324 ppst->param_u.fa.iniflag=1;
13101325 }
13111326 break;
1312 case 'B': m_msp->z_bits = 0; break;
1327
1328 case 'a': m_msp->m8_show_annot = 1; break;
1329
1330 case 'B': m_msp->blast_ident = 1; break;
1331
1332 case 'b': m_msp->z_bits = 0; break;
13131333 case 'I':
13141334 m_msp->tot_ident = 1;
13151335 /*
28652885
28662886 for (i=0; i< ppst->nsq; i++) {
28672887 if (ppst->pam2[0][0][i] > -1000) {
2868 fprintf(stderr," *** ERROR *** pam2[0][0][%d/%c] == %d\n",
2869 i,NCBIstdaa[i],ppst->pam2[0][0][i]);
2888 fprintf(stderr," *** error[%s:%d]*** pam2[0][0][%d/%c] == %d\n",
2889 __FILE__, __LINE__, i,NCBIstdaa[i],ppst->pam2[0][0][i]);
28702890 good_params = 0;
28712891 }
28722892 if (ppst->pam2[0][i][0] > -1000) {
2873 fprintf(stderr," *** ERROR *** pam2[0][%d/%c][0] == %d\n",
2874 i,NCBIstdaa[i],ppst->pam2[0][i][0]);
2893 fprintf(stderr," *** error[%s:%d] (validate_params)- pam2[0][%d/%c][0] == %d\n",
2894 __FILE__,__LINE__,i,NCBIstdaa[i],ppst->pam2[0][i][0]);
28752895 good_params = 0;
28762896 }
28772897 }
28802900 if (ppst->ext_sq_set) {
28812901 for (i=0; i< ppst->nsqx; i++) {
28822902 if (ppst->pam2[1][0][i] > -1000) {
2883 fprintf(stderr," *** ERROR *** pam2[1][0][%d] == %d\n",
2884 i,ppst->pam2[1][0][i]);
2903 fprintf(stderr," *** error[%s:%d] (validate_params) - pam2[1][0][%d] == %d\n",
2904 __FILE__, __LINE__, i,ppst->pam2[1][0][i]);
28852905 good_params = 0;
28862906 }
28872907 if (ppst->pam2[1][i][0] > -1000) {
2888 fprintf(stderr," *** ERROR *** pam2[1][%d][0] == %d\n",
2889 i,ppst->pam2[1][i][0]);
2908 fprintf(stderr," *** error[%s:%d] (validate_params) - pam2[1][%d][0] == %d\n",
2909 __FILE__, __LINE__, i,ppst->pam2[1][i][0]);
28902910 good_params = 0;
28912911 }
28922912 }
28952915 /* check for valid residues in query */
28962916 for (i=0; i<n0; i++) {
28972917 if (aa0[i] > ppst->nsq_e && aa0[i] != ESS) {
2898 fprintf(stderr," *** ERROR *** aa0[%d] = %c[%d > %d] out of range\n",
2899 i, aa0[i], aa0[i], ppst->nsq_e);
2918 fprintf(stderr," *** error [%s:%d] (validate_params) - aa0[%d] = %c[%d > %d] out of range\n",
2919 __FILE__,__LINE__,i, aa0[i], aa0[i], ppst->nsq_e);
29002920 good_params = 0;
29012921 }
29022922 }
29032923
29042924 for (i=0; i<128; i++) {
29052925 if (lascii[i] < NA && lascii[i] > ppst->nsq_e) {
2906 fprintf(stderr," *** ERROR *** lascii [%c|%d] = %d > %d out of range\n",
2907 i, i, lascii[i], ppst->nsq_e);
2926 fprintf(stderr," *** error[%s:%d] (validate_params) - lascii [%c|%d] = %d > %d out of range\n",
2927 __FILE__, __LINE__, i, i, lascii[i], ppst->nsq_e);
29082928 good_params = 0;
29092929 }
29102930
7272 if ((bp=strchr(tname,' '))!=NULL) *bp='\0';
7373
7474 if ((tptr=fopen(tname,"r"))==NULL) {
75 fprintf(stderr," could not open file of names: %s\n",tname);
75 fprintf(stderr,"*** error [%s:%d] could not open file of names: %s\n",__FILE__,__LINE__,tname);
7676 return NULL;
7777 }
7878
108108 if (strlen(flstr)> (size_t)0) {
109109 chlen = MAX_CH*MAX_FN;
110110 if ((chtmp=charr=calloc((size_t)chlen,sizeof(char)))==NULL) {
111 fprintf(stderr,"cannot allocate choice file array\n");
111 fprintf(stderr,"*** error [%s:%d] cannot allocate choice file array\n",__FILE__,__LINE__);
112112 goto l1;
113113 }
114114 chlen--;
115115 if ((fch=fopen(flstr,"r"))==NULL) {
116 fprintf(stderr," cannot open choice file: %s\n",flstr);
116 fprintf(stderr,"*** error [%s:%d] cannot open choice file: %s\n",__FILE__,__LINE__,flstr);
117117 goto l1;
118118 }
119119 fprintf(stderr,"\n Choose sequence library:\n\n");
185185 int new_abbr,ich, nch; /* use new multi-letter abbr */
186186 int ltmp;
187187 FILE *fch;
188 struct lib_struct *cur_lib_p = NULL;
188 struct lib_struct *cur_lib_p = NULL, *tmp_lib_p;
189189
190190 new_abbr = 0;
191191 *ltitle = '\0';
195195 }
196196 else {
197197 if (*flstr=='\0') {
198 fprintf(stderr," abbrv. list request but FASTLIBS undefined, cannot use %s\n",lname);
198 fprintf(stderr,"*** error [%s:%d] abbrv. list request but FASTLIBS undefined, cannot use %s\n",__FILE__,__LINE__,lname);
199199 exit(1);
200200 }
201201
217217
218218 if (strlen(flstr) > (size_t)0) {
219219 if ((fch=fopen(flstr,"r"))==NULL) {
220 fprintf(stderr," cannot open choice file: %s\n",flstr);
220 fprintf(stderr,"*** error [%s:%d] cannot open choice file: %s\n",__FILE__,__LINE__,flstr);
221221 return NULL;
222222 }
223223 }
232232
233233 /* if !new_abbr, match on one letter with ulindex() */
234234 if (!new_abbr) {
235 if (*bp=='+') continue; /* not a &lib& */
235 if (*bp=='+') continue; /* not a +lib+ */
236236 else if (ulindex(lname,bp)!=NULL) {
237237 if (ltitle[0] == '\0') {
238238 strncpy(ltitle,line,MAX_STR);
242242 strncat(ltitle,",\n ",MAX_STR-ltmp);
243243 strncat(ltitle,line,MAX_STR-ltmp-4);
244244 }
245 cur_lib_p = get_lnames(bp+1, cur_lib_p);
245 tmp_lib_p = get_lnames(bp+1, cur_lib_p);
246 if (tmp_lib_p) { cur_lib_p = tmp_lib_p;}
246247 }
247248 }
248249 else {
267268 }
268269 *bp1='+';
269270 }
270 else fprintf(stderr,"%s missing final '+'\n",bp);
271 else fprintf(stderr,"*** error [%s:%d] %s missing final '+'\n",__FILE__,__LINE__,bp);
271272 }
272273 }
273274 }
1818 governing permissions and limitations under the License.
1919 */
2020
21 /* input is a libtype 1,5, or 6 sequence database */
21 /* input is a lib_type 1,5, or 6 sequence database (lib_type specified after filename),
22 e.g. 'swissprot.lseg 1' */
23 /* map_db -n specifies a DNA database */
24
2225 /* output is a BLAST2 formatdb type index file */
2326
2427 /* format of the index file:
155155 int nc, lc, maxc;
156156 double lzscore, lzscore2, lbits;
157157 struct a_struct l_aln, *l_aln_p;
158 float percent, gpercent;
158 float percent, gpercent, ng_percent, disp_percent, disp_similar;
159 int disp_alen;
159160 /* strings, lengths for conventional alignment */
160161 char *seqc0, *seqc0a, *seqc1, *seqc1a, *seqca;
161162 int *cumm_seq_score;
489490
490491 if (lc > 0) {
491492 percent = (100.0*(float)l_aln_p->nident)/(float)lc;
492 }
493 else { percent = -1.00; }
493 ng_percent = (100.0*(float)l_aln_p->nident)/(float)(lc-(l_aln_p->ngap_q + l_aln_p->ngap_l));
494 }
495 else { percent = ng_percent = -1.00; }
494496
495497 fprintf (fp, "a {\n");
496498 if (annot_var_dyn->string[0]) {
533535
534536 if (cur_ares_p->score_delta > 0) score_delta -= cur_ares_p->score_delta;
535537
536 percent = calc_fpercent_id(100.0, l_aln_p->nident,lc,m_msp->tot_ident, -1.0);
538 disp_percent = percent = calc_fpercent_id(100.0, l_aln_p->nident,lc,m_msp->tot_ident, -1.0);
539 disp_similar = calc_fpercent_id(100.0, l_aln_p->nsim, lc, m_msp->tot_ident, -1.0);
540 disp_alen = lc;
537541
538542 ngap = l_aln_p->ngap_q + l_aln_p->ngap_l;
543 ng_percent = calc_fpercent_id(100.0, l_aln_p->nident,lc-ngap,m_msp->tot_ident, -1.0);
544 if (m_msp->blast_ident) {
545 disp_percent = ng_percent;
546 disp_similar = calc_fpercent_id(100.0, l_aln_p->npos, lc-ngap, m_msp->tot_ident, -1.0);
547 disp_alen = lc - ngap;
548 }
549
539550 #ifndef SHOWSIM
540 gpercent = calc_fpercent_id(100.0,l_aln_p->nident,lc-ngap,m_msp->tot_ident, -1.0);
551 gpercent = ng_percent;
541552 #else
542 gpercent = calc_fpercent_id(100.0,l_aln_p->nsim,lc,m_msp->tot_ident, -1.0);
553 gpercent = disp_similar;
543554 #endif
544555
545556 lsw_score = cur_ares_p->sw_score + score_delta;
663674 if (m_msp->markx & MX_HTML) {
664675 fprintf(fp,"<!-- ANNOT_START \"%s\" -->",link_name);}
665676 /* ensure that last character is "\n" */
666 if (annot_var_dyn->string[strlen(annot_var_dyn->string)-1] != '\n') {
667 annot_var_dyn->string[strlen(annot_var_dyn->string)-1] = '\n';
668 }
669 fputs(annot_var_dyn->string, fp);
677 if (!m_msp->m8_show_annot) {
678 if (annot_var_dyn->string[strlen(annot_var_dyn->string)-1] != '\n') {
679 annot_var_dyn->string[strlen(annot_var_dyn->string)-1] = '\n';
680 }
681 fputs(annot_var_dyn->string, fp);
682 }
683 else { fputs("\n",fp);}
684
670685 if (m_msp->markx & MX_HTML) {fputs("<!-- ANNOT_STOP -->",fp);}
671686 }
672687
745760 do_show(fp, m_msp->n0, bbp->seq->n1, lsw_score, name0, name1, nml,
746761 link_name,
747762 m_msp, ppst, seqc0, seqc0a, seqc1, seqc1a, seqca, cumm_seq_score,
748 nc, percent, gpercent, lc, l_aln_p, annot_var_dyn->string,
763 nc, disp_percent, gpercent, disp_alen, l_aln_p, annot_var_dyn->string,
749764 m_msp->annot_p, bbp->seq->annot_p);
750765
751766 /* display the encoded alignment left over from showbest()*/
808823 int tmp;
809824
810825 if (m_msp->markx & MX_AMAP && (m_msp->markx & MX_ATYPE)==7)
826 /* show text graphic of alignment (very rarely used) */
811827 disgraph(fp, n0, n1, percent, score,
812828 aln->amin0, aln->amin1, aln->amax0, aln->amax1, m_msp->sq0off,
813829 name0, name1, nml, aln->llen, m_msp->markx);
814830 else if (m_msp->markx & MX_M10FORM) {
831 /* old tagged/parse-able format */
815832 if (ppst->sw_flag && m_msp->arelv>0)
816833 fprintf(fp,"; %s_score: %d\n",m_msp->f_id1,score);
817834 fprintf(fp,"; %s_ident: %5.3f\n",m_msp->f_id1,percent/100.0);
826843 seqc0, seqc0a, seqc1, seqc1a, seqca, cumm_seq_score, nc,
827844 n0, n1, name0, name1, nml, aln);
828845 }
829 else {
846 else { /* all "normal" alignment formats */
830847 if (!(m_msp->markx & MX_MBLAST)) {
831848 #ifndef LALIGN
832849 fprintf(fp,"%s score: %d; ",m_msp->alabel, score);
847864 annot_var_s, q_annot_p, l_annot_p);
848865 }
849866
850 if (m_msp->markx & MX_AMAP && (m_msp->markx & MX_ATYPE)!=7) {
867 if ((m_msp->markx & MX_AMAP) && ((m_msp->markx & MX_ATYPE)!=MX_ATYPE)) {
851868 fputc('\n',fp);
852869 tmp = n0;
853870
9090 void w_abort (char *p, char *p1);
9191
9292 extern double zs_to_bit(double, int, int);
93
94 void dominfo_to_str(struct dyn_string_str *d, struct annot_str *annot);
9395
9496 /* showbest() shows a list of high scoring sequence descriptions, and
9597 their rst.scores. If -m 9, then an additional complete set of
136138 struct rstruct rst;
137139 int l_score0, ngap;
138140 double lzscore, lzscore2, lbits;
139 float percent, gpercent, ng_percent;
141 float percent, gpercent, ng_percent, disp_percent, disp_similar;
142 int disp_alen;
140143 struct a_struct *aln_p;
141144 struct a_res_str *cur_ares_p;
142145 struct rstruct *rst_p;
143146 int gi_num;
144147 char html_pre_E[120], html_post_E[120];
145148 int have_lalign = 0;
149 struct dyn_string_str *dominfo_dstr;
146150
147151 struct lmf_str *m_fptr;
148152
241245 /* display number of hits for -m 8C (Blast Tab-commented format) */
242246 if (m_msp->markx & MX_M8COMMENT) {
243247 /* line below copied from BLAST+ output */
244 fprintf(fp,"# Fields: query id, subject id, %% identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score");
248 if (m_msp->markx & MX_M8_BTAB_LEN) {
249 fprintf(fp,"# Fields: query id, query length, subject id, subject length, %% identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score");
250 }
251 else {
252 fprintf(fp,"# Fields: query id, subject id, %% identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score");
253 }
254
245255 if (ppst->zsflag > 20) {fprintf(fp,", eval2");}
246256 if (m_msp->show_code & (SHOW_CODE_ALIGN+SHOW_CODE_CIGAR)) { fprintf(fp,", aln_code");}
247257 else if ((m_msp->show_code & SHOW_CODE_BTOP)==SHOW_CODE_BTOP) { fprintf(fp,", BTOP");}
328338 for (ib=istart; ib<istop; ib++) {
329339 bbp = bptr[ib];
330340 if (ppst->do_rep) {
331 bbp->repeat_thresh =
332 min(E1_to_s(ppst->e_cut_r, m_msp->n0, bbp->seq->n1,ppst->zdb_size, m_msp->pstat_void),
333 bbp->rst.score[ppst->score_ix]);
341 if (bbp->rst.escore > ppst->e_cut_r) { /* for poor alignment scores, don't look for more */
342 bbp->repeat_thresh = bbp->rst.score[ppst->score_ix] * 10;
343 }
344 else {
345 bbp->repeat_thresh =
346 min(E1_to_s(ppst->e_cut_r, m_msp->n0, bbp->seq->n1,ppst->zdb_size, m_msp->pstat_void),
347 bbp->rst.score[ppst->score_ix]);
348 }
334349 }
335350
336351 #ifdef DEBUG
518533 }
519534 else if (m_msp->markx & MX_M8OUT) { /* MX_M8OUT -- provide query, library */
520535 if (first_line) {first_line = 0;}
521 fprintf (fp,"%s\t%s",m_msp->qtitle,bline_p);
536 if (m_msp->markx & MX_M8_BTAB_LEN) {
537 fprintf (fp,"%s\t%d\t%s\t%d",m_msp->qtitle,m_msp->n0,bline_p,bbp->seq->n1);
538 }
539 else {
540 fprintf (fp,"%s\t%s",m_msp->qtitle,bline_p);
541 }
522542 }
523543 else if (m_msp->markx & MX_MBLAST2) { /* blast "Sequences producing" */
524544 if (first_line) {first_line = 0;}
536556 annot_str_len = cur_ares_p->annot_code_n;
537557
538558 ngap = cur_ares_p->aln.ngap_q + cur_ares_p->aln.ngap_l;
539 percent = calc_fpercent_id(100.0,aln_p->nident,aln_p->lc, m_msp->tot_ident, -100.0);
559 disp_percent = percent = calc_fpercent_id(100.0,aln_p->nident,aln_p->lc, m_msp->tot_ident, -100.0);
540560 ng_percent = calc_fpercent_id(100.0,aln_p->nident,aln_p->lc-ngap, m_msp->tot_ident, -100.0);
561 disp_similar = calc_fpercent_id(100.0, cur_ares_p->aln.nsim, aln_p->lc, m_msp->tot_ident, -100.0);
562 disp_alen = aln_p->lc;
563 if (m_msp->blast_ident) {
564 disp_percent = ng_percent;
565 disp_similar = calc_fpercent_id(100.0, cur_ares_p->aln.npos, aln_p->lc - ngap, m_msp->tot_ident, -100.0);
566 disp_alen = aln_p->lc - ngap;
567 }
541568
542569 #ifndef SHOWSIM
543 gpercent = calc_fpercent_id(100.0, aln_p->nident, aln_p->lc-ngap, m_msp->tot_ident, -100.0);
570 gpercent = ng_percent;
544571 #else
545 gpercent = calc_fpercent_id(100.0, cur_ares_p->aln.nsim, aln_p->lc, m_msp->tot_ident, -100.0);
572 gpercent = disp_similar;
546573 #endif /* SHOWSIM */
547574
548575 if (m_msp->show_code != SHOW_CODE_ID && m_msp->show_code != SHOW_CODE_IDD) { /* show more complete info than just identity */
563590 /* sequence coordinate min max min max */
564591 if (!(m_msp->markx & MX_M8OUT)) {
565592 fprintf(fp,"\t%5.3f %5.3f %4d %4d %4ld %4ld %4ld %4ld %4ld %4ld %4ld %4ld %3d %3d %3d",
566 percent/100.0,gpercent/100.0,
593 disp_percent/100.0,gpercent/100.0,
567594 cur_ares_p->sw_score,
568 aln_p->lc,
595 disp_alen,
569596 aln_p->d_start0,aln_p->d_stop0,
570597 aln_p->q_start_off, aln_p->q_end_off,
571598 aln_p->d_start1,aln_p->d_stop1,
581608 }
582609 else { /* MX_M8OUT -- blast order, tab separated */
583610 fprintf(fp,"\t%.2f\t%d\t%d\t%d\t%ld\t%ld\t%ld\t%ld\t%.2g\t%.1f",
584 ng_percent,aln_p->lc,aln_p->nmismatch,
611 ng_percent,aln_p->lc-ngap,aln_p->nmismatch,
585612 aln_p->ngap_q + aln_p->ngap_l+aln_p->nfs,
586613 aln_p->d_start0, aln_p->d_stop0,
587614 aln_p->d_start1, aln_p->d_stop1,
588615 zs_to_E(lzscore,n1,ppst->dnaseq,ppst->zdb_size,m_msp->db),
589616 lbits);
617
590618 if (ppst->zsflag > 20) {
591619 fprintf(fp,"\t%.2g",zs_to_E(lzscore2, n1, ppst->dnaseq, ppst->zdb_size, m_msp->db));
592620 }
593621 if ((m_msp->show_code & (SHOW_CODE_ALIGN+SHOW_CODE_CIGAR+SHOW_CODE_BTOP)) && seq_code_len > 0 && seq_code != NULL) {
594622 fprintf(fp,"\t%s",seq_code);
623
595624 if (annot_str_len > 0 && annot_str != NULL) {
596625 fprintf(fp,"\t%s",annot_str);
597626 }
627
628 if (m_msp->show_code & SHOW_CODE_DOMINFO) {
629 dominfo_dstr = init_dyn_string(1024,1024);
630 if (m_msp->annot_p) {
631 dominfo_to_str(dominfo_dstr,m_msp->annot_p);
632 }
633 if (bbp->seq->annot_p) {
634 dominfo_to_str(dominfo_dstr,bbp->seq->annot_p);
635 }
636
637 if (dominfo_dstr->string[0]) {
638 fprintf(fp,"\t%s",dominfo_dstr->string);
639 }
640 free_dyn_string(dominfo_dstr);
641 }
598642 }
599643 fprintf(fp,"\n");
600644 }
602646 else { /* !SHOW_CODE -> SHOW_ID or SHOW_IDD*/
603647 #ifdef SHOWSIM
604648 fprintf(fp," %5.3f %5.3f %4d",
605 percent/100.0,
606 (float)aln_p->nsim/(float)aln_p->lc,aln_p->lc);
649 disp_percent/100.0,disp_similar/100.0,disp_alen);
607650 #else
608 fprintf(fp," %5.3f %4d", percent/100.0,aln_p->lc);
651 fprintf(fp," %5.3f %4d", disp_percent/100.0,disp_alen);
609652 #endif
610653 if (m_msp->markx & MX_HTML) {
611654 if (cur_ares_p->index > 0) {
619662 }
620663 else { link_shown = 0;}
621664
622 if ((m_msp->show_code & SHOW_CODE_ID) == SHOW_CODE_ID) {
665 if ((m_msp->show_code & SHOW_CODE_ID) == SHOW_CODE_ID ) {
623666 annot_str = cur_ares_p->annot_var_id;
624667 }
625668 else if ((m_msp->show_code & SHOW_CODE_IDD) == SHOW_CODE_IDD) {
628671 else {
629672 annot_str = NULL;
630673 }
631 if (annot_str && annot_str[0]) {
674 if (annot_str && annot_str[0] && (!m_msp->m8_show_annot || (m_msp->markx & MX_M8OUT))) {
632675 fprintf(fp," %s",annot_str);
633676 }
634677 }
662705
663706 if (m_msp->markx & MX_HTML) fprintf(fp,"</pre><hr>\n");
664707 }
708
709 /* dominfo_to_str() -- convert domain annotations to a |DX:1-100;C=PF12345~1 dyn_string */
710 /* used for both query and subject strings */
711 void
712 dominfo_to_str(struct dyn_string_str *dominfo_dstr, struct annot_str *annots) {
713 int i;
714 char tmp_string[MAX_STR];
715 struct annot_entry *annot;
716 struct dyn_string_str *dyn_dom_str;
717
718 for (i=0; i < annots->n_annot; i++) {
719
720 annot = &annots->annot_arr_p[i];
721
722 if (annot->target) {
723 if (annot->label == '-') {
724 sprintf(tmp_string,"|XD:%ld-%ld;C=%s",annot->pos+1,annot->end+1,annot->comment);
725 }
726 else {
727 sprintf(tmp_string,"|X%c:%ld-%ld;C=%s",annot->label, annot->pos+1,annot->end+1,annot->comment);
728 }
729 }
730 else {
731 if (annot->label == '-') {
732 sprintf(tmp_string,"|DX:%ld-%ld;C=%s",annot->pos+1,annot->end+1,annot->comment);
733 }
734 else {
735 sprintf(tmp_string,"|%cX:%ld-%ld;C=%s",annot->label, annot->pos+1,annot->end+1,annot->comment);
736 }
737
738 }
739
740
741 dyn_strcat(dominfo_dstr, tmp_string);
742 }
743 }
2323
2424 #define FORMATDBV3 3 /* formatdb version */
2525 #define FORMATDBV4 4 /* formatdb version */
26 #define FORMATDBV5 5 /* formatdb version */
2627
2728 #define NULLB '\0' /* sentinel byte */
2829
7979
8080
8181 /* ****************************************************************
82 This code reads NCBI Blast2 format databases from formatdb version 3 and 4
82 This code reads NCBI Blast2 format databases from formatdb version 3 -- 5
8383
8484 (From NCBI) This section describes the format of the databases.
8585
449449 src_uint4_read(ifile,(unsigned *)&dbformat); /* get format DB version number */
450450 src_uint4_read(ifile,(unsigned *)&dbtype); /* get 1 for protein/0 DNA */
451451
452 if (dbformat != FORMATDBV3 && dbformat!=FORMATDBV4) {
452 if (dbformat != FORMATDBV3 && dbformat!=FORMATDBV4 && dbformat!=FORMATDBV5) {
453453 fprintf(stderr,"error - %s wrong formatdb version (%d/%d)\n",
454454 tname,dbformat,FORMATDBV3);
455455 return NULL;
787787 int title_len;
788788 char *title_str=NULL;
789789 int date_len;
790 char *pdb_title_str=NULL;
791 int pdb_title_len;
790792 char *date_str=NULL;
791793 long ltmp;
792794 int64_t l8tmp;
793795 int i, tmp;
794796 unsigned int *f_pos_arr;
795797
798 if (dbformat == FORMATDBV5) {
799 src_uint4_read(ifile,(unsigned int *)&ltmp);
800 }
801
796802 src_uint4_read(ifile,(unsigned *)&title_len);
797803
798804 if (title_len > 0) {
803809 fread(title_str,(size_t)1,(size_t)title_len,ifile);
804810 }
805811
812 if (dbformat == FORMATDBV5) {
813 src_uint4_read(ifile,(unsigned int *)&pdb_title_len);
814 if (pdb_title_len > 0) {
815 if ((pdb_title_str = calloc((size_t)pdb_title_len+1,sizeof(char)))==NULL) {
816 fprintf(stderr," cannot allocate pdb_title string (%d)\n",pdb_title_len);
817 goto error_r;
818 }
819 fread(pdb_title_str,(size_t)1,(size_t)pdb_title_len,ifile);
820 }
821 }
822
806823 src_uint4_read(ifile,(unsigned *)&date_len);
807824
808825 if (date_len > 0) {
5252 4 - Intelligentics format
5353 5 - NBRF/PIR VMS format
5454 6 - GCG 2bit format
55 7 - FASTQ format
56 8 - accession script
5557
5658 10 - list of gi/acc's
5759 11 - NCBI setdb/blastp (1.3.2) AA/NT
5860 12 - NCBI setdb/blastp (2.0) AA/NT
5961 16 - mySQL queries
60
62
6163 see file altlib.h to confirm numbers
6264
6365 */
166168 struct lmf_str *m_fptr=NULL;
167169 int acc_off=0;
168170 char fmt_term;
171 char acc_script[MAX_LSTR];
169172 struct lib_struct *next_lib_p, *this_lib_p, *tmp_lib_p;
170173
171174 om_fptr = lib_p->m_file_p;
177180
178181 wcnt = 0; /* number of times to ask for file name */
179182
183 /* check for library type */
184 lib_type=0;
185 if ((bp=strchr(lib_p->file_name,' '))!=NULL
186 || (bp=strchr(lib_p->file_name,'^'))!=NULL) {
187 if (isdigit((int)(bp+1)[0])) { /* check for number for lib_type */
188 *bp='\0';
189 sscanf(bp+1,"%d",&lib_type);
190 if (lib_type<0 || lib_type >= LASTLIB) {
191 fprintf(stderr,"\n invalid library type: %d (>%d)- resetting\n%s\n",
192 lib_type,LASTLIB,lib_p->file_name);
193 lib_type=0;
194 }
195 } /* don't change lib_type if its not a number */
196 }
197 else if (lib_p->file_name[0] =='!') { /* check for script */
198 lib_type = lib_p->lib_type = ACC_SCRIPT;
199 }
200
201 /* check for stdin indicator '-' or '@' (or ACC_SCRIPT) */
202 if (lib_p->file_name[0] == '-' || lib_p->file_name[0] == '@'
203 || lib_type == ACC_SCRIPT) {
204 use_stdin = 1;
205 }
206 else use_stdin=0;
207
208 if (use_stdin && !(lib_type ==0 || lib_type==ACC_SCRIPT)) {
209 fprintf(stderr,"\n @/- STDIN libraries must be in FASTA format\n");
210 return NULL;
211 }
212
213 opt_text[0]='\0';
214 if (lib_type != ACC_SCRIPT) {
180215 /* check to see if there is a file option ":1-100" */
181216 #ifndef WIN32
182 if ((bp=strchr(lib_p->file_name,':'))!=NULL && *(bp+1)!='\0') {
217 if ((bp=strchr(lib_p->file_name,':'))!=NULL && *(bp+1)!='\0') {
183218 #else
184 if ((bp=strchr(lib_p->file_name+3,':'))!=NULL && *(bp+1)!='\0') {
219 if ((bp=strchr(lib_p->file_name+3,':'))!=NULL && *(bp+1)!='\0') {
185220 #endif
186 strncpy(opt_text,bp+1,sizeof(opt_text));
187 opt_text[sizeof(opt_text)-1]='\0';
188 *bp = '\0';
189 }
190 else opt_text[0]='\0';
191
192 if (lib_p->file_name[0] == '-' || lib_p->file_name[0] == '@') {
193 use_stdin = 1;
194 }
195 else use_stdin=0;
196
197 /* check for library type */
198 if ((bp=strchr(lib_p->file_name,' '))!=NULL) {
199 *bp='\0';
200 sscanf(bp+1,"%d",&lib_type);
201 if (lib_type<0 || lib_type >= LASTLIB) {
202 fprintf(stderr,"\n invalid library type: %d (>%d)- resetting\n%s\n",
203 lib_type,LASTLIB,lib_p->file_name);
204 lib_type=0;
205 }
206 else {
207 lib_p->lib_type = lib_type;
208 }
209 }
210 else lib_type = lib_p->lib_type;
211
212 if (use_stdin && lib_type !=0 ) {
213 fprintf(stderr,"\n @/- STDIN libraries must be in FASTA format\n");
214 return NULL;
221 strncpy(opt_text,bp+1,sizeof(opt_text));
222 opt_text[sizeof(opt_text)-1]='\0';
223 *bp = '\0';
224 }
215225 }
216226
217227 /* check to see if file can be open()ed? */
218
219228 l1:
220229 opnflg = 0;
221230 if (lib_type<=LASTTXT) {
222231 if (!use_stdin) {
223232 opnflg=((libf=fopen(lib_p->file_name,RBSTR))!=NULL);
233 }
234 else if (lib_type==ACC_SCRIPT) {
235 bp = lib_p->file_name;
236 if (lib_p->file_name[0] == '!') { bp += 1;}
237 strncpy(acc_script, bp, sizeof(acc_script)-1);
238 acc_script[sizeof(acc_script)-1] = '\0';
239
240 /* convert '+' in annot_script to ' ' */
241 bp = strchr(acc_script,'+');
242 for ( ; bp; bp=strchr(bp+1,'+')) {
243 *bp=' ';
244 }
245 libf=popen(acc_script,"r");
246 opnflg=1;
224247 }
225248 else {
226249 libf=stdin;
759759
760760 for (i=1; parm[i].gap > 0; i++) {
761761 if (parm[i].gap > gap) continue;
762 else if (parm[i].gap == gap && parm[i].ext > ext ) continue;
763 else if (parm[i].gap == gap && parm[i].ext == ext) {
762 else if (parm[i].gap <= gap && parm[i].ext > ext ) continue;
763 else if (parm[i].gap <= gap && parm[i].ext <= ext) {
764764 *K = parm[i].K;
765765 *Lambda = parm[i].Lambda;
766766 *H = parm[i].H;
123123 char sqnam[4]; /* "aa" or "nt" */
124124 char sqtype[10]; /* "DNA" or "protein" */
125125 int long_info; /* long description flag*/
126 int blast_ident; /* calculate identities excluding gaps */
126127 long sq0off, sq1off; /* virtual offset into aa0, aa1 */
127128 int markx; /* alignment display type */
128129 int tot_markx; /* markx as summ of all alternative markx */
156157 int ashow_set; /* ashow set with -d */
157158 int nmlen; /* length of name label */
158159 int show_code; /* show alignment code in -m 9; ==1 => identity only, ==2 alignment code*/
160 int m8_show_annot; /* show annotations only in -m 8CB output */
159161 int tot_show_code; /* show alignment for all outputs */
160162 int pre_load_done; /* set after pre_load_best() call */
161163 int align_done; /* do_walign() called */
202202 -5, -11, -11, -11, -6, -9, -9, -12, -10, -1, -5, -9, -5, -8, -10, -10, -6, -17, -9, 8,
203203 -8, -11, 3, 2, -14, -6, -5, -7, -5, -13, -15, -6, -10, -16, -9, -5, -6, -12, -12, -11, 8,
204204 -7, -9, -6, -4, -17, 3, 2, -9, -6, -12, -9, -4, -8, -14, -7, -6, -7, -19, -12, -9, -4, 8,
205 -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1
205 -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
206 -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 8
206207 };
207208
208209 /*
240241 -3, -9, -9, -9, -4, -7, -7, -10, -8, 1, -3, -8, -3, -6, -8, -8, -4, -13, -7, 7,
241242 -6, -8, 3, 3, -11, -4, -3, -5, -4, -11, -12, -4, -8, -13, -7, -3, -5, -10, -10, -9, 8,
242243 -5, -6, -4, -3, -13, 3, 3, -7, -4, -10, -7, -2, -6, -11, -5, -4, -5, -15, -9, -7, -2, 7,
243 -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1
244 -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
245 -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 8
246
244247 };
245248
246249 /*
278281 -1, -7, -7, -7, -2, -5, -6, -8, -6, 3, -1, -6, -1, -4, -6, -6, -2, -10, -5, 7,
279282 -4, -5, 4, 3, -8, -2, -1, -3, -2, -8, -9, -2, -6, -10, -5, -2, -3, -8, -7, -7, 7,
280283 -3, -4, -2, -1, -10, 4, 3, -5, -2, -7, -6, -1, -4, -9, -4, -3, -3, -12, -7, -5, 0, 7,
281 -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1
284 -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
285 -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 8
286
282287 };
283288
284289 /*
316321 0, -4, -5, -5, -1, -4, -4, -6, -4, 3, 0, -4, 0, -2, -4, -4, -1, -6, -4, 6,
317322 -2, -3, 4, 4, -5, -1, 0, -1, 0, -6, -6, -1, -4, -7, -3, 0, -1, -6, -5, -5, 7,
318323 -2, -1, -1, 0, -6, 4, 3, -3, -1, -5, -4, 0, -3, -6, -2, -1, -2, -8, -5, -4, 0, 6,
319 -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1
324 -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
325 -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 8
326
320327 };
321328
322329 /*
354361 0, -3, -4, -4, 0, -3, -3, -4, -3, 3, 1, -3, 1, -1, -3, -3, 0, -4, -2, 5,
355362 -1, -2, 4, 4, -4, 0, 1, -1, 0, -4, -5, 0, -3, -5, -2, 0, 0, -5, -3, -4, 6,
356363 -1, 0, 0, 0, -5, 3, 3, -2, 0, -4, -3, 1, -2, -4, -1, -1, -1, -6, -3, -3, 0, 5,
357 -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1
364 -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
365 -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 6
358366 };
359367
360368 /*
432440 0, -3, -3, -4, 1, -2, -3, -4, -3, 4, 2, -3, 2, -1, -3, -2, 0, -4, -2, 4,
433441 -1, -1, 4, 4, -3, 1, 2, 0, 0, -4, -4, 0, -3, -5, -1, 0, 0, -5, -3, -3, 6,
434442 -1, 0, 1, 2, -3, 3, 3, -1, 1, -3, -3, 1, -2, -4, -1, 0, 0, -6, -3, -2, 2, 5,
435 -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1
443 -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
444 -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 6
436445 };
437446
438447 /*
317317 char line[MAX_STR];
318318 int i, i_doms, n_domain_s = MAX_LSTR;
319319
320 /* since (currently) annot_var_s is MAX_LSOTR, do the same for domain_s */
320 /* since (currently) annot_var_s is MAX_LSTR, do the same for domain_s */
321321 if ((domain_s = (char *)calloc(n_domain_s, sizeof(char)))==NULL) {
322322 fprintf(stderr,"*** error [%s:%d] *** cannot allocate domain_s[%d]\n",__FILE__, __LINE__,n_domain_s);
323323 return NULL;
172172
173173 /* now we need alignment storage - get it */
174174 if ((cur_ares->res = (int *)calloc((size_t)max_res,sizeof(int)))==NULL) {
175 fprintf(stderr," *** cannot allocate alignment results array %d\n",max_res);
175 fprintf(stderr,"*** error [%s:%d] - cannot allocate alignment results array %d\n",
176 __FILE__, __LINE__, max_res);
176177 exit(1);
177178 }
178179
485486
486487 if ((f_ss = (struct swstr *) calloc (N+2, sizeof (struct swstr)))
487488 == NULL) {
488 fprintf (stderr, " *** cannot allocate f_ss array %3d\n", N+2);
489 fprintf (stderr, "*** error [%s:%d] - cannot allocate f_ss array %3d\n",
490 __FILE__, __LINE__, N+2);
489491 exit (1);
490492 }
491493 f_ss++;
492494
493495 if ((r_ss = (struct swstr *) calloc (N+2, sizeof (struct swstr)))
494496 == NULL) {
495 fprintf (stderr, " *** cannot allocate r_ss array %3d\n", N+2);
497 fprintf (stderr, "*** error [%s:%d] - cannot allocate r_ss array %3d\n",
498 __FILE__, __LINE__, N+2);
496499 exit (1);
497500 }
498501 r_ss++;
502505
503506 ck = CHECK_SCORE(IW,B,M,N,S,W,G,H,NC, &sw);
504507 if (c != ck) {
505 fprintf(stderr," *** Check_score error. %d != %d ***\n",c,ck);
508 fprintf(stderr,"*** error [%s:%d] - check_score error. %d != %d ***\n",
509 __FILE__, __LINE__, c,ck);
506510 }
507511
508512 f_ss--; r_ss--;
55 if [ ! -d results ]; then
66 mkdir results
77 fi
8
9 export FA_DB=/slib2/fa_dbs/qfo20.lseg
10
811 echo "starting fasta36 - protein" `date`
9 ../bin/fasta36 -q -m 6 -Z 100000 ../seq/mgstm1.aa:1-100 q > results/test_m1.ok2.html
10 ../bin/fasta36 -S -q -z 11 -O results/test_m1.ok2_p25 -s P250 ../seq/mgstm1.aa:100-218 q
12 ../bin/fasta36 -q -m 6 -Z 100000 ../seq/mgstm1.aa:1-100 $FA_DB > results/test_m1.ok2.html
13 ../bin/fasta36 -S -q -z 11 -O results/test_m1.ok2_p25 -s P250 ../seq/mgstm1.aa:100-218 $FA_DB
1114 echo "done"
1215 echo "starting fastxy36" `date`
13 ../bin/fastx36 -m 9c -S -q ../seq/mgtt2_x.seq q 1 > results/test_t2.xk1
14 ../bin/fasty36 -S -q ../seq/mgtt2_x.seq q > results/test_t2.yk2
15 ../bin/fastx36 -m 9c -S -q -z 2 ../seq/mgstm1.esq a > results/test_m1.xk2z2
16 ../bin/fasty36 -S -q -z 2 ../seq/mgstm1.esq a > results/test_m1.yk2z2
16 ../bin/fastx36 -m 9c -S -q ../seq/mgtt2_x.seq $FA_DB 1 > results/test_t2.xk1
17 ../bin/fasty36 -S -q ../seq/mgtt2_x.seq $FA_DB > results/test_t2.yk2
18 ../bin/fastx36 -m 9c -S -q -z 2 ../seq/mgstm1.esq $FA_DB > results/test_m1.xk2z2
19 ../bin/fasty36 -S -q -z 2 ../seq/mgstm1.esq $FA_DB > results/test_m1.yk2z2
1720 echo "done"
1821 echo "starting fastxy36 rev" `date`
19 ../bin/fastx36 -m 9c -q -m 5 ../seq/mgstm1.rev q > results/test_m1.xk2r
20 ../bin/fasty36 -q -m 5 -M 200-300 -z 2 ../seq/mgstm1.rev q > results/test_m1.yk2rz2
21 ../bin/fasty36 -q -m 5 -z 11 ../seq/mgstm1.rev q > results/test_m1.yk2rz11
22 ../bin/fastx36 -m 9c -q -m 5 ../seq/mgstm1.rev $FA_DB > results/test_m1.xk2r
23 ../bin/fasty36 -q -m 5 -M 200-300 -z 2 ../seq/mgstm1.rev $FA_DB > results/test_m1.yk2rz2
24 ../bin/fasty36 -q -m 5 -z 11 ../seq/mgstm1.rev $FA_DB > results/test_m1.yk2rz11
2225 echo "done"
2326 echo "starting ssearch36" `date`
24 ../bin/ssearch36 -m 9c -S -z 3 -q ../seq/mgstm1.aa q > results/test_m1.ssz3
25 ../bin/ssearch36 -q -M 200-300 -z 2 -Z 100000 -s P250 ../seq/mgstm1.aa q > results/test_m1.ss_p25
27 ../bin/ssearch36 -m 9c -S -z 3 -q ../seq/mgstm1.aa $FA_DB > results/test_m1.ssz3
28 ../bin/ssearch36 -q -M 200-300 -z 2 -Z 100000 -s P250 ../seq/mgstm1.aa $FA_DB > results/test_m1.ss_p25
2629 echo "done"
2730 if [ -e ../bin/ssearch36s ]; then
2831 echo "starting ssearch36s" `date`
29 ../bin/ssearch36s -m 9c -S -z 3 -q ../seq/mgstm1.aa q > results/test_m1.sssz3
30 ../bin/ssearch36s -q -M 200-300 -z 2 -Z 100000 -s P250 ../seq/mgstm1.aa q > results/test_m1.sss_p25
32 ../bin/ssearch36s -m 9c -S -z 3 -q ../seq/mgstm1.aa $FA_DB > results/test_m1.sssz3
33 ../bin/ssearch36s -q -M 200-300 -z 2 -Z 100000 -s P250 ../seq/mgstm1.aa $FA_DB > results/test_m1.sss_p25
3134 echo "done"
3235 fi
3336 echo "starting prss36(ssearch/fastx)" `date`
3538 ../bin/fastx36 -q -k 1000 ../seq/mgstm1.esq ../seq/xurt8c.aa > results/test_m1.rfx
3639 echo "done"
3740 echo "starting ggsearch36/glsearch36" `date`
38 ../bin/ggsearch36 -q -m 9i -w 80 ../seq/hahu.aa q > results/test_h1.gg
39 ../bin/glsearch36 -q -m 9i -w 80 ../seq/hahu.aa q > results/test_h1.gl
40 ../bin/ggsearch36 -q ../seq/gtt1_drome.aa q > results/test_t1.gg
41 ../bin/glsearch36 -q ../seq/gtt1_drome.aa q > results/test_t1.gl
41 ../bin/ggsearch36 -q -m 9i -w 80 ../seq/hahu.aa $FA_DB > results/test_h1.gg
42 ../bin/glsearch36 -q -m 9i -w 80 ../seq/hahu.aa $FA_DB > results/test_h1.gl
43 ../bin/ggsearch36 -q ../seq/gtt1_drome.aa $FA_DB > results/test_t1.gg
44 ../bin/glsearch36 -q ../seq/gtt1_drome.aa $FA_DB > results/test_t1.gl
4245 echo "done"
4346 echo "starting fasta36 - DNA" `date`
4447 ../bin/fasta36 -S -q ../seq/mgstm1.nt %RMB 4 > results/test_m1.ok4
5255 ../bin/tfasty36 -q -i -3 -N 5000 ../seq/mgstm1.aa %p > results/test_m1.ty2
5356 echo "done"
5457 echo "starting fastf36" `date`
55 ../bin/fastf36 -q ../seq/m1r.aa q > results/test_mf.ff
56 ../bin/fastf36 -q ../seq/m1r.aa q > results/test_mf.ff_s
58 ../bin/fastf36 -q ../seq/m1r.aa $FA_DB > results/test_mf.ff
59 ../bin/fastf36 -q ../seq/m1r.aa $FA_DB > results/test_mf.ff_s
5760 echo "done"
5861 echo "starting tfastf36" `date`
5962 ../bin/tfastf36 -q ../seq/m1r.aa %r > results/test_mf.tfr
6063 echo "done"
6164 echo "starting fasts36" `date`
62 ../bin/fasts36 -q -V '*?@' ../seq/ngts.aa q > results/test_m1.fs1
63 ../bin/fasts36 -q ../seq/ngt.aa q > results/test_m1.fs
65 ../bin/fasts36 -q -V '*?@' ../seq/ngts.aa $FA_DB > results/test_m1.fs1
66 ../bin/fasts36 -q ../seq/ngt.aa $FA_DB > results/test_m1.fs
6467 ../bin/fasts36 -q -n ../seq/mgstm1.nts m > results/test_m1.nfs
6568 echo "starting fastm36" `date`
66 ../bin/fastm36 -q ../seq/ngts.aa q > results/test_m1.fm
69 ../bin/fastm36 -q ../seq/ngts.aa $FA_DB > results/test_m1.fm
6770 ../bin/fastm36 -q -n ../seq/mgstm1.nts m > results/test_m1.nfm
6871 echo "done"
6972 echo "starting tfasts36" `date`
33 echo `uname -a`
44 echo ""
55 echo "starting fasta36 - protein" `date`
6
7 FA_DB=/slib2/fa_dbs/qfo20.lseg
8
69 if [ ! -d results ]; then
710 mkdir results
811 fi
9 ../bin/fasta36 -V q\!../scripts/ann_feats_up_www2.pl -V \!../scripts/ann_feats_up_www2.pl -S -z 21 -s BP62 ../seq/gstm1_human.vaa q > results/test2V_m1.ok2_bp62
10 ../bin/fasta36 -V q\!../scripts/ann_feats_up_www2.pl -V \!../scripts/ann_feats_up_www2.pl -S -z 21 ../seq/gstm1_human.vaa q > results/test2V_m1.ok2_z21
11 ../bin/fasta36 -V q\!../scripts/ann_feats_up_www2.pl -V \!../scripts/ann_feats_up_www2.pl -S -m BB ../seq/gstm1_human.vaa q > results/test2V_m1.ok2mB
12 ../bin/fasta36 -V q\!../scripts/ann_feats_up_www2.pl -V \!../scripts/ann_feats_up_www2.pl -S -z 21 -s BP62 ../seq/gstm1_human.vaa $FA_DB > results/test2V_m1.ok2_bp62
13 ../bin/fasta36 -V q\!../scripts/ann_feats_up_www2.pl -V \!../scripts/ann_feats_up_www2.pl -S -z 21 ../seq/gstm1_human.vaa $FA_DB > results/test2V_m1.ok2_z21
14 ../bin/fasta36 -V q\!../scripts/ann_feats_up_www2.pl -V \!../scripts/ann_feats_up_www2.pl -S -m BB ../seq/gstm1_human.vaa $FA_DB > results/test2V_m1.ok2mB
1215 echo "done"
1316 echo "starting fastxy36" `date`
14 ../bin/fastx36 -V \!../scripts/ann_feats_up_www2.pl -m 9c -S -q ../seq/mgtt2_x.seq q > results/test2V_t2.xk2m9c
15 ../bin/fastx36 -V \!../scripts/ann_feats_up_www2.pl -m BB -S -q ../seq/mgtt2_x.seq q > results/test2V_t2.xk2mB
16 ../bin/fastx36 -V \!../scripts/ann_feats_up_www2.pl -m 9c -S -q -z 22 ../seq/gstm1b_human.nt q > results/test2V_m1.xk2m9cz22
17 ../bin/fasty36 -V \!../scripts/ann_feats_up_www2.pl -S -q -z 21 ../seq/gstm1b_human.nt q > results/test2V_m1.yk2z21
17 ../bin/fastx36 -V \!../scripts/ann_feats_up_www2.pl -m 9c -S -q ../seq/mgtt2_x.seq $FA_DB > results/test2V_t2.xk2m9c
18 ../bin/fastx36 -V \!../scripts/ann_feats_up_www2.pl -m BB -S -q ../seq/mgtt2_x.seq $FA_DB > results/test2V_t2.xk2mB
19 ../bin/fastx36 -V \!../scripts/ann_feats_up_www2.pl -m 9c -S -q -z 22 ../seq/gstm1b_human.nt $FA_DB > results/test2V_m1.xk2m9cz22
20 ../bin/fasty36 -V \!../scripts/ann_feats_up_www2.pl -S -q -z 21 ../seq/gstm1b_human.nt $FA_DB > results/test2V_m1.yk2z21
1821 echo "done"
1922 echo "starting ssearch36" `date`
20 ../bin/ssearch36 -V q\!../scripts/ann_pfam_www.pl -V \!../scripts/ann_pfam_www.pl -m 9c -S -z 22 -q ../seq/gstm1_human.vaa q > results/test2V_m1.ssm9cz22
21 ../bin/ssearch36 -V q\!../scripts/ann_pfam_www.pl -V \!../scripts/ann_pfam_www.pl -m 9C -S -z 21 -q ../seq/gstm1_human.vaa q > results/test2V_m1.ssm9Cz21
22 ../bin/ssearch36 -V q\!../scripts/ann_pfam_www.pl -V \!../scripts/ann_pfam_www.pl -m 8CC -S -q ../seq/gstm1_human.vaa q > results/test2V_m1.ssm8CC
23 ../bin/ssearch36 -V q\!../scripts/ann_pfam_www.pl -V \!../scripts/ann_pfam_www.pl -m 9c -S -z 22 -q ../seq/gstm1_human.vaa $FA_DB > results/test2V_m1.ssm9cz22
24 ../bin/ssearch36 -V q\!../scripts/ann_pfam_www.pl -V \!../scripts/ann_pfam_www.pl -m 9C -S -z 21 -q ../seq/gstm1_human.vaa $FA_DB > results/test2V_m1.ssm9Cz21
25 ../bin/ssearch36 -V q\!../scripts/ann_pfam_www.pl -V \!../scripts/ann_pfam_www.pl -m 8CC -S -q ../seq/gstm1_human.vaa $FA_DB > results/test2V_m1.ssm8CC
2326 echo "done" `date`
2427 echo "starting ssearch36" `date`
25 ../bin/ggsearch36 -V q\!../scripts/ann_feats_up_www2.pl -V \!../scripts/ann_feats_up_www2.pl -m 9c -S -q ../seq/gstm1_human.vaa q > results/test2V_m1.ggm9c
26 ../bin/ggsearch36 -V q\!../scripts/ann_feats_up_www2.pl -V \!../scripts/ann_feats_up_www2.pl -m 9C -S -z 21 -q ../seq/gstm1_human.vaa q > results/test2V_m1.ggm9Cz21
28 ../bin/ggsearch36 -V q\!../scripts/ann_feats_up_www2.pl -V \!../scripts/ann_feats_up_www2.pl -m 9c -S -q ../seq/gstm1_human.vaa $FA_DB > results/test2V_m1.ggm9c
29 ../bin/ggsearch36 -V q\!../scripts/ann_feats_up_www2.pl -V \!../scripts/ann_feats_up_www2.pl -m 9C -S -z 21 -q ../seq/gstm1_human.vaa $FA_DB > results/test2V_m1.ggm9Cz21
2730 echo "done" `date`