New upstream version 36.3.8h
Andreas Tille
4 years ago
0 | 0 | |
1 | 1 | ## The FASTA package - protein and DNA sequence similarity searching and alignment programs |
2 | 2 | |
3 | The **FASTA** (pronounced FAST-Aye, not FAST-Ah) programs are a | |
4 | comprehensive set of similarity searching and alignment programs for | |
5 | searching protein and DNA sequence databases. Like the **BLAST** programs `blastp` and `blastn`, the `fasta` program itself uses a rapid heuristic strategy for finding similar regions in protein and DNA sequences. But in | |
6 | addition to heuristic similarity searching, the FASTA package provides | |
7 | programs for rigorous local (`ssearch`) and global (`ggsearch`) | |
8 | similarity searching, as well as a program for finding non-overlapping | |
9 | sequence similarities (`lalign`). Like BLAST, the FASTA package also | |
10 | includes programs for aligning translated DNA sequences against | |
11 | proteins (`fastx`, `fasty` are equivalent to `blastx`, `tfastx`, | |
12 | `tfasty` are similar to `tblastn`). | |
3 | The **FASTA** (pronounced FAST-Aye, not FAST-Ah) programs are a comprehensive set of similarity searching and alignment programs for searching protein and DNA sequence databases. Like the **BLAST** programs `blastp` and `blastn`, the `fasta` program itself uses a rapid heuristic strategy for finding similar regions in protein and DNA sequences. But in addition to heuristic similarity searching, the FASTA package provides | |
4 | programs for rigorous local (`ssearch`) and global (`ggsearch`) similarity searching, as well as a program for finding non-overlapping sequence similarities (`lalign`). Like BLAST, the FASTA package also includes programs for aligning translated DNA sequences against proteins (`fastx`, `fasty` are equivalent to `blastx`, and `tfastx`, `tfasty` are similar to `tblastn`). | |
13 | 5 | |
14 | ####December, 2017 | |
15 | The current FASTA version is fasta-36.3.8f, Dec. 2017 | |
6 | #### March, 2019 | |
7 | ||
8 | An updated release of the FASTA package (`fasta-36.3.8h`) is | |
9 | available. In addition to minor bug fixes, the latest version can | |
10 | generate query and library sequences using program scripts. | |
11 | ||
12 | See doc/README_v36.3.8h.md and doc/readme.v36 for a more complete summary of changes. | |
13 | ||
14 | #### December, 2018 | |
15 | ||
16 | The latest version of the FASTA package is `fasta-36.3.8h`, Dec. 2018. | |
17 | ||
18 | See doc/README_v36.3.8h.md for a more complete summary of changes. | |
19 | ||
20 | #### November, 2018 | |
21 | ||
22 | The current released version of the FASTA package is `fasta-36.3.8h`, Nov. 2018 | |
23 | ||
24 | See doc/README_v36.3.8h.md for a more complete summary of changes. | |
25 | ||
26 | #### October, 2018 | |
27 | ||
28 | The current version of the FASTA package is fasta-36.3.8g, Oct. 2018 | |
29 | ||
30 | See doc/README_v36.3.8h.md for a more complete summary of changes. | |
31 | ||
32 | #### April, 2018 | |
33 | The current version of the FASTA package is fasta-36.3.8g, Apr. 2018 | |
34 | ||
35 | #### December, 2017 | |
36 | The current FASTA version is fasta-36.3.8g, Dec. 2017 | |
16 | 37 | |
17 | 38 | The statistics routines for normally distributed scores (ggsearch36, |
18 | 39 | glsearch36) are more robust to very low E()-value thresholds. |
19 | 40 | |
20 | ####Sept, 2017 | |
41 | #### Sept, 2017 | |
21 | 42 | The current FASTA version is fasta-36.3.8f, Sept. 2017 |
22 | 43 | |
23 | 44 | If the -S option is used and a query sequence has no upper case |
24 | 45 | letters, it is re-read with lower-case letters converted to upper-case. |
25 | 46 | |
26 | ####May, 2017 | |
47 | #### May, 2017 | |
27 | 48 | The current FASTA version is fasta-36.3.8f, May. 2017 |
28 | 49 | |
29 | 50 | Various bugs in sub-alignment scoring corrected and support for the |
30 | EBI SP:GSTM1_HUMAN P09488 added. The format for the $SRCH_URL and | |
31 | $SRCH_URL2 format strings has changed to enable pairwise alignment. | |
51 | EBI SP:GSTM1_HUMAN P09488 added. The format for the `$SRCH_URL` and | |
52 | `$SRCH_URL2` format strings has changed to enable pairwise alignment. | |
32 | 53 | |
33 | ####September, 2016 | |
54 | #### September, 2016 | |
34 | 55 | |
35 | 56 | The fasta-36.3.6e version includes a new directory, `psisearch2`, with |
36 | 57 | scripts to run iterative PSSM (PSI-BLAST or SSEARCH36) searches using |
0 | ||
1 | ||
2 | ## The FASTA package - protein and DNA sequence similarity searching and alignment programs | |
3 | ||
4 | Changes in **fasta-36.3.8f** released 31-Dec-2017 | |
5 | ||
6 | 1. (December, 2017) -- Make statistical thresholds more robust for | |
7 | small E()-values with normally distributed scores (ggsearch36, | |
8 | glsearch36). | |
9 | ||
10 | 2. (September, 2017) Treat all lower-case queries as uppercase with -S option. | |
11 | ||
12 | 3. (May, 2017) Improvements/fixes to sub-alignment scoring strategies. | |
13 | ||
14 | 4. Improvements/fixes to psisearch2 scripts. | |
15 | ||
16 | For more detailed information, see `doc/readme.v36`. | |
17 |
0 | ||
1 | ## The FASTA package - protein and DNA sequence similarity searching and alignment programs | |
2 | ||
3 | Changes in **fasta-36.3.8h** August, 2019 | |
4 | ||
5 | 1. Modifications to support makeblastdb format v5 databases. Currently, only simple database reads have been tested. | |
6 | ||
7 | ||
8 | Changes in **fasta-36.3.8h** March, 2019 | |
9 | ||
10 | 1. Translation table 1 (`-t 1`) now translates 'TGA'->'U' (selenocysteine). | |
11 | ||
12 | 2. New script for extracting DNA sequences from genomes (`scripts/get_genome_seq.py`). Currently works with human (hg38), mouse (mm10), and rat (rn6). | |
13 | ||
14 | Changes in **fasta-36.3.8h** January, 2019 | |
15 | ||
16 | 1. Bug fixes: `fastx`/`tfastx` searches done with the `-t t` option (which adds a `*` to protein sequences so that termination codons can be matched), did not work properly with the `VT` series of matrices, particularly `VT10`. This has been fixed. | |
17 | ||
18 | 2. New features: Both query and library/subject sequences can be generated by specifying a program script, either by putting a `!` at the start of the query/subject file name, or by specifying library type `9`. Thus, `fasta36 \\!../scripts/get_protein.py+P09488+P30711 /seqlib/swissprot.fa` or `fasta36 "../scripts/get_protein.py+P09488+P30711 9" /seqlib/swissprot.fa` will compare two query sequences, `P09488` and `P30711`, to SwissProt, by downloading them from Uniprot using the `get_protein.py` script (which can download sequences using either Uniprot or RefSeq protein accessions). Often, the leading `!` must be escaped from shell interpretation with `\\!`. | |
19 | ||
20 | New scripts that return FASTA sequences using accessions or genome coordinates are available in `scripts/`. `get_protein.py`, `get_uniprot.py`, `get_up_prot_iso_sql.py` and `get_refseq.py`. `get_refseq.py` can download either protein or mRNA RefSeq entries. `get_up_prot_iso_sql.py` retrieves a protein and its isoforms from a MySQL database. | |
21 | ||
22 | `get_genome_seq.py` extracts genome sequences using coordinates from local reference genomes (`hg38` and `mm10` included by default). | |
23 | ||
24 | Changes in **fasta-36.3.8h** December, 2018 | |
25 | ||
26 | The `scripts/ann_exons_up_www.pl` and `ann_exons_up_sql.pl` now include the option `--gen_coord` which provides the associated genome coordinate (including chromosome) as a feature, indicated by `'<'` (start of exon) and `'>'` (end of exon). | |
27 | ||
28 | Changes in **fasta-36.3.8h** released November, 2018 | |
29 | ||
30 | **fasta-36.3.8h** provides new scripts and modifications to the `fasta` programs that normalize the process of merging sub-alignment scores and region information into both FASTA and BLAST results. To move BLASTP towards FASTA with respect to alignment annotation and sub-alignment scoring: | |
31 | ||
32 | 1. The `blastp_annot_cmd.sh` runs a blast search, finds and scores domain information for the alignments, and merges this information back into the blast output `.html` file. This script uses: | |
33 | ||
34 | 1. `annot_blast_btab2.pl --query query.file --ann_script annot_script.pl --q_ann_script annot_script.pl blast.btab_file > blast.btab_file_ann` (a blast tabular file with one or two new fields, an annotation field and (optionally with --dom_info) a raw domain content field. | |
35 | 2. `merge_blast_btab.pl --btab blast.btab_file_ann blast.html > blast_ann.html` (merge the annotations and domain content information in the `blast.btab_file_ann` file together with the standard blast output file to produce annotated alignments. | |
36 | 3. In addition, `rename_exons.py` is available to rename exons (later other domains) in the subject sequences to match the exon labeling in the aligned query sequence. | |
37 | 4. `relabel_domains.py` can be used to adjust color sets for homologous domains. | |
38 | ||
39 | 2. There is also an equivalent `fasta_annot_cmd.sh` script that provides similar funtionality for the FASTA programs. This script does not need to use `annot_blast_btab2.pl` to produce domain subalignment scores (that functionality is provided in FASTA), but it also can use `merge_fasta_btab.pl` and `rename_exons.py` to modify the names of the aligned exons/domains in the subject sequences. | |
40 | ||
41 | 3. To support the independence of the `blastp`/`fasta` output from html annotation, the FASTA package includes some new options: | |
42 | ||
43 | 1. The `-m 8CBL` option includes query sequence length and subject sequence length in the blast tabular output. In addition, if domain annotations are available, the raw domain coordinates are provided in an additional field after the annotation/subalignment scoring field. `-m 8CBl` provides the sequence lengths, but does not add the raw domain coordinates. | |
44 | ||
45 | 2. The `-Xa` option prevents annotation information from being included in the html output -- it is only available in the `-m 8CB` (or `-m 8CBL/l`) output | |
46 | ||
47 | 3. To reduce problems with spaces in script arguements, annotation scripts with spaces separating arguments can use '+' instead of ' '. | |
48 | ||
49 | 4. The `fasta_annot_cmd.sh` script produces both a conventional alignment on `stdout` and a `-m 8CBL` alignment, which is sent to a separate file, which is separated from the `-m F8CBL` option with a `=`, thus `-m F8CBL=tmp_output.blast_tab`. | |
50 | ||
51 | Changes in **fasta-36.3.8g** released 23-Oct-2018 | |
52 | ||
53 | 1. (Oct. 2018) Improvements to scripts in the `psisearch2/` directory: | |
54 | ||
55 | 1. `psisearch2/m89_btop_msa2.pl` | |
56 | 1. the `--clustal` option produces a "CLUSTALW (1.8)", which is required for some downstream programs | |
57 | 2. the `--trunc_acc` option removes the database and accession from identifiers of the form: `sp|P09488|GSTM1_HUMAN` to produce `GSTM1_HUMAN`. | |
58 | 3. the `--min_align` option specifies the fraction of the query sequence that must be aligned `(q_end-q_start+1)/q_length)` | |
59 | Together, these changes make it possible for the output of `m89_btop_msa2.pl` to be used by the EMBOSS program `fprotdist`. | |
60 | ||
61 | 2. A more general implementation of `psisearch2_msa_iter.sh`, which does `psisearch2` one iteration at a time, and a new equivalent `psisearch2_msa_iter_bl.sh`, which uses `psiblast` to do the search. | |
62 | ||
63 | * (Oct. 2018) A small restructuring of the `make/Makefiles` to remove the `-lz` dependence for non-debugging scripts (and add it back when -DDEBUG is used). | |
64 | ||
65 | Changes in **fasta-36.3.8g** released 5-Aug-2018 | |
66 | ||
67 | 1. (Apr 2018) incorporation of `-t t1` termination codes ("*") in `-m 8CB`, `-m 8CC`, and `-m9C` so that aligned termination codons are indicated as `**` (`-m8CB`) or `*1` (`-m8CC`, `-m9C`). | |
68 | ||
69 | 2. (Mar 2018) Updates to scripts/annot_blast_btop2.pl to provide subalignment scoring for blastp searches (BLOSUM62 only). (see doc/readme.v36) | |
70 | ||
71 | 3. (Feb. 2018) a new extended option, `-XB`, which causes percent identity, percent similarity, and alignment length to be calculated using the BLAST model, which does not count gaps in the alignment length. | |
72 | ||
73 | see readme.v36 for other bug fixes. | |
74 | ||
75 | Changes in **fasta-36.3.8g** released 31-Dec-2017 | |
76 | ||
77 | 1. (December, 2017) -- Make statistical thresholds more robust for small E()-values with normally distributed scores (`ggsearch36`,`glsearch36`). | |
78 | ||
79 | 2. (September, 2017) Treat lower-case queries with no upper-case residues as uppercase with `-S` option. | |
80 | ||
81 | 3. (May, 2017) Improvements/fixes to sub-alignment scoring strategies. | |
82 | ||
83 | 4. Improvements/fixes to psisearch2 scripts. | |
84 | ||
85 | For more detailed information, see `doc/readme.v36`. | |
86 |
23 | 23 | </small> |
24 | 24 | </pre> |
25 | 25 | <hr> |
26 | <h2>Latest Updates - FASTA version 36.3.8d (April, 2016)</h2> | |
27 | <ol> | |
28 | <li> | |
29 | The <tt>fasta-36.3.8d/scripts/</tt> directory now provides a | |
30 | script, <tt>annot_blast_btop2.pl</tt> that allows annotations and | |
31 | sub-alignment scoring on BLAST alignments that use the tabular format | |
32 | with BTOP alignment encoding. | |
33 | <p> | |
34 | <li> | |
35 | Bug fixes for overlapping domain domain scoring. v36.3.7 was not thread-safe. | |
36 | <li> | |
37 | Annotation scripts accessing the Pfam domain database can now use | |
38 | the <tt>--vdoms</tt> option to highlight missing parts of a Pfam | |
39 | domain model. In addtion, domains from clans are labeled as clans | |
40 | unless <tt>--no-clans</tt> is specified. | |
41 | </ol> | |
42 | <h2>Updates - FASTA version 36.3.7 (November, 2014)</h2> | |
26 | <h2>Latest Updates - FASTA version 36.3.8h (March, 2019)</h2> | |
43 | 27 | <ol> |
44 | 28 | <li>The FASTA programs have been released under the Apache2.0 Open |
45 | 29 | Source License. The COPYRIGHT file, and copyright notices in |
46 | 30 | program files, have been updated to reflect this change. |
47 | 31 | <p> |
32 | <li> | |
33 | fasta-36.3.8h includes bug fixes for translated alignments | |
34 | with termination codons, the ability to use scripts as query | |
35 | and library sequences, and new scripts for extracting genomic | |
36 | DNA sequences given chromosome coordinates. | |
37 | <li> | |
38 | fasta-36.3.8g includes bug fixes for sub-alignment scoring and | |
39 | psisearch2 scripts, new annotation scripts for exons, and | |
40 | fixes enabling very low statistical thresholds with ggsearch36 | |
41 | and glsearch36. | |
42 | <li> | |
43 | fasta-36.3.8e/scripts includes updated scripts for | |
44 | capturing domain and feature annotations using the | |
45 | EBI/proteins API (https://www.ebi.ac.uk/proteins/api/) to get | |
46 | Uniprot annotations and exon locations. | |
47 | <p> | |
48 | <li> | |
49 | The <tt>fasta-36.3.8e/psisearch2/</tt> directory now | |
50 | provides <tt>psisearch2_msa.pl</tt> | |
51 | and <tt>psisearch2_msa.py</tt>, functionally identical scripts | |
52 | for iterative searching with <tt>psiblast</tt> | |
53 | or <tt>ssearch36</tt>. <tt>psisearch2-msa.pl</tt> offers an | |
54 | option, <tt>--query_seed</tt>, that can dramatically reduce | |
55 | false-positives caused by alignment overextension, with very | |
56 | little loss of search sensitivity. | |
57 | <p> | |
58 | <li> | |
59 | The <tt>fasta-36.3.8d/scripts/</tt> directory now provides a | |
60 | script, <tt>annot_blast_btop2.pl</tt> that allows annotations and | |
61 | sub-alignment scoring on BLAST alignments that use the tabular format | |
62 | with BTOP alignment encoding. | |
63 | <p> | |
48 | 64 | <li>Alignment sub-scoring scripts have been extended to allow |
49 | 65 | overlapping domains. This requires a modified annotation file format. |
50 | 66 | The "classic" format placed the beginning and end of a domain on different lines: |
69 | 85 | </pre> |
70 | 86 | <p> |
71 | 87 | <li> New annotation scripts are available in |
72 | the <tt>fasta-36.3.7/scripts</tt> directory, | |
88 | the <tt>fasta-36.3.8/scripts</tt> directory, | |
73 | 89 | e.g. <tt>ann_pfam_www_e.pl</tt> (Pfam) and <tt>ann_up_www2_e.pl</tt> |
74 | 90 | (Uniprot) to support this new format. If the domain annotations |
75 | 91 | provided by Pfam or Uniprot overlap, then overlapping domains are |
Binary diff not shown
266 | 266 | with a '$>$' character, followed by the sequence itself: |
267 | 267 | \begin{quote} |
268 | 268 | \begin{verbatim} |
269 | >sequence name and description 1 | |
269 | >sequence_name1 and description | |
270 | 270 | A F A S Y T .... actual sequence. |
271 | 271 | F S S .... second line of sequence. |
272 | >sequence name and description 2 | |
272 | >sequence_name2 and description | |
273 | 273 | PMILTYV ... sequence 2 |
274 | 274 | \end{verbatim} |
275 | 275 | \end{quote} |
276 | 276 | All of the characters of the description line are read, and special |
277 | 277 | characters can be used to indicate additional information about the |
278 | sequence. In general, non-amino-acid/non-nucleotide sequences in the | |
279 | sequence lines are ignored. | |
278 | sequence. In particular, a \texttt{'@:C 12345'} at the end of the | |
279 | description line indicates that the first residue of the sequence has | |
280 | coordinate \texttt{'12345'}, instead of starting at \texttt{'1'}. | |
281 | Coordinates can be negative; a DNA sequence upstream from the start of | |
282 | transcription could be displayed with negative coordinates. | |
283 | ||
284 | In general, non-amino-acid/non-nucleotide sequences in the sequence | |
285 | lines are ignored, with the exception of \texttt{'*'}, which indicates | |
286 | a termination codon in a protein sequence, and can be used to indicate | |
287 | the match to a termination codon in protein:DNA alignments. | |
280 | 288 | |
281 | 289 | FASTA format files from major sequence distributors, like the NCBI and |
282 | 290 | EBI, have specially formatted description lines, e.g.:\\ |
283 | 291 | \indent |
284 | 292 | \texttt{ |
285 | >gi|54321|ref|np\_12345| example NCBI refseq sequence\\ | |
293 | >np\_12345| example NCBI refseq sequence\\ | |
286 | 294 | } |
287 | 295 | or\\ |
288 | 296 | \indent |
289 | 297 | \texttt{ |
290 | >sw:gstm1\_human P01234 glutathione transferase GSTM1 - human\\ | |
298 | >sp:gstm1\_human P01234 glutathione transferase GSTM1 - human\\ | |
299 | } | |
300 | or | |
301 | \indent | |
302 | \texttt{ | |
303 | >sp|P09488|GSTM1\_HUMAN glutathione transferase GSTM1 - human\\ | |
291 | 304 | } |
292 | 305 | |
293 | 306 | Several sample test files are included with the FASTA distribution: |
851 | 864 | comments, \texttt{-m 8XC} without comments) and, if available, an |
852 | 865 | annotation encoding matching FASTA \texttt{-m 9C} output. All the |
853 | 866 | \texttt{-m 9c/C/d/D} encodings are available with BLAST tabular |
854 | output using \texttt{-m 8C[c/C/d/D]}. | |
867 | output using \texttt{-m 8C[c/C/d/D]}. In the v36.3.8h release, a | |
868 | new option has been added to \texttt{-m 8CB}, \texttt{-m 8CBL} (or | |
869 | \texttt{-m 8CBl}. The \texttt{L/l} option adds the lengths of the | |
870 | query and subject sequences after the \texttt{seqid}'s to BLAST | |
871 | tabular output, e.g. \texttt{qseqid qlen sseqid slen percid ...} | |
855 | 872 | |
856 | 873 | \item[\texttt{-m 9}] display alignment coordinates and scores with the |
857 | 874 | best score information. \texttt{-m 9i} provides alignment length, |
925 | 942 | \texttt{1M1X2M4X2M1X2M7X3M9D1M2X1M4X2M1X1M1X2I1X1M1X1M3X1M2X1I3M1D1X1M2X1M} |
926 | 943 | \end{footnotesize} |
927 | 944 | \item[\texttt{-m 10}] |
928 | a parseable format for use with other programs. | |
945 | a parseable format for use with other programs (this option no longer reliably tested; \texttt{-m 8CBL} is easier to parse and tested more extensively). | |
929 | 946 | \item[\texttt{-m 11}] |
930 | 947 | Provide \texttt{lav}-like output (used by \texttt{lalign}) for graphical output. |
931 | 948 | \begin{quote} |
1123 | 1140 | programs. (There is an option in the \texttt{Makefile}, |
1124 | 1141 | \texttt{-DDNALIB\_LC}, to enable preserving case in DNA sequences.) |
1125 | 1142 | |
1126 | \item[\texttt{-t \#}] | |
1127 | Translation table - fastx36, tfastx36, fasty36, and | |
1128 | tfasty3 now support the BLAST translation tables. See | |
1129 | \url{http://www.ncbi.nih.gov/Taxonomy/Utils/wprintgc.cgi}. | |
1130 | ||
1131 | \texttt{-t t} or \texttt{-t t\#} enables the addition of | |
1132 | an implicit termination codon to a protein:translated DNA match. That | |
1133 | is, each protein sequence implicitly ends with \texttt{*}, which | |
1134 | matches the termination codes for the appropriate genetic code. | |
1135 | \texttt{-t t\#} sets implicit termination and a different genetic | |
1136 | code. | |
1143 | \item[\texttt{-t \#}] Translation table - fastx36, tfastx36, fasty36, | |
1144 | and tfasty3 now support the BLAST translation tables. See | |
1145 | \url{http://www.ncbi.nih.gov/Taxonomy/Utils/wprintgc.cgi}. | |
1146 | ||
1147 | \texttt{-t 1} also enables translation of \texttt{'TGA'} to | |
1148 | \texttt{'U'} (seleno-cysteine) (by default, \texttt{'TGA'} is | |
1149 | translated to \texttt{'*'}). Because of the ambiguity of the | |
1150 | \texttt{'TGA'} codon, translated alignments of \texttt{'TGA'} with | |
1151 | \texttt{-t 1} match \texttt{'U'} and \texttt{'*'} (termination) | |
1152 | equally well. | |
1153 | ||
1154 | \texttt{-t t} enables the addition of an implicit termination codon to | |
1155 | a protein:translated DNA match. That is, each protein sequence | |
1156 | implicitly ends with \texttt{*}, which matches the termination codes | |
1157 | for the appropriate genetic code. To change the translation table and | |
1158 | insert a termination character after each protein sequence, use | |
1159 | \texttt{-t 1 -t t}. | |
1160 | ||
1137 | 1161 | \item[\texttt{-T \#}] |
1138 | 1162 | set number of threads/workers. Normally on a multi-core machine, the maximum |
1139 | 1163 | number of processors/cores is used. |
1348 | 1372 | \item[\texttt{X1}] sort output by \texttt{init1} score (for |
1349 | 1373 | compatibility with FASTP; obsolete). |
1350 | 1374 | |
1351 | \item[\texttt{XB}] (Previously \texttt{-B}.) Show the z-score, rather | |
1375 | \item[\texttt{XB}] Calculate pecent identity, percent similarity, and | |
1376 | alignment using the BLAST model, which excludes gapped residues. | |
1377 | This allows very high identity alignments with large gaps to look | |
1378 | much closer, but causes the alignment length to drop by the length | |
1379 | of the gap. | |
1380 | ||
1381 | \item[\texttt{Xb}] (Previously \texttt{-B}.) Show the z-score, rather | |
1352 | 1382 | than the bit-score in the list of best scores (rarely used, provided |
1353 | 1383 | for backward compatibility). |
1354 | 1384 | |
1794 | 1824 | 5 & NBRF/PIR VMS (\texttt{>P1;SEQID}/comment/sequence) (obsolete)\\ |
1795 | 1825 | 6 & GCG (version 8.0) Unix Protein and DNA (compressed)\\ |
1796 | 1826 | 7 & FASTQ (sequence only, quality ignored)\\ |
1827 | 9 & a script that is executed to produce FASTA format sequences \\ | |
1797 | 1828 | 10 & subset format (</slib2/swissprot.lseg 0:2 4|) \\ |
1798 | 1829 | 11 & NCBI Blast1.3.2 format (unix only) (obsolete)\\ |
1799 | 1830 | 12 & NCBI Blast2.0 format\\ |
1869 | 1900 | \section{Frequently Asked Questions (FAQs)} |
1870 | 1901 | |
1871 | 1902 | {\noindent}\textbf{Where can I get FASTA?} -- |
1872 | \url{http://faculty.virginia.edu/wrpearson/fasta} has the latest | |
1873 | versions of the FASTA programs. This document describes | |
1874 | \texttt{\CURRENT}, which is available from | |
1875 | \url{http://faculty.virginia.edu/wrpearson/fasta/fasta3.tar.gz}. | |
1876 | In addition, pre-compiled versions of the programs are available for | |
1903 | ||
1904 | The most current version of the FASTA source code is available from | |
1905 | \url{http://github.com/wrpearson/fasta36}. In addition, you can get | |
1906 | the programs from \url{http://faculty.virginia.edu/wrpearson/fasta}, | |
1907 | but sometimes there is a lag between the latest release on GITHUB and | |
1908 | the compiled versions at \url{faculty.virginia.edu}. This document | |
1909 | describes \texttt{\CURRENT}, which is available from | |
1910 | \url{http://faculty.virginia.edu/wrpearson/fasta/fasta3.tar.gz}. In | |
1911 | addition, pre-compiled versions of the programs are available for | |
1877 | 1912 | MacOSX and Windows. |
1878 | 1913 | |
1879 | 1914 | \needspace{4\baselineskip} |
1886 | 1921 | Prot. & Prot. & \texttt{fasta36} & \texttt{blastp} & heuristic local similarity \\ |
1887 | 1922 | & & \texttt{ssearch36} & & optimal local sim.\\ |
1888 | 1923 | & & \texttt{ggearch36} & & global:global sim. \\ |
1889 | & & \texttt{ggearch36} & & global:local sim.\\ | |
1924 | & & \texttt{glearch36} & & global:local sim.\\ | |
1890 | 1925 | DNA & DNA & \texttt{fasta36}$^*$ & \texttt{blastn} & \\[1.2ex] |
1891 | 1926 | \hline \\[-1.0ex] |
1892 | 1927 | Prot. & Prot. & \texttt{lalign36} & & multiple non-intersecting \\ |
2028 | 2063 | \begin{quote} |
2029 | 2064 | William R. Pearson\\ |
2030 | 2065 | Department of Biochemistry\\ |
2031 | Jordan Hall Box 800733\\ | |
2066 | Pinn Hall Box 800733\\ | |
2032 | 2067 | U. of Virginia\\ |
2033 | 2068 | Charlottesville, VA\\ |
2034 | 2069 | wrp@virginia.EDU |
0 | README_v36.3.8h.md⏎ |
110 | 110 | |
111 | 111 | This release provides an extremely efficient SSE2 implementation of |
112 | 112 | the Smith-Waterman algorithm for the SSE2 vector instructions written |
113 | by Michael Farrar (farrar.michael@gmail.com). The SSE code speeds up | |
113 | by Michael Farrar. The SSE code speeds up | |
114 | 114 | Smith-Waterman 8 - 10-fold in my tests, making it comparable to Eric |
115 | 115 | Lindahl's Altivec code for the Apple/IBM G4/G5 architecture. |
116 | 116 |
4 | 4 | multiple high-scoring alignments to be shown, rather than just one. |
5 | 5 | This is the main functional difference between FASTA and BLAST - |
6 | 6 | BLAST could show multiple HSPs, FASTA did not. |
7 | ||
8 | >>Aug. 9, 2019 | |
9 | [src/ncbl2_mlib.c, ncbl2_head.h] | |
10 | ||
11 | Modest extensions made to support reading makeblastdb format v5 | |
12 | databases. Changes have only been made to read the db.pin file, but | |
13 | things work in simple tests. | |
14 | ||
15 | >July 16, 2019 | |
16 | [src/comp_lib9.c] | |
17 | ||
18 | Fixed a memory leak problem when searching with large libraries that | |
19 | could be memory mapped (libraries with .xin index files). If the | |
20 | library did not fit in memory, then the kept allocating new memory. | |
21 | By default, the largest database that fits in memory must be less than | |
22 | 16 GB. Larger libraries will be re-read, which slows down multi-query | |
23 | searches considerably. To increase the size of the library allowed in | |
24 | memory, use the option: "-X M32G" to fit 32 GB libraries. | |
25 | ||
26 | >>Mar. 8, 2019 | |
27 | [src/initfa.c,faatran.c,dropfx2.c] | |
28 | Modify translation table 1 to allow selenocysteine translation | |
29 | (TGA->'U'), and modify scoring matrices to give positive scores to | |
30 | '*':'U'. The translation modification ONLY works with "-t 1". In | |
31 | addition, BLAST BTOP alignments (-m 8CB) convert a 'U' aligned with a | |
32 | '*' to a '*', so the end of the alignment is '**' rather than 'U*' | |
33 | (fastx36) or '*U' (tfastx36). | |
34 | ||
35 | dropfx2.c (fastx36/tfastx36), dropfz3.c(fasty36/tfasty36) did not | |
36 | properly switch protein and translated DNA codes with -m 8CB -- fixed. | |
37 | ||
38 | version date updated to Mar, 2019 | |
39 | ||
40 | >>Feb. 26, 2019 | |
41 | [scripts/get_genome_seq.py] | |
42 | added get_genome_seq.py as a replacement for get_hg38_bed.py, remove | |
43 | get_hg38_bed.py. 'get_genome_seq.py --genome mm10' also produces | |
44 | sequences from mouse mm10 (and can now do any genome that bedtools can | |
45 | read). | |
46 | ||
47 | >>Feb. 23, 2019 | |
48 | [src/comp_lib9.c, mshowbest.c] | |
49 | Modify repeat_thresh so that poor alignment scores (E() > | |
50 | ppst->e_cut_r, typically -E-threshold/10.0) do not look for additional | |
51 | alignments. | |
52 | ||
53 | >>Feb. 21, 2019 | |
54 | [src/nmgetaa.c, scaleswn.c, scripts/get_protein.py, get_hg38_bed.py] | |
55 | ||
56 | Modify nmgetaa.c to ignore ':'s (for sequence subsets) in scripts. | |
57 | The script can do the subsetting. Modify scripts/get_protein.py to | |
58 | provide subsetting. Add scripts/get_hg38_bed.py to extract fasta | |
59 | sequences using the format "chr2:123456-543210" | |
60 | ||
61 | Modify scaleswn.c to estimate Altshul-Gish parameters when gap and | |
62 | extension do not match exactly. | |
63 | ||
64 | >>Feb. 6, 2019 | |
65 | [src/compacc2e.c, nmgetaa.c] | |
66 | modify build_link_data() to allow '+' for space in scripts. Ensure | |
67 | that lib_type is properly initialized (open_lib.c()). | |
68 | ||
69 | >>Jan. 23, 2019 | |
70 | [nmgetaa.c] | |
71 | Fix bug introduced when checking for lib_type. | |
72 | ||
73 | >>Jan. 15, 2019 | |
74 | [src/upam.h, altlib.h, nmgetaa.c] | |
75 | [scripts/rename_exons.py, map_exons_coords.py, get_uniprot.py, get_refseq.py, get_proteins.py] | |
76 | ||
77 | Bug fixes: The VT10, VT20, etc scoring matrices did not have scores for '*:*' | |
78 | alignments, used with FASTX/TFASTX for extending alignments through | |
79 | the termination codon. As a result, searchs with '-t t' did not | |
80 | extend through the termination codon, even though they should have. | |
81 | This has been fixed. | |
82 | ||
83 | Enhancements: FASTA can now download both query and library sequences using a script, by specifying file type 9. Thus: | |
84 | ||
85 | fasta36 "../scripts/get_uniprot.py+P09488 9" /seqlib/swissprot.fasta | |
86 | ||
87 | Will run the script "get_uniprot.py" with the argument "P09488" and | |
88 | use the output of the script as the query sequence. In this example, | |
89 | the library type (9) is specified by the " 9" (this space cannot be | |
90 | replaced with a '+' character). | |
91 | ||
92 | Alternatively, library type '9' can be specified by putting a '!' before the script file name. | |
93 | ||
94 | fasta36 \!../scripts/get_uniprot.py+P09488 /seqlib/swissprot.fasta | |
95 | ||
96 | Scripts can be used to produce query or library sequences, or both. | |
97 | Three scripts that download sequences from the NCBI and Uniprot have | |
98 | been added in the "scripts" directory: "get_uniprot.py" takes Uniprot | |
99 | accessions as arguments, "get_refseq.py" takes refseq accessions | |
100 | (protein or mRNA), and "get_protein.py" gets both Uniprot and RefSeq | |
101 | protein sequences. | |
102 | ||
103 | rename_exons.py and map_exons_coords.py can take annotated BTOP | |
104 | alignments with genome coordinates and map exons to the alternative | |
105 | genome. | |
106 | ||
107 | >>Jan. 2, 2019 | |
108 | [src/mshowbest.c] | |
109 | Fix problems with site annotation when dom_info is provided with -m8CBL | |
110 | [scripts/ann_exons_up_sql.pl, ann_exons_up_www.pl] | |
111 | Make scripts more robust to missing chromosome information, | |
112 | reverse-strand coordinates. | |
113 | ||
114 | >>Dec. 11, 2018 | |
115 | [scripts/ann_exons_up_www.pl, ann_exons_up_sql.pl] | |
116 | Add the option "--gen_coord" to report exon start ('<') and end ('>') | |
117 | genome coordinates features of exons. | |
118 | ||
119 | >>Nov. 14, 2018 | |
120 | [scripts/rename_exons.py, relabel_domains.py, compacc2e.c] | |
121 | ||
122 | Two new scripts, rename_exons.py and relabel_domains.py, that take a | |
123 | blast tabular output file with domain alignment annotations (and | |
124 | possibly raw domain information) and modifies the names | |
125 | (rename_exons.py) or colors (relabel_domains.py). rename_exons.py | |
126 | takes the exon numbering associated with the query sequence and maps | |
127 | it onto the subject alignments. relabel_domains.py can be used to use | |
128 | different color numbers for homologous and non-homologous domains. | |
129 | ||
130 | Both of these programs modify blast tabular output files, which can | |
131 | then be merged back into an alignment display using | |
132 | merge_blastp_annot.pl or merge_fasta_annot.pl. | |
133 | ||
134 | compacc2.c:build_link_data() has been modified to convert '+' in the | |
135 | script string to ' ', to allow passing command line options. A space | |
136 | in the script string is used to separate the script from the library | |
137 | type of the file returned by the script. | |
138 | ||
139 | >>Nov. 6-7, 2018 | |
140 | [doinit.c, mshowbest.c, mshowalign2.c, defs.h, structs.h] | |
141 | ||
142 | (a) Add options to provide query and subject sequence lengths and raw | |
143 | domain coordinates in BLASTP tabular output with the options -m 8CBl | |
144 | and -m 8CBL. If domain annotations are available, -m 8CBL also | |
145 | provides the raw domain coordinates (not just those included in the | |
146 | alignment) in the form |DX:1-100;C=PF12345|XD:1-100;C=PF12345 where | |
147 | |DX a query annotation and |XD indicates a subject annotation. -m | |
148 | 8CBl (lower-case L) shows the sequence lengths, but not the raw domain | |
149 | info. | |
150 | ||
151 | (b) parse the annotation program strings so that '+' are converted to | |
152 | ' '. This greatly simplifies passing arguments to the annotation scripts. Thus: | |
153 | ||
154 | -V \!ann_pfam_sql.pl --db=pfam31 --neg --vdoms can be written as: | |
155 | -V \!ann_pfam_sql.pl+--db=pfam31+--neg+--vdoms (likewise for -V q\!ann_pfam...) | |
156 | ||
157 | (c) provide an option to remove region/feature annotations from non-m8 | |
158 | (blast-tabular) output. This simplifies the process of using | |
159 | scripts/merge_fasta_btab.pl to use .bl_tab (-m 8CBL) files to inject | |
160 | sub-alignment scores and domain information. | |
161 | ||
162 | >>Nov. 1, 2018 | |
163 | [doinit.c] | |
164 | Allow -m F#=file.name in addition to -m "F# file.name" to address | |
165 | problems I had with spaces in shell scripts. | |
166 | ||
167 | >>Oct. 23, 2018 [re-released as fasta-36.3.8g] (see README_v36.3.8g.md) | |
168 | [make/Makefiles*,psisearch2/m89_btop_msa2.pl] | |
169 | ||
170 | Add options to psisearch2/m89_btop_msa2.pl to provide clustalw header | |
171 | (--clustal), require a minimum coverage of the query sequence | |
172 | (--min_align 0.8), and edit sequence identifiers to remove database | |
173 | and accession (--trunc_acc). | |
174 | ||
175 | Remove -lz dependency from non-debug Makefiles. | |
176 | ||
177 | >>Aug. 5, 2018 [re-released as fasta-36.3.8g] | |
178 | [lib_sel.c] | |
179 | Make lib_select.c more robust to missing indirect name files. | |
180 | [scripts/ann*.pl] | |
181 | update various annotation scripts to use https:// instead of http:// | |
182 | ||
183 | >>April 3, 2018 | |
184 | [initfa.c, comp_lib.c, dropfx2.c] | |
185 | Changes to (a) ensure that the "-t t" option correctly inserts and | |
186 | aligns a termination codon '*'. (a) changes to -m 8CB, -m8CC, and -m9C | |
187 | so that aligned termination codons are indicated as "**" (-m8CB) or | |
188 | "*1" (-m8CC, -m9C). | |
189 | ||
190 | >>Mar. 9, 2018 | |
191 | [scripts/annot_blast_btop2.pl, merge_blast_btab.pl, blastp_annot_cmd.sh] | |
192 | Code is now in place to provide sub-alignment scoring using domain | |
193 | annotations with blastp searches (BLOSUM62 only). blastp_annot_cmd.sh | |
194 | runs blast and produces both a standard HTML and a tabular output | |
195 | file. It then runs annot_blast_btop2.pl to add sub-alignment scoring | |
196 | to the tabular ouput file, and then merge_blast_btab.pl merges the | |
197 | domain-annotated blast tabular file with the HTML output file. When | |
198 | combined in this way, the FASTA web server (fasta.bioch.virginia.edu) | |
199 | can produce blastp searches with domain highlights/scoring. | |
200 | ||
201 | >>Feb. 6, 2018 | |
202 | [initfa.c, doinit.c, mshowbest.c, mshowalign2.c] | |
203 | Add a new extended option, -XB, which causes percent identity, percent | |
204 | similarity, and alignment length to be presented using the BLAST | |
205 | model, which does not count gaps in the alignment length. | |
7 | 206 | |
8 | 207 | >>Dec. 30, 2017 [released as fasta-36.3.8g] |
9 | 208 | [scaleswn.c] |
0 | # $ Id: $ | |
1 | # | |
2 | # makefile for fasta3, fasta3_t Use Makefile.mpi for fasta36_mpi | |
3 | # | |
4 | # This file is designed for 64-bit Linux systems using an X86 | |
5 | # architecture with SSE2 extensions. -D_LARGEFILE64_SOURCE and | |
6 | # -DBIG_LIB64 require a 64-bit linux system. | |
7 | # SSE2 extensions are used for ssearch35(_t) | |
8 | # | |
9 | # Use Makefile.linux32_sse2 for 32-bit linux x86 | |
10 | # | |
11 | ||
12 | SHELL=/bin/bash | |
13 | ||
14 | CC = gcc -g -O -msse2 | |
15 | LIB_DB= | |
16 | ||
17 | #CC= gcc -pg -g -O -msse2 -ffast-math | |
18 | #CC = gcc -g -DDEBUG -msse2 | |
19 | #CC=gcc -Wall -pedantic -ansi -g -msse2 -DDEBUG | |
20 | ||
21 | # EBI uses the following with pgcc, -O3 does not work: | |
22 | # CC= pgcc -O2 -pipe -mcpu=pentiumpro -march=pentiumpro -fomit-frame-pointer | |
23 | ||
24 | # this file works for x86 LINUX | |
25 | ||
26 | # standard options | |
27 | ||
28 | CFLAGS= -DSHOW_HELP -DSHOWSIM -DUNIX -DTIMES -DHZ=100 -DMAX_WORKERS=8 -DTHR_EXIT=pthread_exit -DM10_CONS -D_REENTRANT -DHAS_INTTYPES -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -DUSE_FSEEKO -DSAMP_STATS -DPGM_DOC -DUSE_MMAP -D_LARGEFILE64_SOURCE -DBIG_LIB64 | |
29 | # -I/usr/include/mysql -DMYSQL_DB | |
30 | # -DSUPERFAMNUM -DSFCHAR="'|'" | |
31 | ||
32 | # | |
33 | #(for mySQL databases) (also requires change to Makefile36m.common or use of Makefile36m.common_mysql) | |
34 | # run 'mysql_config' so find locations of mySQL files | |
35 | ||
36 | LIB_M = -lm | |
37 | # for mySQL databases | |
38 | # LIB_M = -L/usr/lib64/mysql -lmysqlclient -lm | |
39 | ||
40 | HFLAGS= -o | |
41 | NFLAGS= -o | |
42 | ||
43 | # for Linux | |
44 | THR_SUBS = pthr_subs2 | |
45 | THR_LIBS = -lpthread | |
46 | THR_CC = | |
47 | ||
48 | BIN = ../bin | |
49 | XDIR = /seqprg/bin | |
50 | #XDIR = ~/bin/LINUX | |
51 | ||
52 | # set up files for SSE2/Altivec acceleration | |
53 | # | |
54 | include ../make/Makefile.sse_alt | |
55 | ||
56 | # SSE2 acceleration | |
57 | # | |
58 | DROPGSW_O = $(DROPGSW_SSE_O) | |
59 | DROPLAL_O = $(DROPLAL_SSE_O) | |
60 | DROPGNW_O = $(DROPGNW_SSE_O) | |
61 | DROPLNW_O = $(DROPLNW_SSE_O) | |
62 | ||
63 | # renamed (fasta36) programs | |
64 | include ../make/Makefile36m.common | |
65 | # conventional (fasta3) names | |
66 | # include ../make/Makefile.common |
12 | 12 | |
13 | 13 | #CC= gcc -g -O |
14 | 14 | #CC = gcc -g -DDEBUG |
15 | #LIB_DB= | |
15 | 16 | |
16 | 17 | #CC=gcc -Wall -pedantic -ansi -g -O |
17 | 18 | CC= /usr/local/parasoft/bin/insure -g -DDEBUG |
19 | LIB_DB=-lz | |
18 | 20 | |
19 | 21 | # EBI uses the following with pgcc, -O3 does not work: |
20 | 22 | # CC= pgcc -O2 -pipe -mcpu=pentiumpro -march=pentiumpro -fomit-frame-pointer |
12 | 12 | SHELL=/bin/bash |
13 | 13 | |
14 | 14 | CC= gcc -g -O -msse2 -ffast-math |
15 | LIB_DB= | |
15 | 16 | #CC = gcc -g -DDEBUG -msse2 |
16 | 17 | |
17 | 18 | #CC= /usr/local/parasoft/bin/insure -g -DDEBUG |
19 | #LIB_DB=-lz | |
18 | 20 | |
19 | 21 | #CC=gcc -Wall -pedantic -ansi -g -O |
20 | 22 |
0 | # $ Id: $ | |
1 | # | |
2 | # makefile for fasta3, fasta3_t Use Makefile.mpi for fasta36_mpi | |
3 | # | |
4 | # This file is designed for 64-bit Linux systems using an X86 | |
5 | # architecture with SSE2 extensions. -D_LARGEFILE64_SOURCE and | |
6 | # -DBIG_LIB64 require a 64-bit linux system. | |
7 | # SSE2 extensions are used for ssearch35(_t) | |
8 | # | |
9 | # Use Makefile.linux32_sse2 for 32-bit linux x86 | |
10 | # | |
11 | ||
12 | SHELL=/bin/bash | |
13 | ||
14 | CC = gcc -g -O -msse2 | |
15 | LIB_DB= | |
16 | ||
17 | #CC= gcc -pg -g -O -msse2 -ffast-math | |
18 | #CC = gcc -g -DDEBUG -msse2 | |
19 | #CC=gcc -Wall -pedantic -ansi -g -msse2 -DDEBUG | |
20 | ||
21 | # EBI uses the following with pgcc, -O3 does not work: | |
22 | # CC= pgcc -O2 -pipe -mcpu=pentiumpro -march=pentiumpro -fomit-frame-pointer | |
23 | ||
24 | # this file works for x86 LINUX | |
25 | ||
26 | # standard options | |
27 | ||
28 | CFLAGS= -DSHOW_HELP -DSHOWSIM -DUNIX -DTIMES -DHZ=100 -DMAX_WORKERS=8 -DTHR_EXIT=pthread_exit -DM10_CONS -D_REENTRANT -DHAS_INTTYPES -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -DUSE_FSEEKO -DSAMP_STATS -DPGM_DOC -DUSE_MMAP -D_LARGEFILE64_SOURCE -DBIG_LIB64 | |
29 | # -I/usr/include/mysql -DMYSQL_DB | |
30 | # -DSUPERFAMNUM -DSFCHAR="'|'" | |
31 | ||
32 | # | |
33 | #(for mySQL databases) (also requires change to Makefile36m.common or use of Makefile36m.common_mysql) | |
34 | # run 'mysql_config' so find locations of mySQL files | |
35 | ||
36 | LIB_M = -lm | |
37 | # for mySQL databases | |
38 | # LIB_M = -L/usr/lib64/mysql -lmysqlclient -lm | |
39 | ||
40 | HFLAGS= -o | |
41 | NFLAGS= -o | |
42 | ||
43 | # for Linux | |
44 | THR_SUBS = pthr_subs2 | |
45 | THR_LIBS = -lpthread | |
46 | THR_CC = | |
47 | ||
48 | BIN = ../bin | |
49 | XDIR = /seqprg/bin | |
50 | #XDIR = ~/bin/LINUX | |
51 | ||
52 | # set up files for SSE2/Altivec acceleration | |
53 | # | |
54 | include ../make/Makefile.sse_alt | |
55 | ||
56 | # SSE2 acceleration | |
57 | # | |
58 | DROPGSW_O = $(DROPGSW_SSE_O) | |
59 | DROPLAL_O = $(DROPLAL_SSE_O) | |
60 | DROPGNW_O = $(DROPGNW_SSE_O) | |
61 | DROPLNW_O = $(DROPLNW_SSE_O) | |
62 | ||
63 | # renamed (fasta36) programs | |
64 | include ../make/Makefile36m.common | |
65 | # conventional (fasta3) names | |
66 | # include ../make/Makefile.common |
12 | 12 | SHELL=/bin/bash |
13 | 13 | |
14 | 14 | CC = gcc -g -O -msse2 |
15 | LIB_DB= | |
16 | ||
15 | 17 | #CC= gcc -pg -g -O -msse2 -ffast-math |
16 | 18 | #CC = gcc -g -DDEBUG -msse2 |
17 | 19 | #CC=gcc -Wall -pedantic -ansi -g -msse2 -DDEBUG |
7 | 7 | SHELL=/bin/bash |
8 | 8 | |
9 | 9 | CC= icc -g -O3 |
10 | LIB_DB= | |
10 | 11 | #CC = icc -g -DDEBUG |
12 | #LIB_DB=-lz | |
11 | 13 | |
12 | 14 | #CC=gcc -Wall -pedantic -ansi -g -O |
13 | 15 | #CC= /usr/local/parasoft/bin/insure -g -DDEBUG |
8 | 8 | |
9 | 9 | SHELL=/bin/bash |
10 | 10 | |
11 | CC= icc -O3 -g | |
11 | CC= icc -O3 -g -pthread | |
12 | LIB_DB= | |
12 | 13 | #CC = icc -g -DDEBUG |
14 | #LIB_DB=-lz | |
13 | 15 | |
14 | 16 | #CC=gcc -Wall -pedantic -ansi -g -O |
15 | 17 | #CC= /usr/local/parasoft/bin/insure -g -DDEBUG |
10 | 10 | SHELL=/bin/bash |
11 | 11 | |
12 | 12 | CC= gcc -g -O2 |
13 | LIB_DB= | |
13 | 14 | #CC= gcc -g -DDEBUG |
15 | #LIB_DB=-lz | |
14 | 16 | |
15 | 17 | # this file works for x86 LINUX |
16 | 18 |
10 | 10 | SHELL=/bin/bash |
11 | 11 | |
12 | 12 | CC= gcc -g -O |
13 | LIB_DB= | |
13 | 14 | #CC= gcc -g -DDEBUG |
15 | #LIB_DB=-lz | |
14 | 16 | #CC=/opt/parasoft/bin.linux2/insure -g -DDEBUG |
15 | 17 | |
16 | 18 | # this file works for x86 LINUX |
10 | 10 | SHELL=/bin/bash |
11 | 11 | |
12 | 12 | CC= gcc -g -O |
13 | LIB_DB= | |
13 | 14 | #CC= gcc -g -DDEBUG |
15 | #LIB_DB=-lz | |
14 | 16 | #CC=/opt/parasoft/bin.linux2/insure -g -DDEBUG |
15 | 17 | |
16 | 18 | # this file works for x86 LINUX |
0 | # $ Id: $ | |
1 | # | |
2 | # makefile for fasta3, fasta3_t Use Makefile.mpi for fasta36_mpi | |
3 | # | |
4 | # This file is designed for 64-bit Linux systems using an X86 | |
5 | # architecture with SSE2 extensions. -D_LARGEFILE64_SOURCE and | |
6 | # -DBIG_LIB64 require a 64-bit linux system. | |
7 | # SSE2 extensions are used for ssearch35(_t) | |
8 | # | |
9 | # Use Makefile.linux32_sse2 for 32-bit linux x86 | |
10 | # | |
11 | ||
12 | SHELL=/bin/bash | |
13 | ||
14 | CC = gcc -g -O -msse2 | |
15 | LIB_DB= | |
16 | ||
17 | #CC= gcc -pg -g -O -msse2 -ffast-math | |
18 | #CC = gcc -g -DDEBUG -msse2 | |
19 | #CC=gcc -Wall -pedantic -ansi -g -msse2 -DDEBUG | |
20 | ||
21 | # EBI uses the following with pgcc, -O3 does not work: | |
22 | # CC= pgcc -O2 -pipe -mcpu=pentiumpro -march=pentiumpro -fomit-frame-pointer | |
23 | ||
24 | # this file works for x86 LINUX | |
25 | ||
26 | # standard options | |
27 | ||
28 | CFLAGS= -DSHOW_HELP -DSHOWSIM -DUNIX -DTIMES -DHZ=100 -DMAX_WORKERS=8 -DTHR_EXIT=pthread_exit -DM10_CONS -D_REENTRANT -DHAS_INTTYPES -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -DUSE_FSEEKO -DSAMP_STATS -DPGM_DOC -DUSE_MMAP -D_LARGEFILE64_SOURCE -DBIG_LIB64 | |
29 | # -I/usr/include/mysql -DMYSQL_DB | |
30 | # -DSUPERFAMNUM -DSFCHAR="'|'" | |
31 | ||
32 | # | |
33 | #(for mySQL databases) (also requires change to Makefile36m.common or use of Makefile36m.common_mysql) | |
34 | # run 'mysql_config' so find locations of mySQL files | |
35 | ||
36 | LIB_M = -lm | |
37 | # for mySQL databases | |
38 | # LIB_M = -L/usr/lib64/mysql -lmysqlclient -lm | |
39 | ||
40 | HFLAGS= -o | |
41 | NFLAGS= -o | |
42 | ||
43 | # for Linux | |
44 | THR_SUBS = pthr_subs2 | |
45 | THR_LIBS = -lpthread | |
46 | THR_CC = | |
47 | ||
48 | BIN = ../bin | |
49 | XDIR = /seqprg/bin | |
50 | #XDIR = ~/bin/LINUX | |
51 | ||
52 | # set up files for SSE2/Altivec acceleration | |
53 | # | |
54 | include ../make/Makefile.sse_alt | |
55 | ||
56 | # SSE2 acceleration | |
57 | # | |
58 | DROPGSW_O = $(DROPGSW_SSE_O) | |
59 | DROPLAL_O = $(DROPLAL_SSE_O) | |
60 | DROPGNW_O = $(DROPGNW_SSE_O) | |
61 | DROPLNW_O = $(DROPLNW_SSE_O) | |
62 | ||
63 | # renamed (fasta36) programs | |
64 | include ../make/Makefile36m.common | |
65 | # conventional (fasta3) names | |
66 | # include ../make/Makefile.common |
12 | 12 | |
13 | 13 | # in my hands, gcc-4.0 is about 40% slower than gcc-3.3 on the Altivec code |
14 | 14 | CC= gcc -g -O3 -arch ppc -falign-loops=32 -O3 -maltivec -mpim-altivec -force_cpusubtype_ALL |
15 | LIB_DB= | |
16 | ||
15 | 17 | # -pg -finstrument-functions -lSaturn |
16 | 18 | |
17 | 19 | #CC= gcc-3.3 -g -falign-loops=32 -O3 -mcpu=7450 -faltivec |
18 | 20 | #CC= gcc-3.3 -g -DDEBUG -mcpu=7450 -faltivec |
21 | #LIB_DB=-lz | |
19 | 22 | #CC= cc -g -Wall -pedantic -faltivec |
20 | 23 | # |
21 | 24 | # standard line for normal searching |
12 | 12 | SHELL=/bin/bash |
13 | 13 | |
14 | 14 | CC= gcc -g -O3 -arch i386 -msse2 |
15 | LIB_DB= | |
15 | 16 | #CC= gcc -g -DDEBUG -arch i386 -msse2 |
17 | #LIB_DB=-lz | |
16 | 18 | |
17 | 19 | #CC= cc -g -Wall -pedantic |
18 | 20 | # |
12 | 12 | SHELL=/bin/bash |
13 | 13 | |
14 | 14 | CC= cc -O -g -arch x86_64 -msse2 |
15 | LIB_DB= | |
16 | ||
15 | 17 | #CC= cc -g -DDEBUG -fsanitize=address -arch x86_64 -msse2 |
18 | #LIB_DB=-lz | |
16 | 19 | |
17 | 20 | #CC= cc -g -Wall -pedantic |
18 | 21 | # |
12 | 12 | SHELL=/bin/bash |
13 | 13 | |
14 | 14 | CC= clang -g -O -arch x86_64 -msse2 |
15 | LIB_DB= | |
15 | 16 | #CC= clang -g -DDEBUG -arch x86_64 -msse2 |
17 | #LIB_DB=-lz | |
16 | 18 | |
17 | 19 | #CC= cc -g -Wall -pedantic |
18 | 20 | # |
12 | 12 | SHELL=/bin/bash |
13 | 13 | |
14 | 14 | CC= icc -g -O -m64 # intel icc compiler |
15 | LIB_DB= | |
15 | 16 | #CC= icc -g -DDEBUG -m64 |
17 | #LIB_DB=-lz | |
16 | 18 | |
17 | 19 | #CC= cc -g -Wall -pedantic |
18 | 20 | # |
61 | 61 | pushd $(BIN); cp $(TPROGS) $(XDIR); popd |
62 | 62 | |
63 | 63 | fasta36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fa.o scale_se.o karlin.o $(DROPNFA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o |
64 | $(CC) $(HFLAGS) $(BIN)/fasta36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) | |
64 | $(CC) $(HFLAGS) $(BIN)/fasta36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) | |
65 | 65 | |
66 | 66 | fastx36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fx.o scale_se.o karlin.o drop_fx.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o |
67 | $(CC) $(HFLAGS) $(BIN)/fastx36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fx.o drop_fx.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) | |
67 | $(CC) $(HFLAGS) $(BIN)/fastx36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fx.o drop_fx.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) | |
68 | 68 | |
69 | 69 | fasty36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fy.o scale_se.o karlin.o drop_fz.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o |
70 | $(CC) $(HFLAGS) $(BIN)/fasty36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fy.o drop_fz.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) | |
70 | $(CC) $(HFLAGS) $(BIN)/fasty36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fy.o drop_fz.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) | |
71 | 71 | |
72 | 72 | fastf36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_ff.o scaleswts.o last_tat.o tatstats_ff.o karlin.o $(DROPFF_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o |
73 | $(CC) $(HFLAGS) $(BIN)/fastf36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_ff.o $(DROPFF_O) scaleswts.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) | |
73 | $(CC) $(HFLAGS) $(BIN)/fastf36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_ff.o $(DROPFF_O) scaleswts.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) | |
74 | 74 | |
75 | 75 | fasts36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fs.o scaleswts.o last_tat.o tatstats_fs.o karlin.o $(DROPFS_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o |
76 | $(CC) $(HFLAGS) $(BIN)/fasts36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fs.o $(DROPFS_O) scaleswts.o last_tat.o tatstats_fs.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) | |
76 | $(CC) $(HFLAGS) $(BIN)/fasts36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fs.o $(DROPFS_O) scaleswts.o last_tat.o tatstats_fs.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) | |
77 | 77 | |
78 | 78 | fastm36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fm.o scaleswts.o last_tat.o tatstats_fm.o karlin.o $(DROPFM_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o |
79 | $(CC) $(HFLAGS) $(BIN)/fastm36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fm.o $(DROPFM_O) scaleswts.o last_tat.o tatstats_fm.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) | |
79 | $(CC) $(HFLAGS) $(BIN)/fastm36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fm.o $(DROPFM_O) scaleswts.o last_tat.o tatstats_fm.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) | |
80 | 80 | |
81 | 81 | tfastx36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfx.o scale_se.o karlin.o drop_tfx.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o |
82 | $(CC) $(HFLAGS) $(BIN)/tfastx36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfx.o drop_tfx.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) | |
82 | $(CC) $(HFLAGS) $(BIN)/tfastx36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfx.o drop_tfx.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) | |
83 | 83 | |
84 | 84 | tfasty36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfy.o scale_se.o karlin.o drop_tfz.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o |
85 | $(CC) $(HFLAGS) $(BIN)/tfasty36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfy.o drop_tfz.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) | |
85 | $(CC) $(HFLAGS) $(BIN)/tfasty36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfy.o drop_tfz.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) | |
86 | 86 | |
87 | 87 | tfastf36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tf.o scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(DROPTFF_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o |
88 | $(CC) $(HFLAGS) $(BIN)/tfastf36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tf.o $(DROPTFF_O) scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) | |
88 | $(CC) $(HFLAGS) $(BIN)/tfastf36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tf.o $(DROPTFF_O) scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) | |
89 | 89 | |
90 | 90 | tfastf36s : $(COMP_LIBO) $(COMPACC_SO) showsum.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tf.o scaleswtf.o karlin.o $(DROPTFF_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o |
91 | $(CC) $(HFLAGS) $(BIN)/tfastf36s $(COMP_LIBO) $(COMPACC_SO) showsum.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tf.o $(DROPTFF_O) scaleswtf.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) | |
91 | $(CC) $(HFLAGS) $(BIN)/tfastf36s $(COMP_LIBO) $(COMPACC_SO) showsum.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tf.o $(DROPTFF_O) scaleswtf.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) | |
92 | 92 | |
93 | 93 | tfasts36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfs.o scaleswts.o tatstats_fs.o last_tat.o karlin.o $(DROPTFS_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o |
94 | $(CC) $(HFLAGS) $(BIN)/tfasts36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfs.o $(DROPTFS_O) scaleswts.o tatstats_fs.o last_tat.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) | |
94 | $(CC) $(HFLAGS) $(BIN)/tfasts36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfs.o $(DROPTFS_O) scaleswts.o tatstats_fs.o last_tat.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) | |
95 | 95 | |
96 | 96 | tfastm36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfm.o scaleswts.o tatstats_fm.o last_tat.o karlin.o $(DROPTFM_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o |
97 | $(CC) $(HFLAGS) $(BIN)/tfastm36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfm.o $(DROPTFM_O) scaleswts.o tatstats_fm.o last_tat.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) | |
97 | $(CC) $(HFLAGS) $(BIN)/tfastm36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfm.o $(DROPTFM_O) scaleswts.o tatstats_fm.o last_tat.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) | |
98 | 98 | |
99 | 99 | ssearch36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o scale_se.o karlin.o $(DROPGSW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o |
100 | $(CC) $(HFLAGS) $(BIN)/ssearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPGSW_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) | |
100 | $(CC) $(HFLAGS) $(BIN)/ssearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPGSW_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) | |
101 | 101 | |
102 | 102 | # do not use accelerated Smith-Waterman |
103 | 103 | ssearch36s : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o scale_se.o karlin.o $(DROPGSW_NA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o |
104 | $(CC) $(HFLAGS) $(BIN)/ssearch36s $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPGSW_NA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) | |
104 | $(CC) $(HFLAGS) $(BIN)/ssearch36s $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPGSW_NA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) | |
105 | 105 | |
106 | 106 | lalign36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(LSHOWALIGN).o htime.o apam.o doinit.o init_lal.o scale_se.o karlin.o last_thresh.o $(DROPLAL_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o |
107 | $(CC) $(HFLAGS) $(BIN)/lalign36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(LSHOWALIGN).o htime.o apam.o doinit.o init_lal.o $(DROPLAL_O) scale_se.o karlin.o last_thresh.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) | |
107 | $(CC) $(HFLAGS) $(BIN)/lalign36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(LSHOWALIGN).o htime.o apam.o doinit.o init_lal.o $(DROPLAL_O) scale_se.o karlin.o last_thresh.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) | |
108 | 108 | |
109 | 109 | osearch36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_ssw.o scale_se.o karlin.o $(DROPNSW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o |
110 | $(CC) $(HFLAGS) $(BIN)/osearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_ssw.o $(DROPNSW_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) | |
110 | $(CC) $(HFLAGS) $(BIN)/osearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_ssw.o $(DROPNSW_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) | |
111 | 111 | |
112 | 112 | glsearch36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o scale_sn.o karlin.o $(DROPLNW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o |
113 | $(CC) $(HFLAGS) $(BIN)/glsearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPLNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) | |
113 | $(CC) $(HFLAGS) $(BIN)/glsearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPLNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) | |
114 | 114 | |
115 | 115 | ggsearch36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o scale_sn.o karlin.o $(DROPGNW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o |
116 | $(CC) $(HFLAGS) $(BIN)/ggsearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPGNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) | |
116 | $(CC) $(HFLAGS) $(BIN)/ggsearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPGNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) | |
117 | 117 | |
118 | 118 | prss36 : ssearch36 |
119 | 119 | ln -sf ssearch36 prss36 |
120 | 120 | |
121 | 121 | ssearch36_t : $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o scale_se.o karlin.o $(DROPGSW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o |
122 | $(CC) $(HFLAGS) $(BIN)/ssearch36_t $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGSW_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(THR_LIBS) | |
122 | $(CC) $(HFLAGS) $(BIN)/ssearch36_t $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGSW_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
123 | 123 | |
124 | 124 | ssearch36s_t : $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o scale_se.o karlin.o $(DROPGSW_NA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o |
125 | $(CC) $(HFLAGS) $(BIN)/ssearch36s_t $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGSW_NA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(THR_LIBS) | |
125 | $(CC) $(HFLAGS) $(BIN)/ssearch36s_t $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGSW_NA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
126 | 126 | |
127 | 127 | glsearch36_t : $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o scale_sn.o karlin.o $(DROPLNW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o |
128 | $(CC) $(HFLAGS) $(BIN)/glsearch36_t $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPLNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(THR_LIBS) | |
128 | $(CC) $(HFLAGS) $(BIN)/glsearch36_t $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPLNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
129 | 129 | |
130 | 130 | glsearch36s_t : $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o scale_sn.o karlin.o $(DROPLNW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o |
131 | $(CC) $(HFLAGS) $(BIN)/glsearch36s_t $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPLNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(THR_LIBS) | |
131 | $(CC) $(HFLAGS) $(BIN)/glsearch36s_t $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPLNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
132 | 132 | |
133 | 133 | ggsearch36_t : $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o scale_sn.o karlin.o $(DROPGNW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o |
134 | $(CC) $(HFLAGS) $(BIN)/ggsearch36_t $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(THR_LIBS) | |
134 | $(CC) $(HFLAGS) $(BIN)/ggsearch36_t $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
135 | 135 | |
136 | 136 | ggsearch36s_t : $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o scale_sn.o karlin.o $(DROPGNW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o |
137 | $(CC) $(HFLAGS) $(BIN)/ggsearch36s_t $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(THR_LIBS) | |
137 | $(CC) $(HFLAGS) $(BIN)/ggsearch36s_t $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
138 | 138 | |
139 | 139 | fasta36_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o scale_se.o karlin.o $(DROPNFA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o |
140 | $(CC) $(HFLAGS) $(BIN)/fasta36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(THR_LIBS) | |
140 | $(CC) $(HFLAGS) $(BIN)/fasta36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
141 | 141 | |
142 | 142 | fasta36sum_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o scale_se.o karlin.o $(DROPNFA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o |
143 | $(CC) $(HFLAGS) $(BIN)/fasta36sum_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(THR_LIBS) | |
143 | $(CC) $(HFLAGS) $(BIN)/fasta36sum_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
144 | 144 | |
145 | 145 | fasta36u_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showun.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o scale_se.o karlin.o $(DROPNFA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o |
146 | $(CC) $(HFLAGS) $(BIN)/fasta36u_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showun.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS) | |
146 | $(CC) $(HFLAGS) $(BIN)/fasta36u_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showun.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
147 | 147 | |
148 | 148 | fasta36r_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showrel.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o scale_se.o karlin.o $(DROPNFA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o |
149 | $(CC) $(HFLAGS) $(BIN)/fasta36r_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showrel.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS) | |
149 | $(CC) $(HFLAGS) $(BIN)/fasta36r_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showrel.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
150 | 150 | |
151 | 151 | fastf36_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_ff.o scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(DROPFF_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o |
152 | $(CC) $(HFLAGS) $(BIN)/fastf36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_ff.o $(DROPFF_O) scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS) | |
152 | $(CC) $(HFLAGS) $(BIN)/fastf36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_ff.o $(DROPFF_O) scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
153 | 153 | |
154 | 154 | fastf36s_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showsum.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_ff.o scaleswtf.o karlin.o $(DROPFF_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o |
155 | $(CC) $(HFLAGS) $(BIN)/fastf36s_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showsum.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_ff.o $(DROPFF_O) scaleswtf.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS) | |
155 | $(CC) $(HFLAGS) $(BIN)/fastf36s_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showsum.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_ff.o $(DROPFF_O) scaleswtf.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
156 | 156 | |
157 | 157 | fasts36_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fs.o scaleswts.o last_tat.o tatstats_fs.o karlin.o $(DROPFS_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o |
158 | $(CC) $(HFLAGS) $(BIN)/fasts36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fs.o $(DROPFS_O) scaleswts.o last_tat.o tatstats_fs.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS) | |
158 | $(CC) $(HFLAGS) $(BIN)/fasts36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fs.o $(DROPFS_O) scaleswts.o last_tat.o tatstats_fs.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
159 | 159 | |
160 | 160 | fastm36_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fm.o scaleswts.o last_tat.o tatstats_fm.o karlin.o $(DROPFM_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o |
161 | $(CC) $(HFLAGS) $(BIN)/fastm36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fm.o $(DROPFM_O) scaleswts.o last_tat.o tatstats_fm.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS) | |
161 | $(CC) $(HFLAGS) $(BIN)/fastm36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fm.o $(DROPFM_O) scaleswts.o last_tat.o tatstats_fm.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
162 | 162 | |
163 | 163 | fastx36_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o c_dispn.o htime.o apam.o doinit.o init_fx.o faatran.o scale_se.o karlin.o drop_fx.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o |
164 | $(CC) $(HFLAGS) $(BIN)/fastx36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fx.o drop_fx.o faatran.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(THR_LIBS) | |
164 | $(CC) $(HFLAGS) $(BIN)/fastx36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fx.o drop_fx.o faatran.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
165 | 165 | |
166 | 166 | fasty36_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o c_dispn.o htime.o apam.o doinit.o init_fy.o faatran.o scale_se.o karlin.o drop_fz.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o |
167 | $(CC) $(HFLAGS) $(BIN)/fasty36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fy.o drop_fz.o faatran.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(THR_LIBS) | |
167 | $(CC) $(HFLAGS) $(BIN)/fasty36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fy.o drop_fz.o faatran.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
168 | 168 | |
169 | 169 | tfasta36 : $(COMP_LIBO) compacc.o $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfa.o scale_se.o karlin.o $(DROPTFA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o |
170 | $(CC) $(HFLAGS) $(BIN)/tfasta36 $(COMP_LIBO) compacc.o $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfa.o $(DROPTFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) | |
170 | $(CC) $(HFLAGS) $(BIN)/tfasta36 $(COMP_LIBO) compacc.o $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfa.o $(DROPTFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) | |
171 | 171 | |
172 | 172 | tfasta36_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o c_dispn.o htime.o apam.o doinit.o init_tfa.o scale_se.o karlin.o $(DROPTFA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o |
173 | $(CC) $(HFLAGS) $(BIN)/tfasta36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfa.o $(DROPTFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(THR_LIBS) | |
173 | $(CC) $(HFLAGS) $(BIN)/tfasta36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfa.o $(DROPTFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
174 | 174 | |
175 | 175 | tfastf36_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o c_dispn.o htime.o apam.o doinit.o init_tf.o scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(DROPTFF_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o |
176 | $(CC) $(HFLAGS) $(BIN)/tfastf36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tf.o $(DROPTFF_O) scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS) | |
176 | $(CC) $(HFLAGS) $(BIN)/tfastf36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tf.o $(DROPTFF_O) scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
177 | 177 | |
178 | 178 | tfasts36_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o c_dispn.o htime.o apam.o doinit.o init_tfs.o scaleswts.o last_tat.o tatstats_fs.o karlin.o $(DROPTFS_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o |
179 | $(CC) $(HFLAGS) $(BIN)/tfasts36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfs.o $(DROPTFS_O) scaleswts.o last_tat.o tatstats_fs.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS) | |
179 | $(CC) $(HFLAGS) $(BIN)/tfasts36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfs.o $(DROPTFS_O) scaleswts.o last_tat.o tatstats_fs.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
180 | 180 | |
181 | 181 | tfastx36_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfx.o scale_se.o karlin.o drop_tfx.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o |
182 | $(CC) $(HFLAGS) $(BIN)/tfastx36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfx.o drop_tfx.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(THR_LIBS) | |
182 | $(CC) $(HFLAGS) $(BIN)/tfastx36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfx.o drop_tfx.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
183 | 183 | |
184 | 184 | tfasty36_t : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfy.o scale_se.o karlin.o drop_tfz.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o |
185 | $(CC) $(HFLAGS) $(BIN)/tfasty36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfy.o drop_tfz.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(THR_LIBS) | |
185 | $(CC) $(HFLAGS) $(BIN)/tfasty36_t $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfy.o drop_tfz.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
186 | 186 | |
187 | 187 | comp_mlib5e.o : comp_lib5e.c mw.h structs.h defs.h param.h |
188 | 188 | $(CC) $(THR_CC) $(CFLAGS) -DCOMP_MLIB -c comp_lib5e.c -o comp_mlib5e.o |
212 | 212 | $(CC) $(THR_CC) $(CFLAGS) -c work_thr2.c |
213 | 213 | |
214 | 214 | print_pssm : print_pssm.c getseq.c karlin.c apam.cn pssm_asn_subs.c |
215 | $(CC) -o print_pssm $(CFLAGS) print_pssm.c getseq.c karlin.c apam.c pssm_asn_subs.c $(LIB_M) | |
215 | $(CC) -o print_pssm $(CFLAGS) print_pssm.c getseq.c karlin.c apam.c pssm_asn_subs.c $(LIB_M) $(LIB_DB) | |
216 | 216 | |
217 | 217 | map_db : map_db.c uascii.h ncbl2_head.h |
218 | 218 | $(CC) $(CFLAGS) -o $(BIN)/map_db map_db.c |
57 | 57 | pushd $(BIN); cp $(TPROGS) $(XDIR); popd |
58 | 58 | |
59 | 59 | fasta36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fa.o scale_se.o karlin.o $(DROPNFA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o |
60 | $(CC) $(HFLAGS) $(BIN)/fasta36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) | |
60 | $(CC) $(HFLAGS) $(BIN)/fasta36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) | |
61 | 61 | |
62 | 62 | fastx36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fx.o scale_se.o karlin.o drop_fx.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o |
63 | $(CC) $(HFLAGS) $(BIN)/fastx36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fx.o drop_fx.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) | |
63 | $(CC) $(HFLAGS) $(BIN)/fastx36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fx.o drop_fx.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) | |
64 | 64 | |
65 | 65 | fasty36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fy.o scale_se.o karlin.o drop_fz.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o |
66 | $(CC) $(HFLAGS) $(BIN)/fasty36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fy.o drop_fz.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) | |
66 | $(CC) $(HFLAGS) $(BIN)/fasty36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fy.o drop_fz.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) | |
67 | 67 | |
68 | 68 | fastf36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_ff.o scaleswts.o last_tat.o tatstats_ff.o karlin.o $(DROPFF_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o |
69 | $(CC) $(HFLAGS) $(BIN)/fastf36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_ff.o $(DROPFF_O) scaleswts.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) | |
69 | $(CC) $(HFLAGS) $(BIN)/fastf36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_ff.o $(DROPFF_O) scaleswts.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) | |
70 | 70 | |
71 | 71 | fasts36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fs.o scaleswts.o last_tat.o tatstats_fs.o karlin.o $(DROPFS_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o |
72 | $(CC) $(HFLAGS) $(BIN)/fasts36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fs.o $(DROPFS_O) scaleswts.o last_tat.o tatstats_fs.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) | |
72 | $(CC) $(HFLAGS) $(BIN)/fasts36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fs.o $(DROPFS_O) scaleswts.o last_tat.o tatstats_fs.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) | |
73 | 73 | |
74 | 74 | fastm36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fm.o scaleswts.o last_tat.o tatstats_fm.o karlin.o $(DROPFM_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o |
75 | $(CC) $(HFLAGS) $(BIN)/fastm36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fm.o $(DROPFM_O) scaleswts.o last_tat.o tatstats_fm.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) | |
75 | $(CC) $(HFLAGS) $(BIN)/fastm36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_fm.o $(DROPFM_O) scaleswts.o last_tat.o tatstats_fm.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) | |
76 | 76 | |
77 | 77 | tfastx36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfx.o scale_se.o karlin.o drop_tfx.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o |
78 | $(CC) $(HFLAGS) $(BIN)/tfastx36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfx.o drop_tfx.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) | |
78 | $(CC) $(HFLAGS) $(BIN)/tfastx36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfx.o drop_tfx.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) | |
79 | 79 | |
80 | 80 | tfasty36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfy.o scale_se.o karlin.o drop_tfz.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o |
81 | $(CC) $(HFLAGS) $(BIN)/tfasty36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfy.o drop_tfz.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) | |
81 | $(CC) $(HFLAGS) $(BIN)/tfasty36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfy.o drop_tfz.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) | |
82 | 82 | |
83 | 83 | tfasta36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfa.o scale_se.o karlin.o $(DROPTFA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o |
84 | $(CC) $(HFLAGS) $(BIN)/tfasta36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfa.o $(DROPTFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) | |
84 | $(CC) $(HFLAGS) $(BIN)/tfasta36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfa.o $(DROPTFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) | |
85 | 85 | |
86 | 86 | tfastf36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tf.o scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(DROPTFF_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o |
87 | $(CC) $(HFLAGS) $(BIN)/tfastf36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tf.o $(DROPTFF_O) scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) | |
87 | $(CC) $(HFLAGS) $(BIN)/tfastf36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tf.o $(DROPTFF_O) scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) | |
88 | 88 | |
89 | 89 | tfastf36s : $(COMP_LIBO) $(COMPACC_SO) showsum.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tf.o scaleswtf.o karlin.o $(DROPTFF_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o |
90 | $(CC) $(HFLAGS) $(BIN)/tfastf36s $(COMP_LIBO) $(COMPACC_SO) showsum.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tf.o $(DROPTFF_O) scaleswtf.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) | |
90 | $(CC) $(HFLAGS) $(BIN)/tfastf36s $(COMP_LIBO) $(COMPACC_SO) showsum.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tf.o $(DROPTFF_O) scaleswtf.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) | |
91 | 91 | |
92 | 92 | tfasts36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfs.o scaleswts.o tatstats_fs.o last_tat.o karlin.o $(DROPTFS_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o |
93 | $(CC) $(HFLAGS) $(BIN)/tfasts36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfs.o $(DROPTFS_O) scaleswts.o tatstats_fs.o last_tat.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) | |
93 | $(CC) $(HFLAGS) $(BIN)/tfasts36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfs.o $(DROPTFS_O) scaleswts.o tatstats_fs.o last_tat.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) | |
94 | 94 | |
95 | 95 | tfastm36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfm.o scaleswts.o tatstats_fm.o last_tat.o karlin.o $(DROPTFM_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o |
96 | $(CC) $(HFLAGS) $(BIN)/tfastm36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfm.o $(DROPTFM_O) scaleswts.o tatstats_fm.o last_tat.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) | |
96 | $(CC) $(HFLAGS) $(BIN)/tfastm36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfm.o $(DROPTFM_O) scaleswts.o tatstats_fm.o last_tat.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) | |
97 | 97 | |
98 | 98 | ssearch36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o scale_se.o karlin.o $(DROPGSW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o |
99 | $(CC) $(HFLAGS) $(BIN)/ssearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPGSW_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) | |
99 | $(CC) $(HFLAGS) $(BIN)/ssearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPGSW_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) | |
100 | 100 | |
101 | 101 | # do not use accelerated Smith-Waterman |
102 | 102 | ssearch36s : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o scale_se.o karlin.o $(DROPGSW_NA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o |
103 | $(CC) $(HFLAGS) $(BIN)/ssearch36s $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPGSW_NA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) | |
103 | $(CC) $(HFLAGS) $(BIN)/ssearch36s $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPGSW_NA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) | |
104 | 104 | |
105 | 105 | lalign36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(LSHOWALIGN).o htime.o apam.o doinit.o init_lal.o scale_se.o karlin.o last_thresh.o $(DROPLAL_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o |
106 | $(CC) $(HFLAGS) $(BIN)/lalign36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(LSHOWALIGN).o htime.o apam.o doinit.o init_lal.o $(DROPLAL_O) scale_se.o karlin.o last_thresh.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) | |
106 | $(CC) $(HFLAGS) $(BIN)/lalign36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(LSHOWALIGN).o htime.o apam.o doinit.o init_lal.o $(DROPLAL_O) scale_se.o karlin.o last_thresh.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) | |
107 | 107 | |
108 | 108 | osearch36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_ssw.o scale_se.o karlin.o $(DROPNSW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o |
109 | $(CC) $(HFLAGS) $(BIN)/osearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_ssw.o $(DROPNSW_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) | |
109 | $(CC) $(HFLAGS) $(BIN)/osearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_ssw.o $(DROPNSW_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) | |
110 | 110 | |
111 | 111 | glsearch36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o scale_sn.o karlin.o $(DROPLNW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o |
112 | $(CC) $(HFLAGS) $(BIN)/glsearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPLNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) | |
112 | $(CC) $(HFLAGS) $(BIN)/glsearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPLNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) | |
113 | 113 | |
114 | 114 | ggsearch36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o scale_sn.o karlin.o $(DROPGNW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o |
115 | $(CC) $(HFLAGS) $(BIN)/ggsearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPGNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) | |
115 | $(CC) $(HFLAGS) $(BIN)/ggsearch36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o $(DROPGNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) | |
116 | 116 | |
117 | 117 | prss36 : ssearch36 |
118 | 118 | ln -sf ssearch36 prss36 |
145 | 145 | $(CC) $(THR_CC) $(CFLAGS) -c work_thr2.c |
146 | 146 | |
147 | 147 | print_pssm : print_pssm.c getseq.c karlin.c apam.c pssm_asn_subs.c |
148 | $(CC) -o print_pssm $(CFLAGS) print_pssm.c getseq.c karlin.c apam.c pssm_asn_subs.c $(LIB_M) | |
148 | $(CC) -o print_pssm $(CFLAGS) print_pssm.c getseq.c karlin.c apam.c pssm_asn_subs.c $(LIB_M) $(LIB_DB) | |
149 | 149 | |
150 | 150 | map_db : map_db.c uascii.h ncbl2_head.h |
151 | 151 | $(CC) $(CFLAGS) -o $(BIN)/map_db map_db.c |
53 | 53 | pushd $(BIN); cp $(TPROGS) $(XDIR); popd |
54 | 54 | |
55 | 55 | lalign36 : $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(LSHOWALIGN).o htime.o apam.o doinit.o init_lal.o scale_se.o karlin.o last_thresh.o $(DROPLAL_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o |
56 | $(CC) $(HFLAGS) $(BIN)/lalign36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(LSHOWALIGN).o htime.o apam.o doinit.o init_lal.o $(DROPLAL_O) scale_se.o karlin.o last_thresh.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) | |
56 | $(CC) $(HFLAGS) $(BIN)/lalign36 $(COMP_LIBO) $(COMPACC_SO) $(SHOWBESTO) re_getlib.o $(LSHOWALIGN).o htime.o apam.o doinit.o init_lal.o $(DROPLAL_O) scale_se.o karlin.o last_thresh.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) | |
57 | 57 | |
58 | 58 | ssearch36 : $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o scale_se.o karlin.o $(DROPGSW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o |
59 | $(CC) $(HFLAGS) $(BIN)/ssearch36 $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGSW_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(THR_LIBS) | |
59 | $(CC) $(HFLAGS) $(BIN)/ssearch36 $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGSW_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
60 | 60 | |
61 | 61 | ssearch36s : $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o scale_se.o karlin.o $(DROPGSW_NA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o |
62 | $(CC) $(HFLAGS) $(BIN)/ssearch36s $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGSW_NA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(THR_LIBS) | |
62 | $(CC) $(HFLAGS) $(BIN)/ssearch36s $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGSW_NA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
63 | 63 | |
64 | 64 | glsearch36 : $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o scale_sn.o karlin.o $(DROPLNW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o |
65 | $(CC) $(HFLAGS) $(BIN)/glsearch36 $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPLNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(THR_LIBS) | |
65 | $(CC) $(HFLAGS) $(BIN)/glsearch36 $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPLNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
66 | 66 | |
67 | 67 | glsearch36s : $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o scale_sn.o karlin.o $(DROPLNW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o |
68 | $(CC) $(HFLAGS) $(BIN)/glsearch36s $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPLNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(THR_LIBS) | |
68 | $(CC) $(HFLAGS) $(BIN)/glsearch36s $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPLNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
69 | 69 | |
70 | 70 | ggsearch36 : $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o scale_sn.o karlin.o $(DROPGNW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o |
71 | $(CC) $(HFLAGS) $(BIN)/ggsearch36 $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(THR_LIBS) | |
71 | $(CC) $(HFLAGS) $(BIN)/ggsearch36 $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
72 | 72 | |
73 | 73 | ggsearch36s : $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o scale_sn.o karlin.o $(DROPGNW_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o |
74 | $(CC) $(HFLAGS) $(BIN)/ggsearch36s $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(THR_LIBS) | |
74 | $(CC) $(HFLAGS) $(BIN)/ggsearch36s $(COMP_THRO) ${WORK_THRO} $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o $(DROPGNW_O) scale_sn.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o pssm_asn_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
75 | 75 | |
76 | 76 | fasta36 : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o scale_se.o karlin.o $(DROPNFA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o |
77 | $(CC) $(HFLAGS) $(BIN)/fasta36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(THR_LIBS) | |
77 | $(CC) $(HFLAGS) $(BIN)/fasta36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
78 | 78 | |
79 | 79 | fasta36sum : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o scale_se.o karlin.o $(DROPNFA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o |
80 | $(CC) $(HFLAGS) $(BIN)/fasta36sum $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(THR_LIBS) | |
80 | $(CC) $(HFLAGS) $(BIN)/fasta36sum $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showsum.o re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
81 | 81 | |
82 | 82 | fasta36u : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showun.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o scale_se.o karlin.o $(DROPNFA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o |
83 | $(CC) $(HFLAGS) $(BIN)/fasta36u $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showun.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS) | |
83 | $(CC) $(HFLAGS) $(BIN)/fasta36u $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showun.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
84 | 84 | |
85 | 85 | fasta36r : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showrel.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o scale_se.o karlin.o $(DROPNFA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o |
86 | $(CC) $(HFLAGS) $(BIN)/fasta36r $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showrel.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS) | |
86 | $(CC) $(HFLAGS) $(BIN)/fasta36r $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showrel.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fa.o $(DROPNFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
87 | 87 | |
88 | 88 | fastf36 : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_ff.o scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(DROPFF_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o |
89 | $(CC) $(HFLAGS) $(BIN)/fastf36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_ff.o $(DROPFF_O) scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS) | |
89 | $(CC) $(HFLAGS) $(BIN)/fastf36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_ff.o $(DROPFF_O) scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
90 | 90 | |
91 | 91 | fastf36s : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showsum.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_ff.o scaleswtf.o karlin.o $(DROPFF_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o |
92 | $(CC) $(HFLAGS) $(BIN)/fastf36s $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showsum.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_ff.o $(DROPFF_O) scaleswtf.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS) | |
92 | $(CC) $(HFLAGS) $(BIN)/fastf36s $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) showsum.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_ff.o $(DROPFF_O) scaleswtf.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
93 | 93 | |
94 | 94 | fasts36 : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fs.o scaleswts.o last_tat.o tatstats_fs.o karlin.o $(DROPFS_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o |
95 | $(CC) $(HFLAGS) $(BIN)/fasts36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fs.o $(DROPFS_O) scaleswts.o last_tat.o tatstats_fs.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS) | |
95 | $(CC) $(HFLAGS) $(BIN)/fasts36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fs.o $(DROPFS_O) scaleswts.o last_tat.o tatstats_fs.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
96 | 96 | |
97 | 97 | fastm36 : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fm.o scaleswts.o last_tat.o tatstats_fm.o karlin.o $(DROPFM_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o |
98 | $(CC) $(HFLAGS) $(BIN)/fastm36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fm.o $(DROPFM_O) scaleswts.o last_tat.o tatstats_fm.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS) | |
98 | $(CC) $(HFLAGS) $(BIN)/fastm36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fm.o $(DROPFM_O) scaleswts.o last_tat.o tatstats_fm.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
99 | 99 | |
100 | 100 | fastx36 : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o c_dispn.o htime.o apam.o doinit.o init_fx.o faatran.o scale_se.o karlin.o drop_fx.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o |
101 | $(CC) $(HFLAGS) $(BIN)/fastx36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fx.o drop_fx.o faatran.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(THR_LIBS) | |
101 | $(CC) $(HFLAGS) $(BIN)/fastx36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fx.o drop_fx.o faatran.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
102 | 102 | |
103 | 103 | fasty36 : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o c_dispn.o htime.o apam.o doinit.o init_fy.o faatran.o scale_se.o karlin.o drop_fz.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o |
104 | $(CC) $(HFLAGS) $(BIN)/fasty36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fy.o drop_fz.o faatran.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(THR_LIBS) | |
104 | $(CC) $(HFLAGS) $(BIN)/fasty36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_fy.o drop_fz.o faatran.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
105 | 105 | |
106 | 106 | tfasta36 : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o c_dispn.o htime.o apam.o doinit.o init_tfa.o scale_se.o karlin.o $(DROPTFA_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o |
107 | $(CC) $(HFLAGS) $(BIN)/tfasta36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfa.o $(DROPTFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(THR_LIBS) | |
107 | $(CC) $(HFLAGS) $(BIN)/tfasta36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfa.o $(DROPTFA_O) scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
108 | 108 | |
109 | 109 | tfastf36 : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o c_dispn.o htime.o apam.o doinit.o init_tf.o scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(DROPTFF_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o |
110 | $(CC) $(HFLAGS) $(BIN)/tfastf36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tf.o $(DROPTFF_O) scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS) | |
110 | $(CC) $(HFLAGS) $(BIN)/tfastf36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tf.o $(DROPTFF_O) scaleswtf.o last_tat.o tatstats_ff.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
111 | 111 | |
112 | 112 | tfasts36 : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o c_dispn.o htime.o apam.o doinit.o init_tfs.o scaleswts.o last_tat.o tatstats_fs.o karlin.o $(DROPTFS_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o |
113 | $(CC) $(HFLAGS) $(BIN)/tfasts36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfs.o $(DROPTFS_O) scaleswts.o last_tat.o tatstats_fs.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS) | |
113 | $(CC) $(HFLAGS) $(BIN)/tfasts36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfs.o $(DROPTFS_O) scaleswts.o last_tat.o tatstats_fs.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
114 | 114 | |
115 | 115 | tfastm36 : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfm.o scaleswts.o tatstats_fm.o last_tat.o karlin.o $(DROPTFM_O) $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o |
116 | $(CC) $(HFLAGS) $(BIN)/tfastm36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfm.o $(DROPTFM_O) scaleswts.o tatstats_fm.o last_tat.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(THR_LIBS) | |
116 | $(CC) $(HFLAGS) $(BIN)/tfastm36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_S).o htime.o apam.o doinit.o init_tfm.o $(DROPTFM_O) scaleswts.o tatstats_fm.o last_tat.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o mrandom.o url_subs.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
117 | 117 | |
118 | 118 | tfastx36 : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfx.o scale_se.o karlin.o drop_tfx.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o |
119 | $(CC) $(HFLAGS) $(BIN)/tfastx36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfx.o drop_tfx.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(THR_LIBS) | |
119 | $(CC) $(HFLAGS) $(BIN)/tfastx36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfx.o drop_tfx.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
120 | 120 | |
121 | 121 | tfasty36 : $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfy.o scale_se.o karlin.o drop_tfz.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o |
122 | $(CC) $(HFLAGS) $(BIN)/tfasty36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfy.o drop_tfz.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(THR_LIBS) | |
122 | $(CC) $(HFLAGS) $(BIN)/tfasty36 $(COMP_THRO) $(WORK_THRO) $(THR_SUBS).o $(COMPACC_TO) $(SHOWBESTO) re_getlib.o $(SHOWALIGN_T).o htime.o apam.o doinit.o init_tfy.o drop_tfz.o scale_se.o karlin.o $(LGETLIB) c_dispn.o $(NCBL_LIB) lib_sel.o faatran.o url_subs.o mrandom.o $(LIB_M) $(LIB_DB) $(THR_LIBS) | |
123 | 123 | |
124 | 124 | comp_mlib4.o : comp_lib4.c mw.h structs.h defs.h param.h |
125 | 125 | $(CC) $(THR_CC) $(CFLAGS) -DCOMP_MLIB -c comp_lib4.c -o comp_mlib4.o |
167 | 167 | $(CC) $(THR_CC) $(CFLAGS) -c work_thr2.c |
168 | 168 | |
169 | 169 | print_pssm : print_pssm.c getseq.c karlin.c apam.c pssm_asn_subs.c |
170 | $(CC) -o print_pssm $(CFLAGS) print_pssm.c getseq.c karlin.c apam.c pssm_asn_subs.c $(LIB_M) | |
170 | $(CC) -o print_pssm $(CFLAGS) print_pssm.c getseq.c karlin.c apam.c pssm_asn_subs.c $(LIB_M) $(LIB_DB) | |
171 | 171 | |
172 | 172 | map_db : map_db.c uascii.h ncbl2_head.h |
173 | 173 | $(CC) $(CFLAGS) -o $(BIN)/map_db map_db.c |
33 | 33 | # and "-L/usr/lib64/mysql -lmysqlclient -lz" in LIB_M |
34 | 34 | # some systems may also require a LD_LIBRARY_PATH change |
35 | 35 | |
36 | LIB_M= -lm -lz | |
37 | #LIB_M= -L/usr/lib64/mysql -lmysqlclient -lz -lm | |
36 | LIB_M= -lm | |
37 | #LIB_M= -L/usr/lib64/mysql -lmysqlclient -lm # -lz | |
38 | 38 | NCBL_LIB=ncbl2_mlib.o |
39 | 39 | #NCBL_LIB=ncbl2_mlib.o mysql_lib.o |
40 | 40 |
0 | #!/usr/bin/env perl | |
1 | ||
2 | ################################################################ | |
3 | # copyright (c) 2014,2015 by William R. Pearson and The Rector & | |
4 | # Visitors of the University of Virginia */ | |
5 | ################################################################ | |
6 | # Licensed under the Apache License, Version 2.0 (the "License"); | |
7 | # you may not use this file except in compliance with the License. | |
8 | # You may obtain a copy of the License at | |
9 | # | |
10 | # http://www.apache.org/licenses/LICENSE-2.0 | |
11 | # | |
12 | # Unless required by applicable law or agreed to in writing, | |
13 | # software distributed under this License is distributed on an "AS | |
14 | # IS" BASIS, WITHOUT WRRANTIES OR CONDITIONS OF ANY KIND, either | |
15 | # express or implied. See the License for the specific language | |
16 | # governing permissions and limitations under the License. | |
17 | ################################################################ | |
18 | ||
19 | ################################################################ | |
20 | # clustal2fasta.pl | |
21 | ################################################################ | |
22 | # clustal2fasta.pl takes a standard clustal format alignment file | |
23 | # and produces the corresponding FASTA file. | |
24 | # | |
25 | ################################################################ | |
26 | ||
27 | use warnings; | |
28 | use strict; | |
29 | use Pod::Usage; | |
30 | use Getopt::Long; | |
31 | ||
32 | my ($shelp, $help, $trim) = (0, 0); | |
33 | ||
34 | GetOptions( | |
35 | "h|?" => \$shelp, | |
36 | "help" => \$help, | |
37 | ); | |
38 | ||
39 | pod2usage(1) if $shelp; | |
40 | pod2usage(exitstatus => 0, verbose => 2) if $help; | |
41 | unless (-f STDIN || -p STDIN || @ARGV) { | |
42 | pod2usage(1); | |
43 | } | |
44 | ||
45 | my @seq_ids = (); | |
46 | my %msa = (); | |
47 | ||
48 | # read the first line, first should not be blank | |
49 | my $title = <>; | |
50 | ||
51 | while (my $line = <>) { | |
52 | chomp $line; | |
53 | next unless ($line); | |
54 | next if ($line =~ m/^[\s:\*\+\.]+$/); # skip conservation line | |
55 | ||
56 | my ($seq_id, $align) = split(/\s+/,$line); | |
57 | ||
58 | if (defined($msa{$seq_id})) { | |
59 | $msa{$seq_id} .= $align; | |
60 | } | |
61 | else { | |
62 | $msa{$seq_id} = $align; | |
63 | push @seq_ids, $seq_id; | |
64 | } | |
65 | } | |
66 | ||
67 | for my $seq_id ( @seq_ids ) { | |
68 | my $fmt_seq = $msa{$seq_id}; | |
69 | $fmt_seq =~ s/(.{0,60})/$1\n/g; | |
70 | print ">$seq_id\n$fmt_seq"; | |
71 | } | |
72 | ||
73 | __END__ | |
74 | ||
75 | =pod | |
76 | ||
77 | =head1 NAME | |
78 | ||
79 | clustal2fasta.pl | |
80 | ||
81 | =head1 SYNOPSIS | |
82 | ||
83 | clustal2fasta.pl clustal.msa | |
84 | ||
85 | =head1 OPTIONS | |
86 | ||
87 | -h short help | |
88 | --help include description | |
89 | ||
90 | ||
91 | =head1 DESCRIPTION | |
92 | ||
93 | C<clustal2fasta.pl> takes a Clustal format interleaved multiple | |
94 | sequence alignment and produces the corresponding fasta format library. | |
95 | ||
96 | =head1 AUTHOR | |
97 | ||
98 | William R. Pearson, wrp@virginia.edu | |
99 | ||
100 | =cut |
0 | #!/usr/bin/env python | |
1 | ||
2 | ################################################################ | |
3 | # copyright (c) 2014,2015 by William R. Pearson and The Rector & | |
4 | # Visitors of the University of Virginia */ | |
5 | ################################################################ | |
6 | # Licensed under the Apache License, Version 2.0 (the "License"); | |
7 | # you may not use this file except in compliance with the License. | |
8 | # You may obtain a copy of the License at | |
9 | # | |
10 | # http://www.apache.org/licenses/LICENSE-2.0 | |
11 | # | |
12 | # Unless required by applicable law or agreed to in writing, | |
13 | # software distributed under this License is distributed on an "AS | |
14 | # IS" BASIS, WITHOUT WRRANTIES OR CONDITIONS OF ANY KIND, either | |
15 | # express or implied. See the License for the specific language | |
16 | # governing permissions and limitations under the License. | |
17 | ################################################################ | |
18 | ||
19 | ################################################################ | |
20 | # clustal2fasta.pl | |
21 | ################################################################ | |
22 | # clustal2fasta.pl takes a standard clustal format alignment file | |
23 | # and produces the corresponding FASTA file. | |
24 | # | |
25 | # if --end_mask or --int_mask are set, then end or internal '-'s are converted to the query (first) sequence | |
26 | # if --trim is set, then alignments beyond the beginning/end of the query sequence are trimmed | |
27 | # | |
28 | ################################################################ | |
29 | ||
30 | import argparse | |
31 | import fileinput | |
32 | import re | |
33 | ||
34 | ################ | |
35 | # | |
36 | # python re-write of clustal2fasta.pl | |
37 | # | |
38 | # in the future, modify for various query seeding strategies | |
39 | ################ | |
40 | ||
41 | arg_parse = argparse.ArgumentParser(description='Convert clustal MSA to FASTA library') | |
42 | arg_parse.add_argument('--query|--query_file', dest='query_file', action='store',help='query sequence file') | |
43 | arg_parse.add_argument('files', metavar='FILE', nargs='*', help='files to read, if empty, stdin is used') | |
44 | args=arg_parse.parse_args() | |
45 | ||
46 | msa = {} | |
47 | seq_ids = [] | |
48 | ||
49 | is_line1 = True | |
50 | for line in fileinput.input(args.files): | |
51 | if is_line1: | |
52 | is_line1 = False | |
53 | continue | |
54 | line = line.strip() | |
55 | if not line: | |
56 | continue | |
57 | if re.search(r'^[\s:\*\+\.]+$',line): | |
58 | continue | |
59 | ||
60 | (seq_id, align) = re.split(r'\s+',line) | |
61 | ||
62 | if seq_id in msa: | |
63 | msa[seq_id] += align | |
64 | else: | |
65 | msa[seq_id] = align | |
66 | seq_ids.append(seq_id) | |
67 | ||
68 | for seq_id in seq_ids: | |
69 | fmt_seq = re.sub(r'(.{0,60})',r'\1\n',msa[seq_id]) | |
70 | print ">%s\n%s" % (seq_id, fmt_seq) |
0 | #!/usr/bin/perl -w | |
0 | #!/usr/bin/env perl | |
1 | 1 | |
2 | 2 | ################################################################ |
3 | 3 | # copyright (c) 2014,2015 by William R. Pearson and The Rector & |
37 | 37 | # |
38 | 38 | ################################################################ |
39 | 39 | |
40 | use warnings; | |
40 | 41 | use strict; |
41 | 42 | use Pod::Usage; |
42 | 43 | use Getopt::Long; |
50 | 51 | |
51 | 52 | my ($shelp, $help, $m_format, $evalue, $qvalue, $domain_bound) = (0, 0, "m8CB", 0.001, 30.0,0); |
52 | 53 | my ($query_file, $sel_file, $bound_file_in, $bound_file_only, $bound_file_out, $masked_lib_out,$mask_type_end, $mask_type_int) = ("","","","","","","",""); |
54 | my ($clustal_id,$trunc_acc,$min_align) = (0,0,0.0); | |
53 | 55 | my $query_lib_r = 0; |
54 | 56 | my ($eval2_fmt, $eval2) = (0,""); |
55 | 57 | |
57 | 59 | "query=s" => \$query_file, |
58 | 60 | "query_file=s" => \$query_file, |
59 | 61 | "eval2=s" => \$eval2, # change the evalue used for inclusion |
60 | "evalue=f" => \$evalue, | |
61 | "expect=f" => \$evalue, | |
62 | "evalue|expect=f" => \$evalue, | |
62 | 63 | "qvalue=f" => \$qvalue, |
63 | 64 | "format=s" => \$m_format, |
64 | "selected_file_in=s" => \$sel_file, | |
65 | "sel_file_in=s" => \$sel_file, | |
66 | "sel_file=s" => \$sel_file, | |
67 | "m_format=s" => \$m_format, | |
68 | "mformat=s" => \$m_format, | |
65 | "clustal!" => \$clustal_id, | |
66 | "trunc_acc!" => \$trunc_acc, | |
67 | "selected_file_in|sel_file_in|sel_accs=s" => \$sel_file, | |
68 | "m_format|mformat=s" => \$m_format, | |
69 | 69 | "bound_file_in=s" => \$bound_file_in, |
70 | 70 | "bound_file_only=s" => \$bound_file_only, |
71 | 71 | "bound_file_out=s" => \$bound_file_out, |
82 | 82 | "domain" => \$domain_bound, |
83 | 83 | "int_mask_type=s" => \$mask_type_int, |
84 | 84 | "int_mask=s" => \$mask_type_int, |
85 | "min_align=f" => \$min_align, | |
85 | 86 | "h|?" => \$shelp, |
86 | 87 | "help" => \$help, |
87 | 88 | ); |
214 | 215 | $q_acc = $query_descr; |
215 | 216 | } |
216 | 217 | |
217 | $acc_names{$q_acc} = 1; # this is necessary for the new acc-only NCBI SwissProt libraries | |
218 | $acc_names{$q_acc} = $q_acc; # this is necessary for the new acc-only NCBI SwissProt libraries | |
218 | 219 | |
219 | 220 | $q_acc =~ s/\.\d+$//; |
220 | 221 | |
227 | 228 | my $annot_f='NULL'; |
228 | 229 | |
229 | 230 | if ($m_format =~ m/^m9/i) { |
230 | last if $line =~ m/>>>/; | |
231 | last if $line =~ m/>>>/ || $line =~ m/^<\/pre>/; | |
231 | 232 | next if $line =~ m/^\+\-/; # skip over HSPs |
232 | 233 | my ($left, $right, $align_f) = ("","",'NULL'); |
233 | 234 | ($left, $right, $align_f, $annot_f) = split(/\t/,$line); |
235 | 236 | $align_f= 'NULL' unless $align_f; |
236 | 237 | $annot_f= 'NULL' unless $annot_f; |
237 | 238 | |
239 | if ($left =~ m/<font/) { | |
240 | $left =~ s/<font color="darkred">//; | |
241 | $left =~ s/<\/font>//; | |
242 | } | |
243 | ||
238 | 244 | my @fields = split(/\s+/,$left); |
239 | my ($ldb, $l_id, $l_acc) = ("","",""); | |
240 | if ($fields[0] =~ m/:/) { | |
241 | ($ldb, $l_id) = split(/:/,$fields[0]); | |
242 | ($l_acc) = $fields[1]; | |
243 | } else { | |
244 | ($ldb, $l_acc,$l_id) = split(/\|/,$fields[0]); | |
245 | } | |
245 | $subj_acc = $s_seqid = $fields[0]; | |
246 | ||
247 | # my ($ldb, $l_id, $l_acc) = ("","",""); | |
248 | # if ($fields[0] =~ m/:/) { | |
249 | # ($ldb, $l_id) = split(/:/,$fields[0]); | |
250 | # ($l_acc) = $fields[1]; | |
251 | # } else { | |
252 | # ($ldb, $l_acc,$l_id) = split(/\|/,$fields[0]); | |
253 | # } | |
246 | 254 | |
247 | 255 | @hit_data{@m9_field_names} = split(/\s+/,$right); |
256 | ||
248 | 257 | if ($eval2_fmt) { |
249 | 258 | @hit_data{qw(bits evalue eval2)} = @fields[-3, -2,-1]; |
250 | 259 | } |
255 | 264 | # |
256 | 265 | # currently preselbdr files have $ldb|$l_acc, not full s_seqid, so construct it |
257 | 266 | # |
258 | ($s_seqid, $subj_acc) = (join('|',($ldb, $l_acc, $l_id)), "$ldb|$l_acc"); | |
267 | # ($s_seqid, $subj_acc) = (join('|',($ldb, $l_acc, $l_id)), "$ldb|$l_acc"); | |
259 | 268 | @hit_data{qw(s_seqid subj_acc)} = ($s_seqid, $subj_acc); |
260 | 269 | @hit_data{qw(query_id query_acc)} = ($query_descr, $q_acc); |
261 | 270 | $hit_data{BTOP} = $align_f; |
265 | 274 | last if $line =~ m/^#/; |
266 | 275 | @hit_data{@m8_field_names} = split(/\t/,$line); |
267 | 276 | $subj_acc = $hit_data{'s_seqid'}; |
268 | $subj_acc =~ s/^gi\|\d+\|(\w+\|\w+)\|?\w+/$1/; | |
277 | # remove gi number | |
278 | if ($subj_acc =~ m/^gi|\d+\|/) { | |
279 | $subj_acc =~ s/^gi\|\d+\|//; | |
280 | } | |
269 | 281 | } |
270 | 282 | |
271 | 283 | if ($have_sel_accs) { |
284 | 296 | # $s_seqid_u .= "_". $acc_names{$subj_acc}; |
285 | 297 | } |
286 | 298 | else { |
299 | my $tr_acc = $hit_data{'s_seqid'}; | |
287 | 300 | $acc_names{$hit_data{'s_seqid'}} = 1; |
288 | 301 | } |
289 | 302 | |
290 | 303 | # must be after duplicate seqid check because blast HSP's have bad E-values after good. |
291 | 304 | next if ($eval_fptr->(\%hit_data) > $evalue); |
292 | 305 | |
306 | next if (($hit_data{q_end}-$hit_data{q_start}+1)/$query_len < $min_align); | |
307 | ||
293 | 308 | $hit_data{s_seqid_u} = $s_seqid_u; |
294 | ||
295 | if (length($s_seqid_u) > $max_sseqid_len) { | |
296 | $max_sseqid_len = length($s_seqid_u); | |
297 | } | |
298 | 309 | |
299 | 310 | my $have_dom = 0; |
300 | 311 | if ($domain_bound && $hit_data{annot}) { |
369 | 380 | } |
370 | 381 | } |
371 | 382 | |
383 | $max_sseqid_len = 10; | |
384 | for my $acc ( @multi_names) { | |
385 | my $this_len = length($acc); | |
386 | if ($trunc_acc && ($acc=~m/\|\w+\|(\w+)$/)) { | |
387 | $this_len = length($1); | |
388 | } | |
389 | if ($this_len > $max_sseqid_len) { | |
390 | $max_sseqid_len = $this_len; | |
391 | } | |
392 | } | |
393 | ||
372 | 394 | # final MSA output |
373 | 395 | $max_sseqid_len += 2; |
374 | 396 | |
375 | printf "BTOP%s multiple sequence alignment\n\n\n",$m_format; | |
397 | if (! $clustal_id) { | |
398 | printf "BTOP%s multiple sequence alignment\n\n\n",$m_format; | |
399 | } | |
400 | else { | |
401 | print "CLUSTALW (1.8) multiple sequence alignment\n\n\n"; | |
402 | } | |
376 | 403 | |
377 | 404 | my $i_pos = 0; |
378 | 405 | for (my $j = 0; $j < $query_len/60; $j++) { |
380 | 407 | if ($i_end >= $query_len) {$i_end = $query_len-1;} |
381 | 408 | for my $acc (@multi_names) { |
382 | 409 | next unless $acc; |
383 | printf("%-".$max_sseqid_len."s %s\n",$acc,join("",@{$multi_align{$acc}}[$i_pos .. $i_end])); | |
410 | ||
411 | my $this_acc = $acc; | |
412 | if ($trunc_acc && ($acc=~m/\|\w+\|(\w+)$/)) { | |
413 | $this_acc = $1; | |
414 | } | |
415 | printf("%-".$max_sseqid_len."s %s\n",$this_acc,join("",@{$multi_align{$acc}}[$i_pos .. $i_end])); | |
384 | 416 | } |
385 | 417 | $i_pos += 60; |
386 | 418 | print "\n\n"; |
752 | 784 | my ($q_num, $query_desc, $q_start, $q_stop, $q_len, $l_num, $l_len, $best_yes); |
753 | 785 | |
754 | 786 | while (my $line = <>) { |
755 | if ($line =~ m/^\s*(\d+)>>>(\S+)\s.+ \- (\d+) aa$/) { | |
787 | if ($line =~ m/^\s*(\d+)>>>(\S+)\s.*\- (\d+) aa$/) { | |
756 | 788 | ($q_num,$query_desc, $q_len) = ($1,$2,$3); |
757 | 789 | # ($q_len) = ($line =~ m/(\d+) aa$/); |
758 | 790 | $line = <>; # skip Library: |
890 | 922 | --query -- same as --query_file |
891 | 923 | (only one sequence per file) |
892 | 924 | |
925 | --expect|evalue: 0.001 -- maximum e-value to be include in output | |
926 | ||
893 | 927 | --eval2 : "": use E()-value, "eval2": use E2()/eval2, "ave": use geom. mean |
928 | ||
929 | --qvalue: 30.0 -- minimum qvalue for domain to be considered | |
894 | 930 | |
895 | 931 | --bound_file_in -- tab delimited accession<tab>start<tab>end that |
896 | 932 | specifies MSA boundaries WITHIN alignment. |
903 | 939 | |
904 | 940 | --bound_file_out -- "--bound_file" for next iteration of psisearch2 |
905 | 941 | |
942 | --clustal -- use "CLUSTALW (1.8)" multiple alignment string | |
943 | ||
944 | --trunc_acc -- remove db, acc from db|acc|ident, e.g. sp|P0948|GSTM1_HUMAN becomes GSTM1_HUMAN | |
945 | ||
906 | 946 | --domain_bound parse domain annotations (-V) from m9B file |
907 | 947 | --domain |
908 | 948 | |
909 | 949 | --masked_lib_out -- FASTA format library of MSA sequences |
950 | ||
951 | --min_align:0.0 -- minimum fractional alignment (q_end-q_start+1)/q_len | |
910 | 952 | |
911 | 953 | --int_mask_type = "query", "rand", "X", "none" |
912 | 954 | --end_mask_type = "query", "rand", "X", "none" |
0 | #!/usr/bin/perl -w | |
0 | #!/usr/bin/env perl | |
1 | 1 | |
2 | 2 | ################################################################ |
3 | 3 | # copyright (c) 2016 by William R. Pearson and The Rector & |
16 | 16 | # governing permissions and limitations under the License. |
17 | 17 | ################################################################ |
18 | 18 | |
19 | use warnings; | |
19 | 20 | use strict; |
20 | 21 | use Getopt::Long; |
21 | 22 | use Pod::Usage; |
32 | 33 | ################ |
33 | 34 | # |
34 | 35 | # command: |
35 | # psisearch2_msa.pl --query query.file --db database.file --num_iter N --pssm_evalue 0.002 --int_mask none/query/random --end_mask none/query/random --tmp_dir results/ --domain --align --out_suffix none --pgm ssearch/psiblast --prev_m89res prev_results.itx.m8CB.file --sel_res selected_accs.file --prev_bounds boundary.file | |
36 | # psisearch2_msa.pl --query query.file --in_msa msa.file --db database.file --num_iter N --pssm_evalue 0.002 --int_mask none/query/random --end_mask none/query/random --tmp_dir results/ --domain --align --out_suffix none --pgm ssearch/psiblast --prev_m89res prev_results.itx.m8CB.file --sel_res selected_accs.file --prev_bounds boundary.file | |
36 | 37 | # |
37 | 38 | ################ |
38 | 39 | |
53 | 54 | my $makeblastdb_bin = "$pgm_bin/makeblastdb"; |
54 | 55 | my $datatool_bin = "$pgm_bin/datatool -m $pgm_data/NCBI_all.asn"; |
55 | 56 | my $align2msa_lib = "$pgm_bin/m89_btop_msa2.pl"; |
57 | my $clustal2fasta = "$pgm_bin/clustal2fasta.pl"; | |
56 | 58 | |
57 | 59 | my %srch_subs = ('ssearch' => \&get_ssearch_cmd, |
58 | 60 | 'psiblast' => \&get_psiblast_cmd, |
60 | 62 | |
61 | 63 | my %annot_cmds = ('rpd3' => qq("\!ann_pfam28.pl --pfacc --db RPD3 --vdoms --split_over"), |
62 | 64 | 'rpd3nv' => qq("\!ann_pfam28.pl --pfacc --db RPD3 --split_over"), |
63 | 'rpd3nvn' => qq("\!ann_pfam28.pl --pfacc --db RPD3 --split_over --neg"), | |
64 | 'pfam' => qq("\!ann_pfam30.pl --vdoms --split_over --neg") | |
65 | 'rpd3nvn' => qq("\!./annot/ann_pfam28.pl --pfacc --db RPD3 --split_over --neg"), | |
66 | 'pfam' => qq("\!./annot/ann_pfam30.pl --db pfam31_qfo --vdoms --split_over --neg") | |
65 | 67 | ); |
66 | 68 | |
67 | 69 | ($num_iter, $pssm_evalue, $srch_evalue, $dom_flag, $align_flag, $int_mask, $end_mask, $query_mask, $srch_pgm, $tmp_dir, $error_log, $annot_type, $quiet) = |
68 | 70 | ( 5, 0.002, 5.0, 0, 0, 'none', 'none', 0, 'ssearch','',0, 0, "", 0); |
69 | 71 | ($save_all, $tmp_file_list, $delete_bnd, $delete_tmp) = (0, "", 0, 0); |
70 | ($prev_m89res, $m_format, $prev_sel_res, $prev_bound, $this_iter, $use_stdout) = ("","", "","", 1, 0); | |
72 | ($prev_m89res, $m_format, $prev_sel_res, $prev_bound, $this_iter, $use_stdout) = ("","m8CB", "","", 1, 0); | |
71 | 73 | |
72 | 74 | my $pgm_command = "# ".join(" ",($0,@ARGV)); |
73 | 75 | print STDERR "# ",join(" ",($0,@ARGV)),"\n" if ($error_log); |
89 | 91 | 'sel_accs=s' => \$prev_sel_res, |
90 | 92 | 'sel_file=s' => \$prev_sel_res, |
91 | 93 | 'sel_file_in=s' => \$prev_sel_res, |
92 | # 'in_msa=s' => \$prev_msa, | |
94 | 'in_msa=s' => \$prev_msa, | |
93 | 95 | # 'out_msa=s' => \$next_msa, |
94 | 96 | # 'in_hitdb=s' => \$prev_hitdb, |
95 | 97 | # 'out_hitdb=s' => \$next_hitdb, |
183 | 185 | |
184 | 186 | my @del_err_files = (); |
185 | 187 | |
186 | unless ($prev_m89res) { | |
188 | unless ($prev_m89res || $prev_msa) { | |
187 | 189 | $search = $srch_subs{$srch_pgm}($query_file, $db_file, $prev_pssm); |
188 | 190 | unless ($use_stdout) { |
189 | 191 | log_system("$search > $this_file_out 2> $this_file_out.err"); |
194 | 196 | push @del_err_files, "$this_file_out.err"; |
195 | 197 | $first_iter++; |
196 | 198 | } |
197 | else { | |
199 | elsif ($prev_m89res) { | |
198 | 200 | $this_file_out = $prev_m89res; |
201 | } | |
202 | elsif ($prev_msa) { | |
203 | # build a PSSM, do a search, up the iteration count | |
204 | $prev_pssm = pssm_from_msa($query_file, $prev_msa); | |
205 | $search = $srch_subs{$srch_pgm}($query_file, $db_file, $prev_pssm); | |
206 | unless ($use_stdout) { | |
207 | log_system("$search > $this_file_out 2> $this_file_out.err"); | |
208 | } | |
209 | else { | |
210 | log_system("$search 2> $this_file_out.err"); | |
211 | } | |
212 | push @del_err_files, "$this_file_out.err"; | |
213 | $first_iter++; | |
199 | 214 | } |
200 | 215 | |
201 | 216 | my ($this_pssm, $this_bound_out) = ("",""); |
264 | 279 | |
265 | 280 | my ($cmd) = @_; |
266 | 281 | |
267 | print STDERR "$cmd\n" if $error_log; | |
282 | print STDERR "# $cmd\n" if $error_log; | |
268 | 283 | system($cmd); |
269 | 284 | } |
270 | 285 | |
275 | 290 | sub get_ssearch_cmd { |
276 | 291 | my ($query_file, $db_file, $pssm_file) = @_; |
277 | 292 | |
278 | my $search_cmd = qq($ssearch_bin -S -m 6 -m 9B -E "$srch_evalue 0" -s BP62); | |
293 | my $mf_arg = $m_format; | |
294 | $mf_arg =~ s/^m//; | |
295 | $mf_arg =~ s/\+/ /; | |
296 | ||
297 | my $search_cmd = qq($ssearch_bin -S -E "$srch_evalue 0" -s BP62 -m $mf_arg); | |
298 | ||
279 | 299 | if ($annot_type) { |
280 | 300 | $search_cmd .= qq( -V $annot_cmds{$annot_type}); |
281 | 301 | } |
383 | 403 | } |
384 | 404 | else { |
385 | 405 | return ($this_pssm_asntxt, $this_bound_out); |
406 | } | |
407 | } | |
408 | ||
409 | ################ | |
410 | # pssm_from_msa() | |
411 | # | |
412 | # given query, --in_msa Clustal MSA | |
413 | # use psiblast to generate PSSM in .asntxt or .asnbin format | |
414 | # (later - optionally deletes intermediate files) | |
415 | # | |
416 | # always produce a $bound_file_out file to test for convergence | |
417 | # | |
418 | sub pssm_from_msa { | |
419 | my ($query_file, $msa_file) = @_; | |
420 | ||
421 | my $this_file_out = $query_file; | |
422 | ||
423 | my ($this_hit_db, $this_pssm_asntxt, $this_pssm_asnbin, $this_psibl_out, $this_bound_out) = | |
424 | ("$this_file_out.hit_db", | |
425 | "$this_file_out.asntxt", | |
426 | "$this_file_out.asnbin", | |
427 | "$this_file_out.psibl_out", | |
428 | "$this_file_out.bnd_out", | |
429 | ); | |
430 | ||
431 | my $blastdb_err = "$this_file_out.mkbldb_err"; | |
432 | ## should not need this, but may need to convert in_msa file to fasta file for equivalence to build_msa_pssm() | |
433 | my $clus2fa_cmd = qq($clustal2fasta $msa_file > $this_hit_db); | |
434 | ||
435 | log_system($clus2fa_cmd); | |
436 | ||
437 | my $makeblastdb_cmd = "$makeblastdb_bin -in $this_hit_db -dbtype prot -parse_seqids > $blastdb_err"; | |
438 | log_system($makeblastdb_cmd); | |
439 | ||
440 | my $buildpssm_cmd = "$psiblast_bin -max_target_seqs 5000 -outfmt 7 -inclusion_ethresh 100.0 -in_msa $msa_file -db $this_hit_db -out_pssm $this_pssm_asntxt -num_iterations 1 -save_pssm_after_last_round"; | |
441 | ||
442 | log_system("$buildpssm_cmd > $this_psibl_out 2> $this_psibl_out.err"); | |
443 | ||
444 | log_system("rm $this_hit_db.p* $blastdb_err"); | |
445 | ||
446 | # remove uninformative error logs | |
447 | log_system("rm $this_psibl_out.err") unless $error_log; | |
448 | ||
449 | unless ($srch_pgm eq 'psiblast') { | |
450 | my $asn2asn_cmd = "$datatool_bin -v $this_pssm_asntxt -e $this_pssm_asnbin"; | |
451 | log_system($asn2asn_cmd); | |
452 | return ($this_pssm_asnbin); | |
453 | } | |
454 | else { | |
455 | return ($this_pssm_asntxt); | |
386 | 456 | } |
387 | 457 | } |
388 | 458 |
0 | #!/usr/bin/python | |
0 | #!/usr/bin/env python | |
1 | 1 | |
2 | 2 | ################################################################ |
3 | 3 | # copyright (c) 2016 by William R. Pearson and The Rector & |
33 | 33 | ################ |
34 | 34 | # |
35 | 35 | # command: |
36 | # psisearch2_msa.py --query query_file --db database --num_iter N --evalue 0.002 --no_msa --int_mask none/query/random --end_mask none/query/random --tmp_dir results/ --domain --align --suffix M8CB --pgm ssearch/psiblast --prev_m89res pre_iter.out --this_iter # --num_iter # | |
36 | # psisearch2_msa.py --query query_file --db database --num_iter N --pssm_evalue 0.002 --no_msa --int_mask none/query/random --end_mask none/query/random --tmp_dir results/ --domain --align --suffix M8CB --pgm ssearch/psiblast --prev_m89res pre_iter.out --this_iter # --num_iter # | |
37 | 37 | # |
38 | 38 | ################ |
39 | 39 | |
51 | 51 | makeblastdb_bin = pgm_bin+"/makeblastdb" |
52 | 52 | datatool_bin = "%s/datatool -m %s/NCBI_all.asn" % (pgm_bin,pgm_data) |
53 | 53 | align2msa_lib = "m89_btop_msa2.pl" |
54 | clustal2fasta = "clustal2fasta.py" | |
54 | 55 | |
55 | 56 | annot_cmds = {'rpd3': '"!../scripts/ann_pfam28.pl --pfacc --db RPD3 --vdoms --split_over"', |
56 | 57 | 'rpd3nv':'"!../scripts/ann_pfam28.pl --pfacc --db RPD3 --split_over"', |
57 | 58 | 'pfam':'"!../scripts/ann_pfam30.pl --pfacc --vdoms --split_over"'} |
58 | 59 | |
59 | 60 | num_iter = 5 |
60 | evalue = 0.002 | |
61 | 61 | srch_pgm = 'ssearch' |
62 | error_log = 0 | |
63 | 62 | rm_flag = 0 |
64 | 63 | quiet = 0 |
65 | 64 | |
66 | 65 | ################ |
67 | 66 | # log_system() |
68 | # run system on string, logging first if error_log | |
67 | # run system on string, logging first if args.error_log | |
69 | 68 | # |
70 | 69 | def log_system (cmd, error_log): |
71 | 70 | |
79 | 78 | # sub get_ssearch_cmd() |
80 | 79 | # builds an ssearch command line with query, db, and pssm |
81 | 80 | # |
82 | def get_ssearch_cmd(query_file, db_file, pssm_file) : | |
83 | ||
84 | search_cmd = '%s -S -m 8CB -d 0 -E "1.0 0" -s BP62' % (ssearch_bin) | |
81 | def get_ssearch_cmd(query_file, db_file, pssm_file, args) : | |
82 | ||
83 | search_cmd = '%s -S -m 8CB -d 0 -E "%f 0" -s BP62' % (ssearch_bin, args.srch_evalue) | |
85 | 84 | |
86 | 85 | if (args.annot_type) : |
87 | 86 | search_cmd += " -V %s" % (annot_cmds[args.annot_type]) |
98 | 97 | # sub get_psiblast_cmd() |
99 | 98 | # builds an ssearch command line with query, db, and pssm |
100 | 99 | # |
101 | def get_psiblast_cmd(query_file, db_file, pssm_file) : | |
102 | ||
103 | search_cmd = "%s -num_threads 4 -max_target_seqs 5000 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore score btop' -inclusion_ethresh %f -num_iterations 1 -db %s" % (psiblast_bin, args.evalue, db_file) | |
100 | def get_psiblast_cmd(query_file, db_file, pssm_file, args) : | |
101 | ||
102 | search_cmd = "%s -num_threads 4 -max_target_seqs 5000 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore score btop' -inclusion_ethresh %f -evalue %f -num_iterations 1 -db %s" % (psiblast_bin, args.pssm_evalue, args.srch_evalue, db_file) | |
104 | 103 | |
105 | 104 | if (pssm_file) : |
106 | 105 | search_cmd += " -in_pssm %s" % (pssm_file) |
119 | 118 | # |
120 | 119 | # always produce a bound_file_out file to test for convergence |
121 | 120 | # |
122 | def build_msa_pssm(query_file, this_file_out,prev_bound_in, prev_sel_res, args, error_log) : | |
121 | def build_msa_pssm(query_file, this_file_out,prev_bound_in, prev_sel_res, error_log) : | |
123 | 122 | |
124 | 123 | (this_msa, this_hit_db, this_pssm_asntxt, this_pssm_asnbin, this_psibl_out, this_bound_out) = (this_file_out+".msa",this_file_out+".hit_db",this_file_out+".asntxt",this_file_out+".asnbin",this_file_out+".psibl_out",this_file_out+".bnd_out") |
125 | 124 | |
129 | 128 | if (prev_sel_res) : |
130 | 129 | aln2msa_cmd += " --sel_res %s" % (prev_sel_res) |
131 | 130 | else: |
132 | aln2msa_cmd += " --evalue %f" % (args.evalue) | |
131 | aln2msa_cmd += " --evalue %f" % (args.pssm_evalue) | |
133 | 132 | |
134 | 133 | if (args.int_mask) : |
135 | 134 | aln2msa_cmd += " --int_mask_type %s" % (args.int_mask) |
141 | 140 | aln2msa_cmd += " --domain" |
142 | 141 | |
143 | 142 | if (args.align_flag and args.prev_bound_in) : |
144 | aln2msa_cmd += " --bound_file_in %s" %(args.prev_bound_in) | |
143 | aln2msa_cmd += " --bound_file_in %s" %(args.prev_bound_in) | |
144 | ||
145 | if (args.m_format): | |
146 | aln2msa_cmd += " --m_format %s" % (args.m_format) | |
145 | 147 | |
146 | 148 | # always produce this file to check for convergence |
147 | 149 | aln2msa_cmd += " --bound_file_out %s" % (this_bound_out) |
170 | 172 | return (this_pssm_asntxt, this_bound_out) |
171 | 173 | |
172 | 174 | ################ |
175 | # sub pssm_from_msa | |
176 | # read multiple sequence alignment, produce pssm file | |
177 | # | |
178 | def pssm_from_msa(query_file, msa_file, error_log): | |
179 | ||
180 | this_file_out = query_file | |
181 | ||
182 | this_hit_db = this_file_out+".hit_db" | |
183 | this_pssm_asntxt = this_file_out+".asntxt" | |
184 | this_pssm_asnbin = this_file_out+".asnbin" | |
185 | this_psibl_out = this_file_out+".psibl_out" | |
186 | this_bound_out = this_file_out+".bnd_out" | |
187 | ||
188 | blastdb_err = this_file_out + ".mkbldb_err" | |
189 | ||
190 | clus2fa_cmd = "%s %s > %s" % (clustal2fasta, msa_file, this_hit_db) | |
191 | ||
192 | log_system(clus2fa_cmd, error_log); | |
193 | ||
194 | makeblastdb_cmd = "%s -in %s -dbtype prot -parse_seqids > %s" % (makeblastdb_bin, this_hit_db, blastdb_err); | |
195 | log_system(makeblastdb_cmd, error_log); | |
196 | ||
197 | built_pssm_cmd = "%s -max_target_seqs 5000 -outfmt 7 -inclusion_ethresh 100.0 -in_msa %s -db %s -out_pssm %s -num_iterations 1 -save_pssm_after_last_round" % (psiblast_bin, msa_file, this_hit_db, this_pssm_asntxt) | |
198 | ||
199 | log_system("%s > %s 2> %s.err" % (buildpssm_cmd, this_psibl_out, this_psibl_out), error_log) | |
200 | ||
201 | log_system("rm %s.p* %s" % (this_hit_db,blastdb_err), error_log) | |
202 | ||
203 | # remove uninformative error logs | |
204 | if (not error_log): | |
205 | log_system("rm %s.err" % (this_psibl_out), error_log) | |
206 | ||
207 | if (srch_pgm != 'psiblast'): | |
208 | asn2asn_cmd = "%s -v %s -e %s" % (datatool_bin, this_pssm_asntxt, this_pssm_asnbin) | |
209 | log_system(asn2asn_cmd, error_log); | |
210 | return this_pssm_asnbin | |
211 | else: | |
212 | return this_pssm_asntxt | |
213 | ||
214 | ################ | |
173 | 215 | # sub has_converged() |
174 | 216 | # reads two boundary files and compares accessions |
175 | 217 | # |
210 | 252 | |
211 | 253 | srch_subs = {'ssearch' : get_ssearch_cmd, |
212 | 254 | 'psiblast': get_psiblast_cmd} |
213 | ||
214 | pgm_command = "# "+" ".join(sys.argv); | |
215 | if (error_log) : | |
216 | sys.stderr.write('pgm_command\n') | |
217 | 255 | |
218 | 256 | arg_parse = argparse.ArgumentParser(description='Iterative search with SSEARCH/PSIBLAST') |
219 | 257 | arg_parse.add_argument('--query', dest='query_file', action='store',help='query sequence file') |
221 | 259 | arg_parse.add_argument('--db', dest='db_file', action='store',help='sequence database name') |
222 | 260 | arg_parse.add_argument('--database', dest='db_file', action='store',help='sequence database name') |
223 | 261 | arg_parse.add_argument('--dir', dest='tmp_dir', action='store',help='directory for result and tmp_file output') |
224 | arg_parse.add_argument('--evalue', dest='evalue', default=0.002, type=float, action='store',help='E()-value threshold for inclusion in PSSM') | |
262 | arg_parse.add_argument('--pssm_evalue', dest='pssm_evalue', default=0.002, type=float, action='store',help='E()-value threshold for inclusion in PSSM') | |
263 | arg_parse.add_argument('--search_evalue', dest='srch_evalue', default=5.0, type=float, action='store',help='E()-value threshold for search display') | |
264 | arg_parse.add_argument('--m_format', dest='m_format', action='store',help='input result format m8 [def] or m9') | |
225 | 265 | arg_parse.add_argument('--annot_db', dest='annot_type', action='store',help='source of domain annotations') |
226 | 266 | arg_parse.add_argument('--suffix', dest='suffix', action='store',help='suffix for result output') |
227 | 267 | arg_parse.add_argument('--out_name', dest='file_out', action='store',help='result file name') |
233 | 273 | arg_parse.add_argument('--pgm', dest='srch_pgm', action='store',default='ssearch',help='search program: ssearch/psiblast') |
234 | 274 | arg_parse.add_argument('--query_seed', dest='query_mask', action='store_true',help='use query seeding') |
235 | 275 | arg_parse.add_argument('--prev_m89res', dest='prev_m89res', action='store', help='prevous iteration result') |
276 | arg_parse.add_argument('--prev_msa', dest='prev_msa', action='store', help='prevous MSA') | |
236 | 277 | arg_parse.add_argument('--sel_res', dest='prev_sel_res', action='store', help='selected accession file') |
237 | 278 | arg_parse.add_argument('--this_iter', dest='this_iter', help='this iteration number',type=int) |
238 | 279 | arg_parse.add_argument('--int_seed', dest='int_mask', action='store',default='none',help='sequence masking: none/query/random') |
243 | 284 | arg_parse.add_argument('--save_all', dest='save_all', action='store_true',help='save all temporary files') |
244 | 285 | arg_parse.add_argument('--delete_all', dest='delete_tmp', action='store_true',help='delete all temporary files') |
245 | 286 | arg_parse.add_argument('--delete_bnd', dest='delete_bnd', action='store_true',help='delete boundary temporary file') |
287 | arg_parse.add_argument('--use_stdout', dest='use_stdout', action='store_true',help='send results to stdout',default=False) | |
288 | arg_parse.add_argument('--errors', dest='error_log', action='store_true', help='log errors', default=False) | |
246 | 289 | arg_parse.add_argument('--quiet', dest='quiet', action='store_true',help='fewer messages') |
247 | 290 | arg_parse.add_argument('-Q', dest='quiet', action='store_true',help='fewer messages') |
248 | 291 | |
249 | 292 | args = arg_parse.parse_args() |
293 | ||
294 | pgm_command = "# "+" ".join(sys.argv); | |
295 | if (args.error_log) : | |
296 | sys.stderr.write('pgm_command\n') | |
297 | ||
250 | 298 | if (args.quiet) : |
251 | 299 | quiet = args.quiet |
252 | 300 | |
317 | 365 | del_err_files = [] |
318 | 366 | |
319 | 367 | # do the first search |
320 | if (not args.prev_m89res): | |
321 | search_str = srch_subs[srch_pgm](args.query_file, args.db_file, args.prev_pssm) | |
322 | log_system(search_str+" > "+this_file_out+" 2> "+this_file_out+".err", error_log) | |
368 | if (not (args.prev_m89res or args.prev_msa)): | |
369 | search_str = srch_subs[srch_pgm](args.query_file, args.db_file, args.prev_pssm, args) | |
370 | if (not args.use_stdout): | |
371 | log_system(search_str+" > "+this_file_out+" 2> "+this_file_out+".err", args.error_log) | |
372 | else: | |
373 | log_system(search_str + " 2> "+this_file_out+".err", args.error_log) | |
323 | 374 | del_err_files.append(this_file_out+".err") |
324 | 375 | first_iter += 1 |
325 | else: | |
376 | elif (args.prev_m89res): | |
326 | 377 | this_file_out = args.prev_m89res |
327 | ||
378 | elif (args.prev_msa): | |
379 | # build a PSSM, do a search, up the iteration count | |
380 | prev_pssm = pssm_from_msa(query_file, prev_msa, args.error_log) | |
381 | search_str = srch_subs[srch_pgm](args.query_file, args.db_file, args.prev_pssm, args) | |
382 | if (not args.use_stdout): | |
383 | log_system(search_str + "> " + this_file_out + " 2> " + this_file_out + ".err", args.error_log); | |
384 | else: | |
385 | log_system(search_str + " 2> " + this_file_out + ".err"); | |
386 | ||
387 | del_err_files.append(this_file_out+".err") | |
388 | first_iter += 1 | |
328 | 389 | |
329 | 390 | it=first_iter |
330 | 391 | |
332 | 393 | |
333 | 394 | while (it < args.num_iter) : |
334 | 395 | |
335 | (this_pssm, this_bound_out) = build_msa_pssm(args.query_file, this_file_out, prev_bound_in, arg.prev_sel_res, error_log) | |
396 | (this_pssm, this_bound_out) = build_msa_pssm(args.query_file, this_file_out, prev_bound_in, args.prev_sel_res, args.error_log) | |
336 | 397 | prev_file_out = this_file_out |
337 | arg.prev_sel_res = '' | |
398 | args.prev_sel_res = '' | |
338 | 399 | |
339 | 400 | iter_val = this_iter + it |
340 | 401 | |
347 | 408 | if (args.tmp_dir) : |
348 | 409 | this_file_out = args.tmp_dir+"/"+this_file_out |
349 | 410 | |
350 | search_str = srch_subs[srch_pgm](args.query_file, args.db_file, prev_pssm) | |
351 | log_system("%s > %s 2> %s" % (search_str,this_file_out,this_file_out+".err"), error_log) | |
411 | search_str = srch_subs[srch_pgm](args.query_file, args.db_file, prev_pssm, args) | |
412 | log_system("%s > %s 2> %s" % (search_str,this_file_out,this_file_out+".err"), args.error_log) | |
352 | 413 | del_err_files.append(this_file_out+".err") |
353 | 414 | |
354 | 415 | if (len(del_file_ext)): |
355 | 416 | del_file_list = [ prev_file_out+'.'+ext for ext in del_file_ext] |
356 | log_system('rm '+' '.join(del_file_list),error_log) | |
417 | log_system('rm '+' '.join(del_file_list),args.error_log) | |
357 | 418 | |
358 | 419 | if (has_converged(prev_bound_in, this_bound_out)) : |
359 | 420 | if (not quiet) : |
361 | 422 | |
362 | 423 | # if (len(del_file_ext)): |
363 | 424 | # del_file_list = [ prev_file_out+'.'+ext for ext in del_file_ext] |
364 | # log_system('rm '+' '.join(del_file_list),error_log) | |
425 | # log_system('rm '+' '.join(del_file_list),args.error_log) | |
365 | 426 | |
366 | 427 | if (delete_bnd) : |
367 | log_system("rm "+prev_bound_in,error_log) | |
428 | log_system("rm "+prev_bound_in,args.error_log) | |
368 | 429 | |
369 | 430 | exit(0) |
370 | 431 | |
371 | 432 | if (delete_bnd) : |
372 | log_system("rm "+prev_bound_in,error_log) | |
433 | log_system("rm "+prev_bound_in,args.error_log) | |
373 | 434 | prev_bound_in = this_bound_out |
374 | 435 | |
375 | 436 | it += 1 |
376 | 437 | |
377 | 438 | if (len(del_err_files)): |
378 | log_system('rm '+' '.join(del_err_files),error_log) | |
439 | log_system('rm '+' '.join(del_err_files),args.error_log) | |
379 | 440 | |
380 | 441 | # if (len(del_file_ext)): |
381 | 442 | # del_file_list = [ prev_file_out+'.'+ext for ext in del_file_ext] |
382 | # log_system('rm '+' '.join(del_file_list),error_log) | |
443 | # log_system('rm '+' '.join(del_file_list),args.error_log) | |
383 | 444 | |
384 | 445 | if (delete_bnd): |
385 | log_system("rm "+this_bound_out,error_log) | |
446 | log_system("rm "+this_bound_out,args.error_log) | |
386 | 447 | |
387 | 448 | if (not quiet) : |
388 | 449 | sys.stderr.write(" %s %s %s %s finished (%d iterations)\n" % (sys.argv[0], srch_pgm, query_file, args.db_file, it)) |
0 | #!/bin/sh | |
1 | ||
2 | ################ | |
3 | # example that runs psisearch2_msa.pl iteratively through 5 iterations. | |
4 | # Equivalent to: | |
5 | # psisearch2_msa.pl --query CL0238_emb.fa --num_iter 5 --db /slib2/fa_dbs/rpd3_pfam28_lib.lseg | |
6 | # | |
7 | ||
8 | ||
9 | PS_BIN=~/Devel/fa36_v3.8/psisearch2 | |
10 | Q_DIR="../seq" | |
11 | FA_DB=/slib2/fa_dbs/qfo78.lseg | |
12 | BL_DB=/slib2/bl_dbs/qfo78 | |
13 | DB=$FA_DB | |
14 | ||
15 | OUT_SUFF='qm8CB' | |
16 | ||
17 | M_FORMAT='m8CB' | |
18 | ITERS='2 3 4 5' | |
19 | ||
20 | for q_file_p in $*; do | |
21 | ||
22 | q_file=${q_file_p##*/} | |
23 | echo $q_file | |
24 | ||
25 | # iteration 1: | |
26 | ||
27 | $PS_BIN/psisearch2_msa.pl --query $Q_DIR/$q_file --num_iter 1 --db $DB --int_mask query --end_mask query --out_suffix $OUT_SUFF --m_format $M_FORMAT | |
28 | ||
29 | # iteration 2 - 5 | |
30 | for it in $ITERS; do | |
31 | prev=$(($it-1)) | |
32 | $PS_BIN/psisearch2_msa.pl --query $Q_DIR/$q_file --num_iter 1 --db $DB --int_mask query --end_mask query --out_suffix $OUT_SUFF --this_iter $it --prev_m89res $q_file.it${prev}.$OUT_SUFF --m_format $M_FORMAT | |
33 | done | |
34 | ||
35 | done |
0 | #!/bin/sh | |
1 | ||
2 | ################ | |
3 | # example that runs psisearch2_msa.pl iteratively through 5 iterations using psiblast instead of ssearch | |
4 | # Equivalent to: | |
5 | # psisearch2_msa.pl --pgm psiblast --query query.aa --num_iter 5 --db /slib2/bl_dbs/qfo78 | |
6 | # | |
7 | ||
8 | PS_BIN=~/Devel/fa36_v3.8/psisearch2 | |
9 | q_file=$1 | |
10 | m_format='m8CB' | |
11 | SRC_QDIR=../hum_1dom200_queries | |
12 | ||
13 | iters='2 3 4 5' | |
14 | # iters='' | |
15 | ||
16 | for q_file_p in $*; do | |
17 | ||
18 | q_file=${q_file_p##*/} | |
19 | echo $q_file | |
20 | ||
21 | # iteration 1: | |
22 | # echo "$PS_BIN/psisearch2_msa.pl --pgm psiblast --query $SRC_QDIR/$q_file --num_iter 1 --db /slib2/bl_dbs/qfo78 --int_mask query --end_mask query --out_suffix q_pblt --m_format $m_format --save_list asnbin" | |
23 | $PS_BIN/psisearch2_msa.pl --pgm psiblast --query $SRC_QDIR/$q_file --num_iter 1 --db /slib2/bl_dbs/qfo78 --int_mask query --end_mask query --out_suffix q_pblt --m_format $m_format --save_list asntxt | |
24 | ||
25 | # iteration 2 - 5 | |
26 | for it in $iters; do | |
27 | prev=$(($it-1)) | |
28 | $PS_BIN/psisearch2_msa.pl --pgm psiblast --query $SRC_QDIR/$q_file --num_iter 1 --db /slib2/bl_dbs/qfo78 --int_mask query --end_mask query --out_suffix q_pblt --this_iter $it --prev_m89res $q_file.it${prev}.q_pblt --m_format $m_format --save_list asntxt | |
29 | done | |
30 | done |
0 | 0 | |
1 | 1 | 22-Jan-2014 |
2 | 2 | 13-Apr-2016 updated |
3 | 22-Feb-2019 updated | |
3 | 4 | |
4 | 5 | fasta36/scripts |
5 | 6 | |
6 | 7 | Perl scripts for annotating sequences and expanding libraries |
8 | ||
9 | -- Sequence generation (January, February, 2019) | |
10 | ||
11 | The FASTA programs can now use sequences that are downloaded from | |
12 | Uniprot or NCBI/RefSeq (or otherwise provided by a program script that | |
13 | produces FASTA sequences from an identifier) by specifying the name of | |
14 | the script, the accession(s), and library type 9, e.g. | |
15 | ||
16 | fasta36 \!../scripts/get_protein.py+P09488 /seqlib/swissprot.fasta | |
17 | ||
18 | Scripts are available for downloading protein sequences from Uniprot | |
19 | or RefSeq (get_protein.py), Uniprot (get_uniprot.py), and for | |
20 | downloading either protein or mRNA sequences from RefSeq | |
21 | (get_refseq.py). | |
22 | ||
23 | scripts/get_protein.py get Refseq or Uniprot proteins | |
24 | scripts/get_refseq.py get RefSeq proteins or mRNAs | |
25 | scripts/get_up_prot_iso_sql.py get a protein and its isoforms using a mysql database | |
26 | scripts/get_genome_seq.py get human genome (hg38) or mouse (mm10) --genome mm10 sequences using bedtools using "get_genome_seq.py chr1:123456-126543" | |
7 | 27 | |
8 | 28 | -- Sequence alignment scoring/annotation |
9 | 29 | |
82 | 102 | ann_pdb_cath.pl -- generate CATH domains using PDB accessions from a mySQL database |
83 | 103 | ann_pdb_vast.pl -- use VAST domains, but domain names are not informative |
84 | 104 | |
85 | ann_pfam27.pl -- generate Pfam domains using local Pfam mySQL database (Pfam27 with auto_pfamA, auto_pfamseq) | |
86 | ann_pfam28.pl -- generate Pfam domains using local Pfam mySQL database (Pfam28, no auto_pfamA, auto_pfamseq) | |
105 | ann_pfam28.pl -- generate Pfam domains using local Pfam mySQL database | |
106 | (Pfam28, no auto_pfamA, auto_pfamseq) | |
107 | ||
87 | 108 | ann_pfam_www.pl -- use Pfam Website, and XML::Twig, to get Pfam domain info. |
88 | 109 | |
89 | ann_exons_ens.pl -- generate exon boundaries on SwissProt proteins from Ensembl. | |
90 | ann_exons_up_www.pl -- generate exon boundaries on SwissProt proteins using the EBI/Proteins/API/coordinate service | |
110 | ann_exons_up_www.pl -- generate exon boundaries on Uniprot proteins | |
111 | using the EBI/Proteins/API/coordinate service | |
112 | ||
113 | ann_exons_up_sql_www.pl -- generate exon boundaries on Uniprot | |
114 | proteins using an SQL database (if available) or the EBI/Proteins | |
115 | coordinate service. The SQL results are dramatically faster. | |
116 | ||
91 | 117 | ann_exons_ncbi.pl -- generate exon boundaries on NCBI refseq proteins. |
92 | 118 | |
93 | 119 | -- Library expansion |
94 | 120 | |
121 | expand_up_isoforms.pl -- for Uniprot reference proteomes, provide | |
122 | isoforms for each canonical sequence. | |
123 | ||
95 | 124 | expand_uniref50.pl -- allows search of uniref50 to be expanded |
96 | expand_links.pl -- script to take hits from a smaller library and expand to complete library | |
125 | ||
126 | expand_links.pl -- script to take hits from a smaller library and | |
127 | expand to complete library | |
128 | ||
97 | 129 | links2sql.pl -- create links for expand_links.pl |
98 | 130 | |
99 | 131 | exp_up_ensg.pl -- expand uniprot sequences to include Ensembl splice variants |
0 | #!/usr/bin/env perl | |
1 | ||
2 | ################################################################ | |
3 | # copyright (c) 2014,2015 by William R. Pearson and The Rector & | |
4 | # Visitors of the University of Virginia | |
5 | ################################################################ | |
6 | # Licensed under the Apache License, Version 2.0 (the "License"); | |
7 | # you may not use this file except in compliance with the License. | |
8 | # You may obtain a copy of the License at | |
9 | # | |
10 | # http://www.apache.org/licenses/LICENSE-2.0 | |
11 | # | |
12 | # Unless required by applicable law or agreed to in writing, | |
13 | # software distributed under this License is distributed on an "AS | |
14 | # IS" BASIS, WITHOUT WRRANTIES OR CONDITIONS OF ANY KIND, either | |
15 | # express or implied. See the License for the specific language | |
16 | # governing permissions and limitations under the License. | |
17 | ################################################################ | |
18 | ||
19 | # ann_exons_up_sql.pl gets an annotation file from fasta36 -V with a line of the form: | |
20 | ||
21 | # gi|62822551|sp|P00502|GSTA1_RAT Glutathione S-transfer\n (at least from pir1.lseg) | |
22 | # | |
23 | # it must: | |
24 | # (1) read in the line | |
25 | # (2) parse it to get the up_acc | |
26 | # (3) return the tab delimited features | |
27 | ||
28 | # this version can read feature2 uniprot features (acc/pos/end/label/value), but returns sorted start/end domains | |
29 | # modified 18-Jan-2016 to produce annotation symbols consistent with ann_exons_up_www2.pl | |
30 | # modified Dec 2018 to generate genomic coordinates with --gen_coord | |
31 | # modified 3-Jan-2019 to merge sql and www (--www) access to exon coordinates | |
32 | ||
33 | use warnings; | |
34 | use strict; | |
35 | ||
36 | use DBI; | |
37 | use Getopt::Long; | |
38 | use Pod::Usage; | |
39 | use LWP::Simple; | |
40 | use LWP::UserAgent; | |
41 | use JSON qw(decode_json); | |
42 | ||
43 | use vars qw($host $db $a_table $port $user $pass); | |
44 | ||
45 | my %domains = (); | |
46 | my $domain_cnt = 0; | |
47 | ||
48 | my $hostname = `/bin/hostname`; | |
49 | ||
50 | unless ($hostname =~ m/ebi/) { | |
51 | ($host, $db, $a_table, $port, $user, $pass) = ("wrpxdb.its.virginia.edu", "uniprot", "annot2", 0, "web_user", "fasta_www"); | |
52 | # $host = 'xdb'; | |
53 | } | |
54 | else { | |
55 | ($host, $db, $a_table, $port, $user, $pass) = ("mysql-pearson-prod", "up_db", "annot", 4124, "web_user", "fasta_www"); | |
56 | } | |
57 | ||
58 | my ($lav, $gen_coord, $exon_label, $use_www, $shelp, $help) = (0,0,0,0,0,0); | |
59 | ||
60 | my ($show_color) = (1); | |
61 | my $color_sep_str = " :"; | |
62 | $color_sep_str = '~'; | |
63 | ||
64 | GetOptions( | |
65 | "gen_coord|gene_coord!" => \$gen_coord, | |
66 | "exon_label|label_exons!" => \$exon_label, | |
67 | "www!" => \$use_www, | |
68 | "host=s" => \$host, | |
69 | "db=s" => \$db, | |
70 | "user=s" => \$user, | |
71 | "password=s" => \$pass, | |
72 | "port=i" => \$port, | |
73 | "lav" => \$lav, | |
74 | "h|?" => \$shelp, | |
75 | "help" => \$help, | |
76 | ); | |
77 | ||
78 | pod2usage(1) if $shelp; | |
79 | pod2usage(exitstatus => 0, verbose => 2) if $help; | |
80 | pod2usage(1) unless (@ARGV || -p STDIN || -f STDIN); | |
81 | ||
82 | my $connect = "dbi:mysql(AutoCommit=>1,RaiseError=>1):database=$db"; | |
83 | $connect .= ";host=$host" if $host; | |
84 | $connect .= ";port=$port" if $port; | |
85 | ||
86 | my $dbh = DBI->connect($connect, | |
87 | $user, | |
88 | $pass | |
89 | ) or die $DBI::errstr; | |
90 | ||
91 | ||
92 | my $get_annot_sub = \&get_annots; | |
93 | ||
94 | ||
95 | my $ua = LWP::UserAgent->new(ssl_opts=>{verify_hostname => 0}); | |
96 | my $uniprot_url = 'https://www.ebi.ac.uk/proteins/api/coordinates/'; | |
97 | my $uniprot_suff = ".json"; | |
98 | ||
99 | ||
100 | if ($use_www) { | |
101 | $get_annot_sub = \&get_annots_up_www; | |
102 | } | |
103 | ||
104 | ||
105 | my $get_annots_id = $dbh->prepare(qq(select up_exons.* from up_exons join annot2 using(acc) where id=? order by ix)); | |
106 | my $get_annots_acc = $dbh->prepare(qq(select up_exons.* from up_exons where acc=? order by ix)); | |
107 | my $get_annots_refacc = $dbh->prepare(qq(select ref_acc, start, end, ix from up_exons join annot2 using(acc) where ref_acc=? order by ix)); | |
108 | my $get_annots_refseq = $dbh->prepare(qq(select acc, ex_p_start as start, ex_p_end as end, ex_num as ix, chrom, g_start, g_end from seqdb_demo2.ref_exons where acc=? order by ix)); | |
109 | ||
110 | my $get_annots_sql = $get_annots_acc; | |
111 | ||
112 | my ($tmp, $gi, $sdb, $acc, $id, $use_acc); | |
113 | ||
114 | # get the query | |
115 | my ($query, $seq_len) = @ARGV; | |
116 | $seq_len = 0 unless defined($seq_len); | |
117 | ||
118 | $query =~ s/^>// if ($query); | |
119 | ||
120 | my @annots = (); | |
121 | ||
122 | #if it's a file I can open, read and parse it | |
123 | unless ($query && ($query =~ m/[\|:]/ || | |
124 | $query =~ m/^[NX]P_/ || | |
125 | $query =~ m/^[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}\s/)) { | |
126 | ||
127 | while (my $a_line = <>) { | |
128 | $a_line =~ s/^>//; | |
129 | chomp $a_line; | |
130 | push @annots, show_annots($a_line, $get_annot_sub, $use_www); | |
131 | } | |
132 | } | |
133 | else { | |
134 | push @annots, show_annots("$query\t$seq_len", $get_annot_sub, $use_www); | |
135 | } | |
136 | ||
137 | for my $seq_annot (@annots) { | |
138 | print ">",$seq_annot->{seq_info},"\n"; | |
139 | for my $annot (@{$seq_annot->{list}}) { | |
140 | if (!$lav && $show_color && defined($domains{$annot->[-1]})) { | |
141 | $annot->[-1] .= $color_sep_str.$domains{$annot->[-1]}; | |
142 | } | |
143 | print join("\t",@$annot),"\n"; | |
144 | } | |
145 | } | |
146 | ||
147 | exit(0); | |
148 | ||
149 | sub show_annots { | |
150 | my ($query_len, $get_annot_sub, $use_www) = @_; | |
151 | ||
152 | my ($annot_line, $seq_len) = split(/\t/,$query_len); | |
153 | ||
154 | my %annot_data = (seq_info=>$annot_line); | |
155 | ||
156 | if ($annot_line =~ m/^gi\|/) { | |
157 | $use_acc = 1; | |
158 | ($tmp, $gi, $sdb, $acc, $id) = split(/\|/,$annot_line); | |
159 | } | |
160 | elsif ($annot_line =~ m/^(SP|TR):(\w+) (\w+)/) { | |
161 | ($sdb, $id, $acc) = ($1,$2,$3); | |
162 | $use_acc = 1; | |
163 | $sdb = lc($sdb) | |
164 | } | |
165 | elsif ($annot_line =~ m/^(SP|TR):(\w+)/) { | |
166 | ($sdb, $id) = ($1,$2); | |
167 | $use_acc = 0; | |
168 | $sdb = lc($sdb) | |
169 | } | |
170 | elsif ($annot_line !~ m/\|/) { # new NCBI swissprot format | |
171 | $use_acc =1; | |
172 | if ($annot_line =~ m/[NXY]P_\d+/) { | |
173 | $sdb = 'ref'; | |
174 | } | |
175 | else { | |
176 | $sdb = 'sp'; | |
177 | } | |
178 | ($acc) = split(/\s+/,$annot_line); | |
179 | } | |
180 | else { | |
181 | $use_acc = 1; | |
182 | ($sdb, $acc, $id) = split(/\|/,$annot_line); | |
183 | } | |
184 | ||
185 | unless ($use_acc) { | |
186 | $get_annots_sql = $get_annots_id; | |
187 | $get_annots_sql->execute($id); | |
188 | } | |
189 | else { | |
190 | if ($sdb =~ m/ref/) { | |
191 | $get_annots_sql = $get_annots_refseq; | |
192 | } else { | |
193 | $get_annots_sql = $get_annots_acc; | |
194 | } | |
195 | $acc =~ s/\.\d+$//; | |
196 | ||
197 | unless ($use_www) { | |
198 | $get_annots_sql->execute($acc); | |
199 | } | |
200 | else { | |
201 | $get_annots_sql = $acc; | |
202 | } | |
203 | } | |
204 | ||
205 | $annot_data{list} = $get_annot_sub->($get_annots_sql, $seq_len); | |
206 | ||
207 | return \%annot_data; | |
208 | } | |
209 | ||
210 | sub get_annots { | |
211 | my ($get_annots_sql, $seq_len) = @_; | |
212 | ||
213 | my @feats = (); | |
214 | ||
215 | while (my $exon_hr = $get_annots_sql->fetchrow_hashref()) { | |
216 | my $ix = $exon_hr->{ix}; | |
217 | if ($lav) { | |
218 | push @feats, [$exon_hr->{start}, $exon_hr->{end}, "exon_$ix~$ix"]; | |
219 | } else { | |
220 | my ($exon_info,$ex_info_start, $ex_info_end) = ("exon_$ix~$ix","",""); | |
221 | if ($gen_coord) { | |
222 | if (defined($exon_hr->{g_start})) { | |
223 | my $chr=$exon_hr->{chrom}; | |
224 | $chr = "unk" unless $chr; | |
225 | if ($chr =~ m/^\d+$/ || $chr =~m/^[XYZ]+$/) { | |
226 | $chr = "chr$chr"; | |
227 | } | |
228 | $ex_info_start = sprintf("exon_%d::%s:%d",$ix, $chr, $exon_hr->{g_start}); | |
229 | $ex_info_end = sprintf("exon_%d::%s:%d",$ix, $chr, $exon_hr->{g_end}); | |
230 | if ($exon_label) { | |
231 | $exon_info = sprintf("exon_%d{%s:%d-%d}~%d",$ix, $chr, $exon_hr->{g_start}, $exon_hr->{g_end}, $ix); | |
232 | push @feats, [$exon_hr->{start}, "-", $exon_hr->{end}, $exon_info]; | |
233 | } else { | |
234 | push @feats, [$exon_hr->{start}, "-", $exon_hr->{end}, $exon_info]; | |
235 | push @feats, [$exon_hr->{start},'<','-',$ex_info_start]; | |
236 | push @feats, [$exon_hr->{end},'>','-',$ex_info_end]; | |
237 | } | |
238 | } | |
239 | } else { | |
240 | push @feats, [$exon_hr->{start}, "-", $exon_hr->{end}, $exon_info]; | |
241 | } | |
242 | } | |
243 | } | |
244 | ||
245 | return \@feats; | |
246 | } | |
247 | ||
248 | sub get_annots_up_www { | |
249 | my ($acc, $seq_len) = @_; | |
250 | ||
251 | my @feats = (); | |
252 | ||
253 | # my $exon_json = get_https($uniprot_url.$acc.$uniprot_suff); | |
254 | my $exon_json = get($uniprot_url.$acc.$uniprot_suff); | |
255 | ||
256 | unless (!$exon_json || $exon_json =~ m/errorMessage/ || $exon_json =~ m/Can not find/) { | |
257 | return parse_json_up_exons($exon_json); | |
258 | } | |
259 | else { | |
260 | return (); | |
261 | } | |
262 | } | |
263 | ||
264 | sub parse_json_up_exons { | |
265 | my ($exon_json) = @_; | |
266 | ||
267 | my @exons = (); | |
268 | my @ex_coords = (); | |
269 | ||
270 | my $acc_exons = decode_json($exon_json); | |
271 | ||
272 | my $exon_num = 1; | |
273 | my $last_end = 0; | |
274 | my $last_phase = 0; | |
275 | ||
276 | my $chrom = $acc_exons->{'gnCoordinate'}[0]{'genomicLocation'}{'chromosome'}; | |
277 | my $rev_strand = $acc_exons->{'gnCoordinate'}[0]{'genomicLocation'}{'reverseStrand'}; | |
278 | ||
279 | for my $exon ( @{$acc_exons->{'gnCoordinate'}[0]{'genomicLocation'}{'exon'}} ) { | |
280 | my ($p_begin, $p_end) = ($exon->{'proteinLocation'}{'begin'}{'position'},$exon->{'proteinLocation'}{'end'}{'position'}); | |
281 | my ($g_begin, $g_end) = ($exon->{'genomeLocation'}{'begin'}{'position'},$exon->{'genomeLocation'}{'end'}{'position'}); | |
282 | ||
283 | my $this_phase = 0; | |
284 | if (defined($g_begin) && defined($g_end)) { | |
285 | $this_phase = ($g_end - $g_begin + 1) % 3; | |
286 | } | |
287 | ||
288 | if (!defined($p_begin) || !defined($p_end)) { | |
289 | $exon_num++; | |
290 | $last_phase = 0; | |
291 | next; | |
292 | } | |
293 | ||
294 | if ($p_end >= $p_begin) { | |
295 | if ($p_begin == $last_end) { | |
296 | if ($last_phase==2) { | |
297 | $p_begin += 1; | |
298 | } | |
299 | elsif ($last_phase==1) { | |
300 | $last_end -= 1; | |
301 | $exons[-1]->{seq_end} -= 1; | |
302 | } | |
303 | } | |
304 | ||
305 | if ($p_begin <= $last_end && $p_end > $last_end) { | |
306 | $p_begin = $last_end+1; | |
307 | } | |
308 | $last_end = $p_end; | |
309 | $last_phase = $this_phase; | |
310 | ||
311 | my ($gs_begin, $gs_end) = ($g_begin, $g_end); | |
312 | if ($rev_strand) { | |
313 | ($gs_begin, $gs_end) = ($g_end, $g_begin); | |
314 | } | |
315 | ||
316 | push @exons, { | |
317 | ix=>$exon_num, | |
318 | start=>$p_begin, | |
319 | end=>$p_end, | |
320 | g_start=>$gs_begin, | |
321 | g_end=>$gs_end, | |
322 | chrom=>$chrom, | |
323 | }; | |
324 | ||
325 | $exon_num++; | |
326 | } | |
327 | } | |
328 | ||
329 | # check for domain overlap, and resolve check for domain overlap | |
330 | # (possibly more than 2 domains), choosing the domain with the best | |
331 | # evalue | |
332 | ||
333 | my @ex_feats = (); | |
334 | ||
335 | for my $exon_hr (@exons) { | |
336 | my $ix = $exon_hr->{ix}; | |
337 | if ($lav) { | |
338 | push @ex_feats, [$exon_hr->{start}, $exon_hr->{end}, "exon_$ix~$ix" ]; | |
339 | } | |
340 | else { | |
341 | my ($exon_info,$ex_info_start, $ex_info_end) = ("exon_$ix~$ix","",""); | |
342 | if ($gen_coord) { | |
343 | if (defined($exon_hr->{g_start})) { | |
344 | my $chr=$exon_hr->{chrom}; | |
345 | $chr = "unk" unless $chr; | |
346 | if ($chr =~ m/^\d+$/ || $chr =~m/^[XYZ]+$/) { | |
347 | $chr = "chr$chr"; | |
348 | } | |
349 | $ex_info_start = sprintf("exon_%d::%s:%d",$ix, $chr, $exon_hr->{g_start}); | |
350 | $ex_info_end = sprintf("exon_%d::%s:%d",$ix, $chr, $exon_hr->{g_end}); | |
351 | if ($exon_label) { | |
352 | $exon_info = sprintf("exon_%d{%s:%d-%d}~%d",$ix, $chr, $exon_hr->{g_start}, $exon_hr->{g_end},$ix); | |
353 | push @ex_feats, [$exon_hr->{start}, "-", $exon_hr->{end}, $exon_info]; | |
354 | } else { | |
355 | push @ex_feats, [$exon_hr->{start}, "-", $exon_hr->{end}, $exon_info]; | |
356 | push @ex_feats, [$exon_hr->{start},'<','-',$ex_info_start]; | |
357 | push @ex_feats, [$exon_hr->{end},'>','-',$ex_info_end]; | |
358 | } | |
359 | } | |
360 | } else { | |
361 | push @ex_feats, [$exon_hr->{start}, "-", $exon_hr->{end}, $exon_info]; | |
362 | } | |
363 | } | |
364 | } | |
365 | return \@ex_feats; | |
366 | } | |
367 | ||
368 | sub get_https { | |
369 | my ($url) = @_; | |
370 | ||
371 | my $result = ""; | |
372 | my $response = $ua->get($url); | |
373 | ||
374 | if ($response->is_success) { | |
375 | $result = $response->decoded_content; | |
376 | } else { | |
377 | $result = ''; | |
378 | } | |
379 | return $result; | |
380 | } | |
381 | ||
382 | ||
383 | ||
384 | __END__ | |
385 | ||
386 | =pod | |
387 | ||
388 | =head1 NAME | |
389 | ||
390 | ann_exons_up_sql.pl | |
391 | ||
392 | =head1 SYNOPSIS | |
393 | ||
394 | ann_exons_up_sql.pl --lav 'sp|P09488|GSTM1_NUMAN' | accession.file | |
395 | ||
396 | =head1 OPTIONS | |
397 | ||
398 | -h short help | |
399 | --help include description | |
400 | --gen_coord -- provide genomic exon start/stop coordinates as features | |
401 | --lav produce lav2plt.pl annotation format, only show domains/repeats | |
402 | --host, --user, --password, --port --db -- info for mysql database | |
403 | ||
404 | =head1 DESCRIPTION | |
405 | ||
406 | C<ann_exons_all.pl> extracts exon location information from msyql | |
407 | databases (uniprot for Uniprot proteins, seqdb_demo2 for refseq) built | |
408 | from EBI/proteins API data (Uniprot) or Refseq GFF data (refseq). | |
409 | ||
410 | Given a command line argument that contains a sequence accession | |
411 | (P09488) or identifier (GSTM1_HUMAN), the program looks up the | |
412 | features available for that sequence and returns them in a | |
413 | tab-delimited format: | |
414 | ||
415 | >sp|P09488|GSTM1_HUMAN | |
416 | 1 - 12 exon_1~1 | |
417 | 13 - 38 exon_2~2 | |
418 | 39 - 59 exon_3~3 | |
419 | 60 - 87 exon_4~4 | |
420 | 88 - 120 exon_5~5 | |
421 | 121 - 152 exon_6~6 | |
422 | 153 - 189 exon_7~7 | |
423 | 190 - 218 exon_8~8 | |
424 | ||
425 | C<ann_exons_all.pl --gen_coord 'sp|P09488|GSTM1_HUMAN'>also provides genomic coordinates: | |
426 | ||
427 | >sp|P09488|GSTM1_HUMAN | |
428 | 1 - 12 exon_1~1 | |
429 | 1 < - exon_1::chr1:109687874 | |
430 | 12 > - exon_1::chr1:109687909 | |
431 | 13 - 37 exon_2~2 | |
432 | 13 < - exon_2::chr1:109688170 | |
433 | 37 > - exon_2::chr1:109688245 | |
434 | 38 - 59 exon_3~3 | |
435 | 38 < - exon_3::chr1:109688673 | |
436 | 59 > - exon_3::chr1:109688737 | |
437 | ... | |
438 | 190 - 218 exon_8~8 | |
439 | 190 < - exon_8::chr1:109693206 | |
440 | 218 > - exon_8::chr1:109693292 | |
441 | ||
442 | C<ann_exons_all.pl> is designed to be used by the B<FASTA> programs | |
443 | with the C<-V \!ann_exons_all.pl> option, or by the | |
444 | C<annot_blast_btop.pl> script. It can also be used with the | |
445 | lav2plt.pl program with the C<--xA "\!ann_exons_all.pl --lav"> or | |
446 | C<--yA "\!ann_exons_all.pl --lav"> options. | |
447 | ||
448 | =head1 AUTHOR | |
449 | ||
450 | William R. Pearson, wrp@virginia.edu | |
451 | ||
452 | =cut |
0 | #!/usr/bin/perl -w | |
0 | #!/usr/bin/env perl | |
1 | 1 | |
2 | 2 | ################################################################ |
3 | 3 | # copyright (c) 2014,2015 by William R. Pearson and The Rector & |
28 | 28 | # (3) return the tab delimited exon boundaries |
29 | 29 | |
30 | 30 | |
31 | use warnings; | |
31 | 32 | use strict; |
32 | 33 | |
33 | 34 | use DBI; |
0 | #!/usr/bin/perl -w | |
0 | #!/usr/bin/env perl | |
1 | 1 | |
2 | 2 | # ann_exons_ncbi.pl gets an annotation file from fasta36 -V with a line of the form: |
3 | 3 | |
4 | # gi|23065544|ref|NP_000552.2| | |
4 | # gi|23065544|ref|NP_000552.2| or | |
5 | # NP_000552 | |
5 | 6 | # |
6 | 7 | # and returns the exons present in the protein from NCBI gff3 tables (human, mouse, rat, xtrop) |
7 | 8 | # |
11 | 12 | # (3) return the tab delimited exon boundaries |
12 | 13 | # |
13 | 14 | |
15 | use warnings; | |
14 | 16 | use strict; |
15 | 17 | |
16 | 18 | use DBI; |
23 | 25 | |
24 | 26 | ($host, $db, $port, $user, $pass) = ("wrpxdb.its.virginia.edu", "seqdb_demo2", 0, "web_user", "fasta_www"); |
25 | 27 | |
26 | my ($auto_reg,$rpd2_fams, $neg_doms, $lav, $no_doms, $pf_acc, $shelp, $help) = (0, 0, 0, 0,0, 0,0,0); | |
27 | my ($min_nodom) = (10); | |
28 | my ($lav, $shelp, $help) = (0, 0, 0); | |
28 | 29 | |
29 | 30 | my $color_sep_str = " :"; |
30 | 31 | $color_sep_str = '~'; |
36 | 37 | "password=s" => \$pass, |
37 | 38 | "port=i" => \$port, |
38 | 39 | "lav" => \$lav, |
39 | "neg" => \$neg_doms, | |
40 | "neg_doms" => \$neg_doms, | |
41 | "neg-doms" => \$neg_doms, | |
42 | "min_nodom=i" => \$min_nodom, | |
43 | "pfacc" => \$pf_acc, | |
44 | "RPD2" => \$rpd2_fams, | |
45 | "auto_reg" => \$auto_reg, | |
46 | 40 | "h|?" => \$shelp, |
47 | 41 | "help" => \$help, |
48 | 42 | ); |
130 | 124 | elsif ($annot_line =~ m/^ref\|/) { |
131 | 125 | ($sdb, $acc) = split(/\|/,$annot_line); |
132 | 126 | } |
127 | else { | |
128 | $acc = $annot_line; | |
129 | } | |
133 | 130 | |
134 | 131 | $acc =~ s/\.\d+$//; |
135 | 132 | $get_annots_sql->execute($acc); |
147 | 144 | # get the list of domains, sorted by start |
148 | 145 | while ( my $row_href = $get_annots->fetchrow_hashref()) { |
149 | 146 | |
150 | $row_href->{info} = "exon_".$row_href->{ex_num}; | |
147 | $row_href->{info} = "exon_".$row_href->{ex_num}.$color_sep_str.$row_href->{ex_num}; | |
151 | 148 | push @exons, $row_href |
152 | 149 | } |
153 | 150 | |
171 | 168 | return \@feats; |
172 | 169 | } |
173 | 170 | |
174 | # domain name takes a uniprot domain label, removes comments ( ; | |
175 | # truncated) and numbers and returns a canonical form. Thus: | |
176 | # Cortactin 6. | |
177 | # Cortactin 7; truncated. | |
178 | # becomes "Cortactin" | |
179 | # | |
180 | ||
181 | sub domain_name { | |
182 | ||
183 | my ($value) = @_; | |
184 | ||
185 | if (!defined($domains{$value})) { | |
186 | $domain_cnt++; | |
187 | $domains{$value} = $domain_cnt; | |
188 | } | |
189 | return $value; | |
190 | } | |
191 | ||
192 | 171 | __END__ |
193 | 172 | |
194 | 173 | =pod |
195 | 174 | |
196 | 175 | =head1 NAME |
197 | 176 | |
198 | ann_feats.pl | |
177 | ann_exons_ncbi.pl | |
199 | 178 | |
200 | 179 | =head1 SYNOPSIS |
201 | 180 | |
202 | ann_pfam.pl --neg-doms 'sp|P09488|GSTM1_NUMAN' | accession.file | |
181 | ann_exons_ncbi.pl NP_000552 | |
203 | 182 | |
204 | 183 | =head1 OPTIONS |
205 | 184 | |
206 | 185 | -h short help |
207 | 186 | --help include description |
208 | --neg-doms, -- report domains between annotated domains as NODOM | |
209 | (also --neg, --neg_doms) | |
210 | --min_nodom=10 -- minimum length between domains for NODOM | |
211 | ||
187 | --lav produce lav2plt.pl annotation format, only show domains/repeats | |
212 | 188 | --host, --user, --password, --port --db -- info for mysql database |
213 | 189 | |
214 | 190 | =head1 DESCRIPTION |
215 | 191 | |
216 | C<ann_pfam.pl> extracts domain information from a msyql | |
192 | C<ann_exons_ncbi.pl> extracts domain information from a msyql | |
217 | 193 | database. Currently, the program works with database sequence |
218 | 194 | descriptions in one of two formats: |
219 | 195 | |
220 | >pf26|649|O94823|AT10B_HUMAN -- RPD2_seqs | |
221 | ||
222 | (pf26 databases have auto_pfamseq in the second field) and | |
223 | ||
224 | >gi|1705556|sp|P54670.1|CAF1_DICDI | |
225 | ||
226 | C<ann_pfam.pl> uses the C<pfamA_reg_full_significant>, C<pfamseq>, | |
227 | and C<pfamA> tables of the C<pfam> database to extract domain | |
228 | information on a protein. For proteins that have multiple domains | |
229 | associated with the same overlapping region (domains overlap by more | |
230 | than 1/3 of the domain length), C<auto_pfam.pl> selects the domain | |
231 | annotation with the best C<domain_evalue_score>. When domains overlap | |
232 | by less than 1/3 of the domain length, they are shortened to remove | |
233 | the overlap. | |
234 | ||
235 | C<ann_pfam.pl> is designed to be used by the B<FASTA> programs with | |
236 | the C<-V \!ann_pfam.pl> or C<-V "\!ann_pfam.pl --neg"> option. | |
196 | >gi|23065544|ref|NP_000552.2| or | |
197 | >NP_000552 | |
198 | ||
199 | C<ann_exons_ncbi.pl> uses the C<ref_exons> table of the C<seqdb2> | |
200 | database to extract exon position information on a protein. The | |
201 | C<seqdb2/ref_exons> table is constructed from refseq gff files using | |
202 | the C<ncbi_refseq_ex2prot.pl> script. | |
203 | ||
204 | C<ann_exons_ncbi.pl> is designed to be used by the B<FASTA> programs with | |
205 | the C<-V \!ann_exons_ncbi.pl> option. | |
237 | 206 | |
238 | 207 | =head1 AUTHOR |
239 | 208 |
0 | #!/usr/bin/perl -w | |
0 | #!/usr/bin/env perl | |
1 | 1 | |
2 | 2 | ################################################################ |
3 | 3 | # copyright (c) 2014,2015 by William R. Pearson and The Rector & |
24 | 24 | # (1) read in the line |
25 | 25 | # (2) parse it to get the up_acc |
26 | 26 | # (3) return the tab delimited features |
27 | # | |
28 | 27 | |
29 | 28 | # this version can read feature2 uniprot features (acc/pos/end/label/value), but returns sorted start/end domains |
30 | 29 | # modified 18-Jan-2016 to produce annotation symbols consistent with ann_exons_up_www2.pl |
31 | 30 | |
31 | use warnings; | |
32 | 32 | use strict; |
33 | 33 | |
34 | 34 | use DBI; |
50 | 50 | ($host, $db, $a_table, $port, $user, $pass) = ("mysql-pearson-prod", "up_db", "annot", 4124, "web_user", "fasta_www"); |
51 | 51 | } |
52 | 52 | |
53 | my ($sstr, $lav, $neg_doms, $no_vars, $no_doms, $no_feats, $shelp, $help, $pfam26) = (0,0,0,0,0,0,0,0,0,0); | |
54 | my ($min_nodom) = (10); | |
53 | my ($lav, $gen_coord, $shelp, $help) = (0,0,0,0); | |
55 | 54 | |
56 | 55 | my ($show_color) = (1); |
57 | 56 | my $color_sep_str = " :"; |
58 | 57 | $color_sep_str = '~'; |
59 | 58 | |
60 | 59 | GetOptions( |
60 | "gen_coord!" => \$gen_coord, | |
61 | 61 | "host=s" => \$host, |
62 | 62 | "db=s" => \$db, |
63 | 63 | "user=s" => \$user, |
64 | 64 | "password=s" => \$pass, |
65 | 65 | "port=i" => \$port, |
66 | 66 | "lav" => \$lav, |
67 | "no_doms" => \$no_doms, | |
68 | "no-doms" => \$no_doms, | |
69 | "nodoms" => \$no_doms, | |
70 | "no_var" => \$no_vars, | |
71 | "no-var" => \$no_vars, | |
72 | "novar" => \$no_vars, | |
73 | "neg" => \$neg_doms, | |
74 | "neg_doms" => \$neg_doms, | |
75 | "neg-doms" => \$neg_doms, | |
76 | "negdoms" => \$neg_doms, | |
77 | "min_nodom=i" => \$min_nodom, | |
78 | "min-nodom=i" => \$min_nodom, | |
79 | "no_feats" => \$no_feats, | |
80 | "no-feats" => \$no_feats, | |
81 | "nofeats" => \$no_feats, | |
82 | "color!" => \$show_color, | |
83 | "sstr" => \$sstr, | |
84 | 67 | "h|?" => \$shelp, |
85 | 68 | "help" => \$help, |
86 | 69 | ); |
99 | 82 | ) or die $DBI::errstr; |
100 | 83 | |
101 | 84 | |
102 | my $get_annot_sub = \&get_fasta_annots; | |
103 | if ($lav) { | |
104 | $no_feats = 1; | |
105 | $get_annot_sub = \&get_lav_annots; | |
106 | } | |
107 | ||
108 | my $get_annots_id = $dbh->prepare(qq(select acc, start, end, ix from up_exons join annot2 using(acc) where id=? order by ix)); | |
109 | my $get_annots_acc = $dbh->prepare(qq(select acc, start, end, ix from up_exons where acc=? order by ix)); | |
85 | my $get_annot_sub = \&get_annots; | |
86 | ||
87 | my $get_annots_id = $dbh->prepare(qq(select up_exons.* from up_exons join annot2 using(acc) where id=? order by ix)); | |
88 | my $get_annots_acc = $dbh->prepare(qq(select up_exons.* from up_exons where acc=? order by ix)); | |
110 | 89 | my $get_annots_refacc = $dbh->prepare(qq(select ref_acc, start, end, ix from up_exons join annot2 using(acc) where ref_acc=? order by ix)); |
111 | 90 | |
112 | 91 | my $get_annots_sql = $get_annots_acc; |
199 | 178 | return \%annot_data; |
200 | 179 | } |
201 | 180 | |
202 | sub get_fasta_annots { | |
181 | sub get_annots { | |
203 | 182 | my ($get_annots_sql, $seq_len) = @_; |
204 | 183 | |
205 | my ($acc, $start, $end, $ix); | |
206 | 184 | my @feats = (); |
207 | 185 | |
208 | while (($acc, $start, $end, $ix) = $get_annots_sql->fetchrow_array()) { | |
209 | push @feats, [$start, "-", $end, "exon_$ix~$ix"]; | |
186 | while (my $exon_hr = $get_annots_sql->fetchrow_hashref()) { | |
187 | my $ix = $exon_hr->{ix}; | |
188 | if ($lav) { | |
189 | push @feats, [$exon_hr->{start}, $exon_hr->{end}, "exon_$ix~$ix"]; | |
190 | } | |
191 | else { | |
192 | push @feats, [$exon_hr->{start}, "-", $exon_hr->{end}, "exon_$ix~$ix"]; | |
193 | if ($gen_coord) { | |
194 | if (not defined($exon_hr->{g_start})) { | |
195 | next; | |
196 | } | |
197 | ||
198 | my $chr=$exon_hr->{chrom}; | |
199 | $chr = "unk" unless $chr; | |
200 | if ($chr =~ m/^\d+$/ || $chr =~m/^[XYZ]+$/) { | |
201 | $chr = "chr$chr"; | |
202 | } | |
203 | my $ex_info = sprintf("exon_%d::%s:%d",$ix, $chr, $exon_hr->{g_start}); | |
204 | push @feats, [$exon_hr->{start},'<','-',$ex_info]; | |
205 | $ex_info = sprintf("exon_%d::%s:%d",$ix, $chr, $exon_hr->{g_end}); | |
206 | push @feats, [$exon_hr->{end},'>','-',$ex_info]; | |
207 | } | |
208 | } | |
210 | 209 | } |
211 | 210 | |
212 | 211 | return \@feats; |
213 | 212 | } |
214 | 213 | |
215 | sub get_lav_annots { | |
216 | my ($get_annots_sql, $seq_len) = @_; | |
217 | ||
218 | my ($pos, $end, $label, $value, $comment); | |
219 | ||
220 | my @feats = (); | |
221 | ||
222 | my %annot = (); | |
223 | while (($acc, $pos, $end, $label, $value) = $get_annots_sql->fetchrow_array()) { | |
224 | next unless ($label =~ m/^DOMAIN/ || $label =~ m/^REPEAT/); | |
225 | $value =~ s/\s?\{.+\}\.?$//; | |
226 | $value = domain_name($label,$value); | |
227 | push @feats, [$pos, $end, $value]; | |
228 | } | |
229 | ||
230 | return \@feats; | |
231 | } | |
232 | ||
233 | # domain name takes a uniprot domain label, removes comments ( ; | |
234 | # truncated) and numbers and returns a canonical form. Thus: | |
235 | # Cortactin 6. | |
236 | # Cortactin 7; truncated. | |
237 | # becomes "Cortactin" | |
238 | # | |
239 | ||
240 | sub domain_name { | |
241 | ||
242 | my ($label, $value) = @_; | |
243 | ||
244 | if ($label =~ /DOMAIN|REPEAT/) { | |
245 | $value =~ s/;.*$//; | |
246 | $value =~ s/\s+\d+\.?$//; | |
247 | $value =~ s/\.\s*$//; | |
248 | $value =~ s/\s+\d+\.\s+.*$//; | |
249 | $value =~ s/\s+/_/; | |
250 | if (!defined($domains{$value})) { | |
251 | $domain_cnt++; | |
252 | $domains{$value} = $domain_cnt; | |
253 | } | |
254 | return $value; | |
255 | } | |
256 | else { | |
257 | return $value; | |
258 | } | |
259 | } | |
260 | ||
261 | 214 | __END__ |
262 | 215 | |
263 | 216 | =pod |
268 | 221 | |
269 | 222 | =head1 SYNOPSIS |
270 | 223 | |
271 | ann_exons_up_sql.pl --no_doms --no_feats --lav 'sp|P09488|GSTM1_NUMAN' | accession.file | |
224 | ann_exons_up_sql.pl --lav 'sp|P09488|GSTM1_NUMAN' | accession.file | |
272 | 225 | |
273 | 226 | =head1 OPTIONS |
274 | 227 | |
275 | 228 | -h short help |
276 | 229 | --help include description |
277 | --no-doms do not show domain boundaries (domains are always shown with --lav) | |
278 | --no-feats do not show features (variants, active sites, phospho-sites) | |
279 | --no-var do not show variant sites (--no_var, --novar) | |
230 | --gen_coord -- provide genomic exon start/stop coordinates as features | |
280 | 231 | --lav produce lav2plt.pl annotation format, only show domains/repeats |
281 | --neg-doms, -- report domains between annotated domains as NODOM | |
282 | (also --neg, --neg_doms) | |
283 | --min_nodom=10 minimum non-domain length to produce NODOM | |
284 | 232 | --host, --user, --password, --port --db -- info for mysql database |
285 | 233 | |
286 | 234 | =head1 DESCRIPTION |
287 | 235 | |
288 | C<ann_exons_up_sql.pl> extracts feature, domain, and repeat information from | |
289 | a msyql database (default name, uniprot) built by parsing the | |
290 | uniprot_sprot.dat and uniprot_trembl.dat feature tables. Given a | |
291 | command line argument that contains a sequence accession (P09488) or | |
292 | identifier (GSTM1_HUMAN), the program looks up the features available | |
293 | for that sequence and returns them in a tab-delimited format: | |
236 | C<ann_exons_up_sql.pl> extracts exon location information from | |
237 | a msyql database (default name, uniprot) built from EBI/proteins API data. | |
238 | ||
239 | Given a command line argument that contains a sequence accession | |
240 | (P09488) or identifier (GSTM1_HUMAN), the program looks up the | |
241 | features available for that sequence and returns them in a | |
242 | tab-delimited format: | |
294 | 243 | |
295 | 244 | >sp|P09488|GSTM1_HUMAN |
296 | 2 - 88 GST_N-terminal~1 | |
297 | 7 V F Mutagen: Reduces catalytic activity 100- fold. {ECO:0000269|PubMed:16548513}. | |
298 | 34 * - MOD_RES: Phosphothreonine. {ECO:0000250|UniProtKB:P10649}. | |
299 | 90 - 208 GST_C-terminal~2 | |
300 | 108 V S Mutagen: Changes the properties of the enzyme toward some substrates. {ECO:0000269|PubMed:16548513, ECO:0000269|PubMed:9930979}. | |
301 | 108 V Q Mutagen: Reduces catalytic activity by half. {ECO:0000269|PubMed:16548513, ECO:0000269|PubMed:9930979}. | |
302 | 109 V I Mutagen: Reduces catalytic activity by half. {ECO:0000269|PubMed:16548513}. | |
303 | 116 # - BINDING: Substrate. | |
304 | 116 V A Mutagen: Reduces catalytic activity 10-fold. {ECO:0000269|PubMed:16548513}. | |
305 | 116 V F Mutagen: Slight increase of catalytic activity. {ECO:0000269|PubMed:16548513}. | |
306 | 173 V N in allele GSTM1B; dbSNP:rs1065411. {ECO:0000269|Ref.3, ECO:0000269|Ref.5}. | |
307 | 210 * - MOD_RES: Phosphoserine. {ECO:0000250|UniProtKB:P04905}. | |
308 | 210 V T in dbSNP:rs449856. | |
309 | ||
310 | If features are provided, then a legend of feature symbols is provided | |
311 | as well: | |
312 | ||
313 | ==:Active site | |
314 | =*:Modified | |
315 | =#:Substrate binding | |
316 | =^:Site | |
317 | =!:Metal binding | |
318 | ||
319 | If the C<--lav> option is specified, domain and repeat features are | |
320 | presented in a different format for the C<lav2plt.pl> program: | |
321 | ||
322 | >sp|P09488|GSTM1_HUMAN | |
323 | 2 88 GST N-terminal. | |
324 | 90 208 GST C-terminal. | |
245 | 1 - 12 exon_1~1 | |
246 | 13 - 38 exon_2~2 | |
247 | 39 - 59 exon_3~3 | |
248 | 60 - 87 exon_4~4 | |
249 | 88 - 120 exon_5~5 | |
250 | 121 - 152 exon_6~6 | |
251 | 153 - 189 exon_7~7 | |
252 | 190 - 218 exon_8~8 | |
253 | ||
254 | C<ann_exons_up_sql.pl --gen_coord 'sp|P09488|GSTM1_HUMAN'>also provides genomic coordinates: | |
255 | ||
256 | >sp|P09488|GSTM1_HUMAN | |
257 | 1 - 12 exon_1~1 | |
258 | 1 < - exon_1::chr1:109687874 | |
259 | 12 > - exon_1::chr1:109687909 | |
260 | 13 - 37 exon_2~2 | |
261 | 13 < - exon_2::chr1:109688170 | |
262 | 37 > - exon_2::chr1:109688245 | |
263 | 38 - 59 exon_3~3 | |
264 | 38 < - exon_3::chr1:109688673 | |
265 | 59 > - exon_3::chr1:109688737 | |
266 | ... | |
267 | 190 - 218 exon_8~8 | |
268 | 190 < - exon_8::chr1:109693206 | |
269 | 218 > - exon_8::chr1:109693292 | |
325 | 270 | |
326 | 271 | C<ann_exons_up_sql.pl> is designed to be used by the B<FASTA> programs |
327 | 272 | with the C<-V \!ann_exons_up_sql.pl> option, or by the |
0 | #!/usr/bin/env perl | |
1 | ||
2 | ################################################################ | |
3 | # copyright (c) 2014,2015 by William R. Pearson and The Rector & | |
4 | # Visitors of the University of Virginia | |
5 | ################################################################ | |
6 | # Licensed under the Apache License, Version 2.0 (the "License"); | |
7 | # you may not use this file except in compliance with the License. | |
8 | # You may obtain a copy of the License at | |
9 | # | |
10 | # http://www.apache.org/licenses/LICENSE-2.0 | |
11 | # | |
12 | # Unless required by applicable law or agreed to in writing, | |
13 | # software distributed under this License is distributed on an "AS | |
14 | # IS" BASIS, WITHOUT WRRANTIES OR CONDITIONS OF ANY KIND, either | |
15 | # express or implied. See the License for the specific language | |
16 | # governing permissions and limitations under the License. | |
17 | ################################################################ | |
18 | ||
19 | # ann_exons_up_sql.pl gets an annotation file from fasta36 -V with a line of the form: | |
20 | ||
21 | # gi|62822551|sp|P00502|GSTA1_RAT Glutathione S-transfer\n (at least from pir1.lseg) | |
22 | # | |
23 | # it must: | |
24 | # (1) read in the line | |
25 | # (2) parse it to get the up_acc | |
26 | # (3) return the tab delimited features | |
27 | ||
28 | # this version can read feature2 uniprot features (acc/pos/end/label/value), but returns sorted start/end domains | |
29 | # modified 18-Jan-2016 to produce annotation symbols consistent with ann_exons_up_www2.pl | |
30 | # modified Dec 2018 to generate genomic coordinates with --gen_coord | |
31 | # modified 3-Jan-2019 to merge sql and www (--www) access to exon coordinates | |
32 | ||
33 | use warnings; | |
34 | use strict; | |
35 | ||
36 | use DBI; | |
37 | use Getopt::Long; | |
38 | use Pod::Usage; | |
39 | use LWP::Simple; | |
40 | use LWP::UserAgent; | |
41 | use JSON qw(decode_json); | |
42 | ||
43 | use vars qw($host $db $a_table $port $user $pass); | |
44 | ||
45 | my %domains = (); | |
46 | my $domain_cnt = 0; | |
47 | ||
48 | my $hostname = `/bin/hostname`; | |
49 | ||
50 | unless ($hostname =~ m/ebi/) { | |
51 | ($host, $db, $a_table, $port, $user, $pass) = ("wrpxdb.its.virginia.edu", "uniprot", "annot2", 0, "web_user", "fasta_www"); | |
52 | # $host = 'xdb'; | |
53 | } | |
54 | else { | |
55 | ($host, $db, $a_table, $port, $user, $pass) = ("mysql-pearson-prod", "up_db", "annot", 4124, "web_user", "fasta_www"); | |
56 | } | |
57 | ||
58 | my ($lav, $gen_coord, $exon_label, $use_www, $shelp, $help) = (0,0,0,0,0,0); | |
59 | ||
60 | my ($show_color) = (1); | |
61 | my $color_sep_str = " :"; | |
62 | $color_sep_str = '~'; | |
63 | ||
64 | GetOptions( | |
65 | "gen_coord|gene_coord!" => \$gen_coord, | |
66 | "exon_label|label_exons!" => \$exon_label, | |
67 | "www!" => \$use_www, | |
68 | "host=s" => \$host, | |
69 | "db=s" => \$db, | |
70 | "user=s" => \$user, | |
71 | "password=s" => \$pass, | |
72 | "port=i" => \$port, | |
73 | "lav" => \$lav, | |
74 | "h|?" => \$shelp, | |
75 | "help" => \$help, | |
76 | ); | |
77 | ||
78 | pod2usage(1) if $shelp; | |
79 | pod2usage(exitstatus => 0, verbose => 2) if $help; | |
80 | pod2usage(1) unless (@ARGV || -p STDIN || -f STDIN); | |
81 | ||
82 | my $connect = "dbi:mysql(AutoCommit=>1,RaiseError=>1):database=$db"; | |
83 | $connect .= ";host=$host" if $host; | |
84 | $connect .= ";port=$port" if $port; | |
85 | ||
86 | my $dbh = DBI->connect($connect, | |
87 | $user, | |
88 | $pass | |
89 | ) or die $DBI::errstr; | |
90 | ||
91 | ||
92 | my $get_annot_sub = \&get_annots; | |
93 | ||
94 | ||
95 | my $ua = LWP::UserAgent->new(ssl_opts=>{verify_hostname => 0}); | |
96 | my $uniprot_url = 'https://www.ebi.ac.uk/proteins/api/coordinates/'; | |
97 | my $uniprot_suff = ".json"; | |
98 | ||
99 | ||
100 | if ($use_www) { | |
101 | $get_annot_sub = \&get_annots_up_www; | |
102 | } | |
103 | ||
104 | ||
105 | my $get_annots_id = $dbh->prepare(qq(select up_exons.* from up_exons join annot2 using(acc) where id=? order by ix)); | |
106 | my $get_annots_acc = $dbh->prepare(qq(select up_exons.* from up_exons where acc=? order by ix)); | |
107 | my $get_annots_refacc = $dbh->prepare(qq(select ref_acc, start, end, ix from up_exons join annot2 using(acc) where ref_acc=? order by ix)); | |
108 | ||
109 | my $get_annots_sql = $get_annots_acc; | |
110 | ||
111 | my ($tmp, $gi, $sdb, $acc, $id, $use_acc); | |
112 | ||
113 | # get the query | |
114 | my ($query, $seq_len) = @ARGV; | |
115 | $seq_len = 0 unless defined($seq_len); | |
116 | ||
117 | $query =~ s/^>// if ($query); | |
118 | ||
119 | my @annots = (); | |
120 | ||
121 | #if it's a file I can open, read and parse it | |
122 | unless ($query && ($query =~ m/[\|:]/ || | |
123 | $query =~ m/^[NX]P_/ || | |
124 | $query =~ m/^[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}\s/)) { | |
125 | ||
126 | while (my $a_line = <>) { | |
127 | $a_line =~ s/^>//; | |
128 | chomp $a_line; | |
129 | push @annots, show_annots($a_line, $get_annot_sub, $use_www); | |
130 | } | |
131 | } | |
132 | else { | |
133 | push @annots, show_annots("$query\t$seq_len", $get_annot_sub, $use_www); | |
134 | } | |
135 | ||
136 | for my $seq_annot (@annots) { | |
137 | print ">",$seq_annot->{seq_info},"\n"; | |
138 | for my $annot (@{$seq_annot->{list}}) { | |
139 | if (!$lav && $show_color && defined($domains{$annot->[-1]})) { | |
140 | $annot->[-1] .= $color_sep_str.$domains{$annot->[-1]}; | |
141 | } | |
142 | print join("\t",@$annot),"\n"; | |
143 | } | |
144 | } | |
145 | ||
146 | exit(0); | |
147 | ||
148 | sub show_annots { | |
149 | my ($query_len, $get_annot_sub, $use_www) = @_; | |
150 | ||
151 | my ($annot_line, $seq_len) = split(/\t/,$query_len); | |
152 | ||
153 | my %annot_data = (seq_info=>$annot_line); | |
154 | ||
155 | if ($annot_line =~ m/^gi\|/) { | |
156 | $use_acc = 1; | |
157 | ($tmp, $gi, $sdb, $acc, $id) = split(/\|/,$annot_line); | |
158 | } | |
159 | elsif ($annot_line =~ m/^(SP|TR):(\w+) (\w+)/) { | |
160 | ($sdb, $id, $acc) = ($1,$2,$3); | |
161 | $use_acc = 1; | |
162 | $sdb = lc($sdb) | |
163 | } | |
164 | elsif ($annot_line =~ m/^(SP|TR):(\w+)/) { | |
165 | ($sdb, $id) = ($1,$2); | |
166 | $use_acc = 0; | |
167 | $sdb = lc($sdb) | |
168 | } | |
169 | elsif ($annot_line !~ m/\|/) { # new NCBI swissprot format | |
170 | $use_acc =1; | |
171 | $sdb = 'sp'; | |
172 | ($acc) = split(/\s+/,$annot_line); | |
173 | } | |
174 | else { | |
175 | $use_acc = 1; | |
176 | ($sdb, $acc, $id) = split(/\|/,$annot_line); | |
177 | } | |
178 | ||
179 | unless ($use_acc) { | |
180 | $get_annots_sql = $get_annots_id; | |
181 | $get_annots_sql->execute($id); | |
182 | } | |
183 | else { | |
184 | unless ($sdb =~ m/ref/) { | |
185 | $get_annots_sql = $get_annots_acc; | |
186 | } else { | |
187 | $get_annots_sql = $get_annots_refacc; | |
188 | } | |
189 | $acc =~ s/\.\d+$//; | |
190 | ||
191 | unless ($use_www) { | |
192 | $get_annots_sql->execute($acc); | |
193 | } | |
194 | else { | |
195 | $get_annots_sql = $acc; | |
196 | } | |
197 | } | |
198 | ||
199 | $annot_data{list} = $get_annot_sub->($get_annots_sql, $seq_len); | |
200 | ||
201 | return \%annot_data; | |
202 | } | |
203 | ||
204 | sub get_annots { | |
205 | my ($get_annots_sql, $seq_len) = @_; | |
206 | ||
207 | my @feats = (); | |
208 | ||
209 | while (my $exon_hr = $get_annots_sql->fetchrow_hashref()) { | |
210 | my $ix = $exon_hr->{ix}; | |
211 | if ($lav) { | |
212 | push @feats, [$exon_hr->{start}, $exon_hr->{end}, "exon_$ix~$ix"]; | |
213 | } else { | |
214 | my ($exon_info,$ex_info_start, $ex_info_end) = ("exon_$ix~$ix","",""); | |
215 | if ($gen_coord) { | |
216 | if (defined($exon_hr->{g_start})) { | |
217 | my $chr=$exon_hr->{chrom}; | |
218 | $chr = "unk" unless $chr; | |
219 | if ($chr =~ m/^\d+$/ || $chr =~m/^[XYZ]+$/) { | |
220 | $chr = "chr$chr"; | |
221 | } | |
222 | $ex_info_start = sprintf("exon_%d::%s:%d",$ix, $chr, $exon_hr->{g_start}); | |
223 | $ex_info_end = sprintf("exon_%d::%s:%d",$ix, $chr, $exon_hr->{g_end}); | |
224 | if ($exon_label) { | |
225 | $exon_info = sprintf("exon_%d{%s:%d-%d}~%d",$ix, $chr, $exon_hr->{g_start}, $exon_hr->{g_end}, $ix); | |
226 | push @feats, [$exon_hr->{start}, "-", $exon_hr->{end}, $exon_info]; | |
227 | } else { | |
228 | push @feats, [$exon_hr->{start}, "-", $exon_hr->{end}, $exon_info]; | |
229 | push @feats, [$exon_hr->{start},'<','-',$ex_info_start]; | |
230 | push @feats, [$exon_hr->{end},'>','-',$ex_info_end]; | |
231 | } | |
232 | } | |
233 | } else { | |
234 | push @feats, [$exon_hr->{start}, "-", $exon_hr->{end}, $exon_info]; | |
235 | } | |
236 | } | |
237 | } | |
238 | ||
239 | return \@feats; | |
240 | } | |
241 | ||
242 | sub get_annots_up_www { | |
243 | my ($acc, $seq_len) = @_; | |
244 | ||
245 | my @feats = (); | |
246 | ||
247 | # my $exon_json = get_https($uniprot_url.$acc.$uniprot_suff); | |
248 | my $exon_json = get($uniprot_url.$acc.$uniprot_suff); | |
249 | ||
250 | unless (!$exon_json || $exon_json =~ m/errorMessage/ || $exon_json =~ m/Can not find/) { | |
251 | return parse_json_up_exons($exon_json); | |
252 | } | |
253 | else { | |
254 | return (); | |
255 | } | |
256 | } | |
257 | ||
258 | sub parse_json_up_exons { | |
259 | my ($exon_json) = @_; | |
260 | ||
261 | my @exons = (); | |
262 | my @ex_coords = (); | |
263 | ||
264 | my $acc_exons = decode_json($exon_json); | |
265 | ||
266 | my $exon_num = 1; | |
267 | my $last_end = 0; | |
268 | my $last_phase = 0; | |
269 | ||
270 | my $chrom = $acc_exons->{'gnCoordinate'}[0]{'genomicLocation'}{'chromosome'}; | |
271 | my $rev_strand = $acc_exons->{'gnCoordinate'}[0]{'genomicLocation'}{'reverseStrand'}; | |
272 | ||
273 | for my $exon ( @{$acc_exons->{'gnCoordinate'}[0]{'genomicLocation'}{'exon'}} ) { | |
274 | my ($p_begin, $p_end) = ($exon->{'proteinLocation'}{'begin'}{'position'},$exon->{'proteinLocation'}{'end'}{'position'}); | |
275 | my ($g_begin, $g_end) = ($exon->{'genomeLocation'}{'begin'}{'position'},$exon->{'genomeLocation'}{'end'}{'position'}); | |
276 | ||
277 | my $this_phase = 0; | |
278 | if (defined($g_begin) && defined($g_end)) { | |
279 | $this_phase = ($g_end - $g_begin + 1) % 3; | |
280 | } | |
281 | ||
282 | if (!defined($p_begin) || !defined($p_end)) { | |
283 | $exon_num++; | |
284 | $last_phase = 0; | |
285 | next; | |
286 | } | |
287 | ||
288 | if ($p_end >= $p_begin) { | |
289 | if ($p_begin == $last_end) { | |
290 | if ($last_phase==2) { | |
291 | $p_begin += 1; | |
292 | } | |
293 | elsif ($last_phase==1) { | |
294 | $last_end -= 1; | |
295 | $exons[-1]->{seq_end} -= 1; | |
296 | } | |
297 | } | |
298 | ||
299 | if ($p_begin <= $last_end && $p_end > $last_end) { | |
300 | $p_begin = $last_end+1; | |
301 | } | |
302 | $last_end = $p_end; | |
303 | $last_phase = $this_phase; | |
304 | ||
305 | my ($gs_begin, $gs_end) = ($g_begin, $g_end); | |
306 | if ($rev_strand) { | |
307 | ($gs_begin, $gs_end) = ($g_end, $g_begin); | |
308 | } | |
309 | ||
310 | push @exons, { | |
311 | ix=>$exon_num, | |
312 | start=>$p_begin, | |
313 | end=>$p_end, | |
314 | g_start=>$gs_begin, | |
315 | g_end=>$gs_end, | |
316 | chrom=>$chrom, | |
317 | }; | |
318 | ||
319 | $exon_num++; | |
320 | } | |
321 | } | |
322 | ||
323 | # check for domain overlap, and resolve check for domain overlap | |
324 | # (possibly more than 2 domains), choosing the domain with the best | |
325 | # evalue | |
326 | ||
327 | my @ex_feats = (); | |
328 | ||
329 | for my $exon_hr (@exons) { | |
330 | my $ix = $exon_hr->{ix}; | |
331 | if ($lav) { | |
332 | push @ex_feats, [$exon_hr->{start}, $exon_hr->{end}, "exon_$ix~$ix" ]; | |
333 | } | |
334 | else { | |
335 | my ($exon_info,$ex_info_start, $ex_info_end) = ("exon_$ix~$ix","",""); | |
336 | if ($gen_coord) { | |
337 | if (defined($exon_hr->{g_start})) { | |
338 | my $chr=$exon_hr->{chrom}; | |
339 | $chr = "unk" unless $chr; | |
340 | if ($chr =~ m/^\d+$/ || $chr =~m/^[XYZ]+$/) { | |
341 | $chr = "chr$chr"; | |
342 | } | |
343 | $ex_info_start = sprintf("exon_%d::%s:%d",$ix, $chr, $exon_hr->{g_start}); | |
344 | $ex_info_end = sprintf("exon_%d::%s:%d",$ix, $chr, $exon_hr->{g_end}); | |
345 | if ($exon_label) { | |
346 | $exon_info = sprintf("exon_%d{%s:%d-%d}~%d",$ix, $chr, $exon_hr->{g_start}, $exon_hr->{g_end},$ix); | |
347 | push @ex_feats, [$exon_hr->{start}, "-", $exon_hr->{end}, $exon_info]; | |
348 | } else { | |
349 | push @ex_feats, [$exon_hr->{start}, "-", $exon_hr->{end}, $exon_info]; | |
350 | push @ex_feats, [$exon_hr->{start},'<','-',$ex_info_start]; | |
351 | push @ex_feats, [$exon_hr->{end},'>','-',$ex_info_end]; | |
352 | } | |
353 | } | |
354 | } else { | |
355 | push @ex_feats, [$exon_hr->{start}, "-", $exon_hr->{end}, $exon_info]; | |
356 | } | |
357 | } | |
358 | } | |
359 | return \@ex_feats; | |
360 | } | |
361 | ||
362 | sub get_https { | |
363 | my ($url) = @_; | |
364 | ||
365 | my $result = ""; | |
366 | my $response = $ua->get($url); | |
367 | ||
368 | if ($response->is_success) { | |
369 | $result = $response->decoded_content; | |
370 | } else { | |
371 | $result = ''; | |
372 | } | |
373 | return $result; | |
374 | } | |
375 | ||
376 | ||
377 | ||
378 | __END__ | |
379 | ||
380 | =pod | |
381 | ||
382 | =head1 NAME | |
383 | ||
384 | ann_exons_up_sql.pl | |
385 | ||
386 | =head1 SYNOPSIS | |
387 | ||
388 | ann_exons_up_sql.pl --lav 'sp|P09488|GSTM1_NUMAN' | accession.file | |
389 | ||
390 | =head1 OPTIONS | |
391 | ||
392 | -h short help | |
393 | --help include description | |
394 | --gen_coord -- provide genomic exon start/stop coordinates as features | |
395 | --lav produce lav2plt.pl annotation format, only show domains/repeats | |
396 | --host, --user, --password, --port --db -- info for mysql database | |
397 | ||
398 | =head1 DESCRIPTION | |
399 | ||
400 | C<ann_exons_up_sql.pl> extracts exon location information from | |
401 | a msyql database (default name, uniprot) built from EBI/proteins API data. | |
402 | ||
403 | Given a command line argument that contains a sequence accession | |
404 | (P09488) or identifier (GSTM1_HUMAN), the program looks up the | |
405 | features available for that sequence and returns them in a | |
406 | tab-delimited format: | |
407 | ||
408 | >sp|P09488|GSTM1_HUMAN | |
409 | 1 - 12 exon_1~1 | |
410 | 13 - 38 exon_2~2 | |
411 | 39 - 59 exon_3~3 | |
412 | 60 - 87 exon_4~4 | |
413 | 88 - 120 exon_5~5 | |
414 | 121 - 152 exon_6~6 | |
415 | 153 - 189 exon_7~7 | |
416 | 190 - 218 exon_8~8 | |
417 | ||
418 | C<ann_exons_up_sql.pl --gen_coord 'sp|P09488|GSTM1_HUMAN'>also provides genomic coordinates: | |
419 | ||
420 | >sp|P09488|GSTM1_HUMAN | |
421 | 1 - 12 exon_1~1 | |
422 | 1 < - exon_1::chr1:109687874 | |
423 | 12 > - exon_1::chr1:109687909 | |
424 | 13 - 37 exon_2~2 | |
425 | 13 < - exon_2::chr1:109688170 | |
426 | 37 > - exon_2::chr1:109688245 | |
427 | 38 - 59 exon_3~3 | |
428 | 38 < - exon_3::chr1:109688673 | |
429 | 59 > - exon_3::chr1:109688737 | |
430 | ... | |
431 | 190 - 218 exon_8~8 | |
432 | 190 < - exon_8::chr1:109693206 | |
433 | 218 > - exon_8::chr1:109693292 | |
434 | ||
435 | C<ann_exons_up_sql.pl> is designed to be used by the B<FASTA> programs | |
436 | with the C<-V \!ann_exons_up_sql.pl> option, or by the | |
437 | C<annot_blast_btop.pl> script. It can also be used with the | |
438 | lav2plt.pl program with the C<--xA "\!ann_exons_up_sql.pl --lav"> or | |
439 | C<--yA "\!ann_exons_up_sql.pl --lav"> options. | |
440 | ||
441 | =head1 AUTHOR | |
442 | ||
443 | William R. Pearson, wrp@virginia.edu | |
444 | ||
445 | =cut |
0 | #!/usr/bin/perl -w | |
0 | #!/usr/bin/env perl | |
1 | 1 | |
2 | 2 | ################################################################ |
3 | 3 | # copyright (c) 2014,2015 by William R. Pearson and The Rector & |
16 | 16 | # governing permissions and limitations under the License. |
17 | 17 | ################################################################ |
18 | 18 | |
19 | # ann_exons_up_www.pl gets an annotation file from fasta36 -V with a line of the form: | |
20 | ||
21 | # gi|23065544|ref|NP_000552.2| | |
22 | # | |
23 | # and returns the exons present in the protein from NCBI gff3 tables (human and mouse only) | |
19 | # ann_exons_up_www.pl gets an annotation file from fasta36 -V with a | |
20 | # line of the form: | |
21 | # | |
22 | # sp|P09488|GSTM1_HUMAN<tab>218 | |
23 | # | |
24 | # and uses the EBI protein coordinate API to get the locations of exons | |
25 | # https://www.ebi.ac.uk/proteins/api/coordinates/P09488.json | |
24 | 26 | # |
25 | 27 | # it must: |
26 | 28 | # (1) read in the line |
28 | 30 | # (3) get exon information from EBI/Uniprot |
29 | 31 | # (4) return the tab delimited exon boundaries |
30 | 32 | |
31 | # 22-May-2017 -- use get("http://"), not get_https("https://"), because EBI does not have LWP::Protocol:https | |
32 | ||
33 | # 22-May-2017 -- use get("https://"), not get_https("https://"), because EBI does not have LWP::Protocol:https | |
34 | ||
35 | # 11-Dec-2018 -- modified to include --gen_coord, which reports exon starts and stops in genomic coordinates as <, > | |
36 | ||
37 | use warnings; | |
33 | 38 | use strict; |
34 | 39 | |
35 | 40 | use Getopt::Long; |
41 | 46 | |
42 | 47 | use vars qw($host $db $port $user $pass); |
43 | 48 | |
44 | my ($lav, $shelp, $help) = (0, 0,0); | |
49 | my ($lav, $gen_coord, $shelp, $help) = (0, 0, 0, 0); | |
45 | 50 | |
46 | 51 | my $color_sep_str = " :"; |
47 | 52 | $color_sep_str = '~'; |
48 | 53 | |
49 | 54 | GetOptions( |
55 | "gen_coord!" => \$gen_coord, | |
50 | 56 | "lav" => \$lav, |
51 | 57 | "h|?" => \$shelp, |
52 | 58 | "help" => \$help, |
65 | 71 | my $get_annot_sub = \&get_up_www_exons; |
66 | 72 | |
67 | 73 | my $ua = LWP::UserAgent->new(ssl_opts=>{verify_hostname => 0}); |
68 | my $uniprot_url = 'http://www.ebi.ac.uk/proteins/api/coordinates/'; | |
74 | my $uniprot_url = 'https://www.ebi.ac.uk/proteins/api/coordinates/'; | |
69 | 75 | my $uniprot_suff = ".json"; |
70 | 76 | |
71 | 77 | # get the query |
131 | 137 | |
132 | 138 | $acc =~ s/\.\d+$//; |
133 | 139 | |
140 | # my $exon_json = get_https($uniprot_url.$acc.$uniprot_suff); | |
134 | 141 | my $exon_json = get($uniprot_url.$acc.$uniprot_suff); |
135 | 142 | |
136 | 143 | unless (!$exon_json || $exon_json =~ m/errorMessage/ || $exon_json =~ m/Can not find/) { |
144 | 151 | my ($exon_json) = @_; |
145 | 152 | |
146 | 153 | my @exons = (); |
154 | my @ex_coords = (); | |
147 | 155 | |
148 | 156 | my $acc_exons = decode_json($exon_json); |
149 | 157 | |
150 | 158 | my $exon_num = 1; |
151 | 159 | my $last_end = 0; |
152 | 160 | my $last_phase = 0; |
161 | ||
162 | my $chrom = $acc_exons->{'gnCoordinate'}[0]{'genomicLocation'}{'chromosome'}; | |
163 | my $rev_strand = $acc_exons->{'gnCoordinate'}[0]{'genomicLocation'}{'reverseStrand'}; | |
153 | 164 | |
154 | 165 | for my $exon ( @{$acc_exons->{'gnCoordinate'}[0]{'genomicLocation'}{'exon'}} ) { |
155 | 166 | my ($p_begin, $p_end) = ($exon->{'proteinLocation'}{'begin'}{'position'},$exon->{'proteinLocation'}{'end'}{'position'}); |
183 | 194 | $last_end = $p_end; |
184 | 195 | $last_phase = $this_phase; |
185 | 196 | |
197 | my $info ="exon_".$exon_num.$color_sep_str.$exon_num; | |
198 | ||
199 | my ($gs_begin, $gs_end) = ($g_begin, $g_end); | |
200 | if ($rev_strand) { | |
201 | ($gs_begin, $gs_end) = ($g_end, $g_begin); | |
202 | } | |
203 | ||
186 | 204 | push @exons, { |
187 | info=>"exon_".$exon_num.$color_sep_str.$exon_num, | |
205 | info=>$info, | |
206 | exon_num=>$exon_num, | |
188 | 207 | seq_start=>$p_begin, |
189 | 208 | seq_end=>$p_end, |
209 | gen_seq_start=>$gs_begin, | |
210 | gen_seq_end=>$gs_end, | |
211 | chrom=>$chrom, | |
190 | 212 | }; |
213 | ||
191 | 214 | $exon_num++; |
192 | 215 | } |
193 | 216 | } |
204 | 227 | } |
205 | 228 | else { |
206 | 229 | push @ex_feats, [$d_ref->{seq_start}, '-', $d_ref->{seq_end}, $d_ref->{info} ]; |
230 | if ($gen_coord) { | |
231 | my $chr=$d_ref->{chrom}; | |
232 | if ($chr =~ m/^\d+$/ || $chr =~m/^[XYZ]+$/) { | |
233 | $chr = "chr$chr"; | |
234 | } | |
235 | my $ex_info = sprintf("exon_%d::%s:%d",$d_ref->{exon_num}, $chr, $d_ref->{gen_seq_start}); | |
236 | push @ex_feats, [$d_ref->{seq_start},'<','-',$ex_info]; | |
237 | $ex_info = sprintf("exon_%d::%s:%d",$d_ref->{exon_num}, $chr, $d_ref->{gen_seq_end}); | |
238 | push @ex_feats, [$d_ref->{seq_end},'>','-',$ex_info]; | |
239 | } | |
207 | 240 | } |
208 | 241 | } |
209 | 242 | return \@ex_feats; |
223 | 256 | return $result; |
224 | 257 | } |
225 | 258 | |
226 | sub domain_name { | |
227 | ||
228 | my ($value) = @_; | |
229 | ||
230 | if (!defined($domains{$value})) { | |
231 | $domain_cnt++; | |
232 | $domains{$value} = $domain_cnt; | |
233 | } | |
234 | return $value; | |
235 | } | |
236 | ||
237 | 259 | __END__ |
238 | 260 | |
239 | 261 | =pod |
251 | 273 | -h short help |
252 | 274 | --help include description |
253 | 275 | --lav produce lav2plt.pl annotation format |
276 | --gen_coord produce genome coordinate features | |
254 | 277 | |
255 | 278 | =head1 DESCRIPTION |
256 | 279 |
0 | #!/usr/bin/perl -w | |
0 | #!/usr/bin/env perl | |
1 | 1 | |
2 | 2 | ################################################################ |
3 | 3 | # copyright (c) 2014 by William R. Pearson and The Rector & |
34 | 34 | # ann_feats2ipr.pl is largely identical to ann_feats2l.pl, except that |
35 | 35 | # it uses Interpro for domain/repeat information. |
36 | 36 | |
37 | use warnings; | |
37 | 38 | use strict; |
38 | 39 | |
39 | 40 | use DBI; |
0 | #!/usr/bin/perl -w | |
0 | #!/usr/bin/env perl | |
1 | 1 | |
2 | 2 | ################################################################ |
3 | 3 | # copyright (c) 2014 by William R. Pearson and The Rector & |
34 | 34 | # ann_feats2ipr.pl is largely identical to ann_feats2l.pl, except that |
35 | 35 | # it uses Interpro for domain/repeat information. |
36 | 36 | |
37 | use warnings; | |
37 | 38 | use strict; |
38 | 39 | |
39 | 40 | use DBI; |
0 | #!/usr/bin/perl -w | |
0 | #!/usr/bin/env perl | |
1 | 1 | |
2 | 2 | ################################################################ |
3 | 3 | # copyright (c) 2014,2015 by William R. Pearson and The Rector & |
29 | 29 | # this version can read feature2 uniprot features (acc/pos/end/label/value), but returns sorted start/end domains |
30 | 30 | # modified 18-Jan-2016 to produce annotation symbols consistent with ann_feats_up_www2.pl |
31 | 31 | |
32 | use warnings; | |
32 | 33 | use strict; |
33 | 34 | |
34 | 35 | use DBI; |
0 | #!/usr/bin/perl -w | |
0 | #!/usr/bin/env perl | |
1 | 1 | |
2 | 2 | ################################################################ |
3 | 3 | # copyright (c) 2014,2015 by William R. Pearson and The Rector & |
17 | 17 | ################################################################ |
18 | 18 | |
19 | 19 | ## modified 29-Sept-2016 to use EBI/proteins JSON URL: |
20 | ## http://www.ebi.ac.uk/proteins/api/features/p12345 | |
20 | ## https://www.ebi.ac.uk/proteins/api/features/p12345 | |
21 | 21 | |
22 | 22 | # ann_feats_up_www2.pl gets an annotation file from fasta36 -V with a line of the form: |
23 | 23 | |
31 | 31 | |
32 | 32 | # this version can read feature2 uniprot features (acc/pos/end/label/value), but returns sorted start/end domains |
33 | 33 | |
34 | use warnings; | |
34 | 35 | use strict; |
35 | 36 | |
36 | 37 | use Getopt::Long; |
37 | 38 | use Pod::Usage; |
38 | 39 | use LWP::Simple; |
40 | use LWP::UserAgent; | |
39 | 41 | use JSON qw(decode_json); |
40 | 42 | |
41 | 43 | ## use IO::String; |
42 | 44 | |
43 | my $up_base = 'http://www.ebi.ac.uk/proteins/api/features'; | |
45 | my $ua = LWP::UserAgent->new(ssl_opts=>{verify_hostname => 0}); | |
46 | my $up_base = 'https://www.ebi.ac.uk/proteins/api/features'; | |
47 | my $uniprot_suff = ".json"; | |
44 | 48 | |
45 | 49 | my %domains = (); |
46 | 50 | my $domain_cnt = 0; |
213 | 217 | my $lwp_features = ""; |
214 | 218 | |
215 | 219 | if ($acc && ($acc =~ m/^[A-Z][0-9][A-Z0-9]{3}[0-9]/)) { |
216 | $lwp_features = get("$up_base/$acc.json"); | |
220 | $lwp_features = get_https("$up_base/$acc.json"); | |
217 | 221 | } |
218 | 222 | # elsif ($id && ($id =~ m/^\w+$/)) { |
219 | 223 | # $lwp_features = get("$up_base/$id/$gff_post"); |
366 | 370 | } |
367 | 371 | } |
368 | 372 | |
373 | sub get_https { | |
374 | my ($url) = @_; | |
375 | ||
376 | my $result = ""; | |
377 | my $response = $ua->get($url); | |
378 | ||
379 | if ($response->is_success) { | |
380 | $result = $response->decoded_content; | |
381 | } else { | |
382 | $result = ''; | |
383 | } | |
384 | return $result; | |
385 | } | |
369 | 386 | |
370 | 387 | |
371 | 388 | __END__ |
398 | 415 | |
399 | 416 | C<ann_feats_up_www2.pl> extracts feature, domain, and repeat |
400 | 417 | information from the Uniprot DAS server through an XSLT transation |
401 | provided by http://www.ebi.ac.uk/Tools/dbfetch/dbfetch/uniprotkb. | |
418 | provided by https://www.ebi.ac.uk/Tools/dbfetch/dbfetch/uniprotkb. | |
402 | 419 | This server provides GFF descriptions of Uniprot entries, with most of |
403 | 420 | the information provided in UniProt feature tables. |
404 | 421 |
0 | #!/usr/bin/perl -w | |
0 | #!/usr/bin/env perl | |
1 | 1 | |
2 | 2 | ################################################################ |
3 | 3 | # copyright (c) 2014 by William R. Pearson and The Rector & |
36 | 36 | # (3) return the tab delimited domains |
37 | 37 | # |
38 | 38 | |
39 | use warnings; | |
39 | 40 | use strict; |
40 | 41 | |
41 | 42 | use Getopt::Long; |
0 | #!/usr/bin/perl -w | |
0 | #!/usr/bin/env perl | |
1 | 1 | |
2 | 2 | ################################################################ |
3 | 3 | # copyright (c) 2014 by William R. Pearson and The Rector & |
34 | 34 | # database |
35 | 35 | # |
36 | 36 | |
37 | use warnings; | |
37 | 38 | use strict; |
38 | 39 | |
39 | 40 | use DBI; |
0 | #!/usr/bin/perl -w | |
0 | #!/usr/bin/env perl | |
1 | 1 | |
2 | 2 | ################################################################ |
3 | 3 | # copyright (c) 2014, 2015 by William R. Pearson and The Rector & |
34 | 34 | # database |
35 | 35 | # |
36 | 36 | |
37 | use warnings; | |
37 | 38 | use strict; |
38 | 39 | |
39 | 40 | use LWP::Simple; |
0 | #!/usr/bin/perl -w | |
1 | ||
2 | ################################################################ | |
3 | # copyright (c) 2014 by William R. Pearson and The Rector & | |
4 | # Visitors of the University of Virginia */ | |
5 | ################################################################ | |
6 | # Licensed under the Apache License, Version 2.0 (the "License"); | |
7 | # you may not use this file except in compliance with the License. | |
8 | # You may obtain a copy of the License at | |
9 | # | |
10 | # http://www.apache.org/licenses/LICENSE-2.0 | |
11 | # | |
12 | # Unless required by applicable law or agreed to in writing, | |
13 | # software distributed under this License is distributed on an "AS | |
14 | # IS" BASIS, WITHOUT WRRANTIES OR CONDITIONS OF ANY KIND, either | |
15 | # express or implied. See the License for the specific language | |
16 | # governing permissions and limitations under the License. | |
17 | ################################################################ | |
18 | ||
19 | # ann_pfam_e.pl gets an annotation file from fasta36 -V with a line of the form: | |
20 | ||
21 | # gi|62822551|sp|P00502|GSTA1_RAT Glutathione S-transfer\n (at least from pir1.lseg) | |
22 | # | |
23 | # it must: | |
24 | # (1) read in the line | |
25 | # (2) parse it to get the up_acc | |
26 | # (3) return the tab delimited features | |
27 | # | |
28 | ||
29 | # this version only annotates sequences known to Pfam:pfamseq: | |
30 | # >pf26|164|O57809|1A1D_PYRHO | |
31 | # and only provides domain information | |
32 | ||
33 | use strict; | |
34 | ||
35 | use DBI; | |
36 | use Getopt::Long; | |
37 | use Pod::Usage; | |
38 | ||
39 | use vars qw($host $db $port $user $pass); | |
40 | ||
41 | my $hostname = `/bin/hostname`; | |
42 | ||
43 | ($host, $db, $port, $user, $pass) = ("wrpxdb.its.virginia.edu", "pfam27", 0, "web_user", "fasta_www"); | |
44 | #$host = 'xdb'; | |
45 | ||
46 | my ($auto_reg,$rpd2_fams, $vdoms, $neg_doms, $lav, $no_doms, $no_clans, $pf_acc, $no_over, $acc_comment, $shelp, $help) = | |
47 | (0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0); | |
48 | my ($min_nodom, $min_vdom) = (10,10); | |
49 | ||
50 | my $color_sep_str = " :"; | |
51 | $color_sep_str = '~'; | |
52 | ||
53 | ||
54 | GetOptions( | |
55 | "host=s" => \$host, | |
56 | "db=s" => \$db, | |
57 | "user=s" => \$user, | |
58 | "password=s" => \$pass, | |
59 | "port=i" => \$port, | |
60 | "lav" => \$lav, | |
61 | "acc_comment" => \$acc_comment, | |
62 | "no-over" => \$no_over, | |
63 | "no_over" => \$no_over, | |
64 | "no-clans" => \$no_clans, | |
65 | "no_clans" => \$no_clans, | |
66 | "neg" => \$neg_doms, | |
67 | "neg_doms" => \$neg_doms, | |
68 | "neg-doms" => \$neg_doms, | |
69 | "min_nodom=i" => \$min_nodom, | |
70 | "pfacc" => \$pf_acc, | |
71 | "RPD2" => \$rpd2_fams, | |
72 | "auto_reg" => \$auto_reg, | |
73 | "vdoms" => \$vdoms, | |
74 | "h|?" => \$shelp, | |
75 | "help" => \$help, | |
76 | ); | |
77 | ||
78 | pod2usage(1) if $shelp; | |
79 | pod2usage(exitstatus => 0, verbose => 2) if $help; | |
80 | pod2usage(1) unless (@ARGV || -p STDIN || -f STDIN); | |
81 | ||
82 | my $connect = "dbi:mysql(AutoCommit=>1,RaiseError=>1):database=$db"; | |
83 | $connect .= ";host=$host" if $host; | |
84 | $connect .= ";port=$port" if $port; | |
85 | ||
86 | my $dbh = DBI->connect($connect, | |
87 | $user, | |
88 | $pass | |
89 | ) or die $DBI::errstr; | |
90 | ||
91 | my %annot_types = (); | |
92 | my %domains = (NODOM=>0); | |
93 | my %domain_clan = (NODOM => {clan_id => 'NODOM', clan_acc=>0, domain_cnt=>0}); | |
94 | my @domain_list = (0); | |
95 | my $domain_cnt = 0; | |
96 | ||
97 | my $get_annot_sub = \&get_pfam_annots; | |
98 | ||
99 | my $get_pfam_acc = $dbh->prepare(<<EOSQL); | |
100 | ||
101 | SELECT seq_start, seq_end, model_start, model_end, model_length, auto_pfamA, pfamA_acc, pfamA_id, auto_pfamA_reg_full, domain_evalue_score as evalue, length | |
102 | FROM pfamseq | |
103 | JOIN pfamA_reg_full_significant using(auto_pfamseq) | |
104 | JOIN pfamA USING (auto_pfamA) | |
105 | WHERE in_full = 1 | |
106 | AND pfamseq_acc=? | |
107 | ORDER BY seq_start | |
108 | ||
109 | EOSQL | |
110 | ||
111 | my $get_pfam_refacc = $dbh->prepare(<<EOSQL); | |
112 | ||
113 | SELECT seq_start, seq_end, auto_pfamA, pfamA_acc, pfamA_id, auto_pfamA_reg_full, domain_evalue_score as evalue, length | |
114 | FROM pfamseq | |
115 | JOIN pfamA_reg_full_significant using(auto_pfamseq) | |
116 | JOIN pfamA USING (auto_pfamA) | |
117 | JOIN seqdb_demo2.annot as sa1 on(sa1.acc=pfamseq_acc and sa1.db='sp') | |
118 | JOIN seqdb_demo2.annot as sa2 using(prot_id) | |
119 | WHERE in_full = 1 | |
120 | AND sa2.acc=? | |
121 | AND sa2.db='ref' | |
122 | ORDER BY seq_start | |
123 | ||
124 | EOSQL | |
125 | ||
126 | my $get_annots_sql = $get_pfam_acc; | |
127 | ||
128 | my $get_pfam_id = $dbh->prepare(<<EOSQL); | |
129 | ||
130 | SELECT seq_start, seq_end, auto_pfamA, pfamA_acc, pfamA_id, auto_pfamA_reg_full, domain_evalue_score as evalue, length | |
131 | FROM pfamseq | |
132 | JOIN pfamA_reg_full_significant using(auto_pfamseq) | |
133 | JOIN pfamA USING (auto_pfamA) | |
134 | WHERE in_full=1 | |
135 | AND pfamseq_id=? | |
136 | ORDER BY seq_start | |
137 | ||
138 | EOSQL | |
139 | ||
140 | my $get_pfam_clan = $dbh->prepare(<<EOSQL); | |
141 | ||
142 | SELECT clan_acc, clan_id | |
143 | FROM clans | |
144 | JOIN clan_membership using(auto_clan) | |
145 | WHERE auto_pfamA=? | |
146 | ||
147 | EOSQL | |
148 | ||
149 | my $get_rpd2_clans = $dbh->prepare(<<EOSQL); | |
150 | ||
151 | SELECT auto_pfamA, clan | |
152 | FROM ljm_db.RPD2_final_fams | |
153 | WHERE clan is not NULL | |
154 | ||
155 | EOSQL | |
156 | ||
157 | # -- LEFT JOIN clan_membership USING (auto_pfamA) | |
158 | # -- LEFT JOIN clans using(auto_clan) | |
159 | ||
160 | my ($tmp, $gi, $sdb, $acc, $id, $use_acc); | |
161 | ||
162 | # get the query | |
163 | my ($query, $seq_len) = @ARGV; | |
164 | $seq_len = 0 unless defined($seq_len); | |
165 | ||
166 | $query =~ s/^>// if ($query); | |
167 | ||
168 | my @annots = (); | |
169 | ||
170 | my %rpd2_clan_fams = (); | |
171 | ||
172 | if ($rpd2_fams) { | |
173 | $get_rpd2_clans->execute(); | |
174 | my ($auto_pfam, $auto_clan); | |
175 | while (($auto_pfam, $auto_clan)=$get_rpd2_clans->fetchrow_array()) { | |
176 | $rpd2_clan_fams{$auto_pfam} = $auto_clan; | |
177 | } | |
178 | } | |
179 | ||
180 | #if it's a file I can open, read and parse it | |
181 | unless ($query && $query =~ m/[\|:]/) { | |
182 | ||
183 | while (my $a_line = <>) { | |
184 | $a_line =~ s/^>//; | |
185 | chomp $a_line; | |
186 | push @annots, show_annots($a_line, $get_annot_sub); | |
187 | } | |
188 | } | |
189 | else { | |
190 | push @annots, show_annots("$query $seq_len", $get_annot_sub); | |
191 | } | |
192 | ||
193 | for my $seq_annot (@annots) { | |
194 | print ">",$seq_annot->{seq_info},"\n"; | |
195 | for my $annot (@{$seq_annot->{list}}) { | |
196 | if (!$lav && defined($domains{$annot->[-1]})) { | |
197 | my ($a_name, $a_num) = domain_num($annot->[-1],$domains{$annot->[-1]}); | |
198 | if ($acc_comment) { | |
199 | $annot->[-1] .= $a_name."{$domain_list[$a_num]}"; | |
200 | } | |
201 | $annot->[-1] = $a_name.$color_sep_str.$a_num; | |
202 | } | |
203 | print join("\t",@$annot),"\n"; | |
204 | } | |
205 | } | |
206 | ||
207 | exit(0); | |
208 | ||
209 | sub show_annots { | |
210 | my ($query_len, $get_annot_sub) = @_; | |
211 | ||
212 | my ($annot_line, $seq_len) = split(/\s+/,$query_len); | |
213 | ||
214 | my $pfamA_acc; | |
215 | ||
216 | my %annot_data = (seq_info=>$annot_line); | |
217 | ||
218 | $use_acc = 1; | |
219 | $get_annots_sql = $get_pfam_acc; | |
220 | ||
221 | if ($annot_line =~ m/^pf26\|/) { | |
222 | ($sdb, $gi, $acc, $id) = split(/\|/,$annot_line); | |
223 | $dbh->do("use RPD2_pfam"); | |
224 | } | |
225 | elsif ($annot_line =~ m/^gi\|/) { | |
226 | ($tmp, $gi, $sdb, $acc, $id) = split(/\|/,$annot_line); | |
227 | if ($sdb =~ m/ref/) { | |
228 | $get_annots_sql = $get_pfam_refacc; | |
229 | } | |
230 | } | |
231 | elsif ($annot_line =~ m/^sp\|/) { | |
232 | ($sdb, $acc, $id) = split(/\|/,$annot_line); | |
233 | } | |
234 | elsif ($annot_line =~ m/^ref\|/) { | |
235 | ($sdb, $acc) = split(/\|/,$annot_line); | |
236 | $get_annots_sql = $get_pfam_refacc; | |
237 | } | |
238 | elsif ($annot_line =~ m/^tr\|/) { | |
239 | ($sdb, $acc, $id) = split(/\|/,$annot_line); | |
240 | } | |
241 | elsif ($annot_line =~ m/^SP:/i) { | |
242 | ($sdb, $id) = split(/:/,$annot_line); | |
243 | $use_acc = 0; | |
244 | } | |
245 | else { | |
246 | $use_acc = 1; | |
247 | ($acc) = split(/\s+/,$annot_line); | |
248 | } | |
249 | ||
250 | # remove version number | |
251 | unless ($use_acc) { | |
252 | $get_annots_sql = $get_pfam_id; | |
253 | $get_annots_sql->execute($id); | |
254 | } | |
255 | else { | |
256 | $acc =~ s/\.\d+$//; | |
257 | $get_annots_sql->execute($acc); | |
258 | } | |
259 | ||
260 | $annot_data{list} = $get_annot_sub->($get_annots_sql, $seq_len); | |
261 | ||
262 | return \%annot_data; | |
263 | } | |
264 | ||
265 | sub get_pfam_annots { | |
266 | my ($get_annots, $seq_length) = @_; | |
267 | ||
268 | $seq_length = 0 unless $seq_length; | |
269 | ||
270 | my @pf_domains = (); | |
271 | ||
272 | # get the list of domains, sorted by start | |
273 | while ( my $row_href = $get_annots->fetchrow_hashref()) { | |
274 | if ($auto_reg) { | |
275 | $row_href->{info} = $row_href->{auto_pfamA_reg_full}; | |
276 | } | |
277 | elsif ($pf_acc) { | |
278 | $row_href->{info} = $row_href->{pfamA_acc}; | |
279 | } | |
280 | else { | |
281 | $row_href->{info} = $row_href->{pfamA_id}; | |
282 | } | |
283 | ||
284 | if ($row_href && $row_href->{length} > $seq_length && $seq_length == 0) { $seq_length = $row_href->{length};} | |
285 | ||
286 | next if ($row_href->{seq_start} >= $seq_length); | |
287 | if ($row_href->{seq_end} > $seq_length) { | |
288 | $row_href->{seq_end} = $seq_length; | |
289 | } | |
290 | ||
291 | push @pf_domains, $row_href | |
292 | } | |
293 | ||
294 | # check for domain overlap, and resolve check for domain overlap | |
295 | # (possibly more than 2 domains), choosing the domain with the best | |
296 | # evalue | |
297 | ||
298 | if($no_over && scalar(@pf_domains) > 1) { | |
299 | ||
300 | my @tmp_domains = @pf_domains; | |
301 | my @save_domains = (); | |
302 | ||
303 | my $prev_dom = shift @tmp_domains; | |
304 | ||
305 | while (my $curr_dom = shift @tmp_domains) { | |
306 | ||
307 | my @overlap_domains = ($prev_dom); | |
308 | ||
309 | my $diff = $prev_dom->{seq_end} - $curr_dom->{seq_start}; | |
310 | # check for overlap > domain_length/3 | |
311 | ||
312 | my ($prev_len, $cur_len) = ($prev_dom->{seq_end}-$prev_dom->{seq_start}+1, $curr_dom->{seq_end}-$curr_dom->{seq_start}+1); | |
313 | my $inclusion = ((($curr_dom->{seq_start} >= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} <= $prev_dom->{seq_end})) || | |
314 | (($curr_dom->{seq_start} <= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} >= $prev_dom->{seq_end}))); | |
315 | ||
316 | my $longer_len = ($prev_len > $cur_len) ? $prev_len : $cur_len; | |
317 | ||
318 | while ($inclusion || ($diff > 0 && $diff > $longer_len/3)) { | |
319 | push @overlap_domains, $curr_dom; | |
320 | $curr_dom = shift @tmp_domains; | |
321 | last unless $curr_dom; | |
322 | $diff = $prev_dom->{seq_end} - $curr_dom->{seq_start}; | |
323 | ($prev_len, $cur_len) = ($prev_dom->{seq_end}-$prev_dom->{seq_start}+1, $curr_dom->{seq_end}-$curr_dom->{seq_start}+1); | |
324 | $longer_len = ($prev_len > $cur_len) ? $prev_len : $cur_len; | |
325 | $inclusion = ((($curr_dom->{seq_start} >= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} <= $prev_dom->{seq_end})) || | |
326 | (($curr_dom->{seq_start} <= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} >= $prev_dom->{seq_end}))); | |
327 | } | |
328 | ||
329 | # check for overlapping domains; >1 because $prev_dom is always there | |
330 | if (scalar(@overlap_domains) > 1 ) { | |
331 | # if $rpd2_fams, check for a chosen one | |
332 | if ($rpd2_fams) { | |
333 | for my $dom (@overlap_domains) { | |
334 | if ($rpd2_clan_fams{$dom->{auto_pfamA}}) { | |
335 | $prev_dom = $dom; | |
336 | last; | |
337 | } | |
338 | } | |
339 | } | |
340 | else { | |
341 | @overlap_domains = sort { $a->{evalue} <=> $b->{evalue} } @overlap_domains; | |
342 | $prev_dom = $overlap_domains[0]; | |
343 | } | |
344 | } | |
345 | ||
346 | # $prev_dom should be the best of the overlaps, and we are no longer overlapping > dom_length/3 | |
347 | push @save_domains, $prev_dom; | |
348 | $prev_dom = $curr_dom; | |
349 | } | |
350 | if ($prev_dom) {push @save_domains, $prev_dom;} | |
351 | ||
352 | @pf_domains = @save_domains; | |
353 | ||
354 | # now check for smaller overlaps | |
355 | for (my $i=1; $i < scalar(@pf_domains); $i++) { | |
356 | if ($pf_domains[$i-1]->{seq_end} >= $pf_domains[$i]->{seq_start}) { | |
357 | my $overlap = $pf_domains[$i-1]->{seq_end} - $pf_domains[$i]->{seq_start}; | |
358 | $pf_domains[$i-1]->{seq_end} -= int($overlap/2); | |
359 | $pf_domains[$i]->{seq_start} = $pf_domains[$i-1]->{seq_end}+1; | |
360 | } | |
361 | } | |
362 | } | |
363 | ||
364 | # $vdoms -- virtual Pfam domains -- the equivalent of $neg_doms, | |
365 | # but covering parts of a Pfam model that are not annotated. split | |
366 | # domains have been joined, so simply check beginning and end of | |
367 | # each domain (but must also check for bounded-ness) | |
368 | # only add when 10% or more is missing and missing length > $min_nodom | |
369 | ||
370 | if ($vdoms && scalar(@pf_domains)) { | |
371 | my @vpf_domains; | |
372 | ||
373 | my $curr_dom = $pf_domains[0]; | |
374 | my $length = $curr_dom->{length}; | |
375 | ||
376 | my $prev_dom={seq_end=>0, pfamA_acc=>''}; | |
377 | my $prev_dom_end = 0; | |
378 | my $next_dom_start = $length+1; | |
379 | ||
380 | for (my $dom_ix=0; $dom_ix < scalar(@pf_domains); $dom_ix++ ) { | |
381 | $curr_dom = $pf_domains[$dom_ix]; | |
382 | ||
383 | my $pfamA = $curr_dom->{pfamA_acc}; | |
384 | ||
385 | # first, look left, is there a domain there (if there is, | |
386 | # it should be updated right | |
387 | ||
388 | # my $min_vdom = $curr_dom->{model_length} / 10; | |
389 | ||
390 | if ($prev_dom->{pfamA_acc}) { # look for previous domain | |
391 | $prev_dom_end = $prev_dom->{seq_end}; | |
392 | } | |
393 | ||
394 | # there is a domain to the left, how much room is available? | |
395 | my $left_dom_len = min($curr_dom->{seq_start}-$prev_dom_end-1, $curr_dom->{model_start}-1); | |
396 | if ( $left_dom_len > $min_vdom) { | |
397 | # there is room for a virtual domain | |
398 | my %new_dom = (seq_start=> $curr_dom->{seq_start}-$left_dom_len, | |
399 | seq_end => $curr_dom->{seq_start}-1, | |
400 | info=>'@'.$curr_dom->{info}, | |
401 | model_length=>$curr_dom->{model_length}, | |
402 | model_end => $curr_dom->{model_start}-1, | |
403 | model_start => $left_dom_len, | |
404 | pfamA_acc=>$pfamA, | |
405 | ); | |
406 | push @vpf_domains, \%new_dom; | |
407 | } | |
408 | ||
409 | # save the current domain | |
410 | push @vpf_domains, $curr_dom; | |
411 | $prev_dom = $curr_dom; | |
412 | ||
413 | if ($dom_ix < $#pf_domains) { # there is a domain to the right | |
414 | # first, give all the extra space to the first domain (no splitting) | |
415 | $next_dom_start = $pf_domains[$dom_ix+1]->{seq_start}; | |
416 | } | |
417 | else { | |
418 | $next_dom_start = $length; | |
419 | } | |
420 | ||
421 | # is there room for a virtual domain right | |
422 | ||
423 | my $right_dom_len = min($next_dom_start-$curr_dom->{seq_end}-1, # space available | |
424 | $curr_dom->{model_length}-$curr_dom->{model_end} # space needed | |
425 | ); | |
426 | if ( $right_dom_len > $min_vdom) { | |
427 | my %new_dom = (seq_start=> $curr_dom->{seq_end}+1, | |
428 | seq_end=> $curr_dom->{seq_end}+$right_dom_len, | |
429 | info=>'@'.$pfamA, | |
430 | model_length => $curr_dom->{model_length}, | |
431 | pfamA_acc=> $pfamA, | |
432 | ); | |
433 | push @vpf_domains, \%new_dom; | |
434 | $prev_dom = \%new_dom; | |
435 | } | |
436 | } # all done, check for last one | |
437 | ||
438 | # $curr_dom=$pf_domains[-1]; | |
439 | # # my $min_vdom = $curr_dom->{model_length}/10; | |
440 | ||
441 | # my $right_dom_len = min($length - $curr_dom->{seq_end}+1, # space available | |
442 | # $curr_dom->{model_length}-$curr_dom->{model_end} # space needed | |
443 | # ); | |
444 | # if ($right_dom_len > $min_vdom) { | |
445 | # my %new_dom = (seq_start=> $curr_dom->{seq_end}+1, | |
446 | # seq_end => $curr_dom->{seq_end}+$right_dom_len, | |
447 | # info=>'@'.$curr_dom->{pfamA_acc}, | |
448 | # model_len=> $curr_dom->{model_len}, | |
449 | # pfamA_acc => $curr_dom->{pfamA_acc}, | |
450 | # model_start => $curr_dom->{model_end}+1, | |
451 | # model_end => $curr_dom->{model_len}, | |
452 | # ); | |
453 | ||
454 | # push @vpf_domains, \%new_dom; | |
455 | # } | |
456 | ||
457 | # @vpf_domains has both old @pf_domains and new neg-domains | |
458 | @pf_domains = @vpf_domains; | |
459 | } | |
460 | ||
461 | if ($neg_doms) { | |
462 | my @npf_domains; | |
463 | my $prev_dom={seq_end=>0}; | |
464 | for my $curr_dom ( @pf_domains) { | |
465 | if ($curr_dom->{seq_start} - $prev_dom->{seq_end} > $min_nodom) { | |
466 | my %new_dom = (seq_start=>$prev_dom->{seq_end}+1, seq_end => $curr_dom->{seq_start}-1, info=>'NODOM'); | |
467 | push @npf_domains, \%new_dom; | |
468 | } | |
469 | push @npf_domains, $curr_dom; | |
470 | $prev_dom = $curr_dom; | |
471 | } | |
472 | if ($seq_length - $prev_dom->{seq_end} > $min_nodom) { | |
473 | my %new_dom = (seq_start=>$prev_dom->{seq_end}+1, seq_end=>$seq_length, info=>'NODOM'); | |
474 | if ($new_dom{seq_end} > $new_dom{seq_start}) {push @npf_domains, \%new_dom;} | |
475 | } | |
476 | ||
477 | # @npf_domains has both old @pf_domains and new neg-domains | |
478 | @pf_domains = @npf_domains; | |
479 | } | |
480 | ||
481 | # now make sure we have useful names: colors | |
482 | ||
483 | for my $pf (@pf_domains) { | |
484 | $pf->{info} = domain_name($pf->{info}, $pf->{auto_pfamA}, $pf->{pfamA_acc}); | |
485 | } | |
486 | ||
487 | my @feats = (); | |
488 | for my $d_ref (@pf_domains) { | |
489 | if ($lav) { | |
490 | push @feats, [$d_ref->{seq_start}, $d_ref->{seq_end}, $d_ref->{info}]; | |
491 | } | |
492 | else { | |
493 | push @feats, [$d_ref->{seq_start}, '-', $d_ref->{seq_end}, $d_ref->{info} ]; | |
494 | # push @feats, [$d_ref->{seq_end}, ']', '-', ""]; | |
495 | } | |
496 | ||
497 | } | |
498 | ||
499 | return \@feats; | |
500 | } | |
501 | ||
502 | sub min { | |
503 | my ($arg1, $arg2) = @_; | |
504 | ||
505 | return ($arg1 <= $arg2 ? $arg1 : $arg2); | |
506 | } | |
507 | ||
508 | sub max { | |
509 | my ($arg1, $arg2) = @_; | |
510 | ||
511 | return ($arg1 >= $arg2 ? $arg1 : $arg2); | |
512 | } | |
513 | ||
514 | # domain name takes a uniprot domain label, removes comments ( ; | |
515 | # truncated) and numbers and returns a canonical form. Thus: | |
516 | # Cortactin 6. | |
517 | # Cortactin 7; truncated. | |
518 | # becomes "Cortactin" | |
519 | # | |
520 | ||
521 | sub domain_name { | |
522 | ||
523 | my ($value, $pfamA_acc) = @_; | |
524 | my $is_virtual = 0; | |
525 | ||
526 | if ($value =~ m/^@/) { | |
527 | $is_virtual = 1; | |
528 | $value =~ s/^@//; | |
529 | } | |
530 | ||
531 | # check for clan: | |
532 | if ($no_clans) { | |
533 | if (! defined($domains{$value})) { | |
534 | $domain_clan{$value} = 0; | |
535 | $domains{$value} = ++$domain_cnt; | |
536 | push @domain_list, $pfamA_acc; | |
537 | } | |
538 | } | |
539 | elsif (!defined($domain_clan{$value})) { | |
540 | ## only do this for new domains, old domains have known mappings | |
541 | ||
542 | ## ways to highlight the same domain: | |
543 | # (1) for clans, substitute clan name for family name | |
544 | # (2) for clans, use the same color for the same clan, but don't change the name | |
545 | # (3) for clans, combine family name with clan name, but use colors based on clan | |
546 | ||
547 | # check to see if it's a clan | |
548 | $get_pfam_clan->execute($pfamA_acc); | |
549 | ||
550 | my $pfam_clan_href=0; | |
551 | ||
552 | if ($pfam_clan_href=$get_pfam_clan->fetchrow_hashref()) { # is a clan | |
553 | my ($clan_id, $clan_acc) = @{$pfam_clan_href}{qw(clan_id clan_acc)}; | |
554 | ||
555 | # now check to see if we have seen this clan before (if so, do not increment $domain_cnt) | |
556 | my $c_value = "C." . $clan_id; | |
557 | if ($pf_acc) {$c_value = $clan_acc;} | |
558 | ||
559 | $domain_clan{$value} = {clan_id => $clan_id, | |
560 | clan_acc => $clan_acc}; | |
561 | ||
562 | if ($domains{$c_value}) { | |
563 | $domain_clan{$value}->{domain_cnt} = $domains{$c_value}; | |
564 | $value = $c_value; | |
565 | } | |
566 | else { | |
567 | $domain_clan{$value}->{domain_cnt} = ++ $domain_cnt; | |
568 | $value = $c_value; | |
569 | $domains{$value} = $domain_cnt; | |
570 | push @domain_list, $pfamA_acc; | |
571 | } | |
572 | } | |
573 | else { # not a clan | |
574 | $domain_clan{$value} = 0; | |
575 | $domains{$value} = ++$domain_cnt; | |
576 | push @domain_list, $pfamA_acc; | |
577 | } | |
578 | } | |
579 | elsif ($domain_clan{$value} && $domain_clan{$value}->{clan_acc}) { | |
580 | if ($pf_acc) {$value = $domain_clan{$value}->{clan_acc};} | |
581 | else { $value = "C." . $domain_clan{$value}->{clan_id}; } | |
582 | } | |
583 | ||
584 | if ($is_virtual) { | |
585 | $domains{'@'.$value} = $domains{$value}; | |
586 | $value = '@'.$value; | |
587 | } | |
588 | return $value; | |
589 | } | |
590 | ||
591 | sub domain_num { | |
592 | my ($value, $number) = @_; | |
593 | if ($value =~ m/^@/) { | |
594 | $value =~ s/^@/v/; | |
595 | # $number = $number."v"; | |
596 | } | |
597 | return ($value, $number); | |
598 | } | |
599 | ||
600 | __END__ | |
601 | ||
602 | =pod | |
603 | ||
604 | =head1 NAME | |
605 | ||
606 | ann_feats.pl | |
607 | ||
608 | =head1 SYNOPSIS | |
609 | ||
610 | ann_pfam_e.pl --neg-doms 'sp|P09488|GSTM1_NUMAN' | accession.file | |
611 | ||
612 | =head1 OPTIONS | |
613 | ||
614 | -h short help | |
615 | --help include description | |
616 | --no-over : generate non-overlapping domains (equivalent to ann_pfam.pl) | |
617 | --no-clans : do not use clans with multiple families from same clan | |
618 | --neg-doms : report domains between annotated domains as NODOM | |
619 | (also --neg, --neg_doms) | |
620 | --min_nodom=10 : minimum length between domains for NODOM | |
621 | ||
622 | --host, --user, --password, --port --db : info for mysql database | |
623 | ||
624 | =head1 DESCRIPTION | |
625 | ||
626 | C<ann_pfam_e.pl> extracts domain information from the pfam msyql | |
627 | database. Currently, the program works with database sequence | |
628 | descriptions in one of two formats: | |
629 | ||
630 | Currently, the program works with database | |
631 | sequence descriptions in several formats: | |
632 | ||
633 | >gi|1705556|sp|P54670.1|CAF1_DICDI | |
634 | >sp|P09488|GSTM1_HUMAN | |
635 | >sp:CALM_HUMAN | |
636 | ||
637 | C<ann_pfam_e.pl> uses the C<pfamA_reg_full_significant>, C<pfamseq>, | |
638 | and C<pfamA> tables of the C<pfam> database to extract domain | |
639 | information on a protein. | |
640 | ||
641 | If the "--no-over" option is set, overlapping domains are selected and | |
642 | edited to remove overlaps. For proteins with multiple overlapping | |
643 | domains (domains overlap by more than 1/3 of the domain length), | |
644 | C<auto_pfam_e.pl> selects the domain annotation with the best | |
645 | C<domain_evalue_score>. When domains overlap by less than 1/3 of the | |
646 | domain length, they are shortened to remove the overlap. | |
647 | ||
648 | C<ann_pfam_e.pl> is designed to be used by the B<FASTA> programs with | |
649 | the C<-V \!ann_pfam_e.pl> or C<-V "\!ann_pfam_e.pl --neg"> option. | |
650 | ||
651 | =head1 AUTHOR | |
652 | ||
653 | William R. Pearson, wrp@virginia.edu | |
654 | ||
655 | =cut |
0 | #!/usr/bin/perl -w | |
1 | ||
2 | ################################################################ | |
3 | # copyright (c) 2014,2015 by William R. Pearson and The Rector & | |
4 | # Visitors of the University of Virginia */ | |
5 | ################################################################ | |
6 | # Licensed under the Apache License, Version 2.0 (the "License"); | |
7 | # you may not use this file except in compliance with the License. | |
8 | # You may obtain a copy of the License at | |
9 | # | |
10 | # http://www.apache.org/licenses/LICENSE-2.0 | |
11 | # | |
12 | # Unless required by applicable law or agreed to in writing, | |
13 | # software distributed under this License is distributed on an "AS | |
14 | # IS" BASIS, WITHOUT WRRANTIES OR CONDITIONS OF ANY KIND, either | |
15 | # express or implied. See the License for the specific language | |
16 | # governing permissions and limitations under the License. | |
17 | ################################################################ | |
18 | ||
19 | # ann_pfam.pl gets an annotation file from fasta36 -V with a line of the form: | |
20 | ||
21 | # gi|62822551|sp|P00502|GSTA1_RAT Glutathione S-transfer\n (at least from pir1.lseg) | |
22 | # | |
23 | # it must: | |
24 | # (1) read in the line | |
25 | # (2) parse it to get the up_acc | |
26 | # (3) return the tab delimited features | |
27 | # | |
28 | ||
29 | # this version only annotates sequences known to Pfam:pfamseq: | |
30 | # and only provides domain information | |
31 | ||
32 | use strict; | |
33 | ||
34 | use DBI; | |
35 | use Getopt::Long; | |
36 | use Pod::Usage; | |
37 | ||
38 | use vars qw($host $db $port $user $pass); | |
39 | ||
40 | my $hostname = `/bin/hostname`; | |
41 | ||
42 | ($host, $db, $port, $user, $pass) = ("wrpxdb.its.virginia.edu", "pfam28", 0, "web_user", "fasta_www"); | |
43 | #$host = 'xdb'; | |
44 | #$host = 'localhost'; | |
45 | #$db = 'RPD2_pfam28u'; | |
46 | ||
47 | my ($auto_reg,$rpd2_fams, $neg_doms, $vdoms, $lav, $no_doms, $no_clans, $pf_acc, $acc_comment, $bound_comment, $shelp, $help) = | |
48 | (0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0,); | |
49 | my ($no_over, $split_over, $over_fract) = (0, 0, 3.0); | |
50 | ||
51 | my $color_sep_str = " :"; | |
52 | $color_sep_str = '~'; | |
53 | ||
54 | my ($min_nodom, $min_vdom) = (10,10); | |
55 | ||
56 | GetOptions( | |
57 | "host=s" => \$host, | |
58 | "db=s" => \$db, | |
59 | "user=s" => \$user, | |
60 | "password=s" => \$pass, | |
61 | "port=i" => \$port, | |
62 | "lav" => \$lav, | |
63 | "acc_comment" => \$acc_comment, | |
64 | "bound_comment" => \$bound_comment, | |
65 | "no-over" => \$no_over, | |
66 | "no_over" => \$no_over, | |
67 | "split-over" => \$split_over, | |
68 | "split_over" => \$split_over, | |
69 | "over_fract" => \$over_fract, | |
70 | "over-fract" => \$over_fract, | |
71 | "no-clans" => \$no_clans, | |
72 | "no_clans" => \$no_clans, | |
73 | "neg" => \$neg_doms, | |
74 | "neg_doms" => \$neg_doms, | |
75 | "neg-doms" => \$neg_doms, | |
76 | "min_nodom=i" => \$min_nodom, | |
77 | "vdoms" => \$vdoms, | |
78 | "v_doms" => \$vdoms, | |
79 | "pfacc" => \$pf_acc, | |
80 | "RPD2" => \$rpd2_fams, | |
81 | "auto_reg" => \$auto_reg, | |
82 | "h|?" => \$shelp, | |
83 | "help" => \$help, | |
84 | ); | |
85 | ||
86 | pod2usage(1) if $shelp; | |
87 | pod2usage(exitstatus => 0, verbose => 2) if $help; | |
88 | pod2usage(1) unless (@ARGV || -p STDIN || -f STDIN); | |
89 | ||
90 | my $connect = "dbi:mysql(AutoCommit=>1,RaiseError=>1):database=$db"; | |
91 | $connect .= ";host=$host" if $host; | |
92 | $connect .= ";port=$port" if $port; | |
93 | ||
94 | my $dbh = DBI->connect($connect, | |
95 | $user, | |
96 | $pass | |
97 | ) or die $DBI::errstr; | |
98 | ||
99 | my %annot_types = (); | |
100 | my %domains = (NODOM=>0); | |
101 | my %domain_clan = (NODOM => {clan_id => 'NODOM', clan_acc=>0, domain_cnt=>0}); | |
102 | my @domain_list = (0); | |
103 | my $domain_cnt = 0; | |
104 | ||
105 | my $pfamA_reg_full = 'pfamA_reg_full_significant'; | |
106 | ||
107 | my $get_annot_sub = \&get_pfam_annots; | |
108 | ||
109 | my @pfam_fields = qw(seq_start seq_end model_start model_end model_length pfamA_acc pfamA_id auto_pfamA_reg_full domain_evalue_score as evalue length); | |
110 | ||
111 | my $get_pfam_acc = $dbh->prepare(<<EOSQL); | |
112 | SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_pfamA_reg_full, domain_evalue_score as evalue, length | |
113 | FROM pfamseq | |
114 | JOIN $pfamA_reg_full using(pfamseq_acc) | |
115 | JOIN pfamA USING (pfamA_acc) | |
116 | WHERE in_full = 1 | |
117 | AND pfamseq_acc=? | |
118 | ORDER BY seq_start | |
119 | ||
120 | EOSQL | |
121 | ||
122 | my $get_pfam_refacc = $dbh->prepare(<<EOSQL); | |
123 | ||
124 | SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_pfamA_reg_full, domain_evalue_score as evalue, length | |
125 | FROM pfamseq | |
126 | JOIN $pfamA_reg_full using(pfamseq_acc) | |
127 | JOIN pfamA USING (pfamA_acc) | |
128 | JOIN seqdb_demo2.annot as sa1 on(sa1.acc=pfamseq_acc and sa1.db='sp') | |
129 | JOIN seqdb_demo2.annot as sa2 using(prot_id) | |
130 | WHERE in_full = 1 | |
131 | AND sa2.acc=? | |
132 | AND sa2.db='ref' | |
133 | ORDER BY seq_start | |
134 | ||
135 | EOSQL | |
136 | ||
137 | my $get_annots_sql = $get_pfam_acc; | |
138 | ||
139 | my $get_pfam_id = $dbh->prepare(<<EOSQL); | |
140 | ||
141 | SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_pfamA_reg_full, domain_evalue_score as evalue, length | |
142 | FROM pfamseq | |
143 | JOIN $pfamA_reg_full using(pfamseq_acc) | |
144 | JOIN pfamA USING (pfamA_acc) | |
145 | WHERE in_full=1 | |
146 | AND pfamseq_id=? | |
147 | ORDER BY seq_start | |
148 | ||
149 | EOSQL | |
150 | ||
151 | my $get_pfam_clan = $dbh->prepare(<<EOSQL); | |
152 | ||
153 | SELECT clan_acc, clan_id | |
154 | FROM clan | |
155 | JOIN clan_membership using(clan_acc) | |
156 | WHERE pfamA_acc=? | |
157 | ||
158 | EOSQL | |
159 | ||
160 | my $get_rpd2_clans = $dbh->prepare(<<EOSQL); | |
161 | ||
162 | SELECT auto_pfamA, clan | |
163 | FROM ljm_db.RPD2_final_fams | |
164 | WHERE clan is not NULL | |
165 | ||
166 | EOSQL | |
167 | ||
168 | # -- LEFT JOIN clan_membership USING (auto_pfamA) | |
169 | # -- LEFT JOIN clans using(auto_clan) | |
170 | ||
171 | my ($tmp, $gi, $sdb, $acc, $id, $use_acc); | |
172 | ||
173 | # get the query | |
174 | my ($query, $seq_len) = @ARGV; | |
175 | $seq_len = 0 unless defined($seq_len); | |
176 | ||
177 | $query =~ s/^>// if ($query); | |
178 | ||
179 | my @annots = (); | |
180 | ||
181 | my %rpd2_clan_fams = (); | |
182 | ||
183 | if ($rpd2_fams) { | |
184 | $get_rpd2_clans->execute(); | |
185 | my ($auto_pfam, $auto_clan); | |
186 | while (($auto_pfam, $auto_clan)=$get_rpd2_clans->fetchrow_array()) { | |
187 | $rpd2_clan_fams{$auto_pfam} = $auto_clan; | |
188 | } | |
189 | } | |
190 | ||
191 | #if it's a file I can open, read and parse it | |
192 | unless ($query && $query =~ m/[\|:]/) { | |
193 | ||
194 | while (my $a_line = <>) { | |
195 | $a_line =~ s/^>//; | |
196 | chomp $a_line; | |
197 | push @annots, show_annots($a_line, $get_annot_sub); | |
198 | } | |
199 | } | |
200 | else { | |
201 | push @annots, show_annots("$query $seq_len", $get_annot_sub); | |
202 | } | |
203 | ||
204 | for my $seq_annot (@annots) { | |
205 | print ">",$seq_annot->{seq_info},"\n"; | |
206 | for my $annot (@{$seq_annot->{list}}) { | |
207 | if (!$lav && defined($domains{$annot->[-1]})) { | |
208 | my ($a_name, $a_num) = domain_num($annot->[-1],$domains{$annot->[-1]}); | |
209 | $annot->[-1] = $a_name; | |
210 | my $tmp_a_num = $a_num; | |
211 | $tmp_a_num =~ s/v$//; | |
212 | if ($acc_comment) { | |
213 | $annot->[-1] .= "{$domain_list[$tmp_a_num]}"; | |
214 | } | |
215 | if ($bound_comment) { | |
216 | $annot->[-1] .= $color_sep_str.$annot->[0].":".$annot->[2]; | |
217 | } | |
218 | $annot->[-1] .= $color_sep_str.$a_num; | |
219 | } | |
220 | print join("\t",@$annot),"\n"; | |
221 | } | |
222 | } | |
223 | ||
224 | exit(0); | |
225 | ||
226 | sub show_annots { | |
227 | my ($query_len, $get_annot_sub) = @_; | |
228 | ||
229 | my ($annot_line, $seq_len) = split(/\t/,$query_len); | |
230 | ||
231 | my $pfamA_acc; | |
232 | ||
233 | my %annot_data = (seq_info=>$annot_line); | |
234 | ||
235 | $use_acc = 1; | |
236 | $get_annots_sql = $get_pfam_acc; | |
237 | ||
238 | if ($annot_line =~ m/^pf\d+\|/) { | |
239 | ($sdb, $gi, $pfamA_acc, $acc, $id) = split(/\|/,$annot_line); | |
240 | # $dbh->do("use RPD2_pfam"); | |
241 | } | |
242 | elsif ($annot_line =~ m/^gi\|/) { | |
243 | ($tmp, $gi, $sdb, $acc, $id) = split(/\|/,$annot_line); | |
244 | if ($sdb =~ m/ref/) { | |
245 | $get_annots_sql = $get_pfam_refacc; | |
246 | } | |
247 | } | |
248 | elsif ($annot_line =~ m/^(sp|tr)\|/) { | |
249 | ($sdb, $acc, $id) = split(/\|/,$annot_line); | |
250 | } | |
251 | elsif ($annot_line =~ m/^ref\|/) { | |
252 | ($sdb, $acc) = split(/\|/,$annot_line); | |
253 | $get_annots_sql = $get_pfam_refacc; | |
254 | } | |
255 | elsif ($annot_line =~ m/^(SP|TR):/i) { | |
256 | ($sdb, $id) = split(/:/,$annot_line); | |
257 | $use_acc = 0; | |
258 | } | |
259 | elsif ($annot_line !~ m/\|/) { # new NCBI swissprot format | |
260 | $use_acc =1; | |
261 | $sdb = 'sp'; | |
262 | ($acc) = split(/\s+/,$annot_line); | |
263 | } | |
264 | ||
265 | # remove version number | |
266 | unless ($use_acc) { | |
267 | $get_annots_sql = $get_pfam_id; | |
268 | $get_annots_sql->execute($id); | |
269 | } else { | |
270 | unless ($acc) { | |
271 | warn "missing acc in $annot_line"; | |
272 | next; | |
273 | } else { | |
274 | $acc =~ s/\.\d+$//; | |
275 | $get_annots_sql->execute($acc); | |
276 | } | |
277 | } | |
278 | ||
279 | $annot_data{list} = $get_annot_sub->($get_annots_sql, $seq_len); | |
280 | ||
281 | return \%annot_data; | |
282 | } | |
283 | ||
284 | sub get_pfam_annots { | |
285 | my ($get_annots, $seq_length) = @_; | |
286 | ||
287 | $seq_length = 0 unless $seq_length; | |
288 | ||
289 | my @pf_domains = (); | |
290 | ||
291 | # get the list of domains, sorted by start | |
292 | ||
293 | # $row_href has: seq_start, seq_end, model_start, model_end, model_length, | |
294 | # pfamA_acc, pfamA_id, auto_pfamA_reg_full, | |
295 | # domain_evalue_score as evalue, length | |
296 | ||
297 | while ( my $row_href = $get_annots->fetchrow_hashref()) { | |
298 | if ($auto_reg) { | |
299 | $row_href->{info} = $row_href->{auto_pfamA_reg_full}; | |
300 | } elsif ($pf_acc) { | |
301 | $row_href->{info} = $row_href->{pfamA_acc}; | |
302 | } else { | |
303 | $row_href->{info} = $row_href->{pfamA_id}; | |
304 | } | |
305 | ||
306 | if ($row_href && $row_href->{length} > $seq_length && $seq_length == 0) { | |
307 | $seq_length = $row_href->{length}; | |
308 | } | |
309 | ||
310 | next if ($row_href->{seq_start} >= $seq_length); | |
311 | if ($row_href->{seq_end} > $seq_length) { | |
312 | $row_href->{seq_end} = $seq_length; | |
313 | } | |
314 | ||
315 | push @pf_domains, $row_href | |
316 | } | |
317 | ||
318 | # before checking for domain overlap, check for "split-domains" | |
319 | # (self-unbound) by looking for runs of the same domain that are | |
320 | # ordered by model_start | |
321 | ||
322 | if (scalar(@pf_domains) > 1) { | |
323 | my @j_domains; #joined domains | |
324 | my @tmp_domains = @pf_domains; | |
325 | ||
326 | my $prev_dom = shift(@tmp_domains); | |
327 | ||
328 | for my $curr_dom (@tmp_domains) { | |
329 | # to join domains: | |
330 | # (1) the domains must be in order by model_start/end coordinates | |
331 | # (3) joining the domains cannot make the total combination too long | |
332 | ||
333 | # check for model and sequence consistency | |
334 | if (($prev_dom->{pfamA_acc} eq $curr_dom->{pfamA_acc}) # same family | |
335 | && $prev_dom->{model_start} < $curr_dom->{model_start} # model check | |
336 | && $prev_dom->{model_end} < $curr_dom->{model_end} | |
337 | ||
338 | && ($curr_dom->{model_start} > $prev_dom->{model_end} * 0.80 # limit overlap | |
339 | || $curr_dom->{model_start} < $prev_dom->{model_end} * 1.25) | |
340 | && ((($curr_dom->{model_end} - $curr_dom->{model_start}+1)/$curr_dom->{model_length} + | |
341 | ($prev_dom->{model_end} - $prev_dom->{model_start}+1)/$prev_dom->{model_length}) < 1.33) | |
342 | ) { # join them by updating $prev_dom | |
343 | $prev_dom->{seq_end} = $curr_dom->{seq_end}; | |
344 | $prev_dom->{model_end} = $curr_dom->{model_end}; | |
345 | $prev_dom->{auto_pfamA_reg_full} = $prev_dom->{auto_pfamA_reg_full} . ";". $curr_dom->{auto_pfamA_reg_full}; | |
346 | $prev_dom->{evalue} = ($prev_dom->{evalue} < $curr_dom->{evalue} ? $prev_dom->{evalue} : $curr_dom->{evalue}); | |
347 | } else { | |
348 | push @j_domains, $prev_dom; | |
349 | $prev_dom = $curr_dom; | |
350 | } | |
351 | } | |
352 | push @j_domains, $prev_dom; | |
353 | @pf_domains = @j_domains; | |
354 | ||
355 | ||
356 | if ($no_over) { # for either $no_over or $split_over, check for overlapping domains and edit/split them | |
357 | ||
358 | my @tmp_domains = @pf_domains; # allow shifts from copy of @pf_domains | |
359 | my @save_domains = (); # where the new domains go | |
360 | ||
361 | my $prev_dom = shift @tmp_domains; | |
362 | ||
363 | while (my $curr_dom = shift @tmp_domains) { | |
364 | ||
365 | my @overlap_domains = ($prev_dom); | |
366 | ||
367 | my $diff = $prev_dom->{seq_end} - $curr_dom->{seq_start}; | |
368 | ||
369 | my ($prev_len, $cur_len) = ($prev_dom->{seq_end}-$prev_dom->{seq_start}+1, | |
370 | $curr_dom->{seq_end}-$curr_dom->{seq_start}+1); | |
371 | ||
372 | my $inclusion = ((($curr_dom->{seq_start} >= $prev_dom->{seq_start}) # start is right && end is left | |
373 | && ($curr_dom->{seq_end} <= $prev_dom->{seq_end})) || # -- curr inside prev | |
374 | (($curr_dom->{seq_start} <= $prev_dom->{seq_start}) # start is left && end is right | |
375 | && ($curr_dom->{seq_end} >= $prev_dom->{seq_end}))); # -- prev is inside curr | |
376 | ||
377 | my $longer_len = ($prev_len > $cur_len) ? $prev_len : $cur_len; | |
378 | ||
379 | # check for overlap > domain_length/$over_fract | |
380 | while ($inclusion || ($diff > 0 && $diff > $longer_len/$over_fract)) { | |
381 | push @overlap_domains, $curr_dom; | |
382 | $curr_dom = shift @tmp_domains; | |
383 | last unless $curr_dom; | |
384 | $diff = $prev_dom->{seq_end} - $curr_dom->{seq_start}; | |
385 | ($prev_len, $cur_len) = ($prev_dom->{seq_end}-$prev_dom->{seq_start}+1, $curr_dom->{seq_end}-$curr_dom->{seq_start}+1); | |
386 | $longer_len = ($prev_len > $cur_len) ? $prev_len : $cur_len; | |
387 | $inclusion = ((($curr_dom->{seq_start} >= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} <= $prev_dom->{seq_end})) || | |
388 | (($curr_dom->{seq_start} <= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} >= $prev_dom->{seq_end}))); | |
389 | } | |
390 | ||
391 | # check for overlapping domains; >1 because $prev_dom is always there | |
392 | if (scalar(@overlap_domains) > 1 ) { | |
393 | # if $rpd2_fams, check for a chosen one | |
394 | ||
395 | for my $dom ( @overlap_domains) { | |
396 | $dom->{evalue} = 1.0 unless defined($dom->{evalue}); | |
397 | } | |
398 | ||
399 | @overlap_domains = sort { $a->{evalue} <=> $b->{evalue} } @overlap_domains; | |
400 | $prev_dom = $overlap_domains[0]; | |
401 | } | |
402 | ||
403 | # $prev_dom should be the best of the overlaps, and we are no longer overlapping > dom_length/3 | |
404 | push @save_domains, $prev_dom; | |
405 | $prev_dom = $curr_dom; | |
406 | } | |
407 | ||
408 | if ($prev_dom) { | |
409 | push @save_domains, $prev_dom; | |
410 | } | |
411 | ||
412 | @pf_domains = @save_domains; | |
413 | ||
414 | # now check for smaller overlaps | |
415 | for (my $i=1; $i < scalar(@pf_domains); $i++) { | |
416 | if ($pf_domains[$i-1]->{seq_end} >= $pf_domains[$i]->{seq_start}) { | |
417 | my $overlap = $pf_domains[$i-1]->{seq_end} - $pf_domains[$i]->{seq_start}; | |
418 | $pf_domains[$i-1]->{seq_end} -= int($overlap/2); | |
419 | $pf_domains[$i]->{seq_start} = $pf_domains[$i-1]->{seq_end}+1; | |
420 | } | |
421 | } | |
422 | } | |
423 | elsif ($split_over) { # here, everything that overlaps by > $min_vdom should be split into a separate domain | |
424 | my @save_domains = (); # where the new domains go | |
425 | ||
426 | # check to see if one domain is included (or overlapping) more | |
427 | # than xx% of the other. If so, pick the longer one | |
428 | ||
429 | my ($prev_dom, $curr_dom) = ($pf_domains[0],0) ; | |
430 | for (my $i=1; $i < scalar(@pf_domains); $i++) { | |
431 | $curr_dom = $pf_domains[$i]; | |
432 | ||
433 | my ($prev_len, $cur_len) = ($prev_dom->{seq_end}-$prev_dom->{seq_start}+1, $curr_dom->{seq_end}-$curr_dom->{seq_start}+1); | |
434 | my $longer_len = ($prev_len > $cur_len) ? $prev_len : $cur_len; | |
435 | ||
436 | if (($curr_dom->{seq_start} >= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} <= $prev_dom->{seq_end}) | |
437 | && $cur_len / $prev_len > 0.80) { | |
438 | # $prev_dom stays the same, $curr_dom deleted | |
439 | next; | |
440 | } | |
441 | elsif (($curr_dom->{seq_start} <= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} >= $prev_dom->{seq_end}) | |
442 | && $prev_len / $cur_len > 0.80) { | |
443 | $prev_dom = $curr_dom; # this should delete $prev_dom | |
444 | next; | |
445 | } | |
446 | ||
447 | if ($prev_dom->{seq_end} >= $curr_dom->{seq_start} + $min_vdom) { | |
448 | my ($l_seq_end, $r_seq_start) = ($curr_dom->{seq_start}-1, $prev_dom->{seq_end}+1); | |
449 | ||
450 | $prev_dom->{seq_end} = $l_seq_end; | |
451 | push @save_domains, $prev_dom; | |
452 | my $new_dom = {seq_start => $l_seq_end+1, seq_end=>$r_seq_start-1, | |
453 | model_length => -1, | |
454 | pfamA_acc=>$prev_dom->{pfamA_acc}."/".$curr_dom->{pfamA_acc}, | |
455 | pfamA_id=>$prev_dom->{pfamA_id}."/".$curr_dom->{pfamA_id}, | |
456 | }; | |
457 | ||
458 | if ($pf_acc) { | |
459 | $new_dom->{info} = $new_dom->{pfamA_acc}; | |
460 | } | |
461 | else { | |
462 | $new_dom->{info} = $new_dom->{pfamA_id}; | |
463 | } | |
464 | ||
465 | push @save_domains, $new_dom; | |
466 | $curr_dom->{seq_start} = $r_seq_start; | |
467 | $prev_dom = $curr_dom; | |
468 | } | |
469 | else { | |
470 | push @save_domains, $prev_dom; | |
471 | $prev_dom = $curr_dom; | |
472 | } | |
473 | } | |
474 | push @save_domains, $prev_dom; | |
475 | @pf_domains = @save_domains; | |
476 | } | |
477 | } | |
478 | ||
479 | # $vdoms -- virtual Pfam domains -- the equivalent of $neg_doms, | |
480 | # but covering parts of a Pfam model that are not annotated. split | |
481 | # domains have been joined, so simply check beginning and end of | |
482 | # each domain (but must also check for bounded-ness) | |
483 | # only add when 10% or more is missing and missing length > $min_nodom | |
484 | ||
485 | if ($vdoms && scalar(@pf_domains)) { | |
486 | my @vpf_domains; | |
487 | ||
488 | my $curr_dom = $pf_domains[0]; | |
489 | my $length = $curr_dom->{length}; | |
490 | ||
491 | my $prev_dom={seq_end=>0, pfamA_acc=>''}; | |
492 | my $prev_dom_end = 0; | |
493 | my $next_dom_start = $length+1; | |
494 | ||
495 | for (my $dom_ix=0; $dom_ix < scalar(@pf_domains); $dom_ix++ ) { | |
496 | $curr_dom = $pf_domains[$dom_ix]; | |
497 | ||
498 | my $pfamA = $curr_dom->{pfamA_acc}; | |
499 | ||
500 | # first, look left, is there a domain there (if there is, | |
501 | # it should be updated right | |
502 | ||
503 | # my $min_vdom = $curr_dom->{model_length} / 10; | |
504 | ||
505 | if ($curr_dom->{model_length} < $min_vdom) { | |
506 | push @vpf_domains, $curr_dom; | |
507 | next; | |
508 | } | |
509 | if ($prev_dom->{pfamA_acc}) { # look for previous domain | |
510 | $prev_dom_end = $prev_dom->{seq_end}; | |
511 | } | |
512 | ||
513 | # there is a domain to the left, how much room is available? | |
514 | my $left_dom_len = min($curr_dom->{seq_start}-$prev_dom_end-1, $curr_dom->{model_start}-1); | |
515 | if ( $left_dom_len > $min_vdom) { | |
516 | # there is room for a virtual domain | |
517 | my %new_dom = (seq_start=> $curr_dom->{seq_start}-$left_dom_len, | |
518 | seq_end => $curr_dom->{seq_start}-1, | |
519 | info=>'@'.$curr_dom->{info}, | |
520 | model_length=>$curr_dom->{model_length}, | |
521 | model_end => $curr_dom->{model_start}-1, | |
522 | model_start => $left_dom_len, | |
523 | pfamA_acc=>$pfamA, | |
524 | ); | |
525 | push @vpf_domains, \%new_dom; | |
526 | } | |
527 | ||
528 | # save the current domain | |
529 | push @vpf_domains, $curr_dom; | |
530 | $prev_dom = $curr_dom; | |
531 | ||
532 | if ($dom_ix < $#pf_domains) { # there is a domain to the right | |
533 | # first, give all the extra space to the first domain (no splitting) | |
534 | $next_dom_start = $pf_domains[$dom_ix+1]->{seq_start}; | |
535 | } | |
536 | else { | |
537 | $next_dom_start = $length; | |
538 | } | |
539 | ||
540 | # is there room for a virtual domain right | |
541 | ||
542 | my $right_dom_len = min($next_dom_start-$curr_dom->{seq_end}-1, # space available | |
543 | $curr_dom->{model_length}-$curr_dom->{model_end} # space needed | |
544 | ); | |
545 | if ( $right_dom_len > $min_vdom) { | |
546 | my %new_dom = (seq_start=> $curr_dom->{seq_end}+1, | |
547 | seq_end=> $curr_dom->{seq_end}+$right_dom_len, | |
548 | info=>'@'.$curr_dom->{info}, | |
549 | model_length => $curr_dom->{model_length}, | |
550 | pfamA_acc=> $pfamA, | |
551 | ); | |
552 | push @vpf_domains, \%new_dom; | |
553 | $prev_dom = \%new_dom; | |
554 | } | |
555 | } # all done, check for last one | |
556 | ||
557 | # $curr_dom=$pf_domains[-1]; | |
558 | # # my $min_vdom = $curr_dom->{model_length}/10; | |
559 | ||
560 | # my $right_dom_len = min($length - $curr_dom->{seq_end}+1, # space available | |
561 | # $curr_dom->{model_length}-$curr_dom->{model_end} # space needed | |
562 | # ); | |
563 | # if ($right_dom_len > $min_vdom) { | |
564 | # my %new_dom = (seq_start=> $curr_dom->{seq_end}+1, | |
565 | # seq_end => $curr_dom->{seq_end}+$right_dom_len, | |
566 | # info=>'@'.$curr_dom->{pfamA_acc}, | |
567 | # model_len=> $curr_dom->{model_len}, | |
568 | # pfamA_acc => $curr_dom->{pfamA_acc}, | |
569 | # model_start => $curr_dom->{model_end}+1, | |
570 | # model_end => $curr_dom->{model_len}, | |
571 | # ); | |
572 | ||
573 | # push @vpf_domains, \%new_dom; | |
574 | # } | |
575 | ||
576 | # @vpf_domains has both old @pf_domains and new neg-domains | |
577 | @pf_domains = @vpf_domains; | |
578 | } | |
579 | ||
580 | if ($neg_doms) { | |
581 | my @npf_domains; | |
582 | my $prev_dom={seq_end=>0}; | |
583 | for my $curr_dom ( @pf_domains) { | |
584 | if ($curr_dom->{seq_start} - $prev_dom->{seq_end} > $min_nodom) { | |
585 | my %new_dom = (seq_start=>$prev_dom->{seq_end}+1, seq_end => $curr_dom->{seq_start}-1, info=>'NODOM'); | |
586 | push @npf_domains, \%new_dom; | |
587 | } | |
588 | push @npf_domains, $curr_dom; | |
589 | $prev_dom = $curr_dom; | |
590 | } | |
591 | if ($seq_length - $prev_dom->{seq_end} > $min_nodom) { | |
592 | my %new_dom = (seq_start=>$prev_dom->{seq_end}+1, seq_end=>$seq_length, info=>'NODOM'); | |
593 | if ($new_dom{seq_end} > $new_dom{seq_start}) { | |
594 | push @npf_domains, \%new_dom; | |
595 | } | |
596 | } | |
597 | ||
598 | # @npf_domains has both old @pf_domains and new neg-domains | |
599 | @pf_domains = @npf_domains; | |
600 | } | |
601 | ||
602 | # now make sure we have useful names: colors | |
603 | ||
604 | for my $pf (@pf_domains) { | |
605 | $pf->{info} = domain_name($pf->{info}, $pf->{pfamA_acc}); | |
606 | } | |
607 | ||
608 | my @feats = (); | |
609 | for my $d_ref (@pf_domains) { | |
610 | if ($lav) { | |
611 | push @feats, [$d_ref->{seq_start}, $d_ref->{seq_end}, $d_ref->{info}]; | |
612 | } else { | |
613 | push @feats, [$d_ref->{seq_start}, '-', $d_ref->{seq_end}, $d_ref->{info} ]; | |
614 | # push @feats, [$d_ref->{seq_end}, ']', '-', ""]; | |
615 | } | |
616 | ||
617 | } | |
618 | ||
619 | return \@feats; | |
620 | } | |
621 | ||
622 | sub min { | |
623 | my ($arg1, $arg2) = @_; | |
624 | ||
625 | return ($arg1 <= $arg2 ? $arg1 : $arg2); | |
626 | } | |
627 | ||
628 | sub max { | |
629 | my ($arg1, $arg2) = @_; | |
630 | ||
631 | return ($arg1 >= $arg2 ? $arg1 : $arg2); | |
632 | } | |
633 | ||
634 | # domain name takes a uniprot domain label, removes comments ( ; | |
635 | # truncated) and numbers and returns a canonical form. Thus: | |
636 | # Cortactin 6. | |
637 | # Cortactin 7; truncated. | |
638 | # becomes "Cortactin" | |
639 | # | |
640 | ||
641 | sub domain_name { | |
642 | ||
643 | my ($value, $pfamA_acc) = @_; | |
644 | my $is_virtual = 0; | |
645 | ||
646 | if ($value =~ m/^@/) { | |
647 | $is_virtual = 1; | |
648 | $value =~ s/^@//; | |
649 | } | |
650 | ||
651 | # check for clan: | |
652 | if ($no_clans) { | |
653 | if (! defined($domains{$value})) { | |
654 | $domain_clan{$value} = 0; | |
655 | $domains{$value} = ++$domain_cnt; | |
656 | push @domain_list, $pfamA_acc; | |
657 | } | |
658 | } | |
659 | elsif (!defined($domain_clan{$value})) { | |
660 | ## only do this for new domains, old domains have known mappings | |
661 | ||
662 | ## ways to highlight the same domain: | |
663 | # (1) for clans, substitute clan name for family name | |
664 | # (2) for clans, use the same color for the same clan, but don't change the name | |
665 | # (3) for clans, combine family name with clan name, but use colors based on clan | |
666 | ||
667 | # check to see if it's a clan | |
668 | $get_pfam_clan->execute($pfamA_acc); | |
669 | ||
670 | my $pfam_clan_href=0; | |
671 | ||
672 | if ($pfam_clan_href=$get_pfam_clan->fetchrow_hashref()) { # is a clan | |
673 | my ($clan_id, $clan_acc) = @{$pfam_clan_href}{qw(clan_id clan_acc)}; | |
674 | ||
675 | # now check to see if we have seen this clan before (if so, do not increment $domain_cnt) | |
676 | my $c_value = "C." . $clan_id; | |
677 | if ($pf_acc) {$c_value = $clan_acc;} | |
678 | ||
679 | $domain_clan{$value} = {clan_id => $clan_id, | |
680 | clan_acc => $clan_acc}; | |
681 | ||
682 | if ($domains{$c_value}) { | |
683 | $domain_clan{$value}->{domain_cnt} = $domains{$c_value}; | |
684 | $value = $c_value; | |
685 | } | |
686 | else { | |
687 | $domain_clan{$value}->{domain_cnt} = ++ $domain_cnt; | |
688 | $value = $c_value; | |
689 | $domains{$value} = $domain_cnt; | |
690 | push @domain_list, $pfamA_acc; | |
691 | } | |
692 | } | |
693 | else { # not a clan | |
694 | $domain_clan{$value} = 0; | |
695 | $domains{$value} = ++$domain_cnt; | |
696 | push @domain_list, $pfamA_acc; | |
697 | } | |
698 | } | |
699 | elsif ($domain_clan{$value} && $domain_clan{$value}->{clan_acc}) { | |
700 | if ($pf_acc) {$value = $domain_clan{$value}->{clan_acc};} | |
701 | else { $value = "C." . $domain_clan{$value}->{clan_id}; } | |
702 | } | |
703 | ||
704 | if ($is_virtual) { | |
705 | $domains{'@'.$value} = $domains{$value}; | |
706 | $value = '@'.$value; | |
707 | } | |
708 | return $value; | |
709 | } | |
710 | ||
711 | sub domain_num { | |
712 | my ($value, $number) = @_; | |
713 | if ($value =~ m/^@/) { | |
714 | $value =~ s/^@/v/; | |
715 | $number = $number."v"; | |
716 | } | |
717 | return ($value, $number); | |
718 | } | |
719 | ||
720 | ||
721 | __END__ | |
722 | ||
723 | =pod | |
724 | ||
725 | =head1 NAME | |
726 | ||
727 | ann_pfam28.pl | |
728 | ||
729 | =head1 SYNOPSIS | |
730 | ||
731 | ann_pfam28.pl --neg-doms --vdoms 'sp|P09488|GSTM1_NUMAN' | accession.file | |
732 | ||
733 | =head1 OPTIONS | |
734 | ||
735 | -h short help | |
736 | --help include description | |
737 | --no-over : generate non-overlapping domains (equivalent to ann_pfam.pl) | |
738 | --split-over : overlaps of two domains generate a new hybrid domain | |
739 | --no-clans : do not use clans with multiple families from same clan | |
740 | --neg-doms : report domains between annotated domains as NODOM | |
741 | (also --neg, --neg_doms) | |
742 | --vdoms : produce "virtual domains" using model_start, | |
743 | model_end for partial pfam domains | |
744 | --min_nodom=10 : minimum length between domains for NODOM | |
745 | ||
746 | --host, --user, --password, --port --db : info for mysql database | |
747 | ||
748 | =head1 DESCRIPTION | |
749 | ||
750 | C<ann_pfam28.pl> extracts domain information from the pfam msyql | |
751 | database. Currently, the program works with database | |
752 | sequence descriptions in several formats: | |
753 | ||
754 | >gi|1705556|sp|P54670.1|CAF1_DICDI | |
755 | >sp|P09488|GSTM1_HUMAN | |
756 | >sp:CALM_HUMAN | |
757 | ||
758 | C<ann_pfam28.pl> uses the C<pfamA_reg_full_significant>, C<pfamseq>, | |
759 | and C<pfamA> tables of the C<pfam> database to extract domain | |
760 | information on a protein. | |
761 | ||
762 | If the C<--no-over> option is set, overlapping domains are selected and | |
763 | edited to remove overlaps. For proteins with multiple overlapping | |
764 | domains (domains overlap by more than 1/3 of the domain length), | |
765 | C<auto_pfam28.pl> selects the domain annotation with the best | |
766 | C<domain_evalue_score>. When domains overlap by less than 1/3 of the | |
767 | domain length, they are shortened to remove the overlap. | |
768 | ||
769 | If the C<--split-over> option is set, if two domains overlap, the | |
770 | overlapping region is split out of the domains and labeled as a new, | |
771 | virtual-lie, domain. If one domain is internal to another and spans | |
772 | 80% of the domain, the shorter domain is removed. | |
773 | ||
774 | C<ann_pfam28.pl> is designed to be used by the B<FASTA> programs with | |
775 | the C<-V \!ann_pfam28.pl> or C<-V "\!ann_pfam28.pl --neg"> option. | |
776 | ||
777 | =head1 AUTHOR | |
778 | ||
779 | William R. Pearson, wrp@virginia.edu | |
780 | ||
781 | =cut |
0 | #!/usr/bin/perl -w | |
1 | ||
2 | ################################################################ | |
3 | # copyright (c) 2014,2015 by William R. Pearson and The Rector & | |
4 | # Visitors of the University of Virginia */ | |
5 | ################################################################ | |
6 | # Licensed under the Apache License, Version 2.0 (the "License"); | |
7 | # you may not use this file except in compliance with the License. | |
8 | # You may obtain a copy of the License at | |
9 | # | |
10 | # http://www.apache.org/licenses/LICENSE-2.0 | |
11 | # | |
12 | # Unless required by applicable law or agreed to in writing, | |
13 | # software distributed under this License is distributed on an "AS | |
14 | # IS" BASIS, WITHOUT WRRANTIES OR CONDITIONS OF ANY KIND, either | |
15 | # express or implied. See the License for the specific language | |
16 | # governing permissions and limitations under the License. | |
17 | ################################################################ | |
18 | ||
19 | # ann_pfam.pl gets an annotation file from fasta36 -V with a line of the form: | |
20 | ||
21 | # gi|62822551|sp|P00502|GSTA1_RAT Glutathione S-transfer\n (at least from pir1.lseg) | |
22 | # | |
23 | # it must: | |
24 | # (1) read in the line | |
25 | # (2) parse it to get the up_acc | |
26 | # (3) return the tab delimited features | |
27 | # | |
28 | ||
29 | # this is the first version that works with the new Pfam strategy of | |
30 | # separating Uniprot reference sequences from the rest of uniprot. as | |
31 | # a result, it is possible that 2 SQL queries will be required, one to | |
32 | # pfamA_reg_full_significant and a second to uniprot_reg_full. | |
33 | ||
34 | # modified 15-Jan-2017 to reduce the number of calls when the same | |
35 | # accession is present multiple times. Accessions are saved in a hash | |
36 | # than ensures uniqueness. (Could also speed things up by creating temporary table.) | |
37 | # | |
38 | ||
39 | ||
40 | use strict; | |
41 | ||
42 | use DBI; | |
43 | use Getopt::Long; | |
44 | use Pod::Usage; | |
45 | ||
46 | use vars qw($host $db $port $user $pass); | |
47 | ||
48 | my $hostname = `/bin/hostname`; | |
49 | ||
50 | ($host, $db, $port, $user, $pass) = ("wrpxdb.its.virginia.edu", "pfam31", 0, "web_user", "fasta_www"); | |
51 | #$host = 'xdb'; | |
52 | #$host = 'localhost'; | |
53 | #$db = 'RPD2_pfam28u'; | |
54 | ||
55 | my ($auto_reg,$rpd2_fams, $neg_doms, $vdoms, $lav, $no_doms, $no_clans, $pf_acc, $acc_comment, $bound_comment, $shelp, $help) = | |
56 | (0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0,); | |
57 | my ($no_over, $split_over, $over_fract) = (0, 0, 3.0); | |
58 | ||
59 | my ($color_sep_str, $show_color) = (" :",1); | |
60 | $color_sep_str = '~'; | |
61 | ||
62 | my ($min_nodom, $min_vdom) = (10,10); | |
63 | ||
64 | GetOptions( | |
65 | "host=s" => \$host, | |
66 | "db=s" => \$db, | |
67 | "user=s" => \$user, | |
68 | "password=s" => \$pass, | |
69 | "port=i" => \$port, | |
70 | "lav" => \$lav, | |
71 | "acc_comment" => \$acc_comment, | |
72 | "bound_comment" => \$bound_comment, | |
73 | "color!" => \$show_color, | |
74 | "no-over" => \$no_over, | |
75 | "no_over" => \$no_over, | |
76 | "split-over" => \$split_over, | |
77 | "split_over" => \$split_over, | |
78 | "over_fract" => \$over_fract, | |
79 | "over-fract" => \$over_fract, | |
80 | "no-clans" => \$no_clans, | |
81 | "no_clans" => \$no_clans, | |
82 | "neg" => \$neg_doms, | |
83 | "neg_doms" => \$neg_doms, | |
84 | "neg-doms" => \$neg_doms, | |
85 | "min_nodom=i" => \$min_nodom, | |
86 | "vdoms" => \$vdoms, | |
87 | "v_doms" => \$vdoms, | |
88 | "pfacc" => \$pf_acc, | |
89 | "RPD2" => \$rpd2_fams, | |
90 | "auto_reg" => \$auto_reg, | |
91 | "h|?" => \$shelp, | |
92 | "help" => \$help, | |
93 | ); | |
94 | ||
95 | pod2usage(1) if $shelp; | |
96 | pod2usage(exitstatus => 0, verbose => 2) if $help; | |
97 | pod2usage(1) unless (@ARGV || -p STDIN || -f STDIN); | |
98 | ||
99 | my $connect = "dbi:mysql(AutoCommit=>1,RaiseError=>1):database=$db"; | |
100 | $connect .= ";host=$host" if $host; | |
101 | $connect .= ";port=$port" if $port; | |
102 | ||
103 | my $dbh = DBI->connect($connect, | |
104 | $user, | |
105 | $pass | |
106 | ) or die $DBI::errstr; | |
107 | ||
108 | my %annot_types = (); | |
109 | my %domains = (NODOM=>0); | |
110 | my %domain_clan = (NODOM => {clan_id => 'NODOM', clan_acc=>0, domain_cnt=>0}); | |
111 | my @domain_list = (0); | |
112 | my $domain_cnt = 0; | |
113 | ||
114 | my $pfamA_reg_full = 'pfamA_reg_full_significant'; | |
115 | my $uniprot_reg_full = 'uniprot_reg_full'; | |
116 | ||
117 | my $get_annot_sub = \&get_pfam_annots; | |
118 | ||
119 | my @pfam_fields = qw(seq_start seq_end model_start model_end model_length pfamA_acc pfamA_id auto_pfamA_reg_full domain_evalue_score as evalue length); | |
120 | my @upfam_fields = qw(seq_start seq_end model_start model_end model_length pfamA_acc pfamA_id auto_uniprot_reg_full domain_evalue_score as evalue length); | |
121 | ||
122 | my $get_pfam_acc = $dbh->prepare(<<EOSQL); | |
123 | SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_pfamA_reg_full, domain_evalue_score as evalue, length | |
124 | FROM pfamseq | |
125 | JOIN pfamA_reg_full_significant using(pfamseq_acc) | |
126 | JOIN pfamA USING (pfamA_acc) | |
127 | WHERE in_full = 1 | |
128 | AND pfamseq_acc=? | |
129 | ORDER BY seq_start | |
130 | ||
131 | EOSQL | |
132 | ||
133 | my $get_upfam_acc = $dbh->prepare(<<EOSQL); | |
134 | SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_uniprot_reg_full as auto_pfamA_reg_full, domain_evalue_score as evalue, length | |
135 | FROM uniprot | |
136 | JOIN uniprot_reg_full using(uniprot_acc) | |
137 | JOIN pfamA USING (pfamA_acc) | |
138 | WHERE in_full = 1 | |
139 | AND uniprot_acc=? | |
140 | ORDER BY seq_start | |
141 | ||
142 | EOSQL | |
143 | ||
144 | my $get_pfam_refacc = $dbh->prepare(<<EOSQL); | |
145 | SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_pfamA_reg_full, domain_evalue_score as evalue, length | |
146 | FROM $pfamA_reg_full | |
147 | JOIN pfamseq using(pfamseq_acc) | |
148 | JOIN pfamA USING (pfamA_acc) | |
149 | JOIN uniprot.refseq2up as rf2up on(rf2up.up_acc=pfamseq_acc) | |
150 | WHERE in_full = 1 | |
151 | AND rf2up.refseq_acc=? | |
152 | ORDER BY seq_start | |
153 | ||
154 | EOSQL | |
155 | ||
156 | my $get_upfam_refacc = $dbh->prepare(<<EOSQL); | |
157 | SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_uniprot_reg_full as auto_pfamA_reg_full, domain_evalue_score as evalue, length | |
158 | FROM uniprot | |
159 | JOIN uniprot_reg_full using(uniprot_acc) | |
160 | JOIN pfamA USING (pfamA_acc) | |
161 | JOIN uniprot.refseq2up as rf2up on(rf2up.up_acc=uniprot_acc) | |
162 | WHERE in_full = 1 | |
163 | AND refseq_acc=? | |
164 | ORDER BY seq_start | |
165 | ||
166 | EOSQL | |
167 | ||
168 | my $get_annots_sql = $get_pfam_acc; | |
169 | ||
170 | my $get_pfam_id = $dbh->prepare(<<EOSQL); | |
171 | SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_pfamA_reg_full, domain_evalue_score as evalue, length | |
172 | FROM pfamseq | |
173 | JOIN $pfamA_reg_full using(pfamseq_acc) | |
174 | JOIN pfamA USING (pfamA_acc) | |
175 | WHERE in_full=1 | |
176 | AND pfamseq_id=? | |
177 | ORDER BY seq_start | |
178 | ||
179 | EOSQL | |
180 | ||
181 | my $get_upfam_id = $dbh->prepare(<<EOSQL); | |
182 | SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_uniprot_reg_full as auto_pfamA_reg_full, domain_evalue_score as evalue, length | |
183 | FROM uniprot | |
184 | JOIN uniprot_reg_full using(pfamseq_acc) | |
185 | JOIN pfamA USING (pfamA_acc) | |
186 | WHERE in_full=1 | |
187 | AND uniprot_id=? | |
188 | ORDER BY seq_start | |
189 | ||
190 | EOSQL | |
191 | ||
192 | my $get_pfam_clan = $dbh->prepare(<<EOSQL); | |
193 | ||
194 | SELECT clan_acc, clan_id | |
195 | FROM clan | |
196 | JOIN clan_membership using(clan_acc) | |
197 | WHERE pfamA_acc=? | |
198 | ||
199 | EOSQL | |
200 | ||
201 | my $get_rpd2_clans = $dbh->prepare(<<EOSQL); | |
202 | ||
203 | SELECT auto_pfamA, clan | |
204 | FROM ljm_db.RPD2_final_fams | |
205 | WHERE clan is not NULL | |
206 | ||
207 | EOSQL | |
208 | ||
209 | # -- LEFT JOIN clan_membership USING (auto_pfamA) | |
210 | # -- LEFT JOIN clans using(auto_clan) | |
211 | ||
212 | my ($tmp, $gi, $sdb, $acc, $id, $use_acc); | |
213 | ||
214 | # get the query | |
215 | my ($query, $seq_len) = @ARGV; | |
216 | $seq_len = 0 unless defined($seq_len); | |
217 | ||
218 | $query =~ s/^>// if ($query); | |
219 | ||
220 | my @annots = (); | |
221 | my %annot_set = (); | |
222 | ||
223 | my %rpd2_clan_fams = (); | |
224 | ||
225 | if ($rpd2_fams) { | |
226 | $get_rpd2_clans->execute(); | |
227 | my ($auto_pfam, $auto_clan); | |
228 | while (($auto_pfam, $auto_clan)=$get_rpd2_clans->fetchrow_array()) { | |
229 | $rpd2_clan_fams{$auto_pfam} = $auto_clan; | |
230 | } | |
231 | } | |
232 | ||
233 | #if it's a file I can open, read and parse it | |
234 | unless ($query && ($query =~ m/[\|:]/ || | |
235 | $query =~ m/^[NX]P_/ || | |
236 | $query =~ m/^[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}\s/)) { | |
237 | ||
238 | while (my $a_line = <>) { | |
239 | $a_line =~ s/^>//; | |
240 | chomp $a_line; | |
241 | push @annots, show_annots($a_line, $get_annot_sub); | |
242 | } | |
243 | } | |
244 | else { | |
245 | push @annots, show_annots("$query\t$seq_len", $get_annot_sub); | |
246 | } | |
247 | ||
248 | for my $seq_annot (@annots) { | |
249 | next unless $seq_annot; | |
250 | my $annot_r = $annot_set{$seq_annot}; | |
251 | print ">",$annot_r->{seq_info},"\n"; | |
252 | for my $annot (@{$annot_r->{list}}) { | |
253 | if (!$lav && defined($domains{$annot->[-1]})) { | |
254 | my ($a_name, $a_num) = domain_num($annot->[-1],$domains{$annot->[-1]}); | |
255 | $annot->[-1] = $a_name; | |
256 | my $tmp_a_num = $a_num; | |
257 | $tmp_a_num =~ s/v$//; | |
258 | if ($acc_comment) { | |
259 | $annot->[-1] .= "{$domain_list[$tmp_a_num]}"; | |
260 | } | |
261 | if ($bound_comment) { | |
262 | $annot->[-1] .= $color_sep_str.$annot->[0].":".$annot->[2]; | |
263 | } | |
264 | elsif ($show_color) { | |
265 | $annot->[-1] .= $color_sep_str.$a_num; | |
266 | } | |
267 | } | |
268 | print join("\t",@$annot),"\n"; | |
269 | } | |
270 | } | |
271 | ||
272 | exit(0); | |
273 | ||
274 | sub show_annots { | |
275 | my ($query_len, $get_annot_sub) = @_; | |
276 | ||
277 | my ($annot_line, $seq_len) = split(/\t/,$query_len); | |
278 | ||
279 | my $pfamA_acc; | |
280 | ||
281 | $use_acc = 1; | |
282 | $get_annots_sql = $get_pfam_acc; | |
283 | ||
284 | my $get_annots_sql_u = $get_upfam_acc; | |
285 | ||
286 | if ($annot_line =~ m/^pf\d+\|/) { | |
287 | ($sdb, $gi, $pfamA_acc, $acc, $id) = split(/\|/,$annot_line); | |
288 | # $dbh->do("use RPD2_pfam"); | |
289 | } | |
290 | elsif ($annot_line =~ m/^gi\|/) { | |
291 | ($tmp, $gi, $sdb, $acc, $id) = split(/\|/,$annot_line); | |
292 | if ($sdb =~ m/ref/) { | |
293 | $get_annots_sql = $get_pfam_refacc; | |
294 | $get_annots_sql_u = $get_upfam_refacc; | |
295 | } | |
296 | } | |
297 | elsif ($annot_line =~ m/^(sp|tr|up)\|/) { | |
298 | ($sdb, $acc, $id) = split(/\|/,$annot_line); | |
299 | } | |
300 | elsif ($annot_line =~ m/^ref\|/) { | |
301 | ($sdb, $acc) = split(/\|/,$annot_line); | |
302 | $get_annots_sql = $get_pfam_refacc; | |
303 | $get_annots_sql_u = $get_upfam_refacc; | |
304 | } | |
305 | elsif ($annot_line =~ m/^(SP|TR):/i) { | |
306 | ($sdb, $id) = split(/:/,$annot_line); | |
307 | $use_acc = 0; | |
308 | } | |
309 | elsif ($annot_line !~ m/\|/ && $annot_line !~ m/:/) { | |
310 | $use_acc = 1; | |
311 | ($acc) = split(/\s+/,$annot_line); | |
312 | } | |
313 | # deal with no-database SwissProt/NR | |
314 | else { | |
315 | ($acc)=($annot_line =~ /^(\S+)/); | |
316 | } | |
317 | ||
318 | # here we have an $acc or an $id: check to see if we have the data | |
319 | ||
320 | my %annot_data = (seq_info=>$annot_line); | |
321 | my $annot_key = ''; | |
322 | unless ($use_acc) { | |
323 | next if ($annot_set{$id}); | |
324 | $annot_set{$id} = \%annot_data; | |
325 | $annot_key = $id; | |
326 | ||
327 | $get_annots_sql = $get_pfam_id; | |
328 | $get_annots_sql->execute($id); | |
329 | unless ($get_annots_sql->rows()) { | |
330 | $get_annots_sql = $get_annots_sql_u; | |
331 | $get_annots_sql->execute($id); | |
332 | } | |
333 | } else { | |
334 | unless ($acc) { | |
335 | warn "missing acc in $annot_line"; | |
336 | return ""; | |
337 | } | |
338 | else { | |
339 | $acc =~ s/\.\d+$//; | |
340 | ||
341 | $annot_key = $acc; | |
342 | if ($annot_set{$acc}) { | |
343 | goto ret_label; | |
344 | } | |
345 | $annot_set{$acc} = \%annot_data; | |
346 | ||
347 | $get_annots_sql->execute($acc); | |
348 | unless ($get_annots_sql->rows()) { | |
349 | $get_annots_sql = $get_annots_sql_u; | |
350 | $get_annots_sql->execute($acc); | |
351 | } | |
352 | } | |
353 | } | |
354 | ||
355 | $annot_data{list} = $get_annot_sub->($get_annots_sql, $seq_len); | |
356 | ||
357 | ret_label: | |
358 | return $annot_key; | |
359 | } | |
360 | ||
361 | sub get_pfam_annots { | |
362 | my ($get_annots, $seq_length) = @_; | |
363 | ||
364 | $seq_length = 0 unless $seq_length; | |
365 | ||
366 | my @pf_domains = (); | |
367 | ||
368 | # get the list of domains, sorted by start | |
369 | ||
370 | # $row_href has: seq_start, seq_end, model_start, model_end, model_length, | |
371 | # pfamA_acc, pfamA_id, auto_pfamA_reg_full, | |
372 | # domain_evalue_score as evalue, length | |
373 | ||
374 | while ( my $row_href = $get_annots->fetchrow_hashref()) { | |
375 | if ($auto_reg) { | |
376 | $row_href->{info} = $row_href->{auto_pfamA_reg_full}; | |
377 | } elsif ($pf_acc) { | |
378 | $row_href->{info} = $row_href->{pfamA_acc}; | |
379 | } else { | |
380 | $row_href->{info} = $row_href->{pfamA_id}; | |
381 | } | |
382 | ||
383 | if ($row_href && $row_href->{length} > $seq_length && $seq_length == 0) { | |
384 | $seq_length = $row_href->{length}; | |
385 | } | |
386 | ||
387 | next if ($row_href->{seq_start} >= $seq_length); | |
388 | if ($row_href->{seq_end} > $seq_length) { | |
389 | $row_href->{seq_end} = $seq_length; | |
390 | } | |
391 | ||
392 | push @pf_domains, $row_href | |
393 | } | |
394 | ||
395 | # before checking for domain overlap, check for "split-domains" | |
396 | # (self-unbound) by looking for runs of the same domain that are | |
397 | # ordered by model_start | |
398 | ||
399 | if (scalar(@pf_domains) > 1) { | |
400 | my @j_domains; #joined domains | |
401 | my @tmp_domains = @pf_domains; | |
402 | ||
403 | my $prev_dom = shift(@tmp_domains); | |
404 | ||
405 | for my $curr_dom (@tmp_domains) { | |
406 | # to join domains: | |
407 | # (1) the domains must be in order by model_start/end coordinates | |
408 | # (3) joining the domains cannot make the total combination too long | |
409 | ||
410 | # check for model and sequence consistency | |
411 | if (($prev_dom->{pfamA_acc} eq $curr_dom->{pfamA_acc}) # same family | |
412 | && $prev_dom->{model_start} < $curr_dom->{model_start} # model check | |
413 | && $prev_dom->{model_end} < $curr_dom->{model_end} | |
414 | ||
415 | && ($curr_dom->{model_start} > $prev_dom->{model_end} * 0.80 # limit overlap | |
416 | || $curr_dom->{model_start} < $prev_dom->{model_end} * 1.25) | |
417 | && ((($curr_dom->{model_end} - $curr_dom->{model_start}+1)/$curr_dom->{model_length} + | |
418 | ($prev_dom->{model_end} - $prev_dom->{model_start}+1)/$prev_dom->{model_length}) < 1.33) | |
419 | ) { # join them by updating $prev_dom | |
420 | $prev_dom->{seq_end} = $curr_dom->{seq_end}; | |
421 | $prev_dom->{model_end} = $curr_dom->{model_end}; | |
422 | $prev_dom->{auto_pfamA_reg_full} = $prev_dom->{auto_pfamA_reg_full} . ";". $curr_dom->{auto_pfamA_reg_full}; | |
423 | $prev_dom->{evalue} = ($prev_dom->{evalue} < $curr_dom->{evalue} ? $prev_dom->{evalue} : $curr_dom->{evalue}); | |
424 | } else { | |
425 | push @j_domains, $prev_dom; | |
426 | $prev_dom = $curr_dom; | |
427 | } | |
428 | } | |
429 | push @j_domains, $prev_dom; | |
430 | @pf_domains = @j_domains; | |
431 | ||
432 | ||
433 | if ($no_over) { # for either $no_over or $split_over, check for overlapping domains and edit/split them | |
434 | ||
435 | my @tmp_domains = @pf_domains; # allow shifts from copy of @pf_domains | |
436 | my @save_domains = (); # where the new domains go | |
437 | ||
438 | my $prev_dom = shift @tmp_domains; | |
439 | ||
440 | while (my $curr_dom = shift @tmp_domains) { | |
441 | ||
442 | my @overlap_domains = ($prev_dom); | |
443 | ||
444 | my $diff = $prev_dom->{seq_end} - $curr_dom->{seq_start}; | |
445 | ||
446 | my ($prev_len, $cur_len) = ($prev_dom->{seq_end}-$prev_dom->{seq_start}+1, | |
447 | $curr_dom->{seq_end}-$curr_dom->{seq_start}+1); | |
448 | ||
449 | my $inclusion = ((($curr_dom->{seq_start} >= $prev_dom->{seq_start}) # start is right && end is left | |
450 | && ($curr_dom->{seq_end} <= $prev_dom->{seq_end})) || # -- curr inside prev | |
451 | (($curr_dom->{seq_start} <= $prev_dom->{seq_start}) # start is left && end is right | |
452 | && ($curr_dom->{seq_end} >= $prev_dom->{seq_end}))); # -- prev is inside curr | |
453 | ||
454 | my $longer_len = ($prev_len > $cur_len) ? $prev_len : $cur_len; | |
455 | ||
456 | # check for overlap > domain_length/$over_fract | |
457 | while ($inclusion || ($diff > 0 && $diff > $longer_len/$over_fract)) { | |
458 | push @overlap_domains, $curr_dom; | |
459 | $curr_dom = shift @tmp_domains; | |
460 | last unless $curr_dom; | |
461 | $diff = $prev_dom->{seq_end} - $curr_dom->{seq_start}; | |
462 | ($prev_len, $cur_len) = ($prev_dom->{seq_end}-$prev_dom->{seq_start}+1, $curr_dom->{seq_end}-$curr_dom->{seq_start}+1); | |
463 | $longer_len = ($prev_len > $cur_len) ? $prev_len : $cur_len; | |
464 | $inclusion = ((($curr_dom->{seq_start} >= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} <= $prev_dom->{seq_end})) || | |
465 | (($curr_dom->{seq_start} <= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} >= $prev_dom->{seq_end}))); | |
466 | } | |
467 | ||
468 | # check for overlapping domains; >1 because $prev_dom is always there | |
469 | if (scalar(@overlap_domains) > 1 ) { | |
470 | # if $rpd2_fams, check for a chosen one | |
471 | ||
472 | for my $dom ( @overlap_domains) { | |
473 | $dom->{evalue} = 1.0 unless defined($dom->{evalue}); | |
474 | } | |
475 | ||
476 | @overlap_domains = sort { $a->{evalue} <=> $b->{evalue} } @overlap_domains; | |
477 | $prev_dom = $overlap_domains[0]; | |
478 | } | |
479 | ||
480 | # $prev_dom should be the best of the overlaps, and we are no longer overlapping > dom_length/3 | |
481 | push @save_domains, $prev_dom; | |
482 | $prev_dom = $curr_dom; | |
483 | } | |
484 | ||
485 | if ($prev_dom) { | |
486 | push @save_domains, $prev_dom; | |
487 | } | |
488 | ||
489 | @pf_domains = @save_domains; | |
490 | ||
491 | # now check for smaller overlaps | |
492 | for (my $i=1; $i < scalar(@pf_domains); $i++) { | |
493 | if ($pf_domains[$i-1]->{seq_end} >= $pf_domains[$i]->{seq_start}) { | |
494 | my $overlap = $pf_domains[$i-1]->{seq_end} - $pf_domains[$i]->{seq_start}; | |
495 | $pf_domains[$i-1]->{seq_end} -= int($overlap/2); | |
496 | $pf_domains[$i]->{seq_start} = $pf_domains[$i-1]->{seq_end}+1; | |
497 | } | |
498 | } | |
499 | } | |
500 | elsif ($split_over) { # here, everything that overlaps by > $min_vdom should be split into a separate domain | |
501 | my @save_domains = (); # where the new domains go | |
502 | ||
503 | # check to see if one domain is included (or overlapping) more | |
504 | # than xx% of the other. If so, pick the longer one | |
505 | ||
506 | my ($prev_dom, $curr_dom) = ($pf_domains[0],0) ; | |
507 | for (my $i=1; $i < scalar(@pf_domains); $i++) { | |
508 | $curr_dom = $pf_domains[$i]; | |
509 | ||
510 | my ($prev_len, $cur_len) = ($prev_dom->{seq_end}-$prev_dom->{seq_start}+1, $curr_dom->{seq_end}-$curr_dom->{seq_start}+1); | |
511 | my $longer_len = ($prev_len > $cur_len) ? $prev_len : $cur_len; | |
512 | ||
513 | if (($curr_dom->{seq_start} >= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} <= $prev_dom->{seq_end}) | |
514 | && $cur_len / $prev_len > 0.80) { | |
515 | # $prev_dom stays the same, $curr_dom deleted | |
516 | next; | |
517 | } | |
518 | elsif (($curr_dom->{seq_start} <= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} >= $prev_dom->{seq_end}) | |
519 | && $prev_len / $cur_len > 0.80) { | |
520 | $prev_dom = $curr_dom; # this should delete $prev_dom | |
521 | next; | |
522 | } | |
523 | ||
524 | if ($prev_dom->{seq_end} >= $curr_dom->{seq_start} + $min_vdom) { | |
525 | my ($l_seq_end, $r_seq_start) = ($curr_dom->{seq_start}-1, $prev_dom->{seq_end}+1); | |
526 | ||
527 | $prev_dom->{seq_end} = $l_seq_end; | |
528 | push @save_domains, $prev_dom; | |
529 | my $new_dom = {seq_start => $l_seq_end+1, seq_end=>$r_seq_start-1, | |
530 | model_length => -1, | |
531 | pfamA_acc=>$prev_dom->{pfamA_acc}."/".$curr_dom->{pfamA_acc}, | |
532 | pfamA_id=>$prev_dom->{pfamA_id}."/".$curr_dom->{pfamA_id}, | |
533 | }; | |
534 | ||
535 | if ($pf_acc) { | |
536 | $new_dom->{info} = $new_dom->{pfamA_acc}; | |
537 | } | |
538 | else { | |
539 | $new_dom->{info} = $new_dom->{pfamA_id}; | |
540 | } | |
541 | ||
542 | push @save_domains, $new_dom; | |
543 | $curr_dom->{seq_start} = $r_seq_start; | |
544 | $prev_dom = $curr_dom; | |
545 | } | |
546 | else { | |
547 | push @save_domains, $prev_dom; | |
548 | $prev_dom = $curr_dom; | |
549 | } | |
550 | } | |
551 | push @save_domains, $prev_dom; | |
552 | @pf_domains = @save_domains; | |
553 | } | |
554 | } | |
555 | ||
556 | # $vdoms -- virtual Pfam domains -- the equivalent of $neg_doms, | |
557 | # but covering parts of a Pfam model that are not annotated. split | |
558 | # domains have been joined, so simply check beginning and end of | |
559 | # each domain (but must also check for bounded-ness) | |
560 | # only add when 10% or more is missing and missing length > $min_nodom | |
561 | ||
562 | if ($vdoms && scalar(@pf_domains)) { | |
563 | my @vpf_domains; | |
564 | ||
565 | my $curr_dom = $pf_domains[0]; | |
566 | my $length = $curr_dom->{length}; | |
567 | ||
568 | my $prev_dom={seq_end=>0, pfamA_acc=>''}; | |
569 | my $prev_dom_end = 0; | |
570 | my $next_dom_start = $length+1; | |
571 | ||
572 | for (my $dom_ix=0; $dom_ix < scalar(@pf_domains); $dom_ix++ ) { | |
573 | $curr_dom = $pf_domains[$dom_ix]; | |
574 | ||
575 | my $pfamA = $curr_dom->{pfamA_acc}; | |
576 | ||
577 | # first, look left, is there a domain there (if there is, | |
578 | # it should be updated right | |
579 | ||
580 | # my $min_vdom = $curr_dom->{model_length} / 10; | |
581 | ||
582 | if ($curr_dom->{model_length} < $min_vdom) { | |
583 | push @vpf_domains, $curr_dom; | |
584 | next; | |
585 | } | |
586 | if ($prev_dom->{pfamA_acc}) { # look for previous domain | |
587 | $prev_dom_end = $prev_dom->{seq_end}; | |
588 | } | |
589 | ||
590 | # there is a domain to the left, how much room is available? | |
591 | my $left_dom_len = min($curr_dom->{seq_start}-$prev_dom_end-1, $curr_dom->{model_start}-1); | |
592 | if ( $left_dom_len > $min_vdom) { | |
593 | # there is room for a virtual domain | |
594 | my %new_dom = (seq_start=> $curr_dom->{seq_start}-$left_dom_len, | |
595 | seq_end => $curr_dom->{seq_start}-1, | |
596 | info=>'@'.$curr_dom->{info}, | |
597 | model_length=>$curr_dom->{model_length}, | |
598 | model_end => $curr_dom->{model_start}-1, | |
599 | model_start => $left_dom_len, | |
600 | pfamA_acc=>$pfamA, | |
601 | ); | |
602 | push @vpf_domains, \%new_dom; | |
603 | } | |
604 | ||
605 | # save the current domain | |
606 | push @vpf_domains, $curr_dom; | |
607 | $prev_dom = $curr_dom; | |
608 | ||
609 | if ($dom_ix < $#pf_domains) { # there is a domain to the right | |
610 | # first, give all the extra space to the first domain (no splitting) | |
611 | $next_dom_start = $pf_domains[$dom_ix+1]->{seq_start}; | |
612 | } | |
613 | else { | |
614 | $next_dom_start = $length; | |
615 | } | |
616 | ||
617 | # is there room for a virtual domain right | |
618 | ||
619 | my $right_dom_len = min($next_dom_start-$curr_dom->{seq_end}-1, # space available | |
620 | $curr_dom->{model_length}-$curr_dom->{model_end} # space needed | |
621 | ); | |
622 | if ( $right_dom_len > $min_vdom) { | |
623 | my %new_dom = (seq_start=> $curr_dom->{seq_end}+1, | |
624 | seq_end=> $curr_dom->{seq_end}+$right_dom_len, | |
625 | info=>'@'.$curr_dom->{info}, | |
626 | model_length => $curr_dom->{model_length}, | |
627 | pfamA_acc=> $pfamA, | |
628 | ); | |
629 | push @vpf_domains, \%new_dom; | |
630 | $prev_dom = \%new_dom; | |
631 | } | |
632 | } # all done, check for last one | |
633 | ||
634 | # $curr_dom=$pf_domains[-1]; | |
635 | # # my $min_vdom = $curr_dom->{model_length}/10; | |
636 | ||
637 | # my $right_dom_len = min($length - $curr_dom->{seq_end}+1, # space available | |
638 | # $curr_dom->{model_length}-$curr_dom->{model_end} # space needed | |
639 | # ); | |
640 | # if ($right_dom_len > $min_vdom) { | |
641 | # my %new_dom = (seq_start=> $curr_dom->{seq_end}+1, | |
642 | # seq_end => $curr_dom->{seq_end}+$right_dom_len, | |
643 | # info=>'@'.$curr_dom->{pfamA_acc}, | |
644 | # model_len=> $curr_dom->{model_len}, | |
645 | # pfamA_acc => $curr_dom->{pfamA_acc}, | |
646 | # model_start => $curr_dom->{model_end}+1, | |
647 | # model_end => $curr_dom->{model_len}, | |
648 | # ); | |
649 | ||
650 | # push @vpf_domains, \%new_dom; | |
651 | # } | |
652 | ||
653 | # @vpf_domains has both old @pf_domains and new neg-domains | |
654 | @pf_domains = @vpf_domains; | |
655 | } | |
656 | ||
657 | if ($neg_doms) { | |
658 | my @npf_domains; | |
659 | my $prev_dom={seq_end=>0}; | |
660 | for my $curr_dom ( @pf_domains) { | |
661 | if ($curr_dom->{seq_start} - $prev_dom->{seq_end} > $min_nodom) { | |
662 | my %new_dom = (seq_start=>$prev_dom->{seq_end}+1, seq_end => $curr_dom->{seq_start}-1, info=>'NODOM'); | |
663 | push @npf_domains, \%new_dom; | |
664 | } | |
665 | push @npf_domains, $curr_dom; | |
666 | $prev_dom = $curr_dom; | |
667 | } | |
668 | if ($seq_length - $prev_dom->{seq_end} > $min_nodom) { | |
669 | my %new_dom = (seq_start=>$prev_dom->{seq_end}+1, seq_end=>$seq_length, info=>'NODOM'); | |
670 | if ($new_dom{seq_end} > $new_dom{seq_start}) { | |
671 | push @npf_domains, \%new_dom; | |
672 | } | |
673 | } | |
674 | ||
675 | # @npf_domains has both old @pf_domains and new neg-domains | |
676 | @pf_domains = @npf_domains; | |
677 | } | |
678 | ||
679 | # now make sure we have useful names: colors | |
680 | ||
681 | for my $pf (@pf_domains) { | |
682 | $pf->{info} = domain_name($pf->{info}, $pf->{pfamA_acc}); | |
683 | } | |
684 | ||
685 | my @feats = (); | |
686 | for my $d_ref (@pf_domains) { | |
687 | if ($lav) { | |
688 | push @feats, [$d_ref->{seq_start}, $d_ref->{seq_end}, $d_ref->{info}]; | |
689 | } else { | |
690 | push @feats, [$d_ref->{seq_start}, '-', $d_ref->{seq_end}, $d_ref->{info} ]; | |
691 | # push @feats, [$d_ref->{seq_end}, ']', '-', ""]; | |
692 | } | |
693 | ||
694 | } | |
695 | ||
696 | return \@feats; | |
697 | } | |
698 | ||
699 | sub min { | |
700 | my ($arg1, $arg2) = @_; | |
701 | ||
702 | return ($arg1 <= $arg2 ? $arg1 : $arg2); | |
703 | } | |
704 | ||
705 | sub max { | |
706 | my ($arg1, $arg2) = @_; | |
707 | ||
708 | return ($arg1 >= $arg2 ? $arg1 : $arg2); | |
709 | } | |
710 | ||
711 | # domain name takes a uniprot domain label, removes comments ( ; | |
712 | # truncated) and numbers and returns a canonical form. Thus: | |
713 | # Cortactin 6. | |
714 | # Cortactin 7; truncated. | |
715 | # becomes "Cortactin" | |
716 | # | |
717 | ||
718 | sub domain_name { | |
719 | ||
720 | my ($value, $pfamA_acc) = @_; | |
721 | my $is_virtual = 0; | |
722 | ||
723 | if ($value =~ m/^@/) { | |
724 | $is_virtual = 1; | |
725 | $value =~ s/^@//; | |
726 | } | |
727 | ||
728 | # check for clan: | |
729 | if ($no_clans) { | |
730 | if (! defined($domains{$value})) { | |
731 | $domain_clan{$value} = 0; | |
732 | $domains{$value} = ++$domain_cnt; | |
733 | push @domain_list, $pfamA_acc; | |
734 | } | |
735 | } | |
736 | elsif (!defined($domain_clan{$value})) { | |
737 | ## only do this for new domains, old domains have known mappings | |
738 | ||
739 | ## ways to highlight the same domain: | |
740 | # (1) for clans, substitute clan name for family name | |
741 | # (2) for clans, use the same color for the same clan, but don't change the name | |
742 | # (3) for clans, combine family name with clan name, but use colors based on clan | |
743 | ||
744 | # check to see if it's a clan | |
745 | $get_pfam_clan->execute($pfamA_acc); | |
746 | ||
747 | my $pfam_clan_href=0; | |
748 | ||
749 | if ($pfam_clan_href=$get_pfam_clan->fetchrow_hashref()) { # is a clan | |
750 | my ($clan_id, $clan_acc) = @{$pfam_clan_href}{qw(clan_id clan_acc)}; | |
751 | ||
752 | # now check to see if we have seen this clan before (if so, do not increment $domain_cnt) | |
753 | my $c_value = "C." . $clan_id; | |
754 | if ($pf_acc) {$c_value = $clan_acc;} | |
755 | ||
756 | $domain_clan{$value} = {clan_id => $clan_id, | |
757 | clan_acc => $clan_acc}; | |
758 | ||
759 | if ($domains{$c_value}) { | |
760 | $domain_clan{$value}->{domain_cnt} = $domains{$c_value}; | |
761 | $value = $c_value; | |
762 | } | |
763 | else { | |
764 | $domain_clan{$value}->{domain_cnt} = ++ $domain_cnt; | |
765 | $value = $c_value; | |
766 | $domains{$value} = $domain_cnt; | |
767 | push @domain_list, $pfamA_acc; | |
768 | } | |
769 | } | |
770 | else { # not a clan | |
771 | $domain_clan{$value} = 0; | |
772 | $domains{$value} = ++$domain_cnt; | |
773 | push @domain_list, $pfamA_acc; | |
774 | } | |
775 | } | |
776 | elsif ($domain_clan{$value} && $domain_clan{$value}->{clan_acc}) { | |
777 | if ($pf_acc) {$value = $domain_clan{$value}->{clan_acc};} | |
778 | else { $value = "C." . $domain_clan{$value}->{clan_id}; } | |
779 | } | |
780 | ||
781 | if ($is_virtual) { | |
782 | $domains{'@'.$value} = $domains{$value}; | |
783 | $value = '@'.$value; | |
784 | } | |
785 | return $value; | |
786 | } | |
787 | ||
788 | sub domain_num { | |
789 | my ($value, $number) = @_; | |
790 | if ($value =~ m/^@/) { | |
791 | $value =~ s/^@/v/; | |
792 | $number = $number."v"; | |
793 | } | |
794 | return ($value, $number); | |
795 | } | |
796 | ||
797 | ||
798 | __END__ | |
799 | ||
800 | =pod | |
801 | ||
802 | =head1 NAME | |
803 | ||
804 | ann_pfam30.pl | |
805 | ||
806 | =head1 SYNOPSIS | |
807 | ||
808 | ann_pfam30.pl --neg-doms --vdoms 'sp|P09488|GSTM1_NUMAN' | accession.file | |
809 | ||
810 | =head1 OPTIONS | |
811 | ||
812 | -h short help | |
813 | --help include description | |
814 | --no-over : generate non-overlapping domains (equivalent to ann_pfam.pl) | |
815 | --split-over : overlaps of two domains generate a new hybrid domain | |
816 | --no-clans : do not use clans with multiple families from same clan | |
817 | --neg-doms : report domains between annotated domains as NODOM | |
818 | (also --neg, --neg_doms) | |
819 | --vdoms : produce "virtual domains" using model_start, | |
820 | model_end for partial pfam domains | |
821 | --min_nodom=10 : minimum length between domains for NODOM | |
822 | ||
823 | --host, --user, --password, --port --db : info for mysql database | |
824 | ||
825 | =head1 DESCRIPTION | |
826 | ||
827 | C<ann_pfam30.pl> extracts domain information from the pfam msyql | |
828 | database. Currently, the program works with database | |
829 | sequence descriptions in several formats: | |
830 | ||
831 | >gi|1705556|sp|P54670.1|CAF1_DICDI | |
832 | >sp|P09488|GSTM1_HUMAN | |
833 | >sp:CALM_HUMAN | |
834 | ||
835 | C<ann_pfam30.pl> uses the C<pfamA_reg_full_significant>, C<pfamseq>, | |
836 | and C<pfamA> tables of the C<pfam> database to extract domain | |
837 | information on a protein. | |
838 | ||
839 | If the C<--no-over> option is set, overlapping domains are selected and | |
840 | edited to remove overlaps. For proteins with multiple overlapping | |
841 | domains (domains overlap by more than 1/3 of the domain length), | |
842 | C<auto_pfam28.pl> selects the domain annotation with the best | |
843 | C<domain_evalue_score>. When domains overlap by less than 1/3 of the | |
844 | domain length, they are shortened to remove the overlap. | |
845 | ||
846 | If the C<--split-over> option is set, if two domains overlap, the | |
847 | overlapping region is split out of the domains and labeled as a new, | |
848 | virtual-lie, domain. If one domain is internal to another and spans | |
849 | 80% of the domain, the shorter domain is removed. | |
850 | ||
851 | C<ann_pfam30.pl> is designed to be used by the B<FASTA> programs with | |
852 | the C<-V \!ann_pfam30.pl> or C<-V "\!ann_pfam30.pl --neg"> option. | |
853 | ||
854 | =head1 AUTHOR | |
855 | ||
856 | William R. Pearson, wrp@virginia.edu | |
857 | ||
858 | =cut |
0 | #!/usr/bin/perl -w | |
0 | #!/usr/bin/env perl | |
1 | 1 | |
2 | 2 | ################################################################ |
3 | 3 | # copyright (c) 2014,2015 by William R. Pearson and The Rector & |
39 | 39 | # create temporary tables/select permissions for tmp_annot |
40 | 40 | # |
41 | 41 | |
42 | use warnings; | |
42 | 43 | use strict; |
43 | 44 | |
44 | 45 | use DBI; |
0 | #!/usr/bin/env perl | |
1 | ||
2 | ################################################################ | |
3 | # copyright (c) 2015 by William R. Pearson and The Rector & | |
4 | # Visitors of the University of Virginia */ | |
5 | ################################################################ | |
6 | # Licensed under the Apache License, Version 2.0 (the "License"); | |
7 | # you may not use this file except in compliance with the License. | |
8 | # You may obtain a copy of the License at | |
9 | # | |
10 | # http://www.apache.org/licenses/LICENSE-2.0 | |
11 | # | |
12 | # Unless required by applicable law or agreed to in writing, | |
13 | # software distributed under this License is distributed on an "AS | |
14 | # IS" BASIS, WITHOUT WRRANTIES OR CONDITIONS OF ANY KIND, either | |
15 | # express or implied. See the License for the specific language | |
16 | # governing permissions and limitations under the License. | |
17 | ################################################################ | |
18 | ||
19 | # ann_pfam.pl gets an annotation file from fasta36 -V with a line of the form: | |
20 | ||
21 | # gi|62822551|sp|P00502|GSTA1_RAT Glutathione S-transfer\n (at least from pir1.lseg) | |
22 | # | |
23 | # it must: | |
24 | # (1) read in the line | |
25 | # (2) parse it to get the up_acc | |
26 | # (3) return the tab delimited features | |
27 | # | |
28 | ||
29 | # this is the first version that works with the new Pfam strategy of | |
30 | # separating Uniprot reference sequences from the rest of uniprot. as | |
31 | # a result, it is possible that 2 SQL queries will be required, one to | |
32 | # pfamA_reg_full_significant and a second to uniprot_reg_full. | |
33 | ||
34 | # modified 15-Jan-2017 to reduce the number of calls when the same | |
35 | # accession is present multiple times. Accessions are saved in a hash | |
36 | # than ensures uniqueness. (Could also speed things up by creating temporary table.) | |
37 | # | |
38 | ||
39 | use warnings; | |
40 | use strict; | |
41 | ||
42 | use DBI; | |
43 | use Getopt::Long; | |
44 | use Pod::Usage; | |
45 | ||
46 | use vars qw($host $db $port $user $pass); | |
47 | ||
48 | my $hostname = `/bin/hostname`; | |
49 | ||
50 | ($host, $db, $port, $user, $pass) = ("wrpxdb.its.virginia.edu", "pfam32", 0, "web_user", "fasta_www"); | |
51 | #$host = 'xdb'; | |
52 | #$host = 'localhost'; | |
53 | #$db = 'RPD2_pfam28u'; | |
54 | ||
55 | my ($auto_reg,$rpd2_fams, $neg_doms, $vdoms, $lav, $no_doms, $no_clans, $pf_acc, $acc_comment, $bound_comment, $shelp, $help) = | |
56 | (0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0,); | |
57 | my ($no_over, $split_over, $over_fract) = (0, 0, 3.0); | |
58 | my ($clan_fam) = (0); | |
59 | ||
60 | my ($color_sep_str, $show_color) = (" :",1); | |
61 | $color_sep_str = '~'; | |
62 | ||
63 | my ($min_nodom, $min_vdom) = (10,10); | |
64 | ||
65 | GetOptions( | |
66 | "host=s" => \$host, | |
67 | "db=s" => \$db, | |
68 | "user=s" => \$user, | |
69 | "password=s" => \$pass, | |
70 | "port=i" => \$port, | |
71 | "lav" => \$lav, | |
72 | "acc_comment" => \$acc_comment, | |
73 | "bound_comment" => \$bound_comment, | |
74 | "color!" => \$show_color, | |
75 | "clan_fam|clan-fam" => \$clan_fam, | |
76 | "no_over|no-over" => \$no_over, | |
77 | "split_over|split-over=f" => \$split_over, | |
78 | "over_fract|over-fract=f" => \$over_fract, | |
79 | "no-clans|no_clans" => \$no_clans, | |
80 | "neg|neg_doms|neg-doms" => \$neg_doms, | |
81 | "min_nodom=i" => \$min_nodom, | |
82 | "vdoms|v_doms" => \$vdoms, | |
83 | "pfacc" => \$pf_acc, | |
84 | "RPD2" => \$rpd2_fams, | |
85 | "auto_reg" => \$auto_reg, | |
86 | "h|?" => \$shelp, | |
87 | "help" => \$help, | |
88 | ); | |
89 | ||
90 | pod2usage(1) if $shelp; | |
91 | pod2usage(exitstatus => 0, verbose => 2) if $help; | |
92 | pod2usage(1) unless (@ARGV || -p STDIN || -f STDIN); | |
93 | ||
94 | my $connect = "dbi:mysql(AutoCommit=>1,RaiseError=>1):database=$db"; | |
95 | $connect .= ";host=$host" if $host; | |
96 | $connect .= ";port=$port" if $port; | |
97 | ||
98 | my $dbh = DBI->connect($connect, | |
99 | $user, | |
100 | $pass | |
101 | ) or die $DBI::errstr; | |
102 | ||
103 | my %annot_types = (); | |
104 | my %domains = (NODOM=>0); | |
105 | my %domain_clan = (NODOM => {clan_id => 'NODOM', clan_acc=>0, domain_cnt=>0}); | |
106 | my @domain_list = (0); | |
107 | my $domain_cnt = 0; | |
108 | ||
109 | my $pfamA_reg_full = 'pfamA_reg_full_significant'; | |
110 | my $uniprot_reg_full = 'uniprot_reg_full'; | |
111 | ||
112 | my $get_annot_sub = \&get_pfam_annots; | |
113 | ||
114 | my @pfam_fields = qw(seq_start seq_end model_start model_end model_length pfamA_acc pfamA_id auto_pfamA_reg_full domain_evalue_score as evalue length); | |
115 | my @upfam_fields = qw(seq_start seq_end model_start model_end model_length pfamA_acc pfamA_id auto_uniprot_reg_full domain_evalue_score as evalue length); | |
116 | ||
117 | my $get_pfam_acc = $dbh->prepare(<<EOSQL); | |
118 | SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_pfamA_reg_full, domain_evalue_score as evalue, length | |
119 | FROM pfamseq | |
120 | JOIN pfamA_reg_full_significant using(pfamseq_acc) | |
121 | JOIN pfamA USING (pfamA_acc) | |
122 | WHERE in_full = 1 | |
123 | AND pfamseq_acc=? | |
124 | ORDER BY seq_start | |
125 | ||
126 | EOSQL | |
127 | ||
128 | my $get_upfam_acc = $dbh->prepare(<<EOSQL); | |
129 | SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_uniprot_reg_full as auto_pfamA_reg_full, domain_evalue_score as evalue, length | |
130 | FROM uniprot | |
131 | JOIN uniprot_reg_full using(uniprot_acc) | |
132 | JOIN pfamA USING (pfamA_acc) | |
133 | WHERE in_full = 1 | |
134 | AND uniprot_acc=? | |
135 | ORDER BY seq_start | |
136 | ||
137 | EOSQL | |
138 | ||
139 | my $get_pfam_refacc = $dbh->prepare(<<EOSQL); | |
140 | SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_pfamA_reg_full, domain_evalue_score as evalue, length | |
141 | FROM $pfamA_reg_full | |
142 | JOIN pfamseq using(pfamseq_acc) | |
143 | JOIN pfamA USING (pfamA_acc) | |
144 | JOIN uniprot.up2ref_acc as up2ref on(up2ref.acc=pfamseq_acc) | |
145 | WHERE in_full = 1 | |
146 | AND up2ref.ref_acc=? | |
147 | ORDER BY seq_start | |
148 | ||
149 | EOSQL | |
150 | ||
151 | my $get_upfam_refacc = $dbh->prepare(<<EOSQL); | |
152 | SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_uniprot_reg_full as auto_pfamA_reg_full, domain_evalue_score as evalue, length | |
153 | FROM uniprot | |
154 | JOIN uniprot_reg_full using(uniprot_acc) | |
155 | JOIN pfamA USING (pfamA_acc) | |
156 | JOIN uniprot.up2ref_acc as up2ref on(up2ref.acc=uniprot_acc) | |
157 | WHERE in_full = 1 | |
158 | AND ref_acc=? | |
159 | ORDER BY seq_start | |
160 | ||
161 | EOSQL | |
162 | ||
163 | my $get_annots_sql = $get_pfam_acc; | |
164 | ||
165 | my $get_pfam_id = $dbh->prepare(<<EOSQL); | |
166 | SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_pfamA_reg_full, domain_evalue_score as evalue, length | |
167 | FROM pfamseq | |
168 | JOIN $pfamA_reg_full using(pfamseq_acc) | |
169 | JOIN pfamA USING (pfamA_acc) | |
170 | WHERE in_full=1 | |
171 | AND pfamseq_id=? | |
172 | ORDER BY seq_start | |
173 | ||
174 | EOSQL | |
175 | ||
176 | my $get_upfam_id = $dbh->prepare(<<EOSQL); | |
177 | SELECT seq_start, seq_end, model_start, model_end, model_length, pfamA_acc, pfamA_id, auto_uniprot_reg_full as auto_pfamA_reg_full, domain_evalue_score as evalue, length | |
178 | FROM uniprot | |
179 | JOIN uniprot_reg_full using(pfamseq_acc) | |
180 | JOIN pfamA USING (pfamA_acc) | |
181 | WHERE in_full=1 | |
182 | AND uniprot_id=? | |
183 | ORDER BY seq_start | |
184 | ||
185 | EOSQL | |
186 | ||
187 | my $get_pfam_clan = $dbh->prepare(<<EOSQL); | |
188 | ||
189 | SELECT clan_acc, clan_id | |
190 | FROM clan | |
191 | JOIN clan_membership using(clan_acc) | |
192 | WHERE pfamA_acc=? | |
193 | ||
194 | EOSQL | |
195 | ||
196 | my $get_rpd2_clans = $dbh->prepare(<<EOSQL); | |
197 | ||
198 | SELECT auto_pfamA, clan | |
199 | FROM ljm_db.RPD2_final_fams | |
200 | WHERE clan is not NULL | |
201 | ||
202 | EOSQL | |
203 | ||
204 | # -- LEFT JOIN clan_membership USING (auto_pfamA) | |
205 | # -- LEFT JOIN clans using(auto_clan) | |
206 | ||
207 | my ($tmp, $gi, $sdb, $acc, $id, $use_acc); | |
208 | ||
209 | ################ | |
210 | ## check for db=*_qfo -- do not use get_upfam_acc in that case | |
211 | if ($db =~ m/_qfo/) { | |
212 | $get_upfam_acc= ''; | |
213 | } | |
214 | ||
215 | # get the query | |
216 | my ($query, $seq_len) = @ARGV; | |
217 | $seq_len = 0 unless defined($seq_len); | |
218 | ||
219 | $query =~ s/^>// if ($query); | |
220 | ||
221 | my @annots = (); | |
222 | my %annot_set = (); | |
223 | ||
224 | my %rpd2_clan_fams = (); | |
225 | ||
226 | if ($rpd2_fams) { | |
227 | $get_rpd2_clans->execute(); | |
228 | my ($auto_pfam, $auto_clan); | |
229 | while (($auto_pfam, $auto_clan)=$get_rpd2_clans->fetchrow_array()) { | |
230 | $rpd2_clan_fams{$auto_pfam} = $auto_clan; | |
231 | } | |
232 | } | |
233 | ||
234 | #if it's a file I can open, read and parse it | |
235 | unless ($query && ($query =~ m/[\|:]/ || | |
236 | $query =~ m/^[NX]P_/ || | |
237 | $query =~ m/^[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}\s/)) { | |
238 | ||
239 | while (my $a_line = <>) { | |
240 | $a_line =~ s/^>//; | |
241 | chomp $a_line; | |
242 | push @annots, show_annots($a_line, $get_annot_sub); | |
243 | } | |
244 | } | |
245 | else { | |
246 | push @annots, show_annots("$query\t$seq_len", $get_annot_sub); | |
247 | } | |
248 | ||
249 | for my $seq_annot (@annots) { | |
250 | next unless $seq_annot; | |
251 | my $annot_r = $annot_set{$seq_annot}; | |
252 | print ">",$annot_r->{seq_info},"\n"; | |
253 | for my $annot (@{$annot_r->{list}}) { | |
254 | if (!$lav && defined($domains{$annot->[-1]})) { | |
255 | my ($a_name, $a_num) = domain_num($annot->[-1],$domains{$annot->[-1]}); | |
256 | $annot->[-1] = $a_name; | |
257 | my $tmp_a_num = $a_num; | |
258 | $tmp_a_num =~ s/v$//; | |
259 | if ($acc_comment) { | |
260 | $annot->[-1] .= "{$domain_list[$tmp_a_num]}"; | |
261 | } | |
262 | if ($bound_comment) { | |
263 | $annot->[-1] .= $color_sep_str.$annot->[0].":".$annot->[2]; | |
264 | } | |
265 | elsif ($show_color) { | |
266 | $annot->[-1] .= $color_sep_str.$a_num; | |
267 | } | |
268 | } | |
269 | print join("\t",@$annot),"\n"; | |
270 | } | |
271 | } | |
272 | ||
273 | exit(0); | |
274 | ||
275 | sub show_annots { | |
276 | my ($query_len, $get_annot_sub) = @_; | |
277 | ||
278 | my ($annot_line, $seq_len) = split(/\t/,$query_len); | |
279 | ||
280 | my $pfamA_acc; | |
281 | ||
282 | $use_acc = 1; | |
283 | $get_annots_sql = $get_pfam_acc; | |
284 | ||
285 | my $get_annots_sql_u = $get_upfam_acc; | |
286 | ||
287 | if ($annot_line =~ m/^pf\d+\|/) { | |
288 | ($sdb, $gi, $pfamA_acc, $acc, $id) = split(/\|/,$annot_line); | |
289 | # $dbh->do("use RPD2_pfam"); | |
290 | } | |
291 | elsif ($annot_line =~ m/^gi\|/) { | |
292 | ($tmp, $gi, $sdb, $acc, $id) = split(/\|/,$annot_line); | |
293 | if ($sdb =~ m/ref/) { | |
294 | $get_annots_sql = $get_pfam_refacc; | |
295 | $get_annots_sql_u = $get_upfam_refacc; | |
296 | } | |
297 | } | |
298 | elsif ($annot_line =~ m/^(sp|tr|up)\|/) { | |
299 | ($sdb, $acc, $id) = split(/\|/,$annot_line); | |
300 | } | |
301 | elsif ($annot_line =~ m/^ref\|/) { | |
302 | ($sdb, $acc) = split(/\|/,$annot_line); | |
303 | $get_annots_sql = $get_pfam_refacc; | |
304 | $get_annots_sql_u = $get_upfam_refacc; | |
305 | } | |
306 | elsif ($annot_line =~ m/^(SP|TR):/i) { | |
307 | ($sdb, $id) = split(/:/,$annot_line); | |
308 | $use_acc = 0; | |
309 | } | |
310 | elsif ($annot_line !~ m/\|/ && $annot_line !~ m/:/) { | |
311 | $use_acc = 1; | |
312 | ($acc) = split(/\s+/,$annot_line); | |
313 | } | |
314 | # deal with no-database SwissProt/NR | |
315 | else { | |
316 | ($acc)=($annot_line =~ /^(\S+)/); | |
317 | } | |
318 | ||
319 | # here we have an $acc or an $id: check to see if we have the data | |
320 | ||
321 | my %annot_data = (seq_info=>$annot_line, seq_len=>$seq_len); | |
322 | my $annot_key = ''; | |
323 | unless ($use_acc) { | |
324 | next if ($annot_set{$id}); | |
325 | $annot_set{$id} = \%annot_data; | |
326 | $annot_key = $id; | |
327 | ||
328 | $get_annots_sql = $get_pfam_id; | |
329 | $get_annots_sql->execute($id); | |
330 | unless ($get_annots_sql->rows()) { | |
331 | if ($get_annots_sql_u) { | |
332 | $get_annots_sql = $get_annots_sql_u; | |
333 | $get_annots_sql->execute($id); | |
334 | } | |
335 | } | |
336 | } else { | |
337 | unless ($acc) { | |
338 | warn "missing acc in $annot_line"; | |
339 | return ""; | |
340 | } | |
341 | else { | |
342 | $acc =~ s/\.\d+$//; | |
343 | ||
344 | $annot_key = $acc; | |
345 | if ($annot_set{$acc}) { | |
346 | goto ret_label; | |
347 | } | |
348 | $annot_set{$acc} = \%annot_data; | |
349 | ||
350 | $get_annots_sql->execute($acc); | |
351 | unless ($get_annots_sql->rows()) { | |
352 | if ($get_annots_sql_u) { | |
353 | $get_annots_sql = $get_annots_sql_u; | |
354 | $get_annots_sql->execute($id); | |
355 | } | |
356 | } | |
357 | } | |
358 | } | |
359 | ||
360 | $annot_data{list} = $get_annot_sub->($get_annots_sql, $seq_len); | |
361 | ||
362 | ret_label: | |
363 | return $annot_key; | |
364 | } | |
365 | ||
366 | sub get_pfam_annots { | |
367 | my ($get_annots, $seq_length) = @_; | |
368 | ||
369 | $seq_length = 0 unless $seq_length; | |
370 | ||
371 | my @pf_domains = (); | |
372 | ||
373 | # get the list of domains, sorted by start | |
374 | ||
375 | # $row_href has: seq_start, seq_end, model_start, model_end, model_length, | |
376 | # pfamA_acc, pfamA_id, auto_pfamA_reg_full, | |
377 | # domain_evalue_score as evalue, length | |
378 | ||
379 | while ( my $row_href = $get_annots->fetchrow_hashref()) { | |
380 | if ($auto_reg) { | |
381 | $row_href->{info} = $row_href->{auto_pfamA_reg_full}; | |
382 | } elsif ($pf_acc) { | |
383 | $row_href->{info} = $row_href->{pfamA_acc}; | |
384 | } else { | |
385 | $row_href->{info} = $row_href->{pfamA_id}; | |
386 | } | |
387 | ||
388 | if ($row_href && $row_href->{length} > $seq_length && $seq_length == 0) { | |
389 | $seq_length = $row_href->{length}; | |
390 | } | |
391 | ||
392 | next if ($row_href->{seq_start} >= $seq_length); | |
393 | if ($row_href->{seq_end} > $seq_length) { | |
394 | $row_href->{seq_end} = $seq_length; | |
395 | } | |
396 | ||
397 | push @pf_domains, $row_href | |
398 | } | |
399 | ||
400 | # before checking for domain overlap, check for "split-domains" | |
401 | # (self-unbound) by looking for runs of the same domain that are | |
402 | # ordered by model_start | |
403 | ||
404 | if (scalar(@pf_domains) > 1) { | |
405 | my @j_domains; #joined domains | |
406 | my @tmp_domains = @pf_domains; | |
407 | ||
408 | my $prev_dom = shift(@tmp_domains); | |
409 | ||
410 | for my $curr_dom (@tmp_domains) { | |
411 | # to join domains: | |
412 | # (1) the domains must be in order by model_start/end coordinates | |
413 | # (3) joining the domains cannot make the total combination too long | |
414 | ||
415 | # check for model and sequence consistency | |
416 | if (($prev_dom->{pfamA_acc} eq $curr_dom->{pfamA_acc}) # same family | |
417 | && $prev_dom->{model_start} < $curr_dom->{model_start} # model check | |
418 | && $prev_dom->{model_end} < $curr_dom->{model_end} | |
419 | ||
420 | && ($curr_dom->{model_start} > $prev_dom->{model_end} * 0.80 # limit overlap | |
421 | || $curr_dom->{model_start} < $prev_dom->{model_end} * 1.25) | |
422 | && ((($curr_dom->{model_end} - $curr_dom->{model_start}+1)/$curr_dom->{model_length} + | |
423 | ($prev_dom->{model_end} - $prev_dom->{model_start}+1)/$prev_dom->{model_length}) < 1.33) | |
424 | ) { # join them by updating $prev_dom | |
425 | $prev_dom->{seq_end} = $curr_dom->{seq_end}; | |
426 | $prev_dom->{model_end} = $curr_dom->{model_end}; | |
427 | $prev_dom->{auto_pfamA_reg_full} = $prev_dom->{auto_pfamA_reg_full} . ";". $curr_dom->{auto_pfamA_reg_full}; | |
428 | $prev_dom->{evalue} = ($prev_dom->{evalue} < $curr_dom->{evalue} ? $prev_dom->{evalue} : $curr_dom->{evalue}); | |
429 | } else { | |
430 | push @j_domains, $prev_dom; | |
431 | $prev_dom = $curr_dom; | |
432 | } | |
433 | } | |
434 | push @j_domains, $prev_dom; | |
435 | @pf_domains = @j_domains; | |
436 | ||
437 | ||
438 | if ($no_over) { # for either $no_over or $split_over, check for overlapping domains and edit/split them | |
439 | ||
440 | my @tmp_domains = @pf_domains; # allow shifts from copy of @pf_domains | |
441 | my @save_domains = (); # where the new domains go | |
442 | ||
443 | my $prev_dom = shift @tmp_domains; | |
444 | ||
445 | while (my $curr_dom = shift @tmp_domains) { | |
446 | ||
447 | my @overlap_domains = ($prev_dom); | |
448 | ||
449 | my $diff = $prev_dom->{seq_end} - $curr_dom->{seq_start}; | |
450 | ||
451 | my ($prev_len, $cur_len) = ($prev_dom->{seq_end}-$prev_dom->{seq_start}+1, | |
452 | $curr_dom->{seq_end}-$curr_dom->{seq_start}+1); | |
453 | ||
454 | my $inclusion = ((($curr_dom->{seq_start} >= $prev_dom->{seq_start}) # start is right && end is left | |
455 | && ($curr_dom->{seq_end} <= $prev_dom->{seq_end})) || # -- curr inside prev | |
456 | (($curr_dom->{seq_start} <= $prev_dom->{seq_start}) # start is left && end is right | |
457 | && ($curr_dom->{seq_end} >= $prev_dom->{seq_end}))); # -- prev is inside curr | |
458 | ||
459 | my $longer_len = ($prev_len > $cur_len) ? $prev_len : $cur_len; | |
460 | ||
461 | # check for overlap > domain_length/$over_fract | |
462 | while ($inclusion || ($diff > 0 && $diff > $longer_len/$over_fract)) { | |
463 | push @overlap_domains, $curr_dom; | |
464 | $curr_dom = shift @tmp_domains; | |
465 | last unless $curr_dom; | |
466 | $diff = $prev_dom->{seq_end} - $curr_dom->{seq_start}; | |
467 | ($prev_len, $cur_len) = ($prev_dom->{seq_end}-$prev_dom->{seq_start}+1, $curr_dom->{seq_end}-$curr_dom->{seq_start}+1); | |
468 | $longer_len = ($prev_len > $cur_len) ? $prev_len : $cur_len; | |
469 | $inclusion = ((($curr_dom->{seq_start} >= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} <= $prev_dom->{seq_end})) || | |
470 | (($curr_dom->{seq_start} <= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} >= $prev_dom->{seq_end}))); | |
471 | } | |
472 | ||
473 | # check for overlapping domains; >1 because $prev_dom is always there | |
474 | if (scalar(@overlap_domains) > 1 ) { | |
475 | # if $rpd2_fams, check for a chosen one | |
476 | ||
477 | for my $dom ( @overlap_domains) { | |
478 | $dom->{evalue} = 1.0 unless defined($dom->{evalue}); | |
479 | } | |
480 | ||
481 | @overlap_domains = sort { $a->{evalue} <=> $b->{evalue} } @overlap_domains; | |
482 | $prev_dom = $overlap_domains[0]; | |
483 | } | |
484 | ||
485 | # $prev_dom should be the best of the overlaps, and we are no longer overlapping > dom_length/3 | |
486 | push @save_domains, $prev_dom; | |
487 | $prev_dom = $curr_dom; | |
488 | } | |
489 | ||
490 | if ($prev_dom) { | |
491 | push @save_domains, $prev_dom; | |
492 | } | |
493 | ||
494 | @pf_domains = @save_domains; | |
495 | ||
496 | # now check for smaller overlaps | |
497 | for (my $i=1; $i < scalar(@pf_domains); $i++) { | |
498 | if ($pf_domains[$i-1]->{seq_end} >= $pf_domains[$i]->{seq_start}) { | |
499 | my $overlap = $pf_domains[$i-1]->{seq_end} - $pf_domains[$i]->{seq_start}; | |
500 | $pf_domains[$i-1]->{seq_end} -= int($overlap/2); | |
501 | $pf_domains[$i]->{seq_start} = $pf_domains[$i-1]->{seq_end}+1; | |
502 | } | |
503 | } | |
504 | } | |
505 | elsif ($split_over) { # here, everything that overlaps by > $min_vdom should be split into a separate domain | |
506 | my @save_domains = (); # where the new domains go | |
507 | ||
508 | # check to see if one domain is included (or overlapping) more | |
509 | # than xx% of the other. If so, pick the longer one | |
510 | ||
511 | my ($prev_dom, $curr_dom) = ($pf_domains[0],0) ; | |
512 | for (my $i=1; $i < scalar(@pf_domains); $i++) { | |
513 | $curr_dom = $pf_domains[$i]; | |
514 | ||
515 | my ($prev_len, $cur_len) = ($prev_dom->{seq_end}-$prev_dom->{seq_start}+1, $curr_dom->{seq_end}-$curr_dom->{seq_start}+1); | |
516 | my $longer_len = ($prev_len > $cur_len) ? $prev_len : $cur_len; | |
517 | ||
518 | if (($curr_dom->{seq_start} >= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} <= $prev_dom->{seq_end}) | |
519 | && $cur_len / $prev_len > 0.80) { | |
520 | # $prev_dom stays the same, $curr_dom deleted | |
521 | next; | |
522 | } | |
523 | elsif (($curr_dom->{seq_start} <= $prev_dom->{seq_start}) && ($curr_dom->{seq_end} >= $prev_dom->{seq_end}) | |
524 | && $prev_len / $cur_len > 0.80) { | |
525 | $prev_dom = $curr_dom; # this should delete $prev_dom | |
526 | next; | |
527 | } | |
528 | ||
529 | if ($prev_dom->{seq_end} >= $curr_dom->{seq_start} + $min_vdom) { | |
530 | my ($l_seq_end, $r_seq_start) = ($curr_dom->{seq_start}-1, $prev_dom->{seq_end}+1); | |
531 | ||
532 | $prev_dom->{seq_end} = $l_seq_end; | |
533 | push @save_domains, $prev_dom; | |
534 | my $new_dom = {seq_start => $l_seq_end+1, seq_end=>$r_seq_start-1, | |
535 | model_length => -1, | |
536 | pfamA_acc=>$prev_dom->{pfamA_acc}."/".$curr_dom->{pfamA_acc}, | |
537 | pfamA_id=>$prev_dom->{pfamA_id}."/".$curr_dom->{pfamA_id}, | |
538 | }; | |
539 | ||
540 | if ($pf_acc) { | |
541 | $new_dom->{info} = $new_dom->{pfamA_acc}; | |
542 | } | |
543 | else { | |
544 | $new_dom->{info} = $new_dom->{pfamA_id}; | |
545 | } | |
546 | ||
547 | push @save_domains, $new_dom; | |
548 | $curr_dom->{seq_start} = $r_seq_start; | |
549 | $prev_dom = $curr_dom; | |
550 | } | |
551 | else { | |
552 | push @save_domains, $prev_dom; | |
553 | $prev_dom = $curr_dom; | |
554 | } | |
555 | } | |
556 | push @save_domains, $prev_dom; | |
557 | @pf_domains = @save_domains; | |
558 | } | |
559 | } | |
560 | ||
561 | # $vdoms -- virtual Pfam domains -- the equivalent of $neg_doms, | |
562 | # but covering parts of a Pfam model that are not annotated. split | |
563 | # domains have been joined, so simply check beginning and end of | |
564 | # each domain (but must also check for bounded-ness) | |
565 | # only add when 10% or more is missing and missing length > $min_nodom | |
566 | ||
567 | if ($vdoms && scalar(@pf_domains)) { | |
568 | my @vpf_domains; | |
569 | ||
570 | my $curr_dom = $pf_domains[0]; | |
571 | my $length = $curr_dom->{length}; | |
572 | ||
573 | my $prev_dom={seq_end=>0, pfamA_acc=>''}; | |
574 | my $prev_dom_end = 0; | |
575 | my $next_dom_start = $length+1; | |
576 | ||
577 | for (my $dom_ix=0; $dom_ix < scalar(@pf_domains); $dom_ix++ ) { | |
578 | $curr_dom = $pf_domains[$dom_ix]; | |
579 | ||
580 | my $pfamA = $curr_dom->{pfamA_acc}; | |
581 | ||
582 | # first, look left, is there a domain there (if there is, | |
583 | # it should be updated right | |
584 | ||
585 | # my $min_vdom = $curr_dom->{model_length} / 10; | |
586 | ||
587 | if ($curr_dom->{model_length} < $min_vdom) { | |
588 | push @vpf_domains, $curr_dom; | |
589 | next; | |
590 | } | |
591 | if ($prev_dom->{pfamA_acc}) { # look for previous domain | |
592 | $prev_dom_end = $prev_dom->{seq_end}; | |
593 | } | |
594 | ||
595 | # there is a domain to the left, how much room is available? | |
596 | my $left_dom_len = min($curr_dom->{seq_start}-$prev_dom_end-1, $curr_dom->{model_start}-1); | |
597 | if ( $left_dom_len > $min_vdom) { | |
598 | # there is room for a virtual domain | |
599 | my %new_dom = (seq_start=> $curr_dom->{seq_start}-$left_dom_len, | |
600 | seq_end => $curr_dom->{seq_start}-1, | |
601 | info=>'@'.$curr_dom->{info}, | |
602 | model_length=>$curr_dom->{model_length}, | |
603 | model_end => $curr_dom->{model_start}-1, | |
604 | model_start => $left_dom_len, | |
605 | pfamA_acc=>$pfamA, | |
606 | ); | |
607 | push @vpf_domains, \%new_dom; | |
608 | } | |
609 | ||
610 | # save the current domain | |
611 | push @vpf_domains, $curr_dom; | |
612 | $prev_dom = $curr_dom; | |
613 | ||
614 | if ($dom_ix < $#pf_domains) { # there is a domain to the right | |
615 | # first, give all the extra space to the first domain (no splitting) | |
616 | $next_dom_start = $pf_domains[$dom_ix+1]->{seq_start}; | |
617 | } | |
618 | else { | |
619 | $next_dom_start = $length; | |
620 | } | |
621 | ||
622 | # is there room for a virtual domain right | |
623 | ||
624 | my $right_dom_len = min($next_dom_start-$curr_dom->{seq_end}-1, # space available | |
625 | $curr_dom->{model_length}-$curr_dom->{model_end} # space needed | |
626 | ); | |
627 | if ( $right_dom_len > $min_vdom) { | |
628 | my %new_dom = (seq_start=> $curr_dom->{seq_end}+1, | |
629 | seq_end=> $curr_dom->{seq_end}+$right_dom_len, | |
630 | info=>'@'.$curr_dom->{info}, | |
631 | model_length => $curr_dom->{model_length}, | |
632 | pfamA_acc=> $pfamA, | |
633 | ); | |
634 | push @vpf_domains, \%new_dom; | |
635 | $prev_dom = \%new_dom; | |
636 | } | |
637 | } # all done, check for last one | |
638 | ||
639 | # $curr_dom=$pf_domains[-1]; | |
640 | # # my $min_vdom = $curr_dom->{model_length}/10; | |
641 | ||
642 | # my $right_dom_len = min($length - $curr_dom->{seq_end}+1, # space available | |
643 | # $curr_dom->{model_length}-$curr_dom->{model_end} # space needed | |
644 | # ); | |
645 | # if ($right_dom_len > $min_vdom) { | |
646 | # my %new_dom = (seq_start=> $curr_dom->{seq_end}+1, | |
647 | # seq_end => $curr_dom->{seq_end}+$right_dom_len, | |
648 | # info=>'@'.$curr_dom->{pfamA_acc}, | |
649 | # model_len=> $curr_dom->{model_len}, | |
650 | # pfamA_acc => $curr_dom->{pfamA_acc}, | |
651 | # model_start => $curr_dom->{model_end}+1, | |
652 | # model_end => $curr_dom->{model_len}, | |
653 | # ); | |
654 | ||
655 | # push @vpf_domains, \%new_dom; | |
656 | # } | |
657 | ||
658 | # @vpf_domains has both old @pf_domains and new neg-domains | |
659 | @pf_domains = @vpf_domains; | |
660 | } | |
661 | ||
662 | if ($neg_doms) { | |
663 | my @npf_domains; | |
664 | my $prev_dom={seq_end=>0}; | |
665 | for my $curr_dom ( @pf_domains) { | |
666 | if ($curr_dom->{seq_start} - $prev_dom->{seq_end} > $min_nodom) { | |
667 | my %new_dom = (seq_start=>$prev_dom->{seq_end}+1, seq_end => $curr_dom->{seq_start}-1, info=>'NODOM'); | |
668 | push @npf_domains, \%new_dom; | |
669 | } | |
670 | push @npf_domains, $curr_dom; | |
671 | $prev_dom = $curr_dom; | |
672 | } | |
673 | if ($seq_length - $prev_dom->{seq_end} > $min_nodom) { | |
674 | my %new_dom = (seq_start=>$prev_dom->{seq_end}+1, seq_end=>$seq_length, info=>'NODOM'); | |
675 | if ($new_dom{seq_end} > $new_dom{seq_start}) { | |
676 | push @npf_domains, \%new_dom; | |
677 | } | |
678 | } | |
679 | ||
680 | if (scalar(@pf_domains)==0) { | |
681 | my %new_dom = (seq_start=>1, seq_end=> $seq_len, info=>'NODOM'); | |
682 | push @pf_domains, \%new_dom; | |
683 | } | |
684 | ||
685 | # @npf_domains has both old @pf_domains and new neg-domains | |
686 | @pf_domains = @npf_domains; | |
687 | } | |
688 | ||
689 | # now make sure we have useful names: colors | |
690 | ||
691 | for my $pf (@pf_domains) { | |
692 | $pf->{info} = domain_name($pf->{info}, $pf->{pfamA_acc}); | |
693 | } | |
694 | ||
695 | my @feats = (); | |
696 | for my $d_ref (@pf_domains) { | |
697 | if ($lav) { | |
698 | push @feats, [$d_ref->{seq_start}, $d_ref->{seq_end}, $d_ref->{info}]; | |
699 | } else { | |
700 | push @feats, [$d_ref->{seq_start}, '-', $d_ref->{seq_end}, $d_ref->{info} ]; | |
701 | # push @feats, [$d_ref->{seq_end}, ']', '-', ""]; | |
702 | } | |
703 | ||
704 | } | |
705 | ||
706 | return \@feats; | |
707 | } | |
708 | ||
709 | sub min { | |
710 | my ($arg1, $arg2) = @_; | |
711 | ||
712 | return ($arg1 <= $arg2 ? $arg1 : $arg2); | |
713 | } | |
714 | ||
715 | sub max { | |
716 | my ($arg1, $arg2) = @_; | |
717 | ||
718 | return ($arg1 >= $arg2 ? $arg1 : $arg2); | |
719 | } | |
720 | ||
721 | # domain name takes a uniprot domain label, removes comments ( ; | |
722 | # truncated) and numbers and returns a canonical form. Thus: | |
723 | # Cortactin 6. | |
724 | # Cortactin 7; truncated. | |
725 | # becomes "Cortactin" | |
726 | # | |
727 | ||
728 | sub domain_name { | |
729 | ||
730 | my ($value, $pfamA_acc) = @_; | |
731 | my $is_virtual = 0; | |
732 | ||
733 | if ($value =~ m/^@/) { | |
734 | $is_virtual = 1; | |
735 | $value =~ s/^@//; | |
736 | } | |
737 | ||
738 | # check for clan: | |
739 | if ($no_clans) { | |
740 | if (! defined($domains{$value})) { | |
741 | $domain_clan{$value} = 0; | |
742 | $domains{$value} = ++$domain_cnt; | |
743 | push @domain_list, $pfamA_acc; | |
744 | } | |
745 | } | |
746 | elsif (!defined($domain_clan{$value})) { | |
747 | ## only do this for new domains, old domains have known mappings | |
748 | ||
749 | ## ways to highlight the same domain: | |
750 | # (1) for clans, substitute clan name for family name | |
751 | # (2) for clans, use the same color for the same clan, but don't change the name | |
752 | # (3) for clans, combine family name with clan name, but use colors based on clan | |
753 | ||
754 | # check to see if it's a clan | |
755 | $get_pfam_clan->execute($pfamA_acc); | |
756 | ||
757 | my $pfam_clan_href=0; | |
758 | ||
759 | if ($pfam_clan_href=$get_pfam_clan->fetchrow_hashref()) { # is a clan | |
760 | my ($clan_id, $clan_acc) = @{$pfam_clan_href}{qw(clan_id clan_acc)}; | |
761 | ||
762 | # now check to see if we have seen this clan before (if so, do not increment $domain_cnt) | |
763 | my $c_value = "C." . $clan_id; | |
764 | ||
765 | if ($clan_fam) { | |
766 | $c_value = $c_value; | |
767 | } | |
768 | ||
769 | if ($pf_acc) { | |
770 | $c_value = $clan_acc; | |
771 | } | |
772 | ||
773 | $domain_clan{$value} = {clan_id => $clan_id, | |
774 | clan_acc => $clan_acc}; | |
775 | ||
776 | if ($domains{$c_value}) { | |
777 | $domain_clan{$value}->{domain_cnt} = $domains{$c_value}; | |
778 | $value = $c_value; | |
779 | } | |
780 | else { | |
781 | $domain_clan{$value}->{domain_cnt} = ++ $domain_cnt; | |
782 | $value = $c_value; | |
783 | $domains{$value} = $domain_cnt; | |
784 | push @domain_list, $pfamA_acc; | |
785 | } | |
786 | } | |
787 | else { # not a clan | |
788 | $domain_clan{$value} = 0; | |
789 | $domains{$value} = ++$domain_cnt; | |
790 | push @domain_list, $pfamA_acc; | |
791 | } | |
792 | } | |
793 | elsif ($domain_clan{$value} && $domain_clan{$value}->{clan_acc}) { | |
794 | if ($pf_acc) {$value = $domain_clan{$value}->{clan_acc};} | |
795 | else { $value = "C." . $domain_clan{$value}->{clan_id}; } | |
796 | } | |
797 | ||
798 | if ($is_virtual) { | |
799 | $domains{'@'.$value} = $domains{$value}; | |
800 | $value = '@'.$value; | |
801 | } | |
802 | ||
803 | return $value; | |
804 | } | |
805 | ||
806 | sub domain_num { | |
807 | my ($value, $number) = @_; | |
808 | if ($value =~ m/^@/) { | |
809 | $value =~ s/^@/v/; | |
810 | $number = $number."v"; | |
811 | } | |
812 | return ($value, $number); | |
813 | } | |
814 | ||
815 | ||
816 | __END__ | |
817 | ||
818 | =pod | |
819 | ||
820 | =head1 NAME | |
821 | ||
822 | ann_pfam_sql.pl | |
823 | ||
824 | =head1 SYNOPSIS | |
825 | ||
826 | ann_pfam_sql.pl --neg-doms --vdoms 'sp|P09488|GSTM1_NUMAN' | accession.file | |
827 | ||
828 | =head1 OPTIONS | |
829 | ||
830 | -h short help | |
831 | --help include description | |
832 | --no-over : generate non-overlapping domains (equivalent to ann_pfam.pl) | |
833 | --split-over : overlaps of two domains generate a new hybrid domain | |
834 | --no-clans : do not use clans with multiple families from same clan | |
835 | --neg-doms : report domains between annotated domains as NODOM | |
836 | (also --neg, --neg_doms) | |
837 | --pfacc : report Pfam ACC (PF01234), rather than Pfam identifier (GST-N) | |
838 | --vdoms : produce "virtual domains" using model_start, | |
839 | model_end for partial pfam domains | |
840 | --min_nodom=10 : minimum length between domains for NODOM | |
841 | ||
842 | --host, --user, --password, --port --db : info for mysql database | |
843 | ||
844 | =head1 DESCRIPTION | |
845 | ||
846 | C<ann_pfam_sql.pl> extracts domain information from the pfam msyql | |
847 | database. Currently, the program works with database | |
848 | sequence descriptions in several formats: | |
849 | ||
850 | >gi|1705556|sp|P54670.1|CAF1_DICDI | |
851 | >sp|P09488|GSTM1_HUMAN | |
852 | >sp:CALM_HUMAN | |
853 | ||
854 | C<ann_pfam_sql.pl> uses the C<pfamA_reg_full_significant>, C<pfamseq>, | |
855 | and C<pfamA> tables of the C<pfam> database to extract domain | |
856 | information on a protein. | |
857 | ||
858 | If the C<--no-over> option is set, overlapping domains are selected and | |
859 | edited to remove overlaps. For proteins with multiple overlapping | |
860 | domains (domains overlap by more than 1/3 of the domain length), | |
861 | C<auto_pfam28.pl> selects the domain annotation with the best | |
862 | C<domain_evalue_score>. When domains overlap by less than 1/3 of the | |
863 | domain length, they are shortened to remove the overlap. | |
864 | ||
865 | If the C<--split-over> option is set, if two domains overlap, the | |
866 | overlapping region is split out of the domains and labeled as a new, | |
867 | virtual-lie, domain. If one domain is internal to another and spans | |
868 | 80% of the domain, the shorter domain is removed. | |
869 | ||
870 | C<ann_pfam_sql.pl> is designed to be used by the B<FASTA> programs with | |
871 | the C<-V \!ann_pfam_sql.pl> or C<-V "\!ann_pfam_sql.pl --neg"> option. | |
872 | ||
873 | =head1 AUTHOR | |
874 | ||
875 | William R. Pearson, wrp@virginia.edu | |
876 | ||
877 | =cut |
0 | #!/usr/bin/perl -w | |
0 | #!/usr/bin/env perl | |
1 | 1 | |
2 | 2 | ################################################################ |
3 | 3 | # copyright (c) 2014, 2015 by William R. Pearson and The Rector & |
30 | 30 | # >pf26|164|O57809|1A1D_PYRHO |
31 | 31 | # and only provides domain information |
32 | 32 | |
33 | use warnings; | |
33 | 34 | # use strict; |
34 | 35 | |
35 | 36 | use Getopt::Long; |
79 | 80 | my @domain_list = (0); |
80 | 81 | my $domain_cnt = 0; |
81 | 82 | |
82 | my $loc="http://pfam.xfam.org/"; | |
83 | my $loc="https://pfam.xfam.org/"; | |
83 | 84 | my $url; |
84 | 85 | |
85 | 86 | my @pf_domains; |
0 | ann_exons_ens.pl | |
0 | ann_exons_up_sql.pl | |
1 | 1 | ann_exons_up_www.pl |
2 | 2 | ann_feats2ipr.pl |
3 | 3 | ann_feats_up_sql.pl |
0 | #!/usr/bin/perl -w | |
0 | #!/usr/bin/env perl | |
1 | 1 | |
2 | 2 | ################################################################ |
3 | 3 | # copyright (c) 2014 by William R. Pearson and The Rector & |
28 | 28 | |
29 | 29 | # this version can read feature2 uniprot features (acc/pos/end/label/value), but returns sorted start/end domains |
30 | 30 | |
31 | use warnings; | |
31 | 32 | use strict; |
32 | 33 | |
33 | 34 | use Getopt::Long; |
0 | #!/usr/bin/perl -w | |
0 | #!/usr/bin/env perl | |
1 | 1 | |
2 | 2 | ################################################################ |
3 | # copyright (c) 2014,2015 by William R. Pearson and The Rector & | |
3 | # copyright (c) 2017,2018 by William R. Pearson and The Rector & | |
4 | 4 | # Visitors of the University of Virginia */ |
5 | 5 | ################################################################ |
6 | 6 | # Licensed under the Apache License, Version 2.0 (the "License"); |
17 | 17 | ################################################################ |
18 | 18 | |
19 | 19 | ################################################################ |
20 | # annot_blast_btop2.pl --query query.file --ann_script ann_pfam_www.pl blast_tab_btop_file | |
20 | # annot_blast_btop2.pl --query query.file --ann_script ann_pfam_www.pl --include_doms blast_tab_btop_file | |
21 | 21 | ################################################################ |
22 | 22 | # annot_blast_btop2.pl associates domain annotation information and |
23 | 23 | # subalignment scores with a blast tabular (-outfmt 6 or -outfmt 7) |
29 | 29 | # If the BTOP field or query_file is not available, the script |
30 | 30 | # produces domain content without sub-alignment scores. |
31 | 31 | ################################################################ |
32 | ## 4-Nov-2018 | |
33 | # add --include_doms, which adds a new field with the coordinates of | |
34 | # the domains in the protein (independent of alignment) | |
35 | # | |
36 | ################################################################ | |
37 | ## 21-July-2018 | |
38 | # include sequence length (actually alignment end) to produce NODOM's (no NODOM's without length). | |
39 | # | |
40 | ################################################################ | |
32 | 41 | ## 13-Jan-2017 |
33 | 42 | # modified to provide query/subject coordinates and identities if no |
34 | 43 | # query sequence -- does not decrement for reverse-complement fastx/blastx DNA |
40 | 49 | # add -q_annot_script to annotate query sequence |
41 | 50 | # |
42 | 51 | |
52 | use warnings; | |
43 | 53 | use strict; |
44 | 54 | use IPC::Open2; |
45 | 55 | use Pod::Usage; |
46 | 56 | use Getopt::Long; |
57 | use File::Temp qw/ tempfile /; | |
58 | ||
47 | 59 | # use Data::Dumper; |
48 | 60 | |
49 | 61 | # read lines of the form: |
55 | 67 | # and report the domain content ala -m 8CC |
56 | 68 | |
57 | 69 | my ($matrix, $ann_script, $q_ann_script, $show_raw, $shelp, $help) = ("BLOSUM62", "", "", 0, 0, 0); |
70 | my ($have_qslen, $dom_info, $sub2query) = (0,0,0); # blast tabular file has sseqid sseqlen qseqid qseqlen | |
58 | 71 | my ($query_lib_name) = (""); # if $query_lib_name, do not use $query_file_name |
59 | 72 | my ($out_field_str) = (""); |
60 | 73 | my $query_lib_r = 0; |
67 | 80 | |
68 | 81 | GetOptions( |
69 | 82 | "matrix:s" => \$matrix, |
70 | "ann_script:s" => \$ann_script, | |
71 | "q_ann_script:s" => \$q_ann_script, | |
83 | "ann_script|script:s" => \$ann_script, | |
84 | "q_ann_script|q_script:s" => \$q_ann_script, | |
85 | "have_qslen|have_sqlen!" => \$have_qslen, | |
86 | "domain_info|dom_info!" => \$dom_info, | |
87 | "sub2query!" => \$sub2query, | |
72 | 88 | "query:s" => \$query_lib_name, |
73 | 89 | "query_file:s" => \$query_lib_name, |
74 | 90 | "query_lib:s" => \$query_lib_name, |
75 | 91 | "out_fields:s" => \$out_field_str, |
76 | "script:s" => \$ann_script, | |
77 | "q_script:s" => \$q_ann_script, | |
78 | 92 | "raw_score" => \$show_raw, |
79 | 93 | "h|?" => \$shelp, |
80 | 94 | "help" => \$help, |
92 | 106 | |
93 | 107 | my @tab_fields = qw(q_seqid s_seqid percid alen mismatch gopen q_start q_end s_start s_end evalue bits score BTOP); |
94 | 108 | |
109 | if ($have_qslen) { | |
110 | @tab_fields = qw(q_seqid q_len s_seqid s_len percid alen mismatch gopen q_start q_end s_start s_end evalue bits score BTOP); | |
111 | } | |
112 | ||
95 | 113 | # the fields that are displayed are listed here. By default, all fields except score and BTOP are displayed. |
96 | 114 | my @out_tab_fields = @tab_fields[0 .. $#tab_fields-1]; |
115 | ||
97 | 116 | if ($show_raw) { |
98 | 117 | push @out_tab_fields, "raw_score"; |
99 | ||
100 | } | |
118 | } | |
119 | ||
101 | 120 | if ($out_field_str) { |
102 | 121 | @out_tab_fields = split(/\s+/,$out_field_str); |
103 | 122 | } |
134 | 153 | push @hit_list, \%hit_data; |
135 | 154 | } |
136 | 155 | |
137 | # get the current query sequence | |
156 | # get the query annotations | |
157 | if ($q_ann_script) { | |
158 | $q_ann_script =~ s/\+/ /g; | |
159 | } | |
160 | ||
138 | 161 | if ($q_ann_script && -x (split(/\s+/,$q_ann_script))[0]) { |
139 | 162 | # get the domains for the q_seqid using --q_ann_script |
140 | 163 | # |
142 | 165 | my $pid = open2($Reader, $Writer, $q_ann_script); |
143 | 166 | my $hit = $hit_list[0]; |
144 | 167 | |
145 | print $Writer $hit->{q_seqid},"\n"; | |
168 | my $q_seq_len = scalar(@{$query_lib_r->{$hit->{q_seqid}}}); | |
169 | print $Writer $hit->{q_seqid},"\t",$q_seq_len,"\n"; | |
146 | 170 | close($Writer); |
147 | 171 | |
148 | @q_hit_list = ({ s_seq_id=> $hit->{q_seqid} }); | |
172 | push @q_hit_list,{ s_seq_id=> $hit->{q_seqid}, s_end=> $q_seq_len}; | |
149 | 173 | |
150 | 174 | read_annots($Reader, \@q_hit_list, 0); |
151 | 175 | |
152 | 176 | waitpid($pid, 0); |
153 | 177 | } |
154 | 178 | |
155 | # get the current query sequence | |
179 | # get the subject annotations | |
180 | if ($ann_script) { | |
181 | $ann_script =~ s/\+/ /g; | |
182 | } | |
183 | ||
156 | 184 | if ($ann_script && -x (split(/\s+/,$ann_script))[0]) { |
157 | 185 | # get the domains for each s_seqid using --ann_script |
158 | 186 | # |
187 | # this does not work currently because only one accession is sent. | |
188 | # For mulitple hits, I need to make a tmp_file. | |
189 | ||
159 | 190 | my ($Reader, $Writer); |
160 | 191 | my $pid = open2($Reader, $Writer, $ann_script); |
192 | ||
161 | 193 | for my $hit (@hit_list) { |
162 | print $Writer $hit->{s_seqid},"\n"; | |
194 | # print STDERR $hit->{s_seqid},"\t", $hit->{s_end},"\n"; | |
195 | # print $Writer $hit->{s_seqid},"\t", $hit->{s_end},"\n"; | |
196 | my $s_len = 100000; | |
197 | if ($have_qslen) { | |
198 | $s_len = $hit->{s_len}; | |
199 | } | |
200 | print $Writer $hit->{s_seqid},"\t", $s_len,"\n"; | |
163 | 201 | } |
164 | 202 | close($Writer); |
165 | 203 | |
174 | 212 | @header_lines = ($next_line); |
175 | 213 | |
176 | 214 | # now get query sequence if available |
215 | ||
216 | if ($sub2query && scalar(@q_hit_list)==0) { | |
217 | # copy the information from $hit_list | |
218 | for my $tmp_hit ( @hit_list ) { | |
219 | if ($tmp_hit->{q_seqid} eq $tmp_hit->{s_seqid}) { | |
220 | my %tmp_q_hit = (s_seq_id=> $tmp_hit->{q_seqid}, s_end=> $tmp_hit->{s_len}); | |
221 | ||
222 | $tmp_q_hit{'domains'} = []; | |
223 | for my $dom ( @{$tmp_hit->{domains}} ) { | |
224 | my %new_dom = map { $_ => $dom->{$_} } keys(%$dom); | |
225 | $new_dom{target} = 0; | |
226 | push @{$tmp_q_hit{'domains'}}, \%new_dom; | |
227 | } | |
228 | ||
229 | $tmp_q_hit{'sites'} = []; | |
230 | for my $site ( @{$tmp_hit->{sites}} ) { | |
231 | my %new_site = map { $_ => $site->{$_} } keys(%$site); | |
232 | $new_site{target} = 0; | |
233 | push @{$tmp_q_hit{'sites'}}, \%new_site; | |
234 | } | |
235 | push @q_hit_list,\%tmp_q_hit; | |
236 | last; | |
237 | } | |
238 | } | |
239 | } | |
177 | 240 | |
178 | 241 | my $q_hit = $q_hit_list[0]; |
179 | 242 | |
237 | 300 | |
238 | 301 | if (scalar(@$merged_annots_r)) { # show subalignment scores if available |
239 | 302 | print "\t"; |
240 | ||
241 | 303 | print format_annot_info($hit, $merged_annots_r); |
304 | if ($dom_info) { | |
305 | print "\t",format_dom_info($q_hit->{domains}, $hit->{domains}); | |
306 | } | |
242 | 307 | } |
243 | 308 | elsif (@list_covered) { # otherwise show domain content |
244 | 309 | print "\t",join(";",@list_covered); |
245 | } | |
310 | if ($dom_info) { | |
311 | print "\t",format_dom_info($q_hit->{domains}, $hit->{domains}); | |
312 | } | |
313 | } | |
314 | ||
246 | 315 | print "\n"; |
247 | 316 | } |
248 | 317 | |
275 | 344 | while (my $line = <$Reader>) { |
276 | 345 | next if $line=~ m/^=/; |
277 | 346 | chomp $line; |
347 | ||
348 | # print STDERR "$line\n"; | |
278 | 349 | |
279 | 350 | # check for header |
280 | 351 | if ($line =~ m/^>/) { |
289 | 360 | } |
290 | 361 | @hit_domains = (); # current domains |
291 | 362 | @hit_sites = (); # current sites |
292 | $current_domain = $line; | |
363 | $current_domain = (split(/\s+/,$line))[0]; | |
293 | 364 | $current_domain =~ s/^>//; |
294 | 365 | } else { # check for data |
295 | 366 | my %annot_info = (target=>$target); |
308 | 379 | } |
309 | 380 | close($Reader); |
310 | 381 | |
311 | # all done, save the last one | |
312 | 382 | $hit_list_r->[$hit_ix]{domains} = \@hit_domains; |
313 | 383 | $hit_list_r->[$hit_ix]{sites} = \@hit_sites; |
384 | ||
385 | # clean up NODOMs in {domains} | |
386 | for my $hit ( @$hit_list_r ) { | |
387 | # clean-up last NODOM if < 10 | |
388 | my $tmp_domains = $hit->{domains}; | |
389 | next unless (scalar(@{$tmp_domains})); | |
390 | my ($last_dom, $left_coord) = ($tmp_domains->[-1], $hit->{s_end}); | |
391 | if ($last_dom->{descr} =~ m/^NODOM/ && (($left_coord - $last_dom->{d_pos} + 1) < 10)) { | |
392 | pop @$tmp_domains; | |
393 | } | |
394 | } | |
314 | 395 | } |
315 | 396 | |
316 | 397 | # input: a blast BTOP string of the form: "1VA160TS7KG10RK27" |
416 | 497 | $blosum62[22] = [ qw( 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4) ]; |
417 | 498 | $blosum62[23] = [ qw( -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1) ]; |
418 | 499 | |
419 | ||
420 | 500 | die "blosum62 length mismatch $#blosum62 != $#ncbi_blaa" if (scalar(@blosum62) != scalar(@ncbi_blaa)); |
421 | 501 | |
422 | 502 | for (my $i=0; $i < scalar(@ncbi_blaa); $i++) { |
499 | 579 | my @aligned_domains = (); |
500 | 580 | |
501 | 581 | my $left_active_end = $domain_r->[-1]->{d_end}+1; # as far right as possible |
582 | my $left_align_end = $hit_r->{q_end}; | |
583 | if ($target) { | |
584 | $left_align_end = $hit_r->{s_end}; | |
585 | } | |
586 | ||
587 | if ($left_active_end > $left_align_end ) { | |
588 | $left_active_end = $left_align_end ; | |
589 | } | |
590 | ||
502 | 591 | my ($q_start, $s_start, $h_start, $h_end) = @{$hit_r}{qw(q_start s_start s_start s_end)}; |
503 | my ($qix, $six) = ($q_start, $s_start); # $qix now starts from 1, like $ssix; | |
592 | my ($qix, $six) = ($q_start, $s_start); # $qix now starts from 1, like $six; | |
504 | 593 | |
505 | 594 | my $ds_ix = \$six; # use to track the subject position |
506 | 595 | # reverse coordinate names if $target==0 |
1137 | 1226 | return \@merged_array; |
1138 | 1227 | } |
1139 | 1228 | |
1140 | # domain output formatter | |
1229 | #### | |
1230 | # print raw domain info: | |
1231 | # |DX:%d-%d;C=dom_info|XD:%d-%d:C=dom_info | |
1232 | # | |
1141 | 1233 | sub format_dom_info { |
1142 | my ($hit_r, $raw_score, $dom_r) = @_; | |
1143 | ||
1144 | unless ($raw_score) { | |
1145 | warn "no raw_score at: ".$hit_r->{s_seqid}."\n"; | |
1146 | $raw_score = $hit_r->{score}; | |
1147 | } | |
1148 | ||
1149 | my ($score_scale, $fsub_score) = ($hit_r->{score}/$raw_score, $dom_r->{score}/$raw_score); | |
1150 | ||
1151 | my $qval = 0.0; | |
1152 | if ($hit_r->{evalue} == 0.0) { | |
1153 | $qval = 3000.0 | |
1154 | } | |
1155 | else { | |
1156 | $qval = -10.0*log($hit_r->{evalue})*$fsub_score/(log(10.0)) | |
1157 | } | |
1158 | ||
1159 | my ($ns_score, $s_bit) = (int($dom_r->{score} * $score_scale+0.5), | |
1160 | int($hit_r->{bits} * $fsub_score +0.5), | |
1161 | ); | |
1162 | $qval = 0 if $qval < 0; | |
1163 | ||
1164 | # print join(":",($dom_r->{ad_pos},$dom_r->{ad_end},$ns_score, $s_bit, sprintf("%.1f",$qval))),"\n"; | |
1165 | return join(";",(sprintf("|XR:%d-%d:%d-%d:s=%d", | |
1166 | $dom_r->{qa_start},$dom_r->{qa_end}, | |
1167 | $dom_r->{sa_start},$dom_r->{sa_end},$ns_score), | |
1168 | sprintf("b=%.1f",$s_bit), | |
1169 | sprintf("I=%.3f",$dom_r->{percid}), | |
1170 | sprintf("Q=%.1f",$qval),$dom_r->{descr})); | |
1234 | my ($q_dom_r, $dom_r) = @_; | |
1235 | ||
1236 | my $dom_str = ""; | |
1237 | for my $dom ( @$q_dom_r ) { | |
1238 | $dom_str .= sprintf("|DX:%d-%d;C=%s",@{$dom}{qw(d_pos d_end descr)}); | |
1239 | } | |
1240 | for my $dom ( @$dom_r ) { | |
1241 | $dom_str .= sprintf("|XD:%d-%d;C=%s",@{$dom}{qw(d_pos d_end descr)}); | |
1242 | } | |
1243 | ||
1244 | return $dom_str; | |
1171 | 1245 | } |
1172 | 1246 | |
1173 | 1247 | # merged annot output formatter |
1195 | 1269 | if ($annot_r->{type} eq '-') { # domain with scores |
1196 | 1270 | my $fsub_score = $annot_r->{score}/$raw_score; |
1197 | 1271 | |
1272 | my ($ns_score, $s_bit) = (int($annot_r->{score} * $score_scale + 0.5), | |
1273 | int($hit_r->{bits} * $fsub_score + 0.5), | |
1274 | ); | |
1198 | 1275 | my $qval = 0.0; |
1199 | 1276 | if ($hit_r->{evalue} == 0.0) { |
1200 | $qval = 3000.0 | |
1277 | if ($s_bit > 50) { | |
1278 | $qval = 3000.0 | |
1279 | } | |
1280 | else { | |
1281 | $qval = -10.0 * (log(400.0 * 400.) + $s_bit)/log(10.0); | |
1282 | } | |
1201 | 1283 | } else { |
1202 | 1284 | $qval = -10.0*log($hit_r->{evalue})*$fsub_score/(log(10.0)) |
1203 | 1285 | } |
1204 | 1286 | |
1205 | my ($ns_score, $s_bit) = (int($annot_r->{score} * $score_scale+0.5), | |
1206 | int($hit_r->{bits} * $fsub_score +0.5), | |
1207 | ); | |
1208 | 1287 | $qval = 0 if $qval < 0; |
1209 | 1288 | |
1210 | 1289 | $annot_str .= join(";",(sprintf("|%s:%d-%d:%d-%d:s=%d", |
1213 | 1292 | $annot_r->{sa_start},$annot_r->{sa_end},$ns_score), |
1214 | 1293 | sprintf("b=%.1f",$s_bit), |
1215 | 1294 | sprintf("I=%.3f",$annot_r->{percid}), |
1216 | sprintf("Q=%.1f",$qval),$annot_r->{descr})); | |
1295 | sprintf("Q=%.1f",$qval),"C=".$annot_r->{descr})); | |
1217 | 1296 | } |
1218 | 1297 | else { # site annotation |
1219 | 1298 | my $ann_type = $annot_r->{type}; |
1252 | 1331 | |
1253 | 1332 | --ann_script -- annotation script returning site/domain locations for subject sequences |
1254 | 1333 | -- same as --script |
1334 | ||
1335 | --have_qslen -- use a blast tabular format that includes the query and subject sequence lengths: | |
1336 | -- q_seqid q_len s_seqid s_len ... | |
1255 | 1337 | |
1256 | 1338 | --q_ann_script -- annotation script for query sequences |
1257 | 1339 | -- same as --q_script |
0 | #!/bin/bash | |
1 | ||
2 | cmd=""; | |
3 | for i in "$@" | |
4 | do | |
5 | case $i in | |
6 | -o=*|--outname=*) | |
7 | OUTNAME="${i#*=}" | |
8 | shift # past argument=value | |
9 | ;; | |
10 | -q=*|--query=*) | |
11 | QUERY="${i#*=}" | |
12 | cmd="$cmd -query $QUERY" | |
13 | shift # past argument=value | |
14 | ;; | |
15 | --ann_script=*) | |
16 | ANN_SCRIPT="${i#*=}" | |
17 | shift | |
18 | ;; | |
19 | --q_ann_script=*) | |
20 | Q_ANN_SCRIPT="${i#*=}" | |
21 | shift | |
22 | ;; | |
23 | *) | |
24 | cmd="$cmd $i" | |
25 | ;; | |
26 | esac | |
27 | done | |
28 | ||
29 | # echo "OUTNAME: " $OUTNAME | |
30 | # echo "CMD: " $cmd | |
31 | ||
32 | if [[ $OUTNAME == '' ]]; then | |
33 | OUTNAME=${QUERY}_out | |
34 | fi | |
35 | ||
36 | #if [[ $ANN_SCRIPT == '' ]]; then | |
37 | # ANN_SCRIPT="/seqprg/bin/ann_pfam30.pl --db=pfam31_qfo --host=localhost --neg --vdoms --acc_comment" | |
38 | #fi | |
39 | ||
40 | ||
41 | # echo "OUTNAME2: " $OUTNAME | |
42 | ||
43 | bl_asn="$OUTNAME.asn" | |
44 | bl0_out="$OUTNAME.html" | |
45 | bla_out="${OUTNAME}_an.html" | |
46 | blm_out="$OUTNAME.msa" | |
47 | blt_out="$OUTNAME.bl_tab" | |
48 | blt_ann="$OUTNAME.bl_tab_ann" | |
49 | blr_out="$OUTNAME.bl_tab_rn" | |
50 | ||
51 | # echo "tmp_files:" | |
52 | # echo $bl_asn $bl0_out $bla_out $blt_out | |
53 | ||
54 | # echo "OUTFILE = ${OUTNAME}" | |
55 | ||
56 | #export BLAST_PATH="/ebi/extserv/bin/ncbi-blast+/bin" | |
57 | export BLAST_PATH="/seqprg/bin" | |
58 | ||
59 | $BLAST_PATH/blastp -outfmt 11 $cmd > $bl_asn | |
60 | $BLAST_PATH/blast_formatter -archive $bl_asn -outfmt 0 -html > $bl0_out | |
61 | $BLAST_PATH/blast_formatter -archive $bl_asn -outfmt '7 qseqid qlen sseqid slen pident length mismatch gapopen qstart qend sstart send evalue bitscore score btop' > $blt_out | |
62 | annot_blast_btop2.pl --query $QUERY --have_qslen --dom_info --ann_script "$ANN_SCRIPT" --q_ann_script "$Q_ANN_SCRIPT" $blt_out > $blt_ann | |
63 | ||
64 | rename_exons.py --have_qslen --dom_info $blt_ann > $blr_out | |
65 | merge_blast_btab.pl --plot_url="plot_domain6t.cgi" --have_qslen --dom_info --btab $blr_out $bl0_out | |
66 | ||
67 | # $BLAST_PATH/blast_formatter -archive $bl_asn -outfmt 2 > $blm_out |
25 | 25 | |
26 | 26 | $BLAST_PATH/blastp -outfmt 11 $cmd > $bl_asn |
27 | 27 | $BLAST_PATH/blast_formatter -archive $bl_asn -outfmt 0 -html > $bl0_out |
28 | $BLAST_PATH/blast_formatter -archive $bl_asn -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore score btop' > $blt_out | |
28 | $BLAST_PATH/blast_formatter -archive $bl_asn -outfmt '7 qseqid qlen sseqid slen pident length mismatch gapopen qstart qend sstart send evalue bitscore score btop' > $blt_out | |
29 | 29 | $BLAST_PATH/blast_formatter -archive $bl_asn -outfmt 2 > $blm_out |
30 | 30 |
0 | #!/usr/bin/perl -w | |
0 | #!/usr/bin/env perl | |
1 | 1 | |
2 | 2 | ################################################################ |
3 | 3 | # copyright (c) 2010, 2014 by William R. Pearson and The Rector & |
34 | 34 | ## sequences from an NCBI blast-formatted database. |
35 | 35 | ## |
36 | 36 | |
37 | use warnings; | |
37 | 38 | use strict; |
38 | 39 | use DBI; |
39 | 40 |
0 | #!/usr/bin/perl -w | |
0 | #!/usr/bin/env perl | |
1 | 1 | |
2 | 2 | ################################################################ |
3 | 3 | # copyright (c) 2010, 2014 by William R. Pearson and The Rector & |
34 | 34 | ## sequences from an NCBI blast-formatted database. |
35 | 35 | ## |
36 | 36 | |
37 | use warnings; | |
37 | 38 | use strict; |
38 | 39 | use DBI; |
39 | 40 |
0 | #!/usr/bin/perl -w | |
0 | #!/usr/bin/env perl | |
1 | 1 | |
2 | 2 | ################################################################ |
3 | 3 | # copyright (c) 2010, 2014 by William R. Pearson and The Rector & |
24 | 24 | # (2) take the uniprot accessions and produce a fasta library file |
25 | 25 | # from them |
26 | 26 | |
27 | use warnings; | |
27 | 28 | use strict; |
28 | 29 | use DBI; |
29 | 30 |
0 | #!/usr/bin/env perl | |
1 | ||
2 | ################################################################ | |
3 | # copyright (c) 2010, 2014 by William R. Pearson and The Rector & | |
4 | # Visitors of the University of Virginia */ | |
5 | ################################################################ | |
6 | # Licensed under the Apache License, Version 2.0 (the "License"); | |
7 | # you may not use this file except in compliance with the License. | |
8 | # You may obtain a copy of the License at | |
9 | # | |
10 | # http://www.apache.org/licenses/LICENSE-2.0 | |
11 | # | |
12 | # Unless required by applicable law or agreed to in writing, | |
13 | # software distributed under this License is distributed on an "AS | |
14 | # IS" BASIS, WITHOUT WRRANTIES OR CONDITIONS OF ANY KIND, either | |
15 | # express or implied. See the License for the specific language | |
16 | # governing permissions and limitations under the License. | |
17 | ################################################################ | |
18 | ||
19 | ## usage - expand_up_isoforms.pl [--prim_acc] up_hits.file > up_isoforms.file | |
20 | ## | |
21 | ## take a fasta36 -e expand.sh result file of the form: | |
22 | ## sp|P09488_GSTM1_HUMAN|<tab>1.1e-50 | |
23 | ## | |
24 | ## and extract the accession number, looking it up from the an SQL | |
25 | ## table $table -- in this case "annot2_iso" to provide Uniprot | |
26 | ## isoforms based on a uniprot accession. | |
27 | ## | |
28 | ## if --prim_acc, then the primary accession (used to find the isoforms) is added to the isoform seq_id, e.g. | |
29 | ## sp|P04988|GSTM1_HUMAN has isoforms: with --prim_acc, the identifiers become | |
30 | ## >iso|E7EWW9|E7EWW9_HUMAN >iso|E7EWW9|E7EWW9_HUMAN_P09488 | |
31 | ## >iso|H3BRM6|H3BRM6_HUMAN >iso|H3BRM6|H3BRM6_HUMAN_P09488 | |
32 | ## >iso|H3BQT3|H3BQT3_HUMAN >iso|H3BQT3|H3BQT3_HUMAN_P09488 | |
33 | ||
34 | use warnings; | |
35 | use strict; | |
36 | use Getopt::Long; | |
37 | use Pod::Usage; | |
38 | use DBI; | |
39 | ||
40 | my ($host, $db, $port, $user, $pass) = ("xdb", "uniprot", 0, "web_user", "fasta_www"); | |
41 | $host = 'wrpxdb.its.virginia.edu'; | |
42 | my ($a_table, $i_table) = ("annot2", "annot2_iso"); | |
43 | my ($help, $shelp) = (0,0); | |
44 | my ($e_thresh, $prim_acc) = (1e-6, 0); | |
45 | ||
46 | GetOptions( | |
47 | "h" => \$shelp, | |
48 | "help" => \$help, | |
49 | "host=s" => \$host, | |
50 | "prim_acc!" => \$prim_acc, | |
51 | "db=s" => \$db, | |
52 | "expect|evalue|e_thresh=f" => \$e_thresh, | |
53 | "user=s" => \$user, | |
54 | "password=s" => \$pass, | |
55 | "port=i" => \$port, | |
56 | "i_table" => \$i_table, | |
57 | "a_table" => \$a_table, | |
58 | ); | |
59 | ||
60 | pod2usage(1) if $shelp; | |
61 | pod2usage(exitstatus => 0, verbose => 2) if $help; | |
62 | pod2usage(1) unless (@ARGV || -p STDIN || -f STDIN); | |
63 | ||
64 | my $dbh = DBI->connect("dbi:mysql:host=$host:$db", | |
65 | $user, $pass, | |
66 | { RaiseError => 1, AutoCommit => 1} | |
67 | ) or die $DBI::errstr; | |
68 | ||
69 | my %sth = ( | |
70 | seed2link_acc => "SELECT acc FROM $i_table WHERE prim_acc=?", | |
71 | seed2link_id => "SELECT iso_a.acc FROM $i_table as iso_a JOIN $a_table as an2 on(iso_a.prim_acc=an2.acc) where an2.id=?", | |
72 | link2seq => "SELECT db, acc, prim_acc, id, descr, seq FROM annot2_iso JOIN protein_iso USING(acc) WHERE acc=?" | |
73 | ); | |
74 | ||
75 | for my $sth (keys(%sth)) { | |
76 | $sth{$sth} = $dbh->prepare($sth{$sth}); | |
77 | } | |
78 | ||
79 | my %acc_uniq = (); | |
80 | ||
81 | # get the query | |
82 | my ($query, $eval_arg) = @ARGV; | |
83 | $eval_arg = 1e-10 unless $eval_arg; | |
84 | $query =~ s/^>// if ($query); | |
85 | my @link_lines = (); | |
86 | ||
87 | #if it's a file I can open, read and parse it | |
88 | unless ($query && ($query =~ m/[\|:]/ || | |
89 | $query =~ m/^[NX]P_/ || | |
90 | $query =~ m/^[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}\s/)) { | |
91 | ||
92 | while (my $a_line = <>) { | |
93 | $a_line =~ s/^>//; | |
94 | chomp $a_line; | |
95 | push @link_lines, $a_line; | |
96 | } | |
97 | } | |
98 | else { | |
99 | push @link_lines, "$query\t$eval_arg"; | |
100 | } | |
101 | ||
102 | for my $line ( @link_lines ) { | |
103 | my ($hit, $e_val) = split(/\t/,$line); | |
104 | ||
105 | if ($e_val <= $e_thresh) { | |
106 | process_line($hit,$sth{seed2link_acc},$sth{seed2link_id}); | |
107 | } | |
108 | } | |
109 | ||
110 | for my $acc ( keys %acc_uniq ) { | |
111 | ||
112 | $sth{link2seq}->execute($acc); | |
113 | while (my $row_href = $sth{link2seq}->fetchrow_hashref ) { | |
114 | my $id_str = $row_href->{id}; | |
115 | if ($prim_acc) { | |
116 | $id_str .= "_".$row_href->{prim_acc}; | |
117 | } | |
118 | ||
119 | printf(">%s|%s|%s %s\n","iso",$acc,$id_str,$row_href->{descr}); | |
120 | my $iso_seq = $row_href->{seq}; | |
121 | $iso_seq =~ s/(.{60})/$1\n/g; | |
122 | ||
123 | print "$iso_seq\n"; | |
124 | } | |
125 | $sth{link2seq}->finish(); | |
126 | } | |
127 | ||
128 | $dbh->disconnect(); | |
129 | ||
130 | sub process_line{ | |
131 | my ($seqid,$sth_acc, $sth_id)=@_; | |
132 | ||
133 | my $sth = $sth_acc; | |
134 | ||
135 | my ($db, $link_acc, $link_id) = ("","",""); | |
136 | ||
137 | if ($seqid =~ m/\|/) { | |
138 | ($db, $link_acc, $link_id) = split('\|',$seqid); | |
139 | $link_acc =~ s/\.\d+$//; | |
140 | ||
141 | $sth_acc->execute($link_acc); | |
142 | } | |
143 | elsif ($seqid =~ m/:/) { | |
144 | ($db, $link_id) = split(':',$seqid); | |
145 | $sth_id->execute($link_id); | |
146 | $sth = $sth_id; | |
147 | } | |
148 | else { | |
149 | $link_acc = $seqid; | |
150 | $link_acc =~ s/\.\d+$//; | |
151 | $sth_acc->execute($link_acc); | |
152 | } | |
153 | ||
154 | while (my ($acc) = $sth->fetchrow_array()) { | |
155 | next if ($acc eq $link_acc); | |
156 | $acc_uniq{$acc} = $link_acc unless $acc_uniq{$acc}; | |
157 | } | |
158 | $sth->finish(); | |
159 | } | |
160 | ||
161 | __END__ | |
162 | ||
163 | =pod | |
164 | ||
165 | =head1 NAME | |
166 | ||
167 | expand_up_isoforms.pl expand_file.tab | |
168 | ||
169 | =head1 SYNOPSIS | |
170 | ||
171 | expand_up_isoforms.pl expand_file.tab | |
172 | ||
173 | =head1 OPTIONS | |
174 | ||
175 | -h short help | |
176 | --help include description | |
177 | --evalue E()-value threshold for expansion | |
178 | --prim_acc : show primary accession as part of sequence identifier | |
179 | >iso|E7EWW9|E7EWW9_HUMAN becomes >iso|E7EWW9|E7EWW9_HUMAN_P09488 | |
180 | ||
181 | --host, --user, --password, --port --db : info for mysql database | |
182 | --a_table, --i_table -- SQL table names with reference and isoform acc/id/prim_acc mappings. | |
183 | ||
184 | =head1 DESCRIPTION | |
185 | ||
186 | C<expand_up_isoforms.pl> uses protein isoform tables in an SQL database to identify and extract | |
187 | isoforms of proteins in a reference protein sequence database. | |
188 | ||
189 | C<expand_up_isoforms.pl> takes a file with sequece identifiers and E()-values of the form: | |
190 | ||
191 | sp|P09488|GSTM1_HUMAN <tab> 1e-40 | |
192 | sp:CALM_HUMAN <tab> 1e-40 | |
193 | ||
194 | Lines with E()-values less than --evalue (1E-6 by default) are used to | |
195 | identify protein isoforms, which are included in the set of sequences to be aligned. | |
196 | ||
197 | C<expand_up_isoforms.pl> is designed to be used by the B<FASTA> programs with | |
198 | the C<-e expand_up_isoforms.pl> option. | |
199 | ||
200 | =head1 AUTHOR | |
201 | ||
202 | William R. Pearson, wrp@virginia.edu | |
203 | ||
204 | =cut |
0 | #!/bin/bash | |
1 | ||
2 | cmd=""; | |
3 | for i in "$@" | |
4 | do | |
5 | case $i in | |
6 | --outname=*) | |
7 | OUTNAME="${i#*=}" | |
8 | shift # past argument=value | |
9 | ;; | |
10 | --query=*) | |
11 | QUERY="${i#*=}" | |
12 | shift # past argument=value | |
13 | ;; | |
14 | --db=*) | |
15 | DATABASE="${i#*=}" | |
16 | shift # past argument=value | |
17 | ;; | |
18 | --cmd=*) | |
19 | SRCH_CMD="${i#*=}" | |
20 | shift | |
21 | ;; | |
22 | --ktup=*) | |
23 | KTUP="${i#*=}" | |
24 | shift | |
25 | ;; | |
26 | *) | |
27 | cmd="$cmd $i" | |
28 | ;; | |
29 | esac | |
30 | done | |
31 | ||
32 | ||
33 | # echo "OUTNAME: " $OUTNAME | |
34 | echo "# CMD: " $cmd | |
35 | ||
36 | if [[ $OUTNAME == '' ]]; then | |
37 | OUTNAME=${QUERY}_out | |
38 | fi | |
39 | ||
40 | if [[ $SRCH_CMD == '' ]]; then | |
41 | SRCH_CMD=fasta36 | |
42 | fi | |
43 | ||
44 | #if [[ $ANN_SCRIPT == '' ]]; then | |
45 | # ANN_SCRIPT="/seqprg/bin/ann_pfam30.pl --db=pfam31_qfo --host=localhost --neg --vdoms --acc_comment" | |
46 | #fi | |
47 | ||
48 | ||
49 | # echo "OUTNAME: " $OUTNAME | |
50 | ||
51 | bl0_out="$OUTNAME.html" | |
52 | bla_out="${OUTNAME}_an.html" | |
53 | blt_out="$OUTNAME.fa_tab" | |
54 | blr_out="$OUTNAME.fa_tab_rn" | |
55 | ||
56 | export BLAST_PATH="/seqprg/bin" | |
57 | # BLAST_PATH="../bin" | |
58 | ||
59 | cmd="$cmd -mF8CBL=$blt_out $QUERY $DATABASE" | |
60 | ||
61 | # echo "tmp_files:" | |
62 | # echo $bl_asn $bl0_out $bla_out $blt_out | |
63 | # echo "OUTFILE = ${OUTNAME}" | |
64 | ||
65 | #echo "cmd: $cmd" | |
66 | #echo "===" | |
67 | #echo "bl0_out: $bl0_out" | |
68 | #echo "===" | |
69 | ||
70 | # echo "$BLAST_PATH/$SRCH_CMD $cmd > $bl0_out" | |
71 | ||
72 | # run the program | |
73 | $BLAST_PATH/$SRCH_CMD $cmd > $bl0_out | |
74 | ||
75 | $BLAST_PATH/rename_exons.py --have_qslen --dom_info $blt_out > $blr_out | |
76 | ||
77 | if [ ! -s $blr_out ]; then | |
78 | # echo "# " `ls -l $blt_out $blr_out` | |
79 | blr_out=$blt_out | |
80 | # echo "# " `ls -l $blt_out $blr_out` | |
81 | fi | |
82 | ||
83 | $BLAST_PATH/merge_fasta_btab.pl --plot_url="plot_domain6t.cgi" --have_qslen --dom_info --btab $blr_out $bl0_out |
0 | #!/usr/bin/python | |
1 | ||
2 | ################ | |
3 | ## get_hg38_bed.py parses an HG38 coordinate into a pseudo-bed entry, | |
4 | ## and runs bedtools getfasta to return the fasta sequence | |
5 | ## | |
6 | ||
7 | import sys | |
8 | import re | |
9 | from subprocess import Popen, PIPE, STDOUT | |
10 | import shlex | |
11 | import argparse | |
12 | ||
13 | ## a genome_loc should look like: chr#:start-stop | |
14 | ## if stop < start, coordinates are reversed | |
15 | ||
16 | genome_dict={'hg38':'genome_dna/hg38/reference.fa', | |
17 | 'mm10':'genome_dna/mm10/reference.fa', | |
18 | 'rn6':'genome_dna/rn6/rn6.fa'} | |
19 | ||
20 | parser=argparse.ArgumentParser(description='get_genome_seq.py : get fasta sequence from genome coordinates ') | |
21 | parser.add_argument('--genome', help='genome: hg38 | mm10 | rn6',dest='genome',action='store',default='hg38') | |
22 | parser.add_argument('coords', help='genome coordinates chr1:12345-54321', nargs='*') | |
23 | ||
24 | args=parser.parse_args() | |
25 | ||
26 | bed_cmd = 'bedtools getfasta -fi $RDLIB2/%s -bed stdin' % (genome_dict[args.genome]) | |
27 | ||
28 | bed_lines = '' | |
29 | for genome_loc in args.coords: | |
30 | ||
31 | chrom, g_range = genome_loc.split(':') | |
32 | g_start, g_end = g_range.split('-') | |
33 | ||
34 | if (g_start > g_end): | |
35 | g_start, g_end = g_end, g_start | |
36 | ||
37 | g_start, g_end = int(g_start), int(g_end) | |
38 | g_start -= 1 | |
39 | ||
40 | bed_lines += '%s\t%d\t%d\n' % (chrom, g_start, g_end) | |
41 | ||
42 | bed_p = Popen(bed_cmd, stdout=PIPE, stdin=PIPE, stderr=STDOUT, shell=True) | |
43 | out, err = bed_p.communicate(input=bed_lines) | |
44 | ||
45 | for line in out.split('\n'): | |
46 | if (line and line[0]=='>'): | |
47 | (chrom, start, stop) = re.search(r'>([^:]+):(\d+)\-(\d+)',line).groups() | |
48 | print line + " @C:%s" % (start) | |
49 | elif (line): | |
50 | print line | |
51 | ||
52 |
0 | #!/usr/bin/python | |
1 | ||
2 | ## get_protein.py -- | |
3 | ## get a protein sequence from Uniprot or NCBI/Refseq using the accession | |
4 | ## | |
5 | ||
6 | import sys | |
7 | import re | |
8 | import textwrap | |
9 | from urllib2 import urlopen | |
10 | ||
11 | ncbi_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?" | |
12 | uniprot_url = "https://www.uniprot.org/uniprot/" | |
13 | ||
14 | sub_range = '' | |
15 | for acc in sys.argv[1:]: | |
16 | ||
17 | if (re.search(r':',acc)): | |
18 | (acc, sub_range) = acc.split(':') | |
19 | ||
20 | if (re.match(r'^(sp|tr|iso|ref)\|',acc)): | |
21 | acc=acc.split('|')[1] | |
22 | ||
23 | if (re.match(r'[NX]P_',acc)): | |
24 | db_type="protein" | |
25 | ||
26 | seq_args = "db=%s&id=" % (db_type) + ",".join(sys.argv[1:]) + "&rettype=fasta" | |
27 | seq_html = urlopen(ncbi_url + seq_args).read() | |
28 | else: | |
29 | seq_html = urlopen(uniprot_url + acc + ".fasta").read() | |
30 | ||
31 | header='' | |
32 | seq = '' | |
33 | for line in seq_html.split('\n'): | |
34 | if (line and line[0]=='>'): | |
35 | # print out old one if there | |
36 | if (header): | |
37 | if (sub_range): | |
38 | start, stop = sub_range.split('-') | |
39 | start, stop = int(start), int(stop) | |
40 | if (start > 0): | |
41 | start -= 1 | |
42 | new_seq = seq[start:stop] | |
43 | else: | |
44 | start = 0 | |
45 | new_seq = seq | |
46 | ||
47 | if (start > 0): | |
48 | print "%s @C%d" %(header, start+1) | |
49 | else: | |
50 | print header | |
51 | print '\n'.join(textwrap.wrap(new_seq)) | |
52 | ||
53 | header = line; | |
54 | seq = '' | |
55 | else: | |
56 | seq += line | |
57 | ||
58 | start=0 | |
59 | if (sub_range): | |
60 | start, stop = sub_range.split('-') | |
61 | start, stop = int(start), int(stop) | |
62 | if (start > 0): | |
63 | start -= 1 | |
64 | new_seq = seq[start:stop] | |
65 | else: | |
66 | new_seq = seq | |
67 | ||
68 | if (start > 0): | |
69 | print "%s @C:%d" %(header, start+1) | |
70 | else: | |
71 | print header | |
72 | ||
73 | print '\n'.join(textwrap.wrap(new_seq)) |
0 | #!/usr/bin/python | |
1 | ||
2 | import sys | |
3 | import re | |
4 | from urllib2 import urlopen | |
5 | ||
6 | ||
7 | db_type="protein" | |
8 | if (re.match(r'[NX]M_',sys.argv[1])): | |
9 | db_type="nucleotide" | |
10 | ||
11 | seq_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?" | |
12 | seq_args = "db=%s&id=" % (db_type) + ",".join(sys.argv[1:]) + "&rettype=fasta" | |
13 | ||
14 | seq_html = urlopen(seq_url + seq_args).read() | |
15 | ||
16 | print seq_html |
0 | #!/usr/bin/python | |
1 | ||
2 | import sys | |
3 | from urllib import urlopen | |
4 | ||
5 | ARGV = sys.argv[1:]; | |
6 | ||
7 | for acc in ARGV : | |
8 | url = "https://www.uniprot.org/uniprot/" + acc + ".fasta" | |
9 | # print url | |
10 | fa_seq = urlopen(url).read() | |
11 | print fa_seq |
0 | #!/usr/bin/python | |
1 | ||
2 | import sys | |
3 | import re | |
4 | import textwrap | |
5 | import argparse | |
6 | import MySQLdb.cursors | |
7 | ||
8 | from urllib2 import urlopen | |
9 | ||
10 | ncbi_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?" | |
11 | uniprot_url = "https://www.uniprot.org/uniprot/" | |
12 | ||
13 | db = MySQLdb.connect(db='uniprot', host='xdb', user='web_user', passwd='fasta_www', | |
14 | cursorclass=MySQLdb.cursors.DictCursor) | |
15 | ||
16 | cur1 = db.cursor() | |
17 | cur2 = db.cursor() | |
18 | get_iso_acc='select acc from annot2_iso where prim_acc="%s"' | |
19 | get_fasta_info='select db, acc, id, descr, seq from annot2 join protein using(acc) where acc="%s"' | |
20 | get_iso_fasta_info='select db, acc, id, descr, seq from annot2_iso join protein_iso using(acc) where prim_acc="%s"' | |
21 | ||
22 | fasta_seqs=[] | |
23 | ||
24 | for acc in sys.argv[1:]: | |
25 | ||
26 | if (re.search(r':',acc)): | |
27 | (acc, sub_range) = acc.split(':') | |
28 | ||
29 | if (re.match(r'^(sp|tr|iso|ref)\|',acc)): | |
30 | acc=acc.split('|')[1] | |
31 | ||
32 | cur1.execute(get_fasta_info%(acc,)) | |
33 | row = cur1.fetchone() | |
34 | if (row): | |
35 | fasta_seqs.append(row) | |
36 | else: | |
37 | sys.stderr.write("***error*** %s sequence not found\n"%(acc)) | |
38 | continue | |
39 | ||
40 | cur2.execute(get_iso_fasta_info%(acc,)) | |
41 | for row in cur2: | |
42 | fasta_seqs.append(row) | |
43 | ||
44 | for row in fasta_seqs: | |
45 | print ">%s|%s|%s %s"%(row['db'],row['acc'],row['id'],row['descr']) | |
46 | print '\n'.join(textwrap.wrap(row['seq'])) | |
47 | ||
48 | ||
49 | ||
50 |
0 | #!/usr/bin/perl -w | |
0 | #!/usr/bin/env perl | |
1 | 1 | |
2 | 2 | # lav2plt.pl - produce plotfrom lav output */ |
3 | 3 | |
21 | 21 | # governing permissions and limitations under the License. |
22 | 22 | ################################################################ |
23 | 23 | |
24 | use warnings; | |
24 | 25 | use strict; |
25 | 26 | use Getopt::Long; |
26 | 27 | use Pod::Usage; |
0 | #!/usr/bin/env perl | |
1 | # | |
0 | 2 | ################################################################ |
1 | 3 | # copyright (c) 2012, 2014 by William R. Pearson and The Rector & |
2 | 4 | # Visitors of the University of Virginia */ |
13 | 15 | # express or implied. See the License for the specific language |
14 | 16 | # governing permissions and limitations under the License. |
15 | 17 | ################################################################ |
18 | ||
19 | use warnings; | |
20 | use strict; | |
16 | 21 | |
17 | 22 | #define SX(x) (int)((double)(x)*fxscal+fxoff+24) |
18 | 23 | sub SX { |
0 | ||
0 | #!/usr/bin/env perl | |
1 | # | |
1 | 2 | ################################################################ |
2 | 3 | # copyright (c) 2012, 2014 by William R. Pearson and The Rector & |
3 | 4 | # Visitors of the University of Virginia */ |
15 | 16 | # governing permissions and limitations under the License. |
16 | 17 | ################################################################ |
17 | 18 | |
19 | use warnings; | |
20 | use strict; | |
21 | ||
18 | 22 | #define SX(x) (int)((double)(x)*fxscal+fxoff+6) |
19 | 23 | sub SX { |
20 | 24 | my $xx = shift; |
0 | #!/usr/bin/perl -w | |
0 | #!/usr/bin/env perl | |
1 | 1 | |
2 | 2 | ################################################################ |
3 | 3 | # copyright (c) 2014 by William R. Pearson and The Rector & |
16 | 16 | # governing permissions and limitations under the License. |
17 | 17 | ################################################################ |
18 | 18 | |
19 | use warnings; | |
19 | 20 | use strict; |
20 | 21 | use DBI; |
21 | 22 | use Getopt::Long; |
0 | #!/usr/bin/perl -w | |
0 | #!/usr/bin/env perl | |
1 | 1 | |
2 | 2 | ################################################################ |
3 | 3 | # copyright (c) 2014,2015 by William R. Pearson and The Rector & |
39 | 39 | # |
40 | 40 | ################################################################ |
41 | 41 | |
42 | use warnings; | |
42 | 43 | use strict; |
43 | 44 | use IPC::Open2; |
44 | 45 | use Pod::Usage; |
0 | #!/usr/bin/perl -w | |
0 | #!/usr/bin/env perl | |
1 | 1 | |
2 | 2 | ################################################################ |
3 | 3 | # copyright (c) 2014,2015 by William R. Pearson and The Rector & |
36 | 36 | # |
37 | 37 | ################################################################ |
38 | 38 | |
39 | use warnings; | |
39 | 40 | use strict; |
40 | 41 | use IPC::Open2; |
41 | 42 | use Pod::Usage; |
0 | #!/usr/bin/env python | |
1 | # | |
2 | # given a -m8CB file with exon annotations for the query and subject, | |
3 | # provide a function that maps subject coordinates to query, or vice versa | |
4 | ||
5 | ################################################################ | |
6 | # copyright (c) 2018 by William R. Pearson and The Rector & | |
7 | # Visitors of the University of Virginia */ | |
8 | ################################################################ | |
9 | # Licensed under the Apache License, Version 2.0 (the "License"); | |
10 | # you may not use this file except in compliance with the License. | |
11 | # You may obtain a copy of the License at | |
12 | # | |
13 | # http://www.apache.org/licenses/LICENSE-2.0 | |
14 | # | |
15 | # Unless required by applicable law or agreed to in writing, | |
16 | # software distributed under this License is distributed on an "AS | |
17 | # IS" BASIS, WITHOUT WRRANTIES OR CONDITIONS OF ANY KIND, either | |
18 | # express or implied. See the License for the specific language | |
19 | # governing permissions and limitations under the License. | |
20 | ################################################################ | |
21 | ||
22 | import fileinput | |
23 | import sys | |
24 | import re | |
25 | import argparse | |
26 | import copy | |
27 | ||
28 | ################ | |
29 | # "domain" class that describes a domain/exon alignment annotation | |
30 | # | |
31 | class exonInfo: | |
32 | def __init__(self, name, q_target, p_start, p_end, chrom, d_start, d_end, full_text): | |
33 | self.name = name | |
34 | self.q_target = q_target | |
35 | self.p_start = p_start | |
36 | self.p_end = p_end | |
37 | self.chrom = chrom | |
38 | self.d_start = d_start | |
39 | self.d_end = d_end | |
40 | self.text = full_text | |
41 | self.plus_strand = True | |
42 | if (d_start > d_end): | |
43 | self.plus_strand = False | |
44 | ||
45 | def __str__(self): | |
46 | rxr_str = "XD" | |
47 | if (self.q_target): | |
48 | rxr_str="DX" | |
49 | return '|%s:%i-%i:%s{%s:%i-%i}' % (rxr_str, self.p_start, self.p_end, self.name, self.chrom, self.d_start, self.d_end) | |
50 | ||
51 | class exonAlign: | |
52 | def __init__(self, name, q_target, qp_start, qp_end, sp_start, sp_end, full_text): | |
53 | self.exon = None | |
54 | ||
55 | self.name = name | |
56 | self.q_target = q_target | |
57 | ||
58 | self.q_start = qp_start | |
59 | self.q_end = qp_end | |
60 | self.s_start = sp_start | |
61 | self.s_end = sp_end | |
62 | ||
63 | self.text = full_text | |
64 | self.out_str = '' | |
65 | ||
66 | def __str__(self): | |
67 | rxr_str = "RX" | |
68 | if (self.q_target): | |
69 | rxr_str="XR" | |
70 | return "[%s:%i-%i:%i-%i::%s" % (rxr_str,self.q_start, self.q_end, self.s_start, self.s_end, self.name) | |
71 | ||
72 | def print_bar_str(self): # checking for 'NADA' | |
73 | if (not self.out_str): | |
74 | self.out_str = self.text | |
75 | return str("|%s"%(self.out_str)) | |
76 | ||
77 | # Parses domain annotations after split at '|' | |
78 | ||
79 | # | |
80 | def parse_exon_align(text): | |
81 | # takes a domain in string form, turns it into a domain object | |
82 | # looks like: RX:5-82:5-82:s=397;b=163.1;I=1.000;Q=453.6;C=C.Thioredoxin~1 | |
83 | # could also look like: RX:5-82:5-82:s=397;b=163.1;I=1.000;Q=453.6;C=C.Thioredoxin{PF012445}~1 | |
84 | ||
85 | # get RX/XR and qstart/qstop sstart/sstop as strings | |
86 | m = re.search(r'^(\w+):(\d+)-(\d+):(\d+)-(\d+):',text) | |
87 | if (m): | |
88 | (RXRState, qstart_s, qend_s, sstart_s, send_s) = m.groups() | |
89 | else: | |
90 | sys.stderr.write("could not parse exon location: %s\n"%(text)) | |
91 | ||
92 | # get domain name/color (and possibly {info}) | |
93 | ||
94 | (name, color_s) = re.search(r';C=([^~]+)~(.+)$',text).groups() | |
95 | info_s="" | |
96 | ||
97 | if (re.search(r'\}$',name)): | |
98 | (name, info_s) = re.search(r'([^\{]+)(\{[^\}]+\})$',name).groups() | |
99 | ||
100 | q_target = True | |
101 | if (RXRState=='XR'): | |
102 | q_target = False | |
103 | ||
104 | exon_align = exonAlign(name, q_target, int(qstart_s), int(qend_s), int(sstart_s), int(send_s), | |
105 | text) | |
106 | ||
107 | return exon_align | |
108 | ||
109 | ################ | |
110 | # exon_info is like domain, but no scores | |
111 | # | |
112 | def parse_exon_info(text): | |
113 | # takes a domain in string form, turns it into a domain object | |
114 | # looks like: DX:1-100;C=C.Thioredoxin~1 | |
115 | ||
116 | (RXRState, start_s, end_s,name, color) = re.search(r'^(\w+):(\d+)-(\d+);C=([^~]+)~(.*)$',text).groups() | |
117 | info = "" | |
118 | if (re.search(r'\}$',name)): | |
119 | (name, info) = re.search(r'([^\{]+)(\{[^\}]+\})$',name).groups() | |
120 | ||
121 | gene_re = re.search(r'^\{(\w+):(\d+)\-(\d+)\}',info) | |
122 | if (gene_re): | |
123 | (chrom, d_start, d_end) = gene_re.groups() | |
124 | else: | |
125 | sys.stderr.write("genome info not found: %s\n" % (text)) | |
126 | ||
127 | q_target = True; | |
128 | if (RXRState == 'XD'): | |
129 | q_target = False | |
130 | ||
131 | exon_info = exonInfo(name, q_target, int(start_s), int(end_s), chrom, int(d_start), int(d_end), text) | |
132 | ||
133 | return exon_info | |
134 | ||
135 | #### | |
136 | # parse_protein(result_line) | |
137 | # takes a protein in string format, turns it into a dictionary properly | |
138 | # looks like: sp|P30711|GSTT1_HUMAN up|Q2NL00|GSTT1_BOVIN 86.67 240 32 0 1 240 1 240 1.4e-123 444.0 16VI7DR6IT3IR15KQ3AI6TI11TA7YH8RC12TA3SN10FL10QETM2AT6VMTA2LV2DG4ND6PS24EK6TA11DV14FSPQ5IL3LMML1WK5RQ |XR:4-76:4-76:s=327;b=134.6;I=0.895;Q=367.8;C=C.Thioredoxin~1|RX:5-82:5-82:s=356;b=146.5;I=0.902;Q=403.3;C=C.Thioredoxin~1|RX:83-93:83-93:s=52;b=21.4;I=0.818;Q=30.9;C=NODOM~0|XR:77-93:77-93:s=86;b=35.4;I=0.882;Q=72.6;C=NODOM~0|RX:94-110:94-110:s=88;b=36.2;I=0.882;Q=75.0;C=vC.GST_C~2v|XR:94-110:94-110:s=88;b=36.2;I=0.882;Q=75.0;C=vC.GST_C~2v|RX:111-201:111-201:s=409;b=168.3;I=0.868;Q=468.3;C=C.GST_C~2|XR:111-201:111-201:s=409;b=168.3;I=0.868;Q=468.3;C=C.GST_C~2|RX:202-240:202-240:s=154;b=63.4;I=0.795;Q=155.9;C=NODOM~0|XR:202-240:202-240:s=154;b=63.4;I=0.795;Q=155.9;C=NODOM~0 | |
139 | # | |
140 | def parse_protein(line_data,fields, req_name): | |
141 | # last part (domain annotions) split('|') and parsed by parse_domain() | |
142 | ||
143 | data = {} | |
144 | data = dict(zip(fields, line_data)) | |
145 | if (re.search(r'\|',data['qseqid'])): | |
146 | data['qseq_acc'] = data['qseqid'].split('|')[1] | |
147 | else: | |
148 | data['qseq_acc'] = data['qseqid'] | |
149 | ||
150 | if (re.search(r'\|',data['sseqid'])): | |
151 | data['sseq_acc'] = data['sseqid'].split('|')[1] | |
152 | else: | |
153 | data['sseq_acc'] = data['sseqid'] | |
154 | ||
155 | Qexon_list = [] | |
156 | Sexon_list = [] | |
157 | ||
158 | Qinfo_list = [] | |
159 | Sinfo_list = [] | |
160 | ||
161 | counter = 0 | |
162 | ||
163 | if ('align_annot' in data and len(data['align_annot']) > 0): | |
164 | for exon_str in data['align_annot'].split('|')[1:]: | |
165 | if (req_name and not re.search(req_name, exon_str)): | |
166 | continue | |
167 | ||
168 | counter += 1 | |
169 | exon = parse_exon_align(exon_str) | |
170 | if (exon.q_target): | |
171 | Qexon_list.append(exon) | |
172 | else: | |
173 | Sexon_list.append(exon) | |
174 | ||
175 | data['q_exalign_list'] = Qexon_list | |
176 | data['s_exalign_list'] = Sexon_list | |
177 | ||
178 | if ('exon_info' in data and len(data['exon_info']) > 0): | |
179 | for info_str in data['exon_info'].split('|')[1:]: | |
180 | if (not re.search(r'^[DX][XD]',info_str)): | |
181 | continue | |
182 | ||
183 | dinfo = parse_exon_info(info_str) | |
184 | ||
185 | if (dinfo.q_target): | |
186 | Qinfo_list.append(dinfo) | |
187 | else: | |
188 | Sinfo_list.append(dinfo) | |
189 | ||
190 | ||
191 | # put links to info_list into exon_list so info_list names can | |
192 | # be changed -- give S/Qinfo's the S/Qdom ids of the overlapping domain | |
193 | ||
194 | # find_info_overlaps(Qinfo_list, Qexon_list) | |
195 | # find_info_overlaps(Sinfo_list, Sexon_list) | |
196 | ||
197 | data['q_exinfo_list'] = Qinfo_list | |
198 | data['s_exinfo_list'] = Sinfo_list | |
199 | ||
200 | return data | |
201 | ||
202 | ################ | |
203 | # | |
204 | # decode_btop() - | |
205 | # input: a blast BTOP string of the form: "1VA160TS7KG10RK27" | |
206 | # returns a list_ref of tokens: (1, "VA", 60, "TS", 7, "KG, 10, "RK", 27) | |
207 | def decode_btop(btop_str): | |
208 | out_tokens = [] | |
209 | for token in re.split(r'(\d+)',btop_str): | |
210 | if (not token): continue | |
211 | if re.match(r'\d+',token): | |
212 | out_tokens.append(token) | |
213 | else: | |
214 | for mismat in re.split(r'(..)',token): | |
215 | if (mismat): out_tokens.append(mismat) | |
216 | ||
217 | return out_tokens | |
218 | ||
219 | ################ | |
220 | # | |
221 | # map_align(btop, q_start, s_start) | |
222 | # input: btop | |
223 | # output: q_pos_arr, s_pos_arr | |
224 | # | |
225 | def map_align(btop_str, q_start, s_start): | |
226 | ||
227 | q_pos = q_start | |
228 | s_pos = s_start | |
229 | ||
230 | q_pos_arr = [] | |
231 | s_pos_arr = [] | |
232 | ||
233 | btop_tokens = decode_btop(btop_str) | |
234 | ||
235 | for t in btop_tokens: | |
236 | if (re.match(r'\d+',t)): | |
237 | for i in range(int(t)) : | |
238 | q_pos_arr.append(q_pos) | |
239 | q_pos += 1 | |
240 | s_pos_arr.append(s_pos) | |
241 | s_pos += 1 | |
242 | elif (re.match(r'\-\w',t)): | |
243 | q_pos_arr.append(q_pos) | |
244 | s_pos_arr.append(s_pos) | |
245 | s_pos += 1 | |
246 | elif (re.match(r'\w\-',t)): | |
247 | q_pos_arr.append(q_pos) | |
248 | q_pos += 1 | |
249 | s_pos_arr.append(s_pos) | |
250 | else: | |
251 | q_pos_arr.append(q_pos) | |
252 | q_pos += 1 | |
253 | s_pos_arr.append(s_pos) | |
254 | s_pos += 1 | |
255 | ||
256 | return q_pos_arr, s_pos_arr | |
257 | ||
258 | ################ | |
259 | # | |
260 | # map_coords(from_coords, to_coords, coord_list) | |
261 | # | |
262 | def map_coords(from_coords, to_coords, coord_list): | |
263 | ||
264 | mapped_coords = [] | |
265 | ||
266 | fx = 0 | |
267 | mx = 0 | |
268 | while mx < len(coord_list): | |
269 | this_from_coord = coord_list[mx] | |
270 | while (from_coords[fx] < this_from_coord): | |
271 | fx += 1 | |
272 | continue | |
273 | ||
274 | mapped_coords.append(to_coords[fx]) | |
275 | mx += 1 | |
276 | ||
277 | return mapped_coords | |
278 | ||
279 | ################ | |
280 | # | |
281 | # map_align_coords() given a BTOP, q_start, s_start, and s_target, generate s_coords for list of q_coords | |
282 | # | |
283 | def map_align_coords(btop_str, q_start, s_start, s_target, coord_list): | |
284 | ||
285 | (q_coords, s_coords) = map_align(btop_str, q_start, s_start) | |
286 | ||
287 | sorted_coord_list = sorted(coord_list) | |
288 | ||
289 | if (s_target): | |
290 | s_mapped_coords = map_coords(q_coords, s_coords, sorted_coord_list) | |
291 | else: | |
292 | s_mapped_coords = map_coords(s_coords, q_coords, sorted_coord_list) | |
293 | ||
294 | coord_dict={} | |
295 | for ix, s_coord in enumerate(sorted_coord_list): | |
296 | coord_dict[s_coord]=s_mapped_coords[ix] | |
297 | ||
298 | return [ coord_dict[c] for c in coord_list ] | |
299 | ||
300 | ||
301 | ################ | |
302 | # | |
303 | # aa_to_exon() --- given a coordinate and the corresponding exon map, return the exon coordinate | |
304 | # (can only be done for aligned exons) | |
305 | # | |
306 | # this version of the function must use an info_list, not an | |
307 | # align_list, because it uses p_start/p_end rather than qp_start/sp_start, etc. | |
308 | # a version using qp_start/sp_start would also need a target argument | |
309 | # | |
310 | def aa_to_exon(aa_coords, exon_info_list): | |
311 | ||
312 | sorted_aa_coords = sorted(aa_coords) | |
313 | ||
314 | pos_strand = True | |
315 | if (exon_info_list[0].d_start > exon_info_list[0].d_end): | |
316 | pos_strand = False | |
317 | ||
318 | ex_x = 0 | |
319 | exon_coords = [] | |
320 | ||
321 | aap_x = 0 | |
322 | this_aap = sorted_aa_coords[aap_x] | |
323 | while (ex_x < len(exon_info_list)): | |
324 | this_exon = exon_info_list[ex_x] | |
325 | if (this_aap <= this_exon.p_end and this_aap >= this_exon.p_start): | |
326 | aa_dna_offset = (this_aap - this_exon.p_start) * 3 | |
327 | ||
328 | if (pos_strand): | |
329 | aa_dna_pos = this_exon.d_start + aa_dna_offset | |
330 | else: | |
331 | aa_dna_pos = this_exon.d_start - aa_dna_offset | |
332 | ||
333 | exon_coords.append({'chrom':this_exon.chrom, 'dpos':aa_dna_pos}) | |
334 | aap_x += 1 | |
335 | if (aap_x < len(sorted_aa_coords)): | |
336 | this_aap = sorted_aa_coords[aap_x] | |
337 | else: | |
338 | break | |
339 | else: | |
340 | ex_x += 1 | |
341 | ||
342 | aa_coord_dict = {} | |
343 | for aap_x, aap in enumerate(sorted_aa_coords): | |
344 | aa_coord_dict[aap] = exon_coords[aap_x] | |
345 | ||
346 | return [aa_coord_dict[ax] for ax in aa_coords] | |
347 | ||
348 | ################ | |
349 | # set_data_fields() -- initialize field[] used to generate data[] dict | |
350 | # | |
351 | def set_data_fields(args, line_data) : | |
352 | ||
353 | field_str = 'qseqid sseqid pident length mismatch gapopen q_start q_end s_start s_end evalue bitscore BTOP align_annot' | |
354 | field_qs_str = 'qseqid q_len sseqid s_len pident length mismatch gapopen q_start q_end s_start s_end evalue bitscore BTOP align_annot' | |
355 | ||
356 | if (len(line_data) > 1) : | |
357 | if ((not args.have_qslen) and re.search(r'\d+',line_data[1])): | |
358 | args.have_qslen=True | |
359 | ||
360 | if ((not args.exon_info) and re.search(r'^\|[DX][XD]\:',line_data[-1])): | |
361 | args.exon_info = True | |
362 | ||
363 | end_field = -1 | |
364 | fields = field_str.split(' ') | |
365 | ||
366 | if (args.have_qslen): | |
367 | fields = field_qs_str.split(' ') | |
368 | ||
369 | if (args.exon_info): | |
370 | fields.append('exon_info') | |
371 | end_field = -2 | |
372 | ||
373 | return (fields, end_field) | |
374 | ||
375 | ################################################################ | |
376 | # | |
377 | # main program | |
378 | # print "#"," ".join(sys.argv) | |
379 | ||
380 | def main(): | |
381 | ||
382 | data_fields_reset=False | |
383 | ||
384 | parser=argparse.ArgumentParser(description='map_exon_coords.py result_file.m8CB saa:coord : map subject coordinate to query genomic coordinate') | |
385 | parser.add_argument('--have_qslen', help='bl_tab fields include query/subject lengths',dest='have_qslen',action='store_true',default=False) | |
386 | parser.add_argument('--exon_info', help='raw domain coordinates included',action='store_true',default=True) | |
387 | parser.add_argument('--subj_aa',help='subject aa coordinate to map',action='store',type=int,dest='subj_aa_coord',default=1) | |
388 | parser.add_argument('files', metavar='FILE', nargs='*', help='files to read, if empty, stdin is used') | |
389 | args=parser.parse_args() | |
390 | ||
391 | end_field = -1 | |
392 | data_fields_reset=False | |
393 | ||
394 | (fields, end_field) = set_data_fields(args, []) | |
395 | ||
396 | if (args.have_qslen and args.exon_info): | |
397 | data_fields_reset=True | |
398 | ||
399 | saved_qexon_list = [] | |
400 | qexon_list = [] | |
401 | ||
402 | for line in fileinput.input(args.files): | |
403 | # pass through comments | |
404 | if (line[0] == '#'): | |
405 | print line, # ',' because have not stripped | |
406 | continue | |
407 | ||
408 | ################ | |
409 | # break up tab fields, check for extra fields | |
410 | line = line.strip('\n') | |
411 | line_data = line.split('\t') | |
412 | if (not data_fields_reset): # look for --have_qslen number, --exon_info data, even if not set | |
413 | (fields, end_field) = set_data_fields(args, line_data) | |
414 | data_fields_reset = True | |
415 | ||
416 | ################ | |
417 | # get exon annotations | |
418 | # produces: data['q_exalign_list'], data['s_exalign_list'] | |
419 | # data['q_exinfo_list'], data['s_exinfo_list'] | |
420 | data = parse_protein(line_data,fields,"exon") # get score/alignment/domain data | |
421 | ||
422 | # extract aligned query_coordinates | |
423 | q_coords = [] | |
424 | sa_from_qa = [] | |
425 | for q_ex in data['q_exalign_list']: | |
426 | q_coords.append(q_ex.q_start) | |
427 | q_coords.append(q_ex.q_end) | |
428 | sa_from_qa.append(q_ex.s_start) | |
429 | sa_from_qa.append(q_ex.s_end) | |
430 | ||
431 | s_coords = [] | |
432 | qa_from_sa = [] | |
433 | for s_ex in data['s_exalign_list']: | |
434 | s_coords.append(s_ex.s_start) | |
435 | s_coords.append(s_ex.s_end) | |
436 | qa_from_sa.append(s_ex.q_start) | |
437 | qa_from_sa.append(s_ex.q_end) | |
438 | ||
439 | ################ | |
440 | # map aligned coordinates in query to subject exons | |
441 | # -- this is not necessary -- it already in data['q_exalign_list'].s_start/s_end | |
442 | # s_target=True | |
443 | # sa_from_qa = map_align_coords(data['BTOP'], int(data['q_start']), int(data['s_start']), | |
444 | # s_target, qa_coords) | |
445 | sex_from_qa2sa = aa_to_exon(sa_from_qa, data['s_exinfo_list']) | |
446 | qex_from_sa2qa = aa_to_exon(qa_from_sa, data['q_exinfo_list']) | |
447 | ||
448 | ||
449 | ################ | |
450 | # print out non-exon info | |
451 | ||
452 | print '\t'.join([str(data[x]) for x in fields[:end_field]]), | |
453 | ||
454 | ################ | |
455 | # edit the full text to insert the other aligned coordinates | |
456 | # (also re-order the regions query-first, then subject | |
457 | # for 'q_exalign_list', I need to add the subj_genome_coords sex_from_qa2sa | |
458 | # and they need to be second | |
459 | # for 's_exalign_list', I need to add the query_genome_coords from qex_from_sa2qa | |
460 | # and they need to be first | |
461 | ||
462 | q_exalign_out=[] | |
463 | for qx, q_exon in enumerate(data['q_exalign_list']): | |
464 | sg_start = sex_from_qa2sa[2*qx] | |
465 | sg_end = sex_from_qa2sa[2*qx+1] | |
466 | sg_replace="::%s:%d-%d}"%(sg_start['chrom'],sg_start['dpos'],sg_end['dpos']) | |
467 | ||
468 | this_outstr=re.sub(r'\}',sg_replace,q_exon.text) | |
469 | q_exalign_out.append(this_outstr) | |
470 | ||
471 | s_exalign_out=[] | |
472 | for sx, s_exon in enumerate(data['s_exalign_list']): | |
473 | qg_start = qex_from_sa2qa[2*sx] | |
474 | qg_end = qex_from_sa2qa[2*sx+1] | |
475 | qg_replace="{%s:%d-%d::"%(qg_start['chrom'],qg_start['dpos'],qg_end['dpos']) | |
476 | ||
477 | this_outstr=re.sub(r'\{',qg_replace,s_exon.text) | |
478 | s_exalign_out.append(this_outstr) | |
479 | ||
480 | print "\t|"+"|".join(q_exalign_out+s_exalign_out)+"\t"+line_data[-1] | |
481 | ||
482 | ################ | |
483 | # run the program ... | |
484 | ||
485 | if __name__ == '__main__': | |
486 | main() | |
487 |
0 | #!/usr/bin/env perl | |
1 | ||
2 | ################################################################ | |
3 | # copyright (c) 2018 by William R. Pearson and The Rector & | |
4 | # Visitors of the University of Virginia */ | |
5 | ################################################################ | |
6 | # Licensed under the Apache License, Version 2.0 (the "License"); | |
7 | # you may not use this file except in compliance with the License. | |
8 | # You may obtain a copy of the License at | |
9 | # | |
10 | # http://www.apache.org/licenses/LICENSE-2.0 | |
11 | # | |
12 | # Unless required by applicable law or agreed to in writing, | |
13 | # software distributed under this License is distributed on an "AS | |
14 | # IS" BASIS, WITHOUT WRRANTIES OR CONDITIONS OF ANY KIND, either | |
15 | # express or implied. See the License for the specific language | |
16 | # governing permissions and limitations under the License. | |
17 | ################################################################ | |
18 | ||
19 | ################################################################ | |
20 | # merge_blast_btab.pl --btab .btab file html_file | |
21 | ################################################################ | |
22 | ||
23 | use warnings; | |
24 | use strict; | |
25 | use Getopt::Long; | |
26 | use Pod::Usage; | |
27 | use URI::Encode qw(uri_encode); | |
28 | use URI::Escape qw(uri_escape); | |
29 | ||
30 | my ($btab_file, $have_qslen, $help, $shelp, $dom_info) = ("", 0, 0, 0, 0); | |
31 | my ($plot_url) = (""); | |
32 | ||
33 | GetOptions( | |
34 | "btab_file|btab=s" => \$btab_file, | |
35 | "have_qslen|have_sqlen!" => \$have_qslen, | |
36 | "domain_info|dom_info!" => \$dom_info, | |
37 | "plot_url=s"=> \$plot_url, | |
38 | "h|?" => \$shelp, | |
39 | "help" => \$help, | |
40 | ); | |
41 | ||
42 | pod2usage(1) if $shelp; | |
43 | pod2usage(exitstatus => 0, verbose => 2) if $help; | |
44 | unless (-f STDIN || -p STDIN || @ARGV) { | |
45 | pod2usage(1); | |
46 | } | |
47 | ||
48 | # require a btab file | |
49 | ||
50 | # read it in, save structure as list/hash on accession (list more robust) | |
51 | # what happens with multiple hits for same library -- need to add code | |
52 | # | |
53 | ||
54 | my @bl_fields = qw(q_seqid s_seqid percid alen mismatch gopen q_start q_end s_start s_end evalue bits score annot); | |
55 | ||
56 | if ($have_qslen) { | |
57 | @bl_fields = qw(q_seqid q_len s_seqid s_len percid alen mismatch gopen q_start q_end s_start s_end evalue bits score annot); | |
58 | } | |
59 | ||
60 | if ($dom_info) { | |
61 | push @bl_fields, "dom_info"; | |
62 | } | |
63 | ||
64 | my %tab_data = (); | |
65 | my @sseq_ids = (); | |
66 | ||
67 | unless ($btab_file) { | |
68 | die "--btab_file required" | |
69 | } | |
70 | else { | |
71 | # read in btab file | |
72 | open(my $fd, $btab_file) || die "cannot open $btab_file"; | |
73 | ||
74 | while (my $line = <$fd>) { | |
75 | next if ($line =~ m/^#/); # ignore comments | |
76 | chomp($line); | |
77 | my %a_data = (); | |
78 | @a_data{@bl_fields} = split(/\t/,$line); | |
79 | ||
80 | # here we should confirm that the sseqid is new. If it is not, then add to a list. | |
81 | my $sseqid = $a_data{'s_seqid'}; | |
82 | ||
83 | if (defined($tab_data{$sseqid})) { | |
84 | push @{$tab_data{$sseqid}}, \%a_data | |
85 | } | |
86 | else { | |
87 | $tab_data{$sseqid} = [ \%a_data ]; | |
88 | push @sseq_ids, $sseqid; | |
89 | } | |
90 | } | |
91 | } | |
92 | ||
93 | # have the annotation data in %tab_data{} and @seq_ids | |
94 | # read in the blastp html file and annotate it | |
95 | ||
96 | my ($in_best, $in_align) = (0,0); | |
97 | my ($best_ix, $align_ix, $hsp_ix) = (0,0,0); | |
98 | ||
99 | while (my $line = <>) { | |
100 | chomp($line); | |
101 | unless ($line) { | |
102 | print "\n"; | |
103 | next; | |
104 | } | |
105 | if ($line =~ m/^Sequences producing/) { | |
106 | $in_best = 1; | |
107 | $best_ix = 0; | |
108 | print "$line\n"; | |
109 | next; | |
110 | } | |
111 | ||
112 | if ($in_best) { | |
113 | if ($line =~ /^>/) { | |
114 | $in_best = 0; | |
115 | $in_align = 1; | |
116 | $align_ix = 0; | |
117 | $hsp_ix = 0; | |
118 | # print out the first line | |
119 | print "$line\n"; | |
120 | next; | |
121 | } | |
122 | else { | |
123 | $line = add_best($line, $tab_data{$sseq_ids[$best_ix]}->[0]); | |
124 | $best_ix++; | |
125 | } | |
126 | } | |
127 | ||
128 | if ($in_align) { | |
129 | if ($line =~ m/^\s+Score = \d+/) { # have Length= match, put out annotations if available | |
130 | my $regions_str = regions_to_str($tab_data{$sseq_ids[$align_ix]}->[$hsp_ix]); | |
131 | print $regions_str; | |
132 | ||
133 | if ($plot_url) { | |
134 | my $raw_dom_str = ""; | |
135 | if ($dom_info) { | |
136 | $raw_dom_str = dom_info_str($tab_data{$sseq_ids[$align_ix]}->[$hsp_ix]{'dom_info'}); | |
137 | } | |
138 | ||
139 | my $plot_tag = plot_tag_str($plot_url, $tab_data{$sseq_ids[$align_ix]}->[$hsp_ix], $regions_str, $raw_dom_str); | |
140 | if ($plot_tag) {print $plot_tag,"\n";} | |
141 | } | |
142 | ||
143 | $hsp_ix++; | |
144 | ||
145 | } | |
146 | elsif ($line =~ m/^>/) { | |
147 | $align_ix++; | |
148 | $hsp_ix = 0; | |
149 | } | |
150 | } | |
151 | ||
152 | print "$line\n"; | |
153 | } | |
154 | ||
155 | sub parse_annots { | |
156 | my ($annot_str) = @_; | |
157 | ||
158 | my @annot_list = (); | |
159 | ||
160 | unless ($annot_str && $annot_str =~ m/^\|/) { | |
161 | return \@annot_list; | |
162 | } | |
163 | ||
164 | my @annots = split('\|',$annot_str); | |
165 | shift @annots; | |
166 | ||
167 | for my $annot ( @annots ) { | |
168 | my %annot_data = (); | |
169 | next unless ($annot =~ m/^[XR][RX]/); | |
170 | my @a_fields = split(/;/,$annot); | |
171 | for my $f (@a_fields) { | |
172 | if ($f =~ m/^[XR][XR]/) { | |
173 | my @a2_f = split(':',$f); | |
174 | if ($a2_f[0] =~ m/^XR/) { | |
175 | $annot_data{target} = 'subj'; | |
176 | } | |
177 | else { | |
178 | $annot_data{target} = 'query'; | |
179 | } | |
180 | $annot_data{coord} = "$a2_f[1]:$a2_f[2]"; | |
181 | $annot_data{score} = (split('=',$a2_f[3]))[1] | |
182 | } | |
183 | elsif ($f =~ m/(\w)=(.+)/) { | |
184 | $annot_data{$1} = $2; | |
185 | } | |
186 | } | |
187 | $annot_data{name} = $a_fields[-1]; | |
188 | $annot_data{name} =~ s/^C=//; | |
189 | push @annot_list, \%annot_data; | |
190 | } | |
191 | return \@annot_list; | |
192 | } | |
193 | ||
194 | sub regions_to_str { | |
195 | my ($a_data_r) = @_; | |
196 | ||
197 | my $annot_ref = parse_annots($a_data_r->{annot}); | |
198 | ||
199 | my $region_str = ""; | |
200 | my $annot_str = ""; | |
201 | ||
202 | for my $annot ( @{$annot_ref}) { | |
203 | if ($annot->{target} =~ m/^q/) { | |
204 | $region_str = "qRegion"; | |
205 | } | |
206 | else { | |
207 | $region_str = " Region"; | |
208 | } | |
209 | ||
210 | $annot_str .= sprintf "%s: %s : score=%d; bits=%.1f; Id=%.3f; Q=%.1f : %s\n", $region_str, | |
211 | @{$annot}{qw(coord score b I Q name)}; | |
212 | } | |
213 | return $annot_str; | |
214 | } | |
215 | ||
216 | sub add_best { | |
217 | my ($line, $a_data) = @_; | |
218 | ||
219 | my $annot_str = ''; | |
220 | ||
221 | my $annot_refs = parse_annots($a_data->{annot}); | |
222 | ||
223 | for my $annot ( @$annot_refs) { | |
224 | if ($annot->{target} !~ m/^q/) { | |
225 | $annot_str .= $annot->{name} . ";" | |
226 | } | |
227 | } | |
228 | ||
229 | if ($annot_str) { | |
230 | return "$line $annot_str"; | |
231 | } | |
232 | else { | |
233 | return $line; | |
234 | } | |
235 | } | |
236 | ||
237 | sub plot_tag_str { | |
238 | ||
239 | my ($plot_script, $align_data_r, $regions_str, $doms_str) = @_; | |
240 | ||
241 | my $svg_pref = q(<object type="image/svg+xml" ); | |
242 | my $svg_post = q( width="660" height="76" ></object>); | |
243 | ||
244 | #build argument string | |
245 | my %plt_args = (); | |
246 | @plt_args{qw(q_cstart l_cstart)} = (1, 1); | |
247 | @plt_args{qw(q_name q_cstop q_astart q_astop l_name l_cstop l_astart l_astop)} = | |
248 | @{$align_data_r}{qw(q_seqid q_len q_start q_end s_seqid s_len s_start s_end)}; | |
249 | $plt_args{'regions'}= uri_escape(uri_encode($regions_str)); | |
250 | if ($doms_str) { | |
251 | $plt_args{'doms'} = uri_encode($doms_str); | |
252 | } | |
253 | ||
254 | my $dom_info = (); | |
255 | ||
256 | my @args = map {"$_=$plt_args{$_}"} keys(%plt_args); | |
257 | ||
258 | return $svg_pref . qq( data="$plot_url?) . join('&',@args) . '"' . $svg_post; | |
259 | } | |
260 | ||
261 | sub dom_info_str { | |
262 | my ($raw_dom_info) = @_; | |
263 | ||
264 | my $dom_str = ""; | |
265 | ||
266 | unless ($raw_dom_info) { return "";} | |
267 | ||
268 | my @raw_doms = split('\|',$raw_dom_info); | |
269 | shift(@raw_doms); | |
270 | ||
271 | for my $dom ( @raw_doms ) { | |
272 | my $tmp_dom = $dom; | |
273 | $tmp_dom =~ s/^DX:/qDomain:\t/g; | |
274 | $tmp_dom =~ s/^XD:/lDomain:\t/g; | |
275 | $tmp_dom =~ s/;C=/\t/g; | |
276 | ||
277 | $dom_str .= "$tmp_dom\n"; | |
278 | } | |
279 | ||
280 | return $dom_str; | |
281 | } | |
282 | ||
283 | ||
284 | __END__ | |
285 | ||
286 | =pod | |
287 | ||
288 | =head1 NAME | |
289 | ||
290 | merge_blast_btab.pl | |
291 | ||
292 | =head1 SYNOPSIS | |
293 | ||
294 | merge_blast_btab.pl --btab_file=result.b_tab result.html | |
295 | ||
296 | =head1 OPTIONS | |
297 | ||
298 | -h short help | |
299 | --help include description | |
300 | ||
301 | --btab_file|--btab file_name -- blast tabular output file with | |
302 | sub-alignment scoring | |
303 | ||
304 | =head1 DESCRIPTION | |
305 | ||
306 | C<merge_blast_btab.pl> merges the domain annotations and sub-alignment scoring from C<annot_blast_btop2.pl> blast tabular output file with a conventional blast result file. | |
307 | ||
308 | The tab file is read and parsed, and then the subject/query seqid is used to | |
309 | capture domain locations in the subject/query sequence. If the domains | |
310 | overlap the aligned region, the domain names are appended to the output. | |
311 | ||
312 | =head1 AUTHOR | |
313 | ||
314 | William R. Pearson, wrp@virginia.edu | |
315 | ||
316 | =cut |
0 | #!/usr/bin/env perl | |
1 | ||
2 | ################################################################ | |
3 | # copyright (c) 2018 by William R. Pearson and The Rector & | |
4 | # Visitors of the University of Virginia */ | |
5 | ################################################################ | |
6 | # Licensed under the Apache License, Version 2.0 (the "License"); | |
7 | # you may not use this file except in compliance with the License. | |
8 | # You may obtain a copy of the License at | |
9 | # | |
10 | # http://www.apache.org/licenses/LICENSE-2.0 | |
11 | # | |
12 | # Unless required by applicable law or agreed to in writing, | |
13 | # software distributed under this License is distributed on an "AS | |
14 | # IS" BASIS, WITHOUT WRRANTIES OR CONDITIONS OF ANY KIND, either | |
15 | # express or implied. See the License for the specific language | |
16 | # governing permissions and limitations under the License. | |
17 | ################################################################ | |
18 | ||
19 | ################################################################ | |
20 | # merge_fasta_btab.pl --btab .btab file html_file | |
21 | ################################################################ | |
22 | ||
23 | ################################################################ | |
24 | # takes a standard (or <html> output FASTA file and converts (or adds) labels using .btab information | |
25 | ################################################################ | |
26 | ||
27 | ||
28 | use warnings; | |
29 | use strict; | |
30 | use Getopt::Long; | |
31 | use Pod::Usage; | |
32 | use URI::Encode qw(uri_encode); | |
33 | use URI::Escape qw(uri_escape); | |
34 | ||
35 | my ($btab_file, $have_qslen, $help, $shelp, $dom_info) = ("", 0, 0, 0, 0); | |
36 | my ($plot_url) = (""); | |
37 | ||
38 | GetOptions( | |
39 | "btab_file|btab=s" => \$btab_file, | |
40 | "have_qslen|have_sqlen" => \$have_qslen, | |
41 | "have_qslen|have_sqlen!" => \$have_qslen, | |
42 | "domain_info|dom_info!" => \$dom_info, | |
43 | "plot_url=s"=> \$plot_url, | |
44 | "h|?" => \$shelp, | |
45 | "help" => \$help, | |
46 | ); | |
47 | ||
48 | pod2usage(1) if $shelp; | |
49 | pod2usage(exitstatus => 0, verbose => 2) if $help; | |
50 | unless (-f STDIN || -p STDIN || @ARGV) { | |
51 | pod2usage(1); | |
52 | } | |
53 | ||
54 | # require a btab file | |
55 | ||
56 | # read it in, save structure as list/hash on accession (list more robust) | |
57 | # what happens with multiple hits for same library -- need to add code | |
58 | # | |
59 | ||
60 | my @bl_fields = qw(q_seqid s_seqid percid alen mismatch gopen q_start q_end s_start s_end evalue bits score annot); | |
61 | ||
62 | if ($have_qslen) { | |
63 | @bl_fields = qw(q_seqid q_len s_seqid s_len percid alen mismatch gopen q_start q_end s_start s_end evalue bits score annot); | |
64 | } | |
65 | ||
66 | my %pgm_names= ('FASTA'=>'fap', 'FASTX'=>'fx', 'FASTY'=>'fy', 'FASTS'=>'fs', 'FASTM'=>'fm', | |
67 | 'SSEARCH' => 'gsw', 'GGSEARCH'=>'gnw', 'GLSEARCH'=>'lnw', | |
68 | 'TFASTX' => 'tfx', 'TFASTY'=>'tfx', 'TFASTS'=>'tfs', 'TFASTM'=>'tfm', | |
69 | 'BLASTP'=>'bp', 'BLASTN'=>'bn', 'TBLASTN'=>'tbn' ); | |
70 | ||
71 | if ($dom_info) { | |
72 | push @bl_fields, "dom_info"; | |
73 | } | |
74 | ||
75 | my $pgm_name = ''; | |
76 | my %tab_data = (); | |
77 | my @sseq_ids = (); | |
78 | ||
79 | unless ($btab_file) { | |
80 | die "--btab_file required" | |
81 | } | |
82 | else { | |
83 | # read in btab file | |
84 | open(my $fd, $btab_file) || die "cannot open $btab_file"; | |
85 | ||
86 | while (my $line = <$fd>) { | |
87 | if ($line =~ m/^#/) { # check for program name | |
88 | if (!$pgm_name) { | |
89 | my ($name) = ($line =~ m/^# (\w+) /); | |
90 | if ($name && $pgm_names{$name}) { | |
91 | $pgm_name = $pgm_names{$name}; | |
92 | } | |
93 | } | |
94 | next; | |
95 | } | |
96 | chomp($line); | |
97 | ||
98 | my %a_data = (); | |
99 | @a_data{@bl_fields} = split(/\t/,$line); | |
100 | ||
101 | # here we should confirm that the sseqid is new. If it is not, then add to a list. | |
102 | my $sseqid = $a_data{'s_seqid'}; | |
103 | ||
104 | if (defined($tab_data{$sseqid})) { | |
105 | push @{$tab_data{$sseqid}}, \%a_data | |
106 | } | |
107 | else { | |
108 | $tab_data{$sseqid} = [\%a_data ]; | |
109 | push @sseq_ids, $sseqid; | |
110 | } | |
111 | } | |
112 | } | |
113 | ||
114 | # have the annotation data in %tab_data{} and @seq_ids | |
115 | # read in the blastp html file and annotate it | |
116 | ||
117 | my ($in_best, $in_align, $in_annot) = (0,0,0); | |
118 | my ($annot_id) = (""); | |
119 | my ($best_ix, $align_ix, $hsp_ix) = (0,0,0); | |
120 | ||
121 | while (my $line = <>) { | |
122 | chomp($line); | |
123 | unless ($line) { | |
124 | print "\n"; | |
125 | next; | |
126 | } | |
127 | if ($line =~ m/^The best scores are:/) { | |
128 | $in_best = 1; | |
129 | $best_ix = 0; | |
130 | print "$line\n"; | |
131 | next; | |
132 | } | |
133 | ||
134 | if ($in_best) { | |
135 | if ($line =~ /<pre>>>/) { | |
136 | $in_best = 0; | |
137 | $in_align = 1; | |
138 | $in_annot = 0; | |
139 | $align_ix = 0; | |
140 | $hsp_ix = 0; | |
141 | # print out the first line | |
142 | print "$line\n"; | |
143 | next; | |
144 | } | |
145 | else { | |
146 | if (scalar(@sseq_ids) && $sseq_ids[$best_ix]) { | |
147 | $line = add_best($line, $tab_data{$sseq_ids[$best_ix]}->[0]); | |
148 | $best_ix++; | |
149 | } | |
150 | } | |
151 | } | |
152 | ||
153 | if ($in_align) { | |
154 | if ($line =~ m/^<!\-\- ANNOT_START "([^"]+)" \-\->/) { | |
155 | $annot_id = $1; | |
156 | my $regions_str = regions_to_str($tab_data{$sseq_ids[$align_ix]}->[$hsp_ix]); | |
157 | print qq(<!-- ANNOT_START "$annot_id" -->); | |
158 | print $regions_str; | |
159 | ||
160 | if ($plot_url) { | |
161 | my $raw_dom_str = ""; | |
162 | if ($dom_info) { | |
163 | $raw_dom_str = dom_info_str($tab_data{$sseq_ids[$align_ix]}->[$hsp_ix]{'dom_info'}); | |
164 | } | |
165 | ||
166 | my $plot_tag = plot_tag_str($plot_url, $pgm_name, $tab_data{$sseq_ids[$align_ix]}->[$hsp_ix], $regions_str, $raw_dom_str); | |
167 | if ($plot_tag) {print $plot_tag,"\n";} | |
168 | } | |
169 | ||
170 | $hsp_ix++; | |
171 | ||
172 | # remove the old domain information */ | |
173 | while ($line = <> ) { | |
174 | chomp($line); | |
175 | if ($line !~ m/^\s*q?Region:/ && $line !~ /ANNOT_STOP/) { | |
176 | print "$line\n"; | |
177 | } | |
178 | if ($line =~ m/^<!\-\- ANNOT_STOP \-\->/) { | |
179 | last; | |
180 | } | |
181 | } | |
182 | } | |
183 | elsif ($line =~ m/<pre>>>/) { | |
184 | $align_ix++; | |
185 | $hsp_ix=0; | |
186 | } | |
187 | } | |
188 | ||
189 | print "$line\n"; | |
190 | } | |
191 | ||
192 | sub parse_annots { | |
193 | my ($annot_str) = @_; | |
194 | ||
195 | my @annot_list = (); | |
196 | ||
197 | unless ($annot_str && $annot_str =~ m/^\|/) { | |
198 | return \@annot_list; | |
199 | } | |
200 | ||
201 | my @annots = split('\|',$annot_str); | |
202 | shift @annots; | |
203 | ||
204 | for my $annot ( @annots ) { | |
205 | my %annot_data = (); | |
206 | next unless ($annot =~ m/^[XR][RX]/); | |
207 | my @a_fields = split(/;/,$annot); | |
208 | for my $f (@a_fields) { | |
209 | if ($f =~ m/^[XR][XR]/) { | |
210 | my @a2_f = split(':',$f); | |
211 | if ($a2_f[0] =~ m/^XR/) { | |
212 | $annot_data{target} = 'subj'; | |
213 | } | |
214 | else { | |
215 | $annot_data{target} = 'query'; | |
216 | } | |
217 | $annot_data{coord} = "$a2_f[1]:$a2_f[2]"; | |
218 | $annot_data{score} = (split('=',$a2_f[3]))[1] | |
219 | } | |
220 | elsif ($f =~ m/(\w)=(.+)/) { | |
221 | $annot_data{$1} = $2; | |
222 | } | |
223 | } | |
224 | $annot_data{name} = $a_fields[-1]; | |
225 | $annot_data{name} =~ s/^C=//; | |
226 | ||
227 | push @annot_list, \%annot_data; | |
228 | } | |
229 | return \@annot_list; | |
230 | } | |
231 | ||
232 | sub print_regions { | |
233 | my ($annot_id, $annot_ref) = @_; | |
234 | ||
235 | my $region_str = ""; | |
236 | ||
237 | print qq(<!-- ANNOT_START "$annot_id" -->); | |
238 | ||
239 | for my $annot ( @{$annot_ref}) { | |
240 | if ($annot->{target} =~ m/^q/) { | |
241 | $region_str = "qRegion"; | |
242 | } | |
243 | else { | |
244 | $region_str = " Region"; | |
245 | } | |
246 | ||
247 | printf "%s: %s : score=%d; bits=%.1f; Id=%.3f; Q=%.1f : %s\n", $region_str, | |
248 | @{$annot}{qw(coord score b I Q name)}; | |
249 | } | |
250 | } | |
251 | ||
252 | sub regions_to_str { | |
253 | my ($a_data_r) = @_; | |
254 | ||
255 | my $annot_ref = parse_annots($a_data_r->{annot}); | |
256 | ||
257 | my $region_str = ""; | |
258 | my $annot_str = ""; | |
259 | ||
260 | for my $annot ( @{$annot_ref}) { | |
261 | if ($annot->{target} =~ m/^q/) { | |
262 | $region_str = "qRegion"; | |
263 | } | |
264 | else { | |
265 | $region_str = " Region"; | |
266 | } | |
267 | ||
268 | $annot_str .= sprintf "%s: %s : score=%d; bits=%.1f; Id=%.3f; Q=%.1f : %s\n", $region_str, | |
269 | @{$annot}{qw(coord score b I Q name)}; | |
270 | } | |
271 | return $annot_str; | |
272 | } | |
273 | ||
274 | sub add_best { | |
275 | my ($line, $a_data) = @_; | |
276 | ||
277 | my $annot_str = ''; | |
278 | ||
279 | my $annot_refs = parse_annots($a_data->{annot}); | |
280 | ||
281 | # remove old annotation if present | |
282 | my @line_words = split(/\s/,$line); | |
283 | if ($line_words[-1] =~ m/~\d/) { | |
284 | $line = join(' ',@line_words[0 .. $#line_words-1]); | |
285 | } | |
286 | ||
287 | for my $annot ( @$annot_refs) { | |
288 | if ($annot->{target} !~ m/^q/) { | |
289 | $annot_str .= $annot->{name} . ";" | |
290 | } | |
291 | } | |
292 | ||
293 | if ($annot_str) { | |
294 | return "$line $annot_str"; | |
295 | } | |
296 | else { | |
297 | return $line; | |
298 | } | |
299 | } | |
300 | ||
301 | sub plot_tag_str { | |
302 | ||
303 | my ($plot_script, $pgm_name, $align_data_r, $regions_str, $doms_str) = @_; | |
304 | ||
305 | my $svg_pref = q(<object type="image/svg+xml" ); | |
306 | my $svg_post = q( width="660" height="76" ></object>); | |
307 | ||
308 | #build argument string | |
309 | my %plt_args = (); | |
310 | @plt_args{qw(pgm q_cstart l_cstart)} = ($pgm_name, 1, 1); | |
311 | @plt_args{qw(q_name q_cstop q_astart q_astop l_name l_cstop l_astart l_astop)} = | |
312 | @{$align_data_r}{qw(q_seqid q_len q_start q_end s_seqid s_len s_start s_end)}; | |
313 | $plt_args{'regions'}= uri_escape(uri_encode($regions_str)); | |
314 | if ($doms_str) { | |
315 | $plt_args{'doms'} = uri_encode($doms_str); | |
316 | } | |
317 | ||
318 | my $dom_info = (); | |
319 | ||
320 | my @args = map {"$_=$plt_args{$_}"} keys(%plt_args); | |
321 | ||
322 | return $svg_pref . qq( data="$plot_url?) . join('&',@args) . '"' . $svg_post; | |
323 | } | |
324 | ||
325 | sub dom_info_str { | |
326 | my ($raw_dom_info) = @_; | |
327 | ||
328 | my $dom_str = ""; | |
329 | ||
330 | unless ($raw_dom_info) { return "";} | |
331 | ||
332 | my @raw_doms = split('\|',$raw_dom_info); | |
333 | shift(@raw_doms); | |
334 | ||
335 | for my $dom ( @raw_doms ) { | |
336 | my $tmp_dom = $dom; | |
337 | $tmp_dom =~ s/^DX:/qDomain:\t/g; | |
338 | $tmp_dom =~ s/^XD:/lDomain:\t/g; | |
339 | $tmp_dom =~ s/;C=/\t/g; | |
340 | ||
341 | $dom_str .= "$tmp_dom\n"; | |
342 | } | |
343 | ||
344 | return $dom_str; | |
345 | } | |
346 | ||
347 | __END__ | |
348 | ||
349 | =pod | |
350 | ||
351 | =head1 NAME | |
352 | ||
353 | merge_blast_btab.pl | |
354 | ||
355 | =head1 SYNOPSIS | |
356 | ||
357 | merge_blast_btab.pl --btab_file=result.b_tab result.html | |
358 | ||
359 | =head1 OPTIONS | |
360 | ||
361 | -h short help | |
362 | --help include description | |
363 | ||
364 | --btab_file|--btab file_name -- blast tabular output file with | |
365 | sub-alignment scoring | |
366 | ||
367 | =head1 DESCRIPTION | |
368 | ||
369 | C<merge_blast_btab.pl> merges the domain annotations and sub-alignment scoring from C<annot_blast_btop2.pl> blast tabular output file with a conventional blast result file. | |
370 | ||
371 | The tab file is read and parsed, and then the subject/query seqid is used to | |
372 | capture domain locations in the subject/query sequence. If the domains | |
373 | overlap the aligned region, the domain names are appended to the output. | |
374 | ||
375 | =head1 AUTHOR | |
376 | ||
377 | William R. Pearson, wrp@virginia.edu | |
378 | ||
379 | =cut |
0 | #!/usr/bin/env python | |
1 | ||
2 | # Given a blast_tabular file with search results from one or more protein queries | |
3 | # | |
4 | ||
5 | ################################################################ | |
6 | # copyright (c) 2018 by William R. Pearson and The Rector & Visitors | |
7 | # of the University of Virginia */ | |
8 | # ############################################################### | |
9 | # Licensed under the Apache License, Version 2.0 (the "License"); you | |
10 | # may not use this file except in compliance with the License. You | |
11 | # may obtain a copy of the License at | |
12 | # http://www.apache.org/licenses/LICENSE-2.0 Unless required by | |
13 | # applicable law or agreed to in writing, software distributed under | |
14 | # this License is distributed on an "AS IS" BASIS, WITHOUT WRRANTIES | |
15 | # OR CONDITIONS OF ANY KIND, either express or implied. See the | |
16 | # License for the specific language governing permissions and | |
17 | # limitations under the License. | |
18 | # ############################################################### | |
19 | ||
20 | ||
21 | import fileinput | |
22 | import sys | |
23 | import re | |
24 | import argparse | |
25 | import urllib2 | |
26 | ||
27 | from rename_exons import * | |
28 | ||
29 | def replace_dom_number(line): | |
30 | ||
31 | out_str = '' | |
32 | if (not re.search(r'~',line)): | |
33 | return line | |
34 | ||
35 | (info, num, vdom) = re.search(r'^([^~]+)~(\d+)(v?)$',line).groups() | |
36 | if (vdom is None): | |
37 | vdom='' | |
38 | ||
39 | if (num in homolog_dict): | |
40 | return "%s~h%s%s" % (info, str(homolog_dict[num]['num']), vdom) | |
41 | ||
42 | else: | |
43 | name = line.split(" ")[-1].split("{")[0] | |
44 | if (name == "NODOM"): | |
45 | return line | |
46 | else: | |
47 | if (name in nonhomolog_dict): | |
48 | return '~'.join(line.split('~')[:-1]) + "~" + str(nonhomolog_dict[name]) | |
49 | return out_str | |
50 | ||
51 | ||
52 | ################ | |
53 | # __main__ function | |
54 | # | |
55 | ||
56 | e_thresh = 1e-6 | |
57 | q_thresh = 30.0 | |
58 | ||
59 | homolog_dict = {} | |
60 | nonhomolog_dict = {} | |
61 | ||
62 | def main(): | |
63 | ||
64 | # print "#"," ".join(sys.argv) | |
65 | ||
66 | hom_color = 1 | |
67 | n_hom_color = 11 | |
68 | ||
69 | parser=argparse.ArgumentParser(description='relabel_domains.py result_file.m8CB') | |
70 | ||
71 | parser.add_argument('--have_qslen', help='bl_tab fields include query/subject lengths',dest='have_qslen',action='store_true',default=False) | |
72 | parser.add_argument('--dom_info', help='raw domain coordinates included',action='store_true',default=False) | |
73 | parser.add_argument('files', metavar='FILE', nargs='*', help='files to read, if empty, stdin is used') | |
74 | ||
75 | args=parser.parse_args() | |
76 | ||
77 | end_field = -1 | |
78 | data_fields_reset=False | |
79 | ||
80 | (fields, end_field) = set_data_fields(args, []) | |
81 | ||
82 | if (args.have_qslen and args.dom_info): | |
83 | data_fields_reset=True | |
84 | ||
85 | ||
86 | for line in fileinput.input(args.files): | |
87 | # pass through comments | |
88 | if (line[0] == '#'): | |
89 | print line, # ',' because have not stripped | |
90 | continue | |
91 | ||
92 | ################ | |
93 | # break up tab fields, check for extra fields | |
94 | line = line.strip('\n') | |
95 | line_data = line.split('\t') | |
96 | if (not data_fields_reset): # look for --have_qslen number, --dom_info data, even if not set | |
97 | (fields, end_field) = set_data_fields(args, line_data) | |
98 | data_fields_reset = True | |
99 | ||
100 | ################ | |
101 | # get exon annotations | |
102 | data = parse_protein(line_data,fields,'') # get score/alignment/domain data | |
103 | ||
104 | if (len(data['sdom_list'])==0 and len(data['qdom_list'])==0): | |
105 | print line # no domains to be edited, print stripped line and contine | |
106 | continue | |
107 | ||
108 | ################ | |
109 | # have domains, check if significant and new, or old and known | |
110 | # goals are: (1) consistent coloring between query and subject for same domain | |
111 | # (2) homologous domains get special labels | |
112 | # need dict of good domain names | |
113 | ||
114 | ################ | |
115 | # check to update doms with good E()-value | |
116 | if float(data['evalue']) <= e_thresh: | |
117 | for q_dom in data['qdom_list']: | |
118 | if (float(q_dom.q_score) >= q_thresh and q_dom.name not in homolog_dict ): | |
119 | homolog_dict['q_dom.name'] = q_dom_color | |
120 | dom_color += 1 | |
121 | ||
122 | for s_dom in data['sdom_list']: | |
123 | if (float(s_dom.q_score) >= q_thresh and s_dom.name not in homolog_dict): | |
124 | homolog_dict['s_dom.name'] = s_dom.color | |
125 | hom_color += 1 | |
126 | else: | |
127 | for s_dom in data['sdom_list']: | |
128 | if (s_dom.name not in homolog_dict): | |
129 | nonhomolog_dict['s_dom.name'] = s_dom.color | |
130 | n_hom_color += 1 | |
131 | ||
132 | ||
133 | ################ | |
134 | # done storing good domains, write things out | |
135 | ||
136 | btab_str = '\t'.join(str(data[x]) for x in fields[:end_field]) | |
137 | ||
138 | for s_dom in data['sdom_list']: | |
139 | if (s_dom.name in homolog_dict): | |
140 | s_dom.color=homolog_dict[s_dom.name] | |
141 | elif (s_dom.name in nonhomolog_dict): | |
142 | s_dom.color=nonhomolog_dict[s_dom.name] | |
143 | ||
144 | ||
145 | dom_bar_str = '' | |
146 | for dom in sorted(data['qdom_list']+data['sdom_list'],key=lambda r: r.idnum): | |
147 | dom_bar_str += dom.make_bar_str() | |
148 | ||
149 | print btab_str+dom_bar_str | |
150 | ||
151 | ||
152 | if __name__ == '__main__': | |
153 | main() |
0 | #!/usr/bin/env python | |
1 | # | |
2 | # given a -m8CB file with exon annotations for the query and subject, | |
3 | # adjust the subject exon names to match the query exon names | |
4 | ||
5 | ################################################################ | |
6 | # copyright (c) 2018 by William R. Pearson and The Rector & | |
7 | # Visitors of the University of Virginia */ | |
8 | ################################################################ | |
9 | # Licensed under the Apache License, Version 2.0 (the "License"); | |
10 | # you may not use this file except in compliance with the License. | |
11 | # You may obtain a copy of the License at | |
12 | # | |
13 | # http://www.apache.org/licenses/LICENSE-2.0 | |
14 | # | |
15 | # Unless required by applicable law or agreed to in writing, | |
16 | # software distributed under this License is distributed on an "AS | |
17 | # IS" BASIS, WITHOUT WRRANTIES OR CONDITIONS OF ANY KIND, either | |
18 | # express or implied. See the License for the specific language | |
19 | # governing permissions and limitations under the License. | |
20 | ################################################################ | |
21 | ||
22 | import fileinput | |
23 | import sys | |
24 | import re | |
25 | import argparse | |
26 | import copy | |
27 | ||
28 | ################ | |
29 | # "domain" class that describes a domain/exon alignment annotation | |
30 | # | |
31 | class DomAlign: | |
32 | def __init__(self, name, info, color, qstart, qend, sstart, send, raw_score, bit_score, ident, qscore, RXRState, fulltext): | |
33 | self.name = name | |
34 | self.info = info | |
35 | self.color_type = '' | |
36 | if (not re.search(r'^\d+$',color)): | |
37 | m=re.search(r'^(\d+)([a-z]?\w*)$',color) | |
38 | if (m): | |
39 | (self.color, self.color_type) = m.groups() | |
40 | self.color = int(self.color) | |
41 | else: | |
42 | self.color = int(color) | |
43 | ||
44 | self.q_start = qstart | |
45 | self.q_end = qend | |
46 | self.s_start = sstart | |
47 | self.s_end = send | |
48 | self.raw_score = raw_score | |
49 | self.bit_score = bit_score | |
50 | self.percid = ident | |
51 | self.q_score = qscore | |
52 | self.rxr = RXRState | |
53 | self.idnum = 0 | |
54 | self.overlap_list = [] | |
55 | self.info_dom = None | |
56 | self.text = fulltext | |
57 | self.out_str = '' | |
58 | self.over_cnt = 0 | |
59 | ||
60 | def append_overlap(self, overlap_dict): | |
61 | self.overlap_list.append(overlap_dict) | |
62 | ||
63 | def __str__(self): | |
64 | # return "[%d]name: %s : %i-%i : %i-%i I=%.1f Q=%.1f %s" % (self.idnum, self.name, self.q_start, self.q_end, self.s_start, self.s_end, self.percid, self.q_score, self.rxr) | |
65 | return "[%d:%s] %i-%i:%i-%i::%s [over:%d]" % (self.idnum, self.rxr, self.q_start, self.q_end, self.s_start, self.s_end, self.name,len(self.overlap_list)) | |
66 | ||
67 | def print_bar_str(self): # checking for 'NADA' | |
68 | if (not self.out_str): | |
69 | self.out_str = self.text | |
70 | return str("|%s"%(self.out_str)) | |
71 | ||
72 | def make_bar_str(self): # create original string from values | |
73 | bar_str = "|%s:%d-%d:%d-%d:s=%d;b=%.1f;I=%.3f;Q=%.1f;C=%s%s~%d" % ( | |
74 | self.rxr, self.q_start, self.q_end, self.s_start, self.s_end, | |
75 | self.raw_score, self.bit_score, self.percid, self.q_score, self.name, self.info, self.color) | |
76 | ||
77 | if (self.color_type): | |
78 | bar_str += self.color_type | |
79 | return bar_str | |
80 | ||
81 | ################ | |
82 | # "exonInfo" class describes raw (un-aligned) exons with genome coordinates | |
83 | # | |
84 | class exonInfo: | |
85 | def __init__(self, name, q_target, p_start, p_end, chrom, d_start, d_end, full_text): | |
86 | self.name = name | |
87 | self.q_target = q_target | |
88 | self.p_start = p_start | |
89 | self.p_end = p_end | |
90 | self.chrom = chrom | |
91 | self.d_start = d_start | |
92 | self.d_end = d_end | |
93 | self.text = full_text | |
94 | self.plus_strand = True | |
95 | if (d_start > d_end): | |
96 | self.plus_strand = False | |
97 | ||
98 | def __str__(self): | |
99 | rxr_str = "XD" | |
100 | if (self.q_target): | |
101 | rxr_str="DX" | |
102 | return '|%s:%i-%i:%s{%s:%i-%i}' % (rxr_str, self.p_start, self.p_end, self.name, self.chrom, self.d_start, self.d_end) | |
103 | ||
104 | ||
105 | # Parses domain annotations after split at '|' | |
106 | #|RX:1-38:3-40:s=37;b=17.0;I=0.289;Q=15.9;C=exon_1~1 | |
107 | #|RX:39-67:41-69:s=78;b=35.8;I=0.483;Q=68.7;C=exon_2~2 | |
108 | #|XR:1-67:3-69:s=115;b=52.8;I=0.373;Q=116.3;C=exon_1~1 | |
109 | #|RX:68-117:72-113:s=14;b=6.4;I=0.385;Q=0.0;C=exon_3~3 | |
110 | #|XR:68-124:70-119:s=-11;b=0.0;I=0.378;Q=0.0;C=exon_2~2 | |
111 | #|XR:125-167:120-165:s=39;b=17.9;I=0.429;Q=18.5;C=exon_3~3 | |
112 | #|RX:118-176:114-175:s=24;b=11.0;I=0.411;Q=1.5;C=exon_4~4 | |
113 | #|RX:177-200:176-198:s=27;b=12.4;I=0.435;Q=4.0;C=exon_5~5 | |
114 | #|XR:168-200:166-198:s=12;b=5.5;I=0.419;Q=0.0;C=exon_4~4 | |
115 | # | |
116 | def parse_domain(text): | |
117 | # takes a domain in string form, turns it into a domain object | |
118 | # looks like: RX:5-82:5-82:s=397;b=163.1;I=1.000;Q=453.6;C=C.Thioredoxin~1 | |
119 | # could also look like: RX:5-82:5-82:s=397;b=163.1;I=1.000;Q=453.6;C=C.Thioredoxin{PF012445}~1 | |
120 | ||
121 | # get RX/XR and qstart/qstop sstart/sstop as strings | |
122 | m = re.search(r'^(\w+):(\d+)-(\d+):(\d+)-(\d+):',text) | |
123 | if (m): | |
124 | (RXRState, qstart_s, qend_s, sstart_s, send_s) = m.groups() | |
125 | else: | |
126 | sys.stderr.write("could not parse location: %s\n"%(text)) | |
127 | ||
128 | # get score, bits, identity, Q info | |
129 | m = re.search(r's=(\-?\d+);b=(\-?[\d\.]+);I=([\d\.]+);Q=(\-?\d+\.\d*);',text) | |
130 | if (m): | |
131 | (r_score_s, b_score_s, ident_s, qscore_s) = m.groups() | |
132 | else: | |
133 | sys.stderr.write("Error: no scores: %s\n" %(text)) | |
134 | r_score_s = b_score_s = qscore_s = "-1.0" | |
135 | ||
136 | # get domain name/color (and possibly {info}) | |
137 | ||
138 | (name, color_s) = re.search(r';C=([^~]+)~(.+)$',text).groups() | |
139 | info_s="" | |
140 | ||
141 | if (re.search(r'\}$',name)): | |
142 | (name, info_s) = re.search(r'([^\{]+)(\{[^\}]+\})$',name).groups() | |
143 | ||
144 | dom_align = DomAlign(name, info_s, color_s, int(qstart_s), int(qend_s), int(sstart_s), int(send_s), | |
145 | int(r_score_s), float(b_score_s), float(ident_s),float(qscore_s), RXRState, text) | |
146 | ||
147 | return dom_align | |
148 | ||
149 | # dom_info is like domain, but no scores | |
150 | ################ | |
151 | # exon_info is like domain, but no scores | |
152 | # | |
153 | def parse_exon_info(text): | |
154 | # takes a domain in string form, turns it into a domain object | |
155 | # looks like: DX:1-100;C=C.Thioredoxin~1 | |
156 | ||
157 | (RXRState, start_s, end_s,name, color) = re.search(r'^(\w+):(\d+)-(\d+);C=([^~]+)~(.*)$',text).groups() | |
158 | info = "" | |
159 | if (re.search(r'\}$',name)): | |
160 | (name, info) = re.search(r'([^\{]+)(\{[^\}]+\})$',name).groups() | |
161 | ||
162 | gene_re = re.search(r'^\{([\w\.]+):(\d+)\-(\d+)\}',info) | |
163 | if (gene_re): | |
164 | (chrom, d_start, d_end) = gene_re.groups() | |
165 | else: | |
166 | (chrom, d_start, d_end) = ('',-1,-1) | |
167 | # sys.stderr.write("genome info not found: %s\n" % (text)) | |
168 | ||
169 | q_target = True; | |
170 | if (RXRState == 'XD'): | |
171 | q_target = False | |
172 | ||
173 | exon_info = exonInfo(name, q_target, int(start_s), int(end_s), chrom, int(d_start), int(d_end), text) | |
174 | ||
175 | return exon_info | |
176 | ||
177 | def overlap_fract(qdom, sdom): | |
178 | # checks if a query and subject domain overlap | |
179 | # if they do, return the amount of overlap with respect to each domain | |
180 | # how much of query is covered by subject, how much of subject is covered by query | |
181 | ||
182 | q_overlap = 0.0 | |
183 | s_overlap = 0.0 | |
184 | ||
185 | qq_len = qdom.q_end-qdom.q_start+1 # query alignment length in query coordinates | |
186 | qs_len = qdom.s_end-qdom.s_start+1 # query alignment length in subj coordinates | |
187 | sq_len = sdom.q_end-sdom.q_start+1 # subj alignment length in query coordinates | |
188 | ss_len = sdom.s_end-sdom.s_start+1 # subj alignment length in subject coordinates | |
189 | ||
190 | case = -1 | |
191 | ||
192 | # case (0) no overlap at all | |
193 | if (qdom.q_end < sdom.q_start or sdom.s_end < qdom.s_start or qdom.q_start > sdom.q_end or sdom.q_start > qdom.q_end) : | |
194 | case = 0 | |
195 | q_overlap = s_overlap = 0.0 | |
196 | # case (1) query surrounds subject | |
197 | elif (qdom.q_start <= sdom.q_start and qdom.q_end >= sdom.q_end): | |
198 | case = 1 | |
199 | s_overlap = 1.0 | |
200 | q_overlap = float(sq_len)/qq_len | |
201 | # case (2) subject surrounds query | |
202 | elif (sdom.s_start <= qdom.s_start and sdom.s_end >= qdom.s_end): | |
203 | case = 2 | |
204 | q_overlap = 1.0 | |
205 | s_overlap = float(qs_len)/ss_len | |
206 | # case (3) query left of subject | |
207 | elif (qdom.q_start <= sdom.q_start and qdom.q_end <= sdom.q_end): | |
208 | case = 3 | |
209 | q_overlap = float(qdom.q_end-sdom.q_start+1)/qq_len | |
210 | s_overlap = float(qdom.s_end-sdom.s_start+1)/ss_len | |
211 | # case (4) subject of left of query | |
212 | elif (sdom.s_start <= qdom.s_start and sdom.s_end <= qdom.s_end): | |
213 | case = 4 | |
214 | q_overlap = float(sdom.q_end-qdom.q_start+1)/qq_len | |
215 | s_overlap = float(sdom.s_end-qdom.s_start+1)/ss_len | |
216 | ||
217 | if (q_overlap > 1.0 or s_overlap > 1.0): | |
218 | if (1): | |
219 | sys.stderr.write("***%i: qdom: %s sdom: %s\n"% (case,str(qdom),str(sdom))) | |
220 | sys.stderr.write(" ** qover %.3f sover: %.3f\n"% (q_overlap, s_overlap)) | |
221 | sys.stderr.write(" ** qq_len: %d qs_len: %d ss_len: %d sq_len %d\n"%(qq_len, qs_len, ss_len, sq_len)) | |
222 | ||
223 | return (q_overlap, s_overlap) | |
224 | ||
225 | #### | |
226 | # parse_protein(result_line) | |
227 | # takes a protein in string format, turns it into a dictionary properly | |
228 | # looks like: sp|P30711|GSTT1_HUMAN up|Q2NL00|GSTT1_BOVIN 86.67 240 32 0 1 240 1 240 1.4e-123 444.0 16VI7DR6IT3IR15KQ3AI6TI11TA7YH8RC12TA3SN10FL10QETM2AT6VMTA2LV2DG4ND6PS24EK6TA11DV14FSPQ5IL3LMML1WK5RQ |XR:4-76:4-76:s=327;b=134.6;I=0.895;Q=367.8;C=C.Thioredoxin~1|RX:5-82:5-82:s=356;b=146.5;I=0.902;Q=403.3;C=C.Thioredoxin~1|RX:83-93:83-93:s=52;b=21.4;I=0.818;Q=30.9;C=NODOM~0|XR:77-93:77-93:s=86;b=35.4;I=0.882;Q=72.6;C=NODOM~0|RX:94-110:94-110:s=88;b=36.2;I=0.882;Q=75.0;C=vC.GST_C~2v|XR:94-110:94-110:s=88;b=36.2;I=0.882;Q=75.0;C=vC.GST_C~2v|RX:111-201:111-201:s=409;b=168.3;I=0.868;Q=468.3;C=C.GST_C~2|XR:111-201:111-201:s=409;b=168.3;I=0.868;Q=468.3;C=C.GST_C~2|RX:202-240:202-240:s=154;b=63.4;I=0.795;Q=155.9;C=NODOM~0|XR:202-240:202-240:s=154;b=63.4;I=0.795;Q=155.9;C=NODOM~0 | |
229 | # | |
230 | # returns [data[x] for x in fields] but also data['q/s_dom_list'] and data['q/sinfo_list'] | |
231 | def parse_protein(line_data,fields, req_name): | |
232 | # last part (domain annotions) split('|') and parsed by parse_domain() | |
233 | ||
234 | data = {} | |
235 | data = dict(zip(fields, line_data)) | |
236 | if (re.search(r'\|',data['qseqid'])): | |
237 | data['qseq_acc'] = data['qseqid'].split('|')[1] | |
238 | else: | |
239 | data['qseq_acc'] = data['qseqid'] | |
240 | ||
241 | if (re.search(r'\|',data['sseqid'])): | |
242 | data['sseq_acc'] = data['sseqid'].split('|')[1] | |
243 | else: | |
244 | data['sseq_acc'] = data['sseqid'] | |
245 | ||
246 | Qdom_list = [] | |
247 | Sdom_list = [] | |
248 | ||
249 | Qinfo_list = [] | |
250 | Sinfo_list = [] | |
251 | ||
252 | counter = 0 | |
253 | ||
254 | if ('dom_annot' in data and len(data['dom_annot']) > 0): | |
255 | for dom_str in data['dom_annot'].split('|')[1:]: | |
256 | if (req_name and not re.search(req_name, dom_str)): | |
257 | continue | |
258 | ||
259 | counter += 1 | |
260 | dom = parse_domain(dom_str) | |
261 | dom.idnum = counter | |
262 | if (dom.rxr == 'RX'): | |
263 | Qdom_list.append(dom) | |
264 | else: | |
265 | Sdom_list.append(dom) | |
266 | ||
267 | data['qdom_list'] = Qdom_list | |
268 | data['sdom_list'] = Sdom_list | |
269 | ||
270 | if ('dom_info' in data and len(data['dom_info']) > 0): | |
271 | for info_str in data['dom_info'].split('|')[1:]: | |
272 | if (req_name and not re.search(req_name, info_str)): | |
273 | continue | |
274 | if (not re.search(r'^[DX][XD]',info_str)): | |
275 | continue | |
276 | ||
277 | dinfo = parse_exon_info(info_str) | |
278 | ||
279 | if (dinfo.q_target): | |
280 | Qinfo_list.append(dinfo) | |
281 | else: | |
282 | Sinfo_list.append(dinfo) | |
283 | ||
284 | ||
285 | # put links to info_list into dom_list so info_list names can | |
286 | # be changed -- give S/Qinfo's the S/Qdom ids of the overlapping domain | |
287 | ||
288 | find_info_overlaps(Qinfo_list, Qdom_list) | |
289 | find_info_overlaps(Sinfo_list, Sdom_list) | |
290 | ||
291 | data['qinfo_list'] = Qinfo_list | |
292 | data['sinfo_list'] = Sinfo_list | |
293 | ||
294 | return data | |
295 | ||
296 | # "domain" : RX:1-38:3-40:s=37;b=17.0;I=0.289;Q=15.9;C=exon_1~1 | |
297 | # "name" : like exon_2 | |
298 | # expanded for domain: RX:1-38:3-40:s=37;b=17.0;I=0.289;Q=15.9;C=exon_1{chr1:12345678-123456987}~1 | |
299 | def replace_name(domain_text, new_name, new_color_s): | |
300 | out = "=".join(domain_text.split("=")[:-1]) # out has everything to last '=' | |
301 | ||
302 | old_name = domain_text.split(";C=")[-1] | |
303 | old_info="" | |
304 | ||
305 | if (re.search(r'\}~',old_name)): | |
306 | (old_info)=re.search(r'(\{[^\}]+\})~',old_name).group(1) | |
307 | ||
308 | if (not re.match(r'\d+',new_color_s)): | |
309 | new_color_s="0" | |
310 | out += "="+new_name+old_info+"~"+new_color_s # put it together | |
311 | return out | |
312 | ||
313 | ################ | |
314 | # check for overlaps using mid-point | |
315 | # | |
316 | def mid_overlaps(qdom_list, sdom_list): | |
317 | ||
318 | if (len(qdom_list) != len(sdom_list)): | |
319 | return False | |
320 | ||
321 | for ix, q_dom in enumerate(qdom_list): | |
322 | s_dom = sdom_list[ix] | |
323 | q_mid = q_dom.q_start + (q_dom.q_end - q_dom.q_start + 1)/2.0 | |
324 | if not (q_mid >= s_dom.q_start and q_mid <= s_dom.q_end): | |
325 | return False | |
326 | ||
327 | q_qfract, q_sfract = overlap_fract(q_dom, s_dom) # overlap from query perspective | |
328 | s_sfract, s_qfract = overlap_fract(s_dom, q_dom) # overlap from subject perspective | |
329 | ||
330 | q_dom.overlap_list.append({"dom": s_dom, "q_over": q_qfract, "s_over": q_sfract}) | |
331 | s_dom.overlap_list.append({"dom": q_dom, "q_over": s_qfract, "s_over": s_sfract}) | |
332 | ||
333 | return True | |
334 | ||
335 | ################ | |
336 | # find_overlaps -- populates dom.overlap_list for qdoms, sdoms | |
337 | # | |
338 | def find_overlaps(qdom_list, sdom_list, over_thresh): | |
339 | # find qdom, sdom overlaps in O(N) time | |
340 | # | |
341 | ||
342 | if (len(sdom_list) == 0 or len(qdom_list)==0): | |
343 | return | |
344 | ||
345 | if (len(sdom_list) == len(qdom_list)): # same number of domains | |
346 | if (mid_overlaps(qdom_list, sdom_list)): | |
347 | return; | |
348 | else: | |
349 | for d in qdom_list: | |
350 | d.overlap_list = [] | |
351 | for d in sdom_list: | |
352 | d.overlap_list = [] | |
353 | ||
354 | ||
355 | qdom_queue = [x for x in qdom_list] # build a duplicate list | |
356 | sdom_queue = [x for x in sdom_list] | |
357 | ||
358 | qdom = qdom_queue.pop(0) # get the first element of each | |
359 | sdom = sdom_queue.pop(0) | |
360 | ||
361 | while (True): | |
362 | pop_s = pop_q = False | |
363 | ||
364 | q_qfract, q_sfract = overlap_fract(qdom, sdom) # overlap from query perspective | |
365 | if (q_qfract > over_thresh or q_sfract > over_thresh): | |
366 | qdom.append_overlap({"dom": sdom, "q_over": q_qfract, "s_over": q_sfract}) | |
367 | qdom.over_cnt += 1 | |
368 | ||
369 | s_sfract, s_qfract = overlap_fract(sdom, qdom) # overlap from query perspective | |
370 | if (s_qfract > over_thresh or s_sfract > over_thresh): | |
371 | sdom.append_overlap({"dom": qdom, "q_over": s_qfract, "s_over": s_sfract}) | |
372 | sdom.over_cnt += 1 | |
373 | ||
374 | # check to see if we've used up the domain | |
375 | if (qdom.s_end >= sdom.s_end): | |
376 | pop_s = True | |
377 | # else there are more s_dom's that are part of this q_dom | |
378 | ||
379 | if (sdom.q_end >= qdom.q_end): | |
380 | pop_q = True | |
381 | # else there are more q_dom's that are part of this s_dom | |
382 | ||
383 | # print 'QS: %s %s\t%s %s' %(pop_q, pop_s, qdom, sdom) | |
384 | ||
385 | if (len(qdom_queue) > 0): | |
386 | if (pop_q): # done with this qdom, get next | |
387 | qdom = qdom_queue.pop(0) | |
388 | elif (pop_q): # don't break until we try to get the next domain | |
389 | break; | |
390 | ||
391 | if (len(sdom_queue) > 0): | |
392 | if (pop_s): # done with this sdom, get next | |
393 | sdom = sdom_queue.pop(0) | |
394 | elif (pop_s): # don't break until we try to get the next domain | |
395 | break; | |
396 | #### | |
397 | # all done with overlaps | |
398 | ||
399 | # # print "overlaps done" | |
400 | # for qd in qdom_list: | |
401 | # print qd, qd.over_cnt | |
402 | # for sd in qd.overlap_list: | |
403 | # print " s: q_over %.3f s_over: %.3f %s" % (sd['q_over'], sd['s_over'], str(sd['dom'])) | |
404 | # print "====" | |
405 | ||
406 | # for sd in sdom_list: | |
407 | # print sd, sd.over_cnt | |
408 | # for qd in sd.overlap_list: | |
409 | # print " q: q_over %.3f s_over: %.3f %s" % (qd['q_over'], qd['s_over'], str(qd['dom'])) | |
410 | # print "====" | |
411 | ||
412 | ################ | |
413 | # info_overlaps -- populates dom.overlap_list for qdoms, sdoms | |
414 | # | |
415 | def find_info_overlaps(info_list, dom_list): | |
416 | ||
417 | if (len(info_list) == 0 or len(dom_list)==0): | |
418 | return | |
419 | ||
420 | info_queue = [x for x in info_list] # build a duplicate list | |
421 | dom_queue = [x for x in dom_list] | |
422 | ||
423 | info = info_queue.pop(0) # get the first element of each | |
424 | dom = dom_queue.pop(0) | |
425 | ||
426 | while (True): | |
427 | pop_d = pop_i = False | |
428 | ||
429 | if (dom.rxr == 'RX'): # use dom.q_start/q_end | |
430 | if (dom.q_end < info.p_start): | |
431 | pop_d = True | |
432 | elif (dom.q_end >= info.p_start and dom.q_start <= info.p_end): # overlap | |
433 | dom.info_dom = info | |
434 | pop_d = True | |
435 | pop_i = True | |
436 | elif (info.p_end < dom.q_start): | |
437 | pop_i = True | |
438 | ||
439 | else: # use dom.s_start/s_end | |
440 | if (dom.s_end < info.p_start): | |
441 | pop_d = True | |
442 | elif (dom.s_end >= info.p_start and dom.s_start <= info.p_end): # overlap | |
443 | dom.info_dom = info | |
444 | pop_d = True | |
445 | pop_i = True | |
446 | elif (info.p_end < dom.s_start): | |
447 | pop_i = True | |
448 | ||
449 | if (len(info_queue) > 0): | |
450 | if (pop_i): # done with this info, get next | |
451 | info = info_queue.pop(0) | |
452 | elif (pop_i): # don't break until we try to get the next domain | |
453 | break; | |
454 | ||
455 | if (len(dom_queue) > 0): | |
456 | if (pop_d): # done with this dom, get next | |
457 | dom = dom_queue.pop(0) | |
458 | elif (pop_d): | |
459 | break; | |
460 | ||
461 | ################ | |
462 | # build_multi_dict -- builds of dictionaries of multiple overlaps in | |
463 | # qdom.overlap_list or sdom.overlap_list | |
464 | # returns multi_dict[idnum] | |
465 | # | |
466 | def build_multi_dict(dom_list): | |
467 | # this code looks for xdom's that are associated with multiple ydoms | |
468 | # | |
469 | multi_dict = {} # dict of {qids:/sdom:/qdoms:[]} | |
470 | for dom in dom_list: # for each subject domain | |
471 | if (dom.over_cnt <= 1): | |
472 | continue | |
473 | ||
474 | multi_id_list = [] | |
475 | multi_dom_list = [] | |
476 | multi_q_cnt = 0 | |
477 | for xd_over_yd in dom.overlap_list: # a set of q_doms that overlap the subject | |
478 | multi_q_cnt += 1 | |
479 | multi_id_list.append(xd_over_yd["dom"].idnum) # these are q_dom idnum's | |
480 | multi_dom_list.append(xd_over_yd["dom"]) # these are q_doms | |
481 | ||
482 | if (multi_q_cnt > 1): # only save when two (or more) overlaps | |
483 | multi_dict[dom.idnum] = {"yids": multi_id_list, "ydoms":multi_dom_list, 'xdom':dom} | |
484 | ||
485 | # # print out current multi_q_list | |
486 | # print "--- multi_q dict ---" | |
487 | # for db in multi_dict.keys(): | |
488 | # print "sdom: %s"%(db) | |
489 | # for ix, qd in enumerate(multi_dict[db]['ydoms']): | |
490 | # print " %d %d: %s"%(ix, multi_dict[db]['yids'][ix], qd) | |
491 | ||
492 | # print "--- multi_dict done" | |
493 | ||
494 | return multi_dict | |
495 | ||
496 | ################ | |
497 | # find_best_id() -- returns id of domain with longest 'q_over' | |
498 | # | |
499 | def find_best_id(overlap_list, over_type): | |
500 | ||
501 | max_fract = 0.0 | |
502 | max_idnum = 0 | |
503 | for over_d in overlap_list: | |
504 | if (over_d[over_type] > max_fract): | |
505 | max_idnum = over_d['dom'].idnum | |
506 | max_fract = over_d[over_type] | |
507 | ||
508 | return max_idnum | |
509 | ||
510 | ################################################################ | |
511 | # final labeling routine -- leave qdom's alone, modify sdoms based on qdoms. | |
512 | ################ | |
513 | # sdom's in more than one qdom are in multi_q_dict[] | |
514 | # qdom's in more than one sdom are in multi_s_dict[] | |
515 | # everyone else just gets the qdom name | |
516 | # returns sdom_displayed_dict{idnum} -- the set of sdoms that have been modified | |
517 | # | |
518 | # 13-Nov-2018 -- ensure that there is an info_dom before replacing info_dom.text | |
519 | # | |
520 | def label_doms(qdom_list, sdom_list, multi_q_dict, multi_s_dict): | |
521 | ||
522 | sdom_displayed_dict = {} | |
523 | for qdom in qdom_list: | |
524 | # qdom's stay the same | |
525 | qdom.out_str = qdom.text | |
526 | ||
527 | # check for s_doms with multiple q_doms | |
528 | if (qdom.idnum in multi_s_dict): | |
529 | # find the best, name it exon_X, find the rest, name it qdom.name | |
530 | multi_s_entry = multi_s_dict[qdom.idnum] | |
531 | best_id = find_best_id(qdom.overlap_list,'q_over') # find sdom with most overlap | |
532 | for s_over in qdom.overlap_list: # find the sdom's that overlap this qdom | |
533 | sdom = s_over['dom'] | |
534 | if (sdom.idnum == best_id): | |
535 | sdom.out_str = replace_name(sdom.text, qdom.name, str(qdom.color)) | |
536 | if (sdom.info_dom): | |
537 | sdom.info_dom.out_str = replace_name(sdom.info_dom.text,qdom.name, str(qdom.color)) | |
538 | else: | |
539 | sdom.out_str = replace_name(sdom.text, "exon_X","0") | |
540 | if (sdom.info_dom): | |
541 | sdom.info_dom.out_str = replace_name(sdom.info_dom.text,"exon_X","0") | |
542 | sdom_displayed_dict[sdom.idnum] = sdom; | |
543 | continue # prevents re-labeling later | |
544 | ||
545 | # check for q_doms with multiple doms | |
546 | for sd_over in qdom.overlap_list: | |
547 | sdom = sd_over['dom'] | |
548 | # it might make sense to do this in a second for loop after | |
549 | # all the multiple stuff is done | |
550 | if (sdom.idnum not in multi_q_dict): | |
551 | # this is the simplest case -- sdom.text gets qdom.name | |
552 | if (sdom.idnum not in sdom_displayed_dict): | |
553 | sdom.out_str = replace_name(sdom.text, qdom.name, str(qdom.color)) | |
554 | if (sdom.info_dom): | |
555 | sdom.info_dom.out_str = replace_name(sdom.info_dom.text,qdom.name, str(qdom.color)) | |
556 | else: | |
557 | # this sdom belongs to multiple q_doms, add each of those q_doms to the name | |
558 | exon_str='exon_' | |
559 | # "ydoms" here are the qdoms overlapped by sdom | |
560 | exon_str += '/'.join([ x.name.split("_")[1] for x in multi_q_dict[sdom.idnum]['ydoms']]) | |
561 | sdom.out_str = replace_name(sdom.text, exon_str,"0") | |
562 | if (sdom.info_dom): | |
563 | sdom.info_dom.out_str = replace_name(sdom.info_dom.text,exon_str,"0") | |
564 | ||
565 | sdom_displayed_dict[sdom.idnum]=sdom | |
566 | ||
567 | # done with labeling sdoms based on qdoms, but some may be unlabeled | |
568 | # check for missing s_doms | |
569 | while (len(sdom_displayed_dict.keys()) < len(sdom_list)): | |
570 | for sdom in sdom_list: | |
571 | if (sdom.idnum not in sdom_displayed_dict): | |
572 | sdom.out_str = replace_name(sdom.text, "exon_X","0") | |
573 | if (sdom.info_dom): | |
574 | sdom.info_dom.out_str = replace_name(sdom.info_dom.text,"exon_X","0") | |
575 | ||
576 | sdom_displayed_dict[sdom.idnum] = sdom | |
577 | ||
578 | return sdom_displayed_dict | |
579 | ||
580 | ################ | |
581 | # | |
582 | # aa_to_exon() --- given a coordinate and the corresponding exon map, return the exon coordinate | |
583 | # (can only be done for aligned exons) | |
584 | # | |
585 | # this version of the function must use an info_list, not an | |
586 | # align_list, because it uses p_start/p_end rather than q_start/s_start, etc. | |
587 | # a version using qp_start/sp_start would also need a target argument | |
588 | # | |
589 | def aa_to_exon(aa_coords, exon_info_list): | |
590 | ||
591 | sorted_aa_coords = sorted(aa_coords) | |
592 | ||
593 | pos_strand = True | |
594 | if (exon_info_list[0].d_start > exon_info_list[0].d_end): | |
595 | pos_strand = False | |
596 | ||
597 | ex_x = 0 | |
598 | exon_coords = [] | |
599 | ||
600 | aap_x = 0 | |
601 | this_aap = sorted_aa_coords[aap_x] | |
602 | while (ex_x < len(exon_info_list)): | |
603 | this_exon = exon_info_list[ex_x] | |
604 | if (this_aap <= this_exon.p_end and this_aap >= this_exon.p_start): | |
605 | aa_dna_offset = (this_aap - this_exon.p_start) * 3 | |
606 | ||
607 | if (pos_strand): | |
608 | aa_dna_pos = this_exon.d_start + aa_dna_offset | |
609 | else: | |
610 | aa_dna_pos = this_exon.d_start - aa_dna_offset | |
611 | ||
612 | exon_coords.append({'chrom':this_exon.chrom, 'dpos':aa_dna_pos}) | |
613 | aap_x += 1 | |
614 | if (aap_x < len(sorted_aa_coords)): | |
615 | this_aap = sorted_aa_coords[aap_x] | |
616 | else: | |
617 | break | |
618 | else: | |
619 | ex_x += 1 | |
620 | ||
621 | aa_coord_dict = {} | |
622 | for aap_x, aap in enumerate(sorted_aa_coords): | |
623 | aa_coord_dict[aap] = exon_coords[aap_x] | |
624 | ||
625 | return [aa_coord_dict[ax] for ax in aa_coords] | |
626 | ||
627 | ################ | |
628 | # | |
629 | def set_data_fields(args, line_data) : | |
630 | ||
631 | field_str = 'qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore BTOP dom_annot' | |
632 | field_qs_str = 'qseqid qlen sseqid slen pident length mismatch gapopen qstart qend sstart send evalue bitscore BTOP dom_annot' | |
633 | ||
634 | if (len(line_data) > 1) : | |
635 | if ((not args.have_qslen) and re.search(r'\d+',line_data[1])): | |
636 | args.have_qslen=True | |
637 | ||
638 | if ((not args.dom_info) and re.search(r'^\|[DX][XD]\:',line_data[-1])): | |
639 | args.dom_info = True | |
640 | ||
641 | end_field = -1 | |
642 | fields = field_str.split(' ') | |
643 | ||
644 | if (args.have_qslen): | |
645 | fields = field_qs_str.split(' ') | |
646 | ||
647 | if (args.dom_info): | |
648 | fields.append('dom_info') | |
649 | end_field = -2 | |
650 | ||
651 | return (fields, end_field) | |
652 | ||
653 | ################################################################ | |
654 | # | |
655 | # main program | |
656 | # print "#"," ".join(sys.argv) | |
657 | ||
658 | def main(): | |
659 | ||
660 | parser=argparse.ArgumentParser(description='scan_exons.py result_file.m8CB : re-label subject exons to match query') | |
661 | parser.add_argument('--have_qslen', help='bl_tab fields include query/subject lengths',dest='have_qslen',action='store_true',default=False) | |
662 | parser.add_argument('--dom_info', help='raw domain coordinates included',action='store_true',default=False) | |
663 | parser.add_argument('--fill_gcoords', help='fill in genomic coordinates',action='store_true',default=False) | |
664 | parser.add_argument('files', metavar='FILE', nargs='*', help='files to read, if empty, stdin is used') | |
665 | ||
666 | args=parser.parse_args() | |
667 | ||
668 | end_field = -1 | |
669 | data_fields_reset=False | |
670 | ||
671 | (fields, end_field) = set_data_fields(args, []) | |
672 | ||
673 | if (args.have_qslen and args.dom_info): | |
674 | data_fields_reset=True | |
675 | ||
676 | saved_qdom_list = [] | |
677 | qdom_list = [] | |
678 | ||
679 | for line in fileinput.input(args.files): | |
680 | # pass through comments | |
681 | if (line[0] == '#'): | |
682 | print line, # ',' because have not stripped | |
683 | continue | |
684 | ||
685 | ################ | |
686 | # break up tab fields, check for extra fields | |
687 | line = line.strip('\n') | |
688 | line_data = line.split('\t') | |
689 | if (not data_fields_reset): # look for --have_qslen number, --dom_info data, even if not set | |
690 | (fields, end_field) = set_data_fields(args, line_data) | |
691 | data_fields_reset = True | |
692 | ||
693 | ################ | |
694 | # get exon annotations | |
695 | data = parse_protein(line_data,fields,"exon") # get score/alignment/domain data | |
696 | ||
697 | if (len(data['sdom_list'])==0 and len(data['qdom_list'])==0): | |
698 | print line # no domains to be edited, print stripped line and contine | |
699 | continue | |
700 | ||
701 | # qdom_list=[] outside of loop for cases where the qseqid==sseqid match is not first | |
702 | if len(data['qdom_list'])== 0: | |
703 | if data['qseqid'] == data['sseqid']: | |
704 | saved_qdom_list = [ copy.deepcopy(x) for x in data['sdom_list']] | |
705 | max_sdom_id=len(data['sdom_list'])+1 | |
706 | for qdom in saved_qdom_list: | |
707 | qdom.rxr = 'RX' | |
708 | qdom.idnum = max_sdom_id | |
709 | max_sdom_id += 1 | |
710 | ||
711 | qdom_list = [copy.deepcopy(x) for x in saved_qdom_list] | |
712 | else: | |
713 | qdom_list = data['qdom_list'] | |
714 | ||
715 | # print out non-exon info | |
716 | ||
717 | if (len(qdom_list) == 0): | |
718 | print line | |
719 | continue | |
720 | ||
721 | btab_str = '\t'.join(str(data[x]) for x in fields[:end_field]) | |
722 | # print # comment out for single line | |
723 | ||
724 | ################ | |
725 | # find overlaps and multi-overlaps | |
726 | # | |
727 | find_overlaps(qdom_list,data['sdom_list'], 0.2) | |
728 | ||
729 | multi_q_dict = build_multi_dict(data['sdom_list']) # keys are sdoms hitting multiple qdoms | |
730 | multi_s_dict = build_multi_dict(qdom_list) # keys are qdoms hitting mulitple sdoms | |
731 | ||
732 | ################ | |
733 | # label qdoms, relabel sdoms | |
734 | # | |
735 | sdom_displayed_dict = label_doms(qdom_list, data['sdom_list'], multi_q_dict, multi_s_dict) | |
736 | ||
737 | ################ | |
738 | # print exon annotations | |
739 | # | |
740 | q_exon_list = data['qdom_list'] | |
741 | ||
742 | s_exon_list = [sdom_displayed_dict[x] for x in sdom_displayed_dict.keys()] | |
743 | ||
744 | ################ | |
745 | # if args.fill_gcoords, then do the transformations on the current exon lists | |
746 | ||
747 | if (args.fill_gcoords): | |
748 | sa_from_qa = [] | |
749 | for q_ex in q_exon_list: | |
750 | sa_from_qa.append(q_ex.q_start) | |
751 | sa_from_qa.append(q_ex.q_end) | |
752 | ||
753 | # have list of coordinates, map them to exon | |
754 | sex_from_qa2sa = aa_to_exon(sa_from_qa,data['sinfo_list']) | |
755 | ||
756 | for iqx, q_ex in enumerate(q_exon_list): | |
757 | sg_start = sex_from_qa2sa[2*iqx] | |
758 | sg_end = sex_from_qa2sa[2*iqx+1] | |
759 | sg_replace="::%s:%d-%d}"%(sg_start['chrom'],sg_start['dpos'],sg_end['dpos']) | |
760 | q_ex.text=re.sub(r'\}',sg_replace,q_ex.text) | |
761 | q_ex.out_str=re.sub(r'\}',sg_replace,q_ex.out_str) | |
762 | ||
763 | qa_from_sa = [] | |
764 | for s_ex in s_exon_list: | |
765 | qa_from_sa.append(s_ex.q_start) | |
766 | qa_from_sa.append(s_ex.q_end) | |
767 | ||
768 | # have list of coordinates, map them to exon | |
769 | qex_from_sa2qa = aa_to_exon(qa_from_sa,data['qinfo_list']) | |
770 | ||
771 | for isx, s_ex in enumerate(s_exon_list): | |
772 | qg_start = sex_from_qa2sa[2*iqx] | |
773 | qg_end = sex_from_qa2sa[2*iqx+1] | |
774 | qg_replace="{%s:%d-%d::"%(sg_start['chrom'],sg_start['dpos'],sg_end['dpos']) | |
775 | s_ex.text=re.sub(r'\{',qg_replace,s_ex.text) | |
776 | s_ex.out_str = re.sub(r'\{',qg_replace,s_ex.out_str) | |
777 | ||
778 | sorted_exon_list = sorted(q_exon_list+s_exon_list,key = lambda r: r.idnum) | |
779 | ||
780 | dom_bar_str = '' | |
781 | for exon in sorted_exon_list: | |
782 | # print exon.print_bar_str() # for multi-line output | |
783 | dom_bar_str += exon.print_bar_str() | |
784 | ||
785 | info_bar_str = '' | |
786 | for info in data['qinfo_list'] + data['sinfo_list']: | |
787 | info_bar_str += info.text | |
788 | ||
789 | print '\t'.join((btab_str, dom_bar_str, info_bar_str)) | |
790 | ||
791 | ################ | |
792 | # run the program ... | |
793 | ||
794 | if __name__ == '__main__': | |
795 | main() | |
796 |
0 | #!/usr/bin/perl -w | |
0 | #!/usr/bin/env perl | |
1 | 1 | |
2 | 2 | ################################################################ |
3 | 3 | # copyright (c) 2014 by William R. Pearson and The Rector & |
21 | 21 | # parse: |
22 | 22 | # sp|P09488|GSTM1_HUMAN gi|121735|sp|P09488.3|GSTM1_HUMAN 100.00 218 0 0 1 218 1 218 2.9e-113 408.2 218M |RX:1-12:1-12:s=64;b=25.0;I=1.000;Q=47.5;C=exon_1|RX:13-37:13-37:s=128;b=49.9;I=1.000;Q=121.4;C=exon_2|RX:38-59:38-59:s=125;b=48.7;I=1.000;Q=117.9;C=exon_3|RX:60-86:60-86:s=145;b=56.5;I=1.000;Q=141.0;C=exon_4|RX:87-120:87-120:s=185;b=72.1;I=1.000;Q=187.2;C=exon_5|RX:121-152:121-152:s=174;b=67.8;I=1.000;Q=174.5;C=exon_6|RX:153-189:153-189:s=197;b=76.8;I=1.000;Q=201.0;C=exon_7|RX:190-218:190-218:s=151;b=58.9;I=1.000;Q=147.9;C=exon_8 |
23 | 23 | |
24 | ||
25 | ||
24 | use warnings; | |
26 | 25 | use strict; |
27 | 26 | use Getopt::Long; |
28 | 27 | use Pod::Usage; |
38 | 38 | #define VMSPIR 5 |
39 | 39 | #define GCGBIN 6 |
40 | 40 | #define FASTQ 7 |
41 | #define LASTTXT 7 | |
41 | #define ACC_SCRIPT 9 | |
42 | #define LASTTXT 9 | |
42 | 43 | #define ACC_LIST 10 |
43 | 44 | |
44 | 45 | #include "mm_file.h" |
94 | 95 | #endif |
95 | 96 | |
96 | 97 | int (*getliba[LASTLIB])(unsigned char *, int, char *, int, fseek_t *, int *, |
97 | struct lmf_str *, long *)={ | |
98 | agetlib,lgetlib,pgetlib,egetlib, | |
99 | igetlib,vgetlib,gcg_getlib,qgetlib, | |
100 | agetlib,agetlib | |
98 | struct lmf_str *, long *)={ | |
99 | agetlib,lgetlib,pgetlib,egetlib, /* 0 - 3 */ | |
100 | igetlib,vgetlib,gcg_getlib,qgetlib, /* 4- 7 */ | |
101 | agetlib,agetlib /* 8,9 */ | |
101 | 102 | #ifdef UNIX |
102 | ,agetlib | |
103 | ,agetlib /* 10 */ | |
103 | 104 | #ifdef NCBIBL13 |
104 | ,ncbl_getliba | |
105 | ,ncbl_getliba /* 11 */ | |
105 | 106 | #else |
106 | ,ncbl2_getliba | |
107 | ,ncbl2_getliba /* 12 */ | |
107 | 108 | #endif |
108 | 109 | #ifdef NCBIBL20 |
109 | ,ncbl2_getliba | |
110 | ,ncbl2_getliba /* 12 */ | |
111 | #else | |
112 | ,agetlib /* 12 - place holder */ | |
110 | 113 | #endif |
111 | 114 | #ifdef MYSQL_DB |
112 | ,agetlib | |
113 | ,agetlib | |
114 | ,agetlib | |
115 | ,mysql_getlib | |
115 | ,agetlib /* 13 */ | |
116 | ,agetlib /* 14 */ | |
117 | ,agetlib /* 15 */ | |
118 | ,mysql_getlib /* 16 */ | |
116 | 119 | #endif |
117 | 120 | #endif |
118 | 121 | }; |
0 | /* cal_cons.c - routines for printing translated alignments for | |
0 | /* cal_cons.c - routines for printing alignments for | |
1 | 1 | fasta, ssearch, ggsearch, glsearch */ |
2 | 2 | |
3 | 3 | /* $Id: cal_cons.c 1280 2014-08-21 00:47:55Z wrp $ */ |
638 | 638 | |
639 | 639 | /* Open query library */ |
640 | 640 | if ((q_file_p= open_lib(q_lib_p, m_msg.qdnaseq,qascii,!m_msg.quiet))==NULL) { |
641 | s_abort(" cannot open library ",m_msg.tname); | |
641 | fprintf(stderr,"*** error [%s:%d] cannot open library %s\n",__FILE__,__LINE__, m_msg.tname); | |
642 | exit(1); | |
643 | ||
644 | /* s_abort(" cannot open library ",m_msg.tname); */ | |
642 | 645 | } |
643 | 646 | /* Fetch first sequence */ |
644 | 647 | qlib = 0; |
663 | 666 | |
664 | 667 | /* if protein and ldb_info.term_code set, add '*' if not there */ |
665 | 668 | if (m_msg.ldb_info.term_code && !(m_msg.qdnaseq==SEQT_DNA || m_msg.qdnaseq==SEQT_RNA) && |
666 | aa0[0][m_msg.n0-1]!='*') { | |
667 | aa0[0][m_msg.n0++]='*'; | |
669 | aa0[0][m_msg.n0-1]!=aascii['*']) { | |
670 | aa0[0][m_msg.n0++]=aascii['*']; | |
668 | 671 | aa0[0][m_msg.n0]=0; |
669 | 672 | } |
670 | 673 | |
762 | 765 | } |
763 | 766 | |
764 | 767 | /* get a list of files to search */ |
765 | lib_list_p = lib_select(lib_db_file, m_msg.ltitle, m_msg.flstr, | |
766 | m_msg.ldb_info.ldnaseq); | |
768 | lib_list_p = lib_select(lib_db_file, m_msg.ltitle, m_msg.flstr, m_msg.ldb_info.ldnaseq); | |
767 | 769 | } |
768 | 770 | else { |
769 | 771 | /* get a list of files to search */ |
914 | 916 | |
915 | 917 | if (!validate_params(aa0[0],m_msg.n0, &m_msg, &pst, |
916 | 918 | lascii, pascii)) { |
917 | fprintf(stderr," *** ERROR *** validate_params() failed:\n -- %s\n", argv_line); | |
919 | fprintf(stderr," *** error [%s:%d] - validate_params() failed:\n -- %s\n", __FILE__, __LINE__, argv_line); | |
918 | 920 | exit(1); |
919 | 921 | } |
920 | 922 | |
1520 | 1522 | if (pst.do_rep) { |
1521 | 1523 | if (pst.zsflag >= 0) { |
1522 | 1524 | for (i=m_msg.nskip; i < m_msg.nskip + m_msg.nshow; i++) { |
1523 | bestp_arr[i]->repeat_thresh = | |
1524 | min(E1_to_s(pst.e_cut_r, m_msg.n0, bestp_arr[i]->seq->n1, | |
1525 | pst.zdb_size, m_msg.pstat_void),bestp_arr[i]->rst.score[pst.score_ix]); | |
1525 | if (bestp_arr[i]->rst.escore > pst.e_cut_r) { | |
1526 | bestp_arr[i]->repeat_thresh = bestp_arr[i]->rst.score[pst.score_ix] * 10; | |
1527 | } | |
1528 | else { | |
1529 | bestp_arr[i]->repeat_thresh = | |
1530 | min(E1_to_s(pst.e_cut_r, m_msg.n0, bestp_arr[i]->seq->n1, pst.zdb_size, m_msg.pstat_void), | |
1531 | bestp_arr[i]->rst.score[pst.score_ix]); | |
1532 | } | |
1526 | 1533 | } |
1527 | 1534 | } |
1528 | 1535 | else { |
2242 | 2249 | getlib() calls */ |
2243 | 2250 | /* **************************************************************** */ |
2244 | 2251 | struct getlib_str * |
2245 | init_getlib_info(struct lib_struct *lib_list_p, int maxn,long max_memK) { | |
2252 | init_getlib_info(struct lib_struct *lib_list_p, int maxn, long max_memK) { | |
2246 | 2253 | struct getlib_str *my_getlib_info; |
2247 | 2254 | unsigned char *aa1save; |
2248 | 2255 | |
2353 | 2360 | if ((cur_lib_p->m_file_p = |
2354 | 2361 | open_lib(cur_lib_p, m_msp->ldb_info.ldnaseq, lascii, !m_msp->quiet)) |
2355 | 2362 | ==NULL) { |
2356 | fprintf(stderr," cannot open library %s\n",cur_lib_p->file_name); | |
2363 | fprintf(stderr,"(*** warning [%s:%d] cannot open library %s\n",__FILE__,__LINE__,cur_lib_p->file_name); | |
2357 | 2364 | getlib_info->lib_list_p = getlib_info->lib_list_p->next; |
2358 | 2365 | if (getlib_info->lib_list_p == NULL) { |
2359 | 2366 | goto return_null; |
2374 | 2381 | /* if the library is NCBIBL20 and memory mapped, simply return |
2375 | 2382 | pointers to the memory map */ |
2376 | 2383 | m_fd = getlib_info->lib_list_p->m_file_p; |
2377 | if (m_fd->get_mmap_chain) { | |
2384 | ||
2385 | if (m_fd->get_mmap_chain && getlib_info->use_memory>=0) { | |
2378 | 2386 | /* get a new seqr_chain */ |
2379 | 2387 | my_seqr_chain = |
2380 | 2388 | new_seqr_chain(m_bufi_p->max_chain_seqs,(m_bufi_p->seq_buf_size+1), |
222 | 222 | |
223 | 223 | /* subs_env takes a string, possibly with ${ENV}, and looks up all the |
224 | 224 | potential environment variables and substitutes them into the |
225 | string */ | |
226 | ||
225 | string | |
226 | */ | |
227 | 227 | void subs_env(char *dest, char *src, int dest_size) { |
228 | 228 | char *last_src, *bp, *bp1; |
229 | 229 | |
273 | 273 | dest[dest_size-1]='\0'; |
274 | 274 | } |
275 | 275 | } |
276 | ||
277 | 276 | |
278 | 277 | void |
279 | 278 | selectbest(struct beststr **bptr, int k, int n) /* k is rank in array */ |
1403 | 1402 | char *link_lib_str; |
1404 | 1403 | char link_script[MAX_LSTR]; |
1405 | 1404 | int link_lib_type; |
1406 | char *bp, *link_bp; | |
1405 | char *bp, *link_bp, *bp_s; | |
1407 | 1406 | FILE *link_fd=NULL; /* file for link accessions */ |
1408 | 1407 | |
1409 | 1408 | #ifndef UNIX |
1466 | 1465 | } |
1467 | 1466 | |
1468 | 1467 | strncpy(link_script,link_bp,sizeof(link_script)); |
1468 | /* un-edit m_msp->link_lname */ | |
1469 | if (bp != NULL) *bp = ' '; | |
1470 | ||
1469 | 1471 | link_script[sizeof(link_script)-1] = '\0'; |
1472 | ||
1473 | /* convert + to space in script string */ | |
1474 | for (bp_s = strchr(link_script+1,'+'); bp_s; bp_s=strchr(bp_s+1,'+')) { | |
1475 | *bp_s = ' '; | |
1476 | } | |
1477 | ||
1470 | 1478 | SAFE_STRNCAT(link_script," ",sizeof(link_script)); |
1471 | 1479 | SAFE_STRNCAT(link_script,link_acc_file,sizeof(link_script)); |
1472 | 1480 | SAFE_STRNCAT(link_script," >",sizeof(link_script)); |
1473 | 1481 | SAFE_STRNCAT(link_script,link_lib_file,sizeof(link_script)); |
1474 | ||
1475 | /* un-edit m_msp->link_lname */ | |
1476 | if (bp != NULL) *bp = ' '; | |
1477 | 1482 | |
1478 | 1483 | /* run link_script link_acc_file > link_lib_file */ |
1479 | 1484 | status = system(link_script); |
1580 | 1585 | } |
1581 | 1586 | |
1582 | 1587 | strncpy(lib_db_script,lib_bp,sizeof(lib_db_script)); |
1588 | bp = strchr(lib_db_script,'+'); | |
1589 | for ( ; bp; bp=strchr(bp+1,'+')) { | |
1590 | *bp=' '; | |
1591 | } | |
1592 | ||
1583 | 1593 | lib_db_script[sizeof(lib_db_script)-1] = '\0'; |
1584 | 1594 | SAFE_STRNCAT(lib_db_script," >",sizeof(lib_db_script)); |
1585 | 1595 | SAFE_STRNCAT(lib_db_script,lib_db_file,sizeof(lib_db_script)); |
1649 | 1659 | |
1650 | 1660 | this->max_annot += (this->max_annot/2); |
1651 | 1661 | if ((this->tmp_arr_p= (struct annot_entry *)realloc(this->tmp_arr_p, this->max_annot*sizeof(struct annot_entry)))==NULL) { |
1652 | fprintf(stderr,"[*** error [%s:%d] - cannot reallocate tmp_ann_astr[%d]\n", | |
1662 | fprintf(stderr,"*** error [%s:%d] - cannot reallocate tmp_ann_astr[%d]\n", | |
1653 | 1663 | __FILE__, __LINE__, this->max_annot); |
1654 | 1664 | return 0; |
1655 | 1665 | } |
1702 | 1712 | annotations back |
1703 | 1713 | */ |
1704 | 1714 | |
1715 | /* create filename for input accessions */ | |
1705 | 1716 | annot_bline_file[0] = '\0'; |
1706 | 1717 | |
1707 | 1718 | if ((annot_descr_file=(char *)calloc(MAX_STR,sizeof(char)))==NULL) { |
1710 | 1721 | } |
1711 | 1722 | annot_descr_file[0] = '\0'; |
1712 | 1723 | |
1724 | /* add temporary directory if $TMP_DIR */ | |
1713 | 1725 | if ((bp=getenv("TMP_DIR"))!=NULL) { |
1714 | 1726 | strncpy(annot_bline_file,bp,sizeof(annot_bline_file)); |
1715 | 1727 | annot_bline_file[sizeof(annot_bline_file)-1] = '\0'; |
1728 | 1740 | goto no_annots; |
1729 | 1741 | } |
1730 | 1742 | |
1743 | /* write out accessions, sequence length */ | |
1731 | 1744 | for (i=0; i<nbest; i++) { |
1732 | 1745 | if (bestp_arr[i]->mseq->annot_req_flag) { continue; } |
1733 | 1746 | if ((strlen(bestp_arr[i]->mseq->bline) > DESCR_OFFSET) && |
1743 | 1756 | } |
1744 | 1757 | fclose(annot_fd); |
1745 | 1758 | |
1746 | subs_env(annot_script, sname+1, sizeof(annot_script)); | |
1759 | /* convert '+' in annot_script to ' ' */ | |
1760 | bp = strchr(sname+1,'+'); | |
1761 | for ( ; bp; bp=strchr(bp+1,'+')) { | |
1762 | *bp=' '; | |
1763 | } | |
1764 | ||
1765 | subs_env(annot_script, sname+1, sizeof(annot_script)); | |
1747 | 1766 | annot_script[sizeof(annot_script)-1] = '\0'; |
1748 | 1767 | SAFE_STRNCAT(annot_script," ",sizeof(annot_script)); |
1749 | 1768 | SAFE_STRNCAT(annot_script,annot_bline_file,sizeof(annot_script)); |
1752 | 1771 | |
1753 | 1772 | /* run annot_script annot_bline_file > annot_descr_file */ |
1754 | 1773 | status = system(annot_script); |
1774 | ||
1775 | #ifdef DEBUG | |
1776 | if (debug) { | |
1777 | fprintf(stderr,"%s\n",annot_script); | |
1778 | } | |
1779 | #endif | |
1780 | ||
1755 | 1781 | if (!debug) { |
1756 | 1782 | #ifdef UNIX |
1757 | 1783 | unlink(annot_bline_file); |
2171 | 2197 | |
2172 | 2198 | q_offset = m_msp->q_offset + m_msp->q_off - 1; |
2173 | 2199 | if (q_offset < 0) { q_offset = 0;} |
2200 | ||
2201 | /* convert '+' in annot_script to ' ' */ | |
2202 | bp = strchr(sname+1,'+'); | |
2203 | for ( ; bp; bp=strchr(bp+1,'+')) { | |
2204 | *bp=' '; | |
2205 | } | |
2206 | ||
2174 | 2207 | sprintf(annot_script,"%s \"%s\" %ld",sname+1, bline_descr,q_offset+m_msp->n0); |
2175 | 2208 | annot_script[sizeof(annot_script)-1] = '\0'; |
2176 | 2209 | |
4104 | 4137 | else if (aln && toupper(sp0) == 'N') aln->ngap_q++; |
4105 | 4138 | else if (aln && toupper(sp1) == 'N') aln->ngap_l++; |
4106 | 4139 | } |
4140 | else if ((sp0 == '*' && toupper(sp1) == 'U') || | |
4141 | (toupper(sp0) == 'U' && sp1 == '*')) { | |
4142 | spa_val = M_IDENT; | |
4143 | if (aln) { | |
4144 | aln->nident++; | |
4145 | aln->nmismatch--; | |
4146 | } | |
4147 | } | |
4107 | 4148 | |
4108 | 4149 | /* correct nident, nmismatch for N:N / X:X */ |
4109 | 4150 | if (pam_x_id_sim < 0) { /* > 0 -> identical, similar */ |
67 | 67 | |
68 | 68 | #ifndef MAX_MEMK |
69 | 69 | #if defined(BIG_LIB64) && (defined(COMP_THR) || defined(PCOMPLIB)) |
70 | #define MAX_MEMK 8*1024*1024 /* 12 GB (<<10) for library in memory */ | |
70 | #define MAX_MEMK 16*1024*1024 /* 16 GB (<<10) for library in memory */ | |
71 | 71 | #else |
72 | 72 | #define MAX_MEMK 2*1024*1024 /* 2 GB (<<10) for library in memory */ |
73 | 73 | #endif |
151 | 151 | #define MX_M9SUMM 64 /* markx==9(c) */ |
152 | 152 | #define MX_M10FORM 128 /* markx==10 - verbose output */ |
153 | 153 | #define MX_M11OUT 256 /* markx==11 - lalign lav */ |
154 | #define MX_M8OUT 512 /* markx==8 blast8 output */ | |
155 | #define MX_M8COMMENT 1024 /* markx==8 blast8 output */ | |
156 | #define MX_MBLAST 2048 /* markx=B blast output */ | |
157 | #define MX_MBLAST2 4096 /* markx=BB more blast output */ | |
154 | #define MX_M8OUT 512 /* markx==8 blast tabular (-outfmt=6) output */ | |
155 | #define MX_M8COMMENT 1024 /* markx==8 blast tabular (-outfmt=7) with comments output */ | |
156 | #define MX_MBLAST 2048 /* markx=B blast alignment -outfmt=0 output */ | |
157 | #define MX_MBLAST2 4096 /* markx=BB blast best scores and alignment (-outfmt=0) output */ | |
158 | 158 | #define MX_ANNOT_COORD 16384 /* -m 0, use -m 0B for both */ |
159 | 159 | #define MX_ANNOT_MID 32768 /* markx 0M, 1M, 2M annotations in middle */ |
160 | 160 | #define MX_RES_ALIGN_SCORE (1<<20) /* show residue alignment score, not alignment */ |
161 | #define MX_M8_BTAB_LEN (1<<21) /* show query/subject seq. lens in -m 8 output */ | |
161 | 162 | |
162 | /* codes for -m 9 */ | |
163 | /* codes for -m 9, -m 8C? */ | |
163 | 164 | #define SHOW_CODE_ID 1 /* identity only */ |
164 | 165 | #define SHOW_CODE_IDD 2 /* identity with domains */ |
165 | 166 | #define SHOW_CODE_ALIGN 4 /* encoded alignment */ |
168 | 169 | #define SHOW_CODE_MASK 12 /* use higher bits for annotation format */ |
169 | 170 | #define SHOW_CODE_EXT 16 /* encode identity, mismatch state */ |
170 | 171 | #define SHOW_ANNOT_FULL 32 /* show full-length annot in calc_code */ |
172 | #define SHOW_CODE_DOMINFO 64 /* include raw domain info in btab/BTOP */ | |
173 |
293 | 293 | m_msp->do_showbest = 1; |
294 | 294 | m_msp->ashow = -1; |
295 | 295 | m_msp->ashow_set = 0; |
296 | ||
296 | 297 | m_msp->nmlen = DEF_NMLEN; |
298 | ||
299 | ||
300 | /* values set in initfa.c: parse_ext_opts() */ | |
297 | 301 | m_msp->z_bits = 1; |
298 | 302 | m_msp->tot_ident = 0; |
303 | m_msp->blast_ident = 0; | |
304 | m_msp->m8_show_annot = 0; | |
305 | ||
299 | 306 | m_msp->mshow_set = 0; |
300 | 307 | m_msp->mshow_min = 0; |
301 | 308 | m_msp->aln.llen = 60; |
620 | 627 | else { |
621 | 628 | m_msp->ann_arr_def[i_ann] = NULL; |
622 | 629 | } |
623 | ||
624 | ||
625 | 630 | } |
626 | 631 | |
627 | 632 | /* read definitions of annotation symbols from a file */ |
710 | 715 | |
711 | 716 | return markx; |
712 | 717 | } |
718 | ||
719 | /* specify output format. If output format type is 'F', then provide | |
720 | file name and write to file. | |
721 | ||
722 | Thus, -m "F8CB outfile.m8CB" writes -m 8CB output to outfile.m8CB | |
723 | Different format outputs can be written to different files | |
724 | ||
725 | */ | |
713 | 726 | |
714 | 727 | void |
715 | 728 | pre_parse_markx(char *opt_arg, struct mngmsg *m_msp) { |
757 | 770 | |
758 | 771 | /* first check for -m "F file" format */ |
759 | 772 | if (optarg[0] == 'F') { |
760 | if ((bp=strchr(optarg+1,' '))==NULL) { | |
773 | if ((bp=strchr(optarg+1,' '))==NULL && (bp=strchr(optarg+1,'='))==NULL) { | |
761 | 774 | fprintf(stderr,"-m F missing file name: %s\n",optarg); |
762 | 775 | return; |
763 | 776 | } |
823 | 836 | void |
824 | 837 | parse_markx(char *optarg, struct markx_str *this) { |
825 | 838 | int itmp; |
826 | char ctmp, ctmp2; | |
839 | char ctmp, ctmp2, ctmp3; | |
827 | 840 | |
828 | 841 | itmp = 0; |
829 | ctmp = ctmp2 = '\0'; | |
842 | ctmp = ctmp2 = ctmp3 = '\0'; | |
830 | 843 | |
831 | 844 | if (optarg[0] == 'B') { /* BLAST alignment output */ |
832 | 845 | this->markx = MX_MBLAST; |
853 | 866 | return; |
854 | 867 | } |
855 | 868 | else { |
856 | sscanf(optarg,"%d%c%c",&itmp,&ctmp,&ctmp2); | |
869 | sscanf(optarg,"%d%c%c%c",&itmp,&ctmp,&ctmp2,&ctmp3); | |
857 | 870 | } |
858 | 871 | if (itmp==9) { |
859 | 872 | if (ctmp=='c') {this->show_code = SHOW_CODE_ALIGN;} |
876 | 889 | else if (ctmp2 == 'C') {this->show_code = SHOW_CODE_CIGAR;} |
877 | 890 | else if (ctmp2 == 'D') {this->show_code = SHOW_CODE_CIGAR + SHOW_CODE_EXT;} |
878 | 891 | else if (ctmp2 == 'B') {this->show_code = SHOW_CODE_BTOP;} |
892 | ||
893 | if (ctmp3 == 'L') { | |
894 | this->markx |= MX_M8_BTAB_LEN; | |
895 | this->show_code |= SHOW_CODE_DOMINFO; | |
896 | } | |
897 | else if (ctmp3 == 'l') { | |
898 | this->markx |= MX_M8_BTAB_LEN; | |
899 | } | |
900 | ||
879 | 901 | } |
880 | 902 | } |
881 | 903 |
116 | 116 | |
117 | 117 | f_str = (struct f_struct *) calloc(1, sizeof(struct f_struct)); |
118 | 118 | if(f_str == NULL) { |
119 | fprintf(stderr, "Couldn't calloc f_str\n"); | |
119 | fprintf(stderr, "*** error [%s:%d] - cannot calloc f_str [%lu]\n", | |
120 | __FILE__, __LINE__, sizeof(struct f_struct)); | |
120 | 121 | exit(1); |
121 | 122 | } |
122 | 123 | |
134 | 135 | if (ppst->hsq[i0] < NMAP && ppst->hsq[i0] > mhv) mhv = ppst->hsq[i0]; |
135 | 136 | |
136 | 137 | if (mhv <= 0) { |
137 | fprintf (stderr, " maximum hsq <=0 %d\n", mhv); | |
138 | fprintf (stderr, "*** error [%s:%d] - maximum hsq <=0 %d\n", | |
139 | __FILE__, __LINE__, mhv); | |
138 | 140 | exit (1); |
139 | 141 | } |
140 | 142 | |
146 | 148 | f_str->hmask = (hmax >> f_str->kshft) - 1; |
147 | 149 | |
148 | 150 | if ((f_str->aa0 = (unsigned char *) calloc(n0+1, sizeof(char))) == NULL) { |
149 | fprintf (stderr, " cannot allocate f_str->aa0 array; %d\n",n0+1); | |
151 | fprintf (stderr, "*** error [%s:%d] - cannot allocate f_str->aa0 array; %d\n", | |
152 | __FILE__, __LINE__, n0+1); | |
150 | 153 | exit (1); |
151 | 154 | } |
152 | 155 | for (i=0; i<n0; i++) f_str->aa0[i] = aa0[i]; |
153 | 156 | aa0 = f_str->aa0; |
154 | 157 | |
155 | 158 | if ((f_str->aa0t = (unsigned char *) calloc(n0+1, sizeof(char))) == NULL) { |
156 | fprintf (stderr, " cannot allocate f_str0->aa0t array; %d\n",n0+1); | |
159 | fprintf (stderr, "*** error [%s:%d] - cannot allocate f_str0->aa0t array; %d\n", | |
160 | __FILE__, __LINE__, n0+1); | |
157 | 161 | exit (1); |
158 | 162 | } |
159 | 163 | f_str->aa0ix = 0; |
160 | 164 | |
161 | 165 | if ((f_str->harr = (struct hlstr *) calloc (hmax, sizeof (struct hlstr))) == NULL) { |
162 | fprintf (stderr, " cannot allocate hash array; hmax: %d hmask: %d\n", | |
163 | hmax,f_str->hmask); | |
166 | fprintf (stderr, "*** error [%s:%d] - cannot allocate hash array; hmax: %d hmask: %d\n", | |
167 | __FILE__, __LINE__, hmax,f_str->hmask); | |
164 | 168 | exit (1); |
165 | 169 | } |
166 | 170 | if ((f_str->pamh1 = (int *) calloc (nsq+1, sizeof (int))) == NULL) { |
167 | fprintf (stderr, " cannot allocate pamh1 array\n"); | |
171 | fprintf (stderr, "*** error [%s:%d] - cannot allocate pamh1 array [%d]\n", | |
172 | __FILE__, __LINE__, nsq+1); | |
168 | 173 | exit (1); |
169 | 174 | } |
170 | 175 | if ((f_str->pamh2 = (int *) calloc (hmax, sizeof (int))) == NULL) { |
171 | fprintf (stderr, " cannot allocate pamh2 array\n"); | |
176 | fprintf (stderr, "*** error [%s:%d] - cannot allocate pamh2 array [%d]\n", | |
177 | __FILE__, __LINE__, hmax); | |
172 | 178 | exit (1); |
173 | 179 | } |
174 | 180 | if ((f_str->link = (struct hlstr *) calloc (n0, sizeof (struct hlstr))) == NULL) { |
175 | fprintf (stderr, " cannot allocate hash link array"); | |
181 | fprintf (stderr, "*** error [%s:%d] - cannot allocate hash link array [%d]", | |
182 | __FILE__, __LINE__, n0); | |
176 | 183 | exit (1); |
177 | 184 | } |
178 | 185 | |
247 | 254 | f_str->maxsav = MAXSAV; |
248 | 255 | if ((f_str->vmax = (struct savestr *) |
249 | 256 | calloc(MAXSAV,sizeof(struct savestr)))==NULL) { |
250 | fprintf(stderr, "Couldn't allocate vmax[%d].\n",f_str->maxsav); | |
257 | fprintf(stderr, "*** error [%s:%d] - cannot allocate vmax[%d].\n", | |
258 | __FILE__, __LINE__, f_str->maxsav); | |
251 | 259 | exit(1); |
252 | 260 | } |
253 | 261 | |
254 | 262 | if ((f_str->vptr = (struct savestr **) |
255 | 263 | calloc(MAXSAV,sizeof(struct savestr *)))==NULL) { |
256 | fprintf(stderr, "Couldn't allocate vptr[%d].\n",f_str->maxsav); | |
264 | fprintf(stderr, "*** error [%s:%d] - cannot allocate vptr[%d].\n", | |
265 | __FILE__, __LINE__, f_str->maxsav); | |
257 | 266 | exit(1); |
258 | 267 | } |
259 | 268 | |
260 | 269 | for (vmptr = f_str->vmax; vmptr < &f_str->vmax[MAXSAV]; vmptr++) { |
261 | 270 | vmptr->used = (int *) calloc(n0, sizeof(int)); |
262 | 271 | if(vmptr->used == NULL) { |
263 | fprintf(stderr, "Couldn't alloc vmptr->used\n"); | |
272 | fprintf(stderr, "*** error [%s:%d] - cannot alloc vmptr->used [%d]\n", | |
273 | __FILE__, __LINE__, n0); | |
264 | 274 | exit(1); |
265 | 275 | } |
266 | 276 | } |
284 | 294 | |
285 | 295 | if (f_str->diag == NULL) |
286 | 296 | { |
287 | fprintf (stderr, " cannot allocate diagonal arrays: %ld\n", | |
288 | (long) MAXDIAG * (long) (sizeof (struct dstruct))); | |
297 | fprintf (stderr, "*** error [%s:%d] - cannot allocate diagonal arrays: %ld\n", | |
298 | __FILE__, __LINE__, (long) MAXDIAG * (long) (sizeof (struct dstruct))); | |
289 | 299 | exit (1); |
290 | 300 | } |
291 | 301 | |
293 | 303 | if ((f_str->aa1x =(unsigned char *)calloc((size_t)ppst->maxlen+2, |
294 | 304 | sizeof(unsigned char))) |
295 | 305 | == NULL) { |
296 | fprintf (stderr, "cannot allocate aa1x array %d\n", ppst->maxlen+2); | |
306 | fprintf (stderr, "*** error [%s:%d] - cannot allocate aa1x array %d\n", | |
307 | __FILE__, __LINE__, ppst->maxlen+2); | |
297 | 308 | exit (1); |
298 | 309 | } |
299 | 310 | f_str->aa1x++; |
304 | 315 | |
305 | 316 | maxn0 = max(3*n0/2,MIN_RES); |
306 | 317 | if ((res = (int *)calloc((size_t)maxn0,sizeof(int)))==NULL) { |
307 | fprintf(stderr,"cannot allocate alignment results array %d\n",maxn0); | |
318 | fprintf(stderr,"*** error [%s:%d] - cannot allocate alignment results array %d\n", | |
319 | __FILE__, __LINE__, maxn0); | |
308 | 320 | exit(1); |
309 | 321 | } |
310 | 322 | f_str->res = res; |
314 | 326 | |
315 | 327 | /* initialize priors array. */ |
316 | 328 | if((f_str->priors = (double *)calloc(ppst->nsq+1, sizeof(double))) == NULL) { |
317 | fprintf(stderr, "Couldn't allocate priors array.\n"); | |
329 | fprintf(stderr, "*** error [%s:%d] - cannot allocate priors array [%d]\n", | |
330 | __FILE__, __LINE__, ppst->nsq+1); | |
318 | 331 | exit(1); |
319 | 332 | } |
320 | 333 | calc_priors(f_str->priors, ppst, f_str, NULL, 0, ppst->pseudocts); |
420 | 433 | } |
421 | 434 | |
422 | 435 | if (n0+n1+1 >= MAXDIAG) { |
423 | fprintf(stderr,"n0,n1 too large: %d, %d\n",n0,n1); | |
436 | fprintf(stderr,"*** error [%s:%d] - n0,n1 too large %d + %d > %d\n", | |
437 | __FILE__, __LINE__, n0,n1, MAXDIAG); | |
424 | 438 | rst->score[0] = rst->score[1] = rst->score[2] = -1; |
425 | 439 | rst->escore = 2.0; |
426 | 440 | rst->segnum = 0; |
642 | 656 | if (ppst->debug_lib) |
643 | 657 | for (i=0; i<n10; i++) |
644 | 658 | if (f_str->aa1x[i]>ppst->nsq) { |
645 | fprintf(stderr, | |
646 | "residue[%d/%d] %d range (%d)\n",i,n1, | |
647 | f_str->aa1x[i],ppst->nsq); | |
659 | fprintf(stderr, "*** error [%s:%d] - residue[%d/%d] %d range (%d)\n", | |
660 | __FILE__, __LINE__, i,n1, f_str->aa1x[i],ppst->nsq); | |
648 | 661 | f_str->aa1x[i]=0; |
649 | 662 | n10=i-1; |
650 | 663 | } |
842 | 855 | } |
843 | 856 | tot += ctot; |
844 | 857 | if (ci >= 0) { |
845 | if (ci >= n0) {fprintf(stderr," warning - ci off end %d/%d\n",ci,n0);} | |
858 | if (ci >= n0) {fprintf(stderr,"*** warning [%s:%d] - ci off end %d/%d\n", | |
859 | __FILE__, __LINE__, ci,n0);} | |
846 | 860 | else { |
847 | 861 | *aa0pt++ = aa0p[ci]; |
848 | 862 | aa0p[ci] += 32; |
855 | 869 | if (aa0t_flg) { |
856 | 870 | dmax->dp -= f_str->aa0ix; /* shift ->dp for aa0t */ |
857 | 871 | if ((ci=(int)(aa0pt-f_str->aa0t)) > n0) { |
858 | fprintf(stderr," warning - aapt off %d/%d end\n",ci,n0); | |
872 | fprintf(stderr,"*** warning [%s:%d] - aapt off %d/%d end\n", | |
873 | __FILE__, __LINE__, ci,n0); | |
859 | 874 | } |
860 | 875 | else |
861 | 876 | *aa0pt++ = 0; /* skip over NULL */ |
1157 | 1172 | *have_ares = 0x2; /* set 0x2 bit to indicate local copy */ |
1158 | 1173 | |
1159 | 1174 | if ((a_res = (struct a_res_str *)calloc(1, sizeof(struct a_res_str)))==NULL) { |
1160 | fprintf(stderr," [do_walign] Cannot allocate a_res"); | |
1175 | fprintf(stderr,"*** error [%s:%d] - cannot allocate a_res [%lu]", | |
1176 | __FILE__, __LINE__, sizeof(struct a_res_str)); | |
1161 | 1177 | return NULL; |
1162 | 1178 | } |
1163 | 1179 | |
1180 | 1196 | */ |
1181 | 1197 | |
1182 | 1198 | if ((aa0t = (unsigned char *)calloc(n0+1,sizeof(unsigned char)))==NULL) { |
1183 | fprintf(stderr," cannot allocate aa0t %d\n",n0+1); | |
1199 | fprintf(stderr,"*** error [%s:%d] - cannot allocate aa0t %d\n", | |
1200 | __FILE__, __LINE__, n0+1); | |
1184 | 1201 | exit(1); |
1185 | 1202 | } |
1186 | 1203 |
2065 | 2065 | #define XTERNAL |
2066 | 2066 | #include "upam.h" |
2067 | 2067 | |
2068 | /* this code shows the alignment of the protein with the three phased | |
2069 | translation of the DNA sequence | |
2070 | */ | |
2071 | ||
2068 | 2072 | extern void |
2069 | display_alig(int *a, unsigned char *dna, unsigned char * pro, int length, int ld) | |
2073 | display_alig(int *a, unsigned char *dna_p, unsigned char * pro, int length, int ld) | |
2070 | 2074 | { |
2071 | 2075 | int len = 0, i, j, x, y, lines, k; |
2072 | 2076 | char line1[100], line2[100], line3[100], |
2073 | 2077 | tmp[10] = " "; |
2074 | unsigned char *dna1, c1, c2, c3, *st; | |
2075 | ||
2076 | dna1 = ckalloc((size_t)ld); | |
2077 | for (st = dna, i = 0; i < ld; i++, st++) dna1[i] = NCBIstdaa[*st]; | |
2078 | unsigned char *dna_p1, c1, c2, c3, *st; | |
2079 | ||
2080 | dna_p1 = ckalloc((size_t)ld); | |
2081 | for (st = dna_p, i = 0; i < ld; i++, st++) dna_p1[i] = NCBIstdaa[*st]; | |
2078 | 2082 | line1[0] = line2[0] = line3[0] = '\0'; x= a[0]; y = a[1]-1; |
2079 | 2083 | |
2080 | 2084 | for (len = 0, j = 2, lines = 0; j < length; j++) { |
2086 | 2090 | if (a[j+1] == 2) tmp[2] = ' '; |
2087 | 2091 | } |
2088 | 2092 | if (i > 0) { |
2089 | strncpy(&line1[len], (const char *)&dna1[y], i); y+=i; | |
2090 | } else {line1[len] = '-'; i = 1; tmp[0] = NCBIstdaa[pro[x++]];} | |
2093 | strncpy(&line1[len], (const char *)&dna_p1[y], i); | |
2094 | y+=i; | |
2095 | } | |
2096 | else { | |
2097 | line1[len] = '-'; | |
2098 | i = 1; | |
2099 | tmp[0] = NCBIstdaa[pro[x++]]; | |
2100 | } | |
2091 | 2101 | strncpy(&line2[len], tmp, i); |
2092 | 2102 | for (k = 0; k < i; k++) { |
2093 | 2103 | if (tmp[k] != ' ' && tmp[k] != '-') { |
2094 | if (k == 2) tmp[k] = '\\'; | |
2095 | else if (k == 1) tmp[k] = '|'; | |
2096 | else tmp[k] = '/'; | |
2097 | } else tmp[k] = ' '; | |
2104 | if (k == 2) {tmp[k] = '\\';} | |
2105 | else if (k == 1) { tmp[k] = '|'; } | |
2106 | else { tmp[k] = '/'; } | |
2107 | } | |
2108 | else { tmp[k] = ' '; } | |
2098 | 2109 | } |
2099 | 2110 | if (i == 1) tmp[0] = ' '; |
2100 | 2111 | strncpy(&line3[len], tmp, i); |
2103 | 2114 | line1[len] = line2[len] =line3[len] = '\0'; |
2104 | 2115 | if (len >= WIDTH) { |
2105 | 2116 | printf("\n%5d", WIDTH*lines++); |
2106 | for (k = 10; k <= WIDTH; k+=10) | |
2117 | for (k = 10; k <= WIDTH; k+=10) { | |
2107 | 2118 | printf(" . :"); |
2108 | if (k-5 < WIDTH) printf(" ."); | |
2119 | } | |
2120 | if (k-5 < WIDTH) { printf(" ."); } | |
2109 | 2121 | c1 = line1[WIDTH]; c2 = line2[WIDTH]; c3 = line3[WIDTH]; |
2110 | 2122 | line1[WIDTH] = line2[WIDTH] = line3[WIDTH] = '\0'; |
2123 | ||
2111 | 2124 | printf("\n %s\n %s\n %s\n", line1, line3, line2); |
2125 | ||
2112 | 2126 | line1[WIDTH] = c1; line2[WIDTH] = c2; line3[WIDTH] = c3; |
2113 | 2127 | strncpy(line1, &line1[WIDTH], sizeof(line1)-1); |
2114 | 2128 | strncpy(line2, &line2[WIDTH], sizeof(line2)-1); |
2122 | 2136 | if (k-5 < len) printf(" ."); |
2123 | 2137 | printf("\n %s\n %s\n %s\n", line1, line3, line2); |
2124 | 2138 | } |
2125 | ||
2126 | 2139 | |
2127 | 2140 | /* alignment store the operation that align the protein and dna sequence. |
2128 | 2141 | The code of the number in the array is as follows: |
2137 | 2150 | in the protein and dna sequences in the local alignment. |
2138 | 2151 | |
2139 | 2152 | Display looks like where WIDTH is assumed to be divisible by 10. |
2153 | ||
2154 | -- this alignment is incorrect, protein phases rather than DNA are shown -- | |
2140 | 2155 | |
2141 | 2156 | 0 . : . : . : . : . : . : |
2142 | 2157 | CCTATGATACTGGGATACTGGAACGTCCGCGGACTGACACACCCGATCCGCATGCTCCTG |
281 | 281 | if (hsq[i0] < NMAP && hsq[i0] > mhv) mhv = hsq[i0]; |
282 | 282 | |
283 | 283 | if (mhv <= 0) { |
284 | fprintf (stderr, " maximum hsq <=0 %d\n", mhv); | |
284 | fprintf (stderr, "*** error [%s:%d] - maximum hsq <=0 %d\n", | |
285 | __FILE__, __LINE__, mhv); | |
285 | 286 | exit (1); |
286 | 287 | } |
287 | 288 | |
298 | 299 | f_str->hmask = (hmax >> f_str->kshft) - 1; |
299 | 300 | |
300 | 301 | if ((f_str->harr = (int *) calloc (hmax, sizeof (int))) == NULL) { |
301 | fprintf (stderr, " cannot allocate hash array\n"); | |
302 | fprintf (stderr, "*** error [%s:%d] - cannot allocate hash array [%d]\n", | |
303 | __FILE__, __LINE__, hmax ); | |
302 | 304 | exit (1); |
303 | 305 | } |
304 | 306 | if ((f_str->pamh1 = (int *) calloc (ppst->nsq+1, sizeof (int))) == NULL) { |
305 | fprintf (stderr, " cannot allocate pamh1 array\n"); | |
307 | fprintf (stderr, "*** error [%s:%d] - cannot allocate pamh1 array [%d]\n", | |
308 | __FILE__, __LINE__, ppst->nsq+1); | |
306 | 309 | exit (1); |
307 | 310 | } |
308 | 311 | if ((f_str->pamh2 = (int *) calloc (hmax, sizeof (int))) == NULL) { |
309 | fprintf (stderr, " cannot allocate pamh2 array\n"); | |
312 | fprintf (stderr, "*** error [%s:%d] - cannot allocate pamh2 array [%d]\n", | |
313 | __FILE__, __LINE__, hmax); | |
310 | 314 | exit (1); |
311 | 315 | } |
312 | 316 | if ((f_str->link = (int *) calloc (n0, sizeof (int))) == NULL) { |
313 | fprintf (stderr, " cannot allocate hash link array"); | |
317 | fprintf (stderr, "*** error [%s:%d] - cannot allocate hash link array [%d]", | |
318 | __FILE__, __LINE__, n0); | |
314 | 319 | exit (1); |
315 | 320 | } |
316 | 321 | |
318 | 323 | if ((f_str->aa1x =(unsigned char *)calloc((size_t)ppst->maxlen+2, |
319 | 324 | sizeof(unsigned char))) |
320 | 325 | == NULL) { |
321 | fprintf (stderr, "cannot allocate aa1x array %d\n", ppst->maxlen+2); | |
326 | fprintf (stderr, "*** error [%s:%d] - cannot allocate aa1x array %d\n", | |
327 | __FILE__, __LINE__, ppst->maxlen+2); | |
322 | 328 | exit (1); |
323 | 329 | } |
324 | 330 | f_str->aa1x++; |
326 | 332 | if ((f_str->aa1y =(unsigned char *)calloc((size_t)ppst->maxlen+2, |
327 | 333 | sizeof(unsigned char))) |
328 | 334 | == NULL) { |
329 | fprintf (stderr, "cannot allocate aa1y array %d\n", ppst->maxlen+2); | |
335 | fprintf (stderr, "*** error [%s:%d] - cannot allocate aa1y array %d\n", | |
336 | __FILE__, __LINE__, ppst->maxlen+2); | |
330 | 337 | exit (1); |
331 | 338 | } |
332 | 339 | f_str->aa1y++; |
334 | 341 | maxn0 = n0 + 2; |
335 | 342 | if ((aa0x =(unsigned char *)calloc((size_t)maxn0,sizeof(unsigned char))) |
336 | 343 | == NULL) { |
337 | fprintf (stderr, "cannot allocate aa0x array %d\n", maxn0); | |
344 | fprintf (stderr, "*** error [%s:%d] - cannot allocate aa0x array %d\n", | |
345 | __FILE__, __LINE__, maxn0); | |
338 | 346 | exit (1); |
339 | 347 | } |
340 | 348 | aa0x++; |
342 | 350 | |
343 | 351 | if ((aa0y =(unsigned char *)calloc((size_t)maxn0,sizeof(unsigned char))) |
344 | 352 | == NULL) { |
345 | fprintf (stderr, "cannot allocate aa0y array %d\n", maxn0); | |
353 | fprintf (stderr, "*** error [%s:%d] - cannot allocate aa0y array %d\n", | |
354 | __FILE__, __LINE__, maxn0); | |
346 | 355 | exit (1); |
347 | 356 | } |
348 | 357 | aa0y++; |
437 | 446 | #ifndef ALLOCN0 |
438 | 447 | if ((f_str->diag = (struct dstruct *) calloc ((size_t)MAXDIAG, |
439 | 448 | sizeof (struct dstruct)))==NULL) { |
440 | fprintf (stderr," cannot allocate diagonal arrays: %ld\n", | |
449 | fprintf (stderr,"*** error [%s:%d] - cannot allocate diagonal arrays: %ld\n", | |
450 | __FILE__, __LINE__, | |
441 | 451 | (long) MAXDIAG *sizeof (struct dstruct)); |
442 | 452 | exit (1); |
443 | 453 | }; |
444 | 454 | #else |
445 | 455 | if ((f_str->diag = (struct dstruct *) calloc ((size_t)n0, |
446 | 456 | sizeof (struct dstruct)))==NULL) { |
447 | fprintf (stderr," cannot allocate diagonal arrays: %ld\n", | |
448 | (long)n0*sizeof (struct dstruct)); | |
457 | fprintf (stderr,"*** error [%s:%d] - cannot allocate diagonal arrays: %ld\n", | |
458 | __FILE__, __LINE__, (long)n0*sizeof (struct dstruct)); | |
449 | 459 | exit (1); |
450 | 460 | }; |
451 | 461 | #endif |
452 | 462 | |
453 | 463 | |
454 | 464 | if ((waa= (int *)malloc (sizeof(int)*(nsq+1)*n0)) == NULL) { |
455 | fprintf(stderr,"cannot allocate waa struct %3d\n",nsq*n0); | |
465 | fprintf(stderr,"*** error [%s:%d] - cannot allocate waa struct %3d\n", | |
466 | __FILE__, __LINE__, nsq*n0); | |
456 | 467 | exit(1); |
457 | 468 | } |
458 | 469 | |
466 | 477 | f_str->waa0 = waa; |
467 | 478 | |
468 | 479 | if ((waa= (int *)malloc (sizeof(int)*(nsq+1)*n0)) == NULL) { |
469 | fprintf(stderr,"cannot allocate waa struct %3d\n",nsq*n0); | |
480 | fprintf(stderr,"*** error [%s:%d] - cannot allocate waa struct %3d\n", | |
481 | __FILE__, __LINE__, nsq*n0); | |
470 | 482 | exit(1); |
471 | 483 | } |
472 | 484 | |
488 | 500 | maxn0 = max(4*n0,MIN_RES); |
489 | 501 | #endif |
490 | 502 | if ((res = (int *)calloc((size_t)maxn0,sizeof(int)))==NULL) { |
491 | fprintf(stderr,"cannot allocate alignment results array %d\n",maxn0); | |
503 | fprintf(stderr,"*** error [%s:%d] -cannot allocate alignment results array %d\n", | |
504 | __FILE__, __LINE__, maxn0); | |
492 | 505 | exit(1); |
493 | 506 | } |
494 | 507 | f_str->res = res; |
690 | 703 | } |
691 | 704 | |
692 | 705 | if (n0+n1+1 >= MAXDIAG) { |
693 | fprintf(stderr,"n0,n1 too large: %d, %d\n",n0,n1); | |
706 | fprintf(stderr,"*** error [%s:%d] - n0,n1 too large > %d: %d, %d\n", | |
707 | __FILE__, __LINE__, n0,n1, MAXDIAG); | |
694 | 708 | rst->score[0] = rst->score[1] = rst->score[2] = -1; |
695 | 709 | return; |
696 | 710 | } |
1523 | 1537 | } |
1524 | 1538 | |
1525 | 1539 | if (i >= max_res) { |
1526 | fprintf(stderr," alignment truncated: %d/%d\n", max_res,i); | |
1540 | fprintf(stderr,"*** error [%s:%d] - alignment truncated: %d > %d (max_res)\n", | |
1541 | __FILE__, __LINE__, i, max_res); | |
1527 | 1542 | } |
1528 | 1543 | |
1529 | 1544 | up = &up[-3]; down = &down[-3]; tp = &tp[-3]; |
1580 | 1595 | ld += 2; |
1581 | 1596 | init_ROW(up, ld+1); /* set to zero */ |
1582 | 1597 | init_ROW(down, ld+1); /* set to zero */ |
1583 | ||
1584 | 1598 | |
1585 | 1599 | cur = up+1; |
1586 | 1600 | last = down+1; |
2070 | 2084 | #define XTERNAL |
2071 | 2085 | #include "upam.h" |
2072 | 2086 | |
2087 | /* this code is not used by the program, it was included for testing */ | |
2088 | /* display_alig(*align_enc, *dna_p, *prot, length, ld) takes the | |
2089 | ||
2090 | alignment encoding, and the DNA and protein sequences, and produces an alignment. | |
2091 | *dna_p is the three phases of the translated DNA sequence | |
2092 | *prot is the original protein sequence | |
2093 | ||
2094 | length is the length of the encoding | |
2095 | ld is the length of the alignment(?) | |
2096 | ||
2097 | the first two entries in align_enc[] are the start of the protein | |
2098 | and DNA sequences. | |
2099 | ||
2100 | The encoding is: (why no code 1?:) | |
2101 | ||
2102 | 0: delete amino acid. | |
2103 | 2: frame shift, 2 nucleotides match with an amino acid | |
2104 | 3: match an amino acid with a codon | |
2105 | 4: the other type of frame shift | |
2106 | 5: delete of a codon | |
2107 | ||
2108 | One of the properties of this encoding is that it indicates the | |
2109 | amount that the DNA sequence index needs to be incremented after | |
2110 | prot match (except for 5) | |
2111 | ||
2112 | */ | |
2113 | ||
2073 | 2114 | extern void |
2074 | display_alig(int *a, unsigned char *dna, unsigned char * pro, int length, int ld) | |
2115 | display_alig(int *a, unsigned char *dna_p, unsigned char * pro, int length, int ld) | |
2075 | 2116 | { |
2076 | 2117 | int len = 0, i, j, x, y, lines, k; |
2077 | 2118 | char line1[100], line2[100], line3[100], |
2078 | 2119 | tmp[10] = " "; |
2079 | unsigned char *dna1, c1, c2, c3, *st; | |
2080 | ||
2081 | dna1 = ckalloc((size_t)ld); | |
2082 | for (st = dna, i = 0; i < ld; i++, st++) dna1[i] = NCBIstdaa[*st]; | |
2083 | line1[0] = line2[0] = line3[0] = '\0'; x= a[0]; y = a[1]-1; | |
2120 | unsigned char *dna_p1, c1, c2, c3, *st; | |
2121 | ||
2122 | dna_p1 = ckalloc((size_t)ld); /* dna_p1 is the ascii (sq0) translated-DNA residue */ | |
2123 | ||
2124 | /* generate the ascii aa characters */ | |
2125 | for (st = dna_p, i = 0; i < ld; i++, st++) { | |
2126 | dna_p1[i] = NCBIstdaa[*st]; | |
2127 | } | |
2128 | line1[0] = line2[0] = line3[0] = '\0'; | |
2129 | ||
2130 | x= a[0]; /* start in protein */ | |
2131 | y = a[1]-1; /* start in DNA */ | |
2084 | 2132 | |
2085 | 2133 | for (len = 0, j = 2, lines = 0; j < length; j++) { |
2086 | i = a[j]; | |
2134 | i = a[j]; /* i is align_enc value 0-5 */ | |
2087 | 2135 | /*printf("%d %d %d\n", i, len, b->j);*/ |
2136 | ||
2088 | 2137 | if (i > 0 && i < 5) tmp[i-2] = NCBIstdaa[pro[x++]]; |
2089 | if (i == 5) { | |
2090 | i = 3; tmp[0] = tmp[1] = tmp[2] = '-'; | |
2138 | if (i == 5) { /* special case */ | |
2139 | i = 3; /* increment DNA value by 3, prot by 0 */ | |
2140 | tmp[0] = tmp[1] = tmp[2] = '-'; | |
2091 | 2141 | if (a[j+1] == 2) tmp[2] = ' '; |
2092 | 2142 | } |
2093 | 2143 | if (i > 0) { |
2094 | strncpy(&line1[len], (const char *)&dna1[y], i); y+=i; | |
2095 | } else {line1[len] = '-'; i = 1; tmp[0] = NCBIstdaa[pro[x++]];} | |
2144 | strncpy(&line1[len], (const char *)&dna_p1[y], i); | |
2145 | y+=i; | |
2146 | } | |
2147 | else { | |
2148 | line1[len] = '-'; | |
2149 | i = 1; | |
2150 | tmp[0] = NCBIstdaa[pro[x++]]; | |
2151 | } | |
2152 | ||
2096 | 2153 | strncpy(&line2[len], tmp, i); |
2154 | ||
2097 | 2155 | for (k = 0; k < i; k++) { |
2098 | 2156 | if (tmp[k] != ' ' && tmp[k] != '-') { |
2099 | 2157 | if (k == 2) tmp[k] = '\\'; |
2128 | 2186 | printf("\n %s\n %s\n %s\n", line1, line3, line2); |
2129 | 2187 | } |
2130 | 2188 | |
2131 | ||
2132 | 2189 | /* alignment store the operation that align the protein and dna sequence. |
2133 | 2190 | The code of the number in the array is as follows: |
2134 | 2191 | 0: delete of an amino acid. |
2137 | 2194 | 4: the other type of frame shift |
2138 | 2195 | 5: delete of a codon |
2139 | 2196 | |
2140 | ||
2141 | 2197 | Also the first two element of the array stores the starting point |
2142 | 2198 | in the protein and dna sequences in the local alignment. |
2143 | 2199 | |
2378 | 2434 | |
2379 | 2435 | /* now we need alignment storage - get it */ |
2380 | 2436 | if ((cur_ares->res = (int *)calloc((size_t)max_res,sizeof(int)))==NULL) { |
2381 | fprintf(stderr," *** cannot allocate alignment results array %d\n",max_res); | |
2437 | fprintf(stderr,"*** error [%s:%d] - cannot allocate alignment results array %d\n", | |
2438 | __FILE__, __LINE__, max_res); | |
2382 | 2439 | exit(1); |
2383 | 2440 | } |
2384 | 2441 | |
2599 | 2656 | *have_ares = 0x3; /* set 0x2 bit to indicate local copy */ |
2600 | 2657 | |
2601 | 2658 | if ((a_res = (struct a_res_str *)calloc(1, sizeof(struct a_res_str)))==NULL) { |
2602 | fprintf(stderr," [do_walign] Cannot allocate a_res"); | |
2659 | fprintf(stderr,"*** error [%s:%d] - cannot allocate a_res [%lu]", | |
2660 | __FILE__, __LINE__, sizeof(struct a_res_str)); | |
2603 | 2661 | return NULL; |
2604 | 2662 | } |
2605 | 2663 | |
2647 | 2705 | #endif |
2648 | 2706 | /* |
2649 | 2707 | if (a_res->res[0] != 3) { |
2650 | fprintf(stderr, "*** alignment does not start with match: %d\n",a_res->res[0]); | |
2708 | fprintf(stderr, "*** error [%s:%d] - alignment does not start with match: %d\n", | |
2709 | __FILE__, __LINE__, a_res->res[0]); | |
2651 | 2710 | } |
2652 | 2711 | */ |
2653 | 2712 | |
2654 | 2713 | #ifdef DEBUG |
2655 | 2714 | if (adler32(1L,aa1,n1) != adler32_crc) { |
2656 | fprintf(stderr,"[dropfx.c/do_walign] adler32_crc mismatch n1: %d\n",n1); | |
2715 | fprintf(stderr,"*** error [%s:%d] - adler32_crc mismatch n1: %d\n", | |
2716 | __FILE__, __LINE__, n1); | |
2657 | 2717 | } |
2658 | 2718 | #endif |
2659 | 2719 | |
2730 | 2790 | } |
2731 | 2791 | |
2732 | 2792 | /* |
2733 | Alignment: store the operation that align the protein and dna sequence. | |
2793 | Alignment: store the operation that aligns the protein and dna sequences. | |
2734 | 2794 | The code of the number in the array is as follows: |
2735 | 2795 | 0: delete of an amino acid. |
2736 | 2796 | 2: frame shift, 2 nucleotides match with an amino acid |
2977 | 3037 | else if (calc_func_mode == CALC_ID || calc_func_mode == CALC_ID_DOM) { |
2978 | 3038 | have_ann = (annotp_p && annotp_p->n_annot > 0); |
2979 | 3039 | spa_p = &spa_c; |
2980 | sp0_p = &sp0_c; | |
2981 | sp1_p = &sp1_c; | |
2982 | ||
2983 | sp0a_p = &sp0a_c; | |
2984 | sp1a_p = &sp1a_c; | |
3040 | sp0_p = &sp1_c; | |
3041 | sp1_p = &sp0_c; | |
3042 | ||
3043 | sp0a_p = &sp1a_c; | |
3044 | sp1a_p = &sp0a_c; | |
2985 | 3045 | annot_fmt = 3; |
2986 | 3046 | |
2987 | 3047 | /* does not require aa0a/aa1a, only for variants */ |
2988 | 3048 | } |
2989 | 3049 | else if (calc_func_mode == CALC_CODE) { |
2990 | 3050 | spa_p = &spa_c; |
2991 | sp0_p = &sp0_c; | |
2992 | sp1_p = &sp1_c; | |
2993 | ||
2994 | sp0a_p = &sp0a_c; | |
2995 | sp1a_p = &sp1a_c; | |
3051 | sp0_p = &sp1_c; | |
3052 | sp1_p = &sp0_c; | |
3053 | ||
3054 | sp0a_p = &sp1a_c; | |
3055 | sp1a_p = &sp0a_c; | |
2996 | 3056 | |
2997 | 3057 | show_code = (display_code & (SHOW_CODE_MASK+SHOW_CODE_EXT)); /* see defs.h; SHOW_CODE_ALIGN=2,_CIGAR=3,_CIGAR_EXT=4 */ |
2998 | 3058 | annot_fmt = 2; |
3017 | 3077 | rpmax = &a_res->res[a_res->nres]; |
3018 | 3078 | |
3019 | 3079 | lenc = not_c = aln->nident = aln->nmismatch = aln->nsim = aln->npos = ngap_p = ngap_d = nfs= 0; |
3080 | ||
3020 | 3081 | i0 = a_res->min1; |
3021 | 3082 | i1 = a_res->min0; |
3022 | 3083 | |
3141 | 3202 | *spa_p = M_DEL; |
3142 | 3203 | |
3143 | 3204 | if (calc_func_mode == CALC_CODE) { |
3205 | #ifndef TFAST | |
3144 | 3206 | update_code(align_code_dyn, update_data_p, 2, *spa_p,*sp0_p,*sp1_p); |
3207 | #else | |
3208 | update_code(align_code_dyn, update_data_p, 2, *spa_p,*sp1_p,*sp0_p); | |
3209 | #endif | |
3210 | ||
3145 | 3211 | } |
3146 | 3212 | |
3147 | 3213 | if (calc_func_mode == CALC_CONS) { |
3218 | 3284 | *spa_p = align_type(itmp, *sp0_p, *sp1_p, 0, aln, ppst->pam_x_id_sim); |
3219 | 3285 | |
3220 | 3286 | if (calc_func_mode == CALC_CODE) { |
3287 | #ifndef TFAST | |
3221 | 3288 | update_code(align_code_dyn, update_data_p, 3, *spa_p,*sp0_p,*sp1_p); |
3289 | #else | |
3290 | update_code(align_code_dyn, update_data_p, 3, *spa_p,*sp1_p,*sp0_p); | |
3291 | #endif | |
3222 | 3292 | } |
3223 | 3293 | |
3224 | 3294 | d1_alen++; |
3320 | 3390 | if (cumm_seq_score) *i_spa++ = itmp; |
3321 | 3391 | |
3322 | 3392 | if (calc_func_mode == CALC_CODE) { |
3393 | #ifndef TFAST | |
3323 | 3394 | update_code(align_code_dyn, update_data_p, 3, *spa_p, *sp0_p, *sp1_p); |
3395 | #else | |
3396 | update_code(align_code_dyn, update_data_p, 3, *spa_p, *sp1_p, *sp0_p); | |
3397 | #endif | |
3324 | 3398 | |
3325 | 3399 | if (have_push_features) { |
3326 | 3400 | add_annot_code(have_ann, *sp0_p, *sp1_p, *sp1a_p, |
3366 | 3440 | *spa_p = M_DEL; |
3367 | 3441 | |
3368 | 3442 | if (calc_func_mode == CALC_CODE) { |
3443 | #ifndef TFAST | |
3369 | 3444 | update_code(align_code_dyn, update_data_p, 4, *spa_p, *sp0_p, *sp1_p); |
3445 | #else | |
3446 | update_code(align_code_dyn, update_data_p, 4, *spa_p, *sp1_p, *sp0_p); | |
3447 | #endif | |
3370 | 3448 | } |
3371 | 3449 | |
3372 | 3450 | if (calc_func_mode == CALC_CONS) {sp0_p++; sp1_p++; spa_p++;} |
3435 | 3513 | if (*spa_p == M_IDENT) {d1_ident++;} |
3436 | 3514 | |
3437 | 3515 | if (calc_func_mode == CALC_CODE) { |
3516 | #ifndef TFAST | |
3438 | 3517 | update_code(align_code_dyn, update_data_p, 3, *spa_p,*sp0_p,*sp1_p); |
3518 | #else | |
3519 | update_code(align_code_dyn, update_data_p, 3, *spa_p,*sp1_p,*sp0_p); | |
3520 | #endif | |
3439 | 3521 | } |
3440 | 3522 | |
3441 | 3523 | if (cumm_seq_score) *i_spa++ = itmp; |
3484 | 3566 | |
3485 | 3567 | if (calc_func_mode == CALC_CODE) { |
3486 | 3568 | *spa_p = 5; |
3569 | #ifndef TFAST | |
3487 | 3570 | update_code(align_code_dyn, update_data_p, 5, *spa_p,*sp0_p,*sp1_p); |
3571 | #else | |
3572 | update_code(align_code_dyn, update_data_p, 5, *spa_p,*sp1_p,*sp0_p); | |
3573 | #endif | |
3488 | 3574 | } |
3489 | 3575 | |
3490 | 3576 | if (calc_func_mode == CALC_CONS) {sp0_p++; sp1_p++; spa_p++;} |
3614 | 3700 | */ |
3615 | 3701 | |
3616 | 3702 | static struct update_code_str * |
3617 | init_update_data(show_code) { | |
3703 | init_update_data(int show_code) { | |
3618 | 3704 | |
3619 | 3705 | struct update_code_str *update_data_p; |
3620 | 3706 | |
3716 | 3802 | |
3717 | 3803 | /* only aligned identities update counts */ |
3718 | 3804 | if (op==3 && sim_code == M_IDENT) { |
3719 | up_dp->p_op_cnt++; | |
3720 | return; | |
3805 | if ((sp0 == '*' && (sp1 == '*' || toupper(sp1) == 'U')) | |
3806 | || (sp1 == '*' && (sp0 == '*' || toupper(sp0) == 'U'))) { | |
3807 | if (up_dp->p_op_cnt > 0) { | |
3808 | sprintf(tmp_str,"%d**",up_dp->p_op_cnt); | |
3809 | up_dp->p_op_cnt = 0; | |
3810 | return; | |
3811 | } | |
3812 | } | |
3813 | else { | |
3814 | up_dp->p_op_cnt++; | |
3815 | return; | |
3816 | } | |
3721 | 3817 | } |
3722 | 3818 | else { |
3723 | 3819 | if (up_dp->p_op_cnt > 0) { |
3785 | 3881 | } |
3786 | 3882 | } |
3787 | 3883 | else { /* have a termination codon, output for !SHOW_CODE_CIGAR */ |
3788 | if (!up_dp->cigar_order) { | |
3789 | if (sp0 == '*' || sp1 == '*') { op = 6;} | |
3790 | } | |
3791 | else if (up_dp->show_ext && (sp0 != sp1)) { op = 1;} | |
3884 | if (!up_dp->cigar_order) { /* -m9c : -m9C and -m8CC are cigar_order */ | |
3885 | if (sp0 == '*' || sp1 == '*') { | |
3886 | /* op = 6 gets '*' from op_map="-x/=\\+*" when the string is closed */ | |
3887 | op = 6; | |
3888 | } | |
3889 | } | |
3890 | else if (sp0=='*' && sp1=='*') { | |
3891 | op=6; | |
3892 | } | |
3893 | else if (up_dp->show_ext && (sp0 != sp1)) { | |
3894 | op = 1; | |
3895 | } | |
3792 | 3896 | } |
3793 | 3897 | |
3794 | 3898 | if (up_dp->p_op_cnt == 0) { |
218 | 218 | char le[MAXLC+1][64]; |
219 | 219 | |
220 | 220 | if (naa > MAXLC) { |
221 | fprintf(stderr,"*** dropfz2.c compilation problem naa(%d) > MAXLX(%d) ***\n", | |
222 | naa, MAXLC); | |
221 | fprintf(stderr,"*** error [%s:%d] - compilation problem naa(%d) > MAXLC(%d) ***\n", | |
222 | __FILE__, __LINE__, naa, MAXLC); | |
223 | 223 | } |
224 | 224 | |
225 | 225 | if ((*weighti=(struct wgt **)calloc((size_t)(naa+1),sizeof(struct wgt *))) |
226 | 226 | ==NULL) { |
227 | fprintf(stderr," cannot allocate weights array: %d\n",naa); | |
227 | fprintf(stderr,"*** error [%s:%d] - cannot allocate weights array: %d\n", | |
228 | __FILE__, __LINE__, naa); | |
228 | 229 | exit(1); |
229 | 230 | } |
230 | 231 | |
233 | 234 | for (aa=0; aa <= naa; aa++) { |
234 | 235 | if ((weight[aa]=(struct wgt *)calloc((size_t)256,sizeof(struct wgt))) |
235 | 236 | ==NULL) { |
236 | fprintf(stderr," cannot allocate weight[]: %d/%d\n",aa,naa); | |
237 | fprintf(stderr,"*** error [%s:%d] - cannot allocate weight[]: %d/%d\n", | |
238 | __FILE__, __LINE__, aa,naa); | |
237 | 239 | exit(1); |
238 | 240 | } |
239 | 241 | } |
242 | 244 | if (weightci !=NULL) { |
243 | 245 | if ((*weightci=(struct wgtc **)calloc((size_t)(naa+1), |
244 | 246 | sizeof(struct wgtc *)))==NULL) { |
245 | fprintf(stderr," cannot allocate weight_c array: %d\n",naa); | |
247 | fprintf(stderr,"*** error [%s:%d] - cannot allocate weight_c array: %d\n", | |
248 | __FILE__, __LINE__, naa); | |
246 | 249 | exit(1); |
247 | 250 | } |
248 | 251 | weightc = *weightci; |
250 | 253 | for (aa=0; aa <= naa; aa++) { |
251 | 254 | if ((weightc[aa]=(struct wgtc *)calloc((size_t)256,sizeof(struct wgtc))) |
252 | 255 | ==NULL) { |
253 | fprintf(stderr," cannot allocate weightc[]: %d/%d\n",aa,naa); | |
256 | fprintf(stderr,"*** error [%s:%d] - cannot allocate weightc[]: %d/%d\n", | |
257 | __FILE__, __LINE__, aa,naa); | |
254 | 258 | exit(1); |
255 | 259 | } |
256 | 260 | } |
411 | 415 | #endif |
412 | 416 | |
413 | 417 | if (nt[NT_N] != 'N') { |
414 | fprintf(stderr," nt[NT_N] (%d) != 'X' (%c) - recompile\n",NT_N,nt[NT_N]); | |
418 | fprintf(stderr,"*** error [%s:%d] - nt[NT_N] (%d) != 'X' (%c) - recompile\n", | |
419 | __FILE__, __LINE__, NT_N,nt[NT_N]); | |
415 | 420 | exit(1); |
416 | 421 | } |
417 | 422 | |
460 | 465 | if ((aa0x =(unsigned char *)calloc((size_t)maxn0, |
461 | 466 | sizeof(unsigned char))) |
462 | 467 | == NULL) { |
463 | fprintf (stderr, "cannot allocate aa0x array %d\n", maxn0); | |
468 | fprintf (stderr, "*** error [%s:%d] - cannot allocate aa0x array %d\n", | |
469 | __FILE__, __LINE__, maxn0); | |
464 | 470 | exit (1); |
465 | 471 | } |
466 | 472 | aa0x++; |
470 | 476 | if ((aa0v =(unsigned char *)calloc((size_t)maxn0, |
471 | 477 | sizeof(unsigned char))) |
472 | 478 | == NULL) { |
473 | fprintf (stderr, "cannot allocate aa0v array %d\n", maxn0); | |
479 | fprintf (stderr, "*** error [%s:%d] - cannot allocate aa0v array %d\n", | |
480 | __FILE__, __LINE__, maxn0); | |
474 | 481 | exit (1); |
475 | 482 | } |
476 | 483 | aa0v++; |
522 | 529 | if (hsq[i0] < NMAP && hsq[i0] > mhv) |
523 | 530 | mhv = ppst->hsq[i0]; |
524 | 531 | |
525 | if (mhv <= 0) | |
526 | { | |
527 | fprintf (stderr, " maximum hsq <=0 %d\n", mhv); | |
532 | if (mhv <= 0) { | |
533 | fprintf (stderr, "*** error [%s:%d] - maximum hsq <=0 %d\n", | |
534 | __FILE__, __LINE__, mhv); | |
528 | 535 | exit (1); |
529 | 536 | } |
530 | 537 | |
539 | 546 | f_str->hmask = (hmax >> f_str->kshft) - 1; |
540 | 547 | |
541 | 548 | if ((f_str->harr = (int *) calloc (hmax, sizeof (int))) == NULL) { |
542 | fprintf (stderr, " cannot allocate hash array\n"); | |
549 | fprintf (stderr, "*** error [%s:%d] - cannot allocate hash array [%d]\n", | |
550 | __FILE__, __LINE__, hmax); | |
543 | 551 | exit (1); |
544 | 552 | } |
545 | 553 | if ((f_str->pamh1 = (int *) calloc (ppst->nsq+1, sizeof (int))) == NULL) { |
546 | fprintf (stderr, " cannot allocate pamh1 array\n"); | |
554 | fprintf (stderr, "*** error [%s:%d] - cannot allocate pamh1 array [%d]\n", | |
555 | __FILE__, __LINE__, ppst->nsq+1); | |
547 | 556 | exit (1); |
548 | 557 | } |
549 | if ((f_str->pamh2 = (int *) calloc (hmax, sizeof (int))) == NULL) { | |
550 | fprintf (stderr, " cannot allocate pamh2 array\n"); | |
558 | if ((f_str->pamh2 = (int *)calloc (hmax, sizeof (int))) == NULL) { | |
559 | fprintf (stderr, "*** error [%s:%d] - cannot allocate pamh2 array [%d]\n", | |
560 | __FILE__, __LINE__, hmax); | |
551 | 561 | exit (1); |
552 | 562 | } |
553 | 563 | if ((f_str->link = (int *) calloc (n0, sizeof (int))) == NULL) { |
554 | fprintf (stderr, " cannot allocate hash link array"); | |
564 | fprintf (stderr, "*** error [%s:%d] - cannot allocate hash link array [%d]", | |
565 | __FILE__, __LINE__, n0); | |
555 | 566 | exit (1); |
556 | 567 | } |
557 | 568 | |
614 | 625 | #ifndef ALLOCN0 |
615 | 626 | if ((f_str->diag = (struct dstruct *) calloc ((size_t)MAXDIAG, |
616 | 627 | sizeof (struct dstruct)))==NULL) { |
617 | fprintf (stderr," cannot allocate diagonal arrays: %lu\n", | |
618 | MAXDIAG *sizeof (struct dstruct)); | |
628 | fprintf (stderr,"*** error [%s:%d] - cannot allocate diagonal arrays: %lu\n", | |
629 | __FILE__, __LINE__, MAXDIAG *sizeof (struct dstruct)); | |
619 | 630 | exit (1); |
620 | 631 | }; |
621 | 632 | #else |
622 | 633 | if ((f_str->diag = (struct dstruct *) calloc ((size_t)n0, |
623 | 634 | sizeof (struct dstruct)))==NULL) { |
624 | fprintf (stderr," cannot allocate diagonal arrays: %ld\n", | |
625 | (long)n0*sizeof (struct dstruct)); | |
635 | fprintf (stderr,"*** error [%s:%d] - cannot allocate diagonal arrays: %ld\n", | |
636 | __FILE__, __LINE__, (long)n0*sizeof (struct dstruct)); | |
626 | 637 | exit (1); |
627 | 638 | }; |
628 | 639 | #endif |
636 | 647 | if ((f_str->aa1x =(unsigned char *)calloc((size_t)ppst->maxlen+4, |
637 | 648 | sizeof(unsigned char))) |
638 | 649 | == NULL) { |
639 | fprintf (stderr, "cannot allocate aa1x array %d\n", ppst->maxlen+4); | |
650 | fprintf (stderr, "*** error [%s:%d] - cannot allocate aa1x array %d\n", | |
651 | __FILE__, __LINE__, ppst->maxlen+4); | |
640 | 652 | exit (1); |
641 | 653 | } |
642 | 654 | f_str->aa1x++; |
643 | 655 | |
644 | 656 | if ((f_str->aa1v =(unsigned char *)calloc((size_t)ppst->maxlen+4, |
645 | 657 | sizeof(unsigned char))) == NULL) { |
646 | fprintf (stderr, "cannot allocate aa1v array %d\n", ppst->maxlen+4); | |
658 | fprintf (stderr, "*** error [%s:%d] - cannot allocate aa1v array %d\n", | |
659 | __FILE__, __LINE__, ppst->maxlen+4); | |
647 | 660 | exit (1); |
648 | 661 | } |
649 | 662 | f_str->aa1v++; |
651 | 664 | #endif |
652 | 665 | |
653 | 666 | if ((waa= (int *)malloc (sizeof(int)*(nsq+1)*n0)) == NULL) { |
654 | fprintf(stderr,"cannot allocate waa struct %3d\n",nsq*n0); | |
667 | fprintf(stderr,"*** error [%s:%d] - cannot allocate waa struct %3d\n", | |
668 | __FILE__, __LINE__, nsq*n0); | |
655 | 669 | exit(1); |
656 | 670 | } |
657 | 671 | |
670 | 684 | maxn0 = max(4*n0,MIN_RES); |
671 | 685 | #endif |
672 | 686 | if ((res = (int *)calloc((size_t)maxn0,sizeof(int)))==NULL) { |
673 | fprintf(stderr,"cannot allocate alignment results array %d\n",maxn0); | |
687 | fprintf(stderr,"*** error [%s:%d] - cannot allocate alignment results array %d\n", | |
688 | __FILE__, __LINE__, maxn0); | |
674 | 689 | exit(1); |
675 | 690 | } |
676 | 691 | f_str->res = res; |
848 | 863 | } |
849 | 864 | |
850 | 865 | if (n0+n1+1 >= MAXDIAG) { |
851 | fprintf(stderr,"n0,n1 too large: %d, %d\n",n0,n1); | |
866 | fprintf(stderr,"*** error [%s:%d] - n0,n1 too large > %d: %d, %d\n", | |
867 | __FILE__, __LINE__, n0,n1, MAXDIAG); | |
852 | 868 | rst->score[0] = rst->score[1] = rst->score[2] = -1; |
853 | 869 | return; |
854 | 870 | } |
1096 | 1112 | aa1x = f_str->aa1x; |
1097 | 1113 | #ifdef DEBUG |
1098 | 1114 | if (frame > 1) { |
1099 | fprintf(stderr, "*** fz_walign - frame: %d - out of range [0,1]\n",frame); | |
1115 | fprintf(stderr, "*** error [%s:%d] - fz_walign - frame: %d - out of range [0,1]\n", | |
1116 | __FILE__, __LINE__, frame); | |
1100 | 1117 | } |
1101 | 1118 | #endif |
1102 | 1119 | |
1632 | 1649 | aq = ap->next; free(ap); ap = aq; |
1633 | 1650 | } |
1634 | 1651 | if (i >= max_res) |
1635 | fprintf(stderr,"***alignment truncated: %d/%d***\n", max_res,i); | |
1652 | fprintf(stderr,"*** error [%s:%d] - alignment truncated: %d >= %d***\n", | |
1653 | __FILE__, __LINE__, i, max_res); | |
1636 | 1654 | |
1637 | 1655 | /* up = &up[-3]; down = &down[-3]; tp = &tp[-3]; */ |
1638 | 1656 | free(&f_str->up[-3]); free(&f_str->tp[-3]); free(&f_str->down[-3]); |
2478 | 2496 | |
2479 | 2497 | /* now we need alignment storage - get it */ |
2480 | 2498 | if ((cur_ares->res = (int *)calloc((size_t)max_res,sizeof(int)))==NULL) { |
2481 | fprintf(stderr," *** cannot allocate alignment results array %d\n",max_res); | |
2499 | fprintf(stderr,"*** error [%s:%d] - cannot allocate alignment results array %d\n", | |
2500 | __FILE__, __LINE__, max_res); | |
2482 | 2501 | exit(1); |
2483 | 2502 | } |
2484 | 2503 | |
2649 | 2668 | *have_ares = 0x3; /* set 0x2 bit to indicate local copy */ |
2650 | 2669 | |
2651 | 2670 | if ((a_res = (struct a_res_str *)calloc(1, sizeof(struct a_res_str)))==NULL) { |
2652 | fprintf(stderr," [do_walign] Cannot allocate a_res"); | |
2671 | fprintf(stderr,"*** error [%s:%d] - cannot allocate a_res [%lu]", | |
2672 | __FILE__, __LINE__, sizeof(struct a_res_str)); | |
2653 | 2673 | return NULL; |
2654 | 2674 | } |
2655 | 2675 | |
2940 | 2960 | update_data_p = init_update_data(show_code); |
2941 | 2961 | } |
2942 | 2962 | else { |
2943 | fprintf(stderr,"*** error [%s:%d] --- cal_cons_u() invalid calc_func_mode: %d\n", | |
2963 | fprintf(stderr,"*** error [%s:%d] --- calc_cons_u() invalid calc_func_mode: %d\n", | |
2944 | 2964 | __FILE__, __LINE__, calc_func_mode); |
2945 | 2965 | exit(1); |
2946 | 2966 | } |
2972 | 2992 | else if (calc_func_mode == CALC_ID || calc_func_mode == CALC_ID_DOM) { |
2973 | 2993 | have_ann = (annotp_p && annotp_p->n_annot > 0); |
2974 | 2994 | spa_p = &spa_c; |
2975 | sp0_p = &sp0_c; | |
2976 | sp1_p = &sp1_c; | |
2977 | ||
2978 | sp0a_p = &sp0a_c; | |
2979 | sp1a_p = &sp1a_c; | |
2995 | sp0_p = &sp1_c; | |
2996 | sp1_p = &sp0_c; | |
2997 | ||
2998 | sp0a_p = &sp1a_c; | |
2999 | sp1a_p = &sp0a_c; | |
2980 | 3000 | annot_fmt = 3; |
2981 | 3001 | |
2982 | 3002 | /* does not require aa0a/aa1a, only for variants */ |
2983 | 3003 | } |
2984 | 3004 | else if (calc_func_mode == CALC_CODE) { |
2985 | 3005 | spa_p = &spa_c; |
2986 | sp0_p = &sp0_c; | |
2987 | sp1_p = &sp1_c; | |
2988 | ||
2989 | sp0a_p = &sp0a_c; | |
2990 | sp1a_p = &sp1a_c; | |
3006 | sp0_p = &sp1_c; | |
3007 | sp1_p = &sp0_c; | |
3008 | ||
3009 | sp0a_p = &sp1a_c; | |
3010 | sp1a_p = &sp0a_c; | |
2991 | 3011 | |
2992 | 3012 | show_code = (display_code & (SHOW_CODE_MASK+SHOW_CODE_EXT)); /* see defs.h; SHOW_CODE_ALIGN=2,_CIGAR=3,_CIGAR_EXT=4 */ |
2993 | 3013 | annot_fmt = 2; |
3001 | 3021 | update_data_p = init_update_data(show_code); |
3002 | 3022 | } |
3003 | 3023 | else { |
3004 | fprintf(stderr,"*** error [%s:%d] --- cal_cons_u() invalid calc_func_mode: %d\n", | |
3024 | fprintf(stderr,"*** error [%s:%d] --- calc_cons_u() invalid calc_func_mode: %d\n", | |
3005 | 3025 | __FILE__, __LINE__, calc_func_mode); |
3006 | 3026 | exit(1); |
3007 | 3027 | } |
3117 | 3137 | if (cumm_seq_score) *i_spa++ = itmp; |
3118 | 3138 | |
3119 | 3139 | if (calc_func_mode == CALC_CODE) { |
3140 | #ifndef TFAST | |
3120 | 3141 | update_code(align_code_dyn, update_data_p, 3, *spa_p, *sp0_p, *sp1_p); |
3142 | #else | |
3143 | update_code(align_code_dyn, update_data_p, 3, *spa_p, *sp1_p, *sp0_p); | |
3144 | #endif | |
3121 | 3145 | |
3122 | 3146 | if (have_ann && have_push_features) { |
3123 | 3147 | add_annot_code(have_ann, *sp0_p, *sp1_p, *sp1a_p, |
3159 | 3183 | *spa_p = M_DEL; |
3160 | 3184 | |
3161 | 3185 | if (calc_func_mode == CALC_CODE) { |
3186 | #ifndef TFAST | |
3162 | 3187 | update_code(align_code_dyn, update_data_p, 2, *spa_p,*sp0_p,*sp1_p); |
3188 | #else | |
3189 | update_code(align_code_dyn, update_data_p, 2, *spa_p,*sp1_p,*sp0_p); | |
3190 | #endif | |
3163 | 3191 | } |
3164 | 3192 | |
3165 | 3193 | if (cumm_seq_score) *i_spa++ = ppst->gshift; |
3232 | 3260 | *spa_p = align_type(itmp, *sp0_p, *sp1_p, 0, aln, ppst->pam_x_id_sim); |
3233 | 3261 | |
3234 | 3262 | if (calc_func_mode == CALC_CODE) { |
3263 | #ifndef TFAST | |
3235 | 3264 | update_code(align_code_dyn, update_data_p, 3, *spa_p,*sp0_p,*sp1_p); |
3265 | #else | |
3266 | update_code(align_code_dyn, update_data_p, 3, *spa_p,*sp1_p,*sp0_p); | |
3267 | #endif | |
3236 | 3268 | } |
3237 | 3269 | |
3238 | 3270 | d1_alen++; |
3279 | 3311 | *spa_p = M_DEL; |
3280 | 3312 | |
3281 | 3313 | if (calc_func_mode == CALC_CODE) { |
3314 | #ifndef TFAST | |
3282 | 3315 | update_code(align_code_dyn, update_data_p, 4, *spa_p,*sp0_p,*sp1_p); |
3316 | #else | |
3317 | update_code(align_code_dyn, update_data_p, 4, *spa_p,*sp1_p,*sp0_p); | |
3318 | #endif | |
3283 | 3319 | } |
3284 | 3320 | |
3285 | 3321 | if (calc_func_mode == CALC_CONS) {sp0_p++; sp1_p++; spa_p++;} |
3344 | 3380 | *spa_p = align_type(itmp, *sp0_p, *sp1_p, 0, aln, ppst->pam_x_id_sim); |
3345 | 3381 | |
3346 | 3382 | if (calc_func_mode == CALC_CODE) { |
3383 | #ifndef TFAST | |
3347 | 3384 | update_code(align_code_dyn, update_data_p, 3, *spa_p,*sp0_p,*sp1_p); |
3385 | #else | |
3386 | update_code(align_code_dyn, update_data_p, 3, *spa_p,*sp1_p,*sp0_p); | |
3387 | #endif | |
3348 | 3388 | } |
3349 | 3389 | |
3350 | 3390 | d1_alen++; |
3392 | 3432 | |
3393 | 3433 | if (calc_func_mode == CALC_CODE) { |
3394 | 3434 | *spa_p = 5; |
3435 | #ifndef TFAST | |
3395 | 3436 | update_code(align_code_dyn, update_data_p, 5, *spa_p,*sp0_p,*sp1_p); |
3437 | #else | |
3438 | update_code(align_code_dyn, update_data_p, 5, *spa_p,*sp1_p,*sp0_p); | |
3439 | #endif | |
3396 | 3440 | } |
3397 | 3441 | |
3398 | 3442 | lenc++; |
3408 | 3452 | |
3409 | 3453 | if (calc_func_mode == CALC_CODE) { |
3410 | 3454 | *spa_p = 5; /* indel code */ |
3455 | #ifndef TFAST | |
3411 | 3456 | update_code(align_code_dyn, update_data_p, 0, *spa_p,*sp0_p,*sp1_p); |
3457 | #else | |
3458 | update_code(align_code_dyn, update_data_p, 0, *spa_p,*sp1_p,*sp0_p); | |
3459 | #endif | |
3412 | 3460 | } |
3413 | 3461 | |
3414 | 3462 | if (cumm_seq_score) { |
3594 | 3642 | */ |
3595 | 3643 | |
3596 | 3644 | static struct update_code_str * |
3597 | init_update_data(show_code) { | |
3645 | init_update_data(int show_code) { | |
3598 | 3646 | |
3599 | 3647 | struct update_code_str *update_data_p; |
3600 | 3648 | |
3640 | 3688 | |
3641 | 3689 | if (!up_dp) return; |
3642 | 3690 | |
3643 | if (up_dp->btop_enc) { | |
3644 | sprintf(tmp_cnt,"%d",up_dp->p_op_cnt); | |
3645 | up_dp->p_op_cnt = 0; | |
3646 | } | |
3647 | else { | |
3648 | sprintf_code(tmp_cnt,up_dp, up_dp->p_op_idx, up_dp->p_op_cnt); | |
3649 | } | |
3650 | dyn_strcat(align_code_dyn, tmp_cnt); | |
3691 | if (up_dp->p_op_cnt) { | |
3692 | if (up_dp->btop_enc) { | |
3693 | sprintf(tmp_cnt,"%d",up_dp->p_op_cnt); | |
3694 | up_dp->p_op_cnt = 0; | |
3695 | } | |
3696 | else { | |
3697 | sprintf_code(tmp_cnt,up_dp, up_dp->p_op_idx, up_dp->p_op_cnt); | |
3698 | } | |
3699 | dyn_strcat(align_code_dyn, tmp_cnt); | |
3700 | } | |
3651 | 3701 | |
3652 | 3702 | free(up_dp); |
3653 | 3703 | } |
3700 | 3750 | |
3701 | 3751 | /* only aligned identities update counts */ |
3702 | 3752 | if (op==3 && sim_code == M_IDENT) { |
3703 | up_dp->p_op_cnt++; | |
3704 | return; | |
3753 | if ((sp0 == '*' && (sp1 == '*' || toupper(sp1) == 'U')) | |
3754 | || (sp1 == '*' && (sp0 == '*' || toupper(sp0) == 'U'))) { | |
3755 | if (up_dp->p_op_cnt > 0) { | |
3756 | sprintf(tmp_str,"%d**",up_dp->p_op_cnt); | |
3757 | up_dp->p_op_cnt = 0; | |
3758 | return; | |
3759 | } | |
3760 | } | |
3761 | else { | |
3762 | up_dp->p_op_cnt++; | |
3763 | return; | |
3764 | } | |
3705 | 3765 | } |
3706 | 3766 | else { |
3707 | 3767 | if (up_dp->p_op_cnt > 0) { |
208 | 208 | if (hsq[i0] < NMAP && hsq[i0] > mhv) mhv = hsq[i0]; |
209 | 209 | |
210 | 210 | if (mhv <= 0) { |
211 | fprintf (stderr, " maximum hsq <=0 %d\n", mhv); | |
211 | fprintf (stderr, "*** error [%s:%d] maximum hsq <=0 %d\n", __FILE__, __LINE__, mhv); | |
212 | 212 | exit (1); |
213 | 213 | } |
214 | 214 | |
222 | 222 | f_str->hmask = (hmax >> f_str->kshft) - 1; |
223 | 223 | |
224 | 224 | if ((f_str->harr = (int *) calloc (hmax, sizeof (int))) == NULL) { |
225 | fprintf (stderr, " *** cannot allocate hash array: hmax: %d hmask: %d\n", | |
226 | hmax, f_str->hmask); | |
225 | fprintf (stderr, "*** error [%s:%d] - cannot allocate hash array: hmax: %d hmask: %d\n", | |
226 | __FILE__,__LINE__,hmax, f_str->hmask); | |
227 | 227 | exit (1); |
228 | 228 | } |
229 | 229 | |
230 | 230 | if ((f_str->pamh1 = (int *) calloc (nsq+1, sizeof (int))) == NULL) { |
231 | fprintf (stderr, " *** cannot allocate pamh1 array nsq=%d\n",nsq); | |
231 | fprintf (stderr, "*** error [%s:%d] - cannot allocate pamh1 array nsq=%d\n", | |
232 | __FILE__, __LINE__, nsq); | |
232 | 233 | exit (1); |
233 | 234 | } |
234 | 235 | |
235 | 236 | if ((f_str->pamh2 = (int *) calloc (hmax, sizeof (int))) == NULL) { |
236 | fprintf (stderr, " *** cannot allocate pamh2 array hmax=%d\n",hmax); | |
237 | fprintf (stderr, "*** error [%s:%d] - cannot allocate pamh2 array hmax=%d\n", | |
238 | __FILE__, __LINE__,hmax); | |
237 | 239 | exit (1); |
238 | 240 | } |
239 | 241 | |
240 | 242 | if ((f_str->link = (int *) calloc (n0, sizeof (int))) == NULL) { |
241 | fprintf (stderr, " *** cannot allocate hash link array n0=%d",n0); | |
243 | fprintf (stderr, "*** error [%s:%d] - cannot allocate hash link array n0=%d", | |
244 | __FILE__, __LINE__, n0); | |
242 | 245 | exit (1); |
243 | 246 | } |
244 | 247 | |
299 | 302 | f_str->ndo = 0; |
300 | 303 | if ((f_str->diag = (struct dstruct *) calloc ((size_t)MAXDIAG, |
301 | 304 | sizeof (struct dstruct)))==NULL) { |
302 | fprintf (stderr," *** cannot allocate diagonal arrays: %lu\n", | |
303 | MAXDIAG *sizeof (struct dstruct)); | |
305 | fprintf (stderr,"*** error [%s:%d] - cannot allocate diagonal arrays: %lu\n", | |
306 | __FILE__, __LINE__, MAXDIAG *sizeof (struct dstruct)); | |
304 | 307 | exit (1); |
305 | 308 | }; |
306 | 309 | |
309 | 312 | if ((f_str->aa1x =(unsigned char *)calloc((size_t)ppst->maxlen+2, |
310 | 313 | sizeof(unsigned char))) |
311 | 314 | == NULL) { |
312 | fprintf (stderr, " *** cannot allocate aa1x array %d\n", ppst->maxlen+2); | |
315 | fprintf (stderr, "*** error [%s:%d] - cannot allocate aa1x array %d\n", | |
316 | __FILE__, __LINE__, ppst->maxlen+2); | |
313 | 317 | exit (1); |
314 | 318 | } |
315 | 319 | f_str->aa1x++; |
324 | 328 | maxn0 = n0 + 4; |
325 | 329 | if ((ss = (struct swstr *) calloc (maxn0, sizeof (struct swstr))) |
326 | 330 | == NULL) { |
327 | fprintf (stderr, " *** cannot allocate ss array %3d\n", n0); | |
331 | fprintf (stderr, "*** error [%s:%d] - cannot allocate ss array %3d\n", | |
332 | __FILE__, __LINE__, n0); | |
328 | 333 | exit (1); |
329 | 334 | } |
330 | 335 | ss++; |
335 | 340 | |
336 | 341 | /* initialize variable (-S) pam matrix */ |
337 | 342 | if ((f_str->waa_s= (int *)calloc((nsq+1)*(n0+1),sizeof(int))) == NULL) { |
338 | fprintf(stderr,"*** error [%s:%d] cannot allocate waa_s array %3d\n", | |
343 | fprintf(stderr,"*** error [%s:%d] - cannot allocate waa_s array %3d\n", | |
339 | 344 | __FILE__, __LINE__, nsq*n0); |
340 | 345 | exit(1); |
341 | 346 | } |
342 | 347 | |
343 | 348 | /* initialize pam2p[1] pointers */ |
344 | 349 | if ((f_str->pam2p[1]= (int **)calloc((n0+1),sizeof(int *))) == NULL) { |
345 | fprintf(stderr,"*** error [%s:%d] cannot allocate pam2p[1] array %3d\n", | |
350 | fprintf(stderr,"*** error [%s:%d] - cannot allocate pam2p[1] array %3d\n", | |
346 | 351 | __FILE__, __LINE__, n0); |
347 | 352 | exit(1); |
348 | 353 | } |
349 | 354 | |
350 | 355 | pam2p = f_str->pam2p[1]; |
351 | 356 | if ((pam2p[0]=(int *)calloc((nsq+1)*(n0+1),sizeof(int))) == NULL) { |
352 | fprintf(stderr,"*** error [%s:%d] cannot allocate pam2p[1][] array %3d\n", | |
357 | fprintf(stderr,"*** error [%s:%d] - cannot allocate pam2p[1][] array %3d\n", | |
353 | 358 | __FILE__, __LINE__, nsq*n0); |
354 | 359 | exit(1); |
355 | 360 | } |
360 | 365 | |
361 | 366 | /* initialize universal (alignment) matrix */ |
362 | 367 | if ((f_str->waa_a= (int *)calloc((nsq+1)*(n0+1),sizeof(int))) == NULL) { |
363 | fprintf(stderr,"*** error [%s:%d] cannot allocate waa_a struct %3d\n", | |
368 | fprintf(stderr,"*** error [%s:%d] - cannot allocate waa_a struct %3d\n", | |
364 | 369 | __FILE__, __LINE__, nsq*n0); |
365 | 370 | exit(1); |
366 | 371 | } |
367 | 372 | |
368 | 373 | /* initialize pam2p[0] pointers */ |
369 | 374 | if ((f_str->pam2p[0]= (int **)calloc((n0+1),sizeof(int *))) == NULL) { |
370 | fprintf(stderr,"*** error [%s:%d] cannot allocate pam2p[1] array %3d\n", | |
375 | fprintf(stderr,"*** error [%s:%d] - cannot allocate pam2p[1] array %3d\n", | |
371 | 376 | __FILE__, __LINE__, n0); |
372 | 377 | exit(1); |
373 | 378 | } |
374 | 379 | |
375 | 380 | pam2p = f_str->pam2p[0]; |
376 | 381 | if ((pam2p[0]=(int *)calloc((nsq+1)*(n0+1),sizeof(int))) == NULL) { |
377 | fprintf(stderr,"*** error [%s:%d] cannot allocate pam2p[1][] array %3d\n", | |
382 | fprintf(stderr,"*** error [%s:%d] - cannot allocate pam2p[1][] array %3d\n", | |
378 | 383 | __FILE__, __LINE__, nsq*n0); |
379 | 384 | exit(1); |
380 | 385 | } |
527 | 532 | *f_arg = NULL; |
528 | 533 | } |
529 | 534 | else { |
530 | fprintf(stderr, "*** error [%s:%d] close_work() with NULL f_str ***\n", | |
535 | fprintf(stderr, "*** error [%s:%d] - close_work() with NULL f_str ***\n", | |
531 | 536 | __FILE__, __LINE__); |
532 | 537 | } |
533 | 538 | } |
615 | 620 | } |
616 | 621 | |
617 | 622 | if (n0+n1+1 >= MAXDIAG) { |
618 | fprintf(stderr,"*** error [%s:%d] n0,n1 too large: %d + %d (%d) > %d \n", | |
623 | fprintf(stderr,"*** error [%s:%d] - n0,n1 too large: %d + %d (%d) > %d \n", | |
619 | 624 | __FILE__, __LINE__, n0,n1,n0+n1+1,MAXDIAG); |
620 | 625 | rst->score[0] = rst->score[1] = rst->score[2] = -1; |
621 | 626 | return; |
1136 | 1141 | |
1137 | 1142 | #ifdef DEBUG |
1138 | 1143 | if (window > f_str->bss_size) { |
1139 | fprintf(stderr,"*** error [%s:%d] dropnfa.c:dmatch window [%d] out of range [%d]\n", | |
1144 | fprintf(stderr,"*** error [%s:%d] - dmatch window [%d] out of range [%d]\n", | |
1140 | 1145 | __FILE__, __LINE__, window, f_str->bss_size); |
1141 | 1146 | window = f_str->bss_size - 4; |
1142 | 1147 | } |
1204 | 1209 | |
1205 | 1210 | band = up-low+1; |
1206 | 1211 | if (band < 1) { |
1207 | fprintf(stderr,"*** error [%s:%d] low > up is unacceptable!: M: %d N: %d l/u: %d/%d\n", | |
1212 | fprintf(stderr,"*** error [%s:%d] - low > up is unacceptable!: M: %d N: %d l/u: %d/%d\n", | |
1208 | 1213 | __FILE__, __LINE__, M, N, low, up); |
1209 | 1214 | return 0; |
1210 | 1215 | } |
1346 | 1351 | |
1347 | 1352 | /* now we need alignment storage - get it */ |
1348 | 1353 | if ((cur_ares->res = (int *)calloc((size_t)max_res,sizeof(int)))==NULL) { |
1349 | fprintf(stderr,"*** error [%s:%d] cannot allocate alignment results array %d\n", | |
1354 | fprintf(stderr,"*** error [%s:%d] - cannot allocate alignment results array %d\n", | |
1350 | 1355 | __FILE__, __LINE__, max_res); |
1351 | 1356 | exit(1); |
1352 | 1357 | } |
1384 | 1389 | local_aa1 = (unsigned char *)aa1; |
1385 | 1390 | if (l_min > 0 || l_max < n1 - 1) { |
1386 | 1391 | if (l_max - l_min < 0) { |
1387 | fprintf(stderr,"*** error [%s:%d] l_min: %d > l_max %d\n",__FILE__, __LINE__, l_min,l_max); | |
1392 | fprintf(stderr,"*** error [%s:%d] - l_min: %d > l_max %d\n",__FILE__, __LINE__, l_min,l_max); | |
1388 | 1393 | exit(1); |
1389 | 1394 | } |
1390 | 1395 | if ((local_aa1 = (unsigned char *)calloc(l_max - l_min +2,sizeof(unsigned char *)))==NULL) { |
1391 | fprintf(stderr,"*** error [%s:%d] Cannot allocate local_aa1\n",__FILE__, __LINE__); | |
1396 | fprintf(stderr,"*** error [%s:%d] - cannot allocate local_aa1\n",__FILE__, __LINE__); | |
1392 | 1397 | exit(1); |
1393 | 1398 | } |
1394 | 1399 | |
1564 | 1569 | |
1565 | 1570 | window = min (n1, ppst->param_u.fa.optwid); |
1566 | 1571 | if (window > f_str->bss_size) { |
1567 | fprintf(stderr,"*** error [%s:%d] walign window [%d] out of range [%d]\n", | |
1572 | fprintf(stderr,"*** error [%s:%d] - walign window [%d] out of range [%d]\n", | |
1568 | 1573 | __FILE__, __LINE__, window, f_str->bss_size); |
1569 | 1574 | window = f_str->bss_size - 4; |
1570 | 1575 | } |
1579 | 1584 | a_res->n1 = n1; |
1580 | 1585 | |
1581 | 1586 | if (score <=0) { |
1582 | fprintf(stderr,"*** [%s:%d] n0/n1: %d/%d hoff: %d window: %d\n", | |
1587 | fprintf(stderr,"*** [%s:%d] - score <= 0 - n0/n1: %d/%d hoff: %d window: %d\n", | |
1583 | 1588 | __FILE__, __LINE__, n0, n1, hoff, window); |
1584 | 1589 | return 0; |
1585 | 1590 | } |
2177 | 2182 | *have_ares = 0x3; /* set 0x2 bit to indicate local copy */ |
2178 | 2183 | |
2179 | 2184 | if ((a_res = (struct a_res_str *)calloc(1, sizeof(struct a_res_str)))==NULL) { |
2180 | fprintf(stderr,"*** error [%s:%d] Cannot allocate a_res", __FILE__, __LINE__); | |
2185 | fprintf(stderr,"*** error [%s:%d] - cannot allocate a_res", __FILE__, __LINE__); | |
2181 | 2186 | return NULL; |
2182 | 2187 | } |
2183 | 2188 | |
2203 | 2208 | |
2204 | 2209 | #ifdef DEBUG |
2205 | 2210 | if (adler32(1L,aa1,n1) != adler32_crc) { |
2206 | fprintf(stderr,"*** error [%s:%d] adler32_crc mismatch n1: %d\n",__FILE__, __LINE__, n1); | |
2211 | fprintf(stderr,"*** error [%s:%d] - adler32_crc mismatch n1: %d\n",__FILE__, __LINE__, n1); | |
2207 | 2212 | } |
2208 | 2213 | #endif |
2209 | 2214 |
574 | 574 | * be rerun with 16 bits. If it is more, and we have tried at least |
575 | 575 | * 500 sequences, we switch off the 8-bit mode. |
576 | 576 | */ |
577 | if (score == OVERFLOW) { | |
577 | if (score == OVERFLOW_SCORE) { | |
578 | 578 | f_str->done_16bit++; |
579 | 579 | if(f_str->done_8bit>500 && (3*f_str->done_16bit)>(f_str->done_8bit)) |
580 | 580 | f_str->try_8bit = 0; |
37 | 37 | |
38 | 38 | */ |
39 | 39 | static |
40 | char *AA1="FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG"; | |
40 | char *AA1="FFLLSSSSYY**CCUWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG"; | |
41 | 41 | /* |
42 | 42 | Starts = ---M---------------M---------------M---------------------------- |
43 | 43 | Base1 = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG |
415 | 415 | aacmap[ii]= *aasmap++; |
416 | 416 | } |
417 | 417 | |
418 | ||
419 | for (i=0; i<64; i++) { | |
420 | fprintf(stderr,"'%c',",aacmap[i]); | |
421 | if ((i%16)==15) fputc('\n',stderr); | |
422 | } | |
423 | fputc('\n',stderr); | |
424 | ||
418 | if (debug) { | |
419 | for (i=0; i<64; i++) { | |
420 | fprintf(stderr,"'%c',",aacmap[i]); | |
421 | if ((i%16)==15) fputc('\n',stderr); | |
422 | } | |
423 | fputc('\n',stderr); | |
424 | } | |
425 | 425 | } |
426 | 426 | for (i=0; i<64; i++) { |
427 | 427 | aamap[i]=aascii[aacmap[i]]; |
497 | 497 | char *iprompt2=" database file name: "; |
498 | 498 | |
499 | 499 | #ifdef PCOMPLIB |
500 | char *verstr="36.3.8g Dec, 2017 MPI"; | |
501 | #else | |
502 | char *verstr="36.3.8g Dec, 2017"; | |
500 | char *verstr="36.3.8h Aug, 2019 MPI"; | |
501 | #else | |
502 | char *verstr="36.3.8h Aug, 2019"; | |
503 | 503 | #endif |
504 | 504 | |
505 | 505 | static int mktup=3; |
779 | 779 | ppst->pam2[0][ix_j][p_i] = ppst->pam2[0][ix_i][p_i]; |
780 | 780 | ppst->pam2[0][p_i][ix_j] = ppst->pam2[0][p_i][ix_i]; |
781 | 781 | } |
782 | } | |
782 | p_i = pascii['*']; | |
783 | ppst->pam2[0][ix_j][p_i] = ppst->pam2[0][p_i][ix_j] = ppst->pam2[0][p_i][p_i]; | |
784 | } | |
783 | 785 | else { |
784 | 786 | pascii['U'] = pascii['C']; |
785 | 787 | pascii['u'] = pascii['c']; |
1289 | 1291 | } |
1290 | 1292 | } |
1291 | 1293 | |
1292 | static char my_opts[] = "1BIM:ox:y:N:"; | |
1294 | /* Extended options: | |
1295 | -X1 - use the init1 score, rather than initn, for statistics and ordering results | |
1296 | -Xa - only report annotation information in -m 8CB output (for later merge) | |
1297 | -Xb - report z-score, not bit-score | |
1298 | -XB - use blast identities | |
1299 | -XI - ensure that identities are not rounded to 100% | |
1300 | -XM: - specify memory limits for database buffering | |
1301 | -XN:[+S] - treat N:N/X:X as similar as well as identical | |
1302 | -Xo - use initn score, not opt score, for statistics and ordering results | |
1303 | -Xx: - penalties for X:X, X:not-X match | |
1304 | -Xy: - width of band for optimized scores | |
1305 | */ | |
1306 | ||
1307 | static char my_opts[] = "1aBbIM:ox:y:N:"; | |
1293 | 1308 | |
1294 | 1309 | void |
1295 | 1310 | parse_ext_opts(char *opt_arg, int pgm_id, struct mngmsg *m_msp, struct pstruct *ppst) { |
1309 | 1324 | ppst->param_u.fa.iniflag=1; |
1310 | 1325 | } |
1311 | 1326 | break; |
1312 | case 'B': m_msp->z_bits = 0; break; | |
1327 | ||
1328 | case 'a': m_msp->m8_show_annot = 1; break; | |
1329 | ||
1330 | case 'B': m_msp->blast_ident = 1; break; | |
1331 | ||
1332 | case 'b': m_msp->z_bits = 0; break; | |
1313 | 1333 | case 'I': |
1314 | 1334 | m_msp->tot_ident = 1; |
1315 | 1335 | /* |
2865 | 2885 | |
2866 | 2886 | for (i=0; i< ppst->nsq; i++) { |
2867 | 2887 | if (ppst->pam2[0][0][i] > -1000) { |
2868 | fprintf(stderr," *** ERROR *** pam2[0][0][%d/%c] == %d\n", | |
2869 | i,NCBIstdaa[i],ppst->pam2[0][0][i]); | |
2888 | fprintf(stderr," *** error[%s:%d]*** pam2[0][0][%d/%c] == %d\n", | |
2889 | __FILE__, __LINE__, i,NCBIstdaa[i],ppst->pam2[0][0][i]); | |
2870 | 2890 | good_params = 0; |
2871 | 2891 | } |
2872 | 2892 | if (ppst->pam2[0][i][0] > -1000) { |
2873 | fprintf(stderr," *** ERROR *** pam2[0][%d/%c][0] == %d\n", | |
2874 | i,NCBIstdaa[i],ppst->pam2[0][i][0]); | |
2893 | fprintf(stderr," *** error[%s:%d] (validate_params)- pam2[0][%d/%c][0] == %d\n", | |
2894 | __FILE__,__LINE__,i,NCBIstdaa[i],ppst->pam2[0][i][0]); | |
2875 | 2895 | good_params = 0; |
2876 | 2896 | } |
2877 | 2897 | } |
2880 | 2900 | if (ppst->ext_sq_set) { |
2881 | 2901 | for (i=0; i< ppst->nsqx; i++) { |
2882 | 2902 | if (ppst->pam2[1][0][i] > -1000) { |
2883 | fprintf(stderr," *** ERROR *** pam2[1][0][%d] == %d\n", | |
2884 | i,ppst->pam2[1][0][i]); | |
2903 | fprintf(stderr," *** error[%s:%d] (validate_params) - pam2[1][0][%d] == %d\n", | |
2904 | __FILE__, __LINE__, i,ppst->pam2[1][0][i]); | |
2885 | 2905 | good_params = 0; |
2886 | 2906 | } |
2887 | 2907 | if (ppst->pam2[1][i][0] > -1000) { |
2888 | fprintf(stderr," *** ERROR *** pam2[1][%d][0] == %d\n", | |
2889 | i,ppst->pam2[1][i][0]); | |
2908 | fprintf(stderr," *** error[%s:%d] (validate_params) - pam2[1][%d][0] == %d\n", | |
2909 | __FILE__, __LINE__, i,ppst->pam2[1][i][0]); | |
2890 | 2910 | good_params = 0; |
2891 | 2911 | } |
2892 | 2912 | } |
2895 | 2915 | /* check for valid residues in query */ |
2896 | 2916 | for (i=0; i<n0; i++) { |
2897 | 2917 | if (aa0[i] > ppst->nsq_e && aa0[i] != ESS) { |
2898 | fprintf(stderr," *** ERROR *** aa0[%d] = %c[%d > %d] out of range\n", | |
2899 | i, aa0[i], aa0[i], ppst->nsq_e); | |
2918 | fprintf(stderr," *** error [%s:%d] (validate_params) - aa0[%d] = %c[%d > %d] out of range\n", | |
2919 | __FILE__,__LINE__,i, aa0[i], aa0[i], ppst->nsq_e); | |
2900 | 2920 | good_params = 0; |
2901 | 2921 | } |
2902 | 2922 | } |
2903 | 2923 | |
2904 | 2924 | for (i=0; i<128; i++) { |
2905 | 2925 | if (lascii[i] < NA && lascii[i] > ppst->nsq_e) { |
2906 | fprintf(stderr," *** ERROR *** lascii [%c|%d] = %d > %d out of range\n", | |
2907 | i, i, lascii[i], ppst->nsq_e); | |
2926 | fprintf(stderr," *** error[%s:%d] (validate_params) - lascii [%c|%d] = %d > %d out of range\n", | |
2927 | __FILE__, __LINE__, i, i, lascii[i], ppst->nsq_e); | |
2908 | 2928 | good_params = 0; |
2909 | 2929 | } |
2910 | 2930 |
72 | 72 | if ((bp=strchr(tname,' '))!=NULL) *bp='\0'; |
73 | 73 | |
74 | 74 | if ((tptr=fopen(tname,"r"))==NULL) { |
75 | fprintf(stderr," could not open file of names: %s\n",tname); | |
75 | fprintf(stderr,"*** error [%s:%d] could not open file of names: %s\n",__FILE__,__LINE__,tname); | |
76 | 76 | return NULL; |
77 | 77 | } |
78 | 78 | |
108 | 108 | if (strlen(flstr)> (size_t)0) { |
109 | 109 | chlen = MAX_CH*MAX_FN; |
110 | 110 | if ((chtmp=charr=calloc((size_t)chlen,sizeof(char)))==NULL) { |
111 | fprintf(stderr,"cannot allocate choice file array\n"); | |
111 | fprintf(stderr,"*** error [%s:%d] cannot allocate choice file array\n",__FILE__,__LINE__); | |
112 | 112 | goto l1; |
113 | 113 | } |
114 | 114 | chlen--; |
115 | 115 | if ((fch=fopen(flstr,"r"))==NULL) { |
116 | fprintf(stderr," cannot open choice file: %s\n",flstr); | |
116 | fprintf(stderr,"*** error [%s:%d] cannot open choice file: %s\n",__FILE__,__LINE__,flstr); | |
117 | 117 | goto l1; |
118 | 118 | } |
119 | 119 | fprintf(stderr,"\n Choose sequence library:\n\n"); |
185 | 185 | int new_abbr,ich, nch; /* use new multi-letter abbr */ |
186 | 186 | int ltmp; |
187 | 187 | FILE *fch; |
188 | struct lib_struct *cur_lib_p = NULL; | |
188 | struct lib_struct *cur_lib_p = NULL, *tmp_lib_p; | |
189 | 189 | |
190 | 190 | new_abbr = 0; |
191 | 191 | *ltitle = '\0'; |
195 | 195 | } |
196 | 196 | else { |
197 | 197 | if (*flstr=='\0') { |
198 | fprintf(stderr," abbrv. list request but FASTLIBS undefined, cannot use %s\n",lname); | |
198 | fprintf(stderr,"*** error [%s:%d] abbrv. list request but FASTLIBS undefined, cannot use %s\n",__FILE__,__LINE__,lname); | |
199 | 199 | exit(1); |
200 | 200 | } |
201 | 201 | |
217 | 217 | |
218 | 218 | if (strlen(flstr) > (size_t)0) { |
219 | 219 | if ((fch=fopen(flstr,"r"))==NULL) { |
220 | fprintf(stderr," cannot open choice file: %s\n",flstr); | |
220 | fprintf(stderr,"*** error [%s:%d] cannot open choice file: %s\n",__FILE__,__LINE__,flstr); | |
221 | 221 | return NULL; |
222 | 222 | } |
223 | 223 | } |
232 | 232 | |
233 | 233 | /* if !new_abbr, match on one letter with ulindex() */ |
234 | 234 | if (!new_abbr) { |
235 | if (*bp=='+') continue; /* not a &lib& */ | |
235 | if (*bp=='+') continue; /* not a +lib+ */ | |
236 | 236 | else if (ulindex(lname,bp)!=NULL) { |
237 | 237 | if (ltitle[0] == '\0') { |
238 | 238 | strncpy(ltitle,line,MAX_STR); |
242 | 242 | strncat(ltitle,",\n ",MAX_STR-ltmp); |
243 | 243 | strncat(ltitle,line,MAX_STR-ltmp-4); |
244 | 244 | } |
245 | cur_lib_p = get_lnames(bp+1, cur_lib_p); | |
245 | tmp_lib_p = get_lnames(bp+1, cur_lib_p); | |
246 | if (tmp_lib_p) { cur_lib_p = tmp_lib_p;} | |
246 | 247 | } |
247 | 248 | } |
248 | 249 | else { |
267 | 268 | } |
268 | 269 | *bp1='+'; |
269 | 270 | } |
270 | else fprintf(stderr,"%s missing final '+'\n",bp); | |
271 | else fprintf(stderr,"*** error [%s:%d] %s missing final '+'\n",__FILE__,__LINE__,bp); | |
271 | 272 | } |
272 | 273 | } |
273 | 274 | } |
18 | 18 | governing permissions and limitations under the License. |
19 | 19 | */ |
20 | 20 | |
21 | /* input is a libtype 1,5, or 6 sequence database */ | |
21 | /* input is a lib_type 1,5, or 6 sequence database (lib_type specified after filename), | |
22 | e.g. 'swissprot.lseg 1' */ | |
23 | /* map_db -n specifies a DNA database */ | |
24 | ||
22 | 25 | /* output is a BLAST2 formatdb type index file */ |
23 | 26 | |
24 | 27 | /* format of the index file: |
155 | 155 | int nc, lc, maxc; |
156 | 156 | double lzscore, lzscore2, lbits; |
157 | 157 | struct a_struct l_aln, *l_aln_p; |
158 | float percent, gpercent; | |
158 | float percent, gpercent, ng_percent, disp_percent, disp_similar; | |
159 | int disp_alen; | |
159 | 160 | /* strings, lengths for conventional alignment */ |
160 | 161 | char *seqc0, *seqc0a, *seqc1, *seqc1a, *seqca; |
161 | 162 | int *cumm_seq_score; |
489 | 490 | |
490 | 491 | if (lc > 0) { |
491 | 492 | percent = (100.0*(float)l_aln_p->nident)/(float)lc; |
492 | } | |
493 | else { percent = -1.00; } | |
493 | ng_percent = (100.0*(float)l_aln_p->nident)/(float)(lc-(l_aln_p->ngap_q + l_aln_p->ngap_l)); | |
494 | } | |
495 | else { percent = ng_percent = -1.00; } | |
494 | 496 | |
495 | 497 | fprintf (fp, "a {\n"); |
496 | 498 | if (annot_var_dyn->string[0]) { |
533 | 535 | |
534 | 536 | if (cur_ares_p->score_delta > 0) score_delta -= cur_ares_p->score_delta; |
535 | 537 | |
536 | percent = calc_fpercent_id(100.0, l_aln_p->nident,lc,m_msp->tot_ident, -1.0); | |
538 | disp_percent = percent = calc_fpercent_id(100.0, l_aln_p->nident,lc,m_msp->tot_ident, -1.0); | |
539 | disp_similar = calc_fpercent_id(100.0, l_aln_p->nsim, lc, m_msp->tot_ident, -1.0); | |
540 | disp_alen = lc; | |
537 | 541 | |
538 | 542 | ngap = l_aln_p->ngap_q + l_aln_p->ngap_l; |
543 | ng_percent = calc_fpercent_id(100.0, l_aln_p->nident,lc-ngap,m_msp->tot_ident, -1.0); | |
544 | if (m_msp->blast_ident) { | |
545 | disp_percent = ng_percent; | |
546 | disp_similar = calc_fpercent_id(100.0, l_aln_p->npos, lc-ngap, m_msp->tot_ident, -1.0); | |
547 | disp_alen = lc - ngap; | |
548 | } | |
549 | ||
539 | 550 | #ifndef SHOWSIM |
540 | gpercent = calc_fpercent_id(100.0,l_aln_p->nident,lc-ngap,m_msp->tot_ident, -1.0); | |
551 | gpercent = ng_percent; | |
541 | 552 | #else |
542 | gpercent = calc_fpercent_id(100.0,l_aln_p->nsim,lc,m_msp->tot_ident, -1.0); | |
553 | gpercent = disp_similar; | |
543 | 554 | #endif |
544 | 555 | |
545 | 556 | lsw_score = cur_ares_p->sw_score + score_delta; |
663 | 674 | if (m_msp->markx & MX_HTML) { |
664 | 675 | fprintf(fp,"<!-- ANNOT_START \"%s\" -->",link_name);} |
665 | 676 | /* ensure that last character is "\n" */ |
666 | if (annot_var_dyn->string[strlen(annot_var_dyn->string)-1] != '\n') { | |
667 | annot_var_dyn->string[strlen(annot_var_dyn->string)-1] = '\n'; | |
668 | } | |
669 | fputs(annot_var_dyn->string, fp); | |
677 | if (!m_msp->m8_show_annot) { | |
678 | if (annot_var_dyn->string[strlen(annot_var_dyn->string)-1] != '\n') { | |
679 | annot_var_dyn->string[strlen(annot_var_dyn->string)-1] = '\n'; | |
680 | } | |
681 | fputs(annot_var_dyn->string, fp); | |
682 | } | |
683 | else { fputs("\n",fp);} | |
684 | ||
670 | 685 | if (m_msp->markx & MX_HTML) {fputs("<!-- ANNOT_STOP -->",fp);} |
671 | 686 | } |
672 | 687 | |
745 | 760 | do_show(fp, m_msp->n0, bbp->seq->n1, lsw_score, name0, name1, nml, |
746 | 761 | link_name, |
747 | 762 | m_msp, ppst, seqc0, seqc0a, seqc1, seqc1a, seqca, cumm_seq_score, |
748 | nc, percent, gpercent, lc, l_aln_p, annot_var_dyn->string, | |
763 | nc, disp_percent, gpercent, disp_alen, l_aln_p, annot_var_dyn->string, | |
749 | 764 | m_msp->annot_p, bbp->seq->annot_p); |
750 | 765 | |
751 | 766 | /* display the encoded alignment left over from showbest()*/ |
808 | 823 | int tmp; |
809 | 824 | |
810 | 825 | if (m_msp->markx & MX_AMAP && (m_msp->markx & MX_ATYPE)==7) |
826 | /* show text graphic of alignment (very rarely used) */ | |
811 | 827 | disgraph(fp, n0, n1, percent, score, |
812 | 828 | aln->amin0, aln->amin1, aln->amax0, aln->amax1, m_msp->sq0off, |
813 | 829 | name0, name1, nml, aln->llen, m_msp->markx); |
814 | 830 | else if (m_msp->markx & MX_M10FORM) { |
831 | /* old tagged/parse-able format */ | |
815 | 832 | if (ppst->sw_flag && m_msp->arelv>0) |
816 | 833 | fprintf(fp,"; %s_score: %d\n",m_msp->f_id1,score); |
817 | 834 | fprintf(fp,"; %s_ident: %5.3f\n",m_msp->f_id1,percent/100.0); |
826 | 843 | seqc0, seqc0a, seqc1, seqc1a, seqca, cumm_seq_score, nc, |
827 | 844 | n0, n1, name0, name1, nml, aln); |
828 | 845 | } |
829 | else { | |
846 | else { /* all "normal" alignment formats */ | |
830 | 847 | if (!(m_msp->markx & MX_MBLAST)) { |
831 | 848 | #ifndef LALIGN |
832 | 849 | fprintf(fp,"%s score: %d; ",m_msp->alabel, score); |
847 | 864 | annot_var_s, q_annot_p, l_annot_p); |
848 | 865 | } |
849 | 866 | |
850 | if (m_msp->markx & MX_AMAP && (m_msp->markx & MX_ATYPE)!=7) { | |
867 | if ((m_msp->markx & MX_AMAP) && ((m_msp->markx & MX_ATYPE)!=MX_ATYPE)) { | |
851 | 868 | fputc('\n',fp); |
852 | 869 | tmp = n0; |
853 | 870 |
90 | 90 | void w_abort (char *p, char *p1); |
91 | 91 | |
92 | 92 | extern double zs_to_bit(double, int, int); |
93 | ||
94 | void dominfo_to_str(struct dyn_string_str *d, struct annot_str *annot); | |
93 | 95 | |
94 | 96 | /* showbest() shows a list of high scoring sequence descriptions, and |
95 | 97 | their rst.scores. If -m 9, then an additional complete set of |
136 | 138 | struct rstruct rst; |
137 | 139 | int l_score0, ngap; |
138 | 140 | double lzscore, lzscore2, lbits; |
139 | float percent, gpercent, ng_percent; | |
141 | float percent, gpercent, ng_percent, disp_percent, disp_similar; | |
142 | int disp_alen; | |
140 | 143 | struct a_struct *aln_p; |
141 | 144 | struct a_res_str *cur_ares_p; |
142 | 145 | struct rstruct *rst_p; |
143 | 146 | int gi_num; |
144 | 147 | char html_pre_E[120], html_post_E[120]; |
145 | 148 | int have_lalign = 0; |
149 | struct dyn_string_str *dominfo_dstr; | |
146 | 150 | |
147 | 151 | struct lmf_str *m_fptr; |
148 | 152 | |
241 | 245 | /* display number of hits for -m 8C (Blast Tab-commented format) */ |
242 | 246 | if (m_msp->markx & MX_M8COMMENT) { |
243 | 247 | /* line below copied from BLAST+ output */ |
244 | fprintf(fp,"# Fields: query id, subject id, %% identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score"); | |
248 | if (m_msp->markx & MX_M8_BTAB_LEN) { | |
249 | fprintf(fp,"# Fields: query id, query length, subject id, subject length, %% identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score"); | |
250 | } | |
251 | else { | |
252 | fprintf(fp,"# Fields: query id, subject id, %% identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score"); | |
253 | } | |
254 | ||
245 | 255 | if (ppst->zsflag > 20) {fprintf(fp,", eval2");} |
246 | 256 | if (m_msp->show_code & (SHOW_CODE_ALIGN+SHOW_CODE_CIGAR)) { fprintf(fp,", aln_code");} |
247 | 257 | else if ((m_msp->show_code & SHOW_CODE_BTOP)==SHOW_CODE_BTOP) { fprintf(fp,", BTOP");} |
328 | 338 | for (ib=istart; ib<istop; ib++) { |
329 | 339 | bbp = bptr[ib]; |
330 | 340 | if (ppst->do_rep) { |
331 | bbp->repeat_thresh = | |
332 | min(E1_to_s(ppst->e_cut_r, m_msp->n0, bbp->seq->n1,ppst->zdb_size, m_msp->pstat_void), | |
333 | bbp->rst.score[ppst->score_ix]); | |
341 | if (bbp->rst.escore > ppst->e_cut_r) { /* for poor alignment scores, don't look for more */ | |
342 | bbp->repeat_thresh = bbp->rst.score[ppst->score_ix] * 10; | |
343 | } | |
344 | else { | |
345 | bbp->repeat_thresh = | |
346 | min(E1_to_s(ppst->e_cut_r, m_msp->n0, bbp->seq->n1,ppst->zdb_size, m_msp->pstat_void), | |
347 | bbp->rst.score[ppst->score_ix]); | |
348 | } | |
334 | 349 | } |
335 | 350 | |
336 | 351 | #ifdef DEBUG |
518 | 533 | } |
519 | 534 | else if (m_msp->markx & MX_M8OUT) { /* MX_M8OUT -- provide query, library */ |
520 | 535 | if (first_line) {first_line = 0;} |
521 | fprintf (fp,"%s\t%s",m_msp->qtitle,bline_p); | |
536 | if (m_msp->markx & MX_M8_BTAB_LEN) { | |
537 | fprintf (fp,"%s\t%d\t%s\t%d",m_msp->qtitle,m_msp->n0,bline_p,bbp->seq->n1); | |
538 | } | |
539 | else { | |
540 | fprintf (fp,"%s\t%s",m_msp->qtitle,bline_p); | |
541 | } | |
522 | 542 | } |
523 | 543 | else if (m_msp->markx & MX_MBLAST2) { /* blast "Sequences producing" */ |
524 | 544 | if (first_line) {first_line = 0;} |
536 | 556 | annot_str_len = cur_ares_p->annot_code_n; |
537 | 557 | |
538 | 558 | ngap = cur_ares_p->aln.ngap_q + cur_ares_p->aln.ngap_l; |
539 | percent = calc_fpercent_id(100.0,aln_p->nident,aln_p->lc, m_msp->tot_ident, -100.0); | |
559 | disp_percent = percent = calc_fpercent_id(100.0,aln_p->nident,aln_p->lc, m_msp->tot_ident, -100.0); | |
540 | 560 | ng_percent = calc_fpercent_id(100.0,aln_p->nident,aln_p->lc-ngap, m_msp->tot_ident, -100.0); |
561 | disp_similar = calc_fpercent_id(100.0, cur_ares_p->aln.nsim, aln_p->lc, m_msp->tot_ident, -100.0); | |
562 | disp_alen = aln_p->lc; | |
563 | if (m_msp->blast_ident) { | |
564 | disp_percent = ng_percent; | |
565 | disp_similar = calc_fpercent_id(100.0, cur_ares_p->aln.npos, aln_p->lc - ngap, m_msp->tot_ident, -100.0); | |
566 | disp_alen = aln_p->lc - ngap; | |
567 | } | |
541 | 568 | |
542 | 569 | #ifndef SHOWSIM |
543 | gpercent = calc_fpercent_id(100.0, aln_p->nident, aln_p->lc-ngap, m_msp->tot_ident, -100.0); | |
570 | gpercent = ng_percent; | |
544 | 571 | #else |
545 | gpercent = calc_fpercent_id(100.0, cur_ares_p->aln.nsim, aln_p->lc, m_msp->tot_ident, -100.0); | |
572 | gpercent = disp_similar; | |
546 | 573 | #endif /* SHOWSIM */ |
547 | 574 | |
548 | 575 | if (m_msp->show_code != SHOW_CODE_ID && m_msp->show_code != SHOW_CODE_IDD) { /* show more complete info than just identity */ |
563 | 590 | /* sequence coordinate min max min max */ |
564 | 591 | if (!(m_msp->markx & MX_M8OUT)) { |
565 | 592 | fprintf(fp,"\t%5.3f %5.3f %4d %4d %4ld %4ld %4ld %4ld %4ld %4ld %4ld %4ld %3d %3d %3d", |
566 | percent/100.0,gpercent/100.0, | |
593 | disp_percent/100.0,gpercent/100.0, | |
567 | 594 | cur_ares_p->sw_score, |
568 | aln_p->lc, | |
595 | disp_alen, | |
569 | 596 | aln_p->d_start0,aln_p->d_stop0, |
570 | 597 | aln_p->q_start_off, aln_p->q_end_off, |
571 | 598 | aln_p->d_start1,aln_p->d_stop1, |
581 | 608 | } |
582 | 609 | else { /* MX_M8OUT -- blast order, tab separated */ |
583 | 610 | fprintf(fp,"\t%.2f\t%d\t%d\t%d\t%ld\t%ld\t%ld\t%ld\t%.2g\t%.1f", |
584 | ng_percent,aln_p->lc,aln_p->nmismatch, | |
611 | ng_percent,aln_p->lc-ngap,aln_p->nmismatch, | |
585 | 612 | aln_p->ngap_q + aln_p->ngap_l+aln_p->nfs, |
586 | 613 | aln_p->d_start0, aln_p->d_stop0, |
587 | 614 | aln_p->d_start1, aln_p->d_stop1, |
588 | 615 | zs_to_E(lzscore,n1,ppst->dnaseq,ppst->zdb_size,m_msp->db), |
589 | 616 | lbits); |
617 | ||
590 | 618 | if (ppst->zsflag > 20) { |
591 | 619 | fprintf(fp,"\t%.2g",zs_to_E(lzscore2, n1, ppst->dnaseq, ppst->zdb_size, m_msp->db)); |
592 | 620 | } |
593 | 621 | if ((m_msp->show_code & (SHOW_CODE_ALIGN+SHOW_CODE_CIGAR+SHOW_CODE_BTOP)) && seq_code_len > 0 && seq_code != NULL) { |
594 | 622 | fprintf(fp,"\t%s",seq_code); |
623 | ||
595 | 624 | if (annot_str_len > 0 && annot_str != NULL) { |
596 | 625 | fprintf(fp,"\t%s",annot_str); |
597 | 626 | } |
627 | ||
628 | if (m_msp->show_code & SHOW_CODE_DOMINFO) { | |
629 | dominfo_dstr = init_dyn_string(1024,1024); | |
630 | if (m_msp->annot_p) { | |
631 | dominfo_to_str(dominfo_dstr,m_msp->annot_p); | |
632 | } | |
633 | if (bbp->seq->annot_p) { | |
634 | dominfo_to_str(dominfo_dstr,bbp->seq->annot_p); | |
635 | } | |
636 | ||
637 | if (dominfo_dstr->string[0]) { | |
638 | fprintf(fp,"\t%s",dominfo_dstr->string); | |
639 | } | |
640 | free_dyn_string(dominfo_dstr); | |
641 | } | |
598 | 642 | } |
599 | 643 | fprintf(fp,"\n"); |
600 | 644 | } |
602 | 646 | else { /* !SHOW_CODE -> SHOW_ID or SHOW_IDD*/ |
603 | 647 | #ifdef SHOWSIM |
604 | 648 | fprintf(fp," %5.3f %5.3f %4d", |
605 | percent/100.0, | |
606 | (float)aln_p->nsim/(float)aln_p->lc,aln_p->lc); | |
649 | disp_percent/100.0,disp_similar/100.0,disp_alen); | |
607 | 650 | #else |
608 | fprintf(fp," %5.3f %4d", percent/100.0,aln_p->lc); | |
651 | fprintf(fp," %5.3f %4d", disp_percent/100.0,disp_alen); | |
609 | 652 | #endif |
610 | 653 | if (m_msp->markx & MX_HTML) { |
611 | 654 | if (cur_ares_p->index > 0) { |
619 | 662 | } |
620 | 663 | else { link_shown = 0;} |
621 | 664 | |
622 | if ((m_msp->show_code & SHOW_CODE_ID) == SHOW_CODE_ID) { | |
665 | if ((m_msp->show_code & SHOW_CODE_ID) == SHOW_CODE_ID ) { | |
623 | 666 | annot_str = cur_ares_p->annot_var_id; |
624 | 667 | } |
625 | 668 | else if ((m_msp->show_code & SHOW_CODE_IDD) == SHOW_CODE_IDD) { |
628 | 671 | else { |
629 | 672 | annot_str = NULL; |
630 | 673 | } |
631 | if (annot_str && annot_str[0]) { | |
674 | if (annot_str && annot_str[0] && (!m_msp->m8_show_annot || (m_msp->markx & MX_M8OUT))) { | |
632 | 675 | fprintf(fp," %s",annot_str); |
633 | 676 | } |
634 | 677 | } |
662 | 705 | |
663 | 706 | if (m_msp->markx & MX_HTML) fprintf(fp,"</pre><hr>\n"); |
664 | 707 | } |
708 | ||
709 | /* dominfo_to_str() -- convert domain annotations to a |DX:1-100;C=PF12345~1 dyn_string */ | |
710 | /* used for both query and subject strings */ | |
711 | void | |
712 | dominfo_to_str(struct dyn_string_str *dominfo_dstr, struct annot_str *annots) { | |
713 | int i; | |
714 | char tmp_string[MAX_STR]; | |
715 | struct annot_entry *annot; | |
716 | struct dyn_string_str *dyn_dom_str; | |
717 | ||
718 | for (i=0; i < annots->n_annot; i++) { | |
719 | ||
720 | annot = &annots->annot_arr_p[i]; | |
721 | ||
722 | if (annot->target) { | |
723 | if (annot->label == '-') { | |
724 | sprintf(tmp_string,"|XD:%ld-%ld;C=%s",annot->pos+1,annot->end+1,annot->comment); | |
725 | } | |
726 | else { | |
727 | sprintf(tmp_string,"|X%c:%ld-%ld;C=%s",annot->label, annot->pos+1,annot->end+1,annot->comment); | |
728 | } | |
729 | } | |
730 | else { | |
731 | if (annot->label == '-') { | |
732 | sprintf(tmp_string,"|DX:%ld-%ld;C=%s",annot->pos+1,annot->end+1,annot->comment); | |
733 | } | |
734 | else { | |
735 | sprintf(tmp_string,"|%cX:%ld-%ld;C=%s",annot->label, annot->pos+1,annot->end+1,annot->comment); | |
736 | } | |
737 | ||
738 | } | |
739 | ||
740 | ||
741 | dyn_strcat(dominfo_dstr, tmp_string); | |
742 | } | |
743 | } |
23 | 23 | |
24 | 24 | #define FORMATDBV3 3 /* formatdb version */ |
25 | 25 | #define FORMATDBV4 4 /* formatdb version */ |
26 | #define FORMATDBV5 5 /* formatdb version */ | |
26 | 27 | |
27 | 28 | #define NULLB '\0' /* sentinel byte */ |
28 | 29 |
79 | 79 | |
80 | 80 | |
81 | 81 | /* **************************************************************** |
82 | This code reads NCBI Blast2 format databases from formatdb version 3 and 4 | |
82 | This code reads NCBI Blast2 format databases from formatdb version 3 -- 5 | |
83 | 83 | |
84 | 84 | (From NCBI) This section describes the format of the databases. |
85 | 85 | |
449 | 449 | src_uint4_read(ifile,(unsigned *)&dbformat); /* get format DB version number */ |
450 | 450 | src_uint4_read(ifile,(unsigned *)&dbtype); /* get 1 for protein/0 DNA */ |
451 | 451 | |
452 | if (dbformat != FORMATDBV3 && dbformat!=FORMATDBV4) { | |
452 | if (dbformat != FORMATDBV3 && dbformat!=FORMATDBV4 && dbformat!=FORMATDBV5) { | |
453 | 453 | fprintf(stderr,"error - %s wrong formatdb version (%d/%d)\n", |
454 | 454 | tname,dbformat,FORMATDBV3); |
455 | 455 | return NULL; |
787 | 787 | int title_len; |
788 | 788 | char *title_str=NULL; |
789 | 789 | int date_len; |
790 | char *pdb_title_str=NULL; | |
791 | int pdb_title_len; | |
790 | 792 | char *date_str=NULL; |
791 | 793 | long ltmp; |
792 | 794 | int64_t l8tmp; |
793 | 795 | int i, tmp; |
794 | 796 | unsigned int *f_pos_arr; |
795 | 797 | |
798 | if (dbformat == FORMATDBV5) { | |
799 | src_uint4_read(ifile,(unsigned int *)<mp); | |
800 | } | |
801 | ||
796 | 802 | src_uint4_read(ifile,(unsigned *)&title_len); |
797 | 803 | |
798 | 804 | if (title_len > 0) { |
803 | 809 | fread(title_str,(size_t)1,(size_t)title_len,ifile); |
804 | 810 | } |
805 | 811 | |
812 | if (dbformat == FORMATDBV5) { | |
813 | src_uint4_read(ifile,(unsigned int *)&pdb_title_len); | |
814 | if (pdb_title_len > 0) { | |
815 | if ((pdb_title_str = calloc((size_t)pdb_title_len+1,sizeof(char)))==NULL) { | |
816 | fprintf(stderr," cannot allocate pdb_title string (%d)\n",pdb_title_len); | |
817 | goto error_r; | |
818 | } | |
819 | fread(pdb_title_str,(size_t)1,(size_t)pdb_title_len,ifile); | |
820 | } | |
821 | } | |
822 | ||
806 | 823 | src_uint4_read(ifile,(unsigned *)&date_len); |
807 | 824 | |
808 | 825 | if (date_len > 0) { |
52 | 52 | 4 - Intelligentics format |
53 | 53 | 5 - NBRF/PIR VMS format |
54 | 54 | 6 - GCG 2bit format |
55 | 7 - FASTQ format | |
56 | 8 - accession script | |
55 | 57 | |
56 | 58 | 10 - list of gi/acc's |
57 | 59 | 11 - NCBI setdb/blastp (1.3.2) AA/NT |
58 | 60 | 12 - NCBI setdb/blastp (2.0) AA/NT |
59 | 61 | 16 - mySQL queries |
60 | ||
62 | ||
61 | 63 | see file altlib.h to confirm numbers |
62 | 64 | |
63 | 65 | */ |
166 | 168 | struct lmf_str *m_fptr=NULL; |
167 | 169 | int acc_off=0; |
168 | 170 | char fmt_term; |
171 | char acc_script[MAX_LSTR]; | |
169 | 172 | struct lib_struct *next_lib_p, *this_lib_p, *tmp_lib_p; |
170 | 173 | |
171 | 174 | om_fptr = lib_p->m_file_p; |
177 | 180 | |
178 | 181 | wcnt = 0; /* number of times to ask for file name */ |
179 | 182 | |
183 | /* check for library type */ | |
184 | lib_type=0; | |
185 | if ((bp=strchr(lib_p->file_name,' '))!=NULL | |
186 | || (bp=strchr(lib_p->file_name,'^'))!=NULL) { | |
187 | if (isdigit((int)(bp+1)[0])) { /* check for number for lib_type */ | |
188 | *bp='\0'; | |
189 | sscanf(bp+1,"%d",&lib_type); | |
190 | if (lib_type<0 || lib_type >= LASTLIB) { | |
191 | fprintf(stderr,"\n invalid library type: %d (>%d)- resetting\n%s\n", | |
192 | lib_type,LASTLIB,lib_p->file_name); | |
193 | lib_type=0; | |
194 | } | |
195 | } /* don't change lib_type if its not a number */ | |
196 | } | |
197 | else if (lib_p->file_name[0] =='!') { /* check for script */ | |
198 | lib_type = lib_p->lib_type = ACC_SCRIPT; | |
199 | } | |
200 | ||
201 | /* check for stdin indicator '-' or '@' (or ACC_SCRIPT) */ | |
202 | if (lib_p->file_name[0] == '-' || lib_p->file_name[0] == '@' | |
203 | || lib_type == ACC_SCRIPT) { | |
204 | use_stdin = 1; | |
205 | } | |
206 | else use_stdin=0; | |
207 | ||
208 | if (use_stdin && !(lib_type ==0 || lib_type==ACC_SCRIPT)) { | |
209 | fprintf(stderr,"\n @/- STDIN libraries must be in FASTA format\n"); | |
210 | return NULL; | |
211 | } | |
212 | ||
213 | opt_text[0]='\0'; | |
214 | if (lib_type != ACC_SCRIPT) { | |
180 | 215 | /* check to see if there is a file option ":1-100" */ |
181 | 216 | #ifndef WIN32 |
182 | if ((bp=strchr(lib_p->file_name,':'))!=NULL && *(bp+1)!='\0') { | |
217 | if ((bp=strchr(lib_p->file_name,':'))!=NULL && *(bp+1)!='\0') { | |
183 | 218 | #else |
184 | if ((bp=strchr(lib_p->file_name+3,':'))!=NULL && *(bp+1)!='\0') { | |
219 | if ((bp=strchr(lib_p->file_name+3,':'))!=NULL && *(bp+1)!='\0') { | |
185 | 220 | #endif |
186 | strncpy(opt_text,bp+1,sizeof(opt_text)); | |
187 | opt_text[sizeof(opt_text)-1]='\0'; | |
188 | *bp = '\0'; | |
189 | } | |
190 | else opt_text[0]='\0'; | |
191 | ||
192 | if (lib_p->file_name[0] == '-' || lib_p->file_name[0] == '@') { | |
193 | use_stdin = 1; | |
194 | } | |
195 | else use_stdin=0; | |
196 | ||
197 | /* check for library type */ | |
198 | if ((bp=strchr(lib_p->file_name,' '))!=NULL) { | |
199 | *bp='\0'; | |
200 | sscanf(bp+1,"%d",&lib_type); | |
201 | if (lib_type<0 || lib_type >= LASTLIB) { | |
202 | fprintf(stderr,"\n invalid library type: %d (>%d)- resetting\n%s\n", | |
203 | lib_type,LASTLIB,lib_p->file_name); | |
204 | lib_type=0; | |
205 | } | |
206 | else { | |
207 | lib_p->lib_type = lib_type; | |
208 | } | |
209 | } | |
210 | else lib_type = lib_p->lib_type; | |
211 | ||
212 | if (use_stdin && lib_type !=0 ) { | |
213 | fprintf(stderr,"\n @/- STDIN libraries must be in FASTA format\n"); | |
214 | return NULL; | |
221 | strncpy(opt_text,bp+1,sizeof(opt_text)); | |
222 | opt_text[sizeof(opt_text)-1]='\0'; | |
223 | *bp = '\0'; | |
224 | } | |
215 | 225 | } |
216 | 226 | |
217 | 227 | /* check to see if file can be open()ed? */ |
218 | ||
219 | 228 | l1: |
220 | 229 | opnflg = 0; |
221 | 230 | if (lib_type<=LASTTXT) { |
222 | 231 | if (!use_stdin) { |
223 | 232 | opnflg=((libf=fopen(lib_p->file_name,RBSTR))!=NULL); |
233 | } | |
234 | else if (lib_type==ACC_SCRIPT) { | |
235 | bp = lib_p->file_name; | |
236 | if (lib_p->file_name[0] == '!') { bp += 1;} | |
237 | strncpy(acc_script, bp, sizeof(acc_script)-1); | |
238 | acc_script[sizeof(acc_script)-1] = '\0'; | |
239 | ||
240 | /* convert '+' in annot_script to ' ' */ | |
241 | bp = strchr(acc_script,'+'); | |
242 | for ( ; bp; bp=strchr(bp+1,'+')) { | |
243 | *bp=' '; | |
244 | } | |
245 | libf=popen(acc_script,"r"); | |
246 | opnflg=1; | |
224 | 247 | } |
225 | 248 | else { |
226 | 249 | libf=stdin; |
759 | 759 | |
760 | 760 | for (i=1; parm[i].gap > 0; i++) { |
761 | 761 | if (parm[i].gap > gap) continue; |
762 | else if (parm[i].gap == gap && parm[i].ext > ext ) continue; | |
763 | else if (parm[i].gap == gap && parm[i].ext == ext) { | |
762 | else if (parm[i].gap <= gap && parm[i].ext > ext ) continue; | |
763 | else if (parm[i].gap <= gap && parm[i].ext <= ext) { | |
764 | 764 | *K = parm[i].K; |
765 | 765 | *Lambda = parm[i].Lambda; |
766 | 766 | *H = parm[i].H; |
123 | 123 | char sqnam[4]; /* "aa" or "nt" */ |
124 | 124 | char sqtype[10]; /* "DNA" or "protein" */ |
125 | 125 | int long_info; /* long description flag*/ |
126 | int blast_ident; /* calculate identities excluding gaps */ | |
126 | 127 | long sq0off, sq1off; /* virtual offset into aa0, aa1 */ |
127 | 128 | int markx; /* alignment display type */ |
128 | 129 | int tot_markx; /* markx as summ of all alternative markx */ |
156 | 157 | int ashow_set; /* ashow set with -d */ |
157 | 158 | int nmlen; /* length of name label */ |
158 | 159 | int show_code; /* show alignment code in -m 9; ==1 => identity only, ==2 alignment code*/ |
160 | int m8_show_annot; /* show annotations only in -m 8CB output */ | |
159 | 161 | int tot_show_code; /* show alignment for all outputs */ |
160 | 162 | int pre_load_done; /* set after pre_load_best() call */ |
161 | 163 | int align_done; /* do_walign() called */ |
202 | 202 | -5, -11, -11, -11, -6, -9, -9, -12, -10, -1, -5, -9, -5, -8, -10, -10, -6, -17, -9, 8, |
203 | 203 | -8, -11, 3, 2, -14, -6, -5, -7, -5, -13, -15, -6, -10, -16, -9, -5, -6, -12, -12, -11, 8, |
204 | 204 | -7, -9, -6, -4, -17, 3, 2, -9, -6, -12, -9, -4, -8, -14, -7, -6, -7, -19, -12, -9, -4, 8, |
205 | -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1 | |
205 | -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, | |
206 | -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 8 | |
206 | 207 | }; |
207 | 208 | |
208 | 209 | /* |
240 | 241 | -3, -9, -9, -9, -4, -7, -7, -10, -8, 1, -3, -8, -3, -6, -8, -8, -4, -13, -7, 7, |
241 | 242 | -6, -8, 3, 3, -11, -4, -3, -5, -4, -11, -12, -4, -8, -13, -7, -3, -5, -10, -10, -9, 8, |
242 | 243 | -5, -6, -4, -3, -13, 3, 3, -7, -4, -10, -7, -2, -6, -11, -5, -4, -5, -15, -9, -7, -2, 7, |
243 | -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1 | |
244 | -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, | |
245 | -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 8 | |
246 | ||
244 | 247 | }; |
245 | 248 | |
246 | 249 | /* |
278 | 281 | -1, -7, -7, -7, -2, -5, -6, -8, -6, 3, -1, -6, -1, -4, -6, -6, -2, -10, -5, 7, |
279 | 282 | -4, -5, 4, 3, -8, -2, -1, -3, -2, -8, -9, -2, -6, -10, -5, -2, -3, -8, -7, -7, 7, |
280 | 283 | -3, -4, -2, -1, -10, 4, 3, -5, -2, -7, -6, -1, -4, -9, -4, -3, -3, -12, -7, -5, 0, 7, |
281 | -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1 | |
284 | -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, | |
285 | -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 8 | |
286 | ||
282 | 287 | }; |
283 | 288 | |
284 | 289 | /* |
316 | 321 | 0, -4, -5, -5, -1, -4, -4, -6, -4, 3, 0, -4, 0, -2, -4, -4, -1, -6, -4, 6, |
317 | 322 | -2, -3, 4, 4, -5, -1, 0, -1, 0, -6, -6, -1, -4, -7, -3, 0, -1, -6, -5, -5, 7, |
318 | 323 | -2, -1, -1, 0, -6, 4, 3, -3, -1, -5, -4, 0, -3, -6, -2, -1, -2, -8, -5, -4, 0, 6, |
319 | -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1 | |
324 | -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, | |
325 | -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 8 | |
326 | ||
320 | 327 | }; |
321 | 328 | |
322 | 329 | /* |
354 | 361 | 0, -3, -4, -4, 0, -3, -3, -4, -3, 3, 1, -3, 1, -1, -3, -3, 0, -4, -2, 5, |
355 | 362 | -1, -2, 4, 4, -4, 0, 1, -1, 0, -4, -5, 0, -3, -5, -2, 0, 0, -5, -3, -4, 6, |
356 | 363 | -1, 0, 0, 0, -5, 3, 3, -2, 0, -4, -3, 1, -2, -4, -1, -1, -1, -6, -3, -3, 0, 5, |
357 | -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1 | |
364 | -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, | |
365 | -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 6 | |
358 | 366 | }; |
359 | 367 | |
360 | 368 | /* |
432 | 440 | 0, -3, -3, -4, 1, -2, -3, -4, -3, 4, 2, -3, 2, -1, -3, -2, 0, -4, -2, 4, |
433 | 441 | -1, -1, 4, 4, -3, 1, 2, 0, 0, -4, -4, 0, -3, -5, -1, 0, 0, -5, -3, -3, 6, |
434 | 442 | -1, 0, 1, 2, -3, 3, 3, -1, 1, -3, -3, 1, -2, -4, -1, 0, 0, -6, -3, -2, 2, 5, |
435 | -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1 | |
443 | -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, | |
444 | -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 6 | |
436 | 445 | }; |
437 | 446 | |
438 | 447 | /* |
317 | 317 | char line[MAX_STR]; |
318 | 318 | int i, i_doms, n_domain_s = MAX_LSTR; |
319 | 319 | |
320 | /* since (currently) annot_var_s is MAX_LSOTR, do the same for domain_s */ | |
320 | /* since (currently) annot_var_s is MAX_LSTR, do the same for domain_s */ | |
321 | 321 | if ((domain_s = (char *)calloc(n_domain_s, sizeof(char)))==NULL) { |
322 | 322 | fprintf(stderr,"*** error [%s:%d] *** cannot allocate domain_s[%d]\n",__FILE__, __LINE__,n_domain_s); |
323 | 323 | return NULL; |
172 | 172 | |
173 | 173 | /* now we need alignment storage - get it */ |
174 | 174 | if ((cur_ares->res = (int *)calloc((size_t)max_res,sizeof(int)))==NULL) { |
175 | fprintf(stderr," *** cannot allocate alignment results array %d\n",max_res); | |
175 | fprintf(stderr,"*** error [%s:%d] - cannot allocate alignment results array %d\n", | |
176 | __FILE__, __LINE__, max_res); | |
176 | 177 | exit(1); |
177 | 178 | } |
178 | 179 | |
485 | 486 | |
486 | 487 | if ((f_ss = (struct swstr *) calloc (N+2, sizeof (struct swstr))) |
487 | 488 | == NULL) { |
488 | fprintf (stderr, " *** cannot allocate f_ss array %3d\n", N+2); | |
489 | fprintf (stderr, "*** error [%s:%d] - cannot allocate f_ss array %3d\n", | |
490 | __FILE__, __LINE__, N+2); | |
489 | 491 | exit (1); |
490 | 492 | } |
491 | 493 | f_ss++; |
492 | 494 | |
493 | 495 | if ((r_ss = (struct swstr *) calloc (N+2, sizeof (struct swstr))) |
494 | 496 | == NULL) { |
495 | fprintf (stderr, " *** cannot allocate r_ss array %3d\n", N+2); | |
497 | fprintf (stderr, "*** error [%s:%d] - cannot allocate r_ss array %3d\n", | |
498 | __FILE__, __LINE__, N+2); | |
496 | 499 | exit (1); |
497 | 500 | } |
498 | 501 | r_ss++; |
502 | 505 | |
503 | 506 | ck = CHECK_SCORE(IW,B,M,N,S,W,G,H,NC, &sw); |
504 | 507 | if (c != ck) { |
505 | fprintf(stderr," *** Check_score error. %d != %d ***\n",c,ck); | |
508 | fprintf(stderr,"*** error [%s:%d] - check_score error. %d != %d ***\n", | |
509 | __FILE__, __LINE__, c,ck); | |
506 | 510 | } |
507 | 511 | |
508 | 512 | f_ss--; r_ss--; |
5 | 5 | if [ ! -d results ]; then |
6 | 6 | mkdir results |
7 | 7 | fi |
8 | ||
9 | export FA_DB=/slib2/fa_dbs/qfo20.lseg | |
10 | ||
8 | 11 | echo "starting fasta36 - protein" `date` |
9 | ../bin/fasta36 -q -m 6 -Z 100000 ../seq/mgstm1.aa:1-100 q > results/test_m1.ok2.html | |
10 | ../bin/fasta36 -S -q -z 11 -O results/test_m1.ok2_p25 -s P250 ../seq/mgstm1.aa:100-218 q | |
12 | ../bin/fasta36 -q -m 6 -Z 100000 ../seq/mgstm1.aa:1-100 $FA_DB > results/test_m1.ok2.html | |
13 | ../bin/fasta36 -S -q -z 11 -O results/test_m1.ok2_p25 -s P250 ../seq/mgstm1.aa:100-218 $FA_DB | |
11 | 14 | echo "done" |
12 | 15 | echo "starting fastxy36" `date` |
13 | ../bin/fastx36 -m 9c -S -q ../seq/mgtt2_x.seq q 1 > results/test_t2.xk1 | |
14 | ../bin/fasty36 -S -q ../seq/mgtt2_x.seq q > results/test_t2.yk2 | |
15 | ../bin/fastx36 -m 9c -S -q -z 2 ../seq/mgstm1.esq a > results/test_m1.xk2z2 | |
16 | ../bin/fasty36 -S -q -z 2 ../seq/mgstm1.esq a > results/test_m1.yk2z2 | |
16 | ../bin/fastx36 -m 9c -S -q ../seq/mgtt2_x.seq $FA_DB 1 > results/test_t2.xk1 | |
17 | ../bin/fasty36 -S -q ../seq/mgtt2_x.seq $FA_DB > results/test_t2.yk2 | |
18 | ../bin/fastx36 -m 9c -S -q -z 2 ../seq/mgstm1.esq $FA_DB > results/test_m1.xk2z2 | |
19 | ../bin/fasty36 -S -q -z 2 ../seq/mgstm1.esq $FA_DB > results/test_m1.yk2z2 | |
17 | 20 | echo "done" |
18 | 21 | echo "starting fastxy36 rev" `date` |
19 | ../bin/fastx36 -m 9c -q -m 5 ../seq/mgstm1.rev q > results/test_m1.xk2r | |
20 | ../bin/fasty36 -q -m 5 -M 200-300 -z 2 ../seq/mgstm1.rev q > results/test_m1.yk2rz2 | |
21 | ../bin/fasty36 -q -m 5 -z 11 ../seq/mgstm1.rev q > results/test_m1.yk2rz11 | |
22 | ../bin/fastx36 -m 9c -q -m 5 ../seq/mgstm1.rev $FA_DB > results/test_m1.xk2r | |
23 | ../bin/fasty36 -q -m 5 -M 200-300 -z 2 ../seq/mgstm1.rev $FA_DB > results/test_m1.yk2rz2 | |
24 | ../bin/fasty36 -q -m 5 -z 11 ../seq/mgstm1.rev $FA_DB > results/test_m1.yk2rz11 | |
22 | 25 | echo "done" |
23 | 26 | echo "starting ssearch36" `date` |
24 | ../bin/ssearch36 -m 9c -S -z 3 -q ../seq/mgstm1.aa q > results/test_m1.ssz3 | |
25 | ../bin/ssearch36 -q -M 200-300 -z 2 -Z 100000 -s P250 ../seq/mgstm1.aa q > results/test_m1.ss_p25 | |
27 | ../bin/ssearch36 -m 9c -S -z 3 -q ../seq/mgstm1.aa $FA_DB > results/test_m1.ssz3 | |
28 | ../bin/ssearch36 -q -M 200-300 -z 2 -Z 100000 -s P250 ../seq/mgstm1.aa $FA_DB > results/test_m1.ss_p25 | |
26 | 29 | echo "done" |
27 | 30 | if [ -e ../bin/ssearch36s ]; then |
28 | 31 | echo "starting ssearch36s" `date` |
29 | ../bin/ssearch36s -m 9c -S -z 3 -q ../seq/mgstm1.aa q > results/test_m1.sssz3 | |
30 | ../bin/ssearch36s -q -M 200-300 -z 2 -Z 100000 -s P250 ../seq/mgstm1.aa q > results/test_m1.sss_p25 | |
32 | ../bin/ssearch36s -m 9c -S -z 3 -q ../seq/mgstm1.aa $FA_DB > results/test_m1.sssz3 | |
33 | ../bin/ssearch36s -q -M 200-300 -z 2 -Z 100000 -s P250 ../seq/mgstm1.aa $FA_DB > results/test_m1.sss_p25 | |
31 | 34 | echo "done" |
32 | 35 | fi |
33 | 36 | echo "starting prss36(ssearch/fastx)" `date` |
35 | 38 | ../bin/fastx36 -q -k 1000 ../seq/mgstm1.esq ../seq/xurt8c.aa > results/test_m1.rfx |
36 | 39 | echo "done" |
37 | 40 | echo "starting ggsearch36/glsearch36" `date` |
38 | ../bin/ggsearch36 -q -m 9i -w 80 ../seq/hahu.aa q > results/test_h1.gg | |
39 | ../bin/glsearch36 -q -m 9i -w 80 ../seq/hahu.aa q > results/test_h1.gl | |
40 | ../bin/ggsearch36 -q ../seq/gtt1_drome.aa q > results/test_t1.gg | |
41 | ../bin/glsearch36 -q ../seq/gtt1_drome.aa q > results/test_t1.gl | |
41 | ../bin/ggsearch36 -q -m 9i -w 80 ../seq/hahu.aa $FA_DB > results/test_h1.gg | |
42 | ../bin/glsearch36 -q -m 9i -w 80 ../seq/hahu.aa $FA_DB > results/test_h1.gl | |
43 | ../bin/ggsearch36 -q ../seq/gtt1_drome.aa $FA_DB > results/test_t1.gg | |
44 | ../bin/glsearch36 -q ../seq/gtt1_drome.aa $FA_DB > results/test_t1.gl | |
42 | 45 | echo "done" |
43 | 46 | echo "starting fasta36 - DNA" `date` |
44 | 47 | ../bin/fasta36 -S -q ../seq/mgstm1.nt %RMB 4 > results/test_m1.ok4 |
52 | 55 | ../bin/tfasty36 -q -i -3 -N 5000 ../seq/mgstm1.aa %p > results/test_m1.ty2 |
53 | 56 | echo "done" |
54 | 57 | echo "starting fastf36" `date` |
55 | ../bin/fastf36 -q ../seq/m1r.aa q > results/test_mf.ff | |
56 | ../bin/fastf36 -q ../seq/m1r.aa q > results/test_mf.ff_s | |
58 | ../bin/fastf36 -q ../seq/m1r.aa $FA_DB > results/test_mf.ff | |
59 | ../bin/fastf36 -q ../seq/m1r.aa $FA_DB > results/test_mf.ff_s | |
57 | 60 | echo "done" |
58 | 61 | echo "starting tfastf36" `date` |
59 | 62 | ../bin/tfastf36 -q ../seq/m1r.aa %r > results/test_mf.tfr |
60 | 63 | echo "done" |
61 | 64 | echo "starting fasts36" `date` |
62 | ../bin/fasts36 -q -V '*?@' ../seq/ngts.aa q > results/test_m1.fs1 | |
63 | ../bin/fasts36 -q ../seq/ngt.aa q > results/test_m1.fs | |
65 | ../bin/fasts36 -q -V '*?@' ../seq/ngts.aa $FA_DB > results/test_m1.fs1 | |
66 | ../bin/fasts36 -q ../seq/ngt.aa $FA_DB > results/test_m1.fs | |
64 | 67 | ../bin/fasts36 -q -n ../seq/mgstm1.nts m > results/test_m1.nfs |
65 | 68 | echo "starting fastm36" `date` |
66 | ../bin/fastm36 -q ../seq/ngts.aa q > results/test_m1.fm | |
69 | ../bin/fastm36 -q ../seq/ngts.aa $FA_DB > results/test_m1.fm | |
67 | 70 | ../bin/fastm36 -q -n ../seq/mgstm1.nts m > results/test_m1.nfm |
68 | 71 | echo "done" |
69 | 72 | echo "starting tfasts36" `date` |
3 | 3 | echo `uname -a` |
4 | 4 | echo "" |
5 | 5 | echo "starting fasta36 - protein" `date` |
6 | ||
7 | FA_DB=/slib2/fa_dbs/qfo20.lseg | |
8 | ||
6 | 9 | if [ ! -d results ]; then |
7 | 10 | mkdir results |
8 | 11 | fi |
9 | ../bin/fasta36 -V q\!../scripts/ann_feats_up_www2.pl -V \!../scripts/ann_feats_up_www2.pl -S -z 21 -s BP62 ../seq/gstm1_human.vaa q > results/test2V_m1.ok2_bp62 | |
10 | ../bin/fasta36 -V q\!../scripts/ann_feats_up_www2.pl -V \!../scripts/ann_feats_up_www2.pl -S -z 21 ../seq/gstm1_human.vaa q > results/test2V_m1.ok2_z21 | |
11 | ../bin/fasta36 -V q\!../scripts/ann_feats_up_www2.pl -V \!../scripts/ann_feats_up_www2.pl -S -m BB ../seq/gstm1_human.vaa q > results/test2V_m1.ok2mB | |
12 | ../bin/fasta36 -V q\!../scripts/ann_feats_up_www2.pl -V \!../scripts/ann_feats_up_www2.pl -S -z 21 -s BP62 ../seq/gstm1_human.vaa $FA_DB > results/test2V_m1.ok2_bp62 | |
13 | ../bin/fasta36 -V q\!../scripts/ann_feats_up_www2.pl -V \!../scripts/ann_feats_up_www2.pl -S -z 21 ../seq/gstm1_human.vaa $FA_DB > results/test2V_m1.ok2_z21 | |
14 | ../bin/fasta36 -V q\!../scripts/ann_feats_up_www2.pl -V \!../scripts/ann_feats_up_www2.pl -S -m BB ../seq/gstm1_human.vaa $FA_DB > results/test2V_m1.ok2mB | |
12 | 15 | echo "done" |
13 | 16 | echo "starting fastxy36" `date` |
14 | ../bin/fastx36 -V \!../scripts/ann_feats_up_www2.pl -m 9c -S -q ../seq/mgtt2_x.seq q > results/test2V_t2.xk2m9c | |
15 | ../bin/fastx36 -V \!../scripts/ann_feats_up_www2.pl -m BB -S -q ../seq/mgtt2_x.seq q > results/test2V_t2.xk2mB | |
16 | ../bin/fastx36 -V \!../scripts/ann_feats_up_www2.pl -m 9c -S -q -z 22 ../seq/gstm1b_human.nt q > results/test2V_m1.xk2m9cz22 | |
17 | ../bin/fasty36 -V \!../scripts/ann_feats_up_www2.pl -S -q -z 21 ../seq/gstm1b_human.nt q > results/test2V_m1.yk2z21 | |
17 | ../bin/fastx36 -V \!../scripts/ann_feats_up_www2.pl -m 9c -S -q ../seq/mgtt2_x.seq $FA_DB > results/test2V_t2.xk2m9c | |
18 | ../bin/fastx36 -V \!../scripts/ann_feats_up_www2.pl -m BB -S -q ../seq/mgtt2_x.seq $FA_DB > results/test2V_t2.xk2mB | |
19 | ../bin/fastx36 -V \!../scripts/ann_feats_up_www2.pl -m 9c -S -q -z 22 ../seq/gstm1b_human.nt $FA_DB > results/test2V_m1.xk2m9cz22 | |
20 | ../bin/fasty36 -V \!../scripts/ann_feats_up_www2.pl -S -q -z 21 ../seq/gstm1b_human.nt $FA_DB > results/test2V_m1.yk2z21 | |
18 | 21 | echo "done" |
19 | 22 | echo "starting ssearch36" `date` |
20 | ../bin/ssearch36 -V q\!../scripts/ann_pfam_www.pl -V \!../scripts/ann_pfam_www.pl -m 9c -S -z 22 -q ../seq/gstm1_human.vaa q > results/test2V_m1.ssm9cz22 | |
21 | ../bin/ssearch36 -V q\!../scripts/ann_pfam_www.pl -V \!../scripts/ann_pfam_www.pl -m 9C -S -z 21 -q ../seq/gstm1_human.vaa q > results/test2V_m1.ssm9Cz21 | |
22 | ../bin/ssearch36 -V q\!../scripts/ann_pfam_www.pl -V \!../scripts/ann_pfam_www.pl -m 8CC -S -q ../seq/gstm1_human.vaa q > results/test2V_m1.ssm8CC | |
23 | ../bin/ssearch36 -V q\!../scripts/ann_pfam_www.pl -V \!../scripts/ann_pfam_www.pl -m 9c -S -z 22 -q ../seq/gstm1_human.vaa $FA_DB > results/test2V_m1.ssm9cz22 | |
24 | ../bin/ssearch36 -V q\!../scripts/ann_pfam_www.pl -V \!../scripts/ann_pfam_www.pl -m 9C -S -z 21 -q ../seq/gstm1_human.vaa $FA_DB > results/test2V_m1.ssm9Cz21 | |
25 | ../bin/ssearch36 -V q\!../scripts/ann_pfam_www.pl -V \!../scripts/ann_pfam_www.pl -m 8CC -S -q ../seq/gstm1_human.vaa $FA_DB > results/test2V_m1.ssm8CC | |
23 | 26 | echo "done" `date` |
24 | 27 | echo "starting ssearch36" `date` |
25 | ../bin/ggsearch36 -V q\!../scripts/ann_feats_up_www2.pl -V \!../scripts/ann_feats_up_www2.pl -m 9c -S -q ../seq/gstm1_human.vaa q > results/test2V_m1.ggm9c | |
26 | ../bin/ggsearch36 -V q\!../scripts/ann_feats_up_www2.pl -V \!../scripts/ann_feats_up_www2.pl -m 9C -S -z 21 -q ../seq/gstm1_human.vaa q > results/test2V_m1.ggm9Cz21 | |
28 | ../bin/ggsearch36 -V q\!../scripts/ann_feats_up_www2.pl -V \!../scripts/ann_feats_up_www2.pl -m 9c -S -q ../seq/gstm1_human.vaa $FA_DB > results/test2V_m1.ggm9c | |
29 | ../bin/ggsearch36 -V q\!../scripts/ann_feats_up_www2.pl -V \!../scripts/ann_feats_up_www2.pl -m 9C -S -z 21 -q ../seq/gstm1_human.vaa $FA_DB > results/test2V_m1.ggm9Cz21 | |
27 | 30 | echo "done" `date` |