New upstream version 0.1924
Andreas Tille
3 years ago
0 | include scripts/test.sh |
0 | Metadata-Version: 1.2 | |
1 | Name: pauvre | |
2 | Version: 0.1924 | |
3 | Summary: Tools for plotting Oxford Nanopore and other long-read data. | |
4 | Home-page: https://github.com/conchoecia/pauvre | |
5 | Author: Darrin Schultz | |
6 | Author-email: dts@ucsc.edu | |
7 | License: GPLv3 | |
8 | Description: | |
9 | 'pauvre' is a package for plotting Oxford Nanopore and other long read data. | |
10 | The name means 'poor' in French, a play on words to the oft-used 'pore' prefix | |
11 | for similar packages. This package was designed for python 3, but it might work in | |
12 | python 2. You can visit the gitub page for more detailed information here: | |
13 | https://github.com/conchoecia/pauvre | |
14 | ||
15 | Platform: UNKNOWN | |
16 | Classifier: Development Status :: 2 - Pre-Alpha | |
17 | Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) | |
18 | Classifier: Programming Language :: Python :: 3 | |
19 | Classifier: Programming Language :: Python :: 3.5 | |
20 | Classifier: Operating System :: POSIX :: Linux | |
21 | Classifier: Topic :: Scientific/Engineering :: Bio-Informatics | |
22 | Classifier: Intended Audience :: Science/Research | |
23 | Requires: python (>3.0) | |
24 | Provides: pauvre | |
25 | Requires-Python: >=3 |
0 | [](https://travis-ci.org/conchoecia/pauvre) [](https://zenodo.org/badge/latestdoi/112774670) | |
1 | ||
2 | ||
3 | ## pauvre: a plotting package designed for nanopore and PacBio long reads | |
4 | ||
5 | This package currently hosts four scripts for plotting and/or printing stats. | |
6 | ||
7 | - `pauvre marginplot` | |
8 | - takes a fastq file as input and outputs a marginal histogram with a heatmap. | |
9 | - `pauvre stats` | |
10 | - Takes a fastq file as input and prints out a table of stats, including how many basepairs/reads there are for a length/mean quality cutoff. | |
11 | - This is also automagically called when using `pauvre marginplot` | |
12 | - `pauvre redwood` | |
13 | - I am happy to introduce the redwood plot to the world as a method | |
14 | of representing circular genomes. A redwood plot contains long | |
15 | reads as "rings" on the inside, a gene annotation | |
16 | "cambrium/phloem", and a RNAseq "bark". The input is `.bam` files | |
17 | for the long reads and RNAseq data, and a `.gff` file for the | |
18 | annotation. More details to follow as we document this program | |
19 | better... | |
20 | - `pauvre synteny` | |
21 | - Makes a synteny plot of circular genomes. Finds the most | |
22 | parsimonius rotation to display the synteny of all the input | |
23 | genomes with the fewest crossings-over. Input is one `.gff` file | |
24 | per circular genome and one directory of gene alignments. | |
25 | ||
26 | ## Updates: | |
27 | - 20200215 - v0.1.924 - Made some minor updates to work with python 3.7 and the latest version of pandas, | |
28 | - 20171130 - v0.1.86 - some changes by @wdecoster to integrate `pauvre` into [nanoplot](https://github.com/wdecoster/NanoPlot), | |
29 | as well as some formatting changes that *may* make `pauvre` work better with python2.7. Adding Travis-CI functionality. | |
30 | - 20171025 - v0.1.83 - added some changes to make marginplot interface | |
31 | with @wdecoster's [nanoPlot](https://github.com/wdecoster/NanoPlot) | |
32 | package, and made `pauvre stats` only output data tables for | |
33 | filtered reads. `pauvre stats` also now has the `--filt_maxlen`, | |
34 | `--filt_maxqual`, `--filt_minlen`, and `--filt_minqual` options. | |
35 | - 20171018 - v0.1.8 - you can now filter reads and adjust the plotting viewing window. | |
36 | [See below for a demonstration.](#filter-reads-and-adjust-viewing-window) I added the following options: | |
37 | ||
38 | ``` | |
39 | --filt_maxlen FILT_MAXLEN | |
40 | This sets the max read length filter reads. | |
41 | --filt_maxqual FILT_MAXQUAL | |
42 | This sets the max mean read quality to filter reads. | |
43 | --filt_minlen FILT_MINLEN | |
44 | This sets the min read length to filter reads. | |
45 | --filt_minqual FILT_MINQUAL | |
46 | This sets the min mean read quality to filter reads. | |
47 | --plot_maxlen PLOT_MAXLEN | |
48 | Sets the maximum viewing area in the length dimension. | |
49 | --plot_maxqual PLOT_MAXQUAL | |
50 | Sets the maximum viewing area in the quality | |
51 | dimension. | |
52 | --plot_minlen PLOT_MINLEN | |
53 | Sets the minimum viewing area in the length dimension. | |
54 | --plot_minqual PLOT_MINQUAL | |
55 | Sets the minimum viewing area in the quality | |
56 | dimension. | |
57 | ``` | |
58 | - 20171014 - uploading information on `pauvre redwood` and `pauvre synteny` usage. | |
59 | - 20171012 - made `pauvre stats` more consistently produce useful histograms. | |
60 | `pauvre stats` now also calculates some statistics for different size ranges. | |
61 | - 20170529 - added automatic scaling to the input fastq file. It | |
62 | scales to show the highest read quality and the top 99th percentile | |
63 | of reads by length. | |
64 | ||
65 | # Requirements | |
66 | ||
67 | - You must have the following installed on your system to install this software: | |
68 | - python 3.x | |
69 | - matplotlib | |
70 | - biopython | |
71 | - pandas | |
72 | - pillow | |
73 | ||
74 | # Installation | |
75 | ||
76 | - Instructions to install on your mac or linux system. Not sure on | |
77 | Windows! Make sure *python 3* is the active environment before | |
78 | installing. | |
79 | - `git clone https://github.com/conchoecia/pauvre.git` | |
80 | - `cd ./pauvre` | |
81 | - `pip3 install .` | |
82 | - Or, install with pip | |
83 | - `pip3 install pauvre` | |
84 | ||
85 | # Usage | |
86 | ||
87 | ## `stats` | |
88 | - generate basic statistics about the fastq file. For example, if I | |
89 | want to know the number of bases and reads with AT LEAST a PHRED | |
90 | score of 5 and AT LEAST a read length of 500, run the program as below | |
91 | and look at the cells highlighted with `<braces>`. | |
92 | - `pauvre stats --fastq miniDSMN15.fastq` | |
93 | ||
94 | ||
95 | ``` | |
96 | numReads: 1000 | |
97 | numBasepairs: 1029114 | |
98 | meanLen: 1029.114 | |
99 | medianLen: 875.5 | |
100 | minLen: 11 | |
101 | maxLen: 5337 | |
102 | N50: 1278 | |
103 | L50: 296 | |
104 | ||
105 | Basepairs >= bin by mean PHRED and length | |
106 | minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30 | |
107 | 0 1029114 1010681 935366 429279 143948 25139 3668 2938 2000 0 | |
108 | 500 984212 <968653> 904787 421307 142003 24417 3668 2938 2000 0 | |
109 | 1000 659842 649319 616788 300948 103122 17251 2000 2000 2000 0 | |
110 | et cetera... | |
111 | Number of reads >= bin by mean Phred+Len | |
112 | minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30 | |
113 | 0 1000 969 865 366 118 22 3 2 1 0 | |
114 | 500 873 <859> 789 347 113 20 3 2 1 0 | |
115 | 1000 424 418 396 187 62 11 1 1 1 0 | |
116 | et cetera... | |
117 | ``` | |
118 | ||
119 | ## `marginplot` | |
120 | ||
121 | ### Basic usage | |
122 | - automatically calls `pauvre stats` for each fastq file | |
123 | - Make the default plot showing the 99th percentile of longest reads | |
124 | - `pauvre marginplot --fastq miniDSMN15.fastq` | |
125 | -  | |
126 | - Make a marginal histogram for ONT 2D or 1D^2 cDNA data with a | |
127 | lower maxlen and higher maxqual. | |
128 | - `pauvre marginplot --maxlen 4000 --maxqual 25 --lengthbin 50 --fileform pdf png --qualbin 0.5 --fastq miniDSMN15.fastq` | |
129 | -  | |
130 | ||
131 | ### Filter reads and adjust viewing window | |
132 | - Filter out reads with a mean quality less than 5, and a length | |
133 | less than 800. Zoom in to plot only mean quality of at least 4 and | |
134 | read length at least 500bp. | |
135 | - `pauvre marginplot -f miniDSMN15.fastq --filt_minqual 5 --filt_minlen 800 -y --plot_minlen 500 --plot_minqual 4` | |
136 | -  | |
137 | ||
138 | ### Specialized Options | |
139 | ||
140 | - Plot ONT 1D data with a large tail | |
141 | - `pauvre marginplot --maxlen 100000 --maxqual 15 --lengthbin 500 <myfile>.fastq` | |
142 | - Get more resolution on lengths | |
143 | - `pauvre marginplot --maxlen 100000 --lengthbin 5 <myfile>.fastq` | |
144 | ||
145 | ### Transparency | |
146 | ||
147 | - Turn off transparency if you just want a white background | |
148 | - `pauvre marginplot --transparent False <myfile>.fastq` | |
149 | - Note: transparency is the default behavior | |
150 | -  | |
151 | ||
152 | # Contributors | |
153 | ||
154 | @conchoecia (Darrin Schultz) | |
155 | @mebbert (Mark Ebbert) | |
156 | @wdecoster (Wouter De Coster) |
0 | from pauvre.version import __version__ |
0 | #!/usr/bin/env python | |
1 | # -*- coding: utf-8 -*- | |
2 | ||
3 | # pauvre - just a pore plotting package | |
4 | # Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved. | |
5 | # twitter @conchoecia | |
6 | # | |
7 | # This file is part of pauvre. | |
8 | # | |
9 | # pauvre is free software: you can redistribute it and/or modify | |
10 | # it under the terms of the GNU General Public License as published by | |
11 | # the Free Software Foundation, either version 3 of the License, or | |
12 | # (at your option) any later version. | |
13 | # | |
14 | # pauvre is distributed in the hope that it will be useful, | |
15 | # but WITHOUT ANY WARRANTY; without even the implied warranty of | |
16 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | |
17 | # GNU General Public License for more details. | |
18 | # | |
19 | # You should have received a copy of the GNU General Public License | |
20 | # along with pauvre. If not, see <http://www.gnu.org/licenses/>. | |
21 | import pysam | |
22 | import pandas as pd | |
23 | import os | |
24 | ||
25 | class BAMParse(): | |
26 | """This class reads in a sam/bam file and constructs a pandas | |
27 | dataframe of all the relevant information for the reads to pass on | |
28 | and plot. | |
29 | """ | |
30 | def __init__(self, filename, chrid = None, start = None, | |
31 | stop = None, doubled = None): | |
32 | self.filename = filename | |
33 | self.doubled = doubled | |
34 | #determine if the file is bam or sam | |
35 | self.filetype = os.path.splitext(self.filename)[1] | |
36 | #throw an error if the file is not bam or sam | |
37 | if self.filetype not in ['.bam']: | |
38 | raise Exception("""You have provided a file with an extension other than | |
39 | '.bam', please check your command-line arguments""") | |
40 | #now make sure there is an index file for the bam file | |
41 | if not os.path.exists("{}.bai".format(self.filename)): | |
42 | raise Exception("""Your .bam file is there, but it isn't indexed and | |
43 | there isn't a .bai file to go with it. Use | |
44 | 'samtools index <yourfile>.bam' to fix it.""") | |
45 | #now open the file and just call it a sambam file | |
46 | filetype_dict = {'.sam': '', '.bam': 'b'} | |
47 | self.sambam = pysam.AlignmentFile(self.filename, "r{}".format(filetype_dict[self.filetype])) | |
48 | if chrid == None: | |
49 | self.chrid = self.sambam.references[0] | |
50 | else: | |
51 | self.chrid = chrid | |
52 | self.refindex = self.sambam.references.index(self.chrid) | |
53 | self.seqlength = self.sambam.lengths[self.refindex] | |
54 | self.true_seqlength = self.seqlength if not self.doubled else int(self.seqlength/2) | |
55 | if start == None or stop == None: | |
56 | self.start = 1 | |
57 | self.stop = self.true_seqlength | |
58 | ||
59 | self.features = self.parse() | |
60 | self.features.sort_values(by=['POS','MAPLEN'], ascending=[True, False] ,inplace=True) | |
61 | self.features.reset_index(inplace=True) | |
62 | self.features.drop('index', 1, inplace=True) | |
63 | ||
64 | self.raw_depthmap = self.get_depthmap() | |
65 | self.features_depthmap = self.get_features_depthmap() | |
66 | ||
67 | def get_depthmap(self): | |
68 | depthmap = [0] * (self.stop - self.start + 1) | |
69 | for p in self.sambam.pileup(self.chrid, self.start, self.stop): | |
70 | index = p.reference_pos | |
71 | if index >= self.true_seqlength: | |
72 | index -= self.true_seqlength | |
73 | depthmap[index] += p.nsegments | |
74 | return depthmap | |
75 | ||
76 | def get_features_depthmap(self): | |
77 | """this method builds a more accurate pileup that is | |
78 | based on if there is actually a mapped base at any | |
79 | given position or not. better for long reads and RNA""" | |
80 | depthmap = [0] * (self.stop - self.start + 1) | |
81 | print("depthmap is: {} long".format(len(depthmap))) | |
82 | for index, row in self.features.iterrows(): | |
83 | thisindex = row["POS"] - self.start | |
84 | for thistup in row["TUPS"]: | |
85 | b_type = thistup[1] | |
86 | b_len = thistup[0] | |
87 | if b_type == "M": | |
88 | for j in range(b_len): | |
89 | #this is necessary to reset the index if we wrap | |
90 | # around to the beginning | |
91 | if self.doubled and thisindex == len(depthmap): | |
92 | thisindex = 0 | |
93 | depthmap[thisindex] += 1 | |
94 | thisindex += 1 | |
95 | elif b_type in ["S", "H", "I"]: | |
96 | pass | |
97 | elif b_type in ["D", "N"]: | |
98 | thisindex += b_len | |
99 | #this is necessary to reset the index if we wrap | |
100 | # around to the beginning | |
101 | if self.doubled and thisindex >= len(depthmap): | |
102 | thisindex = thisindex - len(depthmap) | |
103 | ||
104 | return depthmap | |
105 | ||
106 | def parse(self): | |
107 | data = {'POS': [], 'MAPQ': [], 'TUPS': [] } | |
108 | for read in self.sambam.fetch(self.chrid, self.start, self.stop): | |
109 | data['POS'].append(read.reference_start + 1) | |
110 | data['TUPS'].append(self.cigar_parse(read.cigartuples)) | |
111 | data['MAPQ'].append(read.mapq) | |
112 | features = pd.DataFrame.from_dict(data, orient='columns') | |
113 | features['ALNLEN'] = features['TUPS'].apply(self.aln_len) | |
114 | features['TRULEN'] = features['TUPS'].apply(self.tru_len) | |
115 | features['MAPLEN'] = features['TUPS'].apply(self.map_len) | |
116 | features['POS'] = features['POS'].apply(self.fix_pos) | |
117 | return features | |
118 | ||
119 | def cigar_parse(self, tuples): | |
120 | """ | |
121 | arguments: | |
122 | <tuples> a CIGAR string tuple list in pysam format | |
123 | ||
124 | purpose: | |
125 | This function uses the pysam cigarstring tuples format and returns | |
126 | a list of tuples in the internal format, [(20, 'M'), (5, "I")], et | |
127 | cetera. The zeroth element of each tuple is the number of bases for the | |
128 | CIGAR string feature. The first element of each tuple is the CIGAR | |
129 | string feature type. | |
130 | ||
131 | There are several feature types in SAM/BAM files. See below: | |
132 | 'M' - match | |
133 | 'I' - insertion relative to reference | |
134 | 'D' - deletion relative to reference | |
135 | 'N' - skipped region from the reference | |
136 | 'S' - soft clip, not aligned but still in sam file | |
137 | 'H' - hard clip, not aligned and not in sam file | |
138 | 'P' - padding (silent deletion from padded reference) | |
139 | '=' - sequence match | |
140 | 'X' - sequence mismatch | |
141 | 'B' - BAM_CBACK (I don't actually know what this is) | |
142 | ||
143 | """ | |
144 | # I used the map values from http://pysam.readthedocs.io/en/latest/api.html#pysam.AlignedSegment | |
145 | psam_to_char = {0: 'M', 1: 'I', 2: 'D', 3: 'N', 4: 'S', | |
146 | 5: 'H', 6: 'P', 7: '=', 8: 'X', 9: 'B'} | |
147 | return [(value, psam_to_char[feature]) for feature, value in tuples] | |
148 | ||
149 | def aln_len(self, TUPS): | |
150 | """ | |
151 | arguments: | |
152 | <TUPS> a list of tuples output from the cigar_parse() function. | |
153 | ||
154 | purpose: | |
155 | This returns the alignment length of the read to the reference. | |
156 | Specifically, it sums the length of all of the matches and deletions. | |
157 | In effect, this number is length of the region of the reference sequence to | |
158 | which the read maps. This number is probably the most useful for selecting | |
159 | reads to visualize in the mapped read plot. | |
160 | """ | |
161 | return sum([pair[0] for pair in TUPS if pair[1] not in ['S', 'H', 'I']]) | |
162 | ||
163 | def map_len(self, TUPS): | |
164 | """ | |
165 | arguments: | |
166 | <TUPS> a list of tuples output from the cigar_parse() function. | |
167 | ||
168 | purpose: | |
169 | This function returns the map length (all matches and deletions relative to | |
170 | the reference), plus the unmapped 5' and 3' hard/soft clipped sequences. | |
171 | This number is useful if you want to visualize how much 5' and 3' sequence | |
172 | of a read did not map to the reference. For example, poor quality 5' and 3' | |
173 | tails are common in Nanopore reads. | |
174 | """ | |
175 | return sum([pair[0] for pair in TUPS if pair[1] not in ['I']]) | |
176 | ||
177 | def tru_len(self, TUPS): | |
178 | """ | |
179 | arguments: | |
180 | <TUPS> a list of tuples output from the cigar_parse() function. | |
181 | ||
182 | purpose: | |
183 | This function returns the total length of the read, including insertions, | |
184 | deletions, matches, soft clips, and hard clips. This is useful for | |
185 | comparing to the map length or alignment length to see what percentage of | |
186 | the read aligned to the reference. | |
187 | """ | |
188 | return sum([pair[0] for pair in TUPS]) | |
189 | ||
190 | def fix_pos(self, start_index): | |
191 | """ | |
192 | arguments: | |
193 | an int | |
194 | ||
195 | purpose: | |
196 | When using a doubled SAMfile, any reads that start after the first copy | |
197 | of the reference risk running over the plotting window, causing the program | |
198 | to crash. This function corrects for this issue by changing the start site | |
199 | of the read. | |
200 | ||
201 | Note: this will probably break the program if not using a double alignment | |
202 | since no reads would map past half the length of the single reference | |
203 | """ | |
204 | if self.doubled: | |
205 | if start_index > int(self.seqlength/2): | |
206 | return start_index - int(self.seqlength/2) - 1 | |
207 | else: | |
208 | return start_index | |
209 | else: | |
210 | return start_index |
0 | #!/usr/bin/env python | |
1 | # -*- coding: utf-8 -*- | |
2 | ||
3 | # pauvre - a pore plotting package | |
4 | # Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved. | |
5 | # | |
6 | # This file is part of pauvre. | |
7 | # | |
8 | # pauvre is free software: you can redistribute it and/or modify | |
9 | # it under the terms of the GNU General Public License as published by | |
10 | # the Free Software Foundation, either version 3 of the License, or | |
11 | # (at your option) any later version. | |
12 | # | |
13 | # pauvre is distributed in the hope that it will be useful, | |
14 | # but WITHOUT ANY WARRANTY; without even the implied warranty of | |
15 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | |
16 | # GNU General Public License for more details. | |
17 | # | |
18 | # You should have received a copy of the GNU General Public License | |
19 | # along with pauvre. If not, see <http://www.gnu.org/licenses/>. | |
20 | ||
21 | # following this tutorial to install helvetica | |
22 | # https://github.com/olgabot/sciencemeetproductivity.tumblr.com/blob/master/posts/2012/11/how-to-set-helvetica-as-the-default-sans-serif-font-in.md | |
23 | global hfont | |
24 | hfont = {'fontname':'Helvetica'} | |
25 | ||
26 | import matplotlib | |
27 | matplotlib.use('Agg') | |
28 | import matplotlib.pyplot as plt | |
29 | from matplotlib.colors import LinearSegmentedColormap, Normalize | |
30 | import matplotlib.patches as patches | |
31 | ||
32 | ||
33 | import gffutils | |
34 | import pandas as pd | |
35 | pd.set_option('display.max_columns', 500) | |
36 | pd.set_option('display.width', 1000) | |
37 | import numpy as np | |
38 | import os | |
39 | import pauvre.rcparams as rc | |
40 | from pauvre.functions import GFFParse, print_images, timestamp | |
41 | from pauvre import gfftools | |
42 | from pauvre.lsi.lsi import intersection | |
43 | from pauvre.bamparse import BAMParse | |
44 | import progressbar | |
45 | import platform | |
46 | import sys | |
47 | import time | |
48 | ||
49 | # Biopython stuff | |
50 | from Bio import SeqIO | |
51 | import Bio.SubsMat.MatrixInfo as MI | |
52 | ||
53 | ||
54 | class PlotCommand: | |
55 | def __init__(self, plotcmd, REF): | |
56 | self.ref = REF | |
57 | self.style_choices = [] | |
58 | self.cmdtype = "" | |
59 | self.path = "" | |
60 | self.style = "" | |
61 | self.options = "" | |
62 | self._parse_cmd(plotcmd) | |
63 | ||
64 | def _parse_cmd(self, plotcmd): | |
65 | chunks = plotcmd.split(":") | |
66 | if chunks[0] == "ref": | |
67 | self.cmdtype = "ref" | |
68 | if len(chunks) < 2: | |
69 | self._len_error() | |
70 | self.path = self.ref | |
71 | self.style = chunks[1] | |
72 | self.style_choices = ["normal", "colorful"] | |
73 | self._check_style_choices() | |
74 | if len(chunks) > 2: | |
75 | self.options = chunks[2].split(",") | |
76 | elif chunks[0] in ["bam", "peptides"]: | |
77 | if len(chunks) < 3: | |
78 | self._len_error() | |
79 | self.cmdtype = chunks[0] | |
80 | self.path = os.path.abspath(os.path.expanduser(chunks[1])) | |
81 | self.style = chunks[2] | |
82 | if self.cmdtype == "bam": | |
83 | self.style_choices = ["depth", "reads"] | |
84 | else: | |
85 | self.style_choices = ["depth"] | |
86 | self._check_style_choices() | |
87 | if len(chunks) > 3: | |
88 | self.options = chunks[3].split(",") | |
89 | elif chunks[0] in ["gff3"]: | |
90 | if len(chunks) < 2: | |
91 | self._len_error() | |
92 | self.cmdtype = chunks[0] | |
93 | self.path = os.path.abspath(os.path.expanduser(chunks[1])) | |
94 | if len(chunks) > 2: | |
95 | self.options = chunks[2].split(",") | |
96 | ||
97 | ||
98 | def _len_error(self): | |
99 | raise IOError("""You selected {} to plot, | |
100 | but need to specify the style at least.""".format(self.cmdtype)) | |
101 | def _check_style_choices(self): | |
102 | if self.style not in self.style_choices: | |
103 | raise IOError("""You selected {} style for | |
104 | ref. You must select from {}. """.format( | |
105 | self.style, self.style_choices)) | |
106 | ||
107 | global dna_color | |
108 | dna_color = {"A": (81/255, 87/255, 251/255, 1), | |
109 | "T": (230/255, 228/255, 49/255, 1), | |
110 | "G": (28/255, 190/255, 32/255, 1), | |
111 | "C": (220/255, 10/255, 23/255, 1)} | |
112 | ||
113 | #these are the line width for the different cigar string flags. | |
114 | # usually, only M, I, D, S, and H appear in bwa mem output | |
115 | global widthDict | |
116 | widthDict = {'M':0.45, # match | |
117 | 'I':0.9, # insertion relative to reference | |
118 | 'D':0.05, # deletion relative to reference | |
119 | 'N':0.1, # skipped region from the reference | |
120 | 'S':0.1, # soft clip, not aligned but still in sam file | |
121 | 'H':0.1, # hard clip, not aligned and not in sam file | |
122 | 'P':0.1, # padding (silent deletion from padded reference) | |
123 | '=':0.1, # sequence match | |
124 | 'X':0.1} # sequence mismatch | |
125 | ||
126 | ||
127 | global richgrey | |
128 | richgrey = (60/255, 54/255, 69/255, 1) | |
129 | ||
130 | def plot_ref(panel, chrid, start, stop, thiscmd): | |
131 | panel.set_xlim([start, stop]) | |
132 | panel.set_ylim([-2.5, 2.5]) | |
133 | panel.set_xticks([int(val) for val in np.linspace(start, stop, 6)]) | |
134 | if thiscmd.style == "colorful": | |
135 | thisseq = "" | |
136 | for record in SeqIO.parse(thiscmd.ref, "fasta"): | |
137 | if record.id == chrid: | |
138 | thisseq = record.seq[start-1: stop] | |
139 | for i in range(len(thisseq)): | |
140 | left = start + i | |
141 | bottom = -0.5 | |
142 | width = 1 | |
143 | height = 1 | |
144 | rect = patches.Rectangle((left, bottom), | |
145 | width, height, | |
146 | linewidth = 0, | |
147 | facecolor = dna_color[thisseq[i]] ) | |
148 | panel.add_patch(rect) | |
149 | return panel | |
150 | ||
151 | def safe_log10(value): | |
152 | try: | |
153 | logval = np.log10(value) | |
154 | except: | |
155 | logval = 0 | |
156 | return logval | |
157 | ||
158 | def plot_bam(panel, chrid, start, stop, thiscmd): | |
159 | bam = BAMParse(thiscmd.path) | |
160 | panel.set_xlim([start, stop]) | |
161 | if thiscmd.style == "depth": | |
162 | maxdepth = max(bam.features_depthmap) | |
163 | maxdepthlog = safe_log10(maxdepth) | |
164 | if "log" in thiscmd.options: | |
165 | panel.set_ylim([-maxdepthlog, maxdepthlog]) | |
166 | panel.set_yticks([int(val) for val in np.linspace(0, maxdepthlog, 2)]) | |
167 | ||
168 | else: | |
169 | panel.set_yticks([int(val) for val in np.linspace(0, maxdepth, 2)]) | |
170 | if "c" in thiscmd.options: | |
171 | panel.set_ylim([-maxdepth, maxdepth]) | |
172 | else: | |
173 | panel.set_ylim([0, maxdepth]) | |
174 | ||
175 | ||
176 | for i in range(len(bam.features_depthmap)): | |
177 | left = start + i | |
178 | width = 1 | |
179 | if "c" in thiscmd.options and "log" in thiscmd.options: | |
180 | bottom = -1 * safe_log10(bam.features_depthmap[i]) | |
181 | height = safe_log10(bam.features_depthmap[i]) * 2 | |
182 | elif "c" in thiscmd.options and "log" not in thiscmd.options: | |
183 | bottom = -bam.features_depthmap[i] | |
184 | height = bam.features_depthmap[i] * 2 | |
185 | else: | |
186 | bottom = 0 | |
187 | height = bam.features_depthmap[i] | |
188 | if height > 0: | |
189 | rect = patches.Rectangle((left, bottom), | |
190 | width, height, | |
191 | linewidth = 0, | |
192 | facecolor = richgrey ) | |
193 | panel.add_patch(rect) | |
194 | ||
195 | if thiscmd.style == "reads": | |
196 | #If we're plotting reads, we don't need y-axis | |
197 | panel.tick_params(bottom="off", labelbottom="off", | |
198 | left ="off", labelleft = "off") | |
199 | reads = bam.features.copy() | |
200 | panel.set_xlim([start, stop]) | |
201 | direction = "for" | |
202 | if direction == 'for': | |
203 | bav = {"by":['POS','MAPLEN'], "asc": [True, False]} | |
204 | direction= 'rev' | |
205 | elif direction == 'rev': | |
206 | bav = {"by":['POS','MAPLEN'], "asc": [True, False]} | |
207 | direction = 'for' | |
208 | reads.sort_values(by=bav["by"], ascending=bav['asc'],inplace=True) | |
209 | reads.reset_index(drop=True, inplace=True) | |
210 | ||
211 | depth_count = -1 | |
212 | plotind = start | |
213 | while len(reads) > 0: | |
214 | #depth_count -= 1 | |
215 | #print("len of reads is {}".format(len(reads))) | |
216 | potential = reads.query("POS >= {}".format(plotind)) | |
217 | if len(potential) == 0: | |
218 | readsindex = 0 | |
219 | #print("resetting plot ind from {} to {}".format( | |
220 | # plotind, reads.loc[readsindex, "POS"])) | |
221 | depth_count -= 1 | |
222 | ||
223 | else: | |
224 | readsindex = int(potential.index.values[0]) | |
225 | #print("pos of potential is {}".format(reads.loc[readsindex, "POS"])) | |
226 | plotind = reads.loc[readsindex, "POS"] | |
227 | ||
228 | for TUP in reads.loc[readsindex, "TUPS"]: | |
229 | b_type = TUP[1] | |
230 | b_len = TUP[0] | |
231 | #plotting params | |
232 | # left same for all. | |
233 | left = plotind | |
234 | bottom = depth_count | |
235 | height = widthDict[b_type] | |
236 | width = b_len | |
237 | plot = True | |
238 | color = richgrey | |
239 | if b_type in ["H", "S"]: | |
240 | """We don't plot hard or sort clips - like IGV""" | |
241 | plot = False | |
242 | pass | |
243 | elif b_type == "M": | |
244 | """just plot matches normally""" | |
245 | plotind += b_len | |
246 | elif b_type in ["D", "P", "=", "X"]: | |
247 | """deletions get an especially thin line""" | |
248 | plotind += b_len | |
249 | elif b_type == "I": | |
250 | """insertions get a special purple bar""" | |
251 | left = plotind - (b_len/2) | |
252 | color = (200/255, 41/255, 226/255, 0.5) | |
253 | elif b_type == "N": | |
254 | """skips for splice junctions, line in middle""" | |
255 | bottom += (widthDict["M"]/2) - (widthDict["N"]/2) | |
256 | plotind += b_len | |
257 | if plot: | |
258 | rect = patches.Rectangle((left, bottom), | |
259 | width, height, | |
260 | linewidth = 0, | |
261 | facecolor = color ) | |
262 | panel.add_patch(rect) | |
263 | reads.drop([readsindex], inplace=True) | |
264 | reads.reset_index(drop = True, inplace=True) | |
265 | panel.set_ylim([depth_count, 0]) | |
266 | ||
267 | return panel | |
268 | ||
269 | def plot_gff3(panel, chrid, start, stop, thiscmd): | |
270 | ||
271 | db = gffutils.create_db(thiscmd.path, ":memory:") | |
272 | bottom = 0 | |
273 | genes_to_plot = [thing.id | |
274 | for thing in db.region( | |
275 | region=(chrid, start, stop), | |
276 | completely_within=False) | |
277 | if thing.featuretype == "gene" ] | |
278 | #print("genes to plot are: " genes_to_plot) | |
279 | panel.set_xlim([start, stop]) | |
280 | # we don't need labels on one of the axes | |
281 | #panel.tick_params(bottom="off", labelbottom="off", | |
282 | # left ="off", labelleft = "off") | |
283 | ||
284 | ||
285 | ticklabels = [] | |
286 | for geneid in genes_to_plot: | |
287 | plotnow = False | |
288 | if "id" in thiscmd.options and geneid in thiscmd.options: | |
289 | plotnow = True | |
290 | elif "id" not in thiscmd.options: | |
291 | plotnow = True | |
292 | if plotnow: | |
293 | ticklabels.append(geneid) | |
294 | if db[geneid].strand == "+": | |
295 | panel = gfftools._plot_left_to_right_introns_top(panel, geneid, db, | |
296 | bottom, text = None) | |
297 | bottom += 1 | |
298 | else: | |
299 | raise IOError("""Plotting things on the reverse strand is | |
300 | not yet implemented""") | |
301 | #print("tick labels are", ticklabels) | |
302 | panel.set_ylim([0, len(ticklabels)]) | |
303 | yticks_vals = [val for val in np.linspace(0.5, len(ticklabels) - 0.5, len(ticklabels))] | |
304 | panel.set_yticks(yticks_vals) | |
305 | print("bottom is: ", bottom) | |
306 | print("len tick labels is: ", len(ticklabels)) | |
307 | print("intervals are: ", yticks_vals) | |
308 | panel.set_yticklabels(ticklabels) | |
309 | ||
310 | return panel | |
311 | ||
312 | def browser(args): | |
313 | rc.update_rcParams() | |
314 | print(args) | |
315 | ||
316 | # if the user forgot to add a reference, they must add one | |
317 | if args.REF is None: | |
318 | raise IOError("You must specify the reference fasta file") | |
319 | ||
320 | # if the user forgot to add the start and stop, | |
321 | # Print the id and the start/stop | |
322 | if args.CHR is None or args.START is None or args.STOP is None: | |
323 | print("""\n You have forgotten to specify the chromosome, | |
324 | the start coordinate, or the stop coordinate to plot. | |
325 | Try something like '-c chr1 --start 20 --stop 2000'. | |
326 | Here is a list of chromosome ids and their lengths | |
327 | from the provided reference. The minimum start coordinate | |
328 | is one and the maximum stop coordinate is the length of | |
329 | the chromosome.\n\nID\tLength""") | |
330 | for record in SeqIO.parse(args.REF, "fasta"): | |
331 | print("{}\t{}".format(record.id, len(record.seq))) | |
332 | sys.exit(0) | |
333 | ||
334 | if args.CMD is None: | |
335 | raise IOError("You must specify a plotting command.") | |
336 | ||
337 | # now we parse each set of commands | |
338 | commands = [PlotCommand(thiscmd, args.REF) | |
339 | for thiscmd in reversed(args.CMD)] | |
340 | ||
341 | # set the figure dimensions | |
342 | if args.ratio: | |
343 | figWidth = args.ratio[0] + 1 | |
344 | figHeight = args.ratio[1] + 1 | |
345 | #set the panel dimensions | |
346 | panelWidth = args.ratio[0] | |
347 | panelHeight = args.ratio[1] | |
348 | ||
349 | else: | |
350 | figWidth = 7 | |
351 | figHeight = len(commands) + 2 | |
352 | #set the panel dimensions | |
353 | panelWidth = 5 | |
354 | # panel margin x 2 + panel height = total vertical height | |
355 | panelHeight = 0.8 | |
356 | panelMargin = 0.1 | |
357 | ||
358 | figure = plt.figure(figsize=(figWidth,figHeight)) | |
359 | ||
360 | #find the margins to center the panel in figure | |
361 | leftMargin = (figWidth - panelWidth)/2 | |
362 | bottomMargin = ((figHeight - panelHeight)/2) + panelMargin | |
363 | ||
364 | plot_dict = {"ref": plot_ref, | |
365 | "bam": plot_bam, | |
366 | "gff3": plot_gff3 | |
367 | #"peptides": plot_peptides | |
368 | } | |
369 | ||
370 | panels = [] | |
371 | for i in range(len(commands)): | |
372 | thiscmd = commands[i] | |
373 | if thiscmd.cmdtype in ["gff3", "ref", "peptides"] \ | |
374 | or thiscmd.style == "depth" \ | |
375 | or "narrow" in thiscmd.options: | |
376 | temp_panelHeight = 0.5 | |
377 | else: | |
378 | temp_panelHeight = panelHeight | |
379 | panels.append( plt.axes([leftMargin/figWidth, #left | |
380 | bottomMargin/figHeight, #bottom | |
381 | panelWidth/figWidth, #width | |
382 | temp_panelHeight/figHeight]) #height | |
383 | ) | |
384 | panels[i].tick_params(axis='both',which='both',\ | |
385 | bottom='off', labelbottom='off',\ | |
386 | left='on', labelleft='on', \ | |
387 | right='off', labelright='off',\ | |
388 | top='off', labeltop='off') | |
389 | if thiscmd.cmdtype == "ref": | |
390 | panels[i].tick_params(bottom='on', labelbottom='on') | |
391 | ||
392 | ||
393 | ||
394 | #turn off some of the axes | |
395 | panels[i].spines["top"].set_visible(False) | |
396 | panels[i].spines["bottom"].set_visible(False) | |
397 | panels[i].spines["right"].set_visible(False) | |
398 | panels[i].spines["left"].set_visible(False) | |
399 | ||
400 | panels[i] = plot_dict[thiscmd.cmdtype](panels[i], args.CHR, | |
401 | args.START, args.STOP, | |
402 | thiscmd) | |
403 | ||
404 | bottomMargin = bottomMargin + temp_panelHeight + (2 * panelMargin) | |
405 | ||
406 | # Print image(s) | |
407 | if args.BASENAME is None: | |
408 | file_base = 'browser_{}.png'.format(timestamp()) | |
409 | else: | |
410 | file_base = args.BASENAME | |
411 | path = None | |
412 | if args.path: | |
413 | path = args.path | |
414 | transparent = args.transparent | |
415 | print_images( | |
416 | base_output_name=file_base, | |
417 | image_formats=args.fileform, | |
418 | dpi=args.dpi, | |
419 | no_timestamp = kwargs["no_timestamp"], | |
420 | path = path, | |
421 | transparent=transparent) | |
422 | ||
423 | ||
424 | def run(args): | |
425 | browser(args) |
0 | #!/usr/bin/env python | |
1 | # -*- coding: utf-8 -*- | |
2 | ||
3 | # pauvre - just a pore PhD student's plotting package | |
4 | # Copyright (c) 2016-2017 Darrin T. Schultz. All rights reserved. | |
5 | # | |
6 | # This file is part of pauvre. | |
7 | # | |
8 | # pauvre is free software: you can redistribute it and/or modify | |
9 | # it under the terms of the GNU General Public License as published by | |
10 | # the Free Software Foundation, either version 3 of the License, or | |
11 | # (at your option) any later version. | |
12 | # | |
13 | # pauvre is distributed in the hope that it will be useful, | |
14 | # but WITHOUT ANY WARRANTY; without even the implied warranty of | |
15 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | |
16 | # GNU General Public License for more details. | |
17 | # | |
18 | # You should have received a copy of the GNU General Public License | |
19 | # along with pauvre. If not, see <http://www.gnu.org/licenses/>. | |
20 | ||
21 | import ast | |
22 | import matplotlib | |
23 | matplotlib.use('Agg') | |
24 | import matplotlib.pyplot as plt | |
25 | import matplotlib.patches as mplpatches | |
26 | from matplotlib.colors import LinearSegmentedColormap | |
27 | import numpy as np | |
28 | import pandas as pd | |
29 | import os.path as opath | |
30 | from sys import stderr | |
31 | from pauvre.functions import print_images | |
32 | from pauvre.stats import stats | |
33 | import pauvre.rcparams as rc | |
34 | import sys | |
35 | import logging | |
36 | ||
37 | # logging | |
38 | logger = logging.getLogger('pauvre') | |
39 | ||
40 | ||
41 | def generate_panel(panel_left, panel_bottom, panel_width, panel_height, | |
42 | axis_tick_param='both', which_tick_param='both', | |
43 | bottom_tick_param='on', label_bottom_tick_param='on', | |
44 | left_tick_param='on', label_left_tick_param='on', | |
45 | right_tick_param='off', label_right_tick_param='off', | |
46 | top_tick_param='off', label_top_tick_param='off'): | |
47 | """ | |
48 | Setting default panel tick parameters. Some of these are the defaults | |
49 | for matplotlib anyway, but specifying them for readability. Here are | |
50 | options and defaults for the parameters used below: | |
51 | ||
52 | axis : {'x', 'y', 'both'}; which axis to modify; default = 'both' | |
53 | which : {'major', 'minor', 'both'}; which ticks to modify; | |
54 | default = 'major' | |
55 | bottom, top, left, right : bool or {'on', 'off'}; ticks on or off; | |
56 | labelbottom, labeltop, labelleft, labelright : bool or {'on', 'off'} | |
57 | """ | |
58 | ||
59 | # create the panel | |
60 | panel_rectangle = [panel_left, panel_bottom, panel_width, panel_height] | |
61 | panel = plt.axes(panel_rectangle) | |
62 | ||
63 | # Set tick parameters | |
64 | panel.tick_params(axis=axis_tick_param, | |
65 | which=which_tick_param, | |
66 | bottom=bottom_tick_param, | |
67 | labelbottom=label_bottom_tick_param, | |
68 | left=left_tick_param, | |
69 | labelleft=label_left_tick_param, | |
70 | right=right_tick_param, | |
71 | labelright=label_right_tick_param, | |
72 | top=top_tick_param, | |
73 | labeltop=label_top_tick_param) | |
74 | ||
75 | return panel | |
76 | ||
77 | ||
78 | def _generate_histogram_bin_patches(panel, bins, bin_values, horizontal=True): | |
79 | """This helper method generates the histogram that is added to the panel. | |
80 | ||
81 | In this case, horizontal = True applies to the mean quality histogram. | |
82 | So, horizontal = False only applies to the length histogram. | |
83 | """ | |
84 | l_width = 0.0 | |
85 | f_color = (0.5, 0.5, 0.5) | |
86 | e_color = (0, 0, 0) | |
87 | if horizontal: | |
88 | for step in np.arange(0, len(bin_values), 1): | |
89 | left = bins[step] | |
90 | bottom = 0 | |
91 | width = bins[step + 1] - bins[step] | |
92 | height = bin_values[step] | |
93 | hist_rectangle = mplpatches.Rectangle((left, bottom), width, height, | |
94 | linewidth=l_width, | |
95 | facecolor=f_color, | |
96 | edgecolor=e_color) | |
97 | panel.add_patch(hist_rectangle) | |
98 | else: | |
99 | for step in np.arange(0, len(bin_values), 1): | |
100 | left = 0 | |
101 | bottom = bins[step] | |
102 | width = bin_values[step] | |
103 | height = bins[step + 1] - bins[step] | |
104 | ||
105 | hist_rectangle = mplpatches.Rectangle((left, bottom), width, height, | |
106 | linewidth=l_width, | |
107 | facecolor=f_color, | |
108 | edgecolor=e_color) | |
109 | panel.add_patch(hist_rectangle) | |
110 | ||
111 | ||
112 | def generate_histogram(panel, data_list, min_plot_val, max_plot_val, | |
113 | bin_interval, hist_horizontal=True, | |
114 | left_spine=True, bottom_spine=True, | |
115 | top_spine=False, right_spine=False, x_label=None, | |
116 | y_label=None): | |
117 | ||
118 | bins = np.arange(0, max_plot_val, bin_interval) | |
119 | ||
120 | bin_values, bins2 = np.histogram(data_list, bins) | |
121 | ||
122 | # hist_horizontal is used for quality | |
123 | if hist_horizontal: | |
124 | panel.set_xlim([min_plot_val, max_plot_val]) | |
125 | panel.set_ylim([0, max(bin_values * 1.1)]) | |
126 | # and hist_horizontal == Fale is for read length | |
127 | else: | |
128 | panel.set_xlim([0, max(bin_values * 1.1)]) | |
129 | panel.set_ylim([min_plot_val, max_plot_val]) | |
130 | ||
131 | # Generate histogram bin patches, depending on whether we're plotting | |
132 | # vertically or horizontally | |
133 | _generate_histogram_bin_patches(panel, bins, bin_values, hist_horizontal) | |
134 | ||
135 | panel.spines['left'].set_visible(left_spine) | |
136 | panel.spines['bottom'].set_visible(bottom_spine) | |
137 | panel.spines['top'].set_visible(top_spine) | |
138 | panel.spines['right'].set_visible(right_spine) | |
139 | ||
140 | if y_label is not None: | |
141 | panel.set_ylabel(y_label) | |
142 | if x_label is not None: | |
143 | panel.set_xlabel(x_label) | |
144 | ||
145 | def generate_square_map(panel, data_frame, plot_min_y, plot_min_x, | |
146 | plot_max_y, plot_max_x, color, | |
147 | xcol, ycol, **kwargs): | |
148 | """This generates the heatmap panels using squares. Everything is | |
149 | quantized by ints. | |
150 | """ | |
151 | panel.set_xlim([plot_min_x, plot_max_x]) | |
152 | panel.set_ylim([plot_min_y, plot_max_y]) | |
153 | tempdf = data_frame[[xcol, ycol]] | |
154 | data_frame = tempdf.astype(int) | |
155 | ||
156 | querystring = "{}<={} and {}<={}".format(plot_min_y, ycol, plot_min_x, xcol) | |
157 | print(" - Filtering squares with {}".format(querystring)) | |
158 | square_this = data_frame.query(querystring) | |
159 | ||
160 | querystring = "{}<{} and {}<{}".format(ycol, plot_max_y, xcol, plot_max_x) | |
161 | print(" - Filtering squares with {}".format(querystring)) | |
162 | square_this = square_this.query(querystring) | |
163 | ||
164 | counts = square_this.groupby([xcol, ycol]).size().reset_index(name='counts') | |
165 | for index, row in counts.iterrows(): | |
166 | x_pos = row[xcol] | |
167 | y_pos = row[ycol] | |
168 | thiscolor = color(row["counts"]/(counts["counts"].max())) | |
169 | rectangle1=mplpatches.Rectangle((x_pos,y_pos),1,1, | |
170 | linewidth=0,\ | |
171 | facecolor=thiscolor) | |
172 | panel.add_patch(rectangle1) | |
173 | ||
174 | all_counts = counts["counts"] | |
175 | return all_counts | |
176 | ||
177 | def generate_heat_map(panel, data_frame, plot_min_y, plot_min_x, | |
178 | plot_max_y, plot_max_x, color, | |
179 | xcol, ycol, **kwargs): | |
180 | panel.set_xlim([plot_min_x, plot_max_x]) | |
181 | panel.set_ylim([plot_min_y, plot_max_y]) | |
182 | ||
183 | querystring = "{}<={} and {}<={}".format(plot_min_y, ycol, plot_min_x, xcol) | |
184 | print(" - Filtering hexmap with {}".format(querystring)) | |
185 | hex_this = data_frame.query(querystring) | |
186 | ||
187 | querystring = "{}<{} and {}<{}".format(ycol, plot_max_y, xcol, plot_max_x) | |
188 | print(" - Filtering hexmap with {}".format(querystring)) | |
189 | hex_this = hex_this.query(querystring) | |
190 | ||
191 | # This single line controls plotting the hex bins in the panel | |
192 | hex_vals = panel.hexbin(hex_this[xcol], hex_this[ycol], gridsize=49, | |
193 | linewidths=0.0, cmap=color) | |
194 | for each in panel.spines: | |
195 | panel.spines[each].set_visible(False) | |
196 | ||
197 | counts = hex_vals.get_array() | |
198 | return counts | |
199 | ||
200 | def generate_legend(panel, counts, color): | |
201 | # completely custom for more control | |
202 | panel.set_xlim([0, 1]) | |
203 | panel.set_ylim([0, 1000]) | |
204 | panel.set_yticks([int(x) for x in np.linspace(0, 1000, 6)]) | |
205 | panel.set_yticklabels([int(x) for x in np.linspace(0, max(counts), 6)]) | |
206 | for i in np.arange(0, 1001, 1): | |
207 | rgba = color(i / 1001) | |
208 | alpha = rgba[-1] | |
209 | facec = rgba[0:3] | |
210 | hist_rectangle = mplpatches.Rectangle((0, i), 1, 1, | |
211 | linewidth=0.0, | |
212 | facecolor=facec, | |
213 | edgecolor=(0, 0, 0), | |
214 | alpha=alpha) | |
215 | panel.add_patch(hist_rectangle) | |
216 | panel.spines['top'].set_visible(False) | |
217 | panel.spines['left'].set_visible(False) | |
218 | panel.spines['bottom'].set_visible(False) | |
219 | panel.yaxis.set_label_position("right") | |
220 | panel.set_ylabel('count') | |
221 | ||
222 | def custommargin(df, **kwargs): | |
223 | rc.update_rcParams() | |
224 | ||
225 | # 250, 231, 34 light yellow | |
226 | # 67, 1, 85 | |
227 | # R=np.linspace(65/255,1,101) | |
228 | # G=np.linspace(0/255, 231/255, 101) | |
229 | # B=np.linspace(85/255, 34/255, 101) | |
230 | # R=65/255, G=0/255, B=85/255 | |
231 | Rf = 65 / 255 | |
232 | Bf = 85 / 255 | |
233 | pdict = {'red': ((0.0, Rf, Rf), | |
234 | (1.0, Rf, Rf)), | |
235 | 'green': ((0.0, 0.0, 0.0), | |
236 | (1.0, 0.0, 0.0)), | |
237 | 'blue': ((0.0, Bf, Bf), | |
238 | (1.0, Bf, Bf)), | |
239 | 'alpha': ((0.0, 0.0, 0.0), | |
240 | (1.0, 1.0, 1.0)) | |
241 | } | |
242 | # Now we will use this example to illustrate 3 ways of | |
243 | # handling custom colormaps. | |
244 | # First, the most direct and explicit: | |
245 | purple1 = LinearSegmentedColormap('Purple1', pdict) | |
246 | ||
247 | # set the figure dimensions | |
248 | fig_width = 1.61 * 3 | |
249 | fig_height = 1 * 3 | |
250 | fig = plt.figure(figsize=(fig_width, fig_height)) | |
251 | ||
252 | # set the panel dimensions | |
253 | heat_map_panel_width = fig_width * 0.5 | |
254 | heat_map_panel_height = heat_map_panel_width * 0.62 | |
255 | ||
256 | # find the margins to center the panel in figure | |
257 | fig_left_margin = fig_bottom_margin = (1 / 6) | |
258 | ||
259 | # lengthPanel | |
260 | y_panel_width = (1 / 8) | |
261 | ||
262 | # the color Bar parameters | |
263 | legend_panel_width = (1 / 24) | |
264 | ||
265 | # define padding | |
266 | h_padding = 0.02 | |
267 | v_padding = 0.05 | |
268 | ||
269 | # Set whether to include y-axes in histograms | |
270 | print(" - Setting panel options.", file = sys.stderr) | |
271 | if kwargs["Y_AXES"]: | |
272 | y_bottom_spine = True | |
273 | y_bottom_tick = 'on' | |
274 | y_bottom_label = 'on' | |
275 | x_left_spine = True | |
276 | x_left_tick = 'on' | |
277 | x_left_label = 'on' | |
278 | x_y_label = 'Count' | |
279 | else: | |
280 | y_bottom_spine = False | |
281 | y_bottom_tick = 'off' | |
282 | y_bottom_label = 'off' | |
283 | x_left_spine = False | |
284 | x_left_tick = 'off' | |
285 | x_left_label = 'off' | |
286 | x_y_label = None | |
287 | ||
288 | panels = [] | |
289 | ||
290 | # Quality histogram panel | |
291 | print(" - Generating the x-axis panel.", file = sys.stderr) | |
292 | x_panel_left = fig_left_margin + y_panel_width + h_padding | |
293 | x_panel_width = heat_map_panel_width / fig_width | |
294 | x_panel_height = y_panel_width * fig_width / fig_height | |
295 | x_panel = generate_panel(x_panel_left, | |
296 | fig_bottom_margin, | |
297 | x_panel_width, | |
298 | x_panel_height, | |
299 | left_tick_param=x_left_tick, | |
300 | label_left_tick_param=x_left_label) | |
301 | panels.append(x_panel) | |
302 | ||
303 | # y histogram panel | |
304 | print(" - Generating the y-axis panel.", file = sys.stderr) | |
305 | y_panel_bottom = fig_bottom_margin + x_panel_height + v_padding | |
306 | y_panel_height = heat_map_panel_height / fig_height | |
307 | y_panel = generate_panel(fig_left_margin, | |
308 | y_panel_bottom, | |
309 | y_panel_width, | |
310 | y_panel_height, | |
311 | bottom_tick_param=y_bottom_tick, | |
312 | label_bottom_tick_param=y_bottom_label) | |
313 | panels.append(y_panel) | |
314 | ||
315 | # Heat map panel | |
316 | heat_map_panel_left = fig_left_margin + y_panel_width + h_padding | |
317 | heat_map_panel_bottom = fig_bottom_margin + x_panel_height + v_padding | |
318 | print(" - Generating the heat map panel.", file = sys.stderr) | |
319 | heat_map_panel = generate_panel(heat_map_panel_left, | |
320 | heat_map_panel_bottom, | |
321 | heat_map_panel_width / fig_width, | |
322 | heat_map_panel_height / fig_height, | |
323 | bottom_tick_param='off', | |
324 | label_bottom_tick_param='off', | |
325 | left_tick_param='off', | |
326 | label_left_tick_param='off') | |
327 | panels.append(heat_map_panel) | |
328 | heat_map_panel.set_title(kwargs["title"]) | |
329 | ||
330 | # Legend panel | |
331 | print(" - Generating the legend panel.", file = sys.stderr) | |
332 | legend_panel_left = fig_left_margin + y_panel_width + \ | |
333 | heat_map_panel_width / fig_width + h_padding | |
334 | legend_panel_bottom = fig_bottom_margin + x_panel_height + v_padding | |
335 | legend_panel_height = heat_map_panel_height / fig_height | |
336 | legend_panel = generate_panel(legend_panel_left, legend_panel_bottom, | |
337 | legend_panel_width, legend_panel_height, | |
338 | bottom_tick_param='off', | |
339 | label_bottom_tick_param='off', | |
340 | left_tick_param='off', | |
341 | label_left_tick_param='off', | |
342 | right_tick_param='on', | |
343 | label_right_tick_param='on') | |
344 | panels.append(legend_panel) | |
345 | ||
346 | # | |
347 | # Everything above this is just to set up the panels | |
348 | # | |
349 | ################################################################## | |
350 | ||
351 | # Set max and min viewing window for the xaxis | |
352 | if kwargs["plot_max_x"]: | |
353 | plot_max_x = kwargs["plot_max_x"] | |
354 | else: | |
355 | if kwargs["square"]: | |
356 | plot_max_x = df[kwargs["xcol"]].max() | |
357 | plot_max_x = max(np.ceil(df[kwargs["xcol"]])) | |
358 | plot_min_x = kwargs["plot_min_x"] | |
359 | ||
360 | # Set x bin sizes | |
361 | if kwargs["xbin"]: | |
362 | x_bin_interval = kwargs["xbin"] | |
363 | else: | |
364 | # again, this is just based on what looks good from experience | |
365 | x_bin_interval = 1 | |
366 | ||
367 | # Generate x histogram | |
368 | print(" - Generating the x-axis histogram.", file = sys.stderr) | |
369 | generate_histogram(panel = x_panel, | |
370 | data_list = df[kwargs['xcol']], | |
371 | min_plot_val = plot_min_x, | |
372 | max_plot_val = plot_max_x, | |
373 | bin_interval = x_bin_interval, | |
374 | hist_horizontal = True, | |
375 | x_label=kwargs['xcol'], | |
376 | y_label=x_y_label, | |
377 | left_spine=x_left_spine) | |
378 | ||
379 | # Set max and min viewing window for the y axis | |
380 | if kwargs["plot_max_y"]: | |
381 | plot_max_y = kwargs["plot_max_y"] | |
382 | else: | |
383 | if kwargs["square"]: | |
384 | plot_max_y = df[kwargs["ycol"]].max() | |
385 | else: | |
386 | plot_max_y = max(np.ceil(df[kwargs["ycol"]])) | |
387 | ||
388 | plot_min_y = kwargs["plot_min_y"] | |
389 | # Set y bin sizes | |
390 | if kwargs["ybin"]: | |
391 | y_bin_interval = kwargs["ybin"] | |
392 | else: | |
393 | y_bin_interval = 1 | |
394 | ||
395 | # Generate y histogram | |
396 | print(" - Generating the y-axis histogram.", file = sys.stderr) | |
397 | generate_histogram(panel = y_panel, | |
398 | data_list = df[kwargs['ycol']], | |
399 | min_plot_val = plot_min_y, | |
400 | max_plot_val = plot_max_y, | |
401 | bin_interval = y_bin_interval, | |
402 | hist_horizontal = False, | |
403 | y_label = kwargs['ycol'], | |
404 | bottom_spine = y_bottom_spine) | |
405 | ||
406 | # Generate heat map | |
407 | if kwargs["square"]: | |
408 | print(" - Generating the square heatmap.", file = sys.stderr) | |
409 | counts = generate_square_map(panel = heat_map_panel, | |
410 | data_frame = df, | |
411 | plot_min_y = plot_min_y, | |
412 | plot_min_x = plot_min_x, | |
413 | plot_max_y = plot_max_y, | |
414 | plot_max_x = plot_max_x, | |
415 | color = purple1, | |
416 | xcol = kwargs["xcol"], | |
417 | ycol = kwargs["ycol"]) | |
418 | else: | |
419 | print(" - Generating the heatmap.", file = sys.stderr) | |
420 | counts = generate_heat_map(panel = heat_map_panel, | |
421 | data_frame = df, | |
422 | plot_min_y = plot_min_y, | |
423 | plot_min_x = plot_min_x, | |
424 | plot_max_y = plot_max_y, | |
425 | plot_max_x = plot_max_x, | |
426 | color = purple1, | |
427 | xcol = kwargs["xcol"], | |
428 | ycol = kwargs["ycol"]) | |
429 | ||
430 | # Generate legend | |
431 | print(" - Generating the legend.", file = sys.stderr) | |
432 | generate_legend(legend_panel, counts, purple1) | |
433 | ||
434 | # inform the user of the plotting window if not quiet mode | |
435 | #if not kwargs["QUIET"]: | |
436 | # print("""plotting in the following window: | |
437 | # {0} <= Q-score (x-axis) <= {1} | |
438 | # {2} <= length (y-axis) <= {3}""".format( | |
439 | # plot_min_x, plot_max_x, min_plot_val, max_plot_val), | |
440 | # file=stderr) | |
441 | ||
442 | # Print image(s) | |
443 | if kwargs["output_base_name"] is None: | |
444 | file_base = "custommargin" | |
445 | else: | |
446 | file_base = kwargs["output_base_name"] | |
447 | ||
448 | print(" - Saving your images", file = sys.stderr) | |
449 | print_images( | |
450 | base =file_base, | |
451 | image_formats=kwargs["fileform"], | |
452 | dpi=kwargs["dpi"], | |
453 | no_timestamp = kwargs["no_timestamp"], | |
454 | transparent= kwargs["no_transparent"]) | |
455 | ||
456 | def run(args): | |
457 | print(args) | |
458 | if not opath.exists(args.input_file): | |
459 | raise IOError("The input file does not exist: {}".format( | |
460 | args.input_file)) | |
461 | df = pd.read_csv(args.input_file, header='infer', sep='\t') | |
462 | # make sure that the column names that were specified are actually | |
463 | # in the dataframe | |
464 | if args.xcol not in df.columns: | |
465 | raise IOError("""The x-column name that you specified, {}, is not in the | |
466 | dataframe column names: {}""".format(args.xcol, df.columns)) | |
467 | if args.ycol not in df.columns: | |
468 | raise IOError("""The y-column name that you specified, {}, is not in the | |
469 | dataframe column names: {}""".format(args.ycol, df.columns)) | |
470 | print(" - Successfully read csv file. Here are a few lines:", | |
471 | file = sys.stderr) | |
472 | print(df.head(), file = sys.stderr) | |
473 | print(" - Plotting {} on the x-axis".format(args.xcol),file=sys.stderr) | |
474 | print(df[args.xcol].head(), file = sys.stderr) | |
475 | print(" - Plotting {} on the y-axis".format(args.ycol),file=sys.stderr) | |
476 | print(df[args.ycol].head(), file = sys.stderr) | |
477 | custommargin(df=df.dropna(), **vars(args)) |
0 | #!/usr/bin/env python | |
1 | # -*- coding: utf-8 -*- | |
2 | ||
3 | # pauvre | |
4 | # Copyright (c) 2016-2020 Darrin T. Schultz. | |
5 | # | |
6 | # This file is part of pauvre. | |
7 | # | |
8 | # pauvre is free software: you can redistribute it and/or modify | |
9 | # it under the terms of the GNU General Public License as published by | |
10 | # the Free Software Foundation, either version 3 of the License, or | |
11 | # (at your option) any later version. | |
12 | # | |
13 | # pauvre is distributed in the hope that it will be useful, | |
14 | # but WITHOUT ANY WARRANTY; without even the implied warranty of | |
15 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | |
16 | # GNU General Public License for more details. | |
17 | # | |
18 | # You should have received a copy of the GNU General Public License | |
19 | # along with pauvre. If not, see <http://www.gnu.org/licenses/>. | |
20 | ||
21 | from Bio import SeqIO | |
22 | import copy | |
23 | import gzip | |
24 | import matplotlib.pyplot as plt | |
25 | import numpy as np | |
26 | import os | |
27 | import pandas as pd | |
28 | from sys import stderr | |
29 | import time | |
30 | ||
31 | ||
32 | # this makes opening files more robust for different platforms | |
33 | # currently only used in GFFParse | |
34 | import codecs | |
35 | ||
36 | import warnings | |
37 | ||
38 | def print_images(base, image_formats, dpi, | |
39 | transparent=False, no_timestamp = False): | |
40 | """ | |
41 | Save the plot in multiple formats, with or without transparency | |
42 | and with or without timestamps. | |
43 | """ | |
44 | for fmt in image_formats: | |
45 | if no_timestamp: | |
46 | out_name = "{0}.{1}".format(base, fmt) | |
47 | else: | |
48 | out_name = "{0}_{1}.{2}".format(base, timestamp(), fmt) | |
49 | try: | |
50 | if fmt == 'png': | |
51 | plt.savefig(out_name, dpi=dpi, transparent=transparent) | |
52 | else: | |
53 | plt.savefig(out_name, format=fmt, transparent=transparent) | |
54 | except PermissionError: | |
55 | # thanks to https://github.com/wdecoster for the suggestion | |
56 | print("""You don't have permission to save pauvre plots to this | |
57 | directory. Try changing the directory and running the script again!""") | |
58 | ||
59 | class GFFParse(): | |
60 | def __init__(self, filename, stop_codons=None, species=None): | |
61 | self.filename = filename | |
62 | self.samplename = os.path.splitext(os.path.basename(filename))[0] | |
63 | self.species = species | |
64 | self.featureDict = {"name": [], | |
65 | "featType": [], | |
66 | "start": [], | |
67 | "stop": [], | |
68 | "strand": []} | |
69 | gffnames = ["sequence", "source", "featType", "start", "stop", "dunno1", | |
70 | "strand", "dunno2", "tags"] | |
71 | self.features = pd.read_csv(self.filename, comment='#', | |
72 | sep='\t', names=gffnames) | |
73 | self.features['name'] = self.features['tags'].apply(self._get_name) | |
74 | self.features.drop('dunno1', 1, inplace=True) | |
75 | self.features.drop('dunno2', 1, inplace=True) | |
76 | self.features.reset_index(inplace=True, drop=True) | |
77 | # warn the user if there are CDS or gene entries not divisible by three | |
78 | self._check_triplets() | |
79 | # sort the database by start | |
80 | self.features.sort_values(by='start', ascending=True, inplace=True) | |
81 | if stop_codons: | |
82 | strip_codons = ['gene', 'CDS'] | |
83 | # if the direction is forward, subtract three from the stop to bring it closer to the start | |
84 | self.features.loc[(self.features['featType'].isin(strip_codons)) & (self.features['strand'] == '+'), 'stop'] =\ | |
85 | self.features.loc[(self.features['featType'].isin(strip_codons)) | |
86 | & (self.features['strand'] == '+'), 'stop'] - 3 | |
87 | # if the direction is reverse, add three to the start (since the coords are flip-flopped) | |
88 | self.features.loc[(self.features['featType'].isin(strip_codons)) & (self.features['strand'] == '-'), 'start'] =\ | |
89 | self.features.loc[(self.features['featType'].isin(strip_codons)) | |
90 | & (self.features['strand'] == '-'), 'start'] + 3 | |
91 | self.features['center'] = self.features['start'] + \ | |
92 | ((self.features['stop'] - self.features['start']) / 2) | |
93 | # we need to add one since it doesn't account for the last base otherwise | |
94 | self.features['width'] = abs(self.features['stop'] - self.features['start']) + 1 | |
95 | self.features['lmost'] = self.features.apply(self._determine_lmost, axis=1) | |
96 | self.features['rmost'] = self.features.apply(self._determine_rmost, axis=1) | |
97 | self.features['track'] = 0 | |
98 | if len(self.features.loc[self.features['tags'] == "Is_circular=true", 'stop']) < 1: | |
99 | raise IOError("""The GFF file needs to have a tag ending in "Is_circular=true" | |
100 | with a region from 1 to the number of bases in the mitogenome | |
101 | ||
102 | example: | |
103 | Bf201311 Geneious region 1 13337 . + 0 Is_circular=true | |
104 | """) | |
105 | self.seqlen = int(self.features.loc[self.features['tags'] == "Is_circular=true", 'stop']) | |
106 | self.features.reset_index(inplace=True, drop=True) | |
107 | #print("float", self.features.loc[self.features['name'] == 'COX1', 'center']) | |
108 | #print("float cat", len(self.features.loc[self.features['name'] == 'CAT', 'center'])) | |
109 | # print(self.features) | |
110 | # print(self.seqlen) | |
111 | ||
112 | def set_features(self, new_features): | |
113 | """all this does is reset the features pandas dataframe""" | |
114 | self.features = new_features | |
115 | ||
116 | def get_unique_genes(self): | |
117 | """This returns a series of gene names""" | |
118 | plottable = self.features.query( | |
119 | "featType != 'tRNA' and featType != 'region' and featType != 'source'") | |
120 | return set(plottable['name'].unique()) | |
121 | ||
122 | def shuffle(self): | |
123 | """ | |
124 | this returns a list of all possible shuffles of features. | |
125 | A shuffle is when the frontmost bit of coding + noncoding DNA up | |
126 | until the next bit of coding DNA is removed and tagged on the | |
127 | end of the sequence. In this case this process is represented by | |
128 | shifting gff coordinates. | |
129 | """ | |
130 | shuffles = [] | |
131 | # get the index of the first element | |
132 | # get the index of the next thing | |
133 | # subtract the indices of everything, then reset the ones that are below | |
134 | # zero | |
135 | done = False | |
136 | shuffle_features = self.features[self.features['featType'].isin( | |
137 | ['gene', 'rRNA', 'CDS', 'tRNA'])].copy(deep=True) | |
138 | # we first add the shuffle features without reorganizing | |
139 | # print("shuffle\n",shuffle_features) | |
140 | add_first = copy.deepcopy(self) | |
141 | add_first.set_features(shuffle_features) | |
142 | shuffles.append(add_first) | |
143 | # first gene is changed with every iteration | |
144 | first_gene = list(shuffle_features['name'])[0] | |
145 | # absolute first is the first gene in the original gff file, used to determine if we are done in this while loop | |
146 | absolute_first = list(shuffle_features['name'])[0] | |
147 | while not done: | |
148 | # We need to prevent the case of shuffling in the middle of | |
149 | # overlapped genes. Do this by ensuring that the the start of | |
150 | # end of first gene is less than the start of the next gene. | |
151 | first_stop = int(shuffle_features.loc[shuffle_features['name'] == first_gene, 'stop']) | |
152 | next_gene = "" | |
153 | for next_index in range(1, len(shuffle_features)): | |
154 | # get the df of the next list, if len == 0, then it is a tRNA and we need to go to the next index | |
155 | next_gene_df = list( | |
156 | shuffle_features[shuffle_features['featType'].isin(['gene', 'rRNA', 'CDS'])]['name']) | |
157 | if len(next_gene_df) != 0: | |
158 | next_gene = next_gene_df[next_index] | |
159 | next_start = int(shuffle_features.loc[shuffle_features['name'] == next_gene, 'start']) | |
160 | #print("looking at {}, prev_stop is {}, start is {}".format( | |
161 | # next_gene, first_stop, next_start)) | |
162 | #print(shuffle_features[shuffle_features['featType'].isin(['gene', 'rRNA', 'CDS'])]) | |
163 | # if the gene we're looking at and the next one don't overlap, move on | |
164 | if first_stop < next_start: | |
165 | break | |
166 | #print("next_gene before checking for first is {}".format(next_gene)) | |
167 | if next_gene == absolute_first: | |
168 | done = True | |
169 | break | |
170 | # now we can reset the first gene for the next iteration | |
171 | first_gene = next_gene | |
172 | shuffle_features = shuffle_features.copy(deep=True) | |
173 | # figure out where the next start point is going to be | |
174 | next_start = int(shuffle_features.loc[shuffle_features['name'] == next_gene, 'start']) | |
175 | #print('next gene: {}'.format(next_gene)) | |
176 | shuffle_features['start'] = shuffle_features['start'] - next_start + 1 | |
177 | shuffle_features['stop'] = shuffle_features['stop'] - next_start + 1 | |
178 | shuffle_features['center'] = shuffle_features['center'] - next_start + 1 | |
179 | # now correct the values that are less than 0 | |
180 | shuffle_features.loc[shuffle_features['start'] < 1, | |
181 | 'start'] = shuffle_features.loc[shuffle_features['start'] < 1, 'start'] + self.seqlen | |
182 | shuffle_features.loc[shuffle_features['stop'] < 1, 'stop'] = shuffle_features.loc[shuffle_features['stop'] | |
183 | < 1, 'start'] + shuffle_features.loc[shuffle_features['stop'] < 1, 'width'] | |
184 | shuffle_features['center'] = shuffle_features['start'] + \ | |
185 | ((shuffle_features['stop'] - shuffle_features['start']) / 2) | |
186 | shuffle_features['lmost'] = shuffle_features.apply(self._determine_lmost, axis=1) | |
187 | shuffle_features['rmost'] = shuffle_features.apply(self._determine_rmost, axis=1) | |
188 | shuffle_features.sort_values(by='start', ascending=True, inplace=True) | |
189 | shuffle_features.reset_index(inplace=True, drop=True) | |
190 | new_copy = copy.deepcopy(self) | |
191 | new_copy.set_features(shuffle_features) | |
192 | shuffles.append(new_copy) | |
193 | #print("len shuffles: {}".format(len(shuffles))) | |
194 | return shuffles | |
195 | ||
196 | def couple(self, other_GFF, this_y=0, other_y=1): | |
197 | """ | |
198 | Compares this set of features to another set and generates tuples of | |
199 | (x,y) coordinate pairs to input into lsi | |
200 | """ | |
201 | other_features = other_GFF.features | |
202 | coordinates = [] | |
203 | for thisname in self.features['name']: | |
204 | othermatch = other_features.loc[other_features['name'] == thisname, 'center'] | |
205 | if len(othermatch) == 1: | |
206 | this_x = float(self.features.loc[self.features['name'] | |
207 | == thisname, 'center']) # /self.seqlen | |
208 | other_x = float(othermatch) # /other_GFF.seqlen | |
209 | # lsi can't handle vertical or horizontal lines, and we don't | |
210 | # need them either for our comparison. Don't add if equivalent. | |
211 | if this_x != other_x: | |
212 | these_coords = ((this_x, this_y), (other_x, other_y)) | |
213 | coordinates.append(these_coords) | |
214 | return coordinates | |
215 | ||
216 | def _check_triplets(self): | |
217 | """This method verifies that all entries of featType gene and CDS are | |
218 | divisible by three""" | |
219 | genesCDSs = self.features.query("featType == 'CDS' or featType == 'gene'") | |
220 | not_trips = genesCDSs.loc[((abs(genesCDSs['stop'] - genesCDSs['start']) + 1) % 3) > 0, ] | |
221 | if len(not_trips) > 0: | |
222 | print_string = "" | |
223 | print_string += "There are CDS and gene entries that are not divisible by three\n" | |
224 | print_string += str(not_trips) | |
225 | warnings.warn(print_string, SyntaxWarning) | |
226 | ||
227 | def _get_name(self, tag_value): | |
228 | """This extracts a name from a single row in 'tags' of the pandas | |
229 | dataframe | |
230 | """ | |
231 | try: | |
232 | if ";" in tag_value: | |
233 | name = tag_value[5:].split(';')[0] | |
234 | else: | |
235 | name = tag_value[5:].split()[0] | |
236 | except: | |
237 | name = tag_value | |
238 | print("Couldn't correctly parse {}".format( | |
239 | tag_value)) | |
240 | return name | |
241 | ||
242 | def _determine_lmost(self, row): | |
243 | """Booleans don't work well for pandas dataframes, so I need to use apply | |
244 | """ | |
245 | if row['start'] < row['stop']: | |
246 | return row['start'] | |
247 | else: | |
248 | return row['stop'] | |
249 | ||
250 | def _determine_rmost(self, row): | |
251 | """Booleans don't work well for pandas dataframes, so I need to use apply | |
252 | """ | |
253 | if row['start'] < row['stop']: | |
254 | return row['stop'] | |
255 | else: | |
256 | return row['start'] | |
257 | ||
258 | ||
259 | def parse_fastq_length_meanqual(fastq): | |
260 | """ | |
261 | arguments: | |
262 | <fastq> the fastq file path. Hopefully it has been verified to exist already | |
263 | ||
264 | purpose: | |
265 | This function parses a fastq and returns a pandas dataframe of read lengths | |
266 | and read meanQuals. | |
267 | """ | |
268 | # First try to open the file with the gzip package. It will crash | |
269 | # if the file is not gzipped, so this is an easy way to test if | |
270 | # the fastq file is gzipped or not. | |
271 | try: | |
272 | handle = gzip.open(fastq, "rt") | |
273 | length, meanQual = _fastq_parse_helper(handle) | |
274 | except: | |
275 | handle = open(fastq, "r") | |
276 | length, meanQual = _fastq_parse_helper(handle) | |
277 | ||
278 | handle.close() | |
279 | df = pd.DataFrame(list(zip(length, meanQual)), columns=['length', 'meanQual']) | |
280 | return df | |
281 | ||
282 | ||
283 | def filter_fastq_length_meanqual(df, min_len, max_len, | |
284 | min_mqual, max_mqual): | |
285 | querystring = "length >= {0} and meanQual >= {1}".format(min_len, min_mqual) | |
286 | if max_len != None: | |
287 | querystring += " and length <= {}".format(max_len) | |
288 | if max_mqual != None: | |
289 | querystring += " and meanQual <= {}".format(max_mqual) | |
290 | print("Keeping reads that satisfy: {}".format(querystring), file=stderr) | |
291 | filtdf = df.query(querystring) | |
292 | #filtdf["length"] = pd.to_numeric(filtdf["length"], errors='coerce') | |
293 | #filtdf["meanQual"] = pd.to_numeric(filtdf["meanQual"], errors='coerce') | |
294 | return filtdf | |
295 | ||
296 | ||
297 | def _fastq_parse_helper(handle): | |
298 | length = [] | |
299 | meanQual = [] | |
300 | for record in SeqIO.parse(handle, "fastq"): | |
301 | if len(record) > 0: | |
302 | meanQual.append(_arithmetic_mean(record.letter_annotations["phred_quality"])) | |
303 | length.append(len(record)) | |
304 | return length, meanQual | |
305 | ||
306 | ||
307 | def _geometric_mean(phred_values): | |
308 | """in case I want geometric mean in the future, can calculate it like this""" | |
309 | # np.mean(record.letter_annotations["phred_quality"])) | |
310 | pass | |
311 | ||
312 | ||
313 | def _arithmetic_mean(phred_values): | |
314 | """ | |
315 | Convert Phred to 1-accuracy (error probabilities), calculate the arithmetic mean, | |
316 | log transform back to Phred. | |
317 | """ | |
318 | if not isinstance(phred_values, np.ndarray): | |
319 | phred_values = np.array(phred_values) | |
320 | return _erate_to_phred(np.mean(_phred_to_erate(phred_values))) | |
321 | ||
322 | ||
323 | def _phred_to_erate(phred_values): | |
324 | """ | |
325 | converts a list or numpy array of phred values to a numpy array | |
326 | of error rates | |
327 | """ | |
328 | if not isinstance(phred_values, np.ndarray): | |
329 | phred_values = np.array(phred_values) | |
330 | return np.power(10, (-1 * (phred_values / 10))) | |
331 | ||
332 | ||
333 | def _erate_to_phred(erate_values): | |
334 | """ | |
335 | converts a list or numpy array of error rates to a numpy array | |
336 | of phred values | |
337 | """ | |
338 | if not isinstance(erate_values, np.ndarray): | |
339 | phred_values = np.array(erate_values) | |
340 | return -10 * np.log10(erate_values) | |
341 | ||
342 | def timestamp(): | |
343 | """ | |
344 | Returns the current time in :samp:`YYYYMMDD_HHMMSS` format. | |
345 | """ | |
346 | return time.strftime("%Y%m%d_%H%M%S") |
0 | #!/usr/bin/env python | |
1 | # -*- coding: utf-8 -*- | |
2 | ||
3 | # pauvre - a pore plotting package | |
4 | # Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved. | |
5 | # | |
6 | # This file is part of pauvre. | |
7 | # | |
8 | # pauvre is free software: you can redistribute it and/or modify | |
9 | # it under the terms of the GNU General Public License as published by | |
10 | # the Free Software Foundation, either version 3 of the License, or | |
11 | # (at your option) any later version. | |
12 | # | |
13 | # pauvre is distributed in the hope that it will be useful, | |
14 | # but WITHOUT ANY WARRANTY; without even the implied warranty of | |
15 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | |
16 | # GNU General Public License for more details. | |
17 | # | |
18 | # You should have received a copy of the GNU General Public License | |
19 | # along with pauvre. If not, see <http://www.gnu.org/licenses/>. | |
20 | ||
21 | """This file contains things related to parsing and plotting GFF files""" | |
22 | ||
23 | import copy | |
24 | from matplotlib.path import Path | |
25 | import matplotlib.patches as patches | |
26 | ||
27 | global chevron_width | |
28 | global arrow_width | |
29 | global min_text | |
30 | global text_cutoff | |
31 | ||
32 | arrow_width = 80 | |
33 | chevron_width = 40 | |
34 | min_text = 550 | |
35 | text_cutoff = 150 | |
36 | import sys | |
37 | ||
38 | global colorMap | |
39 | colorMap = {'gene': 'green', 'CDS': 'green', 'tRNA':'pink', 'rRNA':'red', | |
40 | 'misc_feature':'purple', 'rep_origin':'orange', 'spacebar':'white', | |
41 | 'ORF':'orange'} | |
42 | ||
43 | def _plot_left_to_right_introns(panel, geneid, db, y_pos, text = None): | |
44 | """ plots a left to right patch with introns when there are no intervening | |
45 | sequences to consider. Uses a gene id and gffutils database as input. | |
46 | b | |
47 | a .-=^=-. c | |
48 | 1__________2---/ e `---1__________2 | |
49 | | #lff \f d| #lff \ | |
50 | | left to \3 | left to \3 | |
51 | | right / | right / | |
52 | 5___________/4 5___________/4 | |
53 | """ | |
54 | #first we need to determine the number of exons | |
55 | bar_thickness = 0.75 | |
56 | #now we can start plotting the exons | |
57 | exonlist = list(db.children(geneid, featuretype='CDS', order_by="start")) | |
58 | for i in range(len(exonlist)): | |
59 | cds_start = exonlist[i].start | |
60 | cds_stop = exonlist[i].stop | |
61 | verts = [(cds_start, y_pos + bar_thickness), #1 | |
62 | (cds_stop - chevron_width, y_pos + bar_thickness), #2 | |
63 | (cds_stop, y_pos + (bar_thickness/2)), #3 | |
64 | (cds_stop - chevron_width, y_pos), #4 | |
65 | (cds_start, y_pos), #5 | |
66 | (cds_start, y_pos + bar_thickness), #1 | |
67 | ] | |
68 | codes = [Path.MOVETO, | |
69 | Path.LINETO, | |
70 | Path.LINETO, | |
71 | Path.LINETO, | |
72 | Path.LINETO, | |
73 | Path.CLOSEPOLY, | |
74 | ] | |
75 | path = Path(verts, codes) | |
76 | patch = patches.PathPatch(path, lw = 0, | |
77 | fc=colorMap['CDS'] ) | |
78 | panel.add_patch(patch) | |
79 | ||
80 | # we must draw the splice junction | |
81 | if i < len(exonlist) - 1: | |
82 | next_start = exonlist[i+1].start | |
83 | next_stop = exonlist[i+1].stop | |
84 | middle = cds_stop + ((next_start - cds_stop)/2) | |
85 | ||
86 | verts = [(cds_stop - chevron_width, y_pos + bar_thickness), #2/a | |
87 | (middle, y_pos + 0.95), #b | |
88 | (next_start, y_pos + bar_thickness), #c | |
89 | (next_start, y_pos + bar_thickness - 0.05), #d | |
90 | (middle, y_pos + 0.95 - 0.05), #e | |
91 | (cds_stop - chevron_width, y_pos + bar_thickness -0.05), #f | |
92 | (cds_stop - chevron_width, y_pos + bar_thickness), #2/a | |
93 | ] | |
94 | codes = [Path.MOVETO, | |
95 | Path.LINETO, | |
96 | Path.LINETO, | |
97 | Path.LINETO, | |
98 | Path.LINETO, | |
99 | Path.LINETO, | |
100 | Path.CLOSEPOLY, | |
101 | ] | |
102 | path = Path(verts, codes) | |
103 | patch = patches.PathPatch(path, lw = 0, | |
104 | fc=colorMap['CDS'] ) | |
105 | panel.add_patch(patch) | |
106 | ||
107 | return panel | |
108 | ||
109 | def _plot_left_to_right_introns_top(panel, geneid, db, y_pos, text = None): | |
110 | """ slightly different from the above version such thatsplice junctions | |
111 | are more visually explicit. | |
112 | ||
113 | plots a left to right patch with introns when there are no intervening | |
114 | sequences to consider. Uses a gene id and gffutils database as input. | |
115 | b | |
116 | a .-=^=-. c | |
117 | 1_____________2---/ e `---1_____________2 | |
118 | | #lff /f d| #lff / | |
119 | | left to / | left to / | |
120 | | right / | right / | |
121 | 4_________/3 4_________/3 | |
122 | """ | |
123 | #first we need to determine the number of exons | |
124 | bar_thickness = 0.75 | |
125 | #now we can start plotting the exons | |
126 | exonlist = list(db.children(geneid, featuretype='CDS', order_by="start")) | |
127 | for i in range(len(exonlist)): | |
128 | cds_start = exonlist[i].start | |
129 | cds_stop = exonlist[i].stop | |
130 | verts = [(cds_start, y_pos + bar_thickness), #1 | |
131 | (cds_stop, y_pos + bar_thickness), #2 | |
132 | (cds_stop - chevron_width, y_pos), #4 | |
133 | (cds_start, y_pos), #5 | |
134 | (cds_start, y_pos + bar_thickness), #1 | |
135 | ] | |
136 | codes = [Path.MOVETO, | |
137 | Path.LINETO, | |
138 | Path.LINETO, | |
139 | Path.LINETO, | |
140 | Path.CLOSEPOLY, | |
141 | ] | |
142 | path = Path(verts, codes) | |
143 | patch = patches.PathPatch(path, lw = 0, | |
144 | fc=colorMap['CDS'] ) | |
145 | panel.add_patch(patch) | |
146 | ||
147 | # we must draw the splice junction | |
148 | if i < len(exonlist) - 1: | |
149 | next_start = exonlist[i+1].start | |
150 | next_stop = exonlist[i+1].stop | |
151 | middle = cds_stop + ((next_start - cds_stop)/2) | |
152 | ||
153 | verts = [(cds_stop-5, y_pos + bar_thickness), #2/a | |
154 | (middle, y_pos + 0.95), #b | |
155 | (next_start, y_pos + bar_thickness), #c | |
156 | (next_start, y_pos + bar_thickness - 0.05), #d | |
157 | (middle, y_pos + 0.95 - 0.05), #e | |
158 | (cds_stop-5, y_pos + bar_thickness -0.05), #f | |
159 | (cds_stop-5, y_pos + bar_thickness), #2/a | |
160 | ] | |
161 | codes = [Path.MOVETO, | |
162 | Path.LINETO, | |
163 | Path.LINETO, | |
164 | Path.LINETO, | |
165 | Path.LINETO, | |
166 | Path.LINETO, | |
167 | Path.CLOSEPOLY, | |
168 | ] | |
169 | path = Path(verts, codes) | |
170 | patch = patches.PathPatch(path, lw = 0, | |
171 | fc=colorMap['CDS'] ) | |
172 | panel.add_patch(patch) | |
173 | ||
174 | return panel | |
175 | ||
176 | def _plot_lff(panel, left_df, right_df, colorMap, y_pos, bar_thickness, text): | |
177 | """ plots a lff patch | |
178 | 1__________2 ____________ | |
179 | | #lff \ \ #rff \ | |
180 | | left for \3 \ right for \ | |
181 | | forward / / forward / | |
182 | 5___________/4 /___________/ | |
183 | """ | |
184 | #if there is only one feature to plot, then just plot it | |
185 | ||
186 | print("plotting lff") | |
187 | verts = [(left_df['start'], y_pos + bar_thickness), #1 | |
188 | (right_df['start'] - chevron_width, y_pos + bar_thickness), #2 | |
189 | (left_df['stop'], y_pos + (bar_thickness/2)), #3 | |
190 | (right_df['start'] - chevron_width, y_pos), #4 | |
191 | (left_df['start'], y_pos), #5 | |
192 | (left_df['start'], y_pos + bar_thickness), #1 | |
193 | ] | |
194 | codes = [Path.MOVETO, | |
195 | Path.LINETO, | |
196 | Path.LINETO, | |
197 | Path.LINETO, | |
198 | Path.LINETO, | |
199 | Path.CLOSEPOLY, | |
200 | ] | |
201 | path = Path(verts, codes) | |
202 | patch = patches.PathPatch(path, lw = 0, | |
203 | fc=colorMap[left_df['featType']] ) | |
204 | text_width = left_df['width'] | |
205 | if text and text_width >= min_text: | |
206 | panel = _plot_label(panel, left_df, y_pos, bar_thickness) | |
207 | elif text and text_width < min_text and text_width >= text_cutoff: | |
208 | panel = _plot_label(panel, left_df, | |
209 | y_pos, bar_thickness, | |
210 | rotate = True, arrow = True) | |
211 | ||
212 | return panel, patch | |
213 | ||
214 | def _plot_label(panel, df, y_pos, bar_thickness, rotate = False, arrow = False): | |
215 | # handles the case where a dataframe was passed | |
216 | fontsize = 8 | |
217 | rotation = 0 | |
218 | if rotate: | |
219 | fontsize = 5 | |
220 | rotation = 90 | |
221 | if len(df) == 1: | |
222 | x =((df.loc[0, 'stop'] - df.loc[0, 'start'])/2) + df.loc[0, 'start'] | |
223 | y = y_pos + (bar_thickness/2) | |
224 | # if we need to center somewhere other than the arrow, need to adjust | |
225 | # for the direction of the arrow | |
226 | # it doesn't look good if it shifts by the whole arrow width, so only | |
227 | # shift by half the arrow width | |
228 | if arrow: | |
229 | if df.loc[0, 'strand'] == "+": | |
230 | shift_start = df.loc[0, 'start'] | |
231 | else: | |
232 | shift_start = df.loc[0, 'start'] + (arrow_width/2) | |
233 | x =((df.loc[0, 'stop'] - (arrow_width/2) - df.loc[0, 'start'])/2) + shift_start | |
234 | panel.text(x, y, | |
235 | df.loc[0, 'name'], fontsize = fontsize, | |
236 | ha='center', va='center', | |
237 | color = 'white', family = 'monospace', | |
238 | zorder = 100, rotation = rotation) | |
239 | # and the case where a series was passed | |
240 | else: | |
241 | x = ((df['stop'] - df['start'])/2) + df['start'] | |
242 | y = y_pos + (bar_thickness/2) | |
243 | if arrow: | |
244 | if df['strand'] == "+": | |
245 | shift_start = df['start'] | |
246 | else: | |
247 | shift_start = df['start'] + (arrow_width/2) | |
248 | x =((df['stop'] - (arrow_width/2) - df['start'])/2) + shift_start | |
249 | panel.text(x, y, | |
250 | df['name'], fontsize = fontsize, | |
251 | ha='center', va='center', | |
252 | color = 'white', family = 'monospace', | |
253 | zorder = 100, rotation = rotation) | |
254 | ||
255 | return panel | |
256 | ||
257 | def _plot_rff(panel, left_df, right_df, colorMap, y_pos, bar_thickness, text): | |
258 | """ plots a rff patch | |
259 | ____________ 1__________2 | |
260 | | #lff \ \ #rff \ | |
261 | | left for \ 6\ right for \3 | |
262 | | forward / / forward / | |
263 | |___________/ /5__________/4 | |
264 | """ | |
265 | #if there is only one feature to plot, then just plot it | |
266 | ||
267 | print("plotting rff") | |
268 | verts = [(right_df['start'], y_pos + bar_thickness), #1 | |
269 | (right_df['stop'] - arrow_width, y_pos + bar_thickness), #2 | |
270 | (right_df['stop'], y_pos + (bar_thickness/2)), #3 | |
271 | (right_df['stop'] - arrow_width, y_pos), #4 | |
272 | (right_df['start'], y_pos), #5 | |
273 | (left_df['stop'] + chevron_width, y_pos + (bar_thickness/2)), #6 | |
274 | (right_df['start'], y_pos + bar_thickness), #1 | |
275 | ] | |
276 | codes = [Path.MOVETO, | |
277 | Path.LINETO, | |
278 | Path.LINETO, | |
279 | Path.LINETO, | |
280 | Path.LINETO, | |
281 | Path.LINETO, | |
282 | Path.CLOSEPOLY, | |
283 | ] | |
284 | path = Path(verts, codes) | |
285 | patch = patches.PathPatch(path, lw = 0, | |
286 | fc=colorMap[right_df['featType']] ) | |
287 | text_width = right_df['width'] | |
288 | if text and text_width >= min_text: | |
289 | panel = _plot_label(panel, right_df, y_pos, bar_thickness) | |
290 | elif text and text_width < min_text and text_width >= text_cutoff: | |
291 | panel = _plot_label(panel, right_df, | |
292 | y_pos, bar_thickness, rotate = True) | |
293 | return panel, patch | |
294 | ||
295 | def x_offset_gff(GFFParseobj, x_offset): | |
296 | """Takes in a gff object (a gff file parsed as a pandas dataframe), | |
297 | and an x_offset value and shifts the start, stop, center, lmost, and rmost. | |
298 | ||
299 | Returns a GFFParse object with the shifted values in GFFParse.features. | |
300 | """ | |
301 | for columnname in ['start', 'stop', 'center', 'lmost', 'rmost']: | |
302 | GFFParseobj.features[columnname] = GFFParseobj.features[columnname] + x_offset | |
303 | return GFFParseobj | |
304 | ||
305 | def gffplot_horizontal(figure, panel, args, gff_object, | |
306 | track_width=0.2, start_y=0.1, **kwargs): | |
307 | """ | |
308 | this plots horizontal things from gff files. it was probably written for synplot, | |
309 | as the browser does not use this at all. | |
310 | """ | |
311 | # Because this size should be relative to the circle that it is plotted next | |
312 | # to, define the start_radius as the place to work from, and the width of | |
313 | # each track. | |
314 | colorMap = {'gene': 'green', 'CDS': 'green', 'tRNA':'pink', 'rRNA':'red', | |
315 | 'misc_feature':'purple', 'rep_origin':'orange', 'spacebar':'white'} | |
316 | augment = 0 | |
317 | bar_thickness = 0.9 * track_width | |
318 | # return these at the end | |
319 | myPatches=[] | |
320 | plot_order = [] | |
321 | ||
322 | idone = False | |
323 | # we need to filter out the tRNAs since those are plotted last | |
324 | plottable_features = gff_object.features.query("featType != 'tRNA' and featType != 'region' and featType != 'source'") | |
325 | plottable_features.reset_index(inplace=True, drop=True) | |
326 | print(plottable_features) | |
327 | ||
328 | len_plottable = len(plottable_features) | |
329 | print('len plottable', len_plottable) | |
330 | # - this for loop relies on the gff features to already be sorted | |
331 | # - The algorithm for this loop works by starting at the 0th index of the | |
332 | # plottable features (i). | |
333 | # - It then looks to see if the next object (the jth) overlaps with the | |
334 | # ith element. | |
335 | i = 0 | |
336 | j = 1 | |
337 | while i < len(plottable_features): | |
338 | if i + j == len(plottable_features): | |
339 | #we have run off of the df and need to include everything from i to the end | |
340 | these_features = plottable_features.loc[i::,].copy(deep=True) | |
341 | these_features = these_features.reset_index() | |
342 | print(these_features) | |
343 | plot_order.append(these_features) | |
344 | i = len(plottable_features) | |
345 | break | |
346 | print(" - i,j are currently: {},{}".format(i, j)) | |
347 | stop = plottable_features.loc[i]["stop"] | |
348 | start = plottable_features.loc[i+j]["start"] | |
349 | print("stop: {}. start: {}.".format(stop, start)) | |
350 | if plottable_features.loc[i]["stop"] <= plottable_features.loc[i+j]["start"]: | |
351 | print(" - putting elements {} through (including) {} together".format(i, i+j)) | |
352 | these_features = plottable_features.loc[i:i+j-1,].copy(deep=True) | |
353 | these_features = these_features.reset_index() | |
354 | print(these_features) | |
355 | plot_order.append(these_features) | |
356 | i += 1 | |
357 | j = 1 | |
358 | else: | |
359 | j += 1 | |
360 | ||
361 | #while idone == False: | |
362 | # print("im in the overlap-pairing while loop i={}".format(i)) | |
363 | # # look ahead at all of the elements that overlap with the ith element | |
364 | # jdone = False | |
365 | # j = 1 | |
366 | # this_set_minimum_index = i | |
367 | # this_set_maximum_index = i | |
368 | # while jdone == False: | |
369 | # print("new i= {} j={} len={}".format(i, j, len_plottable)) | |
370 | # print("len plottable in jdone: {}".format(len_plottable)) | |
371 | # print("plottable features in jdone:\n {}".format(plottable_features)) | |
372 | # # first make sure that we haven't gone off the end of the dataframe | |
373 | # # This is an edge case where i has a jth element that overlaps with it, | |
374 | # # and j is the last element in the plottable features. | |
375 | # if i+j == len_plottable: | |
376 | # print("i+j == len_plottable") | |
377 | # # this checks for the case that i is the last element of the | |
378 | # # plottable features. | |
379 | # # In both of the above cases, we are done with both the ith and | |
380 | # # the jth features. | |
381 | # if i == len_plottable-1: | |
382 | # print("i == len_plottable-1") | |
383 | ||
384 | # # this is the last analysis, so set idone to true | |
385 | # # to finish after this | |
386 | # idone = True | |
387 | # # the last one can't be in its own group, so just add it solo | |
388 | # these_features = plottable_features.loc[this_set_minimum_index:this_set_maximum_index,].copy(deep=True) | |
389 | # plot_order.append(these_features.reset_index(drop=True)) | |
390 | # break | |
391 | # jdone = True | |
392 | # else: | |
393 | # print("i+j != len_plottable") | |
394 | # # if the lmost of the next gene overlaps with the rmost of | |
395 | # # the current one, it overlaps and couple together | |
396 | # if plottable_features.loc[i+j, 'lmost'] < plottable_features.loc[i, 'rmost']: | |
397 | # print("lmost < rmost") | |
398 | # # note that this feature overlaps with the current | |
399 | # this_set_maximum_index = i+j | |
400 | # # ... and we need to look at the next in line | |
401 | # j += 1 | |
402 | # else: | |
403 | # print("lmost !< rmost") | |
404 | # i += 1 + (this_set_maximum_index - this_set_minimum_index) | |
405 | # #add all of the things that grouped together once we don't find any more groups | |
406 | # these_features = plottable_features.loc[this_set_minimum_index:this_set_maximum_index,].copy(deep=True) | |
407 | # plot_order.append(these_features.reset_index(drop=True)) | |
408 | # jdone = True | |
409 | # print("plot order is now: {}".format(plot_order)) | |
410 | # print("jdone: {}".format(str(jdone))) | |
411 | ||
412 | for feature_set in plot_order: | |
413 | # plot_feature_hori handles overlapping cases as well as normal cases | |
414 | panel, patches = gffplot_feature_hori(figure, panel, feature_set, colorMap, | |
415 | start_y, bar_thickness, text = True) | |
416 | for each in patches: | |
417 | print("there are {} patches after gffplot_feature_hori".format(len(patches))) | |
418 | print(each) | |
419 | myPatches.append(each) | |
420 | print("length of myPatches is: {}".format(len(myPatches))) | |
421 | ||
422 | # Now we add all of the tRNAs to this to plot, do it last to overlay | |
423 | # everything else | |
424 | tRNAs = gff_object.features.query("featType == 'tRNA'") | |
425 | tRNAs.reset_index(inplace=True, drop = True) | |
426 | tRNA_bar_thickness = bar_thickness * (0.8) | |
427 | tRNA_start_y = start_y + ((bar_thickness - tRNA_bar_thickness)/2) | |
428 | for i in range(0,len(tRNAs)): | |
429 | this_feature = tRNAs[i:i+1].copy(deep=True) | |
430 | this_feature.reset_index(inplace=True, drop = True) | |
431 | panel, patches = gffplot_feature_hori(figure, panel, this_feature, colorMap, | |
432 | tRNA_start_y, tRNA_bar_thickness, text = True) | |
433 | for patch in patches: | |
434 | myPatches.append(patch) | |
435 | print("There are {} patches at the end of gffplot_horizontal()".format(len(myPatches))) | |
436 | return panel, myPatches | |
437 | ||
438 | def gffplot_feature_hori(figure, panel, feature_df, | |
439 | colorMap, y_pos, bar_thickness, text=True): | |
440 | """This plots the track for a feature, and if there is something for | |
441 | 'this_feature_overlaps_feature', then there is special processing to | |
442 | add the white bar and the extra slope for the chevron | |
443 | """ | |
444 | myPatches = [] | |
445 | #if there is only one feature to plot, then just plot it | |
446 | if len(feature_df) == 1: | |
447 | #print("plotting a single thing: {} {}".format(str(feature_df['sequence']).split()[1], | |
448 | # str(feature_df['featType']).split()[1] )) | |
449 | #print(this_feature['name'], "is not overlapping") | |
450 | # This plots this shape: 1_________2 2_________1 | |
451 | # | forward \3 3/ reverse | | |
452 | # |5__________/4 \4________5| | |
453 | if feature_df.loc[0,'strand'] == '+': | |
454 | verts = [(feature_df.loc[0, 'start'], y_pos + bar_thickness), #1 | |
455 | (feature_df.loc[0, 'stop'] - arrow_width, y_pos + bar_thickness), #2 | |
456 | (feature_df.loc[0, 'stop'], y_pos + (bar_thickness/2)), #3 | |
457 | (feature_df.loc[0, 'stop'] - arrow_width, y_pos), #4 | |
458 | (feature_df.loc[0, 'start'], y_pos), #5 | |
459 | (feature_df.loc[0, 'start'], y_pos + bar_thickness)] #1 | |
460 | elif feature_df.loc[0,'strand'] == '-': | |
461 | verts = [(feature_df.loc[0, 'stop'], y_pos + bar_thickness), #1 | |
462 | (feature_df.loc[0, 'start'] + arrow_width, y_pos + bar_thickness), #2 | |
463 | (feature_df.loc[0, 'start'], y_pos + (bar_thickness/2)), #3 | |
464 | (feature_df.loc[0, 'start'] + arrow_width, y_pos), #4 | |
465 | (feature_df.loc[0, 'stop'], y_pos), #5 | |
466 | (feature_df.loc[0, 'stop'], y_pos + bar_thickness)] #1 | |
467 | feat_width = feature_df.loc[0,'width'] | |
468 | if text and feat_width >= min_text: | |
469 | panel = _plot_label(panel, feature_df.loc[0,], | |
470 | y_pos, bar_thickness) | |
471 | elif text and feat_width < min_text and feat_width >= text_cutoff: | |
472 | panel = _plot_label(panel, feature_df.loc[0,], | |
473 | y_pos, bar_thickness, | |
474 | rotate = True, arrow = True) | |
475 | ||
476 | codes = [Path.MOVETO, | |
477 | Path.LINETO, | |
478 | Path.LINETO, | |
479 | Path.LINETO, | |
480 | Path.LINETO, | |
481 | Path.CLOSEPOLY] | |
482 | path = Path(verts, codes) | |
483 | print("normal path is: {}".format(path)) | |
484 | # If the feature itself is smaller than the arrow, we need to take special measures to | |
485 | if feature_df.loc[0,'width'] <= arrow_width: | |
486 | path = Path([verts[i] for i in [0,2,4,5]], | |
487 | [codes[i] for i in [0,2,4,5]]) | |
488 | patch = patches.PathPatch(path, lw = 0, | |
489 | fc=colorMap[feature_df.loc[0, 'featType']] ) | |
490 | myPatches.append(patch) | |
491 | # there are four possible scenarios if there are two overlapping sequences: | |
492 | # ___________ ____________ ____________ ___________ | |
493 | # | #1 \ \ #1 \ / #2 / / #2 | | |
494 | # | both seqs \ \ both seqs \ / both seqs / / both seqs | | |
495 | # | forward / / forward / \ reverse \ \ reverse | | |
496 | # |__________/ /___________/ \___________\ \___________| | |
497 | # ___________ _____________ ____________ _ _________ | |
498 | # | #3 \ \ #3 | / #2 _| #2 \ | |
499 | # | one seq \ \ one seq | / one seq |_ one seq \ | |
500 | # | forward \ \ reverse | \ reverse _| forward / | |
501 | # |_____________\ \_________| \__________|_ ___________/ | |
502 | # | |
503 | # These different scenarios can be thought of as different left/right | |
504 | # flanking segment types. | |
505 | # In the annotation #rff: | |
506 | # - 'r' refers to the annotation type as being on the right | |
507 | # - the first 'f' refers to the what element is to the left of this one. | |
508 | # Since it is forward the 5' end of this annotation must be a chevron | |
509 | # - the second 'f' refers to the right side of this element. Since it is | |
510 | # forward it must be a normal arrow. | |
511 | # being on the right | |
512 | # | |
513 | # *LEFT TYPES* *RIGHT TYPES* | |
514 | # ____________ ____________ | |
515 | # | #lff \ \ #rff \ | |
516 | # | left for \ \ right for \ | |
517 | # | forward / / forward / | |
518 | # |___________/ /___________/ | |
519 | # ___________ _____________ | |
520 | # | #lfr \ \ #rfr | | |
521 | # | left for \ \ right for | | |
522 | # | reverse \ \ reverse | | |
523 | # |_____________\ \_________| | |
524 | # ____________ ___________ | |
525 | # / #lrr / / #rrr | | |
526 | # / left rev / / right rev | | |
527 | # \ reverse \ \ reverse | | |
528 | # \___________\ \___________| | |
529 | # ____________ __________ | |
530 | # / #lrf _| _| #rrf \ | |
531 | # / left rev |_ | _ right rev \ | |
532 | # \ forward _| _| forward / | |
533 | # \__________| |____________/ | |
534 | # | |
535 | # To properly plot these elements, we must go through each element of the | |
536 | # feature_df to determine which patch type it is. | |
537 | elif len(feature_df) == 2: | |
538 | print("im in here feat len=2") | |
539 | for i in range(len(feature_df)): | |
540 | # this tests for which left type we're dealing with | |
541 | if i == 0: | |
542 | # type could be lff or lfr | |
543 | if feature_df.loc[i, 'strand'] == '+': | |
544 | if feature_df.loc[i + 1, 'strand'] == '+': | |
545 | # plot a lff type | |
546 | panel, patch = _plot_lff(panel, feature_df.iloc[i,], feature_df.iloc[i+1,], | |
547 | colorMap, y_pos, bar_thickness, text) | |
548 | myPatches.append(patch) | |
549 | elif feature_df.loc[i + 1, 'strand'] == '-': | |
550 | #plot a lfr type | |
551 | raise IOError("can't plot {} patches yet".format("lfr")) | |
552 | # or type could be lrr or lrf | |
553 | elif feature_df.loc[i, 'strand'] == '-': | |
554 | if feature_df.loc[i + 1, 'strand'] == '+': | |
555 | # plot a lrf type | |
556 | raise IOError("can't plot {} patches yet".format("lrf")) | |
557 | elif feature_df.loc[i + 1, 'strand'] == '-': | |
558 | #plot a lrr type | |
559 | raise IOError("can't plot {} patches yet".format("lrr")) | |
560 | # in this case we're only dealing with 'right type' patches | |
561 | elif i == len(feature_df) - 1: | |
562 | # type could be rff or rfr | |
563 | if feature_df.loc[i-1, 'strand'] == '+': | |
564 | if feature_df.loc[i, 'strand'] == '+': | |
565 | # plot a rff type | |
566 | panel, patch = _plot_rff(panel, feature_df.iloc[i-1,], feature_df.iloc[i,], | |
567 | colorMap, y_pos, bar_thickness, text) | |
568 | myPatches.append(patch) | |
569 | elif feature_df.loc[i, 'strand'] == '-': | |
570 | #plot a rfr type | |
571 | raise IOError("can't plot {} patches yet".format("rfr")) | |
572 | # or type could be rrr or rrf | |
573 | elif feature_df.loc[i-1, 'strand'] == '-': | |
574 | if feature_df.loc[i, 'strand'] == '+': | |
575 | # plot a rrf type | |
576 | raise IOError("can't plot {} patches yet".format("rrf")) | |
577 | elif feature_df.loc[i, 'strand'] == '-': | |
578 | #plot a rrr type | |
579 | raise IOError("can't plot {} patches yet".format("rrr")) | |
580 | return panel, myPatches |
0 | # Binary search tree that holds status of sweep line. Only leaves hold values. | |
1 | # Operations for finding left and right neighbors of a query point p and finding which segments contain p. | |
2 | # Author: Sam Lichtenberg | |
3 | # Email: splichte@princeton.edu | |
4 | # Date: 09/02/2013 | |
5 | ||
6 | from pauvre.lsi.helper import * | |
7 | ||
8 | ev = 0.00000001 | |
9 | ||
10 | class Q: | |
11 | def __init__(self, key, value): | |
12 | self.key = key | |
13 | self.value = value | |
14 | self.left = None | |
15 | self.right = None | |
16 | ||
17 | def find(self, key): | |
18 | if self.key is None: | |
19 | return False | |
20 | c = compare_by_y(key, self.key) | |
21 | if c==0: | |
22 | return True | |
23 | elif c==-1: | |
24 | if self.left: | |
25 | self.left.find(key) | |
26 | else: | |
27 | return False | |
28 | else: | |
29 | if self.right: | |
30 | self.right.find(key) | |
31 | else: | |
32 | return False | |
33 | def insert(self, key, value): | |
34 | if self.key is None: | |
35 | self.key = key | |
36 | self.value = value | |
37 | c = compare_by_y(key, self.key) | |
38 | if c==0: | |
39 | self.value += value | |
40 | elif c==-1: | |
41 | if self.left is None: | |
42 | self.left = Q(key, value) | |
43 | else: | |
44 | self.left.insert(key, value) | |
45 | else: | |
46 | if self.right is None: | |
47 | self.right = Q(key, value) | |
48 | else: | |
49 | self.right.insert(key, value) | |
50 | # must return key AND value | |
51 | def get_and_del_min(self, parent=None): | |
52 | if self.left is not None: | |
53 | return self.left.get_and_del_min(self) | |
54 | else: | |
55 | k = self.key | |
56 | v = self.value | |
57 | if parent: | |
58 | parent.left = self.right | |
59 | # i.e. is root node | |
60 | else: | |
61 | if self.right: | |
62 | self.key = self.right.key | |
63 | self.value = self.right.value | |
64 | self.left = self.right.left | |
65 | self.right = self.right.right | |
66 | else: | |
67 | self.key = None | |
68 | return k,v | |
69 | ||
70 | def print_tree(self): | |
71 | if self.left: | |
72 | self.left.print_tree() | |
73 | print(self.key) | |
74 | print(self.value) | |
75 | if self.right: | |
76 | self.right.print_tree() |
0 | # Binary search tree that holds status of sweep line. Only leaves hold values. | |
1 | # Operations for finding left and right neighbors of a query point p and finding which segments contain p. | |
2 | # Author: Sam Lichtenberg | |
3 | # Email: splichte@princeton.edu | |
4 | # Date: 09/02/2013 | |
5 | ||
6 | from pauvre.lsi.helper import * | |
7 | ||
8 | ev = 0.00000001 | |
9 | ||
10 | class T: | |
11 | def __init__(self): | |
12 | self.root = Node(None, None, None, None) | |
13 | def contain_p(self, p): | |
14 | if self.root.value is None: | |
15 | return [[], []] | |
16 | lists = [[], []] | |
17 | self.root.contain_p(p, lists) | |
18 | return (lists[0], lists[1]) | |
19 | def get_left_neighbor(self, p): | |
20 | if self.root.value is None: | |
21 | return None | |
22 | return self.root.get_left_neighbor(p) | |
23 | def get_right_neighbor(self, p): | |
24 | if self.root.value is None: | |
25 | return None | |
26 | return self.root.get_right_neighbor(p) | |
27 | def insert(self, key, s): | |
28 | if self.root.value is None: | |
29 | self.root.left = Node(s, None, None, self.root) | |
30 | self.root.value = s | |
31 | self.root.m = get_slope(s) | |
32 | else: | |
33 | (node, path) = self.root.find_insert_pt(key, s) | |
34 | if path == 'r': | |
35 | node.right = Node(s, None, None, node) | |
36 | node.right.adjust() | |
37 | elif path == 'l': | |
38 | node.left = Node(s, None, None, node) | |
39 | else: | |
40 | # this means matching Node was a leaf | |
41 | # need to make a new internal Node | |
42 | if node.compare_to_key(key) < 0 or (node.compare_to_key(key)==0 and node.compare_lower(key, s) < 1): | |
43 | new_internal = Node(s, None, node, node.parent) | |
44 | new_leaf = Node(s, None, None, new_internal) | |
45 | new_internal.left = new_leaf | |
46 | if node is node.parent.left: | |
47 | node.parent.left = new_internal | |
48 | node.adjust() | |
49 | else: | |
50 | node.parent.right = new_internal | |
51 | else: | |
52 | new_internal = Node(node.value, node, None, node.parent) | |
53 | new_leaf = Node(s, None, None, new_internal) | |
54 | new_internal.right = new_leaf | |
55 | if node is node.parent.left: | |
56 | node.parent.left = new_internal | |
57 | new_leaf.adjust() | |
58 | else: | |
59 | node.parent.right = new_internal | |
60 | node.parent = new_internal | |
61 | ||
62 | def delete(self, p, s): | |
63 | key = p | |
64 | node = self.root.find_delete_pt(key, s) | |
65 | val = node.value | |
66 | if node is node.parent.left: | |
67 | parent = node.parent.parent | |
68 | if parent is None: | |
69 | if self.root.right is not None: | |
70 | if self.root.right.left or self.root.right.right: | |
71 | self.root = self.root.right | |
72 | self.root.parent = None | |
73 | else: | |
74 | self.root.left = self.root.right | |
75 | self.root.value = self.root.right.value | |
76 | self.root.m = self.root.right.m | |
77 | self.root.right = None | |
78 | else: | |
79 | self.root.left = None | |
80 | self.root.value = None | |
81 | elif node.parent is parent.left: | |
82 | parent.left = node.parent.right | |
83 | node.parent.right.parent = parent | |
84 | else: | |
85 | parent.right = node.parent.right | |
86 | node.parent.right.parent = parent | |
87 | else: | |
88 | parent = node.parent.parent | |
89 | if parent is None: | |
90 | if self.root.left: | |
91 | # switch properties | |
92 | if self.root.left.right or self.root.left.left: | |
93 | self.root = self.root.left | |
94 | self.root.parent = None | |
95 | else: | |
96 | self.root.right = None | |
97 | else: | |
98 | self.root.right = None | |
99 | self.root.value = None | |
100 | elif node.parent is parent.left: | |
101 | parent.left = node.parent.left | |
102 | node.parent.left.parent = parent | |
103 | farright = node.parent.left | |
104 | while farright.right is not None: | |
105 | farright = farright.right | |
106 | farright.adjust() | |
107 | else: | |
108 | parent.right = node.parent.left | |
109 | node.parent.left.parent = parent | |
110 | farright = node.parent.left | |
111 | while farright.right is not None: | |
112 | farright = farright.right | |
113 | farright.adjust() | |
114 | return val | |
115 | ||
116 | def print_tree(self): | |
117 | self.root.print_tree() | |
118 | class Node: | |
119 | def __init__(self, value, left, right, parent): | |
120 | self.value = value # associated line segment | |
121 | self.left = left | |
122 | self.right = right | |
123 | self.parent = parent | |
124 | self.m = None | |
125 | if value is not None: | |
126 | self.m = get_slope(value) | |
127 | ||
128 | # compares line segment at y-val of p to p | |
129 | # TODO: remove this and replace with get_x_at | |
130 | def compare_to_key(self, p): | |
131 | x0 = self.value[0][0] | |
132 | y0 = self.value[0][1] | |
133 | y1 = p[1] | |
134 | if self.m != 0 and self.m is not None: | |
135 | x1 = x0 - float(y0-y1)/self.m | |
136 | return compare_by_x(p, (x1, y1)) | |
137 | else: | |
138 | x1 = p[0] | |
139 | return 0 | |
140 | ||
141 | def get_left_neighbor(self, p): | |
142 | neighbor = None | |
143 | n = self | |
144 | if n.left is None and n.right is None: | |
145 | return neighbor | |
146 | last_right = None | |
147 | found = False | |
148 | while not found: | |
149 | c = n.compare_to_key(p) | |
150 | if c < 1 and n.left: | |
151 | n = n.left | |
152 | elif c==1 and n.right: | |
153 | n = n.right | |
154 | last_right = n.parent | |
155 | else: | |
156 | found = True | |
157 | c = n.compare_to_key(p) | |
158 | if c==0: | |
159 | if n is n.parent.right: | |
160 | return n.parent | |
161 | else: | |
162 | goright = None | |
163 | if last_right: | |
164 | goright =last_right.left | |
165 | return self.get_lr(None, goright)[0] | |
166 | # n stores the highest-value in the left subtree | |
167 | if c==-1: | |
168 | goright = None | |
169 | if last_right: | |
170 | goright = last_right.left | |
171 | return self.get_lr(None, goright)[0] | |
172 | if c==1: | |
173 | neighbor = n | |
174 | return neighbor | |
175 | ||
176 | def get_right_neighbor(self, p): | |
177 | neighbor = None | |
178 | n = self | |
179 | if n.left is None and n.right is None: | |
180 | return neighbor | |
181 | last_left = None | |
182 | found = False | |
183 | while not found: | |
184 | c = n.compare_to_key(p) | |
185 | if c==0 and n.right: | |
186 | n = n.right | |
187 | elif c < 0 and n.left: | |
188 | n = n.left | |
189 | last_left = n.parent | |
190 | elif c==1 and n.right: | |
191 | n = n.right | |
192 | else: | |
193 | found = True | |
194 | c = n.compare_to_key(p) | |
195 | # can be c==0 and n.left if at root node | |
196 | if c==0: | |
197 | if n.parent is None: | |
198 | return None | |
199 | if n is n.parent.right: | |
200 | goleft = None | |
201 | if last_left: | |
202 | goleft = last_left.right | |
203 | return self.get_lr(goleft, None)[1] | |
204 | else: | |
205 | return self.get_lr(n.parent.right, None)[1] | |
206 | if c==1: | |
207 | goleft = None | |
208 | if last_left: | |
209 | goleft = last_left.right | |
210 | return self.get_lr(goleft, None)[1] | |
211 | if c==-1: | |
212 | return n | |
213 | return neighbor | |
214 | ||
215 | # travels down a single direction to get neighbors | |
216 | def get_lr(self, left, right): | |
217 | lr = [None, None] | |
218 | if left: | |
219 | while left.left: | |
220 | left = left.left | |
221 | lr[1] = left | |
222 | if right: | |
223 | while right.right: | |
224 | right = right.right | |
225 | lr[0] = right | |
226 | return lr | |
227 | ||
228 | def contain_p(self, p, lists): | |
229 | c = self.compare_to_key(p) | |
230 | if c==0: | |
231 | if self.left is None and self.right is None: | |
232 | if compare_by_x(p, self.value[1])==0: | |
233 | lists[1].append(self.value) | |
234 | else: | |
235 | lists[0].append(self.value) | |
236 | if self.left: | |
237 | self.left.contain_p(p, lists) | |
238 | if self.right: | |
239 | self.right.contain_p(p, lists) | |
240 | elif c < 0: | |
241 | if self.left: | |
242 | self.left.contain_p(p, lists) | |
243 | else: | |
244 | if self.right: | |
245 | self.right.contain_p(p, lists) | |
246 | ||
247 | def find_insert_pt(self, key, seg): | |
248 | if self.left and self.right: | |
249 | if self.compare_to_key(key) == 0 and self.compare_lower(key, seg)==1: | |
250 | return self.right.find_insert_pt(key, seg) | |
251 | elif self.compare_to_key(key) < 1: | |
252 | return self.left.find_insert_pt(key, seg) | |
253 | else: | |
254 | return self.right.find_insert_pt(key, seg) | |
255 | # this case only happens at root | |
256 | elif self.left: | |
257 | if self.compare_to_key(key) == 0 and self.compare_lower(key, seg)==1: | |
258 | return (self, 'r') | |
259 | elif self.compare_to_key(key) < 1: | |
260 | return self.left.find_insert_pt(key, seg) | |
261 | else: | |
262 | return (self, 'r') | |
263 | else: | |
264 | return (self, 'n') | |
265 | ||
266 | # adjusts stored segments in inner nodes | |
267 | def adjust(self): | |
268 | value = self.value | |
269 | m = self.m | |
270 | parent = self.parent | |
271 | node = self | |
272 | # go up left as much as possible | |
273 | while parent and node is parent.right: | |
274 | node = parent | |
275 | parent = node.parent | |
276 | # parent to adjust will be on the immediate right | |
277 | if parent and node is parent.left: | |
278 | parent.value = value | |
279 | parent.m = m | |
280 | ||
281 | def compare_lower(self, p, s2): | |
282 | y = p[1] - 10 | |
283 | key = get_x_at(s2, (p[0], y)) | |
284 | return self.compare_to_key(key) | |
285 | ||
286 | # returns matching leaf node, or None if no match | |
287 | # when deleting, you don't delete below--you delete above! so compare lower = -1. | |
288 | def find_delete_pt(self, key, value): | |
289 | if self.left and self.right: | |
290 | # if equal at this pt, and this node's value is less than the seg's slightly above this pt | |
291 | if self.compare_to_key(key) == 0 and self.compare_lower(key, value)==-1: | |
292 | return self.right.find_delete_pt(key, value) | |
293 | if self.compare_to_key(key) < 1: | |
294 | return self.left.find_delete_pt(key, value) | |
295 | else: | |
296 | return self.right.find_delete_pt(key, value) | |
297 | elif self.left: | |
298 | if self.compare_to_key(key) < 1: | |
299 | return self.left.find_delete_pt(key, value) | |
300 | else: | |
301 | return None | |
302 | # is leaf | |
303 | else: | |
304 | if self.compare_to_key(key)==0 and segs_equal(self.value, value): | |
305 | return self | |
306 | else: | |
307 | return None | |
308 | ||
309 | # also prints depth of each node | |
310 | def print_tree(self, l=0): | |
311 | l += 1 | |
312 | if self.left: | |
313 | self.left.print_tree(l) | |
314 | if self.left or self.right: | |
315 | print('INTERNAL: {0}'.format(l)) | |
316 | else: | |
317 | print('LEAF: {0}'.format(l)) | |
318 | print(self) | |
319 | print(self.value) | |
320 | if self.right: | |
321 | self.right.print_tree(l) |
0 | # Helper functions for use in the lsi implementation. | |
1 | ||
2 | ev = 0.0000001 | |
3 | # floating-point comparison | |
4 | def approx_equal(a, b, tol): | |
5 | return abs(a - b) < tol | |
6 | ||
7 | # compares x-values of two pts | |
8 | # used for ordering in T | |
9 | def compare_by_x(k1, k2): | |
10 | if approx_equal(k1[0], k2[0], ev): | |
11 | return 0 | |
12 | elif k1[0] < k2[0]: | |
13 | return -1 | |
14 | else: | |
15 | return 1 | |
16 | ||
17 | # higher y value is "less"; if y value equal, lower x value is "less" | |
18 | # used for ordering in Q | |
19 | def compare_by_y(k1, k2): | |
20 | if approx_equal(k1[1], k2[1], ev): | |
21 | if approx_equal(k1[0], k2[0], ev): | |
22 | return 0 | |
23 | elif k1[0] < k2[0]: | |
24 | return -1 | |
25 | else: | |
26 | return 1 | |
27 | elif k1[1] > k2[1]: | |
28 | return -1 | |
29 | else: | |
30 | return 1 | |
31 | ||
32 | # tests if s0 and s1 represent the same segment (i.e. pts can be in 2 different orders) | |
33 | def segs_equal(s0, s1): | |
34 | x00 = s0[0][0] | |
35 | y00 = s0[0][1] | |
36 | x01 = s0[1][0] | |
37 | y01 = s0[1][1] | |
38 | x10 = s1[0][0] | |
39 | y10 = s1[0][1] | |
40 | x11 = s1[1][0] | |
41 | y11 = s1[1][1] | |
42 | if (approx_equal(x00, x10, ev) and approx_equal(y00, y10, ev)): | |
43 | if (approx_equal(x01, x11, ev) and approx_equal(y01, y11, ev)): | |
44 | return True | |
45 | if (approx_equal(x00, x11, ev) and approx_equal(y00, y11, ev)): | |
46 | if (approx_equal(x01, x10, ev) and approx_equal(y01, y10, ev)): | |
47 | return True | |
48 | return False | |
49 | ||
50 | # get m for a given seg in (p1, p2) form | |
51 | def get_slope(s): | |
52 | x0 = s[0][0] | |
53 | y0 = s[0][1] | |
54 | x1 = s[1][0] | |
55 | y1 = s[1][1] | |
56 | if (x1-x0)==0: | |
57 | return None | |
58 | else: | |
59 | return float(y1-y0)/(x1-x0) | |
60 | ||
61 | # given a point p, return the point on s that shares p's y-val | |
62 | def get_x_at(s, p): | |
63 | m = get_slope(s) | |
64 | # TODO: this should check if p's x-val is octually on seg; we're assuming | |
65 | # for now that it would have been deleted already if not | |
66 | if m == 0: # horizontal segment | |
67 | return p | |
68 | # ditto; should check if y-val on seg | |
69 | if m is None: # vertical segment | |
70 | return (s[0][0], p[1]) | |
71 | x1 = s[0][0]-(s[0][1]-p[1])/m | |
72 | return (x1, p[1]) | |
73 | ||
74 | # returns the point at which two line segments intersect, or None if no intersection. | |
75 | def intersect(seg1, seg2): | |
76 | p = seg1[0] | |
77 | r = (seg1[1][0]-seg1[0][0], seg1[1][1]-seg1[0][1]) | |
78 | q = seg2[0] | |
79 | s = (seg2[1][0]-seg2[0][0], seg2[1][1]-seg2[0][1]) | |
80 | denom = r[0]*s[1]-r[1]*s[0] | |
81 | if denom == 0: | |
82 | return None | |
83 | numer = float(q[0]-p[0])*s[1]-(q[1]-p[1])*s[0] | |
84 | t = numer/denom | |
85 | numer = float(q[0]-p[0])*r[1]-(q[1]-p[1])*r[0] | |
86 | u = numer/denom | |
87 | if (t < 0 or t > 1) or (u < 0 or u > 1): | |
88 | return None | |
89 | x = p[0]+t*r[0] | |
90 | y = p[1]+t*r[1] | |
91 | return (x, y) | |
92 | ||
93 |
0 | # Implementation of the Bentley-Ottmann algorithm, described in deBerg et al, ch. 2. | |
1 | # See README for more information. | |
2 | # Author: Sam Lichtenberg | |
3 | # Email: splichte@princeton.edu | |
4 | # Date: 09/02/2013 | |
5 | ||
6 | from pauvre.lsi.Q import Q | |
7 | from pauvre.lsi.T import T | |
8 | from pauvre.lsi.helper import * | |
9 | ||
10 | # "close enough" for floating point | |
11 | ev = 0.00000001 | |
12 | ||
13 | # how much lower to get the x of a segment, to determine which of a set of segments is the farthest right/left | |
14 | lower_check = 100 | |
15 | ||
16 | # gets the point on a segment at a lower y value. | |
17 | def getNextPoint(p, seg, y_lower): | |
18 | p1 = seg[0] | |
19 | p2 = seg[1] | |
20 | if (p1[0]-p2[0])==0: | |
21 | return (p[0]+10, p[1]) | |
22 | slope = float(p1[1]-p2[1])/(p1[0]-p2[0]) | |
23 | if slope==0: | |
24 | return (p1[0], p[1]-y_lower) | |
25 | y = p[1]-y_lower | |
26 | x = p1[0]-(p1[1]-y)/slope | |
27 | return (x, y) | |
28 | ||
29 | """ | |
30 | for each event point: | |
31 | U_p = segments that have p as an upper endpoint | |
32 | C_p = segments that contain p | |
33 | L_p = segments that have p as a lower endpoint | |
34 | """ | |
35 | def handle_event_point(p, segs, q, t, intersections): | |
36 | rightmost = (float("-inf"), 0) | |
37 | rightmost_seg = None | |
38 | leftmost = (float("inf"), 0) | |
39 | leftmost_seg = None | |
40 | ||
41 | U_p = segs | |
42 | (C_p, L_p) = t.contain_p(p) | |
43 | merge_all = U_p+C_p+L_p | |
44 | if len(merge_all) > 1: | |
45 | intersections[p] = [] | |
46 | for s in merge_all: | |
47 | intersections[p].append(s) | |
48 | merge_CL = C_p+L_p | |
49 | merge_UC = U_p+C_p | |
50 | for s in merge_CL: | |
51 | # deletes at a point slightly above (to break ties) - where seg is located in tree | |
52 | # above intersection point | |
53 | t.delete(p, s) | |
54 | # put segments into T based on where they are at y-val just below p[1] | |
55 | for s in merge_UC: | |
56 | n = getNextPoint(p, s, lower_check) | |
57 | if n[0] > rightmost[0]: | |
58 | rightmost = n | |
59 | rightmost_seg = s | |
60 | if n[0] < leftmost[0]: | |
61 | leftmost = n | |
62 | leftmost_seg = s | |
63 | t.insert(p, s) | |
64 | ||
65 | # means only L_p -> check newly-neighbored segments | |
66 | if len(merge_UC) == 0: | |
67 | neighbors = (t.get_left_neighbor(p), t.get_right_neighbor(p)) | |
68 | if neighbors[0] and neighbors[1]: | |
69 | find_new_event(neighbors[0].value, neighbors[1].value, p, q) | |
70 | ||
71 | # of newly inserted pts, find possible intersections to left and right | |
72 | else: | |
73 | left_neighbor = t.get_left_neighbor(p) | |
74 | if left_neighbor: | |
75 | find_new_event(left_neighbor.value, leftmost_seg, p, q) | |
76 | right_neighbor = t.get_right_neighbor(p) | |
77 | if right_neighbor: | |
78 | find_new_event(right_neighbor.value, rightmost_seg, p, q) | |
79 | ||
80 | def find_new_event(s1, s2, p, q): | |
81 | i = intersect(s1, s2) | |
82 | if i: | |
83 | if compare_by_y(i, p) == 1: | |
84 | if not q.find(i): | |
85 | q.insert(i, []) | |
86 | ||
87 | # segment is in ((x, y), (x, y)) form | |
88 | # first pt in a segment should have higher y-val - this is handled in function | |
89 | def intersection(S): | |
90 | s0 = S[0] | |
91 | if s0[1][1] > s0[0][1]: | |
92 | s0 = (s0[1], s0[0]) | |
93 | q = Q(s0[0], [s0]) | |
94 | q.insert(s0[1], []) | |
95 | intersections = {} | |
96 | for s in S[1:]: | |
97 | if s[1][1] > s[0][1]: | |
98 | s = (s[1], s[0]) | |
99 | q.insert(s[0], [s]) | |
100 | q.insert(s[1], []) | |
101 | t = T() | |
102 | while q.key: | |
103 | p, segs = q.get_and_del_min() | |
104 | handle_event_point(p, segs, q, t, intersections) | |
105 | return intersections | |
106 |
0 | # Test file for lsi. | |
1 | # Author: Sam Lichtenberg | |
2 | # Email: splichte@princeton.edu | |
3 | # Date: 09/02/2013 | |
4 | ||
5 | from lsi import intersection | |
6 | import random | |
7 | import time, sys | |
8 | from helper import * | |
9 | ||
10 | ev = 0.00000001 | |
11 | ||
12 | def scale(i): | |
13 | return float(i) | |
14 | ||
15 | use_file = None | |
16 | try: | |
17 | use_file = sys.argv[2] | |
18 | except: | |
19 | pass | |
20 | ||
21 | if not use_file: | |
22 | S = [] | |
23 | for i in range(int(sys.argv[1])): | |
24 | p1 = (scale(random.randint(0, 1000)), scale(random.randint(0, 1000))) | |
25 | p2 = (scale(random.randint(0, 1000)), scale(random.randint(0, 1000))) | |
26 | s = (p1, p2) | |
27 | S.append(s) | |
28 | f = open('input', 'w') | |
29 | f.write(str(S)) | |
30 | f.close() | |
31 | ||
32 | else: | |
33 | f = open(sys.argv[2], 'r') | |
34 | S = eval(f.read()) | |
35 | ||
36 | intersections = [] | |
37 | seen = [] | |
38 | vs = False | |
39 | hs = False | |
40 | es = False | |
41 | now = time.time() | |
42 | for seg1 in S: | |
43 | if approx_equal(seg1[0][0], seg1[1][0], ev): | |
44 | print 'VERTICAL SEG' | |
45 | print '' | |
46 | print '' | |
47 | vs = True | |
48 | if approx_equal(seg1[0][1], seg1[1][1], ev): | |
49 | print 'HORIZONTAL SEG' | |
50 | print '' | |
51 | print '' | |
52 | hs = True | |
53 | for seg2 in S: | |
54 | if seg1 is not seg2 and segs_equal(seg1, seg2): | |
55 | print 'EQUAL SEGS' | |
56 | print '' | |
57 | print '' | |
58 | es = True | |
59 | if seg1 is not seg2 and (seg2, seg1) not in seen: | |
60 | i = intersect(seg1, seg2) | |
61 | if i: | |
62 | intersections.append((i, [seg1, seg2])) | |
63 | # xpts = [seg1[0][0], seg1[1][0], seg2[0][0], seg2[1][0]] | |
64 | # xpts = sorted(xpts) | |
65 | # if (i[0] <= xpts[2] and i[0] >= xpts[1]: | |
66 | # intersections.append((i, [seg1, seg2])) | |
67 | seen.append((seg1, seg2)) | |
68 | later = time.time() | |
69 | n2time = later-now | |
70 | print "Line sweep results:" | |
71 | now = time.time() | |
72 | lsinters = intersection(S) | |
73 | inters = [] | |
74 | for k, v in lsinters.iteritems(): | |
75 | #print '{0}: {1}'.format(k, v) | |
76 | inters.append(k) | |
77 | # inters.append(v) | |
78 | later = time.time() | |
79 | print 'TIME ELAPSED: {0}'.format(later-now) | |
80 | print "N^2 comparison results:" | |
81 | pts_seen = [] | |
82 | highestseen = 0 | |
83 | for i in intersections: | |
84 | seen_already = False | |
85 | seen = 0 | |
86 | for p in pts_seen: | |
87 | if approx_equal(i[0][0], p[0], ev) and approx_equal(i[0][1], p[1], ev): | |
88 | seen += 1 | |
89 | seen_already = True | |
90 | if seen > highestseen: | |
91 | highestseen = seen | |
92 | if not seen_already: | |
93 | pts_seen.append(i[0]) | |
94 | in_k = False | |
95 | for k in inters: | |
96 | if approx_equal(k[0], i[0][0], ev) and approx_equal(k[1], i[0][1], ev): | |
97 | in_k = True | |
98 | if in_k == False: | |
99 | print 'Not in K: {0}: {1}'.format(i[0], i[1]) | |
100 | # print i | |
101 | print highestseen | |
102 | print 'TIME ELAPSED: {0}'.format(n2time) | |
103 | #print 'Missing from line sweep but in N^2:' | |
104 | #for i in seen: | |
105 | # matched = False | |
106 | print len(lsinters) | |
107 | print len(pts_seen) | |
108 | if len(lsinters) != len(pts_seen): | |
109 | print 'uh oh!' |
0 | #!/usr/bin/env python | |
1 | # -*- coding: utf-8 -*- | |
2 | ||
3 | # pauvre | |
4 | # Copyright (c) 2016-2020 Darrin T. Schultz. | |
5 | # | |
6 | # This file is part of pauvre. | |
7 | # | |
8 | # pauvre is free software: you can redistribute it and/or modify | |
9 | # it under the terms of the GNU General Public License as published by | |
10 | # the Free Software Foundation, either version 3 of the License, or | |
11 | # (at your option) any later version. | |
12 | # | |
13 | # pauvre is distributed in the hope that it will be useful, | |
14 | # but WITHOUT ANY WARRANTY; without even the implied warranty of | |
15 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | |
16 | # GNU General Public License for more details. | |
17 | # | |
18 | # You should have received a copy of the GNU General Public License | |
19 | # along with pauvre. If not, see <http://www.gnu.org/licenses/>. | |
20 | ||
21 | import ast | |
22 | import matplotlib | |
23 | matplotlib.use('Agg') | |
24 | import matplotlib.pyplot as plt | |
25 | import matplotlib.patches as mplpatches | |
26 | from matplotlib.colors import LinearSegmentedColormap | |
27 | import numpy as np | |
28 | import pandas as pd | |
29 | import os.path as opath | |
30 | from sys import stderr | |
31 | from pauvre.functions import parse_fastq_length_meanqual, print_images, filter_fastq_length_meanqual | |
32 | from pauvre.stats import stats | |
33 | import pauvre.rcparams as rc | |
34 | import logging | |
35 | ||
36 | # logging | |
37 | logger = logging.getLogger('pauvre') | |
38 | ||
39 | ||
40 | def generate_panel(panel_left, panel_bottom, panel_width, panel_height, | |
41 | axis_tick_param='both', which_tick_param='both', | |
42 | bottom_tick_param=True, label_bottom_tick_param=True, | |
43 | left_tick_param=True, label_left_tick_param=True, | |
44 | right_tick_param=False, label_right_tick_param=False, | |
45 | top_tick_param=False, label_top_tick_param=False): | |
46 | """ | |
47 | Setting default panel tick parameters. Some of these are the defaults | |
48 | for matplotlib anyway, but specifying them for readability. Here are | |
49 | options and defaults for the parameters used below: | |
50 | ||
51 | axis : {'x', 'y', 'both'}; which axis to modify; default = 'both' | |
52 | which : {'major', 'minor', 'both'}; which ticks to modify; | |
53 | default = 'major' | |
54 | bottom, top, left, right : bool or {True, False}; ticks on or off; | |
55 | labelbottom, labeltop, labelleft, labelright : bool or {True, False} | |
56 | """ | |
57 | ||
58 | # create the panel | |
59 | panel_rectangle = [panel_left, panel_bottom, panel_width, panel_height] | |
60 | panel = plt.axes(panel_rectangle) | |
61 | ||
62 | # Set tick parameters | |
63 | panel.tick_params(axis=axis_tick_param, | |
64 | which=which_tick_param, | |
65 | bottom=bottom_tick_param, | |
66 | labelbottom=label_bottom_tick_param, | |
67 | left=left_tick_param, | |
68 | labelleft=label_left_tick_param, | |
69 | right=right_tick_param, | |
70 | labelright=label_right_tick_param, | |
71 | top=top_tick_param, | |
72 | labeltop=label_top_tick_param) | |
73 | ||
74 | return panel | |
75 | ||
76 | ||
77 | def _generate_histogram_bin_patches(panel, bins, bin_values, horizontal=True): | |
78 | """This helper method generates the histogram that is added to the panel. | |
79 | ||
80 | In this case, horizontal = True applies to the mean quality histogram. | |
81 | So, horizontal = False only applies to the length histogram. | |
82 | """ | |
83 | l_width = 0.0 | |
84 | f_color = (0.5, 0.5, 0.5) | |
85 | e_color = (0, 0, 0) | |
86 | if horizontal: | |
87 | for step in np.arange(0, len(bin_values), 1): | |
88 | left = bins[step] | |
89 | bottom = 0 | |
90 | width = bins[step + 1] - bins[step] | |
91 | height = bin_values[step] | |
92 | hist_rectangle = mplpatches.Rectangle((left, bottom), width, height, | |
93 | linewidth=l_width, | |
94 | facecolor=f_color, | |
95 | edgecolor=e_color) | |
96 | panel.add_patch(hist_rectangle) | |
97 | else: | |
98 | for step in np.arange(0, len(bin_values), 1): | |
99 | left = 0 | |
100 | bottom = bins[step] | |
101 | width = bin_values[step] | |
102 | height = bins[step + 1] - bins[step] | |
103 | ||
104 | hist_rectangle = mplpatches.Rectangle((left, bottom), width, height, | |
105 | linewidth=l_width, | |
106 | facecolor=f_color, | |
107 | edgecolor=e_color) | |
108 | panel.add_patch(hist_rectangle) | |
109 | ||