Package list python-pauvre / 8944f59
New upstream version 0.1924 Andreas Tille 1 year, 4 months ago
34 changed file(s) with 6109 addition(s) and 0 deletion(s). Raw diff Collapse all Expand all
0 include scripts/test.sh
0 Metadata-Version: 1.2
1 Name: pauvre
2 Version: 0.1924
3 Summary: Tools for plotting Oxford Nanopore and other long-read data.
4 Home-page: https://github.com/conchoecia/pauvre
5 Author: Darrin Schultz
6 Author-email: dts@ucsc.edu
7 License: GPLv3
8 Description:
9 'pauvre' is a package for plotting Oxford Nanopore and other long read data.
10 The name means 'poor' in French, a play on words to the oft-used 'pore' prefix
11 for similar packages. This package was designed for python 3, but it might work in
12 python 2. You can visit the gitub page for more detailed information here:
13 https://github.com/conchoecia/pauvre
14
15 Platform: UNKNOWN
16 Classifier: Development Status :: 2 - Pre-Alpha
17 Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
18 Classifier: Programming Language :: Python :: 3
19 Classifier: Programming Language :: Python :: 3.5
20 Classifier: Operating System :: POSIX :: Linux
21 Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
22 Classifier: Intended Audience :: Science/Research
23 Requires: python (>3.0)
24 Provides: pauvre
25 Requires-Python: >=3
0 [![travis-ci](https://travis-ci.org/conchoecia/pauvre.svg?branch=master)](https://travis-ci.org/conchoecia/pauvre) [![DOI](https://zenodo.org/badge/112774670.svg)](https://zenodo.org/badge/latestdoi/112774670)
1
2
3 ## pauvre: a plotting package designed for nanopore and PacBio long reads
4
5 This package currently hosts four scripts for plotting and/or printing stats.
6
7 - `pauvre marginplot`
8 - takes a fastq file as input and outputs a marginal histogram with a heatmap.
9 - `pauvre stats`
10 - Takes a fastq file as input and prints out a table of stats, including how many basepairs/reads there are for a length/mean quality cutoff.
11 - This is also automagically called when using `pauvre marginplot`
12 - `pauvre redwood`
13 - I am happy to introduce the redwood plot to the world as a method
14 of representing circular genomes. A redwood plot contains long
15 reads as "rings" on the inside, a gene annotation
16 "cambrium/phloem", and a RNAseq "bark". The input is `.bam` files
17 for the long reads and RNAseq data, and a `.gff` file for the
18 annotation. More details to follow as we document this program
19 better...
20 - `pauvre synteny`
21 - Makes a synteny plot of circular genomes. Finds the most
22 parsimonius rotation to display the synteny of all the input
23 genomes with the fewest crossings-over. Input is one `.gff` file
24 per circular genome and one directory of gene alignments.
25
26 ## Updates:
27 - 20200215 - v0.1.924 - Made some minor updates to work with python 3.7 and the latest version of pandas,
28 - 20171130 - v0.1.86 - some changes by @wdecoster to integrate `pauvre` into [nanoplot](https://github.com/wdecoster/NanoPlot),
29 as well as some formatting changes that *may* make `pauvre` work better with python2.7. Adding Travis-CI functionality.
30 - 20171025 - v0.1.83 - added some changes to make marginplot interface
31 with @wdecoster's [nanoPlot](https://github.com/wdecoster/NanoPlot)
32 package, and made `pauvre stats` only output data tables for
33 filtered reads. `pauvre stats` also now has the `--filt_maxlen`,
34 `--filt_maxqual`, `--filt_minlen`, and `--filt_minqual` options.
35 - 20171018 - v0.1.8 - you can now filter reads and adjust the plotting viewing window.
36 [See below for a demonstration.](#filter-reads-and-adjust-viewing-window) I added the following options:
37
38 ```
39 --filt_maxlen FILT_MAXLEN
40 This sets the max read length filter reads.
41 --filt_maxqual FILT_MAXQUAL
42 This sets the max mean read quality to filter reads.
43 --filt_minlen FILT_MINLEN
44 This sets the min read length to filter reads.
45 --filt_minqual FILT_MINQUAL
46 This sets the min mean read quality to filter reads.
47 --plot_maxlen PLOT_MAXLEN
48 Sets the maximum viewing area in the length dimension.
49 --plot_maxqual PLOT_MAXQUAL
50 Sets the maximum viewing area in the quality
51 dimension.
52 --plot_minlen PLOT_MINLEN
53 Sets the minimum viewing area in the length dimension.
54 --plot_minqual PLOT_MINQUAL
55 Sets the minimum viewing area in the quality
56 dimension.
57 ```
58 - 20171014 - uploading information on `pauvre redwood` and `pauvre synteny` usage.
59 - 20171012 - made `pauvre stats` more consistently produce useful histograms.
60 `pauvre stats` now also calculates some statistics for different size ranges.
61 - 20170529 - added automatic scaling to the input fastq file. It
62 scales to show the highest read quality and the top 99th percentile
63 of reads by length.
64
65 # Requirements
66
67 - You must have the following installed on your system to install this software:
68 - python 3.x
69 - matplotlib
70 - biopython
71 - pandas
72 - pillow
73
74 # Installation
75
76 - Instructions to install on your mac or linux system. Not sure on
77 Windows! Make sure *python 3* is the active environment before
78 installing.
79 - `git clone https://github.com/conchoecia/pauvre.git`
80 - `cd ./pauvre`
81 - `pip3 install .`
82 - Or, install with pip
83 - `pip3 install pauvre`
84
85 # Usage
86
87 ## `stats`
88 - generate basic statistics about the fastq file. For example, if I
89 want to know the number of bases and reads with AT LEAST a PHRED
90 score of 5 and AT LEAST a read length of 500, run the program as below
91 and look at the cells highlighted with `<braces>`.
92 - `pauvre stats --fastq miniDSMN15.fastq`
93
94
95 ```
96 numReads: 1000
97 numBasepairs: 1029114
98 meanLen: 1029.114
99 medianLen: 875.5
100 minLen: 11
101 maxLen: 5337
102 N50: 1278
103 L50: 296
104
105 Basepairs >= bin by mean PHRED and length
106 minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30
107 0 1029114 1010681 935366 429279 143948 25139 3668 2938 2000 0
108 500 984212 <968653> 904787 421307 142003 24417 3668 2938 2000 0
109 1000 659842 649319 616788 300948 103122 17251 2000 2000 2000 0
110 et cetera...
111 Number of reads >= bin by mean Phred+Len
112 minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30
113 0 1000 969 865 366 118 22 3 2 1 0
114 500 873 <859> 789 347 113 20 3 2 1 0
115 1000 424 418 396 187 62 11 1 1 1 0
116 et cetera...
117 ```
118
119 ## `marginplot`
120
121 ### Basic usage
122 - automatically calls `pauvre stats` for each fastq file
123 - Make the default plot showing the 99th percentile of longest reads
124 - `pauvre marginplot --fastq miniDSMN15.fastq`
125 - ![default](files/default_miniDSMN15.png)
126 - Make a marginal histogram for ONT 2D or 1D^2 cDNA data with a
127 lower maxlen and higher maxqual.
128 - `pauvre marginplot --maxlen 4000 --maxqual 25 --lengthbin 50 --fileform pdf png --qualbin 0.5 --fastq miniDSMN15.fastq`
129 - ![example1](files/miniDSMN15.png)
130
131 ### Filter reads and adjust viewing window
132 - Filter out reads with a mean quality less than 5, and a length
133 less than 800. Zoom in to plot only mean quality of at least 4 and
134 read length at least 500bp.
135 - `pauvre marginplot -f miniDSMN15.fastq --filt_minqual 5 --filt_minlen 800 -y --plot_minlen 500 --plot_minqual 4`
136 - ![test4](files/test4.png)
137
138 ### Specialized Options
139
140 - Plot ONT 1D data with a large tail
141 - `pauvre marginplot --maxlen 100000 --maxqual 15 --lengthbin 500 <myfile>.fastq`
142 - Get more resolution on lengths
143 - `pauvre marginplot --maxlen 100000 --lengthbin 5 <myfile>.fastq`
144
145 ### Transparency
146
147 - Turn off transparency if you just want a white background
148 - `pauvre marginplot --transparent False <myfile>.fastq`
149 - Note: transparency is the default behavior
150 - ![transparency](files/transparency.001.jpeg)
151
152 # Contributors
153
154 @conchoecia (Darrin Schultz)
155 @mebbert (Mark Ebbert)
156 @wdecoster (Wouter De Coster)
0 from pauvre.version import __version__
0 #!/usr/bin/env python
1 # -*- coding: utf-8 -*-
2
3 # pauvre - just a pore plotting package
4 # Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved.
5 # twitter @conchoecia
6 #
7 # This file is part of pauvre.
8 #
9 # pauvre is free software: you can redistribute it and/or modify
10 # it under the terms of the GNU General Public License as published by
11 # the Free Software Foundation, either version 3 of the License, or
12 # (at your option) any later version.
13 #
14 # pauvre is distributed in the hope that it will be useful,
15 # but WITHOUT ANY WARRANTY; without even the implied warranty of
16 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
17 # GNU General Public License for more details.
18 #
19 # You should have received a copy of the GNU General Public License
20 # along with pauvre. If not, see <http://www.gnu.org/licenses/>.
21 import pysam
22 import pandas as pd
23 import os
24
25 class BAMParse():
26 """This class reads in a sam/bam file and constructs a pandas
27 dataframe of all the relevant information for the reads to pass on
28 and plot.
29 """
30 def __init__(self, filename, chrid = None, start = None,
31 stop = None, doubled = None):
32 self.filename = filename
33 self.doubled = doubled
34 #determine if the file is bam or sam
35 self.filetype = os.path.splitext(self.filename)[1]
36 #throw an error if the file is not bam or sam
37 if self.filetype not in ['.bam']:
38 raise Exception("""You have provided a file with an extension other than
39 '.bam', please check your command-line arguments""")
40 #now make sure there is an index file for the bam file
41 if not os.path.exists("{}.bai".format(self.filename)):
42 raise Exception("""Your .bam file is there, but it isn't indexed and
43 there isn't a .bai file to go with it. Use
44 'samtools index <yourfile>.bam' to fix it.""")
45 #now open the file and just call it a sambam file
46 filetype_dict = {'.sam': '', '.bam': 'b'}
47 self.sambam = pysam.AlignmentFile(self.filename, "r{}".format(filetype_dict[self.filetype]))
48 if chrid == None:
49 self.chrid = self.sambam.references[0]
50 else:
51 self.chrid = chrid
52 self.refindex = self.sambam.references.index(self.chrid)
53 self.seqlength = self.sambam.lengths[self.refindex]
54 self.true_seqlength = self.seqlength if not self.doubled else int(self.seqlength/2)
55 if start == None or stop == None:
56 self.start = 1
57 self.stop = self.true_seqlength
58
59 self.features = self.parse()
60 self.features.sort_values(by=['POS','MAPLEN'], ascending=[True, False] ,inplace=True)
61 self.features.reset_index(inplace=True)
62 self.features.drop('index', 1, inplace=True)
63
64 self.raw_depthmap = self.get_depthmap()
65 self.features_depthmap = self.get_features_depthmap()
66
67 def get_depthmap(self):
68 depthmap = [0] * (self.stop - self.start + 1)
69 for p in self.sambam.pileup(self.chrid, self.start, self.stop):
70 index = p.reference_pos
71 if index >= self.true_seqlength:
72 index -= self.true_seqlength
73 depthmap[index] += p.nsegments
74 return depthmap
75
76 def get_features_depthmap(self):
77 """this method builds a more accurate pileup that is
78 based on if there is actually a mapped base at any
79 given position or not. better for long reads and RNA"""
80 depthmap = [0] * (self.stop - self.start + 1)
81 print("depthmap is: {} long".format(len(depthmap)))
82 for index, row in self.features.iterrows():
83 thisindex = row["POS"] - self.start
84 for thistup in row["TUPS"]:
85 b_type = thistup[1]
86 b_len = thistup[0]
87 if b_type == "M":
88 for j in range(b_len):
89 #this is necessary to reset the index if we wrap
90 # around to the beginning
91 if self.doubled and thisindex == len(depthmap):
92 thisindex = 0
93 depthmap[thisindex] += 1
94 thisindex += 1
95 elif b_type in ["S", "H", "I"]:
96 pass
97 elif b_type in ["D", "N"]:
98 thisindex += b_len
99 #this is necessary to reset the index if we wrap
100 # around to the beginning
101 if self.doubled and thisindex >= len(depthmap):
102 thisindex = thisindex - len(depthmap)
103
104 return depthmap
105
106 def parse(self):
107 data = {'POS': [], 'MAPQ': [], 'TUPS': [] }
108 for read in self.sambam.fetch(self.chrid, self.start, self.stop):
109 data['POS'].append(read.reference_start + 1)
110 data['TUPS'].append(self.cigar_parse(read.cigartuples))
111 data['MAPQ'].append(read.mapq)
112 features = pd.DataFrame.from_dict(data, orient='columns')
113 features['ALNLEN'] = features['TUPS'].apply(self.aln_len)
114 features['TRULEN'] = features['TUPS'].apply(self.tru_len)
115 features['MAPLEN'] = features['TUPS'].apply(self.map_len)
116 features['POS'] = features['POS'].apply(self.fix_pos)
117 return features
118
119 def cigar_parse(self, tuples):
120 """
121 arguments:
122 <tuples> a CIGAR string tuple list in pysam format
123
124 purpose:
125 This function uses the pysam cigarstring tuples format and returns
126 a list of tuples in the internal format, [(20, 'M'), (5, "I")], et
127 cetera. The zeroth element of each tuple is the number of bases for the
128 CIGAR string feature. The first element of each tuple is the CIGAR
129 string feature type.
130
131 There are several feature types in SAM/BAM files. See below:
132 'M' - match
133 'I' - insertion relative to reference
134 'D' - deletion relative to reference
135 'N' - skipped region from the reference
136 'S' - soft clip, not aligned but still in sam file
137 'H' - hard clip, not aligned and not in sam file
138 'P' - padding (silent deletion from padded reference)
139 '=' - sequence match
140 'X' - sequence mismatch
141 'B' - BAM_CBACK (I don't actually know what this is)
142
143 """
144 # I used the map values from http://pysam.readthedocs.io/en/latest/api.html#pysam.AlignedSegment
145 psam_to_char = {0: 'M', 1: 'I', 2: 'D', 3: 'N', 4: 'S',
146 5: 'H', 6: 'P', 7: '=', 8: 'X', 9: 'B'}
147 return [(value, psam_to_char[feature]) for feature, value in tuples]
148
149 def aln_len(self, TUPS):
150 """
151 arguments:
152 <TUPS> a list of tuples output from the cigar_parse() function.
153
154 purpose:
155 This returns the alignment length of the read to the reference.
156 Specifically, it sums the length of all of the matches and deletions.
157 In effect, this number is length of the region of the reference sequence to
158 which the read maps. This number is probably the most useful for selecting
159 reads to visualize in the mapped read plot.
160 """
161 return sum([pair[0] for pair in TUPS if pair[1] not in ['S', 'H', 'I']])
162
163 def map_len(self, TUPS):
164 """
165 arguments:
166 <TUPS> a list of tuples output from the cigar_parse() function.
167
168 purpose:
169 This function returns the map length (all matches and deletions relative to
170 the reference), plus the unmapped 5' and 3' hard/soft clipped sequences.
171 This number is useful if you want to visualize how much 5' and 3' sequence
172 of a read did not map to the reference. For example, poor quality 5' and 3'
173 tails are common in Nanopore reads.
174 """
175 return sum([pair[0] for pair in TUPS if pair[1] not in ['I']])
176
177 def tru_len(self, TUPS):
178 """
179 arguments:
180 <TUPS> a list of tuples output from the cigar_parse() function.
181
182 purpose:
183 This function returns the total length of the read, including insertions,
184 deletions, matches, soft clips, and hard clips. This is useful for
185 comparing to the map length or alignment length to see what percentage of
186 the read aligned to the reference.
187 """
188 return sum([pair[0] for pair in TUPS])
189
190 def fix_pos(self, start_index):
191 """
192 arguments:
193 an int
194
195 purpose:
196 When using a doubled SAMfile, any reads that start after the first copy
197 of the reference risk running over the plotting window, causing the program
198 to crash. This function corrects for this issue by changing the start site
199 of the read.
200
201 Note: this will probably break the program if not using a double alignment
202 since no reads would map past half the length of the single reference
203 """
204 if self.doubled:
205 if start_index > int(self.seqlength/2):
206 return start_index - int(self.seqlength/2) - 1
207 else:
208 return start_index
209 else:
210 return start_index
0 #!/usr/bin/env python
1 # -*- coding: utf-8 -*-
2
3 # pauvre - a pore plotting package
4 # Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved.
5 #
6 # This file is part of pauvre.
7 #
8 # pauvre is free software: you can redistribute it and/or modify
9 # it under the terms of the GNU General Public License as published by
10 # the Free Software Foundation, either version 3 of the License, or
11 # (at your option) any later version.
12 #
13 # pauvre is distributed in the hope that it will be useful,
14 # but WITHOUT ANY WARRANTY; without even the implied warranty of
15 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
16 # GNU General Public License for more details.
17 #
18 # You should have received a copy of the GNU General Public License
19 # along with pauvre. If not, see <http://www.gnu.org/licenses/>.
20
21 # following this tutorial to install helvetica
22 # https://github.com/olgabot/sciencemeetproductivity.tumblr.com/blob/master/posts/2012/11/how-to-set-helvetica-as-the-default-sans-serif-font-in.md
23 global hfont
24 hfont = {'fontname':'Helvetica'}
25
26 import matplotlib
27 matplotlib.use('Agg')
28 import matplotlib.pyplot as plt
29 from matplotlib.colors import LinearSegmentedColormap, Normalize
30 import matplotlib.patches as patches
31
32
33 import gffutils
34 import pandas as pd
35 pd.set_option('display.max_columns', 500)
36 pd.set_option('display.width', 1000)
37 import numpy as np
38 import os
39 import pauvre.rcparams as rc
40 from pauvre.functions import GFFParse, print_images, timestamp
41 from pauvre import gfftools
42 from pauvre.lsi.lsi import intersection
43 from pauvre.bamparse import BAMParse
44 import progressbar
45 import platform
46 import sys
47 import time
48
49 # Biopython stuff
50 from Bio import SeqIO
51 import Bio.SubsMat.MatrixInfo as MI
52
53
54 class PlotCommand:
55 def __init__(self, plotcmd, REF):
56 self.ref = REF
57 self.style_choices = []
58 self.cmdtype = ""
59 self.path = ""
60 self.style = ""
61 self.options = ""
62 self._parse_cmd(plotcmd)
63
64 def _parse_cmd(self, plotcmd):
65 chunks = plotcmd.split(":")
66 if chunks[0] == "ref":
67 self.cmdtype = "ref"
68 if len(chunks) < 2:
69 self._len_error()
70 self.path = self.ref
71 self.style = chunks[1]
72 self.style_choices = ["normal", "colorful"]
73 self._check_style_choices()
74 if len(chunks) > 2:
75 self.options = chunks[2].split(",")
76 elif chunks[0] in ["bam", "peptides"]:
77 if len(chunks) < 3:
78 self._len_error()
79 self.cmdtype = chunks[0]
80 self.path = os.path.abspath(os.path.expanduser(chunks[1]))
81 self.style = chunks[2]
82 if self.cmdtype == "bam":
83 self.style_choices = ["depth", "reads"]
84 else:
85 self.style_choices = ["depth"]
86 self._check_style_choices()
87 if len(chunks) > 3:
88 self.options = chunks[3].split(",")
89 elif chunks[0] in ["gff3"]:
90 if len(chunks) < 2:
91 self._len_error()
92 self.cmdtype = chunks[0]
93 self.path = os.path.abspath(os.path.expanduser(chunks[1]))
94 if len(chunks) > 2:
95 self.options = chunks[2].split(",")
96
97
98 def _len_error(self):
99 raise IOError("""You selected {} to plot,
100 but need to specify the style at least.""".format(self.cmdtype))
101 def _check_style_choices(self):
102 if self.style not in self.style_choices:
103 raise IOError("""You selected {} style for
104 ref. You must select from {}. """.format(
105 self.style, self.style_choices))
106
107 global dna_color
108 dna_color = {"A": (81/255, 87/255, 251/255, 1),
109 "T": (230/255, 228/255, 49/255, 1),
110 "G": (28/255, 190/255, 32/255, 1),
111 "C": (220/255, 10/255, 23/255, 1)}
112
113 #these are the line width for the different cigar string flags.
114 # usually, only M, I, D, S, and H appear in bwa mem output
115 global widthDict
116 widthDict = {'M':0.45, # match
117 'I':0.9, # insertion relative to reference
118 'D':0.05, # deletion relative to reference
119 'N':0.1, # skipped region from the reference
120 'S':0.1, # soft clip, not aligned but still in sam file
121 'H':0.1, # hard clip, not aligned and not in sam file
122 'P':0.1, # padding (silent deletion from padded reference)
123 '=':0.1, # sequence match
124 'X':0.1} # sequence mismatch
125
126
127 global richgrey
128 richgrey = (60/255, 54/255, 69/255, 1)
129
130 def plot_ref(panel, chrid, start, stop, thiscmd):
131 panel.set_xlim([start, stop])
132 panel.set_ylim([-2.5, 2.5])
133 panel.set_xticks([int(val) for val in np.linspace(start, stop, 6)])
134 if thiscmd.style == "colorful":
135 thisseq = ""
136 for record in SeqIO.parse(thiscmd.ref, "fasta"):
137 if record.id == chrid:
138 thisseq = record.seq[start-1: stop]
139 for i in range(len(thisseq)):
140 left = start + i
141 bottom = -0.5
142 width = 1
143 height = 1
144 rect = patches.Rectangle((left, bottom),
145 width, height,
146 linewidth = 0,
147 facecolor = dna_color[thisseq[i]] )
148 panel.add_patch(rect)
149 return panel
150
151 def safe_log10(value):
152 try:
153 logval = np.log10(value)
154 except:
155 logval = 0
156 return logval
157
158 def plot_bam(panel, chrid, start, stop, thiscmd):
159 bam = BAMParse(thiscmd.path)
160 panel.set_xlim([start, stop])
161 if thiscmd.style == "depth":
162 maxdepth = max(bam.features_depthmap)
163 maxdepthlog = safe_log10(maxdepth)
164 if "log" in thiscmd.options:
165 panel.set_ylim([-maxdepthlog, maxdepthlog])
166 panel.set_yticks([int(val) for val in np.linspace(0, maxdepthlog, 2)])
167
168 else:
169 panel.set_yticks([int(val) for val in np.linspace(0, maxdepth, 2)])
170 if "c" in thiscmd.options:
171 panel.set_ylim([-maxdepth, maxdepth])
172 else:
173 panel.set_ylim([0, maxdepth])
174
175
176 for i in range(len(bam.features_depthmap)):
177 left = start + i
178 width = 1
179 if "c" in thiscmd.options and "log" in thiscmd.options:
180 bottom = -1 * safe_log10(bam.features_depthmap[i])
181 height = safe_log10(bam.features_depthmap[i]) * 2
182 elif "c" in thiscmd.options and "log" not in thiscmd.options:
183 bottom = -bam.features_depthmap[i]
184 height = bam.features_depthmap[i] * 2
185 else:
186 bottom = 0
187 height = bam.features_depthmap[i]
188 if height > 0:
189 rect = patches.Rectangle((left, bottom),
190 width, height,
191 linewidth = 0,
192 facecolor = richgrey )
193 panel.add_patch(rect)
194
195 if thiscmd.style == "reads":
196 #If we're plotting reads, we don't need y-axis
197 panel.tick_params(bottom="off", labelbottom="off",
198 left ="off", labelleft = "off")
199 reads = bam.features.copy()
200 panel.set_xlim([start, stop])
201 direction = "for"
202 if direction == 'for':
203 bav = {"by":['POS','MAPLEN'], "asc": [True, False]}
204 direction= 'rev'
205 elif direction == 'rev':
206 bav = {"by":['POS','MAPLEN'], "asc": [True, False]}
207 direction = 'for'
208 reads.sort_values(by=bav["by"], ascending=bav['asc'],inplace=True)
209 reads.reset_index(drop=True, inplace=True)
210
211 depth_count = -1
212 plotind = start
213 while len(reads) > 0:
214 #depth_count -= 1
215 #print("len of reads is {}".format(len(reads)))
216 potential = reads.query("POS >= {}".format(plotind))
217 if len(potential) == 0:
218 readsindex = 0
219 #print("resetting plot ind from {} to {}".format(
220 # plotind, reads.loc[readsindex, "POS"]))
221 depth_count -= 1
222
223 else:
224 readsindex = int(potential.index.values[0])
225 #print("pos of potential is {}".format(reads.loc[readsindex, "POS"]))
226 plotind = reads.loc[readsindex, "POS"]
227
228 for TUP in reads.loc[readsindex, "TUPS"]:
229 b_type = TUP[1]
230 b_len = TUP[0]
231 #plotting params
232 # left same for all.
233 left = plotind
234 bottom = depth_count
235 height = widthDict[b_type]
236 width = b_len
237 plot = True
238 color = richgrey
239 if b_type in ["H", "S"]:
240 """We don't plot hard or sort clips - like IGV"""
241 plot = False
242 pass
243 elif b_type == "M":
244 """just plot matches normally"""
245 plotind += b_len
246 elif b_type in ["D", "P", "=", "X"]:
247 """deletions get an especially thin line"""
248 plotind += b_len
249 elif b_type == "I":
250 """insertions get a special purple bar"""
251 left = plotind - (b_len/2)
252 color = (200/255, 41/255, 226/255, 0.5)
253 elif b_type == "N":
254 """skips for splice junctions, line in middle"""
255 bottom += (widthDict["M"]/2) - (widthDict["N"]/2)
256 plotind += b_len
257 if plot:
258 rect = patches.Rectangle((left, bottom),
259 width, height,
260 linewidth = 0,
261 facecolor = color )
262 panel.add_patch(rect)
263 reads.drop([readsindex], inplace=True)
264 reads.reset_index(drop = True, inplace=True)
265 panel.set_ylim([depth_count, 0])
266
267 return panel
268
269 def plot_gff3(panel, chrid, start, stop, thiscmd):
270
271 db = gffutils.create_db(thiscmd.path, ":memory:")
272 bottom = 0
273 genes_to_plot = [thing.id
274 for thing in db.region(
275 region=(chrid, start, stop),
276 completely_within=False)
277 if thing.featuretype == "gene" ]
278 #print("genes to plot are: " genes_to_plot)
279 panel.set_xlim([start, stop])
280 # we don't need labels on one of the axes
281 #panel.tick_params(bottom="off", labelbottom="off",
282 # left ="off", labelleft = "off")
283
284
285 ticklabels = []
286 for geneid in genes_to_plot:
287 plotnow = False
288 if "id" in thiscmd.options and geneid in thiscmd.options:
289 plotnow = True
290 elif "id" not in thiscmd.options:
291 plotnow = True
292 if plotnow:
293 ticklabels.append(geneid)
294 if db[geneid].strand == "+":
295 panel = gfftools._plot_left_to_right_introns_top(panel, geneid, db,
296 bottom, text = None)
297 bottom += 1
298 else:
299 raise IOError("""Plotting things on the reverse strand is
300 not yet implemented""")
301 #print("tick labels are", ticklabels)
302 panel.set_ylim([0, len(ticklabels)])
303 yticks_vals = [val for val in np.linspace(0.5, len(ticklabels) - 0.5, len(ticklabels))]
304 panel.set_yticks(yticks_vals)
305 print("bottom is: ", bottom)
306 print("len tick labels is: ", len(ticklabels))
307 print("intervals are: ", yticks_vals)
308 panel.set_yticklabels(ticklabels)
309
310 return panel
311
312 def browser(args):
313 rc.update_rcParams()
314 print(args)
315
316 # if the user forgot to add a reference, they must add one
317 if args.REF is None:
318 raise IOError("You must specify the reference fasta file")
319
320 # if the user forgot to add the start and stop,
321 # Print the id and the start/stop
322 if args.CHR is None or args.START is None or args.STOP is None:
323 print("""\n You have forgotten to specify the chromosome,
324 the start coordinate, or the stop coordinate to plot.
325 Try something like '-c chr1 --start 20 --stop 2000'.
326 Here is a list of chromosome ids and their lengths
327 from the provided reference. The minimum start coordinate
328 is one and the maximum stop coordinate is the length of
329 the chromosome.\n\nID\tLength""")
330 for record in SeqIO.parse(args.REF, "fasta"):
331 print("{}\t{}".format(record.id, len(record.seq)))
332 sys.exit(0)
333
334 if args.CMD is None:
335 raise IOError("You must specify a plotting command.")
336
337 # now we parse each set of commands
338 commands = [PlotCommand(thiscmd, args.REF)
339 for thiscmd in reversed(args.CMD)]
340
341 # set the figure dimensions
342 if args.ratio:
343 figWidth = args.ratio[0] + 1
344 figHeight = args.ratio[1] + 1
345 #set the panel dimensions
346 panelWidth = args.ratio[0]
347 panelHeight = args.ratio[1]
348
349 else:
350 figWidth = 7
351 figHeight = len(commands) + 2
352 #set the panel dimensions
353 panelWidth = 5
354 # panel margin x 2 + panel height = total vertical height
355 panelHeight = 0.8
356 panelMargin = 0.1
357
358 figure = plt.figure(figsize=(figWidth,figHeight))
359
360 #find the margins to center the panel in figure
361 leftMargin = (figWidth - panelWidth)/2
362 bottomMargin = ((figHeight - panelHeight)/2) + panelMargin
363
364 plot_dict = {"ref": plot_ref,
365 "bam": plot_bam,
366 "gff3": plot_gff3
367 #"peptides": plot_peptides
368 }
369
370 panels = []
371 for i in range(len(commands)):
372 thiscmd = commands[i]
373 if thiscmd.cmdtype in ["gff3", "ref", "peptides"] \
374 or thiscmd.style == "depth" \
375 or "narrow" in thiscmd.options:
376 temp_panelHeight = 0.5
377 else:
378 temp_panelHeight = panelHeight
379 panels.append( plt.axes([leftMargin/figWidth, #left
380 bottomMargin/figHeight, #bottom
381 panelWidth/figWidth, #width
382 temp_panelHeight/figHeight]) #height
383 )
384 panels[i].tick_params(axis='both',which='both',\
385 bottom='off', labelbottom='off',\
386 left='on', labelleft='on', \
387 right='off', labelright='off',\
388 top='off', labeltop='off')
389 if thiscmd.cmdtype == "ref":
390 panels[i].tick_params(bottom='on', labelbottom='on')
391
392
393
394 #turn off some of the axes
395 panels[i].spines["top"].set_visible(False)
396 panels[i].spines["bottom"].set_visible(False)
397 panels[i].spines["right"].set_visible(False)
398 panels[i].spines["left"].set_visible(False)
399
400 panels[i] = plot_dict[thiscmd.cmdtype](panels[i], args.CHR,
401 args.START, args.STOP,
402 thiscmd)
403
404 bottomMargin = bottomMargin + temp_panelHeight + (2 * panelMargin)
405
406 # Print image(s)
407 if args.BASENAME is None:
408 file_base = 'browser_{}.png'.format(timestamp())
409 else:
410 file_base = args.BASENAME
411 path = None
412 if args.path:
413 path = args.path
414 transparent = args.transparent
415 print_images(
416 base_output_name=file_base,
417 image_formats=args.fileform,
418 dpi=args.dpi,
419 no_timestamp = kwargs["no_timestamp"],
420 path = path,
421 transparent=transparent)
422
423
424 def run(args):
425 browser(args)
0 #!/usr/bin/env python
1 # -*- coding: utf-8 -*-
2
3 # pauvre - just a pore PhD student's plotting package
4 # Copyright (c) 2016-2017 Darrin T. Schultz. All rights reserved.
5 #
6 # This file is part of pauvre.
7 #
8 # pauvre is free software: you can redistribute it and/or modify
9 # it under the terms of the GNU General Public License as published by
10 # the Free Software Foundation, either version 3 of the License, or
11 # (at your option) any later version.
12 #
13 # pauvre is distributed in the hope that it will be useful,
14 # but WITHOUT ANY WARRANTY; without even the implied warranty of
15 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
16 # GNU General Public License for more details.
17 #
18 # You should have received a copy of the GNU General Public License
19 # along with pauvre. If not, see <http://www.gnu.org/licenses/>.
20
21 import ast
22 import matplotlib
23 matplotlib.use('Agg')
24 import matplotlib.pyplot as plt
25 import matplotlib.patches as mplpatches
26 from matplotlib.colors import LinearSegmentedColormap
27 import numpy as np
28 import pandas as pd
29 import os.path as opath
30 from sys import stderr
31 from pauvre.functions import print_images
32 from pauvre.stats import stats
33 import pauvre.rcparams as rc
34 import sys
35 import logging
36
37 # logging
38 logger = logging.getLogger('pauvre')
39
40
41 def generate_panel(panel_left, panel_bottom, panel_width, panel_height,
42 axis_tick_param='both', which_tick_param='both',
43 bottom_tick_param='on', label_bottom_tick_param='on',
44 left_tick_param='on', label_left_tick_param='on',
45 right_tick_param='off', label_right_tick_param='off',
46 top_tick_param='off', label_top_tick_param='off'):
47 """
48 Setting default panel tick parameters. Some of these are the defaults
49 for matplotlib anyway, but specifying them for readability. Here are
50 options and defaults for the parameters used below:
51
52 axis : {'x', 'y', 'both'}; which axis to modify; default = 'both'
53 which : {'major', 'minor', 'both'}; which ticks to modify;
54 default = 'major'
55 bottom, top, left, right : bool or {'on', 'off'}; ticks on or off;
56 labelbottom, labeltop, labelleft, labelright : bool or {'on', 'off'}
57 """
58
59 # create the panel
60 panel_rectangle = [panel_left, panel_bottom, panel_width, panel_height]
61 panel = plt.axes(panel_rectangle)
62
63 # Set tick parameters
64 panel.tick_params(axis=axis_tick_param,
65 which=which_tick_param,
66 bottom=bottom_tick_param,
67 labelbottom=label_bottom_tick_param,
68 left=left_tick_param,
69 labelleft=label_left_tick_param,
70 right=right_tick_param,
71 labelright=label_right_tick_param,
72 top=top_tick_param,
73 labeltop=label_top_tick_param)
74
75 return panel
76
77
78 def _generate_histogram_bin_patches(panel, bins, bin_values, horizontal=True):
79 """This helper method generates the histogram that is added to the panel.
80
81 In this case, horizontal = True applies to the mean quality histogram.
82 So, horizontal = False only applies to the length histogram.
83 """
84 l_width = 0.0
85 f_color = (0.5, 0.5, 0.5)
86 e_color = (0, 0, 0)
87 if horizontal:
88 for step in np.arange(0, len(bin_values), 1):
89 left = bins[step]
90 bottom = 0
91 width = bins[step + 1] - bins[step]
92 height = bin_values[step]
93 hist_rectangle = mplpatches.Rectangle((left, bottom), width, height,
94 linewidth=l_width,
95 facecolor=f_color,
96 edgecolor=e_color)
97 panel.add_patch(hist_rectangle)
98 else:
99 for step in np.arange(0, len(bin_values), 1):
100 left = 0
101 bottom = bins[step]
102 width = bin_values[step]
103 height = bins[step + 1] - bins[step]
104
105 hist_rectangle = mplpatches.Rectangle((left, bottom), width, height,
106 linewidth=l_width,
107 facecolor=f_color,
108 edgecolor=e_color)
109 panel.add_patch(hist_rectangle)
110
111
112 def generate_histogram(panel, data_list, min_plot_val, max_plot_val,
113 bin_interval, hist_horizontal=True,
114 left_spine=True, bottom_spine=True,
115 top_spine=False, right_spine=False, x_label=None,
116 y_label=None):
117
118 bins = np.arange(0, max_plot_val, bin_interval)
119
120 bin_values, bins2 = np.histogram(data_list, bins)
121
122 # hist_horizontal is used for quality
123 if hist_horizontal:
124 panel.set_xlim([min_plot_val, max_plot_val])
125 panel.set_ylim([0, max(bin_values * 1.1)])
126 # and hist_horizontal == Fale is for read length
127 else:
128 panel.set_xlim([0, max(bin_values * 1.1)])
129 panel.set_ylim([min_plot_val, max_plot_val])
130
131 # Generate histogram bin patches, depending on whether we're plotting
132 # vertically or horizontally
133 _generate_histogram_bin_patches(panel, bins, bin_values, hist_horizontal)
134
135 panel.spines['left'].set_visible(left_spine)
136 panel.spines['bottom'].set_visible(bottom_spine)
137 panel.spines['top'].set_visible(top_spine)
138 panel.spines['right'].set_visible(right_spine)
139
140 if y_label is not None:
141 panel.set_ylabel(y_label)
142 if x_label is not None:
143 panel.set_xlabel(x_label)
144
145 def generate_square_map(panel, data_frame, plot_min_y, plot_min_x,
146 plot_max_y, plot_max_x, color,
147 xcol, ycol, **kwargs):
148 """This generates the heatmap panels using squares. Everything is
149 quantized by ints.
150 """
151 panel.set_xlim([plot_min_x, plot_max_x])
152 panel.set_ylim([plot_min_y, plot_max_y])
153 tempdf = data_frame[[xcol, ycol]]
154 data_frame = tempdf.astype(int)
155
156 querystring = "{}<={} and {}<={}".format(plot_min_y, ycol, plot_min_x, xcol)
157 print(" - Filtering squares with {}".format(querystring))
158 square_this = data_frame.query(querystring)
159
160 querystring = "{}<{} and {}<{}".format(ycol, plot_max_y, xcol, plot_max_x)
161 print(" - Filtering squares with {}".format(querystring))
162 square_this = square_this.query(querystring)
163
164 counts = square_this.groupby([xcol, ycol]).size().reset_index(name='counts')
165 for index, row in counts.iterrows():
166 x_pos = row[xcol]
167 y_pos = row[ycol]
168 thiscolor = color(row["counts"]/(counts["counts"].max()))
169 rectangle1=mplpatches.Rectangle((x_pos,y_pos),1,1,
170 linewidth=0,\
171 facecolor=thiscolor)
172 panel.add_patch(rectangle1)
173
174 all_counts = counts["counts"]
175 return all_counts
176
177 def generate_heat_map(panel, data_frame, plot_min_y, plot_min_x,
178 plot_max_y, plot_max_x, color,
179 xcol, ycol, **kwargs):
180 panel.set_xlim([plot_min_x, plot_max_x])
181 panel.set_ylim([plot_min_y, plot_max_y])
182
183 querystring = "{}<={} and {}<={}".format(plot_min_y, ycol, plot_min_x, xcol)
184 print(" - Filtering hexmap with {}".format(querystring))
185 hex_this = data_frame.query(querystring)
186
187 querystring = "{}<{} and {}<{}".format(ycol, plot_max_y, xcol, plot_max_x)
188 print(" - Filtering hexmap with {}".format(querystring))
189 hex_this = hex_this.query(querystring)
190
191 # This single line controls plotting the hex bins in the panel
192 hex_vals = panel.hexbin(hex_this[xcol], hex_this[ycol], gridsize=49,
193 linewidths=0.0, cmap=color)
194 for each in panel.spines:
195 panel.spines[each].set_visible(False)
196
197 counts = hex_vals.get_array()
198 return counts
199
200 def generate_legend(panel, counts, color):
201 # completely custom for more control
202 panel.set_xlim([0, 1])
203 panel.set_ylim([0, 1000])
204 panel.set_yticks([int(x) for x in np.linspace(0, 1000, 6)])
205 panel.set_yticklabels([int(x) for x in np.linspace(0, max(counts), 6)])
206 for i in np.arange(0, 1001, 1):
207 rgba = color(i / 1001)
208 alpha = rgba[-1]
209 facec = rgba[0:3]
210 hist_rectangle = mplpatches.Rectangle((0, i), 1, 1,
211 linewidth=0.0,
212 facecolor=facec,
213 edgecolor=(0, 0, 0),
214 alpha=alpha)
215 panel.add_patch(hist_rectangle)
216 panel.spines['top'].set_visible(False)
217 panel.spines['left'].set_visible(False)
218 panel.spines['bottom'].set_visible(False)
219 panel.yaxis.set_label_position("right")
220 panel.set_ylabel('count')
221
222 def custommargin(df, **kwargs):
223 rc.update_rcParams()
224
225 # 250, 231, 34 light yellow
226 # 67, 1, 85
227 # R=np.linspace(65/255,1,101)
228 # G=np.linspace(0/255, 231/255, 101)
229 # B=np.linspace(85/255, 34/255, 101)
230 # R=65/255, G=0/255, B=85/255
231 Rf = 65 / 255
232 Bf = 85 / 255
233 pdict = {'red': ((0.0, Rf, Rf),
234 (1.0, Rf, Rf)),
235 'green': ((0.0, 0.0, 0.0),
236 (1.0, 0.0, 0.0)),
237 'blue': ((0.0, Bf, Bf),
238 (1.0, Bf, Bf)),
239 'alpha': ((0.0, 0.0, 0.0),
240 (1.0, 1.0, 1.0))
241 }
242 # Now we will use this example to illustrate 3 ways of
243 # handling custom colormaps.
244 # First, the most direct and explicit:
245 purple1 = LinearSegmentedColormap('Purple1', pdict)
246
247 # set the figure dimensions
248 fig_width = 1.61 * 3
249 fig_height = 1 * 3
250 fig = plt.figure(figsize=(fig_width, fig_height))
251
252 # set the panel dimensions
253 heat_map_panel_width = fig_width * 0.5
254 heat_map_panel_height = heat_map_panel_width * 0.62
255
256 # find the margins to center the panel in figure
257 fig_left_margin = fig_bottom_margin = (1 / 6)
258
259 # lengthPanel
260 y_panel_width = (1 / 8)
261
262 # the color Bar parameters
263 legend_panel_width = (1 / 24)
264
265 # define padding
266 h_padding = 0.02
267 v_padding = 0.05
268
269 # Set whether to include y-axes in histograms
270 print(" - Setting panel options.", file = sys.stderr)
271 if kwargs["Y_AXES"]:
272 y_bottom_spine = True
273 y_bottom_tick = 'on'
274 y_bottom_label = 'on'
275 x_left_spine = True
276 x_left_tick = 'on'
277 x_left_label = 'on'
278 x_y_label = 'Count'
279 else:
280 y_bottom_spine = False
281 y_bottom_tick = 'off'
282 y_bottom_label = 'off'
283 x_left_spine = False
284 x_left_tick = 'off'
285 x_left_label = 'off'
286 x_y_label = None
287
288 panels = []
289
290 # Quality histogram panel
291 print(" - Generating the x-axis panel.", file = sys.stderr)
292 x_panel_left = fig_left_margin + y_panel_width + h_padding
293 x_panel_width = heat_map_panel_width / fig_width
294 x_panel_height = y_panel_width * fig_width / fig_height
295 x_panel = generate_panel(x_panel_left,
296 fig_bottom_margin,
297 x_panel_width,
298 x_panel_height,
299 left_tick_param=x_left_tick,
300 label_left_tick_param=x_left_label)
301 panels.append(x_panel)
302
303 # y histogram panel
304 print(" - Generating the y-axis panel.", file = sys.stderr)
305 y_panel_bottom = fig_bottom_margin + x_panel_height + v_padding
306 y_panel_height = heat_map_panel_height / fig_height
307 y_panel = generate_panel(fig_left_margin,
308 y_panel_bottom,
309 y_panel_width,
310 y_panel_height,
311 bottom_tick_param=y_bottom_tick,
312 label_bottom_tick_param=y_bottom_label)
313 panels.append(y_panel)
314
315 # Heat map panel
316 heat_map_panel_left = fig_left_margin + y_panel_width + h_padding
317 heat_map_panel_bottom = fig_bottom_margin + x_panel_height + v_padding
318 print(" - Generating the heat map panel.", file = sys.stderr)
319 heat_map_panel = generate_panel(heat_map_panel_left,
320 heat_map_panel_bottom,
321 heat_map_panel_width / fig_width,
322 heat_map_panel_height / fig_height,
323 bottom_tick_param='off',
324 label_bottom_tick_param='off',
325 left_tick_param='off',
326 label_left_tick_param='off')
327 panels.append(heat_map_panel)
328 heat_map_panel.set_title(kwargs["title"])
329
330 # Legend panel
331 print(" - Generating the legend panel.", file = sys.stderr)
332 legend_panel_left = fig_left_margin + y_panel_width + \
333 heat_map_panel_width / fig_width + h_padding
334 legend_panel_bottom = fig_bottom_margin + x_panel_height + v_padding
335 legend_panel_height = heat_map_panel_height / fig_height
336 legend_panel = generate_panel(legend_panel_left, legend_panel_bottom,
337 legend_panel_width, legend_panel_height,
338 bottom_tick_param='off',
339 label_bottom_tick_param='off',
340 left_tick_param='off',
341 label_left_tick_param='off',
342 right_tick_param='on',
343 label_right_tick_param='on')
344 panels.append(legend_panel)
345
346 #
347 # Everything above this is just to set up the panels
348 #
349 ##################################################################
350
351 # Set max and min viewing window for the xaxis
352 if kwargs["plot_max_x"]:
353 plot_max_x = kwargs["plot_max_x"]
354 else:
355 if kwargs["square"]:
356 plot_max_x = df[kwargs["xcol"]].max()
357 plot_max_x = max(np.ceil(df[kwargs["xcol"]]))
358 plot_min_x = kwargs["plot_min_x"]
359
360 # Set x bin sizes
361 if kwargs["xbin"]:
362 x_bin_interval = kwargs["xbin"]
363 else:
364 # again, this is just based on what looks good from experience
365 x_bin_interval = 1
366
367 # Generate x histogram
368 print(" - Generating the x-axis histogram.", file = sys.stderr)
369 generate_histogram(panel = x_panel,
370 data_list = df[kwargs['xcol']],
371 min_plot_val = plot_min_x,
372 max_plot_val = plot_max_x,
373 bin_interval = x_bin_interval,
374 hist_horizontal = True,
375 x_label=kwargs['xcol'],
376 y_label=x_y_label,
377 left_spine=x_left_spine)
378
379 # Set max and min viewing window for the y axis
380 if kwargs["plot_max_y"]:
381 plot_max_y = kwargs["plot_max_y"]
382 else:
383 if kwargs["square"]:
384 plot_max_y = df[kwargs["ycol"]].max()
385 else:
386 plot_max_y = max(np.ceil(df[kwargs["ycol"]]))
387
388 plot_min_y = kwargs["plot_min_y"]
389 # Set y bin sizes
390 if kwargs["ybin"]:
391 y_bin_interval = kwargs["ybin"]
392 else:
393 y_bin_interval = 1
394
395 # Generate y histogram
396 print(" - Generating the y-axis histogram.", file = sys.stderr)
397 generate_histogram(panel = y_panel,
398 data_list = df[kwargs['ycol']],
399 min_plot_val = plot_min_y,
400 max_plot_val = plot_max_y,
401 bin_interval = y_bin_interval,
402 hist_horizontal = False,
403 y_label = kwargs['ycol'],
404 bottom_spine = y_bottom_spine)
405
406 # Generate heat map
407 if kwargs["square"]:
408 print(" - Generating the square heatmap.", file = sys.stderr)
409 counts = generate_square_map(panel = heat_map_panel,
410 data_frame = df,
411 plot_min_y = plot_min_y,
412 plot_min_x = plot_min_x,
413 plot_max_y = plot_max_y,
414 plot_max_x = plot_max_x,
415 color = purple1,
416 xcol = kwargs["xcol"],
417 ycol = kwargs["ycol"])
418 else:
419 print(" - Generating the heatmap.", file = sys.stderr)
420 counts = generate_heat_map(panel = heat_map_panel,
421 data_frame = df,
422 plot_min_y = plot_min_y,
423 plot_min_x = plot_min_x,
424 plot_max_y = plot_max_y,
425 plot_max_x = plot_max_x,
426 color = purple1,
427 xcol = kwargs["xcol"],
428 ycol = kwargs["ycol"])
429
430 # Generate legend
431 print(" - Generating the legend.", file = sys.stderr)
432 generate_legend(legend_panel, counts, purple1)
433
434 # inform the user of the plotting window if not quiet mode
435 #if not kwargs["QUIET"]:
436 # print("""plotting in the following window:
437 # {0} <= Q-score (x-axis) <= {1}
438 # {2} <= length (y-axis) <= {3}""".format(
439 # plot_min_x, plot_max_x, min_plot_val, max_plot_val),
440 # file=stderr)
441
442 # Print image(s)
443 if kwargs["output_base_name"] is None:
444 file_base = "custommargin"
445 else:
446 file_base = kwargs["output_base_name"]
447
448 print(" - Saving your images", file = sys.stderr)
449 print_images(
450 base =file_base,
451 image_formats=kwargs["fileform"],
452 dpi=kwargs["dpi"],
453 no_timestamp = kwargs["no_timestamp"],
454 transparent= kwargs["no_transparent"])
455
456 def run(args):
457 print(args)
458 if not opath.exists(args.input_file):
459 raise IOError("The input file does not exist: {}".format(
460 args.input_file))
461 df = pd.read_csv(args.input_file, header='infer', sep='\t')
462 # make sure that the column names that were specified are actually
463 # in the dataframe
464 if args.xcol not in df.columns:
465 raise IOError("""The x-column name that you specified, {}, is not in the
466 dataframe column names: {}""".format(args.xcol, df.columns))
467 if args.ycol not in df.columns:
468 raise IOError("""The y-column name that you specified, {}, is not in the
469 dataframe column names: {}""".format(args.ycol, df.columns))
470 print(" - Successfully read csv file. Here are a few lines:",
471 file = sys.stderr)
472 print(df.head(), file = sys.stderr)
473 print(" - Plotting {} on the x-axis".format(args.xcol),file=sys.stderr)
474 print(df[args.xcol].head(), file = sys.stderr)
475 print(" - Plotting {} on the y-axis".format(args.ycol),file=sys.stderr)
476 print(df[args.ycol].head(), file = sys.stderr)
477 custommargin(df=df.dropna(), **vars(args))
0 #!/usr/bin/env python
1 # -*- coding: utf-8 -*-
2
3 # pauvre
4 # Copyright (c) 2016-2020 Darrin T. Schultz.
5 #
6 # This file is part of pauvre.
7 #
8 # pauvre is free software: you can redistribute it and/or modify
9 # it under the terms of the GNU General Public License as published by
10 # the Free Software Foundation, either version 3 of the License, or
11 # (at your option) any later version.
12 #
13 # pauvre is distributed in the hope that it will be useful,
14 # but WITHOUT ANY WARRANTY; without even the implied warranty of
15 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
16 # GNU General Public License for more details.
17 #
18 # You should have received a copy of the GNU General Public License
19 # along with pauvre. If not, see <http://www.gnu.org/licenses/>.
20
21 from Bio import SeqIO
22 import copy
23 import gzip
24 import matplotlib.pyplot as plt
25 import numpy as np
26 import os
27 import pandas as pd
28 from sys import stderr
29 import time
30
31
32 # this makes opening files more robust for different platforms
33 # currently only used in GFFParse
34 import codecs
35
36 import warnings
37
38 def print_images(base, image_formats, dpi,
39 transparent=False, no_timestamp = False):
40 """
41 Save the plot in multiple formats, with or without transparency
42 and with or without timestamps.
43 """
44 for fmt in image_formats:
45 if no_timestamp:
46 out_name = "{0}.{1}".format(base, fmt)
47 else:
48 out_name = "{0}_{1}.{2}".format(base, timestamp(), fmt)
49 try:
50 if fmt == 'png':
51 plt.savefig(out_name, dpi=dpi, transparent=transparent)
52 else:
53 plt.savefig(out_name, format=fmt, transparent=transparent)
54 except PermissionError:
55 # thanks to https://github.com/wdecoster for the suggestion
56 print("""You don't have permission to save pauvre plots to this
57 directory. Try changing the directory and running the script again!""")
58
59 class GFFParse():
60 def __init__(self, filename, stop_codons=None, species=None):
61 self.filename = filename
62 self.samplename = os.path.splitext(os.path.basename(filename))[0]
63 self.species = species
64 self.featureDict = {"name": [],
65 "featType": [],
66 "start": [],
67 "stop": [],
68 "strand": []}
69 gffnames = ["sequence", "source", "featType", "start", "stop", "dunno1",
70 "strand", "dunno2", "tags"]
71 self.features = pd.read_csv(self.filename, comment='#',
72 sep='\t', names=gffnames)
73 self.features['name'] = self.features['tags'].apply(self._get_name)
74 self.features.drop('dunno1', 1, inplace=True)
75 self.features.drop('dunno2', 1, inplace=True)
76 self.features.reset_index(inplace=True, drop=True)
77 # warn the user if there are CDS or gene entries not divisible by three
78 self._check_triplets()
79 # sort the database by start
80 self.features.sort_values(by='start', ascending=True, inplace=True)
81 if stop_codons:
82 strip_codons = ['gene', 'CDS']
83 # if the direction is forward, subtract three from the stop to bring it closer to the start
84 self.features.loc[(self.features['featType'].isin(strip_codons)) & (self.features['strand'] == '+'), 'stop'] =\
85 self.features.loc[(self.features['featType'].isin(strip_codons))
86 & (self.features['strand'] == '+'), 'stop'] - 3
87 # if the direction is reverse, add three to the start (since the coords are flip-flopped)
88 self.features.loc[(self.features['featType'].isin(strip_codons)) & (self.features['strand'] == '-'), 'start'] =\
89 self.features.loc[(self.features['featType'].isin(strip_codons))
90 & (self.features['strand'] == '-'), 'start'] + 3
91 self.features['center'] = self.features['start'] + \
92 ((self.features['stop'] - self.features['start']) / 2)
93 # we need to add one since it doesn't account for the last base otherwise
94 self.features['width'] = abs(self.features['stop'] - self.features['start']) + 1
95 self.features['lmost'] = self.features.apply(self._determine_lmost, axis=1)
96 self.features['rmost'] = self.features.apply(self._determine_rmost, axis=1)
97 self.features['track'] = 0
98 if len(self.features.loc[self.features['tags'] == "Is_circular=true", 'stop']) < 1:
99 raise IOError("""The GFF file needs to have a tag ending in "Is_circular=true"
100 with a region from 1 to the number of bases in the mitogenome
101
102 example:
103 Bf201311 Geneious region 1 13337 . + 0 Is_circular=true
104 """)
105 self.seqlen = int(self.features.loc[self.features['tags'] == "Is_circular=true", 'stop'])
106 self.features.reset_index(inplace=True, drop=True)
107 #print("float", self.features.loc[self.features['name'] == 'COX1', 'center'])
108 #print("float cat", len(self.features.loc[self.features['name'] == 'CAT', 'center']))
109 # print(self.features)
110 # print(self.seqlen)
111
112 def set_features(self, new_features):
113 """all this does is reset the features pandas dataframe"""
114 self.features = new_features
115
116 def get_unique_genes(self):
117 """This returns a series of gene names"""
118 plottable = self.features.query(
119 "featType != 'tRNA' and featType != 'region' and featType != 'source'")
120 return set(plottable['name'].unique())
121
122 def shuffle(self):
123 """
124 this returns a list of all possible shuffles of features.
125 A shuffle is when the frontmost bit of coding + noncoding DNA up
126 until the next bit of coding DNA is removed and tagged on the
127 end of the sequence. In this case this process is represented by
128 shifting gff coordinates.
129 """
130 shuffles = []
131 # get the index of the first element
132 # get the index of the next thing
133 # subtract the indices of everything, then reset the ones that are below
134 # zero
135 done = False
136 shuffle_features = self.features[self.features['featType'].isin(
137 ['gene', 'rRNA', 'CDS', 'tRNA'])].copy(deep=True)
138 # we first add the shuffle features without reorganizing
139 # print("shuffle\n",shuffle_features)
140 add_first = copy.deepcopy(self)
141 add_first.set_features(shuffle_features)
142 shuffles.append(add_first)
143 # first gene is changed with every iteration
144 first_gene = list(shuffle_features['name'])[0]
145 # absolute first is the first gene in the original gff file, used to determine if we are done in this while loop
146 absolute_first = list(shuffle_features['name'])[0]
147 while not done:
148 # We need to prevent the case of shuffling in the middle of
149 # overlapped genes. Do this by ensuring that the the start of
150 # end of first gene is less than the start of the next gene.
151 first_stop = int(shuffle_features.loc[shuffle_features['name'] == first_gene, 'stop'])
152 next_gene = ""
153 for next_index in range(1, len(shuffle_features)):
154 # get the df of the next list, if len == 0, then it is a tRNA and we need to go to the next index
155 next_gene_df = list(
156 shuffle_features[shuffle_features['featType'].isin(['gene', 'rRNA', 'CDS'])]['name'])
157 if len(next_gene_df) != 0:
158 next_gene = next_gene_df[next_index]
159 next_start = int(shuffle_features.loc[shuffle_features['name'] == next_gene, 'start'])
160 #print("looking at {}, prev_stop is {}, start is {}".format(
161 # next_gene, first_stop, next_start))
162 #print(shuffle_features[shuffle_features['featType'].isin(['gene', 'rRNA', 'CDS'])])
163 # if the gene we're looking at and the next one don't overlap, move on
164 if first_stop < next_start:
165 break
166 #print("next_gene before checking for first is {}".format(next_gene))
167 if next_gene == absolute_first:
168 done = True
169 break
170 # now we can reset the first gene for the next iteration
171 first_gene = next_gene
172 shuffle_features = shuffle_features.copy(deep=True)
173 # figure out where the next start point is going to be
174 next_start = int(shuffle_features.loc[shuffle_features['name'] == next_gene, 'start'])
175 #print('next gene: {}'.format(next_gene))
176 shuffle_features['start'] = shuffle_features['start'] - next_start + 1
177 shuffle_features['stop'] = shuffle_features['stop'] - next_start + 1
178 shuffle_features['center'] = shuffle_features['center'] - next_start + 1
179 # now correct the values that are less than 0
180 shuffle_features.loc[shuffle_features['start'] < 1,
181 'start'] = shuffle_features.loc[shuffle_features['start'] < 1, 'start'] + self.seqlen
182 shuffle_features.loc[shuffle_features['stop'] < 1, 'stop'] = shuffle_features.loc[shuffle_features['stop']
183 < 1, 'start'] + shuffle_features.loc[shuffle_features['stop'] < 1, 'width']
184 shuffle_features['center'] = shuffle_features['start'] + \
185 ((shuffle_features['stop'] - shuffle_features['start']) / 2)
186 shuffle_features['lmost'] = shuffle_features.apply(self._determine_lmost, axis=1)
187 shuffle_features['rmost'] = shuffle_features.apply(self._determine_rmost, axis=1)
188 shuffle_features.sort_values(by='start', ascending=True, inplace=True)
189 shuffle_features.reset_index(inplace=True, drop=True)
190 new_copy = copy.deepcopy(self)
191 new_copy.set_features(shuffle_features)
192 shuffles.append(new_copy)
193 #print("len shuffles: {}".format(len(shuffles)))
194 return shuffles
195
196 def couple(self, other_GFF, this_y=0, other_y=1):
197 """
198 Compares this set of features to another set and generates tuples of
199 (x,y) coordinate pairs to input into lsi
200 """
201 other_features = other_GFF.features
202 coordinates = []
203 for thisname in self.features['name']:
204 othermatch = other_features.loc[other_features['name'] == thisname, 'center']
205 if len(othermatch) == 1:
206 this_x = float(self.features.loc[self.features['name']
207 == thisname, 'center']) # /self.seqlen
208 other_x = float(othermatch) # /other_GFF.seqlen
209 # lsi can't handle vertical or horizontal lines, and we don't
210 # need them either for our comparison. Don't add if equivalent.
211 if this_x != other_x:
212 these_coords = ((this_x, this_y), (other_x, other_y))
213 coordinates.append(these_coords)
214 return coordinates
215
216 def _check_triplets(self):
217 """This method verifies that all entries of featType gene and CDS are
218 divisible by three"""
219 genesCDSs = self.features.query("featType == 'CDS' or featType == 'gene'")
220 not_trips = genesCDSs.loc[((abs(genesCDSs['stop'] - genesCDSs['start']) + 1) % 3) > 0, ]
221 if len(not_trips) > 0:
222 print_string = ""
223 print_string += "There are CDS and gene entries that are not divisible by three\n"
224 print_string += str(not_trips)
225 warnings.warn(print_string, SyntaxWarning)
226
227 def _get_name(self, tag_value):
228 """This extracts a name from a single row in 'tags' of the pandas
229 dataframe
230 """
231 try:
232 if ";" in tag_value:
233 name = tag_value[5:].split(';')[0]
234 else:
235 name = tag_value[5:].split()[0]
236 except:
237 name = tag_value
238 print("Couldn't correctly parse {}".format(
239 tag_value))
240 return name
241
242 def _determine_lmost(self, row):
243 """Booleans don't work well for pandas dataframes, so I need to use apply
244 """
245 if row['start'] < row['stop']:
246 return row['start']
247 else:
248 return row['stop']
249
250 def _determine_rmost(self, row):
251 """Booleans don't work well for pandas dataframes, so I need to use apply
252 """
253 if row['start'] < row['stop']:
254 return row['stop']
255 else:
256 return row['start']
257
258
259 def parse_fastq_length_meanqual(fastq):
260 """
261 arguments:
262 <fastq> the fastq file path. Hopefully it has been verified to exist already
263
264 purpose:
265 This function parses a fastq and returns a pandas dataframe of read lengths
266 and read meanQuals.
267 """
268 # First try to open the file with the gzip package. It will crash
269 # if the file is not gzipped, so this is an easy way to test if
270 # the fastq file is gzipped or not.
271 try:
272 handle = gzip.open(fastq, "rt")
273 length, meanQual = _fastq_parse_helper(handle)
274 except:
275 handle = open(fastq, "r")
276 length, meanQual = _fastq_parse_helper(handle)
277
278 handle.close()
279 df = pd.DataFrame(list(zip(length, meanQual)), columns=['length', 'meanQual'])
280 return df
281
282
283 def filter_fastq_length_meanqual(df, min_len, max_len,
284 min_mqual, max_mqual):
285 querystring = "length >= {0} and meanQual >= {1}".format(min_len, min_mqual)
286 if max_len != None:
287 querystring += " and length <= {}".format(max_len)
288 if max_mqual != None:
289 querystring += " and meanQual <= {}".format(max_mqual)
290 print("Keeping reads that satisfy: {}".format(querystring), file=stderr)
291 filtdf = df.query(querystring)
292 #filtdf["length"] = pd.to_numeric(filtdf["length"], errors='coerce')
293 #filtdf["meanQual"] = pd.to_numeric(filtdf["meanQual"], errors='coerce')
294 return filtdf
295
296
297 def _fastq_parse_helper(handle):
298 length = []
299 meanQual = []
300 for record in SeqIO.parse(handle, "fastq"):
301 if len(record) > 0:
302 meanQual.append(_arithmetic_mean(record.letter_annotations["phred_quality"]))
303 length.append(len(record))
304 return length, meanQual
305
306
307 def _geometric_mean(phred_values):
308 """in case I want geometric mean in the future, can calculate it like this"""
309 # np.mean(record.letter_annotations["phred_quality"]))
310 pass
311
312
313 def _arithmetic_mean(phred_values):
314 """
315 Convert Phred to 1-accuracy (error probabilities), calculate the arithmetic mean,
316 log transform back to Phred.
317 """
318 if not isinstance(phred_values, np.ndarray):
319 phred_values = np.array(phred_values)
320 return _erate_to_phred(np.mean(_phred_to_erate(phred_values)))
321
322
323 def _phred_to_erate(phred_values):
324 """
325 converts a list or numpy array of phred values to a numpy array
326 of error rates
327 """
328 if not isinstance(phred_values, np.ndarray):
329 phred_values = np.array(phred_values)
330 return np.power(10, (-1 * (phred_values / 10)))
331
332
333 def _erate_to_phred(erate_values):
334 """
335 converts a list or numpy array of error rates to a numpy array
336 of phred values
337 """
338 if not isinstance(erate_values, np.ndarray):
339 phred_values = np.array(erate_values)
340 return -10 * np.log10(erate_values)
341
342 def timestamp():
343 """
344 Returns the current time in :samp:`YYYYMMDD_HHMMSS` format.
345 """
346 return time.strftime("%Y%m%d_%H%M%S")
0 #!/usr/bin/env python
1 # -*- coding: utf-8 -*-
2
3 # pauvre - a pore plotting package
4 # Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved.
5 #
6 # This file is part of pauvre.
7 #
8 # pauvre is free software: you can redistribute it and/or modify
9 # it under the terms of the GNU General Public License as published by
10 # the Free Software Foundation, either version 3 of the License, or
11 # (at your option) any later version.
12 #
13 # pauvre is distributed in the hope that it will be useful,
14 # but WITHOUT ANY WARRANTY; without even the implied warranty of
15 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
16 # GNU General Public License for more details.
17 #
18 # You should have received a copy of the GNU General Public License
19 # along with pauvre. If not, see <http://www.gnu.org/licenses/>.
20
21 """This file contains things related to parsing and plotting GFF files"""
22
23 import copy
24 from matplotlib.path import Path
25 import matplotlib.patches as patches
26
27 global chevron_width
28 global arrow_width
29 global min_text
30 global text_cutoff
31
32 arrow_width = 80
33 chevron_width = 40
34 min_text = 550
35 text_cutoff = 150
36 import sys
37
38 global colorMap
39 colorMap = {'gene': 'green', 'CDS': 'green', 'tRNA':'pink', 'rRNA':'red',
40 'misc_feature':'purple', 'rep_origin':'orange', 'spacebar':'white',
41 'ORF':'orange'}
42
43 def _plot_left_to_right_introns(panel, geneid, db, y_pos, text = None):
44 """ plots a left to right patch with introns when there are no intervening
45 sequences to consider. Uses a gene id and gffutils database as input.
46 b
47 a .-=^=-. c
48 1__________2---/ e `---1__________2
49 | #lff \f d| #lff \
50 | left to \3 | left to \3
51 | right / | right /
52 5___________/4 5___________/4
53 """
54 #first we need to determine the number of exons
55 bar_thickness = 0.75
56 #now we can start plotting the exons
57 exonlist = list(db.children(geneid, featuretype='CDS', order_by="start"))
58 for i in range(len(exonlist)):
59 cds_start = exonlist[i].start
60 cds_stop = exonlist[i].stop
61 verts = [(cds_start, y_pos + bar_thickness), #1
62 (cds_stop - chevron_width, y_pos + bar_thickness), #2
63 (cds_stop, y_pos + (bar_thickness/2)), #3
64 (cds_stop - chevron_width, y_pos), #4
65 (cds_start, y_pos), #5
66 (cds_start, y_pos + bar_thickness), #1
67 ]
68 codes = [Path.MOVETO,
69 Path.LINETO,
70 Path.LINETO,
71 Path.LINETO,
72 Path.LINETO,
73 Path.CLOSEPOLY,
74 ]
75 path = Path(verts, codes)
76 patch = patches.PathPatch(path, lw = 0,
77 fc=colorMap['CDS'] )
78 panel.add_patch(patch)
79
80 # we must draw the splice junction
81 if i < len(exonlist) - 1:
82 next_start = exonlist[i+1].start
83 next_stop = exonlist[i+1].stop
84 middle = cds_stop + ((next_start - cds_stop)/2)
85
86 verts = [(cds_stop - chevron_width, y_pos + bar_thickness), #2/a
87 (middle, y_pos + 0.95), #b
88 (next_start, y_pos + bar_thickness), #c
89 (next_start, y_pos + bar_thickness - 0.05), #d
90 (middle, y_pos + 0.95 - 0.05), #e
91 (cds_stop - chevron_width, y_pos + bar_thickness -0.05), #f
92 (cds_stop - chevron_width, y_pos + bar_thickness), #2/a
93 ]
94 codes = [Path.MOVETO,
95 Path.LINETO,
96 Path.LINETO,
97 Path.LINETO,
98 Path.LINETO,
99 Path.LINETO,
100 Path.CLOSEPOLY,
101 ]
102 path = Path(verts, codes)
103 patch = patches.PathPatch(path, lw = 0,
104 fc=colorMap['CDS'] )
105 panel.add_patch(patch)
106
107 return panel
108
109 def _plot_left_to_right_introns_top(panel, geneid, db, y_pos, text = None):
110 """ slightly different from the above version such thatsplice junctions
111 are more visually explicit.
112
113 plots a left to right patch with introns when there are no intervening
114 sequences to consider. Uses a gene id and gffutils database as input.
115 b
116 a .-=^=-. c
117 1_____________2---/ e `---1_____________2
118 | #lff /f d| #lff /
119 | left to / | left to /
120 | right / | right /
121 4_________/3 4_________/3
122 """
123 #first we need to determine the number of exons
124 bar_thickness = 0.75
125 #now we can start plotting the exons
126 exonlist = list(db.children(geneid, featuretype='CDS', order_by="start"))
127 for i in range(len(exonlist)):
128 cds_start = exonlist[i].start
129 cds_stop = exonlist[i].stop
130 verts = [(cds_start, y_pos + bar_thickness), #1
131 (cds_stop, y_pos + bar_thickness), #2
132 (cds_stop - chevron_width, y_pos), #4
133 (cds_start, y_pos), #5
134 (cds_start, y_pos + bar_thickness), #1
135 ]
136 codes = [Path.MOVETO,
137 Path.LINETO,
138 Path.LINETO,
139 Path.LINETO,
140 Path.CLOSEPOLY,
141 ]
142 path = Path(verts, codes)
143 patch = patches.PathPatch(path, lw = 0,
144 fc=colorMap['CDS'] )
145 panel.add_patch(patch)
146
147 # we must draw the splice junction
148 if i < len(exonlist) - 1:
149 next_start = exonlist[i+1].start
150 next_stop = exonlist[i+1].stop
151 middle = cds_stop + ((next_start - cds_stop)/2)
152
153 verts = [(cds_stop-5, y_pos + bar_thickness), #2/a
154 (middle, y_pos + 0.95), #b
155 (next_start, y_pos + bar_thickness), #c
156 (next_start, y_pos + bar_thickness - 0.05), #d
157 (middle, y_pos + 0.95 - 0.05), #e
158 (cds_stop-5, y_pos + bar_thickness -0.05), #f
159 (cds_stop-5, y_pos + bar_thickness), #2/a
160 ]
161 codes = [Path.MOVETO,
162 Path.LINETO,
163 Path.LINETO,
164 Path.LINETO,
165 Path.LINETO,
166 Path.LINETO,
167 Path.CLOSEPOLY,
168 ]
169 path = Path(verts, codes)
170 patch = patches.PathPatch(path, lw = 0,
171 fc=colorMap['CDS'] )
172 panel.add_patch(patch)
173
174 return panel
175
176 def _plot_lff(panel, left_df, right_df, colorMap, y_pos, bar_thickness, text):
177 """ plots a lff patch
178 1__________2 ____________
179 | #lff \ \ #rff \
180 | left for \3 \ right for \
181 | forward / / forward /
182 5___________/4 /___________/
183 """
184 #if there is only one feature to plot, then just plot it
185
186 print("plotting lff")
187 verts = [(left_df['start'], y_pos + bar_thickness), #1
188 (right_df['start'] - chevron_width, y_pos + bar_thickness), #2
189 (left_df['stop'], y_pos + (bar_thickness/2)), #3
190 (right_df['start'] - chevron_width, y_pos), #4
191 (left_df['start'], y_pos), #5
192 (left_df['start'], y_pos + bar_thickness), #1
193 ]
194 codes = [Path.MOVETO,
195 Path.LINETO,
196 Path.LINETO,
197 Path.LINETO,
198 Path.LINETO,
199 Path.CLOSEPOLY,
200 ]
201 path = Path(verts, codes)
202 patch = patches.PathPatch(path, lw = 0,
203 fc=colorMap[left_df['featType']] )
204 text_width = left_df['width']
205 if text and text_width >= min_text:
206 panel = _plot_label(panel, left_df, y_pos, bar_thickness)
207 elif text and text_width < min_text and text_width >= text_cutoff:
208 panel = _plot_label(panel, left_df,
209 y_pos, bar_thickness,
210 rotate = True, arrow = True)
211
212 return panel, patch
213
214 def _plot_label(panel, df, y_pos, bar_thickness, rotate = False, arrow = False):
215 # handles the case where a dataframe was passed
216 fontsize = 8
217 rotation = 0
218 if rotate:
219 fontsize = 5
220 rotation = 90
221 if len(df) == 1:
222 x =((df.loc[0, 'stop'] - df.loc[0, 'start'])/2) + df.loc[0, 'start']
223 y = y_pos + (bar_thickness/2)
224 # if we need to center somewhere other than the arrow, need to adjust
225 # for the direction of the arrow
226 # it doesn't look good if it shifts by the whole arrow width, so only
227 # shift by half the arrow width
228 if arrow:
229 if df.loc[0, 'strand'] == "+":
230 shift_start = df.loc[0, 'start']
231 else:
232 shift_start = df.loc[0, 'start'] + (arrow_width/2)
233 x =((df.loc[0, 'stop'] - (arrow_width/2) - df.loc[0, 'start'])/2) + shift_start
234 panel.text(x, y,
235 df.loc[0, 'name'], fontsize = fontsize,
236 ha='center', va='center',
237 color = 'white', family = 'monospace',
238 zorder = 100, rotation = rotation)
239 # and the case where a series was passed
240 else:
241 x = ((df['stop'] - df['start'])/2) + df['start']
242 y = y_pos + (bar_thickness/2)
243 if arrow:
244 if df['strand'] == "+":
245 shift_start = df['start']
246 else:
247 shift_start = df['start'] + (arrow_width/2)
248 x =((df['stop'] - (arrow_width/2) - df['start'])/2) + shift_start
249 panel.text(x, y,
250 df['name'], fontsize = fontsize,
251 ha='center', va='center',
252 color = 'white', family = 'monospace',
253 zorder = 100, rotation = rotation)
254
255 return panel
256
257 def _plot_rff(panel, left_df, right_df, colorMap, y_pos, bar_thickness, text):
258 """ plots a rff patch
259 ____________ 1__________2
260 | #lff \ \ #rff \
261 | left for \ 6\ right for \3
262 | forward / / forward /
263 |___________/ /5__________/4
264 """
265 #if there is only one feature to plot, then just plot it
266
267 print("plotting rff")
268 verts = [(right_df['start'], y_pos + bar_thickness), #1
269 (right_df['stop'] - arrow_width, y_pos + bar_thickness), #2
270 (right_df['stop'], y_pos + (bar_thickness/2)), #3
271 (right_df['stop'] - arrow_width, y_pos), #4
272 (right_df['start'], y_pos), #5
273 (left_df['stop'] + chevron_width, y_pos + (bar_thickness/2)), #6
274 (right_df['start'], y_pos + bar_thickness), #1
275 ]
276 codes = [Path.MOVETO,
277 Path.LINETO,
278 Path.LINETO,
279 Path.LINETO,
280 Path.LINETO,
281 Path.LINETO,
282 Path.CLOSEPOLY,
283 ]
284 path = Path(verts, codes)
285 patch = patches.PathPatch(path, lw = 0,
286 fc=colorMap[right_df['featType']] )
287 text_width = right_df['width']
288 if text and text_width >= min_text:
289 panel = _plot_label(panel, right_df, y_pos, bar_thickness)
290 elif text and text_width < min_text and text_width >= text_cutoff:
291 panel = _plot_label(panel, right_df,
292 y_pos, bar_thickness, rotate = True)
293 return panel, patch
294
295 def x_offset_gff(GFFParseobj, x_offset):
296 """Takes in a gff object (a gff file parsed as a pandas dataframe),
297 and an x_offset value and shifts the start, stop, center, lmost, and rmost.
298
299 Returns a GFFParse object with the shifted values in GFFParse.features.
300 """
301 for columnname in ['start', 'stop', 'center', 'lmost', 'rmost']:
302 GFFParseobj.features[columnname] = GFFParseobj.features[columnname] + x_offset
303 return GFFParseobj
304
305 def gffplot_horizontal(figure, panel, args, gff_object,
306 track_width=0.2, start_y=0.1, **kwargs):
307 """
308 this plots horizontal things from gff files. it was probably written for synplot,
309 as the browser does not use this at all.
310 """
311 # Because this size should be relative to the circle that it is plotted next
312 # to, define the start_radius as the place to work from, and the width of
313 # each track.
314 colorMap = {'gene': 'green', 'CDS': 'green', 'tRNA':'pink', 'rRNA':'red',
315 'misc_feature':'purple', 'rep_origin':'orange', 'spacebar':'white'}
316 augment = 0
317 bar_thickness = 0.9 * track_width
318 # return these at the end
319 myPatches=[]
320 plot_order = []
321
322 idone = False
323 # we need to filter out the tRNAs since those are plotted last
324 plottable_features = gff_object.features.query("featType != 'tRNA' and featType != 'region' and featType != 'source'")
325 plottable_features.reset_index(inplace=True, drop=True)
326 print(plottable_features)
327
328 len_plottable = len(plottable_features)
329 print('len plottable', len_plottable)
330 # - this for loop relies on the gff features to already be sorted
331 # - The algorithm for this loop works by starting at the 0th index of the
332 # plottable features (i).
333 # - It then looks to see if the next object (the jth) overlaps with the
334 # ith element.
335 i = 0
336 j = 1
337 while i < len(plottable_features):
338 if i + j == len(plottable_features):
339 #we have run off of the df and need to include everything from i to the end
340 these_features = plottable_features.loc[i::,].copy(deep=True)
341 these_features = these_features.reset_index()
342 print(these_features)
343 plot_order.append(these_features)
344 i = len(plottable_features)
345 break
346 print(" - i,j are currently: {},{}".format(i, j))
347 stop = plottable_features.loc[i]["stop"]
348 start = plottable_features.loc[i+j]["start"]
349 print("stop: {}. start: {}.".format(stop, start))
350 if plottable_features.loc[i]["stop"] <= plottable_features.loc[i+j]["start"]:
351 print(" - putting elements {} through (including) {} together".format(i, i+j))
352 these_features = plottable_features.loc[i:i+j-1,].copy(deep=True)
353 these_features = these_features.reset_index()
354 print(these_features)
355 plot_order.append(these_features)
356 i += 1
357 j = 1
358 else:
359 j += 1
360
361 #while idone == False:
362 # print("im in the overlap-pairing while loop i={}".format(i))
363 # # look ahead at all of the elements that overlap with the ith element
364 # jdone = False
365 # j = 1
366 # this_set_minimum_index = i
367 # this_set_maximum_index = i
368 # while jdone == False:
369 # print("new i= {} j={} len={}".format(i, j, len_plottable))
370 # print("len plottable in jdone: {}".format(len_plottable))
371 # print("plottable features in jdone:\n {}".format(plottable_features))
372 # # first make sure that we haven't gone off the end of the dataframe
373 # # This is an edge case where i has a jth element that overlaps with it,
374 # # and j is the last element in the plottable features.
375 # if i+j == len_plottable:
376 # print("i+j == len_plottable")
377 # # this checks for the case that i is the last element of the
378 # # plottable features.
379 # # In both of the above cases, we are done with both the ith and
380 # # the jth features.
381 # if i == len_plottable-1:
382 # print("i == len_plottable-1")
383
384 # # this is the last analysis, so set idone to true
385 # # to finish after this
386 # idone = True
387 # # the last one can't be in its own group, so just add it solo
388 # these_features = plottable_features.loc[this_set_minimum_index:this_set_maximum_index,].copy(deep=True)
389 # plot_order.append(these_features.reset_index(drop=True))
390 # break
391 # jdone = True
392 # else:
393 # print("i+j != len_plottable")
394 # # if the lmost of the next gene overlaps with the rmost of
395 # # the current one, it overlaps and couple together
396 # if plottable_features.loc[i+j, 'lmost'] < plottable_features.loc[i, 'rmost']:
397 # print("lmost < rmost")
398 # # note that this feature overlaps with the current
399 # this_set_maximum_index = i+j
400 # # ... and we need to look at the next in line
401 # j += 1
402 # else:
403 # print("lmost !< rmost")
404 # i += 1 + (this_set_maximum_index - this_set_minimum_index)
405 # #add all of the things that grouped together once we don't find any more groups
406 # these_features = plottable_features.loc[this_set_minimum_index:this_set_maximum_index,].copy(deep=True)
407 # plot_order.append(these_features.reset_index(drop=True))
408 # jdone = True
409 # print("plot order is now: {}".format(plot_order))
410 # print("jdone: {}".format(str(jdone)))
411
412 for feature_set in plot_order:
413 # plot_feature_hori handles overlapping cases as well as normal cases
414 panel, patches = gffplot_feature_hori(figure, panel, feature_set, colorMap,
415 start_y, bar_thickness, text = True)
416 for each in patches:
417 print("there are {} patches after gffplot_feature_hori".format(len(patches)))
418 print(each)
419 myPatches.append(each)
420 print("length of myPatches is: {}".format(len(myPatches)))
421
422 # Now we add all of the tRNAs to this to plot, do it last to overlay
423 # everything else
424 tRNAs = gff_object.features.query("featType == 'tRNA'")
425 tRNAs.reset_index(inplace=True, drop = True)
426 tRNA_bar_thickness = bar_thickness * (0.8)
427 tRNA_start_y = start_y + ((bar_thickness - tRNA_bar_thickness)/2)
428 for i in range(0,len(tRNAs)):
429 this_feature = tRNAs[i:i+1].copy(deep=True)
430 this_feature.reset_index(inplace=True, drop = True)
431 panel, patches = gffplot_feature_hori(figure, panel, this_feature, colorMap,
432 tRNA_start_y, tRNA_bar_thickness, text = True)
433 for patch in patches:
434 myPatches.append(patch)
435 print("There are {} patches at the end of gffplot_horizontal()".format(len(myPatches)))
436 return panel, myPatches
437
438 def gffplot_feature_hori(figure, panel, feature_df,
439 colorMap, y_pos, bar_thickness, text=True):
440 """This plots the track for a feature, and if there is something for
441 'this_feature_overlaps_feature', then there is special processing to
442 add the white bar and the extra slope for the chevron
443 """
444 myPatches = []
445 #if there is only one feature to plot, then just plot it
446 if len(feature_df) == 1:
447 #print("plotting a single thing: {} {}".format(str(feature_df['sequence']).split()[1],
448 # str(feature_df['featType']).split()[1] ))
449 #print(this_feature['name'], "is not overlapping")
450 # This plots this shape: 1_________2 2_________1
451 # | forward \3 3/ reverse |
452 # |5__________/4 \4________5|
453 if feature_df.loc[0,'strand'] == '+':
454 verts = [(feature_df.loc[0, 'start'], y_pos + bar_thickness), #1
455 (feature_df.loc[0, 'stop'] - arrow_width, y_pos + bar_thickness), #2
456 (feature_df.loc[0, 'stop'], y_pos + (bar_thickness/2)), #3
457 (feature_df.loc[0, 'stop'] - arrow_width, y_pos), #4
458 (feature_df.loc[0, 'start'], y_pos), #5
459 (feature_df.loc[0, 'start'], y_pos + bar_thickness)] #1
460 elif feature_df.loc[0,'strand'] == '-':
461 verts = [(feature_df.loc[0, 'stop'], y_pos + bar_thickness), #1
462 (feature_df.loc[0, 'start'] + arrow_width, y_pos + bar_thickness), #2
463 (feature_df.loc[0, 'start'], y_pos + (bar_thickness/2)), #3
464 (feature_df.loc[0, 'start'] + arrow_width, y_pos), #4
465 (feature_df.loc[0, 'stop'], y_pos), #5
466 (feature_df.loc[0, 'stop'], y_pos + bar_thickness)] #1
467 feat_width = feature_df.loc[0,'width']
468 if text and feat_width >= min_text:
469 panel = _plot_label(panel, feature_df.loc[0,],
470 y_pos, bar_thickness)
471 elif text and feat_width < min_text and feat_width >= text_cutoff:
472 panel = _plot_label(panel, feature_df.loc[0,],
473 y_pos, bar_thickness,
474 rotate = True, arrow = True)
475
476 codes = [Path.MOVETO,
477 Path.LINETO,
478 Path.LINETO,
479 Path.LINETO,
480 Path.LINETO,
481 Path.CLOSEPOLY]
482 path = Path(verts, codes)
483 print("normal path is: {}".format(path))
484 # If the feature itself is smaller than the arrow, we need to take special measures to
485 if feature_df.loc[0,'width'] <= arrow_width:
486 path = Path([verts[i] for i in [0,2,4,5]],
487 [codes[i] for i in [0,2,4,5]])
488 patch = patches.PathPatch(path, lw = 0,
489 fc=colorMap[feature_df.loc[0, 'featType']] )
490 myPatches.append(patch)
491 # there are four possible scenarios if there are two overlapping sequences:
492 # ___________ ____________ ____________ ___________
493 # | #1 \ \ #1 \ / #2 / / #2 |
494 # | both seqs \ \ both seqs \ / both seqs / / both seqs |
495 # | forward / / forward / \ reverse \ \ reverse |
496 # |__________/ /___________/ \___________\ \___________|
497 # ___________ _____________ ____________ _ _________
498 # | #3 \ \ #3 | / #2 _| #2 \
499 # | one seq \ \ one seq | / one seq |_ one seq \
500 # | forward \ \ reverse | \ reverse _| forward /
501 # |_____________\ \_________| \__________|_ ___________/
502 #
503 # These different scenarios can be thought of as different left/right
504 # flanking segment types.
505 # In the annotation #rff:
506 # - 'r' refers to the annotation type as being on the right
507 # - the first 'f' refers to the what element is to the left of this one.
508 # Since it is forward the 5' end of this annotation must be a chevron
509 # - the second 'f' refers to the right side of this element. Since it is
510 # forward it must be a normal arrow.
511 # being on the right
512 #
513 # *LEFT TYPES* *RIGHT TYPES*
514 # ____________ ____________
515 # | #lff \ \ #rff \
516 # | left for \ \ right for \
517 # | forward / / forward /
518 # |___________/ /___________/
519 # ___________ _____________
520 # | #lfr \ \ #rfr |
521 # | left for \ \ right for |
522 # | reverse \ \ reverse |
523 # |_____________\ \_________|
524 # ____________ ___________
525 # / #lrr / / #rrr |
526 # / left rev / / right rev |
527 # \ reverse \ \ reverse |
528 # \___________\ \___________|
529 # ____________ __________
530 # / #lrf _| _| #rrf \
531 # / left rev |_ | _ right rev \
532 # \ forward _| _| forward /
533 # \__________| |____________/
534 #
535 # To properly plot these elements, we must go through each element of the
536 # feature_df to determine which patch type it is.
537 elif len(feature_df) == 2:
538 print("im in here feat len=2")
539 for i in range(len(feature_df)):
540 # this tests for which left type we're dealing with
541 if i == 0:
542 # type could be lff or lfr
543 if feature_df.loc[i, 'strand'] == '+':
544 if feature_df.loc[i + 1, 'strand'] == '+':
545 # plot a lff type
546 panel, patch = _plot_lff(panel, feature_df.iloc[i,], feature_df.iloc[i+1,],
547 colorMap, y_pos, bar_thickness, text)
548 myPatches.append(patch)
549 elif feature_df.loc[i + 1, 'strand'] == '-':
550 #plot a lfr type
551 raise IOError("can't plot {} patches yet".format("lfr"))
552 # or type could be lrr or lrf
553 elif feature_df.loc[i, 'strand'] == '-':
554 if feature_df.loc[i + 1, 'strand'] == '+':
555 # plot a lrf type
556 raise IOError("can't plot {} patches yet".format("lrf"))
557 elif feature_df.loc[i + 1, 'strand'] == '-':
558 #plot a lrr type
559 raise IOError("can't plot {} patches yet".format("lrr"))
560 # in this case we're only dealing with 'right type' patches
561 elif i == len(feature_df) - 1:
562 # type could be rff or rfr
563 if feature_df.loc[i-1, 'strand'] == '+':
564 if feature_df.loc[i, 'strand'] == '+':
565 # plot a rff type
566 panel, patch = _plot_rff(panel, feature_df.iloc[i-1,], feature_df.iloc[i,],
567 colorMap, y_pos, bar_thickness, text)
568 myPatches.append(patch)
569 elif feature_df.loc[i, 'strand'] == '-':
570 #plot a rfr type
571 raise IOError("can't plot {} patches yet".format("rfr"))
572 # or type could be rrr or rrf
573 elif feature_df.loc[i-1, 'strand'] == '-':
574 if feature_df.loc[i, 'strand'] == '+':
575 # plot a rrf type
576 raise IOError("can't plot {} patches yet".format("rrf"))
577 elif feature_df.loc[i, 'strand'] == '-':
578 #plot a rrr type
579 raise IOError("can't plot {} patches yet".format("rrr"))
580 return panel, myPatches
0 # Binary search tree that holds status of sweep line. Only leaves hold values.
1 # Operations for finding left and right neighbors of a query point p and finding which segments contain p.
2 # Author: Sam Lichtenberg
3 # Email: splichte@princeton.edu
4 # Date: 09/02/2013
5
6 from pauvre.lsi.helper import *
7
8 ev = 0.00000001
9
10 class Q:
11 def __init__(self, key, value):
12 self.key = key
13 self.value = value
14 self.left = None
15 self.right = None
16
17 def find(self, key):
18 if self.key is None:
19 return False
20 c = compare_by_y(key, self.key)
21 if c==0:
22 return True
23 elif c==-1:
24 if self.left:
25 self.left.find(key)
26 else:
27 return False
28 else:
29 if self.right:
30 self.right.find(key)
31 else:
32 return False
33 def insert(self, key, value):
34 if self.key is None:
35 self.key = key
36 self.value = value
37 c = compare_by_y(key, self.key)
38 if c==0:
39 self.value += value
40 elif c==-1:
41 if self.left is None:
42 self.left = Q(key, value)
43 else:
44 self.left.insert(key, value)
45 else:
46 if self.right is None:
47 self.right = Q(key, value)
48 else:
49 self.right.insert(key, value)
50 # must return key AND value
51 def get_and_del_min(self, parent=None):
52 if self.left is not None:
53 return self.left.get_and_del_min(self)
54 else:
55 k = self.key
56 v = self.value
57 if parent:
58 parent.left = self.right
59 # i.e. is root node
60 else:
61 if self.right:
62 self.key = self.right.key
63 self.value = self.right.value
64 self.left = self.right.left
65 self.right = self.right.right
66 else:
67 self.key = None
68 return k,v
69
70 def print_tree(self):
71 if self.left:
72 self.left.print_tree()
73 print(self.key)
74 print(self.value)
75 if self.right:
76 self.right.print_tree()
0 # Binary search tree that holds status of sweep line. Only leaves hold values.
1 # Operations for finding left and right neighbors of a query point p and finding which segments contain p.
2 # Author: Sam Lichtenberg
3 # Email: splichte@princeton.edu
4 # Date: 09/02/2013
5
6 from pauvre.lsi.helper import *
7
8 ev = 0.00000001
9
10 class T:
11 def __init__(self):
12 self.root = Node(None, None, None, None)
13 def contain_p(self, p):
14 if self.root.value is None:
15 return [[], []]
16 lists = [[], []]
17 self.root.contain_p(p, lists)
18 return (lists[0], lists[1])
19 def get_left_neighbor(self, p):
20 if self.root.value is None:
21 return None
22 return self.root.get_left_neighbor(p)
23 def get_right_neighbor(self, p):
24 if self.root.value is None:
25 return None
26 return self.root.get_right_neighbor(p)
27 def insert(self, key, s):
28 if self.root.value is None:
29 self.root.left = Node(s, None, None, self.root)
30 self.root.value = s
31 self.root.m = get_slope(s)
32 else:
33 (node, path) = self.root.find_insert_pt(key, s)
34 if path == 'r':
35 node.right = Node(s, None, None, node)
36 node.right.adjust()
37 elif path == 'l':
38 node.left = Node(s, None, None, node)
39 else:
40 # this means matching Node was a leaf
41 # need to make a new internal Node
42 if node.compare_to_key(key) < 0 or (node.compare_to_key(key)==0 and node.compare_lower(key, s) < 1):
43 new_internal = Node(s, None, node, node.parent)
44 new_leaf = Node(s, None, None, new_internal)
45 new_internal.left = new_leaf
46 if node is node.parent.left:
47 node.parent.left = new_internal
48 node.adjust()
49 else:
50 node.parent.right = new_internal
51 else:
52 new_internal = Node(node.value, node, None, node.parent)
53 new_leaf = Node(s, None, None, new_internal)
54 new_internal.right = new_leaf
55 if node is node.parent.left:
56 node.parent.left = new_internal
57 new_leaf.adjust()
58 else:
59 node.parent.right = new_internal
60 node.parent = new_internal
61
62 def delete(self, p, s):
63 key = p
64 node = self.root.find_delete_pt(key, s)
65 val = node.value
66 if node is node.parent.left:
67 parent = node.parent.parent
68 if parent is None:
69 if self.root.right is not None:
70 if self.root.right.left or self.root.right.right:
71 self.root = self.root.right
72 self.root.parent = None
73 else:
74 self.root.left = self.root.right
75 self.root.value = self.root.right.value
76 self.root.m = self.root.right.m
77 self.root.right = None
78 else:
79 self.root.left = None
80 self.root.value = None
81 elif node.parent is parent.left:
82 parent.left = node.parent.right
83 node.parent.right.parent = parent
84 else:
85 parent.right = node.parent.right
86 node.parent.right.parent = parent
87 else:
88 parent = node.parent.parent
89 if parent is None:
90 if self.root.left:
91 # switch properties
92 if self.root.left.right or self.root.left.left:
93 self.root = self.root.left
94 self.root.parent = None
95 else:
96 self.root.right = None
97 else:
98 self.root.right = None
99 self.root.value = None
100 elif node.parent is parent.left:
101 parent.left = node.parent.left
102 node.parent.left.parent = parent
103 farright = node.parent.left
104 while farright.right is not None:
105 farright = farright.right
106 farright.adjust()
107 else:
108 parent.right = node.parent.left
109 node.parent.left.parent = parent
110 farright = node.parent.left
111 while farright.right is not None:
112 farright = farright.right
113 farright.adjust()
114 return val
115
116 def print_tree(self):
117 self.root.print_tree()
118 class Node:
119 def __init__(self, value, left, right, parent):
120 self.value = value # associated line segment
121 self.left = left
122 self.right = right
123 self.parent = parent
124 self.m = None
125 if value is not None:
126 self.m = get_slope(value)
127
128 # compares line segment at y-val of p to p
129 # TODO: remove this and replace with get_x_at
130 def compare_to_key(self, p):
131 x0 = self.value[0][0]
132 y0 = self.value[0][1]
133 y1 = p[1]
134 if self.m != 0 and self.m is not None:
135 x1 = x0 - float(y0-y1)/self.m
136 return compare_by_x(p, (x1, y1))
137 else:
138 x1 = p[0]
139 return 0
140
141 def get_left_neighbor(self, p):
142 neighbor = None
143 n = self
144 if n.left is None and n.right is None:
145 return neighbor
146 last_right = None
147 found = False
148 while not found:
149 c = n.compare_to_key(p)
150 if c < 1 and n.left:
151 n = n.left
152 elif c==1 and n.right:
153 n = n.right
154 last_right = n.parent
155 else:
156 found = True
157 c = n.compare_to_key(p)
158 if c==0:
159 if n is n.parent.right:
160 return n.parent
161 else:
162 goright = None
163 if last_right:
164 goright =last_right.left
165 return self.get_lr(None, goright)[0]
166 # n stores the highest-value in the left subtree
167 if c==-1:
168 goright = None
169 if last_right:
170 goright = last_right.left
171 return self.get_lr(None, goright)[0]
172 if c==1:
173 neighbor = n
174 return neighbor
175
176 def get_right_neighbor(self, p):
177 neighbor = None
178 n = self
179 if n.left is None and n.right is None:
180 return neighbor
181 last_left = None
182 found = False
183 while not found:
184 c = n.compare_to_key(p)
185 if c==0 and n.right:
186 n = n.right
187 elif c < 0 and n.left:
188 n = n.left
189 last_left = n.parent
190 elif c==1 and n.right:
191 n = n.right
192 else:
193 found = True
194 c = n.compare_to_key(p)
195 # can be c==0 and n.left if at root node
196 if c==0:
197 if n.parent is None:
198 return None
199 if n is n.parent.right:
200 goleft = None
201 if last_left:
202 goleft = last_left.right
203 return self.get_lr(goleft, None)[1]
204 else:
205 return self.get_lr(n.parent.right, None)[1]
206 if c==1:
207 goleft = None
208 if last_left:
209 goleft = last_left.right
210 return self.get_lr(goleft, None)[1]
211 if c==-1:
212 return n
213 return neighbor
214
215 # travels down a single direction to get neighbors
216 def get_lr(self, left, right):
217 lr = [None, None]
218 if left:
219 while left.left:
220 left = left.left
221 lr[1] = left
222 if right:
223 while right.right:
224 right = right.right
225 lr[0] = right
226 return lr
227
228 def contain_p(self, p, lists):
229 c = self.compare_to_key(p)
230 if c==0:
231 if self.left is None and self.right is None:
232 if compare_by_x(p, self.value[1])==0:
233 lists[1].append(self.value)
234 else:
235 lists[0].append(self.value)
236 if self.left:
237 self.left.contain_p(p, lists)
238 if self.right:
239 self.right.contain_p(p, lists)
240 elif c < 0:
241 if self.left:
242 self.left.contain_p(p, lists)
243 else:
244 if self.right:
245 self.right.contain_p(p, lists)
246
247 def find_insert_pt(self, key, seg):
248 if self.left and self.right:
249 if self.compare_to_key(key) == 0 and self.compare_lower(key, seg)==1:
250 return self.right.find_insert_pt(key, seg)
251 elif self.compare_to_key(key) < 1:
252 return self.left.find_insert_pt(key, seg)
253 else:
254 return self.right.find_insert_pt(key, seg)
255 # this case only happens at root
256 elif self.left:
257 if self.compare_to_key(key) == 0 and self.compare_lower(key, seg)==1:
258 return (self, 'r')
259 elif self.compare_to_key(key) < 1:
260 return self.left.find_insert_pt(key, seg)
261 else:
262 return (self, 'r')
263 else:
264 return (self, 'n')
265
266 # adjusts stored segments in inner nodes
267 def adjust(self):
268 value = self.value
269 m = self.m
270 parent = self.parent
271 node = self
272 # go up left as much as possible
273 while parent and node is parent.right:
274 node = parent
275 parent = node.parent
276 # parent to adjust will be on the immediate right
277 if parent and node is parent.left:
278 parent.value = value
279 parent.m = m
280
281 def compare_lower(self, p, s2):
282 y = p[1] - 10
283 key = get_x_at(s2, (p[0], y))
284 return self.compare_to_key(key)
285
286 # returns matching leaf node, or None if no match
287 # when deleting, you don't delete below--you delete above! so compare lower = -1.
288 def find_delete_pt(self, key, value):
289 if self.left and self.right:
290 # if equal at this pt, and this node's value is less than the seg's slightly above this pt
291 if self.compare_to_key(key) == 0 and self.compare_lower(key, value)==-1:
292 return self.right.find_delete_pt(key, value)
293 if self.compare_to_key(key) < 1:
294 return self.left.find_delete_pt(key, value)
295 else:
296 return self.right.find_delete_pt(key, value)
297 elif self.left:
298 if self.compare_to_key(key) < 1:
299 return self.left.find_delete_pt(key, value)
300 else:
301 return None
302 # is leaf
303 else:
304 if self.compare_to_key(key)==0 and segs_equal(self.value, value):
305 return self
306 else:
307 return None
308
309 # also prints depth of each node
310 def print_tree(self, l=0):
311 l += 1
312 if self.left:
313 self.left.print_tree(l)
314 if self.left or self.right:
315 print('INTERNAL: {0}'.format(l))
316 else:
317 print('LEAF: {0}'.format(l))
318 print(self)
319 print(self.value)
320 if self.right:
321 self.right.print_tree(l)
(New empty file)
0 # Helper functions for use in the lsi implementation.
1
2 ev = 0.0000001
3 # floating-point comparison
4 def approx_equal(a, b, tol):
5 return abs(a - b) < tol
6
7 # compares x-values of two pts
8 # used for ordering in T
9 def compare_by_x(k1, k2):
10 if approx_equal(k1[0], k2[0], ev):
11 return 0
12 elif k1[0] < k2[0]:
13 return -1
14 else:
15 return 1
16
17 # higher y value is "less"; if y value equal, lower x value is "less"
18 # used for ordering in Q
19 def compare_by_y(k1, k2):
20 if approx_equal(k1[1], k2[1], ev):
21 if approx_equal(k1[0], k2[0], ev):
22 return 0
23 elif k1[0] < k2[0]:
24 return -1
25 else:
26 return 1
27 elif k1[1] > k2[1]:
28 return -1
29 else:
30 return 1
31
32 # tests if s0 and s1 represent the same segment (i.e. pts can be in 2 different orders)
33 def segs_equal(s0, s1):
34 x00 = s0[0][0]
35 y00 = s0[0][1]
36 x01 = s0[1][0]
37 y01 = s0[1][1]
38 x10 = s1[0][0]
39 y10 = s1[0][1]
40 x11 = s1[1][0]
41 y11 = s1[1][1]
42 if (approx_equal(x00, x10, ev) and approx_equal(y00, y10, ev)):
43 if (approx_equal(x01, x11, ev) and approx_equal(y01, y11, ev)):
44 return True
45 if (approx_equal(x00, x11, ev) and approx_equal(y00, y11, ev)):
46 if (approx_equal(x01, x10, ev) and approx_equal(y01, y10, ev)):
47 return True
48 return False
49
50 # get m for a given seg in (p1, p2) form
51 def get_slope(s):
52 x0 = s[0][0]
53 y0 = s[0][1]
54 x1 = s[1][0]
55 y1 = s[1][1]
56 if (x1-x0)==0:
57 return None
58 else:
59 return float(y1-y0)/(x1-x0)
60
61 # given a point p, return the point on s that shares p's y-val
62 def get_x_at(s, p):
63 m = get_slope(s)
64 # TODO: this should check if p's x-val is octually on seg; we're assuming
65 # for now that it would have been deleted already if not
66 if m == 0: # horizontal segment
67 return p
68 # ditto; should check if y-val on seg
69 if m is None: # vertical segment
70 return (s[0][0], p[1])
71 x1 = s[0][0]-(s[0][1]-p[1])/m
72 return (x1, p[1])
73
74 # returns the point at which two line segments intersect, or None if no intersection.
75 def intersect(seg1, seg2):
76 p = seg1[0]
77 r = (seg1[1][0]-seg1[0][0], seg1[1][1]-seg1[0][1])
78 q = seg2[0]
79 s = (seg2[1][0]-seg2[0][0], seg2[1][1]-seg2[0][1])
80 denom = r[0]*s[1]-r[1]*s[0]
81 if denom == 0:
82 return None
83 numer = float(q[0]-p[0])*s[1]-(q[1]-p[1])*s[0]
84 t = numer/denom
85 numer = float(q[0]-p[0])*r[1]-(q[1]-p[1])*r[0]
86 u = numer/denom
87 if (t < 0 or t > 1) or (u < 0 or u > 1):
88 return None
89 x = p[0]+t*r[0]
90 y = p[1]+t*r[1]
91 return (x, y)
92
93
0 # Implementation of the Bentley-Ottmann algorithm, described in deBerg et al, ch. 2.
1 # See README for more information.
2 # Author: Sam Lichtenberg
3 # Email: splichte@princeton.edu
4 # Date: 09/02/2013
5
6 from pauvre.lsi.Q import Q
7 from pauvre.lsi.T import T
8 from pauvre.lsi.helper import *
9
10 # "close enough" for floating point
11 ev = 0.00000001
12
13 # how much lower to get the x of a segment, to determine which of a set of segments is the farthest right/left
14 lower_check = 100
15
16 # gets the point on a segment at a lower y value.
17 def getNextPoint(p, seg, y_lower):
18 p1 = seg[0]
19 p2 = seg[1]
20 if (p1[0]-p2[0])==0:
21 return (p[0]+10, p[1])
22 slope = float(p1[1]-p2[1])/(p1[0]-p2[0])
23 if slope==0:
24 return (p1[0], p[1]-y_lower)
25 y = p[1]-y_lower
26 x = p1[0]-(p1[1]-y)/slope
27 return (x, y)
28
29 """
30 for each event point:
31 U_p = segments that have p as an upper endpoint
32 C_p = segments that contain p
33 L_p = segments that have p as a lower endpoint
34 """
35 def handle_event_point(p, segs, q, t, intersections):
36 rightmost = (float("-inf"), 0)
37 rightmost_seg = None
38 leftmost = (float("inf"), 0)
39 leftmost_seg = None
40
41 U_p = segs
42 (C_p, L_p) = t.contain_p(p)
43 merge_all = U_p+C_p+L_p
44 if len(merge_all) > 1:
45 intersections[p] = []
46 for s in merge_all:
47 intersections[p].append(s)
48 merge_CL = C_p+L_p
49 merge_UC = U_p+C_p
50 for s in merge_CL:
51 # deletes at a point slightly above (to break ties) - where seg is located in tree
52 # above intersection point
53 t.delete(p, s)
54 # put segments into T based on where they are at y-val just below p[1]
55 for s in merge_UC:
56 n = getNextPoint(p, s, lower_check)
57 if n[0] > rightmost[0]:
58 rightmost = n
59 rightmost_seg = s
60 if n[0] < leftmost[0]:
61 leftmost = n
62 leftmost_seg = s
63 t.insert(p, s)
64
65 # means only L_p -> check newly-neighbored segments
66 if len(merge_UC) == 0:
67 neighbors = (t.get_left_neighbor(p), t.get_right_neighbor(p))
68 if neighbors[0] and neighbors[1]:
69 find_new_event(neighbors[0].value, neighbors[1].value, p, q)
70
71 # of newly inserted pts, find possible intersections to left and right
72 else:
73 left_neighbor = t.get_left_neighbor(p)
74 if left_neighbor:
75 find_new_event(left_neighbor.value, leftmost_seg, p, q)
76 right_neighbor = t.get_right_neighbor(p)
77 if right_neighbor:
78 find_new_event(right_neighbor.value, rightmost_seg, p, q)
79
80 def find_new_event(s1, s2, p, q):
81 i = intersect(s1, s2)
82 if i:
83 if compare_by_y(i, p) == 1:
84 if not q.find(i):
85 q.insert(i, [])
86
87 # segment is in ((x, y), (x, y)) form
88 # first pt in a segment should have higher y-val - this is handled in function
89 def intersection(S):
90 s0 = S[0]
91 if s0[1][1] > s0[0][1]:
92 s0 = (s0[1], s0[0])
93 q = Q(s0[0], [s0])
94 q.insert(s0[1], [])
95 intersections = {}
96 for s in S[1:]:
97 if s[1][1] > s[0][1]:
98 s = (s[1], s[0])
99 q.insert(s[0], [s])
100 q.insert(s[1], [])
101 t = T()
102 while q.key:
103 p, segs = q.get_and_del_min()
104 handle_event_point(p, segs, q, t, intersections)
105 return intersections
106
0 # Test file for lsi.
1 # Author: Sam Lichtenberg
2 # Email: splichte@princeton.edu
3 # Date: 09/02/2013
4
5 from lsi import intersection
6 import random
7 import time, sys
8 from helper import *
9
10 ev = 0.00000001
11
12 def scale(i):
13 return float(i)
14
15 use_file = None
16 try:
17 use_file = sys.argv[2]
18 except:
19 pass
20
21 if not use_file:
22 S = []
23 for i in range(int(sys.argv[1])):
24 p1 = (scale(random.randint(0, 1000)), scale(random.randint(0, 1000)))
25 p2 = (scale(random.randint(0, 1000)), scale(random.randint(0, 1000)))
26 s = (p1, p2)
27 S.append(s)
28 f = open('input', 'w')
29 f.write(str(S))
30 f.close()
31
32 else:
33 f = open(sys.argv[2], 'r')
34 S = eval(f.read())
35
36 intersections = []
37 seen = []
38 vs = False
39 hs = False
40 es = False
41 now = time.time()
42 for seg1 in S:
43 if approx_equal(seg1[0][0], seg1[1][0], ev):
44 print 'VERTICAL SEG'
45 print ''
46 print ''
47 vs = True
48 if approx_equal(seg1[0][1], seg1[1][1], ev):
49 print 'HORIZONTAL SEG'
50 print ''
51 print ''
52 hs = True
53 for seg2 in S:
54 if seg1 is not seg2 and segs_equal(seg1, seg2):
55 print 'EQUAL SEGS'
56 print ''
57 print ''
58 es = True
59 if seg1 is not seg2 and (seg2, seg1) not in seen:
60 i = intersect(seg1, seg2)
61 if i:
62 intersections.append((i, [seg1, seg2]))
63 # xpts = [seg1[0][0], seg1[1][0], seg2[0][0], seg2[1][0]]
64 # xpts = sorted(xpts)
65 # if (i[0] <= xpts[2] and i[0] >= xpts[1]:
66 # intersections.append((i, [seg1, seg2]))
67 seen.append((seg1, seg2))
68 later = time.time()
69 n2time = later-now
70 print "Line sweep results:"
71 now = time.time()
72 lsinters = intersection(S)
73 inters = []
74 for k, v in lsinters.iteritems():
75 #print '{0}: {1}'.format(k, v)
76 inters.append(k)
77 # inters.append(v)
78 later = time.time()
79 print 'TIME ELAPSED: {0}'.format(later-now)
80 print "N^2 comparison results:"
81 pts_seen = []
82 highestseen = 0
83 for i in intersections:
84 seen_already = False
85 seen = 0
86 for p in pts_seen:
87 if approx_equal(i[0][0], p[0], ev) and approx_equal(i[0][1], p[1], ev):
88 seen += 1
89 seen_already = True
90 if seen > highestseen:
91 highestseen = seen
92 if not seen_already:
93 pts_seen.append(i[0])
94 in_k = False
95 for k in inters:
96 if approx_equal(k[0], i[0][0], ev) and approx_equal(k[1], i[0][1], ev):
97 in_k = True
98 if in_k == False:
99 print 'Not in K: {0}: {1}'.format(i[0], i[1])
100 # print i
101 print highestseen
102 print 'TIME ELAPSED: {0}'.format(n2time)
103 #print 'Missing from line sweep but in N^2:'
104 #for i in seen:
105 # matched = False
106 print len(lsinters)
107 print len(pts_seen)
108 if len(lsinters) != len(pts_seen):
109 print 'uh oh!'
0 #!/usr/bin/env python
1 # -*- coding: utf-8 -*-
2
3 # pauvre
4 # Copyright (c) 2016-2020 Darrin T. Schultz.
5 #
6 # This file is part of pauvre.
7 #
8 # pauvre is free software: you can redistribute it and/or modify
9 # it under the terms of the GNU General Public License as published by
10 # the Free Software Foundation, either version 3 of the License, or
11 # (at your option) any later version.
12 #
13 # pauvre is distributed in the hope that it will be useful,
14 # but WITHOUT ANY WARRANTY; without even the implied warranty of
15 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
16 # GNU General Public License for more details.
17 #
18 # You should have received a copy of the GNU General Public License
19 # along with pauvre. If not, see <http://www.gnu.org/licenses/>.
20
21 import ast
22 import matplotlib
23 matplotlib.use('Agg')
24 import matplotlib.pyplot as plt
25 import matplotlib.patches as mplpatches
26 from matplotlib.colors import LinearSegmentedColormap
27 import numpy as np
28 import pandas as pd
29 import os.path as opath
30 from sys import stderr
31 from pauvre.functions import parse_fastq_length_meanqual, print_images, filter_fastq_length_meanqual
32 from pauvre.stats import stats
33 import pauvre.rcparams as rc
34 import logging
35
36 # logging
37 logger = logging.getLogger('pauvre')
38
39
40 def generate_panel(panel_left, panel_bottom, panel_width, panel_height,
41 axis_tick_param='both', which_tick_param='both',
42 bottom_tick_param=True, label_bottom_tick_param=True,
43 left_tick_param=True, label_left_tick_param=True,
44 right_tick_param=False, label_right_tick_param=False,
45 top_tick_param=False, label_top_tick_param=False):
46 """
47 Setting default panel tick parameters. Some of these are the defaults
48 for matplotlib anyway, but specifying them for readability. Here are
49 options and defaults for the parameters used below:
50
51 axis : {'x', 'y', 'both'}; which axis to modify; default = 'both'
52 which : {'major', 'minor', 'both'}; which ticks to modify;
53 default = 'major'
54 bottom, top, left, right : bool or {True, False}; ticks on or off;
55 labelbottom, labeltop, labelleft, labelright : bool or {True, False}
56 """
57
58 # create the panel
59 panel_rectangle = [panel_left, panel_bottom, panel_width, panel_height]
60 panel = plt.axes(panel_rectangle)
61
62 # Set tick parameters
63 panel.tick_params(axis=axis_tick_param,
64 which=which_tick_param,
65 bottom=bottom_tick_param,
66 labelbottom=label_bottom_tick_param,
67 left=left_tick_param,
68 labelleft=label_left_tick_param,
69 right=right_tick_param,
70 labelright=label_right_tick_param,
71 top=top_tick_param,
72 labeltop=label_top_tick_param)
73
74 return panel
75
76
77 def _generate_histogram_bin_patches(panel, bins, bin_values, horizontal=True):
78 """This helper method generates the histogram that is added to the panel.
79
80 In this case, horizontal = True applies to the mean quality histogram.
81 So, horizontal = False only applies to the length histogram.
82 """
83 l_width = 0.0
84 f_color = (0.5, 0.5, 0.5)
85 e_color = (0, 0, 0)
86 if horizontal:
87 for step in np.arange(0, len(bin_values), 1):
88 left = bins[step]
89 bottom = 0
90 width = bins[step + 1] - bins[step]
91 height = bin_values[step]
92 hist_rectangle = mplpatches.Rectangle((left, bottom), width, height,
93 linewidth=l_width,
94 facecolor=f_color,
95 edgecolor=e_color)
96 panel.add_patch(hist_rectangle)
97 else:
98 for step in np.arange(0, len(bin_values), 1):
99 left = 0
100 bottom = bins[step]
101 width = bin_values[step]
102 height = bins[step + 1] - bins[step]
103
104 hist_rectangle = mplpatches.Rectangle((left, bottom), width, height,
105 linewidth=l_width,
106 facecolor=f_color,
107 edgecolor=e_color)
108 panel.add_patch(hist_rectangle)
109