Codebase list python-pauvre / 8944f59
New upstream version 0.1924 Andreas Tille 3 years ago
34 changed file(s) with 6109 addition(s) and 0 deletion(s). Raw diff Collapse all Expand all
0 include scripts/test.sh
0 Metadata-Version: 1.2
1 Name: pauvre
2 Version: 0.1924
3 Summary: Tools for plotting Oxford Nanopore and other long-read data.
4 Home-page: https://github.com/conchoecia/pauvre
5 Author: Darrin Schultz
6 Author-email: dts@ucsc.edu
7 License: GPLv3
8 Description:
9 'pauvre' is a package for plotting Oxford Nanopore and other long read data.
10 The name means 'poor' in French, a play on words to the oft-used 'pore' prefix
11 for similar packages. This package was designed for python 3, but it might work in
12 python 2. You can visit the gitub page for more detailed information here:
13 https://github.com/conchoecia/pauvre
14
15 Platform: UNKNOWN
16 Classifier: Development Status :: 2 - Pre-Alpha
17 Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
18 Classifier: Programming Language :: Python :: 3
19 Classifier: Programming Language :: Python :: 3.5
20 Classifier: Operating System :: POSIX :: Linux
21 Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
22 Classifier: Intended Audience :: Science/Research
23 Requires: python (>3.0)
24 Provides: pauvre
25 Requires-Python: >=3
0 [![travis-ci](https://travis-ci.org/conchoecia/pauvre.svg?branch=master)](https://travis-ci.org/conchoecia/pauvre) [![DOI](https://zenodo.org/badge/112774670.svg)](https://zenodo.org/badge/latestdoi/112774670)
1
2
3 ## pauvre: a plotting package designed for nanopore and PacBio long reads
4
5 This package currently hosts four scripts for plotting and/or printing stats.
6
7 - `pauvre marginplot`
8 - takes a fastq file as input and outputs a marginal histogram with a heatmap.
9 - `pauvre stats`
10 - Takes a fastq file as input and prints out a table of stats, including how many basepairs/reads there are for a length/mean quality cutoff.
11 - This is also automagically called when using `pauvre marginplot`
12 - `pauvre redwood`
13 - I am happy to introduce the redwood plot to the world as a method
14 of representing circular genomes. A redwood plot contains long
15 reads as "rings" on the inside, a gene annotation
16 "cambrium/phloem", and a RNAseq "bark". The input is `.bam` files
17 for the long reads and RNAseq data, and a `.gff` file for the
18 annotation. More details to follow as we document this program
19 better...
20 - `pauvre synteny`
21 - Makes a synteny plot of circular genomes. Finds the most
22 parsimonius rotation to display the synteny of all the input
23 genomes with the fewest crossings-over. Input is one `.gff` file
24 per circular genome and one directory of gene alignments.
25
26 ## Updates:
27 - 20200215 - v0.1.924 - Made some minor updates to work with python 3.7 and the latest version of pandas,
28 - 20171130 - v0.1.86 - some changes by @wdecoster to integrate `pauvre` into [nanoplot](https://github.com/wdecoster/NanoPlot),
29 as well as some formatting changes that *may* make `pauvre` work better with python2.7. Adding Travis-CI functionality.
30 - 20171025 - v0.1.83 - added some changes to make marginplot interface
31 with @wdecoster's [nanoPlot](https://github.com/wdecoster/NanoPlot)
32 package, and made `pauvre stats` only output data tables for
33 filtered reads. `pauvre stats` also now has the `--filt_maxlen`,
34 `--filt_maxqual`, `--filt_minlen`, and `--filt_minqual` options.
35 - 20171018 - v0.1.8 - you can now filter reads and adjust the plotting viewing window.
36 [See below for a demonstration.](#filter-reads-and-adjust-viewing-window) I added the following options:
37
38 ```
39 --filt_maxlen FILT_MAXLEN
40 This sets the max read length filter reads.
41 --filt_maxqual FILT_MAXQUAL
42 This sets the max mean read quality to filter reads.
43 --filt_minlen FILT_MINLEN
44 This sets the min read length to filter reads.
45 --filt_minqual FILT_MINQUAL
46 This sets the min mean read quality to filter reads.
47 --plot_maxlen PLOT_MAXLEN
48 Sets the maximum viewing area in the length dimension.
49 --plot_maxqual PLOT_MAXQUAL
50 Sets the maximum viewing area in the quality
51 dimension.
52 --plot_minlen PLOT_MINLEN
53 Sets the minimum viewing area in the length dimension.
54 --plot_minqual PLOT_MINQUAL
55 Sets the minimum viewing area in the quality
56 dimension.
57 ```
58 - 20171014 - uploading information on `pauvre redwood` and `pauvre synteny` usage.
59 - 20171012 - made `pauvre stats` more consistently produce useful histograms.
60 `pauvre stats` now also calculates some statistics for different size ranges.
61 - 20170529 - added automatic scaling to the input fastq file. It
62 scales to show the highest read quality and the top 99th percentile
63 of reads by length.
64
65 # Requirements
66
67 - You must have the following installed on your system to install this software:
68 - python 3.x
69 - matplotlib
70 - biopython
71 - pandas
72 - pillow
73
74 # Installation
75
76 - Instructions to install on your mac or linux system. Not sure on
77 Windows! Make sure *python 3* is the active environment before
78 installing.
79 - `git clone https://github.com/conchoecia/pauvre.git`
80 - `cd ./pauvre`
81 - `pip3 install .`
82 - Or, install with pip
83 - `pip3 install pauvre`
84
85 # Usage
86
87 ## `stats`
88 - generate basic statistics about the fastq file. For example, if I
89 want to know the number of bases and reads with AT LEAST a PHRED
90 score of 5 and AT LEAST a read length of 500, run the program as below
91 and look at the cells highlighted with `<braces>`.
92 - `pauvre stats --fastq miniDSMN15.fastq`
93
94
95 ```
96 numReads: 1000
97 numBasepairs: 1029114
98 meanLen: 1029.114
99 medianLen: 875.5
100 minLen: 11
101 maxLen: 5337
102 N50: 1278
103 L50: 296
104
105 Basepairs >= bin by mean PHRED and length
106 minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30
107 0 1029114 1010681 935366 429279 143948 25139 3668 2938 2000 0
108 500 984212 <968653> 904787 421307 142003 24417 3668 2938 2000 0
109 1000 659842 649319 616788 300948 103122 17251 2000 2000 2000 0
110 et cetera...
111 Number of reads >= bin by mean Phred+Len
112 minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30
113 0 1000 969 865 366 118 22 3 2 1 0
114 500 873 <859> 789 347 113 20 3 2 1 0
115 1000 424 418 396 187 62 11 1 1 1 0
116 et cetera...
117 ```
118
119 ## `marginplot`
120
121 ### Basic usage
122 - automatically calls `pauvre stats` for each fastq file
123 - Make the default plot showing the 99th percentile of longest reads
124 - `pauvre marginplot --fastq miniDSMN15.fastq`
125 - ![default](files/default_miniDSMN15.png)
126 - Make a marginal histogram for ONT 2D or 1D^2 cDNA data with a
127 lower maxlen and higher maxqual.
128 - `pauvre marginplot --maxlen 4000 --maxqual 25 --lengthbin 50 --fileform pdf png --qualbin 0.5 --fastq miniDSMN15.fastq`
129 - ![example1](files/miniDSMN15.png)
130
131 ### Filter reads and adjust viewing window
132 - Filter out reads with a mean quality less than 5, and a length
133 less than 800. Zoom in to plot only mean quality of at least 4 and
134 read length at least 500bp.
135 - `pauvre marginplot -f miniDSMN15.fastq --filt_minqual 5 --filt_minlen 800 -y --plot_minlen 500 --plot_minqual 4`
136 - ![test4](files/test4.png)
137
138 ### Specialized Options
139
140 - Plot ONT 1D data with a large tail
141 - `pauvre marginplot --maxlen 100000 --maxqual 15 --lengthbin 500 <myfile>.fastq`
142 - Get more resolution on lengths
143 - `pauvre marginplot --maxlen 100000 --lengthbin 5 <myfile>.fastq`
144
145 ### Transparency
146
147 - Turn off transparency if you just want a white background
148 - `pauvre marginplot --transparent False <myfile>.fastq`
149 - Note: transparency is the default behavior
150 - ![transparency](files/transparency.001.jpeg)
151
152 # Contributors
153
154 @conchoecia (Darrin Schultz)
155 @mebbert (Mark Ebbert)
156 @wdecoster (Wouter De Coster)
0 from pauvre.version import __version__
0 #!/usr/bin/env python
1 # -*- coding: utf-8 -*-
2
3 # pauvre - just a pore plotting package
4 # Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved.
5 # twitter @conchoecia
6 #
7 # This file is part of pauvre.
8 #
9 # pauvre is free software: you can redistribute it and/or modify
10 # it under the terms of the GNU General Public License as published by
11 # the Free Software Foundation, either version 3 of the License, or
12 # (at your option) any later version.
13 #
14 # pauvre is distributed in the hope that it will be useful,
15 # but WITHOUT ANY WARRANTY; without even the implied warranty of
16 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
17 # GNU General Public License for more details.
18 #
19 # You should have received a copy of the GNU General Public License
20 # along with pauvre. If not, see <http://www.gnu.org/licenses/>.
21 import pysam
22 import pandas as pd
23 import os
24
25 class BAMParse():
26 """This class reads in a sam/bam file and constructs a pandas
27 dataframe of all the relevant information for the reads to pass on
28 and plot.
29 """
30 def __init__(self, filename, chrid = None, start = None,
31 stop = None, doubled = None):
32 self.filename = filename
33 self.doubled = doubled
34 #determine if the file is bam or sam
35 self.filetype = os.path.splitext(self.filename)[1]
36 #throw an error if the file is not bam or sam
37 if self.filetype not in ['.bam']:
38 raise Exception("""You have provided a file with an extension other than
39 '.bam', please check your command-line arguments""")
40 #now make sure there is an index file for the bam file
41 if not os.path.exists("{}.bai".format(self.filename)):
42 raise Exception("""Your .bam file is there, but it isn't indexed and
43 there isn't a .bai file to go with it. Use
44 'samtools index <yourfile>.bam' to fix it.""")
45 #now open the file and just call it a sambam file
46 filetype_dict = {'.sam': '', '.bam': 'b'}
47 self.sambam = pysam.AlignmentFile(self.filename, "r{}".format(filetype_dict[self.filetype]))
48 if chrid == None:
49 self.chrid = self.sambam.references[0]
50 else:
51 self.chrid = chrid
52 self.refindex = self.sambam.references.index(self.chrid)
53 self.seqlength = self.sambam.lengths[self.refindex]
54 self.true_seqlength = self.seqlength if not self.doubled else int(self.seqlength/2)
55 if start == None or stop == None:
56 self.start = 1
57 self.stop = self.true_seqlength
58
59 self.features = self.parse()
60 self.features.sort_values(by=['POS','MAPLEN'], ascending=[True, False] ,inplace=True)
61 self.features.reset_index(inplace=True)
62 self.features.drop('index', 1, inplace=True)
63
64 self.raw_depthmap = self.get_depthmap()
65 self.features_depthmap = self.get_features_depthmap()
66
67 def get_depthmap(self):
68 depthmap = [0] * (self.stop - self.start + 1)
69 for p in self.sambam.pileup(self.chrid, self.start, self.stop):
70 index = p.reference_pos
71 if index >= self.true_seqlength:
72 index -= self.true_seqlength
73 depthmap[index] += p.nsegments
74 return depthmap
75
76 def get_features_depthmap(self):
77 """this method builds a more accurate pileup that is
78 based on if there is actually a mapped base at any
79 given position or not. better for long reads and RNA"""
80 depthmap = [0] * (self.stop - self.start + 1)
81 print("depthmap is: {} long".format(len(depthmap)))
82 for index, row in self.features.iterrows():
83 thisindex = row["POS"] - self.start
84 for thistup in row["TUPS"]:
85 b_type = thistup[1]
86 b_len = thistup[0]
87 if b_type == "M":
88 for j in range(b_len):
89 #this is necessary to reset the index if we wrap
90 # around to the beginning
91 if self.doubled and thisindex == len(depthmap):
92 thisindex = 0
93 depthmap[thisindex] += 1
94 thisindex += 1
95 elif b_type in ["S", "H", "I"]:
96 pass
97 elif b_type in ["D", "N"]:
98 thisindex += b_len
99 #this is necessary to reset the index if we wrap
100 # around to the beginning
101 if self.doubled and thisindex >= len(depthmap):
102 thisindex = thisindex - len(depthmap)
103
104 return depthmap
105
106 def parse(self):
107 data = {'POS': [], 'MAPQ': [], 'TUPS': [] }
108 for read in self.sambam.fetch(self.chrid, self.start, self.stop):
109 data['POS'].append(read.reference_start + 1)
110 data['TUPS'].append(self.cigar_parse(read.cigartuples))
111 data['MAPQ'].append(read.mapq)
112 features = pd.DataFrame.from_dict(data, orient='columns')
113 features['ALNLEN'] = features['TUPS'].apply(self.aln_len)
114 features['TRULEN'] = features['TUPS'].apply(self.tru_len)
115 features['MAPLEN'] = features['TUPS'].apply(self.map_len)
116 features['POS'] = features['POS'].apply(self.fix_pos)
117 return features
118
119 def cigar_parse(self, tuples):
120 """
121 arguments:
122 <tuples> a CIGAR string tuple list in pysam format
123
124 purpose:
125 This function uses the pysam cigarstring tuples format and returns
126 a list of tuples in the internal format, [(20, 'M'), (5, "I")], et
127 cetera. The zeroth element of each tuple is the number of bases for the
128 CIGAR string feature. The first element of each tuple is the CIGAR
129 string feature type.
130
131 There are several feature types in SAM/BAM files. See below:
132 'M' - match
133 'I' - insertion relative to reference
134 'D' - deletion relative to reference
135 'N' - skipped region from the reference
136 'S' - soft clip, not aligned but still in sam file
137 'H' - hard clip, not aligned and not in sam file
138 'P' - padding (silent deletion from padded reference)
139 '=' - sequence match
140 'X' - sequence mismatch
141 'B' - BAM_CBACK (I don't actually know what this is)
142
143 """
144 # I used the map values from http://pysam.readthedocs.io/en/latest/api.html#pysam.AlignedSegment
145 psam_to_char = {0: 'M', 1: 'I', 2: 'D', 3: 'N', 4: 'S',
146 5: 'H', 6: 'P', 7: '=', 8: 'X', 9: 'B'}
147 return [(value, psam_to_char[feature]) for feature, value in tuples]
148
149 def aln_len(self, TUPS):
150 """
151 arguments:
152 <TUPS> a list of tuples output from the cigar_parse() function.
153
154 purpose:
155 This returns the alignment length of the read to the reference.
156 Specifically, it sums the length of all of the matches and deletions.
157 In effect, this number is length of the region of the reference sequence to
158 which the read maps. This number is probably the most useful for selecting
159 reads to visualize in the mapped read plot.
160 """
161 return sum([pair[0] for pair in TUPS if pair[1] not in ['S', 'H', 'I']])
162
163 def map_len(self, TUPS):
164 """
165 arguments:
166 <TUPS> a list of tuples output from the cigar_parse() function.
167
168 purpose:
169 This function returns the map length (all matches and deletions relative to
170 the reference), plus the unmapped 5' and 3' hard/soft clipped sequences.
171 This number is useful if you want to visualize how much 5' and 3' sequence
172 of a read did not map to the reference. For example, poor quality 5' and 3'
173 tails are common in Nanopore reads.
174 """
175 return sum([pair[0] for pair in TUPS if pair[1] not in ['I']])
176
177 def tru_len(self, TUPS):
178 """
179 arguments:
180 <TUPS> a list of tuples output from the cigar_parse() function.
181
182 purpose:
183 This function returns the total length of the read, including insertions,
184 deletions, matches, soft clips, and hard clips. This is useful for
185 comparing to the map length or alignment length to see what percentage of
186 the read aligned to the reference.
187 """
188 return sum([pair[0] for pair in TUPS])
189
190 def fix_pos(self, start_index):
191 """
192 arguments:
193 an int
194
195 purpose:
196 When using a doubled SAMfile, any reads that start after the first copy
197 of the reference risk running over the plotting window, causing the program
198 to crash. This function corrects for this issue by changing the start site
199 of the read.
200
201 Note: this will probably break the program if not using a double alignment
202 since no reads would map past half the length of the single reference
203 """
204 if self.doubled:
205 if start_index > int(self.seqlength/2):
206 return start_index - int(self.seqlength/2) - 1
207 else:
208 return start_index
209 else:
210 return start_index
0 #!/usr/bin/env python
1 # -*- coding: utf-8 -*-
2
3 # pauvre - a pore plotting package
4 # Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved.
5 #
6 # This file is part of pauvre.
7 #
8 # pauvre is free software: you can redistribute it and/or modify
9 # it under the terms of the GNU General Public License as published by
10 # the Free Software Foundation, either version 3 of the License, or
11 # (at your option) any later version.
12 #
13 # pauvre is distributed in the hope that it will be useful,
14 # but WITHOUT ANY WARRANTY; without even the implied warranty of
15 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
16 # GNU General Public License for more details.
17 #
18 # You should have received a copy of the GNU General Public License
19 # along with pauvre. If not, see <http://www.gnu.org/licenses/>.
20
21 # following this tutorial to install helvetica
22 # https://github.com/olgabot/sciencemeetproductivity.tumblr.com/blob/master/posts/2012/11/how-to-set-helvetica-as-the-default-sans-serif-font-in.md
23 global hfont
24 hfont = {'fontname':'Helvetica'}
25
26 import matplotlib
27 matplotlib.use('Agg')
28 import matplotlib.pyplot as plt
29 from matplotlib.colors import LinearSegmentedColormap, Normalize
30 import matplotlib.patches as patches
31
32
33 import gffutils
34 import pandas as pd
35 pd.set_option('display.max_columns', 500)
36 pd.set_option('display.width', 1000)
37 import numpy as np
38 import os
39 import pauvre.rcparams as rc
40 from pauvre.functions import GFFParse, print_images, timestamp
41 from pauvre import gfftools
42 from pauvre.lsi.lsi import intersection
43 from pauvre.bamparse import BAMParse
44 import progressbar
45 import platform
46 import sys
47 import time
48
49 # Biopython stuff
50 from Bio import SeqIO
51 import Bio.SubsMat.MatrixInfo as MI
52
53
54 class PlotCommand:
55 def __init__(self, plotcmd, REF):
56 self.ref = REF
57 self.style_choices = []
58 self.cmdtype = ""
59 self.path = ""
60 self.style = ""
61 self.options = ""
62 self._parse_cmd(plotcmd)
63
64 def _parse_cmd(self, plotcmd):
65 chunks = plotcmd.split(":")
66 if chunks[0] == "ref":
67 self.cmdtype = "ref"
68 if len(chunks) < 2:
69 self._len_error()
70 self.path = self.ref
71 self.style = chunks[1]
72 self.style_choices = ["normal", "colorful"]
73 self._check_style_choices()
74 if len(chunks) > 2:
75 self.options = chunks[2].split(",")
76 elif chunks[0] in ["bam", "peptides"]:
77 if len(chunks) < 3:
78 self._len_error()
79 self.cmdtype = chunks[0]
80 self.path = os.path.abspath(os.path.expanduser(chunks[1]))
81 self.style = chunks[2]
82 if self.cmdtype == "bam":
83 self.style_choices = ["depth", "reads"]
84 else:
85 self.style_choices = ["depth"]
86 self._check_style_choices()
87 if len(chunks) > 3:
88 self.options = chunks[3].split(",")
89 elif chunks[0] in ["gff3"]:
90 if len(chunks) < 2:
91 self._len_error()
92 self.cmdtype = chunks[0]
93 self.path = os.path.abspath(os.path.expanduser(chunks[1]))
94 if len(chunks) > 2:
95 self.options = chunks[2].split(",")
96
97
98 def _len_error(self):
99 raise IOError("""You selected {} to plot,
100 but need to specify the style at least.""".format(self.cmdtype))
101 def _check_style_choices(self):
102 if self.style not in self.style_choices:
103 raise IOError("""You selected {} style for
104 ref. You must select from {}. """.format(
105 self.style, self.style_choices))
106
107 global dna_color
108 dna_color = {"A": (81/255, 87/255, 251/255, 1),
109 "T": (230/255, 228/255, 49/255, 1),
110 "G": (28/255, 190/255, 32/255, 1),
111 "C": (220/255, 10/255, 23/255, 1)}
112
113 #these are the line width for the different cigar string flags.
114 # usually, only M, I, D, S, and H appear in bwa mem output
115 global widthDict
116 widthDict = {'M':0.45, # match
117 'I':0.9, # insertion relative to reference
118 'D':0.05, # deletion relative to reference
119 'N':0.1, # skipped region from the reference
120 'S':0.1, # soft clip, not aligned but still in sam file
121 'H':0.1, # hard clip, not aligned and not in sam file
122 'P':0.1, # padding (silent deletion from padded reference)
123 '=':0.1, # sequence match
124 'X':0.1} # sequence mismatch
125
126
127 global richgrey
128 richgrey = (60/255, 54/255, 69/255, 1)
129
130 def plot_ref(panel, chrid, start, stop, thiscmd):
131 panel.set_xlim([start, stop])
132 panel.set_ylim([-2.5, 2.5])
133 panel.set_xticks([int(val) for val in np.linspace(start, stop, 6)])
134 if thiscmd.style == "colorful":
135 thisseq = ""
136 for record in SeqIO.parse(thiscmd.ref, "fasta"):
137 if record.id == chrid:
138 thisseq = record.seq[start-1: stop]
139 for i in range(len(thisseq)):
140 left = start + i
141 bottom = -0.5
142 width = 1
143 height = 1
144 rect = patches.Rectangle((left, bottom),
145 width, height,
146 linewidth = 0,
147 facecolor = dna_color[thisseq[i]] )
148 panel.add_patch(rect)
149 return panel
150
151 def safe_log10(value):
152 try:
153 logval = np.log10(value)
154 except:
155 logval = 0
156 return logval
157
158 def plot_bam(panel, chrid, start, stop, thiscmd):
159 bam = BAMParse(thiscmd.path)
160 panel.set_xlim([start, stop])
161 if thiscmd.style == "depth":
162 maxdepth = max(bam.features_depthmap)
163 maxdepthlog = safe_log10(maxdepth)
164 if "log" in thiscmd.options:
165 panel.set_ylim([-maxdepthlog, maxdepthlog])
166 panel.set_yticks([int(val) for val in np.linspace(0, maxdepthlog, 2)])
167
168 else:
169 panel.set_yticks([int(val) for val in np.linspace(0, maxdepth, 2)])
170 if "c" in thiscmd.options:
171 panel.set_ylim([-maxdepth, maxdepth])
172 else:
173 panel.set_ylim([0, maxdepth])
174
175
176 for i in range(len(bam.features_depthmap)):
177 left = start + i
178 width = 1
179 if "c" in thiscmd.options and "log" in thiscmd.options:
180 bottom = -1 * safe_log10(bam.features_depthmap[i])
181 height = safe_log10(bam.features_depthmap[i]) * 2
182 elif "c" in thiscmd.options and "log" not in thiscmd.options:
183 bottom = -bam.features_depthmap[i]
184 height = bam.features_depthmap[i] * 2
185 else:
186 bottom = 0
187 height = bam.features_depthmap[i]
188 if height > 0:
189 rect = patches.Rectangle((left, bottom),
190 width, height,
191 linewidth = 0,
192 facecolor = richgrey )
193 panel.add_patch(rect)
194
195 if thiscmd.style == "reads":
196 #If we're plotting reads, we don't need y-axis
197 panel.tick_params(bottom="off", labelbottom="off",
198 left ="off", labelleft = "off")
199 reads = bam.features.copy()
200 panel.set_xlim([start, stop])
201 direction = "for"
202 if direction == 'for':
203 bav = {"by":['POS','MAPLEN'], "asc": [True, False]}
204 direction= 'rev'
205 elif direction == 'rev':
206 bav = {"by":['POS','MAPLEN'], "asc": [True, False]}
207 direction = 'for'
208 reads.sort_values(by=bav["by"], ascending=bav['asc'],inplace=True)
209 reads.reset_index(drop=True, inplace=True)
210
211 depth_count = -1
212 plotind = start
213 while len(reads) > 0:
214 #depth_count -= 1
215 #print("len of reads is {}".format(len(reads)))
216 potential = reads.query("POS >= {}".format(plotind))
217 if len(potential) == 0:
218 readsindex = 0
219 #print("resetting plot ind from {} to {}".format(
220 # plotind, reads.loc[readsindex, "POS"]))
221 depth_count -= 1
222
223 else:
224 readsindex = int(potential.index.values[0])
225 #print("pos of potential is {}".format(reads.loc[readsindex, "POS"]))
226 plotind = reads.loc[readsindex, "POS"]
227
228 for TUP in reads.loc[readsindex, "TUPS"]:
229 b_type = TUP[1]
230 b_len = TUP[0]
231 #plotting params
232 # left same for all.
233 left = plotind
234 bottom = depth_count
235 height = widthDict[b_type]
236 width = b_len
237 plot = True
238 color = richgrey
239 if b_type in ["H", "S"]:
240 """We don't plot hard or sort clips - like IGV"""
241 plot = False
242 pass
243 elif b_type == "M":
244 """just plot matches normally"""
245 plotind += b_len
246 elif b_type in ["D", "P", "=", "X"]:
247 """deletions get an especially thin line"""
248 plotind += b_len
249 elif b_type == "I":
250 """insertions get a special purple bar"""
251 left = plotind - (b_len/2)
252 color = (200/255, 41/255, 226/255, 0.5)
253 elif b_type == "N":
254 """skips for splice junctions, line in middle"""
255 bottom += (widthDict["M"]/2) - (widthDict["N"]/2)
256 plotind += b_len
257 if plot:
258 rect = patches.Rectangle((left, bottom),
259 width, height,
260 linewidth = 0,
261 facecolor = color )
262 panel.add_patch(rect)
263 reads.drop([readsindex], inplace=True)
264 reads.reset_index(drop = True, inplace=True)
265 panel.set_ylim([depth_count, 0])
266
267 return panel
268
269 def plot_gff3(panel, chrid, start, stop, thiscmd):
270
271 db = gffutils.create_db(thiscmd.path, ":memory:")
272 bottom = 0
273 genes_to_plot = [thing.id
274 for thing in db.region(
275 region=(chrid, start, stop),
276 completely_within=False)
277 if thing.featuretype == "gene" ]
278 #print("genes to plot are: " genes_to_plot)
279 panel.set_xlim([start, stop])
280 # we don't need labels on one of the axes
281 #panel.tick_params(bottom="off", labelbottom="off",
282 # left ="off", labelleft = "off")
283
284
285 ticklabels = []
286 for geneid in genes_to_plot:
287 plotnow = False
288 if "id" in thiscmd.options and geneid in thiscmd.options:
289 plotnow = True
290 elif "id" not in thiscmd.options:
291 plotnow = True
292 if plotnow:
293 ticklabels.append(geneid)
294 if db[geneid].strand == "+":
295 panel = gfftools._plot_left_to_right_introns_top(panel, geneid, db,
296 bottom, text = None)
297 bottom += 1
298 else:
299 raise IOError("""Plotting things on the reverse strand is
300 not yet implemented""")
301 #print("tick labels are", ticklabels)
302 panel.set_ylim([0, len(ticklabels)])
303 yticks_vals = [val for val in np.linspace(0.5, len(ticklabels) - 0.5, len(ticklabels))]
304 panel.set_yticks(yticks_vals)
305 print("bottom is: ", bottom)
306 print("len tick labels is: ", len(ticklabels))
307 print("intervals are: ", yticks_vals)
308 panel.set_yticklabels(ticklabels)
309
310 return panel
311
312 def browser(args):
313 rc.update_rcParams()
314 print(args)
315
316 # if the user forgot to add a reference, they must add one
317 if args.REF is None:
318 raise IOError("You must specify the reference fasta file")
319
320 # if the user forgot to add the start and stop,
321 # Print the id and the start/stop
322 if args.CHR is None or args.START is None or args.STOP is None:
323 print("""\n You have forgotten to specify the chromosome,
324 the start coordinate, or the stop coordinate to plot.
325 Try something like '-c chr1 --start 20 --stop 2000'.
326 Here is a list of chromosome ids and their lengths
327 from the provided reference. The minimum start coordinate
328 is one and the maximum stop coordinate is the length of
329 the chromosome.\n\nID\tLength""")
330 for record in SeqIO.parse(args.REF, "fasta"):
331 print("{}\t{}".format(record.id, len(record.seq)))
332 sys.exit(0)
333
334 if args.CMD is None:
335 raise IOError("You must specify a plotting command.")
336
337 # now we parse each set of commands
338 commands = [PlotCommand(thiscmd, args.REF)
339 for thiscmd in reversed(args.CMD)]
340
341 # set the figure dimensions
342 if args.ratio:
343 figWidth = args.ratio[0] + 1
344 figHeight = args.ratio[1] + 1
345 #set the panel dimensions
346 panelWidth = args.ratio[0]
347 panelHeight = args.ratio[1]
348
349 else:
350 figWidth = 7
351 figHeight = len(commands) + 2
352 #set the panel dimensions
353 panelWidth = 5
354 # panel margin x 2 + panel height = total vertical height
355 panelHeight = 0.8
356 panelMargin = 0.1
357
358 figure = plt.figure(figsize=(figWidth,figHeight))
359
360 #find the margins to center the panel in figure
361 leftMargin = (figWidth - panelWidth)/2
362 bottomMargin = ((figHeight - panelHeight)/2) + panelMargin
363
364 plot_dict = {"ref": plot_ref,
365 "bam": plot_bam,
366 "gff3": plot_gff3
367 #"peptides": plot_peptides
368 }
369
370 panels = []
371 for i in range(len(commands)):
372 thiscmd = commands[i]
373 if thiscmd.cmdtype in ["gff3", "ref", "peptides"] \
374 or thiscmd.style == "depth" \
375 or "narrow" in thiscmd.options:
376 temp_panelHeight = 0.5
377 else:
378 temp_panelHeight = panelHeight
379 panels.append( plt.axes([leftMargin/figWidth, #left
380 bottomMargin/figHeight, #bottom
381 panelWidth/figWidth, #width
382 temp_panelHeight/figHeight]) #height
383 )
384 panels[i].tick_params(axis='both',which='both',\
385 bottom='off', labelbottom='off',\
386 left='on', labelleft='on', \
387 right='off', labelright='off',\
388 top='off', labeltop='off')
389 if thiscmd.cmdtype == "ref":
390 panels[i].tick_params(bottom='on', labelbottom='on')
391
392
393
394 #turn off some of the axes
395 panels[i].spines["top"].set_visible(False)
396 panels[i].spines["bottom"].set_visible(False)
397 panels[i].spines["right"].set_visible(False)
398 panels[i].spines["left"].set_visible(False)
399
400 panels[i] = plot_dict[thiscmd.cmdtype](panels[i], args.CHR,
401 args.START, args.STOP,
402 thiscmd)
403
404 bottomMargin = bottomMargin + temp_panelHeight + (2 * panelMargin)
405
406 # Print image(s)
407 if args.BASENAME is None:
408 file_base = 'browser_{}.png'.format(timestamp())
409 else:
410 file_base = args.BASENAME
411 path = None
412 if args.path:
413 path = args.path
414 transparent = args.transparent
415 print_images(
416 base_output_name=file_base,
417 image_formats=args.fileform,
418 dpi=args.dpi,
419 no_timestamp = kwargs["no_timestamp"],
420 path = path,
421 transparent=transparent)
422
423
424 def run(args):
425 browser(args)
0 #!/usr/bin/env python
1 # -*- coding: utf-8 -*-
2
3 # pauvre - just a pore PhD student's plotting package
4 # Copyright (c) 2016-2017 Darrin T. Schultz. All rights reserved.
5 #
6 # This file is part of pauvre.
7 #
8 # pauvre is free software: you can redistribute it and/or modify
9 # it under the terms of the GNU General Public License as published by
10 # the Free Software Foundation, either version 3 of the License, or
11 # (at your option) any later version.
12 #
13 # pauvre is distributed in the hope that it will be useful,
14 # but WITHOUT ANY WARRANTY; without even the implied warranty of
15 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
16 # GNU General Public License for more details.
17 #
18 # You should have received a copy of the GNU General Public License
19 # along with pauvre. If not, see <http://www.gnu.org/licenses/>.
20
21 import ast
22 import matplotlib
23 matplotlib.use('Agg')
24 import matplotlib.pyplot as plt
25 import matplotlib.patches as mplpatches
26 from matplotlib.colors import LinearSegmentedColormap
27 import numpy as np
28 import pandas as pd
29 import os.path as opath
30 from sys import stderr
31 from pauvre.functions import print_images
32 from pauvre.stats import stats
33 import pauvre.rcparams as rc
34 import sys
35 import logging
36
37 # logging
38 logger = logging.getLogger('pauvre')
39
40
41 def generate_panel(panel_left, panel_bottom, panel_width, panel_height,
42 axis_tick_param='both', which_tick_param='both',
43 bottom_tick_param='on', label_bottom_tick_param='on',
44 left_tick_param='on', label_left_tick_param='on',
45 right_tick_param='off', label_right_tick_param='off',
46 top_tick_param='off', label_top_tick_param='off'):
47 """
48 Setting default panel tick parameters. Some of these are the defaults
49 for matplotlib anyway, but specifying them for readability. Here are
50 options and defaults for the parameters used below:
51
52 axis : {'x', 'y', 'both'}; which axis to modify; default = 'both'
53 which : {'major', 'minor', 'both'}; which ticks to modify;
54 default = 'major'
55 bottom, top, left, right : bool or {'on', 'off'}; ticks on or off;
56 labelbottom, labeltop, labelleft, labelright : bool or {'on', 'off'}
57 """
58
59 # create the panel
60 panel_rectangle = [panel_left, panel_bottom, panel_width, panel_height]
61 panel = plt.axes(panel_rectangle)
62
63 # Set tick parameters
64 panel.tick_params(axis=axis_tick_param,
65 which=which_tick_param,
66 bottom=bottom_tick_param,
67 labelbottom=label_bottom_tick_param,
68 left=left_tick_param,
69 labelleft=label_left_tick_param,
70 right=right_tick_param,
71 labelright=label_right_tick_param,
72 top=top_tick_param,
73 labeltop=label_top_tick_param)
74
75 return panel
76
77
78 def _generate_histogram_bin_patches(panel, bins, bin_values, horizontal=True):
79 """This helper method generates the histogram that is added to the panel.
80
81 In this case, horizontal = True applies to the mean quality histogram.
82 So, horizontal = False only applies to the length histogram.
83 """
84 l_width = 0.0
85 f_color = (0.5, 0.5, 0.5)
86 e_color = (0, 0, 0)
87 if horizontal:
88 for step in np.arange(0, len(bin_values), 1):
89 left = bins[step]
90 bottom = 0
91 width = bins[step + 1] - bins[step]
92 height = bin_values[step]
93 hist_rectangle = mplpatches.Rectangle((left, bottom), width, height,
94 linewidth=l_width,
95 facecolor=f_color,
96 edgecolor=e_color)
97 panel.add_patch(hist_rectangle)
98 else:
99 for step in np.arange(0, len(bin_values), 1):
100 left = 0
101 bottom = bins[step]
102 width = bin_values[step]
103 height = bins[step + 1] - bins[step]
104
105 hist_rectangle = mplpatches.Rectangle((left, bottom), width, height,
106 linewidth=l_width,
107 facecolor=f_color,
108 edgecolor=e_color)
109 panel.add_patch(hist_rectangle)
110
111
112 def generate_histogram(panel, data_list, min_plot_val, max_plot_val,
113 bin_interval, hist_horizontal=True,
114 left_spine=True, bottom_spine=True,
115 top_spine=False, right_spine=False, x_label=None,
116 y_label=None):
117
118 bins = np.arange(0, max_plot_val, bin_interval)
119
120 bin_values, bins2 = np.histogram(data_list, bins)
121
122 # hist_horizontal is used for quality
123 if hist_horizontal:
124 panel.set_xlim([min_plot_val, max_plot_val])
125 panel.set_ylim([0, max(bin_values * 1.1)])
126 # and hist_horizontal == Fale is for read length
127 else:
128 panel.set_xlim([0, max(bin_values * 1.1)])
129 panel.set_ylim([min_plot_val, max_plot_val])
130
131 # Generate histogram bin patches, depending on whether we're plotting
132 # vertically or horizontally
133 _generate_histogram_bin_patches(panel, bins, bin_values, hist_horizontal)
134
135 panel.spines['left'].set_visible(left_spine)
136 panel.spines['bottom'].set_visible(bottom_spine)
137 panel.spines['top'].set_visible(top_spine)
138 panel.spines['right'].set_visible(right_spine)
139
140 if y_label is not None:
141 panel.set_ylabel(y_label)
142 if x_label is not None:
143 panel.set_xlabel(x_label)
144
145 def generate_square_map(panel, data_frame, plot_min_y, plot_min_x,
146 plot_max_y, plot_max_x, color,
147 xcol, ycol, **kwargs):
148 """This generates the heatmap panels using squares. Everything is
149 quantized by ints.
150 """
151 panel.set_xlim([plot_min_x, plot_max_x])
152 panel.set_ylim([plot_min_y, plot_max_y])
153 tempdf = data_frame[[xcol, ycol]]
154 data_frame = tempdf.astype(int)
155
156 querystring = "{}<={} and {}<={}".format(plot_min_y, ycol, plot_min_x, xcol)
157 print(" - Filtering squares with {}".format(querystring))
158 square_this = data_frame.query(querystring)
159
160 querystring = "{}<{} and {}<{}".format(ycol, plot_max_y, xcol, plot_max_x)
161 print(" - Filtering squares with {}".format(querystring))
162 square_this = square_this.query(querystring)
163
164 counts = square_this.groupby([xcol, ycol]).size().reset_index(name='counts')
165 for index, row in counts.iterrows():
166 x_pos = row[xcol]
167 y_pos = row[ycol]
168 thiscolor = color(row["counts"]/(counts["counts"].max()))
169 rectangle1=mplpatches.Rectangle((x_pos,y_pos),1,1,
170 linewidth=0,\
171 facecolor=thiscolor)
172 panel.add_patch(rectangle1)
173
174 all_counts = counts["counts"]
175 return all_counts
176
177 def generate_heat_map(panel, data_frame, plot_min_y, plot_min_x,
178 plot_max_y, plot_max_x, color,
179 xcol, ycol, **kwargs):
180 panel.set_xlim([plot_min_x, plot_max_x])
181 panel.set_ylim([plot_min_y, plot_max_y])
182
183 querystring = "{}<={} and {}<={}".format(plot_min_y, ycol, plot_min_x, xcol)
184 print(" - Filtering hexmap with {}".format(querystring))
185 hex_this = data_frame.query(querystring)
186
187 querystring = "{}<{} and {}<{}".format(ycol, plot_max_y, xcol, plot_max_x)
188 print(" - Filtering hexmap with {}".format(querystring))
189 hex_this = hex_this.query(querystring)
190
191 # This single line controls plotting the hex bins in the panel
192 hex_vals = panel.hexbin(hex_this[xcol], hex_this[ycol], gridsize=49,
193 linewidths=0.0, cmap=color)
194 for each in panel.spines:
195 panel.spines[each].set_visible(False)
196
197 counts = hex_vals.get_array()
198 return counts
199
200 def generate_legend(panel, counts, color):
201 # completely custom for more control
202 panel.set_xlim([0, 1])
203 panel.set_ylim([0, 1000])
204 panel.set_yticks([int(x) for x in np.linspace(0, 1000, 6)])
205 panel.set_yticklabels([int(x) for x in np.linspace(0, max(counts), 6)])
206 for i in np.arange(0, 1001, 1):
207 rgba = color(i / 1001)
208 alpha = rgba[-1]
209 facec = rgba[0:3]
210 hist_rectangle = mplpatches.Rectangle((0, i), 1, 1,
211 linewidth=0.0,
212 facecolor=facec,
213 edgecolor=(0, 0, 0),
214 alpha=alpha)
215 panel.add_patch(hist_rectangle)
216 panel.spines['top'].set_visible(False)
217 panel.spines['left'].set_visible(False)
218 panel.spines['bottom'].set_visible(False)
219 panel.yaxis.set_label_position("right")
220 panel.set_ylabel('count')
221
222 def custommargin(df, **kwargs):
223 rc.update_rcParams()
224
225 # 250, 231, 34 light yellow
226 # 67, 1, 85
227 # R=np.linspace(65/255,1,101)
228 # G=np.linspace(0/255, 231/255, 101)
229 # B=np.linspace(85/255, 34/255, 101)
230 # R=65/255, G=0/255, B=85/255
231 Rf = 65 / 255
232 Bf = 85 / 255
233 pdict = {'red': ((0.0, Rf, Rf),
234 (1.0, Rf, Rf)),
235 'green': ((0.0, 0.0, 0.0),
236 (1.0, 0.0, 0.0)),
237 'blue': ((0.0, Bf, Bf),
238 (1.0, Bf, Bf)),
239 'alpha': ((0.0, 0.0, 0.0),
240 (1.0, 1.0, 1.0))
241 }
242 # Now we will use this example to illustrate 3 ways of
243 # handling custom colormaps.
244 # First, the most direct and explicit:
245 purple1 = LinearSegmentedColormap('Purple1', pdict)
246
247 # set the figure dimensions
248 fig_width = 1.61 * 3
249 fig_height = 1 * 3
250 fig = plt.figure(figsize=(fig_width, fig_height))
251
252 # set the panel dimensions
253 heat_map_panel_width = fig_width * 0.5
254 heat_map_panel_height = heat_map_panel_width * 0.62
255
256 # find the margins to center the panel in figure
257 fig_left_margin = fig_bottom_margin = (1 / 6)
258
259 # lengthPanel
260 y_panel_width = (1 / 8)
261
262 # the color Bar parameters
263 legend_panel_width = (1 / 24)
264
265 # define padding
266 h_padding = 0.02
267 v_padding = 0.05
268
269 # Set whether to include y-axes in histograms
270 print(" - Setting panel options.", file = sys.stderr)
271 if kwargs["Y_AXES"]:
272 y_bottom_spine = True
273 y_bottom_tick = 'on'
274 y_bottom_label = 'on'
275 x_left_spine = True
276 x_left_tick = 'on'
277 x_left_label = 'on'
278 x_y_label = 'Count'
279 else:
280 y_bottom_spine = False
281 y_bottom_tick = 'off'
282 y_bottom_label = 'off'
283 x_left_spine = False
284 x_left_tick = 'off'
285 x_left_label = 'off'
286 x_y_label = None
287
288 panels = []
289
290 # Quality histogram panel
291 print(" - Generating the x-axis panel.", file = sys.stderr)
292 x_panel_left = fig_left_margin + y_panel_width + h_padding
293 x_panel_width = heat_map_panel_width / fig_width
294 x_panel_height = y_panel_width * fig_width / fig_height
295 x_panel = generate_panel(x_panel_left,
296 fig_bottom_margin,
297 x_panel_width,
298 x_panel_height,
299 left_tick_param=x_left_tick,
300 label_left_tick_param=x_left_label)
301 panels.append(x_panel)
302
303 # y histogram panel
304 print(" - Generating the y-axis panel.", file = sys.stderr)
305 y_panel_bottom = fig_bottom_margin + x_panel_height + v_padding
306 y_panel_height = heat_map_panel_height / fig_height
307 y_panel = generate_panel(fig_left_margin,
308 y_panel_bottom,
309 y_panel_width,
310 y_panel_height,
311 bottom_tick_param=y_bottom_tick,
312 label_bottom_tick_param=y_bottom_label)
313 panels.append(y_panel)
314
315 # Heat map panel
316 heat_map_panel_left = fig_left_margin + y_panel_width + h_padding
317 heat_map_panel_bottom = fig_bottom_margin + x_panel_height + v_padding
318 print(" - Generating the heat map panel.", file = sys.stderr)
319 heat_map_panel = generate_panel(heat_map_panel_left,
320 heat_map_panel_bottom,
321 heat_map_panel_width / fig_width,
322 heat_map_panel_height / fig_height,
323 bottom_tick_param='off',
324 label_bottom_tick_param='off',
325 left_tick_param='off',
326 label_left_tick_param='off')
327 panels.append(heat_map_panel)
328 heat_map_panel.set_title(kwargs["title"])
329
330 # Legend panel
331 print(" - Generating the legend panel.", file = sys.stderr)
332 legend_panel_left = fig_left_margin + y_panel_width + \
333 heat_map_panel_width / fig_width + h_padding
334 legend_panel_bottom = fig_bottom_margin + x_panel_height + v_padding
335 legend_panel_height = heat_map_panel_height / fig_height
336 legend_panel = generate_panel(legend_panel_left, legend_panel_bottom,
337 legend_panel_width, legend_panel_height,
338 bottom_tick_param='off',
339 label_bottom_tick_param='off',
340 left_tick_param='off',
341 label_left_tick_param='off',
342 right_tick_param='on',
343 label_right_tick_param='on')
344 panels.append(legend_panel)
345
346 #
347 # Everything above this is just to set up the panels
348 #
349 ##################################################################
350
351 # Set max and min viewing window for the xaxis
352 if kwargs["plot_max_x"]:
353 plot_max_x = kwargs["plot_max_x"]
354 else:
355 if kwargs["square"]:
356 plot_max_x = df[kwargs["xcol"]].max()
357 plot_max_x = max(np.ceil(df[kwargs["xcol"]]))
358 plot_min_x = kwargs["plot_min_x"]
359
360 # Set x bin sizes
361 if kwargs["xbin"]:
362 x_bin_interval = kwargs["xbin"]
363 else:
364 # again, this is just based on what looks good from experience
365 x_bin_interval = 1
366
367 # Generate x histogram
368 print(" - Generating the x-axis histogram.", file = sys.stderr)
369 generate_histogram(panel = x_panel,
370 data_list = df[kwargs['xcol']],
371 min_plot_val = plot_min_x,
372 max_plot_val = plot_max_x,
373 bin_interval = x_bin_interval,
374 hist_horizontal = True,
375 x_label=kwargs['xcol'],
376 y_label=x_y_label,
377 left_spine=x_left_spine)
378
379 # Set max and min viewing window for the y axis
380 if kwargs["plot_max_y"]:
381 plot_max_y = kwargs["plot_max_y"]
382 else:
383 if kwargs["square"]:
384 plot_max_y = df[kwargs["ycol"]].max()
385 else:
386 plot_max_y = max(np.ceil(df[kwargs["ycol"]]))
387
388 plot_min_y = kwargs["plot_min_y"]
389 # Set y bin sizes
390 if kwargs["ybin"]:
391 y_bin_interval = kwargs["ybin"]
392 else:
393 y_bin_interval = 1
394
395 # Generate y histogram
396 print(" - Generating the y-axis histogram.", file = sys.stderr)
397 generate_histogram(panel = y_panel,
398 data_list = df[kwargs['ycol']],
399 min_plot_val = plot_min_y,
400 max_plot_val = plot_max_y,
401 bin_interval = y_bin_interval,
402 hist_horizontal = False,
403 y_label = kwargs['ycol'],
404 bottom_spine = y_bottom_spine)
405
406 # Generate heat map
407 if kwargs["square"]:
408 print(" - Generating the square heatmap.", file = sys.stderr)
409 counts = generate_square_map(panel = heat_map_panel,
410 data_frame = df,
411 plot_min_y = plot_min_y,
412 plot_min_x = plot_min_x,
413 plot_max_y = plot_max_y,
414 plot_max_x = plot_max_x,
415 color = purple1,
416 xcol = kwargs["xcol"],
417 ycol = kwargs["ycol"])
418 else:
419 print(" - Generating the heatmap.", file = sys.stderr)
420 counts = generate_heat_map(panel = heat_map_panel,
421 data_frame = df,
422 plot_min_y = plot_min_y,
423 plot_min_x = plot_min_x,
424 plot_max_y = plot_max_y,
425 plot_max_x = plot_max_x,
426 color = purple1,
427 xcol = kwargs["xcol"],
428 ycol = kwargs["ycol"])
429
430 # Generate legend
431 print(" - Generating the legend.", file = sys.stderr)
432 generate_legend(legend_panel, counts, purple1)
433
434 # inform the user of the plotting window if not quiet mode
435 #if not kwargs["QUIET"]:
436 # print("""plotting in the following window:
437 # {0} <= Q-score (x-axis) <= {1}
438 # {2} <= length (y-axis) <= {3}""".format(
439 # plot_min_x, plot_max_x, min_plot_val, max_plot_val),
440 # file=stderr)
441
442 # Print image(s)
443 if kwargs["output_base_name"] is None:
444 file_base = "custommargin"
445 else:
446 file_base = kwargs["output_base_name"]
447
448 print(" - Saving your images", file = sys.stderr)
449 print_images(
450 base =file_base,
451 image_formats=kwargs["fileform"],
452 dpi=kwargs["dpi"],
453 no_timestamp = kwargs["no_timestamp"],
454 transparent= kwargs["no_transparent"])
455
456 def run(args):
457 print(args)
458 if not opath.exists(args.input_file):
459 raise IOError("The input file does not exist: {}".format(
460 args.input_file))
461 df = pd.read_csv(args.input_file, header='infer', sep='\t')
462 # make sure that the column names that were specified are actually
463 # in the dataframe
464 if args.xcol not in df.columns:
465 raise IOError("""The x-column name that you specified, {}, is not in the
466 dataframe column names: {}""".format(args.xcol, df.columns))
467 if args.ycol not in df.columns:
468 raise IOError("""The y-column name that you specified, {}, is not in the
469 dataframe column names: {}""".format(args.ycol, df.columns))
470 print(" - Successfully read csv file. Here are a few lines:",
471 file = sys.stderr)
472 print(df.head(), file = sys.stderr)
473 print(" - Plotting {} on the x-axis".format(args.xcol),file=sys.stderr)
474 print(df[args.xcol].head(), file = sys.stderr)
475 print(" - Plotting {} on the y-axis".format(args.ycol),file=sys.stderr)
476 print(df[args.ycol].head(), file = sys.stderr)
477 custommargin(df=df.dropna(), **vars(args))
0 #!/usr/bin/env python
1 # -*- coding: utf-8 -*-
2
3 # pauvre
4 # Copyright (c) 2016-2020 Darrin T. Schultz.
5 #
6 # This file is part of pauvre.
7 #
8 # pauvre is free software: you can redistribute it and/or modify
9 # it under the terms of the GNU General Public License as published by
10 # the Free Software Foundation, either version 3 of the License, or
11 # (at your option) any later version.
12 #
13 # pauvre is distributed in the hope that it will be useful,
14 # but WITHOUT ANY WARRANTY; without even the implied warranty of
15 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
16 # GNU General Public License for more details.
17 #
18 # You should have received a copy of the GNU General Public License
19 # along with pauvre. If not, see <http://www.gnu.org/licenses/>.
20
21 from Bio import SeqIO
22 import copy
23 import gzip
24 import matplotlib.pyplot as plt
25 import numpy as np
26 import os
27 import pandas as pd
28 from sys import stderr
29 import time
30
31
32 # this makes opening files more robust for different platforms
33 # currently only used in GFFParse
34 import codecs
35
36 import warnings
37
38 def print_images(base, image_formats, dpi,
39 transparent=False, no_timestamp = False):
40 """
41 Save the plot in multiple formats, with or without transparency
42 and with or without timestamps.
43 """
44 for fmt in image_formats:
45 if no_timestamp:
46 out_name = "{0}.{1}".format(base, fmt)
47 else:
48 out_name = "{0}_{1}.{2}".format(base, timestamp(), fmt)
49 try:
50 if fmt == 'png':
51 plt.savefig(out_name, dpi=dpi, transparent=transparent)
52 else:
53 plt.savefig(out_name, format=fmt, transparent=transparent)
54 except PermissionError:
55 # thanks to https://github.com/wdecoster for the suggestion
56 print("""You don't have permission to save pauvre plots to this
57 directory. Try changing the directory and running the script again!""")
58
59 class GFFParse():
60 def __init__(self, filename, stop_codons=None, species=None):
61 self.filename = filename
62 self.samplename = os.path.splitext(os.path.basename(filename))[0]
63 self.species = species
64 self.featureDict = {"name": [],
65 "featType": [],
66 "start": [],
67 "stop": [],
68 "strand": []}
69 gffnames = ["sequence", "source", "featType", "start", "stop", "dunno1",
70 "strand", "dunno2", "tags"]
71 self.features = pd.read_csv(self.filename, comment='#',
72 sep='\t', names=gffnames)
73 self.features['name'] = self.features['tags'].apply(self._get_name)
74 self.features.drop('dunno1', 1, inplace=True)
75 self.features.drop('dunno2', 1, inplace=True)
76 self.features.reset_index(inplace=True, drop=True)
77 # warn the user if there are CDS or gene entries not divisible by three
78 self._check_triplets()
79 # sort the database by start
80 self.features.sort_values(by='start', ascending=True, inplace=True)
81 if stop_codons:
82 strip_codons = ['gene', 'CDS']
83 # if the direction is forward, subtract three from the stop to bring it closer to the start
84 self.features.loc[(self.features['featType'].isin(strip_codons)) & (self.features['strand'] == '+'), 'stop'] =\
85 self.features.loc[(self.features['featType'].isin(strip_codons))
86 & (self.features['strand'] == '+'), 'stop'] - 3
87 # if the direction is reverse, add three to the start (since the coords are flip-flopped)
88 self.features.loc[(self.features['featType'].isin(strip_codons)) & (self.features['strand'] == '-'), 'start'] =\
89 self.features.loc[(self.features['featType'].isin(strip_codons))
90 & (self.features['strand'] == '-'), 'start'] + 3
91 self.features['center'] = self.features['start'] + \
92 ((self.features['stop'] - self.features['start']) / 2)
93 # we need to add one since it doesn't account for the last base otherwise
94 self.features['width'] = abs(self.features['stop'] - self.features['start']) + 1
95 self.features['lmost'] = self.features.apply(self._determine_lmost, axis=1)
96 self.features['rmost'] = self.features.apply(self._determine_rmost, axis=1)
97 self.features['track'] = 0
98 if len(self.features.loc[self.features['tags'] == "Is_circular=true", 'stop']) < 1:
99 raise IOError("""The GFF file needs to have a tag ending in "Is_circular=true"
100 with a region from 1 to the number of bases in the mitogenome
101
102 example:
103 Bf201311 Geneious region 1 13337 . + 0 Is_circular=true
104 """)
105 self.seqlen = int(self.features.loc[self.features['tags'] == "Is_circular=true", 'stop'])
106 self.features.reset_index(inplace=True, drop=True)
107 #print("float", self.features.loc[self.features['name'] == 'COX1', 'center'])
108 #print("float cat", len(self.features.loc[self.features['name'] == 'CAT', 'center']))
109 # print(self.features)
110 # print(self.seqlen)
111
112 def set_features(self, new_features):
113 """all this does is reset the features pandas dataframe"""
114 self.features = new_features
115
116 def get_unique_genes(self):
117 """This returns a series of gene names"""
118 plottable = self.features.query(
119 "featType != 'tRNA' and featType != 'region' and featType != 'source'")
120 return set(plottable['name'].unique())
121
122 def shuffle(self):
123 """
124 this returns a list of all possible shuffles of features.
125 A shuffle is when the frontmost bit of coding + noncoding DNA up
126 until the next bit of coding DNA is removed and tagged on the
127 end of the sequence. In this case this process is represented by
128 shifting gff coordinates.
129 """
130 shuffles = []
131 # get the index of the first element
132 # get the index of the next thing
133 # subtract the indices of everything, then reset the ones that are below
134 # zero
135 done = False
136 shuffle_features = self.features[self.features['featType'].isin(
137 ['gene', 'rRNA', 'CDS', 'tRNA'])].copy(deep=True)
138 # we first add the shuffle features without reorganizing
139 # print("shuffle\n",shuffle_features)
140 add_first = copy.deepcopy(self)
141 add_first.set_features(shuffle_features)
142 shuffles.append(add_first)
143 # first gene is changed with every iteration
144 first_gene = list(shuffle_features['name'])[0]
145 # absolute first is the first gene in the original gff file, used to determine if we are done in this while loop
146 absolute_first = list(shuffle_features['name'])[0]
147 while not done:
148 # We need to prevent the case of shuffling in the middle of
149 # overlapped genes. Do this by ensuring that the the start of
150 # end of first gene is less than the start of the next gene.
151 first_stop = int(shuffle_features.loc[shuffle_features['name'] == first_gene, 'stop'])
152 next_gene = ""
153 for next_index in range(1, len(shuffle_features)):
154 # get the df of the next list, if len == 0, then it is a tRNA and we need to go to the next index
155 next_gene_df = list(
156 shuffle_features[shuffle_features['featType'].isin(['gene', 'rRNA', 'CDS'])]['name'])
157 if len(next_gene_df) != 0:
158 next_gene = next_gene_df[next_index]
159 next_start = int(shuffle_features.loc[shuffle_features['name'] == next_gene, 'start'])
160 #print("looking at {}, prev_stop is {}, start is {}".format(
161 # next_gene, first_stop, next_start))
162 #print(shuffle_features[shuffle_features['featType'].isin(['gene', 'rRNA', 'CDS'])])
163 # if the gene we're looking at and the next one don't overlap, move on
164 if first_stop < next_start:
165 break
166 #print("next_gene before checking for first is {}".format(next_gene))
167 if next_gene == absolute_first:
168 done = True
169 break
170 # now we can reset the first gene for the next iteration
171 first_gene = next_gene
172 shuffle_features = shuffle_features.copy(deep=True)
173 # figure out where the next start point is going to be
174 next_start = int(shuffle_features.loc[shuffle_features['name'] == next_gene, 'start'])
175 #print('next gene: {}'.format(next_gene))
176 shuffle_features['start'] = shuffle_features['start'] - next_start + 1
177 shuffle_features['stop'] = shuffle_features['stop'] - next_start + 1
178 shuffle_features['center'] = shuffle_features['center'] - next_start + 1
179 # now correct the values that are less than 0
180 shuffle_features.loc[shuffle_features['start'] < 1,
181 'start'] = shuffle_features.loc[shuffle_features['start'] < 1, 'start'] + self.seqlen
182 shuffle_features.loc[shuffle_features['stop'] < 1, 'stop'] = shuffle_features.loc[shuffle_features['stop']
183 < 1, 'start'] + shuffle_features.loc[shuffle_features['stop'] < 1, 'width']
184 shuffle_features['center'] = shuffle_features['start'] + \
185 ((shuffle_features['stop'] - shuffle_features['start']) / 2)
186 shuffle_features['lmost'] = shuffle_features.apply(self._determine_lmost, axis=1)
187 shuffle_features['rmost'] = shuffle_features.apply(self._determine_rmost, axis=1)
188 shuffle_features.sort_values(by='start', ascending=True, inplace=True)
189 shuffle_features.reset_index(inplace=True, drop=True)
190 new_copy = copy.deepcopy(self)
191 new_copy.set_features(shuffle_features)
192 shuffles.append(new_copy)
193 #print("len shuffles: {}".format(len(shuffles)))
194 return shuffles
195
196 def couple(self, other_GFF, this_y=0, other_y=1):
197 """
198 Compares this set of features to another set and generates tuples of
199 (x,y) coordinate pairs to input into lsi
200 """
201 other_features = other_GFF.features
202 coordinates = []
203 for thisname in self.features['name']:
204 othermatch = other_features.loc[other_features['name'] == thisname, 'center']
205 if len(othermatch) == 1:
206 this_x = float(self.features.loc[self.features['name']
207 == thisname, 'center']) # /self.seqlen
208 other_x = float(othermatch) # /other_GFF.seqlen
209 # lsi can't handle vertical or horizontal lines, and we don't
210 # need them either for our comparison. Don't add if equivalent.
211 if this_x != other_x:
212 these_coords = ((this_x, this_y), (other_x, other_y))
213 coordinates.append(these_coords)
214 return coordinates
215
216 def _check_triplets(self):
217 """This method verifies that all entries of featType gene and CDS are
218 divisible by three"""
219 genesCDSs = self.features.query("featType == 'CDS' or featType == 'gene'")
220 not_trips = genesCDSs.loc[((abs(genesCDSs['stop'] - genesCDSs['start']) + 1) % 3) > 0, ]
221 if len(not_trips) > 0:
222 print_string = ""
223 print_string += "There are CDS and gene entries that are not divisible by three\n"
224 print_string += str(not_trips)
225 warnings.warn(print_string, SyntaxWarning)
226
227 def _get_name(self, tag_value):
228 """This extracts a name from a single row in 'tags' of the pandas
229 dataframe
230 """
231 try:
232 if ";" in tag_value:
233 name = tag_value[5:].split(';')[0]
234 else:
235 name = tag_value[5:].split()[0]
236 except:
237 name = tag_value
238 print("Couldn't correctly parse {}".format(
239 tag_value))
240 return name
241
242 def _determine_lmost(self, row):
243 """Booleans don't work well for pandas dataframes, so I need to use apply
244 """
245 if row['start'] < row['stop']:
246 return row['start']
247 else:
248 return row['stop']
249
250 def _determine_rmost(self, row):
251 """Booleans don't work well for pandas dataframes, so I need to use apply
252 """
253 if row['start'] < row['stop']:
254 return row['stop']
255 else:
256 return row['start']
257
258
259 def parse_fastq_length_meanqual(fastq):
260 """
261 arguments:
262 <fastq> the fastq file path. Hopefully it has been verified to exist already
263
264 purpose:
265 This function parses a fastq and returns a pandas dataframe of read lengths
266 and read meanQuals.
267 """
268 # First try to open the file with the gzip package. It will crash
269 # if the file is not gzipped, so this is an easy way to test if
270 # the fastq file is gzipped or not.
271 try:
272 handle = gzip.open(fastq, "rt")
273 length, meanQual = _fastq_parse_helper(handle)
274 except:
275 handle = open(fastq, "r")
276 length, meanQual = _fastq_parse_helper(handle)
277
278 handle.close()
279 df = pd.DataFrame(list(zip(length, meanQual)), columns=['length', 'meanQual'])
280 return df
281
282
283 def filter_fastq_length_meanqual(df, min_len, max_len,
284 min_mqual, max_mqual):
285 querystring = "length >= {0} and meanQual >= {1}".format(min_len, min_mqual)
286 if max_len != None:
287 querystring += " and length <= {}".format(max_len)
288 if max_mqual != None:
289 querystring += " and meanQual <= {}".format(max_mqual)
290 print("Keeping reads that satisfy: {}".format(querystring), file=stderr)
291 filtdf = df.query(querystring)
292 #filtdf["length"] = pd.to_numeric(filtdf["length"], errors='coerce')
293 #filtdf["meanQual"] = pd.to_numeric(filtdf["meanQual"], errors='coerce')
294 return filtdf
295
296
297 def _fastq_parse_helper(handle):
298 length = []
299 meanQual = []
300 for record in SeqIO.parse(handle, "fastq"):
301 if len(record) > 0:
302 meanQual.append(_arithmetic_mean(record.letter_annotations["phred_quality"]))
303 length.append(len(record))
304 return length, meanQual
305
306
307 def _geometric_mean(phred_values):
308 """in case I want geometric mean in the future, can calculate it like this"""
309 # np.mean(record.letter_annotations["phred_quality"]))
310 pass
311
312
313 def _arithmetic_mean(phred_values):
314 """
315 Convert Phred to 1-accuracy (error probabilities), calculate the arithmetic mean,
316 log transform back to Phred.
317 """
318 if not isinstance(phred_values, np.ndarray):
319 phred_values = np.array(phred_values)
320 return _erate_to_phred(np.mean(_phred_to_erate(phred_values)))
321
322
323 def _phred_to_erate(phred_values):
324 """
325 converts a list or numpy array of phred values to a numpy array
326 of error rates
327 """
328 if not isinstance(phred_values, np.ndarray):
329 phred_values = np.array(phred_values)
330 return np.power(10, (-1 * (phred_values / 10)))
331
332
333 def _erate_to_phred(erate_values):
334 """
335 converts a list or numpy array of error rates to a numpy array
336 of phred values
337 """
338 if not isinstance(erate_values, np.ndarray):
339 phred_values = np.array(erate_values)
340 return -10 * np.log10(erate_values)
341
342 def timestamp():
343 """
344 Returns the current time in :samp:`YYYYMMDD_HHMMSS` format.
345 """
346 return time.strftime("%Y%m%d_%H%M%S")
0 #!/usr/bin/env python
1 # -*- coding: utf-8 -*-
2
3 # pauvre - a pore plotting package
4 # Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved.
5 #
6 # This file is part of pauvre.
7 #
8 # pauvre is free software: you can redistribute it and/or modify
9 # it under the terms of the GNU General Public License as published by
10 # the Free Software Foundation, either version 3 of the License, or
11 # (at your option) any later version.
12 #
13 # pauvre is distributed in the hope that it will be useful,
14 # but WITHOUT ANY WARRANTY; without even the implied warranty of
15 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
16 # GNU General Public License for more details.
17 #
18 # You should have received a copy of the GNU General Public License
19 # along with pauvre. If not, see <http://www.gnu.org/licenses/>.
20
21 """This file contains things related to parsing and plotting GFF files"""
22
23 import copy
24 from matplotlib.path import Path
25 import matplotlib.patches as patches
26
27 global chevron_width
28 global arrow_width
29 global min_text
30 global text_cutoff
31
32 arrow_width = 80
33 chevron_width = 40
34 min_text = 550
35 text_cutoff = 150
36 import sys
37
38 global colorMap
39 colorMap = {'gene': 'green', 'CDS': 'green', 'tRNA':'pink', 'rRNA':'red',
40 'misc_feature':'purple', 'rep_origin':'orange', 'spacebar':'white',
41 'ORF':'orange'}
42
43 def _plot_left_to_right_introns(panel, geneid, db, y_pos, text = None):
44 """ plots a left to right patch with introns when there are no intervening
45 sequences to consider. Uses a gene id and gffutils database as input.
46 b
47 a .-=^=-. c
48 1__________2---/ e `---1__________2
49 | #lff \f d| #lff \
50 | left to \3 | left to \3
51 | right / | right /
52 5___________/4 5___________/4
53 """
54 #first we need to determine the number of exons
55 bar_thickness = 0.75
56 #now we can start plotting the exons
57 exonlist = list(db.children(geneid, featuretype='CDS', order_by="start"))
58 for i in range(len(exonlist)):
59 cds_start = exonlist[i].start
60 cds_stop = exonlist[i].stop
61 verts = [(cds_start, y_pos + bar_thickness), #1
62 (cds_stop - chevron_width, y_pos + bar_thickness), #2
63 (cds_stop, y_pos + (bar_thickness/2)), #3
64 (cds_stop - chevron_width, y_pos), #4
65 (cds_start, y_pos), #5
66 (cds_start, y_pos + bar_thickness), #1
67 ]
68 codes = [Path.MOVETO,
69 Path.LINETO,
70 Path.LINETO,
71 Path.LINETO,
72 Path.LINETO,
73 Path.CLOSEPOLY,
74 ]
75 path = Path(verts, codes)
76 patch = patches.PathPatch(path, lw = 0,
77 fc=colorMap['CDS'] )
78 panel.add_patch(patch)
79
80 # we must draw the splice junction
81 if i < len(exonlist) - 1:
82 next_start = exonlist[i+1].start
83 next_stop = exonlist[i+1].stop
84 middle = cds_stop + ((next_start - cds_stop)/2)
85
86 verts = [(cds_stop - chevron_width, y_pos + bar_thickness), #2/a
87 (middle, y_pos + 0.95), #b
88 (next_start, y_pos + bar_thickness), #c
89 (next_start, y_pos + bar_thickness - 0.05), #d
90 (middle, y_pos + 0.95 - 0.05), #e
91 (cds_stop - chevron_width, y_pos + bar_thickness -0.05), #f
92 (cds_stop - chevron_width, y_pos + bar_thickness), #2/a
93 ]
94 codes = [Path.MOVETO,
95 Path.LINETO,
96 Path.LINETO,
97 Path.LINETO,
98 Path.LINETO,
99 Path.LINETO,
100 Path.CLOSEPOLY,
101 ]
102 path = Path(verts, codes)
103 patch = patches.PathPatch(path, lw = 0,
104 fc=colorMap['CDS'] )
105 panel.add_patch(patch)
106
107 return panel
108
109 def _plot_left_to_right_introns_top(panel, geneid, db, y_pos, text = None):
110 """ slightly different from the above version such thatsplice junctions
111 are more visually explicit.
112
113 plots a left to right patch with introns when there are no intervening
114 sequences to consider. Uses a gene id and gffutils database as input.
115 b
116 a .-=^=-. c
117 1_____________2---/ e `---1_____________2
118 | #lff /f d| #lff /
119 | left to / | left to /
120 | right / | right /
121 4_________/3 4_________/3
122 """
123 #first we need to determine the number of exons
124 bar_thickness = 0.75
125 #now we can start plotting the exons
126 exonlist = list(db.children(geneid, featuretype='CDS', order_by="start"))
127 for i in range(len(exonlist)):
128 cds_start = exonlist[i].start
129 cds_stop = exonlist[i].stop
130 verts = [(cds_start, y_pos + bar_thickness), #1
131 (cds_stop, y_pos + bar_thickness), #2
132 (cds_stop - chevron_width, y_pos), #4
133 (cds_start, y_pos), #5
134 (cds_start, y_pos + bar_thickness), #1
135 ]
136 codes = [Path.MOVETO,
137 Path.LINETO,
138 Path.LINETO,
139 Path.LINETO,
140 Path.CLOSEPOLY,
141 ]
142 path = Path(verts, codes)
143 patch = patches.PathPatch(path, lw = 0,
144 fc=colorMap['CDS'] )
145 panel.add_patch(patch)
146
147 # we must draw the splice junction
148 if i < len(exonlist) - 1:
149 next_start = exonlist[i+1].start
150 next_stop = exonlist[i+1].stop
151 middle = cds_stop + ((next_start - cds_stop)/2)
152
153 verts = [(cds_stop-5, y_pos + bar_thickness), #2/a
154 (middle, y_pos + 0.95), #b
155 (next_start, y_pos + bar_thickness), #c
156 (next_start, y_pos + bar_thickness - 0.05), #d
157 (middle, y_pos + 0.95 - 0.05), #e
158 (cds_stop-5, y_pos + bar_thickness -0.05), #f
159 (cds_stop-5, y_pos + bar_thickness), #2/a
160 ]
161 codes = [Path.MOVETO,
162 Path.LINETO,
163 Path.LINETO,
164 Path.LINETO,
165 Path.LINETO,
166 Path.LINETO,
167 Path.CLOSEPOLY,
168 ]
169 path = Path(verts, codes)
170 patch = patches.PathPatch(path, lw = 0,
171 fc=colorMap['CDS'] )
172 panel.add_patch(patch)
173
174 return panel
175
176 def _plot_lff(panel, left_df, right_df, colorMap, y_pos, bar_thickness, text):
177 """ plots a lff patch
178 1__________2 ____________
179 | #lff \ \ #rff \
180 | left for \3 \ right for \
181 | forward / / forward /
182 5___________/4 /___________/
183 """
184 #if there is only one feature to plot, then just plot it
185
186 print("plotting lff")
187 verts = [(left_df['start'], y_pos + bar_thickness), #1
188 (right_df['start'] - chevron_width, y_pos + bar_thickness), #2
189 (left_df['stop'], y_pos + (bar_thickness/2)), #3
190 (right_df['start'] - chevron_width, y_pos), #4
191 (left_df['start'], y_pos), #5
192 (left_df['start'], y_pos + bar_thickness), #1
193 ]
194 codes = [Path.MOVETO,
195 Path.LINETO,
196 Path.LINETO,
197 Path.LINETO,
198 Path.LINETO,
199 Path.CLOSEPOLY,
200 ]
201 path = Path(verts, codes)
202 patch = patches.PathPatch(path, lw = 0,
203 fc=colorMap[left_df['featType']] )
204 text_width = left_df['width']
205 if text and text_width >= min_text:
206 panel = _plot_label(panel, left_df, y_pos, bar_thickness)
207 elif text and text_width < min_text and text_width >= text_cutoff:
208 panel = _plot_label(panel, left_df,
209 y_pos, bar_thickness,
210 rotate = True, arrow = True)
211
212 return panel, patch
213
214 def _plot_label(panel, df, y_pos, bar_thickness, rotate = False, arrow = False):
215 # handles the case where a dataframe was passed
216 fontsize = 8
217 rotation = 0
218 if rotate:
219 fontsize = 5
220 rotation = 90
221 if len(df) == 1:
222 x =((df.loc[0, 'stop'] - df.loc[0, 'start'])/2) + df.loc[0, 'start']
223 y = y_pos + (bar_thickness/2)
224 # if we need to center somewhere other than the arrow, need to adjust
225 # for the direction of the arrow
226 # it doesn't look good if it shifts by the whole arrow width, so only
227 # shift by half the arrow width
228 if arrow:
229 if df.loc[0, 'strand'] == "+":
230 shift_start = df.loc[0, 'start']
231 else:
232 shift_start = df.loc[0, 'start'] + (arrow_width/2)
233 x =((df.loc[0, 'stop'] - (arrow_width/2) - df.loc[0, 'start'])/2) + shift_start
234 panel.text(x, y,
235 df.loc[0, 'name'], fontsize = fontsize,
236 ha='center', va='center',
237 color = 'white', family = 'monospace',
238 zorder = 100, rotation = rotation)
239 # and the case where a series was passed
240 else:
241 x = ((df['stop'] - df['start'])/2) + df['start']
242 y = y_pos + (bar_thickness/2)
243 if arrow:
244 if df['strand'] == "+":
245 shift_start = df['start']
246 else:
247 shift_start = df['start'] + (arrow_width/2)
248 x =((df['stop'] - (arrow_width/2) - df['start'])/2) + shift_start
249 panel.text(x, y,
250 df['name'], fontsize = fontsize,
251 ha='center', va='center',
252 color = 'white', family = 'monospace',
253 zorder = 100, rotation = rotation)
254
255 return panel
256
257 def _plot_rff(panel, left_df, right_df, colorMap, y_pos, bar_thickness, text):
258 """ plots a rff patch
259 ____________ 1__________2
260 | #lff \ \ #rff \
261 | left for \ 6\ right for \3
262 | forward / / forward /
263 |___________/ /5__________/4
264 """
265 #if there is only one feature to plot, then just plot it
266
267 print("plotting rff")
268 verts = [(right_df['start'], y_pos + bar_thickness), #1
269 (right_df['stop'] - arrow_width, y_pos + bar_thickness), #2
270 (right_df['stop'], y_pos + (bar_thickness/2)), #3
271 (right_df['stop'] - arrow_width, y_pos), #4
272 (right_df['start'], y_pos), #5
273 (left_df['stop'] + chevron_width, y_pos + (bar_thickness/2)), #6
274 (right_df['start'], y_pos + bar_thickness), #1
275 ]
276 codes = [Path.MOVETO,
277 Path.LINETO,
278 Path.LINETO,
279 Path.LINETO,
280 Path.LINETO,
281 Path.LINETO,
282 Path.CLOSEPOLY,
283 ]
284 path = Path(verts, codes)
285 patch = patches.PathPatch(path, lw = 0,
286 fc=colorMap[right_df['featType']] )
287 text_width = right_df['width']
288 if text and text_width >= min_text:
289 panel = _plot_label(panel, right_df, y_pos, bar_thickness)
290 elif text and text_width < min_text and text_width >= text_cutoff:
291 panel = _plot_label(panel, right_df,
292 y_pos, bar_thickness, rotate = True)
293 return panel, patch
294
295 def x_offset_gff(GFFParseobj, x_offset):
296 """Takes in a gff object (a gff file parsed as a pandas dataframe),
297 and an x_offset value and shifts the start, stop, center, lmost, and rmost.
298
299 Returns a GFFParse object with the shifted values in GFFParse.features.
300 """
301 for columnname in ['start', 'stop', 'center', 'lmost', 'rmost']:
302 GFFParseobj.features[columnname] = GFFParseobj.features[columnname] + x_offset
303 return GFFParseobj
304
305 def gffplot_horizontal(figure, panel, args, gff_object,
306 track_width=0.2, start_y=0.1, **kwargs):
307 """
308 this plots horizontal things from gff files. it was probably written for synplot,
309 as the browser does not use this at all.
310 """
311 # Because this size should be relative to the circle that it is plotted next
312 # to, define the start_radius as the place to work from, and the width of
313 # each track.
314 colorMap = {'gene': 'green', 'CDS': 'green', 'tRNA':'pink', 'rRNA':'red',
315 'misc_feature':'purple', 'rep_origin':'orange', 'spacebar':'white'}
316 augment = 0
317 bar_thickness = 0.9 * track_width
318 # return these at the end
319 myPatches=[]
320 plot_order = []
321
322 idone = False
323 # we need to filter out the tRNAs since those are plotted last
324 plottable_features = gff_object.features.query("featType != 'tRNA' and featType != 'region' and featType != 'source'")
325 plottable_features.reset_index(inplace=True, drop=True)
326 print(plottable_features)
327
328 len_plottable = len(plottable_features)
329 print('len plottable', len_plottable)
330 # - this for loop relies on the gff features to already be sorted
331 # - The algorithm for this loop works by starting at the 0th index of the
332 # plottable features (i).
333 # - It then looks to see if the next object (the jth) overlaps with the
334 # ith element.
335 i = 0
336 j = 1
337 while i < len(plottable_features):
338 if i + j == len(plottable_features):
339 #we have run off of the df and need to include everything from i to the end
340 these_features = plottable_features.loc[i::,].copy(deep=True)
341 these_features = these_features.reset_index()
342 print(these_features)
343 plot_order.append(these_features)
344 i = len(plottable_features)
345 break
346 print(" - i,j are currently: {},{}".format(i, j))
347 stop = plottable_features.loc[i]["stop"]
348 start = plottable_features.loc[i+j]["start"]
349 print("stop: {}. start: {}.".format(stop, start))
350 if plottable_features.loc[i]["stop"] <= plottable_features.loc[i+j]["start"]:
351 print(" - putting elements {} through (including) {} together".format(i, i+j))
352 these_features = plottable_features.loc[i:i+j-1,].copy(deep=True)
353 these_features = these_features.reset_index()
354 print(these_features)
355 plot_order.append(these_features)
356 i += 1
357 j = 1
358 else:
359 j += 1
360
361 #while idone == False:
362 # print("im in the overlap-pairing while loop i={}".format(i))
363 # # look ahead at all of the elements that overlap with the ith element
364 # jdone = False
365 # j = 1
366 # this_set_minimum_index = i
367 # this_set_maximum_index = i
368 # while jdone == False:
369 # print("new i= {} j={} len={}".format(i, j, len_plottable))
370 # print("len plottable in jdone: {}".format(len_plottable))
371 # print("plottable features in jdone:\n {}".format(plottable_features))
372 # # first make sure that we haven't gone off the end of the dataframe
373 # # This is an edge case where i has a jth element that overlaps with it,
374 # # and j is the last element in the plottable features.
375 # if i+j == len_plottable:
376 # print("i+j == len_plottable")
377 # # this checks for the case that i is the last element of the
378 # # plottable features.
379 # # In both of the above cases, we are done with both the ith and
380 # # the jth features.
381 # if i == len_plottable-1:
382 # print("i == len_plottable-1")
383
384 # # this is the last analysis, so set idone to true
385 # # to finish after this
386 # idone = True
387 # # the last one can't be in its own group, so just add it solo
388 # these_features = plottable_features.loc[this_set_minimum_index:this_set_maximum_index,].copy(deep=True)
389 # plot_order.append(these_features.reset_index(drop=True))
390 # break
391 # jdone = True
392 # else:
393 # print("i+j != len_plottable")
394 # # if the lmost of the next gene overlaps with the rmost of
395 # # the current one, it overlaps and couple together
396 # if plottable_features.loc[i+j, 'lmost'] < plottable_features.loc[i, 'rmost']:
397 # print("lmost < rmost")
398 # # note that this feature overlaps with the current
399 # this_set_maximum_index = i+j
400 # # ... and we need to look at the next in line
401 # j += 1
402 # else:
403 # print("lmost !< rmost")
404 # i += 1 + (this_set_maximum_index - this_set_minimum_index)
405 # #add all of the things that grouped together once we don't find any more groups
406 # these_features = plottable_features.loc[this_set_minimum_index:this_set_maximum_index,].copy(deep=True)
407 # plot_order.append(these_features.reset_index(drop=True))
408 # jdone = True
409 # print("plot order is now: {}".format(plot_order))
410 # print("jdone: {}".format(str(jdone)))
411
412 for feature_set in plot_order:
413 # plot_feature_hori handles overlapping cases as well as normal cases
414 panel, patches = gffplot_feature_hori(figure, panel, feature_set, colorMap,
415 start_y, bar_thickness, text = True)
416 for each in patches:
417 print("there are {} patches after gffplot_feature_hori".format(len(patches)))
418 print(each)
419 myPatches.append(each)
420 print("length of myPatches is: {}".format(len(myPatches)))
421
422 # Now we add all of the tRNAs to this to plot, do it last to overlay
423 # everything else
424 tRNAs = gff_object.features.query("featType == 'tRNA'")
425 tRNAs.reset_index(inplace=True, drop = True)
426 tRNA_bar_thickness = bar_thickness * (0.8)
427 tRNA_start_y = start_y + ((bar_thickness - tRNA_bar_thickness)/2)
428 for i in range(0,len(tRNAs)):
429 this_feature = tRNAs[i:i+1].copy(deep=True)
430 this_feature.reset_index(inplace=True, drop = True)
431 panel, patches = gffplot_feature_hori(figure, panel, this_feature, colorMap,
432 tRNA_start_y, tRNA_bar_thickness, text = True)
433 for patch in patches:
434 myPatches.append(patch)
435 print("There are {} patches at the end of gffplot_horizontal()".format(len(myPatches)))
436 return panel, myPatches
437
438 def gffplot_feature_hori(figure, panel, feature_df,
439 colorMap, y_pos, bar_thickness, text=True):
440 """This plots the track for a feature, and if there is something for
441 'this_feature_overlaps_feature', then there is special processing to
442 add the white bar and the extra slope for the chevron
443 """
444 myPatches = []
445 #if there is only one feature to plot, then just plot it
446 if len(feature_df) == 1:
447 #print("plotting a single thing: {} {}".format(str(feature_df['sequence']).split()[1],
448 # str(feature_df['featType']).split()[1] ))
449 #print(this_feature['name'], "is not overlapping")
450 # This plots this shape: 1_________2 2_________1
451 # | forward \3 3/ reverse |
452 # |5__________/4 \4________5|
453 if feature_df.loc[0,'strand'] == '+':
454 verts = [(feature_df.loc[0, 'start'], y_pos + bar_thickness), #1
455 (feature_df.loc[0, 'stop'] - arrow_width, y_pos + bar_thickness), #2
456 (feature_df.loc[0, 'stop'], y_pos + (bar_thickness/2)), #3
457 (feature_df.loc[0, 'stop'] - arrow_width, y_pos), #4
458 (feature_df.loc[0, 'start'], y_pos), #5
459 (feature_df.loc[0, 'start'], y_pos + bar_thickness)] #1
460 elif feature_df.loc[0,'strand'] == '-':
461 verts = [(feature_df.loc[0, 'stop'], y_pos + bar_thickness), #1
462 (feature_df.loc[0, 'start'] + arrow_width, y_pos + bar_thickness), #2
463 (feature_df.loc[0, 'start'], y_pos + (bar_thickness/2)), #3
464 (feature_df.loc[0, 'start'] + arrow_width, y_pos), #4
465 (feature_df.loc[0, 'stop'], y_pos), #5
466 (feature_df.loc[0, 'stop'], y_pos + bar_thickness)] #1
467 feat_width = feature_df.loc[0,'width']
468 if text and feat_width >= min_text:
469 panel = _plot_label(panel, feature_df.loc[0,],
470 y_pos, bar_thickness)
471 elif text and feat_width < min_text and feat_width >= text_cutoff:
472 panel = _plot_label(panel, feature_df.loc[0,],
473 y_pos, bar_thickness,
474 rotate = True, arrow = True)
475
476 codes = [Path.MOVETO,
477 Path.LINETO,
478 Path.LINETO,
479 Path.LINETO,
480 Path.LINETO,
481 Path.CLOSEPOLY]
482 path = Path(verts, codes)
483 print("normal path is: {}".format(path))
484 # If the feature itself is smaller than the arrow, we need to take special measures to
485 if feature_df.loc[0,'width'] <= arrow_width:
486 path = Path([verts[i] for i in [0,2,4,5]],
487 [codes[i] for i in [0,2,4,5]])
488 patch = patches.PathPatch(path, lw = 0,
489 fc=colorMap[feature_df.loc[0, 'featType']] )
490 myPatches.append(patch)
491 # there are four possible scenarios if there are two overlapping sequences:
492 # ___________ ____________ ____________ ___________
493 # | #1 \ \ #1 \ / #2 / / #2 |
494 # | both seqs \ \ both seqs \ / both seqs / / both seqs |
495 # | forward / / forward / \ reverse \ \ reverse |
496 # |__________/ /___________/ \___________\ \___________|
497 # ___________ _____________ ____________ _ _________
498 # | #3 \ \ #3 | / #2 _| #2 \
499 # | one seq \ \ one seq | / one seq |_ one seq \
500 # | forward \ \ reverse | \ reverse _| forward /
501 # |_____________\ \_________| \__________|_ ___________/
502 #
503 # These different scenarios can be thought of as different left/right
504 # flanking segment types.
505 # In the annotation #rff:
506 # - 'r' refers to the annotation type as being on the right
507 # - the first 'f' refers to the what element is to the left of this one.
508 # Since it is forward the 5' end of this annotation must be a chevron
509 # - the second 'f' refers to the right side of this element. Since it is
510 # forward it must be a normal arrow.
511 # being on the right
512 #
513 # *LEFT TYPES* *RIGHT TYPES*
514 # ____________ ____________
515 # | #lff \ \ #rff \
516 # | left for \ \ right for \
517 # | forward / / forward /
518 # |___________/ /___________/
519 # ___________ _____________
520 # | #lfr \ \ #rfr |
521 # | left for \ \ right for |
522 # | reverse \ \ reverse |
523 # |_____________\ \_________|
524 # ____________ ___________
525 # / #lrr / / #rrr |
526 # / left rev / / right rev |
527 # \ reverse \ \ reverse |
528 # \___________\ \___________|
529 # ____________ __________
530 # / #lrf _| _| #rrf \
531 # / left rev |_ | _ right rev \
532 # \ forward _| _| forward /
533 # \__________| |____________/
534 #
535 # To properly plot these elements, we must go through each element of the
536 # feature_df to determine which patch type it is.
537 elif len(feature_df) == 2:
538 print("im in here feat len=2")
539 for i in range(len(feature_df)):
540 # this tests for which left type we're dealing with
541 if i == 0:
542 # type could be lff or lfr
543 if feature_df.loc[i, 'strand'] == '+':
544 if feature_df.loc[i + 1, 'strand'] == '+':
545 # plot a lff type
546 panel, patch = _plot_lff(panel, feature_df.iloc[i,], feature_df.iloc[i+1,],
547 colorMap, y_pos, bar_thickness, text)
548 myPatches.append(patch)
549 elif feature_df.loc[i + 1, 'strand'] == '-':
550 #plot a lfr type
551 raise IOError("can't plot {} patches yet".format("lfr"))
552 # or type could be lrr or lrf
553 elif feature_df.loc[i, 'strand'] == '-':
554 if feature_df.loc[i + 1, 'strand'] == '+':
555 # plot a lrf type
556 raise IOError("can't plot {} patches yet".format("lrf"))
557 elif feature_df.loc[i + 1, 'strand'] == '-':
558 #plot a lrr type
559 raise IOError("can't plot {} patches yet".format("lrr"))
560 # in this case we're only dealing with 'right type' patches
561 elif i == len(feature_df) - 1:
562 # type could be rff or rfr
563 if feature_df.loc[i-1, 'strand'] == '+':
564 if feature_df.loc[i, 'strand'] == '+':
565 # plot a rff type
566 panel, patch = _plot_rff(panel, feature_df.iloc[i-1,], feature_df.iloc[i,],
567 colorMap, y_pos, bar_thickness, text)
568 myPatches.append(patch)
569 elif feature_df.loc[i, 'strand'] == '-':
570 #plot a rfr type
571 raise IOError("can't plot {} patches yet".format("rfr"))
572 # or type could be rrr or rrf
573 elif feature_df.loc[i-1, 'strand'] == '-':
574 if feature_df.loc[i, 'strand'] == '+':
575 # plot a rrf type
576 raise IOError("can't plot {} patches yet".format("rrf"))
577 elif feature_df.loc[i, 'strand'] == '-':
578 #plot a rrr type
579 raise IOError("can't plot {} patches yet".format("rrr"))
580 return panel, myPatches
0 # Binary search tree that holds status of sweep line. Only leaves hold values.
1 # Operations for finding left and right neighbors of a query point p and finding which segments contain p.
2 # Author: Sam Lichtenberg
3 # Email: splichte@princeton.edu
4 # Date: 09/02/2013
5
6 from pauvre.lsi.helper import *
7
8 ev = 0.00000001
9
10 class Q:
11 def __init__(self, key, value):
12 self.key = key
13 self.value = value
14 self.left = None
15 self.right = None
16
17 def find(self, key):
18 if self.key is None:
19 return False
20 c = compare_by_y(key, self.key)
21 if c==0:
22 return True
23 elif c==-1:
24 if self.left:
25 self.left.find(key)
26 else:
27 return False
28 else:
29 if self.right:
30 self.right.find(key)
31 else:
32 return False
33 def insert(self, key, value):
34 if self.key is None:
35 self.key = key
36 self.value = value
37 c = compare_by_y(key, self.key)
38 if c==0:
39 self.value += value
40 elif c==-1:
41 if self.left is None:
42 self.left = Q(key, value)
43 else:
44 self.left.insert(key, value)
45 else:
46 if self.right is None:
47 self.right = Q(key, value)
48 else:
49 self.right.insert(key, value)
50 # must return key AND value
51 def get_and_del_min(self, parent=None):
52 if self.left is not None:
53 return self.left.get_and_del_min(self)
54 else:
55 k = self.key
56 v = self.value
57 if parent:
58 parent.left = self.right
59 # i.e. is root node
60 else:
61 if self.right:
62 self.key = self.right.key
63 self.value = self.right.value
64 self.left = self.right.left
65 self.right = self.right.right
66 else:
67 self.key = None
68 return k,v
69
70 def print_tree(self):
71 if self.left:
72 self.left.print_tree()
73 print(self.key)
74 print(self.value)
75 if self.right:
76 self.right.print_tree()
0 # Binary search tree that holds status of sweep line. Only leaves hold values.
1 # Operations for finding left and right neighbors of a query point p and finding which segments contain p.
2 # Author: Sam Lichtenberg
3 # Email: splichte@princeton.edu
4 # Date: 09/02/2013
5
6 from pauvre.lsi.helper import *
7
8 ev = 0.00000001
9
10 class T:
11 def __init__(self):
12 self.root = Node(None, None, None, None)
13 def contain_p(self, p):
14 if self.root.value is None:
15 return [[], []]
16 lists = [[], []]
17 self.root.contain_p(p, lists)
18 return (lists[0], lists[1])
19 def get_left_neighbor(self, p):
20 if self.root.value is None:
21 return None
22 return self.root.get_left_neighbor(p)
23 def get_right_neighbor(self, p):
24 if self.root.value is None:
25 return None
26 return self.root.get_right_neighbor(p)
27 def insert(self, key, s):
28 if self.root.value is None:
29 self.root.left = Node(s, None, None, self.root)
30 self.root.value = s
31 self.root.m = get_slope(s)
32 else:
33 (node, path) = self.root.find_insert_pt(key, s)
34 if path == 'r':
35 node.right = Node(s, None, None, node)
36 node.right.adjust()
37 elif path == 'l':
38 node.left = Node(s, None, None, node)
39 else:
40 # this means matching Node was a leaf
41 # need to make a new internal Node
42 if node.compare_to_key(key) < 0 or (node.compare_to_key(key)==0 and node.compare_lower(key, s) < 1):
43 new_internal = Node(s, None, node, node.parent)
44 new_leaf = Node(s, None, None, new_internal)
45 new_internal.left = new_leaf
46 if node is node.parent.left:
47 node.parent.left = new_internal
48 node.adjust()
49 else:
50 node.parent.right = new_internal
51 else:
52 new_internal = Node(node.value, node, None, node.parent)
53 new_leaf = Node(s, None, None, new_internal)
54 new_internal.right = new_leaf
55 if node is node.parent.left:
56 node.parent.left = new_internal
57 new_leaf.adjust()
58 else:
59 node.parent.right = new_internal
60 node.parent = new_internal
61
62 def delete(self, p, s):
63 key = p
64 node = self.root.find_delete_pt(key, s)
65 val = node.value
66 if node is node.parent.left:
67 parent = node.parent.parent
68 if parent is None:
69 if self.root.right is not None:
70 if self.root.right.left or self.root.right.right:
71 self.root = self.root.right
72 self.root.parent = None
73 else:
74 self.root.left = self.root.right
75 self.root.value = self.root.right.value
76 self.root.m = self.root.right.m
77 self.root.right = None
78 else:
79 self.root.left = None
80 self.root.value = None
81 elif node.parent is parent.left:
82 parent.left = node.parent.right
83 node.parent.right.parent = parent
84 else:
85 parent.right = node.parent.right
86 node.parent.right.parent = parent
87 else:
88 parent = node.parent.parent
89 if parent is None:
90 if self.root.left:
91 # switch properties
92 if self.root.left.right or self.root.left.left:
93 self.root = self.root.left
94 self.root.parent = None
95 else:
96 self.root.right = None
97 else:
98 self.root.right = None
99 self.root.value = None
100 elif node.parent is parent.left:
101 parent.left = node.parent.left
102 node.parent.left.parent = parent
103 farright = node.parent.left
104 while farright.right is not None:
105 farright = farright.right
106 farright.adjust()
107 else:
108 parent.right = node.parent.left
109 node.parent.left.parent = parent
110 farright = node.parent.left
111 while farright.right is not None:
112 farright = farright.right
113 farright.adjust()
114 return val
115
116 def print_tree(self):
117 self.root.print_tree()
118 class Node:
119 def __init__(self, value, left, right, parent):
120 self.value = value # associated line segment
121 self.left = left
122 self.right = right
123 self.parent = parent
124 self.m = None
125 if value is not None:
126 self.m = get_slope(value)
127
128 # compares line segment at y-val of p to p
129 # TODO: remove this and replace with get_x_at
130 def compare_to_key(self, p):
131 x0 = self.value[0][0]
132 y0 = self.value[0][1]
133 y1 = p[1]
134 if self.m != 0 and self.m is not None:
135 x1 = x0 - float(y0-y1)/self.m
136 return compare_by_x(p, (x1, y1))
137 else:
138 x1 = p[0]
139 return 0
140
141 def get_left_neighbor(self, p):
142 neighbor = None
143 n = self
144 if n.left is None and n.right is None:
145 return neighbor
146 last_right = None
147 found = False
148 while not found:
149 c = n.compare_to_key(p)
150 if c < 1 and n.left:
151 n = n.left
152 elif c==1 and n.right:
153 n = n.right
154 last_right = n.parent
155 else:
156 found = True
157 c = n.compare_to_key(p)
158 if c==0:
159 if n is n.parent.right:
160 return n.parent
161 else:
162 goright = None
163 if last_right:
164 goright =last_right.left
165 return self.get_lr(None, goright)[0]
166 # n stores the highest-value in the left subtree
167 if c==-1:
168 goright = None
169 if last_right:
170 goright = last_right.left
171 return self.get_lr(None, goright)[0]
172 if c==1:
173 neighbor = n
174 return neighbor
175
176 def get_right_neighbor(self, p):
177 neighbor = None
178 n = self
179 if n.left is None and n.right is None:
180 return neighbor
181 last_left = None
182 found = False
183 while not found:
184 c = n.compare_to_key(p)
185 if c==0 and n.right:
186 n = n.right
187 elif c < 0 and n.left:
188 n = n.left
189 last_left = n.parent
190 elif c==1 and n.right:
191 n = n.right
192 else:
193 found = True
194 c = n.compare_to_key(p)
195 # can be c==0 and n.left if at root node
196 if c==0:
197 if n.parent is None:
198 return None
199 if n is n.parent.right:
200 goleft = None
201 if last_left:
202 goleft = last_left.right
203 return self.get_lr(goleft, None)[1]
204 else:
205 return self.get_lr(n.parent.right, None)[1]
206 if c==1:
207 goleft = None
208 if last_left:
209 goleft = last_left.right
210 return self.get_lr(goleft, None)[1]
211 if c==-1:
212 return n
213 return neighbor
214
215 # travels down a single direction to get neighbors
216 def get_lr(self, left, right):
217 lr = [None, None]
218 if left:
219 while left.left:
220 left = left.left
221 lr[1] = left
222 if right:
223 while right.right:
224 right = right.right
225 lr[0] = right
226 return lr
227
228 def contain_p(self, p, lists):
229 c = self.compare_to_key(p)
230 if c==0:
231 if self.left is None and self.right is None:
232 if compare_by_x(p, self.value[1])==0:
233 lists[1].append(self.value)
234 else:
235 lists[0].append(self.value)
236 if self.left:
237 self.left.contain_p(p, lists)
238 if self.right:
239 self.right.contain_p(p, lists)
240 elif c < 0:
241 if self.left:
242 self.left.contain_p(p, lists)
243 else:
244 if self.right:
245 self.right.contain_p(p, lists)
246
247 def find_insert_pt(self, key, seg):
248 if self.left and self.right:
249 if self.compare_to_key(key) == 0 and self.compare_lower(key, seg)==1:
250 return self.right.find_insert_pt(key, seg)
251 elif self.compare_to_key(key) < 1:
252 return self.left.find_insert_pt(key, seg)
253 else:
254 return self.right.find_insert_pt(key, seg)
255 # this case only happens at root
256 elif self.left:
257 if self.compare_to_key(key) == 0 and self.compare_lower(key, seg)==1:
258 return (self, 'r')
259 elif self.compare_to_key(key) < 1:
260 return self.left.find_insert_pt(key, seg)
261 else:
262 return (self, 'r')
263 else:
264 return (self, 'n')
265
266 # adjusts stored segments in inner nodes
267 def adjust(self):
268 value = self.value
269 m = self.m
270 parent = self.parent
271 node = self
272 # go up left as much as possible
273 while parent and node is parent.right:
274 node = parent
275 parent = node.parent
276 # parent to adjust will be on the immediate right
277 if parent and node is parent.left:
278 parent.value = value
279 parent.m = m
280
281 def compare_lower(self, p, s2):
282 y = p[1] - 10
283 key = get_x_at(s2, (p[0], y))
284 return self.compare_to_key(key)
285
286 # returns matching leaf node, or None if no match
287 # when deleting, you don't delete below--you delete above! so compare lower = -1.
288 def find_delete_pt(self, key, value):
289 if self.left and self.right:
290 # if equal at this pt, and this node's value is less than the seg's slightly above this pt
291 if self.compare_to_key(key) == 0 and self.compare_lower(key, value)==-1:
292 return self.right.find_delete_pt(key, value)
293 if self.compare_to_key(key) < 1:
294 return self.left.find_delete_pt(key, value)
295 else:
296 return self.right.find_delete_pt(key, value)
297 elif self.left:
298 if self.compare_to_key(key) < 1:
299 return self.left.find_delete_pt(key, value)
300 else:
301 return None
302 # is leaf
303 else:
304 if self.compare_to_key(key)==0 and segs_equal(self.value, value):
305 return self
306 else:
307 return None
308
309 # also prints depth of each node
310 def print_tree(self, l=0):
311 l += 1
312 if self.left:
313 self.left.print_tree(l)
314 if self.left or self.right:
315 print('INTERNAL: {0}'.format(l))
316 else:
317 print('LEAF: {0}'.format(l))
318 print(self)
319 print(self.value)
320 if self.right:
321 self.right.print_tree(l)
(New empty file)
0 # Helper functions for use in the lsi implementation.
1
2 ev = 0.0000001
3 # floating-point comparison
4 def approx_equal(a, b, tol):
5 return abs(a - b) < tol
6
7 # compares x-values of two pts
8 # used for ordering in T
9 def compare_by_x(k1, k2):
10 if approx_equal(k1[0], k2[0], ev):
11 return 0
12 elif k1[0] < k2[0]:
13 return -1
14 else:
15 return 1
16
17 # higher y value is "less"; if y value equal, lower x value is "less"
18 # used for ordering in Q
19 def compare_by_y(k1, k2):
20 if approx_equal(k1[1], k2[1], ev):
21 if approx_equal(k1[0], k2[0], ev):
22 return 0
23 elif k1[0] < k2[0]:
24 return -1
25 else:
26 return 1
27 elif k1[1] > k2[1]:
28 return -1
29 else:
30 return 1
31
32 # tests if s0 and s1 represent the same segment (i.e. pts can be in 2 different orders)
33 def segs_equal(s0, s1):
34 x00 = s0[0][0]
35 y00 = s0[0][1]
36 x01 = s0[1][0]
37 y01 = s0[1][1]
38 x10 = s1[0][0]
39 y10 = s1[0][1]
40 x11 = s1[1][0]
41 y11 = s1[1][1]
42 if (approx_equal(x00, x10, ev) and approx_equal(y00, y10, ev)):
43 if (approx_equal(x01, x11, ev) and approx_equal(y01, y11, ev)):
44 return True
45 if (approx_equal(x00, x11, ev) and approx_equal(y00, y11, ev)):
46 if (approx_equal(x01, x10, ev) and approx_equal(y01, y10, ev)):
47 return True
48 return False
49
50 # get m for a given seg in (p1, p2) form
51 def get_slope(s):
52 x0 = s[0][0]
53 y0 = s[0][1]
54 x1 = s[1][0]
55 y1 = s[1][1]
56 if (x1-x0)==0:
57 return None
58 else:
59 return float(y1-y0)/(x1-x0)
60
61 # given a point p, return the point on s that shares p's y-val
62 def get_x_at(s, p):
63 m = get_slope(s)
64 # TODO: this should check if p's x-val is octually on seg; we're assuming
65 # for now that it would have been deleted already if not
66 if m == 0: # horizontal segment
67 return p
68 # ditto; should check if y-val on seg
69 if m is None: # vertical segment
70 return (s[0][0], p[1])
71 x1 = s[0][0]-(s[0][1]-p[1])/m
72 return (x1, p[1])
73
74 # returns the point at which two line segments intersect, or None if no intersection.
75 def intersect(seg1, seg2):
76 p = seg1[0]
77 r = (seg1[1][0]-seg1[0][0], seg1[1][1]-seg1[0][1])
78 q = seg2[0]
79 s = (seg2[1][0]-seg2[0][0], seg2[1][1]-seg2[0][1])
80 denom = r[0]*s[1]-r[1]*s[0]
81 if denom == 0:
82 return None
83 numer = float(q[0]-p[0])*s[1]-(q[1]-p[1])*s[0]
84 t = numer/denom
85 numer = float(q[0]-p[0])*r[1]-(q[1]-p[1])*r[0]
86 u = numer/denom
87 if (t < 0 or t > 1) or (u < 0 or u > 1):
88 return None
89 x = p[0]+t*r[0]
90 y = p[1]+t*r[1]
91 return (x, y)
92
93
0 # Implementation of the Bentley-Ottmann algorithm, described in deBerg et al, ch. 2.
1 # See README for more information.
2 # Author: Sam Lichtenberg
3 # Email: splichte@princeton.edu
4 # Date: 09/02/2013
5
6 from pauvre.lsi.Q import Q
7 from pauvre.lsi.T import T
8 from pauvre.lsi.helper import *
9
10 # "close enough" for floating point
11 ev = 0.00000001
12
13 # how much lower to get the x of a segment, to determine which of a set of segments is the farthest right/left
14 lower_check = 100
15
16 # gets the point on a segment at a lower y value.
17 def getNextPoint(p, seg, y_lower):
18 p1 = seg[0]
19 p2 = seg[1]
20 if (p1[0]-p2[0])==0:
21 return (p[0]+10, p[1])
22 slope = float(p1[1]-p2[1])/(p1[0]-p2[0])
23 if slope==0:
24 return (p1[0], p[1]-y_lower)
25 y = p[1]-y_lower
26 x = p1[0]-(p1[1]-y)/slope
27 return (x, y)
28
29 """
30 for each event point:
31 U_p = segments that have p as an upper endpoint
32 C_p = segments that contain p
33 L_p = segments that have p as a lower endpoint
34 """
35 def handle_event_point(p, segs, q, t, intersections):
36 rightmost = (float("-inf"), 0)
37 rightmost_seg = None
38 leftmost = (float("inf"), 0)
39 leftmost_seg = None
40
41 U_p = segs
42 (C_p, L_p) = t.contain_p(p)
43 merge_all = U_p+C_p+L_p
44 if len(merge_all) > 1:
45 intersections[p] = []
46 for s in merge_all:
47 intersections[p].append(s)
48 merge_CL = C_p+L_p
49 merge_UC = U_p+C_p
50 for s in merge_CL:
51 # deletes at a point slightly above (to break ties) - where seg is located in tree
52 # above intersection point
53 t.delete(p, s)
54 # put segments into T based on where they are at y-val just below p[1]
55 for s in merge_UC:
56 n = getNextPoint(p, s, lower_check)
57 if n[0] > rightmost[0]:
58 rightmost = n
59 rightmost_seg = s
60 if n[0] < leftmost[0]:
61 leftmost = n
62 leftmost_seg = s
63 t.insert(p, s)
64
65 # means only L_p -> check newly-neighbored segments
66 if len(merge_UC) == 0:
67 neighbors = (t.get_left_neighbor(p), t.get_right_neighbor(p))
68 if neighbors[0] and neighbors[1]:
69 find_new_event(neighbors[0].value, neighbors[1].value, p, q)
70
71 # of newly inserted pts, find possible intersections to left and right
72 else:
73 left_neighbor = t.get_left_neighbor(p)
74 if left_neighbor:
75 find_new_event(left_neighbor.value, leftmost_seg, p, q)
76 right_neighbor = t.get_right_neighbor(p)
77 if right_neighbor:
78 find_new_event(right_neighbor.value, rightmost_seg, p, q)
79
80 def find_new_event(s1, s2, p, q):
81 i = intersect(s1, s2)
82 if i:
83 if compare_by_y(i, p) == 1:
84 if not q.find(i):
85 q.insert(i, [])
86
87 # segment is in ((x, y), (x, y)) form
88 # first pt in a segment should have higher y-val - this is handled in function
89 def intersection(S):
90 s0 = S[0]
91 if s0[1][1] > s0[0][1]:
92 s0 = (s0[1], s0[0])
93 q = Q(s0[0], [s0])
94 q.insert(s0[1], [])
95 intersections = {}
96 for s in S[1:]:
97 if s[1][1] > s[0][1]:
98 s = (s[1], s[0])
99 q.insert(s[0], [s])
100 q.insert(s[1], [])
101 t = T()
102 while q.key:
103 p, segs = q.get_and_del_min()
104 handle_event_point(p, segs, q, t, intersections)
105 return intersections
106
0 # Test file for lsi.
1 # Author: Sam Lichtenberg
2 # Email: splichte@princeton.edu
3 # Date: 09/02/2013
4
5 from lsi import intersection
6 import random
7 import time, sys
8 from helper import *
9
10 ev = 0.00000001
11
12 def scale(i):
13 return float(i)
14
15 use_file = None
16 try:
17 use_file = sys.argv[2]
18 except:
19 pass
20
21 if not use_file:
22 S = []
23 for i in range(int(sys.argv[1])):
24 p1 = (scale(random.randint(0, 1000)), scale(random.randint(0, 1000)))
25 p2 = (scale(random.randint(0, 1000)), scale(random.randint(0, 1000)))
26 s = (p1, p2)
27 S.append(s)
28 f = open('input', 'w')
29 f.write(str(S))
30 f.close()
31
32 else:
33 f = open(sys.argv[2], 'r')
34 S = eval(f.read())
35
36 intersections = []
37 seen = []
38 vs = False
39 hs = False
40 es = False
41 now = time.time()
42 for seg1 in S:
43 if approx_equal(seg1[0][0], seg1[1][0], ev):
44 print 'VERTICAL SEG'
45 print ''
46 print ''
47 vs = True
48 if approx_equal(seg1[0][1], seg1[1][1], ev):
49 print 'HORIZONTAL SEG'
50 print ''
51 print ''
52 hs = True
53 for seg2 in S:
54 if seg1 is not seg2 and segs_equal(seg1, seg2):
55 print 'EQUAL SEGS'
56 print ''
57 print ''
58 es = True
59 if seg1 is not seg2 and (seg2, seg1) not in seen:
60 i = intersect(seg1, seg2)
61 if i:
62 intersections.append((i, [seg1, seg2]))
63 # xpts = [seg1[0][0], seg1[1][0], seg2[0][0], seg2[1][0]]
64 # xpts = sorted(xpts)
65 # if (i[0] <= xpts[2] and i[0] >= xpts[1]:
66 # intersections.append((i, [seg1, seg2]))
67 seen.append((seg1, seg2))
68 later = time.time()
69 n2time = later-now
70 print "Line sweep results:"
71 now = time.time()
72 lsinters = intersection(S)
73 inters = []
74 for k, v in lsinters.iteritems():
75 #print '{0}: {1}'.format(k, v)
76 inters.append(k)
77 # inters.append(v)
78 later = time.time()
79 print 'TIME ELAPSED: {0}'.format(later-now)
80 print "N^2 comparison results:"
81 pts_seen = []
82 highestseen = 0
83 for i in intersections:
84 seen_already = False
85 seen = 0
86 for p in pts_seen:
87 if approx_equal(i[0][0], p[0], ev) and approx_equal(i[0][1], p[1], ev):
88 seen += 1
89 seen_already = True
90 if seen > highestseen:
91 highestseen = seen
92 if not seen_already:
93 pts_seen.append(i[0])
94 in_k = False
95 for k in inters:
96 if approx_equal(k[0], i[0][0], ev) and approx_equal(k[1], i[0][1], ev):
97 in_k = True
98 if in_k == False:
99 print 'Not in K: {0}: {1}'.format(i[0], i[1])
100 # print i
101 print highestseen
102 print 'TIME ELAPSED: {0}'.format(n2time)
103 #print 'Missing from line sweep but in N^2:'
104 #for i in seen:
105 # matched = False
106 print len(lsinters)
107 print len(pts_seen)
108 if len(lsinters) != len(pts_seen):
109 print 'uh oh!'
0 #!/usr/bin/env python
1 # -*- coding: utf-8 -*-
2
3 # pauvre
4 # Copyright (c) 2016-2020 Darrin T. Schultz.
5 #
6 # This file is part of pauvre.
7 #
8 # pauvre is free software: you can redistribute it and/or modify
9 # it under the terms of the GNU General Public License as published by
10 # the Free Software Foundation, either version 3 of the License, or
11 # (at your option) any later version.
12 #
13 # pauvre is distributed in the hope that it will be useful,
14 # but WITHOUT ANY WARRANTY; without even the implied warranty of
15 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
16 # GNU General Public License for more details.
17 #
18 # You should have received a copy of the GNU General Public License
19 # along with pauvre. If not, see <http://www.gnu.org/licenses/>.
20
21 import ast
22 import matplotlib
23 matplotlib.use('Agg')
24 import matplotlib.pyplot as plt
25 import matplotlib.patches as mplpatches
26 from matplotlib.colors import LinearSegmentedColormap
27 import numpy as np
28 import pandas as pd
29 import os.path as opath
30 from sys import stderr
31 from pauvre.functions import parse_fastq_length_meanqual, print_images, filter_fastq_length_meanqual
32 from pauvre.stats import stats
33 import pauvre.rcparams as rc
34 import logging
35
36 # logging
37 logger = logging.getLogger('pauvre')
38
39
40 def generate_panel(panel_left, panel_bottom, panel_width, panel_height,
41 axis_tick_param='both', which_tick_param='both',
42 bottom_tick_param=True, label_bottom_tick_param=True,
43 left_tick_param=True, label_left_tick_param=True,
44 right_tick_param=False, label_right_tick_param=False,
45 top_tick_param=False, label_top_tick_param=False):
46 """
47 Setting default panel tick parameters. Some of these are the defaults
48 for matplotlib anyway, but specifying them for readability. Here are
49 options and defaults for the parameters used below:
50
51 axis : {'x', 'y', 'both'}; which axis to modify; default = 'both'
52 which : {'major', 'minor', 'both'}; which ticks to modify;
53 default = 'major'
54 bottom, top, left, right : bool or {True, False}; ticks on or off;
55 labelbottom, labeltop, labelleft, labelright : bool or {True, False}
56 """
57
58 # create the panel
59 panel_rectangle = [panel_left, panel_bottom, panel_width, panel_height]
60 panel = plt.axes(panel_rectangle)
61
62 # Set tick parameters
63 panel.tick_params(axis=axis_tick_param,
64 which=which_tick_param,
65 bottom=bottom_tick_param,
66 labelbottom=label_bottom_tick_param,
67 left=left_tick_param,
68 labelleft=label_left_tick_param,
69 right=right_tick_param,
70 labelright=label_right_tick_param,
71 top=top_tick_param,
72 labeltop=label_top_tick_param)
73
74 return panel
75
76
77 def _generate_histogram_bin_patches(panel, bins, bin_values, horizontal=True):
78 """This helper method generates the histogram that is added to the panel.
79
80 In this case, horizontal = True applies to the mean quality histogram.
81 So, horizontal = False only applies to the length histogram.
82 """
83 l_width = 0.0
84 f_color = (0.5, 0.5, 0.5)
85 e_color = (0, 0, 0)
86 if horizontal:
87 for step in np.arange(0, len(bin_values), 1):
88 left = bins[step]
89 bottom = 0
90 width = bins[step + 1] - bins[step]
91 height = bin_values[step]
92 hist_rectangle = mplpatches.Rectangle((left, bottom), width, height,
93 linewidth=l_width,
94 facecolor=f_color,
95 edgecolor=e_color)
96 panel.add_patch(hist_rectangle)
97 else:
98 for step in np.arange(0, len(bin_values), 1):
99 left = 0
100 bottom = bins[step]
101 width = bin_values[step]
102 height = bins[step + 1] - bins[step]
103
104 hist_rectangle = mplpatches.Rectangle((left, bottom), width, height,
105 linewidth=l_width,
106 facecolor=f_color,
107 edgecolor=e_color)
108 panel.add_patch(hist_rectangle)
109
110
111 def generate_histogram(panel, data_list, max_plot_length, min_plot_length,
112 bin_interval, hist_horizontal=True,
113 left_spine=True, bottom_spine=True,
114 top_spine=False, right_spine=False, x_label=None,
115 y_label=None):
116
117 bins = np.arange(0, max_plot_length, bin_interval)
118
119 bin_values, bins2 = np.histogram(data_list, bins)
120
121 # hist_horizontal is used for quality
122 if hist_horizontal:
123 panel.set_xlim([min_plot_length, max_plot_length])
124 panel.set_ylim([0, max(bin_values * 1.1)])
125 # and hist_horizontal == Fale is for read length
126 else:
127 panel.set_xlim([0, max(bin_values * 1.1)])
128 panel.set_ylim([min_plot_length, max_plot_length])
129
130 # Generate histogram bin patches, depending on whether we're plotting
131 # vertically or horizontally
132 _generate_histogram_bin_patches(panel, bins, bin_values, hist_horizontal)
133
134 panel.spines['left'].set_visible(left_spine)
135 panel.spines['bottom'].set_visible(bottom_spine)
136 panel.spines['top'].set_visible(top_spine)
137 panel.spines['right'].set_visible(right_spine)
138
139 if y_label is not None:
140 panel.set_ylabel(y_label)
141 if x_label is not None:
142 panel.set_xlabel(x_label)
143
144
145 def generate_heat_map(panel, data_frame, min_plot_length, min_plot_qual,
146 max_plot_length, max_plot_qual, color, **kwargs):
147 panel.set_xlim([min_plot_qual, max_plot_qual])
148 panel.set_ylim([min_plot_length, max_plot_length])
149
150 if kwargs["kmerdf"]:
151 hex_this = data_frame.query('length<{} and numks<{}'.format(
152 max_plot_length, max_plot_qual))
153 # This single line controls plotting the hex bins in the panel
154 hex_vals = panel.hexbin(hex_this['numks'], hex_this['length'], gridsize=int(np.ceil(max_plot_qual/2)),
155 linewidths=0.0, cmap=color)
156
157 else:
158 hex_this = data_frame.query('length<{} and meanQual<{}'.format(
159 max_plot_length, max_plot_qual))
160
161 # This single line controls plotting the hex bins in the panel
162 hex_vals = panel.hexbin(hex_this['meanQual'], hex_this['length'], gridsize=49,
163 linewidths=0.0, cmap=color)
164 for each in panel.spines:
165 panel.spines[each].set_visible(False)
166
167 counts = hex_vals.get_array()
168 return counts
169
170
171 def generate_legend(panel, counts, color):
172
173 # completely custom for more control
174 panel.set_xlim([0, 1])
175 panel.set_ylim([0, 1000])
176 panel.set_yticks([int(x) for x in np.linspace(0, 1000, 6)])
177 panel.set_yticklabels([int(x) for x in np.linspace(0, max(counts), 6)])
178 for i in np.arange(0, 1001, 1):
179 rgba = color(i / 1001)
180 alpha = rgba[-1]
181 facec = rgba[0:3]
182 hist_rectangle = mplpatches.Rectangle((0, i), 1, 1,
183 linewidth=0.0,
184 facecolor=facec,
185 edgecolor=(0, 0, 0),
186 alpha=alpha)
187 panel.add_patch(hist_rectangle)
188 panel.spines['top'].set_visible(False)
189 panel.spines['left'].set_visible(False)
190 panel.spines['bottom'].set_visible(False)
191 panel.yaxis.set_label_position("right")
192 panel.set_ylabel('Number of Reads')
193
194
195 def margin_plot(df, **kwargs):
196 rc.update_rcParams()
197
198 # 250, 231, 34 light yellow
199 # 67, 1, 85
200 # R=np.linspace(65/255,1,101)
201 # G=np.linspace(0/255, 231/255, 101)
202 # B=np.linspace(85/255, 34/255, 101)
203 # R=65/255, G=0/255, B=85/255
204 Rf = 65 / 255
205 Bf = 85 / 255
206 pdict = {'red': ((0.0, Rf, Rf),
207 (1.0, Rf, Rf)),
208 'green': ((0.0, 0.0, 0.0),
209 (1.0, 0.0, 0.0)),
210 'blue': ((0.0, Bf, Bf),
211 (1.0, Bf, Bf)),
212 'alpha': ((0.0, 0.0, 0.0),
213 (1.0, 1.0, 1.0))
214 }
215 # Now we will use this example to illustrate 3 ways of
216 # handling custom colormaps.
217 # First, the most direct and explicit:
218 purple1 = LinearSegmentedColormap('Purple1', pdict)
219
220 # set the figure dimensions
221 fig_width = 1.61 * 3
222 fig_height = 1 * 3
223 fig = plt.figure(figsize=(fig_width, fig_height))
224
225 # set the panel dimensions
226 heat_map_panel_width = fig_width * 0.5
227 heat_map_panel_height = heat_map_panel_width * 0.62
228
229 # find the margins to center the panel in figure
230 fig_left_margin = fig_bottom_margin = (1 / 6)
231
232 # lengthPanel
233 length_panel_width = (1 / 8)
234
235 # the color Bar parameters
236 legend_panel_width = (1 / 24)
237
238 # define padding
239 h_padding = 0.02
240 v_padding = 0.05
241
242 # Set whether to include y-axes in histograms
243 if kwargs["Y_AXES"]:
244 length_bottom_spine = True
245 length_bottom_tick = False
246 length_bottom_label = True
247 qual_left_spine = True
248 qual_left_tick = True
249 qual_left_label = True
250 qual_y_label = 'Count'
251 else:
252 length_bottom_spine = False
253 length_bottom_tick = False
254 length_bottom_label = False
255 qual_left_spine = False
256 qual_left_tick = False
257 qual_left_label = False
258 qual_y_label = None
259
260 panels = []
261
262 # Quality histogram panel
263 qual_panel_left = fig_left_margin + length_panel_width + h_padding
264 qual_panel_width = heat_map_panel_width / fig_width
265 qual_panel_height = length_panel_width * fig_width / fig_height
266 qual_panel = generate_panel(qual_panel_left,
267 fig_bottom_margin,
268 qual_panel_width,
269 qual_panel_height,
270 left_tick_param=qual_left_tick,
271 label_left_tick_param=qual_left_label)
272 panels.append(qual_panel)
273
274 # Length histogram panel
275 length_panel_bottom = fig_bottom_margin + qual_panel_height + v_padding
276 length_panel_height = heat_map_panel_height / fig_height
277 length_panel = generate_panel(fig_left_margin,
278 length_panel_bottom,
279 length_panel_width,
280 length_panel_height,
281 bottom_tick_param=length_bottom_tick,
282 label_bottom_tick_param=length_bottom_label)
283 panels.append(length_panel)
284
285 # Heat map panel
286 heat_map_panel_left = fig_left_margin + length_panel_width + h_padding
287 heat_map_panel_bottom = fig_bottom_margin + qual_panel_height + v_padding
288
289 heat_map_panel = generate_panel(heat_map_panel_left,
290 heat_map_panel_bottom,
291 heat_map_panel_width / fig_width,
292 heat_map_panel_height / fig_height,
293 bottom_tick_param=False,
294 label_bottom_tick_param=False,
295 left_tick_param=False,
296 label_left_tick_param=False)
297 panels.append(heat_map_panel)
298 heat_map_panel.set_title(kwargs["title"])
299
300 # Legend panel
301 legend_panel_left = fig_left_margin + length_panel_width + \
302 (heat_map_panel_width / fig_width) + (h_padding * 2)
303 legend_panel_bottom = fig_bottom_margin + qual_panel_height + v_padding
304 legend_panel_height = heat_map_panel_height / fig_height
305 legend_panel = generate_panel(legend_panel_left, legend_panel_bottom,
306 legend_panel_width, legend_panel_height,
307 bottom_tick_param = False,
308 label_bottom_tick_param = False,
309 left_tick_param = False,
310 label_left_tick_param = False,
311 right_tick_param = True,
312 label_right_tick_param = True)
313 panels.append(legend_panel)
314
315 # Set min and max viewing window for length
316 if kwargs["plot_maxlen"]:
317 max_plot_length = kwargs["plot_maxlen"]
318 else:
319 max_plot_length = int(np.percentile(df['length'], 99))
320 min_plot_length = kwargs["plot_minlen"]
321
322 # Set length bin sizes
323 if kwargs["lengthbin"]:
324 length_bin_interval = kwargs["lengthbin"]
325 else:
326 # Dividing by 80 is based on what looks good from experience
327 length_bin_interval = int(max_plot_length / 80)
328
329 # length_bins = np.arange(0, max_plot_length, length_bin_interval)
330
331 # Set max and min viewing window for quality
332 if kwargs["plot_maxqual"]:
333 max_plot_qual = kwargs["plot_maxqual"]
334 elif kwargs["kmerdf"]:
335 max_plot_qual = np.ceil(df["numks"].median() * 2)
336 else:
337 max_plot_qual = max(np.ceil(df['meanQual']))
338 min_plot_qual = kwargs["plot_minqual"]
339
340 # Set qual bin sizes
341 if kwargs["qualbin"]:
342 qual_bin_interval = kwargs["qualbin"]
343 elif kwargs["kmerdf"]:
344 qual_bin_interval = 1
345 else:
346 # again, this is just based on what looks good from experience
347 qual_bin_interval = max_plot_qual / 85
348 qual_bins = np.arange(0, max_plot_qual, qual_bin_interval)
349
350 # Generate length histogram
351 generate_histogram(length_panel, df['length'], max_plot_length, min_plot_length,
352 length_bin_interval, hist_horizontal=False,
353 y_label='Read Length', bottom_spine=length_bottom_spine)
354
355 # Generate quality histogram
356 if kwargs["kmerdf"]:
357 generate_histogram(qual_panel, df['numks'], max_plot_qual, min_plot_qual,
358 qual_bin_interval, x_label='number of kmers',
359 y_label=qual_y_label, left_spine=qual_left_spine)
360 else:
361 generate_histogram(qual_panel, df['meanQual'], max_plot_qual, min_plot_qual,
362 qual_bin_interval, x_label='Phred Quality',
363 y_label=qual_y_label, left_spine=qual_left_spine)
364
365 # Generate heat map
366 counts = generate_heat_map(heat_map_panel, df, min_plot_length, min_plot_qual,
367 max_plot_length, max_plot_qual, purple1, kmerdf = kwargs["kmerdf"])
368
369 # Generate legend
370 generate_legend(legend_panel, counts, purple1)
371
372 # inform the user of the plotting window if not quiet mode
373 if not kwargs["QUIET"]:
374 print("""plotting in the following window:
375 {0} <= Q-score (x-axis) <= {1}
376 {2} <= length (y-axis) <= {3}""".format(
377 min_plot_qual, max_plot_qual, min_plot_length, max_plot_length),
378 file=stderr)
379 # Print image(s)
380 if kwargs["BASENAME"] is None and not kwargs["path"] is None:
381 file_base = kwargs["BASENAME"]
382 elif kwargs["BASENAME"] is None:
383 file_base = opath.splitext(opath.basename(kwargs["fastq"]))[0]
384 else:
385 file_base = kwargs["BASENAME"]
386 print_images(
387 file_base,
388 image_formats=kwargs["fileform"],
389 dpi=kwargs["dpi"],
390 no_timestamp = kwargs["no_timestamp"],
391 transparent=kwargs["TRANSPARENT"])
392
393 def run(args):
394 if args.kmerdf:
395 df = pd.read_csv(args.kmerdf, header='infer', sep='\t')
396 df["kmers"] = df["kmers"].apply(ast.literal_eval)
397 else:
398 df = parse_fastq_length_meanqual(args.fastq)
399 df = filter_fastq_length_meanqual(df, args.filt_minlen, args.filt_maxlen,
400 args.filt_minqual, args.filt_maxqual)
401 stats(df, args.fastq, False)
402 margin_plot(df=df.dropna(), **vars(args))
0 #!/usr/bin/env python
1 # -*- coding: utf-8 -*-
2
3 # pauvre - just a pore plotting package
4 # Copyright (c) 2016-2017 Darrin T. Schultz. All rights reserved.
5 #
6 # This file is part of pauvre.
7 #
8 # pauvre is free software: you can redistribute it and/or modify
9 # it under the terms of the GNU General Public License as published by
10 # the Free Software Foundation, either version 3 of the License, or
11 # (at your option) any later version.
12 #
13 # pauvre is distributed in the hope that it will be useful,
14 # but WITHOUT ANY WARRANTY; without even the implied warranty of
15 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
16 # GNU General Public License for more details.
17 #
18 # You should have received a copy of the GNU General Public License
19 # along with pauvre. If not, see <http://www.gnu.org/licenses/>.
20
21 # I modeled this code on https://github.com/arq5x/poretools/. Check it out. - DTS
22
23 import sys
24 import os.path
25 import argparse
26
27 # pauvre imports
28 import pauvre.version
29 # This class is used in argparse to expand the ~. This avoids errors caused on
30 # some systems.
31
32
33 class FullPaths(argparse.Action):
34 """Expand user- and relative-paths"""
35
36 def __call__(self, parser, namespace, values, option_string=None):
37 setattr(namespace, self.dest,
38 os.path.abspath(os.path.expanduser(values)))
39
40
41 class FullPathsList(argparse.Action):
42 """Expand user- and relative-paths when a list of paths is passed to the
43 program"""
44
45 def __call__(self, parser, namespace, values, option_string=None):
46 setattr(namespace, self.dest,
47 [os.path.abspath(os.path.expanduser(value)) for value in values])
48
49 def run_subtool(parser, args):
50 if args.command == 'browser':
51 import pauvre.browser as submodule
52 elif args.command == 'custommargin':
53 import pauvre.custommargin as submodule
54 elif args.command == 'marginplot':
55 import pauvre.marginplot as submodule
56 elif args.command == 'redwood':
57 import pauvre.redwood as submodule
58 elif args.command == 'stats':
59 import pauvre.stats as submodule
60 elif args.command == 'synplot':
61 import pauvre.synplot as submodule
62 # run the chosen submodule.
63 submodule.run(args)
64
65
66 class ArgumentParserWithDefaults(argparse.ArgumentParser):
67 def __init__(self, *args, **kwargs):
68 super(ArgumentParserWithDefaults, self).__init__(*args, **kwargs)
69 self.add_argument("-q", "--quiet", help="Do not output warnings to stderr",
70 action="store_true",
71 dest="QUIET")
72
73 def main():
74
75 #########################################
76 # create the top-level parser
77 #########################################
78 parser = argparse.ArgumentParser(
79 prog='pauvre', formatter_class=argparse.ArgumentDefaultsHelpFormatter)
80 parser.add_argument("-v", "--version", help="Installed pauvre version",
81 action="version",
82 version="%(prog)s " + str(pauvre.version.__version__))
83 subparsers = parser.add_subparsers(
84 title='[sub-commands]', dest='command', parser_class=ArgumentParserWithDefaults)
85
86 #########################################
87 # create the individual tool parsers
88 #########################################
89
90 #############
91 # browser
92 #############
93 parser_browser = subparsers.add_parser('browser',
94 help="""an adaptable genome
95 browser with various track types""")
96 parser_browser.add_argument('-c', '--chromosomeid',
97 metavar = "Chr",
98 dest = 'CHR',
99 type = str,
100 help = """The fasta sequence to observe.
101 Use the header name of the fasta file
102 without the '>' character""")
103 parser_browser.add_argument('--dpi',
104 metavar='dpi',
105 default=600,
106 type=int,
107 help="""Change the dpi from the default 600
108 if you need it higher""")
109 parser_browser.add_argument('--fileform',
110 dest='fileform',
111 metavar='STRING',
112 choices=['png', 'pdf', 'eps', 'jpeg', 'jpg',
113 'pdf', 'pgf', 'ps', 'raw', 'rgba',
114 'svg', 'svgz', 'tif', 'tiff'],
115 default=['png'],
116 nargs='+',
117 help='Which output format would you like? Def.=png')
118 parser_browser.add_argument("--no_timestamp",
119 action = 'store_true',
120 help="""Turn off time stamps in the filename
121 output.""")
122 parser_browser.add_argument('-o', '--output-base-name',
123 dest='BASENAME',
124 help="""Specify a base name for the output file(
125 s). The input file base name is the
126 default.""")
127 parser_browser.add_argument('--path',
128 type=str,
129 help="""Set an explicit filepath for the output.
130 Only do this if you have selected one output type.""")
131 parser_browser.add_argument('-p', '--plot_commands',
132 dest='CMD',
133 nargs = '+',
134 help="""Write strings here to select what
135 to plot. The format for each track is:
136 <type>:<path>:<style>:<options>
137
138 To plot the reference, the format is:
139 ref:<style>:<options>
140
141 Surround each track string with double
142 quotes and a space between subsequent strings.
143 "bam:mybam.bam:depth" "ref:colorful"
144 """)
145 parser_browser.add_argument('--ratio',
146 nargs = '+',
147 type = float,
148 default=None,
149 help="""Enter the dimensions (arbitrary units)
150 to plot the figure. For example a figure that is
151 seven times wider than tall is:
152 --ratio 7 1""")
153 parser_browser.add_argument('-r', '--reference',
154 metavar='REF',
155 dest = 'REF',
156 action=FullPaths,
157 help='The reference fasta file.')
158 parser_browser.add_argument('--start',
159 metavar='START',
160 dest = 'START',
161 type = int,
162 help="""The start position to observe on the
163 fasta file. Uses 1-based indexing.""")
164 parser_browser.add_argument('--stop',
165 metavar='STOP',
166 dest = 'STOP',
167 type = int,
168 help="""The stop position to observe on the
169 fasta file. Uses 1-based indexing.""")
170 parser_browser.add_argument('-T', '--transparent',
171 action='store_false',
172 help="""Specify this option if you DON'T want a
173 transparent background. Default is on.""")
174 parser_browser.set_defaults(func=run_subtool)
175
176 ################
177 # custommargin
178 ################
179 parser_custmar = subparsers.add_parser('custommargin',
180 help='plot custom marginal histograms of tab-delimited files')
181 parser_custmar.add_argument('--dpi',
182 metavar='dpi',
183 default=600,
184 type=int,
185 help="""Change the dpi from the default 600
186 if you need it higher""")
187 parser_custmar.add_argument('--fileform',
188 dest='fileform',
189 metavar='STRING',
190 choices=['png', 'pdf', 'eps', 'jpeg', 'jpg',
191 'pdf', 'pgf', 'ps', 'raw', 'rgba',
192 'svg', 'svgz', 'tif', 'tiff'],
193 default=['png'],
194 nargs='+',
195 help='Which output format would you like? Def.=png')
196 parser_custmar.add_argument('-i', '--input_file',
197 action=FullPaths,
198 help="""A tab-separated file with a header row
199 of column names.""")
200 parser_custmar.add_argument('--xcol',
201 type=str,
202 help="""The column name of the data to plot on
203 the x-axis""")
204 parser_custmar.add_argument('--ycol',
205 type=str,
206 help="""The column name of the data to plot on
207 the y-axis""")
208 parser_custmar.add_argument('-n', '--no_transparent',
209 action='store_false',
210 help="""Specify this option if
211 you don't want a transparent background. Default
212 is on.""")
213 parser_custmar.add_argument("--no_timestamp",
214 action = 'store_true',
215 help="""Turn off time stamps in the filename
216 output.""")
217 parser_custmar.add_argument('-o', '--output_base_name',
218 default = None,
219 help='Specify a base name for the output file('
220 's). The input file base name is the '
221 'default.')
222 parser_custmar.add_argument('--plot_max_y',
223 type=int,
224 help="""Sets the maximum viewing area in the
225 length dimension.""")
226 parser_custmar.add_argument('--plot_max_x',
227 type=float,
228 help="""Sets the maximum viewing area in the
229 quality dimension.""")
230 parser_custmar.add_argument('--plot_min_y',
231 type=int,
232 default=0,
233 help="""Sets the minimum viewing area in the
234 length dimension. Default value = 0""")
235 parser_custmar.add_argument('--plot_min_x',
236 type=float,
237 default=0,
238 help="""Sets the minimum viewing area in the
239 quality dimension. Default value = 0""")
240 parser_custmar.add_argument('--square',
241 action='store_true',
242 help="""changes the hexmap into a square
243 map quantized by ints.""")
244 parser_custmar.add_argument('-t', '--title',
245 type = str,
246 help="""This sets the title for the whole plot.
247 Use --title "Crustacean's DNA read quality"
248 if you need single quote or apostrophe
249 inside title.""")
250 parser_custmar.add_argument('--ybin',
251 type=int,
252 help="""This sets the bin size to use for length.""")
253 parser_custmar.add_argument('--xbin',
254 type=float,
255 help="""This sets the bin size to use for quality""")
256 parser_custmar.add_argument('--add-yaxes',
257 dest='Y_AXES',
258 action='store_true',
259 help='Add Y-axes to both marginal histograms.')
260 parser_custmar.set_defaults(func=run_subtool)
261
262
263 #############
264 # marginplot
265 #############
266 parser_mnplot = subparsers.add_parser('marginplot',
267 help='plot a marginal histogram of a fastq file')
268 parser_mnplot.add_argument('--dpi',
269 metavar='dpi',
270 default=600,
271 type=int,
272 help="""Change the dpi from the default 600
273 if you need it higher""")
274 parser_mnplot.add_argument('-f', '--fastq',
275 metavar='FASTQ',
276 action=FullPaths,
277 help='The input FASTQ file.')
278 parser_mnplot.add_argument('--fileform',
279 dest='fileform',
280 metavar='STRING',
281 choices=['png', 'pdf', 'eps', 'jpeg', 'jpg',
282 'pdf', 'pgf', 'ps', 'raw', 'rgba',
283 'svg', 'svgz', 'tif', 'tiff'],
284 default=['png'],
285 nargs='+',
286 help='Which output format would you like? Def.=png')
287 parser_mnplot.add_argument('--filt_maxlen',
288 type=int,
289 help="""This sets the max read length filter reads.""")
290 parser_mnplot.add_argument('--filt_maxqual',
291 type=float,
292 help="""This sets the max mean read quality
293 to filter reads.""")
294 parser_mnplot.add_argument('--filt_minlen',
295 type=int,
296 default=0,
297 help="""This sets the min read length to
298 filter reads.""")
299 parser_mnplot.add_argument('--filt_minqual',
300 type=float,
301 default=0,
302 help="""This sets the min mean read quality
303 to filter reads.""")
304 parser_mnplot.add_argument('--kmerdf',
305 type = str,
306 default = None,
307 help = """Pass the filename of a data matrix if
308 you would like to plot read length
309 versus number of kmers in that read. The data matrix
310 is a tab-separated text file with columns
311 "id length numks and kmers", where:
312 <id> = read id
313 <length> = the length of the read
314 <numks> = the number of canonical kmers in the read
315 <kmers> = a list representation of kmers ie ['GAT', 'GTA']""")
316 parser_mnplot.add_argument('-n', '--no_transparent',
317 dest='TRANSPARENT',
318 action='store_false',
319 help="""Specify this option if
320 you don't want a transparent background. Default
321 is on.""")
322 parser_mnplot.add_argument("--no_timestamp",
323 action = 'store_true',
324 help="""Turn off time stamps in the filename
325 output.""")
326 parser_mnplot.add_argument('-o', '--output_base_name',
327 dest='BASENAME',
328 help='Specify a base name for the output file('
329 's). The input file base name is the '
330 'default.')
331 parser_mnplot.add_argument('--plot_maxlen',
332 type=int,
333 help="""Sets the maximum viewing area in the
334 length dimension.""")
335 parser_mnplot.add_argument('--plot_maxqual',
336 type=float,
337 help="""Sets the maximum viewing area in the
338 quality dimension.""")
339 parser_mnplot.add_argument('--plot_minlen',
340 type=int,
341 default=0,
342 help="""Sets the minimum viewing area in the
343 length dimension.""")
344 parser_mnplot.add_argument('--plot_minqual',
345 type=float,
346 default=0,
347 help="""Sets the minimum viewing area in the
348 quality dimension.""")
349 parser_mnplot.add_argument('--lengthbin',
350 metavar='LENGTHBIN',
351 type=int,
352 help="""This sets the bin size to use for length.""")
353 parser_mnplot.add_argument('--qualbin',
354 metavar='QUALBIN',
355 type=float,
356 help="""This sets the bin size to use for quality""")
357 parser_mnplot.add_argument('-t', '--title',
358 metavar='TITLE',
359 default='Read length vs mean quality',
360 help="""This sets the title for the whole plot.
361 Use --title "Crustacean's DNA read quality"
362 if you need single quote or apostrophe
363 inside title.""")
364 parser_mnplot.add_argument('-y', '--add-yaxes',
365 dest='Y_AXES',
366 action='store_true',
367 help='Add Y-axes to both marginal histograms.')
368 parser_mnplot.set_defaults(func=run_subtool)
369
370 #############
371 # redwood
372 #############
373 parser_redwood = subparsers.add_parser('redwood',
374 help='make a redwood plot from a bam file')
375 parser_redwood.add_argument('-d', '--doubled',
376 dest='doubled',
377 choices=['main', 'rnaseq'],
378 default=[],
379 nargs='+',
380 help="""If your fasta file was doubled to map
381 circularly, use this flag or you may encounter
382 plotting errors. Accepts multiple arguments.
383 'main' is for the sam file passed with --sam,
384 'rnaseq' is for the sam file passed with --rnaseq""")
385 parser_redwood.add_argument('--dpi',
386 metavar='dpi',
387 default=600,
388 type=int,
389 help="""Change the dpi from the default 600
390 if you need it higher""")
391 parser_redwood.add_argument('--fileform',
392 dest='fileform',
393 metavar='STRING',
394 choices=['png', 'pdf', 'eps', 'jpeg', 'jpg',
395 'pdf', 'pgf', 'ps', 'raw', 'rgba',
396 'svg', 'svgz', 'tif', 'tiff'],
397 default=['png'],
398 nargs='+',
399 help='Which output format would you like? Def.=png')
400 parser_redwood.add_argument('--gff',
401 metavar='gff',
402 action=FullPaths,
403 help="""The input filepath for the gff annotation
404 to plot""")
405 parser_redwood.add_argument('-I', '--interlace',
406 action='store_true',
407 default=False,
408 help="""Interlace the reads so the pileup plot
409 looks better. Doesn't work currently""")
410 parser_redwood.add_argument('-i', '--invert',
411 action='store_true',
412 default=False,
413 help="""invert the image so that it looks better
414 on a dark background. DOESN'T DO ANYTHING.""")
415 parser_redwood.add_argument('-L', '--log',
416 action='store_true',
417 default=False,
418 help="""Plot the RNAseq track with a log scale""")
419 parser_redwood.add_argument('-M', '--main_bam',
420 metavar='mainbam',
421 action=FullPaths,
422 help="""The input filepath for the bam file to
423 plot. Ideally was plotted with a fasta file that
424 is two copies of the mitochondrial genome
425 concatenated. This should be long reads (ONT, PB)
426 and will be displayed in the interior of the
427 redwood plot.""")
428 parser_redwood.add_argument("--no_timestamp",
429 action = 'store_true',
430 help="""Turn off time stamps in the filename
431 output.""")
432 parser_redwood.add_argument('-o', '--output-base-name',
433 dest='BASENAME',
434 help='Specify a base name for the output file('
435 's). The input file base name is the '
436 'default.')
437 parser_redwood.add_argument('--query',
438 dest='query',
439 default=['ALNLEN >= 10000', 'MAPLEN < reflength'],
440 nargs='+',
441 help='Query your reads to change plotting options')
442 parser_redwood.add_argument('-R', '--rnaseq_bam',
443 metavar='rnabam',
444 action=FullPaths,
445 help='The input filepath for the rnaseq bam file to plot')
446 parser_redwood.add_argument('--small_start',
447 dest='small_start',
448 choices=['inside', 'outside'],
449 default='inside',
450 help="""This determines where the shortest of the
451 filtered reads will appear on the redwood plot:
452 on the outside or on the inside? The default
453 option puts the longest reads on the outside and
454 the shortest reads on the inside.""")
455 parser_redwood.add_argument('--sort',
456 dest='sort',
457 choices=['ALNLEN', 'TRULEN', 'MAPLEN', 'POS'],
458 default='ALNLEN',
459 help="""What value to use to sort the order in
460 which the reads are plotted?""")
461 parser_redwood.add_argument('--ticks',
462 type = int,
463 nargs = '+',
464 default = [0, 10, 100, 1000],
465 help="""Specify control for the number of ticks.""")
466 parser_redwood.add_argument('-T', '--transparent',
467 action='store_false',
468 help="""Specify this option if you DON'T want a
469 transparent background. Default is on.""")
470 parser_redwood.set_defaults(func=run_subtool)
471
472 #############
473 # stats
474 #############
475 parser_stats = subparsers.add_parser('stats',
476 help='outputs stats from a fastq file')
477 parser_stats.add_argument('-f', '--fastq',
478 metavar='FASTQ',
479 action=FullPaths,
480 help='The input FASTQ file.')
481 parser_stats.add_argument('-H', '--histogram',
482 action='store_true',
483 help="""Make a histogram of the read lengths and
484 saves it to a new file""")
485 parser_stats.add_argument('--filt_maxlen',
486 type=int,
487 help="""This sets the max read length filter reads.""")
488 parser_stats.add_argument('--filt_maxqual',
489 type=float,
490 help="""This sets the max mean read quality
491 to filter reads.""")
492 parser_stats.add_argument('--filt_minlen',
493 type=int,
494 default=0,
495 help="""This sets the min read length to
496 filter reads.""")
497 parser_stats.add_argument('--filt_minqual',
498 type=float,
499 default=0,
500 help="""This sets the min mean read quality
501 to filter reads.""")
502
503 parser_stats.set_defaults(func=run_subtool)
504
505 #############
506 # synplot
507 #############
508 parser_synplot = subparsers.add_parser('synplot',
509 help="""make a synteny plot from a gff
510 file, protein alignment, and partition
511 file""")
512 parser_synplot.add_argument('--aln_dir',
513 metavar='aln_dir',
514 action=FullPaths,
515 help="""The directory where all the fasta
516 alignments are contained.""")
517 parser_synplot.add_argument('--center_on',
518 type=str,
519 default=None,
520 help="""Centers the plot around the gene that
521 you pass as an argument. For example, if there
522 is a locus called 'COI' in the gff file and in
523 the alignments directory, center using:
524 --center_on COI""")
525 parser_synplot.add_argument('--dpi',
526 metavar='dpi',
527 default=600,
528 type=int,
529 help="""Change the dpi from the default 600
530 if you need it higher""")
531 parser_synplot.add_argument('--fileform',
532 dest='fileform',
533 choices=['png', 'pdf', 'eps', 'jpeg', 'jpg',
534 'pdf', 'pgf', 'ps', 'raw', 'rgba',
535 'svg', 'svgz', 'tif', 'tiff'],
536 default=['png'],
537 nargs='+',
538 help='Which output format would you like? Def.=png')
539 parser_synplot.add_argument('--gff_paths',
540 metavar='gff_paths',
541 action=FullPathsList,
542 nargs='+',
543 help="""The input filepath. for the gff annotation
544 to plot. Individual filepaths separated by spaces.
545 For example, --gff_paths sp1.gff sp2.gff sp3.gff""")
546 parser_synplot.add_argument('--gff_labels',
547 metavar='gff_labels',
548 type=str,
549 nargs='+',
550 help="""In case the gff names and sequence names
551 don't match, change the labels that will appear
552 over the text.""")
553 parser_synplot.add_argument("--no_timestamp",
554 action = 'store_true',
555 help="""Turn off time stamps in the filename
556 output.""")
557 parser_synplot.add_argument('--optimum_order',
558 action='store_true',
559 help="""If selected, this doesn't plot the
560 optimum arrangement of things as they are input
561 into gff_paths. Instead, it uses the first gff
562 file as the top-most sequence in the plot, and
563 reorganizes the remaining gff files to minimize
564 the number of intersections.""")
565 parser_synplot.add_argument('-o', '--output-basename',
566 dest='BASENAME',
567 help='Specify a base name for the output file('
568 's). The input file base name is the '
569 'default.')
570 parser_synplot.add_argument('--ratio',
571 nargs = '+',
572 type = float,
573 default=None,
574 help="""Enter the dimensions (arbitrary units)
575 to plot the figure. For example a figure that is
576 seven times wider than tall is:
577 --ratio 7 1""")
578 parser_synplot.add_argument('--sandwich',
579 action='store_true',
580 default=False,
581 help="""Put an additional copy of the first gff
582 file on the bottom of the plot for comparison.""")
583 parser_synplot.add_argument('--start_with_aligned_genes',
584 action='store_true',
585 default=False,
586 help="""Minimizes the number of intersections
587 but only selects combos where the first gene in
588 each sequence is aligned.""")
589 parser_synplot.add_argument('--stop_codons',
590 action='store_true',
591 default=True,
592 help="""Performs some internal corrections if
593 the gff annotation includes the stop
594 codons in the coding sequences.""")
595 parser_synplot.add_argument('-T', '--transparent',
596 action='store_false',
597 help="""Specify this option if you DON'T want a
598 transparent background. Default is on.""")
599 parser_synplot.set_defaults(func=run_subtool)
600
601 #######################################################
602 # parse the args and call the selected function
603 #######################################################
604
605 args = parser.parse_args()
606
607 # If there were no args, print the help function
608 if len(sys.argv) == 1:
609 parser.print_help()
610 sys.exit(1)
611
612 # If there were no args, but someone selected a program,
613 # print the program's help.
614 commandDict = {'browser': parser_browser.print_help,
615 'custommargin': parser_custmar.print_help,
616 'marginplot': parser_mnplot.print_help,
617 'redwood': parser_redwood.print_help,
618 'stats': parser_stats.print_help,
619 'synplot': parser_synplot.print_help}
620
621 if len(sys.argv) == 2:
622 commandDict[args.command]()
623 sys.exit(1)
624
625 if args.QUIET:
626 logger.setLevel(logging.ERROR)
627
628 try:
629 args.func(parser, args)
630 except IOError as e:
631 if e.errno != 32: # ignore SIGPIPE
632 raise
633
634 if __name__ == "__main__":
635 main()
0 # -*- coding: utf-8 -*-
1
2 from matplotlib import rcParams
3
4 def update_rcParams():
5 # This mpl style is from the UCSC BME163 class.
6 rcParams.update({
7 'font.size' : 8.0 ,
8 'font.sans-serif' : 'Arial' ,
9 'xtick.major.size' : 2 ,
10 'xtick.major.width' : 0.75 ,
11 'xtick.labelsize' : 8.0 ,
12 'xtick.direction' : 'out' ,
13 'ytick.major.size' : 2 ,
14 'ytick.major.width' : 0.75 ,
15 'ytick.labelsize' : 8.0 ,
16 'ytick.direction' : 'out' ,
17 'xtick.major.pad' : 2 ,
18 'xtick.minor.pad' : 2 ,
19 'ytick.major.pad' : 2 ,
20 'ytick.minor.pad' : 2 ,
21 'savefig.dpi' : 601 ,
22 'axes.linewidth' : 0.75 ,
23 'text.usetex' : False ,
24 'text.latex.unicode' : False })
0 #!/usr/bin/env python
1 # -*- coding: utf-8 -*-
2
3 # pauvre - just a pore plotting package
4 # Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved.
5 # twitter @conchoecia
6 #
7 # This file is part of pauvre.
8 #
9 # pauvre is free software: you can redistribute it and/or modify
10 # it under the terms of the GNU General Public License as published by
11 # the Free Software Foundation, either version 3 of the License, or
12 # (at your option) any later version.
13 #
14 # pauvre is distributed in the hope that it will be useful,
15 # but WITHOUT ANY WARRANTY; without even the implied warranty of
16 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
17 # GNU General Public License for more details.
18 #
19 # You should have received a copy of the GNU General Public License
20 # along with pauvre. If not, see <http://www.gnu.org/licenses/>.
21
22 # SAM/BAM todo
23 # i/ for loop took 269 seconds ~ 4.48 minutes
24
25 #things to do
26
27 # This doesn't make much sense. it is perfectly possible that it could
28 # double and logbe both (in reference to how histograms are plotted
29
30 # - make each layer operate independently
31 # - for each of these, make the program figure out the total length either from the GFF or from the bam file
32 # - make the error "Your query was too stringent and no reads resulted..." not give a traceback.
33 # - raise the proper error if the sam file has no header.
34 # - figure out how to update rcParams every time we run a program
35 # - Write a better docstring for how plotArc works.
36 # - Write a docstring for seqOrder method. I don't remember what it does
37 # - write a better function to get the alignment length of the sam/bam file
38 # right now it opens the file twice and only gets the length the first time
39 # - drop columns by name, not by column number (samFile.drop...)
40 # - Here's another that needs to be fixed: samFile.drop(samFile.columns[3], axis=1, inplace=True)
41 # - args
42 # - get the filename from args
43 # - set up a double-mapped mode to wrap reads around the 0th coordinate for
44 # circular assemblies
45 # - Make the r-dist something that the user can change.
46 # Getting Everything on the Same Track
47 # - Make a list of features to plot `plot_order = []` or something like that
48 # - First, go through the GFF features and come up with all of the things that
49 # AREN'T tRNAs that overlap. Store each set of overlaps as its own set of
50 # features. For things with no overlap, add those to the `plot_order` alone
51 # - Iterate through all elements of `plot_order`, if all elements are forward
52 # (start < stop), then draw the element at the end first with no modification,
53 # then for every subsequent element draw a white border around the arrow.
54 # The element same element should have the start of its arrow drawn to 1
55 # degree before the start of the next one. This will make a chevron pattern
56 # That will show both the start and stop of both reads accurately.
57 # - If both elements are reverse, then do the same thing, but in reverse
58 # - If one element is in reverse, but another is forward, then make all the
59 # elements in the set half-width since there are too many possible
60 # combinations to code reliably.
61
62 import matplotlib
63 matplotlib.use('Agg')
64 import matplotlib.pyplot as plt
65 import matplotlib.patches as patches
66 from matplotlib import rcParams
67 import platform
68 import itertools
69 import pandas as pd
70 import numpy as np
71 import scipy as sp
72 import time
73 import os, sys
74 import warnings
75
76 # import the pauvre rcParams
77 import pauvre.rcparams as rc
78 from pauvre.functions import GFFParse, print_images
79 from pauvre.bamparse import BAMParse
80
81 # following this tutorial to install helvetica
82 # https://github.com/olgabot/sciencemeetproductivity.tumblr.com/blob/master/posts/2012/11/how-to-set-helvetica-as-the-default-sans-serif-font-in.md
83 global hfont
84 hfont = {'fontname':'Helvetica'}
85
86 def plotArc(start_angle, stop_angle, radius, width, **kwargs):
87 """ write a docstring for this function"""
88 numsegments = 100
89 theta = np.radians(np.linspace(start_angle+90, stop_angle+90, numsegments))
90 centerx = 0
91 centery = 0
92 x1 = -np.cos(theta) * (radius)
93 y1 = np.sin(theta) * (radius)
94 stack1 = np.column_stack([x1, y1])
95 x2 = -np.cos(theta) * (radius + width)
96 y2 = np.sin(theta) * (radius + width)
97 stack2 = np.column_stack([np.flip(x2, axis=0), np.flip(y2,axis=0)])
98 #add the first values from the first set to close the polygon
99 np.append(stack2, [[x1[0],y1[0]]], axis=0)
100 arcArray = np.concatenate((stack1,stack2), axis=0)
101 return patches.Polygon(arcArray, True, **kwargs), ((x1, y1), (x2, y2))
102
103 def fix_query_reflength(sequence_length, queries, doubled):
104 """
105 arguments:
106 <sequence_length> This is the reference fasta length. It should be 2x the actual
107 length of the reference since this program takes a sam file from
108 a concatenated reference.
109 <queries> This is a list of SQL-type query strings. This is generated
110 from argparse.
111
112 purpose:
113 This function takes in a list of queries to use for read filtering
114 for the redwood plot. It is often not advisable to plot all mapped reads
115 since many of them are too small relative to the reference length. Also,
116 the point of a death star plot is to show continuity of a circular
117 reference, so short reads aren't very helpful there either.
118
119 Currently, this function only recognizes the keyword argument 'reflength'.
120 """
121 if not doubled:
122 sequence_length = int(sequence_length * 2)
123 for i in range(len(queries)):
124 if 'reflength' in queries[i].split():
125 queries[i] = queries[i].replace('reflength', str(int(sequence_length/2)))
126
127 def cust_log(base, val):
128 try:
129 #val = np.log(val)/np.log(base)
130 with warnings.catch_warnings():
131 warnings.simplefilter("ignore")
132 val = np.log(val)
133 except:
134 val = 0
135 if np.isinf(val):
136 val = 0
137 return val
138
139 def plot_histo(panel, args, angleMap, width_map, start_radius, track_radius,
140 thisLog = False, double = False, ticks = False):
141 """Plots a histogram given a width map. Width map must be the true length
142 of the circular genome"""
143 myPatches = []
144 if double:
145 #plot a line in the middle if doubled to distinguish the center
146 mid_radius = start_radius+((track_radius - (track_radius/100))/2)
147 arc, arcArray = plotArc(start_angle=0, stop_angle=360,
148 radius=mid_radius,
149 width=track_radius/100, fc='yellow')
150 myPatches.append(arc)
151 #this is only 1/100 the width of the band
152
153 for i in range(len(width_map)):
154 iStartIndex = i
155 iStopIndex = i + 1
156 iStartAngle = angleMap[iStartIndex]
157 iStopAngle = angleMap[iStopIndex]
158 # This doesn't make much sense. it is perfectly possible that it could
159 # double and logbe both
160 if double:
161 logwidth = -1 * track_radius * (np.log(width_map[i]-9)/np.log(max(width_map))) * 0.495
162
163 log_start_radius = start_radius+((track_radius - (track_radius/100))/2)
164 width = track_radius * (width_map[i]/max(width_map)) * 0.495
165 normal_start_radius = start_radius+((track_radius - (track_radius/100))/2) + (track_radius/100)
166 #first plot the log inside
167 arc, arcArray = plotArc(start_angle=iStartAngle, stop_angle=iStopAngle,
168 radius=log_start_radius,
169 width=logwidth, fc='black')
170 myPatches.append(arc)
171 # now plot the normal outside
172 arc, arcArray = plotArc(start_angle=iStartAngle, stop_angle=iStopAngle,
173 radius=normal_start_radius,
174 width=width, fc='black')
175 else:
176 if thisLog:
177 base = 10
178 width = track_radius * (cust_log(base, width_map[i])/cust_log(base, max(width_map)))
179
180 else:
181 width = track_radius * (width_map[i]/max(width_map))
182 arc, arcArray = plotArc(start_angle=iStartAngle, stop_angle=iStopAngle,
183 radius=start_radius,
184 width=width, fc='black')
185 myPatches.append(arc)
186
187 maxx = start_radius
188 maxy = 0
189 kerning = 12
190 if ticks != None and len(ticks) > 0:
191 # plot the scalebar
192 maxval=max(width_map)
193 xs = []
194 ys = []
195 xend = start_radius
196 value_list = ticks
197 for value in value_list:
198 centerAngle = 45
199 this_radius = start_radius + \
200 (track_radius * (cust_log(10, value)/cust_log(10, maxval)))
201 print("this_radius", this_radius)
202 arc, arcArray = plotArc(start_angle=centerAngle-1, stop_angle=centerAngle+1,
203 radius=this_radius,
204 width=track_radius/25, fc='red')
205 xs.append(arcArray[0][0][-1])
206 ys.append(arcArray[0][1][-1])
207 myPatches.append(arc)
208
209 middle = np.mean(ys) - (kerning/2)
210 # The tick marks should start slightly below the lowest tick
211 # mark on the track itself and extend upward evenly with text.
212 new_ys = [middle-kerning + (kerning * i)
213 for i in range(len(value_list))]
214 for i in range(len(value_list)):
215 xlist = [xs[i], xend]
216 ylist = [ys[i], new_ys[i]]
217 # plots the lines between the markers and the labels
218 panel.plot(xlist, ylist, ls='--', color='red', lw=0.75)
219 # plots the text labels
220 panel.text(xend, new_ys[i],
221 value_list[i], fontsize = 12,
222 ha='left', va='center',
223 color = 'black', **hfont)
224 #plots the title
225 panel.text(start_radius, new_ys[-1]+kerning,
226 "Read Depth ", fontsize = 10,
227 ha='center', va='center',
228 color = 'black', **hfont)
229
230
231 return myPatches, start_radius + track_radius, panel
232
233 def plot_reads(args, angleMap, widthDict, samFiledf, start_radius,
234 doubled = False, collapse = False, track_width=False,
235 track_depth = False, thisLog = False):
236 """this function plots the reads in a sam file.
237 Outputs the patches, and the final_radius
238
239 Author:
240 Darrin T. Schultz (github@conchoecia)
241 """
242
243 # there should be a different width dict with log scale
244 #if args.log and thisLog:
245 # widthDict = {'M':0.8, # match
246 # 'I':0.95, # insertion relative to reference
247 # 'D':0.05, # deletion relative to reference
248 # 'N':0.1, # skipped region from the reference
249 # 'S':0.1, # soft clip, not aligned but still in sam file
250 # 'H':0.1, # hard clip, not aligned and not in sam file
251 # 'P':0.1, # padding (silent deletion from padded reference)
252 # '=':0.1, # sequence match
253 # 'X':0.1} # sequence mismatch
254
255 # clean up the df to reset the number of rows, otherwise there might be
256 # errors in the while loop below
257 samFiledf = samFiledf.reset_index()
258
259 append_radius = 0
260 myPatches = []
261 depth_map = []
262 # if we need to collapse the reads so they aren't 1 read per line, keep track
263 # of plotted depth at a position with depth_map
264 if collapse == True:
265 if doubled:
266 plotted_depth_map = [0] * int((len(angleMap)/2))
267 else:
268 plotted_depth_map = [0] * len(angleMap)
269
270 # If we define a track width, then use the max read depth
271 # of the track to figure out the read_width
272 read_width = 1
273 if track_width:
274 read_width = track_width/track_depth
275
276 # This while loops cycles through the the list until everything is plotted
277 # We start at the first index of the file and count down. If we plot it, then
278 # we remove that read from the dataframe. Once we get to the last index,
279 # we attempt to plot and afterward reset the index of the dataframe and
280 # start from the beginning again.
281 i = 0
282 plotted = False
283 #this is used to dtermine where we are during collapse
284 current_row = 1
285 if args.log and thisLog:
286 current_row = 10
287 skip = False
288 # for printing out the progress of plotting,
289 # get the original number of rows to make placeholders
290 original_rownum_charcount = len(str(len(samFiledf)))
291 direction = 'for'
292 print(samFiledf)
293 while len(samFiledf) > 0:
294 stringTuples = samFiledf.loc[i, 'TUPS']
295 mapLen = samFiledf.loc[i, 'MAPLEN']
296 #print("i: {} and maplen: {}".format(i, mapLen))
297 #Subtract one because BAM uses 1-based indexing but plotting uses 0.
298 # I think I could avoid this in the future by changing the parse
299 start_index = samFiledf.loc[i, 'POS'] - 1
300 start_angle= angleMap[start_index]
301 stop_index = 0
302 stop_angle = angleMap[stop_index]
303 if args.log and thisLog:
304 log = 100
305 read_width = track_width * ((cust_log(log, current_row + 1)/cust_log(log, track_depth + 1))\
306 - (cust_log(log, current_row)/cust_log(log, track_depth + 1)))
307 #print("log track depth: {}".format(np.log10(track_depth + 1)))
308 #print("log current row: {}".format(np.log10(current_row + 1)))
309 #now look in the plotted_depth_map to see if there is already a read
310 # that overlaps with the current read we are considering.
311 if collapse:
312 for collapse_i in range(start_index, start_index+mapLen):
313 #print("looking at {} and found {}".format(collapse_i, plotted_depth_map[collapse_i]))
314 if plotted_depth_map[collapse_i] >= current_row:
315 skip = True
316 break
317 #if we don't skip it, we're gonna plot it!
318 if not skip:
319 # Note that we are plotting something here
320 if collapse:
321 for collapse_i in range(start_index, start_index+mapLen):
322 plotted_depth_map[collapse_i] = plotted_depth_map[collapse_i] + 1
323 for tup in stringTuples:
324 if tup[1] == 'I':
325 #If there is an insertion, back up halfway and make plot the
326 # insertion to visually show a "bulge" with too much sequence.
327 # do not advance the start index to resume normal plotting
328 # after the insertion.
329 iStartIndex = start_index-int(tup[0]/2)
330 iStopIndex = iStartIndex + tup[0]
331 iStartAngle = angleMap[iStartIndex]
332 iStopAngle = angleMap[iStopIndex]
333 arc, arcArray = plotArc(start_angle=iStartAngle, stop_angle=iStopAngle,
334 radius=start_radius + append_radius,
335 width=widthDict[tup[1]] * read_width, fc='black')
336 else:
337 stop_index = start_index + tup[0]
338 stop_angle = angleMap[stop_index]
339 arc, arcArray = plotArc(start_angle=start_angle, stop_angle=stop_angle,
340 radius=start_radius + append_radius,
341 width=widthDict[tup[1]] * read_width, fc='black')
342 start_index = stop_index
343 start_angle = angleMap[start_index]
344 myPatches.append(arc)
345 # If we're not collapsing the reads, we just advance one every row
346 if not collapse:
347 append_radius += read_width
348 plotted = True
349 #if we've ploted something, remove that read and reset the indices
350 # since we will stay at the current index
351 samFiledf = samFiledf.drop(i)
352 samFiledf = samFiledf.reset_index()
353 samFiledf = samFiledf.drop('index', 1)
354 print("(countdown until done) rows: {0:0>{num}} dir: {1}\r".format(
355 len(samFiledf), direction, num= original_rownum_charcount), end="")
356 else:
357 #if we weren't able to plot the current read, we will advance the
358 # index and look for another read to plot
359 # I tried to speed up the algorithm here by making the software 'smart'
360 # about looking forward for the next read, but it was ~50% slower
361 i += 1
362 skip = False
363 # we will only reach this condition if we aren't collapsing, since all
364 # the reads will be removed before getting to this point
365 if i >= len(samFiledf):
366 i = 0
367 #Once we've gone around a complete cycle, we can jump up to start
368 # plotting the next row
369 append_radius += read_width
370 current_row += 1
371 #every time we reset, reorganize so that we're now going in the
372 # opposite direction to avoid skewing all the reads in the forward direction
373 if args.interlace:
374 if direction == 'for':
375 bav = {"by":['POS','MAPLEN'], "asc": [False, False]}
376 direction= 'rev'
377 elif direction == 'rev':
378 bav = {"by":['POS','MAPLEN'], "asc": [True, False]}
379 direction = 'for'
380 samFiledf.sort_values(by=bav["by"], ascending=bav['asc'],inplace=True)
381 samFiledf.reset_index(inplace=True)
382 samFiledf.drop('index', 1, inplace=True)
383 print("\nfinal row is {}".format(current_row))
384 return myPatches, start_radius + append_radius
385
386 def redwood(args):
387 """This function controls all the plotting features for the redwood plot.
388 1) determine the length of the circular fasta sequence length
389 """
390 # get the plotting options specific to this program.
391 rc.update_rcParams()
392 # This stops numpy from printing numbers in scientific notation.
393 np.set_printoptions(suppress=True)
394
395
396 start = time.time()
397 print(args)
398
399 #---------------------------------------------------------------
400 # 1) determine the length of the circular fasta sequence length
401 #_______________________________________________________________
402 main_doubled = True if 'main' in args.doubled else False
403 # It is fine that we are using a global since this will never be manipulated
404 global sequence_length
405 if args.main_bam:
406 samFile = BAMParse(args.main_bam, doubled = main_doubled)
407 sequence_length = samFile.seqlength
408 filename = samFile.filename
409 else:
410 sequence_length = GFFParse(args.gff).seqlen
411 if sequence_length == 0:
412 raise OSError("""You have used a SAM/BAM file with no header. Please add a header to
413 the file.""")
414
415 # this also needs to be changed depending on if it was a concatenated SAM
416 # if doubled = true, then use linspace between 0,720
417 # if doubled = false, then use linspace between 0, 360
418 # on second thought, it might not be necessary to change this value even
419 # for doubled sequences
420 global angleMap
421 if 'main' in args.doubled:
422 angleMap = np.linspace(0,720,sequence_length)
423 else:
424 angleMap = np.linspace(0,360,sequence_length)
425
426 #these are the line width for the different cigar string flags.
427 # usually, only M, I, D, S, and H appear in bwa mem output
428 widthDict = {'M':0.45, # match
429 'I':0.9, # insertion relative to reference
430 'D':0.05, # deletion relative to reference
431 'N':0.1, # skipped region from the reference
432 'S':0.1, # soft clip, not aligned but still in sam file
433 'H':0.1, # hard clip, not aligned and not in sam file
434 'P':0.1, # padding (silent deletion from padded reference)
435 '=':0.1, # sequence match
436 'X':0.1} # sequence mismatch
437
438 ##################
439 # make the redwood plot
440 ##################
441
442 figWidth = 5
443 figHeight = 5
444
445 # fig1 is an arbitrary name, only one figure currently
446 fig_1 = plt.figure(figsize=(figWidth, figHeight))
447
448 # This is the width and height of the plot in absolute
449 # values relative to the figWidth and figHeight.
450 circleDiameter = 5.0
451
452 # Center to plot
453 leftMargin = (figWidth - circleDiameter)/2
454 bottomMargin = leftMargin
455
456 # There is one panel in this figure that contains
457 # the concentric circles that represent reads. In addition,
458 # the latest version of this program adds the annotation to the exterior
459 panelCircle = plt.axes([leftMargin/figWidth, #left
460 bottomMargin/figHeight, #bottom
461 circleDiameter/figWidth, #width
462 circleDiameter/figHeight #height
463 ],frameon=False)
464 # Some of these are defaults and redundant, but are included
465 # for readability and convenience.
466 panelCircle.tick_params(axis='both',which='both',\
467 bottom='off', labelbottom='off',\
468 left='off', labelleft='off', \
469 right='off', labelright='off',\
470 top='off', labeltop='off')
471
472 # Simply the list in which patches will be stored.
473 myPatches = []
474
475 # Each read occupies a width of radius = 1. The max width occupied by any
476 # type of match to the read is 0.9 (Insertion). The line width of a match
477 # is 0.45. Look at the `widthDict` dictionary above to see the defined
478 # widths. I chose radius = 15 to start with because it looks decent in most
479 # scenarios, but maybe could use some tweaking in future iterations.
480 start_radius_dict = {1: 10,
481 20: 15,
482 30: 20,
483 35: 90}
484 if args.main_bam:
485 print("The number of rows is: {}".format(len(samFile.features)))
486 for key in sorted(start_radius_dict):
487 if len(samFile.features) >= key:
488 radius = start_radius_dict[key]
489 print("Chose radius: {}".format(radius))
490 else:
491 radius = start_radius_dict[30]
492
493 circle_fontsize = 14
494 panelCircle.text(0, 0, "Position\n(bp)", fontsize = circle_fontsize,
495 ha='center', va='center',
496 color = 'black', **hfont)
497
498 # now plot ticks around the interior, as well as the text
499 if 'main' in args.doubled:
500 real_length = int(sequence_length/2)
501 else:
502 real_length = sequence_length
503 this_len = int(real_length / 1000) * 1000
504 for i in range(0, this_len + 1000, 1000):
505 startAngle = angleMap[i] - 0.75
506 stopAngle = angleMap[i] + 0.75
507 this_radius = radius * 0.93
508 this_width = radius * 0.04
509 arc, arcArray = plotArc(start_angle=startAngle, stop_angle=stopAngle,
510 radius=this_radius,
511 width=this_width, fc='black')
512 myPatches.append(arc)
513 # now plot text if (val/1000) % 2 == 0
514 if (i/1000) % 2 == 0:
515 #the 0.98 gives some float
516 x_pos = -np.cos(np.radians(angleMap[i]+90)) * this_radius
517 y_pos = np.sin(np.radians(angleMap[i]+90)) * this_radius
518 rotation = 0
519 if i == 0:
520 y_pos = np.sin(np.radians(angleMap[i]+90)) * this_radius * 0.98
521 rotation = 0
522 ha = 'center'
523 va = 'top'
524 if (angleMap[i] > 0 and angleMap[i] < 45):
525 x_pos = -np.cos(np.radians(angleMap[i]+90 + 1.5)) * this_radius
526 y_pos = np.sin(np.radians(angleMap[i]+90 + 1.5)) * this_radius
527 rotation = (90 - angleMap[i]) * 0.95
528 ha = 'right'
529 va = 'top'
530 elif (angleMap[i] >= 45 and angleMap[i] < 67.5):
531 rotation = 90 - angleMap[i]
532 ha = 'right'
533 va = 'top'
534 elif (angleMap[i] >= 67.5 and angleMap[i] < 90):
535 # subtracted an addtl degree because it looked bad otherwise
536 x_pos = -np.cos(np.radians(angleMap[i]+90 - 1.75)) * this_radius
537 y_pos = np.sin(np.radians(angleMap[i]+90 - 1.75)) * this_radius
538 rotation = 90 - angleMap[i]
539 ha = 'right'
540 va = 'top'
541 elif (angleMap[i] >= 90 and angleMap[i] < 112.5):
542 # added an addtl degree because it looked bad otherwise
543 x_pos = -np.cos(np.radians(angleMap[i]+90 - 1.75)) * this_radius
544 y_pos = np.sin(np.radians(angleMap[i]+90 - 1.75)) * this_radius
545 rotation = 90 - angleMap[i]
546 ha = 'right'
547 #bottom is not good
548 va = 'center'
549 elif (angleMap[i] >= 112.5 and angleMap[i] < 135):
550 # added an addtl degree because it looked bad otherwise
551 x_pos = -np.cos(np.radians(angleMap[i]+90 + 1.25)) * this_radius
552 y_pos = np.sin(np.radians(angleMap[i]+90 + 1.25)) * this_radius
553 rotation = 90 - angleMap[i]
554 ha = 'right'
555 #bottom not good
556 #center really not good
557 va = 'bottom'
558 elif (angleMap[i] >= 135 and angleMap[i] < 157.5):
559 rotation = 90 - angleMap[i]
560 ha = 'right'
561 va = 'bottom'
562 elif (angleMap[i] >= 157.5 and angleMap[i] < 180):
563 x_pos = -np.cos(np.radians(angleMap[i]+90 - 1.0)) * this_radius
564 y_pos = np.sin(np.radians(angleMap[i]+90 - 1.0)) * this_radius
565 rotation = 90 - angleMap[i]
566 ha = 'right'
567 va = 'bottom'
568 elif (angleMap[i] >= 180 and angleMap[i] < 202.5):
569 # added an addtl degree because it looked bad otherwise
570 x_pos = -np.cos(np.radians(angleMap[i]+90 + 2.0)) * this_radius
571 y_pos = np.sin(np.radians(angleMap[i]+90 + 2.0)) * this_radius
572 rotation = 270 - angleMap[i]
573 ha = 'left'
574 va = 'bottom'
575 elif (angleMap[i] >= 202.5 and angleMap[i] < 225):
576 x_pos = -np.cos(np.radians(angleMap[i]+90 + 2.25)) * this_radius
577 y_pos = np.sin(np.radians(angleMap[i]+90 + 2.25)) * this_radius
578 rotation = 270 - angleMap[i]
579 ha = 'left'
580 va = 'bottom'
581 elif (angleMap[i] >= 225 and angleMap[i] < 247.5):
582 x_pos = -np.cos(np.radians(angleMap[i]+90 - 1.0)) * this_radius
583 y_pos = np.sin(np.radians(angleMap[i]+90 - 1.0)) * this_radius
584 rotation = 270 - angleMap[i]
585 ha = 'left'
586 va = 'bottom'
587 elif (angleMap[i] >= 247.5 and angleMap[i] < 270):
588 x_pos = -np.cos(np.radians(angleMap[i]+90 - 3.0)) * this_radius
589 y_pos = np.sin(np.radians(angleMap[i]+90 - 3.0)) * this_radius
590 rotation = 270 - angleMap[i]
591 ha = 'left'
592 va = 'bottom'
593 elif (angleMap[i] >= 270 and angleMap[i] < 292.5):
594 rotation = 270 - angleMap[i]
595 ha = 'left'
596 va = 'bottom'
597 elif (angleMap[i] >= 292.5 and angleMap[i] < 315):
598 x_pos = -np.cos(np.radians(angleMap[i]+90 + 0.25)) * this_radius
599 y_pos = np.sin(np.radians(angleMap[i]+90 + 0.25)) * this_radius
600 rotation = 270 - angleMap[i]
601 ha = 'left'
602 va = 'top'
603 elif (angleMap[i] >= 315 and angleMap[i] < 337.5):
604 x_pos = -np.cos(np.radians(angleMap[i]+90 - 1.6)) * this_radius
605 y_pos = np.sin(np.radians(angleMap[i]+90 - 1.6)) * this_radius
606 rotation = 270 - angleMap[i]
607 ha = 'left'
608 va = 'top'
609 elif (angleMap[i] >= 337.5 and angleMap[i] < 360):
610 x_pos = -np.cos(np.radians(angleMap[i]+90 - 5.0)) * this_radius
611 y_pos = np.sin(np.radians(angleMap[i]+90 - 5.0)) * this_radius
612 rotation = 270 - angleMap[i]
613 ha = 'left'
614 va = 'top'
615 print(" angleMap: {} value: {} rotation: {}".format(angleMap[i], i, rotation))
616 text = "{}".format(i)
617 if i == 0:
618 text = "1/\n{}".format(real_length)
619 panelCircle.text(x_pos, y_pos, text, fontsize = circle_fontsize,
620 ha=ha, va=va, color = 'black', rotation=rotation, **hfont)
621 # Now add a legend
622
623 # If the user wants to plot long reads, plot them
624 if args.main_bam:
625 #turn the query string into something usable,
626 # get rid of variables from argparse
627 # Make sure we haven't told the program to not query
628 if 'False' not in args.query:
629 doubled = 'main' in args.doubled
630 fix_query_reflength(sequence_length, args.query, doubled)
631 # string all the queries together
632 queryString = " and ".join(args.query)
633 print("You are using this query string to filter reads:\n'{}'".format(queryString))
634 samFile = samFile.features.query(queryString)
635 if len(samFile) == 0:
636 raise IOError("""Your query was too stringent and no reads resulted. Please try
637 again with a less stringent test. Redwood plotter
638 exiting""")
639
640 # now determine how to sort the reads in the order they will be plotted
641 if args.small_start == 'inside':
642 ascend = True
643 elif args.small_start == 'outside':
644 ascend = False
645 samFile = samFile.sort_values(by=args.sort, ascending=ascend)
646 samFile = samFile.reset_index()
647 # It is necessary to drop the column called index, since reset_index()
648 # adds this.
649 samFile = samFile.drop('index', 1)
650
651 #this plots the central rings from the sam file
652 read_patches, radius = plot_reads(args, angleMap, widthDict, samFile,
653 radius, doubled = True, collapse = False)
654 myPatches = myPatches + read_patches
655
656 # if the user would like to plot the annotation, plot it now. In the future,
657 # allow the user to select the order in which the individual elements are
658 # plotted. Since the annotation should have a fixed proportional size to
659 # the circle independent of the number of plotted reads, define a new
660 # radius context.
661 if args.gff:
662 print("the radius at the end of annotation is: {}".format(radius))
663 panelCircle, gff_patches, gff_radius = plot_gff(args, panelCircle, args.gff, radius)
664 myPatches = myPatches + gff_patches
665 radius = gff_radius
666
667 # it is helpful to be able to plot the RNAseq data along with the annotation.
668 # plot that directly outside the annotation
669 if args.rnaseq_bam:
670 print("in RNAseq")
671 rna_doubled = True if 'rnaseq' in args.doubled else False
672 bamobject = BAMParse(args.rnaseq_bam, doubled = rna_doubled)
673 samFile = bamobject.features
674 track_width = radius * 0.15
675 #read_patches, radius = plot_reads(args, angleMap,
676 # widthDict, samFile, radius,
677 # doubled = rna_doubled,
678 # collapse = True, track_width = track_width,
679 # track_depth = max(bamobject.raw_depthmap),
680 # thisLog = True)
681 radius_orig=radius
682 read_patches, radius, panelCircle = plot_histo(panelCircle, args, angleMap,
683 bamobject.get_depthmap(), radius,
684 track_width,
685 thisLog = True,
686 ticks = args.ticks)
687 myPatches = myPatches + read_patches
688 myPatches.append(arc)
689
690 # The numseqs value is used to determine the viewing dimensions
691 # of the circle we will plot. It is scaled with the number of sequences
692 # that will be plotted. This value should probably be set last to
693 # accommodate other tracks, like annotation.
694 min_radius = int(-5 - np.ceil(radius))
695 max_radius = int(5 + np.ceil(radius))
696 panelCircle.set_xlim([min_radius, max_radius])
697 panelCircle.set_ylim([min_radius, max_radius])
698
699 for patch in myPatches:
700 panelCircle.add_patch(patch)
701
702 end = time.time()
703 print(end - start)
704 # Print image(s)
705 ifargs.BASENAME is None:
706 file_base = "redwood"
707 else:
708 file_base = args.BASENAME
709 print_images(
710 base_output_name=file_base,
711 image_formats=args.fileform,
712 no_timestamp = args.no_timestamp,
713 dpi=args.dpi,
714 transparent=args.transparent)
715
716 def feature_set_direction(feature_set_df):
717 """This function determines if the features in the dataframe passed here are
718 all forward, all reverse, or mixed"""
719 all_pos = all(feature_set_df['strand'] == '+')
720 all_neg = all(feature_set_df['strand'] == '-')
721 if all_pos:
722 return '+'
723 elif all_neg:
724 return '-'
725 else:
726 return 'mixed'
727
728 def plot_feature(this_feature, colorMap, start_radius,
729 bar_thickness, direction, this_feature_overlaps_feature):
730 """This plots the track for a feature, and if there is something for
731 'this_feature_overlaps_feature', then there is special processing to
732 add the white bar and the extra slope for the chevron
733 """
734 myPatches = []
735 if this_feature_overlaps_feature.empty:
736 iStartAngle = angleMap[this_feature['start']]
737 iStopAngle = angleMap[this_feature['stop']] - 2
738 arc, arcArray = plotArc(start_angle=iStartAngle,
739 stop_angle=iStopAngle,
740 radius = start_radius, width=bar_thickness,
741 fc=colorMap[this_feature['featType']])
742 myPatches.append(arc)
743 #this bit plots the arrow triangles for the genes.
744 # Right now it makes each arrow only 1 degree in width and uses 100 segments to plot it.
745 # This resolution hasn't given me any artifacts thus far
746 tStartAngle = angleMap[this_feature['stop']]-2
747 tStopAngle = angleMap[this_feature['stop']]
748 angles = np.linspace(tStartAngle,tStopAngle,100)
749 widths = np.linspace(bar_thickness,0,100)
750 for j in range(len(angles)-1):
751 arc, arcArray = plotArc(start_angle=tStartAngle,
752 stop_angle=angles[j+1],
753 radius=start_radius+(bar_thickness-widths[j+1])/2,
754 width=widths[j+1],
755 fc=colorMap[this_feature['featType']])
756 myPatches.append(arc)
757 else:
758 # First, make a solid white bar
759 iStartAngle = angleMap[this_feature['start']]
760 iStopAngle = angleMap[this_feature_overlaps_feature['start']]
761 arc, arcArray = plotArc(start_angle=iStartAngle,
762 stop_angle=iStopAngle,
763 radius = start_radius, width=bar_thickness,
764 fc=colorMap['spacebar'])
765 myPatches.append(arc)
766 # Now, make the actual color bar
767 iStartAngle = angleMap[this_feature['start']]
768 iStopAngle = angleMap[this_feature_overlaps_feature['start']] - 1
769 arc, arcArray = plotArc(start_angle=iStartAngle,
770 stop_angle=iStopAngle,
771 radius = start_radius, width=bar_thickness,
772 fc=colorMap[this_feature['featType']])
773 myPatches.append(arc)
774 #first plot a little pink bar for the outline
775 tStartAngle = angleMap[this_feature_overlaps_feature['start']]
776 tStopAngle = angleMap[this_feature['stop']]+1
777 angles = np.linspace(tStartAngle,tStopAngle,100)
778 widths = np.linspace(bar_thickness,0,100)
779 for j in range(len(angles)-1):
780 arc, arcArray = plotArc(start_angle=tStartAngle,
781 stop_angle=angles[j+1],
782 radius=start_radius+(bar_thickness-widths[j+1])/2,
783 width=widths[j+1],
784 fc=colorMap['spacebar'])
785 myPatches.append(arc)
786 #this bit plots the arrow triangles for the genes.
787 # Right now it makes each arrow only 1 degree in width and uses 100 segments to plot it.
788 # This resolution hasn't given me any artifacts thus far
789 tStartAngle = angleMap[this_feature_overlaps_feature['start']]-1
790 tStopAngle = angleMap[this_feature['stop']]
791 angles = np.linspace(tStartAngle,tStopAngle,100)
792 widths = np.linspace(bar_thickness,0,100)
793 for j in range(len(angles)-1):
794 arc, arcArray = plotArc(start_angle=tStartAngle,
795 stop_angle=angles[j+1],
796 radius=start_radius+(bar_thickness-widths[j+1])/2,
797 width=widths[j+1],
798 fc=colorMap[this_feature['featType']])
799 myPatches.append(arc)
800 return myPatches
801
802 def get_angles(name, center_angle, kerning_angle):
803 num_chars = len(name.strip())
804 if num_chars == 2:
805 start_pos = center_angle
806 stop_pos = center_angle + kerning_angle
807 return (start_pos, stop_pos)
808 if num_chars % 2 == 0:
809 start_pos = center_angle - (kerning_angle/2) - ((num_chars - 1)/2)
810 stop_pos = center_angle + (kerning_angle/2) + ((num_chars - 1)/2)
811 else:
812 start_pos = center_angle - (num_chars/2)
813 stop_pos = center_angle + (num_chars/2)
814 angles = np.arange(start_pos, stop_pos + kerning_angle, kerning_angle)
815 return angles
816
817 def plot_gff(args, panelCircle, gff_path, radius):
818 #parse the gff file
819 gffParser = GFFParse(gff_path)
820
821 # Because this size should be relative to the circle that it is plotted next
822 # to, define the start_radius as the place to work from, and the width of
823 # each track.
824 start_radius = radius
825 track_width = radius * 0.15
826
827 colorMap = {'gene': 'green', 'CDS': 'green', 'tRNA':'pink', 'rRNA':'red',
828 'misc_feature':'purple', 'rep_origin':'orange', 'spacebar':'white',
829 'ORF':'orange'}
830 augment = 0
831 bar_thickness = 0.9 * track_width
832 # return these at the end
833 myPatches=[]
834 plot_order = []
835 # this for loop relies on the gff features to already be sorted
836 i = 0
837 idone = False
838 # we need to filter out the tRNAs since those are plotted last
839 plottable_features = gffParser.features.query("featType != 'tRNA' and featType != 'region' and featType != 'source'")
840 plottable_features = plottable_features.reset_index(drop=True)
841 while idone == False:
842 print("im in the overlap-pairing while loop i={}".format(i))
843 # look ahead at all of the elements that overlap with the ith element
844 jdone = False
845 j = 1
846 this_set_minimum_index = i
847 this_set_maximum_index = i
848 while jdone == False:
849 print("new i= {} j={} len={}".format(i, j, len(plottable_features)))
850 # first make sure that we haven't gone off the end of the dataframe
851 if i+j == len(plottable_features):
852 if i == len(plottable_features)-1:
853 # this is the last analysis, so set idone to true
854 # to finish after this
855 idone = True
856 # the last one can't be in its own group, so just add it solo
857 these_features = plottable_features.loc[this_set_minimum_index:this_set_maximum_index,]
858 plot_order.append(these_features.reset_index(drop=True))
859 break
860 jdone == True
861 else:
862 # if the lmost of the next gene overlaps with the rmost of
863 # the current one, it overlaps and couple together
864 if plottable_features.loc[i+j, 'lmost'] < plottable_features.loc[i, 'rmost']:
865 # note that this feature overlaps with the current
866 this_set_maximum_index = i+j
867 # ... and we need to look at the next in line
868 j += 1
869 else:
870 i += 1 + (this_set_maximum_index - this_set_minimum_index)
871 #add all of the things that grouped together once we don't find any more groups
872 these_features = plottable_features.loc[this_set_minimum_index:this_set_maximum_index,]
873 plot_order.append(these_features.reset_index(drop=True))
874 jdone = True
875
876 for feature_set in plot_order:
877 print(feature_set)
878 direction = feature_set_direction(feature_set)
879 print("direction = {}".format(direction))
880 if direction == '+':
881 for i in range(len(feature_set)-1, -1,-1):
882 print("inside the plot for loop i = {}".format(i))
883 this_feature = feature_set.loc[i,]
884 print("got this single feature")
885 #For the first element, just plot it normally
886 if i == len(feature_set) - 1:
887 #plot the annotation
888 print("Im in the first plot thing")
889 patches = plot_feature(this_feature, colorMap, start_radius,
890 bar_thickness, direction,
891 pd.Series([]))
892 for each in patches:
893 myPatches.append(each)
894 else:
895 print("now I'm plotting the other one")
896 overlapped_feature = feature_set.loc[i+1,]
897 print("this is the overlapped feature")
898 print(overlapped_feature)
899 patches = plot_feature(this_feature, colorMap, start_radius,
900 bar_thickness, direction,
901 overlapped_feature)
902 for each in patches:
903 myPatches.append(each)
904
905 final_radius = start_radius + track_width
906 # Now we add all of the tRNAs to this to plot, do it last to overlay
907 # everything else
908 tRNAs = gffParser.features.query("featType == 'tRNA'")
909 tRNAs.reset_index(inplace=True, drop = True)
910 print(tRNAs)
911 tRNA_bar_thickness = bar_thickness * (0.8)
912 tRNA_start_radius = start_radius + ((tRNA_bar_thickness * (0.2))/2)
913 #tRNA_bar_thickness = bar_thickness
914 #tRNA_start_radius = start_radius
915 print("sequence_length = {}".format(sequence_length))
916 angle_ranges = []
917 for i in range(0,len(tRNAs)):
918 this_feature = tRNAs.loc[i,]
919 min_angle = angleMap[min(this_feature.loc['start'],this_feature.loc['stop'])]
920 max_angle = angleMap[max(this_feature.loc['start'],this_feature.loc['stop'])]
921 angle_ranges.append((min_angle, max_angle))
922 patches = plot_feature(this_feature, colorMap, tRNA_start_radius,
923 tRNA_bar_thickness, direction,
924 pd.Series([]))
925 for each in patches:
926 myPatches.append(each)
927 print("angle ranges of tRNAs")
928 print(angle_ranges)
929
930 # now plot the text of the all the things. do this last
931 # to cover the previous things we plotted
932 for i in range(len(gffParser.features.index)):
933 if gffParser.features.loc[i, 'featType'] not in ['region', 'source', 'tRNA']:
934 angles = []
935 # to plot the centers, first figure out the center of the annotated
936 # gene by subtracting the end from the beginning
937 name = gffParser.features.loc[i,'name']
938 middle_position = int(gffParser.features.loc[i,'center'])
939 # calculate the radius, because it changes depending on the track.
940 # Remember that the radius is the position at the bottom of the track,
941 # so we must use va=bottom.
942 center_radius = radius + (gffParser.features.loc[i,'track'] * track_width) + (bar_thickness/2)
943 # then, figure out the center angle of that position.
944 # I subtract one since I want to center this for the non-arrow
945 # part of the bar.
946 center_angle = angleMap[middle_position]-2
947 count = 0
948 while count < 1:
949 print()
950 print(name)
951 #figure out how many characters to plot
952 kerning_angle = 2.5
953 print("putting in {} into get angles with center angle = {}, kerning_angle = {}".format(name, center_angle, kerning_angle))
954 angles = get_angles(name, center_angle, kerning_angle)
955 char_min_angle = angles[0]-(kerning_angle / 2)
956 char_max_angle = angles[-1] - (kerning_angle / 2)
957 overlapping_angles = []
958 # for every possible tRNA position, see if the minimum or
959 # max angle lies in the range of the text
960 for tRNA_overlap_range in angle_ranges:
961 for tRNA_position in tRNA_overlap_range:
962 if (char_min_angle < tRNA_position and
963 tRNA_position < char_max_angle):
964 #overlapping_angles contains the tRNA overlapping angles now
965 overlapping_angles.append(tRNA_position)
966 print("character angles {}".format(angles))
967 print("overlapping angles {}".format(overlapping_angles))
968 print("in while loop, center = {}".format(center_angle))
969 if len(overlapping_angles) == 0:
970 count += 1
971 else:
972 min_overlapping_angle = min(overlapping_angles)
973 print("min_overlapping_angle = {}".format(min_overlapping_angle))
974 max_overlapping_angle = max(overlapping_angles)
975 print("max_overlapping_angle = {}".format(max_overlapping_angle))
976 gene_start_angle = angleMap[gffParser.features.loc[i,'start']]
977 gene_stop_angle = angleMap[gffParser.features.loc[i,'stop']]
978 start_dif = abs(min_overlapping_angle - gene_start_angle)
979 stop_dif = abs(gene_stop_angle - max_overlapping_angle)
980 print("start_dif = {}, stop_dif = {}".format(start_dif, stop_dif))
981 if start_dif > stop_dif:
982 center_angle = min_overlapping_angle - ((min_overlapping_angle - gene_start_angle)/2)
983 else:
984 # the extra -2 is to compensate for the taper. This only
985 # applies to the end of a gene stop. This code won't
986 # work properly for - strand things currently
987 center_angle = gene_stop_angle - 2 - ((gene_stop_angle - max_overlapping_angle)/2)
988 count += 1
989 print("center angle after assignment = {}".format(center_angle))
990 #There is some overlap in the text and we need to
991 # figure out the min overlap and max overlap. Then figure
992 # out if the angle between min and start or max and end is greater
993 # use whichever is greater as the new place to put the text, then calculate the center
994 angles = get_angles(name, center_angle, kerning_angle)
995 text_angle = angles[-1] - angles[0]
996 # this prints all the characters in their positions and rotates
997 print_count = 0
998 for j in range(len(angles)):
999 this_width = gffParser.features.loc[i,'width']
1000 if not text_angle > (angleMap[this_width] - 1):
1001 if center_angle > 95 and center_angle < 265:
1002 rotation = (-1 * center_angle) - 180
1003 this_char = name[len(angles) - 1 - j]
1004 else:
1005 rotation = -1 * center_angle
1006 print("name: {} j: {}".format(name, j))
1007 this_char = name[j]
1008 this_angle = angles[j]
1009 # now calculate the absolute x position
1010 x_pos = -np.cos(np.radians(this_angle+90)) * center_radius
1011 # now calculate the absolute y position
1012 y_pos = np.sin(np.radians(this_angle+90)) * center_radius
1013 #rotate the text if necessary
1014 panelCircle.text(x_pos, y_pos, this_char, fontsize=10,
1015 ha='center', va='center',
1016 rotation = rotation,
1017 family = 'monospace',
1018 color = 'white')
1019 # this block handles the case where the text is too small to put
1020 # parallel with the mitochondrial circle, so it is perpindicular
1021 else:
1022 if print_count == 0:
1023 if gffParser.features.loc[i,'strand'] == '+':
1024 gene_start_angle = angleMap[gffParser.features.loc[i,'start']]
1025 gene_stop_angle = angleMap[gffParser.features.loc[i,'stop']]
1026 center_angle = gene_stop_angle - ((gene_stop_angle-gene_start_angle)/2)
1027 # now calculate the absolute x position
1028 x_pos = -np.cos(np.radians(center_angle+90)) * center_radius
1029 # now calculate the absolute y position
1030 y_pos = np.sin(np.radians(center_angle+90)) * center_radius
1031 if (center_angle > 0 and center_angle < 180):
1032 rotation = -1 * center_angle + 90
1033 else:
1034 rotation = -1 * center_angle - 90
1035 panelCircle.text(x_pos, y_pos, name, fontsize=6,
1036 ha='center', va='center',
1037 rotation = rotation,
1038 family = 'monospace',
1039 color = 'white')
1040
1041 print_count += 1
1042
1043 return panelCircle, myPatches, final_radius
1044
1045 def run(args):
1046 redwood(args)
0 #!/usr/bin/env python
1 # -*- coding: utf-8 -*-
2
3 # pauvre - just a pore PhD student's plotting package
4 # Copyright (c) 2016-2017 Darrin T. Schultz. All rights reserved.
5 #
6 # This file is part of pauvre.
7 #
8 # pauvre is free software: you can redistribute it and/or modify
9 # it under the terms of the GNU General Public License as published by
10 # the Free Software Foundation, either version 3 of the License, or
11 # (at your option) any later version.
12 #
13 # pauvre is distributed in the hope that it will be useful,
14 # but WITHOUT ANY WARRANTY; without even the implied warranty of
15 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
16 # GNU General Public License for more details.
17 #
18 # You should have received a copy of the GNU General Public License
19 # along with pauvre. If not, see <http://www.gnu.org/licenses/>.
20
21 # TODO: make a function to nicely print out the pandas dataframes
22
23 """
24 Program: pauvre stats
25
26 example usage/output:
27 - For example, if I want to know the number of bases and reads with AT LEAST
28 PHRED score 5 and AT LEAST length of 500, I can run the program as you
29 see below and look at the cells highlighted with <braces>.
30 - We find that there are 968,653 basepairs contained in 859 reads that
31 fit meanReadQuality >= Q5, readLen >= 500.
32 > pauvre marginplot --fastq miniDSMN15.fastq
33
34 >numReads: 1000
35 >numBasepairs: 1029114
36 >meanLen: 1029.114
37 >medianLen: 875.5
38 >minLen: 11
39 >maxLen: 5337
40 >N50: 1278
41 >L50: 296
42
43 > Basepairs >= bin by mean PHRED and length
44 >minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30
45 > 0 1029114 1010681 935366 429279 143948 25139 3668 2938 2000 0
46 > 500 984212 <968653> 904787 421307 142003 24417 3668 2938 2000 0
47 > 1000 659842 649319 616788 300948 103122 17251 2000 2000 2000 0
48 > et cetera...
49 > Number of reads >= bin by mean Phred+Len
50 >minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30
51 > 0 1000 969 865 366 118 22 3 2 1 0
52 > 500 873 <859> 789 347 113 20 3 2 1 0
53 > 1000 424 418 396 187 62 11 1 1 1 0
54 > et cetera...
55 """
56
57
58 from pauvre.functions import parse_fastq_length_meanqual
59 import os
60 import pandas as pd
61 import numpy as np
62
63
64 def stats(df, fastqName, histogram=False):
65 """
66 arguments:
67 <df>
68 - a pandas dataframe with cols ['length', 'meanQual'] that contains read
69 length and quality information
70 <fastqName>
71 - just the name of the fastq file. This is used for printing out headers
72 for the data tables.
73
74 purpose:
75 Calculate and output various stats about this fastq file. Currently
76 this program reports:
77 - Total number of reads
78 - Total number of basepairs
79 - mean
80 - median
81 - min
82 - max
83 - N50
84 - A table of num basepairs filtered by min mean PHRED and length
85 - A table of num reads filtered by the same as above ^
86
87 example usage/output:
88 - For example, if I want to know the number of bases and reads with AT LEAST
89 PHRED score 5 and AT LEAST length of 500, I can run the program as you
90 see below and look at the cells highlighted with <braces>.
91 - We find that there are 968,653 basepairs contained in 859 reads that
92 fit meanReadQuality >= Q5, readLen >= 500.
93 > pauvre marginplot --fastq miniDSMN15.fastq
94
95 >numReads: 1000
96 >numBasepairs: 1029114
97 >meanLen: 1029.114
98 >medianLen: 875.5
99 >minLen: 11
100 >maxLen: 5337
101 >N50: 1278
102 >L50: 296
103
104 > Basepairs >= bin by mean PHRED and length
105 >minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30
106 > 0 1029114 1010681 935366 429279 143948 25139 3668 2938 2000 0
107 > 500 984212 <968653> 904787 421307 142003 24417 3668 2938 2000 0
108 > 1000 659842 649319 616788 300948 103122 17251 2000 2000 2000 0
109 > et cetera...
110 > Number of reads >= bin by mean Phred+Len
111 >minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30
112 > 0 1000 969 865 366 118 22 3 2 1 0
113 > 500 873 <859> 789 347 113 20 3 2 1 0
114 > 1000 424 418 396 187 62 11 1 1 1 0
115 > et cetera...
116 """
117 fastqBase = os.path.basename(fastqName)
118
119 analysis_sizes = [0, 1000, 5000, 10000]
120 print_string = ""
121 for this_size in analysis_sizes:
122 these_lengths = df.loc[df["length"] >= this_size, "length"]
123 if len(these_lengths) > 0:
124 print_string += "# Fastq stats for {}, reads >= {}bp\n".format(fastqBase, this_size)
125 print_string += "numReads: {}\n".format(len(these_lengths))
126 print_string += "%totalNumReads: {0:.2f}\n".format((len(these_lengths) / len(df)) * 100)
127 print_string += "numBasepairs: {}\n".format(sum(these_lengths))
128 print_string += "%totalBasepairs: {0:.2f}\n".format(
129 (sum(these_lengths) / sum(df["length"])) * 100)
130 print_string += "meanLen: {}\n".format(np.mean(these_lengths))
131 print_string += "medianLen: {}\n".format(np.median(these_lengths))
132 print_string += "minLen: {}\n".format(min(these_lengths))
133 maxLen = max(these_lengths)
134 print_string += "maxLen: {}\n".format(maxLen)
135
136 # calculate the N50
137 fiftypercent = 0.5 * sum(these_lengths)
138 lenSum = 0
139 count = 0
140 for val in sorted(these_lengths, reverse=True):
141 lenSum += val
142 count += 1
143 if lenSum >= fiftypercent:
144 print_string += "N50: {}\n".format(int(val))
145 print_string += "L50: {}\n".format(count)
146 break
147 print_string += "\n"
148
149 print_string += lengthQual_table(df)
150
151 if histogram: # now make a histogram with read lengths
152 histogram_lengths(df["length"], fastqBase.split('.')[0])
153 print(print_string)
154
155
156 def lengthQual_table(df):
157 """Create a table with lengths/basepairs on columns and qualities on rows"""
158 # This block calculates the number of length bins for this data set
159 lengthBinList = []
160 size_map = [(1000, 250),
161 (10000, 500),
162 (40000, 1000),
163 (100000, 5000),
164 (500000, 20000),
165 (1000000, 50000),
166 (10000000000, 100000)]
167 # first, figure out where we will start the table
168 minlen = min(df["length"])
169 current_val = 0
170 firstDone = False
171 for this_max_size, this_size_step in size_map:
172 for this_bin in range(current_val, this_max_size, this_size_step):
173 if minlen < this_bin:
174 if not firstDone:
175 lengthBinList.append(prev)
176 firstDone = True
177 lengthBinList.append(this_bin)
178 prev = this_bin
179 current_val = this_max_size
180 # now figure out the largest bin
181 maxLen = df["length"].max()
182 first_index_gt_maxLen = next(i for i, v in enumerate(lengthBinList) if v > maxLen) + 1
183 lengthBinList = lengthBinList[0:first_index_gt_maxLen]
184
185 qualBinList = []
186 increment_by = 1
187 while len(qualBinList) == 0 or len(qualBinList) > 15:
188 # now set up the bins for mean PHRED
189 minQual = int(np.floor(min(df["meanQual"])))
190 maxQual = int(np.ceil(max(df["meanQual"])))
191 qualBinList = list(np.arange(minQual, maxQual + increment_by, increment_by))
192 increment_by += 0.25
193
194 # now make a table of read lengths
195 bpTots = []
196 readnumTots = []
197 for row in range(len(lengthBinList)):
198 dataNums = []
199 readNums = []
200 for column in range(len(qualBinList)):
201 thisQuery = df.query("length >= {} and meanQual >= {}".format(
202 lengthBinList[row], qualBinList[column]))
203 dataNums.append(sum(thisQuery['length']))
204 readNums.append(len(thisQuery.index))
205 bpTots.append(dataNums)
206 readnumTots.append(readNums)
207
208 tables = {"Basepairs >= bin by mean PHRED and length": bpTots,
209 "Number of reads >= bin by mean Phred+Len": readnumTots}
210 print_table = ""
211 for key in sorted(tables):
212 # make a dataframe of our basepair distribution table
213 dataDf = pd.DataFrame(tables[key], columns=["Q{}".format(x) for x in qualBinList])
214 # add the min lengths as a column
215 dataDf.insert(0, 'minLen', lengthBinList)
216 print_table += pretty_print_table(dataDf, key)
217 return print_table
218
219
220 def histogram_lengths(length, name_prefix):
221 """Create a histogram of read counts per length."""
222 counts = length.value_counts().to_frame(name="readCount")
223 counts.index.rename('readLen', inplace=True)
224 counts.sort_index(inplace=True)
225 counts.to_csv("{}.hist.csv".format(name_prefix), index=True)
226
227
228 def pretty_print_table(df, title):
229 print_string = ""
230 dataframeStr = df.to_string(index=False)
231 # this is the char width of the whole table printed
232 lendataframeStr = len(dataframeStr.split('\n')[0])
233 # this is the char width of the minlen portion of the printed table
234 minLenLen = len(dataframeStr.split()[0])
235 blank = " " * minLenLen
236 # center the text on this offset as the table title
237 txtoffset = lendataframeStr - minLenLen
238 print_string += "\n{}{:^{offset}}\n".format(
239 blank, title, offset=txtoffset)
240 print_string += dataframeStr + "\n"
241 return print_string
242
243
244 def run(args):
245 """This just opens the fastq file and passes the info to the stats() function.
246 This is a wrapper function that is accessed by pauvre_main.
247 Useful since we can call stats() independently from other pauvre programs."""
248 df = parse_fastq_length_meanqual(args.fastq)
249 stats(df, args.fastq, histogram=args.histogram)
0 #!/usr/bin/env python
1 # -*- coding: utf-8 -*-
2
3 # pauvre - just a pore plotting package
4 # Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved.
5 #
6 # This file is part of pauvre.
7 #
8 # pauvre is free software: you can redistribute it and/or modify
9 # it under the terms of the GNU General Public License as published by
10 # the Free Software Foundation, either version 3 of the License, or
11 # (at your option) any later version.
12 #
13 # pauvre is distributed in the hope that it will be useful,
14 # but WITHOUT ANY WARRANTY; without even the implied warranty of
15 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
16 # GNU General Public License for more details.
17 #
18 # You should have received a copy of the GNU General Public License
19 # along with pauvre. If not, see <http://www.gnu.org/licenses/>.
20
21 # TODO
22 # import the pauvre rcParams
23 # Cleanup everything
24
25 import pandas as pd
26 pd.set_option('display.max_columns', 500)
27 pd.set_option('display.width', 1000)
28 import numpy as np
29 import os
30 import pauvre.rcparams as rc
31 from pauvre.functions import GFFParse, print_images, timestamp
32 from pauvre import gfftools
33 from pauvre.lsi.lsi import intersection
34 import progressbar
35 import platform
36 import sys
37 import time
38 import warnings
39
40 # for the shuffling algorithm
41 from itertools import product
42
43 # Biopython stuff
44 from Bio import SeqIO
45 import Bio.SubsMat.MatrixInfo as MI
46
47 # following this tutorial to install helvetica
48 # https://github.com/olgabot/sciencemeetproductivity.tumblr.com/blob/master/posts/2012/11/how-to-set-helvetica-as-the-default-sans-serif-font-in.md
49 global hfont
50 hfont = {'fontname':'Helvetica'}
51
52 import matplotlib
53 matplotlib.use('agg')
54 import matplotlib.pyplot as plt
55 from matplotlib.colors import LinearSegmentedColormap, Normalize
56 import matplotlib.patches as patches
57
58 def shuffle_optimize_gffs(args, GFFs):
59 """This function takes in a list of GFF objects and reshuffles the
60 individual files such that the resulting sequence of GFF files has
61 the minimum number of intersections when plotting synteny
62
63 if args.optimum_order, then the program will find the global minimum
64 arrangement using the first GFF file as the anchor.
65
66 if not args.optimum_order, then the program will find the local minimum
67 shuffle between every input pair of GFF files to plot in the best way possible
68 given the input order.
69
70 returns a list of GFF files from which the user can calculate plotting coordinates
71 """
72 # we use the first-input gff as the topmost sequence,
73 # and then find the best synteny match for the remaining sequences
74 shuffled_gffs = []
75 if args.optimum_order:
76 firstgff = GFFs[0]
77 # save the first gff file unadultered
78 shuffled_gffs.append(firstgff)
79 nextgffs = GFFs[1:]
80 while len(nextgffs) > 0:
81 obs_list = []
82 for i in range(len(nextgffs)):
83 # every observation will be stored here as a tuple.
84 # zeroth element is the num intersections with the current gff
85 # first element is the index of nextgffs
86 # second element is the GFF object
87 shuffles = nextgffs[i].shuffle()
88 for k in range(len(shuffles)):
89 shuf = shuffles[k]
90 coords = firstgff.couple(shuf, this_y = 0, other_y = 1)
91 num_inters = len(intersection(coords))
92 obs_list.append((num_inters, i, shuf))
93 #print(obs_list[-1])
94 intersections, gffixs, shufs = zip(*obs_list)
95 # get the index of the shuffled gff with the least number of
96 # intersections to the current one against which we are comparing
97 #print("intersections", intersections)
98 selected_ix = intersections.index(min(intersections))
99 # save this gff to shuffled gffs to use later for plotting
100 shuffled_gffs.append(shufs[selected_ix])
101 # remove the origin of the shuffled gff from nextgffs
102 del nextgffs[gffixs[selected_ix]]
103 #print("global minimum was {} intersections".format(min(intersections)))
104 # now update the firstgff to the latest shuffled one we collected
105 firstgff = shufs[selected_ix]
106 # plot the gff files in the order in which you input them,
107 # but shuffle them to the order with least intersections
108 else:
109 # first we need to find the best arrangement by finding the combinations
110 # that share the most unique genes
111 genes_series = [GFFs[i].get_unique_genes() for i in range(len(GFFs))]
112 combinations_indices = [0]
113 remaining_indices = list(range(1, len(GFFs)))
114 done = False
115 biggest_intersection_index = -1
116 biggest_intersection_value = 0
117 current_remaining_indices_index = 0
118 while not done:
119 #get the len of the intersection
120 #print("combinations_indices: {}".format(combinations_indices))
121 #print("current_remaining_indices_index: {}".format(current_remaining_indices_index))
122 #print("remaining_indices[current_remaining_indices_index]: {}".format(remaining_indices[current_remaining_indices_index]))
123 #print("genes_series[remaining_indices[current_remaining_indices_index]]: {}".format(genes_series[remaining_indices[current_remaining_indices_index]]))
124 this_intersection_value = len(genes_series[combinations_indices[-1]] &\
125 genes_series[remaining_indices[current_remaining_indices_index]])
126 if this_intersection_value > biggest_intersection_value:
127 biggest_intersection_value = this_intersection_value
128 biggest_intersection_index = current_remaining_indices_index
129 if current_remaining_indices_index < len(remaining_indices)-1:
130 current_remaining_indices_index += 1
131 else:
132 combinations_indices.append(remaining_indices[biggest_intersection_index])
133 del remaining_indices[biggest_intersection_index]
134 biggest_intersection_value = 0
135 current_remaining_indices_index = 0
136 biggest_intersection_index = -1
137 if len(remaining_indices) == 0:
138 done = True
139 # The best order of genes with the most shared genes
140 #I don't know if this is really that useful though since many species will overlap.
141 # In a future implementation of this program it might be necessary to do sub-sorting of this list to get the lest number of line intersections
142 print("The best gene combination is {}".format(combinations_indices))
143 # now we rearrange the GFFs to the best order
144 new_GFFs = [GFFs[i] for i in combinations_indices]
145 # If we're adding another copy of the top one, add it here before shuffling
146 if args.sandwich:
147 new_GFFs.append(new_GFFs[0])
148 shuffles = [new_GFFs[i].shuffle() for i in range(len(new_GFFs))]
149 #print([len(shuffles[i]) for i in range(len(shuffles))])
150 cumulative_least_shuffled_value = 999999999999999999999999999999999999
151 bar = progressbar.ProgressBar()
152 for combination in bar(list(product(*shuffles))):
153 num_intersections = []
154 #I have no idea what this list comprehension does anymore.
155 first_genes = [str(combination[i].features[combination[i].features['featType'].isin(['gene', 'rRNA', 'CDS', 'tRNA'])]['name'].head(n=1)).split()[1] for i in range(len(combination))]
156 # skip to the next iteration if all the genes aren't the same
157 if args.start_with_aligned_genes and len(set(first_genes)) != 1:
158 continue
159 for i in range(len(new_GFFs) - 1):
160 j = i + 1
161 #figure out the best shuffle the next sequence
162 coords = combination[i].couple(combination[j], this_y = i, other_y = j)
163 num_intersections.append(len(intersection(coords)))
164 if sum(num_intersections) < cumulative_least_shuffled_value:
165 shuffled_gffs = combination
166 cumulative_least_shuffled_value = sum(num_intersections)
167 print("\nnew fewest global intersections: {}".format(sum(num_intersections)))
168 return shuffled_gffs
169
170 def black_colormap():
171 zeroone = np.linspace(0, 1, 100)
172 colorrange = [(0,0,0,x) for x in zeroone]
173 minblosum = min(MI.blosum62.values())
174 maxblosum = max(MI.blosum62.values())
175 colormap = {i: colorrange[int(translate(i, minblosum, maxblosum, 0, 99))]
176 for i in range(minblosum, maxblosum + 1, 1)}
177 return colormap
178
179 def translate(value, left_min, left_max, right_min, right_max):
180 """This code maps values from the left range and interpolates to the
181 corresponding range on the right. This is used to translate the amino acid
182 substition matrix scores to a scale between 0 and 1 for making alphamaps.
183
184 I don't know if this works if the directionality of the ranges are swapped.
185 IE [5, -10] mapped to [0, 1]
186
187 args:
188 <value> - the value in [<left_min>:<left_max>] to scale between
189 [<right_min>:<right_max>]
190 <left_min> - the 'min' of the left (source) range
191 <left_max> - the 'max' of the left (source) range
192 <right_min> - the 'min' of the right (target) range
193 <right_max> the 'max' of the right (target) range
194
195 output:
196 the <value>(float) scaled between <right_min> and <right_max>
197 """
198 # Figure out how 'wide' each range is
199 left_span = left_max - left_min
200 right_span = right_max - right_min
201
202 # Convert the left range into a 0-1 range (float)
203 value_scaled = float(value - left_min) / float(left_span)
204
205 # Convert the 0-1 range into a value in the right range.
206 return right_min + (value_scaled * right_span)
207
208 def _samplename_warning(samplename, filename):
209 warnings.warn("""
210 There is a sample in your fasta alignments that
211 does not match the samplenames from the gff filenames. Please
212 rename this samplename to not contain any spaces or underscores.
213 IE for sample 'NC016', '>NC_016_-_ND6' will not work but
214 '>NC016_-_ND6' will work.
215
216 Erroneous name: {}
217 File: {}""".format(samplename, os.path.basename(filename)))
218
219 def _samplelength_warning(samplename, genename, featType, gfflen, alnlen):
220 raise Warning("""The length of the protein alignment isn't the same as the
221 length in the GFF file for the sample. Maybe you used a sequence in the
222 alignment that is different from the annotation source? Check if the
223 stop codons are deleted/inserted from either the GFF or alignment. The
224 protein alignment length should be 3 less than the gff length if the
225 stop codons were included in the gff annotation.
226
227 Another possibility is that the RNA that is generating this error has
228 post-transcriptional modifications that complete the stop codon. In
229 this case, you can fudge the stop position in the gff file (increase
230 the value by one or two) to make the plotting script run.
231
232 Sample name: {}
233 feat type: {}
234 gene name: {}
235 gff length: {}
236 aln length: {}""".format(samplename, featType, genename, gfflen, alnlen))
237
238 def _nosample_warning(samplename, alngenename, gffnames):
239 raise Warning("""One of the gff files doesn't contain a sequence that the
240 alignment file indicates should be present. Either the alignment file
241 is misnamed or the sequence name in the GFF file is not what you
242 intended.
243
244 Sample name: {}
245 aln gene name: {}
246 gff names: {}""".format(samplename, alngenename, gffnames))
247
248 def get_alignments(args):
249 """
250 this reads in all the alignments from the fasta directory.
251 """
252 # This is a dict object with key as
253 filelist = {os.path.splitext(x)[0]:os.path.join(os.path.abspath(args.aln_dir), x)
254 for x in os.listdir(args.aln_dir)
255 if os.path.splitext(x)[1]}
256 print("file list is:")
257 print(filelist)
258 # one entry in seq_dict is:
259 # {seqname: {"featType": featType,
260 # "seqs": {samplename: seq},
261 # "indices": {samplename: indices}}
262 seqs_dict = {}
263 # go through every gene in the genelist
264 for genename in filelist:
265 thisFeatType = ""
266 seqs_list = {}
267 indices_list = {}
268 #print("We found the following samplenames: {}".format(args.samplenames), file = sys.stderr)
269 # this block handles reading in the fasta files to interpret the alignments
270 for record in SeqIO.parse(filelist[genename], "fasta"):
271 # get the sample name and make sure that the sample names match
272 samplename = record.id.replace("_", " ").split()[0]
273 #print("Looking at sample: {}".format(samplename), file=sys.stderr)
274 if samplename not in args.samplenames:
275 #if there's a sequence in the fasta that we did not specify
276 # in the command, ignore that sequence
277 _samplename_warning(samplename, filelist[genename])
278 else:
279 # first, get the sample features
280 samplegff = args.samplenames[samplename].features
281 featType = samplegff.loc[samplegff['name'] == genename, 'featType'].to_string().split()[1]
282 # now we determine if this is a prot alignment or a nucleotide aln
283 if featType in ['gene', 'CDS']:
284 final_seq = "".join([x*3 for x in record.seq])
285 elif featType == 'rRNA':
286 final_seq = str(record.seq)
287 # we now need to verify that the protein sequence is
288 # the length of the gene in the gff file. Do this by removing
289 # gaps in the alignment
290 gfffilt = samplegff.loc[samplegff['name'] == genename, 'width']
291 if len(gfffilt) == 0:
292 _nosample_warning(samplename, genename, list(samplegff['name']))
293 gfflen = int(gfffilt)
294 aln = final_seq.replace("-", "")
295 alnlen = len(aln)
296 if gfflen != alnlen:
297 _samplelength_warning(samplename, genename, featType, gfflen, alnlen)
298 # If we've made it this far without any errors, then incorporate the
299 # indices for each index
300 #print("start_index", start_index)
301 final_indices = [-1] * len(final_seq)
302 # up until the next for loop, here we are determining which
303 # direction to move in. Reverse sequences decrease from the start
304 strand = samplegff.loc[samplegff['name'] == genename, 'strand'].to_string().split()[1]
305 if strand == '+':
306 direction = 1
307 start_index = int(samplegff.loc[samplegff['name'] == genename, 'start'])
308 elif strand == '-':
309 direction = -1
310 start_index = int(samplegff.loc[samplegff['name'] == genename, 'stop'])
311 for i in range(len(final_indices)):
312 if final_seq[i] != '-':
313 final_indices[i] = start_index
314 start_index = start_index + (1 * direction)
315 seqs_list[samplename] = final_seq
316 if args.center_on:
317 center_coord = int(args.samplenames[samplename].features.loc[args.samplenames[samplename].features['name'] == args.center_on, 'center'])
318 indices_list[samplename] = np.array(final_indices) - center_coord
319 else:
320 indices_list[samplename] = final_indices
321 thisFeatType = featType
322 seqs_dict[genename] = {"featType": thisFeatType,
323 "seqs": seqs_list,
324 "indices": indices_list}
325 return seqs_dict
326
327
328 def plot_synteny(seq1, ind1, seq2, ind2, y1, y2,
329 featType, matrix, cm, seqname):
330 """This function plots all the lines for each"""
331 print("PLOTTING SYNTENY")
332 myPatches = []
333 colormap = {"COX1": '#c0d9ef',
334 "L": '#e8f1df',
335 "I": '#f7dedc',
336 "16S": '#ff2e00',
337 "12S": '#ffc239',
338 "cal": '#ffff54',
339 "COX2": "#7fce66",
340 "ND2": "#00ae60",
341 "COX3": "#00aeec",
342 "ND1": "#006fbb",
343 "*": "#ffffff",
344 "(": "#ded9c5",
345 "Q": "#ffc294",
346 "?": "#b5a2c4",
347 "ND4": "#968b5a",
348 "ND3": "#00fc65",
349 "ND4L": "#00dcf0",
350 "ND6": "#ff994e",
351 "ND5": "#dc31e6",
352 "X": "#d8d8d8",
353 "G": "#abdce7",
354 "CYTB": "#ff0059"}
355
356 for i in range(len(seq1)):
357 feat1 = seq1[i]
358 feat2 = seq2[i]
359 if feat1 != '-' and feat2 != '-':
360 xs = []
361 ys = []
362 xs.append(ind1[i]) # top left
363 ys.append(y1)
364 xs.append(ind1[i] + 1) # top right
365 ys.append(y1)
366 xs.append(ind2[i] + 1) # bottom right
367 ys.append(y2)
368 xs.append(ind2[i]) #bottom left
369 ys.append(y2)
370 xs.append(ind1[i]) #top left
371 ys.append(y1)
372 alpha = 0.5
373 if featType in ['CDS', 'gene']:
374 try:
375 val = matrix[(feat1, feat2)]
376 except:
377 val = matrix[(feat2, feat1)]
378 color = cm[val]
379 alpha = color[-1]
380 elif featType == 'rRNA':
381 if feat1 != feat2:
382 alpha=0
383 color = colormap[seqname]
384 stack1 = np.column_stack([xs, ys])
385 myPatches.append(patches.Polygon(stack1, closed=True,
386 color = color,
387 alpha = alpha,
388 lw=0))
389 return myPatches
390
391 def synplot(args):
392 rc.update_rcParams()
393 print(args)
394 GFFs = []
395 for i in range(len(args.gff_paths)):
396 gffpath = args.gff_paths[i]
397 species = ""
398 if args.gff_labels:
399 species = args.gff_labels[i]
400 GFFs.append(GFFParse(gffpath, args.stop_codons, species))
401
402 # find the optimum shuffling pattern
403 # and add a list of samplenames to the args
404 optGFFs = shuffle_optimize_gffs(args, GFFs)
405 # Make a sandwich for a circular comparison
406 setattr(args, 'samplenames', {gff.samplename:gff for gff in optGFFs})
407
408 # now get the cms and normalize
409 #cms, normalize = gen_colormaps()
410 cm = black_colormap()
411
412 ## and we get the protein alignment objects
413 # {seqname: {"featType": featType,
414 # "seqs": {samplename: seq},
415 # "indices": {samplename: indices}}
416 print("getting alignments")
417 seqs_dict = get_alignments(args)
418 print("done getting alignments")
419 print("seqs_dict is:")
420 print(seqs_dict)
421
422 # set the figure dimensions
423 if args.ratio:
424 figWidth = args.ratio[0] + 1
425 figHeight = args.ratio[1] + 1
426 #set the panel dimensions
427 panelWidth = args.ratio[0]
428 panelHeight = args.ratio[1]
429
430 else:
431 figWidth = 2.5*4
432 figHeight = 5
433 #set the panel dimensions
434 panelWidth = 2.5 * 3
435 panelHeight = 1.75
436
437 figure = plt.figure(figsize=(figWidth,figHeight))
438
439
440 #find the margins to center the panel in figure
441 leftMargin = (figWidth - panelWidth)/2
442 bottomMargin = ((figHeight - panelHeight)/2) + 0.25
443
444 panel0=plt.axes([leftMargin/figWidth, #left
445 bottomMargin/figHeight, #bottom
446 panelWidth/figWidth, #width
447 panelHeight/figHeight]) #height
448 panel0.tick_params(axis='both',which='both',\
449 bottom='on', labelbottom='off',\
450 left='off', labelleft='off', \
451 right='off', labelright='off',\
452 top='off', labeltop='off')
453 #turn off some of the axes
454 panel0.spines['top'].set_visible(False)
455 panel0.spines['right'].set_visible(False)
456 panel0.spines['left'].set_visible(False)
457
458 # {seqname: {"featType": featType,
459 # "seqs": {samplename: seq},
460 # "indices": {samplename: indices}}
461 allPatches = []
462 for seqname in seqs_dict:
463 #go through in order
464 print(seqname)
465 for i in range(0, len(optGFFs) - 1):
466 samplei = optGFFs[i].samplename
467 samplej = optGFFs[i+1].samplename
468 if samplei in seqs_dict[seqname]["seqs"].keys() and\
469 samplej in seqs_dict[seqname]["seqs"].keys():
470 featType = seqs_dict[seqname]["featType"]
471 seq1 = seqs_dict[seqname]["seqs"][samplei]
472 ind1 = seqs_dict[seqname]["indices"][samplei]
473 seq2 = seqs_dict[seqname]["seqs"][samplej]
474 ind2 = seqs_dict[seqname]["indices"][samplej]
475 # this is the top one, just leave it at the actual value since
476 # the base of the annotations start on the integer
477 y1 = len(optGFFs) - 1 - i
478 # this needs to be increased by the bar_thickness (0.9 * track_width in this case, or 0.09)
479 y2 = len(optGFFs) - 2 - i
480 myPatches = plot_synteny(seq1, ind1, seq2, ind2, y1, y2,
481 featType, MI.blosum62, cm, seqname)
482 for patch in myPatches:
483 allPatches.append(patch)
484
485 #print("len allPatches", len(allPatches))
486 # this bit plots the simplified lines in the centers
487 ## first we plot all the lines from the centers of matching genes.
488 ## This is temporary. Or maybe it should be a feature
489 #verts = []
490 #for i in range(len(optGFFs) - 1):
491 # j = i + 1
492 # coords = optGFFs[i].couple(optGFFs[j], this_y = len(optGFFs) - i, other_y = len(optGFFs) - i - 1)
493 # for coord in coords:
494 # verts.append(coord)
495
496 #for vert in verts:
497 # xxyy = list(zip(*vert))
498 # panel0.plot(xxyy[0], xxyy[1])
499 # now we plot horizontal lines showing the length of the mitochondrial sequence
500 maxseqlen = 0
501 # this is a heuristic for trackwidth of what looks good in my experience
502 track_multiplier = 0.08
503 if args.ratio:
504 track_width = track_multiplier * panelWidth
505 else:
506 #0.032 if only 3
507 #0.062 if 6
508 track_width = track_multiplier * panelWidth
509 for i in range(len(optGFFs)):
510 gff = optGFFs[i]
511 #print(" - Plotting panels of {}".format(gff), file = sys.stderr)
512 x_offset = 0
513 #print(" - Detecting if centering is on.".format(gff), file = sys.stderr)
514 if args.center_on:
515 x_offset = -1 * int(gff.features.loc[gff.features['name'] == args.center_on, 'center'])
516 gff = gfftools.x_offset_gff(gff, x_offset)
517 #print(" - Centering is on.".format(gff), file = sys.stderr)
518 #print(" - Plotting horizontal portions with gffplot_horizontal.".format(gff), file = sys.stderr)
519 panel0, patches = gfftools.gffplot_horizontal(
520 figure, panel0, args, gff,
521 track_width = track_width,
522 start_y = len(optGFFs) - i - 1 - ((0.9 * track_width)/2),
523 x_offset = x_offset)
524 #print("{} patches came out of gffplot_horizontal()".format(len(patches)))
525 seq_name = gff.features['sequence'].unique()[0]
526 if args.gff_labels:
527 seq_name = "$\it{{{0}}}$".format(gff.species)
528 panel0.text(0 + x_offset, len(optGFFs) - i - 1 + (0.18/2),
529 seq_name, fontsize = 12,
530 ha='left', va='bottom',
531 color = 'black',
532 zorder = 100)
533
534 if gff.seqlen > maxseqlen:
535 maxseqlen = gff.seqlen
536 xs = (1 + x_offset, gff.seqlen + x_offset)
537 #ys = [len(optGFFs) - i - 1 + (0.09/2)]*2
538 ys = [len(optGFFs) - i - 1]*2
539
540 #print(" - Plotting lines.".format(gff), file = sys.stderr)
541 panel0.plot(xs, ys, color='black', zorder = -9)
542 #print(" - Adding patches.".format(gff), file = sys.stderr)
543 #print("Right before adding patches there are {} patches.".format(len(patches)))
544 for i in range(len(patches)):
545 patch = patches[i]
546 allPatches.append(patch)
547 for patch in allPatches:
548 panel0.add_patch(patch)
549 panel0.set_xlabel("position (bp)")
550
551 #panel0.set_xlim([-15000, int(np.ceil(maxseqlen/1000)*1000)])
552 panel0.set_ylim([ 0 - ( (track_width/2) * 1.1 ),
553 len(optGFFs) - 1 + ( (track_width/2) * 1.1 )])
554
555 # This removes the text labels from the plot
556 labels = [item.get_text() for item in panel0.get_xticklabels()]
557 empty_string_labels = ['']*len(labels)
558 print(" - Setting tick labels.".format(gff), file = sys.stderr)
559
560 panel0.set_xticklabels(empty_string_labels)
561
562 # Print image(s)
563 print(" - Running print_images.".format(gff), file = sys.stderr)
564 if args.BASENAME is None:
565 file_base = "synteny"
566 else:
567 file_base = args.BASENAME
568 print_images(
569 base=file_base,
570 image_formats=args.fileform,
571 no_timestamp = args.no_timestamp,
572 dpi=args.dpi,
573 transparent=args.transparent)
574
575
576 def run(args):
577 synplot(args)
(New empty file)
0 #!/usr/bin/env python
1 # -*- coding: utf-8 -*-
2
3 # pauvre - just a pore plotting package
4 # Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved.
5 #
6 # This file is part of pauvre.
7 #
8 # pauvre is free software: you can redistribute it and/or modify
9 # it under the terms of the GNU General Public License as published by
10 # the Free Software Foundation, either version 3 of the License, or
11 # (at your option) any later version.
12 #
13 # pauvre is distributed in the hope that it will be useful,
14 # but WITHOUT ANY WARRANTY; without even the implied warranty of
15 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
16 # GNU General Public License for more details.
17 #
18 # You should have received a copy of the GNU General Public License
19 # along with pauvre. If not, see <http://www.gnu.org/licenses/>.
20
21 import os
22 import subprocess as sp
23 import unittest
24
25 class libSeq_test_case(unittest.TestCase):
26 """Tests that different features of synplot work correctly
27 """
28 def setUp(self):
29 # Here we must safely make the test directory if it does
30 # not exist
31 scriptdir = os.path.dirname(os.path.realpath(__file__))
32 testoutputdir = os.path.join(scriptdir, "testresults")
33 if not os.path.exists(testoutputdir):
34 os.makedirs(testoutputdir)
35 # now we see if the specific subdirectory for the program
36 # we are testing exists
37 self.thisoutdir = os.path.join(testoutputdir, "synplot")
38 if not os.path.exists(self.thisoutdir):
39 os.makedirs(self.thisoutdir)
40 # now we change our working directory to the
41 # specific subdirectory for this file
42
43 self.aln_dir = os.path.join(scriptdir, "testdata/alignments/")
44 self.gff1 = os.path.join(scriptdir, "testdata/gff_files/Bf201706.gff")
45 self.gff2 = os.path.join(scriptdir, "testdata/gff_files/JN392469.gff")
46 self.gff3 = os.path.join(scriptdir, "testdata/gff_files/NC016117.gff")
47
48
49 def test_normal_plotting_scenario(self):
50 """This verifies that the LibSeq class is constructed with all of the
51 parameters that are present in the meraculous config files"""
52 os.chdir(self.thisoutdir)
53 thiscommand = """pauvre synplot --aln_dir {0} \
54 --fileform pdf \
55 --gff_paths {1} {2} {3} \
56 --center_on COX1 \
57 --gff_labels "B. forskalii" "P. bachei" "M. leidyi" """.format(
58 self.aln_dir, self.gff1,
59 self.gff2, self.gff3)
60
61 data = sp.Popen(thiscommand, shell = True,
62 stdout=sp.PIPE)
63 print("The data is:", data.communicate()[0].decode())
64 print("The return code is: ", data.returncode)
65 self.assertEqual(0, int(data.returncode))
0 # -*- coding: utf-8 -*-
1
2 __version__ = "0.1924"
0 Metadata-Version: 1.2
1 Name: pauvre
2 Version: 0.1924
3 Summary: Tools for plotting Oxford Nanopore and other long-read data.
4 Home-page: https://github.com/conchoecia/pauvre
5 Author: Darrin Schultz
6 Author-email: dts@ucsc.edu
7 License: GPLv3
8 Description:
9 'pauvre' is a package for plotting Oxford Nanopore and other long read data.
10 The name means 'poor' in French, a play on words to the oft-used 'pore' prefix
11 for similar packages. This package was designed for python 3, but it might work in
12 python 2. You can visit the gitub page for more detailed information here:
13 https://github.com/conchoecia/pauvre
14
15 Platform: UNKNOWN
16 Classifier: Development Status :: 2 - Pre-Alpha
17 Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
18 Classifier: Programming Language :: Python :: 3
19 Classifier: Programming Language :: Python :: 3.5
20 Classifier: Operating System :: POSIX :: Linux
21 Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
22 Classifier: Intended Audience :: Science/Research
23 Requires: python (>3.0)
24 Provides: pauvre
25 Requires-Python: >=3
0 MANIFEST.in
1 README.md
2 setup.py
3 pauvre/__init__.py
4 pauvre/bamparse.py
5 pauvre/browser.py
6 pauvre/custommargin.py
7 pauvre/functions.py
8 pauvre/gfftools.py
9 pauvre/marginplot.py
10 pauvre/pauvre_main.py
11 pauvre/rcparams.py
12 pauvre/redwood.py
13 pauvre/stats.py
14 pauvre/synplot.py
15 pauvre/version.py
16 pauvre.egg-info/PKG-INFO
17 pauvre.egg-info/SOURCES.txt
18 pauvre.egg-info/dependency_links.txt
19 pauvre.egg-info/entry_points.txt
20 pauvre.egg-info/not-zip-safe
21 pauvre.egg-info/requires.txt
22 pauvre.egg-info/top_level.txt
23 pauvre/lsi/Q.py
24 pauvre/lsi/T.py
25 pauvre/lsi/__init__.py
26 pauvre/lsi/helper.py
27 pauvre/lsi/lsi.py
28 pauvre/lsi/test.py
29 pauvre/tests/__init__.py
30 pauvre/tests/test_synplot.py
31 scripts/test.sh
0 [console_scripts]
1 pauvre = pauvre.pauvre_main:main
2
0 matplotlib>=2.0.2
1 biopython>=1.68
2 pandas>=0.20.1
3 numpy>=1.12.1
4 scipy
5 sklearn
0 set -ev
1
2 git clone https://github.com/wdecoster/nanotest.git
3
4 pauvre -h
5
6 pauvre marginplot -h
7 pauvre marginplot -f nanotest/reads.fastq.gz
8 pauvre marginplot -f nanotest/reads.fastq.gz -t "my title" --filt_maxlen 30000
9 pauvre marginplot -f nanotest/reads.fastq.gz -y --filt_minqual 7 --fileform svg
10
11 pauvre stats -h
12 pauvre stats -f nanotest/reads.fastq.gz
13 pauvre stats -f nanotest/reads.fastq.gz -H --filt_maxqual 12
0 [egg_info]
1 tag_build =
2 tag_date = 0
3
0 #!/usr/bin/env python3
1 # -*- coding: utf-8 -*-
2
3 # pauvre
4 # Copyright (c) 2016-2020 Darrin T. Schultz.
5 #
6 # This file is part of pauvre.
7 #
8 # pauvre is free software: you can redistribute it and/or modify
9 # it under the terms of the GNU General Public License as published by
10 # the Free Software Foundation, either version 3 of the License, or
11 # (at your option) any later version.
12 #
13 # pauvre is distributed in the hope that it will be useful,
14 # but WITHOUT ANY WARRANTY; without even the implied warranty of
15 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
16 # GNU General Public License for more details.
17 #
18 # You should have received a copy of the GNU General Public License
19 # along with pauvre. If not, see <http://www.gnu.org/licenses/>.
20
21 # Tutorials on how to setup python package here:
22 # - http://python-packaging.readthedocs.io/en/latest/testing.html
23 # - https://jeffknupp.com/blog/2013/08/16/open-sourcing-a-python-project-the-right-way/
24
25 import os
26 from setuptools import setup, find_packages
27
28 version_py = os.path.join(os.path.dirname(__file__), 'pauvre', 'version.py')
29 version = open(version_py).read().strip().split('=')[-1].replace('"', '').strip()
30 print(version)
31
32
33 setup(name='pauvre',
34 requires=['python (>3.0)'],
35 version=version,
36 description='Tools for plotting Oxford Nanopore and other long-read data.',
37 long_description="""
38 'pauvre' is a package for plotting Oxford Nanopore and other long read data.
39 The name means 'poor' in French, a play on words to the oft-used 'pore' prefix
40 for similar packages. This package was designed for python 3, but it might work in
41 python 2. You can visit the gitub page for more detailed information here:
42 https://github.com/conchoecia/pauvre
43 """,
44
45 url='https://github.com/conchoecia/pauvre',
46 author='Darrin Schultz',
47 author_email='dts@ucsc.edu',
48 classifiers=[
49 'Development Status :: 2 - Pre-Alpha',
50 'License :: OSI Approved :: GNU General Public License v3 (GPLv3)',
51 'Programming Language :: Python :: 3',
52 'Programming Language :: Python :: 3.5',
53 'Operating System :: POSIX :: Linux',
54 'Topic :: Scientific/Engineering :: Bio-Informatics',
55 'Intended Audience :: Science/Research'
56 ],
57 license='GPLv3',
58 provides=['pauvre'],
59 packages=find_packages() + ['scripts'],
60 python_requires='>=3',
61 install_requires=[
62 "matplotlib >= 2.0.2",
63 "biopython >= 1.68",
64 "pandas >= 0.20.1",
65 "numpy >= 1.12.1",
66 "scipy",
67 "sklearn"
68 ],
69 entry_points={
70 'console_scripts': ['pauvre=pauvre.pauvre_main:main'],
71 },
72 zip_safe=False,
73 include_package_data=True)