Commit 8944f5916920e9b56dd61f3c8b440d37ada5bc60 - python-pauvre

New upstream version 0.1924 Andreas Tille 3 years ago

34 changed file(s) with 6109 addition(s) and 0 deletion(s). Raw diff Collapse all Expand all

-0

MANIFEST.in less more

include scripts/test.sh

+26

-0

PKG-INFO less more

	0	Metadata-Version: 1.2
	1	Name: pauvre
	2	Version: 0.1924
	3	Summary: Tools for plotting Oxford Nanopore and other long-read data.
	4	Home-page: https://github.com/conchoecia/pauvre
	5	Author: Darrin Schultz
	6	Author-email: dts@ucsc.edu
	7	License: GPLv3
	8	Description:
	9	'pauvre' is a package for plotting Oxford Nanopore and other long read data.
	10	The name means 'poor' in French, a play on words to the oft-used 'pore' prefix
	11	for similar packages. This package was designed for python 3, but it might work in
	12	python 2. You can visit the gitub page for more detailed information here:
	13	https://github.com/conchoecia/pauvre
	14
	15	Platform: UNKNOWN
	16	Classifier: Development Status :: 2 - Pre-Alpha
	17	Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
	18	Classifier: Programming Language :: Python :: 3
	19	Classifier: Programming Language :: Python :: 3.5
	20	Classifier: Operating System :: POSIX :: Linux
	21	Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
	22	Classifier: Intended Audience :: Science/Research
	23	Requires: python (>3.0)
	24	Provides: pauvre
	25	Requires-Python: >=3

+157

-0

README.md less more

	0	[![travis-ci](https://travis-ci.org/conchoecia/pauvre.svg?branch=master)](https://travis-ci.org/conchoecia/pauvre) [![DOI](https://zenodo.org/badge/112774670.svg)](https://zenodo.org/badge/latestdoi/112774670)
	1
	2
	3	## pauvre: a plotting package designed for nanopore and PacBio long reads
	4
	5	This package currently hosts four scripts for plotting and/or printing stats.
	6
	7	- `pauvre marginplot`
	8	- takes a fastq file as input and outputs a marginal histogram with a heatmap.
	9	- `pauvre stats`
	10	- Takes a fastq file as input and prints out a table of stats, including how many basepairs/reads there are for a length/mean quality cutoff.
	11	- This is also automagically called when using `pauvre marginplot`
	12	- `pauvre redwood`
	13	- I am happy to introduce the redwood plot to the world as a method
	14	of representing circular genomes. A redwood plot contains long
	15	reads as "rings" on the inside, a gene annotation
	16	"cambrium/phloem", and a RNAseq "bark". The input is `.bam` files
	17	for the long reads and RNAseq data, and a `.gff` file for the
	18	annotation. More details to follow as we document this program
	19	better...
	20	- `pauvre synteny`
	21	- Makes a synteny plot of circular genomes. Finds the most
	22	parsimonius rotation to display the synteny of all the input
	23	genomes with the fewest crossings-over. Input is one `.gff` file
	24	per circular genome and one directory of gene alignments.
	25
	26	## Updates:
	27	- 20200215 - v0.1.924 - Made some minor updates to work with python 3.7 and the latest version of pandas,
	28	- 20171130 - v0.1.86 - some changes by @wdecoster to integrate `pauvre` into [nanoplot](https://github.com/wdecoster/NanoPlot),
	29	as well as some formatting changes that may make `pauvre` work better with python2.7. Adding Travis-CI functionality.
	30	- 20171025 - v0.1.83 - added some changes to make marginplot interface
	31	with @wdecoster's [nanoPlot](https://github.com/wdecoster/NanoPlot)
	32	package, and made `pauvre stats` only output data tables for
	33	filtered reads. `pauvre stats` also now has the `--filt_maxlen`,
	34	`--filt_maxqual`, `--filt_minlen`, and `--filt_minqual` options.
	35	- 20171018 - v0.1.8 - you can now filter reads and adjust the plotting viewing window.
	36	[See below for a demonstration.](#filter-reads-and-adjust-viewing-window) I added the following options:
	37
	38	```
	39	--filt_maxlen FILT_MAXLEN
	40	This sets the max read length filter reads.
	41	--filt_maxqual FILT_MAXQUAL
	42	This sets the max mean read quality to filter reads.
	43	--filt_minlen FILT_MINLEN
	44	This sets the min read length to filter reads.
	45	--filt_minqual FILT_MINQUAL
	46	This sets the min mean read quality to filter reads.
	47	--plot_maxlen PLOT_MAXLEN
	48	Sets the maximum viewing area in the length dimension.
	49	--plot_maxqual PLOT_MAXQUAL
	50	Sets the maximum viewing area in the quality
	51	dimension.
	52	--plot_minlen PLOT_MINLEN
	53	Sets the minimum viewing area in the length dimension.
	54	--plot_minqual PLOT_MINQUAL
	55	Sets the minimum viewing area in the quality
	56	dimension.
	57	```
	58	- 20171014 - uploading information on `pauvre redwood` and `pauvre synteny` usage.
	59	- 20171012 - made `pauvre stats` more consistently produce useful histograms.
	60	`pauvre stats` now also calculates some statistics for different size ranges.
	61	- 20170529 - added automatic scaling to the input fastq file. It
	62	scales to show the highest read quality and the top 99th percentile
	63	of reads by length.
	64
	65	# Requirements
	66
	67	- You must have the following installed on your system to install this software:
	68	- python 3.x
	69	- matplotlib
	70	- biopython
	71	- pandas
	72	- pillow
	73
	74	# Installation
	75
	76	- Instructions to install on your mac or linux system. Not sure on
	77	Windows! Make sure python 3 is the active environment before
	78	installing.
	79	- `git clone https://github.com/conchoecia/pauvre.git`
	80	- `cd ./pauvre`
	81	- `pip3 install .`
	82	- Or, install with pip
	83	- `pip3 install pauvre`
	84
	85	# Usage
	86
	87	## `stats`
	88	- generate basic statistics about the fastq file. For example, if I
	89	want to know the number of bases and reads with AT LEAST a PHRED
	90	score of 5 and AT LEAST a read length of 500, run the program as below
	91	and look at the cells highlighted with `<braces>`.
	92	- `pauvre stats --fastq miniDSMN15.fastq`
	93
	94
	95	```
	96	numReads: 1000
	97	numBasepairs: 1029114
	98	meanLen: 1029.114
	99	medianLen: 875.5
	100	minLen: 11
	101	maxLen: 5337
	102	N50: 1278
	103	L50: 296
	104
	105	Basepairs >= bin by mean PHRED and length
	106	minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30
	107	0 1029114 1010681 935366 429279 143948 25139 3668 2938 2000 0
	108	500 984212 <968653> 904787 421307 142003 24417 3668 2938 2000 0
	109	1000 659842 649319 616788 300948 103122 17251 2000 2000 2000 0
	110	et cetera...
	111	Number of reads >= bin by mean Phred+Len
	112	minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30
	113	0 1000 969 865 366 118 22 3 2 1 0
	114	500 873 <859> 789 347 113 20 3 2 1 0
	115	1000 424 418 396 187 62 11 1 1 1 0
	116	et cetera...
	117	```
	118
	119	## `marginplot`
	120
	121	### Basic usage
	122	- automatically calls `pauvre stats` for each fastq file
	123	- Make the default plot showing the 99th percentile of longest reads
	124	- `pauvre marginplot --fastq miniDSMN15.fastq`
	125	- ![default](files/default_miniDSMN15.png)
	126	- Make a marginal histogram for ONT 2D or 1D^2 cDNA data with a
	127	lower maxlen and higher maxqual.
	128	- `pauvre marginplot --maxlen 4000 --maxqual 25 --lengthbin 50 --fileform pdf png --qualbin 0.5 --fastq miniDSMN15.fastq`
	129	- ![example1](files/miniDSMN15.png)
	130
	131	### Filter reads and adjust viewing window
	132	- Filter out reads with a mean quality less than 5, and a length
	133	less than 800. Zoom in to plot only mean quality of at least 4 and
	134	read length at least 500bp.
	135	- `pauvre marginplot -f miniDSMN15.fastq --filt_minqual 5 --filt_minlen 800 -y --plot_minlen 500 --plot_minqual 4`
	136	- ![test4](files/test4.png)
	137
	138	### Specialized Options
	139
	140	- Plot ONT 1D data with a large tail
	141	- `pauvre marginplot --maxlen 100000 --maxqual 15 --lengthbin 500 <myfile>.fastq`
	142	- Get more resolution on lengths
	143	- `pauvre marginplot --maxlen 100000 --lengthbin 5 <myfile>.fastq`
	144
	145	### Transparency
	146
	147	- Turn off transparency if you just want a white background
	148	- `pauvre marginplot --transparent False <myfile>.fastq`
	149	- Note: transparency is the default behavior
	150	- ![transparency](files/transparency.001.jpeg)
	151
	152	# Contributors
	153
	154	@conchoecia (Darrin Schultz)
	155	@mebbert (Mark Ebbert)
	156	@wdecoster (Wouter De Coster)

-0

pauvre/__init__.py less more

from pauvre.version import __version__

+211

-0

pauvre/bamparse.py less more

	0	#!/usr/bin/env python
	1	# -- coding: utf-8 --
	2
	3	# pauvre - just a pore plotting package
	4	# Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved.
	5	# twitter @conchoecia
	6	#
	7	# This file is part of pauvre.
	8	#
	9	# pauvre is free software: you can redistribute it and/or modify
	10	# it under the terms of the GNU General Public License as published by
	11	# the Free Software Foundation, either version 3 of the License, or
	12	# (at your option) any later version.
	13	#
	14	# pauvre is distributed in the hope that it will be useful,
	15	# but WITHOUT ANY WARRANTY; without even the implied warranty of
	16	# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
	17	# GNU General Public License for more details.
	18	#
	19	# You should have received a copy of the GNU General Public License
	20	# along with pauvre. If not, see <http://www.gnu.org/licenses/>.
	21	import pysam
	22	import pandas as pd
	23	import os
	24
	25	class BAMParse():
	26	"""This class reads in a sam/bam file and constructs a pandas
	27	dataframe of all the relevant information for the reads to pass on
	28	and plot.
	29	"""
	30	def __init__(self, filename, chrid = None, start = None,
	31	stop = None, doubled = None):
	32	self.filename = filename
	33	self.doubled = doubled
	34	#determine if the file is bam or sam
	35	self.filetype = os.path.splitext(self.filename)[1]
	36	#throw an error if the file is not bam or sam
	37	if self.filetype not in ['.bam']:
	38	raise Exception("""You have provided a file with an extension other than
	39	'.bam', please check your command-line arguments""")
	40	#now make sure there is an index file for the bam file
	41	if not os.path.exists("{}.bai".format(self.filename)):
	42	raise Exception("""Your .bam file is there, but it isn't indexed and
	43	there isn't a .bai file to go with it. Use
	44	'samtools index <yourfile>.bam' to fix it.""")
	45	#now open the file and just call it a sambam file
	46	filetype_dict = {'.sam': '', '.bam': 'b'}
	47	self.sambam = pysam.AlignmentFile(self.filename, "r{}".format(filetype_dict[self.filetype]))
	48	if chrid == None:
	49	self.chrid = self.sambam.references[0]
	50	else:
	51	self.chrid = chrid
	52	self.refindex = self.sambam.references.index(self.chrid)
	53	self.seqlength = self.sambam.lengths[self.refindex]
	54	self.true_seqlength = self.seqlength if not self.doubled else int(self.seqlength/2)
	55	if start == None or stop == None:
	56	self.start = 1
	57	self.stop = self.true_seqlength
	58
	59	self.features = self.parse()
	60	self.features.sort_values(by=['POS','MAPLEN'], ascending=[True, False] ,inplace=True)
	61	self.features.reset_index(inplace=True)
	62	self.features.drop('index', 1, inplace=True)
	63
	64	self.raw_depthmap = self.get_depthmap()
	65	self.features_depthmap = self.get_features_depthmap()
	66
	67	def get_depthmap(self):
	68	depthmap = [0] * (self.stop - self.start + 1)
	69	for p in self.sambam.pileup(self.chrid, self.start, self.stop):
	70	index = p.reference_pos
	71	if index >= self.true_seqlength:
	72	index -= self.true_seqlength
	73	depthmap[index] += p.nsegments
	74	return depthmap
	75
	76	def get_features_depthmap(self):
	77	"""this method builds a more accurate pileup that is
	78	based on if there is actually a mapped base at any
	79	given position or not. better for long reads and RNA"""
	80	depthmap = [0] * (self.stop - self.start + 1)
	81	print("depthmap is: {} long".format(len(depthmap)))
	82	for index, row in self.features.iterrows():
	83	thisindex = row["POS"] - self.start
	84	for thistup in row["TUPS"]:
	85	b_type = thistup[1]
	86	b_len = thistup[0]
	87	if b_type == "M":
	88	for j in range(b_len):
	89	#this is necessary to reset the index if we wrap
	90	# around to the beginning
	91	if self.doubled and thisindex == len(depthmap):
	92	thisindex = 0
	93	depthmap[thisindex] += 1
	94	thisindex += 1
	95	elif b_type in ["S", "H", "I"]:
	96	pass
	97	elif b_type in ["D", "N"]:
	98	thisindex += b_len
	99	#this is necessary to reset the index if we wrap
	100	# around to the beginning
	101	if self.doubled and thisindex >= len(depthmap):
	102	thisindex = thisindex - len(depthmap)
	103
	104	return depthmap
	105
	106	def parse(self):
	107	data = {'POS': [], 'MAPQ': [], 'TUPS': [] }
	108	for read in self.sambam.fetch(self.chrid, self.start, self.stop):
	109	data['POS'].append(read.reference_start + 1)
	110	data['TUPS'].append(self.cigar_parse(read.cigartuples))
	111	data['MAPQ'].append(read.mapq)
	112	features = pd.DataFrame.from_dict(data, orient='columns')
	113	features['ALNLEN'] = features['TUPS'].apply(self.aln_len)
	114	features['TRULEN'] = features['TUPS'].apply(self.tru_len)
	115	features['MAPLEN'] = features['TUPS'].apply(self.map_len)
	116	features['POS'] = features['POS'].apply(self.fix_pos)
	117	return features
	118
	119	def cigar_parse(self, tuples):
	120	"""
	121	arguments:
	122	<tuples> a CIGAR string tuple list in pysam format
	123
	124	purpose:
	125	This function uses the pysam cigarstring tuples format and returns
	126	a list of tuples in the internal format, [(20, 'M'), (5, "I")], et
	127	cetera. The zeroth element of each tuple is the number of bases for the
	128	CIGAR string feature. The first element of each tuple is the CIGAR
	129	string feature type.
	130
	131	There are several feature types in SAM/BAM files. See below:
	132	'M' - match
	133	'I' - insertion relative to reference
	134	'D' - deletion relative to reference
	135	'N' - skipped region from the reference
	136	'S' - soft clip, not aligned but still in sam file
	137	'H' - hard clip, not aligned and not in sam file
	138	'P' - padding (silent deletion from padded reference)
	139	'=' - sequence match
	140	'X' - sequence mismatch
	141	'B' - BAM_CBACK (I don't actually know what this is)
	142
	143	"""
	144	# I used the map values from http://pysam.readthedocs.io/en/latest/api.html#pysam.AlignedSegment
	145	psam_to_char = {0: 'M', 1: 'I', 2: 'D', 3: 'N', 4: 'S',
	146	5: 'H', 6: 'P', 7: '=', 8: 'X', 9: 'B'}
	147	return [(value, psam_to_char[feature]) for feature, value in tuples]
	148
	149	def aln_len(self, TUPS):
	150	"""
	151	arguments:
	152	<TUPS> a list of tuples output from the cigar_parse() function.
	153
	154	purpose:
	155	This returns the alignment length of the read to the reference.
	156	Specifically, it sums the length of all of the matches and deletions.
	157	In effect, this number is length of the region of the reference sequence to
	158	which the read maps. This number is probably the most useful for selecting
	159	reads to visualize in the mapped read plot.
	160	"""
	161	return sum([pair[0] for pair in TUPS if pair[1] not in ['S', 'H', 'I']])
	162
	163	def map_len(self, TUPS):
	164	"""
	165	arguments:
	166	<TUPS> a list of tuples output from the cigar_parse() function.
	167
	168	purpose:
	169	This function returns the map length (all matches and deletions relative to
	170	the reference), plus the unmapped 5' and 3' hard/soft clipped sequences.
	171	This number is useful if you want to visualize how much 5' and 3' sequence
	172	of a read did not map to the reference. For example, poor quality 5' and 3'
	173	tails are common in Nanopore reads.
	174	"""
	175	return sum([pair[0] for pair in TUPS if pair[1] not in ['I']])
	176
	177	def tru_len(self, TUPS):
	178	"""
	179	arguments:
	180	<TUPS> a list of tuples output from the cigar_parse() function.
	181
	182	purpose:
	183	This function returns the total length of the read, including insertions,
	184	deletions, matches, soft clips, and hard clips. This is useful for
	185	comparing to the map length or alignment length to see what percentage of
	186	the read aligned to the reference.
	187	"""
	188	return sum([pair[0] for pair in TUPS])
	189
	190	def fix_pos(self, start_index):
	191	"""
	192	arguments:
	193	an int
	194
	195	purpose:
	196	When using a doubled SAMfile, any reads that start after the first copy
	197	of the reference risk running over the plotting window, causing the program
	198	to crash. This function corrects for this issue by changing the start site
	199	of the read.
	200
	201	Note: this will probably break the program if not using a double alignment
	202	since no reads would map past half the length of the single reference
	203	"""
	204	if self.doubled:
	205	if start_index > int(self.seqlength/2):
	206	return start_index - int(self.seqlength/2) - 1
	207	else:
	208	return start_index
	209	else:
	210	return start_index

+426

-0

pauvre/browser.py less more

	0	#!/usr/bin/env python
	1	# -- coding: utf-8 --
	2
	3	# pauvre - a pore plotting package
	4	# Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved.
	5	#
	6	# This file is part of pauvre.
	7	#
	8	# pauvre is free software: you can redistribute it and/or modify
	9	# it under the terms of the GNU General Public License as published by
	10	# the Free Software Foundation, either version 3 of the License, or
	11	# (at your option) any later version.
	12	#
	13	# pauvre is distributed in the hope that it will be useful,
	14	# but WITHOUT ANY WARRANTY; without even the implied warranty of
	15	# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
	16	# GNU General Public License for more details.
	17	#
	18	# You should have received a copy of the GNU General Public License
	19	# along with pauvre. If not, see <http://www.gnu.org/licenses/>.
	20
	21	# following this tutorial to install helvetica
	22	# https://github.com/olgabot/sciencemeetproductivity.tumblr.com/blob/master/posts/2012/11/how-to-set-helvetica-as-the-default-sans-serif-font-in.md
	23	global hfont
	24	hfont = {'fontname':'Helvetica'}
	25
	26	import matplotlib
	27	matplotlib.use('Agg')
	28	import matplotlib.pyplot as plt
	29	from matplotlib.colors import LinearSegmentedColormap, Normalize
	30	import matplotlib.patches as patches
	31
	32
	33	import gffutils
	34	import pandas as pd
	35	pd.set_option('display.max_columns', 500)
	36	pd.set_option('display.width', 1000)
	37	import numpy as np
	38	import os
	39	import pauvre.rcparams as rc
	40	from pauvre.functions import GFFParse, print_images, timestamp
	41	from pauvre import gfftools
	42	from pauvre.lsi.lsi import intersection
	43	from pauvre.bamparse import BAMParse
	44	import progressbar
	45	import platform
	46	import sys
	47	import time
	48
	49	# Biopython stuff
	50	from Bio import SeqIO
	51	import Bio.SubsMat.MatrixInfo as MI
	52
	53
	54	class PlotCommand:
	55	def __init__(self, plotcmd, REF):
	56	self.ref = REF
	57	self.style_choices = []
	58	self.cmdtype = ""
	59	self.path = ""
	60	self.style = ""
	61	self.options = ""
	62	self._parse_cmd(plotcmd)
	63
	64	def _parse_cmd(self, plotcmd):
	65	chunks = plotcmd.split(":")
	66	if chunks[0] == "ref":
	67	self.cmdtype = "ref"
	68	if len(chunks) < 2:
	69	self._len_error()
	70	self.path = self.ref
	71	self.style = chunks[1]
	72	self.style_choices = ["normal", "colorful"]
	73	self._check_style_choices()
	74	if len(chunks) > 2:
	75	self.options = chunks[2].split(",")
	76	elif chunks[0] in ["bam", "peptides"]:
	77	if len(chunks) < 3:
	78	self._len_error()
	79	self.cmdtype = chunks[0]
	80	self.path = os.path.abspath(os.path.expanduser(chunks[1]))
	81	self.style = chunks[2]
	82	if self.cmdtype == "bam":
	83	self.style_choices = ["depth", "reads"]
	84	else:
	85	self.style_choices = ["depth"]
	86	self._check_style_choices()
	87	if len(chunks) > 3:
	88	self.options = chunks[3].split(",")
	89	elif chunks[0] in ["gff3"]:
	90	if len(chunks) < 2:
	91	self._len_error()
	92	self.cmdtype = chunks[0]
	93	self.path = os.path.abspath(os.path.expanduser(chunks[1]))
	94	if len(chunks) > 2:
	95	self.options = chunks[2].split(",")
	96
	97
	98	def _len_error(self):
	99	raise IOError("""You selected {} to plot,
	100	but need to specify the style at least.""".format(self.cmdtype))
	101	def _check_style_choices(self):
	102	if self.style not in self.style_choices:
	103	raise IOError("""You selected {} style for
	104	ref. You must select from {}. """.format(
	105	self.style, self.style_choices))
	106
	107	global dna_color
	108	dna_color = {"A": (81/255, 87/255, 251/255, 1),
	109	"T": (230/255, 228/255, 49/255, 1),
	110	"G": (28/255, 190/255, 32/255, 1),
	111	"C": (220/255, 10/255, 23/255, 1)}
	112
	113	#these are the line width for the different cigar string flags.
	114	# usually, only M, I, D, S, and H appear in bwa mem output
	115	global widthDict
	116	widthDict = {'M':0.45, # match
	117	'I':0.9, # insertion relative to reference
	118	'D':0.05, # deletion relative to reference
	119	'N':0.1, # skipped region from the reference
	120	'S':0.1, # soft clip, not aligned but still in sam file
	121	'H':0.1, # hard clip, not aligned and not in sam file
	122	'P':0.1, # padding (silent deletion from padded reference)
	123	'=':0.1, # sequence match
	124	'X':0.1} # sequence mismatch
	125
	126
	127	global richgrey
	128	richgrey = (60/255, 54/255, 69/255, 1)
	129
	130	def plot_ref(panel, chrid, start, stop, thiscmd):
	131	panel.set_xlim([start, stop])
	132	panel.set_ylim([-2.5, 2.5])
	133	panel.set_xticks([int(val) for val in np.linspace(start, stop, 6)])
	134	if thiscmd.style == "colorful":
	135	thisseq = ""
	136	for record in SeqIO.parse(thiscmd.ref, "fasta"):
	137	if record.id == chrid:
	138	thisseq = record.seq[start-1: stop]
	139	for i in range(len(thisseq)):
	140	left = start + i
	141	bottom = -0.5
	142	width = 1
	143	height = 1
	144	rect = patches.Rectangle((left, bottom),
	145	width, height,
	146	linewidth = 0,
	147	facecolor = dna_color[thisseq[i]] )
	148	panel.add_patch(rect)
	149	return panel
	150
	151	def safe_log10(value):
	152	try:
	153	logval = np.log10(value)
	154	except:
	155	logval = 0
	156	return logval
	157
	158	def plot_bam(panel, chrid, start, stop, thiscmd):
	159	bam = BAMParse(thiscmd.path)
	160	panel.set_xlim([start, stop])
	161	if thiscmd.style == "depth":
	162	maxdepth = max(bam.features_depthmap)
	163	maxdepthlog = safe_log10(maxdepth)
	164	if "log" in thiscmd.options:
	165	panel.set_ylim([-maxdepthlog, maxdepthlog])
	166	panel.set_yticks([int(val) for val in np.linspace(0, maxdepthlog, 2)])
	167
	168	else:
	169	panel.set_yticks([int(val) for val in np.linspace(0, maxdepth, 2)])
	170	if "c" in thiscmd.options:
	171	panel.set_ylim([-maxdepth, maxdepth])
	172	else:
	173	panel.set_ylim([0, maxdepth])
	174
	175
	176	for i in range(len(bam.features_depthmap)):
	177	left = start + i
	178	width = 1
	179	if "c" in thiscmd.options and "log" in thiscmd.options:
	180	bottom = -1 * safe_log10(bam.features_depthmap[i])
	181	height = safe_log10(bam.features_depthmap[i]) * 2
	182	elif "c" in thiscmd.options and "log" not in thiscmd.options:
	183	bottom = -bam.features_depthmap[i]
	184	height = bam.features_depthmap[i] * 2
	185	else:
	186	bottom = 0
	187	height = bam.features_depthmap[i]
	188	if height > 0:
	189	rect = patches.Rectangle((left, bottom),
	190	width, height,
	191	linewidth = 0,
	192	facecolor = richgrey )
	193	panel.add_patch(rect)
	194
	195	if thiscmd.style == "reads":
	196	#If we're plotting reads, we don't need y-axis
	197	panel.tick_params(bottom="off", labelbottom="off",
	198	left ="off", labelleft = "off")
	199	reads = bam.features.copy()
	200	panel.set_xlim([start, stop])
	201	direction = "for"
	202	if direction == 'for':
	203	bav = {"by":['POS','MAPLEN'], "asc": [True, False]}
	204	direction= 'rev'
	205	elif direction == 'rev':
	206	bav = {"by":['POS','MAPLEN'], "asc": [True, False]}
	207	direction = 'for'
	208	reads.sort_values(by=bav["by"], ascending=bav['asc'],inplace=True)
	209	reads.reset_index(drop=True, inplace=True)
	210
	211	depth_count = -1
	212	plotind = start
	213	while len(reads) > 0:
	214	#depth_count -= 1
	215	#print("len of reads is {}".format(len(reads)))
	216	potential = reads.query("POS >= {}".format(plotind))
	217	if len(potential) == 0:
	218	readsindex = 0
	219	#print("resetting plot ind from {} to {}".format(
	220	# plotind, reads.loc[readsindex, "POS"]))
	221	depth_count -= 1
	222
	223	else:
	224	readsindex = int(potential.index.values[0])
	225	#print("pos of potential is {}".format(reads.loc[readsindex, "POS"]))
	226	plotind = reads.loc[readsindex, "POS"]
	227
	228	for TUP in reads.loc[readsindex, "TUPS"]:
	229	b_type = TUP[1]
	230	b_len = TUP[0]
	231	#plotting params
	232	# left same for all.
	233	left = plotind
	234	bottom = depth_count
	235	height = widthDict[b_type]
	236	width = b_len
	237	plot = True
	238	color = richgrey
	239	if b_type in ["H", "S"]:
	240	"""We don't plot hard or sort clips - like IGV"""
	241	plot = False
	242	pass
	243	elif b_type == "M":
	244	"""just plot matches normally"""
	245	plotind += b_len
	246	elif b_type in ["D", "P", "=", "X"]:
	247	"""deletions get an especially thin line"""
	248	plotind += b_len
	249	elif b_type == "I":
	250	"""insertions get a special purple bar"""
	251	left = plotind - (b_len/2)
	252	color = (200/255, 41/255, 226/255, 0.5)
	253	elif b_type == "N":
	254	"""skips for splice junctions, line in middle"""
	255	bottom += (widthDict["M"]/2) - (widthDict["N"]/2)
	256	plotind += b_len
	257	if plot:
	258	rect = patches.Rectangle((left, bottom),
	259	width, height,
	260	linewidth = 0,
	261	facecolor = color )
	262	panel.add_patch(rect)
	263	reads.drop([readsindex], inplace=True)
	264	reads.reset_index(drop = True, inplace=True)
	265	panel.set_ylim([depth_count, 0])
	266
	267	return panel
	268
	269	def plot_gff3(panel, chrid, start, stop, thiscmd):
	270
	271	db = gffutils.create_db(thiscmd.path, ":memory:")
	272	bottom = 0
	273	genes_to_plot = [thing.id
	274	for thing in db.region(
	275	region=(chrid, start, stop),
	276	completely_within=False)
	277	if thing.featuretype == "gene" ]
	278	#print("genes to plot are: " genes_to_plot)
	279	panel.set_xlim([start, stop])
	280	# we don't need labels on one of the axes
	281	#panel.tick_params(bottom="off", labelbottom="off",
	282	# left ="off", labelleft = "off")
	283
	284
	285	ticklabels = []
	286	for geneid in genes_to_plot:
	287	plotnow = False
	288	if "id" in thiscmd.options and geneid in thiscmd.options:
	289	plotnow = True
	290	elif "id" not in thiscmd.options:
	291	plotnow = True
	292	if plotnow:
	293	ticklabels.append(geneid)
	294	if db[geneid].strand == "+":
	295	panel = gfftools._plot_left_to_right_introns_top(panel, geneid, db,
	296	bottom, text = None)
	297	bottom += 1
	298	else:
	299	raise IOError("""Plotting things on the reverse strand is
	300	not yet implemented""")
	301	#print("tick labels are", ticklabels)
	302	panel.set_ylim([0, len(ticklabels)])
	303	yticks_vals = [val for val in np.linspace(0.5, len(ticklabels) - 0.5, len(ticklabels))]
	304	panel.set_yticks(yticks_vals)
	305	print("bottom is: ", bottom)
	306	print("len tick labels is: ", len(ticklabels))
	307	print("intervals are: ", yticks_vals)
	308	panel.set_yticklabels(ticklabels)
	309
	310	return panel
	311
	312	def browser(args):
	313	rc.update_rcParams()
	314	print(args)
	315
	316	# if the user forgot to add a reference, they must add one
	317	if args.REF is None:
	318	raise IOError("You must specify the reference fasta file")
	319
	320	# if the user forgot to add the start and stop,
	321	# Print the id and the start/stop
	322	if args.CHR is None or args.START is None or args.STOP is None:
	323	print("""\n You have forgotten to specify the chromosome,
	324	the start coordinate, or the stop coordinate to plot.
	325	Try something like '-c chr1 --start 20 --stop 2000'.
	326	Here is a list of chromosome ids and their lengths
	327	from the provided reference. The minimum start coordinate
	328	is one and the maximum stop coordinate is the length of
	329	the chromosome.\n\nID\tLength""")
	330	for record in SeqIO.parse(args.REF, "fasta"):
	331	print("{}\t{}".format(record.id, len(record.seq)))
	332	sys.exit(0)
	333
	334	if args.CMD is None:
	335	raise IOError("You must specify a plotting command.")
	336
	337	# now we parse each set of commands
	338	commands = [PlotCommand(thiscmd, args.REF)
	339	for thiscmd in reversed(args.CMD)]
	340
	341	# set the figure dimensions
	342	if args.ratio:
	343	figWidth = args.ratio[0] + 1
	344	figHeight = args.ratio[1] + 1
	345	#set the panel dimensions
	346	panelWidth = args.ratio[0]
	347	panelHeight = args.ratio[1]
	348
	349	else:
	350	figWidth = 7
	351	figHeight = len(commands) + 2
	352	#set the panel dimensions
	353	panelWidth = 5
	354	# panel margin x 2 + panel height = total vertical height
	355	panelHeight = 0.8
	356	panelMargin = 0.1
	357
	358	figure = plt.figure(figsize=(figWidth,figHeight))
	359
	360	#find the margins to center the panel in figure
	361	leftMargin = (figWidth - panelWidth)/2
	362	bottomMargin = ((figHeight - panelHeight)/2) + panelMargin
	363
	364	plot_dict = {"ref": plot_ref,
	365	"bam": plot_bam,
	366	"gff3": plot_gff3
	367	#"peptides": plot_peptides
	368	}
	369
	370	panels = []
	371	for i in range(len(commands)):
	372	thiscmd = commands[i]
	373	if thiscmd.cmdtype in ["gff3", "ref", "peptides"] \
	374	or thiscmd.style == "depth" \
	375	or "narrow" in thiscmd.options:
	376	temp_panelHeight = 0.5
	377	else:
	378	temp_panelHeight = panelHeight
	379	panels.append( plt.axes([leftMargin/figWidth, #left
	380	bottomMargin/figHeight, #bottom
	381	panelWidth/figWidth, #width
	382	temp_panelHeight/figHeight]) #height
	383	)
	384	panels[i].tick_params(axis='both',which='both',\
	385	bottom='off', labelbottom='off',\
	386	left='on', labelleft='on', \
	387	right='off', labelright='off',\
	388	top='off', labeltop='off')
	389	if thiscmd.cmdtype == "ref":
	390	panels[i].tick_params(bottom='on', labelbottom='on')
	391
	392
	393
	394	#turn off some of the axes
	395	panels[i].spines["top"].set_visible(False)
	396	panels[i].spines["bottom"].set_visible(False)
	397	panels[i].spines["right"].set_visible(False)
	398	panels[i].spines["left"].set_visible(False)
	399
	400	panels[i] = plot_dict[thiscmd.cmdtype](panels[i], args.CHR,
	401	args.START, args.STOP,
	402	thiscmd)
	403
	404	bottomMargin = bottomMargin + temp_panelHeight + (2 * panelMargin)
	405
	406	# Print image(s)
	407	if args.BASENAME is None:
	408	file_base = 'browser_{}.png'.format(timestamp())
	409	else:
	410	file_base = args.BASENAME
	411	path = None
	412	if args.path:
	413	path = args.path
	414	transparent = args.transparent
	415	print_images(
	416	base_output_name=file_base,
	417	image_formats=args.fileform,
	418	dpi=args.dpi,
	419	no_timestamp = kwargs["no_timestamp"],
	420	path = path,
	421	transparent=transparent)
	422
	423
	424	def run(args):
	425	browser(args)

+478

-0

pauvre/custommargin.py less more

	0	#!/usr/bin/env python
	1	# -- coding: utf-8 --
	2
	3	# pauvre - just a pore PhD student's plotting package
	4	# Copyright (c) 2016-2017 Darrin T. Schultz. All rights reserved.
	5	#
	6	# This file is part of pauvre.
	7	#
	8	# pauvre is free software: you can redistribute it and/or modify
	9	# it under the terms of the GNU General Public License as published by
	10	# the Free Software Foundation, either version 3 of the License, or
	11	# (at your option) any later version.
	12	#
	13	# pauvre is distributed in the hope that it will be useful,
	14	# but WITHOUT ANY WARRANTY; without even the implied warranty of
	15	# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
	16	# GNU General Public License for more details.
	17	#
	18	# You should have received a copy of the GNU General Public License
	19	# along with pauvre. If not, see <http://www.gnu.org/licenses/>.
	20
	21	import ast
	22	import matplotlib
	23	matplotlib.use('Agg')
	24	import matplotlib.pyplot as plt
	25	import matplotlib.patches as mplpatches
	26	from matplotlib.colors import LinearSegmentedColormap
	27	import numpy as np
	28	import pandas as pd
	29	import os.path as opath
	30	from sys import stderr
	31	from pauvre.functions import print_images
	32	from pauvre.stats import stats
	33	import pauvre.rcparams as rc
	34	import sys
	35	import logging
	36
	37	# logging
	38	logger = logging.getLogger('pauvre')
	39
	40
	41	def generate_panel(panel_left, panel_bottom, panel_width, panel_height,
	42	axis_tick_param='both', which_tick_param='both',
	43	bottom_tick_param='on', label_bottom_tick_param='on',
	44	left_tick_param='on', label_left_tick_param='on',
	45	right_tick_param='off', label_right_tick_param='off',
	46	top_tick_param='off', label_top_tick_param='off'):
	47	"""
	48	Setting default panel tick parameters. Some of these are the defaults
	49	for matplotlib anyway, but specifying them for readability. Here are
	50	options and defaults for the parameters used below:
	51
	52	axis : {'x', 'y', 'both'}; which axis to modify; default = 'both'
	53	which : {'major', 'minor', 'both'}; which ticks to modify;
	54	default = 'major'
	55	bottom, top, left, right : bool or {'on', 'off'}; ticks on or off;
	56	labelbottom, labeltop, labelleft, labelright : bool or {'on', 'off'}
	57	"""
	58
	59	# create the panel
	60	panel_rectangle = [panel_left, panel_bottom, panel_width, panel_height]
	61	panel = plt.axes(panel_rectangle)
	62
	63	# Set tick parameters
	64	panel.tick_params(axis=axis_tick_param,
	65	which=which_tick_param,
	66	bottom=bottom_tick_param,
	67	labelbottom=label_bottom_tick_param,
	68	left=left_tick_param,
	69	labelleft=label_left_tick_param,
	70	right=right_tick_param,
	71	labelright=label_right_tick_param,
	72	top=top_tick_param,
	73	labeltop=label_top_tick_param)
	74
	75	return panel
	76
	77
	78	def _generate_histogram_bin_patches(panel, bins, bin_values, horizontal=True):
	79	"""This helper method generates the histogram that is added to the panel.
	80
	81	In this case, horizontal = True applies to the mean quality histogram.
	82	So, horizontal = False only applies to the length histogram.
	83	"""
	84	l_width = 0.0
	85	f_color = (0.5, 0.5, 0.5)
	86	e_color = (0, 0, 0)
	87	if horizontal:
	88	for step in np.arange(0, len(bin_values), 1):
	89	left = bins[step]
	90	bottom = 0
	91	width = bins[step + 1] - bins[step]
	92	height = bin_values[step]
	93	hist_rectangle = mplpatches.Rectangle((left, bottom), width, height,
	94	linewidth=l_width,
	95	facecolor=f_color,
	96	edgecolor=e_color)
	97	panel.add_patch(hist_rectangle)
	98	else:
	99	for step in np.arange(0, len(bin_values), 1):
	100	left = 0
	101	bottom = bins[step]
	102	width = bin_values[step]
	103	height = bins[step + 1] - bins[step]
	104
	105	hist_rectangle = mplpatches.Rectangle((left, bottom), width, height,
	106	linewidth=l_width,
	107	facecolor=f_color,
	108	edgecolor=e_color)
	109	panel.add_patch(hist_rectangle)
	110
	111
	112	def generate_histogram(panel, data_list, min_plot_val, max_plot_val,
	113	bin_interval, hist_horizontal=True,
	114	left_spine=True, bottom_spine=True,
	115	top_spine=False, right_spine=False, x_label=None,
	116	y_label=None):
	117
	118	bins = np.arange(0, max_plot_val, bin_interval)
	119
	120	bin_values, bins2 = np.histogram(data_list, bins)
	121
	122	# hist_horizontal is used for quality
	123	if hist_horizontal:
	124	panel.set_xlim([min_plot_val, max_plot_val])
	125	panel.set_ylim([0, max(bin_values * 1.1)])
	126	# and hist_horizontal == Fale is for read length
	127	else:
	128	panel.set_xlim([0, max(bin_values * 1.1)])
	129	panel.set_ylim([min_plot_val, max_plot_val])
	130
	131	# Generate histogram bin patches, depending on whether we're plotting
	132	# vertically or horizontally
	133	_generate_histogram_bin_patches(panel, bins, bin_values, hist_horizontal)
	134
	135	panel.spines['left'].set_visible(left_spine)
	136	panel.spines['bottom'].set_visible(bottom_spine)
	137	panel.spines['top'].set_visible(top_spine)
	138	panel.spines['right'].set_visible(right_spine)
	139
	140	if y_label is not None:
	141	panel.set_ylabel(y_label)
	142	if x_label is not None:
	143	panel.set_xlabel(x_label)
	144
	145	def generate_square_map(panel, data_frame, plot_min_y, plot_min_x,
	146	plot_max_y, plot_max_x, color,
	147	xcol, ycol, **kwargs):
	148	"""This generates the heatmap panels using squares. Everything is
	149	quantized by ints.
	150	"""
	151	panel.set_xlim([plot_min_x, plot_max_x])
	152	panel.set_ylim([plot_min_y, plot_max_y])
	153	tempdf = data_frame[[xcol, ycol]]
	154	data_frame = tempdf.astype(int)
	155
	156	querystring = "{}<={} and {}<={}".format(plot_min_y, ycol, plot_min_x, xcol)
	157	print(" - Filtering squares with {}".format(querystring))
	158	square_this = data_frame.query(querystring)
	159
	160	querystring = "{}<{} and {}<{}".format(ycol, plot_max_y, xcol, plot_max_x)
	161	print(" - Filtering squares with {}".format(querystring))
	162	square_this = square_this.query(querystring)
	163
	164	counts = square_this.groupby([xcol, ycol]).size().reset_index(name='counts')
	165	for index, row in counts.iterrows():
	166	x_pos = row[xcol]
	167	y_pos = row[ycol]
	168	thiscolor = color(row["counts"]/(counts["counts"].max()))
	169	rectangle1=mplpatches.Rectangle((x_pos,y_pos),1,1,
	170	linewidth=0,\
	171	facecolor=thiscolor)
	172	panel.add_patch(rectangle1)
	173
	174	all_counts = counts["counts"]
	175	return all_counts
	176
	177	def generate_heat_map(panel, data_frame, plot_min_y, plot_min_x,
	178	plot_max_y, plot_max_x, color,
	179	xcol, ycol, **kwargs):
	180	panel.set_xlim([plot_min_x, plot_max_x])
	181	panel.set_ylim([plot_min_y, plot_max_y])
	182
	183	querystring = "{}<={} and {}<={}".format(plot_min_y, ycol, plot_min_x, xcol)
	184	print(" - Filtering hexmap with {}".format(querystring))
	185	hex_this = data_frame.query(querystring)
	186
	187	querystring = "{}<{} and {}<{}".format(ycol, plot_max_y, xcol, plot_max_x)
	188	print(" - Filtering hexmap with {}".format(querystring))
	189	hex_this = hex_this.query(querystring)
	190
	191	# This single line controls plotting the hex bins in the panel
	192	hex_vals = panel.hexbin(hex_this[xcol], hex_this[ycol], gridsize=49,
	193	linewidths=0.0, cmap=color)
	194	for each in panel.spines:
	195	panel.spines[each].set_visible(False)
	196
	197	counts = hex_vals.get_array()
	198	return counts
	199
	200	def generate_legend(panel, counts, color):
	201	# completely custom for more control
	202	panel.set_xlim([0, 1])
	203	panel.set_ylim([0, 1000])
	204	panel.set_yticks([int(x) for x in np.linspace(0, 1000, 6)])
	205	panel.set_yticklabels([int(x) for x in np.linspace(0, max(counts), 6)])
	206	for i in np.arange(0, 1001, 1):
	207	rgba = color(i / 1001)
	208	alpha = rgba[-1]
	209	facec = rgba[0:3]
	210	hist_rectangle = mplpatches.Rectangle((0, i), 1, 1,
	211	linewidth=0.0,
	212	facecolor=facec,
	213	edgecolor=(0, 0, 0),
	214	alpha=alpha)
	215	panel.add_patch(hist_rectangle)
	216	panel.spines['top'].set_visible(False)
	217	panel.spines['left'].set_visible(False)
	218	panel.spines['bottom'].set_visible(False)
	219	panel.yaxis.set_label_position("right")
	220	panel.set_ylabel('count')
	221
	222	def custommargin(df, **kwargs):
	223	rc.update_rcParams()
	224
	225	# 250, 231, 34 light yellow
	226	# 67, 1, 85
	227	# R=np.linspace(65/255,1,101)
	228	# G=np.linspace(0/255, 231/255, 101)
	229	# B=np.linspace(85/255, 34/255, 101)
	230	# R=65/255, G=0/255, B=85/255
	231	Rf = 65 / 255
	232	Bf = 85 / 255
	233	pdict = {'red': ((0.0, Rf, Rf),
	234	(1.0, Rf, Rf)),
	235	'green': ((0.0, 0.0, 0.0),
	236	(1.0, 0.0, 0.0)),
	237	'blue': ((0.0, Bf, Bf),
	238	(1.0, Bf, Bf)),
	239	'alpha': ((0.0, 0.0, 0.0),
	240	(1.0, 1.0, 1.0))
	241	}
	242	# Now we will use this example to illustrate 3 ways of
	243	# handling custom colormaps.
	244	# First, the most direct and explicit:
	245	purple1 = LinearSegmentedColormap('Purple1', pdict)
	246
	247	# set the figure dimensions
	248	fig_width = 1.61 * 3
	249	fig_height = 1 * 3
	250	fig = plt.figure(figsize=(fig_width, fig_height))
	251
	252	# set the panel dimensions
	253	heat_map_panel_width = fig_width * 0.5
	254	heat_map_panel_height = heat_map_panel_width * 0.62
	255
	256	# find the margins to center the panel in figure
	257	fig_left_margin = fig_bottom_margin = (1 / 6)
	258
	259	# lengthPanel
	260	y_panel_width = (1 / 8)
	261
	262	# the color Bar parameters
	263	legend_panel_width = (1 / 24)
	264
	265	# define padding
	266	h_padding = 0.02
	267	v_padding = 0.05
	268
	269	# Set whether to include y-axes in histograms
	270	print(" - Setting panel options.", file = sys.stderr)
	271	if kwargs["Y_AXES"]:
	272	y_bottom_spine = True
	273	y_bottom_tick = 'on'
	274	y_bottom_label = 'on'
	275	x_left_spine = True
	276	x_left_tick = 'on'
	277	x_left_label = 'on'
	278	x_y_label = 'Count'
	279	else:
	280	y_bottom_spine = False
	281	y_bottom_tick = 'off'
	282	y_bottom_label = 'off'
	283	x_left_spine = False
	284	x_left_tick = 'off'
	285	x_left_label = 'off'
	286	x_y_label = None
	287
	288	panels = []
	289
	290	# Quality histogram panel
	291	print(" - Generating the x-axis panel.", file = sys.stderr)
	292	x_panel_left = fig_left_margin + y_panel_width + h_padding
	293	x_panel_width = heat_map_panel_width / fig_width
	294	x_panel_height = y_panel_width * fig_width / fig_height
	295	x_panel = generate_panel(x_panel_left,
	296	fig_bottom_margin,
	297	x_panel_width,
	298	x_panel_height,
	299	left_tick_param=x_left_tick,
	300	label_left_tick_param=x_left_label)
	301	panels.append(x_panel)
	302
	303	# y histogram panel
	304	print(" - Generating the y-axis panel.", file = sys.stderr)
	305	y_panel_bottom = fig_bottom_margin + x_panel_height + v_padding
	306	y_panel_height = heat_map_panel_height / fig_height
	307	y_panel = generate_panel(fig_left_margin,
	308	y_panel_bottom,
	309	y_panel_width,
	310	y_panel_height,
	311	bottom_tick_param=y_bottom_tick,
	312	label_bottom_tick_param=y_bottom_label)
	313	panels.append(y_panel)
	314
	315	# Heat map panel
	316	heat_map_panel_left = fig_left_margin + y_panel_width + h_padding
	317	heat_map_panel_bottom = fig_bottom_margin + x_panel_height + v_padding
	318	print(" - Generating the heat map panel.", file = sys.stderr)
	319	heat_map_panel = generate_panel(heat_map_panel_left,
	320	heat_map_panel_bottom,
	321	heat_map_panel_width / fig_width,
	322	heat_map_panel_height / fig_height,
	323	bottom_tick_param='off',
	324	label_bottom_tick_param='off',
	325	left_tick_param='off',
	326	label_left_tick_param='off')
	327	panels.append(heat_map_panel)
	328	heat_map_panel.set_title(kwargs["title"])
	329
	330	# Legend panel
	331	print(" - Generating the legend panel.", file = sys.stderr)
	332	legend_panel_left = fig_left_margin + y_panel_width + \
	333	heat_map_panel_width / fig_width + h_padding
	334	legend_panel_bottom = fig_bottom_margin + x_panel_height + v_padding
	335	legend_panel_height = heat_map_panel_height / fig_height
	336	legend_panel = generate_panel(legend_panel_left, legend_panel_bottom,
	337	legend_panel_width, legend_panel_height,
	338	bottom_tick_param='off',
	339	label_bottom_tick_param='off',
	340	left_tick_param='off',
	341	label_left_tick_param='off',
	342	right_tick_param='on',
	343	label_right_tick_param='on')
	344	panels.append(legend_panel)
	345
	346	#
	347	# Everything above this is just to set up the panels
	348	#
	349	##################################################################
	350
	351	# Set max and min viewing window for the xaxis
	352	if kwargs["plot_max_x"]:
	353	plot_max_x = kwargs["plot_max_x"]
	354	else:
	355	if kwargs["square"]:
	356	plot_max_x = df[kwargs["xcol"]].max()
	357	plot_max_x = max(np.ceil(df[kwargs["xcol"]]))
	358	plot_min_x = kwargs["plot_min_x"]
	359
	360	# Set x bin sizes
	361	if kwargs["xbin"]:
	362	x_bin_interval = kwargs["xbin"]
	363	else:
	364	# again, this is just based on what looks good from experience
	365	x_bin_interval = 1
	366
	367	# Generate x histogram
	368	print(" - Generating the x-axis histogram.", file = sys.stderr)
	369	generate_histogram(panel = x_panel,
	370	data_list = df[kwargs['xcol']],
	371	min_plot_val = plot_min_x,
	372	max_plot_val = plot_max_x,
	373	bin_interval = x_bin_interval,
	374	hist_horizontal = True,
	375	x_label=kwargs['xcol'],
	376	y_label=x_y_label,
	377	left_spine=x_left_spine)
	378
	379	# Set max and min viewing window for the y axis
	380	if kwargs["plot_max_y"]:
	381	plot_max_y = kwargs["plot_max_y"]
	382	else:
	383	if kwargs["square"]:
	384	plot_max_y = df[kwargs["ycol"]].max()
	385	else:
	386	plot_max_y = max(np.ceil(df[kwargs["ycol"]]))
	387
	388	plot_min_y = kwargs["plot_min_y"]
	389	# Set y bin sizes
	390	if kwargs["ybin"]:
	391	y_bin_interval = kwargs["ybin"]
	392	else:
	393	y_bin_interval = 1
	394
	395	# Generate y histogram
	396	print(" - Generating the y-axis histogram.", file = sys.stderr)
	397	generate_histogram(panel = y_panel,
	398	data_list = df[kwargs['ycol']],
	399	min_plot_val = plot_min_y,
	400	max_plot_val = plot_max_y,
	401	bin_interval = y_bin_interval,
	402	hist_horizontal = False,
	403	y_label = kwargs['ycol'],
	404	bottom_spine = y_bottom_spine)
	405
	406	# Generate heat map
	407	if kwargs["square"]:
	408	print(" - Generating the square heatmap.", file = sys.stderr)
	409	counts = generate_square_map(panel = heat_map_panel,
	410	data_frame = df,
	411	plot_min_y = plot_min_y,
	412	plot_min_x = plot_min_x,
	413	plot_max_y = plot_max_y,
	414	plot_max_x = plot_max_x,
	415	color = purple1,
	416	xcol = kwargs["xcol"],
	417	ycol = kwargs["ycol"])
	418	else:
	419	print(" - Generating the heatmap.", file = sys.stderr)
	420	counts = generate_heat_map(panel = heat_map_panel,
	421	data_frame = df,
	422	plot_min_y = plot_min_y,
	423	plot_min_x = plot_min_x,
	424	plot_max_y = plot_max_y,
	425	plot_max_x = plot_max_x,
	426	color = purple1,
	427	xcol = kwargs["xcol"],
	428	ycol = kwargs["ycol"])
	429
	430	# Generate legend
	431	print(" - Generating the legend.", file = sys.stderr)
	432	generate_legend(legend_panel, counts, purple1)
	433
	434	# inform the user of the plotting window if not quiet mode
	435	#if not kwargs["QUIET"]:
	436	# print("""plotting in the following window:
	437	# {0} <= Q-score (x-axis) <= {1}
	438	# {2} <= length (y-axis) <= {3}""".format(
	439	# plot_min_x, plot_max_x, min_plot_val, max_plot_val),
	440	# file=stderr)
	441
	442	# Print image(s)
	443	if kwargs["output_base_name"] is None:
	444	file_base = "custommargin"
	445	else:
	446	file_base = kwargs["output_base_name"]
	447
	448	print(" - Saving your images", file = sys.stderr)
	449	print_images(
	450	base =file_base,
	451	image_formats=kwargs["fileform"],
	452	dpi=kwargs["dpi"],
	453	no_timestamp = kwargs["no_timestamp"],
	454	transparent= kwargs["no_transparent"])
	455
	456	def run(args):
	457	print(args)
	458	if not opath.exists(args.input_file):
	459	raise IOError("The input file does not exist: {}".format(
	460	args.input_file))
	461	df = pd.read_csv(args.input_file, header='infer', sep='\t')
	462	# make sure that the column names that were specified are actually
	463	# in the dataframe
	464	if args.xcol not in df.columns:
	465	raise IOError("""The x-column name that you specified, {}, is not in the
	466	dataframe column names: {}""".format(args.xcol, df.columns))
	467	if args.ycol not in df.columns:
	468	raise IOError("""The y-column name that you specified, {}, is not in the
	469	dataframe column names: {}""".format(args.ycol, df.columns))
	470	print(" - Successfully read csv file. Here are a few lines:",
	471	file = sys.stderr)
	472	print(df.head(), file = sys.stderr)
	473	print(" - Plotting {} on the x-axis".format(args.xcol),file=sys.stderr)
	474	print(df[args.xcol].head(), file = sys.stderr)
	475	print(" - Plotting {} on the y-axis".format(args.ycol),file=sys.stderr)
	476	print(df[args.ycol].head(), file = sys.stderr)
	477	custommargin(df=df.dropna(), **vars(args))

+347

-0

pauvre/functions.py less more

	0	#!/usr/bin/env python
	1	# -- coding: utf-8 --
	2
	3	# pauvre
	4	# Copyright (c) 2016-2020 Darrin T. Schultz.
	5	#
	6	# This file is part of pauvre.
	7	#
	8	# pauvre is free software: you can redistribute it and/or modify
	9	# it under the terms of the GNU General Public License as published by
	10	# the Free Software Foundation, either version 3 of the License, or
	11	# (at your option) any later version.
	12	#
	13	# pauvre is distributed in the hope that it will be useful,
	14	# but WITHOUT ANY WARRANTY; without even the implied warranty of
	15	# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
	16	# GNU General Public License for more details.
	17	#
	18	# You should have received a copy of the GNU General Public License
	19	# along with pauvre. If not, see <http://www.gnu.org/licenses/>.
	20
	21	from Bio import SeqIO
	22	import copy
	23	import gzip
	24	import matplotlib.pyplot as plt
	25	import numpy as np
	26	import os
	27	import pandas as pd
	28	from sys import stderr
	29	import time
	30
	31
	32	# this makes opening files more robust for different platforms
	33	# currently only used in GFFParse
	34	import codecs
	35
	36	import warnings
	37
	38	def print_images(base, image_formats, dpi,
	39	transparent=False, no_timestamp = False):
	40	"""
	41	Save the plot in multiple formats, with or without transparency
	42	and with or without timestamps.
	43	"""
	44	for fmt in image_formats:
	45	if no_timestamp:
	46	out_name = "{0}.{1}".format(base, fmt)
	47	else:
	48	out_name = "{0}_{1}.{2}".format(base, timestamp(), fmt)
	49	try:
	50	if fmt == 'png':
	51	plt.savefig(out_name, dpi=dpi, transparent=transparent)
	52	else:
	53	plt.savefig(out_name, format=fmt, transparent=transparent)
	54	except PermissionError:
	55	# thanks to https://github.com/wdecoster for the suggestion
	56	print("""You don't have permission to save pauvre plots to this
	57	directory. Try changing the directory and running the script again!""")
	58
	59	class GFFParse():
	60	def __init__(self, filename, stop_codons=None, species=None):
	61	self.filename = filename
	62	self.samplename = os.path.splitext(os.path.basename(filename))[0]
	63	self.species = species
	64	self.featureDict = {"name": [],
	65	"featType": [],
	66	"start": [],
	67	"stop": [],
	68	"strand": []}
	69	gffnames = ["sequence", "source", "featType", "start", "stop", "dunno1",
	70	"strand", "dunno2", "tags"]
	71	self.features = pd.read_csv(self.filename, comment='#',
	72	sep='\t', names=gffnames)
	73	self.features['name'] = self.features['tags'].apply(self._get_name)
	74	self.features.drop('dunno1', 1, inplace=True)
	75	self.features.drop('dunno2', 1, inplace=True)
	76	self.features.reset_index(inplace=True, drop=True)
	77	# warn the user if there are CDS or gene entries not divisible by three
	78	self._check_triplets()
	79	# sort the database by start
	80	self.features.sort_values(by='start', ascending=True, inplace=True)
	81	if stop_codons:
	82	strip_codons = ['gene', 'CDS']
	83	# if the direction is forward, subtract three from the stop to bring it closer to the start
	84	self.features.loc[(self.features['featType'].isin(strip_codons)) & (self.features['strand'] == '+'), 'stop'] =\
	85	self.features.loc[(self.features['featType'].isin(strip_codons))
	86	& (self.features['strand'] == '+'), 'stop'] - 3
	87	# if the direction is reverse, add three to the start (since the coords are flip-flopped)
	88	self.features.loc[(self.features['featType'].isin(strip_codons)) & (self.features['strand'] == '-'), 'start'] =\
	89	self.features.loc[(self.features['featType'].isin(strip_codons))
	90	& (self.features['strand'] == '-'), 'start'] + 3
	91	self.features['center'] = self.features['start'] + \
	92	((self.features['stop'] - self.features['start']) / 2)
	93	# we need to add one since it doesn't account for the last base otherwise
	94	self.features['width'] = abs(self.features['stop'] - self.features['start']) + 1
	95	self.features['lmost'] = self.features.apply(self._determine_lmost, axis=1)
	96	self.features['rmost'] = self.features.apply(self._determine_rmost, axis=1)
	97	self.features['track'] = 0
	98	if len(self.features.loc[self.features['tags'] == "Is_circular=true", 'stop']) < 1:
	99	raise IOError("""The GFF file needs to have a tag ending in "Is_circular=true"
	100	with a region from 1 to the number of bases in the mitogenome
	101
	102	example:
	103	Bf201311 Geneious region 1 13337 . + 0 Is_circular=true
	104	""")
	105	self.seqlen = int(self.features.loc[self.features['tags'] == "Is_circular=true", 'stop'])
	106	self.features.reset_index(inplace=True, drop=True)
	107	#print("float", self.features.loc[self.features['name'] == 'COX1', 'center'])
	108	#print("float cat", len(self.features.loc[self.features['name'] == 'CAT', 'center']))
	109	# print(self.features)
	110	# print(self.seqlen)
	111
	112	def set_features(self, new_features):
	113	"""all this does is reset the features pandas dataframe"""
	114	self.features = new_features
	115
	116	def get_unique_genes(self):
	117	"""This returns a series of gene names"""
	118	plottable = self.features.query(
	119	"featType != 'tRNA' and featType != 'region' and featType != 'source'")
	120	return set(plottable['name'].unique())
	121
	122	def shuffle(self):
	123	"""
	124	this returns a list of all possible shuffles of features.
	125	A shuffle is when the frontmost bit of coding + noncoding DNA up
	126	until the next bit of coding DNA is removed and tagged on the
	127	end of the sequence. In this case this process is represented by
	128	shifting gff coordinates.
	129	"""
	130	shuffles = []
	131	# get the index of the first element
	132	# get the index of the next thing
	133	# subtract the indices of everything, then reset the ones that are below
	134	# zero
	135	done = False
	136	shuffle_features = self.features[self.features['featType'].isin(
	137	['gene', 'rRNA', 'CDS', 'tRNA'])].copy(deep=True)
	138	# we first add the shuffle features without reorganizing
	139	# print("shuffle\n",shuffle_features)
	140	add_first = copy.deepcopy(self)
	141	add_first.set_features(shuffle_features)
	142	shuffles.append(add_first)
	143	# first gene is changed with every iteration
	144	first_gene = list(shuffle_features['name'])[0]
	145	# absolute first is the first gene in the original gff file, used to determine if we are done in this while loop
	146	absolute_first = list(shuffle_features['name'])[0]
	147	while not done:
	148	# We need to prevent the case of shuffling in the middle of
	149	# overlapped genes. Do this by ensuring that the the start of
	150	# end of first gene is less than the start of the next gene.
	151	first_stop = int(shuffle_features.loc[shuffle_features['name'] == first_gene, 'stop'])
	152	next_gene = ""
	153	for next_index in range(1, len(shuffle_features)):
	154	# get the df of the next list, if len == 0, then it is a tRNA and we need to go to the next index
	155	next_gene_df = list(
	156	shuffle_features[shuffle_features['featType'].isin(['gene', 'rRNA', 'CDS'])]['name'])
	157	if len(next_gene_df) != 0:
	158	next_gene = next_gene_df[next_index]
	159	next_start = int(shuffle_features.loc[shuffle_features['name'] == next_gene, 'start'])
	160	#print("looking at {}, prev_stop is {}, start is {}".format(
	161	# next_gene, first_stop, next_start))
	162	#print(shuffle_features[shuffle_features['featType'].isin(['gene', 'rRNA', 'CDS'])])
	163	# if the gene we're looking at and the next one don't overlap, move on
	164	if first_stop < next_start:
	165	break
	166	#print("next_gene before checking for first is {}".format(next_gene))
	167	if next_gene == absolute_first:
	168	done = True
	169	break
	170	# now we can reset the first gene for the next iteration
	171	first_gene = next_gene
	172	shuffle_features = shuffle_features.copy(deep=True)
	173	# figure out where the next start point is going to be
	174	next_start = int(shuffle_features.loc[shuffle_features['name'] == next_gene, 'start'])
	175	#print('next gene: {}'.format(next_gene))
	176	shuffle_features['start'] = shuffle_features['start'] - next_start + 1
	177	shuffle_features['stop'] = shuffle_features['stop'] - next_start + 1
	178	shuffle_features['center'] = shuffle_features['center'] - next_start + 1
	179	# now correct the values that are less than 0
	180	shuffle_features.loc[shuffle_features['start'] < 1,
	181	'start'] = shuffle_features.loc[shuffle_features['start'] < 1, 'start'] + self.seqlen
	182	shuffle_features.loc[shuffle_features['stop'] < 1, 'stop'] = shuffle_features.loc[shuffle_features['stop']
	183	< 1, 'start'] + shuffle_features.loc[shuffle_features['stop'] < 1, 'width']
	184	shuffle_features['center'] = shuffle_features['start'] + \
	185	((shuffle_features['stop'] - shuffle_features['start']) / 2)
	186	shuffle_features['lmost'] = shuffle_features.apply(self._determine_lmost, axis=1)
	187	shuffle_features['rmost'] = shuffle_features.apply(self._determine_rmost, axis=1)
	188	shuffle_features.sort_values(by='start', ascending=True, inplace=True)
	189	shuffle_features.reset_index(inplace=True, drop=True)
	190	new_copy = copy.deepcopy(self)
	191	new_copy.set_features(shuffle_features)
	192	shuffles.append(new_copy)
	193	#print("len shuffles: {}".format(len(shuffles)))
	194	return shuffles
	195
	196	def couple(self, other_GFF, this_y=0, other_y=1):
	197	"""
	198	Compares this set of features to another set and generates tuples of
	199	(x,y) coordinate pairs to input into lsi
	200	"""
	201	other_features = other_GFF.features
	202	coordinates = []
	203	for thisname in self.features['name']:
	204	othermatch = other_features.loc[other_features['name'] == thisname, 'center']
	205	if len(othermatch) == 1:
	206	this_x = float(self.features.loc[self.features['name']
	207	== thisname, 'center']) # /self.seqlen
	208	other_x = float(othermatch) # /other_GFF.seqlen
	209	# lsi can't handle vertical or horizontal lines, and we don't
	210	# need them either for our comparison. Don't add if equivalent.
	211	if this_x != other_x:
	212	these_coords = ((this_x, this_y), (other_x, other_y))
	213	coordinates.append(these_coords)
	214	return coordinates
	215
	216	def _check_triplets(self):
	217	"""This method verifies that all entries of featType gene and CDS are
	218	divisible by three"""
	219	genesCDSs = self.features.query("featType == 'CDS' or featType == 'gene'")
	220	not_trips = genesCDSs.loc[((abs(genesCDSs['stop'] - genesCDSs['start']) + 1) % 3) > 0, ]
	221	if len(not_trips) > 0:
	222	print_string = ""
	223	print_string += "There are CDS and gene entries that are not divisible by three\n"
	224	print_string += str(not_trips)
	225	warnings.warn(print_string, SyntaxWarning)
	226
	227	def _get_name(self, tag_value):
	228	"""This extracts a name from a single row in 'tags' of the pandas
	229	dataframe
	230	"""
	231	try:
	232	if ";" in tag_value:
	233	name = tag_value[5:].split(';')[0]
	234	else:
	235	name = tag_value[5:].split()[0]
	236	except:
	237	name = tag_value
	238	print("Couldn't correctly parse {}".format(
	239	tag_value))
	240	return name
	241
	242	def _determine_lmost(self, row):
	243	"""Booleans don't work well for pandas dataframes, so I need to use apply
	244	"""
	245	if row['start'] < row['stop']:
	246	return row['start']
	247	else:
	248	return row['stop']
	249
	250	def _determine_rmost(self, row):
	251	"""Booleans don't work well for pandas dataframes, so I need to use apply
	252	"""
	253	if row['start'] < row['stop']:
	254	return row['stop']
	255	else:
	256	return row['start']
	257
	258
	259	def parse_fastq_length_meanqual(fastq):
	260	"""
	261	arguments:
	262	<fastq> the fastq file path. Hopefully it has been verified to exist already
	263
	264	purpose:
	265	This function parses a fastq and returns a pandas dataframe of read lengths
	266	and read meanQuals.
	267	"""
	268	# First try to open the file with the gzip package. It will crash
	269	# if the file is not gzipped, so this is an easy way to test if
	270	# the fastq file is gzipped or not.
	271	try:
	272	handle = gzip.open(fastq, "rt")
	273	length, meanQual = _fastq_parse_helper(handle)
	274	except:
	275	handle = open(fastq, "r")
	276	length, meanQual = _fastq_parse_helper(handle)
	277
	278	handle.close()
	279	df = pd.DataFrame(list(zip(length, meanQual)), columns=['length', 'meanQual'])
	280	return df
	281
	282
	283	def filter_fastq_length_meanqual(df, min_len, max_len,
	284	min_mqual, max_mqual):
	285	querystring = "length >= {0} and meanQual >= {1}".format(min_len, min_mqual)
	286	if max_len != None:
	287	querystring += " and length <= {}".format(max_len)
	288	if max_mqual != None:
	289	querystring += " and meanQual <= {}".format(max_mqual)
	290	print("Keeping reads that satisfy: {}".format(querystring), file=stderr)
	291	filtdf = df.query(querystring)
	292	#filtdf["length"] = pd.to_numeric(filtdf["length"], errors='coerce')
	293	#filtdf["meanQual"] = pd.to_numeric(filtdf["meanQual"], errors='coerce')
	294	return filtdf
	295
	296
	297	def _fastq_parse_helper(handle):
	298	length = []
	299	meanQual = []
	300	for record in SeqIO.parse(handle, "fastq"):
	301	if len(record) > 0:
	302	meanQual.append(_arithmetic_mean(record.letter_annotations["phred_quality"]))
	303	length.append(len(record))
	304	return length, meanQual
	305
	306
	307	def _geometric_mean(phred_values):
	308	"""in case I want geometric mean in the future, can calculate it like this"""
	309	# np.mean(record.letter_annotations["phred_quality"]))
	310	pass
	311
	312
	313	def _arithmetic_mean(phred_values):
	314	"""
	315	Convert Phred to 1-accuracy (error probabilities), calculate the arithmetic mean,
	316	log transform back to Phred.
	317	"""
	318	if not isinstance(phred_values, np.ndarray):
	319	phred_values = np.array(phred_values)
	320	return _erate_to_phred(np.mean(_phred_to_erate(phred_values)))
	321
	322
	323	def _phred_to_erate(phred_values):
	324	"""
	325	converts a list or numpy array of phred values to a numpy array
	326	of error rates
	327	"""
	328	if not isinstance(phred_values, np.ndarray):
	329	phred_values = np.array(phred_values)
	330	return np.power(10, (-1 * (phred_values / 10)))
	331
	332
	333	def _erate_to_phred(erate_values):
	334	"""
	335	converts a list or numpy array of error rates to a numpy array
	336	of phred values
	337	"""
	338	if not isinstance(erate_values, np.ndarray):
	339	phred_values = np.array(erate_values)
	340	return -10 * np.log10(erate_values)
	341
	342	def timestamp():
	343	"""
	344	Returns the current time in :samp:`YYYYMMDD_HHMMSS` format.
	345	"""
	346	return time.strftime("%Y%m%d_%H%M%S")

+581

-0

pauvre/gfftools.py less more

	0	#!/usr/bin/env python
	1	# -- coding: utf-8 --
	2
	3	# pauvre - a pore plotting package
	4	# Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved.
	5	#
	6	# This file is part of pauvre.
	7	#
	8	# pauvre is free software: you can redistribute it and/or modify
	9	# it under the terms of the GNU General Public License as published by
	10	# the Free Software Foundation, either version 3 of the License, or
	11	# (at your option) any later version.
	12	#
	13	# pauvre is distributed in the hope that it will be useful,
	14	# but WITHOUT ANY WARRANTY; without even the implied warranty of
	15	# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
	16	# GNU General Public License for more details.
	17	#
	18	# You should have received a copy of the GNU General Public License
	19	# along with pauvre. If not, see <http://www.gnu.org/licenses/>.
	20
	21	"""This file contains things related to parsing and plotting GFF files"""
	22
	23	import copy
	24	from matplotlib.path import Path
	25	import matplotlib.patches as patches
	26
	27	global chevron_width
	28	global arrow_width
	29	global min_text
	30	global text_cutoff
	31
	32	arrow_width = 80
	33	chevron_width = 40
	34	min_text = 550
	35	text_cutoff = 150
	36	import sys
	37
	38	global colorMap
	39	colorMap = {'gene': 'green', 'CDS': 'green', 'tRNA':'pink', 'rRNA':'red',
	40	'misc_feature':'purple', 'rep_origin':'orange', 'spacebar':'white',
	41	'ORF':'orange'}
	42
	43	def _plot_left_to_right_introns(panel, geneid, db, y_pos, text = None):
	44	""" plots a left to right patch with introns when there are no intervening
	45	sequences to consider. Uses a gene id and gffutils database as input.
	46	b
	47	a .-=^=-. c
	48	1__________2---/ e `---1__________2
	49	\| #lff \f d\| #lff \
	50	\| left to \3 \| left to \3
	51	\| right / \| right /
	52	5___________/4 5___________/4
	53	"""
	54	#first we need to determine the number of exons
	55	bar_thickness = 0.75
	56	#now we can start plotting the exons
	57	exonlist = list(db.children(geneid, featuretype='CDS', order_by="start"))
	58	for i in range(len(exonlist)):
	59	cds_start = exonlist[i].start
	60	cds_stop = exonlist[i].stop
	61	verts = [(cds_start, y_pos + bar_thickness), #1
	62	(cds_stop - chevron_width, y_pos + bar_thickness), #2
	63	(cds_stop, y_pos + (bar_thickness/2)), #3
	64	(cds_stop - chevron_width, y_pos), #4
	65	(cds_start, y_pos), #5
	66	(cds_start, y_pos + bar_thickness), #1
	67	]
	68	codes = [Path.MOVETO,
	69	Path.LINETO,
	70	Path.LINETO,
	71	Path.LINETO,
	72	Path.LINETO,
	73	Path.CLOSEPOLY,
	74	]
	75	path = Path(verts, codes)
	76	patch = patches.PathPatch(path, lw = 0,
	77	fc=colorMap['CDS'] )
	78	panel.add_patch(patch)
	79
	80	# we must draw the splice junction
	81	if i < len(exonlist) - 1:
	82	next_start = exonlist[i+1].start
	83	next_stop = exonlist[i+1].stop
	84	middle = cds_stop + ((next_start - cds_stop)/2)
	85
	86	verts = [(cds_stop - chevron_width, y_pos + bar_thickness), #2/a
	87	(middle, y_pos + 0.95), #b
	88	(next_start, y_pos + bar_thickness), #c
	89	(next_start, y_pos + bar_thickness - 0.05), #d
	90	(middle, y_pos + 0.95 - 0.05), #e
	91	(cds_stop - chevron_width, y_pos + bar_thickness -0.05), #f
	92	(cds_stop - chevron_width, y_pos + bar_thickness), #2/a
	93	]
	94	codes = [Path.MOVETO,
	95	Path.LINETO,
	96	Path.LINETO,
	97	Path.LINETO,
	98	Path.LINETO,
	99	Path.LINETO,
	100	Path.CLOSEPOLY,
	101	]
	102	path = Path(verts, codes)
	103	patch = patches.PathPatch(path, lw = 0,
	104	fc=colorMap['CDS'] )
	105	panel.add_patch(patch)
	106
	107	return panel
	108
	109	def _plot_left_to_right_introns_top(panel, geneid, db, y_pos, text = None):
	110	""" slightly different from the above version such thatsplice junctions
	111	are more visually explicit.
	112
	113	plots a left to right patch with introns when there are no intervening
	114	sequences to consider. Uses a gene id and gffutils database as input.
	115	b
	116	a .-=^=-. c
	117	1_____________2---/ e `---1_____________2
	118	\| #lff /f d\| #lff /
	119	\| left to / \| left to /
	120	\| right / \| right /
	121	4_________/3 4_________/3
	122	"""
	123	#first we need to determine the number of exons
	124	bar_thickness = 0.75
	125	#now we can start plotting the exons
	126	exonlist = list(db.children(geneid, featuretype='CDS', order_by="start"))
	127	for i in range(len(exonlist)):
	128	cds_start = exonlist[i].start
	129	cds_stop = exonlist[i].stop
	130	verts = [(cds_start, y_pos + bar_thickness), #1
	131	(cds_stop, y_pos + bar_thickness), #2
	132	(cds_stop - chevron_width, y_pos), #4
	133	(cds_start, y_pos), #5
	134	(cds_start, y_pos + bar_thickness), #1
	135	]
	136	codes = [Path.MOVETO,
	137	Path.LINETO,
	138	Path.LINETO,
	139	Path.LINETO,
	140	Path.CLOSEPOLY,
	141	]
	142	path = Path(verts, codes)
	143	patch = patches.PathPatch(path, lw = 0,
	144	fc=colorMap['CDS'] )
	145	panel.add_patch(patch)
	146
	147	# we must draw the splice junction
	148	if i < len(exonlist) - 1:
	149	next_start = exonlist[i+1].start
	150	next_stop = exonlist[i+1].stop
	151	middle = cds_stop + ((next_start - cds_stop)/2)
	152
	153	verts = [(cds_stop-5, y_pos + bar_thickness), #2/a
	154	(middle, y_pos + 0.95), #b
	155	(next_start, y_pos + bar_thickness), #c
	156	(next_start, y_pos + bar_thickness - 0.05), #d
	157	(middle, y_pos + 0.95 - 0.05), #e
	158	(cds_stop-5, y_pos + bar_thickness -0.05), #f
	159	(cds_stop-5, y_pos + bar_thickness), #2/a
	160	]
	161	codes = [Path.MOVETO,
	162	Path.LINETO,
	163	Path.LINETO,
	164	Path.LINETO,
	165	Path.LINETO,
	166	Path.LINETO,
	167	Path.CLOSEPOLY,
	168	]
	169	path = Path(verts, codes)
	170	patch = patches.PathPatch(path, lw = 0,
	171	fc=colorMap['CDS'] )
	172	panel.add_patch(patch)
	173
	174	return panel
	175
	176	def _plot_lff(panel, left_df, right_df, colorMap, y_pos, bar_thickness, text):
	177	""" plots a lff patch
	178	1__________2 ____________
	179	\| #lff \ \ #rff \
	180	\| left for \3 \ right for \
	181	\| forward / / forward /
	182	5___________/4 /___________/
	183	"""
	184	#if there is only one feature to plot, then just plot it
	185
	186	print("plotting lff")
	187	verts = [(left_df['start'], y_pos + bar_thickness), #1
	188	(right_df['start'] - chevron_width, y_pos + bar_thickness), #2
	189	(left_df['stop'], y_pos + (bar_thickness/2)), #3
	190	(right_df['start'] - chevron_width, y_pos), #4
	191	(left_df['start'], y_pos), #5
	192	(left_df['start'], y_pos + bar_thickness), #1
	193	]
	194	codes = [Path.MOVETO,
	195	Path.LINETO,
	196	Path.LINETO,
	197	Path.LINETO,
	198	Path.LINETO,
	199	Path.CLOSEPOLY,
	200	]
	201	path = Path(verts, codes)
	202	patch = patches.PathPatch(path, lw = 0,
	203	fc=colorMap[left_df['featType']] )
	204	text_width = left_df['width']
	205	if text and text_width >= min_text:
	206	panel = _plot_label(panel, left_df, y_pos, bar_thickness)
	207	elif text and text_width < min_text and text_width >= text_cutoff:
	208	panel = _plot_label(panel, left_df,
	209	y_pos, bar_thickness,
	210	rotate = True, arrow = True)
	211
	212	return panel, patch
	213
	214	def _plot_label(panel, df, y_pos, bar_thickness, rotate = False, arrow = False):
	215	# handles the case where a dataframe was passed
	216	fontsize = 8
	217	rotation = 0
	218	if rotate:
	219	fontsize = 5
	220	rotation = 90
	221	if len(df) == 1:
	222	x =((df.loc[0, 'stop'] - df.loc[0, 'start'])/2) + df.loc[0, 'start']
	223	y = y_pos + (bar_thickness/2)
	224	# if we need to center somewhere other than the arrow, need to adjust
	225	# for the direction of the arrow
	226	# it doesn't look good if it shifts by the whole arrow width, so only
	227	# shift by half the arrow width
	228	if arrow:
	229	if df.loc[0, 'strand'] == "+":
	230	shift_start = df.loc[0, 'start']
	231	else:
	232	shift_start = df.loc[0, 'start'] + (arrow_width/2)
	233	x =((df.loc[0, 'stop'] - (arrow_width/2) - df.loc[0, 'start'])/2) + shift_start
	234	panel.text(x, y,
	235	df.loc[0, 'name'], fontsize = fontsize,
	236	ha='center', va='center',
	237	color = 'white', family = 'monospace',
	238	zorder = 100, rotation = rotation)
	239	# and the case where a series was passed
	240	else:
	241	x = ((df['stop'] - df['start'])/2) + df['start']
	242	y = y_pos + (bar_thickness/2)
	243	if arrow:
	244	if df['strand'] == "+":
	245	shift_start = df['start']
	246	else:
	247	shift_start = df['start'] + (arrow_width/2)
	248	x =((df['stop'] - (arrow_width/2) - df['start'])/2) + shift_start
	249	panel.text(x, y,
	250	df['name'], fontsize = fontsize,
	251	ha='center', va='center',
	252	color = 'white', family = 'monospace',
	253	zorder = 100, rotation = rotation)
	254
	255	return panel
	256
	257	def _plot_rff(panel, left_df, right_df, colorMap, y_pos, bar_thickness, text):
	258	""" plots a rff patch
	259	____________ 1__________2
	260	\| #lff \ \ #rff \
	261	\| left for \ 6\ right for \3
	262	\| forward / / forward /
	263	\|___________/ /5__________/4
	264	"""
	265	#if there is only one feature to plot, then just plot it
	266
	267	print("plotting rff")
	268	verts = [(right_df['start'], y_pos + bar_thickness), #1
	269	(right_df['stop'] - arrow_width, y_pos + bar_thickness), #2
	270	(right_df['stop'], y_pos + (bar_thickness/2)), #3
	271	(right_df['stop'] - arrow_width, y_pos), #4
	272	(right_df['start'], y_pos), #5
	273	(left_df['stop'] + chevron_width, y_pos + (bar_thickness/2)), #6
	274	(right_df['start'], y_pos + bar_thickness), #1
	275	]
	276	codes = [Path.MOVETO,
	277	Path.LINETO,
	278	Path.LINETO,
	279	Path.LINETO,
	280	Path.LINETO,
	281	Path.LINETO,
	282	Path.CLOSEPOLY,
	283	]
	284	path = Path(verts, codes)
	285	patch = patches.PathPatch(path, lw = 0,
	286	fc=colorMap[right_df['featType']] )
	287	text_width = right_df['width']
	288	if text and text_width >= min_text:
	289	panel = _plot_label(panel, right_df, y_pos, bar_thickness)
	290	elif text and text_width < min_text and text_width >= text_cutoff:
	291	panel = _plot_label(panel, right_df,
	292	y_pos, bar_thickness, rotate = True)
	293	return panel, patch
	294
	295	def x_offset_gff(GFFParseobj, x_offset):
	296	"""Takes in a gff object (a gff file parsed as a pandas dataframe),
	297	and an x_offset value and shifts the start, stop, center, lmost, and rmost.
	298
	299	Returns a GFFParse object with the shifted values in GFFParse.features.
	300	"""
	301	for columnname in ['start', 'stop', 'center', 'lmost', 'rmost']:
	302	GFFParseobj.features[columnname] = GFFParseobj.features[columnname] + x_offset
	303	return GFFParseobj
	304
	305	def gffplot_horizontal(figure, panel, args, gff_object,
	306	track_width=0.2, start_y=0.1, **kwargs):
	307	"""
	308	this plots horizontal things from gff files. it was probably written for synplot,
	309	as the browser does not use this at all.
	310	"""
	311	# Because this size should be relative to the circle that it is plotted next
	312	# to, define the start_radius as the place to work from, and the width of
	313	# each track.
	314	colorMap = {'gene': 'green', 'CDS': 'green', 'tRNA':'pink', 'rRNA':'red',
	315	'misc_feature':'purple', 'rep_origin':'orange', 'spacebar':'white'}
	316	augment = 0
	317	bar_thickness = 0.9 * track_width
	318	# return these at the end
	319	myPatches=[]
	320	plot_order = []
	321
	322	idone = False
	323	# we need to filter out the tRNAs since those are plotted last
	324	plottable_features = gff_object.features.query("featType != 'tRNA' and featType != 'region' and featType != 'source'")
	325	plottable_features.reset_index(inplace=True, drop=True)
	326	print(plottable_features)
	327
	328	len_plottable = len(plottable_features)
	329	print('len plottable', len_plottable)
	330	# - this for loop relies on the gff features to already be sorted
	331	# - The algorithm for this loop works by starting at the 0th index of the
	332	# plottable features (i).
	333	# - It then looks to see if the next object (the jth) overlaps with the
	334	# ith element.
	335	i = 0
	336	j = 1
	337	while i < len(plottable_features):
	338	if i + j == len(plottable_features):
	339	#we have run off of the df and need to include everything from i to the end
	340	these_features = plottable_features.loc[i::,].copy(deep=True)
	341	these_features = these_features.reset_index()
	342	print(these_features)
	343	plot_order.append(these_features)
	344	i = len(plottable_features)
	345	break
	346	print(" - i,j are currently: {},{}".format(i, j))
	347	stop = plottable_features.loc[i]["stop"]
	348	start = plottable_features.loc[i+j]["start"]
	349	print("stop: {}. start: {}.".format(stop, start))
	350	if plottable_features.loc[i]["stop"] <= plottable_features.loc[i+j]["start"]:
	351	print(" - putting elements {} through (including) {} together".format(i, i+j))
	352	these_features = plottable_features.loc[i:i+j-1,].copy(deep=True)
	353	these_features = these_features.reset_index()
	354	print(these_features)
	355	plot_order.append(these_features)
	356	i += 1
	357	j = 1
	358	else:
	359	j += 1
	360
	361	#while idone == False:
	362	# print("im in the overlap-pairing while loop i={}".format(i))
	363	# # look ahead at all of the elements that overlap with the ith element
	364	# jdone = False
	365	# j = 1
	366	# this_set_minimum_index = i
	367	# this_set_maximum_index = i
	368	# while jdone == False:
	369	# print("new i= {} j={} len={}".format(i, j, len_plottable))
	370	# print("len plottable in jdone: {}".format(len_plottable))
	371	# print("plottable features in jdone:\n {}".format(plottable_features))
	372	# # first make sure that we haven't gone off the end of the dataframe
	373	# # This is an edge case where i has a jth element that overlaps with it,
	374	# # and j is the last element in the plottable features.
	375	# if i+j == len_plottable:
	376	# print("i+j == len_plottable")
	377	# # this checks for the case that i is the last element of the
	378	# # plottable features.
	379	# # In both of the above cases, we are done with both the ith and
	380	# # the jth features.
	381	# if i == len_plottable-1:
	382	# print("i == len_plottable-1")
	383
	384	# # this is the last analysis, so set idone to true
	385	# # to finish after this
	386	# idone = True
	387	# # the last one can't be in its own group, so just add it solo
	388	# these_features = plottable_features.loc[this_set_minimum_index:this_set_maximum_index,].copy(deep=True)
	389	# plot_order.append(these_features.reset_index(drop=True))
	390	# break
	391	# jdone = True
	392	# else:
	393	# print("i+j != len_plottable")
	394	# # if the lmost of the next gene overlaps with the rmost of
	395	# # the current one, it overlaps and couple together
	396	# if plottable_features.loc[i+j, 'lmost'] < plottable_features.loc[i, 'rmost']:
	397	# print("lmost < rmost")
	398	# # note that this feature overlaps with the current
	399	# this_set_maximum_index = i+j
	400	# # ... and we need to look at the next in line
	401	# j += 1
	402	# else:
	403	# print("lmost !< rmost")
	404	# i += 1 + (this_set_maximum_index - this_set_minimum_index)
	405	# #add all of the things that grouped together once we don't find any more groups
	406	# these_features = plottable_features.loc[this_set_minimum_index:this_set_maximum_index,].copy(deep=True)
	407	# plot_order.append(these_features.reset_index(drop=True))
	408	# jdone = True
	409	# print("plot order is now: {}".format(plot_order))
	410	# print("jdone: {}".format(str(jdone)))
	411
	412	for feature_set in plot_order:
	413	# plot_feature_hori handles overlapping cases as well as normal cases
	414	panel, patches = gffplot_feature_hori(figure, panel, feature_set, colorMap,
	415	start_y, bar_thickness, text = True)
	416	for each in patches:
	417	print("there are {} patches after gffplot_feature_hori".format(len(patches)))
	418	print(each)
	419	myPatches.append(each)
	420	print("length of myPatches is: {}".format(len(myPatches)))
	421
	422	# Now we add all of the tRNAs to this to plot, do it last to overlay
	423	# everything else
	424	tRNAs = gff_object.features.query("featType == 'tRNA'")
	425	tRNAs.reset_index(inplace=True, drop = True)
	426	tRNA_bar_thickness = bar_thickness * (0.8)
	427	tRNA_start_y = start_y + ((bar_thickness - tRNA_bar_thickness)/2)
	428	for i in range(0,len(tRNAs)):
	429	this_feature = tRNAs[i:i+1].copy(deep=True)
	430	this_feature.reset_index(inplace=True, drop = True)
	431	panel, patches = gffplot_feature_hori(figure, panel, this_feature, colorMap,
	432	tRNA_start_y, tRNA_bar_thickness, text = True)
	433	for patch in patches:
	434	myPatches.append(patch)
	435	print("There are {} patches at the end of gffplot_horizontal()".format(len(myPatches)))
	436	return panel, myPatches
	437
	438	def gffplot_feature_hori(figure, panel, feature_df,
	439	colorMap, y_pos, bar_thickness, text=True):
	440	"""This plots the track for a feature, and if there is something for
	441	'this_feature_overlaps_feature', then there is special processing to
	442	add the white bar and the extra slope for the chevron
	443	"""
	444	myPatches = []
	445	#if there is only one feature to plot, then just plot it
	446	if len(feature_df) == 1:
	447	#print("plotting a single thing: {} {}".format(str(feature_df['sequence']).split()[1],
	448	# str(feature_df['featType']).split()[1] ))
	449	#print(this_feature['name'], "is not overlapping")
	450	# This plots this shape: 1_________2 2_________1
	451	# \| forward \3 3/ reverse \|
	452	# \|5__________/4 \4________5\|
	453	if feature_df.loc[0,'strand'] == '+':
	454	verts = [(feature_df.loc[0, 'start'], y_pos + bar_thickness), #1
	455	(feature_df.loc[0, 'stop'] - arrow_width, y_pos + bar_thickness), #2
	456	(feature_df.loc[0, 'stop'], y_pos + (bar_thickness/2)), #3
	457	(feature_df.loc[0, 'stop'] - arrow_width, y_pos), #4
	458	(feature_df.loc[0, 'start'], y_pos), #5
	459	(feature_df.loc[0, 'start'], y_pos + bar_thickness)] #1
	460	elif feature_df.loc[0,'strand'] == '-':
	461	verts = [(feature_df.loc[0, 'stop'], y_pos + bar_thickness), #1
	462	(feature_df.loc[0, 'start'] + arrow_width, y_pos + bar_thickness), #2
	463	(feature_df.loc[0, 'start'], y_pos + (bar_thickness/2)), #3
	464	(feature_df.loc[0, 'start'] + arrow_width, y_pos), #4
	465	(feature_df.loc[0, 'stop'], y_pos), #5
	466	(feature_df.loc[0, 'stop'], y_pos + bar_thickness)] #1
	467	feat_width = feature_df.loc[0,'width']
	468	if text and feat_width >= min_text:
	469	panel = _plot_label(panel, feature_df.loc[0,],
	470	y_pos, bar_thickness)
	471	elif text and feat_width < min_text and feat_width >= text_cutoff:
	472	panel = _plot_label(panel, feature_df.loc[0,],
	473	y_pos, bar_thickness,
	474	rotate = True, arrow = True)
	475
	476	codes = [Path.MOVETO,
	477	Path.LINETO,
	478	Path.LINETO,
	479	Path.LINETO,
	480	Path.LINETO,
	481	Path.CLOSEPOLY]
	482	path = Path(verts, codes)
	483	print("normal path is: {}".format(path))
	484	# If the feature itself is smaller than the arrow, we need to take special measures to
	485	if feature_df.loc[0,'width'] <= arrow_width:
	486	path = Path([verts[i] for i in [0,2,4,5]],
	487	[codes[i] for i in [0,2,4,5]])
	488	patch = patches.PathPatch(path, lw = 0,
	489	fc=colorMap[feature_df.loc[0, 'featType']] )
	490	myPatches.append(patch)
	491	# there are four possible scenarios if there are two overlapping sequences:
	492	# ___________ ____________ ____________ ___________
	493	# \| #1 \ \ #1 \ / #2 / / #2 \|
	494	# \| both seqs \ \ both seqs \ / both seqs / / both seqs \|
	495	# \| forward / / forward / \ reverse \ \ reverse \|
	496	# \|__________/ /___________/ \___________\ \___________\|
	497	# ___________ _____________ ____________ _ _________
	498	# \| #3 \ \ #3 \| / #2 _\| #2 \
	499	# \| one seq \ \ one seq \| / one seq \|_ one seq \
	500	# \| forward \ \ reverse \| \ reverse _\| forward /
	501	# \|_____________\ \_________\| \__________\|_ ___________/
	502	#
	503	# These different scenarios can be thought of as different left/right
	504	# flanking segment types.
	505	# In the annotation #rff:
	506	# - 'r' refers to the annotation type as being on the right
	507	# - the first 'f' refers to the what element is to the left of this one.
	508	# Since it is forward the 5' end of this annotation must be a chevron
	509	# - the second 'f' refers to the right side of this element. Since it is
	510	# forward it must be a normal arrow.
	511	# being on the right
	512	#
	513	# LEFT TYPES RIGHT TYPES
	514	# ____________ ____________
	515	# \| #lff \ \ #rff \
	516	# \| left for \ \ right for \
	517	# \| forward / / forward /
	518	# \|___________/ /___________/
	519	# ___________ _____________
	520	# \| #lfr \ \ #rfr \|
	521	# \| left for \ \ right for \|
	522	# \| reverse \ \ reverse \|
	523	# \|_____________\ \_________\|
	524	# ____________ ___________
	525	# / #lrr / / #rrr \|
	526	# / left rev / / right rev \|
	527	# \ reverse \ \ reverse \|
	528	# \___________\ \___________\|
	529	# ____________ __________
	530	# / #lrf _\| _\| #rrf \
	531	# / left rev \|_ \| _ right rev \
	532	# \ forward _\| _\| forward /
	533	# \__________\| \|____________/
	534	#
	535	# To properly plot these elements, we must go through each element of the
	536	# feature_df to determine which patch type it is.
	537	elif len(feature_df) == 2:
	538	print("im in here feat len=2")
	539	for i in range(len(feature_df)):
	540	# this tests for which left type we're dealing with
	541	if i == 0:
	542	# type could be lff or lfr
	543	if feature_df.loc[i, 'strand'] == '+':
	544	if feature_df.loc[i + 1, 'strand'] == '+':
	545	# plot a lff type
	546	panel, patch = _plot_lff(panel, feature_df.iloc[i,], feature_df.iloc[i+1,],
	547	colorMap, y_pos, bar_thickness, text)
	548	myPatches.append(patch)
	549	elif feature_df.loc[i + 1, 'strand'] == '-':
	550	#plot a lfr type
	551	raise IOError("can't plot {} patches yet".format("lfr"))
	552	# or type could be lrr or lrf
	553	elif feature_df.loc[i, 'strand'] == '-':
	554	if feature_df.loc[i + 1, 'strand'] == '+':
	555	# plot a lrf type
	556	raise IOError("can't plot {} patches yet".format("lrf"))
	557	elif feature_df.loc[i + 1, 'strand'] == '-':
	558	#plot a lrr type
	559	raise IOError("can't plot {} patches yet".format("lrr"))
	560	# in this case we're only dealing with 'right type' patches
	561	elif i == len(feature_df) - 1:
	562	# type could be rff or rfr
	563	if feature_df.loc[i-1, 'strand'] == '+':
	564	if feature_df.loc[i, 'strand'] == '+':
	565	# plot a rff type
	566	panel, patch = _plot_rff(panel, feature_df.iloc[i-1,], feature_df.iloc[i,],
	567	colorMap, y_pos, bar_thickness, text)
	568	myPatches.append(patch)
	569	elif feature_df.loc[i, 'strand'] == '-':
	570	#plot a rfr type
	571	raise IOError("can't plot {} patches yet".format("rfr"))
	572	# or type could be rrr or rrf
	573	elif feature_df.loc[i-1, 'strand'] == '-':
	574	if feature_df.loc[i, 'strand'] == '+':
	575	# plot a rrf type
	576	raise IOError("can't plot {} patches yet".format("rrf"))
	577	elif feature_df.loc[i, 'strand'] == '-':
	578	#plot a rrr type
	579	raise IOError("can't plot {} patches yet".format("rrr"))
	580	return panel, myPatches

+77

-0

pauvre/lsi/Q.py less more

	0	# Binary search tree that holds status of sweep line. Only leaves hold values.
	1	# Operations for finding left and right neighbors of a query point p and finding which segments contain p.
	2	# Author: Sam Lichtenberg
	3	# Email: splichte@princeton.edu
	4	# Date: 09/02/2013
	5
	6	from pauvre.lsi.helper import *
	7
	8	ev = 0.00000001
	9
	10	class Q:
	11	def __init__(self, key, value):
	12	self.key = key
	13	self.value = value
	14	self.left = None
	15	self.right = None
	16
	17	def find(self, key):
	18	if self.key is None:
	19	return False
	20	c = compare_by_y(key, self.key)
	21	if c==0:
	22	return True
	23	elif c==-1:
	24	if self.left:
	25	self.left.find(key)
	26	else:
	27	return False
	28	else:
	29	if self.right:
	30	self.right.find(key)
	31	else:
	32	return False
	33	def insert(self, key, value):
	34	if self.key is None:
	35	self.key = key
	36	self.value = value
	37	c = compare_by_y(key, self.key)
	38	if c==0:
	39	self.value += value
	40	elif c==-1:
	41	if self.left is None:
	42	self.left = Q(key, value)
	43	else:
	44	self.left.insert(key, value)
	45	else:
	46	if self.right is None:
	47	self.right = Q(key, value)
	48	else:
	49	self.right.insert(key, value)
	50	# must return key AND value
	51	def get_and_del_min(self, parent=None):
	52	if self.left is not None:
	53	return self.left.get_and_del_min(self)
	54	else:
	55	k = self.key
	56	v = self.value
	57	if parent:
	58	parent.left = self.right
	59	# i.e. is root node
	60	else:
	61	if self.right:
	62	self.key = self.right.key
	63	self.value = self.right.value
	64	self.left = self.right.left
	65	self.right = self.right.right
	66	else:
	67	self.key = None
	68	return k,v
	69
	70	def print_tree(self):
	71	if self.left:
	72	self.left.print_tree()
	73	print(self.key)
	74	print(self.value)
	75	if self.right:
	76	self.right.print_tree()

+322

-0

pauvre/lsi/T.py less more

	0	# Binary search tree that holds status of sweep line. Only leaves hold values.
	1	# Operations for finding left and right neighbors of a query point p and finding which segments contain p.
	2	# Author: Sam Lichtenberg
	3	# Email: splichte@princeton.edu
	4	# Date: 09/02/2013
	5
	6	from pauvre.lsi.helper import *
	7
	8	ev = 0.00000001
	9
	10	class T:
	11	def __init__(self):
	12	self.root = Node(None, None, None, None)
	13	def contain_p(self, p):
	14	if self.root.value is None:
	15	return [[], []]
	16	lists = [[], []]
	17	self.root.contain_p(p, lists)
	18	return (lists[0], lists[1])
	19	def get_left_neighbor(self, p):
	20	if self.root.value is None:
	21	return None
	22	return self.root.get_left_neighbor(p)
	23	def get_right_neighbor(self, p):
	24	if self.root.value is None:
	25	return None
	26	return self.root.get_right_neighbor(p)
	27	def insert(self, key, s):
	28	if self.root.value is None:
	29	self.root.left = Node(s, None, None, self.root)
	30	self.root.value = s
	31	self.root.m = get_slope(s)
	32	else:
	33	(node, path) = self.root.find_insert_pt(key, s)
	34	if path == 'r':
	35	node.right = Node(s, None, None, node)
	36	node.right.adjust()
	37	elif path == 'l':
	38	node.left = Node(s, None, None, node)
	39	else:
	40	# this means matching Node was a leaf
	41	# need to make a new internal Node
	42	if node.compare_to_key(key) < 0 or (node.compare_to_key(key)==0 and node.compare_lower(key, s) < 1):
	43	new_internal = Node(s, None, node, node.parent)
	44	new_leaf = Node(s, None, None, new_internal)
	45	new_internal.left = new_leaf
	46	if node is node.parent.left:
	47	node.parent.left = new_internal
	48	node.adjust()
	49	else:
	50	node.parent.right = new_internal
	51	else:
	52	new_internal = Node(node.value, node, None, node.parent)
	53	new_leaf = Node(s, None, None, new_internal)
	54	new_internal.right = new_leaf
	55	if node is node.parent.left:
	56	node.parent.left = new_internal
	57	new_leaf.adjust()
	58	else:
	59	node.parent.right = new_internal
	60	node.parent = new_internal
	61
	62	def delete(self, p, s):
	63	key = p
	64	node = self.root.find_delete_pt(key, s)
	65	val = node.value
	66	if node is node.parent.left:
	67	parent = node.parent.parent
	68	if parent is None:
	69	if self.root.right is not None:
	70	if self.root.right.left or self.root.right.right:
	71	self.root = self.root.right
	72	self.root.parent = None
	73	else:
	74	self.root.left = self.root.right
	75	self.root.value = self.root.right.value
	76	self.root.m = self.root.right.m
	77	self.root.right = None
	78	else:
	79	self.root.left = None
	80	self.root.value = None
	81	elif node.parent is parent.left:
	82	parent.left = node.parent.right
	83	node.parent.right.parent = parent
	84	else:
	85	parent.right = node.parent.right
	86	node.parent.right.parent = parent
	87	else:
	88	parent = node.parent.parent
	89	if parent is None:
	90	if self.root.left:
	91	# switch properties
	92	if self.root.left.right or self.root.left.left:
	93	self.root = self.root.left
	94	self.root.parent = None
	95	else:
	96	self.root.right = None
	97	else:
	98	self.root.right = None
	99	self.root.value = None
	100	elif node.parent is parent.left:
	101	parent.left = node.parent.left
	102	node.parent.left.parent = parent
	103	farright = node.parent.left
	104	while farright.right is not None:
	105	farright = farright.right
	106	farright.adjust()
	107	else:
	108	parent.right = node.parent.left
	109	node.parent.left.parent = parent
	110	farright = node.parent.left
	111	while farright.right is not None:
	112	farright = farright.right
	113	farright.adjust()
	114	return val
	115
	116	def print_tree(self):
	117	self.root.print_tree()
	118	class Node:
	119	def __init__(self, value, left, right, parent):
	120	self.value = value # associated line segment
	121	self.left = left
	122	self.right = right
	123	self.parent = parent
	124	self.m = None
	125	if value is not None:
	126	self.m = get_slope(value)
	127
	128	# compares line segment at y-val of p to p
	129	# TODO: remove this and replace with get_x_at
	130	def compare_to_key(self, p):
	131	x0 = self.value[0][0]
	132	y0 = self.value[0][1]
	133	y1 = p[1]
	134	if self.m != 0 and self.m is not None:
	135	x1 = x0 - float(y0-y1)/self.m
	136	return compare_by_x(p, (x1, y1))
	137	else:
	138	x1 = p[0]
	139	return 0
	140
	141	def get_left_neighbor(self, p):
	142	neighbor = None
	143	n = self
	144	if n.left is None and n.right is None:
	145	return neighbor
	146	last_right = None
	147	found = False
	148	while not found:
	149	c = n.compare_to_key(p)
	150	if c < 1 and n.left:
	151	n = n.left
	152	elif c==1 and n.right:
	153	n = n.right
	154	last_right = n.parent
	155	else:
	156	found = True
	157	c = n.compare_to_key(p)
	158	if c==0:
	159	if n is n.parent.right:
	160	return n.parent
	161	else:
	162	goright = None
	163	if last_right:
	164	goright =last_right.left
	165	return self.get_lr(None, goright)[0]
	166	# n stores the highest-value in the left subtree
	167	if c==-1:
	168	goright = None
	169	if last_right:
	170	goright = last_right.left
	171	return self.get_lr(None, goright)[0]
	172	if c==1:
	173	neighbor = n
	174	return neighbor
	175
	176	def get_right_neighbor(self, p):
	177	neighbor = None
	178	n = self
	179	if n.left is None and n.right is None:
	180	return neighbor
	181	last_left = None
	182	found = False
	183	while not found:
	184	c = n.compare_to_key(p)
	185	if c==0 and n.right:
	186	n = n.right
	187	elif c < 0 and n.left:
	188	n = n.left
	189	last_left = n.parent
	190	elif c==1 and n.right:
	191	n = n.right
	192	else:
	193	found = True
	194	c = n.compare_to_key(p)
	195	# can be c==0 and n.left if at root node
	196	if c==0:
	197	if n.parent is None:
	198	return None
	199	if n is n.parent.right:
	200	goleft = None
	201	if last_left:
	202	goleft = last_left.right
	203	return self.get_lr(goleft, None)[1]
	204	else:
	205	return self.get_lr(n.parent.right, None)[1]
	206	if c==1:
	207	goleft = None
	208	if last_left:
	209	goleft = last_left.right
	210	return self.get_lr(goleft, None)[1]
	211	if c==-1:
	212	return n
	213	return neighbor
	214
	215	# travels down a single direction to get neighbors
	216	def get_lr(self, left, right):
	217	lr = [None, None]
	218	if left:
	219	while left.left:
	220	left = left.left
	221	lr[1] = left
	222	if right:
	223	while right.right:
	224	right = right.right
	225	lr[0] = right
	226	return lr
	227
	228	def contain_p(self, p, lists):
	229	c = self.compare_to_key(p)
	230	if c==0:
	231	if self.left is None and self.right is None:
	232	if compare_by_x(p, self.value[1])==0:
	233	lists[1].append(self.value)
	234	else:
	235	lists[0].append(self.value)
	236	if self.left:
	237	self.left.contain_p(p, lists)
	238	if self.right:
	239	self.right.contain_p(p, lists)
	240	elif c < 0:
	241	if self.left:
	242	self.left.contain_p(p, lists)
	243	else:
	244	if self.right:
	245	self.right.contain_p(p, lists)
	246
	247	def find_insert_pt(self, key, seg):
	248	if self.left and self.right:
	249	if self.compare_to_key(key) == 0 and self.compare_lower(key, seg)==1:
	250	return self.right.find_insert_pt(key, seg)
	251	elif self.compare_to_key(key) < 1:
	252	return self.left.find_insert_pt(key, seg)
	253	else:
	254	return self.right.find_insert_pt(key, seg)
	255	# this case only happens at root
	256	elif self.left:
	257	if self.compare_to_key(key) == 0 and self.compare_lower(key, seg)==1:
	258	return (self, 'r')
	259	elif self.compare_to_key(key) < 1:
	260	return self.left.find_insert_pt(key, seg)
	261	else:
	262	return (self, 'r')
	263	else:
	264	return (self, 'n')
	265
	266	# adjusts stored segments in inner nodes
	267	def adjust(self):
	268	value = self.value
	269	m = self.m
	270	parent = self.parent
	271	node = self
	272	# go up left as much as possible
	273	while parent and node is parent.right:
	274	node = parent
	275	parent = node.parent
	276	# parent to adjust will be on the immediate right
	277	if parent and node is parent.left:
	278	parent.value = value
	279	parent.m = m
	280
	281	def compare_lower(self, p, s2):
	282	y = p[1] - 10
	283	key = get_x_at(s2, (p[0], y))
	284	return self.compare_to_key(key)
	285
	286	# returns matching leaf node, or None if no match
	287	# when deleting, you don't delete below--you delete above! so compare lower = -1.
	288	def find_delete_pt(self, key, value):
	289	if self.left and self.right:
	290	# if equal at this pt, and this node's value is less than the seg's slightly above this pt
	291	if self.compare_to_key(key) == 0 and self.compare_lower(key, value)==-1:
	292	return self.right.find_delete_pt(key, value)
	293	if self.compare_to_key(key) < 1:
	294	return self.left.find_delete_pt(key, value)
	295	else:
	296	return self.right.find_delete_pt(key, value)
	297	elif self.left:
	298	if self.compare_to_key(key) < 1:
	299	return self.left.find_delete_pt(key, value)
	300	else:
	301	return None
	302	# is leaf
	303	else:
	304	if self.compare_to_key(key)==0 and segs_equal(self.value, value):
	305	return self
	306	else:
	307	return None
	308
	309	# also prints depth of each node
	310	def print_tree(self, l=0):
	311	l += 1
	312	if self.left:
	313	self.left.print_tree(l)
	314	if self.left or self.right:
	315	print('INTERNAL: {0}'.format(l))
	316	else:
	317	print('LEAF: {0}'.format(l))
	318	print(self)
	319	print(self.value)
	320	if self.right:
	321	self.right.print_tree(l)

-0

pauvre/lsi/__init__.py less more

(New empty file)

+94

-0

pauvre/lsi/helper.py less more

	0	# Helper functions for use in the lsi implementation.
	1
	2	ev = 0.0000001
	3	# floating-point comparison
	4	def approx_equal(a, b, tol):
	5	return abs(a - b) < tol
	6
	7	# compares x-values of two pts
	8	# used for ordering in T
	9	def compare_by_x(k1, k2):
	10	if approx_equal(k1[0], k2[0], ev):
	11	return 0
	12	elif k1[0] < k2[0]:
	13	return -1
	14	else:
	15	return 1
	16
	17	# higher y value is "less"; if y value equal, lower x value is "less"
	18	# used for ordering in Q
	19	def compare_by_y(k1, k2):
	20	if approx_equal(k1[1], k2[1], ev):
	21	if approx_equal(k1[0], k2[0], ev):
	22	return 0
	23	elif k1[0] < k2[0]:
	24	return -1
	25	else:
	26	return 1
	27	elif k1[1] > k2[1]:
	28	return -1
	29	else:
	30	return 1
	31
	32	# tests if s0 and s1 represent the same segment (i.e. pts can be in 2 different orders)
	33	def segs_equal(s0, s1):
	34	x00 = s0[0][0]
	35	y00 = s0[0][1]
	36	x01 = s0[1][0]
	37	y01 = s0[1][1]
	38	x10 = s1[0][0]
	39	y10 = s1[0][1]
	40	x11 = s1[1][0]
	41	y11 = s1[1][1]
	42	if (approx_equal(x00, x10, ev) and approx_equal(y00, y10, ev)):
	43	if (approx_equal(x01, x11, ev) and approx_equal(y01, y11, ev)):
	44	return True
	45	if (approx_equal(x00, x11, ev) and approx_equal(y00, y11, ev)):
	46	if (approx_equal(x01, x10, ev) and approx_equal(y01, y10, ev)):
	47	return True
	48	return False
	49
	50	# get m for a given seg in (p1, p2) form
	51	def get_slope(s):
	52	x0 = s[0][0]
	53	y0 = s[0][1]
	54	x1 = s[1][0]
	55	y1 = s[1][1]
	56	if (x1-x0)==0:
	57	return None
	58	else:
	59	return float(y1-y0)/(x1-x0)
	60
	61	# given a point p, return the point on s that shares p's y-val
	62	def get_x_at(s, p):
	63	m = get_slope(s)
	64	# TODO: this should check if p's x-val is octually on seg; we're assuming
	65	# for now that it would have been deleted already if not
	66	if m == 0: # horizontal segment
	67	return p
	68	# ditto; should check if y-val on seg
	69	if m is None: # vertical segment
	70	return (s[0][0], p[1])
	71	x1 = s[0][0]-(s[0][1]-p[1])/m
	72	return (x1, p[1])
	73
	74	# returns the point at which two line segments intersect, or None if no intersection.
	75	def intersect(seg1, seg2):
	76	p = seg1[0]
	77	r = (seg1[1][0]-seg1[0][0], seg1[1][1]-seg1[0][1])
	78	q = seg2[0]
	79	s = (seg2[1][0]-seg2[0][0], seg2[1][1]-seg2[0][1])
	80	denom = r[0]s[1]-r[1]s[0]
	81	if denom == 0:
	82	return None
	83	numer = float(q[0]-p[0])s[1]-(q[1]-p[1])s[0]
	84	t = numer/denom
	85	numer = float(q[0]-p[0])r[1]-(q[1]-p[1])r[0]
	86	u = numer/denom
	87	if (t < 0 or t > 1) or (u < 0 or u > 1):
	88	return None
	89	x = p[0]+t*r[0]
	90	y = p[1]+t*r[1]
	91	return (x, y)
	92
	93

+107

-0

pauvre/lsi/lsi.py less more

	0	# Implementation of the Bentley-Ottmann algorithm, described in deBerg et al, ch. 2.
	1	# See README for more information.
	2	# Author: Sam Lichtenberg
	3	# Email: splichte@princeton.edu
	4	# Date: 09/02/2013
	5
	6	from pauvre.lsi.Q import Q
	7	from pauvre.lsi.T import T
	8	from pauvre.lsi.helper import *
	9
	10	# "close enough" for floating point
	11	ev = 0.00000001
	12
	13	# how much lower to get the x of a segment, to determine which of a set of segments is the farthest right/left
	14	lower_check = 100
	15
	16	# gets the point on a segment at a lower y value.
	17	def getNextPoint(p, seg, y_lower):
	18	p1 = seg[0]
	19	p2 = seg[1]
	20	if (p1[0]-p2[0])==0:
	21	return (p[0]+10, p[1])
	22	slope = float(p1[1]-p2[1])/(p1[0]-p2[0])
	23	if slope==0:
	24	return (p1[0], p[1]-y_lower)
	25	y = p[1]-y_lower
	26	x = p1[0]-(p1[1]-y)/slope
	27	return (x, y)
	28
	29	"""
	30	for each event point:
	31	U_p = segments that have p as an upper endpoint
	32	C_p = segments that contain p
	33	L_p = segments that have p as a lower endpoint
	34	"""
	35	def handle_event_point(p, segs, q, t, intersections):
	36	rightmost = (float("-inf"), 0)
	37	rightmost_seg = None
	38	leftmost = (float("inf"), 0)
	39	leftmost_seg = None
	40
	41	U_p = segs
	42	(C_p, L_p) = t.contain_p(p)
	43	merge_all = U_p+C_p+L_p
	44	if len(merge_all) > 1:
	45	intersections[p] = []
	46	for s in merge_all:
	47	intersections[p].append(s)
	48	merge_CL = C_p+L_p
	49	merge_UC = U_p+C_p
	50	for s in merge_CL:
	51	# deletes at a point slightly above (to break ties) - where seg is located in tree
	52	# above intersection point
	53	t.delete(p, s)
	54	# put segments into T based on where they are at y-val just below p[1]
	55	for s in merge_UC:
	56	n = getNextPoint(p, s, lower_check)
	57	if n[0] > rightmost[0]:
	58	rightmost = n
	59	rightmost_seg = s
	60	if n[0] < leftmost[0]:
	61	leftmost = n
	62	leftmost_seg = s
	63	t.insert(p, s)
	64
	65	# means only L_p -> check newly-neighbored segments
	66	if len(merge_UC) == 0:
	67	neighbors = (t.get_left_neighbor(p), t.get_right_neighbor(p))
	68	if neighbors[0] and neighbors[1]:
	69	find_new_event(neighbors[0].value, neighbors[1].value, p, q)
	70
	71	# of newly inserted pts, find possible intersections to left and right
	72	else:
	73	left_neighbor = t.get_left_neighbor(p)
	74	if left_neighbor:
	75	find_new_event(left_neighbor.value, leftmost_seg, p, q)
	76	right_neighbor = t.get_right_neighbor(p)
	77	if right_neighbor:
	78	find_new_event(right_neighbor.value, rightmost_seg, p, q)
	79
	80	def find_new_event(s1, s2, p, q):
	81	i = intersect(s1, s2)
	82	if i:
	83	if compare_by_y(i, p) == 1:
	84	if not q.find(i):
	85	q.insert(i, [])
	86
	87	# segment is in ((x, y), (x, y)) form
	88	# first pt in a segment should have higher y-val - this is handled in function
	89	def intersection(S):
	90	s0 = S[0]
	91	if s0[1][1] > s0[0][1]:
	92	s0 = (s0[1], s0[0])
	93	q = Q(s0[0], [s0])
	94	q.insert(s0[1], [])
	95	intersections = {}
	96	for s in S[1:]:
	97	if s[1][1] > s[0][1]:
	98	s = (s[1], s[0])
	99	q.insert(s[0], [s])
	100	q.insert(s[1], [])
	101	t = T()
	102	while q.key:
	103	p, segs = q.get_and_del_min()
	104	handle_event_point(p, segs, q, t, intersections)
	105	return intersections
	106

+110

-0

pauvre/lsi/test.py less more

	0	# Test file for lsi.
	1	# Author: Sam Lichtenberg
	2	# Email: splichte@princeton.edu
	3	# Date: 09/02/2013
	4
	5	from lsi import intersection
	6	import random
	7	import time, sys
	8	from helper import *
	9
	10	ev = 0.00000001
	11
	12	def scale(i):
	13	return float(i)
	14
	15	use_file = None
	16	try:
	17	use_file = sys.argv[2]
	18	except:
	19	pass
	20
	21	if not use_file:
	22	S = []
	23	for i in range(int(sys.argv[1])):
	24	p1 = (scale(random.randint(0, 1000)), scale(random.randint(0, 1000)))
	25	p2 = (scale(random.randint(0, 1000)), scale(random.randint(0, 1000)))
	26	s = (p1, p2)
	27	S.append(s)
	28	f = open('input', 'w')
	29	f.write(str(S))
	30	f.close()
	31
	32	else:
	33	f = open(sys.argv[2], 'r')
	34	S = eval(f.read())
	35
	36	intersections = []
	37	seen = []
	38	vs = False
	39	hs = False
	40	es = False
	41	now = time.time()
	42	for seg1 in S:
	43	if approx_equal(seg1[0][0], seg1[1][0], ev):
	44	print 'VERTICAL SEG'
	45	print ''
	46	print ''
	47	vs = True
	48	if approx_equal(seg1[0][1], seg1[1][1], ev):
	49	print 'HORIZONTAL SEG'
	50	print ''
	51	print ''
	52	hs = True
	53	for seg2 in S:
	54	if seg1 is not seg2 and segs_equal(seg1, seg2):
	55	print 'EQUAL SEGS'
	56	print ''
	57	print ''
	58	es = True
	59	if seg1 is not seg2 and (seg2, seg1) not in seen:
	60	i = intersect(seg1, seg2)
	61	if i:
	62	intersections.append((i, [seg1, seg2]))
	63	# xpts = [seg1[0][0], seg1[1][0], seg2[0][0], seg2[1][0]]
	64	# xpts = sorted(xpts)
	65	# if (i[0] <= xpts[2] and i[0] >= xpts[1]:
	66	# intersections.append((i, [seg1, seg2]))
	67	seen.append((seg1, seg2))
	68	later = time.time()
	69	n2time = later-now
	70	print "Line sweep results:"
	71	now = time.time()
	72	lsinters = intersection(S)
	73	inters = []
	74	for k, v in lsinters.iteritems():
	75	#print '{0}: {1}'.format(k, v)
	76	inters.append(k)
	77	# inters.append(v)
	78	later = time.time()
	79	print 'TIME ELAPSED: {0}'.format(later-now)
	80	print "N^2 comparison results:"
	81	pts_seen = []
	82	highestseen = 0
	83	for i in intersections:
	84	seen_already = False
	85	seen = 0
	86	for p in pts_seen:
	87	if approx_equal(i[0][0], p[0], ev) and approx_equal(i[0][1], p[1], ev):
	88	seen += 1
	89	seen_already = True
	90	if seen > highestseen:
	91	highestseen = seen
	92	if not seen_already:
	93	pts_seen.append(i[0])
	94	in_k = False
	95	for k in inters:
	96	if approx_equal(k[0], i[0][0], ev) and approx_equal(k[1], i[0][1], ev):
	97	in_k = True
	98	if in_k == False:
	99	print 'Not in K: {0}: {1}'.format(i[0], i[1])
	100	# print i
	101	print highestseen
	102	print 'TIME ELAPSED: {0}'.format(n2time)
	103	#print 'Missing from line sweep but in N^2:'
	104	#for i in seen:
	105	# matched = False
	106	print len(lsinters)
	107	print len(pts_seen)
	108	if len(lsinters) != len(pts_seen):
	109	print 'uh oh!'

+403

-0

pauvre/marginplot.py less more

	0	#!/usr/bin/env python
	1	# -- coding: utf-8 --
	2
	3	# pauvre
	4	# Copyright (c) 2016-2020 Darrin T. Schultz.
	5	#
	6	# This file is part of pauvre.
	7	#
	8	# pauvre is free software: you can redistribute it and/or modify
	9	# it under the terms of the GNU General Public License as published by
	10	# the Free Software Foundation, either version 3 of the License, or
	11	# (at your option) any later version.
	12	#
	13	# pauvre is distributed in the hope that it will be useful,
	14	# but WITHOUT ANY WARRANTY; without even the implied warranty of
	15	# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
	16	# GNU General Public License for more details.
	17	#
	18	# You should have received a copy of the GNU General Public License
	19	# along with pauvre. If not, see <http://www.gnu.org/licenses/>.
	20
	21	import ast
	22	import matplotlib
	23	matplotlib.use('Agg')
	24	import matplotlib.pyplot as plt
	25	import matplotlib.patches as mplpatches
	26	from matplotlib.colors import LinearSegmentedColormap
	27	import numpy as np
	28	import pandas as pd
	29	import os.path as opath
	30	from sys import stderr
	31	from pauvre.functions import parse_fastq_length_meanqual, print_images, filter_fastq_length_meanqual
	32	from pauvre.stats import stats
	33	import pauvre.rcparams as rc
	34	import logging
	35
	36	# logging
	37	logger = logging.getLogger('pauvre')
	38
	39
	40	def generate_panel(panel_left, panel_bottom, panel_width, panel_height,
	41	axis_tick_param='both', which_tick_param='both',
	42	bottom_tick_param=True, label_bottom_tick_param=True,
	43	left_tick_param=True, label_left_tick_param=True,
	44	right_tick_param=False, label_right_tick_param=False,
	45	top_tick_param=False, label_top_tick_param=False):
	46	"""
	47	Setting default panel tick parameters. Some of these are the defaults
	48	for matplotlib anyway, but specifying them for readability. Here are
	49	options and defaults for the parameters used below:
	50
	51	axis : {'x', 'y', 'both'}; which axis to modify; default = 'both'
	52	which : {'major', 'minor', 'both'}; which ticks to modify;
	53	default = 'major'
	54	bottom, top, left, right : bool or {True, False}; ticks on or off;
	55	labelbottom, labeltop, labelleft, labelright : bool or {True, False}
	56	"""
	57
	58	# create the panel
	59	panel_rectangle = [panel_left, panel_bottom, panel_width, panel_height]
	60	panel = plt.axes(panel_rectangle)
	61
	62	# Set tick parameters
	63	panel.tick_params(axis=axis_tick_param,
	64	which=which_tick_param,
	65	bottom=bottom_tick_param,
	66	labelbottom=label_bottom_tick_param,
	67	left=left_tick_param,
	68	labelleft=label_left_tick_param,
	69	right=right_tick_param,
	70	labelright=label_right_tick_param,
	71	top=top_tick_param,
	72	labeltop=label_top_tick_param)
	73
	74	return panel
	75
	76
	77	def _generate_histogram_bin_patches(panel, bins, bin_values, horizontal=True):
	78	"""This helper method generates the histogram that is added to the panel.
	79
	80	In this case, horizontal = True applies to the mean quality histogram.
	81	So, horizontal = False only applies to the length histogram.
	82	"""
	83	l_width = 0.0
	84	f_color = (0.5, 0.5, 0.5)
	85	e_color = (0, 0, 0)
	86	if horizontal:
	87	for step in np.arange(0, len(bin_values), 1):
	88	left = bins[step]
	89	bottom = 0
	90	width = bins[step + 1] - bins[step]
	91	height = bin_values[step]
	92	hist_rectangle = mplpatches.Rectangle((left, bottom), width, height,
	93	linewidth=l_width,
	94	facecolor=f_color,
	95	edgecolor=e_color)
	96	panel.add_patch(hist_rectangle)
	97	else:
	98	for step in np.arange(0, len(bin_values), 1):
	99	left = 0
	100	bottom = bins[step]
	101	width = bin_values[step]
	102	height = bins[step + 1] - bins[step]
	103
	104	hist_rectangle = mplpatches.Rectangle((left, bottom), width, height,
	105	linewidth=l_width,
	106	facecolor=f_color,
	107	edgecolor=e_color)
	108	panel.add_patch(hist_rectangle)
	109
	110
	111	def generate_histogram(panel, data_list, max_plot_length, min_plot_length,
	112	bin_interval, hist_horizontal=True,
	113	left_spine=True, bottom_spine=True,
	114	top_spine=False, right_spine=False, x_label=None,
	115	y_label=None):
	116
	117	bins = np.arange(0, max_plot_length, bin_interval)
	118
	119	bin_values, bins2 = np.histogram(data_list, bins)
	120
	121	# hist_horizontal is used for quality
	122	if hist_horizontal:
	123	panel.set_xlim([min_plot_length, max_plot_length])
	124	panel.set_ylim([0, max(bin_values * 1.1)])
	125	# and hist_horizontal == Fale is for read length
	126	else:
	127	panel.set_xlim([0, max(bin_values * 1.1)])
	128	panel.set_ylim([min_plot_length, max_plot_length])
	129
	130	# Generate histogram bin patches, depending on whether we're plotting
	131	# vertically or horizontally
	132	_generate_histogram_bin_patches(panel, bins, bin_values, hist_horizontal)
	133
	134	panel.spines['left'].set_visible(left_spine)
	135	panel.spines['bottom'].set_visible(bottom_spine)
	136	panel.spines['top'].set_visible(top_spine)
	137	panel.spines['right'].set_visible(right_spine)
	138
	139	if y_label is not None:
	140	panel.set_ylabel(y_label)
	141	if x_label is not None:
	142	panel.set_xlabel(x_label)
	143
	144
	145	def generate_heat_map(panel, data_frame, min_plot_length, min_plot_qual,
	146	max_plot_length, max_plot_qual, color, **kwargs):
	147	panel.set_xlim([min_plot_qual, max_plot_qual])
	148	panel.set_ylim([min_plot_length, max_plot_length])
	149
	150	if kwargs["kmerdf"]:
	151	hex_this = data_frame.query('length<{} and numks<{}'.format(
	152	max_plot_length, max_plot_qual))
	153	# This single line controls plotting the hex bins in the panel
	154	hex_vals = panel.hexbin(hex_this['numks'], hex_this['length'], gridsize=int(np.ceil(max_plot_qual/2)),
	155	linewidths=0.0, cmap=color)
	156
	157	else:
	158	hex_this = data_frame.query('length<{} and meanQual<{}'.format(
	159	max_plot_length, max_plot_qual))
	160
	161	# This single line controls plotting the hex bins in the panel
	162	hex_vals = panel.hexbin(hex_this['meanQual'], hex_this['length'], gridsize=49,
	163	linewidths=0.0, cmap=color)
	164	for each in panel.spines:
	165	panel.spines[each].set_visible(False)
	166
	167	counts = hex_vals.get_array()
	168	return counts
	169
	170
	171	def generate_legend(panel, counts, color):
	172
	173	# completely custom for more control
	174	panel.set_xlim([0, 1])
	175	panel.set_ylim([0, 1000])
	176	panel.set_yticks([int(x) for x in np.linspace(0, 1000, 6)])
	177	panel.set_yticklabels([int(x) for x in np.linspace(0, max(counts), 6)])
	178	for i in np.arange(0, 1001, 1):
	179	rgba = color(i / 1001)
	180	alpha = rgba[-1]
	181	facec = rgba[0:3]
	182	hist_rectangle = mplpatches.Rectangle((0, i), 1, 1,
	183	linewidth=0.0,
	184	facecolor=facec,
	185	edgecolor=(0, 0, 0),
	186	alpha=alpha)
	187	panel.add_patch(hist_rectangle)
	188	panel.spines['top'].set_visible(False)
	189	panel.spines['left'].set_visible(False)
	190	panel.spines['bottom'].set_visible(False)
	191	panel.yaxis.set_label_position("right")
	192	panel.set_ylabel('Number of Reads')
	193
	194
	195	def margin_plot(df, **kwargs):
	196	rc.update_rcParams()
	197
	198	# 250, 231, 34 light yellow
	199	# 67, 1, 85
	200	# R=np.linspace(65/255,1,101)
	201	# G=np.linspace(0/255, 231/255, 101)
	202	# B=np.linspace(85/255, 34/255, 101)
	203	# R=65/255, G=0/255, B=85/255
	204	Rf = 65 / 255
	205	Bf = 85 / 255
	206	pdict = {'red': ((0.0, Rf, Rf),
	207	(1.0, Rf, Rf)),
	208	'green': ((0.0, 0.0, 0.0),
	209	(1.0, 0.0, 0.0)),
	210	'blue': ((0.0, Bf, Bf),
	211	(1.0, Bf, Bf)),
	212	'alpha': ((0.0, 0.0, 0.0),
	213	(1.0, 1.0, 1.0))
	214	}
	215	# Now we will use this example to illustrate 3 ways of
	216	# handling custom colormaps.
	217	# First, the most direct and explicit:
	218	purple1 = LinearSegmentedColormap('Purple1', pdict)
	219
	220	# set the figure dimensions
	221	fig_width = 1.61 * 3
	222	fig_height = 1 * 3
	223	fig = plt.figure(figsize=(fig_width, fig_height))
	224
	225	# set the panel dimensions
	226	heat_map_panel_width = fig_width * 0.5
	227	heat_map_panel_height = heat_map_panel_width * 0.62
	228
	229	# find the margins to center the panel in figure
	230	fig_left_margin = fig_bottom_margin = (1 / 6)
	231
	232	# lengthPanel
	233	length_panel_width = (1 / 8)
	234
	235	# the color Bar parameters
	236	legend_panel_width = (1 / 24)
	237
	238	# define padding
	239	h_padding = 0.02
	240	v_padding = 0.05
	241
	242	# Set whether to include y-axes in histograms
	243	if kwargs["Y_AXES"]:
	244	length_bottom_spine = True
	245	length_bottom_tick = False
	246	length_bottom_label = True
	247	qual_left_spine = True
	248	qual_left_tick = True
	249	qual_left_label = True
	250	qual_y_label = 'Count'
	251	else:
	252	length_bottom_spine = False
	253	length_bottom_tick = False
	254	length_bottom_label = False
	255	qual_left_spine = False
	256	qual_left_tick = False
	257	qual_left_label = False
	258	qual_y_label = None
	259
	260	panels = []
	261
	262	# Quality histogram panel
	263	qual_panel_left = fig_left_margin + length_panel_width + h_padding
	264	qual_panel_width = heat_map_panel_width / fig_width
	265	qual_panel_height = length_panel_width * fig_width / fig_height
	266	qual_panel = generate_panel(qual_panel_left,
	267	fig_bottom_margin,
	268	qual_panel_width,
	269	qual_panel_height,
	270	left_tick_param=qual_left_tick,
	271	label_left_tick_param=qual_left_label)
	272	panels.append(qual_panel)
	273
	274	# Length histogram panel
	275	length_panel_bottom = fig_bottom_margin + qual_panel_height + v_padding
	276	length_panel_height = heat_map_panel_height / fig_height
	277	length_panel = generate_panel(fig_left_margin,
	278	length_panel_bottom,
	279	length_panel_width,
	280	length_panel_height,
	281	bottom_tick_param=length_bottom_tick,
	282	label_bottom_tick_param=length_bottom_label)
	283	panels.append(length_panel)
	284
	285	# Heat map panel
	286	heat_map_panel_left = fig_left_margin + length_panel_width + h_padding
	287	heat_map_panel_bottom = fig_bottom_margin + qual_panel_height + v_padding
	288
	289	heat_map_panel = generate_panel(heat_map_panel_left,
	290	heat_map_panel_bottom,
	291	heat_map_panel_width / fig_width,
	292	heat_map_panel_height / fig_height,
	293	bottom_tick_param=False,
	294	label_bottom_tick_param=False,
	295	left_tick_param=False,
	296	label_left_tick_param=False)
	297	panels.append(heat_map_panel)
	298	heat_map_panel.set_title(kwargs["title"])
	299
	300	# Legend panel
	301	legend_panel_left = fig_left_margin + length_panel_width + \
	302	(heat_map_panel_width / fig_width) + (h_padding * 2)
	303	legend_panel_bottom = fig_bottom_margin + qual_panel_height + v_padding
	304	legend_panel_height = heat_map_panel_height / fig_height
	305	legend_panel = generate_panel(legend_panel_left, legend_panel_bottom,
	306	legend_panel_width, legend_panel_height,
	307	bottom_tick_param = False,
	308	label_bottom_tick_param = False,
	309	left_tick_param = False,
	310	label_left_tick_param = False,
	311	right_tick_param = True,
	312	label_right_tick_param = True)
	313	panels.append(legend_panel)
	314
	315	# Set min and max viewing window for length
	316	if kwargs["plot_maxlen"]:
	317	max_plot_length = kwargs["plot_maxlen"]
	318	else:
	319	max_plot_length = int(np.percentile(df['length'], 99))
	320	min_plot_length = kwargs["plot_minlen"]
	321
	322	# Set length bin sizes
	323	if kwargs["lengthbin"]:
	324	length_bin_interval = kwargs["lengthbin"]
	325	else:
	326	# Dividing by 80 is based on what looks good from experience
	327	length_bin_interval = int(max_plot_length / 80)
	328
	329	# length_bins = np.arange(0, max_plot_length, length_bin_interval)
	330
	331	# Set max and min viewing window for quality
	332	if kwargs["plot_maxqual"]:
	333	max_plot_qual = kwargs["plot_maxqual"]
	334	elif kwargs["kmerdf"]:
	335	max_plot_qual = np.ceil(df["numks"].median() * 2)
	336	else:
	337	max_plot_qual = max(np.ceil(df['meanQual']))
	338	min_plot_qual = kwargs["plot_minqual"]
	339
	340	# Set qual bin sizes
	341	if kwargs["qualbin"]:
	342	qual_bin_interval = kwargs["qualbin"]
	343	elif kwargs["kmerdf"]:
	344	qual_bin_interval = 1
	345	else:
	346	# again, this is just based on what looks good from experience
	347	qual_bin_interval = max_plot_qual / 85
	348	qual_bins = np.arange(0, max_plot_qual, qual_bin_interval)
	349
	350	# Generate length histogram
	351	generate_histogram(length_panel, df['length'], max_plot_length, min_plot_length,
	352	length_bin_interval, hist_horizontal=False,
	353	y_label='Read Length', bottom_spine=length_bottom_spine)
	354
	355	# Generate quality histogram
	356	if kwargs["kmerdf"]:
	357	generate_histogram(qual_panel, df['numks'], max_plot_qual, min_plot_qual,
	358	qual_bin_interval, x_label='number of kmers',
	359	y_label=qual_y_label, left_spine=qual_left_spine)
	360	else:
	361	generate_histogram(qual_panel, df['meanQual'], max_plot_qual, min_plot_qual,
	362	qual_bin_interval, x_label='Phred Quality',
	363	y_label=qual_y_label, left_spine=qual_left_spine)
	364
	365	# Generate heat map
	366	counts = generate_heat_map(heat_map_panel, df, min_plot_length, min_plot_qual,
	367	max_plot_length, max_plot_qual, purple1, kmerdf = kwargs["kmerdf"])
	368
	369	# Generate legend
	370	generate_legend(legend_panel, counts, purple1)
	371
	372	# inform the user of the plotting window if not quiet mode
	373	if not kwargs["QUIET"]:
	374	print("""plotting in the following window:
	375	{0} <= Q-score (x-axis) <= {1}
	376	{2} <= length (y-axis) <= {3}""".format(
	377	min_plot_qual, max_plot_qual, min_plot_length, max_plot_length),
	378	file=stderr)
	379	# Print image(s)
	380	if kwargs["BASENAME"] is None and not kwargs["path"] is None:
	381	file_base = kwargs["BASENAME"]
	382	elif kwargs["BASENAME"] is None:
	383	file_base = opath.splitext(opath.basename(kwargs["fastq"]))[0]
	384	else:
	385	file_base = kwargs["BASENAME"]
	386	print_images(
	387	file_base,
	388	image_formats=kwargs["fileform"],
	389	dpi=kwargs["dpi"],
	390	no_timestamp = kwargs["no_timestamp"],
	391	transparent=kwargs["TRANSPARENT"])
	392
	393	def run(args):
	394	if args.kmerdf:
	395	df = pd.read_csv(args.kmerdf, header='infer', sep='\t')
	396	df["kmers"] = df["kmers"].apply(ast.literal_eval)
	397	else:
	398	df = parse_fastq_length_meanqual(args.fastq)
	399	df = filter_fastq_length_meanqual(df, args.filt_minlen, args.filt_maxlen,
	400	args.filt_minqual, args.filt_maxqual)
	401	stats(df, args.fastq, False)
	402	margin_plot(df=df.dropna(), **vars(args))

+636

-0

pauvre/pauvre_main.py less more

	0	#!/usr/bin/env python
	1	# -- coding: utf-8 --
	2
	3	# pauvre - just a pore plotting package
	4	# Copyright (c) 2016-2017 Darrin T. Schultz. All rights reserved.
	5	#
	6	# This file is part of pauvre.
	7	#
	8	# pauvre is free software: you can redistribute it and/or modify
	9	# it under the terms of the GNU General Public License as published by
	10	# the Free Software Foundation, either version 3 of the License, or
	11	# (at your option) any later version.
	12	#
	13	# pauvre is distributed in the hope that it will be useful,
	14	# but WITHOUT ANY WARRANTY; without even the implied warranty of
	15	# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
	16	# GNU General Public License for more details.
	17	#
	18	# You should have received a copy of the GNU General Public License
	19	# along with pauvre. If not, see <http://www.gnu.org/licenses/>.
	20
	21	# I modeled this code on https://github.com/arq5x/poretools/. Check it out. - DTS
	22
	23	import sys
	24	import os.path
	25	import argparse
	26
	27	# pauvre imports
	28	import pauvre.version
	29	# This class is used in argparse to expand the ~. This avoids errors caused on
	30	# some systems.
	31
	32
	33	class FullPaths(argparse.Action):
	34	"""Expand user- and relative-paths"""
	35
	36	def __call__(self, parser, namespace, values, option_string=None):
	37	setattr(namespace, self.dest,
	38	os.path.abspath(os.path.expanduser(values)))
	39
	40
	41	class FullPathsList(argparse.Action):
	42	"""Expand user- and relative-paths when a list of paths is passed to the
	43	program"""
	44
	45	def __call__(self, parser, namespace, values, option_string=None):
	46	setattr(namespace, self.dest,
	47	[os.path.abspath(os.path.expanduser(value)) for value in values])
	48
	49	def run_subtool(parser, args):
	50	if args.command == 'browser':
	51	import pauvre.browser as submodule
	52	elif args.command == 'custommargin':
	53	import pauvre.custommargin as submodule
	54	elif args.command == 'marginplot':
	55	import pauvre.marginplot as submodule
	56	elif args.command == 'redwood':
	57	import pauvre.redwood as submodule
	58	elif args.command == 'stats':
	59	import pauvre.stats as submodule
	60	elif args.command == 'synplot':
	61	import pauvre.synplot as submodule
	62	# run the chosen submodule.
	63	submodule.run(args)
	64
	65
	66	class ArgumentParserWithDefaults(argparse.ArgumentParser):
	67	def __init__(self, args, *kwargs):
	68	super(ArgumentParserWithDefaults, self).__init__(args, *kwargs)
	69	self.add_argument("-q", "--quiet", help="Do not output warnings to stderr",
	70	action="store_true",
	71	dest="QUIET")
	72
	73	def main():
	74
	75	#########################################
	76	# create the top-level parser
	77	#########################################
	78	parser = argparse.ArgumentParser(
	79	prog='pauvre', formatter_class=argparse.ArgumentDefaultsHelpFormatter)
	80	parser.add_argument("-v", "--version", help="Installed pauvre version",
	81	action="version",
	82	version="%(prog)s " + str(pauvre.version.__version__))
	83	subparsers = parser.add_subparsers(
	84	title='[sub-commands]', dest='command', parser_class=ArgumentParserWithDefaults)
	85
	86	#########################################
	87	# create the individual tool parsers
	88	#########################################
	89
	90	#############
	91	# browser
	92	#############
	93	parser_browser = subparsers.add_parser('browser',
	94	help="""an adaptable genome
	95	browser with various track types""")
	96	parser_browser.add_argument('-c', '--chromosomeid',
	97	metavar = "Chr",
	98	dest = 'CHR',
	99	type = str,
	100	help = """The fasta sequence to observe.
	101	Use the header name of the fasta file
	102	without the '>' character""")
	103	parser_browser.add_argument('--dpi',
	104	metavar='dpi',
	105	default=600,
	106	type=int,
	107	help="""Change the dpi from the default 600
	108	if you need it higher""")
	109	parser_browser.add_argument('--fileform',
	110	dest='fileform',
	111	metavar='STRING',
	112	choices=['png', 'pdf', 'eps', 'jpeg', 'jpg',
	113	'pdf', 'pgf', 'ps', 'raw', 'rgba',
	114	'svg', 'svgz', 'tif', 'tiff'],
	115	default=['png'],
	116	nargs='+',
	117	help='Which output format would you like? Def.=png')
	118	parser_browser.add_argument("--no_timestamp",
	119	action = 'store_true',
	120	help="""Turn off time stamps in the filename
	121	output.""")
	122	parser_browser.add_argument('-o', '--output-base-name',
	123	dest='BASENAME',
	124	help="""Specify a base name for the output file(
	125	s). The input file base name is the
	126	default.""")
	127	parser_browser.add_argument('--path',
	128	type=str,
	129	help="""Set an explicit filepath for the output.
	130	Only do this if you have selected one output type.""")
	131	parser_browser.add_argument('-p', '--plot_commands',
	132	dest='CMD',
	133	nargs = '+',
	134	help="""Write strings here to select what
	135	to plot. The format for each track is:
	136	<type>:<path>:<style>:<options>
	137
	138	To plot the reference, the format is:
	139	ref:<style>:<options>
	140
	141	Surround each track string with double
	142	quotes and a space between subsequent strings.
	143	"bam:mybam.bam:depth" "ref:colorful"
	144	""")
	145	parser_browser.add_argument('--ratio',
	146	nargs = '+',
	147	type = float,
	148	default=None,
	149	help="""Enter the dimensions (arbitrary units)
	150	to plot the figure. For example a figure that is
	151	seven times wider than tall is:
	152	--ratio 7 1""")
	153	parser_browser.add_argument('-r', '--reference',
	154	metavar='REF',
	155	dest = 'REF',
	156	action=FullPaths,
	157	help='The reference fasta file.')
	158	parser_browser.add_argument('--start',
	159	metavar='START',
	160	dest = 'START',
	161	type = int,
	162	help="""The start position to observe on the
	163	fasta file. Uses 1-based indexing.""")
	164	parser_browser.add_argument('--stop',
	165	metavar='STOP',
	166	dest = 'STOP',
	167	type = int,
	168	help="""The stop position to observe on the
	169	fasta file. Uses 1-based indexing.""")
	170	parser_browser.add_argument('-T', '--transparent',
	171	action='store_false',
	172	help="""Specify this option if you DON'T want a
	173	transparent background. Default is on.""")
	174	parser_browser.set_defaults(func=run_subtool)
	175
	176	################
	177	# custommargin
	178	################
	179	parser_custmar = subparsers.add_parser('custommargin',
	180	help='plot custom marginal histograms of tab-delimited files')
	181	parser_custmar.add_argument('--dpi',
	182	metavar='dpi',
	183	default=600,
	184	type=int,
	185	help="""Change the dpi from the default 600
	186	if you need it higher""")
	187	parser_custmar.add_argument('--fileform',
	188	dest='fileform',
	189	metavar='STRING',
	190	choices=['png', 'pdf', 'eps', 'jpeg', 'jpg',
	191	'pdf', 'pgf', 'ps', 'raw', 'rgba',
	192	'svg', 'svgz', 'tif', 'tiff'],
	193	default=['png'],
	194	nargs='+',
	195	help='Which output format would you like? Def.=png')
	196	parser_custmar.add_argument('-i', '--input_file',
	197	action=FullPaths,
	198	help="""A tab-separated file with a header row
	199	of column names.""")
	200	parser_custmar.add_argument('--xcol',
	201	type=str,
	202	help="""The column name of the data to plot on
	203	the x-axis""")
	204	parser_custmar.add_argument('--ycol',
	205	type=str,
	206	help="""The column name of the data to plot on
	207	the y-axis""")
	208	parser_custmar.add_argument('-n', '--no_transparent',
	209	action='store_false',
	210	help="""Specify this option if
	211	you don't want a transparent background. Default
	212	is on.""")
	213	parser_custmar.add_argument("--no_timestamp",
	214	action = 'store_true',
	215	help="""Turn off time stamps in the filename
	216	output.""")
	217	parser_custmar.add_argument('-o', '--output_base_name',
	218	default = None,
	219	help='Specify a base name for the output file('
	220	's). The input file base name is the '
	221	'default.')
	222	parser_custmar.add_argument('--plot_max_y',
	223	type=int,
	224	help="""Sets the maximum viewing area in the
	225	length dimension.""")
	226	parser_custmar.add_argument('--plot_max_x',
	227	type=float,
	228	help="""Sets the maximum viewing area in the
	229	quality dimension.""")
	230	parser_custmar.add_argument('--plot_min_y',
	231	type=int,
	232	default=0,
	233	help="""Sets the minimum viewing area in the
	234	length dimension. Default value = 0""")
	235	parser_custmar.add_argument('--plot_min_x',
	236	type=float,
	237	default=0,
	238	help="""Sets the minimum viewing area in the
	239	quality dimension. Default value = 0""")
	240	parser_custmar.add_argument('--square',
	241	action='store_true',
	242	help="""changes the hexmap into a square
	243	map quantized by ints.""")
	244	parser_custmar.add_argument('-t', '--title',
	245	type = str,
	246	help="""This sets the title for the whole plot.
	247	Use --title "Crustacean's DNA read quality"
	248	if you need single quote or apostrophe
	249	inside title.""")
	250	parser_custmar.add_argument('--ybin',
	251	type=int,
	252	help="""This sets the bin size to use for length.""")
	253	parser_custmar.add_argument('--xbin',
	254	type=float,
	255	help="""This sets the bin size to use for quality""")
	256	parser_custmar.add_argument('--add-yaxes',
	257	dest='Y_AXES',
	258	action='store_true',
	259	help='Add Y-axes to both marginal histograms.')
	260	parser_custmar.set_defaults(func=run_subtool)
	261
	262
	263	#############
	264	# marginplot
	265	#############
	266	parser_mnplot = subparsers.add_parser('marginplot',
	267	help='plot a marginal histogram of a fastq file')
	268	parser_mnplot.add_argument('--dpi',
	269	metavar='dpi',
	270	default=600,
	271	type=int,
	272	help="""Change the dpi from the default 600
	273	if you need it higher""")
	274	parser_mnplot.add_argument('-f', '--fastq',
	275	metavar='FASTQ',
	276	action=FullPaths,
	277	help='The input FASTQ file.')
	278	parser_mnplot.add_argument('--fileform',
	279	dest='fileform',
	280	metavar='STRING',
	281	choices=['png', 'pdf', 'eps', 'jpeg', 'jpg',
	282	'pdf', 'pgf', 'ps', 'raw', 'rgba',
	283	'svg', 'svgz', 'tif', 'tiff'],
	284	default=['png'],
	285	nargs='+',
	286	help='Which output format would you like? Def.=png')
	287	parser_mnplot.add_argument('--filt_maxlen',
	288	type=int,
	289	help="""This sets the max read length filter reads.""")
	290	parser_mnplot.add_argument('--filt_maxqual',
	291	type=float,
	292	help="""This sets the max mean read quality
	293	to filter reads.""")
	294	parser_mnplot.add_argument('--filt_minlen',
	295	type=int,
	296	default=0,
	297	help="""This sets the min read length to
	298	filter reads.""")
	299	parser_mnplot.add_argument('--filt_minqual',
	300	type=float,
	301	default=0,
	302	help="""This sets the min mean read quality
	303	to filter reads.""")
	304	parser_mnplot.add_argument('--kmerdf',
	305	type = str,
	306	default = None,
	307	help = """Pass the filename of a data matrix if
	308	you would like to plot read length
	309	versus number of kmers in that read. The data matrix
	310	is a tab-separated text file with columns
	311	"id length numks and kmers", where:
	312	<id> = read id
	313	<length> = the length of the read
	314	<numks> = the number of canonical kmers in the read
	315	<kmers> = a list representation of kmers ie ['GAT', 'GTA']""")
	316	parser_mnplot.add_argument('-n', '--no_transparent',
	317	dest='TRANSPARENT',
	318	action='store_false',
	319	help="""Specify this option if
	320	you don't want a transparent background. Default
	321	is on.""")
	322	parser_mnplot.add_argument("--no_timestamp",
	323	action = 'store_true',
	324	help="""Turn off time stamps in the filename
	325	output.""")
	326	parser_mnplot.add_argument('-o', '--output_base_name',
	327	dest='BASENAME',
	328	help='Specify a base name for the output file('
	329	's). The input file base name is the '
	330	'default.')
	331	parser_mnplot.add_argument('--plot_maxlen',
	332	type=int,
	333	help="""Sets the maximum viewing area in the
	334	length dimension.""")
	335	parser_mnplot.add_argument('--plot_maxqual',
	336	type=float,
	337	help="""Sets the maximum viewing area in the
	338	quality dimension.""")
	339	parser_mnplot.add_argument('--plot_minlen',
	340	type=int,
	341	default=0,
	342	help="""Sets the minimum viewing area in the
	343	length dimension.""")
	344	parser_mnplot.add_argument('--plot_minqual',
	345	type=float,
	346	default=0,
	347	help="""Sets the minimum viewing area in the
	348	quality dimension.""")
	349	parser_mnplot.add_argument('--lengthbin',
	350	metavar='LENGTHBIN',
	351	type=int,
	352	help="""This sets the bin size to use for length.""")
	353	parser_mnplot.add_argument('--qualbin',
	354	metavar='QUALBIN',
	355	type=float,
	356	help="""This sets the bin size to use for quality""")
	357	parser_mnplot.add_argument('-t', '--title',
	358	metavar='TITLE',
	359	default='Read length vs mean quality',
	360	help="""This sets the title for the whole plot.
	361	Use --title "Crustacean's DNA read quality"
	362	if you need single quote or apostrophe
	363	inside title.""")
	364	parser_mnplot.add_argument('-y', '--add-yaxes',
	365	dest='Y_AXES',
	366	action='store_true',
	367	help='Add Y-axes to both marginal histograms.')
	368	parser_mnplot.set_defaults(func=run_subtool)
	369
	370	#############
	371	# redwood
	372	#############
	373	parser_redwood = subparsers.add_parser('redwood',
	374	help='make a redwood plot from a bam file')
	375	parser_redwood.add_argument('-d', '--doubled',
	376	dest='doubled',
	377	choices=['main', 'rnaseq'],
	378	default=[],
	379	nargs='+',
	380	help="""If your fasta file was doubled to map
	381	circularly, use this flag or you may encounter
	382	plotting errors. Accepts multiple arguments.
	383	'main' is for the sam file passed with --sam,
	384	'rnaseq' is for the sam file passed with --rnaseq""")
	385	parser_redwood.add_argument('--dpi',
	386	metavar='dpi',
	387	default=600,
	388	type=int,
	389	help="""Change the dpi from the default 600
	390	if you need it higher""")
	391	parser_redwood.add_argument('--fileform',
	392	dest='fileform',
	393	metavar='STRING',
	394	choices=['png', 'pdf', 'eps', 'jpeg', 'jpg',
	395	'pdf', 'pgf', 'ps', 'raw', 'rgba',
	396	'svg', 'svgz', 'tif', 'tiff'],
	397	default=['png'],
	398	nargs='+',
	399	help='Which output format would you like? Def.=png')
	400	parser_redwood.add_argument('--gff',
	401	metavar='gff',
	402	action=FullPaths,
	403	help="""The input filepath for the gff annotation
	404	to plot""")
	405	parser_redwood.add_argument('-I', '--interlace',
	406	action='store_true',
	407	default=False,
	408	help="""Interlace the reads so the pileup plot
	409	looks better. Doesn't work currently""")
	410	parser_redwood.add_argument('-i', '--invert',
	411	action='store_true',
	412	default=False,
	413	help="""invert the image so that it looks better
	414	on a dark background. DOESN'T DO ANYTHING.""")
	415	parser_redwood.add_argument('-L', '--log',
	416	action='store_true',
	417	default=False,
	418	help="""Plot the RNAseq track with a log scale""")
	419	parser_redwood.add_argument('-M', '--main_bam',
	420	metavar='mainbam',
	421	action=FullPaths,
	422	help="""The input filepath for the bam file to
	423	plot. Ideally was plotted with a fasta file that
	424	is two copies of the mitochondrial genome
	425	concatenated. This should be long reads (ONT, PB)
	426	and will be displayed in the interior of the
	427	redwood plot.""")
	428	parser_redwood.add_argument("--no_timestamp",
	429	action = 'store_true',
	430	help="""Turn off time stamps in the filename
	431	output.""")
	432	parser_redwood.add_argument('-o', '--output-base-name',
	433	dest='BASENAME',
	434	help='Specify a base name for the output file('
	435	's). The input file base name is the '
	436	'default.')
	437	parser_redwood.add_argument('--query',
	438	dest='query',
	439	default=['ALNLEN >= 10000', 'MAPLEN < reflength'],
	440	nargs='+',
	441	help='Query your reads to change plotting options')
	442	parser_redwood.add_argument('-R', '--rnaseq_bam',
	443	metavar='rnabam',
	444	action=FullPaths,
	445	help='The input filepath for the rnaseq bam file to plot')
	446	parser_redwood.add_argument('--small_start',
	447	dest='small_start',
	448	choices=['inside', 'outside'],
	449	default='inside',
	450	help="""This determines where the shortest of the
	451	filtered reads will appear on the redwood plot:
	452	on the outside or on the inside? The default
	453	option puts the longest reads on the outside and
	454	the shortest reads on the inside.""")
	455	parser_redwood.add_argument('--sort',
	456	dest='sort',
	457	choices=['ALNLEN', 'TRULEN', 'MAPLEN', 'POS'],
	458	default='ALNLEN',
	459	help="""What value to use to sort the order in
	460	which the reads are plotted?""")
	461	parser_redwood.add_argument('--ticks',
	462	type = int,
	463	nargs = '+',
	464	default = [0, 10, 100, 1000],
	465	help="""Specify control for the number of ticks.""")
	466	parser_redwood.add_argument('-T', '--transparent',
	467	action='store_false',
	468	help="""Specify this option if you DON'T want a
	469	transparent background. Default is on.""")
	470	parser_redwood.set_defaults(func=run_subtool)
	471
	472	#############
	473	# stats
	474	#############
	475	parser_stats = subparsers.add_parser('stats',
	476	help='outputs stats from a fastq file')
	477	parser_stats.add_argument('-f', '--fastq',
	478	metavar='FASTQ',
	479	action=FullPaths,
	480	help='The input FASTQ file.')
	481	parser_stats.add_argument('-H', '--histogram',
	482	action='store_true',
	483	help="""Make a histogram of the read lengths and
	484	saves it to a new file""")
	485	parser_stats.add_argument('--filt_maxlen',
	486	type=int,
	487	help="""This sets the max read length filter reads.""")
	488	parser_stats.add_argument('--filt_maxqual',
	489	type=float,
	490	help="""This sets the max mean read quality
	491	to filter reads.""")
	492	parser_stats.add_argument('--filt_minlen',
	493	type=int,
	494	default=0,
	495	help="""This sets the min read length to
	496	filter reads.""")
	497	parser_stats.add_argument('--filt_minqual',
	498	type=float,
	499	default=0,
	500	help="""This sets the min mean read quality
	501	to filter reads.""")
	502
	503	parser_stats.set_defaults(func=run_subtool)
	504
	505	#############
	506	# synplot
	507	#############
	508	parser_synplot = subparsers.add_parser('synplot',
	509	help="""make a synteny plot from a gff
	510	file, protein alignment, and partition
	511	file""")
	512	parser_synplot.add_argument('--aln_dir',
	513	metavar='aln_dir',
	514	action=FullPaths,
	515	help="""The directory where all the fasta
	516	alignments are contained.""")
	517	parser_synplot.add_argument('--center_on',
	518	type=str,
	519	default=None,
	520	help="""Centers the plot around the gene that
	521	you pass as an argument. For example, if there
	522	is a locus called 'COI' in the gff file and in
	523	the alignments directory, center using:
	524	--center_on COI""")
	525	parser_synplot.add_argument('--dpi',
	526	metavar='dpi',
	527	default=600,
	528	type=int,
	529	help="""Change the dpi from the default 600
	530	if you need it higher""")
	531	parser_synplot.add_argument('--fileform',
	532	dest='fileform',
	533	choices=['png', 'pdf', 'eps', 'jpeg', 'jpg',
	534	'pdf', 'pgf', 'ps', 'raw', 'rgba',
	535	'svg', 'svgz', 'tif', 'tiff'],
	536	default=['png'],
	537	nargs='+',
	538	help='Which output format would you like? Def.=png')
	539	parser_synplot.add_argument('--gff_paths',
	540	metavar='gff_paths',
	541	action=FullPathsList,
	542	nargs='+',
	543	help="""The input filepath. for the gff annotation
	544	to plot. Individual filepaths separated by spaces.
	545	For example, --gff_paths sp1.gff sp2.gff sp3.gff""")
	546	parser_synplot.add_argument('--gff_labels',
	547	metavar='gff_labels',
	548	type=str,
	549	nargs='+',
	550	help="""In case the gff names and sequence names
	551	don't match, change the labels that will appear
	552	over the text.""")
	553	parser_synplot.add_argument("--no_timestamp",
	554	action = 'store_true',
	555	help="""Turn off time stamps in the filename
	556	output.""")
	557	parser_synplot.add_argument('--optimum_order',
	558	action='store_true',
	559	help="""If selected, this doesn't plot the
	560	optimum arrangement of things as they are input
	561	into gff_paths. Instead, it uses the first gff
	562	file as the top-most sequence in the plot, and
	563	reorganizes the remaining gff files to minimize
	564	the number of intersections.""")
	565	parser_synplot.add_argument('-o', '--output-basename',
	566	dest='BASENAME',
	567	help='Specify a base name for the output file('
	568	's). The input file base name is the '
	569	'default.')
	570	parser_synplot.add_argument('--ratio',
	571	nargs = '+',
	572	type = float,
	573	default=None,
	574	help="""Enter the dimensions (arbitrary units)
	575	to plot the figure. For example a figure that is
	576	seven times wider than tall is:
	577	--ratio 7 1""")
	578	parser_synplot.add_argument('--sandwich',
	579	action='store_true',
	580	default=False,
	581	help="""Put an additional copy of the first gff
	582	file on the bottom of the plot for comparison.""")
	583	parser_synplot.add_argument('--start_with_aligned_genes',
	584	action='store_true',
	585	default=False,
	586	help="""Minimizes the number of intersections
	587	but only selects combos where the first gene in
	588	each sequence is aligned.""")
	589	parser_synplot.add_argument('--stop_codons',
	590	action='store_true',
	591	default=True,
	592	help="""Performs some internal corrections if
	593	the gff annotation includes the stop
	594	codons in the coding sequences.""")
	595	parser_synplot.add_argument('-T', '--transparent',
	596	action='store_false',
	597	help="""Specify this option if you DON'T want a
	598	transparent background. Default is on.""")
	599	parser_synplot.set_defaults(func=run_subtool)
	600
	601	#######################################################
	602	# parse the args and call the selected function
	603	#######################################################
	604
	605	args = parser.parse_args()
	606
	607	# If there were no args, print the help function
	608	if len(sys.argv) == 1:
	609	parser.print_help()
	610	sys.exit(1)
	611
	612	# If there were no args, but someone selected a program,
	613	# print the program's help.
	614	commandDict = {'browser': parser_browser.print_help,
	615	'custommargin': parser_custmar.print_help,
	616	'marginplot': parser_mnplot.print_help,
	617	'redwood': parser_redwood.print_help,
	618	'stats': parser_stats.print_help,
	619	'synplot': parser_synplot.print_help}
	620
	621	if len(sys.argv) == 2:
	622	commandDict[args.command]()
	623	sys.exit(1)
	624
	625	if args.QUIET:
	626	logger.setLevel(logging.ERROR)
	627
	628	try:
	629	args.func(parser, args)
	630	except IOError as e:
	631	if e.errno != 32: # ignore SIGPIPE
	632	raise
	633
	634	if __name__ == "__main__":
	635	main()

+25

-0

pauvre/rcparams.py less more

	0	# -- coding: utf-8 --
	1
	2	from matplotlib import rcParams
	3
	4	def update_rcParams():
	5	# This mpl style is from the UCSC BME163 class.
	6	rcParams.update({
	7	'font.size' : 8.0 ,
	8	'font.sans-serif' : 'Arial' ,
	9	'xtick.major.size' : 2 ,
	10	'xtick.major.width' : 0.75 ,
	11	'xtick.labelsize' : 8.0 ,
	12	'xtick.direction' : 'out' ,
	13	'ytick.major.size' : 2 ,
	14	'ytick.major.width' : 0.75 ,
	15	'ytick.labelsize' : 8.0 ,
	16	'ytick.direction' : 'out' ,
	17	'xtick.major.pad' : 2 ,
	18	'xtick.minor.pad' : 2 ,
	19	'ytick.major.pad' : 2 ,
	20	'ytick.minor.pad' : 2 ,
	21	'savefig.dpi' : 601 ,
	22	'axes.linewidth' : 0.75 ,
	23	'text.usetex' : False ,
	24	'text.latex.unicode' : False })

+1047

-0

pauvre/redwood.py less more

	0	#!/usr/bin/env python
	1	# -- coding: utf-8 --
	2
	3	# pauvre - just a pore plotting package
	4	# Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved.
	5	# twitter @conchoecia
	6	#
	7	# This file is part of pauvre.
	8	#
	9	# pauvre is free software: you can redistribute it and/or modify
	10	# it under the terms of the GNU General Public License as published by
	11	# the Free Software Foundation, either version 3 of the License, or
	12	# (at your option) any later version.
	13	#
	14	# pauvre is distributed in the hope that it will be useful,
	15	# but WITHOUT ANY WARRANTY; without even the implied warranty of
	16	# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
	17	# GNU General Public License for more details.
	18	#
	19	# You should have received a copy of the GNU General Public License
	20	# along with pauvre. If not, see <http://www.gnu.org/licenses/>.
	21
	22	# SAM/BAM todo
	23	# i/ for loop took 269 seconds ~ 4.48 minutes
	24
	25	#things to do
	26
	27	# This doesn't make much sense. it is perfectly possible that it could
	28	# double and logbe both (in reference to how histograms are plotted
	29
	30	# - make each layer operate independently
	31	# - for each of these, make the program figure out the total length either from the GFF or from the bam file
	32	# - make the error "Your query was too stringent and no reads resulted..." not give a traceback.
	33	# - raise the proper error if the sam file has no header.
	34	# - figure out how to update rcParams every time we run a program
	35	# - Write a better docstring for how plotArc works.
	36	# - Write a docstring for seqOrder method. I don't remember what it does
	37	# - write a better function to get the alignment length of the sam/bam file
	38	# right now it opens the file twice and only gets the length the first time
	39	# - drop columns by name, not by column number (samFile.drop...)
	40	# - Here's another that needs to be fixed: samFile.drop(samFile.columns[3], axis=1, inplace=True)
	41	# - args
	42	# - get the filename from args
	43	# - set up a double-mapped mode to wrap reads around the 0th coordinate for
	44	# circular assemblies
	45	# - Make the r-dist something that the user can change.
	46	# Getting Everything on the Same Track
	47	# - Make a list of features to plot `plot_order = []` or something like that
	48	# - First, go through the GFF features and come up with all of the things that
	49	# AREN'T tRNAs that overlap. Store each set of overlaps as its own set of
	50	# features. For things with no overlap, add those to the `plot_order` alone
	51	# - Iterate through all elements of `plot_order`, if all elements are forward
	52	# (start < stop), then draw the element at the end first with no modification,
	53	# then for every subsequent element draw a white border around the arrow.
	54	# The element same element should have the start of its arrow drawn to 1
	55	# degree before the start of the next one. This will make a chevron pattern
	56	# That will show both the start and stop of both reads accurately.
	57	# - If both elements are reverse, then do the same thing, but in reverse
	58	# - If one element is in reverse, but another is forward, then make all the
	59	# elements in the set half-width since there are too many possible
	60	# combinations to code reliably.
	61
	62	import matplotlib
	63	matplotlib.use('Agg')
	64	import matplotlib.pyplot as plt
	65	import matplotlib.patches as patches
	66	from matplotlib import rcParams
	67	import platform
	68	import itertools
	69	import pandas as pd
	70	import numpy as np
	71	import scipy as sp
	72	import time
	73	import os, sys
	74	import warnings
	75
	76	# import the pauvre rcParams
	77	import pauvre.rcparams as rc
	78	from pauvre.functions import GFFParse, print_images
	79	from pauvre.bamparse import BAMParse
	80
	81	# following this tutorial to install helvetica
	82	# https://github.com/olgabot/sciencemeetproductivity.tumblr.com/blob/master/posts/2012/11/how-to-set-helvetica-as-the-default-sans-serif-font-in.md
	83	global hfont
	84	hfont = {'fontname':'Helvetica'}
	85
	86	def plotArc(start_angle, stop_angle, radius, width, **kwargs):
	87	""" write a docstring for this function"""
	88	numsegments = 100
	89	theta = np.radians(np.linspace(start_angle+90, stop_angle+90, numsegments))
	90	centerx = 0
	91	centery = 0
	92	x1 = -np.cos(theta) * (radius)
	93	y1 = np.sin(theta) * (radius)
	94	stack1 = np.column_stack([x1, y1])
	95	x2 = -np.cos(theta) * (radius + width)
	96	y2 = np.sin(theta) * (radius + width)
	97	stack2 = np.column_stack([np.flip(x2, axis=0), np.flip(y2,axis=0)])
	98	#add the first values from the first set to close the polygon
	99	np.append(stack2, [[x1[0],y1[0]]], axis=0)
	100	arcArray = np.concatenate((stack1,stack2), axis=0)
	101	return patches.Polygon(arcArray, True, **kwargs), ((x1, y1), (x2, y2))
	102
	103	def fix_query_reflength(sequence_length, queries, doubled):
	104	"""
	105	arguments:
	106	<sequence_length> This is the reference fasta length. It should be 2x the actual
	107	length of the reference since this program takes a sam file from
	108	a concatenated reference.
	109	<queries> This is a list of SQL-type query strings. This is generated
	110	from argparse.
	111
	112	purpose:
	113	This function takes in a list of queries to use for read filtering
	114	for the redwood plot. It is often not advisable to plot all mapped reads
	115	since many of them are too small relative to the reference length. Also,
	116	the point of a death star plot is to show continuity of a circular
	117	reference, so short reads aren't very helpful there either.
	118
	119	Currently, this function only recognizes the keyword argument 'reflength'.
	120	"""
	121	if not doubled:
	122	sequence_length = int(sequence_length * 2)
	123	for i in range(len(queries)):
	124	if 'reflength' in queries[i].split():
	125	queries[i] = queries[i].replace('reflength', str(int(sequence_length/2)))
	126
	127	def cust_log(base, val):
	128	try:
	129	#val = np.log(val)/np.log(base)
	130	with warnings.catch_warnings():
	131	warnings.simplefilter("ignore")
	132	val = np.log(val)
	133	except:
	134	val = 0
	135	if np.isinf(val):
	136	val = 0
	137	return val
	138
	139	def plot_histo(panel, args, angleMap, width_map, start_radius, track_radius,
	140	thisLog = False, double = False, ticks = False):
	141	"""Plots a histogram given a width map. Width map must be the true length
	142	of the circular genome"""
	143	myPatches = []
	144	if double:
	145	#plot a line in the middle if doubled to distinguish the center
	146	mid_radius = start_radius+((track_radius - (track_radius/100))/2)
	147	arc, arcArray = plotArc(start_angle=0, stop_angle=360,
	148	radius=mid_radius,
	149	width=track_radius/100, fc='yellow')
	150	myPatches.append(arc)
	151	#this is only 1/100 the width of the band
	152
	153	for i in range(len(width_map)):
	154	iStartIndex = i
	155	iStopIndex = i + 1
	156	iStartAngle = angleMap[iStartIndex]
	157	iStopAngle = angleMap[iStopIndex]
	158	# This doesn't make much sense. it is perfectly possible that it could
	159	# double and logbe both
	160	if double:
	161	logwidth = -1 * track_radius * (np.log(width_map[i]-9)/np.log(max(width_map))) * 0.495
	162
	163	log_start_radius = start_radius+((track_radius - (track_radius/100))/2)
	164	width = track_radius * (width_map[i]/max(width_map)) * 0.495
	165	normal_start_radius = start_radius+((track_radius - (track_radius/100))/2) + (track_radius/100)
	166	#first plot the log inside
	167	arc, arcArray = plotArc(start_angle=iStartAngle, stop_angle=iStopAngle,
	168	radius=log_start_radius,
	169	width=logwidth, fc='black')
	170	myPatches.append(arc)
	171	# now plot the normal outside
	172	arc, arcArray = plotArc(start_angle=iStartAngle, stop_angle=iStopAngle,
	173	radius=normal_start_radius,
	174	width=width, fc='black')
	175	else:
	176	if thisLog:
	177	base = 10
	178	width = track_radius * (cust_log(base, width_map[i])/cust_log(base, max(width_map)))
	179
	180	else:
	181	width = track_radius * (width_map[i]/max(width_map))
	182	arc, arcArray = plotArc(start_angle=iStartAngle, stop_angle=iStopAngle,
	183	radius=start_radius,
	184	width=width, fc='black')
	185	myPatches.append(arc)
	186
	187	maxx = start_radius
	188	maxy = 0
	189	kerning = 12
	190	if ticks != None and len(ticks) > 0:
	191	# plot the scalebar
	192	maxval=max(width_map)
	193	xs = []
	194	ys = []
	195	xend = start_radius
	196	value_list = ticks
	197	for value in value_list:
	198	centerAngle = 45
	199	this_radius = start_radius + \
	200	(track_radius * (cust_log(10, value)/cust_log(10, maxval)))
	201	print("this_radius", this_radius)
	202	arc, arcArray = plotArc(start_angle=centerAngle-1, stop_angle=centerAngle+1,
	203	radius=this_radius,
	204	width=track_radius/25, fc='red')
	205	xs.append(arcArray[0][0][-1])
	206	ys.append(arcArray[0][1][-1])
	207	myPatches.append(arc)
	208
	209	middle = np.mean(ys) - (kerning/2)
	210	# The tick marks should start slightly below the lowest tick
	211	# mark on the track itself and extend upward evenly with text.
	212	new_ys = [middle-kerning + (kerning * i)
	213	for i in range(len(value_list))]
	214	for i in range(len(value_list)):
	215	xlist = [xs[i], xend]
	216	ylist = [ys[i], new_ys[i]]
	217	# plots the lines between the markers and the labels
	218	panel.plot(xlist, ylist, ls='--', color='red', lw=0.75)
	219	# plots the text labels
	220	panel.text(xend, new_ys[i],
	221	value_list[i], fontsize = 12,
	222	ha='left', va='center',
	223	color = 'black', **hfont)
	224	#plots the title
	225	panel.text(start_radius, new_ys[-1]+kerning,
	226	"Read Depth ", fontsize = 10,
	227	ha='center', va='center',
	228	color = 'black', **hfont)
	229
	230
	231	return myPatches, start_radius + track_radius, panel
	232
	233	def plot_reads(args, angleMap, widthDict, samFiledf, start_radius,
	234	doubled = False, collapse = False, track_width=False,
	235	track_depth = False, thisLog = False):
	236	"""this function plots the reads in a sam file.
	237	Outputs the patches, and the final_radius
	238
	239	Author:
	240	Darrin T. Schultz (github@conchoecia)
	241	"""
	242
	243	# there should be a different width dict with log scale
	244	#if args.log and thisLog:
	245	# widthDict = {'M':0.8, # match
	246	# 'I':0.95, # insertion relative to reference
	247	# 'D':0.05, # deletion relative to reference
	248	# 'N':0.1, # skipped region from the reference
	249	# 'S':0.1, # soft clip, not aligned but still in sam file
	250	# 'H':0.1, # hard clip, not aligned and not in sam file
	251	# 'P':0.1, # padding (silent deletion from padded reference)
	252	# '=':0.1, # sequence match
	253	# 'X':0.1} # sequence mismatch
	254
	255	# clean up the df to reset the number of rows, otherwise there might be
	256	# errors in the while loop below
	257	samFiledf = samFiledf.reset_index()
	258
	259	append_radius = 0
	260	myPatches = []
	261	depth_map = []
	262	# if we need to collapse the reads so they aren't 1 read per line, keep track
	263	# of plotted depth at a position with depth_map
	264	if collapse == True:
	265	if doubled:
	266	plotted_depth_map = [0] * int((len(angleMap)/2))
	267	else:
	268	plotted_depth_map = [0] * len(angleMap)
	269
	270	# If we define a track width, then use the max read depth
	271	# of the track to figure out the read_width
	272	read_width = 1
	273	if track_width:
	274	read_width = track_width/track_depth
	275
	276	# This while loops cycles through the the list until everything is plotted
	277	# We start at the first index of the file and count down. If we plot it, then
	278	# we remove that read from the dataframe. Once we get to the last index,
	279	# we attempt to plot and afterward reset the index of the dataframe and
	280	# start from the beginning again.
	281	i = 0
	282	plotted = False
	283	#this is used to dtermine where we are during collapse
	284	current_row = 1
	285	if args.log and thisLog:
	286	current_row = 10
	287	skip = False
	288	# for printing out the progress of plotting,
	289	# get the original number of rows to make placeholders
	290	original_rownum_charcount = len(str(len(samFiledf)))
	291	direction = 'for'
	292	print(samFiledf)
	293	while len(samFiledf) > 0:
	294	stringTuples = samFiledf.loc[i, 'TUPS']
	295	mapLen = samFiledf.loc[i, 'MAPLEN']
	296	#print("i: {} and maplen: {}".format(i, mapLen))
	297	#Subtract one because BAM uses 1-based indexing but plotting uses 0.
	298	# I think I could avoid this in the future by changing the parse
	299	start_index = samFiledf.loc[i, 'POS'] - 1
	300	start_angle= angleMap[start_index]
	301	stop_index = 0
	302	stop_angle = angleMap[stop_index]
	303	if args.log and thisLog:
	304	log = 100
	305	read_width = track_width * ((cust_log(log, current_row + 1)/cust_log(log, track_depth + 1))\
	306	- (cust_log(log, current_row)/cust_log(log, track_depth + 1)))
	307	#print("log track depth: {}".format(np.log10(track_depth + 1)))
	308	#print("log current row: {}".format(np.log10(current_row + 1)))
	309	#now look in the plotted_depth_map to see if there is already a read
	310	# that overlaps with the current read we are considering.
	311	if collapse:
	312	for collapse_i in range(start_index, start_index+mapLen):
	313	#print("looking at {} and found {}".format(collapse_i, plotted_depth_map[collapse_i]))
	314	if plotted_depth_map[collapse_i] >= current_row:
	315	skip = True
	316	break
	317	#if we don't skip it, we're gonna plot it!
	318	if not skip:
	319	# Note that we are plotting something here
	320	if collapse:
	321	for collapse_i in range(start_index, start_index+mapLen):
	322	plotted_depth_map[collapse_i] = plotted_depth_map[collapse_i] + 1
	323	for tup in stringTuples:
	324	if tup[1] == 'I':
	325	#If there is an insertion, back up halfway and make plot the
	326	# insertion to visually show a "bulge" with too much sequence.
	327	# do not advance the start index to resume normal plotting
	328	# after the insertion.
	329	iStartIndex = start_index-int(tup[0]/2)
	330	iStopIndex = iStartIndex + tup[0]
	331	iStartAngle = angleMap[iStartIndex]
	332	iStopAngle = angleMap[iStopIndex]
	333	arc, arcArray = plotArc(start_angle=iStartAngle, stop_angle=iStopAngle,
	334	radius=start_radius + append_radius,
	335	width=widthDict[tup[1]] * read_width, fc='black')
	336	else:
	337	stop_index = start_index + tup[0]
	338	stop_angle = angleMap[stop_index]
	339	arc, arcArray = plotArc(start_angle=start_angle, stop_angle=stop_angle,
	340	radius=start_radius + append_radius,
	341	width=widthDict[tup[1]] * read_width, fc='black')
	342	start_index = stop_index
	343	start_angle = angleMap[start_index]
	344	myPatches.append(arc)
	345	# If we're not collapsing the reads, we just advance one every row
	346	if not collapse:
	347	append_radius += read_width
	348	plotted = True
	349	#if we've ploted something, remove that read and reset the indices
	350	# since we will stay at the current index
	351	samFiledf = samFiledf.drop(i)
	352	samFiledf = samFiledf.reset_index()
	353	samFiledf = samFiledf.drop('index', 1)
	354	print("(countdown until done) rows: {0:0>{num}} dir: {1}\r".format(
	355	len(samFiledf), direction, num= original_rownum_charcount), end="")
	356	else:
	357	#if we weren't able to plot the current read, we will advance the
	358	# index and look for another read to plot
	359	# I tried to speed up the algorithm here by making the software 'smart'
	360	# about looking forward for the next read, but it was ~50% slower
	361	i += 1
	362	skip = False
	363	# we will only reach this condition if we aren't collapsing, since all
	364	# the reads will be removed before getting to this point
	365	if i >= len(samFiledf):
	366	i = 0
	367	#Once we've gone around a complete cycle, we can jump up to start
	368	# plotting the next row
	369	append_radius += read_width
	370	current_row += 1
	371	#every time we reset, reorganize so that we're now going in the
	372	# opposite direction to avoid skewing all the reads in the forward direction
	373	if args.interlace:
	374	if direction == 'for':
	375	bav = {"by":['POS','MAPLEN'], "asc": [False, False]}
	376	direction= 'rev'
	377	elif direction == 'rev':
	378	bav = {"by":['POS','MAPLEN'], "asc": [True, False]}
	379	direction = 'for'
	380	samFiledf.sort_values(by=bav["by"], ascending=bav['asc'],inplace=True)
	381	samFiledf.reset_index(inplace=True)
	382	samFiledf.drop('index', 1, inplace=True)
	383	print("\nfinal row is {}".format(current_row))
	384	return myPatches, start_radius + append_radius
	385
	386	def redwood(args):
	387	"""This function controls all the plotting features for the redwood plot.
	388	1) determine the length of the circular fasta sequence length
	389	"""
	390	# get the plotting options specific to this program.
	391	rc.update_rcParams()
	392	# This stops numpy from printing numbers in scientific notation.
	393	np.set_printoptions(suppress=True)
	394
	395
	396	start = time.time()
	397	print(args)
	398
	399	#---------------------------------------------------------------
	400	# 1) determine the length of the circular fasta sequence length
	401	#_______________________________________________________________
	402	main_doubled = True if 'main' in args.doubled else False
	403	# It is fine that we are using a global since this will never be manipulated
	404	global sequence_length
	405	if args.main_bam:
	406	samFile = BAMParse(args.main_bam, doubled = main_doubled)
	407	sequence_length = samFile.seqlength
	408	filename = samFile.filename
	409	else:
	410	sequence_length = GFFParse(args.gff).seqlen
	411	if sequence_length == 0:
	412	raise OSError("""You have used a SAM/BAM file with no header. Please add a header to
	413	the file.""")
	414
	415	# this also needs to be changed depending on if it was a concatenated SAM
	416	# if doubled = true, then use linspace between 0,720
	417	# if doubled = false, then use linspace between 0, 360
	418	# on second thought, it might not be necessary to change this value even
	419	# for doubled sequences
	420	global angleMap
	421	if 'main' in args.doubled:
	422	angleMap = np.linspace(0,720,sequence_length)
	423	else:
	424	angleMap = np.linspace(0,360,sequence_length)
	425
	426	#these are the line width for the different cigar string flags.
	427	# usually, only M, I, D, S, and H appear in bwa mem output
	428	widthDict = {'M':0.45, # match
	429	'I':0.9, # insertion relative to reference
	430	'D':0.05, # deletion relative to reference
	431	'N':0.1, # skipped region from the reference
	432	'S':0.1, # soft clip, not aligned but still in sam file
	433	'H':0.1, # hard clip, not aligned and not in sam file
	434	'P':0.1, # padding (silent deletion from padded reference)
	435	'=':0.1, # sequence match
	436	'X':0.1} # sequence mismatch
	437
	438	##################
	439	# make the redwood plot
	440	##################
	441
	442	figWidth = 5
	443	figHeight = 5
	444
	445	# fig1 is an arbitrary name, only one figure currently
	446	fig_1 = plt.figure(figsize=(figWidth, figHeight))
	447
	448	# This is the width and height of the plot in absolute
	449	# values relative to the figWidth and figHeight.
	450	circleDiameter = 5.0
	451
	452	# Center to plot
	453	leftMargin = (figWidth - circleDiameter)/2
	454	bottomMargin = leftMargin
	455
	456	# There is one panel in this figure that contains
	457	# the concentric circles that represent reads. In addition,
	458	# the latest version of this program adds the annotation to the exterior
	459	panelCircle = plt.axes([leftMargin/figWidth, #left
	460	bottomMargin/figHeight, #bottom
	461	circleDiameter/figWidth, #width
	462	circleDiameter/figHeight #height
	463	],frameon=False)
	464	# Some of these are defaults and redundant, but are included
	465	# for readability and convenience.
	466	panelCircle.tick_params(axis='both',which='both',\
	467	bottom='off', labelbottom='off',\
	468	left='off', labelleft='off', \
	469	right='off', labelright='off',\
	470	top='off', labeltop='off')
	471
	472	# Simply the list in which patches will be stored.
	473	myPatches = []
	474
	475	# Each read occupies a width of radius = 1. The max width occupied by any
	476	# type of match to the read is 0.9 (Insertion). The line width of a match
	477	# is 0.45. Look at the `widthDict` dictionary above to see the defined
	478	# widths. I chose radius = 15 to start with because it looks decent in most
	479	# scenarios, but maybe could use some tweaking in future iterations.
	480	start_radius_dict = {1: 10,
	481	20: 15,
	482	30: 20,
	483	35: 90}
	484	if args.main_bam:
	485	print("The number of rows is: {}".format(len(samFile.features)))
	486	for key in sorted(start_radius_dict):
	487	if len(samFile.features) >= key:
	488	radius = start_radius_dict[key]
	489	print("Chose radius: {}".format(radius))
	490	else:
	491	radius = start_radius_dict[30]
	492
	493	circle_fontsize = 14
	494	panelCircle.text(0, 0, "Position\n(bp)", fontsize = circle_fontsize,
	495	ha='center', va='center',
	496	color = 'black', **hfont)
	497
	498	# now plot ticks around the interior, as well as the text
	499	if 'main' in args.doubled:
	500	real_length = int(sequence_length/2)
	501	else:
	502	real_length = sequence_length
	503	this_len = int(real_length / 1000) * 1000
	504	for i in range(0, this_len + 1000, 1000):
	505	startAngle = angleMap[i] - 0.75
	506	stopAngle = angleMap[i] + 0.75
	507	this_radius = radius * 0.93
	508	this_width = radius * 0.04
	509	arc, arcArray = plotArc(start_angle=startAngle, stop_angle=stopAngle,
	510	radius=this_radius,
	511	width=this_width, fc='black')
	512	myPatches.append(arc)
	513	# now plot text if (val/1000) % 2 == 0
	514	if (i/1000) % 2 == 0:
	515	#the 0.98 gives some float
	516	x_pos = -np.cos(np.radians(angleMap[i]+90)) * this_radius
	517	y_pos = np.sin(np.radians(angleMap[i]+90)) * this_radius
	518	rotation = 0
	519	if i == 0:
	520	y_pos = np.sin(np.radians(angleMap[i]+90)) * this_radius * 0.98
	521	rotation = 0
	522	ha = 'center'
	523	va = 'top'
	524	if (angleMap[i] > 0 and angleMap[i] < 45):
	525	x_pos = -np.cos(np.radians(angleMap[i]+90 + 1.5)) * this_radius
	526	y_pos = np.sin(np.radians(angleMap[i]+90 + 1.5)) * this_radius
	527	rotation = (90 - angleMap[i]) * 0.95
	528	ha = 'right'
	529	va = 'top'
	530	elif (angleMap[i] >= 45 and angleMap[i] < 67.5):
	531	rotation = 90 - angleMap[i]
	532	ha = 'right'
	533	va = 'top'
	534	elif (angleMap[i] >= 67.5 and angleMap[i] < 90):
	535	# subtracted an addtl degree because it looked bad otherwise
	536	x_pos = -np.cos(np.radians(angleMap[i]+90 - 1.75)) * this_radius
	537	y_pos = np.sin(np.radians(angleMap[i]+90 - 1.75)) * this_radius
	538	rotation = 90 - angleMap[i]
	539	ha = 'right'
	540	va = 'top'
	541	elif (angleMap[i] >= 90 and angleMap[i] < 112.5):
	542	# added an addtl degree because it looked bad otherwise
	543	x_pos = -np.cos(np.radians(angleMap[i]+90 - 1.75)) * this_radius
	544	y_pos = np.sin(np.radians(angleMap[i]+90 - 1.75)) * this_radius
	545	rotation = 90 - angleMap[i]
	546	ha = 'right'
	547	#bottom is not good
	548	va = 'center'
	549	elif (angleMap[i] >= 112.5 and angleMap[i] < 135):
	550	# added an addtl degree because it looked bad otherwise
	551	x_pos = -np.cos(np.radians(angleMap[i]+90 + 1.25)) * this_radius
	552	y_pos = np.sin(np.radians(angleMap[i]+90 + 1.25)) * this_radius
	553	rotation = 90 - angleMap[i]
	554	ha = 'right'
	555	#bottom not good
	556	#center really not good
	557	va = 'bottom'
	558	elif (angleMap[i] >= 135 and angleMap[i] < 157.5):
	559	rotation = 90 - angleMap[i]
	560	ha = 'right'
	561	va = 'bottom'
	562	elif (angleMap[i] >= 157.5 and angleMap[i] < 180):
	563	x_pos = -np.cos(np.radians(angleMap[i]+90 - 1.0)) * this_radius
	564	y_pos = np.sin(np.radians(angleMap[i]+90 - 1.0)) * this_radius
	565	rotation = 90 - angleMap[i]
	566	ha = 'right'
	567	va = 'bottom'
	568	elif (angleMap[i] >= 180 and angleMap[i] < 202.5):
	569	# added an addtl degree because it looked bad otherwise
	570	x_pos = -np.cos(np.radians(angleMap[i]+90 + 2.0)) * this_radius
	571	y_pos = np.sin(np.radians(angleMap[i]+90 + 2.0)) * this_radius
	572	rotation = 270 - angleMap[i]
	573	ha = 'left'
	574	va = 'bottom'
	575	elif (angleMap[i] >= 202.5 and angleMap[i] < 225):
	576	x_pos = -np.cos(np.radians(angleMap[i]+90 + 2.25)) * this_radius
	577	y_pos = np.sin(np.radians(angleMap[i]+90 + 2.25)) * this_radius
	578	rotation = 270 - angleMap[i]
	579	ha = 'left'
	580	va = 'bottom'
	581	elif (angleMap[i] >= 225 and angleMap[i] < 247.5):
	582	x_pos = -np.cos(np.radians(angleMap[i]+90 - 1.0)) * this_radius
	583	y_pos = np.sin(np.radians(angleMap[i]+90 - 1.0)) * this_radius
	584	rotation = 270 - angleMap[i]
	585	ha = 'left'
	586	va = 'bottom'
	587	elif (angleMap[i] >= 247.5 and angleMap[i] < 270):
	588	x_pos = -np.cos(np.radians(angleMap[i]+90 - 3.0)) * this_radius
	589	y_pos = np.sin(np.radians(angleMap[i]+90 - 3.0)) * this_radius
	590	rotation = 270 - angleMap[i]
	591	ha = 'left'
	592	va = 'bottom'
	593	elif (angleMap[i] >= 270 and angleMap[i] < 292.5):
	594	rotation = 270 - angleMap[i]
	595	ha = 'left'
	596	va = 'bottom'
	597	elif (angleMap[i] >= 292.5 and angleMap[i] < 315):
	598	x_pos = -np.cos(np.radians(angleMap[i]+90 + 0.25)) * this_radius
	599	y_pos = np.sin(np.radians(angleMap[i]+90 + 0.25)) * this_radius
	600	rotation = 270 - angleMap[i]
	601	ha = 'left'
	602	va = 'top'
	603	elif (angleMap[i] >= 315 and angleMap[i] < 337.5):
	604	x_pos = -np.cos(np.radians(angleMap[i]+90 - 1.6)) * this_radius
	605	y_pos = np.sin(np.radians(angleMap[i]+90 - 1.6)) * this_radius
	606	rotation = 270 - angleMap[i]
	607	ha = 'left'
	608	va = 'top'
	609	elif (angleMap[i] >= 337.5 and angleMap[i] < 360):
	610	x_pos = -np.cos(np.radians(angleMap[i]+90 - 5.0)) * this_radius
	611	y_pos = np.sin(np.radians(angleMap[i]+90 - 5.0)) * this_radius
	612	rotation = 270 - angleMap[i]
	613	ha = 'left'
	614	va = 'top'
	615	print(" angleMap: {} value: {} rotation: {}".format(angleMap[i], i, rotation))
	616	text = "{}".format(i)
	617	if i == 0:
	618	text = "1/\n{}".format(real_length)
	619	panelCircle.text(x_pos, y_pos, text, fontsize = circle_fontsize,
	620	ha=ha, va=va, color = 'black', rotation=rotation, **hfont)
	621	# Now add a legend
	622
	623	# If the user wants to plot long reads, plot them
	624	if args.main_bam:
	625	#turn the query string into something usable,
	626	# get rid of variables from argparse
	627	# Make sure we haven't told the program to not query
	628	if 'False' not in args.query:
	629	doubled = 'main' in args.doubled
	630	fix_query_reflength(sequence_length, args.query, doubled)
	631	# string all the queries together
	632	queryString = " and ".join(args.query)
	633	print("You are using this query string to filter reads:\n'{}'".format(queryString))
	634	samFile = samFile.features.query(queryString)
	635	if len(samFile) == 0:
	636	raise IOError("""Your query was too stringent and no reads resulted. Please try
	637	again with a less stringent test. Redwood plotter
	638	exiting""")
	639
	640	# now determine how to sort the reads in the order they will be plotted
	641	if args.small_start == 'inside':
	642	ascend = True
	643	elif args.small_start == 'outside':
	644	ascend = False
	645	samFile = samFile.sort_values(by=args.sort, ascending=ascend)
	646	samFile = samFile.reset_index()
	647	# It is necessary to drop the column called index, since reset_index()
	648	# adds this.
	649	samFile = samFile.drop('index', 1)
	650
	651	#this plots the central rings from the sam file
	652	read_patches, radius = plot_reads(args, angleMap, widthDict, samFile,
	653	radius, doubled = True, collapse = False)
	654	myPatches = myPatches + read_patches
	655
	656	# if the user would like to plot the annotation, plot it now. In the future,
	657	# allow the user to select the order in which the individual elements are
	658	# plotted. Since the annotation should have a fixed proportional size to
	659	# the circle independent of the number of plotted reads, define a new
	660	# radius context.
	661	if args.gff:
	662	print("the radius at the end of annotation is: {}".format(radius))
	663	panelCircle, gff_patches, gff_radius = plot_gff(args, panelCircle, args.gff, radius)
	664	myPatches = myPatches + gff_patches
	665	radius = gff_radius
	666
	667	# it is helpful to be able to plot the RNAseq data along with the annotation.
	668	# plot that directly outside the annotation
	669	if args.rnaseq_bam:
	670	print("in RNAseq")
	671	rna_doubled = True if 'rnaseq' in args.doubled else False
	672	bamobject = BAMParse(args.rnaseq_bam, doubled = rna_doubled)
	673	samFile = bamobject.features
	674	track_width = radius * 0.15
	675	#read_patches, radius = plot_reads(args, angleMap,
	676	# widthDict, samFile, radius,
	677	# doubled = rna_doubled,
	678	# collapse = True, track_width = track_width,
	679	# track_depth = max(bamobject.raw_depthmap),
	680	# thisLog = True)
	681	radius_orig=radius
	682	read_patches, radius, panelCircle = plot_histo(panelCircle, args, angleMap,
	683	bamobject.get_depthmap(), radius,
	684	track_width,
	685	thisLog = True,
	686	ticks = args.ticks)
	687	myPatches = myPatches + read_patches
	688	myPatches.append(arc)
	689
	690	# The numseqs value is used to determine the viewing dimensions
	691	# of the circle we will plot. It is scaled with the number of sequences
	692	# that will be plotted. This value should probably be set last to
	693	# accommodate other tracks, like annotation.
	694	min_radius = int(-5 - np.ceil(radius))
	695	max_radius = int(5 + np.ceil(radius))
	696	panelCircle.set_xlim([min_radius, max_radius])
	697	panelCircle.set_ylim([min_radius, max_radius])
	698
	699	for patch in myPatches:
	700	panelCircle.add_patch(patch)
	701
	702	end = time.time()
	703	print(end - start)
	704	# Print image(s)
	705	ifargs.BASENAME is None:
	706	file_base = "redwood"
	707	else:
	708	file_base = args.BASENAME
	709	print_images(
	710	base_output_name=file_base,
	711	image_formats=args.fileform,
	712	no_timestamp = args.no_timestamp,
	713	dpi=args.dpi,
	714	transparent=args.transparent)
	715
	716	def feature_set_direction(feature_set_df):
	717	"""This function determines if the features in the dataframe passed here are
	718	all forward, all reverse, or mixed"""
	719	all_pos = all(feature_set_df['strand'] == '+')
	720	all_neg = all(feature_set_df['strand'] == '-')
	721	if all_pos:
	722	return '+'
	723	elif all_neg:
	724	return '-'
	725	else:
	726	return 'mixed'
	727
	728	def plot_feature(this_feature, colorMap, start_radius,
	729	bar_thickness, direction, this_feature_overlaps_feature):
	730	"""This plots the track for a feature, and if there is something for
	731	'this_feature_overlaps_feature', then there is special processing to
	732	add the white bar and the extra slope for the chevron
	733	"""
	734	myPatches = []
	735	if this_feature_overlaps_feature.empty:
	736	iStartAngle = angleMap[this_feature['start']]
	737	iStopAngle = angleMap[this_feature['stop']] - 2
	738	arc, arcArray = plotArc(start_angle=iStartAngle,
	739	stop_angle=iStopAngle,
	740	radius = start_radius, width=bar_thickness,
	741	fc=colorMap[this_feature['featType']])
	742	myPatches.append(arc)
	743	#this bit plots the arrow triangles for the genes.
	744	# Right now it makes each arrow only 1 degree in width and uses 100 segments to plot it.
	745	# This resolution hasn't given me any artifacts thus far
	746	tStartAngle = angleMap[this_feature['stop']]-2
	747	tStopAngle = angleMap[this_feature['stop']]
	748	angles = np.linspace(tStartAngle,tStopAngle,100)
	749	widths = np.linspace(bar_thickness,0,100)
	750	for j in range(len(angles)-1):
	751	arc, arcArray = plotArc(start_angle=tStartAngle,
	752	stop_angle=angles[j+1],
	753	radius=start_radius+(bar_thickness-widths[j+1])/2,
	754	width=widths[j+1],
	755	fc=colorMap[this_feature['featType']])
	756	myPatches.append(arc)
	757	else:
	758	# First, make a solid white bar
	759	iStartAngle = angleMap[this_feature['start']]
	760	iStopAngle = angleMap[this_feature_overlaps_feature['start']]
	761	arc, arcArray = plotArc(start_angle=iStartAngle,
	762	stop_angle=iStopAngle,
	763	radius = start_radius, width=bar_thickness,
	764	fc=colorMap['spacebar'])
	765	myPatches.append(arc)
	766	# Now, make the actual color bar
	767	iStartAngle = angleMap[this_feature['start']]
	768	iStopAngle = angleMap[this_feature_overlaps_feature['start']] - 1
	769	arc, arcArray = plotArc(start_angle=iStartAngle,
	770	stop_angle=iStopAngle,
	771	radius = start_radius, width=bar_thickness,
	772	fc=colorMap[this_feature['featType']])
	773	myPatches.append(arc)
	774	#first plot a little pink bar for the outline
	775	tStartAngle = angleMap[this_feature_overlaps_feature['start']]
	776	tStopAngle = angleMap[this_feature['stop']]+1
	777	angles = np.linspace(tStartAngle,tStopAngle,100)
	778	widths = np.linspace(bar_thickness,0,100)
	779	for j in range(len(angles)-1):
	780	arc, arcArray = plotArc(start_angle=tStartAngle,
	781	stop_angle=angles[j+1],
	782	radius=start_radius+(bar_thickness-widths[j+1])/2,
	783	width=widths[j+1],
	784	fc=colorMap['spacebar'])
	785	myPatches.append(arc)
	786	#this bit plots the arrow triangles for the genes.
	787	# Right now it makes each arrow only 1 degree in width and uses 100 segments to plot it.
	788	# This resolution hasn't given me any artifacts thus far
	789	tStartAngle = angleMap[this_feature_overlaps_feature['start']]-1
	790	tStopAngle = angleMap[this_feature['stop']]
	791	angles = np.linspace(tStartAngle,tStopAngle,100)
	792	widths = np.linspace(bar_thickness,0,100)
	793	for j in range(len(angles)-1):
	794	arc, arcArray = plotArc(start_angle=tStartAngle,
	795	stop_angle=angles[j+1],
	796	radius=start_radius+(bar_thickness-widths[j+1])/2,
	797	width=widths[j+1],
	798	fc=colorMap[this_feature['featType']])
	799	myPatches.append(arc)
	800	return myPatches
	801
	802	def get_angles(name, center_angle, kerning_angle):
	803	num_chars = len(name.strip())
	804	if num_chars == 2:
	805	start_pos = center_angle
	806	stop_pos = center_angle + kerning_angle
	807	return (start_pos, stop_pos)
	808	if num_chars % 2 == 0:
	809	start_pos = center_angle - (kerning_angle/2) - ((num_chars - 1)/2)
	810	stop_pos = center_angle + (kerning_angle/2) + ((num_chars - 1)/2)
	811	else:
	812	start_pos = center_angle - (num_chars/2)
	813	stop_pos = center_angle + (num_chars/2)
	814	angles = np.arange(start_pos, stop_pos + kerning_angle, kerning_angle)
	815	return angles
	816
	817	def plot_gff(args, panelCircle, gff_path, radius):
	818	#parse the gff file
	819	gffParser = GFFParse(gff_path)
	820
	821	# Because this size should be relative to the circle that it is plotted next
	822	# to, define the start_radius as the place to work from, and the width of
	823	# each track.
	824	start_radius = radius
	825	track_width = radius * 0.15
	826
	827	colorMap = {'gene': 'green', 'CDS': 'green', 'tRNA':'pink', 'rRNA':'red',
	828	'misc_feature':'purple', 'rep_origin':'orange', 'spacebar':'white',
	829	'ORF':'orange'}
	830	augment = 0
	831	bar_thickness = 0.9 * track_width
	832	# return these at the end
	833	myPatches=[]
	834	plot_order = []
	835	# this for loop relies on the gff features to already be sorted
	836	i = 0
	837	idone = False
	838	# we need to filter out the tRNAs since those are plotted last
	839	plottable_features = gffParser.features.query("featType != 'tRNA' and featType != 'region' and featType != 'source'")
	840	plottable_features = plottable_features.reset_index(drop=True)
	841	while idone == False:
	842	print("im in the overlap-pairing while loop i={}".format(i))
	843	# look ahead at all of the elements that overlap with the ith element
	844	jdone = False
	845	j = 1
	846	this_set_minimum_index = i
	847	this_set_maximum_index = i
	848	while jdone == False:
	849	print("new i= {} j={} len={}".format(i, j, len(plottable_features)))
	850	# first make sure that we haven't gone off the end of the dataframe
	851	if i+j == len(plottable_features):
	852	if i == len(plottable_features)-1:
	853	# this is the last analysis, so set idone to true
	854	# to finish after this
	855	idone = True
	856	# the last one can't be in its own group, so just add it solo
	857	these_features = plottable_features.loc[this_set_minimum_index:this_set_maximum_index,]
	858	plot_order.append(these_features.reset_index(drop=True))
	859	break
	860	jdone == True
	861	else:
	862	# if the lmost of the next gene overlaps with the rmost of
	863	# the current one, it overlaps and couple together
	864	if plottable_features.loc[i+j, 'lmost'] < plottable_features.loc[i, 'rmost']:
	865	# note that this feature overlaps with the current
	866	this_set_maximum_index = i+j
	867	# ... and we need to look at the next in line
	868	j += 1
	869	else:
	870	i += 1 + (this_set_maximum_index - this_set_minimum_index)
	871	#add all of the things that grouped together once we don't find any more groups
	872	these_features = plottable_features.loc[this_set_minimum_index:this_set_maximum_index,]
	873	plot_order.append(these_features.reset_index(drop=True))
	874	jdone = True
	875
	876	for feature_set in plot_order:
	877	print(feature_set)
	878	direction = feature_set_direction(feature_set)
	879	print("direction = {}".format(direction))
	880	if direction == '+':
	881	for i in range(len(feature_set)-1, -1,-1):
	882	print("inside the plot for loop i = {}".format(i))
	883	this_feature = feature_set.loc[i,]
	884	print("got this single feature")
	885	#For the first element, just plot it normally
	886	if i == len(feature_set) - 1:
	887	#plot the annotation
	888	print("Im in the first plot thing")
	889	patches = plot_feature(this_feature, colorMap, start_radius,
	890	bar_thickness, direction,
	891	pd.Series([]))
	892	for each in patches:
	893	myPatches.append(each)
	894	else:
	895	print("now I'm plotting the other one")
	896	overlapped_feature = feature_set.loc[i+1,]
	897	print("this is the overlapped feature")
	898	print(overlapped_feature)
	899	patches = plot_feature(this_feature, colorMap, start_radius,
	900	bar_thickness, direction,
	901	overlapped_feature)
	902	for each in patches:
	903	myPatches.append(each)
	904
	905	final_radius = start_radius + track_width
	906	# Now we add all of the tRNAs to this to plot, do it last to overlay
	907	# everything else
	908	tRNAs = gffParser.features.query("featType == 'tRNA'")
	909	tRNAs.reset_index(inplace=True, drop = True)
	910	print(tRNAs)
	911	tRNA_bar_thickness = bar_thickness * (0.8)
	912	tRNA_start_radius = start_radius + ((tRNA_bar_thickness * (0.2))/2)
	913	#tRNA_bar_thickness = bar_thickness
	914	#tRNA_start_radius = start_radius
	915	print("sequence_length = {}".format(sequence_length))
	916	angle_ranges = []
	917	for i in range(0,len(tRNAs)):
	918	this_feature = tRNAs.loc[i,]
	919	min_angle = angleMap[min(this_feature.loc['start'],this_feature.loc['stop'])]
	920	max_angle = angleMap[max(this_feature.loc['start'],this_feature.loc['stop'])]
	921	angle_ranges.append((min_angle, max_angle))
	922	patches = plot_feature(this_feature, colorMap, tRNA_start_radius,
	923	tRNA_bar_thickness, direction,
	924	pd.Series([]))
	925	for each in patches:
	926	myPatches.append(each)
	927	print("angle ranges of tRNAs")
	928	print(angle_ranges)
	929
	930	# now plot the text of the all the things. do this last
	931	# to cover the previous things we plotted
	932	for i in range(len(gffParser.features.index)):
	933	if gffParser.features.loc[i, 'featType'] not in ['region', 'source', 'tRNA']:
	934	angles = []
	935	# to plot the centers, first figure out the center of the annotated
	936	# gene by subtracting the end from the beginning
	937	name = gffParser.features.loc[i,'name']
	938	middle_position = int(gffParser.features.loc[i,'center'])
	939	# calculate the radius, because it changes depending on the track.
	940	# Remember that the radius is the position at the bottom of the track,
	941	# so we must use va=bottom.
	942	center_radius = radius + (gffParser.features.loc[i,'track'] * track_width) + (bar_thickness/2)
	943	# then, figure out the center angle of that position.
	944	# I subtract one since I want to center this for the non-arrow
	945	# part of the bar.
	946	center_angle = angleMap[middle_position]-2
	947	count = 0
	948	while count < 1:
	949	print()
	950	print(name)
	951	#figure out how many characters to plot
	952	kerning_angle = 2.5
	953	print("putting in {} into get angles with center angle = {}, kerning_angle = {}".format(name, center_angle, kerning_angle))
	954	angles = get_angles(name, center_angle, kerning_angle)
	955	char_min_angle = angles[0]-(kerning_angle / 2)
	956	char_max_angle = angles[-1] - (kerning_angle / 2)
	957	overlapping_angles = []
	958	# for every possible tRNA position, see if the minimum or
	959	# max angle lies in the range of the text
	960	for tRNA_overlap_range in angle_ranges:
	961	for tRNA_position in tRNA_overlap_range:
	962	if (char_min_angle < tRNA_position and
	963	tRNA_position < char_max_angle):
	964	#overlapping_angles contains the tRNA overlapping angles now
	965	overlapping_angles.append(tRNA_position)
	966	print("character angles {}".format(angles))
	967	print("overlapping angles {}".format(overlapping_angles))
	968	print("in while loop, center = {}".format(center_angle))
	969	if len(overlapping_angles) == 0:
	970	count += 1
	971	else:
	972	min_overlapping_angle = min(overlapping_angles)
	973	print("min_overlapping_angle = {}".format(min_overlapping_angle))
	974	max_overlapping_angle = max(overlapping_angles)
	975	print("max_overlapping_angle = {}".format(max_overlapping_angle))
	976	gene_start_angle = angleMap[gffParser.features.loc[i,'start']]
	977	gene_stop_angle = angleMap[gffParser.features.loc[i,'stop']]
	978	start_dif = abs(min_overlapping_angle - gene_start_angle)
	979	stop_dif = abs(gene_stop_angle - max_overlapping_angle)
	980	print("start_dif = {}, stop_dif = {}".format(start_dif, stop_dif))
	981	if start_dif > stop_dif:
	982	center_angle = min_overlapping_angle - ((min_overlapping_angle - gene_start_angle)/2)
	983	else:
	984	# the extra -2 is to compensate for the taper. This only
	985	# applies to the end of a gene stop. This code won't
	986	# work properly for - strand things currently
	987	center_angle = gene_stop_angle - 2 - ((gene_stop_angle - max_overlapping_angle)/2)
	988	count += 1
	989	print("center angle after assignment = {}".format(center_angle))
	990	#There is some overlap in the text and we need to
	991	# figure out the min overlap and max overlap. Then figure
	992	# out if the angle between min and start or max and end is greater
	993	# use whichever is greater as the new place to put the text, then calculate the center
	994	angles = get_angles(name, center_angle, kerning_angle)
	995	text_angle = angles[-1] - angles[0]
	996	# this prints all the characters in their positions and rotates
	997	print_count = 0
	998	for j in range(len(angles)):
	999	this_width = gffParser.features.loc[i,'width']
	1000	if not text_angle > (angleMap[this_width] - 1):
	1001	if center_angle > 95 and center_angle < 265:
	1002	rotation = (-1 * center_angle) - 180
	1003	this_char = name[len(angles) - 1 - j]
	1004	else:
	1005	rotation = -1 * center_angle
	1006	print("name: {} j: {}".format(name, j))
	1007	this_char = name[j]
	1008	this_angle = angles[j]
	1009	# now calculate the absolute x position
	1010	x_pos = -np.cos(np.radians(this_angle+90)) * center_radius
	1011	# now calculate the absolute y position
	1012	y_pos = np.sin(np.radians(this_angle+90)) * center_radius
	1013	#rotate the text if necessary
	1014	panelCircle.text(x_pos, y_pos, this_char, fontsize=10,
	1015	ha='center', va='center',
	1016	rotation = rotation,
	1017	family = 'monospace',
	1018	color = 'white')
	1019	# this block handles the case where the text is too small to put
	1020	# parallel with the mitochondrial circle, so it is perpindicular
	1021	else:
	1022	if print_count == 0:
	1023	if gffParser.features.loc[i,'strand'] == '+':
	1024	gene_start_angle = angleMap[gffParser.features.loc[i,'start']]
	1025	gene_stop_angle = angleMap[gffParser.features.loc[i,'stop']]
	1026	center_angle = gene_stop_angle - ((gene_stop_angle-gene_start_angle)/2)
	1027	# now calculate the absolute x position
	1028	x_pos = -np.cos(np.radians(center_angle+90)) * center_radius
	1029	# now calculate the absolute y position
	1030	y_pos = np.sin(np.radians(center_angle+90)) * center_radius
	1031	if (center_angle > 0 and center_angle < 180):
	1032	rotation = -1 * center_angle + 90
	1033	else:
	1034	rotation = -1 * center_angle - 90
	1035	panelCircle.text(x_pos, y_pos, name, fontsize=6,
	1036	ha='center', va='center',
	1037	rotation = rotation,
	1038	family = 'monospace',
	1039	color = 'white')
	1040
	1041	print_count += 1
	1042
	1043	return panelCircle, myPatches, final_radius
	1044
	1045	def run(args):
	1046	redwood(args)

+250

-0

pauvre/stats.py less more

	0	#!/usr/bin/env python
	1	# -- coding: utf-8 --
	2
	3	# pauvre - just a pore PhD student's plotting package
	4	# Copyright (c) 2016-2017 Darrin T. Schultz. All rights reserved.
	5	#
	6	# This file is part of pauvre.
	7	#
	8	# pauvre is free software: you can redistribute it and/or modify
	9	# it under the terms of the GNU General Public License as published by
	10	# the Free Software Foundation, either version 3 of the License, or
	11	# (at your option) any later version.
	12	#
	13	# pauvre is distributed in the hope that it will be useful,
	14	# but WITHOUT ANY WARRANTY; without even the implied warranty of
	15	# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
	16	# GNU General Public License for more details.
	17	#
	18	# You should have received a copy of the GNU General Public License
	19	# along with pauvre. If not, see <http://www.gnu.org/licenses/>.
	20
	21	# TODO: make a function to nicely print out the pandas dataframes
	22
	23	"""
	24	Program: pauvre stats
	25
	26	example usage/output:
	27	- For example, if I want to know the number of bases and reads with AT LEAST
	28	PHRED score 5 and AT LEAST length of 500, I can run the program as you
	29	see below and look at the cells highlighted with <braces>.
	30	- We find that there are 968,653 basepairs contained in 859 reads that
	31	fit meanReadQuality >= Q5, readLen >= 500.
	32	> pauvre marginplot --fastq miniDSMN15.fastq
	33
	34	>numReads: 1000
	35	>numBasepairs: 1029114
	36	>meanLen: 1029.114
	37	>medianLen: 875.5
	38	>minLen: 11
	39	>maxLen: 5337
	40	>N50: 1278
	41	>L50: 296
	42
	43	> Basepairs >= bin by mean PHRED and length
	44	>minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30
	45	> 0 1029114 1010681 935366 429279 143948 25139 3668 2938 2000 0
	46	> 500 984212 <968653> 904787 421307 142003 24417 3668 2938 2000 0
	47	> 1000 659842 649319 616788 300948 103122 17251 2000 2000 2000 0
	48	> et cetera...
	49	> Number of reads >= bin by mean Phred+Len
	50	>minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30
	51	> 0 1000 969 865 366 118 22 3 2 1 0
	52	> 500 873 <859> 789 347 113 20 3 2 1 0
	53	> 1000 424 418 396 187 62 11 1 1 1 0
	54	> et cetera...
	55	"""
	56
	57
	58	from pauvre.functions import parse_fastq_length_meanqual
	59	import os
	60	import pandas as pd
	61	import numpy as np
	62
	63
	64	def stats(df, fastqName, histogram=False):
	65	"""
	66	arguments:
	67	<df>
	68	- a pandas dataframe with cols ['length', 'meanQual'] that contains read
	69	length and quality information
	70	<fastqName>
	71	- just the name of the fastq file. This is used for printing out headers
	72	for the data tables.
	73
	74	purpose:
	75	Calculate and output various stats about this fastq file. Currently
	76	this program reports:
	77	- Total number of reads
	78	- Total number of basepairs
	79	- mean
	80	- median
	81	- min
	82	- max
	83	- N50
	84	- A table of num basepairs filtered by min mean PHRED and length
	85	- A table of num reads filtered by the same as above ^
	86
	87	example usage/output:
	88	- For example, if I want to know the number of bases and reads with AT LEAST
	89	PHRED score 5 and AT LEAST length of 500, I can run the program as you
	90	see below and look at the cells highlighted with <braces>.
	91	- We find that there are 968,653 basepairs contained in 859 reads that
	92	fit meanReadQuality >= Q5, readLen >= 500.
	93	> pauvre marginplot --fastq miniDSMN15.fastq
	94
	95	>numReads: 1000
	96	>numBasepairs: 1029114
	97	>meanLen: 1029.114
	98	>medianLen: 875.5
	99	>minLen: 11
	100	>maxLen: 5337
	101	>N50: 1278
	102	>L50: 296
	103
	104	> Basepairs >= bin by mean PHRED and length
	105	>minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30
	106	> 0 1029114 1010681 935366 429279 143948 25139 3668 2938 2000 0
	107	> 500 984212 <968653> 904787 421307 142003 24417 3668 2938 2000 0
	108	> 1000 659842 649319 616788 300948 103122 17251 2000 2000 2000 0
	109	> et cetera...
	110	> Number of reads >= bin by mean Phred+Len
	111	>minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30
	112	> 0 1000 969 865 366 118 22 3 2 1 0
	113	> 500 873 <859> 789 347 113 20 3 2 1 0
	114	> 1000 424 418 396 187 62 11 1 1 1 0
	115	> et cetera...
	116	"""
	117	fastqBase = os.path.basename(fastqName)
	118
	119	analysis_sizes = [0, 1000, 5000, 10000]
	120	print_string = ""
	121	for this_size in analysis_sizes:
	122	these_lengths = df.loc[df["length"] >= this_size, "length"]
	123	if len(these_lengths) > 0:
	124	print_string += "# Fastq stats for {}, reads >= {}bp\n".format(fastqBase, this_size)
	125	print_string += "numReads: {}\n".format(len(these_lengths))
	126	print_string += "%totalNumReads: {0:.2f}\n".format((len(these_lengths) / len(df)) * 100)
	127	print_string += "numBasepairs: {}\n".format(sum(these_lengths))
	128	print_string += "%totalBasepairs: {0:.2f}\n".format(
	129	(sum(these_lengths) / sum(df["length"])) * 100)
	130	print_string += "meanLen: {}\n".format(np.mean(these_lengths))
	131	print_string += "medianLen: {}\n".format(np.median(these_lengths))
	132	print_string += "minLen: {}\n".format(min(these_lengths))
	133	maxLen = max(these_lengths)
	134	print_string += "maxLen: {}\n".format(maxLen)
	135
	136	# calculate the N50
	137	fiftypercent = 0.5 * sum(these_lengths)
	138	lenSum = 0
	139	count = 0
	140	for val in sorted(these_lengths, reverse=True):
	141	lenSum += val
	142	count += 1
	143	if lenSum >= fiftypercent:
	144	print_string += "N50: {}\n".format(int(val))
	145	print_string += "L50: {}\n".format(count)
	146	break
	147	print_string += "\n"
	148
	149	print_string += lengthQual_table(df)
	150
	151	if histogram: # now make a histogram with read lengths
	152	histogram_lengths(df["length"], fastqBase.split('.')[0])
	153	print(print_string)
	154
	155
	156	def lengthQual_table(df):
	157	"""Create a table with lengths/basepairs on columns and qualities on rows"""
	158	# This block calculates the number of length bins for this data set
	159	lengthBinList = []
	160	size_map = [(1000, 250),
	161	(10000, 500),
	162	(40000, 1000),
	163	(100000, 5000),
	164	(500000, 20000),
	165	(1000000, 50000),
	166	(10000000000, 100000)]
	167	# first, figure out where we will start the table
	168	minlen = min(df["length"])
	169	current_val = 0
	170	firstDone = False
	171	for this_max_size, this_size_step in size_map:
	172	for this_bin in range(current_val, this_max_size, this_size_step):
	173	if minlen < this_bin:
	174	if not firstDone:
	175	lengthBinList.append(prev)
	176	firstDone = True
	177	lengthBinList.append(this_bin)
	178	prev = this_bin
	179	current_val = this_max_size
	180	# now figure out the largest bin
	181	maxLen = df["length"].max()
	182	first_index_gt_maxLen = next(i for i, v in enumerate(lengthBinList) if v > maxLen) + 1
	183	lengthBinList = lengthBinList[0:first_index_gt_maxLen]
	184
	185	qualBinList = []
	186	increment_by = 1
	187	while len(qualBinList) == 0 or len(qualBinList) > 15:
	188	# now set up the bins for mean PHRED
	189	minQual = int(np.floor(min(df["meanQual"])))
	190	maxQual = int(np.ceil(max(df["meanQual"])))
	191	qualBinList = list(np.arange(minQual, maxQual + increment_by, increment_by))
	192	increment_by += 0.25
	193
	194	# now make a table of read lengths
	195	bpTots = []
	196	readnumTots = []
	197	for row in range(len(lengthBinList)):
	198	dataNums = []
	199	readNums = []
	200	for column in range(len(qualBinList)):
	201	thisQuery = df.query("length >= {} and meanQual >= {}".format(
	202	lengthBinList[row], qualBinList[column]))
	203	dataNums.append(sum(thisQuery['length']))
	204	readNums.append(len(thisQuery.index))
	205	bpTots.append(dataNums)
	206	readnumTots.append(readNums)
	207
	208	tables = {"Basepairs >= bin by mean PHRED and length": bpTots,
	209	"Number of reads >= bin by mean Phred+Len": readnumTots}
	210	print_table = ""
	211	for key in sorted(tables):
	212	# make a dataframe of our basepair distribution table
	213	dataDf = pd.DataFrame(tables[key], columns=["Q{}".format(x) for x in qualBinList])
	214	# add the min lengths as a column
	215	dataDf.insert(0, 'minLen', lengthBinList)
	216	print_table += pretty_print_table(dataDf, key)
	217	return print_table
	218
	219
	220	def histogram_lengths(length, name_prefix):
	221	"""Create a histogram of read counts per length."""
	222	counts = length.value_counts().to_frame(name="readCount")
	223	counts.index.rename('readLen', inplace=True)
	224	counts.sort_index(inplace=True)
	225	counts.to_csv("{}.hist.csv".format(name_prefix), index=True)
	226
	227
	228	def pretty_print_table(df, title):
	229	print_string = ""
	230	dataframeStr = df.to_string(index=False)
	231	# this is the char width of the whole table printed
	232	lendataframeStr = len(dataframeStr.split('\n')[0])
	233	# this is the char width of the minlen portion of the printed table
	234	minLenLen = len(dataframeStr.split()[0])
	235	blank = " " * minLenLen
	236	# center the text on this offset as the table title
	237	txtoffset = lendataframeStr - minLenLen
	238	print_string += "\n{}{:^{offset}}\n".format(
	239	blank, title, offset=txtoffset)
	240	print_string += dataframeStr + "\n"
	241	return print_string
	242
	243
	244	def run(args):
	245	"""This just opens the fastq file and passes the info to the stats() function.
	246	This is a wrapper function that is accessed by pauvre_main.
	247	Useful since we can call stats() independently from other pauvre programs."""
	248	df = parse_fastq_length_meanqual(args.fastq)
	249	stats(df, args.fastq, histogram=args.histogram)

+578

-0

pauvre/synplot.py less more

	0	#!/usr/bin/env python
	1	# -- coding: utf-8 --
	2
	3	# pauvre - just a pore plotting package
	4	# Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved.
	5	#
	6	# This file is part of pauvre.
	7	#
	8	# pauvre is free software: you can redistribute it and/or modify
	9	# it under the terms of the GNU General Public License as published by
	10	# the Free Software Foundation, either version 3 of the License, or
	11	# (at your option) any later version.
	12	#
	13	# pauvre is distributed in the hope that it will be useful,
	14	# but WITHOUT ANY WARRANTY; without even the implied warranty of
	15	# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
	16	# GNU General Public License for more details.
	17	#
	18	# You should have received a copy of the GNU General Public License
	19	# along with pauvre. If not, see <http://www.gnu.org/licenses/>.
	20
	21	# TODO
	22	# import the pauvre rcParams
	23	# Cleanup everything
	24
	25	import pandas as pd
	26	pd.set_option('display.max_columns', 500)
	27	pd.set_option('display.width', 1000)
	28	import numpy as np
	29	import os
	30	import pauvre.rcparams as rc
	31	from pauvre.functions import GFFParse, print_images, timestamp
	32	from pauvre import gfftools
	33	from pauvre.lsi.lsi import intersection
	34	import progressbar
	35	import platform
	36	import sys
	37	import time
	38	import warnings
	39
	40	# for the shuffling algorithm
	41	from itertools import product
	42
	43	# Biopython stuff
	44	from Bio import SeqIO
	45	import Bio.SubsMat.MatrixInfo as MI
	46
	47	# following this tutorial to install helvetica
	48	# https://github.com/olgabot/sciencemeetproductivity.tumblr.com/blob/master/posts/2012/11/how-to-set-helvetica-as-the-default-sans-serif-font-in.md
	49	global hfont
	50	hfont = {'fontname':'Helvetica'}
	51
	52	import matplotlib
	53	matplotlib.use('agg')
	54	import matplotlib.pyplot as plt
	55	from matplotlib.colors import LinearSegmentedColormap, Normalize
	56	import matplotlib.patches as patches
	57
	58	def shuffle_optimize_gffs(args, GFFs):
	59	"""This function takes in a list of GFF objects and reshuffles the
	60	individual files such that the resulting sequence of GFF files has
	61	the minimum number of intersections when plotting synteny
	62
	63	if args.optimum_order, then the program will find the global minimum
	64	arrangement using the first GFF file as the anchor.
	65
	66	if not args.optimum_order, then the program will find the local minimum
	67	shuffle between every input pair of GFF files to plot in the best way possible
	68	given the input order.
	69
	70	returns a list of GFF files from which the user can calculate plotting coordinates
	71	"""
	72	# we use the first-input gff as the topmost sequence,
	73	# and then find the best synteny match for the remaining sequences
	74	shuffled_gffs = []
	75	if args.optimum_order:
	76	firstgff = GFFs[0]
	77	# save the first gff file unadultered
	78	shuffled_gffs.append(firstgff)
	79	nextgffs = GFFs[1:]
	80	while len(nextgffs) > 0:
	81	obs_list = []
	82	for i in range(len(nextgffs)):
	83	# every observation will be stored here as a tuple.
	84	# zeroth element is the num intersections with the current gff
	85	# first element is the index of nextgffs
	86	# second element is the GFF object
	87	shuffles = nextgffs[i].shuffle()
	88	for k in range(len(shuffles)):
	89	shuf = shuffles[k]
	90	coords = firstgff.couple(shuf, this_y = 0, other_y = 1)
	91	num_inters = len(intersection(coords))
	92	obs_list.append((num_inters, i, shuf))
	93	#print(obs_list[-1])
	94	intersections, gffixs, shufs = zip(*obs_list)
	95	# get the index of the shuffled gff with the least number of
	96	# intersections to the current one against which we are comparing
	97	#print("intersections", intersections)
	98	selected_ix = intersections.index(min(intersections))
	99	# save this gff to shuffled gffs to use later for plotting
	100	shuffled_gffs.append(shufs[selected_ix])
	101	# remove the origin of the shuffled gff from nextgffs
	102	del nextgffs[gffixs[selected_ix]]
	103	#print("global minimum was {} intersections".format(min(intersections)))
	104	# now update the firstgff to the latest shuffled one we collected
	105	firstgff = shufs[selected_ix]
	106	# plot the gff files in the order in which you input them,
	107	# but shuffle them to the order with least intersections
	108	else:
	109	# first we need to find the best arrangement by finding the combinations
	110	# that share the most unique genes
	111	genes_series = [GFFs[i].get_unique_genes() for i in range(len(GFFs))]
	112	combinations_indices = [0]
	113	remaining_indices = list(range(1, len(GFFs)))
	114	done = False
	115	biggest_intersection_index = -1
	116	biggest_intersection_value = 0
	117	current_remaining_indices_index = 0
	118	while not done:
	119	#get the len of the intersection
	120	#print("combinations_indices: {}".format(combinations_indices))
	121	#print("current_remaining_indices_index: {}".format(current_remaining_indices_index))
	122	#print("remaining_indices[current_remaining_indices_index]: {}".format(remaining_indices[current_remaining_indices_index]))
	123	#print("genes_series[remaining_indices[current_remaining_indices_index]]: {}".format(genes_series[remaining_indices[current_remaining_indices_index]]))
	124	this_intersection_value = len(genes_series[combinations_indices[-1]] &\
	125	genes_series[remaining_indices[current_remaining_indices_index]])
	126	if this_intersection_value > biggest_intersection_value:
	127	biggest_intersection_value = this_intersection_value
	128	biggest_intersection_index = current_remaining_indices_index
	129	if current_remaining_indices_index < len(remaining_indices)-1:
	130	current_remaining_indices_index += 1
	131	else:
	132	combinations_indices.append(remaining_indices[biggest_intersection_index])
	133	del remaining_indices[biggest_intersection_index]
	134	biggest_intersection_value = 0
	135	current_remaining_indices_index = 0
	136	biggest_intersection_index = -1
	137	if len(remaining_indices) == 0:
	138	done = True
	139	# The best order of genes with the most shared genes
	140	#I don't know if this is really that useful though since many species will overlap.
	141	# In a future implementation of this program it might be necessary to do sub-sorting of this list to get the lest number of line intersections
	142	print("The best gene combination is {}".format(combinations_indices))
	143	# now we rearrange the GFFs to the best order
	144	new_GFFs = [GFFs[i] for i in combinations_indices]
	145	# If we're adding another copy of the top one, add it here before shuffling
	146	if args.sandwich:
	147	new_GFFs.append(new_GFFs[0])
	148	shuffles = [new_GFFs[i].shuffle() for i in range(len(new_GFFs))]
	149	#print([len(shuffles[i]) for i in range(len(shuffles))])
	150	cumulative_least_shuffled_value = 999999999999999999999999999999999999
	151	bar = progressbar.ProgressBar()
	152	for combination in bar(list(product(*shuffles))):
	153	num_intersections = []
	154	#I have no idea what this list comprehension does anymore.
	155	first_genes = [str(combination[i].features[combination[i].features['featType'].isin(['gene', 'rRNA', 'CDS', 'tRNA'])]['name'].head(n=1)).split()[1] for i in range(len(combination))]
	156	# skip to the next iteration if all the genes aren't the same
	157	if args.start_with_aligned_genes and len(set(first_genes)) != 1:
	158	continue
	159	for i in range(len(new_GFFs) - 1):
	160	j = i + 1
	161	#figure out the best shuffle the next sequence
	162	coords = combination[i].couple(combination[j], this_y = i, other_y = j)
	163	num_intersections.append(len(intersection(coords)))
	164	if sum(num_intersections) < cumulative_least_shuffled_value:
	165	shuffled_gffs = combination
	166	cumulative_least_shuffled_value = sum(num_intersections)
	167	print("\nnew fewest global intersections: {}".format(sum(num_intersections)))
	168	return shuffled_gffs
	169
	170	def black_colormap():
	171	zeroone = np.linspace(0, 1, 100)
	172	colorrange = [(0,0,0,x) for x in zeroone]
	173	minblosum = min(MI.blosum62.values())
	174	maxblosum = max(MI.blosum62.values())
	175	colormap = {i: colorrange[int(translate(i, minblosum, maxblosum, 0, 99))]
	176	for i in range(minblosum, maxblosum + 1, 1)}
	177	return colormap
	178
	179	def translate(value, left_min, left_max, right_min, right_max):
	180	"""This code maps values from the left range and interpolates to the
	181	corresponding range on the right. This is used to translate the amino acid
	182	substition matrix scores to a scale between 0 and 1 for making alphamaps.
	183
	184	I don't know if this works if the directionality of the ranges are swapped.
	185	IE [5, -10] mapped to [0, 1]
	186
	187	args:
	188	<value> - the value in [<left_min>:<left_max>] to scale between
	189	[<right_min>:<right_max>]
	190	<left_min> - the 'min' of the left (source) range
	191	<left_max> - the 'max' of the left (source) range
	192	<right_min> - the 'min' of the right (target) range
	193	<right_max> the 'max' of the right (target) range
	194
	195	output:
	196	the <value>(float) scaled between <right_min> and <right_max>
	197	"""
	198	# Figure out how 'wide' each range is
	199	left_span = left_max - left_min
	200	right_span = right_max - right_min
	201
	202	# Convert the left range into a 0-1 range (float)
	203	value_scaled = float(value - left_min) / float(left_span)
	204
	205	# Convert the 0-1 range into a value in the right range.
	206	return right_min + (value_scaled * right_span)
	207
	208	def _samplename_warning(samplename, filename):
	209	warnings.warn("""
	210	There is a sample in your fasta alignments that
	211	does not match the samplenames from the gff filenames. Please
	212	rename this samplename to not contain any spaces or underscores.
	213	IE for sample 'NC016', '>NC_016_-_ND6' will not work but
	214	'>NC016_-_ND6' will work.
	215
	216	Erroneous name: {}
	217	File: {}""".format(samplename, os.path.basename(filename)))
	218
	219	def _samplelength_warning(samplename, genename, featType, gfflen, alnlen):
	220	raise Warning("""The length of the protein alignment isn't the same as the
	221	length in the GFF file for the sample. Maybe you used a sequence in the
	222	alignment that is different from the annotation source? Check if the
	223	stop codons are deleted/inserted from either the GFF or alignment. The
	224	protein alignment length should be 3 less than the gff length if the
	225	stop codons were included in the gff annotation.
	226
	227	Another possibility is that the RNA that is generating this error has
	228	post-transcriptional modifications that complete the stop codon. In
	229	this case, you can fudge the stop position in the gff file (increase
	230	the value by one or two) to make the plotting script run.
	231
	232	Sample name: {}
	233	feat type: {}
	234	gene name: {}
	235	gff length: {}
	236	aln length: {}""".format(samplename, featType, genename, gfflen, alnlen))
	237
	238	def _nosample_warning(samplename, alngenename, gffnames):
	239	raise Warning("""One of the gff files doesn't contain a sequence that the
	240	alignment file indicates should be present. Either the alignment file
	241	is misnamed or the sequence name in the GFF file is not what you
	242	intended.
	243
	244	Sample name: {}
	245	aln gene name: {}
	246	gff names: {}""".format(samplename, alngenename, gffnames))
	247
	248	def get_alignments(args):
	249	"""
	250	this reads in all the alignments from the fasta directory.
	251	"""
	252	# This is a dict object with key as
	253	filelist = {os.path.splitext(x)[0]:os.path.join(os.path.abspath(args.aln_dir), x)
	254	for x in os.listdir(args.aln_dir)
	255	if os.path.splitext(x)[1]}
	256	print("file list is:")
	257	print(filelist)
	258	# one entry in seq_dict is:
	259	# {seqname: {"featType": featType,
	260	# "seqs": {samplename: seq},
	261	# "indices": {samplename: indices}}
	262	seqs_dict = {}
	263	# go through every gene in the genelist
	264	for genename in filelist:
	265	thisFeatType = ""
	266	seqs_list = {}
	267	indices_list = {}
	268	#print("We found the following samplenames: {}".format(args.samplenames), file = sys.stderr)
	269	# this block handles reading in the fasta files to interpret the alignments
	270	for record in SeqIO.parse(filelist[genename], "fasta"):
	271	# get the sample name and make sure that the sample names match
	272	samplename = record.id.replace("_", " ").split()[0]
	273	#print("Looking at sample: {}".format(samplename), file=sys.stderr)
	274	if samplename not in args.samplenames:
	275	#if there's a sequence in the fasta that we did not specify
	276	# in the command, ignore that sequence
	277	_samplename_warning(samplename, filelist[genename])
	278	else:
	279	# first, get the sample features
	280	samplegff = args.samplenames[samplename].features
	281	featType = samplegff.loc[samplegff['name'] == genename, 'featType'].to_string().split()[1]
	282	# now we determine if this is a prot alignment or a nucleotide aln
	283	if featType in ['gene', 'CDS']:
	284	final_seq = "".join([x*3 for x in record.seq])
	285	elif featType == 'rRNA':
	286	final_seq = str(record.seq)
	287	# we now need to verify that the protein sequence is
	288	# the length of the gene in the gff file. Do this by removing
	289	# gaps in the alignment
	290	gfffilt = samplegff.loc[samplegff['name'] == genename, 'width']
	291	if len(gfffilt) == 0:
	292	_nosample_warning(samplename, genename, list(samplegff['name']))
	293	gfflen = int(gfffilt)
	294	aln = final_seq.replace("-", "")
	295	alnlen = len(aln)
	296	if gfflen != alnlen:
	297	_samplelength_warning(samplename, genename, featType, gfflen, alnlen)
	298	# If we've made it this far without any errors, then incorporate the
	299	# indices for each index
	300	#print("start_index", start_index)
	301	final_indices = [-1] * len(final_seq)
	302	# up until the next for loop, here we are determining which
	303	# direction to move in. Reverse sequences decrease from the start
	304	strand = samplegff.loc[samplegff['name'] == genename, 'strand'].to_string().split()[1]
	305	if strand == '+':
	306	direction = 1
	307	start_index = int(samplegff.loc[samplegff['name'] == genename, 'start'])
	308	elif strand == '-':
	309	direction = -1
	310	start_index = int(samplegff.loc[samplegff['name'] == genename, 'stop'])
	311	for i in range(len(final_indices)):
	312	if final_seq[i] != '-':
	313	final_indices[i] = start_index
	314	start_index = start_index + (1 * direction)
	315	seqs_list[samplename] = final_seq
	316	if args.center_on:
	317	center_coord = int(args.samplenames[samplename].features.loc[args.samplenames[samplename].features['name'] == args.center_on, 'center'])
	318	indices_list[samplename] = np.array(final_indices) - center_coord
	319	else:
	320	indices_list[samplename] = final_indices
	321	thisFeatType = featType
	322	seqs_dict[genename] = {"featType": thisFeatType,
	323	"seqs": seqs_list,
	324	"indices": indices_list}
	325	return seqs_dict
	326
	327
	328	def plot_synteny(seq1, ind1, seq2, ind2, y1, y2,
	329	featType, matrix, cm, seqname):
	330	"""This function plots all the lines for each"""
	331	print("PLOTTING SYNTENY")
	332	myPatches = []
	333	colormap = {"COX1": '#c0d9ef',
	334	"L": '#e8f1df',
	335	"I": '#f7dedc',
	336	"16S": '#ff2e00',
	337	"12S": '#ffc239',
	338	"cal": '#ffff54',
	339	"COX2": "#7fce66",
	340	"ND2": "#00ae60",
	341	"COX3": "#00aeec",
	342	"ND1": "#006fbb",
	343	"*": "#ffffff",
	344	"(": "#ded9c5",
	345	"Q": "#ffc294",
	346	"?": "#b5a2c4",
	347	"ND4": "#968b5a",
	348	"ND3": "#00fc65",
	349	"ND4L": "#00dcf0",
	350	"ND6": "#ff994e",
	351	"ND5": "#dc31e6",
	352	"X": "#d8d8d8",
	353	"G": "#abdce7",
	354	"CYTB": "#ff0059"}
	355
	356	for i in range(len(seq1)):
	357	feat1 = seq1[i]
	358	feat2 = seq2[i]
	359	if feat1 != '-' and feat2 != '-':
	360	xs = []
	361	ys = []
	362	xs.append(ind1[i]) # top left
	363	ys.append(y1)
	364	xs.append(ind1[i] + 1) # top right
	365	ys.append(y1)
	366	xs.append(ind2[i] + 1) # bottom right
	367	ys.append(y2)
	368	xs.append(ind2[i]) #bottom left
	369	ys.append(y2)
	370	xs.append(ind1[i]) #top left
	371	ys.append(y1)
	372	alpha = 0.5
	373	if featType in ['CDS', 'gene']:
	374	try:
	375	val = matrix[(feat1, feat2)]
	376	except:
	377	val = matrix[(feat2, feat1)]
	378	color = cm[val]
	379	alpha = color[-1]
	380	elif featType == 'rRNA':
	381	if feat1 != feat2:
	382	alpha=0
	383	color = colormap[seqname]
	384	stack1 = np.column_stack([xs, ys])
	385	myPatches.append(patches.Polygon(stack1, closed=True,
	386	color = color,
	387	alpha = alpha,
	388	lw=0))
	389	return myPatches
	390
	391	def synplot(args):
	392	rc.update_rcParams()
	393	print(args)
	394	GFFs = []
	395	for i in range(len(args.gff_paths)):
	396	gffpath = args.gff_paths[i]
	397	species = ""
	398	if args.gff_labels:
	399	species = args.gff_labels[i]
	400	GFFs.append(GFFParse(gffpath, args.stop_codons, species))
	401
	402	# find the optimum shuffling pattern
	403	# and add a list of samplenames to the args
	404	optGFFs = shuffle_optimize_gffs(args, GFFs)
	405	# Make a sandwich for a circular comparison
	406	setattr(args, 'samplenames', {gff.samplename:gff for gff in optGFFs})
	407
	408	# now get the cms and normalize
	409	#cms, normalize = gen_colormaps()
	410	cm = black_colormap()
	411
	412	## and we get the protein alignment objects
	413	# {seqname: {"featType": featType,
	414	# "seqs": {samplename: seq},
	415	# "indices": {samplename: indices}}
	416	print("getting alignments")
	417	seqs_dict = get_alignments(args)
	418	print("done getting alignments")
	419	print("seqs_dict is:")
	420	print(seqs_dict)
	421
	422	# set the figure dimensions
	423	if args.ratio:
	424	figWidth = args.ratio[0] + 1
	425	figHeight = args.ratio[1] + 1
	426	#set the panel dimensions
	427	panelWidth = args.ratio[0]
	428	panelHeight = args.ratio[1]
	429
	430	else:
	431	figWidth = 2.5*4
	432	figHeight = 5
	433	#set the panel dimensions
	434	panelWidth = 2.5 * 3
	435	panelHeight = 1.75
	436
	437	figure = plt.figure(figsize=(figWidth,figHeight))
	438
	439
	440	#find the margins to center the panel in figure
	441	leftMargin = (figWidth - panelWidth)/2
	442	bottomMargin = ((figHeight - panelHeight)/2) + 0.25
	443
	444	panel0=plt.axes([leftMargin/figWidth, #left
	445	bottomMargin/figHeight, #bottom
	446	panelWidth/figWidth, #width
	447	panelHeight/figHeight]) #height
	448	panel0.tick_params(axis='both',which='both',\
	449	bottom='on', labelbottom='off',\
	450	left='off', labelleft='off', \
	451	right='off', labelright='off',\
	452	top='off', labeltop='off')
	453	#turn off some of the axes
	454	panel0.spines['top'].set_visible(False)
	455	panel0.spines['right'].set_visible(False)
	456	panel0.spines['left'].set_visible(False)
	457
	458	# {seqname: {"featType": featType,
	459	# "seqs": {samplename: seq},
	460	# "indices": {samplename: indices}}
	461	allPatches = []
	462	for seqname in seqs_dict:
	463	#go through in order
	464	print(seqname)
	465	for i in range(0, len(optGFFs) - 1):
	466	samplei = optGFFs[i].samplename
	467	samplej = optGFFs[i+1].samplename
	468	if samplei in seqs_dict[seqname]["seqs"].keys() and\
	469	samplej in seqs_dict[seqname]["seqs"].keys():
	470	featType = seqs_dict[seqname]["featType"]
	471	seq1 = seqs_dict[seqname]["seqs"][samplei]
	472	ind1 = seqs_dict[seqname]["indices"][samplei]
	473	seq2 = seqs_dict[seqname]["seqs"][samplej]
	474	ind2 = seqs_dict[seqname]["indices"][samplej]
	475	# this is the top one, just leave it at the actual value since
	476	# the base of the annotations start on the integer
	477	y1 = len(optGFFs) - 1 - i
	478	# this needs to be increased by the bar_thickness (0.9 * track_width in this case, or 0.09)
	479	y2 = len(optGFFs) - 2 - i
	480	myPatches = plot_synteny(seq1, ind1, seq2, ind2, y1, y2,
	481	featType, MI.blosum62, cm, seqname)
	482	for patch in myPatches:
	483	allPatches.append(patch)
	484
	485	#print("len allPatches", len(allPatches))
	486	# this bit plots the simplified lines in the centers
	487	## first we plot all the lines from the centers of matching genes.
	488	## This is temporary. Or maybe it should be a feature
	489	#verts = []
	490	#for i in range(len(optGFFs) - 1):
	491	# j = i + 1
	492	# coords = optGFFs[i].couple(optGFFs[j], this_y = len(optGFFs) - i, other_y = len(optGFFs) - i - 1)
	493	# for coord in coords:
	494	# verts.append(coord)
	495
	496	#for vert in verts:
	497	# xxyy = list(zip(*vert))
	498	# panel0.plot(xxyy[0], xxyy[1])
	499	# now we plot horizontal lines showing the length of the mitochondrial sequence
	500	maxseqlen = 0
	501	# this is a heuristic for trackwidth of what looks good in my experience
	502	track_multiplier = 0.08
	503	if args.ratio:
	504	track_width = track_multiplier * panelWidth
	505	else:
	506	#0.032 if only 3
	507	#0.062 if 6
	508	track_width = track_multiplier * panelWidth
	509	for i in range(len(optGFFs)):
	510	gff = optGFFs[i]
	511	#print(" - Plotting panels of {}".format(gff), file = sys.stderr)
	512	x_offset = 0
	513	#print(" - Detecting if centering is on.".format(gff), file = sys.stderr)
	514	if args.center_on:
	515	x_offset = -1 * int(gff.features.loc[gff.features['name'] == args.center_on, 'center'])
	516	gff = gfftools.x_offset_gff(gff, x_offset)
	517	#print(" - Centering is on.".format(gff), file = sys.stderr)
	518	#print(" - Plotting horizontal portions with gffplot_horizontal.".format(gff), file = sys.stderr)
	519	panel0, patches = gfftools.gffplot_horizontal(
	520	figure, panel0, args, gff,
	521	track_width = track_width,
	522	start_y = len(optGFFs) - i - 1 - ((0.9 * track_width)/2),
	523	x_offset = x_offset)
	524	#print("{} patches came out of gffplot_horizontal()".format(len(patches)))
	525	seq_name = gff.features['sequence'].unique()[0]
	526	if args.gff_labels:
	527	seq_name = "$\it{{{0}}}$".format(gff.species)
	528	panel0.text(0 + x_offset, len(optGFFs) - i - 1 + (0.18/2),
	529	seq_name, fontsize = 12,
	530	ha='left', va='bottom',
	531	color = 'black',
	532	zorder = 100)
	533
	534	if gff.seqlen > maxseqlen:
	535	maxseqlen = gff.seqlen
	536	xs = (1 + x_offset, gff.seqlen + x_offset)
	537	#ys = [len(optGFFs) - i - 1 + (0.09/2)]*2
	538	ys = [len(optGFFs) - i - 1]*2
	539
	540	#print(" - Plotting lines.".format(gff), file = sys.stderr)
	541	panel0.plot(xs, ys, color='black', zorder = -9)
	542	#print(" - Adding patches.".format(gff), file = sys.stderr)
	543	#print("Right before adding patches there are {} patches.".format(len(patches)))
	544	for i in range(len(patches)):
	545	patch = patches[i]
	546	allPatches.append(patch)
	547	for patch in allPatches:
	548	panel0.add_patch(patch)
	549	panel0.set_xlabel("position (bp)")
	550
	551	#panel0.set_xlim([-15000, int(np.ceil(maxseqlen/1000)*1000)])
	552	panel0.set_ylim([ 0 - ( (track_width/2) * 1.1 ),
	553	len(optGFFs) - 1 + ( (track_width/2) * 1.1 )])
	554
	555	# This removes the text labels from the plot
	556	labels = [item.get_text() for item in panel0.get_xticklabels()]
	557	empty_string_labels = ['']*len(labels)
	558	print(" - Setting tick labels.".format(gff), file = sys.stderr)
	559
	560	panel0.set_xticklabels(empty_string_labels)
	561
	562	# Print image(s)
	563	print(" - Running print_images.".format(gff), file = sys.stderr)
	564	if args.BASENAME is None:
	565	file_base = "synteny"
	566	else:
	567	file_base = args.BASENAME
	568	print_images(
	569	base=file_base,
	570	image_formats=args.fileform,
	571	no_timestamp = args.no_timestamp,
	572	dpi=args.dpi,
	573	transparent=args.transparent)
	574
	575
	576	def run(args):
	577	synplot(args)

-0

pauvre/tests/__init__.py less more

(New empty file)

+66

-0

pauvre/tests/test_synplot.py less more

	0	#!/usr/bin/env python
	1	# -- coding: utf-8 --
	2
	3	# pauvre - just a pore plotting package
	4	# Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved.
	5	#
	6	# This file is part of pauvre.
	7	#
	8	# pauvre is free software: you can redistribute it and/or modify
	9	# it under the terms of the GNU General Public License as published by
	10	# the Free Software Foundation, either version 3 of the License, or
	11	# (at your option) any later version.
	12	#
	13	# pauvre is distributed in the hope that it will be useful,
	14	# but WITHOUT ANY WARRANTY; without even the implied warranty of
	15	# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
	16	# GNU General Public License for more details.
	17	#
	18	# You should have received a copy of the GNU General Public License
	19	# along with pauvre. If not, see <http://www.gnu.org/licenses/>.
	20
	21	import os
	22	import subprocess as sp
	23	import unittest
	24
	25	class libSeq_test_case(unittest.TestCase):
	26	"""Tests that different features of synplot work correctly
	27	"""
	28	def setUp(self):
	29	# Here we must safely make the test directory if it does
	30	# not exist
	31	scriptdir = os.path.dirname(os.path.realpath(__file__))
	32	testoutputdir = os.path.join(scriptdir, "testresults")
	33	if not os.path.exists(testoutputdir):
	34	os.makedirs(testoutputdir)
	35	# now we see if the specific subdirectory for the program
	36	# we are testing exists
	37	self.thisoutdir = os.path.join(testoutputdir, "synplot")
	38	if not os.path.exists(self.thisoutdir):
	39	os.makedirs(self.thisoutdir)
	40	# now we change our working directory to the
	41	# specific subdirectory for this file
	42
	43	self.aln_dir = os.path.join(scriptdir, "testdata/alignments/")
	44	self.gff1 = os.path.join(scriptdir, "testdata/gff_files/Bf201706.gff")
	45	self.gff2 = os.path.join(scriptdir, "testdata/gff_files/JN392469.gff")
	46	self.gff3 = os.path.join(scriptdir, "testdata/gff_files/NC016117.gff")
	47
	48
	49	def test_normal_plotting_scenario(self):
	50	"""This verifies that the LibSeq class is constructed with all of the
	51	parameters that are present in the meraculous config files"""
	52	os.chdir(self.thisoutdir)
	53	thiscommand = """pauvre synplot --aln_dir {0} \
	54	--fileform pdf \
	55	--gff_paths {1} {2} {3} \
	56	--center_on COX1 \
	57	--gff_labels "B. forskalii" "P. bachei" "M. leidyi" """.format(
	58	self.aln_dir, self.gff1,
	59	self.gff2, self.gff3)
	60
	61	data = sp.Popen(thiscommand, shell = True,
	62	stdout=sp.PIPE)
	63	print("The data is:", data.communicate()[0].decode())
	64	print("The return code is: ", data.returncode)
	65	self.assertEqual(0, int(data.returncode))

-0

pauvre/version.py less more

	0	# -- coding: utf-8 --
	1
	2	__version__ = "0.1924"

+26

-0

pauvre.egg-info/PKG-INFO less more

	0	Metadata-Version: 1.2
	1	Name: pauvre
	2	Version: 0.1924
	3	Summary: Tools for plotting Oxford Nanopore and other long-read data.
	4	Home-page: https://github.com/conchoecia/pauvre
	5	Author: Darrin Schultz
	6	Author-email: dts@ucsc.edu
	7	License: GPLv3
	8	Description:
	9	'pauvre' is a package for plotting Oxford Nanopore and other long read data.
	10	The name means 'poor' in French, a play on words to the oft-used 'pore' prefix
	11	for similar packages. This package was designed for python 3, but it might work in
	12	python 2. You can visit the gitub page for more detailed information here:
	13	https://github.com/conchoecia/pauvre
	14
	15	Platform: UNKNOWN
	16	Classifier: Development Status :: 2 - Pre-Alpha
	17	Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
	18	Classifier: Programming Language :: Python :: 3
	19	Classifier: Programming Language :: Python :: 3.5
	20	Classifier: Operating System :: POSIX :: Linux
	21	Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
	22	Classifier: Intended Audience :: Science/Research
	23	Requires: python (>3.0)
	24	Provides: pauvre
	25	Requires-Python: >=3

+32

-0

pauvre.egg-info/SOURCES.txt less more

	0	MANIFEST.in
	1	README.md
	2	setup.py
	3	pauvre/__init__.py
	4	pauvre/bamparse.py
	5	pauvre/browser.py
	6	pauvre/custommargin.py
	7	pauvre/functions.py
	8	pauvre/gfftools.py
	9	pauvre/marginplot.py
	10	pauvre/pauvre_main.py
	11	pauvre/rcparams.py
	12	pauvre/redwood.py
	13	pauvre/stats.py
	14	pauvre/synplot.py
	15	pauvre/version.py
	16	pauvre.egg-info/PKG-INFO
	17	pauvre.egg-info/SOURCES.txt
	18	pauvre.egg-info/dependency_links.txt
	19	pauvre.egg-info/entry_points.txt
	20	pauvre.egg-info/not-zip-safe
	21	pauvre.egg-info/requires.txt
	22	pauvre.egg-info/top_level.txt
	23	pauvre/lsi/Q.py
	24	pauvre/lsi/T.py
	25	pauvre/lsi/__init__.py
	26	pauvre/lsi/helper.py
	27	pauvre/lsi/lsi.py
	28	pauvre/lsi/test.py
	29	pauvre/tests/__init__.py
	30	pauvre/tests/test_synplot.py
	31	scripts/test.sh⏎

-0

pauvre.egg-info/dependency_links.txt less more

-0

pauvre.egg-info/entry_points.txt less more

	0	[console_scripts]
	1	pauvre = pauvre.pauvre_main:main
	2

-0

pauvre.egg-info/not-zip-safe less more

-0

pauvre.egg-info/requires.txt less more

	0	matplotlib>=2.0.2
	1	biopython>=1.68
	2	pandas>=0.20.1
	3	numpy>=1.12.1
	4	scipy
	5	sklearn

-0

pauvre.egg-info/top_level.txt less more

	0	pauvre
	1	scripts

+14

-0

scripts/test.sh less more

	0	set -ev
	1
	2	git clone https://github.com/wdecoster/nanotest.git
	3
	4	pauvre -h
	5
	6	pauvre marginplot -h
	7	pauvre marginplot -f nanotest/reads.fastq.gz
	8	pauvre marginplot -f nanotest/reads.fastq.gz -t "my title" --filt_maxlen 30000
	9	pauvre marginplot -f nanotest/reads.fastq.gz -y --filt_minqual 7 --fileform svg
	10
	11	pauvre stats -h
	12	pauvre stats -f nanotest/reads.fastq.gz
	13	pauvre stats -f nanotest/reads.fastq.gz -H --filt_maxqual 12

-0

setup.cfg less more

	0	[egg_info]
	1	tag_build =
	2	tag_date = 0
	3

+74

-0

setup.py less more

	0	#!/usr/bin/env python3
	1	# -- coding: utf-8 --
	2
	3	# pauvre
	4	# Copyright (c) 2016-2020 Darrin T. Schultz.
	5	#
	6	# This file is part of pauvre.
	7	#
	8	# pauvre is free software: you can redistribute it and/or modify
	9	# it under the terms of the GNU General Public License as published by
	10	# the Free Software Foundation, either version 3 of the License, or
	11	# (at your option) any later version.
	12	#
	13	# pauvre is distributed in the hope that it will be useful,
	14	# but WITHOUT ANY WARRANTY; without even the implied warranty of
	15	# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
	16	# GNU General Public License for more details.
	17	#
	18	# You should have received a copy of the GNU General Public License
	19	# along with pauvre. If not, see <http://www.gnu.org/licenses/>.
	20
	21	# Tutorials on how to setup python package here:
	22	# - http://python-packaging.readthedocs.io/en/latest/testing.html
	23	# - https://jeffknupp.com/blog/2013/08/16/open-sourcing-a-python-project-the-right-way/
	24
	25	import os
	26	from setuptools import setup, find_packages
	27
	28	version_py = os.path.join(os.path.dirname(__file__), 'pauvre', 'version.py')
	29	version = open(version_py).read().strip().split('=')[-1].replace('"', '').strip()
	30	print(version)
	31
	32
	33	setup(name='pauvre',
	34	requires=['python (>3.0)'],
	35	version=version,
	36	description='Tools for plotting Oxford Nanopore and other long-read data.',
	37	long_description="""
	38	'pauvre' is a package for plotting Oxford Nanopore and other long read data.
	39	The name means 'poor' in French, a play on words to the oft-used 'pore' prefix
	40	for similar packages. This package was designed for python 3, but it might work in
	41	python 2. You can visit the gitub page for more detailed information here:
	42	https://github.com/conchoecia/pauvre
	43	""",
	44
	45	url='https://github.com/conchoecia/pauvre',
	46	author='Darrin Schultz',
	47	author_email='dts@ucsc.edu',
	48	classifiers=[
	49	'Development Status :: 2 - Pre-Alpha',
	50	'License :: OSI Approved :: GNU General Public License v3 (GPLv3)',
	51	'Programming Language :: Python :: 3',
	52	'Programming Language :: Python :: 3.5',
	53	'Operating System :: POSIX :: Linux',
	54	'Topic :: Scientific/Engineering :: Bio-Informatics',
	55	'Intended Audience :: Science/Research'
	56	],
	57	license='GPLv3',
	58	provides=['pauvre'],
	59	packages=find_packages() + ['scripts'],
	60	python_requires='>=3',
	61	install_requires=[
	62	"matplotlib >= 2.0.2",
	63	"biopython >= 1.68",
	64	"pandas >= 0.20.1",
	65	"numpy >= 1.12.1",
	66	"scipy",
	67	"sklearn"
	68	],
	69	entry_points={
	70	'console_scripts': ['pauvre=pauvre.pauvre_main:main'],
	71	},
	72	zip_safe=False,
	73	include_package_data=True)