Commit 9904cd8bf255135b069354758622ae612466aa0b - porechop

New upstream version 0.2.4+dfsg Steffen Moeller 5 years ago

5 changed file(s) with 263 addition(s) and 152 deletion(s). Raw diff Collapse all Expand all

+117

-74

README.md less more

2	2	Porechop is a tool for finding and removing adapters from [Oxford Nanopore](https://nanoporetech.com/) reads. Adapters on the ends of reads are trimmed off, and when a read has an adapter in its middle, it is treated as chimeric and chopped into separate reads. Porechop performs thorough alignments to effectively find adapters, even at low sequence identity.
3	3
4	4	Porechop also supports demultiplexing of Nanopore reads that were barcoded with the [Native Barcoding Kit](https://store.nanoporetech.com/native-barcoding-kit-1d.html), [PCR Barcoding Kit](https://store.nanoporetech.com/pcr-barcoding-kit-96.html) or [Rapid Barcoding Kit](https://store.nanoporetech.com/rapid-barcoding-sequencing-kit.html).
	5
	6
	7	### Oct 2018 update: Porechop is officially unsupported
	8
	9	While I'm happy Porechop has so many users, it has always been a bit klugey and a pain to maintain. I don't have the time to give it the attention it deserves, so I'm going to now officially declare Porechop as abandonware (though the unanswered [issues](https://github.com/rrwick/Porechop/issues) and [pull requests](https://github.com/rrwick/Porechop/pulls) reveal that it already has been for some time). I've added a [known issues](#known-issues) section to the README to outline what I think is wrong with Porechop and how a reimplementation should look. I may someday (no promises though :stuck_out_tongue:) try to rewrite it from a blank canvas to address its faults.
	10
	11
5	12
6	13
7	14

23	30	* [Verbose output](#verbose-output)
24	31	* [Known adapters](#known-adapters)
25	32	* [Full usage](#full-usage)
	33	* [Known issues](#known-issues)
26	34	* [Acknowledgements](#acknowledgements)
27	35	* [License](#license)
28	36

90	98	__Demultiplex barcoded reads, straight from Albacore output directory:__<br>
91	99	`porechop -i albacore_dir -b output_dir`
92	100
93		__Demultiplex barcoded reads without trimming (appropriate if Nanopolish will be used):__<br>
94		`porechop -i input_reads.fastq.gz -b output_dir --untrimmed`
95
96	101	__Also works with FASTA:__<br>
97	102	`porechop -i input_reads.fasta -o output_reads.fasta`
98	103

142	147
143	148	If you run Porechop with `--discard_middle`, the reads with internal adapters will be thrown out instead of split.
144	149
145		This approach might make sense if you are trimming reads from a barcoded run, as chimeric reads may combine sequences from different bins. For example, consider this read:
146		```
147		BC01_rev - SEQUENCE_1 - SQK-NSK007_Y_Top - BC02_rev - SEQUENCE_2
148		```
149		SEQUENCE_1 belongs in the barcode 1 bin and SEQUENCE_2 belongs in the barcode 2 bin, so while we could split the read, we would end up with contamination from another bin. Throwing the read out with `--discard_middle` might be the better option.
	150	If you plan on using your reads with [Nanopolish](https://github.com/jts/nanopolish), then the `--discard_middle` option is required. This is because Nanopolish first runs [`nanopolish index`](https://github.com/jts/nanopolish#data-preprocessing) to find a one-to-one association between FASTQ reads and fast5 files. If you ran Porechop _without_ `--discard_middle`, then you could end up with multiple separate FASTQ reads which are from a single fast5, and this is incompatible with Nanopolish.
	151
	152	This option is also recommended if you are trimming reads from a demultiplexed barcoded sequencing run. This is because chimeric reads may contain two sequences from two separate barcodes, so throwing them out is the safer option to reduce cross-barcode contamination.
150	153
151	154
152	155	### Barcode demultiplexing

188	191
189	192	If Porechop is run without `-o` or `-b`, then it will output the trimmed reads to stdout and print its progress info to stderr. The output format of the reads will be FASTA/FASTQ based on the input reads, or else can be specified using `--format`.
190	193
191		The `--verbosity` option will change the output of progress info:
	194	The `--verbosity` option will change the amount of progress info:
192	195	* `--verbosity 0` gives no progress output.
193		* `--verbosity 1` (the default) gives summary info about end adapter trimming and shows all instances of middle adapter splitting.
194		* `--verbosity 2` shows sequences and is described below.
	196	* `--verbosity 1` (default) gives summary info about end adapter trimming and middle adapter splitting.
	197	* `--verbosity 2` shows the actual trimmed/split sequences for each read (described more below).
195	198	* `--verbosity 3` shows tons of data (mainly for debugging).
196	199
197	200

232	235	# Full usage
233	236
234	237	```
235		usage: porechop [-h] -i INPUT [-o OUTPUT] [--format {auto,fasta,fastq,fasta.gz,fastq.gz}] [-v VERBOSITY]
236		[-t THREADS] [--version] [-b BARCODE_DIR] [--barcode_threshold BARCODE_THRESHOLD]
	238	usage: porechop -i INPUT [-o OUTPUT] [--format {auto,fasta,fastq,fasta.gz,fastq.gz}] [-v VERBOSITY]
	239	[-t THREADS] [-b BARCODE_DIR] [--barcode_threshold BARCODE_THRESHOLD]
237	240	[--barcode_diff BARCODE_DIFF] [--require_two_barcodes] [--untrimmed]
238		[--adapter_threshold ADAPTER_THRESHOLD] [--check_reads CHECK_READS]
239		[--scoring_scheme SCORING_SCHEME] [--end_size END_SIZE] [--min_trim_size MIN_TRIM_SIZE]
240		[--extra_end_trim EXTRA_END_TRIM] [--end_threshold END_THRESHOLD] [--discard_middle]
	241	[--discard_unassigned] [--adapter_threshold ADAPTER_THRESHOLD]
	242	[--check_reads CHECK_READS] [--scoring_scheme SCORING_SCHEME] [--end_size END_SIZE]
	243	[--min_trim_size MIN_TRIM_SIZE] [--extra_end_trim EXTRA_END_TRIM]
	244	[--end_threshold END_THRESHOLD] [--no_split] [--discard_middle]
241	245	[--middle_threshold MIDDLE_THRESHOLD]
242	246	[--extra_middle_trim_good_side EXTRA_MIDDLE_TRIM_GOOD_SIDE]
243	247	[--extra_middle_trim_bad_side EXTRA_MIDDLE_TRIM_BAD_SIDE]
244		[--min_split_read_size MIN_SPLIT_READ_SIZE]
245
246		Porechop: a tool for finding adapters in Oxford Nanopore reads, trimming them from the ends and splitting
247		reads with internal adapters
248
249		optional arguments:
250		-h, --help show this help message and exit
	248	[--min_split_read_size MIN_SPLIT_READ_SIZE] [-h] [--version]
	249
	250	Porechop: a tool for finding adapters in Oxford Nanopore reads, trimming them from the ends and
	251	splitting reads with internal adapters
251	252
252	253	Main options:
253		-i INPUT, --input INPUT FASTA/FASTQ of input reads or a directory which will be recursively
254		searched for FASTQ files (required)
255		-o OUTPUT, --output OUTPUT Filename for FASTA or FASTQ of trimmed reads (if not set, trimmed
256		reads will be printed to stdout)
	254	-i INPUT, --input INPUT FASTA/FASTQ of input reads or a directory which will be
	255	recursively searched for FASTQ files (required)
	256	-o OUTPUT, --output OUTPUT Filename for FASTA or FASTQ of trimmed reads (if not set, trimmed
	257	reads will be printed to stdout)
257	258	--format {auto,fasta,fastq,fasta.gz,fastq.gz}
258		Output format for the reads - if auto, the format will be chosen based
259		on the output filename or the input read format (default: auto)
	259	Output format for the reads - if auto, the format will be chosen
	260	based on the output filename or the input read format (default:
	261	auto)
260	262	-v VERBOSITY, --verbosity VERBOSITY
261		Level of progress information: 0 = none, 1 = some, 2 = lots, 3 = full
262		- output will go to stdout if reads are saved to a file and stderr if
263		reads are printed to stdout (default: 1)
264		-t THREADS, --threads THREADS Number of threads to use for adapter alignment (default: 8)
265		--version show program's version number and exit
	263	Level of progress information: 0 = none, 1 = some, 2 = lots, 3 =
	264	full - output will go to stdout if reads are saved to a file and
	265	stderr if reads are printed to stdout (default: 1)
	266	-t THREADS, --threads THREADS Number of threads to use for adapter alignment (default: 8)
266	267
267	268	Barcode binning settings:
268	269	Control the binning of reads based on barcodes (i.e. barcode demultiplexing)
269	270
270	271	-b BARCODE_DIR, --barcode_dir BARCODE_DIR
271		Reads will be binned based on their barcode and saved to separate
272		files in this directory (incompatible with --output)
	272	Reads will be binned based on their barcode and saved to separate
	273	files in this directory (incompatible with --output)
273	274	--barcode_threshold BARCODE_THRESHOLD
274		A read must have at least this percent identity to a barcode to be
275		binned (default: 75.0)
276		--barcode_diff BARCODE_DIFF If the difference between a read's best barcode identity and its
277		second-best barcode identity is less than this value, it will not be
278		put in a barcode bin (to exclude cases which are too close to call)
279		(default: 5.0)
280		--require_two_barcodes Reads will only be put in barcode bins if they have a strong match for
281		the barcode on both their start and end (default: a read can be binned
282		with a match at its start or end)
283		--untrimmed Bin reads but do not trim the ends (appropriate if reads are to be
284		used with Nanopolish) (default: False)
	275	A read must have at least this percent identity to a barcode to be
	276	binned (default: 75.0)
	277	--barcode_diff BARCODE_DIFF If the difference between a read's best barcode identity and its
	278	second-best barcode identity is less than this value, it will not
	279	be put in a barcode bin (to exclude cases which are too close to
	280	call) (default: 5.0)
	281	--require_two_barcodes Reads will only be put in barcode bins if they have a strong match
	282	for the barcode on both their start and end (default: a read can
	283	be binned with a match at its start or end)
	284	--untrimmed Bin reads but do not trim them (default: trim the reads)
	285	--discard_unassigned Discard unassigned reads (instead of creating a "none" bin)
	286	(default: False)
285	287
286	288	Adapter search settings:
287	289	Control how the program determines which adapter sets are present
288	290
289	291	--adapter_threshold ADAPTER_THRESHOLD
290		An adapter set has to have at least this percent identity to be
291		labelled as present and trimmed off (0 to 100) (default: 90.0)
292		--check_reads CHECK_READS This many reads will be aligned to all possible adapters to determine
293		which adapter sets are present (default: 10000)
294		--scoring_scheme SCORING_SCHEME Comma-delimited string of alignment scores: match, mismatch, gap open,
295		gap extend (default: 3,-6,-5,-2)
	292	An adapter set has to have at least this percent identity to be
	293	labelled as present and trimmed off (0 to 100) (default: 90.0)
	294	--check_reads CHECK_READS This many reads will be aligned to all possible adapters to
	295	determine which adapter sets are present (default: 10000)
	296	--scoring_scheme SCORING_SCHEME
	297	Comma-delimited string of alignment scores: match, mismatch, gap
	298	open, gap extend (default: 3,-6,-5,-2)
296	299
297	300	End adapter settings:
298	301	Control the trimming of adapters from read ends
299	302
300		--end_size END_SIZE The number of base pairs at each end of the read which will be
301		searched for adapter sequences (default: 150)
302		--min_trim_size MIN_TRIM_SIZE Adapter alignments smaller than this will be ignored (default: 4)
303		--extra_end_trim EXTRA_END_TRIM This many additional bases will be removed next to adapters found at
304		the ends of reads (default: 2)
305		--end_threshold END_THRESHOLD Adapters at the ends of reads must have at least this percent identity
306		to be removed (0 to 100) (default: 75.0)
	303	--end_size END_SIZE The number of base pairs at each end of the read which will be
	304	searched for adapter sequences (default: 150)
	305	--min_trim_size MIN_TRIM_SIZE Adapter alignments smaller than this will be ignored (default: 4)
	306	--extra_end_trim EXTRA_END_TRIM
	307	This many additional bases will be removed next to adapters found
	308	at the ends of reads (default: 2)
	309	--end_threshold END_THRESHOLD Adapters at the ends of reads must have at least this percent
	310	identity to be removed (0 to 100) (default: 75.0)
307	311
308	312	Middle adapter settings:
309	313	Control the splitting of read from middle adapters
310	314
311		--no_split Skip splitting reads based on middle adapters (default: split reads
312		when an adapter is found in the middle)
313		--discard_middle Reads with middle adapters will be discarded (default: reads with
314		middle adapters are split) (this option is on by default when
315		outputting reads into barcode bins)
	315	--no_split Skip splitting reads based on middle adapters (default: split
	316	reads when an adapter is found in the middle)
	317	--discard_middle Reads with middle adapters will be discarded (default: reads with
	318	middle adapters are split) (required for reads to be used with
	319	Nanopolish, this option is on by default when outputting reads
	320	into barcode bins)
316	321	--middle_threshold MIDDLE_THRESHOLD
317		Adapters in the middle of reads must have at least this percent
318		identity to be found (0 to 100) (default: 85.0)
	322	Adapters in the middle of reads must have at least this percent
	323	identity to be found (0 to 100) (default: 85.0)
319	324	--extra_middle_trim_good_side EXTRA_MIDDLE_TRIM_GOOD_SIDE
320		This many additional bases will be removed next to middle adapters on
321		their "good" side (default: 10)
	325	This many additional bases will be removed next to middle adapters
	326	on their "good" side (default: 10)
322	327	--extra_middle_trim_bad_side EXTRA_MIDDLE_TRIM_BAD_SIDE
323		This many additional bases will be removed next to middle adapters on
324		their "bad" side (default: 100)
	328	This many additional bases will be removed next to middle adapters
	329	on their "bad" side (default: 100)
325	330	--min_split_read_size MIN_SPLIT_READ_SIZE
326		Post-split read pieces smaller than this many base pairs will not be
327		outputted (default: 1000)
328		```
	331	Post-split read pieces smaller than this many base pairs will not
	332	be outputted (default: 1000)
	333
	334	Help:
	335	-h, --help Show this help message and exit
	336	--version Show program's version number and exit
	337	```
	338
	339
	340
	341	# Known issues
	342
	343	### Adapter search
	344
	345	Porechop tries to automatically determine which adapters are present by looking at the reads, but this approach has a few issues:
	346
	347	* As the number of kits/barcodes has grown, adapter-search part of the Porechop's pipeline has become increasingly slow.
	348	* Porechop only does the adapter search on a subset of reads, which means there can be problems with non-randomly ordered read sets (e.g. all barcode 1 reads at the start of a file, followed by barcode 2 reads, etc).
	349	* Many ONT adapters share common sequence with each other, making false positive adapter finds possible.
	350
	351	A simpler solution (and in hindsight what I should have done) would be to require the kit and/or adapters from the user. E.g. `porechop --sqk-lsk109` or `porechop --start_adapt ACGCTAGCATACGT`.
	352
	353
	354	### Performance
	355
	356	Porechop uses [SeqAn](https://github.com/seqan/seqan) to perform its alignments in C++. This library is very flexible, but not as fast as some alternatives, such as [Edlib](https://github.com/Martinsos/edlib).
	357
	358	Another performance issue is that Porechop uses [ctypes](https://docs.python.org/3/library/ctypes.html) to interface with its C++ code. Function calls with ctypes can have a bit of overhead, which means that Porechop cannot use threads very efficiently (it spends too much of its time in the Python code, which is intrinsically non-parallel).
	359
	360
	361	### Barcode demultiplexing
	362
	363	I added demultiplexing to Porechop as an afterthought – it was already looking for barcodes to trim, so why not sort reads by barcodes too? This turned out to be a very useful feature, but in hindsight I think it might be simpler (and easier to maintain) if trimming and demultiplexing functionality were in separate tools.
	364
	365
	366	### Base-space problems
	367
	368	I've encountered a couple of issues where adapter sequences are not properly basecalled, resulting in inconsistent sequence. Porechop trims in base-space, so this is a somewhat intractable problem. See [issue #40](https://github.com/rrwick/Porechop/issues/40) for an example.
	369
	370	These have made me wonder if the adapter trimming should be done in signal-space instead, though that would be a more complex problem to solve. I hope that in the future ONT can integrate this kind of functionality directly into their basecallers.
	371
329	372
330	373
331	374

+46

-19

porechop/adapters.py less more

42	42	Gets the barcode name for the output files. We want a concise name, so it looks at all
43	43	options and chooses the shortest.
44	44	"""
45		possible_names = [self.name, self.start_sequence[0]]
	45	possible_names = [self.name]
	46	if self.start_sequence:
	47	possible_names.append(self.start_sequence[0])
46	48	if self.end_sequence:
47	49	possible_names.append(self.end_sequence[0])
48	50	barcode_name = sorted(possible_names, key=lambda x: len(x))[0]

75	77	start_sequence=('SQK-NSK007_Y_Top', 'AATGTACTTCGTTCAGTTACGTATTGCT'),
76	78	end_sequence=('SQK-NSK007_Y_Bottom', 'GCAATACGTAACTGAACGAAGT')),
77	79
	80
78	81	Adapter('Rapid',
79		start_sequence=('Rapid_adapter', 'GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA')),
	82	start_sequence=('Rapid_adapter',
	83	'GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA')),
	84
	85	Adapter('RBK004_upstream',
	86	start_sequence=('RBK004_upstream', 'AATGTACTTCGTTCAGTTACGGCTTGGGTGTTTAACC')),
	87
80	88
81	89	Adapter('SQK-MAP006',
82		start_sequence=('SQK-MAP006_Y_Top_SK63', 'GGTTGTTTCTGTTGGTGCTGATATTGCT'),
83		end_sequence= ('SQK-MAP006_Y_Bottom_SK64', 'GCAATATCAGCACCAACAGAAA')),
84
85		Adapter('SQK-MAP006 Short',
	90	start_sequence=('SQK-MAP006_Y_Top_SK63', 'GGTTGTTTCTGTTGGTGCTGATATTGCT'),
	91	end_sequence= ('SQK-MAP006_Y_Bottom_SK64', 'GCAATATCAGCACCAACAGAAA')),
	92
	93	Adapter('SQK-MAP006 short',
86	94	start_sequence=('SQK-MAP006_Short_Y_Top_LI32', 'CGGCGTCTGCTTGGGTGTTTAACCT'),
87	95	end_sequence= ('SQK-MAP006_Short_Y_Bottom_LI33', 'GGTTAAACACCCAAGCAGACGCCG')),
88	96
	97
	98	# The PCR adapters are used both in PCR DNA kits and some cDNA kits.
89	99	Adapter('PCR adapters 1',
90	100	start_sequence=('PCR_1_start', 'ACTTGCCTGTCGCTCTATCTTC'),
91		end_sequence= ('PCR_1_end', 'GAAGATAGAGCGACAGGCAAGT')),
92
93		Adapter('PCR tail 1',
94		start_sequence=('PCR_tail_1_start', 'TTAACCTTTCTGTTGGTGCTGATATTGC'),
95		end_sequence= ('PCR_tail_1_end', 'GCAATATCAGCACCAACAGAAAGGTTAA')),
96
97		Adapter('PCR tail 2',
98		start_sequence=('PCR_tail_2_start', 'TTAACCTACTTGCCTGTCGCTCTATCTTC'),
99		end_sequence= ('PCR_tail_2_end', 'GAAGATAGAGCGACAGGCAAGTAGGTTAA')),
	101	end_sequence= ('PCR_1_end', 'GAAGATAGAGCGACAGGCAAGT')),
	102
	103	Adapter('PCR adapters 2',
	104	start_sequence=('PCR_2_start', 'TTTCTGTTGGTGCTGATATTGC'),
	105	end_sequence= ('PCR_2_end', 'GCAATATCAGCACCAACAGAAA')),
	106
	107	Adapter('PCR adapters 3',
	108	start_sequence=('PCR_3_start', 'TACTTGCCTGTCGCTCTATCTTC'),
	109	end_sequence= ('PCR_3_end', 'GAAGATAGAGCGACAGGCAAGTA')),
100	110
101	111
102	112	# 1D^2 kit adapters are interesting. ONT provided the following sequences on their site:

114	124	end_sequence= ('1D2_part_2_end', 'CACCCAAGCAGACGCCAGCAATACGTAACT')),
115	125	# The middle part of the provided sequences is less common, so I've left it out of the
116	126	# adapter sequences here.
	127
	128
	129	Adapter('cDNA SSP',
	130	start_sequence=('cDNA_SSP', 'TTTCTGTTGGTGCTGATATTGCTGCCATTACGGCCGGG'),
	131	end_sequence= ('cDNA_SSP_rev', 'CCCGGCCGTAATGGCAGCAATATCAGCACCAACAGAAA')),
117	132
118	133
119	134	# Some barcoding kits (like the native barcodes) use the rev comp barcode at the start

460	475	end_sequence=('NB' + '%02d' % barcode_num + '_end', end_full_seq))
461	476
462	477
463		def make_full_rapid_barcode_adapter(barcode_num):
	478	def make_old_full_rapid_barcode_adapter(barcode_num): # applies to SQK-RBK001
464	479	barcode = [x for x in ADAPTERS if x.name == 'Barcode ' + str(barcode_num) + ' (forward)'][0]
465	480	start_barcode_seq = barcode.start_sequence[1]
466	481
467		start_full_seq = 'AATGTACTTCGTTCAGTTACGTATTGCT' + start_barcode_seq + 'GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA'
468
469		return Adapter('Rapid barcoding ' + str(barcode_num) + ' (full sequence)',
	482	start_full_seq = 'AATGTACTTCGTTCAGTTACG' + 'TATTGCT' + start_barcode_seq + \
	483	'GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA'
	484
	485	return Adapter('Rapid barcoding ' + str(barcode_num) + ' (full sequence, old)',
470	486	start_sequence=('RB' + '%02d' % barcode_num + '_full', start_full_seq))
	487
	488
	489	def make_new_full_rapid_barcode_adapter(barcode_num): # applies to SQK-RBK004
	490	barcode = [x for x in ADAPTERS if x.name == 'Barcode ' + str(barcode_num) + ' (forward)'][0]
	491	start_barcode_seq = barcode.start_sequence[1]
	492
	493	start_full_seq = 'AATGTACTTCGTTCAGTTACG' + 'GCTTGGGTGTTTAACC' + start_barcode_seq + \
	494	'GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA'
	495
	496	return Adapter('Rapid barcoding ' + str(barcode_num) + ' (full sequence, new)',
	497	start_sequence=('RB' + '%02d' % barcode_num + '_full', start_full_seq))

+25

-8

porechop/nanopore_read.py less more

22	22	def __init__(self, name, seq, quals):
23	23	self.name = name
24	24
25		self.seq = seq
	25	self.seq = seq.upper()
	26	if self.seq.count('U') > self.seq.count('T'):
	27	self.rna = True
	28	self.seq = self.seq.replace('U', 'T')
	29	else:
	30	self.rna = False
	31
26	32	self.quals = quals
27	33	if len(quals) < len(seq):
28	34	self.quals += '+' * (len(seq) - len(quals))

95	101	seq = self.get_seq_with_start_end_adapters_trimmed()
96	102	if not seq: # Don't return empty sequences
97	103	return ''
	104	if self.rna:
	105	seq = seq.replace('T', 'U')
98	106	return ''.join(['>', self.name, '\n', add_line_breaks_to_sequence(seq, 70)])
99	107	elif discard_middle:
100	108	return ''

105	113	if not split_read_part[0]: # Don't return empty sequences
106	114	return ''
107	115	seq = add_line_breaks_to_sequence(split_read_part[0], 70)
	116	if self.rna:
	117	seq = seq.replace('T', 'U')
108	118	fasta_str += ''.join(['>', read_name, '\n', seq])
109	119	return fasta_str
110	120

118	128	quals = self.get_quals_with_start_end_adapters_trimmed()
119	129	if not seq: # Don't return empty sequences
120	130	return ''
	131	if self.rna:
	132	seq = seq.replace('T', 'U')
121	133	return ''.join(['@', self.name, '\n', seq, '\n+\n', quals, '\n'])
122	134	elif discard_middle:
123	135	return ''

125	137	fastq_str = ''
126	138	for i, split_read_part in enumerate(self.get_split_read_parts(min_split_read_size)):
127	139	read_name = add_number_to_read_name(self.name, i + 1)
128		if not split_read_part[0]: # Don't return empty sequences
	140	seq, qual = split_read_part[0], split_read_part[1]
	141	if not seq: # Don't return empty sequences
129	142	return ''
130		fastq_str += ''.join(['@', read_name, '\n', split_read_part[0], '\n+\n',
131		split_read_part[1], '\n'])
	143	if self.rna:
	144	seq = seq.replace('T', 'U')
	145	fastq_str += ''.join(['@', read_name, '\n', seq, '\n+\n', qual, '\n'])
132	146	return fastq_str
133	147
134	148	def align_adapter_set(self, adapter_set, end_size, scoring_scheme_vals):

137	151	This is not to determine where to trim the reads, but rather to figure out which adapter
138	152	sets are present in the data.
139	153	"""
140		read_seq_start = self.seq[:end_size]
141		score, _, _, _ = align_adapter(read_seq_start, adapter_set.start_sequence[1],
142		scoring_scheme_vals)
143		adapter_set.best_start_score = max(adapter_set.best_start_score, score)
	154	if adapter_set.start_sequence:
	155	read_seq_start = self.seq[:end_size]
	156	score, _, _, _ = align_adapter(read_seq_start, adapter_set.start_sequence[1],
	157	scoring_scheme_vals)
	158	adapter_set.best_start_score = max(adapter_set.best_start_score, score)
144	159	if adapter_set.end_sequence:
145	160	read_seq_end = self.seq[-end_size:]
146	161	score, _, _, _ = align_adapter(read_seq_end, adapter_set.end_sequence[1],

155	170	"""
156	171	read_seq_start = self.seq[:end_size]
157	172	for adapter in adapters:
	173	if not adapter.start_sequence:
	174	continue
158	175	full_score, partial_score, read_start, read_end = \
159	176	align_adapter(read_seq_start, adapter.start_sequence[1], scoring_scheme_vals)
160	177	if partial_score > end_threshold and read_end != end_size and \

+74

-50

porechop/porechop.py less more

23	23	from multiprocessing.dummy import Pool as ThreadPool
24	24	from collections import defaultdict
25	25	from .misc import load_fasta_or_fastq, print_table, red, bold_underline, MyHelpFormatter, int_to_str
26		from .adapters import ADAPTERS, make_full_native_barcode_adapter, make_full_rapid_barcode_adapter
	26	from .adapters import ADAPTERS, make_full_native_barcode_adapter,\
	27	make_old_full_rapid_barcode_adapter, make_new_full_rapid_barcode_adapter
27	28	from .nanopore_read import NanoporeRead
28	29	from .version import __version__
29	30

36	37	matching_sets = find_matching_adapter_sets(check_reads, args.verbosity, args.end_size,
37	38	args.scoring_scheme_vals, args.print_dest,
38	39	args.adapter_threshold, args.threads)
39		matching_sets = exclude_end_adapters_for_rapid(matching_sets)
40	40	matching_sets = fix_up_1d2_sets(matching_sets)
41		display_adapter_set_results(matching_sets, args.verbosity, args.print_dest)
42		matching_sets = add_full_barcode_adapter_sets(matching_sets)
43	41
44	42	if args.barcode_dir:
45	43	forward_or_reverse_barcodes = choose_barcoding_kit(matching_sets, args.verbosity,
46	44	args.print_dest)
47	45	else:
48	46	forward_or_reverse_barcodes = None
	47
	48	display_adapter_set_results(matching_sets, args.verbosity, args.print_dest)
	49	matching_sets = add_full_barcode_adapter_sets(matching_sets)
	50
49	51	if args.verbosity > 0:
50	52	print('\n', file=args.print_dest)
51	53

85	87	parser = argparse.ArgumentParser(description='Porechop: a tool for finding adapters in Oxford '
86	88	'Nanopore reads, trimming them from the ends and '
87	89	'splitting reads with internal adapters',
88		formatter_class=MyHelpFormatter)
	90	formatter_class=MyHelpFormatter, add_help=False)
89	91	main_group = parser.add_argument_group('Main options')
90	92	main_group.add_argument('-i', '--input', required=True,
91	93	help='FASTA/FASTQ of input reads or a directory which will be '

104	106	'a file and stderr if reads are printed to stdout')
105	107	main_group.add_argument('-t', '--threads', type=int, default=default_threads,
106	108	help='Number of threads to use for adapter alignment')
107		main_group.add_argument('--version', action='version', version=__version__)
108	109
109	110	barcode_group = parser.add_argument_group('Barcode binning settings',
110	111	'Control the binning of reads based on barcodes '

126	127	'match for the barcode on both their start and end (default: '
127	128	'a read can be binned with a match at its start or end)')
128	129	barcode_group.add_argument('--untrimmed', action='store_true',
129		help='Bin reads but do not trim them (appropriate if reads are to '
130		'be used with Nanopolish) (default: trim the reads)')
	130	help='Bin reads but do not trim them (default: trim the reads)')
131	131	barcode_group.add_argument('--discard_unassigned', action='store_true',
132	132	help='Discard unassigned reads (instead of creating a "none" bin)')
133	133

168	168	'middle)')
169	169	middle_trim_group.add_argument('--discard_middle', action='store_true',
170	170	help='Reads with middle adapters will be discarded (default: '
171		'reads with middle adapters are split) (this option is '
172		'on by default when outputting reads into barcode bins)')
173		middle_trim_group.add_argument('--middle_threshold', type=float, default=85.0,
	171	'reads with middle adapters are split) (required for '
	172	'reads to be used with Nanopolish, this option is on by '
	173	'default when outputting reads into barcode bins)')
	174	middle_trim_group.add_argument('--middle_threshold', type=float, default=90.0,
174	175	help='Adapters in the middle of reads must have at least this '
175	176	'percent identity to be found (0 to 100)')
176	177	middle_trim_group.add_argument('--extra_middle_trim_good_side', type=int, default=10,

182	183	middle_trim_group.add_argument('--min_split_read_size', type=int, default=1000,
183	184	help='Post-split read pieces smaller than this many base pairs '
184	185	'will not be outputted')
	186
	187	help_args = parser.add_argument_group('Help')
	188	help_args.add_argument('-h', '--help', action='help', default=argparse.SUPPRESS,
	189	help='Show this help message and exit')
	190	help_args.add_argument('--version', action='version', version=__version__,
	191	help="Show program's version number and exit")
185	192
186	193	args = parser.parse_args()
187	194

324	331	If the user is sorting reads by barcode bin, choose one barcode configuration (rev comp
325	332	barcodes at the start of the read or at the end of the read) and ignore the other.
326	333	"""
327		forward_barcodes = 0
328		reverse_barcodes = 0
	334	# Tally up scores for forward and reverse barcodes.
	335	forward_start_or_end, reverse_start_or_end = 0, 0
	336	forward_start_and_end, reverse_start_and_end = 0, 0
329	337	for adapter_set in adapter_sets:
330		score = adapter_set.best_start_or_end_score()
331		if 'Barcode' in adapter_set.name and '(forward)' in adapter_set.name:
332		forward_barcodes += score
333		elif 'Barcode' in adapter_set.name and '(reverse)' in adapter_set.name:
334		reverse_barcodes += score
335		if forward_barcodes > reverse_barcodes:
336		if verbosity > 0:
337		print('\nBarcodes determined to be in forward orientation', file=print_dest)
338		return 'forward'
339		elif reverse_barcodes > forward_barcodes:
340		if verbosity > 0:
341		print('\nBarcodes determined to be in reverse orientation', file=print_dest)
342		return 'reverse'
343		else:
344		return None
345
346
347		def exclude_end_adapters_for_rapid(matching_sets):
348		"""
349		Rapid reads shouldn't have end adapters, so we don't want to look for them if this seems to be
350		a rapid read set.
351		"""
352		if 'Rapid adapter' in [x.name for x in matching_sets]:
353		for s in matching_sets:
354		s.end_sequence = None
355		return matching_sets
	338	if 'barcode' in adapter_set.name.lower():
	339	if '(forward)' in adapter_set.name.lower():
	340	forward_start_or_end += adapter_set.best_start_or_end_score()
	341	forward_start_and_end += adapter_set.best_start_score
	342	forward_start_and_end += adapter_set.best_end_score
	343	elif '(reverse)' in adapter_set.name.lower():
	344	reverse_start_or_end += adapter_set.best_start_or_end_score()
	345	reverse_start_and_end += adapter_set.best_start_score
	346	reverse_start_and_end += adapter_set.best_end_score
	347
	348	if forward_start_or_end == 0 and reverse_start_or_end == 0:
	349	sys.exit('Error: no barcodes were found, so Porechop cannot perform barcode demultiplexing')
	350
	351	# If possible, make a decision using each barcode's best start OR end score.
	352	orientation = None
	353	if forward_start_or_end > reverse_start_or_end:
	354	orientation = 'forward'
	355	elif reverse_start_or_end > forward_start_or_end:
	356	orientation = 'reverse'
	357
	358	# If that didn't work (i.e. it's a tie between forward and reverse), then choose based on the
	359	# sum of both start AND end scores.
	360	elif forward_start_and_end > reverse_start_and_end:
	361	orientation = 'forward'
	362	elif reverse_start_and_end > forward_start_and_end:
	363	orientation = 'reverse'
	364
	365	if orientation is None:
	366	sys.exit('Error: Porechop could not determine barcode orientation')
	367
	368	if verbosity > 0:
	369	print('\nBarcodes determined to be in ' + orientation + ' orientation', file=print_dest)
	370	return orientation
356	371
357	372
358	373	def fix_up_1d2_sets(matching_sets):

410	425
411	426	# Rapid barcode full sequences
412	427	if all(x in matching_set_names
413		for x in ['SQK-NSK007', 'Rapid', 'Barcode ' + str(i) + ' (forward)']):
414		matching_sets.append(make_full_rapid_barcode_adapter(i))
	428	for x in ['Rapid', 'Barcode ' + str(i) + ' (forward)']):
	429	if 'RBK004_upstream' in matching_set_names:
	430	matching_sets.append(make_new_full_rapid_barcode_adapter(i))
	431	elif 'SQK-NSK007' in matching_set_names:
	432	matching_sets.append(make_old_full_rapid_barcode_adapter(i))
415	433
416	434	return matching_sets
417	435

423	441	if verbosity > 0:
424	442	print(bold_underline('Trimming adapters from read ends'),
425	443	file=print_dest)
426		name_len = max(max(len(x.start_sequence[0]) for x in matching_sets),
427		max(len(x.end_sequence[0]) if x.end_sequence else 0 for x in matching_sets))
	444	name_len = max(max(len(x.start_sequence[0])
	445	if x.start_sequence else 0 for x in matching_sets),
	446	max(len(x.end_sequence[0])
	447	if x.end_sequence else 0 for x in matching_sets))
428	448	for matching_set in matching_sets:
429		print(' ' + matching_set.start_sequence[0].rjust(name_len) + ': ' +
430		red(matching_set.start_sequence[1]), file=print_dest)
	449	if matching_set.start_sequence:
	450	print(' ' + matching_set.start_sequence[0].rjust(name_len) + ': ' +
	451	red(matching_set.start_sequence[1]), file=print_dest)
431	452	if matching_set.end_sequence:
432	453	print(' ' + matching_set.end_sequence[0].rjust(name_len) + ': ' +
433	454	red(matching_set.end_sequence[1]), file=print_dest)

518	539
519	540	adapters = []
520	541	for matching_set in matching_sets:
521		adapters.append(matching_set.start_sequence)
522		if matching_set.end_sequence and \
523		matching_set.end_sequence[1] != matching_set.start_sequence[1]:
524		adapters.append(matching_set.end_sequence)
	542	if matching_set.start_sequence:
	543	adapters.append(matching_set.start_sequence)
	544	if matching_set.end_sequence:
	545	if (not matching_set.start_sequence) or \
	546	matching_set.end_sequence[1] != matching_set.start_sequence[1]:
	547	adapters.append(matching_set.end_sequence)
525	548
526	549	start_sequence_names = set()
527	550	end_sequence_names = set()
528	551	for matching_set in matching_sets:
529		start_sequence_names.add(matching_set.start_sequence[0])
	552	if matching_set.start_sequence:
	553	start_sequence_names.add(matching_set.start_sequence[0])
530	554	if matching_set.end_sequence:
531	555	end_sequence_names.add(matching_set.end_sequence[0])
532	556

-1

porechop/version.py less more

1	1	The version is stored here in a separate file so it can exist in only one place.
2	2	http://stackoverflow.com/questions/458550
3	3	"""
4		__version__ = '0.2.3'
	4	__version__ = '0.2.4'