Codebase list porechop / 9904cd8
New upstream version 0.2.4+dfsg Steffen Moeller 5 years ago
5 changed file(s) with 263 addition(s) and 152 deletion(s). Raw diff Collapse all Expand all
22 Porechop is a tool for finding and removing adapters from [Oxford Nanopore](https://nanoporetech.com/) reads. Adapters on the ends of reads are trimmed off, and when a read has an adapter in its middle, it is treated as chimeric and chopped into separate reads. Porechop performs thorough alignments to effectively find adapters, even at low sequence identity.
33
44 Porechop also supports demultiplexing of Nanopore reads that were barcoded with the [Native Barcoding Kit](https://store.nanoporetech.com/native-barcoding-kit-1d.html), [PCR Barcoding Kit](https://store.nanoporetech.com/pcr-barcoding-kit-96.html) or [Rapid Barcoding Kit](https://store.nanoporetech.com/rapid-barcoding-sequencing-kit.html).
5
6
7 ### Oct 2018 update: Porechop is officially unsupported
8
9 While I'm happy Porechop has so many users, it has always been a bit klugey and a pain to maintain. I don't have the time to give it the attention it deserves, so I'm going to now officially declare Porechop as abandonware (though the unanswered [issues](https://github.com/rrwick/Porechop/issues) and [pull requests](https://github.com/rrwick/Porechop/pulls) reveal that it already has been for some time). I've added a [known issues](#known-issues) section to the README to outline what I think is wrong with Porechop and how a reimplementation should look. I may someday (no promises though :stuck_out_tongue:) try to rewrite it from a blank canvas to address its faults.
10
11
512
613
714
2330 * [Verbose output](#verbose-output)
2431 * [Known adapters](#known-adapters)
2532 * [Full usage](#full-usage)
33 * [Known issues](#known-issues)
2634 * [Acknowledgements](#acknowledgements)
2735 * [License](#license)
2836
9098 __Demultiplex barcoded reads, straight from Albacore output directory:__<br>
9199 `porechop -i albacore_dir -b output_dir`
92100
93 __Demultiplex barcoded reads without trimming (appropriate if Nanopolish will be used):__<br>
94 `porechop -i input_reads.fastq.gz -b output_dir --untrimmed`
95
96101 __Also works with FASTA:__<br>
97102 `porechop -i input_reads.fasta -o output_reads.fasta`
98103
142147
143148 If you run Porechop with `--discard_middle`, the reads with internal adapters will be thrown out instead of split.
144149
145 This approach might make sense if you are trimming reads from a barcoded run, as chimeric reads may combine sequences from different bins. For example, consider this read:
146 ```
147 BC01_rev - SEQUENCE_1 - SQK-NSK007_Y_Top - BC02_rev - SEQUENCE_2
148 ```
149 SEQUENCE_1 belongs in the barcode 1 bin and SEQUENCE_2 belongs in the barcode 2 bin, so while we could split the read, we would end up with contamination from another bin. Throwing the read out with `--discard_middle` might be the better option.
150 If you plan on using your reads with [Nanopolish](https://github.com/jts/nanopolish), then the `--discard_middle` option is required. This is because Nanopolish first runs [`nanopolish index`](https://github.com/jts/nanopolish#data-preprocessing) to find a one-to-one association between FASTQ reads and fast5 files. If you ran Porechop _without_ `--discard_middle`, then you could end up with multiple separate FASTQ reads which are from a single fast5, and this is incompatible with Nanopolish.
151
152 This option is also recommended if you are trimming reads from a demultiplexed barcoded sequencing run. This is because chimeric reads may contain two sequences from two separate barcodes, so throwing them out is the safer option to reduce cross-barcode contamination.
150153
151154
152155 ### Barcode demultiplexing
188191
189192 If Porechop is run without `-o` or `-b`, then it will output the trimmed reads to stdout and print its progress info to stderr. The output format of the reads will be FASTA/FASTQ based on the input reads, or else can be specified using `--format`.
190193
191 The `--verbosity` option will change the output of progress info:
194 The `--verbosity` option will change the amount of progress info:
192195 * `--verbosity 0` gives no progress output.
193 * `--verbosity 1` (the default) gives summary info about end adapter trimming and shows all instances of middle adapter splitting.
194 * `--verbosity 2` shows sequences and is described below.
196 * `--verbosity 1` (default) gives summary info about end adapter trimming and middle adapter splitting.
197 * `--verbosity 2` shows the actual trimmed/split sequences for each read (described more below).
195198 * `--verbosity 3` shows tons of data (mainly for debugging).
196199
197200
232235 # Full usage
233236
234237 ```
235 usage: porechop [-h] -i INPUT [-o OUTPUT] [--format {auto,fasta,fastq,fasta.gz,fastq.gz}] [-v VERBOSITY]
236 [-t THREADS] [--version] [-b BARCODE_DIR] [--barcode_threshold BARCODE_THRESHOLD]
238 usage: porechop -i INPUT [-o OUTPUT] [--format {auto,fasta,fastq,fasta.gz,fastq.gz}] [-v VERBOSITY]
239 [-t THREADS] [-b BARCODE_DIR] [--barcode_threshold BARCODE_THRESHOLD]
237240 [--barcode_diff BARCODE_DIFF] [--require_two_barcodes] [--untrimmed]
238 [--adapter_threshold ADAPTER_THRESHOLD] [--check_reads CHECK_READS]
239 [--scoring_scheme SCORING_SCHEME] [--end_size END_SIZE] [--min_trim_size MIN_TRIM_SIZE]
240 [--extra_end_trim EXTRA_END_TRIM] [--end_threshold END_THRESHOLD] [--discard_middle]
241 [--discard_unassigned] [--adapter_threshold ADAPTER_THRESHOLD]
242 [--check_reads CHECK_READS] [--scoring_scheme SCORING_SCHEME] [--end_size END_SIZE]
243 [--min_trim_size MIN_TRIM_SIZE] [--extra_end_trim EXTRA_END_TRIM]
244 [--end_threshold END_THRESHOLD] [--no_split] [--discard_middle]
241245 [--middle_threshold MIDDLE_THRESHOLD]
242246 [--extra_middle_trim_good_side EXTRA_MIDDLE_TRIM_GOOD_SIDE]
243247 [--extra_middle_trim_bad_side EXTRA_MIDDLE_TRIM_BAD_SIDE]
244 [--min_split_read_size MIN_SPLIT_READ_SIZE]
245
246 Porechop: a tool for finding adapters in Oxford Nanopore reads, trimming them from the ends and splitting
247 reads with internal adapters
248
249 optional arguments:
250 -h, --help show this help message and exit
248 [--min_split_read_size MIN_SPLIT_READ_SIZE] [-h] [--version]
249
250 Porechop: a tool for finding adapters in Oxford Nanopore reads, trimming them from the ends and
251 splitting reads with internal adapters
251252
252253 Main options:
253 -i INPUT, --input INPUT FASTA/FASTQ of input reads or a directory which will be recursively
254 searched for FASTQ files (required)
255 -o OUTPUT, --output OUTPUT Filename for FASTA or FASTQ of trimmed reads (if not set, trimmed
256 reads will be printed to stdout)
254 -i INPUT, --input INPUT FASTA/FASTQ of input reads or a directory which will be
255 recursively searched for FASTQ files (required)
256 -o OUTPUT, --output OUTPUT Filename for FASTA or FASTQ of trimmed reads (if not set, trimmed
257 reads will be printed to stdout)
257258 --format {auto,fasta,fastq,fasta.gz,fastq.gz}
258 Output format for the reads - if auto, the format will be chosen based
259 on the output filename or the input read format (default: auto)
259 Output format for the reads - if auto, the format will be chosen
260 based on the output filename or the input read format (default:
261 auto)
260262 -v VERBOSITY, --verbosity VERBOSITY
261 Level of progress information: 0 = none, 1 = some, 2 = lots, 3 = full
262 - output will go to stdout if reads are saved to a file and stderr if
263 reads are printed to stdout (default: 1)
264 -t THREADS, --threads THREADS Number of threads to use for adapter alignment (default: 8)
265 --version show program's version number and exit
263 Level of progress information: 0 = none, 1 = some, 2 = lots, 3 =
264 full - output will go to stdout if reads are saved to a file and
265 stderr if reads are printed to stdout (default: 1)
266 -t THREADS, --threads THREADS Number of threads to use for adapter alignment (default: 8)
266267
267268 Barcode binning settings:
268269 Control the binning of reads based on barcodes (i.e. barcode demultiplexing)
269270
270271 -b BARCODE_DIR, --barcode_dir BARCODE_DIR
271 Reads will be binned based on their barcode and saved to separate
272 files in this directory (incompatible with --output)
272 Reads will be binned based on their barcode and saved to separate
273 files in this directory (incompatible with --output)
273274 --barcode_threshold BARCODE_THRESHOLD
274 A read must have at least this percent identity to a barcode to be
275 binned (default: 75.0)
276 --barcode_diff BARCODE_DIFF If the difference between a read's best barcode identity and its
277 second-best barcode identity is less than this value, it will not be
278 put in a barcode bin (to exclude cases which are too close to call)
279 (default: 5.0)
280 --require_two_barcodes Reads will only be put in barcode bins if they have a strong match for
281 the barcode on both their start and end (default: a read can be binned
282 with a match at its start or end)
283 --untrimmed Bin reads but do not trim the ends (appropriate if reads are to be
284 used with Nanopolish) (default: False)
275 A read must have at least this percent identity to a barcode to be
276 binned (default: 75.0)
277 --barcode_diff BARCODE_DIFF If the difference between a read's best barcode identity and its
278 second-best barcode identity is less than this value, it will not
279 be put in a barcode bin (to exclude cases which are too close to
280 call) (default: 5.0)
281 --require_two_barcodes Reads will only be put in barcode bins if they have a strong match
282 for the barcode on both their start and end (default: a read can
283 be binned with a match at its start or end)
284 --untrimmed Bin reads but do not trim them (default: trim the reads)
285 --discard_unassigned Discard unassigned reads (instead of creating a "none" bin)
286 (default: False)
285287
286288 Adapter search settings:
287289 Control how the program determines which adapter sets are present
288290
289291 --adapter_threshold ADAPTER_THRESHOLD
290 An adapter set has to have at least this percent identity to be
291 labelled as present and trimmed off (0 to 100) (default: 90.0)
292 --check_reads CHECK_READS This many reads will be aligned to all possible adapters to determine
293 which adapter sets are present (default: 10000)
294 --scoring_scheme SCORING_SCHEME Comma-delimited string of alignment scores: match, mismatch, gap open,
295 gap extend (default: 3,-6,-5,-2)
292 An adapter set has to have at least this percent identity to be
293 labelled as present and trimmed off (0 to 100) (default: 90.0)
294 --check_reads CHECK_READS This many reads will be aligned to all possible adapters to
295 determine which adapter sets are present (default: 10000)
296 --scoring_scheme SCORING_SCHEME
297 Comma-delimited string of alignment scores: match, mismatch, gap
298 open, gap extend (default: 3,-6,-5,-2)
296299
297300 End adapter settings:
298301 Control the trimming of adapters from read ends
299302
300 --end_size END_SIZE The number of base pairs at each end of the read which will be
301 searched for adapter sequences (default: 150)
302 --min_trim_size MIN_TRIM_SIZE Adapter alignments smaller than this will be ignored (default: 4)
303 --extra_end_trim EXTRA_END_TRIM This many additional bases will be removed next to adapters found at
304 the ends of reads (default: 2)
305 --end_threshold END_THRESHOLD Adapters at the ends of reads must have at least this percent identity
306 to be removed (0 to 100) (default: 75.0)
303 --end_size END_SIZE The number of base pairs at each end of the read which will be
304 searched for adapter sequences (default: 150)
305 --min_trim_size MIN_TRIM_SIZE Adapter alignments smaller than this will be ignored (default: 4)
306 --extra_end_trim EXTRA_END_TRIM
307 This many additional bases will be removed next to adapters found
308 at the ends of reads (default: 2)
309 --end_threshold END_THRESHOLD Adapters at the ends of reads must have at least this percent
310 identity to be removed (0 to 100) (default: 75.0)
307311
308312 Middle adapter settings:
309313 Control the splitting of read from middle adapters
310314
311 --no_split Skip splitting reads based on middle adapters (default: split reads
312 when an adapter is found in the middle)
313 --discard_middle Reads with middle adapters will be discarded (default: reads with
314 middle adapters are split) (this option is on by default when
315 outputting reads into barcode bins)
315 --no_split Skip splitting reads based on middle adapters (default: split
316 reads when an adapter is found in the middle)
317 --discard_middle Reads with middle adapters will be discarded (default: reads with
318 middle adapters are split) (required for reads to be used with
319 Nanopolish, this option is on by default when outputting reads
320 into barcode bins)
316321 --middle_threshold MIDDLE_THRESHOLD
317 Adapters in the middle of reads must have at least this percent
318 identity to be found (0 to 100) (default: 85.0)
322 Adapters in the middle of reads must have at least this percent
323 identity to be found (0 to 100) (default: 85.0)
319324 --extra_middle_trim_good_side EXTRA_MIDDLE_TRIM_GOOD_SIDE
320 This many additional bases will be removed next to middle adapters on
321 their "good" side (default: 10)
325 This many additional bases will be removed next to middle adapters
326 on their "good" side (default: 10)
322327 --extra_middle_trim_bad_side EXTRA_MIDDLE_TRIM_BAD_SIDE
323 This many additional bases will be removed next to middle adapters on
324 their "bad" side (default: 100)
328 This many additional bases will be removed next to middle adapters
329 on their "bad" side (default: 100)
325330 --min_split_read_size MIN_SPLIT_READ_SIZE
326 Post-split read pieces smaller than this many base pairs will not be
327 outputted (default: 1000)
328 ```
331 Post-split read pieces smaller than this many base pairs will not
332 be outputted (default: 1000)
333
334 Help:
335 -h, --help Show this help message and exit
336 --version Show program's version number and exit
337 ```
338
339
340
341 # Known issues
342
343 ### Adapter search
344
345 Porechop tries to automatically determine which adapters are present by looking at the reads, but this approach has a few issues:
346
347 * As the number of kits/barcodes has grown, adapter-search part of the Porechop's pipeline has become increasingly slow.
348 * Porechop only does the adapter search on a subset of reads, which means there can be problems with non-randomly ordered read sets (e.g. all barcode 1 reads at the start of a file, followed by barcode 2 reads, etc).
349 * Many ONT adapters share common sequence with each other, making false positive adapter finds possible.
350
351 A simpler solution (and in hindsight what I should have done) would be to require the kit and/or adapters from the user. E.g. `porechop --sqk-lsk109` or `porechop --start_adapt ACGCTAGCATACGT`.
352
353
354 ### Performance
355
356 Porechop uses [SeqAn](https://github.com/seqan/seqan) to perform its alignments in C++. This library is very flexible, but not as fast as some alternatives, such as [Edlib](https://github.com/Martinsos/edlib).
357
358 Another performance issue is that Porechop uses [ctypes](https://docs.python.org/3/library/ctypes.html) to interface with its C++ code. Function calls with ctypes can have a bit of overhead, which means that Porechop cannot use threads very efficiently (it spends too much of its time in the Python code, which is intrinsically non-parallel).
359
360
361 ### Barcode demultiplexing
362
363 I added demultiplexing to Porechop as an afterthought – it was already looking for barcodes to trim, so why not sort reads by barcodes too? This turned out to be a very useful feature, but in hindsight I think it might be simpler (and easier to maintain) if trimming and demultiplexing functionality were in separate tools.
364
365
366 ### Base-space problems
367
368 I've encountered a couple of issues where adapter sequences are not properly basecalled, resulting in inconsistent sequence. Porechop trims in base-space, so this is a somewhat intractable problem. See [issue #40](https://github.com/rrwick/Porechop/issues/40) for an example.
369
370 These have made me wonder if the adapter trimming should be done in signal-space instead, though that would be a more complex problem to solve. I hope that in the future ONT can integrate this kind of functionality directly into their basecallers.
371
329372
330373
331374
4242 Gets the barcode name for the output files. We want a concise name, so it looks at all
4343 options and chooses the shortest.
4444 """
45 possible_names = [self.name, self.start_sequence[0]]
45 possible_names = [self.name]
46 if self.start_sequence:
47 possible_names.append(self.start_sequence[0])
4648 if self.end_sequence:
4749 possible_names.append(self.end_sequence[0])
4850 barcode_name = sorted(possible_names, key=lambda x: len(x))[0]
7577 start_sequence=('SQK-NSK007_Y_Top', 'AATGTACTTCGTTCAGTTACGTATTGCT'),
7678 end_sequence=('SQK-NSK007_Y_Bottom', 'GCAATACGTAACTGAACGAAGT')),
7779
80
7881 Adapter('Rapid',
79 start_sequence=('Rapid_adapter', 'GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA')),
82 start_sequence=('Rapid_adapter',
83 'GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA')),
84
85 Adapter('RBK004_upstream',
86 start_sequence=('RBK004_upstream', 'AATGTACTTCGTTCAGTTACGGCTTGGGTGTTTAACC')),
87
8088
8189 Adapter('SQK-MAP006',
82 start_sequence=('SQK-MAP006_Y_Top_SK63', 'GGTTGTTTCTGTTGGTGCTGATATTGCT'),
83 end_sequence= ('SQK-MAP006_Y_Bottom_SK64', 'GCAATATCAGCACCAACAGAAA')),
84
85 Adapter('SQK-MAP006 Short',
90 start_sequence=('SQK-MAP006_Y_Top_SK63', 'GGTTGTTTCTGTTGGTGCTGATATTGCT'),
91 end_sequence= ('SQK-MAP006_Y_Bottom_SK64', 'GCAATATCAGCACCAACAGAAA')),
92
93 Adapter('SQK-MAP006 short',
8694 start_sequence=('SQK-MAP006_Short_Y_Top_LI32', 'CGGCGTCTGCTTGGGTGTTTAACCT'),
8795 end_sequence= ('SQK-MAP006_Short_Y_Bottom_LI33', 'GGTTAAACACCCAAGCAGACGCCG')),
8896
97
98 # The PCR adapters are used both in PCR DNA kits and some cDNA kits.
8999 Adapter('PCR adapters 1',
90100 start_sequence=('PCR_1_start', 'ACTTGCCTGTCGCTCTATCTTC'),
91 end_sequence= ('PCR_1_end', 'GAAGATAGAGCGACAGGCAAGT')),
92
93 Adapter('PCR tail 1',
94 start_sequence=('PCR_tail_1_start', 'TTAACCTTTCTGTTGGTGCTGATATTGC'),
95 end_sequence= ('PCR_tail_1_end', 'GCAATATCAGCACCAACAGAAAGGTTAA')),
96
97 Adapter('PCR tail 2',
98 start_sequence=('PCR_tail_2_start', 'TTAACCTACTTGCCTGTCGCTCTATCTTC'),
99 end_sequence= ('PCR_tail_2_end', 'GAAGATAGAGCGACAGGCAAGTAGGTTAA')),
101 end_sequence= ('PCR_1_end', 'GAAGATAGAGCGACAGGCAAGT')),
102
103 Adapter('PCR adapters 2',
104 start_sequence=('PCR_2_start', 'TTTCTGTTGGTGCTGATATTGC'),
105 end_sequence= ('PCR_2_end', 'GCAATATCAGCACCAACAGAAA')),
106
107 Adapter('PCR adapters 3',
108 start_sequence=('PCR_3_start', 'TACTTGCCTGTCGCTCTATCTTC'),
109 end_sequence= ('PCR_3_end', 'GAAGATAGAGCGACAGGCAAGTA')),
100110
101111
102112 # 1D^2 kit adapters are interesting. ONT provided the following sequences on their site:
114124 end_sequence= ('1D2_part_2_end', 'CACCCAAGCAGACGCCAGCAATACGTAACT')),
115125 # The middle part of the provided sequences is less common, so I've left it out of the
116126 # adapter sequences here.
127
128
129 Adapter('cDNA SSP',
130 start_sequence=('cDNA_SSP', 'TTTCTGTTGGTGCTGATATTGCTGCCATTACGGCCGGG'),
131 end_sequence= ('cDNA_SSP_rev', 'CCCGGCCGTAATGGCAGCAATATCAGCACCAACAGAAA')),
117132
118133
119134 # Some barcoding kits (like the native barcodes) use the rev comp barcode at the start
460475 end_sequence=('NB' + '%02d' % barcode_num + '_end', end_full_seq))
461476
462477
463 def make_full_rapid_barcode_adapter(barcode_num):
478 def make_old_full_rapid_barcode_adapter(barcode_num): # applies to SQK-RBK001
464479 barcode = [x for x in ADAPTERS if x.name == 'Barcode ' + str(barcode_num) + ' (forward)'][0]
465480 start_barcode_seq = barcode.start_sequence[1]
466481
467 start_full_seq = 'AATGTACTTCGTTCAGTTACGTATTGCT' + start_barcode_seq + 'GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA'
468
469 return Adapter('Rapid barcoding ' + str(barcode_num) + ' (full sequence)',
482 start_full_seq = 'AATGTACTTCGTTCAGTTACG' + 'TATTGCT' + start_barcode_seq + \
483 'GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA'
484
485 return Adapter('Rapid barcoding ' + str(barcode_num) + ' (full sequence, old)',
470486 start_sequence=('RB' + '%02d' % barcode_num + '_full', start_full_seq))
487
488
489 def make_new_full_rapid_barcode_adapter(barcode_num): # applies to SQK-RBK004
490 barcode = [x for x in ADAPTERS if x.name == 'Barcode ' + str(barcode_num) + ' (forward)'][0]
491 start_barcode_seq = barcode.start_sequence[1]
492
493 start_full_seq = 'AATGTACTTCGTTCAGTTACG' + 'GCTTGGGTGTTTAACC' + start_barcode_seq + \
494 'GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA'
495
496 return Adapter('Rapid barcoding ' + str(barcode_num) + ' (full sequence, new)',
497 start_sequence=('RB' + '%02d' % barcode_num + '_full', start_full_seq))
2222 def __init__(self, name, seq, quals):
2323 self.name = name
2424
25 self.seq = seq
25 self.seq = seq.upper()
26 if self.seq.count('U') > self.seq.count('T'):
27 self.rna = True
28 self.seq = self.seq.replace('U', 'T')
29 else:
30 self.rna = False
31
2632 self.quals = quals
2733 if len(quals) < len(seq):
2834 self.quals += '+' * (len(seq) - len(quals))
95101 seq = self.get_seq_with_start_end_adapters_trimmed()
96102 if not seq: # Don't return empty sequences
97103 return ''
104 if self.rna:
105 seq = seq.replace('T', 'U')
98106 return ''.join(['>', self.name, '\n', add_line_breaks_to_sequence(seq, 70)])
99107 elif discard_middle:
100108 return ''
105113 if not split_read_part[0]: # Don't return empty sequences
106114 return ''
107115 seq = add_line_breaks_to_sequence(split_read_part[0], 70)
116 if self.rna:
117 seq = seq.replace('T', 'U')
108118 fasta_str += ''.join(['>', read_name, '\n', seq])
109119 return fasta_str
110120
118128 quals = self.get_quals_with_start_end_adapters_trimmed()
119129 if not seq: # Don't return empty sequences
120130 return ''
131 if self.rna:
132 seq = seq.replace('T', 'U')
121133 return ''.join(['@', self.name, '\n', seq, '\n+\n', quals, '\n'])
122134 elif discard_middle:
123135 return ''
125137 fastq_str = ''
126138 for i, split_read_part in enumerate(self.get_split_read_parts(min_split_read_size)):
127139 read_name = add_number_to_read_name(self.name, i + 1)
128 if not split_read_part[0]: # Don't return empty sequences
140 seq, qual = split_read_part[0], split_read_part[1]
141 if not seq: # Don't return empty sequences
129142 return ''
130 fastq_str += ''.join(['@', read_name, '\n', split_read_part[0], '\n+\n',
131 split_read_part[1], '\n'])
143 if self.rna:
144 seq = seq.replace('T', 'U')
145 fastq_str += ''.join(['@', read_name, '\n', seq, '\n+\n', qual, '\n'])
132146 return fastq_str
133147
134148 def align_adapter_set(self, adapter_set, end_size, scoring_scheme_vals):
137151 This is not to determine where to trim the reads, but rather to figure out which adapter
138152 sets are present in the data.
139153 """
140 read_seq_start = self.seq[:end_size]
141 score, _, _, _ = align_adapter(read_seq_start, adapter_set.start_sequence[1],
142 scoring_scheme_vals)
143 adapter_set.best_start_score = max(adapter_set.best_start_score, score)
154 if adapter_set.start_sequence:
155 read_seq_start = self.seq[:end_size]
156 score, _, _, _ = align_adapter(read_seq_start, adapter_set.start_sequence[1],
157 scoring_scheme_vals)
158 adapter_set.best_start_score = max(adapter_set.best_start_score, score)
144159 if adapter_set.end_sequence:
145160 read_seq_end = self.seq[-end_size:]
146161 score, _, _, _ = align_adapter(read_seq_end, adapter_set.end_sequence[1],
155170 """
156171 read_seq_start = self.seq[:end_size]
157172 for adapter in adapters:
173 if not adapter.start_sequence:
174 continue
158175 full_score, partial_score, read_start, read_end = \
159176 align_adapter(read_seq_start, adapter.start_sequence[1], scoring_scheme_vals)
160177 if partial_score > end_threshold and read_end != end_size and \
2323 from multiprocessing.dummy import Pool as ThreadPool
2424 from collections import defaultdict
2525 from .misc import load_fasta_or_fastq, print_table, red, bold_underline, MyHelpFormatter, int_to_str
26 from .adapters import ADAPTERS, make_full_native_barcode_adapter, make_full_rapid_barcode_adapter
26 from .adapters import ADAPTERS, make_full_native_barcode_adapter,\
27 make_old_full_rapid_barcode_adapter, make_new_full_rapid_barcode_adapter
2728 from .nanopore_read import NanoporeRead
2829 from .version import __version__
2930
3637 matching_sets = find_matching_adapter_sets(check_reads, args.verbosity, args.end_size,
3738 args.scoring_scheme_vals, args.print_dest,
3839 args.adapter_threshold, args.threads)
39 matching_sets = exclude_end_adapters_for_rapid(matching_sets)
4040 matching_sets = fix_up_1d2_sets(matching_sets)
41 display_adapter_set_results(matching_sets, args.verbosity, args.print_dest)
42 matching_sets = add_full_barcode_adapter_sets(matching_sets)
4341
4442 if args.barcode_dir:
4543 forward_or_reverse_barcodes = choose_barcoding_kit(matching_sets, args.verbosity,
4644 args.print_dest)
4745 else:
4846 forward_or_reverse_barcodes = None
47
48 display_adapter_set_results(matching_sets, args.verbosity, args.print_dest)
49 matching_sets = add_full_barcode_adapter_sets(matching_sets)
50
4951 if args.verbosity > 0:
5052 print('\n', file=args.print_dest)
5153
8587 parser = argparse.ArgumentParser(description='Porechop: a tool for finding adapters in Oxford '
8688 'Nanopore reads, trimming them from the ends and '
8789 'splitting reads with internal adapters',
88 formatter_class=MyHelpFormatter)
90 formatter_class=MyHelpFormatter, add_help=False)
8991 main_group = parser.add_argument_group('Main options')
9092 main_group.add_argument('-i', '--input', required=True,
9193 help='FASTA/FASTQ of input reads or a directory which will be '
104106 'a file and stderr if reads are printed to stdout')
105107 main_group.add_argument('-t', '--threads', type=int, default=default_threads,
106108 help='Number of threads to use for adapter alignment')
107 main_group.add_argument('--version', action='version', version=__version__)
108109
109110 barcode_group = parser.add_argument_group('Barcode binning settings',
110111 'Control the binning of reads based on barcodes '
126127 'match for the barcode on both their start and end (default: '
127128 'a read can be binned with a match at its start or end)')
128129 barcode_group.add_argument('--untrimmed', action='store_true',
129 help='Bin reads but do not trim them (appropriate if reads are to '
130 'be used with Nanopolish) (default: trim the reads)')
130 help='Bin reads but do not trim them (default: trim the reads)')
131131 barcode_group.add_argument('--discard_unassigned', action='store_true',
132132 help='Discard unassigned reads (instead of creating a "none" bin)')
133133
168168 'middle)')
169169 middle_trim_group.add_argument('--discard_middle', action='store_true',
170170 help='Reads with middle adapters will be discarded (default: '
171 'reads with middle adapters are split) (this option is '
172 'on by default when outputting reads into barcode bins)')
173 middle_trim_group.add_argument('--middle_threshold', type=float, default=85.0,
171 'reads with middle adapters are split) (required for '
172 'reads to be used with Nanopolish, this option is on by '
173 'default when outputting reads into barcode bins)')
174 middle_trim_group.add_argument('--middle_threshold', type=float, default=90.0,
174175 help='Adapters in the middle of reads must have at least this '
175176 'percent identity to be found (0 to 100)')
176177 middle_trim_group.add_argument('--extra_middle_trim_good_side', type=int, default=10,
182183 middle_trim_group.add_argument('--min_split_read_size', type=int, default=1000,
183184 help='Post-split read pieces smaller than this many base pairs '
184185 'will not be outputted')
186
187 help_args = parser.add_argument_group('Help')
188 help_args.add_argument('-h', '--help', action='help', default=argparse.SUPPRESS,
189 help='Show this help message and exit')
190 help_args.add_argument('--version', action='version', version=__version__,
191 help="Show program's version number and exit")
185192
186193 args = parser.parse_args()
187194
324331 If the user is sorting reads by barcode bin, choose one barcode configuration (rev comp
325332 barcodes at the start of the read or at the end of the read) and ignore the other.
326333 """
327 forward_barcodes = 0
328 reverse_barcodes = 0
334 # Tally up scores for forward and reverse barcodes.
335 forward_start_or_end, reverse_start_or_end = 0, 0
336 forward_start_and_end, reverse_start_and_end = 0, 0
329337 for adapter_set in adapter_sets:
330 score = adapter_set.best_start_or_end_score()
331 if 'Barcode' in adapter_set.name and '(forward)' in adapter_set.name:
332 forward_barcodes += score
333 elif 'Barcode' in adapter_set.name and '(reverse)' in adapter_set.name:
334 reverse_barcodes += score
335 if forward_barcodes > reverse_barcodes:
336 if verbosity > 0:
337 print('\nBarcodes determined to be in forward orientation', file=print_dest)
338 return 'forward'
339 elif reverse_barcodes > forward_barcodes:
340 if verbosity > 0:
341 print('\nBarcodes determined to be in reverse orientation', file=print_dest)
342 return 'reverse'
343 else:
344 return None
345
346
347 def exclude_end_adapters_for_rapid(matching_sets):
348 """
349 Rapid reads shouldn't have end adapters, so we don't want to look for them if this seems to be
350 a rapid read set.
351 """
352 if 'Rapid adapter' in [x.name for x in matching_sets]:
353 for s in matching_sets:
354 s.end_sequence = None
355 return matching_sets
338 if 'barcode' in adapter_set.name.lower():
339 if '(forward)' in adapter_set.name.lower():
340 forward_start_or_end += adapter_set.best_start_or_end_score()
341 forward_start_and_end += adapter_set.best_start_score
342 forward_start_and_end += adapter_set.best_end_score
343 elif '(reverse)' in adapter_set.name.lower():
344 reverse_start_or_end += adapter_set.best_start_or_end_score()
345 reverse_start_and_end += adapter_set.best_start_score
346 reverse_start_and_end += adapter_set.best_end_score
347
348 if forward_start_or_end == 0 and reverse_start_or_end == 0:
349 sys.exit('Error: no barcodes were found, so Porechop cannot perform barcode demultiplexing')
350
351 # If possible, make a decision using each barcode's best start OR end score.
352 orientation = None
353 if forward_start_or_end > reverse_start_or_end:
354 orientation = 'forward'
355 elif reverse_start_or_end > forward_start_or_end:
356 orientation = 'reverse'
357
358 # If that didn't work (i.e. it's a tie between forward and reverse), then choose based on the
359 # sum of both start AND end scores.
360 elif forward_start_and_end > reverse_start_and_end:
361 orientation = 'forward'
362 elif reverse_start_and_end > forward_start_and_end:
363 orientation = 'reverse'
364
365 if orientation is None:
366 sys.exit('Error: Porechop could not determine barcode orientation')
367
368 if verbosity > 0:
369 print('\nBarcodes determined to be in ' + orientation + ' orientation', file=print_dest)
370 return orientation
356371
357372
358373 def fix_up_1d2_sets(matching_sets):
410425
411426 # Rapid barcode full sequences
412427 if all(x in matching_set_names
413 for x in ['SQK-NSK007', 'Rapid', 'Barcode ' + str(i) + ' (forward)']):
414 matching_sets.append(make_full_rapid_barcode_adapter(i))
428 for x in ['Rapid', 'Barcode ' + str(i) + ' (forward)']):
429 if 'RBK004_upstream' in matching_set_names:
430 matching_sets.append(make_new_full_rapid_barcode_adapter(i))
431 elif 'SQK-NSK007' in matching_set_names:
432 matching_sets.append(make_old_full_rapid_barcode_adapter(i))
415433
416434 return matching_sets
417435
423441 if verbosity > 0:
424442 print(bold_underline('Trimming adapters from read ends'),
425443 file=print_dest)
426 name_len = max(max(len(x.start_sequence[0]) for x in matching_sets),
427 max(len(x.end_sequence[0]) if x.end_sequence else 0 for x in matching_sets))
444 name_len = max(max(len(x.start_sequence[0])
445 if x.start_sequence else 0 for x in matching_sets),
446 max(len(x.end_sequence[0])
447 if x.end_sequence else 0 for x in matching_sets))
428448 for matching_set in matching_sets:
429 print(' ' + matching_set.start_sequence[0].rjust(name_len) + ': ' +
430 red(matching_set.start_sequence[1]), file=print_dest)
449 if matching_set.start_sequence:
450 print(' ' + matching_set.start_sequence[0].rjust(name_len) + ': ' +
451 red(matching_set.start_sequence[1]), file=print_dest)
431452 if matching_set.end_sequence:
432453 print(' ' + matching_set.end_sequence[0].rjust(name_len) + ': ' +
433454 red(matching_set.end_sequence[1]), file=print_dest)
518539
519540 adapters = []
520541 for matching_set in matching_sets:
521 adapters.append(matching_set.start_sequence)
522 if matching_set.end_sequence and \
523 matching_set.end_sequence[1] != matching_set.start_sequence[1]:
524 adapters.append(matching_set.end_sequence)
542 if matching_set.start_sequence:
543 adapters.append(matching_set.start_sequence)
544 if matching_set.end_sequence:
545 if (not matching_set.start_sequence) or \
546 matching_set.end_sequence[1] != matching_set.start_sequence[1]:
547 adapters.append(matching_set.end_sequence)
525548
526549 start_sequence_names = set()
527550 end_sequence_names = set()
528551 for matching_set in matching_sets:
529 start_sequence_names.add(matching_set.start_sequence[0])
552 if matching_set.start_sequence:
553 start_sequence_names.add(matching_set.start_sequence[0])
530554 if matching_set.end_sequence:
531555 end_sequence_names.add(matching_set.end_sequence[0])
532556
11 The version is stored here in a separate file so it can exist in only one place.
22 http://stackoverflow.com/questions/458550
33 """
4 __version__ = '0.2.3'
4 __version__ = '0.2.4'