New upstream version 0.2.4+dfsg
Steffen Moeller
5 years ago
2 | 2 | Porechop is a tool for finding and removing adapters from [Oxford Nanopore](https://nanoporetech.com/) reads. Adapters on the ends of reads are trimmed off, and when a read has an adapter in its middle, it is treated as chimeric and chopped into separate reads. Porechop performs thorough alignments to effectively find adapters, even at low sequence identity. |
3 | 3 | |
4 | 4 | Porechop also supports demultiplexing of Nanopore reads that were barcoded with the [Native Barcoding Kit](https://store.nanoporetech.com/native-barcoding-kit-1d.html), [PCR Barcoding Kit](https://store.nanoporetech.com/pcr-barcoding-kit-96.html) or [Rapid Barcoding Kit](https://store.nanoporetech.com/rapid-barcoding-sequencing-kit.html). |
5 | ||
6 | ||
7 | ### Oct 2018 update: Porechop is officially unsupported | |
8 | ||
9 | While I'm happy Porechop has so many users, it has always been a bit klugey and a pain to maintain. I don't have the time to give it the attention it deserves, so I'm going to now officially declare Porechop as abandonware (though the unanswered [issues](https://github.com/rrwick/Porechop/issues) and [pull requests](https://github.com/rrwick/Porechop/pulls) reveal that it already has been for some time). I've added a [known issues](#known-issues) section to the README to outline what I think is wrong with Porechop and how a reimplementation should look. I may someday (no promises though :stuck_out_tongue:) try to rewrite it from a blank canvas to address its faults. | |
10 | ||
11 | ||
5 | 12 | |
6 | 13 | |
7 | 14 | |
23 | 30 | * [Verbose output](#verbose-output) |
24 | 31 | * [Known adapters](#known-adapters) |
25 | 32 | * [Full usage](#full-usage) |
33 | * [Known issues](#known-issues) | |
26 | 34 | * [Acknowledgements](#acknowledgements) |
27 | 35 | * [License](#license) |
28 | 36 | |
90 | 98 | __Demultiplex barcoded reads, straight from Albacore output directory:__<br> |
91 | 99 | `porechop -i albacore_dir -b output_dir` |
92 | 100 | |
93 | __Demultiplex barcoded reads without trimming (appropriate if Nanopolish will be used):__<br> | |
94 | `porechop -i input_reads.fastq.gz -b output_dir --untrimmed` | |
95 | ||
96 | 101 | __Also works with FASTA:__<br> |
97 | 102 | `porechop -i input_reads.fasta -o output_reads.fasta` |
98 | 103 | |
142 | 147 | |
143 | 148 | If you run Porechop with `--discard_middle`, the reads with internal adapters will be thrown out instead of split. |
144 | 149 | |
145 | This approach might make sense if you are trimming reads from a barcoded run, as chimeric reads may combine sequences from different bins. For example, consider this read: | |
146 | ``` | |
147 | BC01_rev - SEQUENCE_1 - SQK-NSK007_Y_Top - BC02_rev - SEQUENCE_2 | |
148 | ``` | |
149 | SEQUENCE_1 belongs in the barcode 1 bin and SEQUENCE_2 belongs in the barcode 2 bin, so while we could split the read, we would end up with contamination from another bin. Throwing the read out with `--discard_middle` might be the better option. | |
150 | If you plan on using your reads with [Nanopolish](https://github.com/jts/nanopolish), then the `--discard_middle` option is required. This is because Nanopolish first runs [`nanopolish index`](https://github.com/jts/nanopolish#data-preprocessing) to find a one-to-one association between FASTQ reads and fast5 files. If you ran Porechop _without_ `--discard_middle`, then you could end up with multiple separate FASTQ reads which are from a single fast5, and this is incompatible with Nanopolish. | |
151 | ||
152 | This option is also recommended if you are trimming reads from a demultiplexed barcoded sequencing run. This is because chimeric reads may contain two sequences from two separate barcodes, so throwing them out is the safer option to reduce cross-barcode contamination. | |
150 | 153 | |
151 | 154 | |
152 | 155 | ### Barcode demultiplexing |
188 | 191 | |
189 | 192 | If Porechop is run without `-o` or `-b`, then it will output the trimmed reads to stdout and print its progress info to stderr. The output format of the reads will be FASTA/FASTQ based on the input reads, or else can be specified using `--format`. |
190 | 193 | |
191 | The `--verbosity` option will change the output of progress info: | |
194 | The `--verbosity` option will change the amount of progress info: | |
192 | 195 | * `--verbosity 0` gives no progress output. |
193 | * `--verbosity 1` (the default) gives summary info about end adapter trimming and shows all instances of middle adapter splitting. | |
194 | * `--verbosity 2` shows sequences and is described below. | |
196 | * `--verbosity 1` (default) gives summary info about end adapter trimming and middle adapter splitting. | |
197 | * `--verbosity 2` shows the actual trimmed/split sequences for each read (described more below). | |
195 | 198 | * `--verbosity 3` shows tons of data (mainly for debugging). |
196 | 199 | |
197 | 200 | |
232 | 235 | # Full usage |
233 | 236 | |
234 | 237 | ``` |
235 | usage: porechop [-h] -i INPUT [-o OUTPUT] [--format {auto,fasta,fastq,fasta.gz,fastq.gz}] [-v VERBOSITY] | |
236 | [-t THREADS] [--version] [-b BARCODE_DIR] [--barcode_threshold BARCODE_THRESHOLD] | |
238 | usage: porechop -i INPUT [-o OUTPUT] [--format {auto,fasta,fastq,fasta.gz,fastq.gz}] [-v VERBOSITY] | |
239 | [-t THREADS] [-b BARCODE_DIR] [--barcode_threshold BARCODE_THRESHOLD] | |
237 | 240 | [--barcode_diff BARCODE_DIFF] [--require_two_barcodes] [--untrimmed] |
238 | [--adapter_threshold ADAPTER_THRESHOLD] [--check_reads CHECK_READS] | |
239 | [--scoring_scheme SCORING_SCHEME] [--end_size END_SIZE] [--min_trim_size MIN_TRIM_SIZE] | |
240 | [--extra_end_trim EXTRA_END_TRIM] [--end_threshold END_THRESHOLD] [--discard_middle] | |
241 | [--discard_unassigned] [--adapter_threshold ADAPTER_THRESHOLD] | |
242 | [--check_reads CHECK_READS] [--scoring_scheme SCORING_SCHEME] [--end_size END_SIZE] | |
243 | [--min_trim_size MIN_TRIM_SIZE] [--extra_end_trim EXTRA_END_TRIM] | |
244 | [--end_threshold END_THRESHOLD] [--no_split] [--discard_middle] | |
241 | 245 | [--middle_threshold MIDDLE_THRESHOLD] |
242 | 246 | [--extra_middle_trim_good_side EXTRA_MIDDLE_TRIM_GOOD_SIDE] |
243 | 247 | [--extra_middle_trim_bad_side EXTRA_MIDDLE_TRIM_BAD_SIDE] |
244 | [--min_split_read_size MIN_SPLIT_READ_SIZE] | |
245 | ||
246 | Porechop: a tool for finding adapters in Oxford Nanopore reads, trimming them from the ends and splitting | |
247 | reads with internal adapters | |
248 | ||
249 | optional arguments: | |
250 | -h, --help show this help message and exit | |
248 | [--min_split_read_size MIN_SPLIT_READ_SIZE] [-h] [--version] | |
249 | ||
250 | Porechop: a tool for finding adapters in Oxford Nanopore reads, trimming them from the ends and | |
251 | splitting reads with internal adapters | |
251 | 252 | |
252 | 253 | Main options: |
253 | -i INPUT, --input INPUT FASTA/FASTQ of input reads or a directory which will be recursively | |
254 | searched for FASTQ files (required) | |
255 | -o OUTPUT, --output OUTPUT Filename for FASTA or FASTQ of trimmed reads (if not set, trimmed | |
256 | reads will be printed to stdout) | |
254 | -i INPUT, --input INPUT FASTA/FASTQ of input reads or a directory which will be | |
255 | recursively searched for FASTQ files (required) | |
256 | -o OUTPUT, --output OUTPUT Filename for FASTA or FASTQ of trimmed reads (if not set, trimmed | |
257 | reads will be printed to stdout) | |
257 | 258 | --format {auto,fasta,fastq,fasta.gz,fastq.gz} |
258 | Output format for the reads - if auto, the format will be chosen based | |
259 | on the output filename or the input read format (default: auto) | |
259 | Output format for the reads - if auto, the format will be chosen | |
260 | based on the output filename or the input read format (default: | |
261 | auto) | |
260 | 262 | -v VERBOSITY, --verbosity VERBOSITY |
261 | Level of progress information: 0 = none, 1 = some, 2 = lots, 3 = full | |
262 | - output will go to stdout if reads are saved to a file and stderr if | |
263 | reads are printed to stdout (default: 1) | |
264 | -t THREADS, --threads THREADS Number of threads to use for adapter alignment (default: 8) | |
265 | --version show program's version number and exit | |
263 | Level of progress information: 0 = none, 1 = some, 2 = lots, 3 = | |
264 | full - output will go to stdout if reads are saved to a file and | |
265 | stderr if reads are printed to stdout (default: 1) | |
266 | -t THREADS, --threads THREADS Number of threads to use for adapter alignment (default: 8) | |
266 | 267 | |
267 | 268 | Barcode binning settings: |
268 | 269 | Control the binning of reads based on barcodes (i.e. barcode demultiplexing) |
269 | 270 | |
270 | 271 | -b BARCODE_DIR, --barcode_dir BARCODE_DIR |
271 | Reads will be binned based on their barcode and saved to separate | |
272 | files in this directory (incompatible with --output) | |
272 | Reads will be binned based on their barcode and saved to separate | |
273 | files in this directory (incompatible with --output) | |
273 | 274 | --barcode_threshold BARCODE_THRESHOLD |
274 | A read must have at least this percent identity to a barcode to be | |
275 | binned (default: 75.0) | |
276 | --barcode_diff BARCODE_DIFF If the difference between a read's best barcode identity and its | |
277 | second-best barcode identity is less than this value, it will not be | |
278 | put in a barcode bin (to exclude cases which are too close to call) | |
279 | (default: 5.0) | |
280 | --require_two_barcodes Reads will only be put in barcode bins if they have a strong match for | |
281 | the barcode on both their start and end (default: a read can be binned | |
282 | with a match at its start or end) | |
283 | --untrimmed Bin reads but do not trim the ends (appropriate if reads are to be | |
284 | used with Nanopolish) (default: False) | |
275 | A read must have at least this percent identity to a barcode to be | |
276 | binned (default: 75.0) | |
277 | --barcode_diff BARCODE_DIFF If the difference between a read's best barcode identity and its | |
278 | second-best barcode identity is less than this value, it will not | |
279 | be put in a barcode bin (to exclude cases which are too close to | |
280 | call) (default: 5.0) | |
281 | --require_two_barcodes Reads will only be put in barcode bins if they have a strong match | |
282 | for the barcode on both their start and end (default: a read can | |
283 | be binned with a match at its start or end) | |
284 | --untrimmed Bin reads but do not trim them (default: trim the reads) | |
285 | --discard_unassigned Discard unassigned reads (instead of creating a "none" bin) | |
286 | (default: False) | |
285 | 287 | |
286 | 288 | Adapter search settings: |
287 | 289 | Control how the program determines which adapter sets are present |
288 | 290 | |
289 | 291 | --adapter_threshold ADAPTER_THRESHOLD |
290 | An adapter set has to have at least this percent identity to be | |
291 | labelled as present and trimmed off (0 to 100) (default: 90.0) | |
292 | --check_reads CHECK_READS This many reads will be aligned to all possible adapters to determine | |
293 | which adapter sets are present (default: 10000) | |
294 | --scoring_scheme SCORING_SCHEME Comma-delimited string of alignment scores: match, mismatch, gap open, | |
295 | gap extend (default: 3,-6,-5,-2) | |
292 | An adapter set has to have at least this percent identity to be | |
293 | labelled as present and trimmed off (0 to 100) (default: 90.0) | |
294 | --check_reads CHECK_READS This many reads will be aligned to all possible adapters to | |
295 | determine which adapter sets are present (default: 10000) | |
296 | --scoring_scheme SCORING_SCHEME | |
297 | Comma-delimited string of alignment scores: match, mismatch, gap | |
298 | open, gap extend (default: 3,-6,-5,-2) | |
296 | 299 | |
297 | 300 | End adapter settings: |
298 | 301 | Control the trimming of adapters from read ends |
299 | 302 | |
300 | --end_size END_SIZE The number of base pairs at each end of the read which will be | |
301 | searched for adapter sequences (default: 150) | |
302 | --min_trim_size MIN_TRIM_SIZE Adapter alignments smaller than this will be ignored (default: 4) | |
303 | --extra_end_trim EXTRA_END_TRIM This many additional bases will be removed next to adapters found at | |
304 | the ends of reads (default: 2) | |
305 | --end_threshold END_THRESHOLD Adapters at the ends of reads must have at least this percent identity | |
306 | to be removed (0 to 100) (default: 75.0) | |
303 | --end_size END_SIZE The number of base pairs at each end of the read which will be | |
304 | searched for adapter sequences (default: 150) | |
305 | --min_trim_size MIN_TRIM_SIZE Adapter alignments smaller than this will be ignored (default: 4) | |
306 | --extra_end_trim EXTRA_END_TRIM | |
307 | This many additional bases will be removed next to adapters found | |
308 | at the ends of reads (default: 2) | |
309 | --end_threshold END_THRESHOLD Adapters at the ends of reads must have at least this percent | |
310 | identity to be removed (0 to 100) (default: 75.0) | |
307 | 311 | |
308 | 312 | Middle adapter settings: |
309 | 313 | Control the splitting of read from middle adapters |
310 | 314 | |
311 | --no_split Skip splitting reads based on middle adapters (default: split reads | |
312 | when an adapter is found in the middle) | |
313 | --discard_middle Reads with middle adapters will be discarded (default: reads with | |
314 | middle adapters are split) (this option is on by default when | |
315 | outputting reads into barcode bins) | |
315 | --no_split Skip splitting reads based on middle adapters (default: split | |
316 | reads when an adapter is found in the middle) | |
317 | --discard_middle Reads with middle adapters will be discarded (default: reads with | |
318 | middle adapters are split) (required for reads to be used with | |
319 | Nanopolish, this option is on by default when outputting reads | |
320 | into barcode bins) | |
316 | 321 | --middle_threshold MIDDLE_THRESHOLD |
317 | Adapters in the middle of reads must have at least this percent | |
318 | identity to be found (0 to 100) (default: 85.0) | |
322 | Adapters in the middle of reads must have at least this percent | |
323 | identity to be found (0 to 100) (default: 85.0) | |
319 | 324 | --extra_middle_trim_good_side EXTRA_MIDDLE_TRIM_GOOD_SIDE |
320 | This many additional bases will be removed next to middle adapters on | |
321 | their "good" side (default: 10) | |
325 | This many additional bases will be removed next to middle adapters | |
326 | on their "good" side (default: 10) | |
322 | 327 | --extra_middle_trim_bad_side EXTRA_MIDDLE_TRIM_BAD_SIDE |
323 | This many additional bases will be removed next to middle adapters on | |
324 | their "bad" side (default: 100) | |
328 | This many additional bases will be removed next to middle adapters | |
329 | on their "bad" side (default: 100) | |
325 | 330 | --min_split_read_size MIN_SPLIT_READ_SIZE |
326 | Post-split read pieces smaller than this many base pairs will not be | |
327 | outputted (default: 1000) | |
328 | ``` | |
331 | Post-split read pieces smaller than this many base pairs will not | |
332 | be outputted (default: 1000) | |
333 | ||
334 | Help: | |
335 | -h, --help Show this help message and exit | |
336 | --version Show program's version number and exit | |
337 | ``` | |
338 | ||
339 | ||
340 | ||
341 | # Known issues | |
342 | ||
343 | ### Adapter search | |
344 | ||
345 | Porechop tries to automatically determine which adapters are present by looking at the reads, but this approach has a few issues: | |
346 | ||
347 | * As the number of kits/barcodes has grown, adapter-search part of the Porechop's pipeline has become increasingly slow. | |
348 | * Porechop only does the adapter search on a subset of reads, which means there can be problems with non-randomly ordered read sets (e.g. all barcode 1 reads at the start of a file, followed by barcode 2 reads, etc). | |
349 | * Many ONT adapters share common sequence with each other, making false positive adapter finds possible. | |
350 | ||
351 | A simpler solution (and in hindsight what I should have done) would be to require the kit and/or adapters from the user. E.g. `porechop --sqk-lsk109` or `porechop --start_adapt ACGCTAGCATACGT`. | |
352 | ||
353 | ||
354 | ### Performance | |
355 | ||
356 | Porechop uses [SeqAn](https://github.com/seqan/seqan) to perform its alignments in C++. This library is very flexible, but not as fast as some alternatives, such as [Edlib](https://github.com/Martinsos/edlib). | |
357 | ||
358 | Another performance issue is that Porechop uses [ctypes](https://docs.python.org/3/library/ctypes.html) to interface with its C++ code. Function calls with ctypes can have a bit of overhead, which means that Porechop cannot use threads very efficiently (it spends too much of its time in the Python code, which is intrinsically non-parallel). | |
359 | ||
360 | ||
361 | ### Barcode demultiplexing | |
362 | ||
363 | I added demultiplexing to Porechop as an afterthought – it was already looking for barcodes to trim, so why not sort reads by barcodes too? This turned out to be a very useful feature, but in hindsight I think it might be simpler (and easier to maintain) if trimming and demultiplexing functionality were in separate tools. | |
364 | ||
365 | ||
366 | ### Base-space problems | |
367 | ||
368 | I've encountered a couple of issues where adapter sequences are not properly basecalled, resulting in inconsistent sequence. Porechop trims in base-space, so this is a somewhat intractable problem. See [issue #40](https://github.com/rrwick/Porechop/issues/40) for an example. | |
369 | ||
370 | These have made me wonder if the adapter trimming should be done in signal-space instead, though that would be a more complex problem to solve. I hope that in the future ONT can integrate this kind of functionality directly into their basecallers. | |
371 | ||
329 | 372 | |
330 | 373 | |
331 | 374 |
42 | 42 | Gets the barcode name for the output files. We want a concise name, so it looks at all |
43 | 43 | options and chooses the shortest. |
44 | 44 | """ |
45 | possible_names = [self.name, self.start_sequence[0]] | |
45 | possible_names = [self.name] | |
46 | if self.start_sequence: | |
47 | possible_names.append(self.start_sequence[0]) | |
46 | 48 | if self.end_sequence: |
47 | 49 | possible_names.append(self.end_sequence[0]) |
48 | 50 | barcode_name = sorted(possible_names, key=lambda x: len(x))[0] |
75 | 77 | start_sequence=('SQK-NSK007_Y_Top', 'AATGTACTTCGTTCAGTTACGTATTGCT'), |
76 | 78 | end_sequence=('SQK-NSK007_Y_Bottom', 'GCAATACGTAACTGAACGAAGT')), |
77 | 79 | |
80 | ||
78 | 81 | Adapter('Rapid', |
79 | start_sequence=('Rapid_adapter', 'GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA')), | |
82 | start_sequence=('Rapid_adapter', | |
83 | 'GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA')), | |
84 | ||
85 | Adapter('RBK004_upstream', | |
86 | start_sequence=('RBK004_upstream', 'AATGTACTTCGTTCAGTTACGGCTTGGGTGTTTAACC')), | |
87 | ||
80 | 88 | |
81 | 89 | Adapter('SQK-MAP006', |
82 | start_sequence=('SQK-MAP006_Y_Top_SK63', 'GGTTGTTTCTGTTGGTGCTGATATTGCT'), | |
83 | end_sequence= ('SQK-MAP006_Y_Bottom_SK64', 'GCAATATCAGCACCAACAGAAA')), | |
84 | ||
85 | Adapter('SQK-MAP006 Short', | |
90 | start_sequence=('SQK-MAP006_Y_Top_SK63', 'GGTTGTTTCTGTTGGTGCTGATATTGCT'), | |
91 | end_sequence= ('SQK-MAP006_Y_Bottom_SK64', 'GCAATATCAGCACCAACAGAAA')), | |
92 | ||
93 | Adapter('SQK-MAP006 short', | |
86 | 94 | start_sequence=('SQK-MAP006_Short_Y_Top_LI32', 'CGGCGTCTGCTTGGGTGTTTAACCT'), |
87 | 95 | end_sequence= ('SQK-MAP006_Short_Y_Bottom_LI33', 'GGTTAAACACCCAAGCAGACGCCG')), |
88 | 96 | |
97 | ||
98 | # The PCR adapters are used both in PCR DNA kits and some cDNA kits. | |
89 | 99 | Adapter('PCR adapters 1', |
90 | 100 | start_sequence=('PCR_1_start', 'ACTTGCCTGTCGCTCTATCTTC'), |
91 | end_sequence= ('PCR_1_end', 'GAAGATAGAGCGACAGGCAAGT')), | |
92 | ||
93 | Adapter('PCR tail 1', | |
94 | start_sequence=('PCR_tail_1_start', 'TTAACCTTTCTGTTGGTGCTGATATTGC'), | |
95 | end_sequence= ('PCR_tail_1_end', 'GCAATATCAGCACCAACAGAAAGGTTAA')), | |
96 | ||
97 | Adapter('PCR tail 2', | |
98 | start_sequence=('PCR_tail_2_start', 'TTAACCTACTTGCCTGTCGCTCTATCTTC'), | |
99 | end_sequence= ('PCR_tail_2_end', 'GAAGATAGAGCGACAGGCAAGTAGGTTAA')), | |
101 | end_sequence= ('PCR_1_end', 'GAAGATAGAGCGACAGGCAAGT')), | |
102 | ||
103 | Adapter('PCR adapters 2', | |
104 | start_sequence=('PCR_2_start', 'TTTCTGTTGGTGCTGATATTGC'), | |
105 | end_sequence= ('PCR_2_end', 'GCAATATCAGCACCAACAGAAA')), | |
106 | ||
107 | Adapter('PCR adapters 3', | |
108 | start_sequence=('PCR_3_start', 'TACTTGCCTGTCGCTCTATCTTC'), | |
109 | end_sequence= ('PCR_3_end', 'GAAGATAGAGCGACAGGCAAGTA')), | |
100 | 110 | |
101 | 111 | |
102 | 112 | # 1D^2 kit adapters are interesting. ONT provided the following sequences on their site: |
114 | 124 | end_sequence= ('1D2_part_2_end', 'CACCCAAGCAGACGCCAGCAATACGTAACT')), |
115 | 125 | # The middle part of the provided sequences is less common, so I've left it out of the |
116 | 126 | # adapter sequences here. |
127 | ||
128 | ||
129 | Adapter('cDNA SSP', | |
130 | start_sequence=('cDNA_SSP', 'TTTCTGTTGGTGCTGATATTGCTGCCATTACGGCCGGG'), | |
131 | end_sequence= ('cDNA_SSP_rev', 'CCCGGCCGTAATGGCAGCAATATCAGCACCAACAGAAA')), | |
117 | 132 | |
118 | 133 | |
119 | 134 | # Some barcoding kits (like the native barcodes) use the rev comp barcode at the start |
460 | 475 | end_sequence=('NB' + '%02d' % barcode_num + '_end', end_full_seq)) |
461 | 476 | |
462 | 477 | |
463 | def make_full_rapid_barcode_adapter(barcode_num): | |
478 | def make_old_full_rapid_barcode_adapter(barcode_num): # applies to SQK-RBK001 | |
464 | 479 | barcode = [x for x in ADAPTERS if x.name == 'Barcode ' + str(barcode_num) + ' (forward)'][0] |
465 | 480 | start_barcode_seq = barcode.start_sequence[1] |
466 | 481 | |
467 | start_full_seq = 'AATGTACTTCGTTCAGTTACGTATTGCT' + start_barcode_seq + 'GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA' | |
468 | ||
469 | return Adapter('Rapid barcoding ' + str(barcode_num) + ' (full sequence)', | |
482 | start_full_seq = 'AATGTACTTCGTTCAGTTACG' + 'TATTGCT' + start_barcode_seq + \ | |
483 | 'GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA' | |
484 | ||
485 | return Adapter('Rapid barcoding ' + str(barcode_num) + ' (full sequence, old)', | |
470 | 486 | start_sequence=('RB' + '%02d' % barcode_num + '_full', start_full_seq)) |
487 | ||
488 | ||
489 | def make_new_full_rapid_barcode_adapter(barcode_num): # applies to SQK-RBK004 | |
490 | barcode = [x for x in ADAPTERS if x.name == 'Barcode ' + str(barcode_num) + ' (forward)'][0] | |
491 | start_barcode_seq = barcode.start_sequence[1] | |
492 | ||
493 | start_full_seq = 'AATGTACTTCGTTCAGTTACG' + 'GCTTGGGTGTTTAACC' + start_barcode_seq + \ | |
494 | 'GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA' | |
495 | ||
496 | return Adapter('Rapid barcoding ' + str(barcode_num) + ' (full sequence, new)', | |
497 | start_sequence=('RB' + '%02d' % barcode_num + '_full', start_full_seq)) |
22 | 22 | def __init__(self, name, seq, quals): |
23 | 23 | self.name = name |
24 | 24 | |
25 | self.seq = seq | |
25 | self.seq = seq.upper() | |
26 | if self.seq.count('U') > self.seq.count('T'): | |
27 | self.rna = True | |
28 | self.seq = self.seq.replace('U', 'T') | |
29 | else: | |
30 | self.rna = False | |
31 | ||
26 | 32 | self.quals = quals |
27 | 33 | if len(quals) < len(seq): |
28 | 34 | self.quals += '+' * (len(seq) - len(quals)) |
95 | 101 | seq = self.get_seq_with_start_end_adapters_trimmed() |
96 | 102 | if not seq: # Don't return empty sequences |
97 | 103 | return '' |
104 | if self.rna: | |
105 | seq = seq.replace('T', 'U') | |
98 | 106 | return ''.join(['>', self.name, '\n', add_line_breaks_to_sequence(seq, 70)]) |
99 | 107 | elif discard_middle: |
100 | 108 | return '' |
105 | 113 | if not split_read_part[0]: # Don't return empty sequences |
106 | 114 | return '' |
107 | 115 | seq = add_line_breaks_to_sequence(split_read_part[0], 70) |
116 | if self.rna: | |
117 | seq = seq.replace('T', 'U') | |
108 | 118 | fasta_str += ''.join(['>', read_name, '\n', seq]) |
109 | 119 | return fasta_str |
110 | 120 | |
118 | 128 | quals = self.get_quals_with_start_end_adapters_trimmed() |
119 | 129 | if not seq: # Don't return empty sequences |
120 | 130 | return '' |
131 | if self.rna: | |
132 | seq = seq.replace('T', 'U') | |
121 | 133 | return ''.join(['@', self.name, '\n', seq, '\n+\n', quals, '\n']) |
122 | 134 | elif discard_middle: |
123 | 135 | return '' |
125 | 137 | fastq_str = '' |
126 | 138 | for i, split_read_part in enumerate(self.get_split_read_parts(min_split_read_size)): |
127 | 139 | read_name = add_number_to_read_name(self.name, i + 1) |
128 | if not split_read_part[0]: # Don't return empty sequences | |
140 | seq, qual = split_read_part[0], split_read_part[1] | |
141 | if not seq: # Don't return empty sequences | |
129 | 142 | return '' |
130 | fastq_str += ''.join(['@', read_name, '\n', split_read_part[0], '\n+\n', | |
131 | split_read_part[1], '\n']) | |
143 | if self.rna: | |
144 | seq = seq.replace('T', 'U') | |
145 | fastq_str += ''.join(['@', read_name, '\n', seq, '\n+\n', qual, '\n']) | |
132 | 146 | return fastq_str |
133 | 147 | |
134 | 148 | def align_adapter_set(self, adapter_set, end_size, scoring_scheme_vals): |
137 | 151 | This is not to determine where to trim the reads, but rather to figure out which adapter |
138 | 152 | sets are present in the data. |
139 | 153 | """ |
140 | read_seq_start = self.seq[:end_size] | |
141 | score, _, _, _ = align_adapter(read_seq_start, adapter_set.start_sequence[1], | |
142 | scoring_scheme_vals) | |
143 | adapter_set.best_start_score = max(adapter_set.best_start_score, score) | |
154 | if adapter_set.start_sequence: | |
155 | read_seq_start = self.seq[:end_size] | |
156 | score, _, _, _ = align_adapter(read_seq_start, adapter_set.start_sequence[1], | |
157 | scoring_scheme_vals) | |
158 | adapter_set.best_start_score = max(adapter_set.best_start_score, score) | |
144 | 159 | if adapter_set.end_sequence: |
145 | 160 | read_seq_end = self.seq[-end_size:] |
146 | 161 | score, _, _, _ = align_adapter(read_seq_end, adapter_set.end_sequence[1], |
155 | 170 | """ |
156 | 171 | read_seq_start = self.seq[:end_size] |
157 | 172 | for adapter in adapters: |
173 | if not adapter.start_sequence: | |
174 | continue | |
158 | 175 | full_score, partial_score, read_start, read_end = \ |
159 | 176 | align_adapter(read_seq_start, adapter.start_sequence[1], scoring_scheme_vals) |
160 | 177 | if partial_score > end_threshold and read_end != end_size and \ |
23 | 23 | from multiprocessing.dummy import Pool as ThreadPool |
24 | 24 | from collections import defaultdict |
25 | 25 | from .misc import load_fasta_or_fastq, print_table, red, bold_underline, MyHelpFormatter, int_to_str |
26 | from .adapters import ADAPTERS, make_full_native_barcode_adapter, make_full_rapid_barcode_adapter | |
26 | from .adapters import ADAPTERS, make_full_native_barcode_adapter,\ | |
27 | make_old_full_rapid_barcode_adapter, make_new_full_rapid_barcode_adapter | |
27 | 28 | from .nanopore_read import NanoporeRead |
28 | 29 | from .version import __version__ |
29 | 30 | |
36 | 37 | matching_sets = find_matching_adapter_sets(check_reads, args.verbosity, args.end_size, |
37 | 38 | args.scoring_scheme_vals, args.print_dest, |
38 | 39 | args.adapter_threshold, args.threads) |
39 | matching_sets = exclude_end_adapters_for_rapid(matching_sets) | |
40 | 40 | matching_sets = fix_up_1d2_sets(matching_sets) |
41 | display_adapter_set_results(matching_sets, args.verbosity, args.print_dest) | |
42 | matching_sets = add_full_barcode_adapter_sets(matching_sets) | |
43 | 41 | |
44 | 42 | if args.barcode_dir: |
45 | 43 | forward_or_reverse_barcodes = choose_barcoding_kit(matching_sets, args.verbosity, |
46 | 44 | args.print_dest) |
47 | 45 | else: |
48 | 46 | forward_or_reverse_barcodes = None |
47 | ||
48 | display_adapter_set_results(matching_sets, args.verbosity, args.print_dest) | |
49 | matching_sets = add_full_barcode_adapter_sets(matching_sets) | |
50 | ||
49 | 51 | if args.verbosity > 0: |
50 | 52 | print('\n', file=args.print_dest) |
51 | 53 | |
85 | 87 | parser = argparse.ArgumentParser(description='Porechop: a tool for finding adapters in Oxford ' |
86 | 88 | 'Nanopore reads, trimming them from the ends and ' |
87 | 89 | 'splitting reads with internal adapters', |
88 | formatter_class=MyHelpFormatter) | |
90 | formatter_class=MyHelpFormatter, add_help=False) | |
89 | 91 | main_group = parser.add_argument_group('Main options') |
90 | 92 | main_group.add_argument('-i', '--input', required=True, |
91 | 93 | help='FASTA/FASTQ of input reads or a directory which will be ' |
104 | 106 | 'a file and stderr if reads are printed to stdout') |
105 | 107 | main_group.add_argument('-t', '--threads', type=int, default=default_threads, |
106 | 108 | help='Number of threads to use for adapter alignment') |
107 | main_group.add_argument('--version', action='version', version=__version__) | |
108 | 109 | |
109 | 110 | barcode_group = parser.add_argument_group('Barcode binning settings', |
110 | 111 | 'Control the binning of reads based on barcodes ' |
126 | 127 | 'match for the barcode on both their start and end (default: ' |
127 | 128 | 'a read can be binned with a match at its start or end)') |
128 | 129 | barcode_group.add_argument('--untrimmed', action='store_true', |
129 | help='Bin reads but do not trim them (appropriate if reads are to ' | |
130 | 'be used with Nanopolish) (default: trim the reads)') | |
130 | help='Bin reads but do not trim them (default: trim the reads)') | |
131 | 131 | barcode_group.add_argument('--discard_unassigned', action='store_true', |
132 | 132 | help='Discard unassigned reads (instead of creating a "none" bin)') |
133 | 133 | |
168 | 168 | 'middle)') |
169 | 169 | middle_trim_group.add_argument('--discard_middle', action='store_true', |
170 | 170 | help='Reads with middle adapters will be discarded (default: ' |
171 | 'reads with middle adapters are split) (this option is ' | |
172 | 'on by default when outputting reads into barcode bins)') | |
173 | middle_trim_group.add_argument('--middle_threshold', type=float, default=85.0, | |
171 | 'reads with middle adapters are split) (required for ' | |
172 | 'reads to be used with Nanopolish, this option is on by ' | |
173 | 'default when outputting reads into barcode bins)') | |
174 | middle_trim_group.add_argument('--middle_threshold', type=float, default=90.0, | |
174 | 175 | help='Adapters in the middle of reads must have at least this ' |
175 | 176 | 'percent identity to be found (0 to 100)') |
176 | 177 | middle_trim_group.add_argument('--extra_middle_trim_good_side', type=int, default=10, |
182 | 183 | middle_trim_group.add_argument('--min_split_read_size', type=int, default=1000, |
183 | 184 | help='Post-split read pieces smaller than this many base pairs ' |
184 | 185 | 'will not be outputted') |
186 | ||
187 | help_args = parser.add_argument_group('Help') | |
188 | help_args.add_argument('-h', '--help', action='help', default=argparse.SUPPRESS, | |
189 | help='Show this help message and exit') | |
190 | help_args.add_argument('--version', action='version', version=__version__, | |
191 | help="Show program's version number and exit") | |
185 | 192 | |
186 | 193 | args = parser.parse_args() |
187 | 194 | |
324 | 331 | If the user is sorting reads by barcode bin, choose one barcode configuration (rev comp |
325 | 332 | barcodes at the start of the read or at the end of the read) and ignore the other. |
326 | 333 | """ |
327 | forward_barcodes = 0 | |
328 | reverse_barcodes = 0 | |
334 | # Tally up scores for forward and reverse barcodes. | |
335 | forward_start_or_end, reverse_start_or_end = 0, 0 | |
336 | forward_start_and_end, reverse_start_and_end = 0, 0 | |
329 | 337 | for adapter_set in adapter_sets: |
330 | score = adapter_set.best_start_or_end_score() | |
331 | if 'Barcode' in adapter_set.name and '(forward)' in adapter_set.name: | |
332 | forward_barcodes += score | |
333 | elif 'Barcode' in adapter_set.name and '(reverse)' in adapter_set.name: | |
334 | reverse_barcodes += score | |
335 | if forward_barcodes > reverse_barcodes: | |
336 | if verbosity > 0: | |
337 | print('\nBarcodes determined to be in forward orientation', file=print_dest) | |
338 | return 'forward' | |
339 | elif reverse_barcodes > forward_barcodes: | |
340 | if verbosity > 0: | |
341 | print('\nBarcodes determined to be in reverse orientation', file=print_dest) | |
342 | return 'reverse' | |
343 | else: | |
344 | return None | |
345 | ||
346 | ||
347 | def exclude_end_adapters_for_rapid(matching_sets): | |
348 | """ | |
349 | Rapid reads shouldn't have end adapters, so we don't want to look for them if this seems to be | |
350 | a rapid read set. | |
351 | """ | |
352 | if 'Rapid adapter' in [x.name for x in matching_sets]: | |
353 | for s in matching_sets: | |
354 | s.end_sequence = None | |
355 | return matching_sets | |
338 | if 'barcode' in adapter_set.name.lower(): | |
339 | if '(forward)' in adapter_set.name.lower(): | |
340 | forward_start_or_end += adapter_set.best_start_or_end_score() | |
341 | forward_start_and_end += adapter_set.best_start_score | |
342 | forward_start_and_end += adapter_set.best_end_score | |
343 | elif '(reverse)' in adapter_set.name.lower(): | |
344 | reverse_start_or_end += adapter_set.best_start_or_end_score() | |
345 | reverse_start_and_end += adapter_set.best_start_score | |
346 | reverse_start_and_end += adapter_set.best_end_score | |
347 | ||
348 | if forward_start_or_end == 0 and reverse_start_or_end == 0: | |
349 | sys.exit('Error: no barcodes were found, so Porechop cannot perform barcode demultiplexing') | |
350 | ||
351 | # If possible, make a decision using each barcode's best start OR end score. | |
352 | orientation = None | |
353 | if forward_start_or_end > reverse_start_or_end: | |
354 | orientation = 'forward' | |
355 | elif reverse_start_or_end > forward_start_or_end: | |
356 | orientation = 'reverse' | |
357 | ||
358 | # If that didn't work (i.e. it's a tie between forward and reverse), then choose based on the | |
359 | # sum of both start AND end scores. | |
360 | elif forward_start_and_end > reverse_start_and_end: | |
361 | orientation = 'forward' | |
362 | elif reverse_start_and_end > forward_start_and_end: | |
363 | orientation = 'reverse' | |
364 | ||
365 | if orientation is None: | |
366 | sys.exit('Error: Porechop could not determine barcode orientation') | |
367 | ||
368 | if verbosity > 0: | |
369 | print('\nBarcodes determined to be in ' + orientation + ' orientation', file=print_dest) | |
370 | return orientation | |
356 | 371 | |
357 | 372 | |
358 | 373 | def fix_up_1d2_sets(matching_sets): |
410 | 425 | |
411 | 426 | # Rapid barcode full sequences |
412 | 427 | if all(x in matching_set_names |
413 | for x in ['SQK-NSK007', 'Rapid', 'Barcode ' + str(i) + ' (forward)']): | |
414 | matching_sets.append(make_full_rapid_barcode_adapter(i)) | |
428 | for x in ['Rapid', 'Barcode ' + str(i) + ' (forward)']): | |
429 | if 'RBK004_upstream' in matching_set_names: | |
430 | matching_sets.append(make_new_full_rapid_barcode_adapter(i)) | |
431 | elif 'SQK-NSK007' in matching_set_names: | |
432 | matching_sets.append(make_old_full_rapid_barcode_adapter(i)) | |
415 | 433 | |
416 | 434 | return matching_sets |
417 | 435 | |
423 | 441 | if verbosity > 0: |
424 | 442 | print(bold_underline('Trimming adapters from read ends'), |
425 | 443 | file=print_dest) |
426 | name_len = max(max(len(x.start_sequence[0]) for x in matching_sets), | |
427 | max(len(x.end_sequence[0]) if x.end_sequence else 0 for x in matching_sets)) | |
444 | name_len = max(max(len(x.start_sequence[0]) | |
445 | if x.start_sequence else 0 for x in matching_sets), | |
446 | max(len(x.end_sequence[0]) | |
447 | if x.end_sequence else 0 for x in matching_sets)) | |
428 | 448 | for matching_set in matching_sets: |
429 | print(' ' + matching_set.start_sequence[0].rjust(name_len) + ': ' + | |
430 | red(matching_set.start_sequence[1]), file=print_dest) | |
449 | if matching_set.start_sequence: | |
450 | print(' ' + matching_set.start_sequence[0].rjust(name_len) + ': ' + | |
451 | red(matching_set.start_sequence[1]), file=print_dest) | |
431 | 452 | if matching_set.end_sequence: |
432 | 453 | print(' ' + matching_set.end_sequence[0].rjust(name_len) + ': ' + |
433 | 454 | red(matching_set.end_sequence[1]), file=print_dest) |
518 | 539 | |
519 | 540 | adapters = [] |
520 | 541 | for matching_set in matching_sets: |
521 | adapters.append(matching_set.start_sequence) | |
522 | if matching_set.end_sequence and \ | |
523 | matching_set.end_sequence[1] != matching_set.start_sequence[1]: | |
524 | adapters.append(matching_set.end_sequence) | |
542 | if matching_set.start_sequence: | |
543 | adapters.append(matching_set.start_sequence) | |
544 | if matching_set.end_sequence: | |
545 | if (not matching_set.start_sequence) or \ | |
546 | matching_set.end_sequence[1] != matching_set.start_sequence[1]: | |
547 | adapters.append(matching_set.end_sequence) | |
525 | 548 | |
526 | 549 | start_sequence_names = set() |
527 | 550 | end_sequence_names = set() |
528 | 551 | for matching_set in matching_sets: |
529 | start_sequence_names.add(matching_set.start_sequence[0]) | |
552 | if matching_set.start_sequence: | |
553 | start_sequence_names.add(matching_set.start_sequence[0]) | |
530 | 554 | if matching_set.end_sequence: |
531 | 555 | end_sequence_names.add(matching_set.end_sequence[0]) |
532 | 556 |