Import upstream version 1.2.0+git20210608.1.1bd0fe9
Debian Janitor
2 years ago
0 | # Compiled python modules | |
1 | *.pyc | |
2 | ||
3 | # Setuptools distribution folder | |
4 | /dist/ | |
5 | ||
6 | # Python egg metadata, regenerated from source files by setuptools | |
7 | /*.egg-info | |
8 | /*.egg | |
9 | ||
10 | # Wheel data | |
11 | build | |
12 | conda |
0 | language: python | |
1 | python: | |
2 | - "3.4" | |
3 | env: | |
4 | - PYTHONHASHSEED=0 | |
5 | install: | |
6 | - pip install . | |
7 | - pip install nose | |
8 | - pip install Sphinx | |
9 | script: "make tests" |
0 | == 1.1.0 == | |
1 | ||
2 | - fix: custom tags are not necessarily lower case | |
3 | - additional support for rGFA subset of GFA1 by setting option dialect="rgfa" | |
4 | ||
5 | == 1.0.0 == | |
6 | ||
7 | - initial release |
0 | The following contributors helped to develop gfapy. Please drop a note to | |
1 | gonnella@zbh.uni-hamburg.de if I left someone out or missed something. | |
2 | ||
3 | - Tim Weber (translation of parts of the code from Ruby to Python) | |
4 | - Stefan Kurtz (advises) | |
5 |
0 | All code of gfapy is released under the following ISC license. | |
1 | It is functionally equivalent to a two-term BSD copyright with | |
2 | language removed that is made unnecessary by the Berne convention. | |
3 | See http://openbsd.org/policy.html for more information on copyrights. | |
4 | ||
5 | Copyright (c) 2017 Giorgio Gonnella and CONTRIBUTORS | |
6 | Copyright (c) 2017 Center for Bioinformatics, University of Hamburg | |
7 | ||
8 | Permission to use, copy, modify, and distribute this software for any | |
9 | purpose with or without fee is hereby granted, provided that the above | |
10 | copyright notice and this permission notice appear in all copies. | |
11 | ||
12 | THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES | |
13 | WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF | |
14 | MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR | |
15 | ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES | |
16 | WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN | |
17 | ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF | |
18 | OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. |
0 | default: tests | |
1 | ||
2 | .PHONY: manual tests cleanup upload conda sdist wheel install | |
3 | ||
4 | PYTHON=python3 | |
5 | PIP=pip3 | |
6 | ||
7 | # Install using pip | |
8 | install: | |
9 | ${PIP} install --upgrade --user --editable . | |
10 | ||
11 | # Source distribution | |
12 | sdist: | |
13 | ${PYTHON} setup.py sdist | |
14 | ||
15 | # Pure Python Wheel | |
16 | wheel: | |
17 | ${PYTHON} setup.py bdist_wheel | |
18 | ||
19 | # Create the manual | |
20 | manual: | |
21 | cd doc && make latexpdf | |
22 | mkdir -p manual | |
23 | cp doc/_build/latex/Gfapy.pdf manual/gfapy-manual.pdf | |
24 | ||
25 | doctest: | |
26 | cd doc && make doctest | |
27 | ||
28 | unittests: | |
29 | @echo | |
30 | @echo "Running unit test suite..." | |
31 | @PYTHONHASHSEED=0 ${PYTHON} -m unittest discover | |
32 | ||
33 | tests: doctest unittests | |
34 | ||
35 | # Remove distribution files | |
36 | cleanup: | |
37 | rm -rf dist/ build/ gfapy.egg-info/ | |
38 | ||
39 | upload: tests cleanup sdist wheel | |
40 | cd dist; \ | |
41 | for file in *; do \ | |
42 | twine register $$file; \ | |
43 | twine upload $$file; \ | |
44 | done | |
45 | ||
46 | conda: | |
47 | mkdir -p conda | |
48 | cd conda; \ | |
49 | conda skeleton pypi gfapy; \ | |
50 | conda build gfapy |
0 | Metadata-Version: 1.1 | |
1 | Name: gfapy | |
2 | Version: 1.2.0 | |
3 | Summary: Library for handling data in the GFA1 and GFA2 formats | |
4 | Home-page: https://github.com/ggonnella/gfapy | |
5 | Author: Giorgio Gonnella and others (see CONTRIBUTORS) | |
6 | Author-email: gonnella@zbh.uni-hamburg.de | |
7 | License: ISC | |
8 | Description: Gfapy | |
9 | ~~~~~ | |
10 | ||
11 | |travis| |readthedocs| |latesttag| |license| |requiresio| | |
12 | ||
13 | |bioconda| |pypi| |debian| |ubuntu| | |
14 | ||
15 | .. sphinx-begin | |
16 | ||
17 | The Graphical Fragment Assembly (GFA) are formats for the representation | |
18 | of sequence graphs, including assembly, variation and splicing graphs. | |
19 | Two versions of GFA have been defined (GFA1 and GFA2) and several sequence | |
20 | analysis programs have been adopting the formats as an interchange format, | |
21 | which allow to easily combine different sequence analysis tools. | |
22 | ||
23 | This library implements the GFA1 and GFA2 specification | |
24 | described at https://github.com/GFA-spec/GFA-spec/blob/master/GFA-spec.md. | |
25 | It allows to create a Gfa object from a file in the GFA format | |
26 | or from scratch, to enumerate the graph elements (segments, links, | |
27 | containments, paths and header lines), to traverse the graph (by | |
28 | traversing all links outgoing from or incoming to a segment), to search for | |
29 | elements (e.g. which links connect two segments) and to manipulate the | |
30 | graph (e.g. to eliminate a link or a segment or to duplicate a segment | |
31 | distributing the read counts evenly on the copies). | |
32 | ||
33 | The GFA format can be easily extended by users by defining own custom | |
34 | tags and record types. In Gfapy, it is easy to write extensions modules, | |
35 | which allow to define custom record types and datatypes for the parsing | |
36 | and validation of custom fields. The custom lines can be connected, using | |
37 | references, to each other and to lines of the standard record types. | |
38 | ||
39 | Requirements | |
40 | ~~~~~~~~~~~~ | |
41 | ||
42 | Gfapy has been written for Python 3 and tested using Python version 3.7. | |
43 | It does not require any additional Python packages or other software. | |
44 | ||
45 | Installation | |
46 | ~~~~~~~~~~~~ | |
47 | ||
48 | Gfapy is distributed as a Python package and can be installed using | |
49 | the Python package manager pip, as well as conda (in the Bioconda channel). | |
50 | It is also available as a package in some Linux distributions (Debian, Ubuntu). | |
51 | ||
52 | The following command installs the current stable version from the Python | |
53 | Packages index:: | |
54 | ||
55 | pip install gfapy | |
56 | ||
57 | If you would like to install the current development version from Github, | |
58 | use the following command:: | |
59 | ||
60 | pip install -e git+https://github.com/ggonnella/gfapy.git#egg=gfapy | |
61 | ||
62 | Alternatively it is possible to install gfapy using conda. Gfapy is | |
63 | included in the Bioconda (https://bioconda.github.io/) channel:: | |
64 | ||
65 | conda install -c bioconda gfapy | |
66 | ||
67 | Usage | |
68 | ~~~~~ | |
69 | ||
70 | If you installed gfapy as described above, you can import it in your script | |
71 | using the conventional Python syntax:: | |
72 | ||
73 | >>> import gfapy | |
74 | ||
75 | Documentation | |
76 | ~~~~~~~~~~~~~ | |
77 | ||
78 | The documentation, including this introduction to Gfapy, a user manual | |
79 | and the API documentation is hosted on the ReadTheDocs server, | |
80 | at the URL http://gfapy.readthedocs.io/en/latest/ and it can be | |
81 | downloaded as PDF from the URL | |
82 | https://github.com/ggonnella/gfapy/blob/master/manual/gfapy-manual.pdf. | |
83 | ||
84 | References | |
85 | ~~~~~~~~~~ | |
86 | ||
87 | Giorgio Gonnella and Stefan Kurtz "GfaPy: a flexible and extensible software | |
88 | library for handling sequence graphs in Python", Bioinformatics (2017) btx398 | |
89 | https://doi.org/10.1093/bioinformatics/btx398 | |
90 | ||
91 | .. sphinx-end | |
92 | ||
93 | .. |travis| | |
94 | image:: https://travis-ci.com/ggonnella/gfapy.svg?branch=master | |
95 | :target: https://travis-ci.com/ggonnella/gfapy | |
96 | :alt: Travis | |
97 | ||
98 | .. |latesttag| | |
99 | image:: https://img.shields.io/github/v/tag/ggonnella/gfapy | |
100 | :target: https://github.com/ggonnella/gfapy/tags | |
101 | :alt: Latest GitHub tag | |
102 | ||
103 | .. |readthedocs| | |
104 | image:: https://readthedocs.org/projects/pip/badge/?version=stable | |
105 | :target: https://pip.pypa.io/en/stable/?badge=stable | |
106 | :alt: ReadTheDocs | |
107 | ||
108 | .. |bioconda| | |
109 | image:: https://img.shields.io/conda/vn/bioconda/gfapy | |
110 | :target: https://bioconda.github.io/recipes/gfapy/README.html | |
111 | :alt: Bioconda | |
112 | ||
113 | .. |pypi| | |
114 | image:: https://img.shields.io/pypi/v/gfapy | |
115 | :target: https://pypi.org/project/gfapy/ | |
116 | :alt: PyPI | |
117 | ||
118 | .. |debian| | |
119 | image:: https://img.shields.io/debian/v/gfapy | |
120 | :target: https://packages.debian.org/search?keywords=gfapy | |
121 | :alt: Debian | |
122 | ||
123 | .. |ubuntu| | |
124 | image:: https://img.shields.io/ubuntu/v/gfapy | |
125 | :target: https://packages.ubuntu.com/search?keywords=gfapy | |
126 | :alt: Ubuntu | |
127 | ||
128 | .. |license| | |
129 | image:: https://img.shields.io/pypi/l/gfapy | |
130 | :target: https://github.com/ggonnella/gfapy/blob/master/LICENSE.txt | |
131 | :alt: ISC License | |
132 | ||
133 | .. |requiresio| | |
134 | image:: https://requires.io/github/ggonnella/gfapy/requirements.svg?branch=master | |
135 | :target: https://requires.io/github/ggonnella/gfapy/requirements/?branch=master | |
136 | :alt: Requirements Status | |
137 | ||
138 | Keywords: bioinformatics genomics sequences GFA assembly graphs | |
139 | Platform: UNKNOWN | |
140 | Classifier: Development Status :: 5 - Production/Stable | |
141 | Classifier: Environment :: Console | |
142 | Classifier: Intended Audience :: Developers | |
143 | Classifier: Intended Audience :: End Users/Desktop | |
144 | Classifier: Intended Audience :: Science/Research | |
145 | Classifier: License :: OSI Approved :: ISC License (ISCL) | |
146 | Classifier: Operating System :: MacOS :: MacOS X | |
147 | Classifier: Operating System :: POSIX :: Linux | |
148 | Classifier: Programming Language :: Python :: 3 :: Only | |
149 | Classifier: Topic :: Scientific/Engineering :: Bio-Informatics | |
150 | Classifier: Topic :: Software Development :: Libraries |
0 | Gfapy | |
1 | ~~~~~ | |
2 | ||
3 | |travis| |readthedocs| |latesttag| |license| |requiresio| | |
4 | ||
5 | |bioconda| |pypi| |debian| |ubuntu| | |
6 | ||
7 | .. sphinx-begin | |
8 | ||
0 | 9 | The Graphical Fragment Assembly (GFA) are formats for the representation |
1 | 10 | of sequence graphs, including assembly, variation and splicing graphs. |
2 | 11 | Two versions of GFA have been defined (GFA1 and GFA2) and several sequence |
22 | 31 | Requirements |
23 | 32 | ~~~~~~~~~~~~ |
24 | 33 | |
25 | Gfapy has been written for Python 3 and tested using Python version 3.3. | |
34 | Gfapy has been written for Python 3 and tested using Python version 3.7. | |
26 | 35 | It does not require any additional Python packages or other software. |
27 | 36 | |
28 | 37 | Installation |
30 | 39 | |
31 | 40 | Gfapy is distributed as a Python package and can be installed using |
32 | 41 | the Python package manager pip, as well as conda (in the Bioconda channel). |
42 | It is also available as a package in some Linux distributions (Debian, Ubuntu). | |
33 | 43 | |
34 | 44 | The following command installs the current stable version from the Python |
35 | 45 | Packages index:: |
69 | 79 | Giorgio Gonnella and Stefan Kurtz "GfaPy: a flexible and extensible software |
70 | 80 | library for handling sequence graphs in Python", Bioinformatics (2017) btx398 |
71 | 81 | https://doi.org/10.1093/bioinformatics/btx398 |
82 | ||
83 | .. sphinx-end | |
84 | ||
85 | .. |travis| | |
86 | image:: https://travis-ci.com/ggonnella/gfapy.svg?branch=master | |
87 | :target: https://travis-ci.com/ggonnella/gfapy | |
88 | :alt: Travis | |
89 | ||
90 | .. |latesttag| | |
91 | image:: https://img.shields.io/github/v/tag/ggonnella/gfapy | |
92 | :target: https://github.com/ggonnella/gfapy/tags | |
93 | :alt: Latest GitHub tag | |
94 | ||
95 | .. |readthedocs| | |
96 | image:: https://readthedocs.org/projects/pip/badge/?version=stable | |
97 | :target: https://pip.pypa.io/en/stable/?badge=stable | |
98 | :alt: ReadTheDocs | |
99 | ||
100 | .. |bioconda| | |
101 | image:: https://img.shields.io/conda/vn/bioconda/gfapy | |
102 | :target: https://bioconda.github.io/recipes/gfapy/README.html | |
103 | :alt: Bioconda | |
104 | ||
105 | .. |pypi| | |
106 | image:: https://img.shields.io/pypi/v/gfapy | |
107 | :target: https://pypi.org/project/gfapy/ | |
108 | :alt: PyPI | |
109 | ||
110 | .. |debian| | |
111 | image:: https://img.shields.io/debian/v/gfapy | |
112 | :target: https://packages.debian.org/search?keywords=gfapy | |
113 | :alt: Debian | |
114 | ||
115 | .. |ubuntu| | |
116 | image:: https://img.shields.io/ubuntu/v/gfapy | |
117 | :target: https://packages.ubuntu.com/search?keywords=gfapy | |
118 | :alt: Ubuntu | |
119 | ||
120 | .. |license| | |
121 | image:: https://img.shields.io/pypi/l/gfapy | |
122 | :target: https://github.com/ggonnella/gfapy/blob/master/LICENSE.txt | |
123 | :alt: ISC License | |
124 | ||
125 | .. |requiresio| | |
126 | image:: https://requires.io/github/ggonnella/gfapy/requirements.svg?branch=master | |
127 | :target: https://requires.io/github/ggonnella/gfapy/requirements/?branch=master | |
128 | :alt: Requirements Status |
0 | #!/bin/bash | |
1 | ||
2 | # | |
3 | # This script is derived from rdj-spacepeak.sh in | |
4 | # the GenomeTools repository (www.genometools.org). | |
5 | # | |
6 | # (c) 2010-2017 Giorgio Gonnella, ZBH, University of Hamburg | |
7 | # | |
8 | ||
9 | sleeptime=0.1 | |
10 | ||
11 | if [ $# -eq 0 ]; then | |
12 | echo "Usage: $0 <command> [args]" | |
13 | echo | |
14 | echo "The following information is polled each $sleeptime seconds" | |
15 | echo "from /proc/[pid]/status:" | |
16 | echo | |
17 | echo " VmPeak: Peak virtual memory size." | |
18 | echo " VmSize: Virtual memory size." | |
19 | echo " VmLck: Locked memory size." | |
20 | echo " VmHWM: Peak resident set size (\"high water mark\")." | |
21 | echo " VmRSS: Resident set size." | |
22 | echo " VmData, VmStk, VmExe: Size of data, stack, and text segments." | |
23 | echo " VmLib: Shared library code size." | |
24 | echo " VmPTE: Page table entries size (since Linux 2.6.10)." | |
25 | echo | |
26 | echo "The command is run under /usr/bin/time." | |
27 | exit | |
28 | fi | |
29 | ||
30 | # code inspired by: | |
31 | # http://stackoverflow.com/questions/1080461/ | |
32 | # /peak-memory-measurement-of-long-running-process-in-linux | |
33 | function __measure_space_peak { | |
34 | types="Peak Size Lck HWM RSS Data Stk Exe Lib PTE" | |
35 | declare -A maxVm | |
36 | for vm in $types; do maxVm[$vm]=0; done | |
37 | ppid=$$ | |
38 | /usr/bin/time $@ & | |
39 | tpid=`pgrep -P ${ppid} -n -f time` | |
40 | if [[ ${tpid} -ne "" ]]; then | |
41 | pid=`pgrep -P ${tpid} -n -f $1` # $! may work here but not later | |
42 | fi | |
43 | declare -A Vm | |
44 | while [[ ${tpid} -ne "" ]]; do | |
45 | for vm in $types; do | |
46 | if [[ ${pid} -ne "" ]]; then | |
47 | Vm[$vm]=`cat /proc/${pid}/status 2> /dev/null \ | |
48 | | grep Vm${vm} | awk '{print $2}'` | |
49 | if [[ ${Vm[$vm]} -gt ${maxVm[$vm]} ]]; then | |
50 | maxVm[$vm]=${Vm[$vm]} | |
51 | fi | |
52 | fi | |
53 | done | |
54 | sleep $sleeptime | |
55 | savedtpid=${tpid} | |
56 | tpid=`pgrep -P ${ppid} -n -f time` | |
57 | done | |
58 | wait ${savedtpid} # don't wait, job is finished | |
59 | exitstatus=$? # catch the exit status of wait, the same of $@ | |
60 | echo "Memory usage for $@:" >> /dev/stderr | |
61 | for vm in $types; do echo " Vm$vm: ${maxVm[$vm]} kB" >> /dev/stderr; done | |
62 | echo "Exit status: ${exitstatus}" >> /dev/stderr | |
63 | } | |
64 | __measure_space_peak $* |
0 | #!/usr/bin/env Rscript | |
1 | # (c) Giorgio Gonnella, ZBH, Uni Hamburg, 2017 | |
2 | ||
3 | script.name = "./gfapy-plot-benchmarkdata.R" | |
4 | args <- commandArgs(trailingOnly=TRUE) | |
5 | if (is.na(args[3])) { | |
6 | cat("Usage: ",script.name, " <inputfile> <outpfx> <variable>", "\n") | |
7 | cat("variable: either 'segments' or 'connectivity'\n") | |
8 | stop("Too few command-line parameters") | |
9 | } | |
10 | infname <- args[1] | |
11 | cat("input data: ",infname,"\n") | |
12 | outpfx <- args[2] | |
13 | cat("output prefix:", outpfx, "\n") | |
14 | xvar <- args[3] | |
15 | if (xvar != 'segments' && xvar != 'connectivity') { | |
16 | stop("variable must be one of: segments, connectivity") | |
17 | } | |
18 | ||
19 | library("ggplot2") | |
20 | ||
21 | # | |
22 | # The following function is described here: | |
23 | # http://www.cookbook-r.com/Graphs/Plotting_means_and_error_bars_(ggplot2)/#Helper%20functions | |
24 | # Licence: CC0 (https://creativecommons.org/publicdomain/zero/1.0/) | |
25 | # | |
26 | ## Gives count, mean, standard deviation, standard error of the mean, and | |
27 | ## confidence interval (default 95%). | |
28 | ## data: a data frame. | |
29 | ## measurevar: the name of a column that contains the var to be summariezed | |
30 | ## groupvars: a vector containing names of columns that contain grouping vars | |
31 | ## na.rm: a boolean that indicates whether to ignore NA's | |
32 | ## conf.interval: the percent range of the confidence interval (default 95%) | |
33 | summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE, | |
34 | conf.interval=.95, .drop=TRUE) { | |
35 | library(plyr) | |
36 | ||
37 | # New version of length which can handle NA's: if na.rm==T, don't count them | |
38 | length2 <- function (x, na.rm=FALSE) { | |
39 | if (na.rm) sum(!is.na(x)) | |
40 | else length(x) | |
41 | } | |
42 | ||
43 | # This does the summary. For each group's data frame, return a vector with | |
44 | # N, mean, and sd | |
45 | datac <- ddply(data, groupvars, .drop=.drop, | |
46 | .fun = function(xx, col) { | |
47 | c(N = length2(xx[[col]], na.rm=na.rm), | |
48 | mean = mean (xx[[col]], na.rm=na.rm), | |
49 | sd = sd (xx[[col]], na.rm=na.rm) | |
50 | ) | |
51 | }, | |
52 | measurevar | |
53 | ) | |
54 | ||
55 | # Rename the "mean" column | |
56 | datac <- rename(datac, c("mean" = measurevar)) | |
57 | ||
58 | datac$se <- datac$sd / sqrt(datac$N) # Calculate standard error of the mean | |
59 | ||
60 | # Confidence interval multiplier for standard error | |
61 | # Calculate t-statistic for confidence interval: | |
62 | # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1 | |
63 | ciMult <- qt(conf.interval/2 + .5, datac$N-1) | |
64 | datac$ci <- datac$se * ciMult | |
65 | ||
66 | return(datac) | |
67 | } | |
68 | ||
69 | data <- read.table(infname, header=T, sep="\t") | |
70 | ||
71 | if (xvar == "segments") { | |
72 | xvarname = "lines" | |
73 | xlab="Lines (segments 1/3; dovetails 2/3)" | |
74 | } else { | |
75 | xvarname = "mult" | |
76 | xlab="Dovetails/segment (segments=4000)" | |
77 | data[c("lines")] = (data[c("mult")]+1)*4000 | |
78 | } | |
79 | ||
80 | time.data <- summarySE(data, measurevar="time", groupvars=c(xvarname)) | |
81 | outfname = paste0(outpfx,"_time.log") | |
82 | sink(outfname) | |
83 | print(time.data) | |
84 | time.lm <- lm(time ~ lines, data=data) | |
85 | summary(time.lm) | |
86 | time.nls <- nls(time ~ b + a * lines, | |
87 | data=data, start=list(a=0,b=0), | |
88 | algorithm="port", lower=c(0,0)) | |
89 | print(time.nls) | |
90 | sink() | |
91 | ||
92 | outfname = paste0(outpfx,"_space.log") | |
93 | sink(outfname) | |
94 | space.data <- summarySE(data, measurevar="space", groupvars=c(xvarname)) | |
95 | print(space.data) | |
96 | space.lm <- lm(space ~ lines, data=data) | |
97 | summary(space.lm) | |
98 | space.nls <- nls(space ~ b + a * lines, | |
99 | data=data, start=list(a=0,b=0), | |
100 | algorithm="port", lower=c(0,0)) | |
101 | print(space.nls) | |
102 | sink() | |
103 | ||
104 | outfname = paste0(outpfx,"_time.pdf") | |
105 | pdf(outfname) | |
106 | print(ggplot(time.data, aes_string(x=xvarname, y="time")) + | |
107 | geom_errorbar(aes(ymin=time-se, ymax=time+se), width=2) + | |
108 | geom_line(size=0.2) + geom_point(size=3) + | |
109 | ylab("Total elapsed time (s)") + | |
110 | xlab(xlab)) | |
111 | outfname = paste0(outpfx,"_space.pdf") | |
112 | pdf(outfname) | |
113 | print(ggplot(space.data, aes_string(x=xvarname, y="space")) + | |
114 | geom_errorbar(aes(ymin=space-se, ymax=space+se), width=2) + | |
115 | geom_line(size=0.2) + geom_point(size=3) + | |
116 | ylab("Memory peak (MB)") + | |
117 | xlab(xlab)) | |
118 | dev.off() | |
119 |
0 | #!/usr/bin/env python3 | |
1 | """ | |
2 | Prepare the output of the convert benchmark script for the R plotting script. | |
3 | """ | |
4 | ||
5 | import argparse | |
6 | import os | |
7 | import sys | |
8 | import re | |
9 | ||
10 | op = argparse.ArgumentParser(description=__doc__) | |
11 | op.add_argument('--version', action='version', version='%(prog)s 1.0') | |
12 | op.add_argument("--mult", "-m", action="store_true", | |
13 | help="set if variable n of edges/segment") | |
14 | op.add_argument("inputfile") | |
15 | opts = op.parse_args() | |
16 | ||
17 | if not os.path.exists(opts.inputfile): | |
18 | sys.stderr.write("Input file not found: {}\n".format(opts.inputfile)) | |
19 | exit(1) | |
20 | ||
21 | with open(opts.inputfile) as inputfile: | |
22 | header = True | |
23 | if opts.mult: | |
24 | outdata = ["mult", "time", "space", "time_per_line", "space_per_line"] | |
25 | else: | |
26 | outdata = ["lines", "time", "space", "time_per_line", "space_per_line"] | |
27 | print("\t".join(outdata)) | |
28 | for line in inputfile: | |
29 | if line[:3] == "###": | |
30 | header = False | |
31 | elif not header: | |
32 | data = line.rstrip("\n\r").split("\t") | |
33 | n_segments = data[2] | |
34 | multiplier = data[3] | |
35 | n_lines = int(int(n_segments) * (1+float(multiplier))) | |
36 | elapsed = data[5] | |
37 | elapsed_match = re.compile(r'\s+(\d+):(\d+\.\d+)').match(elapsed) | |
38 | if elapsed_match: | |
39 | minutes = int(elapsed_match.groups()[0]) | |
40 | seconds = float(elapsed_match.groups()[1]) | |
41 | seconds += minutes * 60 | |
42 | else: | |
43 | elapsed_match = re.compile(r'\s+(\d+):(\d+):(\d+)').match(elapsed) | |
44 | if elapsed_match: | |
45 | hours = int(elapsed_match.groups()[0]) | |
46 | minutes = int(elapsed_match.groups()[1]) | |
47 | seconds = int(elapsed_match.groups()[2]) | |
48 | minutes += hours * 60 | |
49 | seconds += minutes * 60 | |
50 | else: | |
51 | continue | |
52 | memory = data[6] | |
53 | memory = int(re.compile(r'(\d+) kB').match(memory).groups()[0]) | |
54 | megabytes = memory / 1024 | |
55 | if opts.mult: | |
56 | outdata = [str(multiplier)] | |
57 | else: | |
58 | outdata = [str(n_lines)] | |
59 | outdata += [str(seconds),str(megabytes), | |
60 | str(seconds/n_lines), str(megabytes/n_lines)] | |
61 | print("\t".join(outdata)) |
0 | #!/bin/bash | |
1 | #$ -clear | |
2 | #$ -q 16c.q | |
3 | #$ -cwd | |
4 | #$ -V | |
5 | #$ -S /bin/bash | |
6 | #$ -o jobs_out | |
7 | #$ -j y | |
8 | ||
9 | if [ $# -ne 4 ]; then | |
10 | echo "Usage: $0 <operation> <version> <variable> <range>" > /dev/stderr | |
11 | echo " operation: (mergelinear/convert) ../bin/gfapy-<operation> <gfafile> will be called" > /dev/stderr | |
12 | echo " version: (gfa1/gfa2) gfa version" > /dev/stderr | |
13 | echo " variable: (segments/connectivity)" > /dev/stderr | |
14 | echo " range: (all/fast/slow)" > /dev/stderr | |
15 | exit 1 | |
16 | fi | |
17 | ||
18 | operation=$1 | |
19 | version=$2 | |
20 | variable=$3 | |
21 | range=$4 | |
22 | ||
23 | if [ $variable == "segments" ]; then | |
24 | if [ $range == "fast" ]; then | |
25 | nsegments="1000 2000 4000 8000 16000 32000 64000 128000" | |
26 | elif [ $range == "slow" ]; then | |
27 | nsegments="256000 512000 1024000 2048000 4096000" | |
28 | elif [ $range == "all"]; then | |
29 | nsegments="1000 2000 4000 8000 16000 32000 64000 128000 256000 512000 1024000 2048000 4096000" | |
30 | fi | |
31 | else | |
32 | nsegments=4000 | |
33 | fi | |
34 | ||
35 | if [ $variable == "connectivity" ]; then | |
36 | if [ $range == "fast" ]; then | |
37 | multipliers="2 4 8 16 32 64" | |
38 | elif [ $range == "slow" ]; then | |
39 | multipliers="128 256" | |
40 | elif [ $range == "all"]; then | |
41 | multipliers="2 4 8 16 32 64 128 256" | |
42 | fi | |
43 | else | |
44 | multipliers=2 | |
45 | fi | |
46 | ||
47 | replicate=1 | |
48 | for i in $nsegments; do | |
49 | for m in $multipliers; do | |
50 | fname="${i}_e${m}x.$replicate.${version}" | |
51 | if [ ! -e $fname ]; then | |
52 | ./gfapy-randomgraph --segments $i -g $version \ | |
53 | --dovetails-per-segment $m --with-sequence > $fname | |
54 | fi | |
55 | echo "Profiling $operation $fname ..." | |
56 | rm -f $fname.$operation.prof | |
57 | python3 -m cProfile -o $fname.$operation.prof \ | |
58 | ../bin/gfapy-$operation $fname 1> /dev/null | |
59 | done | |
60 | done |
0 | #!/usr/bin/env python3 | |
1 | """ | |
2 | Creates a random graph for testing | |
3 | """ | |
4 | ||
5 | import argparse | |
6 | import sys | |
7 | import random | |
8 | ||
9 | op = argparse.ArgumentParser(description=__doc__) | |
10 | op.add_argument("--segments", "-s", type=int, | |
11 | help="number of segments", required=True) | |
12 | op.add_argument("--slen", "-l", type=int, default=100, | |
13 | help="lenght of segments sequence") | |
14 | op.add_argument("--with-sequence", "-w", action="store_true") | |
15 | op.add_argument("--dovetails-per-segment", "-d", | |
16 | help="average number of dovetail edges per segment", | |
17 | default=2.0, type=float) | |
18 | op.add_argument('--gfa-version', "-g", default="gfa1", | |
19 | help="gfa version", choices=("gfa1", "gfa2")) | |
20 | op.add_argument('--version', action='version', version='%(prog)s 1.0') | |
21 | opts = op.parse_args() | |
22 | ||
23 | if opts.segments < 0: | |
24 | sys.stderr.write("Error: the number of segments must be "+ | |
25 | ">= 0 ({})\n".format(opts.segments)) | |
26 | exit(1) | |
27 | if opts.dovetails_per_segment < 0: | |
28 | sys.stderr.write("Error: the average number of dovetails per segment must "+ | |
29 | "be >= 0 ({})\n".format(opts.dovetails_per_segment)) | |
30 | exit(1) | |
31 | if opts.slen <= 0: | |
32 | sys.stderr.write("Error: the length of segments sequence must be > 0"+ | |
33 | " ({})\n".format(opts.slen)) | |
34 | exit(1) | |
35 | ||
36 | if opts.gfa_version == "gfa1": | |
37 | print("H\tVN:Z:1.0") | |
38 | else: | |
39 | print("H\tVN:Z:2.0") | |
40 | ||
41 | def random_sequence(slen): | |
42 | sequence = [] | |
43 | for i in range(slen): | |
44 | sequence.append(random.choice('ACGT')) | |
45 | return "".join(sequence) | |
46 | ||
47 | for i in range(opts.segments): | |
48 | if opts.with_sequence: | |
49 | sequence = random_sequence(opts.slen) | |
50 | else: | |
51 | sequence = "*" | |
52 | if opts.gfa_version == "gfa1": | |
53 | print("S\ts{}\t{}\tLN:i:{}".format(i, sequence, opts.slen)) | |
54 | else: | |
55 | print("S\ts{}\t{}\t{}".format(i, opts.slen, sequence)) | |
56 | ||
57 | n_dovetails = int(opts.segments * opts.dovetails_per_segment) | |
58 | edges = {} | |
59 | for i in range(n_dovetails): | |
60 | edge = False | |
61 | while not edge: | |
62 | s_from = random.randint(0, opts.segments-1) | |
63 | s_from_or = random.choice('+-') | |
64 | s_to = random.randint(0, opts.segments-1) | |
65 | s_to_or = random.choice('+-') | |
66 | if s_from not in edges: | |
67 | edges[s_from] = {'+': {}, '-': {}} | |
68 | if s_to not in edges[s_from][s_from_or]: | |
69 | edges[s_from][s_from_or][s_to] = {'+': False, '-': False} | |
70 | if not edges[s_from][s_from_or][s_to][s_to_or]: | |
71 | edges[s_from][s_from_or][s_to][s_to_or] = True | |
72 | edge = True | |
73 | ovlen = opts.slen//10 | |
74 | if ovlen == 0: ovlen = 1 | |
75 | cigar = "{}M".format(ovlen) | |
76 | if opts.gfa_version == "gfa1": | |
77 | print("L\ts{}\t{}\ts{}\t{}\t{}\tID:Z:e{}".format(s_from, s_from_or, s_to, | |
78 | s_to_or, cigar, i)) | |
79 | else: | |
80 | s_from_begin = opts.slen - ovlen if s_from_or == "+" else 0 | |
81 | s_from_end = "{}$".format(opts.slen) if s_from_or == "+" else ovlen | |
82 | s_to_begin = opts.slen - ovlen if s_to_or == "-" else 0 | |
83 | s_to_end = "{}$".format(opts.slen) if s_to_or == "-" else ovlen | |
84 | print("E\te{}\ts{}{}\ts{}{}\t{}\t{}\t{}\t{}\t{}".format( | |
85 | i, s_from, s_from_or, s_to, s_to_or, s_from_begin, s_from_end, | |
86 | s_to_begin, s_to_end, cigar)) |
0 | #!/usr/bin/env python3 | |
1 | """ | |
2 | Run the benchmarks necessary to reproduce the figures of Section 3 | |
3 | of the Supplementary Information of the manuscript \"Gfapy: a flexible | |
4 | and extensible software library for handling sequence graphs in Python\" | |
5 | and plots the figures using R. | |
6 | """ | |
7 | ||
8 | import argparse | |
9 | import os | |
10 | ||
11 | op = argparse.ArgumentParser(description=__doc__) | |
12 | op.add_argument("fignum", help="Figure number", type=int, | |
13 | choices=range(5,9)) | |
14 | op.add_argument("--queue", default=None, | |
15 | help="Use the specified queue of a Grid Engine cluster system "+ | |
16 | "(e.g. 16c.q). If not provided, the benchmarks are run on the "+ | |
17 | "local computer.") | |
18 | op.add_argument("--nrepl",type=int, default=3, | |
19 | help="Number of replicates (default: 3)") | |
20 | op.add_argument("--fast",action="store_true", | |
21 | help="Run only the three fastest datapoints of the benchmark") | |
22 | opts = op.parse_args() | |
23 | ||
24 | if opts.fignum == 5: | |
25 | testvar="segments" | |
26 | operation="convert" | |
27 | elif opts.fignum == 6: | |
28 | testvar="connectivity" | |
29 | operation="convert" | |
30 | elif opts.fignum == 7: | |
31 | testvar="segments" | |
32 | operation="mergelinear" | |
33 | else: # 8 | |
34 | testvar="connectivity" | |
35 | operation="mergelinear" | |
36 | ||
37 | if opts.fast: | |
38 | subset="fast" | |
39 | else: | |
40 | subset="all" | |
41 | ||
42 | run_benchmarks_args="figure{}.out {} gfa2 {} {} {}".format( | |
43 | opts.fignum, operation, testvar, subset, opts.nrepl) | |
44 | ||
45 | if not opts.queue: | |
46 | os.system("./gfapy-run-benchmarks.sh {}".format(run_benchmarks_args)) | |
47 | else: | |
48 | qsub_script_pfx=\ | |
49 | """#!/bin/bash | |
50 | #$ -clear | |
51 | #$ -q {} | |
52 | #$ -cwd | |
53 | #$ -V | |
54 | #$ -S /bin/bash | |
55 | #$ -o jobs_out | |
56 | #$ -j y | |
57 | #$ -sync y | |
58 | ||
59 | """.format(opts.queue) | |
60 | with open("gfapy-run-benchmarks.sh", "r") as input_file: | |
61 | content = input_file.read() | |
62 | with open("gfapy-run-benchmarks.qsub", "w") as output_file: | |
63 | output_file.write(qsub_script_pfx) | |
64 | output_file.write(content) | |
65 | os.system("mkdir -p jobs_out") | |
66 | os.system("qsub gfapy-run-benchmarks.qsub {}".format(run_benchmarks_args)) | |
67 | ||
68 | if testvar == "segments": | |
69 | prepareflag="" | |
70 | else: | |
71 | prepareflag="--mult" | |
72 | os.system("./gfapy-plot-preparedata.py {} figure{}.out > figure{}.dat".format( | |
73 | prepareflag, opts.fignum, opts.fignum)) | |
74 | os.system("./gfapy-plot-benchmarkdata.R figure{}.dat figure{} {}".format( | |
75 | opts.fignum, opts.fignum, testvar)) |
0 | #!/bin/bash | |
1 | ||
2 | if [ $# -ne 6 ]; then | |
3 | echo "Usage: $0 <outfile> <operation> <version> <variable> <range> <nrepl>" > /dev/stderr | |
4 | echo " outfile: will be overwritten if exists" > /dev/stderr | |
5 | echo " operation: (mergelinear/convert) ../bin/gfapy-<operation> <gfafile> will be called" > /dev/stderr | |
6 | echo " version: (gfa1/gfa2) gfa version" > /dev/stderr | |
7 | echo " variable: (segments/connectivity)" > /dev/stderr | |
8 | echo " range: (all/fast/slow)" > /dev/stderr | |
9 | echo " nrepl: (e.g. 3) number of replicates" > /dev/stderr | |
10 | exit 1 | |
11 | fi | |
12 | ||
13 | outfile=$1 | |
14 | operation=$2 | |
15 | version=$3 | |
16 | variable=$4 | |
17 | range=$5 | |
18 | nrepl=$6 | |
19 | ||
20 | if [ $variable == "segments" ]; then | |
21 | if [ $range == "fast" ]; then | |
22 | nsegments="1000 2000 4000" | |
23 | elif [ $range == "slow" ]; then | |
24 | nsegments="8000 16000 32000 64000 128000 256000 512000 1024000 2048000" | |
25 | elif [ $range == "all"]; then | |
26 | nsegments="1000 2000 4000 8000 16000 32000 64000 128000 256000 512000 1024000 2048000" | |
27 | fi | |
28 | else | |
29 | nsegments=4000 | |
30 | fi | |
31 | ||
32 | if [ $variable == "connectivity" ]; then | |
33 | if [ $range == "fast" ]; then | |
34 | multipliers="2 4 8" | |
35 | elif [ $range == "slow" ]; then | |
36 | multipliers="16 32 64 128 256" | |
37 | elif [ $range == "all"]; then | |
38 | multipliers="2 4 8 16 32 64 128 256" | |
39 | fi | |
40 | else | |
41 | multipliers=2 | |
42 | fi | |
43 | ||
44 | mkdir -p benchmark_results | |
45 | rm -f $outfile | |
46 | echo "# hostname: $HOSTNAME" > $outfile | |
47 | echo "### benchmark data:" >> $outfile | |
48 | for ((replicate=1;replicate<=nrepl;++replicate)); do | |
49 | for i in $nsegments; do | |
50 | for m in $multipliers; do | |
51 | fname="benchmark_results/${i}_e${m}x.$replicate.${version}" | |
52 | bmout="$fname.$operation.benchmark" | |
53 | rm -f $bmout | |
54 | if [ ! -e $fname ]; then | |
55 | ./gfapy-randomgraph --segments $i -g $version \ | |
56 | --dovetails-per-segment $m --with-sequence > $fname | |
57 | fi | |
58 | ./gfapy-benchmark-collectdata ../bin/gfapy-$operation $fname \ | |
59 | 1> /dev/null 2> $bmout | |
60 | elapsed=$(grep -P -o "(?<=) [^ ]*(?=elapsed)" $bmout) | |
61 | memory=$(grep -P -o "(?<=VmHWM: ).*" $bmout) | |
62 | filesize=( $(ls -ln $fname) );filesize=${filesize[4]} | |
63 | echo -e "gfapy-$operation\t$version\t$i\t$m\t$replicate\t$elapsed\t$memory\t$filesize" >> $outfile | |
64 | done | |
65 | done | |
66 | done |
0 | #!/usr/bin/env python3 | |
1 | """ | |
2 | Compare two GFA files | |
3 | ||
4 | Note: the current version is not yet functional and only checking segments. | |
5 | Work in progress. | |
6 | """ | |
7 | ||
8 | import sys | |
9 | import os | |
10 | import gfapy | |
11 | import argparse | |
12 | ||
13 | op = argparse.ArgumentParser(description=__doc__) | |
14 | op.add_argument('--version', action='version', version='%(prog)s 0.1') | |
15 | op.add_argument("filename1") | |
16 | op.add_argument("filename2") | |
17 | opts = op.parse_args() | |
18 | ||
19 | gfa1 = gfapy.Gfa.from_file(opts.filename1) | |
20 | gfa2 = gfapy.Gfa.from_file(opts.filename2) | |
21 | ||
22 | different = False | |
23 | ||
24 | if gfa1.version != gfa2.version: | |
25 | print("# different version") | |
26 | exit(1) | |
27 | else: | |
28 | for s in gfa1.segments: | |
29 | s2 = gfa2.segment(s) | |
30 | if s2 is None: | |
31 | different = True | |
32 | print("# segment {} in {} but not in {}".format(s.name, opts.filename1, opts.filename2)) | |
33 | if s.diff(s2): | |
34 | different = True | |
35 | for diff in s.diff(s2): | |
36 | print(diff) | |
37 | for s in gfa2.segments: | |
38 | s1 = gfa1.segment(s) | |
39 | if s1 is None: | |
40 | different = True | |
41 | print("# segment {} in {} but not in {}".format(s.name, opts.filename2, opts.filename1)) | |
42 | ||
43 | if different: | |
44 | exit(1) | |
45 | else: | |
46 | exit(0) |
0 | # Minimal makefile for Sphinx documentation | |
1 | # | |
2 | ||
3 | # You can set these variables from the command line. | |
4 | SPHINXOPTS = | |
5 | SPHINXBUILD = sphinx-build | |
6 | SPHINXPROJ = Gfapy | |
7 | SOURCEDIR = . | |
8 | BUILDDIR = _build | |
9 | ||
10 | # Put it first so that "make" without argument is like "make help". | |
11 | help: | |
12 | @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) | |
13 | ||
14 | .PHONY: help Makefile | |
15 | ||
16 | cleanup: | |
17 | rm source _build -rf | |
18 | ||
19 | # Catch-all target: route all unknown targets to Sphinx using the new | |
20 | # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). | |
21 | %: Makefile | |
22 | @PYTHONHASHSEED=0 $(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) |
0 | #!/usr/bin/env python3 | |
1 | # -*- coding: utf-8 -*- | |
2 | # | |
3 | # Gfapy documentation build configuration file, created by | |
4 | # sphinx-quickstart on Thu Mar 16 10:13:57 2017. | |
5 | # | |
6 | # This file is execfile()d with the current directory set to its | |
7 | # containing dir. | |
8 | # | |
9 | # Note that not all possible configuration values are present in this | |
10 | # autogenerated file. | |
11 | # | |
12 | # All configuration values have a default; values that are commented out | |
13 | # serve to show the default. | |
14 | ||
15 | # If extensions (or modules to document with autodoc) are in another directory, | |
16 | # add these directories to sys.path here. If the directory is relative to the | |
17 | # documentation root, use os.path.abspath to make it absolute, like shown here. | |
18 | # | |
19 | import os | |
20 | import sys | |
21 | sys.path.insert(0, os.path.abspath('.')) | |
22 | sys.path.insert(0, os.path.abspath('../')) | |
23 | ||
24 | # -- General configuration ------------------------------------------------ | |
25 | ||
26 | # If your documentation needs a minimal Sphinx version, state it here. | |
27 | # | |
28 | # needs_sphinx = '1.0' | |
29 | ||
30 | # Add any Sphinx extension module names here, as strings. They can be | |
31 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom | |
32 | # ones. | |
33 | extensions = [ | |
34 | 'sphinx.ext.autodoc', | |
35 | 'sphinx.ext.doctest', | |
36 | 'sphinx.ext.todo', | |
37 | 'sphinx.ext.coverage', | |
38 | 'sphinx.ext.imgmath', | |
39 | 'sphinx.ext.ifconfig', | |
40 | 'sphinx.ext.viewcode', | |
41 | 'sphinx.ext.githubpages', | |
42 | 'sphinx.ext.napoleon' | |
43 | ] | |
44 | ||
45 | # Napoleon | |
46 | napoleon_numpy_docstring = True | |
47 | napoleon_google_docstring = True | |
48 | napoleon_use_param = False | |
49 | napoleon_use_ivar = True | |
50 | ||
51 | # Default role: | |
52 | default_role = 'any' | |
53 | ||
54 | # Add any paths that contain templates here, relative to this directory. | |
55 | templates_path = ['_templates'] | |
56 | ||
57 | # The suffix(es) of source filenames. | |
58 | # You can specify multiple suffix as a list of string: | |
59 | # | |
60 | # source_suffix = ['.rst', '.md'] | |
61 | source_suffix = '.rst' | |
62 | ||
63 | # The master toctree document. | |
64 | master_doc = 'index' | |
65 | ||
66 | # General information about the project. | |
67 | project = 'Gfapy' | |
68 | copyright = '2017, Giorgio Gonnella and others (see CONTRIBUTORS)' | |
69 | author = 'Giorgio Gonnella and others (see CONTRIBUTORS)' | |
70 | ||
71 | # The version info for the project you're documenting, acts as replacement for | |
72 | # |version| and |release|, also used in various other places throughout the | |
73 | # built documents. | |
74 | # | |
75 | # The short X.Y version. | |
76 | version = '1.1' | |
77 | # The full version, including alpha/beta/rc tags. | |
78 | release = '1.1.0' | |
79 | ||
80 | # The language for content autogenerated by Sphinx. Refer to documentation | |
81 | # for a list of supported languages. | |
82 | # | |
83 | # This is also used if you do content translation via gettext catalogs. | |
84 | # Usually you set "language" from the command line for these cases. | |
85 | language = None | |
86 | ||
87 | # List of patterns, relative to source directory, that match files and | |
88 | # directories to ignore when looking for source files. | |
89 | # This patterns also effect to html_static_path and html_extra_path | |
90 | exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store'] | |
91 | ||
92 | # The name of the Pygments (syntax highlighting) style to use. | |
93 | pygments_style = 'sphinx' | |
94 | ||
95 | # If true, `todo` and `todoList` produce output, else they produce nothing. | |
96 | todo_include_todos = False | |
97 | ||
98 | # -- Options for HTML output ---------------------------------------------- | |
99 | ||
100 | # The theme to use for HTML and HTML Help pages. See the documentation for | |
101 | # a list of builtin themes. | |
102 | # | |
103 | html_theme = 'sphinx_rtd_theme' | |
104 | ||
105 | # Theme options are theme-specific and customize the look and feel of a theme | |
106 | # further. For a list of options available for each theme, see the | |
107 | # documentation. | |
108 | # | |
109 | # html_theme_options = {} | |
110 | ||
111 | # Add any paths that contain custom static files (such as style sheets) here, | |
112 | # relative to this directory. They are copied after the builtin static files, | |
113 | # so a file named "default.css" will overwrite the builtin "default.css". | |
114 | html_static_path = ['_static'] | |
115 | ||
116 | ||
117 | # -- Options for HTMLHelp output ------------------------------------------ | |
118 | ||
119 | # Output file base name for HTML help builder. | |
120 | htmlhelp_basename = 'Gfapydoc' | |
121 | ||
122 | ||
123 | # -- Options for LaTeX output --------------------------------------------- | |
124 | ||
125 | latex_elements = { | |
126 | # The paper size ('letterpaper' or 'a4paper'). | |
127 | # | |
128 | 'papersize': 'a4paper', | |
129 | ||
130 | # The font size ('10pt', '11pt' or '12pt'). | |
131 | # | |
132 | # 'pointsize': '10pt', | |
133 | ||
134 | # Additional stuff for the LaTeX preamble. | |
135 | # | |
136 | # 'preamble': '', | |
137 | ||
138 | # Latex figure (float) alignment | |
139 | # | |
140 | # 'figure_align': 'htbp', | |
141 | } | |
142 | ||
143 | # Grouping the document tree into LaTeX files. List of tuples | |
144 | # (source start file, target name, title, | |
145 | # author, documentclass [howto, manual, or own class]). | |
146 | latex_documents = [ | |
147 | (master_doc, 'Gfapy.tex', 'Gfapy Documentation', | |
148 | 'Giorgio Gonnella', 'manual'), | |
149 | ] | |
150 | ||
151 | # -- Options for manual page output --------------------------------------- | |
152 | ||
153 | # One entry per manual page. List of tuples | |
154 | # (source start file, name, description, authors, manual section). | |
155 | man_pages = [ | |
156 | (master_doc, 'gfapy', 'Gfapy Documentation', | |
157 | [author], 1) | |
158 | ] | |
159 | ||
160 | ||
161 | # -- Options for Texinfo output ------------------------------------------- | |
162 | ||
163 | # Grouping the document tree into Texinfo files. List of tuples | |
164 | # (source start file, target name, title, author, | |
165 | # dir menu entry, description, category) | |
166 | texinfo_documents = [ | |
167 | (master_doc, 'Gfapy', 'Gfapy Documentation', | |
168 | author, 'Gfapy', | |
169 | 'Python library for the Graphic Fragment Assembly (GFA) format.', | |
170 | 'Miscellaneous'), | |
171 | ] | |
172 |
0 | .. Gfapy documentation master file, created by | |
1 | sphinx-quickstart on Thu Mar 16 10:13:57 2017. | |
2 | You can adapt this file completely to your liking, but it should at least | |
3 | contain the root `toctree` directive. | |
4 | ||
5 | Gfapy documentation | |
6 | =================== | |
7 | ||
8 | .. toctree:: | |
9 | :maxdepth: 2 | |
10 | :caption: Contents: | |
11 | ||
12 | readme | |
13 | changelog | |
14 | ||
15 | tutorial/gfa | |
16 | tutorial/validation | |
17 | tutorial/positional_fields | |
18 | tutorial/placeholders | |
19 | tutorial/positions | |
20 | tutorial/alignments | |
21 | tutorial/tags | |
22 | tutorial/references | |
23 | tutorial/header | |
24 | tutorial/custom_records | |
25 | tutorial/comments | |
26 | tutorial/errors | |
27 | tutorial/graph_operations | |
28 | tutorial/rgfa | |
29 | ||
30 | Indices and tables | |
31 | ================== | |
32 | ||
33 | * :ref:`genindex` | |
34 | * :ref:`modindex` | |
35 | * :ref:`search` |
0 | .. testsetup:: * | |
1 | ||
2 | import gfapy | |
3 | from gfapy import is_placeholder, Alignment | |
4 | h = "H\tVN:Z:2.0\tTS:i:100" | |
5 | sA = "S\tA\t100\t*" | |
6 | sB = "S\tB\t100\t*" | |
7 | x = "E\tx\tA+\tB-\t0\t100$\t0\t100$\t4,2\tTS:i:50" | |
8 | gfa = gfapy.Gfa([h, sA, sB, x]) | |
9 | ||
10 | .. _alignments: | |
11 | ||
12 | Alignments | |
13 | ~~~~~~~~~~ | |
14 | ||
15 | Some GFA1 (L/C overlap, P overlaps) and GFA2 (E/F alignment) fields contain | |
16 | alignments or lists of alignments. The alignment can be left unspecified and a | |
17 | placeholder symbol ``*`` used instead. In GFA1 the alignments can be given as | |
18 | CIGAR strings, in GFA2 also as Dazzler traces. | |
19 | ||
20 | Gfapy uses three different classes for representing the content of alignment fields: | |
21 | :class:`~gfapy.alignment.cigar.CIGAR`, :class:`~gfapy.alignment.trace.Trace` | |
22 | and :class:`~gfapy.alignment.placeholder.AlignmentPlaceholder`. | |
23 | ||
24 | Creating an alignment | |
25 | ^^^^^^^^^^^^^^^^^^^^^ | |
26 | ||
27 | An alignment instance is usually created from its GFA string | |
28 | representation or from a list by using the | |
29 | :class:`gfapy.Alignment() <gfapy.alignment.alignment.Alignment>` | |
30 | constructor. | |
31 | ||
32 | .. doctest:: | |
33 | ||
34 | >>> from gfapy import Alignment | |
35 | >>> Alignment("*") | |
36 | gfapy.AlignmentPlaceholder() | |
37 | >>> Alignment("10,10,10") | |
38 | gfapy.Trace([10,10,10]) | |
39 | >>> Alignment([10,10,10]) | |
40 | gfapy.Trace([10,10,10]) | |
41 | >>> Alignment("30M2I") | |
42 | gfapy.CIGAR([gfapy.CIGAR.Operation(30,'M'), gfapy.CIGAR.Operation(2,'I')]) | |
43 | ||
44 | If the argument is an alignment object it will be returned, | |
45 | so that is always safe to call the method on a variable which can | |
46 | contain a string or an alignment instance: | |
47 | ||
48 | .. doctest:: | |
49 | ||
50 | >>> Alignment(Alignment("*")) | |
51 | gfapy.AlignmentPlaceholder() | |
52 | >>> Alignment(Alignment("10,10")) | |
53 | gfapy.Trace([10,10]) | |
54 | ||
55 | Recognizing undefined alignments | |
56 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
57 | ||
58 | The :func:`gfapy.is_placeholder() <gfapy.placeholder.is_placeholder>` method | |
59 | allows to test if an alignment field contains an undefined value (placeholder) | |
60 | instead of a defined value (CIGAR string, trace). The method accepts as | |
61 | argument either an alignment object or a string or list representation. | |
62 | ||
63 | .. doctest:: | |
64 | ||
65 | >>> from gfapy import is_placeholder, Alignment | |
66 | >>> is_placeholder(Alignment("30M")) | |
67 | False | |
68 | >>> is_placeholder(Alignment("10,10")) | |
69 | False | |
70 | >>> is_placeholder(Alignment("*")) | |
71 | True | |
72 | >>> is_placeholder("*") | |
73 | True | |
74 | >>> is_placeholder("30M") | |
75 | False | |
76 | >>> is_placeholder("10,10") | |
77 | False | |
78 | >>> is_placeholder([]) | |
79 | True | |
80 | >>> is_placeholder([10,10]) | |
81 | False | |
82 | ||
83 | Note that, as a placeholder is ``False`` in boolean context, just a | |
84 | ``if not aligment`` will also work, if alignment is an alignment object. | |
85 | But this of course, does not work, if it is a string representation. | |
86 | Therefore it is better to use the | |
87 | :func:`gfapy.is_placeholder() <gfapy.placeholder.is_placeholder>` method, | |
88 | which works in both cases. | |
89 | ||
90 | .. doctest:: | |
91 | ||
92 | >>> if not Alignment("*"): print('no alignment') | |
93 | no alignment | |
94 | >>> if is_placeholder(Alignment("*")): print('no alignment') | |
95 | no alignment | |
96 | >>> if "*": print('not a placeholder...?') | |
97 | not a placeholder...? | |
98 | >>> if is_placeholder("*"): print('really? it is a placeholder!') | |
99 | really? it is a placeholder! | |
100 | ||
101 | Reading and editing CIGARs | |
102 | ^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
103 | ||
104 | CIGARs are represented by specialized lists, instances of the class | |
105 | :class:`~gfapy.alignment.cigar.CIGAR`, whose elements are CIGAR operations | |
106 | CIGAR operations are represented by instance of the class | |
107 | :class:`~gfapy.alignment.cigar.CIGAR.Operation`, | |
108 | and provide the properties ``length`` (length of the operation, an integer) | |
109 | and ``code`` (one-letter string which specifies the type of operation). | |
110 | Note that not all operations allowed in SAM files (for which CIGAR strings | |
111 | were first defined) are also meaningful in GFA and thus GFA2 only allows | |
112 | the operations ``M``, ``I``, ``D`` and ``P``. | |
113 | ||
114 | .. doctest:: | |
115 | ||
116 | >>> cigar = gfapy.Alignment("30M") | |
117 | >>> isinstance(cigar, list) | |
118 | True | |
119 | >>> operation = cigar[0] | |
120 | >>> type(operation) | |
121 | <class 'gfapy.alignment.cigar.CIGAR.Operation'> | |
122 | >>> operation.code | |
123 | 'M' | |
124 | >>> operation.code = 'D' | |
125 | >>> operation.length | |
126 | 30 | |
127 | >>> len(operation) | |
128 | 30 | |
129 | >>> str(operation) | |
130 | '30D' | |
131 | ||
132 | As a CIGAR instance is a list, list methods apply to it. If the array is | |
133 | emptied, its string representation will be the placeholder symbol ``*``. | |
134 | ||
135 | .. doctest:: | |
136 | ||
137 | >>> cigar = gfapy.Alignment("1I20M2D") | |
138 | >>> cigar[0].code = "M" | |
139 | >>> cigar.pop(1) | |
140 | gfapy.CIGAR.Operation(20,'M') | |
141 | >>> str(cigar) | |
142 | '1M2D' | |
143 | >>> cigar[:] = [] | |
144 | >>> str(cigar) | |
145 | '*' | |
146 | ||
147 | The validate :func:`CIGAR.validate() <gfapy.alignment.cigar.CIGAR.validate>` | |
148 | function checks if a CIGAR instance is valid. A version can be provided, as the | |
149 | CIGAR validation is version specific (as GFA2 forbids some CIGAR operations). | |
150 | ||
151 | .. doctest:: | |
152 | ||
153 | >>> cigar = gfapy.Alignment("30M10D20M5I10M") | |
154 | >>> cigar.validate() | |
155 | >>> cigar[1].code = "L" | |
156 | >>> cigar.validate() | |
157 | Traceback (most recent call last): | |
158 | ... | |
159 | gfapy.error.ValueError: | |
160 | >>> cigar = gfapy.Alignment("30M10D20M5I10M") | |
161 | >>> cigar[1].code = "X" | |
162 | >>> cigar.validate(version="gfa1") | |
163 | >>> cigar.validate(version="gfa2") | |
164 | Traceback (most recent call last): | |
165 | ... | |
166 | gfapy.error.ValueError: | |
167 | ||
168 | Reading and editing traces | |
169 | ^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
170 | ||
171 | Traces are arrays of non-negative integers. The values are interpreted | |
172 | using a trace spacing value. If traces are used, a trace spacing value | |
173 | must be defined in a TS integer tag, either in the header, or in the | |
174 | single lines which contain traces (which takes precedence over the | |
175 | header global value). | |
176 | ||
177 | .. doctest:: | |
178 | ||
179 | >>> print(gfa) #doctest: +SKIP | |
180 | H TS:i:100 | |
181 | E x A+ B- 0 100$ 0 100$ 4,2 TS:i:50 | |
182 | ... | |
183 | >>> gfa.header.TS | |
184 | 100 | |
185 | >>> gfa.line("x").TS | |
186 | 50 | |
187 | ||
188 | Query, reference and complement | |
189 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
190 | ||
191 | CIGARs are asymmetric, i.e.\ they consider one sequence as reference and | |
192 | another sequence as query. | |
193 | ||
194 | The :func:`~gfapy.alignment.cigar.CIGAR.length_on_reference` and | |
195 | :func:`~gfapy.alignment.cigar.CIGAR.length_on_query` methods compute the length | |
196 | of the alignment on the two sequences. These methods are used by the library | |
197 | e.g. to convert GFA1 L lines to GFA2 E lines (which is only possible if CIGARs | |
198 | are provided). | |
199 | ||
200 | .. doctest:: | |
201 | ||
202 | >>> cigar = gfapy.Alignment("30M10D20M5I10M") | |
203 | >>> cigar.length_on_reference() | |
204 | 70 | |
205 | >>> cigar.length_on_query() | |
206 | 65 | |
207 | ||
208 | CIGARs are dependent on which sequence is taken as reference and which | |
209 | is taken as query. For each alignment, a complement CIGAR can be | |
210 | computed using the method | |
211 | :func:`~gfapy.alignment.cigar.CIGAR.complement`; it is the CIGAR obtained | |
212 | when the two sequences are switched. | |
213 | ||
214 | .. doctest:: | |
215 | ||
216 | >>> cigar = gfapy.Alignment("2M1D3M") | |
217 | >>> str(cigar.complement()) | |
218 | '3M1I2M' | |
219 | ||
220 | The current version of Gfapy does not provide a way to compute the | |
221 | alignment, thus the trace information can be accessed and edited, but | |
222 | not used for this purpose. Because of this there is currently no way in | |
223 | Gfapy to compute a complement trace (trace obtained when the sequences | |
224 | are switched). | |
225 | ||
226 | .. doctest:: | |
227 | ||
228 | >>> trace = gfapy.Alignment("1,2,3") | |
229 | >>> str(trace.complement()) | |
230 | '*' | |
231 | ||
232 | The complement of a placeholder is a placeholder: | |
233 | ||
234 | .. doctest:: | |
235 | ||
236 | >>> str(gfapy.Alignment("*").complement()) | |
237 | '*' |
0 | .. testsetup:: * | |
1 | ||
2 | import gfapy | |
3 | g = gfapy.Gfa() | |
4 | ||
5 | .. _comments: | |
6 | ||
7 | Comments | |
8 | -------- | |
9 | ||
10 | GFA lines starting with a ``#`` symbol are considered comments. In Gfapy | |
11 | comments are represented by instances of the class :class:`gfapy.line.Comment | |
12 | <gfapy.line.comment.comment.Comment>`. They have a similar interface to other | |
13 | line instances, with some differences, e.g. they do not support tags. | |
14 | ||
15 | The comments collection | |
16 | ~~~~~~~~~~~~~~~~~~~~~~~ | |
17 | ||
18 | The comments of a Gfa object are accessed using the :func:`Gfa.comments | |
19 | <gfapy.lines.collections.Collections.comments>` property. This is a list of | |
20 | comment line instances. The single elements can be modified, but the list | |
21 | itself is read-only. To remove a comment from the Gfa, you need to find the | |
22 | instance in the list, and call | |
23 | :func:`~gfapy.line.common.disconnection.Disconnection.disconnect` on it. To | |
24 | add a comment to a :class:`~gfapy.gfa.Gfa` instance is done similarly to other | |
25 | lines, by using the :func:`Gfa.add_line(line) | |
26 | <gfapy.lines.creators.Creators.add_line>` method. | |
27 | ||
28 | .. doctest:: | |
29 | ||
30 | >>> g.add_line("# this is a comment") #doctest: +ELLIPSIS | |
31 | >>> [str(c) for c in g.comments] | |
32 | ['# this is a comment'] | |
33 | >>> g.comments[0].disconnect() | |
34 | >>> g.comments | |
35 | [] | |
36 | ||
37 | Accessing the comment content | |
38 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
39 | ||
40 | The content of the comment line, excluding the initial ``#`` and eventual | |
41 | initial spacing characters, is included in the ``content`` field. The initial | |
42 | spacing characters can be read/changed using the ``spacer`` field. The default | |
43 | value is a single space. | |
44 | ||
45 | .. doctest:: | |
46 | ||
47 | >>> g.add_line("# this is a comment") #doctest: +ELLIPSIS | |
48 | >>> c = g.comments[-1] | |
49 | >>> c.content | |
50 | 'this is a comment' | |
51 | >>> c.spacer | |
52 | ' ' | |
53 | >>> c.spacer = '___' | |
54 | >>> str(c) | |
55 | '#___this is a comment' | |
56 | ||
57 | Tags are not supported by comment lines. If the line contains tags, | |
58 | these are nor parsed, but included in the ``content`` field. Trying to set | |
59 | tags raises exceptions. | |
60 | ||
61 | .. doctest:: | |
62 | ||
63 | >>> c = gfapy.Line("# this is not a tag\txx:i:1") | |
64 | >>> c.content | |
65 | 'this is not a tag\txx:i:1' | |
66 | >>> c.xx | |
67 | >>> c.xx = 1 | |
68 | Traceback (most recent call last): | |
69 | ... | |
70 | gfapy.error.RuntimeError: Tags of comment lines cannot be set |
0 | .. testsetup:: * | |
1 | ||
2 | import gfapy | |
3 | g = gfapy.Gfa(version = 'gfa2') | |
4 | ||
5 | .. _custom_records: | |
6 | ||
7 | Custom records | |
8 | -------------- | |
9 | ||
10 | The GFA2 specification considers each line which starts with a non-standard | |
11 | record type a custom (i.e. user- or program-specific) record. | |
12 | Gfapy allows to retrieve these records and access their data using a | |
13 | similar interface to that for the predefined record types. | |
14 | ||
15 | Retrieving, adding and deleting custom records | |
16 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
17 | ||
18 | Gfa instances have the property | |
19 | :func:`~gfapy.lines.collections.Collections.custom_records`, | |
20 | a list of all line instances with a non-standard record type. Among these, | |
21 | records of a specific record type are retrieved using the method | |
22 | :func:`Gfa.custom_records_of_type(record_type) | |
23 | <gfapy.lines.collections.Collections.custom_records_of_type>`. | |
24 | Lines are added and deleted using the same methods | |
25 | (:func:`~gfapy.lines.creators.Creators.add_line` and | |
26 | :func:`~gfapy.line.common.disconnection.Disconnection.disconnect`) as for | |
27 | other line types. | |
28 | ||
29 | .. doctest:: | |
30 | ||
31 | >>> g.add_line("X\tcustom line") #doctest: +ELLIPSIS | |
32 | >>> g.add_line("Y\tcustom line") #doctest: +ELLIPSIS | |
33 | >>> [str(line) for line in g.custom_records] #doctest: +SKIP | |
34 | ['X\tcustom line', 'Y\tcustom line'] | |
35 | >>> g.custom_record_keys) #doctest: +SKIP | |
36 | ['X', 'Y'] | |
37 | >>> [str(line) for line in g.custom_records_of_type('X')] | |
38 | ['X\tcustom line'] | |
39 | >>> g.custom_records_of_type("X")[-1].disconnect() | |
40 | >>> g.custom_records_of_type('X') | |
41 | [] | |
42 | ||
43 | Interface without extensions | |
44 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
45 | ||
46 | If no extension (see :ref:`extensions` section) has been defined to handle a | |
47 | custom record type, the interface has some limitations: the field content is | |
48 | not validated, and the field names are unknown. The generic custom record | |
49 | class is employed | |
50 | (:class:`~gfapy.line.custom_record.custom_record.CustomRecord`). | |
51 | ||
52 | As the name of the positional fields in a custom record is not known, a generic | |
53 | name ``field1``, ``field2``, ... is used. The number of positional fields is | |
54 | found by getting the length of the | |
55 | :attr:`~gfapy.line.custom_record.init.Init.positional_fieldnames` list. | |
56 | ||
57 | .. doctest:: | |
58 | ||
59 | >>> g.add_line("X\ta\tb\tcc:i:10\tdd:i:100") #doctest: +ELLIPSIS | |
60 | >>> x = g.custom_records_of_type('X')[-1] | |
61 | >>> len(x.positional_fieldnames) | |
62 | 2 | |
63 | >>> x.field1 | |
64 | 'a' | |
65 | >>> x.field2 | |
66 | 'b' | |
67 | ||
68 | Positional fields are allowed to contain any character (including non-printable | |
69 | characters and spacing characters), except tabs and newlines (as they are | |
70 | structural elements of the line). No further validation is performed. | |
71 | ||
72 | As Gfapy cannot know how many positional fields are present when parsing custom | |
73 | records, a heuristic approach is followed, to identify tags. A field resembles | |
74 | a tag if it starts with ``tn:d:`` where ``tn`` is a valid tag name and ``d`` a | |
75 | valid tag datatype (see :ref:`tags` chapter). The fields are parsed from the | |
76 | last to the first. | |
77 | ||
78 | As soon as a field is found which does not resemble a tag, all remaining fields | |
79 | are considered positionals (even if another field parsed later resembles a | |
80 | tag). Due to this, invalid tags are sometimes wrongly taken as positional | |
81 | fields (this can be avoided by writing an extension). | |
82 | ||
83 | .. doctest:: | |
84 | ||
85 | >>> g.add_line("X\ta\tb\tcc:i:10\tdd:i:100") #doctest: +ELLIPSIS | |
86 | >>> x1 = g.custom_records_of_type("X")[-1] | |
87 | >>> x1.cc | |
88 | 10 | |
89 | >>> x1.dd | |
90 | 100 | |
91 | >>> g.add_line("X\ta\tb\tcc:i:10\tdd:i:100\te") #doctest: +ELLIPSIS | |
92 | >>> x2 = g.custom_records_of_type("X")[-1] | |
93 | >>> x2.cc | |
94 | >>> x2.field3 | |
95 | 'cc:i:10' | |
96 | >>> g.add_line("Z\ta\tb\tcc:i:10\tddd:i:100") #doctest: +ELLIPSIS | |
97 | >>> x3 = g.custom_records_of_type("Z")[-1] | |
98 | >>> x3.cc | |
99 | >>> x3.field3 | |
100 | 'cc:i:10' | |
101 | >>> x3.field4 | |
102 | 'ddd:i:100' | |
103 | ||
104 | .. _extensions: | |
105 | ||
106 | Extensions | |
107 | ~~~~~~~~~~ | |
108 | ||
109 | The support for custom fields is limited, as Gfapy does not know which and how | |
110 | many fields are there and how shall they be validated. It is possible to create | |
111 | an extension of Gfapy, which defines new record types: this will allow to use | |
112 | these record types in a similar way to the built-in types. | |
113 | ||
114 | As an example, an extension will be described, which defines two record types: | |
115 | T for taxa and M for assignments of segments to taxa. For further information | |
116 | about the possible usage case for this extension, see the Supplemental | |
117 | Information to the manuscript describing Gfapy. | |
118 | ||
119 | The T records will contain a single positional field, ``tid``, a GFA2 | |
120 | identifier, and an optional UL string tag. The M records will contain three | |
121 | positional fields (all three GFA2 identifier): a name field ``mid`` (optional), | |
122 | and two references, ``tid`` to a T line and ``sid`` to an S line. The SC | |
123 | integer tag will be also defined. Here is an example of a GFA containing M and | |
124 | T lines: | |
125 | ||
126 | .. code:: | |
127 | ||
128 | S sA 1000 * | |
129 | S sB 1000 * | |
130 | M assignment1 t123 sA SC:i:40 | |
131 | M assignment2 t123 sB | |
132 | M * B12c sB SC:i:20 | |
133 | T B12c | |
134 | T t123 UL:Z:http://www.taxon123.com | |
135 | ||
136 | Writing subclasses of the :class:`~gfapy.line.line.Line` class, it is possible to | |
137 | communicate to Gfapy, how records of the M and T class shall be handled. This | |
138 | only requires to define some constants and to call the class method | |
139 | :func:`~gfapy.line.line.Line.register_extension`. | |
140 | ||
141 | The constants to define are ``RECORD TYPE``, which shall be the content | |
142 | of the record type field (e.g. ``M``); ``POSFIELDS`` shall contain an ordered | |
143 | dict, specifying the datatype for each positional field, in the order these | |
144 | fields are found in the line; ``TAGS_DATATYPE`` is a dict, specifying the | |
145 | datatype of the predefined optional tags; ``NAME_FIELD`` is a field name, | |
146 | and specifies which field contains the identifier of the line. | |
147 | For details on predefined and custom datatypes, see the next sections | |
148 | (:ref:`predefined_datatypes` and :ref:`custom_datatypes`). | |
149 | ||
150 | To handle references, :func:`~gfapy.line.line.Line.register_extension` | |
151 | can be supplied with a ``references`` parameter, a list of triples | |
152 | ``(fieldname, classname, backreferences)``. Thereby ``fieldname`` is the name | |
153 | of the field in the corresponding record containing the reference (e.g. | |
154 | ``sid``), ``classname`` is the name of the class to which the reference goes | |
155 | (e.g. ``gfa.line.segment.GFA2``), and \texttt{backreferences} is how the | |
156 | collection of backreferences shall be called, in the records to which reference | |
157 | points to (e.g. ``metagenomic_assignments``). | |
158 | ||
159 | .. code:: python | |
160 | ||
161 | from collections include OrderedDict | |
162 | ||
163 | class Taxon(gfapy.Line): | |
164 | RECORD_TYPE = "T" | |
165 | POSFIELDS = OrderedDict([("tid","identifier_gfa2")]) | |
166 | TAGS_DATATYPE = {"UL":"Z"} | |
167 | NAME_FIELD = "tid" | |
168 | ||
169 | Taxon.register_extension() | |
170 | ||
171 | class MetagenomicAssignment(gfapy.Line): | |
172 | RECORD_TYPE = "M" | |
173 | POSFIELDS = OrderedDict([("mid","optional_identifier_gfa2"), | |
174 | ("tid","identifier_gfa2"), | |
175 | ("sid","identifier_gfa2")]) | |
176 | TAGS_DATATYPE = {"SC":"i"} | |
177 | NAME_FIELD = "mid" | |
178 | ||
179 | MetagenomicAssignment.register_extension(references= | |
180 | [("sid", gfapy.line.segment.GFA2, "metagenomic_assignments"), | |
181 | ("tid", Taxon, "metagenomic_assignments")]) | |
182 | ||
183 | .. _predefined_datatypes: | |
184 | ||
185 | Predefined datatypes for extensions | |
186 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
187 | ||
188 | The datatype of fields is specified in Gfapy using classes, which provide | |
189 | functions for decoding, encoding and validating the corresponding data. | |
190 | Gfapy contains a number of datatypes which correspond to the description | |
191 | of the field content in the GFA1 and GFA2 specification. | |
192 | ||
193 | When writing extensions only the GFA2 field datatypes are generally used | |
194 | (as GFA1 does not contain custom fields). They are summarized in | |
195 | the following table: | |
196 | ||
197 | +-------------------------------------+---------------+--------------------------------------------------------+ | |
198 | | Name | Example | Description | | |
199 | +=====================================+===============+========================================================+ | |
200 | | ``alignment_gfa2`` | ``12M1I3M`` | CIGAR string, Trace alignment or Placeholder (``*``) | | |
201 | +-------------------------------------+---------------+--------------------------------------------------------+ | |
202 | | ``identifier_gfa2`` | ``S1`` | ID of a line | | |
203 | +-------------------------------------+---------------+--------------------------------------------------------+ | |
204 | | ``oriented_identifier_gfa2`` | ``S1+`` | ID of a line followed by ``+`` or ``-`` | | |
205 | +-------------------------------------+---------------+--------------------------------------------------------+ | |
206 | | ``optional_identifier_gfa2`` | ``*`` | ID of a line or Placeholder (``*``) | | |
207 | +-------------------------------------+---------------+--------------------------------------------------------+ | |
208 | | ``identifier_list_gfa2`` | ``S1 S2`` | space separated list of line IDs | | |
209 | +-------------------------------------+---------------+--------------------------------------------------------+ | |
210 | | ``oriented_identifier_list_gfa2`` | ``S1+ S2-`` | space separated list of line IDs plus orientations | | |
211 | +-------------------------------------+---------------+--------------------------------------------------------+ | |
212 | | ``position_gfa2`` | ``120$`` | non-negative integer, optionally followed by ``$`` | | |
213 | +-------------------------------------+---------------+--------------------------------------------------------+ | |
214 | | ``sequence_gfa2`` | ``ACGNNYR`` | sequence of printable chars., no whitespace | | |
215 | +-------------------------------------+---------------+--------------------------------------------------------+ | |
216 | | ``string`` | ``a b_c;d`` | string, no tabs and newlines (Z tags) | | |
217 | +-------------------------------------+---------------+--------------------------------------------------------+ | |
218 | | ``char`` | ``A`` | single character (A tags) | | |
219 | +-------------------------------------+---------------+--------------------------------------------------------+ | |
220 | | ``float`` | ``1.12`` | float (f tags) | | |
221 | +-------------------------------------+---------------+--------------------------------------------------------+ | |
222 | | ``integer`` | ``-12`` | integer (i tags) | | |
223 | +-------------------------------------+---------------+--------------------------------------------------------+ | |
224 | | ``optional_integer`` | ``*`` | integer or placeholder | | |
225 | +-------------------------------------+---------------+--------------------------------------------------------+ | |
226 | | ``numeric_array`` | ``c,10,3`` | array of integers or floats (B tags) | | |
227 | +-------------------------------------+---------------+--------------------------------------------------------+ | |
228 | | ``byte_array`` | ``12F1FF`` | hexadecimal byte string (H tags) | | |
229 | +-------------------------------------+---------------+--------------------------------------------------------+ | |
230 | | ``json`` | ``{’b’:2}`` | JSON string, no tabs and newlines (J tags) | | |
231 | +-------------------------------------+---------------+--------------------------------------------------------+ | |
232 | ||
233 | .. _custom_datatypes: | |
234 | ||
235 | Custom datatypes for extensions | |
236 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
237 | ||
238 | For custom records, one sometimes needs datatypes not yet available in the GFA | |
239 | specification. For example, a custom datatype can be defined for | |
240 | the taxon identifier used in the ``tid`` field of the T and M records: | |
241 | accordingly the taxon identifier shall be only either | |
242 | in the form ``taxon:<n>``, where ``<n>`` is a positive integer, | |
243 | or consist of letters, numbers and underscores only | |
244 | (without ``:``). | |
245 | ||
246 | To define the datatype, a class is written, which contains the following | |
247 | functions: | |
248 | ||
249 | * ``validate_encoded(string)``: validates the content of the field, | |
250 | if this is a string (e.g., the name of the T line) | |
251 | * ``validate_decoded(object)``: validates the content of the field, | |
252 | if this is not a string (e.g., a reference to a T line) | |
253 | * ``decode(string)``: validates the content of the field (a string) | |
254 | and returns the decoded content; note that references must not be resolved | |
255 | (there is no access to the Gfa instance here), thus the name of the | |
256 | T line will be returned unchanged | |
257 | * ``encode(string)``: validates the content of the field (not in string | |
258 | form) and returns the string which codes it in the GFA file (also here | |
259 | references are validated but not converted into strings) | |
260 | ||
261 | Finally the datatype is registered calling | |
262 | :func:`~gfapy.field.field.Field.register_datatype`. The code for | |
263 | the taxon ID extension is the following: | |
264 | ||
265 | .. code:: python | |
266 | ||
267 | import re | |
268 | ||
269 | class TaxonID: | |
270 | ||
271 | def validate_encoded(string): | |
272 | if not re.match(r"^taxon:(\d+)$",string) and \ | |
273 | not re.match(r"^[a-zA-Z0-9_]+$", string): | |
274 | raise gfapy.ValueError("Invalid taxon ID: {}".format(string)) | |
275 | ||
276 | def decode(string): | |
277 | TaxonID.validate_encoded(string) | |
278 | return string | |
279 | ||
280 | def validate_decoded(obj): | |
281 | if isinstance(obj,Taxon): | |
282 | TaxonID.validate_encoded(obj.name) | |
283 | else: | |
284 | raise gfapy.TypeError( | |
285 | "Invalid type for taxon ID: "+"{}".format(repr(obj))) | |
286 | ||
287 | def encode(obj): | |
288 | TaxonID.validate_decoded(obj) | |
289 | return obj | |
290 | ||
291 | gfapy.Field.register_datatype("taxon_id", TaxonID) | |
292 | ||
293 | To use the new datatype in the T and M lines defined above (:ref:`extensions`), | |
294 | the definition of the two subclasses can be changed: | |
295 | in ``POSFIELDS`` the value ``taxon_id`` shall be assigned to the key ``tid``. |
0 | .. _errors: | |
1 | ||
2 | Errors | |
3 | ------ | |
4 | ||
5 | The different types of errors defined in Gfapy are summarized in the | |
6 | following table. All exception raised in the library are subclasses of | |
7 | `Error`. Thus, ``except gfapy.Error`` can be use to catch | |
8 | all library errors. | |
9 | ||
10 | +-----------------------+-------------------------------+---------------------------------+ | |
11 | | Error | Description | Examples | | |
12 | +=======================+===============================+=================================+ | |
13 | | `VersionError` | An unknown or wrong version | "GFA0"; or GFA1 in GFA2 context | | |
14 | | | is specified or implied | | | |
15 | +-----------------------+-------------------------------+---------------------------------+ | |
16 | | `ValueError` | The value of an object is | a negative position is used | | |
17 | | | invalid | | | |
18 | +-----------------------+-------------------------------+---------------------------------+ | |
19 | | `TypeError` | The wrong type has been used | Z instead of i used for VN tag; | | |
20 | | | or specified | Hash for an i tag | | |
21 | +-----------------------+-------------------------------+---------------------------------+ | |
22 | | `FormatError` | The format of an object is | a line does not contain the | | |
23 | | | wrong | expected number of fields | | |
24 | +-----------------------+-------------------------------+---------------------------------+ | |
25 | | `NotUniqueError` | Something should be unique | duplicated tag name or line | | |
26 | | | but is not | identifier | | |
27 | +-----------------------+-------------------------------+---------------------------------+ | |
28 | | `InconsistencyError` | Pieces of information collide | length of sequence and LN tag | | |
29 | | | with each other | do not match | | |
30 | +-----------------------+-------------------------------+---------------------------------+ | |
31 | | `RuntimeError` | The user tried to do | editing from/to field in | | |
32 | | | something which is not | connected links | | |
33 | | | allowed | | | |
34 | +-----------------------+-------------------------------+---------------------------------+ | |
35 | | `ArgumentError` | Problem with the arguments of | wrong number of arguments in | | |
36 | | | a method | dynamically created method | | |
37 | +-----------------------+-------------------------------+---------------------------------+ | |
38 | | `AssertionError` | Something unexpected happened | there is a bug in the library or| | |
39 | | | | the library has been used in | | |
40 | | | | an unintended way | | |
41 | +-----------------------+-------------------------------+---------------------------------+ | |
42 |
0 | .. testsetup:: * | |
1 | ||
2 | import gfapy | |
3 | gfa = gfapy.Gfa() | |
4 | gfa1 = gfapy.Gfa() | |
5 | gfa1.add_line("H\tVN:Z:1.0") | |
6 | gfa1.add_line("# this is a comment") | |
7 | gfa1.add_line("S\t1\t*") | |
8 | gfa1.add_line("S\t2\t*") | |
9 | gfa1.add_line("S\t3\t*") | |
10 | gfa2 = gfapy.Gfa() | |
11 | gfa2.add_line("H\tVN:Z:2.0\tTS:i:100") | |
12 | gfa2.add_line("X\tcustom line") | |
13 | gfa2.add_line("Y\tcustom line") | |
14 | ||
15 | .. _gfa: | |
16 | ||
17 | The Gfa class | |
18 | ------------- | |
19 | ||
20 | The content of a GFA file is represented in Gfapy by an instance of the class | |
21 | :class:`~gfapy.gfa.Gfa`. In most cases, the Gfa instance will be constructed | |
22 | from the data contained in a GFA file, using the method | |
23 | :func:`Gfa.from_file() <gfapy.gfa.Gfa.from_file>`. | |
24 | ||
25 | Alternatively, it is possible to use the construct of the class; it takes an | |
26 | optional positional parameter, the content of a GFA file (as string, or as list | |
27 | of strings, one per line of the GFA file). If no GFA content is provided, the | |
28 | Gfa instance will be empty. | |
29 | ||
30 | .. doctest:: | |
31 | ||
32 | >>> gfa = gfapy.Gfa("H\tVN:Z:1.0\nS\tA\t*") | |
33 | >>> print(len(gfa.lines)) | |
34 | 2 | |
35 | >>> gfa = gfapy.Gfa(["H\tVN:Z:1.0", "S\tA\t*", "S\tB\t*"]) | |
36 | >>> print(len(gfa.lines)) | |
37 | 3 | |
38 | >>> gfa = gfapy.Gfa() | |
39 | >>> print(len(gfa.lines)) | |
40 | 0 | |
41 | ||
42 | The string representation of the Gfa object (which can be obtained using | |
43 | ``str()``) is the textual representation in GFA format. | |
44 | Using :func:`Gfa.to_file(filename) <gfapy.gfa.Gfa.to_file>` allows | |
45 | writing this representation to a GFA file (the content of the file is | |
46 | overwritten). | |
47 | ||
48 | .. doctest:: | |
49 | ||
50 | >>> g1 = gfapy.Gfa() | |
51 | >>> g1.append("H\tVN:Z:1.0") | |
52 | >>> g1.append("S\ta\t*") | |
53 | >>> g1.to_file("my.gfa") #doctest: +SKIP | |
54 | >>> g2 = gfapy.Gfa.from_file("my.gfa") #doctest: +SKIP | |
55 | >>> str(g1) | |
56 | 'H\tVN:Z:1.0\nS\ta\t*' | |
57 | ||
58 | ||
59 | All methods for creating a Gfa (constructor and from_file) accept | |
60 | a ``vlevel`` parameter, the validation level, | |
61 | and can assume the values 0, 1, 2 and 3. A higher value means | |
62 | more validations are performed. The :ref:`validation` chapter explains | |
63 | the meaning of the different validation levels in detail. | |
64 | The default value is 1. | |
65 | ||
66 | .. doctest:: | |
67 | ||
68 | >>> gfapy.Gfa().vlevel | |
69 | 1 | |
70 | >>> gfapy.Gfa(vlevel = 0).vlevel | |
71 | 0 | |
72 | ||
73 | A further parameter is ``version``. It can be set to ``'gfa1'``, | |
74 | ``'gfa2'`` or left to the default value (``None``). The default | |
75 | is to auto-detect the version of the GFA from the line content. | |
76 | If the version is set manually, any content not compatible to the | |
77 | specified version will trigger an exception. If the version is | |
78 | set automatically, an exception will be raised if two lines | |
79 | are found, with content incompatible to each other (e.g. a GFA1 | |
80 | segment followed by a GFA2 segment). | |
81 | ||
82 | .. doctest:: | |
83 | ||
84 | >>> g = gfapy.Gfa(version='gfa2') | |
85 | >>> g.version | |
86 | 'gfa2' | |
87 | >>> g.add_line("S\t1\t*") | |
88 | Traceback (most recent call last): | |
89 | ... | |
90 | gfapy.error.VersionError: Version: 1.0 (None) | |
91 | ... | |
92 | >>> g = gfapy.Gfa() | |
93 | >>> g.version | |
94 | >>> g.add_line("S\t1\t*") | |
95 | >>> g.version | |
96 | 'gfa1' | |
97 | >>> g.add_line("S\t1\t100\t*") | |
98 | Traceback (most recent call last): | |
99 | ... | |
100 | gfapy.error.VersionError: Version: 1.0 (None) | |
101 | ... | |
102 | ||
103 | Collections of lines | |
104 | ~~~~~~~~~~~~~~~~~~~~ | |
105 | ||
106 | The property :attr:`~gfapy.lines.collections.Collections.lines` | |
107 | of the Gfa object is a list of all the lines | |
108 | in the GFA file (including the header, which is split into single-tag | |
109 | lines). The list itself shall not be modified by the user directly (i.e. | |
110 | adding and removing lines is done using a different interface, see | |
111 | below). However the single elements of the list can be edited. | |
112 | ||
113 | .. doctest:: | |
114 | ||
115 | >>> for line in gfa.lines: print(line) | |
116 | ||
117 | For most record types, a list of the lines of the record type is available | |
118 | as a read-only property, which is named after the record type, in plural. | |
119 | ||
120 | .. doctest:: | |
121 | ||
122 | >>> [str(line) for line in gfa1.segments] | |
123 | ['S\t1\t*', 'S\t2\t*', 'S\t3\t*'] | |
124 | >>> [str(line) for line in gfa2.fragments] | |
125 | [] | |
126 | ||
127 | A particular case are edges; these are in GFA1 links and containments, while in | |
128 | GFA2 there is a unified edge record type, which also allows to represent | |
129 | internal alignments. In Gfapy, the | |
130 | :attr:`~gfapy.lines.collections.Collections.edges` property retrieves all edges | |
131 | (i.e. all E lines in GFA2, and all L and C lines in GFA1). The | |
132 | :attr:`~gfapy.lines.collections.Collections.dovetails` property is a list of | |
133 | all edges which represent dovetail overlaps (i.e. all L lines in GFA1 and a | |
134 | subset of the E lines in GFA2). The | |
135 | :attr:`~gfapy.lines.collections.Collections.containments` property is a list of | |
136 | all edges which represent containments (i.e. all C lines in GFA1 and a subset | |
137 | of the E lines in GFA2). | |
138 | ||
139 | .. doctest:: | |
140 | ||
141 | >>> gfa2.edges | |
142 | [] | |
143 | >>> gfa2.dovetails | |
144 | [] | |
145 | >>> gfa2.containments | |
146 | [] | |
147 | ||
148 | Paths are retrieved using the | |
149 | :attr:`~gfapy.lines.collections.Collections.paths` property. This list | |
150 | contains all P lines in GFA1 and all O lines in GFA2. Sets returns the list of | |
151 | all U lines in GFA2 (empty list in GFA1). | |
152 | ||
153 | .. doctest:: | |
154 | ||
155 | >>> gfa2.paths | |
156 | [] | |
157 | >>> gfa2.sets | |
158 | [] | |
159 | ||
160 | The header contain metadata in a single or multiple lines. For ease of | |
161 | access to the header information, all its tags are summarized in a | |
162 | single line instance, which is retrieved using the | |
163 | :attr:`~gfapy.lines.headers.Headers.header` property. This list | |
164 | The :ref:`header` chapter of this manual explains more in | |
165 | detail, how to work with the header object. | |
166 | ||
167 | .. doctest:: | |
168 | ||
169 | >>> gfa2.header.TS | |
170 | 100 | |
171 | ||
172 | All lines which start by the string ``#`` are comments; they are handled in | |
173 | the :ref:`comments` chapter and are retrieved using the | |
174 | :attr:`~gfapy.lines.collections.Collections.comments` property. | |
175 | ||
176 | .. doctest:: | |
177 | ||
178 | >>> [str(line) for line in gfa1.comments] | |
179 | ['# this is a comment'] | |
180 | ||
181 | Custom lines are lines of GFA2 files which start | |
182 | with a non-standard record type. Gfapy provides basic built-in support | |
183 | for accessing the information in custom lines, and allows to define | |
184 | extensions for own record types for defining more advanced | |
185 | functionality (see the :ref:`custom_records` chapter). | |
186 | ||
187 | .. doctest:: | |
188 | ||
189 | >>> [str(line) for line in gfa2.custom_records] | |
190 | ['X\tcustom line', 'Y\tcustom line'] | |
191 | >>> gfa2.custom_record_keys | |
192 | ['X', 'Y'] | |
193 | >>> [str(line) for line in gfa2.custom_records_of_type('X')] | |
194 | ['X\tcustom line'] | |
195 | ||
196 | Line identifiers | |
197 | ~~~~~~~~~~~~~~~~ | |
198 | ||
199 | Some GFA lines have a mandatory or optional identifier field: segments and | |
200 | paths in GFA1, segments, gaps, edges, paths and sets in GFA2. A line of this | |
201 | type can be retrieved by identifier, using the method | |
202 | :func:`Gfa.line(ID) <gfapy.gfa.Gfa.line>` using the identifier as argument. | |
203 | ||
204 | .. doctest:: | |
205 | ||
206 | >>> str(gfa1.line('1')) | |
207 | 'S\t1\t*' | |
208 | ||
209 | The GFA2 specification prescribes the exact namespace for the identifier | |
210 | (segments, paths, sets, edges and gaps identifier share the same namespace). | |
211 | The content of this namespace can be retrieved using the | |
212 | :attr:`~gfapy.lines.collections.Collections.names` property. | |
213 | The identifiers of single line types | |
214 | can be retrieved using the properties | |
215 | :attr:`~gfapy.lines.collections.Collections.segment_names`, | |
216 | :attr:`~gfapy.lines.collections.Collections.edge_names`, | |
217 | :attr:`~gfapy.lines.collections.Collections.gap_names`, | |
218 | :attr:`~gfapy.lines.collections.Collections.path_names` and | |
219 | :attr:`~gfapy.lines.collections.Collections.set_names`. | |
220 | ||
221 | .. doctest:: | |
222 | ||
223 | >>> g = gfapy.Gfa() | |
224 | >>> g.add_line("S\tA\t100\t*") | |
225 | >>> g.add_line("S\tB\t100\t*") | |
226 | >>> g.add_line("S\tC\t100\t*") | |
227 | >>> g.add_line("E\tb_c\tB+\tC+\t0\t10\t90\t100$\t*") | |
228 | >>> g.add_line("O\tp1\tB+ C+") | |
229 | >>> g.add_line("U\ts1\tA b_c g") | |
230 | >>> g.add_line("G\tg\tA+\tB-\t1000\t*") | |
231 | >>> g.names | |
232 | ['A', 'B', 'C', 'b_c', 'g', 'p1', 's1'] | |
233 | >>> g.segment_names | |
234 | ['A', 'B', 'C'] | |
235 | >>> g.path_names | |
236 | ['p1'] | |
237 | >>> g.edge_names | |
238 | ['b_c'] | |
239 | >>> g.gap_names | |
240 | ['g'] | |
241 | >>> g.set_names | |
242 | ['s1'] | |
243 | ||
244 | The GFA1 specification does not handle the question of the namespace of | |
245 | identifiers explicitly. However, gfapy assumes and enforces | |
246 | a single namespace for segment, path names and the values of the ID tags | |
247 | of L and C lines. The content of this namespace can be found using | |
248 | :attr:`~gfapy.lines.collections.Collections.names` property. | |
249 | The identifiers of single line types | |
250 | can be retrieved using the properties | |
251 | :attr:`~gfapy.lines.collections.Collections.segment_names`, | |
252 | :attr:`~gfapy.lines.collections.Collections.edge_names` | |
253 | (ID tags of links and containments) and | |
254 | :attr:`~gfapy.lines.collections.Collections.path_names`. | |
255 | For GFA1, the properties | |
256 | :attr:`~gfapy.lines.collections.Collections.gap_names`, | |
257 | :attr:`~gfapy.lines.collections.Collections.set_names` | |
258 | contain always empty lists. | |
259 | ||
260 | .. doctest:: | |
261 | ||
262 | >>> g = gfapy.Gfa() | |
263 | >>> g.add_line("S\tA\t*") | |
264 | >>> g.add_line("S\tB\t*") | |
265 | >>> g.add_line("S\tC\t*") | |
266 | >>> g.add_line("L\tB\t+\tC\t+\t*\tID:Z:b_c") | |
267 | >>> g.add_line("P\tp1\tB+,C+\t*") | |
268 | >>> g.names | |
269 | ['A', 'B', 'C', 'b_c', 'p1'] | |
270 | >>> g.segment_names | |
271 | ['A', 'B', 'C'] | |
272 | >>> g.path_names | |
273 | ['p1'] | |
274 | >>> g.edge_names | |
275 | ['b_c'] | |
276 | >>> g.gap_names | |
277 | [] | |
278 | >>> g.set_names | |
279 | [] | |
280 | ||
281 | Identifiers of external sequences | |
282 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
283 | ||
284 | Fragments contain identifiers which refer to external sequences | |
285 | (not contained in the GFA file). According to the specification, the | |
286 | these identifiers are not part of the same namespace as the identifier | |
287 | of the GFA lines. They can be retrieved using the | |
288 | :attr:`~gfapy.lines.collections.Collections.external_names` | |
289 | property. | |
290 | ||
291 | .. doctest:: | |
292 | ||
293 | >>> g = gfapy.Gfa() | |
294 | >>> g.add_line("S\tA\t100\t*") | |
295 | >>> g.add_line("F\tA\tread1+\t10\t30\t0\t20$\t20M") | |
296 | >>> g.external_names | |
297 | ['read1'] | |
298 | ||
299 | The method | |
300 | :func:`Gfa.fragments_for_external(external_ID) <gfapy.lines.finders.Finders.fragments_for_external>` | |
301 | retrieves all F lines with a specified external sequence identifier. | |
302 | ||
303 | .. doctest:: | |
304 | ||
305 | >>> f = g.fragments_for_external('read1') | |
306 | >>> len(f) | |
307 | 1 | |
308 | >>> str(f[0]) | |
309 | 'F\tA\tread1+\t10\t30\t0\t20$\t20M' | |
310 | ||
311 | Adding new lines | |
312 | ~~~~~~~~~~~~~~~~ | |
313 | ||
314 | New lines can be added to a Gfa instance using the | |
315 | :func:`Gfa.add_line(line) <gfapy.lines.creators.Creators.add_line>` | |
316 | method or its alias | |
317 | :func:`Gfa.append(line) <gfapy.lines.creators.Creators.append>`. | |
318 | The argument can be either a string | |
319 | describing a line with valid GFA syntax, or a :class:`~gfapy.line.line.Line` | |
320 | instance. If a string is added, a line instance is created and | |
321 | then added. | |
322 | ||
323 | .. doctest:: | |
324 | ||
325 | >>> g = gfapy.Gfa() | |
326 | >>> g.add_line("S\tA\t*") #doctest: +ELLIPSIS | |
327 | >>> g.segment_names | |
328 | ['A'] | |
329 | >>> g.append("S\tB\t*") #doctest: +ELLIPSIS | |
330 | >>> g.segment_names | |
331 | ['A', 'B'] | |
332 | ||
333 | Editing the lines | |
334 | ~~~~~~~~~~~~~~~~~ | |
335 | ||
336 | Accessing the information stored in the fields of a line instance is | |
337 | described in the :ref:`positional_fields` and :ref:`tags` chapters. | |
338 | ||
339 | In Gfapy, a line instance belonging to a Gfa instance is said | |
340 | to be *connected* to the Gfa instance. Direct editing the content of a connected | |
341 | line is only possible, for those fields which do not contain | |
342 | references to other lines. For more information on how to modify the content of | |
343 | the fields of connected line, see the :ref:`references` chapter. | |
344 | ||
345 | .. doctest:: | |
346 | ||
347 | >>> g = gfapy.Gfa() | |
348 | >>> e = gfapy.Line("E\t*\tA+\tB-\t0\t10\t90\t100$\t*") | |
349 | >>> e.sid1 = "C+" | |
350 | >>> g.add_line(e) #doctest: +ELLIPSIS | |
351 | >>> e.sid1 = "A+" | |
352 | Traceback (most recent call last): | |
353 | gfapy.error.RuntimeError: ... | |
354 | ||
355 | Removing lines | |
356 | ~~~~~~~~~~~~~~ | |
357 | ||
358 | Disconnecting a line from the Gfa instance is done using the | |
359 | :func:`Gfa.rm(line) <gfapy.lines.destructors.Destructors.rm>` method. The | |
360 | argument can be a line instance or the name of a line. | |
361 | ||
362 | In alternative, a line instance can also be disconnected using the | |
363 | `disconnect` method on it. Disconnecting a line | |
364 | may trigger other operations, such as the disconnection of other lines (see the | |
365 | :ref:`references` chapter). | |
366 | ||
367 | .. doctest:: | |
368 | ||
369 | >>> g = gfapy.Gfa() | |
370 | >>> g.add_line("S\tA\t*") #doctest: +ELLIPSIS | |
371 | >>> g.segment_names | |
372 | ['A'] | |
373 | >>> g.rm('A') #doctest: +ELLIPSIS | |
374 | >>> g.segment_names | |
375 | [] | |
376 | >>> g.append("S\tB\t*") #doctest: +ELLIPSIS | |
377 | >>> g.segment_names | |
378 | ['B'] | |
379 | >>> b = g.line('B') | |
380 | >>> b.disconnect() | |
381 | >>> g.segment_names | |
382 | [] | |
383 | ||
384 | Renaming lines | |
385 | ~~~~~~~~~~~~~~ | |
386 | ||
387 | Lines with an identifier can be renamed. This is done simply by editing | |
388 | the corresponding field (such as ``name`` or ``sid`` for a segment). | |
389 | This field is not a reference to another line and can be freely edited | |
390 | also in line instances connected to a Gfa. All references to the line | |
391 | from other lines will still be up to date, as they will refer to the | |
392 | same instance (whose name has been changed) and their string | |
393 | representation will use the new name. | |
394 | ||
395 | .. doctest:: | |
396 | ||
397 | >>> g = gfapy.Gfa() | |
398 | >>> g.add_line("S\tA\t*") #doctest: +ELLIPSIS | |
399 | >>> g.add_line("L\tA\t+\tB\t-\t*") #doctest: +ELLIPSIS | |
400 | >>> g.segment_names | |
401 | ['A', 'B'] | |
402 | >>> g.dovetails[0].from_name | |
403 | 'A' | |
404 | >>> g.segment('A').name = 'C' | |
405 | >>> g.segment_names | |
406 | ['B', 'C'] | |
407 | >>> g.dovetails[0].from_name | |
408 | 'C' |
0 | .. _graph_operations: | |
1 | ||
2 | Graph operations | |
3 | ---------------- | |
4 | ||
5 | Graph operations such as linear paths merging, multiplication of | |
6 | segments and other are provided. These operations are implemented | |
7 | in analogy to those provided by the Ruby library RGFA. As RGFA only | |
8 | handles GFA1 graphs, only dovetail overlaps are considered as | |
9 | connections. A detailed description of the operation can be | |
10 | found in Gonnella and Kurtz (2016). More information about the | |
11 | single operations are found in the method documentation of the | |
12 | submodules of `GraphOperations`. | |
13 |
0 | .. testsetup:: * | |
1 | ||
2 | import gfapy | |
3 | gfa = gfapy.Gfa() | |
4 | ||
5 | .. _header: | |
6 | ||
7 | The Header | |
8 | ---------- | |
9 | ||
10 | GFA files may contain one or multiple header lines (record type: "H"). These | |
11 | lines may be present in any part of the file, not necessarily at the beginning. | |
12 | ||
13 | Although the header may consist of multiple lines, its content refers to the | |
14 | whole file. Therefore in Gfapy the header is accessed using a single line | |
15 | instance (accessible by the :attr:`~gfapy.lines.headers.Headers.header` | |
16 | property). Header lines contain only tags. If not header line is present in the | |
17 | Gfa, then the header line object will be empty (i.e. contain no tags). | |
18 | ||
19 | Note that header lines cannot be connected to the Gfa as other lines (i.e. | |
20 | calling :meth:`~gfapy.line.common.connection.Connection.connect` on them raises | |
21 | an exception). Instead they must be merged to the existing Gfa header, using | |
22 | `add_line` on the Gfa instance. | |
23 | ||
24 | .. doctest:: | |
25 | ||
26 | >>> gfa.add_line("H\tnn:f:1.0") #doctest: +ELLIPSIS | |
27 | >>> gfa.header.nn | |
28 | 1.0 | |
29 | >>> gfapy.Line("H\tnn:f:1.0").connect(gfa) | |
30 | Traceback (most recent call last): | |
31 | ... | |
32 | gfapy.error.RuntimeError: ... | |
33 | ||
34 | Multiple definitions of the predefined header tags | |
35 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
36 | ||
37 | For the predefined tags (``VN`` and ``TS``), the presence of multiple | |
38 | values in different lines is an error, unless the value is the same in | |
39 | each instance (in which case the repeated definitions are ignored). | |
40 | ||
41 | .. doctest:: | |
42 | ||
43 | >>> gfa.add_line("H\tVN:Z:1.0") #doctest: +ELLIPSIS | |
44 | >>> gfa.add_line("H\tVN:Z:1.0") # ignored #doctest: +ELLIPSIS | |
45 | >>> gfa.add_line("H\tVN:Z:2.0") | |
46 | Traceback (most recent call last): | |
47 | ... | |
48 | gfapy.error.VersionError: ... | |
49 | ||
50 | Multiple definitions of custom header tags | |
51 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
52 | ||
53 | If the tags are present only once in the header in its entirety, the access to | |
54 | the tags is the same as for any other line (see the :ref:`tags` chapter). | |
55 | ||
56 | However, the specification does not forbid custom tags to be defined with | |
57 | different values in different header lines (which we name "multi-definition | |
58 | tags"). This particular case is handled in the next sections. | |
59 | ||
60 | Reading multi-definitions tags | |
61 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
62 | ||
63 | Reading, validating and setting the datatype of multi-definition tags is done | |
64 | using the same methods as for all other lines (see the :ref:`tags` chapter). | |
65 | However, if a tag is defined multiple times on multiple H lines, reading the | |
66 | tag will return a list of the values on the lines. This array is an instance of | |
67 | the subclass ``gfapy.FieldArray`` of list. | |
68 | ||
69 | .. doctest:: | |
70 | ||
71 | >>> gfa.add_line("H\txx:i:1") #doctest: +ELLIPSIS | |
72 | >>> gfa.add_line("H\txx:i:2") #doctest: +ELLIPSIS | |
73 | >>> gfa.add_line("H\txx:i:3") #doctest: +ELLIPSIS | |
74 | >>> gfa.header.xx | |
75 | gfapy.FieldArray('i',[1, 2, 3]) | |
76 | ||
77 | Setting tags | |
78 | ~~~~~~~~~~~~ | |
79 | ||
80 | There are two possibilities to set a tag for the header. The first is | |
81 | the normal tag interface (using ``set`` or the tag name property). The | |
82 | second is to use ``add``. The latter supports multi-definition tags, | |
83 | i.e. it adds the value to the previous ones (if any), instead of | |
84 | overwriting them. | |
85 | ||
86 | .. doctest:: | |
87 | ||
88 | >>> gfa = gfapy.Gfa() | |
89 | >>> gfa.header.xx | |
90 | >>> gfa.header.add("xx", 1) | |
91 | >>> gfa.header.xx | |
92 | 1 | |
93 | >>> gfa.header.add("xx", 2) | |
94 | >>> gfa.header.xx | |
95 | gfapy.FieldArray('i',[1, 2]) | |
96 | >>> gfa.header.set("xx", 3) | |
97 | >>> gfa.header.xx | |
98 | 3 | |
99 | ||
100 | Modifying field array values | |
101 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
102 | ||
103 | Field arrays can be modified directly (e.g. adding new values or | |
104 | removing some values). After modification, the user may check if the | |
105 | array values remain compatible with the datatype of the tag using the | |
106 | :meth:`~gfapy.line.common.validate.Validate.validate_field`` method. | |
107 | ||
108 | .. doctest:: | |
109 | ||
110 | >>> gfa = gfapy.Gfa() | |
111 | >>> gfa.header.xx = gfapy.FieldArray('i',[1,2,3]) | |
112 | >>> gfa.header.xx | |
113 | gfapy.FieldArray('i',[1, 2, 3]) | |
114 | >>> gfa.header.validate_field("xx") | |
115 | >>> gfa.header.xx.append("X") | |
116 | >>> gfa.header.validate_field("xx") | |
117 | Traceback (most recent call last): | |
118 | ... | |
119 | gfapy.error.FormatError: ... | |
120 | ||
121 | If the field array is modified using array methods which return a list | |
122 | or data of any other type, a field array must be constructed, setting | |
123 | its datatype to the value returned by calling | |
124 | :meth:`~gfapy.line.common.field_datatype.FieldDatatype.get_datatype` | |
125 | on the header. | |
126 | ||
127 | .. doctest:: | |
128 | ||
129 | >>> gfa = gfapy.Gfa() | |
130 | >>> gfa.header.xx = gfapy.FieldArray('i',[1,2,3]) | |
131 | >>> gfa.header.xx | |
132 | gfapy.FieldArray('i',[1, 2, 3]) | |
133 | >>> gfa.header.xx = gfapy.FieldArray(gfa.header.get_datatype("xx"), | |
134 | ... list(map(lambda x: x+1, gfa.header.xx))) | |
135 | >>> gfa.header.xx | |
136 | gfapy.FieldArray('i',[2, 3, 4]) | |
137 | ||
138 | String representation of the header | |
139 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
140 | ||
141 | For consistency with other line types, the string representation of the header | |
142 | is a single-line string, eventually non standard-compliant, if it contains | |
143 | multiple instances of the tag. (and when calling | |
144 | :meth:`~gfapy.line.common.writer.Writer.field_to_s` for a tag present multiple | |
145 | times, the output string will contain the instances of the tag, separated by | |
146 | tabs). | |
147 | ||
148 | However, when the Gfa is output to file or string, the header is split into | |
149 | multiple H lines with single tags, so that standard-compliant GFA is output. | |
150 | The split header can be retrieved using the | |
151 | :attr:`~gfapy.lines.headers.Headers.headers` property of the Gfa instance. | |
152 | ||
153 | .. doctest:: | |
154 | ||
155 | >>> gfa = gfapy.Gfa() | |
156 | >>> gfa.header.VN = "1.0" | |
157 | >>> gfa.header.xx = gfapy.FieldArray('i',[1,2]) | |
158 | >>> gfa.header.field_to_s("xx") | |
159 | '1\t2' | |
160 | >>> gfa.header.field_to_s("xx", tag=True) | |
161 | 'xx:i:1\txx:i:2' | |
162 | >>> str(gfa.header) | |
163 | 'H\tVN:Z:1.0\txx:i:1\txx:i:2' | |
164 | >>> [str(h) for h in gfa.headers] | |
165 | ['H\tVN:Z:1.0', 'H\txx:i:1', 'H\txx:i:2'] | |
166 | >>> str(gfa) | |
167 | 'H\tVN:Z:1.0\nH\txx:i:1\nH\txx:i:2' | |
168 |
0 | .. testsetup:: * | |
1 | ||
2 | import gfapy | |
3 | ||
4 | .. _placeholders: | |
5 | ||
6 | Placeholders | |
7 | ------------ | |
8 | ||
9 | Some positional fields may contain an undefined value S: ``sequence``; | |
10 | L/C: ``overlap``; P: ``overlaps``; E: ``eid``, ``alignment``; F: | |
11 | ``alignment``; G: ``gid``, ``var``; U/O: ``pid``. In GFA this value is | |
12 | represented by a ``*``. | |
13 | ||
14 | In Gfapy the class `Placeholder` represent the undefined value. | |
15 | ||
16 | Distinguishing placeholders | |
17 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
18 | ||
19 | The :func:`gfapy.is_placeholder() <gfapy.placeholder.is_placeholder>` method | |
20 | allows to check if a value is a placeholder; a value is a placeholder if | |
21 | it is a `Placeholder` instance, or would represent | |
22 | a placeholder in GFA (a string containing ``*``), or would be represented | |
23 | by a placeholder in GFA (e.g. an empty array). | |
24 | ||
25 | .. doctest:: | |
26 | ||
27 | >>> gfapy.is_placeholder("*") | |
28 | True | |
29 | >>> gfapy.is_placeholder("**") | |
30 | False | |
31 | >>> gfapy.is_placeholder([]) | |
32 | True | |
33 | >>> gfapy.is_placeholder(gfapy.Placeholder()) | |
34 | True | |
35 | ||
36 | Note that, as a placeholder is ``False`` in boolean context, just a | |
37 | ``if not placeholder`` will also work, if the value is an instance | |
38 | of `Placeholder`, but not always for the other cases (in particular not | |
39 | for the string representation ``*``). | |
40 | Therefore using | |
41 | :func:`gfapy.is_placeholder() <gfapy.placeholder.is_placeholder>` | |
42 | is better. | |
43 | ||
44 | .. doctest:: | |
45 | ||
46 | >>> if "*": print('* is not a placeholder') | |
47 | * is not a placeholder | |
48 | >>> if gfapy.is_placeholder("*"): print('but it represents a placeholder') | |
49 | but it represents a placeholder | |
50 | ||
51 | Compatibility methods | |
52 | ~~~~~~~~~~~~~~~~~~~~~ | |
53 | ||
54 | Some methods are defined for placeholders, which allow them to respond | |
55 | to the same methods as defined values. This allows to write generic | |
56 | code. | |
57 | ||
58 | .. doctest:: | |
59 | ||
60 | >>> placeholder = gfapy.Placeholder() | |
61 | >>> placeholder.validate() # does nothing | |
62 | >>> len(placeholder) | |
63 | 0 | |
64 | >>> placeholder[1] | |
65 | gfapy.Placeholder() | |
66 | >>> placeholder + 1 | |
67 | gfapy.Placeholder() | |
68 |
0 | .. testsetup:: * | |
1 | ||
2 | import gfapy | |
3 | gfa = gfapy.Gfa() | |
4 | ||
5 | .. _positional_fields: | |
6 | ||
7 | Positional fields | |
8 | ----------------- | |
9 | ||
10 | Most lines in GFA have positional fields (Headers are an exception). | |
11 | During parsing, if a line is encountered, which has too less or too many | |
12 | positional fields, an exception will be thrown. The correct number of | |
13 | positional fields is record type-specific. | |
14 | ||
15 | Positional fields are recognized by its position in the line. Each | |
16 | positional field has an implicit field name and datatype associated with | |
17 | it. | |
18 | ||
19 | Field names | |
20 | ~~~~~~~~~~~ | |
21 | ||
22 | The field names are derived from the specification. Lower case versions | |
23 | of the field names are used and spaces are substituted with underscores. | |
24 | In some cases, the field names were changed, as they represent keywords | |
25 | in common programming languages (``from``, ``send``). | |
26 | ||
27 | The following tables shows the field names used in Gfapy, for each kind | |
28 | of line. Headers have no positional fields. Comments and custom records | |
29 | follow particular rules, see the respective chapters (:ref:`comments` and | |
30 | :ref:`custom_records`). | |
31 | ||
32 | GFA1 field names | |
33 | ^^^^^^^^^^^^^^^^ | |
34 | ||
35 | +---------------+--------------------+---------------------+------------------+-----------------+---------------+---------------+ | |
36 | | Record Type | Field 1 | Field 2 | Field 3 | Field 4 | Field 5 | Field 6 | | |
37 | +===============+====================+=====================+==================+=================+===============+===============+ | |
38 | | Segment | ``name`` | ``sequence`` | | | | | | |
39 | +---------------+--------------------+---------------------+------------------+-----------------+---------------+---------------+ | |
40 | | Link | ``from_segment`` | ``from_orient`` | ``to_segment`` | ``to_orient`` | ``overlap`` | | | |
41 | +---------------+--------------------+---------------------+------------------+-----------------+---------------+---------------+ | |
42 | | Containment | ``from_segment`` | ``from_orient`` | ``to_segment`` | ``to_orient`` | ``pos`` | ``overlap`` | | |
43 | +---------------+--------------------+---------------------+------------------+-----------------+---------------+---------------+ | |
44 | | Path | ``path_name`` | ``segment_names`` | ``overlaps`` | | | | | |
45 | +---------------+--------------------+---------------------+------------------+-----------------+---------------+---------------+ | |
46 | ||
47 | GFA2 field names | |
48 | ^^^^^^^^^^^^^^^^ | |
49 | ||
50 | +---------------+-----------+----------------+----------------+-------------+-------------+-------------+-----------------+-----------------+ | |
51 | | Record Type | Field 1 | Field 2 | Field 3 | Field 4 | Field 5 | Field 6 | Field 7 | Field 8 | | |
52 | +===============+===========+================+================+=============+=============+=============+=================+=================+ | |
53 | | Segment | ``sid`` | ``slen`` | ``sequence`` | | | | | | | |
54 | +---------------+-----------+----------------+----------------+-------------+-------------+-------------+-----------------+-----------------+ | |
55 | | Edge | ``eid`` | ``sid1`` | ``sid2`` | ``beg1`` | ``end1`` | ``beg2`` | ``end2`` | ``alignment`` | | |
56 | +---------------+-----------+----------------+----------------+-------------+-------------+-------------+-----------------+-----------------+ | |
57 | | Fragment | ``sid`` | ``external`` | ``s_beg`` | ``s_end`` | ``f_beg`` | ``f_end`` | ``alignment`` | | | |
58 | +---------------+-----------+----------------+----------------+-------------+-------------+-------------+-----------------+-----------------+ | |
59 | | Gap | ``gid`` | ``sid1`` | ``d1`` | ``d2`` | ``sid2`` | ``disp`` | ``var`` | | | |
60 | +---------------+-----------+----------------+----------------+-------------+-------------+-------------+-----------------+-----------------+ | |
61 | | Set | ``pid`` | ``items`` | | | | | | | | |
62 | +---------------+-----------+----------------+----------------+-------------+-------------+-------------+-----------------+-----------------+ | |
63 | | Path | ``pid`` | ``items`` | | | | | | | | |
64 | +---------------+-----------+----------------+----------------+-------------+-------------+-------------+-----------------+-----------------+ | |
65 | ||
66 | Datatypes | |
67 | ~~~~~~~~~ | |
68 | ||
69 | The datatype of each positional field is described in the specification | |
70 | and cannot be changed (differently from tags). Here is a short | |
71 | description of the Python classes used to represent data for different | |
72 | datatypes. | |
73 | ||
74 | Placeholders | |
75 | ^^^^^^^^^^^^ | |
76 | ||
77 | The positional fields in GFA can never be empty. However, there are some | |
78 | fields with optional values. If a value is not specified, a placeholder | |
79 | character is used instead (``*``). Such undefined values are represented | |
80 | in Gfapy by the `Placeholder` class, which is described more in | |
81 | detail in the :ref:`placeholders` chapter. | |
82 | ||
83 | Arrays | |
84 | ^^^^^^ | |
85 | ||
86 | The ``items`` field in unordered and ordered groups and the | |
87 | ``segment_names`` and ``overlaps`` fields in paths are lists of objects | |
88 | and are represented by list instances. | |
89 | ||
90 | .. doctest:: | |
91 | ||
92 | >>> set = gfapy.Line("U\t*\t1 A 2") | |
93 | >>> type(set.items) | |
94 | <class 'list'> | |
95 | >>> gfa2_path = gfapy.Line("O\t*\tA+ B-") | |
96 | >>> type(gfa2_path.items) | |
97 | <class 'list'> | |
98 | >>> gfa1_path = gfapy.Line("P\tp1\tA+,B-\t10M,9M1D1M") | |
99 | >>> type(gfa1_path.segment_names) | |
100 | <class 'list'> | |
101 | >>> type(gfa1_path.overlaps) | |
102 | <class 'list'> | |
103 | ||
104 | Orientations | |
105 | ^^^^^^^^^^^^ | |
106 | ||
107 | Orientations are represented by strings. The ``gfapy.invert()`` method | |
108 | applied to an orientation string returns the other orientation. | |
109 | ||
110 | .. doctest:: | |
111 | ||
112 | >>> gfapy.invert("+") | |
113 | '-' | |
114 | >>> gfapy.invert("-") | |
115 | '+' | |
116 | ||
117 | Identifiers | |
118 | ^^^^^^^^^^^ | |
119 | ||
120 | The identifier of the line itself (available for S, P, E, G, U, O lines) | |
121 | can always be accessed in Gfapy using the ``name`` alias and is | |
122 | represented in Gfapy by a string. If it is optional (E, G, U, O lines) | |
123 | and not specified, it is represented by a Placeholder instance. The | |
124 | fragment identifier is also a string. | |
125 | ||
126 | Identifiers which refer to other lines are also present in some line | |
127 | types (L, C, E, G, U, O, F). These are never placeholders and in | |
128 | stand-alone lines are represented by strings. In connected lines they | |
129 | are references to the Line instances to which they refer to (see the | |
130 | :ref:`references` chapter). | |
131 | ||
132 | Oriented identifiers | |
133 | ^^^^^^^^^^^^^^^^^^^^ | |
134 | ||
135 | Oriented identifiers (e.g. ``segment_names`` in GFA1 paths) are | |
136 | represented by elements of the class ``gfapy.OrientedLine``. The | |
137 | ``segment`` method of the oriented segments returns the segment | |
138 | identifier (or segment reference in connected path lines) and the | |
139 | ``orient`` method returns the orientation string. The ``name`` method | |
140 | returns the string of the segment, even if this is a reference to a | |
141 | segment. A new oriented line can be created using the | |
142 | ``OL[line, orientation]`` method. | |
143 | ||
144 | Calling ``invert`` returns an oriented segment, with inverted | |
145 | orientation. To set the two attributes the methods ``segment=`` and | |
146 | ``orient=`` are available. | |
147 | ||
148 | Examples: | |
149 | ||
150 | .. doctest:: | |
151 | ||
152 | >>> p = gfapy.Line("P\tP1\ta+,b-\t*") | |
153 | >>> p.segment_names | |
154 | [gfapy.OrientedLine('a','+'), gfapy.OrientedLine('b','-')] | |
155 | >>> sn0 = p.segment_names[0] | |
156 | >>> sn0.line | |
157 | 'a' | |
158 | >>> sn0.name | |
159 | 'a' | |
160 | >>> sn0.orient | |
161 | '+' | |
162 | >>> sn0.invert() | |
163 | >>> sn0 | |
164 | gfapy.OrientedLine('a','-') | |
165 | >>> sn0.orient | |
166 | '-' | |
167 | >>> sn0.line = gfapy.Line('S\tX\t*') | |
168 | >>> str(sn0) | |
169 | 'X-' | |
170 | >>> sn0.name | |
171 | 'X' | |
172 | >>> sn0 = gfapy.OrientedLine(gfapy.Line('S\tY\t*'), '+') | |
173 | ||
174 | Sequences | |
175 | ^^^^^^^^^ | |
176 | ||
177 | Sequences (S field sequence) are represented by strings in Gfapy. | |
178 | Depending on the GFA version, the alphabet definition is more or less | |
179 | restrictive. The definitions are correctly applied by the validation | |
180 | methods. | |
181 | ||
182 | The method ``rc()`` is provided to compute the reverse complement of a | |
183 | nucleotidic sequence. The extended IUPAC alphabet is understood by the | |
184 | method. Applied to non nucleotidic sequences, the results will be | |
185 | meaningless: | |
186 | ||
187 | .. doctest:: | |
188 | ||
189 | >>> from gfapy.sequence import rc | |
190 | >>> rc("gcat") | |
191 | 'atgc' | |
192 | >>> rc("*") | |
193 | '*' | |
194 | >>> rc("yatc") | |
195 | 'gatr' | |
196 | >>> rc("gCat") | |
197 | 'atGc' | |
198 | >>> rc("cag", rna=True) | |
199 | 'cug' | |
200 | ||
201 | Integers and positions | |
202 | ^^^^^^^^^^^^^^^^^^^^^^ | |
203 | ||
204 | The C lines ``pos`` field and the G lines ``disp`` and ``var`` fields | |
205 | are represented by integers. The ``var`` field is optional, and thus can | |
206 | be also a placeholder. Positions are 0-based coordinates. | |
207 | ||
208 | The position fields of GFA2 E lines (``beg1, beg2, end1, end2``) and F | |
209 | lines (``s_beg, s_end, f_beg, f_end``) contain a dollar string as suffix | |
210 | if the position is equal to the segment length. For more information, | |
211 | see the :ref:`positions` chapter. | |
212 | ||
213 | Alignments | |
214 | ^^^^^^^^^^ | |
215 | ||
216 | Alignments are always optional, ie they can be placeholders. If they are | |
217 | specified they are CIGAR alignments or, only in GFA2, trace alignments. | |
218 | For more details, see the :ref:`alignments` chapter. | |
219 | ||
220 | GFA1 datatypes | |
221 | ^^^^^^^^^^^^^^ | |
222 | ||
223 | +------------------------+---------------+--------------------------------+ | |
224 | | Datatype | Record Type | Fields | | |
225 | +========================+===============+================================+ | |
226 | | Identifier | Segment | ``name`` | | |
227 | +------------------------+---------------+--------------------------------+ | |
228 | | | Path | ``path_name`` | | |
229 | +------------------------+---------------+--------------------------------+ | |
230 | | | Link | ``from_segment, to_segment`` | | |
231 | +------------------------+---------------+--------------------------------+ | |
232 | | | Containment | ``from_segment, to_segment`` | | |
233 | +------------------------+---------------+--------------------------------+ | |
234 | | [OrientedIdentifier] | Path | ``segment_names`` | | |
235 | +------------------------+---------------+--------------------------------+ | |
236 | | Orientation | Link | ``from_orient, to_orient`` | | |
237 | +------------------------+---------------+--------------------------------+ | |
238 | | | Containment | ``from_orient, to_orient`` | | |
239 | +------------------------+---------------+--------------------------------+ | |
240 | | Sequence | Segment | ``sequence`` | | |
241 | +------------------------+---------------+--------------------------------+ | |
242 | | Alignment | Link | ``overlap`` | | |
243 | +------------------------+---------------+--------------------------------+ | |
244 | | | Containment | ``overlap`` | | |
245 | +------------------------+---------------+--------------------------------+ | |
246 | | [Alignment] | Path | ``overlaps`` | | |
247 | +------------------------+---------------+--------------------------------+ | |
248 | | Position | Containment | ``pos`` | | |
249 | +------------------------+---------------+--------------------------------+ | |
250 | ||
251 | GFA2 datatypes | |
252 | ^^^^^^^^^^^^^^ | |
253 | ||
254 | +------------------------+---------------+----------------------------------+ | |
255 | | Datatype | Record Type | Fields | | |
256 | +========================+===============+==================================+ | |
257 | | Itentifier | Segment | ``sid`` | | |
258 | +------------------------+---------------+----------------------------------+ | |
259 | | | Fragment | ``sid`` | | |
260 | +------------------------+---------------+----------------------------------+ | |
261 | | OrientedIdentifier | Edge | ``sid1, sid2`` | | |
262 | +------------------------+---------------+----------------------------------+ | |
263 | | | Gap | ``sid1, sid2`` | | |
264 | +------------------------+---------------+----------------------------------+ | |
265 | | | Fragment | ``external`` | | |
266 | +------------------------+---------------+----------------------------------+ | |
267 | | OptionalIdentifier | Edge | ``eid`` | | |
268 | +------------------------+---------------+----------------------------------+ | |
269 | | | Gap | ``gid`` | | |
270 | +------------------------+---------------+----------------------------------+ | |
271 | | | U Group | ``oid`` | | |
272 | +------------------------+---------------+----------------------------------+ | |
273 | | | O Group | ``uid`` | | |
274 | +------------------------+---------------+----------------------------------+ | |
275 | | [Identifier] | U Group | ``items`` | | |
276 | +------------------------+---------------+----------------------------------+ | |
277 | | [OrientedIdentifier] | O Group | ``items`` | | |
278 | +------------------------+---------------+----------------------------------+ | |
279 | | Sequence | Segment | ``sequence`` | | |
280 | +------------------------+---------------+----------------------------------+ | |
281 | | Alignment | Edge | ``alignment`` | | |
282 | +------------------------+---------------+----------------------------------+ | |
283 | | | Fragment | ``alignment`` | | |
284 | +------------------------+---------------+----------------------------------+ | |
285 | | Position | Edge | ``beg1, end1, beg2, end2`` | | |
286 | +------------------------+---------------+----------------------------------+ | |
287 | | | Fragment | ``s_beg, s_end, f_beg, f_end`` | | |
288 | +------------------------+---------------+----------------------------------+ | |
289 | | Integer | Gap | ``disp, var`` | | |
290 | +------------------------+---------------+----------------------------------+ | |
291 | ||
292 | Reading and writing positional fields | |
293 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
294 | ||
295 | The ``positional_fieldnames`` method returns the list of the names (as | |
296 | strings) of the positional fields of a line. The positional fields can | |
297 | be read using a method on the Gfapy line object, which is called as the | |
298 | field name. Setting the value is done with an equal sign version of the | |
299 | field name method (e.g. segment.slen = 120). In alternative, the | |
300 | ``set(fieldname, value)`` and ``get(fieldname)`` methods can also be | |
301 | used. | |
302 | ||
303 | .. doctest:: | |
304 | ||
305 | >>> s_gfa1 = gfapy.Line("S\t1\t*") | |
306 | >>> s_gfa1.positional_fieldnames | |
307 | ['name', 'sequence'] | |
308 | >>> s_gfa1.name | |
309 | '1' | |
310 | >>> s_gfa1.get("name") | |
311 | '1' | |
312 | >>> s_gfa1.name = "segment2" | |
313 | >>> s_gfa1.name | |
314 | 'segment2' | |
315 | >>> s_gfa1.set('name',"3") | |
316 | >>> s_gfa1.name | |
317 | '3' | |
318 | ||
319 | When a field is read, the value is converted into an appropriate object. | |
320 | The string representation of a field can be read using the | |
321 | ``field_to_s(fieldname)`` method. | |
322 | ||
323 | .. doctest:: | |
324 | ||
325 | >>> gfa = gfapy.Gfa() | |
326 | >>> gfa.add_line("S\ts1\t*") | |
327 | >>> gfa.add_line("L\ts1\t+\ts2\t-\t*") | |
328 | >>> link = gfa.dovetails[0] | |
329 | >>> str(link.from_segment) | |
330 | 'S\ts1\t*' | |
331 | >>> link.field_to_s('from_segment') | |
332 | 's1' | |
333 | ||
334 | When setting a non-string field, the user can specify the value of a tag | |
335 | either as a Python non-string object, or as the string representation of | |
336 | the value. | |
337 | ||
338 | .. doctest:: | |
339 | ||
340 | >>> gfa = gfapy.Gfa(version='gfa1') | |
341 | >>> gfa.add_line("C\ta\t+\tb\t-\t10\t*") | |
342 | >>> c = gfa.containments[0] | |
343 | >>> c.pos | |
344 | 10 | |
345 | >>> c.pos = 1 | |
346 | >>> c.pos | |
347 | 1 | |
348 | >>> c.pos = "2" | |
349 | >>> c.pos | |
350 | 2 | |
351 | >>> c.field_to_s("pos") | |
352 | '2' | |
353 | ||
354 | Note that setting the value of reference and backreferences-related | |
355 | fields is generally not allowed, when a line instance is connected to a | |
356 | Gfa object (see the :ref:`references` chapter). | |
357 | ||
358 | .. doctest:: | |
359 | ||
360 | >>> gfa = gfapy.Gfa(version='gfa1') | |
361 | >>> l = gfapy.Line("L\ts1\t+\ts2\t-\t*") | |
362 | >>> l.from_name | |
363 | 's1' | |
364 | >>> l.from_segment = "s3" | |
365 | >>> l.from_name | |
366 | 's3' | |
367 | >>> gfa.add_line(l) | |
368 | >>> l.from_segment = "s4" | |
369 | Traceback (most recent call last): | |
370 | ... | |
371 | gfapy.error.RuntimeError: ... | |
372 | ||
373 | Validation | |
374 | ~~~~~~~~~~ | |
375 | ||
376 | The content of all positional fields must be a correctly formatted | |
377 | string according to the rules given in the GFA specifications (or a | |
378 | Python object whose string representation is a correctly formatted | |
379 | string). | |
380 | ||
381 | Depending on the validation level, more or less checks are done | |
382 | automatically (see the :ref:`validation` chapter). Not regarding which | |
383 | validation level is selected, the user can trigger a manual validation | |
384 | using the ``validate_field(fieldname)`` method for a single field, or | |
385 | using ``validate``, which does a full validation on the whole line, | |
386 | including all positional fields. | |
387 | ||
388 | .. doctest:: | |
389 | ||
390 | >>> line = gfapy.Line("H\txx:i:1") | |
391 | >>> line.validate_field("xx") | |
392 | >>> line.validate() | |
393 | ||
394 | Aliases | |
395 | ~~~~~~~ | |
396 | ||
397 | For some fields, aliases are defined, which can be used in all contexts | |
398 | where the original field name is used (i.e. as parameter of a method, | |
399 | and the same setter and getter methods defined for the original field | |
400 | name are also defined for each alias, see below). | |
401 | ||
402 | .. doctest:: | |
403 | ||
404 | >>> gfa1_path = gfapy.Line("P\tX\t1-,2+,3+\t*") | |
405 | >>> gfa1_path.name == gfa1_path.path_name | |
406 | True | |
407 | >>> edge = gfapy.Line("E\t*\tA+\tB-\t0\t10\t90\t100$\t*") | |
408 | >>> edge.eid == edge.name | |
409 | True | |
410 | >>> containment = gfapy.Line("C\tA\t+\tB\t-\t10\t*") | |
411 | >>> containment.from_segment == containment.container | |
412 | True | |
413 | >>> segment = gfapy.Line("S\t1\t*") | |
414 | >>> segment.sid == segment.name | |
415 | True | |
416 | >>> segment.sid | |
417 | '1' | |
418 | >>> segment.name = '2' | |
419 | >>> segment.sid | |
420 | '2' | |
421 | ||
422 | Name | |
423 | ^^^^ | |
424 | ||
425 | Different record types have an identifier field: segments (name in GFA1, | |
426 | sid in GFA2), paths (path\_name), edge (eid), fragment (sid), gap (gid), | |
427 | groups (pid). | |
428 | ||
429 | All these fields are aliased to ``name``. This allows the user for | |
430 | example to set the identifier of a line using the ``name=(value)`` | |
431 | method using the same syntax for different record types (segments, | |
432 | edges, paths, fragments, gaps and groups). | |
433 | ||
434 | Version-specific field names | |
435 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
436 | ||
437 | For segments the GFA1 name and the GFA2 sid are equivalent fields. For | |
438 | this reason an alias ``sid`` is defined for GFA1 segments and ``name`` | |
439 | for GFA2 segments. | |
440 | ||
441 | Crypical field names | |
442 | ^^^^^^^^^^^^^^^^^^^^ | |
443 | ||
444 | The definition of from and to for containments is somewhat cryptic. | |
445 | Therefore following aliases have been defined for containments: | |
446 | container[\_orient] for from[\_\|segment\|orient]; contained[\_orient] | |
447 | for to[\_segment\|orient]. |
0 | .. testsetup:: * | |
1 | ||
2 | import gfapy | |
3 | ||
4 | .. _positions: | |
5 | ||
6 | Positions | |
7 | --------- | |
8 | ||
9 | The only position field in GFA1 is the ``pos`` field in the C lines. | |
10 | This represents the starting position of the contained segment in the | |
11 | container segment and is 0-based. | |
12 | ||
13 | Some fields in GFA2 E lines (``beg1, beg2, end1, end2``) and F lines | |
14 | (``s_beg, s_end, f_beg, f_end``) are positions. According to the | |
15 | specification, they are 0-based and represent virtual ticks before and | |
16 | after each string in the sequence. Thus ranges are represented similarly | |
17 | to the Python range conventions: e.g. a 1-character prefix of a sequence | |
18 | will have begin 0 and end 1. | |
19 | ||
20 | Last positions in GFA2 | |
21 | ~~~~~~~~~~~~~~~~~~~~~~ | |
22 | ||
23 | The GFA2 positions must contain an additional string (``$``) appended to | |
24 | the integer, if (and only if) they are the last position in the segment | |
25 | sequence. These particular positions are represented in Gfapy as | |
26 | instances of the class :class:`~gfapy.lastpos.LastPos`. | |
27 | ||
28 | To create a lastpos instance, the constructor can be used with an | |
29 | integer, or the string representation (which must end with the dollar | |
30 | sign, otherwise an integer is returned): | |
31 | ||
32 | .. doctest:: | |
33 | ||
34 | >>> str(gfapy.LastPos(12)) | |
35 | '12$' | |
36 | >>> gfapy.LastPos("12") | |
37 | 12 | |
38 | >>> str(gfapy.LastPos("12")) | |
39 | '12' | |
40 | >>> gfapy.LastPos("12$") | |
41 | gfapy.LastPos(12) | |
42 | >>> str(gfapy.LastPos("12$")) | |
43 | '12$' | |
44 | ||
45 | Subtracting an integer from a lastpos returns a lastpos if 0 subtracted, | |
46 | an integer otherwise. This allows to do some arithmetic on positions | |
47 | without making them invalid. | |
48 | ||
49 | .. doctest:: | |
50 | ||
51 | >>> gfapy.LastPos(12) - 0 | |
52 | gfapy.LastPos(12) | |
53 | >>> gfapy.LastPos(12) - 1 | |
54 | 11 | |
55 | ||
56 | The functions :func:`~gfapy.lastpos.islastpos` and | |
57 | :func:`~gfapy.lastpos.isfirstpos` allow to | |
58 | determine if a position value is 0 (first), or the last position, using | |
59 | the same syntax for lastpos and integer instances. | |
60 | ||
61 | .. doctest:: | |
62 | ||
63 | >>> gfapy.isfirstpos(0) | |
64 | True | |
65 | >>> gfapy.islastpos(0) | |
66 | False | |
67 | >>> gfapy.isfirstpos(12) | |
68 | False | |
69 | >>> gfapy.islastpos(12) | |
70 | False | |
71 | >>> gfapy.islastpos(gfapy.LastPos("12")) | |
72 | False | |
73 | >>> gfapy.islastpos(gfapy.LastPos("12$")) | |
74 | True |
0 | .. testsetup:: * | |
1 | ||
2 | import gfapy | |
3 | gfa = gfapy.Gfa() | |
4 | ||
5 | .. _references: | |
6 | ||
7 | References | |
8 | ---------- | |
9 | ||
10 | Some fields in GFA lines contain identifiers or lists of identifiers | |
11 | (sometimes followed by orientation strings), which reference other lines | |
12 | of the GFA file. In Gfapy it is possible to follow these references and | |
13 | traverse the graph. | |
14 | ||
15 | Connecting a line to a Gfa object | |
16 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
17 | ||
18 | In stand-alone line instances, the identifiers which reference other | |
19 | lines are either strings containing the line name, pairs of strings | |
20 | (name and orientation) in a ``gfapy.OrientedLine`` object, or lists of | |
21 | lines names or ``gfapy.OrientedLine`` objects. | |
22 | ||
23 | Using the ``add_line(line)`` (alias: ``append(line)``) method of the | |
24 | ``gfapy.Gfa`` object, or the equivalent ``connect(gfa)`` method of the | |
25 | gfapy.Line instance, a line is added to a Gfa instance (this is done | |
26 | automatically when a GFA file is parsed). All strings expressing | |
27 | references are then changed into references to the corresponding line | |
28 | objects. The method ``is_connected()`` allows to determine if a line is | |
29 | connected to a gfapy instance. The read-only property ``gfa`` contains | |
30 | the ``gfapy.Gfa`` instance to which the line is connected. | |
31 | ||
32 | .. doctest:: | |
33 | ||
34 | >>> gfa = gfapy.Gfa(version='gfa1') | |
35 | >>> link = gfapy.Line("L\tA\t-\tB\t+\t20M") | |
36 | >>> link.is_connected() | |
37 | False | |
38 | >>> link.gfa is None | |
39 | True | |
40 | >>> type(link.from_segment) | |
41 | <class 'str'> | |
42 | >>> gfa.append(link) | |
43 | >>> link.is_connected() | |
44 | True | |
45 | >>> link.gfa #doctest: +ELLIPSIS | |
46 | <gfapy.gfa.Gfa object at ...> | |
47 | >>> type(link.from_segment) | |
48 | <class 'gfapy.line.segment.gfa1.GFA1'> | |
49 | ||
50 | References for each record type | |
51 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
52 | ||
53 | The following tables describes the references contained in each record | |
54 | type. The notation ``[]`` represent lists. | |
55 | ||
56 | GFA1 | |
57 | ^^^^ | |
58 | ||
59 | +---------------+-------------------+---------------------------+ | |
60 | | Record type | Fields | Type of reference | | |
61 | +===============+===================+===========================+ | |
62 | | Link | from, to | Segment | | |
63 | +---------------+-------------------+---------------------------+ | |
64 | | Containment | from, to | Segment | | |
65 | +---------------+-------------------+---------------------------+ | |
66 | | Path | segment\_names, | [OrientedLine(Segment)] | | |
67 | +---------------+-------------------+---------------------------+ | |
68 | | | links (1) | [OrientedLine(Link)] | | |
69 | +---------------+-------------------+---------------------------+ | |
70 | ||
71 | (1): paths contain information in the fields segment\_names and | |
72 | overlaps, which allow to find the identify from which they depend; these | |
73 | links can be retrieved using ``links`` (which is not a field). | |
74 | ||
75 | GFA2 | |
76 | ^^^^ | |
77 | ||
78 | +---------------+--------------+------------------------------------+ | |
79 | | Record type | Fields | Type of reference | | |
80 | +===============+==============+====================================+ | |
81 | | Edge | sid1, sid2 | Segment | | |
82 | +---------------+--------------+------------------------------------+ | |
83 | | Gap | sid1, sid2 | Segment | | |
84 | +---------------+--------------+------------------------------------+ | |
85 | | Fragment | sid | Segment | | |
86 | +---------------+--------------+------------------------------------+ | |
87 | | Set | items | [Edge/Set/Path/Segment] | | |
88 | +---------------+--------------+------------------------------------+ | |
89 | | Path | items | [OrientedLine(Edge/Set/Segment)] | | |
90 | +---------------+--------------+------------------------------------+ | |
91 | ||
92 | Backreferences for each record type | |
93 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
94 | ||
95 | When a line containing a reference to another line is connected to a Gfa | |
96 | object, backreferences to it are created in the targeted line. | |
97 | ||
98 | For each backreference collection a read-only property exist, which is | |
99 | named as the collection (e.g. ``dovetails_L`` for segments). Note that | |
100 | the reference list returned by these arrays are read-only and editing | |
101 | the references is done using other methods (see the section "Editing | |
102 | reference fields" below). | |
103 | ||
104 | .. code:: python | |
105 | ||
106 | segment.dovetails_L # => [gfapy.line.edge.Link(...), ...] | |
107 | ||
108 | The following tables describe the backreferences collections for each | |
109 | record type. | |
110 | ||
111 | GFA1 | |
112 | ^^^^ | |
113 | ||
114 | +---------------+-------------------------+ | |
115 | | Record type | Backreferences | | |
116 | +===============+=========================+ | |
117 | | Segment | dovetails\_L | | |
118 | +---------------+-------------------------+ | |
119 | | | dovetails\_R | | |
120 | +---------------+-------------------------+ | |
121 | | | edges\_to\_contained | | |
122 | +---------------+-------------------------+ | |
123 | | | edges\_to\_containers | | |
124 | +---------------+-------------------------+ | |
125 | | | paths | | |
126 | +---------------+-------------------------+ | |
127 | | Link | paths | | |
128 | +---------------+-------------------------+ | |
129 | ||
130 | GFA2 | |
131 | ^^^^ | |
132 | ||
133 | +---------------+-------------------------+--------+ | |
134 | | Record type | Backreferences | Type | | |
135 | +===============+=========================+========+ | |
136 | | Segment | dovetails\_L | E | | |
137 | +---------------+-------------------------+--------+ | |
138 | | | dovetails\_R | E | | |
139 | +---------------+-------------------------+--------+ | |
140 | | | edges\_to\_contained | E | | |
141 | +---------------+-------------------------+--------+ | |
142 | | | edges\_to\_containers | E | | |
143 | +---------------+-------------------------+--------+ | |
144 | | | internals | E | | |
145 | +---------------+-------------------------+--------+ | |
146 | | | gaps\_L | G | | |
147 | +---------------+-------------------------+--------+ | |
148 | | | gaps\_R | G | | |
149 | +---------------+-------------------------+--------+ | |
150 | | | fragments | F | | |
151 | +---------------+-------------------------+--------+ | |
152 | | | paths | O | | |
153 | +---------------+-------------------------+--------+ | |
154 | | | sets | U | | |
155 | +---------------+-------------------------+--------+ | |
156 | | Edge | paths | O | | |
157 | +---------------+-------------------------+--------+ | |
158 | | | sets | U | | |
159 | +---------------+-------------------------+--------+ | |
160 | | O Group | paths | O | | |
161 | +---------------+-------------------------+--------+ | |
162 | | | sets | U | | |
163 | +---------------+-------------------------+--------+ | |
164 | | U Group | sets | U | | |
165 | +---------------+-------------------------+--------+ | |
166 | ||
167 | Segment backreference convenience methods | |
168 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
169 | ||
170 | For segments, additional methods are available which combine in | |
171 | different way the backreferences information. The | |
172 | `dovetails_of_end` and `gaps_of_end` methods take an | |
173 | argument ``L`` or ``R`` and return the dovetails overlaps (or gaps) of the | |
174 | left or, respectively, right end of the segment sequence | |
175 | (equivalent to the segment properties ``dovetails_L``/``dovetails_R`` and | |
176 | ``gaps_L``/``gaps_R``). | |
177 | ||
178 | The segment ``containments`` property is a list of both containments where the | |
179 | segment is the container or the contained segment. The segment ``edges`` | |
180 | property is a list of all edges (dovetails, containments and internals) | |
181 | with a reference to the segment. | |
182 | ||
183 | Other methods directly compute list of segments from the edges lists | |
184 | mentioned above. The ``neighbours_L``, ``neighbours_R`` properties and | |
185 | the `neighbours` method compute the set of segment instances which are | |
186 | connected by dovetails to the segment. | |
187 | The ``containers`` and ``contained`` | |
188 | properties similarly compute the set of segment instances which, | |
189 | respectively, contains the segment, or are contained in the segment. | |
190 | ||
191 | .. doctest:: | |
192 | ||
193 | >>> gfa = gfapy.Gfa() | |
194 | >>> gfa.append('S\tA\t*') | |
195 | >>> s = gfa.segment('A') | |
196 | >>> gfa.append('S\tB\t*') | |
197 | >>> gfa.append('S\tC\t*') | |
198 | >>> gfa.append('L\tA\t-\tB\t+\t*') | |
199 | >>> gfa.append('C\tA\t+\tC\t+\t10\t*') | |
200 | >>> [str(l) for l in s.dovetails_of_end("L")] | |
201 | ['L\tA\t-\tB\t+\t*'] | |
202 | >>> s.dovetails_L == s.dovetails_of_end("L") | |
203 | True | |
204 | >>> s.gaps_of_end("R") | |
205 | [] | |
206 | >>> [str(e) for e in s.edges] | |
207 | ['L\tA\t-\tB\t+\t*', 'C\tA\t+\tC\t+\t10\t*'] | |
208 | >>> [str(n) for n in s.neighbours_L] | |
209 | ['S\tB\t*'] | |
210 | >>> s.containers | |
211 | [] | |
212 | >>> [str(c) for c in s.contained] | |
213 | ['S\tC\t*'] | |
214 | ||
215 | Multiline group definitions | |
216 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
217 | ||
218 | The GFA2 specification opens the possibility (experimental) to define | |
219 | groups on multiple lines, by using the same ID for each line defining | |
220 | the group. This is supported by gfapy. | |
221 | ||
222 | This means that if multiple `Ordered` or | |
223 | `Unordered` instances connected to a Gfa object have | |
224 | the same ``gid``, they are merged into a single instance (technically | |
225 | the last one getting added to the graph object). The items list are | |
226 | merged. | |
227 | ||
228 | The tags of multiple line defining a group shall not contradict each | |
229 | other (i.e. either are the tag names on different lines defining the | |
230 | group all different, or, if the same tag is present on different lines, | |
231 | the value and datatype must be the same, in which case the multiple | |
232 | definition will be ignored). | |
233 | ||
234 | .. doctest:: | |
235 | ||
236 | >>> gfa = gfapy.Gfa() | |
237 | >>> gfa.add_line("U\tu1\ts1 s2 s3") | |
238 | >>> [s.name for s in gfa.sets[-1].items] | |
239 | ['s1', 's2', 's3'] | |
240 | >>> gfa.add_line('U\tu1\t4 5') | |
241 | >>> [s.name for s in gfa.sets[-1].items] | |
242 | ['s1', 's2', 's3', '4', '5'] | |
243 | ||
244 | Induced set and captured path | |
245 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
246 | ||
247 | The item list in GFA2 sets and paths may not contain elements which are | |
248 | implicitly involved. For example a path may contain segments, without | |
249 | specifying the edges connecting them, if there is only one such edge. | |
250 | Alternatively a path may contain edges, without explicitly indicating the | |
251 | segments. Similarly a set may contain edges, but not the segments | |
252 | referred to in them, or contain segments which are connected by edges, | |
253 | without the edges themselves. Furthermore groups may refer to other | |
254 | groups (set to sets or paths, paths to paths only), which then | |
255 | indirectly contain references to segments and edges. | |
256 | ||
257 | Gfapy provides methods for the computation of the sets of segments and | |
258 | edges which are implied by an ordered or unordered group. Thereby all | |
259 | references to subgroups are resolved and implicit elements are added, as | |
260 | described in the specification. The computation can, therefore, only be | |
261 | applied to connected lines. For unordered groups, this computation is | |
262 | provided by the method ``induced_set()``, which returns an array of | |
263 | segment and edge instances. For ordered group, the computation is | |
264 | provided by the method ``captured_path()``, which returns a list of | |
265 | ``gfapy.OrientedLine`` instances, alternating segment and edge instances | |
266 | (and starting and ending in segments). | |
267 | ||
268 | The methods ``induced_segments_set()``, ``induced_edges_set()``, | |
269 | ``captured_segments()`` and ``captured_edges()`` return, respectively, | |
270 | the list of only segments or edges, in ordered or unordered groups. | |
271 | ||
272 | .. doctest:: | |
273 | ||
274 | >>> gfa = gfapy.Gfa() | |
275 | >>> gfa.add_line("S\ts1\t100\t*") | |
276 | >>> gfa.add_line("S\ts2\t100\t*") | |
277 | >>> gfa.add_line("S\ts3\t100\t*") | |
278 | >>> gfa.add_line("E\te1\ts1+\ts2-\t0\t10\t90\t100$\t*") | |
279 | >>> gfa.add_line("U\tu1\ts1 s2 s3") | |
280 | >>> u = gfa.sets[-1] | |
281 | >>> [l.name for l in u.induced_edges_set] | |
282 | ['e1'] | |
283 | >>> [l.name for l in u.induced_segments_set ] | |
284 | ['s1', 's2', 's3'] | |
285 | >>> [l.name for l in u.induced_set ] | |
286 | ['s1', 's2', 's3', 'e1'] | |
287 | ||
288 | Disconnecting a line from a Gfa object | |
289 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
290 | ||
291 | Lines can be disconnected using the ``rm(line)`` method of the | |
292 | ``gfapy.Gfa`` object or the ``disconnect()`` method of the line | |
293 | instance. | |
294 | ||
295 | .. doctest:: | |
296 | ||
297 | >>> gfa = gfapy.Gfa() | |
298 | >>> gfa.append('S\tsA\t*') | |
299 | >>> gfa.append('S\tsB\t*') | |
300 | >>> line = gfa.segment("sA") | |
301 | >>> gfa.segment_names | |
302 | ['sA', 'sB'] | |
303 | >>> gfa.rm(line) | |
304 | >>> gfa.segment_names | |
305 | ['sB'] | |
306 | >>> line = gfa.segment('sB') | |
307 | >>> line.disconnect() | |
308 | >>> gfa.segment_names | |
309 | [] | |
310 | ||
311 | Disconnecting a line affects other lines as well. Lines which are | |
312 | dependent on the disconnected line are disconnected as well. Any other | |
313 | reference to disconnected lines is removed as well. In the disconnected | |
314 | line, references to lines are transformed back to strings and | |
315 | backreferences are deleted. | |
316 | ||
317 | The following tables show which dependent lines are disconnected if they | |
318 | refer to a line which is being disconnected. | |
319 | ||
320 | GFA1 | |
321 | ^^^^ | |
322 | ||
323 | +---------------+---------------------------------+ | |
324 | | Record type | Dependent lines | | |
325 | +===============+=================================+ | |
326 | | Segment | links (+ paths), containments | | |
327 | +---------------+---------------------------------+ | |
328 | | Link | paths | | |
329 | +---------------+---------------------------------+ | |
330 | ||
331 | GFA2 | |
332 | ^^^^ | |
333 | ||
334 | +---------------+---------------------------------------+ | |
335 | | Record type | Dependent lines | | |
336 | +===============+=======================================+ | |
337 | | Segment | edges, gaps, fragments, sets, paths | | |
338 | +---------------+---------------------------------------+ | |
339 | | Edge | sets, paths | | |
340 | +---------------+---------------------------------------+ | |
341 | | Sets | sets, paths | | |
342 | +---------------+---------------------------------------+ | |
343 | ||
344 | Editing reference fields | |
345 | ~~~~~~~~~~~~~~~~~~~~~~~~ | |
346 | ||
347 | In connected line instances, it is not allowed to directly change the | |
348 | content of fields containing references to other lines, as this would | |
349 | make the state of the Gfa object invalid. | |
350 | ||
351 | Besides the fields containing references, some other fields are | |
352 | read-only in connected lines. Changing some of the fields would require | |
353 | moving the backreferences to other collections (position fields of edges | |
354 | and gaps, ``from_orient`` and ``to_orient`` of links). The overlaps | |
355 | field of connected links is readonly as it may be necessary to identify | |
356 | the link in paths. | |
357 | ||
358 | Renaming an element | |
359 | ^^^^^^^^^^^^^^^^^^^ | |
360 | ||
361 | The name field of a line (e.g. segment ``name``/``sid``) is not a | |
362 | reference and thus can be edited also in connected lines. When the name | |
363 | of the line is changed, no manual editing of references (e.g. from/to | |
364 | fields in links) is necessary, as all lines which refer to the line will | |
365 | still refer to the same instance. The references to the instance in the | |
366 | Gfa lines collections will be automatically updated. Also, the new name | |
367 | will be correctly used when converting to string, such as when the Gfa | |
368 | instance is written to a GFA file. | |
369 | ||
370 | Renaming a line to a name which already exists has the same effect of | |
371 | adding a line with that name. That is, in most cases, | |
372 | ``gfapy.NotUniqueError`` is raised. An exception are GFA2 sets and | |
373 | paths: in this case the line will be appended to the existing line with | |
374 | the same name (as described in "Multiline group definitions"). | |
375 | ||
376 | Adding and removing group elements | |
377 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
378 | ||
379 | Elements of GFA2 groups can be added and removed from both connected and | |
380 | non-connected lines, using the following methods. | |
381 | ||
382 | To add an item to or remove an item from an unordered group, use the | |
383 | methods ``add_item(item)`` and ``rm_item(item)``, which take as argument | |
384 | either a string (identifier) or a line instance. | |
385 | ||
386 | To append or prepend an item to an ordered group, use the methods | |
387 | ``append_item(item)`` and ``prepend_item(item)``. To remove the first or | |
388 | the last item of an ordered group use the methods ``rm_first_item()`` | |
389 | and ``rm_last_item()``. | |
390 | ||
391 | Editing read-only fields of connected lines | |
392 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
393 | ||
394 | Editing the read-only information of edges, gaps, links, containments, | |
395 | fragments and paths is more complicated. These lines shall be | |
396 | disconnected before the edit and connected again to the Gfa object after | |
397 | it. Before disconnecting a line, you should check if there are other | |
398 | lines dependent on it (see tables above). If so, you will have to | |
399 | disconnect these lines first, eventually update their fields and | |
400 | reconnect them at the end of the operation. | |
401 | ||
402 | Virtual lines | |
403 | ~~~~~~~~~~~~~ | |
404 | ||
405 | The order of the lines in GFA is not prescribed. Therefore, during | |
406 | parsing, or constructing a Gfa in memory, it is possible that a line is | |
407 | referenced to, before it is added to the Gfa instance. Whenever this | |
408 | happens, Gfapy creates a "virtual" line instance. | |
409 | ||
410 | Users do not have to handle with virtual lines, if they work with | |
411 | complete and valid GFA files. | |
412 | ||
413 | Virtual lines are similar to normal line instances, with some | |
414 | limitations (they contain only limited information and it is not allowed | |
415 | to add tags to them). To check if a line is a virtual line, one can use | |
416 | the ``virtual`` property of the line. | |
417 | ||
418 | As soon as the parser founds the real line corresponding to a previously | |
419 | introduced virtual line, the virtual line is exchanged with the real | |
420 | line and all references are corrected to point to the real line. | |
421 | ||
422 | .. doctest:: | |
423 | ||
424 | >>> g = gfapy.Gfa() | |
425 | >>> g.add_line("S\t1\t*") | |
426 | >>> g.add_line("L\t1\t+\t2\t+\t*") | |
427 | >>> l = g.dovetails[0] | |
428 | >>> g.segment("1").virtual | |
429 | False | |
430 | >>> g.segment("2").virtual | |
431 | True | |
432 | >>> l.to_segment == g.segment("2") | |
433 | True | |
434 | >>> g.segment("2").dovetails == [l] | |
435 | True | |
436 | >>> g.add_line("S\t2\t*") | |
437 | >>> g.segment("2").virtual | |
438 | False | |
439 | >>> l.to_segment == g.segment("2") | |
440 | True | |
441 | >>> g.segment("2").dovetails == [l] | |
442 | True |
0 | .. testsetup:: * | |
1 | ||
2 | import gfapy | |
3 | ||
4 | .. _rgfa: | |
5 | ||
6 | rGFA | |
7 | ---- | |
8 | ||
9 | rGFA (https://github.com/lh3/gfatools/blob/master/doc/rGFA.md) | |
10 | is a subset of GFA1, in which only particular line types (S and L) | |
11 | are allowed, and the S lines are required to contain the tags | |
12 | `SN` (of type `Z`), `SO` and `SR` (of type `i`). | |
13 | ||
14 | When working with rGFA files, it is convenient to use the `dialect="rgfa"` | |
15 | option in the constructor `Gfa()` and in | |
16 | func:`Gfa.from_file() <gfapy.gfa.Gfa.from_file>`. | |
17 | ||
18 | This ensures that additional validations are performed: GFA version must be 1, | |
19 | only rGFA-compatible lines (S,L) are allowed and that the required tags are | |
20 | required (with the correct datatype). The validations can also be executed | |
21 | manually using `Gfa.validate_rgfa() <gfapy.gfa.Gfa.validate_rgfa>`. | |
22 | ||
23 | Furthermore, the `stable_sequence_names` attribute of the GFA objects | |
24 | returns the set of stable sequence names contained in the `SN` tags | |
25 | of the segments. | |
26 | ||
27 | .. doctest:: | |
28 | ||
29 | >>> g = gfapy.Gfa("S\tS1\tCTGAA\tSN:Z:chr1\tSO:i:0\tSR:i:0", dialect="rgfa") | |
30 | >>> g.segment_names | |
31 | ['S1'] | |
32 | >>> g.stable_sequence_names | |
33 | ['chr1'] | |
34 | >>> g.add_line("S\tS2\tACG\tSN:Z:chr1\tSO:i:5\tSR:i:0") | |
35 |
0 | .. testsetup:: * | |
1 | ||
2 | import gfapy | |
3 | gfa = gfapy.Gfa() | |
4 | ||
5 | .. _tags: | |
6 | ||
7 | Tags | |
8 | ---- | |
9 | ||
10 | Each record in GFA can contain tags. Tags are fields which consist in a | |
11 | tag name, a datatype and data. The format is ``NN:T:DATA`` where ``NN`` | |
12 | is a two-letter tag name, ``T`` is a one-letter datatype string and | |
13 | ``DATA`` is a string representing the data according to the specified | |
14 | datatype. Tag names must be unique for each line, i.e. each line may | |
15 | only contain a tag once. | |
16 | ||
17 | :: | |
18 | ||
19 | # Examples of GFA tags of different datatypes: | |
20 | "aa:i:-12" | |
21 | "bb:f:1.23" | |
22 | "cc:Z:this is a string" | |
23 | "dd:A:X" | |
24 | "ee:B:c,12,3,2" | |
25 | "ff:H:122FA0" | |
26 | 'gg:J:["A","B"]' | |
27 | ||
28 | Custom tags | |
29 | ~~~~~~~~~~~ | |
30 | ||
31 | Some tags are explicitly defined in the specification (these are named | |
32 | *predefined tags* in Gfapy), and the user or an application can define | |
33 | its own custom tags. These may contain lower case letters. | |
34 | ||
35 | Custom tags are user or program specific and may of course collide with | |
36 | the tags used by other users or programs. For this reasons, if you write | |
37 | scripts which employ custom tags, you should always check that the | |
38 | values are of the correct datatype and plausible. | |
39 | ||
40 | .. doctest:: | |
41 | ||
42 | >>> line = gfapy.Line("H\txx:i:2") | |
43 | >>> if line.get_datatype("xx") != "i": | |
44 | ... raise Exception("I expected the tag xx to contain an integer!") | |
45 | >>> myvalue = line.xx | |
46 | >>> if (myvalue > 120) or (myvalue % 2 == 1): | |
47 | ... raise Exception("The value in the xx tag is not an even value <= 120") | |
48 | >>> # ... do something with myvalue | |
49 | ||
50 | Also it is good practice to allow the user of the script to change the | |
51 | name of the custom tags. For example, Gfapy employs the +or+ custom tag | |
52 | to track the original segment from which a segment in the final graph is | |
53 | derived. All methods which read or write the +or+ tag allow to specify | |
54 | an alternative tag name to use instead of +or+, for the case that this | |
55 | name collides with the custom tag of another program. | |
56 | ||
57 | .. code:: python | |
58 | ||
59 | # E.g. a method which does something with myvalue, usually stored in tag xx | |
60 | # allows the user to specify an alternative name for the tag | |
61 | def mymethod(line, mytag="xx"): | |
62 | myvalue = line.get(mytag) | |
63 | # ... | |
64 | ||
65 | Predefined tags | |
66 | ~~~~~~~~~~~~~~~ | |
67 | ||
68 | According to the GFA specifications, predefined tag names consist of either | |
69 | two upper case letters, or an upper case letter followed by a digit. | |
70 | The GFA1 specification predefines tags for each line type, while GFA2 | |
71 | only predefines tags for the header and edges. | |
72 | ||
73 | While tags with the predefined names are allowed to be added to any line, | |
74 | when they are used in the lines mentiones in the specification (e.g. `VN` | |
75 | in the header) gfapy checks that the datatype is the one prescribed by | |
76 | the specification (e.g. `VN` must be of type `Z`). It is not forbidden | |
77 | to use the same tags in other contexts, but in this case, the datatype | |
78 | restriction is not enforced. | |
79 | ||
80 | +------------+------------+-----------------------+ | |
81 | | Tag | Type | Line types | GFA version | | |
82 | +============+============+=======================+ | |
83 | | VN | Z | H | 1,2 | | |
84 | +-----+------+------------+-----------------------+ | |
85 | | TS | i | H,S | 2 | | |
86 | +-----+------+------------+-----------------------+ | |
87 | | LN | i | S | 1 | | |
88 | +-----+------+------------+-----------------------+ | |
89 | | RC | i | S,L,C | 1 | | |
90 | +-----+------+------------+-----------------------+ | |
91 | | FC | i | S,L | 1 | | |
92 | +-----+------+------------+-----------------------+ | |
93 | | KC | i | S,L | 1 | | |
94 | +-----+------+------------+-----------------------+ | |
95 | | SH | H | S | 1 | | |
96 | +-----+------+------------+-----------------------+ | |
97 | | UR | Z | S | 1 | | |
98 | +-----+------+------------+-----------------------+ | |
99 | | MQ | i | L | 1 | | |
100 | +-----+------+------------+-----------------------+ | |
101 | | NM | i | L,i | 1 | | |
102 | +-----+------+------------+-----------------------+ | |
103 | | ID | Z | L,C | 1 | | |
104 | +-----+------+------------+-----------------------+ | |
105 | ||
106 | :: | |
107 | ||
108 | "VN:Z:1.0" # VN => predefined tag | |
109 | "z5:Z:1.0" # z5 first char is downcase => custom tag | |
110 | "XX:Z:aaa" # XX upper case, but not predefined => custom tag | |
111 | ||
112 | # not forbidden, but not recommended: | |
113 | "zZ:Z:1.0" # => mixed case, first char downcase => custom tag | |
114 | "Zz:Z:1.0" # => mixed case, first char upcase => custom tag | |
115 | "vn:Z:1.0" # => same name as predefined tag, but downcase => custom tag | |
116 | ||
117 | Datatypes | |
118 | ~~~~~~~~~ | |
119 | ||
120 | The following table summarizes the datatypes available for tags: | |
121 | ||
122 | +----------+-----------------+---------------------------+----------------------+ | |
123 | | Symbol | Datatype | Example | Python class | | |
124 | +==========+=================+===========================+======================+ | |
125 | | Z | string | This is a string | str | | |
126 | +----------+-----------------+---------------------------+----------------------+ | |
127 | | i | integer | -12 | int | | |
128 | +----------+-----------------+---------------------------+----------------------+ | |
129 | | f | float | 1.2E-5 | float | | |
130 | +----------+-----------------+---------------------------+----------------------+ | |
131 | | A | char | X | str | | |
132 | +----------+-----------------+---------------------------+----------------------+ | |
133 | | J | JSON | [1,{"k1":1,"k2":2},"a"] | list/dict | | |
134 | +----------+-----------------+---------------------------+----------------------+ | |
135 | | B | numeric array | f,1.2,13E-2,0 | gfapy.NumericArray | | |
136 | +----------+-----------------+---------------------------+----------------------+ | |
137 | | H | byte array | FFAA01 | gfapy.ByteArray | | |
138 | +----------+-----------------+---------------------------+----------------------+ | |
139 | ||
140 | Validation | |
141 | ~~~~~~~~~~ | |
142 | ||
143 | The tag names must consist of a letter and a digit or two letters. | |
144 | ||
145 | :: | |
146 | ||
147 | "KC:i:1" # => OK | |
148 | "xx:i:1" # => OK | |
149 | "x1:i:1" # => OK | |
150 | "xxx:i:1" # => error: name is too long | |
151 | "x:i:1" # => error: name is too short | |
152 | "11:i:1" # => error: at least one letter must be present | |
153 | ||
154 | The datatype must be one of the datatypes specified above. For | |
155 | predefined tags, Gfapy also checks that the datatype given in the | |
156 | specification is used. | |
157 | ||
158 | :: | |
159 | ||
160 | "xx:X:1" # => error: datatype X is unknown | |
161 | "VN:i:1" # => error: VN must be of type Z | |
162 | ||
163 | The data must be a correctly formatted string for the specified datatype | |
164 | or a Python object whose string representation is a correctly formatted | |
165 | string. | |
166 | ||
167 | .. doctest:: | |
168 | ||
169 | # current value: xx:i:2 | |
170 | >>> line = gfapy.Line("S\tA\t*\txx:i:2") | |
171 | >>> line.xx = 1 | |
172 | >>> line.xx | |
173 | 1 | |
174 | >>> line.xx = "3" | |
175 | >>> line.xx | |
176 | 3 | |
177 | >>> line.xx = "A" | |
178 | >>> line.xx | |
179 | Traceback (most recent call last): | |
180 | ... | |
181 | gfapy.error.FormatError: ... | |
182 | ||
183 | Depending on the validation level, more or less checks are done | |
184 | automatically (see :ref:`validation` chapter). Per default - validation level | |
185 | (1) - validation is performed only during parsing or accessing values | |
186 | the first time, therefore the user must perform a manual validation if | |
187 | he changes values to something which is not guaranteed to be correct. To | |
188 | trigger a manual validation, the user can call the method | |
189 | ``validate_field(fieldname)`` to validate a single tag, or | |
190 | ``validate()`` to validate the whole line, including all tags. | |
191 | ||
192 | .. doctest:: | |
193 | ||
194 | >>> line = gfapy.Line("S\tA\t*\txx:i:2", vlevel = 0) | |
195 | >>> line.validate_field("xx") | |
196 | >>> line.validate() | |
197 | >>> line.xx = "A" | |
198 | >>> line.validate_field("xx") | |
199 | Traceback (most recent call last): | |
200 | ... | |
201 | gfapy.error.FormatError: ... | |
202 | >>> line.validate() | |
203 | Traceback (most recent call last): | |
204 | ... | |
205 | gfapy.error.FormatError: ... | |
206 | >>> line.xx = "3" | |
207 | >>> line.validate_field("xx") | |
208 | >>> line.validate() | |
209 | ||
210 | Reading and writing tags | |
211 | ~~~~~~~~~~~~~~~~~~~~~~~~ | |
212 | ||
213 | Tags can be read using a property on the Gfapy line object, which is | |
214 | called as the tag (e.g. line.xx). A special version of the property | |
215 | prefixed by ``try_get_`` raises an error if the tag was not available | |
216 | (e.g. ``line.try_get_LN``), while the tag property (e.g. ``line.LN``) | |
217 | would return ``None`` in this case. Setting the value is done assigning | |
218 | a value to it the tag name method (e.g. ``line.TS = 120``). In | |
219 | alternative, the ``set(fieldname, value)``, ``get(fieldname)`` and | |
220 | ``try_get(fieldname)`` methods can also be used. To remove a tag from a | |
221 | line, use the ``delete(fieldname)`` method, or set its value to | |
222 | ``None``. The ``tagnames`` property Line instances is a list of | |
223 | the names (as strings) of all defined tags for a line. | |
224 | ||
225 | ||
226 | .. doctest:: | |
227 | ||
228 | >>> line = gfapy.Line("S\tA\t*\txx:i:1", vlevel = 0) | |
229 | >>> line.xx | |
230 | 1 | |
231 | >>> line.xy is None | |
232 | True | |
233 | >>> line.try_get_xx() | |
234 | 1 | |
235 | >>> line.try_get_xy() | |
236 | Traceback (most recent call last): | |
237 | ... | |
238 | gfapy.error.NotFoundError: ... | |
239 | >>> line.get("xx") | |
240 | 1 | |
241 | >>> line.try_get("xy") | |
242 | Traceback (most recent call last): | |
243 | ... | |
244 | gfapy.error.NotFoundError: ... | |
245 | >>> line.xx = 2 | |
246 | >>> line.xx | |
247 | 2 | |
248 | >>> line.xx = "a" | |
249 | >>> line.tagnames | |
250 | ['xx'] | |
251 | >>> line.xy = 2 | |
252 | >>> line.xy | |
253 | 2 | |
254 | >>> line.set("xy", 3) | |
255 | >>> line.get("xy") | |
256 | 3 | |
257 | >>> line.tagnames | |
258 | ['xx', 'xy'] | |
259 | >>> line.delete("xy") | |
260 | 3 | |
261 | >>> line.xy is None | |
262 | True | |
263 | >>> line.xx = None | |
264 | >>> line.xx is None | |
265 | True | |
266 | >>> line.try_get("xx") | |
267 | Traceback (most recent call last): | |
268 | ... | |
269 | gfapy.error.NotFoundError: ... | |
270 | >>> line.tagnames | |
271 | [] | |
272 | ||
273 | When a tag is read, the value is converted into an appropriate object | |
274 | (see Python classes in the datatype table above). When setting a value, | |
275 | the user can specify the value of a tag either as a Python object, or as | |
276 | the string representation of the value. | |
277 | ||
278 | .. doctest:: | |
279 | ||
280 | >>> line = gfapy.Line('H\txx:i:1\txy:Z:TEXT\txz:J:["a","b"]') | |
281 | >>> line.xx | |
282 | 1 | |
283 | >>> isinstance(line.xx, int) | |
284 | True | |
285 | >>> line.xy | |
286 | 'TEXT' | |
287 | >>> isinstance(line.xy, str) | |
288 | True | |
289 | >>> line.xz | |
290 | ['a', 'b'] | |
291 | >>> isinstance(line.xz, list) | |
292 | True | |
293 | ||
294 | The string representation of a tag can be read using the | |
295 | ``field_to_s(fieldname)`` method. The default is to only output the | |
296 | content of the field. By setting \`\`tag: true\`\`\`, the entire tag is | |
297 | output (name, datatype, content, separated by colons). An exception is | |
298 | raised if the field does not exist. | |
299 | ||
300 | .. doctest:: | |
301 | ||
302 | >>> line = gfapy.Line("H\txx:i:1") | |
303 | >>> line.xx | |
304 | 1 | |
305 | >>> line.field_to_s("xx") | |
306 | '1' | |
307 | >>> line.field_to_s("xx", tag=True) | |
308 | 'xx:i:1' | |
309 | ||
310 | Datatype of custom tags | |
311 | ~~~~~~~~~~~~~~~~~~~~~~~ | |
312 | ||
313 | The datatype of an existing custom field (but not of predefined fields) | |
314 | can be changed using the ``set_datatype(fieldname, datatype)`` method. | |
315 | The current datatype specification can be read using | |
316 | ``get_datatype(fieldname)``. | |
317 | ||
318 | .. doctest:: | |
319 | ||
320 | >>> line = gfapy.Line("H\txx:i:1") | |
321 | >>> line.get_datatype("xx") | |
322 | 'i' | |
323 | >>> line.set_datatype("xx", "Z") | |
324 | >>> line.get_datatype("xx") | |
325 | 'Z' | |
326 | ||
327 | If a new custom tag is specified, Gfapy selects the correct datatype for | |
328 | it: i/f for numeric values, J/B for arrays, J for hashes and Z for | |
329 | strings and strings. If the user wants to specify a different datatype, | |
330 | he may do so by setting it with ``set_datatype()`` (this can be done | |
331 | also before assigning a value, which is necessary if full validation is | |
332 | active). | |
333 | ||
334 | .. doctest:: | |
335 | ||
336 | >>> line = gfapy.Line("H") | |
337 | >>> line.xx = "1" | |
338 | >>> line.xx | |
339 | '1' | |
340 | >>> line.set_datatype("xy", "i") | |
341 | >>> line.xy = "1" | |
342 | >>> line.xy | |
343 | 1 | |
344 | ||
345 | Arrays of numerical values | |
346 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
347 | ||
348 | ``B`` and ``H`` tags represent array with particular constraints (e.g. | |
349 | they can only contain numeric values, and in some cases the values must | |
350 | be in predefined ranges). In order to represent them correctly and allow | |
351 | for validation, Python classes have been defined for both kind of tags: | |
352 | ``gfapy.ByteArray`` for ``H`` and ``gfapy.NumericArray`` for ``B`` | |
353 | fields. | |
354 | ||
355 | Both are subclasses of list. Object of the two classes can be created by | |
356 | passing an existing list or the string representation to the class | |
357 | constructor. | |
358 | ||
359 | .. doctest:: | |
360 | ||
361 | >>> # create a byte array instance | |
362 | >>> gfapy.ByteArray([12,3,14]) | |
363 | b'\x0c\x03\x0e' | |
364 | >>> gfapy.ByteArray("A012FF") | |
365 | b'\xa0\x12\xff' | |
366 | >>> # create a numeric array instance | |
367 | >>> gfapy.NumericArray.from_string("c,12,3,14") | |
368 | [12, 3, 14] | |
369 | >>> gfapy.NumericArray([12,3,14]) | |
370 | [12, 3, 14] | |
371 | ||
372 | Instances of the classes behave as normal lists, except that they | |
373 | provide a #validate() method, which checks the constraints, and that | |
374 | their string representation is the GFA string representation of the | |
375 | field value. | |
376 | ||
377 | .. doctest:: | |
378 | ||
379 | >>> gfapy.NumericArray([12,1,"1x"]).validate() | |
380 | Traceback (most recent call last): | |
381 | ... | |
382 | gfapy.error.ValueError | |
383 | >>> str(gfapy.NumericArray([12,3,14])) | |
384 | 'C,12,3,14' | |
385 | >>> gfapy.ByteArray([12,1,"1x"]).validate() | |
386 | Traceback (most recent call last): | |
387 | ... | |
388 | gfapy.error.ValueError | |
389 | >>> str(gfapy.ByteArray([12,3,14])) | |
390 | '0C030E' | |
391 | ||
392 | For numeric values, the `compute_subtype` method allows to compute | |
393 | the subtype which will be used for the string representation. Unsigned | |
394 | subtypes are used if all values are positive. The smallest possible | |
395 | subtype range is selected. The subtype may change when the range of the | |
396 | elements changes. | |
397 | ||
398 | .. doctest:: | |
399 | ||
400 | >>> gfapy.NumericArray([12,13,14]).compute_subtype() | |
401 | 'C' | |
402 | ||
403 | Special cases: custom records, headers, comments and virtual lines. | |
404 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
405 | ||
406 | GFA2 allows custom records, introduced by record type strings other than | |
407 | the predefined ones. Gfapy uses a pragmatical approach for identifying | |
408 | tags in custom records, and tries to interpret the rightmost fields as | |
409 | tags, until the first field from the right raises an error; all | |
410 | remaining fields are treated as positional fields. | |
411 | ||
412 | :: | |
413 | ||
414 | "X a b c xx:i:12" # => xx is tag, a, b, c are positional fields | |
415 | "Y a b xx:i:12 c" # => all positional fields, as c is not a valid tag | |
416 | ||
417 | For easier access, the entire header of the GFA is summarized in a | |
418 | single line instance. A class (`FieldArray`) has been defined to | |
419 | handle the special case when multiple H lines define the same tag (see | |
420 | :ref:`header` chapter for details). | |
421 | ||
422 | Comment lines are represented by a subclass of the same class | |
423 | (`Line`) as the records. However, they cannot contain tags: the | |
424 | entire line is taken as content of the comment. See the :ref:`comments` | |
425 | chapter for more information about comments. | |
426 | ||
427 | :: | |
428 | ||
429 | "# this is not a tag: xx:i:1" # => xx is not a tag, xx:i:1 is part of the comment | |
430 | ||
431 | Virtual instances of the `Line` class (e.g. segment instances automatically | |
432 | created because of not yet resolved references found in edges) cannot be | |
433 | modified by the user, and tags cannot be specified for them. This | |
434 | includes all instances of the `Unknown` class. See the | |
435 | :ref:`references` chapter for more information about virtual lines. |
0 | .. _validation: | |
1 | ||
2 | Validation | |
3 | ---------- | |
4 | ||
5 | Different validation levels are available. They represent different | |
6 | compromises between speed and warrant of validity. The validation level | |
7 | can be specified when the :class:`~gfapy.gfa.Gfa` object is created, using the | |
8 | ``vlevel`` parameter of the constructor and of the | |
9 | `from_file` method. Four levels of validation are defined | |
10 | (0 = no validation, 1 = validation by reading, 2 = validation by reading | |
11 | and writing, 3 = continuous validation). The default validation level | |
12 | value is 1. | |
13 | ||
14 | Manual validation | |
15 | ~~~~~~~~~~~~~~~~~ | |
16 | ||
17 | Independently from the validation level chosen, the user can always check the | |
18 | value of a field calling | |
19 | :meth:`~gfapy.line.common.validate.Validate.validate_field` on the line | |
20 | instance. If no exception is raised, the field content is valid. | |
21 | ||
22 | To check if the entire content of the line is valid, the user can call | |
23 | :meth:`~gfapy.line.common.validate.Validate.validate` on the line instance. | |
24 | This will check all fields and perform cross-field validations, such as | |
25 | comparing the length of the sequence of a GFA1 segment, to the value of the LN | |
26 | tag (if present). | |
27 | ||
28 | It is also possible to validate the structure of the GFA, for example to | |
29 | check if there are unresolved references to lines. To do this, use the | |
30 | :meth:`~gfapy.gfa.Gfa.validate` of the :class:`~gfapy.gfa.Gfa` instance. | |
31 | ||
32 | No validations | |
33 | ~~~~~~~~~~~~~~ | |
34 | ||
35 | If the validation is set to 0, Gfapy will try to accept any input and | |
36 | never raise an exception. This is not always possible, and in some | |
37 | cases, an exception will still be raised, if the data is invalid. | |
38 | ||
39 | Validation when reading | |
40 | ~~~~~~~~~~~~~~~~~~~~~~~ | |
41 | ||
42 | If the validation level is set to 1 or higher, basic validations will be | |
43 | performed, such as checking the number of positional fields, the | |
44 | presence of duplicated tags, the tag datatype of predefined tags. | |
45 | Additionally, all tags will be validated, either during parsing or on | |
46 | first access. Record-type cross-field validations will also be | |
47 | performed. | |
48 | ||
49 | In other words, a validation of 1 means that Gfapy guarantees (as good | |
50 | as it can) that the GFA content read from a file is valid, and will | |
51 | raise an exception on accessing the data if not. | |
52 | ||
53 | The user is supposed to call `validate_field` after changing | |
54 | a field content to something which can be potentially invalid, or | |
55 | :meth:`~gfapy.line.common.validate.Validate.validate` if potentially | |
56 | cross-field validations could fail. | |
57 | ||
58 | Validation when writing | |
59 | ~~~~~~~~~~~~~~~~~~~~~~~ | |
60 | ||
61 | Setting the level to 2 will perform all validations described above, | |
62 | plus validate the fields content when their value is written to string. | |
63 | ||
64 | In other words, a validation of 2 means that Gfapy guarantee (as good as | |
65 | it can) that the GFA content read from a file and written to a file is | |
66 | valid and will raise an exception on accessing the data or writing to | |
67 | file if not. | |
68 | ||
69 | Continuous validation | |
70 | ~~~~~~~~~~~~~~~~~~~~~ | |
71 | ||
72 | If the validation level is set to 3, all validations for lower levels | |
73 | described above are run, plus a validation of fields contents each time | |
74 | a setter method is used. | |
75 | ||
76 | A validation of 3 means that Gfapy guarantees (as good as it can) that | |
77 | the GFA content is always valid. |
0 | 0 | import gfapy |
1 | from .alignment import Alignment | |
2 | 1 | |
3 | 2 | class AlignmentPlaceholder(gfapy.Placeholder): |
4 | 3 | """ |
19 | 19 | if isinstance(obj, gfapy.ByteArray): |
20 | 20 | return str(obj) |
21 | 21 | if isinstance(obj, list): |
22 | return str(ByteArray(obj)) | |
22 | return str(gfapy.ByteArray(obj)) | |
23 | 23 | elif isinstance(obj, str): |
24 | 24 | return obj |
25 | 25 | else: |
32 | 32 | if isinstance(obj, gfapy.ByteArray): |
33 | 33 | return str(obj) |
34 | 34 | elif isinstance(obj, list): |
35 | return str(ByteArray(obj)) | |
35 | return str(gfapy.ByteArray(obj)) | |
36 | 36 | elif isinstance(obj, str): |
37 | 37 | validate_encoded(obj) |
38 | 38 | return obj |
0 | 0 | import gfapy |
1 | 1 | import builtins |
2 | import re | |
3 | 2 | from .validator import Validator |
4 | 3 | from .parser import Parser |
5 | 4 | from .writer import Writer |
26 | 26 | if not re.match("^[!-~]+$", elem): |
27 | 27 | raise gfapy.FormatError( |
28 | 28 | "the list contains an invalid GFA2 identifier ({})\n" |
29 | .format(repr(string))+ | |
29 | .format(repr(elem))+ | |
30 | 30 | "(it contains spaces and/or non-printable characters)") |
31 | 31 | else: |
32 | 32 | raise gfapy.TypeError( |
24 | 24 | pass |
25 | 25 | elif isinstance(obj, gfapy.Line): |
26 | 26 | validate_encoded(obj.name) |
27 | elif isinstance(obj, String): | |
27 | elif isinstance(obj, str): | |
28 | 28 | validate_encoded(obj) |
29 | 29 | else: |
30 | 30 | raise gfapy.TypeError( |
48 | 48 | def encode(obj): |
49 | 49 | if isinstance(obj, gfapy.Placeholder): |
50 | 50 | return str(obj) |
51 | elif isinstance(obj, String): | |
51 | elif isinstance(obj, str): | |
52 | 52 | obj = str(obj) |
53 | 53 | elif isinstance(obj, gfapy.Line): |
54 | 54 | obj = str(obj.name) |
24 | 24 | elem.validate() |
25 | 25 | if not re.match(r"^[!-)+-<>-~][!-~]*$", elem.name): |
26 | 26 | raise gfapy.FormatError( |
27 | "#{elem.name} is not a valid GFA1 segment name\n".format(elem.name)+ | |
27 | "{} is not a valid GFA1 segment name\n".format(elem.name)+ | |
28 | 28 | "(it does not match [!-)+-<>-~][!-~]*)") |
29 | 29 | |
30 | 30 | def unsafe_encode(obj): |
23 | 23 | elem.validate() |
24 | 24 | if not re.match(r"^[!-~]+$", elem.name): |
25 | 25 | raise gfapy.FormatError( |
26 | "the list contains an invalid GFA2 identifier\n".format(elem.name)+ | |
26 | "the list contains an invalid GFA2 identifier {}\n".format(elem.name)+ | |
27 | 27 | "(it contains spaces and/or non-printable characters)") |
28 | 28 | if not elem.orient in ["+", "-"]: |
29 | 29 | raise gfapy.FormatError( |
0 | import gfapy | |
1 | 0 | from .artifacts import Artifacts |
2 | 1 | from .copy_number import CopyNumber |
3 | 2 | from .invertible_segments import InvertibleSegments |
0 | from itertools import groupby | |
1 | import gfapy | |
2 | ||
0 | 3 | class InvertibleSegments: |
1 | 4 | |
2 | 5 | def randomly_orient_invertibles(self): |
3 | 6 | ''' Selects a random orientation for all invertible segments. |
4 | 7 | |
5 | 8 | For the definition of invertible segment, see Gonnella and Kurtz (2016).''' |
6 | for s in segment_names: | |
9 | for sn in self.segment_names: | |
7 | 10 | if self._segment_same_links_both_ends(sn): |
8 | 11 | self._randomly_orient_proven_invertible_segment(sn) |
9 | 12 | |
65 | 68 | return [list(v) for k,v in groupby(sorted(links,key=sig),key=sig)] |
66 | 69 | |
67 | 70 | def _annotate_random_orientation(self, segment_name): |
68 | s = self.try_get_segment(segment_name) | |
71 | segment = self.try_get_segment(segment_name) | |
69 | 72 | n = segment.name.split("_") |
70 | 73 | pairs = 0 |
71 | 74 | pos = [1, segment.LN] |
2 | 2 | class LinearPaths: |
3 | 3 | |
4 | 4 | def linear_path(self, segment, exclude = None): |
5 | """Finnd a linear path which contains the specified segment | |
5 | """Find a linear path which contains the specified segment | |
6 | 6 | |
7 | 7 | Parameters: |
8 | 8 | segment (str, Line): the segment to analyse |
333 | 333 | merged.LN = len(merged.sequence) |
334 | 334 | elif self._vlevel > 0 and merged.LN != len(merged.sequence): |
335 | 335 | raise gfapy.InconsistencyError( |
336 | "Computed sequence length {} ".format(merged.sequence.length)+ | |
336 | "Computed sequence length {} ".format(len(merged.sequence))+ | |
337 | 337 | "and computed LN {} differ".format(merged.LN)) |
338 | 338 | if merged.length is not None: |
339 | 339 | for count_tag in ["KC", "RC", "FC"]: |
50 | 50 | raise gfapy.ArgumentError("Mulitiplication factor must be >= 0"+ |
51 | 51 | " ({} found)".format(factor)) |
52 | 52 | elif factor == 0: |
53 | if conserve_components and factor == 1 and is_cut_segment(segment): | |
53 | if conserve_components and factor == 1 and self.is_cut_segment(segment): | |
54 | 54 | return self |
55 | 55 | else: |
56 | 56 | self.rm(segment) |
97 | 97 | processed_circulars = set() |
98 | 98 | for l in segment.dovetails + segment.containments: |
99 | 99 | if l.is_circular(): |
100 | if l not in processed_circular: | |
100 | if l not in processed_circulars: | |
101 | 101 | self.__divide_counts(l, factor) |
102 | processed_circular.append(l) | |
102 | processed_circulars.add(l) | |
103 | 103 | else: |
104 | 104 | self.__divide_counts(l, factor) |
105 | 105 | |
121 | 121 | def _select_distribute_end(self, links_distribution_policy, |
122 | 122 | segment_name, factor): |
123 | 123 | if links_distribution_policy not in self.LINKS_DISTRIBUTION_POLICY: |
124 | raise gfa.ArgumentError("Unknown links distribution policy {}\n".format(\ | |
124 | raise gfapy.ArgumentError("Unknown links distribution policy {}\n".format(\ | |
125 | 125 | links_distribution_policy)+"accepted values are: {}".format(\ |
126 | 126 | ", ".join(self.LINKS_DISTRIBUTION_POLICY))) |
127 | 127 | if links_distribution_policy == "off": |
23 | 23 | count_tag=self.default["count_tag"] |
24 | 24 | if unit_length is None: |
25 | 25 | unit_length=self.default["unit_length"] |
26 | s1 = segment(segment_end1.segment) | |
27 | s2 = segment(segment_end2.segment) | |
28 | et1 = segment(segment_end1.end_type) | |
29 | et2 = segment(segment_end2.end_type) | |
26 | s1 = self.try_get_segment(segment_end1.segment) | |
27 | s2 = self.try_get_segment(segment_end2.segment) | |
28 | et1 = segment_end1.end_type | |
29 | et2 = segment_end2.end_type | |
30 | 30 | n1 = sorted(s1.neighbours(et1), key=lambda s:s.name) |
31 | 31 | n2 = sorted(s2.neighbours(et2), key=lambda s:s.name) |
32 | 32 | assert list(n1) == [os.inverted() for os in n2] |
44 | 44 | unit_length=unit_length) for s in alternatives] |
45 | 45 | alternatives.pop(coverages.index(max(coverages))) |
46 | 46 | for s in alternatives: |
47 | segment(s[0]).disconnect() | |
47 | self.try_get_segment(s[0]).disconnect() |
3 | 3 | |
4 | 4 | def _junction_junction_paths(self, sn, exclude): |
5 | 5 | retval = [] |
6 | exclude.append(sn) | |
6 | exclude.add(sn) | |
7 | 7 | s = self.segment(sn) |
8 | 8 | for dL in s.dovetails_L: |
9 | 9 | eL = dL.other_end(gfapy.SegmentEnd(s, "L")) |
76 | 76 | elif self._version == "gfa2": |
77 | 77 | mln = len(merged.sequence) |
78 | 78 | tmp_link = gfapy.line.edge.GFA2(["*",merged.name+"+", \ |
79 | last_name+("-" if is_reversed else "+"), | |
79 | last.name+("-" if is_reversed else "+"), | |
80 | 80 | str(mln - ln), "{}$".format(mln), |
81 | 81 | str(ln-1) if is_reversed else "0", # on purpose fake |
82 | 82 | "{}$".format(ln) if is_reversed else "1", # on purpose fake |
23 | 23 | oe = {} |
24 | 24 | for et in ["L", "R"]: |
25 | 25 | oe[et] = l[et][0].other_end(se[et]) |
26 | if oe[:L] == oe[:R]: | |
26 | if oe["L"] == oe["R"]: | |
27 | 27 | return |
28 | 28 | for et in ["L", "R"]: |
29 | 29 | self._delete_other_links(oe[et], se[et], |
30 | 30 | conserve_components=conserve_components) |
31 | 31 | else: |
32 | if l[:L].size == 1: | |
32 | if l["L"].size == 1: | |
33 | 33 | et = "L" |
34 | elif l[:R].size == 1: | |
34 | elif l["R"].size == 1: | |
35 | 35 | et = "R" |
36 | 36 | else: |
37 | 37 | return |
51 | 51 | removed, if their removal does not split connected components |
52 | 52 | of the graph (considering as connections only dovetail overlaps) |
53 | 53 | """ |
54 | for sn in segment_names: | |
54 | for sn in self.segment_names: | |
55 | 55 | self.enforce_segment_mandatory_links(sn, conserve_components= |
56 | 56 | conserve_components) |
57 | 57 | |
63 | 63 | Parameters: |
64 | 64 | segment (str, Line): the segment |
65 | 65 | """ |
66 | if not isinstance(segment, gfa.Line): | |
66 | if not isinstance(segment, gfapy.Line): | |
67 | 67 | segment = self.try_get_segment(segment) |
68 | 68 | for e in segment.dovetails: |
69 | 69 | if e.from_segment == e.to_segment: |
71 | 71 | |
72 | 72 | def remove_self_links(self): |
73 | 73 | """Remove all dovetail overlap of segments to themselves, if any.""" |
74 | for sn in segment_names: | |
74 | for sn in self.segment_names: | |
75 | 75 | self.remove_self_link(sn) |
25 | 25 | c[et] = set() |
26 | 26 | visited = set() |
27 | 27 | segend = link.get("from") if et == "from" else link.to |
28 | visited.append(segend.name) | |
29 | visited.append(link.other_end(segend).name) | |
28 | visited.add(segend.name) | |
29 | visited.add(link.other_end(segend).name) | |
30 | 30 | self.__traverse_component(segend, c[et], visited) |
31 | 31 | return c["from"] != c["to"] |
32 | 32 | |
49 | 49 | start_points = set() |
50 | 50 | for et in ["L", "R"]: |
51 | 51 | for l in segment.dovetails_of_end(et): |
52 | start_points.append(l.other_end(\ | |
53 | gfapy.SegmentEnd(segment_name, et)).inverted()) | |
52 | start_points.add(l.other_end(\ | |
53 | gfapy.SegmentEnd(segment.name, et)).inverted()) | |
54 | 54 | cc = [] |
55 | 55 | for start_point in start_points: |
56 | 56 | cc.append(set()) |
57 | 57 | visited = set() |
58 | visited.append(segment_name) | |
59 | traverse_component(start_point, cc[-1], visited) | |
58 | visited.add(segment.name) | |
59 | self.__traverse_component(start_point, cc[-1], visited) | |
60 | 60 | return any(c != cc[0] for c in cc) |
61 | 61 | |
62 | 62 | def segment_connected_component(self, segment, visited = None): |
79 | 79 | segment_name = segment |
80 | 80 | segment = self.segment(segment) |
81 | 81 | visited.add(segment_name) |
82 | c = [segment] | |
82 | c = set() | |
83 | c.add(segment) | |
83 | 84 | for e in ["L", "R"]: |
84 | 85 | self.__traverse_component(gfapy.SegmentEnd(segment, e), c, visited) |
85 | return c | |
86 | return list(c) | |
86 | 87 | |
87 | 88 | def connected_components(self): |
88 | 89 | """Compute the connected components of the graph. |
183 | 184 | if sn in visited: |
184 | 185 | continue |
185 | 186 | visited.add(sn) |
186 | c.append(s) | |
187 | c.add(s) | |
187 | 188 | for e in ["L","R"]: |
188 | 189 | self.__traverse_component(gfapy.SegmentEnd(s, e), c, visited) |
0 | import gfapy | |
1 | 0 | from ..line import Line |
2 | 1 | from .construction import Construction |
3 | 2 | from .tags import Tags |
0 | import gfapy | |
1 | ||
2 | 0 | class Writer: |
3 | 1 | def __str__(self): |
4 | 2 | return "#" + str(self.spacer) + str(self.content) |
298 | 298 | DynamicField(partial(get_method, k = k), |
299 | 299 | partial(set_method, k = k))) |
300 | 300 | def all_references(self): |
301 | return [ item for item in values for values in self._refs ] | |
301 | return [ item for item in [ values for values in self._refs ] ] | |
302 | 302 | |
303 | 303 | @classmethod |
304 | 304 | def register_extension(cls, references=[]): |
0 | import gfapy | |
1 | ||
2 | 0 | class DefaultRecordDefinition: |
3 | 1 | """ |
4 | 2 | Provides default values for the constants for the definition of record types. |
30 | 30 | return super().__setattr__(name, value) |
31 | 31 | else: |
32 | 32 | attr.set(self, value) |
33 | except AttributeError as err: | |
33 | except AttributeError: | |
34 | 34 | return self._set_dynamic_field(name, value) |
35 | 35 | |
36 | 36 | def _get_dynamic_field(self, name, err): |
44 | 44 | self._validate_predefined_tag_type(n, self._field_datatype(n)) |
45 | 45 | elif not self._is_valid_custom_tagname(n): |
46 | 46 | raise gfapy.FormatError("Custom tag names must consist in a letter "+ |
47 | "and a digit or two letters\nFound: {}".format(tagname)) | |
47 | "and a digit or two letters\nFound: {}".format(n)) | |
48 | 48 | |
49 | 49 | def _validate_predefined_tag_type(self, tagname, datatype): |
50 | 50 | if datatype != self.__class__.DATATYPE[tagname]: |
158 | 158 | (str, str, object) list |
159 | 159 | """ |
160 | 160 | retval = [] |
161 | for of in tagnames: | |
161 | for of in self.tagnames: | |
162 | 162 | retval.append([of, self.get_datatype(of), self.get(of)]) |
163 | 163 | return retval |
17 | 17 | raise gfapy.InconsistencyError( |
18 | 18 | "Edge: {}\n".format(str(self))+ |
19 | 19 | "Field {}: $ after ".format(fn)+ |
20 | "non-last position\n".format(str(pos))+ | |
20 | "non-last position {}\n".format(str(pos))+ | |
21 | 21 | "Segment: {}".format(str(seg))) |
8 | 8 | Thereby, tags are not considered. |
9 | 9 | """ |
10 | 10 | hash(str(self.from_end)) + \ |
11 | hash(str(to_end)) + \ | |
12 | hash(str(overlap)) + \ | |
13 | hash(str(overlap.complement())) | |
11 | hash(str(self.to_end)) + \ | |
12 | hash(str(self.overlap)) + \ | |
13 | hash(str(self.overlap.complement())) | |
14 | 14 | |
15 | 15 | def is_eql(self, other): |
16 | 16 | """ |
86 | 86 | oss = [oriented_edge.line.sid1, oriented_edge.line.sid2] |
87 | 87 | if oriented_edge.orient == "-": |
88 | 88 | for i in range(len(oss)): |
89 | oss[i].invert() | |
89 | oss[i] = oss[i].inverted() | |
90 | 90 | if len(items) > 1: |
91 | 91 | nextitem = items[1] |
92 | 92 | if isinstance(nextitem.line, gfapy.line.segment.GFA2): |
98 | 98 | oss_of_next = [nextitem.line.sid1, nextitem.line.sid2] |
99 | 99 | if oriented_edge.orient == "-": |
100 | 100 | for i in range(len(oss_of_next)): |
101 | oss_of_next[i].invert() | |
101 | oss_of_next[i] = oss_of_next[i].inverted() | |
102 | 102 | if oss[0] in oss_of_next: |
103 | 103 | oss.reverse() |
104 | 104 | # if oss_of_next have no element in common with oss an error will be |
105 | 105 | # raised in the next iteration, so does not need to be handled here |
106 | 106 | elif isinstance(nextitem.line, gfapy.line.group.Ordered): |
107 | subpath = item.line.captured_path | |
107 | subpath = nextitem.line.captured_path | |
108 | 108 | if not subpath: return# does not need to be further handled here |
109 | if item.orient == "+": | |
110 | firstsubpathsegment = supath[0] | |
109 | if nextitem.orient == "+": | |
110 | firstsubpathsegment = subpath[0] | |
111 | 111 | else: |
112 | firstsubpathsegment = supath[-1].inverted() | |
112 | firstsubpathsegment = subpath[-1].inverted() | |
113 | 113 | if firstsubpathsegment == oss[0]: |
114 | 114 | oss.reverse() |
115 | 115 | # if oss does not include in firstsubpathsegment |
128 | 128 | possible_prev = [oriented_edge.line.sid1, oriented_edge.line.sid2] |
129 | 129 | if oriented_edge.orient == "-": |
130 | 130 | for i, v in enumerate(possible_prev): |
131 | possible_prev[i].invert() | |
131 | possible_prev[i] = possible_prev[i].inverted() | |
132 | 132 | if prev_os == possible_prev[0]: |
133 | 133 | path.append(possible_prev[1]) |
134 | 134 | elif prev_os == possible_prev[1]: |
166 | 166 | "Expected element:\n"+ |
167 | 167 | " {} ({})\n".format(path[-1], path[-1].line)+ |
168 | 168 | "Current element:\n"+ |
169 | " {} ({})\n".format(segment, segment.line)) | |
169 | " {} ({})\n".format(oriented_segment, oriented_segment.line)) | |
170 | 170 | |
171 | 171 | def _check_s_to_e_contiguity(self, path, oriented_segment): |
172 | 172 | # check that segment is an extremity of path[-1] |
0 | import gfapy | |
1 | ||
0 | 2 | class References: |
1 | 3 | |
2 | 4 | def append_item(self, item): |
42 | 44 | self.items = self.items[1:] |
43 | 45 | else: |
44 | 46 | self.items[0].update_reference(self, "paths") |
45 | self._delete_reference(items[0], "items") | |
47 | self._delete_reference(self.items[0], "items") | |
46 | 48 | self.compute_induced_set() # check contiguity |
47 | 49 | |
48 | 50 | def rm_last_item(self): |
65 | 67 | if isinstance(item.line, gfapy.Line): |
66 | 68 | item.line = item.name |
67 | 69 | if append: |
68 | items.append(item) | |
70 | self.items.append(item) | |
69 | 71 | else: |
70 | items.insert(0, item) | |
72 | self.items.insert(0, item) | |
71 | 73 | |
72 | 74 | def _add_item_to_connected_group(self, item, append = True): |
73 | 75 | item.line = self.prepare_and_check_ref(item.line) |
19 | 19 | gfapy.Field._validate_gfa_field(oline.line.overlap, "alignment_gfa1") |
20 | 20 | overlaps.append(str(oline.line.overlap)) |
21 | 21 | a.append(",".join(overlaps)) |
22 | for tn in self.tagnames: | |
23 | a.append(self.field_to_s(tn, tag=True)) | |
22 | 24 | return a |
11 | 11 | a = ["O"] |
12 | 12 | a.append(self.field_to_s("path_name")) |
13 | 13 | a.append(" ".join(items)) |
14 | for tn in self.tagnames: | |
15 | a.append(self.field_to_s(tn, tag=True)) | |
14 | 16 | return a |
0 | import gfapy | |
1 | ||
0 | 2 | class References: |
1 | 3 | |
2 | 4 | def add_item(self, item): |
38 | 40 | if isinstance(item, str): |
39 | 41 | item = self._gfa.line(item) |
40 | 42 | self._check_item_included(item) |
41 | line._delete_reference(self, "sets") | |
42 | self._delete_reference(line, "items") | |
43 | item._delete_reference(self, "sets") | |
44 | self._delete_reference(item, "items") | |
43 | 45 | return None |
44 | 46 | |
45 | 47 | def _check_item_included(self, item): |
0 | 0 | from .common.construction import Construction |
1 | from .common.dynamic_fields import DynamicFields, DynamicField | |
1 | from .common.dynamic_fields import DynamicFields | |
2 | 2 | from .common.writer import Writer |
3 | 3 | from .common.version_conversion import VersionConversion |
4 | 4 | from .common.field_datatype import FieldDatatype |
11 | 11 | from .common.disconnection import Disconnection |
12 | 12 | from .common.validate import Validate |
13 | 13 | from .common.default_record_definition import DefaultRecordDefinition |
14 | ||
15 | import gfapy | |
16 | 14 | |
17 | 15 | class Line(Construction, DynamicFields, Writer, VersionConversion, |
18 | 16 | FieldDatatype, FieldData, Equivalence, Cloning, Connection, |
16 | 16 | "GFA2 requires to specify a length\n"+ |
17 | 17 | "No length information available in the GFA1 segment:\n"+ |
18 | 18 | "Segment line: {}".format(str(self))) |
19 | a = ["S", self.field_to_s("name", tag = False), str(self.try_get_length()), | |
19 | a = ["S", self.field_to_s("name", tag = False), str(length), | |
20 | 20 | self.field_to_s("sequence", tag = False)] |
21 | 21 | for fn in self.tagnames: |
22 | 22 | if fn != "LN": |
153 | 153 | elif gfa_line.record_type in ["L", "P", "C", "#"]: |
154 | 154 | gfa_line.connect(self) |
155 | 155 | else: |
156 | rt = gfa_line.record_type | |
156 | 157 | raise gfapy.AssertionError( |
157 | 158 | "Invalid record type {}. This should never happen".format(rt)) |
158 | 159 |
21 | 21 | other_end = gfapy.SegmentEnd(other_end) |
22 | 22 | s = self.try_get_segment(segment_end.segment) |
23 | 23 | for d in s.dovetails_of_end(segment_end.end_type): |
24 | if not conserve_components or not self.is_cut_link(l): | |
25 | l.disconnect() | |
24 | if not conserve_components or not self.is_cut_link(d): | |
25 | d.disconnect() | |
26 | 26 | |
27 | 27 | def _unregister_line(self, gfa_line): |
28 | 28 | self._api_private_check_gfa_line(gfa_line, "unregister_line") |
31 | 31 | raise gfapy.AssertionError("Bug found, please report\n"+ |
32 | 32 | "gfa_line: {}".format(gfa_line)) |
33 | 33 | collection = self._records[rt] |
34 | key = gfa_line | |
35 | delete_if_empty = None | |
36 | 34 | storage_key = gfa_line.__class__.STORAGE_KEY |
37 | 35 | if storage_key == "name": |
38 | 36 | name = gfa_line.name |
1 | 1 | import gfapy.line |
2 | 2 | from .collections import Collections |
3 | 3 | from .headers import Headers |
4 | from .collections import Collections | |
5 | 4 | from .creators import Creators |
6 | 5 | from .destructors import Destructors |
7 | 6 | from .finders import Finders |
207 | 207 | "Value: {1}\n"+ |
208 | 208 | "Range: {2}\n"+ |
209 | 209 | "Content: {3}").format(subtype, e, |
210 | repr(range), repr(array))) | |
210 | repr(range), repr(elems))) | |
211 | 211 | yield e |
212 | 212 | else: |
213 | 213 | yield float(e) |
0 | Metadata-Version: 1.1 | |
1 | Name: gfapy | |
2 | Version: 1.2.0 | |
3 | Summary: Library for handling data in the GFA1 and GFA2 formats | |
4 | Home-page: https://github.com/ggonnella/gfapy | |
5 | Author: Giorgio Gonnella and others (see CONTRIBUTORS) | |
6 | Author-email: gonnella@zbh.uni-hamburg.de | |
7 | License: ISC | |
8 | Description: Gfapy | |
9 | ~~~~~ | |
10 | ||
11 | |travis| |readthedocs| |latesttag| |license| |requiresio| | |
12 | ||
13 | |bioconda| |pypi| |debian| |ubuntu| | |
14 | ||
15 | .. sphinx-begin | |
16 | ||
17 | The Graphical Fragment Assembly (GFA) are formats for the representation | |
18 | of sequence graphs, including assembly, variation and splicing graphs. | |
19 | Two versions of GFA have been defined (GFA1 and GFA2) and several sequence | |
20 | analysis programs have been adopting the formats as an interchange format, | |
21 | which allow to easily combine different sequence analysis tools. | |
22 | ||
23 | This library implements the GFA1 and GFA2 specification | |
24 | described at https://github.com/GFA-spec/GFA-spec/blob/master/GFA-spec.md. | |
25 | It allows to create a Gfa object from a file in the GFA format | |
26 | or from scratch, to enumerate the graph elements (segments, links, | |
27 | containments, paths and header lines), to traverse the graph (by | |
28 | traversing all links outgoing from or incoming to a segment), to search for | |
29 | elements (e.g. which links connect two segments) and to manipulate the | |
30 | graph (e.g. to eliminate a link or a segment or to duplicate a segment | |
31 | distributing the read counts evenly on the copies). | |
32 | ||
33 | The GFA format can be easily extended by users by defining own custom | |
34 | tags and record types. In Gfapy, it is easy to write extensions modules, | |
35 | which allow to define custom record types and datatypes for the parsing | |
36 | and validation of custom fields. The custom lines can be connected, using | |
37 | references, to each other and to lines of the standard record types. | |
38 | ||
39 | Requirements | |
40 | ~~~~~~~~~~~~ | |
41 | ||
42 | Gfapy has been written for Python 3 and tested using Python version 3.7. | |
43 | It does not require any additional Python packages or other software. | |
44 | ||
45 | Installation | |
46 | ~~~~~~~~~~~~ | |
47 | ||
48 | Gfapy is distributed as a Python package and can be installed using | |
49 | the Python package manager pip, as well as conda (in the Bioconda channel). | |
50 | It is also available as a package in some Linux distributions (Debian, Ubuntu). | |
51 | ||
52 | The following command installs the current stable version from the Python | |
53 | Packages index:: | |
54 | ||
55 | pip install gfapy | |
56 | ||
57 | If you would like to install the current development version from Github, | |
58 | use the following command:: | |
59 | ||
60 | pip install -e git+https://github.com/ggonnella/gfapy.git#egg=gfapy | |
61 | ||
62 | Alternatively it is possible to install gfapy using conda. Gfapy is | |
63 | included in the Bioconda (https://bioconda.github.io/) channel:: | |
64 | ||
65 | conda install -c bioconda gfapy | |
66 | ||
67 | Usage | |
68 | ~~~~~ | |
69 | ||
70 | If you installed gfapy as described above, you can import it in your script | |
71 | using the conventional Python syntax:: | |
72 | ||
73 | >>> import gfapy | |
74 | ||
75 | Documentation | |
76 | ~~~~~~~~~~~~~ | |
77 | ||
78 | The documentation, including this introduction to Gfapy, a user manual | |
79 | and the API documentation is hosted on the ReadTheDocs server, | |
80 | at the URL http://gfapy.readthedocs.io/en/latest/ and it can be | |
81 | downloaded as PDF from the URL | |
82 | https://github.com/ggonnella/gfapy/blob/master/manual/gfapy-manual.pdf. | |
83 | ||
84 | References | |
85 | ~~~~~~~~~~ | |
86 | ||
87 | Giorgio Gonnella and Stefan Kurtz "GfaPy: a flexible and extensible software | |
88 | library for handling sequence graphs in Python", Bioinformatics (2017) btx398 | |
89 | https://doi.org/10.1093/bioinformatics/btx398 | |
90 | ||
91 | .. sphinx-end | |
92 | ||
93 | .. |travis| | |
94 | image:: https://travis-ci.com/ggonnella/gfapy.svg?branch=master | |
95 | :target: https://travis-ci.com/ggonnella/gfapy | |
96 | :alt: Travis | |
97 | ||
98 | .. |latesttag| | |
99 | image:: https://img.shields.io/github/v/tag/ggonnella/gfapy | |
100 | :target: https://github.com/ggonnella/gfapy/tags | |
101 | :alt: Latest GitHub tag | |
102 | ||
103 | .. |readthedocs| | |
104 | image:: https://readthedocs.org/projects/pip/badge/?version=stable | |
105 | :target: https://pip.pypa.io/en/stable/?badge=stable | |
106 | :alt: ReadTheDocs | |
107 | ||
108 | .. |bioconda| | |
109 | image:: https://img.shields.io/conda/vn/bioconda/gfapy | |
110 | :target: https://bioconda.github.io/recipes/gfapy/README.html | |
111 | :alt: Bioconda | |
112 | ||
113 | .. |pypi| | |
114 | image:: https://img.shields.io/pypi/v/gfapy | |
115 | :target: https://pypi.org/project/gfapy/ | |
116 | :alt: PyPI | |
117 | ||
118 | .. |debian| | |
119 | image:: https://img.shields.io/debian/v/gfapy | |
120 | :target: https://packages.debian.org/search?keywords=gfapy | |
121 | :alt: Debian | |
122 | ||
123 | .. |ubuntu| | |
124 | image:: https://img.shields.io/ubuntu/v/gfapy | |
125 | :target: https://packages.ubuntu.com/search?keywords=gfapy | |
126 | :alt: Ubuntu | |
127 | ||
128 | .. |license| | |
129 | image:: https://img.shields.io/pypi/l/gfapy | |
130 | :target: https://github.com/ggonnella/gfapy/blob/master/LICENSE.txt | |
131 | :alt: ISC License | |
132 | ||
133 | .. |requiresio| | |
134 | image:: https://requires.io/github/ggonnella/gfapy/requirements.svg?branch=master | |
135 | :target: https://requires.io/github/ggonnella/gfapy/requirements/?branch=master | |
136 | :alt: Requirements Status | |
137 | ||
138 | Keywords: bioinformatics genomics sequences GFA assembly graphs | |
139 | Platform: UNKNOWN | |
140 | Classifier: Development Status :: 5 - Production/Stable | |
141 | Classifier: Environment :: Console | |
142 | Classifier: Intended Audience :: Developers | |
143 | Classifier: Intended Audience :: End Users/Desktop | |
144 | Classifier: Intended Audience :: Science/Research | |
145 | Classifier: License :: OSI Approved :: ISC License (ISCL) | |
146 | Classifier: Operating System :: MacOS :: MacOS X | |
147 | Classifier: Operating System :: POSIX :: Linux | |
148 | Classifier: Programming Language :: Python :: 3 :: Only | |
149 | Classifier: Topic :: Scientific/Engineering :: Bio-Informatics | |
150 | Classifier: Topic :: Software Development :: Libraries |
0 | MANIFEST.in | |
1 | README.rst | |
2 | setup.cfg | |
3 | setup.py | |
4 | bin/gfapy-convert | |
5 | bin/gfapy-mergelinear | |
6 | bin/gfapy-renumber | |
7 | bin/gfapy-validate | |
8 | gfapy/__init__.py | |
9 | gfapy/byte_array.py | |
10 | gfapy/error.py | |
11 | gfapy/field_array.py | |
12 | gfapy/gfa.py | |
13 | gfapy/lastpos.py | |
14 | gfapy/logger.py | |
15 | gfapy/numeric_array.py | |
16 | gfapy/oriented_line.py | |
17 | gfapy/placeholder.py | |
18 | gfapy/rgfa.py | |
19 | gfapy/segment_end.py | |
20 | gfapy/segment_end_path.py | |
21 | gfapy/sequence.py | |
22 | gfapy/symbol_invert.py | |
23 | gfapy.egg-info/PKG-INFO | |
24 | gfapy.egg-info/SOURCES.txt | |
25 | gfapy.egg-info/dependency_links.txt | |
26 | gfapy.egg-info/not-zip-safe | |
27 | gfapy.egg-info/top_level.txt | |
28 | gfapy/alignment/__init__.py | |
29 | gfapy/alignment/alignment.py | |
30 | gfapy/alignment/cigar.py | |
31 | gfapy/alignment/placeholder.py | |
32 | gfapy/alignment/trace.py | |
33 | gfapy/field/__init__.py | |
34 | gfapy/field/alignment_gfa1.py | |
35 | gfapy/field/alignment_gfa2.py | |
36 | gfapy/field/alignment_list_gfa1.py | |
37 | gfapy/field/byte_array.py | |
38 | gfapy/field/char.py | |
39 | gfapy/field/comment.py | |
40 | gfapy/field/custom_record_type.py | |
41 | gfapy/field/field.py | |
42 | gfapy/field/float.py | |
43 | gfapy/field/generic.py | |
44 | gfapy/field/identifier_gfa2.py | |
45 | gfapy/field/identifier_list_gfa2.py | |
46 | gfapy/field/integer.py | |
47 | gfapy/field/json.py | |
48 | gfapy/field/numeric_array.py | |
49 | gfapy/field/optional_identifier_gfa2.py | |
50 | gfapy/field/optional_integer.py | |
51 | gfapy/field/orientation.py | |
52 | gfapy/field/oriented_identifier_gfa2.py | |
53 | gfapy/field/oriented_identifier_list_gfa1.py | |
54 | gfapy/field/oriented_identifier_list_gfa2.py | |
55 | gfapy/field/parser.py | |
56 | gfapy/field/path_name_gfa1.py | |
57 | gfapy/field/position_gfa1.py | |
58 | gfapy/field/position_gfa2.py | |
59 | gfapy/field/segment_name_gfa1.py | |
60 | gfapy/field/sequence_gfa1.py | |
61 | gfapy/field/sequence_gfa2.py | |
62 | gfapy/field/string.py | |
63 | gfapy/field/validator.py | |
64 | gfapy/field/writer.py | |
65 | gfapy/graph_operations/__init__.py | |
66 | gfapy/graph_operations/artifacts.py | |
67 | gfapy/graph_operations/copy_number.py | |
68 | gfapy/graph_operations/graph_operations.py | |
69 | gfapy/graph_operations/invertible_segments.py | |
70 | gfapy/graph_operations/linear_paths.py | |
71 | gfapy/graph_operations/multiplication.py | |
72 | gfapy/graph_operations/p_bubbles.py | |
73 | gfapy/graph_operations/redundant_linear_paths.py | |
74 | gfapy/graph_operations/superfluous_links.py | |
75 | gfapy/graph_operations/topology.py | |
76 | gfapy/line/__init__.py | |
77 | gfapy/line/line.py | |
78 | gfapy/line/comment/__init__.py | |
79 | gfapy/line/comment/comment.py | |
80 | gfapy/line/comment/construction.py | |
81 | gfapy/line/comment/tags.py | |
82 | gfapy/line/comment/version_conversion.py | |
83 | gfapy/line/comment/writer.py | |
84 | gfapy/line/common/__init__.py | |
85 | gfapy/line/common/cloning.py | |
86 | gfapy/line/common/connection.py | |
87 | gfapy/line/common/construction.py | |
88 | gfapy/line/common/default_record_definition.py | |
89 | gfapy/line/common/disconnection.py | |
90 | gfapy/line/common/dynamic_fields.py | |
91 | gfapy/line/common/equivalence.py | |
92 | gfapy/line/common/field_data.py | |
93 | gfapy/line/common/field_datatype.py | |
94 | gfapy/line/common/update_references.py | |
95 | gfapy/line/common/validate.py | |
96 | gfapy/line/common/version_conversion.py | |
97 | gfapy/line/common/virtual_to_real.py | |
98 | gfapy/line/common/writer.py | |
99 | gfapy/line/custom_record/__init__.py | |
100 | gfapy/line/custom_record/construction.py | |
101 | gfapy/line/custom_record/custom_record.py | |
102 | gfapy/line/edge/__init__.py | |
103 | gfapy/line/edge/edge.py | |
104 | gfapy/line/edge/common/__init__.py | |
105 | gfapy/line/edge/common/alignment_type.py | |
106 | gfapy/line/edge/common/from_to.py | |
107 | gfapy/line/edge/containment/__init__.py | |
108 | gfapy/line/edge/containment/canonical.py | |
109 | gfapy/line/edge/containment/containment.py | |
110 | gfapy/line/edge/containment/pos.py | |
111 | gfapy/line/edge/containment/to_gfa2.py | |
112 | gfapy/line/edge/gfa1/__init__.py | |
113 | gfapy/line/edge/gfa1/alignment_type.py | |
114 | gfapy/line/edge/gfa1/oriented_segments.py | |
115 | gfapy/line/edge/gfa1/other.py | |
116 | gfapy/line/edge/gfa1/references.py | |
117 | gfapy/line/edge/gfa1/to_gfa2.py | |
118 | gfapy/line/edge/gfa2/__init__.py | |
119 | gfapy/line/edge/gfa2/alignment_type.py | |
120 | gfapy/line/edge/gfa2/gfa2.py | |
121 | gfapy/line/edge/gfa2/other.py | |
122 | gfapy/line/edge/gfa2/references.py | |
123 | gfapy/line/edge/gfa2/to_gfa1.py | |
124 | gfapy/line/edge/gfa2/validation.py | |
125 | gfapy/line/edge/link/__init__.py | |
126 | gfapy/line/edge/link/canonical.py | |
127 | gfapy/line/edge/link/complement.py | |
128 | gfapy/line/edge/link/equivalence.py | |
129 | gfapy/line/edge/link/link.py | |
130 | gfapy/line/edge/link/references.py | |
131 | gfapy/line/edge/link/to_gfa2.py | |
132 | gfapy/line/fragment/__init__.py | |
133 | gfapy/line/fragment/fragment.py | |
134 | gfapy/line/fragment/references.py | |
135 | gfapy/line/fragment/validation.py | |
136 | gfapy/line/gap/__init__.py | |
137 | gfapy/line/gap/gap.py | |
138 | gfapy/line/gap/references.py | |
139 | gfapy/line/group/__init__.py | |
140 | gfapy/line/group/group.py | |
141 | gfapy/line/group/gfa2/__init__.py | |
142 | gfapy/line/group/gfa2/references.py | |
143 | gfapy/line/group/gfa2/same_id.py | |
144 | gfapy/line/group/ordered/__init__.py | |
145 | gfapy/line/group/ordered/captured_path.py | |
146 | gfapy/line/group/ordered/ordered.py | |
147 | gfapy/line/group/ordered/references.py | |
148 | gfapy/line/group/ordered/to_gfa1.py | |
149 | gfapy/line/group/path/__init__.py | |
150 | gfapy/line/group/path/captured_path.py | |
151 | gfapy/line/group/path/path.py | |
152 | gfapy/line/group/path/references.py | |
153 | gfapy/line/group/path/to_gfa2.py | |
154 | gfapy/line/group/path/topology.py | |
155 | gfapy/line/group/path/validation.py | |
156 | gfapy/line/group/unordered/__init__.py | |
157 | gfapy/line/group/unordered/induced_set.py | |
158 | gfapy/line/group/unordered/references.py | |
159 | gfapy/line/group/unordered/unordered.py | |
160 | gfapy/line/header/__init__.py | |
161 | gfapy/line/header/connection.py | |
162 | gfapy/line/header/field_data.py | |
163 | gfapy/line/header/header.py | |
164 | gfapy/line/header/multiline.py | |
165 | gfapy/line/header/version_conversion.py | |
166 | gfapy/line/segment/__init__.py | |
167 | gfapy/line/segment/coverage.py | |
168 | gfapy/line/segment/gfa1.py | |
169 | gfapy/line/segment/gfa1_to_gfa2.py | |
170 | gfapy/line/segment/gfa2.py | |
171 | gfapy/line/segment/gfa2_to_gfa1.py | |
172 | gfapy/line/segment/length_gfa1.py | |
173 | gfapy/line/segment/references.py | |
174 | gfapy/line/segment/segment.py | |
175 | gfapy/line/segment/writer_wo_sequence.py | |
176 | gfapy/line/unknown/__init__.py | |
177 | gfapy/line/unknown/unknown.py | |
178 | gfapy/lines/__init__.py | |
179 | gfapy/lines/collections.py | |
180 | gfapy/lines/creators.py | |
181 | gfapy/lines/destructors.py | |
182 | gfapy/lines/finders.py | |
183 | gfapy/lines/headers.py | |
184 | gfapy/lines/lines.py | |
185 | manual/gfapy-manual.pdf | |
186 | tests/__init__.py | |
187 | tests/extension.py | |
188 | tests/test_api_alignment.py | |
189 | tests/test_api_comments.py | |
190 | tests/test_api_custom_records.py | |
191 | tests/test_api_extensions.py | |
192 | tests/test_api_gfa1_lines.py | |
193 | tests/test_api_gfa2_lines.py | |
194 | tests/test_api_gfa_basics.py | |
195 | tests/test_api_groups_validation.py | |
196 | tests/test_api_header.py | |
197 | tests/test_api_linear_paths.py | |
198 | tests/test_api_linear_paths_extended.py | |
199 | tests/test_api_lines_collections.py | |
200 | tests/test_api_lines_creators.py | |
201 | tests/test_api_lines_destructors.py | |
202 | tests/test_api_lines_finders.py | |
203 | tests/test_api_multiplication.py | |
204 | tests/test_api_placeholders.py | |
205 | tests/test_api_positionals.py | |
206 | tests/test_api_positions.py | |
207 | tests/test_api_references_edge_gfa1.py | |
208 | tests/test_api_references_edge_gfa2.py | |
209 | tests/test_api_references_f_g_lines.py | |
210 | tests/test_api_references_groups.py | |
211 | tests/test_api_references_virtual.py | |
212 | tests/test_api_rename_lines.py | |
213 | tests/test_api_rgfa.py | |
214 | tests/test_api_tags.py | |
215 | tests/test_api_version.py | |
216 | tests/test_api_version_conversion.py | |
217 | tests/test_gfapy_alignment.py | |
218 | tests/test_gfapy_byte_array.py | |
219 | tests/test_gfapy_cigar.py | |
220 | tests/test_gfapy_line_containment.py | |
221 | tests/test_gfapy_line_edge.py | |
222 | tests/test_gfapy_line_header.py | |
223 | tests/test_gfapy_line_link.py | |
224 | tests/test_gfapy_line_path.py | |
225 | tests/test_gfapy_line_segment.py | |
226 | tests/test_gfapy_line_version.py | |
227 | tests/test_gfapy_numeric_array.py | |
228 | tests/test_gfapy_segment_references.py | |
229 | tests/test_gfapy_sequence.py | |
230 | tests/test_gfapy_trace.py | |
231 | tests/test_graphop_artifacts.py | |
232 | tests/test_graphop_copy_number.py | |
233 | tests/test_internals_field_parser.py | |
234 | tests/test_internals_field_validator.py | |
235 | tests/test_internals_field_writer.py | |
236 | tests/test_internals_tag_datatype.py | |
237 | tests/test_unit_alignment.py | |
238 | tests/test_unit_field_array.py | |
239 | tests/test_unit_gfa_lines.py | |
240 | tests/test_unit_header.py | |
241 | tests/test_unit_line.py | |
242 | tests/test_unit_line_cloning.py | |
243 | tests/test_unit_line_connection.py | |
244 | tests/test_unit_line_dynamic_fields.py | |
245 | tests/test_unit_line_equivalence.py | |
246 | tests/test_unit_lines_finders.py | |
247 | tests/test_unit_multiplication.py | |
248 | tests/test_unit_numeric_array.py | |
249 | tests/test_unit_oriented_line.py | |
250 | tests/test_unit_segment_end.py | |
251 | tests/test_unit_symbol_invert.py | |
252 | tests/test_unit_unknown.py | |
253 | tests/testdata/all_line_types.gfa1.gfa | |
254 | tests/testdata/all_line_types.gfa2.gfa | |
255 | tests/testdata/copynum.1.gfa | |
256 | tests/testdata/copynum.1.gfa2 | |
257 | tests/testdata/copynum.2.gfa | |
258 | tests/testdata/copynum.2.gfa2 | |
259 | tests/testdata/dead_ends.gfa | |
260 | tests/testdata/dead_ends.gfa2 | |
261 | tests/testdata/example1.gfa | |
262 | tests/testdata/example1.gfa2 | |
263 | tests/testdata/example_from_spec.gfa | |
264 | tests/testdata/example_from_spec.gfa2 | |
265 | tests/testdata/example_from_spec.path14.seq | |
266 | tests/testdata/example_from_spec2.gfa | |
267 | tests/testdata/example_from_spec2.gfa2 | |
268 | tests/testdata/filled.gfa1 | |
269 | tests/testdata/filled.gfa2 | |
270 | tests/testdata/gfa2_edges_classification.gfa | |
271 | tests/testdata/invalid_path.gfa2 | |
272 | tests/testdata/linear_blunt.gfa1 | |
273 | tests/testdata/linear_blunt.gfa2 | |
274 | tests/testdata/linear_merging.1.gfa | |
275 | tests/testdata/linear_merging.1.gfa2 | |
276 | tests/testdata/linear_merging.2.gfa | |
277 | tests/testdata/linear_merging.2.gfa2 | |
278 | tests/testdata/linear_merging.3.gfa | |
279 | tests/testdata/linear_merging.3.gfa2 | |
280 | tests/testdata/linear_merging.4.gfa | |
281 | tests/testdata/linear_merging.4.gfa2 | |
282 | tests/testdata/linear_merging.5.gfa | |
283 | tests/testdata/linear_merging.5.gfa2 | |
284 | tests/testdata/links_distri.l1.gfa | |
285 | tests/testdata/links_distri.l1.gfa2 | |
286 | tests/testdata/links_distri.l1.m2.gfa | |
287 | tests/testdata/links_distri.l1.m2.gfa2 | |
288 | tests/testdata/links_distri.l2.gfa | |
289 | tests/testdata/links_distri.l2.gfa2 | |
290 | tests/testdata/links_distri.l2.m2.gfa | |
291 | tests/testdata/links_distri.l2.m2.gfa2 | |
292 | tests/testdata/links_distri.l2.m2.no_ld.gfa | |
293 | tests/testdata/links_distri.l2.m2.no_ld.gfa2 | |
294 | tests/testdata/links_distri.l2.m3.gfa | |
295 | tests/testdata/links_distri.l2.m3.gfa2 | |
296 | tests/testdata/links_distri.l2.m3.no_ld.gfa | |
297 | tests/testdata/links_distri.l2.m3.no_ld.gfa2 | |
298 | tests/testdata/links_distri.l3.gfa | |
299 | tests/testdata/links_distri.l3.gfa2 | |
300 | tests/testdata/links_distri.l3.m2.gfa | |
301 | tests/testdata/links_distri.l3.m2.gfa2 | |
302 | tests/testdata/links_distri.l3.m2.no_ld.gfa | |
303 | tests/testdata/links_distri.l3.m2.no_ld.gfa2 | |
304 | tests/testdata/loop.gfa | |
305 | tests/testdata/loop.gfa2 | |
306 | tests/testdata/rgfa_example.1.gfa | |
307 | tests/testdata/rgfa_example.2.gfa | |
308 | tests/testdata/sample.gfa | |
309 | tests/testdata/sample.gfa2 | |
310 | tests/testdata/seq_to_fill.fas | |
311 | tests/testdata/spec_q1.gfa | |
312 | tests/testdata/spec_q1.gfa2 | |
313 | tests/testdata/spec_q2.gfa | |
314 | tests/testdata/spec_q2.gfa2 | |
315 | tests/testdata/spec_q2.path_circular.seq | |
316 | tests/testdata/spec_q2.path_linear.seq | |
317 | tests/testdata/spec_q3.gfa | |
318 | tests/testdata/spec_q3.gfa2 | |
319 | tests/testdata/spec_q4.gfa | |
320 | tests/testdata/spec_q4.gfa2 | |
321 | tests/testdata/spec_q4.path_more_than_circular.seq | |
322 | tests/testdata/spec_q5.gfa | |
323 | tests/testdata/spec_q5.gfa2 | |
324 | tests/testdata/spec_q6.gfa | |
325 | tests/testdata/spec_q6.gfa2 | |
326 | tests/testdata/spec_q7.gfa | |
327 | tests/testdata/spec_q7.gfa2 | |
328 | tests/testdata/to_be_filled.gfa1 | |
329 | tests/testdata/to_be_filled.gfa2 | |
330 | tests/testdata/two_components.gfa | |
331 | tests/testdata/two_components.gfa2 | |
332 | tests/testdata/unnamed_and_named_links.gfa | |
333 | tests/testdata/unnamed_link.gfa | |
334 | tests/testdata/valid_path.gfa2⏎ |
Binary diff not shown
8 | 8 | sys.exit("Sorry, only Python 3 is supported") |
9 | 9 | |
10 | 10 | setup(name='gfapy', |
11 | version='1.1.0', | |
11 | version='1.2.0', | |
12 | 12 | description='Library for handling data in the GFA1 and GFA2 formats', |
13 | 13 | long_description=readme(), |
14 | 14 | url='https://github.com/ggonnella/gfapy', |
0 | import gfapy | |
1 | import unittest | |
2 | ||
3 | class TestAPIGroupsValidation(unittest.TestCase): | |
4 | ||
5 | def test_invalid_path_gfa2(self): | |
6 | with self.assertRaises(gfapy.NotFoundError): | |
7 | g = gfapy.Gfa.from_file("tests/testdata/invalid_path.gfa2") | |
8 | ||
9 | def test_invalid_path_gfa2_vlevel0(self): | |
10 | g = gfapy.Gfa.from_file("tests/testdata/invalid_path.gfa2", vlevel = 0) | |
11 | with self.assertRaises(gfapy.NotFoundError): | |
12 | g.validate() | |
13 | ||
14 | def test_valid_path_gfa2(self): | |
15 | # nothing raised | |
16 | g = gfapy.Gfa.from_file("tests/testdata/valid_path.gfa2") |
46 | 46 | lps.add(" ".join([s.name for s in lp])) |
47 | 47 | self.assertEqual({"1 19 18", "11 9 12", "22 16 20 21 23"}, lps) |
48 | 48 | |
49 | def test_linear_path_blunt_ends(self): | |
50 | for sfx in ["gfa1", "gfa2"]: | |
51 | gfa = gfapy.Gfa.from_file("tests/testdata/linear_blunt."+"{}".format(sfx)) | |
52 | gfa.merge_linear_paths() | |
53 | self.assertEqual(1, len(gfa.segments)) | |
54 | self.assertEqual("s1_s2_s3_s4", gfa.segments[0].name) | |
55 | self.assertEqual("CTGAAACGTGGCTCACA", gfa.segments[0].sequence) |
199 | 199 | gfa1_str ='''# comment |
200 | 200 | H\tVN:Z:1.0 |
201 | 201 | S\tB\t*\tLN:i:200 |
202 | S\tC\t*\tLN:i:100 | |
202 | S\tC\t*\tLN:i:100\txy:i:1 | |
203 | 203 | S\tA\t*\tLN:i:200 |
204 | 204 | L\tA\t+\tB\t-\t100M\tID:Z:a_to_b |
205 | 205 | C\tA\t+\tC\t-\t20\t100M\tID:Z:2 |
206 | P\t1\tA+,B-\t100M''' | |
206 | P\t1\tA+,B-\t100M\txx:Z:valid\txy:i:1''' | |
207 | 207 | gfa2_str ='''# comment |
208 | 208 | H\tVN:Z:2.0 |
209 | 209 | S\tB\t200\t* |
210 | S\tC\t100\t* | |
210 | S\tC\t100\t*\txy:i:1 | |
211 | 211 | S\tA\t200\t* |
212 | 212 | E\ta_to_b\tA+\tB-\t100\t200$\t100\t200$\t100M |
213 | 213 | E\t2\tA+\tC-\t20\t120\t0\t100$\t100M |
214 | O\t1\tA+ a_to_b+ B-''' | |
214 | O\t1\tA+ a_to_b+ B-\txx:Z:valid\txy:i:1''' | |
215 | 215 | self.assertEqual(gfa2_str, str(gfapy.Gfa(gfa1_str).to_gfa2())) |
216 | 216 | self.assertEqual(gfa1_str, str(gfapy.Gfa(gfa2_str).to_gfa1())) |
0 | H VN:Z:1.0 | |
1 | S s1 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTTCCG LN:i:44 | |
2 | S s2 ACGTTCGTACGTACGTACGTACGTAAACGGTAC LN:i:33 | |
3 | S s3 ACGTAAACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTTTTTACGTACGTA LN:i:75 | |
4 | S s4 ACGTACGTACGTTTTTTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGGAACGTAGTACGTACGTACGTACG LN:i:80 | |
5 | S s5 * LN:i:90 xx:Z:not in the Fasta, should accept and warn | |
6 | L s1 + s2 + 5M1I2M | |
7 | L s2 + s3 + 8M1D3M | |
8 | L s3 + s4 + 16M1I9M | |
9 | L s4 - s1 + 8M1I16M |
0 | H VN:Z:2.0 | |
1 | S s1 44 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTTCCG | |
2 | S s2 33 ACGTTCGTACGTACGTACGTACGTAAACGGTAC | |
3 | S s3 75 ACGTAAACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTTTTTACGTACGTA | |
4 | S s4 80 ACGTACGTACGTTTTTTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGGAACGTAGTACGTACGTACGTACG | |
5 | S s5 90 * xx:Z:not in the Fasta, should accept and warn | |
6 | E 1 s1+ s2+ 37 44$ 0 8 5M1I2M | |
7 | E 2 s2+ s3+ 21 33$ 0 11 8M1D3M | |
8 | E 3 s3+ s4+ 50 75$ 0 26 16M1I9M | |
9 | E 4 s4- s1+ 0 24 0 25 8M1I16M |
0 | # File used for the collections test | |
1 | # similar but NOT equivalent to the gfa1 file! | |
2 | S 1 122 * | |
3 | S 3 29 TGCTAGCTGACTGTCGATGCTGTGTG | |
4 | E 1_to_2 1+ 2+ 110 122$ 0 12 12M | |
5 | S 5 130 * | |
6 | S 13 150 * | |
7 | E 2_to_6 2+ 6+ 0 122$ 10 132 122M | |
8 | O 14 11+ 12+ | |
9 | S 11 140 * xx:i:11 | |
10 | F 2 read1+ 0 42 12 55 * id:Z:read1_in_2 | |
11 | F 2 read2+ 45 62 0 18 * id:Z:read2_in_2 | |
12 | U 16 1 3 15 2_to_6 16sub | |
13 | H ac:Z:test2 | |
14 | # another comment | |
15 | S 12 150 * | |
16 | S 4 120 * | |
17 | H VN:Z:2.0 | |
18 | E 1_to_3 1+ 3+ 112 122$ 0 12 10M | |
19 | G 1_to_11 1+ 11- 120 * | |
20 | E 11_to_12 11+ 12+ 18 140$ 0 122 122M | |
21 | S 6 150 * | |
22 | X custom_record xx:Z:testtag | |
23 | X custom_record X2 | |
24 | G 2_to_12 2- 12+ 500 50 | |
25 | O 15 11+ 11_to_13+ 13+ xx:i:-1 | |
26 | Y another_custom_record | |
27 | U 16sub 2 3 | |
28 | S 2 120 * xx:Z:sometag | |
29 | H aa:i:12 ab:Z:test1 | |
30 | H aa:i:15 | |
31 | E 1_to_5 1+ 5+ 0 122$ 2 124 * zz:Z:tag |
0 | H VN:Z:2.0 | |
1 | H ul:Z:https://github.com/sjackman/assembly-graph/blob/master/sample.gfa | |
2 | S 1 8 CGATGCAA | |
3 | S 2 10 TGCAAAGTAC | |
4 | S 3 21 TGCAACGTATAGACTTGTCAC RC:i:4 | |
5 | S 4 7 GCATATA | |
6 | S 5 8 CGATGATA | |
7 | S 6 4 ATGA | |
8 | E * 1+ 2+ 3 9$ 0 5 5M | |
9 | E * 3+ 2+ 21$ 21$ 0 0 0M | |
10 | E * 3+ 4- 17 21$ 3 7$ 1M1D2M | |
11 | E * 4- 5+ 0 0 0 0 0M |
0 | # File used for the collections test | |
1 | # similar but NOT equivalent to the gfa1 file! | |
2 | S 1 122 * | |
3 | S 3 29 TGCTAGCTGACTGTCGATGCTGTGTG | |
4 | E 1_to_2 1+ 2+ 110 122$ 0 12 12M | |
5 | S 5 130 * | |
6 | S 13 150 * | |
7 | E 2_to_6 2+ 6+ 0 122$ 10 132 122M | |
8 | O 14 11+ 12+ | |
9 | S 11 140 * xx:i:11 | |
10 | F 3 read1+ 0 42$ 12 55 * id:Z:read1_in_3 | |
11 | F 2 read2+ 45 62 0 18 * id:Z:read2_in_2 | |
12 | U 16 1 3 15 2_to_6 16sub | |
13 | H ac:Z:test2 | |
14 | # another comment | |
15 | S 12 150 * | |
16 | S 4 120 * | |
17 | H VN:Z:2.0 | |
18 | E 1_to_3 1+ 3+ 112 122$ 0 12 10M | |
19 | G 1_to_11 1+ 11- 120 * | |
20 | E 11_to_12 11+ 12+ 18 140$ 0 122 122M | |
21 | S 6 150 * | |
22 | X custom_record xx:Z:testtag | |
23 | X custom_record X2 | |
24 | E 11_to_13 11+ 13+ 20 140$ 0 120 120M | |
25 | G 2_to_12 2- 12+ 500 50 | |
26 | O 15 11+ 11_to_13+ 13+ xx:i:-1 | |
27 | Y another_custom_record | |
28 | U 16sub 2 3 | |
29 | S 2 120 * xx:Z:sometag | |
30 | H aa:i:12 ab:Z:test1 | |
31 | H aa:i:15 | |
32 | E 1_to_5 1+ 5+ 0 122$ 2 124 * zz:Z:tag |
0 | H VN:Z:1.0 | |
1 | H ul:Z:https://github.com/sjackman/assembly-graph/blob/master/sample.gfa | |
2 | S 1 CGATGCAA LN:i:12 | |
3 | S 2 TGCAAAGTAC | |
4 | S 3 TGCAACGTATAGACTTGTCAC RC:i:4 | |
5 | S 4 GCATATA | |
6 | S 5 CGATGATA | |
7 | S 6 ATGA | |
8 | L 1 + 2 + 5M | |
9 | L 3 + 2 + 0M | |
10 | L 3 + 4 - 1M1D2M1S | |
11 | L 4 - 5 + 0M |
0 | # File used for the collections test | |
1 | S 1 * | |
2 | S 3 CGATGCTAGCTGACTGTCGATGCTGTGTG | |
3 | L 1 + 2 + 12M ID:Z:1_to_2 | |
4 | S 5 * | |
5 | S 13 * | |
6 | C 2 + 6 + 10 122M ID:Z:2_to_6 | |
7 | P 14 11+,12+ 122M | |
8 | S 11 * | |
9 | H ac:Z:test2 | |
10 | S 12 * | |
11 | S 4 * | |
12 | H VN:Z:1.0 | |
13 | L 1 + 3 + 12M ID:Z:1_to_3 | |
14 | S 6 * | |
15 | L 11 + 13 + 120M ID:Z:11_to_13 | |
16 | P 15 11+,13+ 120M | |
17 | S 2 * xx:Z:sometag | |
18 | H aa:i:12 ab:Z:test1 | |
19 | H aa:i:15 | |
20 | C 1 + 5 + 12 120M ID:Z:1_to_5 |
0 | # comment | |
1 | S 3 CGATGCTAGCTGACTGTCGATGCTGTGTG | |
2 | L 1 + 2 + 12M ID:Z:1_to_2 | |
3 | S 5 * | |
4 | S 13 * | |
5 | C 2 + 6 + 10 122M ID:Z:2_to_6 | |
6 | P 14 11+,12+ 122M | |
7 | S 11 * | |
8 | H ac:Z:test2 | |
9 | S 12 * | |
10 | S 4 * | |
11 | H VN:Z:1.0 | |
12 | L 1 + 3 + 12M ID:Z:1_to_3 | |
13 | L 11 + 12 + 122M ID:Z:11_to_12 | |
14 | S 6 * | |
15 | L 11 + 13 + 120M ID:Z:11_to_13 | |
16 | P 15 11+,13+ 120M | |
17 | S 2 * xx:Z:sometag | |
18 | H aa:i:12 ab:Z:test1 | |
19 | H aa:i:15 | |
20 | C 1 + 5 + 12 120M ID:Z:1_to_5 |
0 | # File used for the collections test | |
1 | # similar but NOT equivalent to the gfa1 file! | |
2 | S 3 29 TGCTAGCTGACTGTCGATGCTGTGTG | |
3 | E 1_to_2 1+ 2+ 110 122$ 0 12 12M | |
4 | S 5 130 * | |
5 | S 13 150 * | |
6 | E 2_to_6 2+ 6+ 0 122$ 10 132 122M | |
7 | O 14 11+ 12+ | |
8 | S 11 140 * xx:i:11 | |
9 | F 2 read1+ 0 42 12 55 * id:Z:read1_in_2 | |
10 | F 2 read2+ 45 62 0 18 * id:Z:read2_in_2 | |
11 | U 16 1 3 15 2_to_6 16sub | |
12 | H ac:Z:test2 | |
13 | # another comment | |
14 | S 12 150 * | |
15 | S 4 120 * | |
16 | H VN:Z:2.0 | |
17 | E 1_to_3 1+ 3+ 112 122$ 0 12 10M | |
18 | G 1_to_11 1+ 11- 120 * | |
19 | E 11_to_12 11+ 12+ 18 140$ 0 122 122M | |
20 | S 6 150 * | |
21 | X custom_record xx:Z:testtag | |
22 | X custom_record X2 | |
23 | E 11_to_13 11+ 13+ 20 140$ 0 120 120M | |
24 | G 2_to_12 2- 12+ 500 50 | |
25 | O 15 11+ 11_to_13+ 13+ xx:i:-1 | |
26 | Y another_custom_record | |
27 | U 16sub 2 3 | |
28 | S 2 120 * xx:Z:sometag | |
29 | H aa:i:12 ab:Z:test1 | |
30 | H aa:i:15 | |
31 | E 1_to_5 1+ 5+ 0 122$ 2 124 * zz:Z:tag |
0 | S s1 5 CTGAA | |
1 | S s2 3 ACG | |
2 | S s3 4 TGGC | |
3 | S s4 5 TGTGA | |
4 | E 1 s1+ s2+ 5$ 5$ 0 0 0M | |
5 | E 2 s2+ s3+ 3$ 3$ 0 0 0M | |
6 | E 3 s3+ s4- 4$ 4$ 5$ 5$ 0M |
0 | >s0 not in GFA, should accept and warn | |
1 | CGACAGCGATCGAGCATGCTAGCTAGTCGTAGCTAGCTAGCTA | |
2 | >s1 | |
3 | ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTTCCG | |
4 | >s2 more text in description line should not disturb | |
5 | ACGTTCGTACGTACGT | |
6 | ACGTACGTAAACGGTAC | |
7 | >s3 | |
8 | ACGTAAACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTTTTTACGTACGTA | |
9 | >s4 | |
10 | ACGTACGTACGTTTTTTACGTACGTACGTA | |
11 | CGTACGTACGTACGTACGTACGTACGGAAC | |
12 | GTAGTACGTACGTACGTACG |
0 | H VN:Z:1.0 | |
1 | S s1 * LN:i:44 | |
2 | S s2 * LN:i:33 | |
3 | S s3 * LN:i:75 | |
4 | S s4 * LN:i:80 | |
5 | S s5 * LN:i:90 xx:Z:not in the Fasta, should accept and warn | |
6 | L s1 + s2 + 5M1I2M | |
7 | L s2 + s3 + 8M1D3M | |
8 | L s3 + s4 + 16M1I9M | |
9 | L s4 - s1 + 8M1I16M |