Codebase list gfapy / 22ceacf
New upstream release. Debian Janitor 2 years ago
112 changed file(s) with 898 addition(s) and 4020 deletion(s). Raw diff Collapse all Expand all
+0
-13
.gitignore less more
0 # Compiled python modules
1 *.pyc
2
3 # Setuptools distribution folder
4 /dist/
5
6 # Python egg metadata, regenerated from source files by setuptools
7 /*.egg-info
8 /*.egg
9
10 # Wheel data
11 build
12 conda
+0
-10
.travis.yml less more
0 language: python
1 python:
2 - "3.4"
3 env:
4 - PYTHONHASHSEED=0
5 install:
6 - pip install .
7 - pip install nose
8 - pip install Sphinx
9 script: "make tests"
+0
-8
CHANGES.txt less more
0 == 1.1.0 ==
1
2 - fix: custom tags are not necessarily lower case
3 - additional support for rGFA subset of GFA1 by setting option dialect="rgfa"
4
5 == 1.0.0 ==
6
7 - initial release
+0
-6
CONTRIBUTORS less more
0 The following contributors helped to develop gfapy. Please drop a note to
1 gonnella@zbh.uni-hamburg.de if I left someone out or missed something.
2
3 - Tim Weber (translation of parts of the code from Ruby to Python)
4 - Stefan Kurtz (advises)
5
+0
-19
LICENSE.txt less more
0 All code of gfapy is released under the following ISC license.
1 It is functionally equivalent to a two-term BSD copyright with
2 language removed that is made unnecessary by the Berne convention.
3 See http://openbsd.org/policy.html for more information on copyrights.
4
5 Copyright (c) 2017 Giorgio Gonnella and CONTRIBUTORS
6 Copyright (c) 2017 Center for Bioinformatics, University of Hamburg
7
8 Permission to use, copy, modify, and distribute this software for any
9 purpose with or without fee is hereby granted, provided that the above
10 copyright notice and this permission notice appear in all copies.
11
12 THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
13 WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
14 MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
15 ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
16 WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
17 ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
18 OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+0
-51
Makefile less more
0 default: tests
1
2 .PHONY: manual tests cleanup upload conda sdist wheel install
3
4 PYTHON=python3
5 PIP=pip3
6
7 # Install using pip
8 install:
9 ${PIP} install --upgrade --user --editable .
10
11 # Source distribution
12 sdist:
13 ${PYTHON} setup.py sdist
14
15 # Pure Python Wheel
16 wheel:
17 ${PYTHON} setup.py bdist_wheel
18
19 # Create the manual
20 manual:
21 cd doc && make latexpdf
22 mkdir -p manual
23 cp doc/_build/latex/Gfapy.pdf manual/gfapy-manual.pdf
24
25 doctest:
26 cd doc && make doctest
27
28 unittests:
29 @echo
30 @echo "Running unit test suite..."
31 @PYTHONHASHSEED=0 ${PYTHON} -m unittest discover
32
33 tests: doctest unittests
34
35 # Remove distribution files
36 cleanup:
37 rm -rf dist/ build/ gfapy.egg-info/
38
39 upload: tests cleanup sdist wheel
40 cd dist; \
41 for file in *; do \
42 twine register $$file; \
43 twine upload $$file; \
44 done
45
46 conda:
47 mkdir -p conda
48 cd conda; \
49 conda skeleton pypi gfapy; \
50 conda build gfapy
0 Metadata-Version: 1.1
1 Name: gfapy
2 Version: 1.2.0
3 Summary: Library for handling data in the GFA1 and GFA2 formats
4 Home-page: https://github.com/ggonnella/gfapy
5 Author: Giorgio Gonnella and others (see CONTRIBUTORS)
6 Author-email: gonnella@zbh.uni-hamburg.de
7 License: ISC
8 Description: Gfapy
9 ~~~~~
10
11 |travis| |readthedocs| |latesttag| |license| |requiresio|
12
13 |bioconda| |pypi| |debian| |ubuntu|
14
15 .. sphinx-begin
16
17 The Graphical Fragment Assembly (GFA) are formats for the representation
18 of sequence graphs, including assembly, variation and splicing graphs.
19 Two versions of GFA have been defined (GFA1 and GFA2) and several sequence
20 analysis programs have been adopting the formats as an interchange format,
21 which allow to easily combine different sequence analysis tools.
22
23 This library implements the GFA1 and GFA2 specification
24 described at https://github.com/GFA-spec/GFA-spec/blob/master/GFA-spec.md.
25 It allows to create a Gfa object from a file in the GFA format
26 or from scratch, to enumerate the graph elements (segments, links,
27 containments, paths and header lines), to traverse the graph (by
28 traversing all links outgoing from or incoming to a segment), to search for
29 elements (e.g. which links connect two segments) and to manipulate the
30 graph (e.g. to eliminate a link or a segment or to duplicate a segment
31 distributing the read counts evenly on the copies).
32
33 The GFA format can be easily extended by users by defining own custom
34 tags and record types. In Gfapy, it is easy to write extensions modules,
35 which allow to define custom record types and datatypes for the parsing
36 and validation of custom fields. The custom lines can be connected, using
37 references, to each other and to lines of the standard record types.
38
39 Requirements
40 ~~~~~~~~~~~~
41
42 Gfapy has been written for Python 3 and tested using Python version 3.7.
43 It does not require any additional Python packages or other software.
44
45 Installation
46 ~~~~~~~~~~~~
47
48 Gfapy is distributed as a Python package and can be installed using
49 the Python package manager pip, as well as conda (in the Bioconda channel).
50 It is also available as a package in some Linux distributions (Debian, Ubuntu).
51
52 The following command installs the current stable version from the Python
53 Packages index::
54
55 pip install gfapy
56
57 If you would like to install the current development version from Github,
58 use the following command::
59
60 pip install -e git+https://github.com/ggonnella/gfapy.git#egg=gfapy
61
62 Alternatively it is possible to install gfapy using conda. Gfapy is
63 included in the Bioconda (https://bioconda.github.io/) channel::
64
65 conda install -c bioconda gfapy
66
67 Usage
68 ~~~~~
69
70 If you installed gfapy as described above, you can import it in your script
71 using the conventional Python syntax::
72
73 >>> import gfapy
74
75 Documentation
76 ~~~~~~~~~~~~~
77
78 The documentation, including this introduction to Gfapy, a user manual
79 and the API documentation is hosted on the ReadTheDocs server,
80 at the URL http://gfapy.readthedocs.io/en/latest/ and it can be
81 downloaded as PDF from the URL
82 https://github.com/ggonnella/gfapy/blob/master/manual/gfapy-manual.pdf.
83
84 References
85 ~~~~~~~~~~
86
87 Giorgio Gonnella and Stefan Kurtz "GfaPy: a flexible and extensible software
88 library for handling sequence graphs in Python", Bioinformatics (2017) btx398
89 https://doi.org/10.1093/bioinformatics/btx398
90
91 .. sphinx-end
92
93 .. |travis|
94 image:: https://travis-ci.com/ggonnella/gfapy.svg?branch=master
95 :target: https://travis-ci.com/ggonnella/gfapy
96 :alt: Travis
97
98 .. |latesttag|
99 image:: https://img.shields.io/github/v/tag/ggonnella/gfapy
100 :target: https://github.com/ggonnella/gfapy/tags
101 :alt: Latest GitHub tag
102
103 .. |readthedocs|
104 image:: https://readthedocs.org/projects/pip/badge/?version=stable
105 :target: https://pip.pypa.io/en/stable/?badge=stable
106 :alt: ReadTheDocs
107
108 .. |bioconda|
109 image:: https://img.shields.io/conda/vn/bioconda/gfapy
110 :target: https://bioconda.github.io/recipes/gfapy/README.html
111 :alt: Bioconda
112
113 .. |pypi|
114 image:: https://img.shields.io/pypi/v/gfapy
115 :target: https://pypi.org/project/gfapy/
116 :alt: PyPI
117
118 .. |debian|
119 image:: https://img.shields.io/debian/v/gfapy
120 :target: https://packages.debian.org/search?keywords=gfapy
121 :alt: Debian
122
123 .. |ubuntu|
124 image:: https://img.shields.io/ubuntu/v/gfapy
125 :target: https://packages.ubuntu.com/search?keywords=gfapy
126 :alt: Ubuntu
127
128 .. |license|
129 image:: https://img.shields.io/pypi/l/gfapy
130 :target: https://github.com/ggonnella/gfapy/blob/master/LICENSE.txt
131 :alt: ISC License
132
133 .. |requiresio|
134 image:: https://requires.io/github/ggonnella/gfapy/requirements.svg?branch=master
135 :target: https://requires.io/github/ggonnella/gfapy/requirements/?branch=master
136 :alt: Requirements Status
137
138 Keywords: bioinformatics genomics sequences GFA assembly graphs
139 Platform: UNKNOWN
140 Classifier: Development Status :: 5 - Production/Stable
141 Classifier: Environment :: Console
142 Classifier: Intended Audience :: Developers
143 Classifier: Intended Audience :: End Users/Desktop
144 Classifier: Intended Audience :: Science/Research
145 Classifier: License :: OSI Approved :: ISC License (ISCL)
146 Classifier: Operating System :: MacOS :: MacOS X
147 Classifier: Operating System :: POSIX :: Linux
148 Classifier: Programming Language :: Python :: 3 :: Only
149 Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
150 Classifier: Topic :: Software Development :: Libraries
0 Gfapy
1 ~~~~~
2
3 |travis| |readthedocs| |latesttag| |license| |requiresio|
4
5 |bioconda| |pypi| |debian| |ubuntu|
6
7 .. sphinx-begin
8
09 The Graphical Fragment Assembly (GFA) are formats for the representation
110 of sequence graphs, including assembly, variation and splicing graphs.
211 Two versions of GFA have been defined (GFA1 and GFA2) and several sequence
2231 Requirements
2332 ~~~~~~~~~~~~
2433
25 Gfapy has been written for Python 3 and tested using Python version 3.3.
34 Gfapy has been written for Python 3 and tested using Python version 3.7.
2635 It does not require any additional Python packages or other software.
2736
2837 Installation
3039
3140 Gfapy is distributed as a Python package and can be installed using
3241 the Python package manager pip, as well as conda (in the Bioconda channel).
42 It is also available as a package in some Linux distributions (Debian, Ubuntu).
3343
3444 The following command installs the current stable version from the Python
3545 Packages index::
6979 Giorgio Gonnella and Stefan Kurtz "GfaPy: a flexible and extensible software
7080 library for handling sequence graphs in Python", Bioinformatics (2017) btx398
7181 https://doi.org/10.1093/bioinformatics/btx398
82
83 .. sphinx-end
84
85 .. |travis|
86 image:: https://travis-ci.com/ggonnella/gfapy.svg?branch=master
87 :target: https://travis-ci.com/ggonnella/gfapy
88 :alt: Travis
89
90 .. |latesttag|
91 image:: https://img.shields.io/github/v/tag/ggonnella/gfapy
92 :target: https://github.com/ggonnella/gfapy/tags
93 :alt: Latest GitHub tag
94
95 .. |readthedocs|
96 image:: https://readthedocs.org/projects/pip/badge/?version=stable
97 :target: https://pip.pypa.io/en/stable/?badge=stable
98 :alt: ReadTheDocs
99
100 .. |bioconda|
101 image:: https://img.shields.io/conda/vn/bioconda/gfapy
102 :target: https://bioconda.github.io/recipes/gfapy/README.html
103 :alt: Bioconda
104
105 .. |pypi|
106 image:: https://img.shields.io/pypi/v/gfapy
107 :target: https://pypi.org/project/gfapy/
108 :alt: PyPI
109
110 .. |debian|
111 image:: https://img.shields.io/debian/v/gfapy
112 :target: https://packages.debian.org/search?keywords=gfapy
113 :alt: Debian
114
115 .. |ubuntu|
116 image:: https://img.shields.io/ubuntu/v/gfapy
117 :target: https://packages.ubuntu.com/search?keywords=gfapy
118 :alt: Ubuntu
119
120 .. |license|
121 image:: https://img.shields.io/pypi/l/gfapy
122 :target: https://github.com/ggonnella/gfapy/blob/master/LICENSE.txt
123 :alt: ISC License
124
125 .. |requiresio|
126 image:: https://requires.io/github/ggonnella/gfapy/requirements.svg?branch=master
127 :target: https://requires.io/github/ggonnella/gfapy/requirements/?branch=master
128 :alt: Requirements Status
+0
-3
benchmarks/.gitignore less more
0 benchmark_results*
1 jobs_out
2 figure*
+0
-65
benchmarks/gfapy-benchmark-collectdata less more
0 #!/bin/bash
1
2 #
3 # This script is derived from rdj-spacepeak.sh in
4 # the GenomeTools repository (www.genometools.org).
5 #
6 # (c) 2010-2017 Giorgio Gonnella, ZBH, University of Hamburg
7 #
8
9 sleeptime=0.1
10
11 if [ $# -eq 0 ]; then
12 echo "Usage: $0 <command> [args]"
13 echo
14 echo "The following information is polled each $sleeptime seconds"
15 echo "from /proc/[pid]/status:"
16 echo
17 echo " VmPeak: Peak virtual memory size."
18 echo " VmSize: Virtual memory size."
19 echo " VmLck: Locked memory size."
20 echo " VmHWM: Peak resident set size (\"high water mark\")."
21 echo " VmRSS: Resident set size."
22 echo " VmData, VmStk, VmExe: Size of data, stack, and text segments."
23 echo " VmLib: Shared library code size."
24 echo " VmPTE: Page table entries size (since Linux 2.6.10)."
25 echo
26 echo "The command is run under /usr/bin/time."
27 exit
28 fi
29
30 # code inspired by:
31 # http://stackoverflow.com/questions/1080461/
32 # /peak-memory-measurement-of-long-running-process-in-linux
33 function __measure_space_peak {
34 types="Peak Size Lck HWM RSS Data Stk Exe Lib PTE"
35 declare -A maxVm
36 for vm in $types; do maxVm[$vm]=0; done
37 ppid=$$
38 /usr/bin/time $@ &
39 tpid=`pgrep -P ${ppid} -n -f time`
40 if [[ ${tpid} -ne "" ]]; then
41 pid=`pgrep -P ${tpid} -n -f $1` # $! may work here but not later
42 fi
43 declare -A Vm
44 while [[ ${tpid} -ne "" ]]; do
45 for vm in $types; do
46 if [[ ${pid} -ne "" ]]; then
47 Vm[$vm]=`cat /proc/${pid}/status 2> /dev/null \
48 | grep Vm${vm} | awk '{print $2}'`
49 if [[ ${Vm[$vm]} -gt ${maxVm[$vm]} ]]; then
50 maxVm[$vm]=${Vm[$vm]}
51 fi
52 fi
53 done
54 sleep $sleeptime
55 savedtpid=${tpid}
56 tpid=`pgrep -P ${ppid} -n -f time`
57 done
58 wait ${savedtpid} # don't wait, job is finished
59 exitstatus=$? # catch the exit status of wait, the same of $@
60 echo "Memory usage for $@:" >> /dev/stderr
61 for vm in $types; do echo " Vm$vm: ${maxVm[$vm]} kB" >> /dev/stderr; done
62 echo "Exit status: ${exitstatus}" >> /dev/stderr
63 }
64 __measure_space_peak $*
+0
-120
benchmarks/gfapy-plot-benchmarkdata.R less more
0 #!/usr/bin/env Rscript
1 # (c) Giorgio Gonnella, ZBH, Uni Hamburg, 2017
2
3 script.name = "./gfapy-plot-benchmarkdata.R"
4 args <- commandArgs(trailingOnly=TRUE)
5 if (is.na(args[3])) {
6 cat("Usage: ",script.name, " <inputfile> <outpfx> <variable>", "\n")
7 cat("variable: either 'segments' or 'connectivity'\n")
8 stop("Too few command-line parameters")
9 }
10 infname <- args[1]
11 cat("input data: ",infname,"\n")
12 outpfx <- args[2]
13 cat("output prefix:", outpfx, "\n")
14 xvar <- args[3]
15 if (xvar != 'segments' && xvar != 'connectivity') {
16 stop("variable must be one of: segments, connectivity")
17 }
18
19 library("ggplot2")
20
21 #
22 # The following function is described here:
23 # http://www.cookbook-r.com/Graphs/Plotting_means_and_error_bars_(ggplot2)/#Helper%20functions
24 # Licence: CC0 (https://creativecommons.org/publicdomain/zero/1.0/)
25 #
26 ## Gives count, mean, standard deviation, standard error of the mean, and
27 ## confidence interval (default 95%).
28 ## data: a data frame.
29 ## measurevar: the name of a column that contains the var to be summariezed
30 ## groupvars: a vector containing names of columns that contain grouping vars
31 ## na.rm: a boolean that indicates whether to ignore NA's
32 ## conf.interval: the percent range of the confidence interval (default 95%)
33 summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE,
34 conf.interval=.95, .drop=TRUE) {
35 library(plyr)
36
37 # New version of length which can handle NA's: if na.rm==T, don't count them
38 length2 <- function (x, na.rm=FALSE) {
39 if (na.rm) sum(!is.na(x))
40 else length(x)
41 }
42
43 # This does the summary. For each group's data frame, return a vector with
44 # N, mean, and sd
45 datac <- ddply(data, groupvars, .drop=.drop,
46 .fun = function(xx, col) {
47 c(N = length2(xx[[col]], na.rm=na.rm),
48 mean = mean (xx[[col]], na.rm=na.rm),
49 sd = sd (xx[[col]], na.rm=na.rm)
50 )
51 },
52 measurevar
53 )
54
55 # Rename the "mean" column
56 datac <- rename(datac, c("mean" = measurevar))
57
58 datac$se <- datac$sd / sqrt(datac$N) # Calculate standard error of the mean
59
60 # Confidence interval multiplier for standard error
61 # Calculate t-statistic for confidence interval:
62 # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
63 ciMult <- qt(conf.interval/2 + .5, datac$N-1)
64 datac$ci <- datac$se * ciMult
65
66 return(datac)
67 }
68
69 data <- read.table(infname, header=T, sep="\t")
70
71 if (xvar == "segments") {
72 xvarname = "lines"
73 xlab="Lines (segments 1/3; dovetails 2/3)"
74 } else {
75 xvarname = "mult"
76 xlab="Dovetails/segment (segments=4000)"
77 data[c("lines")] = (data[c("mult")]+1)*4000
78 }
79
80 time.data <- summarySE(data, measurevar="time", groupvars=c(xvarname))
81 outfname = paste0(outpfx,"_time.log")
82 sink(outfname)
83 print(time.data)
84 time.lm <- lm(time ~ lines, data=data)
85 summary(time.lm)
86 time.nls <- nls(time ~ b + a * lines,
87 data=data, start=list(a=0,b=0),
88 algorithm="port", lower=c(0,0))
89 print(time.nls)
90 sink()
91
92 outfname = paste0(outpfx,"_space.log")
93 sink(outfname)
94 space.data <- summarySE(data, measurevar="space", groupvars=c(xvarname))
95 print(space.data)
96 space.lm <- lm(space ~ lines, data=data)
97 summary(space.lm)
98 space.nls <- nls(space ~ b + a * lines,
99 data=data, start=list(a=0,b=0),
100 algorithm="port", lower=c(0,0))
101 print(space.nls)
102 sink()
103
104 outfname = paste0(outpfx,"_time.pdf")
105 pdf(outfname)
106 print(ggplot(time.data, aes_string(x=xvarname, y="time")) +
107 geom_errorbar(aes(ymin=time-se, ymax=time+se), width=2) +
108 geom_line(size=0.2) + geom_point(size=3) +
109 ylab("Total elapsed time (s)") +
110 xlab(xlab))
111 outfname = paste0(outpfx,"_space.pdf")
112 pdf(outfname)
113 print(ggplot(space.data, aes_string(x=xvarname, y="space")) +
114 geom_errorbar(aes(ymin=space-se, ymax=space+se), width=2) +
115 geom_line(size=0.2) + geom_point(size=3) +
116 ylab("Memory peak (MB)") +
117 xlab(xlab))
118 dev.off()
119
+0
-62
benchmarks/gfapy-plot-preparedata.py less more
0 #!/usr/bin/env python3
1 """
2 Prepare the output of the convert benchmark script for the R plotting script.
3 """
4
5 import argparse
6 import os
7 import sys
8 import re
9
10 op = argparse.ArgumentParser(description=__doc__)
11 op.add_argument('--version', action='version', version='%(prog)s 1.0')
12 op.add_argument("--mult", "-m", action="store_true",
13 help="set if variable n of edges/segment")
14 op.add_argument("inputfile")
15 opts = op.parse_args()
16
17 if not os.path.exists(opts.inputfile):
18 sys.stderr.write("Input file not found: {}\n".format(opts.inputfile))
19 exit(1)
20
21 with open(opts.inputfile) as inputfile:
22 header = True
23 if opts.mult:
24 outdata = ["mult", "time", "space", "time_per_line", "space_per_line"]
25 else:
26 outdata = ["lines", "time", "space", "time_per_line", "space_per_line"]
27 print("\t".join(outdata))
28 for line in inputfile:
29 if line[:3] == "###":
30 header = False
31 elif not header:
32 data = line.rstrip("\n\r").split("\t")
33 n_segments = data[2]
34 multiplier = data[3]
35 n_lines = int(int(n_segments) * (1+float(multiplier)))
36 elapsed = data[5]
37 elapsed_match = re.compile(r'\s+(\d+):(\d+\.\d+)').match(elapsed)
38 if elapsed_match:
39 minutes = int(elapsed_match.groups()[0])
40 seconds = float(elapsed_match.groups()[1])
41 seconds += minutes * 60
42 else:
43 elapsed_match = re.compile(r'\s+(\d+):(\d+):(\d+)').match(elapsed)
44 if elapsed_match:
45 hours = int(elapsed_match.groups()[0])
46 minutes = int(elapsed_match.groups()[1])
47 seconds = int(elapsed_match.groups()[2])
48 minutes += hours * 60
49 seconds += minutes * 60
50 else:
51 continue
52 memory = data[6]
53 memory = int(re.compile(r'(\d+) kB').match(memory).groups()[0])
54 megabytes = memory / 1024
55 if opts.mult:
56 outdata = [str(multiplier)]
57 else:
58 outdata = [str(n_lines)]
59 outdata += [str(seconds),str(megabytes),
60 str(seconds/n_lines), str(megabytes/n_lines)]
61 print("\t".join(outdata))
+0
-61
benchmarks/gfapy-profiler.sh less more
0 #!/bin/bash
1 #$ -clear
2 #$ -q 16c.q
3 #$ -cwd
4 #$ -V
5 #$ -S /bin/bash
6 #$ -o jobs_out
7 #$ -j y
8
9 if [ $# -ne 4 ]; then
10 echo "Usage: $0 <operation> <version> <variable> <range>" > /dev/stderr
11 echo " operation: (mergelinear/convert) ../bin/gfapy-<operation> <gfafile> will be called" > /dev/stderr
12 echo " version: (gfa1/gfa2) gfa version" > /dev/stderr
13 echo " variable: (segments/connectivity)" > /dev/stderr
14 echo " range: (all/fast/slow)" > /dev/stderr
15 exit 1
16 fi
17
18 operation=$1
19 version=$2
20 variable=$3
21 range=$4
22
23 if [ $variable == "segments" ]; then
24 if [ $range == "fast" ]; then
25 nsegments="1000 2000 4000 8000 16000 32000 64000 128000"
26 elif [ $range == "slow" ]; then
27 nsegments="256000 512000 1024000 2048000 4096000"
28 elif [ $range == "all"]; then
29 nsegments="1000 2000 4000 8000 16000 32000 64000 128000 256000 512000 1024000 2048000 4096000"
30 fi
31 else
32 nsegments=4000
33 fi
34
35 if [ $variable == "connectivity" ]; then
36 if [ $range == "fast" ]; then
37 multipliers="2 4 8 16 32 64"
38 elif [ $range == "slow" ]; then
39 multipliers="128 256"
40 elif [ $range == "all"]; then
41 multipliers="2 4 8 16 32 64 128 256"
42 fi
43 else
44 multipliers=2
45 fi
46
47 replicate=1
48 for i in $nsegments; do
49 for m in $multipliers; do
50 fname="${i}_e${m}x.$replicate.${version}"
51 if [ ! -e $fname ]; then
52 ./gfapy-randomgraph --segments $i -g $version \
53 --dovetails-per-segment $m --with-sequence > $fname
54 fi
55 echo "Profiling $operation $fname ..."
56 rm -f $fname.$operation.prof
57 python3 -m cProfile -o $fname.$operation.prof \
58 ../bin/gfapy-$operation $fname 1> /dev/null
59 done
60 done
+0
-87
benchmarks/gfapy-randomgraph less more
0 #!/usr/bin/env python3
1 """
2 Creates a random graph for testing
3 """
4
5 import argparse
6 import sys
7 import random
8
9 op = argparse.ArgumentParser(description=__doc__)
10 op.add_argument("--segments", "-s", type=int,
11 help="number of segments", required=True)
12 op.add_argument("--slen", "-l", type=int, default=100,
13 help="lenght of segments sequence")
14 op.add_argument("--with-sequence", "-w", action="store_true")
15 op.add_argument("--dovetails-per-segment", "-d",
16 help="average number of dovetail edges per segment",
17 default=2.0, type=float)
18 op.add_argument('--gfa-version', "-g", default="gfa1",
19 help="gfa version", choices=("gfa1", "gfa2"))
20 op.add_argument('--version', action='version', version='%(prog)s 1.0')
21 opts = op.parse_args()
22
23 if opts.segments < 0:
24 sys.stderr.write("Error: the number of segments must be "+
25 ">= 0 ({})\n".format(opts.segments))
26 exit(1)
27 if opts.dovetails_per_segment < 0:
28 sys.stderr.write("Error: the average number of dovetails per segment must "+
29 "be >= 0 ({})\n".format(opts.dovetails_per_segment))
30 exit(1)
31 if opts.slen <= 0:
32 sys.stderr.write("Error: the length of segments sequence must be > 0"+
33 " ({})\n".format(opts.slen))
34 exit(1)
35
36 if opts.gfa_version == "gfa1":
37 print("H\tVN:Z:1.0")
38 else:
39 print("H\tVN:Z:2.0")
40
41 def random_sequence(slen):
42 sequence = []
43 for i in range(slen):
44 sequence.append(random.choice('ACGT'))
45 return "".join(sequence)
46
47 for i in range(opts.segments):
48 if opts.with_sequence:
49 sequence = random_sequence(opts.slen)
50 else:
51 sequence = "*"
52 if opts.gfa_version == "gfa1":
53 print("S\ts{}\t{}\tLN:i:{}".format(i, sequence, opts.slen))
54 else:
55 print("S\ts{}\t{}\t{}".format(i, opts.slen, sequence))
56
57 n_dovetails = int(opts.segments * opts.dovetails_per_segment)
58 edges = {}
59 for i in range(n_dovetails):
60 edge = False
61 while not edge:
62 s_from = random.randint(0, opts.segments-1)
63 s_from_or = random.choice('+-')
64 s_to = random.randint(0, opts.segments-1)
65 s_to_or = random.choice('+-')
66 if s_from not in edges:
67 edges[s_from] = {'+': {}, '-': {}}
68 if s_to not in edges[s_from][s_from_or]:
69 edges[s_from][s_from_or][s_to] = {'+': False, '-': False}
70 if not edges[s_from][s_from_or][s_to][s_to_or]:
71 edges[s_from][s_from_or][s_to][s_to_or] = True
72 edge = True
73 ovlen = opts.slen//10
74 if ovlen == 0: ovlen = 1
75 cigar = "{}M".format(ovlen)
76 if opts.gfa_version == "gfa1":
77 print("L\ts{}\t{}\ts{}\t{}\t{}\tID:Z:e{}".format(s_from, s_from_or, s_to,
78 s_to_or, cigar, i))
79 else:
80 s_from_begin = opts.slen - ovlen if s_from_or == "+" else 0
81 s_from_end = "{}$".format(opts.slen) if s_from_or == "+" else ovlen
82 s_to_begin = opts.slen - ovlen if s_to_or == "-" else 0
83 s_to_end = "{}$".format(opts.slen) if s_to_or == "-" else ovlen
84 print("E\te{}\ts{}{}\ts{}{}\t{}\t{}\t{}\t{}\t{}".format(
85 i, s_from, s_from_or, s_to, s_to_or, s_from_begin, s_from_end,
86 s_to_begin, s_to_end, cigar))
+0
-76
benchmarks/gfapy-reproduce-manuscript-figure.py less more
0 #!/usr/bin/env python3
1 """
2 Run the benchmarks necessary to reproduce the figures of Section 3
3 of the Supplementary Information of the manuscript \"Gfapy: a flexible
4 and extensible software library for handling sequence graphs in Python\"
5 and plots the figures using R.
6 """
7
8 import argparse
9 import os
10
11 op = argparse.ArgumentParser(description=__doc__)
12 op.add_argument("fignum", help="Figure number", type=int,
13 choices=range(5,9))
14 op.add_argument("--queue", default=None,
15 help="Use the specified queue of a Grid Engine cluster system "+
16 "(e.g. 16c.q). If not provided, the benchmarks are run on the "+
17 "local computer.")
18 op.add_argument("--nrepl",type=int, default=3,
19 help="Number of replicates (default: 3)")
20 op.add_argument("--fast",action="store_true",
21 help="Run only the three fastest datapoints of the benchmark")
22 opts = op.parse_args()
23
24 if opts.fignum == 5:
25 testvar="segments"
26 operation="convert"
27 elif opts.fignum == 6:
28 testvar="connectivity"
29 operation="convert"
30 elif opts.fignum == 7:
31 testvar="segments"
32 operation="mergelinear"
33 else: # 8
34 testvar="connectivity"
35 operation="mergelinear"
36
37 if opts.fast:
38 subset="fast"
39 else:
40 subset="all"
41
42 run_benchmarks_args="figure{}.out {} gfa2 {} {} {}".format(
43 opts.fignum, operation, testvar, subset, opts.nrepl)
44
45 if not opts.queue:
46 os.system("./gfapy-run-benchmarks.sh {}".format(run_benchmarks_args))
47 else:
48 qsub_script_pfx=\
49 """#!/bin/bash
50 #$ -clear
51 #$ -q {}
52 #$ -cwd
53 #$ -V
54 #$ -S /bin/bash
55 #$ -o jobs_out
56 #$ -j y
57 #$ -sync y
58
59 """.format(opts.queue)
60 with open("gfapy-run-benchmarks.sh", "r") as input_file:
61 content = input_file.read()
62 with open("gfapy-run-benchmarks.qsub", "w") as output_file:
63 output_file.write(qsub_script_pfx)
64 output_file.write(content)
65 os.system("mkdir -p jobs_out")
66 os.system("qsub gfapy-run-benchmarks.qsub {}".format(run_benchmarks_args))
67
68 if testvar == "segments":
69 prepareflag=""
70 else:
71 prepareflag="--mult"
72 os.system("./gfapy-plot-preparedata.py {} figure{}.out > figure{}.dat".format(
73 prepareflag, opts.fignum, opts.fignum))
74 os.system("./gfapy-plot-benchmarkdata.R figure{}.dat figure{} {}".format(
75 opts.fignum, opts.fignum, testvar))
+0
-67
benchmarks/gfapy-run-benchmarks.sh less more
0 #!/bin/bash
1
2 if [ $# -ne 6 ]; then
3 echo "Usage: $0 <outfile> <operation> <version> <variable> <range> <nrepl>" > /dev/stderr
4 echo " outfile: will be overwritten if exists" > /dev/stderr
5 echo " operation: (mergelinear/convert) ../bin/gfapy-<operation> <gfafile> will be called" > /dev/stderr
6 echo " version: (gfa1/gfa2) gfa version" > /dev/stderr
7 echo " variable: (segments/connectivity)" > /dev/stderr
8 echo " range: (all/fast/slow)" > /dev/stderr
9 echo " nrepl: (e.g. 3) number of replicates" > /dev/stderr
10 exit 1
11 fi
12
13 outfile=$1
14 operation=$2
15 version=$3
16 variable=$4
17 range=$5
18 nrepl=$6
19
20 if [ $variable == "segments" ]; then
21 if [ $range == "fast" ]; then
22 nsegments="1000 2000 4000"
23 elif [ $range == "slow" ]; then
24 nsegments="8000 16000 32000 64000 128000 256000 512000 1024000 2048000"
25 elif [ $range == "all"]; then
26 nsegments="1000 2000 4000 8000 16000 32000 64000 128000 256000 512000 1024000 2048000"
27 fi
28 else
29 nsegments=4000
30 fi
31
32 if [ $variable == "connectivity" ]; then
33 if [ $range == "fast" ]; then
34 multipliers="2 4 8"
35 elif [ $range == "slow" ]; then
36 multipliers="16 32 64 128 256"
37 elif [ $range == "all"]; then
38 multipliers="2 4 8 16 32 64 128 256"
39 fi
40 else
41 multipliers=2
42 fi
43
44 mkdir -p benchmark_results
45 rm -f $outfile
46 echo "# hostname: $HOSTNAME" > $outfile
47 echo "### benchmark data:" >> $outfile
48 for ((replicate=1;replicate<=nrepl;++replicate)); do
49 for i in $nsegments; do
50 for m in $multipliers; do
51 fname="benchmark_results/${i}_e${m}x.$replicate.${version}"
52 bmout="$fname.$operation.benchmark"
53 rm -f $bmout
54 if [ ! -e $fname ]; then
55 ./gfapy-randomgraph --segments $i -g $version \
56 --dovetails-per-segment $m --with-sequence > $fname
57 fi
58 ./gfapy-benchmark-collectdata ../bin/gfapy-$operation $fname \
59 1> /dev/null 2> $bmout
60 elapsed=$(grep -P -o "(?<=) [^ ]*(?=elapsed)" $bmout)
61 memory=$(grep -P -o "(?<=VmHWM: ).*" $bmout)
62 filesize=( $(ls -ln $fname) );filesize=${filesize[4]}
63 echo -e "gfapy-$operation\t$version\t$i\t$m\t$replicate\t$elapsed\t$memory\t$filesize" >> $outfile
64 done
65 done
66 done
+0
-47
bin/gfapy-diff less more
0 #!/usr/bin/env python3
1 """
2 Compare two GFA files
3
4 Note: the current version is not yet functional and only checking segments.
5 Work in progress.
6 """
7
8 import sys
9 import os
10 import gfapy
11 import argparse
12
13 op = argparse.ArgumentParser(description=__doc__)
14 op.add_argument('--version', action='version', version='%(prog)s 0.1')
15 op.add_argument("filename1")
16 op.add_argument("filename2")
17 opts = op.parse_args()
18
19 gfa1 = gfapy.Gfa.from_file(opts.filename1)
20 gfa2 = gfapy.Gfa.from_file(opts.filename2)
21
22 different = False
23
24 if gfa1.version != gfa2.version:
25 print("# different version")
26 exit(1)
27 else:
28 for s in gfa1.segments:
29 s2 = gfa2.segment(s)
30 if s2 is None:
31 different = True
32 print("# segment {} in {} but not in {}".format(s.name, opts.filename1, opts.filename2))
33 if s.diff(s2):
34 different = True
35 for diff in s.diff(s2):
36 print(diff)
37 for s in gfa2.segments:
38 s1 = gfa1.segment(s)
39 if s1 is None:
40 different = True
41 print("# segment {} in {} but not in {}".format(s.name, opts.filename2, opts.filename1))
42
43 if different:
44 exit(1)
45 else:
46 exit(0)
0 gfapy (1.2.0-1) UNRELEASED; urgency=low
1
2 * New upstream release.
3
4 -- Debian Janitor <janitor@jelmer.uk> Tue, 08 Jun 2021 08:43:01 -0000
5
06 gfapy (1.1.0+dfsg-1) unstable; urgency=medium
17
28 * New upstream release.
+0
-2
doc/.gitignore less more
0 source
1 _build
+0
-23
doc/Makefile less more
0 # Minimal makefile for Sphinx documentation
1 #
2
3 # You can set these variables from the command line.
4 SPHINXOPTS =
5 SPHINXBUILD = sphinx-build
6 SPHINXPROJ = Gfapy
7 SOURCEDIR = .
8 BUILDDIR = _build
9
10 # Put it first so that "make" without argument is like "make help".
11 help:
12 @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
13
14 .PHONY: help Makefile
15
16 cleanup:
17 rm source _build -rf
18
19 # Catch-all target: route all unknown targets to Sphinx using the new
20 # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
21 %: Makefile
22 @PYTHONHASHSEED=0 $(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+0
-4
doc/changelog.rst less more
0 Changelog
1 ---------
2 .. include:: ../CHANGES.txt
3 :literal:
+0
-173
doc/conf.py less more
0 #!/usr/bin/env python3
1 # -*- coding: utf-8 -*-
2 #
3 # Gfapy documentation build configuration file, created by
4 # sphinx-quickstart on Thu Mar 16 10:13:57 2017.
5 #
6 # This file is execfile()d with the current directory set to its
7 # containing dir.
8 #
9 # Note that not all possible configuration values are present in this
10 # autogenerated file.
11 #
12 # All configuration values have a default; values that are commented out
13 # serve to show the default.
14
15 # If extensions (or modules to document with autodoc) are in another directory,
16 # add these directories to sys.path here. If the directory is relative to the
17 # documentation root, use os.path.abspath to make it absolute, like shown here.
18 #
19 import os
20 import sys
21 sys.path.insert(0, os.path.abspath('.'))
22 sys.path.insert(0, os.path.abspath('../'))
23
24 # -- General configuration ------------------------------------------------
25
26 # If your documentation needs a minimal Sphinx version, state it here.
27 #
28 # needs_sphinx = '1.0'
29
30 # Add any Sphinx extension module names here, as strings. They can be
31 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
32 # ones.
33 extensions = [
34 'sphinx.ext.autodoc',
35 'sphinx.ext.doctest',
36 'sphinx.ext.todo',
37 'sphinx.ext.coverage',
38 'sphinx.ext.imgmath',
39 'sphinx.ext.ifconfig',
40 'sphinx.ext.viewcode',
41 'sphinx.ext.githubpages',
42 'sphinx.ext.napoleon'
43 ]
44
45 # Napoleon
46 napoleon_numpy_docstring = True
47 napoleon_google_docstring = True
48 napoleon_use_param = False
49 napoleon_use_ivar = True
50
51 # Default role:
52 default_role = 'any'
53
54 # Add any paths that contain templates here, relative to this directory.
55 templates_path = ['_templates']
56
57 # The suffix(es) of source filenames.
58 # You can specify multiple suffix as a list of string:
59 #
60 # source_suffix = ['.rst', '.md']
61 source_suffix = '.rst'
62
63 # The master toctree document.
64 master_doc = 'index'
65
66 # General information about the project.
67 project = 'Gfapy'
68 copyright = '2017, Giorgio Gonnella and others (see CONTRIBUTORS)'
69 author = 'Giorgio Gonnella and others (see CONTRIBUTORS)'
70
71 # The version info for the project you're documenting, acts as replacement for
72 # |version| and |release|, also used in various other places throughout the
73 # built documents.
74 #
75 # The short X.Y version.
76 version = '1.1'
77 # The full version, including alpha/beta/rc tags.
78 release = '1.1.0'
79
80 # The language for content autogenerated by Sphinx. Refer to documentation
81 # for a list of supported languages.
82 #
83 # This is also used if you do content translation via gettext catalogs.
84 # Usually you set "language" from the command line for these cases.
85 language = None
86
87 # List of patterns, relative to source directory, that match files and
88 # directories to ignore when looking for source files.
89 # This patterns also effect to html_static_path and html_extra_path
90 exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
91
92 # The name of the Pygments (syntax highlighting) style to use.
93 pygments_style = 'sphinx'
94
95 # If true, `todo` and `todoList` produce output, else they produce nothing.
96 todo_include_todos = False
97
98 # -- Options for HTML output ----------------------------------------------
99
100 # The theme to use for HTML and HTML Help pages. See the documentation for
101 # a list of builtin themes.
102 #
103 html_theme = 'sphinx_rtd_theme'
104
105 # Theme options are theme-specific and customize the look and feel of a theme
106 # further. For a list of options available for each theme, see the
107 # documentation.
108 #
109 # html_theme_options = {}
110
111 # Add any paths that contain custom static files (such as style sheets) here,
112 # relative to this directory. They are copied after the builtin static files,
113 # so a file named "default.css" will overwrite the builtin "default.css".
114 html_static_path = ['_static']
115
116
117 # -- Options for HTMLHelp output ------------------------------------------
118
119 # Output file base name for HTML help builder.
120 htmlhelp_basename = 'Gfapydoc'
121
122
123 # -- Options for LaTeX output ---------------------------------------------
124
125 latex_elements = {
126 # The paper size ('letterpaper' or 'a4paper').
127 #
128 'papersize': 'a4paper',
129
130 # The font size ('10pt', '11pt' or '12pt').
131 #
132 # 'pointsize': '10pt',
133
134 # Additional stuff for the LaTeX preamble.
135 #
136 # 'preamble': '',
137
138 # Latex figure (float) alignment
139 #
140 # 'figure_align': 'htbp',
141 }
142
143 # Grouping the document tree into LaTeX files. List of tuples
144 # (source start file, target name, title,
145 # author, documentclass [howto, manual, or own class]).
146 latex_documents = [
147 (master_doc, 'Gfapy.tex', 'Gfapy Documentation',
148 'Giorgio Gonnella', 'manual'),
149 ]
150
151 # -- Options for manual page output ---------------------------------------
152
153 # One entry per manual page. List of tuples
154 # (source start file, name, description, authors, manual section).
155 man_pages = [
156 (master_doc, 'gfapy', 'Gfapy Documentation',
157 [author], 1)
158 ]
159
160
161 # -- Options for Texinfo output -------------------------------------------
162
163 # Grouping the document tree into Texinfo files. List of tuples
164 # (source start file, target name, title, author,
165 # dir menu entry, description, category)
166 texinfo_documents = [
167 (master_doc, 'Gfapy', 'Gfapy Documentation',
168 author, 'Gfapy',
169 'Python library for the Graphic Fragment Assembly (GFA) format.',
170 'Miscellaneous'),
171 ]
172
+0
-36
doc/index.rst less more
0 .. Gfapy documentation master file, created by
1 sphinx-quickstart on Thu Mar 16 10:13:57 2017.
2 You can adapt this file completely to your liking, but it should at least
3 contain the root `toctree` directive.
4
5 Gfapy documentation
6 ===================
7
8 .. toctree::
9 :maxdepth: 2
10 :caption: Contents:
11
12 readme
13 changelog
14
15 tutorial/gfa
16 tutorial/validation
17 tutorial/positional_fields
18 tutorial/placeholders
19 tutorial/positions
20 tutorial/alignments
21 tutorial/tags
22 tutorial/references
23 tutorial/header
24 tutorial/custom_records
25 tutorial/comments
26 tutorial/errors
27 tutorial/graph_operations
28 tutorial/rgfa
29
30 Indices and tables
31 ==================
32
33 * :ref:`genindex`
34 * :ref:`modindex`
35 * :ref:`search`
+0
-3
doc/readme.rst less more
0 Introduction
1 ============
2 .. include:: ../README.rst
+0
-1
doc/run_apidoc.sh less more
0 sphinx-apidoc -o source/ ../gfapy
+0
-238
doc/tutorial/alignments.rst less more
0 .. testsetup:: *
1
2 import gfapy
3 from gfapy import is_placeholder, Alignment
4 h = "H\tVN:Z:2.0\tTS:i:100"
5 sA = "S\tA\t100\t*"
6 sB = "S\tB\t100\t*"
7 x = "E\tx\tA+\tB-\t0\t100$\t0\t100$\t4,2\tTS:i:50"
8 gfa = gfapy.Gfa([h, sA, sB, x])
9
10 .. _alignments:
11
12 Alignments
13 ~~~~~~~~~~
14
15 Some GFA1 (L/C overlap, P overlaps) and GFA2 (E/F alignment) fields contain
16 alignments or lists of alignments. The alignment can be left unspecified and a
17 placeholder symbol ``*`` used instead. In GFA1 the alignments can be given as
18 CIGAR strings, in GFA2 also as Dazzler traces.
19
20 Gfapy uses three different classes for representing the content of alignment fields:
21 :class:`~gfapy.alignment.cigar.CIGAR`, :class:`~gfapy.alignment.trace.Trace`
22 and :class:`~gfapy.alignment.placeholder.AlignmentPlaceholder`.
23
24 Creating an alignment
25 ^^^^^^^^^^^^^^^^^^^^^
26
27 An alignment instance is usually created from its GFA string
28 representation or from a list by using the
29 :class:`gfapy.Alignment() <gfapy.alignment.alignment.Alignment>`
30 constructor.
31
32 .. doctest::
33
34 >>> from gfapy import Alignment
35 >>> Alignment("*")
36 gfapy.AlignmentPlaceholder()
37 >>> Alignment("10,10,10")
38 gfapy.Trace([10,10,10])
39 >>> Alignment([10,10,10])
40 gfapy.Trace([10,10,10])
41 >>> Alignment("30M2I")
42 gfapy.CIGAR([gfapy.CIGAR.Operation(30,'M'), gfapy.CIGAR.Operation(2,'I')])
43
44 If the argument is an alignment object it will be returned,
45 so that is always safe to call the method on a variable which can
46 contain a string or an alignment instance:
47
48 .. doctest::
49
50 >>> Alignment(Alignment("*"))
51 gfapy.AlignmentPlaceholder()
52 >>> Alignment(Alignment("10,10"))
53 gfapy.Trace([10,10])
54
55 Recognizing undefined alignments
56 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
57
58 The :func:`gfapy.is_placeholder() <gfapy.placeholder.is_placeholder>` method
59 allows to test if an alignment field contains an undefined value (placeholder)
60 instead of a defined value (CIGAR string, trace). The method accepts as
61 argument either an alignment object or a string or list representation.
62
63 .. doctest::
64
65 >>> from gfapy import is_placeholder, Alignment
66 >>> is_placeholder(Alignment("30M"))
67 False
68 >>> is_placeholder(Alignment("10,10"))
69 False
70 >>> is_placeholder(Alignment("*"))
71 True
72 >>> is_placeholder("*")
73 True
74 >>> is_placeholder("30M")
75 False
76 >>> is_placeholder("10,10")
77 False
78 >>> is_placeholder([])
79 True
80 >>> is_placeholder([10,10])
81 False
82
83 Note that, as a placeholder is ``False`` in boolean context, just a
84 ``if not aligment`` will also work, if alignment is an alignment object.
85 But this of course, does not work, if it is a string representation.
86 Therefore it is better to use the
87 :func:`gfapy.is_placeholder() <gfapy.placeholder.is_placeholder>` method,
88 which works in both cases.
89
90 .. doctest::
91
92 >>> if not Alignment("*"): print('no alignment')
93 no alignment
94 >>> if is_placeholder(Alignment("*")): print('no alignment')
95 no alignment
96 >>> if "*": print('not a placeholder...?')
97 not a placeholder...?
98 >>> if is_placeholder("*"): print('really? it is a placeholder!')
99 really? it is a placeholder!
100
101 Reading and editing CIGARs
102 ^^^^^^^^^^^^^^^^^^^^^^^^^^
103
104 CIGARs are represented by specialized lists, instances of the class
105 :class:`~gfapy.alignment.cigar.CIGAR`, whose elements are CIGAR operations
106 CIGAR operations are represented by instance of the class
107 :class:`~gfapy.alignment.cigar.CIGAR.Operation`,
108 and provide the properties ``length`` (length of the operation, an integer)
109 and ``code`` (one-letter string which specifies the type of operation).
110 Note that not all operations allowed in SAM files (for which CIGAR strings
111 were first defined) are also meaningful in GFA and thus GFA2 only allows
112 the operations ``M``, ``I``, ``D`` and ``P``.
113
114 .. doctest::
115
116 >>> cigar = gfapy.Alignment("30M")
117 >>> isinstance(cigar, list)
118 True
119 >>> operation = cigar[0]
120 >>> type(operation)
121 <class 'gfapy.alignment.cigar.CIGAR.Operation'>
122 >>> operation.code
123 'M'
124 >>> operation.code = 'D'
125 >>> operation.length
126 30
127 >>> len(operation)
128 30
129 >>> str(operation)
130 '30D'
131
132 As a CIGAR instance is a list, list methods apply to it. If the array is
133 emptied, its string representation will be the placeholder symbol ``*``.
134
135 .. doctest::
136
137 >>> cigar = gfapy.Alignment("1I20M2D")
138 >>> cigar[0].code = "M"
139 >>> cigar.pop(1)
140 gfapy.CIGAR.Operation(20,'M')
141 >>> str(cigar)
142 '1M2D'
143 >>> cigar[:] = []
144 >>> str(cigar)
145 '*'
146
147 The validate :func:`CIGAR.validate() <gfapy.alignment.cigar.CIGAR.validate>`
148 function checks if a CIGAR instance is valid. A version can be provided, as the
149 CIGAR validation is version specific (as GFA2 forbids some CIGAR operations).
150
151 .. doctest::
152
153 >>> cigar = gfapy.Alignment("30M10D20M5I10M")
154 >>> cigar.validate()
155 >>> cigar[1].code = "L"
156 >>> cigar.validate()
157 Traceback (most recent call last):
158 ...
159 gfapy.error.ValueError:
160 >>> cigar = gfapy.Alignment("30M10D20M5I10M")
161 >>> cigar[1].code = "X"
162 >>> cigar.validate(version="gfa1")
163 >>> cigar.validate(version="gfa2")
164 Traceback (most recent call last):
165 ...
166 gfapy.error.ValueError:
167
168 Reading and editing traces
169 ^^^^^^^^^^^^^^^^^^^^^^^^^^
170
171 Traces are arrays of non-negative integers. The values are interpreted
172 using a trace spacing value. If traces are used, a trace spacing value
173 must be defined in a TS integer tag, either in the header, or in the
174 single lines which contain traces (which takes precedence over the
175 header global value).
176
177 .. doctest::
178
179 >>> print(gfa) #doctest: +SKIP
180 H TS:i:100
181 E x A+ B- 0 100$ 0 100$ 4,2 TS:i:50
182 ...
183 >>> gfa.header.TS
184 100
185 >>> gfa.line("x").TS
186 50
187
188 Query, reference and complement
189 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
190
191 CIGARs are asymmetric, i.e.\ they consider one sequence as reference and
192 another sequence as query.
193
194 The :func:`~gfapy.alignment.cigar.CIGAR.length_on_reference` and
195 :func:`~gfapy.alignment.cigar.CIGAR.length_on_query` methods compute the length
196 of the alignment on the two sequences. These methods are used by the library
197 e.g. to convert GFA1 L lines to GFA2 E lines (which is only possible if CIGARs
198 are provided).
199
200 .. doctest::
201
202 >>> cigar = gfapy.Alignment("30M10D20M5I10M")
203 >>> cigar.length_on_reference()
204 70
205 >>> cigar.length_on_query()
206 65
207
208 CIGARs are dependent on which sequence is taken as reference and which
209 is taken as query. For each alignment, a complement CIGAR can be
210 computed using the method
211 :func:`~gfapy.alignment.cigar.CIGAR.complement`; it is the CIGAR obtained
212 when the two sequences are switched.
213
214 .. doctest::
215
216 >>> cigar = gfapy.Alignment("2M1D3M")
217 >>> str(cigar.complement())
218 '3M1I2M'
219
220 The current version of Gfapy does not provide a way to compute the
221 alignment, thus the trace information can be accessed and edited, but
222 not used for this purpose. Because of this there is currently no way in
223 Gfapy to compute a complement trace (trace obtained when the sequences
224 are switched).
225
226 .. doctest::
227
228 >>> trace = gfapy.Alignment("1,2,3")
229 >>> str(trace.complement())
230 '*'
231
232 The complement of a placeholder is a placeholder:
233
234 .. doctest::
235
236 >>> str(gfapy.Alignment("*").complement())
237 '*'
+0
-71
doc/tutorial/comments.rst less more
0 .. testsetup:: *
1
2 import gfapy
3 g = gfapy.Gfa()
4
5 .. _comments:
6
7 Comments
8 --------
9
10 GFA lines starting with a ``#`` symbol are considered comments. In Gfapy
11 comments are represented by instances of the class :class:`gfapy.line.Comment
12 <gfapy.line.comment.comment.Comment>`. They have a similar interface to other
13 line instances, with some differences, e.g. they do not support tags.
14
15 The comments collection
16 ~~~~~~~~~~~~~~~~~~~~~~~
17
18 The comments of a Gfa object are accessed using the :func:`Gfa.comments
19 <gfapy.lines.collections.Collections.comments>` property. This is a list of
20 comment line instances. The single elements can be modified, but the list
21 itself is read-only. To remove a comment from the Gfa, you need to find the
22 instance in the list, and call
23 :func:`~gfapy.line.common.disconnection.Disconnection.disconnect` on it. To
24 add a comment to a :class:`~gfapy.gfa.Gfa` instance is done similarly to other
25 lines, by using the :func:`Gfa.add_line(line)
26 <gfapy.lines.creators.Creators.add_line>` method.
27
28 .. doctest::
29
30 >>> g.add_line("# this is a comment") #doctest: +ELLIPSIS
31 >>> [str(c) for c in g.comments]
32 ['# this is a comment']
33 >>> g.comments[0].disconnect()
34 >>> g.comments
35 []
36
37 Accessing the comment content
38 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
39
40 The content of the comment line, excluding the initial ``#`` and eventual
41 initial spacing characters, is included in the ``content`` field. The initial
42 spacing characters can be read/changed using the ``spacer`` field. The default
43 value is a single space.
44
45 .. doctest::
46
47 >>> g.add_line("# this is a comment") #doctest: +ELLIPSIS
48 >>> c = g.comments[-1]
49 >>> c.content
50 'this is a comment'
51 >>> c.spacer
52 ' '
53 >>> c.spacer = '___'
54 >>> str(c)
55 '#___this is a comment'
56
57 Tags are not supported by comment lines. If the line contains tags,
58 these are nor parsed, but included in the ``content`` field. Trying to set
59 tags raises exceptions.
60
61 .. doctest::
62
63 >>> c = gfapy.Line("# this is not a tag\txx:i:1")
64 >>> c.content
65 'this is not a tag\txx:i:1'
66 >>> c.xx
67 >>> c.xx = 1
68 Traceback (most recent call last):
69 ...
70 gfapy.error.RuntimeError: Tags of comment lines cannot be set
+0
-296
doc/tutorial/custom_records.rst less more
0 .. testsetup:: *
1
2 import gfapy
3 g = gfapy.Gfa(version = 'gfa2')
4
5 .. _custom_records:
6
7 Custom records
8 --------------
9
10 The GFA2 specification considers each line which starts with a non-standard
11 record type a custom (i.e. user- or program-specific) record.
12 Gfapy allows to retrieve these records and access their data using a
13 similar interface to that for the predefined record types.
14
15 Retrieving, adding and deleting custom records
16 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
17
18 Gfa instances have the property
19 :func:`~gfapy.lines.collections.Collections.custom_records`,
20 a list of all line instances with a non-standard record type. Among these,
21 records of a specific record type are retrieved using the method
22 :func:`Gfa.custom_records_of_type(record_type)
23 <gfapy.lines.collections.Collections.custom_records_of_type>`.
24 Lines are added and deleted using the same methods
25 (:func:`~gfapy.lines.creators.Creators.add_line` and
26 :func:`~gfapy.line.common.disconnection.Disconnection.disconnect`) as for
27 other line types.
28
29 .. doctest::
30
31 >>> g.add_line("X\tcustom line") #doctest: +ELLIPSIS
32 >>> g.add_line("Y\tcustom line") #doctest: +ELLIPSIS
33 >>> [str(line) for line in g.custom_records] #doctest: +SKIP
34 ['X\tcustom line', 'Y\tcustom line']
35 >>> g.custom_record_keys) #doctest: +SKIP
36 ['X', 'Y']
37 >>> [str(line) for line in g.custom_records_of_type('X')]
38 ['X\tcustom line']
39 >>> g.custom_records_of_type("X")[-1].disconnect()
40 >>> g.custom_records_of_type('X')
41 []
42
43 Interface without extensions
44 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
45
46 If no extension (see :ref:`extensions` section) has been defined to handle a
47 custom record type, the interface has some limitations: the field content is
48 not validated, and the field names are unknown. The generic custom record
49 class is employed
50 (:class:`~gfapy.line.custom_record.custom_record.CustomRecord`).
51
52 As the name of the positional fields in a custom record is not known, a generic
53 name ``field1``, ``field2``, ... is used. The number of positional fields is
54 found by getting the length of the
55 :attr:`~gfapy.line.custom_record.init.Init.positional_fieldnames` list.
56
57 .. doctest::
58
59 >>> g.add_line("X\ta\tb\tcc:i:10\tdd:i:100") #doctest: +ELLIPSIS
60 >>> x = g.custom_records_of_type('X')[-1]
61 >>> len(x.positional_fieldnames)
62 2
63 >>> x.field1
64 'a'
65 >>> x.field2
66 'b'
67
68 Positional fields are allowed to contain any character (including non-printable
69 characters and spacing characters), except tabs and newlines (as they are
70 structural elements of the line). No further validation is performed.
71
72 As Gfapy cannot know how many positional fields are present when parsing custom
73 records, a heuristic approach is followed, to identify tags. A field resembles
74 a tag if it starts with ``tn:d:`` where ``tn`` is a valid tag name and ``d`` a
75 valid tag datatype (see :ref:`tags` chapter). The fields are parsed from the
76 last to the first.
77
78 As soon as a field is found which does not resemble a tag, all remaining fields
79 are considered positionals (even if another field parsed later resembles a
80 tag). Due to this, invalid tags are sometimes wrongly taken as positional
81 fields (this can be avoided by writing an extension).
82
83 .. doctest::
84
85 >>> g.add_line("X\ta\tb\tcc:i:10\tdd:i:100") #doctest: +ELLIPSIS
86 >>> x1 = g.custom_records_of_type("X")[-1]
87 >>> x1.cc
88 10
89 >>> x1.dd
90 100
91 >>> g.add_line("X\ta\tb\tcc:i:10\tdd:i:100\te") #doctest: +ELLIPSIS
92 >>> x2 = g.custom_records_of_type("X")[-1]
93 >>> x2.cc
94 >>> x2.field3
95 'cc:i:10'
96 >>> g.add_line("Z\ta\tb\tcc:i:10\tddd:i:100") #doctest: +ELLIPSIS
97 >>> x3 = g.custom_records_of_type("Z")[-1]
98 >>> x3.cc
99 >>> x3.field3
100 'cc:i:10'
101 >>> x3.field4
102 'ddd:i:100'
103
104 .. _extensions:
105
106 Extensions
107 ~~~~~~~~~~
108
109 The support for custom fields is limited, as Gfapy does not know which and how
110 many fields are there and how shall they be validated. It is possible to create
111 an extension of Gfapy, which defines new record types: this will allow to use
112 these record types in a similar way to the built-in types.
113
114 As an example, an extension will be described, which defines two record types:
115 T for taxa and M for assignments of segments to taxa. For further information
116 about the possible usage case for this extension, see the Supplemental
117 Information to the manuscript describing Gfapy.
118
119 The T records will contain a single positional field, ``tid``, a GFA2
120 identifier, and an optional UL string tag. The M records will contain three
121 positional fields (all three GFA2 identifier): a name field ``mid`` (optional),
122 and two references, ``tid`` to a T line and ``sid`` to an S line. The SC
123 integer tag will be also defined. Here is an example of a GFA containing M and
124 T lines:
125
126 .. code::
127
128 S sA 1000 *
129 S sB 1000 *
130 M assignment1 t123 sA SC:i:40
131 M assignment2 t123 sB
132 M * B12c sB SC:i:20
133 T B12c
134 T t123 UL:Z:http://www.taxon123.com
135
136 Writing subclasses of the :class:`~gfapy.line.line.Line` class, it is possible to
137 communicate to Gfapy, how records of the M and T class shall be handled. This
138 only requires to define some constants and to call the class method
139 :func:`~gfapy.line.line.Line.register_extension`.
140
141 The constants to define are ``RECORD TYPE``, which shall be the content
142 of the record type field (e.g. ``M``); ``POSFIELDS`` shall contain an ordered
143 dict, specifying the datatype for each positional field, in the order these
144 fields are found in the line; ``TAGS_DATATYPE`` is a dict, specifying the
145 datatype of the predefined optional tags; ``NAME_FIELD`` is a field name,
146 and specifies which field contains the identifier of the line.
147 For details on predefined and custom datatypes, see the next sections
148 (:ref:`predefined_datatypes` and :ref:`custom_datatypes`).
149
150 To handle references, :func:`~gfapy.line.line.Line.register_extension`
151 can be supplied with a ``references`` parameter, a list of triples
152 ``(fieldname, classname, backreferences)``. Thereby ``fieldname`` is the name
153 of the field in the corresponding record containing the reference (e.g.
154 ``sid``), ``classname`` is the name of the class to which the reference goes
155 (e.g. ``gfa.line.segment.GFA2``), and \texttt{backreferences} is how the
156 collection of backreferences shall be called, in the records to which reference
157 points to (e.g. ``metagenomic_assignments``).
158
159 .. code:: python
160
161 from collections include OrderedDict
162
163 class Taxon(gfapy.Line):
164 RECORD_TYPE = "T"
165 POSFIELDS = OrderedDict([("tid","identifier_gfa2")])
166 TAGS_DATATYPE = {"UL":"Z"}
167 NAME_FIELD = "tid"
168
169 Taxon.register_extension()
170
171 class MetagenomicAssignment(gfapy.Line):
172 RECORD_TYPE = "M"
173 POSFIELDS = OrderedDict([("mid","optional_identifier_gfa2"),
174 ("tid","identifier_gfa2"),
175 ("sid","identifier_gfa2")])
176 TAGS_DATATYPE = {"SC":"i"}
177 NAME_FIELD = "mid"
178
179 MetagenomicAssignment.register_extension(references=
180 [("sid", gfapy.line.segment.GFA2, "metagenomic_assignments"),
181 ("tid", Taxon, "metagenomic_assignments")])
182
183 .. _predefined_datatypes:
184
185 Predefined datatypes for extensions
186 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
187
188 The datatype of fields is specified in Gfapy using classes, which provide
189 functions for decoding, encoding and validating the corresponding data.
190 Gfapy contains a number of datatypes which correspond to the description
191 of the field content in the GFA1 and GFA2 specification.
192
193 When writing extensions only the GFA2 field datatypes are generally used
194 (as GFA1 does not contain custom fields). They are summarized in
195 the following table:
196
197 +-------------------------------------+---------------+--------------------------------------------------------+
198 | Name | Example | Description |
199 +=====================================+===============+========================================================+
200 | ``alignment_gfa2`` | ``12M1I3M`` | CIGAR string, Trace alignment or Placeholder (``*``) |
201 +-------------------------------------+---------------+--------------------------------------------------------+
202 | ``identifier_gfa2`` | ``S1`` | ID of a line |
203 +-------------------------------------+---------------+--------------------------------------------------------+
204 | ``oriented_identifier_gfa2`` | ``S1+`` | ID of a line followed by ``+`` or ``-`` |
205 +-------------------------------------+---------------+--------------------------------------------------------+
206 | ``optional_identifier_gfa2`` | ``*`` | ID of a line or Placeholder (``*``) |
207 +-------------------------------------+---------------+--------------------------------------------------------+
208 | ``identifier_list_gfa2`` | ``S1 S2`` | space separated list of line IDs |
209 +-------------------------------------+---------------+--------------------------------------------------------+
210 | ``oriented_identifier_list_gfa2`` | ``S1+ S2-`` | space separated list of line IDs plus orientations |
211 +-------------------------------------+---------------+--------------------------------------------------------+
212 | ``position_gfa2`` | ``120$`` | non-negative integer, optionally followed by ``$`` |
213 +-------------------------------------+---------------+--------------------------------------------------------+
214 | ``sequence_gfa2`` | ``ACGNNYR`` | sequence of printable chars., no whitespace |
215 +-------------------------------------+---------------+--------------------------------------------------------+
216 | ``string`` | ``a b_c;d`` | string, no tabs and newlines (Z tags) |
217 +-------------------------------------+---------------+--------------------------------------------------------+
218 | ``char`` | ``A`` | single character (A tags) |
219 +-------------------------------------+---------------+--------------------------------------------------------+
220 | ``float`` | ``1.12`` | float (f tags) |
221 +-------------------------------------+---------------+--------------------------------------------------------+
222 | ``integer`` | ``-12`` | integer (i tags) |
223 +-------------------------------------+---------------+--------------------------------------------------------+
224 | ``optional_integer`` | ``*`` | integer or placeholder |
225 +-------------------------------------+---------------+--------------------------------------------------------+
226 | ``numeric_array`` | ``c,10,3`` | array of integers or floats (B tags) |
227 +-------------------------------------+---------------+--------------------------------------------------------+
228 | ``byte_array`` | ``12F1FF`` | hexadecimal byte string (H tags) |
229 +-------------------------------------+---------------+--------------------------------------------------------+
230 | ``json`` | ``{’b’:2}`` | JSON string, no tabs and newlines (J tags) |
231 +-------------------------------------+---------------+--------------------------------------------------------+
232
233 .. _custom_datatypes:
234
235 Custom datatypes for extensions
236 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
237
238 For custom records, one sometimes needs datatypes not yet available in the GFA
239 specification. For example, a custom datatype can be defined for
240 the taxon identifier used in the ``tid`` field of the T and M records:
241 accordingly the taxon identifier shall be only either
242 in the form ``taxon:<n>``, where ``<n>`` is a positive integer,
243 or consist of letters, numbers and underscores only
244 (without ``:``).
245
246 To define the datatype, a class is written, which contains the following
247 functions:
248
249 * ``validate_encoded(string)``: validates the content of the field,
250 if this is a string (e.g., the name of the T line)
251 * ``validate_decoded(object)``: validates the content of the field,
252 if this is not a string (e.g., a reference to a T line)
253 * ``decode(string)``: validates the content of the field (a string)
254 and returns the decoded content; note that references must not be resolved
255 (there is no access to the Gfa instance here), thus the name of the
256 T line will be returned unchanged
257 * ``encode(string)``: validates the content of the field (not in string
258 form) and returns the string which codes it in the GFA file (also here
259 references are validated but not converted into strings)
260
261 Finally the datatype is registered calling
262 :func:`~gfapy.field.field.Field.register_datatype`. The code for
263 the taxon ID extension is the following:
264
265 .. code:: python
266
267 import re
268
269 class TaxonID:
270
271 def validate_encoded(string):
272 if not re.match(r"^taxon:(\d+)$",string) and \
273 not re.match(r"^[a-zA-Z0-9_]+$", string):
274 raise gfapy.ValueError("Invalid taxon ID: {}".format(string))
275
276 def decode(string):
277 TaxonID.validate_encoded(string)
278 return string
279
280 def validate_decoded(obj):
281 if isinstance(obj,Taxon):
282 TaxonID.validate_encoded(obj.name)
283 else:
284 raise gfapy.TypeError(
285 "Invalid type for taxon ID: "+"{}".format(repr(obj)))
286
287 def encode(obj):
288 TaxonID.validate_decoded(obj)
289 return obj
290
291 gfapy.Field.register_datatype("taxon_id", TaxonID)
292
293 To use the new datatype in the T and M lines defined above (:ref:`extensions`),
294 the definition of the two subclasses can be changed:
295 in ``POSFIELDS`` the value ``taxon_id`` shall be assigned to the key ``tid``.
+0
-43
doc/tutorial/errors.rst less more
0 .. _errors:
1
2 Errors
3 ------
4
5 The different types of errors defined in Gfapy are summarized in the
6 following table. All exception raised in the library are subclasses of
7 `Error`. Thus, ``except gfapy.Error`` can be use to catch
8 all library errors.
9
10 +-----------------------+-------------------------------+---------------------------------+
11 | Error | Description | Examples |
12 +=======================+===============================+=================================+
13 | `VersionError` | An unknown or wrong version | "GFA0"; or GFA1 in GFA2 context |
14 | | is specified or implied | |
15 +-----------------------+-------------------------------+---------------------------------+
16 | `ValueError` | The value of an object is | a negative position is used |
17 | | invalid | |
18 +-----------------------+-------------------------------+---------------------------------+
19 | `TypeError` | The wrong type has been used | Z instead of i used for VN tag; |
20 | | or specified | Hash for an i tag |
21 +-----------------------+-------------------------------+---------------------------------+
22 | `FormatError` | The format of an object is | a line does not contain the |
23 | | wrong | expected number of fields |
24 +-----------------------+-------------------------------+---------------------------------+
25 | `NotUniqueError` | Something should be unique | duplicated tag name or line |
26 | | but is not | identifier |
27 +-----------------------+-------------------------------+---------------------------------+
28 | `InconsistencyError` | Pieces of information collide | length of sequence and LN tag |
29 | | with each other | do not match |
30 +-----------------------+-------------------------------+---------------------------------+
31 | `RuntimeError` | The user tried to do | editing from/to field in |
32 | | something which is not | connected links |
33 | | allowed | |
34 +-----------------------+-------------------------------+---------------------------------+
35 | `ArgumentError` | Problem with the arguments of | wrong number of arguments in |
36 | | a method | dynamically created method |
37 +-----------------------+-------------------------------+---------------------------------+
38 | `AssertionError` | Something unexpected happened | there is a bug in the library or|
39 | | | the library has been used in |
40 | | | an unintended way |
41 +-----------------------+-------------------------------+---------------------------------+
42
+0
-409
doc/tutorial/gfa.rst less more
0 .. testsetup:: *
1
2 import gfapy
3 gfa = gfapy.Gfa()
4 gfa1 = gfapy.Gfa()
5 gfa1.add_line("H\tVN:Z:1.0")
6 gfa1.add_line("# this is a comment")
7 gfa1.add_line("S\t1\t*")
8 gfa1.add_line("S\t2\t*")
9 gfa1.add_line("S\t3\t*")
10 gfa2 = gfapy.Gfa()
11 gfa2.add_line("H\tVN:Z:2.0\tTS:i:100")
12 gfa2.add_line("X\tcustom line")
13 gfa2.add_line("Y\tcustom line")
14
15 .. _gfa:
16
17 The Gfa class
18 -------------
19
20 The content of a GFA file is represented in Gfapy by an instance of the class
21 :class:`~gfapy.gfa.Gfa`. In most cases, the Gfa instance will be constructed
22 from the data contained in a GFA file, using the method
23 :func:`Gfa.from_file() <gfapy.gfa.Gfa.from_file>`.
24
25 Alternatively, it is possible to use the construct of the class; it takes an
26 optional positional parameter, the content of a GFA file (as string, or as list
27 of strings, one per line of the GFA file). If no GFA content is provided, the
28 Gfa instance will be empty.
29
30 .. doctest::
31
32 >>> gfa = gfapy.Gfa("H\tVN:Z:1.0\nS\tA\t*")
33 >>> print(len(gfa.lines))
34 2
35 >>> gfa = gfapy.Gfa(["H\tVN:Z:1.0", "S\tA\t*", "S\tB\t*"])
36 >>> print(len(gfa.lines))
37 3
38 >>> gfa = gfapy.Gfa()
39 >>> print(len(gfa.lines))
40 0
41
42 The string representation of the Gfa object (which can be obtained using
43 ``str()``) is the textual representation in GFA format.
44 Using :func:`Gfa.to_file(filename) <gfapy.gfa.Gfa.to_file>` allows
45 writing this representation to a GFA file (the content of the file is
46 overwritten).
47
48 .. doctest::
49
50 >>> g1 = gfapy.Gfa()
51 >>> g1.append("H\tVN:Z:1.0")
52 >>> g1.append("S\ta\t*")
53 >>> g1.to_file("my.gfa") #doctest: +SKIP
54 >>> g2 = gfapy.Gfa.from_file("my.gfa") #doctest: +SKIP
55 >>> str(g1)
56 'H\tVN:Z:1.0\nS\ta\t*'
57
58
59 All methods for creating a Gfa (constructor and from_file) accept
60 a ``vlevel`` parameter, the validation level,
61 and can assume the values 0, 1, 2 and 3. A higher value means
62 more validations are performed. The :ref:`validation` chapter explains
63 the meaning of the different validation levels in detail.
64 The default value is 1.
65
66 .. doctest::
67
68 >>> gfapy.Gfa().vlevel
69 1
70 >>> gfapy.Gfa(vlevel = 0).vlevel
71 0
72
73 A further parameter is ``version``. It can be set to ``'gfa1'``,
74 ``'gfa2'`` or left to the default value (``None``). The default
75 is to auto-detect the version of the GFA from the line content.
76 If the version is set manually, any content not compatible to the
77 specified version will trigger an exception. If the version is
78 set automatically, an exception will be raised if two lines
79 are found, with content incompatible to each other (e.g. a GFA1
80 segment followed by a GFA2 segment).
81
82 .. doctest::
83
84 >>> g = gfapy.Gfa(version='gfa2')
85 >>> g.version
86 'gfa2'
87 >>> g.add_line("S\t1\t*")
88 Traceback (most recent call last):
89 ...
90 gfapy.error.VersionError: Version: 1.0 (None)
91 ...
92 >>> g = gfapy.Gfa()
93 >>> g.version
94 >>> g.add_line("S\t1\t*")
95 >>> g.version
96 'gfa1'
97 >>> g.add_line("S\t1\t100\t*")
98 Traceback (most recent call last):
99 ...
100 gfapy.error.VersionError: Version: 1.0 (None)
101 ...
102
103 Collections of lines
104 ~~~~~~~~~~~~~~~~~~~~
105
106 The property :attr:`~gfapy.lines.collections.Collections.lines`
107 of the Gfa object is a list of all the lines
108 in the GFA file (including the header, which is split into single-tag
109 lines). The list itself shall not be modified by the user directly (i.e.
110 adding and removing lines is done using a different interface, see
111 below). However the single elements of the list can be edited.
112
113 .. doctest::
114
115 >>> for line in gfa.lines: print(line)
116
117 For most record types, a list of the lines of the record type is available
118 as a read-only property, which is named after the record type, in plural.
119
120 .. doctest::
121
122 >>> [str(line) for line in gfa1.segments]
123 ['S\t1\t*', 'S\t2\t*', 'S\t3\t*']
124 >>> [str(line) for line in gfa2.fragments]
125 []
126
127 A particular case are edges; these are in GFA1 links and containments, while in
128 GFA2 there is a unified edge record type, which also allows to represent
129 internal alignments. In Gfapy, the
130 :attr:`~gfapy.lines.collections.Collections.edges` property retrieves all edges
131 (i.e. all E lines in GFA2, and all L and C lines in GFA1). The
132 :attr:`~gfapy.lines.collections.Collections.dovetails` property is a list of
133 all edges which represent dovetail overlaps (i.e. all L lines in GFA1 and a
134 subset of the E lines in GFA2). The
135 :attr:`~gfapy.lines.collections.Collections.containments` property is a list of
136 all edges which represent containments (i.e. all C lines in GFA1 and a subset
137 of the E lines in GFA2).
138
139 .. doctest::
140
141 >>> gfa2.edges
142 []
143 >>> gfa2.dovetails
144 []
145 >>> gfa2.containments
146 []
147
148 Paths are retrieved using the
149 :attr:`~gfapy.lines.collections.Collections.paths` property. This list
150 contains all P lines in GFA1 and all O lines in GFA2. Sets returns the list of
151 all U lines in GFA2 (empty list in GFA1).
152
153 .. doctest::
154
155 >>> gfa2.paths
156 []
157 >>> gfa2.sets
158 []
159
160 The header contain metadata in a single or multiple lines. For ease of
161 access to the header information, all its tags are summarized in a
162 single line instance, which is retrieved using the
163 :attr:`~gfapy.lines.headers.Headers.header` property. This list
164 The :ref:`header` chapter of this manual explains more in
165 detail, how to work with the header object.
166
167 .. doctest::
168
169 >>> gfa2.header.TS
170 100
171
172 All lines which start by the string ``#`` are comments; they are handled in
173 the :ref:`comments` chapter and are retrieved using the
174 :attr:`~gfapy.lines.collections.Collections.comments` property.
175
176 .. doctest::
177
178 >>> [str(line) for line in gfa1.comments]
179 ['# this is a comment']
180
181 Custom lines are lines of GFA2 files which start
182 with a non-standard record type. Gfapy provides basic built-in support
183 for accessing the information in custom lines, and allows to define
184 extensions for own record types for defining more advanced
185 functionality (see the :ref:`custom_records` chapter).
186
187 .. doctest::
188
189 >>> [str(line) for line in gfa2.custom_records]
190 ['X\tcustom line', 'Y\tcustom line']
191 >>> gfa2.custom_record_keys
192 ['X', 'Y']
193 >>> [str(line) for line in gfa2.custom_records_of_type('X')]
194 ['X\tcustom line']
195
196 Line identifiers
197 ~~~~~~~~~~~~~~~~
198
199 Some GFA lines have a mandatory or optional identifier field: segments and
200 paths in GFA1, segments, gaps, edges, paths and sets in GFA2. A line of this
201 type can be retrieved by identifier, using the method
202 :func:`Gfa.line(ID) <gfapy.gfa.Gfa.line>` using the identifier as argument.
203
204 .. doctest::
205
206 >>> str(gfa1.line('1'))
207 'S\t1\t*'
208
209 The GFA2 specification prescribes the exact namespace for the identifier
210 (segments, paths, sets, edges and gaps identifier share the same namespace).
211 The content of this namespace can be retrieved using the
212 :attr:`~gfapy.lines.collections.Collections.names` property.
213 The identifiers of single line types
214 can be retrieved using the properties
215 :attr:`~gfapy.lines.collections.Collections.segment_names`,
216 :attr:`~gfapy.lines.collections.Collections.edge_names`,
217 :attr:`~gfapy.lines.collections.Collections.gap_names`,
218 :attr:`~gfapy.lines.collections.Collections.path_names` and
219 :attr:`~gfapy.lines.collections.Collections.set_names`.
220
221 .. doctest::
222
223 >>> g = gfapy.Gfa()
224 >>> g.add_line("S\tA\t100\t*")
225 >>> g.add_line("S\tB\t100\t*")
226 >>> g.add_line("S\tC\t100\t*")
227 >>> g.add_line("E\tb_c\tB+\tC+\t0\t10\t90\t100$\t*")
228 >>> g.add_line("O\tp1\tB+ C+")
229 >>> g.add_line("U\ts1\tA b_c g")
230 >>> g.add_line("G\tg\tA+\tB-\t1000\t*")
231 >>> g.names
232 ['A', 'B', 'C', 'b_c', 'g', 'p1', 's1']
233 >>> g.segment_names
234 ['A', 'B', 'C']
235 >>> g.path_names
236 ['p1']
237 >>> g.edge_names
238 ['b_c']
239 >>> g.gap_names
240 ['g']
241 >>> g.set_names
242 ['s1']
243
244 The GFA1 specification does not handle the question of the namespace of
245 identifiers explicitly. However, gfapy assumes and enforces
246 a single namespace for segment, path names and the values of the ID tags
247 of L and C lines. The content of this namespace can be found using
248 :attr:`~gfapy.lines.collections.Collections.names` property.
249 The identifiers of single line types
250 can be retrieved using the properties
251 :attr:`~gfapy.lines.collections.Collections.segment_names`,
252 :attr:`~gfapy.lines.collections.Collections.edge_names`
253 (ID tags of links and containments) and
254 :attr:`~gfapy.lines.collections.Collections.path_names`.
255 For GFA1, the properties
256 :attr:`~gfapy.lines.collections.Collections.gap_names`,
257 :attr:`~gfapy.lines.collections.Collections.set_names`
258 contain always empty lists.
259
260 .. doctest::
261
262 >>> g = gfapy.Gfa()
263 >>> g.add_line("S\tA\t*")
264 >>> g.add_line("S\tB\t*")
265 >>> g.add_line("S\tC\t*")
266 >>> g.add_line("L\tB\t+\tC\t+\t*\tID:Z:b_c")
267 >>> g.add_line("P\tp1\tB+,C+\t*")
268 >>> g.names
269 ['A', 'B', 'C', 'b_c', 'p1']
270 >>> g.segment_names
271 ['A', 'B', 'C']
272 >>> g.path_names
273 ['p1']
274 >>> g.edge_names
275 ['b_c']
276 >>> g.gap_names
277 []
278 >>> g.set_names
279 []
280
281 Identifiers of external sequences
282 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
283
284 Fragments contain identifiers which refer to external sequences
285 (not contained in the GFA file). According to the specification, the
286 these identifiers are not part of the same namespace as the identifier
287 of the GFA lines. They can be retrieved using the
288 :attr:`~gfapy.lines.collections.Collections.external_names`
289 property.
290
291 .. doctest::
292
293 >>> g = gfapy.Gfa()
294 >>> g.add_line("S\tA\t100\t*")
295 >>> g.add_line("F\tA\tread1+\t10\t30\t0\t20$\t20M")
296 >>> g.external_names
297 ['read1']
298
299 The method
300 :func:`Gfa.fragments_for_external(external_ID) <gfapy.lines.finders.Finders.fragments_for_external>`
301 retrieves all F lines with a specified external sequence identifier.
302
303 .. doctest::
304
305 >>> f = g.fragments_for_external('read1')
306 >>> len(f)
307 1
308 >>> str(f[0])
309 'F\tA\tread1+\t10\t30\t0\t20$\t20M'
310
311 Adding new lines
312 ~~~~~~~~~~~~~~~~
313
314 New lines can be added to a Gfa instance using the
315 :func:`Gfa.add_line(line) <gfapy.lines.creators.Creators.add_line>`
316 method or its alias
317 :func:`Gfa.append(line) <gfapy.lines.creators.Creators.append>`.
318 The argument can be either a string
319 describing a line with valid GFA syntax, or a :class:`~gfapy.line.line.Line`
320 instance. If a string is added, a line instance is created and
321 then added.
322
323 .. doctest::
324
325 >>> g = gfapy.Gfa()
326 >>> g.add_line("S\tA\t*") #doctest: +ELLIPSIS
327 >>> g.segment_names
328 ['A']
329 >>> g.append("S\tB\t*") #doctest: +ELLIPSIS
330 >>> g.segment_names
331 ['A', 'B']
332
333 Editing the lines
334 ~~~~~~~~~~~~~~~~~
335
336 Accessing the information stored in the fields of a line instance is
337 described in the :ref:`positional_fields` and :ref:`tags` chapters.
338
339 In Gfapy, a line instance belonging to a Gfa instance is said
340 to be *connected* to the Gfa instance. Direct editing the content of a connected
341 line is only possible, for those fields which do not contain
342 references to other lines. For more information on how to modify the content of
343 the fields of connected line, see the :ref:`references` chapter.
344
345 .. doctest::
346
347 >>> g = gfapy.Gfa()
348 >>> e = gfapy.Line("E\t*\tA+\tB-\t0\t10\t90\t100$\t*")
349 >>> e.sid1 = "C+"
350 >>> g.add_line(e) #doctest: +ELLIPSIS
351 >>> e.sid1 = "A+"
352 Traceback (most recent call last):
353 gfapy.error.RuntimeError: ...
354
355 Removing lines
356 ~~~~~~~~~~~~~~
357
358 Disconnecting a line from the Gfa instance is done using the
359 :func:`Gfa.rm(line) <gfapy.lines.destructors.Destructors.rm>` method. The
360 argument can be a line instance or the name of a line.
361
362 In alternative, a line instance can also be disconnected using the
363 `disconnect` method on it. Disconnecting a line
364 may trigger other operations, such as the disconnection of other lines (see the
365 :ref:`references` chapter).
366
367 .. doctest::
368
369 >>> g = gfapy.Gfa()
370 >>> g.add_line("S\tA\t*") #doctest: +ELLIPSIS
371 >>> g.segment_names
372 ['A']
373 >>> g.rm('A') #doctest: +ELLIPSIS
374 >>> g.segment_names
375 []
376 >>> g.append("S\tB\t*") #doctest: +ELLIPSIS
377 >>> g.segment_names
378 ['B']
379 >>> b = g.line('B')
380 >>> b.disconnect()
381 >>> g.segment_names
382 []
383
384 Renaming lines
385 ~~~~~~~~~~~~~~
386
387 Lines with an identifier can be renamed. This is done simply by editing
388 the corresponding field (such as ``name`` or ``sid`` for a segment).
389 This field is not a reference to another line and can be freely edited
390 also in line instances connected to a Gfa. All references to the line
391 from other lines will still be up to date, as they will refer to the
392 same instance (whose name has been changed) and their string
393 representation will use the new name.
394
395 .. doctest::
396
397 >>> g = gfapy.Gfa()
398 >>> g.add_line("S\tA\t*") #doctest: +ELLIPSIS
399 >>> g.add_line("L\tA\t+\tB\t-\t*") #doctest: +ELLIPSIS
400 >>> g.segment_names
401 ['A', 'B']
402 >>> g.dovetails[0].from_name
403 'A'
404 >>> g.segment('A').name = 'C'
405 >>> g.segment_names
406 ['B', 'C']
407 >>> g.dovetails[0].from_name
408 'C'
+0
-14
doc/tutorial/graph_operations.rst less more
0 .. _graph_operations:
1
2 Graph operations
3 ----------------
4
5 Graph operations such as linear paths merging, multiplication of
6 segments and other are provided. These operations are implemented
7 in analogy to those provided by the Ruby library RGFA. As RGFA only
8 handles GFA1 graphs, only dovetail overlaps are considered as
9 connections. A detailed description of the operation can be
10 found in Gonnella and Kurtz (2016). More information about the
11 single operations are found in the method documentation of the
12 submodules of `GraphOperations`.
13
+0
-169
doc/tutorial/header.rst less more
0 .. testsetup:: *
1
2 import gfapy
3 gfa = gfapy.Gfa()
4
5 .. _header:
6
7 The Header
8 ----------
9
10 GFA files may contain one or multiple header lines (record type: "H"). These
11 lines may be present in any part of the file, not necessarily at the beginning.
12
13 Although the header may consist of multiple lines, its content refers to the
14 whole file. Therefore in Gfapy the header is accessed using a single line
15 instance (accessible by the :attr:`~gfapy.lines.headers.Headers.header`
16 property). Header lines contain only tags. If not header line is present in the
17 Gfa, then the header line object will be empty (i.e. contain no tags).
18
19 Note that header lines cannot be connected to the Gfa as other lines (i.e.
20 calling :meth:`~gfapy.line.common.connection.Connection.connect` on them raises
21 an exception). Instead they must be merged to the existing Gfa header, using
22 `add_line` on the Gfa instance.
23
24 .. doctest::
25
26 >>> gfa.add_line("H\tnn:f:1.0") #doctest: +ELLIPSIS
27 >>> gfa.header.nn
28 1.0
29 >>> gfapy.Line("H\tnn:f:1.0").connect(gfa)
30 Traceback (most recent call last):
31 ...
32 gfapy.error.RuntimeError: ...
33
34 Multiple definitions of the predefined header tags
35 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
36
37 For the predefined tags (``VN`` and ``TS``), the presence of multiple
38 values in different lines is an error, unless the value is the same in
39 each instance (in which case the repeated definitions are ignored).
40
41 .. doctest::
42
43 >>> gfa.add_line("H\tVN:Z:1.0") #doctest: +ELLIPSIS
44 >>> gfa.add_line("H\tVN:Z:1.0") # ignored #doctest: +ELLIPSIS
45 >>> gfa.add_line("H\tVN:Z:2.0")
46 Traceback (most recent call last):
47 ...
48 gfapy.error.VersionError: ...
49
50 Multiple definitions of custom header tags
51 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
52
53 If the tags are present only once in the header in its entirety, the access to
54 the tags is the same as for any other line (see the :ref:`tags` chapter).
55
56 However, the specification does not forbid custom tags to be defined with
57 different values in different header lines (which we name "multi-definition
58 tags"). This particular case is handled in the next sections.
59
60 Reading multi-definitions tags
61 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
62
63 Reading, validating and setting the datatype of multi-definition tags is done
64 using the same methods as for all other lines (see the :ref:`tags` chapter).
65 However, if a tag is defined multiple times on multiple H lines, reading the
66 tag will return a list of the values on the lines. This array is an instance of
67 the subclass ``gfapy.FieldArray`` of list.
68
69 .. doctest::
70
71 >>> gfa.add_line("H\txx:i:1") #doctest: +ELLIPSIS
72 >>> gfa.add_line("H\txx:i:2") #doctest: +ELLIPSIS
73 >>> gfa.add_line("H\txx:i:3") #doctest: +ELLIPSIS
74 >>> gfa.header.xx
75 gfapy.FieldArray('i',[1, 2, 3])
76
77 Setting tags
78 ~~~~~~~~~~~~
79
80 There are two possibilities to set a tag for the header. The first is
81 the normal tag interface (using ``set`` or the tag name property). The
82 second is to use ``add``. The latter supports multi-definition tags,
83 i.e. it adds the value to the previous ones (if any), instead of
84 overwriting them.
85
86 .. doctest::
87
88 >>> gfa = gfapy.Gfa()
89 >>> gfa.header.xx
90 >>> gfa.header.add("xx", 1)
91 >>> gfa.header.xx
92 1
93 >>> gfa.header.add("xx", 2)
94 >>> gfa.header.xx
95 gfapy.FieldArray('i',[1, 2])
96 >>> gfa.header.set("xx", 3)
97 >>> gfa.header.xx
98 3
99
100 Modifying field array values
101 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
102
103 Field arrays can be modified directly (e.g. adding new values or
104 removing some values). After modification, the user may check if the
105 array values remain compatible with the datatype of the tag using the
106 :meth:`~gfapy.line.common.validate.Validate.validate_field`` method.
107
108 .. doctest::
109
110 >>> gfa = gfapy.Gfa()
111 >>> gfa.header.xx = gfapy.FieldArray('i',[1,2,3])
112 >>> gfa.header.xx
113 gfapy.FieldArray('i',[1, 2, 3])
114 >>> gfa.header.validate_field("xx")
115 >>> gfa.header.xx.append("X")
116 >>> gfa.header.validate_field("xx")
117 Traceback (most recent call last):
118 ...
119 gfapy.error.FormatError: ...
120
121 If the field array is modified using array methods which return a list
122 or data of any other type, a field array must be constructed, setting
123 its datatype to the value returned by calling
124 :meth:`~gfapy.line.common.field_datatype.FieldDatatype.get_datatype`
125 on the header.
126
127 .. doctest::
128
129 >>> gfa = gfapy.Gfa()
130 >>> gfa.header.xx = gfapy.FieldArray('i',[1,2,3])
131 >>> gfa.header.xx
132 gfapy.FieldArray('i',[1, 2, 3])
133 >>> gfa.header.xx = gfapy.FieldArray(gfa.header.get_datatype("xx"),
134 ... list(map(lambda x: x+1, gfa.header.xx)))
135 >>> gfa.header.xx
136 gfapy.FieldArray('i',[2, 3, 4])
137
138 String representation of the header
139 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
140
141 For consistency with other line types, the string representation of the header
142 is a single-line string, eventually non standard-compliant, if it contains
143 multiple instances of the tag. (and when calling
144 :meth:`~gfapy.line.common.writer.Writer.field_to_s` for a tag present multiple
145 times, the output string will contain the instances of the tag, separated by
146 tabs).
147
148 However, when the Gfa is output to file or string, the header is split into
149 multiple H lines with single tags, so that standard-compliant GFA is output.
150 The split header can be retrieved using the
151 :attr:`~gfapy.lines.headers.Headers.headers` property of the Gfa instance.
152
153 .. doctest::
154
155 >>> gfa = gfapy.Gfa()
156 >>> gfa.header.VN = "1.0"
157 >>> gfa.header.xx = gfapy.FieldArray('i',[1,2])
158 >>> gfa.header.field_to_s("xx")
159 '1\t2'
160 >>> gfa.header.field_to_s("xx", tag=True)
161 'xx:i:1\txx:i:2'
162 >>> str(gfa.header)
163 'H\tVN:Z:1.0\txx:i:1\txx:i:2'
164 >>> [str(h) for h in gfa.headers]
165 ['H\tVN:Z:1.0', 'H\txx:i:1', 'H\txx:i:2']
166 >>> str(gfa)
167 'H\tVN:Z:1.0\nH\txx:i:1\nH\txx:i:2'
168
+0
-69
doc/tutorial/placeholders.rst less more
0 .. testsetup:: *
1
2 import gfapy
3
4 .. _placeholders:
5
6 Placeholders
7 ------------
8
9 Some positional fields may contain an undefined value S: ``sequence``;
10 L/C: ``overlap``; P: ``overlaps``; E: ``eid``, ``alignment``; F:
11 ``alignment``; G: ``gid``, ``var``; U/O: ``pid``. In GFA this value is
12 represented by a ``*``.
13
14 In Gfapy the class `Placeholder` represent the undefined value.
15
16 Distinguishing placeholders
17 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
18
19 The :func:`gfapy.is_placeholder() <gfapy.placeholder.is_placeholder>` method
20 allows to check if a value is a placeholder; a value is a placeholder if
21 it is a `Placeholder` instance, or would represent
22 a placeholder in GFA (a string containing ``*``), or would be represented
23 by a placeholder in GFA (e.g. an empty array).
24
25 .. doctest::
26
27 >>> gfapy.is_placeholder("*")
28 True
29 >>> gfapy.is_placeholder("**")
30 False
31 >>> gfapy.is_placeholder([])
32 True
33 >>> gfapy.is_placeholder(gfapy.Placeholder())
34 True
35
36 Note that, as a placeholder is ``False`` in boolean context, just a
37 ``if not placeholder`` will also work, if the value is an instance
38 of `Placeholder`, but not always for the other cases (in particular not
39 for the string representation ``*``).
40 Therefore using
41 :func:`gfapy.is_placeholder() <gfapy.placeholder.is_placeholder>`
42 is better.
43
44 .. doctest::
45
46 >>> if "*": print('* is not a placeholder')
47 * is not a placeholder
48 >>> if gfapy.is_placeholder("*"): print('but it represents a placeholder')
49 but it represents a placeholder
50
51 Compatibility methods
52 ~~~~~~~~~~~~~~~~~~~~~
53
54 Some methods are defined for placeholders, which allow them to respond
55 to the same methods as defined values. This allows to write generic
56 code.
57
58 .. doctest::
59
60 >>> placeholder = gfapy.Placeholder()
61 >>> placeholder.validate() # does nothing
62 >>> len(placeholder)
63 0
64 >>> placeholder[1]
65 gfapy.Placeholder()
66 >>> placeholder + 1
67 gfapy.Placeholder()
68
+0
-448
doc/tutorial/positional_fields.rst less more
0 .. testsetup:: *
1
2 import gfapy
3 gfa = gfapy.Gfa()
4
5 .. _positional_fields:
6
7 Positional fields
8 -----------------
9
10 Most lines in GFA have positional fields (Headers are an exception).
11 During parsing, if a line is encountered, which has too less or too many
12 positional fields, an exception will be thrown. The correct number of
13 positional fields is record type-specific.
14
15 Positional fields are recognized by its position in the line. Each
16 positional field has an implicit field name and datatype associated with
17 it.
18
19 Field names
20 ~~~~~~~~~~~
21
22 The field names are derived from the specification. Lower case versions
23 of the field names are used and spaces are substituted with underscores.
24 In some cases, the field names were changed, as they represent keywords
25 in common programming languages (``from``, ``send``).
26
27 The following tables shows the field names used in Gfapy, for each kind
28 of line. Headers have no positional fields. Comments and custom records
29 follow particular rules, see the respective chapters (:ref:`comments` and
30 :ref:`custom_records`).
31
32 GFA1 field names
33 ^^^^^^^^^^^^^^^^
34
35 +---------------+--------------------+---------------------+------------------+-----------------+---------------+---------------+
36 | Record Type | Field 1 | Field 2 | Field 3 | Field 4 | Field 5 | Field 6 |
37 +===============+====================+=====================+==================+=================+===============+===============+
38 | Segment | ``name`` | ``sequence`` | | | | |
39 +---------------+--------------------+---------------------+------------------+-----------------+---------------+---------------+
40 | Link | ``from_segment`` | ``from_orient`` | ``to_segment`` | ``to_orient`` | ``overlap`` | |
41 +---------------+--------------------+---------------------+------------------+-----------------+---------------+---------------+
42 | Containment | ``from_segment`` | ``from_orient`` | ``to_segment`` | ``to_orient`` | ``pos`` | ``overlap`` |
43 +---------------+--------------------+---------------------+------------------+-----------------+---------------+---------------+
44 | Path | ``path_name`` | ``segment_names`` | ``overlaps`` | | | |
45 +---------------+--------------------+---------------------+------------------+-----------------+---------------+---------------+
46
47 GFA2 field names
48 ^^^^^^^^^^^^^^^^
49
50 +---------------+-----------+----------------+----------------+-------------+-------------+-------------+-----------------+-----------------+
51 | Record Type | Field 1 | Field 2 | Field 3 | Field 4 | Field 5 | Field 6 | Field 7 | Field 8 |
52 +===============+===========+================+================+=============+=============+=============+=================+=================+
53 | Segment | ``sid`` | ``slen`` | ``sequence`` | | | | | |
54 +---------------+-----------+----------------+----------------+-------------+-------------+-------------+-----------------+-----------------+
55 | Edge | ``eid`` | ``sid1`` | ``sid2`` | ``beg1`` | ``end1`` | ``beg2`` | ``end2`` | ``alignment`` |
56 +---------------+-----------+----------------+----------------+-------------+-------------+-------------+-----------------+-----------------+
57 | Fragment | ``sid`` | ``external`` | ``s_beg`` | ``s_end`` | ``f_beg`` | ``f_end`` | ``alignment`` | |
58 +---------------+-----------+----------------+----------------+-------------+-------------+-------------+-----------------+-----------------+
59 | Gap | ``gid`` | ``sid1`` | ``d1`` | ``d2`` | ``sid2`` | ``disp`` | ``var`` | |
60 +---------------+-----------+----------------+----------------+-------------+-------------+-------------+-----------------+-----------------+
61 | Set | ``pid`` | ``items`` | | | | | | |
62 +---------------+-----------+----------------+----------------+-------------+-------------+-------------+-----------------+-----------------+
63 | Path | ``pid`` | ``items`` | | | | | | |
64 +---------------+-----------+----------------+----------------+-------------+-------------+-------------+-----------------+-----------------+
65
66 Datatypes
67 ~~~~~~~~~
68
69 The datatype of each positional field is described in the specification
70 and cannot be changed (differently from tags). Here is a short
71 description of the Python classes used to represent data for different
72 datatypes.
73
74 Placeholders
75 ^^^^^^^^^^^^
76
77 The positional fields in GFA can never be empty. However, there are some
78 fields with optional values. If a value is not specified, a placeholder
79 character is used instead (``*``). Such undefined values are represented
80 in Gfapy by the `Placeholder` class, which is described more in
81 detail in the :ref:`placeholders` chapter.
82
83 Arrays
84 ^^^^^^
85
86 The ``items`` field in unordered and ordered groups and the
87 ``segment_names`` and ``overlaps`` fields in paths are lists of objects
88 and are represented by list instances.
89
90 .. doctest::
91
92 >>> set = gfapy.Line("U\t*\t1 A 2")
93 >>> type(set.items)
94 <class 'list'>
95 >>> gfa2_path = gfapy.Line("O\t*\tA+ B-")
96 >>> type(gfa2_path.items)
97 <class 'list'>
98 >>> gfa1_path = gfapy.Line("P\tp1\tA+,B-\t10M,9M1D1M")
99 >>> type(gfa1_path.segment_names)
100 <class 'list'>
101 >>> type(gfa1_path.overlaps)
102 <class 'list'>
103
104 Orientations
105 ^^^^^^^^^^^^
106
107 Orientations are represented by strings. The ``gfapy.invert()`` method
108 applied to an orientation string returns the other orientation.
109
110 .. doctest::
111
112 >>> gfapy.invert("+")
113 '-'
114 >>> gfapy.invert("-")
115 '+'
116
117 Identifiers
118 ^^^^^^^^^^^
119
120 The identifier of the line itself (available for S, P, E, G, U, O lines)
121 can always be accessed in Gfapy using the ``name`` alias and is
122 represented in Gfapy by a string. If it is optional (E, G, U, O lines)
123 and not specified, it is represented by a Placeholder instance. The
124 fragment identifier is also a string.
125
126 Identifiers which refer to other lines are also present in some line
127 types (L, C, E, G, U, O, F). These are never placeholders and in
128 stand-alone lines are represented by strings. In connected lines they
129 are references to the Line instances to which they refer to (see the
130 :ref:`references` chapter).
131
132 Oriented identifiers
133 ^^^^^^^^^^^^^^^^^^^^
134
135 Oriented identifiers (e.g. ``segment_names`` in GFA1 paths) are
136 represented by elements of the class ``gfapy.OrientedLine``. The
137 ``segment`` method of the oriented segments returns the segment
138 identifier (or segment reference in connected path lines) and the
139 ``orient`` method returns the orientation string. The ``name`` method
140 returns the string of the segment, even if this is a reference to a
141 segment. A new oriented line can be created using the
142 ``OL[line, orientation]`` method.
143
144 Calling ``invert`` returns an oriented segment, with inverted
145 orientation. To set the two attributes the methods ``segment=`` and
146 ``orient=`` are available.
147
148 Examples:
149
150 .. doctest::
151
152 >>> p = gfapy.Line("P\tP1\ta+,b-\t*")
153 >>> p.segment_names
154 [gfapy.OrientedLine('a','+'), gfapy.OrientedLine('b','-')]
155 >>> sn0 = p.segment_names[0]
156 >>> sn0.line
157 'a'
158 >>> sn0.name
159 'a'
160 >>> sn0.orient
161 '+'
162 >>> sn0.invert()
163 >>> sn0
164 gfapy.OrientedLine('a','-')
165 >>> sn0.orient
166 '-'
167 >>> sn0.line = gfapy.Line('S\tX\t*')
168 >>> str(sn0)
169 'X-'
170 >>> sn0.name
171 'X'
172 >>> sn0 = gfapy.OrientedLine(gfapy.Line('S\tY\t*'), '+')
173
174 Sequences
175 ^^^^^^^^^
176
177 Sequences (S field sequence) are represented by strings in Gfapy.
178 Depending on the GFA version, the alphabet definition is more or less
179 restrictive. The definitions are correctly applied by the validation
180 methods.
181
182 The method ``rc()`` is provided to compute the reverse complement of a
183 nucleotidic sequence. The extended IUPAC alphabet is understood by the
184 method. Applied to non nucleotidic sequences, the results will be
185 meaningless:
186
187 .. doctest::
188
189 >>> from gfapy.sequence import rc
190 >>> rc("gcat")
191 'atgc'
192 >>> rc("*")
193 '*'
194 >>> rc("yatc")
195 'gatr'
196 >>> rc("gCat")
197 'atGc'
198 >>> rc("cag", rna=True)
199 'cug'
200
201 Integers and positions
202 ^^^^^^^^^^^^^^^^^^^^^^
203
204 The C lines ``pos`` field and the G lines ``disp`` and ``var`` fields
205 are represented by integers. The ``var`` field is optional, and thus can
206 be also a placeholder. Positions are 0-based coordinates.
207
208 The position fields of GFA2 E lines (``beg1, beg2, end1, end2``) and F
209 lines (``s_beg, s_end, f_beg, f_end``) contain a dollar string as suffix
210 if the position is equal to the segment length. For more information,
211 see the :ref:`positions` chapter.
212
213 Alignments
214 ^^^^^^^^^^
215
216 Alignments are always optional, ie they can be placeholders. If they are
217 specified they are CIGAR alignments or, only in GFA2, trace alignments.
218 For more details, see the :ref:`alignments` chapter.
219
220 GFA1 datatypes
221 ^^^^^^^^^^^^^^
222
223 +------------------------+---------------+--------------------------------+
224 | Datatype | Record Type | Fields |
225 +========================+===============+================================+
226 | Identifier | Segment | ``name`` |
227 +------------------------+---------------+--------------------------------+
228 | | Path | ``path_name`` |
229 +------------------------+---------------+--------------------------------+
230 | | Link | ``from_segment, to_segment`` |
231 +------------------------+---------------+--------------------------------+
232 | | Containment | ``from_segment, to_segment`` |
233 +------------------------+---------------+--------------------------------+
234 | [OrientedIdentifier] | Path | ``segment_names`` |
235 +------------------------+---------------+--------------------------------+
236 | Orientation | Link | ``from_orient, to_orient`` |
237 +------------------------+---------------+--------------------------------+
238 | | Containment | ``from_orient, to_orient`` |
239 +------------------------+---------------+--------------------------------+
240 | Sequence | Segment | ``sequence`` |
241 +------------------------+---------------+--------------------------------+
242 | Alignment | Link | ``overlap`` |
243 +------------------------+---------------+--------------------------------+
244 | | Containment | ``overlap`` |
245 +------------------------+---------------+--------------------------------+
246 | [Alignment] | Path | ``overlaps`` |
247 +------------------------+---------------+--------------------------------+
248 | Position | Containment | ``pos`` |
249 +------------------------+---------------+--------------------------------+
250
251 GFA2 datatypes
252 ^^^^^^^^^^^^^^
253
254 +------------------------+---------------+----------------------------------+
255 | Datatype | Record Type | Fields |
256 +========================+===============+==================================+
257 | Itentifier | Segment | ``sid`` |
258 +------------------------+---------------+----------------------------------+
259 | | Fragment | ``sid`` |
260 +------------------------+---------------+----------------------------------+
261 | OrientedIdentifier | Edge | ``sid1, sid2`` |
262 +------------------------+---------------+----------------------------------+
263 | | Gap | ``sid1, sid2`` |
264 +------------------------+---------------+----------------------------------+
265 | | Fragment | ``external`` |
266 +------------------------+---------------+----------------------------------+
267 | OptionalIdentifier | Edge | ``eid`` |
268 +------------------------+---------------+----------------------------------+
269 | | Gap | ``gid`` |
270 +------------------------+---------------+----------------------------------+
271 | | U Group | ``oid`` |
272 +------------------------+---------------+----------------------------------+
273 | | O Group | ``uid`` |
274 +------------------------+---------------+----------------------------------+
275 | [Identifier] | U Group | ``items`` |
276 +------------------------+---------------+----------------------------------+
277 | [OrientedIdentifier] | O Group | ``items`` |
278 +------------------------+---------------+----------------------------------+
279 | Sequence | Segment | ``sequence`` |
280 +------------------------+---------------+----------------------------------+
281 | Alignment | Edge | ``alignment`` |
282 +------------------------+---------------+----------------------------------+
283 | | Fragment | ``alignment`` |
284 +------------------------+---------------+----------------------------------+
285 | Position | Edge | ``beg1, end1, beg2, end2`` |
286 +------------------------+---------------+----------------------------------+
287 | | Fragment | ``s_beg, s_end, f_beg, f_end`` |
288 +------------------------+---------------+----------------------------------+
289 | Integer | Gap | ``disp, var`` |
290 +------------------------+---------------+----------------------------------+
291
292 Reading and writing positional fields
293 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
294
295 The ``positional_fieldnames`` method returns the list of the names (as
296 strings) of the positional fields of a line. The positional fields can
297 be read using a method on the Gfapy line object, which is called as the
298 field name. Setting the value is done with an equal sign version of the
299 field name method (e.g. segment.slen = 120). In alternative, the
300 ``set(fieldname, value)`` and ``get(fieldname)`` methods can also be
301 used.
302
303 .. doctest::
304
305 >>> s_gfa1 = gfapy.Line("S\t1\t*")
306 >>> s_gfa1.positional_fieldnames
307 ['name', 'sequence']
308 >>> s_gfa1.name
309 '1'
310 >>> s_gfa1.get("name")
311 '1'
312 >>> s_gfa1.name = "segment2"
313 >>> s_gfa1.name
314 'segment2'
315 >>> s_gfa1.set('name',"3")
316 >>> s_gfa1.name
317 '3'
318
319 When a field is read, the value is converted into an appropriate object.
320 The string representation of a field can be read using the
321 ``field_to_s(fieldname)`` method.
322
323 .. doctest::
324
325 >>> gfa = gfapy.Gfa()
326 >>> gfa.add_line("S\ts1\t*")
327 >>> gfa.add_line("L\ts1\t+\ts2\t-\t*")
328 >>> link = gfa.dovetails[0]
329 >>> str(link.from_segment)
330 'S\ts1\t*'
331 >>> link.field_to_s('from_segment')
332 's1'
333
334 When setting a non-string field, the user can specify the value of a tag
335 either as a Python non-string object, or as the string representation of
336 the value.
337
338 .. doctest::
339
340 >>> gfa = gfapy.Gfa(version='gfa1')
341 >>> gfa.add_line("C\ta\t+\tb\t-\t10\t*")
342 >>> c = gfa.containments[0]
343 >>> c.pos
344 10
345 >>> c.pos = 1
346 >>> c.pos
347 1
348 >>> c.pos = "2"
349 >>> c.pos
350 2
351 >>> c.field_to_s("pos")
352 '2'
353
354 Note that setting the value of reference and backreferences-related
355 fields is generally not allowed, when a line instance is connected to a
356 Gfa object (see the :ref:`references` chapter).
357
358 .. doctest::
359
360 >>> gfa = gfapy.Gfa(version='gfa1')
361 >>> l = gfapy.Line("L\ts1\t+\ts2\t-\t*")
362 >>> l.from_name
363 's1'
364 >>> l.from_segment = "s3"
365 >>> l.from_name
366 's3'
367 >>> gfa.add_line(l)
368 >>> l.from_segment = "s4"
369 Traceback (most recent call last):
370 ...
371 gfapy.error.RuntimeError: ...
372
373 Validation
374 ~~~~~~~~~~
375
376 The content of all positional fields must be a correctly formatted
377 string according to the rules given in the GFA specifications (or a
378 Python object whose string representation is a correctly formatted
379 string).
380
381 Depending on the validation level, more or less checks are done
382 automatically (see the :ref:`validation` chapter). Not regarding which
383 validation level is selected, the user can trigger a manual validation
384 using the ``validate_field(fieldname)`` method for a single field, or
385 using ``validate``, which does a full validation on the whole line,
386 including all positional fields.
387
388 .. doctest::
389
390 >>> line = gfapy.Line("H\txx:i:1")
391 >>> line.validate_field("xx")
392 >>> line.validate()
393
394 Aliases
395 ~~~~~~~
396
397 For some fields, aliases are defined, which can be used in all contexts
398 where the original field name is used (i.e. as parameter of a method,
399 and the same setter and getter methods defined for the original field
400 name are also defined for each alias, see below).
401
402 .. doctest::
403
404 >>> gfa1_path = gfapy.Line("P\tX\t1-,2+,3+\t*")
405 >>> gfa1_path.name == gfa1_path.path_name
406 True
407 >>> edge = gfapy.Line("E\t*\tA+\tB-\t0\t10\t90\t100$\t*")
408 >>> edge.eid == edge.name
409 True
410 >>> containment = gfapy.Line("C\tA\t+\tB\t-\t10\t*")
411 >>> containment.from_segment == containment.container
412 True
413 >>> segment = gfapy.Line("S\t1\t*")
414 >>> segment.sid == segment.name
415 True
416 >>> segment.sid
417 '1'
418 >>> segment.name = '2'
419 >>> segment.sid
420 '2'
421
422 Name
423 ^^^^
424
425 Different record types have an identifier field: segments (name in GFA1,
426 sid in GFA2), paths (path\_name), edge (eid), fragment (sid), gap (gid),
427 groups (pid).
428
429 All these fields are aliased to ``name``. This allows the user for
430 example to set the identifier of a line using the ``name=(value)``
431 method using the same syntax for different record types (segments,
432 edges, paths, fragments, gaps and groups).
433
434 Version-specific field names
435 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
436
437 For segments the GFA1 name and the GFA2 sid are equivalent fields. For
438 this reason an alias ``sid`` is defined for GFA1 segments and ``name``
439 for GFA2 segments.
440
441 Crypical field names
442 ^^^^^^^^^^^^^^^^^^^^
443
444 The definition of from and to for containments is somewhat cryptic.
445 Therefore following aliases have been defined for containments:
446 container[\_orient] for from[\_\|segment\|orient]; contained[\_orient]
447 for to[\_segment\|orient].
+0
-75
doc/tutorial/positions.rst less more
0 .. testsetup:: *
1
2 import gfapy
3
4 .. _positions:
5
6 Positions
7 ---------
8
9 The only position field in GFA1 is the ``pos`` field in the C lines.
10 This represents the starting position of the contained segment in the
11 container segment and is 0-based.
12
13 Some fields in GFA2 E lines (``beg1, beg2, end1, end2``) and F lines
14 (``s_beg, s_end, f_beg, f_end``) are positions. According to the
15 specification, they are 0-based and represent virtual ticks before and
16 after each string in the sequence. Thus ranges are represented similarly
17 to the Python range conventions: e.g. a 1-character prefix of a sequence
18 will have begin 0 and end 1.
19
20 Last positions in GFA2
21 ~~~~~~~~~~~~~~~~~~~~~~
22
23 The GFA2 positions must contain an additional string (``$``) appended to
24 the integer, if (and only if) they are the last position in the segment
25 sequence. These particular positions are represented in Gfapy as
26 instances of the class :class:`~gfapy.lastpos.LastPos`.
27
28 To create a lastpos instance, the constructor can be used with an
29 integer, or the string representation (which must end with the dollar
30 sign, otherwise an integer is returned):
31
32 .. doctest::
33
34 >>> str(gfapy.LastPos(12))
35 '12$'
36 >>> gfapy.LastPos("12")
37 12
38 >>> str(gfapy.LastPos("12"))
39 '12'
40 >>> gfapy.LastPos("12$")
41 gfapy.LastPos(12)
42 >>> str(gfapy.LastPos("12$"))
43 '12$'
44
45 Subtracting an integer from a lastpos returns a lastpos if 0 subtracted,
46 an integer otherwise. This allows to do some arithmetic on positions
47 without making them invalid.
48
49 .. doctest::
50
51 >>> gfapy.LastPos(12) - 0
52 gfapy.LastPos(12)
53 >>> gfapy.LastPos(12) - 1
54 11
55
56 The functions :func:`~gfapy.lastpos.islastpos` and
57 :func:`~gfapy.lastpos.isfirstpos` allow to
58 determine if a position value is 0 (first), or the last position, using
59 the same syntax for lastpos and integer instances.
60
61 .. doctest::
62
63 >>> gfapy.isfirstpos(0)
64 True
65 >>> gfapy.islastpos(0)
66 False
67 >>> gfapy.isfirstpos(12)
68 False
69 >>> gfapy.islastpos(12)
70 False
71 >>> gfapy.islastpos(gfapy.LastPos("12"))
72 False
73 >>> gfapy.islastpos(gfapy.LastPos("12$"))
74 True
+0
-443
doc/tutorial/references.rst less more
0 .. testsetup:: *
1
2 import gfapy
3 gfa = gfapy.Gfa()
4
5 .. _references:
6
7 References
8 ----------
9
10 Some fields in GFA lines contain identifiers or lists of identifiers
11 (sometimes followed by orientation strings), which reference other lines
12 of the GFA file. In Gfapy it is possible to follow these references and
13 traverse the graph.
14
15 Connecting a line to a Gfa object
16 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
17
18 In stand-alone line instances, the identifiers which reference other
19 lines are either strings containing the line name, pairs of strings
20 (name and orientation) in a ``gfapy.OrientedLine`` object, or lists of
21 lines names or ``gfapy.OrientedLine`` objects.
22
23 Using the ``add_line(line)`` (alias: ``append(line)``) method of the
24 ``gfapy.Gfa`` object, or the equivalent ``connect(gfa)`` method of the
25 gfapy.Line instance, a line is added to a Gfa instance (this is done
26 automatically when a GFA file is parsed). All strings expressing
27 references are then changed into references to the corresponding line
28 objects. The method ``is_connected()`` allows to determine if a line is
29 connected to a gfapy instance. The read-only property ``gfa`` contains
30 the ``gfapy.Gfa`` instance to which the line is connected.
31
32 .. doctest::
33
34 >>> gfa = gfapy.Gfa(version='gfa1')
35 >>> link = gfapy.Line("L\tA\t-\tB\t+\t20M")
36 >>> link.is_connected()
37 False
38 >>> link.gfa is None
39 True
40 >>> type(link.from_segment)
41 <class 'str'>
42 >>> gfa.append(link)
43 >>> link.is_connected()
44 True
45 >>> link.gfa #doctest: +ELLIPSIS
46 <gfapy.gfa.Gfa object at ...>
47 >>> type(link.from_segment)
48 <class 'gfapy.line.segment.gfa1.GFA1'>
49
50 References for each record type
51 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
52
53 The following tables describes the references contained in each record
54 type. The notation ``[]`` represent lists.
55
56 GFA1
57 ^^^^
58
59 +---------------+-------------------+---------------------------+
60 | Record type | Fields | Type of reference |
61 +===============+===================+===========================+
62 | Link | from, to | Segment |
63 +---------------+-------------------+---------------------------+
64 | Containment | from, to | Segment |
65 +---------------+-------------------+---------------------------+
66 | Path | segment\_names, | [OrientedLine(Segment)] |
67 +---------------+-------------------+---------------------------+
68 | | links (1) | [OrientedLine(Link)] |
69 +---------------+-------------------+---------------------------+
70
71 (1): paths contain information in the fields segment\_names and
72 overlaps, which allow to find the identify from which they depend; these
73 links can be retrieved using ``links`` (which is not a field).
74
75 GFA2
76 ^^^^
77
78 +---------------+--------------+------------------------------------+
79 | Record type | Fields | Type of reference |
80 +===============+==============+====================================+
81 | Edge | sid1, sid2 | Segment |
82 +---------------+--------------+------------------------------------+
83 | Gap | sid1, sid2 | Segment |
84 +---------------+--------------+------------------------------------+
85 | Fragment | sid | Segment |
86 +---------------+--------------+------------------------------------+
87 | Set | items | [Edge/Set/Path/Segment] |
88 +---------------+--------------+------------------------------------+
89 | Path | items | [OrientedLine(Edge/Set/Segment)] |
90 +---------------+--------------+------------------------------------+
91
92 Backreferences for each record type
93 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
94
95 When a line containing a reference to another line is connected to a Gfa
96 object, backreferences to it are created in the targeted line.
97
98 For each backreference collection a read-only property exist, which is
99 named as the collection (e.g. ``dovetails_L`` for segments). Note that
100 the reference list returned by these arrays are read-only and editing
101 the references is done using other methods (see the section "Editing
102 reference fields" below).
103
104 .. code:: python
105
106 segment.dovetails_L # => [gfapy.line.edge.Link(...), ...]
107
108 The following tables describe the backreferences collections for each
109 record type.
110
111 GFA1
112 ^^^^
113
114 +---------------+-------------------------+
115 | Record type | Backreferences |
116 +===============+=========================+
117 | Segment | dovetails\_L |
118 +---------------+-------------------------+
119 | | dovetails\_R |
120 +---------------+-------------------------+
121 | | edges\_to\_contained |
122 +---------------+-------------------------+
123 | | edges\_to\_containers |
124 +---------------+-------------------------+
125 | | paths |
126 +---------------+-------------------------+
127 | Link | paths |
128 +---------------+-------------------------+
129
130 GFA2
131 ^^^^
132
133 +---------------+-------------------------+--------+
134 | Record type | Backreferences | Type |
135 +===============+=========================+========+
136 | Segment | dovetails\_L | E |
137 +---------------+-------------------------+--------+
138 | | dovetails\_R | E |
139 +---------------+-------------------------+--------+
140 | | edges\_to\_contained | E |
141 +---------------+-------------------------+--------+
142 | | edges\_to\_containers | E |
143 +---------------+-------------------------+--------+
144 | | internals | E |
145 +---------------+-------------------------+--------+
146 | | gaps\_L | G |
147 +---------------+-------------------------+--------+
148 | | gaps\_R | G |
149 +---------------+-------------------------+--------+
150 | | fragments | F |
151 +---------------+-------------------------+--------+
152 | | paths | O |
153 +---------------+-------------------------+--------+
154 | | sets | U |
155 +---------------+-------------------------+--------+
156 | Edge | paths | O |
157 +---------------+-------------------------+--------+
158 | | sets | U |
159 +---------------+-------------------------+--------+
160 | O Group | paths | O |
161 +---------------+-------------------------+--------+
162 | | sets | U |
163 +---------------+-------------------------+--------+
164 | U Group | sets | U |
165 +---------------+-------------------------+--------+
166
167 Segment backreference convenience methods
168 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
169
170 For segments, additional methods are available which combine in
171 different way the backreferences information. The
172 `dovetails_of_end` and `gaps_of_end` methods take an
173 argument ``L`` or ``R`` and return the dovetails overlaps (or gaps) of the
174 left or, respectively, right end of the segment sequence
175 (equivalent to the segment properties ``dovetails_L``/``dovetails_R`` and
176 ``gaps_L``/``gaps_R``).
177
178 The segment ``containments`` property is a list of both containments where the
179 segment is the container or the contained segment. The segment ``edges``
180 property is a list of all edges (dovetails, containments and internals)
181 with a reference to the segment.
182
183 Other methods directly compute list of segments from the edges lists
184 mentioned above. The ``neighbours_L``, ``neighbours_R`` properties and
185 the `neighbours` method compute the set of segment instances which are
186 connected by dovetails to the segment.
187 The ``containers`` and ``contained``
188 properties similarly compute the set of segment instances which,
189 respectively, contains the segment, or are contained in the segment.
190
191 .. doctest::
192
193 >>> gfa = gfapy.Gfa()
194 >>> gfa.append('S\tA\t*')
195 >>> s = gfa.segment('A')
196 >>> gfa.append('S\tB\t*')
197 >>> gfa.append('S\tC\t*')
198 >>> gfa.append('L\tA\t-\tB\t+\t*')
199 >>> gfa.append('C\tA\t+\tC\t+\t10\t*')
200 >>> [str(l) for l in s.dovetails_of_end("L")]
201 ['L\tA\t-\tB\t+\t*']
202 >>> s.dovetails_L == s.dovetails_of_end("L")
203 True
204 >>> s.gaps_of_end("R")
205 []
206 >>> [str(e) for e in s.edges]
207 ['L\tA\t-\tB\t+\t*', 'C\tA\t+\tC\t+\t10\t*']
208 >>> [str(n) for n in s.neighbours_L]
209 ['S\tB\t*']
210 >>> s.containers
211 []
212 >>> [str(c) for c in s.contained]
213 ['S\tC\t*']
214
215 Multiline group definitions
216 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
217
218 The GFA2 specification opens the possibility (experimental) to define
219 groups on multiple lines, by using the same ID for each line defining
220 the group. This is supported by gfapy.
221
222 This means that if multiple `Ordered` or
223 `Unordered` instances connected to a Gfa object have
224 the same ``gid``, they are merged into a single instance (technically
225 the last one getting added to the graph object). The items list are
226 merged.
227
228 The tags of multiple line defining a group shall not contradict each
229 other (i.e. either are the tag names on different lines defining the
230 group all different, or, if the same tag is present on different lines,
231 the value and datatype must be the same, in which case the multiple
232 definition will be ignored).
233
234 .. doctest::
235
236 >>> gfa = gfapy.Gfa()
237 >>> gfa.add_line("U\tu1\ts1 s2 s3")
238 >>> [s.name for s in gfa.sets[-1].items]
239 ['s1', 's2', 's3']
240 >>> gfa.add_line('U\tu1\t4 5')
241 >>> [s.name for s in gfa.sets[-1].items]
242 ['s1', 's2', 's3', '4', '5']
243
244 Induced set and captured path
245 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
246
247 The item list in GFA2 sets and paths may not contain elements which are
248 implicitly involved. For example a path may contain segments, without
249 specifying the edges connecting them, if there is only one such edge.
250 Alternatively a path may contain edges, without explicitly indicating the
251 segments. Similarly a set may contain edges, but not the segments
252 referred to in them, or contain segments which are connected by edges,
253 without the edges themselves. Furthermore groups may refer to other
254 groups (set to sets or paths, paths to paths only), which then
255 indirectly contain references to segments and edges.
256
257 Gfapy provides methods for the computation of the sets of segments and
258 edges which are implied by an ordered or unordered group. Thereby all
259 references to subgroups are resolved and implicit elements are added, as
260 described in the specification. The computation can, therefore, only be
261 applied to connected lines. For unordered groups, this computation is
262 provided by the method ``induced_set()``, which returns an array of
263 segment and edge instances. For ordered group, the computation is
264 provided by the method ``captured_path()``, which returns a list of
265 ``gfapy.OrientedLine`` instances, alternating segment and edge instances
266 (and starting and ending in segments).
267
268 The methods ``induced_segments_set()``, ``induced_edges_set()``,
269 ``captured_segments()`` and ``captured_edges()`` return, respectively,
270 the list of only segments or edges, in ordered or unordered groups.
271
272 .. doctest::
273
274 >>> gfa = gfapy.Gfa()
275 >>> gfa.add_line("S\ts1\t100\t*")
276 >>> gfa.add_line("S\ts2\t100\t*")
277 >>> gfa.add_line("S\ts3\t100\t*")
278 >>> gfa.add_line("E\te1\ts1+\ts2-\t0\t10\t90\t100$\t*")
279 >>> gfa.add_line("U\tu1\ts1 s2 s3")
280 >>> u = gfa.sets[-1]
281 >>> [l.name for l in u.induced_edges_set]
282 ['e1']
283 >>> [l.name for l in u.induced_segments_set ]
284 ['s1', 's2', 's3']
285 >>> [l.name for l in u.induced_set ]
286 ['s1', 's2', 's3', 'e1']
287
288 Disconnecting a line from a Gfa object
289 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
290
291 Lines can be disconnected using the ``rm(line)`` method of the
292 ``gfapy.Gfa`` object or the ``disconnect()`` method of the line
293 instance.
294
295 .. doctest::
296
297 >>> gfa = gfapy.Gfa()
298 >>> gfa.append('S\tsA\t*')
299 >>> gfa.append('S\tsB\t*')
300 >>> line = gfa.segment("sA")
301 >>> gfa.segment_names
302 ['sA', 'sB']
303 >>> gfa.rm(line)
304 >>> gfa.segment_names
305 ['sB']
306 >>> line = gfa.segment('sB')
307 >>> line.disconnect()
308 >>> gfa.segment_names
309 []
310
311 Disconnecting a line affects other lines as well. Lines which are
312 dependent on the disconnected line are disconnected as well. Any other
313 reference to disconnected lines is removed as well. In the disconnected
314 line, references to lines are transformed back to strings and
315 backreferences are deleted.
316
317 The following tables show which dependent lines are disconnected if they
318 refer to a line which is being disconnected.
319
320 GFA1
321 ^^^^
322
323 +---------------+---------------------------------+
324 | Record type | Dependent lines |
325 +===============+=================================+
326 | Segment | links (+ paths), containments |
327 +---------------+---------------------------------+
328 | Link | paths |
329 +---------------+---------------------------------+
330
331 GFA2
332 ^^^^
333
334 +---------------+---------------------------------------+
335 | Record type | Dependent lines |
336 +===============+=======================================+
337 | Segment | edges, gaps, fragments, sets, paths |
338 +---------------+---------------------------------------+
339 | Edge | sets, paths |
340 +---------------+---------------------------------------+
341 | Sets | sets, paths |
342 +---------------+---------------------------------------+
343
344 Editing reference fields
345 ~~~~~~~~~~~~~~~~~~~~~~~~
346
347 In connected line instances, it is not allowed to directly change the
348 content of fields containing references to other lines, as this would
349 make the state of the Gfa object invalid.
350
351 Besides the fields containing references, some other fields are
352 read-only in connected lines. Changing some of the fields would require
353 moving the backreferences to other collections (position fields of edges
354 and gaps, ``from_orient`` and ``to_orient`` of links). The overlaps
355 field of connected links is readonly as it may be necessary to identify
356 the link in paths.
357
358 Renaming an element
359 ^^^^^^^^^^^^^^^^^^^
360
361 The name field of a line (e.g. segment ``name``/``sid``) is not a
362 reference and thus can be edited also in connected lines. When the name
363 of the line is changed, no manual editing of references (e.g. from/to
364 fields in links) is necessary, as all lines which refer to the line will
365 still refer to the same instance. The references to the instance in the
366 Gfa lines collections will be automatically updated. Also, the new name
367 will be correctly used when converting to string, such as when the Gfa
368 instance is written to a GFA file.
369
370 Renaming a line to a name which already exists has the same effect of
371 adding a line with that name. That is, in most cases,
372 ``gfapy.NotUniqueError`` is raised. An exception are GFA2 sets and
373 paths: in this case the line will be appended to the existing line with
374 the same name (as described in "Multiline group definitions").
375
376 Adding and removing group elements
377 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
378
379 Elements of GFA2 groups can be added and removed from both connected and
380 non-connected lines, using the following methods.
381
382 To add an item to or remove an item from an unordered group, use the
383 methods ``add_item(item)`` and ``rm_item(item)``, which take as argument
384 either a string (identifier) or a line instance.
385
386 To append or prepend an item to an ordered group, use the methods
387 ``append_item(item)`` and ``prepend_item(item)``. To remove the first or
388 the last item of an ordered group use the methods ``rm_first_item()``
389 and ``rm_last_item()``.
390
391 Editing read-only fields of connected lines
392 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
393
394 Editing the read-only information of edges, gaps, links, containments,
395 fragments and paths is more complicated. These lines shall be
396 disconnected before the edit and connected again to the Gfa object after
397 it. Before disconnecting a line, you should check if there are other
398 lines dependent on it (see tables above). If so, you will have to
399 disconnect these lines first, eventually update their fields and
400 reconnect them at the end of the operation.
401
402 Virtual lines
403 ~~~~~~~~~~~~~
404
405 The order of the lines in GFA is not prescribed. Therefore, during
406 parsing, or constructing a Gfa in memory, it is possible that a line is
407 referenced to, before it is added to the Gfa instance. Whenever this
408 happens, Gfapy creates a "virtual" line instance.
409
410 Users do not have to handle with virtual lines, if they work with
411 complete and valid GFA files.
412
413 Virtual lines are similar to normal line instances, with some
414 limitations (they contain only limited information and it is not allowed
415 to add tags to them). To check if a line is a virtual line, one can use
416 the ``virtual`` property of the line.
417
418 As soon as the parser founds the real line corresponding to a previously
419 introduced virtual line, the virtual line is exchanged with the real
420 line and all references are corrected to point to the real line.
421
422 .. doctest::
423
424 >>> g = gfapy.Gfa()
425 >>> g.add_line("S\t1\t*")
426 >>> g.add_line("L\t1\t+\t2\t+\t*")
427 >>> l = g.dovetails[0]
428 >>> g.segment("1").virtual
429 False
430 >>> g.segment("2").virtual
431 True
432 >>> l.to_segment == g.segment("2")
433 True
434 >>> g.segment("2").dovetails == [l]
435 True
436 >>> g.add_line("S\t2\t*")
437 >>> g.segment("2").virtual
438 False
439 >>> l.to_segment == g.segment("2")
440 True
441 >>> g.segment("2").dovetails == [l]
442 True
+0
-36
doc/tutorial/rgfa.rst less more
0 .. testsetup:: *
1
2 import gfapy
3
4 .. _rgfa:
5
6 rGFA
7 ----
8
9 rGFA (https://github.com/lh3/gfatools/blob/master/doc/rGFA.md)
10 is a subset of GFA1, in which only particular line types (S and L)
11 are allowed, and the S lines are required to contain the tags
12 `SN` (of type `Z`), `SO` and `SR` (of type `i`).
13
14 When working with rGFA files, it is convenient to use the `dialect="rgfa"`
15 option in the constructor `Gfa()` and in
16 func:`Gfa.from_file() <gfapy.gfa.Gfa.from_file>`.
17
18 This ensures that additional validations are performed: GFA version must be 1,
19 only rGFA-compatible lines (S,L) are allowed and that the required tags are
20 required (with the correct datatype). The validations can also be executed
21 manually using `Gfa.validate_rgfa() <gfapy.gfa.Gfa.validate_rgfa>`.
22
23 Furthermore, the `stable_sequence_names` attribute of the GFA objects
24 returns the set of stable sequence names contained in the `SN` tags
25 of the segments.
26
27 .. doctest::
28
29 >>> g = gfapy.Gfa("S\tS1\tCTGAA\tSN:Z:chr1\tSO:i:0\tSR:i:0", dialect="rgfa")
30 >>> g.segment_names
31 ['S1']
32 >>> g.stable_sequence_names
33 ['chr1']
34 >>> g.add_line("S\tS2\tACG\tSN:Z:chr1\tSO:i:5\tSR:i:0")
35
+0
-436
doc/tutorial/tags.rst less more
0 .. testsetup:: *
1
2 import gfapy
3 gfa = gfapy.Gfa()
4
5 .. _tags:
6
7 Tags
8 ----
9
10 Each record in GFA can contain tags. Tags are fields which consist in a
11 tag name, a datatype and data. The format is ``NN:T:DATA`` where ``NN``
12 is a two-letter tag name, ``T`` is a one-letter datatype string and
13 ``DATA`` is a string representing the data according to the specified
14 datatype. Tag names must be unique for each line, i.e. each line may
15 only contain a tag once.
16
17 ::
18
19 # Examples of GFA tags of different datatypes:
20 "aa:i:-12"
21 "bb:f:1.23"
22 "cc:Z:this is a string"
23 "dd:A:X"
24 "ee:B:c,12,3,2"
25 "ff:H:122FA0"
26 'gg:J:["A","B"]'
27
28 Custom tags
29 ~~~~~~~~~~~
30
31 Some tags are explicitly defined in the specification (these are named
32 *predefined tags* in Gfapy), and the user or an application can define
33 its own custom tags. These may contain lower case letters.
34
35 Custom tags are user or program specific and may of course collide with
36 the tags used by other users or programs. For this reasons, if you write
37 scripts which employ custom tags, you should always check that the
38 values are of the correct datatype and plausible.
39
40 .. doctest::
41
42 >>> line = gfapy.Line("H\txx:i:2")
43 >>> if line.get_datatype("xx") != "i":
44 ... raise Exception("I expected the tag xx to contain an integer!")
45 >>> myvalue = line.xx
46 >>> if (myvalue > 120) or (myvalue % 2 == 1):
47 ... raise Exception("The value in the xx tag is not an even value <= 120")
48 >>> # ... do something with myvalue
49
50 Also it is good practice to allow the user of the script to change the
51 name of the custom tags. For example, Gfapy employs the +or+ custom tag
52 to track the original segment from which a segment in the final graph is
53 derived. All methods which read or write the +or+ tag allow to specify
54 an alternative tag name to use instead of +or+, for the case that this
55 name collides with the custom tag of another program.
56
57 .. code:: python
58
59 # E.g. a method which does something with myvalue, usually stored in tag xx
60 # allows the user to specify an alternative name for the tag
61 def mymethod(line, mytag="xx"):
62 myvalue = line.get(mytag)
63 # ...
64
65 Predefined tags
66 ~~~~~~~~~~~~~~~
67
68 According to the GFA specifications, predefined tag names consist of either
69 two upper case letters, or an upper case letter followed by a digit.
70 The GFA1 specification predefines tags for each line type, while GFA2
71 only predefines tags for the header and edges.
72
73 While tags with the predefined names are allowed to be added to any line,
74 when they are used in the lines mentiones in the specification (e.g. `VN`
75 in the header) gfapy checks that the datatype is the one prescribed by
76 the specification (e.g. `VN` must be of type `Z`). It is not forbidden
77 to use the same tags in other contexts, but in this case, the datatype
78 restriction is not enforced.
79
80 +------------+------------+-----------------------+
81 | Tag | Type | Line types | GFA version |
82 +============+============+=======================+
83 | VN | Z | H | 1,2 |
84 +-----+------+------------+-----------------------+
85 | TS | i | H,S | 2 |
86 +-----+------+------------+-----------------------+
87 | LN | i | S | 1 |
88 +-----+------+------------+-----------------------+
89 | RC | i | S,L,C | 1 |
90 +-----+------+------------+-----------------------+
91 | FC | i | S,L | 1 |
92 +-----+------+------------+-----------------------+
93 | KC | i | S,L | 1 |
94 +-----+------+------------+-----------------------+
95 | SH | H | S | 1 |
96 +-----+------+------------+-----------------------+
97 | UR | Z | S | 1 |
98 +-----+------+------------+-----------------------+
99 | MQ | i | L | 1 |
100 +-----+------+------------+-----------------------+
101 | NM | i | L,i | 1 |
102 +-----+------+------------+-----------------------+
103 | ID | Z | L,C | 1 |
104 +-----+------+------------+-----------------------+
105
106 ::
107
108 "VN:Z:1.0" # VN => predefined tag
109 "z5:Z:1.0" # z5 first char is downcase => custom tag
110 "XX:Z:aaa" # XX upper case, but not predefined => custom tag
111
112 # not forbidden, but not recommended:
113 "zZ:Z:1.0" # => mixed case, first char downcase => custom tag
114 "Zz:Z:1.0" # => mixed case, first char upcase => custom tag
115 "vn:Z:1.0" # => same name as predefined tag, but downcase => custom tag
116
117 Datatypes
118 ~~~~~~~~~
119
120 The following table summarizes the datatypes available for tags:
121
122 +----------+-----------------+---------------------------+----------------------+
123 | Symbol | Datatype | Example | Python class |
124 +==========+=================+===========================+======================+
125 | Z | string | This is a string | str |
126 +----------+-----------------+---------------------------+----------------------+
127 | i | integer | -12 | int |
128 +----------+-----------------+---------------------------+----------------------+
129 | f | float | 1.2E-5 | float |
130 +----------+-----------------+---------------------------+----------------------+
131 | A | char | X | str |
132 +----------+-----------------+---------------------------+----------------------+
133 | J | JSON | [1,{"k1":1,"k2":2},"a"] | list/dict |
134 +----------+-----------------+---------------------------+----------------------+
135 | B | numeric array | f,1.2,13E-2,0 | gfapy.NumericArray |
136 +----------+-----------------+---------------------------+----------------------+
137 | H | byte array | FFAA01 | gfapy.ByteArray |
138 +----------+-----------------+---------------------------+----------------------+
139
140 Validation
141 ~~~~~~~~~~
142
143 The tag names must consist of a letter and a digit or two letters.
144
145 ::
146
147 "KC:i:1" # => OK
148 "xx:i:1" # => OK
149 "x1:i:1" # => OK
150 "xxx:i:1" # => error: name is too long
151 "x:i:1" # => error: name is too short
152 "11:i:1" # => error: at least one letter must be present
153
154 The datatype must be one of the datatypes specified above. For
155 predefined tags, Gfapy also checks that the datatype given in the
156 specification is used.
157
158 ::
159
160 "xx:X:1" # => error: datatype X is unknown
161 "VN:i:1" # => error: VN must be of type Z
162
163 The data must be a correctly formatted string for the specified datatype
164 or a Python object whose string representation is a correctly formatted
165 string.
166
167 .. doctest::
168
169 # current value: xx:i:2
170 >>> line = gfapy.Line("S\tA\t*\txx:i:2")
171 >>> line.xx = 1
172 >>> line.xx
173 1
174 >>> line.xx = "3"
175 >>> line.xx
176 3
177 >>> line.xx = "A"
178 >>> line.xx
179 Traceback (most recent call last):
180 ...
181 gfapy.error.FormatError: ...
182
183 Depending on the validation level, more or less checks are done
184 automatically (see :ref:`validation` chapter). Per default - validation level
185 (1) - validation is performed only during parsing or accessing values
186 the first time, therefore the user must perform a manual validation if
187 he changes values to something which is not guaranteed to be correct. To
188 trigger a manual validation, the user can call the method
189 ``validate_field(fieldname)`` to validate a single tag, or
190 ``validate()`` to validate the whole line, including all tags.
191
192 .. doctest::
193
194 >>> line = gfapy.Line("S\tA\t*\txx:i:2", vlevel = 0)
195 >>> line.validate_field("xx")
196 >>> line.validate()
197 >>> line.xx = "A"
198 >>> line.validate_field("xx")
199 Traceback (most recent call last):
200 ...
201 gfapy.error.FormatError: ...
202 >>> line.validate()
203 Traceback (most recent call last):
204 ...
205 gfapy.error.FormatError: ...
206 >>> line.xx = "3"
207 >>> line.validate_field("xx")
208 >>> line.validate()
209
210 Reading and writing tags
211 ~~~~~~~~~~~~~~~~~~~~~~~~
212
213 Tags can be read using a property on the Gfapy line object, which is
214 called as the tag (e.g. line.xx). A special version of the property
215 prefixed by ``try_get_`` raises an error if the tag was not available
216 (e.g. ``line.try_get_LN``), while the tag property (e.g. ``line.LN``)
217 would return ``None`` in this case. Setting the value is done assigning
218 a value to it the tag name method (e.g. ``line.TS = 120``). In
219 alternative, the ``set(fieldname, value)``, ``get(fieldname)`` and
220 ``try_get(fieldname)`` methods can also be used. To remove a tag from a
221 line, use the ``delete(fieldname)`` method, or set its value to
222 ``None``. The ``tagnames`` property Line instances is a list of
223 the names (as strings) of all defined tags for a line.
224
225
226 .. doctest::
227
228 >>> line = gfapy.Line("S\tA\t*\txx:i:1", vlevel = 0)
229 >>> line.xx
230 1
231 >>> line.xy is None
232 True
233 >>> line.try_get_xx()
234 1
235 >>> line.try_get_xy()
236 Traceback (most recent call last):
237 ...
238 gfapy.error.NotFoundError: ...
239 >>> line.get("xx")
240 1
241 >>> line.try_get("xy")
242 Traceback (most recent call last):
243 ...
244 gfapy.error.NotFoundError: ...
245 >>> line.xx = 2
246 >>> line.xx
247 2
248 >>> line.xx = "a"
249 >>> line.tagnames
250 ['xx']
251 >>> line.xy = 2
252 >>> line.xy
253 2
254 >>> line.set("xy", 3)
255 >>> line.get("xy")
256 3
257 >>> line.tagnames
258 ['xx', 'xy']
259 >>> line.delete("xy")
260 3
261 >>> line.xy is None
262 True
263 >>> line.xx = None
264 >>> line.xx is None
265 True
266 >>> line.try_get("xx")
267 Traceback (most recent call last):
268 ...
269 gfapy.error.NotFoundError: ...
270 >>> line.tagnames
271 []
272
273 When a tag is read, the value is converted into an appropriate object
274 (see Python classes in the datatype table above). When setting a value,
275 the user can specify the value of a tag either as a Python object, or as
276 the string representation of the value.
277
278 .. doctest::
279
280 >>> line = gfapy.Line('H\txx:i:1\txy:Z:TEXT\txz:J:["a","b"]')
281 >>> line.xx
282 1
283 >>> isinstance(line.xx, int)
284 True
285 >>> line.xy
286 'TEXT'
287 >>> isinstance(line.xy, str)
288 True
289 >>> line.xz
290 ['a', 'b']
291 >>> isinstance(line.xz, list)
292 True
293
294 The string representation of a tag can be read using the
295 ``field_to_s(fieldname)`` method. The default is to only output the
296 content of the field. By setting \`\`tag: true\`\`\`, the entire tag is
297 output (name, datatype, content, separated by colons). An exception is
298 raised if the field does not exist.
299
300 .. doctest::
301
302 >>> line = gfapy.Line("H\txx:i:1")
303 >>> line.xx
304 1
305 >>> line.field_to_s("xx")
306 '1'
307 >>> line.field_to_s("xx", tag=True)
308 'xx:i:1'
309
310 Datatype of custom tags
311 ~~~~~~~~~~~~~~~~~~~~~~~
312
313 The datatype of an existing custom field (but not of predefined fields)
314 can be changed using the ``set_datatype(fieldname, datatype)`` method.
315 The current datatype specification can be read using
316 ``get_datatype(fieldname)``.
317
318 .. doctest::
319
320 >>> line = gfapy.Line("H\txx:i:1")
321 >>> line.get_datatype("xx")
322 'i'
323 >>> line.set_datatype("xx", "Z")
324 >>> line.get_datatype("xx")
325 'Z'
326
327 If a new custom tag is specified, Gfapy selects the correct datatype for
328 it: i/f for numeric values, J/B for arrays, J for hashes and Z for
329 strings and strings. If the user wants to specify a different datatype,
330 he may do so by setting it with ``set_datatype()`` (this can be done
331 also before assigning a value, which is necessary if full validation is
332 active).
333
334 .. doctest::
335
336 >>> line = gfapy.Line("H")
337 >>> line.xx = "1"
338 >>> line.xx
339 '1'
340 >>> line.set_datatype("xy", "i")
341 >>> line.xy = "1"
342 >>> line.xy
343 1
344
345 Arrays of numerical values
346 ~~~~~~~~~~~~~~~~~~~~~~~~~~
347
348 ``B`` and ``H`` tags represent array with particular constraints (e.g.
349 they can only contain numeric values, and in some cases the values must
350 be in predefined ranges). In order to represent them correctly and allow
351 for validation, Python classes have been defined for both kind of tags:
352 ``gfapy.ByteArray`` for ``H`` and ``gfapy.NumericArray`` for ``B``
353 fields.
354
355 Both are subclasses of list. Object of the two classes can be created by
356 passing an existing list or the string representation to the class
357 constructor.
358
359 .. doctest::
360
361 >>> # create a byte array instance
362 >>> gfapy.ByteArray([12,3,14])
363 b'\x0c\x03\x0e'
364 >>> gfapy.ByteArray("A012FF")
365 b'\xa0\x12\xff'
366 >>> # create a numeric array instance
367 >>> gfapy.NumericArray.from_string("c,12,3,14")
368 [12, 3, 14]
369 >>> gfapy.NumericArray([12,3,14])
370 [12, 3, 14]
371
372 Instances of the classes behave as normal lists, except that they
373 provide a #validate() method, which checks the constraints, and that
374 their string representation is the GFA string representation of the
375 field value.
376
377 .. doctest::
378
379 >>> gfapy.NumericArray([12,1,"1x"]).validate()
380 Traceback (most recent call last):
381 ...
382 gfapy.error.ValueError
383 >>> str(gfapy.NumericArray([12,3,14]))
384 'C,12,3,14'
385 >>> gfapy.ByteArray([12,1,"1x"]).validate()
386 Traceback (most recent call last):
387 ...
388 gfapy.error.ValueError
389 >>> str(gfapy.ByteArray([12,3,14]))
390 '0C030E'
391
392 For numeric values, the `compute_subtype` method allows to compute
393 the subtype which will be used for the string representation. Unsigned
394 subtypes are used if all values are positive. The smallest possible
395 subtype range is selected. The subtype may change when the range of the
396 elements changes.
397
398 .. doctest::
399
400 >>> gfapy.NumericArray([12,13,14]).compute_subtype()
401 'C'
402
403 Special cases: custom records, headers, comments and virtual lines.
404 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
405
406 GFA2 allows custom records, introduced by record type strings other than
407 the predefined ones. Gfapy uses a pragmatical approach for identifying
408 tags in custom records, and tries to interpret the rightmost fields as
409 tags, until the first field from the right raises an error; all
410 remaining fields are treated as positional fields.
411
412 ::
413
414 "X a b c xx:i:12" # => xx is tag, a, b, c are positional fields
415 "Y a b xx:i:12 c" # => all positional fields, as c is not a valid tag
416
417 For easier access, the entire header of the GFA is summarized in a
418 single line instance. A class (`FieldArray`) has been defined to
419 handle the special case when multiple H lines define the same tag (see
420 :ref:`header` chapter for details).
421
422 Comment lines are represented by a subclass of the same class
423 (`Line`) as the records. However, they cannot contain tags: the
424 entire line is taken as content of the comment. See the :ref:`comments`
425 chapter for more information about comments.
426
427 ::
428
429 "# this is not a tag: xx:i:1" # => xx is not a tag, xx:i:1 is part of the comment
430
431 Virtual instances of the `Line` class (e.g. segment instances automatically
432 created because of not yet resolved references found in edges) cannot be
433 modified by the user, and tags cannot be specified for them. This
434 includes all instances of the `Unknown` class. See the
435 :ref:`references` chapter for more information about virtual lines.
+0
-78
doc/tutorial/validation.rst less more
0 .. _validation:
1
2 Validation
3 ----------
4
5 Different validation levels are available. They represent different
6 compromises between speed and warrant of validity. The validation level
7 can be specified when the :class:`~gfapy.gfa.Gfa` object is created, using the
8 ``vlevel`` parameter of the constructor and of the
9 `from_file` method. Four levels of validation are defined
10 (0 = no validation, 1 = validation by reading, 2 = validation by reading
11 and writing, 3 = continuous validation). The default validation level
12 value is 1.
13
14 Manual validation
15 ~~~~~~~~~~~~~~~~~
16
17 Independently from the validation level chosen, the user can always check the
18 value of a field calling
19 :meth:`~gfapy.line.common.validate.Validate.validate_field` on the line
20 instance. If no exception is raised, the field content is valid.
21
22 To check if the entire content of the line is valid, the user can call
23 :meth:`~gfapy.line.common.validate.Validate.validate` on the line instance.
24 This will check all fields and perform cross-field validations, such as
25 comparing the length of the sequence of a GFA1 segment, to the value of the LN
26 tag (if present).
27
28 It is also possible to validate the structure of the GFA, for example to
29 check if there are unresolved references to lines. To do this, use the
30 :meth:`~gfapy.gfa.Gfa.validate` of the :class:`~gfapy.gfa.Gfa` instance.
31
32 No validations
33 ~~~~~~~~~~~~~~
34
35 If the validation is set to 0, Gfapy will try to accept any input and
36 never raise an exception. This is not always possible, and in some
37 cases, an exception will still be raised, if the data is invalid.
38
39 Validation when reading
40 ~~~~~~~~~~~~~~~~~~~~~~~
41
42 If the validation level is set to 1 or higher, basic validations will be
43 performed, such as checking the number of positional fields, the
44 presence of duplicated tags, the tag datatype of predefined tags.
45 Additionally, all tags will be validated, either during parsing or on
46 first access. Record-type cross-field validations will also be
47 performed.
48
49 In other words, a validation of 1 means that Gfapy guarantees (as good
50 as it can) that the GFA content read from a file is valid, and will
51 raise an exception on accessing the data if not.
52
53 The user is supposed to call `validate_field` after changing
54 a field content to something which can be potentially invalid, or
55 :meth:`~gfapy.line.common.validate.Validate.validate` if potentially
56 cross-field validations could fail.
57
58 Validation when writing
59 ~~~~~~~~~~~~~~~~~~~~~~~
60
61 Setting the level to 2 will perform all validations described above,
62 plus validate the fields content when their value is written to string.
63
64 In other words, a validation of 2 means that Gfapy guarantee (as good as
65 it can) that the GFA content read from a file and written to a file is
66 valid and will raise an exception on accessing the data or writing to
67 file if not.
68
69 Continuous validation
70 ~~~~~~~~~~~~~~~~~~~~~
71
72 If the validation level is set to 3, all validations for lower levels
73 described above are run, plus a validation of fields contents each time
74 a setter method is used.
75
76 A validation of 3 means that Gfapy guarantees (as good as it can) that
77 the GFA content is always valid.
00 import re
11 import gfapy
2 from .alignment import Alignment
32
43 class CIGAR(list):
54 """
00 import gfapy
1 from .alignment import Alignment
21
32 class AlignmentPlaceholder(gfapy.Placeholder):
43 """
00 import gfapy
1 from .alignment import Alignment
21
32 class Trace(list):
43 """Trace alignment.
1919 if isinstance(obj, gfapy.ByteArray):
2020 return str(obj)
2121 if isinstance(obj, list):
22 return str(ByteArray(obj))
22 return str(gfapy.ByteArray(obj))
2323 elif isinstance(obj, str):
2424 return obj
2525 else:
3232 if isinstance(obj, gfapy.ByteArray):
3333 return str(obj)
3434 elif isinstance(obj, list):
35 return str(ByteArray(obj))
35 return str(gfapy.ByteArray(obj))
3636 elif isinstance(obj, str):
3737 validate_encoded(obj)
3838 return obj
00 import gfapy
11 import builtins
2 import re
32 from .validator import Validator
43 from .parser import Parser
54 from .writer import Writer
2626 if not re.match("^[!-~]+$", elem):
2727 raise gfapy.FormatError(
2828 "the list contains an invalid GFA2 identifier ({})\n"
29 .format(repr(string))+
29 .format(repr(elem))+
3030 "(it contains spaces and/or non-printable characters)")
3131 else:
3232 raise gfapy.TypeError(
2424 pass
2525 elif isinstance(obj, gfapy.Line):
2626 validate_encoded(obj.name)
27 elif isinstance(obj, String):
27 elif isinstance(obj, str):
2828 validate_encoded(obj)
2929 else:
3030 raise gfapy.TypeError(
4848 def encode(obj):
4949 if isinstance(obj, gfapy.Placeholder):
5050 return str(obj)
51 elif isinstance(obj, String):
51 elif isinstance(obj, str):
5252 obj = str(obj)
5353 elif isinstance(obj, gfapy.Line):
5454 obj = str(obj.name)
2424 elem.validate()
2525 if not re.match(r"^[!-)+-<>-~][!-~]*$", elem.name):
2626 raise gfapy.FormatError(
27 "#{elem.name} is not a valid GFA1 segment name\n".format(elem.name)+
27 "{} is not a valid GFA1 segment name\n".format(elem.name)+
2828 "(it does not match [!-)+-<>-~][!-~]*)")
2929
3030 def unsafe_encode(obj):
2323 elem.validate()
2424 if not re.match(r"^[!-~]+$", elem.name):
2525 raise gfapy.FormatError(
26 "the list contains an invalid GFA2 identifier\n".format(elem.name)+
26 "the list contains an invalid GFA2 identifier {}\n".format(elem.name)+
2727 "(it contains spaces and/or non-printable characters)")
2828 if not elem.orient in ["+", "-"]:
2929 raise gfapy.FormatError(
0 import gfapy
1
20 class Artifacts:
31
42 def remove_small_components(self, minlen):
0 import gfapy
1
20 class CopyNumber:
31
42 def set_default_count_tag(self, tag):
0 import gfapy
10 from .artifacts import Artifacts
21 from .copy_number import CopyNumber
32 from .invertible_segments import InvertibleSegments
0 from itertools import groupby
1 import gfapy
2
03 class InvertibleSegments:
14
25 def randomly_orient_invertibles(self):
36 ''' Selects a random orientation for all invertible segments.
47
58 For the definition of invertible segment, see Gonnella and Kurtz (2016).'''
6 for s in segment_names:
9 for sn in self.segment_names:
710 if self._segment_same_links_both_ends(sn):
811 self._randomly_orient_proven_invertible_segment(sn)
912
6568 return [list(v) for k,v in groupby(sorted(links,key=sig),key=sig)]
6669
6770 def _annotate_random_orientation(self, segment_name):
68 s = self.try_get_segment(segment_name)
71 segment = self.try_get_segment(segment_name)
6972 n = segment.name.split("_")
7073 pairs = 0
7174 pos = [1, segment.LN]
22 class LinearPaths:
33
44 def linear_path(self, segment, exclude = None):
5 """Finnd a linear path which contains the specified segment
5 """Find a linear path which contains the specified segment
66
77 Parameters:
88 segment (str, Line): the segment to analyse
333333 merged.LN = len(merged.sequence)
334334 elif self._vlevel > 0 and merged.LN != len(merged.sequence):
335335 raise gfapy.InconsistencyError(
336 "Computed sequence length {} ".format(merged.sequence.length)+
336 "Computed sequence length {} ".format(len(merged.sequence))+
337337 "and computed LN {} differ".format(merged.LN))
338338 if merged.length is not None:
339339 for count_tag in ["KC", "RC", "FC"]:
5050 raise gfapy.ArgumentError("Mulitiplication factor must be >= 0"+
5151 " ({} found)".format(factor))
5252 elif factor == 0:
53 if conserve_components and factor == 1 and is_cut_segment(segment):
53 if conserve_components and factor == 1 and self.is_cut_segment(segment):
5454 return self
5555 else:
5656 self.rm(segment)
9797 processed_circulars = set()
9898 for l in segment.dovetails + segment.containments:
9999 if l.is_circular():
100 if l not in processed_circular:
100 if l not in processed_circulars:
101101 self.__divide_counts(l, factor)
102 processed_circular.append(l)
102 processed_circulars.add(l)
103103 else:
104104 self.__divide_counts(l, factor)
105105
121121 def _select_distribute_end(self, links_distribution_policy,
122122 segment_name, factor):
123123 if links_distribution_policy not in self.LINKS_DISTRIBUTION_POLICY:
124 raise gfa.ArgumentError("Unknown links distribution policy {}\n".format(\
124 raise gfapy.ArgumentError("Unknown links distribution policy {}\n".format(\
125125 links_distribution_policy)+"accepted values are: {}".format(\
126126 ", ".join(self.LINKS_DISTRIBUTION_POLICY)))
127127 if links_distribution_policy == "off":
2323 count_tag=self.default["count_tag"]
2424 if unit_length is None:
2525 unit_length=self.default["unit_length"]
26 s1 = segment(segment_end1.segment)
27 s2 = segment(segment_end2.segment)
28 et1 = segment(segment_end1.end_type)
29 et2 = segment(segment_end2.end_type)
26 s1 = self.try_get_segment(segment_end1.segment)
27 s2 = self.try_get_segment(segment_end2.segment)
28 et1 = segment_end1.end_type
29 et2 = segment_end2.end_type
3030 n1 = sorted(s1.neighbours(et1), key=lambda s:s.name)
3131 n2 = sorted(s2.neighbours(et2), key=lambda s:s.name)
3232 assert list(n1) == [os.inverted() for os in n2]
4444 unit_length=unit_length) for s in alternatives]
4545 alternatives.pop(coverages.index(max(coverages)))
4646 for s in alternatives:
47 segment(s[0]).disconnect()
47 self.try_get_segment(s[0]).disconnect()
33
44 def _junction_junction_paths(self, sn, exclude):
55 retval = []
6 exclude.append(sn)
6 exclude.add(sn)
77 s = self.segment(sn)
88 for dL in s.dovetails_L:
99 eL = dL.other_end(gfapy.SegmentEnd(s, "L"))
7676 elif self._version == "gfa2":
7777 mln = len(merged.sequence)
7878 tmp_link = gfapy.line.edge.GFA2(["*",merged.name+"+", \
79 last_name+("-" if is_reversed else "+"),
79 last.name+("-" if is_reversed else "+"),
8080 str(mln - ln), "{}$".format(mln),
8181 str(ln-1) if is_reversed else "0", # on purpose fake
8282 "{}$".format(ln) if is_reversed else "1", # on purpose fake
2323 oe = {}
2424 for et in ["L", "R"]:
2525 oe[et] = l[et][0].other_end(se[et])
26 if oe[:L] == oe[:R]:
26 if oe["L"] == oe["R"]:
2727 return
2828 for et in ["L", "R"]:
2929 self._delete_other_links(oe[et], se[et],
3030 conserve_components=conserve_components)
3131 else:
32 if l[:L].size == 1:
32 if l["L"].size == 1:
3333 et = "L"
34 elif l[:R].size == 1:
34 elif l["R"].size == 1:
3535 et = "R"
3636 else:
3737 return
5151 removed, if their removal does not split connected components
5252 of the graph (considering as connections only dovetail overlaps)
5353 """
54 for sn in segment_names:
54 for sn in self.segment_names:
5555 self.enforce_segment_mandatory_links(sn, conserve_components=
5656 conserve_components)
5757
6363 Parameters:
6464 segment (str, Line): the segment
6565 """
66 if not isinstance(segment, gfa.Line):
66 if not isinstance(segment, gfapy.Line):
6767 segment = self.try_get_segment(segment)
6868 for e in segment.dovetails:
6969 if e.from_segment == e.to_segment:
7171
7272 def remove_self_links(self):
7373 """Remove all dovetail overlap of segments to themselves, if any."""
74 for sn in segment_names:
74 for sn in self.segment_names:
7575 self.remove_self_link(sn)
2525 c[et] = set()
2626 visited = set()
2727 segend = link.get("from") if et == "from" else link.to
28 visited.append(segend.name)
29 visited.append(link.other_end(segend).name)
28 visited.add(segend.name)
29 visited.add(link.other_end(segend).name)
3030 self.__traverse_component(segend, c[et], visited)
3131 return c["from"] != c["to"]
3232
4949 start_points = set()
5050 for et in ["L", "R"]:
5151 for l in segment.dovetails_of_end(et):
52 start_points.append(l.other_end(\
53 gfapy.SegmentEnd(segment_name, et)).inverted())
52 start_points.add(l.other_end(\
53 gfapy.SegmentEnd(segment.name, et)).inverted())
5454 cc = []
5555 for start_point in start_points:
5656 cc.append(set())
5757 visited = set()
58 visited.append(segment_name)
59 traverse_component(start_point, cc[-1], visited)
58 visited.add(segment.name)
59 self.__traverse_component(start_point, cc[-1], visited)
6060 return any(c != cc[0] for c in cc)
6161
6262 def segment_connected_component(self, segment, visited = None):
7979 segment_name = segment
8080 segment = self.segment(segment)
8181 visited.add(segment_name)
82 c = [segment]
82 c = set()
83 c.add(segment)
8384 for e in ["L", "R"]:
8485 self.__traverse_component(gfapy.SegmentEnd(segment, e), c, visited)
85 return c
86 return list(c)
8687
8788 def connected_components(self):
8889 """Compute the connected components of the graph.
183184 if sn in visited:
184185 continue
185186 visited.add(sn)
186 c.append(s)
187 c.add(s)
187188 for e in ["L","R"]:
188189 self.__traverse_component(gfapy.SegmentEnd(s, e), c, visited)
00 import gfapy
1 import copy
21 from functools import total_ordering
32
43 @total_ordering
0 import gfapy
10 from ..line import Line
21 from .construction import Construction
32 from .tags import Tags
0 import gfapy
1
20 class Writer:
31 def __str__(self):
42 return "#" + str(self.spacer) + str(self.content)
0 import gfapy
10 import json
21 from copy import deepcopy
32
298298 DynamicField(partial(get_method, k = k),
299299 partial(set_method, k = k)))
300300 def all_references(self):
301 return [ item for item in values for values in self._refs ]
301 return [ item for item in [ values for values in self._refs ] ]
302302
303303 @classmethod
304304 def register_extension(cls, references=[]):
0 import gfapy
1
20 class DefaultRecordDefinition:
31 """
42 Provides default values for the constants for the definition of record types.
3030 return super().__setattr__(name, value)
3131 else:
3232 attr.set(self, value)
33 except AttributeError as err:
33 except AttributeError:
3434 return self._set_dynamic_field(name, value)
3535
3636 def _get_dynamic_field(self, name, err):
4444 self._validate_predefined_tag_type(n, self._field_datatype(n))
4545 elif not self._is_valid_custom_tagname(n):
4646 raise gfapy.FormatError("Custom tag names must consist in a letter "+
47 "and a digit or two letters\nFound: {}".format(tagname))
47 "and a digit or two letters\nFound: {}".format(n))
4848
4949 def _validate_predefined_tag_type(self, tagname, datatype):
5050 if datatype != self.__class__.DATATYPE[tagname]:
158158 (str, str, object) list
159159 """
160160 retval = []
161 for of in tagnames:
161 for of in self.tagnames:
162162 retval.append([of, self.get_datatype(of), self.get(of)])
163163 return retval
0 import gfapy
1
02 class ToGFA2:
13
24 @property
1717 raise gfapy.InconsistencyError(
1818 "Edge: {}\n".format(str(self))+
1919 "Field {}: $ after ".format(fn)+
20 "non-last position\n".format(str(pos))+
20 "non-last position {}\n".format(str(pos))+
2121 "Segment: {}".format(str(seg)))
88 Thereby, tags are not considered.
99 """
1010 hash(str(self.from_end)) + \
11 hash(str(to_end)) + \
12 hash(str(overlap)) + \
13 hash(str(overlap.complement()))
11 hash(str(self.to_end)) + \
12 hash(str(self.overlap)) + \
13 hash(str(self.overlap.complement()))
1414
1515 def is_eql(self, other):
1616 """
0 import gfapy
10 from .references import References
21 from ..line import Line
32
0 import gfapy
1
02 class SameID:
13
24 def _process_not_unique(self, previous):
8686 oss = [oriented_edge.line.sid1, oriented_edge.line.sid2]
8787 if oriented_edge.orient == "-":
8888 for i in range(len(oss)):
89 oss[i].invert()
89 oss[i] = oss[i].inverted()
9090 if len(items) > 1:
9191 nextitem = items[1]
9292 if isinstance(nextitem.line, gfapy.line.segment.GFA2):
9898 oss_of_next = [nextitem.line.sid1, nextitem.line.sid2]
9999 if oriented_edge.orient == "-":
100100 for i in range(len(oss_of_next)):
101 oss_of_next[i].invert()
101 oss_of_next[i] = oss_of_next[i].inverted()
102102 if oss[0] in oss_of_next:
103103 oss.reverse()
104104 # if oss_of_next have no element in common with oss an error will be
105105 # raised in the next iteration, so does not need to be handled here
106106 elif isinstance(nextitem.line, gfapy.line.group.Ordered):
107 subpath = item.line.captured_path
107 subpath = nextitem.line.captured_path
108108 if not subpath: return# does not need to be further handled here
109 if item.orient == "+":
110 firstsubpathsegment = supath[0]
109 if nextitem.orient == "+":
110 firstsubpathsegment = subpath[0]
111111 else:
112 firstsubpathsegment = supath[-1].inverted()
112 firstsubpathsegment = subpath[-1].inverted()
113113 if firstsubpathsegment == oss[0]:
114114 oss.reverse()
115115 # if oss does not include in firstsubpathsegment
128128 possible_prev = [oriented_edge.line.sid1, oriented_edge.line.sid2]
129129 if oriented_edge.orient == "-":
130130 for i, v in enumerate(possible_prev):
131 possible_prev[i].invert()
131 possible_prev[i] = possible_prev[i].inverted()
132132 if prev_os == possible_prev[0]:
133133 path.append(possible_prev[1])
134134 elif prev_os == possible_prev[1]:
166166 "Expected element:\n"+
167167 " {} ({})\n".format(path[-1], path[-1].line)+
168168 "Current element:\n"+
169 " {} ({})\n".format(segment, segment.line))
169 " {} ({})\n".format(oriented_segment, oriented_segment.line))
170170
171171 def _check_s_to_e_contiguity(self, path, oriented_segment):
172172 # check that segment is an extremity of path[-1]
0 import gfapy
1
02 class References:
13
24 def append_item(self, item):
4244 self.items = self.items[1:]
4345 else:
4446 self.items[0].update_reference(self, "paths")
45 self._delete_reference(items[0], "items")
47 self._delete_reference(self.items[0], "items")
4648 self.compute_induced_set() # check contiguity
4749
4850 def rm_last_item(self):
6567 if isinstance(item.line, gfapy.Line):
6668 item.line = item.name
6769 if append:
68 items.append(item)
70 self.items.append(item)
6971 else:
70 items.insert(0, item)
72 self.items.insert(0, item)
7173
7274 def _add_item_to_connected_group(self, item, append = True):
7375 item.line = self.prepare_and_check_ref(item.line)
1919 gfapy.Field._validate_gfa_field(oline.line.overlap, "alignment_gfa1")
2020 overlaps.append(str(oline.line.overlap))
2121 a.append(",".join(overlaps))
22 for tn in self.tagnames:
23 a.append(self.field_to_s(tn, tag=True))
2224 return a
1111 a = ["O"]
1212 a.append(self.field_to_s("path_name"))
1313 a.append(" ".join(items))
14 for tn in self.tagnames:
15 a.append(self.field_to_s(tn, tag=True))
1416 return a
0 import gfapy
1
02 class References:
13
24 def add_item(self, item):
3840 if isinstance(item, str):
3941 item = self._gfa.line(item)
4042 self._check_item_included(item)
41 line._delete_reference(self, "sets")
42 self._delete_reference(line, "items")
43 item._delete_reference(self, "sets")
44 self._delete_reference(item, "items")
4345 return None
4446
4547 def _check_item_included(self, item):
00 from .common.construction import Construction
1 from .common.dynamic_fields import DynamicFields, DynamicField
1 from .common.dynamic_fields import DynamicFields
22 from .common.writer import Writer
33 from .common.version_conversion import VersionConversion
44 from .common.field_datatype import FieldDatatype
1111 from .common.disconnection import Disconnection
1212 from .common.validate import Validate
1313 from .common.default_record_definition import DefaultRecordDefinition
14
15 import gfapy
1614
1715 class Line(Construction, DynamicFields, Writer, VersionConversion,
1816 FieldDatatype, FieldData, Equivalence, Cloning, Connection,
1616 "GFA2 requires to specify a length\n"+
1717 "No length information available in the GFA1 segment:\n"+
1818 "Segment line: {}".format(str(self)))
19 a = ["S", self.field_to_s("name", tag = False), str(self.try_get_length()),
19 a = ["S", self.field_to_s("name", tag = False), str(length),
2020 self.field_to_s("sequence", tag = False)]
2121 for fn in self.tagnames:
2222 if fn != "LN":
0 import gfapy
1
20 class Collections:
31 @property
42 def comments(self):
153153 elif gfa_line.record_type in ["L", "P", "C", "#"]:
154154 gfa_line.connect(self)
155155 else:
156 rt = gfa_line.record_type
156157 raise gfapy.AssertionError(
157158 "Invalid record type {}. This should never happen".format(rt))
158159
2121 other_end = gfapy.SegmentEnd(other_end)
2222 s = self.try_get_segment(segment_end.segment)
2323 for d in s.dovetails_of_end(segment_end.end_type):
24 if not conserve_components or not self.is_cut_link(l):
25 l.disconnect()
24 if not conserve_components or not self.is_cut_link(d):
25 d.disconnect()
2626
2727 def _unregister_line(self, gfa_line):
2828 self._api_private_check_gfa_line(gfa_line, "unregister_line")
3131 raise gfapy.AssertionError("Bug found, please report\n"+
3232 "gfa_line: {}".format(gfa_line))
3333 collection = self._records[rt]
34 key = gfa_line
35 delete_if_empty = None
3634 storage_key = gfa_line.__class__.STORAGE_KEY
3735 if storage_key == "name":
3836 name = gfa_line.name
0 import gfapy
1
20 class Headers:
31 @property
42 def header(self):
11 import gfapy.line
22 from .collections import Collections
33 from .headers import Headers
4 from .collections import Collections
54 from .creators import Creators
65 from .destructors import Destructors
76 from .finders import Finders
207207 "Value: {1}\n"+
208208 "Range: {2}\n"+
209209 "Content: {3}").format(subtype, e,
210 repr(range), repr(array)))
210 repr(range), repr(elems)))
211211 yield e
212212 else:
213213 yield float(e)
0 Metadata-Version: 1.1
1 Name: gfapy
2 Version: 1.2.0
3 Summary: Library for handling data in the GFA1 and GFA2 formats
4 Home-page: https://github.com/ggonnella/gfapy
5 Author: Giorgio Gonnella and others (see CONTRIBUTORS)
6 Author-email: gonnella@zbh.uni-hamburg.de
7 License: ISC
8 Description: Gfapy
9 ~~~~~
10
11 |travis| |readthedocs| |latesttag| |license| |requiresio|
12
13 |bioconda| |pypi| |debian| |ubuntu|
14
15 .. sphinx-begin
16
17 The Graphical Fragment Assembly (GFA) are formats for the representation
18 of sequence graphs, including assembly, variation and splicing graphs.
19 Two versions of GFA have been defined (GFA1 and GFA2) and several sequence
20 analysis programs have been adopting the formats as an interchange format,
21 which allow to easily combine different sequence analysis tools.
22
23 This library implements the GFA1 and GFA2 specification
24 described at https://github.com/GFA-spec/GFA-spec/blob/master/GFA-spec.md.
25 It allows to create a Gfa object from a file in the GFA format
26 or from scratch, to enumerate the graph elements (segments, links,
27 containments, paths and header lines), to traverse the graph (by
28 traversing all links outgoing from or incoming to a segment), to search for
29 elements (e.g. which links connect two segments) and to manipulate the
30 graph (e.g. to eliminate a link or a segment or to duplicate a segment
31 distributing the read counts evenly on the copies).
32
33 The GFA format can be easily extended by users by defining own custom
34 tags and record types. In Gfapy, it is easy to write extensions modules,
35 which allow to define custom record types and datatypes for the parsing
36 and validation of custom fields. The custom lines can be connected, using
37 references, to each other and to lines of the standard record types.
38
39 Requirements
40 ~~~~~~~~~~~~
41
42 Gfapy has been written for Python 3 and tested using Python version 3.7.
43 It does not require any additional Python packages or other software.
44
45 Installation
46 ~~~~~~~~~~~~
47
48 Gfapy is distributed as a Python package and can be installed using
49 the Python package manager pip, as well as conda (in the Bioconda channel).
50 It is also available as a package in some Linux distributions (Debian, Ubuntu).
51
52 The following command installs the current stable version from the Python
53 Packages index::
54
55 pip install gfapy
56
57 If you would like to install the current development version from Github,
58 use the following command::
59
60 pip install -e git+https://github.com/ggonnella/gfapy.git#egg=gfapy
61
62 Alternatively it is possible to install gfapy using conda. Gfapy is
63 included in the Bioconda (https://bioconda.github.io/) channel::
64
65 conda install -c bioconda gfapy
66
67 Usage
68 ~~~~~
69
70 If you installed gfapy as described above, you can import it in your script
71 using the conventional Python syntax::
72
73 >>> import gfapy
74
75 Documentation
76 ~~~~~~~~~~~~~
77
78 The documentation, including this introduction to Gfapy, a user manual
79 and the API documentation is hosted on the ReadTheDocs server,
80 at the URL http://gfapy.readthedocs.io/en/latest/ and it can be
81 downloaded as PDF from the URL
82 https://github.com/ggonnella/gfapy/blob/master/manual/gfapy-manual.pdf.
83
84 References
85 ~~~~~~~~~~
86
87 Giorgio Gonnella and Stefan Kurtz "GfaPy: a flexible and extensible software
88 library for handling sequence graphs in Python", Bioinformatics (2017) btx398
89 https://doi.org/10.1093/bioinformatics/btx398
90
91 .. sphinx-end
92
93 .. |travis|
94 image:: https://travis-ci.com/ggonnella/gfapy.svg?branch=master
95 :target: https://travis-ci.com/ggonnella/gfapy
96 :alt: Travis
97
98 .. |latesttag|
99 image:: https://img.shields.io/github/v/tag/ggonnella/gfapy
100 :target: https://github.com/ggonnella/gfapy/tags
101 :alt: Latest GitHub tag
102
103 .. |readthedocs|
104 image:: https://readthedocs.org/projects/pip/badge/?version=stable
105 :target: https://pip.pypa.io/en/stable/?badge=stable
106 :alt: ReadTheDocs
107
108 .. |bioconda|
109 image:: https://img.shields.io/conda/vn/bioconda/gfapy
110 :target: https://bioconda.github.io/recipes/gfapy/README.html
111 :alt: Bioconda
112
113 .. |pypi|
114 image:: https://img.shields.io/pypi/v/gfapy
115 :target: https://pypi.org/project/gfapy/
116 :alt: PyPI
117
118 .. |debian|
119 image:: https://img.shields.io/debian/v/gfapy
120 :target: https://packages.debian.org/search?keywords=gfapy
121 :alt: Debian
122
123 .. |ubuntu|
124 image:: https://img.shields.io/ubuntu/v/gfapy
125 :target: https://packages.ubuntu.com/search?keywords=gfapy
126 :alt: Ubuntu
127
128 .. |license|
129 image:: https://img.shields.io/pypi/l/gfapy
130 :target: https://github.com/ggonnella/gfapy/blob/master/LICENSE.txt
131 :alt: ISC License
132
133 .. |requiresio|
134 image:: https://requires.io/github/ggonnella/gfapy/requirements.svg?branch=master
135 :target: https://requires.io/github/ggonnella/gfapy/requirements/?branch=master
136 :alt: Requirements Status
137
138 Keywords: bioinformatics genomics sequences GFA assembly graphs
139 Platform: UNKNOWN
140 Classifier: Development Status :: 5 - Production/Stable
141 Classifier: Environment :: Console
142 Classifier: Intended Audience :: Developers
143 Classifier: Intended Audience :: End Users/Desktop
144 Classifier: Intended Audience :: Science/Research
145 Classifier: License :: OSI Approved :: ISC License (ISCL)
146 Classifier: Operating System :: MacOS :: MacOS X
147 Classifier: Operating System :: POSIX :: Linux
148 Classifier: Programming Language :: Python :: 3 :: Only
149 Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
150 Classifier: Topic :: Software Development :: Libraries
0 MANIFEST.in
1 README.rst
2 setup.cfg
3 setup.py
4 bin/gfapy-convert
5 bin/gfapy-mergelinear
6 bin/gfapy-renumber
7 bin/gfapy-validate
8 gfapy/__init__.py
9 gfapy/byte_array.py
10 gfapy/error.py
11 gfapy/field_array.py
12 gfapy/gfa.py
13 gfapy/lastpos.py
14 gfapy/logger.py
15 gfapy/numeric_array.py
16 gfapy/oriented_line.py
17 gfapy/placeholder.py
18 gfapy/rgfa.py
19 gfapy/segment_end.py
20 gfapy/segment_end_path.py
21 gfapy/sequence.py
22 gfapy/symbol_invert.py
23 gfapy.egg-info/PKG-INFO
24 gfapy.egg-info/SOURCES.txt
25 gfapy.egg-info/dependency_links.txt
26 gfapy.egg-info/not-zip-safe
27 gfapy.egg-info/top_level.txt
28 gfapy/alignment/__init__.py
29 gfapy/alignment/alignment.py
30 gfapy/alignment/cigar.py
31 gfapy/alignment/placeholder.py
32 gfapy/alignment/trace.py
33 gfapy/field/__init__.py
34 gfapy/field/alignment_gfa1.py
35 gfapy/field/alignment_gfa2.py
36 gfapy/field/alignment_list_gfa1.py
37 gfapy/field/byte_array.py
38 gfapy/field/char.py
39 gfapy/field/comment.py
40 gfapy/field/custom_record_type.py
41 gfapy/field/field.py
42 gfapy/field/float.py
43 gfapy/field/generic.py
44 gfapy/field/identifier_gfa2.py
45 gfapy/field/identifier_list_gfa2.py
46 gfapy/field/integer.py
47 gfapy/field/json.py
48 gfapy/field/numeric_array.py
49 gfapy/field/optional_identifier_gfa2.py
50 gfapy/field/optional_integer.py
51 gfapy/field/orientation.py
52 gfapy/field/oriented_identifier_gfa2.py
53 gfapy/field/oriented_identifier_list_gfa1.py
54 gfapy/field/oriented_identifier_list_gfa2.py
55 gfapy/field/parser.py
56 gfapy/field/path_name_gfa1.py
57 gfapy/field/position_gfa1.py
58 gfapy/field/position_gfa2.py
59 gfapy/field/segment_name_gfa1.py
60 gfapy/field/sequence_gfa1.py
61 gfapy/field/sequence_gfa2.py
62 gfapy/field/string.py
63 gfapy/field/validator.py
64 gfapy/field/writer.py
65 gfapy/graph_operations/__init__.py
66 gfapy/graph_operations/artifacts.py
67 gfapy/graph_operations/copy_number.py
68 gfapy/graph_operations/graph_operations.py
69 gfapy/graph_operations/invertible_segments.py
70 gfapy/graph_operations/linear_paths.py
71 gfapy/graph_operations/multiplication.py
72 gfapy/graph_operations/p_bubbles.py
73 gfapy/graph_operations/redundant_linear_paths.py
74 gfapy/graph_operations/superfluous_links.py
75 gfapy/graph_operations/topology.py
76 gfapy/line/__init__.py
77 gfapy/line/line.py
78 gfapy/line/comment/__init__.py
79 gfapy/line/comment/comment.py
80 gfapy/line/comment/construction.py
81 gfapy/line/comment/tags.py
82 gfapy/line/comment/version_conversion.py
83 gfapy/line/comment/writer.py
84 gfapy/line/common/__init__.py
85 gfapy/line/common/cloning.py
86 gfapy/line/common/connection.py
87 gfapy/line/common/construction.py
88 gfapy/line/common/default_record_definition.py
89 gfapy/line/common/disconnection.py
90 gfapy/line/common/dynamic_fields.py
91 gfapy/line/common/equivalence.py
92 gfapy/line/common/field_data.py
93 gfapy/line/common/field_datatype.py
94 gfapy/line/common/update_references.py
95 gfapy/line/common/validate.py
96 gfapy/line/common/version_conversion.py
97 gfapy/line/common/virtual_to_real.py
98 gfapy/line/common/writer.py
99 gfapy/line/custom_record/__init__.py
100 gfapy/line/custom_record/construction.py
101 gfapy/line/custom_record/custom_record.py
102 gfapy/line/edge/__init__.py
103 gfapy/line/edge/edge.py
104 gfapy/line/edge/common/__init__.py
105 gfapy/line/edge/common/alignment_type.py
106 gfapy/line/edge/common/from_to.py
107 gfapy/line/edge/containment/__init__.py
108 gfapy/line/edge/containment/canonical.py
109 gfapy/line/edge/containment/containment.py
110 gfapy/line/edge/containment/pos.py
111 gfapy/line/edge/containment/to_gfa2.py
112 gfapy/line/edge/gfa1/__init__.py
113 gfapy/line/edge/gfa1/alignment_type.py
114 gfapy/line/edge/gfa1/oriented_segments.py
115 gfapy/line/edge/gfa1/other.py
116 gfapy/line/edge/gfa1/references.py
117 gfapy/line/edge/gfa1/to_gfa2.py
118 gfapy/line/edge/gfa2/__init__.py
119 gfapy/line/edge/gfa2/alignment_type.py
120 gfapy/line/edge/gfa2/gfa2.py
121 gfapy/line/edge/gfa2/other.py
122 gfapy/line/edge/gfa2/references.py
123 gfapy/line/edge/gfa2/to_gfa1.py
124 gfapy/line/edge/gfa2/validation.py
125 gfapy/line/edge/link/__init__.py
126 gfapy/line/edge/link/canonical.py
127 gfapy/line/edge/link/complement.py
128 gfapy/line/edge/link/equivalence.py
129 gfapy/line/edge/link/link.py
130 gfapy/line/edge/link/references.py
131 gfapy/line/edge/link/to_gfa2.py
132 gfapy/line/fragment/__init__.py
133 gfapy/line/fragment/fragment.py
134 gfapy/line/fragment/references.py
135 gfapy/line/fragment/validation.py
136 gfapy/line/gap/__init__.py
137 gfapy/line/gap/gap.py
138 gfapy/line/gap/references.py
139 gfapy/line/group/__init__.py
140 gfapy/line/group/group.py
141 gfapy/line/group/gfa2/__init__.py
142 gfapy/line/group/gfa2/references.py
143 gfapy/line/group/gfa2/same_id.py
144 gfapy/line/group/ordered/__init__.py
145 gfapy/line/group/ordered/captured_path.py
146 gfapy/line/group/ordered/ordered.py
147 gfapy/line/group/ordered/references.py
148 gfapy/line/group/ordered/to_gfa1.py
149 gfapy/line/group/path/__init__.py
150 gfapy/line/group/path/captured_path.py
151 gfapy/line/group/path/path.py
152 gfapy/line/group/path/references.py
153 gfapy/line/group/path/to_gfa2.py
154 gfapy/line/group/path/topology.py
155 gfapy/line/group/path/validation.py
156 gfapy/line/group/unordered/__init__.py
157 gfapy/line/group/unordered/induced_set.py
158 gfapy/line/group/unordered/references.py
159 gfapy/line/group/unordered/unordered.py
160 gfapy/line/header/__init__.py
161 gfapy/line/header/connection.py
162 gfapy/line/header/field_data.py
163 gfapy/line/header/header.py
164 gfapy/line/header/multiline.py
165 gfapy/line/header/version_conversion.py
166 gfapy/line/segment/__init__.py
167 gfapy/line/segment/coverage.py
168 gfapy/line/segment/gfa1.py
169 gfapy/line/segment/gfa1_to_gfa2.py
170 gfapy/line/segment/gfa2.py
171 gfapy/line/segment/gfa2_to_gfa1.py
172 gfapy/line/segment/length_gfa1.py
173 gfapy/line/segment/references.py
174 gfapy/line/segment/segment.py
175 gfapy/line/segment/writer_wo_sequence.py
176 gfapy/line/unknown/__init__.py
177 gfapy/line/unknown/unknown.py
178 gfapy/lines/__init__.py
179 gfapy/lines/collections.py
180 gfapy/lines/creators.py
181 gfapy/lines/destructors.py
182 gfapy/lines/finders.py
183 gfapy/lines/headers.py
184 gfapy/lines/lines.py
185 manual/gfapy-manual.pdf
186 tests/__init__.py
187 tests/extension.py
188 tests/test_api_alignment.py
189 tests/test_api_comments.py
190 tests/test_api_custom_records.py
191 tests/test_api_extensions.py
192 tests/test_api_gfa1_lines.py
193 tests/test_api_gfa2_lines.py
194 tests/test_api_gfa_basics.py
195 tests/test_api_groups_validation.py
196 tests/test_api_header.py
197 tests/test_api_linear_paths.py
198 tests/test_api_linear_paths_extended.py
199 tests/test_api_lines_collections.py
200 tests/test_api_lines_creators.py
201 tests/test_api_lines_destructors.py
202 tests/test_api_lines_finders.py
203 tests/test_api_multiplication.py
204 tests/test_api_placeholders.py
205 tests/test_api_positionals.py
206 tests/test_api_positions.py
207 tests/test_api_references_edge_gfa1.py
208 tests/test_api_references_edge_gfa2.py
209 tests/test_api_references_f_g_lines.py
210 tests/test_api_references_groups.py
211 tests/test_api_references_virtual.py
212 tests/test_api_rename_lines.py
213 tests/test_api_rgfa.py
214 tests/test_api_tags.py
215 tests/test_api_version.py
216 tests/test_api_version_conversion.py
217 tests/test_gfapy_alignment.py
218 tests/test_gfapy_byte_array.py
219 tests/test_gfapy_cigar.py
220 tests/test_gfapy_line_containment.py
221 tests/test_gfapy_line_edge.py
222 tests/test_gfapy_line_header.py
223 tests/test_gfapy_line_link.py
224 tests/test_gfapy_line_path.py
225 tests/test_gfapy_line_segment.py
226 tests/test_gfapy_line_version.py
227 tests/test_gfapy_numeric_array.py
228 tests/test_gfapy_segment_references.py
229 tests/test_gfapy_sequence.py
230 tests/test_gfapy_trace.py
231 tests/test_graphop_artifacts.py
232 tests/test_graphop_copy_number.py
233 tests/test_internals_field_parser.py
234 tests/test_internals_field_validator.py
235 tests/test_internals_field_writer.py
236 tests/test_internals_tag_datatype.py
237 tests/test_unit_alignment.py
238 tests/test_unit_field_array.py
239 tests/test_unit_gfa_lines.py
240 tests/test_unit_header.py
241 tests/test_unit_line.py
242 tests/test_unit_line_cloning.py
243 tests/test_unit_line_connection.py
244 tests/test_unit_line_dynamic_fields.py
245 tests/test_unit_line_equivalence.py
246 tests/test_unit_lines_finders.py
247 tests/test_unit_multiplication.py
248 tests/test_unit_numeric_array.py
249 tests/test_unit_oriented_line.py
250 tests/test_unit_segment_end.py
251 tests/test_unit_symbol_invert.py
252 tests/test_unit_unknown.py
253 tests/testdata/all_line_types.gfa1.gfa
254 tests/testdata/all_line_types.gfa2.gfa
255 tests/testdata/copynum.1.gfa
256 tests/testdata/copynum.1.gfa2
257 tests/testdata/copynum.2.gfa
258 tests/testdata/copynum.2.gfa2
259 tests/testdata/dead_ends.gfa
260 tests/testdata/dead_ends.gfa2
261 tests/testdata/example1.gfa
262 tests/testdata/example1.gfa2
263 tests/testdata/example_from_spec.gfa
264 tests/testdata/example_from_spec.gfa2
265 tests/testdata/example_from_spec.path14.seq
266 tests/testdata/example_from_spec2.gfa
267 tests/testdata/example_from_spec2.gfa2
268 tests/testdata/filled.gfa1
269 tests/testdata/filled.gfa2
270 tests/testdata/gfa2_edges_classification.gfa
271 tests/testdata/invalid_path.gfa2
272 tests/testdata/linear_blunt.gfa1
273 tests/testdata/linear_blunt.gfa2
274 tests/testdata/linear_merging.1.gfa
275 tests/testdata/linear_merging.1.gfa2
276 tests/testdata/linear_merging.2.gfa
277 tests/testdata/linear_merging.2.gfa2
278 tests/testdata/linear_merging.3.gfa
279 tests/testdata/linear_merging.3.gfa2
280 tests/testdata/linear_merging.4.gfa
281 tests/testdata/linear_merging.4.gfa2
282 tests/testdata/linear_merging.5.gfa
283 tests/testdata/linear_merging.5.gfa2
284 tests/testdata/links_distri.l1.gfa
285 tests/testdata/links_distri.l1.gfa2
286 tests/testdata/links_distri.l1.m2.gfa
287 tests/testdata/links_distri.l1.m2.gfa2
288 tests/testdata/links_distri.l2.gfa
289 tests/testdata/links_distri.l2.gfa2
290 tests/testdata/links_distri.l2.m2.gfa
291 tests/testdata/links_distri.l2.m2.gfa2
292 tests/testdata/links_distri.l2.m2.no_ld.gfa
293 tests/testdata/links_distri.l2.m2.no_ld.gfa2
294 tests/testdata/links_distri.l2.m3.gfa
295 tests/testdata/links_distri.l2.m3.gfa2
296 tests/testdata/links_distri.l2.m3.no_ld.gfa
297 tests/testdata/links_distri.l2.m3.no_ld.gfa2
298 tests/testdata/links_distri.l3.gfa
299 tests/testdata/links_distri.l3.gfa2
300 tests/testdata/links_distri.l3.m2.gfa
301 tests/testdata/links_distri.l3.m2.gfa2
302 tests/testdata/links_distri.l3.m2.no_ld.gfa
303 tests/testdata/links_distri.l3.m2.no_ld.gfa2
304 tests/testdata/loop.gfa
305 tests/testdata/loop.gfa2
306 tests/testdata/rgfa_example.1.gfa
307 tests/testdata/rgfa_example.2.gfa
308 tests/testdata/sample.gfa
309 tests/testdata/sample.gfa2
310 tests/testdata/seq_to_fill.fas
311 tests/testdata/spec_q1.gfa
312 tests/testdata/spec_q1.gfa2
313 tests/testdata/spec_q2.gfa
314 tests/testdata/spec_q2.gfa2
315 tests/testdata/spec_q2.path_circular.seq
316 tests/testdata/spec_q2.path_linear.seq
317 tests/testdata/spec_q3.gfa
318 tests/testdata/spec_q3.gfa2
319 tests/testdata/spec_q4.gfa
320 tests/testdata/spec_q4.gfa2
321 tests/testdata/spec_q4.path_more_than_circular.seq
322 tests/testdata/spec_q5.gfa
323 tests/testdata/spec_q5.gfa2
324 tests/testdata/spec_q6.gfa
325 tests/testdata/spec_q6.gfa2
326 tests/testdata/spec_q7.gfa
327 tests/testdata/spec_q7.gfa2
328 tests/testdata/to_be_filled.gfa1
329 tests/testdata/to_be_filled.gfa2
330 tests/testdata/two_components.gfa
331 tests/testdata/two_components.gfa2
332 tests/testdata/unnamed_and_named_links.gfa
333 tests/testdata/unnamed_link.gfa
334 tests/testdata/valid_path.gfa2
Binary diff not shown
00 [bdist_wheel]
11 python-tag = py3
2
3 [egg_info]
4 tag_build =
5 tag_date = 0
6
88 sys.exit("Sorry, only Python 3 is supported")
99
1010 setup(name='gfapy',
11 version='1.1.0',
11 version='1.2.0',
1212 description='Library for handling data in the GFA1 and GFA2 formats',
1313 long_description=readme(),
1414 url='https://github.com/ggonnella/gfapy',
0 import gfapy
1 import unittest
2
3 class TestAPIGroupsValidation(unittest.TestCase):
4
5 def test_invalid_path_gfa2(self):
6 with self.assertRaises(gfapy.NotFoundError):
7 g = gfapy.Gfa.from_file("tests/testdata/invalid_path.gfa2")
8
9 def test_invalid_path_gfa2_vlevel0(self):
10 g = gfapy.Gfa.from_file("tests/testdata/invalid_path.gfa2", vlevel = 0)
11 with self.assertRaises(gfapy.NotFoundError):
12 g.validate()
13
14 def test_valid_path_gfa2(self):
15 # nothing raised
16 g = gfapy.Gfa.from_file("tests/testdata/valid_path.gfa2")
4646 lps.add(" ".join([s.name for s in lp]))
4747 self.assertEqual({"1 19 18", "11 9 12", "22 16 20 21 23"}, lps)
4848
49 def test_linear_path_blunt_ends(self):
50 for sfx in ["gfa1", "gfa2"]:
51 gfa = gfapy.Gfa.from_file("tests/testdata/linear_blunt."+"{}".format(sfx))
52 gfa.merge_linear_paths()
53 self.assertEqual(1, len(gfa.segments))
54 self.assertEqual("s1_s2_s3_s4", gfa.segments[0].name)
55 self.assertEqual("CTGAAACGTGGCTCACA", gfa.segments[0].sequence)
199199 gfa1_str ='''# comment
200200 H\tVN:Z:1.0
201201 S\tB\t*\tLN:i:200
202 S\tC\t*\tLN:i:100
202 S\tC\t*\tLN:i:100\txy:i:1
203203 S\tA\t*\tLN:i:200
204204 L\tA\t+\tB\t-\t100M\tID:Z:a_to_b
205205 C\tA\t+\tC\t-\t20\t100M\tID:Z:2
206 P\t1\tA+,B-\t100M'''
206 P\t1\tA+,B-\t100M\txx:Z:valid\txy:i:1'''
207207 gfa2_str ='''# comment
208208 H\tVN:Z:2.0
209209 S\tB\t200\t*
210 S\tC\t100\t*
210 S\tC\t100\t*\txy:i:1
211211 S\tA\t200\t*
212212 E\ta_to_b\tA+\tB-\t100\t200$\t100\t200$\t100M
213213 E\t2\tA+\tC-\t20\t120\t0\t100$\t100M
214 O\t1\tA+ a_to_b+ B-'''
214 O\t1\tA+ a_to_b+ B-\txx:Z:valid\txy:i:1'''
215215 self.assertEqual(gfa2_str, str(gfapy.Gfa(gfa1_str).to_gfa2()))
216216 self.assertEqual(gfa1_str, str(gfapy.Gfa(gfa2_str).to_gfa1()))
0 H VN:Z:1.0
1 S s1 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTTCCG LN:i:44
2 S s2 ACGTTCGTACGTACGTACGTACGTAAACGGTAC LN:i:33
3 S s3 ACGTAAACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTTTTTACGTACGTA LN:i:75
4 S s4 ACGTACGTACGTTTTTTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGGAACGTAGTACGTACGTACGTACG LN:i:80
5 S s5 * LN:i:90 xx:Z:not in the Fasta, should accept and warn
6 L s1 + s2 + 5M1I2M
7 L s2 + s3 + 8M1D3M
8 L s3 + s4 + 16M1I9M
9 L s4 - s1 + 8M1I16M
0 H VN:Z:2.0
1 S s1 44 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTTCCG
2 S s2 33 ACGTTCGTACGTACGTACGTACGTAAACGGTAC
3 S s3 75 ACGTAAACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTTTTTACGTACGTA
4 S s4 80 ACGTACGTACGTTTTTTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGGAACGTAGTACGTACGTACGTACG
5 S s5 90 * xx:Z:not in the Fasta, should accept and warn
6 E 1 s1+ s2+ 37 44$ 0 8 5M1I2M
7 E 2 s2+ s3+ 21 33$ 0 11 8M1D3M
8 E 3 s3+ s4+ 50 75$ 0 26 16M1I9M
9 E 4 s4- s1+ 0 24 0 25 8M1I16M
+0
-32
tests/testdata/invalid/edge_missing.gfa2 less more
0 # File used for the collections test
1 # similar but NOT equivalent to the gfa1 file!
2 S 1 122 *
3 S 3 29 TGCTAGCTGACTGTCGATGCTGTGTG
4 E 1_to_2 1+ 2+ 110 122$ 0 12 12M
5 S 5 130 *
6 S 13 150 *
7 E 2_to_6 2+ 6+ 0 122$ 10 132 122M
8 O 14 11+ 12+
9 S 11 140 * xx:i:11
10 F 2 read1+ 0 42 12 55 * id:Z:read1_in_2
11 F 2 read2+ 45 62 0 18 * id:Z:read2_in_2
12 U 16 1 3 15 2_to_6 16sub
13 H ac:Z:test2
14 # another comment
15 S 12 150 *
16 S 4 120 *
17 H VN:Z:2.0
18 E 1_to_3 1+ 3+ 112 122$ 0 12 10M
19 G 1_to_11 1+ 11- 120 *
20 E 11_to_12 11+ 12+ 18 140$ 0 122 122M
21 S 6 150 *
22 X custom_record xx:Z:testtag
23 X custom_record X2
24 G 2_to_12 2- 12+ 500 50
25 O 15 11+ 11_to_13+ 13+ xx:i:-1
26 Y another_custom_record
27 U 16sub 2 3
28 S 2 120 * xx:Z:sometag
29 H aa:i:12 ab:Z:test1
30 H aa:i:15
31 E 1_to_5 1+ 5+ 0 122$ 2 124 * zz:Z:tag
+0
-12
tests/testdata/invalid/edge_wrong_lastpos.gfa2 less more
0 H VN:Z:2.0
1 H ul:Z:https://github.com/sjackman/assembly-graph/blob/master/sample.gfa
2 S 1 8 CGATGCAA
3 S 2 10 TGCAAAGTAC
4 S 3 21 TGCAACGTATAGACTTGTCAC RC:i:4
5 S 4 7 GCATATA
6 S 5 8 CGATGATA
7 S 6 4 ATGA
8 E * 1+ 2+ 3 9$ 0 5 5M
9 E * 3+ 2+ 21$ 21$ 0 0 0M
10 E * 3+ 4- 17 21$ 3 7$ 1M1D2M
11 E * 4- 5+ 0 0 0 0 0M
+0
-33
tests/testdata/invalid/fragment_wrong_lastpos.gfa2 less more
0 # File used for the collections test
1 # similar but NOT equivalent to the gfa1 file!
2 S 1 122 *
3 S 3 29 TGCTAGCTGACTGTCGATGCTGTGTG
4 E 1_to_2 1+ 2+ 110 122$ 0 12 12M
5 S 5 130 *
6 S 13 150 *
7 E 2_to_6 2+ 6+ 0 122$ 10 132 122M
8 O 14 11+ 12+
9 S 11 140 * xx:i:11
10 F 3 read1+ 0 42$ 12 55 * id:Z:read1_in_3
11 F 2 read2+ 45 62 0 18 * id:Z:read2_in_2
12 U 16 1 3 15 2_to_6 16sub
13 H ac:Z:test2
14 # another comment
15 S 12 150 *
16 S 4 120 *
17 H VN:Z:2.0
18 E 1_to_3 1+ 3+ 112 122$ 0 12 10M
19 G 1_to_11 1+ 11- 120 *
20 E 11_to_12 11+ 12+ 18 140$ 0 122 122M
21 S 6 150 *
22 X custom_record xx:Z:testtag
23 X custom_record X2
24 E 11_to_13 11+ 13+ 20 140$ 0 120 120M
25 G 2_to_12 2- 12+ 500 50
26 O 15 11+ 11_to_13+ 13+ xx:i:-1
27 Y another_custom_record
28 U 16sub 2 3
29 S 2 120 * xx:Z:sometag
30 H aa:i:12 ab:Z:test1
31 H aa:i:15
32 E 1_to_5 1+ 5+ 0 122$ 2 124 * zz:Z:tag
+0
-12
tests/testdata/invalid/inconsistent_length.gfa1 less more
0 H VN:Z:1.0
1 H ul:Z:https://github.com/sjackman/assembly-graph/blob/master/sample.gfa
2 S 1 CGATGCAA LN:i:12
3 S 2 TGCAAAGTAC
4 S 3 TGCAACGTATAGACTTGTCAC RC:i:4
5 S 4 GCATATA
6 S 5 CGATGATA
7 S 6 ATGA
8 L 1 + 2 + 5M
9 L 3 + 2 + 0M
10 L 3 + 4 - 1M1D2M1S
11 L 4 - 5 + 0M
+0
-21
tests/testdata/invalid/link_missing.gfa1 less more
0 # File used for the collections test
1 S 1 *
2 S 3 CGATGCTAGCTGACTGTCGATGCTGTGTG
3 L 1 + 2 + 12M ID:Z:1_to_2
4 S 5 *
5 S 13 *
6 C 2 + 6 + 10 122M ID:Z:2_to_6
7 P 14 11+,12+ 122M
8 S 11 *
9 H ac:Z:test2
10 S 12 *
11 S 4 *
12 H VN:Z:1.0
13 L 1 + 3 + 12M ID:Z:1_to_3
14 S 6 *
15 L 11 + 13 + 120M ID:Z:11_to_13
16 P 15 11+,13+ 120M
17 S 2 * xx:Z:sometag
18 H aa:i:12 ab:Z:test1
19 H aa:i:15
20 C 1 + 5 + 12 120M ID:Z:1_to_5
+0
-21
tests/testdata/invalid/segment_missing.gfa1 less more
0 # comment
1 S 3 CGATGCTAGCTGACTGTCGATGCTGTGTG
2 L 1 + 2 + 12M ID:Z:1_to_2
3 S 5 *
4 S 13 *
5 C 2 + 6 + 10 122M ID:Z:2_to_6
6 P 14 11+,12+ 122M
7 S 11 *
8 H ac:Z:test2
9 S 12 *
10 S 4 *
11 H VN:Z:1.0
12 L 1 + 3 + 12M ID:Z:1_to_3
13 L 11 + 12 + 122M ID:Z:11_to_12
14 S 6 *
15 L 11 + 13 + 120M ID:Z:11_to_13
16 P 15 11+,13+ 120M
17 S 2 * xx:Z:sometag
18 H aa:i:12 ab:Z:test1
19 H aa:i:15
20 C 1 + 5 + 12 120M ID:Z:1_to_5
+0
-32
tests/testdata/invalid/segment_missing.gfa2 less more
0 # File used for the collections test
1 # similar but NOT equivalent to the gfa1 file!
2 S 3 29 TGCTAGCTGACTGTCGATGCTGTGTG
3 E 1_to_2 1+ 2+ 110 122$ 0 12 12M
4 S 5 130 *
5 S 13 150 *
6 E 2_to_6 2+ 6+ 0 122$ 10 132 122M
7 O 14 11+ 12+
8 S 11 140 * xx:i:11
9 F 2 read1+ 0 42 12 55 * id:Z:read1_in_2
10 F 2 read2+ 45 62 0 18 * id:Z:read2_in_2
11 U 16 1 3 15 2_to_6 16sub
12 H ac:Z:test2
13 # another comment
14 S 12 150 *
15 S 4 120 *
16 H VN:Z:2.0
17 E 1_to_3 1+ 3+ 112 122$ 0 12 10M
18 G 1_to_11 1+ 11- 120 *
19 E 11_to_12 11+ 12+ 18 140$ 0 122 122M
20 S 6 150 *
21 X custom_record xx:Z:testtag
22 X custom_record X2
23 E 11_to_13 11+ 13+ 20 140$ 0 120 120M
24 G 2_to_12 2- 12+ 500 50
25 O 15 11+ 11_to_13+ 13+ xx:i:-1
26 Y another_custom_record
27 U 16sub 2 3
28 S 2 120 * xx:Z:sometag
29 H aa:i:12 ab:Z:test1
30 H aa:i:15
31 E 1_to_5 1+ 5+ 0 122$ 2 124 * zz:Z:tag
0 H VN:Z:2.0
1 S 1 6 AGCGTA
2 S 2 6 TAACAG
3 S 3 6 GCTAGT
4 S 4 6 TCAGCG
5 O P1 1+ 2+ 3+ 5+
0 S s1 CTGAA
1 S s2 ACG
2 S s3 TGGC
3 S s4 TGTGA
4 L s1 + s2 + 0M
5 L s2 + s3 + 0M
6 L s3 + s4 - 0M
0 S s1 5 CTGAA
1 S s2 3 ACG
2 S s3 4 TGGC
3 S s4 5 TGTGA
4 E 1 s1+ s2+ 5$ 5$ 0 0 0M
5 E 2 s2+ s3+ 3$ 3$ 0 0 0M
6 E 3 s3+ s4- 4$ 4$ 5$ 5$ 0M
0 >s0 not in GFA, should accept and warn
1 CGACAGCGATCGAGCATGCTAGCTAGTCGTAGCTAGCTAGCTA
2 >s1
3 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTTCCG
4 >s2 more text in description line should not disturb
5 ACGTTCGTACGTACGT
6 ACGTACGTAAACGGTAC
7 >s3
8 ACGTAAACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTTTTTACGTACGTA
9 >s4
10 ACGTACGTACGTTTTTTACGTACGTACGTA
11 CGTACGTACGTACGTACGTACGTACGGAAC
12 GTAGTACGTACGTACGTACG
0 H VN:Z:1.0
1 S s1 * LN:i:44
2 S s2 * LN:i:33
3 S s3 * LN:i:75
4 S s4 * LN:i:80
5 S s5 * LN:i:90 xx:Z:not in the Fasta, should accept and warn
6 L s1 + s2 + 5M1I2M
7 L s2 + s3 + 8M1D3M
8 L s3 + s4 + 16M1I9M
9 L s4 - s1 + 8M1I16M
0 H VN:Z:2.0
1 S s1 44 *
2 S s2 33 *
3 S s3 75 *
4 S s4 80 *
5 S s5 90 * xx:Z:not in the Fasta, should accept and warn
6 E 1 s1+ s2+ 37 44$ 0 8 5M1I2M
7 E 2 s2+ s3+ 21 33$ 0 11 8M1D3M
8 E 3 s3+ s4+ 50 75$ 0 26 16M1I9M
9 E 4 s4- s1+ 0 24 0 25 8M1I16M
0 H VN:Z:2.0
1 S 1 6 AGCGTA
2 S 2 6 TAACAG
3 S 3 6 GCTAGT
4 S 4 6 TCAGCG
5 O P1 1+ 2+ 3+ 4+