Codebase list gfapy / bc39b8e
New upstream snapshot. Debian Janitor 1 year, 5 months ago
54 changed file(s) with 658 addition(s) and 4004 deletion(s). Raw diff Collapse all Expand all
+0
-13
.gitignore less more
0 # Compiled python modules
1 *.pyc
2
3 # Setuptools distribution folder
4 /dist/
5
6 # Python egg metadata, regenerated from source files by setuptools
7 /*.egg-info
8 /*.egg
9
10 # Wheel data
11 build
12 conda
+0
-13
.travis.yml less more
0 language: python
1 arch:
2 - amd64
3 - ppc64le
4 python:
5 - "3.7"
6 env:
7 - PYTHONHASHSEED=0
8 install:
9 - pip install .
10 - pip install nose
11 - pip install Sphinx
12 script: "make tests"
+0
-26
CHANGES.txt less more
0 == 1.2.3 ==
1
2 - make it possible to count input header lines correctly
3
4 == 1.2.2 ==
5
6 - remove to/from aliases for GFA1 containment/link fields
7 since `to` potentially clashes with a tag name
8
9 == 1.2.1 ==
10
11 - fixed an issue with linear path merging (issue 21)
12 - GFA1 paths can contain a single segment only (issue 22)
13
14 == 1.2.0 ==
15
16 - fixed all open issues
17
18 == 1.1.0 ==
19
20 - fix: custom tags are not necessarily lower case
21 - additional support for rGFA subset of GFA1 by setting option dialect="rgfa"
22
23 == 1.0.0 ==
24
25 - initial release
+0
-6
CONTRIBUTORS less more
0 The following contributors helped to develop gfapy. Please drop a note to
1 gonnella@zbh.uni-hamburg.de if I left someone out or missed something.
2
3 - Tim Weber (translation of parts of the code from Ruby to Python)
4 - Stefan Kurtz (advises)
5
+0
-51
Makefile less more
0 default: tests
1
2 .PHONY: manual tests cleanup upload conda sdist wheel install
3
4 PYTHON=python3
5 PIP=pip3
6
7 # Install using pip
8 install:
9 ${PIP} install --upgrade --user --editable .
10
11 # Source distribution
12 sdist:
13 ${PYTHON} setup.py sdist
14
15 # Pure Python Wheel
16 wheel:
17 ${PYTHON} setup.py bdist_wheel
18
19 # Create the manual
20 manual:
21 cd doc && make latexpdf
22 mkdir -p manual
23 cp doc/_build/latex/Gfapy.pdf manual/gfapy-manual.pdf
24
25 doctest:
26 cd doc && make doctest
27
28 unittests:
29 @echo
30 @echo "Running unit test suite..."
31 @PYTHONHASHSEED=0 ${PYTHON} -m unittest discover
32
33 tests: doctest unittests
34
35 # Remove distribution files
36 cleanup:
37 rm -rf dist/ build/ gfapy.egg-info/
38
39 upload: tests cleanup sdist wheel
40 cd dist; \
41 for file in *; do \
42 twine check $$file && \
43 twine upload $$file; \
44 done
45
46 conda:
47 mkdir -p conda
48 cd conda; \
49 conda skeleton pypi gfapy; \
50 conda build gfapy
0 Metadata-Version: 2.1
1 Name: gfapy
2 Version: 1.2.3
3 Summary: Library for handling data in the GFA1 and GFA2 formats
4 Home-page: https://github.com/ggonnella/gfapy
5 Author: Giorgio Gonnella and others (see CONTRIBUTORS)
6 Author-email: gonnella@zbh.uni-hamburg.de
7 License: ISC
8 Keywords: bioinformatics genomics sequences GFA assembly graphs
9 Classifier: Development Status :: 5 - Production/Stable
10 Classifier: Environment :: Console
11 Classifier: Intended Audience :: Developers
12 Classifier: Intended Audience :: End Users/Desktop
13 Classifier: Intended Audience :: Science/Research
14 Classifier: License :: OSI Approved :: ISC License (ISCL)
15 Classifier: Operating System :: MacOS :: MacOS X
16 Classifier: Operating System :: POSIX :: Linux
17 Classifier: Programming Language :: Python :: 3 :: Only
18 Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
19 Classifier: Topic :: Software Development :: Libraries
20 License-File: LICENSE.txt
21
22 Gfapy
23 ~~~~~
24
25 |travis| |readthedocs| |latesttag| |license|
26
27 |bioconda| |pypi| |debian| |ubuntu|
28
29 .. sphinx-begin
30
31 The Graphical Fragment Assembly (GFA) are formats for the representation
32 of sequence graphs, including assembly, variation and splicing graphs.
33 Two versions of GFA have been defined (GFA1 and GFA2) and several sequence
34 analysis programs have been adopting the formats as an interchange format,
35 which allow to easily combine different sequence analysis tools.
36
37 This library implements the GFA1 and GFA2 specification
38 described at https://github.com/GFA-spec/GFA-spec/blob/master/GFA-spec.md.
39 It allows to create a Gfa object from a file in the GFA format
40 or from scratch, to enumerate the graph elements (segments, links,
41 containments, paths and header lines), to traverse the graph (by
42 traversing all links outgoing from or incoming to a segment), to search for
43 elements (e.g. which links connect two segments) and to manipulate the
44 graph (e.g. to eliminate a link or a segment or to duplicate a segment
45 distributing the read counts evenly on the copies).
46
47 The GFA format can be easily extended by users by defining own custom
48 tags and record types. In Gfapy, it is easy to write extensions modules,
49 which allow to define custom record types and datatypes for the parsing
50 and validation of custom fields. The custom lines can be connected, using
51 references, to each other and to lines of the standard record types.
52
53 Requirements
54 ~~~~~~~~~~~~
55
56 Gfapy has been written for Python 3 and tested using Python version 3.7.
57 It does not require any additional Python packages or other software.
58
59 Installation
60 ~~~~~~~~~~~~
61
62 Gfapy is distributed as a Python package and can be installed using
63 the Python package manager pip, as well as conda (in the Bioconda channel).
64 It is also available as a package in some Linux distributions (Debian, Ubuntu).
65
66 The following command installs the current stable version from the Python
67 Packages index::
68
69 pip install gfapy
70
71 If you would like to install the current development version from Github,
72 use the following command::
73
74 pip install -e git+https://github.com/ggonnella/gfapy.git#egg=gfapy
75
76 Alternatively it is possible to install gfapy using conda. Gfapy is
77 included in the Bioconda (https://bioconda.github.io/) channel::
78
79 conda install -c bioconda gfapy
80
81 Usage
82 ~~~~~
83
84 If you installed gfapy as described above, you can import it in your script
85 using the conventional Python syntax::
86
87 >>> import gfapy
88
89 Documentation
90 ~~~~~~~~~~~~~
91
92 The documentation, including this introduction to Gfapy, a user manual
93 and the API documentation is hosted on the ReadTheDocs server,
94 at the URL http://gfapy.readthedocs.io/en/latest/ and it can be
95 downloaded as PDF from the URL
96 https://github.com/ggonnella/gfapy/blob/master/manual/gfapy-manual.pdf.
97
98 References
99 ~~~~~~~~~~
100
101 Giorgio Gonnella and Stefan Kurtz "GfaPy: a flexible and extensible software
102 library for handling sequence graphs in Python", Bioinformatics (2017) btx398
103 https://doi.org/10.1093/bioinformatics/btx398
104
105 .. sphinx-end
106
107 .. |travis|
108 image:: https://travis-ci.com/ggonnella/gfapy.svg?branch=master
109 :target: https://travis-ci.com/ggonnella/gfapy
110 :alt: Travis
111
112 .. |latesttag|
113 image:: https://img.shields.io/github/v/tag/ggonnella/gfapy
114 :target: https://github.com/ggonnella/gfapy/tags
115 :alt: Latest GitHub tag
116
117 .. |readthedocs|
118 image:: https://readthedocs.org/projects/pip/badge/?version=stable
119 :target: https://pip.pypa.io/en/stable/?badge=stable
120 :alt: ReadTheDocs
121
122 .. |bioconda|
123 image:: https://img.shields.io/conda/vn/bioconda/gfapy
124 :target: https://bioconda.github.io/recipes/gfapy/README.html
125 :alt: Bioconda
126
127 .. |pypi|
128 image:: https://img.shields.io/pypi/v/gfapy
129 :target: https://pypi.org/project/gfapy/
130 :alt: PyPI
131
132 .. |debian|
133 image:: https://img.shields.io/debian/v/gfapy
134 :target: https://packages.debian.org/search?keywords=gfapy
135 :alt: Debian
136
137 .. |ubuntu|
138 image:: https://img.shields.io/ubuntu/v/gfapy
139 :target: https://packages.ubuntu.com/search?keywords=gfapy
140 :alt: Ubuntu
141
142 .. |license|
143 image:: https://img.shields.io/pypi/l/gfapy
144 :target: https://github.com/ggonnella/gfapy/blob/master/LICENSE.txt
145 :alt: ISC License
146
147 .. |requiresio|
148 image:: https://requires.io/github/ggonnella/gfapy/requirements.svg?branch=master
149 :target: https://requires.io/github/ggonnella/gfapy/requirements/?branch=master
150 :alt: Requirements Status
+0
-3
benchmarks/.gitignore less more
0 benchmark_results*
1 jobs_out
2 figure*
+0
-65
benchmarks/gfapy-benchmark-collectdata less more
0 #!/bin/bash
1
2 #
3 # This script is derived from rdj-spacepeak.sh in
4 # the GenomeTools repository (www.genometools.org).
5 #
6 # (c) 2010-2017 Giorgio Gonnella, ZBH, University of Hamburg
7 #
8
9 sleeptime=0.1
10
11 if [ $# -eq 0 ]; then
12 echo "Usage: $0 <command> [args]"
13 echo
14 echo "The following information is polled each $sleeptime seconds"
15 echo "from /proc/[pid]/status:"
16 echo
17 echo " VmPeak: Peak virtual memory size."
18 echo " VmSize: Virtual memory size."
19 echo " VmLck: Locked memory size."
20 echo " VmHWM: Peak resident set size (\"high water mark\")."
21 echo " VmRSS: Resident set size."
22 echo " VmData, VmStk, VmExe: Size of data, stack, and text segments."
23 echo " VmLib: Shared library code size."
24 echo " VmPTE: Page table entries size (since Linux 2.6.10)."
25 echo
26 echo "The command is run under /usr/bin/time."
27 exit
28 fi
29
30 # code inspired by:
31 # http://stackoverflow.com/questions/1080461/
32 # /peak-memory-measurement-of-long-running-process-in-linux
33 function __measure_space_peak {
34 types="Peak Size Lck HWM RSS Data Stk Exe Lib PTE"
35 declare -A maxVm
36 for vm in $types; do maxVm[$vm]=0; done
37 ppid=$$
38 /usr/bin/time $@ &
39 tpid=`pgrep -P ${ppid} -n -f time`
40 if [[ ${tpid} -ne "" ]]; then
41 pid=`pgrep -P ${tpid} -n -f $1` # $! may work here but not later
42 fi
43 declare -A Vm
44 while [[ ${tpid} -ne "" ]]; do
45 for vm in $types; do
46 if [[ ${pid} -ne "" ]]; then
47 Vm[$vm]=`cat /proc/${pid}/status 2> /dev/null \
48 | grep Vm${vm} | awk '{print $2}'`
49 if [[ ${Vm[$vm]} -gt ${maxVm[$vm]} ]]; then
50 maxVm[$vm]=${Vm[$vm]}
51 fi
52 fi
53 done
54 sleep $sleeptime
55 savedtpid=${tpid}
56 tpid=`pgrep -P ${ppid} -n -f time`
57 done
58 wait ${savedtpid} # don't wait, job is finished
59 exitstatus=$? # catch the exit status of wait, the same of $@
60 echo "Memory usage for $@:" >> /dev/stderr
61 for vm in $types; do echo " Vm$vm: ${maxVm[$vm]} kB" >> /dev/stderr; done
62 echo "Exit status: ${exitstatus}" >> /dev/stderr
63 }
64 __measure_space_peak $*
+0
-120
benchmarks/gfapy-plot-benchmarkdata.R less more
0 #!/usr/bin/env Rscript
1 # (c) Giorgio Gonnella, ZBH, Uni Hamburg, 2017
2
3 script.name = "./gfapy-plot-benchmarkdata.R"
4 args <- commandArgs(trailingOnly=TRUE)
5 if (is.na(args[3])) {
6 cat("Usage: ",script.name, " <inputfile> <outpfx> <variable>", "\n")
7 cat("variable: either 'segments' or 'connectivity'\n")
8 stop("Too few command-line parameters")
9 }
10 infname <- args[1]
11 cat("input data: ",infname,"\n")
12 outpfx <- args[2]
13 cat("output prefix:", outpfx, "\n")
14 xvar <- args[3]
15 if (xvar != 'segments' && xvar != 'connectivity') {
16 stop("variable must be one of: segments, connectivity")
17 }
18
19 library("ggplot2")
20
21 #
22 # The following function is described here:
23 # http://www.cookbook-r.com/Graphs/Plotting_means_and_error_bars_(ggplot2)/#Helper%20functions
24 # Licence: CC0 (https://creativecommons.org/publicdomain/zero/1.0/)
25 #
26 ## Gives count, mean, standard deviation, standard error of the mean, and
27 ## confidence interval (default 95%).
28 ## data: a data frame.
29 ## measurevar: the name of a column that contains the var to be summariezed
30 ## groupvars: a vector containing names of columns that contain grouping vars
31 ## na.rm: a boolean that indicates whether to ignore NA's
32 ## conf.interval: the percent range of the confidence interval (default 95%)
33 summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE,
34 conf.interval=.95, .drop=TRUE) {
35 library(plyr)
36
37 # New version of length which can handle NA's: if na.rm==T, don't count them
38 length2 <- function (x, na.rm=FALSE) {
39 if (na.rm) sum(!is.na(x))
40 else length(x)
41 }
42
43 # This does the summary. For each group's data frame, return a vector with
44 # N, mean, and sd
45 datac <- ddply(data, groupvars, .drop=.drop,
46 .fun = function(xx, col) {
47 c(N = length2(xx[[col]], na.rm=na.rm),
48 mean = mean (xx[[col]], na.rm=na.rm),
49 sd = sd (xx[[col]], na.rm=na.rm)
50 )
51 },
52 measurevar
53 )
54
55 # Rename the "mean" column
56 datac <- rename(datac, c("mean" = measurevar))
57
58 datac$se <- datac$sd / sqrt(datac$N) # Calculate standard error of the mean
59
60 # Confidence interval multiplier for standard error
61 # Calculate t-statistic for confidence interval:
62 # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
63 ciMult <- qt(conf.interval/2 + .5, datac$N-1)
64 datac$ci <- datac$se * ciMult
65
66 return(datac)
67 }
68
69 data <- read.table(infname, header=T, sep="\t")
70
71 if (xvar == "segments") {
72 xvarname = "lines"
73 xlab="Lines (segments 1/3; dovetails 2/3)"
74 } else {
75 xvarname = "mult"
76 xlab="Dovetails/segment (segments=4000)"
77 data[c("lines")] = (data[c("mult")]+1)*4000
78 }
79
80 time.data <- summarySE(data, measurevar="time", groupvars=c(xvarname))
81 outfname = paste0(outpfx,"_time.log")
82 sink(outfname)
83 print(time.data)
84 time.lm <- lm(time ~ lines, data=data)
85 summary(time.lm)
86 time.nls <- nls(time ~ b + a * lines,
87 data=data, start=list(a=0,b=0),
88 algorithm="port", lower=c(0,0))
89 print(time.nls)
90 sink()
91
92 outfname = paste0(outpfx,"_space.log")
93 sink(outfname)
94 space.data <- summarySE(data, measurevar="space", groupvars=c(xvarname))
95 print(space.data)
96 space.lm <- lm(space ~ lines, data=data)
97 summary(space.lm)
98 space.nls <- nls(space ~ b + a * lines,
99 data=data, start=list(a=0,b=0),
100 algorithm="port", lower=c(0,0))
101 print(space.nls)
102 sink()
103
104 outfname = paste0(outpfx,"_time.pdf")
105 pdf(outfname)
106 print(ggplot(time.data, aes_string(x=xvarname, y="time")) +
107 geom_errorbar(aes(ymin=time-se, ymax=time+se), width=2) +
108 geom_line(size=0.2) + geom_point(size=3) +
109 ylab("Total elapsed time (s)") +
110 xlab(xlab))
111 outfname = paste0(outpfx,"_space.pdf")
112 pdf(outfname)
113 print(ggplot(space.data, aes_string(x=xvarname, y="space")) +
114 geom_errorbar(aes(ymin=space-se, ymax=space+se), width=2) +
115 geom_line(size=0.2) + geom_point(size=3) +
116 ylab("Memory peak (MB)") +
117 xlab(xlab))
118 dev.off()
119
+0
-62
benchmarks/gfapy-plot-preparedata.py less more
0 #!/usr/bin/env python3
1 """
2 Prepare the output of the convert benchmark script for the R plotting script.
3 """
4
5 import argparse
6 import os
7 import sys
8 import re
9
10 op = argparse.ArgumentParser(description=__doc__)
11 op.add_argument('--version', action='version', version='%(prog)s 1.0')
12 op.add_argument("--mult", "-m", action="store_true",
13 help="set if variable n of edges/segment")
14 op.add_argument("inputfile")
15 opts = op.parse_args()
16
17 if not os.path.exists(opts.inputfile):
18 sys.stderr.write("Input file not found: {}\n".format(opts.inputfile))
19 exit(1)
20
21 with open(opts.inputfile) as inputfile:
22 header = True
23 if opts.mult:
24 outdata = ["mult", "time", "space", "time_per_line", "space_per_line"]
25 else:
26 outdata = ["lines", "time", "space", "time_per_line", "space_per_line"]
27 print("\t".join(outdata))
28 for line in inputfile:
29 if line[:3] == "###":
30 header = False
31 elif not header:
32 data = line.rstrip("\n\r").split("\t")
33 n_segments = data[2]
34 multiplier = data[3]
35 n_lines = int(int(n_segments) * (1+float(multiplier)))
36 elapsed = data[5]
37 elapsed_match = re.compile(r'\s+(\d+):(\d+\.\d+)').match(elapsed)
38 if elapsed_match:
39 minutes = int(elapsed_match.groups()[0])
40 seconds = float(elapsed_match.groups()[1])
41 seconds += minutes * 60
42 else:
43 elapsed_match = re.compile(r'\s+(\d+):(\d+):(\d+)').match(elapsed)
44 if elapsed_match:
45 hours = int(elapsed_match.groups()[0])
46 minutes = int(elapsed_match.groups()[1])
47 seconds = int(elapsed_match.groups()[2])
48 minutes += hours * 60
49 seconds += minutes * 60
50 else:
51 continue
52 memory = data[6]
53 memory = int(re.compile(r'(\d+) kB').match(memory).groups()[0])
54 megabytes = memory / 1024
55 if opts.mult:
56 outdata = [str(multiplier)]
57 else:
58 outdata = [str(n_lines)]
59 outdata += [str(seconds),str(megabytes),
60 str(seconds/n_lines), str(megabytes/n_lines)]
61 print("\t".join(outdata))
+0
-61
benchmarks/gfapy-profiler.sh less more
0 #!/bin/bash
1 #$ -clear
2 #$ -q 16c.q
3 #$ -cwd
4 #$ -V
5 #$ -S /bin/bash
6 #$ -o jobs_out
7 #$ -j y
8
9 if [ $# -ne 4 ]; then
10 echo "Usage: $0 <operation> <version> <variable> <range>" > /dev/stderr
11 echo " operation: (mergelinear/convert) ../bin/gfapy-<operation> <gfafile> will be called" > /dev/stderr
12 echo " version: (gfa1/gfa2) gfa version" > /dev/stderr
13 echo " variable: (segments/connectivity)" > /dev/stderr
14 echo " range: (all/fast/slow)" > /dev/stderr
15 exit 1
16 fi
17
18 operation=$1
19 version=$2
20 variable=$3
21 range=$4
22
23 if [ $variable == "segments" ]; then
24 if [ $range == "fast" ]; then
25 nsegments="1000 2000 4000 8000 16000 32000 64000 128000"
26 elif [ $range == "slow" ]; then
27 nsegments="256000 512000 1024000 2048000 4096000"
28 elif [ $range == "all"]; then
29 nsegments="1000 2000 4000 8000 16000 32000 64000 128000 256000 512000 1024000 2048000 4096000"
30 fi
31 else
32 nsegments=4000
33 fi
34
35 if [ $variable == "connectivity" ]; then
36 if [ $range == "fast" ]; then
37 multipliers="2 4 8 16 32 64"
38 elif [ $range == "slow" ]; then
39 multipliers="128 256"
40 elif [ $range == "all"]; then
41 multipliers="2 4 8 16 32 64 128 256"
42 fi
43 else
44 multipliers=2
45 fi
46
47 replicate=1
48 for i in $nsegments; do
49 for m in $multipliers; do
50 fname="${i}_e${m}x.$replicate.${version}"
51 if [ ! -e $fname ]; then
52 ./gfapy-randomgraph --segments $i -g $version \
53 --dovetails-per-segment $m --with-sequence > $fname
54 fi
55 echo "Profiling $operation $fname ..."
56 rm -f $fname.$operation.prof
57 python3 -m cProfile -o $fname.$operation.prof \
58 ../bin/gfapy-$operation $fname 1> /dev/null
59 done
60 done
+0
-87
benchmarks/gfapy-randomgraph less more
0 #!/usr/bin/env python3
1 """
2 Creates a random graph for testing
3 """
4
5 import argparse
6 import sys
7 import random
8
9 op = argparse.ArgumentParser(description=__doc__)
10 op.add_argument("--segments", "-s", type=int,
11 help="number of segments", required=True)
12 op.add_argument("--slen", "-l", type=int, default=100,
13 help="lenght of segments sequence")
14 op.add_argument("--with-sequence", "-w", action="store_true")
15 op.add_argument("--dovetails-per-segment", "-d",
16 help="average number of dovetail edges per segment",
17 default=2.0, type=float)
18 op.add_argument('--gfa-version', "-g", default="gfa1",
19 help="gfa version", choices=("gfa1", "gfa2"))
20 op.add_argument('--version', action='version', version='%(prog)s 1.0')
21 opts = op.parse_args()
22
23 if opts.segments < 0:
24 sys.stderr.write("Error: the number of segments must be "+
25 ">= 0 ({})\n".format(opts.segments))
26 exit(1)
27 if opts.dovetails_per_segment < 0:
28 sys.stderr.write("Error: the average number of dovetails per segment must "+
29 "be >= 0 ({})\n".format(opts.dovetails_per_segment))
30 exit(1)
31 if opts.slen <= 0:
32 sys.stderr.write("Error: the length of segments sequence must be > 0"+
33 " ({})\n".format(opts.slen))
34 exit(1)
35
36 if opts.gfa_version == "gfa1":
37 print("H\tVN:Z:1.0")
38 else:
39 print("H\tVN:Z:2.0")
40
41 def random_sequence(slen):
42 sequence = []
43 for i in range(slen):
44 sequence.append(random.choice('ACGT'))
45 return "".join(sequence)
46
47 for i in range(opts.segments):
48 if opts.with_sequence:
49 sequence = random_sequence(opts.slen)
50 else:
51 sequence = "*"
52 if opts.gfa_version == "gfa1":
53 print("S\ts{}\t{}\tLN:i:{}".format(i, sequence, opts.slen))
54 else:
55 print("S\ts{}\t{}\t{}".format(i, opts.slen, sequence))
56
57 n_dovetails = int(opts.segments * opts.dovetails_per_segment)
58 edges = {}
59 for i in range(n_dovetails):
60 edge = False
61 while not edge:
62 s_from = random.randint(0, opts.segments-1)
63 s_from_or = random.choice('+-')
64 s_to = random.randint(0, opts.segments-1)
65 s_to_or = random.choice('+-')
66 if s_from not in edges:
67 edges[s_from] = {'+': {}, '-': {}}
68 if s_to not in edges[s_from][s_from_or]:
69 edges[s_from][s_from_or][s_to] = {'+': False, '-': False}
70 if not edges[s_from][s_from_or][s_to][s_to_or]:
71 edges[s_from][s_from_or][s_to][s_to_or] = True
72 edge = True
73 ovlen = opts.slen//10
74 if ovlen == 0: ovlen = 1
75 cigar = "{}M".format(ovlen)
76 if opts.gfa_version == "gfa1":
77 print("L\ts{}\t{}\ts{}\t{}\t{}\tID:Z:e{}".format(s_from, s_from_or, s_to,
78 s_to_or, cigar, i))
79 else:
80 s_from_begin = opts.slen - ovlen if s_from_or == "+" else 0
81 s_from_end = "{}$".format(opts.slen) if s_from_or == "+" else ovlen
82 s_to_begin = opts.slen - ovlen if s_to_or == "-" else 0
83 s_to_end = "{}$".format(opts.slen) if s_to_or == "-" else ovlen
84 print("E\te{}\ts{}{}\ts{}{}\t{}\t{}\t{}\t{}\t{}".format(
85 i, s_from, s_from_or, s_to, s_to_or, s_from_begin, s_from_end,
86 s_to_begin, s_to_end, cigar))
+0
-76
benchmarks/gfapy-reproduce-manuscript-figure.py less more
0 #!/usr/bin/env python3
1 """
2 Run the benchmarks necessary to reproduce the figures of Section 3
3 of the Supplementary Information of the manuscript \"Gfapy: a flexible
4 and extensible software library for handling sequence graphs in Python\"
5 and plots the figures using R.
6 """
7
8 import argparse
9 import os
10
11 op = argparse.ArgumentParser(description=__doc__)
12 op.add_argument("fignum", help="Figure number", type=int,
13 choices=range(5,9))
14 op.add_argument("--queue", default=None,
15 help="Use the specified queue of a Grid Engine cluster system "+
16 "(e.g. 16c.q). If not provided, the benchmarks are run on the "+
17 "local computer.")
18 op.add_argument("--nrepl",type=int, default=3,
19 help="Number of replicates (default: 3)")
20 op.add_argument("--fast",action="store_true",
21 help="Run only the three fastest datapoints of the benchmark")
22 opts = op.parse_args()
23
24 if opts.fignum == 5:
25 testvar="segments"
26 operation="convert"
27 elif opts.fignum == 6:
28 testvar="connectivity"
29 operation="convert"
30 elif opts.fignum == 7:
31 testvar="segments"
32 operation="mergelinear"
33 else: # 8
34 testvar="connectivity"
35 operation="mergelinear"
36
37 if opts.fast:
38 subset="fast"
39 else:
40 subset="all"
41
42 run_benchmarks_args="figure{}.out {} gfa2 {} {} {}".format(
43 opts.fignum, operation, testvar, subset, opts.nrepl)
44
45 if not opts.queue:
46 os.system("./gfapy-run-benchmarks.sh {}".format(run_benchmarks_args))
47 else:
48 qsub_script_pfx=\
49 """#!/bin/bash
50 #$ -clear
51 #$ -q {}
52 #$ -cwd
53 #$ -V
54 #$ -S /bin/bash
55 #$ -o jobs_out
56 #$ -j y
57 #$ -sync y
58
59 """.format(opts.queue)
60 with open("gfapy-run-benchmarks.sh", "r") as input_file:
61 content = input_file.read()
62 with open("gfapy-run-benchmarks.qsub", "w") as output_file:
63 output_file.write(qsub_script_pfx)
64 output_file.write(content)
65 os.system("mkdir -p jobs_out")
66 os.system("qsub gfapy-run-benchmarks.qsub {}".format(run_benchmarks_args))
67
68 if testvar == "segments":
69 prepareflag=""
70 else:
71 prepareflag="--mult"
72 os.system("./gfapy-plot-preparedata.py {} figure{}.out > figure{}.dat".format(
73 prepareflag, opts.fignum, opts.fignum))
74 os.system("./gfapy-plot-benchmarkdata.R figure{}.dat figure{} {}".format(
75 opts.fignum, opts.fignum, testvar))
+0
-67
benchmarks/gfapy-run-benchmarks.sh less more
0 #!/bin/bash
1
2 if [ $# -ne 6 ]; then
3 echo "Usage: $0 <outfile> <operation> <version> <variable> <range> <nrepl>" > /dev/stderr
4 echo " outfile: will be overwritten if exists" > /dev/stderr
5 echo " operation: (mergelinear/convert) ../bin/gfapy-<operation> <gfafile> will be called" > /dev/stderr
6 echo " version: (gfa1/gfa2) gfa version" > /dev/stderr
7 echo " variable: (segments/connectivity)" > /dev/stderr
8 echo " range: (all/fast/slow)" > /dev/stderr
9 echo " nrepl: (e.g. 3) number of replicates" > /dev/stderr
10 exit 1
11 fi
12
13 outfile=$1
14 operation=$2
15 version=$3
16 variable=$4
17 range=$5
18 nrepl=$6
19
20 if [ $variable == "segments" ]; then
21 if [ $range == "fast" ]; then
22 nsegments="1000 2000 4000"
23 elif [ $range == "slow" ]; then
24 nsegments="8000 16000 32000 64000 128000 256000 512000 1024000 2048000"
25 elif [ $range == "all"]; then
26 nsegments="1000 2000 4000 8000 16000 32000 64000 128000 256000 512000 1024000 2048000"
27 fi
28 else
29 nsegments=4000
30 fi
31
32 if [ $variable == "connectivity" ]; then
33 if [ $range == "fast" ]; then
34 multipliers="2 4 8"
35 elif [ $range == "slow" ]; then
36 multipliers="16 32 64 128 256"
37 elif [ $range == "all"]; then
38 multipliers="2 4 8 16 32 64 128 256"
39 fi
40 else
41 multipliers=2
42 fi
43
44 mkdir -p benchmark_results
45 rm -f $outfile
46 echo "# hostname: $HOSTNAME" > $outfile
47 echo "### benchmark data:" >> $outfile
48 for ((replicate=1;replicate<=nrepl;++replicate)); do
49 for i in $nsegments; do
50 for m in $multipliers; do
51 fname="benchmark_results/${i}_e${m}x.$replicate.${version}"
52 bmout="$fname.$operation.benchmark"
53 rm -f $bmout
54 if [ ! -e $fname ]; then
55 ./gfapy-randomgraph --segments $i -g $version \
56 --dovetails-per-segment $m --with-sequence > $fname
57 fi
58 ./gfapy-benchmark-collectdata ../bin/gfapy-$operation $fname \
59 1> /dev/null 2> $bmout
60 elapsed=$(grep -P -o "(?<=) [^ ]*(?=elapsed)" $bmout)
61 memory=$(grep -P -o "(?<=VmHWM: ).*" $bmout)
62 filesize=( $(ls -ln $fname) );filesize=${filesize[4]}
63 echo -e "gfapy-$operation\t$version\t$i\t$m\t$replicate\t$elapsed\t$memory\t$filesize" >> $outfile
64 done
65 done
66 done
+0
-47
bin/gfapy-diff less more
0 #!/usr/bin/env python3
1 """
2 Compare two GFA files
3
4 Note: the current version is not yet functional and only checking segments.
5 Work in progress.
6 """
7
8 import sys
9 import os
10 import gfapy
11 import argparse
12
13 op = argparse.ArgumentParser(description=__doc__)
14 op.add_argument('--version', action='version', version='%(prog)s 0.1')
15 op.add_argument("filename1")
16 op.add_argument("filename2")
17 opts = op.parse_args()
18
19 gfa1 = gfapy.Gfa.from_file(opts.filename1)
20 gfa2 = gfapy.Gfa.from_file(opts.filename2)
21
22 different = False
23
24 if gfa1.version != gfa2.version:
25 print("# different version")
26 exit(1)
27 else:
28 for s in gfa1.segments:
29 s2 = gfa2.segment(s)
30 if s2 is None:
31 different = True
32 print("# segment {} in {} but not in {}".format(s.name, opts.filename1, opts.filename2))
33 if s.diff(s2):
34 different = True
35 for diff in s.diff(s2):
36 print(diff)
37 for s in gfa2.segments:
38 s1 = gfa1.segment(s)
39 if s1 is None:
40 different = True
41 print("# segment {} in {} but not in {}".format(s.name, opts.filename2, opts.filename1))
42
43 if different:
44 exit(1)
45 else:
46 exit(0)
+0
-51
bin/gfapy-fillseq less more
0 #!/usr/bin/env python3
1 """
2 Add sequences from a Fasta file to a GFA file.
3 """
4
5 import argparse
6 import sys
7 import gfapy
8
9 op = argparse.ArgumentParser(description=__doc__)
10 op.add_argument("inputgfa")
11 op.add_argument("inputfasta")
12 op.add_argument("-q", "--quiet", action="store_true", help="silence warnings")
13 op.add_argument("-v", "--verbose", action="store_true", help="verbose output")
14 op.add_argument("-V", '--version', action='version', version='%(prog)s 0.1')
15 opts = op.parse_args()
16
17 # note when applying to the output of older versions of Canu (1.6)
18 # the following fix to the GFA VN tag is necessary:
19 # sed -i s'/VN:Z:bogart\/edges/VN:Z:1.0/' canu.contigs.gfa
20
21 g = gfapy.Gfa.from_file(opts.inputgfa)
22
23 segment = None
24 slines = []
25 with open(opts.inputfasta) as f:
26 for line in f:
27 line = line.strip()
28 if line.startswith(">"):
29 if segment:
30 segment.sequence = "".join(slines)
31 sname = line[1:].split(" ")[0]
32 if opts.verbose:
33 sys.stderr.write("Processing segment {}...\n".format(sname))
34 segment = g.segment(sname)
35 if not opts.quiet and not segment:
36 sys.stderr.write("Warning: Segment with ID {} ".format(sname)+
37 "found in Fasta but not in GFA file\n")
38 slines = []
39 else:
40 slines.append(line)
41 if segment:
42 segment.sequence = "".join(slines)
43
44 if not opts.quiet:
45 for s in g.segments:
46 if s.sequence == gfapy.Placeholder:
47 sys.stderr.write("Warning: Segment with ID {} ".format(s.name)+
48 "found in GFA but not in Fasta file\n")
49
50 print(g)
0 gfapy (1.2.3+git20220408.1.12b31da+dfsg-1) UNRELEASED; urgency=low
1
2 * New upstream snapshot.
3
4 -- Debian Janitor <janitor@jelmer.uk> Thu, 03 Nov 2022 08:41:06 -0000
5
06 gfapy (1.2.3+dfsg-1) unstable; urgency=medium
17
28 * New upstream release.
+0
-2
doc/.gitignore less more
0 source
1 _build
+0
-23
doc/Makefile less more
0 # Minimal makefile for Sphinx documentation
1 #
2
3 # You can set these variables from the command line.
4 SPHINXOPTS =
5 SPHINXBUILD = sphinx-build
6 SPHINXPROJ = Gfapy
7 SOURCEDIR = .
8 BUILDDIR = _build
9
10 # Put it first so that "make" without argument is like "make help".
11 help:
12 @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
13
14 .PHONY: help Makefile
15
16 cleanup:
17 rm source _build -rf
18
19 # Catch-all target: route all unknown targets to Sphinx using the new
20 # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
21 %: Makefile
22 @PYTHONHASHSEED=0 $(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+0
-4
doc/changelog.rst less more
0 Changelog
1 ---------
2 .. include:: ../CHANGES.txt
3 :literal:
+0
-173
doc/conf.py less more
0 #!/usr/bin/env python3
1 # -*- coding: utf-8 -*-
2 #
3 # Gfapy documentation build configuration file, created by
4 # sphinx-quickstart on Thu Mar 16 10:13:57 2017.
5 #
6 # This file is execfile()d with the current directory set to its
7 # containing dir.
8 #
9 # Note that not all possible configuration values are present in this
10 # autogenerated file.
11 #
12 # All configuration values have a default; values that are commented out
13 # serve to show the default.
14
15 # If extensions (or modules to document with autodoc) are in another directory,
16 # add these directories to sys.path here. If the directory is relative to the
17 # documentation root, use os.path.abspath to make it absolute, like shown here.
18 #
19 import os
20 import sys
21 sys.path.insert(0, os.path.abspath('.'))
22 sys.path.insert(0, os.path.abspath('../'))
23
24 # -- General configuration ------------------------------------------------
25
26 # If your documentation needs a minimal Sphinx version, state it here.
27 #
28 # needs_sphinx = '1.0'
29
30 # Add any Sphinx extension module names here, as strings. They can be
31 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
32 # ones.
33 extensions = [
34 'sphinx.ext.autodoc',
35 'sphinx.ext.doctest',
36 'sphinx.ext.todo',
37 'sphinx.ext.coverage',
38 'sphinx.ext.imgmath',
39 'sphinx.ext.ifconfig',
40 'sphinx.ext.viewcode',
41 'sphinx.ext.githubpages',
42 'sphinx.ext.napoleon'
43 ]
44
45 # Napoleon
46 napoleon_numpy_docstring = True
47 napoleon_google_docstring = True
48 napoleon_use_param = False
49 napoleon_use_ivar = True
50
51 # Default role:
52 default_role = 'any'
53
54 # Add any paths that contain templates here, relative to this directory.
55 templates_path = ['_templates']
56
57 # The suffix(es) of source filenames.
58 # You can specify multiple suffix as a list of string:
59 #
60 # source_suffix = ['.rst', '.md']
61 source_suffix = '.rst'
62
63 # The master toctree document.
64 master_doc = 'index'
65
66 # General information about the project.
67 project = 'Gfapy'
68 copyright = '2017--2022, Giorgio Gonnella and others (see CONTRIBUTORS)'
69 author = 'Giorgio Gonnella and others (see CONTRIBUTORS)'
70
71 # The version info for the project you're documenting, acts as replacement for
72 # |version| and |release|, also used in various other places throughout the
73 # built documents.
74 #
75 # The short X.Y version.
76 version = '1.2'
77 # The full version, including alpha/beta/rc tags.
78 release = '1.2.3'
79
80 # The language for content autogenerated by Sphinx. Refer to documentation
81 # for a list of supported languages.
82 #
83 # This is also used if you do content translation via gettext catalogs.
84 # Usually you set "language" from the command line for these cases.
85 language = None
86
87 # List of patterns, relative to source directory, that match files and
88 # directories to ignore when looking for source files.
89 # This patterns also effect to html_static_path and html_extra_path
90 exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
91
92 # The name of the Pygments (syntax highlighting) style to use.
93 pygments_style = 'sphinx'
94
95 # If true, `todo` and `todoList` produce output, else they produce nothing.
96 todo_include_todos = False
97
98 # -- Options for HTML output ----------------------------------------------
99
100 # The theme to use for HTML and HTML Help pages. See the documentation for
101 # a list of builtin themes.
102 #
103 html_theme = 'sphinx_rtd_theme'
104
105 # Theme options are theme-specific and customize the look and feel of a theme
106 # further. For a list of options available for each theme, see the
107 # documentation.
108 #
109 # html_theme_options = {}
110
111 # Add any paths that contain custom static files (such as style sheets) here,
112 # relative to this directory. They are copied after the builtin static files,
113 # so a file named "default.css" will overwrite the builtin "default.css".
114 html_static_path = ['_static']
115
116
117 # -- Options for HTMLHelp output ------------------------------------------
118
119 # Output file base name for HTML help builder.
120 htmlhelp_basename = 'Gfapydoc'
121
122
123 # -- Options for LaTeX output ---------------------------------------------
124
125 latex_elements = {
126 # The paper size ('letterpaper' or 'a4paper').
127 #
128 'papersize': 'a4paper',
129
130 # The font size ('10pt', '11pt' or '12pt').
131 #
132 # 'pointsize': '10pt',
133
134 # Additional stuff for the LaTeX preamble.
135 #
136 # 'preamble': '',
137
138 # Latex figure (float) alignment
139 #
140 # 'figure_align': 'htbp',
141 }
142
143 # Grouping the document tree into LaTeX files. List of tuples
144 # (source start file, target name, title,
145 # author, documentclass [howto, manual, or own class]).
146 latex_documents = [
147 (master_doc, 'Gfapy.tex', 'Gfapy Documentation',
148 'Giorgio Gonnella', 'manual'),
149 ]
150
151 # -- Options for manual page output ---------------------------------------
152
153 # One entry per manual page. List of tuples
154 # (source start file, name, description, authors, manual section).
155 man_pages = [
156 (master_doc, 'gfapy', 'Gfapy Documentation',
157 [author], 1)
158 ]
159
160
161 # -- Options for Texinfo output -------------------------------------------
162
163 # Grouping the document tree into Texinfo files. List of tuples
164 # (source start file, target name, title, author,
165 # dir menu entry, description, category)
166 texinfo_documents = [
167 (master_doc, 'Gfapy', 'Gfapy Documentation',
168 author, 'Gfapy',
169 'Python library for the Graphic Fragment Assembly (GFA) format.',
170 'Miscellaneous'),
171 ]
172
+0
-36
doc/index.rst less more
0 .. Gfapy documentation master file, created by
1 sphinx-quickstart on Thu Mar 16 10:13:57 2017.
2 You can adapt this file completely to your liking, but it should at least
3 contain the root `toctree` directive.
4
5 Gfapy documentation
6 ===================
7
8 .. toctree::
9 :maxdepth: 2
10 :caption: Contents:
11
12 readme
13 changelog
14
15 tutorial/gfa
16 tutorial/validation
17 tutorial/positional_fields
18 tutorial/placeholders
19 tutorial/positions
20 tutorial/alignments
21 tutorial/tags
22 tutorial/references
23 tutorial/header
24 tutorial/custom_records
25 tutorial/comments
26 tutorial/errors
27 tutorial/graph_operations
28 tutorial/rgfa
29
30 Indices and tables
31 ==================
32
33 * :ref:`genindex`
34 * :ref:`modindex`
35 * :ref:`search`
+0
-5
doc/readme.rst less more
0 Introduction
1 ============
2 .. include:: ../README.rst
3 :start-after: sphinx-begin
4 :end-before: sphinx-end
+0
-1
doc/run_apidoc.sh less more
0 sphinx-apidoc -o source/ ../gfapy
+0
-238
doc/tutorial/alignments.rst less more
0 .. testsetup:: *
1
2 import gfapy
3 from gfapy import is_placeholder, Alignment
4 h = "H\tVN:Z:2.0\tTS:i:100"
5 sA = "S\tA\t100\t*"
6 sB = "S\tB\t100\t*"
7 x = "E\tx\tA+\tB-\t0\t100$\t0\t100$\t4,2\tTS:i:50"
8 gfa = gfapy.Gfa([h, sA, sB, x])
9
10 .. _alignments:
11
12 Alignments
13 ~~~~~~~~~~
14
15 Some GFA1 (L/C overlap, P overlaps) and GFA2 (E/F alignment) fields contain
16 alignments or lists of alignments. The alignment can be left unspecified and a
17 placeholder symbol ``*`` used instead. In GFA1 the alignments can be given as
18 CIGAR strings, in GFA2 also as Dazzler traces.
19
20 Gfapy uses three different classes for representing the content of alignment fields:
21 :class:`~gfapy.alignment.cigar.CIGAR`, :class:`~gfapy.alignment.trace.Trace`
22 and :class:`~gfapy.alignment.placeholder.AlignmentPlaceholder`.
23
24 Creating an alignment
25 ^^^^^^^^^^^^^^^^^^^^^
26
27 An alignment instance is usually created from its GFA string
28 representation or from a list by using the
29 :class:`gfapy.Alignment() <gfapy.alignment.alignment.Alignment>`
30 constructor.
31
32 .. doctest::
33
34 >>> from gfapy import Alignment
35 >>> Alignment("*")
36 gfapy.AlignmentPlaceholder()
37 >>> Alignment("10,10,10")
38 gfapy.Trace([10,10,10])
39 >>> Alignment([10,10,10])
40 gfapy.Trace([10,10,10])
41 >>> Alignment("30M2I")
42 gfapy.CIGAR([gfapy.CIGAR.Operation(30,'M'), gfapy.CIGAR.Operation(2,'I')])
43
44 If the argument is an alignment object it will be returned,
45 so that is always safe to call the method on a variable which can
46 contain a string or an alignment instance:
47
48 .. doctest::
49
50 >>> Alignment(Alignment("*"))
51 gfapy.AlignmentPlaceholder()
52 >>> Alignment(Alignment("10,10"))
53 gfapy.Trace([10,10])
54
55 Recognizing undefined alignments
56 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
57
58 The :func:`gfapy.is_placeholder() <gfapy.placeholder.is_placeholder>` method
59 allows to test if an alignment field contains an undefined value (placeholder)
60 instead of a defined value (CIGAR string, trace). The method accepts as
61 argument either an alignment object or a string or list representation.
62
63 .. doctest::
64
65 >>> from gfapy import is_placeholder, Alignment
66 >>> is_placeholder(Alignment("30M"))
67 False
68 >>> is_placeholder(Alignment("10,10"))
69 False
70 >>> is_placeholder(Alignment("*"))
71 True
72 >>> is_placeholder("*")
73 True
74 >>> is_placeholder("30M")
75 False
76 >>> is_placeholder("10,10")
77 False
78 >>> is_placeholder([])
79 True
80 >>> is_placeholder([10,10])
81 False
82
83 Note that, as a placeholder is ``False`` in boolean context, just a
84 ``if not aligment`` will also work, if alignment is an alignment object.
85 But this of course, does not work, if it is a string representation.
86 Therefore it is better to use the
87 :func:`gfapy.is_placeholder() <gfapy.placeholder.is_placeholder>` method,
88 which works in both cases.
89
90 .. doctest::
91
92 >>> if not Alignment("*"): print('no alignment')
93 no alignment
94 >>> if is_placeholder(Alignment("*")): print('no alignment')
95 no alignment
96 >>> if "*": print('not a placeholder...?')
97 not a placeholder...?
98 >>> if is_placeholder("*"): print('really? it is a placeholder!')
99 really? it is a placeholder!
100
101 Reading and editing CIGARs
102 ^^^^^^^^^^^^^^^^^^^^^^^^^^
103
104 CIGARs are represented by specialized lists, instances of the class
105 :class:`~gfapy.alignment.cigar.CIGAR`, whose elements are CIGAR operations
106 CIGAR operations are represented by instance of the class
107 :class:`~gfapy.alignment.cigar.CIGAR.Operation`,
108 and provide the properties ``length`` (length of the operation, an integer)
109 and ``code`` (one-letter string which specifies the type of operation).
110 Note that not all operations allowed in SAM files (for which CIGAR strings
111 were first defined) are also meaningful in GFA and thus GFA2 only allows
112 the operations ``M``, ``I``, ``D`` and ``P``.
113
114 .. doctest::
115
116 >>> cigar = gfapy.Alignment("30M")
117 >>> isinstance(cigar, list)
118 True
119 >>> operation = cigar[0]
120 >>> type(operation)
121 <class 'gfapy.alignment.cigar.CIGAR.Operation'>
122 >>> operation.code
123 'M'
124 >>> operation.code = 'D'
125 >>> operation.length
126 30
127 >>> len(operation)
128 30
129 >>> str(operation)
130 '30D'
131
132 As a CIGAR instance is a list, list methods apply to it. If the array is
133 emptied, its string representation will be the placeholder symbol ``*``.
134
135 .. doctest::
136
137 >>> cigar = gfapy.Alignment("1I20M2D")
138 >>> cigar[0].code = "M"
139 >>> cigar.pop(1)
140 gfapy.CIGAR.Operation(20,'M')
141 >>> str(cigar)
142 '1M2D'
143 >>> cigar[:] = []
144 >>> str(cigar)
145 '*'
146
147 The validate :func:`CIGAR.validate() <gfapy.alignment.cigar.CIGAR.validate>`
148 function checks if a CIGAR instance is valid. A version can be provided, as the
149 CIGAR validation is version specific (as GFA2 forbids some CIGAR operations).
150
151 .. doctest::
152
153 >>> cigar = gfapy.Alignment("30M10D20M5I10M")
154 >>> cigar.validate()
155 >>> cigar[1].code = "L"
156 >>> cigar.validate()
157 Traceback (most recent call last):
158 ...
159 gfapy.error.ValueError:
160 >>> cigar = gfapy.Alignment("30M10D20M5I10M")
161 >>> cigar[1].code = "X"
162 >>> cigar.validate(version="gfa1")
163 >>> cigar.validate(version="gfa2")
164 Traceback (most recent call last):
165 ...
166 gfapy.error.ValueError:
167
168 Reading and editing traces
169 ^^^^^^^^^^^^^^^^^^^^^^^^^^
170
171 Traces are arrays of non-negative integers. The values are interpreted
172 using a trace spacing value. If traces are used, a trace spacing value
173 must be defined in a TS integer tag, either in the header, or in the
174 single lines which contain traces (which takes precedence over the
175 header global value).
176
177 .. doctest::
178
179 >>> print(gfa) #doctest: +SKIP
180 H TS:i:100
181 E x A+ B- 0 100$ 0 100$ 4,2 TS:i:50
182 ...
183 >>> gfa.header.TS
184 100
185 >>> gfa.line("x").TS
186 50
187
188 Query, reference and complement
189 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
190
191 CIGARs are asymmetric, i.e.\ they consider one sequence as reference and
192 another sequence as query.
193
194 The :func:`~gfapy.alignment.cigar.CIGAR.length_on_reference` and
195 :func:`~gfapy.alignment.cigar.CIGAR.length_on_query` methods compute the length
196 of the alignment on the two sequences. These methods are used by the library
197 e.g. to convert GFA1 L lines to GFA2 E lines (which is only possible if CIGARs
198 are provided).
199
200 .. doctest::
201
202 >>> cigar = gfapy.Alignment("30M10D20M5I10M")
203 >>> cigar.length_on_reference()
204 70
205 >>> cigar.length_on_query()
206 65
207
208 CIGARs are dependent on which sequence is taken as reference and which
209 is taken as query. For each alignment, a complement CIGAR can be
210 computed using the method
211 :func:`~gfapy.alignment.cigar.CIGAR.complement`; it is the CIGAR obtained
212 when the two sequences are switched.
213
214 .. doctest::
215
216 >>> cigar = gfapy.Alignment("2M1D3M")
217 >>> str(cigar.complement())
218 '3M1I2M'
219
220 The current version of Gfapy does not provide a way to compute the
221 alignment, thus the trace information can be accessed and edited, but
222 not used for this purpose. Because of this there is currently no way in
223 Gfapy to compute a complement trace (trace obtained when the sequences
224 are switched).
225
226 .. doctest::
227
228 >>> trace = gfapy.Alignment("1,2,3")
229 >>> str(trace.complement())
230 '*'
231
232 The complement of a placeholder is a placeholder:
233
234 .. doctest::
235
236 >>> str(gfapy.Alignment("*").complement())
237 '*'
+0
-71
doc/tutorial/comments.rst less more
0 .. testsetup:: *
1
2 import gfapy
3 g = gfapy.Gfa()
4
5 .. _comments:
6
7 Comments
8 --------
9
10 GFA lines starting with a ``#`` symbol are considered comments. In Gfapy
11 comments are represented by instances of the class :class:`gfapy.line.Comment
12 <gfapy.line.comment.comment.Comment>`. They have a similar interface to other
13 line instances, with some differences, e.g. they do not support tags.
14
15 The comments collection
16 ~~~~~~~~~~~~~~~~~~~~~~~
17
18 The comments of a Gfa object are accessed using the :func:`Gfa.comments
19 <gfapy.lines.collections.Collections.comments>` property. This is a list of
20 comment line instances. The single elements can be modified, but the list
21 itself is read-only. To remove a comment from the Gfa, you need to find the
22 instance in the list, and call
23 :func:`~gfapy.line.common.disconnection.Disconnection.disconnect` on it. To
24 add a comment to a :class:`~gfapy.gfa.Gfa` instance is done similarly to other
25 lines, by using the :func:`Gfa.add_line(line)
26 <gfapy.lines.creators.Creators.add_line>` method.
27
28 .. doctest::
29
30 >>> g.add_line("# this is a comment") #doctest: +ELLIPSIS
31 >>> [str(c) for c in g.comments]
32 ['# this is a comment']
33 >>> g.comments[0].disconnect()
34 >>> g.comments
35 []
36
37 Accessing the comment content
38 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
39
40 The content of the comment line, excluding the initial ``#`` and eventual
41 initial spacing characters, is included in the ``content`` field. The initial
42 spacing characters can be read/changed using the ``spacer`` field. The default
43 value is a single space.
44
45 .. doctest::
46
47 >>> g.add_line("# this is a comment") #doctest: +ELLIPSIS
48 >>> c = g.comments[-1]
49 >>> c.content
50 'this is a comment'
51 >>> c.spacer
52 ' '
53 >>> c.spacer = '___'
54 >>> str(c)
55 '#___this is a comment'
56
57 Tags are not supported by comment lines. If the line contains tags,
58 these are nor parsed, but included in the ``content`` field. Trying to set
59 tags raises exceptions.
60
61 .. doctest::
62
63 >>> c = gfapy.Line("# this is not a tag\txx:i:1")
64 >>> c.content
65 'this is not a tag\txx:i:1'
66 >>> c.xx
67 >>> c.xx = 1
68 Traceback (most recent call last):
69 ...
70 gfapy.error.RuntimeError: Tags of comment lines cannot be set
+0
-296
doc/tutorial/custom_records.rst less more
0 .. testsetup:: *
1
2 import gfapy
3 g = gfapy.Gfa(version = 'gfa2')
4
5 .. _custom_records:
6
7 Custom records
8 --------------
9
10 The GFA2 specification considers each line which starts with a non-standard
11 record type a custom (i.e. user- or program-specific) record.
12 Gfapy allows to retrieve these records and access their data using a
13 similar interface to that for the predefined record types.
14
15 Retrieving, adding and deleting custom records
16 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
17
18 Gfa instances have the property
19 :func:`~gfapy.lines.collections.Collections.custom_records`,
20 a list of all line instances with a non-standard record type. Among these,
21 records of a specific record type are retrieved using the method
22 :func:`Gfa.custom_records_of_type(record_type)
23 <gfapy.lines.collections.Collections.custom_records_of_type>`.
24 Lines are added and deleted using the same methods
25 (:func:`~gfapy.lines.creators.Creators.add_line` and
26 :func:`~gfapy.line.common.disconnection.Disconnection.disconnect`) as for
27 other line types.
28
29 .. doctest::
30
31 >>> g.add_line("X\tcustom line") #doctest: +ELLIPSIS
32 >>> g.add_line("Y\tcustom line") #doctest: +ELLIPSIS
33 >>> [str(line) for line in g.custom_records] #doctest: +SKIP
34 ['X\tcustom line', 'Y\tcustom line']
35 >>> g.custom_record_keys) #doctest: +SKIP
36 ['X', 'Y']
37 >>> [str(line) for line in g.custom_records_of_type('X')]
38 ['X\tcustom line']
39 >>> g.custom_records_of_type("X")[-1].disconnect()
40 >>> g.custom_records_of_type('X')
41 []
42
43 Interface without extensions
44 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
45
46 If no extension (see :ref:`extensions` section) has been defined to handle a
47 custom record type, the interface has some limitations: the field content is
48 not validated, and the field names are unknown. The generic custom record
49 class is employed
50 (:class:`~gfapy.line.custom_record.custom_record.CustomRecord`).
51
52 As the name of the positional fields in a custom record is not known, a generic
53 name ``field1``, ``field2``, ... is used. The number of positional fields is
54 found by getting the length of the
55 :attr:`~gfapy.line.custom_record.init.Init.positional_fieldnames` list.
56
57 .. doctest::
58
59 >>> g.add_line("X\ta\tb\tcc:i:10\tdd:i:100") #doctest: +ELLIPSIS
60 >>> x = g.custom_records_of_type('X')[-1]
61 >>> len(x.positional_fieldnames)
62 2
63 >>> x.field1
64 'a'
65 >>> x.field2
66 'b'
67
68 Positional fields are allowed to contain any character (including non-printable
69 characters and spacing characters), except tabs and newlines (as they are
70 structural elements of the line). No further validation is performed.
71
72 As Gfapy cannot know how many positional fields are present when parsing custom
73 records, a heuristic approach is followed, to identify tags. A field resembles
74 a tag if it starts with ``tn:d:`` where ``tn`` is a valid tag name and ``d`` a
75 valid tag datatype (see :ref:`tags` chapter). The fields are parsed from the
76 last to the first.
77
78 As soon as a field is found which does not resemble a tag, all remaining fields
79 are considered positionals (even if another field parsed later resembles a
80 tag). Due to this, invalid tags are sometimes wrongly taken as positional
81 fields (this can be avoided by writing an extension).
82
83 .. doctest::
84
85 >>> g.add_line("X\ta\tb\tcc:i:10\tdd:i:100") #doctest: +ELLIPSIS
86 >>> x1 = g.custom_records_of_type("X")[-1]
87 >>> x1.cc
88 10
89 >>> x1.dd
90 100
91 >>> g.add_line("X\ta\tb\tcc:i:10\tdd:i:100\te") #doctest: +ELLIPSIS
92 >>> x2 = g.custom_records_of_type("X")[-1]
93 >>> x2.cc
94 >>> x2.field3
95 'cc:i:10'
96 >>> g.add_line("Z\ta\tb\tcc:i:10\tddd:i:100") #doctest: +ELLIPSIS
97 >>> x3 = g.custom_records_of_type("Z")[-1]
98 >>> x3.cc
99 >>> x3.field3
100 'cc:i:10'
101 >>> x3.field4
102 'ddd:i:100'
103
104 .. _extensions:
105
106 Extensions
107 ~~~~~~~~~~
108
109 The support for custom fields is limited, as Gfapy does not know which and how
110 many fields are there and how shall they be validated. It is possible to create
111 an extension of Gfapy, which defines new record types: this will allow to use
112 these record types in a similar way to the built-in types.
113
114 As an example, an extension will be described, which defines two record types:
115 T for taxa and M for assignments of segments to taxa. For further information
116 about the possible usage case for this extension, see the Supplemental
117 Information to the manuscript describing Gfapy.
118
119 The T records will contain a single positional field, ``tid``, a GFA2
120 identifier, and an optional UL string tag. The M records will contain three
121 positional fields (all three GFA2 identifier): a name field ``mid`` (optional),
122 and two references, ``tid`` to a T line and ``sid`` to an S line. The SC
123 integer tag will be also defined. Here is an example of a GFA containing M and
124 T lines:
125
126 .. code::
127
128 S sA 1000 *
129 S sB 1000 *
130 M assignment1 t123 sA SC:i:40
131 M assignment2 t123 sB
132 M * B12c sB SC:i:20
133 T B12c
134 T t123 UL:Z:http://www.taxon123.com
135
136 Writing subclasses of the :class:`~gfapy.line.line.Line` class, it is possible to
137 communicate to Gfapy, how records of the M and T class shall be handled. This
138 only requires to define some constants and to call the class method
139 :func:`~gfapy.line.line.Line.register_extension`.
140
141 The constants to define are ``RECORD TYPE``, which shall be the content
142 of the record type field (e.g. ``M``); ``POSFIELDS`` shall contain an ordered
143 dict, specifying the datatype for each positional field, in the order these
144 fields are found in the line; ``TAGS_DATATYPE`` is a dict, specifying the
145 datatype of the predefined optional tags; ``NAME_FIELD`` is a field name,
146 and specifies which field contains the identifier of the line.
147 For details on predefined and custom datatypes, see the next sections
148 (:ref:`predefined_datatypes` and :ref:`custom_datatypes`).
149
150 To handle references, :func:`~gfapy.line.line.Line.register_extension`
151 can be supplied with a ``references`` parameter, a list of triples
152 ``(fieldname, classname, backreferences)``. Thereby ``fieldname`` is the name
153 of the field in the corresponding record containing the reference (e.g.
154 ``sid``), ``classname`` is the name of the class to which the reference goes
155 (e.g. ``gfa.line.segment.GFA2``), and \texttt{backreferences} is how the
156 collection of backreferences shall be called, in the records to which reference
157 points to (e.g. ``metagenomic_assignments``).
158
159 .. code:: python
160
161 from collections include OrderedDict
162
163 class Taxon(gfapy.Line):
164 RECORD_TYPE = "T"
165 POSFIELDS = OrderedDict([("tid","identifier_gfa2")])
166 TAGS_DATATYPE = {"UL":"Z"}
167 NAME_FIELD = "tid"
168
169 Taxon.register_extension()
170
171 class MetagenomicAssignment(gfapy.Line):
172 RECORD_TYPE = "M"
173 POSFIELDS = OrderedDict([("mid","optional_identifier_gfa2"),
174 ("tid","identifier_gfa2"),
175 ("sid","identifier_gfa2")])
176 TAGS_DATATYPE = {"SC":"i"}
177 NAME_FIELD = "mid"
178
179 MetagenomicAssignment.register_extension(references=
180 [("sid", gfapy.line.segment.GFA2, "metagenomic_assignments"),
181 ("tid", Taxon, "metagenomic_assignments")])
182
183 .. _predefined_datatypes:
184
185 Predefined datatypes for extensions
186 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
187
188 The datatype of fields is specified in Gfapy using classes, which provide
189 functions for decoding, encoding and validating the corresponding data.
190 Gfapy contains a number of datatypes which correspond to the description
191 of the field content in the GFA1 and GFA2 specification.
192
193 When writing extensions only the GFA2 field datatypes are generally used
194 (as GFA1 does not contain custom fields). They are summarized in
195 the following table:
196
197 +-------------------------------------+---------------+--------------------------------------------------------+
198 | Name | Example | Description |
199 +=====================================+===============+========================================================+
200 | ``alignment_gfa2`` | ``12M1I3M`` | CIGAR string, Trace alignment or Placeholder (``*``) |
201 +-------------------------------------+---------------+--------------------------------------------------------+
202 | ``identifier_gfa2`` | ``S1`` | ID of a line |
203 +-------------------------------------+---------------+--------------------------------------------------------+
204 | ``oriented_identifier_gfa2`` | ``S1+`` | ID of a line followed by ``+`` or ``-`` |
205 +-------------------------------------+---------------+--------------------------------------------------------+
206 | ``optional_identifier_gfa2`` | ``*`` | ID of a line or Placeholder (``*``) |
207 +-------------------------------------+---------------+--------------------------------------------------------+
208 | ``identifier_list_gfa2`` | ``S1 S2`` | space separated list of line IDs |
209 +-------------------------------------+---------------+--------------------------------------------------------+
210 | ``oriented_identifier_list_gfa2`` | ``S1+ S2-`` | space separated list of line IDs plus orientations |
211 +-------------------------------------+---------------+--------------------------------------------------------+
212 | ``position_gfa2`` | ``120$`` | non-negative integer, optionally followed by ``$`` |
213 +-------------------------------------+---------------+--------------------------------------------------------+
214 | ``sequence_gfa2`` | ``ACGNNYR`` | sequence of printable chars., no whitespace |
215 +-------------------------------------+---------------+--------------------------------------------------------+
216 | ``string`` | ``a b_c;d`` | string, no tabs and newlines (Z tags) |
217 +-------------------------------------+---------------+--------------------------------------------------------+
218 | ``char`` | ``A`` | single character (A tags) |
219 +-------------------------------------+---------------+--------------------------------------------------------+
220 | ``float`` | ``1.12`` | float (f tags) |
221 +-------------------------------------+---------------+--------------------------------------------------------+
222 | ``integer`` | ``-12`` | integer (i tags) |
223 +-------------------------------------+---------------+--------------------------------------------------------+
224 | ``optional_integer`` | ``*`` | integer or placeholder |
225 +-------------------------------------+---------------+--------------------------------------------------------+
226 | ``numeric_array`` | ``c,10,3`` | array of integers or floats (B tags) |
227 +-------------------------------------+---------------+--------------------------------------------------------+
228 | ``byte_array`` | ``12F1FF`` | hexadecimal byte string (H tags) |
229 +-------------------------------------+---------------+--------------------------------------------------------+
230 | ``json`` | ``{’b’:2}`` | JSON string, no tabs and newlines (J tags) |
231 +-------------------------------------+---------------+--------------------------------------------------------+
232
233 .. _custom_datatypes:
234
235 Custom datatypes for extensions
236 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
237
238 For custom records, one sometimes needs datatypes not yet available in the GFA
239 specification. For example, a custom datatype can be defined for
240 the taxon identifier used in the ``tid`` field of the T and M records:
241 accordingly the taxon identifier shall be only either
242 in the form ``taxon:<n>``, where ``<n>`` is a positive integer,
243 or consist of letters, numbers and underscores only
244 (without ``:``).
245
246 To define the datatype, a class is written, which contains the following
247 functions:
248
249 * ``validate_encoded(string)``: validates the content of the field,
250 if this is a string (e.g., the name of the T line)
251 * ``validate_decoded(object)``: validates the content of the field,
252 if this is not a string (e.g., a reference to a T line)
253 * ``decode(string)``: validates the content of the field (a string)
254 and returns the decoded content; note that references must not be resolved
255 (there is no access to the Gfa instance here), thus the name of the
256 T line will be returned unchanged
257 * ``encode(string)``: validates the content of the field (not in string
258 form) and returns the string which codes it in the GFA file (also here
259 references are validated but not converted into strings)
260
261 Finally the datatype is registered calling
262 :func:`~gfapy.field.field.Field.register_datatype`. The code for
263 the taxon ID extension is the following:
264
265 .. code:: python
266
267 import re
268
269 class TaxonID:
270
271 def validate_encoded(string):
272 if not re.match(r"^taxon:(\d+)$",string) and \
273 not re.match(r"^[a-zA-Z0-9_]+$", string):
274 raise gfapy.ValueError("Invalid taxon ID: {}".format(string))
275
276 def decode(string):
277 TaxonID.validate_encoded(string)
278 return string
279
280 def validate_decoded(obj):
281 if isinstance(obj,Taxon):
282 TaxonID.validate_encoded(obj.name)
283 else:
284 raise gfapy.TypeError(
285 "Invalid type for taxon ID: "+"{}".format(repr(obj)))
286
287 def encode(obj):
288 TaxonID.validate_decoded(obj)
289 return obj
290
291 gfapy.Field.register_datatype("taxon_id", TaxonID)
292
293 To use the new datatype in the T and M lines defined above (:ref:`extensions`),
294 the definition of the two subclasses can be changed:
295 in ``POSFIELDS`` the value ``taxon_id`` shall be assigned to the key ``tid``.
+0
-43
doc/tutorial/errors.rst less more
0 .. _errors:
1
2 Errors
3 ------
4
5 The different types of errors defined in Gfapy are summarized in the
6 following table. All exception raised in the library are subclasses of
7 `Error`. Thus, ``except gfapy.Error`` can be use to catch
8 all library errors.
9
10 +-----------------------+-------------------------------+---------------------------------+
11 | Error | Description | Examples |
12 +=======================+===============================+=================================+
13 | `VersionError` | An unknown or wrong version | "GFA0"; or GFA1 in GFA2 context |
14 | | is specified or implied | |
15 +-----------------------+-------------------------------+---------------------------------+
16 | `ValueError` | The value of an object is | a negative position is used |
17 | | invalid | |
18 +-----------------------+-------------------------------+---------------------------------+
19 | `TypeError` | The wrong type has been used | Z instead of i used for VN tag; |
20 | | or specified | Hash for an i tag |
21 +-----------------------+-------------------------------+---------------------------------+
22 | `FormatError` | The format of an object is | a line does not contain the |
23 | | wrong | expected number of fields |
24 +-----------------------+-------------------------------+---------------------------------+
25 | `NotUniqueError` | Something should be unique | duplicated tag name or line |
26 | | but is not | identifier |
27 +-----------------------+-------------------------------+---------------------------------+
28 | `InconsistencyError` | Pieces of information collide | length of sequence and LN tag |
29 | | with each other | do not match |
30 +-----------------------+-------------------------------+---------------------------------+
31 | `RuntimeError` | The user tried to do | editing from_segment field in |
32 | | something which is not | connected links |
33 | | allowed | |
34 +-----------------------+-------------------------------+---------------------------------+
35 | `ArgumentError` | Problem with the arguments of | wrong number of arguments in |
36 | | a method | dynamically created method |
37 +-----------------------+-------------------------------+---------------------------------+
38 | `AssertionError` | Something unexpected happened | there is a bug in the library or|
39 | | | the library has been used in |
40 | | | an unintended way |
41 +-----------------------+-------------------------------+---------------------------------+
42
+0
-409
doc/tutorial/gfa.rst less more
0 .. testsetup:: *
1
2 import gfapy
3 gfa = gfapy.Gfa()
4 gfa1 = gfapy.Gfa()
5 gfa1.add_line("H\tVN:Z:1.0")
6 gfa1.add_line("# this is a comment")
7 gfa1.add_line("S\t1\t*")
8 gfa1.add_line("S\t2\t*")
9 gfa1.add_line("S\t3\t*")
10 gfa2 = gfapy.Gfa()
11 gfa2.add_line("H\tVN:Z:2.0\tTS:i:100")
12 gfa2.add_line("X\tcustom line")
13 gfa2.add_line("Y\tcustom line")
14
15 .. _gfa:
16
17 The Gfa class
18 -------------
19
20 The content of a GFA file is represented in Gfapy by an instance of the class
21 :class:`~gfapy.gfa.Gfa`. In most cases, the Gfa instance will be constructed
22 from the data contained in a GFA file, using the method
23 :func:`Gfa.from_file() <gfapy.gfa.Gfa.from_file>`.
24
25 Alternatively, it is possible to use the construct of the class; it takes an
26 optional positional parameter, the content of a GFA file (as string, or as list
27 of strings, one per line of the GFA file). If no GFA content is provided, the
28 Gfa instance will be empty.
29
30 .. doctest::
31
32 >>> gfa = gfapy.Gfa("H\tVN:Z:1.0\nS\tA\t*")
33 >>> print(len(gfa.lines))
34 2
35 >>> gfa = gfapy.Gfa(["H\tVN:Z:1.0", "S\tA\t*", "S\tB\t*"])
36 >>> print(len(gfa.lines))
37 3
38 >>> gfa = gfapy.Gfa()
39 >>> print(len(gfa.lines))
40 0
41
42 The string representation of the Gfa object (which can be obtained using
43 ``str()``) is the textual representation in GFA format.
44 Using :func:`Gfa.to_file(filename) <gfapy.gfa.Gfa.to_file>` allows
45 writing this representation to a GFA file (the content of the file is
46 overwritten).
47
48 .. doctest::
49
50 >>> g1 = gfapy.Gfa()
51 >>> g1.append("H\tVN:Z:1.0")
52 >>> g1.append("S\ta\t*")
53 >>> g1.to_file("my.gfa") #doctest: +SKIP
54 >>> g2 = gfapy.Gfa.from_file("my.gfa") #doctest: +SKIP
55 >>> str(g1)
56 'H\tVN:Z:1.0\nS\ta\t*'
57
58
59 All methods for creating a Gfa (constructor and from_file) accept
60 a ``vlevel`` parameter, the validation level,
61 and can assume the values 0, 1, 2 and 3. A higher value means
62 more validations are performed. The :ref:`validation` chapter explains
63 the meaning of the different validation levels in detail.
64 The default value is 1.
65
66 .. doctest::
67
68 >>> gfapy.Gfa().vlevel
69 1
70 >>> gfapy.Gfa(vlevel = 0).vlevel
71 0
72
73 A further parameter is ``version``. It can be set to ``'gfa1'``,
74 ``'gfa2'`` or left to the default value (``None``). The default
75 is to auto-detect the version of the GFA from the line content.
76 If the version is set manually, any content not compatible to the
77 specified version will trigger an exception. If the version is
78 set automatically, an exception will be raised if two lines
79 are found, with content incompatible to each other (e.g. a GFA1
80 segment followed by a GFA2 segment).
81
82 .. doctest::
83
84 >>> g = gfapy.Gfa(version='gfa2')
85 >>> g.version
86 'gfa2'
87 >>> g.add_line("S\t1\t*")
88 Traceback (most recent call last):
89 ...
90 gfapy.error.VersionError: Version: 1.0 (None)
91 ...
92 >>> g = gfapy.Gfa()
93 >>> g.version
94 >>> g.add_line("S\t1\t*")
95 >>> g.version
96 'gfa1'
97 >>> g.add_line("S\t1\t100\t*")
98 Traceback (most recent call last):
99 ...
100 gfapy.error.VersionError: Version: 1.0 (None)
101 ...
102
103 Collections of lines
104 ~~~~~~~~~~~~~~~~~~~~
105
106 The property :attr:`~gfapy.lines.collections.Collections.lines`
107 of the Gfa object is a list of all the lines
108 in the GFA file (including the header, which is split into single-tag
109 lines). The list itself shall not be modified by the user directly (i.e.
110 adding and removing lines is done using a different interface, see
111 below). However the single elements of the list can be edited.
112
113 .. doctest::
114
115 >>> for line in gfa.lines: print(line)
116
117 For most record types, a list of the lines of the record type is available
118 as a read-only property, which is named after the record type, in plural.
119
120 .. doctest::
121
122 >>> [str(line) for line in gfa1.segments]
123 ['S\t1\t*', 'S\t2\t*', 'S\t3\t*']
124 >>> [str(line) for line in gfa2.fragments]
125 []
126
127 A particular case are edges; these are in GFA1 links and containments, while in
128 GFA2 there is a unified edge record type, which also allows to represent
129 internal alignments. In Gfapy, the
130 :attr:`~gfapy.lines.collections.Collections.edges` property retrieves all edges
131 (i.e. all E lines in GFA2, and all L and C lines in GFA1). The
132 :attr:`~gfapy.lines.collections.Collections.dovetails` property is a list of
133 all edges which represent dovetail overlaps (i.e. all L lines in GFA1 and a
134 subset of the E lines in GFA2). The
135 :attr:`~gfapy.lines.collections.Collections.containments` property is a list of
136 all edges which represent containments (i.e. all C lines in GFA1 and a subset
137 of the E lines in GFA2).
138
139 .. doctest::
140
141 >>> gfa2.edges
142 []
143 >>> gfa2.dovetails
144 []
145 >>> gfa2.containments
146 []
147
148 Paths are retrieved using the
149 :attr:`~gfapy.lines.collections.Collections.paths` property. This list
150 contains all P lines in GFA1 and all O lines in GFA2. Sets returns the list of
151 all U lines in GFA2 (empty list in GFA1).
152
153 .. doctest::
154
155 >>> gfa2.paths
156 []
157 >>> gfa2.sets
158 []
159
160 The header contain metadata in a single or multiple lines. For ease of
161 access to the header information, all its tags are summarized in a
162 single line instance, which is retrieved using the
163 :attr:`~gfapy.lines.headers.Headers.header` property. This list
164 The :ref:`header` chapter of this manual explains more in
165 detail, how to work with the header object.
166
167 .. doctest::
168
169 >>> gfa2.header.TS
170 100
171
172 All lines which start by the string ``#`` are comments; they are handled in
173 the :ref:`comments` chapter and are retrieved using the
174 :attr:`~gfapy.lines.collections.Collections.comments` property.
175
176 .. doctest::
177
178 >>> [str(line) for line in gfa1.comments]
179 ['# this is a comment']
180
181 Custom lines are lines of GFA2 files which start
182 with a non-standard record type. Gfapy provides basic built-in support
183 for accessing the information in custom lines, and allows to define
184 extensions for own record types for defining more advanced
185 functionality (see the :ref:`custom_records` chapter).
186
187 .. doctest::
188
189 >>> [str(line) for line in gfa2.custom_records]
190 ['X\tcustom line', 'Y\tcustom line']
191 >>> gfa2.custom_record_keys
192 ['X', 'Y']
193 >>> [str(line) for line in gfa2.custom_records_of_type('X')]
194 ['X\tcustom line']
195
196 Line identifiers
197 ~~~~~~~~~~~~~~~~
198
199 Some GFA lines have a mandatory or optional identifier field: segments and
200 paths in GFA1, segments, gaps, edges, paths and sets in GFA2. A line of this
201 type can be retrieved by identifier, using the method
202 :func:`Gfa.line(ID) <gfapy.gfa.Gfa.line>` using the identifier as argument.
203
204 .. doctest::
205
206 >>> str(gfa1.line('1'))
207 'S\t1\t*'
208
209 The GFA2 specification prescribes the exact namespace for the identifier
210 (segments, paths, sets, edges and gaps identifier share the same namespace).
211 The content of this namespace can be retrieved using the
212 :attr:`~gfapy.lines.collections.Collections.names` property.
213 The identifiers of single line types
214 can be retrieved using the properties
215 :attr:`~gfapy.lines.collections.Collections.segment_names`,
216 :attr:`~gfapy.lines.collections.Collections.edge_names`,
217 :attr:`~gfapy.lines.collections.Collections.gap_names`,
218 :attr:`~gfapy.lines.collections.Collections.path_names` and
219 :attr:`~gfapy.lines.collections.Collections.set_names`.
220
221 .. doctest::
222
223 >>> g = gfapy.Gfa()
224 >>> g.add_line("S\tA\t100\t*")
225 >>> g.add_line("S\tB\t100\t*")
226 >>> g.add_line("S\tC\t100\t*")
227 >>> g.add_line("E\tb_c\tB+\tC+\t0\t10\t90\t100$\t*")
228 >>> g.add_line("O\tp1\tB+ C+")
229 >>> g.add_line("U\ts1\tA b_c g")
230 >>> g.add_line("G\tg\tA+\tB-\t1000\t*")
231 >>> g.names
232 ['A', 'B', 'C', 'b_c', 'g', 'p1', 's1']
233 >>> g.segment_names
234 ['A', 'B', 'C']
235 >>> g.path_names
236 ['p1']
237 >>> g.edge_names
238 ['b_c']
239 >>> g.gap_names
240 ['g']
241 >>> g.set_names
242 ['s1']
243
244 The GFA1 specification does not handle the question of the namespace of
245 identifiers explicitly. However, gfapy assumes and enforces
246 a single namespace for segment, path names and the values of the ID tags
247 of L and C lines. The content of this namespace can be found using
248 :attr:`~gfapy.lines.collections.Collections.names` property.
249 The identifiers of single line types
250 can be retrieved using the properties
251 :attr:`~gfapy.lines.collections.Collections.segment_names`,
252 :attr:`~gfapy.lines.collections.Collections.edge_names`
253 (ID tags of links and containments) and
254 :attr:`~gfapy.lines.collections.Collections.path_names`.
255 For GFA1, the properties
256 :attr:`~gfapy.lines.collections.Collections.gap_names`,
257 :attr:`~gfapy.lines.collections.Collections.set_names`
258 contain always empty lists.
259
260 .. doctest::
261
262 >>> g = gfapy.Gfa()
263 >>> g.add_line("S\tA\t*")
264 >>> g.add_line("S\tB\t*")
265 >>> g.add_line("S\tC\t*")
266 >>> g.add_line("L\tB\t+\tC\t+\t*\tID:Z:b_c")
267 >>> g.add_line("P\tp1\tB+,C+\t*")
268 >>> g.names
269 ['A', 'B', 'C', 'b_c', 'p1']
270 >>> g.segment_names
271 ['A', 'B', 'C']
272 >>> g.path_names
273 ['p1']
274 >>> g.edge_names
275 ['b_c']
276 >>> g.gap_names
277 []
278 >>> g.set_names
279 []
280
281 Identifiers of external sequences
282 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
283
284 Fragments contain identifiers which refer to external sequences
285 (not contained in the GFA file). According to the specification, the
286 these identifiers are not part of the same namespace as the identifier
287 of the GFA lines. They can be retrieved using the
288 :attr:`~gfapy.lines.collections.Collections.external_names`
289 property.
290
291 .. doctest::
292
293 >>> g = gfapy.Gfa()
294 >>> g.add_line("S\tA\t100\t*")
295 >>> g.add_line("F\tA\tread1+\t10\t30\t0\t20$\t20M")
296 >>> g.external_names
297 ['read1']
298
299 The method
300 :func:`Gfa.fragments_for_external(external_ID) <gfapy.lines.finders.Finders.fragments_for_external>`
301 retrieves all F lines with a specified external sequence identifier.
302
303 .. doctest::
304
305 >>> f = g.fragments_for_external('read1')
306 >>> len(f)
307 1
308 >>> str(f[0])
309 'F\tA\tread1+\t10\t30\t0\t20$\t20M'
310
311 Adding new lines
312 ~~~~~~~~~~~~~~~~
313
314 New lines can be added to a Gfa instance using the
315 :func:`Gfa.add_line(line) <gfapy.lines.creators.Creators.add_line>`
316 method or its alias
317 :func:`Gfa.append(line) <gfapy.lines.creators.Creators.append>`.
318 The argument can be either a string
319 describing a line with valid GFA syntax, or a :class:`~gfapy.line.line.Line`
320 instance. If a string is added, a line instance is created and
321 then added.
322
323 .. doctest::
324
325 >>> g = gfapy.Gfa()
326 >>> g.add_line("S\tA\t*") #doctest: +ELLIPSIS
327 >>> g.segment_names
328 ['A']
329 >>> g.append("S\tB\t*") #doctest: +ELLIPSIS
330 >>> g.segment_names
331 ['A', 'B']
332
333 Editing the lines
334 ~~~~~~~~~~~~~~~~~
335
336 Accessing the information stored in the fields of a line instance is
337 described in the :ref:`positional_fields` and :ref:`tags` chapters.
338
339 In Gfapy, a line instance belonging to a Gfa instance is said
340 to be *connected* to the Gfa instance. Direct editing the content of a connected
341 line is only possible, for those fields which do not contain
342 references to other lines. For more information on how to modify the content of
343 the fields of connected line, see the :ref:`references` chapter.
344
345 .. doctest::
346
347 >>> g = gfapy.Gfa()
348 >>> e = gfapy.Line("E\t*\tA+\tB-\t0\t10\t90\t100$\t*")
349 >>> e.sid1 = "C+"
350 >>> g.add_line(e) #doctest: +ELLIPSIS
351 >>> e.sid1 = "A+"
352 Traceback (most recent call last):
353 gfapy.error.RuntimeError: ...
354
355 Removing lines
356 ~~~~~~~~~~~~~~
357
358 Disconnecting a line from the Gfa instance is done using the
359 :func:`Gfa.rm(line) <gfapy.lines.destructors.Destructors.rm>` method. The
360 argument can be a line instance or the name of a line.
361
362 In alternative, a line instance can also be disconnected using the
363 `disconnect` method on it. Disconnecting a line
364 may trigger other operations, such as the disconnection of other lines (see the
365 :ref:`references` chapter).
366
367 .. doctest::
368
369 >>> g = gfapy.Gfa()
370 >>> g.add_line("S\tA\t*") #doctest: +ELLIPSIS
371 >>> g.segment_names
372 ['A']
373 >>> g.rm('A') #doctest: +ELLIPSIS
374 >>> g.segment_names
375 []
376 >>> g.append("S\tB\t*") #doctest: +ELLIPSIS
377 >>> g.segment_names
378 ['B']
379 >>> b = g.line('B')
380 >>> b.disconnect()
381 >>> g.segment_names
382 []
383
384 Renaming lines
385 ~~~~~~~~~~~~~~
386
387 Lines with an identifier can be renamed. This is done simply by editing
388 the corresponding field (such as ``name`` or ``sid`` for a segment).
389 This field is not a reference to another line and can be freely edited
390 also in line instances connected to a Gfa. All references to the line
391 from other lines will still be up to date, as they will refer to the
392 same instance (whose name has been changed) and their string
393 representation will use the new name.
394
395 .. doctest::
396
397 >>> g = gfapy.Gfa()
398 >>> g.add_line("S\tA\t*") #doctest: +ELLIPSIS
399 >>> g.add_line("L\tA\t+\tB\t-\t*") #doctest: +ELLIPSIS
400 >>> g.segment_names
401 ['A', 'B']
402 >>> g.dovetails[0].from_name
403 'A'
404 >>> g.segment('A').name = 'C'
405 >>> g.segment_names
406 ['B', 'C']
407 >>> g.dovetails[0].from_name
408 'C'
+0
-14
doc/tutorial/graph_operations.rst less more
0 .. _graph_operations:
1
2 Graph operations
3 ----------------
4
5 Graph operations such as linear paths merging, multiplication of
6 segments and other are provided. These operations are implemented
7 in analogy to those provided by the Ruby library RGFA. As RGFA only
8 handles GFA1 graphs, only dovetail overlaps are considered as
9 connections. A detailed description of the operation can be
10 found in Gonnella and Kurtz (2016). More information about the
11 single operations are found in the method documentation of the
12 submodules of `GraphOperations`.
13
+0
-189
doc/tutorial/header.rst less more
0 .. testsetup:: *
1
2 import gfapy
3 gfa = gfapy.Gfa()
4
5 .. _header:
6
7 The Header
8 ----------
9
10 GFA files may contain one or multiple header lines (record type: "H"). These
11 lines may be present in any part of the file, not necessarily at the beginning.
12
13 Although the header may consist of multiple lines, its content refers to the
14 whole file. Therefore in Gfapy the header is accessed using a single line
15 instance (accessible by the :attr:`~gfapy.lines.headers.Headers.header`
16 property). Header lines contain only tags. If not header line is present in the
17 Gfa, then the header line object will be empty (i.e. contain no tags).
18
19 Note that header lines cannot be connected to the Gfa as other lines (i.e.
20 calling :meth:`~gfapy.line.common.connection.Connection.connect` on them raises
21 an exception). Instead they must be merged to the existing Gfa header, using
22 `add_line` on the Gfa instance.
23
24 .. doctest::
25
26 >>> gfa.add_line("H\tnn:f:1.0") #doctest: +ELLIPSIS
27 >>> gfa.header.nn
28 1.0
29 >>> gfapy.Line("H\tnn:f:1.0").connect(gfa)
30 Traceback (most recent call last):
31 ...
32 gfapy.error.RuntimeError: ...
33
34 Multiple definitions of the predefined header tags
35 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
36
37 For the predefined tags (``VN`` and ``TS``), the presence of multiple
38 values in different lines is an error, unless the value is the same in
39 each instance (in which case the repeated definitions are ignored).
40
41 .. doctest::
42
43 >>> gfa.add_line("H\tVN:Z:1.0") #doctest: +ELLIPSIS
44 >>> gfa.add_line("H\tVN:Z:1.0") # ignored #doctest: +ELLIPSIS
45 >>> gfa.add_line("H\tVN:Z:2.0")
46 Traceback (most recent call last):
47 ...
48 gfapy.error.VersionError: ...
49
50 Multiple definitions of custom header tags
51 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
52
53 If the tags are present only once in the header in its entirety, the access to
54 the tags is the same as for any other line (see the :ref:`tags` chapter).
55
56 However, the specification does not forbid custom tags to be defined with
57 different values in different header lines (which we name "multi-definition
58 tags"). This particular case is handled in the next sections.
59
60 Reading multi-definitions tags
61 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
62
63 Reading, validating and setting the datatype of multi-definition tags is done
64 using the same methods as for all other lines (see the :ref:`tags` chapter).
65 However, if a tag is defined multiple times on multiple H lines, reading the
66 tag will return a list of the values on the lines. This array is an instance of
67 the subclass ``gfapy.FieldArray`` of list.
68
69 .. doctest::
70
71 >>> gfa.add_line("H\txx:i:1") #doctest: +ELLIPSIS
72 >>> gfa.add_line("H\txx:i:2") #doctest: +ELLIPSIS
73 >>> gfa.add_line("H\txx:i:3") #doctest: +ELLIPSIS
74 >>> gfa.header.xx
75 gfapy.FieldArray('i',[1, 2, 3])
76
77 Setting tags
78 ~~~~~~~~~~~~
79
80 There are two possibilities to set a tag for the header. The first is
81 the normal tag interface (using ``set`` or the tag name property). The
82 second is to use ``add``. The latter supports multi-definition tags,
83 i.e. it adds the value to the previous ones (if any), instead of
84 overwriting them.
85
86 .. doctest::
87
88 >>> gfa = gfapy.Gfa()
89 >>> gfa.header.xx
90 >>> gfa.header.add("xx", 1)
91 >>> gfa.header.xx
92 1
93 >>> gfa.header.add("xx", 2)
94 >>> gfa.header.xx
95 gfapy.FieldArray('i',[1, 2])
96 >>> gfa.header.set("xx", 3)
97 >>> gfa.header.xx
98 3
99
100 Modifying field array values
101 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
102
103 Field arrays can be modified directly (e.g. adding new values or
104 removing some values). After modification, the user may check if the
105 array values remain compatible with the datatype of the tag using the
106 :meth:`~gfapy.line.common.validate.Validate.validate_field`` method.
107
108 .. doctest::
109
110 >>> gfa = gfapy.Gfa()
111 >>> gfa.header.xx = gfapy.FieldArray('i',[1,2,3])
112 >>> gfa.header.xx
113 gfapy.FieldArray('i',[1, 2, 3])
114 >>> gfa.header.validate_field("xx")
115 >>> gfa.header.xx.append("X")
116 >>> gfa.header.validate_field("xx")
117 Traceback (most recent call last):
118 ...
119 gfapy.error.FormatError: ...
120
121 If the field array is modified using array methods which return a list
122 or data of any other type, a field array must be constructed, setting
123 its datatype to the value returned by calling
124 :meth:`~gfapy.line.common.field_datatype.FieldDatatype.get_datatype`
125 on the header.
126
127 .. doctest::
128
129 >>> gfa = gfapy.Gfa()
130 >>> gfa.header.xx = gfapy.FieldArray('i',[1,2,3])
131 >>> gfa.header.xx
132 gfapy.FieldArray('i',[1, 2, 3])
133 >>> gfa.header.xx = gfapy.FieldArray(gfa.header.get_datatype("xx"),
134 ... list(map(lambda x: x+1, gfa.header.xx)))
135 >>> gfa.header.xx
136 gfapy.FieldArray('i',[2, 3, 4])
137
138 String representation of the header
139 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
140
141 For consistency with other line types, the string representation of the header
142 is a single-line string, eventually non standard-compliant, if it contains
143 multiple instances of the tag. (and when calling
144 :meth:`~gfapy.line.common.writer.Writer.field_to_s` for a tag present multiple
145 times, the output string will contain the instances of the tag, separated by
146 tabs).
147
148 However, when the Gfa is output to file or string, the header is split into
149 multiple H lines with single tags, so that standard-compliant GFA is output.
150 The split header can be retrieved using the
151 :attr:`~gfapy.lines.headers.Headers.headers` property of the Gfa instance.
152
153 .. doctest::
154
155 >>> gfa = gfapy.Gfa()
156 >>> gfa.header.VN = "1.0"
157 >>> gfa.header.xx = gfapy.FieldArray('i',[1,2])
158 >>> gfa.header.field_to_s("xx")
159 '1\t2'
160 >>> gfa.header.field_to_s("xx", tag=True)
161 'xx:i:1\txx:i:2'
162 >>> str(gfa.header)
163 'H\tVN:Z:1.0\txx:i:1\txx:i:2'
164 >>> [str(h) for h in gfa.headers]
165 ['H\tVN:Z:1.0', 'H\txx:i:1', 'H\txx:i:2']
166 >>> str(gfa)
167 'H\tVN:Z:1.0\nH\txx:i:1\nH\txx:i:2'
168
169 Count the input header lines
170 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
171
172 Due to the different way header lines are stored, the number of header elements
173 is not equal to the number of header lines in the input. This is annoying if an
174 application wants to count the number of input lines in a file. In order to make
175 that possible, the number of input header lines are counted and can be
176 retrieved using the :attr:`~gfapy.lines.headers.Headers.n_input_header_lines`
177 property of the Gfa instance.
178
179 .. doctest::
180
181 >>> gfa = gfapy.Gfa()
182 >>> gfa.add_line("H\txx:i:1\tyy:Z:ABC") #doctest: +ELLIPSIS
183 >>> gfa.add_line("H\txy:i:2") #doctest: +ELLIPSIS
184 >>> gfa.add_line("H\tyz:i:3\tab:A:A") #doctest: +ELLIPSIS
185 >>> len(gfa.headers)
186 5
187 >>> gfa.n_input_header_lines
188 3
+0
-69
doc/tutorial/placeholders.rst less more
0 .. testsetup:: *
1
2 import gfapy
3
4 .. _placeholders:
5
6 Placeholders
7 ------------
8
9 Some positional fields may contain an undefined value S: ``sequence``;
10 L/C: ``overlap``; P: ``overlaps``; E: ``eid``, ``alignment``; F:
11 ``alignment``; G: ``gid``, ``var``; U/O: ``pid``. In GFA this value is
12 represented by a ``*``.
13
14 In Gfapy the class `Placeholder` represent the undefined value.
15
16 Distinguishing placeholders
17 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
18
19 The :func:`gfapy.is_placeholder() <gfapy.placeholder.is_placeholder>` method
20 allows to check if a value is a placeholder; a value is a placeholder if
21 it is a `Placeholder` instance, or would represent
22 a placeholder in GFA (a string containing ``*``), or would be represented
23 by a placeholder in GFA (e.g. an empty array).
24
25 .. doctest::
26
27 >>> gfapy.is_placeholder("*")
28 True
29 >>> gfapy.is_placeholder("**")
30 False
31 >>> gfapy.is_placeholder([])
32 True
33 >>> gfapy.is_placeholder(gfapy.Placeholder())
34 True
35
36 Note that, as a placeholder is ``False`` in boolean context, just a
37 ``if not placeholder`` will also work, if the value is an instance
38 of `Placeholder`, but not always for the other cases (in particular not
39 for the string representation ``*``).
40 Therefore using
41 :func:`gfapy.is_placeholder() <gfapy.placeholder.is_placeholder>`
42 is better.
43
44 .. doctest::
45
46 >>> if "*": print('* is not a placeholder')
47 * is not a placeholder
48 >>> if gfapy.is_placeholder("*"): print('but it represents a placeholder')
49 but it represents a placeholder
50
51 Compatibility methods
52 ~~~~~~~~~~~~~~~~~~~~~
53
54 Some methods are defined for placeholders, which allow them to respond
55 to the same methods as defined values. This allows to write generic
56 code.
57
58 .. doctest::
59
60 >>> placeholder = gfapy.Placeholder()
61 >>> placeholder.validate() # does nothing
62 >>> len(placeholder)
63 0
64 >>> placeholder[1]
65 gfapy.Placeholder()
66 >>> placeholder + 1
67 gfapy.Placeholder()
68
+0
-449
doc/tutorial/positional_fields.rst less more
0 .. testsetup:: *
1
2 import gfapy
3 gfa = gfapy.Gfa()
4
5 .. _positional_fields:
6
7 Positional fields
8 -----------------
9
10 Most lines in GFA have positional fields (Headers are an exception).
11 During parsing, if a line is encountered, which has too less or too many
12 positional fields, an exception will be thrown. The correct number of
13 positional fields is record type-specific.
14
15 Positional fields are recognized by its position in the line. Each
16 positional field has an implicit field name and datatype associated with
17 it.
18
19 Field names
20 ~~~~~~~~~~~
21
22 The field names are derived from the specification. Lower case versions
23 of the field names are used and spaces are substituted with underscores.
24 In some cases, the field names were changed, as they represent keywords
25 in common programming languages or clash with potential tag names
26 (``from``, ``to``, ``send``).
27
28 The following tables shows the field names used in Gfapy, for each kind
29 of line. Headers have no positional fields. Comments and custom records
30 follow particular rules, see the respective chapters (:ref:`comments` and
31 :ref:`custom_records`).
32
33 GFA1 field names
34 ^^^^^^^^^^^^^^^^
35
36 +---------------+--------------------+---------------------+------------------+-----------------+---------------+---------------+
37 | Record Type | Field 1 | Field 2 | Field 3 | Field 4 | Field 5 | Field 6 |
38 +===============+====================+=====================+==================+=================+===============+===============+
39 | Segment | ``name`` | ``sequence`` | | | | |
40 +---------------+--------------------+---------------------+------------------+-----------------+---------------+---------------+
41 | Link | ``from_segment`` | ``from_orient`` | ``to_segment`` | ``to_orient`` | ``overlap`` | |
42 +---------------+--------------------+---------------------+------------------+-----------------+---------------+---------------+
43 | Containment | ``from_segment`` | ``from_orient`` | ``to_segment`` | ``to_orient`` | ``pos`` | ``overlap`` |
44 +---------------+--------------------+---------------------+------------------+-----------------+---------------+---------------+
45 | Path | ``path_name`` | ``segment_names`` | ``overlaps`` | | | |
46 +---------------+--------------------+---------------------+------------------+-----------------+---------------+---------------+
47
48 GFA2 field names
49 ^^^^^^^^^^^^^^^^
50
51 +---------------+-----------+----------------+----------------+-------------+-------------+-------------+-----------------+-----------------+
52 | Record Type | Field 1 | Field 2 | Field 3 | Field 4 | Field 5 | Field 6 | Field 7 | Field 8 |
53 +===============+===========+================+================+=============+=============+=============+=================+=================+
54 | Segment | ``sid`` | ``slen`` | ``sequence`` | | | | | |
55 +---------------+-----------+----------------+----------------+-------------+-------------+-------------+-----------------+-----------------+
56 | Edge | ``eid`` | ``sid1`` | ``sid2`` | ``beg1`` | ``end1`` | ``beg2`` | ``end2`` | ``alignment`` |
57 +---------------+-----------+----------------+----------------+-------------+-------------+-------------+-----------------+-----------------+
58 | Fragment | ``sid`` | ``external`` | ``s_beg`` | ``s_end`` | ``f_beg`` | ``f_end`` | ``alignment`` | |
59 +---------------+-----------+----------------+----------------+-------------+-------------+-------------+-----------------+-----------------+
60 | Gap | ``gid`` | ``sid1`` | ``d1`` | ``d2`` | ``sid2`` | ``disp`` | ``var`` | |
61 +---------------+-----------+----------------+----------------+-------------+-------------+-------------+-----------------+-----------------+
62 | Set | ``pid`` | ``items`` | | | | | | |
63 +---------------+-----------+----------------+----------------+-------------+-------------+-------------+-----------------+-----------------+
64 | Path | ``pid`` | ``items`` | | | | | | |
65 +---------------+-----------+----------------+----------------+-------------+-------------+-------------+-----------------+-----------------+
66
67 Datatypes
68 ~~~~~~~~~
69
70 The datatype of each positional field is described in the specification
71 and cannot be changed (differently from tags). Here is a short
72 description of the Python classes used to represent data for different
73 datatypes.
74
75 Placeholders
76 ^^^^^^^^^^^^
77
78 The positional fields in GFA can never be empty. However, there are some
79 fields with optional values. If a value is not specified, a placeholder
80 character is used instead (``*``). Such undefined values are represented
81 in Gfapy by the `Placeholder` class, which is described more in
82 detail in the :ref:`placeholders` chapter.
83
84 Arrays
85 ^^^^^^
86
87 The ``items`` field in unordered and ordered groups and the
88 ``segment_names`` and ``overlaps`` fields in paths are lists of objects
89 and are represented by list instances.
90
91 .. doctest::
92
93 >>> set = gfapy.Line("U\t*\t1 A 2")
94 >>> type(set.items)
95 <class 'list'>
96 >>> gfa2_path = gfapy.Line("O\t*\tA+ B-")
97 >>> type(gfa2_path.items)
98 <class 'list'>
99 >>> gfa1_path = gfapy.Line("P\tp1\tA+,B-\t10M,9M1D1M")
100 >>> type(gfa1_path.segment_names)
101 <class 'list'>
102 >>> type(gfa1_path.overlaps)
103 <class 'list'>
104
105 Orientations
106 ^^^^^^^^^^^^
107
108 Orientations are represented by strings. The ``gfapy.invert()`` method
109 applied to an orientation string returns the other orientation.
110
111 .. doctest::
112
113 >>> gfapy.invert("+")
114 '-'
115 >>> gfapy.invert("-")
116 '+'
117
118 Identifiers
119 ^^^^^^^^^^^
120
121 The identifier of the line itself (available for S, P, E, G, U, O lines)
122 can always be accessed in Gfapy using the ``name`` alias and is
123 represented in Gfapy by a string. If it is optional (E, G, U, O lines)
124 and not specified, it is represented by a Placeholder instance. The
125 fragment identifier is also a string.
126
127 Identifiers which refer to other lines are also present in some line
128 types (L, C, E, G, U, O, F). These are never placeholders and in
129 stand-alone lines are represented by strings. In connected lines they
130 are references to the Line instances to which they refer to (see the
131 :ref:`references` chapter).
132
133 Oriented identifiers
134 ^^^^^^^^^^^^^^^^^^^^
135
136 Oriented identifiers (e.g. ``segment_names`` in GFA1 paths) are
137 represented by elements of the class ``gfapy.OrientedLine``. The
138 ``segment`` method of the oriented segments returns the segment
139 identifier (or segment reference in connected path lines) and the
140 ``orient`` method returns the orientation string. The ``name`` method
141 returns the string of the segment, even if this is a reference to a
142 segment. A new oriented line can be created using the
143 ``OL[line, orientation]`` method.
144
145 Calling ``invert`` returns an oriented segment, with inverted
146 orientation. To set the two attributes the methods ``segment=`` and
147 ``orient=`` are available.
148
149 Examples:
150
151 .. doctest::
152
153 >>> p = gfapy.Line("P\tP1\ta+,b-\t*")
154 >>> p.segment_names
155 [gfapy.OrientedLine('a','+'), gfapy.OrientedLine('b','-')]
156 >>> sn0 = p.segment_names[0]
157 >>> sn0.line
158 'a'
159 >>> sn0.name
160 'a'
161 >>> sn0.orient
162 '+'
163 >>> sn0.invert()
164 >>> sn0
165 gfapy.OrientedLine('a','-')
166 >>> sn0.orient
167 '-'
168 >>> sn0.line = gfapy.Line('S\tX\t*')
169 >>> str(sn0)
170 'X-'
171 >>> sn0.name
172 'X'
173 >>> sn0 = gfapy.OrientedLine(gfapy.Line('S\tY\t*'), '+')
174
175 Sequences
176 ^^^^^^^^^
177
178 Sequences (S field sequence) are represented by strings in Gfapy.
179 Depending on the GFA version, the alphabet definition is more or less
180 restrictive. The definitions are correctly applied by the validation
181 methods.
182
183 The method ``rc()`` is provided to compute the reverse complement of a
184 nucleotidic sequence. The extended IUPAC alphabet is understood by the
185 method. Applied to non nucleotidic sequences, the results will be
186 meaningless:
187
188 .. doctest::
189
190 >>> from gfapy.sequence import rc
191 >>> rc("gcat")
192 'atgc'
193 >>> rc("*")
194 '*'
195 >>> rc("yatc")
196 'gatr'
197 >>> rc("gCat")
198 'atGc'
199 >>> rc("cag", rna=True)
200 'cug'
201
202 Integers and positions
203 ^^^^^^^^^^^^^^^^^^^^^^
204
205 The C lines ``pos`` field and the G lines ``disp`` and ``var`` fields
206 are represented by integers. The ``var`` field is optional, and thus can
207 be also a placeholder. Positions are 0-based coordinates.
208
209 The position fields of GFA2 E lines (``beg1, beg2, end1, end2``) and F
210 lines (``s_beg, s_end, f_beg, f_end``) contain a dollar string as suffix
211 if the position is equal to the segment length. For more information,
212 see the :ref:`positions` chapter.
213
214 Alignments
215 ^^^^^^^^^^
216
217 Alignments are always optional, ie they can be placeholders. If they are
218 specified they are CIGAR alignments or, only in GFA2, trace alignments.
219 For more details, see the :ref:`alignments` chapter.
220
221 GFA1 datatypes
222 ^^^^^^^^^^^^^^
223
224 +------------------------+---------------+--------------------------------+
225 | Datatype | Record Type | Fields |
226 +========================+===============+================================+
227 | Identifier | Segment | ``name`` |
228 +------------------------+---------------+--------------------------------+
229 | | Path | ``path_name`` |
230 +------------------------+---------------+--------------------------------+
231 | | Link | ``from_segment, to_segment`` |
232 +------------------------+---------------+--------------------------------+
233 | | Containment | ``from_segment, to_segment`` |
234 +------------------------+---------------+--------------------------------+
235 | [OrientedIdentifier] | Path | ``segment_names`` |
236 +------------------------+---------------+--------------------------------+
237 | Orientation | Link | ``from_orient, to_orient`` |
238 +------------------------+---------------+--------------------------------+
239 | | Containment | ``from_orient, to_orient`` |
240 +------------------------+---------------+--------------------------------+
241 | Sequence | Segment | ``sequence`` |
242 +------------------------+---------------+--------------------------------+
243 | Alignment | Link | ``overlap`` |
244 +------------------------+---------------+--------------------------------+
245 | | Containment | ``overlap`` |
246 +------------------------+---------------+--------------------------------+
247 | [Alignment] | Path | ``overlaps`` |
248 +------------------------+---------------+--------------------------------+
249 | Position | Containment | ``pos`` |
250 +------------------------+---------------+--------------------------------+
251
252 GFA2 datatypes
253 ^^^^^^^^^^^^^^
254
255 +------------------------+---------------+----------------------------------+
256 | Datatype | Record Type | Fields |
257 +========================+===============+==================================+
258 | Itentifier | Segment | ``sid`` |
259 +------------------------+---------------+----------------------------------+
260 | | Fragment | ``sid`` |
261 +------------------------+---------------+----------------------------------+
262 | OrientedIdentifier | Edge | ``sid1, sid2`` |
263 +------------------------+---------------+----------------------------------+
264 | | Gap | ``sid1, sid2`` |
265 +------------------------+---------------+----------------------------------+
266 | | Fragment | ``external`` |
267 +------------------------+---------------+----------------------------------+
268 | OptionalIdentifier | Edge | ``eid`` |
269 +------------------------+---------------+----------------------------------+
270 | | Gap | ``gid`` |
271 +------------------------+---------------+----------------------------------+
272 | | U Group | ``oid`` |
273 +------------------------+---------------+----------------------------------+
274 | | O Group | ``uid`` |
275 +------------------------+---------------+----------------------------------+
276 | [Identifier] | U Group | ``items`` |
277 +------------------------+---------------+----------------------------------+
278 | [OrientedIdentifier] | O Group | ``items`` |
279 +------------------------+---------------+----------------------------------+
280 | Sequence | Segment | ``sequence`` |
281 +------------------------+---------------+----------------------------------+
282 | Alignment | Edge | ``alignment`` |
283 +------------------------+---------------+----------------------------------+
284 | | Fragment | ``alignment`` |
285 +------------------------+---------------+----------------------------------+
286 | Position | Edge | ``beg1, end1, beg2, end2`` |
287 +------------------------+---------------+----------------------------------+
288 | | Fragment | ``s_beg, s_end, f_beg, f_end`` |
289 +------------------------+---------------+----------------------------------+
290 | Integer | Gap | ``disp, var`` |
291 +------------------------+---------------+----------------------------------+
292
293 Reading and writing positional fields
294 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
295
296 The ``positional_fieldnames`` method returns the list of the names (as
297 strings) of the positional fields of a line. The positional fields can
298 be read using a method on the Gfapy line object, which is called as the
299 field name. Setting the value is done with an equal sign version of the
300 field name method (e.g. segment.slen = 120). In alternative, the
301 ``set(fieldname, value)`` and ``get(fieldname)`` methods can also be
302 used.
303
304 .. doctest::
305
306 >>> s_gfa1 = gfapy.Line("S\t1\t*")
307 >>> s_gfa1.positional_fieldnames
308 ['name', 'sequence']
309 >>> s_gfa1.name
310 '1'
311 >>> s_gfa1.get("name")
312 '1'
313 >>> s_gfa1.name = "segment2"
314 >>> s_gfa1.name
315 'segment2'
316 >>> s_gfa1.set('name',"3")
317 >>> s_gfa1.name
318 '3'
319
320 When a field is read, the value is converted into an appropriate object.
321 The string representation of a field can be read using the
322 ``field_to_s(fieldname)`` method.
323
324 .. doctest::
325
326 >>> gfa = gfapy.Gfa()
327 >>> gfa.add_line("S\ts1\t*")
328 >>> gfa.add_line("L\ts1\t+\ts2\t-\t*")
329 >>> link = gfa.dovetails[0]
330 >>> str(link.from_segment)
331 'S\ts1\t*'
332 >>> link.field_to_s('from_segment')
333 's1'
334
335 When setting a non-string field, the user can specify the value of a tag
336 either as a Python non-string object, or as the string representation of
337 the value.
338
339 .. doctest::
340
341 >>> gfa = gfapy.Gfa(version='gfa1')
342 >>> gfa.add_line("C\ta\t+\tb\t-\t10\t*")
343 >>> c = gfa.containments[0]
344 >>> c.pos
345 10
346 >>> c.pos = 1
347 >>> c.pos
348 1
349 >>> c.pos = "2"
350 >>> c.pos
351 2
352 >>> c.field_to_s("pos")
353 '2'
354
355 Note that setting the value of reference and backreferences-related
356 fields is generally not allowed, when a line instance is connected to a
357 Gfa object (see the :ref:`references` chapter).
358
359 .. doctest::
360
361 >>> gfa = gfapy.Gfa(version='gfa1')
362 >>> l = gfapy.Line("L\ts1\t+\ts2\t-\t*")
363 >>> l.from_name
364 's1'
365 >>> l.from_segment = "s3"
366 >>> l.from_name
367 's3'
368 >>> gfa.add_line(l)
369 >>> l.from_segment = "s4"
370 Traceback (most recent call last):
371 ...
372 gfapy.error.RuntimeError: ...
373
374 Validation
375 ~~~~~~~~~~
376
377 The content of all positional fields must be a correctly formatted
378 string according to the rules given in the GFA specifications (or a
379 Python object whose string representation is a correctly formatted
380 string).
381
382 Depending on the validation level, more or less checks are done
383 automatically (see the :ref:`validation` chapter). Not regarding which
384 validation level is selected, the user can trigger a manual validation
385 using the ``validate_field(fieldname)`` method for a single field, or
386 using ``validate``, which does a full validation on the whole line,
387 including all positional fields.
388
389 .. doctest::
390
391 >>> line = gfapy.Line("H\txx:i:1")
392 >>> line.validate_field("xx")
393 >>> line.validate()
394
395 Aliases
396 ~~~~~~~
397
398 For some fields, aliases are defined, which can be used in all contexts
399 where the original field name is used (i.e. as parameter of a method,
400 and the same setter and getter methods defined for the original field
401 name are also defined for each alias, see below).
402
403 .. doctest::
404
405 >>> gfa1_path = gfapy.Line("P\tX\t1-,2+,3+\t*")
406 >>> gfa1_path.name == gfa1_path.path_name
407 True
408 >>> edge = gfapy.Line("E\t*\tA+\tB-\t0\t10\t90\t100$\t*")
409 >>> edge.eid == edge.name
410 True
411 >>> containment = gfapy.Line("C\tA\t+\tB\t-\t10\t*")
412 >>> containment.from_segment == containment.container
413 True
414 >>> segment = gfapy.Line("S\t1\t*")
415 >>> segment.sid == segment.name
416 True
417 >>> segment.sid
418 '1'
419 >>> segment.name = '2'
420 >>> segment.sid
421 '2'
422
423 Name
424 ^^^^
425
426 Different record types have an identifier field: segments (name in GFA1,
427 sid in GFA2), paths (path\_name), edge (eid), fragment (sid), gap (gid),
428 groups (pid).
429
430 All these fields are aliased to ``name``. This allows the user for
431 example to set the identifier of a line using the ``name=(value)``
432 method using the same syntax for different record types (segments,
433 edges, paths, fragments, gaps and groups).
434
435 Version-specific field names
436 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
437
438 For segments the GFA1 name and the GFA2 sid are equivalent fields. For
439 this reason an alias ``sid`` is defined for GFA1 segments and ``name``
440 for GFA2 segments.
441
442 Crypical field names
443 ^^^^^^^^^^^^^^^^^^^^
444
445 The definition of from and to for containments is somewhat cryptic.
446 Therefore following aliases have been defined for containments:
447 container[\_orient] for from[\_\|segment\|orient]; contained[\_orient]
448 for to[\_segment\|orient].
+0
-75
doc/tutorial/positions.rst less more
0 .. testsetup:: *
1
2 import gfapy
3
4 .. _positions:
5
6 Positions
7 ---------
8
9 The only position field in GFA1 is the ``pos`` field in the C lines.
10 This represents the starting position of the contained segment in the
11 container segment and is 0-based.
12
13 Some fields in GFA2 E lines (``beg1, beg2, end1, end2``) and F lines
14 (``s_beg, s_end, f_beg, f_end``) are positions. According to the
15 specification, they are 0-based and represent virtual ticks before and
16 after each string in the sequence. Thus ranges are represented similarly
17 to the Python range conventions: e.g. a 1-character prefix of a sequence
18 will have begin 0 and end 1.
19
20 Last positions in GFA2
21 ~~~~~~~~~~~~~~~~~~~~~~
22
23 The GFA2 positions must contain an additional string (``$``) appended to
24 the integer, if (and only if) they are the last position in the segment
25 sequence. These particular positions are represented in Gfapy as
26 instances of the class :class:`~gfapy.lastpos.LastPos`.
27
28 To create a lastpos instance, the constructor can be used with an
29 integer, or the string representation (which must end with the dollar
30 sign, otherwise an integer is returned):
31
32 .. doctest::
33
34 >>> str(gfapy.LastPos(12))
35 '12$'
36 >>> gfapy.LastPos("12")
37 12
38 >>> str(gfapy.LastPos("12"))
39 '12'
40 >>> gfapy.LastPos("12$")
41 gfapy.LastPos(12)
42 >>> str(gfapy.LastPos("12$"))
43 '12$'
44
45 Subtracting an integer from a lastpos returns a lastpos if 0 subtracted,
46 an integer otherwise. This allows to do some arithmetic on positions
47 without making them invalid.
48
49 .. doctest::
50
51 >>> gfapy.LastPos(12) - 0
52 gfapy.LastPos(12)
53 >>> gfapy.LastPos(12) - 1
54 11
55
56 The functions :func:`~gfapy.lastpos.islastpos` and
57 :func:`~gfapy.lastpos.isfirstpos` allow to
58 determine if a position value is 0 (first), or the last position, using
59 the same syntax for lastpos and integer instances.
60
61 .. doctest::
62
63 >>> gfapy.isfirstpos(0)
64 True
65 >>> gfapy.islastpos(0)
66 False
67 >>> gfapy.isfirstpos(12)
68 False
69 >>> gfapy.islastpos(12)
70 False
71 >>> gfapy.islastpos(gfapy.LastPos("12"))
72 False
73 >>> gfapy.islastpos(gfapy.LastPos("12$"))
74 True
+0
-444
doc/tutorial/references.rst less more
0 .. testsetup:: *
1
2 import gfapy
3 gfa = gfapy.Gfa()
4
5 .. _references:
6
7 References
8 ----------
9
10 Some fields in GFA lines contain identifiers or lists of identifiers
11 (sometimes followed by orientation strings), which reference other lines
12 of the GFA file. In Gfapy it is possible to follow these references and
13 traverse the graph.
14
15 Connecting a line to a Gfa object
16 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
17
18 In stand-alone line instances, the identifiers which reference other
19 lines are either strings containing the line name, pairs of strings
20 (name and orientation) in a ``gfapy.OrientedLine`` object, or lists of
21 lines names or ``gfapy.OrientedLine`` objects.
22
23 Using the ``add_line(line)`` (alias: ``append(line)``) method of the
24 ``gfapy.Gfa`` object, or the equivalent ``connect(gfa)`` method of the
25 gfapy.Line instance, a line is added to a Gfa instance (this is done
26 automatically when a GFA file is parsed). All strings expressing
27 references are then changed into references to the corresponding line
28 objects. The method ``is_connected()`` allows to determine if a line is
29 connected to a gfapy instance. The read-only property ``gfa`` contains
30 the ``gfapy.Gfa`` instance to which the line is connected.
31
32 .. doctest::
33
34 >>> gfa = gfapy.Gfa(version='gfa1')
35 >>> link = gfapy.Line("L\tA\t-\tB\t+\t20M")
36 >>> link.is_connected()
37 False
38 >>> link.gfa is None
39 True
40 >>> type(link.from_segment)
41 <class 'str'>
42 >>> gfa.append(link)
43 >>> link.is_connected()
44 True
45 >>> link.gfa #doctest: +ELLIPSIS
46 <gfapy.gfa.Gfa object at ...>
47 >>> type(link.from_segment)
48 <class 'gfapy.line.segment.gfa1.GFA1'>
49
50 References for each record type
51 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
52
53 The following tables describes the references contained in each record
54 type. The notation ``[]`` represent lists.
55
56 GFA1
57 ^^^^
58
59 +---------------+-------------------+---------------------------+
60 | Record type | Fields | Type of reference |
61 +===============+===================+===========================+
62 | Link | from_segment, to_segment | Segment |
63 +---------------+-------------------+---------------------------+
64 | Containment | from_segment, to_segment | Segment |
65 +---------------+-------------------+---------------------------+
66 | Path | segment\_names, | [OrientedLine(Segment)] |
67 +---------------+-------------------+---------------------------+
68 | | links (1) | [OrientedLine(Link)] |
69 +---------------+-------------------+---------------------------+
70
71 (1): paths contain information in the fields segment\_names and
72 overlaps, which allow to find the identify from which they depend; these
73 links can be retrieved using ``links`` (which is not a field).
74
75 GFA2
76 ^^^^
77
78 +---------------+--------------+------------------------------------+
79 | Record type | Fields | Type of reference |
80 +===============+==============+====================================+
81 | Edge | sid1, sid2 | Segment |
82 +---------------+--------------+------------------------------------+
83 | Gap | sid1, sid2 | Segment |
84 +---------------+--------------+------------------------------------+
85 | Fragment | sid | Segment |
86 +---------------+--------------+------------------------------------+
87 | Set | items | [Edge/Set/Path/Segment] |
88 +---------------+--------------+------------------------------------+
89 | Path | items | [OrientedLine(Edge/Set/Segment)] |
90 +---------------+--------------+------------------------------------+
91
92 Backreferences for each record type
93 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
94
95 When a line containing a reference to another line is connected to a Gfa
96 object, backreferences to it are created in the targeted line.
97
98 For each backreference collection a read-only property exist, which is
99 named as the collection (e.g. ``dovetails_L`` for segments). Note that
100 the reference list returned by these arrays are read-only and editing
101 the references is done using other methods (see the section "Editing
102 reference fields" below).
103
104 .. code:: python
105
106 segment.dovetails_L # => [gfapy.line.edge.Link(...), ...]
107
108 The following tables describe the backreferences collections for each
109 record type.
110
111 GFA1
112 ^^^^
113
114 +---------------+-------------------------+
115 | Record type | Backreferences |
116 +===============+=========================+
117 | Segment | dovetails\_L |
118 +---------------+-------------------------+
119 | | dovetails\_R |
120 +---------------+-------------------------+
121 | | edges\_to\_contained |
122 +---------------+-------------------------+
123 | | edges\_to\_containers |
124 +---------------+-------------------------+
125 | | paths |
126 +---------------+-------------------------+
127 | Link | paths |
128 +---------------+-------------------------+
129
130 GFA2
131 ^^^^
132
133 +---------------+-------------------------+--------+
134 | Record type | Backreferences | Type |
135 +===============+=========================+========+
136 | Segment | dovetails\_L | E |
137 +---------------+-------------------------+--------+
138 | | dovetails\_R | E |
139 +---------------+-------------------------+--------+
140 | | edges\_to\_contained | E |
141 +---------------+-------------------------+--------+
142 | | edges\_to\_containers | E |
143 +---------------+-------------------------+--------+
144 | | internals | E |
145 +---------------+-------------------------+--------+
146 | | gaps\_L | G |
147 +---------------+-------------------------+--------+
148 | | gaps\_R | G |
149 +---------------+-------------------------+--------+
150 | | fragments | F |
151 +---------------+-------------------------+--------+
152 | | paths | O |
153 +---------------+-------------------------+--------+
154 | | sets | U |
155 +---------------+-------------------------+--------+
156 | Edge | paths | O |
157 +---------------+-------------------------+--------+
158 | | sets | U |
159 +---------------+-------------------------+--------+
160 | O Group | paths | O |
161 +---------------+-------------------------+--------+
162 | | sets | U |
163 +---------------+-------------------------+--------+
164 | U Group | sets | U |
165 +---------------+-------------------------+--------+
166
167 Segment backreference convenience methods
168 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
169
170 For segments, additional methods are available which combine in
171 different way the backreferences information. The
172 `dovetails_of_end` and `gaps_of_end` methods take an
173 argument ``L`` or ``R`` and return the dovetails overlaps (or gaps) of the
174 left or, respectively, right end of the segment sequence
175 (equivalent to the segment properties ``dovetails_L``/``dovetails_R`` and
176 ``gaps_L``/``gaps_R``).
177
178 The segment ``containments`` property is a list of both containments where the
179 segment is the container or the contained segment. The segment ``edges``
180 property is a list of all edges (dovetails, containments and internals)
181 with a reference to the segment.
182
183 Other methods directly compute list of segments from the edges lists
184 mentioned above. The ``neighbours_L``, ``neighbours_R`` properties and
185 the `neighbours` method compute the set of segment instances which are
186 connected by dovetails to the segment.
187 The ``containers`` and ``contained``
188 properties similarly compute the set of segment instances which,
189 respectively, contains the segment, or are contained in the segment.
190
191 .. doctest::
192
193 >>> gfa = gfapy.Gfa()
194 >>> gfa.append('S\tA\t*')
195 >>> s = gfa.segment('A')
196 >>> gfa.append('S\tB\t*')
197 >>> gfa.append('S\tC\t*')
198 >>> gfa.append('L\tA\t-\tB\t+\t*')
199 >>> gfa.append('C\tA\t+\tC\t+\t10\t*')
200 >>> [str(l) for l in s.dovetails_of_end("L")]
201 ['L\tA\t-\tB\t+\t*']
202 >>> s.dovetails_L == s.dovetails_of_end("L")
203 True
204 >>> s.gaps_of_end("R")
205 []
206 >>> [str(e) for e in s.edges]
207 ['L\tA\t-\tB\t+\t*', 'C\tA\t+\tC\t+\t10\t*']
208 >>> [str(n) for n in s.neighbours_L]
209 ['S\tB\t*']
210 >>> s.containers
211 []
212 >>> [str(c) for c in s.contained]
213 ['S\tC\t*']
214
215 Multiline group definitions
216 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
217
218 The GFA2 specification opens the possibility (experimental) to define
219 groups on multiple lines, by using the same ID for each line defining
220 the group. This is supported by gfapy.
221
222 This means that if multiple `Ordered` or
223 `Unordered` instances connected to a Gfa object have
224 the same ``gid``, they are merged into a single instance (technically
225 the last one getting added to the graph object). The items list are
226 merged.
227
228 The tags of multiple line defining a group shall not contradict each
229 other (i.e. either are the tag names on different lines defining the
230 group all different, or, if the same tag is present on different lines,
231 the value and datatype must be the same, in which case the multiple
232 definition will be ignored).
233
234 .. doctest::
235
236 >>> gfa = gfapy.Gfa()
237 >>> gfa.add_line("U\tu1\ts1 s2 s3")
238 >>> [s.name for s in gfa.sets[-1].items]
239 ['s1', 's2', 's3']
240 >>> gfa.add_line('U\tu1\t4 5')
241 >>> [s.name for s in gfa.sets[-1].items]
242 ['s1', 's2', 's3', '4', '5']
243
244 Induced set and captured path
245 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
246
247 The item list in GFA2 sets and paths may not contain elements which are
248 implicitly involved. For example a path may contain segments, without
249 specifying the edges connecting them, if there is only one such edge.
250 Alternatively a path may contain edges, without explicitly indicating the
251 segments. Similarly a set may contain edges, but not the segments
252 referred to in them, or contain segments which are connected by edges,
253 without the edges themselves. Furthermore groups may refer to other
254 groups (set to sets or paths, paths to paths only), which then
255 indirectly contain references to segments and edges.
256
257 Gfapy provides methods for the computation of the sets of segments and
258 edges which are implied by an ordered or unordered group. Thereby all
259 references to subgroups are resolved and implicit elements are added, as
260 described in the specification. The computation can, therefore, only be
261 applied to connected lines. For unordered groups, this computation is
262 provided by the method ``induced_set()``, which returns an array of
263 segment and edge instances. For ordered group, the computation is
264 provided by the method ``captured_path()``, which returns a list of
265 ``gfapy.OrientedLine`` instances, alternating segment and edge instances
266 (and starting and ending in segments).
267
268 The methods ``induced_segments_set()``, ``induced_edges_set()``,
269 ``captured_segments()`` and ``captured_edges()`` return, respectively,
270 the list of only segments or edges, in ordered or unordered groups.
271
272 .. doctest::
273
274 >>> gfa = gfapy.Gfa()
275 >>> gfa.add_line("S\ts1\t100\t*")
276 >>> gfa.add_line("S\ts2\t100\t*")
277 >>> gfa.add_line("S\ts3\t100\t*")
278 >>> gfa.add_line("E\te1\ts1+\ts2-\t0\t10\t90\t100$\t*")
279 >>> gfa.add_line("U\tu1\ts1 s2 s3")
280 >>> u = gfa.sets[-1]
281 >>> [l.name for l in u.induced_edges_set]
282 ['e1']
283 >>> [l.name for l in u.induced_segments_set ]
284 ['s1', 's2', 's3']
285 >>> [l.name for l in u.induced_set ]
286 ['s1', 's2', 's3', 'e1']
287
288 Disconnecting a line from a Gfa object
289 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
290
291 Lines can be disconnected using the ``rm(line)`` method of the
292 ``gfapy.Gfa`` object or the ``disconnect()`` method of the line
293 instance.
294
295 .. doctest::
296
297 >>> gfa = gfapy.Gfa()
298 >>> gfa.append('S\tsA\t*')
299 >>> gfa.append('S\tsB\t*')
300 >>> line = gfa.segment("sA")
301 >>> gfa.segment_names
302 ['sA', 'sB']
303 >>> gfa.rm(line)
304 >>> gfa.segment_names
305 ['sB']
306 >>> line = gfa.segment('sB')
307 >>> line.disconnect()
308 >>> gfa.segment_names
309 []
310
311 Disconnecting a line affects other lines as well. Lines which are
312 dependent on the disconnected line are disconnected as well. Any other
313 reference to disconnected lines is removed as well. In the disconnected
314 line, references to lines are transformed back to strings and
315 backreferences are deleted.
316
317 The following tables show which dependent lines are disconnected if they
318 refer to a line which is being disconnected.
319
320 GFA1
321 ^^^^
322
323 +---------------+---------------------------------+
324 | Record type | Dependent lines |
325 +===============+=================================+
326 | Segment | links (+ paths), containments |
327 +---------------+---------------------------------+
328 | Link | paths |
329 +---------------+---------------------------------+
330
331 GFA2
332 ^^^^
333
334 +---------------+---------------------------------------+
335 | Record type | Dependent lines |
336 +===============+=======================================+
337 | Segment | edges, gaps, fragments, sets, paths |
338 +---------------+---------------------------------------+
339 | Edge | sets, paths |
340 +---------------+---------------------------------------+
341 | Sets | sets, paths |
342 +---------------+---------------------------------------+
343
344 Editing reference fields
345 ~~~~~~~~~~~~~~~~~~~~~~~~
346
347 In connected line instances, it is not allowed to directly change the
348 content of fields containing references to other lines, as this would
349 make the state of the Gfa object invalid.
350
351 Besides the fields containing references, some other fields are
352 read-only in connected lines. Changing some of the fields would require
353 moving the backreferences to other collections (position fields of edges
354 and gaps, ``from_orient`` and ``to_orient`` of links). The overlaps
355 field of connected links is readonly as it may be necessary to identify
356 the link in paths.
357
358 Renaming an element
359 ^^^^^^^^^^^^^^^^^^^
360
361 The name field of a line (e.g. segment ``name``/``sid``) is not a
362 reference and thus can be edited also in connected lines. When the name
363 of the line is changed, no manual editing of references (e.g.
364 ``from_segment``/``to_segment``
365 fields in links) is necessary, as all lines which refer to the line will
366 still refer to the same instance. The references to the instance in the
367 Gfa lines collections will be automatically updated. Also, the new name
368 will be correctly used when converting to string, such as when the Gfa
369 instance is written to a GFA file.
370
371 Renaming a line to a name which already exists has the same effect of
372 adding a line with that name. That is, in most cases,
373 ``gfapy.NotUniqueError`` is raised. An exception are GFA2 sets and
374 paths: in this case the line will be appended to the existing line with
375 the same name (as described in "Multiline group definitions").
376
377 Adding and removing group elements
378 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
379
380 Elements of GFA2 groups can be added and removed from both connected and
381 non-connected lines, using the following methods.
382
383 To add an item to or remove an item from an unordered group, use the
384 methods ``add_item(item)`` and ``rm_item(item)``, which take as argument
385 either a string (identifier) or a line instance.
386
387 To append or prepend an item to an ordered group, use the methods
388 ``append_item(item)`` and ``prepend_item(item)``. To remove the first or
389 the last item of an ordered group use the methods ``rm_first_item()``
390 and ``rm_last_item()``.
391
392 Editing read-only fields of connected lines
393 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
394
395 Editing the read-only information of edges, gaps, links, containments,
396 fragments and paths is more complicated. These lines shall be
397 disconnected before the edit and connected again to the Gfa object after
398 it. Before disconnecting a line, you should check if there are other
399 lines dependent on it (see tables above). If so, you will have to
400 disconnect these lines first, eventually update their fields and
401 reconnect them at the end of the operation.
402
403 Virtual lines
404 ~~~~~~~~~~~~~
405
406 The order of the lines in GFA is not prescribed. Therefore, during
407 parsing, or constructing a Gfa in memory, it is possible that a line is
408 referenced to, before it is added to the Gfa instance. Whenever this
409 happens, Gfapy creates a "virtual" line instance.
410
411 Users do not have to handle with virtual lines, if they work with
412 complete and valid GFA files.
413
414 Virtual lines are similar to normal line instances, with some
415 limitations (they contain only limited information and it is not allowed
416 to add tags to them). To check if a line is a virtual line, one can use
417 the ``virtual`` property of the line.
418
419 As soon as the parser founds the real line corresponding to a previously
420 introduced virtual line, the virtual line is exchanged with the real
421 line and all references are corrected to point to the real line.
422
423 .. doctest::
424
425 >>> g = gfapy.Gfa()
426 >>> g.add_line("S\t1\t*")
427 >>> g.add_line("L\t1\t+\t2\t+\t*")
428 >>> l = g.dovetails[0]
429 >>> g.segment("1").virtual
430 False
431 >>> g.segment("2").virtual
432 True
433 >>> l.to_segment == g.segment("2")
434 True
435 >>> g.segment("2").dovetails == [l]
436 True
437 >>> g.add_line("S\t2\t*")
438 >>> g.segment("2").virtual
439 False
440 >>> l.to_segment == g.segment("2")
441 True
442 >>> g.segment("2").dovetails == [l]
443 True
+0
-36
doc/tutorial/rgfa.rst less more
0 .. testsetup:: *
1
2 import gfapy
3
4 .. _rgfa:
5
6 rGFA
7 ----
8
9 rGFA (https://github.com/lh3/gfatools/blob/master/doc/rGFA.md)
10 is a subset of GFA1, in which only particular line types (S and L)
11 are allowed, and the S lines are required to contain the tags
12 `SN` (of type `Z`), `SO` and `SR` (of type `i`).
13
14 When working with rGFA files, it is convenient to use the `dialect="rgfa"`
15 option in the constructor `Gfa()` and in
16 func:`Gfa.from_file() <gfapy.gfa.Gfa.from_file>`.
17
18 This ensures that additional validations are performed: GFA version must be 1,
19 only rGFA-compatible lines (S,L) are allowed and that the required tags are
20 required (with the correct datatype). The validations can also be executed
21 manually using `Gfa.validate_rgfa() <gfapy.gfa.Gfa.validate_rgfa>`.
22
23 Furthermore, the `stable_sequence_names` attribute of the GFA objects
24 returns the set of stable sequence names contained in the `SN` tags
25 of the segments.
26
27 .. doctest::
28
29 >>> g = gfapy.Gfa("S\tS1\tCTGAA\tSN:Z:chr1\tSO:i:0\tSR:i:0", dialect="rgfa")
30 >>> g.segment_names
31 ['S1']
32 >>> g.stable_sequence_names
33 ['chr1']
34 >>> g.add_line("S\tS2\tACG\tSN:Z:chr1\tSO:i:5\tSR:i:0")
35
+0
-436
doc/tutorial/tags.rst less more
0 .. testsetup:: *
1
2 import gfapy
3 gfa = gfapy.Gfa()
4
5 .. _tags:
6
7 Tags
8 ----
9
10 Each record in GFA can contain tags. Tags are fields which consist in a
11 tag name, a datatype and data. The format is ``NN:T:DATA`` where ``NN``
12 is a two-letter tag name, ``T`` is a one-letter datatype string and
13 ``DATA`` is a string representing the data according to the specified
14 datatype. Tag names must be unique for each line, i.e. each line may
15 only contain a tag once.
16
17 ::
18
19 # Examples of GFA tags of different datatypes:
20 "aa:i:-12"
21 "bb:f:1.23"
22 "cc:Z:this is a string"
23 "dd:A:X"
24 "ee:B:c,12,3,2"
25 "ff:H:122FA0"
26 'gg:J:["A","B"]'
27
28 Custom tags
29 ~~~~~~~~~~~
30
31 Some tags are explicitly defined in the specification (these are named
32 *predefined tags* in Gfapy), and the user or an application can define
33 its own custom tags. These may contain lower case letters.
34
35 Custom tags are user or program specific and may of course collide with
36 the tags used by other users or programs. For this reasons, if you write
37 scripts which employ custom tags, you should always check that the
38 values are of the correct datatype and plausible.
39
40 .. doctest::
41
42 >>> line = gfapy.Line("H\txx:i:2")
43 >>> if line.get_datatype("xx") != "i":
44 ... raise Exception("I expected the tag xx to contain an integer!")
45 >>> myvalue = line.xx
46 >>> if (myvalue > 120) or (myvalue % 2 == 1):
47 ... raise Exception("The value in the xx tag is not an even value <= 120")
48 >>> # ... do something with myvalue
49
50 Also it is good practice to allow the user of the script to change the
51 name of the custom tags. For example, Gfapy employs the +or+ custom tag
52 to track the original segment from which a segment in the final graph is
53 derived. All methods which read or write the +or+ tag allow to specify
54 an alternative tag name to use instead of +or+, for the case that this
55 name collides with the custom tag of another program.
56
57 .. code:: python
58
59 # E.g. a method which does something with myvalue, usually stored in tag xx
60 # allows the user to specify an alternative name for the tag
61 def mymethod(line, mytag="xx"):
62 myvalue = line.get(mytag)
63 # ...
64
65 Predefined tags
66 ~~~~~~~~~~~~~~~
67
68 According to the GFA specifications, predefined tag names consist of either
69 two upper case letters, or an upper case letter followed by a digit.
70 The GFA1 specification predefines tags for each line type, while GFA2
71 only predefines tags for the header and edges.
72
73 While tags with the predefined names are allowed to be added to any line,
74 when they are used in the lines mentiones in the specification (e.g. `VN`
75 in the header) gfapy checks that the datatype is the one prescribed by
76 the specification (e.g. `VN` must be of type `Z`). It is not forbidden
77 to use the same tags in other contexts, but in this case, the datatype
78 restriction is not enforced.
79
80 +------------+------------+-----------------------+
81 | Tag | Type | Line types | GFA version |
82 +============+============+=======================+
83 | VN | Z | H | 1,2 |
84 +-----+------+------------+-----------------------+
85 | TS | i | H,S | 2 |
86 +-----+------+------------+-----------------------+
87 | LN | i | S | 1 |
88 +-----+------+------------+-----------------------+
89 | RC | i | S,L,C | 1 |
90 +-----+------+------------+-----------------------+
91 | FC | i | S,L | 1 |
92 +-----+------+------------+-----------------------+
93 | KC | i | S,L | 1 |
94 +-----+------+------------+-----------------------+
95 | SH | H | S | 1 |
96 +-----+------+------------+-----------------------+
97 | UR | Z | S | 1 |
98 +-----+------+------------+-----------------------+
99 | MQ | i | L | 1 |
100 +-----+------+------------+-----------------------+
101 | NM | i | L,i | 1 |
102 +-----+------+------------+-----------------------+
103 | ID | Z | L,C | 1 |
104 +-----+------+------------+-----------------------+
105
106 ::
107
108 "VN:Z:1.0" # VN => predefined tag
109 "z5:Z:1.0" # z5 first char is downcase => custom tag
110 "XX:Z:aaa" # XX upper case, but not predefined => custom tag
111
112 # not forbidden, but not recommended:
113 "zZ:Z:1.0" # => mixed case, first char downcase => custom tag
114 "Zz:Z:1.0" # => mixed case, first char upcase => custom tag
115 "vn:Z:1.0" # => same name as predefined tag, but downcase => custom tag
116
117 Datatypes
118 ~~~~~~~~~
119
120 The following table summarizes the datatypes available for tags:
121
122 +----------+-----------------+---------------------------+----------------------+
123 | Symbol | Datatype | Example | Python class |
124 +==========+=================+===========================+======================+
125 | Z | string | This is a string | str |
126 +----------+-----------------+---------------------------+----------------------+
127 | i | integer | -12 | int |
128 +----------+-----------------+---------------------------+----------------------+
129 | f | float | 1.2E-5 | float |
130 +----------+-----------------+---------------------------+----------------------+
131 | A | char | X | str |
132 +----------+-----------------+---------------------------+----------------------+
133 | J | JSON | [1,{"k1":1,"k2":2},"a"] | list/dict |
134 +----------+-----------------+---------------------------+----------------------+
135 | B | numeric array | f,1.2,13E-2,0 | gfapy.NumericArray |
136 +----------+-----------------+---------------------------+----------------------+
137 | H | byte array | FFAA01 | gfapy.ByteArray |
138 +----------+-----------------+---------------------------+----------------------+
139
140 Validation
141 ~~~~~~~~~~
142
143 The tag names must consist of a letter and a digit or two letters.
144
145 ::
146
147 "KC:i:1" # => OK
148 "xx:i:1" # => OK
149 "x1:i:1" # => OK
150 "xxx:i:1" # => error: name is too long
151 "x:i:1" # => error: name is too short
152 "11:i:1" # => error: at least one letter must be present
153
154 The datatype must be one of the datatypes specified above. For
155 predefined tags, Gfapy also checks that the datatype given in the
156 specification is used.
157
158 ::
159
160 "xx:X:1" # => error: datatype X is unknown
161 "VN:i:1" # => error: VN must be of type Z
162
163 The data must be a correctly formatted string for the specified datatype
164 or a Python object whose string representation is a correctly formatted
165 string.
166
167 .. doctest::
168
169 # current value: xx:i:2
170 >>> line = gfapy.Line("S\tA\t*\txx:i:2")
171 >>> line.xx = 1
172 >>> line.xx
173 1
174 >>> line.xx = "3"
175 >>> line.xx
176 3
177 >>> line.xx = "A"
178 >>> line.xx
179 Traceback (most recent call last):
180 ...
181 gfapy.error.FormatError: ...
182
183 Depending on the validation level, more or less checks are done
184 automatically (see :ref:`validation` chapter). Per default - validation level
185 (1) - validation is performed only during parsing or accessing values
186 the first time, therefore the user must perform a manual validation if
187 he changes values to something which is not guaranteed to be correct. To
188 trigger a manual validation, the user can call the method
189 ``validate_field(fieldname)`` to validate a single tag, or
190 ``validate()`` to validate the whole line, including all tags.
191
192 .. doctest::
193
194 >>> line = gfapy.Line("S\tA\t*\txx:i:2", vlevel = 0)
195 >>> line.validate_field("xx")
196 >>> line.validate()
197 >>> line.xx = "A"
198 >>> line.validate_field("xx")
199 Traceback (most recent call last):
200 ...
201 gfapy.error.FormatError: ...
202 >>> line.validate()
203 Traceback (most recent call last):
204 ...
205 gfapy.error.FormatError: ...
206 >>> line.xx = "3"
207 >>> line.validate_field("xx")
208 >>> line.validate()
209
210 Reading and writing tags
211 ~~~~~~~~~~~~~~~~~~~~~~~~
212
213 Tags can be read using a property on the Gfapy line object, which is
214 called as the tag (e.g. line.xx). A special version of the property
215 prefixed by ``try_get_`` raises an error if the tag was not available
216 (e.g. ``line.try_get_LN``), while the tag property (e.g. ``line.LN``)
217 would return ``None`` in this case. Setting the value is done assigning
218 a value to it the tag name method (e.g. ``line.TS = 120``). In
219 alternative, the ``set(fieldname, value)``, ``get(fieldname)`` and
220 ``try_get(fieldname)`` methods can also be used. To remove a tag from a
221 line, use the ``delete(fieldname)`` method, or set its value to
222 ``None``. The ``tagnames`` property Line instances is a list of
223 the names (as strings) of all defined tags for a line.
224
225
226 .. doctest::
227
228 >>> line = gfapy.Line("S\tA\t*\txx:i:1", vlevel = 0)
229 >>> line.xx
230 1
231 >>> line.xy is None
232 True
233 >>> line.try_get_xx()
234 1
235 >>> line.try_get_xy()
236 Traceback (most recent call last):
237 ...
238 gfapy.error.NotFoundError: ...
239 >>> line.get("xx")
240 1
241 >>> line.try_get("xy")
242 Traceback (most recent call last):
243 ...
244 gfapy.error.NotFoundError: ...
245 >>> line.xx = 2
246 >>> line.xx
247 2
248 >>> line.xx = "a"
249 >>> line.tagnames
250 ['xx']
251 >>> line.xy = 2
252 >>> line.xy
253 2
254 >>> line.set("xy", 3)
255 >>> line.get("xy")
256 3
257 >>> line.tagnames
258 ['xx', 'xy']
259 >>> line.delete("xy")
260 3
261 >>> line.xy is None
262 True
263 >>> line.xx = None
264 >>> line.xx is None
265 True
266 >>> line.try_get("xx")
267 Traceback (most recent call last):
268 ...
269 gfapy.error.NotFoundError: ...
270 >>> line.tagnames
271 []
272
273 When a tag is read, the value is converted into an appropriate object
274 (see Python classes in the datatype table above). When setting a value,
275 the user can specify the value of a tag either as a Python object, or as
276 the string representation of the value.
277
278 .. doctest::
279
280 >>> line = gfapy.Line('H\txx:i:1\txy:Z:TEXT\txz:J:["a","b"]')
281 >>> line.xx
282 1
283 >>> isinstance(line.xx, int)
284 True
285 >>> line.xy
286 'TEXT'
287 >>> isinstance(line.xy, str)
288 True
289 >>> line.xz
290 ['a', 'b']
291 >>> isinstance(line.xz, list)
292 True
293
294 The string representation of a tag can be read using the
295 ``field_to_s(fieldname)`` method. The default is to only output the
296 content of the field. By setting \`\`tag: true\`\`\`, the entire tag is
297 output (name, datatype, content, separated by colons). An exception is
298 raised if the field does not exist.
299
300 .. doctest::
301
302 >>> line = gfapy.Line("H\txx:i:1")
303 >>> line.xx
304 1
305 >>> line.field_to_s("xx")
306 '1'
307 >>> line.field_to_s("xx", tag=True)
308 'xx:i:1'
309
310 Datatype of custom tags
311 ~~~~~~~~~~~~~~~~~~~~~~~
312
313 The datatype of an existing custom field (but not of predefined fields)
314 can be changed using the ``set_datatype(fieldname, datatype)`` method.
315 The current datatype specification can be read using
316 ``get_datatype(fieldname)``.
317
318 .. doctest::
319
320 >>> line = gfapy.Line("H\txx:i:1")
321 >>> line.get_datatype("xx")
322 'i'
323 >>> line.set_datatype("xx", "Z")
324 >>> line.get_datatype("xx")
325 'Z'
326
327 If a new custom tag is specified, Gfapy selects the correct datatype for
328 it: i/f for numeric values, J/B for arrays, J for hashes and Z for
329 strings and strings. If the user wants to specify a different datatype,
330 he may do so by setting it with ``set_datatype()`` (this can be done
331 also before assigning a value, which is necessary if full validation is
332 active).
333
334 .. doctest::
335
336 >>> line = gfapy.Line("H")
337 >>> line.xx = "1"
338 >>> line.xx
339 '1'
340 >>> line.set_datatype("xy", "i")
341 >>> line.xy = "1"
342 >>> line.xy
343 1
344
345 Arrays of numerical values
346 ~~~~~~~~~~~~~~~~~~~~~~~~~~
347
348 ``B`` and ``H`` tags represent array with particular constraints (e.g.
349 they can only contain numeric values, and in some cases the values must
350 be in predefined ranges). In order to represent them correctly and allow
351 for validation, Python classes have been defined for both kind of tags:
352 ``gfapy.ByteArray`` for ``H`` and ``gfapy.NumericArray`` for ``B``
353 fields.
354
355 Both are subclasses of list. Object of the two classes can be created by
356 passing an existing list or the string representation to the class
357 constructor.
358
359 .. doctest::
360
361 >>> # create a byte array instance
362 >>> gfapy.ByteArray([12,3,14])
363 b'\x0c\x03\x0e'
364 >>> gfapy.ByteArray("A012FF")
365 b'\xa0\x12\xff'
366 >>> # create a numeric array instance
367 >>> gfapy.NumericArray.from_string("c,12,3,14")
368 [12, 3, 14]
369 >>> gfapy.NumericArray([12,3,14])
370 [12, 3, 14]
371
372 Instances of the classes behave as normal lists, except that they
373 provide a #validate() method, which checks the constraints, and that
374 their string representation is the GFA string representation of the
375 field value.
376
377 .. doctest::
378
379 >>> gfapy.NumericArray([12,1,"1x"]).validate()
380 Traceback (most recent call last):
381 ...
382 gfapy.error.ValueError
383 >>> str(gfapy.NumericArray([12,3,14]))
384 'C,12,3,14'
385 >>> gfapy.ByteArray([12,1,"1x"]).validate()
386 Traceback (most recent call last):
387 ...
388 gfapy.error.ValueError
389 >>> str(gfapy.ByteArray([12,3,14]))
390 '0C030E'
391
392 For numeric values, the `compute_subtype` method allows to compute
393 the subtype which will be used for the string representation. Unsigned
394 subtypes are used if all values are positive. The smallest possible
395 subtype range is selected. The subtype may change when the range of the
396 elements changes.
397
398 .. doctest::
399
400 >>> gfapy.NumericArray([12,13,14]).compute_subtype()
401 'C'
402
403 Special cases: custom records, headers, comments and virtual lines.
404 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
405
406 GFA2 allows custom records, introduced by record type strings other than
407 the predefined ones. Gfapy uses a pragmatical approach for identifying
408 tags in custom records, and tries to interpret the rightmost fields as
409 tags, until the first field from the right raises an error; all
410 remaining fields are treated as positional fields.
411
412 ::
413
414 "X a b c xx:i:12" # => xx is tag, a, b, c are positional fields
415 "Y a b xx:i:12 c" # => all positional fields, as c is not a valid tag
416
417 For easier access, the entire header of the GFA is summarized in a
418 single line instance. A class (`FieldArray`) has been defined to
419 handle the special case when multiple H lines define the same tag (see
420 :ref:`header` chapter for details).
421
422 Comment lines are represented by a subclass of the same class
423 (`Line`) as the records. However, they cannot contain tags: the
424 entire line is taken as content of the comment. See the :ref:`comments`
425 chapter for more information about comments.
426
427 ::
428
429 "# this is not a tag: xx:i:1" # => xx is not a tag, xx:i:1 is part of the comment
430
431 Virtual instances of the `Line` class (e.g. segment instances automatically
432 created because of not yet resolved references found in edges) cannot be
433 modified by the user, and tags cannot be specified for them. This
434 includes all instances of the `Unknown` class. See the
435 :ref:`references` chapter for more information about virtual lines.
+0
-78
doc/tutorial/validation.rst less more
0 .. _validation:
1
2 Validation
3 ----------
4
5 Different validation levels are available. They represent different
6 compromises between speed and warrant of validity. The validation level
7 can be specified when the :class:`~gfapy.gfa.Gfa` object is created, using the
8 ``vlevel`` parameter of the constructor and of the
9 `from_file` method. Four levels of validation are defined
10 (0 = no validation, 1 = validation by reading, 2 = validation by reading
11 and writing, 3 = continuous validation). The default validation level
12 value is 1.
13
14 Manual validation
15 ~~~~~~~~~~~~~~~~~
16
17 Independently from the validation level chosen, the user can always check the
18 value of a field calling
19 :meth:`~gfapy.line.common.validate.Validate.validate_field` on the line
20 instance. If no exception is raised, the field content is valid.
21
22 To check if the entire content of the line is valid, the user can call
23 :meth:`~gfapy.line.common.validate.Validate.validate` on the line instance.
24 This will check all fields and perform cross-field validations, such as
25 comparing the length of the sequence of a GFA1 segment, to the value of the LN
26 tag (if present).
27
28 It is also possible to validate the structure of the GFA, for example to
29 check if there are unresolved references to lines. To do this, use the
30 :meth:`~gfapy.gfa.Gfa.validate` of the :class:`~gfapy.gfa.Gfa` instance.
31
32 No validations
33 ~~~~~~~~~~~~~~
34
35 If the validation is set to 0, Gfapy will try to accept any input and
36 never raise an exception. This is not always possible, and in some
37 cases, an exception will still be raised, if the data is invalid.
38
39 Validation when reading
40 ~~~~~~~~~~~~~~~~~~~~~~~
41
42 If the validation level is set to 1 or higher, basic validations will be
43 performed, such as checking the number of positional fields, the
44 presence of duplicated tags, the tag datatype of predefined tags.
45 Additionally, all tags will be validated, either during parsing or on
46 first access. Record-type cross-field validations will also be
47 performed.
48
49 In other words, a validation of 1 means that Gfapy guarantees (as good
50 as it can) that the GFA content read from a file is valid, and will
51 raise an exception on accessing the data if not.
52
53 The user is supposed to call `validate_field` after changing
54 a field content to something which can be potentially invalid, or
55 :meth:`~gfapy.line.common.validate.Validate.validate` if potentially
56 cross-field validations could fail.
57
58 Validation when writing
59 ~~~~~~~~~~~~~~~~~~~~~~~
60
61 Setting the level to 2 will perform all validations described above,
62 plus validate the fields content when their value is written to string.
63
64 In other words, a validation of 2 means that Gfapy guarantee (as good as
65 it can) that the GFA content read from a file and written to a file is
66 valid and will raise an exception on accessing the data or writing to
67 file if not.
68
69 Continuous validation
70 ~~~~~~~~~~~~~~~~~~~~~
71
72 If the validation level is set to 3, all validations for lower levels
73 described above are run, plus a validation of fields contents each time
74 a setter method is used.
75
76 A validation of 3 means that Gfapy guarantees (as good as it can) that
77 the GFA content is always valid.
325325 merged.name = "_".join(merged.name)
326326 ortag = merged.get("or")
327327 if isinstance(ortag, list):
328 merged.set_datatype("or", "Z")
328329 merged.set("or", ",".join(ortag))
329330 if not gfapy.is_placeholder(merged.sequence):
330331 merged.sequence = "".join(merged.sequence)
3636 if gfapy.posvalue(begpos) > gfapy.posvalue(endpos):
3737 raise gfapy.ValueError(
3838 "Line: {}\n".format(str(self))+
39 "begin > end: {}$ > {}".format(gfapy.posvalue(begpos),
40 gfapy.posvalue(endpos)))
39 "begin > end: {} > {}".format(gfapy.posvalue(begpos),
40 gfapy.posvalue(endpos)))
4141 if gfapy.isfirstpos(begpos):
4242 if gfapy.isfirstpos(endpos):
4343 return ("pfx", True)
0 Metadata-Version: 2.1
1 Name: gfapy
2 Version: 1.2.3
3 Summary: Library for handling data in the GFA1 and GFA2 formats
4 Home-page: https://github.com/ggonnella/gfapy
5 Author: Giorgio Gonnella and others (see CONTRIBUTORS)
6 Author-email: gonnella@zbh.uni-hamburg.de
7 License: ISC
8 Keywords: bioinformatics genomics sequences GFA assembly graphs
9 Classifier: Development Status :: 5 - Production/Stable
10 Classifier: Environment :: Console
11 Classifier: Intended Audience :: Developers
12 Classifier: Intended Audience :: End Users/Desktop
13 Classifier: Intended Audience :: Science/Research
14 Classifier: License :: OSI Approved :: ISC License (ISCL)
15 Classifier: Operating System :: MacOS :: MacOS X
16 Classifier: Operating System :: POSIX :: Linux
17 Classifier: Programming Language :: Python :: 3 :: Only
18 Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
19 Classifier: Topic :: Software Development :: Libraries
20 License-File: LICENSE.txt
21
22 Gfapy
23 ~~~~~
24
25 |travis| |readthedocs| |latesttag| |license|
26
27 |bioconda| |pypi| |debian| |ubuntu|
28
29 .. sphinx-begin
30
31 The Graphical Fragment Assembly (GFA) are formats for the representation
32 of sequence graphs, including assembly, variation and splicing graphs.
33 Two versions of GFA have been defined (GFA1 and GFA2) and several sequence
34 analysis programs have been adopting the formats as an interchange format,
35 which allow to easily combine different sequence analysis tools.
36
37 This library implements the GFA1 and GFA2 specification
38 described at https://github.com/GFA-spec/GFA-spec/blob/master/GFA-spec.md.
39 It allows to create a Gfa object from a file in the GFA format
40 or from scratch, to enumerate the graph elements (segments, links,
41 containments, paths and header lines), to traverse the graph (by
42 traversing all links outgoing from or incoming to a segment), to search for
43 elements (e.g. which links connect two segments) and to manipulate the
44 graph (e.g. to eliminate a link or a segment or to duplicate a segment
45 distributing the read counts evenly on the copies).
46
47 The GFA format can be easily extended by users by defining own custom
48 tags and record types. In Gfapy, it is easy to write extensions modules,
49 which allow to define custom record types and datatypes for the parsing
50 and validation of custom fields. The custom lines can be connected, using
51 references, to each other and to lines of the standard record types.
52
53 Requirements
54 ~~~~~~~~~~~~
55
56 Gfapy has been written for Python 3 and tested using Python version 3.7.
57 It does not require any additional Python packages or other software.
58
59 Installation
60 ~~~~~~~~~~~~
61
62 Gfapy is distributed as a Python package and can be installed using
63 the Python package manager pip, as well as conda (in the Bioconda channel).
64 It is also available as a package in some Linux distributions (Debian, Ubuntu).
65
66 The following command installs the current stable version from the Python
67 Packages index::
68
69 pip install gfapy
70
71 If you would like to install the current development version from Github,
72 use the following command::
73
74 pip install -e git+https://github.com/ggonnella/gfapy.git#egg=gfapy
75
76 Alternatively it is possible to install gfapy using conda. Gfapy is
77 included in the Bioconda (https://bioconda.github.io/) channel::
78
79 conda install -c bioconda gfapy
80
81 Usage
82 ~~~~~
83
84 If you installed gfapy as described above, you can import it in your script
85 using the conventional Python syntax::
86
87 >>> import gfapy
88
89 Documentation
90 ~~~~~~~~~~~~~
91
92 The documentation, including this introduction to Gfapy, a user manual
93 and the API documentation is hosted on the ReadTheDocs server,
94 at the URL http://gfapy.readthedocs.io/en/latest/ and it can be
95 downloaded as PDF from the URL
96 https://github.com/ggonnella/gfapy/blob/master/manual/gfapy-manual.pdf.
97
98 References
99 ~~~~~~~~~~
100
101 Giorgio Gonnella and Stefan Kurtz "GfaPy: a flexible and extensible software
102 library for handling sequence graphs in Python", Bioinformatics (2017) btx398
103 https://doi.org/10.1093/bioinformatics/btx398
104
105 .. sphinx-end
106
107 .. |travis|
108 image:: https://travis-ci.com/ggonnella/gfapy.svg?branch=master
109 :target: https://travis-ci.com/ggonnella/gfapy
110 :alt: Travis
111
112 .. |latesttag|
113 image:: https://img.shields.io/github/v/tag/ggonnella/gfapy
114 :target: https://github.com/ggonnella/gfapy/tags
115 :alt: Latest GitHub tag
116
117 .. |readthedocs|
118 image:: https://readthedocs.org/projects/pip/badge/?version=stable
119 :target: https://pip.pypa.io/en/stable/?badge=stable
120 :alt: ReadTheDocs
121
122 .. |bioconda|
123 image:: https://img.shields.io/conda/vn/bioconda/gfapy
124 :target: https://bioconda.github.io/recipes/gfapy/README.html
125 :alt: Bioconda
126
127 .. |pypi|
128 image:: https://img.shields.io/pypi/v/gfapy
129 :target: https://pypi.org/project/gfapy/
130 :alt: PyPI
131
132 .. |debian|
133 image:: https://img.shields.io/debian/v/gfapy
134 :target: https://packages.debian.org/search?keywords=gfapy
135 :alt: Debian
136
137 .. |ubuntu|
138 image:: https://img.shields.io/ubuntu/v/gfapy
139 :target: https://packages.ubuntu.com/search?keywords=gfapy
140 :alt: Ubuntu
141
142 .. |license|
143 image:: https://img.shields.io/pypi/l/gfapy
144 :target: https://github.com/ggonnella/gfapy/blob/master/LICENSE.txt
145 :alt: ISC License
146
147 .. |requiresio|
148 image:: https://requires.io/github/ggonnella/gfapy/requirements.svg?branch=master
149 :target: https://requires.io/github/ggonnella/gfapy/requirements/?branch=master
150 :alt: Requirements Status
0 LICENSE.txt
1 MANIFEST.in
2 README.rst
3 setup.cfg
4 setup.py
5 bin/gfapy-convert
6 bin/gfapy-mergelinear
7 bin/gfapy-renumber
8 bin/gfapy-validate
9 gfapy/__init__.py
10 gfapy/byte_array.py
11 gfapy/error.py
12 gfapy/field_array.py
13 gfapy/gfa.py
14 gfapy/lastpos.py
15 gfapy/logger.py
16 gfapy/numeric_array.py
17 gfapy/oriented_line.py
18 gfapy/placeholder.py
19 gfapy/rgfa.py
20 gfapy/segment_end.py
21 gfapy/segment_end_path.py
22 gfapy/sequence.py
23 gfapy/symbol_invert.py
24 gfapy.egg-info/PKG-INFO
25 gfapy.egg-info/SOURCES.txt
26 gfapy.egg-info/dependency_links.txt
27 gfapy.egg-info/not-zip-safe
28 gfapy.egg-info/top_level.txt
29 gfapy/alignment/__init__.py
30 gfapy/alignment/alignment.py
31 gfapy/alignment/cigar.py
32 gfapy/alignment/placeholder.py
33 gfapy/alignment/trace.py
34 gfapy/field/__init__.py
35 gfapy/field/alignment_gfa1.py
36 gfapy/field/alignment_gfa2.py
37 gfapy/field/alignment_list_gfa1.py
38 gfapy/field/byte_array.py
39 gfapy/field/char.py
40 gfapy/field/comment.py
41 gfapy/field/custom_record_type.py
42 gfapy/field/field.py
43 gfapy/field/float.py
44 gfapy/field/generic.py
45 gfapy/field/identifier_gfa2.py
46 gfapy/field/identifier_list_gfa2.py
47 gfapy/field/integer.py
48 gfapy/field/json.py
49 gfapy/field/numeric_array.py
50 gfapy/field/optional_identifier_gfa2.py
51 gfapy/field/optional_integer.py
52 gfapy/field/orientation.py
53 gfapy/field/oriented_identifier_gfa2.py
54 gfapy/field/oriented_identifier_list_gfa1.py
55 gfapy/field/oriented_identifier_list_gfa2.py
56 gfapy/field/parser.py
57 gfapy/field/path_name_gfa1.py
58 gfapy/field/position_gfa1.py
59 gfapy/field/position_gfa2.py
60 gfapy/field/segment_name_gfa1.py
61 gfapy/field/sequence_gfa1.py
62 gfapy/field/sequence_gfa2.py
63 gfapy/field/string.py
64 gfapy/field/validator.py
65 gfapy/field/writer.py
66 gfapy/graph_operations/__init__.py
67 gfapy/graph_operations/artifacts.py
68 gfapy/graph_operations/copy_number.py
69 gfapy/graph_operations/graph_operations.py
70 gfapy/graph_operations/invertible_segments.py
71 gfapy/graph_operations/linear_paths.py
72 gfapy/graph_operations/multiplication.py
73 gfapy/graph_operations/p_bubbles.py
74 gfapy/graph_operations/redundant_linear_paths.py
75 gfapy/graph_operations/superfluous_links.py
76 gfapy/graph_operations/topology.py
77 gfapy/line/__init__.py
78 gfapy/line/line.py
79 gfapy/line/comment/__init__.py
80 gfapy/line/comment/comment.py
81 gfapy/line/comment/construction.py
82 gfapy/line/comment/tags.py
83 gfapy/line/comment/version_conversion.py
84 gfapy/line/comment/writer.py
85 gfapy/line/common/__init__.py
86 gfapy/line/common/cloning.py
87 gfapy/line/common/connection.py
88 gfapy/line/common/construction.py
89 gfapy/line/common/default_record_definition.py
90 gfapy/line/common/disconnection.py
91 gfapy/line/common/dynamic_fields.py
92 gfapy/line/common/equivalence.py
93 gfapy/line/common/field_data.py
94 gfapy/line/common/field_datatype.py
95 gfapy/line/common/update_references.py
96 gfapy/line/common/validate.py
97 gfapy/line/common/version_conversion.py
98 gfapy/line/common/virtual_to_real.py
99 gfapy/line/common/writer.py
100 gfapy/line/custom_record/__init__.py
101 gfapy/line/custom_record/construction.py
102 gfapy/line/custom_record/custom_record.py
103 gfapy/line/edge/__init__.py
104 gfapy/line/edge/edge.py
105 gfapy/line/edge/common/__init__.py
106 gfapy/line/edge/common/alignment_type.py
107 gfapy/line/edge/common/from_to.py
108 gfapy/line/edge/containment/__init__.py
109 gfapy/line/edge/containment/canonical.py
110 gfapy/line/edge/containment/containment.py
111 gfapy/line/edge/containment/pos.py
112 gfapy/line/edge/containment/to_gfa2.py
113 gfapy/line/edge/gfa1/__init__.py
114 gfapy/line/edge/gfa1/alignment_type.py
115 gfapy/line/edge/gfa1/oriented_segments.py
116 gfapy/line/edge/gfa1/other.py
117 gfapy/line/edge/gfa1/references.py
118 gfapy/line/edge/gfa1/to_gfa2.py
119 gfapy/line/edge/gfa2/__init__.py
120 gfapy/line/edge/gfa2/alignment_type.py
121 gfapy/line/edge/gfa2/gfa2.py
122 gfapy/line/edge/gfa2/other.py
123 gfapy/line/edge/gfa2/references.py
124 gfapy/line/edge/gfa2/to_gfa1.py
125 gfapy/line/edge/gfa2/validation.py
126 gfapy/line/edge/link/__init__.py
127 gfapy/line/edge/link/canonical.py
128 gfapy/line/edge/link/complement.py
129 gfapy/line/edge/link/equivalence.py
130 gfapy/line/edge/link/link.py
131 gfapy/line/edge/link/references.py
132 gfapy/line/edge/link/to_gfa2.py
133 gfapy/line/fragment/__init__.py
134 gfapy/line/fragment/fragment.py
135 gfapy/line/fragment/references.py
136 gfapy/line/fragment/validation.py
137 gfapy/line/gap/__init__.py
138 gfapy/line/gap/gap.py
139 gfapy/line/gap/references.py
140 gfapy/line/group/__init__.py
141 gfapy/line/group/group.py
142 gfapy/line/group/gfa2/__init__.py
143 gfapy/line/group/gfa2/references.py
144 gfapy/line/group/gfa2/same_id.py
145 gfapy/line/group/ordered/__init__.py
146 gfapy/line/group/ordered/captured_path.py
147 gfapy/line/group/ordered/ordered.py
148 gfapy/line/group/ordered/references.py
149 gfapy/line/group/ordered/to_gfa1.py
150 gfapy/line/group/path/__init__.py
151 gfapy/line/group/path/captured_path.py
152 gfapy/line/group/path/path.py
153 gfapy/line/group/path/references.py
154 gfapy/line/group/path/to_gfa2.py
155 gfapy/line/group/path/topology.py
156 gfapy/line/group/path/validation.py
157 gfapy/line/group/unordered/__init__.py
158 gfapy/line/group/unordered/induced_set.py
159 gfapy/line/group/unordered/references.py
160 gfapy/line/group/unordered/unordered.py
161 gfapy/line/header/__init__.py
162 gfapy/line/header/connection.py
163 gfapy/line/header/field_data.py
164 gfapy/line/header/header.py
165 gfapy/line/header/multiline.py
166 gfapy/line/header/version_conversion.py
167 gfapy/line/segment/__init__.py
168 gfapy/line/segment/coverage.py
169 gfapy/line/segment/gfa1.py
170 gfapy/line/segment/gfa1_to_gfa2.py
171 gfapy/line/segment/gfa2.py
172 gfapy/line/segment/gfa2_to_gfa1.py
173 gfapy/line/segment/length_gfa1.py
174 gfapy/line/segment/references.py
175 gfapy/line/segment/segment.py
176 gfapy/line/segment/writer_wo_sequence.py
177 gfapy/line/unknown/__init__.py
178 gfapy/line/unknown/unknown.py
179 gfapy/lines/__init__.py
180 gfapy/lines/collections.py
181 gfapy/lines/creators.py
182 gfapy/lines/destructors.py
183 gfapy/lines/finders.py
184 gfapy/lines/headers.py
185 gfapy/lines/lines.py
186 manual/gfapy-manual.pdf
187 tests/__init__.py
188 tests/extension.py
189 tests/test_api_alignment.py
190 tests/test_api_comments.py
191 tests/test_api_custom_records.py
192 tests/test_api_extensions.py
193 tests/test_api_gfa1_lines.py
194 tests/test_api_gfa2_lines.py
195 tests/test_api_gfa_basics.py
196 tests/test_api_groups_validation.py
197 tests/test_api_header.py
198 tests/test_api_linear_paths.py
199 tests/test_api_linear_paths_extended.py
200 tests/test_api_lines_collections.py
201 tests/test_api_lines_creators.py
202 tests/test_api_lines_destructors.py
203 tests/test_api_lines_finders.py
204 tests/test_api_multiplication.py
205 tests/test_api_placeholders.py
206 tests/test_api_positionals.py
207 tests/test_api_positions.py
208 tests/test_api_references_edge_gfa1.py
209 tests/test_api_references_edge_gfa2.py
210 tests/test_api_references_f_g_lines.py
211 tests/test_api_references_groups.py
212 tests/test_api_references_virtual.py
213 tests/test_api_rename_lines.py
214 tests/test_api_rgfa.py
215 tests/test_api_tags.py
216 tests/test_api_version.py
217 tests/test_api_version_conversion.py
218 tests/test_gfapy_alignment.py
219 tests/test_gfapy_byte_array.py
220 tests/test_gfapy_cigar.py
221 tests/test_gfapy_line_containment.py
222 tests/test_gfapy_line_edge.py
223 tests/test_gfapy_line_header.py
224 tests/test_gfapy_line_link.py
225 tests/test_gfapy_line_path.py
226 tests/test_gfapy_line_segment.py
227 tests/test_gfapy_line_version.py
228 tests/test_gfapy_numeric_array.py
229 tests/test_gfapy_segment_references.py
230 tests/test_gfapy_sequence.py
231 tests/test_gfapy_trace.py
232 tests/test_graphop_artifacts.py
233 tests/test_graphop_copy_number.py
234 tests/test_internals_field_parser.py
235 tests/test_internals_field_validator.py
236 tests/test_internals_field_writer.py
237 tests/test_internals_tag_datatype.py
238 tests/test_unit_alignment.py
239 tests/test_unit_field_array.py
240 tests/test_unit_gfa_lines.py
241 tests/test_unit_header.py
242 tests/test_unit_line.py
243 tests/test_unit_line_cloning.py
244 tests/test_unit_line_connection.py
245 tests/test_unit_line_dynamic_fields.py
246 tests/test_unit_line_equivalence.py
247 tests/test_unit_lines_finders.py
248 tests/test_unit_multiplication.py
249 tests/test_unit_numeric_array.py
250 tests/test_unit_oriented_line.py
251 tests/test_unit_segment_end.py
252 tests/test_unit_symbol_invert.py
253 tests/test_unit_unknown.py
254 tests/testdata/all_line_types.gfa1.gfa
255 tests/testdata/all_line_types.gfa2.gfa
256 tests/testdata/copynum.1.gfa
257 tests/testdata/copynum.1.gfa2
258 tests/testdata/copynum.2.gfa
259 tests/testdata/copynum.2.gfa2
260 tests/testdata/dead_ends.gfa
261 tests/testdata/dead_ends.gfa2
262 tests/testdata/example1.gfa
263 tests/testdata/example1.gfa2
264 tests/testdata/example_from_spec.gfa
265 tests/testdata/example_from_spec.gfa2
266 tests/testdata/example_from_spec.path14.seq
267 tests/testdata/example_from_spec2.gfa
268 tests/testdata/example_from_spec2.gfa2
269 tests/testdata/filled.gfa1
270 tests/testdata/filled.gfa2
271 tests/testdata/gfa2_edges_classification.gfa
272 tests/testdata/invalid_path.gfa2
273 tests/testdata/linear_blunt.gfa1
274 tests/testdata/linear_blunt.gfa2
275 tests/testdata/linear_merging.1.gfa
276 tests/testdata/linear_merging.1.gfa2
277 tests/testdata/linear_merging.2.gfa
278 tests/testdata/linear_merging.2.gfa2
279 tests/testdata/linear_merging.3.gfa
280 tests/testdata/linear_merging.3.gfa2
281 tests/testdata/linear_merging.4.gfa
282 tests/testdata/linear_merging.4.gfa2
283 tests/testdata/linear_merging.5.gfa
284 tests/testdata/linear_merging.5.gfa2
285 tests/testdata/linear_merging.6.gfa
286 tests/testdata/linear_merging.6.merged.gfa
287 tests/testdata/links_distri.l1.gfa
288 tests/testdata/links_distri.l1.gfa2
289 tests/testdata/links_distri.l1.m2.gfa
290 tests/testdata/links_distri.l1.m2.gfa2
291 tests/testdata/links_distri.l2.gfa
292 tests/testdata/links_distri.l2.gfa2
293 tests/testdata/links_distri.l2.m2.gfa
294 tests/testdata/links_distri.l2.m2.gfa2
295 tests/testdata/links_distri.l2.m2.no_ld.gfa
296 tests/testdata/links_distri.l2.m2.no_ld.gfa2
297 tests/testdata/links_distri.l2.m3.gfa
298 tests/testdata/links_distri.l2.m3.gfa2
299 tests/testdata/links_distri.l2.m3.no_ld.gfa
300 tests/testdata/links_distri.l2.m3.no_ld.gfa2
301 tests/testdata/links_distri.l3.gfa
302 tests/testdata/links_distri.l3.gfa2
303 tests/testdata/links_distri.l3.m2.gfa
304 tests/testdata/links_distri.l3.m2.gfa2
305 tests/testdata/links_distri.l3.m2.no_ld.gfa
306 tests/testdata/links_distri.l3.m2.no_ld.gfa2
307 tests/testdata/loop.gfa
308 tests/testdata/loop.gfa2
309 tests/testdata/rgfa_example.1.gfa
310 tests/testdata/rgfa_example.2.gfa
311 tests/testdata/sample.gfa
312 tests/testdata/sample.gfa2
313 tests/testdata/seq_to_fill.fas
314 tests/testdata/spec_q1.gfa
315 tests/testdata/spec_q1.gfa2
316 tests/testdata/spec_q2.gfa
317 tests/testdata/spec_q2.gfa2
318 tests/testdata/spec_q2.path_circular.seq
319 tests/testdata/spec_q2.path_linear.seq
320 tests/testdata/spec_q3.gfa
321 tests/testdata/spec_q3.gfa2
322 tests/testdata/spec_q4.gfa
323 tests/testdata/spec_q4.gfa2
324 tests/testdata/spec_q4.path_more_than_circular.seq
325 tests/testdata/spec_q5.gfa
326 tests/testdata/spec_q5.gfa2
327 tests/testdata/spec_q6.gfa
328 tests/testdata/spec_q6.gfa2
329 tests/testdata/spec_q7.gfa
330 tests/testdata/spec_q7.gfa2
331 tests/testdata/to_be_filled.gfa1
332 tests/testdata/to_be_filled.gfa2
333 tests/testdata/two_components.gfa
334 tests/testdata/two_components.gfa2
335 tests/testdata/unnamed_and_named_links.gfa
336 tests/testdata/unnamed_link.gfa
337 tests/testdata/valid_path.gfa2
Binary diff not shown
00 [bdist_wheel]
11 python-tag = py3
2
3 [egg_info]
4 tag_build =
5 tag_date = 0
6
+0
-32
tests/testdata/invalid/edge_missing.gfa2 less more
0 # File used for the collections test
1 # similar but NOT equivalent to the gfa1 file!
2 S 1 122 *
3 S 3 29 TGCTAGCTGACTGTCGATGCTGTGTG
4 E 1_to_2 1+ 2+ 110 122$ 0 12 12M
5 S 5 130 *
6 S 13 150 *
7 E 2_to_6 2+ 6+ 0 122$ 10 132 122M
8 O 14 11+ 12+
9 S 11 140 * xx:i:11
10 F 2 read1+ 0 42 12 55 * id:Z:read1_in_2
11 F 2 read2+ 45 62 0 18 * id:Z:read2_in_2
12 U 16 1 3 15 2_to_6 16sub
13 H ac:Z:test2
14 # another comment
15 S 12 150 *
16 S 4 120 *
17 H VN:Z:2.0
18 E 1_to_3 1+ 3+ 112 122$ 0 12 10M
19 G 1_to_11 1+ 11- 120 *
20 E 11_to_12 11+ 12+ 18 140$ 0 122 122M
21 S 6 150 *
22 X custom_record xx:Z:testtag
23 X custom_record X2
24 G 2_to_12 2- 12+ 500 50
25 O 15 11+ 11_to_13+ 13+ xx:i:-1
26 Y another_custom_record
27 U 16sub 2 3
28 S 2 120 * xx:Z:sometag
29 H aa:i:12 ab:Z:test1
30 H aa:i:15
31 E 1_to_5 1+ 5+ 0 122$ 2 124 * zz:Z:tag
+0
-12
tests/testdata/invalid/edge_wrong_lastpos.gfa2 less more
0 H VN:Z:2.0
1 H ul:Z:https://github.com/sjackman/assembly-graph/blob/master/sample.gfa
2 S 1 8 CGATGCAA
3 S 2 10 TGCAAAGTAC
4 S 3 21 TGCAACGTATAGACTTGTCAC RC:i:4
5 S 4 7 GCATATA
6 S 5 8 CGATGATA
7 S 6 4 ATGA
8 E * 1+ 2+ 3 9$ 0 5 5M
9 E * 3+ 2+ 21$ 21$ 0 0 0M
10 E * 3+ 4- 17 21$ 3 7$ 1M1D2M
11 E * 4- 5+ 0 0 0 0 0M
+0
-33
tests/testdata/invalid/fragment_wrong_lastpos.gfa2 less more
0 # File used for the collections test
1 # similar but NOT equivalent to the gfa1 file!
2 S 1 122 *
3 S 3 29 TGCTAGCTGACTGTCGATGCTGTGTG
4 E 1_to_2 1+ 2+ 110 122$ 0 12 12M
5 S 5 130 *
6 S 13 150 *
7 E 2_to_6 2+ 6+ 0 122$ 10 132 122M
8 O 14 11+ 12+
9 S 11 140 * xx:i:11
10 F 3 read1+ 0 42$ 12 55 * id:Z:read1_in_3
11 F 2 read2+ 45 62 0 18 * id:Z:read2_in_2
12 U 16 1 3 15 2_to_6 16sub
13 H ac:Z:test2
14 # another comment
15 S 12 150 *
16 S 4 120 *
17 H VN:Z:2.0
18 E 1_to_3 1+ 3+ 112 122$ 0 12 10M
19 G 1_to_11 1+ 11- 120 *
20 E 11_to_12 11+ 12+ 18 140$ 0 122 122M
21 S 6 150 *
22 X custom_record xx:Z:testtag
23 X custom_record X2
24 E 11_to_13 11+ 13+ 20 140$ 0 120 120M
25 G 2_to_12 2- 12+ 500 50
26 O 15 11+ 11_to_13+ 13+ xx:i:-1
27 Y another_custom_record
28 U 16sub 2 3
29 S 2 120 * xx:Z:sometag
30 H aa:i:12 ab:Z:test1
31 H aa:i:15
32 E 1_to_5 1+ 5+ 0 122$ 2 124 * zz:Z:tag
+0
-12
tests/testdata/invalid/inconsistent_length.gfa1 less more
0 H VN:Z:1.0
1 H ul:Z:https://github.com/sjackman/assembly-graph/blob/master/sample.gfa
2 S 1 CGATGCAA LN:i:12
3 S 2 TGCAAAGTAC
4 S 3 TGCAACGTATAGACTTGTCAC RC:i:4
5 S 4 GCATATA
6 S 5 CGATGATA
7 S 6 ATGA
8 L 1 + 2 + 5M
9 L 3 + 2 + 0M
10 L 3 + 4 - 1M1D2M1S
11 L 4 - 5 + 0M
+0
-21
tests/testdata/invalid/link_missing.gfa1 less more
0 # File used for the collections test
1 S 1 *
2 S 3 CGATGCTAGCTGACTGTCGATGCTGTGTG
3 L 1 + 2 + 12M ID:Z:1_to_2
4 S 5 *
5 S 13 *
6 C 2 + 6 + 10 122M ID:Z:2_to_6
7 P 14 11+,12+ 122M
8 S 11 *
9 H ac:Z:test2
10 S 12 *
11 S 4 *
12 H VN:Z:1.0
13 L 1 + 3 + 12M ID:Z:1_to_3
14 S 6 *
15 L 11 + 13 + 120M ID:Z:11_to_13
16 P 15 11+,13+ 120M
17 S 2 * xx:Z:sometag
18 H aa:i:12 ab:Z:test1
19 H aa:i:15
20 C 1 + 5 + 12 120M ID:Z:1_to_5
+0
-21
tests/testdata/invalid/segment_missing.gfa1 less more
0 # comment
1 S 3 CGATGCTAGCTGACTGTCGATGCTGTGTG
2 L 1 + 2 + 12M ID:Z:1_to_2
3 S 5 *
4 S 13 *
5 C 2 + 6 + 10 122M ID:Z:2_to_6
6 P 14 11+,12+ 122M
7 S 11 *
8 H ac:Z:test2
9 S 12 *
10 S 4 *
11 H VN:Z:1.0
12 L 1 + 3 + 12M ID:Z:1_to_3
13 L 11 + 12 + 122M ID:Z:11_to_12
14 S 6 *
15 L 11 + 13 + 120M ID:Z:11_to_13
16 P 15 11+,13+ 120M
17 S 2 * xx:Z:sometag
18 H aa:i:12 ab:Z:test1
19 H aa:i:15
20 C 1 + 5 + 12 120M ID:Z:1_to_5
+0
-32
tests/testdata/invalid/segment_missing.gfa2 less more
0 # File used for the collections test
1 # similar but NOT equivalent to the gfa1 file!
2 S 3 29 TGCTAGCTGACTGTCGATGCTGTGTG
3 E 1_to_2 1+ 2+ 110 122$ 0 12 12M
4 S 5 130 *
5 S 13 150 *
6 E 2_to_6 2+ 6+ 0 122$ 10 132 122M
7 O 14 11+ 12+
8 S 11 140 * xx:i:11
9 F 2 read1+ 0 42 12 55 * id:Z:read1_in_2
10 F 2 read2+ 45 62 0 18 * id:Z:read2_in_2
11 U 16 1 3 15 2_to_6 16sub
12 H ac:Z:test2
13 # another comment
14 S 12 150 *
15 S 4 120 *
16 H VN:Z:2.0
17 E 1_to_3 1+ 3+ 112 122$ 0 12 10M
18 G 1_to_11 1+ 11- 120 *
19 E 11_to_12 11+ 12+ 18 140$ 0 122 122M
20 S 6 150 *
21 X custom_record xx:Z:testtag
22 X custom_record X2
23 E 11_to_13 11+ 13+ 20 140$ 0 120 120M
24 G 2_to_12 2- 12+ 500 50
25 O 15 11+ 11_to_13+ 13+ xx:i:-1
26 Y another_custom_record
27 U 16sub 2 3
28 S 2 120 * xx:Z:sometag
29 H aa:i:12 ab:Z:test1
30 H aa:i:15
31 E 1_to_5 1+ 5+ 0 122$ 2 124 * zz:Z:tag