Codebase list gmap / HEAD
HEAD

Tree @HEAD (Download .tar.gz)

   1
   2
   3
   4
   5
   6
   7
   8
   9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
  62
  63
  64
  65
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
0.  Availability
============

The source code for this package is available from
http://research-pub.gene.com/gmap.  License terms are provided in the
LICENSE file.


1.  Building and installing GMAP and GSNAP
==========================================

Prerequisites: a Unix system (including Cygwin on Windows), a C
compiler, and Perl

Step 1: Set your site-specific variables by editing the file
config.site.  In particular, you should set appropriate values for
"prefix" and probably for "with_gmapdb", as explained in that file.
If you are compiling this package on a Macintosh, you may need to edit
CFLAGS to be

CFLAGS = '-O3 -m64'

since Macintosh machines will make only 32-bit executables by default.


Step 2: Build, test, and install the programs, by running the
following GNU commands

    ./configure
    make
    make check   (optional)
    make install

Note 1: Instead of editing the config.site file in step 1, you may type
everything on the command line for the ./configure script in step 2,
like this

    ./configure --prefix=/your/usr/local/path --with-gmapdb=/path/to/gmapdb

If you omit --with-gmapdb, it defaults to ${prefix}/share.  If you
omit --prefix, it defaults to /usr/local.  Note that on the command
line, it is "with-gmapdb" with a hyphen, but in a config.site file,
it is "with_gmapdb" with an underscore.


Note 2: If you want to keep your version of config.site or have
multiple versions, you can save the file to a different filename, and
then refer to it like this

    ./configure CONFIG_SITE=<config site file>


Note 3: GSNAP can read from gzip-compressed FASTA or FASTQ input
files.  This feature requires the zlib library to be present
(available from http://www.zlib.net).  The configure program will
detect the availability of zlib automatically.  However, to disable
this feature, you can add "--disable-zlib" to the ./configure command
or edit your config.site file to have the command "disable_zlib".


Note 4: GSNAP can read from bzip2-compressed FASTA or FASTQ input
files.  This feature requires the bzlib library to be present.  The
configure program will detect the availability of bzlib automatically.
However, to disable this feature, you can add "--disable-bzlib" to the
./configure command or edit your config.site file to have the command
"disable_bzlib".


2.  Possible issues during compilation
======================================

Recent versions of GMAP and GSNAP use certain techniques to achieve
increased speed.  One of these techniques is special SIMD
(single-instruction, multiple data) instructions that can perform some
computations in parallel.  There are various levels of SIMD
instructions, and we use those from the SSE2, SSE4.1, AVX2, and AVX512
instruction sets, depending on the capabilities of the computer.  Most
computers built within the last 10 years should support SSE2 at least.

However, you may run into the following issues:

Issue 1: If you have a computer cluster with different types of SIMD
capabilities, then you should compile the source code multiple times,
one on each type of computer.  If your computer has SSE2 capabilities,
the ./configure and "make install" commands will create the programs
gmap.sse2 and gsnap.sse2.  Likewise, if your computer has AVX2
capabilities, the ./configure and "make install" commands will create
gmap.avx2 and gsnap.avx2.  The gmap and gsnap programs are dispatch
programs that determine at run time what capabilities the computer has
and call the appropriate version of the program (e.g., gsnap.sse2 or
gsnap.avx2).  The ./configure and "make install" commands will also
create versions of the programs, gmap.nosimd and gsnap.nosimd, that do
not depend on SIMD.  If the gmap and gsnap dispatch programs cannot
find a version of the program that matches the given computer, then it
will run the non-SIMD version instead.

Issue 2.  If you compile the code successfully, but when you run the
program, you see something that says "Illegal instruction", then you
must be running GMAP or GSNAP on a different computer than the one
where you compiled it.  The dispatch program should not allow this to
happen, especially if you have compiled the code on a computer of each
type in your cluster, so you can report this error to the authors and
we can try to help.  In the worst case, you can compile your program
for the lowest common denominator by by providing the
--simd-level=<level> flag to ./configure.


3.  Building a GMAP/GSNAP index (one chromosome per FASTA entry)
==============================================================================

A GMAP/GSNAP "index" is a set of genomic index files, representing the
genome in a hash table format.  You use the gmap_build program to
build your own database from a FASTA file that contains the genome.
The gmap and gsnap programs use the same index.

Below I use the terms "genome" and "chromosome", but the input
sequences can be anything you wish to align to, including transcripts
or small fragments.

A genome can be considered regular or large in size.  A regular genome
has a total sequence length of up to 2^32 = 4,294,967,296 (about 4
billion) bp, while a large genome may contain up to 2^40 (about 1
trillion) bp.  However, each individual chromosome is still limited to
2^32 bp.  The gmap_build program will automatically recognize when a
genome is larger than 2^32 bp and build the index files accordingly.

To handle large genomes, the ./configure and "make install" procedures
produce the programs gmapl and gsnapl.  The gmapl and gsnapl programs
work on large genomes, while the gmap and gsnap programs work on
regular genomes.  If you try to use the wrong program on a given
database, the programs will notify you of the error and quit.

You will need to start with a set of FASTA files containing either
entire chromosomes or contigs that represent pieces of chromosomes.
For example, for the human genome, you can retrieve all of the FASTA
files under the ftp directory at the URL:

    ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.28_GRCh38.p13/GCA_000001405.28_GRCh38.p13_genomic.fna.gz

The Unix wget command makes it easy to retrieve files from the URL
identifier.  You can leave the file in compressed gzip format.

Note that the file above will contain not only the reference
chromosomes (1 to 22, plus X, Y, and M), but also unmapped scaffolds
and patches.  Also, the chromosomes will be labeled with GenBank
accession numbers, such as "CM000663.2", rather than "chr1".  To
convert the chromosome names, and also to limit the chromosomes in
your genome build, you should also obtain

    GCA_000001405.28_GRCh38.p13_assembly_report.txt

from the same FTP directory.  If you want to restrict your genome
build to the reference chromosomes, you can generate a conversion file
like this:

    grep assembled-molecule GCA_000001405.28_GRCh38.p13_assembly_report.txt | cut -f 5,10 > names.txt

If you want to include the unmapped scaffolds, you can do this

    grep unplaced-scaffold GCA_000001405.28_GRCh38.p13_assembly_report.txt | cut -f 5,10 >> names.txt

to append those entries to your conversion file.

Alternatively, for certain species, you can obtain your genome
sequence from Gencode at

    ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_25/GRCh38.primary_assembly.genome.fa.gz

which will have the chromosomes labeled "chr1" and so on, and also
contain only the primary assembly (reference chromosomes plus unmapped
scaffolds).

Or, you can obtain your chromosome sequence from Ensembl at

    ftp://ftp.ensembl.org/pub/release-101/fasta/homo_sapiens/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

The Ensembl chromosomes are labeled without the "chr" prefix, so they
are "1", "2", "3", and so on.  If you plan on using other data sources
from NCBI, you may want to add the "chr" prefix, by generating a
names.txt file, with lines like this:

1	chr1
2	chr2

where there is a tab character between the two columns.

Finally, you can obtain your genome sequence from UCSC at

    ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/

but you will have to download your chromosomes individually.  There is
no need to gunzip them or to concatenate them, though.  These chromosomes
do have the "chr" prefix.


If your FASTA entries each contain a single chromosome, and the
accession for each chromosome is the chromosome number/letter, you can
simply run this command

    gmap_build -d <genome> [-k <kmer size>] [--names=<file>] [--limit-to-names] <fasta_files...>

which will build and install the GMAP index files in your default
GMAPDB location.  If your FASTA files are gzipped, you should add the
flag "--gunzip".  If you created a file to convert chromosome names,
you can use the --names flag to perform the conversion.  If your
conversion file is also intended to limit the chromosomes in your
genome build (which is the usual case), then you should add the flag
--limit-to-names, or else all chromosomes in the FASTA file will be
added to the build.

For the mitochondrial chromosome, which is circular, you will probably
want to allow for circular alignments by GSNAP and GMAP around the
circular origin of the chromosome.  You will need to give gmap_build
the flag "--circular=<mitochondrial chromosome name>".  If you
obtained your genome from NCBI, Gencode, or UCSC, this would be
"--circular=chrM".  For Ensembl, it would be "--circular=MT".  If your
genome has a few circular chromosomes, you can list them in a single
string, separated by commas.  If you have many circular chromosomes or
contigs, you provide a file to the --circular flag instead.  Note that
if you provide the --names conversion file, you should provide the new
name of the circular chromosome, not the old one.  (Note: This
behavior has changed starting with the 2020-10-20 release of this
package.  Previously, you specified the old name of the circular
chromosome.)

You can see the full usage of gmap_build by doing "gmap_build --help",
but here are some other useful flags.  You can control the k-mer size
for the genomic index with the -k flag, which can range from 12 to 15.
The default value for -k is 15, but this requires your machine to have
4 GB of RAM to build the indices.  If you do not have 4 GB of RAM,
then you will need to reduce the value of -k or find another machine.
Here are the RAM requirements for building various indices:

    k-mer of 12: 64 MB
    k-mer of 13: 256 MB
    k-mer of 14: 1 GB
    k-mer of 15: 4 GB

Note that the term

    <fasta_files...>

above indicates that multiple files can be listed.  The files can be
in any order, and the contigs can be in any order within these files.
By default, the gmap_build process will sort the contigs and
chromosomes into their appropriate "chrom" order.  For the human
genome, this order is 1, 2, ..., 10, 11, ..., 22, X, Y, M, followed by
all other chromosomes in numeric/alphabetical order.  If you don't
want this sort, provide the "-s none" flag to gmap_build.  Other sort
options besides "none" and "chrom" are "alpha" and "numeric-alpha".

You can type "gmap_build --help" to see the full set of options.  We
discuss some specific situations below.

You can provide information to gmap_build that certain chromosomes are
circular, with the -c or --circular flag.  The value for these flags
is a list of chromosomes, separated by commas.  The gmap_build program
will then allow .  For example, the mitochondrial genome in human beings is
circular.

If your genome files are compressed using gzip, you can simply add the
flag "--gunzip" to gmap_build.  But if your genome files require some
other type of processing, you may need to write a small script that
pipes the sequences in FASTA format to gmap_build.  You can tell
gmap_build about your script with the -E (or --fasta-pipe) flag, like
this

    gmap_build -d <genome> -E 'gunzip -c chr*.gz'
    gmap_build -d <genome> -E 'cat *.fa | ./add-chromosomal-info.pl'

You can think of the command as a Unix pipe for processing each FASTA
file before it is read by gmap_build.


4.  Running GMAP
================

To see the full set of options, type "gmap --help".  The following are
some common examples of usage.  For more examples, see the document
available at http://www.gene.com/share/gmap/paper/demo-slides.pdf

For each of the examples below, we assume that you have installed a
genome database called <genome> in your GMAPDB directory.  (If your
database is located elsewhere, you can specify the -D flag to gmap or
set the environment variable GMAPDB to point to that directory.)

* Mapping only: To map one or more cDNAs in a FASTA file onto a
  genome, run GMAP as follows:

    gmap -d <genome> <cdna_file>

* Mapping and alignment: If you want to map the cDNAs to a genome,
  and show the full alignment, provide the -A flag:

    gmap -d <genome> -A <cdna_file>


* Alignment only: To align one or more cDNAs in a FASTA file onto a
  given genomic segment (also in a FASTA file), use the -g flag
  instead of the -d flag:

    gmap -g <genome_segment> -A <cdna_file>


* Batch mode: If you have a large number of cDNAs to run, and you have
  sufficient RAM to run in batch mode, add the "-B 3", "-B 4", or "-B
  5" option.  Details for these options are provided by running "gmap
  --help".

    gmap -d <genome> -B 5 -A <cdna_file>


* Multithreaded mode: If your machine has several processors, you can
  make batch mode run even faster by specifying multiple threads with
  the -t flag:

    gmap -d <genome> -B 5 -A -t <nthreads> <cdna_file>

  Note that with multiple threads, the output results will appear in
  random order, depending on which thread finishes its computation
  first.  If you wish your output to be in the same order as the 
  input cDNA file, add the '-O' (letter O, not the number 0) flag
  to get ordered output.

  Guidelines: The -t flag specifies the number of computational
  threads.  In addition, if your machine supports threads, GMAP also
  uses one thread for reading the input query sequences, and one
  thread for printing the output results.  Therefore, the total number
  of threads will be 2 plus the number you specify.  The program will
  work optimally if it uses one thread per available processor.  If
  you specify too many threads, you can cause your computer to thrash
  and slow down.  Note that other programs running on your computer
  also need processors.


* Compressed output: If you want to store the alignment results in a
  compressed format, use the -Z flag.  You can uncompress the results
  by using the gmap_uncompress.pl program:

    gmap -d <genome> -Z <cdna_file> > x
    cat x | gmap_uncompress


Note that gmap is written for small genomes (less than 2^32 bp in
total length).  With large genomes, there is an equivalent program
called gmapl, which you should run instead of gmap.  The gmapl program
is equivalent to gmap, and is based on the same source code, but is
compiled to use 64-bit index files instead of 32-bit files.  The gmap
and gmapl programs will detect whether the genomes are the correct
size, and will exit if you try to run them on the wrong-sized genomes.



5.  Building map files
======================

This package includes an implementation of interval index trees
(IITs), which permits efficient lookup of interval information.  The
gmap program also allows you (with its -m flag) to look up pre-mapped
annotation information that overlaps your query cDNA sequence.  These
interval index trees (or map files) are built using the iit_store
program included in this package.  To build a map file, do the
following:

Step 1: Put your map information for a given genome into a map file
with the following FASTA-like format:
   
    >label coords optional_tag
    optional_annotation (which may be zero, one, or multiple lines)

For example, the label may be an EST accession, with the coords
representing its genomic position.  Labels may be duplicated if
necessary.

The coords should be of the form

    chr:position
    chr:startposition..endposition

The term "chr:position" is equivalent to "chr:position..position".  If
you want to indicate that the interval is on the minus strand or
reverse direction, then <endposition> may be less than
<startposition>.

Tags are very general and can be used for a variety of purposes.  For
example, you could 


Step 2: Run iit_store on this map file as follows

    cat <mapfile> | iit_store -o <mapname>

The program will create a file called <mapname>.iit.

Now you can retrieve this information with iit_get

    iit_get <mapname>.iit <coords>

where <coords> has the format "chr:position" or
"chr:startposition..endposition".  The iit_get program has other
capabilities, including the ability to retrieve information by label,
like this:

    iit_get <mapname>.iit <label>

More details can be found by running "iit_get --help".

If you plan to use this map information frequently, you may want to
place it with its corresponding genome for future use.  In each
GMAP/GSNAP database, there is a subdirectory for storing map files,
like this

    /path/to/gmapdb/<genome>/<genome>.maps/

(If you don't remember where your default gmap directory is, run "gmap
--version" to find it.)  If you put your <mapname>.iit file into this
maps subdirectory, you can get additional functionality.  First, you
can run the program get-genome, which is used mainly for getting
genome sequence, to get map information instead, like this

    get-genome -d <genome> -m <mapname> <coords>

Second, you can use GMAP with the -m flag to retrieve map information
that corresponds to a given cDNA sequence like this

    gmap -d <genome> -m <mapname> <cdna_file>

Finally, GMAP and the IIT utilities support the GFF3 format.  GMAP can
generate its results in GFF3 format, and iit_store can parse GFF3
files using its -G and -l flags.  More details about iit_store can be
found by doing "iit_store --help".


6.  Running GSNAP
=================

GSNAP uses the same database as GMAP does, so you will need to process
the genome using gmap_build as explained above, if you haven't done
that already.  To see the full set of options for GSNAP, type "gsnap --help".

Input to GSNAP should be either in FASTQ or FASTA format.  The FASTQ
input may include quality scores, which will then be included in SAM
output, if that output format is selected.  For single-end reads, the
FASTQ file may be piped into GSNAP, or given as its command-line
argument, like this

    cat <fastq_file> | gsnap -d <genome>

or

    gsnap -d <genome> <fastq_file>


For paired-end reads, the two corresponding FASTQ files should be
given as command-line arguments in pairs, like this

    gsnap -d <genome> <fastq_file_1> <fastq_file_2> [<fastq_file_3> <fastq_file_4>...]

A pipe cannot work since GSNAP needs to access both FASTQ files in
parallel.  The reads in FASTQ files may have varying lengths, if
desired.  Note that GSNAP can process multiple sets of paired-end
reads, by adding the files in pairs.  If you want to provide multiple
single-end files, you can either use "cat" to concatenate them into
the stdin of gsnap, like this:

    cat <fastq_file_1> [<fastq_file_2>...] | gsnap -d <genome>

or you can provide them all on the command line with the
--force-single-end flag, like this:

    gsnap -d <genome> --force-single-end <fastq_file_1> [<fastq_file_2>...]

which will process each FASTQ file one at a time as single-end reads,
and not try to pair them up.

GSNAP also has the ability to deal with files compressed with gzip, if
the configure script at compile time can find a zlib library in your
system (see Note 3 in the section above about building and installing
GMAP and GSNAP).  If so, and your files are gzipped, you can then read
in gzipped files directly like this

    gsnap --gunzip -d <genome> <fastq.gz>, or 
    gsnap --gunzip -d <genome> --force-single-end <fastq1.gz> [<fastq2.gz>...]

for single-end reads, or

    gsnap --gunzip -d <genome> <fastq_1.gz> <fastq_2.gz> [<fastq_3.gz> <fastq_4.gz>...]

for paired-end reads.

Likewise, GSNAP can handle files compressed with bzip2, if the
configure script at compile time can find a bzlib library in your
system (see Note 3 in the section above about building and installing
GMAP and GSNAP).  If so, and your files are bzip2-compressed, you can
then read in those files directly like this

    gsnap --bunzip2 -d <genome> <fastq.bz2>, or 
    gsnap --bunzip2 -d <genome> --force-single-end <fastq1.bz2> [<fastq2.bz2>...]

for single-end reads, or

    gsnap --bunzip2 -d <genome> <fastq_1.bz2> <fastq_2.bz2> [<fastq_1.bz2> <fastq_2.bz2>...]

for paired-end reads.


For FASTA format, you should include one line per read (or end of a
paired-end read).  The same FASTA file can have a mixture of
single-end and paired-end reads of varying lengths, if desired.

Single-end reads:

Each FASTA entry should contain one short read per line, like this

>Header information
AAAACATTCTCCTCCGCATAAGCCTGCGTCAGATTA

Each short read can have a different length.  However, the entire read
needs to be on a single line, and may not wrap around multiple lines.
If it extends to a second line, GSNAP will think that the read is
paired-end.


Paired-end reads:

Each FASTA entry should contain two short reads, one per line, like
this

>Header information
AAAACATTCTCCTCCGCATAAGCCTAGTAGATTA
GGCGTAGGTAGAAGTAGAGGTTAAGGCGCGTCAG

By default, the program assumes that the second end is in the reverse
complement direction compared with the first end.  If they are in the
same direction, you may need to use the --circular-input (or -c) flag.

GSNAP and GMAP can also read an extended FASTA format that include
quality scores, which look like this

    >Header information
    AAAACATTCTCCTCCGCATAAGCCTGCGTCAGATTA
    +
    <quality scores>

for single-end reads, or

    <Header information
    AAAACATTCTCCTCCGCATAAGCCTGCGTCAGATTA
    +
    <quality scores>

for the second-end of a paired-end read.  In addition, GSNAP can read
an extended FASTA format for paired-end reads, like this:

    >Header information
    AAAACATTCTCCTCCGCATAAGCCTAGTAGATTA
    GGCGTAGGTAGAAGTAGAGGTTAAGGCGCGTCAG
    +
    <quality scores 1>
    <quality scores 2>

This extended FASTA format is useful if paired-end information needs
to be piped into GSNAP via stdin.


As with gmap, gsnap is written for small genomes (less than 2^32 bp in
total length).  With large genomes, there is an equivalent program
called gsnapl, which you should run instead of gsnap.  The gsnapl
program is equivalent to gmap, and is based on the same source code,
but is compiled to use 64-bit index files instead of 32-bit files.
The gsnap and gsnapl programs will detect whether the genomes are the
correct size, and will exit if you try to run them on the wrong-sized
genomes.


7.  Detecting known and novel splice sites in GSNAP
====================================================

Whereas GMAP assumes RNA-seq input by default, GSNAP assumes DNA-seq
input by default.  To tell GSNAP to find splice junctions in
individual reads, you need to provide either the "--novelsplicing=1"
(or "-N 1") flag and/or the "-s <splicing_file>" flag to GSNAP.

GSNAP can detect splices using a probabilistic model using the
--novelsplicing (or -N) flag.  You can also detect splices from a set
of splice sites that you provide, using the --splicesites (or -s)
flag.  You may specify both flags, which will report splice junctions
involving both known and novel sites.

Output for a splicing junction will look like this:

>TCCGTGACGTGGATTGGTGCTGCACCCCTCATC      1       Header
 TCCGTGACGTGGATTGgt---------------      1..16   +19:56050054..56050069  start:0..donor:0.99,splice_dist:1238,dir:sense,sub:0+0=0,label:NM_001648.KLK3.exon1/5|NM_001030047.KLK3.exon1/5|NM_001030048.KLK3.exon1/5|NM_001030049.KLK3.exon1/6|NM_001030050.KLK3.exon1/2       
,--------------agGTGCTGCACCCCTCATC      17..33  +19:56051308..56051324  acceptor:0.99..end:0,dir:sense,sub:0+0=0,label:NM_001648.KLK3.exon2/5|NM_001030047.KLK3.exon2/5|NM_001030048.KLK3.exon2/5|NM_001030049.KLK3.exon2/6|NM_001030050.KLK3.exon2/2
 
After the "donor:" or "acceptor:" splice site type, the model score
probability is given, even if the splice site is known.  For known
splice sites, the "label:" field will provide information about the
site.  If there is more than one known splice site at a genomic
position, the labels are separated by a "|" delimiter.

There are several advantages to specifying a database of known splice
sites.  First, GSNAP will then be able to detect splicing involving
atypical splice sites, that would otherwise give low scores using its
probabilistic model.  A known splice site is treated as if its model
probability is 1.0.  Second, GSNAP can find splicing involving short
exons.  Such cases have a single end aligning to three exons,
separated by two introns.  Third, GSNAP can identify splicing at the
ends of reads with greater sensitivity, even if they have short
overlaps onto the next exon.  Fourth, GSNAP can detect known long
splices, because expected splice lengths can be included with the
splice site information.

GSNAP allows for known splicing at two levels: at the level of known
splice sites and at the level of known introns.  At the site level,
GSNAP finds splicing between arbitrary combinations of donor and
acceptor splice sites, meaning that it can find alternative splicing
events.  At the intron level, GSNAP finds splicing only between the
set of given donor-acceptor pairs, so it is constrained not to find
alternative splicing events, only introns included in the given list.
For most purposes, I would recommend using known splice sites, rather
than known introns, unless you are certain that all alternative
splicing events are known are represented in your file.

GSNAP can tell the difference between known site-level and known
intron-level splicing based on the format of the input file.  To
perform known site-level splicing, you will need to create a file with
the following format:

>NM_004448.ERBB2.exon1 17:35110090..35110091 donor 6678
>NM_004448.ERBB2.exon2 17:35116768..35116769 acceptor 6678
>NM_004448.ERBB2.exon2 17:35116920..35116921 donor 1179
>NM_004448.ERBB2.exon3 17:35118099..35118100 acceptor 1179
>NM_004449.ERG.exon1 21:38955452..38955451 donor 783
>NM_004449.ERG.exon2 21:38878740..38878739 acceptor 783
>NM_004449.ERG.exon2 21:38878638..38878637 donor 360
>NM_004449.ERG.exon3 21:38869542..38869541 acceptor 360

Each line must start with a ">" character, then be followed by an
identifier, which may have duplicates and can have any format, with
the gene name or exon number shown here only as a suggestion.  Then
there should be the chromosomal coordinates which straddle the
exon-intron boundary, so one coordinate is on the exon and one is on
the intron.  (Coordinates are all 1-based, so the first character of a
chromosome is number 1.)  Finally, there should be the splice type:
"donor" or "acceptor".  You may optionally store the intron distance
at the end.  GSNAP can use this intron distance, if it is longer than
its value for --localsplicedist, to look for long introns at that
splice site.  The same splice site may have different intron distances
in the database; GSNAP will use the longest intron distance reported
in searching for long introns.
                                                                                                
Note that the chromosomal coordinates are in the sense direction.
Therefore, genes on the plus strand of the genome (like NM_004448) have
the coordinates in ascending order (e.g., 35110090..35110091).
Genes on the minus strand of the genome (like NM_004449) have the
coordinates in descending order (e.g., 38955452..38955451).
                                                                                                
On the other hand, to perform known intron-level splicing, you will need
to create a file with the following format:

>NM_004448.ERBB2.intron1 17:35110090..35116769
>NM_004448.ERBB2.intron2 17:35116920..35118100
>NM_004449.ERG.intron1 21:38955452..38878739
>NM_004449.ERG.intron2 21:38878638..38869541

Again, coordinates are 1-based, and specify the exon coordinates
surrounding the intron, with the first coordinate being from the donor
exon and the second one being from the acceptor exon.


There are several ways to help you generate these files.  First, if
you have a GTF file, you can use the included programs gtf_splicesites
and gtf_introns like this:

    cat <gtf file> | gtf_splicesites > foo.splicesites
    cat <gtf file> | gtf_introns > foo.introns

Second, if you retrieve an alignment tracks from UCSC, like this:

    ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz

if you are aligning to genome hg18, or

    ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz

if you are aligning to genome hg19, you can process this track using
the included program psl_splicesites or psl_introns, like this:

    gunzip -c refGene.txt.gz | psl_splicesites -s 1 > foo.splicesites
    gunzip -c refGene.txt.gz | psl_introns -s 1 > foo.introns

Note that alignment tracks in UCSC sometimes have an extra column on
the left.  The "-s" flag allows you to indicate how many columns
should be skipped.

Once you have built this splicesites or introns file, you process it
as a map file (see "Building map files" above), by doing

    cat foo.splicesites | iit_store -o <splicesitesfile>, or
    cat foo.introns | iit_store -o <intronsfile>

If you want to include more than one track, you can do this:

    gunzip -c refGene.txt.gz | psl_splicesites -s 1 > foo
    gunzip -c knownGene.txt.gz | psl_splicesites > bar
    cat foo bar | iit_store -o <splicesitesfile>


A third way to build a known splicesites or known introns file is
useful if you have cDNA sequences rather than an alignment track, or
if you do not trust the alignment track and prefer to use cDNA
sequences.  GMAP has an option "-f splicesites" that finds splice
sites in cDNA sequences and reports them in the correct splicesite
format.  Likewise, GMAP can build an intron file, with the option "-f
introns".

When processing known cDNA sequences, you should also run GMAP with
the "-n 1" flag, so you get the best alignment, and with the "-z
sense_force" or "-z sense_filter" flag.  The sense_force option will
help GMAP know that the introns in your cDNA sequences are in the
correct GT-AG sense, and is applicable when you have a high quality
set of cDNA sequences.  The sense_filter option will allow GMAP to try
either sense or antisense, and to filter out sequences that appear to
be antisense; this is applicable if you are uncertain about the
validity of your cDNA sequences.

Again once you have built either a known splicesites or known introns
file, you process it as a map file by doing

    cat <file> | iit_store -o <splicesitesfile>, or
    cat <file> | iit_store -o <intronsfile>
                                                                                                
which creates <splicesitesfile>.iit or <intronsfile>.iit.


Regardless of how you built <splicesitesfile>.iit or <intronsfile>.iit,
you put it in the maps subdirectory by doing
                                                                                                
    cp <splicesitesfile>.iit /path/to/gmapdb/<genome>/<genome>.maps, or
    cp <intronsfile>.iit /path/to/gmapdb/<genome>/<genome>.maps
                                                                                                
Then, you may use the file by doing this:
                                                                                                
    gsnap -d <genome> -s <splicesitesfile> <shortreads>, or
    gsnap -d <genome> -s <intronsfile> <shortreads>, or
                                                                                                

8.  SAM output format
=====================

GSNAP can generate SAM output format, by providing the "-A sam" flag
to GSNAP.  In addition, GMAP can also print its alignments in SAM
output, using the "-f samse" or "-f sampe" options, for single-end or
paired-end data.  The sampe option will generate SAM flags to indicate
whether the read is the first or second end of a pair, which requires
that you provide GMAP with an extended FASTA format having a ">" or
"<" character in the header to indicate that information.  However,
the sampe option will change only the SAM flags, and not change the
underlying alignment algorithm.  GMAP does not know how to find
concordance between paired-end reads like GSNAP does.

GSNAP provides some special SAM flags as follows:

XQ: A non-normalized mapping quality score

X2: The second best XQ score among all multimapping alignments for a
given read.  If there is only a single alignment, this value is 0.

XO: Output type.  GSNAP categorizes its alignments into output types,
as follows.  Note that the --split-output option will create separate
output files for each output type.  Alternatively, if you use
sam_sort, you should provide --split-output to that program instead
and achieve the same functionality.  (The reason for this is that
there may be situations where GSNAP assigns different output types to
the first and second ends of the reads and sam_sort needs to see
alignments from both ends together.)  In either case, the output types
have the following meanings and filename suffixes:

  NM (nomapping) (filename suffix ".nomapping"): The entire read
  (single-end or paired-end) could not be aligned.  If the
  --failed-input=FILENAME flag is specified, then these reads are also
   printed in FASTQ or FASTA format (depending on the input format) in
  the given file (plus a .1 or .2 ending for the first and second ends
  of a paired-end read).

  CU (concordant unique) (filename suffix ".concordant_uniq"): Both
  ends of a paired-end read could be aligned concordantly to a single
  position in the genome.  For a definition of concordance, see the
  section "Output types" below.

  CM (concordant multiple) (filename suffix ".concordant_mult"): Both
  ends of a paired-end read could be aligned concordantly, but to more
  than one position in the genome.

  CX (concordant multiple excess) (filename suffix
  ".concordant_mult_xs"): Multiple concordant alignments, but user
  specified --quiet-if-excessive and the number of alignments exceeds
  "-n" threshold.  If the --failed-input option is given, these reads
  are also printed in FASTA or FASTQ format in the given file.

  CT (concordant translocation) (filename suffix
  ".concordant_transloc"): Both ends of a paired-end read could be
  aligned concordantly, but one end requires a split alignment to a
  distant location, such as another chromosome, or a different strand
  on that chromosome, or a far distance on that strand.  Note that
  translocation alignments need to be printed on two separate SAM
  lines.

  CC (concordant circular) (filename suffix ".concordant_circular"):
  Both ends of a paired-end read could be aligned concordantly, but
  one or both ends require an alignment that goes around the origin of
  a circular chromosome.  Circular chromosomes are specified in the
  gmap_build step by using the -c or --circular flag, as described
  previously.  Note that circular alignments need to be printed on two
  separate SAM lines.

  PI (paired unique inversion) (filename suffix ".paired_uniq_inv"):
  Both ends of a paired-end read could be aligned uniquely, but in a
  way that indicates that a genomic inversion has occurred between the
  two ends.

  PS (paired unique scramble) (filename suffix ".paired_uniq_scr"):
  Both ends of a paired-end read could be aligned uniquely, but in a
  way that indicates that the genomic order is scrambled.  This
  typically occurs because of tandem duplications.

  PL (paired unique long) (filename suffix ".paired_uniq_long"): Both
  ends of a paired-end read could be aligned uniquely, but in a way
  that indicates that a large genomic deletion has occurred between
  the two ends.

  PC (paired unique circular) (filename suffix
  ".paired_uniq_circular"): Both ends of a paired-end read could be
  aligned uniquely, but not concordantly, representing an inversion,
  scramble, or deletion.  In addition, one or both ends of the read
  goes around the origin of a circular chromosome.

  PM (paired multiple) (filename suffix ".paired_mult"): Both
  ends of a paired-end read could be aligned near each other,
  representing an inversion, scramble, or deletion, but there are
  multiple places in the genome where an alignment is found.

  PX (paired multiple excess) (filename suffix ".paired_mult_xs"):
  Multiple paired alignments, but user specified --quiet-if-excessive
  and the number of alignments exceeds the "-n" threshold.  If the
  --failed-input option is given, these reads are also printed in
  FASTA or FASTQ format in the given file.

  HU, HM, HT, HC (halfmapping unique, halfmapping multiple,
  halfmapping translocation, and halfmapping circular, respectively)
  (filename suffixes: ".halfmapping_uniq", ".halfmapping_mult",
  ".halfmapping_transloc", ".halfmapping_circular): Same as for the
  concordant output types, except that only one end of the paired-end
  read could be aligned, and the other end could not be aligned
  anywhere in the genome.

  HX (halfmapping multiple excess) (filename suffix
  ".halfmapping_mult_xs"): Multiple halfmapping alignments, but user
  specified --quiet-if-excessive and the number of alignments exceeds
  the "-n" threshold.  If the --failed-input option is given, these
  reads are also printed in FASTA or FASTQ format in the given file.

  UU, UM, UT, UC (unpaired unique, unpaired multiple, unpaired
  translocation, and unpaired circular, respectively) (filename
  suffixes: ".unpaired_uniq", ".unpaired_mult", ".unpaired_transloc",
  ".unpaired_circular): Same as for the concordant output types,
  except that the two ends could not be aligned concordantly or even
  paired.  These "unpaired" categories are also used for single-end
  reads, since they lack a mate end that can allow for concordance,
  pairing, or halfmapping.

  UX (unpaired multiple excess) (filename suffix ".unpaired_mult_xs"):
  Multiple unpaired alignments, but user specified
  --quiet-if-excessive and the number of alignments on one end exceeds
  the "-n" threshold.  If the --failed-input option is given, these
  reads are also printed in FASTA or FASTQ format in the given file.


XB: Prints the barcode extracted from the end of the read.  Applies only
if --barcode-length is not 0.

XP: Prints the primer inferred from a paired-end read.  Applies only
if the --adapter-strip flag is specified.

XH: Prints the part of the query sequence that was hard-clipped.
Sequence is printed in plus-genomic order and replaces the "H" part of
the CIGAR string.

XI: Prints the part of the quality string that was hard-clipped.
Sequence is printed in plus-genomic order and replaces the "H" part of
the CIGAR string.

XS: Prints the strand orientation (+ or -) for a splice.  Appears only
if splicing is allowed (-N or -s flag provided), and only for reads
containing a splice.  The value "+" means the expected GT-AG, GC-AG,
or AT-AC dinucleotide pair is on the plus strand of the genome, and
"-" means the dinucleotides are on the minus strand.  If the
orientation is not obvious, because the dinucleotides do not match
GT-AG, GC-AG, AT-AC, or their complements, then the program applies a
probabilistic splice model to determine the orientation.  If the
splice sites have low probability, then the program may not be able to
determine an orientation, and the result will be printed as XS:A:?.
To prevent this flag, which cannot be handled by such programs (such
as Cufflinks), use the --force-xs-dir flag.  However, this flag will
merely change occurrences of XS:A:? arbitrarily to XS:A:+.

XA: Indicates an ambiguous splice.  If GSNAP finds two or more
possible splices at a given position, it will try to resolve the
ambiguity if possible based on the other end of the paired-end read.
If the ambiguity cannot be resolved, GSNAP will not report any of the
splices, but will report a soft clip instead.  The XA field indicates
which end or ends are ambiguous and the number of matches found on
each ambiguous end, based on the output XA:Z:i,j.  If i or j is
greater than 0, that indicates that the lower or higher chromosomal
end is ambiguous, respectively.  The value given indicates the number
of matches found in the ambiguous end.  This number may be smaller
than the number of bases soft-clipped, due to mismatches.

XC: Indicates whether the alignment crosses over the origin of a
circular chromosome.  If so, the string XC:A:+ is printed.

XT: Prints the intron dinucleotides and splice probabilities around a
distant splicing event (genomic deletion, inversion, scramble, or
translocation).

XW and XV: Printed only when SNP-tolerant alignment is enabled.  XW
provides the number of mismatches against both the reference and
alternate alleles (or the "World" population).  Therefore, these are
true mismatches.  XV provides the number of positions that are
mismatches against the reference genome, but do match the alternate
genome.  Therefore, these are known variant positions.  The sum of XW
and XV provides the number of differences relative to the reference
genome, and with the exception of indels, should equal the value of
NM.

XG: Indicates which method within GSNAP generated the alignment.  B:
GMAP alignment produced from suffix array, M: GMAP alignment produced
from GSNAP hash table method, T: terminal alignment, O: merging of
overlaps.  Absence of XG flag indicates the standard GSNAP hash table
method.  (Note: older versions of GSNAP used "PG:", but some
downstream software required all PG methods to be listed in the header
section, so we changed the field name to "XG:")


9.  GSNAP output format
========================

By default, GSNAP prints its output in a FASTA-like format, which we
developed before we incorporated the SAM format.  The default GSNAP
output has some advantages over SAM output, especially for debugging
purposes.  However, we routinely use SAM output for our own pipeline,
and it has been subject to the most testing by ourselves and by outside
users.

Here is some output from GSNAP on a paired-end read:

>GGACTGCGCACCGAACGGCAGCGACTTCCCGTAGTAGCGGTGCTCCGCGAAGACCAGTAGAGCCCCCCGCTCGGCC   1 concordant    ILLUMINA-A1CCE9_0004:1:1:1510:2090#0
 GGACTGCGCACCGAACGGCAGCGACTTCCCGTAGTAGCGct-----------------------------------   1..39   +9:139128263..139128301 start:0..acceptor:0.99,dir:antisense,splice_dist:214,sub:0+0=0,label:NM_013379.DPP7.exon4/13  segs:2,align_score:2  pair_score:5,pair_length:112
,-------------------------------------acGTGCTCCGCGAAGACCAGTAGAGCCCCCCGCTCGGCC   40..76  +9:139128516..139128552 donor:0.96..end:0,dir:antisense,splice_dist:214,sub:0+0=0,label:NM_013379.DPP7.exon3/13
 
<CTTCGCCAACAACTCGGGCTTCGTCGCGGAGCTGGCGGCCGAGCGGGGGGCTCTACTGGTCTTCGCGGAGCACCGC   1 concordant    ILLUMINA-A1CCE9_0004:1:1:1510:2090#0
 CTTCGCCAACAACTCGGcCTTCGTCGCGGAGCTGGCGGCCGAGCGGGGGGCTCTACTGGTCTTCGCGGAGCACgtg   1..73   -9:139128588..139128516 start:0..end:3,sub:3+1=4      segs:1,align_score:3  pair_score:5,pair_length:112
 
Each end of a read gets its own block, with the first read starting
with ">" and the second read for paired-end reads starting with "<".
The block starts with a header line that has in column 1, the query
sequence in its original direction (and with lower-case preserved if
any); in column 2, the number of hits for that query and if the read
is paired-end, the relationship between the ends (as discussed in the
next paragraph); and in column 3, the accession number for the query.


10.  Output types
=================

The two ends of a paired-end read can have the following
relationships: "concordant", "paired", or "unpaired".  A paired-end
read is concordant if the ends align to the same chromosome, in the
expected relative orientations, and having an inferred insert length
greater than zero and within the "--pairmax" parameter.  The inferred
insert length is the distance from the end of the first-end alignment
to the start of the second-end alignment, plus the read lengths of the
two ends.  There may be more than one concordant alignment for a given
read, and if so, the alignments for each end are reported in
corresponding order.

If a concordant relationship cannot be found, then the program will
report any paired relationships it can find.  A paired alignment
occurs when the two ends align to the same chromosome, but fail some
criterion for concordance.  There are different subtypes of paired
alignments, depending on which criterion is violated.  If the
orientations are opposite what is expected, the paired subtype is
"inversion".  If they are in the expected orientation, but the
distance is greater than the "--pairmax" parameter, then the paired
subtype is "toolong".  If they are in the expected orientation, but
the inferred insert length appears to be negative, then the paired
subtype is "scramble".  In GSNAP output, a paired subtype is shown in
a label called "pairtype", which can have the values
"pairtype:inversion", "pairtype:toolong", and "pairtype:scramble".

Otherwise, if neither a concordant nor paired alignment can be found,
then the program will align each end separately, and report the
relationship as being "unpaired".

GSNAP can find translocation splices within a single end of a read,
but it tries to be conservative about reporting them.  If there is any
alignment that does not involve such a translocation, then it will not
report the translocation.  It therefore reports translocation splices
only when no other alignment is found within the concordant, paired,
or unpaired categories.  Therefore, such results are listed in the
header as having "(transloc)" appear after the "concordant", "paired",
or "unpaired" result type.

After the query line, each of the genomic hits is shown, up to the
'-n' parameter.  If too many hits were found (more than the '-n'
parameter), the behavior depends on whether the "--quiet-if-excessive"
flag is given to GSNAP.  If not, then the first n hits will be printed
and the rest will not be printed.  If the "--quiet-if-excessive" flag
is given to GSNAP, then no hits will be printed if the number exceeds
n.

Each of the genomic hits contains one or more alignment segments,
which is a region of continuous one-to-one correspondence (matches or
mismatches) between the query and the genome.  Multiple segments occur
when the alignment contains an insertion, deletion, or splice.  The
first segment is marked by a space (" ") at the beginning of the line,
while the second and following segments are marked by a comma (",") at
the beginning of the line.  (In the current implementation of GSNAP
that allows only a single indel or splice, the number of segments is
at most two.)

The segments contain information in tab-delimited columns as follows:

Column 1: Genomic sequence with matches in capital letters, mismatches
in lower-case letters, and regions outside the segment with dashes.
For deletions in the query, the deleted genomic sequence is also
included in lower case.  For spliced reads, the two dinucleotides at
the intron ends are included in lower case.

Column 2: Range of query sequence aligned in the segment.  Coordinates
are inclusive, with the first nucleotide considered to be position 1.

Column 3: Range of genomic segment aligned, again with inclusive
coordinates, with the first nucleotide in each chromosome considered
to be position 1.  Plus and minus strands are marked with a "+" or "-"
sign.

Column 4: Segment information, delimited by commas.  The first item
reports on the ends of the segment, which can be of type "start",
"end", "ins", "del", "donor", "acceptor", or "term".  After "start"
and "end", we report the number of nucleotides clipped or trimmed from
the segment.  In our example above, "end:3" means that 3 nucleotides
should be trimmed from the end.  Trimming finds a local maximum of
matches to mismatches from the end and is computed only if the "-T"
flag is specified, and the value for "-T" limits the amount of
trimming allowed.  After "ins" and "del", we report the number of
nucleotides that were inserted or deleted in the query relative to the
genome.  After "donor" or "acceptor", we report the probability of the
splice site, based on the MaxEnt model.  The "term" label indicate a
terminal segment, where the entire read could not be aligned, but more
than half of the read could be aligned from either end.

Each segment will also show after the "sub" tag, he number of
mismatches in that segment including the part that is trimmed, if any.
If SNP-tolerant alignment is chosen, then the number of SNPs is also
shown (see details below under SNP-tolerant alignment).  Other
information may also be included with the segment information, such as
the orientation and distance of the splice or known splice labels, if
appropriate flags and information are given to GSNAP.  Splices are
marked with a splice_type, which can be "consistent", "inversion",
"scramble", or "translocation".  A "translocation" splice includes
splices on the same chromosome where the splice distance exceeds the
parameter for localsplicedist.

Column 5: Alignment or hit information, delimited by commas.  For the
first segment in a hit (the one starting with a space), this column
provides the number of segments (denoted by "segs:") and the score of
the alignment (denoted by "align_score:").

Column 6: Pair information (for paired-end reads only).  For the first
segment in a hit (with the same information repeated on both ends of a
concordant pair), this column provides the score of the pair (which is
the sum of the alignment scores) and the inferred length of the insert
(ignoring splices within each segment, but not between segments).  


11.  SNP-tolerant alignment in GSNAP
====================================

GSNAP has the ability to align to a reference space of all possible
major and minor alleles in a set of known SNPs provided by the user.
This ability is provided by the -v flag, and produces output like
this:

>ATGGTAATCCTGCTCAGTACGAGAGGAACCGCAGGA   2       Header
 ATGGTAATCCTGCTCAGTACGAGAGGAACCGCAGGt   1..36   -12:34263937..34263902  start:0..end:0,sub:1+0=1       
 ATGGTAATCCTGCTCAGTAGGAGAGGAACCCCAGGt   1..36   -21:14379300..14379265  start:0..end:0,sub:1+2=3

The notation "sub:1+0=1" indicates that the SNP-tolerant alignment has
one mismatch ("sub:1") and zero minor SNP alleles ("+0"), for a total
of one differences ("=1") relative to the reference genome with all
major alleles.  Likewise, the notation "sub:1+2=3" indicates one
SNP-tolerant mismatch, two minor SNP alleles, and 3 mismatches
relative to the reference sequence with all major alleles.

Note that by default, GSNAP shows only differences relative to both
the reference and alternate genomes in lower case.  If you want to
show differences relative to the reference genome in lower case, you
will need to provide the flag --show-refdiff, which would give the
output above as:

>ATGGTAATCCTGCTCAGTACGAGAGGAACCGCAGGA   2       Header
 ATGGTAATCCTGCTCAGTACGAGAGGAACCGCAGGt   1..36   -12:34263937..34263902  start:0..end:0,sub:1+0=1       
 ATGGTAATCCTGCTCAGTAgGAGAGGAACCcCAGGt   1..36   -21:14379300..14379265  start:0..end:0,sub:1+2=3

For SAM output format, all differences from the reference sequence are
shown in the NM and MD fields, although the alignment scoring and
mapping qualities are based on a SNP-tolerant alignment.

Before deciding to use SNP-tolerant alignment, please note the
following.  First, SNP-tolerant alignment requires more memory than
standard alignment, perhaps twice as much.  Also, with longer reads
now of 75 or more bp, GSNAP alignments are generally fine without
SNP-tolerant alignment.  Finally, if your SNPs are incorrect (and many
in the databases are), then they can lead SNP-tolerant alignment to
favor the faulty SNPs.

If you still wish to use SNP-tolerant alignment, you will need to
specify a set of known SNPs by creating a file with the following
format:

>rs62211261 21:14379270 CG
>rs62211262 21:14379281 CG

Each line must start with a ">" character, then be followed by an
identifier (which may have duplicates).  Then there should be the
chromosomal coordinate of the SNP.  (Coordinates are all 1-based, so
the first character of a chromosome is number 1.)  Finally, there
should be the two possible alleles.  (Previous versions required that
these be in alphabetical order: "AC", "AG", "AT", "CG", "CT", or "GT",
but that is no longer a requirement.)  These alleles must correspond
to the possible nucleotides on the plus strand of the genome.  If the
one of these two letters does not match the allele in the reference
sequence, that SNP will be ignored in subsequent processing as a
probable error.

GSNAP also supports the idea of a wildcard SNP.  A wildcard SNP allows
all nucleotides to match at that position, not just a given reference
and alternate allele.  It is essentially as if an "N" were recorded at
that genomic location, although the index files still keep track of
the reference allele.  To indicate that a position has a wildcard SNP,
you can indicate the genotype as "WN", where "W" is the reference
allele.  Another indication of a wildcard SNP is to provide two
separate lines at that position with the genotypes "WX" and "WY",
where "W" is the reference allele and "X" and "Y" are two different
alternate alleles.

To help you generate the file, this package includes programs that can
process known SNP data, from various formats, including dbSNP files,
VCF format, or GVF format.  One issue is that SNP databases now
contain many more variants than the common set of SNPs in the human
population.  The SNP-tolerance feature in GSNAP is not intended to
handle density packed variants, but rather more widely spaced
variants.  Therefore, it is important to limit the known SNPs to ones
that are common in the human population.

You can obtain dbSNP files from UCSC, like this

    ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/snp151Common.txt.gz

Note that there is a file called snp151.txt.gz, but this has far too
many SNPs to be useful, and possibly too densely packed to be handled
easily by the SNP tolerance feature of GSNAP.  You can process UCSC
files using our program dbsnp_iit, like this:

    gunzip -c snp151Common.txt.gz | dbsnp_iit [-w <weight>] > <snpfile>.txt, or

The option "-w <weight>" makes use of the dbSNP item weight, a value
from 1 to 3, where lower weight means higher confidence.  Items will
be included if the item weight is the given value <weight> or less.
The default value of -w is 1, which is the criterion UCSC uses to
build its ambiguous version of the genome.  To allow all item weights,
specify "-w 3".

SNPs have various types of exceptions.  Starting with snp132, these
exceptions are included in the snp file itself.  By default, dbsnp_iit
will exclude all types of exceptions.  However, if you want to include
some types of exceptions, you will need to modify the dbsnp_iit
program (written in Perl) to indicate a "1" for the type of exception
you want to include.

An alternative source for SNP data is from Ensembl, such as

    ftp://ftp.ensembl.org/pub/release-101/variation/gvf/homo_sapiens/homo_sapiens-chr*.gvf.gz

which can be handled by the gvf_iit program in this package.  However,
there appear to be too many variants to be useful for the
SNP-tolerance feature of GSNAP.  If someone knows how to limit the
SNPs from an Ensembl source, please let me know.

Another source of dbSNP in 2020 is located at:

    ftp://ftp.ncbi.nih.gov/snp/redesign/latest_release/VCF/GCF_000001405.38.gz (for GRCh38)
    ftp://ftp.ncbi.nih.gov/snp/redesign/latest_release/VCF/GCF_000001405.25.gz (for GRCh37)

and can also be processed with the vcf_iit program, like this:

    gunzip -c <file> | vcf_iit -C [-v <version>] [-n <conversion file>] > <snpfile>.txt

The -C flag indicates that only those snp variants labeled as "COMMON"
will be used, which is important as explained previously.  Although
this version of dbSNP is 154, it contains multiple historical versions
of dbSNP, so if you want a particular version, such as 150, you would
use the flag "-v 150".  The VCF file from NCBI is labeled with Refseq
accession numbers like "NC_000001.11" instead of "chr1".  To make the
appropriate name conversion, you should create another conversion file
from the NCBI accession_report.txt file, like this:

    grep -v "^#" GCA_000001405.28_GRCh38.p13_assembly_report.txt | cut -f 7,10 >> refseq-to-chr.txt

and then provide the flag "-n refseq-to-chr.txt" to vcf_iit.

Once you have built a <snpfile>.txt file using dbsnp_iit, gvf_iit, or
vcf_iit, you create an IIT file by doing this
                                                                                                
    cat <snpfile>.txt | iit_store -o <snpfile>
                                                                                                
which creates <snpfile>.iit.  If you wish to store this file for other
people to use, you can then put it in a central place:

    cp <snpfile>.iit /path/to/gmapdb/<genome>/<genome>.maps

Or you can keep it in your own directory for your own use.  Then you
need to create a reference space index and compressed genome by doing

    snpindex -d <genome> [-V <snpdirectory>] -v <snpfile> <snpfile>.iit

If you specify the [-V <snpdirectory>] option, then the resulting SNP
index files are placed in the given directory.  Otherwise, if you
don't specify this, then the SNP files are saved with the reference
index files.

If your <snpfile>.iit file is already in a <genome>.maps subdirectory,
then you can simply run

    snpindex -d <genome> [-V <snpdirectory>] -v <snpfile>

and the program will find the indicated <snpfile>.iit file.
Additional options for snpindex can be found by doing "snpindex
--help".


Finally, you can perform SNP-tolerant alignment by doing

    gsnap -d <genome> [-V <snpdirectory>] -v <snpfile> <shortreads>

You can also retrieve snp information for genomic segments by running
get-genome with the -v and -f flags.  For more details, run
"get-genome --help".


12.  Alignment of reads from bisulfite-treated DNA in GSNAP
===========================================================

GSNAP has the ability to align reads from bisulfite-treated DNA, which
converts unmethylated cytosines to uracils that appear as thymines in
reads.  GSNAP is able to identify genomic-T to read-C mismatches, and
produces output like this:

>CTACGTcGTAGACATATTGATTCAGAATTTGAAGTTTAGCTAGATCTTAc     1       Read
 C.ACG.tGTAGACATA.TGATTCAGAAT.TGAAGTTTAGCTAGA.C.TAg     1..50   sub:2   +9:1000000..1000049
 
As with all GSNAP output, the first line is the query, and the ones
afterward represent genomic segments.  The "." symbol indicates an
unmethylated C in the genome that appears as a T in the read.
Mismatches are shown by lower case characters in the genomic segment.
In position 6, we see an example of a genomic-T that appears as a
read-C, representing a mismatch.  The last position also shows a
mismatch between a genomic-G and read-C.

To process reads from bisulfite-treated DNA, you will first need to
create the necessary index files by doing

    cmetindex -d <genome>

Then, you can align the reads by doing

    gsnap -d <genome> --mode=cmet-stranded, or
    gsnap -d <genome> --mode=cmet-nonstranded

on your set of short reads.  The stranded flag works on data where
the laboratory allows only reads that go from 5' to 3' on each strand
of the genome, and excludes their reverse complement reads that would
be produced by PCR amplification.  The resulting reads should have a
preponderance of C's converted to T's.

The non-stranded flag is for the laboratory protocol that allows both
the 5'-to-3' genomic reads on each strand, as well as their reverse
complements.  The reverse complemented reads will have the C-to-T
conversion show up as a G-to-A change.  The resulting reads should
have a preponderance of T's in half the reads, and a preponderance of
A's in half the reads.



13.  RNA-editing tolerance in GSNAP
===================================

Just as GSNAP has a program cmetindex and a mode called "cmet" for
tolerance to C-to-T changes, it can be tolerant to A-to-G changes
using the program atoiindex and a mode called "atoi".  This mode is
designed to facilitate alignments that are tolerant to RNA editing
where A's are converted to I's, which appear as G's to a sequencer.

To process reads under RNA-editing tolerance, you will first need to
create the necessary index files by doing

    atoiindex -d <genome>

Then, you can align the reads by doing

    gsnap -d <genome> --mode=atoi-stranded, or 
    gsnap -d <genome> --mode=atoi-nonstranded

on your set of short reads.  As with bisulfite sequencing, the
stranded flag is for laboratory protocols that allow only the 5'-to-3'
RNA, or sense, reads, and the non-stranded flag is for laboratory
protocols that allow both sense and antisense reads.


14.  Transcriptome-guided genomic alignment
===========================================

Transcriptome-guided genomic alignment (TGGA) is a feature in GSNAP
that utilizes both a genome and a transcriptome.  In this alignment
mode, GSNAP aligns reads first against the set of transcripts.  This
alignment is generally gap-free, since splicing is not an issue.
Indels can still be present, but GSNAP knows how to identify them in
transcripts.  The advantage of aligning against a transcript is that
GSNAP can find seeds anywhere in the transcript, even if they span an
exon-exon junction.  In contrast, with standard genomic alignment, if
the end of a read is near a splice junction, there may not be enough
nucleotides on the distal end of the splice to find a seed.
Therefore, in such cases, GSNAP would have soft-clip the end of the
alignment.

In TGGA, GSNAP converts alignments to transcripts to alignments in the
genome, using information about how the entire transcript maps to the
genome.  The ends of reads can be aligned more accurately, even if
only one or two nucleotides are on the distal end of a splice.  In
addition, gap-free alignment to transcripts is fast, so TGGA is many
times faster than regular genomic alignment.  A final advantage of
TGGA is that GSNAP reports the transcripts that are consistent with
the read, which can be useful for downstream analyses that measure
gene expression by transcript.  Therefore, TGGA serves a function
similar to that of alignment-free RNA-seq tools like Kallisto or
Salmon, but also provides the genomic alignment needed for the
analysis of splicing, mutations, and polymorphisms.

To perform TGGA, in addition to the genome in FASTA format (suppose it
is called genome.fasta), you will need either a set of transcripts in
FASTA format (transcripts.fasta) or a set of transcript mappings to
the genome (transcripts.map).  The map format can be obtained from a
GFF3 or GTF file using the programs gff3_genes or gtf_genes included
with this software package, as follows:

    gff3_genes <gff3_file> > transcripts.map
    gtf_genes <gtf_file> > transcripts.map

The map file can also be generated by aligning the transcript FASTA
format against the genome using "gmap -f map_exons", but if you are
starting with a FASTA file, it is easier to let gmap_build perform
this process for you.

To build the genome and transcriptome index files, there are two
cases:

Case 1: The genome index has not yet been built.  In this case, you
can build both the genome and transcriptome indices with one command,
either

    gmap_build [-D <directory>] -d <genome_name> -c <transcriptome_name> -T transcripts.fasta genome.fasta, or

    gmap_build [-D <directory>] -d <genome_name> -c <transcriptome_name> -G transcripts.map genome.fasta

depending on whether you have a FASTA or map file.

Case 2: The genome index has already been built (as described in
section 3 above).  In this case, you do not need to build the genome
index again, just the transcriptome index, so you should omit the
genome FASTA file:

    gmap_build [-D <directory>] -d <genome_name> -c <transcriptome_name> -T transcripts.fasta, or

    gmap_build [-D <directory>] -d <genome_name> -c <transcriptome_name> -G transcripts.map

The result of these procedures will be to create index files in one
directory for the genome, and in another directory for the
transcriptome.  In addition, there will be a subdirectory
<genome_name>.transcripts with files <transcriptome_name>.*, which
contain information about how transcripts map to the genome.

After these indices are built, you can perform TGGA by doing

    gsnap [-D <directory>] -d <genome_name> -c <transcriptome_name> -N 1 -A sam <fastq files...>

In other words, the only addition is the flag "-c
<transcriptome_name>".  The "-N 1" flag is actually optional, because
the fact that you have a transcriptome tells GSNAP that your input
files are RNA-seq.