Codebase list slib / cme/main xml-parse.txi
cme/main

Tree @cme/main (Download .tar.gz)

xml-parse.txi @cme/mainraw · history · blame

   1
   2
   3
   4
   5
   6
   7
   8
   9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
  62
  63
  64
  65
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
@code{(require 'xml-parse)} or @code{(require 'ssax)}

@noindent
The XML standard document referred to in this module is@*
@url{http://www.w3.org/TR/1998/REC-xml-19980210.html}.

@noindent
The present frameworks fully supports the XML Namespaces
Recommendation@*
@url{http://www.w3.org/TR/REC-xml-names}.

@subsection String Glue


@defun ssax:reverse-collect-str list-of-frags


Given the list of fragments (some of which are text strings),
reverse the list and concatenate adjacent text strings.  If
LIST-OF-FRAGS has zero or one element, the result of the procedure
is @code{equal?} to its argument.
@end defun


@defun ssax:reverse-collect-str-drop-ws list-of-frags


Given the list of fragments (some of which are text strings),
reverse the list and concatenate adjacent text strings while
dropping "unsignificant" whitespace, that is, whitespace in front,
behind and between elements.  The whitespace that is included in
character data is not affected.

Use this procedure to "intelligently" drop "insignificant"
whitespace in the parsed SXML.  If the strict compliance with the
XML Recommendation regarding the whitespace is desired, use the
@code{ssax:reverse-collect-str} procedure instead.
@end defun

@subsection Character and Token Functions

The following functions either skip, or build and return tokens,
according to inclusion or delimiting semantics.  The list of
characters to expect, include, or to break at may vary from one
invocation of a function to another.  This allows the functions to
easily parse even context-sensitive languages.

Exceptions are mentioned specifically.  The list of expected
characters (characters to skip until, or break-characters) may
include an EOF "character", which is coded as symbol *eof*

The input stream to parse is specified as a PORT, which is the last
argument.


@defun ssax:assert-current-char char-list string port


Reads a character from the @var{port} and looks it up in the
@var{char-list} of expected characters.  If the read character was
found among expected, it is returned.  Otherwise, the
procedure writes a message using @var{string} as a comment
and quits.
@end defun


@defun ssax:skip-while char-list port


Reads characters from the @var{port} and disregards them, as long as they
are mentioned in the @var{char-list}.  The first character (which may be EOF)
peeked from the stream that is @emph{not} a member of the @var{char-list} is
returned.
@end defun


@defun ssax:init-buffer


Returns an initial buffer for @code{ssax:next-token*} procedures.
@code{ssax:init-buffer} may allocate a new buffer at each invocation.
@end defun


@defun ssax:next-token prefix-char-list break-char-list comment-string port


Skips any number of the prefix characters (members of the @var{prefix-char-list}), if
any, and reads the sequence of characters up to (but not including)
a break character, one of the @var{break-char-list}.

The string of characters thus read is returned.  The break character
is left on the input stream.  @var{break-char-list} may include the symbol @code{*eof*};
otherwise, EOF is fatal, generating an error message including a
specified @var{comment-string}.
@end defun

@noindent
@code{ssax:next-token-of} is similar to @code{ssax:next-token}
except that it implements an inclusion rather than delimiting
semantics.


@defun ssax:next-token-of inc-charset port


Reads characters from the @var{port} that belong to the list of characters
@var{inc-charset}.  The reading stops at the first character which is not a member
of the set.  This character is left on the stream.  All the read
characters are returned in a string.


@end defun
@defun ssax:next-token-of pred port

Reads characters from the @var{port} for which @var{pred} (a procedure of
one argument) returns non-#f.  The reading stops at the first
character for which @var{pred} returns #f.  That character is left
on the stream.  All the results of evaluating of @var{pred} up to #f
are returned in a string.

@var{pred} is a procedure that takes one argument (a character or
the EOF object) and returns a character or #f.  The returned
character does not have to be the same as the input argument to the
@var{pred}.  For example,

@example
(ssax:next-token-of (lambda (c)
                      (cond ((eof-object? c) #f)
                            ((char-alphabetic? c) (char-downcase c))
                            (else #f)))
                    (current-input-port))
@end example

will try to read an alphabetic token from the current input port,
and return it in lower case.
@end defun


@defun ssax:read-string len port


Reads @var{len} characters from the @var{port}, and returns them in a string.  If
EOF is encountered before @var{len} characters are read, a shorter string
will be returned.
@end defun

@subsection Data Types

@table @code

@item TAG-KIND

A symbol @samp{START}, @samp{END}, @samp{PI}, @samp{DECL},
@samp{COMMENT}, @samp{CDSECT}, or @samp{ENTITY-REF} that identifies
a markup token

@item UNRES-NAME

a name (called GI in the XML Recommendation) as given in an XML
document for a markup token: start-tag, PI target, attribute name.
If a GI is an NCName, UNRES-NAME is this NCName converted into a
Scheme symbol.  If a GI is a QName, @samp{UNRES-NAME} is a pair of
symbols: @code{(@var{PREFIX} . @var{LOCALPART})}.

@item RES-NAME

An expanded name, a resolved version of an @samp{UNRES-NAME}.  For
an element or an attribute name with a non-empty namespace URI,
@samp{RES-NAME} is a pair of symbols,
@code{(@var{URI-SYMB} . @var{LOCALPART})}.
Otherwise, it's a single symbol.

@item ELEM-CONTENT-MODEL

A symbol:
@table @samp
@item ANY
anything goes, expect an END tag.
@item EMPTY-TAG
no content, and no END-tag is coming
@item EMPTY
no content, expect the END-tag as the next token
@item PCDATA
expect character data only, and no children elements
@item MIXED
@item ELEM-CONTENT
@end table

@item URI-SYMB

A symbol representing a namespace URI -- or other symbol chosen by
the user to represent URI.  In the former case, @code{URI-SYMB} is
created by %-quoting of bad URI characters and converting the
resulting string into a symbol.

@item NAMESPACES

A list representing namespaces in effect.  An element of the list
has one of the following forms:

@table @code

@item (@var{prefix} @var{uri-symb} . @var{uri-symb}) or

@item (@var{prefix} @var{user-prefix} . @var{uri-symb})
@var{user-prefix} is a symbol chosen by the user to represent the URI.

@item (#f @var{user-prefix} . @var{uri-symb})
Specification of the user-chosen prefix and a URI-SYMBOL.

@item (*DEFAULT* @var{user-prefix} . @var{uri-symb})
Declaration of the default namespace

@item (*DEFAULT* #f . #f)
Un-declaration of the default namespace.  This notation
represents overriding of the previous declaration

@end table

A NAMESPACES list may contain several elements for the same @var{prefix}.
The one closest to the beginning of the list takes effect.

@item ATTLIST

An ordered collection of (@var{NAME} . @var{VALUE}) pairs, where
@var{NAME} is a RES-NAME or an UNRES-NAME.  The collection is an ADT.

@item STR-HANDLER

A procedure of three arguments: @var{string1} @var{string2}
@var{seed} returning a new @var{seed}.  The procedure is supposed to
handle a chunk of character data @var{string1} followed by a chunk
of character data @var{string2}.  @var{string2} is a short string,
often @samp{"\n"} and even @samp{""}.

@item ENTITIES
An assoc list of pairs:
@lisp
   (@var{named-entity-name} . @var{named-entity-body})
@end lisp

where @var{named-entity-name} is a symbol under which the entity was
declared, @var{named-entity-body} is either a string, or (for an
external entity) a thunk that will return an input port (from which
the entity can be read).  @var{named-entity-body} may also be #f.
This is an indication that a @var{named-entity-name} is currently
being expanded.  A reference to this @var{named-entity-name} will be
an error: violation of the WFC nonrecursion.

@item XML-TOKEN

This record represents a markup, which is, according to the XML
Recommendation, "takes the form of start-tags, end-tags,
empty-element tags, entity references, character references,
comments, CDATA section delimiters, document type declarations, and
processing instructions."

@table @asis
@item kind
a TAG-KIND
@item head
an UNRES-NAME.  For XML-TOKENs of kinds 'COMMENT and 'CDSECT, the
head is #f.
@end table

For example,
@example
<P>                   => kind=START,      head=P
</P>                  => kind=END,        head=P
<BR/>                 => kind=EMPTY-EL,   head=BR
<!DOCTYPE OMF ...>    => kind=DECL,       head=DOCTYPE
<?xml version="1.0"?> => kind=PI,         head=xml
&my-ent;              => kind=ENTITY-REF, head=my-ent
@end example

Character references are not represented by xml-tokens as these
references are transparently resolved into the corresponding
characters.

@item XML-DECL

The record represents a datatype of an XML document: the list of
declared elements and their attributes, declared notations, list of
replacement strings or loading procedures for parsed general
entities, etc.  Normally an XML-DECL record is created from a DTD or
an XML Schema, although it can be created and filled in in many
other ways (e.g., loaded from a file).

@table @var
@item elems
an (assoc) list of decl-elem or #f.  The latter instructs
the parser to do no validation of elements and attributes.

@item decl-elem
declaration of one element:

@code{(@var{elem-name} @var{elem-content} @var{decl-attrs})}

@var{elem-name} is an UNRES-NAME for the element.

@var{elem-content} is an ELEM-CONTENT-MODEL.

@var{decl-attrs} is an @code{ATTLIST}, of
@code{(@var{attr-name} . @var{value})} associations.

This element can declare a user procedure to handle parsing of an
element (e.g., to do a custom validation, or to build a hash of IDs
as they're encountered).

@item decl-attr
an element of an @code{ATTLIST}, declaration of one attribute:

@code{(@var{attr-name} @var{content-type} @var{use-type} @var{default-value})}

@var{attr-name} is an UNRES-NAME for the declared attribute.

@var{content-type} is a symbol: @code{CDATA}, @code{NMTOKEN},
@code{NMTOKENS}, @dots{} or a list of strings for the enumerated
type.

@var{use-type} is a symbol: @code{REQUIRED}, @code{IMPLIED}, or
@code{FIXED}.

@var{default-value} is a string for the default value, or #f if not
given.

@end table

@end table

@subsection Low-Level Parsers and Scanners

@noindent
These procedures deal with primitive lexical units (Names,
whitespaces, tags) and with pieces of more generic productions.
Most of these parsers must be called in appropriate context.  For
example, @code{ssax:complete-start-tag} must be called only when the
start-tag has been detected and its GI has been read.


@defun ssax:skip-s port


Skip the S (whitespace) production as defined by
@example
[3] S ::= (#x20 | #x09 | #x0D | #x0A)
@end example

@code{ssax:skip-s} returns the first not-whitespace character it encounters while
scanning the @var{port}.  This character is left on the input stream.
@end defun


@defun ssax:read-ncname port


Read a NCName starting from the current position in the @var{port} and
return it as a symbol.

@example
[4] NameChar ::= Letter | Digit | '.' | '-' | '_' | ':'
                 | CombiningChar | Extender
[5] Name ::= (Letter | '_' | ':') (NameChar)*
@end example

This code supports the XML Namespace Recommendation REC-xml-names,
which modifies the above productions as follows:

@example
[4] NCNameChar ::= Letter | Digit | '.' | '-' | '_'
                      | CombiningChar | Extender
[5] NCName ::= (Letter | '_') (NCNameChar)*
@end example

As the Rec-xml-names says,

@quotation
"An XML document conforms to this specification if all other tokens
[other than element types and attribute names] in the document which
are required, for XML conformance, to match the XML production for
Name, match this specification's production for NCName."
@end quotation

Element types and attribute names must match the production QName,
defined below.
@end defun


@defun ssax:read-qname port


Read a (namespace-) Qualified Name, QName, from the current position
in @var{port}; and return an UNRES-NAME.

From REC-xml-names:
@example
[6] QName ::= (Prefix ':')? LocalPart
[7] Prefix ::= NCName
[8] LocalPart ::= NCName
@end example
@end defun


@defun ssax:read-markup-token port


This procedure starts parsing of a markup token.  The current
position in the stream must be @samp{<}.  This procedure scans
enough of the input stream to figure out what kind of a markup token
it is seeing.  The procedure returns an XML-TOKEN structure
describing the token.  Note, generally reading of the current markup
is not finished!  In particular, no attributes of the start-tag
token are scanned.

Here's a detailed break out of the return values and the position in
the PORT when that particular value is returned:

@table @asis

@item PI-token

only PI-target is read.  To finish the Processing-Instruction and
disregard it, call @code{ssax:skip-pi}.  @code{ssax:read-attributes}
may be useful as well (for PIs whose content is attribute-value
pairs).

@item END-token

The end tag is read completely; the current position is right after
the terminating @samp{>} character.

@item COMMENT

is read and skipped completely.  The current position is right after
@samp{-->} that terminates the comment.

@item CDSECT

The current position is right after @samp{<!CDATA[}.  Use
@code{ssax:read-cdata-body} to read the rest.

@item DECL

We have read the keyword (the one that follows @samp{<!})
identifying this declaration markup.  The current position is after
the keyword (usually a whitespace character)

@item START-token

We have read the keyword (GI) of this start tag.  No attributes are
scanned yet.  We don't know if this tag has an empty content either.
Use @code{ssax:complete-start-tag} to finish parsing of the token.

@end table
@end defun


@defun ssax:skip-pi port


The current position is inside a PI.  Skip till the rest of the PI
@end defun


@defun ssax:read-pi-body-as-string port


The current position is right after reading the PITarget.  We read
the body of PI and return is as a string.  The port will point to
the character right after @samp{?>} combination that terminates PI.

@example
[16] PI ::= '<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>'
@end example
@end defun


@defun ssax:skip-internal-dtd port


The current pos in the port is inside an internal DTD subset (e.g.,
after reading @samp{#\[} that begins an internal DTD subset) Skip
until the @samp{]>} combination that terminates this DTD.
@end defun


@defun ssax:read-cdata-body port str-handler seed


This procedure must be called after we have read a string
@samp{<![CDATA[} that begins a CDATA section.  The current position
must be the first position of the CDATA body.  This function reads
@emph{lines} of the CDATA body and passes them to a @var{str-handler}, a character
data consumer.

@var{str-handler} is a procedure taking arguments: @var{string1}, @var{string2},
and @var{seed}.  The first @var{string1} argument to @var{str-handler} never
contains a newline; the second @var{string2} argument often will.
On the first invocation of @var{str-handler}, @var{seed} is the one passed to @code{ssax:read-cdata-body} as the
third argument.  The result of this first invocation will be passed
as the @var{seed} argument to the second invocation of the line
consumer, and so on.  The result of the last invocation of the @var{str-handler} is
returned by the @code{ssax:read-cdata-body}.  Note a similarity to the fundamental @dfn{fold}
@cindex fold
iterator.

Within a CDATA section all characters are taken at their face value,
with three exceptions:
@itemize @bullet
@item
CR, LF, and CRLF are treated as line delimiters, and passed
as a single @samp{#\newline} to @var{str-handler}

@item
@samp{]]>} combination is the end of the CDATA section.
@samp{&gt;} is treated as an embedded @samp{>} character.

@item
@samp{&lt;} and @samp{&amp;} are not specially recognized (and are
not expanded)!

@end itemize
@end defun


@defun ssax:read-char-ref port


@example
[66]  CharRef ::=  '&#' [0-9]+ ';'
                 | '&#x' [0-9a-fA-F]+ ';'
@end example

This procedure must be called after we we have read @samp{&#} that
introduces a char reference.  The procedure reads this reference and
returns the corresponding char.  The current position in PORT will
be after the @samp{;} that terminates the char reference.

Faults detected:@*
WFC: XML-Spec.html#wf-Legalchar

According to Section @cite{4.1 Character and Entity References}
of the XML Recommendation:

@quotation
"[Definition: A character reference refers to a specific character
in the ISO/IEC 10646 character set, for example one not directly
accessible from available input devices.]"
@end quotation

@c Therefore, we use a @code{ucscode->char} function to convert a
@c character code into the character -- *regardless* of the current
@c character encoding of the input stream.
@end defun


@defun ssax:handle-parsed-entity port name entities content-handler str-handler seed


Expands and handles a parsed-entity reference.

@var{name} is a symbol, the name of the parsed entity to expand.
@c entities - see ENTITIES
@var{content-handler} is a procedure of arguments @var{port}, @var{entities}, and
@var{seed} that returns a seed.
@var{str-handler} is called if the entity in question is a pre-declared entity.

@code{ssax:handle-parsed-entity} returns the result returned by @var{content-handler} or @var{str-handler}.

Faults detected:@*
WFC: XML-Spec.html#wf-entdeclared@*
WFC: XML-Spec.html#norecursion
@end defun


@defun attlist-add attlist name-value


Add a @var{name-value} pair to the existing @var{attlist}, preserving its sorted ascending
order; and return the new list.  Return #f if a pair with the same
name already exists in @var{attlist}
@end defun


@defun attlist-remove-top attlist


Given an non-null @var{attlist}, return a pair of values: the top and the rest.
@end defun


@defun ssax:read-attributes port entities


This procedure reads and parses a production @dfn{Attribute}.
@cindex Attribute

@example
[41] Attribute ::= Name Eq AttValue
[10] AttValue ::=  '"' ([^<&"] | Reference)* '"'
                | "'" ([^<&'] | Reference)* "'"
[25] Eq ::= S? '=' S?
@end example

The procedure returns an ATTLIST, of Name (as UNRES-NAME), Value (as
string) pairs.  The current character on the @var{port} is a non-whitespace
character that is not an NCName-starting character.

Note the following rules to keep in mind when reading an
@dfn{AttValue}:
@cindex AttValue
@quotation
Before the value of an attribute is passed to the application or
checked for validity, the XML processor must normalize it as
follows:

@itemize @bullet
@item
A character reference is processed by appending the referenced
character to the attribute value.

@item
An entity reference is processed by recursively processing the
replacement text of the entity.  The named entities @samp{amp},
@samp{lt}, @samp{gt}, @samp{quot}, and @samp{apos} are pre-declared.

@item
A whitespace character (#x20, #x0D, #x0A, #x09) is processed by
appending #x20 to the normalized value, except that only a single
#x20 is appended for a "#x0D#x0A" sequence that is part of an
external parsed entity or the literal entity value of an internal
parsed entity.

@item
Other characters are processed by appending them to the normalized
value.

@end itemize

@end quotation

Faults detected:@*
WFC: XML-Spec.html#CleanAttrVals@*
WFC: XML-Spec.html#uniqattspec
@end defun


@defun ssax:resolve-name port unres-name namespaces apply-default-ns?


Convert an @var{unres-name} to a RES-NAME, given the appropriate @var{namespaces} declarations.
The last parameter, @var{apply-default-ns?}, determines if the default namespace applies
(for instance, it does not for attribute names).

Per REC-xml-names/#nsc-NSDeclared, the "xml" prefix is considered
pre-declared and bound to the namespace name
"http://www.w3.org/XML/1998/namespace".

@code{ssax:resolve-name} tests for the namespace constraints:@*
@url{http://www.w3.org/TR/REC-xml-names/#nsc-NSDeclared}
@end defun


@defun ssax:complete-start-tag tag port elems entities namespaces


Complete parsing of a start-tag markup.  @code{ssax:complete-start-tag} must be called after the
start tag token has been read.  @var{tag} is an UNRES-NAME.  @var{elems} is an
instance of the ELEMS slot of XML-DECL; it can be #f to tell the
function to do @emph{no} validation of elements and their
attributes.

@code{ssax:complete-start-tag} returns several values:
@itemize @bullet
@item ELEM-GI:
a RES-NAME.
@item ATTRIBUTES:
element's attributes, an ATTLIST of (RES-NAME . STRING) pairs.
The list does NOT include xmlns attributes.
@item NAMESPACES:
the input list of namespaces amended with namespace
(re-)declarations contained within the start-tag under parsing
@item ELEM-CONTENT-MODEL
@end itemize

On exit, the current position in @var{port} will be the first character
after @samp{>} that terminates the start-tag markup.

Faults detected:@*
VC: XML-Spec.html#enum@*
VC: XML-Spec.html#RequiredAttr@*
VC: XML-Spec.html#FixedAttr@*
VC: XML-Spec.html#ValueType@*
WFC: XML-Spec.html#uniqattspec (after namespaces prefixes are resolved)@*
VC: XML-Spec.html#elementvalid@*
WFC: REC-xml-names/#dt-NSName

@emph{Note}: although XML Recommendation does not explicitly say it,
xmlns and xmlns: attributes don't have to be declared (although they
can be declared, to specify their default value).
@end defun


@defun ssax:read-external-id port


Parses an ExternalID production:

@example
[75] ExternalID ::= 'SYSTEM' S SystemLiteral
                  | 'PUBLIC' S PubidLiteral S SystemLiteral
[11] SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'")
[12] PubidLiteral ::=  '"' PubidChar* '"'
                     | "'" (PubidChar - "'")* "'"
[13] PubidChar ::=  #x20 | #x0D | #x0A | [a-zA-Z0-9]
                         | [-'()+,./:=?;!*#@@$_%]
@end example

Call @code{ssax:read-external-id} when an ExternalID is expected; that is, the current
character must be either #\S or #\P that starts correspondingly a
SYSTEM or PUBLIC token.  @code{ssax:read-external-id} returns the @var{SystemLiteral} as a
string.  A @var{PubidLiteral} is disregarded if present.
@end defun

@subsection Mid-Level Parsers and Scanners

@noindent
These procedures parse productions corresponding to the whole
(document) entity or its higher-level pieces (prolog, root element,
etc).


@defun ssax:scan-misc port


Scan the Misc production in the context:

@example
[1]  document ::=  prolog element Misc*
[22] prolog ::= XMLDecl? Misc* (doctypedec l Misc*)?
[27] Misc ::= Comment | PI |  S
@end example

Call @code{ssax:scan-misc} in the prolog or epilog contexts.  In these contexts,
whitespaces are completely ignored.  The return value from @code{ssax:scan-misc} is
either a PI-token, a DECL-token, a START token, or *EOF*.  Comments
are ignored and not reported.
@end defun


@defun ssax:read-char-data port expect-eof? str-handler iseed


Read the character content of an XML document or an XML element.

@example
[43] content ::=
(element | CharData | Reference | CDSect | PI | Comment)*
@end example

To be more precise, @code{ssax:read-char-data} reads CharData, expands CDSect and character
entities, and skips comments.  @code{ssax:read-char-data} stops at a named reference, EOF,
at the beginning of a PI, or a start/end tag.

@var{expect-eof?} is a boolean indicating if EOF is normal; i.e., the character
data may be terminated by the EOF.  EOF is normal while processing a
parsed entity.

@var{iseed} is an argument passed to the first invocation of @var{str-handler}.

@code{ssax:read-char-data} returns two results: @var{seed} and @var{token}.  The @var{seed}
is the result of the last invocation of @var{str-handler}, or the original @var{iseed} if @var{str-handler}
was never called.

@var{token} can be either an eof-object (this can happen only if @var{expect-eof?}
was #t), or:
@itemize @bullet

@item
an xml-token describing a START tag or an END-tag;
For a start token, the caller has to finish reading it.

@item
an xml-token describing the beginning of a PI.  It's up to an
application to read or skip through the rest of this PI;

@item
an xml-token describing a named entity reference.

@end itemize

CDATA sections and character references are expanded inline and
never returned.  Comments are silently disregarded.

As the XML Recommendation requires, all whitespace in character data
must be preserved.  However, a CR character (#x0D) must be
disregarded if it appears before a LF character (#x0A), or replaced
by a #x0A character otherwise.  See Secs. 2.10 and 2.11 of the XML
Recommendation.  See also the canonical XML Recommendation.
@end defun


@defun ssax:assert-token token kind gi error-cont


Make sure that @var{token} is of anticipated @var{kind} and has anticipated @var{gi}.  Note
that the @var{gi} argument may actually be a pair of two symbols,
Namespace-URI or the prefix, and of the localname.  If the assertion
fails, @var{error-cont} is evaluated by passing it three arguments: @var{token} @var{kind} @var{gi}.  The
result of @var{error-cont} is returned.
@end defun

@subsection High-level Parsers

These procedures are to instantiate a SSAX parser.  A user can
instantiate the parser to do the full validation, or no validation,
or any particular validation.  The user specifies which PI he wants
to be notified about.  The user tells what to do with the parsed
character and element data.  The latter handlers determine if the
parsing follows a SAX or a DOM model.


@defun ssax:make-pi-parser my-pi-handlers


Create a parser to parse and process one Processing Element (PI).

@var{my-pi-handlers} is an association list of pairs
@code{(@var{pi-tag} . @var{pi-handler})} where @var{pi-tag} is an
NCName symbol, the PI target; and @var{pi-handler} is a procedure
taking arguments @var{port}, @var{pi-tag}, and @var{seed}.

@var{pi-handler} should read the rest of the PI up to and including
the combination @samp{?>} that terminates the PI.  The handler
should return a new seed.  One of the @var{pi-tag}s may be the
symbol @code{*DEFAULT*}.  The corresponding handler will handle PIs
that no other handler will.  If the *DEFAULT* @var{pi-tag} is not
specified, @code{ssax:make-pi-parser} will assume the default handler that skips the body of
the PI.

@code{ssax:make-pi-parser} returns a procedure of arguments @var{port}, @var{pi-tag}, and
@var{seed}; that will parse the current PI according to @var{my-pi-handlers}.
@end defun


@defun ssax:make-elem-parser my-new-level-seed my-finish-element my-char-data-handler my-pi-handlers


Create a parser to parse and process one element, including its
character content or children elements.  The parser is typically
applied to the root element of a document.

@table @asis

@item @var{my-new-level-seed}
is a procedure taking arguments:

@var{elem-gi} @var{attributes} @var{namespaces} @var{expected-content} @var{seed}

where @var{elem-gi} is a RES-NAME of the element about to be
processed.

@var{my-new-level-seed} is to generate the seed to be passed to handlers that process the
content of the element.

@item @var{my-finish-element}
is a procedure taking arguments:

@var{elem-gi} @var{attributes} @var{namespaces} @var{parent-seed} @var{seed}

@var{my-finish-element} is called when parsing of @var{elem-gi} is finished.
The @var{seed} is the result from the last content parser (or
from @var{my-new-level-seed} if the element has the empty content).
@var{parent-seed} is the same seed as was passed to @var{my-new-level-seed}.
@var{my-finish-element} is to generate a seed that will be the result
of the element parser.

@item @var{my-char-data-handler}
is a STR-HANDLER as described in Data Types above.

@item @var{my-pi-handlers}
is as described for @code{ssax:make-pi-handler} above.

@end table

The generated parser is a procedure taking arguments:

@var{start-tag-head} @var{port} @var{elems} @var{entities} @var{namespaces} @var{preserve-ws?} @var{seed}

The procedure must be called after the start tag token has been
read.  @var{start-tag-head} is an UNRES-NAME from the start-element
tag.  ELEMS is an instance of ELEMS slot of XML-DECL.

Faults detected:@*
VC: XML-Spec.html#elementvalid@*
WFC: XML-Spec.html#GIMatch
@end defun


@defun ssax:make-parser user-handler-tag user-handler @dots{}


Create an XML parser, an instance of the XML parsing framework.
This will be a SAX, a DOM, or a specialized parser depending on the
supplied user-handlers.

@code{ssax:make-parser} takes an even number of arguments; @var{user-handler-tag} is a symbol that identifies
a procedure (or association list for @code{PROCESSING-INSTRUCTIONS})
(@var{user-handler}) that follows the tag.  Given below are tags and signatures of
the corresponding procedures.  Not all tags have to be specified.
If some are omitted, reasonable defaults will apply.

@table @samp

@item DOCTYPE
handler-procedure: @var{port} @var{docname} @var{systemid} @var{internal-subset?} @var{seed}

If @var{internal-subset?} is #t, the current position in the port is
right after we have read @samp{[} that begins the internal DTD
subset.  We must finish reading of this subset before we return (or
must call @code{skip-internal-dtd} if we aren't interested in
reading it).  @var{port} at exit must be at the first symbol after
the whole DOCTYPE declaration.

The handler-procedure must generate four values:
@quotation
@var{elems} @var{entities} @var{namespaces} @var{seed}
@end quotation

@var{elems} is as defined for the ELEMS slot of XML-DECL.  It may be
#f to switch off validation.  @var{namespaces} will typically
contain @var{user-prefix}es for selected @var{uri-symb}s.  The
default handler-procedure skips the internal subset, if any, and
returns @code{(values #f '() '() seed)}.

@item UNDECL-ROOT
procedure: @var{elem-gi} @var{seed}

where @var{elem-gi} is an UNRES-NAME of the root element.  This
procedure is called when an XML document under parsing contains
@emph{no} DOCTYPE declaration.

The handler-procedure, as a DOCTYPE handler procedure above,
must generate four values:
@quotation
@var{elems} @var{entities} @var{namespaces} @var{seed}
@end quotation

The default handler-procedure returns (values #f '() '() seed)

@item DECL-ROOT
procedure: @var{elem-gi} @var{seed}

where @var{elem-gi} is an UNRES-NAME of the root element.  This
procedure is called when an XML document under parsing does contains
the DOCTYPE declaration.  The handler-procedure must generate a new
@var{seed} (and verify that the name of the root element matches the
doctype, if the handler so wishes).  The default handler-procedure
is the identity function.

@item NEW-LEVEL-SEED
procedure: see ssax:make-elem-parser, my-new-level-seed

@item FINISH-ELEMENT
procedure: see ssax:make-elem-parser, my-finish-element

@item CHAR-DATA-HANDLER
procedure: see ssax:make-elem-parser, my-char-data-handler

@item PROCESSING-INSTRUCTIONS
association list as is passed to @code{ssax:make-pi-parser}.
The default value is '()

@end table

The generated parser is a procedure of arguments @var{port} and
@var{seed}.

This procedure parses the document prolog and then exits to an
element parser (created by @code{ssax:make-elem-parser}) to handle
the rest.

@example
[1]  document ::=  prolog element Misc*
[22] prolog ::= XMLDecl? Misc* (doctypedec | Misc*)?
[27] Misc ::= Comment | PI |  S
[28] doctypedecl ::=  '<!DOCTYPE' S Name (S ExternalID)? S?
              ('[' (markupdecl | PEReference | S)* ']' S?)? '>'
[29] markupdecl ::= elementdecl | AttlistDecl
                     | EntityDecl
                     | NotationDecl | PI
                     | Comment
@end example
@end defun

@subsection Parsing XML to SXML


@defun ssax:xml->sxml port namespace-prefix-assig


This is an instance of the SSAX parser that returns an SXML
representation of the XML document to be read from @var{port}.  @var{namespace-prefix-assig} is a list
of @code{(@var{user-prefix} . @var{uri-string})} that assigns
@var{user-prefix}es to certain namespaces identified by particular
@var{uri-string}s.  It may be an empty list.  @code{ssax:xml->sxml} returns an SXML
tree.  The port points out to the first character after the root
element.
@end defun