Codebase list jlex / HEAD manual.html
HEAD

Tree @HEAD (Download .tar.gz)

manual.html @HEADraw · history · blame

   1
   2
   3
   4
   5
   6
   7
   8
   9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
  62
  63
  64
  65
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
<!-- X-URL: http://www.cs.princeton.edu/~appel/modern/java/JLex/manual.html -->

<BR> <P>
<H1 ALIGN=CENTER>JLex:<BR> A lexical analyzer generator for Java<sup><small>(TM)</small></sup><BR> 
      </H1>
<P ALIGN=CENTER>
<STRONG>Elliot Berk<BR> 
Department of Computer Science, Princeton University
</STRONG></P>
<P ALIGN=CENTER>Version 1.2, May 5, 1997</P>
<P ALIGN=CENTER>Manual revision October 29, 1997</P>
<P ALIGN=CENTER>Last updated September 6, 2000 for JLex 1.2.5</P>
<P ALIGN=CENTER>(latest version can be obtained from
<A HREF="http://www.cs.princeton.edu/~appel/modern/java/JLex/">http://www.cs.princeton.edu/~appel/modern/java/JLex/</A> )</p>
<P>
<HR>
<P><H2><A NAME="SECTION00010000000000000000">Contents</A></H2>
<UL> 
<LI> <A NAME="tex2html54"
HREF="#SECTION1">1. Introduction</A>
<LI> <A NAME="tex2html55"
HREF="#SECTION2">2. JLex Specifications</A>
<UL> 
<LI> <A NAME="tex2html56"
HREF="#SECTION2.1">2.1 User Code</A>
<LI> <A NAME="tex2html57"
HREF="#SECTION2.2">2.2 JLex Directives</A>
<UL> 
<LI> <A NAME="tex2html58"
HREF="#SECTION2.2.1">2.2.1 Internal Code to Lexical Analyzer Class</A>
<LI> <A NAME="tex2html59"
HREF="#SECTION2.2.2">2.2.2 Initialization Code for Lexical Analyzer Class</A>
<LI> <A NAME="tex2html60"
HREF="#SECTION2.2.3">2.2.3 End-of-File Code for Lexical Analyzer Class</A>
<LI> <A NAME="tex2html61"
HREF="#SECTION2.2.4">2.2.4 Macro Definitions</A>
<LI> <A NAME="tex2html62"
HREF="#SECTION2.2.5">2.2.5 State Declarations</A>
<LI> <A NAME="tex2html63"
HREF="#SECTION2.2.6">2.2.6 Character Counting</A>
<LI> <A NAME="tex2html64"
HREF="#SECTION2.2.7">2.2.7 Line Counting</A>
<LI> <A NAME="tex2html65"
HREF="#SECTION2.2.8">2.2.8 Java CUP Compatibility </A>
<LI> <A NAME="tex2html66"
HREF="#SECTION2.2.9">2.2.9 Lexical Analyzer Component Titles</A>
<LI> <A NAME="tex2html67" 
HREF="#SECTION2.2.10">2.2.10 Default Token Type</A>
<LI> <A NAME="tex2html68"
HREF="#SECTION2.2.11">2.2.11 Default Token Type II: Wrapped Integer</A>
<LI> <A NAME="tex2html69"
HREF="#SECTION2.2.12">2.2.12 YYEOF on End-of-File</A>
<LI> <A NAME="tex2html70"
HREF="#SECTION2.2.13">2.2.13 Newlines and Operating System Compatibility</A>
<LI> <A NAME="tex2html71"
HREF="#SECTION2.2.14">2.2.14 Character Sets</A>
<LI> <A NAME="tex2html72"
HREF="#SECTION2.2.15">2.2.15 Character Format To and From File</A>
<LI> <A NAME="tex2html73"
HREF="#SECTION2.2.16">2.2.16 Exceptions Generated by Lexical Actions</A>
<LI> <A NAME="tex2html74"
HREF="#SECTION2.2.17">2.2.17 Specifying the Return Value on End-of-File</A>
<LI> <A NAME="tex2html74a"
HREF="#SECTION2.2.18">2.2.18 Specifying an interface to implement</A>
<LI> <A NAME="tex2html75"
HREF="#SECTION2.2.19">2.2.19 Making the Generated Class Public</A>
</UL> 
<LI> <A NAME="tex2html75"
HREF="#SECTION2.3">2.3 Regular Expression Rules</A>
<UL> 
<LI> <A NAME="tex2html76"
HREF="#SECTION2.3.1">2.3.1 Lexical States</A>
<LI> <A NAME="tex2html77"
HREF="#SECTION2.3.2">2.3.2 Regular Expressions</A>
<LI> <A NAME="tex2html78"
HREF="#SECTION2.3.3">2.3.3 Associated Actions</A>
<UL> 
<LI> <A NAME="tex2html79"
HREF="#SECTION2.3.3.1">2.3.3.1 Actions and Recursion:</A>
<LI> <A NAME="tex2html80"
HREF="#SECTION2.3.3.2">2.3.3.2 State Transitions:</A>
<LI> <A NAME="tex2html81"
HREF="#SECTION2.3.3.3">2.3.3.3 Available Lexical Values:</A>
</UL> 
</UL> 
</UL> 
<LI> <A NAME="tex2html82"
HREF="#SECTION3">3. Generated Lexical Analyzers</A>
<LI> <A NAME="tex2html83"
HREF="#SECTION4">4. Performance</A>
<LI> <A NAME="tex2html84"
HREF="#SECTION5">5. Implementation Issues</A>
<UL> 
<LI> <A NAME="tex2html85"
HREF="#SECTION5.1">5.1 Unimplemented Features</A>
<LI> <A NAME="tex2html86"
HREF="#SECTION5.2">5.2 Unicode vs Ascii</A>
<LI> <A NAME="tex2html87"
HREF="#SECTION5.3">5.3 Commas in State Lists</A>
<LI> <A NAME="tex2html88"
HREF="#SECTION5.4">5.4 Wish List of Unimplemented Features</A>
</UL> 
<LI> <A NAME="tex2html89"
HREF="#SECTION6">6. Credits and Copyrights</A>
<UL> 
<LI> <A NAME="tex2html90"
HREF="#SECTION6.1">6.1 Credits</A>
<LI> <A NAME="tex2html91"
HREF="#SECTION6.2">6.2 Copyright</A>
</UL> 
</UL>
<P>
<BR> <HR>
<BR> <P>
<H1><A NAME="SECTION1">1. Introduction</A></H1>
<P>
A lexical analyzer breaks an input stream of characters 
into tokens.  
Writing lexical analyzers by hand can be a tedious
process, so software tools have been developed to ease
this task.
<P>
Perhaps the best known such utility is Lex.  
Lex is a lexical analyzer generator for the UNIX 
operating system, targeted to the C programming language.
Lex takes a specially-formatted specification file
containing the details of a lexical analyzer. 
This tool then creates a C source file for the 
associated table-driven lexer.
<P>
The JLex utility is based upon the Lex lexical
analyzer generator model.  JLex takes a specification 
file similar to that accepted by Lex, then 
creates a Java source file for the corresponding lexical
analyzer.
<P>
<BR> <HR>
<BR> <P>
<H1><A NAME="SECTION2">2. JLex Specifications</A></H1>
<P>
A JLex input file is organized into three sections, 
separated by double-percent directives (``%%'').  
A proper JLex specification has the following format.<BR>
<I>user code</I><BR>
%%<BR>
<I>JLex directives</I><BR>
%%<BR>
<I>regular expression rules</I><BR>
The ``%%'' directives distinguish sections of the input
file and must be placed at the beginning of their line.
The remainder of the line containing the ``%%'' directives
may be discarded and should not be used to house
additional declarations or code.
<P>
The user code section - the first section of the specification
file - is copied directly into the resulting output file. 
This area of the specification provides space for the
implementation of utility classes or return types.
<P>
The JLex directives section is the second part of the input
file.  Here, macros definitions are given and state names
are declared.
<P>
The third section contains the rules of lexical analysis,
each of which consists of three parts: an optional state list,
a regular expression, and an action.
<P>
<BR> <HR>
<BR> <P>
<H2><A NAME="SECTION2.1">2.1 User Code</A></H2>
<P>
User code precedes the first double-percent directive (``%%').  
This code is copied verbatim into the lexical analyzer source
file that JLex outputs, at the top of the file.
Therefore, if the lexer source file needs to begin 
with a package declaration or with 
the importation of an external class,
the user code section should begin with 
the corresponding declaration.
This declaration will then be copied onto 
the top of the generated source file.
<P>
<BR> <HR>
<BR> <P>
<H2><A NAME="SECTION2.2">2.2 JLex Directives</A></H2>
<P>
The JLex directive section begins after the first ``%%''
and continues until the second ``%%'' delimiter. 
Each JLex directive should be contained on a single line
and should begin that line.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.1">2.2.1 Internal Code to Lexical Analyzer Class</A></H3>
<P>
The <I>%{...%}</I> directive allows the user to write
Java code to be copied into the lexical analyzer class.
This directive is used as follows.<BR>
<I>%{ </I><BR>
<I>&lt;code&gt; </I><BR>
<I>%} </I><BR>
To be properly recognized, the <I>%{ </I> and <I>%} </I> 
should each be situated at the beginning of a line.
The specified Java code in <I>&lt;code&gt;</I> will be then copied into
the lexical analyzer class created by JLex.<BR>
<I>class Yylex { </I><BR>
<I>... &lt;code&gt; ... </I><BR>
<I>} </I><BR>
This permits the declaration of variables and functions
internal to the generated lexical analyzer class.
Variable names beginning with <I>yy</I> should be
avoided, as these are reserved for use by the generated
lexical analyzer class.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.2">2.2.2 Initialization Code for Lexical Analyzer Class</A></H3>
<P>
The <I>%init{ ... %init}</I> directive allows the user to write
Java code to be copied into the constructor for the
lexical analyzer class.<BR>
<I>%init{ </I><BR>
<I>&lt;code&gt;</I><BR>
<I>%init} </I><BR>
The <I>%init{</I> and <I>%init}</I> directives
should be situated at the beginning of a line.
The specified Java code in <I>&lt;code&gt;</I> will be then copied into
the lexical analyzer class constructor.<BR>
<I>class Yylex { </I><BR>
<I>Yylex () { </I><BR>
<I>... &lt;code&gt; ... </I><BR>
<I>} </I><BR>
<I>} </I><BR>
This directive permits one-time initializations 
of the lexical analyzer class from inside its constructor.
Variable names beginning with <I>yy</I> should be
avoided, as these are reserved for use by the generated
lexical analyzer class.
<P>
The code given in the <I>%init{ ... %init}</I> directive
may potentially throw an exception, or propagate it from
another function.  To declare this exception, use
the <I>%initthrow{ ... %initthrow}</I> directive.<BR>
<I>%initthrow{ </I><BR>
<I>&lt;exception[1]&gt;</I>[<I>, &lt;exception[2]&gt;, ...</I>]<BR>
<I>%initthrow} </I><BR>
The Java code specified here will be copied 
into the declaration of the lexical analyzer 
constructor.<BR>
<I>Yylex () </I><BR>
<I>throws &lt;exception[1]&gt;</I>[<I>, &lt;exception[2]&gt;, ...</I>]<BR>
<I>{ </I><BR>
<I>... &lt;code&gt; ... </I><BR>
<I>} </I><BR>
If the Java code given in the <I>%init{ ... %init}</I> 
directive throws an exception that is not declared,
the resulting lexical analyzer source file may not compile
successfully.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.3">2.2.3 End-of-File Code for Lexical Analyzer Class</A></H3>
<P>
The <I>%eof{ ... %eof}</I> directive 
allows the user to write Java code to be 
copied into the lexical analyzer class
for execution after the end-of-file is reached.<BR>
<I>%eof{ </I><BR>
<I>&lt;code&gt;</I><BR>
<I>%eof} </I><BR>
The <I>%eof{</I> and <I>%eof}</I> directives
should be situated at the beginning of a line.
The specified Java code in <I>&lt;code&gt;</I> 
will be executed at most once, and immediately
after the end-of-file is reached for the input file
the lexical analyzer class is processing.
<P>
The code given in the <I>%eof{ ... %eof}</I> directive
may potentially throw an exception, or propagate it from
another function.  To declare this exception, use
the <I>%eofthrow{ ... %eofthrow}</I> directive.<BR>
<I>%eofthrow{ </I><BR>
<I>&lt;exception[1]&gt;</I>[<I>, &lt;exception[2]&gt;, ...</I>]<BR>
<I>%eofthrow} </I><BR>
The Java code specified here will be copied 
into the declaration of the lexical analyzer function
called to clean-up upon reaching end-of-file.<BR>
<I>private void yy_do_eof () </I><BR>
<I>throws &lt;exception[1]&gt;</I>[<I>, &lt;exception[2]&gt;, ...</I>]<BR>
<I>{ </I><BR>
<I>... &lt;code&gt; ... </I><BR>
<I>} </I><BR>
The Java code in &lt;code&gt; that makes up 
the body of this function will, in part, 
come from the code given in the
<I>%eof{ ... %eof}</I> directive.
If this code throws an exception that is not declared
using the <I>%eofthrow{ ... %eofthrow}</I> directive,
the resulting lexer may not compile successfully.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.4">2.2.4 Macro Definitions</A></H3>
<P>
Macro definitions are given in the JLex directives section 
of the specification.
Each macro definition is contained on a single line and
consists of a macro name followed by an equal sign (=), 
then by its associated definition.
The format can therefore be summarized as follows.<BR>
<I>&lt;name&gt;</I> = <I>&lt;definition&gt;</I><BR>
Non-newline white space, e.g. blanks and tabs, 
is optional between the macro name and the equal sign
and between the equal sign and the macro definition.
Each macro definition should be contained on a 
single line.
<P>
Macro names should be valid identifiers, 
e.g. sequences of letters, digits, and underscores
beginning with a letter or underscore.
<P>
Macro definitions should be valid regular expressions,
the details of which are described in another section below.
<P>
Macro definitions can contain other macro expansions,
in the standard<BR><I>{&lt;name&gt;} </I> format for macros 
within regular expressions.
However, the user should note that these expressions
are macros - not functions or nonterminals - so
mutually recursive constructs using macros are illegal.
Therefore, cycles in macro definitions will have 
unpredictable results.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.5">2.2.5 State Declarations</A></H3>
<P>
Lexical states are used to control when certain 
regular expressions are matched.
These are declared in the JLex directives 
in the following way.<BR>
<I>%state </I>state[0][<I>, state[1], state[2], ...</I>]<BR>
Each declaration of a series of lexical states should
be contained on a single line.
Multiple declarations can be included in the same
JLex specification, so the declaration of many states
can be broken into many declarations over multiple lines.
<P>
State names should be valid identifiers,
e.g. sequences of letters, digits, and underscores
beginning with a letter or underscore.
<P>
A single lexical state is implicitly declared by JLex.
This state is called <I>YYINITIAL</I>, and the generated
lexer begins lexical analysis in this state.
<P>
Rules of lexical analysis begin with an optional state list.
If a state list is given, the lexical rule is matched only when
the lexical analyzer is in one of the specified states.
If a state list is not given, the lexical rule is matched when
the lexical analyzer is in any state.
<P>
If a JLex specification does not make use of states,
by neither declaring states nor preceding lexical rules 
with state lists,
the resulting lexer will remain in state <I>YYINITIAL</I>
throughout execution.
Since lexical rules are not prefaced by state lists,
these rules are matched in all existing states,
including the implicitly declared state <I>YYINITIAL</I>.
Therefore, everything works as expected if states are
not used at all.
<P>
States are declared as constant integers within the generated 
lexical analyzer class.
The constant integer declared for a declared state 
has the same name as that state.
The user should be careful to avoid name conflict 
between state names and variables declared in the 
action portion of rules or elsewhere within  
the lexical analyzer class.
A convenient convention would be to declare state
names in all capitals, as a reminder that these
identifiers effectively become constants.
<P>
<BR> <P> <hr>
<H3><A NAME="SECTION2.2.6">2.2.6 Character Counting</A></H3>
<P>
Character counting is turned off by default, but can be activated
with the <I>%char</I> directive.<BR>
<I>%char</I><BR>
The zero-based character index of the first character in
the matched region of text is then placed in the
integer variable <I>yychar</I>.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.7">2.2.7 Line Counting</A></H3>
<P>
Line counting is turned off by default, but can be activated
with the <I>%line</I> directive.<BR>
<I>%line</I><BR>
The zero-based line index at the beginning of the
matched region of text is then placed in the
integer variable <I>yyline</I>.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.8">2.2.8 Java CUP Compatibility </A></H3>
<P>
Java CUP is a parser generator for Java originally written
by Scott Hudson of Georgia Tech University, and maintained and
extended by Frank Flannery, Dan Wang, and C. Scott Ananian.
Details of this software tool are on the World Wide Web
at<BR>
<a href="http://www.cs.princeton.edu/~appel/modern/java/CUP/">http://www.cs.princeton.edu/~appel/modern/java/CUP/</a>.<BR>
Java CUP compatibility is turned off by default, but can 
be activated with the following JLex directive.<BR>
<I>%cup</I><BR>
When given, this directive makes the generated scanner conform to the
<code>java_cup.runtime.Scanner</code> interface.  It has the same
effect as the following three directives:<BR>
<i>%implements java_cup.runtime.Scanner</i><BR>
<i>%function next_token</i><BR>
<i>%type java_cup.runtime.Symbol</i><BR>
See <a href="#SECTION2.2.9">the next section</a> for more details on
these three directives, and the CUP manual for more details on using
CUP and JLex together.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.9">2.2.9 Lexical Analyzer Component Titles</A></H3>
<P>
The following directives can be used to change the name of
the generated lexical analyzer class, the tokenizing function, and
the token return type.  To change the name of the lexical
analyzer class from <I>Yylex</I>, use the 
<I>%class</I> directive.<BR>
<I>%class &lt;name&gt;</I><BR>
To change the name of the tokenizing function from <I>yylex</I>,
use the <I>%function</I> directive.<BR>
<I>%function &lt;name&gt;</I><BR>
To change the name of the return type from the tokenizing 
function from <I>Yytoken</I>, use the <I>%type</I> 
directive.<BR>
<I>%type &lt;name&gt;</I><BR>
If the default names are not altering using these directives,
the tokenizing function is envoked with a call to
<I>Yylex.yylex()</I>, which returns the <I>Ytoken</I> type.
<P>
To avoid scoping conflicts, names beginning with <I>yy</I>
are normally reserved for lexical analyzer internal functions
and variables.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.10">2.2.10 Default Token Type</A></H3>
<P>
To make the 32-bit primitive integer type <I>int</I>, 
the return type for the tokenizing function 
(and therefore the token type),
use the <I>%integer</I> directive.<BR>
<I>%integer</I><BR>
Under default settings, <I>Yytoken</I> is the return 
type of the tokenizing function<BR><I>Yylex.yylex()</I>,
as in the following code fragment.<BR>
<I>class Yylex { ... </I><BR>
<I>public Yytoken yylex () {</I><BR>
<I>... } </I><BR>
The <I>%integer</I> directive replaces the previous code
with a revised declaration, in which the token type 
has been changed to <I>int</I>.<BR>
<I>class Yylex { ... </I><BR>
<I>public int yylex () {</I><BR>
<I>... } </I><BR>
This declaration allows lexical actions to return 
integer codes, as in the following code fragment
from a hypothetical lexical action.<BR>
<I>{ ...</I><BR>
<I>return 7; </I><BR>
<I>... } </I>
<P>
The integer return type forces changes the behavior
at end of file.
Under default settings, objects - subclasses of the 
java.lang.Object class - are returned by <I>Yylex.yylex()</I>.
During execution of the generated lexer <I>Yylex</I>,
a special object value must be reserved for end-of-file. 
Therefore, when the end-of-file is reached 
for the processed input file (and from then onward), 
<I>Yylex.yylex()</I> returns <I>null</I>.
<P>
When <I>int</I> is the return type of <I>Yylex.yylex()</I>,
<I>null</I> can no longer be returned.  Instead,
<I>Yylex.yylex()</I> returns the value -1, corresponding
to constant integer<BR><I>Yylex.YYEOF</I>.
The <I>%integer</I> directive implies <I>%yyeof</I>; see below.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.11">2.2.11 Default Token Type II: Wrapped Integer</A></H3>
<P>
To make java.lang.Integer the return type for the
tokenizing function (and therefore the token type),
use the <I>%intwrap</I> directive.<BR>
<I>%intwrap</I><BR>
Under default settings, <I>Yytoken</I> is the return 
type of the tokenizing function<BR><I>Yylex.yylex()</I>,
as in the following code fragment.<BR>
<I>class Yylex { ... </I><BR>
<I>public Yytoken yylex () {</I><BR>
<I>... } </I><BR>
The <I>%intwrap</I> directive replaces the previous code
with a revised declaration, in which the token type 
has been changed to java.lang.Integer.<BR>
<I>class Yylex { ... </I><BR>
<I>public java.lang.Integer yylex () {</I><BR>
<I>... } </I><BR>
This declaration allows lexical actions to return 
wrapped integer codes, as in the following code fragment
from a hypothetical lexical action.<BR>
<I>{ ...</I><BR>
<I>return new java.lang.Integer(0); </I><BR>
<I>... } </I>
<P>
Notice that the effect of <I>%intwrap</I> directive can be
equivalently accomplished using the <I>%type</I>
directive, as follows.<BR>
<I>%type java.lang.Integer</I><BR>
This manually changes the name of the return type
from <I>Yylex.yylex()</I> to<BR><I>java.lang.Integer</I>.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.12">2.2.12 YYEOF on End-of-File</A></H3>
<P>
The <I>%yyeof</I> directive causes the constant
integer <I>Yylex.YYEOF</I> to be declared.  If
the <I>%integer</i> directive is present, <i>Yylex.YYEOF</i>
is returned upon end-of-file.<BR>

<I>%yyeof</I><BR>

This directive causes <i>Yylex.YYEOF</i> to be declared as
follows:<BR>
<I>public final int YYEOF = -1;</I><BR>
The <i>%integer</i> directive implies <i>%yyeof</i>.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.13">2.2.13 Newlines and Operating System Compatibility</A></H3>
<P>
In UNIX operating systems, the character code sequence 
representing a newline is the single character ``\n''.
Conversely, in DOS-based operating systems, the newline is
the two-character sequence ``\r\n''
consisting of the carriage return followed by the newline.
The <I>%notunix</I> directive results in either the carriage
return or the newline being recognized as a newline.<BR>
<I>%notunix</I><BR>
This issue of recognizing the proper sequence of characters
as a newline is important in ensuring Java platform independence.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.14">2.2.14 Character Sets</A></H3>
<P>
The default settings support an alphabet 
of character codes between 0 and 127 inclusive.
If the generated lexical analyzer receives 
an input character code that falls outside 
of these bounds, the lexer may fail.
<P>
The <I>%full</I> directive can be used 
to extend this alphabet to include
all 8-bit values.<BR>
<I>%full</I><BR>
If the <I>%full</I> directive is given, 
JLex will generate a lexical analyzer
that supports an alphabet of character codes
between 0 and 255 inclusive.
<P>
The <I>%unicode</I> can be used 
to extend the alphabet to include the
full 16-bit Unicode alphabet.<BR>
<I>%unicode</I><BR>
If the <I>%unicode</I> directive is given, 
JLex will generate a lexical analyzer
that supports an alphabet of character codes
between 0 and 2^16-1 inclusive.
<p>
The <i>%ignorecase</i> directive can be given to generate
case-insensitive lexers.<br>
<i>%ignorecase</i><br>
If the <i>%ignorecase</i> directive is given, CUP will expand all
character classes in a unicode-friendly way to match both upper,
lower, and title-case letters.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.15">2.2.15 Character Format To and From File</A></H3>
<P>
Under the status quo, JLex and the lexical
analyzer it generates read from and write to
Ascii text files, with byte sized characters.
However, to support further extensions on the JLex tool,
all internal processing of characters is done
using the 16-bit Java character type,
although the full range of 16-bit values is not supported.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.16">2.2.16 Exceptions Generated by Lexical Actions</A></H3>
<P>
The code given in the action portion of 
the regular expression rules,
in section three of the JLex specification,
may potentially throw an exception, or propagate 
it from another function.  
To declare these exceptions, use
the <I>%yylexthrow{ ... %yylexthrow}</I> directive.<BR>
<I>%yylexthrow{ </I><BR>
<I>&lt;exception[1]&gt;</I>[<I>, &lt;exception[2]&gt;, ...</I>]<BR>
<I>%yylexthrow} </I><BR>
The Java code specified here will be copied 
into the declaration of the lexical analyzer 
tokenizing function <I>Yylex.yylex()</I>, as follows.<BR>
<I>public Yytoken yylex () </I><BR>
<I>throws &lt;exception[1]&gt;</I>[<I>, &lt;exception[2]&gt;, ...</I>]
<BR>
<I>{ </I><BR>
<I>... </I><BR>
<I>} </I><BR>
If the code given in the action portion of 
the regular expression rules
throws an exception that is not declared
using the <I>%yylexthrow{ ... %yylexthrow}</I> directive,
the resulting lexer may not compile successfully.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.17">2.2.17 Specifying the Return Value on End-of-File</A></H3>
<P>
The <I>%eofval{ ... %eofval}</I> directive
specifies the return value on end-of-file.
This directive allows the user to write Java code to be
copied into the lexical analyzer tokenizing 
function <I>Yylex.yylex()</I> 
for execution when the end-of-file is reached.
This code must return a value compatible with 
the type of the tokenizing function <I>Yylex.yylex()</I>.<BR>
<I>%eofval{ </I><BR>
<I>&lt;code&gt;</I><BR>
<I>%eofval} </I><BR>
The specified Java code in <I>&lt;code&gt;</I> 
determines the return value of <I>Yylex.yylex()</I>
when the end-of-file is reached for the input file
the lexical analyzer class is processing.
This will also be the value returned by <I>Yylex.yylex()</I>
each additional time this function is called
after end-of-file is initially reached,
so <I>&lt;code&gt;</I> may be executed more than once.
Finally, the <I>%eofval{</I> and <I>%eofval}</I> directives
should be situated at the beginning of a line.
<P>
An example of usage is given below.
Suppose the return value desired on end-of-file is 
<I>(new token(sym.EOF))</I> rather than 
the default value <I>null</I>.
The user adds the following declaration to the
specification file.<BR>
<I>%eofval{ </I><BR>
<I>return (new token(sym.EOF)); </I><BR>
<I>%eofval} </I><BR>
The code is then copied into <I>Yylex.yylex()</I>
into the appropriate place.<BR>
<I>public Yytoken yylex () { ... </I><BR>
<I>return (new token(sym.EOF)); </I><BR>
<I>... } </I><BR>
The value returned by <I>Yylex.yylex()</I> upon 
end-of-file and from that point onward is now
<I>(new token(sym.EOF))</I>.
<P>
<BR> <HR>
<H3><A NAME="SECTION2.2.18">2.2.18 Specifying an interface to implement</A></H3>
<P>
JLex allows the user to specify an interface which the <i>Yylex</i>
class will implement.  By adding the following declaration to the input
file:<br>
<i>%implements &ltclassname&gt</i><br>
the user specifies that Yylex will implement <i>classname</i>.  The
generated parser class declaration will look like:<br>
<tt>
class Yylex implements <i>classname</i> { ...
</tt>
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.2.19">2.2.19 Making the Generated Class Public</A></H3>
<P>
The <I>%public</I> directive causes the lexical analyzer class
generated by JLex to be a public class.<br>
<I>%public</I><BR>
The default behavior adds no access specifier to the generated
class, resulting in the class being visible only from the current
package.
<P>
<BR> <HR>
<BR> <P>
<H2><A NAME="SECTION2.3">2.3 Regular Expression Rules</A></H2>
<P>
The third part of the JLex specification consists
of a series of rules for breaking the input stream into tokens.
These rules specify regular expressions, then associate
these expressions with actions consisting of Java source code.
<P>
The rules have three distinct parts: 
the optional state list, the regular expression, 
and the associated action.
This format is represented as follows.<BR>
[<I>&lt;states&gt;</I>]  <I>&lt;expression&gt; { &lt;action&gt; }</I><BR>
Each part of the rule is discussed in a section below.
<P>
If more than one rule matches strings from its input,
the generated lexer resolves conflicts between rules
by greedily choosing the rule that matches the longest string.
If more than one rule matches strings of the same length,
the lexer will choose the rule that is given first in
the JLex specification.
Therefore, rules appearing earlier in the specification
are given a higher priority by the generated lexer.
<P>
The rules given in a JLex specification should 
match all possible input.  
If the generated lexical analyzer receives input that
does not match any of its rules, 
an error will be raised.
<P>
Therefore, all input should be matched by at least one rule.  
This can be guaranteed by placing the following rule 
at the bottom of a JLex specification:<BR>
<I>. { java.lang.System.out.println("Unmatched input: " + yytext()); 
}</I><BR>
The dot (.), as described below, will match any input
except for the newline.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.3.1">2.3.1 Lexical States</A></H3>
<P>
An optional lexical state list preceeds each rule.
This list should be in the following form:<BR>
<I>&lt;</I>state[0][<I>, state[1], state[2], ...</I>]<I>&gt;</I><BR>
The outer set of brackets ([]) indicate that multiple states are optional.
The greater than (&lt;) and less than (&gt;) symbols 
represent themselves and should surround the state
list, preceding the regular expression.
The state list specifies under which initial states
the rule can be matched.
<P>
For instance, if <I>yylex()</I> is called with 
the lexer at state <I>A</I>, 
the lexer will attempt to match the input only 
against those rules that have 
<I>A</I> in their state list.
<P>
If no state list is specified for a given rule, 
the rule is matched against in all lexical states.
<P>
<BR> <HR>
<BR> <P>
<H3><A NAME="SECTION2.3.2">2.3.2 Regular Expressions</A></H3>
<P>
Regular expressions should not contain any white space,
as white space is interpreted as the end of 
the current regular expression.
There is one exception; if (non-newline) white 
space characters appear from within double quotes, 
these characters are taken to represent themselves.
For instance, `` '' is interpreted as a blank space.
<P>
The alphabet for JLex is the Ascii character set, 
meaning character codes between 0 and 127 inclusive.
<P>
The following characters are metacharacters, with 
special meanings in JLex regular expressions.<BR>
<pre><h4>? * + | ( ) ^ $ . [ ] { } &quot \</h4></pre><br>
Otherwise, individual characters stand for themselves.
<P>
<i>ef</i> Consecutive regular expressions represents
their concatenation.
<P>
<i>e</i>|<i>f</i> The vertical bar (|) represents an option between
the regular expressions that surround it, so  
matches either expression <i>e</i> or <i>f</i>.
<P>
The following escape sequences are recognized and expanded:
<TABLE>
<TR>
<TD>\b</td>
<TD>Backspace</td>
</tr>
<tr>
<TD>\n</td>
<TD>newline</td>
</tr>
<tr>
<TD>\t</td>
<TD>Tab</td>
</tr>
<tr>
<TD>\f</td>
<TD>Formfeed</td>
</tr>
<tr>
<TD>\r</td>
<TD>Carriage return</td>
</tr>
<tr>
<TD>\<i>ddd</i></td>
<TD>The character code corresponding to the number formed by three octal
digits <i>ddd</i></td>
</tr>
<tr>
<TD>\x<i>dd</i></td>
<TD>The character code corresponding to the number formed by two
hexadecimal digits <i>dd</i></td>
</tr>
<tr>
<TD>\u<i>dddd</i></td>
<TD>The Unicode character code corresponding to the number formed by
four hexidecimal digits <i>dddd</i>.</td>
</tr>
<tr>
<TD>\^<i>C</i></td>
<TD>Control character</td>
</tr>
<tr>
<TD>\<i>c</i></td>
<TD>A backslash followed by any other character <i>c</i> matches itself</td>
</tr>
</table>
$ The dollar sign ($) denotes the end of a line.
If the dollar sign ends a regular expression, the expression
is matched only at the end of a line.
<P>
. The dot (.) matches any character except the newline,
so this expression is equivalent to [^\n].
<P>
"..." Metacharacters lose their meaning within
double quotes and represent themselves.
The sequence <code>\"</code> (which represents the
single character <code>"</code>) is the only exception.

<P>
<I>{name}</I> Curly braces denote a macro expansion,
with <I>name</I> the declared name of the associated macro.
<P>
* The star (*) represents Kleene closure and matches 
zero or more repetitions of the preceding regular expression.
<P>
+ The plus (+) matches one or more repetitions of the 
preceding regular expression, so <I>e</I>+ is equivalent to <I>ee</I>*.
<P>
? The question mark (?) matches zero or one repetitions
of the preceding regular expression.
<P>
(...) Parentheses are used for grouping within regular
expressions.
<P>
[...] Square backets denote a class of characters
and match any one character enclosed in the backets.  If the
first character following the left bracket ([) is 
the up arrow (^),
the set is negated and the expression matches any character
except those enclosed in the backets.  Different
metacharacter rules hold inside the backets, with the
following expressions having special meanings:
<TABLE>
<tr>
<td><i>{name}</i></td>
<td>Macro expansion</td>
</tr>
<tr>
<td><i>a</i> - <i>b</i></td>
<td>Range of character codes from <i>a</i> to <i>b</i> to be included in
character set</td>
</tr>
<tr>
<td>&quot...&quot</td>
<td>All metacharacters within double quotes lose
their special meanings. The sequence <code>\"</code> (which represents the
single character <code>"</code>) is the only exception.</td>
</tr>
<tr>
<td>\</td>
<td>Metacharacter following backslash(\) loses its special meaning</td>
</tr>
</table>

<P>
For example, [a-z] matches any lower-case letter, [^0-9] 
matches anything except a digit, and [0-9a-fA-F] matches any hexadecimal
digit. 
Inside character class brackets,
a metacharacter following a backslash loses its special meaning.
Therefore, [\-\\] matches a dash or a backslash.
Likewise ["A-Z"] matches one of the three characters A, dash, or Z.
Leading and trailing dashes in a character class also lose their
special meanings, so [+-] and [-+] do what you would expect them to
(ie, match only '+' and '-').
<P>
<BR> <P>
<H3><A NAME="SECTION2.3.3">2.3.3 Associated Actions</A></H3>
<P>
The action associated with a lexical rule consists
of Java code enclosed inside block-delimiting curly braces.<BR>
<I>{ action } </I><BR>
The Java code <I>action</I> is copied, as given, into 
the state-driven lexical analyzer produced by JLex.
<P>
All curly braces contained in <I>action</I> not part of strings or comments
should be balanced.
<P>
<BR> <HR>
<BR> <P>
<H4><A NAME="SECTION2.3.3.1">2.3.3.1 Actions and Recursion:</A></H4>
<P>
If no return value is returned in an action, the 
lexical analyzer will loop, searching for the next match 
from the input stream and returning the value
associated with that match.
<P>
The lexical analyzer can be made to recur explicitly
with a call to <I>yylex()</I>, as in the following 
code fragment.<BR>
<I>{ ...</I> <BR>
<I>return yylex();</I> <BR>
<I>... } </I> <BR>
This code fragment causes the lexical analyzer to recur, 
searching for the next match in the input
and returning the value associated with that match.
The same effect can be had, however, by simply
not returning from a given action.  
This results in the lexer searching for the next match,
without the additional overhead of recursion.
<P>
The preceding code fragment is an example of tail recursion,
since the recursive call comes at the end of 
the calling function's execution.
The following code fragment is an example of a recursive
call that is not tail recursive.<BR>
<I>{ ...</I> <BR>
<I>next = yylex();</I> <BR>
<I>... } </I> <BR>
Recursive actions that are not tail-recursive work
in the expected way, 
except that variables such as 
<i>yyline</i> and <i>yychar</i>
may be changed during recursion.
<P>
<BR> <HR>
<BR> <P>
<H4><A NAME="SECTION2.3.3.2">2.3.3.2 State Transitions:</A></H4>
<P>
If lexical states are declared in the JLex
directives section, transitions on these states
can be declared within the regular expression actions.
State transitions are made by the following 
function call.<BR>
<I>yybegin(state);</I><BR>
The void function <I>yybegin()</I> is passed the state 
name <I>state</I> and effects a transition to
this lexical state.
<P>
The state <I>state</I> must be declared within the JLex
directives section, or this call will result in a
compiler error in the generated source file.
The one exception to this declaration requirement is
state <I>YYINITIAL</I>, the lexical state 
implicitly declared by JLex.
The generated lexer begins lexical analysis in state
<I>YYINITIAL</I> and remains in this state until
a transition is made.
<P>
<BR> <HR>
<BR> <P>
<H4><A NAME="SECTION2.3.3.3">2.3.3.3 Available Lexical Values:</A></H4>
<P>
The following values, internal to the <I>Yylex</I> class, 
are available within the action portion of the lexical rules.

<table>
<tr>
<th align=left>Variable or Method</th>
<th align=left>ActivationDirective</th>
<th align=left>Description</th>
</tr>
<tr>
<td><i>java.lang.String yytext();</i></td>
<td>Always active.</td>
<td>Matched portion of the character input stream.</td>
</tr>
<tr>
<td><i>int yychar;</i></td>
<td><i>%char</i></td>
<td>Zero-based character index of the first character in the matched
portion of the input stream</td>
</tr>
<tr>
<td><i>int yyline;</i></td>
<td><i>%line</i></td>
<td>Zero-based line number of the start of the matched portion of the
input stream</td>
</tr>

</table>

<P>
<BR> <HR>






<BR> <P>
<H1><A NAME="SECTION3">3. Generated Lexical Analyzers</A></H1>
<P>
JLex will take a properly-formed specification
and transform it into a Java source file for the
corresponding lexical analyzer.
<P>
The generated lexical analayzer resides in the class <I>Yylex</I>.
There are two constructors to this class, both requiring a single argument:
the input stream to be tokenized.
The input stream may either be of type <code>java.io.InputStream</code>
or <code>java.io.Reader</code> (such as <code>StringReader</code>). 
Note that the <code>java.io.Reader</code> constructor should be used
if you are generating a lexer accepting unicode characters, as the JDK
1.0 <code>java.io.InputStream</code> class does not always read unicode
correctly.
<P>
The access function to the lexer is <I>Yylex.yylex()</I>,
which returns the next token from the input stream.
The return type is <I>Yytoken</I> and the function is declared 
as follows.<BR>
<I>class Yylex { ... </I><BR>
<I>public Yytoken yylex () {</I><BR>
<I>... } </I><BR>
The user must declare the type of <I>Yytoken</I> and 
can accomplish this conveniently in the
first section of the JLex specification, the user 
code section.  For instance, to make <I>Yylex.yylex()</I>
return a wrapper around integers,
the user would enter the following code somewhere
preceding the first ``%%''.<BR>
<I>class Yytoken { int field; Yytoken(int f) { field=f; } } </I><BR>
Then, in the lexical actions, wrapped integers would
be returned, in something like this way.<BR>
<I>{ ...</I><BR>
<I>return new Yytoken(0); </I><BR>
<I>... } </I><BR>
Likewise, in the user code section, a class could be defined
declaring constants that correspond to each of the token types.<BR>
<I>class TokenCodes { ... </I><BR>
<I>public static final STRING = 0; </I><BR>
<I>public static final INTEGER = 1; </I><BR>
<I>... } </I><BR>
Then, in the lexical actions, these token codes could be 
returned.<BR>
<I>{ ...</I><BR>
<I>return new Yytoken(STRING); </I><BR>
<I>... } </I><BR>
These are simplified examples; in actual use, one would probably
define a token class containing more information than an integer
code.
<P>
These examples begin to illustrate the object-oriented 
techniques a user could employ to define an arbitrarily 
complex token type to be returned by <I>Yylex.yylex()</I>.
In particular, inheritance permits the user to return more
than one token type.  If a distinct token type was needed
for strings and integers, the user could make the
following declarations.<BR>
<I>class Yytoken { ... } </I><BR>
<I>class IntegerToken extends Yytoken { ... } </I><BR>
<I>class StringToken extends Yytoken { ... } </I><BR>
Then the user could return both <I>IntegerToken</I> and
<I>StringToken</I> types from the lexical actions.
<P>
The names of the lexical analyzer class, the tokening function,
and its return type each may be altered using the
JLex directives.  See the section <a href="#SECTION2.2.9">2.2.9</a>
for more details.
<P>
<BR> <P> <HR>
<H1><A NAME="SECTION4">4. Performance</A></H1>
<P>
A benchmark experiment was conducted, comparing the performance
of a lexical analyzer generated by JLex to that
of a hand-written lexical analyzer.
The comparison was made for lexical analyzers 
of a simple ``toy'' programming language.
The hand-written lexical analyzer,
like the lexical analyzer generated by JLex,
was written in Java.
<P>
The experiment consists of running each lexical analyzer 
on two source files written in the toy language, 
then measuring the time required to process these files.
Each lexical analyzer was invoked by a dummy driver also
written in Java.
<P>
The generated lexical analyzer proved to be quite quick,
as the following results show.
<table>
<tr>
<th>Size of Source File</th>
<th>JLex-Generated Lexical Analyzer: Execution Time</th>
<th>Hand-Written Lexical Analyzer: Execution Times</th>
</tr>
<tr>
<td align=middle>177 lines</td>
<td align=middle>0.42 seconds</td>
<td align=middle>0.53 seconds</td>
</tr>
<tr>
<td align=middle>897 lines</td>
<td align=middle>0.98 seconds</td>
<td align=middle>1.28 seconds</td>
</tr>
</table>

<P>
The JLex lexical analyzer soundly outperformed 
the hand-written lexer.
<P>
One of the biggest complaints about table-driven lexical analyzers
generated by programs like JLex is that these lexical analyzers
do not perform as well as hand-written ones.
Therefore, this experiment is particularly important in
demonstrating the relative speed of JLex lexical analyzers.
<P>
<BR> <HR>
<BR> <P>
<H1><A NAME="SECTION5">5. Implementation Issues</A></H1>
<P>
<BR> <HR>
<BR> <P>
<H2><A NAME="SECTION5.1">5.1 Unimplemented Features</A></H2>
<P>
The following is a (possibly incomplete) list of 
unimplemented features of JLex.
<OL>
<LI> The regular expression lookahead operator is unimplemented,
and not included in the list of special regular
expression metacharacters.<LI> The start-of-line operator (^) assumes 
the following nonstandard behavior.
A match on a regular expression that uses 
this operator will cause the newline that
precedes the match to be discarded.
</OL><BR> <HR>
<BR> <P>
<H2><A NAME="SECTION5.2">5.2 Unicode vs Ascii</A></H2>
<P>
In contrast to the 8-bit character type (char) mandated by
Ansi C, Java supports a 16-bit char and the Unicode 
character set.  Java provides a built-in String class to 
manipulate these Unicode characters.
<P>
As of version 1.2.5, JLex uses the JDK 1.1 <code>Reader</code> and
<code>Writer</code> classes to read in the JLex specification
file and write out the lexical analyzer source file.  This
means that all unicode characters are allowed in both of these.
In order for the generated scanner to work with unicode characters,
you must use the <code>java.io.Reader</code> constructor of the 
generated scanner, and the <code>Reader</code> you provide must
properly handle the translation from OS-native format to unicode.
You must also specify the <i>%unicode</i> directive in the
specification; see section <A HREF="#SECTION2.2.14">2.2.14</a>.
<P>
<BR> <HR>
<BR> <P>
<H2><A NAME="SECTION5.3">5.3 Commas in State Lists</A></H2>
<P>
Commas between state names in declaration lists and lexical
rules are optional.  These lists will be correctly parsed
with white space between state names and without 
comma separators.
<P>
<BR> <HR>
<BR> <P>
<H2><A NAME="SECTION5.4">5.4 Wish List of Unimplemented Features</A></H2>
<P>
The following minor features would be nice to have as part of JLex,
but have not been implemented due to their scope or their negative
impact upon performance.
<OL><LI> Detection of unbalanced braces within the comment
portion of lexical actions.<LI> Detection of cycles in macro definitions.
</OL><BR> <HR>
<BR> <P>
<H1><A NAME="SECTION6">6. Credits and Copyrights</A></H1>
<P>
<BR> <HR>
<BR> <P>
<H2><A NAME="SECTION6.1">6.1 Credits</A></H2>
<P>
The treatment of lexical analyzer generators given in
Alan Holub's <I>Compiler Design in C</I> (Prentice-Hall, 1990)
provided a starting point for my implementation.
<P>
Discussions with Professor Andrew Appel of the 
Princeton University Computer Science Department
provided guidance in the design of JLex.
<P>
Java is a trademark of Sun Microsystems Incorporated.
<P>
<BR> <HR>
<BR> <P>
<H2><A NAME="SECTION6.2">6.2 Copyright</A></H2>
<P>
JLex COPYRIGHT NOTICE, LICENSE AND DISCLAIMER.
<P>
Copyright 1996 by Elliot Joel Berk.
<P>
Permission to use, copy, modify, and distribute this software and its
documentation for any purpose and without fee is hereby granted,
provided that the above copyright notice appear in all copies and that
both the copyright notice and this permission notice and warranty
disclaimer appear in supporting documentation, and that the name of
Elliot Joel Berk not be used in advertising or publicity pertaining 
to distribution of the software without specific, written prior permission.
<P>
Elliot Joel Berk disclaims all warranties with regard to this software, including
all implied warranties of merchantability and fitness.  In no event
shall Elliot Joel Berk be liable for any special, indirect or consequential
damages or any damages whatsoever resulting from loss of use, data or
profits, whether in an action of contract, negligence or other
tortious action, arising out of or in connection with the use or
performance of this software.
<BR> <HR>
<P><ADDRESS>
<I>Frank Flannery<br>
Wed Jul 24 00:27:39 EDT 1996</i>
<ADDRESS>
</BODY>
</HTML>