cgit.sxemacs.org Git - packages/blob - xemacs-packages/semantic/doc/wisent.texi

   1 \input texinfo  @c -*-texinfo-*-
   2 @c %**start of header
   3 @setfilename wisent.info
   4 @set TITLE  Wisent Parser Development
   5 @set AUTHOR Eric M. Ludlam, David Ponce, and Richard Y. Kim
   6 @settitle @value{TITLE}
   7
   8 @c *************************************************************************
   9 @c @ Header
  10 @c *************************************************************************
  11
  12 @c Merge all indexes into a single index for now.
  13 @c We can always separate them later into two or more as needed.
  14 @syncodeindex vr cp
  15 @syncodeindex fn cp
  16 @syncodeindex ky cp
  17 @syncodeindex pg cp
  18 @syncodeindex tp cp
  19
  20 @c @footnotestyle separate
  21 @c @paragraphindent 2
  22 @c @@smallbook
  23 @c %**end of header
  24
  25 @copying
  26 This manual documents the Wisent parser generator.
  27
  28 Copyright @copyright{} 2001, 2002, 2003, 2004, 2007 David Ponce
  29
  30 Some texts are borrowed or adapted from the manual of Bison version
  31 1.35.  The text in section entitled ``Understanding the automaton'' is
  32 adapted from the section ``Understanding Your Parser'' in the manual
  33 of Bison version 1.49.
  34
  35 Copyright @copyright{} 1988, 1989, 1990, 1991, 1992, 1993, 1995, 1998,
  36 1999, 2000, 2001, 2002, 2003, 2004 Free Software Foundation, Inc.
  37
  38 @quotation
  39 Permission is granted to copy, distribute and/or modify this document
  40 under the terms of the GNU Free Documentation License, Version 1.1 or
  41 any later version published by the Free Software Foundation; with the
  42 Invariant Sections being list their titles, with the Front-Cover Texts
  43 being list, and with the Back-Cover Texts being list.  A copy of the
  44 license is included in the section entitled ``GNU Free Documentation
  45 License''.
  46 @end quotation
  47 @end copying
  48
  49 @ifinfo
  50 @dircategory Emacs
  51 @direntry
  52 * Semantic Wisent parser development: (wisent).
  53 @end direntry
  54 @end ifinfo
  55
  56 @iftex
  57 @finalout
  58 @end iftex
  59
  60 @c @setchapternewpage odd
  61 @c @setchapternewpage off
  62
  63 @ifinfo
  64 This file documents Application Development with Semantic.
  65 @emph{Infrastructure for parser based text analysis in Emacs}
  66
  67 Copyright @copyright{} 2001, 2002, 2003, 2004 @value{AUTHOR}
  68 @end ifinfo
  69
  70 @titlepage
  71 @sp 10
  72 @title @value{TITLE}
  73 @author by @value{AUTHOR}
  74 @vskip 0pt plus 1 fill
  75 Copyright @copyright{} 2001, 2002, 2003, 2004 @value{AUTHOR}
  76 @page
  77 @vskip 0pt plus 1 fill
  78 @insertcopying
  79 @end titlepage
  80 @page
  81
  82 @c MACRO inclusion
  83 @include semanticheader.texi
  84 @paragraphindent none
  85
  86
  87 @c *************************************************************************
  88 @c @ Document
  89 @c *************************************************************************
  90 @contents
  91
  92 @node top
  93 @top @value{TITLE}
  94
  95 Wisent (the European Bison ;-) is an Emacs Lisp implementation of the
  96 GNU Compiler Compiler Bison.
  97
  98 This manual describes how to use Wisent to develop grammars for
  99 programming languages, and how to use grammars to parse language
 100 source in Emacs buffers.
 101
 102 It also describes how Wisent is used with the @semantic{} tool set
 103 described in the @ref{Top, Semantic Manual, Semantic Manual, semantic}.
 104
 105 @menu
 106 * Wisent Overview::
 107 * Wisent Grammar::
 108 * Wisent Parsing::
 109 * Wisent Semantic::
 110 * GNU Free Documentation License::
 111 * Index::
 112 @end menu
 113
 114 @node Wisent Overview
 115 @chapter Wisent Overview
 116
 117 @dfn{Wisent} (the European Bison) is an implementation in Emacs Lisp
 118 of the GNU Compiler Compiler Bison. Its code is a port of the C code
 119 of GNU Bison 1.28 & 1.31.
 120
 121 For more details on the basic concepts for understanding Wisent, it is
 122 worthwhile to read the @ref{Top, Bison Manual, bison}.
 123 @ifhtml
 124 @uref{http://www.gnu.org/manual/bison/html_node/index.html}.
 125 @end ifhtml
 126
 127 Wisent can generate compilers compatible with the @semantic{} tool set.
 128 See the @ref{Top, Semantic Manual, , semantic}.
 129
 130 It benefits from these Bison features:
 131
 132 @itemize @bullet
 133 @item
 134 It uses a fast but not so space-efficient encoding for the parse
 135 tables, described in Corbett's PhD thesis from Berkeley:
 136 @quotation
 137 @cite{Static Semantics in Compiler Error Recovery}@*
 138 June 1985, Report No. UCB/CSD 85/251.
 139 @end quotation
 140
 141 @item
 142 For generating the lookahead sets, Wisent uses the well-known
 143 technique of F. DeRemer and A. Pennello they described in:
 144 @quotation
 145 @cite{Efficient Construction of LALR(1) Lookahead Sets}@*
 146 October 1982, ACM TOPLS Vol 4 No 4.
 147 @end quotation
 148
 149 @item
 150 Wisent resolves shift/reduce conflicts using operator precedence and
 151 associativity.
 152
 153 @item
 154 Parser error recovery is accomplished using rules which match the
 155 special token @code{error}.
 156 @end itemize
 157
 158 Nevertheless there are some fundamental differences between Bison and
 159 Wisent.
 160
 161 @itemize
 162 @item
 163 Wisent is intended to be used in Emacs.  It reads and produces Emacs
 164 Lisp data structures.  All the additional code used in grammars is
 165 Emacs Lisp code.
 166
 167 @item
 168 Contrary to Bison, Wisent does not generate a parser which combines
 169 Emacs Lisp code and grammar constructs.  They exist separately.
 170 Wisent reads the grammar from a Lisp data structure and then generates
 171 grammar constructs as tables.  Afterward, the derived tables can be
 172 included and byte-compiled in separate Emacs Lisp files, and be used
 173 at a later time by the Wisent's parser engine.
 174
 175 @item
 176 Wisent allows multiple start nonterminals and allows a call to the
 177 parsing function to be made for a particular start nonterminal.  For
 178 example, this is particularly useful to parse a region of an Emacs
 179 buffer.  @semantic{} heavily depends on the availability of this feature.
 180 @end itemize
 181
 182 @node Wisent Grammar
 183 @chapter Wisent Grammar
 184
 185 @cindex context-free grammar
 186 @cindex rule
 187 In order for Wisent to parse a language, it must be described by a
 188 @dfn{context-free grammar}.  That is a grammar specified as rules that
 189 can be applied regardless of context.  For more information, see
 190 @ref{Language and Grammar, , , bison}, in the Bison manual.
 191
 192 @cindex terminal
 193 @cindex nonterminal
 194 The formal grammar is formulated using @dfn{terminal} and
 195 @dfn{nonterminal} items.  Terminals can be Emacs Lisp symbols or
 196 characters, and nonterminals are symbols only.
 197
 198 @cindex token
 199 Terminals (also known as @dfn{tokens}) represent the lexical
 200 elements of the language like numbers, strings, etc..
 201
 202 For example @samp{PLUS} can represent the operator @samp{+}.
 203
 204 Nonterminal symbols are described by rules:
 205
 206 @example
 207 @group
 208 RESULT @equiv{} COMPONENTS@dots{}
 209 @end group
 210 @end example
 211
 212 @samp{RESULT} is a nonterminal that this rule describes and
 213 @samp{COMPONENTS} are various terminals and nonterminals that are put
 214 together by this rule.
 215
 216 For example, this rule:
 217
 218 @example
 219 @group
 220 exp @equiv{} exp PLUS exp
 221 @end group
 222 @end example
 223
 224 Says that two groupings of type @samp{exp}, with a @samp{PLUS} token
 225 in between, can be combined into a larger grouping of type @samp{exp}.
 226
 227 @menu
 228 * Grammar format::
 229 * Example::
 230 * Compiling a grammar::
 231 * Conflicts::
 232 @end menu
 233
 234 @node Grammar format, Example, Wisent Grammar, Wisent Grammar
 235 @comment  node-name,  next,  previous,  up
 236 @section Grammar format
 237
 238 @cindex grammar format
 239 To be acceptable by Wisent a context-free grammar must respect a
 240 particular format.  That is, must be represented as an Emacs Lisp list
 241 of the form:
 242
 243 @code{(@var{terminals} @var{assocs} . @var{non-terminals})}
 244
 245 @table @var
 246 @item terminals
 247 Is the list of terminal symbols used in the grammar.
 248
 249 @cindex associativity
 250 @item assocs
 251 Specify the associativity of @var{terminals}.  It is @code{nil} when
 252 there is no associativity defined, or an alist of
 253 @w{@code{(@var{assoc-type} . @var{assoc-value})}} elements.
 254
 255 @var{assoc-type} must be one of the @code{default-prec},
 256 @code{nonassoc}, @code{left} or @code{right} symbols.  When
 257 @var{assoc-type} is @code{default-prec}, @var{assoc-value} must be
 258 @code{nil} or @code{t} (the default).  Otherwise it is a list of
 259 tokens which must have been previously declared in @var{terminals}.
 260
 261 For details, see @ref{Contextual Precedence, , , bison}, in the
 262 Bison manual.
 263
 264 @item non-terminals
 265 Is the list of nonterminal definitions.  Each definition has the form:
 266
 267 @code{(@var{nonterm} . @var{rules})}
 268
 269 Where @var{nonterm} is the nonterminal symbol defined and
 270 @var{rules} the list of rules that describe this nonterminal.  Each
 271 rule is a list:
 272
 273 @code{(@var{components} [@var{precedence}] [@var{action}])}
 274
 275 Where:
 276
 277 @table @var
 278 @item components
 279 Is a list of various terminals and nonterminals that are put together
 280 by this rule.
 281
 282 For example,
 283
 284 @example
 285 @group
 286 (exp ((exp ?+ exp))          ;; exp: exp '+' exp
 287      )                       ;;    ;
 288 @end group
 289 @end example
 290
 291 Says that two groupings of type @samp{exp}, with a @samp{+} token in
 292 between, can be combined into a larger grouping of type @samp{exp}.
 293
 294 @cindex grammar coding conventions
 295 By convention, a nonterminal symbol should be in lower case, such as
 296 @samp{exp}, @samp{stmt} or @samp{declaration}.  Terminal symbols
 297 should be upper case to distinguish them from nonterminals: for
 298 example, @samp{INTEGER}, @samp{IDENTIFIER}, @samp{IF} or
 299 @samp{RETURN}.  A terminal symbol that represents a particular keyword
 300 in the language is conventionally the same as that keyword converted
 301 to upper case.  The terminal symbol @code{error} is reserved for error
 302 recovery.
 303
 304 @cindex middle-rule actions
 305 Scattered among the components can be @dfn{middle-rule} actions.
 306 Usually only @var{action} is provided (@pxref{action}).
 307
 308 If @var{components} in a rule is @code{nil}, it means that the rule
 309 can match the empty string.  For example, here is how to define a
 310 comma-separated sequence of zero or more @samp{exp} groupings:
 311
 312 @example
 313 @group
 314 (expseq  (nil)               ;; expseq: ;; empty
 315          ((expseq1))         ;;       | expseq1
 316          )                   ;;       ;
 317
 318 (expseq1 ((exp))             ;; expseq1: exp
 319          ((expseq1 ?, exp))  ;;        | expseq1 ',' exp
 320          )                   ;;        ;
 321 @end group
 322 @end example
 323
 324 @cindex precedence level
 325 @item precedence
 326 Assign the rule the precedence of the given terminal item, overriding
 327 the precedence that would be deduced for it, that is the one of the
 328 last terminal in it.  Notice that only terminals declared in
 329 @var{assocs} have a precedence level.  The altered rule precedence
 330 then affects how conflicts involving that rule are resolved.
 331
 332 @var{precedence} is an optional vector of one terminal item.
 333
 334 Here is how @var{precedence} solves the problem of unary minus.
 335 First, declare a precedence for a fictitious terminal symbol named
 336 @code{UMINUS}.  There are no tokens of this type, but the symbol
 337 serves to stand for its precedence:
 338
 339 @example
 340 @dots{}
 341 ((default-prec t) ;; This is the default
 342  (left '+' '-')
 343  (left '*')
 344  (left UMINUS))
 345 @end example
 346
 347 Now the precedence of @code{UMINUS} can be used in specific rules:
 348
 349 @example
 350 @group
 351 (exp    @dots{}                  ;; exp:    @dots{}
 352          ((exp ?- exp))      ;;         | exp '-' exp
 353         @dots{}                  ;;         @dots{}
 354          ((?- exp) [UMINUS]) ;;         | '-' exp %prec UMINUS
 355         @dots{}                  ;;         @dots{}
 356         )                    ;;         ;
 357 @end group
 358 @end example
 359
 360 If you forget to append @code{[UMINUS]} to the rule for unary minus,
 361 Wisent silently assumes that minus has its usual precedence.  This
 362 kind of problem can be tricky to debug, since one typically discovers
 363 the mistake only by testing the code.
 364
 365 Using @code{(default-prec nil)} declaration makes it easier to
 366 discover this kind of problem systematically.  It causes rules that
 367 lack a @var{precedence} modifier to have no precedence, even if the
 368 last terminal symbol mentioned in their components has a declared
 369 precedence.
 370
 371 If @code{(default-prec nil)} is in effect, you must specify
 372 @var{precedence} for all rules that participate in precedence conflict
 373 resolution.  Then you will see any shift/reduce conflict until you
 374 tell Wisent how to resolve it, either by changing your grammar or by
 375 adding an explicit precedence.  This will probably add declarations to
 376 the grammar, but it helps to protect against incorrect rule
 377 precedences.
 378
 379 The effect of @code{(default-prec nil)} can be reversed by giving
 380 @code{(default-prec t)}, which is the default.
 381
 382 For more details, see @ref{Contextual Precedence, , , bison}, in the
 383 Bison manual.
 384
 385 It is important to understand that @var{assocs} declarations defines
 386 associativity but also assign a precedence level to terminals.  All
 387 terminals declared in the same @code{left}, @code{right} or
 388 @code{nonassoc} association get the same precedence level.  The
 389 precedence level is increased at each new association.
 390
 391 On the other hand, @var{precedence} explicitly assign the precedence
 392 level of the given terminal to a rule.
 393
 394 @cindex semantic actions
 395 @item @anchor{action}action
 396 An action is an optional Emacs Lisp function call, like this:
 397
 398 @code{(identity $1)}
 399
 400 The result of an action determines the semantic value of a rule.
 401
 402 From an implementation standpoint, the function call will be embedded
 403 in a lambda expression, and several useful local variables will be
 404 defined:
 405
 406 @table @code
 407 @vindex $N
 408 @item $@var{n}
 409 Where @var{n} is a positive integer.  Like in Bison, the value of
 410 @code{$@var{n}} is the semantic value of the @var{n}th element of
 411 @var{components}, starting from 1.  It can be of any Lisp data
 412 type.
 413
 414 @vindex $region@var{n}
 415 @item $regionN
 416 Where @var{n} is a positive integer.  For each @code{$@var{n}}
 417 variable defined there is a corresponding @code{$region@var{n}}
 418 variable.  Its value is a pair @code{(@var{start-pos} .
 419 @var{end-pos})} that represent the start and end positions (in the
 420 lexical input stream) of the @code{$@var{n}} value.  It can be
 421 @code{nil} when the component positions are not available, like for an
 422 empty string component for example.
 423
 424 @vindex $region
 425 @item $region
 426 Its value is the leftmost and rightmost positions of input data
 427 matched by all @var{components} in the rule.  This is a pair
 428 @code{(@var{leftmost-pos} .  @var{rightmost-pos})}.  It can be
 429 @code{nil} when components positions are not available.
 430
 431 @vindex $nterm
 432 @item $nterm
 433 This variable is initialized with the nonterminal symbol
 434 (@var{nonterm}) the rule belongs to.  It could be useful to improve
 435 error reporting or debugging.  It is also used to automatically
 436 provide incremental re-parse entry points for @semantic{} tags
 437 (@pxref{Wisent Semantic}).
 438
 439 @vindex $action
 440 @item $action
 441 The value of @code{$action} is the symbolic name of the current
 442 semantic action (@pxref{Debugging actions}).
 443 @end table
 444
 445 When an action is not specified a default value is supplied, it is
 446 @code{(identity $1)}.  This means that the default semantic value of a
 447 rule is the value of its first component.  Excepted for a rule
 448 matching the empty string, for which the default action is to return
 449 @code{nil}.
 450 @end table
 451 @end table
 452
 453 @node Example, Compiling a grammar, Grammar format, Wisent Grammar
 454 @comment  node-name,  next,  previous,  up
 455 @section Example
 456
 457 @cindex grammar example
 458 Here is an example to parse simple infix arithmetic expressions.  See
 459 @ref{Infix Calc, , , bison}, in the Bison manual for details.
 460
 461 @lisp
 462 @group
 463 '(
 464   ;; Terminals
 465   (NUM)
 466
 467   ;; Terminal associativity & precedence
 468   ((nonassoc ?=)
 469    (left ?- ?+)
 470    (left ?* ?/)
 471    (left NEG)
 472    (right ?^))
 473
 474   ;; Rules
 475   (input
 476    ((line))
 477    ((input line)
 478     (format "%s %s" $1 $2))
 479    )
 480
 481   (line
 482    ((?;)
 483     (progn ";"))
 484    ((exp ?;)
 485     (format "%s;" $1))
 486    ((error ?;)
 487     (progn "Error;")))
 488    )
 489
 490   (exp
 491    ((NUM)
 492     (string-to-number $1))
 493    ((exp ?= exp)
 494     (= $1 $3))
 495    ((exp ?+ exp)
 496     (+ $1 $3))
 497    ((exp ?- exp)
 498     (- $1 $3))
 499    ((exp ?* exp)
 500     (* $1 $3))
 501    ((exp ?/ exp)
 502     (/ $1 $3))
 503    ((?- exp) [NEG]
 504     (- $2))
 505    ((exp ?^ exp)
 506     (expt $1 $3))
 507    ((?\( exp ?\))
 508     (progn $2))
 509    )
 510   )
 511 @end group
 512 @end lisp
 513
 514 In the bison-like @dfn{WY} format (@pxref{Wisent Semantic}) the
 515 grammar looks like this:
 516
 517 @example
 518 @group
 519 %token <number> NUM
 520
 521 %nonassoc '=' ;; comparison
 522 %left '-' '+'
 523 %left '*' '/'
 524 %left NEG     ;; negation--unary minus
 525 %right '^'    ;; exponentiation
 526
 527 %%
 528
 529 input:
 530     line
 531   | input line
 532     (format "%s %s" $1 $2)
 533   ;
 534
 535 line:
 536     ';'
 537     @{";"@}
 538   | exp ';'
 539     (format "%s;" $1)
 540   | error ';'
 541     @{"Error;"@}
 542   ;
 543
 544 exp:
 545     NUM
 546     (string-to-number $1)
 547   | exp '=' exp
 548     (= $1 $3)
 549   | exp '+' exp
 550     (+ $1 $3)
 551   | exp '-' exp
 552     (- $1 $3)
 553   | exp '*' exp
 554     (* $1 $3)
 555   | exp '/' exp
 556     (/ $1 $3)
 557   | '-' exp %prec NEG
 558     (- $2)
 559   | exp '^' exp
 560     (expt $1 $3)
 561   | '(' exp ')'
 562     @{$2@}
 563   ;
 564
 565 %%
 566 @end group
 567 @end example
 568
 569 @node Compiling a grammar, Conflicts, Example, Wisent Grammar
 570 @comment  node-name,  next,  previous,  up
 571 @section Compiling a grammar
 572
 573 @cindex automaton
 574 After providing a context-free grammar in a suitable format, it must
 575 be translated into a set of tables (an @dfn{automaton}) that will be
 576 used to derive the parser.  Like Bison, Wisent translates grammars that
 577 must be @dfn{LALR(1)}.
 578
 579 @cindex LALR(1) grammar
 580 @cindex look-ahead token
 581 A grammar is @acronym{LALR(1)} if it is possible to tell how to parse
 582 any portion of an input string with just a single token of look-ahead:
 583 the @dfn{look-ahead token}.  See @ref{Language and Grammar, , ,
 584 bison}, in the Bison manual for more information.
 585
 586 @cindex grammar compilation
 587 Grammar translation (compilation) is achieved by the function:
 588
 589 @cindex compiling a grammar
 590 @vindex wisent-single-start-flag
 591 @findex wisent-compile-grammar
 592 @defun wisent-compile-grammar grammar &optional start-list
 593 Compile @var{grammar} and return an @acronym{LALR(1)} automaton.
 594
 595 Optional argument @var{start-list} is a list of start symbols
 596 (nonterminals).  If @code{nil} the first nonterminal defined in the
 597 grammar is the default start symbol.  If @var{start-list} contains
 598 only one element, it defines the start symbol.  If @var{start-list}
 599 contains more than one element, all are defined as potential start
 600 symbols, unless @code{wisent-single-start-flag} is non-@code{nil}.  In
 601 that case the first element of @var{start-list} defines the start
 602 symbol and others are ignored.
 603
 604 The @acronym{LALR(1)} automaton is a vector of the form:
 605
 606 @code{[@var{actions gotos starts functions}]}
 607
 608 @table @var
 609 @item actions
 610 A state/token matrix telling the parser what to do at every state
 611 based on the current look-ahead token.  That is shift, reduce, accept
 612 or error.  See also @ref{Wisent Parsing}.
 613
 614 @item gotos
 615 A state/nonterminal matrix telling the parser the next state to go to
 616 after reducing with each rule.
 617
 618 @item starts
 619 An alist which maps the allowed start symbols (nonterminals) to
 620 lexical tokens that will be first shifted into the parser stack.
 621
 622 @item functions
 623 An obarray of semantic action symbols.  A semantic action is actually
 624 an Emacs Lisp function (lambda expression).
 625 @end table
 626 @end defun
 627
 628 @node Conflicts, , Compiling a grammar, Wisent Grammar
 629 @comment  node-name,  next,  previous,  up
 630 @section Conflicts
 631
 632 Normally, a grammar should produce an automaton where at each state
 633 the parser has only one action to do (@pxref{Wisent Parsing}).
 634
 635 @cindex ambiguous grammar
 636 In certain cases, a grammar can produce an automaton where, at some
 637 states, there are more than one action possible.  Such a grammar is
 638 @dfn{ambiguous}, and generates @dfn{conflicts}.
 639
 640 @cindex deterministic automaton
 641 The parser can't be driven by an automaton which isn't completely
 642 @dfn{deterministic}, that is which contains conflicts.  It is
 643 necessary to resolve the conflicts to eliminate them.  Wisent resolves
 644 conflicts like Bison does.
 645
 646 @cindex grammar conflicts
 647 @cindex conflicts resolution
 648 There are two sorts of conflicts:
 649
 650 @table @dfn
 651 @cindex shift/reduce conflicts
 652 @item shift/reduce conflicts
 653 When either a shift or a reduction would be valid at the same state.
 654
 655 Such conflicts are resolved by choosing to shift, unless otherwise
 656 directed by operator precedence declarations.
 657 See @ref{Shift/Reduce , , , bison}, in the Bison manual for more
 658 information.
 659
 660 @cindex reduce/reduce conflicts
 661 @item reduce/reduce conflicts
 662 That occurs if there are two or more rules that apply to the same
 663 sequence of input.  This usually indicates a serious error in the
 664 grammar.
 665
 666 Such conflicts are resolved by choosing to use the rule that appears
 667 first in the grammar, but it is very risky to rely on this.  Every
 668 reduce/reduce conflict must be studied and usually eliminated.  See
 669 @ref{Reduce/Reduce , , , bison}, in the Bison manual for more
 670 information.
 671 @end table
 672
 673 @menu
 674 * Grammar Debugging::
 675 * Understanding the automaton::
 676 @end menu
 677
 678 @node Grammar Debugging
 679 @subsection Grammar debugging
 680
 681 @cindex grammar debugging
 682 @cindex grammar verbose description
 683 To help writing a new grammar, @code{wisent-compile-grammar} can
 684 produce a verbose report containing a detailed description of the
 685 grammar and parser (equivalent to what Bison reports with the
 686 @option{--verbose} option).
 687
 688 To enable the verbose report you can set to non-@code{nil} the
 689 variable:
 690
 691 @vindex wisent-verbose-flag
 692 @deffn Option wisent-verbose-flag
 693 non-@code{nil} means to report verbose information on generated parser.
 694 @end deffn
 695
 696 Or interactively use the command:
 697
 698 @findex wisent-toggle-verbose-flag
 699 @deffn Command wisent-toggle-verbose-flag
 700 Toggle whether to report verbose information on generated parser.
 701 @end deffn
 702
 703 The verbose report is printed in the temporary buffer
 704 @code{*wisent-log*} when running interactively, or in file
 705 @file{wisent.output} when running in batch mode.  Different
 706 reports are separated from each other by a line like this:
 707
 708 @example
 709 @group
 710 *** Wisent @var{source-file} - 2002-06-27 17:33
 711 @end group
 712 @end example
 713
 714 where @var{source-file} is the name of the Emacs Lisp file from which
 715 the grammar was read.  See @ref{Understanding the automaton}, for
 716 details on the verbose report.
 717
 718 @table @strong
 719 @item Please Note
 720 To help debugging the grammar compiler itself, you can set this
 721 variable to print the content of some internal data structures:
 722
 723 @vindex wisent-debug-flag
 724 @defvar wisent-debug-flag
 725 non-@code{nil} means enable some debug stuff.
 726 @end defvar
 727 @end table
 728
 729 @node Understanding the automaton
 730 @subsection Understanding the automaton
 731
 732 @cindex understanding the automaton
 733 This section (took from the manual of Bison 1.49) describes how to use
 734 the verbose report printed by @code{wisent-compile-grammar} to
 735 understand the generated automaton, to tune or fix a grammar.
 736
 737 We will use the following example:
 738
 739 @example
 740 @group
 741 (let ((wisent-verbose-flag t)) ;; Print a verbose report!
 742   (wisent-compile-grammar
 743    '((NUM STR)                          ; %token NUM STR
 744
 745      ((left ?+ ?-)                      ; %left '+' '-';
 746       (left ?*))                        ; %left '*'
 747
 748      (exp                               ; exp:
 749       ((exp ?+ exp))                    ;    exp '+' exp
 750       ((exp ?- exp))                    ;  | exp '-' exp
 751       ((exp ?* exp))                    ;  | exp '*' exp
 752       ((exp ?/ exp))                    ;  | exp '/' exp
 753       ((NUM))                           ;  | NUM
 754       )                                 ;  ;
 755
 756      (useless                           ; useless:
 757       ((STR))                           ;    STR
 758       )                                 ;  ;
 759      )
 760    'nil)                                ; no %start declarations
 761   )
 762 @end group
 763 @end example
 764
 765 When evaluating the above expression, grammar compilation first issues
 766 the following two clear messages:
 767
 768 @example
 769 @group
 770 Grammar contains 1 useless nonterminals and 1 useless rules
 771 Grammar contains 7 shift/reduce conflicts
 772 @end group
 773 @end example
 774
 775 The @samp{*wisent-log*} buffer details things!
 776
 777 The first section reports conflicts that were solved using precedence
 778 and/or associativity:
 779
 780 @example
 781 @group
 782 Conflict in state 7 between rule 1 and token '+' resolved as reduce.
 783 Conflict in state 7 between rule 1 and token '-' resolved as reduce.
 784 Conflict in state 7 between rule 1 and token '*' resolved as shift.
 785 Conflict in state 8 between rule 2 and token '+' resolved as reduce.
 786 Conflict in state 8 between rule 2 and token '-' resolved as reduce.
 787 Conflict in state 8 between rule 2 and token '*' resolved as shift.
 788 Conflict in state 9 between rule 3 and token '+' resolved as reduce.
 789 Conflict in state 9 between rule 3 and token '-' resolved as reduce.
 790 Conflict in state 9 between rule 3 and token '*' resolved as reduce.
 791 @end group
 792 @end example
 793
 794 The next section reports useless tokens, nonterminal and rules (note
 795 that useless tokens might be used by the scanner):
 796
 797 @example
 798 @group
 799 Useless nonterminals:
 800
 801    useless
 802
 803
 804 Terminals which are not used:
 805
 806    STR
 807
 808
 809 Useless rules:
 810
 811 #6     useless: STR;
 812 @end group
 813 @end example
 814
 815 The next section lists states that still have conflicts:
 816
 817 @example
 818 @group
 819 State 7 contains 1 shift/reduce conflict.
 820 State 8 contains 1 shift/reduce conflict.
 821 State 9 contains 1 shift/reduce conflict.
 822 State 10 contains 4 shift/reduce conflicts.
 823 @end group
 824 @end example
 825
 826 The next section reproduces the grammar used:
 827
 828 @example
 829 @group
 830 Grammar
 831
 832   Number, Rule
 833   1       exp -> exp '+' exp
 834   2       exp -> exp '-' exp
 835   3       exp -> exp '*' exp
 836   4       exp -> exp '/' exp
 837   5       exp -> NUM
 838 @end group
 839 @end example
 840
 841 And reports the uses of the symbols:
 842
 843 @example
 844 @group
 845 Terminals, with rules where they appear
 846
 847 $EOI (-1)
 848 error (1)
 849 NUM (2) 5
 850 STR (3) 6
 851 '+' (4) 1
 852 '-' (5) 2
 853 '*' (6) 3
 854 '/' (7) 4
 855
 856
 857 Nonterminals, with rules where they appear
 858
 859 exp (8)
 860     on left: 1 2 3 4 5, on right: 1 2 3 4
 861 @end group
 862 @end example
 863
 864 The report then details the automaton itself, describing each state
 865 with it set of @dfn{items}, also known as @dfn{pointed rules}.  Each
 866 item is a production rule together with a point (marked by @samp{.})
 867 that the input cursor.
 868
 869 @example
 870 @group
 871 state 0
 872
 873     NUM shift, and go to state 1
 874
 875     exp go to state 2
 876 @end group
 877 @end example
 878
 879 State 0 corresponds to being at the very beginning of the parsing, in
 880 the initial rule, right before the start symbol (@samp{exp}).  When
 881 the parser returns to this state right after having reduced a rule
 882 that produced an @samp{exp}, it jumps to state 2.  If there is no such
 883 transition on a nonterminal symbol, and the lookahead is a @samp{NUM},
 884 then this token is shifted on the parse stack, and the control flow
 885 jumps to state 1.  Any other lookahead triggers a parse error.
 886
 887 In the state 1...
 888
 889 @example
 890 @group
 891 state 1
 892
 893     exp  ->  NUM .   (rule 5)
 894
 895     $default    reduce using rule 5 (exp)
 896 @end group
 897 @end example
 898
 899 the rule 5, @samp{exp: NUM;}, is completed.  Whatever the lookahead
 900 (@samp{$default}), the parser will reduce it.  If it was coming from
 901 state 0, then, after this reduction it will return to state 0, and
 902 will jump to state 2 (@samp{exp: go to state 2}).
 903
 904 @example
 905 @group
 906 state 2
 907
 908     exp  ->  exp . '+' exp   (rule 1)
 909     exp  ->  exp . '-' exp   (rule 2)
 910     exp  ->  exp . '*' exp   (rule 3)
 911     exp  ->  exp . '/' exp   (rule 4)
 912
 913     $EOI        shift, and go to state 11
 914     '+' shift, and go to state 3
 915     '-' shift, and go to state 4
 916     '*' shift, and go to state 5
 917     '/' shift, and go to state 6
 918 @end group
 919 @end example
 920
 921 In state 2, the automaton can only shift a symbol.  For instance,
 922 because of the item @samp{exp -> exp . '+' exp}, if the lookahead if
 923 @samp{+}, it will be shifted on the parse stack, and the automaton
 924 control will jump to state 3, corresponding to the item
 925 @samp{exp -> exp . '+' exp}:
 926
 927 @example
 928 @group
 929 state 3
 930
 931     exp  ->  exp '+' . exp   (rule 1)
 932
 933     NUM shift, and go to state 1
 934
 935     exp go to state 7
 936 @end group
 937 @end example
 938
 939 Since there is no default action, any other token than those listed
 940 above will trigger a parse error.
 941
 942 The interpretation of states 4 to 6 is straightforward:
 943
 944 @example
 945 @group
 946 state 4
 947
 948     exp  ->  exp '-' . exp   (rule 2)
 949
 950     NUM shift, and go to state 1
 951
 952     exp go to state 8
 953
 954
 955
 956 state 5
 957
 958     exp  ->  exp '*' . exp   (rule 3)
 959
 960     NUM shift, and go to state 1
 961
 962     exp go to state 9
 963
 964
 965
 966 state 6
 967
 968     exp  ->  exp '/' . exp   (rule 4)
 969
 970     NUM shift, and go to state 1
 971
 972     exp go to state 10
 973 @end group
 974 @end example
 975
 976 As was announced in beginning of the report, @samp{State 7 contains 1
 977 shift/reduce conflict.}:
 978
 979 @example
 980 @group
 981 state 7
 982
 983     exp  ->  exp . '+' exp   (rule 1)
 984     exp  ->  exp '+' exp .   (rule 1)
 985     exp  ->  exp . '-' exp   (rule 2)
 986     exp  ->  exp . '*' exp   (rule 3)
 987     exp  ->  exp . '/' exp   (rule 4)
 988
 989     '*' shift, and go to state 5
 990     '/' shift, and go to state 6
 991
 992     '/' [reduce using rule 1 (exp)]
 993     $default    reduce using rule 1 (exp)
 994 @end group
 995 @end example
 996
 997 Indeed, there are two actions associated to the lookahead @samp{/}:
 998 either shifting (and going to state 6), or reducing rule 1.  The
 999 conflict means that either the grammar is ambiguous, or the parser
1000 lacks information to make the right decision.  Indeed the grammar is
1001 ambiguous, as, since we did not specify the precedence of @samp{/},
1002 the sentence @samp{NUM + NUM / NUM} can be parsed as @samp{NUM + (NUM
1003 / NUM)}, which corresponds to shifting @samp{/}, or as @samp{(NUM +
1004 NUM) / NUM}, which corresponds to reducing rule 1.
1005
1006 Because in @acronym{LALR(1)} parsing a single decision can be made,
1007 Wisent arbitrarily chose to disable the reduction, see
1008 @ref{Conflicts}.  Discarded actions are reported in between square
1009 brackets.
1010
1011 Note that all the previous states had a single possible action: either
1012 shifting the next token and going to the corresponding state, or
1013 reducing a single rule.  In the other cases, i.e., when shifting
1014 @emph{and} reducing is possible or when @emph{several} reductions are
1015 possible, the lookahead is required to select the action.  State 7 is
1016 one such state: if the lookahead is @samp{*} or @samp{/} then the
1017 action is shifting, otherwise the action is reducing rule 1.  In other
1018 words, the first two items, corresponding to rule 1, are not eligible
1019 when the lookahead is @samp{*}, since we specified that @samp{*} has
1020 higher precedence that @samp{+}.  More generally, some items are
1021 eligible only with some set of possible lookaheads.
1022
1023 States 8 to 10 are similar:
1024
1025 @example
1026 @group
1027 state 8
1028
1029     exp  ->  exp . '+' exp   (rule 1)
1030     exp  ->  exp . '-' exp   (rule 2)
1031     exp  ->  exp '-' exp .   (rule 2)
1032     exp  ->  exp . '*' exp   (rule 3)
1033     exp  ->  exp . '/' exp   (rule 4)
1034
1035     '*' shift, and go to state 5
1036     '/' shift, and go to state 6
1037
1038     '/' [reduce using rule 2 (exp)]
1039     $default    reduce using rule 2 (exp)
1040
1041
1042
1043 state 9
1044
1045     exp  ->  exp . '+' exp   (rule 1)
1046     exp  ->  exp . '-' exp   (rule 2)
1047     exp  ->  exp . '*' exp   (rule 3)
1048     exp  ->  exp '*' exp .   (rule 3)
1049     exp  ->  exp . '/' exp   (rule 4)
1050
1051     '/' shift, and go to state 6
1052
1053     '/' [reduce using rule 3 (exp)]
1054     $default    reduce using rule 3 (exp)
1055
1056
1057
1058 state 10
1059
1060     exp  ->  exp . '+' exp   (rule 1)
1061     exp  ->  exp . '-' exp   (rule 2)
1062     exp  ->  exp . '*' exp   (rule 3)
1063     exp  ->  exp . '/' exp   (rule 4)
1064     exp  ->  exp '/' exp .   (rule 4)
1065
1066     '+' shift, and go to state 3
1067     '-' shift, and go to state 4
1068     '*' shift, and go to state 5
1069     '/' shift, and go to state 6
1070
1071     '+' [reduce using rule 4 (exp)]
1072     '-' [reduce using rule 4 (exp)]
1073     '*' [reduce using rule 4 (exp)]
1074     '/' [reduce using rule 4 (exp)]
1075     $default    reduce using rule 4 (exp)
1076 @end group
1077 @end example
1078
1079 Observe that state 10 contains conflicts due to the lack of precedence
1080 of @samp{/} wrt @samp{+}, @samp{-}, and @samp{*}, but also because the
1081 associativity of @samp{/} is not specified.
1082
1083 Finally, the state 11 (plus 12) is named the @dfn{final state}, or the
1084 @dfn{accepting state}:
1085
1086 @example
1087 @group
1088 state 11
1089
1090     $EOI        shift, and go to state 12
1091
1092
1093
1094 state 12
1095
1096     $default    accept
1097 @end group
1098 @end example
1099
1100 The end of input is shifted @samp{$EOI shift,} and the parser exits
1101 successfully (@samp{go to state 12}, that terminates).
1102
1103 @node Wisent Parsing
1104 @chapter Wisent Parsing
1105
1106 @cindex bottom-up parser
1107 @cindex shift-reduce parser
1108 The Wisent's parser is what is called a @dfn{bottom-up} or
1109 @dfn{shift-reduce} parser which repeatedly:
1110
1111 @table @dfn
1112 @cindex shift
1113 @item shift
1114 That is pushes the value of the last lexical token read (the
1115 look-ahead token) into a value stack, and reads a new one.
1116
1117 @cindex reduce
1118 @item reduce
1119 That is replaces a nonterminal by its semantic value.  The values of
1120 the components which form the right hand side of a rule are popped
1121 from the value stack and reduced by the semantic action of this rule.
1122 The result is pushed back on top of value stack.
1123 @end table
1124
1125 The parser will stop on:
1126
1127 @table @dfn
1128 @cindex accept
1129 @item accept
1130 When all input has been successfully parsed.  The semantic value of
1131 the start nonterminal is on top of the value stack.
1132
1133 @cindex syntax error
1134 @item error
1135 When a syntax error (an unexpected token in input) has been detected.
1136 At this point the parser issues an error message and either stops or
1137 calls a recovery routine to try to resume parsing.
1138 @end table
1139
1140 @cindex table-driven parser
1141 The above elementary actions are driven by the @acronym{LALR(1)}
1142 automaton built by @code{wisent-compile-grammar} from a context-free
1143 grammar.
1144
1145 The Wisent's parser is entered by calling the function:
1146
1147 @findex wisent-parse
1148 @defun wisent-parse automaton lexer &optional error start
1149 Parse input using the automaton specified in @var{automaton}.
1150
1151 @table @var
1152 @item automaton
1153 Is an @acronym{LALR(1)} automaton generated by
1154 @code{wisent-compile-grammar} (@pxref{Wisent Grammar}).
1155
1156 @item lexer
1157 Is a function with no argument called by the parser to obtain the next
1158 terminal (token) in input (@pxref{Writing a lexer}).
1159
1160 @item error
1161 Is an optional reporting function called when a parse error occurs.
1162 It receives a message string to report.  It defaults to the function
1163 @code{wisent-message} (@pxref{Report errors}).
1164
1165 @item start
1166 Specify the start symbol (nonterminal) used by the parser as its goal.
1167 It defaults to the start symbol defined in the grammar
1168 (@pxref{Wisent Grammar}).
1169 @end table
1170 @end defun
1171
1172 The following two normal hooks permit to do some useful processing
1173 respectively before to start parsing, and after the parser terminated.
1174
1175 @vindex wisent-pre-parse-hook
1176 @defvar wisent-pre-parse-hook
1177 Normal hook run just before entering the @var{LR} parser engine.
1178 @end defvar
1179
1180 @vindex wisent-post-parse-hook
1181 @defvar wisent-post-parse-hook
1182 Normal hook run just after the @var{LR} parser engine terminated.
1183 @end defvar
1184
1185 @menu
1186 * Writing a lexer::
1187 * Actions goodies::
1188 * Report errors::
1189 * Error recovery::
1190 * Debugging actions::
1191 @end menu
1192
1193 @node Writing a lexer
1194 @section What the parser must receive
1195
1196 It is important to understand that the parser does not parse
1197 characters, but lexical tokens, and does not know anything about
1198 characters in text streams!
1199
1200 @cindex lexical analysis
1201 @cindex lexer
1202 @cindex scanner
1203 Reading input data to produce lexical tokens is performed by a lexer
1204 (also called a scanner) in a lexical analysis step, before the syntax
1205 analysis step performed by the parser.  The parser automatically calls
1206 the lexer when it needs the next token to parse.
1207
1208 @cindex lexical tokens
1209 A Wisent's lexer is an Emacs Lisp function with no argument.  It must
1210 return a valid lexical token of the form:
1211
1212 @code{(@var{token-class value} [@var{start} . @var{end}])}
1213
1214 @table @var
1215 @item token-class
1216 Is a category of lexical token identifying a terminal as specified in
1217 the grammar (@pxref{Wisent Grammar}).  It can be a symbol or a character
1218 literal.
1219
1220 @item value
1221 Is the value of the lexical token.  It can be of any valid Emacs Lisp
1222 data type.
1223
1224 @item start
1225 @itemx end
1226 Are the optionals beginning and end positions of @var{value} in the
1227 input stream.
1228 @end table
1229
1230 When there are no more tokens to read the lexer must return the token
1231 @code{(list wisent-eoi-term)} to each request.
1232
1233 @vindex wisent-eoi-term
1234 @defvar wisent-eoi-term
1235 Predefined constant, End-Of-Input terminal symbol.
1236 @end defvar
1237
1238 @code{wisent-lex} is an example of a lexer that reads lexical tokens
1239 produced by a @semantic{} lexer, and translates them into lexical tokens
1240 suitable to the Wisent parser.  See also @ref{Wisent Lex}.
1241
1242 To call the lexer in a semantic action use the function
1243 @code{wisent-lexer}.  See also @ref{Actions goodies}.
1244
1245 @node Actions goodies
1246 @section Variables and macros useful in grammar actions.
1247
1248 @vindex wisent-input
1249 @defvar wisent-input
1250 The last token read.
1251 This variable only has meaning in the scope of @code{wisent-parse}.
1252 @end defvar
1253
1254 @findex wisent-lexer
1255 @defun wisent-lexer
1256 Obtain the next terminal in input.
1257 @end defun
1258
1259 @findex wisent-region
1260 @defun wisent-region &rest positions
1261 Return the start/end positions of the region including
1262 @var{positions}.  Each element of @var{positions} is a pair
1263 @w{@code{(@var{start-pos} .  @var{end-pos})}} or @code{nil}.  The
1264 returned value is the pair @w{@code{(@var{min-start-pos} .
1265 @var{max-end-pos})}} or @code{nil} if no @var{positions} are
1266 available.
1267 @end defun
1268
1269 @node Report errors
1270 @section The error reporting function
1271
1272 @cindex error reporting
1273 When the parser encounters a syntax error it calls a user-defined
1274 function.  It must be an Emacs Lisp function with one argument: a
1275 string containing the message to report.
1276
1277 By default the parser uses this function to report error messages:
1278
1279 @findex wisent-message
1280 @defun wisent-message string &rest args
1281 Print a one-line message if @code{wisent-parse-verbose-flag} is set.
1282 Pass @var{string} and @var{args} arguments to @dfn{message}.
1283 @end defun
1284
1285 @table @strong
1286 @item Please Note:
1287 @code{wisent-message} uses the following function to print lexical
1288 tokens:
1289
1290 @defun wisent-token-to-string token
1291 Return a printed representation of lexical token @var{token}.
1292 @end defun
1293
1294 The general printed form of a lexical token is:
1295
1296 @w{@code{@var{token}(@var{value})@@@var{location}}}
1297 @end table
1298
1299 To control the verbosity of the parser you can set to non-@code{nil}
1300 this variable:
1301
1302 @vindex wisent-parse-verbose-flag
1303 @deffn Option wisent-parse-verbose-flag
1304 non-@code{nil} means to issue more messages while parsing.
1305 @end deffn
1306
1307 Or interactively use the command:
1308
1309 @findex wisent-parse-toggle-verbose-flag
1310 @deffn Command wisent-parse-toggle-verbose-flag
1311 Toggle whether to issue more messages while parsing.
1312 @end deffn
1313
1314 When the error reporting function is entered the variable
1315 @code{wisent-input} contains the unexpected token as returned by the
1316 lexer.
1317
1318 The error reporting function can be called from a semantic action too
1319 using the special macro @code{wisent-error}.  When called from a
1320 semantic action entered by error recovery (@pxref{Error recovery}) the
1321 value of the variable @code{wisent-recovering} is non-@code{nil}.
1322
1323 @node Error recovery
1324 @section Error recovery
1325
1326 @cindex error recovery
1327 The error recovery mechanism of the Wisent's parser conforms to the
1328 one Bison uses.  See @ref{Error Recovery, , , bison}, in the Bison
1329 manual for details.
1330
1331 @cindex error token
1332 To recover from a syntax error you must write rules to recognize the
1333 special token @code{error}.  This is a terminal symbol that is
1334 automatically defined and reserved for error handling.
1335
1336 When the parser encounters a syntax error, it pops the state stack
1337 until it finds a state that allows shifting the @code{error} token.
1338 After it has been shifted, if the old look-ahead token is not
1339 acceptable to be shifted next, the parser reads tokens and discards
1340 them until it finds a token which is acceptable.
1341
1342 @cindex error recovery strategy
1343 Strategies for error recovery depend on the choice of error rules in
1344 the grammar.  A simple and useful strategy is simply to skip the rest
1345 of the current statement if an error is detected:
1346
1347 @example
1348 @group
1349 (stmnt (( error ?; )) ;; on error, skip until ';' is read
1350        )
1351 @end group
1352 @end example
1353
1354 It is also useful to recover to the matching close-delimiter of an
1355 opening-delimiter that has already been parsed:
1356
1357 @example
1358 @group
1359 (primary (( ?@{ expr  ?@} ))
1360          (( ?@{ error ?@} ))
1361          @dots{}
1362          )
1363 @end group
1364 @end example
1365
1366 @cindex error recovery actions
1367 Note that error recovery rules may have actions, just as any other
1368 rules can.  Here are some predefined hooks, variables, functions or
1369 macros, useful in such actions:
1370
1371 @vindex wisent-nerrs
1372 @defvar wisent-nerrs
1373 The number of parse errors encountered so far.
1374 @end defvar
1375
1376 @vindex wisent-recovering
1377 @defvar wisent-recovering
1378 non-@code{nil} means that the parser is recovering.
1379 This variable only has meaning in the scope of @code{wisent-parse}.
1380 @end defvar
1381
1382 @findex wisent-error
1383 @defun wisent-error msg
1384 Call the user supplied error reporting function with message
1385 @var{msg} (@pxref{Report errors}).
1386
1387 For an example of use, @xref{wisent-skip-token}.
1388 @end defun
1389
1390 @findex wisent-errok
1391 @defun wisent-errok
1392 Resume generating error messages immediately for subsequent syntax
1393 errors.
1394
1395 The parser suppress error message for syntax errors that happens
1396 shortly after the first, until three consecutive input tokens have
1397 been successfully shifted.
1398
1399 Calling @code{wisent-errok} in an action, make error messages resume
1400 immediately.  No error messages will be suppressed if you call it in
1401 an error rule's action.
1402
1403 For an example of use, @xref{wisent-skip-token}.
1404 @end defun
1405
1406 @findex wisent-clearin
1407 @defun wisent-clearin
1408 Discard the current lookahead token.
1409 This will cause a new lexical token to be read.
1410
1411 In an error rule's action the previous lookahead token is reanalyzed
1412 immediately.  @code{wisent-clearin} may be called to clear this token.
1413
1414 For example, suppose that on a parse error, an error handling routine
1415 is called that advances the input stream to some point where parsing
1416 should once again commence.  The next symbol returned by the lexical
1417 scanner is probably correct.  The previous lookahead token ought to
1418 be discarded with @code{wisent-clearin}.
1419
1420 For an example of use, @xref{wisent-skip-token}.
1421 @end defun
1422
1423 @findex wisent-abort
1424 @defun wisent-abort
1425 Abort parsing and save the lookahead token.
1426 @end defun
1427
1428 @findex wisent-set-region
1429 @defun wisent-set-region start end
1430 Change the region of text matched by the current nonterminal.
1431 @var{start} and @var{end} are respectively the beginning and end
1432 positions of the region occupied by the group of components associated
1433 to this nonterminal.  If @var{start} or @var{end} values are not a
1434 valid positions the region is set to @code{nil}.
1435
1436 For an example of use, @xref{wisent-skip-token}.
1437 @end defun
1438
1439 @vindex wisent-discarding-token-functions
1440 @defvar wisent-discarding-token-functions
1441 List of functions to be called when discarding a lexical token.
1442 These functions receive the lexical token discarded.
1443 When the parser encounters unexpected tokens, it can discards them,
1444 based on what directed by error recovery rules.  Either when the
1445 parser reads tokens until one is found that can be shifted, or when an
1446 semantic action calls the function @code{wisent-skip-token} or
1447 @code{wisent-skip-block}.
1448 For language specific hooks, make sure you define this as a local
1449 hook.
1450
1451 For example, in @semantic{}, this hook is set to the function
1452 @code{wisent-collect-unmatched-syntax} to collect unmatched lexical
1453 tokens (@pxref{Useful functions}).
1454 @end defvar
1455
1456 @findex wisent-skip-token
1457 @defun wisent-skip-token
1458 @anchor{wisent-skip-token}
1459 Skip the lookahead token in order to resume parsing.
1460 Return nil.
1461 Must be used in error recovery semantic actions.
1462
1463 It typically looks like this:
1464
1465 @lisp
1466 @group
1467 (wisent-message "%s: skip %s" $action
1468                 (wisent-token-to-string wisent-input))
1469 (run-hook-with-args
1470  'wisent-discarding-token-functions wisent-input)
1471 (wisent-clearin)
1472 (wisent-errok)))
1473 @end group
1474 @end lisp
1475 @end defun
1476
1477 @findex wisent-skip-block
1478 @defun wisent-skip-block
1479 Safely skip a block in order to resume parsing.
1480 Return nil.
1481 Must be used in error recovery semantic actions.
1482
1483 A block is data between an open-delimiter (syntax class @code{(}) and
1484 a matching close-delimiter (syntax class @code{)}):
1485
1486 @example
1487 @group
1488 (a parenthesized block)
1489 [a block between brackets]
1490 @{a block between braces@}
1491 @end group
1492 @end example
1493
1494 The following example uses @code{wisent-skip-block} to safely skip a
1495 block delimited by @samp{LBRACE} (@code{@{}) and @samp{RBRACE}
1496 (@code{@}}) tokens, when a syntax error occurs in
1497 @samp{other-components}:
1498
1499 @example
1500 @group
1501 (block ((LBRACE other-components RBRACE))
1502        ((LBRACE RBRACE))
1503        ((LBRACE error)
1504         (wisent-skip-block))
1505        )
1506 @end group
1507 @end example
1508 @end defun
1509
1510 @node Debugging actions
1511 @section Debugging semantic actions
1512
1513 @cindex semantic action symbols
1514 Each semantic action is represented by a symbol interned in an
1515 @dfn{obarray} that is part of the @acronym{LALR(1)} automaton
1516 (@pxref{Compiling a grammar}).  @code{symbol-function} on a semantic
1517 action symbol return the semantic action lambda expression.
1518
1519 A semantic action symbol name has the form
1520 @code{@var{nonterminal}:@var{index}}, where @var{nonterminal} is the
1521 name of the nonterminal symbol the action belongs to, and @var{index}
1522 is an action sequence number within the scope of @var{nonterminal}.
1523 For example, this nonterminal definition:
1524
1525 @example
1526 @group
1527 input:
1528    line                     [@code{input:0}]
1529  | input line
1530    (format "%s %s" $1 $2)   [@code{input:1}]
1531  ;
1532 @end group
1533 @end example
1534
1535 Will produce two semantic actions, and associated symbols:
1536
1537 @table @code
1538 @item input:0
1539 A default action that returns @code{$1}.
1540
1541 @item input:1
1542 That returns @code{(format "%s %s" $1 $2)}.
1543 @end table
1544
1545 @cindex debugging semantic actions
1546 Debugging uses the Lisp debugger to investigate what is happening
1547 during execution of semantic actions.
1548 Three commands are available to debug semantic actions.  They receive
1549 two arguments:
1550
1551 @itemize @bullet
1552 @item The automaton that contains the semantic action.
1553
1554 @item The semantic action symbol.
1555 @end itemize
1556
1557 @findex wisent-debug-on-entry
1558 @deffn Command wisent-debug-on-entry automaton function
1559 Request @var{automaton}'s @var{function} to invoke debugger each time it is called.
1560 @var{function} must be a semantic action symbol that exists in @var{automaton}.
1561 @end deffn
1562
1563 @findex wisent-cancel-debug-on-entry
1564 @deffn Command wisent-cancel-debug-on-entry automaton function
1565 Undo effect of @code{wisent-debug-on-entry} on @var{automaton}'s @var{function}.
1566 @var{function} must be a semantic action symbol that exists in @var{automaton}.
1567 @end deffn
1568
1569 @findex wisent-debug-show-entry
1570 @deffn Command wisent-debug-show-entry automaton function
1571 Show the source of @var{automaton}'s semantic action @var{function}.
1572 @var{function} must be a semantic action symbol that exists in @var{automaton}.
1573 @end deffn
1574
1575 @node Wisent Semantic
1576 @chapter How to use Wisent with Semantic
1577
1578 @cindex tags
1579 This section presents how the Wisent's parser can be used to produce
1580 @dfn{tags} for the @semantic{} tool set.
1581
1582 @semantic{} tags form a hierarchy of Emacs Lisp data structures that
1583 describes a program in a way independent of programming languages.
1584 Tags map program declarations, like functions, methods, variables,
1585 data types, classes, includes, grammar rules, etc..
1586
1587 @cindex WY grammar format
1588 To use the Wisent parser with @semantic{} you have to define
1589 your grammar in @dfn{WY} form, a grammar format very close
1590 to the one used by Bison.
1591
1592 Please @inforef{top, Semantic Grammar Framework Manual, grammar-fw}
1593 for more information on @semantic{} grammars.
1594
1595 @menu
1596 * Grammar styles::
1597 * Wisent Lex::
1598 @end menu
1599
1600 @node Grammar styles
1601 @section Grammar styles
1602
1603 @cindex grammar styles
1604 @semantic{} parsing heavily depends on how you wrote the grammar.
1605 There are mainly two styles to write a Wisent's grammar intended to be
1606 used with the @semantic{} tool set: the @dfn{Iterative style} and the
1607 @dfn{Bison style}.  Each one has pros and cons, and in certain cases
1608 it can be worth a mix of the two styles!
1609
1610 @menu
1611 * Iterative style::
1612 * Bison style::
1613 * Mixed style::
1614 * Start nonterminals::
1615 * Useful functions::
1616 @end menu
1617
1618 @node Iterative style, Bison style, Grammar styles, Grammar styles
1619 @subsection Iterative style
1620
1621 @cindex grammar iterative style
1622 The @dfn{iterative style} is the preferred style to use with @semantic{}.
1623 It relies on an iterative parser back-end mechanism which parses start
1624 nonterminals one at a time and automagically skips unexpected lexical
1625 tokens in input.
1626
1627 Compared to rule-based iterative functions (@pxref{Bison style}),
1628 iterative parsers are better in that they can handle obscure errors
1629 more cleanly.
1630
1631 @cindex raw tag
1632 Each start nonterminal must produces a @dfn{raw tag} by calling a
1633 @code{TAG}-like grammar macro with appropriate parameters.  See also
1634 @ref{Start nonterminals}.
1635
1636 @cindex expanded tag
1637 Then, each parsing iteration automatically translates a raw tag into
1638 @dfn{expanded tags}, updating the raw tag structure with internal
1639 properties and buffer related data.
1640
1641 After parsing completes, it results in a tree of expanded tags.
1642
1643 The following example is a snippet of the iterative style Java grammar
1644 provided in the @semantic{} distribution in the file
1645 @file{wisent-java-tags.wy}.
1646
1647 @example
1648 @group
1649 @dots{}
1650 ;; Alternate entry points
1651 ;;    - Needed by partial re-parse
1652 %start formal_parameter
1653 @dots{}
1654 ;;    - Needed by EXPANDFULL clauses
1655 %start formal_parameters
1656 @dots{}
1657
1658 formal_parameter_list
1659   : PAREN_BLOCK
1660     (EXPANDFULL $1 formal_parameters)
1661   ;
1662
1663 formal_parameters
1664   : LPAREN
1665     ()
1666   | RPAREN
1667     ()
1668   | formal_parameter COMMA
1669   | formal_parameter RPAREN
1670   ;
1671
1672 formal_parameter
1673   : formal_parameter_modifier_opt type variable_declarator_id
1674     (VARIABLE-TAG $3 $2 nil :typemodifiers $1)
1675   ;
1676 @end group
1677 @end example
1678
1679 @findex EXPANDFULL
1680 It shows the use of the @code{EXPANDFULL} grammar macro to parse a
1681 @samp{PAREN_BLOCK} which contains a @samp{formal_parameter_list}.
1682 @code{EXPANDFULL} tells to recursively parse @samp{formal_parameters}
1683 inside @samp{PAREN_BLOCK}.  The parser iterates until it digested all
1684 available input data inside the @samp{PAREN_BLOCK}, trying to match
1685 any of the @samp{formal_parameters} rules:
1686
1687 @itemize
1688 @item @samp{LPAREN}
1689
1690 @item @samp{RPAREN}
1691
1692 @item @samp{formal_parameter COMMA}
1693
1694 @item @samp{formal_parameter RPAREN}
1695 @end itemize
1696
1697 At each iteration it will return a @samp{formal_parameter} raw tag,
1698 or @code{nil} to skip unwanted (single @samp{LPAREN} or @samp{RPAREN}
1699 for example) or unexpected input data.  Those raw tags will be
1700 automatically expanded by the iterative back-end parser.
1701
1702 @node Bison style
1703 @subsection Bison style
1704
1705 @cindex grammar bison style
1706 What we call the @dfn{Bison style} is the traditional style of Bison's
1707 grammars.  Compared to iterative style, it is not straightforward to
1708 use grammars written in Bison style in @semantic{}.  Mainly because such
1709 grammars are designed to parse the whole input data in one pass, and
1710 don't use the iterative parser back-end mechanism (@pxref{Iterative
1711 style}).  With Bison style the parser is called once to parse the
1712 grammar start nonterminal.
1713
1714 The following example is a snippet of the Bison style Java grammar
1715 provided in the @semantic{} distribution in the file
1716 @file{wisent-java.wy}.
1717
1718 @example
1719 @group
1720 %start formal_parameter
1721 @dots{}
1722
1723 formal_parameter_list
1724   : formal_parameter_list COMMA formal_parameter
1725     (cons $3 $1)
1726   | formal_parameter
1727     (list $1)
1728   ;
1729
1730 formal_parameter
1731   : formal_parameter_modifier_opt type variable_declarator_id
1732     (EXPANDTAG
1733      (VARIABLE-TAG $3 $2 :typemodifiers $1)
1734      )
1735   ;
1736 @end group
1737 @end example
1738
1739 The first consequence is that syntax errors are not automatically
1740 handled by @semantic{}.  Thus, it is necessary to explicitly handle
1741 them at the grammar level, providing error recovery rules to skip
1742 unexpected input data.
1743
1744 The second consequence is that the iterative parser can't do automatic
1745 tag expansion, except for the start nonterminal value.  It is
1746 necessary to explicitly expand tags from concerned semantic actions by
1747 calling the grammar macro @code{EXPANDTAG} with a raw tag as
1748 parameter.  See also @ref{Start nonterminals}, for incremental
1749 re-parse considerations.
1750
1751 @node Mixed style
1752 @subsection Mixed style
1753
1754 @cindex grammar mixed style
1755 @example
1756 @group
1757 %start grammar
1758 ;; Reparse
1759 %start prologue epilogue declaration nonterminal rule
1760 @dots{}
1761
1762 %%
1763
1764 grammar:
1765     prologue
1766   | epilogue
1767   | declaration
1768   | nonterminal
1769   | PERCENT_PERCENT
1770   ;
1771 @dots{}
1772
1773 nonterminal:
1774     SYMBOL COLON rules SEMI
1775     (TAG $1 'nonterminal :children $3)
1776   ;
1777
1778 rules:
1779     lifo_rules
1780     (apply 'nconc (nreverse $1))
1781   ;
1782
1783 lifo_rules:
1784     lifo_rules OR rule
1785     (cons $3 $1)
1786   | rule
1787     (list $1)
1788   ;
1789
1790 rule:
1791     rhs
1792     (let* ((rhs $1)
1793            name type comps prec action elt)
1794       @dots{}
1795       (EXPANDTAG
1796        (TAG name 'rule :type type :value comps :prec prec :expr action)
1797        ))
1798   ;
1799 @end group
1800 @end example
1801
1802 This example shows how iterative and Bison styles can be combined in
1803 the same grammar to obtain a good compromise between grammar
1804 complexity and an efficient parsing strategy in an interactive
1805 environment.
1806
1807 @samp{nonterminal} is parsed using iterative style via the main
1808 @samp{grammar} rule.  The semantic action uses the @code{TAG} macro to
1809 produce a raw tag, automagically expanded by @semantic{}.
1810
1811 But @samp{rules} part is parsed in Bison style! Why?
1812
1813 Rule delimiters are the colon (@code{:}), that follows the nonterminal
1814 name, and a final semicolon (@code{;}).  Unfortunately these
1815 delimiters are not @code{open-paren}/@code{close-paren} type, and the
1816 Emacs' syntactic analyzer can't easily isolate data between them to
1817 produce a @samp{RULES_PART} parenthesis-block-like lexical token.
1818 Consequently it is not possible to use @code{EXPANDFULL} to iterate in
1819 @samp{RULES_PART}, like this:
1820
1821 @example
1822 @group
1823 nonterminal:
1824     SYMBOL COLON rules SEMI
1825     (TAG $1 'nonterminal :children $3)
1826   ;
1827
1828 rules:
1829     RULES_PART  ;; @strong{Map a parenthesis-block-like lexical token}
1830     (EXPANDFULL $1 'rules)
1831   ;
1832
1833 rules:
1834     COLON
1835     ()
1836     OR
1837     ()
1838     SEMI
1839     ()
1840     rhs
1841     rhs
1842     (let* ((rhs $1)
1843            name type comps prec action elt)
1844       @dots{}
1845       (TAG name 'rule :type type :value comps :prec prec :expr action)
1846       )
1847   ;
1848 @end group
1849 @end example
1850
1851 In such cases, when it is difficult for Emacs to obtain
1852 parenthesis-block-like lexical tokens, the best solution is to use the
1853 traditional Bison style with error recovery!
1854
1855 In some extreme cases, it can also be convenient to extend the lexer,
1856 to deliver new lexical tokens, to simplify the grammar.
1857
1858 @node Start nonterminals
1859 @subsection Start nonterminals
1860
1861 @cindex start nonterminals
1862 @cindex @code{reparse-symbol} property
1863 When you write a grammar for @semantic{}, it is important to carefully
1864 indicate the start nonterminals.  Each one defines an entry point in
1865 the grammar, and after parsing its semantic value is returned to the
1866 back-end iterative engine.  Consequently:
1867
1868 @strong{The semantic value of a start nonterminal must be a produced
1869 by a TAG like grammar macro}.
1870
1871 Start nonterminals are declared by @code{%start} statements.  When
1872 nothing is specified the first nonterminal that appears in the grammar
1873 is the start nonterminal.
1874
1875 Generally, the following nonterminals must be declared as start
1876 symbols:
1877
1878 @itemize @bullet
1879 @item The main grammar entry point
1880 @quotation
1881 Of course!
1882 @end quotation
1883
1884 @item nonterminals passed to @code{EXPAND}/@code{EXPANDFULL}
1885 @quotation
1886 These grammar macros recursively parse a part of input data, based on
1887 rules of the given nonterminal.
1888
1889 For example, the following will parse @samp{PAREN_BLOCK} data using
1890 the @samp{formal_parameters} rules:
1891
1892 @example
1893 @group
1894 formal_parameter_list
1895   : PAREN_BLOCK
1896     (EXPANDFULL $1 formal_parameters)
1897   ;
1898 @end group
1899 @end example
1900
1901 The semantic value of @samp{formal_parameters} becomes the value of
1902 the @code{EXPANDFULL} expression.  It is a list of @semantic{} tags
1903 spliced in the tags tree.
1904
1905 Because the automaton must know that @samp{formal_parameters} is a
1906 start symbol, you must declare it like this:
1907
1908 @example
1909 @group
1910 %start formal_parameters
1911 @end group
1912 @end example
1913 @end quotation
1914 @end itemize
1915
1916 @cindex incremental re-parse
1917 @cindex reparse-symbol
1918 The @code{EXPANDFULL} macro has a side effect it is important to know,
1919 related to the incremental re-parse mechanism of @semantic{}: the
1920 nonterminal symbol parameter passed to @code{EXPANDFULL} also becomes
1921 the @code{reparse-symbol} property of the tag returned by the
1922 @code{EXPANDFULL} expression.
1923
1924 When buffer's data mapped by a tag is modified, @semantic{}
1925 schedules an incremental re-parse of that data, using the tag's
1926 @code{reparse-symbol} property as start nonterminal.
1927
1928 @strong{The rules associated to such start symbols must be carefully
1929 reviewed to ensure that the incremental parser will work!}
1930
1931 Things are a little bit different when the grammar is written in Bison
1932 style.
1933
1934 @strong{The @code{reparse-symbol} property is set to the nonterminal
1935 symbol the rule that explicitly uses @code{EXPANDTAG} belongs to.}
1936
1937 For example:
1938
1939 @example
1940 @group
1941 rule:
1942     rhs
1943     (let* ((rhs $1)
1944            name type comps prec action elt)
1945       @dots{}
1946       (EXPANDTAG
1947        (TAG name 'rule :type type :value comps :prec prec :expr action)
1948        ))
1949   ;
1950 @end group
1951 @end example
1952
1953 Set the @code{reparse-symbol} property of the expanded tag to
1954 @samp{rule}.  A important consequence is that:
1955
1956 @strong{Every nonterminal having any rule that calls @code{EXPANDTAG}
1957 in a semantic action, should be declared as a start symbol!}
1958
1959 @node Useful functions
1960 @subsection Useful functions
1961
1962 Here is a description of some predefined functions it might be useful
1963 to know when writing new code to use Wisent in @semantic{}:
1964
1965 @findex wisent-collect-unmatched-syntax
1966 @defun wisent-collect-unmatched-syntax input
1967 Add @var{input} lexical token to the cache of unmatched tokens, in
1968 variable @code{semantic-unmatched-syntax-cache}.
1969
1970 See implementation of the function @code{wisent-skip-token} in
1971 @ref{Error recovery}, for an example of use.
1972 @end defun
1973
1974 @node Wisent Lex
1975 @section The Wisent Lex lexer
1976
1977 @findex semantic-lex
1978 The lexical analysis step of @semantic{} is performed by the general
1979 function @code{semantic-lex}.  For more information, @inforef{Writing
1980 Lexers, ,semantic-langdev}.
1981
1982 @code{semantic-lex} produces lexical tokens of the form:
1983
1984 @example
1985 @group
1986 @code{(@var{token-class start} . @var{end})}
1987 @end group
1988 @end example
1989
1990 @table @var
1991 @item token-class
1992 Is a symbol that identifies a lexical token class, like @code{symbol},
1993 @code{string}, @code{number}, or @code{PAREN_BLOCK}.
1994
1995 @item start
1996 @itemx end
1997 Are the start and end positions of mapped data in the input buffer.
1998 @end table
1999
2000 The Wisent's parser doesn't depend on the nature of analyzed input
2001 stream (buffer, string, etc.), and requires that lexical tokens have a
2002 different form (@pxref{Writing a lexer}):
2003
2004 @example
2005 @group
2006 @code{(@var{token-class value} [@var{start} . @var{end}])}
2007 @end group
2008 @end example
2009
2010 @cindex lexical token mapping
2011 @code{wisent-lex} is the default Wisent's lexer used in @semantic{}.
2012
2013 @vindex wisent-lex-istream
2014 @findex wisent-lex
2015 @defun wisent-lex
2016 Return the next available lexical token in Wisent's form.
2017
2018 The variable @code{wisent-lex-istream} contains the list of lexical
2019 tokens produced by @code{semantic-lex}.  Pop the next token available
2020 and convert it to a form suitable for the Wisent's parser.
2021 @end defun
2022
2023 Mapping of lexical tokens as produced by @code{semantic-lex} into
2024 equivalent Wisent lexical tokens is straightforward:
2025
2026 @example
2027 @group
2028 (@var{token-class start} . @var{end})
2029      @result{} (@var{token-class value start} . @var{end})
2030 @end group
2031 @end example
2032
2033 @var{value} is the input @code{buffer-substring} from @var{start} to
2034 @var{end}.
2035
2036 @node GNU Free Documentation License
2037 @appendix GNU Free Documentation License
2038
2039 @include fdl.texi
2040
2041 @node Index
2042 @unnumbered Index
2043 @printindex cp
2044
2045 @iftex
2046 @contents
2047 @summarycontents
2048 @end iftex
2049
2050 @bye
2051
2052 @c Following comments are for the benefit of ispell.
2053
2054 @c  LocalWords:  Wisent automagically wisent Wisent's LALR obarray