Subsections


3 The CppCC input grammar file

The input of CppCC consist of a single file containing a description of the grammar for which it will generate the scanner and the parser4. Hence, the grammar file will contain at least two section, one which will describe the scanner's accepted language and one describing the parser's. Additionally, one can provide a number of options that will affect the code generation process and a ``token customization section''. Each of these sections is introduced by a keyword at the ``top level'' of the grammar file. Their syntax was tried to be kept as C++-like as possible, in order to provide the user with a good approximation of what CppCC will generate out of it.

To be more precise, the grammar file BNF production is:

<grammar_file> -> [ <options_section> ] 
                  <token_customization_section> 
                  <lexical_section> <syntax_section>

The options section applies to the overall CppCC operation, while the other three sections refer to a specific part of the code to be generated. As an example, the token customization section whose syntax is:

<token_customization_section> -> [ <c_block> ] "TOKEN" <c_token_class_id> 
                                 [ <inheritance> ] [ <c_block> ]

lets you specify the name of the class to be generated, and also make it inherit from one or more classes, if the project into which the parser will be integrated requires so (as an example, this feature can be used to switch from a handwritten parser to a CppCC generated one in an existing project where it is very desirable to preserve the interface between the parser and the rest of the code. This can be easily achieved by making the generated class(es) inherit the old interfaces). Moreover, within each section, handwritten methods and field declarations that will be merged into the final generated source can be freely added (continuing the parser replacement example, after making the new parser inherit the old one's interface, one can then implement its methods in terms of the ones provided by the generated parser. The easiest way is to place these wrapper methods inside the parser's section itself).

Each of the token customization, lexical and syntax sections can optionally be preceded by a block of C++ code. This code will also be placed before the code that is generated from the sections' contents itself, and therefore it is a good place for adding #include lines for headers that will be used by the user code inside the section, typedefs, etc.


3.1 The options section

This section contains various switches that control the code generation algorithm. Options are specified as key = value pairs. The syntax of this section is:

<options_section> -> "OPTIONS" "{" ( <option_name> "=" <option_value> ) * "}"

The possible options are:

DEBUG_PARSER
tells CppCC to add debugging code into the parser source. The default value is "false". When enabled, the generated parser class will contain extra methods for tracing its functioning at runtime;
DEBUG_SCANNER
same as DEBUG_PARSER, but affects the scanner's class;
CASE_SENSITIVE
cause CppCC to add code into the scanner's class for dealing with case-insensitive languages. The default value is "true''. If set to "false", the generated scanner will make no difference between upper and lower case letters.
USE_EXCEPTIONS
if set to "true'', the generated code will use the exceptions mechanism for error handling (See Sec. 3.5). If not, no exception-related code will be generated. Defaults to "true".
DEFAULT_LOOKAHEAD
specifies how many tokens the parser should inspect when making choices of what expansions it should chose next, when just the current token is not enough for making such choices. The default value is 1, which makes the parser be pure LL(1)5.
TOKENS_SPAN_EOF
specifies what should the scanner do when a file ends before a token was completely matched. If this option is false, a scan exception will occur, otherwise the scanner will attempt to silently switch to another input stream and continue scanning. The default value of this option is false.
COUNT_COLUMNS
tells CppCC to generate extra code in the scanner in order to also keep track of the starting/ending column for each token. The default value is "true".
SHARP_LINES
tells CppCC whether to insert #line directives into the generated code. The directives will indicate the compiler/debugger the places where the user's code was inserted into the input file. The default value is ''true''.
PROFILING_FILE
specifies the name of a profile file to be used when generating the scanner (See Annex A ).
STRING_CLASS
specifies a class to be used instead of std::string for representing the token images (See Sec. 4.1.1)
NAMESPACE_NAME
specify an altrernate name for the namespace enclosing the cppcc generated class declarations (See Sec. 4)
SRC_EXTENSION, HDR_EXTENSION
allows use of alternate extensions when writing the generated source files (See Sec. 4).


3.1.0.0.1 Example:

If we need a parser that will treat upper/lower case letters the same and we don't want to deal with the LOOKAHEAD issues, the OPTIONS section would be the one shown in Fig. 2.

Figure 2: Sample options section.
\begin{figure}\begin{tt}
{\normalsize\vspace{1ex} \hrule width \columnwidth \vs...
...ce{1ex} \hrule width \columnwidth \vspace{1ex}}
\vspace{-1\parskip}
\end{figure}

By setting the default lookahead to a very large value (we used the machine's INT_MAX, but any fairly large value is just as good), we ensured that the generated parser will always try out a whole expansion when not sure about what to do next (not accidentally we used INT_MAX, as it is also the largest lookahead value the parser could use anyway).


3.2 The token customization section

The token customization section allows you to specify the token class' name, inheritance and add some C++ code to be included into the generated class.

Its syntax is:

<token_customization_section> -> [ <c_block> ] 
                                 "TOKEN" <c_token_class_id> [ <inheritance> ] 
                                 <c_block>

As it can be noticed above, at least the token's class name followed by an (eventually empty) C block must be present.

The code block that precedes the TOKEN keyword, if present will be inserted verbatim (without the enclosing braces) into the scanner's class' header file. It is a good place for placing #include lines or other fragments of code that need to appear before any other kind of user code in the generated sources.

If the inheritance list is present, then each of the parent classes must provide a default constructor (this is needed because the token's class constructors themselves will be generated by CppCC which doesn't know what constructors should be called for each of the parent classes. Thus, the simplest solution is to just let the C++ compiler generate calls to their default parameterless constructors).

The C++ block following the token's class name can contain any extra application-specific fields and methods. Because they are inlined into the token's class the user-defined methods will have access to all the token objects' internal fields. The following fields of the token's class are accessible (See Sec. 4.1 for details):

int length;
 
The length of the token's image.
cppcc::Position bPos, ePos:;
 
Beginning and end location of the token relative to the beginning of the input stream;
int id;
 
the token's id.
The string image of the token can be obtained by calling the std::string& image () method of the token class, which returns a reference to an internal field that contains the token's image. This field can be modified in any way by the user's code.


3.2.0.0.1 Example:

Suppose an application containing a parser for some language foo requires a way of finding out whether a token spans multiple lines. An implementation using the user-code feature of the token customization section is given in Fig.3.

Figure 3: Adding a custom method into a token section.
\begin{figure}\begin{tt}
{\normalsize\vspace{1ex} \hrule width \columnwidth \vs...
...ce{1ex} \hrule width \columnwidth \vspace{1ex}}
\vspace{-1\parskip}
\end{figure}


3.3 The lexical section

The lexical section contains the scanner's specification. As for the token customization section, a name for the generated class is always required. The syntax is:

<lexical_section> -> [ <c_block> ] 
                     "SCANNER" <c_scanner_class_id> [ <inheritance> ] 
                     "{" <scanner_decl> * "}"

Just like the token customization section, a first block of code for placing #include or other C++ code that needs to be placed at the beginning of the generated file is provided.

If an inheritance is present in this section, each of the parent classes must provide a default constructor that will be called by the generated scanner class' constructors. Inside the body of the lexical section multiple <scanner_decl> items can appear. These can be either token declarations or user code:

<scanner_decl> ->  <c_block> 
                 | [ <lexical_states_list> ] <token_kind> "{" <token_decl> * "}" 
                 | "SPECIAL" "{" <special_token_decl> * "}" 
<token_kind> -> "TOKEN" | "SKIP" | "MORE" | "KEYWORD"

Token declarations specify the language that is accepted by the scanner. A CppCC generated scanner supports five kinds of tokens:


3.3.1 Non-special tokens

The regular tokens are those declared using the reserved keyword "TOKEN". Such tokens will be returned by successive calls to the scanning method, as they are recognized in the input stream.

Unlike the regular tokens, the ``SKIP'' tokens, although recognized as valid language constructs in the input stream, will never be returned to the parser, but simply ignored. This kind of token declaration is useful when the scanner should throw away some tokens such as comments and whitespaces. They are declared using the reserved word SKIP. Note that, however, any associated token actions are still executed (including the commonTokenAction hook, if defined)6.

The "MORE" tokens are also never returned by the scanner's routine to the parser, but are nevertheless ignored. A token may be flagged as "MORE" when it is merely the prefix for a longer token. After it has recognized such a token, the scanner will continue to match characters until a non-MORE token is found, then return a single token having as its image the concatenation of the images of all the previously matched MORE tokens and the last token of some other kind. Of course the same result could have been achieved by using a single long regular expression that matched the same concatenation, but using MORE tokens allow user code to be executed during the match (after each MORE token a user action can be placed).

The "KEYWORD" tokens are very similar to regular tokens, except that the scanner will avoid making the token's image available to the user code7. This can result in a performance improvement for the scanner.

The rest of the token declaration for these four kinds is the same:

<token_decl> ->  "<#" <c_token_id> ":" <token_regexp> ">" 
               | "<" <c_token_id> ":" <token_regexp> ">" [ <c_block> ]

All the tokens in the above category have an associated lexical state set, a symbolic name, a regular expression and an (optional) user action. Each of these will be discussed below.

Lexical states provide a simple mechanism for switching between sets of accepted input tokens while the scanner is running. Each such set of tokens is called a lexical state. While in a lexical state x, the scanner will only accept those tokens that had x inside their lexical state list. If a token is to be recognized in any lexical state, its lexical state list should only contain the special wildcard marker * which stands for ``any state''. If no states are specified for a token's declaration, then its state list will implicitly contain a single state called START. As the name implies, the START state is also the initial state of the scanner, and if no other states ever occur within the lexical section, it is the only lexical state of the scanner. The token declaration syntax allows a single lexical state list to apply to multiple tokens.

The symbolic name (or simply the name) of a token is the name by which the token will be known to the parser and user actions. It is specified by the <c_token_id> inside a token declaration. For each symbolic name, a unique integer constant will be created into the generated code. For each Token object returned by the scanning method, the id field (See. 4) will be set to the constant value that is associated to the token's symbolic name8.

A special case is that of macro tokens, those whose symbolic names are preceded by a ``#'' sign, These are not real tokens, and cannot be used inside the syntactic section. Their sole purpose is to act as ``regular expressions macro definitions''. Such tokens' names can appear inside other regular expressions definitions of other tokens, and will be replaced by their own definition by CppCC (See Example 4).

The regular expression describes the class of strings that match the token. When the scanning routine will successfully match a string that conforms to a token's regular expression, it will return a new Token object with the corresponding id. The complete regular expression grammar can be found in Section B. The regular expression can be any of the following:


3.3.1.0.1 Example:

Suppose we have a simple language called Foo that only contains expressions. The sample lexical section given in Fig. 4 is a possible description of the scanner for such a language. It only accepts decimal numbers and identifiers as operands, and the usual arithmetic operators, and is able to recognize and skip C++-style one line comments.

Figure 4: A simple lexical section.
\begin{figure}\begin{tt}
{\normalsize\vspace{1ex} \hrule width \columnwidth \vs...
...ce{1ex} \hrule width \columnwidth \vspace{1ex}}
\vspace{-1\parskip}
\end{figure}

The token action associated to a token is a block of C++ code that will get executed every time the scanner matches that token. This code has access to all the scanner class' internal data, including the currently matched token object, which can be modified by this code if so needed. The (eventually) modified token object will be returned as the result value of the current call of the scanning method. The code in a token action can use the following variables and functions:

<TokenClass> *token;
 
The just matched token. The id, image and position fields of the token object are set by the scanner before entering the token action code and are not modified by it in any way after wards. Thus, any modifications of these fields will be preserved and seen from the parser.
Position bPos, ePos;
 
The scanner's internal position fields. They have the same values as the token's fields with the same names.
Additionally to these fields, all the methods from the scanner's public interface are also accessible (see Sec. 4.2.1).


3.3.1.0.2 Example:

Let's modify Example 3.3.1.0.3 so that each time a DECIMAL_NUMBER is matched on the input the scanner will also determine the value of the number. This value will be stored in a field we're going to add to the token class. The code is shown in Fig. 5.

Figure 5: Token action that computes a token's semantic value.
\begin{figure}
% latex2html id marker 432
\begin{tt}
{\normalsize\vspace{1ex} \...
...ce{1ex} \hrule width \columnwidth \vspace{1ex}}
\vspace{-1\parskip}
\end{figure}


3.3.2 Special tokens

If a particular application needs to reserve a number of tokens for user-defined purposes, then these tokens can be declared as SPECIAL. No regular expressions are associated to them and they are basically treated as nonexistent by the token matching routine, their creation must be done by the user code. However, such tokens can be freely used inside the syntactic section just as normal tokens. This kind of tokens was introduced to facilitate the implementation of ``semantic tokens''. A simple way to implement this mechanism with CppCC is to declare the various semantic tokens as SPECIAL and add user actions that will create them when needed. For instance, consider a language where we have to distinguish between function names and variable names. From the scanner's perspective, both of these are just identifiers, that can be matched by a regular expression like letter (letter | digit)*. This is what the scanning method would normally return. But we can associate a user action that, each time a name token is matched, will perform a symbol table lookup and find the symbol with the given name, then change the token's id from name to say function_name or variable_name. The so-modified token object will then be returned to the parser.


3.3.2.0.1 Example:

Supposing the language in Example 3.3.1.0.3 needs to differentiate between function and variable identifiers. This will require some user code that will do the symbol table lookup and change the IDENTIFIER tokens to either FUNCTION_IDENTIFIER or VARIABLE_IDENTIFIER. A possible way of doing this is shown in 6. The example code only shows what needs to be modified in the lexical specification shown in Fig. 4. The extra code consists of a user action associated to the IDENTIFIER token that does the replacements mentioned above, and an extra field that is added into the token's class to store its ``semantic information'', i.e. a reference to the symbol table entry that is associated with the identifier.

Figure 6: A lexer specification that makes use of special tokens to represent semantic information.
\begin{figure}
% latex2html id marker 460
\begin{tt}
{\normalsize\vspace{1ex} \...
...ce{1ex} \hrule width \columnwidth \vspace{1ex}}
\vspace{-1\parskip}
\end{figure}


3.3.3 The user code

C++ code can be freely intermixed with token declarations. Any block of C++ code, which can contain any methods or fields9 that is found inside the lexical section will be inserted into the generated scanner class. From within this code access to scanner's internal fields and methods is permitted, as from any normal member function.

Several methods have a special meaning if defined in a code item inside the scanner's section:

void commonTokenAction (<TokenClassName> &token);
 
This method, if defined, will be called after a token has been successfully matched on the input stream its fields have been set up, but before before it is returned to the parser10. The argument is the matched token and has as its type a reference to an instance of the class denoted by the token's class name as specified into the token customization section.

The commonTokenAction is also the last point where one can cause the scanner to throw away the current token by calling the rejectToken method. If such a call is made, the token will never be returned and scanning will continue as if this was a SKIP token.

bool wrap();
 
This method, if defined, is called each time the scanner encounters the ``end of file'' condition as it reads from the current input stream. An implementation of this method can switch to another input stream by calling the switchToStream method of the scanner with a newly created input stream object as argument (see Sec 4.2.1 for a complete list of stream switching methods). If a new input stream was opened, it must return a true value. If a true value is returned, the scanner will reset the stream location pointers and continue (i.e. the next token will be matched from the new input stream, having its start position at line 1, column 1). If not, depending on the current scanning state, one of the following actions is taken:
bool onScanError (const ScanException &ex);
 
See Section 3.5.3.


3.4 The syntax section

contains the parser's specification. The syntax is:

<syntax_section> -> [ <c_block> ]  
                    "PARSER" <c_parser_class_id> [ <inheritance> ] 
                    "{" <parser_decl> * "}"

As for the token and lexer classes, a first block of code and an inheritance can be specified for the parser class. If present the first block of code is merged inside the generated header file of the parser's class, before the parser's class itself.

If an inheritance is given, each parent class must have a parameterless constructor that will be called by the generated parser's class constructors. Inside a lexical section, two kinds of constructs can be added: user code and EBNF productions. We will first discuss the EBNF productions, then the user code.

The CppCC productions are written in a code-like fashion: the syntax for a production resembles to that of a C++ function. It has a return type, arguments, a name and a body. The name is actually the name of the described nonterminal, the one that usually appears on the left hand side in a BNF production. The body contains, among other things, the actual right hand of the production. CppCC will generate a method inside the parser's class for each production found in the input grammar. The method will have exactly the same signature as the one specified in its production within the grammar file. CppCC parsers have no default start symbol.The parsing process starts by calling one of these generated methods from somewhere inside the rest of your project's code, and ends when the called method returns11. The syntax for specifying productions is (For the complete syntax see Sec B):

<production> -> "(" <c_type_decl> ")" <c_nonterminal_id> "(" <c_formal_args_list> ")"  
                 [ <throw_clause> ] 
                "{" <c_block> <or_list> "}"

The body of a production is a mixture of BNF expressions and blocks of C++ code. The user-supplied C++ code will be inserted into the generated methods at the appropriate places so that it is executed each time the parsing algorithm has reached the point where the code appeared inside the EBNF specification.

A special case is a production that only contains the first block of code and no expansions. For these, CppCC will assume that the production's code will somehow match the nonterminal it is intended for (such ``black box productions'' can be used for writing fast routines that match comments or strings by accessing the input stream directly12, or to implement error recovery rules - See 3.5).


3.4.1 BNF expansions

The grammar of BNF expressions (or expansions) accepted by CppCC allows the "|", "*","+" and concatenation operators. Precedence can be changed by grouping expansions with round parentheses "(" and ")". An expansion can be any of the following:


3.4.1.0.1 Example

In Example 3.3.1.0.3 we introduced a language consisting of arithmetic expressions. Fig. 7 contains the syntax section for that language. It corresponds to the lexical section given in Fig. 4. The parser accepts as its input a (possibly zero) number of expressions. Not that the EXPRESSION, TERM and FACTOR all return a value of type int, which is not used in any way within this example. After we will introduce the user actions later in this section, this example will be enhanced to make use of the return values.

Figure 7: A simple syntax section
\begin{figure}\begin{tt}
{\normalsize\vspace{1ex} \hrule width \columnwidth \vs...
...ce{1ex} \hrule width \columnwidth \vspace{1ex}}
\vspace{-1\parskip}
\end{figure}


3.4.2 User actions

The parser in Example 7 is a bit useless. It only recognizes the syntactic constructs but does not do anything with them, For that, one must add pieces of code at various points that will get executed along with the parsing process, the user actions. Code can be added before and/or after each BNF expansion. The code that is added before an expression will get executed before the parsing algorithm selects that expression for expansion (if the expression is not selected at all, the code will not be executed; this can happen if the expression is an alternative inside a choice expression). The code added after an expression will be executed after the expression was expanded successfully. Inside a block of user code, a variable called token is guaranteed to point to the most recently matched token, and can be used in any suitable way. Note that this pointer will change its value as soon as a new token is retrieved from the scanner, and therefore saving a token must be done by creating a copy of the token opbject the variable points to, not by just saving the pointer. Also, it is most certainly a bad idea to delete the token object from a user action. As an example, the EBNF construction

<IDENTIFIER> { cout << "Found an identifier at line " << token->bPos.ln << endl; }

will print something like "Found an identifier at line 10'', if the file given as input to the parser had an identifier on line 10.

The productions' syntax allows a special block of code to be placed at the very start of its body. This block is guaranteed to be also inserted at the very start of the generated method and have the same declarative context as the method itself, thus being is a good place to add declarations of local variables that will be used later inside the user actions.

The semantic value of a production is also computed by user actions. At any point inside a user action, a value of an appropriate type (i.e. compatible with the declared return type of the production's method) can be returned with a plain C++ return statement as the semantic value of that production. Note that, because the user code is inserted textually inside the method's body, the return statement will be executed as such, causing the method to terminate and return that value with no further actions. For instance, an EBNF construction like

<IDENTIFIER> {return something; } <IDENTIFIER>

will most probably not produce the expected result, i.e. the parser will never attempt to match two IDENTIFIERs but only the first one, after which the return statement is taken. It is therefore advisable to place the return statement(s) only inside the user code that gets executed after the whole right hand side of the production was expanded (unless, of course, some particular reasons impose a different approach - one example could be that if at some point the user decides it is useless to continue parsing at at the current level, it can always interrupt it by adding a return statement with some failure code).

Retrieving the semantic value of a nonterminal is possible by using a special syntax inside the EBNF expression. Its form is <variable> = NONTERMINAL(...), eg. x = FACTOR(). This indicates that the return value of the method that parses a FACTOR will be stored inside the variable x.


3.4.2.0.1 Example

We will enhance the parser in Fig. 7 with user actions that will evaluate each expression and print its value to the standard output. For this, we assume the scanner stores the value of a DECIMAL_NUMBER inside the token as explained in Example 3.3.1.0.4. The syntax section of the enhanced parser is shown in Fig. 8.

Figure 8: A parser with user actions that evaluate expressions.
\begin{figure}\begin{tt}
{\normalsize\vspace{1ex} \hrule width \columnwidth \vs...
...ce{1ex} \hrule width \columnwidth \vspace{1ex}}
\vspace{-1\parskip}
\end{figure}


3.4.3 Local lookahead

In Section 1 we said that the parsers generated by CppCC are capable of handling LL(k) grammars. However, a LL(k > 1) parser is notably slow. Moreover, usual grammars are designed so that they are ``mostly'' LL(1), with only small areas being LL(k > 1). There are two common solutions to overcome such situation: left-factoring the non-LL(1) production and increasing the lookahead depth. Although left factoring has the advantage that it converts a non-LL(1) grammar to LL(1), it introduces new productions which have no semantic significance, as they are only the common prefixes of those production that were subject to factorization. This usually complicates semantic information handling (such as parse-tree building).


3.4.3.0.1 Example:

Let's consider two productions that have a common prefix, such as:

A -> $ \alpha$
B -> $ \alpha$Y,

where $ \alpha$ has a length of at least two tokens. A grammar containing these two productions is not LL(1). If we apply left factoring, we obtain something like:

A      -> A_OR_B X 
B      -> A_OR_B Y 
A_OR_B -> $ \alpha$,

where A_OR_B is a newly introduced nonterminal that expands to the common part of A and B. Now, if we need to create the parse tree based on the new grammar, we have to always add a new node for A_OR_B. The problems arise when we do not know exactly the semantics of an A_OR_B until we can decide whether it is part of an A or a B. In this case, we have to somehow store the date in a neutral way and postpone its interpretation until we can decide what it actually represents. This will most probably result in rather obscure user actions that will be difficult to read and maintain.

For such situations, where breaking down a production into two or more parts is not desirable, the other solution can be used, i.e. locally increasing the lookahead. The rest of the grammar will still be LL(1), so there will be no overall performance penalty. CppCC accepts three kinds of local lookahead hints, which can be combined:

fixed lookahead:
its effect is to tell the parser to inspect up to a specified number of tokens before deciding to chose a certain expansion.
syntactic lookahead:
there are cases when the number of lookahead tokens cannot be predicted. In this case, the parser can try to match a whole expansion, and chose the associated ``real expansion'' if the match succeeds. Note that if the expansion contains nonterminals, the user actions inside the productions of those nonterminals are never executed13. However, user actions followed by an exclamation mark ``!'' will be executed even during lookahead14.
semantic lookahead:
there are also cases when the decision whether to chose a certain expansion is based on some semantic information. One can specify a boolean expression that will tell the parser what to do. Note that unlike user actions, semantic lookahead (although it is also user code) is always evaluated while the parser is performing the lookahead algorithm (to be more specific, if expansion A has a LOOKAHEAD(B()) and B's production itself contains a semantic lookahead (actually this is also true for all kinds of lookahead hints), then this semantic lookahead is always taken into consideration while performing the LOOKAHEAD(B())).
A lookahead hint can precede any expansion in a CppCC production that appears at a non-LL(1) choice point15. The syntax is16:

LOOKAHEAD(fixed_lookahead, syntactic_lookahead, semantic_lookahead) <some_expansion>

The effect is that if the lookahead succeeds, the expansion that follows it will be taken17. Any of the fixed_lookahead, syntactic_lookahead and semantic_lookahead parameters can be omitted (at least one must be present, and if they are combined they must appear in this order and separated by commas). If more than one kind of lookahead parameters is given, the lookahead will succeed if all of them do succeed (their combination is the AND operation applied to them). However, the fixed lookahead always takes precedence over the syntactic one, that is, when both are present, the parser will look ahead for the expansion provided as a syntactic lookahead, but only until it reaches the number of lookahead tokens specified as the fixed lookahead.


3.4.3.0.2 Example:

We will discuss several of the most common situation where each kind of lookahead is needed:

3.4.4 The user code

Inside the syntax section, one can add any C++ code that it is wanted to be inserted into the generated parser class (i.e. class fields and methods of any kind). User code must be placed in a block surrounded by balanced curly braces, like this: { public: myMethod(...) { ..... }}21. Methods added here have unrestricted access to the parser class internal fields. Several methods, if added inside such a block of code, have a special meaning:

bool onParseError (const ParseException &pex);
 
The parser's common error handler, see Section 3.5.4.


3.5 Error recovery

CppCC is very flexible regarding the error-handling approach. It offers support for all the error recovery techniques offered by other similar tools. Any of them (or both combined) can be used, according to one's taste. Basically, there are two ways of implementing a customizable error handling mechanism:

The use of exceptions is particularly well suited for a descendant recursive parser as they allow the user to deal with errors at the syntactic level that is most convenient (for instance, in a C parser, a good place to catch an exception that occurred due to a parse error could be just after the call to a FUNCTION_DEFINITION nonterminal; after reporting the error, the parsing process can continue by skipping to the next function and resume from there). The idea is that a catch clause implicitly sets a recovery point in the parser. However, if an application requires (or the one prefers), CppCC can generate code that does not use exception, errors being dealt with by using various error handlers which will be described below. The use of exceptions in the generated code is controlled by the USE_EXCEPTIONS option (See Sec. 3.1). If this option is set to false, use of exceptions is disabled and the generated code will be ``exceptions clean'' 22. Figure 12 shows the exceptions used by the generated code.

Figure 12: Exception hierarchy used inside an exception-enabled parser.
\includegraphics{pics/cppcc_except.eps}

Usually, a parser has to deal with three kinds of errors: I/O errors, scanning errors, and parse errors. A fourth kind can also be (partially) handled here, that is semantic errors. Each of these will be discussed below along with how they can be dealt with in a CppCC generated front-end. Each kind of error will be discussed in two situation: when exceptions are enabled and when they are disabled.


3.5.1 I/O errors

From the CppCC's perspective, the only I/O errors that need to be dealt with are those that arise as the scanner reads an input stream. All the other I/O errors can only be caused by user code and therefore it is user's responsibility to provide a mechanism for handling them.

Two handlers are available for I/O errors. One is the scanner's default error handler, and one is user-defined. To define a custom error handler, all that's needed is to add a special method into one of the blocks of C++ code of the lexical section. The prototype of this method must be:

bool onIOError (ScanException &s)23;

After each input operation, the scanner performs the algorithm shown in Fig. 13.

Figure 13: I/O error handling algorithm performed by the scanning routine.
\begin{figure}\begin{tt}
{\normalsize\vspace{1ex} \hrule width \columnwidth \vs...
...ce{1ex} \hrule width \columnwidth \vspace{1ex}}
\vspace{-1\parskip}
\end{figure}

If an error has occurred, the scanner will always call a user error handler if one has been provided. This handler can either attempt to fix the problem in some way, in which case it can ask the parser to try and continue its operation by returning true as the result of the onIOError method. If a user handler is not available or it was unable to fix the problem, the default action is taken. What the default means depends on whether use of exceptions was enabled or not.

If exceptions were disabled, then the scanner will simply abort(). If the exceptions were enabled, then a ScanException object. With exceptions enabled, all the methods generated by CppCC that somehow will call the scanning method (including the parser's methods) have ScanException class into their throw clauses, thus allowing the exception to go up the stack until it reaches a user's catch clause for it (this may happen at the very top of the stack, where the parsing process was started, or inside any EBNF production, as will be explained below).


3.5.2 Example

Fig. 14 shows how to add I/O error handling to the Foo parser presented in Example 3.4.1.0.6. We will use both kinds of error handling: we write an error handler that attempts to recover from the error and a catch clause that deals with the situations in which recovery was not possible. The try-catch block surrounds the call to the start symbol's method of the parser because as said above, exceptions will propagate through the stack back to the user code if the I/O error handler returns false.

Figure 14: Handling I/O errors using both an error handler and exceptions.
\begin{figure}\begin{tt}
{\normalsize\vspace{1ex} \hrule width \columnwidth \vs...
...ce{1ex} \hrule width \columnwidth \vspace{1ex}}
\vspace{-1\parskip}
\end{figure}


3.5.3 Lexical errors

Lexical errors occur if one of the following situations apply:

If no user handler is defined, the default action taken by the scanner is to wither throw a ScanException (if exception were enabled) or to simply abort. To define a user handler, a special method must be added inside a user code block inside the lexical section. The prototype of this method must be:

bool onScanError (ScanException &ex);

As it can be seen, if exceptions are disabled, the exception object (which contain information about the error that occurred) is still available because it is passed as an argument to the error handler. The algorithm performed by the lexer when an error occurs is shown in Fig. 15.

Figure 15: Lexical error handling algorithm performed by the scanning routine.
\begin{figure}\begin{tt}
{\normalsize\vspace{1ex} \hrule width \columnwidth \vs...
...ce{1ex} \hrule width \columnwidth \vspace{1ex}}
\vspace{-1\parskip}
\end{figure}

Just as for onIOError, the return value from onScanError indicates whether the scanner should attempt to resume its operation. An implementation of onScanError can try and fix the error situation (eg. it can simply skip the current character from the input stream if the error was cause by a character that could not be matched in the current DFA state of the scanner, print a warning that some characters have been skipped) then return true to the parser in order to make it try to resume its normal operation. IF the return value is false, the scanner assumes the error is unrecoverable and does the default error handling action, which implicitly terminates the current call of the scanning method.


3.5.3.0.1 Example:

Let's enhance the parser from Example 3.5.2 by adding lexical error handling. Fig. 16 shows what needs to be added: we have another code block into the lexical section that contains the lexical error handler and an extra catch clause into the main function. The error handling routine will first attempt to fix the error in some way (it may for instance try to skip discard some characters from the input stream) and if that succeeds it tells the scanner to continue. If recovering is not possible, it just returns false and let the scanner abort operation. In this latter case, a ScanException will be thrown and it will be caught by the catch clause in the main functions after it propagates through the current call stack.

Figure 16: Handling lexical and I/O errors using both an error handler and exceptions
\begin{figure}
% latex2html id marker 939
\begin{tt}\par
{\normalsize\vspace{1e...
...ce{1ex} \hrule width \columnwidth \vspace{1ex}}
\vspace{-1\parskip}
\end{figure}


3.5.4 Syntactic errors

Doe to the fact that the parser is made up of many functions that call each other recursively handling syntactic errors is a more complex matter here. A generic error handler as provided for I/O and lexical errors is of limited use here, as each syntactic error occurs in a certain context and needs to be dealt with separately. However, such an error handler is available, and can be used to execute common actions to be taken when an error occurs, such as reporting it (or aborting the parsing process if no other error recovery method is implemented). The user error handler must be inserted into a code block inside the syntax section, and has the prototype:

bool onParseError (const ParseException &pex);

The return value has the same meaning as for the lexical and I/O error handlers, if true it means the parser should try and resume from where it left, and if false it means that further error-handling actions must be taken. If exceptions are not enabled, the only thing left to do next is to abort(). With exceptions enabled, the created ParseException object will be thrown up the calling stack. The error handling algorithm is shown in Fig. 17.

Figure 17: Syntax error handling algorithm performed by the parser.
\begin{figure}\begin{tt}
{\normalsize\vspace{1ex} \hrule width \columnwidth \vs...
...ce{1ex} \hrule width \columnwidth \vspace{1ex}}
\vspace{-1\parskip}
\end{figure}

The productions grammar of CppCC allow exceptions that were thrown while parsing according to a certain expansion to be caught and some user-provided code to be executed, in a manner very close to the C++ try-catch semantics. The general form is shown in Fig. 18.

Figure 18: General form of a catch chain inside a production.
\begin{figure}\begin{tt}
{\normalsize\vspace{1ex} \hrule width \columnwidth \vs...
...ce{1ex} \hrule width \columnwidth \vspace{1ex}}
\vspace{-1\parskip}
\end{figure}

Generally, after each expansion, a list of catch clauses can follow. These will catch exceptions thrown at any recursion level inside that expansion (if no other catch clauses for the same exception(s) are reached first). In order for this mechanism to work, when exceptions are enabled, all the methods generated from CppCC productions will have in their throw clause at least the exception classes mentioned above. Any I/O, lexical, parsing or user-specific exception can be caught and dealt with inside any CppCC production. If such an exception is never caught inside a production, it will cause the parsing process to terminate abnormally and the exception will either be caught in the user code that invoked the parser or cause the whole program to abort by calling the standard handler.

An alternative solution for handling errors locally is to add error-handling expansions, as code-only productions (see Sec. 3.4). Suppose the parser reaches point where the current expansion is (A() | B() | skipt_to_C()) C(). If the current lookahead does not match neither an A or a B at the choice point, then skip_to_C() will be chosen. This can be a code-only production that displays an error message and consumes tokens until a point from where recovery is possible, such as the beginning of a C().

3.5.5 Semantic errors

CppCC offers support for user-defined error handling such as semantic checks during parsing. Let's assume we want to write a parser for a language where the atoms of an expression can be, among other things, identifiers. A semantic check here would be to see whether a variable with that identifier was declared before being used inside an expression. If the check fails, the user code that performed it can chose to throw an exceptions, something like a SymbolNotFoundException. Because this exception is thrown from inside the user code located in a production, if the exception will be caught at another level of the grammar, the exception must appear inside the throw clause of the method that will be generated from that production. For this reason, CppCC grammars can specify custom throw clauses for each production, where an user can add his own exceptions. If exceptions are enabled, besides these custom exception, the throw clause will of course contain the standard CppCC exceptions.


3.5.5.0.1 Example

Fig. 19 shows how the parser for our Foo language (See Fig 7) can be enhanced to do error recovery and semantic checks. The FACTOR nonterminal can now also be an <IDENTIFIER> when this happens, the identifier is looked up into the symbol table, and if it doesn't exist, a SymbolNotFoundException is thrown. The error recovery strategy implemented here consists of simply skipping the token that caused the error. The next time EXPRESSION() is called from INPUT_FILE(), it will attempt to continue with parsing starting with the next token after the one that was discarded.

Figure 19: Syntactic and semantic error recovery.
\begin{figure}\begin{tt}\par
{\normalsize\vspace{1ex} \hrule width \columnwidth...
...ce{1ex} \hrule width \columnwidth \vspace{1ex}}
\vspace{-1\parskip}
\end{figure}


Alec Panovici 2003-02-01