Connecting Ragel to Bison in C++

The flow of information from an input source, through a lexical analyzer (lexer) and on into a parser can be thought of as a pipeline. The ‘ideal’ way to represent this in C++ is to model the stages as classes and to construct the pipeline with instances conforming to a pattern that looks something like this:

Input input(/* file name or buffer */);
Lexer lexer(input);
Parser parser(lexer);
parser.parse();

This scheme can be achieved using tools such as ANTLR and Boost.Spirit. Newer versions of Bison support the generation of C++ meeting the requirement for the Parser class above. In this article I will show how to construct the Lexer class using the Ragel scanner generator such that it properly supports the required Bison interface. I shall present a complete worked example including all header files, dependencies and build details.

Calculator

The parser in this article implements the classic infix calculator as found in the Bison manual and countless other texts covering the basics of parsing. I choose to describe a calculator because it is straightforward, well documented and does not, therefore, distract from the main purpose of the article.

I have chosen to omit the description of an input source class in favour of a much simpler alternative which parses the command-line arguments directly. This means that the completed program operates such that

$ calc 2+2 3*3 6/3

will produce the output

4
9
2

Parser

A Bison grammar input file (Parser.yy in this example) takes the following outline form:

preamble (directives, token definitions etc)
%%
grammar productions
%%
postamble (supporting code)

The preamble is divided into subsections to support the narrative flow of the article.

Definitions

At the top of the file is specified the skeleton to use (C++, as contained in file lalr1.cc), the required Bison version, the name of the namespace to contain the generated parser class and the name of the class itself:

%skeleton "lalr1.cc"
%require "2.5"
%defines
%define namespace "Calc"
%define parser_class_name "Parser"

Header Inclusion

Next appears a directive that allows the inclusion of code into the header file that Bison will generate. This code must a) define a forward-reference for the Lexer class, b) include any required headers and c) define YYSTYPE, the type that carries semantic values.

%code requires {
  namespace Calc {class Lexer;}
  #include <cmath>
  #include <iostream>
  #define YYSTYPE double
}

Parameters

To make the pipeline shown at the top of the article it is necessary to instruct Bison to construct a parser class that takes a reference to a lexer object as a parameter. Bison provides the %parse-param directive for this purpose:

%parse-param {Calc::Lexer& lexer}

Body Inclusion

Similar to the need to supply code to appear in the header file generated by Bison, there is a need to supply code to appear in the generated code file:

%code {
  #include "Lexer.hh"
  #define yylex lexer.lex
}

During compilation of the generated code it is necessary to have the full defintion of the Lexer class rather than just a forward-reference.

One key part of the whole article is the definition of ‘yylex’ above. Bison (in C++ mode) expects to call yylex() repeatedly in order to obtain tokens (and optionally semantic values) from the input source. This definition connects this call to the instance of the Lexer class that will held by the Parser class.

Token Definitions

With organisation stuff out of the way the substance of the parser can now be considered, starting with the definition of the various tokens that will be supplied by the lexer:

%token END     0 "end of file"
%token LPAREN 40 "("
%token RPAREN 41 ")"
%token MUL    42 "*"
%token PLUS   43 "+"
%token MINUS  45 "-"
%token DIV    47 "/"
%token POW    94 "^"
%token NUM

As will be seen below, I have chosen to express the grammar using these symbolic forms rather than directly writing ‘”/”‘ for instance.

Precedences

The final part of the parser preamble is the necessary statement of operator precedence:

%left  PLUS MINUS
%left  MUL DIV
%left  NEG
%right POW

Grammar

Now we come to the infix calculator grammar, which should be familiar to anybody who has looked through the similar example in the Bison manual.

%%

input
    : /* empty */
    | exp {std::cout << $1 << '\n';}
    ;

exp
    : NUM {$$ = $1;}
    | exp PLUS exp {$$ = $1 + $3;}
    | exp MINUS exp {$$ = $1 - $3;}
    | exp MUL exp {$$ = $1 * $3;}
    | exp DIV exp {$$ = $1 / $3;}
    | MINUS exp %prec NEG {$$ = -$2;}
    | exp POW exp {$$ = pow($1, $3);}
    | LPAREN exp RPAREN {$$ = $2;}
    ;

Remember that this parser has a simplified input scheme so there is no need to worry about the handling of input lines.

Supporting Code

Bison-generated C++ parsers expect the user to supply an error handling function. Bison will place the definition of this function into the ‘private’ section of the generated parser class in the header file. Complex error handling is possible but this simple example just reports to the standard error stream:

%%

void
Calc::Parser::error
(const location_type& loc, const std::string& msg)
{
  std::cerr << loc << ": " << msg << '\n';
}

Lexer

At this point a Bison grammar has been defined, including tokens and all necessary supporting stuff. The definition of a lexer can now be made, starting with a header file.

Declaration

The lexer is declared as class Lexer in file Lexer.hh. I have chosen to put the lexer into the same namespace as the parser. The Lexer class constructor takes pointers to the start and end of the buffer to be tokenized. The lex() function supplies the interface required by Bison; that is, when called it returns a token number and can optionally write to the semantic value. Bison manages its own token stack – no work is needed here. The private part of the class contains the data required for the functioning of the generated Ragel machine. Please consult the Ragel user guide (section 5.1) for more details.

#include "Parser.hh"

namespace Calc
{
  class Lexer
  {
  public:
    Lexer(char const*, char const*);

    Parser::token_type lex(Parser::semantic_type*);

  private:
    char const* p;
    char const* const pe;
    char const* const eof;
    int cs;
    char const* ts;
    char const* te;
    int act;
  };
}

General

The Ragel source for the lexer is contained in file Lexer.rl.

Ragel takes a different approach to its input files to that of Bison: the file is assumed to contain statements in the target language and a special construct is used to break into Ragel directives, including the specification of the scanner itself. This means that the file dives straight in to C++ with the #include for the lexer’s own header file and those of other headers required for the implementation:

#include "Lexer.hh"
#include <cstdlib>
#include <string>

Machine

Next up is the definition of the lexer machine itself. I am not going to go into detail about the construction of the lexer itself: this is more than adequately described in the Ragel manual.

%%{
  machine Lexer;
  alphtype char;
  write data;
    
  intlit = digit+;
    
  fltexp = [Ee] '-'? intlit;
    
  fltdot = '.' digit+ fltexp?;
    
  number = (intlit (fltdot | fltexp)? );
    
  ws = [ \t\n];
    
  main := |*
    
    number
      => {ret = Parser::token::NUM;
          *val = strtod(std::string(ts, te).c_str(), 0);
          fbreak;};
    
    '('
      => {ret = Parser::token::LPAREN; fbreak;};
    ')'
      => {ret = Parser::token::RPAREN; fbreak;};
    '+'
      => {ret = Parser::token::PLUS; fbreak;};
    '-'
      => {ret = Parser::token::MINUS; fbreak;};
    '*'
      => {ret = Parser::token::MUL; fbreak;};
    '/'
      => {ret = Parser::token::DIV; fbreak;};
    '^'
      => {ret = Parser::token::POW; fbreak;};
    
    ws;
    
  *|;
}%%

Some things should be noted here, the first being the use of ‘ret’ which is the return value from the lex() function that is defined below.

Second, note the use of ‘fbreak’ to cause Ragel to generate code that breaks out of the scanning loop, returning control to the lex() function and the Bison parser. It should also be noted that the ‘ws’ alternative does not return a token and does not break out of the scanning loop with the desired result that whitespace is never seen by the parser.

Finally, the ‘number’ alternative manipulates the ‘val’ parameter, this being the semantic value of the token and taking the form of a pointer to a double (as described above: YYSTYPE).

Constructor

Following the definition of the scanner the file reverts to being C++ source and resumes with the definition of the constructor for the Lexer class.

Calc::Lexer::Lexer
(char const* p_, char const* pe_)
  : p(p_)
  , pe(pe_)
  , eof(pe_)
{
  %% write init;
}

The constructor must manually initialize the buffer pointers but then uses a Ragel directive to cause the generation of code to initialise the remaining object data.

lex()

The final part of the lexer is the lex() function itself which turns out to be surprising simple. The ‘ret’ return variable is defined and initialised with the END token so that empty input buffers will be handled correctly.

The rest of the function will be generated by Ragel from the machine definition presented above. The form of this code depends on how Ragel is invoked.

Calc::Parser::token_type
Calc::Lexer::lex
(Parser::semantic_type* val)
{
  Parser::token_type ret = Parser::token::END;
  %% write exec;
  return ret;
}

main()

The last piece of code needed for the calculator is the main() function which is found in file main.cc. The body of main() consists of a loop over the command-line arguments, passing each in turn as a buffer to an instance of the Lexer class, which itself is given to an instance of the Parser class. The parser instance parse() function is called to perform the actual parsing operation, printing results as they are calculated.

#include "Lexer.hh" // includes Parser.hh
#include <cstring>  // for strlen()

int
main
(int argc, char const* argv[])
{
  for(int arg = 1; arg < argc; ++arg) {
    Calc::Lexer lexer(argv[arg], argv[arg] + strlen(argv[arg]));
    Calc::Parser parser(lexer);

    parser.parse();
  }

  return 0;
}

Building

The Makefile is straightforward if you come from a traditional command-line-oriented Unix-y background. Explicit dependencies are provided to require the execution of Bison and Ragel on their respective inputs before any attempt to compile their outputs!

CXX = g++
LNK = $(CXX)
BISON = bison
RAGEL = ragel

all: calc

clean:
    $(RM) *~ *.o calc
    $(RM) Parser.hh Parser.cc location.hh position.hh stack.hh Lexer.cc

calc: Lexer.o Parser.o main.o
    $(LNK) $^ -o $@

%.o: %.cc Makefile
    $(CXX) -c -o $@ $<

Parser.cc: Parser.yy
    $(BISON) -o $@ $<

Parser.yy: Lexer.hh

Lexer.cc: Lexer.rl Parser.cc
    $(RAGEL) -C -o $@ $<

Conclusion

This article presents nothing new; Bison has been around for decades and Ragel for a number of years. Both are very well documented. What is presented above is a synthesis of the available information, showing how to bring these two tools together such that developers can take advantage of the powerful features of Ragel while staying with the established Bison parsing system.

The key steps involved can be summarised thus:

  • Make a forward-declaration of the lexer class available to the parser,
  • Use %parse-param to cause the parser to take and hold a lexer instance (by reference),
  • Define yylex to forward to some appropriately defined function of the referenced lexer instance.

With these three points satisfied everything else is just details.

Advertisements

4 thoughts on “Connecting Ragel to Bison in C++

  1. Good read! But I have a question regarding something I’ve struggled to wrap my head about the last days. I notice that you put your %% write data; directive outside of a class. I’ve been trying to encapsulate my entire scanner in a class but this little bit still eludes me. I’ve been trying to put it into the class definition (in the header file), as it outputs among other things “static const int [machine]_start = 1. But ragel doesn’t seem to like to work on the header file if I have %%{ machine lexer; write data }%% in it. Eventhough the Ragel user guide mentions that you can put it inside a class.

    I guess what I lose at the moment is that I cannot have several instances of my lexer at the same time. Which at the moment is not of any greater importance.

    • Fredrik, thank you for your question. I can report that it is possible to make Ragel work on header files as well as C++ source. The full explanation probably constitutes another post to this blog because it is too complicated for a comment such as this. But, a hint is that you have to separate the Ragel machine into three files: the pure definition of the machine and two files that generate the header and body respectively. The latter two files must include the first using [%% include Machine “Machine.rl”;] or something like it.
      Having said all that, I have to ask why you would want to do this, given that it doesn’t work properly for table-based lexers and the rest of the stuff generated by [%% write data;] can be generated in the main code for strictly local use without causing encapsulation problems.
      Can you explain why you need to have more than one lexer /in the same file/?

  2. Webkit uses two well known parser generators – Flex for creating a lexer and Bison for creating a parser (you might run into them with the names Lex and Yacc). Flex input is a file containing regular expression definitions of the tokens. Bison’s input is the language syntax rules in BNF format.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s