dccc is a tool I made because I didn't like yacc.
Basically, dccc is a language that combines powerful grammar specification with AST definition in a very concise and readable way.
To install, just put the file "dccc" in a location where your PATH environment variable can find it.
Type "dccc --help" to get a description of the command line parameters.
token INT: int prec PLUS MINUS: left, TIMES DIV: left expr ::= Int: INT | Plus: expr, PLUS, expr | Neg: MINUS, expr | Minus: expr, MINUS, expr | Times: expr, TIMES, expr | Div: expr, DIV, expr | Expr: LEFTPAR, expr, RIGHTPARSave it as "expr.dg" and compile it with the following command:
dccc expr.dg --yacc expr.mly --tree ast.mlThis creates a file "expr.mly", which can be compiled with ocamlyacc, and a file "ast.ml", which contains the AST declaration. The yacc file doesn't look very pretty, but the AST file is interesting:
type loc = Lexing.position * Lexing.position type 'a located = { loc : loc; node : 'a } and expr = expr_node located and expr_node = [ `Expr of expr | `Div of (expr * expr) | `Times of (expr * expr) | `Minus of (expr * expr) | `Neg of expr | `Plus of (expr * expr) | `Int of int ]As you see, every expressions are located. Polymorphic variants are used instead of variants. This allows the name space to be more flexible, but it might be a bit harder to read type errors. just write type annotations for your functions using the definitions of this AST file to avoid this problem.
my_rule: ... auxilliary_rule ... auxilliary_rule: TOKEN auxilliary_rule { $1::$2 } | { [] }The semantic actions might be different, but in OCaml, using list here is very natural. With dccc, you can just write:
my_rule: ... TOKEN* ...This is much less verbose and tedious.
In fact, a token in dccc can be a complex expression. Here is a table describing the possible tokens. The type "token" refers to the associated type of the token, which is built recursively from this table (see later for remarks on types).
Token | Description | Type |
---|---|---|
identifier | A terminal or a non-terminal. | (associated type) |
(token) | token | |
token1, token2 | Sequence of tokens. | token1 * token2 |
{token} | Token group. | token |
#token | Ignore the value of this token, even if it has an associated type different than unit. | unit |
token1 | token2 | Non-determinism. token1 and token2 must have compatible types. | token1 |
Label: token | Labeled token. The token is labeled with the polymorphic variant label `Label. | [> `Label of token] |
token? | Optional token: it appears 0 or 1 time. | token option |
token* | Repeat token an arbitrary number of times, 0 included. | token list |
token+ | Repeat token an arbitrary number of times, at least 1 time. | token list |
token1!token2 | Repeat token1 an arbitrary number of times, separating the occurences using token2. | token1 list |
prec identifier | The token indicates a precedence and is inserted as a "%prec" in the yacc file. | unit |
The default type associated to a terminal is unit, but you can change it with the construction "token":
token token_name: token_typeThis changes the type of token_name to token_type. Moreover, the semantic actions and the AST will be modified so that the AST contain the value associated to the terminal.
If you don't want the value of a terminal to be stored into the AST, you can use the operator #.
If you really need a type of "a * (b * c)", you can use the construction {}:
my_rule ::= a, { b, c } bc ::= b, cThe type "unit" is discarded in tuples:
token INT: int my_rule ::= INT, COMMA, INTThe type of the non-terminal "my_rule" will be "int * int", because the non-terminal "COMMA", with no associated type, has the type "unit".
rule ::= A { ... } | B { ... } | C { ... }In dccc, this example is written like this:
rule ::= A | B | CIt looks pretty much the same, doesn't it? But actually, "A | B | C" is just one token, which is the non-deterministic token "A | (B | C)" which can reduces either to "A" or "B | C", which itself can reduce to "B" or "C".
In dcc, one can use this | construction anywhere:
rule ::= A, (B | C), DIn yacc, this example must be written using an auxilliary non-terminal:
rule ::= A bc D { ... } bc ::= B C { ... }In dccc, because every token returns a value thanks to the automatically generated semantic actions, the tokens in a | construction must have the same type. Labels are used to turn a token into a polymorphic variant. This is very useful to use two tokens of different types in a | construction. Here are some examples:
token INT: int rule ::= INT, DOT | INT, SEMICOLONHere, the grammar allows both "15." and "15;". They are treated equally: only the value 15 is seen in the AST, which doesn't make the difference between the two choices.
token INT: int rule ::= Dot: INT, DOT | Semi: INT, SEMICOLONThis is almost the same grammar, but the choices are labelled. Thus, in the AST, the non-terminal "rule" has type "[> `Dot of int | `Semi of int]", and one can make a difference between "15." (which is represented in the AST as "`Dot 15") and "15;" (which is represented as "`Semi 15").
token INT: int token STRING: string rule ::= INT | STRINGThis is not valid, because tokens INT and STRING have different types, namely int and string. To make this work, we use labels:
token INT: int token STRING: string rule ::= Int: INT | String: STRING
prec STAR: nonassoc, VERT COLON: left, COMMA: right, SHARP PLUS: nonassocTokens STAR and PLUS are non associative. Tokens VERT and COLON are left-associative. Token COMMA is right-associative. Token STAR has the lowest precedence, while PLUS has the highest.
token INT: int token ID: string prec VERT COLON COMMA BANG: left, SHARP STAR PLUS QUESTIONMARK: nonassoc grammar ::= ( GRule: ID, COLONCOLONEQUAL, tok | GToken: TOKEN, ID, COLON, ID | GPrec: PREC, (ID+, COLON, ID)!COMMA)*, EOF tok ::= TIdent: ID | TIgnore: SHARP, tok | TSeq: tok, COMMA, tok | TBranch: tok, VERT, tok | TStar: tok, STAR | TPlus: tok, PLUS | TOption: tok, QUESTIONMARK | TBang: tok, BANG, tok | TLabeled: ID, COLON, tok | TSub: LEFTACC, tok, RIGHTACC | TPrec: PREC, ID | TTok: LEFTPAR, tok, RIGHTPARThe tokens (non-terminal tok) are easy to understand. The non-terminal grammar, which describes an entire file, makes good use of the constructions *, !, and +.
Save this file as "dccc.dg" and compile it using:
dccc dccc.dg --yacc parser.mly --tree ast.mlNow you can compile the file "parser.mly" using ocamlyacc:
ocamlyacc parser.mlyYour parser can now be compiled using ocamlc:
ocamlc -c ast.ml parser.mli parser.mlNow you just have to write a lexer and you'll be able to parse grammars using Parser.grammar, or tokens using Parser.tok. For more information about writing lexers and an example of a main file that uses a parser, see the OCaml documentation on the INRIA's website.