dccc is a tool I made because I didn't like yacc.
Basically, dccc is a language that combines powerful grammar specification with AST definition in a very concise and readable way.
To install, just put the file "dccc" in a location where your PATH environment variable can find it.
Type "dccc --help" to get a description of the command line parameters.
token INT: int prec PLUS MINUS: left, TIMES DIV: left expr ::= Int: INT | Plus: expr, PLUS, expr | Neg: MINUS, expr | Minus: expr, MINUS, expr | Times: expr, TIMES, expr | Div: expr, DIV, expr | Expr: LEFTPAR, expr, RIGHTPARSave it as "expr.dg" and compile it with the following command:
dccc expr.dg --yacc expr.mly --tree ast.mlThis creates a file "expr.mly", which can be compiled with ocamlyacc, and a file "ast.ml", which contains the AST declaration. The yacc file doesn't look very pretty, but the AST file is interesting:
type loc = Lexing.position * Lexing.position
type 'a located = { loc : loc; node : 'a }
and expr = expr_node located
and expr_node = [
`Expr of expr
| `Div of (expr * expr)
| `Times of (expr * expr)
| `Minus of (expr * expr)
| `Neg of expr
| `Plus of (expr * expr)
| `Int of int ]
As you see, every expressions are located. Polymorphic variants are used instead of variants. This allows the name space to be more flexible, but it might be a bit harder to read type errors. just write type annotations for your functions using the definitions of this AST file to avoid this problem.
my_rule: ... auxilliary_rule ...
auxilliary_rule:
TOKEN auxilliary_rule { $1::$2 }
| { [] }
The semantic actions might be different, but in OCaml, using list here is very natural. With dccc, you can just write:
my_rule: ... TOKEN* ...This is much less verbose and tedious.
In fact, a token in dccc can be a complex expression. Here is a table describing the possible tokens. The type "token" refers to the associated type of the token, which is built recursively from this table (see later for remarks on types).
| Token | Description | Type |
|---|---|---|
| identifier | A terminal or a non-terminal. | (associated type) |
| (token) | token | |
| token1, token2 | Sequence of tokens. | token1 * token2 |
| {token} | Token group. | token |
| #token | Ignore the value of this token, even if it has an associated type different than unit. | unit |
| token1 | token2 | Non-determinism. token1 and token2 must have compatible types. | token1 |
| Label: token | Labeled token. The token is labeled with the polymorphic variant label `Label. | [> `Label of token] |
| token? | Optional token: it appears 0 or 1 time. | token option |
| token* | Repeat token an arbitrary number of times, 0 included. | token list |
| token+ | Repeat token an arbitrary number of times, at least 1 time. | token list |
| token1!token2 | Repeat token1 an arbitrary number of times, separating the occurences using token2. | token1 list |
| prec identifier | The token indicates a precedence and is inserted as a "%prec" in the yacc file. | unit |
The default type associated to a terminal is unit, but you can change it with the construction "token":
token token_name: token_typeThis changes the type of token_name to token_type. Moreover, the semantic actions and the AST will be modified so that the AST contain the value associated to the terminal.
If you don't want the value of a terminal to be stored into the AST, you can use the operator #.
If you really need a type of "a * (b * c)", you can use the construction {}:
my_rule ::= a, { b, c }
bc ::= b, c
The type "unit" is discarded in tuples:
token INT: int my_rule ::= INT, COMMA, INTThe type of the non-terminal "my_rule" will be "int * int", because the non-terminal "COMMA", with no associated type, has the type "unit".
rule ::= A { ... } | B { ... } | C { ... }
In dccc, this example is written like this:
rule ::= A | B | CIt looks pretty much the same, doesn't it? But actually, "A | B | C" is just one token, which is the non-deterministic token "A | (B | C)" which can reduces either to "A" or "B | C", which itself can reduce to "B" or "C".
In dcc, one can use this | construction anywhere:
rule ::= A, (B | C), DIn yacc, this example must be written using an auxilliary non-terminal:
rule ::= A bc D { ... }
bc ::= B C { ... }
In dccc, because every token returns a value thanks to the automatically generated semantic actions, the tokens in a | construction must have the same type. Labels are used to turn a token into a polymorphic variant. This is very useful to use two tokens of different types in a | construction. Here are some examples:
token INT: int rule ::= INT, DOT | INT, SEMICOLONHere, the grammar allows both "15." and "15;". They are treated equally: only the value 15 is seen in the AST, which doesn't make the difference between the two choices.
token INT: int rule ::= Dot: INT, DOT | Semi: INT, SEMICOLONThis is almost the same grammar, but the choices are labelled. Thus, in the AST, the non-terminal "rule" has type "[> `Dot of int | `Semi of int]", and one can make a difference between "15." (which is represented in the AST as "`Dot 15") and "15;" (which is represented as "`Semi 15").
token INT: int token STRING: string rule ::= INT | STRINGThis is not valid, because tokens INT and STRING have different types, namely int and string. To make this work, we use labels:
token INT: int token STRING: string rule ::= Int: INT | String: STRING
prec STAR: nonassoc, VERT COLON: left, COMMA: right, SHARP PLUS: nonassocTokens STAR and PLUS are non associative. Tokens VERT and COLON are left-associative. Token COMMA is right-associative. Token STAR has the lowest precedence, while PLUS has the highest.
token INT: int token ID: string prec VERT COLON COMMA BANG: left, SHARP STAR PLUS QUESTIONMARK: nonassoc grammar ::= ( GRule: ID, COLONCOLONEQUAL, tok | GToken: TOKEN, ID, COLON, ID | GPrec: PREC, (ID+, COLON, ID)!COMMA)*, EOF tok ::= TIdent: ID | TIgnore: SHARP, tok | TSeq: tok, COMMA, tok | TBranch: tok, VERT, tok | TStar: tok, STAR | TPlus: tok, PLUS | TOption: tok, QUESTIONMARK | TBang: tok, BANG, tok | TLabeled: ID, COLON, tok | TSub: LEFTACC, tok, RIGHTACC | TPrec: PREC, ID | TTok: LEFTPAR, tok, RIGHTPARThe tokens (non-terminal tok) are easy to understand. The non-terminal grammar, which describes an entire file, makes good use of the constructions *, !, and +.
Save this file as "dccc.dg" and compile it using:
dccc dccc.dg --yacc parser.mly --tree ast.mlNow you can compile the file "parser.mly" using ocamlyacc:
ocamlyacc parser.mlyYour parser can now be compiled using ocamlc:
ocamlc -c ast.ml parser.mli parser.mlNow you just have to write a lexer and you'll be able to parse grammars using Parser.grammar, or tokens using Parser.tok. For more information about writing lexers and an example of a main file that uses a parser, see the OCaml documentation on the INRIA's website.