Neotoma - Super powerful parsing for erlang
21 April 2010
Erlang strings are painful
Oh it’s so true. The pain is super apparent, especially when trying to parse configuration files. The traditional way to parse a configuration file that is not in the erlang format can be pretty hard to do. For instance, for beehive, the application configuration template looks like:
# Config file
# For example, a rack app
bundle: echo "Bundle java stuff"
start: /bin/rackstart.sh
# etc. etc.
Originally, this was parsed in lex/yacc and consumed in c++ (shudder). The code for that is available buried deep within the history of babysitter
A traditional parser would look something like this:
-module (config_parser).
-export ([file/1]).
-define (SEPARATOR, $:).
file(Filename) ->
{ok, Fd} = file:open(Filename, [read]),
io:setopts(Fd, [binary]),
for_each_line(Fd, fun parse/3, 1, []).
for_each_line(Device, Proc, Count, Accum) ->
case io:get_line(Device, "") of
eof -> file:close(Device), Accum;
Line ->
NewAccum = Proc(Line, Count, Accum),
for_each_line(Device, Proc, Count + 1, NewAccum)
end.
parse(Line, Count, Acc) ->
[peg_parse(Line, Count)|Acc].
% START PEG PARSING
peg_parse(Line, Count) ->
{Field, X} = parse_line(Line, [], []),
case Field of
comment -> ok;
X -> io:format("[~p] ~p:~p~n", [Count, Field, X])
end.
% Top
% Strip comments
parse_line(<<$#, _Rest/binary>>, [], _Acc) -> {comment, []};
% Is this the field?
parse_line(<<?SEPARATOR, Rest/binary>>, _Field, Acc) -> parse_line(Rest, lists:reverse(Acc), []);
parse_line(<<Char, Rest/binary>>, Field, Acc) -> parse_line(Rest, Field, [Char|Acc]);
parse_line(<<>>, Field, Acc) -> {Field, lists:reverse(Acc)};
parse_line(<<$\n>>, Field, Acc) -> {Field, lists:reverse(Acc)}.
I’ll only touch on the basics of what that is here (so if you want to skip it, just go to the next section).
Basically we open a file descriptor to the file and tell it to read in the binary format (a little faster and less work on the vm). For every line, we go through character by character and examine based on the position and context that the character is in and store the value in the context where it appears. Later we’ll come back (notice where the io:format is?) and store it in some meaningful way. This is just a demo. If there is enough interest, I can finish it and post it here. Otherwise, I won’t spend more time on it as there is a better way.
Introducing Neotoma
Neotoma, a project by Sean Cribbs that makes PEG parsing in erlang easy. It’s a nifty tool that generates an unambiguous parser that generates a parse tree. Don’t try to use this to create a parser to examine natural languages though, it’s not an CFG (context free grammar) parser.
There aren’t too many resources available yet through google.com, so after some head scratching and pm’ing with Sean Cribbs, the author, I was able to sketch the parser to dig out the parse tree for the grammar in a ridiculously little amount of code.
Before we get into the PEG grammar, there are a few basic pieces of terminology that you need to know.
Terminal symbols
There are two disjoint sets of items called terminal symbols. These are basic pieces (or atoms) of a grammar. There are two types of terminals, the non-terminal and the terminal symbol. They must be, by necessity disjoint sets of symbols for a valid grammar. The reason will become apparent shortly.
- non-terminal symbol - Symbol representing a ‘variable’ in a grammar. These are symbols that can be replaced by other elements of the grammar.
- terminal symbol - Symbols that cannot be broken down any further, but that can be consumed by non-terminals.
Let’s look at an example. These are two non-terminals that describe a signed integer in the BNF form of a grammar:
<integer> ::= ['-'] <digit> {<digit>}
<digit> ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
The integer and the digit are the non-terminals, while the 0-9 and - are the terminal symbols.
PEG grammars
A “parsing expression grammar”, or PEG is an formal description of an analytic language model that describes a set of rules for recognizing strings and their context in the language. For instance, in the English language a proper declarative sentence requires a subject and a predicate
My name is Mary.
<subject> = My name
<predicate> = is Mary
Although you should never use a PEG to parse natural language (use a Backus-Naur Form), the corresponding PEG would look something like (incomplete, but only used for example):
% Psst, a Neotoma PEG for the sentence "My name is Mary"
declarative_sentence <- predicate noun ".";
predicate <- "My name";
noun <- "is Mary";
This is a VERY basic PEG parser. What does all this mean? Well…
Introducing Neotoma
Let’s look at the simplest possible grammar for describing the exact same thing, a signed integer.
rule <- Decimal ~;
Decimal <- [0-9]+ `list_to_integer(Node)`;
That’s it? Yep. This might look a little confusing, but don’t worry, I’ll explain what’s happening here. First, look at the `Decimal` rule. That says that any digit 0 through 9 will be consumed by this rule. The stuff in the “ will be what happens to the matched number. We’ll get back to that soon. The `rule` non-terminal will ‘take’ the Decimal rule in. This won’t do anything interesting yet, other than say that the rule will contain an integer. Let’s change that:
additive <- multitative "+" additive / multitative ~;
multitative <- primary "*" multitative / primary ~;
primary <- "(" additive ")" / decimal ~;
decimal <- [0-9]+ `list_to_integer(Node)`;
This shouldn’t be too hard to understand from the rule above, but there are two new things in these rules. First, the stuff that happens in between the “ is the semantic part of the parse tree. Sean calls this the ‘transformation’ of the grammar. What it does is stuff all the matches of the rules into a variable called `Node`. From there, you can do basically what you want with what the transformation returns (so long as the rules that use it understand it). If instead of putting stuff in “ at the end of a rule definition, you put a `~`, this says return the Node variable untouched.
The second new introduction in this set of rules is the `/`. This an `or` statement. It’s a precedence-based ‘or’. So, for our rules above, the primary terminal can either be an additive surrounded by parenthesis OR a decimal. Note, I said that precedence is denoted here. The first item in the list is the first item matched. It’s always a good idea (even the author Sean Cribbs suggests) to try to match the longest rule possible. Precedence is important.
Alright, so remember the very incomplete code example from above that didn’t do anything yet, that looks ugly and is exactly 37 lines long? Well, here is a very complete Neotoma version that can parse the entire file:
% This is the PEG compiler for babysitter configuration files
config_element <- ws? elem_list ws? `list_utils:merge_proplists(Node)`;
elem_list <- head:elem tail:(ws? elem)* `
case Node of
[] -> [];
[""] -> [];
_ ->
Head = proplists:get_value(head, Node),
Tail = [R || [_,R] <- proplists:get_value(tail, Node)],
[Head|Tail]
end
`;
elem <- hook_elem / action_elem ~;
hook_elem <- action '.' (bef / aft) ':' [ \t]* string (crlf / !.) `
{lists:nth(1, Node),
{lists:nth(3, Node), lists:flatten(lists:nth(6, Node))}
}
`;
action_elem <- action ':' [ \t]* string (crlf / !.) `
{lists:nth(1, Node),
{command, lists:flatten(lists:nth(4, Node))}
}
`;
action <- "bundle" / "mount" / "run" / "unmount" / "cleanup" `erlang:list_to_atom(lists:flatten(Node))`;
ws <- (comment / space)* `{}`;
% Atoms
bef <- "before" `pre`;
aft <- "after" `post`;
string <- bracketed_string / nonbracketed_string ~;
nonbracketed_string <- (!crlf .)* ~;
bracketed_string <- '{' str:(!'}' (.))* '}' `proplists:get_value(str, Node)`;
comment <- '#' (!crlf .)* crlf ~;
space <- [ \t\n\s\r] ~;
crlf <- "\r\n" / "\n" ~;
That’s it! Looks like a lot, but it’s not as bad once you start looking at it. I’ll leave this as an exercise to look through it for now. Some hints that I picked up to keep in mind either from Sean Cribbs or from working with the code for a while.
- (crlf / !.) at the end of a line means either a newline (note the crlf rule at the bottom) or the EOF character at the end of the file
- The ’!’ means NOT, it’s a negative look-ahead character, so for example, the nonbracketed_string can be zero or more of any character except the crlf
- The ‘string’ rule can be either a string or the stuff inside of {}s. Notice that the bracketed_string comes first in the string rule, this is why precedence matters. Try it the other way
- You can assign ‘variables’ to the matches. For instance, look at the bracketed_string rule. The tag is: ‘str:’. This will put the tuple: {str, Value} into the Node variable, so you can pull it out later.
To actually use this, make sure the neotoma.beam is in your code path (or use the -p(a/z) switch to load it) and type:
neotoma:file("file.peg").
If all the syntax is correct, neotoma will generate a parser with the two exported functions `parse/1` and `file/1`, which you can compile and use at your leisure.
To get a copy of the code discussed in this tutorial in full, clone it from this repository here: http://github.com/auser/neotoma_template.
Some quick links before I go:
* Neotoma source * Video introduction * Parsing Expression Grammar Wikipedia * Google group
Thanks and I hope this helps you figure out Neotoma. Don’t hesitate to ask.
Comments
blog comments powered by Disqus