Transforming Syntax
Or: how to write the easy part of a compiler
This article has moved to my current blog. Please look there for the most up to date version. The new version includes many corrections and improvements.
This post will cover this stackoverflow question on how to translate an expression to a C-like language by user2516265. The other answers suggested regexes or treating it as Perl code *shudder*. This post aims to cover:
- parsing with Marpa::R2,
- AST manipulation,
- optimization passes,
- compilation, and
- Perl OO.
It won't cover parser theory etc. or provide a copy&paste-ready solution. The estimated size of the final code will be around
Note: All runnable code snippets should be prefixed by
use strict; use warnings; use 5.010;
.
Original question
I am new to the perl scripting. I am writing script to read excel file and put in text file in C programing syntax.
So I excel sheet I have string like below:
If ((Myvalue.xyz == 1) Or (Frmae_1.signal_1 == 1)) Then a = 1 Else a = 0;This I have to convert into:
a = (((Myvalue.xyz == 1) || (Frmae_1.signal_1 == 1))?1:0)How this can be handled in perl?
Short introduction to Marpa
Various parser exists for Perl. Some of these are ancient tech, some are solid but slow, some are awesome. My favourite Parsers are:
- Regexp::Grammars
- Provides many powerful and expressive tools. Retains the full power of Perl regexes.
- Parser::MGC
- Makes it trivial to write an object oriented top-down parser. However, the resulting code is unneccessarily verbose.
- Marpa::R2
- This is a wrapper around libmarpa. It is written in C, and uses Early parsing instead of top-down techniques. This makes it fairly easy to write arbitrary BNFm which Marpa will happily use. While it is strictly less powerful than regexes, Marpa provides excellent interfaces and good debugging help that makes it worth the extra code.
Here, we will use the Scanless Interface for Marpa, which adds lexing capabilities to the BNF. To create a new grammar, we have to
$grammar = Marpa::R2::Scanless::G->new({ %options })
one of these options is the BNF source, which has to be passed as a reference.
Writing the BNF
We want to write a grammar that is able to correctly parse the example data
If ((Myvalue.xyz == 1) Or (Frmae_1.signal_1 == 1)) Then a = 1
Else a = 0;
We see that the program is made from statements that are separated by semicolons. We can express this with the BNF rule
StatementList ::= <Expression>+ separator => <op semicolon>
<op semicolon> ~ ';' # Tokens are declared with '~'
An expression can be an If
/Then
/Else
, a paranthesized expression, a binary operator, an identifier, or a numeric literal. We should choose a sensible precedence for all these possibilities. E.g. paranthesization and literals should have very high precedence. The conditional should be lower. For the binary operators, I have choosen the precedence
==
equality test=
assignmentOr
logical operator
Marpa Scanless allows us to write simple alternatives with |
, and ||
to provide prioritized rules (this works roughly like |
in regexes with respect to backtracking). The BNF for the expression would be:
Expression ::=
('(') Expression (')') assoc => group # tokens enclosed in parens are matched, but their values discarded
| Number
|| Ident
|| Expression '==' Expression
|| Expression '=' Expression
|| Expression 'Or' Expression
|| Conditional
(This isn't exactly good style, but it is good enough)
The other referenced rules are:
Conditional ::=
('If') Expression ('Then') Expression
| ('If') Expression ('Then') Expression ('Else') Expression
Ident ~ ident
Number ~ <number int> | <number rat>
word ~ [\w]+ # Perl charclasses can be used for tokens
ident ~ word | ident '.' word # allow complex identifiers
<number int> ~ [\d]+
<number rat> ~ <number int> '.' <number int>
We can indicate Marpa to start matching at the “top” with the pseudorule :start ::= StatementList
, and can allow the lexer to skip whitespace between tokens via
:discard ~ ws
ws ~ [\s]+
In the next section, we take a short look at object oriented Programming with Perl and will then circle back to see how Marpa interacts with that.
Intermission: Perl OO
An object is a thingy which we can ask to do things. Depending on what kind of thing it is, it will either throw an error, or fulfill our request. This can simplify code because we don't care how they work on the inside, or where they know from how to perform the task.
Perl uses class-based OO. We can declare a class with the package
keyword. All sub
s in that package can then be used as methods. We can inherit methods from another package via use parent 'Parent::Class'
. When a sub is used as a method, the object it was invoked on is passed as the first argument. This invocant is usually called $self
in proud Smalltalk tradition.
The invocant may also be a string that holds the name of the package. That is, Perl makes no distinction between class methods and instance methods.
We can create a new instance of a class by bless
ing a reference into that class. Any reference will do. Usually, hash references are used, but here we will focus on array references. For demonstration purposes, here is a small example to show blessing and inheritance. In this example, we create unneccessarily complex classes to calculate the statistical mean and variance from a data set.
# Call the `new` method on class `Data::Variance`:
my $object = Data::Variance->new(1.1, 1.9, 2.1);
# Call the `variance` method on the object:
say "Var: ", $object->variance;
# Call the inherited `mean` method on the object:
say "Mean:", $object->mean;
# Define the `Data::Mean` class which provides the `mean` method
package Data::Mean;
sub new {
# The invocant for `new` should be the class name
my ($class, @data) = @_;
# bless a reference to the data into the provided class
my $self = bless \@data => $class;
# return the freshly baked object
return $self;
}
# calculate the mean
sub mean {
my ($self) = @_;
my $sum = 0;
$sum += $_ for @$self;
return $sum / @$self;
}
# Define the `Data::Variance` class which inherits from `Data::Mean`:
package Data::Variance;
use parent -norequire, 'Data::Mean'; # specify inheritance.
# The `-norequire` option is used because the parent is in the same file
# `new` is inherited
# `mean` is inherited
# calculate the variance, bootstrapping it with the `mean`
sub variance {
my ($self) = @_;
my $mean = $self->mean;
return Data::Mean->new(map { ($_ - $mean)**2 } @$self)->mean; # nifty math
}
Output:
Var: 0.186666666666667
Mean:1.7
In that example, the reference is blessed in a new
method. This is considered good style, but we can bless any reference anywhere into any class (the class doesn't even have to exist). This is what we'll let Marpa do in the next step.
Marpa's bless
adverb
We can let Marpa take the array of matched values, and bless them into a class of our choosing. We can do so by augmenting a rule with the bless
adverb, e.g.
StatementList ::= <Expression>+ separator => <op semicolon> bless => Block
But we also have to specify how these values are obtained. We can do so by setting default attributes like
:default ::= action => [values]
We could have also written an action ourselves that takes the matched values as arguments and returns some data structure. The above default action is similar to sub { return [@_] }
.
We can specify a class root for the bless
adverb. If we construct the grammar with the bless_package
option, e.g. bless_package => 'Ast'
, then the StatementList
would not be blessed into the Block
class but into Ast::Block
.
Similar classes are added to the other rules.
Obtaining a parse
This is the general workflow to obtain a parse from a Marpa::R2::Scannless::G
Grammar object:
Create a recognizer for that grammar:
my $recce = Marpa::R2::Scanless::R->new({ grammar => $grammar });
Read the input string:
$recce->read(\$input);
Look at the value of the parse. The parse failed if it is
undef
. Otherwise, it is a reference to out AST:my $value = $recce->value; defined $value or die "Parse failed"; my $ast = $$value;
I have prepared a gist with fully runnable code that parses the example input, and prints out the resulting data structure. The output is shown below (although I improved the intendation):
$VAR1 = bless( [
[
bless( [
bless( [
bless( [
bless( ['Myvalue.xyz'], 'Ast::Var' ),
'==',
bless( ['1'], 'Ast::Literal' )
], 'Ast::Binop' ),
'Or',
bless( [
bless( ['Frmae_1.signal_1'], 'Ast::Var' ),
'==',
bless( ['1'], 'Ast::Literal' )
], 'Ast::Binop' )
], 'Ast::Binop' ),
bless( [
bless( ['a'], 'Ast::Var' ),
'=',
bless( ['1'], 'Ast::Literal' )
], 'Ast::Binop' ),
bless( [
bless( ['a'], 'Ast::Var' ),
'=',
bless( ['0'], 'Ast::Literal' )
], 'Ast::Binop' )
], 'Ast::Cond' )
]
], 'Ast::Block' );
The above data structure is equivalent to this tree:
Block
|
[ ]
|
/----------------------- Cond ---------\
/ | \
Binop Binop Binop
/ | \ / | \ / | \
Binop 'Or' \-- Binop Var '=' Literal Var '=' Literal
/ | \ / | \ | | | |
Var '==' Literal Var '==' Literal 'a' '1' 'a' '0'
| | | |
'Myvalue.xyz' '1' 'Frmae_1.signal_1' '1'
We can see that the Data::Dumper
output is much harder to read than the tree. This is mainly because it contains too much superfluous information.
Working with the AST
Before we can start working with the syntax tree Marpa creates for us, we should take a moment to specify inheritance, and write a few accessor method so that we don't have to hardcode indices of the underlying arrayrefs. This is fairly boring:
package Ast; # a parent class to define common methods
# a constructor for convenience.
# Marpa won't use this but rather `bless` directly
sub new {
my ($class, @args) = @_;
bless \@args => $class;
}
package Ast::Binop;
use parent -norequire, 'Ast';
sub l { shift()->[0] }
sub op { shift()->[1] }
sub r { shift()->[2] }
package Ast::Var;
use parent -norequire, 'Ast';
sub name { shift()->[0] }
package Ast::Cond;
use parent -norequire, 'Ast';
sub cond { shift()->[0] }
sub then { shift()->[1] }
sub else { shift()->[2] }
package Ast::Block;
use parent -norequire, 'Ast';
sub contents { shift()->[0] }
# comes in handly later on
sub new {
my ($class, @items) = @_;
$class->SUPER::new(\@items);
}
package Ast::Literal;
use parent -norequire, 'Ast';
sub val { shift()->[0] }
We will now look at a few ways to transform the AST.
Transformation: cloning
It can sometimes be useful to make a deep copy of a data structure. This is fairly easy for each AST node. To copy an AST node, it
- clones all child nodes, or copies the values if they aren't a regular node (we call such nodes a terminal node).
- blesses an array reference containing those copies into the correct class.
To make things a bit easier, we start by defining a childs
method that just gives the children nodes:
package Ast;
...;
sub childs { @{ shift() } }
...;
package Ast::Block;
...;
sub childs { @{ shift()->contents } }
To determine whether a given child node is a terminal node, we ask it if it can perform the clone
method. We can do so with the universal can
method. I.e. $node->can("clone")
does this check.
However, it is a fatal error to call a method on an unblessed reference (something that isn't an object). Therefore, we first do a check whether the $node
is an object at all. The blessed
method from Scalar::Util
comes in handy here.
Therefore, we can perform cloning like
package Ast;
use Scalar::Util qw/blessed/;
...;
sub clone {
my ($self) = @_;
my @childs = map { blessed($_) && $_->can("clone") ? $_->clone : $_ } $self->childs;
ref($self)->new(@childs);
# ref $self is the class of $self.
# Remember that we can call methods on strings
# that contain the name of a class.
}
Note: This technique assumes that the given data structure is a tree alright, and doesn't contain loops (that would be a cyclic graph, not a tree).
I have prepared a gist that runs this code on the above Data::Dumper
output.
Transformation: AST dump (aka. compilation)
As mentioned above, the Data::Dumper
output is hard to read. Therefore, we will create a method that prettyprints a part of the AST. The output will generally look like
Ast::Something(
childnode_A(
...
)
'some literal'
${a.variable}
)
Our prettyprint
method will take an optional argument that indicates the intendation level. First, we add the general implementation:
package Ast;
...;
sub prettyprint {
my ($self, $indent) = @_;
$indent //= 0; # initialize $indent if no value passed
$indent++; # increment indent level
my $items = join "\n", # concatenate items with newline
map { " "x$indent . $_ } # pad with intendation
map { blessed($_) && $_->can("prettyprint")
? $_->prettyprint($indent)
: $_
} $self->childs;
my $type = ref $self;
return "$type(\n" . $items . " )";
}
For Ast::Literal
and Ast::Var
, we provide special implementations:
package Ast::Literal;
...;
sub prettyprint {
my $self = shift;
my $val = $self->val // return "undef";
return qq("$val");
}
package Ast::Var;
...;
sub prettyprint {
my $self = shift;
'${' . $self->name . '}';
}
I have made a gist that adds prettyprinting. When the prettyprint
method is called on the dumped AST, we get the output
Ast::Block(
Ast::Cond(
Ast::Binop(
Ast::Binop(
${Myvalue.xyz}
==
"1" )
Or
Ast::Binop(
${Frmae_1.signal_1}
==
"1" ) )
Ast::Binop(
${a}
=
"1" )
Ast::Binop(
${a}
=
"0" ) ) )
What I have done here was defining a language, and then compiling the AST to that language. I can easily add further implementations that change how the AST is compiled. E.g, we could add an if-then-else for the conditional like
package Ast::Cond;
...;
sub prettyprint {
my ($self, $indent) = @_;
$indent //= 0;
my ($cond, $then, $else) =
map { Scalar::Util::blessed($_) && $_->can("prettyprint")
? $_->prettyprint($indent)
: $_
} $self->childs;
return "if $cond\n"
. " "x$indent . "then $then\n"
. " "x$indent . "else $else";
}
This would produce the output
Ast::Block(
if Ast::Binop(
Ast::Binop(
${Myvalue.xyz}
==
"1" )
Or
Ast::Binop(
${Frmae_1.signal_1}
==
"1" ) )
then Ast::Binop(
${a}
=
"1" )
else Ast::Binop(
${a}
=
"0" ) )
(see the gist for the complete code)
When you compile the AST to the C-like target language, you will probably use similar transformations.
Transformation: AST modification (aka. optimization)
Arguably more difficult than flattening the AST to some representation is modifying the structure of the syntax tree. This is due mostly to Perl's type system which doesn't lend itself to these tasks. (Generally, strongly typed functional languages like Ocaml or Haskell tend to excel here due to very expressive pattern matching).
In your stackoverflow question, you asked how to factor out the common assignment from the two branches of the conditional. To do that, we have to assert
- That both branches are binops.
- That the operator of both binops is the assignment operator
=
. - That the lvalue of both assignments is a variable.
- That this variable goes by the same name.
Then we can factor out the assignment, and will assign the result of the conditional. Expressed as Perl code, given a branch $branch
, we want the condition
$branch->isa("Ast::Binop") && $branch->op eq "=" && $branch->l->isa("Ast::Var")
to be true. The 4th condition can be expressed as $then->l->name eq $else->l->name
. The complete simplify
method for $cond
would then be:
package Ast::Cond;
...;
sub simplify {
my $self = shift;
my ($cond, $then, $else) = @$self;
if (
not( grep not($_->isa('Ast::Binop') && $_->op eq '=' && $_->l->isa('Ast::Var')), $then, $else) and
$then->l->name eq $else->l->name
) {
return Ast::Binop->new(
$then->l->clone,
'=',
Ast::Cond->new($cond->simplify, $then->r->simplify, $else->r->simplify),
);
}
# else: just do what would have been done by default
return $self->SUPER::simplify;
}
The not grep not ...
is an application of De Morgan's Law:
All other nodes inherit a simplify
from Ast
that does
sub simplify {
my $self = shift;
my @childs = map {blessed($_) && $_->can("simplify") ? $_->simplify : $_} $self->childs;
ref($self)->new(@childs);
}
In other words, a simplify
on an AST that does not contain a conditional where the common subexpression is found has the same result as a clone
on the AST.
If the simplify
method is invoked on our AST, we get the following output:
Ast::Block(
Ast::Binop(
${a}
=
if Ast::Binop(
Ast::Binop(
${Myvalue.xyz}
==
"1" )
Or
Ast::Binop(
${Frmae_1.signal_1}
==
"1" ) )
then "1"
else "0" ) )
Closing remarks
Creating an AST and compiling it down to some representation may seem somewhat hard, but it isn't rocket science either. In the above examples, simplicity and code size were a constraint. In a real-world application, it would be helpful to define more AST node types. There is no such thing as a Binop
on a semantic level, instead I'd use Ast::Logical::Or
, Ast::Assign
, Ast::Numeric::Eq
and so on. Also, numeric literals should have a different type than string literals etc.
This would also make the final compilation to the C-like language easier. To make that step, one would implement another method that recurses through the tree, as clone
, simplify
and pretyprint
do. However, there is no general solution, so there isn't much inheritance going on here. Especially the implementation of compile
for the Ast
class should probably throw an error.
Because the AST is unambiguous, there is no concept of precedence present in a tree. Therefore, it might be helpful to surround each output subexpression with parens, even when a human programmer knows that they aren't needed.
The AST Marpa produces and the AST we actually want may sometimes be slightly different. To do the translation, custom actions can be helpful. The Marpa docs cover this to some extent.
The grammar shown was tailored to match the example input, but not much more. Most important of all, more operators would have to be implemented with the correct precedence and association rules. Many operators are left-associatove:
$x + $y + $z == (($x + $y) + $z)
But assignment is right-associative:
($x = $y = $z) == ($x = ($y = $z))
Associativity can be specified by the assoc
adverb in the Marpa grammar, e.g.
Expression ::=
...
|| Expression ('=') Expression assoc => right bless => Assign
...
Some operators like comparision aren't associative in Perl, instead
$x <= $y <= $z
is a syntax error (Perl5 has no chaining comparision operators).