Migration of the YACC grammar for PHP to SDF

Last summer, Eric Bouwers started working on infrastructure for PHP program transformation and analysis, sponsored by the Google Summer of Code. He did an excellent job, thanks to his expertise in PHP and his thorough knowledge of Stratego/XT. To enjoy all the language engineering support in Stratego/XT, Eric developed a PHP grammar in SDF, the grammar formalism that is usually applied in Stratego/XT projects. Unfortunately it proved to be very difficult to get the grammar of PHP right.

PHP precedence problems

PHP features many operators, and the precedence of the operators is somewhat unusual and challenging for a grammar formalism. For example, PHP allows the weak binding assignment operator as an argument of the binary, strong binding && operator:

  if ($is_upload && $file = fopen($fname, 'w')) { 

The same holds for the unary, strong binding ! operator:

  if(!$foo = getenv('BAR')){

A similar precedence rule for the include operator allows an include to occur as the argument of the strong binding @ operator:

  @include_once 'Var/Renderer/' . $mode . '.php'

Precedence rule recovery

The most serious problem was to find out what the exact precedence rules of PHP operators are. The syntax of PHP is defined by a YACC grammar, which has a notion of precedence declarations that is heavily used by the PHP grammar. Unfortunately, for more complex grammars it is far from clear what the exact effect of the precedence declarations are. The precedence declarations are only used for conflict resolution in the parse table generator, so if there is no conflict, then the precedence declarations do not actually have any effect on a particular combination of operators. That's why we developed support for recovering precedence rules from YACC grammars, which I already wrote about in a previous blog. Based on these tools, we now have a very precise specification of the precedence rules of PHP.

The next step in the process of getting a perfect PHP grammar was to actually use this specification to develop very precise precedence declarations for the SDF grammar of PHP. However, the precedence rule specification involves about 1650 rules, so migrating these precedence rules to SDF precedence declarations by hand is not really an option. Fortunately, all the ingredients are actually there to generate SDF priority declarations from the precedence rules that we recover from the YACC grammar.

Argument-specific priorities

Thanks to two new features of SDF, these precedence rules can be translated directly to SDF. The first feature is argument-specific priorities. In the past, SDF only allowed priority declarations between productions. For example, the SDF priority

  E "*" E -> E > E "+" E -> E

defines that the production for the + operator cannot be applied to produce any of the E arguments of the production for the * operator, hence the production for the addition operator cannot be applied on the left-hand side or right-hand side of the multiplication operator. This priority implies that the multiplication operator binds stronger than the addition operator. This single SDF priority corresponds to the following two precedence rules in the grammar formalism independent notation we are using in the Stratego/XT Grammar Engineering Tools:

  <E -> <E -> E + E> * E>
  <E -> E * <E -> E + E>>

For many languages precedence rules are different for arguments of the same production. That's why we us the more specific representation of a precedence rules in our grammar engineering tools. Fortunately, SDF now supports argument-specific priorities as well. These argument-specific priorities are just plain numbers that indicate to which arguments of a production the priority applies. For example, the following SDF priority forbids the assignment operator only at the left-most and the right-most E of the conditional operator:

  E "?" E ":" E -> E <0,4> > E "=" E

This corresponds to the following precedence rules:

  <E -> <E -> E = E> ? E : E>
  <E -> E ? E : <E -> E = E>>

Non-transitive priorities

The second new SDF feature that is required for expressing the PHP precedence rules is non-transitive priorities. Before the introduction of this feature, all SDF priorities where transitively closed. For example, if there are two separate priorities

  "!" E -> E   > E "+" E -> E
  E "+" E -> E > V "=" E -> E

then by the transitive closure of priorities this would imply the priority

  "!" E -> E >  V "=" E -> E

This transitive closure feature is useful in most cases, but for some languages (such as PHP) the precedence rules are in fact not transitively closed, which makes the definition of these rules in SDF slightly problematic. For this reason, SDF now also features non-transitive priorities, using a dot before the > of the priority:

  "!" E -> E .> E "+" E -> E

Non-transitive priorities will not be included in the transitive closure, which gives you very precise control over the precedence rules.

Precedence rule migration

Thanks to the position-specific, non-transitive priorities of SDF, the precedence rules that we recover from the YACC grammar for PHP can now be mapped directly to SDF priority declarations. The two precedence rules mentioned earlier:

  <E -> <E -> E + E> * E>
  <E -> E * <E -> E + E>>

now translate directly to SDF priorities:

  E * E -> E <0> .> E + E -> E
  E * E -> E <2> .> E + E -> E

The migration of the recovered YACC precedence rules results in about 1650 of these SDF priorities, but thanks to the fully automatic migration this huge number of priorities is not really a problem. The resulting PHP syntax definition immediately solved all the known issues with the PHP syntax definition, which shows that this migration was most reliable and successful.


There is a lot of interesting work left to be done. First, it would be interesting to develop a more formal grammar for PHP, similar to the grammars of the C, Java, and C# specifications. These specifications all encode the precedence rules of the operators in the production rules, by introducing non-terminals for all the precedence levels. It should not be too difficult to automatically determine such an encoding from the precedence rules we recover. This would result in a formal specification of the PHP syntax, which will benefit many other parser generators. One of the remarkable things we found out is that the unary - operator has the same precedence as the binary - (usually it binds stronger), which results in -1 * 3 being parsed as -(1 * 3). We have not been able to find an example where this strange precedence rule results in unexpected behaviour, but for the development of a solid parser is it essential that such precedence rules are defined precisely.

Second, it would be good to try to minimize the number of generated SDF priorities by determining a priority declaration that you can actually oversee as a human. This would involve finding out where the transitive closure feature of SDF priorities can be used to remove redundant priority declarations.

Third, it would great to integrate the precedence rule migration in a tool that completely migrates a YACC/FLEX grammar to SDF. For this, we need tools to parse and understand a FLEX specification and extend the existing support for precedence rule migration to other YACC productions.

Clearly, there is lots of interesting (and useful!) grammar engineering work to do in this direction!

No comments: