Tuesday, April 03, 2007

LDTA'07 slides on Grammar Engineering Tools

The slides of our presentation of the LDTA'07 paper Grammar Engineering Support for Precedence Rule Recovery and Compatibility Checking are now available online. The slides are a spectacular demonstration of latex masochism, so please take a look ;) . There are few bonus slides after the conclusion that I wasn't able to show during the 30-minutes version of the talk.

Migration of the YACC grammar for PHP to SDF

Last summer, Eric Bouwers started working on infrastructure for PHP program transformation and analysis, sponsored by the Google Summer of Code. He did an excellent job, thanks to his expertise in PHP and his thorough knowledge of Stratego/XT. To enjoy all the language engineering support in Stratego/XT, Eric developed a PHP grammar in SDF, the grammar formalism that is usually applied in Stratego/XT projects. Unfortunately it proved to be very difficult to get the grammar of PHP right.

PHP precedence problems

PHP features many operators, and the precedence of the operators is somewhat unusual and challenging for a grammar formalism. For example, PHP allows the weak binding assignment operator as an argument of the binary, strong binding && operator:

  if ($is_upload && $file = fopen($fname, 'w')) { 
    ...
  }

The same holds for the unary, strong binding ! operator:

  if(!$foo = getenv('BAR')){
    ...
  }

A similar precedence rule for the include operator allows an include to occur as the argument of the strong binding @ operator:

  @include_once 'Var/Renderer/' . $mode . '.php'

Precedence rule recovery

The most serious problem was to find out what the exact precedence rules of PHP operators are. The syntax of PHP is defined by a YACC grammar, which has a notion of precedence declarations that is heavily used by the PHP grammar. Unfortunately, for more complex grammars it is far from clear what the exact effect of the precedence declarations are. The precedence declarations are only used for conflict resolution in the parse table generator, so if there is no conflict, then the precedence declarations do not actually have any effect on a particular combination of operators. That's why we developed support for recovering precedence rules from YACC grammars, which I already wrote about in a previous blog. Based on these tools, we now have a very precise specification of the precedence rules of PHP.

The next step in the process of getting a perfect PHP grammar was to actually use this specification to develop very precise precedence declarations for the SDF grammar of PHP. However, the precedence rule specification involves about 1650 rules, so migrating these precedence rules to SDF precedence declarations by hand is not really an option. Fortunately, all the ingredients are actually there to generate SDF priority declarations from the precedence rules that we recover from the YACC grammar.

Argument-specific priorities

Thanks to two new features of SDF, these precedence rules can be translated directly to SDF. The first feature is argument-specific priorities. In the past, SDF only allowed priority declarations between productions. For example, the SDF priority

  E "*" E -> E > E "+" E -> E

defines that the production for the + operator cannot be applied to produce any of the E arguments of the production for the * operator, hence the production for the addition operator cannot be applied on the left-hand side or right-hand side of the multiplication operator. This priority implies that the multiplication operator binds stronger than the addition operator. This single SDF priority corresponds to the following two precedence rules in the grammar formalism independent notation we are using in the Stratego/XT Grammar Engineering Tools:

  <E -> <E -> E + E> * E>
  <E -> E * <E -> E + E>>

For many languages precedence rules are different for arguments of the same production. That's why we us the more specific representation of a precedence rules in our grammar engineering tools. Fortunately, SDF now supports argument-specific priorities as well. These argument-specific priorities are just plain numbers that indicate to which arguments of a production the priority applies. For example, the following SDF priority forbids the assignment operator only at the left-most and the right-most E of the conditional operator:

  E "?" E ":" E -> E <0,4> > E "=" E

This corresponds to the following precedence rules:

  <E -> <E -> E = E> ? E : E>
  <E -> E ? E : <E -> E = E>>

Non-transitive priorities

The second new SDF feature that is required for expressing the PHP precedence rules is non-transitive priorities. Before the introduction of this feature, all SDF priorities where transitively closed. For example, if there are two separate priorities

  "!" E -> E   > E "+" E -> E
  E "+" E -> E > V "=" E -> E

then by the transitive closure of priorities this would imply the priority

  "!" E -> E >  V "=" E -> E

This transitive closure feature is useful in most cases, but for some languages (such as PHP) the precedence rules are in fact not transitively closed, which makes the definition of these rules in SDF slightly problematic. For this reason, SDF now also features non-transitive priorities, using a dot before the > of the priority:

  "!" E -> E .> E "+" E -> E

Non-transitive priorities will not be included in the transitive closure, which gives you very precise control over the precedence rules.

Precedence rule migration

Thanks to the position-specific, non-transitive priorities of SDF, the precedence rules that we recover from the YACC grammar for PHP can now be mapped directly to SDF priority declarations. The two precedence rules mentioned earlier:

  <E -> <E -> E + E> * E>
  <E -> E * <E -> E + E>>

now translate directly to SDF priorities:

  E * E -> E <0> .> E + E -> E
  E * E -> E <2> .> E + E -> E

The migration of the recovered YACC precedence rules results in about 1650 of these SDF priorities, but thanks to the fully automatic migration this huge number of priorities is not really a problem. The resulting PHP syntax definition immediately solved all the known issues with the PHP syntax definition, which shows that this migration was most reliable and successful.

Future

There is a lot of interesting work left to be done. First, it would be interesting to develop a more formal grammar for PHP, similar to the grammars of the C, Java, and C# specifications. These specifications all encode the precedence rules of the operators in the production rules, by introducing non-terminals for all the precedence levels. It should not be too difficult to automatically determine such an encoding from the precedence rules we recover. This would result in a formal specification of the PHP syntax, which will benefit many other parser generators. One of the remarkable things we found out is that the unary - operator has the same precedence as the binary - (usually it binds stronger), which results in -1 * 3 being parsed as -(1 * 3). We have not been able to find an example where this strange precedence rule results in unexpected behaviour, but for the development of a solid parser is it essential that such precedence rules are defined precisely.

Second, it would be good to try to minimize the number of generated SDF priorities by determining a priority declaration that you can actually oversee as a human. This would involve finding out where the transitive closure feature of SDF priorities can be used to remove redundant priority declarations.

Third, it would great to integrate the precedence rule migration in a tool that completely migrates a YACC/FLEX grammar to SDF. For this, we need tools to parse and understand a FLEX specification and extend the existing support for precedence rule migration to other YACC productions.

Clearly, there is lots of interesting (and useful!) grammar engineering work to do in this direction!

Thursday, March 01, 2007

x86-64 support for Stratego/XT!

Today is the day that Stratego/XT supports 64-bit processors! Stratego/XT supports x86-64 from release 0.17M3pre16744 (or later), the sdf2-bundle from release 2.4pre212034 (or later). The releases are available from our new Nix buildfarm at the TU Delft.

Some history

About 6 years ago, various people started to complain about the lack of 64-bit processor support. At the time, most complaints came from our very own Unix geek Armijn Hemel, mostly because of his passion for Sun and these strange UltraSparc machines. However, similar to the limited distribution of Unix geeks, 64-bit systems were rather uncommon at the time. The requests we got were about more obscure processors, like Sun's UltraSparc and Intel's IA-64 (Itanium).

The 64-bit issues were never solved because (1) we never had a decent 64-bit machine at our disposal, (2) users with 64-bit system were uncommon, and (3) most of the issues were actually not Stratego/XT issues, but problems in the ATerm library, which is not maintained by the Stratego/XT developers.

Some first steps

However, it is not possible anymore to ignore 64 bit systems: Intel and AMD both sell 64-bit processors for consumers these days. Several users of Stratego/XT already have x86-64 machines, and the only reason why they don't complain en masse is that there is always the option to compile in 32-bit mode (using gcc -m32).

At the TU Delft (the new Stratego/XT headquarters), we now have an amazing buildfarm with some real, dedicated hardware bought specifically for the purpose of building software. At the moment, all our build machines (except for the Mac Minis) have x86-64 processors, so the lack of 64-bit machines is no longer an excuse.

Also, the ATerm library now enjoys a few more contributors. Last summer, Eelco Dolstra from Utrecht University created the first complete 64-bit patch for the ATerm library (Meta-Environment issue 606), simply because his Nix package management system uses the ATerm library and portability of Nix is important. Also, Erik Scheffers from the Eindhoven University of Technology has done an excellent job on the development of ATerm branches that support GCC 4.x and 64-bit machines.

The final step .. uh, steps

As a result, it was now feasible to fully support x86-64 systems. The only thing left for me to do was to use all the right patches and branches and enable an x86-64 build in our buildfarm. At least, that's what I thought ... Well, if you know computer scientists, then you'll also know that they are always far too optimistic.

In the end, it took me a day or four to get everything working. This is rather stressful work, I must say. Debugging code that is affected by a mixture of 32-bit assumptions and aliasing bugs introduced by GCC optimizations is definitely not much fun. You can stare at C code for as long as you like, but if the actual code being executed is completely different, then this won't help much. In the end, this little project resulted in quite a few new issues:

  • STR-701 is a bug that was raised by casting a pointer to an integer in the address strategy of libstratego-lib, which returns an integer representation of the address of an ATerm. The Stratego Library has had this strategy for a long time, and indeed the most natural representation of an address is an integer datatype. Unfortunately, ATerm integers are fixed size, 32-bit integers, hence it cannot be used to represent a pointer of 64 bits. The new representation is a string, which is acceptable for most of the applications of address.
  • Meta-Environment issue 720 is related to GCC optimizations based on strict alias analysis. In this case, the optimization seems to be applied only in the x86-64 backend of GCC, while the underlying problem is in fact architecture independent.

    The code that raises this bug applies efficient memory allocation by allocating blocks of objects rather than individual ones. The available objects are encoded efficiently in a linked list, with only a next field. This next field is used for the actual data of the object, as well as the link to next available object. The objects are character classes, having the name CC_Class, which is a typedef for an array of longs. Roughly, the invalid code for adding a node to the linked list looks like this:

    struct CC_Node {
      struct CC_Node *next;
    };
    
    static struct CC_Node *free_nodes = NULL;
    
    void add_node(CC_Class* c) {
      struct CC_Node *node = (struct CC_Node *) c;
      node->next = free_nodes;
      free_nodes = node;
    }
    

    The problem with this efficient linked list is that the same memory location is accessed through pointers of different types, in this case a pointer to a CC_Node struct and a pointer to a CC_Class. Hence, the code creates aliases of different types, which is invalid in C (see for example this nice introduction to strict aliasing). In this case, C compilers are allowed to assume that the two variables do not alias, which enables a whole bunch of optimizations that are invalid if they do in fact alias.

    The solution for this is to use a C union, which explicitly informs the compiler that a certain memory location is accessed through two different types. Using a union, the above code translates to:

    union CC_Node {
      CC_Class *cc;
      CC_Class **next;
    };
    
    static union CC_Node free_node = {NULL};
    
    void add_node(CC_Class* c) {
      node.cc = c;
      *(node.next) = free_node.cc;
      free_node.cc = node.cc;
    }
    

    Sidenote: I'm not really a C union expert, and I'm not 100% sure whether in this case a union is necessary for a CC_Class* and CC_Class** or CC_Class and CC_Class*. The union I've chosen solves the bug, but I should figure out what the exact solution should be. Feedback is welcome.

  • Meta-Environment issue 718 is related to the previous bug. The problem here is that the same memory locations are accessed through a generic datatype (ATerm) as well as pointers to more specific structs, which again leads to strict aliasing problems. This time, the issue has been solved in a more ad-hoc way by declaring a variable as volatile. This solves the issue for now, but a more fundamental solution (probably a union) is necessary here as well.
  • STR-705 adds some checks for the size of various types to the Stratego/XT build system, called Auto/XT. These checks are necessary for the header files of the ATerm library, which determine the characteristics of the platform based on the size of longs, integers, and void pointers, which are defined as macros (a feature that is under discussion in Meta-Environment issue 606: it does not play very well with cross compilation and compilation in 32-bit mode on a 64-bit platform). The ATerm library we are using at the moment is the branch 64-bit-fixes, which has been developed by Eelco Dolstra and Erik Scheffers.

    The new macro XT_C_TYPE_CHARACTERISTICS checks the sizes and defines the macros that are required by these headers. The macro XT_SETUP invokes the XT_C_TYPE_CHARACTERISTICS macro, so all packages based on Stratego/XT will automatically support the 64-bit branch of the ATerm library.

  • STR-703 is related to the previous issues. In packages based on the GNU Autotools and Auto/XT, the C code is compiled by the Automake-based build system, not by the Stratego compiler itself (which only produces the C code). In this case, the XT_C_TYPE_CHARACTERISTICS takes care of the required defines. However, the Stratego compiler can also be used as a standalone compiler, where strc invokes the C compiler itself. In this case, strc needs to pass the definitions of macros to the C compiler.
  • STR-704 drops the use of autoheader in stratego-libraries. Autoheader replaces the command-line definition of macros to a generated config.h. This generated file used to be installed as stratego-config.h, but this header file is no longer necessary: there is no configuration option in this file that is still necessary as part of the Stratego/XT installation. The mechanism of config.h installation is rather fragile (some macro definitions have to be removed), so if it is not necessary anymore, then why not drop it ...

    The relation to x86-64 support is that several C files in the stratego-libraries package did not correctly include the generated config.h before aterm2.h. This breaks on x86-64 systems because aterm2.h requires the aforementioned macro definitions.

Short version

The net result of this operation is that we now support x86-64 systems. And this time we will keep supporting 64-bit processors, whatever it takes.

It would be fun to check now if UltraSparc and IA-64 machines work out of the box, but I don't have access to any of these. If you have one, I would love to know if it works.

Wednesday, February 14, 2007

Base access in the C# specification

In a previous post, I discussed a bug in the Java Language Specification on super field access of protected fields. If you haven't read this yet, I would suggest to give it a read before you continue with this post. Thanks to a discussion with Alex Buckley (the new maintainer of the Java Language specification), there is now a proposal to fix this bug in an elegant way. I'll report on the solution and the nice discussion on the relation to super field accesses in bytecode later.

However, first I would like to illustrate the risk of reuse. While writing on issues in the Java Language Specification, I figured that the C# specification probably has the same issue. After all, C# features the same details of protected access. Consider the following two C# classes:

  class A {
    protected int secret;
  }

  class B : A {
    public void f(A a) {
      a.secret = 5;
    }
  }

Due to the details of protected access, this example won't compile. The Mono C# compiler clearly explains the problem:

  A.cs(17,5): error CS1540: Cannot access protected 
  member  `A.secret' via a qualifier of type `A'. The 
  qualifier must be of type `B' or derived from it

Of course, C# also support access to fields of base classes (aka super classes). Indeed, checking the C# specification reveals that the definition of base access is exactly the same as super field access in Java. In Section 14.5.8 of the C# Language Specification (ECMA-334), the semantics of a base access expressions is defined in the following way:

"At compile-time, base-access expressions of the form base.I and base[E] are evaluated exactly as if they were written ((B)this).I and ((B)this)[E], where B is the base class of the class or struct in which the construct occurs."

Compare this definition to the Java Language Specification:

"Suppose that a field access expression super.name appears within class C, and the immediate super class of C is class S. Then super.name is treated exactly as if it had been the expression ((S)this).name."

The good thing about this reuse is that I can reuse the examples of my previous post as well. Consider the following two C# classes that compile without any problem:

  class A {
    protected int secret;
  }

  class B : A {
    public void f() {
      base.secret = 5;
    }
  }

Next, consider the derivative of this example, where class B has been modified to refer to the field secret using (A) this which is exactly the same as a reference through base, according to the specification.

  class B : A {
    public void f() {
      ((A) this).secret = 5;
    }
  }

Similar to Java, this class won't compile, due to the details of protected access in C#. Again, the Mono C# compiler explains the issue:

  A.cs(13,7): error CS1540: Cannot access protected 
  member `A.secret' via a qualifier of type `A'. The 
  qualifier must be of type `B' or derived from it

This example shows that for C# the two expressions base.secret and ((A) this).secret are not evaluated in the same way, so the previously reported problem in the Java Language Specification also applies to the C# specification.

Now I have to figure out how to report issues in the C# specification ...

New Arrivals

Yesterday was the day of two new arrivals:
  • Zef Hemel is joining the MoDSE project at the TU Delft. Welcome Zef!
  • Shan Shan Huang has started blogging about her work on MJ, a very cool code generation solution. Yannis, Shan Shan, and David have been searching for a sweet spot in safe code generation during the last few years, and this sounds like a very interesting approach. Go and read their ECOOP paper!

This post was unusually short. I'm sorry for that, but I had to mention this short list ;)

Saturday, February 10, 2007

Informal specifications are not so super

I'm trying to get back to the good habit of blogging about our work. I'm not very fond of dumping random links or remarks, so the most challenging part of blogging is to find a good topic to write a decent story about. This time, I fallback to a topic I've actually been working on about two years ago, but is still most relevant. At that time, I was actively developing a type checker for Java, as part of the Dryad project. This story is about a bug in the Java Language Specification, that for whatever bizarre reason has never been reported (afaik).

Super field access

Java supports access to fields of super classes using the super keyword, even if this field is hidden by a declaration of another field with the same name. For example, the following sample will print super.x = 1 and x = 2.

  class S {
    int x = 1;
  }

  class C extends S {
    int x = 2;

    void print() {
      System.out.println("super.x = " + super.x + " and x = " + x);
    }

    public static void main(String[] ps) {
      new C().print(); 
    }
  }

To allow access from inner classes to hidden fields of enclosing instances, Java also supports qualified super field accesses. In this case, the super keyword is prefixed with the name of a lexically enclosing class. This feature is related to the qualified this expression, which allows you to refer to an enclosing instance.

Current specification

We all have a reasonable, though informal, idea what the semantics of this language feature is. Of course for a real specification the semantics has to be defined more precisely. For example, two things that need to be define are what the type of such an expression is and if the field is accessible at all. The specification concisely defines the semantics of this language feature by forwarding the semantic rules to existing, more basic language features. For super.name, the JLS specifies:

"Suppose that a field access expression super.name appears within class C, and the immediate super class of C is class S. Then super.name is treated exactly as if it had been the expression ((S)this).name."

So, in the example I gave, super.x would be exactly equivalent to ((S)this).x. Obviously, the emphasis of exactly is on purpose. Why would they use this word? Does this suggest that there is also a notion of being treated almost exactly in the same way? ;) .

For qualified field access, the specification is almost the same, but this time using a qualified this instead of this.

"Suppose that a field access expression T.super.name appears within class C, and the immediate super class of the class denoted by T is a class whose fully qualified name is S. Then T.super.name is treated exactly as if it had been the expression ((S)T.this).name."

This specification looks very reasonable, considering that for a field access only the compile-time type of the subject expression is used to determine which field is to be used. By casting the subject expression (this) to the right type, the expected field of the super class is accessed.

Oops

Of course, it is always nice to have your type checker as compact as possible, so I was very happy with this specification. I could just forward everything related to super field accesses to the corresponding expression with a cast and a this expression. The Dryad typing rules looked something like this:

  attributes:
    |[ super.x ]| -> <attributes> |[ ((reft) this).x ]|
    where
      reft is the superclass of the current class

  attributes:
    |[ cname.super.x ]| ->  <attributes> |[ ((reft) cname.this).x ]|
    where
      reft is the superclass of the class cname

This implementation looks very attractive, but ... it didn't work. The reason for this is that super.name is in fact not exactly the same as ((S)this).name. The reason for this are the details of protected access, which I've previously written about on my blog. I'm not going to redo that, so let me just give an example (based on the example in the previous post) where this assumed equivalence is invalid. First, the following two classes are valid and can be compiled without any problems:

  package a;
  public class A {
    protected int secret;
  }

  package b;
  public class B extends a.A {
    void f() {
      super.secret = 5;
    }
  }

Next, let's now change the assignment to the expression ((a.A) this).secret, which is equivalent to super.secret according to the specification.

  package b;

  public class B extends a.A {
    void f() {
      ((a.A) this).secret = 5;
    }
  }

Unfortunately, this won't compile, due to the details of protected access:

b/B.java:5: secret has protected access in a.A
    ((a.A) this).secret = 5;
    ^
1 error

This example shows that the two expressions are not treated in the same way, so this looks like a problem in the Java Language Specification to me. Also, this shows that in semantics of languages likje Java the devil is really in the detail. What surprises me is that nobody has mentioned this before. Several major Java compilers have been implemented, right? Shouldn't the programmers responsible for these compilers have encountered this problem?

Java Virtual Machine

Another interesting thing is how the Java Virtual Machine specification deals with this. There is no special bytecode operator for accessing fields of super classes: all field assignments are performed using the putfield operator. Assuming that the source compiler would ignore the protected access problem, the two unequal examples I just gave would compile to exactly the same bytecode. So how can the JVM report an error about an illegal access for the expression ((a.A) this).secret? Well, it turns out that it doesn't.

We can show this by first making secret public, compile B, then make secret protected, and only recompile A. This works like a charm: doing this trick for the following example prints secret = 5.

  package a;
  public class A {
    protected int secret;
    public void print() {
      System.out.println("secret = " + secret);
    }
  }

  package b;
  public class B extends a.A {
    void f() {
      ((a.A) this).secret = 5;
    }
    public static void main(String[] ps) {
      B b = new B();
      b.f();
      b.print();
    }
  }

However, if this would be allowed by bytecode in general, then this would mean that the security vulnerability that was fixed with the details of protected access, would actually only give source level protection. Obviously, that would be no protection at all: you can safely assume that attackers are capable of writing bytecode. So let's try to make the example a bit more adventurous by passing the subject expression to the f method:

  package b;
  public class B extends a.A {
    void f(a.A a) {
      a.secret = 5;
    }
    public static void main(String[] ps) {
      B b = new B();
      b.f(b);
      b.print();
    }
  }

This time, the verifier reports an error:

  Exception in thread "main" java.lang.VerifyError:
    (class: b/B, method: f signature: (La/A;)V)
    Bad access to protected data

This error report is correct, so apparently the verifier does check for illegal protected access. In the first case, it was just a bit more liberal than the source language. The question is, how is this specified in the Java Virtual Machine specif cation? My first impression was that there might be some special handling of accesses to this. However, this would require the verifier to trace which local variables might have the value of this, which is rather unlikely. Then, Dick Eimers (who did lots of scary bytecode stuff for his master thesis) pointed me to a paper that exactly covers this subject: Checking Access to Protected Members in the Java Virtual Machine by Alessandro Coglio. Strange enough, this paper is not cited anywhere, while I think that the discussion of this issue is pretty good.

It turns out that the difference in accessibility between super field accesses and ordinary field accesses is handled implicitly thanks to the type inferencer used by Java Virtual Machine. The inferred type of the operand of the field access will be more specific than the type in the corresponding source code, which makes the access to the protected field valid in bytecode. I don't think that this implicit handling of the observed difference is a very good idea.

Tuesday, February 06, 2007

Our take on injection attacks

If you haven't been hiding under some really impressive rock for the last decade, then you probably know that injection attacks are a major issue in web applications. The problem of SQL injection is well-known, but you see similar issues everywhere: SQL, Shell, XML, HTML, LDAP search filters, XPath, XQuery, and a whole series of enterprisey query languages, such as HQL, JDOQL, EJBQL, OQL are all potential candidates for injection attacks. Just search for any of these languages together with the term injection and observe the horror. Recently, it has also become more popular to mix a program written in Java with scripts, usually something like JavaScript, Ruby or Groovy. If you include user input in the script, then this is yet another vector of attack.

Solutions?

Of course it is possible to just advice programmers to properly escape all user inputs, which prevents most of the injection attacks. However, that's like telling people to do their own memory management or to do the dishes every day (which is a particular problem I have). In other words: you won't get it right.

Most of the research on injection attacks has focused on finding injection problems in existing source code using static and/or runtime analysis. Usually, this results in tools that check for injection attacks for specific languages (e.g. SQL) in specific host languages (e.g. PHP). This is very important and useful work, since it can easily be applied to detect or prevent injection attacks in existing code bases. However, at some point we just fundamentally need to reconsider the way we program. Why just fight the symptoms if you can fix the problem?

So that's what we've done in our latest work called StringBorg. I'm not going to claim that all your injection problems will be over tomorrow, but at least I think that what we propose here gives us some perspective on solving theses issues once and for all in a few years. The solution we propose is to use syntax embeddings of the guest languages (SQL, LDAP, Shell, XPath, JavaScript) in the host language (PHP, Java) and let the system do all the proper escaping and positive checking of user input.

Examples

The paper I'll mention later explains all the technical details, and I cannot redo that in a better way in a blog, so I'll just give a bunch of examples that illustrate how it works.

SQL

The first example is an embedding of SQL in Java. This example illustrates how you can insert strings in SQL queries and compose SQL queries at runtime. The first code fragment is the classic, vulnerable, way of composing SQL queries using string concatenation.

  String s = "'; DROP TABLE Users; --";
  String e = "username = \'" + s + "\'";
  String q = "SELECT password FROM Users WHERE " + e;
  System.out.println(q);

Clearly, if the string s was provided by the user, then this would result in an injection attack: the final query is SELECT password FROM Users WHERE username = ''; DROP TABLE Users; --'. Bad luck, the Users table is gone! (or maybe you can thank your database administrator).

With StringBorg, you can introduce some kind of literal syntax for SQL. The SQL code is written between the quotation symbols <|...|>. SQL code or strings can be inserted in another SQL query using the syntax ${...}. The example would be written in StringBorg as:

  String s = "'; DROP TABLE Users; --";
  SQL e = <| username = ${s} |>;
  SQL q = <| SELECT password FROM Users WHERE ${e} |>;
  System.out.println(q.toString());

This will result in the correct query, SELECT password FROM Users WHERE username = '''; DROP TABLE Users; --', where the single quotes have been escaped by StringBorg according to the rules of the SQL standard (the exact escaping rules depend on the SQL dialect). Not only does the StringBorg solution solve the injection problem, it is also much prettier! This example also shows that it is not required to know the full SQL query at compile-time, for example the actual condition e could be different for two branches of an if statement, or could even be constructed in a while statement.

The nice thing about StringBorg is that the SQL support is not restricted to a specific language, in this case Java. For PHP, you can do exactly the same thing:

  $s = "'; DROP TABLE Users; --";
  $e = <| username = ${$s} |>;
  $q = <| SELECT password FROM Users WHERE ${$e} |>;
  echo $q->toString(), "\n";

LDAP

Using user input in LDAP search filters has very similar injection problems. First a basic example, where there is no problem with the user input:

  String name = "Babs Jensen";
  LDAP q = (| (cn=$(name)) |);
  System.out.println(q.toString());

The resulting LDAP filter will be (cn=Babs Jensen), which is what you would except. If the string has the value Babs (Jensen), then the parentheses need to be escaped. Indeed, StringBorg will produce the filter (cn=Babs \28Jensen\29). This input might have been an accident, but of course we can easily change this into a real injection attempt by using the string *. Again, StringBorg will properly escape this, resulting in the query (cn=\2a).

Shell

Programs that invoke shell command could be vulnerable to injection attacks as well (as the TWiki developers and users have learned the hard way). Similar to the other examples, StringBorg introduces a syntax to construct shell commands, and escape strings:

  Shell cmd = <| /bin/echo svn cat http://x -r <| s |> |>;
  System.out.println(cmd.toString());

If s has the values bravo, foo bar, * and ; echo pwn3d! respectively, then the resulting commands are:

  /bin/echo svn cat http://x -r bravo
  /bin/echo svn cat http://x -r foo\ bar
  /bin/echo svn cat http://x -r \*
  /bin/echo svn cat http://x -r \;\ echo\ pwn3d\!

JavaScript

Not only does StringBorg prevent injection attacks, it also makes composing SQL, XQuery, JavaScript, etc more attractive: you don't have to concatenate all these nasty strings anymore. For example, the following example taken from and article on the new Java scripting support is just plain ugly:

  jsEngine.eval(
    "function printNames1(namesList) {" +
    "  var x;" +
    "  var names = namesList.toArray();" +
    "  for(x in names) {" +
    "    println(names[x]);" +
    "  }" +
    "}" +

    "function addName(namesList, name) {" +
    "  namesList.add(name);" +
    "}"
  );

whereas this looks quite reasonable:

  jsEngine.eval(|[
    function printNames1(namesList) {
      var x;
      var names = namesList.toArray();
      for(x in names) {
        println(names[x]);
      }
    }

    function addName(namesList, name) {
      namesList.add(name);
    }
  ]| );

Of course, this would be easy to fix by introducing multi-line string literals in Java, but in addition to the nicer syntax, you get protection against injection attacks and compile-time syntactic checking of the code for free!

Generic, generic, generic

Now, if you are familiar with our work, then this solution won't really surprise you, since we have been working on syntax embeddings for some time now (although in different application areas, such as meta programming). However, this work is quite a fundamental step towards making these syntax embeddings easier to use by ordinary programmers. First, the system now supports ambiguities, which always was the weak point of our code generation work: if you don't support ambiguities, then the programmer needs to be familiar with the details of the grammar of the guest language, which you really don't want. Fortunately, this is now a technical detail that you now can forget about! Second, no meta-programming is required at all to add a new guest language (e.g. XPath) to the system. All you need to do is define the syntax of the language, define the syntax of the embedding, and optionally define escaping rules for strings and you're all set. Thus, compared to our previous work on MetaBorg (OOPSLA '04), there is no need for implementing the mapping from the syntax of the guest language to code in the host language.

This is a pretty amazing property: basically, this means that you can just use languages as libraries. You can just pick the languages you want to use in a source file and that's it! No difficult meta-programming stuff, no program transformation, no rewrite rules and strategies, no limitations. In fact, this even goes beyond libraries: libraries are always language specific (for example for Java or PHP), but the implementation of support for a guest language (e.g. SQL) is language independent. This means that if some person or company implements support for a guest language (e.g. SQL) then all host languages (Java, PHP, etc) are immediately supported.

Future?

The paper we wrote about this is titled "Preventing Injection Attacks with Syntax Embeddings. A Host and Guest Language Independent Approach" and is now available as technical report. Last week, we submitted this paper to the USENIX Security Symposium. We won't know if the paper is accepted until April 4, but I would be flabbergasted if it got rejected ;) . Our prototype implementation, called StringBorg, is available as well. I'm looking forward to your feedback and opinions. I'll add some examples to the webpage later this week, so make sure to come back!

Peter Ahé already has a general solution for embedding foreign languages on his wish list (as opposed to an XML specific solution), so could this actually be happening in the near future?

Sunday, February 04, 2007

Some random thoughts on the complexity of syntax

For some reason I was invited to join the JSR-308 mailing list, which is a public mailing list discussing JSR-308. I've reported some problems with the Java grammar of the Java Language Specification in the past, so maybe I'm now on some list of people that might be interested. I'm not sure if I will contribute to the discussion, but at least lurking has been quite interesting. If you're not familiar with the proposal, JSR-308 has been started to allow annotations at more places in a Java program. The title of the JSR is "Annotations on Java Types", but the current discussion seems to interpret the goal of the JSR a bit more ambitiously, since there is a lot of talk going on about annotations of statements, and even expressions. I don't have a particularly strong opinion on this, but it's interesting to observe how the members of the list are approaching this change in the language.

Neal Gafter seems to represent the realistic camp in the discussion (not to call his opinion conservative). Neal was once the main person responsible for the Java Compiler at Sun, so you can safely assume that he knows what he's talking about. Together with Joshua Bloch, he is now mainly responsible for the position of Google in these Java matters. Last week, he sent another interesting message to the list: Can we agree on our goals?. As I mentioned, I don't have a very strong opinion on what the goal of the JSR should be, but Neal raised a point about syntax that reminded me again of some thoughts on syntax that have been lingering in my mind for some time now. Neal wrote:

"I think the right way to design a language with a general annotation facility is to support (or at least consider supporting) a way of annotating every semantically meaningful nonterminal. Doing that requires a language design with a very simple syntax. Java isn't syntactically simple, and I don't think there is anything we can do it make it so. If we wanted to take this approach with Java we'd have to come up with a syntactic solution for every construct that we want to be annotatable. Given the shape of the Java grammar, that solution would probably be a different special case for every thing we might want to annotate."

Whether you like it or not, this is a most valid concern. The interesting point about this annotation thing is that it is a language feature that applies in a completely different way to existing language constructs. Adding an expression, a statement, or some modifier to the language is not difficult to do, since this adds only an alternative to the existing structure of the language. Annotations, on the other hand, do not add just another alternative, but crosscut (sorry, I couldn't avoid the term) the language. If you are an annotation guy, then you want to have them everywhere, since you essentially want to add information to arbitrary language constructs. Now this is quite a problem if you have a lot of language constructs, not alternative language constructs, but distinct kinds of language constructs (of course known as nonterminals to grammar people). This would be trivial to do in language where there are not that many language constructs, such as Lisp and Scheme, and even model-based languages.

This makes you wonder what is a good language syntax. Should adding such a crosscutting language feature be easy? Conceptually, it is beyond any doubt attractive to have a limited number of language constructs, but on the other hand it is very convenient that Java has this natural syntax for things like modifiers, return types, formal parameters, formal type parameters, throws clauses, array initializers, multiple variable declarations at the same line, and so on. If you want to add annotations to all these different language constructs, then you basically have to break their abstraction, which suddenly makes them look unnatural, since it becomes clear that a syntactical construct that used to be easy to read, has some explicit semantic meaning. That is, the entire structure of the language is exposed in this way. It is no longer possible to read a program, abstracting over all the details of the language. Also, for several locations it is very unclear to the reader where an annotation refers to. For example, the current draft specification states that

"There is no need for new syntax for annotations on return types, because Java already permits an annotation to appear before a method return type. Currently, such annotations are interpreted as on the method declaration — for example, the @Deprecated annotation indicates that the method is deprecated. The person who defines the annotation decides whether an annotation that appears before the return value applies to the method declaration or to the return type.

Clearly, there is a problem in this case, since an annotation in the header of the method could refer to several things. The reason for this, is the syntactical conciseness of the language for method declarations: you don't have to identify every part explicitly, hence if you want to annotate some part only, then you have a problem. Moving that decision to the declaration side of the annotation is a not an attractive solution, for example there will be annotations that are applicable to both declarations and types.

This all brings us to the question how to determine if a syntax of a programming language is simple? Is that really just some subjective idea, or is it possible to determine this semi-automatically with more objective methods? I assume that the answer depends on the way the language is applied. For example, in program transformation it is rather inconvenient to have all kinds of optional clauses for a language construct. This reminds me of a post by Nick Sieger, who applied a visualization tool to some grammars. For some reason, this post was very popular and was discussed all over the web, including Lambda the Ultimate and LWN. However, most people seemed to agree that the visualizations did not tell much about the complexity of the languages. Indeed, the most visible aspects of the pictures are the encodings of the actual grammar that had to be applied to make the grammar non-ambiguous or to fit in the used grammar formalism. For example, the encoding of precedence rules for expressions makes the graph look pretty, but conceptually this is just a single expression. As a first guess, I would expect that some balance between the number of nodes and edges would be a better measurement: lots of edges to a single node, means that nonterminal is allowed at a lot of places, which is probably good for the orthogonality of the language (more people have been claiming this in the discussion about these visualizations).

But well, this makes you wonder if there has been any research on this. The only work I'm familiar with is SdfMetz, which is a metrics tool for SDF developed by Joost Visser and Tiago Alves. SDF grammars are usually closer to the intended design of a language than LR or LL grammars, so if you are interested in the complexity of the syntax of a language, then using SDF grammars sounds like a good idea. SdfMetz supports quite an interesting list of metrics. Some are rather obvious (count productions etc), but there are also some more complex metrics. I'm quite sure that (combinations of) these metrics can give some indication of the complexity of a language. Unfortunately, the work on SdfMetz was not mentioned at all in the discussion of these visualizations. Why is it that a quick and dirty blogpost is discussed all over the web and solid research does not get mentioned? Clearly, the SdfMetz researchers should just post a few fancy pictures for achieving instant fame ;) . Back to the question what is a good syntax, they have mostly focused on the facts until now (see their paper Metrication of SDF Grammars), and have not done much work on interpreting the metrics they have collected. It would be interesting if somebody would start doing this.

Joost Visser and Tiago Alves will be presenting SdfMetz at LDTA 2007, the Workshop on Language Descriptions, Tools and Application (program now available!). As I mentioned in my previous post, we will be presenting our work on precedence rule recovery and compatibility checking there as well. So, if you are in the neighbourhood (or maybe even visiting ETAPS), then make sure to drop by if you are interested!

Another thing that finally seems to get some well-deserved attention is ambiguity analysis. It strikes me that the people on the JSR-308 list approach this rather informally, by just guessing what might be ambiguous or not. It should be much easier to play with a language and determine how to introduce a new feature in a non-ambiguous way. Sylvian Schmitz will be presenting a Bison-based ambiguity detection tool at LDTA, so that should be interesting to learn about. The paper is already online, but I haven't read it yet. Maybe I'll report about it later.

Thursday, January 25, 2007

Grammar Engineering, I'm loving it

I think that the most attractive thing to work on as a researcher are the problems that you actually encounter yourself. Obviously, it is an option to try to encounter these problems by studying or writing code that you are not actually directly interested in, but it is much more fun if you can work on issues in code that you just write out of your own interest.

This is what we've done in our latest paper, titled "Grammar Engineering Support for Precedence Rule Recovery and Compatibility Checking". Most of our work involves syntax definitions, for example to provide general support for the implementation of program transformations and also for our research on embedding and composing languages (for various applications). One of the problems we encounter is that the conversion of a grammar from one grammar formalism to another is rather unreliable. For example, if you need to convert a grammar from YACC to SDF, then you basically have no idea if the two grammars are compatible. For programs in general, this is understandable, since imperative source code are very difficult to compare. But, if you have a more or less declarative specification of a grammar, how is it possible that you cannot compare them at all?

As a first step towards supporting grammar compatibility checking, we have implemented a tool that compares the precedence rules of grammars. A very simple example of a precedence rule is that 1 + 2 * 3 should be parsed as 1 + (2 * 3). The precedence rules of a grammar might look like a trivial property at first sight, but actually it is rather complex to understand as a human what the precedence rules of a YACC or SDF grammar are. This tool has already been most successful for comparing existing C grammars written in YACC and SDF and deriving the exact precedence rules of PHP, which has quite a bizarre expression language.

The paper has been accepted for LDTA 2007, the Workshop on Language Descriptions, Tools and Applications, which is an excellent place for this subject. We will present our work at this workshop at the end of March. A draft version of the paper is available from the publication list at my homepage. The implementation is available as part of the Stratego/XT Grammar Engineering Tools. The website includes a bunch of examples. In the future, we hope to provide more tools to assist with the maintenance, testing, conversion, and analysis of grammars. In fact, Stratego/XT itself already contains some interesting tools for this, most prominently the grammar unit-testing tool parse-unit.

Now I think of it, it's probably a bad idea as a vegetarian to paraphrase a campaign by McDonald's ...