2007-02-04

Some random thoughts on the complexity of syntax

For some reason I was invited to join the JSR-308 mailing list, which is a public mailing list discussing JSR-308. I've reported some problems with the Java grammar of the Java Language Specification in the past, so maybe I'm now on some list of people that might be interested. I'm not sure if I will contribute to the discussion, but at least lurking has been quite interesting. If you're not familiar with the proposal, JSR-308 has been started to allow annotations at more places in a Java program. The title of the JSR is "Annotations on Java Types", but the current discussion seems to interpret the goal of the JSR a bit more ambitiously, since there is a lot of talk going on about annotations of statements, and even expressions. I don't have a particularly strong opinion on this, but it's interesting to observe how the members of the list are approaching this change in the language.

Neal Gafter seems to represent the realistic camp in the discussion (not to call his opinion conservative). Neal was once the main person responsible for the Java Compiler at Sun, so you can safely assume that he knows what he's talking about. Together with Joshua Bloch, he is now mainly responsible for the position of Google in these Java matters. Last week, he sent another interesting message to the list: Can we agree on our goals?. As I mentioned, I don't have a very strong opinion on what the goal of the JSR should be, but Neal raised a point about syntax that reminded me again of some thoughts on syntax that have been lingering in my mind for some time now. Neal wrote:

"I think the right way to design a language with a general annotation facility is to support (or at least consider supporting) a way of annotating every semantically meaningful nonterminal. Doing that requires a language design with a very simple syntax. Java isn't syntactically simple, and I don't think there is anything we can do it make it so. If we wanted to take this approach with Java we'd have to come up with a syntactic solution for every construct that we want to be annotatable. Given the shape of the Java grammar, that solution would probably be a different special case for every thing we might want to annotate."

Whether you like it or not, this is a most valid concern. The interesting point about this annotation thing is that it is a language feature that applies in a completely different way to existing language constructs. Adding an expression, a statement, or some modifier to the language is not difficult to do, since this adds only an alternative to the existing structure of the language. Annotations, on the other hand, do not add just another alternative, but crosscut (sorry, I couldn't avoid the term) the language. If you are an annotation guy, then you want to have them everywhere, since you essentially want to add information to arbitrary language constructs. Now this is quite a problem if you have a lot of language constructs, not alternative language constructs, but distinct kinds of language constructs (of course known as nonterminals to grammar people). This would be trivial to do in language where there are not that many language constructs, such as Lisp and Scheme, and even model-based languages.

This makes you wonder what is a good language syntax. Should adding such a crosscutting language feature be easy? Conceptually, it is beyond any doubt attractive to have a limited number of language constructs, but on the other hand it is very convenient that Java has this natural syntax for things like modifiers, return types, formal parameters, formal type parameters, throws clauses, array initializers, multiple variable declarations at the same line, and so on. If you want to add annotations to all these different language constructs, then you basically have to break their abstraction, which suddenly makes them look unnatural, since it becomes clear that a syntactical construct that used to be easy to read, has some explicit semantic meaning. That is, the entire structure of the language is exposed in this way. It is no longer possible to read a program, abstracting over all the details of the language. Also, for several locations it is very unclear to the reader where an annotation refers to. For example, the current draft specification states that

"There is no need for new syntax for annotations on return types, because Java already permits an annotation to appear before a method return type. Currently, such annotations are interpreted as on the method declaration — for example, the @Deprecated annotation indicates that the method is deprecated. The person who defines the annotation decides whether an annotation that appears before the return value applies to the method declaration or to the return type.

Clearly, there is a problem in this case, since an annotation in the header of the method could refer to several things. The reason for this, is the syntactical conciseness of the language for method declarations: you don't have to identify every part explicitly, hence if you want to annotate some part only, then you have a problem. Moving that decision to the declaration side of the annotation is a not an attractive solution, for example there will be annotations that are applicable to both declarations and types.

This all brings us to the question how to determine if a syntax of a programming language is simple? Is that really just some subjective idea, or is it possible to determine this semi-automatically with more objective methods? I assume that the answer depends on the way the language is applied. For example, in program transformation it is rather inconvenient to have all kinds of optional clauses for a language construct. This reminds me of a post by Nick Sieger, who applied a visualization tool to some grammars. For some reason, this post was very popular and was discussed all over the web, including Lambda the Ultimate and LWN. However, most people seemed to agree that the visualizations did not tell much about the complexity of the languages. Indeed, the most visible aspects of the pictures are the encodings of the actual grammar that had to be applied to make the grammar non-ambiguous or to fit in the used grammar formalism. For example, the encoding of precedence rules for expressions makes the graph look pretty, but conceptually this is just a single expression. As a first guess, I would expect that some balance between the number of nodes and edges would be a better measurement: lots of edges to a single node, means that nonterminal is allowed at a lot of places, which is probably good for the orthogonality of the language (more people have been claiming this in the discussion about these visualizations).

But well, this makes you wonder if there has been any research on this. The only work I'm familiar with is SdfMetz, which is a metrics tool for SDF developed by Joost Visser and Tiago Alves. SDF grammars are usually closer to the intended design of a language than LR or LL grammars, so if you are interested in the complexity of the syntax of a language, then using SDF grammars sounds like a good idea. SdfMetz supports quite an interesting list of metrics. Some are rather obvious (count productions etc), but there are also some more complex metrics. I'm quite sure that (combinations of) these metrics can give some indication of the complexity of a language. Unfortunately, the work on SdfMetz was not mentioned at all in the discussion of these visualizations. Why is it that a quick and dirty blogpost is discussed all over the web and solid research does not get mentioned? Clearly, the SdfMetz researchers should just post a few fancy pictures for achieving instant fame ;) . Back to the question what is a good syntax, they have mostly focused on the facts until now (see their paper Metrication of SDF Grammars), and have not done much work on interpreting the metrics they have collected. It would be interesting if somebody would start doing this.

Joost Visser and Tiago Alves will be presenting SdfMetz at LDTA 2007, the Workshop on Language Descriptions, Tools and Application (program now available!). As I mentioned in my previous post, we will be presenting our work on precedence rule recovery and compatibility checking there as well. So, if you are in the neighbourhood (or maybe even visiting ETAPS), then make sure to drop by if you are interested!

Another thing that finally seems to get some well-deserved attention is ambiguity analysis. It strikes me that the people on the JSR-308 list approach this rather informally, by just guessing what might be ambiguous or not. It should be much easier to play with a language and determine how to introduce a new feature in a non-ambiguous way. Sylvian Schmitz will be presenting a Bison-based ambiguity detection tool at LDTA, so that should be interesting to learn about. The paper is already online, but I haven't read it yet. Maybe I'll report about it later.

No comments: