2008-12-04

Why the JVM Spec defines checkcast for interface types

I'm working on the specification of pointer analysis for Java using Datalog. Basically, a pointer analysis computes for each variable in a program the set of objects it may point to at run-time.

For this purpose I need to express parts of the JVM Spec in Datalog as well. As a simple example, the following Datalog rules define when a class is a subclass of another class.

/**
 * JVM Spec:
 * - A class A is a subclass of a class C if A is a direct 
 *   subclass of C
 */
Subclass(?c, ?a) <-
  DirectSubclass[?a] = ?c.

/**
 * JVM Spec:
 * - A class A is a subclass of a class C if there is a direct
 *   subclass B of C and class A is a subclass of B
 */
Subclass(?c, ?a) <-
  Subclass(?b, ?a),
  DirectSubclass[?b] = ?c.

As you can see, this is remarkably close to the original specification (quoted in comments). You can clearly see the relationship between the spec and the code, even if you are not familiar with Datalog.

Recently, I was working on the specification of the checkcast instruction. This instruction performs the run-time check if an object can be cast to some type. The JVM Spec for checkcast first defines some variables:

The following rules are used to determine whether an objectref that is not null can be cast to the resolved type: if S is the class of the object referred to by objectref and T is the resolved class, array, or interface type, checkcast determines whether objectref can be cast to type T as follows:

So, this basically says that we're checking the cast (T) S.

The first rule for this cast is straightforward:

If S is an ordinary (nonarray) class, then:
  • If T is a class type, then S must be the same class as T, or a subclass of T.
  • If T is an interface type, then S must implement interface T.
Well, if you're somewhat familiar with Java, or object-oriented programming, then this part is obvious. Again, the specification in Datalog is easy:
CheckCast(?s, ?s) <-
  ClassType(?s).

CheckCast(?s, ?t) <-
  Subclass(?t, ?s).

CheckCast(?s, ?t) <-
  ClassType(?s),
  Superinterface(?t, ?s).
However, the next alternative in the specification is confusing:
If S is an interface type, then:
  • If T is a class type, then T must be Object.
  • If T is an interface type, then T must be the same interface as S or a superinterface of S.

The specification is crystal clear, but how can S ever be an interface type? S is the type of the object that is being cast, and how can an object ever have a run-time type that is an interface? Of course, the static type of an expression can be an interface, but we're talking about the run-time here!

I searched the web, which only resulted in a few hits. There was one question on a Sun forum years ago, where the one answer didn't make a lot of sense.

It turns out that this is indeed an `impossible' case. The reason why this item is in the specification, is because checkcast is recursively defined for arrays:

If S is a class representing the array type SC[], that is, an array of components of type SC, then:
  • ...
  • If T is an array type TC[], that is, an array of components of type TC, then one of the following must be true:
    • ...
    • TC and SC are reference types, and type SC can be cast to TC by recursive application of these rules.

So, if you have an object of type List[] that is cast to an Collection[], then the rules for checkcast get recursively invoked for the types S = List and T = Collection. Notice that List is an interface, but an object can have type List[] at run-time. If have not verified this with the JVM Spec maintainers, but as far as I can see, this is the only reason why the rule for interface types is there.

Just to show a little bit more of my specifications, here is the rule for the array case I just quoted from the JVM Spec:

CheckCast(?s, ?t) <-
  ComponentType[?s] = ?sc,
  ComponentType[?t] = ?tc,
  ReferenceType(?sc),
  ReferenceType(?tc),
  CheckCast(?sc, ?tc).

Isn't it beautiful how this exactly corresponds to the formal specification?

Unfortunately, even formal specifications can have errors, so I also specified a large testsuite that checks the specifications with concrete code. Here are some of the tests for CheckCast.

test Casting to self
  using database tests/hello/Empty.jar
  assert
    CheckCast("java.lang.Integer", "java.lang.Integer")

test Casting to superclasses
  using database tests/hello/Empty.jar
  assert
    CheckCast("java.lang.Integer", "java.lang.Number")
    CheckCast("java.lang.Integer", "java.lang.Object")

test Cast ArrayList to various superinterfaces
  using database tests/hello/Arrays.jar
  assert
    CheckCast("java.util.ArrayList", "java.util.List")
    CheckCast("java.util.ArrayList", "java.util.Collection")
    CheckCast("java.util.ArrayList", "java.io.Serializable")

test Cast class[] to implemented interface[]
  using database tests/hello/Arrays.jar
  assert
    CheckCast("java.util.ArrayList[]", "java.util.List[]")
    CheckCast("java.lang.Integer[]", "java.io.Serializable[]")

test Cast interface[] to superinterface[]
  using database tests/hello/Arrays.jar
  assert
    CheckCast("java.util.List[]", "java.util.Collection[]")

The tests are specified in a little domain-specific language for unit-testing Datalog that I implemented, initially for IRIS and later for LogicBlox. This tool is similar to parse-unit, a tool I wrote earlier for testing parsers in Stratego/XT. The concise syntax of a test encourages you to write a lot of tests. Domain-specific languages rock for this purpose!

2008-06-30

New Conference on Software Language Engineering

This year is special. There is a new and exciting conference: the International Conference on Software Language Engineering (SLE). The deadline for submission of papers is July 14th, which is coming up soon! Before I start raving about the topics covered by this conference, here is the disclaimer: I'm on the program committee of this conference, and as such I believe it's my duty to advertise the conference.

Anyway, if done right, this conference has the potential to become a major and prestigious conference. The conference fills a clear gap: the topics of software language engineering do not exactly fit in major programming language conferences like OOPSLA, PLDI, POPL, and ECOOP. Nor do they fit exactly in the area of compiler construction (CC). CC does typically not accept more engineering or methodology-oriented papers. For OOPSLA and ECOOP the work more or less has to be in the context of object-oriented programming, for POPL it immediately has to be a principle (whatever that is), and for PLDI there are usually just a few slots available for papers that don't do something with memory management, garbage collection, program analysis, or concurrency. Personally, I've been pretty successful at getting papers in the area of software language engineering accepted at OOPSLA, but a full conference devoted to this topic is much better!

Another reason why I think that this conference has a lot of potential is that if I look at the list of topics of interest in the call for papers, then I can only think of one summary: everything that's fun! I'm convinced I'm not the only one who thinks these topics are fun. When talking to colleagues, I notice again and again most of us just love languages. The engineering of those languages is an issue for almost all computer scientists and many programmers in industry, and this conference will be the most obvious target for papers about this!

Also, the formalisms and used for the specification and implementation of (domain-specific) languages are still very much an open research topic. Standardization of languages is still far from perfect, as discussed by many posts on this blog. Also, new language implementation techniques are being proposed all the time, and extensible compilers for developing language extensions are more popular than ever. Not to mention the increasing interest in using domain-specific languages to help solve the software development problems we're facing.

Earlier in this post I wrote that this conference has major potential if done right. There are few risks. First, the conference has been started by two relatively small communities: ATEM and LDTA. I think the conference should attract a much larger community than the union of those two communities. I hope lots of people outside of the ATEM and LDTA communities will consider to submit a paper. Second, this year the conference is co-located with MODELS. Many programming language people are slightly allergic to model-driven engineering. I hope they will realize that this conference is not specifically a model-driven conference. Finally, the whole setup of the conference should be international and varied. I'm sorry to say that at this point I'm not entirely happy with the choice of keynote speakers. This nothing personal: I respect both keynote speakers, but the particular combination of the two speakers is a bit unfortunate. First, they are both Dutch. Second, neither of them is extremely well-known in the communities of OOPSLA, PLDI, or ECOOP. I hope that this will not affect the potential of this interesting conference.

Now go work on your submission!

2008-01-20

Ph.D. Thesis: Exercises in Free Syntax

It has been awfully quiet here, I'm sorry about that. There are a few reasons for that. The first one is that I assembled my PhD thesis from my publications. This took quite some time and energy, but the result is great! My dissertation Exercises Free Syntax is available online. If you are interested in having dead tree version, just let me know!

I will defend my thesis tomorrow, January 21 (see the Dutch announcement). It's weird to realize that tomorrow is the accumulation of 4 years of working intensely!

For the library I created an English abstract. To give you an idea what the thesis is about, let me quote it here:

In modern software development the use of multiple software languages to constitute a single application is ubiquitous. Despite the omnipresent use of combinations of languages, the principles and techniques for using languages together are ad-hoc, unfriendly to programmers, and result in a poor level of integration. We work towards a principled and generic solution to language extension by studying the applicability of modular syntax definition, scannerless parsing, generalized parsing algorithms, and program transformations.

We describe MetaBorg, a method for providing concrete syntax for domain abstractions to application programmers. Since object-oriented languages are designed for extensibility and reuse, the language constructs are often sufficient for expressing domain abstractions at the semantic level. However, they do not provide the right abstractions at the syntactic level. The MetaBorg method consists of embedding domain-specific languages in a general purpose host language and assimilating the embedded domain code into the surrounding host code. Instead of extending the implementation of the host language, the assimilation phase implements domain abstractions in terms of existing APIs leaving the host language undisturbed.

We present a solution to injection vulnerabilities. Software written in one language often needs to construct sentences in another language, such as SQL queries, XML output, or shell command invocations. This is almost always done using unhygienic string manipulation. A client can then supply specially crafted input that causes the constructed sentence to be interpreted in an unintended way, leading to an injection attack. We describe a more natural style of programming that yields code that is impervious to injections by construction. Our approach embeds the grammars of the guest languages into that of the host language and automatically generates code that maps the embedded language to constructs in the host language that reconstruct the embedded sentences, adding escaping functions where appropriate.

We study AspectJ as a typical example of a language conglomerate, i.e. a language composed of a number of separate languages with different syntactic styles. We show that the combination of the lexical syntax leads to considerable complexity in the lexical states to be processed. We show how scannerless parsing elegantly addresses this. We present the design of a modular, extensible, and formal definition of the lexical and context-free aspects of the AspectJ syntax. We introduce grammar mixins, which allows the declarative definition of keyword policies and combination of extensions.

We introduce separate compilation of grammars to enable deployment of languages as plugins to a compiler. Current extensible compilers focus on source-level extensibility, which requires users to compile the compiler with a specific configuration of extensions. A compound parser needs to be generated for every combination. We introduce an algorithm for parse table composition to support separate compilation of grammars to parse table components. Parse table components can be composed (linked) efficiently at runtime, i.e. just before parsing. For realistic language combination scenarios involving grammars for real languages, our parse table composition algorithm is an order of magnitude faster than computation of the parse table for the combined grammars, making online language composition feasible.

Also, they asked me for a Dutch, non-technical summary for news websites. For my Dutch readers:

We presenteren een verzameling van methoden en technieken om programmeertalen te combineren. Onze methoden maken het bijvoorbeeld mogelijk om in een programmeertaal die ontworpen is voor algemene doeleinden een subtaal te gebruiken die beter aansluit bij het domain van een bepaald onderdeel van een applicatie. Hierdoor kan een programmeur op een duidelijkere en compactere wijze een aspect van de software implementeren.

Op basis van dezelfde technieken presenteren we een methode die programmeurs beschermt tegen fouten die de oorzaak zijn van het meest voorkomende beveiligingsprobleem, een zogenaamde injectie aanval. Door op een iets andere wijze te programmeren, heeft de programmeur de garantie dat de software niet gevoelig is voor dergelijke aanvallen. In tegenstelling tot eerder voorgestelde oplossingen geeft onze methode absolute garanties, is eenvoudiger voor de programmeur, en kan gebruikt worden voor alle gevallen waarin injectie aanvallen kunnen voorkomen (bijvoorbeeld niet specifiek voor de taal SQL).

Tot slot maken onze technieken het mogelijk om de syntaxis van sommige programmeertalen duidelijker en formeler te definieren. Sommige moderne programmeertalen zijn eigenlijk een samensmelting van verschillende subtalen (zogenaamde taalagglomeraten). Van dergelijke talen was het tot nu toe onduidelijk hoe de syntaxis precies geformuleerd kon worden, wat voor standaardisering en compatibiliteit noodzakelijk is.