2008-01-20

Ph.D. Thesis: Exercises in Free Syntax

It has been awfully quiet here, I'm sorry about that. There are a few reasons for that. The first one is that I assembled my PhD thesis from my publications. This took quite some time and energy, but the result is great! My dissertation Exercises Free Syntax is available online. If you are interested in having dead tree version, just let me know!

I will defend my thesis tomorrow, January 21 (see the Dutch announcement). It's weird to realize that tomorrow is the accumulation of 4 years of working intensely!

For the library I created an English abstract. To give you an idea what the thesis is about, let me quote it here:

In modern software development the use of multiple software languages to constitute a single application is ubiquitous. Despite the omnipresent use of combinations of languages, the principles and techniques for using languages together are ad-hoc, unfriendly to programmers, and result in a poor level of integration. We work towards a principled and generic solution to language extension by studying the applicability of modular syntax definition, scannerless parsing, generalized parsing algorithms, and program transformations.

We describe MetaBorg, a method for providing concrete syntax for domain abstractions to application programmers. Since object-oriented languages are designed for extensibility and reuse, the language constructs are often sufficient for expressing domain abstractions at the semantic level. However, they do not provide the right abstractions at the syntactic level. The MetaBorg method consists of embedding domain-specific languages in a general purpose host language and assimilating the embedded domain code into the surrounding host code. Instead of extending the implementation of the host language, the assimilation phase implements domain abstractions in terms of existing APIs leaving the host language undisturbed.

We present a solution to injection vulnerabilities. Software written in one language often needs to construct sentences in another language, such as SQL queries, XML output, or shell command invocations. This is almost always done using unhygienic string manipulation. A client can then supply specially crafted input that causes the constructed sentence to be interpreted in an unintended way, leading to an injection attack. We describe a more natural style of programming that yields code that is impervious to injections by construction. Our approach embeds the grammars of the guest languages into that of the host language and automatically generates code that maps the embedded language to constructs in the host language that reconstruct the embedded sentences, adding escaping functions where appropriate.

We study AspectJ as a typical example of a language conglomerate, i.e. a language composed of a number of separate languages with different syntactic styles. We show that the combination of the lexical syntax leads to considerable complexity in the lexical states to be processed. We show how scannerless parsing elegantly addresses this. We present the design of a modular, extensible, and formal definition of the lexical and context-free aspects of the AspectJ syntax. We introduce grammar mixins, which allows the declarative definition of keyword policies and combination of extensions.

We introduce separate compilation of grammars to enable deployment of languages as plugins to a compiler. Current extensible compilers focus on source-level extensibility, which requires users to compile the compiler with a specific configuration of extensions. A compound parser needs to be generated for every combination. We introduce an algorithm for parse table composition to support separate compilation of grammars to parse table components. Parse table components can be composed (linked) efficiently at runtime, i.e. just before parsing. For realistic language combination scenarios involving grammars for real languages, our parse table composition algorithm is an order of magnitude faster than computation of the parse table for the combined grammars, making online language composition feasible.

Also, they asked me for a Dutch, non-technical summary for news websites. For my Dutch readers:

We presenteren een verzameling van methoden en technieken om programmeertalen te combineren. Onze methoden maken het bijvoorbeeld mogelijk om in een programmeertaal die ontworpen is voor algemene doeleinden een subtaal te gebruiken die beter aansluit bij het domain van een bepaald onderdeel van een applicatie. Hierdoor kan een programmeur op een duidelijkere en compactere wijze een aspect van de software implementeren.

Op basis van dezelfde technieken presenteren we een methode die programmeurs beschermt tegen fouten die de oorzaak zijn van het meest voorkomende beveiligingsprobleem, een zogenaamde injectie aanval. Door op een iets andere wijze te programmeren, heeft de programmeur de garantie dat de software niet gevoelig is voor dergelijke aanvallen. In tegenstelling tot eerder voorgestelde oplossingen geeft onze methode absolute garanties, is eenvoudiger voor de programmeur, en kan gebruikt worden voor alle gevallen waarin injectie aanvallen kunnen voorkomen (bijvoorbeeld niet specifiek voor de taal SQL).

Tot slot maken onze technieken het mogelijk om de syntaxis van sommige programmeertalen duidelijker en formeler te definieren. Sommige moderne programmeertalen zijn eigenlijk een samensmelting van verschillende subtalen (zogenaamde taalagglomeraten). Van dergelijke talen was het tot nu toe onduidelijk hoe de syntaxis precies geformuleerd kon worden, wat voor standaardisering en compatibiliteit noodzakelijk is.