Sunday, March 03, 2013

Mercurial Cherry-Picking: Branches vs Clones

At LogicBlox we're using Mercurial, and generally we are quite happy with it (before Mercurial we used CVS and later Subversion, so that wasn't too much of a competition for Mercurial). Mercurial offers quite a few alternative methods for managing diverging development though, which can be a bit confusing. In the early days we mostly used clones, except for the branches of work of individual developers. After some debates on the confusion of having two different methods in use, we switched from clones to branches.

This weekend I was doing a merge of two branches, and accidentally introduced a problem that was caused by earlier cherry-picking of changesets between the two branches. I thought this really should not have been possible, so I decided to sanity-check my understanding of Mercurial branches. It turns out that cherry-picking and Mercurial branches really do not work well together.

The example I used is as follows:


This seems fairly typical: at some point you branch for a specific version. During your development it turns out that you need a changeset (revision 1) on your default branch, so you cherry-pick that revision. Later you need to merge all 1.0 development into default, so you do a merge. Now what happens?


To setup the branches, we reuse this shell function in the following examples.

function init
{
  # initialize a repository
  hg init test1
  cd test1
  echo "foo" > file.txt
  hg add file.txt
  hg commit -m "first commit"

  # make a branch and make two consecutive changes
  hg branch 1.0
  echo "bar" >> file.txt
  hg commit -m "bar"
  echo "fred" >> file.txt
  hg commit -m "fred"
}


Option 1: Manually Making Changes

With this option the developer adds the line 'bar' manually to both branches. You would expect a merge conflict here, because the system really has no information that these changes are correlated, and perhaps should not assume that they are.

This scenario can be executed as follows:

function sample1
{
  init
  hg up default
  echo "bar" >> file.txt
  hg commit -m "bar"
  hg merge 1.0 || echo "conflict expected"
}

You will see that Hg nicely reports the merge failure:

merging file.txt failed!
0 files updated, 0 files merged, 0 files removed, 1 files unresolve

Of course, applying a diff using diff/patch will result in the same result, because it does not matter how you modify the files.


Option 2: Mercurial Import/Export

The second option is to export the changeset using hg export, and import it using hg import. The hope here would be that Mercurial would correctly remember the origin of the changeset. To my surprise, it does not.

function sample2
{
  init

  hg export 1 > bar.diff
  hg up default
  hg import bar.diff

  hg merge 1.0 || echo "conflict??"
}

Result:
merging file.txt failed!
0 files updated, 0 files merged, 0 files removed, 1 files unresolve


Option 3: Transplant/Graft

Puzzled by this result, let's try graft (which is roughly the 2.0 implementation of transplant)

function sample3
{
  init
  hg up default
  hg graft 1
  hg merge 1.0 || echo "conflict???"
}

Result:
merging file.txt failed!
0 files updated, 0 files merged, 0 files removed, 1 files unresolve

Option 4: Clones

People who regularly work with Mercurial will of course immediately see that this is not a problem with clones. In fact, this is how you do distributed development at all. Just for the sake it, here is an example:

function sample4
{
  hg init test1
  cd test1
  echo "foo" > file.txt
  hg add file.txt
  hg commit -m "first commit"

  cd ..
  hg clone test1 test2

  cd test2
  echo "bar" >> file.txt
  hg commit -m "bar"
  echo "fred" >> file.txt
  hg commit -m "fred"
  hg export 1 > ../bar.diff

  cd ../test1
  hg import ../bar.diff
  hg pull ../test2
  hg update
}

Result:

1 files updated, 0 files merged, 0 files removed, 0 files unresolved

I'm not really a complete Mercurial guru, so I might be missing something here, but clearly this demonstrates that clones work better for this purpose of cherry-picking + merging. If anybody with deeper understanding of Mercurial branches can present a working example that would be great!

Thursday, December 04, 2008

Why the JVM Spec defines checkcast for interface types

I'm working on the specification of pointer analysis for Java using Datalog. Basically, a pointer analysis computes for each variable in a program the set of objects it may point to at run-time.

For this purpose I need to express parts of the JVM Spec in Datalog as well. As a simple example, the following Datalog rules define when a class is a subclass of another class.

/**
 * JVM Spec:
 * - A class A is a subclass of a class C if A is a direct 
 *   subclass of C
 */
Subclass(?c, ?a) <-
  DirectSubclass[?a] = ?c.

/**
 * JVM Spec:
 * - A class A is a subclass of a class C if there is a direct
 *   subclass B of C and class A is a subclass of B
 */
Subclass(?c, ?a) <-
  Subclass(?b, ?a),
  DirectSubclass[?b] = ?c.

As you can see, this is remarkably close to the original specification (quoted in comments). You can clearly see the relationship between the spec and the code, even if you are not familiar with Datalog.

Recently, I was working on the specification of the checkcast instruction. This instruction performs the run-time check if an object can be cast to some type. The JVM Spec for checkcast first defines some variables:

The following rules are used to determine whether an objectref that is not null can be cast to the resolved type: if S is the class of the object referred to by objectref and T is the resolved class, array, or interface type, checkcast determines whether objectref can be cast to type T as follows:

So, this basically says that we're checking the cast (T) S.

The first rule for this cast is straightforward:

If S is an ordinary (nonarray) class, then:
  • If T is a class type, then S must be the same class as T, or a subclass of T.
  • If T is an interface type, then S must implement interface T.
Well, if you're somewhat familiar with Java, or object-oriented programming, then this part is obvious. Again, the specification in Datalog is easy:
CheckCast(?s, ?s) <-
  ClassType(?s).

CheckCast(?s, ?t) <-
  Subclass(?t, ?s).

CheckCast(?s, ?t) <-
  ClassType(?s),
  Superinterface(?t, ?s).
However, the next alternative in the specification is confusing:
If S is an interface type, then:
  • If T is a class type, then T must be Object.
  • If T is an interface type, then T must be the same interface as S or a superinterface of S.

The specification is crystal clear, but how can S ever be an interface type? S is the type of the object that is being cast, and how can an object ever have a run-time type that is an interface? Of course, the static type of an expression can be an interface, but we're talking about the run-time here!

I searched the web, which only resulted in a few hits. There was one question on a Sun forum years ago, where the one answer didn't make a lot of sense.

It turns out that this is indeed an `impossible' case. The reason why this item is in the specification, is because checkcast is recursively defined for arrays:

If S is a class representing the array type SC[], that is, an array of components of type SC, then:
  • ...
  • If T is an array type TC[], that is, an array of components of type TC, then one of the following must be true:
    • ...
    • TC and SC are reference types, and type SC can be cast to TC by recursive application of these rules.

So, if you have an object of type List[] that is cast to an Collection[], then the rules for checkcast get recursively invoked for the types S = List and T = Collection. Notice that List is an interface, but an object can have type List[] at run-time. If have not verified this with the JVM Spec maintainers, but as far as I can see, this is the only reason why the rule for interface types is there.

Just to show a little bit more of my specifications, here is the rule for the array case I just quoted from the JVM Spec:

CheckCast(?s, ?t) <-
  ComponentType[?s] = ?sc,
  ComponentType[?t] = ?tc,
  ReferenceType(?sc),
  ReferenceType(?tc),
  CheckCast(?sc, ?tc).

Isn't it beautiful how this exactly corresponds to the formal specification?

Unfortunately, even formal specifications can have errors, so I also specified a large testsuite that checks the specifications with concrete code. Here are some of the tests for CheckCast.

test Casting to self
  using database tests/hello/Empty.jar
  assert
    CheckCast("java.lang.Integer", "java.lang.Integer")

test Casting to superclasses
  using database tests/hello/Empty.jar
  assert
    CheckCast("java.lang.Integer", "java.lang.Number")
    CheckCast("java.lang.Integer", "java.lang.Object")

test Cast ArrayList to various superinterfaces
  using database tests/hello/Arrays.jar
  assert
    CheckCast("java.util.ArrayList", "java.util.List")
    CheckCast("java.util.ArrayList", "java.util.Collection")
    CheckCast("java.util.ArrayList", "java.io.Serializable")

test Cast class[] to implemented interface[]
  using database tests/hello/Arrays.jar
  assert
    CheckCast("java.util.ArrayList[]", "java.util.List[]")
    CheckCast("java.lang.Integer[]", "java.io.Serializable[]")

test Cast interface[] to superinterface[]
  using database tests/hello/Arrays.jar
  assert
    CheckCast("java.util.List[]", "java.util.Collection[]")

The tests are specified in a little domain-specific language for unit-testing Datalog that I implemented, initially for IRIS and later for LogicBlox. This tool is similar to parse-unit, a tool I wrote earlier for testing parsers in Stratego/XT. The concise syntax of a test encourages you to write a lot of tests. Domain-specific languages rock for this purpose!

Tuesday, July 01, 2008

New Conference on Software Language Engineering

This year is special. There is a new and exciting conference: the International Conference on Software Language Engineering (SLE). The deadline for submission of papers is July 14th, which is coming up soon! Before I start raving about the topics covered by this conference, here is the disclaimer: I'm on the program committee of this conference, and as such I believe it's my duty to advertise the conference.

Anyway, if done right, this conference has the potential to become a major and prestigious conference. The conference fills a clear gap: the topics of software language engineering do not exactly fit in major programming language conferences like OOPSLA, PLDI, POPL, and ECOOP. Nor do they fit exactly in the area of compiler construction (CC). CC does typically not accept more engineering or methodology-oriented papers. For OOPSLA and ECOOP the work more or less has to be in the context of object-oriented programming, for POPL it immediately has to be a principle (whatever that is), and for PLDI there are usually just a few slots available for papers that don't do something with memory management, garbage collection, program analysis, or concurrency. Personally, I've been pretty successful at getting papers in the area of software language engineering accepted at OOPSLA, but a full conference devoted to this topic is much better!

Another reason why I think that this conference has a lot of potential is that if I look at the list of topics of interest in the call for papers, then I can only think of one summary: everything that's fun! I'm convinced I'm not the only one who thinks these topics are fun. When talking to colleagues, I notice again and again most of us just love languages. The engineering of those languages is an issue for almost all computer scientists and many programmers in industry, and this conference will be the most obvious target for papers about this!

Also, the formalisms and used for the specification and implementation of (domain-specific) languages are still very much an open research topic. Standardization of languages is still far from perfect, as discussed by many posts on this blog. Also, new language implementation techniques are being proposed all the time, and extensible compilers for developing language extensions are more popular than ever. Not to mention the increasing interest in using domain-specific languages to help solve the software development problems we're facing.

Earlier in this post I wrote that this conference has major potential if done right. There are few risks. First, the conference has been started by two relatively small communities: ATEM and LDTA. I think the conference should attract a much larger community than the union of those two communities. I hope lots of people outside of the ATEM and LDTA communities will consider to submit a paper. Second, this year the conference is co-located with MODELS. Many programming language people are slightly allergic to model-driven engineering. I hope they will realize that this conference is not specifically a model-driven conference. Finally, the whole setup of the conference should be international and varied. I'm sorry to say that at this point I'm not entirely happy with the choice of keynote speakers. This nothing personal: I respect both keynote speakers, but the particular combination of the two speakers is a bit unfortunate. First, they are both Dutch. Second, neither of them is extremely well-known in the communities of OOPSLA, PLDI, or ECOOP. I hope that this will not affect the potential of this interesting conference.

Now go work on your submission!

Tuesday, June 03, 2008

Dubious Conferences: How do they threat people?

I just got a call for papers for the International e-Conference on Computer Science 2008 (IeCCS 2008). The IeCCS conference organizers and committee members are one of a kind! The submission deadline for papers is the 20th of June. Notification of acceptance is 25th of June. That's 5 days for reviewing. The camera ready deadline is 27th of June. That's 2 days for revising your paper. You've got to love the efficiency of these people! The 2007 edition of this conference has a program of 11 pages of accepted papers. When I review papers, I rarely get more than 2 done per day. If the IeCCS committee want to accept a similar number of papers this year, then they'd better make sure to get enough coffee (or tea, as advised in my Ph.D. thesis).

Now, if you are a computer science researcher such emails are hardly surprising. I delete several of them every week. Everybody is aware of conferences with questionable reviewing practices (see the SCIgen paper generator). What surprised me about the IeCCS call for papers is that there is actually a researcher on the committee who I vaguely know from when I was a student. So, I searched the web a bit to see how obvious the evidence is that the reviewing practices of IeCCS are questionable. Interestingly, I could find only one reference that mentions IeCCS as a conference where you'd better not submit to. It's an interesting presentation of somebody at the PSU.

It seems that lists of conferences with a dubious reputation (also known as fake conferences) are impossible to keep up. I've seen a few lists in the past, but they've all disappeared. What interests me is why those lists are taken down. The most well known list, by Arlindo Oliveira, was taken down after receiving threats by conference organizers. I've never quite understood that: how serious can such a threat be? Maybe they'll publish a random paper with my name? They'll put me on the program committee next year?

So well, here we go. Let's see what happens.

Notice: IeCCS is not fake. It is very real!

Sunday, January 20, 2008

Ph.D. Thesis: Exercises in Free Syntax

It has been awfully quiet here, I'm sorry about that. There are a few reasons for that. The first one is that I assembled my PhD thesis from my publications. This took quite some time and energy, but the result is great! My dissertation Exercises Free Syntax is available online. If you are interested in having dead tree version, just let me know!

I will defend my thesis tomorrow, January 21 (see the Dutch announcement). It's weird to realize that tomorrow is the accumulation of 4 years of working intensely!

For the library I created an English abstract. To give you an idea what the thesis is about, let me quote it here:

In modern software development the use of multiple software languages to constitute a single application is ubiquitous. Despite the omnipresent use of combinations of languages, the principles and techniques for using languages together are ad-hoc, unfriendly to programmers, and result in a poor level of integration. We work towards a principled and generic solution to language extension by studying the applicability of modular syntax definition, scannerless parsing, generalized parsing algorithms, and program transformations.

We describe MetaBorg, a method for providing concrete syntax for domain abstractions to application programmers. Since object-oriented languages are designed for extensibility and reuse, the language constructs are often sufficient for expressing domain abstractions at the semantic level. However, they do not provide the right abstractions at the syntactic level. The MetaBorg method consists of embedding domain-specific languages in a general purpose host language and assimilating the embedded domain code into the surrounding host code. Instead of extending the implementation of the host language, the assimilation phase implements domain abstractions in terms of existing APIs leaving the host language undisturbed.

We present a solution to injection vulnerabilities. Software written in one language often needs to construct sentences in another language, such as SQL queries, XML output, or shell command invocations. This is almost always done using unhygienic string manipulation. A client can then supply specially crafted input that causes the constructed sentence to be interpreted in an unintended way, leading to an injection attack. We describe a more natural style of programming that yields code that is impervious to injections by construction. Our approach embeds the grammars of the guest languages into that of the host language and automatically generates code that maps the embedded language to constructs in the host language that reconstruct the embedded sentences, adding escaping functions where appropriate.

We study AspectJ as a typical example of a language conglomerate, i.e. a language composed of a number of separate languages with different syntactic styles. We show that the combination of the lexical syntax leads to considerable complexity in the lexical states to be processed. We show how scannerless parsing elegantly addresses this. We present the design of a modular, extensible, and formal definition of the lexical and context-free aspects of the AspectJ syntax. We introduce grammar mixins, which allows the declarative definition of keyword policies and combination of extensions.

We introduce separate compilation of grammars to enable deployment of languages as plugins to a compiler. Current extensible compilers focus on source-level extensibility, which requires users to compile the compiler with a specific configuration of extensions. A compound parser needs to be generated for every combination. We introduce an algorithm for parse table composition to support separate compilation of grammars to parse table components. Parse table components can be composed (linked) efficiently at runtime, i.e. just before parsing. For realistic language combination scenarios involving grammars for real languages, our parse table composition algorithm is an order of magnitude faster than computation of the parse table for the combined grammars, making online language composition feasible.

Also, they asked me for a Dutch, non-technical summary for news websites. For my Dutch readers:

We presenteren een verzameling van methoden en technieken om programmeertalen te combineren. Onze methoden maken het bijvoorbeeld mogelijk om in een programmeertaal die ontworpen is voor algemene doeleinden een subtaal te gebruiken die beter aansluit bij het domain van een bepaald onderdeel van een applicatie. Hierdoor kan een programmeur op een duidelijkere en compactere wijze een aspect van de software implementeren.

Op basis van dezelfde technieken presenteren we een methode die programmeurs beschermt tegen fouten die de oorzaak zijn van het meest voorkomende beveiligingsprobleem, een zogenaamde injectie aanval. Door op een iets andere wijze te programmeren, heeft de programmeur de garantie dat de software niet gevoelig is voor dergelijke aanvallen. In tegenstelling tot eerder voorgestelde oplossingen geeft onze methode absolute garanties, is eenvoudiger voor de programmeur, en kan gebruikt worden voor alle gevallen waarin injectie aanvallen kunnen voorkomen (bijvoorbeeld niet specifiek voor de taal SQL).

Tot slot maken onze technieken het mogelijk om de syntaxis van sommige programmeertalen duidelijker en formeler te definieren. Sommige moderne programmeertalen zijn eigenlijk een samensmelting van verschillende subtalen (zogenaamde taalagglomeraten). Van dergelijke talen was het tot nu toe onduidelijk hoe de syntaxis precies geformuleerd kon worden, wat voor standaardisering en compatibiliteit noodzakelijk is.

Tuesday, April 03, 2007

LDTA'07 slides on Grammar Engineering Tools

The slides of our presentation of the LDTA'07 paper Grammar Engineering Support for Precedence Rule Recovery and Compatibility Checking are now available online. The slides are a spectacular demonstration of latex masochism, so please take a look ;) . There are few bonus slides after the conclusion that I wasn't able to show during the 30-minutes version of the talk.

Migration of the YACC grammar for PHP to SDF

Last summer, Eric Bouwers started working on infrastructure for PHP program transformation and analysis, sponsored by the Google Summer of Code. He did an excellent job, thanks to his expertise in PHP and his thorough knowledge of Stratego/XT. To enjoy all the language engineering support in Stratego/XT, Eric developed a PHP grammar in SDF, the grammar formalism that is usually applied in Stratego/XT projects. Unfortunately it proved to be very difficult to get the grammar of PHP right.

PHP precedence problems

PHP features many operators, and the precedence of the operators is somewhat unusual and challenging for a grammar formalism. For example, PHP allows the weak binding assignment operator as an argument of the binary, strong binding && operator:

  if ($is_upload && $file = fopen($fname, 'w')) { 
    ...
  }

The same holds for the unary, strong binding ! operator:

  if(!$foo = getenv('BAR')){
    ...
  }

A similar precedence rule for the include operator allows an include to occur as the argument of the strong binding @ operator:

  @include_once 'Var/Renderer/' . $mode . '.php'

Precedence rule recovery

The most serious problem was to find out what the exact precedence rules of PHP operators are. The syntax of PHP is defined by a YACC grammar, which has a notion of precedence declarations that is heavily used by the PHP grammar. Unfortunately, for more complex grammars it is far from clear what the exact effect of the precedence declarations are. The precedence declarations are only used for conflict resolution in the parse table generator, so if there is no conflict, then the precedence declarations do not actually have any effect on a particular combination of operators. That's why we developed support for recovering precedence rules from YACC grammars, which I already wrote about in a previous blog. Based on these tools, we now have a very precise specification of the precedence rules of PHP.

The next step in the process of getting a perfect PHP grammar was to actually use this specification to develop very precise precedence declarations for the SDF grammar of PHP. However, the precedence rule specification involves about 1650 rules, so migrating these precedence rules to SDF precedence declarations by hand is not really an option. Fortunately, all the ingredients are actually there to generate SDF priority declarations from the precedence rules that we recover from the YACC grammar.

Argument-specific priorities

Thanks to two new features of SDF, these precedence rules can be translated directly to SDF. The first feature is argument-specific priorities. In the past, SDF only allowed priority declarations between productions. For example, the SDF priority

  E "*" E -> E > E "+" E -> E

defines that the production for the + operator cannot be applied to produce any of the E arguments of the production for the * operator, hence the production for the addition operator cannot be applied on the left-hand side or right-hand side of the multiplication operator. This priority implies that the multiplication operator binds stronger than the addition operator. This single SDF priority corresponds to the following two precedence rules in the grammar formalism independent notation we are using in the Stratego/XT Grammar Engineering Tools:

  <E -> <E -> E + E> * E>
  <E -> E * <E -> E + E>>

For many languages precedence rules are different for arguments of the same production. That's why we us the more specific representation of a precedence rules in our grammar engineering tools. Fortunately, SDF now supports argument-specific priorities as well. These argument-specific priorities are just plain numbers that indicate to which arguments of a production the priority applies. For example, the following SDF priority forbids the assignment operator only at the left-most and the right-most E of the conditional operator:

  E "?" E ":" E -> E <0,4> > E "=" E

This corresponds to the following precedence rules:

  <E -> <E -> E = E> ? E : E>
  <E -> E ? E : <E -> E = E>>

Non-transitive priorities

The second new SDF feature that is required for expressing the PHP precedence rules is non-transitive priorities. Before the introduction of this feature, all SDF priorities where transitively closed. For example, if there are two separate priorities

  "!" E -> E   > E "+" E -> E
  E "+" E -> E > V "=" E -> E

then by the transitive closure of priorities this would imply the priority

  "!" E -> E >  V "=" E -> E

This transitive closure feature is useful in most cases, but for some languages (such as PHP) the precedence rules are in fact not transitively closed, which makes the definition of these rules in SDF slightly problematic. For this reason, SDF now also features non-transitive priorities, using a dot before the > of the priority:

  "!" E -> E .> E "+" E -> E

Non-transitive priorities will not be included in the transitive closure, which gives you very precise control over the precedence rules.

Precedence rule migration

Thanks to the position-specific, non-transitive priorities of SDF, the precedence rules that we recover from the YACC grammar for PHP can now be mapped directly to SDF priority declarations. The two precedence rules mentioned earlier:

  <E -> <E -> E + E> * E>
  <E -> E * <E -> E + E>>

now translate directly to SDF priorities:

  E * E -> E <0> .> E + E -> E
  E * E -> E <2> .> E + E -> E

The migration of the recovered YACC precedence rules results in about 1650 of these SDF priorities, but thanks to the fully automatic migration this huge number of priorities is not really a problem. The resulting PHP syntax definition immediately solved all the known issues with the PHP syntax definition, which shows that this migration was most reliable and successful.

Future

There is a lot of interesting work left to be done. First, it would be interesting to develop a more formal grammar for PHP, similar to the grammars of the C, Java, and C# specifications. These specifications all encode the precedence rules of the operators in the production rules, by introducing non-terminals for all the precedence levels. It should not be too difficult to automatically determine such an encoding from the precedence rules we recover. This would result in a formal specification of the PHP syntax, which will benefit many other parser generators. One of the remarkable things we found out is that the unary - operator has the same precedence as the binary - (usually it binds stronger), which results in -1 * 3 being parsed as -(1 * 3). We have not been able to find an example where this strange precedence rule results in unexpected behaviour, but for the development of a solid parser is it essential that such precedence rules are defined precisely.

Second, it would be good to try to minimize the number of generated SDF priorities by determining a priority declaration that you can actually oversee as a human. This would involve finding out where the transitive closure feature of SDF priorities can be used to remove redundant priority declarations.

Third, it would great to integrate the precedence rule migration in a tool that completely migrates a YACC/FLEX grammar to SDF. For this, we need tools to parse and understand a FLEX specification and extend the existing support for precedence rule migration to other YACC productions.

Clearly, there is lots of interesting (and useful!) grammar engineering work to do in this direction!

Thursday, March 01, 2007

x86-64 support for Stratego/XT!

Today is the day that Stratego/XT supports 64-bit processors! Stratego/XT supports x86-64 from release 0.17M3pre16744 (or later), the sdf2-bundle from release 2.4pre212034 (or later). The releases are available from our new Nix buildfarm at the TU Delft.

Some history

About 6 years ago, various people started to complain about the lack of 64-bit processor support. At the time, most complaints came from our very own Unix geek Armijn Hemel, mostly because of his passion for Sun and these strange UltraSparc machines. However, similar to the limited distribution of Unix geeks, 64-bit systems were rather uncommon at the time. The requests we got were about more obscure processors, like Sun's UltraSparc and Intel's IA-64 (Itanium).

The 64-bit issues were never solved because (1) we never had a decent 64-bit machine at our disposal, (2) users with 64-bit system were uncommon, and (3) most of the issues were actually not Stratego/XT issues, but problems in the ATerm library, which is not maintained by the Stratego/XT developers.

Some first steps

However, it is not possible anymore to ignore 64 bit systems: Intel and AMD both sell 64-bit processors for consumers these days. Several users of Stratego/XT already have x86-64 machines, and the only reason why they don't complain en masse is that there is always the option to compile in 32-bit mode (using gcc -m32).

At the TU Delft (the new Stratego/XT headquarters), we now have an amazing buildfarm with some real, dedicated hardware bought specifically for the purpose of building software. At the moment, all our build machines (except for the Mac Minis) have x86-64 processors, so the lack of 64-bit machines is no longer an excuse.

Also, the ATerm library now enjoys a few more contributors. Last summer, Eelco Dolstra from Utrecht University created the first complete 64-bit patch for the ATerm library (Meta-Environment issue 606), simply because his Nix package management system uses the ATerm library and portability of Nix is important. Also, Erik Scheffers from the Eindhoven University of Technology has done an excellent job on the development of ATerm branches that support GCC 4.x and 64-bit machines.

The final step .. uh, steps

As a result, it was now feasible to fully support x86-64 systems. The only thing left for me to do was to use all the right patches and branches and enable an x86-64 build in our buildfarm. At least, that's what I thought ... Well, if you know computer scientists, then you'll also know that they are always far too optimistic.

In the end, it took me a day or four to get everything working. This is rather stressful work, I must say. Debugging code that is affected by a mixture of 32-bit assumptions and aliasing bugs introduced by GCC optimizations is definitely not much fun. You can stare at C code for as long as you like, but if the actual code being executed is completely different, then this won't help much. In the end, this little project resulted in quite a few new issues:

  • STR-701 is a bug that was raised by casting a pointer to an integer in the address strategy of libstratego-lib, which returns an integer representation of the address of an ATerm. The Stratego Library has had this strategy for a long time, and indeed the most natural representation of an address is an integer datatype. Unfortunately, ATerm integers are fixed size, 32-bit integers, hence it cannot be used to represent a pointer of 64 bits. The new representation is a string, which is acceptable for most of the applications of address.
  • Meta-Environment issue 720 is related to GCC optimizations based on strict alias analysis. In this case, the optimization seems to be applied only in the x86-64 backend of GCC, while the underlying problem is in fact architecture independent.

    The code that raises this bug applies efficient memory allocation by allocating blocks of objects rather than individual ones. The available objects are encoded efficiently in a linked list, with only a next field. This next field is used for the actual data of the object, as well as the link to next available object. The objects are character classes, having the name CC_Class, which is a typedef for an array of longs. Roughly, the invalid code for adding a node to the linked list looks like this:

    struct CC_Node {
      struct CC_Node *next;
    };
    
    static struct CC_Node *free_nodes = NULL;
    
    void add_node(CC_Class* c) {
      struct CC_Node *node = (struct CC_Node *) c;
      node->next = free_nodes;
      free_nodes = node;
    }
    

    The problem with this efficient linked list is that the same memory location is accessed through pointers of different types, in this case a pointer to a CC_Node struct and a pointer to a CC_Class. Hence, the code creates aliases of different types, which is invalid in C (see for example this nice introduction to strict aliasing). In this case, C compilers are allowed to assume that the two variables do not alias, which enables a whole bunch of optimizations that are invalid if they do in fact alias.

    The solution for this is to use a C union, which explicitly informs the compiler that a certain memory location is accessed through two different types. Using a union, the above code translates to:

    union CC_Node {
      CC_Class *cc;
      CC_Class **next;
    };
    
    static union CC_Node free_node = {NULL};
    
    void add_node(CC_Class* c) {
      node.cc = c;
      *(node.next) = free_node.cc;
      free_node.cc = node.cc;
    }
    

    Sidenote: I'm not really a C union expert, and I'm not 100% sure whether in this case a union is necessary for a CC_Class* and CC_Class** or CC_Class and CC_Class*. The union I've chosen solves the bug, but I should figure out what the exact solution should be. Feedback is welcome.

  • Meta-Environment issue 718 is related to the previous bug. The problem here is that the same memory locations are accessed through a generic datatype (ATerm) as well as pointers to more specific structs, which again leads to strict aliasing problems. This time, the issue has been solved in a more ad-hoc way by declaring a variable as volatile. This solves the issue for now, but a more fundamental solution (probably a union) is necessary here as well.
  • STR-705 adds some checks for the size of various types to the Stratego/XT build system, called Auto/XT. These checks are necessary for the header files of the ATerm library, which determine the characteristics of the platform based on the size of longs, integers, and void pointers, which are defined as macros (a feature that is under discussion in Meta-Environment issue 606: it does not play very well with cross compilation and compilation in 32-bit mode on a 64-bit platform). The ATerm library we are using at the moment is the branch 64-bit-fixes, which has been developed by Eelco Dolstra and Erik Scheffers.

    The new macro XT_C_TYPE_CHARACTERISTICS checks the sizes and defines the macros that are required by these headers. The macro XT_SETUP invokes the XT_C_TYPE_CHARACTERISTICS macro, so all packages based on Stratego/XT will automatically support the 64-bit branch of the ATerm library.

  • STR-703 is related to the previous issues. In packages based on the GNU Autotools and Auto/XT, the C code is compiled by the Automake-based build system, not by the Stratego compiler itself (which only produces the C code). In this case, the XT_C_TYPE_CHARACTERISTICS takes care of the required defines. However, the Stratego compiler can also be used as a standalone compiler, where strc invokes the C compiler itself. In this case, strc needs to pass the definitions of macros to the C compiler.
  • STR-704 drops the use of autoheader in stratego-libraries. Autoheader replaces the command-line definition of macros to a generated config.h. This generated file used to be installed as stratego-config.h, but this header file is no longer necessary: there is no configuration option in this file that is still necessary as part of the Stratego/XT installation. The mechanism of config.h installation is rather fragile (some macro definitions have to be removed), so if it is not necessary anymore, then why not drop it ...

    The relation to x86-64 support is that several C files in the stratego-libraries package did not correctly include the generated config.h before aterm2.h. This breaks on x86-64 systems because aterm2.h requires the aforementioned macro definitions.

Short version

The net result of this operation is that we now support x86-64 systems. And this time we will keep supporting 64-bit processors, whatever it takes.

It would be fun to check now if UltraSparc and IA-64 machines work out of the box, but I don't have access to any of these. If you have one, I would love to know if it works.