2005-08-02

Lifting Member Classes from Generic Classes

I've been working of Java generics and member classes during the last few weeks. In particular, I had to find out how the additional information on Java generics is exactly represented in bytecode attributes of generic classes and methods (aka generic signatures). I was surprised by the way member classes of generic classes are compiled and I'm worried about the consequences of this for future updates of the JVM specification. That's what this entry is about.

First, something about the relation between member classes and lambda lifting. The Java language supports member classes, but Java bytecode does not. Therefore, Java compilers have to lift member classes to top-level classes, a transformation that is comparable to lambda lifting (see for example the paper Lambda-Lifting in Quadratic Time).

Member classes are compiled to ordinary top-level classes where the constructor takes an extra argument for the instance of the enclosing class. For example, the constructor a member class Bar of class Foo will get an additional argument of type Foo for the enclosing instance of a Bar object. Constructors of local classes (classes declared in a method) also take additional arguments for the local variables that it uses from its enclosing method. This process of lifting classes (that have lexical scope) is very similar to lifting nested functions in lambda lifting: all local variables that are used in the nested class become explicit arguments to make the nested class scope insensitive. After that, the class can simply be lifted out of its original scope to the top-level. An essential property of the class (or function) after lifting is that the nested class (or function) no longer directly refers to variables of the original scope of the nested class or function.

In Java 5.0, parameterized types and methods (aka generics) have been introduced. In combination with member classes, this raises the question how type variables should be handled when lifting member classes. From the source code point of view, this is pretty obvious:

class Foo<A> {
  class Bar {
    A get() { ... }
  }
}

If the class Bar is lifted, then its constructor gets an additional parameter for the enclosing Foo instance. This Foo instance is parameterized using a type variable A, so the lifted class Bar should also be parameterized with a type: the type parameter of its enclosing instance. This lifting of type parameters is comparable to the lifting of parameters for normal variables. So, the result of source-level lifting the Bar class could be:

class Foo<A> {
}

class Bar<A> {
  private final Foo<A> _enclosing;

  public Bar(Foo<A> enclosing) {
    _enclosing = enclosing;
  }

  A get() { ... }
}

Indeed, the Eclipse implementation of the refactoring "Move Member Type to New File" adds the type parameter to the lifted class (thumbs up for the generics support in Eclipse!).

So, what happens in Java bytecode? Should the lifted class have a type parameter? Should the lifted class be a valid class, generically speaking? (of course, a JVM is currently not required to understand generics related information in bytecode).

Well, the lifted class does not have type parameter and, generically speaking, it is not a valid class. Let's take a look at the bytecode, represented in a structured way as an aterm, produced by a tool called class2aterm (I use ... to leave out some details and // to explain what the code means. The full aterm is available here)

$ class2aterm -i Foo\$Bar.class --parse-sig | pp-aterm
ClassFile(
  ...
  // field for the enclosing Foo instance.
  Field(
    AccessFlags([Final, Synthetic])
  , Name("this$0")
  , FieldDescriptor(ObjectType("Foo"))
  , Attributes([])
  )
  ...

  // constructor, taking a Foo argument.
  Method(
    AccessFlags([])
  , Name("<init>")
  , MethodDescriptor([ObjectType("Foo")], Void)
  , Attributes([])
  )
  ...

  // get method with a generic signature
  Method(
    AccessFlags([])
  , Name("get")
  , MethodDescriptor([], ObjectType("java.lang.Object"))
  , Attributes(
      [ MethodSignature(
          TypeParams([])
        , Params([])
        , Returns(TypeVar(Id("A")))
        , Throws([])
        )
      ]
    )
  )
  ...

  // attributes of the class Bar
  Attributes(
    [ SourceFile("Foo.java")
    , InnerClasses( ... )
    ]
  )
  ...
)

This disassembled class file reveals some interesting details about the way nested classes are lifted:

  • The lifted class Bar is not parameterized: it has no ClassSignature attributed, which should be there if the class takes formal type parameters.
  • The field for the enclosing class does not have a parameterized type. Its type is the raw type Foo!
  • The constructor of Bar (the method name <init>) has no generic signature and takes a raw type Foo as an argument.
  • The get method does have a generic signature, which describes that the method returns a type variable A.

Of course, all the information of the original source can be reconstructed by a tool that knows about member classes and generics. But, to a tool that only knows about generics, this code would be considered incorrect. Hence, if the virtual machine would support generics in the future (which is an option explicitly left open), then this code would be incorrect! The type variable mentioned in the generic signature of the get method is not in scope. Hence, the JVM would be required to have knowledge of inner classes as well as generics to be able to find out what type parameter this type variable refers to. Unless, of course, the bytecode format is changed, which will still make it impossible to run code compiled to the current bytecode format under the new JVM, which has always been a important requirement for Sun when working on extensions of the Java platform (language and virtual machine).

Furthermore, the type variable in the signature of the get method is not qualified. Every single name in Java bytecode is fully qualified, which is very useful for tools that need to work on bytecode: they don't have to name analysis to find out to what construct a name refers. Type variables are not qualified, which complicates the analysis that has to be performed by a tool that operates on bytecode. Not only can this type variable refer to type parameters of arbitrary enclosing classes, it could also refer to type parameters of enclosing generic methods (for local classes or member classes in local classes).

The fact that type variables in bytecode are not qualified is already quite annoying without considering member classes. In the Java language, it is allowed to redeclare type variables. For example:

class Foo<A> {
  <A> void foo(A x) {
  }
}

In this example the type parameter A of the foo method is a different type parameter then the A parameter of the class Foo. This basically means that a bytecode processing tool with knowledge of generics has to do name analysis, which is definitely not something that is desirable for a bytecode format. Introducing canonical, fully qualified names for type variables would solve this.

As you might know, I'm working on semantic analysis for Java in the context of the Stratego/XT project. My goal is to make it possible to define program transformations in Stratego at the semantic level: in program transformations consider the actual meaning of names, types of expressions, and so on, without requiring the programmer to redo the semantic analysis, which is quite complex for a 'real' language like Java. Obviously, I have decided to qualify type variables. For example, the parameter A of the method foo in class Foo in the last example is represented as:

Param(
  []
, TypeVar(
    MethodName(TypeName(PackageName([]), Id("Foo")), Id("foo"))
  , Id("A")
  )
, Id("x")
)

The MethodName is the qualifier of the type variable in this example. This qualifier makes it immediately clear that the type variable refers to the type parameter of the method foo.

I don't know if this would have been fixed (maybe I see this completely wrong), but still it's a pity that I wasn't able to give feedback on this before JSR14 was finished. At that time, I was still working on the syntactic part of my Java transformation project (which is now available as Java Front). I gave some feedback on the syntax of generics, annotations and enumerations (mostly typos and minor bugs), but that's about it. For reducing the number of possible problems, I think that it would be very useful if new language features, such as generics, are also implemented with alternative techniques and tools. For example, I was able to give some feedback on the syntax of Java, because I was implementing a parser by creating a declarative syntax definition in SDF, a modular syntax definition formalism that integrates lexical and context-free syntax. These unconventional approaches might in general result in valuable feedback on proposals for new language features.

2005-06-03

Understanding a Problem

This week I've been reviewing the solutions of assignments submitted by student of our program transformation course. One of the things that strikes me again and again is how hard it is for students to get a grasp of the problem that they have to solve in an assignment.

In the last few installments of our program transformation course, the students had to develop a program instrumentation that traces the number of calls for every callee/caller pair for one of the assignments (we usually have about 10 assignments). Last year, we just described the problem in the assignment. The solutions of the students were ok, but not really exciting. They forgot to handle all kinds of cases, and some solutions even didn't terminate for some input programs.

This year, I included a set of tests in the assignment, which illustrate most (but not all) of the problems in this program transformation. Surprise, surprise: the students suddenly were able to handle all the issues illustrated by the testsuite that I provided. However, obviously I did not give the students all tests (evil grin). Indeed, several solutions could not handle the tests that I did not provide.

Most students don't write test. Worse, if they do test, then they create a single file and modify the test to check a new situation that they might have discovered. In this way they don't build up a nice testsuite. The important part of testing is that they can be repeated automatically, not that you run a test once! I'm trying very hard to convince students to test their code properly, but they don't seem to understand the need for it, so you typically get questions like "Is 10 tests enough?". I'm afraid that this is not a problem specific to students.

This might not be very surprising to you, but I am surprised how clear the results of this small 'experiment' are (I did not do this experiment on purpose). I wonder what the implications of this should be for education. Clearly, having a grasp of a problem is the most important part of the solution.

2005-04-24

Generics: The Importance of Wildcards

or "Why type erasure is not such a bad thing" or "Why generics in C# are not that good"

Last Friday, I read an article that has been on my to do list way too long: Adding Wildcards to the Java Programming Language. I've seen wildcards in Java; I've used wildcards in Java; and I've even read the Java Language Specification on wildcards, but did yet not get the essence of wildcards.

This paper makes the need for wildcards very clear and explains why the work on wildcards and parameterized types is novel. Unfortunately, generics are often ridiculed by functional programmers. They claim that their type systems have been more expressive for decades. Fortunately, this paper clearly explains the issues of introducing generics in an object oriented setting.

The problem with basic parameterized types is subtyping. For example, although Integer is a subclass of Number, a List<Integer> is not a subtype of a List<Number>. Hence, if a method requires a List<Number> as an argument, then you cannot pass a list of List<Integer> to it.

Why is a List<Integer> not a subclass of List<Number>? Well, this is related to covariance and contravariance. A type declaration is covariant if it allowed to be more specific. For example, return types are covariant. A method that is declared to return a Number can be overridden to be more specific and return an Integer. On the other hand, parameter types are contravariant: a method that is declared to accept a Integer argument can be implemented in a more general way by allowing all Numbers.

The problem with type parameters is that they are used in method parameters as well as return types. Hence, they are restricted to the intersection of covariance and contravariance: invariance. Thus, type parameters are invariant and a List<Integer> is not a subclass of List<Number>.

If Java would be restricted to basic parameterized types and methods, then it is quite difficult to come up with a good signature for a method that works on a List that contains any Number. In fact, you cannot even declare the type of lists with abitrary numbers! Allowing arbitrary numbers a generic method with a dummy type parameter for the 'real' Number, i.e.

      <T> void doSomething(List<T extends Number>) { ... }

This works, but it gets quite mind-boggling if the types get more complex ("The more interesting your types get, the less fun it is to write them down!" -- Benjamin C. Pierce). These types are not only hard to write down: it does not even work in all cases. For example, you cannot declare a field that contains arbitrary Numbers, since you cannot introduce a dummy type variable for a field.

Wildcards are a language feature that make it a bit more fun to write down these types. Actually, it is a language feature that is necessary to write down these types, since the dummy type variable is just a workaround and uses the type of the method to declare the type of the argument. The field example shows that you cannot write down the type itself. Wildcards are based on the notion of use-site variance. Using wildcards, you can declare that your list is covariant: List<? extends Number> or contravariant: List<? super Number>. For more details, read the paper!

Unfortunately, C# will not support wildcards or a similar mechanism. The implementation strategy does not allow the introduction of wildcards (generics are implemented in the runtime instead of by type erasure). This is a bit surprising, since the implementation strategy is often claimed to be superior. What disappoints me is that the designers of C# are not willing to admit that subtyping is an issue and that wildcards are a solution. See the weblog of Eric Gunnerson: Puzzling through Erasure II and the section on wildcards in JavaOne: Day One.

2005-04-19

Java Surprise 3: The Return of the Class

First of all: good news! The final version of The Java Language Specification, Third Edition is now available online! The specification has been improved considerably since the latest draft. Gilad Bracha seems to be responsible for the bulk of the work, which is a tough job. I think that the result is pretty good, although I'm afraid that I will keep bothering him with comments and requests for clarification ;).

Now back to the issue of this post. First of all: I have no idea how well-known the issue in this post is. I didn't know it, but it might actually be quite well-known. I have some references to previous discussions on this issue at the end of the post.

First, I want to say something about what influences the return type of a method in Java. Before Java 1.5, the return type of a method was just the plain return type specified in the method declaration. In other words, the return type did not depend on anything.

Java 1.5 introduces parameterized types and generic methods. The return type of a method can now also include type variables. This makes the return type dependent on the values of the type variables that occur in it. The type variables can have two different scopes: the class of the method or just the method itself, which makes it a generic method. So, the actual return type of a method now also depends on the value of these type variables.

However, there is one method in the Java library that does not return what it declares to return and needs another dependency. Indeed, there is an additional factor that influences the return type of this method.

The method I'm talking about is Object.getClass(), which returns the class of an object. In Java 1.5, Class itself is parameterized with the type that it represents. For example, the Class for String is Class<String>. The question is: what should the type parameter of the Class returned by Object.getClass() be? Well, at the declaration of the method we basically know nothing, and that is indeed the declared return type: a wildcard (unknown type) with a very general bounds: the type must extend Object.

   public final Class<? extends Object> getClass()

However, let's take a look at a piece of code where getClass is invoked. Assuming that getClass returns what it claims to return, we cannot declare c to be of a more specific type, for example Class<List>. We must declare it with a very general value for the type parameter: a wildcard.

   List<String> list = ...;
   Class<?> c = list.getClass();

This is unfortunate, since we actually know more about the type parameter of Class. We know at the invocation site that it is a List, but of course we cannot declare that in the return type of getClass in this way! So, we would like to let the return type of getClass dependent on the static type of the Object on which the method is invoked. In this way, the variable c could be declared to be of type Class<List>.

The developers of the Java specification decided to make the return type of this method a special case. That is, the Java Language Specification defines that an invocation of the getClass method must be treated in special way. In other words, the return type of the method is different from the one declared on the source code. The bounds of the Class returned by Object.getClass() is changed by the specification to the static type of the expression on which the method getClass is invoked. This is a useful feature, but it is a pity that this return type cannot be declared!

This post is getting way too long, but I would like to relate this to the implicit this argument of methods in object-oriented languages. For ordinary method arguments, you can declare types, which might include type variables. These type variables can influence the return type of the method. This is more or less what we want, but now we need this for our implicit this argument. I'm not sure if a solution in this direction is more attractive, but there is some link ... Are there more methods whose return type we would like to dependent on the static type of the object at the invocation site? If so, then this should not be supported by the language itself. Unfortunately, I cannot think of an example at the moment ;) .

There is even more to tell about this getClass method, since the type parameter of the Class is not the static type of the subject expression, but the erased variant of it. Maybe I'll make that a future post ...

Some references to related discussions:

2005-04-06

Java Surprise 2: Motivation

In the previous posts I showed that the priority of a cast to a reference type is different from the cast to a primitive type. Martijn Vermaat asked me why the designers of the Java language made this decision. Of course, they have good reasons for design decision, but still the decision is questionable, especially now we have autoboxing.

Let's take a look at this example from the original post:

$ echo "(Integer) - 2" | parse-java -s Expr | aterm2xml --implicit
<Minus>
  <ExprName><Id>Integer</Id></ExprName>
  <Lit><Deci>2</Deci></Lit>
</Minus>

If no priorities where defined in the Java language, then this expression would be ambiguous. I can illustrate this by parsing the same expression using a Java grammar that does not declare priorities. I'm using the SGLR parser for this, which is capable of producing a parse forest (multiple parse trees) if an input is ambiguous. The alternatives are represented by an amb element with 2 or more children.

$ "(Integer) - 2" | sglri -p JavaAmb.tbl | aterm2xml --implicit
<amb>
  <Minus>
    <ExprName>
      <Id>Integer</Id>
    </ExprName>
    <Lit>
      <Deci>2</Deci>
    </Lit>
  </Minus>
  <CastRef>
    <ClassOrInterfaceType>
      <TypeName>
        <Id>Integer</Id>
      </TypeName>
    </ClassOrInterfaceType>
    <Minus>
      <Lit>
        <Deci>2</Deci>
      </Lit>
    </Minus>
  </CastRef>
</amb>

This clearly shows that the input is ambiguous: the first alternative is the binary operator (which is the alternative chosen by the Java language) and the other alternative is a cast to a reference type.

However, the cast to an int is not ambiguous, since int is a reserved keyword, thus forbidden as an identifier. So, for this input there is only a single parse option, even in the ambiguous version of Java.

$ echo "(int) - 2" | sglri -p JavaAmb.tbl | aterm2xml --implicit
<CastPrim>
  <Int/>
  <Minus>
    <Lit><Deci>2</Deci></Lit>
  </Minus>
</CastPrim>

The ambiguity in the first example has to be resolved. So, what should the language designer do? Prefer the cast, or prefer the binary minus? Well, that decision is not very hard: in the first example, the (Integer) is a parenthesized expression, where the expression is the variable Integer. If we ignore this actual value (since it is quite distracting), then the structure of the expression is ( Expression ) - Expression. You will recognize the need for this pattern, since the expression (a * b) - c has exactly the same structure!

The cast to a primitive type does not have the ambiguity problem, since all primitives types are keywords and all keywords are forbidden as identifiers. So, there is no reason to disallow this a primitive cast at this location and for this reason the language designers changed the priority of the primitive cast.

Are there alternatives? Yes, there are, but they are not very attractive either. First, a parenthesized expression name could be forbidden. Using parentheses for a plain identifier (or a qualified name) does not make a lot of sense. Another option is disallow casts to primitive types at this location. This can be annoying, but it makes things more clear and consistent.

Of course, having two different production rules for casts is not attractive. It's just a single language construct, so it should be defined by a single production as well. I wonder what the language designers would have done if autoboxing was already included in the first version of Java, since autoboxing makes this distinction between a reference cast and a primitive cast visible.

2005-04-05

Java Surprise 2: Another Example

While browsing through the micro testsuites of Java-front, I was remembered of another typical example:

int x = (int) ++y;
int x = (Integer) ++y;

The first statement is allowed. The second is not.

(See the first post on Java Surprise 2 for an explanation)

Java Surprise 2: Cast Priority

I promised a Java Surprise series. Thanks to my full-time job this promise is not hard to remember: every few days there is a fresh surprise for me ;) . The second surprise in this series is actually one a discovered last summer, so I'm cheating a bit. If you know me in real life, then I've probably already bothered you with this one.

First of all: please take a seat. Are you sitting comfortably? Excellent. Did you know that the syntactical priority of a cast to a primitive type is different from the cast to a reference type? Well, it is. Most likely, you will never encounter this, but it is not hard to find an example that will surprise you.

You are probably familiar with autoboxing in Java 1.5. In short, autoboxing can convert primitive types (such as int) to reference types (such as Integer) for you if necessary. Hence, you can assign an int to an Integer and you can also cast an int to an Integer. Some (correct) statements:

Integer x = 3;
Integer y = (Integer) 3;
int z = (Integer) 3;

I'm going to abuse your familiarity with autoboxing to show how weird it can be that the priority of primitive casts is different from reference casts. The following program is a correct program that includes a (redundant) cast to an int.

public class JavaSurprise2 {
  public static void main(String[] ps) {
    int y = (int) - 2;
    System.out.println(String.valueOf(y));
  }
}

Compile and run:

martin@logistico:~/tmp> javac JavaSurprise2.java
martin@logistico:~/tmp> java JavaSurprise2
-2

Well, that looks great. Now, let's replace the int with an Integer.

public class JavaSurprise2 {
  public static void main(String[] ps) {
    int y = (Integer) - 2;
    System.out.println(String.valueOf(y));
  }
}

Compile ...

martin@logistico:~/tmp> javac JavaSurprise2.java
JavaSurprise2.java:4: cannot find symbol
symbol  : variable Integer
location: class JavaSurprise2
    int y = (Integer) - 2;
             ^
JavaSurprise2.java:4: illegal start of type
    int y = (Integer) - 2;
            ^
2 errors

What the heck? Cannot find symbol? Let's give it a symbol ...

public class JavaSurprise2 {
  public static void main(String[] ps) {
    int Integer = 3;
    int y = (Integer) - 2;
    System.out.println(String.valueOf(y));
  }
}

Compile and run ...

martin@logistico:~/tmp> javac JavaSurprise2.java
martin@logistico:~/tmp> java JavaSurprise2
1

So, what happens? Of course the compiler is right. As I said in the beginning, the priority of a cast to a primitive type is different from a reference type. Because of the priorities defined in the Java Language Specification, the int example is parsed as a cast. However, the Integer version is parsed to an expression name: an expression that can be referred to using a name (aka variable). The Java compiler will never come back to this decision to make it a cast anyway: syntactical choices are always committed.

I can illustrate these different parses using Java-front, a package that provides a Java parser that is generated from a declarative syntax definition for Java in SDF (yes, I'm the developer: marketing intended ;) )

martin@logistico:~/tmp> echo "(Integer) - 2" | parse-java -s Expr
Minus(ExprName(Id("Integer")),Lit(Deci("2")))

martin@logistico:~/tmp> echo "(int) - 2" | parse-java -s Expr
CastPrim(Int,Minus(Lit(Deci("2"))))

Or in terms of XML:

martin@logistico:~/tmp> echo "(Integer) - 2"
    | parse-java -s Expr | aterm2xml --implicit
<Minus>
  <ExprName><Id>Integer</Id></ExprName>
  <Lit><Deci>2</Deci></Lit>
</Minus>

martin@logistico:~/tmp> echo "(int) - 2"
    | parse-java -s Expr | aterm2xml --implicit
<CastPrim>
  <Int/>
  <Minus>
    <Lit><Deci>2</Deci></Lit>
  </Minus>
</CastPrim>

Surprised? You'd better be!

2005-03-28

Java Surprise 1: Overloading and Inner Classes

As some of you might know, I'm working on implementing components of a Java compiler in Stratego. Obviously, I have to study the Java Language Specification in great detail for that. I had the impression that I knew a lot a about the Java language, but I still learn a lot of new details. Some of these details are funny, some are not. I've already encountered a lot of these causes and I'll try to blog about them from now.

My first post in this series is about this fragment:

class Foo {
  void f(String s) {}
 
  class Bar {
    void f(int x) {}

    class Fred {
      void g() { f("aaa"); }  
    }
  }
}

Did you know that you cannot overload the method f in this way?

The reason for this is that the specification separates method invocation in a few phases. The first compile-time phases determines the class to search for the method to invoke. For a plain method invocation (just and identifier), the JLS specifies that the class to search for methods is the innermost type declaration that has a method of that name. In this case, this will be the class Bar. Hence, the later phases that handle method overloading will not consider the method f that takes a String argument.

I don't think I've encountered this issue during my Java programming. Did you?