2005-08-02

Lifting Member Classes from Generic Classes

I've been working of Java generics and member classes during the last few weeks. In particular, I had to find out how the additional information on Java generics is exactly represented in bytecode attributes of generic classes and methods (aka generic signatures). I was surprised by the way member classes of generic classes are compiled and I'm worried about the consequences of this for future updates of the JVM specification. That's what this entry is about.

First, something about the relation between member classes and lambda lifting. The Java language supports member classes, but Java bytecode does not. Therefore, Java compilers have to lift member classes to top-level classes, a transformation that is comparable to lambda lifting (see for example the paper Lambda-Lifting in Quadratic Time).

Member classes are compiled to ordinary top-level classes where the constructor takes an extra argument for the instance of the enclosing class. For example, the constructor a member class Bar of class Foo will get an additional argument of type Foo for the enclosing instance of a Bar object. Constructors of local classes (classes declared in a method) also take additional arguments for the local variables that it uses from its enclosing method. This process of lifting classes (that have lexical scope) is very similar to lifting nested functions in lambda lifting: all local variables that are used in the nested class become explicit arguments to make the nested class scope insensitive. After that, the class can simply be lifted out of its original scope to the top-level. An essential property of the class (or function) after lifting is that the nested class (or function) no longer directly refers to variables of the original scope of the nested class or function.

In Java 5.0, parameterized types and methods (aka generics) have been introduced. In combination with member classes, this raises the question how type variables should be handled when lifting member classes. From the source code point of view, this is pretty obvious:

class Foo<A> {
  class Bar {
    A get() { ... }
  }
}

If the class Bar is lifted, then its constructor gets an additional parameter for the enclosing Foo instance. This Foo instance is parameterized using a type variable A, so the lifted class Bar should also be parameterized with a type: the type parameter of its enclosing instance. This lifting of type parameters is comparable to the lifting of parameters for normal variables. So, the result of source-level lifting the Bar class could be:

class Foo<A> {
}

class Bar<A> {
  private final Foo<A> _enclosing;

  public Bar(Foo<A> enclosing) {
    _enclosing = enclosing;
  }

  A get() { ... }
}

Indeed, the Eclipse implementation of the refactoring "Move Member Type to New File" adds the type parameter to the lifted class (thumbs up for the generics support in Eclipse!).

So, what happens in Java bytecode? Should the lifted class have a type parameter? Should the lifted class be a valid class, generically speaking? (of course, a JVM is currently not required to understand generics related information in bytecode).

Well, the lifted class does not have type parameter and, generically speaking, it is not a valid class. Let's take a look at the bytecode, represented in a structured way as an aterm, produced by a tool called class2aterm (I use ... to leave out some details and // to explain what the code means. The full aterm is available here)

$ class2aterm -i Foo\$Bar.class --parse-sig | pp-aterm
ClassFile(
  ...
  // field for the enclosing Foo instance.
  Field(
    AccessFlags([Final, Synthetic])
  , Name("this$0")
  , FieldDescriptor(ObjectType("Foo"))
  , Attributes([])
  )
  ...

  // constructor, taking a Foo argument.
  Method(
    AccessFlags([])
  , Name("<init>")
  , MethodDescriptor([ObjectType("Foo")], Void)
  , Attributes([])
  )
  ...

  // get method with a generic signature
  Method(
    AccessFlags([])
  , Name("get")
  , MethodDescriptor([], ObjectType("java.lang.Object"))
  , Attributes(
      [ MethodSignature(
          TypeParams([])
        , Params([])
        , Returns(TypeVar(Id("A")))
        , Throws([])
        )
      ]
    )
  )
  ...

  // attributes of the class Bar
  Attributes(
    [ SourceFile("Foo.java")
    , InnerClasses( ... )
    ]
  )
  ...
)

This disassembled class file reveals some interesting details about the way nested classes are lifted:

  • The lifted class Bar is not parameterized: it has no ClassSignature attributed, which should be there if the class takes formal type parameters.
  • The field for the enclosing class does not have a parameterized type. Its type is the raw type Foo!
  • The constructor of Bar (the method name <init>) has no generic signature and takes a raw type Foo as an argument.
  • The get method does have a generic signature, which describes that the method returns a type variable A.

Of course, all the information of the original source can be reconstructed by a tool that knows about member classes and generics. But, to a tool that only knows about generics, this code would be considered incorrect. Hence, if the virtual machine would support generics in the future (which is an option explicitly left open), then this code would be incorrect! The type variable mentioned in the generic signature of the get method is not in scope. Hence, the JVM would be required to have knowledge of inner classes as well as generics to be able to find out what type parameter this type variable refers to. Unless, of course, the bytecode format is changed, which will still make it impossible to run code compiled to the current bytecode format under the new JVM, which has always been a important requirement for Sun when working on extensions of the Java platform (language and virtual machine).

Furthermore, the type variable in the signature of the get method is not qualified. Every single name in Java bytecode is fully qualified, which is very useful for tools that need to work on bytecode: they don't have to name analysis to find out to what construct a name refers. Type variables are not qualified, which complicates the analysis that has to be performed by a tool that operates on bytecode. Not only can this type variable refer to type parameters of arbitrary enclosing classes, it could also refer to type parameters of enclosing generic methods (for local classes or member classes in local classes).

The fact that type variables in bytecode are not qualified is already quite annoying without considering member classes. In the Java language, it is allowed to redeclare type variables. For example:

class Foo<A> {
  <A> void foo(A x) {
  }
}

In this example the type parameter A of the foo method is a different type parameter then the A parameter of the class Foo. This basically means that a bytecode processing tool with knowledge of generics has to do name analysis, which is definitely not something that is desirable for a bytecode format. Introducing canonical, fully qualified names for type variables would solve this.

As you might know, I'm working on semantic analysis for Java in the context of the Stratego/XT project. My goal is to make it possible to define program transformations in Stratego at the semantic level: in program transformations consider the actual meaning of names, types of expressions, and so on, without requiring the programmer to redo the semantic analysis, which is quite complex for a 'real' language like Java. Obviously, I have decided to qualify type variables. For example, the parameter A of the method foo in class Foo in the last example is represented as:

Param(
  []
, TypeVar(
    MethodName(TypeName(PackageName([]), Id("Foo")), Id("foo"))
  , Id("A")
  )
, Id("x")
)

The MethodName is the qualifier of the type variable in this example. This qualifier makes it immediately clear that the type variable refers to the type parameter of the method foo.

I don't know if this would have been fixed (maybe I see this completely wrong), but still it's a pity that I wasn't able to give feedback on this before JSR14 was finished. At that time, I was still working on the syntactic part of my Java transformation project (which is now available as Java Front). I gave some feedback on the syntax of generics, annotations and enumerations (mostly typos and minor bugs), but that's about it. For reducing the number of possible problems, I think that it would be very useful if new language features, such as generics, are also implemented with alternative techniques and tools. For example, I was able to give some feedback on the syntax of Java, because I was implementing a parser by creating a declarative syntax definition in SDF, a modular syntax definition formalism that integrates lexical and context-free syntax. These unconventional approaches might in general result in valuable feedback on proposals for new language features.

No comments: