2005-04-06

Java Surprise 2: Motivation

In the previous posts I showed that the priority of a cast to a reference type is different from the cast to a primitive type. Martijn Vermaat asked me why the designers of the Java language made this decision. Of course, they have good reasons for design decision, but still the decision is questionable, especially now we have autoboxing.

Let's take a look at this example from the original post:

$ echo "(Integer) - 2" | parse-java -s Expr | aterm2xml --implicit
<Minus>
  <ExprName><Id>Integer</Id></ExprName>
  <Lit><Deci>2</Deci></Lit>
</Minus>

If no priorities where defined in the Java language, then this expression would be ambiguous. I can illustrate this by parsing the same expression using a Java grammar that does not declare priorities. I'm using the SGLR parser for this, which is capable of producing a parse forest (multiple parse trees) if an input is ambiguous. The alternatives are represented by an amb element with 2 or more children.

$ "(Integer) - 2" | sglri -p JavaAmb.tbl | aterm2xml --implicit
<amb>
  <Minus>
    <ExprName>
      <Id>Integer</Id>
    </ExprName>
    <Lit>
      <Deci>2</Deci>
    </Lit>
  </Minus>
  <CastRef>
    <ClassOrInterfaceType>
      <TypeName>
        <Id>Integer</Id>
      </TypeName>
    </ClassOrInterfaceType>
    <Minus>
      <Lit>
        <Deci>2</Deci>
      </Lit>
    </Minus>
  </CastRef>
</amb>

This clearly shows that the input is ambiguous: the first alternative is the binary operator (which is the alternative chosen by the Java language) and the other alternative is a cast to a reference type.

However, the cast to an int is not ambiguous, since int is a reserved keyword, thus forbidden as an identifier. So, for this input there is only a single parse option, even in the ambiguous version of Java.

$ echo "(int) - 2" | sglri -p JavaAmb.tbl | aterm2xml --implicit
<CastPrim>
  <Int/>
  <Minus>
    <Lit><Deci>2</Deci></Lit>
  </Minus>
</CastPrim>

The ambiguity in the first example has to be resolved. So, what should the language designer do? Prefer the cast, or prefer the binary minus? Well, that decision is not very hard: in the first example, the (Integer) is a parenthesized expression, where the expression is the variable Integer. If we ignore this actual value (since it is quite distracting), then the structure of the expression is ( Expression ) - Expression. You will recognize the need for this pattern, since the expression (a * b) - c has exactly the same structure!

The cast to a primitive type does not have the ambiguity problem, since all primitives types are keywords and all keywords are forbidden as identifiers. So, there is no reason to disallow this a primitive cast at this location and for this reason the language designers changed the priority of the primitive cast.

Are there alternatives? Yes, there are, but they are not very attractive either. First, a parenthesized expression name could be forbidden. Using parentheses for a plain identifier (or a qualified name) does not make a lot of sense. Another option is disallow casts to primitive types at this location. This can be annoying, but it makes things more clear and consistent.

Of course, having two different production rules for casts is not attractive. It's just a single language construct, so it should be defined by a single production as well. I wonder what the language designers would have done if autoboxing was already included in the first version of Java, since autoboxing makes this distinction between a reference cast and a primitive cast visible.

4 comments:

Anonymous said...

Seems fair to me, thanks for the insight!

Intuitively, one would regard (Integer) as the nasty part of this expression.

But of course, also the dual role of the minus sign in the Java grammar is key for this behaviour.

I sometimes wonder why languages use distinct notation for unary and binary (infix) minus (i.e. "hmmm, I guess the designer just choose for easy parsing here"). But once situations like this one come to play, they seem like a pretty good case for such a decision.

Anonymous said...

Hmmm. I'm surprised that you're surprised. This is a well-known ambiguity in C and C++. There are others eg

fred(-george)
(fred)(-george)

could be a function call.

It is resolved pragmatically by determining whether the "simple name" is a "type", not by priority. The production is identical for both primitive types and user-defined types.

Anonymous said...

This is a great Blog! But internet marketing costs money.
If you want to start for the price of a burger to supplement
your income you need a simple method. No PPC cost, no list!
Just some of your time! You can work at home with a system
that is as good as owning your own ATM Cash Machine!
ATM CASH

Anonymous said...

Web Traffic Secret The Most Powerful WebTraffic Secret Software On The Web!