Abstract Syntax Trees

Last revised 14:00 Friday October 26, 2001 (most recent change in blue)

Original work item: "Exposing the AST API."

Related work item: "Improved Java code manipulation support, address JDOM limitations; JDOM doesn't preserve markers and isn't document aware; JDOM finer grained update support (e.g. change method name, type name); buffer contents is duplicated in Java UI document and needs to be manually synchronized."

Background

Dirk's wish list with regard to the AST support:

Summary from Sept. 10-11 meetings

Dirk travelled to SNZ to discuss refactoring requirements and possible solutions with Philippe M. and Jeem.

Some of the forms of solutions discussed, but ultimately abandoned:

In the end, we agreed on AST plus bindings: AST will either extend or replace JDOM. In the latter case, JDOM would be deprecated.

AST will exist along side Java model.

API Design Issue: AST Node Types - Classes, interface, or both

There are on the order of 87 node types for Java ASTs. Bindings will add some more. There are a couple of way this can be mapped to Java types.

(1) Use org.w3c.DOM interfaces as API. Provide a private concrete implementation of the DOM.

Pro: Very small, and standard API for read-write documents.
Con: API is not convenient for clients.
Con: API is not amenable to exposing bindings and other non-structural information.

(2) Concrete API class per AST node type.

Pro: API as small as possible.
Pro: Client can create nodes directly.

Con: Cannot easily hide implementation details; e.g. representation and mechanism for reassembling compilation unit text after editing; lazy binding creation.

Clients who create node from scratch only ever need basic constructors (the nodes they create do not have source positions, bindings, or other decorations). On the other hand, the parser needs to remember more info including fine-grained source positions.

(3) API interface per AST node type, along with node factory methods. Provide a private concrete implementation. Allow clients to reimplement node types (from scratch) and supply a factory.

Like JDOM (except JDOM does not permit client to reimplement) and org.w3c.dom.

Pro: API as small as possible.
Pro: Easy to tailor different kinds of representation: read-write vs. read-only ASTs; raw ASTs vs. AST+bindings.

Con:  Hidden concrete implementation classes takes more space.
Con: Using factory methods is a bit less direct than using constructors.

We will use API interfaces for bindings, and exposed API classes for AST nodes.

API Design Issue: Statement vs. Statement Lists

For structured statements, like while, the child statement is grammatically a single statement. However, since a block is one type of statement, it is possible to have a list of statements underneath. There are options for rendering this:

(1) Child is a single statement.

(Like current compiler's internal ASTs.)

Pro: As per the Java grammar.
Con: A client wanting to insert an additional statement into the child must be prepared to replace by a block if there isn't one.

(2) Child is a list of statements.

(Like the .net codeDOM.)

Pro: More convenient for clients that edit ASTs. Uniform mechanism for inserting and removing statements from child.
Con: Muddies scopes (enclosing braces around statements introduce a scope and make declarations syntactically legal).

We will go with (1) and stick closely to the grammar.

Usage

There are a couple different usage scenarios for ASTs:

Source Construct Normalization

Source Positions

Unicode Escapes

Comments

Whitespace

AST Parent Backpointer

Multiple ASTs

Structural Equality

Syntactic Correctness of Parser-built ASTs

Syntactic Correctness of Non-parser-built ASTs

Deep Constructs

Editing Protocol

User Data Field

Lists of Members

Serialization

Node types

The AST node types are based on the standard grammar for the Java language given in the JLS2.

Every AST node belongs to a single AST instance. (In DOM terminology, the AST is the document and the AST nodes are the elements). The AST instance can serve as a factory for creating new nodes. Nodes point to their owning AST (fixed for the node's lifetime). The AST points directly to the root node (a compilation unit).

The AST node types do not define their own notion of equality; they just inherit the object identity based implementation from Object.

Note: Grammar rules (in comments) are expressed in the Pascal-style extended BNF used in section 18 of JLS2. We use C# style property declarations as a convenient abbreviation for a standard matched pair of get and set methods.

public class AST
    public AST();
    public property CompilationUnit root;
    public void loadFromSource(char[] source);
    public void setOptions(...);
    public char[] serialize();

public abstract class ASTNode
    protected ASTNode(AST ast);
    public AST getOwner();
    public property int[] startPositions;
    public property int[] lengths;
    public property boolean isWholeLine;
    public property Object clientData;
    public ASTNode getParent();
    ... other protocol common to all AST node types

Names

As explained in JLS2 section 6.5, the grammar does not allow names to be resolved more finely than the following 6 categories by syntactic means alone:

PackageName:
            Identifier
            PackageName . Identifier
TypeName:
            Identifier
            PackageOrTypeName . Identifier

ExpressionName:
            Identifier
            AmbiguousName . Identifier

MethodName:
            Identifier
            AmbiguousName . Identifier

PackageOrTypeName:
            Identifier
            PackageOrTypeName . Identifier

AmbiguousName:
            Identifier
            AmbiguousName . Identifier

Given that names cannot be resolved definitively to a package, type, field, or variable at AST node construction time, an open question is how much of the categorization that could be done should be reflected in the AST. More categories means more information flow from the parser to the AST client; on the other hand, a variety of categories is not necessarily convenient for clients. For example, in import a.b.c the name is a TypeName whereas in import a.b.c.* the name a.b.c is a PackageOrTypeName. If the name category was to be reflected in the type of the AST nodes, the client would need to know to create the appropriate type of name nodes when editing the AST.

Proposal: Use two AST node types for names: simple names, and qualified names. Qualified names are expressed recursively, to facilitate clients discovering how the qualifier part of a name resolves. Use these for everything but MethodName; for MethodName, which can appear only in a method invocation expression, separate the selector identifier from any preceding qualifier.

(Note: The current internal AST nodes go beyond making the simple/qualified distinction: they also have simple & qualified type names (classes SimpleTypeReference and QualifiedTypeReference) in additional to simple & qualified named (classes SimpleNameReference and QualifiedNameReference).)

// Name:
//            SimpleName
//            QualifiedName
// SimpleName:
//            Identifier
// QualifiedName:
//            Name . Identifier
public interface IName // "marker" interface
    public IBinding resolvedBinding();  // optional

public class SimpleName extends ASTNode implements IName, IExpression
    public SimpleName(AST ast);
    public property char[] identifier;

public class QualifiedName extends ASTNode implements IName, IExpression
    public QualifiedName(AST ast);
    public property IName qualifier;
    public property char[] identifier;

Compilation Units and Major Declarations

// CompilationUnit:
//            [package Identifier { . Identifier } ;  ]
//                {ImportDeclaration}
//                {TypeDeclaration | ;}
public class CompilationUnit extends ASTNode
    public CompilationUnit(AST ast);
    public property Name packageName; // optional
    public property ImportDeclaration[] imports;
    public property TypeDeclaration[] types;

// ImportDeclaration:
//          import Identifier { . Identifier } [ .  *   ] ;
public class ImportDeclaration extends ASTNode
    public ImportDeclaration(AST ast);
    public property Name importName;
    public property boolean onDemand;
    public IBinding resolveBinding();

// TypeDeclaration:
//           {Modifier} class Identifier
//                [extends Type]
//                [implements Type { , Type}]
//            { {ClassBodyDeclaration | ; } }
//           {Modifier} interface Identifier
//                [extends Type { , Type}]
//            { {InterfaceBodyDeclaration | ; } }
// Modifier:
//          public
//          protected
//          private
//          static
//          abstract
//          final
//          native
//          synchronized
//          transient
//          volatile
//          strictfp
// ClassBodyDeclaration:
//            MethodDeclaration
//            ConstructorDeclaration
//            FieldDeclaration
//            ClassDeclaration
//            TypeDeclaration
//            Initializer
// InterfaceBodyDeclaration:
//            MethodDeclaration
//            FieldDeclaration
//            TypeDeclaration
public class TypeDeclaration extends ASTNode implements IStatement, IMember
    public TypeDeclaration(AST ast);
    public property int modifiers;
    public property char[] name;
    public property Name superclass; // optional
    public property Name[] superInterfaces;
    public property IMember[] members;
    public property char[][] javadocComment; // optional
    // convenience methods
    public FieldDeclaration[] getFields; // includes constants; excludes initializers
    public AbstractMethodDeclaration[] getMethods; // includes constructors
    public TypeDeclaration[] getTypes;
    public ITypeBinding resolveBinding();

// MethodDeclaration:
//            {Modifier} (Type | void) Identifier (
//               [FormalParameter { , FormalParameter}] )  {[ ]}
//               [throws QualifiedIdentifierList] ( MethodBody | ; )
// ConstructorDeclaration:
//            {Modifier} Identifier ( [FormalParameter { , FormalParameter}] )
//               [throws QualifiedIdentifierList] MethodBody
// FormalParameter:
//            [final] Type Identifier {[ ]}
public abstract class AbstractMethodDeclaration extends ASTNode implements IMember
    protected AbstractMethodDeclaration(AST ast);
    public property int modifiers;
    public property char[] selector;
    public property FormalParameter[] parameters;
    public property Name[] thrownExceptions;
    public property char[][] javadocComment; // optional
    public property Block body; // optional
    public IMethodBinding resolveBinding();

public class MethodDeclaration extends AbstractMethodDeclaration
    public MethodDeclaration(AST ast);
    public property Type returnType; // includes void

public class ConstructorDeclaration extends AbstractMethodDeclaration
    public ConstructorDeclaration(AST ast);

// FieldDeclaration:
//            {Modifier} Type Identifier {[ ]} [ = Expression]
//                { , Identifier {[ ]} [ = Expression] }
public class FieldDeclaration extends ASTNode implements IMember
    public AbstractMethodDeclaration(AST ast);
    public property int modifiers;
    public property char[] name;
    public property Type type;
    public property char[][] javadocComment; // optional
    public property IExpression initializer; // optional
    public IFieldBinding resolveBinding();

// Initializer:
//            [static] Block
public final class Initializer extends ASTNode implements IMember
    public Initializer(AST ast);
    public property int modifiers;
    public property Block body;

// LocalVariableDeclaration:
//            [final] Type Identifier {[]} [ = Expression ]
//                { , Identifier {[]} [ = Expression] } ;
public class LocalVariableDeclaration extends ASTNode implements IStatement
    public LocalVariableDeclaration(AST ast);
    public property int modifiers;
    public property char[] name;
    public property Type type;
    public property IExpression initializer; // optional
    public ILocalVariableBinding resolveBinding();

Types

The Type node (= TypeReference) represents a reference to a base type, a named type, or an array thereof.

// Type:
//            (BasicType | TypeName ) {[]}
// BasicType:
//          byte
//          short
//          char
//          int
//          long
//          float
//          double
//          boolean
public class Type extends ASTNode implements IExpression
    public Type (AST ast);
    public property int baseType; // either
    public property Name typeName; // or
    public property int dimensions;
    public IBinding resolvedType();

Statements

There is a different AST node type for each different kind of statement. Use a "marker" interface (IStatement) to bring all constructs that can appear within a block (nonterminal BlockStatement, which includes local variable and type declarations).

// Block:
//          { BlockStatement }
// BlockStatement :
//          LocalVariableDeclaration
//          TypeDeclaration
//          [Identifier : ] Statement
//Statement:
//          Block
//          if (Expression ) Statement [else Statement]
//          for ( ForInitOpt  ;   [Expression] ;   ForUpdateOpt ) Statement
//          while ( Expression ) Statement
//          do Statement while ( Expression );
//          try Block [Catches] [ finally Block ]
//          switch ( Expression ) { SwitchBlockStatementGroups }
//          synchronized ( Expression ) Block
//          return [Expression] ;
//          throw Expression   ;
//          break [Identifier]  ;
//          continue [Identifier]  ;
//          ;
//          ExpressionStatement
//          Identifier : Statement
public interface IStatement  // "marker" interface

public class Block extends ASTNode implements IStatement
    public Block(AST ast);
    public property IStatement[] statements;
public class IfStatement extends ASTNode implements IStatement
    public IfStatement(AST ast);
    public property IExpression test;
    public property IStatement thenPart;
    public property IStatement elsePart;  // optional
public class WhileStatement extends ASTNode implements IStatement
    ...
public class ForStatement extends ASTNode implements IStatement
    ...
public class DoStatement extends ASTNode implements IStatement
    ...
public class TryStatement extends ASTNode implements IStatement
    ...
public class SwitchStatement extends ASTNode implements IStatement
    ...
public class SynchronizedStatement extends ASTNode implements IStatement
    ...
public class ReturnStatement extends ASTNode implements IStatement
    ...
public class ThrowStatement extends ASTNode implements IStatement
    ...
public class BreakStatement extends ASTNode implements IStatement
    ...
public class ContinueStatement extends ASTNode implements IStatement
    ...
public class NullStatement extends ASTNode implements IStatement
    ...
public class LabeledStatement extends ASTNode implements IStatement
    ...
public class AssertStatement extends ASTNode implements IStatement
    ...

Expression Statements

Certain types of expressions can also appear as statements. The ExpressionStatement node wraps an expression up as a statement. The source range for the ExpressionStatement includes the trailing semicolon.

public class ExpressionStatement extends ASTNode implements IStatement
    public ExpressionStatement(AST ast);
    public property IExpression expression;

Expressions

There is a different AST node type for each different kind of expression. Use a "marker" interface (IExpression) to bring all constructs that can appear as expressions.

(Many details TBD).

// Expression:
//           Identifier
//           ArrayAllocationExpression
//           StringLiteral
//           FloatingPointLiteral
//           BooleanLiteral
//           CharacterLiteral
//           StringLiteral
//           NullLiteral
//           ( Type | void ) . class
//           [ ClassName . ] this
//          ( Expression )
//           [ Expression . ] new Type (
//               [ Expression { , Expression } ] ) [ ClassBody ]
//           Expression . Identifier
//           [ ClassName . ] super . Identifier
//           MethodName ( [ Expression { , Expression } ] )
//           Expression . Identifier ( [ Expression { , Expression } ] )
//           [ ClassName . ] super . Identifier (
//               [ Expression { , Expression } ] )
//           Expression [ Expression ]
//           Expression InfixOperator Expression
//           Expression instanceof Type
//           Expression PostfixOperator
//           PrefixOperator Expression
//          ( Type ) Expression
//           Expression ? Expression : Expression
//           Expression AssignmentOperator Expression
//     ArrayInitializer
public interface IExpression  // "marker" interface
    public IBinding resolvedType();  // optional

// ArrayAllocationExpression:
//          new PrimitiveType [ Expression ] { [ Expression ] } { [ ] }
//          new TypeName [ Expression ] { [ Expression ] } { [ ] }
//          new PrimitiveType [ ] { [] } ArrayInitializer
//          new TypeName [ ] { [] } ArrayInitializer
public class ArrayAllocationExpression
    extends ASTNode implements IExpression
    public ArrayAllocationExpression(AST ast);
    public property Type type;
    public property Expression[] dimensions; // optional
    public property Expression arrayInitializer; // optional

public class StringLiteral extends ASTNode implements IExpression
    public StringLiteral(AST ast);
    public property String value;

public class CastExpression extends ASTNode implements IExpression
    public CastExpression(AST ast);
    public property Type type;
    public property IExpression value;

public class InfixExpression extends ASTNode implements IExpression
    public InfixExpression(AST ast);
    public property int infixOperator;
    public property IExpression leftOperand;
    public property IExpression rightOperand;
    public property IExpression[] extendedOperands;  // L op R op R2 op R3...

Bindings

The "world of bindings" is an integrated picture of the structure of the program as seen from the compiler's point of view. The bindings correspond to named entities (packages, types, fields, methods, local variables).

Clients navigate from AST nodes into the world of bindings to discover things like:

Once in the world of bindings, the client can navigate the web of bindings: Some of the navigations that are not supported (quite intentionally): There are no links from the world of bindings back to the world of ASTs.

Other things dealt with in the world of bindings:

Other issues:

Existing Binding classes

To give an idea of the scope of the existing binding infrastructure, below is a dump of the type hierarchy of the compiler's binding classes from package rg.eclipse.jdt.internal.compiler.lookup.

public abstract class Binding
    public abstract class TypeBinding
        public final class ArrayBinding
        public final class BaseTypeBinding
        public abstract class ReferenceBinding
            public class SourceTypeBinding
                public class NestedTypeBinding
                    public final class LocalTypeBinding
                    public final class MemberTypeBinding
            public class ProblemReferenceBinding
            public class UnresolvedReferenceBinding
    public class PackageBinding
        public class ProblemPackageBinding
    public abstract class VariableBinding
        public class LocalVariableBinding
            public class SyntheticArgumentBinding
        public class FieldBinding
            public class SyntheticFieldBinding
            public class ProblemFieldBinding
    public class MethodBinding
        public class ProblemMethodBinding
        public class SyntheticAccessMethodBinding
    public class ImportBinding
    public class ProblemBinding

Binding API

The existing binding classes are not immediately suitable for exposing as a binding API.

However, the Java builder does have an API for the built "image", in package org.eclipse.jdt.internal.core.builder. (This API is a hold-over from Leapfrog era, and is not exposed in the Eclipse code base). This API was designed to expose the same kind of integrated picture of the structure of the program as seen from the compiler's point of view. This API has a detailed specification that does not expose implementation details, so the proposal is to use it as the basis for the new binding API.

Re-purposing this API would entail:

Below is a dump of the relevant interfaces from package org.eclipse.jdt.internal.core.builder. Unnecessary protocol has been omitted. (Note that NotPresentException is an unchecked exception, and would not be required.)

public interface IHandle
    int K_JAVA_IMAGE = 1;
    int K_JAVA_PACKAGE = 2;
    int K_JAVA_TYPE = 3;
    int K_JAVA_FIELD = 4;
    int K_JAVA_METHOD = 5;
    int K_JAVA_CONSTRUCTOR = 6;
    boolean equals(Object obj);
    int hashCode();
    boolean isFictional() throws NotPresentException;
    boolean isPresent();
    int kind();

public interface IMember extends IHandle
    IType getDeclaringClass();
    int getModifiers() throws NotPresentException;
    String getName();
    boolean isBinary() throws NotPresentException;
    boolean isDeprecated() throws NotPresentException;
    boolean isSynthetic() throws NotPresentException;

public interface IPackage extends IHandle
    boolean equals(Object obj);
    IType getClassHandle(String name);
    String getName();
    boolean isUnnamed();

public interface IType extends IMember
    boolean equals(Object obj);
    IType getArrayHandle();
    IType getComponentType();
    IConstructor getConstructorHandle(IType[] parameterTypes);
    IType[] getDeclaredClasses() throws NotPresentException;
    IConstructor[] getDeclaredConstructors() throws NotPresentException;
    IField[] getDeclaredFields() throws NotPresentException;
    IMethod[] getDeclaredMethods() throws NotPresentException;
    int getDeclaredModifiers() throws NotPresentException;
    String getDeclaredName() throws NotPresentException;
    IType getDeclaringClass() throws NotPresentException;
    IField getFieldHandle(String name);
    IType[] getInterfaces() throws NotPresentException;
    IMethod getMethodHandle(String name, IType[] parameterTypes);
    int getModifiers() throws NotPresentException;
    String getName();
    IPackage getPackage();
    String getSimpleName();
    IType getSuperclass() throws NotPresentException;
    boolean isAnonymous() throws NotPresentException;
    boolean isArray();
    boolean isBinary() throws NotPresentException;
    boolean isClass() throws NotPresentException;
    boolean isDeprecated() throws NotPresentException;
    boolean isInnerClass() throws NotPresentException;
    boolean isInterface() throws NotPresentException;
    boolean isLocal() throws NotPresentException;
    boolean isPackageMember() throws NotPresentException;
    boolean isPresent();
    boolean isPrimitive();
    boolean isSynthetic() throws NotPresentException;
    boolean isTopLevel() throws NotPresentException;

public interface IMethod extends IMember
    boolean equals(Object obj);
    IType[] getExceptionTypes() throws NotPresentException;
    IType[] getParameterTypes();
    IType getReturnType() throws NotPresentException;
    boolean isPresent();

public interface IConstructor extends IMember
    boolean equals(Object obj);
    IType[] getExceptionTypes() throws NotPresentException;
    IType[] getParameterTypes();
    boolean isPresent();

public interface IField extends IMember
    boolean equals(Object obj);
    IType getType() throws NotPresentException;
    boolean isPresent();

In this vein, the interface for local variables would look something like:

public interface IVariable extends IHandle
    boolean equals(Object obj);
    IType getDeclaringClass();
    int getModifiers() throws NotPresentException;
    String getName();
    boolean isSynthetic() throws NotPresentException;
    IType getType() throws NotPresentException;
    boolean isPresent();

Also will need to add:

Document History

18:30 Thursday September 27, 2001 - incorporated first round comments from PM and DB.
10:45 Monday October 1, 2001 - incorporated comments from DB.
10:45 Tuesday October 2, 2001 - clarify handing of ExpressionStatement.
14:00 Friday October 26, 2001 - add subtree structural equality.