Abstract Syntax Trees
Last revised 14:00 Friday October 26, 2001 (most
recent change in blue)
Original work item: "Exposing the AST API."
Related work item: "Improved Java code manipulation support, address
JDOM limitations; JDOM doesn't preserve markers and isn't document aware;
JDOM finer grained update support (e.g. change method name, type name);
buffer contents is duplicated in Java UI document and needs to be manually
synchronized."
Background
- refactoring is a key customer (Dirk B.)
- for 2.0: have good support in place so that refactoring can be made an
open API; would need corresponding core APIs for abstract syntax trees
- current refactoring appoach uses AST visitor which runs post-resolve on
designated compilation unit
- visitor collects info (parent stack) including source positions
- uses name environment (bindings, scopes) to validate/locate
- all changes are formulated as reified undoable edits in a source buffer
- a batch of changes are made at a time
- refactoring does not reason hypothetically (changes are undoable)
- JDOM is not used by refactoring, but potentially could be (JDOM is used
inside Java Model)
Dirk's wish list with regard to the AST support:
-
consistent positions in AST (e.g. sourceStart and sourceEnd should always
cover the whole node).
-
comment handling in AST. Some node contain preceding comments other don't
-
Parent pointer in AST node.
-
Data pointer in AST node.
-
Expression should provide getType method (cached value of resolveType(Scope)).
Summary from Sept. 10-11 meetings
Dirk travelled to SNZ to discuss refactoring requirements and possible
solutions with Philippe M. and Jeem.
Some of the forms of solutions discussed, but ultimately abandoned:
-
A vanilla DOM.
-
too limiting: difficult to provide for pre-computing bindings.
-
clumsy for clients to use without AST node types represented by different
Java types
-
AST plus resolves info in form of Java model elements.
-
Java model methods are unsuitable canonical representation for resolved
methods because parameter types are as per source.
-
It is hard to map a type back to a Java element (need to remember package
fragment).
-
Would need additional Java model elements to represent variable declarations.
-
Would need special Java element implementations for local types, methods,
and fields.
In the end, we agreed on AST plus bindings:
-
AST is simple tree of typed nodes as determined by parser.
-
A different Java type for each AST node type.
-
TBD: are node types API classes, API interfaces, or both?
-
Simple case: no bindings.
-
Basic AST Lifecycle: client requests AST; compilation unit is parsed and
nodes are created; root is handed back to client; AST is garbage collected
after client lets go of all AST nodes. AST is not affected if underlying
compilation unit is changed (i.e., eager parse, with no residual dependence
on state of file in workspace).
-
Any syntax errors detected need to be reported to client. Problems are
opaque: source character position plus human-readable message. Clients
needs to determine whether an AST is only partial (e.g., busted flags).
-
Predicatable trees with simple correspondence to source code (parser should
not optimize ASTs).
-
Reliable source position information for all nodes.
-
Scratch "data" field on each AST node for client to record an Object.
-
Navigate AST upwards as well as downwards. Parent link.
-
AST walker for convenient traversal.
-
ASTs can be read-only or read-write.
-
Read-write ASTs can be modified in terms of ASTs. AST node factories (no
Java parser required). Cut/copy/paste.
-
Cut/copy/paste between separate ASTs requires fragment cloning to preserve
independence of storage.
-
Client may be interested in cut/copy/pasting comments too.
-
New nodes would not carry source positions.
-
Read-write AST can be serialized back to compilation unit char[].
-
Preserve existing comments, whitespace, and use of \u.
-
Control whitespace and use of \u for insertions.
-
Provide edit map for updating markers.
-
Resolved ASTs: basic AST annotated with bindings (non-syntactic information).
-
Binding information is derived from AST plus the Java model (relative to
some project).
-
Availability/validity of bindings ceases when Java model changes.
-
Client must request up front that bindings be created.
-
Certain AST nodes get decorated with bindings.
-
Client should have ways to communicate up front what kind of bindings are
required and where.
-
Availability/validity of bindings for a read-write AST ceases when it is
first modified.
-
Bindings (non-syntactic information) includes things such as the following:
-
Resolved names - which type, field, method, or local variable does a AST
name refernce resolve to.
-
Resolved types - what is the resolved type of an AST expression or type
reference.
-
Resolved supertypes - what are the resolved supertypes of a resolved type.
-
Resolved declarations - what resolved type, field, method, or local variable
does an AST declaration map to.
-
Thrown exceptions - what are the resolved types of the exceptions thrown
by a given expression or statement.
-
Resolved members - what are the resolved members of a resolved type.
-
Problems also should be reported with resolving. Problems are opaque: source
character position plus human-readable message.
-
Space for bindings storage is significant; increases monotonically as more
bindings are accessed.
-
Space for bindings is for lifetime of the AST.
-
Advanced AST Lifecycle: client requests AST with bindings; compilation
unit is parsed, nodes created, and perhaps some bindings created and annotated
on nodes; root is handed back to client; AST is garbage collected after
client lets go of all AST nodes and bindings. AST itself is not affected
if underlying compilation unit is changed (i.e., eager parse, with no residual
dependence on state of file in workspace). Bindings may become stale or
invalid if workspace changes (i.e., possibly lazy and incremental construction
of bindings using Java model).
-
Bindings from two ASTs are not generally comparable.
-
For bindings with stable global names, API provides
strings that can be compared between ASTs.
AST will either extend or replace JDOM. In the latter case, JDOM would
be deprecated.
AST will exist along side Java model.
API Design Issue: AST Node Types - Classes, interface, or both
There are on the order of 87 node types for Java ASTs. Bindings will add
some more. There are a couple of way this can be mapped to Java types.
(1) Use org.w3c.DOM interfaces as API. Provide a private concrete implementation
of the DOM.
Pro: Very small, and standard API for read-write documents.
Con: API is not convenient for clients.
Con: API is not amenable to exposing bindings and other non-structural
information.
(2) Concrete API class per AST node type.
Pro: API as small as possible.
Pro: Client can create nodes directly.
Con: Cannot easily hide implementation details; e.g. representation
and mechanism for reassembling compilation unit text after editing; lazy
binding creation.
Clients who create node from scratch only ever need basic constructors
(the nodes they create do not have source positions, bindings, or other
decorations). On the other hand, the parser needs to remember more info
including fine-grained source positions.
(3) API interface per AST node type, along with node factory methods.
Provide a private concrete implementation. Allow clients to reimplement
node types (from scratch) and supply a factory.
Like JDOM (except JDOM does not permit client to reimplement) and org.w3c.dom.
Pro: API as small as possible.
Pro: Easy to tailor different kinds of representation: read-write vs.
read-only ASTs; raw ASTs vs. AST+bindings.
Con: Hidden concrete implementation classes takes more space.
Con: Using factory methods is a bit less direct than using constructors.
We will use API interfaces for bindings, and exposed API classes for
AST nodes.
API Design Issue: Statement vs. Statement Lists
For structured statements, like while, the child statement is grammatically
a single statement. However, since a block is one type of statement, it
is possible to have a list of statements underneath. There are options
for rendering this:
(1) Child is a single statement.
(Like current compiler's internal ASTs.)
Pro: As per the Java grammar.
Con: A client wanting to insert an additional statement into the child
must be prepared to replace by a block if there isn't one.
(2) Child is a list of statements.
(Like the .net codeDOM.)
Pro: More convenient for clients that edit ASTs. Uniform mechanism for
inserting and removing statements from child.
Con: Muddies scopes (enclosing braces around statements introduce a
scope and make declarations syntactically legal).
We will go with (1) and stick closely to the grammar.
Usage
There are a couple different usage scenarios for ASTs:
-
Analyze an existing compilation unit to discover syntactic structure.
-
Discover relationship between syntactic structure and original text.
-
Discover relationship between syntactic structure and resolved world.
-
Create a new compilation unit from scratch.
-
Edit an existing compilation unit.
Source Construct Normalization
-
Most syntactic constructions are rendered in one and only one way.
-
When this is not the case, the AST construction is "lossy".
-
Some forms cannot be distinguised in input (if one cares).
-
Some forms cannot be produced in output.
-
Copying the construct normalizes it.
-
Example: Modifier order
-
final static public int X = 1;
-
public static final int X = 1; // preferred form
-
Example: Compound variable declarations
-
int i = 1, j = 2;
-
int i = 1; int j = 2; // preferred form
-
Example: Array type declarators
-
int[] x[];
-
int[][] x; // preferred form
-
Example: Short ifs
-
if (a) f(); else ;
-
if (a) f(); // preferred form
-
Can only be done for syntactic nuances that are have no semantic import.
-
Normalization is generally acceptable where unimportant syntactic nuances
are involved.
-
Normal form should follow JLS recommendations and Java coding standards.
-
Note that parentheses and blocks are important to user and should not be
normalized.
Source Positions
-
When AST is obtained by parsing a text string, exposing source ranges for
nodes allows clients to navigate back into original string; e.g., for making
text editor selections.
-
AST supports only character-oriented position information; mapping character
positions to lines are handled elsewhere (e.g., text editor).
-
Source ranges are irrelevant for nodes created by other means.
-
Source ranges give original position in original string.
-
Editing the AST does not alter positions or anything clever.
-
Most constructs occupy contiguous character positions, or ranges.
-
Ranges are represented by 0-based start position and length.
-
Start position begins at first significant character of construct corresponding
to AST node.
-
First significant character.
-
Does not include leading whitespace.
-
Does not include preceding comment (except the javadoc comment preceding
a declaration, or the comment preceding a statement - see below).
-
End position includes last significant character of construct corresponding
to AST node.
-
Last significant character.
-
Includes trailing terminators that are part of construct; e.g., include
trailing semicolon at end of local variable declaration.
-
Does not include separators; e.g., exclude trailing comma in parameter
list.
-
Does not include trailing whitespace.
-
Does not include trailing comment.
-
Statement end-of-line comments are not encompassed by statement.
-
System.out.println("hello"); // $non-nls$
-
Embedded comments are encompassed if they occur before
end position.
-
System.out.println("hello") /* comment */;
-
Some node types would have source ranges for significant contiguous subconstructs
not readily gleanable from source ranges of the subnodes.
-
Additional source ranges would be specified for each node type.
-
E.g., method declaration has additional source range for the method name
and for the method declaration excluding its javadoc comment.
-
Use start and length arrays rather than proliferate API methods for additional
source ranges.
Unicode Escapes
-
Original source text might contain Unicode escapes (JLS 3.2, 3.3).
-
E.g., void\u0040\u005a(); declares a method named Z.
-
Scanner removes all Unicode escapes and returns a Unicode token stream.
-
Newly created AST nodes are "post" Unicode escaping.
-
Output options:
-
Preserve existing Unicode escapes (default); remove all existing Unicode
escapes.
-
Do not introduce Unicode escapes (default); introduce Unicode escapes for
characters in a specified set (e.g., all non-ASCII).
-
Initial implementation: support default behavior only.
Comments
-
Comments are problematic for ASTs; these lexical items are normally filtered
out of token stream.
-
Comments are significant to user.
-
Editing an existing compilation unit should generally preserve existing
comments.
-
Should be able to include comments for newly created subtrees.
-
Copying a subtree from one place to another should include relevant comments.
-
Most common forms of comments:
-
Javadoc comments - on one or more lines preceding field, method, and type
declarations.
-
Boilerplace comments (copyright notices) - one or more lines preceding
the package declaration, or between the package declaration and first import
or type declaration.
-
Statement comments - one or more lines between statements in a block.
-
Statement end-of-line comments.
-
VA/ST experience: not worth bending over backwards to accomodate all comments.
-
Determined clients can rescan original string to get at all comments.
-
Expose high value comments:
-
Javadoc comments - treat as attribute of the field, method, and type declarations.
-
Clients can extract Javadoc attributes (including @deprecated).
-
Clients can create declarations with Javadoc.
-
Statement comments within blocks
-
Approach 1: Treat as pseudo-statements with a special AST node type.
-
Pro: Clients can include comments with blocks.
-
Con: Only works for comments within genuine blocks. E.g., can't handle
-
if (test)
-
// should not happen
-
throw new RuntimeException();
-
Would work better if we were using statement lists in more places.
-
Approach 2: Treat as a property of following statment node.
-
Pro: Clients can include comments before any statement.
-
Con: Does not handle trailing comments in blocks. E.g.,
-
{
-
throw new RuntimeException();
-
// can't reach here
-
}
-
Recommend approach 2 since it covers most cases.
-
Boilerplate comments would not be exposed, but would be preserved through
edit and output.
Whitespace
-
Whitespace (JLS 3.6) includes ASCII SP, HT, and FF characters, and line
terminators.
-
Like comments, whitespace is significant to user.
-
Editing an existing compilation unit should generally preserve whitespace.
-
Whitespace for newly created subtrees automatically generated to produce
output that follows common conventions and blends in with surrounding text
(use the same leading whitespace).
-
Copying a subtree from one place to another should should generally preserve
whitespace.
AST Parent Backpointer
-
Each AST node will carry a backpointer to its parent node.
-
ASTNode.getParent() returns ASTNode
-
This permits clients to traverse ASTs in upward as well as downward direction.
-
Bidirectional links must be maintained during editing.
-
Deletion API must unlink child from parent.
-
Insertion API must link child to parent.
-
To preserve treeness, automatically clone child subtree if child already
has parent.
-
Replace API must unlink old child before inserting new child.
-
Parent backlinks means that hanging on to any node in an AST instance
will prevent any part of the AST instance from being garbage collected.
Multiple ASTs
-
Muliple ASTs can exist side by side (and ASTs are potentially for same
compilation unit).
-
Allow insertion of nodes from one AST into another AST.
-
Automatically clones child subtree (forgetting source positions and binding
decorations).
-
Ensure memory representation of ASTs remain completely independent.
Structural Equality
-
Structural equality predicate on AST nodes.
-
Isomorphic subtrees.
-
Belonging to same or different AST.
-
Considers structural info only; ignores source positions,
bindings, etc.
-
Named something other than "equals" to avoid having
to reimplement hashCode too.
Syntactic Correctness of Parser-built ASTs
-
For ASTs built by a Java parser, there are issues of syntactic correctness.
-
Syntactic correctness is judged by the Syntactic Grammar (as defined in
JLS2 section 2.3).
-
Java parser must guarantee to produce a faithful AST for any syntactically
correct compilation unit.
-
Java parser may also build ASTs for syntactically incorrect compilation
units.
-
Complicant Java compilers must reject syntactically incorrect compilation
units.
-
What principle do we apply to Java parsers and the ASTs they return?
-
Real Java parsers are invariably more liberal than the Syntactic Grammar,
and rely on post-parse checks to report errors for any syntactically incorrect
constructs that makes it past the parser.
-
E.g., conflicting modifiers: public private
-
E.g., field declared with no initializer occurs in an interface
-
E.g., void foo() [];
-
In the current Eclipse compiler, many of these checks are done in the course
of type and name resolution. If client only wants AST, we want to avoid
doing expensive name and type analysis.
-
Approach 1: Guarantee that no ASTs are built for syntactically incorrect
compilation units.
-
You do not get an AST at all for compilation units with syntax errors.
-
Pro: Client can trust parser to distinguish syntactically correct from
incorrect.
-
Con: Client cannot manipulate syntactically incorrect compilation units
at all.
-
Con: Requires post-parse check to detect residual syntax errors.
-
Approach 2: Provide no guarantee about the ASTs for syntactically incorrect
compilation units.
-
You might not get a useful AST at all.
-
You might get an AST that had pieces missing; e.g., a malformed method
was excised
-
You might get an AST that is incoherent or self-contradictory; e.g., a
transient class!?
-
Pro: Maximum flexibility for implementation.
-
Pro: Client can get useful ASTs for some syntactically incorrect programs.
-
Con: Client cannot trust parser to distinguish syntactically correct from
incorrect.
-
Approach 3: Guarantee that the client examining the resulting AST has some
way to determine whether the compilation units is incorrect.
-
Priniciple: Syntactic errors must not be suppressed.
-
AST nodes could carry flags indicating certain syntax problem; e.g., duplicate
modifiers public public
-
A bit on root node could say "unspecified syntax errors".
-
Could be special AST nodes types indicating major problems; e.g., bogus
method body
-
Could be representable configurations of AST node types that are recognizable
as syntactially incorrect; e.g., conflicting modifiers public private;
missing field initializer in interface
-
Pro: Client can trust parser to not hide any syntax errors that are in
the source.
-
Pro: Client can get useful ASTs for syntactically incorrect programs.
-
Con: Client must do extra work to determine whether there are syntax errors.
-
Con: Extra work to include this information if no client really cares about
the difference between syntactically correct and incorrect.
-
The first approach is too harsh. It is much more reasonable, and interesting,
to be able to work with some syntactically incorrect compilation units.
-
The second approach feels reasonable if clients never care whether the
source is syntactically correct or not.
-
The third approach feels reasonable if some clients would care whether
the source is syntactically correct or not.
-
The principle difference between the second and third appoaches is that
the former sanctions quietly suppressing syntax errors whereas the latter
precludes it.
-
The nature of the AST nodes inherently makes room to express a wide range
of syntactically malformed programs.
-
An extra flag per node for "unspecified syntax errors" should cover the
bases.
-
The existing compiler's ASTs already carry enough information to enable
the compiler to do thorough post-parse detecting of residual syntax errors.
-
Therefore the third approach is within easy reach.
-
The third approach gives clients more than the second approach.
-
Recommendation: we adopt the third approach.
Syntactic Correctness of Non-parser-built ASTs
-
ASTs do not just come from a parser.
-
They can be created from scratch.
-
A parser-build AST can be edited.
-
These ASTs will need to be serialized to a source compilation unit (why
else would they exist?).
-
What kinds of support and guarantees are in place to ensure that such a
suitable source compilation unit can be generated?
-
Basic guarantee: any AST that could have come from parsing a syntactically
correct compilation unit will serialize to a compilation unit that is
-
(a) syntactically correct
-
(b) strongly semantically equivalent to the original compilation unit.
-
and possibly (c) normalized; that is, parse(serialize(x)) is isomorphic
to x
-
There are likely many ways to get ASTs that do not correspond to any syntactically
correct compilation unit.
-
E.g., use illegal identifiers ("1abc" or "try" or "//").
-
E.g., use illegal modifier combinations with modifier bit masks.
-
Post-screening the AST for syntactic correctness would be misguided.
-
Should just go ahead and generate the obvious, syntactically incorrect,
compilation unit.
-
More importantly: ensure semantic equivalence.
-
Operator precedence creates issues:
-
E.g., given AST for expression v1*v2, replace v1 node
by expression v1+v3.
-
Naive serialization yields v1+v3*v2 which is not semantically
equivalent to the AST.
-
Result should be (v1+v3)*v2.
-
Parentheses may need to be introduced during serialization.
-
Nested if statement creates issues:
-
E.g., given AST for statement if (a) f(); else g();, replace f();
by if (b) h();
-
Naive serialization yields if (a) if (b) h(); else g();
-
Result should be if (a) if (b) h(); else ; else g();
-
Extra verbiage may need to be introduced during serialization.
Deep Constructs
-
Some programs involve impossibly deep constructs.
-
Multi-line string concatenation expressions are the main offender.
-
For example, "Line 1\\n"+"Line 2\\n"+...+"Line 5000"
-
Runtime stacks blow when recursing over deep ASTs.
-
AST node types should be designed to keep trees reasonably shallow for
reasonably typical programs.
-
Introduce N-ary operator expression node type to deal with multi-line string
concatenation expressions.
-
N.B. Current compiler performs compile-time concatenations during parse
phase to deal with this problem.
Editing Protocol
-
What general form should the editing API take?
-
Setters on receiver to manipulate its children (parent never affected)
-
E.g., whileStatement.setCondition(newExpression)
-
Use null for optional children
-
Treat lists as an array-valued property.
-
E.g., block.getStatements() returns Statement[]
-
E.g., block.setStatements(Statement[] statements)
-
Use empty list for no children (rather than null)
-
Alternative approach for lists: use Collection-like protocol
-
E.g., block.addStatement(pos, newChildStatement)
-
E.g., block.removeStatement(oldChildStatement)
-
Con: Increased number of methods on API; bad when a node type has several
list properties.
-
Alternative approach for delete/replace: use parent backpointers to implement
generic delete and replace operation which affect the receiver's relationship
to its parent
-
E.g., oldChildStatement.delete()
-
Con: semantics of deletion ugly when node occurs outside of any list
User Data Field
-
Each AST node has a user data slot reserved for client use.
-
ASTNode.getClientData() returns Object
-
ASTNode.setClientData(Object data)
-
The initial value is null.
-
Client may use for decorations, or whatever.
-
AST nodes created by parser carry no data initially.
-
AST nodes created explicitly carry no data initially.
-
Even read-only ASTs have read-write data slots.
-
Cloning an AST node creates a new node (does not copy or clone data).
Lists of Members
-
List of field, method, and type members of a type declaration.
-
This list is syntactically and semantically heterogenous.
-
No syntactic constraints on number and order.
-
Order is significant to user.
-
Within field declarations, relative order is semantically significant.
-
Standard practices:
-
Place field declarations before member methods and types.
-
Place types before methods.
-
Option (1): expose separate lists for field, methods, and types.
-
Pro: This is way internal AST works.
-
Pro: Convenient for clients to locate member fields, methods, and types.
-
Con: Not flexible for editing; editing will mangle member order.
-
Option (2): expose a single list of members
-
Pro: parser does not normalize; client controls order of members.
-
Con: More work for clients to locate member fields, methods, and types.
-
Option (3): expose a single list of members, with extra getters for locating
member fields, methods, and types.
-
Pro: Combines advantage of (2) with convenience of (1).
-
Recommended approach: (3).
-
For class declarations, treat initializers and constructors as members.
-
Lump instance and static initializers in with field declarations.
-
Lump constructor declarations in with method declarations.
Serialization
-
Clients of read-write ASTs will generally want to serialize to a Java compilation
unit.
-
Serialization via simple AST tree walk.
-
Straightforward.
-
Introduce line breaks and whitespace to make it look pretty.
-
Or post-process it with the Java formatter.
-
If AST originated by parsing, the result is likely unacceptable to user:
-
Completely reformatted.
-
Constructs are normalized.
-
Some comments may have be lost.
-
Could be provided by API that makes use of regular AST API only.
-
Could be written by clients.
-
Serialization via source reconstruction.
-
Only applicable to ASTs initially constructed by parser.
-
Use source position information in modified AST to reconstruct compilation
unit.
-
Retain passages of original text corresponding to unchanged AST trees.
-
Generates new text only where required.
-
Produce a result that a user will recognize and accept.
-
Preserve formatting wherever possible.
-
Preserve source construct normalization wherever possible.
-
Preserve arbitrarily-placed comments wherever possible.
-
Requires retaining the original compilation unit, and likely recording
additional information in nodes to allow reconstruction.
-
This is the way the current JDOM implementation works.
-
Could be provided by API that has privileged access to AST nodes and parser-recorded
information.
-
Should also return a list of edit instructions so that markers can be adjusted,
etc.
-
Clients would have a hard time doing this themselves.
-
Recommend deferring implementation of serializer
that does source reconstruction.
-
In interim, refactoring can apply edits to original
compilation unit text directly.
Node types
The AST node types are based on the standard grammar for the Java language
given in the JLS2.
Every AST node belongs to a single AST instance. (In DOM terminology,
the AST is the document and the AST nodes are the elements). The AST instance
can serve as a factory for creating new nodes. Nodes point to their owning
AST (fixed for the node's lifetime). The AST points directly to the root
node (a compilation unit).
The AST node types do not define their own notion of equality; they
just inherit the object identity based implementation from Object.
Note: Grammar rules (in comments) are expressed in the Pascal-style
extended BNF used in section 18 of JLS2. We use C# style property
declarations as a convenient abbreviation for a standard matched pair of
get and set methods.
public class AST
public AST();
public property CompilationUnit root;
public void loadFromSource(char[] source);
public void setOptions(...);
public char[] serialize();
public abstract class ASTNode
protected ASTNode(AST ast);
public AST getOwner();
public property int[] startPositions;
public property int[] lengths;
public property boolean isWholeLine;
public property Object clientData;
public ASTNode getParent();
... other protocol common to all AST node types
Names
As explained in JLS2 section 6.5, the grammar does not allow names to be
resolved more finely than the following 6 categories by syntactic means
alone:
PackageName:
Identifier
PackageName . Identifier
TypeName:
Identifier
PackageOrTypeName . Identifier
ExpressionName:
Identifier
AmbiguousName . Identifier
MethodName:
Identifier
AmbiguousName . Identifier
PackageOrTypeName:
Identifier
PackageOrTypeName . Identifier
AmbiguousName:
Identifier
AmbiguousName . Identifier
Given that names cannot be resolved definitively to a package, type,
field, or variable at AST node construction time, an open question is how
much of the categorization that could be done should be reflected in the
AST. More categories means more information flow from the parser to the
AST client; on the other hand, a variety of categories is not necessarily
convenient for clients. For example, in import a.b.c the name
is a TypeName whereas in import a.b.c.* the name a.b.c
is a PackageOrTypeName. If the name category was to be reflected
in the type of the AST nodes, the client would need to know to create the
appropriate type of name nodes when editing the AST.
Proposal: Use two AST node types for names: simple names, and qualified
names. Qualified names are expressed recursively, to facilitate clients
discovering how the qualifier part of a name resolves. Use these for everything
but MethodName; for MethodName, which can appear only
in a method invocation expression, separate the selector identifier from
any preceding qualifier.
(Note: The current internal AST nodes go beyond making the simple/qualified
distinction: they also have simple & qualified type names (classes
SimpleTypeReference
and QualifiedTypeReference) in additional to simple & qualified
named (classes SimpleNameReference and
QualifiedNameReference).)
// Name:
//
SimpleName
//
QualifiedName
// SimpleName:
//
Identifier
// QualifiedName:
//
Name . Identifier
public interface IName // "marker" interface
public IBinding resolvedBinding(); //
optional
public class SimpleName extends ASTNode implements IName, IExpression
public SimpleName(AST ast);
public property char[] identifier;
public class QualifiedName extends ASTNode implements IName, IExpression
public QualifiedName(AST ast);
public property IName qualifier;
public property char[] identifier;
Compilation Units and Major Declarations
// CompilationUnit:
//
[package Identifier { . Identifier } ;
]
//
{ImportDeclaration}
//
{TypeDeclaration | ;}
public class CompilationUnit extends ASTNode
public CompilationUnit(AST ast);
public property Name packageName; // optional
public property ImportDeclaration[] imports;
public property TypeDeclaration[] types;
// ImportDeclaration:
// import
Identifier { . Identifier } [ . *
]
;
public class ImportDeclaration extends ASTNode
public ImportDeclaration(AST ast);
public property Name importName;
public property boolean onDemand;
public IBinding resolveBinding();
// TypeDeclaration:
//
{Modifier} class Identifier
//
[extends Type]
//
[implements Type { , Type}]
//
{ {ClassBodyDeclaration | ; } }
//
{Modifier} interface Identifier
//
[extends Type { , Type}]
//
{ {InterfaceBodyDeclaration | ; } }
// Modifier:
// public
// protected
// private
// static
// abstract
// final
// native
// synchronized
// transient
// volatile
// strictfp
// ClassBodyDeclaration:
//
MethodDeclaration
//
ConstructorDeclaration
//
FieldDeclaration
//
ClassDeclaration
//
TypeDeclaration
//
Initializer
// InterfaceBodyDeclaration:
//
MethodDeclaration
//
FieldDeclaration
//
TypeDeclaration
public class TypeDeclaration extends ASTNode implements IStatement,
IMember
public TypeDeclaration(AST ast);
public property int modifiers;
public property char[] name;
public property Name superclass; // optional
public property Name[] superInterfaces;
public property IMember[] members;
public property char[][] javadocComment; //
optional
// convenience methods
public FieldDeclaration[] getFields; // includes
constants; excludes initializers
public AbstractMethodDeclaration[] getMethods;
// includes constructors
public TypeDeclaration[] getTypes;
public ITypeBinding resolveBinding();
// MethodDeclaration:
//
{Modifier} (Type | void) Identifier (
//
[FormalParameter { , FormalParameter}] )
{[ ]}
//
[throws QualifiedIdentifierList] ( MethodBody | ;
)
// ConstructorDeclaration:
//
{Modifier} Identifier ( [FormalParameter { ,
FormalParameter}] )
//
[throws QualifiedIdentifierList] MethodBody
// FormalParameter:
//
[final] Type Identifier {[ ]}
public abstract class AbstractMethodDeclaration extends ASTNode
implements IMember
protected AbstractMethodDeclaration(AST ast);
public property int modifiers;
public property char[] selector;
public property FormalParameter[] parameters;
public property Name[] thrownExceptions;
public property char[][] javadocComment; //
optional
public property Block body; // optional
public IMethodBinding resolveBinding();
public class MethodDeclaration extends AbstractMethodDeclaration
public MethodDeclaration(AST ast);
public property Type returnType; // includes
void
public class ConstructorDeclaration extends AbstractMethodDeclaration
public ConstructorDeclaration(AST ast);
// FieldDeclaration:
//
{Modifier} Type Identifier {[ ]} [ =
Expression]
//
{ , Identifier {[ ]} [ =
Expression] }
public class FieldDeclaration extends ASTNode implements IMember
public AbstractMethodDeclaration(AST ast);
public property int modifiers;
public property char[] name;
public property Type type;
public property char[][] javadocComment; //
optional
public property IExpression initializer; //
optional
public IFieldBinding resolveBinding();
// Initializer:
//
[static] Block
public final class Initializer extends ASTNode implements IMember
public Initializer(AST ast);
public property int modifiers;
public property Block body;
// LocalVariableDeclaration:
//
[final] Type Identifier {[]} [ =
Expression ]
//
{ , Identifier {[]} [ = Expression]
} ;
public class LocalVariableDeclaration extends ASTNode implements
IStatement
public LocalVariableDeclaration(AST ast);
public property int modifiers;
public property char[] name;
public property Type type;
public property IExpression initializer; //
optional
public ILocalVariableBinding resolveBinding();
Types
The Type node (= TypeReference) represents a reference to a base type,
a named type, or an array thereof.
// Type:
//
(BasicType | TypeName ) {[]}
// BasicType:
// byte
// short
// char
// int
// long
// float
// double
// boolean
public class Type extends ASTNode implements IExpression
public Type (AST ast);
public property int baseType; // either
public property Name typeName; // or
public property int dimensions;
public IBinding resolvedType();
Statements
There is a different AST node type for each different kind of statement.
Use a "marker" interface (IStatement) to bring all constructs
that can appear within a block (nonterminal BlockStatement, which
includes local variable and type declarations).
// Block:
// {
BlockStatement }
// BlockStatement :
// LocalVariableDeclaration
// TypeDeclaration
// [Identifier
:
] Statement
//Statement:
// Block
// if
(Expression ) Statement [else Statement]
// for
( ForInitOpt ; [Expression]
;
ForUpdateOpt ) Statement
// while
( Expression ) Statement
// do
Statement while ( Expression
);
// try
Block [Catches] [ finally Block ]
// switch
( Expression ) { SwitchBlockStatementGroups
}
// synchronized
( Expression ) Block
// return
[Expression] ;
// throw
Expression ;
// break
[Identifier] ;
// continue
[Identifier] ;
// ;
// ExpressionStatement
// Identifier
:
Statement
public interface IStatement // "marker" interface
public class Block extends ASTNode implements IStatement
public Block(AST ast);
public property IStatement[] statements;
public class IfStatement extends ASTNode implements IStatement
public IfStatement(AST ast);
public property IExpression test;
public property IStatement thenPart;
public property IStatement elsePart; //
optional
public class WhileStatement extends ASTNode implements IStatement
...
public class ForStatement extends ASTNode implements IStatement
...
public class DoStatement extends ASTNode implements IStatement
...
public class TryStatement extends ASTNode implements IStatement
...
public class SwitchStatement extends ASTNode implements IStatement
...
public class SynchronizedStatement extends ASTNode implements IStatement
...
public class ReturnStatement extends ASTNode implements IStatement
...
public class ThrowStatement extends ASTNode implements IStatement
...
public class BreakStatement extends ASTNode implements IStatement
...
public class ContinueStatement extends ASTNode implements IStatement
...
public class NullStatement extends ASTNode implements IStatement
...
public class LabeledStatement extends ASTNode implements IStatement
...
public class AssertStatement extends ASTNode implements IStatement
...
Expression Statements
Certain types of expressions can also appear as statements.
The ExpressionStatement node wraps an expression up as a statement. The
source range for the ExpressionStatement includes the trailing semicolon.
public class ExpressionStatement extends ASTNode
implements IStatement
public ExpressionStatement(AST
ast);
public property IExpression
expression;
Expressions
There is a different AST node type for each different kind of expression.
Use a "marker" interface (IExpression) to bring all constructs
that can appear as expressions.
(Many details TBD).
// Expression:
//
Identifier
//
ArrayAllocationExpression
//
StringLiteral
//
FloatingPointLiteral
//
BooleanLiteral
//
CharacterLiteral
//
StringLiteral
//
NullLiteral
//
( Type | void ) . class
//
[ ClassName . ] this
// (
Expression )
//
[ Expression . ] new Type (
//
[ Expression { , Expression } ] ) [ ClassBody
]
//
Expression . Identifier
//
[ ClassName . ] super . Identifier
//
MethodName ( [ Expression { , Expression } ] )
//
Expression . Identifier ( [ Expression { ,
Expression } ] )
//
[ ClassName . ] super . Identifier (
//
[ Expression { , Expression } ] )
//
Expression [ Expression ]
//
Expression InfixOperator Expression
//
Expression instanceof Type
//
Expression PostfixOperator
//
PrefixOperator Expression
// (
Type ) Expression
//
Expression ? Expression : Expression
//
Expression AssignmentOperator Expression
// ArrayInitializer
public interface IExpression // "marker" interface
public IBinding resolvedType(); // optional
// ArrayAllocationExpression:
// new
PrimitiveType [ Expression ] { [
Expression ] } { [ ] }
// new
TypeName [ Expression ] {
[ Expression
]
} { [ ] }
// new
PrimitiveType [ ] { [] } ArrayInitializer
// new
TypeName [ ] { [] } ArrayInitializer
public class ArrayAllocationExpression
extends ASTNode implements IExpression
public ArrayAllocationExpression(AST ast);
public property Type type;
public property Expression[] dimensions; //
optional
public property Expression arrayInitializer;
// optional
public class StringLiteral extends ASTNode implements IExpression
public StringLiteral(AST ast);
public property String value;
public class CastExpression extends ASTNode implements IExpression
public CastExpression(AST ast);
public property Type type;
public property IExpression value;
public class InfixExpression extends ASTNode implements IExpression
public InfixExpression(AST ast);
public property int infixOperator;
public property IExpression leftOperand;
public property IExpression rightOperand;
public property IExpression[] extendedOperands;
// L op R op R2 op R3...
Bindings
The "world of bindings" is an integrated picture of the structure of the
program as seen from the compiler's point of view. The bindings correspond
to named entities (packages, types, fields, methods, local variables).
Clients navigate from AST nodes into the world of bindings to discover
things like:
-
the entity an identifier resolves to
-
the resolved type of an expression node
-
the resolved binding of a declaration node
-
others?
Once in the world of bindings, the client can navigate the web of bindings:
-
from array type to its component type, and vice versa
-
from field or variable to its declared type
-
from method to its parameter and return types
-
from type to its constructors and its declared method, field, and type
members
-
from constructor, method, or field to its declaring type
-
from nested type to its enclosing type
-
from type to declaring package
-
from type to its supertypes (but, significantly, not to its subtypes)
-
directly to the binding for any base type (int, float, char, etc.)
-
directly to the binding for a handful of well-known types (java.lang.Object,
etc.)
Some of the navigations that are not supported (quite intentionally):
-
from package to its (known) types - very expensive
-
from package to one of its types by name - very expensive
-
from type to its (known) subtypes - very expensive
-
from type or method to the local types it encloses - binding for local
types are only of interest to those with the enclosing type's AST in their
hand
-
from method to the variables declared within it - binding for variables
are only of interest to those with the method's AST in their hand
There are no links from the world of bindings back to the world of ASTs.
Other things dealt with in the world of bindings:
-
synthetic entities stemming from default constructors, abstract method
copy-down from interfaces, and inner class emulation
-
missing bindings for entities that are required (mentioned by name) but
were not found
-
type hierachy circularities
-
internal inconsistencies
Other issues:
-
Compile-time-computed values for constants (public static final fields
with compile-time computable values)
Existing Binding classes
To give an idea of the scope of the existing binding infrastructure, below
is a dump of the type hierarchy of the compiler's binding classes from
package rg.eclipse.jdt.internal.compiler.lookup.
public abstract class Binding
public abstract class TypeBinding
public final class ArrayBinding
public final class BaseTypeBinding
public abstract class
ReferenceBinding
public class SourceTypeBinding
public class NestedTypeBinding
public final class LocalTypeBinding
public final class MemberTypeBinding
public class ProblemReferenceBinding
public class UnresolvedReferenceBinding
public class PackageBinding
public class ProblemPackageBinding
public abstract class VariableBinding
public class LocalVariableBinding
public class SyntheticArgumentBinding
public class FieldBinding
public class SyntheticFieldBinding
public class ProblemFieldBinding
public class MethodBinding
public class ProblemMethodBinding
public class SyntheticAccessMethodBinding
public class ImportBinding
public class ProblemBinding
Binding API
The existing binding classes are not immediately suitable for exposing
as a binding API.
However, the Java builder does have an API for the built "image", in
package org.eclipse.jdt.internal.core.builder. (This API is a
hold-over from Leapfrog era, and is not exposed in the Eclipse code base).
This API was designed to expose the same kind of integrated picture of
the structure of the program as seen from the compiler's point of view.
This API has a detailed specification that does not expose implementation
details, so the proposal is to use it as the basis for the new binding
API.
Re-purposing this API would entail:
-
introducing entities for local variables
-
removing protocol for navigations that are not supported (e.g., from package
to its known types)
-
removing unneeded protocol; including states, non-state-specific handles,
deltas, report cards, dependency graph, package references
Below is a dump of the relevant interfaces from package org.eclipse.jdt.internal.core.builder.
Unnecessary protocol has been omitted. (Note that NotPresentException is
an unchecked exception, and would not be required.)
public interface IHandle
int K_JAVA_IMAGE = 1;
int K_JAVA_PACKAGE = 2;
int K_JAVA_TYPE = 3;
int K_JAVA_FIELD = 4;
int K_JAVA_METHOD = 5;
int K_JAVA_CONSTRUCTOR = 6;
boolean equals(Object obj);
int hashCode();
boolean isFictional() throws NotPresentException;
boolean isPresent();
int kind();
public interface IMember extends IHandle
IType getDeclaringClass();
int getModifiers() throws NotPresentException;
String getName();
boolean isBinary() throws NotPresentException;
boolean isDeprecated() throws NotPresentException;
boolean isSynthetic() throws NotPresentException;
public interface IPackage extends IHandle
boolean equals(Object obj);
IType getClassHandle(String name);
String getName();
boolean isUnnamed();
public interface IType extends IMember
boolean equals(Object obj);
IType getArrayHandle();
IType getComponentType();
IConstructor getConstructorHandle(IType[] parameterTypes);
IType[] getDeclaredClasses() throws NotPresentException;
IConstructor[] getDeclaredConstructors() throws
NotPresentException;
IField[] getDeclaredFields() throws NotPresentException;
IMethod[] getDeclaredMethods() throws NotPresentException;
int getDeclaredModifiers() throws NotPresentException;
String getDeclaredName() throws NotPresentException;
IType getDeclaringClass() throws NotPresentException;
IField getFieldHandle(String name);
IType[] getInterfaces() throws NotPresentException;
IMethod getMethodHandle(String name, IType[]
parameterTypes);
int getModifiers() throws NotPresentException;
String getName();
IPackage getPackage();
String getSimpleName();
IType getSuperclass() throws NotPresentException;
boolean isAnonymous() throws NotPresentException;
boolean isArray();
boolean isBinary() throws NotPresentException;
boolean isClass() throws NotPresentException;
boolean isDeprecated() throws NotPresentException;
boolean isInnerClass() throws NotPresentException;
boolean isInterface() throws NotPresentException;
boolean isLocal() throws NotPresentException;
boolean isPackageMember() throws NotPresentException;
boolean isPresent();
boolean isPrimitive();
boolean isSynthetic() throws NotPresentException;
boolean isTopLevel() throws NotPresentException;
public interface IMethod extends IMember
boolean equals(Object obj);
IType[] getExceptionTypes() throws NotPresentException;
IType[] getParameterTypes();
IType getReturnType() throws NotPresentException;
boolean isPresent();
public interface IConstructor extends IMember
boolean equals(Object obj);
IType[] getExceptionTypes() throws NotPresentException;
IType[] getParameterTypes();
boolean isPresent();
public interface IField extends IMember
boolean equals(Object obj);
IType getType() throws NotPresentException;
boolean isPresent();
In this vein, the interface for local variables would look something
like:
public interface IVariable extends IHandle
boolean equals(Object obj);
IType getDeclaringClass();
int getModifiers() throws NotPresentException;
String getName();
boolean isSynthetic() throws NotPresentException;
IType getType() throws NotPresentException;
boolean isPresent();
Also will need to add:
-
Pseudo-bindings for base types: boolean, int, float, etc.
-
Access to well-known java.lang bindings: Object, String, Throwable, Exception,
RuntimeException, Error, Class.
Document History
18:30 Thursday September 27, 2001 - incorporated first round comments from
PM and DB.
10:45 Monday October 1, 2001 - incorporated comments
from DB.
10:45 Tuesday October 2, 2001 - clarify handing
of ExpressionStatement.
14:00 Friday October 26, 2001 - add subtree structural
equality.