Skip to end of metadata
Go to start of metadata

Bytecode Names for Extended Groovy Names

(This is a proposal for discussion, not yet a decision. – JRose)

The names of Groovy classes, methods, fields, and properties can be arbitrary 16-bit unicode strings. This permission allows Groovy objects to directly model structures, such as XML trees or file systems, that contain names which are not valid Java identifiers. Such a permission for "foreign" or "escaped" identifiers has often proven useful in scripting languages which allow them, such as Lisp or the Unix shells.

The syntax for defining and using extended identifiers in Groovy is TBD. The syntax may simply consist of a permission to use a single-quoted string wherever a defined or qualified class or member name is allowed, or wherever a builder element name is allowed. We do not expect to allow local variables, or simple variable references, to have spellings which extend beyond the set of Java identifiers.

This raises the question of how such names are represented in places they might be visible to Java programs. Chiefly, this is in the classfile format itself, colloquially called the "bytecodes". Extended Groovy names are represented in the bytecodes by means of a name mangling scheme described here.

A valid Java identifier that contains no dollar signs '$' is represented directly by itself in the bytecodes, and there is no translation required at any point in the system.

The Groovy runtime itself performs no translation of identifiers, but (like the Java Reflection API) uses 16-bit unicode strings uniformly to represent named objects. However, when resolving a unicode string to a Java class or one of its members, a translation takes place, just before querying the Java Reflection API, from the Groovy name to the bytecode name sought reflectively.

We will first describe the mapping from bytecode name to unicode string, as a simple substring replacement. This defines the translation from bytecode names to extended Groovy names.

All identifier substrings matching the following grammar for 'unicode_escape' (and only they) are taken to represent, and are replaced by, single unicode characters:

unicode escape grammar

Within the context of a bytecode name, matching of this grammar is greedy, so that a longer matching substring is preferred over a shorter match.

Note that all substrings begin with dollar '$' followed by a decimal digit. They are also formatted as a dollar '$' followed by a hexadecimal numeral, followed possibly by a a letter 'X'.

Note also that every 16-bit number has exactly one hexadecimal representation in this grammar. The stop 'X' will be used only in the presence of a following digit or 'X' character, to prevent mis-parsing. The average length of these substrings will be slightly over 3 characters for ASCII punctuation, and 4-5 characters for extended punctuation.

These strings are called "identifier unicode escapes".

Substrings which contain dollar "$", even when it is followed by a digit, but which do not match the above grammar for identifier unicode escapes, are legal in bytecode names, and are not translated.

An identifier unicode escape directly names a single 16-bit unicode character (or perhaps a 16-bit surrogate code for part of a 32-bit code point). It is an error if the character is in fact a valid Java identifier character (letter or digit or '_' or '$'). The bytecode names for Groovy identifiers never use such codes, even though the identifier unicode escape grammar allows them to be expressed.

A so-called "identifier null escape" in a bytecode name represents a null string in the corresponding Groovy name. It is the substring "$0X". (This will be used at the beginning of a bytecode name, to prevent the name from beginning with a digit, or to interrupt a part of a Groovy name that looks like an escape substring.)

The combined grammar for "identifier escapes" is therefore:

'identifier escape grammar'

A bytecode name is unambiguously mapped to a Groovy name by removing each identifier null escape, and replacing each identifier unicode escape by the corresponding 16-bit unicode character (or surrogate code).

A Groovy name string is mapped back to a bytecode name which represents it by the following steps:

  1. The name string is scanned for any identifier escapes. Each one that is found is interrupted by following its leading dollar '$' by the null escape, as in '$$0X'.
  2. If the name string begins with a Java digit, the null escape '$0X' is prepended.
  3. Every character in the name string which is not a Java identifier character (and every surrogate) is replaced by an identifier unicode escape which contains the hexadecimal numeral of that character (or surrogate).
  4. If there is a choice of two such escapes, the one ending in a stop 'X' is chosen if and only if the following character in the name string is an uppercase hexadecimal digit, or the uppercase letter 'X'.

The resulting string is the bytecode name for the original Groovy name string. It is uniquely determined. (The original mapping from bytecode names is slightly ambiguous, because of optional delimiters. Therefore, this mapping requires extra care to ensure uniqueness. Failure to attend to this can create bugs, in which Groovy methods do not link properly.)

Bytecode names which are not the unique mapping of their corresponding unicode strings are in error. The Groovy system will never ask a classfile for them, and may silently drop them from a reflective query. This applies to bytecode names which accidentally contain null escapes, or unicode escapes which encode valid Java identifier characters.

Note that a Groovy name can be a Groovy or a Java keyword, such as 'int'. Such names are not changed, but rather used directly in the bytecodes.

Note: Java identifier characters are precisely defined as any instance of the production 'JavaLetterOrDigit' in the JLS2 (Java Language Specification, second edition).

It is rare (though not impossible) for a bytecode name from a Java program to contain identifier unicode escapes. The design of these escapes is intended to make rare the need for the null escape.

Examples:

Groovy Name

bytecode name

 

foo

(same)

 

foo12

(same)

 

foo_bar

(same)

 

foo$bar

(same)

 

int

(same)

 

?????

(same)

(Greek APHTH)

this$0

(same)

 

A$BA

(same)

 

A$1

(same)

 

A$42

A$$0X42

 

(none)

A$42

(0x42 is 'B')

(none)

x$79z

(0x79 is 'y')

http-equiv

http$45equiv

 

2x4

$0X2x4

 

A%

A$25

 

*

$2A

 

*9

$2AX9

 

<<

$3C$3C

 

+=

$2B$3D

 

©

$0A9

(COPYRIGHT SIGN)

A©A

A$0A9XA

 

X©X

X$0A9XX

 

Z©Z

Z$0A9Z

 

?

$2297

(CIRCLED TIMES)

X?X

X$2297X

 

?

$0A4BF

(YI RADICAL CIP)

Note that the manglings used for nested classes do not usually produce identifier unicode escapes by accident. Rare examples of such an accident would be 'this$21' and 'Foo$21', which map to Groovy names 'this!' and 'Foo!'.

Note that Groovy, as a language, does not directly recognize internal '<init>' and '<clinit>' methods. An attempt to define or call methods by these names first maps them from Groovy names to bytecode names, which are '$3Cinit$3E' and '$3Cclinit$3E', respectively. These names are not likely to match anything loaded in the VM, unless there is some scheme afoot which maps whole XML elements to Groovy names.

The latest Java VMs allow fairly arbitrary bytecode names for classes, fields, and methods, even though the Java language allows only a limited alphabet for identifiers. For example, Java VM may well accept classes or methods with names like "int", or "123", or "@&#".

However, in order to support translation from Groovy to Java source code, it is desirable that Groovy names, when they depart from the restrictions in legal Java identifiers, be mapped back to legal Java identifiers.

(Though I'm not married to the above set of patterns, I think they are reasonably easy to parse, compact, and separate from Java usages. – JRose)

  • No labels