Skip to end of metadata
Go to start of metadata

Chapter 3

Lexical Structure

The organization of this chapter parallels the chapter on Lexical Structure in the Java Language Specification (second edition), which begins as follows:

This chapter specifies the lexical structure of the Java programming language.

Programs are written in Unicode (§3.1, JLS), but lexical translations are provided (§3.2, JLS) so that Unicode escapes (§3.3, JLS) can be used to include any Unicode character using only ASCII characters. Line terminators are defined (§3.4, JLS) to support the different conventions of existing host systems while maintaining consistent line numbers.

The Unicode characters resulting from the lexical translations are reduced to a sequence of input elements (§3.5, JLS), which are white space (§3.6, JLS), comments (§3.7, JLS), and tokens. The tokens are the identifiers (§3.8, JLS), keywords (§3.9, JLS), literals (§3.10, JLS), separators (§3.11, JLS), and operators (§3.12, JLS) of the syntactic grammar.

3.1 Unicode

(Cf. JLS. §3.1.)

TO DO

3.2 Lexical Translations

(Cf. JLS. §3.2.)

TO DO

3.3 Unicode Escapes

(Cf. JLS. §3.3.)

TO DO

3.4 Line Terminators

(Cf. JLS. §3.4.)

In Groovy, many line terminators are syntactically significant. As a result, the lexical grammar differs from Java's. Line terminators are not part of whitespace, but (outside of comments and strings) they are classified as Newline tokens.

ISSUE: Shall we define significance of newlines in the token stream, or operationally in the grammar? In token stream, we can say they are insignificant after separators and operators. In the grammar, we can just swallow optional newlines inside certain productions, as Expr: Expr Op LineTerminator* Expr.

Provisionally do it in the grammar, as /bin/sh does it. This lets us keep "x++" as a complete statement. But we also want to ignore newlines in parenthesis nesting, which means we want an inherited grammar attribute for newline suppression inside parens and brackets.

3.5 Input Elements and Tokens

(Cf. JLS. §3.5.)

Add Token: SignificantNewline.

3.6 White Space

(Cf. JLS. §3.6.)

Remove WhiteSpace: LineTerminator, since significant newlines are tokens in their own right.

A significant newline is a token which in Java would be a group of whitespace tokens, but which contains at least one uncommented line terminator. In an end-of-line comment, the terminating newline is counted as uncommented. (Newlines in traditional C-style comments are not significant.)

This grammar implies that one or more consecutive uncommented line terminators, possibly separated by whitespace and comments, count as a single significant newline token.

Generally speaking, significant newline tokens are equivalent to the semicolon separator token, wherever the latter is acceptable. Unlike Java,(but like Pascal and the scripting sh and awk) both semicolons and significant newlines are statement separators. A statement just before an enclosing right bracket is terminated with or without a final semicolon or newline.

The grammar is organized so that significant newline tokens are ignored after prefix and infix operator tokens. They are also ignored if they occur directly within round or square brackets, but not directly within curly brackets. These rules provide for easy continuation of long statements or expressions onto multiple lines, without a need to explicitly escape the intermediate line terminators.

3.7 Comments

(Cf. JLS. §3.7.)

TO DO

3.8 Identifiers

(Cf. JLS. §3.8.)

Groovy identifiers differ from Java identifiers in that the ASCII dollar character '$' is not a legal identifier character. This is restriction applies in practice only to the spelling of unqualified names, since Groovy provides a way to use any Unicode string whatever as a member name or command name.

(The dollar sign is sometimes used internally by Groovy to mangle non-Java identifiers which must be converted to Java names. For this reason, it would be confusing to allow unescaped dollar signs as Groovy identifier constituents.)

3.9 Keywords

(Cf. JLS. §3.9.)

The following words are keywords in Groovy but not in Java:

def

mapanyTO DO

The following words are keywords in Java and Groovy, but are currently illegal in Groovy:

do

strictfpconstgoto

3.10 Literals

(Cf. JLS. §3.10.)

TO DO

3.10.1 Integer Literals

(Cf. JLS. §3.10.1.)

The production IntegerTypeSuffix: g G is added, allowing BigInteger constants.

ISSUE: 123i allowed? Other literal syntaxes?

3.10.2 Floating-Point Literals

(Cf. JLS. §3.10.2.)

The production FloatTypeSuffix: g G is added, allowing BigDecimal constants.

3.10.3 Boolean Literals

(Cf. JLS. §3.10.3.)

(No change.)

3.10.4 Character Literals

(Cf. JLS. §3.10.4.)

Groovy has no CharacterLiteral token. All literals with character data in them denote strings. Constant strings of unit length serve in the place of character literals, since they coerce properly to character constants.

3.10.5 String Literals

(Cf. JLS. §3.10.5.)

Groovy string literals have a syntax inspired by other scripting languages. A string literal may be delimited by either single or double quotes. String literals with double quotes may incorporate substring substitution expressions, while singly-quoted string literals are always constants.

If a double-quoted string contains an unescaped dollar sign, it is more properly called a string constructor, since it evaluates to a non-constant string, whose contents depend on expressions following the dollar signs.

Independently, the quote marks may be tripled, allowing the string to span multiple lines. If the quote marks are used singly, the string may not contain a line terminator.

Regardless of the spelling of a LineTerminator found inside a string literal or constructor, it is always equivalent to an escaped newline '\n'.

ISSUE: Why use curly braces? Makes for irregular variant of block syntax, forcing us to specify exceptions in various places. We should use round braces instead. Fundamentally, GString parameters are expressions, not blocks, and their syntax should reflect this.

String constructors are recognized lexically as a complex of GStringSeparators and other tokens, according to the grammar of GStringLexicalForm. After whitespace is removed, the resulting token sequence is parsed according to this syntax:

Identifiers and dots after a dollar sign are parsed eagerly, even if some of their characters could also be validly parsed as string characters.

Reference: http://archive.groovy.codehaus.org/jsr/threads/iakbeiefedohmiddhked

3.10.6 Escape Sequences for Character and String Literals

(Cf. JLS. §3.10.6.)

(Same as in Java.)

In double-quoted strings, the escape sequence \$ is legal, and stands for the ASCII dollar character. (The dollar does not introduce a GStringValue.)

3.10.7 The Null Literal

(Cf. JLS. §3.10.7.)

TO DO

3.11 Separators

(Cf. JLS. §3.11.)

TO DO

3.12 Operators

(Cf. JLS. §3.12.)

TO DO


Specification Table of Contents.

The organization of this chapter parallels the chapter on Lexical Structure in the Java Language Specification (second edition).

The original of this specification is at http://docs.codehaus.org/display/GroovyJSR.

Labels: