Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 41 Next »

Currently, the character encoding for source files needs to be configured individually for each and every plugin that processes source files. In this context, source file refers to some plain text file that - unlike an XML file - lacks intrinsic means to specify the employed file encoding. The Java source files are the most promiment example of such text files. Velocity templates, BeanShell scripts and APT documents are further examples.

Life would become easier if there was a dedicated POM element like ${project.build.sourceEncoding} which could be used to specify the encoding once per entire project. Every plugin could use it as default value:

Adding this element to the POM structure can only happen in Maven 2.1:

For Maven 2.0.x, the value can be defined as an equivalent property:

Thus plugins could immediately be modified to use ${project.build.sourceEncoding} expression, whatever Maven version is used.

Motivation

Why bother with file encoding at all? Well, a file encoding (aka charset) is required to solve the following discrepancy: A file stored on disk or transmitted via network is merely a stream of bytes/octets. In contrast, text is a stream of characters. However, a character is not a byte.

To further illustrate this, just consider the Unicode standard chosen for a Java String. Unicode defines more than 65.000 characters which obviously cannot be mapped to a single byte each. Hence, one needs a reversible transformation that defines how to map a character to bytes and vice-versa. This transformation is called a file/character encoding.

Now, there are different encodings, each potentially yielding different bytes for the same character. For example, the common encoding ASCII will map the character 'A' to the byte with the hex code 0x41. The same character is mapped to the byte 0xC1 when using the encoding EBCDIC. Another example is the character 'ü' (small letter u with umlaut) that maps to the single byte 0xFC when using ISO-8859-1 but maps to the two byte sequence 0xC3 0xBC when using UTF-8.

It should be clear by now that encoding a character with one encoding and later on decoding it with a different encoding can corrupt the character. To avoid such errors, it is crucial that all developers of a project have agreed to use the same encoding when editing the project sources and running the build.

Default Value

Without default value for source encoding, local machines' detected platform encoding is used, which is not ideal for build reproducibility. Then setting a static default value consistently across every Maven plugin will improve build reproducibility.

Note: who is affected? Teams with non-uniform configuration know they have to explicitely choose an encoding when using non-ascii characters. "Non uniform configuration" comes either from people in multiple countries having different character sets, or from people working on different OSs (Unix*s tend to use UTF-8 as platform encoding, where Windows stays with some sort of ISO encoding). For teams with uniform configuration, there is no immediate problem with using local machines' detected platform encoding: it "simply works". Problems arise when someone with a different configuration tries to build, and the detected encoding of his local machine is different from the (implicit) expected one = the so-called "build reproducibility" problem.

Proposed default value: ISO-8859-1, which must be supported by every JVM (see java.nio.Charset) and is already the default value for some plugins (the majority of plugins use platform encoding as a default value instead).

Note: Using a fixed default value for the encoding instead of the locally detected platform encoding will break builds that rely on a local machines' platform encoding other than the proposed Latin-1 but did not lock this down in the POM. This problem is of limited impact for reporting plugins (will "only" lead to some garbage in reports), but will really break produced artifacts when build plugins are involved (compiler, resources, modello, plugin, invoker, shitty).
It is assumed that those builds:

  1. are not the vast majority
  2. are easy to fix by setting the new property (platform encoding has been added to "mvn -v" output to help choosing the value, see MNG-3509)

As such the general benefit of out-of-the-box reproducibility outweighs.

Note 2: a less pro-active stategy could have been to let default encoding value to platform encoding, but provide an enforcer rule to help developers detect this as a weak point and encourage them to set an explicit fixed value in their build.

A check has to be coded in every plugin with the default value:

This default value can be coded in POM model too for 2.1.x (default value of the encoding attribute) and in super-pom in Maven 2.0.x. But this change is only for clarity since without it, the previous check coded in every plugin will transform null value to the chosen default value.

Code Spots to Review for Proper Encoding Handling

The following classes and/or methods indicate usage of the JVM's default encoding and hence should be reviewed:

  • FileReader
  • FileWriter
  • InputStreamReader(InputStream)
  • OutputStreamWriter(OutputStream)
  • ReaderFactory.newPlatformReader()
  • WriterFactory.newPlatformWriter()
  • FileUtils.fileRead(String)
  • FileUtils.fileRead(File)
  • FileUtils.fileWrite(String, String)
  • FileUtils.fileAppend(String, String)
  • IOUtils.toString(InputStream)
  • IOUtils.toString(InputStream, int)

Plugins to Modify

Build plugins are highlighted, since the impact of the change is more critical to the built artifact than reporting plugins. 

Affected Apache plugins:

  • maven-changes-plugin (velocity template processing): MCHANGES-71
  • maven-compiler-plugin (source processing): MCOMPILER-70, done in 2.1-SNAPSHOT
  • maven-invoker-plugin (beanshell script evaluation): MINVOKER-30, done in 1.2-SNAPSHOT
  • maven-javadoc-plugin (source processing): MJAVADOC-182, done in 2.5-SNAPSHOT
  • maven-jxr-plugin (source processing): JXR-60, done in 2.2-SNAPSHOT
  • maven-plugin-plugin (javadoc extraction, java source generation): MPLUGIN-101, MPLUGIN-100
  • maven-pmd-plugin (source analysis): MPMD-76, done in 2.4-SNAPSHOT
  • maven-resources-plugin (contents filtering): MRESOURCES-57, done in 2.3-SNAPSHOT
  • maven-site-plugin (apt sources): MSITE-314, done in 2.0-beta-7-SNAPSHOT

Affected Codehaus plugins:

  • modello-maven-plugin/modello-core (java source generation)
  • plexus-maven-plugin (javadoc extraction)
  • shitty-maven-plugin (groovy script evaluation)
  • taglist-maven-plugin (javadoc extraction)

References

Please see [0] for the related thread from the mailing list, [1] for some further descriptions and [2] for a similar feature request in JIRA.

[0] http://www.nabble.com/POM-Element-for-Source-File-Encoding-to14930345s177.html

[1] http://www.nabble.com/Re%3A-Maven-and-File-Encoding-p16301958s177.html

[2] MNG-2216

  • No labels