Skip to end of metadata
Go to start of metadata

Currently, the character encoding for source files needs to be configured individually for each and every plugin that processes source files. In this context, source file refers to some plain text file that - unlike an XML file - lacks intrinsic means to specify the employed file encoding. The Java source files are the most promiment example of such text files. Velocity templates, BeanShell scripts and APT documents are further examples. This proposal does not apply to XML files as their encoding can be determined from the file itself, see XML encoding for further information.

Life would become easier if there was a dedicated POM element like ${project.build.sourceEncoding} which could be used to specify the encoding once per entire project. Every plugin could use it as default value:

Adding this element to the POM structure can only happen in Maven 3.x (tracked with MNG-2216 issue):

For Maven 2.x, the value can be defined as an equivalent property:

Thus plugins could immediately be modified to use ${project.build.sourceEncoding} expression, whatever Maven version is used.

Motivation

Why bother with file encoding at all? Well, a file encoding (aka charset) is required to solve the following discrepancy: A file stored on disk or transmitted via network is merely a stream of bytes/octets. In contrast, text is a stream of characters. However, a character is not a byte.

To further illustrate this, just consider the Unicode standard chosen for a Java String. Unicode defines more than 65.000 characters which obviously cannot be mapped to a single byte each. Hence, one needs a reversible transformation that defines how to map a character to bytes and vice-versa. This transformation is called a file/character encoding.

Now, there are different encodings, each potentially yielding different bytes for the same character. For example, the common encoding ASCII will map the character 'A' to the byte with the hex code 0x41. The same character is mapped to the byte 0xC1 when using the encoding EBCDIC. Another example is the character 'ü' (small letter u with umlaut) that maps to the single byte 0xFC when using ISO-8859-1 but maps to the two byte sequence 0xC3 0xBC when using UTF-8.

It should be clear by now that encoding a character with one encoding and later on decoding it with a different encoding can corrupt the character. To avoid such errors, it is crucial that all developers of a project have agreed to use the same encoding when editing the project sources and running the build.

Default Value

As shown by a user poll on the mailing list and the numerous comments on this article, this proposal has been revised: Plugins should use the platform default encoding if no explicit file encoding has been provided in the plugin configuration.

Since usage of the platform encoding yields platform-dependent and hence potentially irreproducible builds, plugins should output a warning to inform the user about this threat, e.g.:

[WARNING] Using platform encoding (Cp1252 actually) to copy filtered resources, i.e. build is platform dependent!

This way, users can smoothly update their POMs to follow best practices.

Code Spots to Review for Proper Encoding Handling

The following classes and/or methods indicate usage of the JVM's default encoding and hence should be reviewed:

  • String(byte[])
  • String.getBytes()
  • FileReader
  • FileWriter
  • PrintWriter(File)(new in JDK 5)
  • PrintWriter(OutputStream)(new in JDK 5)
  • InputStreamReader(InputStream)
  • OutputStreamWriter(OutputStream)
  • ReaderFactory.newPlatformReader()
  • WriterFactory.newPlatformWriter()
  • FileUtils.fileRead(String)
  • FileUtils.fileRead(File)
  • FileUtils.fileWrite(String, String)
  • FileUtils.fileAppend(String, String)
  • IOUtils.toString(InputStream)
  • IOUtils.toString(InputStream, int)

Plugins to Modify

Build plugins are highlighted, since the impact of the change is more critical to the built artifact than reporting plugins. 

Affected Apache plugins:

  • maven-changes-plugin (velocity template for announcement): MCHANGES-71, done in 2.1
  • maven-checkstyle-plugin (source analysis): MCHECKSTYLE-95, done in 2.2
  • maven-compiler-plugin (source processing): MCOMPILER-70, done in 2.1
  • maven-invoker-plugin (beanshell script evaluation): MINVOKER-30, done in 1.2
  • maven-javadoc-plugin (source processing): MJAVADOC-182, done in 2.5
  • maven-jxr-plugin (source processing): JXR-60, done in 2.2
  • maven-plugin-plugin (javadoc extraction, java source generation): MPLUGIN-101, MPLUGIN-100, done in 2.5
  • maven-pmd-plugin (source analysis): MPMD-76, done in 2.4
  • maven-resources-plugin (contents filtering): MRESOURCES-57, done in 2.3
  • maven-site-plugin (apt sources): MSITE-314, done in 2.0-beta-7

Affected Codehaus plugins:

  • findbugs-maven-plugin: (no Jira issue), done in 2.2
  • jalopy-maven-plugin: MOJO-1138, done in 1.0-alpha-2-SNAPSHOT
  • javancss-maven-plugin: MJNCSS-31
  • modello-maven-plugin/modello-core (java source generation): MODELLO-109, done in 1.0-alpha-19
  • native2ascii-maven-plugin
  • plexus-component-metadata (formerly plexus-maven-plugin) (javadoc extraction): PLX-371, done in 1.0-beta-3.0.4
  • shitty-maven-plugin (groovy script evaluation)
  • simian-maven-plugin
  • taglist-maven-plugin (javadoc extraction): MTAGLIST-27, done in 2.3

References

Please see [0] for the related thread from the mailing list, [1] for some further descriptions and [2] for a similar feature request in JIRA. Also note a related proposal for the output encoding of reports [3].

[0] http://www.nabble.com/POM-Element-for-Source-File-Encoding-to14930345s177.html

[1] http://www.nabble.com/Re%3A-Maven-and-File-Encoding-p16301958s177.html

[2] MNG-2216

[3] Reporting Encoding Configuration

  • No labels

83 Comments

  1. Oh no. Please, please, pretty please don't make the default encoding iso-8859-1.

    You seem to think that builds that rely on platform encoding "are negligible in number", but I disagree. Go to any Japanese software development shop, and their source code includes comments and literal strings in Japanese (I mean, what else do you expect?) Go to any Chinese, Korean, Thai, Vietnamese software shop, and I bet the same is true. Those files are not in iso-8859-1 encoding — they are in the platform default encoding.

    The implication of what you are suggesting is that all the Maven projects used in those places will break when this change gets integrated.

    For a build not to be reproducible, you'd have to have source code that uses characters outside ASCII, and you'd have to have different build machines that use different non-ASCII encoding, like shift-jis and big5. Now that is the situation negligible, and in such an environment, a build already breaks with the current version of Maven. So you got the trade-off analysis wrong.

    Once again, please don't make such a huge compatibility breaking change. If you don't believe me, please talk to some Asian developers, who know a thing or two about encoding and character set before making a decision.

  2. The topic was discussed on the Maven dev list: your opinion would have been really useful at that time (no Asian developer replied)...

    Yes, we know that setting a static default encoding will break some builds, whatever the value chosen: no more detection of developers' platform encoding.

    To me, this drawback can be ok because:

    1. the build will break when changing plugin version: release notes of every plugin incorporating the change can explain it and corresponding workaround
    2. the workaround is really simple: just a single property to set in the pom, which will be used by each and every plugin without thinking at it any more

    I suppose Asian developers will know what value they'll need to set. Or they can even enforce platform encoding with

    Last point: before the change, there were already some plugins with ISO-8859-1 default encoding, which had to be configured if using another encoding, in each plugin. After the change, the configuration is to be done only once.

  3. You are making two mistakes.


    First, you greatly underestimate the number of such projects. That is somewhat understandable because they happen in places you don't see, but that doesn't mean they don't exist (just look at places like SourceForge.jp or independent project site like this.) This affects a lot of projects, because it changes the behavior of the javac plugin, which everyone uses.


    Second, you greatly overestimate the danger of "build reproducibility" of choosing the platform encoding as the default. In fact, I'm bit puzzled by this, because the main driver for this change is to have a single place to configure encoding for all plugins (which I agree as useful), and I have yet to hear any complaints from actual users here or elsewhere that
    the choice of platform encoding is hurting them. So where did this idea come from?

    As I wrote in my first post, for such a build reproducibility to become an issue, you first need a project that uses a non-Western encoding (say shift-jis for Japanese) for source files, and then that needs to be built on machines that use another incompatible non-Western encoding (say big5 for Chinese.) This just doesn't happen all that often, for the same reason you don't have Japanese Windows in your office.


    So to sum it up, you are trying to solve a problem that doesn't exist, and ends up inflicting a pain for everyone in non-Western encodings. Hence my earlier statement that you got the trade-off analysis all wrong.

    But really, I'm just begging you. It appears that the change has not yet made it into the releases. So it's still possible to avert the problem. Please, please change the default back to platform encoding, before it's too late.

  4. I just changed "are neglectable in number" qualifier to "are not the vast majority": this will help depassionate the debate.

    The hard thing is to estimate how platform encoding is a problem or a feature. I'll add a section on this topic, please contribute.

  5. I have yet to hear any complaints from actual users here or elsewhere that the choice of platform encoding is hurting them. So where did this idea come from?

    I am not saying that there is a majority of people complaining (otherwise the POM would have been designed differently right from the start). The issue is really subtle because most communities can come away with the platform default encoding. Furthermore, this issue has much of this "works for me" attitude: It are usually outsiders/minorities that have to suffer. So here are two JIRA tickets I can quickly offer to demonstrate actual reported needs: MTAGLIST-27, MANTTASKS-14.

    for such a build reproducibility to become an issue ... This just doesn't happen all that often

    I define correctness (including correct build output) as something like "works always" and not as "works quite often". Is this attitude wrong?

    you are trying to solve a problem that doesn't exist

    I can understand your arguments about the number of affected projects (although I had never thought that Asians do not lock down the employed file encoding) but I absolutely disagree about the problem being "non-existent". Among others, a reproducible build should enable everybody on the world to checkout a project from SCM and invoke "mvn <goal>" to successfully build a project regardless

    • of his operating system and
    • of his locale

    Now, if I were to checkout one of those projects that entirely rely on the platform's default encoding and are developed using an Asian encoding, would I be able to build that on my Western machine? That's just what I consider a bad state that should be fixed. One of Java's dreams is platform-independence and isn't that a nice goal for the build, too? In particular, a build tool like Maven that explicitly promotes best practices should not encourage platform-dependence, IMHO.

    Platform encoding might have worked well in the former days when we all worked behind closed walls but things are moving more international now and collaborative development requires explicit project conventions about source encoding. Having said that, I am eased to see that the project Seasar you mentioned in your post already locked down file encoding to UTF-8.

  6. I have yet to hear any complaints from actual users here or elsewhere that the choice of platform encoding is hurting them. So where did this idea come from?

    Also, we might just ask Hervé why he isn't spelling his own name correctly in the class javadoc of the code he contributes but is using the unaccented form "Herve". More than a decade after the invention of Unicode, I personally have no understanding why people are required to garble their names.

  7. +1 on having unified encoding setting. It's obviously a good thing.

    -100 on the proposed default value. It breaks a lot of existing builds. Please use ${file.encoding} instead.

    If you want to encourage platform independent build, add a WARN message which tells users maven is running with platform encoding. I don't think we need to introduce backward incompatibility change for it.

  8. Alright, I just started a poll on the user list to verify your rating.

  9. Benjamin,

     Can you accept votes on this page comment area? This way is easier for people who want to vote but not on the Maven users list. Most of Japanese maven users are not on the list (we are discussing other-than-English problem on English only list!).

  10. Can you accept votes on this page comment area?

    Yes, of course, comments reported here will count. To account for your objections was exactly the reason why I started an (apparently vivid) discussion on the user mailing list.

  11. +1 for Kousuke's proposal. 

    Use the platform encoding and keep current behavior for compatibility. Stop  broking a lot of Japanese maven project.

  12. +1 on (a).

    To avoid breaking build, I hope keeping using platform encoding.

  13. +1 on (a).

    I hope keeping using platform encoding.

  14. +1 on (a).

    For compatibility.

  15. +1 on (a)

    It's important option for we Asian countries. Asian users might not send you some feed back to you, but they are depends upon your great artifact, Maven.

  16. +1 on (a).

    IMHO, compatibility is sometimes more important than other reasons. 

  17. Today, nearly every program uses platform encoding (current codepage on Windows, current locale on UNIX variants).
    Changing this behavior is really strange, and sounds very old-fashioned (programs like pre-1990 age).
    So,
    +1 for a)

  18. +1 on (a).

    Please consider countries using multibyte language.

  19. +1 on (a) proposal.

    Think about the world with multibyte langs, and compatibility!!

  20. I propose the default value is UTF-8, because UTF-8 is suitable for every platform.

  21. +1 on (a).

    I hope keeping using platform encoding.

  22. Never make iso-8859-1 a default, please. Possibly UTF-8, but compatibility should be considered with priority.

  23. +1 for Kousuke's proposal.

    For japanese developer, default value of ISO-8859-1 will cause so much problems.


  24. +1 on (a) .

    It is necessary to abolish ISO-8859-[.] .

    ISO-8859-[.] is scarce power of expression, and will not be an intention in the future .

    Persisting in ISO-8859-[.] obstructs the development of software .

  25. Alright, alright, I guess it's clear by now (wink)

    I just closed the poll on the user list and will now update the proposal to use the platform encoding.

  26. We are glad you have accepted our opinion.

    For  developers using language other than english, it was really serious problem(sad)