POM Element for Source File Encoding

Currently, the character encoding for source files needs to be configured individually for each and every plugin that processes source files. In this context, source file refers to some plain text file that - unlike an XML file - lacks intrinsic means to specify the employed file encoding. The Java source files are the most promiment example of such text files. Velocity templates, BeanShell scripts and APT documents are further examples.

Life would become easier if there was a dedicated POM element like ${project.build.sourceEncoding} which could be used to specify the encoding once per entire project. Every plugin could use it as default value:

/**
 * @parameter expression="${encoding}" default-value="${project.build.sourceEncoding}"
 */
private String encoding;

Adding this element to the POM structure can only happen in Maven 3.x:

<project>
  ...
  <build>
    <!-- NOTE: This is just a vision for the future, it's not yet implemented -->
    <sourceEncoding>UTF-8</sourceEncoding>
    ...
  </build>
  ...
</project>

For Maven 2.x, the value can be defined as an equivalent property:

<project>
  ...
  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    ...
  </properties>
  ...
</project>

Thus plugins could immediately be modified to use ${project.build.sourceEncoding} expression, whatever Maven version is used.

Motivation

Why bother with file encoding at all? Well, a file encoding (aka charset) is required to solve the following discrepancy: A file stored on disk or transmitted via network is merely a stream of bytes/octets. In contrast, text is a stream of characters. However, a character is not a byte.

To further illustrate this, just consider the Unicode standard chosen for a Java String. Unicode defines more than 65.000 characters which obviously cannot be mapped to a single byte each. Hence, one needs a reversible transformation that defines how to map a character to bytes and vice-versa. This transformation is called a file/character encoding.

Now, there are different encodings, each potentially yielding different bytes for the same character. For example, the common encoding ASCII will map the character 'A' to the byte with the hex code 0x41. The same character is mapped to the byte 0xC1 when using the encoding EBCDIC. Another example is the character 'ü' (small letter u with umlaut) that maps to the single byte 0xFC when using ISO-8859-1 but maps to the two byte sequence 0xC3 0xBC when using UTF-8.

It should be clear by now that encoding a character with one encoding and later on decoding it with a different encoding can corrupt the character. To avoid such errors, it is crucial that all developers of a project have agreed to use the same encoding when editing the project sources and running the build.

Default Value

As shown by a user poll on the mailing list and the numerous comments on this article, this proposal has been revised: Plugins should use the platform default encoding if no explicit file encoding has been provided in the plugin configuration.

Since usage of the platform encoding yields platform-dependent and hence potentially irreproducible builds, plugins should output a warning to inform the user about this threat. This way, users can smoothly update their POMs to follow best practices.

Code Spots to Review for Proper Encoding Handling

The following classes and/or methods indicate usage of the JVM's default encoding and hence should be reviewed:

  • String(byte[])
  • String.getBytes()
  • FileReader
  • FileWriter
  • PrintWriter(File)(new in JDK 5)
  • PrintWriter(OutputStream)(new in JDK 5)
  • InputStreamReader(InputStream)
  • OutputStreamWriter(OutputStream)
  • ReaderFactory.newPlatformReader()
  • WriterFactory.newPlatformWriter()
  • FileUtils.fileRead(String)
  • FileUtils.fileRead(File)
  • FileUtils.fileWrite(String, String)
  • FileUtils.fileAppend(String, String)
  • IOUtils.toString(InputStream)
  • IOUtils.toString(InputStream, int)

Plugins to Modify

Build plugins are highlighted, since the impact of the change is more critical to the built artifact than reporting plugins. 

Affected Apache plugins:

  • maven-changes-plugin (velocity template for announcement): MCHANGES-71, done in 2.1
  • maven-checkstyle-plugin (source analysis): MCHECKSTYLE-95, done in 2.2
  • maven-compiler-plugin (source processing): MCOMPILER-70, done in 2.1-SNAPSHOT
  • maven-invoker-plugin (beanshell script evaluation): MINVOKER-30, done in 1.2
  • maven-javadoc-plugin (source processing): MJAVADOC-182, done in 2.5
  • maven-jxr-plugin (source processing): JXR-60, done in 2.2-SNAPSHOT
  • maven-plugin-plugin (javadoc extraction, java source generation): MPLUGIN-101, MPLUGIN-100, done in 2.5
  • maven-pmd-plugin (source analysis): MPMD-76, done in 2.4
  • maven-resources-plugin (contents filtering): MRESOURCES-57, done in 2.3
  • maven-site-plugin (apt sources): MSITE-314, done in 2.0-beta-7

Affected Codehaus plugins:

  • jalopy-maven-plugin: MOJO-1138, done in 1.0-alpha-2-SNAPSHOT
  • javancss-maven-plugin: MJNCSS-31
  • modello-maven-plugin/modello-core (java source generation): MODELLO-109, done in 1.0-alpha-19
  • native2ascii-maven-plugin
  • plexus-component-metadata (formerly plexus-maven-plugin) (javadoc extraction): PLX-371, done in 1.0-beta-3.0.4
  • shitty-maven-plugin (groovy script evaluation)
  • simian-maven-plugin
  • taglist-maven-plugin (javadoc extraction): MTAGLIST-27, done in 2.3

References

Please see [0] for the related thread from the mailing list, [1] for some further descriptions and [2] for a similar feature request in JIRA. Also note a related proposal for the output encoding of reports [3].

[0] http://www.nabble.com/POM-Element-for-Source-File-Encoding-to14930345s177.html

[1] http://www.nabble.com/Re%3A-Maven-and-File-Encoding-p16301958s177.html

[2] MNG-2216

[3] Reporting Encoding Configuration

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. Apr 25, 2008

    Kohsuke Kawaguchi says:

    Oh no. Please, please, pretty please don't make the default encoding iso-8859-1....

    Oh no. Please, please, pretty please don't make the default encoding iso-8859-1.

    You seem to think that builds that rely on platform encoding "are negligible in number", but I disagree. Go to any Japanese software development shop, and their source code includes comments and literal strings in Japanese (I mean, what else do you expect?) Go to any Chinese, Korean, Thai, Vietnamese software shop, and I bet the same is true. Those files are not in iso-8859-1 encoding — they are in the platform default encoding.

    The implication of what you are suggesting is that all the Maven projects used in those places will break when this change gets integrated.

    For a build not to be reproducible, you'd have to have source code that uses characters outside ASCII, and you'd have to have different build machines that use different non-ASCII encoding, like shift-jis and big5. Now that is the situation negligible, and in such an environment, a build already breaks with the current version of Maven. So you got the trade-off analysis wrong.

    Once again, please don't make such a huge compatibility breaking change. If you don't believe me, please talk to some Asian developers, who know a thing or two about encoding and character set before making a decision.

  2. Apr 25, 2008

    Hervé Boutemy says:

    The topic was discussed on the Maven dev list: your opinion would have been real...

    The topic was discussed on the Maven dev list: your opinion would have been really useful at that time (no Asian developer replied)...

    Yes, we know that setting a static default encoding will break some builds, whatever the value chosen: no more detection of developers' platform encoding.

    To me, this drawback can be ok because:

    1. the build will break when changing plugin version: release notes of every plugin incorporating the change can explain it and corresponding workaround
    2. the workaround is really simple: just a single property to set in the pom, which will be used by each and every plugin without thinking at it any more

    I suppose Asian developers will know what value they'll need to set. Or they can even enforce platform encoding with

    <project.build.sourceEncoding>${file.encoding}</project.build.sourceEncoding>

    Last point: before the change, there were already some plugins with ISO-8859-1 default encoding, which had to be configured if using another encoding, in each plugin. After the change, the configuration is to be done only once.

  3. Apr 25, 2008

    Kohsuke Kawaguchi says:

    You are making two mistakes. First, you greatly underestimate the number of s...

    You are making two mistakes.


    First, you greatly underestimate the number of such projects. That is somewhat understandable because they happen in places you don't see, but that doesn't mean they don't exist (just look at places like SourceForge.jp or independent project site like this.) This affects a lot of projects, because it changes the behavior of the javac plugin, which everyone uses.


    Second, you greatly overestimate the danger of "build reproducibility" of choosing the platform encoding as the default. In fact, I'm bit puzzled by this, because the main driver for this change is to have a single place to configure encoding for all plugins (which I agree as useful), and I have yet to hear any complaints from actual users here or elsewhere that
    the choice of platform encoding is hurting them. So where did this idea come from?

    As I wrote in my first post, for such a build reproducibility to become an issue, you first need a project that uses a non-Western encoding (say shift-jis for Japanese) for source files, and then that needs to be built on machines that use another incompatible non-Western encoding (say big5 for Chinese.) This just doesn't happen all that often, for the same reason you don't have Japanese Windows in your office.


    So to sum it up, you are trying to solve a problem that doesn't exist, and ends up inflicting a pain for everyone in non-Western encodings. Hence my earlier statement that you got the trade-off analysis all wrong.

    But really, I'm just begging you. It appears that the change has not yet made it into the releases. So it's still possible to avert the problem. Please, please change the default back to platform encoding, before it's too late.

  4. Apr 26, 2008

    Hervé Boutemy says:

    I just changed "are neglectable in number" qualifier to "are not the vast majori...

    I just changed "are neglectable in number" qualifier to "are not the vast majority": this will help depassionate the debate.

    The hard thing is to estimate how platform encoding is a problem or a feature. I'll add a section on this topic, please contribute.

  5. Apr 27, 2008

    Benjamin Bentmann says:

    I have yet to hear any complaints from actual users here or elsewhere that the c...

    I have yet to hear any complaints from actual users here or elsewhere that the choice of platform encoding is hurting them. So where did this idea come from?

    I am not saying that there is a majority of people complaining (otherwise the POM would have been designed differently right from the start). The issue is really subtle because most communities can come away with the platform default encoding. Furthermore, this issue has much of this "works for me" attitude: It are usually outsiders/minorities that have to suffer. So here are two JIRA tickets I can quickly offer to demonstrate actual reported needs: MTAGLIST-27, MANTTASKS-14.

    for such a build reproducibility to become an issue [...] This just doesn't happen all that often

    I define correctness (including correct build output) as something like "works always" and not as "works quite often". Is this attitude wrong?

    you are trying to solve a problem that doesn't exist

    I can understand your arguments about the number of affected projects (although I had never thought that Asians do not lock down the employed file encoding) but I absolutely disagree about the problem being "non-existent". Among others, a reproducible build should enable everybody on the world to checkout a project from SCM and invoke "mvn <goal>" to successfully build a project regardless

    • of his operating system and
    • of his locale

    Now, if I were to checkout one of those projects that entirely rely on the platform's default encoding and are developed using an Asian encoding, would I be able to build that on my Western machine? That's just what I consider a bad state that should be fixed. One of Java's dreams is platform-independence and isn't that a nice goal for the build, too? In particular, a build tool like Maven that explicitly promotes best practices should not encourage platform-dependence, IMHO.

    Platform encoding might have worked well in the former days when we all worked behind closed walls but things are moving more international now and collaborative development requires explicit project conventions about source encoding. Having said that, I am eased to see that the project Seasar you mentioned in your post already locked down file encoding to UTF-8.

  6. Apr 27, 2008

    Benjamin Bentmann says:

    I have yet to hear any complaints from actual users here or elsewhere that the c...

    I have yet to hear any complaints from actual users here or elsewhere that the choice of platform encoding is hurting them. So where did this idea come from?

    Also, we might just ask Hervé why he isn't spelling his own name correctly in the class javadoc of the code he contributes but is using the unaccented form "Herve". More than a decade after the invention of Unicode, I personally have no understanding why people are required to garble their names.

  7. Apr 28, 2008

    Takayoshi Kimura says:

    +1 on having unified encoding setting. It's obviously a good thing. -10...

    +1 on having unified encoding setting. It's obviously a good thing.

    -100 on the proposed default value. It breaks a lot of existing builds. Please use ${file.encoding} instead.

    If you want to encourage platform independent build, add a WARN message which tells users maven is running with platform encoding. I don't think we need to introduce backward incompatibility change for it.

  8. Apr 29, 2008

    Benjamin Bentmann says:

    Alright, I just started a poll on the user list to verify your rating.

    Alright, I just started a poll on the user list to verify your rating.

  9. Apr 29, 2008

    Takayoshi Kimura says:

    Benjamin,  Can you accept votes on this page comment area? This way is eas...

    Benjamin,

     Can you accept votes on this page comment area? This way is easier for people who want to vote but not on the Maven users list. Most of Japanese maven users are not on the list (we are discussing other-than-English problem on English only list!).

  10. Apr 29, 2008

    Benjamin Bentmann says:

    Can you accept votes on this page comment area? Yes, of course, comments reporte...

    Can you accept votes on this page comment area?

    Yes, of course, comments reported here will count. To account for your objections was exactly the reason why I started an (apparently vivid) discussion on the user mailing list.

  11. Apr 29, 2008

    Koichi Kobayashi says:

    +1 on (a).

    +1 on (a).

  12. Apr 29, 2008

    Takashi Okamoto says:

    +1 for Kousuke's proposal.  Use the platform encoding and keep current...

    +1 for Kousuke's proposal. 

    Use the platform encoding and keep current behavior for compatibility. Stop  broking a lot of Japanese maven project.

  13. Apr 29, 2008

    Yohji Nihonyanagi says:

    +1 on (a).

    +1 on (a).

  14. Apr 29, 2008

    Hiroyuki Oonaka says:

    +1 on (a). To avoid breaking build, I hope keeping using platform encoding.

    +1 on (a).

    To avoid breaking build, I hope keeping using platform encoding.

  15. Apr 29, 2008

    yone098 says:

    +1 on (a)

    +1 on (a)

  16. Apr 29, 2008

    cactusman says:

    +1 on (a).

    +1 on (a).

  17. Apr 29, 2008

    Koji Suga says:

    +1 on (a).

    +1 on (a).

  18. Apr 29, 2008

    Kenichi Dewa says:

    +1 on (a).

    +1 on (a).

  19. Apr 29, 2008

    masanobuimai says:

    +1 on (a).

    +1 on (a).

  20. Apr 29, 2008

    Takuto Wada says:

    +1 on (a).

    +1 on (a).

  21. Apr 29, 2008

    MIYAMOTO Daisuke says:

    +1 on (a)

    +1 on (a)

  22. Apr 29, 2008

    Osamu Goto says:

    +1 on (a).

    +1 on (a).

  23. Apr 29, 2008

    leecom says:

    +1 on (a).

    +1 on (a).

  24. Apr 30, 2008

    kazunori satok says:

    +1 on (a).

    +1 on (a).

  25. Apr 30, 2008

    nasobeme says:

    +1 on (a).

    +1 on (a).

  26. Apr 30, 2008

    hiroyuki iwanaga says:

    +1 on (a).

    +1 on (a).

  27. Apr 30, 2008

    HONMA Hirotaka says:

    +1 on (a).

    +1 on (a).

  28. Apr 30, 2008

    Rikiya Yamamoto says:

    +1 on (a).

    +1 on (a).

  29. Apr 30, 2008

    Takeshi Kawajiri says:

    +1 on (a).

    +1 on (a).

  30. Apr 30, 2008

    Mitsuhiro Okamoto says:

    +1 on (a).

    +1 on (a).

  31. Apr 30, 2008

    Shinobu Watanabe says:

    +1 on (a). I hope keeping using platform encoding.

    +1 on (a).

    I hope keeping using platform encoding.

  32. Apr 30, 2008

    Takayuki says:

    +1 on (a). For compatibility.

    +1 on (a).

    For compatibility.

  33. Apr 30, 2008

    Ryuzo Yamamoto says:

    +1 on (a).

    +1 on (a).

  34. Apr 30, 2008

    Shigeaki Wakizaka says:

    +1 on (a)

    +1 on (a)

  35. Apr 30, 2008

    takayuki okazaki says:

    +1 on (a) It's important option for we Asian countries. Asian users might not s...

    +1 on (a)

    It's important option for we Asian countries. Asian users might not send you some feed back to you, but they are depends upon your great artifact, Maven.

  36. Apr 30, 2008

    Takayuki Kaneko says:

    +1 on (a). IMHO, compatibility is sometimes more important than other reaso...

    +1 on (a).

    IMHO, compatibility is sometimes more important than other reasons. 

  37. Apr 30, 2008

    Nobukazu Ishigaki says:

    +1 on (a).

    +1 on (a).

  38. Apr 30, 2008

    SODA Noriyuki says:

    Today, nearly every program uses platform encoding (current codepage on Windows,...

    Today, nearly every program uses platform encoding (current codepage on Windows, current locale on UNIX variants).
    Changing this behavior is really strange, and sounds very old-fashioned (programs like pre-1990 age).
    So,
    +1 for a)

  39. Apr 30, 2008

    Tatsuya Shimura says:

    +1 on (a)

    +1 on (a)

  40. Apr 30, 2008

    Tomohito Ozaki says:

    +1 on (a). Please consider countries using multibyte language.

    +1 on (a).

    Please consider countries using multibyte language.

  41. Apr 30, 2008

    KATOH Yasufumi says:

    +1 on (a).

    +1 on (a).

  42. Apr 30, 2008

    Shinpei Ohtani says:

    +1 on (a) proposal. Think about the world with multibyte langs, and compati...

    +1 on (a) proposal.

    Think about the world with multibyte langs, and compatibility!!

  43. Apr 30, 2008

    Suetoshi Urabe says:

    +1 on (a)

    +1 on (a)

  44. Apr 30, 2008

    mckenzy says:

    +1 on (a).

    +1 on (a).

  45. Apr 30, 2008

    Masahide Takeda says:

    +1 on (a).

    +1 on (a).

  46. Apr 30, 2008

    nobeans says:

    +1 on (a).

    +1 on (a).

  47. Apr 30, 2008

    kubota keisen says:

    +1 on (a).

    +1 on (a).

  48. Apr 30, 2008

    NONAKA Kimihiro says:

    +1 on (a)

    +1 on (a)

  49. Apr 30, 2008

    Yasuo Higa says:

    I propose the default value is UTF-8, because UTF-8 is suitable for every platfo...

    I propose the default value is UTF-8, because UTF-8 is suitable for every platform.

  50. Apr 30, 2008

    jkato says:

    +1 on (a)

    +1 on (a)

  51. Apr 30, 2008

    onozaty says:

    +1 on (a)

    +1 on (a)

  52. Apr 30, 2008

    ITO Yoshiichi says:

    +1 on (a). I hope keeping using platform encoding.

    +1 on (a).

    I hope keeping using platform encoding.

  53. Apr 30, 2008

    Hiroshi Kajikawa says:

    +1 on (a)

    +1 on (a)

  54. Apr 30, 2008

    Toshiya Kobayashi says:

    +1 on (a)

    +1 on (a)

  55. Apr 30, 2008

    Munenori TAKEI says:

    +1 on (a).

    +1 on (a).

  56. Apr 30, 2008

    Watanabe says:

    +1 on (a)

    +1 on (a)

  57. Apr 30, 2008

    Horiuchi Hiroki says:

    +1 on (a)

    +1 on (a)

  58. Apr 30, 2008

    NISHIMOTO Keisuke says:

    +1 on (a)

    +1 on (a)

  59. Apr 30, 2008

    Jun Funakura says:

    +1 on (a).

    +1 on (a).

  60. Apr 30, 2008

    Shinya Ogino says:

    Never make iso-8859-1 a default, please. Possibly UTF-8, but compatibility shoul...

    Never make iso-8859-1 a default, please. Possibly UTF-8, but compatibility should be considered with priority.

  61. May 01, 2008

    IZUNO Tadashi says:

    +1 on (a).

    +1 on (a).

  62. May 01, 2008

    Shinji Ichikawa says:

    +1 on (a).

    +1 on (a).

  63. May 01, 2008

    Satoru Okamoto says:

    +1 on (a).

    +1 on (a).

  64. May 01, 2008

    taichi says:

    +1 on (a).

    +1 on (a).

  65. May 01, 2008

    Takeshi Matsuba says:

    +1 on (a).

    +1 on (a).

  66. May 01, 2008

    ITO Sho says:

    +1 on (a).

    +1 on (a).

  67. May 01, 2008

    hsmt says:

    +1 on (a).

    +1 on (a).

  68. May 01, 2008

    sudo says:

    +1 on (a).

    +1 on (a).

  69. May 01, 2008

    Ryuji Furuya says:

    +1 on (a).

    +1 on (a).

  70. May 01, 2008

    Kanji Yokota says:

    +1 on (a).

    +1 on (a).

  71. May 01, 2008

    Masanobu Shimura says:

    +1 for Kousuke's proposal. For japanese developer, default value of ISO-885...

    +1 for Kousuke's proposal.

    For japanese developer, default value of ISO-8859-1 will cause so much problems.

  72. May 01, 2008

    UEHARA Junji says:

    +1 on (a).

    +1 on (a).

  73. May 01, 2008

    Takkenoko says:

    +1 on (a).

    +1 on (a).

  74. May 01, 2008

    Mitsutoshi NAKANO says:

    +1 on (a) . It is necessary to abolish ISO-8859-[.] . ISO-8859-[.] is sc...


    +1 on (a) .

    It is necessary to abolish ISO-8859-[.] .

    ISO-8859-[.] is scarce power of expression, and will not be an intention in the future .

    Persisting in ISO-8859-[.] obstructs the development of software .

  75. May 01, 2008

    iteng says:

    +1 on (a).

    +1 on (a).

  76. May 01, 2008

    Wataru Nakamura says:

    +1 on (a).

    +1 on (a).

  77. May 01, 2008

    Benjamin Bentmann says:

    Alright, alright, I guess it's clear by now I just closed the poll on the user...

    Alright, alright, I guess it's clear by now

    I just closed the poll on the user list and will now update the proposal to use the platform encoding.

  78. May 01, 2008

    Masakazu Matsushita says:

    We are glad you have accepted our opinion. For  developers using language ...

    We are glad you have accepted our opinion.

    For  developers using language other than english, it was really serious problem

  79. May 01, 2008

    Masahiro Nagafusa says:

    +1 on (a).

    +1 on (a).

  80. May 02, 2008

    hajimeni says:

    +1 on (a).

    +1 on (a).

  81. May 02, 2008

    Hirotaka Ueki says:

    +1 on (a).

    +1 on (a).

  82. May 02, 2008

    Tsuyoshi Yamamoto says:

    +1 on (a).

    +1 on (a).

  83. May 06, 2008

    calico catnap says:

    +1 on (a).

    +1 on (a).