Currently, the character encoding for source files needs to be configured individually for each and every plugin that processes source files. In this context, source file refers to some plain text file that - unlike an XML file - lacks intrinsic means to specify the employed file encoding. The Java source files are the most promiment example of such text files. Velocity templates, BeanShell scripts and APT documents are further examples. This proposal does not apply to XML files as their encoding can be determined from the file itself, see XML encoding for further information.
Life would become easier if there was a dedicated POM element like ${project.build.sourceEncoding} which could be used to specify the encoding once per entire project. Every plugin could use it as default value:
Adding this element to the POM structure can only happen in Maven 3.x (tracked with MNG-2216 issue):
For Maven 2.x, the value can be defined as an equivalent property:
Thus plugins could immediately be modified to use ${project.build.sourceEncoding} expression, whatever Maven version is used.
Motivation
Why bother with file encoding at all? Well, a file encoding (aka charset) is required to solve the following discrepancy: A file stored on disk or transmitted via network is merely a stream of bytes/octets. In contrast, text is a stream of characters. However, a character is not a byte.
To further illustrate this, just consider the Unicode standard chosen for a Java String. Unicode defines more than 65.000 characters which obviously cannot be mapped to a single byte each. Hence, one needs a reversible transformation that defines how to map a character to bytes and vice-versa. This transformation is called a file/character encoding.
Now, there are different encodings, each potentially yielding different bytes for the same character. For example, the common encoding ASCII will map the character 'A' to the byte with the hex code 0x41. The same character is mapped to the byte 0xC1 when using the encoding EBCDIC. Another example is the character 'ü' (small letter u with umlaut) that maps to the single byte 0xFC when using ISO-8859-1 but maps to the two byte sequence 0xC3 0xBC when using UTF-8.
It should be clear by now that encoding a character with one encoding and later on decoding it with a different encoding can corrupt the character. To avoid such errors, it is crucial that all developers of a project have agreed to use the same encoding when editing the project sources and running the build.
Default Value
As shown by a user poll on the mailing list and the numerous comments on this article, this proposal has been revised: Plugins should use the platform default encoding if no explicit file encoding has been provided in the plugin configuration.
Since usage of the platform encoding yields platform-dependent and hence potentially irreproducible builds, plugins should output a warning to inform the user about this threat, e.g.:
[WARNING] Using platform encoding (Cp1252 actually) to copy filtered resources, i.e. build is platform dependent!
This way, users can smoothly update their POMs to follow best practices.
Code Spots to Review for Proper Encoding Handling
The following classes and/or methods indicate usage of the JVM's default encoding and hence should be reviewed:
String(byte[])String.getBytes()FileReaderFileWriterPrintWriter(File)(new in JDK 5)PrintWriter(OutputStream)(new in JDK 5)InputStreamReader(InputStream)OutputStreamWriter(OutputStream)ReaderFactory.newPlatformReader()WriterFactory.newPlatformWriter()FileUtils.fileRead(String)FileUtils.fileRead(File)FileUtils.fileWrite(String, String)FileUtils.fileAppend(String, String)IOUtils.toString(InputStream)IOUtils.toString(InputStream, int)
Plugins to Modify
Build plugins are highlighted, since the impact of the change is more critical to the built artifact than reporting plugins.
Affected Apache plugins:
- maven-changes-plugin (velocity template for announcement): MCHANGES-71, done in 2.1
- maven-checkstyle-plugin (source analysis): MCHECKSTYLE-95, done in 2.2
- maven-compiler-plugin (source processing): MCOMPILER-70, done in 2.1
- maven-invoker-plugin (beanshell script evaluation): MINVOKER-30, done in 1.2
- maven-javadoc-plugin (source processing): MJAVADOC-182, done in 2.5
- maven-jxr-plugin (source processing): JXR-60, done in 2.2
- maven-plugin-plugin (javadoc extraction, java source generation): MPLUGIN-101, MPLUGIN-100, done in 2.5
- maven-pmd-plugin (source analysis): MPMD-76, done in 2.4
- maven-resources-plugin (contents filtering): MRESOURCES-57, done in 2.3
- maven-site-plugin (apt sources): MSITE-314, done in 2.0-beta-7
Affected Codehaus plugins:
- findbugs-maven-plugin: (no Jira issue), done in 2.2
- jalopy-maven-plugin: MOJO-1138, done in 1.0-alpha-2-SNAPSHOT
- javancss-maven-plugin: MJNCSS-31
- modello-maven-plugin/modello-core (java source generation): MODELLO-109, done in 1.0-alpha-19
- native2ascii-maven-plugin
- plexus-component-metadata (formerly plexus-maven-plugin) (javadoc extraction): PLX-371, done in 1.0-beta-3.0.4
- shitty-maven-plugin (groovy script evaluation)
- simian-maven-plugin
- taglist-maven-plugin (javadoc extraction): MTAGLIST-27, done in 2.3
References
Please see [0] for the related thread from the mailing list, [1] for some further descriptions and [2] for a similar feature request in JIRA. Also note a related proposal for the output encoding of reports [3].
[0] http://www.nabble.com/POM-Element-for-Source-File-Encoding-to14930345s177.html
[1] http://www.nabble.com/Re%3A-Maven-and-File-Encoding-p16301958s177.html
[2] MNG-2216

83 Comments
Hide/Show CommentsApr 25, 2008
Kohsuke Kawaguchi
Oh no. Please, please, pretty please don't make the default encoding iso-8859-1.
You seem to think that builds that rely on platform encoding "are negligible in number", but I disagree. Go to any Japanese software development shop, and their source code includes comments and literal strings in Japanese (I mean, what else do you expect?) Go to any Chinese, Korean, Thai, Vietnamese software shop, and I bet the same is true. Those files are not in iso-8859-1 encoding — they are in the platform default encoding.
The implication of what you are suggesting is that all the Maven projects used in those places will break when this change gets integrated.
For a build not to be reproducible, you'd have to have source code that uses characters outside ASCII, and you'd have to have different build machines that use different non-ASCII encoding, like shift-jis and big5. Now that is the situation negligible, and in such an environment, a build already breaks with the current version of Maven. So you got the trade-off analysis wrong.
Once again, please don't make such a huge compatibility breaking change. If you don't believe me, please talk to some Asian developers, who know a thing or two about encoding and character set before making a decision.
Apr 25, 2008
Hervé Boutemy
The topic was discussed on the Maven dev list: your opinion would have been really useful at that time (no Asian developer replied)...
Yes, we know that setting a static default encoding will break some builds, whatever the value chosen: no more detection of developers' platform encoding.
To me, this drawback can be ok because:
I suppose Asian developers will know what value they'll need to set. Or they can even enforce platform encoding with
Last point: before the change, there were already some plugins with ISO-8859-1 default encoding, which had to be configured if using another encoding, in each plugin. After the change, the configuration is to be done only once.
Apr 25, 2008
Kohsuke Kawaguchi
You are making two mistakes.
First, you greatly underestimate the number of such projects. That is somewhat understandable because they happen in places you don't see, but that doesn't mean they don't exist (just look at places like SourceForge.jp or independent project site like this.) This affects a lot of projects, because it changes the behavior of the javac plugin, which everyone uses.
Second, you greatly overestimate the danger of "build reproducibility" of choosing the platform encoding as the default. In fact, I'm bit puzzled by this, because the main driver for this change is to have a single place to configure encoding for all plugins (which I agree as useful), and I have yet to hear any complaints from actual users here or elsewhere that
the choice of platform encoding is hurting them. So where did this idea come from?
As I wrote in my first post, for such a build reproducibility to become an issue, you first need a project that uses a non-Western encoding (say shift-jis for Japanese) for source files, and then that needs to be built on machines that use another incompatible non-Western encoding (say big5 for Chinese.) This just doesn't happen all that often, for the same reason you don't have Japanese Windows in your office.
So to sum it up, you are trying to solve a problem that doesn't exist, and ends up inflicting a pain for everyone in non-Western encodings. Hence my earlier statement that you got the trade-off analysis all wrong.
But really, I'm just begging you. It appears that the change has not yet made it into the releases. So it's still possible to avert the problem. Please, please change the default back to platform encoding, before it's too late.
Apr 26, 2008
Hervé Boutemy
I just changed "are neglectable in number" qualifier to "are not the vast majority": this will help depassionate the debate.
The hard thing is to estimate how platform encoding is a problem or a feature. I'll add a section on this topic, please contribute.
Apr 27, 2008
Benjamin Bentmann
I am not saying that there is a majority of people complaining (otherwise the POM would have been designed differently right from the start). The issue is really subtle because most communities can come away with the platform default encoding. Furthermore, this issue has much of this "works for me" attitude: It are usually outsiders/minorities that have to suffer. So here are two JIRA tickets I can quickly offer to demonstrate actual reported needs: MTAGLIST-27, MANTTASKS-14.
I define correctness (including correct build output) as something like "works always" and not as "works quite often". Is this attitude wrong?
I can understand your arguments about the number of affected projects (although I had never thought that Asians do not lock down the employed file encoding) but I absolutely disagree about the problem being "non-existent". Among others, a reproducible build should enable everybody on the world to checkout a project from SCM and invoke "mvn <goal>" to successfully build a project regardless
Now, if I were to checkout one of those projects that entirely rely on the platform's default encoding and are developed using an Asian encoding, would I be able to build that on my Western machine? That's just what I consider a bad state that should be fixed. One of Java's dreams is platform-independence and isn't that a nice goal for the build, too? In particular, a build tool like Maven that explicitly promotes best practices should not encourage platform-dependence, IMHO.
Platform encoding might have worked well in the former days when we all worked behind closed walls but things are moving more international now and collaborative development requires explicit project conventions about source encoding. Having said that, I am eased to see that the project Seasar you mentioned in your post already locked down file encoding to UTF-8.
Apr 27, 2008
Benjamin Bentmann
Also, we might just ask Hervé why he isn't spelling his own name correctly in the class javadoc of the code he contributes but is using the unaccented form "Herve". More than a decade after the invention of Unicode, I personally have no understanding why people are required to garble their names.
Apr 28, 2008
Takayoshi Kimura
+1 on having unified encoding setting. It's obviously a good thing.
-100 on the proposed default value. It breaks a lot of existing builds. Please use ${file.encoding} instead.
If you want to encourage platform independent build, add a WARN message which tells users maven is running with platform encoding. I don't think we need to introduce backward incompatibility change for it.
Apr 29, 2008
Benjamin Bentmann
Alright, I just started a poll on the user list to verify your rating.
Apr 29, 2008
Takayoshi Kimura
Benjamin,
Can you accept votes on this page comment area? This way is easier for people who want to vote but not on the Maven users list. Most of Japanese maven users are not on the list (we are discussing other-than-English problem on English only list!).
Apr 29, 2008
Benjamin Bentmann
Yes, of course, comments reported here will count. To account for your objections was exactly the reason why I started an (apparently vivid) discussion on the user mailing list.
Apr 29, 2008
Koichi Kobayashi
+1 on (a).
Apr 29, 2008
Takashi Okamoto
+1 for Kousuke's proposal.
Use the platform encoding and keep current behavior for compatibility. Stop broking a lot of Japanese maven project.
Apr 29, 2008
Yohji Nihonyanagi
+1 on (a).
Apr 29, 2008
Hiroyuki Oonaka
+1 on (a).
To avoid breaking build, I hope keeping using platform encoding.
Apr 29, 2008
yone098
+1 on (a)
Apr 29, 2008
cactusman
+1 on (a).
Apr 29, 2008
Koji Suga
+1 on (a).
Apr 29, 2008
Kenichi Dewa
+1 on (a).
Apr 29, 2008
masanobuimai
+1 on (a).
Apr 29, 2008
Takuto Wada
+1 on (a).
Apr 29, 2008
MIYAMOTO Daisuke
+1 on (a)
Apr 29, 2008
Osamu Goto
+1 on (a).
Apr 29, 2008
leecom
+1 on (a).
Apr 30, 2008
kazunori satok
+1 on (a).
Apr 30, 2008
nasobeme
+1 on (a).
Apr 30, 2008
hiroyuki iwanaga
+1 on (a).
Apr 30, 2008
HONMA Hirotaka
+1 on (a).
Apr 30, 2008
Rikiya Yamamoto
+1 on (a).
Apr 30, 2008
Takeshi Kawajiri
+1 on (a).
Apr 30, 2008
Mitsuhiro Okamoto
+1 on (a).
Apr 30, 2008
Shinobu Watanabe
+1 on (a).
I hope keeping using platform encoding.
Apr 30, 2008
Takayuki
+1 on (a).
For compatibility.
Apr 30, 2008
Ryuzo Yamamoto
+1 on (a).
Apr 30, 2008
Shigeaki Wakizaka
+1 on (a)
Apr 30, 2008
takayuki okazaki
+1 on (a)
It's important option for we Asian countries. Asian users might not send you some feed back to you, but they are depends upon your great artifact, Maven.
Apr 30, 2008
Takayuki Kaneko
+1 on (a).
IMHO, compatibility is sometimes more important than other reasons.
Apr 30, 2008
Nobukazu Ishigaki
+1 on (a).
Apr 30, 2008
SODA Noriyuki
Today, nearly every program uses platform encoding (current codepage on Windows, current locale on UNIX variants).
Changing this behavior is really strange, and sounds very old-fashioned (programs like pre-1990 age).
So,
+1 for a)
Apr 30, 2008
Tatsuya Shimura
+1 on (a)
Apr 30, 2008
Tomohito Ozaki
+1 on (a).
Please consider countries using multibyte language.
Apr 30, 2008
KATOH Yasufumi
+1 on (a).
Apr 30, 2008
Shinpei Ohtani
+1 on (a) proposal.
Think about the world with multibyte langs, and compatibility!!
Apr 30, 2008
Suetoshi Urabe
+1 on (a)
Apr 30, 2008
mckenzy
+1 on (a).
Apr 30, 2008
Masahide Takeda
+1 on (a).
Apr 30, 2008
nobeans
+1 on (a).
Apr 30, 2008
kubota keisen
+1 on (a).
Apr 30, 2008
NONAKA Kimihiro
+1 on (a)
Apr 30, 2008
Yasuo Higa
I propose the default value is UTF-8, because UTF-8 is suitable for every platform.
Apr 30, 2008
jkato
+1 on (a)
Apr 30, 2008
onozaty
+1 on (a)
Apr 30, 2008
ITO Yoshiichi
+1 on (a).
I hope keeping using platform encoding.
Apr 30, 2008
Hiroshi Kajikawa
+1 on (a)
Apr 30, 2008
Toshiya Kobayashi
+1 on (a)
Apr 30, 2008
Munenori TAKEI
+1 on (a).
Apr 30, 2008
Watanabe
+1 on (a)
Apr 30, 2008
Horiuchi Hiroki
+1 on (a)
Apr 30, 2008
NISHIMOTO Keisuke
+1 on (a)
Apr 30, 2008
Jun Funakura
+1 on (a).
Apr 30, 2008
Shinya Ogino
Never make iso-8859-1 a default, please. Possibly UTF-8, but compatibility should be considered with priority.
May 01, 2008
IZUNO Tadashi
+1 on (a).
May 01, 2008
Shinji Ichikawa
+1 on (a).
May 01, 2008
Satoru Okamoto
+1 on (a).
May 01, 2008
taichi
+1 on (a).
May 01, 2008
Takeshi Matsuba
+1 on (a).
May 01, 2008
ITO Sho
+1 on (a).
May 01, 2008
hsmt
+1 on (a).
May 01, 2008
sudo
+1 on (a).
May 01, 2008
Ryuji Furuya
+1 on (a).
May 01, 2008
Kanji Yokota
+1 on (a).
May 01, 2008
Masanobu Shimura
+1 for Kousuke's proposal.
For japanese developer, default value of ISO-8859-1 will cause so much problems.
May 01, 2008
UEHARA Junji
+1 on (a).
May 01, 2008
Takkenoko
+1 on (a).
May 01, 2008
Mitsutoshi NAKANO
+1 on (a) .
It is necessary to abolish ISO-8859-[.] .
ISO-8859-[.] is scarce power of expression, and will not be an intention in the future .
Persisting in ISO-8859-[.] obstructs the development of software .
May 01, 2008
iteng
+1 on (a).
May 01, 2008
Wataru Nakamura
+1 on (a).
May 01, 2008
Benjamin Bentmann
Alright, alright, I guess it's clear by now
I just closed the poll on the user list and will now update the proposal to use the platform encoding.
May 01, 2008
Masakazu Matsushita
We are glad you have accepted our opinion.
For developers using language other than english, it was really serious problem
May 01, 2008
Masahiro Nagafusa
+1 on (a).
May 02, 2008
hajimeni
+1 on (a).
May 02, 2008
Hirotaka Ueki
+1 on (a).
May 02, 2008
Tsuyoshi Yamamoto
+1 on (a).
May 06, 2008
calico catnap
+1 on (a).