XML encoding: pom.xml, site.xml, ...
Until now (Maven 2.0.7), XML encoding support is buggy:
- XML streams are read with platform encoding, which leads to problems with non-ascii characters on ascii based platforms, and every characters on non-ascii platforms (Z/OS with EBCDIC),
- XML streams are transformed to String (with platform encoding), and the resulting String is reworked a lot before being parsed by an XML parser (interpolation...)
- even if XML streams were directly passed to the XML parser, MXParser used by Maven does not support encoding itself...
Changing the parser, then the interpolation code is a big task.
It won't change much things in the code: only the Reader instanciation. Every other code (particularly interpolation) can remain the same.
- [PLXUTILS-11]: add XmlReader to plexus-utils, DONE in plexus-utils 1.4.5 as XmlStreamReader with ReaderFactory
Integration Level 1: detect XML encoding for user-written XML files
These files really need good XML encoding support, since user need accents and other local characters (Japanese, greek, cyrillic, ...)
- [MODELLO-92]: use XmlStreamReader to read Modello .mdo files and update the misc. generators, DONE in modello 1.0-alpha-17
- [MNG-2254]: use XmlStreamReader to read pom.xml, settings.xml and profiles.xml, DONE in Maven 2.0.8-SNAPSHOT
- [MANTTASKS-79]: add XML encoding detection support for pom.xml and settings.xml in Maven Ant Tasks, DONE in 2.0.8-SNAPSHOT
- [MINSTALL-44]: add XML encoding support when reading/writing POM files in install plugin, DONE in 2.3-SNAPSHOT
- [MDEPLOY-66]: add XML encoding support when reading/writing POM files in deploy plugin
- [MRELEASE-87]: Poms are written with wrong encodings
- [MSITE-239]: use XmlStreamReader to read site.xml, DONE in maven-site-plugin 2.0-beta-6-SNAPSHOT
- [DOXIA-133]: XML encoding detection for xdoc, docbook, fml and xhtml files, DONE in doxia 1.0-alpha-9 and doxia-site 1.0-alpha-9
Integration Level 2: detect XML encoding for internal XML files
These files shouldn't really need special characters, since they are technical descriptors (plexus.xml and so on). But this change is useful for non-ascii platforms (Z/OS with EBCDIC), where even simple ascii characters can't be read with platform encoding.
- [PLX-343]: use XmlStreamReader in plexus-container-default to load internal XML configuration files, done in 1.0-alpha-30
- [MANTTASKS-14]: make Maven Ant Tasks work on Z/OS
- TODO: use XmlStreamReader class wherever an XML stream has to be changed into a String/Reader
- TODO: check correct encoding when XML data are written to a stream through a Writer, using XmlStreamWriter if necessary
new FileReader/Writer(File) vs
The Java API documentation is explicit about this fact (if you read it carefully: yes, look at the class description, not the constructor comments), but this is not obvious when using the API: developers tend to forget that they chose an encoding when using this API.
After you have replaced your
FileReader/Writer constructor with this API which is explicit about encoding choice, you understand that if the file read/written is XML, platform encoding is a wrong choice: you need XML encoding detection, which is the purpose of
Integrating XML encoding detection in Maven plugins
A lot of Maven plugins read and write XML files, and they're actually doing it with platform encoding (ie
FileReader/Writer): the change to
Reader/WriterFactory.newPlatformReader/Writer should be done.
But there is a problem with Maven versions earlier than 2.0.6: in Maven 2.0.5 and earlier, plexus-utils version is forced by Maven Core and cannot be overriden by a plugin. MNG-2892 (released in Maven 2.0.6) fixed this limitation. Then Maven 2.0.6 is a prerequisite to fix plugins...
What can be done?
- In maven-site-plugin, XML encoding classes from plexus-utils were copied to plugin's sources (MSITE-242 to remove them): there is a lot of XML files read by this plugin, with strong encoding support need, then this bad solution was really the best one. But this wouldn't be good to do such a copy in every plugin.
- A light solution is to replace
new InputStreamReader( new FileInputStream( File ), "utf-8" ): if XML encoding detection is not supported, at least reading the file with default XML encoding, UTF-8, is both more powerful and more coherent (not a bug but a missing feature).
- Another solution would be to have XML encoding classes in another library than plexus-utils...
XML files should ideally be marked as "text/xml" to let svn and other tools know that XML encoding detection should be used:
Quick tests with viewvc 1.0.3 showed that such a mark did not change anything: an UTF-16 XML file was considered as binary, and no diff provided.