Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 27 Next »

XML encoding: pom.xml, site.xml, ... 

Until now (Maven 2.0.7), XML encoding support is buggy:

  • XML streams are read with platform encoding, which leads to problems with non-ascii characters on ascii based platforms, and every characters on non-ascii platforms (Z/OS with EBCDIC),
  • XML streams are transformed to String (with platform encoding), and the resulting String is reworked a lot before being parsed by an XML parser (interpolation...)
  • even if XML streams were directly passed to the XML parser, MXParser used by Maven does not support encoding itself...

Changing the parser, then the interpolation code is a big task. 

Solution: use XmlReader class from Rome to detect XML streams encoding as defined in XML specification

It won't change much things in the code: only the Reader instanciation. Every other code (particularly interpolation) can remain the same.

Note: corresponding XmlStreamWriter and WriterFactory have been added in plexus-utils 1.4.4, and XmlReader renamed to XmlStreamReader to be coherent

Integration Level 1: detect XML encoding for user-written XML files

These files really need good XML encoding support, since user need accents and other local characters (Japanese, greek, cyrillic, ...)

  • [MODELLO-92]: use XmlStreamReader to read Modello .mdo files and update the misc. generators, DONE in modello 1.0-alpha-17
  • [MNG-2254]: use XmlStreamReader to read pom.xml, settings.xml and profiles.xml, DONE in Maven 2.0.8-SNAPSHOT
  • [MANTTASKS-79]: add XML encoding detection support for pom.xml and settings.xml in Maven Ant Tasks, DONE in 2.0.8-SNAPSHOT
  • [MRELEASE-87]: Poms are written with wrong encodings
  • [MSITE-239]: use XmlStreamReader to read site.xml, DONE in maven-site-plugin 2.0-beta-6-SNAPSHOT
  • [DOXIA-133]: XML encoding detection for xdoc, docbook, fml and xhtml files, DONE in doxia 1.0-alpha-9 and doxia-site 1.0-alpha-9

Integration Level 2: detect XML encoding for internal XML files

These files shouldn't really need special characters, since they are technical descriptors (plexus.xml and so on). But this change is useful for non-ascii platforms (Z/OS with EBCDIC), where even simple ascii characters can't be read with platform encoding.

  • [PLX-343]: use XmlStreamReader in plexus-container-default to load internal XML configuration files, done in 1.0-alpha-30
  • [MANTTASKS-14]: make Maven Ant Tasks work on Z/OS
  • TODO: use XmlStreamReader class wherever an XML stream has to be changed into a String/Reader
  • TODO: check correct encoding when XML data are written to a stream through a Writer, using XmlStreamWriter if necessary

Technical Notes 

new FileReader/Writer(File) vs Reader/WriterFactory.newPlatformReader/Writer(File)

When using new FileReader(File) or new FileWriter(File) API, platform encoding is used for conversion between bytes and characters.

The Java API documentation is explicit about this fact (if you read it carefully: yes, look at the class description, not the constructor comments), but this is not obvious when using the API: developers tend to forget that they chose an encoding when using this API.

ReaderFactory.newPlatformReader(File) and WriterFactory.newPlatformWriter(File) API simply calls previous API, but when using it, the encoding choice is explicit.

After you have replaced your FileReader/Writer constructor with this API which is explicit about encoding choice, you understand that if the file read/written is XML, platform encoding is a wrong choice: you need XML encoding detection, which is the purpose of ReaderFactory.newXmlReader(File) and WriterFactory.newXmlWriter(File)...

Subversion properties

XML files should ideally be marked as "text/xml" to let svn and other tools know that XML encoding detection should be used:

 svn propset svn:mime-type text/xml *.xml *.mdo *.fml *.xhtml

Quick tests with viewvc 1.0.3 showed that such a mark did not change anything: an UTF-16 XML file was considered as binary, and no diff provided.

  • No labels