Skip to end of metadata
Go to start of metadata

Purpose of the library

The purpose of this library is to "guess" the encoding of files, and retrieve a reader that is properly configured to use the right encoding as guessed. The library is able to recognize the various Unicode encoding variants:

  • UTF-8
  • UTF-16LE - Low Endian
  • UTF-16BE - Big Endian
  • UTF-32

If a Unicode encoding isn't recognized, it's an 8-bit encoding. If the 8-bit encoding is not US-ASCII, the default platform 8-bit encoding is assumed whatever it is. However, the library cannot guess between different 8-bit encodings. Only statistical analysis, n-grams and similar techniques specific to each language used in those files can help guessing the encoding, but this is not supported by the library.

License

This library is released under the Apache 2 license

Useful links 

The jars are also available in Maven's repository.

Origins

At a previous, I was developing with IntelliJ IDEA, from Jetbrains. It's certainly the best IDE around. It's a real pleasure to develop with it. During the summer 2002, I came across an issue regarding file encodings. At work, one of our concerns were localisation/internationalisation issues. We were developing applications that are i18n/i10n aware. We used to have our Java source files encoded in ISO-latin-1, and our XML files encoded in UTF-8 (especially because there were some language specific stuff inside). At that time, IDEA was able to read a file within a specified encoding. But it could not detect the encoding used to encode that file. And as shit happens sometimes (wink) I totally messed up a very important XML file... I then realized that it was due to the fact that IDEA was not able to guess the encoding. Encoding issues are very critical when dealing with l10n/i18n, that's why I filed some feature requests to the IDEA's developers. I wrote two simple classes to show them that it was very easy to guess an encoding, and I granted them the right to include (and modify) my source code inside IDEA. That's what they did, and since then, all IDEA fans can open their files without worrying about messing up with their files... (who hasn't seen some weird boxes or interrogation marks in their messed files ?) 

  • No labels