Message-ID: <104029968.4881.1369483134542.JavaMail.email@example.com> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_4880_440951395.1369483134541" ------=_Part_4880_440951395.1369483134541 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
Internet standards support for international character sets was historic= ally weak, is now improving, but is still not perfect. The current standard= s situation is confusing, and has led to false expectations about what can = be reliably achieved using today's browsers, protocols and servers.
Jetty fully supports and implements the current relevant standards and s= pecifications, but this alone is not sufficient to make working with intern= ational characters easy or reliable in all situations. This FAQ explains th= e current standards, provides hints and tips for building and decoding Inte= rnationalised web pages (including ones with dynamic data) and explains how= Jetty has anticipated the probable future direction of the standards.
The intended readership is people developing Servlet applications and th= eir associated web pages. A basic knowledge of Java, HTML and of hexadecima= l notation is assumed.
Unless otherwise stated, the information below applies to all current (A= ugust 2002) standards-conformant Web Servers, not just to Jetty.
There are four groups whose standards and specifications affect characte= r handling in web pages:
A single byte (8 bits) has the capacity to represent up to 256 character= s. There are several widely-used encodings which give the US-ASCII characte= rs their normal values in the range 0-127, and a selection of other charact= ers are assigned code values in the range 128-255. In many web browsers the= encoding to use can either be specified by the web page designer or select= ed by the user. Some of these encodings are proprietary, others specified b= y consortia or international standards bodies.
The first approaches to supporting international characters involved sel= ecting one of these 8-bit character sets, and attempting to ensure that the= web page source, the browser, and any server using data from that browser = were using the same character encoding.
There is a default character set ISO-8859-1, which supports most western= European languages, and is currently the official 'default' content encodi= ng for content carried by HTTP (but not for data included in URLs - see bel= ow). This is also the default for all Jetty versions except 4.0.x. Alternat= ive encodings may be specified by defining the HTTP Content-Type header of = the page to be something like "text/html; charset=3DISO-8859-4". = From a Servlet you can use request.setContentType("text/html; charset= =3DISO-8859-4"); . Pages can then be composed using the desired litera= l characters (e.g. accented letters from Europe or, with a different encodi= ng selected, Japanese characters). This mechanism can be unreliable; the br= owser's user can select the encoding to be applied, which may be different = from that intended by the servlet designer.
Today the Internet is converging on a single, common encoding - Unicode = - in which can be represented all known written languages, as well as a wid= e range of symbols (e.g. mathematical symbols and decorative marks). Unicod= e requires a 16-bit integer value to represent each character. By design, t= he 95 printable US-ASCII characters have the same code values in Unicode; U= S-ASCII is a subset of Unicode. Most modern browsers can decode and display= a wide range of the characters represented by Unicode - but it would be ra= re to find a browser capable of displaying all the Unicode characters.
Unicode is the only character encoding used in XML and is now the defaul= t in HTML, XHML and in most Java implementations.
The Internet transmits data in groups of 8 bits, which the IETF usually = call 'octets', but everyone else calls 'bytes'. When larger values have to = be sent, such as the 16 bits needed for Unicode and some other internationa= l character encodings, there has to be a convention on how the 16 bits are = packed into one or more octets. There are two standards commonly used to en= code Unicode: UTF-8 and UTF-16.
UTF-16 is the 'brute-force' encoding for data transmission. The 16 bits = representing the character value are placed in adjacent octets. Variants of= UTF-16 place the octets in different orders.
UTF-8 is more common, and is recommended for most purposes. Characters w= ith values less 0080 (hexadecimal notation) are transmitted as a single oct= et whose value is that of the character. Characters with values in the rang= e 0080 to 07FF are encoded as two octets, whilst the (infrequently-used) Un= icode characters with values between 0800 and FFFF are encoded into three o= ctets. This encoding has the really useful property that a sequence of (7-b= it) US-ASCII characters sent as bytes and then sent as Unicode UTF-8 octets= produce identical octet streams - a US-ASCII byte stream is a valid UTF-8 = octet stream and represents the same printing characters. This is not the c= ase if other characters are being sent, or if UTF-16 is in use.
As well as having US-ASCII compatibility, UTF-8 is preferred because, in= the majority of situations, it results in shorter messages than UTF-16.
Note that, when UTF-8 is specified, it not only defines the way in which= the character code values are packed into octets, it also implicitly defin= es the character encoding in use as Unicode.
There is an international standard - ISO-10646 - which defines an identi= cal character set to Unicode - but omits much essential additional informat= ion. For most purposes refer to Unicode rather than ISO-10646.
There are two places in which HTTP requests (from browsers to web server= s) may include character data:
Wherever possible, a POST method should be used when international chara= cters are involved.
This is because the browser sends a HTTP
r which can help the web server determine the encoding of the cont=
Content-Type header will tell the server the MIME-typ=
e encoding of the content (usually
) and also can optionally include the character encoding of=
the content eg:
If both the MIME-type and the charset encoding information is sent in th= e POST HTTP header, the server can correctly decode the content.
Unfortunately, many browsers do not bother to send the charset informati= on, leaving the web server to guess the correct encoding. For this reason, = the Servlet API provides the SevletRequest.setCharacterEncoding(St= ring) method to allow the webapp developer to control the decoding of t= he form content.
Jetty-6 uses a default of UTF-8 if no overriding character encoding is s= et on a request.
A typical URL looks like:
When a form is sent, using the default GET method, the data values from = the form are included in the URL, e.g.:
It is important to note that only a very restricted sub-set of the US-AS= CII printing characters are permitted in URLs.
name=3DD=C3=BCrst (with an umlaut) is illega=
l. It might work with some browser/server combinations (and might even deli=
ver the expected value), but it should never be used.
The HTTP Specification provides an 'escape' mechanism, which permits arb= itrary octet values to be included in the URL. The three characters %HH - w= here HH are hexadecimal characters - inserts the specified octet value into= the URL. This has to be used for the US-ASCII characters which may not app= ear literally in URLs, and can be used for other octet values.
It is a common fallacy that this permits international characters to be = reliably transmitted. This is wrong.
This is because the %HH escape permits the transmission of a sequence of= octets, but has nothing to say about what character encoding is in use.
There is no provision in the HTTP protocol for the sender (the browser) = to tell the receiver (the web server) what encoding has been used in the UR= I, and none of the specifications related to HTTP/HTML define a default enc= oding.
Thus, although any desired octet sequence can be placed in a URL, none o= f the standards tell the web server how to interpret that octet sequence.= p>
The designers of web servers with Servlet APIs currently have a problem.= They are presented with an octet stream of unspecified encoding, and yet h= ave to deliver a Java String (a sequence of decoded characters) to the Serv= let API.
Due to the lack of a standard, different browers took different approach= es to the character encoding used. Some use the encoding of the page and so= me use UTF-8. Some drafts were prepared by various standards bodies suggest= ing that UTF-8 would become the standard encoding. Older versions of jetty = (eg 4.0.x series) used UTF-8 as the default in anticipation of a standard b= eing adopted. As a standard was not forthcoming, jetty-4.1.x reverted to a = default encoding of ISO-8859-1.
The W3C organization's HTML standard now recommends the use of UTF-8: http://www.w3.org/TR/html40/appendi= x/notes.html#non-ascii-chars and accordingly jetty-6 series uses a defa= ult of UTF-8.
If UTF-8 is not correct for your environment, you may use one of two jet= ty-specific methods to set the charset encoding of the query string in GET = requests:
org.mortbay.util.URI.charsetto= the encoding you want to use.
There are many ways in which international characters can be displayed o= r placed into browsers for inclusion in HTTP requests. Some examples are:= p>
It is also possible to manipulate document text using the DOM APIs.
It is believed that, in all the above examples, all modern browsers (tho= se supporting HTML 4) will treat the &...; encoding as representing Uni= code characters. Earlier ones may not understand this encoding.
The first example, with the literal =C3=BC, should only be used if the c= haracter encoding can be relied upon, and if support for 'legacy' browsers = (those not understanding the &...; encoding) is essential.
It is, of course, possible for users to enter characters using <input= ..> and <textarea...> elements via the operating system. Text can = come from keyboards, and also from 'cut and paste' mechanisms. It appears t= hat most browsers use their current (user-selectable) 'Encoding' setting (e= .g. in MSIE: View..Encoding) to encode such characters. After the User has = selected the encoding to use, it appears that many browsers will transmit t= he data characters in the request in that locally-defined encoding, rather = than the one specified with the page.
The only reliable, standards-supported way to handle international chara= cters in a browser- and server-neutral way appears to be:
It is sometimes suggested that all forms can and should be submitted usi= ng the above POST method. There is, in fact, a valid need to use the defaul= t GET method.
To appreciate this need one must consider carefully a significant differ= ence between submitting a form using POST and GET. When using GET the data = values from the form are appended to the URI and form part of the visible '= address' seen at the top of the browser. It is also this entire string that= is saved if the page is bookmarked by the browser, and may be moved using = 'cut-and-paste' into an eMail, another web page etc..
It is possible that the dynamic data from the form is an integral part o= f the semantics of the 'page address' to be stored.
The address may be part of a map; one of the data values from the map ma= y define the town on which the map view is to be centered - and this name m= ay only be expressible in, say, Thai characters. The town name may have bee= n entered or selected by the user - it was not a 'literal' in the HTML defi= ning the page or in the URL.
Another common need is to 'bookmark' or sent by eMail the request string= from a search engine request which has non-ASCII characters in the search.=
There is not yet any standards-based way of constructing this dynamicall= y-defined URL - there is no direct way to be certain what character encodin= gs have been applied in constructing the URI-with-query string that the bro= wser generates.
A work-around which has been suggested is to provide additional, known t= ext fields alongside the unknown text. In the example above, the form with = the dynamically-defined town name could also have a hidden field into which= the generated page inserts 'known' text from the same character set (using= the &...; encoding). When the request is eventually received by a serv= let the octet contents of the known field are inspected (typically by using= request.getQueryString() ). If the characters of the 'tracer' field are th= e same as those injected into the page when it was generated (and the chara= cter code values encompass those of the unknown town) then there is an assu= mption that the encoding used was Unicode and that the town name as present= in Java is accurate.
If the the actual encoding can be deduced from the 'tracer' octets, the = Servlet API request.setCharacterEncoding() can be called (before calling an= y of the .getParameter() methods) to tell the web server which encoding to = assume when decoding the query.
There is an obvious potential flaw in this 'tracer' technique - the brow= ser may represent &...;-specified 'tracer' text with its Unicode values= , yet may use the local keyboard/operating system encoding for locally-ente= red data. The author is not aware of any conclusive knowledge or evidence i= n this area.
Jetty configuration files use XML. If international characters need to b= e included in these files, the standard XML character encoding method shoul= d be used. Where the character has a defined abbreviation (such as =C3=BC f= or u-umlaut), that should be used. In other cases the hexadecimal descripti= on of the character's Unicode code value should be used. For example =CE=91= defines the Greek capital A letter. Use of the decimal form (=CE=91) seems= now to be unfashionable in W3C circles.
It is to be hoped that something like the IRI scheme described in the In= ternet Draft will evolve into a standard that will be adopted by suppliers = of web servers and browsers. It will probably also need changes to HTTP and= /or the use of internationalised versions of the http and https protocols. = As currently drafted, such a scheme would not, of its own solve the problem= of dynamic data derived from form GET submissions. This will require chang= es to HTML4 or, more likely, extensions to a future version of XHTML.
The whole area of form data handling may be radically improved if the Xf= orms program is successful. This has defined an XML-based approach to forms= and associated data and event handling and uses Unicode throughout. The Xf= orms 1.0 specification is currently (August 2002) at 'last call working dra= ft' status, and a number of experimental implementations, some using browse= r applets or plug-ins, have been announced.
Neither of these likely developments will improve the handling of intern= ational characters by 'todays' browsers, so designers of web services for t= he 'open' market seem likely to have to work within today's constraints for= a long time.
Anyone interested in the full complexity of handing international charac= ters and languages might like to read the W3C's Character Model (currently = a working draft) and follow the W3C's International Activity.
originally contributed by Chris Hayes------=_Part_4880_440951395.1369483134541--