|
The Unicode Workflow |
|
|
|
Tuesday, 11 April 2006 |
|
Page 4 of 5 Output Web browser All modern web browsers support UTF-8. But you'll have to let the browser know that you're outputting UTF-8 data, because it is likely that it's standard character encoding is set to something else. In plain HTML pages add this meta tag in the header: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> In scripted pages add the meta tag and this header: "Content-type: text/html; charset=utf-8" In PHP you would write: header("Content-type: text/html; charset=utf-8"); Suppose you have to output ISO-8859-1 (Latin-1) to the browser, for whatever reason, you should use HTML entities to encode the 'special' characters. PHP's build-in htmlentities() function will convert 'special' characters to HTML entities. The htmlentities() function assumes that the string to convert is IS0-8859-1 (Latin-1) encoded, unless you specify otherwise. So don't forget to tell PHP that your original string is UTF-8 encoded, by setting the 3rd argument to 'UTF-8'. E.g. $latin1String = htmlentities($utf8String, ENT_COMPAT, 'UTF-8'); Flash Flash MX and higher expects UTF-8 input, whether you use the XML() or the LoadVars() objects. Unless someone has put system.useCodepage = true; somewhere in the code. Mail clients The Internet Message Format standard (RFC 2822) expects mail to be sent in 7bit US-ASCII encoding. This is without any accents. This standard is extended by the MIME standard, which allows for other character sets and media types. This chapter is a little harder than the previous ones, but don't panic, every decent scripting language has classes that will handle the details for you: see the MiME::* perl modules or Mail_Mime PHP PEAR class. You just need to know what you're doing, hence all the theory. Mail subject According to RFC 2822, mail header fields, including the subject, MUST be composed of printable US-ASCII characters (i.e., characters that have values between 33 and 126, inclusive). So if you want a subject with accents, you must encode it from your original character set to a US-ASCII character set. There are 2 of ways to do this: quoted-printable or base64. The Quoted-Printable encoding allows any 8bit character to be represented as "=XX", where XX is the hexadecimal value of the character. E.g. "Voilà une message" in the ISO-8859-1 (latin-1) character set is encoded as "Voil=E0=20une=20message". SPACE characters may be represented as "_". E.g. "Voil=E0_une_message" which is more readable. "Voilà une message" in the UTF-8 character set is encoded as "Voil=C3=A0_une_message". As you can see the "à" character is encoded as the 2 character sequence "=C3=A0". Base64 encoding will take 3 characters of the original string and convert them to 4 characters in the range of: A-Z, a-z, 0-9, /. The = is used for padding. E.g. "Voilà une message" is encoded as "Vm9pbOAgdW5lIG1lc3NhZ2U=" I would suggest the Quoted-Printable encoding for subjects that contain mainly ASCII characters. In case the mail client fails to decode the subject correctly, you'll still have an idea what the subject is about. Base64 is unreadable (for most of us). Now we have an encoded subject, but our mail reader won't know that. So we need to tell it by formatting our subject as follows: "=?" charset "?" encoding "?" encoded-text "?=" , where charset is the original character set and encoding is either "Q" for Quoted-Printable or "B" for Base64. E.g The subject containing the Quoted-Printable ISO-8859-1 string "Voilà une message", is written as: Subject: =?ISO-8859-1?Q?Voil=E0_une_message?= The Base64 version of the ISO-8859-1 string is: Subject: =?ISO-8859-1?B?Vm9pbOAgdW5lIG1lc3NhZ2U=?= The Quoted-Printable version of the UTF-8 string is: Subject: =?UTF-8?Q?Voil=C3=A0_une_message?= The Base64 version of the UTF-8 string is: Subject: =?UTF-8?B?Vm9pbMOgIHVuZSBtZXNzYWdl?= Also note that a single encoded word may not contain more than 75 characters, charset, encoding and delimiters inclusive. If you have a longer subject, read rfc1522 (MIME Part Two). It's great bedtime reading ;-) Mail Body In order to use other characters than US-ASCII in the mail body, it must comply with MIME standard. This is easier than the subject. You just add a couple of headers. For plain text messages, you can output your UTF-8 data without any conversion, after adding these headers: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; Content-Transfer-Encoding: 8bit HTML emails are similar: Mime-Version: 1.0 Content-Type: text/html; charset=UTF-8; Content-Transfer-Encoding: 8bit If (you think) your end user's email client doesn't support UTF-8, you could encode the message body into ISO-8859-1 (Latin-1) and use HTML entities for 'special' characters. The headers would be: Mime-Version: 1.0 Content-Type: text/html; charset=ISO-8895-1; Content-Transfer-Encoding: 8bit How Hotmail handles UTF-8 data is not clear to me. It seems to use a combination of the user's language setting and the e-mail's encoding to determine which character set to use. If your language setting is English, it will display the mail as ISO-8895-1 (Latin-1) or UTF-8, according to the mail's encoding. But if you set your language to Japanese it will display the message as SHIFT_JIS, independent of the mail's encoding.
|
Comments
pretty good advice though i don't like and don't recommend using html entities. good reading. thanks.
ps: i still get a kick out of hard php makes i18n work, coldfusion is a lot cleaner when it comes to this sort of thing.
Posted by PaulH, Whose homepage is http://www.sustainableGIS.com/blog/cfg11n/ on Wednesday, 12 April 2006 at 6:18
Personally, I'd amend the first step of your workflow to read:
* Whenever data is entered in your application, convert it to UTF-8 if it's not, and then normalize the data to your preferred Unicode normalization form.
Posted by AndrewC, on Friday, 19 May 2006 at 5:03
Great article!
I'm a rusty - very rusty web dev and I had never had to import dynamic text to Flash... Since my content was in French... I had loads of trouble with accents. The Unicode Workflow fixed my problems. Thank you!
Posted by Chris, on Saturday, 03 June 2006 at 11:36
Thanks a LOT for this tutorial! It was great reading and it helped a lot.
Posted by Gregor, on Tuesday, 20 June 2006 at 10:37
u have writen wounderfull articale about unicode..
Posted by bari, on Tuesday, 01 August 2006 at 2:48
Yep, unfortunatelly there are still browsers running which do not support unicode. It is really tricky if you are working with asp/asp.net. It do things which you cant controll and that why it sucks so badly.
Posted by anonymous email, Whose homepage is http://www.anonymousspeech.com on Friday, 27 October 2006 at 11:47
Wonderfull article, very in-depth also! I still tend to prefer to use html entities in html files and such, but your article fills many gaps in my knowledge of text encoding. This truly is a reference.
You still might want to add though, that PHP files should be saved without a bit order mark, because these are sent as a header and can create problems with script generated headers.
Posted by Bram Esposito, Whose homepage is http://www.patpitiee.be on Friday, 10 November 2006 at 5:21
Thank you very much for your help
Posted by valery, on Tuesday, 19 December 2006 at 1:06
Hotmail and yahoo utf-8 problem solved...:
$char='UTF-8';
$e= explode('@',$toAddress);
$e=$e[1];
$e= explode('.',$e);
$e=$e[0];
$e=strtolower($e);
if($e=='hotmail' || $e=='yahoo'){
$fromName=utf8_decode($fromName);
$subject=utf8_decode($subject);
$message=utf8_decode($message);
$char='ISO-8859-1';
}
$headers = 'MIME-Version: 1.0 rn';
$headers .= 'Content-type: text/html; charset=$char rn';
$headers .= 'From: '.$fromName.' ';
mail($toAddress, $subject, $message, $headers);
Posted by ff, Whose homepage is http://shoppingP.com on Monday, 25 December 2006 at 9:14
Thanks very much. I have been looking for precisely this information. And it ain't easy to find in any tongue.
Posted by Mark Solomon, Whose homepage is http://hanged.man.tripod.com/majorarcanum on Wednesday, 11 April 2007 at 11:27
thanks for the info - good stuff!
Posted by tag hag, on Monday, 30 April 2007 at 6:11
Really helpful. I found the function at http://uk2.php.net/manual/en/function.imap-8bit.php#61216 worked fine for sending email subjects, but one change needed making:
Change:
$sLine = implode( '=' . chr(13).chr(10), $aMatch[0] ); // add soft crlf's
to
$sLine = '=?utf-8?Q?'.implode( '?=rnt=?utf-8?Q?' . chr(13).chr(10), $aMatch[0] ).'?=';
Posted by Rob, on Friday, 04 May 2007 at 10:28
Very nice concept it did not get to into depth, but was very well explained. A lot better explained then what you will find on mirc thats for sure.
Posted by Tyler Dewitt, Whose homepage is http://www.dewittsmedia.com on Sunday, 21 October 2007 at 2:37
When I was converting php to asp, I've found
utf8_decode($aUsers[$i]).
Wot's the php utf8_decode() func replacement in asp ???????? I searched all the web, but I couldn't find my solution..! could u help me...
Posted by srikanth dhondi, on Thursday, 14 February 2008 at 4:17
|
1 Page 1 of 1 ( 14 comments )
©2005 MosCom
|