Home arrow Articles arrow Web development arrow The Unicode Workflow
The Unicode Workflow Print E-mail
Tuesday, 11 April 2006
Article Index
The Unicode Workflow
Input
Application
Output
Sample code

Input

On the input side we want to check which character set our input is encoded in. If it's not UTF-8, we should convert it to UTF-8.

Web forms

Since most modern web browsers support UTF-8, you can get your form input in UTF-8 format. If you're using HTML, set your HTML editor to save files in UTF-8 format and add this tag:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
If you're generating your form with a scripting language add this header:
"Content-type: text/html; charset=utf-8"
In PHP you would write:
header("Content-type: text/html; charset=utf-8");
and add the meta tag in your generated HTML. Adding the charset information will return the form data in UTF-8 encoding. The browser will not return any information about the encoding of your form data however. If you send out your page in UTF-8, the form data will be posted back in UTF-8 (most of the time).

Now what if the user changes the page encoding or the browser doesn't support Unicode? (We actually have a client that still uses Internet Explorer 4 on and old Macintosh and the data he submits is in MacRoman encoding.) So we need to detect which charset we're dealing with.
Add a hidden field to you form with characters that have a different value in the character sets we want to check:
<input type="hidden" name="charset_check" value="&auml;&reg;"/>
When we get this value back, the browser has translated the HTML entities to characters and we can check their binary value. Here is a PHP example:

$test = $_POST['charset_check']; // our hidden field
if (bin2hex($test) == 'c3a4c2ae';) { // UTF-8
$charset = 'utf8';
} elseif (bin2hex($test) == 'e4ae') { // ISO-8859-1 (=Latin-1) or Windows-1252
$charset = 'ISO-8859-1';
} elseif (bin2hex($test) == '8aa8') { // MacRoman
$charset = 'MacRoman';
} else {
$charset = 'Unknown';
}
You can test this script here. Or view the source code.

To convert the non-UTF-8 data, you can use PHP's utf8_encode() function, if you're dealing with the ISO-8859-1 character set. For other character sets use the iconv library. In Perl you can use the Encode module.

If you don't want to go through all this trouble, you can use a regular expression to check if the input is in UTF-8 or not. If it's not, warn the user. This regular expression can be found on the W3C website.

ASP (Microsoft's active server pages) is a different story. I am not an asp developer myself, but my friend Stijn Matthyssen of the company Weblogik is one. We spend a couple of evenings trying to apply the Unicode Workflow in asp. We gave up on the old asp soon, because it just didn't work.
Asp.net handles UTF-8 transparently. Input and output are UTF-8 by default. And however MS Access and MS SQL Server's native encoding is UTF-16, it is all taken care of by the .net framework. But because all posted form data is put into String objects by the .net framework before it reaches your code, you can't detect what the original encoding is. (I guess you could parse the raw query string, but if you're gonna do that, what's the sense of using a framework.)
If you're posting a form in an encoding that wasn't expected by the .net framework, all the String objects will be empty.

Flash

Macromedia Flash MX (Flash 6) and above uses UTF-8 as its native text format. Unless someone has put system.useCodepage = true; somewhere in the code. So data coming from a Flash form doesn't need any conversion.

CSV, Tab separated files

Most spreadsheets and databases have a charset export option. If not, open your file in a decent text editor such as UltraEdit or BBedit and convert it to UTF-8.

If users can upload CSV files, use W3C's regular expression to check for valid UTF-8 data. If it's not valid, assume your local charset, which is ISO-8859-1 (Latin-1) or Windows-1252 in English speaking countries and Western Europe.

Microsoft Excel up to version 7 (Excel 95) stores strings as 8bit values. Excel 97, 2000, XP stores strings in UTF-16 format. You can parse an excel sheet with the Spreadsheet::ParseExcel Perl Module and convert it to UTF-8 if needed. I don't know of any decent PHP classes that read Excel files. If you do, please let me know.



Comments
pretty good advice though i don't like and don't recommend using html entities. good reading. thanks.

ps: i still get a kick out of hard php makes i18n work, coldfusion is a lot cleaner when it comes to this sort of thing.

  Posted by PaulH, Whose homepage is http://www.sustainableGIS.com/blog/cfg11n/ on Wednesday, 12 April 2006 at 6:18

Personally, I'd amend the first step of your workflow to read:

* Whenever data is entered in your application, convert it to UTF-8 if it's not, and then normalize the data to your preferred Unicode normalization form.

  Posted by AndrewC, on Friday, 19 May 2006 at 5:03

Great article!

I'm a rusty - very rusty web dev and I had never had to import dynamic text to Flash... Since my content was in French... I had loads of trouble with accents. The Unicode Workflow fixed my problems. Thank you!

  Posted by Chris, on Saturday, 03 June 2006 at 11:36

Thanks a LOT for this tutorial! It was great reading and it helped a lot.
  Posted by Gregor, on Tuesday, 20 June 2006 at 10:37

u have writen wounderfull articale about unicode..
  Posted by bari, on Tuesday, 01 August 2006 at 2:48

Yep, unfortunatelly there are still browsers running which do not support unicode. It is really tricky if you are working with asp/asp.net. It do things which you cant controll and that why it sucks so badly.
  Posted by anonymous email, Whose homepage is http://www.anonymousspeech.com on Friday, 27 October 2006 at 11:47

Wonderfull article, very in-depth also! I still tend to prefer to use html entities in html files and such, but your article fills many gaps in my knowledge of text encoding. This truly is a reference.


You still might want to add though, that PHP files should be saved without a bit order mark, because these are sent as a header and can create problems with script generated headers.

  Posted by Bram Esposito, Whose homepage is http://www.patpitiee.be on Friday, 10 November 2006 at 5:21

Thank you very much for your help
  Posted by valery, on Tuesday, 19 December 2006 at 1:06

Hotmail and yahoo utf-8 problem solved...:

$char='UTF-8';
$e= explode('@',$toAddress);
$e=$e[1];
$e= explode('.',$e);
$e=$e[0];
$e=strtolower($e);
if($e=='hotmail' || $e=='yahoo'){
   $fromName=utf8_decode($fromName);
   $subject=utf8_decode($subject);
   $message=utf8_decode($message);
   $char='ISO-8859-1';
}

$headers = 'MIME-Version: 1.0 rn';
$headers .= 'Content-type: text/html; charset=$char rn';
$headers .= 'From: '.$fromName.' ';

mail($toAddress, $subject, $message, $headers);

  Posted by ff, Whose homepage is http://shoppingP.com on Monday, 25 December 2006 at 9:14

Thanks very much. I have been looking for precisely this information. And it ain't easy to find in any tongue.
  Posted by Mark Solomon, Whose homepage is http://hanged.man.tripod.com/majorarcanum on Wednesday, 11 April 2007 at 11:27

thanks for the info - good stuff!
  Posted by tag hag, on Monday, 30 April 2007 at 6:11

Really helpful. I found the function at http://uk2.php.net/manual/en/function.imap-8bit.php#61216 worked fine for sending email subjects, but one change needed making:

Change:
$sLine = implode( '=' . chr(13).chr(10), $aMatch[0] ); // add soft crlf's

to
$sLine = '=?utf-8?Q?'.implode( '?=rnt=?utf-8?Q?' . chr(13).chr(10), $aMatch[0] ).'?=';

  Posted by Rob, on Friday, 04 May 2007 at 10:28

Very nice concept it did not get to into depth, but was very well explained. A lot better explained then what you will find on mirc thats for sure.
  Posted by Tyler Dewitt, Whose homepage is http://www.dewittsmedia.com on Sunday, 21 October 2007 at 2:37

When I was converting php to asp, I've found

utf8_decode($aUsers[$i]).

Wot's the php utf8_decode() func replacement in asp ???????? I searched all the web, but I couldn't find my solution..! could u help me...

  Posted by srikanth dhondi, on Thursday, 14 February 2008 at 4:17


 1 
Page 1 of 1 ( 14 comments )
©2005 MosCom

Add your comments to this article The Unicode Workflow ...

Name (required)

E-Mail (required)
Your email will not be displayed on the site - only to our administrator
Homepage

Comment