Johan van Mol .org
HomeHome
ArticlesArticles
BlogBlog
Advanced SearchAdvanced Search
Home arrow Articles arrow Web development arrow The Unicode Workflow
The Unicode Workflow Print E-mail
Tuesday, 11 April 2006
Article Index
The Unicode Workflow
Input
Application
Output
Sample code
ImageI was tired of finding myself utf8_encode()'ing and utf8-decode()'ing all over the place when dealing with multi-lingual Flash and HTML sites. My solution is the Unicode Workflow, a big word for a couple of simple rules which will save you a lot of headaches.

Introduction

I happen to live in a country in wich about 60% of the population speaks Dutch and the other 40% French. Sorry, I forgot our 1% German speaking friends, but most of the websites my company builds are Dutch and French and sometimes English and German.
We usually start developing the Dutch site, a language - like English - with hardly any accents, and near the end of the project, we start implementing the French text (lots of accents!). Then we notice all kinds of funny characters and wrong accents in the French text and we find ourselves utf8_encode()'ing and utf8-decode()'ing all over the place.
All of this is not applicable if your app is build on a single platform and outputs only HTML; ascii text and HTML entities will do just fine. But if you get your input in different languages from web forms, databases, csv files and excell sheets written on Windows, Linux or Macintosh and you have to output XML to Flash, HTML to browsers and plain text to mail clients, you'll run into problems if you don't have a strategy.

Here is an example of what you may end up with:
Latin-1: or ISO-8859 is the standard US/Western Europe character set
CP1252: Windows codepage, differs slightly from Latin-1
MacRoman: US/Western Europe character set on MacOS 9 and below
UTF-8: Unicode 8bit encoded character set
Mixed encodings

Unicode

The Unicode standard (more info at www.unicode.org) is a character set that contains all possible glyphs in the world. So instead of converting from one codepage into another when switching platforms or languages, you can use one character set on all platforms and for all languages.
All modern operating systems, browsers and mail clients support Unicode. The XML standard even requires Unicode.
Unicode comes in 3 encodings: UTF-8, UTF-16 and UTF-32, but for web and XML applications UTF-8 is used. UTF-8 is backwards compatible with ascii. To put it simple: 'regular' characters (with no accents) have the same 8bit value as their ascii counterparts and whenever a 'special' character is used, UTF-8 will use 2, 3 or 4 bytes. If you open a UTF-8 file in a program that can't handle UTF-8, the 'regular' characters will be readable and 'special' characters will be displayed as 2 or more funny characters. (e.g. "é" will be displayed as "é")

The Unicode Workflow

To avoid all encoding problems, I suggest a simple set of rules, what I call the Unicode Workflow (actually it's a UTF-8 workflow).
The rules are:

  • Whenever data is entered in your application, convert it to UTF-8 if it's not.
  • Keep all data in your application in UTF-8 format.
  • Whenever you must output non-UTF-8 data, convert it at the last moment
UTF-8 encoding

 



Comments
pretty good advice though i don't like and don't recommend using html entities. good reading. thanks.

ps: i still get a kick out of hard php makes i18n work, coldfusion is a lot cleaner when it comes to this sort of thing.

  Posted by PaulH, Whose homepage is http://www.sustainableGIS.com/blog/cfg11n/ on Wednesday, 12 April 2006 at 6:18

Personally, I'd amend the first step of your workflow to read:

* Whenever data is entered in your application, convert it to UTF-8 if it's not, and then normalize the data to your preferred Unicode normalization form.

  Posted by AndrewC, on Friday, 19 May 2006 at 5:03

Great article!

I'm a rusty - very rusty web dev and I had never had to import dynamic text to Flash... Since my content was in French... I had loads of trouble with accents. The Unicode Workflow fixed my problems. Thank you!

  Posted by Chris, on Saturday, 03 June 2006 at 11:36

Thanks a LOT for this tutorial! It was great reading and it helped a lot.
  Posted by Gregor, on Tuesday, 20 June 2006 at 10:37

u have writen wounderfull articale about unicode..
  Posted by bari, on Tuesday, 01 August 2006 at 2:48

Yep, unfortunatelly there are still browsers running which do not support unicode. It is really tricky if you are working with asp/asp.net. It do things which you cant controll and that why it sucks so badly.
  Posted by anonymous email, Whose homepage is http://www.anonymousspeech.com on Friday, 27 October 2006 at 11:47

Wonderfull article, very in-depth also! I still tend to prefer to use html entities in html files and such, but your article fills many gaps in my knowledge of text encoding. This truly is a reference.


You still might want to add though, that PHP files should be saved without a bit order mark, because these are sent as a header and can create problems with script generated headers.

  Posted by Bram Esposito, Whose homepage is http://www.patpitiee.be on Friday, 10 November 2006 at 5:21

Thank you very much for your help
  Posted by valery, on Tuesday, 19 December 2006 at 1:06

Hotmail and yahoo utf-8 problem solved...:

$char='UTF-8';
$e= explode('@',$toAddress);
$e=$e[1];
$e= explode('.',$e);
$e=$e[0];
$e=strtolower($e);
if($e=='hotmail' || $e=='yahoo'){
   $fromName=utf8_decode($fromName);
   $subject=utf8_decode($subject);
   $message=utf8_decode($message);
   $char='ISO-8859-1';
}

$headers = 'MIME-Version: 1.0 rn';
$headers .= 'Content-type: text/html; charset=$char rn';
$headers .= 'From: '.$fromName.' ';

mail($toAddress, $subject, $message, $headers);

  Posted by ff, Whose homepage is http://shoppingP.com on Monday, 25 December 2006 at 9:14

Thanks very much. I have been looking for precisely this information. And it ain't easy to find in any tongue.
  Posted by Mark Solomon, Whose homepage is http://hanged.man.tripod.com/majorarcanum on Wednesday, 11 April 2007 at 11:27

thanks for the info - good stuff!
  Posted by tag hag, on Monday, 30 April 2007 at 6:11

Really helpful. I found the function at http://uk2.php.net/manual/en/function.imap-8bit.php#61216 worked fine for sending email subjects, but one change needed making:

Change:
$sLine = implode( '=' . chr(13).chr(10), $aMatch[0] ); // add soft crlf's

to
$sLine = '=?utf-8?Q?'.implode( '?=rnt=?utf-8?Q?' . chr(13).chr(10), $aMatch[0] ).'?=';

  Posted by Rob, on Friday, 04 May 2007 at 10:28

Very nice concept it did not get to into depth, but was very well explained. A lot better explained then what you will find on mirc thats for sure.
  Posted by Tyler Dewitt, Whose homepage is http://www.dewittsmedia.com on Sunday, 21 October 2007 at 2:37

When I was converting php to asp, I've found

utf8_decode($aUsers[$i]).

Wot's the php utf8_decode() func replacement in asp ???????? I searched all the web, but I couldn't find my solution..! could u help me...

  Posted by srikanth dhondi, on Thursday, 14 February 2008 at 4:17


 1 
Page 1 of 1 ( 14 comments )
©2005 MosCom

You are not authorized to leave comments - please login.