Home arrow Articles arrow Web development arrow The Unicode Workflow
The Unicode Workflow Print E-mail
Tuesday, 11 April 2006
Article Index
The Unicode Workflow
Input
Application
Output
Sample code

Application

At this point all text data, coming from external sources is UTF-8 encoded. But the application itself may also contain (hard coded) text data. We need to make sure that this data is UTF-8 encoded too.

Scripts, HTML and text files
Hard coded string literals in your code will have the character set of your script file. Set your editors to output UTF-8 files. If your script files are saved in ISO-8895-1, your string literals will be also.
If you save your HTML pages or templates in ISO-8895-1 and you use HTML entities for 'special' characters everything will work fine. But to be consistent with The Unicode Workflow and avoid mistakes, you should save your HTML files in UTF-8 format. The same goes for plain text files or XML files that contain data to be outputted.

In every decent editor you can set the encoding. A few examples:
  • Dreamweaver:
    general setting: preferences > New Document > Default Encoding
    or per document: Modify > Page Properties > Document Encoding
  • Zend Studio: Preferences > Editing > Encoding
  • BBedit: per document > in one of the menus in the document window
  • UltraEdit: File > Conversions > Ascii to UTF-8
  • Homesite: in options > settings > File settings: check the checkbox "enable non-ANSI file encoding" then you can select UTF-8 from the pulldown menu in the save dialog box
  • SubEthaEdit: Format > File Encodings
Databases

In MySQL data is saved in VARCHAR, TEXT and BLOB fields as you put it in. In other words if you insert a UTF-8 encoded string, it will be saved as a UTF-8 encoded string, no matter how you set up the encoding of your table or columns. The encoding of your database, table or column will only affect operations such as sorting and full text searching. Input and output are not affected by this setting.
Before MySQL 4.1 there wasn't a UTF-8 option, but your binary UTF-8 data would be saved without a problem. MySQL would just think that a 16bit UTF-8 character is a 2 8bit character sequence.
Since sorting and searching is probably the main reason why you would use a database, you'll have to inform MySQL that you're working with UTF-8 encoded text.

You can fields with the UTF-8 character set with this SQL statement:
CREATE TABLE myTable (
    id INT UNSIGNED AUTO_INCREMENT NOT NULL,
    name VARCHAR(255) CHARACTER SET utf8
)

If you have an existing MySQL database in a non-UTF-8 character set, you can change it to UTF-8 with or without converting your existing data.
If you alter a column from the server's default character set to UTF-8, MySQL will convert the data to UTF-8.
ALTER TABLE myTable MODIFY name VARCHAR(255) CHARACTER SET utf8;
This is a useful feature to convert IS0-XXXX encoded databases to UTF-8, but you may not want this. If you had UTF-8 data already (from a Flash application for example), but you had never set the character set to UTF-8.
You can trick MySQL by altering your table to binary format first and then to alter it to UTF-8. There will be no conversion between binary data and text data.
ALTER TABLE myTable MODIFY name BINARY(255);
ALTER TABLE myTable MODIFY name VARCHAR(255) CHARACTER SET utf8;

As said before MS Access and MS SQL Server use UTF-16 as their native format, but asp.net handles this transparently.



Comments
pretty good advice though i don't like and don't recommend using html entities. good reading. thanks.

ps: i still get a kick out of hard php makes i18n work, coldfusion is a lot cleaner when it comes to this sort of thing.

  Posted by PaulH, Whose homepage is http://www.sustainableGIS.com/blog/cfg11n/ on Wednesday, 12 April 2006 at 6:18

Personally, I'd amend the first step of your workflow to read:

* Whenever data is entered in your application, convert it to UTF-8 if it's not, and then normalize the data to your preferred Unicode normalization form.

  Posted by AndrewC, on Friday, 19 May 2006 at 5:03

Great article!

I'm a rusty - very rusty web dev and I had never had to import dynamic text to Flash... Since my content was in French... I had loads of trouble with accents. The Unicode Workflow fixed my problems. Thank you!

  Posted by Chris, on Saturday, 03 June 2006 at 11:36

Thanks a LOT for this tutorial! It was great reading and it helped a lot.
  Posted by Gregor, on Tuesday, 20 June 2006 at 10:37

u have writen wounderfull articale about unicode..
  Posted by bari, on Tuesday, 01 August 2006 at 2:48

Yep, unfortunatelly there are still browsers running which do not support unicode. It is really tricky if you are working with asp/asp.net. It do things which you cant controll and that why it sucks so badly.
  Posted by anonymous email, Whose homepage is http://www.anonymousspeech.com on Friday, 27 October 2006 at 11:47

Wonderfull article, very in-depth also! I still tend to prefer to use html entities in html files and such, but your article fills many gaps in my knowledge of text encoding. This truly is a reference.


You still might want to add though, that PHP files should be saved without a bit order mark, because these are sent as a header and can create problems with script generated headers.

  Posted by Bram Esposito, Whose homepage is http://www.patpitiee.be on Friday, 10 November 2006 at 5:21

Thank you very much for your help
  Posted by valery, on Tuesday, 19 December 2006 at 1:06

Hotmail and yahoo utf-8 problem solved...:

$char='UTF-8';
$e= explode('@',$toAddress);
$e=$e[1];
$e= explode('.',$e);
$e=$e[0];
$e=strtolower($e);
if($e=='hotmail' || $e=='yahoo'){
   $fromName=utf8_decode($fromName);
   $subject=utf8_decode($subject);
   $message=utf8_decode($message);
   $char='ISO-8859-1';
}

$headers = 'MIME-Version: 1.0 rn';
$headers .= 'Content-type: text/html; charset=$char rn';
$headers .= 'From: '.$fromName.' ';

mail($toAddress, $subject, $message, $headers);

  Posted by ff, Whose homepage is http://shoppingP.com on Monday, 25 December 2006 at 9:14

Thanks very much. I have been looking for precisely this information. And it ain't easy to find in any tongue.
  Posted by Mark Solomon, Whose homepage is http://hanged.man.tripod.com/majorarcanum on Wednesday, 11 April 2007 at 11:27

thanks for the info - good stuff!
  Posted by tag hag, on Monday, 30 April 2007 at 6:11

Really helpful. I found the function at http://uk2.php.net/manual/en/function.imap-8bit.php#61216 worked fine for sending email subjects, but one change needed making:

Change:
$sLine = implode( '=' . chr(13).chr(10), $aMatch[0] ); // add soft crlf's

to
$sLine = '=?utf-8?Q?'.implode( '?=rnt=?utf-8?Q?' . chr(13).chr(10), $aMatch[0] ).'?=';

  Posted by Rob, on Friday, 04 May 2007 at 10:28

Very nice concept it did not get to into depth, but was very well explained. A lot better explained then what you will find on mirc thats for sure.
  Posted by Tyler Dewitt, Whose homepage is http://www.dewittsmedia.com on Sunday, 21 October 2007 at 2:37

When I was converting php to asp, I've found

utf8_decode($aUsers[$i]).

Wot's the php utf8_decode() func replacement in asp ???????? I searched all the web, but I couldn't find my solution..! could u help me...

  Posted by srikanth dhondi, on Thursday, 14 February 2008 at 4:17


 1 
Page 1 of 1 ( 14 comments )
©2005 MosCom

Add your comments to this article The Unicode Workflow ...

Name (required)

E-Mail (required)
Your email will not be displayed on the site - only to our administrator
Homepage

Comment