Character Sets / Character Encoding Issues

Introduction

Let’s first define some terms to make it easier to understand the following sections (taken from the book XML Internationalization and Localization). See also the introductory WIKI page on i18n.

A character is the smallest component of written language that has a semantic value. Examples of characters are letters, ideographs (e.g. Chinese characters), punctuation marks, digits etc.

A character set is a group of characters without associated numerical values. An example of a character set is the Latin alphabet or the Cyrillic alphabet.

Coded character sets are character sets in which each character is associated with a scalar value: a code point. For example, in ASCII, the uppercase letter “A” has the value 65. Examples for coded character sets are ASCII and Unicode. A coded character set is meant to be encoded, i.e. converted into a digital representation so that the characters can be serialized in files, databases, or strings. This is done through a character encoding scheme or encoding. The encoding method maps each character value to a given sequence of bytes.

In many cases, the encoding is just a direct projection of the scalar values, and there is no real distinction between the coded character set and its serialized representation. For example, in ISO 8859-1 (Latin 1), the character “A” (code point 65) is encoded as a byte 0×41 (i.e. 65). In other cases, the encoding method is more complex. For example, in UTF-8, an encoding of Unicode, the character “á” (225) is encoded as two bytes: 0xC3 and 0xA1.

Unicode and its encodings

For Unicode (also called Universal Character Set or UCS), a coded character set developed by the Unicode consortium, there a several possible encodings: UTF-8, UTF-16, and UTF-32. Of these, UTF-8 is most relevant for a web application.

UTF-8

UTF-8 is a multibyte 8-bit encoding in which each Unicode scalar value is mapped to a sequence of one to four bytes. One of the main advantages of UTF-8 is its compatibility with ASCII. If no extended characters are present, there is no difference between a dencoded in ASCII and one encoded in UTF-8.

One thing to take into consideration when using UTF-8 with PHP is that characters are represented with a varying number of bytes. Some PHP functions do not take this into account and will not work as expected (more on this below).

See also utf-8

PHP and character sets

This page is going to assume you’ve done a little reading and absorbed some paranioa about the issue of character sets and character encoding in web applications. If you haven’t, try here;

“When I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues, blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought, enough is enough.”

“Darn near impossible” is perhaps too extreme but, certainly in PHP, if you simply “accept the defaults” you probably will end up with all kinds of strange characters and question marks the moment anyone outside the US or Western Europe submits some content to your site

This page won’t rehash existing discussions suffice to say you should be thinking in terms of Unicode, the grand unified solution to all character issues and, in particular, UTF-8, a specific encoding of Unicode and the best solution for PHP applications.

Everybody Gets it wrong

Just so you don’t get the idea that only “serious programmers” can understand the problem, and as a taster for the type of problems you can have, right now (i.e. they may fix it later) on IBM’s new PHP Blog @ developerworks, here’s what you see if you right click > Page Info in Firefox;

Firefox say it regards the character encoding as being ISO-8859-1 1). That’s actually coming from an HTTP header - if you click on the “Headers” tab you see;

Content-Type: text/html;charset=ISO-8859-1

Meanwhile amongst the HTML meta tags (scroll down past the whitespace) though you find;

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

Now that’s not a train smash (yet) but it should raise the flag that something isn’t quite right. The meta tag will be ignored by browsers so content will be regarded as being encoded as ISO-8859-1, thanks to the HTTP header.

This begs the question - how is the content on the blog actually encoded. If that Content-Type: text/html;charset=ISO-8859-1 header is also turning up in a form that writers on the blog use to submit content, it will probably mean the content being stored will have been encoded as ISO-8859-1. If that’s the case, the real problem will raise it’s head in the blogs RSS Feed which currently does not specify the the charset with an HTTP header - just that it’s XML;

Content-Type: text/xml

...but does declare UTF as the encoding in the XML content itself;

<?xml version="1.0" encoding="UTF-8"?>

Anyone subscribed to this feed is going to see some wierd characters appearing, should the blog contain anything but pure ASCII characters, because there’s a very good chance the content is actually stored is ISO-8859-1, the guess here being that the “back end” content admin page (containing a form for adding content) is also telling the browser it’s ISO-8859-1.

Hopefully, by the time you’ve read this document, you’ll understand what exactly is going wrong here and why.

PHP's Problem with Character Encoding

The basic problem PHP has with character encoding is it has a very simple idea of what the notion of a character is: that one character equals one byte. Being more precise, the problem is most of PHP’s string related functionality (see common_problem_areas_with_utf-8 for further details) make this assumption but to be able to support a wide range of characters (or all characters, ever, as Unicode does), you need more than one byte to represent a character.

An example in code. From Sam Ruby’s i18n Survival Guide, he recommends using the string Iñtërnâtiônàlizætiøn for testing. Counted with your eye, you can see it contains 20 characters;

Iñtërnâtiônàlizætiøn
12345678901234567890

But counted with PHP’s strlen function...

<?php
echo strlen('Iñtërnâtiônàlizætiøn');
?>

PHP will report 27 characters. That’s because the string, encoded as UTF-8, contains multi-byte characters which PHP’s strlen function will count as being multiple characters.

Life gets even more interesting if you run the following2);

<?php
header('Content-Type: text/plain; charset=ISO-8859-1');
 
$str = 'Iñtërnâtiônàlizætiøn';
 
$out = '';
$pos = '';
for($i = 0, $j = 1; $i < strlen($str); $i++, $j++) {
    $out .= $str[$i];
    if ( $j == 10 ) $j = 0;
    $pos .= $j;
}
 
echo $out."\n".$pos;
?>

You should see something like;

Iñtërnâtiônà lizætiøn
123456789012345678901234567

Which give you an idea of what PHP’s string related functionality actually “sees” 3) when working with this string.

The bottom line is all those string functions you’ve happily littered all over your code, plus a bunch of other stuff like your use regular expressions are now in doubt. Is there a character set issue lurking in there, ready to spray strange characters all over your content? The good news is it’s really not a big jump to being able to support any and all characters, so long as you make use of UTF-8.

One important point (and more good news) which may not be obvious is PHP doesn’t attempt to convert / massage the contents of strings. Even though it’s string capabilities don’t “understand” anything other than 1 character = 1 byte, PHP won’t “mess” with the encoding, leaving it “as is” 4). That means, for example;

$some_utf8 = $_POST['comment'];
 
echo 'Foo '.$some_utf8.' bar'; # note this is VERY bad security - XSS!
 
$utf8_words = array('Iñtërnâtiônàlizætiøn', 'foo', 'Iñtërnâtiônàlizætiøn');
$utf8_words = implode(' ',$utf8_words);
 
$utf8_string = 'Iñtërnâtiônàlizætiøn';
 
print_r(explode('i',$utf8_string));

None of the above will “damage” or alter the character encoding. PHP just passes the strings through blindly.

One more interesting example;

$utf8_string = 'Iñtërnâtiônàlizætiøn';
 
print_r(explode('à',$utf8_string));

Although it’s passing the string à as the seperator to explode, because well formed UTF-8 has the property that every sequence is unique, there’s no chance the à will be mistaken for another character, so we can safely explode the string using it.

What may also be a little confusing is PHP scripts themselves can contain more or less any sort of encoding - the PHP parser is generally fine with this, although you need to be careful when it comes to the byte order mark (BOM) - see Unicode, WordPress, Panther Server and BBEdit: UTF-8 with or without BOM

But what about mbstring, iconv etc.?

Yep there’s PHP extensions to help with character encoding issues but (if you use a shared host, you’ve probably already got that sinking feeling) they’re not enabled by default in PHP4. Two of particular note;

  • iconv: The iconv extension became a default part of PHP5 but it doesn’t offer you a magic wand that will make all problems go away. It probably has most value when either migrating old content to UTF-8 or when interfacing with systems can’t deliver you US-ASCII, ISO-8859-1 or UTF-85), such as an RSS feed, your PHP script reads, which is encoded with BIG5.
  • mbstring: The mbstring extension is potentially a magic wand, as it provides a mechanism to override a large number of PHP’s string functions. Bad news is it’s not avaible by default in PHP. Third-hand reports say it used to be pretty unstable but in the last year or so has stabilized (more detail appreciated).

It may be you can take advantage of these extensions in your own environment but if you’re writing software for other people to install for themselves, that makes them bad news.

And PHP 6?

Then all our problems magically vanish ;-) Specifically PHP 6 should have native understanding of Unicode and default to UTF-8 for output as well as a bunch of other stuff, building on the International_Components_for_Unicode project.

Strategy for Handing Character Encoding in PHP Applications

So what do you do when the tools you have for the job (PHP) don’t provide the facilities you need? You make it someone elses problem. In fact something else - the web browser. The Firefox and IE (and no doubt Konqueror/Safari as well but can’t speak first hand) have excellent support for many different character sets, the most important being UTF-8. All you have to do is tell them “everything is UTF-8” and your problem goes away (well almost).

Why UTF-8?

What makes UTF-8 special is, first, that it’s an encoding of Unicode and, second, that it’s backwards compatible with ASCII. From here;

Character codes less than 128 (effectively, the ASCII repertoire) are presented “as such”, using one octet for each code (character) All other codes are presented, according to a relatively complicated method, so that one code (character) is presented as a sequence of two to six octets, each of which is in the range 128 - 255. This means that in a sequence of octets, octets in the range 0 - 127 (”bytes with most significant bit set to 0”) directly represent ASCII characters, whereas octets in the range 128 - 255 (”bytes with most significant bit set to 1”) are to be interpreted as really encoded presentations of characters.

There’s some important consequences of this;

  1. if you have some text encoded as only as ASCII, you can immediately declare it as UTF-8 without needing to convert it
  2. there’s zero likelihood that, when doing things like searching UTF-8 string, with PHP’s string functions, that anything that’s not an ASCII character could be mistaken for an ASCII character. So ''strpos($utf_string,'<');'' won’t mistake any other characters, split into their bytes, as being a < character.

If you’re not sure which characters ASCII represents, try here

A further special feature of UTF-8 is that in well formed UTF-8, no character can be mistaken for another. Put another way, if you have a character that takes four bytes to represent and chop of the last two bytes of that sequence, it cannot be mistaken for another character. Each sequence of bytes starts with an identifier byte using a value which only appears in identifiers bytes. It’s easiest to see by examining this table.

Note the “well formed" above. You might also have badly formed UTF-8 and it may be important, in some instances to check for this. This PHP library is probably the best way to check, being strict and fast. More on validation below.

Practical Issues

Declaring UTF-8

If you’re starting development on a new application / site and currently have no content stored (that might be encoded in something other than ASCII or UTF-8), using UTF-8 is just a matter of informing browsers correctly. OK there’s more to it than that, depending on what you’re actually going to do with data you get from a browser, e.g. parsing it, but the first step is letting browsers know.

That can be done by sending the following HTTP header;

// Setting the Content-Type header with charset
header('Content-Type: text/html; charset=utf-8');

Note: the value charset should be case insensitive - browsers shouldn’t care.

An alternative (which, for overkill, you might want to use as well), is HTML meta tag equivalent to the Content-Type HTTP header;

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Note: you should use this header as early as possible in the <head/> section of the document, in particular before the <title/> tag - if not the browser may decide the charset for you (have not confirmed this point but seen it discussed before - anyone have a relevant link?).

Otherwise, it’s worth examining section 5.2.2 Specifying the character encoding of the HTML 4.01 spec.

One further place (for sake of overkill if you’re using the above HTTP header and meta tag), where it’s a good idea to specify a charset is in forms with the accept-charset attribute e.g.;

<form accept-charset="utf-8">

This attribute corresponds to the HTTP accept-charset header, so when used in a form it instructs the user-agent to send the header, when submitting the form. Technically accept-charset is an intruction for the server but it also guides the browser.

A relevant link from Microsoft;

ACCEPTCHARSET Attribute:

If the this attribute is not specified, the form will be submitted in the character encoding specified for the document. If the form includes characters outside the character set specified for the document, Microsoft Internet Explorer will attempt to determine an appropriate character set. If an appropriate character set cannot be determined, then the characters outside of the character set will be encoded as an HTML numeric character reference. (...) If the user enters characters that are not in the character set of the document containing the form, the UTF-8 character set will be used. UTF-8 is the preferred format for multilingual text.

Also an interesting comment (first on the list) via Sam Ruby’s blog on Character Encoding and HTML Forms;

The best way to control how the web browser will send back data is to use the accept-charset attribute on the form element. Without that attribute, all kinds of weird things can happen (eg. if the user forces the browser to use a non-default character encoding to display the page, the form might get submitted in that encoding).

Sam also makes the point that if you don’t declare the encoding, you lose the knowledge of what you’re actually being sent, leading you to a position where you have to “guess” what the incoming charset is.

Code isses with multibyte UTF-8 characters

Now you’ve intructed browsers that you’re only using UTF-8, the next issue will be making sure that the string operations you perform in your code will behave correctly when given multibyte UTF-8 characters. This doesn’t mean you need to throw out all use of PHP’s native string functions, regular expressions and otherwise, but that you need to consider where PHP needs to understand what a multibyte character is.

In general, if we call the needle to be a string you are, in some way, searching for and the haystack to be the string you are searching in, you will need to worry when the needle could contain multibyte (non-ASCII) characters. This is expanded in under common problems but a couple of examples.

Checking a Strings Length

Let’s say you want to validate someone’s first name, checking that it contains no more than 10 characters (characters, not bytes). Normally you might do so like;

if ( strlen($firstname) > 10 ) {
   die($firstname . ' is too long');
}

Now with a Russian name, like Aleksandra, using the Latin Alphabet is exactly ten characters long. But using the Russian Alphabet the looks like Александра (still ten characters) and PHP’s strlen function sees it as containing 20 bytes - the above test for length will fail.

To handle this correctly, we have to turn to PHP’s PCRE extension and the /u pattern modifier. The test then becomes;

if ( !preg_match('/^\w{,10}$/u', $firstname) ) {
   die($firstname . ' is too long');
}

The \w meta-character will match word characters (letters, numbers and underscores). Using the /u pattern modifier, the notion of what is a letter, to the PCRE extension, is extended to UTF-8 characters6)

Converting Upper Case / Lower Case

The normal way to convert between upper and lower case is using PHP’s strtoupper and strtolower. The problem there though is they rely on the current locale setting of the server as mentioned in the manual;

‘alphabetic’ is determined by the current locale. This means that in i.e. the default “C” locale, characters such as umlaut-A (Ä) will not be converted.

The concept of a characters “case” only exists is some alphabets such as Latin, Greek, Cyrillic, Armenian and archaic Georgian - it does not exist in the Chinese alphabet, for example. See Unicode Standard Annex #21: Case Mappings.

This means there is a finite number of characters that you might need to convert from upper to lowercase and vice versa. A lookup table of upper / lower case character mapping can be found in the source of PHP’s mbstring extension here.

It’s possible to implement this in PHP, as done in DokuWiki here - have a look at the utf8_strtolower and utf8_strtoupper functions (note also that they attempt to use the mbstring extension first, as implementing these in PHP is slow).

Making room for more data

One issue with UTF-8 is it’s difficult to predict how many bytes will be needed. Characters can be represented by up to four bytes so a field in a database would probably need to have it’s size increased by x4 - that may even force you to use BLOB fields where previously you’d used a VARCHAR

Converting Existing Content

If you have existing content, it will need to be converted to UTF-8.

Note: it’s strongly recommended to migrate all content to UTF-8 offline rather than attempting to manage multiple character sets at “runtime”, converting between the two. Aside from the fact PHP has poor support for this unless you have iconv available, it’s also a recipe for disaster if your application loses track of what encoding is being used where.

If you’re running an English-only site and you’re 99% sure that the content contains only ASCII characters, you might just bite the bullet, redeclare the site as UTF-8 and manually edit any problem characters as you find them.

If you’re site contains alot of non-English content, there’s a good chance you’ll need to convert the content to UTF-8.

First you need to find out what character set you’re currently using (Firefox: right-click > Page Info is a good way to find out). If you’ve never though about it, there’s a fairly good chance it’s encoded as ISO-8859-1.

See Sam Ruby’s utf-8 musings.

Converting ISO-8859-1

Many HTML editors add a content-type meta tag that explicity declares ISO-8859-1, so often people are using it without even being aware.

Generally it’s easy to convert using PHP’s utf8_encode function (which explicity converts ISO-8859-1 to UTF-8 - nothing else!).

You may also want to convert numeric HTML character refereneces as these will only be understood by a browser (TODO: more detail here needed)

Also be aware of of Windows cp1252 code page, which is similar to ISO-8859-1 but not 100%. See comments on utf8_encode page (need to add examples / problems where this is an issue).

Converting other Character Sets

(TODO) Use iconv - see Useful Tools as well

Interfacing with Systems using other Charsets

So long as you declared you pages / forms correctly with UTF-8, web browsers (modern ones at least) should provide no real issue (although you should still check input for well formedness).

The problem is what if you interface with other applications / data sources than a web browser (e.g. an RSS feed).

Generating Output

When building interfaces on your site for other applications to access, such as an RSS feed (i.e. when you’re outputing something for non-browser consumption) you need to make sure you’re declaring character sets e.g.

header('Content-Type: text/xml, charset=utf-8');
echo '<?xml version="1.0" encoding="utf-8"?>';
// etc.

You should also avoid passing strings through htmlentities - few applications other than browsers understand HTML entities.

Otherwise, if you’re generating XML, you should read HOWTO Avoid Being Called a Bozo When Producing XML.

Processing Input

When handling input from sources other than a browser (e.g. parsing an RSS feed or accepting a Pingback), there are two main issues to consider;

  • Determining the input character encoding
  • Converting it to UTF-8 (if necessary)

In addition, different effort is required, depending on whether you PHP app is acting as a client or server to the input.

Specific to XML, you need to be careful with the SAX parser - see XML Extension (SAX).

RSS (Client)

Hopefully the source of the input has declared the character encoding. If you’re reading a remote RSS feed, your HTTP client needs to examine the Content-Type header delivered by the remote server. You might also (if XML) want to examine the opening XML processing instruction and hunt for the encoding attribute. If there is no charset declaration anywhere, the safest thing to do is not to use it - it is possible to detect character sets but no one has done a good job of this in PHP - if desperate, you might consider passing it to a Python script via a shell command, as use the http://chardet.feedparser.org/Universal Encoding Detector. Once you know what the content is declared as, you should be OK to pass the string and the declaration to iconv() - it will fail if the declaration is lying or the character set is not supported.

If you’re insisting on PHP, you’re probably best off using Magpie RSS which has some support for this problem. More generally, it’s probably better to use Universal Feed Parser (a Python project which is light years ahead and has automatic encoding detection built in) running as a cron job.

Pingbacks (Server)

If you’re accepting a Pingback or similar, you need to check what the client is declaring the input as - the apache_request_headers() function and similar will allow you to examine the incoming Content-Type header.

Otherwise the rest of the discussion for RSS applies

Checking UTF-8 for Well Formedness

About the most rigorous but also performant way to test whether a string is well formed UTF-8 in PHP is by using this UTF-8 to Code Point Array Converter in PHP - returns FALSE when you convert from UTF-8 to Unicode code points, if the UTF-8 is not well formed.

Another approach is to the regular expression here http://www.w3.org/International/questions/qa-forms-utf-8 with preg_match. Have found this to be alot slower than the previous solution; noticably if you have a large document. Note also that the way it’s defined, it regards most non-printable ASCII characters as invalid which may or may not be what you want.

A third and very fast way to do a quick check is using preg_match with the /u modifier. If preg_match (or similar PCRE functions) are given badly formed UTF-8, when using the /u modifier, they simply die quietly. That means you can use a function like this to test;

function utf8_compliant($str) {
    if ( strlen($str) == 0 ) {
        return TRUE;
    }
    // If even just the first character can be matched, when the /u
    // modifier is used, then it's valid UTF-8. If the UTF-8 is somehow
    // invalid, nothing at all will match, even if the string contains
    // some valid sequences
    return (preg_match('/^.{1}/us',$str,$ar) == 1);
}

Some words of warning.

  1. UTF-8 allows for five and six byte sequences and PCRE uses that as the definition of UTF-8. But 5/6 byte sequences are not supported by Unicode. That means the above test might pass such a sequence but it would not actually represent any Unicode character - see this comment in the PHP manual for more detail plus some well formed vs. badly formed strings to test with.
  2. Whether the utf8_compliant() function shown above works properly seems to depend on exactly which version of PHP (or perhaps PCRE) you are using; it cannot be trusted for input validation. It is a bad choice for checking that your string is correctly formed UTF-8 to prevent an invalid multibyte sequence from being used in an addslashes based SQL injection attack. In particular, for this invalid UTF-8 sequence (that addslashes will helpfully turn into an unescaped single quote) utf8_compliant(chr(0xbf).chr(0×27)) == true. Ouch. Fast yes, accurate no. (test fails with PHP4.4.2/PHP5.1.2 and PCRE6.4; presumably successful test by original author with unknown versions).

A fourth approach can be found in this manual entry - have yet to test / benchmark it.

Common Problem Areas with UTF-8

(TODO) Common code situations...

Entities

PHP provides two functions, htmlspecialchars and htmlentities, which tend to get liberally scattered around code, for translating certain characters to HTML entities. It also provides the function html_entity_decode to go in the opposite direction, from entity to character. Behind the functions are some lookup tables for different character sets, about which some information can be obtained using get_html_translation_table (but see below!).

In general, given that you’re switching to UTF-8, you no longer need to use HTML entities other than the “special five” which could cause a parser problems, because the characters can be represented directly in UTF-8. The “special five”, which could trip an HTML / XML parser are;

  • & (ampersand) entity: &amp;
  • " (double quote) entity: &quot;
  • ' (single quote) entity: &apos; 7)
  • < (less than) entity: &lt;
  • > (greater than) entity: &gt;

PHP does support more entities than this and understands their corresponding character representation in a number of common character sets (but not all character sets). In particular the translation from entity to UTF-8 characters seems to have been broken up until PHP 5.

In general, the safe rule is don’t output anything but the “special five” entities (or use anything but those five “internally” within your application). Entities will then only be an issue if you’re consuming data from an external source which is using them.

htmlspecialchars

htmlspecialchars provides a tool to help with generating HTML and XML markup, to make sure that characters like >, which could be mistaken for part of the markup, are converted to an entity like &gt; for display, as describe in the above section.

Unlike many of PHP’s string functions, htmlspecialchars has some awareness of character encodings and, by default, assumes text it is given to escape is encoded as ISO-8859-1. Technically you can probably8) get away with passing it UTF-8 encoded text, without any problems because there shouldn’t be anything which is could mistake for the characters it is trying to match, but it’s probably smarter to tell it the character, using it’s third argument, e.g.;

$html = htmlspecialchars($utf8_string, ENT_COMPAT, 'UTF-8');

To reverse htmlspecialchars, you are better off rolling your own function - see below.

htmlentities

htmlentities allows translation of a further range of characters (in addition to the markup characters translated by htmlspecialchars) into their equivalent HTML entities. The original reason 9) for having entities in HTML was to allow browsers with, say, only support for ASCII encoding to be able to display further, useful, characters.

You can get an idea of the characters that PHP’s htmlentities function would translate like this (note get_html_translation_table cannot be told what charset to use - see below.

<?php
echo '<pre>';
print_r(array_map('htmlspecialchars',get_html_translation_table(HTML_ENTITIES)));
echo '</pre>
?>

With modern web browsers and widespead support for UTF-8, you don’t need htmlentities because all of these characters can be represented directly in UTF-8. More importantly, in general, only browsers support HTML’s special characters - a normal text editor, for example, is unaware of HTML entities. Depending on what you’re doing, using htmlentities may reduce the ability of other systems to “consume” your content.

Also (not confirmed but sounds reasonable - from anon comment here), character entities (stuff like » or —) do not work when a document is served as application/xml+xhtml (unless you define them). You can still get away with the numeric form though.

html_entity_decode

The html_entity_decode function is intended to convert HTML entities back into “normal” characters. Depending on the character set you tell it to use, it looks up an HTML entity it finds in some text and returns corresponding character from a lookup table. The character set you specify as this functions third argument means both the character set of the text you give html_entity_decode to parse and the character set which which to decode the entities into.

Support for UTF-8 seems to have been broken for this function until PHP 5 - see here and here.

Generally speaking you’re probably better off avoiding it unless you’re forced to consume some external data source which contains entities other than the special five (above). That also means you may be better off rolling your own function to reverse htmlspecialchars, because html_entity_decode will translate more than just the “special five”.

get_html_translation_table

This function returns a array with characters as keys and their corresponding HTML entities as values. It looks like this function will always provide the characters encoded as ISO-8859-1, from looking at the get_html_translation_table source and the determine_charset function it relies on.

Perhaps a future PHP version will see get_html_translation_table provide a third argument to switch the charset.

Further Information

If you do need to get into translating to and from anything but the “special five” entities, you should get familiar with what the relevant functions really do internally, by looking at standard/html.c.

Length Operations

As has already been discussed, because PHP’s basic string function regards 1 byte to be 1 character, using a function like strlen on a multibyte string (like UTF-8) will tell you the number of bytes in the string, not the number of characters.

To count the number of characters, there’s a nice hack via the utf8_decode function (mentioned in the comment by “chernyshevsky at hotmail dot com” on the strlen page);

function utf8_strlen($string){
    return strlen(utf8_decode($str));
}

Now the utf8_encode and utf8_decode function outputs are only for translating between ISO-8859-1 and UTF-8 (the function names are a little misleading) but when going from UTF-8 to ISO-8859-1, any UTF-8 character that utf8_decode doesn’t know how to handle will be replaced by a single ? character of one byte. In effect that means all characters which are multiple bytes are “crunched” into single byte characters. From there strlen tells the “truth” about the number of characters in the string.

If you want to see the internal implementation, look here which calls xml_utf8_decode - seems to do a safe job of parsing UTF-8.

(TODO) More to come on stuff like substr

Case Conversions

(TODO)

Validation

(TODO) Examples of the http://www.php.net/pcre /u pattern modifier and highlight the \w metacharacter

Sorting

(TODO)

Searching

(TODO)

Security Concerns

(TODO) Issues like spoofing / phishing etc.

UTF-7 risks - see Google XSS Example

Case Study: DokuWiki

This is an attempt at describing the switch to UTF-8 in Dokuwiki, from the memory of someone who was indirectly involved. Right now it’s a short overview.

Dokuwiki is a PHP wiki which stores all wiki pages in files. Origionally it began by defaulting to ISO-8859-1 while supporting other character sets depending on what language you specified in the Dokuwiki configuration.

The problem with this approach is it meant that a given wiki installation could only support a single character set (effectively meaning a small group of languages). It also introduced a whole bunch of headaches, like the behaviour of utf-8 these functions in conjunction with a server’s locale settings and perhaps the need for character set detection and iconv for any wiki content from sources other than a web browser.

The decision was taken to switch to move dokuwiki to “all UTF-8” - all wiki pages would be encoded as UTF-8. This pushed 90% of problems onto the browser (modern browsers have, generally excellent support for UTF-8) and allowed a single wiki to support many different character sets and thereby languages (see these examples).

The remaining 10% of problems included;

  • the need to migrate existing wikis and their content to UTF-8. Given 99% of users were using ISO-8859-1, the conversion helper was written to help them migrate. Migration of content had to be performed all in one go by end users
  • the need to implement some UTF-8 aware functions such as utf8_strtolower() for users without the mbstring extension installed.
  • the need for further functions (like utf8_strip_specials) to help with converting UTF-8 to ASCII for wiki page names
  • checking whether input is valid UTF-8 (utf8_check)

PHP Extensions / Functionality for Character Encoding

The short summary of this section is: for PHP < v6, you need mbstring and iconv available.

The mbstring extension

Manual: http://www.php.net/mbstring

Provides multibyte aware implementations of some of the most common PHP string functions, the POSIX extended regex extension and the mail function. These are either accessible via their own namespace (i.e. functions beginning mb_*) or can be used to “overload” the normal PHP implementations, giving you half a chance (expect to have additional work to do) to have an application support a different character set to that it was designed for.

The mbstring extension supports many different character sets, most importantly UTF-8. It also allows for conversion between character sets and implements some level of encoding detection (no idea how effective this is though).

The mbstring extension is not part of the default PHP distribution - if you need it and are using a web hosting service, make sure you provide has compiled it into PHP. Common Linux distributions (like Debian) package PHP with mbstring.

The iconv extension

Manual: http://www.php.net/iconv

The main purpose of the iconv extension is converting between different character sets. Generally it would be best applied to input sources other than web browsers (e.g. when you’re aggregating RSS feeds encoded in different character sets) and is probably the most effective tool PHP has for character set conversion.

From PHP 5+, the iconv extension also comes with implementations of some common string functions, but from crude benchmarks, is much slower than mbstring or other approaches, at least when working with UTF-8. This seems to be because iconv is carefully checking for badly formed UTF-8.

Also from PHP 5+, iconv became a default part of the PHP distribution. For PHP versions ⇐ 4, make sure your host has installed it.

The GNU Recode extension

Manual: http://www.php.net/recode

Essentially does the same thing as iconv, for converting strings to other character sets. General feeling is better use iconv - recode doesn’t get much use, is not available for use on Windows and causes issues with some more popular extensions.

utf8_encode() & utf8_decode()

Manual: http://www.php.net/utf8_encode and http://www.php.net/utf8_decode

The names of these two functions are slightly misleading - they are specifically for use in converting between ISO-8859-1 and UTF-8 - nothing more, nothing less.

They are package with PHP’s SAX parser and could be regarded as “legacy” from the days where 99% of web pages where encoded as ISO-8859-1.

They can be useful in some instances though, for example utf8_decode() has the effect of “squashing” multibyte UTF-8 sequences into a single byte (whether is “recognizes” a target ISO-8859-1 character or not) and is very fast. That means you can implement a UTF-8 aware strlen function like;

function utf8_strlen($str) {
    return strlen(utf8_decode($str));
}

PHP 6 and ICU

PHP 6 will be using IBM’s ICU libraries to provide native support for character sets. This is, in general, very good news and brings PHP on a par with Java in this area.

There’s some information here - at this time it’s not entirely clear how it will end up looking like - if you need advance warning, keep an eye on the i18n and interals mailing lists.

Summary

(TODO) List of key things to think about

Useful Reads

General

PHP Specific

Useful Tools

PHP Specific

Editors with UTF-8 Support

When editing content outside of a (decent) browser, make sure to use an editor with UTF-8 support (i.e. not notepad!)

  • Simredo - simple text editor, useful for creating and viewing text encoded in different encodings.
  • SciTE – excellent Open Source cross platform editor: make sure you set the properties value code.page=65001 to make it use UTF-8
  • Jedit – Java based editor with UTF-8 support
  • VIM – see Using UTF-8
  • EMACS – see GNU Emacs and UTF-8 locale
  • TEA – a GTK2 based editor for GNU/Linux
  • Notepad2 – a very good notepad replacement for Windows
  • http://people.w3.org/rishida/utilities – some online tools to help / learn about Unicode

Related Wiki Pages

  • i18n - main page
  • utf-8 - PHP functions and UTF-8
  • mysql - UTF-8 and MySQL
1) ISO-8859-1 is basically everything you need to write English and “Western European” languages and is very commonly used on the web still, despite UTF-8.
2) Note you should be using a text editor capable of encoding PHP source files as UTF-8 - see useful_tools
3) we’re talking loose definitions here for humans to grasp - PHP’s internal string representations are ultimately “zeros and ones”
4) there are exceptions to this of course. PHP’s string functions are “generally safe”, depending on what you’re doing. You need be careful with strtoupper and strtolower, for example which are “locale aware” and could mistake UTF-8 characters for those in the current locale. Also the \w meta character in the PCRE regular expression extensions is locale dependendent unless the /u modifier is used - see what references determine_charset
5) the modern browsers all do a good job with UTF-8 and support many other character sets as well - they can be more or less trusted to get it right
6) note some input is need here into how it does this - assume that is regards any multibyte character it is not aware of as being a letter character - that probably means \w will match a chess character like the queen: ♛
7) PHP’s htmlspecialchars function outputs &#39; instead of &apos, apparently because IE seems to have trouble with the latter
8) not confirmed!!
9) probably very over simplified or even wrong explaination so be warned
 
php/i18n/charsets.txt · Last modified: 2006/06/13 13:42 by 137.222.40.78 (stuartp)
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki