Handling UTF-8 with PHP

This page is intended as a reference for functionality PHP provides which can either help with handling UTF-8 or should be regarded as a risk when used in conjunction with UTF-8 encoded strings. Further information can be found on the i18n and charsets pages.

Note that this page applies to PHP < version 6, which is expected to have native support for Unicode / UTF-8

UTF-8 "Dangerous" PHP Functionality

The following functions / functionality in PHP may pose issues when used in conjunction with UTF-8, depending on what you’re doing.

Important Note the words “it depends” are critical to bear in mind here - do not blindly replace all use of these functions without understanding why you’re doing so. Remember ASCII (aka US-ASCII or ASCII7) is a subset of UTF-8 and that UTF-8 has been designed so that no character sequence in a well formed UTF-8 string can be mistaken as a sub-sequence of another, longer character. These two facts will often mean you can survive with PHP’s own string functions depending on the exact nature of what you are doing with them - see the strpos discussion below. Blindly replacing all uses “just in case” is likely to lead to apps with run like lame dogs.

Note on Locales the discussion below could be read to suggest “locales are evil”, which would be to misunderstand the problem.

If you’re writing code for yourself, to be used on a server you control, locales could be made to work if your server has locales installed which support UTF-8. That would mean functions like strtolower behave correctly.

But this is no use if you’re writing applications which will be installed by third parties (like these for example) because it’s system specific (it’s not even just OS specific). If the default system locale does not support UTF-8, in theory your application could change the locale “on the fly” using setlocale but in practice that requires two things; that there is a locale available on the system which supports UTF-8 (not guaranteed) and that the correct locale identifier string can be found (there a definately differences between Windows and *Nix locale identifiers and even amongst the Unixes believe there are variations e.g. FreeBSD). What’s more, you can’t rely on users to be able to change the locale correctly to suit your applications needs - on a shared host they probably won’t be able to change the locale for the user that Apache is running with. Bottom line - locales are not the way to go for applications intended to be “write once, run anywhere”.

Note on well formedness the term “well formed UTF-8” appears frequently here. See checking_utf-8_for_well_formedness for details of how to check for well formedness. The point there is you should check UTF-8 strings for well formedness when using functions like explode (see below) which will work with UTF-8 so long as it is well formed.

Note that you can find “UTF-8 aware” implementations of many of these functions under CVS here.

The PCRE Extension

Official docs at http://www.php.net/pcre.

/i (PCRE_CASELESS) pattern modifier

Unless the /u modifier is used as well, picks up it’s understanding of upper and lowercase from the server’s locale. Depending on what you’re doing, this may result in false matches which in turn lead to corrupt UTF-8 strings.

/u (PCRE_UTF8) pattern modifier

  • Official documentation: PCRE pattern modifiers
  • Risk: low
  • Impact: matches 5 and 6 byte sequences which are not Unicode

UTF-8 allows for 5 and 6 byte character sequences but these have no meaning in Unicode (ie. there are displayable characters for these sequences). This might lead to “junk” in a web page (browsers would display a ?). See this PHP manual comment

\w \W \b \B meta characters

The \w means “word character”, the meaning of which is loaded from the servers current locale. From the manual;

A “word” character is any letter or digit or the underscore character, that is, any character which can be part of a Perl “word”. The definition of letters and digits is controlled by PCRE’s character tables, and may vary if locale-specific matching is taking place []. For example, in the “fr” (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.

Depending on what you are doing and the strings involved, this may mean \w makes a match across two UTF-8 sequences, leading to corrupted (badly formed) UTF-8 strings.

Similar applies to \W (non-word meta char), \b (word boundary) and \B (non-word boundary) all of which pick up their meaning from the server’s locale settings.

Using the /u pattern modifier prevents words from being mangled but instead PERL skips strings of characters with code values greater than 127. Therefore, \w will not match a multibyte (non-lower ascii) word at all (but also won’t return portions of it).

The String Extension

Official docs at http://www.php.net/strings.

htmlentities

  • Official documentation: htmlentities
  • Risk: high
  • Impact: could corrupt a UTF-8 string

Rumour - although this function (claims) to have UTF-8 support, bug reports claim it’s broken at least until PHP 5.

Using it on a UTF-8 string with the wrong charset would, very likely, result in corruption / junk output.

Otherwise, when using UTF-8, you don’t need entities - see Common Problem Areas with UTF-8.

html_entity_decode

  • Official documentation: html_entity_decode
  • Risk: high
  • Impact: could corrupt a UTF-8 string

Highly suspect - see comments on htmlentities above - this does the reverse.

htmlspecialchars

  • Official documentation: htmlspecialchars
  • Risk: low
  • Impact: in theory (not confirmed) should not damage a UTF-8 string

htmlspecialchars should (not confirmed) do the right thing by default (without the third argument specifying UTF-8) if it is given a well formed UTF-8 string because the characters it replaces are all within the ASCII 7 range.

That said you can also explicitly tell it to expect UTF-8 like;

$html = htmlspecialchars($utf8_string, ENT_COMPAT, 'UTF-8');

sprintf

  • Official documentation: sprintf
  • FIXME TODO

Yet to investigate in detail. The x and X type specifiers are probably an issue. See also this manual comment - have not verified .

FIXME TODO printf, sscanf, fscanf and vsprintf - same arguments probably apply

str_ireplace

  • Official documentation: str_ireplace
  • Risk: high
  • Impact: could corrupt a UTF-8 string

str_ireplace() relies of the server’s locale setting to convert all characters to lower case. If the locale setting is something other than ASCII or UTF-8, it may mistakenly match UTF-8 sub-sequences with characters in the locale, and while replacing corrupt the string. Certainly it cannot be relied upon to understand what “uppercase” and “lowercase” means in UTF-8, unless the locale explicitly supports UTF-8.

str_split

  • Official documentation: str_split
  • Risk: high
  • Impact: could corrupt a UTF-8 string

str_split() breaks up a string given a length argument. The length is a length in bytes not characters. That means it could break a multibyte UTF-8 sequence into invalid parts.

That said, if you know for sure that a given UTF-8 string contains, say, only 2 byte sequences, you might reasonably want to use str_split to break it up into single character sequences.

strcasecmp

  • Official documentation: strcasecmp
  • Risk: medium
  • Impact: results cannot be trusted

strcasecmp() internally converts the two strings it is comparing to lowercase, based on the server locale settings. As such, it cannot be relied upon to be able to convert appropriate multibyte characters in UTF-8 to lowercase and, depending on the actual locale, may have internally corrupted the UTF-8 strings it is comparing, having falsely matched byte sequences. It won’t actually damage the UTF-8 string but the result of the comparison cannot be trusted.

That said, if two given UTF-8 strings are known to contain only characters in the ASCII 7 range, strcasecmp() could be used to compare them successfully, irrespective of the locale setting.

strcspn

  • Official documentation: strcspn
  • Risk: medium
  • Impact: results cannot be trusted

strcspn() will return a length in bytes not characters, which may not always be what you require.

Also if the mask you provide it contains multibyte characters, these will be split, internally, into their component bytes, perhaps meaning results which are not semantically true - 10xxxxxx bytes in a sequence could be matched 1).

stristr

  • Official documentation: stristr
  • Risk: high
  • Impact: could return a corrupt a UTF-8 string

stristr internally converts characters to lower case using the server’s locale and in determining the substring to return, the result may be a corrupted UTF-8 string and the matching will be undpredictable (locale dependent).

strlen

  • Official documentation: strlen
  • Risk: low
  • Impact: results in bytes not characters

strlen simply counts the number of bytes in a string, not the number of characters. This means for UTF-8 the integer it returns is actually longer than the number of characters in the string.

Note that this may not always be a problem - see the strpos discussion below for an example where working in bytes not characters produces expected results.

strpos

  • Official documentation: strpos
  • Risk: low
  • Impact: results in bytes not characters

strpos will behave correctly with well formed UTF-8 but the result it returns will be in bytes not characters, which may to may not be what you desire, depending on what you want to do with that result.

You would be able to use the result in conjunction with substr for example (remember each UTF-8 sequence is unique) but if you want to validate a string in some manner, based on character length not byte length, strpos may not be semantically correct.

Consider the following example;

<?php
header ('Content-type: text/html; charset=utf-8');
$haystack = 'Iñtërnâtiônàlizætiøn';
$needle = 'ô';
 
$pos = strpos($haystack, $needle);
 
print "Position in bytes is $pos<br>";
 
$substr = substr($haystack, 0, $pos);
 
print "Substr: $substr<br>";

This will display;

  Position in bytes is 12
  Substr: Iñtërnâti

The point being it “works” despite the fact the string is UTF-8 - there’s no need to replace the use of substr or subpos in the case.

By contrast, pulling out an arbitrary substring which happens to cut a 2 byte UTF-8 sequence breaks the string;

<?php
header ('Content-type: text/html; charset=utf-8');
 
$haystack = 'Iñtërnâtiônàlizætiøn';
 
$substr = substr($haystack, 0, 13); // Position 13 is in the middle of the ô char
 
print "Substr: $substr<br>";

$substr now contains badly formed UTF-8 and your browser should display something wierd as a result (probably a ?)

strrev

  • Official documentation: strrev
  • Risk: high
  • Impact: could return a corrupt a UTF-8 string

strrev first has to split a string into an array of bytes then reverse their order - this would corrupt multibyte characters in a UTF-8 string.

Note you could still use strrev() if you know that a given UTF-8 string only contains characters in the ASCII 7 range.

strrpos

  • Official documentation: strrpos
  • Risk: low
  • Impact: results in bytes not characters

strrpos will return an answer in bytes not characters. See strpos above for more info.

strspn

  • Official documentation: strspn
  • Risk: low
  • Impact: results in bytes not characters

strspn will return an answer in bytes not characters - See strpos above for more info - similar arguments apply

strtolower

  • Official documentation: strtolower
  • Risk: high
  • Impact: could return a corrupt a UTF-8 string

strtolower uses the servers locale setting to understand the meaning of “uppercase” and “lowercase”. Depending on the locale character set, this could mean it falsely matches parts of a UTF-8 string with sequences in the character set it thinks it’s using - the result would be “corrupt” UTF-8.

Otherwise strtolower would fail to be able to understand the meaning of “uppercase” and “lowercase” in UTF-8 if the locale does not support UTF-8 (your locale might be US-ASCII, in which can strtolower won’t corrupt the UTF-8 but also won’t convert uppercase multibyte UTF-8 characters to their lowercase equivalent).

strtoupper

  • Official documentation: strtolower
  • Risk: high
  • Impact: could return a corrupt a UTF-8 string

See notes on strtolower above.

substr

  • Official documentation: substr
  • Risk: medium to high
  • Impact: accepts arguments in bytes positions not characters - could corrupt a UTF-8 string

If used in an arbitrary manner to chop off part of a string, it could potentially split UTF-8 sequences resulting in corruption. At the same time if used in conjunction with functions like strpos (see notes above), would be able to extract a portion of a UTF-8 string without corrupting it, although you’ll be passing it arguments in terms of byte positions not character positions.

substr_replace

  • Official documentation: substr_replace
  • Risk: medium to high
  • Impact: accepts arguments in bytes positions not characters - could corrupt a UTF-8 string

If arbitrary start and length arguments are supplied, could corrupt a UTF-8 string. Otherwise could be used in some instances when working with relative UTF-8 character positions - see notes on substr above.

trim, ltrim, rtrim

  • Official documentation: trim, ltrim, rtrim
  • Risk: low
  • Impart: could corrupt a UTF-8 string if second (optional) charlist arg is used

Used in the “default” manner (without the second charlist argument) these functions are safe to use on a UTF-8 string, because the whitespace characters they are searching for are all in the ASCII 7 range.

If the 2nd argument is used, to extend the list of characters this functions attempt to trim, and multibyte (non-ASCII7) characters are in the 2nd argument, then there is a risk of corrupting the returned subject string. This is because (l/r)trim will split the charlist into their component bytes and bytes in a multibyte sequence of the form 10xxxxxx2) could be trimmed from other multibyte sequences in the subject string. Probably (unconfirmed) this can only happen when trimming from the right hand side of the string, so this problem may only affect trim and rtrim.

ucfirst

  • Official documentation: ucfirst
  • Risk: high
  • Impact: could return a corrupt a UTF-8 string

See notes to strtolower above

ucwords

  • Official documentation: ucwords
  • Risk: high
  • Impact: could return a corrupt a UTF-8 string

See notes to strtolower above

wordwrap

  • Official documentation: wordwrap
  • Risk: medium to high
  • Impact: could return a corrupt a UTF-8 string

If the fourth “cut” argument is used, could split a UTF-8 sequence, resulting in corruption.

To be confirmed - what is the meaning of a “word” to this function. Is it the same as ucwords;

The definition of a word is any string of characters that is immediately after a whitespace (These are: space, form-feed, newline, carriage return, horizontal tab, and vertical tab).

If that is correct, wordwrap will only be dangerous if the cut argument is used.

Array Extension

Official docs at http://www.php.net/array.

FIXME - needs to become an explicit list of functions. Just a description right now.

The main issue related to arrays is sorting and (thankfully) this will be non-critical to most applications.

Functions like sort, when sorting alphanumerically, will lack the knowledge to know how to sort multi byte UTF-8 characters in a manner which is semantically correct. sort will still sort ASCII 7 characters correctly (semantically correct) but will only be able to sort multibyte UTF-8 characters based on their byte-by-byte values.

Because of UTF-8’s design, this will mean, after a sort, ASCII 7 characters will be at one end of a range while 4 byte sequences are at the other, with 2 and 3 byte sequences in between.

Mail Functions

FIXME - mail and UTF-8 - content type headers? base64 encoding?

As mentioned at UTF-8 (compared to UTF-7);

> UTF-8 requires the transmission system to be eight-bit clean. In the case of e-mail this means it has to be further encoded using quoted printable or base64.

Some links;

Seems to be two approach (at least specific to the body of the email - ignoring subject / headers) - if you want to send plain text you have to encode that body with something like base64_encode. Alternative you could “attach” an HTML body which then only needs to needs to have the correct charset declaration.

Variables Handling

serialize / unserialize

  • Official documentation: serialize, unserialize
  • Risk: low
  • Impact: problem when using these for stuff like RPC / data exchange with external systems

Sometimes people use this functionality as a manner to talk to PHP from other languages. The serialized encoding embeds string lengths (in bytes!) into the encoded string. External languages / environments may have different understandings of string lengths.

var_dump / debug_zval_dump

Just a potential debugging “gotcha” - if web page encoded as UTF-8, you may only see 3 characters, for example, while these functions report, say, 5 as string length

XML Extension (SAX)

Official docs at http://www.php.net/xml.

The SAX parser (officially)3) supports three encodings ISO-8859-1, US-ASCII and UTF-8 - see here. It distinguishes between source encoding (the encoding of an XML document it is parsing) and target encoding - the encoding of strings passed to your SAX callback functions.

The source encoding is either passed explicitly to xml_parser_create or (since PHP 5) determined automatically from the charset declaration in the XML document. If no source encoding is specified, PHP defaults to ISO-8859-1 (perhaps a design flaw - would have been smarter to default to UTF-8). If the source encoding contains byte sequences PHP doesn’t understand, it will raise an error e.g. the XML_ERROR_UNKNOWN_ENCODING or XML_ERROR_INCORRECT_ENCODING error codes.

The target encoding can be controlled with the xml_parser_set_option function. Any incoming characters outside the range of the target encoding are replaced with a question mark. That means if the source encoding is UTF-8 and the target encoding is US-ASCII, multibyte UTF-8 characters will be replaced with a question mark.

Note that the XML SAX extension should (not confirmed) spot badly formed UTF-8 in the source encoding. Also it’s definition of what is UTF-8 is only those within the the Unicode range (unlike the PCRE extension) - i.e. doesn’t regard 5 and 6 byte sequences as being UTF-8.

See PHP, XML, and Character Encodings: a tale of sadness, rage, and (data-)loss. See also Magpie RSS 0.7+ which implements a work around for detecting / converting other character sets (currently in the rss.parse.inc file).

XML DOM Extension

Both PHP4 + PHP5 xml-dom extensions use UTF-8 as internal encoding. This means that they mostly get it right, however there is one major GOTCHA, since they extect input strings to be utf8-encoded. If you use iso-8859-1 as your internal encoding (which you most likely do), this means that each and every string that you input to the DOM api should be encoded with utf8_encode. It’s important to realize that you have to do this regardless of which encoding the document is out in. Annoying to say the least, but atleast it’s consistent.

utf8_encode and utf8_decode

  • Official documentation: utf8_encode, utf8_encode
  • Risk: medium
  • Impact: will result in corrupt UTF-8 if used incorrectly - they are used to convert only between UTF-8 and ISO-8859-1 - use on another other charset (excepting ASCII-7) would result in junk / lost characters

These functions are designed to convert between ISO-8859-1 and UTF-8 (nothing more, nothing less). In particular older versions of IE / Win98 used CP1252 (a Windows encoding similar to but not the same as ISO-8859-1). See this manual entry.

Some links

  • utf8encode utf-8 encodes HTML unicode entities (&#NNNN).
  • utf8ToUnicodeEntities decodes utf-8 encoded strings into HTML unicode entities (&#NNNN;) or javascript ones (%uNNNN) .

URL Functions

Is it a good idea to use UTF-8 in URLs (security issues / mapping to filesystem / DB primary keys etc.)?

urlencode, rawurlencode

  • Official documentation: urlencode, rawurlencode
  • Risk: low
  • Impact: encoding a string that has previously been utf-8 encoded is generally safe - it’ll appear as a multibyte sequence rather than RFC-1738 conforming %uNNNN entities. The multibyte sequence will present correctly on a page declared to be encoded with the UTF-8 charset. However, a utf-8 encoder other than utf8_encode should be used to convert unicode entities to a utf-8 encoded string.

urldecode, rawurldecode

  • Official documentation: urldecode, rawurldecode
  • Risk: medium
  • Impact: incoming unicode strings will be mangled

Some links

GD Extension

Official docs at http://www.php.net/gd.

FIXME Stuff todo here. In particular functions like imagettftext. Guessing it will depend largely on what the GD font you are using is able to support.

Some links;

Otherwise suspect Gallery v2 has this nailed these days - need to look

exif extension

Official docs at http://www.php.net/exif.

FIXME Stuff to research here - what are the issues in reading exif data - are exotic charsets used? etc.

Some links;

UTF-8 Safe Functionality

Special mentions for stuff which may be “surprisingly” safe with UTF-8. Note if “well formedness” is mentioned, it may mean you should be checking the strings for well formedness before using these functions.

explode

  • Official documentation: explode
  • Risk: none

So long as all arguments used are well formed UTF-8, no problems.

This works because every complete character sequence in a UTF-8 string is unique (cannot be mistaken as part of a longer sequence)

str_replace

So long as all arguments used are well formed UTF-8, no problems.

This works because every complete character sequence in a UTF-8 string is unique (cannot be mistaken as part of a longer sequence).

1) see table here UTF-8
2) referring to the table here UTF-8
3) PHP5 uses libxml2 which supports more encodings - rumour has it (not confirmed) that creating the parser like xml_parser_create(”“); will be it to support more than just the three official character sets, auto-detecting from the charset declaration
 
php/i18n/utf-8.txt · Last modified: 2006/04/21 09:44 by 80.243.116.251 (troelskn)
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki