PHP Unicode Regular Expressions for Form Validation

Validating user-submitted form data is necessary and is often done against just ASCII (English) characters. But, as more websites are catering to an international market, many need to validate against foreign / unicode characters such as the letter é in the word extérieures. Since regular expressions are often used to compare user-submitted data against accepted patterns, these regexes need to be match against unicode characters in order to support non-ASCII characters. Here’s an example conversion from an ASCII regex to a Unicode one that also matches à through ă.

ASCII: ^([a-zA-Z0-9]+)$
UNICODE: ^([u0030-u0039u0041-u0056u0061-u007au00c0-u0103]+)$

where
u0030 = 0
u0039 = 9
u0041 = A
u005a = Z
u0061 = a
u007a = z
u00c0 = à
u0103 = ă

You can find unicode characters and their codes at http://www.fileformat.info/info/unicode/char/005a/index.htm

If your using PHP to perform the regex matching, make sure to use the /u modifier with the unicode hex code, e.g.

preg_match(‘/^[x30-x39x41-x56x61-x7axc0-x103]$/u’, $str);

Some useful references follow: