Sep 022005

I’ve seen several bug reports lately complaining that the compiler barfs on Unicode identifiers in certain contexts.  In all cases, the bug reports were in error because the source files were not Unicode encoded text.

Here’s the complicated rule:

If you want to represent Unicode characters in a source file, the source file needs to be saved in a format that can represent Unicode.

Seems obvious, no?

Plain ASCII text won’t cut it.  UTF-8 or UCS-2 is what you need.  ASCII values above 127 in non-Unicode source files are interpreted based on the user’s current locale and charset/code page.  The interpretation of those bytes will be different depending on how your environment is set up at compile time.  We’ve tolerated this kind of ambiguity in string constants for years, but when adding support for unicode identifiers to the Delphi language I chose to disallow that ambiguity.  The Delphi compiler will accept Unicode alphanumeric characters outside the traditional Pascal identifier range of [‘a’..’z’,’A’..’Z’,’0′..’9′,’_’] if and only if the source file is encoded in Unicode.

How can you tell if a source file is Unicode encoded?  Look for the invisible Byte Order Mark at the start of the file.  In UCS-2 big-endian encoding, the BOM is $FEFF, for little endian $FFFE.  In UTF-8 encoding, the BOM is $EB $BB $BF.  (The Unicode char FFFE is designated as a non-printing char)  Or, just open the file in Notepad (on NT systems) and see what text format it displays in the File: Save As dialog.

How can you convert a source file to a Unicode encoding?  Open it in the Delphi 2005 IDE, right click and select File Format, and then select the character encoding format you want.  I recommend UTF-8.  UTF-8 is compact and still (mostly) readable with ASCII tools.  If you choose UTF-8, your source file will be viewable in any old ASCII text viewer if you can ignore the three “garbage” characters at the start of the file.  If you use high-ASCII chars in your source file (such as umlaut chars) those will not be readable by a plain ASCII viewer.  They will of course be readable by a more modern UTF-8 / Unicode aware editor such as Notepad or the Delphi IDE.