Sep 032004
 

I’ve just checked in Delphi compiler support for parsing source code identifiers that contain Unicode extended characters. Chances are good that this will appear in the next release of the Delphi products.

Why?

Why should we allow Unicode characters in Delphi identifiers? Isn’t ASCII good enough?

The first reason is that programmers want to express their ideas in a programming language using terms and symbols that are most familiar to them. Language reflects and drives the structure of our thought processes. If a non-English speaker has to go through an English filter before they can render their ideas in code, the code is that much farther away from representing their actual idea or intent.

The second (and more compelling) reason is that Unicode is ubiquitous in modern system architectures. Not becoming ubiquitous – is ubiquitous. Both Java and .NET default to representing all string data (including identifiers and metadata) in Unicode. Win32 has supported Unicode since its conception, but Win95’s failures in this area and enigmatic market success discouraged developers from pursuing Unicode in the Windows realm.

Unicode identifier support is needed in Delphi for .NET as a matter of completeness. Without it, there is a portion of the .NET metadata lexical space that you simply cannot access from Delphi source. We can argue all day about how big or insignificant that portion is, but you can’t escape the fact that there is a hole in your coverage.

I have a bad habit of referring to Unicode and UTF8 interchangeably. Unicode is the character set. UTF8 is a manner of encoding (compressing) Unicode bytes.

What/How

Unicode chars in identifiers are accepted only when the source file is UTF8 or Unicode encoded. Placing an Ö (o-umlaut) in the middle of an identifier in plain old ASCII source will fail to compile, even if your locale and codepage are set to German.

High-ASCII chars for European languages aren’t so much of an issue as multibyte encodings for Asian languages. Locale-based multibyte character encodings often extend trail byte values into the low ASCII-7 range, making it very difficult to tell if an arbitrary byte pulled at random from the text file is a stand-alone character, a lead byte or a trail byte. UTF8 encoding sets the high bit for lead and trail bytes, so there is never any ambiguity about whether a given byte is a normal ASCII char or part of a multibyte sequence.

Just to give your browser fits: So, you could see a Cyrillic Я (“ya”), Hebrew ש (“shin”), Arabic ك (“kaf”), Japanese あ (Hirigana “A”), or Hangul 한 in a Delphi Unicode identifier – or any combination of all of them!

The compiler is still tuned for handling byte-sized ASCII-7 characters. ASCII-7 chars are unaffected by UTF8 encoding, which is a large part of what makes UTF8 so appealing. UTF8 provides the compactness of ASCII-7 for text that consists mostly of Latin characters and the breadth and precision of Unicode as needed.

The compiler’s scanner, hash function, and case insensitive name compare routines process ASCII-7 chars as before, but now also detect when the high bit is set in a char as it goes through the pipeline. At a convenient checkpoint, the routine will realize a high bit went by and will shift gears from fast ASCII to a slower more precise Unicode equivalent routine. The hash function and name compare continue to be case insensitive, even with Unicode identifiers.

I can’t make Unicode text scanning as fast as ASCII, but I can make it so that supporting the occasional Unicode character does not drag the whole system down to Unicode speed. Even if Japanese or Chinese Delphi developers start writing most of their code using Unicode identifiers, the ASCII-7 chars in their source files (white space, punctuation, operator symbols) will still outnumber Unicode chars by an order of magnitude.

The performance hit to source files that do not contain Unicode identifiers should be between zero and 1%. I actually found a way to make the hash function a little faster so it should balance out to a net zero.

To the compiler, the Unicode characters are just an opaque payload. It doesn’t know what they are, just that they have high bits set and it should continue consuming them until it finds a whitespace or other char that qualifies as terminating an identifier. In the present implementation, the compiler doesn’t perform any analysis or validation of the Unicode char stream. Unicode punctuation characters (above ASCII-7) will be considered part of the identifier just like any other Unicode char. That will probably change in the future – next week or next product. It’s not critical to enabling access to Unicode idents available elsewhere in the ecosystem.

Note that the Delphi IDE normalizes all source files to UTF8 in its editor buffers and files that it feeds to the internal compiler, so when you compile from the IDE you should always have the ability to use Unicode identifiers. If you choose to save that source file in a non-Unicode format on disk, then you will probably have trouble compiling the Unicode identifiers in that file using the command line compiler.

Oh, I also added support for the command line compiler to handle Unicode source files. That was a few months ago. Delphi 8 handled UTF8 encoded sources, but not UCS2 encoded sources. The command line compiler now supports UTF8, UCS2 big endian and little endian, and locale-encoded sources. UCS4 encodings are recognized and rejected. UCS2 encoded files are converted to UTF8 internally before they are seen by the scanner, so expect your build times to be a little longer if you standardize all your source on UCS2 instead of UTF8.

A Byte Order Mark at the start of the source file is mandatory for the compiler to recognize your character encoding. Source without a BOM is assumed to be in the current locale charset.

Side note: Tagawa added a charset command line switch so that you can specify in which locale charset a source file should be compiled. Very handy when you need to compile non-Unicode sources provided by your German colleagues in the same makefile as compiling non-Unicode sources from your French colleagues. (p.s. Don’t mention the war!)

Unicode identifiers may not appear in the published section of a class type. This is to minimize the impact on VCL components and third party code of “funny characters” showing up in the RTTI.

Though the primary push for this feature came from Delphi for .NET, there is little technical reason not to support Unicode identifiers in Win32 as well. It would be more work to turn it off in Win32.

Action Items

So what does this mean for you? If you write Delphi code that scans/parses Delphi source files, or deal with Delphi source identifiers (excluding RTTI), you should review your code for assumptions about the character content of an identifier. The traditional Delphi rule for identifiers is: starts with a..z or underscore, followed by a..z, 0..9, or underscore. With Unicode identifiers, the rule changes only slightly: an identifier may begin with a Unicode alphabetic character or underscore, and may be followed by a Unicode alphanumeric character or underscore. (Never mind that what I just checked in doesn’t enforce those rules yet. You’re better off assuming it will than to assume not)

If you are a Delphi source scanner, this would also be a good time to review your support for UTF8, UCS2-BE and UCS2-LE source encodings.

I’ll be showing some amusing tangents of using Unicode identifiers in Delphi source at my BorCon sessions Sept 12-14. (almost next week! gah!)

Sorry, the comment form is closed at this time.