Re: variable length characters
- Posted by DerekParnell (admin) Jun 04, 2013
- 1444 views
If I may be permitted to add a small historical context here.
When Microsoft started using 16-bit characters, they were using an encoding called UCS-2, which was the forerunner of UTF-16. This was all before Win2000. At that time, all the character sets defined by the Unicode Consortium could fit into 16-bit integers. The consortium subsequently expanded the defined character sets, so Microsoft adopted the UTF-16 standard. UTF-16 has 16-bit and 32-bit characters but Microsoft only provide fonts for the 16-bit alphabets.
I believe this means that, in theory, Word could read any UTF-16 encoded file but only display those characters that are defined in the Basic Multilingual Plane (BMP). Now, approximately 97% of all the defined Consortium character sets fit into that Plane, but now there are 16 more planes available. Only a small fraction of the non-BMP planes contain defined characters so far, so Microsoft haven't got a lot to be concerned about just yet. Their fonts don't yet support Egyptian Hieroglyphs, Phoenician, Gothic, Mahjong Tiles, etc ... though maybe they should think about the Emoticon characters.
So in summary, Microsoft technically supports UTF-16 (and thus 16 and 32 bit characters) but for practical purposes, their applications (and fonts) support a (large) subset instead, just those that fit into 16-bits.
We are quite welcome to design our own fonts and special use characters while coding for Windows platforms, but we must take care to use the appropriate API facilities to make them play nicely with Windows.
I would still recommend that Euphoria use a third-party Unicode library such as IBM's to provide Unicode support rather than develop 'native' Euphoria solutions to the text and language processing issues that arise.