1. variable length characters

Forked from Re: euphoria text processing

DerekParnell said...

Both UTF8 and UF16 use variable length characters.

Agreed. UTF-16 is a variable length character set. Even M$ agrees that you sometimes need to use a 32bit value (also known as a surrogate pair) to represent a character in UTF-16: http://msdn.microsoft.com/en-us/library/windows/desktop/dd374069(v=vs.85).aspx

new topic     » topic index » view message » categorize

2. Re: variable length characters

If I may be permitted to add a small historical context here.

When Microsoft started using 16-bit characters, they were using an encoding called UCS-2, which was the forerunner of UTF-16. This was all before Win2000. At that time, all the character sets defined by the Unicode Consortium could fit into 16-bit integers. The consortium subsequently expanded the defined character sets, so Microsoft adopted the UTF-16 standard. UTF-16 has 16-bit and 32-bit characters but Microsoft only provide fonts for the 16-bit alphabets.

I believe this means that, in theory, Word could read any UTF-16 encoded file but only display those characters that are defined in the Basic Multilingual Plane (BMP). Now, approximately 97% of all the defined Consortium character sets fit into that Plane, but now there are 16 more planes available. Only a small fraction of the non-BMP planes contain defined characters so far, so Microsoft haven't got a lot to be concerned about just yet. Their fonts don't yet support Egyptian Hieroglyphs, Phoenician, Gothic, Mahjong Tiles, etc ... though maybe they should think about the Emoticon characters.

So in summary, Microsoft technically supports UTF-16 (and thus 16 and 32 bit characters) but for practical purposes, their applications (and fonts) support a (large) subset instead, just those that fit into 16-bits.

We are quite welcome to design our own fonts and special use characters while coding for Windows platforms, but we must take care to use the appropriate API facilities to make them play nicely with Windows.

I would still recommend that Euphoria use a third-party Unicode library such as IBM's to provide Unicode support rather than develop 'native' Euphoria solutions to the text and language processing issues that arise.

new topic     » goto parent     » topic index » view message » categorize

3. Re: variable length characters

DerekParnell said...

I would still recommend that Euphoria use a third-party Unicode library such as IBM's to provide Unicode support rather than develop 'native' Euphoria solutions to the text and language processing issues that arise.

Agreed. Using a 3rd party library would mean less development time (for us), better development support (as we could reach out on any Unicode-related issues to the larger community behind that Unicode library), and less headaches trying to cover all the bases (right-to-left text entry, rare (or rarely used) character sets, variant characters, et al).

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu