Re: euphoria text processing
- Posted by DerekParnell (admin) Jun 03, 2013
- 1824 views
The simple fact is that Euphoria has a 4 byte character as DEFAULT.
This is nitpicking, I know, but by using the word "default" here, it might imply that there are alternatives. There are none. Euphoria only stores characters as integers, and an integer can hold any of the valid Unicode values (code points) (0 to #10FFFF) and each takes up 4 bytes in RAM.
Just to bring you current, Microsoft has a 2byte character as default. so, in theory euphoria is fully ready for a total Unicode syntax based on 16 bit character as used by Microsoft. It is known as 16 bit LE (Little Endian). Microsoft chose is that way starting with Windows XP.
Firstly, just to make very clear, the MICROSOFT default is for UTF16LE encoding for strings in RAM and files, but this is not the WINDOWS default. Windows has no default, as it fully supports both ANSI encoding and UTF16 encoding; it's up to the application to choose which to use. In Microsoft's case, their applications (eg. Word, Excel, Powerpoint, cmd.exe, etc...) choose to use UTF16 encoding of text strings. The 'little endian' choice was influenced by the native way that Intel chips store 16-bit integers in RAM. Had Microsoft designed their PC's around non-Intel chips (eg. Motorola, as Apple did) then they probably would have used 'big endian' encoding.
In C or C plus plus, there is a methodology based on W prefix signifying wide characters or 16 bit characters. However, I don't think anybody as made an effort to allow this way of compilation.
Unless I'm misunderstanding you, the "W" suffix is actually a Windows convention in naming their API functions and has nothing to do with C or C++. In the C/C++ language, there is a way of specifying 'wide' characters, and that is to use the 'L' prefix. eg... wchar_t wszStr[] = L"1a1g";
You can quite easily write your own functions using Peek/Poke facilities of Euphoria to create the full 16 bit LE text, and write little functions to select a character or number of characters, rotate, etc, using you own small functions. Alternatively you can modify the actual Euphoria low level functions in C language (or C plus). Whatever you do, you will have to create two levels of text functions. If you happen to want the extended Chinese then you can use the 4byte per character setup similar to what I suggested above.
The need for peek() and poke() would only be required for converting text created by non-Euphoria applications into Euphoria sequences. Once text was in a sequence, the standard Euphoria functions could be used. Except that when dealing with Language Text, you need to be mindful of the many weird and wonderful ways that we humans have developed spoken and written text.
For example, in German a double 's' is regarded as a single character for sorting and comparison, but can be written as either two 's' characters or the special character 'ß'. Then there is the double 'L' in Spanish, which is sometimes regarded as a single character and sometimes not, depending on the context. Consider the Thai language, where some vowels are written before the consonant though sorted as if written after it. And some vowels are written below, some above, some on either side, and some in around their consonants. Arabic and Hebrew (and some others) are written Right-To-Left.
In general, once you leave the simply world of ANSI/ASCII, text processing has complications!
The other popular system which the Linux enthusiasts use and the Web has almost adopted is UTF-8. While it is a neat system for the web, the creation (by you) of the new functions similar to what I suggested above would be more difficult. The reason for this is that UTF-8 for a majority of international characters creates anything from one to 5 bytes per character, and selection and extraction of text is a bit more complicat4ed. On the other hand, you can access a vast number of pre-defined character manipulation libraries for UTF-8 and interface to them if you know how to connect with C language.
Both UTF8 and UF16 use variable length characters. If you need speed in text processing, it might be better to convert text to UTF32 first.
Forked into: variable length characters