Re: Euphoria and Unicode
- Posted by DerekParnell (admin) Oct 24, 2008
- 952 views
Wayyyl...
I'm pedantic; sorry but you'll just have to adjust any expectations you have of me
By "process and maintain w/o error" and preservation I mean:
- Assigning and transporting wide string char data retrieved from other in- and out-of-process data objects such as DDE if available, ODBC data sets and Euphoria keyboard input;
By "WIDE" I assume you mean UTF-16 encoding.
- Assigning is just copying numeric values around. That's ok.
- Transporting(?) I guess means moving text sequences to/from external storage. Not so straight forward. See below...
- manipulating those strings with standard functions like Trim/Mid/Replace/[=, <>, like];
Manipulating is fine, except when its based on the values within the string. So trim() is a problem because it only trims white-space and so far it only knows about ASCII whitespace. UTF-16 whitespace is a superset of ASCII. There are rare times when subscripting UTF-16 will fail but that is mainly when dealing with some Chinese ideographs, because these might take 32-bits rather than 16-bits to encode.
Comparisions between UTF-16 strings is not easy. Equality tests are okay, but anything based on collating order is a problem. Euphoria only does ASCII. IT can't tell if A-Grave is lower or higher than A-Acute, for example - and this is usually language based anyway aside from UTF-16 encoding. That is, different languages collate the same characters in different orders.
- and, if not by reference, passing/retrieving full width strings as parameters...
This is not so easy. There is no built-in way to convert a Euphoria text sequence (which is stored as an array of 30-bit values) to a RAM array of 16-bit values. The function to do this isn't difficult and someone can show you how.
...all without stripping any foreign language info from the tuple when placing/returning it to whatever data store I choose.
"tuple"? Do you mean sequence? What and how is "foreign language info" stored in the sequence? Euphoria reads and writes bytes. Each byte read in occupies a sequence element. If the file is stored in UTF-16 format (with or without BOM), you will have to have a special read/write routine to convert bytes read in to UTF-16 values. Likewise, to write UTF-16 values you will need to have a special routine to convert them to a byte stream.
Does that clarify? I'm sure there are many other ways of getting this across and I'll be glad to re-phrase for anyone to understand. I'm good with words.
Another way is, "If I build/use an international app with Euhporia, will terrorists put a contract out on me because I didn't use the right 'n'?"
Yes, I believe they will.