Historical UnicodeSupport, Revision 9
Unicode has been missing since the beginning of the Euphoria. All strings are implemented as sequences of integer where each integer may have values up to 230, but operations like puts, printf, sprintf and dir only uses the lowest 8-bit of the integer, effectively not utilizing the feature of a string in Euphoria.
This page describes the possible ways of putting Unicode support in Euphoria.
Source file
Euphoria only accepts ASCII source files. UTF16 or UCS2 files are blatantly rejected because of the null (0) bytes embedded, and that byte is considered an illegal character by the scanner. UTF-8 encoding is widely used since it maintain compatibility with ASCII, but unfortunately the bytes 128-255 is reserved for (strange) shrouding that was used during commercial days of Eu.
Embedding characters > 255 in strings
No escape sequence is provided for writing strings with characters whose values are larger than 255. If the source file encoding can be unicode, the escape sequence is not needed. But before that, an escape sequence must be provided to ease writing. For example "This is my face: \u3020" instead of the current "This is my face: " & #3020 which is troublesome to write.
Unicode encoding format
Unicode defines characters up to 231 (cmiiw), but it is highly impractical. The use of 16-bit encoding like UTF16 is used natively in Windows NT and Java, since it defines almost all the regularly used characters in the world. Therefore Euphoria should use the lowest 16-bit of the integer to represent strings, not more and not less.
On the other hand, most of the routines will work with UTF-8 natively on Linux with little to no modification. This makes adding native support for UTF-8 a lot easier, and without the headaches involved in adding support for surrogate characters.
Routines (built in and standard library) that needs to be changed
Strings
- trim, trim_head, trim_tail - Also remove other characters that are considered whitespaces other than "\r\n\t "
- lower, upper - Convert correctly characters other than 'A'..'Z' and 'a'..'z'
Input/ output
This section needs quite a bit of decision. For open, the encoding should be specified in the ''mode'' parameter. If not specified, it defaults to ASCII which means that only lowest 8 bits is read/written. The syntax of the ''mode'' parameter becomes {r,w,a,u}{b,}{;encoding={utf8,utf16,ascii},}.
The above is wrong. ASCII is only 7 bits, but euphoria allows 8bits to be used for each character in the file name. UTF-8 filenames already work.
Note that
For example
fn = open("unicode.txt", "r;encoding=utf8")
- read_file, read_lines - Define the encoding to read the file as.
- write_file, write_lines - Same as above.
- print, pretty_print, ?, sprint, printf, sprintf, puts, gets, getc, get - Accept (reads and writes) the lower 16-bit of the string, not just 8-bit. Except if the file is opened using ASCII encoding.
- current_dir, chdir, dir - If the directory or file has non-ascii characters, return them correctly
Operating System
- command_line, getenv - Unicode command-line and env var is possible. Return correctly.
- system, system_exec - Execute unicode-named programs and unicode parameters.
- abbreviate_path, canonical_path - Filesystem routines will need to be adjusted for the various slash characters that exist in UNICODE.
Crashing
crash_file, crash_message, crash - Accept unicode filenames and messages.
Dynamic library
- open_dll - Enable unicode filenames.
- message_box - Change implementation from MessageBoxA to MessageBoxW
Encoding supported
- utf16 (native)
- utf8 (because it is widely supported and does not depend on codepages)
- ascii (for compatibility purposes)
- diff to current revision, view current revision history, backlinks
- Last modified Mar 07, 2011 by SDPringle