UnicodeSupport
Unicode has been missing since the beginning of the Euphoria. All strings are implemented as sequences of integer where each integer may have values up to 230, but operations like puts, printf, sprintf and dir only uses the lowest 8-bit of the integer, effectively not utilizing the feature of a string in Euphoria.
This page describes the possible ways of putting Unicode support in Euphoria.
Source file
Euphoria only accepts ASCII source files. UTF16 or UCS2 files are blatantly rejected because of the null (0) bytes embedded, and that byte is considered an illegal character by the scanner. UTF-8 encoding is widely used since it maintain compatibility with ASCII, but unfortunately the bytes 128-255 is reserved for (strange) shrouding that was used during commercial days of Eu.
Unicode encoding format
Unicode defines characters up to 220+216-1 (#10_FFFF), but most of these code-points are unused in modern languages or redundant. For this reason UTF-8 and UTF-16 are often favored over UTF-32 to save space in files. UTF-8 is used in our internal regex library and the routines will work with UTF-8 natively on Linux with little to no modification. UTF-16 are used in the Windows and Java APIs.
On the other hand, it is always at least as easy to manipulate characters in UTF32 than UTF16 or UTF8 (and easier for characters greater than #FFFF) and EUPHORIA sequences use at least 4-btyes per member, no matter what API you use. And since UTF-8 and UTF-16 need more than one code point to represent some characters, UTF-8 and UTF-16 do not save space in sequences but use more space than UTF-32.
Routines (built in and standard library) that needs to be changed
Strings
- trim, trim_head, trim_tail - Also remove other characters that are considered whitespaces other than "\r\n\t "
- lower, upper - Convert correctly characters other than 'A'..'Z' and 'a'..'z'
Input/ output
This section needs quite a bit of decision. For open, the encoding should be specified in the ''mode'' parameter. If not specified, it should default to ASCII which means that only lowest 7 bits is read/written. The syntax of the ''mode'' parameter becomes {r,w,a,u}{b,}{;encoding={utf8,utf16,ascii},}.
Note that
For example
fn = open("unicode.txt", "r;encoding=utf8")
- read_file, read_lines - Define the encoding to read the file as.
- write_file, write_lines - Same as above.
- print, pretty_print, ?, sprint, printf, sprintf, puts, gets, getc, get - Accept (reads and writes) the lower 16-bit of the string, not just 8-bit. Except if the file is opened using ASCII encoding.
- current_dir, chdir, dir - If the directory or file has non-ascii characters, return them correctly
What about providing new parameters at the end of these functions to specify encodings and have euphoria convert characters to the appropriate byte streams and vice versa? I know that UTF-8 consoles on Linux are easy to work with...Windows might be more difficult, since I think we'd have to start linking to the Wide versions of the API instead of the ANSI versions.
Operating System
- command_line, getenv - Unicode command-line and env var is possible. Return correctly.
- system, system_exec - Execute unicode-named programs and unicode parameters.
- abbreviate_path, canonical_path - Filesystem routines will need to be adjusted for the various slash characters that exist in UNICODE.
Crashing
crash_file, crash_message, crash - Accept unicode filenames and messages.
Dynamic library
- open_dll - Enable unicode filenames.
- message_box - Change implementation from MessageBoxA to MessageBoxW
Encoding supported
- utf16 (native) (what is native about utf16? -mattlewis)
- utf8 (because it is widely supported and does not depend on codepages)
- ascii (for compatibility purposes)
- utf32 ?