UnicodeSupport

Unicode has been missing since the beginning of the Euphoria. All strings are implemented as sequences of integer where each integer may have values up to 230, but operations like puts, printf, sprintf and dir only uses the lowest 8-bit of the integer, effectively not utilizing the feature of a string in Euphoria.

This page describes the possible ways of putting Unicode support in Euphoria.

Source file

Euphoria only accepts ASCII source files. UTF16 or UCS2 files are blatantly rejected because of the null (0) bytes embedded, and that byte is considered an illegal character by the scanner. UTF-8 encoding is widely used since it maintain compatibility with ASCII, but unfortunately the bytes 128-255 is reserved for (strange) shrouding that was used during commercial days of Eu.

Unicode encoding format

Unicode defines characters up to 220+216-1 (#10_FFFF), but most of these code-points are unused in modern languages or redundant. For this reason UTF-8 and UTF-16 are often favored over UTF-32 to save space in files. UTF-8 is used in our internal regex library and the routines will work with UTF-8 natively on Linux with little to no modification. UTF-16 are used in the Windows and Java APIs.

On the other hand, it is always at least as easy to manipulate characters in UTF32 than UTF16 or UTF8 (and easier for characters greater than #FFFF) and EUPHORIA sequences use at least 4-btyes per member, no matter what API you use. And since UTF-8 and UTF-16 need more than one code point to represent some characters, UTF-8 and UTF-16 do not save space in sequences but use more space than UTF-32.

Routines (built in and standard library) that needs to be changed

Strings

  • trim, trim_head, trim_tail - Also remove other characters that are considered whitespaces other than "\r\n\t "
  • lower, upper - Convert correctly characters other than 'A'..'Z' and 'a'..'z'

Input/ output

This section needs quite a bit of decision. For open, the encoding should be specified in the ''mode'' parameter. If not specified, it should default to ASCII which means that only lowest 7 bits is read/written. The syntax of the ''mode'' parameter becomes {r,w,a,u}{b,}{;encoding={utf8,utf16,ascii},}.

Note that

For example

fn = open("unicode.txt", "r;encoding=utf8") 

What about providing new parameters at the end of these functions to specify encodings and have euphoria convert characters to the appropriate byte streams and vice versa? I know that UTF-8 consoles on Linux are easy to work with...Windows might be more difficult, since I think we'd have to start linking to the Wide versions of the API instead of the ANSI versions.

Operating System

Crashing

crash_file, crash_message, crash - Accept unicode filenames and messages.

Dynamic library

  • open_dll - Enable unicode filenames.
  • message_box - Change implementation from MessageBoxA to MessageBoxW

Encoding supported

  • utf16 (native) (what is native about utf16? -mattlewis)
  • utf8 (because it is widely supported and does not depend on codepages)
  • ascii (for compatibility purposes)
  • utf32 ?

Search



Quick Links

User menu

Not signed in.

Misc Menu