Re: Euphoria vs The Other Guys --- and RTFM
- Posted by DerekParnell (admin) May 03, 2014
- 1657 views
In my first look at text data I was locked into the old idea that one byte was one character; that makes indexing a character in a string very easy. UTF-8 results in variable length encodings; indexing individual characters in Euphoria is no longer fun. In a Python3 string: x = "▒∆ Hello", printing x[1] produces ∆, which is {226,136,134}. How is Euphoria going to evolve to get a similiar convenience?
It's easy. A unicode string in Euphoria is just one element = one code point, where each element is a 32-bit integer. In other words, unicode strings in Euphoria are held in a sequence using UTF32 encoding. We would have functions that convert to and from other UTF encodings. So in your example above, the resulting sequence would be {38424, 34576, 72, 101, 108, 108, 111}
Of course, I'm speaking about future functionality as it currently doesn't support unicode source text.