Re: UTF-8
- Posted by Vinoba Mar 10, 2011
- 2253 views
Euphoria atoms are stored as either 31-bit signed integers, or doubles. So some 32-bit numbers (anything greater than 230-1) are stored as doubles, and so less efficient. I'm not familiar enough with UTF-32 to know how much this affects anything. Of course, 64-bit euphoria (already working in the 4.1 implementation) will use 63-bit signed integers and extended precision floating point numbers. ..... Matt
I think 31 bits is OK, but i will look at it again and report back in detail. As a quick comment, the absence of the higher-most bit in 4 bytes might only affect some (hopefully minor) East Asian languages. Of course with 63 or 64 bits we will be able to accommodate all the Planetary and many of the trans-universe languages
Derek has done some of work on unicode routines for euphoria (mostly standard library stuff, IIRC). There's a unicode branch in the repo if you're interested in taking a look. This may or may not make it into 4.1.
From what I did with wxEuphoria, I think it's mainly I/O stuff that needs updating. For instance, the built-in sprint() coerces things to C chars, so UTF-8 seems to work with it, but not UTF-16. For wxEuphoria, I had to write my own w_sprintf(), which was based on euphoria's sprintf(), but using wxWidget style characters. Matt
I was wondering if you have looked at Microsoft's intermediate solution (tchar) and now wchar and things like ...MessageW() etc, as a good migration solution.
I will try and look at the Unicode branch you mentioned above, and see what goodies you have for me there. I want more than a kid gets going halooweening!