Re: Interpreter Mod We Can All Get Behind
- Posted by Matt Lewis <matthewwalkerlewis at ?mai?.com> Nov 21, 2007
- 632 views
CChris wrote: > > Also U"<unicode string>" would be desirable too. Some variatuins on U would > give more control over the unicode encoding that's desired. For instance U for > UTF8, uU for UTF16LE, Uu for UTF16BE, uUU for UTF32LE and UUu for UTF32BE? > Just > suggestions. I think that you'd probably want to stick with (internally) the equivalent of wide characters (wchar) which are 4 bytes each. It pretty naturally aligns with the sequence. The only encoding issues would be in the files themselves, at which point, you're correct that we probably need some way to identify how things are encoded in the file. I'd suggest UTF8 as the best way to encode euphoria source, since most of the characters will be from ASCII, making UTF8 the most efficient. But if the scanner is UTF8 enabled, is there any real need to identify which strings are or are not unicode? Using sequences, there's no need to distinguish between char widths as with C/C++. This all assumes that the interpreter is capable of dealing with unicode. wxEuphoria now handles it pretty seamlessly, although, of course, if you use any funky characters, your strings will look kind of funny:
string = "This uses a unicode character: " & 2015
If we decide to go with UTF8 as the standard, we'll need to have a library (presumably in eu, for the front end) that is capable of decoding UTF8. And then, of course, there will be all sorts of decisions about how to handle puts/printf/etc. But if we go with straight wide chars, then it might actually make things a bit simpler, like it did with wxEuphoria (because I don't have to cast a long to a char). Not sure how all this would affect DOS, however. Matt