OpenEuphoria: Forum: UTF-8

1. UTF-8

Posted by mattlewis (admin) Mar 07, 2011
2603 views

Recently, I was contacted by someone asking about using wxEuphoria to display Russian text. wxEuphoria uses unicode by default (assuming that you're linking it against a unicode build of wxWidgets).

The example text I'd received came to me as UTF8 encoded text. Using wxEuphoria's from_utf8() routine, I was able to convert this to wxWidgets "native" unicode representation (which I believe is UTF16), and it displayed just fine for me. In the past, I've done similar things on windows (though with Hangul, not Cyrillic characters), and I had to make sure I was using a font that supported the code points I was using.

That's all interesting, but the other (and main) reason I bring this up is that the UTF8 text actually displayed correctly in my console. This was on Linux, and my environment is set to UTF8, so it's not terribly surprising, but still interesting that euphoria is able to deal with UTF8 output and input (at least, as far as some simple gets(0) testing showed). Some googling tells me that it looks possible to be able to display UTF8 in a windows console if you change your code page (and use an appropriate font).

Matt

new topic » topic index » view message » categorize

2. Re: UTF-8

Posted by Vinoba Mar 09, 2011
2335 views

UTF8 is the way Linux likes to handle things and so does the Internet. For all the older "languages" it is tempting therefore to stick to 8 bit and use windows codepages.

Microsoft's internal default is LE16 (16 bit Little Endian). They chose that (I think) because Intel CPUs handle 16 bit words that way in the memory. wxwidgets has also chosen LE16. wxWidgets 2.9 is totally Unicode 16 bit. QT is going the Unicode route and of course VC is already that.

I would urge you to take the LE16 route. It would require some rewriting of the basic code, but Euphoria has a tremendous advantage over other languages because you never created a string type. Therefore creating a true Unicode characters string type and converting everything to 16 bit Unicode will not be too difficult.

For me Unicode 16 bit still causes problems with Indic characters as Indic languages are syllabic. The decision makers in India accepted a ANSI type character set instead of asking for a full syllabic character set of some 6000 characters for each of the Indic languages. If that was the case the collation algorithms would be much better handled.

I hope that Euphoria takes the 16 (or 32) bit Unicode route, even in the next 4.# version. Look at AutoIT 3.6. They changed to 16 bit Unicode in version 3.4 http://www.autoitscript.com/ Some of the good BASICs are also changing to 16 bit Unicode e.g.RealBasic, PowerBasic, Purebasic

new topic » goto parent » topic index » view message » categorize

3. Re: UTF-8

Posted by Vinoba Mar 09, 2011
2271 views

Duplicate post removed

new topic » goto parent » topic index » view message » categorize

4. Re: UTF-8

Posted by LarryMiller Mar 10, 2011
2255 views

UTF-8 is the most popular and probably the best unicode encoding for files. It is 100% compatible with standard ASCII and is more space efficient than UTF-16 for most European languages. This is not the case for most Asian languages.

While great for files, UTF-8 is not very easy to work with internally. UTF-16, or even UTF-32, is much easier to work with as an internal format. UTF-32 is a more natural fit for Euphoria which is already using 32 bits to store characters.

I too would like to see full Unicode support in Euphoria. But there is much to be considered so I would not expect this to happen anytime soon. It would have been easier if Euphoria had been designed for Unicode from the beginning. But when Euphoria first came out Unicode was more of a dream than a reality.

new topic » goto parent » topic index » view message » categorize

5. Re: UTF-8

Posted by Vinoba Mar 10, 2011
2255 views

LarryMiller said...

UTF-8 is the most popular and probably the best Unicode encoding for files. It is 100% compatible with standard ASCII and is more space efficient than UTF-16 for most European languages. This is not the case for most Asian languages.

It is "popular" only because it was the first one out of the joint World level effort. It is holding its place only because it is one byte for most European characters. Even there, try and mix characters from two European languages and you will hit a big wall. This is the problem they are facing within the European Union. The juggling of "codepages" required is too time inefficient. It is not efficient for most Asian languages. That incidentally happens to be the place where 75% of world's population lives and where maximum growth is occurring. Unicode 16 is also 100% compatible with ASCII. You just ignore the 0 in the higher order byte of the word. If ASCII could live for half a century by having to deliberately drop the highest bit of a byte, it can live equally comfortably by dropping the high order byte of a word.

said...

While great for files, UTF-8 is not very easy to work with internally. UTF-16, or even UTF-32, is much easier to work with as an internal format. UTF-32 is a more natural fit for Euphoria which is already using 32 bits to store characters.

To the best of my knowledge, file names are in Ansi or in Unicode 16 bit or 32 bit. I do not know of them being in UTF8. I missed out on 32 bit storage of characters in Euphoria. If that is the case, the implementation of 32 bit Unicode should be very easy and will put Euphoria well ahead of most BASICS (and Clipper, Harbour, Lua, Agena, AutoIT) again, and will be a great tool for international trade software. I would appreciate some pointers/samples regarding the 32 bit storage of characters in Euphoria. I am 100% comfortable with Assembly/Machine language, so a couple of hints is all I need to look at the storage.

I did NOT see that (32 bit storage) in the Unicode conversion software somebody has written for Euphoria.

said...

I too would like to see full Unicode support in Euphoria. But there is much to be considered so I would not expect this to happen anytime soon. It would have been easier if Euphoria had been designed for Unicode from the beginning. But when Euphoria first came out Unicode was more of a dream than a reality.

Please see my comments above regarding the existing 32 bit storage in Euphoria. Coupled with the fact that Euphoria does not have a string type, it would seem to me that Euphoria is already 80% of the way towards being a 32 bit Unicode language. There is still a problem with 16 bit Unicode for Chinese and even for (collation of) Indic characters. This would be less of an issue with 32 bit Unicode. When you look at ever growing databases in all languages, and the searches needed, and also look at the huge expansion of storage capacities, 32 bit seems to be the choice. Euphoria's 32 bit storage and the multi-terrabyte cheap storage devices seem to be made for each other at a personal computing level.

new topic » goto parent » topic index » view message » categorize

6. Re: UTF-8

Posted by mattlewis (admin) Mar 10, 2011
2270 views

Vinoba said...

LarryMiller said...

While great for files, UTF-8 is not very easy to work with internally. UTF-16, or even UTF-32, is much easier to work with as an internal format. UTF-32 is a more natural fit for Euphoria which is already using 32 bits to store characters.

To the best of my knowledge, file names are in Ansi or in Unicode 16 bit or 32 bit. I do not know of them being in UTF8. I missed out on 32 bit storage of characters in Euphoria. If that is the case, the implementation of 32 bit Unicode should be very easy and will put Euphoria well ahead of most BASICS (and Clipper, Harbour, Lua, Agena, AutoIT) again, and will be a great tool for international trade software. I would appreciate some pointers/samples regarding the 32 bit storage of characters in Euphoria. I am 100% comfortable with Assembly/Machine language, so a couple of hints is all I need to look at the storage.

I did NOT see that (32 bit storage) in the Unicode conversion software somebody has written for Euphoria.

Euphoria atoms are stored as either 31-bit signed integers, or doubles. So some 32-bit numbers (anything greater than 2³⁰-1) are stored as doubles, and so less efficient. I'm not familiar enough with UTF-32 to know how much this affects anything. Of course, 64-bit euphoria (already working in the 4.1 implementation) will use 63-bit signed integers and extended precision floating point numbers.

Derek has done some of work on unicode routines for euphoria (mostly standard library stuff, IIRC). There's a unicode branch in the repo if you're interested in taking a look. This may or may not make it into 4.1.

From what I did with wxEuphoria, I think it's mainly I/O stuff that needs updating. For instance, the built-in sprint() coerces things to C chars, so UTF-8 seems to work with it, but not UTF-16. For wxEuphoria, I had to write my own w_sprintf(), which was based on euphoria's sprintf(), but using wxWidget style characters.

Matt

new topic » goto parent » topic index » view message » categorize

7. Re: UTF-8

Posted by Vinoba Mar 10, 2011
2249 views

mattlewis said...

Euphoria atoms are stored as either 31-bit signed integers, or doubles. So some 32-bit numbers (anything greater than 2³⁰-1) are stored as doubles, and so less efficient. I'm not familiar enough with UTF-32 to know how much this affects anything. Of course, 64-bit euphoria (already working in the 4.1 implementation) will use 63-bit signed integers and extended precision floating point numbers. ..... Matt

I think 31 bits is OK, but i will look at it again and report back in detail. As a quick comment, the absence of the higher-most bit in 4 bytes might only affect some (hopefully minor) East Asian languages. Of course with 63 or 64 bits we will be able to accommodate all the Planetary and many of the trans-universe languages

mattlewis said...

Derek has done some of work on unicode routines for euphoria (mostly standard library stuff, IIRC). There's a unicode branch in the repo if you're interested in taking a look. This may or may not make it into 4.1.

From what I did with wxEuphoria, I think it's mainly I/O stuff that needs updating. For instance, the built-in sprint() coerces things to C chars, so UTF-8 seems to work with it, but not UTF-16. For wxEuphoria, I had to write my own w_sprintf(), which was based on euphoria's sprintf(), but using wxWidget style characters. Matt

I was wondering if you have looked at Microsoft's intermediate solution (tchar) and now wchar and things like ...MessageW() etc, as a good migration solution.

I will try and look at the Unicode branch you mentioned above, and see what goodies you have for me there. I want more than a kid gets going halooweening!

new topic » goto parent » topic index » view message » categorize

8. Re: UTF-8

Posted by ArthurCrump Mar 10, 2011
2228 views

The current Unicode specification fits into 21 bits.
UTF-32 characters are in the range 0-#10FFFF

new topic » goto parent » topic index » view message » categorize

9. Re: UTF-8

Posted by mattlewis (admin) Mar 10, 2011
2248 views

Vinoba said...

I think 31 bits is OK, but i will look at it again and report back in detail. As a quick comment, the absence of the higher-most bit in 4 bytes might only affect some (hopefully minor) East Asian languages. Of course with 63 or 64 bits we will be able to accommodate all the Planetary and many of the trans-universe languages

To be clear, inside a sequence (or a declared atom), the promotion from integer to double is automatic and mostly transparent.

Vinoba said...

I was wondering if you have looked at Microsoft's intermediate solution (tchar) and now wchar and things like ...MessageW() etc, as a good migration solution.

I will try and look at the Unicode branch you mentioned above, and see what goodies you have for me there. I want more than a kid gets going halooweening!

I have used some of Microsoft's Unicode stuff when using COM, since you pretty much have to. My library does all of the conversions for you, but that would be simple enough not too, especially now that we have things like poke2() built-in.

At one point in the past, I actually did play around with win32lib, changing from the ANSI to Wide character routines, and displayed non-ASCII text. From the coding standpoint, it was pretty straight forward.

Matt

new topic » goto parent » topic index » view message » categorize

10. Re: UTF-8

Posted by Vinoba Mar 10, 2011
2231 views

ArthurCrump said...

The current Unicode specification fits into 21 bits.
UTF-32 characters are in the range 0-#10FFFF

Thanks for jogging my memory. When I was looking for "personal space" for better collation of for 6000-8000 characters in one of the Indic languages, I had decided upon using E000 area. Then I vaguely remember that I was also looking at the pages above #10FFFF to do the same i.e. to find 12 times 6000 character space and I stopped because I fell ill.

Yes, 31 bit integer Euphoria can definitely cope with the "32 bit Unicode" and my Indic OEM extensions after 21 bits. I hope I can keep good health to do this. I will look more closely at the Unicode branch as suggested by Matt Lewis, and see if I can cope with it and/or improve upon it.

new topic » goto parent » topic index » view message » categorize

OpenEuphoria

1. UTF-8

2. Re: UTF-8

3. Re: UTF-8

4. Re: UTF-8

5. Re: UTF-8

6. Re: UTF-8

7. Re: UTF-8

8. Re: UTF-8

9. Re: UTF-8

10. Re: UTF-8

Search

Include:

Quick Links

User menu

Misc Menu