Re: UTF-8
- Posted by Vinoba Mar 10, 2011
- 2259 views
UTF-8 is the most popular and probably the best Unicode encoding for files. It is 100% compatible with standard ASCII and is more space efficient than UTF-16 for most European languages. This is not the case for most Asian languages.
It is "popular" only because it was the first one out of the joint World level effort. It is holding its place only because it is one byte for most European characters. Even there, try and mix characters from two European languages and you will hit a big wall. This is the problem they are facing within the European Union. The juggling of "codepages" required is too time inefficient. It is not efficient for most Asian languages. That incidentally happens to be the place where 75% of world's population lives and where maximum growth is occurring. Unicode 16 is also 100% compatible with ASCII. You just ignore the 0 in the higher order byte of the word. If ASCII could live for half a century by having to deliberately drop the highest bit of a byte, it can live equally comfortably by dropping the high order byte of a word.
While great for files, UTF-8 is not very easy to work with internally. UTF-16, or even UTF-32, is much easier to work with as an internal format. UTF-32 is a more natural fit for Euphoria which is already using 32 bits to store characters.
To the best of my knowledge, file names are in Ansi or in Unicode 16 bit or 32 bit. I do not know of them being in UTF8. I missed out on 32 bit storage of characters in Euphoria. If that is the case, the implementation of 32 bit Unicode should be very easy and will put Euphoria well ahead of most BASICS (and Clipper, Harbour, Lua, Agena, AutoIT) again, and will be a great tool for international trade software. I would appreciate some pointers/samples regarding the 32 bit storage of characters in Euphoria. I am 100% comfortable with Assembly/Machine language, so a couple of hints is all I need to look at the storage.
I did NOT see that (32 bit storage) in the Unicode conversion software somebody has written for Euphoria.
I too would like to see full Unicode support in Euphoria. But there is much to be considered so I would not expect this to happen anytime soon. It would have been easier if Euphoria had been designed for Unicode from the beginning. But when Euphoria first came out Unicode was more of a dream than a reality.
Please see my comments above regarding the existing 32 bit storage in Euphoria. Coupled with the fact that Euphoria does not have a string type, it would seem to me that Euphoria is already 80% of the way towards being a 32 bit Unicode language. There is still a problem with 16 bit Unicode for Chinese and even for (collation of) Indic characters. This would be less of an issue with 32 bit Unicode. When you look at ever growing databases in all languages, and the searches needed, and also look at the huge expansion of storage capacities, 32 bit seems to be the choice. Euphoria's 32 bit storage and the multi-terrabyte cheap storage devices seem to be made for each other at a personal computing level.