1. EuGTK and UTF-8
- Posted by jimcbrown (admin) Dec 16, 2012
- 1624 views
Forked from Re: GUI with RTL and BiDi support
I would be surrprized if GTK and WxEuphoria text widgets were not equally capable of lefttoright and unicode already. their wrappers are more complete and more widely used, if user feedback is any indication.
GTK: I have already commented that on paper at least and based on test 117 given by the GTK wrapper author, you should be able to enter utf-8 text in a Label.
Agreed. I tried it and it works.
The methodology recommended by him and I agree there, is to use C/P from a utf-8 word processor.
This is not necessary. I've tested EuGTK (release from Feb 2010) with GTK 2 and with SCIM 1.4.5 on Linux/GNU.
I can't speak for RTL text entry as I don't use any language with it, but entering hanzi and hangul the conventional way works fine with EuGTK.
2. Re: EuGTK and UTF-8
- Posted by EUWX Dec 16, 2012
- 1668 views
Forked from Re: GUI with RTL and BiDi support
I would be surrprized if GTK and WxEuphoria text widgets were not equally capable of lefttoright and unicode already. their wrappers are more complete and more widely used, if user feedback is any indication.
GTK: I have already commented that on paper at least and based on test 117 given by the GTK wrapper author, you should be able to enter utf-8 text in a Label.
Agreed. I tried it and it works.
The methodology recommended by him and I agree there, is to use C/P from a utf-8 word processor.
This is not necessary. I've tested EuGTK (release from Feb 2010) with GTK 2 and with SCIM 1.4.5 on Linux/GNU.
I can't speak for RTL text entry as I don't use any language with it, but entering hanzi and hangul the conventional way works fine with EuGTK.
You have to understand something more about the syllabic characters and its composition and usage on computers. For a long time we were restricted to 128 character set, which was later expanded to 256 character set.This essentially is based on using ONE BYTE per character. Keyboards do not have 256 keys and various schemes are devised to enter the 256 variations of glyphs for a given language font.
Based on these restrictions and need for using their own languages, most major languages created their own alternative glyphs and methods of joining BUT ALWAYS WITHIN THE 256 GLYPH availability. I mentioned Hindi Saral font, which is one of those simplifications.
Although the chinese have need for many more glyphs, The Hanzi font you mentioned is available as a one byte font
http://www.fontpalace.com/font-download/Hanzi-Kaishu/
with the advent of Unicode, the Indians (via their government) accepted 128 glyphs per Indian language and they reside at about 0980 hex onwards. Although Indian languages are truely syllabic, and there is a need for reserving about 4000 spaces per language, there is no demand from India or India for creating separate glyph pages. The Chinese did ask for and in fact they created an extension from 64K Unicode Glyphs to a 21 bit Unicode to accommodate these.
The Korean Unicode block for Hangul is 256 character block
Unicode Hangul Jamo (U+1100–U+11FF) The Hangul Jamo Unicode Block contains the possible Leading, Middle, and Trailing parts of a Hangul block in the following ranges:
• U+1100–U+1112: the basic 18 lnitial consonants plus 1 silent consonant (NG)
• U+1113–U+1159: ohter initial complex and ancient consonants
• U+115A–U+115E: Reserved
• U+115F: choseong (initial) filler
• U+1160: jungseong (middle) filler
• U+1161–U+1175: the basic 21 vowels and dipthongs
• U+1176–U+11A2: other and ancient vowel combinations
• U+11A3–U+11A7: Reserved
• U+11A8–U+11C2: the basic 27 trailing consonant combinations
• U+11C3–U+11F9: other trailing consonant combinations
• U+11FA–U+FFFF: Reserved
If you notice, you can use it in a software that can access a character in one byte provided the 256 glyphs are attached to one particular page of font.
When I talk about Unicode availability, I mean the ability to put any one of the characters, at least as available form Microsoft in their Arial MS Unicode font, within one word or phrase. That can only happen, if the software allows at least a 16 bit usage per character (and now 21 bit usage). The Microsoft alternative recognised by Microsoft with the introduction of Windows XP was to use the 16 bit character set internally. The European software developers even today are still thinking of code pages of 256 characters, as exemplified by your post regarding Hanzi and Hangul.
In the original thread, I tried to encourage the questioner to post the name of the RTL language he want to use, because I am pretty certain that for the major RTL group, I can find the 256 character ANSI font and and enable him to use any of the current non-Microsoft languages including Euphoria.
EuGTK will work, but only by using the one byte font.
AS IT STANDS WITH EUPHORIA, EUGTK WILL NOT WORK USING THE FULL TWO BYTE STORAGE.
However, Euphoria wrapper can probably be modified for two byte storage of Unicode characters if it is an absolute requirement, such as in a multilingual composition requiring sorting. RTL is a further complication/problem which can be easily tested by him using a byte font in his chosen RightTtoLeft language.
UTF-8 will create problems associated with selection and extraction form withing a phrase, because each "foreign" character can have anything form one to 6 bytes per character!
3. Re: EuGTK and UTF-8
- Posted by jimcbrown (admin) Dec 16, 2012
- 1643 views
You have to understand something more about the syllabic characters and its composition and usage on computers. For a long time we were restricted to 128 character set, which was later expanded to 256 character set.This essentially is based on using ONE BYTE per character. Keyboards do not have 256 keys and various schemes are devised to enter the 256 variations of glyphs for a given language font.
Agreed.
Based on these restrictions and need for using their own languages, most major languages created their own alternative glyphs and methods of joining BUT ALWAYS WITHIN THE 256 GLYPH availability. I mentioned Hindi Saral font, which is one of those simplifications.
Not true. Even simplified hanzi can't fit into 256 characters. Notice that GB2312 was encoding in two bytes (16 bits) with EUC-CN. And of course, GB2312 is a very incomplete set of hanzi (though enough for the simplest of everyday uses).
Although the chinese have need for many more glyphs, The Hanzi font you mentioned is available as a one byte font
http://www.fontpalace.com/font-download/Hanzi-Kaishu/
I did not mention that font. Looking at the character map, it looks like one of those joke fonts that map Roman letters to random hanzi.
Those fonts can be fun, but even a 5 year old child needs more hanzi than provided by the character map. It's impractical for everyday use.
The Chinese did ask for and in fact they created an extension from 64K Unicode Glyphs to a 21 bit Unicode to accommodate these.
And even this was not enough to represent all hanzi.
http://unicode.org/reports/tr28/tr28-3.html#13_7_variation_selectors
The Korean Unicode block for Hangul is 256 character block
Unicode Hangul Jamo (U+1100U+11FF)
Hangul Jamo just covers the individual letters. However, Unicode embeds Hangul syllablic blocks as individual characters as well.
Hangul defines 10 vowels and 14 constants. Counting all possible CV and CVC combinations leads to a minimum of 2100 syllablic blocks.
14*10 + 14*10*14 = 140 + 1960 = 2100
Actually, there are even more, since some words encode extra constants beyond the simble CV and CVC combination. E.g. 없어요 (eopseoyo - to not have / to not exist).
The Hangul Jamo Unicode Block contains the possible Leading, Middle, and Trailing parts of a Hangul block in the following ranges:
U+1100U+1112: the basic 18 lnitial consonants plus 1 silent consonant (NG)
U+1113U+1159: ohter initial complex and ancient consonants
U+115AU+115E: Reserved
U+115F: choseong (initial) filler
U+1160: jungseong (middle) filler
U+1161U+1175: the basic 21 vowels and dipthongs
U+1176U+11A2: other and ancient vowel combinations
U+11A3U+11A7: Reserved
U+11A8U+11C2: the basic 27 trailing consonant combinations
U+11C3U+11F9: other trailing consonant combinations
U+11FAU+FFFF: Reserved
You missed out on the U+3130 - U+318F range, not to mention the entire syllabic block range at U+AC00 - U+D7AF.
If you notice, you can use it in a software that can access a character in one byte provided the 256 glyphs are attached to one particular page of font.
You are correct. As noted in http://www.decodeunicode.org/en/hangul_syllables there was significant debate over whether or not it was necessary to encode all these syllables.
In theory, one could scrap the hangul syllable block, use an automated process to convert all existing text written using hangul syllable characters (virtually all of them) into text using only Hangul combining Jamo, and upgrade all word processors and other text manipulation tools to do this.
Considering the unique difficulties of getting Hangul Jamo to combine into syllablic blocks correctly (this requires working in two dimensions unlike the one-dimentional problem possed by RTL text entry for a language like Arabic), I'd be wary of changing an approach that has worked for decades. At the very least, I'd want to see a fully functional (and open) working example.
On the other hand, I suspect the real reason that Unicode embedded the Hangul syllables was for backwards compatibility with the Johab set (which in turn was probably influenced by the desire to include several thousand hanja in the character set).
Anyways, my point: Hanzi can not be represented in 256 characters. While Hangul could be, it'd be very difficult to do so - and since Koreans often include Hanja (more or less the same thing as Hanzi) in their writing, necessitating a multibyte character set anyways, there exists a strong disincentive to do so.
When I talk about Unicode availability, I mean the ability to put any one of the characters, at least as available form Microsoft in their Arial MS Unicode font, within one word or phrase.
I'm going by GNU Unifont.
That can only happen, if the software allows at least a 16 bit usage per character (and now 21 bit usage). The Microsoft alternative recognised by Microsoft with the introduction of Windows XP was to use the 16 bit character set internally.
Agreed. I personally prefer UTF-8 (a strange text encoding format that can use anywhere from 7 bits to 6 bytes per Unicode character) to UTF-16, but either gets the job done. UCS-2 is no longer enough.
The European software developers even today are still thinking of code pages of 256 characters, as exemplified by your post regarding Hanzi and Hangul.
I don't see how my post refers to this. I was referring to the opposite problem - 256 characters simply isn't enough for some languages/character sets.
In the original thread, I tried to encourage the questioner to post the name of the RTL language he want to use, because I am pretty certain that for the major RTL group, I can find the 256 character ANSI font and and enable him to use any of the current non-Microsoft languages including Euphoria.
Sure, depending on the OP's needs, that might be enough.
But will the OP be able to mix English with that RTL language? Having to constantly swtich code pages is a pain.
The goal of Unicode is to be able to represent all languages. It's nice to be able to mix Korean and Mandarin and English (and Arabic and Hindi) all in the same document.
EuGTK will work, but only by using the one byte font.
AS IT STANDS WITH EUPHORIA, EUGTK WILL NOT WORK USING THE FULL TWO BYTE STORAGE.
This is flat out wrong. I tested EuGTK demos test7.ex (single line text entry) and test48.ex (a simple text editor) with Hangul syllablic block characters and with Hanzi (both of which require a minimum of at least two bytes per character in any character set), and it worked fine. Of course, I used UTF-8 (which is backwards compatible with ASCII at the binary level, unlike UTF-16).
However, Euphoria wrapper can probably be modified for two byte storage of Unicode characters if it is an absolute requirement, such as in a multilingual composition requiring sorting. RTL is a further complication/problem which can be easily tested by him using a byte font in his chosen RightTtoLeft language.
UTF-8 will create problems associated with selection and extraction form withing a phrase, because each "foreign" character can have anything form one to 6 bytes per character!
Of course, GTK uses UTF-8 internally already, so by leveraging the support for UTF-8 in Glib, all these hard problems have already been solved!
4. Re: EuGTK and UTF-8
- Posted by EUWX Dec 16, 2012
- 1518 views
EuGTK will work, but only by using the one byte font.
AS IT STANDS WITH EUPHORIA, EUGTK WILL NOT WORK USING THE FULL TWO BYTE STORAGE.
This is flat out wrong. I tested EuGTK demos test7.ex (single line text entry) and test48.ex (a simple text editor) with Hangul syllablic block characters and with Hanzi (both of which require a minimum of at least two bytes per character in any character set), and it worked fine. Of course, I used UTF-8 (which is backwards compatible with ASCII at the binary level, unlike UTF-16).
1. Kindly look at and provide the Peeks of one of these strings.
2. Try extracting the third and 4th syllable of a string, like extracting "mc" from "jimbrown" in your EuGTK created Chinese text, using Euphoria and then extracting 6th and 7th similar to extracting "ro" from "jimbrown".
There are other aspects of your posts reminiscent of the year 200o-2003, and I will address them later as I get time.
Whenever i come accross Unicode and utf-8, theorizing seems to be the order of day. Outside of simple nice text under utw-8 in the net like this:
"जैसा ये लिखा है",
a major part of it is vapourware.
5. Re: EuGTK and UTF-8
- Posted by jimcbrown (admin) Dec 16, 2012
- 1607 views
EuGTK will work, but only by using the one byte font.
AS IT STANDS WITH EUPHORIA, EUGTK WILL NOT WORK USING THE FULL TWO BYTE STORAGE.
This is flat out wrong. I tested EuGTK demos test7.ex (single line text entry) and test48.ex (a simple text editor) with Hangul syllablic block characters and with Hanzi (both of which require a minimum of at least two bytes per character in any character set), and it worked fine. Of course, I used UTF-8 (which is backwards compatible with ASCII at the binary level, unlike UTF-16).
1. Kindly look at and provide the Peeks of one of these strings.
Here you go: {228,189,160,229,165,189,233,169,172,239,188,159}
2. Try extracting the third and 4th syllable of a string, like extracting "mc" from "jimbrown" in your EuGTK created Chinese text, using Euphoria and then extracting 6th and 7th similar to extracting "ro" from "jimbrown".
Hmm. I know how to do that (using http://www.gtk.org/api/2.6/glib/glib-Unicode-Manipulation.html#g-utf8-find-next-char ) but the ancient version of EuGTK that I just happened to have lying around when I decided to do these tests doesn't seem to have that wrapped.
Still, even if I have to wrap glib by hand myself, glib makes it a lot easier. Heck, they even make it easy to compare utf-8 strings! http://www.gtk.org/api/2.6/glib/glib-Unicode-Manipulation.html#g-utf8-collate
On the other hand, I could simply wrap http://www.gtk.org/api/2.6/glib/glib-Unicode-Manipulation.html#g-utf8-to-ucs4-fast and http://www.gtk.org/api/2.6/glib/glib-Unicode-Manipulation.html#g-ucs4-to-utf8 and then do any character comparision or substring extraction via ucs4.
Not only does glib make it easy, but it gives you choices!
Not that I have anything against wxWidgets, mind you. I haven't looked as closely into wxWidgets, but I'm sure that the unicode version makes things just as easy as Glib does. Heck, you can build wxWidgets on top of GTK and have the best of both worlds!
There are other aspects of your posts reminiscent of the year 200o-2003, and I will address them later as I get time.
Perhaps you could refresh my memory and point out a few specific examples from that time? As you address these other aspects of my more recent posts, of course.
Whenever i come accross Unicode and utf-8, theorizing seems to be the order of day. Outside of simple nice text under utw-8 in the net like this:
"जैसा ये लिखा है",
a major part of it is vapourware.
Again, I shall point you to http://www.gtk.org/api/2.6/glib/glib-Unicode-Manipulation.html
6. Re: EuGTK and UTF-8
- Posted by EUWX Dec 16, 2012
- 1589 views
Whenever i come accross Unicode and utf-8, theorizing seems to be the order of day. Outside of simple nice text under utw-8 in the net like this:
"जैसा ये लिखा है",
a major part of it is vapourware.
Again, I shall point you to http://www.gtk.org/api/2.6/glib/glib-Unicode-Manipulation.html
The question is not whether GTK or Glib can do something. Can EUGTK do something? - that is the question. When I talked about 2000-2003; that was the time when people were changing over from Win 98 with its codepages to real 16 bit Unicode. There are countless examples of presumption of real Unicode, when actually it was just codepage or specific font related solution.
I can caterogrically state that using EUGTK, as is currently done in Euphoria, you cannot do extraction, rotation of characters cutting down of strings, etc. You need to write more code and more wrappers, or use anotherpiece of software to do that.
Kindly remember that when I talk about programming I am NOT talking about C or C plusplus; I am talking about application development languages. Most application development programmers are vaguely familiar with C plusplus, and cannot use it, but like me, are comfortable with BASIC and EUPHORIA. Even there, we would rather work with fully integrated GUI tools in the language. For world audience we would like fully developed language that can do string manipulation in Unicode, and a search compatible with syllabic languages. That is why, incidentally Hadoop is taking off in database work, because they fundamentally store a data field as a number of bytes in multiples of 4 bytes - which by the way, is the storage method of Euphoria, but not developed enough to reach those levels.
I am concerned that your attitude is always that of defending the existing rather than looking at the weaknesses and wanting to address these weaknesses, or of recognising the need for progress. The fact that I have rejected Freebasic, QB4, Python Schema, etc in favour of Euphoria is because of the strengths of Euphoria, but that does not mean I should ignore or attempt to hide the weaknesses under the hood of a big C
7. Re: EuGTK and UTF-8
- Posted by jimcbrown (admin) Dec 16, 2012
- 1686 views
Whenever i come accross Unicode and utf-8, theorizing seems to be the order of day. Outside of simple nice text under utw-8 in the net like this:
"जैसा ये लिखा है",
a major part of it is vapourware.
Again, I shall point you to http://www.gtk.org/api/2.6/glib/glib-Unicode-Manipulation.html
The question is not whether GTK or Glib can do something. Can EUGTK do something? - that is the question.
I can caterogrically state that using EUGTK, as is currently done in Euphoria, you cannot do extraction, rotation of characters cutting down of strings, etc. You need to write more code and more wrappers, or use anotherpiece of software to do that.
That's a good point. I haven't tried the latest version of EuGTK, so I don't know if that has been added since.
When I talked about 2000-2003; that was the time when people were changing over from Win 98 with its codepages to real 16 bit Unicode. There are countless examples of presumption of real Unicode, when actually it was just codepage or specific font related solution.
But - I didn't use 98 at that time. I was already on UTF-8 supporting Linux/GNU. I don't see what any of that has to do with me.
Kindly remember that when I talk about programming I am NOT talking about C or C plusplus; I am talking about application development languages. Most application development programmers are vaguely familiar with C plusplus, and cannot use it, but like me, are comfortable with BASIC and EUPHORIA. Even there, we would rather work with fully integrated GUI tools in the language. For world audience we would like fully developed language that can do string manipulation in Unicode, and a search compatible with syllabic languages. That is why, incidentally Hadoop is taking off in database work, because they fundamentally store a data field as a number of bytes in multiples of 4 bytes - which by the way, is the storage method of Euphoria, but not developed enough to reach those levels.
I am concerned that your attitude is always that of defending the existing rather than looking at the weaknesses and wanting to address these weaknesses, or of recognising the need for progress. The fact that I have rejected Freebasic, QB4, Python Schema, etc in favour of Euphoria is because of the strengths of Euphoria, but that does not mean I should ignore or attempt to hide the weaknesses under the hood of a big C
You are absolutely right. To that end, I've wrapped the Glib functions for the ucs4/utf8 conversions.
This wrapper can be used independently, but it goes well with EuGTK.
The wrapper is available here: http://openeuphoria.org/pastey/177.wc
A modified test7.ex (SLE demo) that uses the new wrapper (call it gunicode.e ) is here: http://openeuphoria.org/pastey/178.wc
The modified demo grabs a substring of the unicode test pasted in and displays just that part when you hit the Quit button.
This took, half an hour? If no one has wrapped it before, I'd guess that this is because no one has needed it before. Not that we shouldn't have this - now I can confidently say that a coder writing only in Euphoria code (and using the right wrappers of course) can do Unicode manipulation as well as simply inputting and displaying Unicode text.
8. Re: EuGTK and UTF-8
- Posted by EUWX Dec 16, 2012
- 1582 views
Whenever i come accross Unicode and utf-8, theorizing seems to be the order of day. Outside of simple nice text under utw-8 in the net like this:
"जैसा ये लिखा है",
a major part of it is vapourware.
Again, I shall point you to http://www.gtk.org/api/2.6/glib/glib-Unicode-Manipulation.html
The question is not whether GTK or Glib can do something. Can EUGTK do something? - that is the question.
I can caterogrically state that using EUGTK, as is currently done in Euphoria, you cannot do extraction, rotation of characters cutting down of strings, etc. You need to write more code and more wrappers, or use anotherpiece of software to do that.
That's a good point. I haven't tried the latest version of EuGTK, so I don't know if that has been added since.
When I talked about 2000-2003; that was the time when people were changing over from Win 98 with its codepages to real 16 bit Unicode. There are countless examples of presumption of real Unicode, when actually it was just codepage or specific font related solution.
But - I didn't use 98 at that time. I was already on UTF-8 supporting Linux/GNU. I don't see what any of that has to do with me.
Kindly remember that when I talk about programming I am NOT talking about C or C plusplus; I am talking about application development languages. Most application development programmers are vaguely familiar with C plusplus, and cannot use it, but like me, are comfortable with BASIC and EUPHORIA. Even there, we would rather work with fully integrated GUI tools in the language. For world audience we would like fully developed language that can do string manipulation in Unicode, and a search compatible with syllabic languages. That is why, incidentally Hadoop is taking off in database work, because they fundamentally store a data field as a number of bytes in multiples of 4 bytes - which by the way, is the storage method of Euphoria, but not developed enough to reach those levels.
I am concerned that your attitude is always that of defending the existing rather than looking at the weaknesses and wanting to address these weaknesses, or of recognising the need for progress. The fact that I have rejected Freebasic, QB4, Python Schema, etc in favour of Euphoria is because of the strengths of Euphoria, but that does not mean I should ignore or attempt to hide the weaknesses under the hood of a big C
You are absolutely right. To that end, I've wrapped the Glib functions for the ucs4/utf8 conversions.
This wrapper can be used independently, but it goes well with EuGTK.
The wrapper is available here: http://openeuphoria.org/pastey/177.wc
A modified test7.ex (SLE demo) that uses the new wrapper (call it gunicode.e ) is here: http://openeuphoria.org/pastey/178.wc
The modified demo grabs a substring of the unicode test pasted in and displays just that part when you hit the Quit button.
This took, half an hour? If no one has wrapped it before, I'd guess that this is because no one has needed it before. Not that we shouldn't have this - now I can confidently say that a coder writing only in Euphoria code (and using the right wrappers of course) can do Unicode manipulation as well as simply inputting and displaying Unicode text.
Thanks for the two pasteys. It is cool, as the younger generation says.
The whole conversion started with the client looking for RTL language support. And in my first response I said this:
"Euphoria has a 4 bytes/per character and has availability of several Peek/Poke operations with memory. Therefore, you should be able to integrate your needs using 16 bit Unicode and Peek/Poke operations. "
In your pasteys you have used these features in the gunicode.e file and seem to have achieved this with UTF-8.
You should let him see these pasteys in the thread he started and ask him to test on RTL. He is also looking for BiDi (bidirectional) language support. He would therefore,need to test your Euphoria code with some bidirectional text.
As far as I am concerned, Microsoft's approach to Unicode, with a 2 byte character is the most rational approach. When memory was not easy to obtain we started with 7 bit ASCII, changing soon to 8 bit ANSI. With the introduction of Unicode, with the whole world's characters using 16 bit per character is logical. Now with 21 bit per character incorporating extended Chinese or whatever it is called, Euphoria using the full 4 bit character WHICH IT ALREADY HAS,is the most logical approach. Some work in C (unfortunately I don't know C and how to wrap.,etc) should enable proper extraction of sub-strings of LE16 and LE32. In fact a proper groups library functions to do conversions between different formats and extractions of substrings is desirable.
The client has brought a new problem. Whether or not, GTK has BiDi, it is doable in Euphoria using C and C language programmers using their own code and/or wrapping existing library. The world extends beyond the Suez Canal, and China is probably the richest country in the world.
9. Re: EuGTK and UTF-8
- Posted by jimcbrown (admin) Dec 17, 2012
- 1480 views
The whole conversion started with the client looking for RTL language support. And in my first response I said this:
"Euphoria has a 4 bytes/per character and has availability of several Peek/Poke operations with memory. Therefore, you should be able to integrate your needs using 16 bit Unicode and Peek/Poke operations. "
In your pasteys you have used these features in the gunicode.e file and seem to have achieved this with UTF-8.
You should let him see these pasteys in the thread he started
The OP of the other thread is following this one as well. I don't see the need to cross post.
and ask him to test on RTL. He is also looking for BiDi (bidirectional) language support. He would therefore,need to test your Euphoria code with some bidirectional text.
Agreed.
As far as I am concerned, Microsoft's approach to Unicode, with a 2 byte character is the most rational approach.
I disagree. Originally UCS-2 was used, which can't represent the whole of Unicode. UTF-16 fixes this, by essentially becoming a multibyte text encoding format (some characters take up two bytes, others use combining characters which take up 4 bytes).
I prefer UTF-8 and UCS-4 to UTF-16 but for opposite reasons. UTF-8 is so backwards compatible with older programs that you don't need update your code at all to get the simplest operations to work. (You do need to make changes to support slightly higher operations like subscripting or sorting multibyte characters, but not for simple text entry or displaying of text).
Now with 21 bit per character incorporating extended Chinese or whatever it is called, Euphoria using the full 4 bit character WHICH IT ALREADY HAS,is the most logical approach.
Agreed. This is also the reason why I favor UCS-4, even though it lacks that simple backwards compatibility that UTF-8 provides.
UTF-16 seems like the worst of both worlds - it lacks backwards compatibility (code needs to be updated to handle null characters correctly, for example) and it doesn't fit as elegantly as UCS-4 does with the Euphorian integer.
Another reason to favor UCS-4 is that it is platform agnostic - UTF-8 seems more tied to Unix-land (which it was originally developed for) and UTF-16 is, as you stated, more closely embraced by M$.
The world extends beyond the Suez Canal, and China is probably the richest country in the world.
The US has a larger GDP. If you're willing to count the European Union, the EU also has a larger GDP.
10. Re: EuGTK and UTF-8
- Posted by mattlewis (admin) Dec 17, 2012
- 1497 views
As far as I am concerned, Microsoft's approach to Unicode, with a 2 byte character is the most rational approach.
I disagree. Originally UCS-2 was used, which can't represent the whole of Unicode. UTF-16 fixes this, by essentially becoming a multibyte text encoding format (some characters take up two bytes, others use combining characters which take up 4 bytes).
I prefer UTF-8 and UCS-4 to UTF-16 but for opposite reasons. UTF-8 is so backwards compatible with older programs that you don't need update your code at all to get the simplest operations to work. (You do need to make changes to support slightly higher operations like subscripting or sorting multibyte characters, but not for simple text entry or displaying of text).
The key factor would seem to be which languages you normally use. Personally, I only use English, so UTF-8 works very well for me, and for normal purposes, I don't even need to be aware that I'm using Unicode at all, since it generally looks the same as ASCII. If I used a different language with a different alphabet, then I'd probably prefer UTF-16 or UTF-32.
The world extends beyond the Suez Canal, and China is probably the richest country in the world.
The US has a larger GDP. If you're willing to count the European Union, the EU also has a larger GDP.
Soon the GDP of China may surpass that of the US (I don't take this as a given, however, they have a lot of challenges ahead of them), but the people will still be poorer. Their growth is basically them playing catch up with the rest of the developed world, and most of their population has a long way to go before that happens.
Matt
11. Re: EuGTK and UTF-8
- Posted by jimcbrown (admin) Dec 17, 2012
- 1514 views
The key factor would seem to be which languages you normally use. Personally, I only use English, so UTF-8 works very well for me, and for normal purposes, I don't even need to be aware that I'm using Unicode at all, since it generally looks the same as ASCII. If I used a different language with a different alphabet, then I'd probably prefer UTF-16 or UTF-32.
The only connection I can see here is the size of text files. Under UTF-8, you'd need 6 bytes per charcter for hanzi or hangul - but UCS-4 would only require 4 bytes, and UTF-16 would require only 2 bytes if you're lucky (but still 4 at most).
Of course, disk is cheap these days. This isn't a major concern for me.
The world extends beyond the Suez Canal, and China is probably the richest country in the world.
The US has a larger GDP. If you're willing to count the European Union, the EU also has a larger GDP.
Soon the GDP of China may surpass that of the US (I don't take this as a given, however, they have a lot of challenges ahead of them), but the people will still be poorer. Their growth is basically them playing catch up with the rest of the developed world, and most of their population has a long way to go before that happens.
Matt
Agreed. http://seeingredinchina.com/2011/02/22/chinas-gdp-doesnt-mean-what-you-think-it-does/