Re: EuGTK and UTF-8
- Posted by jimcbrown (admin) Dec 16, 2012
- 1645 views
You have to understand something more about the syllabic characters and its composition and usage on computers. For a long time we were restricted to 128 character set, which was later expanded to 256 character set.This essentially is based on using ONE BYTE per character. Keyboards do not have 256 keys and various schemes are devised to enter the 256 variations of glyphs for a given language font.
Agreed.
Based on these restrictions and need for using their own languages, most major languages created their own alternative glyphs and methods of joining BUT ALWAYS WITHIN THE 256 GLYPH availability. I mentioned Hindi Saral font, which is one of those simplifications.
Not true. Even simplified hanzi can't fit into 256 characters. Notice that GB2312 was encoding in two bytes (16 bits) with EUC-CN. And of course, GB2312 is a very incomplete set of hanzi (though enough for the simplest of everyday uses).
Although the chinese have need for many more glyphs, The Hanzi font you mentioned is available as a one byte font
http://www.fontpalace.com/font-download/Hanzi-Kaishu/
I did not mention that font. Looking at the character map, it looks like one of those joke fonts that map Roman letters to random hanzi.
Those fonts can be fun, but even a 5 year old child needs more hanzi than provided by the character map. It's impractical for everyday use.
The Chinese did ask for and in fact they created an extension from 64K Unicode Glyphs to a 21 bit Unicode to accommodate these.
And even this was not enough to represent all hanzi.
http://unicode.org/reports/tr28/tr28-3.html#13_7_variation_selectors
The Korean Unicode block for Hangul is 256 character block
Unicode Hangul Jamo (U+1100U+11FF)
Hangul Jamo just covers the individual letters. However, Unicode embeds Hangul syllablic blocks as individual characters as well.
Hangul defines 10 vowels and 14 constants. Counting all possible CV and CVC combinations leads to a minimum of 2100 syllablic blocks.
14*10 + 14*10*14 = 140 + 1960 = 2100
Actually, there are even more, since some words encode extra constants beyond the simble CV and CVC combination. E.g. 없어요 (eopseoyo - to not have / to not exist).
The Hangul Jamo Unicode Block contains the possible Leading, Middle, and Trailing parts of a Hangul block in the following ranges:
U+1100U+1112: the basic 18 lnitial consonants plus 1 silent consonant (NG)
U+1113U+1159: ohter initial complex and ancient consonants
U+115AU+115E: Reserved
U+115F: choseong (initial) filler
U+1160: jungseong (middle) filler
U+1161U+1175: the basic 21 vowels and dipthongs
U+1176U+11A2: other and ancient vowel combinations
U+11A3U+11A7: Reserved
U+11A8U+11C2: the basic 27 trailing consonant combinations
U+11C3U+11F9: other trailing consonant combinations
U+11FAU+FFFF: Reserved
You missed out on the U+3130 - U+318F range, not to mention the entire syllabic block range at U+AC00 - U+D7AF.
If you notice, you can use it in a software that can access a character in one byte provided the 256 glyphs are attached to one particular page of font.
You are correct. As noted in http://www.decodeunicode.org/en/hangul_syllables there was significant debate over whether or not it was necessary to encode all these syllables.
In theory, one could scrap the hangul syllable block, use an automated process to convert all existing text written using hangul syllable characters (virtually all of them) into text using only Hangul combining Jamo, and upgrade all word processors and other text manipulation tools to do this.
Considering the unique difficulties of getting Hangul Jamo to combine into syllablic blocks correctly (this requires working in two dimensions unlike the one-dimentional problem possed by RTL text entry for a language like Arabic), I'd be wary of changing an approach that has worked for decades. At the very least, I'd want to see a fully functional (and open) working example.
On the other hand, I suspect the real reason that Unicode embedded the Hangul syllables was for backwards compatibility with the Johab set (which in turn was probably influenced by the desire to include several thousand hanja in the character set).
Anyways, my point: Hanzi can not be represented in 256 characters. While Hangul could be, it'd be very difficult to do so - and since Koreans often include Hanja (more or less the same thing as Hanzi) in their writing, necessitating a multibyte character set anyways, there exists a strong disincentive to do so.
When I talk about Unicode availability, I mean the ability to put any one of the characters, at least as available form Microsoft in their Arial MS Unicode font, within one word or phrase.
I'm going by GNU Unifont.
That can only happen, if the software allows at least a 16 bit usage per character (and now 21 bit usage). The Microsoft alternative recognised by Microsoft with the introduction of Windows XP was to use the 16 bit character set internally.
Agreed. I personally prefer UTF-8 (a strange text encoding format that can use anywhere from 7 bits to 6 bytes per Unicode character) to UTF-16, but either gets the job done. UCS-2 is no longer enough.
The European software developers even today are still thinking of code pages of 256 characters, as exemplified by your post regarding Hanzi and Hangul.
I don't see how my post refers to this. I was referring to the opposite problem - 256 characters simply isn't enough for some languages/character sets.
In the original thread, I tried to encourage the questioner to post the name of the RTL language he want to use, because I am pretty certain that for the major RTL group, I can find the 256 character ANSI font and and enable him to use any of the current non-Microsoft languages including Euphoria.
Sure, depending on the OP's needs, that might be enough.
But will the OP be able to mix English with that RTL language? Having to constantly swtich code pages is a pain.
The goal of Unicode is to be able to represent all languages. It's nice to be able to mix Korean and Mandarin and English (and Arabic and Hindi) all in the same document.
EuGTK will work, but only by using the one byte font.
AS IT STANDS WITH EUPHORIA, EUGTK WILL NOT WORK USING THE FULL TWO BYTE STORAGE.
This is flat out wrong. I tested EuGTK demos test7.ex (single line text entry) and test48.ex (a simple text editor) with Hangul syllablic block characters and with Hanzi (both of which require a minimum of at least two bytes per character in any character set), and it worked fine. Of course, I used UTF-8 (which is backwards compatible with ASCII at the binary level, unlike UTF-16).
However, Euphoria wrapper can probably be modified for two byte storage of Unicode characters if it is an absolute requirement, such as in a multilingual composition requiring sorting. RTL is a further complication/problem which can be easily tested by him using a byte font in his chosen RightTtoLeft language.
UTF-8 will create problems associated with selection and extraction form withing a phrase, because each "foreign" character can have anything form one to 6 bytes per character!
Of course, GTK uses UTF-8 internally already, so by leveraging the support for UTF-8 in Glib, all these hard problems have already been solved!