Re: EuGTK and UTF-8
- Posted by EUWX Dec 16, 2012
- 1669 views
Forked from Re: GUI with RTL and BiDi support
I would be surrprized if GTK and WxEuphoria text widgets were not equally capable of lefttoright and unicode already. their wrappers are more complete and more widely used, if user feedback is any indication.
GTK: I have already commented that on paper at least and based on test 117 given by the GTK wrapper author, you should be able to enter utf-8 text in a Label.
Agreed. I tried it and it works.
The methodology recommended by him and I agree there, is to use C/P from a utf-8 word processor.
This is not necessary. I've tested EuGTK (release from Feb 2010) with GTK 2 and with SCIM 1.4.5 on Linux/GNU.
I can't speak for RTL text entry as I don't use any language with it, but entering hanzi and hangul the conventional way works fine with EuGTK.
You have to understand something more about the syllabic characters and its composition and usage on computers. For a long time we were restricted to 128 character set, which was later expanded to 256 character set.This essentially is based on using ONE BYTE per character. Keyboards do not have 256 keys and various schemes are devised to enter the 256 variations of glyphs for a given language font.
Based on these restrictions and need for using their own languages, most major languages created their own alternative glyphs and methods of joining BUT ALWAYS WITHIN THE 256 GLYPH availability. I mentioned Hindi Saral font, which is one of those simplifications.
Although the chinese have need for many more glyphs, The Hanzi font you mentioned is available as a one byte font
http://www.fontpalace.com/font-download/Hanzi-Kaishu/
with the advent of Unicode, the Indians (via their government) accepted 128 glyphs per Indian language and they reside at about 0980 hex onwards. Although Indian languages are truely syllabic, and there is a need for reserving about 4000 spaces per language, there is no demand from India or India for creating separate glyph pages. The Chinese did ask for and in fact they created an extension from 64K Unicode Glyphs to a 21 bit Unicode to accommodate these.
The Korean Unicode block for Hangul is 256 character block
Unicode Hangul Jamo (U+1100–U+11FF) The Hangul Jamo Unicode Block contains the possible Leading, Middle, and Trailing parts of a Hangul block in the following ranges:
• U+1100–U+1112: the basic 18 lnitial consonants plus 1 silent consonant (NG)
• U+1113–U+1159: ohter initial complex and ancient consonants
• U+115A–U+115E: Reserved
• U+115F: choseong (initial) filler
• U+1160: jungseong (middle) filler
• U+1161–U+1175: the basic 21 vowels and dipthongs
• U+1176–U+11A2: other and ancient vowel combinations
• U+11A3–U+11A7: Reserved
• U+11A8–U+11C2: the basic 27 trailing consonant combinations
• U+11C3–U+11F9: other trailing consonant combinations
• U+11FA–U+FFFF: Reserved
If you notice, you can use it in a software that can access a character in one byte provided the 256 glyphs are attached to one particular page of font.
When I talk about Unicode availability, I mean the ability to put any one of the characters, at least as available form Microsoft in their Arial MS Unicode font, within one word or phrase. That can only happen, if the software allows at least a 16 bit usage per character (and now 21 bit usage). The Microsoft alternative recognised by Microsoft with the introduction of Windows XP was to use the 16 bit character set internally. The European software developers even today are still thinking of code pages of 256 characters, as exemplified by your post regarding Hanzi and Hangul.
In the original thread, I tried to encourage the questioner to post the name of the RTL language he want to use, because I am pretty certain that for the major RTL group, I can find the 256 character ANSI font and and enable him to use any of the current non-Microsoft languages including Euphoria.
EuGTK will work, but only by using the one byte font.
AS IT STANDS WITH EUPHORIA, EUGTK WILL NOT WORK USING THE FULL TWO BYTE STORAGE.
However, Euphoria wrapper can probably be modified for two byte storage of Unicode characters if it is an absolute requirement, such as in a multilingual composition requiring sorting. RTL is a further complication/problem which can be easily tested by him using a byte font in his chosen RightTtoLeft language.
UTF-8 will create problems associated with selection and extraction form withing a phrase, because each "foreign" character can have anything form one to 6 bytes per character!