Re: EuGTK and UTF-8
- Posted by EUWX Dec 16, 2012
- 1584 views
Whenever i come accross Unicode and utf-8, theorizing seems to be the order of day. Outside of simple nice text under utw-8 in the net like this:
"जैसा ये लिखा है",
a major part of it is vapourware.
Again, I shall point you to http://www.gtk.org/api/2.6/glib/glib-Unicode-Manipulation.html
The question is not whether GTK or Glib can do something. Can EUGTK do something? - that is the question.
I can caterogrically state that using EUGTK, as is currently done in Euphoria, you cannot do extraction, rotation of characters cutting down of strings, etc. You need to write more code and more wrappers, or use anotherpiece of software to do that.
That's a good point. I haven't tried the latest version of EuGTK, so I don't know if that has been added since.
When I talked about 2000-2003; that was the time when people were changing over from Win 98 with its codepages to real 16 bit Unicode. There are countless examples of presumption of real Unicode, when actually it was just codepage or specific font related solution.
But - I didn't use 98 at that time. I was already on UTF-8 supporting Linux/GNU. I don't see what any of that has to do with me.
Kindly remember that when I talk about programming I am NOT talking about C or C plusplus; I am talking about application development languages. Most application development programmers are vaguely familiar with C plusplus, and cannot use it, but like me, are comfortable with BASIC and EUPHORIA. Even there, we would rather work with fully integrated GUI tools in the language. For world audience we would like fully developed language that can do string manipulation in Unicode, and a search compatible with syllabic languages. That is why, incidentally Hadoop is taking off in database work, because they fundamentally store a data field as a number of bytes in multiples of 4 bytes - which by the way, is the storage method of Euphoria, but not developed enough to reach those levels.
I am concerned that your attitude is always that of defending the existing rather than looking at the weaknesses and wanting to address these weaknesses, or of recognising the need for progress. The fact that I have rejected Freebasic, QB4, Python Schema, etc in favour of Euphoria is because of the strengths of Euphoria, but that does not mean I should ignore or attempt to hide the weaknesses under the hood of a big C
You are absolutely right. To that end, I've wrapped the Glib functions for the ucs4/utf8 conversions.
This wrapper can be used independently, but it goes well with EuGTK.
The wrapper is available here: http://openeuphoria.org/pastey/177.wc
A modified test7.ex (SLE demo) that uses the new wrapper (call it gunicode.e ) is here: http://openeuphoria.org/pastey/178.wc
The modified demo grabs a substring of the unicode test pasted in and displays just that part when you hit the Quit button.
This took, half an hour? If no one has wrapped it before, I'd guess that this is because no one has needed it before. Not that we shouldn't have this - now I can confidently say that a coder writing only in Euphoria code (and using the right wrappers of course) can do Unicode manipulation as well as simply inputting and displaying Unicode text.
Thanks for the two pasteys. It is cool, as the younger generation says.
The whole conversion started with the client looking for RTL language support. And in my first response I said this:
"Euphoria has a 4 bytes/per character and has availability of several Peek/Poke operations with memory. Therefore, you should be able to integrate your needs using 16 bit Unicode and Peek/Poke operations. "
In your pasteys you have used these features in the gunicode.e file and seem to have achieved this with UTF-8.
You should let him see these pasteys in the thread he started and ask him to test on RTL. He is also looking for BiDi (bidirectional) language support. He would therefore,need to test your Euphoria code with some bidirectional text.
As far as I am concerned, Microsoft's approach to Unicode, with a 2 byte character is the most rational approach. When memory was not easy to obtain we started with 7 bit ASCII, changing soon to 8 bit ANSI. With the introduction of Unicode, with the whole world's characters using 16 bit per character is logical. Now with 21 bit per character incorporating extended Chinese or whatever it is called, Euphoria using the full 4 bit character WHICH IT ALREADY HAS,is the most logical approach. Some work in C (unfortunately I don't know C and how to wrap.,etc) should enable proper extraction of sub-strings of LE16 and LE32. In fact a proper groups library functions to do conversions between different formats and extractions of substrings is desirable.
The client has brought a new problem. Whether or not, GTK has BiDi, it is doable in Euphoria using C and C language programmers using their own code and/or wrapping existing library. The world extends beyond the Suez Canal, and China is probably the richest country in the world.