1. UTF-8 encoding vs UTF-32

I have read that Euphoria's native encoding is UTF-32. I am not very familiar with the different encodings, so I have to ask: will a text encoded and saved with Euphoria's UTF-32 be readable in a UTF-8 environment? Also, is Euphoria able to read, process and save texts encoded with UTF-8?

Personally I don't have any particular requirements, apart for the support of non-English languages. I am only concerned that using different encodings may cause compatibility problems.

new topic     » topic index » view message » categorize

2. Re: UTF-8 encoding vs UTF-32

Nevla said...

I have read that Euphoria's native encoding is UTF-32. I am not very familiar with the different encodings, so I have to ask: will a text encoded and saved with Euphoria's UTF-32 be readable in a UTF-8 environment? Also, is Euphoria able to read, process and save texts encoded with UTF-8?

Personally I don't have any particular requirements, apart for the support of non-English languages. I am only concerned that using different encodings may cause compatibility problems.

At the present time Euphoria doesn't natively support Unicode in any form so it doesn't have a native encoding. Euphoria applications can use Unicode in applications but you must do most of the work yourself. UTF-32 would be the easiest encoding to use internally. UTF-8 and UTF-32 are totally different and incompatible encodings. Euphoria can read and write any arbitrary data, including UTF-8, but doesn't understand Unicode at all.

A future version of Euphoria may provide Unicode support.

new topic     » goto parent     » topic index » view message » categorize

3. Re: UTF-8 encoding vs UTF-32

LarryMiller said...

Euphoria applications can use Unicode in applications but you must do most of the work yourself.

I already know that I wouldn't be able to do any advanced coding myself, let alone with encodings (I know almost nothing about them). I was attracted to Euphoria because it is a simple language and especially user friendly for hobby programmers like me. I certainly would not want to get involved with highly technical programming issues.

Is there any library that allows me to interpret and to convert the different encodings so I can use them in Euphoria, possibly without much hassle?

LarryMiller said...

UTF-32 would be the easiest encoding to use internally.

Sorry, my knowledge is quite limited, what do you mean by this? In what respect, exactly, would it be the easiest encoding?

LarryMiller said...

UTF-8 and UTF-32 are totally different and incompatible encodings.

What is the standard encoding that is used by Euphoria's GUIs? For instance, what encoding is used with EuGTK? Ultimately, my choice of an encoding will probably depend on this alone.

new topic     » goto parent     » topic index » view message » categorize

4. Re: UTF-8 encoding vs UTF-32

UTF-8 and UTF-32 are very different.
UTF-8 is currently the standard in Linux and on the internet. That should be an incentive for Euphoris to go the UTF-8 route.
A 4 byte store of integers in a standard for Euphoria, so theoretically to implement at 32 bit character as a standard for character sequences should not be difficult. However, I have tried to access the actual pointer to a Integer numeric sequence to enable me to reach individual 4 byte characters.
Sequence of bytes treated as four byte words and using Peek and Poke would work.
It would be easy for C programmer to implement 4 distinct types of sequences, viz:
1. ASCII/ANSI one byte/character. For God's sake, get away from this. More than 75% of the population of the world languages cannot use this. 2. UTF-8. There are developments already existing within euphoria in this area. 3. 16 bit characters as mentioned in the Unicode standard and used by Microsoft internally in Windows XP,7, 8 since about 2002. These are a very good step ahead and easy to implement in Euphoria. 4. 32 bit characters as mentioned in the extended Unicode standard to accommodate mainly the full range of Chinese characters - This is currently 21 bits only and should be easy to incorporate into Euphoria as a 4th type of character sequence.

new topic     » goto parent     » topic index » view message » categorize

5. Re: UTF-8 encoding vs UTF-32

I know very little about utf-8 (or 16, etc), but the following works properly with EuGTK:

object str =  
    U"" & " Katakana BO\n" & 
    U"" & " Khmer NYO\n" & 
    U"" & " Laotian pho sung\n" & 
    U"" & " Greek Psi \n" & 
    U"" & " Geometric shape\n" &  
    U"" & " Face screaming\n" & 
    U"" & "?" 
     
include GtkEngine.e 
 
constant win = create(GtkWindow) 
connect(win,"destroy","Quit") 
 
constant lbl = create(GtkLabel) 
set(lbl,"markup",str) 
set(lbl,"font","24") 
add(win,lbl) 
 
show_all(win) 
main() 

ボ Katakana BO 
ញ Khmer NYO 
ຜ Laotian pho sung 
ψ Greek Psi  
◍ Geometric shape 
😱 Face screaming 
Given a text editor that can handle the desired character set, you can enter the text directly, without having to use the U"" formats. AFAIK, GTK can handle text inputs using whatever keyboard/language setup you may have. String functions: comparisons, sorting, etc. are most likely not going to work.

new topic     » goto parent     » topic index » view message » categorize

6. Re: UTF-8 encoding vs UTF-32

Irv, your answer seems to imply that as long as I use the correct code for a character, both Euphoria and EuGTK will be happy with it, and display it correctly. Am I getting this right?

new topic     » goto parent     » topic index » view message » categorize

7. Re: UTF-8 encoding vs UTF-32

Nevla said...

Irv, your answer seems to imply that as long as I use the correct code for a character, both Euphoria and EuGTK will be happy with it, and display it correctly. Am I getting this right?

It seems so. I just entered a new customer to my Euphoria/EuGTK accounts receivable program, and added a purchase for Mr. 我希望這將工作 ("I hope this will work" translated by Google).

Displays properly, and eu 'find' also seems to work, as it can select Mr. I Hope's transactions from all the others - including Mr.我不知道 ("I don't know")

Whether sorting works, I couldn't tell, as I don't know Chinese sorting rules. At least, it didn't break.

Note: this particular program uses Eu's serialize functions to save and load data files. write_lines() also works, I haven't tried others, but probably printf() wouldn't do.

MR. JENNINGS YOUNG 
MR. JEROLD YOUNG 
MR. BOB ZOBEL 
我不知道 
我希望這將工作 
Apparently, "I don't know" comes after Z :)

Edit: correction: printf() does work, displaying the names correctly on the terminal and in a customer printout.

Note: Punjabi also works, as does Hebrew - editing automatically goes into RTL mode when Hebrew text is pasted in.

new topic     » goto parent     » topic index » view message » categorize

8. Re: UTF-8 encoding vs UTF-32

Thank you again, Irv, for the time you took to try it out. I really appreciate it.

So, to sum up this thread, it would seem that as long as the foreign language text is displayed and handled properly by EuGtk, I should not even worry about which encoding Euphoria is actually using. I just go with the "Euphoria encoding", whatever that may be.

I seems that the UTF-8 vs UTF-16 vs UTF-32 question has become irrelevant...

new topic     » goto parent     » topic index » view message » categorize

9. Re: UTF-8 encoding vs UTF-32

Nevla said...

Thank you again, Irv, for the time you took to try it out. I really appreciate it.

So, to sum up this thread, it would seem that as long as the foreign language text is displayed and handled properly by EuGtk, I should not even worry about which encoding Euphoria is actually using. I just go with the "Euphoria encoding", whatever that may be.

I seems that the UTF-8 vs UTF-16 vs UTF-32 question has become irrelevant...

It would, perhaps, be nice if Eu had a UTF variable type, so that the 2, 3 or 4 bytes which make up the UTF could be packed into one Eu atom. There'd be less wasted space that way, and things like sorting might be easier.

Since apparently UTF can be sent as 2, 3 or 4 bytes, figuring out how to sort those strings is going to be tricky. Having every character 32 bytes long (even though UTF seems only to need 21 bytes) would probably make that task easier.

Also, length() doesn't give the results you'd expect when used with UTF strings. For example, "I don't know" 我不知道 is {230,136,145,228,184,141,231,159,165,233,129,147} or 12 'bytes'

new topic     » goto parent     » topic index » view message » categorize

10. Re: UTF-8 encoding vs UTF-32

I should point out that there are better ways to specify UTF in your source code than the example I gave previously:

object soccer = "Fu\xc3\x9fb\xc3\xa4lle" -- utf-8 
-- or: "Fu\u00c3\u009fb\u00c3\u00a4lle"  -- unicode 
-- or: simply type it in or copy any special chars from your character map program, etc. 
 
include GtkEngine.e 
 
constant win = create(GtkWindow) 
    set(win,"border width",10) 
    set(win,"title"," UTF Test ") 
    set(win,"icon","fonts") 
    connect(win,"destroy","Quit") 
 
constant lbl = create(GtkLabel) 
    set(lbl,"text",soccer & " isn't Football!") 
    set(lbl,"font","16") 
    add(win,lbl) 
 
show_all(win) 
main() 
new topic     » goto parent     » topic index » view message » categorize

11. Re: UTF-8 encoding vs UTF-32

Nevla said...

I have read that Euphoria's native encoding is UTF-32.

  • Euphoria does not use Unicode at all.
  • If by the term "native encoding" you mean what does it allow source code to be encoded in, then the answer is no. Euphoria source code files must be in ASCII encoding.
  • Euphoria's sequences are a 'good fit' to be used for UTF-32 encoding because there would be a one-to-one correspondence between code points and atoms, and absolutely nothing in Euphoria would need to be changed with respect to using sequence to store UTF-32 strings in RAM.
Nevla said...

I am not very familiar with the different encodings, so I have to ask: will a text encoded and saved with Euphoria's UTF-32 be readable in a UTF-8 environment?

No it would not. And that has nothing to do with Euphoria. It doesn't matter how a UTF-32 text file was created, because if a program reading it expected UTF-8 then it would fail. UTF-32 and UTF-8 are not interchangable. In UTF-32, characters are all 4 bytes long, whereas in UTF-8, characters can be 1, 2 or 4 bytes long - in the same string.

Nevla said...

Also, is Euphoria able to read, process and save texts encoded with UTF-8?

Technically yes, except that the required functions to do this are not currently in the standard library. They have been written, just not added to the library yet.

Nevla said...

Personally I don't have any particular requirements, apart for the support of non-English languages. I am only concerned that using different encodings may cause compatibility problems.

Yes, using different encodings can cause problems.

My 'rule-of-thumb' is to use UTF-32 internally, meaning that strings in RAM are stored as UTF-32 and all processing (comparing, sorting, searching, manipulating, etc...) of these strings is done using UTF-32, and use UTF-8 or UTF-16 externally, depending on what is going to read the text once it has left the program. UTF-8 for text files is good because it uses less disk space, and UTF-8 is suitable for User Interface text in Linux systems, and UTF-16 is suitable for UI text in Windows systems. For text being sent over the Internet, you probably should use UTF-8 but it depends on what the receiving system expects.

new topic     » goto parent     » topic index » view message » categorize

12. Re: UTF-8 encoding vs UTF-32

DerekParnell said...
  • If by the term "native encoding" you mean what does it allow source code to be encoded in, then the answer is no. Euphoria source code files must be in ASCII encoding.

Or UTF-8.....

new topic     » goto parent     » topic index » view message » categorize

13. Re: UTF-8 encoding vs UTF-32

DerekParnell said...

Euphoria's sequences are a 'good fit' to be used for UTF-32 encoding because there would be a one-to-one correspondence between code points and atoms, and absolutely nothing in Euphoria would need to be changed with respect to using sequence to store UTF-32 strings in RAM.

...

My 'rule-of-thumb' is to use UTF-32 internally, meaning that strings in RAM are stored as UTF-32 and all processing (comparing, sorting, searching, manipulating, etc...) of these strings is done using UTF-32

So, practically speaking, what library would I need to use in order to use UTF-32 internally? Or is it the case that I don't even need to use one while working internally? (because of the natural compatibility that you just mentioned above)

new topic     » goto parent     » topic index » view message » categorize

14. Re: UTF-8 encoding vs UTF-32

jimcbrown said...
DerekParnell said...
  • If by the term "native encoding" you mean what does it allow source code to be encoded in, then the answer is no. Euphoria source code files must be in ASCII encoding.

Or UTF-8.....

Ok ... I was confusing Unicode with encoding. Euphoria doesn't allow identifiers to be anything other than ASCII but string literals can be anything. And thus UTF-8 encoding is okay.

new topic     » goto parent     » topic index » view message » categorize

15. Re: UTF-8 encoding vs UTF-32

Nevla said...

So, practically speaking, what library would I need to use in order to use UTF-32 internally? Or is it the case that I don't even need to use one while working internally? (because of the natural compatibility that you just mentioned above)

Depends. If you're serious about doing Unicode right then you'd need a library to just about anything with strings, including sorting, comparing, case conversions, searching, etc ... its just that using UTF-32 internally makes some of this easier when working with sub-strings.

When conversion between UTF encodings, for example getting a UTF-8 text file into a UTF-32 sequence or sending a UTF-32 sequence to a UI field for display, then you'll need to use library routines.

new topic     » goto parent     » topic index » view message » categorize

16. Re: UTF-8 encoding vs UTF-32

Here is what me confused in the first place:

http://progopedia.com/language/euphoria/

This progopedia page states that Euphoria's encoding is UTF-32. It should be corrected.

new topic     » goto parent     » topic index » view message » categorize

17. Re: UTF-8 encoding vs UTF-32

Nevla said...

Here is what me confused in the first place:

http://progopedia.com/language/euphoria/

This progopedia page states that Euphoria's encoding is UTF-32. It should be corrected.

Well technically it is correct, but it sort of implies that Euphoria also understands Unicode, which it does not. A clarification might be better.

I'd forgotten about this site.

new topic     » goto parent     » topic index » view message » categorize

18. Re: UTF-8 encoding vs UTF-32

Euphoria's sequences give it the ability to store UTF-32 character strings and perform some basic functions with them. But real world applications need more. There comes a time when an application must accept Unicode data from outside itself and be able to present it to the outside world. Euphoria has no native ability to do any of those things. Due to the storage inefficiency UTF-32 is rarely used for files. At the present time UTF-8 is most commonly used so some conversions will be necessary.

There are Euphoria include files available (not part of the official distribution) that can do some of these things. To do serious Unicode programming you will need some understanding of the more common Unicode encodings. The system functions available in Linux and Windows will likely be necessary as well.

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu