Re: UTF-8 encoding vs UTF-32

new topic     » goto parent     » topic index » view thread      » older message » newer message
Nevla said...

I have read that Euphoria's native encoding is UTF-32.

  • Euphoria does not use Unicode at all.
  • If by the term "native encoding" you mean what does it allow source code to be encoded in, then the answer is no. Euphoria source code files must be in ASCII encoding.
  • Euphoria's sequences are a 'good fit' to be used for UTF-32 encoding because there would be a one-to-one correspondence between code points and atoms, and absolutely nothing in Euphoria would need to be changed with respect to using sequence to store UTF-32 strings in RAM.
Nevla said...

I am not very familiar with the different encodings, so I have to ask: will a text encoded and saved with Euphoria's UTF-32 be readable in a UTF-8 environment?

No it would not. And that has nothing to do with Euphoria. It doesn't matter how a UTF-32 text file was created, because if a program reading it expected UTF-8 then it would fail. UTF-32 and UTF-8 are not interchangable. In UTF-32, characters are all 4 bytes long, whereas in UTF-8, characters can be 1, 2 or 4 bytes long - in the same string.

Nevla said...

Also, is Euphoria able to read, process and save texts encoded with UTF-8?

Technically yes, except that the required functions to do this are not currently in the standard library. They have been written, just not added to the library yet.

Nevla said...

Personally I don't have any particular requirements, apart for the support of non-English languages. I am only concerned that using different encodings may cause compatibility problems.

Yes, using different encodings can cause problems.

My 'rule-of-thumb' is to use UTF-32 internally, meaning that strings in RAM are stored as UTF-32 and all processing (comparing, sorting, searching, manipulating, etc...) of these strings is done using UTF-32, and use UTF-8 or UTF-16 externally, depending on what is going to read the text once it has left the program. UTF-8 for text files is good because it uses less disk space, and UTF-8 is suitable for User Interface text in Linux systems, and UTF-16 is suitable for UI text in Windows systems. For text being sent over the Internet, you probably should use UTF-8 but it depends on what the receiving system expects.

new topic     » goto parent     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu