Re: UTF-8 encoding vs UTF-32
- Posted by DerekParnell (admin) Aug 10, 2014
- 1463 views
I have read that Euphoria's native encoding is UTF-32.
- Euphoria does not use Unicode at all.
- If by the term "native encoding" you mean what does it allow source code to be encoded in, then the answer is no. Euphoria source code files must be in ASCII encoding.
- Euphoria's sequences are a 'good fit' to be used for UTF-32 encoding because there would be a one-to-one correspondence between code points and atoms, and absolutely nothing in Euphoria would need to be changed with respect to using sequence to store UTF-32 strings in RAM.
I am not very familiar with the different encodings, so I have to ask: will a text encoded and saved with Euphoria's UTF-32 be readable in a UTF-8 environment?
No it would not. And that has nothing to do with Euphoria. It doesn't matter how a UTF-32 text file was created, because if a program reading it expected UTF-8 then it would fail. UTF-32 and UTF-8 are not interchangable. In UTF-32, characters are all 4 bytes long, whereas in UTF-8, characters can be 1, 2 or 4 bytes long - in the same string.
Also, is Euphoria able to read, process and save texts encoded with UTF-8?
Technically yes, except that the required functions to do this are not currently in the standard library. They have been written, just not added to the library yet.
Personally I don't have any particular requirements, apart for the support of non-English languages. I am only concerned that using different encodings may cause compatibility problems.
Yes, using different encodings can cause problems.
My 'rule-of-thumb' is to use UTF-32 internally, meaning that strings in RAM are stored as UTF-32 and all processing (comparing, sorting, searching, manipulating, etc...) of these strings is done using UTF-32, and use UTF-8 or UTF-16 externally, depending on what is going to read the text once it has left the program. UTF-8 for text files is good because it uses less disk space, and UTF-8 is suitable for User Interface text in Linux systems, and UTF-16 is suitable for UI text in Windows systems. For text being sent over the Internet, you probably should use UTF-8 but it depends on what the receiving system expects.