1. unicode and puts
- Posted by Frank Dowling <frank at frankied.com> Jun 06, 2007
- 530 views
I'm having a problem outputting unicode strings - now it's no drama doing this with files, but gets hairy for me when serving a page from my server. However, when I ran into a situation which involved the page being in russian, with Lithuanian and English as alternate language options to view the same page, I decided to make the whole page straight unicode, and ran into this: I've never had this problem with UTF-8, probably because I only used it for the odd character such as the Maori 'a' in New Zealand english, which has no all zero bytes in it (maybe UTF-8 doesn't encode with leading or trailing zeros, I don't know, but its irrelevent to unicode) where s = the unicode string "Hello World", the following statement:
puts(1,s)
will output "H" for to the screen because of puts() aborting when the null character is reached. putting two bytes (1 character) at a time will work, obviously for both Big and Little Endian, but there is a performance hit and adding another layer between the output of your page and Euphoria's puts() procedure. The only way I can see to get around it is implementing the two bytes chunking technique, with an alias of puts (which I already use for my web CMS Framework.) Is there a *really quick and easy* way to get around this? Also, as someone who uses Euphoria primarily for CGI, the last release for me was the most directly significant of all of them because of the "include" changes. I can now relax and spread my files out a bit :P
2. Re: unicode and puts
- Posted by Craig Welch <euphoriah at cwelch.org> Jun 06, 2007
- 509 views
- Last edited Jun 07, 2007
FD(censored) wrote: > The only way I can see to get around it is implementing the two bytes chunking > technique, with an alias of puts (which I already use for my web CMS Framework.) > > Is there a *really quick and easy* way to get around this? Ah, Euphoria and Unicode. Don't you just love it? I've taken a different approach. I use HTML encoding for all of my Unicode. To demonstrate, go to http://www.wazu.jp/hosting/pricing.exu Click on 'price this plan', and if you haven't changed any of the default numbers, it will verify it and add the ordering section to the page. In any of the 'name' fields, enter your non-Latin characters. Russian, Japanese, whatever. Do *not* click that you've agreed to the terms and conditions. That way, the page will fail with an error message, but your entered characters will be re-displayed. That's the key ... how they were re-displayed. 1) The page is 'charset=utf-8'. That means that no matter how you enter the characters, the browser will POST them back to the CGI program as UTF-8. Let's say that {229,147,169} is input. That's how it's stored in the database. 2) The UTF-8 (1 - 4 bytes) character is converted to a hex number (its Unicode number). This example would become 54E9. 3) The hex is turned into decimal. The above example would be 21737. 4) The number is turned into its HTML representation, 哩 5) Each character in the page buffer is replaced as above, and the buffer written out. 6) <SHAMELESS PLUG> Then go to http://www.wazu.jp/ to get some Unicode fonts to test with </SHAMELESS PLUG> HTH. -- Craig PS You should turn off directory listing.