1. unicode and puts

I'm having a problem outputting unicode strings - now it's no drama doing this
with files, but gets hairy for me when serving a page from my server.

However, when I ran into a situation which involved the page being in russian,
with Lithuanian and English as alternate language options to view the same page,
I decided to make the whole page straight unicode, and ran into this:
I've never had this problem with UTF-8, probably because I only used it for the
odd character such as the Maori 'a' in New Zealand english, which has no all zero
bytes in it (maybe UTF-8 doesn't encode with leading or trailing zeros, I don't
know, but its irrelevent to unicode)


where s = the unicode string "Hello World", the following statement:
puts(1,s)


will output "H" for to the screen because of puts() aborting when the null
character is reached.
putting two bytes (1 character) at a time will work, obviously for both Big and
Little Endian, but there is a performance hit and adding another layer between
the output of your page and Euphoria's puts() procedure.

The only way I can see to get around it is implementing the two bytes chunking
technique, with an alias of puts (which I already use for my web CMS Framework.)

Is there a *really quick and easy* way to get around this?

Also, as someone who uses Euphoria primarily for CGI, the last release for me
was the most directly significant of all of them because of the "include"
changes. I can now relax and spread my files out a bit :P

new topic     » topic index » view message » categorize

2. Re: unicode and puts

FD(censored) wrote:
> The only way I can see to get around it is implementing the two bytes chunking
> technique, with an alias of puts (which I already use for my web CMS Framework.)
>
> Is there a *really quick and easy* way to get around this?

Ah, Euphoria and Unicode. Don't you just love it?

I've taken a different approach. I use HTML encoding for all of my Unicode.

To demonstrate, go to http://www.wazu.jp/hosting/pricing.exu

Click on 'price this plan', and if you haven't changed any of the 
default numbers, it will verify it and add the ordering section to the page.

In any of the 'name' fields, enter your non-Latin characters. Russian,
Japanese, whatever. Do *not* click that you've agreed to the terms and
conditions. That way, the page will fail with an error message, but your
entered characters will be re-displayed. That's the key ... how they
were re-displayed.

1) The page is 'charset=utf-8'. That means that no matter how you enter
the characters, the browser will POST them back to the CGI program as UTF-8.
    Let's say that  {229,147,169} is input. That's how it's stored in
the database.

2) The UTF-8 (1 - 4 bytes) character is converted to a hex number (its
Unicode number).
     This example would become 54E9.

3) The hex is turned into decimal. The above example would be 21737.

4) The number is turned into its HTML representation, 哩

5) Each character in the page buffer is replaced as above, and the
buffer written out.

6) <SHAMELESS PLUG> Then go to http://www.wazu.jp/ to get some Unicode
fonts to test with </SHAMELESS PLUG>

HTH.

-- 
Craig

PS You should turn off directory listing.

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu