Re: Profiling Under Windows & String Size

new topic     » goto parent     » topic index » view thread      » older message » newer message

The Unicode UTF-8 and UTF-16 encoding uses variable length characters
sizes. All the Roman characters and most of the special ASCII characters
take a single byte but other language character sets can have 2 to 6 bytes
per character. This comfortably takes care of the various Chinese,
Japanese, Korean, Etc... character sets. And for English text it is space
efficient.  UTF-8 is supported by Windows and IE5 and XML and most
database systems (Oracle, Progress, Informix, etc)

To quote from the Unicode FAQ
"Q: What is UTF-16?
Unicode was originally designed as a pure 16-bit encoding, aimed at
representing all modern scripts. (Ancient scripts were to be represented
with private-use characters.) Over time, and especially after the addition
of over 14,500 composite characters for compatibility with legacy sets, it
became clear that 16-bits were not sufficient for the user community. Out
of this arose UTF-16.

UTF-16 allows access to 63K characters as single Unicode 16-bit units. It
can access an additional 1M characters by a mechanism known as surrogate
pairs.
"

see http://www.unicode.org for more details.



cheers,
Derek Parnell
dparnell at bigpond.net.au
Melbourne, Australia
-----Original Message-----
From: Roderick Jackson <rjackson at CSIWEB.COM>
To: EUPHORIA at LISTSERV.MUOHIO.EDU <EUPHORIA at LISTSERV.MUOHIO.EDU>
Date: Thursday, August 5 1999 23:41
Subject: Re: Profiling Under Windows & String Size


Kat wrote:
<snip>
>> Assume that this new byte scheme BLOATS Euphoria into the
>> 800 KB realm.  So now as Rob says,  You are WASTING 80 KB more
>> than you where before.  Assume you have written a program that
>> under the old scheme abused 16,000 KB of memory.  But, using
>> the new byte scheme you could store the same amount of data
>> using only 4,000 KB.  This frees up 12,000 KB of memory.
>> When you are freeing up 12,000 KB who cares about the 80 KB
>> wasted?
>
>While you try to lock Euphoria into 8 bits per character, i'd like to
refer
>everyone else to http://www.unicode.org/ , where the standard is 16 bits,
>for the reasons i gave earlier.

Hmmm... while the advantages of Rob's current scheme are very valid,
I think I'd like to state here that I'm not sure he would be better off
putting stock in the Unicode "standard".

According to other postings on this list, Unicode is tragically
insufficient for its most notable goal: handling global character sets.
Norm's Chinese characters alone (for his Eu project) number--what, around
45,000? And then Japanese takes the same number... already the 65,536
characters of Unicode are blown away, by only two languages. Apparently,
Unicode simply CANNOT do what it is trying to, at least not without severe
compromise. That being the case, why try to make use of it? If anything, a
3-byte scheme (or 4-byte) would make more sense (Super Unicode?), but then
those of us with languages that do just fine under ASCII might start to
balk. blink


Rod

new topic     » goto parent     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu