Re: Profiling Under Windows & String Size
- Posted by Derek Parnell <dparnell at BIGPOND.NET.AU> Aug 06, 1999
- 474 views
The Unicode UTF-8 and UTF-16 encoding uses variable length characters sizes. All the Roman characters and most of the special ASCII characters take a single byte but other language character sets can have 2 to 6 bytes per character. This comfortably takes care of the various Chinese, Japanese, Korean, Etc... character sets. And for English text it is space efficient. UTF-8 is supported by Windows and IE5 and XML and most database systems (Oracle, Progress, Informix, etc) To quote from the Unicode FAQ "Q: What is UTF-16? Unicode was originally designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.) Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16. UTF-16 allows access to 63K characters as single Unicode 16-bit units. It can access an additional 1M characters by a mechanism known as surrogate pairs. " see http://www.unicode.org for more details. cheers, Derek Parnell dparnell at bigpond.net.au Melbourne, Australia -----Original Message----- From: Roderick Jackson <rjackson at CSIWEB.COM> To: EUPHORIA at LISTSERV.MUOHIO.EDU <EUPHORIA at LISTSERV.MUOHIO.EDU> Date: Thursday, August 5 1999 23:41 Subject: Re: Profiling Under Windows & String Size Kat wrote: <snip> >> Assume that this new byte scheme BLOATS Euphoria into the >> 800 KB realm. So now as Rob says, You are WASTING 80 KB more >> than you where before. Assume you have written a program that >> under the old scheme abused 16,000 KB of memory. But, using >> the new byte scheme you could store the same amount of data >> using only 4,000 KB. This frees up 12,000 KB of memory. >> When you are freeing up 12,000 KB who cares about the 80 KB >> wasted? > >While you try to lock Euphoria into 8 bits per character, i'd like to refer >everyone else to http://www.unicode.org/ , where the standard is 16 bits, >for the reasons i gave earlier. Hmmm... while the advantages of Rob's current scheme are very valid, I think I'd like to state here that I'm not sure he would be better off putting stock in the Unicode "standard". According to other postings on this list, Unicode is tragically insufficient for its most notable goal: handling global character sets. Norm's Chinese characters alone (for his Eu project) number--what, around 45,000? And then Japanese takes the same number... already the 65,536 characters of Unicode are blown away, by only two languages. Apparently, Unicode simply CANNOT do what it is trying to, at least not without severe compromise. That being the case, why try to make use of it? If anything, a 3-byte scheme (or 4-byte) would make more sense (Super Unicode?), but then those of us with languages that do just fine under ASCII might start to balk. Rod