Re: Status of Edix

new topic     » goto parent     » topic index » view thread      » older message » newer message

I was not clear in my first reply.

Unicode is a character set and UTF8 is a method of representing the character set in 8 bit bytes. Hence the representation can be 1-5 bytes long, because currently there are 136,755 characters covering 139 modern and historic scripts.

The original Unicode character could be represented by precisely 2 bytes each. Therefore, writing a piece of software to reach the nth character in a sequence, was very simple, IF you represented them in 2 byte fields, which is what Microsoft does internally and in their text files. However, when UTF8 is used to represent these, the nth character position cannot be estimated or guestimated; you have to crawl along to find the nth character. Therefore, UTF16 was invented, and it was good for the original extent (64K characters) but not enough for the extended characters.

UTF32 is the latest, and it is a 32 bit representation of each individual character and therefore, you can arrive at the nth character in a sequence by calculating the exact position in number of bytes. In a search algorithm this fast access gives speeds that easily overcome the bulkiness of 32 bits per character.

Euphoria has a 4 byte integer as standard. Therefore, let us say you create a character set at the private area Hex E000-F8FF. It is easy to create a Euphoria type, which allows integers in that range, and call it Euphoria type “MyPrivateChar”. That is exactly what I am doing in my work with characters discussed in another thread here. In fact, I have created separate types:

MyPrivateChar” - E000 – F8FF

MyPrivateHindi” – E000 – E1FF

MyPrivateGuj” – E200 – E3FF

MyPrivateBengali” – E400 – E5FF

MyPrivateKannada” – E600 – E7FF

And so on for all the Indic languages.

Actually, currently, three private use areas are defined: one in the Basic Multilingual Plane (U+E000–U+F8FF), and one each in, and nearly covering, planes 15 and 16 (U+F0000–U+FFFFD, U+100000–U+10FFFD). All these areas are accessible with the use of 32 bit integer values of Euphoria and hence my use of wxEuphoria for my work.

new topic     » goto parent     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu