OpenEuphoria: Forum: (rant) unicode-text

(rant) unicode-text

new topic » topic index » view thread » older message » newer message

Posted by _tom (admin) Oct 12, 2017
2912 views

explorations; definitions; observations

Text

character "letter, digit, ... "
string "sequence of characters"

Plain-text:

character --> atom --> single quote ' delimiter: 'a'
string --> sequence --> double quote " delimiter: "hello"
output: ? --> numbers; puts --> text

For utf8 unicode-text:

utf8 character --> sequence --> double quote " delimiter: "ß"
string --> sequence --> double quote " delimiter: "hello"
output: ? --> numbers; puts --> text

For utf32 unicode-text:

utf32 character --> atom
string --> sequence
output: ? --> numbers; puts --> text

For Phix utf8 unicode-text:

utf8 character --> string --> string --> "a" or "ß"
utf8 string --> string --> "hello"
output: ? s --> text; puts --> text; print --> numbers

Unicode

Unicode "assigns a unique number (code-point) to each character" Properly, Unicode is the consortium that sets standards: www.unicode.org Common usage, unicode-text

code-point "unique hexadecimal number for each character"
plain-text "code-points #0 to #7F" (up to 127 decimal)
unicode-text "code-points #0 to #10FFFF" (up to 10⁶ decimal)

Example:

character 'a' code-point #61 (hex), U+61 (outside of OE), "\x61" (escaped notation), 97 (decimal)

Example:

? 'a' 
    --> 97 
? "\x61" 
    --> {97} 
? "\u0061" 
    --> {97} 
? "\U00000061" 
    --> {97} 
 
	-- little endian 
 
include std/convert.e 
 
? int_to_bits( 97 ) 
-- {1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0} 
 
? int_to_bits( 97, 8 ) 
-- {1,0,0,0,0,1,1,0} 
	--note: big endian 
 
? "\b10000110" 
--> 134 
	-- 'fail' written in big endian 
 
? "\b11000010"  
--> 97 
	-- 'success' written in little endian

Endian

Little Endian "least significant digits written last."
Big Endian "most significant digits written last."

The natural way to write decimal numbers is big endian.

In human languages that write right to left, decimal numbers are written in big endian format.
In languages that write left to right, decimal numbers are written in little endian format.

Writing numbers backwards has resulted in Western children not understanding arithmetic and mathematics. These children become adults that do not understand arithmetic and mathematics.

Trivia

Endian refers to Jonathan Swift Guliver's Travels.

UTF

UTF "Unicode Transformation Format"

UTF32 "is the code-point; the definition of a character"
UTF16 "Microsoft format"
UTF8 "standard format for text"

Example:

include std/convert.e 
	 
-- the character 'a' 
? int_to_bits( #61 )   
--> {1,0,0,0,0,1,1,0, 0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0} 
--   |                |                |                | 4-bytes                      
 
-- note: 'big endian'

For plain text three bytes of unused bits.

UTF32

The definition of text.

character --> atom or integer
string --> sequence
for plain-text UTF32 is the same as UTF8

Literal text

unlikely that text will in UTF32 format
Geany runs as utf8 but can load|save UTF32 text
(unix) terminal, puts, printf are UTF8

Simple processing

direct indexing and traversal
length is number of characters
find to search for character in a string
mutable
insert code-point into string using escaped notation

UTF16

Combines the disadvantages of UTF32 and UTF8
Microsoft products
good lucks

UTF8

Motivation: saving bits is more important than convenience in processing

plain-text character --> atom or integer
unicode-text character --> sequence
string --> sequence
for plain-text UTF8 is the same as UTF32
see also Phix string data-type which corresponds directly to UTF8

Literal text

most text is in UTF8 format
default for Geany
(unix) terminal; puts ; printf are UTF8
most applications: internet, webbrowser, editors, LibreOffice

More difficult processing

match to search for character
no direct indexing or traversal
length does not return number of characters
mutable
consider transformation of UTF8 to UTF32 for processing
escaped notation does not insert UTF8 character into a string

Storage Required

minimal for ram use
- Phix string < UTF32 sequence < UTF8 sequence

minimal for disk use
- UTF32 file < UTF8 file
- zipping can significantly compress text files

minimal for internet streaming
- UTF8 < UTF32

Sort | Collate

sorting "arrange items based on numerical value"
- Sorting plain-text or unicode-text does not always look right--since assignment of code-points to characters was somewhat arbitrary.
collating "arrange items based on language specific alphabetic order"
- collating requires a custom algorithm
- some call this natural sorting
- see Unicode collating algorithm
- see Rosetta natural sorting
- see archive nat sort

Chasing Nanoseconds

In Voyage au Centre de la Terre by Jules Verne, about 2.5% of the text is unicode. It is barely possible to show that sorting all of the words as utf32 is faster than sorting as utf8.

_tom

new topic » topic index » view thread » older message » newer message

Search

Quick Links

User menu

Not signed in.

OpenEuphoria

(rant) unicode-text

Text

Unicode

Endian

Trivia

UTF

UTF32

UTF16

UTF8

Storage Required

Sort | Collate

Chasing Nanoseconds

Search

Include:

Quick Links

User menu

Misc Menu