(rant) unicode-text

new topic     » topic index » view thread      » older message » newer message

explorations; definitions; observations


Text

  • character "letter, digit, ... "
  • string "sequence of characters"

Plain-text:

  • character --> atom --> single quote ' delimiter: 'a'
  • string --> sequence --> double quote " delimiter: "hello"
  • output: ? --> numbers; puts --> text

For utf8 unicode-text:

  • utf8 character --> sequence --> double quote " delimiter: "ß"
  • string --> sequence --> double quote " delimiter: "hello"
  • output: ? --> numbers; puts --> text

For utf32 unicode-text:

  • utf32 character --> atom
  • string --> sequence
  • output: ? --> numbers; puts --> text

For Phix utf8 unicode-text:

  • utf8 character --> string --> string --> "a" or "ß"
  • utf8 string --> string --> "hello"
  • output: ? s --> text; puts --> text; print --> numbers

Unicode

Unicode "assigns a unique number (code-point) to each character" Properly, Unicode is the consortium that sets standards: www.unicode.org Common usage, unicode-text

  • code-point "unique hexadecimal number for each character"
  • plain-text "code-points #0 to #7F" (up to 127 decimal)
  • unicode-text "code-points #0 to #10FFFF" (up to 106 decimal)

Example:

character 'a' code-point #61 (hex), U+61 (outside of OE), "\x61" (escaped notation), 97 (decimal)

Example:

? 'a' 
    --> 97 
? "\x61" 
    --> {97} 
? "\u0061" 
    --> {97} 
? "\U00000061" 
    --> {97} 
 
	-- little endian 
 
include std/convert.e 
 
? int_to_bits( 97 ) 
-- {1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0} 
 
? int_to_bits( 97, 8 ) 
-- {1,0,0,0,0,1,1,0} 
	--note: big endian 
 
? "\b10000110" 
--> 134 
	-- 'fail' written in big endian 
 
? "\b11000010"  
--> 97 
	-- 'success' written in little endian 

Endian

  • Little Endian "least significant digits written last."
  • Big Endian "most significant digits written last."

The natural way to write decimal numbers is big endian.

  • In human languages that write right to left, decimal numbers are written in big endian format.
  • In languages that write left to right, decimal numbers are written in little endian format.

Writing numbers backwards has resulted in Western children not understanding arithmetic and mathematics. These children become adults that do not understand arithmetic and mathematics.

Trivia

Endian refers to Jonathan Swift Guliver's Travels.

UTF

UTF "Unicode Transformation Format"

  • UTF32 "is the code-point; the definition of a character"
  • UTF16 "Microsoft format"
  • UTF8 "standard format for text"

Example:

include std/convert.e 
	 
-- the character 'a' 
? int_to_bits( #61 )   
--> {1,0,0,0,0,1,1,0, 0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0} 
--   |                |                |                | 4-bytes                      
 
-- note: 'big endian' 
 

For plain text three bytes of unused bits.

UTF32

The definition of text.

  • character --> atom or integer
  • string --> sequence
  • for plain-text UTF32 is the same as UTF8

Literal text

  • unlikely that text will in UTF32 format
  • Geany runs as utf8 but can load|save UTF32 text
  • (unix) terminal, puts, printf are UTF8

Simple processing

  • direct indexing and traversal
  • length is number of characters
  • find to search for character in a string
  • mutable
  • insert code-point into string using escaped notation

UTF16

  • Combines the disadvantages of UTF32 and UTF8
  • Microsoft products
  • good lucks

UTF8

Motivation: saving bits is more important than convenience in processing

  • plain-text character --> atom or integer
  • unicode-text character --> sequence
  • string --> sequence
  • for plain-text UTF8 is the same as UTF32
  • see also Phix string data-type which corresponds directly to UTF8

Literal text

  • most text is in UTF8 format
  • default for Geany
  • (unix) terminal; puts ; printf are UTF8
  • most applications: internet, webbrowser, editors, LibreOffice

More difficult processing

  • match to search for character
  • no direct indexing or traversal
  • length does not return number of characters
  • mutable
  • consider transformation of UTF8 to UTF32 for processing
  • escaped notation does not insert UTF8 character into a string

Storage Required

  • minimal for ram use
    • Phix string < UTF32 sequence < UTF8 sequence
  • minimal for disk use
    • UTF32 file < UTF8 file
    • zipping can significantly compress text files
  • minimal for internet streaming
    • UTF8 < UTF32

Sort | Collate

  • sorting "arrange items based on numerical value"
    • Sorting plain-text or unicode-text does not always look right--since assignment of code-points to characters was somewhat arbitrary.
  • collating "arrange items based on language specific alphabetic order"
    • collating requires a custom algorithm
    • some call this natural sorting
    • see Unicode collating algorithm
    • see Rosetta natural sorting
    • see archive nat sort

Chasing Nanoseconds

In Voyage au Centre de la Terre by Jules Verne, about 2.5% of the text is unicode. It is barely possible to show that sorting all of the words as utf32 is faster than sorting as utf8.

_tom

new topic     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu