1. (rant) unicode-text

explorations; definitions; observations


Text

  • character "letter, digit, ... "
  • string "sequence of characters"

Plain-text:

  • character --> atom --> single quote ' delimiter: 'a'
  • string --> sequence --> double quote " delimiter: "hello"
  • output: ? --> numbers; puts --> text

For utf8 unicode-text:

  • utf8 character --> sequence --> double quote " delimiter: ""
  • string --> sequence --> double quote " delimiter: "hello"
  • output: ? --> numbers; puts --> text

For utf32 unicode-text:

  • utf32 character --> atom
  • string --> sequence
  • output: ? --> numbers; puts --> text

For Phix utf8 unicode-text:

  • utf8 character --> string --> string --> "a" or ""
  • utf8 string --> string --> "hello"
  • output: ? s --> text; puts --> text; print --> numbers

Unicode

Unicode "assigns a unique number (code-point) to each character" Properly, Unicode is the consortium that sets standards: www.unicode.org Common usage, unicode-text

  • code-point "unique hexadecimal number for each character"
  • plain-text "code-points #0 to #7F" (up to 127 decimal)
  • unicode-text "code-points #0 to #10FFFF" (up to 106 decimal)

Example:

character 'a' code-point #61 (hex), U+61 (outside of OE), "\x61" (escaped notation), 97 (decimal)

Example:

? 'a' 
    --> 97 
? "\x61" 
    --> {97} 
? "\u0061" 
    --> {97} 
? "\U00000061" 
    --> {97} 
 
	-- little endian 
 
include std/convert.e 
 
? int_to_bits( 97 ) 
-- {1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0} 
 
? int_to_bits( 97, 8 ) 
-- {1,0,0,0,0,1,1,0} 
	--note: big endian 
 
? "\b10000110" 
--> 134 
	-- 'fail' written in big endian 
 
? "\b11000010"  
--> 97 
	-- 'success' written in little endian 

Endian

  • Little Endian "least significant digits written last."
  • Big Endian "most significant digits written last."

The natural way to write decimal numbers is big endian.

  • In human languages that write right to left, decimal numbers are written in big endian format.
  • In languages that write left to right, decimal numbers are written in little endian format.

Writing numbers backwards has resulted in Western children not understanding arithmetic and mathematics. These children become adults that do not understand arithmetic and mathematics.

Trivia

Endian refers to Jonathan Swift Guliver's Travels.

UTF

UTF "Unicode Transformation Format"

  • UTF32 "is the code-point; the definition of a character"
  • UTF16 "Microsoft format"
  • UTF8 "standard format for text"

Example:

include std/convert.e 
	 
-- the character 'a' 
? int_to_bits( #61 )   
--> {1,0,0,0,0,1,1,0, 0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0} 
--   |                |                |                | 4-bytes                      
 
-- note: 'big endian' 
 

For plain text three bytes of unused bits.

UTF32

The definition of text.

  • character --> atom or integer
  • string --> sequence
  • for plain-text UTF32 is the same as UTF8

Literal text

  • unlikely that text will in UTF32 format
  • Geany runs as utf8 but can load|save UTF32 text
  • (unix) terminal, puts, printf are UTF8

Simple processing

  • direct indexing and traversal
  • length is number of characters
  • find to search for character in a string
  • mutable
  • insert code-point into string using escaped notation

UTF16

  • Combines the disadvantages of UTF32 and UTF8
  • Microsoft products
  • good lucks

UTF8

Motivation: saving bits is more important than convenience in processing

  • plain-text character --> atom or integer
  • unicode-text character --> sequence
  • string --> sequence
  • for plain-text UTF8 is the same as UTF32
  • see also Phix string data-type which corresponds directly to UTF8

Literal text

  • most text is in UTF8 format
  • default for Geany
  • (unix) terminal; puts ; printf are UTF8
  • most applications: internet, webbrowser, editors, LibreOffice

More difficult processing

  • match to search for character
  • no direct indexing or traversal
  • length does not return number of characters
  • mutable
  • consider transformation of UTF8 to UTF32 for processing
  • escaped notation does not insert UTF8 character into a string

Storage Required

  • minimal for ram use
    • Phix string < UTF32 sequence < UTF8 sequence
  • minimal for disk use
    • UTF32 file < UTF8 file
    • zipping can significantly compress text files
  • minimal for internet streaming
    • UTF8 < UTF32

Sort | Collate

  • sorting "arrange items based on numerical value"
    • Sorting plain-text or unicode-text does not always look right--since assignment of code-points to characters was somewhat arbitrary.
  • collating "arrange items based on language specific alphabetic order"
    • collating requires a custom algorithm
    • some call this natural sorting
    • see Unicode collating algorithm
    • see Rosetta natural sorting
    • see archive nat sort

Chasing Nanoseconds

In Voyage au Centre de la Terre by Jules Verne, about 2.5% of the text is unicode. It is barely possible to show that sorting all of the words as utf32 is faster than sorting as utf8.

_tom

new topic     » topic index » view message » categorize

2. Re: (rant) unicode-text

_tom said...

UTF16

  • Combines the disadvantages of UTF32 and UTF8
  • Microsoft products
  • (windows) console displays utf16 text

Том, really, Windows console displays DOS OEM text,
so, for example, in Russian, it is old good CP866,
pure Russian DOS code page, without any UNICODE.

Try please these articles:
https://en.wikipedia.org/wiki/UTF-7
https://en.wikipedia.org/wiki/Unicode#UTF
https://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings

Regards


kinz

new topic     » goto parent     » topic index » view message » categorize

3. Re: (rant) unicode-text

Thanks Igor,

Microsoft is for really smart people only--that excludes me. Microsoft is for people with lots of patience only--that excludes me. I booted up my Win10 netbook and concluded that Windows makes the computer unusable. (It was marginally usable with Win7).

Does Windows force you to use the console as a strictly DOS computer?

What is the easy way to set the code page in a console?

How do you work with unicode-text on a Windows computer?

I will need lots of help with Windows.

I upgraded my notes on UTF16 to "good lucks".

_tom

new topic     » goto parent     » topic index » view message » categorize

4. Re: (rant) unicode-text

_tom said...

Thanks Igor,

Microsoft is for really smart people only--that excludes me. Microsoft is for people with lots of patience only--that excludes me. I booted up my Win10 netbook and concluded that Windows makes the computer unusable. (It was marginally usable with Win7).

Does Windows force you to use the console as a strictly DOS computer?

What is the easy way to set the code page in a console?

How do you work with unicode-text on a Windows computer?

I will need lots of help with Windows.

I upgraded my notes on UTF16 to "good lucks".

_tom

I encountered an issue in my work recently related to this subject: Pasting text from the clipboard could capture weird byte values depending on which application the text was copied from - even if it was exactly the same text. I had been using CF_TEXT in the Windows function call but only CF_OEMTEXT ensures that just the bare text is returned. BTW, no console was involved.

Spock

new topic     » goto parent     » topic index » view message » categorize

5. Re: (rant) unicode-text

_tom said...

I concluded that Windows makes the computer unusable.

I installed a copy of Ubuntu last year and got terribly frustrated with it - took me over half an hour to figure out how to get a console/terminal window to appear. Of course I immediately pinned it to that launcher bar wotchamacallit, but if someone unpinned it then no doubt I'd be completely lost again. It is the things that we are not used to that are strange.

_tom said...

How do you work with unicode-text on a Windows computer?

In a GUI application, always.

Pete

new topic     » goto parent     » topic index » view message » categorize

6. Re: (rant) unicode-text

The statement "Windows 10 makes my netbook unusable" is a literal (non hyperbole) statement. It runs so slowly that I think it is broken. Bootup and shutdown takes 3.25 minutes. It is about 2 minutes before the file manager starts responding. It is very easy to get 100% cpu utilization; then the computer just hangs and gets hot. Windows just did an update which cost almost 20 minutes of downtime. Mint Linux has never treated me this poorly.

I went to a small town and asked a Sears associate "where is the automotive department?" The reply was "where it has always been." It turns out that in this town the automotive department is an a separate building--everyone in town knows this. You can't compare operating systems based on how long it takes to learn* where a particular feature is hidden.

* (Maybe you can...)
In a (unix) terminal there is a dropdown menu that lets me set a traditional code page (such as Igor uses). In a console? Microsoft documention offers info on editing the registry; looks like there is API function that can also do this. Microsoft did not tell me how to do a one time code-page change to demo Russian characters. Windows is harder than it looks.

Eventually, I did find a Microsoft statement that utf is not supported in a console. Just surprised that Windows has made no progress in keeping up with standards.

With Mint, a right-click gives me access to a terminal. There are multiple workspaces that make multi-tasking much easier. Linux is dramatically faster than Windows. After installing Mint you have a complete computer; after installing Windows you have to figure out how to get and install software. Lots of pragmatic reasons why Linux is easier than Windows.

_tom

new topic     » goto parent     » topic index » view message » categorize

7. Re: (rant) unicode-text

For the documentation a new example:

Example

You can use any text encoding format because atom|sequence data-types are number based. Legacy code-pages provide 256 characters. The first 127 are plain-text (same as utf) and the rest are code-page specific. The (windows) console only outputs text in a code-page format; a (unix) terminal can be optionally set to a code-page.

-- (unix) terminal dropdown menu: Terminal | Set Character Encoding | Add or remove | ... 
-- (windows) console: edit the registry 
 
-- set code page to CP866 for `Cyrillic/Russian` 
 
sequence chr = {} 
 
for c = 32 to 256 do 
    chr = append(chr, c ) 
end for 
     
puts(1, chr ) 
 
-->  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ 
--  [\]^_`abcdefghijklmnopqrstuvwxyz{|}~АБВГДЕЖЗИЙКЛМНОПРСТУФХ 
--  ЦЧШЩЪЫЬЭЮЯабвгдежзийклмноп░▒▓│┤╡╢╖╕╣║╗╝╜╛┐└┴┬├─┼╞╟╚╔╩╦╠═╬╧╨   
--  ╤╥╙╘╒╓╫╪┘┌█▄▌▐▀рстуфхцчшщъыьэюяЁёЄєЇїЎў∙√№■ 
 
new topic     » goto parent     » topic index » view message » categorize

8. Re: (rant) unicode-text

_tom said...

The statement "Windows 10 makes my netbook unusable" is a literal (non hyperbole) statement. It runs so slowly that I think it is broken. Bootup and shutdown takes 3.25 minutes. It is about 2 minutes before the file manager starts responding. It is very easy to get 100% cpu utilization; then the computer just hangs and gets hot. Windows just did an update which cost almost 20 minutes of downtime. Mint Linux has never treated me this poorly.

I hate Windows 10. I refuse to use it. It is a trojan horse and spyware, and many people were tricked into "upgrading" to it using literal malware tactics. I don't mind the new start menu, and a few other improvements, but i refuse to be tempted into using such an invasion of privacy. I recently built a Ryzen R7 1700 system, which is "unsupported hardware" for Windows 7. Microsoft Update literally self-destructed once it realized it was running on a modern CPU that is "supposed" to be running windows 10 only, so i had to uninstall a certain update. But Microsoft lies, because Windows 7 runs fine on Ryzen, even with an NVMe drive (which was quite a pain to get the drivers, because Microsoft really dooesn't want you to know that NVMe does infact work in Windows 7.) I really hate Microsoft, but I'm forced to keep using Windows for several reasons. At least Windows 7 works well enough for my needs.

_tom said...

I went to a small town and asked a Sears associate "where is the automotive department?" The reply was "where it has always been." It turns out that in this town the automotive department is an a separate building--everyone in town knows this. You can't compare operating systems based on how long it takes to learn* where a particular feature is hidden.

* (Maybe you can...)
In a (unix) terminal there is a dropdown menu that lets me set a traditional code page (such as Igor uses). In a console? Microsoft documention offers info on editing the registry; looks like there is API function that can also do this. Microsoft did not tell me how to do a one time code-page change to demo Russian characters. Windows is harder than it looks.

Eventually, I did find a Microsoft statement that utf is not supported in a console. Just surprised that Windows has made no progress in keeping up with standards.

I noticed that Japanese or Chinese characters in file names don't work in the Windows console, but they work fine in a linux console. As much as i dislike the non-intuitiveness of console commands, I have to say that for several reasons, command line is MUCH better in linux.

_tom said...

With Mint, a right-click gives me access to a terminal. There are multiple workspaces that make multi-tasking much easier. Linux is dramatically faster than Windows. After installing Mint you have a complete computer; after installing Windows you have to figure out how to get and install software. Lots of pragmatic reasons why Linux is easier than Windows.

I wish i could use linux on my main PC, but i need windows for a few things. Linux Mint is one of my favorite distros for a GUI workstation. Solus is pretty cool, too. I alwas have a hard time deciding which desktop environment to use, though. XFCE is my favorite lightweight environment, but others have more features and more modern style.

Even though i use Windows for my main workstation, i always like to have a Debian server running on my lan. Curretnly, i have a cheap Mini-ITX quad-core AMD APU with 4 sata ports, which has been a great low-power file/sql/http server.

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu