1. (rant) unicode-text
- Posted by _tom (admin) Oct 12, 2017
- 2909 views
- Last edited Oct 16, 2017
explorations; definitions; observations
Text
- character "letter, digit, ... "
- string "sequence of characters"
Plain-text:
- character --> atom --> single quote ' delimiter: 'a'
- string --> sequence --> double quote " delimiter: "hello"
- output: ? --> numbers; puts --> text
For utf8 unicode-text:
- utf8 character --> sequence --> double quote " delimiter: "ß"
- string --> sequence --> double quote " delimiter: "hello"
- output: ? --> numbers; puts --> text
For utf32 unicode-text:
- utf32 character --> atom
- string --> sequence
- output: ? --> numbers; puts --> text
For Phix utf8 unicode-text:
- utf8 character --> string --> string --> "a" or "ß"
- utf8 string --> string --> "hello"
- output: ? s --> text; puts --> text; print --> numbers
Unicode
Unicode "assigns a unique number (code-point) to each character" Properly, Unicode is the consortium that sets standards: www.unicode.org Common usage, unicode-text
- code-point "unique hexadecimal number for each character"
- plain-text "code-points #0 to #7F" (up to 127 decimal)
- unicode-text "code-points #0 to #10FFFF" (up to 106 decimal)
Example:
character 'a' code-point #61 (hex), U+61 (outside of OE), "\x61" (escaped notation), 97 (decimal)
Example:
? 'a' --> 97 ? "\x61" --> {97} ? "\u0061" --> {97} ? "\U00000061" --> {97} -- little endian include std/convert.e ? int_to_bits( 97 ) -- {1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0} ? int_to_bits( 97, 8 ) -- {1,0,0,0,0,1,1,0} --note: big endian ? "\b10000110" --> 134 -- 'fail' written in big endian ? "\b11000010" --> 97 -- 'success' written in little endian
Endian
- Little Endian "least significant digits written last."
- Big Endian "most significant digits written last."
The natural way to write decimal numbers is big endian.
- In human languages that write right to left, decimal numbers are written in big endian format.
- In languages that write left to right, decimal numbers are written in little endian format.
Writing numbers backwards has resulted in Western children not understanding arithmetic and mathematics. These children become adults that do not understand arithmetic and mathematics.
Trivia
Endian refers to Jonathan Swift Guliver's Travels.
UTF
UTF "Unicode Transformation Format"
- UTF32 "is the code-point; the definition of a character"
- UTF16 "Microsoft format"
- UTF8 "standard format for text"
Example:
include std/convert.e -- the character 'a' ? int_to_bits( #61 ) --> {1,0,0,0,0,1,1,0, 0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0} -- | | | | 4-bytes -- note: 'big endian'
For plain text three bytes of unused bits.
UTF32
The definition of text.
- character --> atom or integer
- string --> sequence
- for plain-text UTF32 is the same as UTF8
Literal text
- unlikely that text will in UTF32 format
- Geany runs as utf8 but can load|save UTF32 text
- (unix) terminal, puts, printf are UTF8
Simple processing
- direct indexing and traversal
- length is number of characters
- find to search for character in a string
- mutable
- insert code-point into string using escaped notation
UTF16
- Combines the disadvantages of UTF32 and UTF8
- Microsoft products
- good lucks
UTF8
Motivation: saving bits is more important than convenience in processing
- plain-text character --> atom or integer
- unicode-text character --> sequence
- string --> sequence
- for plain-text UTF8 is the same as UTF32
- see also Phix string data-type which corresponds directly to UTF8
Literal text
- most text is in UTF8 format
- default for Geany
- (unix) terminal; puts ; printf are UTF8
- most applications: internet, webbrowser, editors, LibreOffice
More difficult processing
- match to search for character
- no direct indexing or traversal
- length does not return number of characters
- mutable
- consider transformation of UTF8 to UTF32 for processing
- escaped notation does not insert UTF8 character into a string
Storage Required
- minimal for ram use
- Phix string < UTF32 sequence < UTF8 sequence
- minimal for disk use
- UTF32 file < UTF8 file
- zipping can significantly compress text files
- minimal for internet streaming
- UTF8 < UTF32
Sort | Collate
- sorting "arrange items based on numerical value"
- Sorting plain-text or unicode-text does not always look right--since assignment of code-points to characters was somewhat arbitrary.
- collating "arrange items based on language specific alphabetic order"
- collating requires a custom algorithm
- some call this natural sorting
- see Unicode collating algorithm
- see Rosetta natural sorting
- see archive nat sort
Chasing Nanoseconds
In Voyage au Centre de la Terre by Jules Verne, about 2.5% of the text is unicode. It is barely possible to show that sorting all of the words as utf32 is faster than sorting as utf8.
_tom
2. Re: (rant) unicode-text
- Posted by kinz Oct 15, 2017
- 2781 views
Том, really, Windows console displays DOS OEM text,
so, for example, in Russian, it is old good CP866,
pure Russian DOS code page, without any UNICODE.
Try please these articles:
https://en.wikipedia.org/wiki/UTF-7
https://en.wikipedia.org/wiki/Unicode#UTF
https://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings
Regards
kinz
3. Re: (rant) unicode-text
- Posted by _tom (admin) Oct 16, 2017
- 2709 views
Thanks Igor,
Microsoft is for really smart people only--that excludes me. Microsoft is for people with lots of patience only--that excludes me. I booted up my Win10 netbook and concluded that Windows makes the computer unusable. (It was marginally usable with Win7).
Does Windows force you to use the console as a strictly DOS computer?
What is the easy way to set the code page in a console?
How do you work with unicode-text on a Windows computer?
I will need lots of help with Windows.
I upgraded my notes on UTF16 to "good lucks".
_tom
4. Re: (rant) unicode-text
- Posted by Spock Oct 16, 2017
- 2671 views
Thanks Igor,
Microsoft is for really smart people only--that excludes me. Microsoft is for people with lots of patience only--that excludes me. I booted up my Win10 netbook and concluded that Windows makes the computer unusable. (It was marginally usable with Win7).
Does Windows force you to use the console as a strictly DOS computer?
What is the easy way to set the code page in a console?
How do you work with unicode-text on a Windows computer?
I will need lots of help with Windows.
I upgraded my notes on UTF16 to "good lucks".
_tom
I encountered an issue in my work recently related to this subject: Pasting text from the clipboard could capture weird byte values depending on which application the text was copied from - even if it was exactly the same text. I had been using CF_TEXT in the Windows function call but only CF_OEMTEXT ensures that just the bare text is returned. BTW, no console was involved.
Spock
5. Re: (rant) unicode-text
- Posted by petelomax Oct 17, 2017
- 2653 views
I concluded that Windows makes the computer unusable.
I installed a copy of Ubuntu last year and got terribly frustrated with it - took me over half an hour to figure out how to get a console/terminal window to appear. Of course I immediately pinned it to that launcher bar wotchamacallit, but if someone unpinned it then no doubt I'd be completely lost again. It is the things that we are not used to that are strange.
How do you work with unicode-text on a Windows computer?
In a GUI application, always.
Pete
6. Re: (rant) unicode-text
- Posted by _tom (admin) Oct 17, 2017
- 2655 views
The statement "Windows 10 makes my netbook unusable" is a literal (non hyperbole) statement. It runs so slowly that I think it is broken. Bootup and shutdown takes 3.25 minutes. It is about 2 minutes before the file manager starts responding. It is very easy to get 100% cpu utilization; then the computer just hangs and gets hot. Windows just did an update which cost almost 20 minutes of downtime. Mint Linux has never treated me this poorly.
I went to a small town and asked a Sears associate "where is the automotive department?" The reply was "where it has always been." It turns out that in this town the automotive department is an a separate building--everyone in town knows this. You can't compare operating systems based on how long it takes to learn* where a particular feature is hidden.
In a (unix) terminal there is a dropdown menu that lets me set a traditional code page (such as Igor uses). In a console? Microsoft documention offers info on editing the registry; looks like there is API function that can also do this. Microsoft did not tell me how to do a one time code-page change to demo Russian characters. Windows is harder than it looks.
Eventually, I did find a Microsoft statement that utf is not supported in a console. Just surprised that Windows has made no progress in keeping up with standards.
With Mint, a right-click gives me access to a terminal. There are multiple workspaces that make multi-tasking much easier. Linux is dramatically faster than Windows. After installing Mint you have a complete computer; after installing Windows you have to figure out how to get and install software. Lots of pragmatic reasons why Linux is easier than Windows.
_tom
7. Re: (rant) unicode-text
- Posted by _tom (admin) Oct 17, 2017
- 2645 views
For the documentation a new example:
Example
You can use any text encoding format because atom|sequence data-types are number based. Legacy code-pages provide 256 characters. The first 127 are plain-text (same as utf) and the rest are code-page specific. The (windows) console only outputs text in a code-page format; a (unix) terminal can be optionally set to a code-page.
-- (unix) terminal dropdown menu: Terminal | Set Character Encoding | Add or remove | ... -- (windows) console: edit the registry -- set code page to CP866 for `Cyrillic/Russian` sequence chr = {} for c = 32 to 256 do chr = append(chr, c ) end for puts(1, chr ) --> !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ -- [\]^_`abcdefghijklmnopqrstuvwxyz{|}~АБВГДЕЖЗИЙКЛМНОПРСТУФХ -- ЦЧШЩЪЫЬЭЮЯабвгдежзийклмноп░▒▓│┤╡╢╖╕╣║╗╝╜╛┐└┴┬├─┼╞╟╚╔╩╦╠═╬╧╨ -- ╤╥╙╘╒╓╫╪┘┌█▄▌▐▀рстуфхцчшщъыьэюяЁёЄєЇїЎў°∙·√№¤■
8. Re: (rant) unicode-text
- Posted by ryanj Oct 19, 2017
- 2573 views
The statement "Windows 10 makes my netbook unusable" is a literal (non hyperbole) statement. It runs so slowly that I think it is broken. Bootup and shutdown takes 3.25 minutes. It is about 2 minutes before the file manager starts responding. It is very easy to get 100% cpu utilization; then the computer just hangs and gets hot. Windows just did an update which cost almost 20 minutes of downtime. Mint Linux has never treated me this poorly.
I hate Windows 10. I refuse to use it. It is a trojan horse and spyware, and many people were tricked into "upgrading" to it using literal malware tactics. I don't mind the new start menu, and a few other improvements, but i refuse to be tempted into using such an invasion of privacy. I recently built a Ryzen R7 1700 system, which is "unsupported hardware" for Windows 7. Microsoft Update literally self-destructed once it realized it was running on a modern CPU that is "supposed" to be running windows 10 only, so i had to uninstall a certain update. But Microsoft lies, because Windows 7 runs fine on Ryzen, even with an NVMe drive (which was quite a pain to get the drivers, because Microsoft really dooesn't want you to know that NVMe does infact work in Windows 7.) I really hate Microsoft, but I'm forced to keep using Windows for several reasons. At least Windows 7 works well enough for my needs.
I went to a small town and asked a Sears associate "where is the automotive department?" The reply was "where it has always been." It turns out that in this town the automotive department is an a separate building--everyone in town knows this. You can't compare operating systems based on how long it takes to learn* where a particular feature is hidden.
In a (unix) terminal there is a dropdown menu that lets me set a traditional code page (such as Igor uses). In a console? Microsoft documention offers info on editing the registry; looks like there is API function that can also do this. Microsoft did not tell me how to do a one time code-page change to demo Russian characters. Windows is harder than it looks.
Eventually, I did find a Microsoft statement that utf is not supported in a console. Just surprised that Windows has made no progress in keeping up with standards.
I noticed that Japanese or Chinese characters in file names don't work in the Windows console, but they work fine in a linux console. As much as i dislike the non-intuitiveness of console commands, I have to say that for several reasons, command line is MUCH better in linux.
With Mint, a right-click gives me access to a terminal. There are multiple workspaces that make multi-tasking much easier. Linux is dramatically faster than Windows. After installing Mint you have a complete computer; after installing Windows you have to figure out how to get and install software. Lots of pragmatic reasons why Linux is easier than Windows.
I wish i could use linux on my main PC, but i need windows for a few things. Linux Mint is one of my favorite distros for a GUI workstation. Solus is pretty cool, too. I alwas have a hard time deciding which desktop environment to use, though. XFCE is my favorite lightweight environment, but others have more features and more modern style.
Even though i use Windows for my main workstation, i always like to have a Debian server running on my lan. Curretnly, i have a cheap Mini-ITX quad-core AMD APU with 4 sata ports, which has been a great low-power file/sql/http server.
9. Re: (rant) unicode-text
- Posted by kinz Nov 01, 2017
- 2304 views
Does Windows force you to use the console as a strictly DOS computer?
No, it doesn't.
What is the easy way to set the code page in a console?
It sets up automatically on installation of your local version of Windows.
How do you work with unicode-text on a Windows computer?
Using Notebook, HippoEDIT, OpenOffice Writer, FAR. No problem.
Regards
kinz
10. Re: (rant) unicode-text
- Posted by kinz Nov 01, 2017
- 2324 views
- Last edited Nov 05, 2017
Sorry, this was the doubled message #9.