NewDocs_Strings_One

Text Data

Text is "data composed of characters or strings of characters."

A character is "a symbol used to represent human language and written communication." Unicode is "the standard for encoding, representing, and handling text representing most of the world's writing systems." A code point is "a unique number that identifies a Unicode value for a character."

Code point values are not used to represent text directly; instead code points are encoded into a binary form that is actually used by computer software. An encoding is "a particular standard used to represent a Unicode code point in binary form."

UTF stands for "Unicode Transformation Format." The UTF-8 format "encodes Unicode code points into eight-bit (one byte) values; from one to four bytes are needed to represent a single code point." UTF-8 uses variable length encoding because code points are not represented by a fixed number of bytes. UTF-8 is the standard used by internet web pages, unix computer systems, and is the de facto standard for Unicode encoding. Windows and OSX use an incompatible UTF-16 encoding standard.

Plain text is written using characters "limited to one byte UTF-8 encodings in the range 0 to 127." These encodings are identical to legacy ASCII values. Computer languages, like Euphoria, are often written in plain text. If you limit yourself to plain text then a character is "one bye" and a string is "a sequence of one byte characters."

A grapheme is "an idealized (or abstract) representation of a Unicode character." A glyph is "what you actually see when a Unicode character is displayed." The distinction is you can think about the one grapheme which is the letter 'A' or you can recognize the same letter as part of a family of symbols produced by handwriting, printers, screens, and artist's renderings.

Run the demonstration program /euphoria/demo/ascii.ex to display a chart of integer values and glyphs for each character. On a unix system the chart includes values from 0 to 127. On a windows system you will see characters that go up to 254; these extra characters are called extended ASCII. There are conflicting standards that describe which characters are included in the extended ASCII range. This is why plain text includes just the original ASCII characters and why UTF-8 encoding includes only the original ASCII characters.

Euphoria does not have a character or string data-type; the built-in data-types are used to represent all text based data.

You can assign a one byte encoded character to an integer or atom variable. These values include plain text, legacy ASCII, and one byte UTF-8 encodings.

An arbitrary UTF-8 character encoding must be assigned to an object because some character values are a sequence of two to four bytes.

A string is "sequence of byte values; and is always a flat sequence."

A string containing plain text (legacy ASCII) character values is simple. Each item is represented by a small integer value. Each item represents just one character (one Unicode code point). The length of a sequence is identical to the number of characters; indexing individual characters is direct and easy. The routines that work with plain text characters and strings are simple.

A string containing UTF-8 character values is no longer simple. A string is still a flat sequence of byte values. Because of variable length encoding you can no longer assume that any one item represents a character. The length of sequence could be longer than the number of characters (Unicode code points). To index an individual character you need an algorithm to to find the variable length encoded characters first. Not all routines will recognize UTF-8 encoding. The string:upper function works on plain text but does not yet work with UTF-8 encoded text. The standard library regular expression routines work equally well with plain and UTF-8 encoded text.

Text Input

A literal text value is written as an assignment of a glyph or glyphs to a variable. Euphoria always encodes each glyph as a number (or sequence of numbers). In effect the identity of the "glyph" is lost after the encoding; Euphoria does not have character and string data-types. The conversion of glyphs to numbers is automatic.

Euphoria uses a few characters in a special way for its own syntax:

Symbol Name Integer Hexadecimal
' apostrophe 39 27
" quotation mark 34 22
\ backslash 47 5C
` grave-accent 96 60

Care is needed to distinguish these characters when used for Euphoria syntax and when used for text data.

Character

A plain text character is "a character encoded in one byte ASCII or one byte UTF-8 encoding." A plain text character can be assigned to an integer (atom, or object). The single plain character glyph is written enclosed between apostrophe ' delimiters. The delimiters are just syntax and are not part of the character:

integer chr = 'x' 
    -- chr is the letter x 
    -- encoded as the integer 120 
     
atom c = '*' 
    -- c is the symbol * 
    -- encoded as the integer 42 
     

Assignment of other text characters can be operating system specific.

When assigning a character using windows:

  • A windows console will recognize extended ASCII characters.
  • A windows console will not recognize UTF-8 encoded characters.

When assigning a character using unix:

  • A unix terminal will not recognize extended ASCII characters.
  • A unix terminal will recognize all UTF-8 encoded characters.

You must remember that a UTF-8 encoding can be from one to four bytes long. That means that characters, other than those of plain text, are sequences and must be assigned to a sequence or object variable.

object chr = "?" 
    -- chr is the Euro symbol ? 
    -- char is the sequence {226,130,172} 

Using apostrophe delimiters with a UTF-8 encoded character produces an error:

object foo = '?' 
    --> error 
    -- character constant is missing a closing ' 
    -- object foo = '?' 
    --                ^ 

String

A literal string is written as a list of glyphs enclosed by quotation mark " delimiters. The delimiters are just syntax and are not part of the string.

If a string is composed of plain text characters:

sequence s = "Riki is a mongoose." 
    --> the string is 
    -- Riki is a mongoose. 

The encoded string becomes:

{82'R',105'i',107'k',105'i',32' ',105'i',115's',32' ',97'a',32' ',109'm', 
111'o',110'n',103'g',111'o',111'o',115's',101'e',46'.'} 

This output was produced by the pretty:pretty_print procedure. It has a feature that shows the item value decorated by the character grapheme. The decorated 82'R' is the encoded character 82.

All strings are always just a sequence of integers:

{82,105,107,105,32,105,115,32,97,32,109,111,110,103,111,111,115,101,46} 

Each plain text character is converted to a single integer value.

With plain text strings:

  • The length of the string is equal to the number of code points.
  • There is a 1:1 correspondence of items and code points.
  • The result is the same on windows and unix.
sequence str = "Is there a ?2 banknote?" 

On a unix terminal you will see that the string output in a decorated style:

{73'I',32' ',119'w',97'a',110'n',116't',32' ',226,130,172}{73'I',115's',32' ',116't',104'h',101'e',114'r',101'e',32' ',97'a',32' ', 
226,130,172,50'2',32' ',98'b',97'a',110'n',107'k',110'n',111'o',116't',101'e', 
63'?'}    

The actual string is encoded as:

{73,115,32,116,104,101,114,101,32,97,32,226,130,172,50,32,98,97,110,107,110,111,116,101,63}

Each plain text character has one integer value, while Euro symbol 226,130,172 appears as a multi-byte UTF-8 encoding within the sequence.

With UTF-8 encoding:

  • The length of the sequence may be longer than the number of code points.
  • You can no longer depend on a 1:1 indexing of items and code points.
  • A windows console will not recognize UTF-8 encoded strings.

String Literal

You can assign text to a variable using the same syntax common to many languages:

  • You can write one line of text enclosed by single quotation mark " delimiters:

-- a simple string 
sequence greet = "Hello Euphoria \n" 
  • You can write several lines of text enclosed by triple quotation mark """ delimiters:

-- a fancy string 
sequence cowsay =  
""" 
< Hello, bovine Euphoria!  > 
 -----------------------  
        \   ^__^ 
         \  (oo)\_______ 
            (__)\       )\/\ 
                ||----w | 
                ||     || 
""" 
  • You can explore the many flexible ways of assigning text data:

Assignment Style Enclosing Example
Direct integer { }

{104,101,108,108,111}
{ 'h','e','l','l','o'}
binary b" "

b""
hexadecimal x" "

x""
Cooked apostrophe ' '

'*'
'\n' 
single quotation mark " "

"hello \n"
Raw triple quotation mark """ """

"""hello"""
grave-accent ` `

`hello`

Regardless of the input style you choose the result is always the same; you can not tell how the string was input. Euphoria always automatically converts text data into encoded numbers and a string is always a flat sequence of integers. Pick the input method that seems convenient at the moment.

0 1 2 3 4 5 6 7
0 NUL
\0
SP 0 @ P ` p
1 ! 1 A Q a q
2 "
\"
2 B R b r
3 # 3 C S c s
4 $ 4 D T d t
5 % 5 E U e u
6 & 6 F V f v
7 '
\'
7 G W g w
8 ( 8 H X h x
9 TAB
\t
) 9 I Y i y
A NL
\n
* : J Z j z
B ESC
\e
+ ; K [ k {
C , < L \
\\
l |
D CR
\r
- = M ] m }
E . > N ^ n ~
F / ? O _ o DEL

Note that a few characters require escaped notation (like \n) when written in a cooked string. In a raw string they are entered directly from the keyboard.

In this ASCII chart, columns 0 and 1 are control or nonprinting charcters. A control character is "intended for the control of hardware; there is no assigned grapheme or glyph." Note that most of the available control characters are no longer used. A nonprinting character is "a synonym for a control character emphasizing that nothing is printed for that character."

The whitespace characters are TAB, NL, CR, and SP. A whitespace character "outputs empty space; ironically to print 'nothing' you still have to print 'some' character."

Direct String

In a direct string assignment "all characters are entered as numbers immediately."

You can create a direct string by writing each item explicitly in numerical form:

-- as glyph 
puts(1, {'H','e','l','l','o','!'} ) 
 
-- as integer 
puts(1, {72,101,108,108,111,33}  )mm 
 
-- as binary 
puts(1, { 0b01001000, 0b01100101, 0b01101100, 0b01101100, 0b01101111, 0b00100001 } ) 
 
-- as octal 
puts(1, {0t110, 0t145, 0t154, 0t154, 0t157, 0t41} )  
 
-- as hexadecimal 
puts(1, { 0x48, 0x65, 0x6C, 0x6C, 0x6F, 0x21 } ) 
 
    ------> all display the same thing 
     
        ------> Hello! 

A binary string is "a shortcut notation for writing a sequence of binary values."

If the sequence is { 0b01001000, 0b01100101, 0b01101100, 0b01101100, 0b01101111, 0b00100001 } then a binary string is easier to type:

b"01001000 01100101 01101100 01101100 01101111 00100001" 

You can display a binary string as a sequence:

? b"01001000 01100101 01101100 01101100 01101111 00100001" 
    --> {72,101,108,108,111,33} 

or as a string:

puts(1, b"01001000 01100101 01101100 01101100 01101111 00100001" ) 
    --> Hello! 

A binary string:

  • Is written enclosed by opening b" and closing " delimiters.
  • The binary digits are 0 and 1 and may use underscore _ as a spacer.
  • The characters space, tab, newline, or carriage return separate binary numbers from each other.

Note that for binary strings:

  • Underscores can be used anywhere to help readability.
  • Strings can be written over several lines.

A hexadecimal string is "a shortcut notation for writing a sequence of hexadecimal values."

If the sequence is { 0x48, 0x65, 0x6C, 0x6C, 0x6F, 0x21 } then a hexadecimal string is easier to type:

x"48 65 6C 6C 6F 21"

again, this is the string Hello!.

A hexadecimal string:

  • Is written enclosed by opening x" and closing " delimiters.
  • The hexadecimal digits are the characters 0..9, A..F, a..f and may use underscore _ as a spacer.
  • The characters space, tab, newline, or carriage return separate hexadecimal numbers from each other.
  • Every two digits is one hexadecimal character value.

Note that for hexadecimal strings:

  • Underscores can be used anywhere to help readability.
  • Strings can be written over several lines.
  • Hexadecimal strings can be written without any spacers at all.

Each pair of hexadecimal digits is a one byte value. That means that you can leave all spacers out of a hexadecimal string: x"48 65 6C 6C 6F 21" is the same as x"48656C6C6F21"

puts(1, x"48656C6C6F21" ) 
    --> Hello! 
Not Categorized, Please Help

Search



Quick Links

User menu

Not signed in.

Misc Menu