String

%

Character and String

Euphoria is easy to use for programming with text based data.

All of the usual Euphoria routines and operations work on text just like they work on numbers. There are also libraries of routines designed to make working with text easy: string centric routines and regular expressions.

Character String

'd' '3' '-' '&' 'f' 'G'

"Hello world"
"this is a string"
Input Output
gets() puts()

Character

A character is one individual symbol such as a letter, digit, punctuation, dingbat, ..., that we use for written communications.

An individual character may be written using single quote ' delimiters:

'a'    'A'    '['    '#' 

They may be assigned to either an integer or atom:

atom char = 'a' 
integer pound = '#' 

There is no special "character data-type" in Euphoria. The standard ASCII chart assigns a number to each character. These number values are used in Euphoria to represent characters.

Euphoria converts all character values to their numeric equivalence; only number values are stored:

? 'a' 
    -- 97 appears, not 'a' 
? 'A' 
    -- 65 appears, not 'A' 

It is easy to display character values using puts():

puts(1, 'a' ) 
    -- a    <-- appears         
puts(1, 'A' ) 
    -- A    <-- appears 

There is no automatic way to distinguish the value 97 intended to be the number 'ninety-seven' and the value 'a' intended to be the character a. All values are numbers.

include std/console.e 
display( 'a' ) 
display( 97  ) 
        -- output for both examples is: 
   -- 97 
   -- 97 

Therefore 'B' is just a notation that is equivalent to typing 66. There are no "characters" in Euphoria, just numbers (atoms).

Values representing characters may be manipulated and operated on just like any other numerical value--they are numerical values.

Character atoms combine to make string sequences. Both examples represent the same Euphoria sequence:

{ 'H','e','l','l','o',' ','W','o','r','l','d' } 
"Hello World" 

Escaped Characters

Special characters may be entered using a back-slash:

Code Meaning
\n newline
\r carriage return
\t tab
\\ backslash
\" double quote
\' single quote
\0 null
\e escape
\E escape
\b/d..d/ A binary coded value, The 'b is followed by 1 or more binary digits.
Inside strings, use the space character to delimit end a binary value.
\x/hh/ A 2-hex-digit value: "\x5F" ==> {95}
\u/hhhh/ A 4-hex-digit value: "\u2A7C" ==> {10876}
\U/hhhhhhhh/ An 8-hex-digit value: "\U8123FEDC" ==> {2166619868}

For example, "Hello, World!\n", or '\\'. The Euphoria editor displays character strings in green.

Sometimes the special characters are described as "non-printing characters" because, while they control the layout of a display, nothing appears when they are output.

Note that you can use the underscore character '_' inside the \b, \x, \u, and \U values to aid readability:

\U8123_FEDC   -- as written using spacer _  
{2166619868}  -- value as stored 

String

A string is a sequence of character values. There is no special "string data-type" in Euphoria.

A string sequence may be written using double-quote " delimiters:

"ABCDEFG" 

A string is just like any other sequence in Euphoria. For each element of a string, the character values are all converted to their numerical value. Strings may be manipulated and operated on the same way as all other sequences in Euphoria.

The string "ABCDEFG" is entirely equivalent to the sequence:

{65, 66, 67, 68, 69, 70, 71} 
puts(1, "ABCDEFG" ) 
   -- ABCDEFG  <-- appears on output 
print(1, "ABCDEFG" ) 
   -- {65, 66, 67, 68, 69, 70, 71} <-- appears on output 

A quoted string is really just a convenient notation that saves you from having to type in all the ASCII codes. It follows that "" is equivalent to {}. Both represent the sequence of zero length, also known as the empty sequence. As a matter of programming style, it is natural to use "" to suggest a zero length sequence of characters, and {} to suggest some other kind of sequence.

An individual character is an atom. It must be entered using single quotes. There is a difference between an individual character (which is an atom), and a character string of length one (which is a sequence):

'B'   -- equivalent to the atom 66 -- the ASCII code for B 
"B"   -- equivalent to the sequence {66} 

Keep in mind that an atom is not equivalent to a one-element sequence containing the same value, although there are a few built-in routines that choose to treat them similarly.

Some routines are able to make an intelligent guess if a sequence is intended to be as string as opposed to a numerical sequence:

include std/console.e 
 
? "Hello World" 
    -- {72,101,108,108,111,32,87,111,114,108,100}  <-- appears 
 
    -- recognize that all sequences are numeric 
 
display( "Hello World" ) 
    -- Hello World  <-- appears  
 
                     -- string appears as expected 
 
display( {72,101,108,108,111,32,87,111,114,108,100} ) 
    -- Hello World <-- appears 
 
                  --  appears as string since all element values are  
                  -- character values 
 
display( {72,101,108,108,111,32,87,111,114,108,100.1 } ) 
    -- {72,101,108,108,111,32,87,111,114,108,100.1 } <-- appears 
 
            -- last element in the sequence is not a character value 
            -- sequence is output as a numerical sequence 

Hint
Escaped characters may be written directly into a string, for characters not available on the keyboard, and to control the layout of the string:

puts(1, "This sentence\nis displayed\nover three lines" ) 
-- 
-- the '\n' escaped character creates line breaks 
--  
--This sentence 
--is displayed 
--over three lines 
In a real and practical program it is possible to input, create, manipulate, and then finally output strings, all without ever having to ever consider their numerical basis. Euphoria lets you think in terms of 'values' rather than specialized 'data-types'. String values can be manipulated just like any other sequence. This generic quality of Euphoria makes programming simple and easy.




Character Strings and Individual Characters

A string in Euphoria is just as sequence of characters. That means that individual characters may be indexed and manipulated just like any other sequence.

To make working with strings easy, there are a variety of ways to enter string values into a sequence:

Delimeter Notation Example
Left Right
" " double-quotes

"ABCDEFG"
` ` back-quotes

`ABCDEFG`
""" """ three double-quotes

"""ABCDEFG"""
b" " binary byte strings

b"1001 00110110 0110_0111 1_0101_1010"
-- ==> {#9,#36,#67,#15A}
x" " hexadecimal byte strings

x"65 66 67 AE"
-- ==> {#65,#66,#67,#AE}

The rules for double-quote strings

  1. They begin and end with a double-quote " character
  2. They cannot contain a double-quote
  3. They must be only on a single line
  4. They cannot contain the TAB character
  5. If they contain the back-slash '\' character, that character must immediately be followed by one of the special escape codes. The back-slash and escape code will be replaced by the appropriate single character equivalent. If you need to include double-quote, end-of-line, back-slash, or TAB characters inside a double-quoted string, you need to enter them in a special manner.

Examples:

"Bill said\n\t\"This is a back-slash \\ character\".\n" 

Which, when displayed should look like ...

Bill said 
    "This is a back-slash \ character". 

The rules for raw strings

  1. Enclose with three double-quotes """...""" or back-quote. `...`
  2. The resulting string will never have any carriage-return characters in it.
  3. If the resulting string begins with a new-line, the initial new-line is removed and any trailing new-line is also removed.
  4. A special form is used to automatically remove leading whitespace from the source code text. You might code this form to align the source text for ease of reading. If the first line after the raw string start token begins with one or more underscore characters, the number of consecutive underscores signifies the maximum number of whitespace characters that will be removed from each line of the raw string text. The underscores represent an assumed left margin width. Note, these leading underscores do not form part of the raw string text.

Examples:

-- No leading underscores and no leading whitespace 
` 
Bill said 
    "This is a back-slash \ character". 
` 

Which, when displayed should look like ...

Bill said 
    "This is a back-slash \ character". 

-- No leading underscores and but leading whitespace 
` 
   Bill said 
      "This is a back-slash \ character". 
` 

Which, when displayed should look like ...

   Bill said 
      "This is a back-slash \ character". 

-- Leading underscores and leading whitespace 
` 
_____Bill said 
         "This is a back-slash \ character". 
` 

Which, when displayed should look like ...

Bill said 
    "This is a back-slash \ character". 

Extended string literals are useful when the string contains new-lines, tabs, or back-slash characters because they do not have to be entered in the special manner. The back-quote form can be used when the string literal contains a set of three double-quote characters, and the triple quote form can be used when the text literal contains back-quote characters. If a literal contains both a back quote and a set of three double-quotes, you will need to concatenate two literals.

object TQ, BQ, QQ 
TQ = `This text contains """ for some reason.` 
BQ = """This text contains a back quote ` for some reason.""" 
QQ = """This text contains a back quote ` """ & `and """ for some reason.` 

The rules for binary strings

  1. They begin with the pair b" and end with a double-quote " character
  2. They can only contain binary digits (0-1), and space, underscore, tab, newline, carriage-return. Anything else is invalid.
  3. An underscore is simply ignored, as if it was never there. It is used to aid readability.
  4. Each set of contiguous binary digits represent a single sequence element
  5. They can span multiple lines
  6. The non-digits are treated as punctuation and used to delimit individual values.

Examples:

b"1 10 11_0100 01010110_01111000" == {0x01, 0x02, 0x34, 0x5678} 

The rules for hexadecimal strings

  1. They begin with the pair x" and end with a double-quote " character
  2. They can only contain hexadecimal digits (0-9 A-F a-f), and space, underscore, tab, newline, carriage-return. Anything else is invalid.
  3. An underscore is simply ignored, as if it was never there. It is used to aid readability.
  4. Each pair of contiguous hex digits represent a single sequence element with a value from 0 to 255
  5. They can span multiple lines
  6. The non-digits are treated as punctuation and used to delimit individual values.

Examples:

x"1 2 34 5678_AbC" == {0x01, 0x02, 0x34, 0x56, 0x78, 0xAB, 0x0C} 

When you put too many hex characters together they are split up appropriately for you:

x"656667AE"  -- 8-bit  ==> {#65,#66,#67,#AE} 
Not Categorized, Please Help

Search



Quick Links

User menu

Not signed in.

Misc Menu