8.20 Regular Expressions

8.20.1 Introduction

Regular expressions in Euphoria are based on the PCRE (Perl Compatible Regular Expressions) library created by Philip Hazel.

This document will detail the Euphoria interface to Regular Expressions, not really regular expression syntax. It is a very complex subject that many books have been written on. Here are a few good resources online that can help while learning regular expressions.

8.20.2 General Use

Many functions take an optional options parameter. This parameter can be either a single option constant (see Option Constants), multiple option constants or'ed together into a single atom or a sequence of options, in which the function will take care of ensuring the are or'ed together correctly. Options are like their C equivalents with the 'PCRE_' prefix stripped off. Name spaces disambiguate symbols so we don't need this prefix.

All strings passed into this library must be either 8-bit per character strings or UTF which uses multiple bytes to encode UNICODE characters. You can use UTF8 encoded UNICODE strings when you pass the UTF8 option.

8.20.3 Option Constants

8.20.3.1 Compile Time and Match Time

When a regular expression object is created via new we call also say it get's "compiled." The options you may use for this are called "compile time" option constants. Once the regular expression is created you can use the other functions that take this regular expression and a string. These routines' options are called "match time" option constants. To not set any options at all, do not supply the options argument or supply DEFAULT.

Compile Time Option Constants

The only options that may set at "compile time"; that is, to pass to new; are ANCHORED, AUTO_CALLOUT, BSR_ANYCRLF, BSR_UNICODE, CASELESS, DEFAULT, DOLLAR_ENDONLY, DOTALL, DUPNAMES, EXTENDED, EXTRA, FIRSTLINE, MULTILINE, NEWLINE_CR, NEWLINE_LF, NEWLINE_CRLF, NEWLINE_ANY, NEWLINE_ANYCRLF, NO_AUTO_CAPTURE, NO_UTF8_CHECK, UNGREEDY, and UTF8.

Match Time Option Constants

Options that may be set at "match time" are ANCHORED, NEWLINE_CR, NEWLINE_LF, NEWLINE_CRLF, NEWLINE_ANY NEWLINE_ANYCRLF NOTBOL, NOTEOL, NOTEMPTY, NO_UTF8_CHECK. Routines that take match time option constants match, split or replace a regular expression against some string.

8.20.3.2 ANCHORED

public constant ANCHORED

Forces matches to be only from the first place it is asked to try to make a search. In C, this is called PCRE_ANCHORED. This is passed to all routines including new.

8.20.3.3 AUTO_CALLOUT

public constant AUTO_CALLOUT

In C, this is called PCRE_AUTO_CALLOUT. To get the functionality of this flag in EUPHORIA, you can use: find_replace_callback without passing this option. This is passed to new.

8.20.3.4 BSR_ANYCRLF

public constant BSR_ANYCRLF

With this option only ASCII new line sequences are recognized as newlines. Other UNICODE newline sequences (encoded as UTF8) are not recognized as an end of line marker. This is passed to all routines including new.

8.20.3.5 BSR_UNICODE

public constant BSR_UNICODE

With this option any UNICODE new line sequence is recognized as a newline. The UNICODE will have to be encoded as UTF8, however. This is passed to all routines including new.

8.20.3.6 CASELESS

public constant CASELESS

This will make your regular expression matches case insensitive. With this flag for example, [a-z] is the same as [A-Za-z]. This is passed to new.

8.20.3.7 DEFAULT

public constant DEFAULT

This is a value used for not setting any flags at all. This can be passed to all routines including new

8.20.3.8 DFA_SHORTEST

public constant DFA_SHORTEST

This is NOT used by any standard library routine.

8.20.3.9 DFA_RESTART

public constant DFA_RESTART

This is NOT used by any standard library routine.

8.20.3.10 DOLLAR_ENDONLY

public constant DOLLAR_ENDONLY

If this bit is set, a dollar sign metacharacter in the pattern matches only at the end of the subject string. Without this option, a dollar sign also matches immediately before a newline at the end of the string (but not before any other newlines). Thus you must include the newline character in the pattern before the dollar sign if you want to match a line that contanis a newline character. The DOLLAR_ENDONLY option is ignored if MULTILINE is set. There is no way to set this option within a pattern. This is passed to new.

8.20.3.11 DOTALL

public constant DOTALL

With this option the '.' character also matches a newline sequence. This is passed to new.

8.20.3.12 DUPNAMES

public constant DUPNAMES

Allow duplicate names for named subpatterns. Since there is no way to access named subpatterns this flag has no effect. This is passed to new.

8.20.3.13 EXTENDED

public constant EXTENDED

Whitespace and characters beginning with a hash mark to the end of the line in the pattern will be ignored when searching except when the whitespace or hash is escaped or in a character class. This is passed to new.

8.20.3.14 EXTRA

public constant EXTRA

When an alphanumeric follows a backslash(\) has no special meaning an error is generated. This is passed to new.

8.20.3.15 FIRSTLINE

public constant FIRSTLINE

If PCRE_FIRSTLINE is set, the match must happen before or at the first newline in the subject (though it may continue over the newline). This is passed to new.

8.20.3.16 MULTILINE

public constant MULTILINE

When MULTILINE it is set, the "start of line" and "end of line" constructs match immediately following or immediately before internal newlines in the subject string, respectively, as well as at the very start and end. This is passed to new.

8.20.3.17 NEWLINE_CR

public constant NEWLINE_CR

Sets CR as the NEWLINE sequence. The NEWLINE sequence will match $ when MULTILINE is set. This is passed to all routines including new.

8.20.3.18 NEWLINE_LF

public constant NEWLINE_LF

Sets LF as the NEWLINE sequence. The NEWLINE sequence will match $ when MULTILINE is set. This is passed to all routines including new.

8.20.3.19 NEWLINE_CRLF

public constant NEWLINE_CRLF

Sets CRLF as the NEWLINE sequence The NEWLINE sequence will match $ when MULTILINE is set. This is passed to all routines including new.

8.20.3.20 NEWLINE_ANY

public constant NEWLINE_ANY

Sets ANY newline sequence as the NEWLINE sequence including those from UNICODE when UTF8 is also set. The string will have to be encoded as UTF8, however. The NEWLINE sequence will match $ when MULTILINE is set. This is passed to all routines including new.

8.20.3.21 NEWLINE_ANYCRLF

public constant NEWLINE_ANYCRLF

Sets ANY newline sequence from ASCII. The NEWLINE sequence will match $ when MULTILINE is set. This is passed to all routines including new.

8.20.3.22 NOTBOL

public constant NOTBOL

This indicates that beginning of the passed string does NOT start at the Beginning Of a Line (NOTBOL), so a carrot symbol (^) in the original pattern will not match the beginning of the string. This is used by routines other than new.

8.20.3.23 NOTEOL

public constant NOTEOL

This indicates that end of the passed string does NOT end at the End Of a Line (NOTEOL), so a dollar sign ($) in the original pattern will not match the end of the string. This is used by routines other than new.

8.20.3.24 NO_AUTO_CAPTURE

public constant NO_AUTO_CAPTURE

Disables capturing subpatterns except when the subpatterns are named. This is passed to new.

8.20.3.25 NO_UTF8_CHECK

public constant NO_UTF8_CHECK

Turn off checking for the validity of your UTF string. Use this with caution. An invalid utf8 string with this option could crash your program. Only use this if you know the string is a valid utf8 string. This is passed to all routines including new.

8.20.3.26 NOTEMPTY

public constant NOTEMPTY

Here matches of empty strings will not be allowed. In C, this is PCRE_NOTEMPTY. The pattern: `A*a*` will match "AAAA", "aaaa", and "Aaaa" but not "". This is used by routines other than new.

8.20.3.27 PARTIAL

public constant PARTIAL

This option has no effect on whether a match will occur or not. However, it does affect the error code generated by find in the event of a failure: If for some pattern re, and two strings s1 and s2, find( re, s1 & s2 ) would return a match but both find( re, s1 ) and find( re, s2 ) would not, then find( re, s1, 1, PCRE_PARTIAL ) will return ERROR_PARTIAL rather than ERROR_NOMATCH. We say s1 has a partial match of re.

Note that find( re, s2, 1, PCRE_PARTIAL ) will ERROR_NOMATCH. In C, this constant is called PCRE_PARTIAL.

8.20.3.28 STRING_OFFSETS

public constant STRING_OFFSETS

This is used by matches and all_matches.

8.20.3.29 UNGREEDY

public constant UNGREEDY This modifier sets the pattern such that quantifiers are not greedy by default, but become greedy if followed by a question mark.

This is passed to new.

8.20.3.30 UTF8

public constant UTF8

Makes strings passed in to be interpreted as a UTF8 encoded string. This is passed to new.

8.20.4 Error Constants

Error constants differ from their C equivalents as they do not have PCRE_ prepended to each name.

8.20.4.1 ERROR_NOMATCH

include std/regex.e
namespace regex
public constant ERROR_NOMATCH

There was no match found.

8.20.4.2 ERROR_NULL

include std/regex.e
namespace regex
public constant ERROR_NULL

There was an internal error in the EUPHORIA wrapper (std/regex.e in the standard include directory or be_regex.c in the EUPHORIA source).

8.20.4.3 ERROR_BADOPTION

include std/regex.e
namespace regex
public constant ERROR_BADOPTION

There was an internal error in the EUPHORIA wrapper (std/regex.e in the standard include directory or be_regex.c in the EUPHORIA source).

8.20.4.4 ERROR_BADMAGIC

include std/regex.e
namespace regex
public constant ERROR_BADMAGIC

The pattern passed is not a value returned from new.

8.20.4.5 ERROR_UNKNOWN_OPCODE

include std/regex.e
namespace regex
public constant ERROR_UNKNOWN_OPCODE

An internal error either in the pcre library EUPHORIA uses or its wrapper occured.

8.20.4.6 ERROR_UNKNOWN_NODE

include std/regex.e
namespace regex
public constant ERROR_UNKNOWN_NODE

An internal error either in the pcre library EUPHORIA uses or its wrapper occured.

8.20.4.7 ERROR_NOMEMORY

include std/regex.e
namespace regex
public constant ERROR_NOMEMORY

Out of memory.

8.20.4.8 ERROR_NOSUBSTRING

include std/regex.e
namespace regex
public constant ERROR_NOSUBSTRING

The wrapper or the PCRE backend didn't preallocate enough capturing groups for this pattern.

8.20.4.9 ERROR_MATCHLIMIT

include std/regex.e
namespace regex
public constant ERROR_MATCHLIMIT

Too many matches encountered.

8.20.4.10 ERROR_CALLOUT

include std/regex.e
namespace regex
public constant ERROR_CALLOUT

Not applicable to our implementation.

8.20.4.11 ERROR_BADUTF8

include std/regex.e
namespace regex
public constant ERROR_BADUTF8

The subject or pattern is not valid UTF8 but it was specified as such with UTF8.

8.20.4.12 ERROR_BADUTF8_OFFSET

include std/regex.e
namespace regex
public constant ERROR_BADUTF8_OFFSET

The offset specified doesn't start on a UTF8 character boundary but it was specified as UTF8 with UTF8.

8.20.4.13 ERROR_PARTIAL

include std/regex.e
namespace regex
public constant ERROR_PARTIAL

Pattern didn't match, but there is a partial match. See PARTIAL.

8.20.4.14 ERROR_BADPARTIAL

include std/regex.e
namespace regex
public constant ERROR_BADPARTIAL

PCRE backend doesn't support partial matching for this pattern.

8.20.4.15 ERROR_INTERNAL

include std/regex.e
namespace regex
public constant ERROR_INTERNAL

8.20.4.16 ERROR_BADCOUNT

include std/regex.e
namespace regex
public constant ERROR_BADCOUNT

size parameter to find is less than minus 1.

8.20.4.17 ERROR_DFA_UITEM

include std/regex.e
namespace regex
public constant ERROR_DFA_UITEM

Not applicable to our implementation: The PCRE wrapper doesn't use DFA routines

8.20.4.18 ERROR_DFA_UCOND

include std/regex.e
namespace regex
public constant ERROR_DFA_UCOND

Not applicable to our implementation: The PCRE wrapper doesn't use DFA routines

8.20.4.19 ERROR_DFA_UMLIMIT

include std/regex.e
namespace regex
public constant ERROR_DFA_UMLIMIT

Not applicable to our implementation: The PCRE wrapper doesn't use DFA routines

8.20.4.20 ERROR_DFA_WSSIZE

include std/regex.e
namespace regex
public constant ERROR_DFA_WSSIZE

Not applicable to our implementation: The PCRE wrapper doesn't use DFA routines

8.20.4.21 ERROR_DFA_RECURSE

include std/regex.e
namespace regex
public constant ERROR_DFA_RECURSE

Not applicable to our implementation: The PCRE wrapper doesn't use DFA routines

8.20.4.22 ERROR_RECURSIONLIMIT

include std/regex.e
namespace regex
public constant ERROR_RECURSIONLIMIT

Too much recursion used for match.

8.20.4.23 ERROR_NULLWSLIMIT

include std/regex.e
namespace regex
public constant ERROR_NULLWSLIMIT

This error isn't in the source code.

8.20.4.24 ERROR_BADNEWLINE

include std/regex.e
namespace regex
public constant ERROR_BADNEWLINE

Both BSR_UNICODE and BSR_ANY options were specified. These options are contradictory.

8.20.4.25 error_names

include std/regex.e
namespace regex
public constant error_names

8.20.5 Create/Destroy

8.20.5.1 regex

include std/regex.e
namespace regex
public type regex(object o)

Regular expression type

8.20.5.2 option_spec

include std/regex.e
namespace regex
public type option_spec(object o)

Regular expression option specification type

Although the functions do not use this type (they return an error instead), you can use this to check if your routine is receiving something sane.

8.20.5.3 option_spec_to_string

include std/regex.e
namespace regex
public function option_spec_to_string(option_spec o)

Converts an option spec to a string.

This can be useful for debugging what options were passed in. Without it you have to convert a number to hex and lookup the constants in the source code.

8.20.5.4 error_to_string

include std/regex.e
namespace regex
public function error_to_string(integer i)

Converts an regex error to a string.

This can be useful for debugging and even something rough to give to the user incase of a regex failure. It's preferable to a number.

See Also:

error_message

8.20.5.5 new

include std/regex.e
namespace regex
public function new(string pattern, option_spec options = DEFAULT)

Return an allocated regular expression

Parameters:
  1. pattern : a sequence representing a human readable regular expression
  2. options : defaults to DEFAULT. See Compile Time Option Constants.
Returns:

A regex, which other regular expression routines can work on or an atom to indicate an error. If an error, you can call error_message to get a detailed error message.

Comments:

This is the only routine that accepts a human readable regular expression. The string is compiled and a regex is returned. Analyzing and compiling a regular expression is a costly operation and should not be done more than necessary. For instance, if your application looks for an email address among text frequently, you should create the regular expression as a constant accessible to your source code and any files that may use it, thus, the regular expression is analyzed and compiled only once per run of your application.

-- Bad Example
include std/regex.e as re

while sequence(line) do
    re:regex proper_name = re:new("[A-Z][a-z]+ [A-Z][a-z]+")
    if re:find(proper_name, line) then
        -- code
    end if
end while
-- Good Example
include std/regex.e as re
constant re_proper_name = re:new("[A-Z][a-z]+ [A-Z][a-z]+")
while sequence(line) do
    if re:find(re_proper_name, line) then
        -- code
    end if
end while
Example 1:
include std/regex.e as re
re:regex number = re:new("[0-9]+")
Note:

For simple matches, the built-in Euphoria routine eu:match and the library routine wildcard:is_match are often times easier to use and a little faster. Regular expressions are faster for complex searching/matching.

See Also:

error_message, find, find_all

8.20.5.6 error_message

include std/regex.e
namespace regex
public function error_message(object re)

If new returns an atom, this function will return a text error message as to the reason.

Parameters:
  1. re: Regular expression to get the error message from
Returns:

An atom (0) when no error message exists, otherwise a sequence describing the error.

Example 1:
include std/regex.e
object r = regex:new("[A-Z[a-z]*")
if atom(r) then
  printf(1, "Regex failed to compile: %s\n", { regex:error_message(r) })
end if

8.20.6 Utility Routines

8.20.6.1 escape

include std/regex.e
namespace regex
public function escape(string s)

Escape special regular expression characters that may be entered into a search string from user input.

Notes:
Special regex characters are:

. \ + * ? [ ^ ] $ ( ) { } = ! < > | : -
Parameters:
  1. s: string sequence to escape
Returns:

An escaped sequence representing s.

Example 1:
include std/regex.e as re
sequence search_s = re:escape("Payroll is $***15.00")
-- search_s = "Payroll is \\$\\*\\*\\*15\\.00"

8.20.6.2 get_ovector_size

include std/regex.e
namespace regex
public function get_ovector_size(regex ex, integer maxsize = 0)

Returns the number of capturing subpatterns (the ovector size) for a regex

Parameters:
  1. ex : a regex
  2. maxsize : optional maximum number of named groups to get data from
Returns:

An integer

8.20.7 Match

8.20.7.1 find

include std/regex.e
namespace regex
public function find(regex re, string haystack, integer from = 1,
        option_spec options = DEFAULT,
        integer size = get_ovector_size(re, 30))

Return the first match of re in haystack. You can optionally start at the position from.

Parameters:
  1. re : a regex for a subject to be matched against
  2. haystack : a string in which to searched
  3. from : an integer setting the starting position to begin searching from. Defaults to 1
  4. options : defaults to DEFAULT. See Match Time Option Constants. The only options that may be set when calling find are ANCHORED, NEWLINE_CR, NEWLINE_LF, NEWLINE_CRLF, NEWLINE_ANY NEWLINE_ANYCRLF NOTBOL, NOTEOL, NOTEMPTY, and NO_UTF8_CHECK. options can be any match time option or a sequence of valid options or it can be a value that comes from using or_bits on any two valid option values.
  5. size : internal (how large an array the C backend should allocate). Defaults to 90, in rare cases this number may need to be increased in order to accomodate complex regex expressions.
Returns:

An object, which is either an atom of 0, meaning nothing matched or a sequence of index pairs. These index pairs may be fewer than the number of groups specified. These index pairs may be the invalid index pair {0,0}.

The first pair is the starting and ending indeces of the sub-string that matches the expression. This pair may be followed by indeces of the groups. The groups are subexpressions in the regular expression surrounded by parenthesis ().

Now, it is possible to get a match without having all of the groups match. This can happen when there is a quantifier after a group. For example: '([01])*' or '([01])?'. In this case, the returned sequence of pairs will be missing the last group indeces for which there is no match. However, if the missing group is followed by a group that *does* match, {0,0} will be used as a place holder. You can ensure your groups match when your expression matches by keeping quantifiers

inside your groups:

For example use: '([01]?)' instead of '([01])?'

Example 1:
include std/regex.e as re
r = re:new("([A-Za-z]+) ([0-9]+)") -- John 20 or Jane 45
object result = re:find(r, "John 20")

-- The return value will be:
-- {
--    { 1, 7 }, -- Total match
--    { 1, 4 }, -- First grouping "John" ([A-Za-z]+)
--    { 6, 7 }  -- Second grouping "20" ([0-9]+)
-- }

8.20.7.2 find_all

include std/regex.e
namespace regex
public function find_all(regex re, string haystack, integer from = 1,
        option_spec options = DEFAULT,
        integer size = get_ovector_size(re, 30))

Return all matches of re in haystack optionally starting at the sequence position from.

Parameters:
  1. re : a regex for a subject to be matched against
  2. haystack : a string in which to searched
  3. from : an integer setting the starting position to begin searching from. Defaults to 1
  4. options : defaults to DEFAULT. See Match Time Option Constants.
Returns:

A sequence of sequences that were returned by find and in the case of no matches this returns an empty sequence. Please see find for a detailed description of each member of the return sequence.

Example 1:
include std/regex.e as re
constant re_number = re:new("[0-9]+")
object matches = re:find_all(re_number, "10 20 30")

-- matches is:
-- {
--     {{1, 2}},
--     {{4, 5}},
--     {{7, 8}}
-- }

8.20.7.3 has_match

include std/regex.e
namespace regex
public function has_match(regex re, string haystack, integer from = 1,
        option_spec options = DEFAULT)

Determine if re matches any portion of haystack.

Parameters:
  1. re : a regex for a subject to be matched against
  2. haystack : a string in which to searched
  3. from : an integer setting the starting position to begin searching from. Defaults to 1
  4. options : defaults to DEFAULT. See Match Time Option Constants. options can be any match time option or a sequence of valid options or it can be a value that comes from using or_bits on any two valid option values.
Returns:

An atom, 1 if re matches any portion of haystack or 0 if not.

8.20.7.4 is_match

include std/regex.e
namespace regex
public function is_match(regex re, string haystack, integer from = 1,
        option_spec options = DEFAULT)

Determine if the entire haystack matches re.

Parameters:
  1. re : a regex for a subject to be matched against
  2. haystack : a string in which to searched
  3. from : an integer setting the starting position to begin searching from. Defaults to 1
  4. options : defaults to DEFAULT. See Match Time Option Constants. options can be any match time option or a sequence of valid options or it can be a value that comes from using or_bits on any two valid option values.
Returns:

An atom, 1 if re matches the entire haystack or 0 if not.

8.20.7.5 matches

include std/regex.e
namespace regex
public function matches(regex re, string haystack, integer from = 1,
        option_spec options = DEFAULT)

Get the matched text only.

Parameters:
  1. re : a regex for a subject to be matched against
  2. haystack : a string in which to searched
  3. from : an integer setting the starting position to begin searching from. Defaults to 1
  4. options : defaults to DEFAULT. See Match Time Option Constants. options can be any match time option or STRING_OFFSETS or a sequence of valid options or it can be a value that comes from using or_bits on any two valid option values.
Returns:

Returns a sequence of strings, the first being the entire match and subsequent items being each of the captured groups or ERROR_NOMATCH of there is no match. The size of the sequence is the number of groups in the expression plus one (for the entire match).

If options contains the bit STRING_OFFSETS, then the result is different. For each item, a sequence is returned containing the matched text, the starting index in haystack and the ending index in haystack.

Example 1:
include std/regex.e as re
constant re_name = re:new("([A-Z][a-z]+) ([A-Z][a-z]+)")

object matches = re:matches(re_name, "John Doe and Jane Doe")
-- matches is:
-- {
--   "John Doe", -- full match data
--   "John",     -- first group
--   "Doe"       -- second group
-- }

matches = re:matches(re_name, "John Doe and Jane Doe", 1, re:STRING_OFFSETS)
-- matches is:
-- {
--   { "John Doe", 1, 8 }, -- full match data
--   { "John",     1, 4 }, -- first group
--   { "Doe",      6, 8 }  -- second group
-- }
See Also:

all_matches

8.20.7.6 all_matches

include std/regex.e
namespace regex
public function all_matches(regex re, string haystack, integer from = 1,
        option_spec options = DEFAULT)

Get the text of all matches

Parameters:
  1. re : a regex for a subject to be matched against
  2. haystack : a string in which to searched
  3. from : an integer setting the starting position to begin searching from. Defaults to 1
  4. options : options, defaults to DEFAULT. See Match Time Option Constants. options can be any match time option or a sequence of valid options or it can be a value that comes from using or_bits on any two valid option values.
Returns:

Returns ERROR_NOMATCH if there are no matches, or a sequence of sequences of strings if there is at least one match. In each member sequence of the returned sequence, the first string is the entire match and subsequent items being each of the captured groups. The size of the sequence is the number of groups in the expression plus one (for the entire match). In other words, each member of the return value will be of the same structure of that is returned by matches.

If options contains the bit STRING_OFFSETS, then the result is different. In each member sequence, instead of each member being a string each member is itself a sequence containing the matched text, the starting index in haystack and the ending index in haystack.

Example 1:
include std/regex.e as re
constant re_name = re:new("([A-Z][a-z]+) ([A-Z][a-z]+)")

object matches = re:all_matches(re_name, "John Doe and Jane Doe")
-- matches is:
-- {
--   {             -- first match
--     "John Doe", -- full match data
--     "John",     -- first group
--     "Doe"       -- second group
--   },
--   {             -- second match
--     "Jane Doe", -- full match data
--     "Jane",     -- first group
--     "Doe"       -- second group
--   }
-- }

matches = re:all_matches(re_name, "John Doe and Jane Doe", , re:STRING_OFFSETS)
-- matches is:
-- {
--   {                         -- first match
--     { "John Doe",  1,  8 }, -- full match data
--     { "John",      1,  4 }, -- first group
--     { "Doe",       6,  8 }  -- second group
--   },
--   {                         -- second match
--     { "Jane Doe", 14, 21 }, -- full match data
--     { "Jane",     14, 17 }, -- first group
--     { "Doe",      19, 21 }  -- second group
--   }
-- }
See Also:

matches

8.20.8 Splitting

8.20.8.1 split

include std/regex.e
namespace regex
public function split(regex re, string text, integer from = 1, option_spec options = DEFAULT)

Split a string based on a regex as a delimiter

Parameters:
  1. re : a regex which will be used for matching
  2. text : a string on which search and replace will apply
  3. from : optional start position
  4. options : options, defaults to DEFAULT. See Match Time Option Constants. options can be any match time option or a sequence of valid options or it can be a value that comes from using or_bits on any two valid option values.
Returns:

A sequence of string values split at the delimiter and if no delimiters were matched this sequence will be a one member sequence equal to {text}.

Example 1:
include std/regex.e as re
regex comma_space_re = re:new(`,\s`)
sequence data = re:split(comma_space_re, 
                         "euphoria programming, source code, reference data")
-- data is
-- {
--   "euphoria programming",
--   "source code",
--   "reference data"
-- }

8.20.8.2 split_limit

include std/regex.e
namespace regex
public function split_limit(regex re, string text, integer limit = 0, integer from = 1,
        option_spec options = DEFAULT)

8.20.9 Replacement

8.20.9.1 find_replace

include std/regex.e
namespace regex
public function find_replace(regex ex, string text, sequence replacement, integer from = 1,
        option_spec options = DEFAULT)

Replaces all matches of a regex with the replacement text.

Parameters:
  1. re : a regex which will be used for matching
  2. text : a string on which search and replace will apply
  3. replacement : a string, used to replace each of the full matches
  4. from : optional start position
  5. options : options, defaults to DEFAULT. See Match Time Option Constants. options can be any match time option or a sequence of valid options or it can be a value that comes from using or_bits on any two valid option values.
Returns:

A sequence, the modified text. If there is no match with re the return value will be the same as text when it was passed in.

Special replacement operators:
  • \ -- Causes the next character to lose its special meaning.
  • \n ~ -- Inserts a 0x0A (LF) character.
  • \r -- Inserts a 0x0D (CR) character.
  • \t -- Inserts a 0x09 (TAB) character.
  • \1 to \9 -- Recalls stored substrings from registers (\1, \2, \3, to \9).
  • \0 -- Recalls entire matched pattern.
  • \u -- Convert next character to uppercase
  • \l -- Convert next character to lowercase
  • \U -- Convert to uppercase till \E or \e
  • \L -- Convert to lowercase till \E or \e
  • \E or \e -- Terminate a \\U or \L conversion
Example 1:
include std/regex.e
regex r = new(`([A-Za-z]+)\.([A-Za-z]+)`)
sequence details = find_replace(r, "hello.txt", 
                                        `Filename: \U\1\e Extension: \U\2\e`)
-- details = "Filename: HELLO Extension: TXT"

8.20.9.2 find_replace_limit

include std/regex.e
namespace regex
public function find_replace_limit(regex ex, string text, sequence replacement,
        integer limit, integer from = 1, option_spec options = DEFAULT)

Replaces up to limit matches of ex in text except when limit is 0. When limit is 0, this routine replaces all of the matches.

This function is identical to find_replace except it allows you to limit the number of replacements to perform. Please see the documentation for find_replace for all the details.

Parameters:
  1. re : a regex which will be used for matching
  2. text : a string on which search and replace will apply
  3. replacement : a string, used to replace each of the full matches
  4. limit : the number of matches to process
  5. from : optional start position
  6. options : options, defaults to DEFAULT. See Match Time Option Constants. options can be any match time option or a sequence of valid options or it can be a value that comes from using or_bits on any two valid option values.
Returns:

A sequence, the modified text.

See Also:

find_replace

8.20.9.3 find_replace_callback

include std/regex.e
namespace regex
public function find_replace_callback(regex ex, string text, integer rid, integer limit = 0,
        integer from = 1, option_spec options = DEFAULT)

When limit is positive, this routine replaces up to limit matches of ex in text with the result of the user defined callback, rid, and when limit is 0, replaces all matches of ex in text with the result of this user defined callback, rid.

The callback should take one sequence. The first member of this sequence will be a a string representing the entire match and the subsequent members, if they exist, will be a strings for the captured groups within the regular expression.

Parameters:
  1. re : a regex which will be used for matching
  2. text : a string on which search and replace will apply
  3. rid : routine id to execute for each match
  4. limit : the number of matches to process
  5. from : optional start position
  6. options : options, defaults to DEFAULT. See Match Time Option Constants. options can be any match time option or a sequence of valid options or it can be a value that comes from using or_bits on any two valid option values.

The function rid. Must take one sequence parameter. The function needs to accept a sequence of strings and return a string. For each match, the function will be passed a sequence of strings. The first string is the entire match the subsequent strings are for the capturing groups. If a match succeeds with groups that don't exist, that place will contain a 0. If the sub-group does exist, the palce will contain the matching group string. for that group.

Returns:

A sequence, the modified text.

Examples:
include std/text.e
function my_convert(sequence params)
    switch params[1] do
        case "1" then
            return "one "
        case "2" then
            return "two "
        case else
            return "unknown "
    end switch
end function

regex r = re:new(`\d`)
sequence result = re:find_replace_callback(r, "125",routine_id("my_convert"))
-- result = "one two unknown "


integer missing_data_flag = 0
regex r2 = re:new(`[A-Z][a-z]+ ([A-Z][a-z]+)?`)
function my_toupper( sequence params)
      -- here params[2] may be 0.
      return upper( params[1] )
end function

result = find_replace_callback(r2, "John Doe", routine_id("my_toupper"))
-- params[2] is "Doe"
-- result = "JOHN DOE"
printf(1, "result=%s\n", {result} )
result = find_replace_callback(r2, "Mary", routine_id("my_toupper"))
-- result = "MARY"