Up | TOC | Index | |||||
<< 7 Included Tools | < 8.19 Locale Routines | Up: 8 API Reference | 8.21 Text Manipulation > | 9 Release Notes >> |
8.20 Regular Expressions
8.20.1 Introduction
Regular expressions in Euphoria are based on the PCRE (Perl Compatible Regular Expressions) library created by Philip Hazel.
This document will detail the Euphoria interface to Regular Expressions, not really regular expression syntax. It is a very complex subject that many books have been written on. Here are a few good resources online that can help while learning regular expressions.
- EUForum Article
- Perl Regular Expressions Man Page
- Regular Expression Library (user supplied regular expressions for just about any task).
- WikiPedia Regular Expression Article
- Man page of PCRE in HTML
8.20.2 General Use
Many functions take an optional options parameter. This parameter can be either a single option constant (see Option Constants), multiple option constants or'ed together into a single atom or a sequence of options, in which the function will take care of ensuring the are or'ed together correctly. Options are like their C equivalents with the 'PCRE_' prefix stripped off. Name spaces disambiguate symbols so we don't need this prefix.
All strings passed into this library must be either 8-bit per character strings or UTF which uses multiple bytes to encode UNICODE characters. You can use UTF8 encoded UNICODE strings when you pass the UTF8 option.
8.20.3 Option Constants
8.20.3.1 Compile Time and Match Time
When a regular expression object is created via new we call also say it get's "compiled." The options you may use for this are called "compile time" option constants. Once the regular expression is created you can use the other functions that take this regular expression and a string. These routines' options are called "match time" option constants. To not set any options at all, do not supply the options argument or supply DEFAULT.
Compile Time Option Constants
The only options that may set at "compile time"; that is, to pass to new; are ANCHORED, AUTO_CALLOUT, BSR_ANYCRLF, BSR_UNICODE, CASELESS, DEFAULT, DOLLAR_ENDONLY, DOTALL, DUPNAMES, EXTENDED, EXTRA, FIRSTLINE, MULTILINE, NEWLINE_CR, NEWLINE_LF, NEWLINE_CRLF, NEWLINE_ANY, NEWLINE_ANYCRLF, NO_AUTO_CAPTURE, NO_UTF8_CHECK, UNGREEDY, and UTF8.
Match Time Option Constants
Options that may be set at "match time" are ANCHORED, NEWLINE_CR, NEWLINE_LF, NEWLINE_CRLF, NEWLINE_ANY NEWLINE_ANYCRLF NOTBOL, NOTEOL, NOTEMPTY, NO_UTF8_CHECK. Routines that take match time option constants match, split or replace a regular expression against some string.
8.20.3.2 ANCHORED
public constant ANCHORED
Forces matches to be only from the first place it is asked to try to make a search. In C, this is called PCRE_ANCHORED. This is passed to all routines including new.
8.20.3.3 AUTO_CALLOUT
public constant AUTO_CALLOUT
In C, this is called PCRE_AUTO_CALLOUT. To get the functionality of this flag in EUPHORIA, you can use: find_replace_callback without passing this option. This is passed to new.
8.20.3.4 BSR_ANYCRLF
public constant BSR_ANYCRLF
With this option only ASCII new line sequences are recognized as newlines. Other UNICODE newline sequences (encoded as UTF8) are not recognized as an end of line marker. This is passed to all routines including new.
8.20.3.5 BSR_UNICODE
public constant BSR_UNICODE
With this option any UNICODE new line sequence is recognized as a newline. The UNICODE will have to be encoded as UTF8, however. This is passed to all routines including new.
8.20.3.6 CASELESS
public constant CASELESS
This will make your regular expression matches case insensitive. With this flag for example, [a-z] is the same as [A-Za-z]. This is passed to new.
8.20.3.7 DEFAULT
public constant DEFAULT
This is a value used for not setting any flags at all. This can be passed to all routines including new
8.20.3.8 DFA_SHORTEST
public constant DFA_SHORTEST
This is NOT used by any standard library routine.
8.20.3.9 DFA_RESTART
public constant DFA_RESTART
This is NOT used by any standard library routine.
8.20.3.10 DOLLAR_ENDONLY
public constant DOLLAR_ENDONLY
If this bit is set, a dollar sign metacharacter in the pattern matches only at the end of the subject string. Without this option, a dollar sign also matches immediately before a newline at the end of the string (but not before any other newlines). Thus you must include the newline character in the pattern before the dollar sign if you want to match a line that contanis a newline character. The DOLLAR_ENDONLY option is ignored if MULTILINE is set. There is no way to set this option within a pattern. This is passed to new.
8.20.3.11 DOTALL
public constant DOTALL
With this option the '.' character also matches a newline sequence. This is passed to new.
8.20.3.12 DUPNAMES
public constant DUPNAMES
Allow duplicate names for named subpatterns. Since there is no way to access named subpatterns this flag has no effect. This is passed to new.
8.20.3.13 EXTENDED
public constant EXTENDED
Whitespace and characters beginning with a hash mark to the end of the line in the pattern will be ignored when searching except when the whitespace or hash is escaped or in a character class. This is passed to new.
8.20.3.14 EXTRA
public constant EXTRA
When an alphanumeric follows a backslash(\) has no special meaning an error is generated. This is passed to new.
8.20.3.15 FIRSTLINE
public constant FIRSTLINE
If PCRE_FIRSTLINE is set, the match must happen before or at the first newline in the subject (though it may continue over the newline). This is passed to new.
8.20.3.16 MULTILINE
public constant MULTILINE
When MULTILINE it is set, the "start of line" and "end of line" constructs match immediately following or immediately before internal newlines in the subject string, respectively, as well as at the very start and end. This is passed to new.
8.20.3.17 NEWLINE_CR
public constant NEWLINE_CR
Sets CR as the NEWLINE sequence. The NEWLINE sequence will match $ when MULTILINE is set. This is passed to all routines including new.
8.20.3.18 NEWLINE_LF
public constant NEWLINE_LF
Sets LF as the NEWLINE sequence. The NEWLINE sequence will match $ when MULTILINE is set. This is passed to all routines including new.
8.20.3.19 NEWLINE_CRLF
public constant NEWLINE_CRLF
Sets CRLF as the NEWLINE sequence The NEWLINE sequence will match $ when MULTILINE is set. This is passed to all routines including new.
8.20.3.20 NEWLINE_ANY
public constant NEWLINE_ANY
Sets ANY newline sequence as the NEWLINE sequence including those from UNICODE when UTF8 is also set. The string will have to be encoded as UTF8, however. The NEWLINE sequence will match $ when MULTILINE is set. This is passed to all routines including new.
8.20.3.21 NEWLINE_ANYCRLF
public constant NEWLINE_ANYCRLF
Sets ANY newline sequence from ASCII. The NEWLINE sequence will match $ when MULTILINE is set. This is passed to all routines including new.
8.20.3.22 NOTBOL
public constant NOTBOL
This indicates that beginning of the passed string does NOT start at the Beginning Of a Line (NOTBOL), so a carrot symbol (^) in the original pattern will not match the beginning of the string. This is used by routines other than new.
8.20.3.23 NOTEOL
public constant NOTEOL
This indicates that end of the passed string does NOT end at the End Of a Line (NOTEOL), so a dollar sign ($) in the original pattern will not match the end of the string. This is used by routines other than new.
8.20.3.24 NO_AUTO_CAPTURE
public constant NO_AUTO_CAPTURE
Disables capturing subpatterns except when the subpatterns are named. This is passed to new.
8.20.3.25 NO_UTF8_CHECK
public constant NO_UTF8_CHECK
Turn off checking for the validity of your UTF string. Use this with caution. An invalid utf8 string with this option could crash your program. Only use this if you know the string is a valid utf8 string. This is passed to all routines including new.
8.20.3.26 NOTEMPTY
public constant NOTEMPTY
Here matches of empty strings will not be allowed. In C, this is PCRE_NOTEMPTY. The pattern: `A*a*` will match "AAAA", "aaaa", and "Aaaa" but not "". This is used by routines other than new.
8.20.3.27 PARTIAL
public constant PARTIAL
This option has no effect on whether a match will occur or not. However, it does affect the error code generated by find in the event of a failure: If for some pattern re, and two strings s1 and s2, find( re, s1 & s2 ) would return a match but both find( re, s1 ) and find( re, s2 ) would not, then find( re, s1, 1, PCRE_PARTIAL ) will return ERROR_PARTIAL rather than ERROR_NOMATCH. We say s1 has a partial match of re.
Note that find( re, s2, 1, PCRE_PARTIAL ) will ERROR_NOMATCH. In C, this constant is called PCRE_PARTIAL.
8.20.3.28 STRING_OFFSETS
public constant STRING_OFFSETS
This is used by matches and all_matches.
8.20.3.29 UNGREEDY
public constant UNGREEDY This modifier sets the pattern such that quantifiers are not greedy by default, but become greedy if followed by a question mark.
This is passed to new.
8.20.3.30 UTF8
public constant UTF8
Makes strings passed in to be interpreted as a UTF8 encoded string. This is passed to new.
8.20.4 Error Constants
Error constants differ from their C equivalents as they do not have PCRE_ prepended to each name.
8.20.4.1 ERROR_NOMATCH
include std/regex.e namespace regex public constant ERROR_NOMATCH
There was no match found.
8.20.4.2 ERROR_NULL
include std/regex.e namespace regex public constant ERROR_NULL
There was an internal error in the EUPHORIA wrapper (std/regex.e in the standard include directory or be_regex.c in the EUPHORIA source).
8.20.4.3 ERROR_BADOPTION
include std/regex.e namespace regex public constant ERROR_BADOPTION
There was an internal error in the EUPHORIA wrapper (std/regex.e in the standard include directory or be_regex.c in the EUPHORIA source).
8.20.4.4 ERROR_BADMAGIC
include std/regex.e namespace regex public constant ERROR_BADMAGIC
The pattern passed is not a value returned from new.
8.20.4.5 ERROR_UNKNOWN_OPCODE
include std/regex.e namespace regex public constant ERROR_UNKNOWN_OPCODE
An internal error either in the pcre library EUPHORIA uses or its wrapper occured.
8.20.4.6 ERROR_UNKNOWN_NODE
include std/regex.e namespace regex public constant ERROR_UNKNOWN_NODE
An internal error either in the pcre library EUPHORIA uses or its wrapper occured.
8.20.4.7 ERROR_NOMEMORY
include std/regex.e namespace regex public constant ERROR_NOMEMORY
Out of memory.
8.20.4.8 ERROR_NOSUBSTRING
include std/regex.e namespace regex public constant ERROR_NOSUBSTRING
The wrapper or the PCRE backend didn't preallocate enough capturing groups for this pattern.
8.20.4.9 ERROR_MATCHLIMIT
include std/regex.e namespace regex public constant ERROR_MATCHLIMIT
Too many matches encountered.
8.20.4.10 ERROR_CALLOUT
include std/regex.e namespace regex public constant ERROR_CALLOUT
Not applicable to our implementation.
8.20.4.11 ERROR_BADUTF8
include std/regex.e namespace regex public constant ERROR_BADUTF8
The subject or pattern is not valid UTF8 but it was specified as such with UTF8.
8.20.4.12 ERROR_BADUTF8_OFFSET
include std/regex.e namespace regex public constant ERROR_BADUTF8_OFFSET
The offset specified doesn't start on a UTF8 character boundary but it was specified as UTF8 with UTF8.
8.20.4.13 ERROR_PARTIAL
include std/regex.e namespace regex public constant ERROR_PARTIAL
Pattern didn't match, but there is a partial match. See PARTIAL.
8.20.4.14 ERROR_BADPARTIAL
include std/regex.e namespace regex public constant ERROR_BADPARTIAL
PCRE backend doesn't support partial matching for this pattern.
8.20.4.15 ERROR_INTERNAL
include std/regex.e namespace regex public constant ERROR_INTERNAL
8.20.4.16 ERROR_BADCOUNT
include std/regex.e namespace regex public constant ERROR_BADCOUNT
size parameter to find is less than minus 1.
8.20.4.17 ERROR_DFA_UITEM
include std/regex.e namespace regex public constant ERROR_DFA_UITEM
Not applicable to our implementation: The PCRE wrapper doesn't use DFA routines
8.20.4.18 ERROR_DFA_UCOND
include std/regex.e namespace regex public constant ERROR_DFA_UCOND
Not applicable to our implementation: The PCRE wrapper doesn't use DFA routines
8.20.4.19 ERROR_DFA_UMLIMIT
include std/regex.e namespace regex public constant ERROR_DFA_UMLIMIT
Not applicable to our implementation: The PCRE wrapper doesn't use DFA routines
8.20.4.20 ERROR_DFA_WSSIZE
include std/regex.e namespace regex public constant ERROR_DFA_WSSIZE
Not applicable to our implementation: The PCRE wrapper doesn't use DFA routines
8.20.4.21 ERROR_DFA_RECURSE
include std/regex.e namespace regex public constant ERROR_DFA_RECURSE
Not applicable to our implementation: The PCRE wrapper doesn't use DFA routines
8.20.4.22 ERROR_RECURSIONLIMIT
include std/regex.e namespace regex public constant ERROR_RECURSIONLIMIT
Too much recursion used for match.
8.20.4.23 ERROR_NULLWSLIMIT
include std/regex.e namespace regex public constant ERROR_NULLWSLIMIT
This error isn't in the source code.
8.20.4.24 ERROR_BADNEWLINE
include std/regex.e namespace regex public constant ERROR_BADNEWLINE
Both BSR_UNICODE and BSR_ANY options were specified. These options are contradictory.
8.20.4.25 error_names
include std/regex.e namespace regex public constant error_names
8.20.5 Create/Destroy
8.20.5.1 regex
include std/regex.e namespace regex public type regex(object o)
Regular expression type
8.20.5.2 option_spec
include std/regex.e namespace regex public type option_spec(object o)
Regular expression option specification type
Although the functions do not use this type (they return an error instead), you can use this to check if your routine is receiving something sane.
8.20.5.3 option_spec_to_string
include std/regex.e namespace regex public function option_spec_to_string(option_spec o)
Converts an option spec to a string.
This can be useful for debugging what options were passed in. Without it you have to convert a number to hex and lookup the constants in the source code.
8.20.5.4 error_to_string
include std/regex.e namespace regex public function error_to_string(integer i)
Converts an regex error to a string.
This can be useful for debugging and even something rough to give to the user incase of a regex failure. It's preferable to a number.
See Also:
8.20.5.5 new
include std/regex.e namespace regex public function new(string pattern, option_spec options = DEFAULT)
Return an allocated regular expression
Parameters:
- pattern : a sequence representing a human readable regular expression
- options : defaults to DEFAULT. See Compile Time Option Constants.
Returns:
A regex, which other regular expression routines can work on or an atom to indicate an error. If an error, you can call error_message to get a detailed error message.
Comments:
This is the only routine that accepts a human readable regular expression. The string is compiled and a regex is returned. Analyzing and compiling a regular expression is a costly operation and should not be done more than necessary. For instance, if your application looks for an email address among text frequently, you should create the regular expression as a constant accessible to your source code and any files that may use it, thus, the regular expression is analyzed and compiled only once per run of your application.
-- Bad Example include std/regex.e as re while sequence(line) do re:regex proper_name = re:new("[A-Z][a-z]+ [A-Z][a-z]+") if re:find(proper_name, line) then -- code end if end while
-- Good Example include std/regex.e as re constant re_proper_name = re:new("[A-Z][a-z]+ [A-Z][a-z]+") while sequence(line) do if re:find(re_proper_name, line) then -- code end if end while
Example 1:
include std/regex.e as re re:regex number = re:new("[0-9]+")
Note:
For simple matches, the built-in Euphoria routine eu:match and the library routine wildcard:is_match are often times easier to use and a little faster. Regular expressions are faster for complex searching/matching.
See Also:
8.20.5.6 error_message
include std/regex.e namespace regex public function error_message(object re)
If new returns an atom, this function will return a text error message as to the reason.
Parameters:
- re: Regular expression to get the error message from
Returns:
An atom (0) when no error message exists, otherwise a sequence describing the error.
Example 1:
include std/regex.e object r = regex:new("[A-Z[a-z]*") if atom(r) then printf(1, "Regex failed to compile: %s\n", { regex:error_message(r) }) end if
8.20.6 Utility Routines
8.20.6.1 escape
include std/regex.e namespace regex public function escape(string s)
Escape special regular expression characters that may be entered into a search string from user input.
Notes:
Special regex characters are:
. \ + * ? [ ^ ] $ ( ) { } = ! < > | : -
Parameters:
- s: string sequence to escape
Returns:
An escaped sequence representing s.
Example 1:
include std/regex.e as re sequence search_s = re:escape("Payroll is $***15.00") -- search_s = "Payroll is \\$\\*\\*\\*15\\.00"
8.20.6.2 get_ovector_size
include std/regex.e namespace regex public function get_ovector_size(regex ex, integer maxsize = 0)
Returns the number of capturing subpatterns (the ovector size) for a regex
Parameters:
- ex : a regex
- maxsize : optional maximum number of named groups to get data from
Returns:
An integer
8.20.7 Match
8.20.7.1 find
include std/regex.e namespace regex public function find(regex re, string haystack, integer from = 1, option_spec options = DEFAULT, integer size = get_ovector_size(re, 30))
Return the first match of re in haystack. You can optionally start at the position from.
Parameters:
- re : a regex for a subject to be matched against
- haystack : a string in which to searched
- from : an integer setting the starting position to begin searching from. Defaults to 1
- options : defaults to DEFAULT. See Match Time Option Constants. The only options that may be set when calling find are ANCHORED, NEWLINE_CR, NEWLINE_LF, NEWLINE_CRLF, NEWLINE_ANY NEWLINE_ANYCRLF NOTBOL, NOTEOL, NOTEMPTY, and NO_UTF8_CHECK. options can be any match time option or a sequence of valid options or it can be a value that comes from using or_bits on any two valid option values.
- size : internal (how large an array the C backend should allocate). Defaults to 90, in rare cases this number may need to be increased in order to accomodate complex regex expressions.
Returns:
An object, which is either an atom of 0, meaning nothing matched or a sequence of index pairs. These index pairs may be fewer than the number of groups specified. These index pairs may be the invalid index pair {0,0}.
The first pair is the starting and ending indeces of the sub-string that matches the expression. This pair may be followed by indeces of the groups. The groups are subexpressions in the regular expression surrounded by parenthesis ().
Now, it is possible to get a match without having all of the groups match. This can happen when there is a quantifier after a group. For example: '([01])*' or '([01])?'. In this case, the returned sequence of pairs will be missing the last group indeces for which there is no match. However, if the missing group is followed by a group that *does* match, {0,0} will be used as a place holder. You can ensure your groups match when your expression matches by keeping quantifiers
inside your groups:
For example use: '([01]?)' instead of '([01])?'
Example 1:
include std/regex.e as re r = re:new("([A-Za-z]+) ([0-9]+)") -- John 20 or Jane 45 object result = re:find(r, "John 20") -- The return value will be: -- { -- { 1, 7 }, -- Total match -- { 1, 4 }, -- First grouping "John" ([A-Za-z]+) -- { 6, 7 } -- Second grouping "20" ([0-9]+) -- }
8.20.7.2 find_all
include std/regex.e namespace regex public function find_all(regex re, string haystack, integer from = 1, option_spec options = DEFAULT, integer size = get_ovector_size(re, 30))
Return all matches of re in haystack optionally starting at the sequence position from.
Parameters:
- re : a regex for a subject to be matched against
- haystack : a string in which to searched
- from : an integer setting the starting position to begin searching from. Defaults to 1
- options : defaults to DEFAULT. See Match Time Option Constants.
Returns:
A sequence of sequences that were returned by find and in the case of no matches this returns an empty sequence. Please see find for a detailed description of each member of the return sequence.
Example 1:
include std/regex.e as re constant re_number = re:new("[0-9]+") object matches = re:find_all(re_number, "10 20 30") -- matches is: -- { -- {{1, 2}}, -- {{4, 5}}, -- {{7, 8}} -- }
8.20.7.3 has_match
include std/regex.e namespace regex public function has_match(regex re, string haystack, integer from = 1, option_spec options = DEFAULT)
Determine if re matches any portion of haystack.
Parameters:
- re : a regex for a subject to be matched against
- haystack : a string in which to searched
- from : an integer setting the starting position to begin searching from. Defaults to 1
- options : defaults to DEFAULT. See Match Time Option Constants. options can be any match time option or a sequence of valid options or it can be a value that comes from using or_bits on any two valid option values.
Returns:
An atom, 1 if re matches any portion of haystack or 0 if not.
8.20.7.4 is_match
include std/regex.e namespace regex public function is_match(regex re, string haystack, integer from = 1, option_spec options = DEFAULT)
Determine if the entire haystack matches re.
Parameters:
- re : a regex for a subject to be matched against
- haystack : a string in which to searched
- from : an integer setting the starting position to begin searching from. Defaults to 1
- options : defaults to DEFAULT. See Match Time Option Constants. options can be any match time option or a sequence of valid options or it can be a value that comes from using or_bits on any two valid option values.
Returns:
An atom, 1 if re matches the entire haystack or 0 if not.
8.20.7.5 matches
include std/regex.e namespace regex public function matches(regex re, string haystack, integer from = 1, option_spec options = DEFAULT)
Get the matched text only.
Parameters:
- re : a regex for a subject to be matched against
- haystack : a string in which to searched
- from : an integer setting the starting position to begin searching from. Defaults to 1
- options : defaults to DEFAULT. See Match Time Option Constants. options can be any match time option or STRING_OFFSETS or a sequence of valid options or it can be a value that comes from using or_bits on any two valid option values.
Returns:
Returns a sequence of strings, the first being the entire match and subsequent items being each of the captured groups or ERROR_NOMATCH of there is no match. The size of the sequence is the number of groups in the expression plus one (for the entire match).
If options contains the bit STRING_OFFSETS, then the result is different. For each item, a sequence is returned containing the matched text, the starting index in haystack and the ending index in haystack.
Example 1:
include std/regex.e as re constant re_name = re:new("([A-Z][a-z]+) ([A-Z][a-z]+)") object matches = re:matches(re_name, "John Doe and Jane Doe") -- matches is: -- { -- "John Doe", -- full match data -- "John", -- first group -- "Doe" -- second group -- } matches = re:matches(re_name, "John Doe and Jane Doe", 1, re:STRING_OFFSETS) -- matches is: -- { -- { "John Doe", 1, 8 }, -- full match data -- { "John", 1, 4 }, -- first group -- { "Doe", 6, 8 } -- second group -- }
See Also:
8.20.7.6 all_matches
include std/regex.e namespace regex public function all_matches(regex re, string haystack, integer from = 1, option_spec options = DEFAULT)
Get the text of all matches
Parameters:
- re : a regex for a subject to be matched against
- haystack : a string in which to searched
- from : an integer setting the starting position to begin searching from. Defaults to 1
- options : options, defaults to DEFAULT. See Match Time Option Constants. options can be any match time option or a sequence of valid options or it can be a value that comes from using or_bits on any two valid option values.
Returns:
Returns ERROR_NOMATCH if there are no matches, or a sequence of sequences of strings if there is at least one match. In each member sequence of the returned sequence, the first string is the entire match and subsequent items being each of the captured groups. The size of the sequence is the number of groups in the expression plus one (for the entire match). In other words, each member of the return value will be of the same structure of that is returned by matches.
If options contains the bit STRING_OFFSETS, then the result is different. In each member sequence, instead of each member being a string each member is itself a sequence containing the matched text, the starting index in haystack and the ending index in haystack.
Example 1:
include std/regex.e as re constant re_name = re:new("([A-Z][a-z]+) ([A-Z][a-z]+)") object matches = re:all_matches(re_name, "John Doe and Jane Doe") -- matches is: -- { -- { -- first match -- "John Doe", -- full match data -- "John", -- first group -- "Doe" -- second group -- }, -- { -- second match -- "Jane Doe", -- full match data -- "Jane", -- first group -- "Doe" -- second group -- } -- } matches = re:all_matches(re_name, "John Doe and Jane Doe", , re:STRING_OFFSETS) -- matches is: -- { -- { -- first match -- { "John Doe", 1, 8 }, -- full match data -- { "John", 1, 4 }, -- first group -- { "Doe", 6, 8 } -- second group -- }, -- { -- second match -- { "Jane Doe", 14, 21 }, -- full match data -- { "Jane", 14, 17 }, -- first group -- { "Doe", 19, 21 } -- second group -- } -- }
See Also:
8.20.8 Splitting
8.20.8.1 split
include std/regex.e namespace regex public function split(regex re, string text, integer from = 1, option_spec options = DEFAULT)
Split a string based on a regex as a delimiter
Parameters:
- re : a regex which will be used for matching
- text : a string on which search and replace will apply
- from : optional start position
- options : options, defaults to DEFAULT. See Match Time Option Constants. options can be any match time option or a sequence of valid options or it can be a value that comes from using or_bits on any two valid option values.
Returns:
A sequence of string values split at the delimiter and if no delimiters were matched this sequence will be a one member sequence equal to {text}.
Example 1:
include std/regex.e as re regex comma_space_re = re:new(`,\s`) sequence data = re:split(comma_space_re, "euphoria programming, source code, reference data") -- data is -- { -- "euphoria programming", -- "source code", -- "reference data" -- }
8.20.8.2 split_limit
include std/regex.e namespace regex public function split_limit(regex re, string text, integer limit = 0, integer from = 1, option_spec options = DEFAULT)
8.20.9 Replacement
8.20.9.1 find_replace
include std/regex.e namespace regex public function find_replace(regex ex, string text, sequence replacement, integer from = 1, option_spec options = DEFAULT)
Replaces all matches of a regex with the replacement text.
Parameters:
- re : a regex which will be used for matching
- text : a string on which search and replace will apply
- replacement : a string, used to replace each of the full matches
- from : optional start position
- options : options, defaults to DEFAULT. See Match Time Option Constants. options can be any match time option or a sequence of valid options or it can be a value that comes from using or_bits on any two valid option values.
Returns:
A sequence, the modified text. If there is no match with re the return value will be the same as text when it was passed in.
Special replacement operators:
- \ -- Causes the next character to lose its special meaning.
- \n ~ -- Inserts a 0x0A (LF) character.
- \r -- Inserts a 0x0D (CR) character.
- \t -- Inserts a 0x09 (TAB) character.
- \1 to \9 -- Recalls stored substrings from registers (\1, \2, \3, to \9).
- \0 -- Recalls entire matched pattern.
- \u -- Convert next character to uppercase
- \l -- Convert next character to lowercase
- \U -- Convert to uppercase till \E or \e
- \L -- Convert to lowercase till \E or \e
- \E or \e -- Terminate a \\U or \L conversion
Example 1:
include std/regex.e regex r = new(`([A-Za-z]+)\.([A-Za-z]+)`) sequence details = find_replace(r, "hello.txt", `Filename: \U\1\e Extension: \U\2\e`) -- details = "Filename: HELLO Extension: TXT"
8.20.9.2 find_replace_limit
include std/regex.e namespace regex public function find_replace_limit(regex ex, string text, sequence replacement, integer limit, integer from = 1, option_spec options = DEFAULT)
Replaces up to limit matches of ex in text except when limit is 0. When limit is 0, this routine replaces all of the matches.
This function is identical to find_replace except it allows you to limit the number of replacements to perform. Please see the documentation for find_replace for all the details.
Parameters:
- re : a regex which will be used for matching
- text : a string on which search and replace will apply
- replacement : a string, used to replace each of the full matches
- limit : the number of matches to process
- from : optional start position
- options : options, defaults to DEFAULT. See Match Time Option Constants. options can be any match time option or a sequence of valid options or it can be a value that comes from using or_bits on any two valid option values.
Returns:
A sequence, the modified text.
See Also:
8.20.9.3 find_replace_callback
include std/regex.e namespace regex public function find_replace_callback(regex ex, string text, integer rid, integer limit = 0, integer from = 1, option_spec options = DEFAULT)
When limit is positive, this routine replaces up to limit matches of ex in text with the result of the user defined callback, rid, and when limit is 0, replaces all matches of ex in text with the result of this user defined callback, rid.
The callback should take one sequence. The first member of this sequence will be a a string representing the entire match and the subsequent members, if they exist, will be a strings for the captured groups within the regular expression.
Parameters:
- re : a regex which will be used for matching
- text : a string on which search and replace will apply
- rid : routine id to execute for each match
- limit : the number of matches to process
- from : optional start position
- options : options, defaults to DEFAULT. See Match Time Option Constants. options can be any match time option or a sequence of valid options or it can be a value that comes from using or_bits on any two valid option values.
The function rid. Must take one sequence parameter. The function needs to accept a sequence of strings and return a string. For each match, the function will be passed a sequence of strings. The first string is the entire match the subsequent strings are for the capturing groups. If a match succeeds with groups that don't exist, that place will contain a 0. If the sub-group does exist, the palce will contain the matching group string. for that group.
Returns:
A sequence, the modified text.
Examples:
include std/text.e function my_convert(sequence params) switch params[1] do case "1" then return "one " case "2" then return "two " case else return "unknown " end switch end function regex r = re:new(`\d`) sequence result = re:find_replace_callback(r, "125",routine_id("my_convert")) -- result = "one two unknown " integer missing_data_flag = 0 regex r2 = re:new(`[A-Z][a-z]+ ([A-Z][a-z]+)?`) function my_toupper( sequence params) -- here params[2] may be 0. return upper( params[1] ) end function result = find_replace_callback(r2, "John Doe", routine_id("my_toupper")) -- params[2] is "Doe" -- result = "JOHN DOE" printf(1, "result=%s\n", {result} ) result = find_replace_callback(r2, "Mary", routine_id("my_toupper")) -- result = "MARY"