OpenEuphoria: Euphoria v4.0

This document will detail the Euphoria interface to Regular Expressions, not really regular expression syntax. It is a very complex subject that many books have been written on. Here are a few good resources online that can help while learning regular expressions.

EUForum Article
Perl Regular Expressions Man Page
Regular Expression Library (user supplied regular expressions for just about any task).
WikiPedia Regular Expression Article
Man page of PCRE in HTML

8.20.2 General Use

Many functions take an optional options parameter. This parameter can be either a single option constant (see Option Constants), multiple option constants or'ed together into a single atom or a sequence of options, in which the function will take care of ensuring the are or'ed together correctly. Options are like their C equivalents with the 'PCRE_' prefix stripped off. Name spaces disambiguate symbols so we don't need this prefix.

All strings passed into this library must be either 8-bit per character strings or UTF which uses multiple bytes to encode UNICODE characters. You can use UTF8 encoded UNICODE strings when you pass the UTF8 option.

8.20.3 Option Constants

8.20.3.1 Compile Time and Match Time

When a regular expression object is created via new we call also say it get's "compiled." The options you may use for this are called "compile time" option constants. Once the regular expression is created you can use the other functions that take this regular expression and a string. These routines' options are called "match time" option constants. To not set any options at all, do not supply the options argument or supply DEFAULT.

Compile Time Option Constants

The only options that may set at "compile time"; that is, to pass to new; are ANCHORED, AUTO_CALLOUT, BSR_ANYCRLF, BSR_UNICODE, CASELESS, DEFAULT, DOLLAR_ENDONLY, DOTALL, DUPNAMES, EXTENDED, EXTRA, FIRSTLINE, MULTILINE, NEWLINE_CR, NEWLINE_LF, NEWLINE_CRLF, NEWLINE_ANY, NEWLINE_ANYCRLF, NO_AUTO_CAPTURE, NO_UTF8_CHECK, UNGREEDY, and UTF8.

Match Time Option Constants

Options that may be set at "match time" are ANCHORED, NEWLINE_CR, NEWLINE_LF, NEWLINE_CRLF, NEWLINE_ANY NEWLINE_ANYCRLF NOTBOL, NOTEOL, NOTEMPTY, NO_UTF8_CHECK. Routines that take match time option constants match, split or replace a regular expression against some string.

8.20.3.2 ANCHORED

public constant ANCHORED

Forces matches to be only from the first place it is asked to try to make a search. In C, this is called PCRE_ANCHORED. This is passed to all routines including new.

8.20.3.3 AUTO_CALLOUT

public constant AUTO_CALLOUT

In C, this is called PCRE_AUTO_CALLOUT. To get the functionality of this flag in EUPHORIA, you can use: find_replace_callback without passing this option. This is passed to new.

8.20.3.4 BSR_ANYCRLF

public constant BSR_ANYCRLF

With this option only ASCII new line sequences are recognized as newlines. Other UNICODE newline sequences (encoded as UTF8) are not recognized as an end of line marker. This is passed to all routines including new.

8.20.3.5 BSR_UNICODE

public constant BSR_UNICODE

With this option any UNICODE new line sequence is recognized as a newline. The UNICODE will have to be encoded as UTF8, however. This is passed to all routines including new.

8.20.3.6 CASELESS

public constant CASELESS

This will make your regular expression matches case insensitive. With this flag for example, [a-z] is the same as [A-Za-z]. This is passed to new.

8.20.3.7 DEFAULT

public constant DEFAULT

This is a value used for not setting any flags at all. This can be passed to all routines including new

8.20.3.8 DFA_SHORTEST

public constant DFA_SHORTEST

This is NOT used by any standard library routine.

8.20.3.9 DFA_RESTART

public constant DFA_RESTART

This is NOT used by any standard library routine.

8.20.3.10 DOLLAR_ENDONLY

public constant DOLLAR_ENDONLY

If this bit is set, a dollar sign metacharacter in the pattern matches only at the end of the subject string. Without this option, a dollar sign also matches immediately before a newline at the end of the string (but not before any other newlines). Thus you must include the newline character in the pattern before the dollar sign if you want to match a line that contanis a newline character. The DOLLAR_ENDONLY option is ignored if MULTILINE is set. There is no way to set this option within a pattern. This is passed to new.

8.20.3.11 DOTALL

public constant DOTALL

With this option the '.' character also matches a newline sequence. This is passed to new.

8.20.3.12 DUPNAMES

public constant DUPNAMES

Allow duplicate names for named subpatterns. Since there is no way to access named subpatterns this flag has no effect. This is passed to new.

8.20.3.13 EXTENDED

public constant EXTENDED

Whitespace and characters beginning with a hash mark to the end of the line in the pattern will be ignored when searching except when the whitespace or hash is escaped or in a character class. This is passed to new.

8.20.3.14 EXTRA

public constant EXTRA

When an alphanumeric follows a backslash(\) has no special meaning an error is generated. This is passed to new.

8.20.3.15 FIRSTLINE

public constant FIRSTLINE

If PCRE_FIRSTLINE is set, the match must happen before or at the first newline in the subject (though it may continue over the newline). This is passed to new.

8.20.3.16 MULTILINE

public constant MULTILINE

When MULTILINE it is set, the "start of line" and "end of line" constructs match immediately following or immediately before internal newlines in the subject string, respectively, as well as at the very start and end. This is passed to new.

8.20.3.17 NEWLINE_CR

public constant NEWLINE_CR

Sets CR as the NEWLINE sequence. The NEWLINE sequence will match $ when MULTILINE is set. This is passed to all routines including new.

8.20.3.18 NEWLINE_LF

public constant NEWLINE_LF

Sets LF as the NEWLINE sequence. The NEWLINE sequence will match $ when MULTILINE is set. This is passed to all routines including new.

8.20.3.19 NEWLINE_CRLF

public constant NEWLINE_CRLF

Sets CRLF as the NEWLINE sequence The NEWLINE sequence will match $ when MULTILINE is set. This is passed to all routines including new.

8.20.3.20 NEWLINE_ANY

public constant NEWLINE_ANY

Sets ANY newline sequence as the NEWLINE sequence including those from UNICODE when UTF8 is also set. The string will have to be encoded as UTF8, however. The NEWLINE sequence will match $ when MULTILINE is set. This is passed to all routines including new.

8.20.3.21 NEWLINE_ANYCRLF

public constant NEWLINE_ANYCRLF

Sets ANY newline sequence from ASCII. The NEWLINE sequence will match $ when MULTILINE is set. This is passed to all routines including new.

8.20.3.22 NOTBOL

public constant NOTBOL

This indicates that beginning of the passed string does NOT start at the Beginning Of a Line (NOTBOL), so a carrot symbol (^) in the original pattern will not match the beginning of the string. This is used by routines other than new.

8.20.3.23 NOTEOL

public constant NOTEOL

This indicates that end of the passed string does NOT end at the End Of a Line (NOTEOL), so a dollar sign ($) in the original pattern will not match the end of the string. This is used by routines other than new.

8.20.3.24 NO_AUTO_CAPTURE

public constant NO_AUTO_CAPTURE

Disables capturing subpatterns except when the subpatterns are named. This is passed to new.

8.20.3.25 NO_UTF8_CHECK

public constant NO_UTF8_CHECK

Turn off checking for the validity of your UTF string. Use this with caution. An invalid utf8 string with this option could crash your program. Only use this if you know the string is a valid utf8 string. This is passed to all routines including new.

8.20.3.26 NOTEMPTY

public constant NOTEMPTY

Here matches of empty strings will not be allowed. In C, this is PCRE_NOTEMPTY. The pattern: `A*a*` will match "AAAA", "aaaa", and "Aaaa" but not "". This is used by routines other than new.

8.20.3.27 PARTIAL

public constant PARTIAL

This option has no effect on whether a match will occur or not. However, it does affect the error code generated by find in the event of a failure: If for some pattern re, and two strings s1 and s2, find( re, s1 & s2 ) would return a match but both find( re, s1 ) and find( re, s2 ) would not, then find( re, s1, 1, PCRE_PARTIAL ) will return ERROR_PARTIAL rather than ERROR_NOMATCH. We say s1 has a partial match of re.

Note that find( re, s2, 1, PCRE_PARTIAL ) will ERROR_NOMATCH. In C, this constant is called PCRE_PARTIAL.

8.20.3.28 STRING_OFFSETS

public constant STRING_OFFSETS

This is used by matches and all_matches.

8.20.3.29 UNGREEDY

public constant UNGREEDY This modifier sets the pattern such that quantifiers are not greedy by default, but become greedy if followed by a question mark.

This is passed to new.

8.20.3.30 UTF8

public constant UTF8

Makes strings passed in to be interpreted as a UTF8 encoded string. This is passed to new.

8.20.4 Error Constants

Error constants differ from their C equivalents as they do not have PCRE_ prepended to each name.

8.20.4.1 ERROR_NOMATCH

include std/regex.e
namespace regex
public constant ERROR_NOMATCH

There was no match found.

8.20.4.2 ERROR_NULL

include std/regex.e
namespace regex
public constant ERROR_NULL

There was an internal error in the EUPHORIA wrapper (std/regex.e in the standard include directory or be_regex.c in the EUPHORIA source).

8.20.4.3 ERROR_BADOPTION

include std/regex.e
namespace regex
public constant ERROR_BADOPTION

There was an internal error in the EUPHORIA wrapper (std/regex.e in the standard include directory or be_regex.c in the EUPHORIA source).

8.20.4.4 ERROR_BADMAGIC

include std/regex.e
namespace regex
public constant ERROR_BADMAGIC

The pattern passed is not a value returned from new.

8.20.4.5 ERROR_UNKNOWN_OPCODE

include std/regex.e
namespace regex
public constant ERROR_UNKNOWN_OPCODE

An internal error either in the pcre library EUPHORIA uses or its wrapper occured.

8.20.4.6 ERROR_UNKNOWN_NODE

include std/regex.e
namespace regex
public constant ERROR_UNKNOWN_NODE

An internal error either in the pcre library EUPHORIA uses or its wrapper occured.

8.20.4.7 ERROR_NOMEMORY

include std/regex.e
namespace regex
public constant ERROR_NOMEMORY

Out of memory.

8.20.4.8 ERROR_NOSUBSTRING

include std/regex.e
namespace regex
public constant ERROR_NOSUBSTRING

The wrapper or the PCRE backend didn't preallocate enough capturing groups for this pattern.

8.20.4.9 ERROR_MATCHLIMIT

include std/regex.e
namespace regex
public constant ERROR_MATCHLIMIT

Too many matches encountered.

8.20.4.10 ERROR_CALLOUT

include std/regex.e
namespace regex
public constant ERROR_CALLOUT

Not applicable to our implementation.

8.20.4.11 ERROR_BADUTF8

include std/regex.e
namespace regex
public constant ERROR_BADUTF8

The subject or pattern is not valid UTF8 but it was specified as such with UTF8.

8.20.4.12 ERROR_BADUTF8_OFFSET

include std/regex.e
namespace regex
public constant ERROR_BADUTF8_OFFSET

The offset specified doesn't start on a UTF8 character boundary but it was specified as UTF8 with UTF8.

8.20.4.13 ERROR_PARTIAL

include std/regex.e
namespace regex
public constant ERROR_PARTIAL

Pattern didn't match, but there is a partial match. See PARTIAL.

8.20.4.14 ERROR_BADPARTIAL

include std/regex.e
namespace regex
public constant ERROR_BADPARTIAL

PCRE backend doesn't support partial matching for this pattern.

8.20.4.15 ERROR_INTERNAL

include std/regex.e
namespace regex
public constant ERROR_INTERNAL

8.20.4.16 ERROR_BADCOUNT

include std/regex.e
namespace regex
public constant ERROR_BADCOUNT

size parameter to find is less than minus 1.