OpenEuphoria: Wiki Diff

Wiki Diff Searching, revision #1 to tip

= About this miniguide

The following is a rough draft of a Miniguide~--a draft
that can use some help.

When spotting something in error or
confusing, write a comment right into this wiki page.

For example, if a paragraph or table does not make sense
it would help if you:
* just say "doesn't make sense"
* add a comment "try this..."

Your comments will be used to help re-write this miniguide.

Comments from the forum have already been valuable. I
see areas where improvement is needed.

Some concerns I have after writing this miniguide are:
* possibly too much content into too small an article
* I was using adding {{{&& && &&}}} to tables in an effort to
adjust table column spacing
* possibly a separate guide is needed to emphasize native
Euphoria functions, and isolate regex functions to this guide

\\_tom

----

== Intro

Searching is about find a //needle// (say a specific word)
in a //haystack//, which is a string of text~--that string could be just a sentence or a library of books.

A "regular expression" (regex for short) is a way of describing the needle. The simplest regex 'needle' is the
word written out exactly. The interesting regex 'needles'
can specify all kinds of details...capitalization, location of the word, alternatives to the word, allowed spellings...tremendous power and flexibility is available.

Regex searching is a big topic. Several references will help:

* Perl Regular Expressions Man Page
http://perldoc.perl.org/perlre.html
* Regular Expression Library (user supplied regular expressions for just about any task).
http://regexlib.com/
* WikiPedia Regular Expression Article
http://en.wikipedia.org/wiki/Regular_expression
* Man page of PCRE in HTML
http://www.slabihoud.de/software/archives/pcrecompat.html
* a helpful website on regular expressions
http://www.regular-expressions.info/tutorial.html

\\\\

----

= Searching Miniguide

A primer on searching for content in strings using Euphoria.

== About Searching

Searching is often is directed at a string of text. If the text string is
##"Hello World!"##, then a search could be to locate the text ##World##
within that string.

* ##Hello **World**!##

\\Searching may produce a variety of results~:
* just report that a search succeeds
** true or false
* report on the location of the search
** first index~: 7
** extent of matching slice~: {7,11}
* report on what was matched
** //World//
* replace the match with something else
** replace ##World## with ##Euphoria##
** ##"Hello Euphoria!"##
# split the string (remove the match and split the sequence)
** ##{"Hello ", "!"}##

\\What is being searched may be stated~:
* as a literal
** ##World##
* as a wildcard
** ##?orld##
* as a regex
** ##W(.*)d##

\\This Miniguide is an introduction to the functions useful in searching. The Miniguide
search table, and regex cheatsheet will help in
deciding which function to use.

== Simple Word Matching

Given the string ##"Hello World!"## the objective is to search and locate the
text ##"World"## within that string.

The needle, "thing to be found," is ##"World"##. The haystack, "where to look," is
##"Hello World!"##. Or more compactly: locate "World" in "Hello **World**!".

When using the Euphoria ##eu:match()## function, think of ##"World"## as being
a **slice** of the string ##"Hello World!"##.

When using the regex module ##re:find()## or ##re:matches()##, think of ##"World"## as being

a
**pattern** (a //regex// pattern) that must be matched against the string
##"Hello World!"##.

In this example the slice viewpoint and regex viewpoint are identical. Both represent a
**literal** search for text within string "Hello World!".

The Euphoria code that solves the problem is:

|= regex | |= eu:match |
|<eucode>
include std/regex.e as re
regex r = re:new( "World" )
? re:find( r, "Hello World!" )
-- {
-- {7,12}
-- }
</eucode>| | <eucode>
? eu:match( "World", "Hello World!" )
-- 7
</eucode> | |
| <eucode>
? re:matches( r, "Hello World!" )
-- {
-- "World"
-- } </eucode> | | |

\\Note: ##match()## is a built-in Euphoria function, it normally does not need a
namespace qualifier. It is written as ##eu:match()## in this Miniguide to make
it explicitly different from the regular expression functions that will use ##re:## as a

namespace qualifier.

For simple literal matching, Euphoria routines are quicker and easier to use
than regex routines. However, the Euphoria ##eu:match()## function is limited to
literal searches. The regex functions, ##re:find()## and ##re:matches()##, can
be extended to search for complex and clever patterns. This Miniguide will
emphasize regex searching.

The information obtained from a search may be used in various ways:

It results be used in a conditional:

|= regex | | eu:match |
| <eucode>
if sequence( re:find( r, "Hello World!" ) ) then
puts(1, "It matches \n" )
end if
</eucode> | | <eucode>
if match( "World", "Hello World!" ) then
puts(1, "It matches \n" )
end if
</eucode> | |
| <eucode>
if atom( re:find( r, "Hello World!" ) then
puts(1, "It does not match \n" )
end if
</eucode> | | |

\\The regex search pattern could have been a variable:

|= regex | | Euphoria slice |
| <eucode>
sequence greeting = "World"
r = re:new( greeting )
if sequence( r, "Hello World!" ) ) then
puts(1, "It matches \n" )
end if
</eucode> | | |

== Wildcard ? *

Wildcard characters allow a needle string to be written as a pattern that
should be matched instead of a literal form.

The **wildcard** style follows the same rules as defined by operating systems
when directories and filenames are expressed.

The wildcard//(// **?** //)//means substitute any one character. The
wildcard//(// ** * ** //)//means substitute any number of characters.

| wildcard needle | : | haystack match |
|"?og" | : | "**dog**" |

<eucode>
include std/wildcard.e as wild
? wild:is_match( "?og", "dog" )
-- 1

? wild:is_match( "?og, "cat" )
-- 0

? wild:is_match( "*ot", "parrot" )
-- 1
</eucode>

Wildcard matching is good for groups of filenames.
It can be used for general matching, but recognize the limitations:

* the matches with ##wild:is_match()## are case sensitive
* the needle must match the __entire__ haystack
* a simple trick allows a "slice" to be matched; just add a ** * ** before
and after the text of the slice being matched
* literal characters * and ? can not be matched

<eucode>
? wild:is_match( "*dog*", "hounddog" )
-- 1
</eucode>

\\The needle is still matching the entire haystack; but effectively it is the slice
that is matched.

The other wildcard function is designed for actual filename matching:

<eucode>
? wild:wildcard_file( `?:\data`, `c:\DATA' )
--
-- returns 1 on Windows systems
-- returns 0 on Unix systems
</eucode>

\\The wildcards ** ? * ** follow the conventions used by the operating system in use.
Thus matches under //Windows// are not case sensitive, but matches under //Unix// are
case sensitive.

Wildcards add a bit of power to matching operations, but nothing compared to what a
regex is capable of.

== Regex Matching

A **regular expression** is a __pattern matching language__. It is best to think of a regex
as a small, terse, mini-language that is used from within Euphoria to perform
matches on strings. A regex is a language onto itself. To use it one must learn
the "regex way" of writing and expressing patterns.

This Miniguide will introduce the use of regular expressions. To fully appreciate
what is possible, read the Wikipedia article on regular expressions, and then
get a book on the subject.

There are many tutorials and articles on regular expressions on the internet.

The Euphoria regex functions use the PCRE library written by Philip Hazel.
PCRE means "Perl Compatible Regular Expressions." There are other regex libraries
(such as GREP and Emacs-style) but Perl-style is used by many programming languages and

thus widely documented.

== Writing Strings

A regex needle is written as a text string. Because it is mini-language, a few rules
must be followed before any meaningful regex patterns can be written.

There are two fundamental ways to write a text string in Euphoria using:
* string delimiter ##**"**##
** "World"
* raw delimiter ##** ` **## or ##** """ **##
** `World`
** """World"""

\\If a **"** //string// delimiter is used, then the string may contain **escaped

characters**
that are preceded with a **\** escape character. This allows "non-printing" characters such

as tab '\t' and newline '\n' to be part of the string.
Values can also be coded in
octal format \033 , or hexadecimal format \xIB.
A literal **\** must be
written as **{{{\\}}}** and a literal **"** must be written as **\"** .

If a **`** or **"""** //raw// delimiter is used then all characters are literal

as-is~--there are no escaped characters.

A search requires a //needle// string and a //haystack// string to be written.

When using the ##eu:match()## function, just use the same string delimiter style
for both the needle and haystack.

When using the regex.e include module, the needle string has special rules
(and special exemptions) that make writing these strings interesting.

Some //regex needle// characters may not be used exactly as written:

<eucode>

{ } [ ] ( ) ^ $ . | * + ? \ -- special metacharacters

- ] \ ^ $ -- special characters

\s \S \w \W \d \D -- special escaped characters
</eucode>

\\All of these "special" characters are part of the regex language and are
written into a needle string in order to describe a regex-pattern that
will be used for searching.

If the string delimiter **"** is used to write a needle, then the escaped
characters used by Euphoria and the special characters used by the regex pattern
will conflict. It is possible to escape special characters, escape escaped
characters and ultimately write a needle that will work. The resulting needle
takes effort write and looks like a mess.

If a raw string delimiter, **`** or **"""** is used, then the regex needle string is much
easier to write and easier to understand.

Part of the charm in writing a regular expression is knowing the rules
when a metacharacter is part of the language~-and has "meta meaning", and when it is just

itself~--a regular character.

To write a metacharacter in a regex needle, and give it its plain literal meaning, they must

be "escaped" using the **\** character.

<eucode>
-- needle metacharacters must be escaped to revert to literal meaning
\{ \} \[ \] ... \? \\
</eucode>

\\The special metacharacter delimiters **[ ] ** are used to define a character
class. For example ##[aeiou]## is the class of vowels in the English language.
Any one character from this list would produce a valid match. The class delimiters
add a few more rules to follow~:
* metacharacters inside a [ ] class are just themselves
* a **-** inside a [ ] class is //special// and used to
indicate a range of characters
** [0-9] means [0123456789]
** meaning any digit
* a **^** as the __first__ character in a [ ] class
is the denotes the //negated character class//
** meaning everything matches __but__ the characters within the class
** [123] will match any of 1, 2, 3
** [^123] will match any character __but__ 1, 2, 3
* a **^**, if not the first character, is just its literal self

When written in a haystack string, metacharacters are are no longer special, but are literal

characters.

However, the **\** is a metacharacter that is shared by both Euphoria and
the regex language. To write a literal **\** `//escape//` character it __may__ have to be

"escaped" by writing it twice. There are //two// levels of escaping the must be considered!

* when writing the haystack string
* when writing the needle string

\\Example: literal escape \ character in a string:

|= string delimiter | " | ` | """ |
| literal | \ | \ | \ |
| haystack string | {{{"\\"}}} | {{{`\`}}} | {{{"""\"""}}} |
| needle string | {{{"\\\\"}}} | {{{`\\`}}} | {{{"""\\"""}}} |

\\A string delimiter, itself, can not be part of a string:

|= string delimiter | " | ` | """ |
| literal | " | ` | """ |
| within a string | \" | no | no |

\\A practical example:

|= |=String delimiter " |
|**needle** \\ **haystack**| <eucode>
r = re:new( "C:\\\\WIN" )
? re:find( r, "C:\\WIN32" )
</eucode> |

\\
|= |=Raw delimiter ` | |=Raw delimeter """ |
|**needle** \\ **haystack**| <eucode>
r = re:new( `C:\\WIN` )
? re:find( r, `C:\WIN32` )
</eucode> | | <eucode>
r = re:new( """C:\\WIN""" )
? re:find( r, """C:\WIN32""" )
</eucode> |

\\There are a variety of non-printing ASCII characters that are written
preceded with the \ , or escape character. Common examples are: \t for tab,
\n for newline, and \r for carriage return. Values can also be coded in
octal format \033 , or hexadecimal format \xIB.

There will only be a valid match if the needle pattern exactly matches
a slice of the haystack string:

|= regex | | eu:match |
| <eucode>
? re:find( re:new("world"), "Hello World!" )
-- {} | | <eucode>
eu:match( "world", "Hello World!" )
-- 0
-- no match |
| <eucode>
? re:find( re:new("World "), "Hello World!" )
-- {} | | <eucode>
? eu:match( "world", "Hello World!" )
-- 0
-- no match |
| <eucode>
? re:find( re:new("o W" ), "Hello World!" )
-- 5
-- match located
</eucode> | | <eucode>
? eu:match( "o W", "Hello World!" )
-- 5
-- match located |

\\Note: Regex examples in the literature are often written as if for the
Perl language. The **/** is a common delimiter used in Perl to write
a regex string. A Perl regex needle that looks like ##/Hello/##,
is ##`Hello`## in Euphoria. Perl examples require a literal **/** to be written
in escaped form. The Perl regex ##/Good\/bye/## is ##`Good/by`## in Euphoria.

== ^ Anchored Matches $

If a regex pattern can be found __anywhere__ in the haystack string, then a valid match
is reported. It is possible to specify //where// the match must occur, by using **anchor**

metacharacters//(// **^** //)// and//(// **$** //)// . The anchor **^** means the match
must occur at the beginning of the string. The anchor **$** means the match must
occur at the end of the string, or before the newline **\n** at the end of that string.

<eucode>
? re:find( re:new( "keeper" ), "housekeeper" )
? re:find( re:new( "^keeper" ), "housekeeper" )
? re:find( re:new( "keeper$" ), "housekeeper" )
? re:find( re:new( "keeper$"), "housekeeper\n" )
? re:find( re:new( "^housekeeper$" ), "housekeeper" )
</eucode>

\\A related idea is to specify the index from which the matching must begin.

== Using [ ] Character Classes

If only literal string is in searching, then there is little advantage in using a
regex module. A **character class** allows for a set of possible characters that
may be used in a match, rather than just the explicit (literal) characters found
when a regex is written out as just a word. Character classes are denoted by
//(// **[ ]** //)//delimiters, while inside the brackets a list of permitted characters is
written out.

|= re:find() | | eu:match() |
|<eucode>
? re:find( re:new( "cat" ), "cat" ) -- matches cat
</eucode>| | <eucode>
? eu:match( "cat", "cat" )
-- 1
-- only literal match possible |

<eucode>
? re:find( re:new( "[bcr]at"), "bat" ) -- matches bat

? re:find( re:new( "[cab]"), "abc" ) -- matches a
</eucode>

In the last example, the first valid match from the character class is ##'a'##. It does
not matter which order the characters are written within the character class, thus
##'c'## does not have to be the first match.

A character class may be used to create a regex pattern that allows for case-insensitive
matching.

<eucode>
? re:find( re:new( "[yY][eE][sS]"), "yeS" )

? re:find( re:new( "yes", CASELESS ), "yEs" )
</eucode>

The easier way to achieve the same match is to add a **modifier**, ##CASELESS##, to
the ##re:new()## function.

Both ordinary and special characters may be included in a character class. The
special characters ##** - ] \ ^ $ **## must be matched using an escape.

<eucode>
? re:find( re:new( """[\]c]def""" ), "]def" ) -- both ]def and cdef will match

sequence x = "bcr"
? re:find( re:new( "[" & x & "]at" ), "rat" )

? re:find( re:new( """[\$""" & x & "]at"), "$at" )

? re:find( re:new( """[\\$x]at"""), """\at""" )

</eucode>

Notice that PERL has a simple way of embedding a variable into a string using a $x
notation. This kind of example has limited value. Therefore examples that show
how $ may be confused with this variable substitution are not interesting.

Euphoria will always use **&** for
string concatenation.

The special character//(// **-** //)//is the range operator within a **[ ]** character

class. The long
forms of ##[0123456789]## or ##[abc...xyz]## become compact ##[0-9]## or ##[a-z]##.

There is a special rule for including a literal **-** special character, when it is the

first or last character in a class, then it is considered to be an ordinary (literal)

character.

The special character//(// **^** //)//when in the //first// position of a character class

denotes
a **negated character class**. That means any character //but// those listed in the
character class will match.

<eucode>
? re:find( re:new( "[^a]at" ), "aat" )
? re:find( re:new( "[^0-9]" ), "house" )
? re:find( re:new( "[a^]at" ), "^at" )
</eucode>

There are abbreviations for the most common character classes:

|= abbreviation | class |= description |
| \d | [0-9] | a digit |
| \s | [\ \t\r\n\f] | a whitespace character |
| \W | [0-9a-zA-Z_] | a word character |
| \D | [^\d] | negated \n, any but a digit |
| \S | [^\s] | any non-whitespace |
| \W | [^\w] | any non-word |
| . | | any character, except \n |

The abbreviations ##** \d \s \w \D \S \W **## can be used both inside and outside of a
character class

<eucode>
? re:find( re:new( """\d\d:\d\d:\d\d"""), "12:30:48 the time format" )
? re:find( re:new( """[\d\s]"""), "any digit or whitespace" )
? re:find( re:new("""\w\W\w""" ), "matches a word char, followed by non word, followed by

word char" )
? re:find( re:new("..rt"), "matches any two chars followed by rt" )
? re:find( re:new("""end\."""), "matches end." )
? re:find( re:new("end[.]"), "same thing, matches end." )
</eucode>

Notice that when abbreviations are used, the **"""** triple delimiter should be used.

The **word anchor** ##\b## matches the boundary between a word character and a
non-word character ##\w\W## or ##\W\w## .

<eucode>
x = "Housecat concatenates house and cat"
? re:find( re:new("""\bcat"""), "in concatenates matches cat" )
? re:find( re:new("""cat\b"""), "in housecat matches cat" )
? re:find( re:new("""\bcat\b"""), "matches end of string cat" )
</eucode>

== Matching this | that --Alternatives

Different character strings may be matched using the **alternation**
metacharacer//(// **|** //)//. If ##dog## or ##cat## are to match, write the regex pattern
as "dog|cat". At each point in the haystack string, first "dog" will be tried as a
match, it it fails then "cat" will be tried as a match. If both fail, then the next
character of the haystack string starts the matching tests again.

<eucode>
? re:find( re:new("cat|dog|bird"), "cats and dogs -- matches cat" )
? re:find( re:new("dog|cat|bird"), "cats and dogs -- matches cat" )
</eucode>

In the second example, "dog", may be first in the list, but "cat" is still the first
pattern that matches.

<eucode>
? re:find( re:new("c|ca|cat|cats"), "cats -- matches c" )
? re:find( re:new("cats|cat|ca|c"), "cats -- matches cats" )
</eucode>

At any given character position in the haystack string, the first alternative that fully
matches, will be the match that succeeds. In these examples the first alternative does
have a valid match, no more matching tests are required.

== ( Grouping Things ) and ( Hierarchical (Matching))

The grouping metacharacters//(// ** ( )** //)//allow a part of the regex pattern to be

treated as a
single unit. The parenthesis surround a portion of the regex that will be considered as
a single unit. Thus the regex pattern ##"house(cat|keeper)"## means that ##"housecat"## or
##"housekeeper"## will succeed as matches.

<eucode>
? re:find( re:new("(a|b)b"), "ab bb -- both will match" )
? re:find( re:new("(^a|b)c"), "ac at start of string, or bc anywhere" )
? re:find( re:new("house(cat|)"), "house or housecat will match" )
? re:find( re:new("house(cat(s|))"), "housecats or housecat ore house will match" )
? re:find( re:new("""(19|20|)\d\d"""), """matches the null alternative ()\d\d because 20\d\d

can not match""" )
</eucode>

== Extracting Matches

The grouping metacharacters **( )** can be used to "extract" the matching slice for each

grouping. First the slice for the overall match is presented, then each grouping is
presented.

<eucode>
? re:find( re:new("""(\d\d):(\d\d):(\d\d)"""), "34:12:15 -- matching hh:mm:ss time format" )
{
{1,8},
{1,2},
{4,5},
{7,8}
}
</eucode>

When groupings in a regex pattern are nested, then they are shown starting with the
leftmost opening parenthesis, followed by the next opening parenthesis:

<eucode>
r = re:new( `(ab(cd|ef)((gi)|j))` )

-- 1 2 34
</eucode>

A related feature are **backreferences** which are written as ##\1 \2, ..## A backreference
is a matching pattern that may be used inside the definition of a regex pattern.

<eucode>
? re:find( re:new("""(\w\w\w)\s\1"""), "the the is located in a string" )
</eucode>

TOM above seems to work, but need some more understanding

NOTE $1 $2 ... are variables that are extracted and work as PERL specific
variables. They do not exist in Euphoria. The \1 \2 ... are internal to
the regex pattern, therefore are available in Euphoria.

== Matching Repetitions ? * + { }

The quantifier metacharacters//(// ** ? * + { }** //)//allow a portion of the regex pattern

to be
repeated (a specified number of times) and still provide a successful match.

The repetition meanings are:

|= meta |= regex example |= meaning |
| ? | `a?` | matches **a** 1 or 0 times |
| * | `a*` | matches **a** 0 or more times -- any number |
| + | `a+` | matches **a** 1 ore more times -- at least once |
| a{n,m} | | at least n, but not more than m times |
| a{n,} | | at least n, or more times |
| a{n} | | exactly n times |

<eucode>
? re:find( re:new( `[a-z]+\s+\d*` ), "beer 12 -- lower case word, some space, and any number

of digits" )

? re:find( re:new( `(\W+)|s+\1` ), "match match doubled words of arbitrary length" )

? re:find( re:new( `\d{2,4}` ), "20311 02 1914 --at least 2 but not more than 4 digits" )

? re:find( re:new( `d{4}|\d{2}` ), "123 02 2010 -- better strategy since 3 digit dates

excluded" )
</eucode>

These quantifiers are known as **greedy**, in that they will try to match __as much__ of the
haystack string as possible.

<eucode>
? re:find( re:new( `^(.*)(at)(.*)$` ), "the cat in the hat" )

--{
-- {1,18},
-- {1,16},
-- {17,18},
-- {19,18}
--}
</eucode>

This regex matches the haystack string. The first group is ##"the cat in the h"##, the

second group is ##"at"##, and the third group is ##"no match"##.

The first//(// ** .* ** //)//quantifier (being greedy) can grab as much of the string as

possible as long
as the regex matches the haystack string. The second .* matches nothing, because there is
no string left for it to work on.

== Matching Efficiency

The ##re:new()## function is said to "compile" a regex text pattern into an internal
format used for actual matching. This "compiling" has a time cost associated with it.

When a regex is created, the PCRE engine allocates some
memory for it. Euphoria does required memory cleanup
for you.

== Find and Replace

A regex may be located in a string and then the contents replaced with new
text. Use the ##find_replace()## function:

<eucode>
-- a literal replacement

sequence x = "Time to feed the cat!" -- Time to feed the cat!
regex r = re:new( "cat" )
x = find_replace( r, x, "hacker" ) -- Time to feed the hacker!
</eucode>

<eucode>
-- replacement is a backreference

sequence y = "'quoted words'" -- 'quoted words'
regex r = re:new( `'(.*)'$` )
y = find_replace( r, y, "\1" ) -- quoted words

-- the single quotes are stripped from the string
</eucode>

<eucode>
-- all matches are replaced

x = "I batted 4 for 4" -- I batted 4 for 4
r = re:new( `4` )
x = re:find_replace( r, x, "four" ) -- I batted four for four
</eucode>

\\
<eucode>
-- specify the number of replacements

x = "I batted 4 for 4" -- I batted 4 for 4
r = re:new( `4` )
x = re:find_replace_limit( r, x, "four", 1 ) -- I batted four for 4
</eucode>

== { "Splitting", "Text" }

A string may be split into individual words. A regex needle is used to locate
suitable splitting locations. The located needle is "removed" from the string
and used to "split" the string into words. For example the regex needle ##`\s+`##

represents the space between words. The ##re:split()## function can then produce a sequence

of words:

<eucode>
sequence x = "Calvin and Hobbes"

regex r = re:new( `\s+` )
x = re:split( r, x )
display( x )
--{
-- "Calvin",
-- "and",
-- "Hobbes"
--}
</eucode>

The elements of the new sequence are sometimes called **tokens**, because the
splitting process has more uses than just finding words in a sentence.

\\Data from a spreadsheet may be saved in a comma-delimited format. The
regex ##`,/s*`## will locate commas and whitespace, and allow a
comma delimited string to be split:

<eucode>

x = "1.618,2.718, 3.142"
r = re:new( `,\s*` )
x = re:split(r,x)
display( x )
--{
-- "1.618",
-- "2.718",
-- "3.142"
--}
</eucode>

The regex ##`/`## will parse a directory listing:

<eucode>
x = `/usr/bin`
r = re:new( `/` )
x = split(r,x)
display(x)
--{
-- "",
-- "usr",
-- "bin"
--}
</eucode>

Since the first character of ##x## matched the regex, ##re:split()## pretends an empty

initial element to the output sequence.

The RDS archives contain a few more options for splitting strings of text.
For example, the file ##strtok-2-1.zip## contains some useful functions related to
splitting. Warning: this file was written for Euphoria V3~--before using it, copy the
file "Strtok-v-2-1.e" to "strtok.e". Then, using a text editor search and replace the

variable "loop" to "iloop" and rename "case" to "xcase". (This is to accommodate the
new keywords found in Euphoria V4.0)

|=regex | | strtok |
| <eucode>
include std/regex.e as re
</eucode> | | <eucode>
include strtok.e </eucode> |
|<eucode>
object x = "1.618,2.718, 3.142"
r = re:new( `,\s*` )
x = re:split(r,x)
display( x )
--{
-- "1.618",
-- "2.718",
-- "3.142"
--}
</eucode> | | <eucode>
object x = "1.618,2.718, 3.142"
x = parse( x, `, `)
display( x )
--{
-- "1.618",
-- "2.718",
-- "3.142"
--}
</eucode> |

The "strtok" functions may be faster and easier to use than regex splitting
~~= Searching Miniguide~~

~~A primer on searching for content in strings using Euphoria.~~

~~== About Searching~~

~~Searching is often is directed at a string of text. If the text string is~~
~~##"Hello World!"##, then a search could be to locate the text ##World##~~
~~within that string.~~

* ##Hello **World**!##

~~\\Searching may produce a variety of results~:~~
* just report that a search succeeds
** true or false
* report on the location of the search
** first index~: 7
** extent of matching slice~: {7,11}
* report on what was matched
** //World//
* replace the match with something else
** replace ##World## with ##Euphoria##
** ##"Hello Euphoria!"##
~~# split the string (remove the match and split the sequence)~~
** ##{"Hello ", "!"}##

~~\\What is being searched may be stated~:~~
* as a literal
** ##World##
* as a wildcard
** ##?orld##
* as a regex
** ##W(.*)d##

~~\\This Miniguide is an introduction to the functions useful in searching. The Miniguide~~
~~search table, and regex cheatsheet will help in~~
~~deciding which function to use.~~

~~== Simple Word Matching~~

~~Given the string ##"Hello World!"## the objective is to search and locate the~~
~~text ##"World"## within that string.~~

~~The needle, "thing to be found," is ##"World"##. The haystack, "where to look," is~~
~~##"Hello World!"##. Or more compactly: locate "World" in "Hello **World**!".~~

~~When using the Euphoria ##eu:match()## function, think of ##"World"## as being~~
~~a **slice** of the string ##"Hello World!"##.~~

~~When using the regex module ##re:find()## or ##re:matches()##, think of ##"World"## as being~~

a
**pattern** (a //regex// pattern) that must be matched against the string
~~##"Hello World!"##.~~

~~In this example the slice viewpoint and regex viewpoint are identical. Both represent a~~
**literal** search for text within string "Hello World!".

~~The Euphoria code that solves the problem is:~~

~~|= regex | |= eu:match |~~
~~|<eucode>~~
~~include std/regex.e as re~~
~~regex r = re:new( "World" )~~
~~? re:find( r, "Hello World!" )~~
~~-- {~~
~~-- {7,12}~~
~~-- }~~
~~</eucode>| | <eucode>~~
~~? eu:match( "World", "Hello World!" )~~
~~-- 7~~
~~</eucode> | |~~
~~| <eucode>~~
~~? re:matches( r, "Hello World!" )~~
~~-- {~~
~~-- "World"~~
~~-- } </eucode> | | |~~

~~\\Note: ##match()## is a built-in Euphoria function, it normally does not need a~~
~~namespace qualifier. It is written as ##eu:match()## in this Miniguide to make~~
~~it explicitly different from the regular expression functions that will use ##re:## as a~~

~~namespace qualifier.~~

~~For simple literal matching, Euphoria routines are quicker and easier to use~~
~~than regex routines. However, the Euphoria ##eu:match()## function is limited to~~
~~literal searches. The regex functions, ##re:find()## and ##re:matches()##, can~~
~~be extended to search for complex and clever patterns. This Miniguide will~~
~~emphasize regex searching.~~

~~The information obtained from a search may be used in various ways:~~

~~It results be used in a conditional:~~

~~|= regex | | eu:match |~~
~~| <eucode>~~
~~if sequence( re:find( r, "Hello World!" ) ) then~~
~~puts(1, "It matches \n" )~~
~~end if~~
~~</eucode> | | <eucode>~~
~~if match( "World", "Hello World!" ) then~~
~~puts(1, "It matches \n" )~~
~~end if~~
~~</eucode> | |~~
~~| <eucode>~~
~~if atom( re:find( r, "Hello World!" ) then~~
~~puts(1, "It does not match \n" )~~
~~end if~~
~~</eucode> | | |~~

~~\\The regex search pattern could have been a variable:~~

~~|= regex | | Euphoria slice |~~
~~| <eucode>~~
~~sequence greeting = "World"~~
~~r = re:new( greeting )~~
~~if sequence( r, "Hello World!" ) ) then~~
~~puts(1, "It matches \n" )~~
~~end if~~
~~</eucode> | | |~~

== Wildcard ? *

~~Wildcard characters allow a needle string to be written as a pattern that~~
~~should be matched instead of a literal form.~~

~~The **wildcard** style follows the same rules as defined by operating systems~~
~~when directories and filenames are expressed.~~

~~The wildcard//(// **?** //)//means substitute any one character. The~~
~~wildcard//(// ** * ** //)//means substitute any number of characters.~~

~~| wildcard needle | : | haystack match |~~
~~|"?og" | : | "**dog**" |~~

~~<eucode>~~
~~include std/wildcard.e as wild~~
~~? wild:is_match( "?og", "dog" )~~
~~-- 1~~

~~? wild:is_match( "?og, "cat" )~~
~~-- 0~~

~~? wild:is_match( "*ot", "parrot" )~~
~~-- 1~~
~~</eucode>~~

~~Comparing the regex way with the wildcard way:~~

~~|= regex | | wildcard |~~
~~|<eucode>~~
~~include std/regex.e as re | | <eucode>~~
~~include std/wildcard.e as wild~~
~~</eucode>|~~
~~| regex r | | |~~
~~|<eucode>~~
~~r = re:new( `.og` )~~
~~? re:find(r, "dog" )~~
~~</eucode> | | <eucode>~~
~~? wild:is_match( "?og", "dog" )~~
~~</eucode>|~~
~~|<eucode>~~
~~? re:find(r, "cat")~~
~~</eucode>| |<eucode>~~
~~? wild:is_match( "?og", "cat" )~~
~~</eucode> |~~
~~|<eucode>~~
~~r = re:new( `*ot` )~~
~~? re:find(r, "parrot" )~~
~~</eucode> || <eucode>~~
~~? wild:is_match( "*ot", "parrot" ) |~~

~~\\The meaning of **?** in a regex and in a wildcard is __not__~~
~~the same. This is a long standing historical difference:~~

~~|= meaning&& && && && && |= regex&& && && && |= wildcard |~~
~~| exactly one|. | ? |~~
~~| zero or more | * | * |~~
~~| ---- | ---- | ---- |~~
~~| optional | ? | |~~

~~\\The regex **?** is not a wildcard itself, but it is used to turn the~~
~~character just before it into a "kind of wildcard." If the regex ##`.`## means exactly~~
~~one character, then the regex ##`.?`## means one character or no character~--thus~~
~~meaning **optional**.~~

~~Wildcard matching is good for groups of filenames.~~
~~It can be used for general matching, but recognize the limitations:~~

* the matches with ##wild:is_match()## are case sensitive
* the needle must match the __entire__ haystack
* a simple trick allows a "slice" to be matched; just add a ** * ** before
~~and after the text of the slice being matched~~
* literal characters * and ? can not be matched

~~<eucode>~~
~~? wild:is_match( "*dog*", "hounddog" )~~
~~-- 1~~
~~</eucode>~~

~~\\The needle is still matching the entire haystack; but effectively it is the slice~~
~~that is matched.~~

~~The other wildcard function is designed for actual filename matching:~~

~~<eucode>~~
~~? wild:wildcard_file( `?:\data`, `c:\DATA' )~~
--
~~-- returns 1 on Windows systems~~
~~-- returns 0 on Unix systems~~
~~</eucode>~~

~~\\The wildcards ** ? * ** follow the conventions used by the operating system in use.~~
~~Thus matches under //Windows// are not case sensitive, but matches under //Unix// are~~
~~case sensistive.~~

~~Wildcards add a bit of power to matching operations, but nothing compared to what a~~
~~regex is capable of.~~

~~== Regex Matching~~

~~A **regular expression** is a __pattern matching language__. It is best to think of a regex~~
~~as a small, terse, mini-language that is used from within Euphoria to perform~~
~~matches on strings. A regex is a language onto itself. To use it one must learn~~
~~the "regex way" of writing and expressing patterns.~~

~~This Miniguide will introduce the use of regular expressions. To fully appreciate~~
~~what is possible, read the Wikipedia article on regular expressions, and then~~
~~get a book on the subject.~~

~~There are many tutorials and articles on regular expressions on the internet.~~

~~The Euphoria regex functions use the PCRE library written by Philip Hazel.~~
~~PCRE means "Perl Compatible Regular Expressions." There are other regex libraries~~
~~(such as GREP and Emacs-style) but Perl-style is used by many programming languages and~~

~~thus widely documented.~~

~~== Writing Strings~~

~~A regex needle is written as a text string. Because it is mini-language, a few rules~~
~~must be followed before any meaningful regex patterns can be written.~~

~~There are two fundamental ways to write a text string in Euphoria using:~~
* string delimeter ##**"**##
** "World"
* raw delimeter ##** ` **## or ##** """ **##
** `World`
** """World"""

~~\\If a **"** //string// delimeter is used, then the string may contain **escaped~~

characters**
~~that are preceded with a **\** escape character. This allows "non-printing" characters such~~

~~as tab '\t' and newline '\n' to be part of the string.~~
~~Values can also be coded in~~
~~ocatal format \033 , or hexadeciaml format \xIB.~~
~~A literal **\** must be~~
~~written as **{{{\\}}}** and a literal **"** must be written as **\"** .~~

~~If a **`** or **"""** //raw// delimeter is used then all characters are literal~~

~~as-is~--there are no escaped characters.~~

~~A search requires a //needle// string and a //haystack// string to be written.~~

~~When using the ##eu:match()## function, just use the same string delimeter style~~
~~for both the needle and haystack.~~

~~When using the regex.e include module, the needle string has special rules~~
~~(and special exemptions) that make writing these strings interesting.~~

~~Some //regex needle// characters may not be used exactly as written:~~

~~<eucode>~~

~~{ } [ ] ( ) ^ $ . | * + ? \ -- special metacharacters~~

~~- ] \ ^ $ -- special characters~~

~~\s \S \w \W \d \D -- special escaped characters~~
~~</eucode>~~

~~\\All of these "special" characters are part of the regex language and are~~
~~written into a needle string in order to describe a regex-pattern that~~
~~will be used for searching.~~

~~If the string delimeter **"** is used to write a needle, then the escaped~~
~~characters used by Euphoria and the special characters used by the regex pattern~~
~~will conflict. It is possible to escape special characters, escape escaped~~
~~characters and ultimately write a needle that will work. The resulting needle~~
~~takes effort write and looks like a mess.~~

~~If a raw string delimeter, **`** or **"""** is used, then the regex needle string is much~~
~~easier to write and easier to understand.~~

~~Part of the charm in writing a regular expression is knowing the rules~~
~~when a metacharacter is part of the language~-and has "meta meaning", and when it is just~~

~~itself~--a regular character.~~

~~To write a metacharacter in a regex needle, and give it its plain literal meaning, they must~~

~~be "escaped" using the **\** character.~~

~~<eucode>~~
~~-- needle metacharacters must be escaped to revert to literal meaning~~
~~\{ \} \[ \] ... \? \\~~
~~</eucode>~~

~~\\The special metacharacter delimeters **[ ] ** are used to define a character~~
~~class. For example ##[aeiou]## is the class of vowels in the English language.~~
~~Any one character from this list would produce a valid match. The class delimeters~~
~~add a few more rules to follow~:~~
* metacharacters inside a [ ] class are just themselves
* a **-** inside a [ ] class is //special// and used to
~~indicate a range of characters~~
** [0-9] means [0123456789]
** meaning any digit
* a **^** as the __first__ character in a [ ] class
~~is the denotes the //negated character class//~~
** meaning everthing matches __but__ the characters within the class
** [123] will match any of 1, 2, 3
** [^123] will match any character __but__ 1, 2, 3
* a **^**, if not the first character, is just its literal self

~~When written in a haystack string, metacharacters are are no longer special, but are literal~~

~~characters.~~

~~However, the **\** is a metacharacter that is shared by both Euphoria and~~
~~the regex language. To write a literal **\** `//escape//` character it __may__ have to be~~

~~"escaped" by writing it twice. There are //two// levels of escaping the must be considered!~~

* when writing the haystack string
* when writing the needle string

~~\\Example: literal escape \ character in a string:~~

~~|= string delimeter | " | ` | """ |~~
~~| literal | \ | \ | \ |~~
~~| haystack string | {{{"\\"}}} | {{{`\`}}} | {{{"""\"""}}} |~~
~~| needle string | {{{"\\\\"}}} | {{{`\\`}}} | {{{"""\\"""}}} |~~

~~\\A string delimeter, itself, can not be part of a string:~~

~~|= string delimeter | " | ` | """ |~~
~~| literal | " | ` | """ |~~
~~| within a string | \" | no | no |~~

~~\\A practical example:~~

~~|= |=String delimeter " |~~
~~|**needle** \\ **haystack**| <eucode>~~
~~r = re:new( "C:\\\\WIN" )~~
~~? re:find( r, "C:\\WIN32" )~~
~~</eucode> |~~

\\
~~|= |=Raw delimeter ` | |=Raw delimeter """ |~~
~~|**needle** \\ **haystack**| <eucode>~~
~~r = re:new( `C:\\WIN` )~~
~~? re:find( r, `C:\WIN32` )~~
~~</eucode> | | <eucode>~~
~~r = re:new( """C:\\WIN""" )~~
~~? re:find( r, """C:\WIN32""" )~~
~~</eucode> |~~

~~\\There are a variety of non-printing ASCII characters that are written~~
~~preceded with the \ , or escape character. Common examples are: \t for tab,~~
~~\n for newline, and \r for carriage return. Values can also be coded in~~
~~ocatal format \033 , or hexadeciaml format \xIB.~~

~~There will only be a valid match if the needle pattern exactly matches~~
~~a slice of the haystack string:~~

~~|= regex | | eu:match |~~
~~| <eucode>~~
~~? re:find( re:new("world"), "Hello World!" )~~
~~-- {} | | <eucode>~~
~~eu:match( "world", "Hello World!" )~~
~~-- 0~~
~~-- no match |~~
~~| <eucode>~~
~~? re:find( re:new("World "), "Hello World!" )~~
~~-- {} | | <eucode>~~
~~? eu:match( "world", "Hello World!" )~~
~~-- 0~~
~~-- no match |~~
~~| <eucode>~~
~~? re:find( re:new("o W" ), "Hello World!" )~~
~~-- 5~~
~~-- match located~~
~~</eucode> | | <eucode>~~
~~? eu:match( "o W", "Hello World!" )~~
~~-- 5~~
~~-- match located |~~

~~\\Note: Regex examples in the literature are often witten as if for the~~
~~Perl language. The **/** is a common delimeter used in Perl to write~~
~~a regex string. A Perl regex needle that looks like ##/Hello/##,~~
~~is ##`Hello`## in Euphoria. Perl examples require a literal **/** to be written~~
~~in escaped form. The Perl regex ##/Good\/bye/## is ##`Good/by`## in Euphoria.~~

~~== ^ Anchored Matches $~~

~~If a regex pattern can be found __anywhere__ in the haystack string, then a valid match~~
is reported. It is possible to specify //where// the match must occur, by using **anchor**

~~metacharacters//(// **^** //)// and//(// **$** //)// . The anchor **^** means the match~~
~~must occur at the beginning of the string. The anchor **$** means the match must~~
~~occur at the end of the string, or before the newline **\n** at the end of that string.~~

~~<eucode>~~
~~? re:find( re:new( "keeper" ), "housekeeper" )~~
~~? re:find( re:new( "^keeper" ), "housekeeper" )~~
~~? re:find( re:new( "keeper$" ), "housekeeper" )~~
~~? re:find( re:new( "keeper$"), "housekeeper\n" )~~
~~? re:find( re:new( "^housekeeper$" ), "housekeeper" )~~
~~</eucode>~~

~~\\A related idea is to specify the index from which the matching must begin.~~

~~== Using [ ] Character Classes~~

~~If only literal string is in searching, then there is little advantage in using a~~
~~regex module. A **character class** allows for a set of possible characters that~~
~~may be used in a match, rather than just the explicit (literal) characters found~~
~~when a regex is written out as just a word. Character classes are denoted by~~
~~//(// **[ ]** //)//delimeters, while inside the brackets a list of permitted characters is~~
~~written out.~~

~~|= re:find() | | eu:match() |~~
~~|<eucode>~~
~~? re:find( re:new( "cat" ), "cat" ) -- matches cat~~
~~</eucode>| | <eucode>~~
~~? eu:match( "cat", "cat" )~~
~~-- 1~~
~~-- only literal match possible |~~

~~<eucode>~~
~~? re:find( re:new( "[bcr]at"), "bat" ) -- matches bat~~

~~? re:find( re:new( "[cab]"), "abc" ) -- matches a~~
~~</eucode>~~

~~In the last example, the first valid match from the character class is ##'a'##. It does~~
~~not matter which order the characters are written within the character class, thus~~
~~##'c'## does not have to be the first match.~~

~~A character class may be used to create a regex pattern that allows for case-insenstive~~
~~matching.~~

~~<eucode>~~
~~? re:find( re:new( "[yY][eE][sS]"), "yeS" )~~

~~? re:find( re:new( "yes", CASELESS ), "yEs" )~~
~~</eucode>~~

~~The easier way to achieve the same match is to add a **modifier**, ##CASELESS##, to~~
~~the ##re:new()## function.~~

~~Both ordinary and special characters may be included in a character class. The~~
~~special characters ##** - ] \ ^ $ **## must be matched using an escape.~~

~~<eucode>~~
~~? re:find( re:new( """[\]c]def""" ), "]def" ) -- both ]def and cdef will match~~

~~sequence x = "bcr"~~
~~? re:find( re:new( "[" & x & "]at" ), "rat" )~~

~~? re:find( re:new( """[\$""" & x & "]at"), "$at" )~~

~~? re:find( re:new( """[\\$x]at"""), """\at""" )~~

~~</eucode>~~

~~Notice that PERL has a simple way of embedding a variable into a string using a $x~~
~~notation. This kind of example has limited value. Therefore examples that show~~
~~how $ may be confused with this variable substitution are not interesting.~~

~~Euphoria will always use **&** for~~
~~string concatenation.~~

~~The special character//(// **-** //)//is the range operator within a **[ ]** character~~

~~class. The long~~
~~forms of ##[0123456789]## or ##[abc...xyz]## become compact ##[0-9]## or ##[a-z]##.~~

~~There is a special rule for including a literal **-** special character, when it is the~~

~~first or last character in a class, then it is considered to be an ordinary (literal)~~

~~character.~~

~~The special character//(// **^** //)//when in the //first// position of a character class~~

~~denotes~~
~~a **negated character class**. That means any character //but// those listed in the~~
~~character class will match.~~

~~<eucode>~~
~~? re:find( re:new( "[^a]at" ), "aat" )~~
~~? re:find( re:new( "[^0-9]" ), "house" )~~
~~? re:find( re:new( "[a^]at" ), "^at" )~~
~~</eucode>~~

~~There are abbreviations for the most common character classes:~~

~~|= abbreviation | class |= description |~~
~~| \d | [0-9] | a digit |~~
~~| \s | [\ \t\r\n\f] | a whitespace character |~~
~~| \W | [0-9a-zA-Z_] | a word character |~~
~~| \D | [^\d] | negated \n, any but a digit |~~
~~| \S | [^\s] | any non-whitespace |~~
~~| \W | [^\w] | any non-word |~~
~~| . | | any character, except \n |~~

~~The abbreviations ##** \d \s \w \D \S \W **## can be used both inside and outside of a~~
~~character class~~

~~<eucode>~~
~~? re:find( re:new( """\d\d:\d\d:\d\d"""), "12:30:48 the time format" )~~
~~? re:find( re:new( """[\d\s]"""), "any digit or whitespace" )~~
~~? re:find( re:new("""\w\W\w""" ), "matches a word char, followed by nonword, followed by~~

~~word char" )~~
~~? re:find( re:new("..rt"), "matches anytwo chars followed by rt" )~~
~~? re:find( re:new("""end\."""), "matches end." )~~
~~? re:find( re:new("end[.]"), "same thing, matches end." )~~
~~</eucode>~~

~~Notice that when abbreviations are used, the **"""** triple delimeter should be used.~~

~~The **word anchor** ##\b## matches the boundary between a word character and a~~
~~non-word character ##\w\W## or ##\W\w## .~~

~~<eucode>~~
~~x = "Housecat catenates house and cat"~~
~~? re:find( re:new("""\bcat"""), "in catenates matches cat" )~~
~~? re:find( re:new("""cat\b"""), "in housecat matches cat" )~~
~~? re:find( re:new("""\bcat\b"""), "matches end of string cat" )~~
~~</eucode>~~

~~== Matching this | that --Aternatives~~

Differenct character strings may be matched using the **alternation**
~~metacharacer//(// **|** //)//. If ##dog## or ##cat## are to match, write the regex pattern~~
~~as "dog|cat". At each point in the haystack string, first "dog" will be tried as a~~
~~match, it it fails then "cat" will be tried as a match. If both fail, then the next~~
~~character of the haystack string starts the matching tests again.~~

~~<eucode>~~
~~? re:find( re:new("cat|dog|bird"), "cats and dogs -- matches cat" )~~
~~? re:find( re:new("dog|cat|bird"), "cats and dogs -- matches cat" )~~
~~</eucode>~~

~~In the second example, "dog", may be first in the list, but "cat" is still the first~~
~~pattern that matches.~~

~~<eucode>~~
~~? re:find( re:new("c|ca|cat|cats"), "cats -- matches c" )~~
~~? re:find( re:new("cats|cat|ca|c"), "cats -- matches cats" )~~
~~</eucode>~~

~~At any given character positon in the haystack string, the first alternative that fully~~
~~matches, will be the match that succeeds. In these examples the first alternative does~~
~~have a valid match, no more matching tests are required.~~

~~== ( Grouping Things ) and ( Hierarchical (Matching))~~

~~The grouping metacharacers//(// ** ( )** //)//allow a part of the regex pattern to be~~

~~treated as a~~
~~single unit. The parenthesis surround a portion of the regex that will be considered as~~
~~a single unit. Thus the regex pattern ##"house(cat|keeper)"## means that ##"housecat"## or~~
~~##"housekeeper"## will suceed as matches.~~

~~<eucode>~~
~~? re:find( re:new("(a|b)b"), "ab bb -- both will match" )~~
~~? re:find( re:new("(^a|b)c"), "ac at start of string, or bc anywhere" )~~
~~? re:find( re:new("house(cat|)"), "house or housecat will match" )~~
~~? re:find( re:new("house(cat(s|))"), "housecats or housecat ore house will match" )~~
~~? re:find( re:new("""(19|20|)\d\d"""), """matches the null alternative ()\d\d because 20\d\d~~

~~can not match""" )~~
~~</eucode>~~

~~== Extracting Matches~~

~~The grouping metacharacters **( )** can be used to "extract" the matching slice for each~~

~~grouping. First the slice for the overall match is presented, then each grouping is~~
~~presented.~~

~~<eucode>~~
~~? re:find( re:new("""(\d\d):(\d\d):(\d\d)"""), "34:12:15 -- matching hh:mm:ss time format" )~~
{
~~{1,8},~~
~~{1,2},~~
~~{4,5},~~
~~{7,8}~~
}
~~</eucode>~~

~~When groupings in a regex pattern are nested, then they are shown starting with the~~
~~leftmost opening parenthsis, followed by the next opening parenthesis:~~

~~<eucode>~~
~~r = re:new( `(ab(cd|ef)((gi)|j))` )~~

~~-- 1 2 34~~
~~</eucode>~~

~~A related feature are **backreferences** which are writen as ##\1 \2, ..## A backreference~~
~~is a matching pattern that may be used inside the defintion of a regex pattern.~~

~~<eucode>~~
~~? re:find( re:new("""(\w\w\w)\s\1"""), "the the is located in a string" )~~
~~</eucode>~~

~~TOM above seems to work, but need some more understanding~~

~~NOTE $1 $2 ... are variables that are extracted and work as PERL specific~~
~~variables. They do not exist in Euphoria. The \1 \2 ... are internal to~~
~~the regex pattern, therefore are available in Euphoria.~~

~~== Matching Repetitions ? * + { }~~

~~The quantifier metacharacters//(// ** ? * + { }** //)//allow a portion of the regex pattern~~

~~to be~~
~~repeated (a specified number of times) and still provide a sucessful match.~~

~~The repetion meanings are:~~

~~|= meta |= regex example |= meaning |~~
~~| ? | `a?` | matches **a** 1 or 0 times |~~
~~| * | `a*` | matches **a** 0 or more times -- any number |~~
~~| + | `a+` | matches **a** 1 ore more times -- at least once |~~
~~| a{n,m} | | at least n, but not more than m times |~~
~~| a{n,} | | at least n, or more times |~~
~~| a{n} | | exactly n times |~~

~~<eucode>~~
~~? re:find( re:new( `[a-z]+\s+\d*` ), "beer 12 -- lower case word, some space, and any number~~

~~of digits" )~~

~~? re:find( re:new( `(\W+)|s+\1` ), "match match doubled words of arbitrary length" )~~

~~? re:find( re:new( `\d{2,4}` ), "20311 02 1914 --at least 2 but not more than 4 digits" )~~

~~? re:find( re:new( `d{4}|\d{2}` ), "123 02 2010 -- better strategy since 3 digit dates~~

~~excluded" )~~
~~</eucode>~~

~~These quantifiers are known as **greedy**, in that they will try to match __as much__ of the~~
~~haystack string as possible.~~

~~<eucode>~~
~~? re:find( re:new( `^(.*)(at)(.*)$` ), "the cat in the hat" )~~

~~--{~~
~~-- {1,18},~~
~~-- {1,16},~~
~~-- {17,18},~~
~~-- {19,18}~~
~~--}~~
~~</eucode>~~

~~This regex matches the haystack string. The first group is ##"the cat in the h"##, the~~

~~second group is ##"at"##, and the third group is ##"no match"##.~~

~~The first//(// ** .* ** //)//quantifier (being greedy) can grab as much of the string as~~

~~possible as long~~
~~as the regex matches the haystack string. The second .* matches nothing, because there is~~
~~no string left for it to work on.~~

~~== Matching Efficiecy~~

~~The ##re:new()## function is said to "compile" a regex text pattern into an internal~~
~~format used for actual matching. This "compiling" has a time cost associated with it.~~

~~A regex consumes memory. This memory is allocated by the C-code of the PCRE library~~
~~which means that it may be necessary to manually free this memory using the~~
~~Euphoria ##delete()## function.~~

~~== Find and Replace~~

~~A regex may be located in a string and then the contents replaced with new~~
~~text. Use the ##find_replace()## function:~~

~~<eucode>~~
~~-- a literal replacement~~

~~sequence x = "Time to feed the cat!" -- Time to feed the cat!~~
~~regex r = re:new( "cat" )~~
~~x = find_replace( r, x, "hacker" ) -- Time to feed the hacker!~~
~~</eucode>~~

~~<eucode>~~
~~-- replacement is a backreference~~

~~sequence y = "'quoted words'" -- 'quoted words'~~
~~regex r = re:new( `'(.*)'$` )~~
~~y = find_replace( r, y, "\1" ) -- quoted words~~

~~-- the single quotes are stripped from the string~~
~~</eucode>~~

~~<eucode>~~
~~-- all matches are replaced~~

~~x = "I batted 4 for 4" -- I batted 4 for 4~~
~~r = re:new( `4` )~~
~~x = re:find_replace( r, x, "four" ) -- I battted four for four~~
~~</eucode>~~

\\
~~<eucode>~~
~~-- specify the number of replacements~~

~~x = "I batted 4 for 4" -- I batted 4 for 4~~
~~r = re:new( `4` )~~
~~x = re:find_replace_limit( r, x, "four", 1 ) -- I batted four for 4~~
~~</eucode>~~

~~== { "Splitting", "Text" }~~

~~A string may be split into individual words. A regex needle is used to locate~~
~~suitable splitting locations. The located needle is "removed" from the string~~
~~and used to "split" the string into words. For example the regex needle ##`\s+`##~~

~~represents the space between words. The ##re:split()## function can then produce a sequence~~

~~of words:~~

~~<eucode>~~
~~sequence x = "Calvin and Hobbes"~~

~~regex r = re:new( `\s+` )~~
~~x = re:split( r, x )~~
~~display( x )~~
~~--{~~
~~-- "Calvin",~~
~~-- "and",~~
~~-- "Hobbes"~~
~~--}~~
~~</eucode>~~

~~The elements of the new sequence are sometimes called **tokens**, because the~~
~~splitting process has more uses than just finding words in a sentance.~~

~~\\Data from a speadsheet may be saved in a comma-delimited format. The~~
~~regex ##`,/s*`## will locate commas and whitespace, and allow a~~
~~comma delimited string to be split:~~

~~<eucode>~~

~~x = "1.618,2.718, 3.142"~~
~~r = re:new( `,\s*` )~~
~~x = re:split(r,x)~~
~~display( x )~~
~~--{~~
~~-- "1.618",~~
~~-- "2.718",~~
~~-- "3.142"~~
~~--}~~
~~</eucode>~~

~~The regex ##`/`## will parse a directory listing:~~

~~<eucode>~~
~~x = `/usr/bin`~~
~~r = re:new( `/` )~~
~~x = split(r,x)~~
~~display(x)~~
~~--{~~
~~-- "",~~
~~-- "usr",~~
~~-- "bin"~~
~~--}~~
~~</eucode>~~

~~Since the first character of ##x## matched the regex, ##re:split()## prepends an empty~~

~~initial element to the output sequence.~~

~~The RDS archives contain a few more options for splitting strings of text.~~
~~For example, the file ##strtok-2-1.zip## contains some useful functions related to~~
~~splitting. Warning: this file was written for Euphoria V3~--before using it, copy the~~
~~file "Strtok-v-2-1.e" to "strtok.e". Then, using a text editor search and replace the~~

~~variable "loop" to "iloop" and rename "case" to "xcase". (This is to accomodate the~~
~~new keywords found in Euphoria V4.0)~~

~~|=regex | | strtok |~~
~~| <eucode>~~
~~include std/regex.e as re~~
~~</eucode> | | <eucode>~~
~~include strtok.e |~~
~~|<eucode>~~
~~object x = "1.618,2.718, 3.142"~~
~~r = re:new( `,\s*` )~~
~~x = re:split(r,x)~~
~~display( x )~~
~~--{~~
~~-- "1.618",~~
~~-- "2.718",~~
~~-- "3.142"~~
~~--}~~
~~</eucode> | | <eucode>~~
~~object x = "1.618,2.718, 3.142"~~
~~x = parse( x, `, `)~~
~~display( x )~~
~~--{~~
~~-- "1.618",~~
~~-- "2.718",~~
~~-- "3.142"~~
~~--}~~
~~</eucode> |~~

~~The "strtok" functions may be faster and easier to use than regex splitting~~
for some applications.

OpenEuphoria

Wiki Diff Searching, revision #1 to tip

Search

Include:

Quick Links

User menu

Misc Menu