Searching

About this miniguide

The following is a rough draft of a Miniguide--a draft that can use some help.

When spotting something in error or confusing, write a comment right into this wiki page.

For example, if a paragraph or table does not make sense it would help if you:

  • just say "doesn't make sense"
  • add a comment "try this..."

Your comments will be used to help re-write this miniguide.

Comments from the forum have already been valuable. I see areas where improvement is needed.

Some concerns I have after writing this miniguide are:

  • possibly too much content into too small an article
  • I was using adding && && && to tables in an effort to adjust table column spacing
  • possibly a separate guide is needed to emphasize native Euphoria functions, and isolate regex functions to this guide


_tom


Intro

Searching is about find a needle (say a specific word) in a haystack, which is a string of text--that string could be just a sentence or a library of books.

A "regular expression" (regex for short) is a way of describing the needle. The simplest regex 'needle' is the word written out exactly. The interesting regex 'needles' can specify all kinds of details...capitalization, location of the word, alternatives to the word, allowed spellings...tremendous power and flexibility is available.

Regex searching is a big topic. Several references will help:




Searching Miniguide

A primer on searching for content in strings using Euphoria.

About Searching

Searching is often is directed at a string of text. If the text string is "Hello World!", then a search could be to locate the text World within that string.

  • Hello World!


Searching may produce a variety of results:

  • just report that a search succeeds
    • true or false
  • report on the location of the search
    • first index: 7
    • extent of matching slice: {7,11}
  • report on what was matched
    • World
  • replace the match with something else
    • replace World with Euphoria
    • "Hello Euphoria!"
  • split the string (remove the match and split the sequence)
    • {"Hello ", "!"}


What is being searched may be stated:

  • as a literal
    • World
  • as a wildcard
    • ?orld
  • as a regex
    • W(.*)d


This Miniguide is an introduction to the functions useful in searching. The Miniguide search table, and regex cheatsheet will help in deciding which function to use.

Simple Word Matching

Given the string "Hello World!" the objective is to search and locate the text "World" within that string.

The needle, "thing to be found," is "World". The haystack, "where to look," is "Hello World!". Or more compactly: locate "World" in "Hello World!".

When using the Euphoria eu:match() function, think of "World" as being a slice of the string "Hello World!".

When using the regex module re:find() or re:matches(), think of "World" as being

a pattern (a regex pattern) that must be matched against the string "Hello World!".

In this example the slice viewpoint and regex viewpoint are identical. Both represent a literal search for text within string "Hello World!".

The Euphoria code that solves the problem is:

regex eu:match

include std/regex.e as re
regex r = re:new( "World" )
? re:find( r, "Hello World!" )
-- {
--   {7,12}
-- }

? eu:match( "World", "Hello World!" )
-- 7

? re:matches( r, "Hello World!" )
-- {
-- "World"
-- } 


Note: match() is a built-in Euphoria function, it normally does not need a namespace qualifier. It is written as eu:match() in this Miniguide to make it explicitly different from the regular expression functions that will use re: as a

namespace qualifier.

For simple literal matching, Euphoria routines are quicker and easier to use than regex routines. However, the Euphoria eu:match() function is limited to literal searches. The regex functions, re:find() and re:matches(), can be extended to search for complex and clever patterns. This Miniguide will emphasize regex searching.

The information obtained from a search may be used in various ways:

It results be used in a conditional:

regex eu:match

if sequence( re:find( r, "Hello World!" ) ) then
puts(1, "It matches \n" )
end if

if match( "World", "Hello World!" ) then
puts(1, "It matches \n" )
end if

if atom( re:find( r, "Hello World!" ) then
puts(1, "It does not match \n" )
end if


The regex search pattern could have been a variable:

regex Euphoria slice

sequence greeting = "World"
r = re:new( greeting )
if sequence( r, "Hello World!" )  ) then
puts(1, "It matches \n" )
end if

Wildcard ? *

Wildcard characters allow a needle string to be written as a pattern that should be matched instead of a literal form.

The wildcard style follows the same rules as defined by operating systems when directories and filenames are expressed.

The wildcard( ? )means substitute any one character. The wildcard( * )means substitute any number of characters.

wildcard needle : haystack match
"?og" : "dog"
include std/wildcard.e as wild 
? wild:is_match( "?og", "dog" ) 
-- 1 
 
? wild:is_match( "?og, "cat" ) 
-- 0 
 
? wild:is_match( "*ot", "parrot" ) 
-- 1 

Wildcard matching is good for groups of filenames. It can be used for general matching, but recognize the limitations:

  • the matches with wild:is_match() are case sensitive
  • the needle must match the entire haystack
  • a simple trick allows a "slice" to be matched; just add a * before and after the text of the slice being matched
  • literal characters * and ? can not be matched
? wild:is_match( "*dog*", "hounddog" ) 
-- 1 


The needle is still matching the entire haystack; but effectively it is the slice that is matched.

The other wildcard function is designed for actual filename matching:

? wild:wildcard_file( `?:\data`, `c:\DATA' ) 
-- 
-- returns 1 on Windows systems 
-- returns 0 on Unix systems 


The wildcards ? * follow the conventions used by the operating system in use. Thus matches under Windows are not case sensitive, but matches under Unix are case sensitive.

Wildcards add a bit of power to matching operations, but nothing compared to what a regex is capable of.

Regex Matching

A regular expression is a pattern matching language. It is best to think of a regex as a small, terse, mini-language that is used from within Euphoria to perform matches on strings. A regex is a language onto itself. To use it one must learn the "regex way" of writing and expressing patterns.

This Miniguide will introduce the use of regular expressions. To fully appreciate what is possible, read the Wikipedia article on regular expressions, and then get a book on the subject.

There are many tutorials and articles on regular expressions on the internet.

The Euphoria regex functions use the PCRE library written by Philip Hazel. PCRE means "Perl Compatible Regular Expressions." There are other regex libraries (such as GREP and Emacs-style) but Perl-style is used by many programming languages and

thus widely documented.

Writing Strings

A regex needle is written as a text string. Because it is mini-language, a few rules must be followed before any meaningful regex patterns can be written.

There are two fundamental ways to write a text string in Euphoria using:

  • string delimiter "
    • "World"
  • raw delimiter ` or """
    • `World`
    • """World"""


If a " string delimiter is used, then the string may contain escaped

characters that are preceded with a \ escape character. This allows "non-printing" characters such

as tab '\t' and newline '\n' to be part of the string. Values can also be coded in octal format \033 , or hexadecimal format \xIB. A literal \ must be written as \\ and a literal " must be written as \" .

If a ` or """ raw delimiter is used then all characters are literal

as-is--there are no escaped characters.

A search requires a needle string and a haystack string to be written.

When using the eu:match() function, just use the same string delimiter style for both the needle and haystack.

When using the regex.e include module, the needle string has special rules (and special exemptions) that make writing these strings interesting.

Some regex needle characters may not be used exactly as written:

 
{ } [ ] ( ) ^ $ . | * + ? \      -- special metacharacters  
 
 - ] \ ^ $                       -- special characters 
 
\s   \S     \w   \W     \d  \D   -- special escaped characters 


All of these "special" characters are part of the regex language and are written into a needle string in order to describe a regex-pattern that will be used for searching.

If the string delimiter " is used to write a needle, then the escaped characters used by Euphoria and the special characters used by the regex pattern will conflict. It is possible to escape special characters, escape escaped characters and ultimately write a needle that will work. The resulting needle takes effort write and looks like a mess.

If a raw string delimiter, ` or """ is used, then the regex needle string is much easier to write and easier to understand.

Part of the charm in writing a regular expression is knowing the rules when a metacharacter is part of the language-and has "meta meaning", and when it is just

itself--a regular character.

To write a metacharacter in a regex needle, and give it its plain literal meaning, they must

be "escaped" using the \ character.

-- needle metacharacters must be escaped to revert to literal meaning 
\{  \}  \[  \]  ... \?  \\ 


The special metacharacter delimiters [ ] are used to define a character class. For example [aeiou] is the class of vowels in the English language. Any one character from this list would produce a valid match. The class delimiters add a few more rules to follow:

  • metacharacters inside a [ ] class are just themselves
  • a - inside a [ ] class is special and used to indicate a range of characters
    • [0-9] means [0123456789]
    • meaning any digit
  • a ^ as the first character in a [ ] class is the denotes the negated character class
    • meaning everything matches but the characters within the class
    • [123] will match any of 1, 2, 3
    • [^123] will match any character but 1, 2, 3
  • a ^, if not the first character, is just its literal self

When written in a haystack string, metacharacters are are no longer special, but are literal

characters.

However, the \ is a metacharacter that is shared by both Euphoria and the regex language. To write a literal \ `escape` character it may have to be

"escaped" by writing it twice. There are two levels of escaping the must be considered!

  • when writing the haystack string
  • when writing the needle string


Example: literal escape \ character in a string:

string delimiter " ` """
literal \ \ \
haystack string "\\" `\` """\"""
needle string "\\\\" `\\` """\\"""


A string delimiter, itself, can not be part of a string:

string delimiter " ` """
literal " ` """
within a string \" no no


A practical example:

String delimiter "
needle
haystack

r = re:new( "C:\\\\WIN" )
? re:find( r, "C:\\WIN32" )


Raw delimiter ` Raw delimeter """
needle
haystack

r = re:new( `C:\\WIN` )
? re:find( r, `C:\WIN32` )

r = re:new( """C:\\WIN""" )
? re:find( r, """C:\WIN32""" )


There are a variety of non-printing ASCII characters that are written preceded with the \ , or escape character. Common examples are: \t for tab, \n for newline, and \r for carriage return. Values can also be coded in octal format \033 , or hexadecimal format \xIB.

There will only be a valid match if the needle pattern exactly matches a slice of the haystack string:

regex eu:match

? re:find( re:new("world"), "Hello World!" )
-- {}

eu:match( "world", "Hello World!" )
-- 0
-- no match

? re:find( re:new("World "), "Hello World!" )
--  {}

? eu:match( "world", "Hello World!" )
-- 0
-- no match

? re:find( re:new("o W" ),  "Hello World!" )
-- 5
-- match located

? eu:match( "o W", "Hello World!" )
-- 5
-- match located 


Note: Regex examples in the literature are often written as if for the Perl language. The / is a common delimiter used in Perl to write a regex string. A Perl regex needle that looks like /Hello/, is `Hello` in Euphoria. Perl examples require a literal / to be written in escaped form. The Perl regex /Good\/bye/ is `Good/by` in Euphoria.

^ Anchored Matches $

If a regex pattern can be found anywhere in the haystack string, then a valid match is reported. It is possible to specify where the match must occur, by using anchor

metacharacters( ^ ) and( $ ) . The anchor ^ means the match must occur at the beginning of the string. The anchor $ means the match must occur at the end of the string, or before the newline \n at the end of that string.

? re:find( re:new( "keeper" ),  "housekeeper" ) 
? re:find( re:new( "^keeper" ), "housekeeper" ) 
? re:find( re:new( "keeper$" ), "housekeeper" ) 
? re:find( re:new( "keeper$"), "housekeeper\n" ) 
? re:find( re:new( "^housekeeper$" ), "housekeeper" ) 


A related idea is to specify the index from which the matching must begin.

Using [ ] Character Classes

If only literal string is in searching, then there is little advantage in using a regex module. A character class allows for a set of possible characters that may be used in a match, rather than just the explicit (literal) characters found when a regex is written out as just a word. Character classes are denoted by ( [ ] )delimiters, while inside the brackets a list of permitted characters is written out.

re:find() eu:match()

? re:find( re:new( "cat" ), "cat" ) -- matches cat

? eu:match( "cat", "cat" )
-- 1
-- only literal match possible
? re:find( re:new( "[bcr]at"), "bat" ) -- matches bat 
 
? re:find( re:new( "[cab]"), "abc" ) -- matches a 

In the last example, the first valid match from the character class is 'a'. It does not matter which order the characters are written within the character class, thus 'c' does not have to be the first match.

A character class may be used to create a regex pattern that allows for case-insensitive matching.

? re:find( re:new( "[yY][eE][sS]"), "yeS" ) 
 
? re:find( re:new( "yes", CASELESS ), "yEs" ) 

The easier way to achieve the same match is to add a modifier, CASELESS, to the re:new() function.

Both ordinary and special characters may be included in a character class. The special characters - ] \ ^ $ must be matched using an escape.

? re:find( re:new( """[\]c]def""" ), "]def" ) -- both ]def and cdef will match 
 
sequence x = "bcr" 
? re:find( re:new( "[" & x & "]at" ), "rat" ) 
 
? re:find( re:new( """[\$""" & x & "]at"), "$at" ) 
 
? re:find( re:new( """[\\$x]at"""), """\at""" ) 
 

Notice that PERL has a simple way of embedding a variable into a string using a $x notation. This kind of example has limited value. Therefore examples that show how $ may be confused with this variable substitution are not interesting.

Euphoria will always use & for string concatenation.

The special character( - )is the range operator within a [ ] character

class. The long forms of [0123456789] or [abc...xyz] become compact [0-9] or [a-z].

There is a special rule for including a literal - special character, when it is the

first or last character in a class, then it is considered to be an ordinary (literal)

character.

The special character( ^ )when in the first position of a character class

denotes a negated character class. That means any character but those listed in the character class will match.

? re:find( re:new( "[^a]at" ), "aat" ) 
? re:find( re:new( "[^0-9]" ), "house" ) 
? re:find( re:new( "[a^]at" ), "^at" ) 

There are abbreviations for the most common character classes:

abbreviation class description
\d [0-9] a digit
\s [\ \t\r\n\f] a whitespace character
\W [0-9a-zA-Z_] a word character
\D [^\d] negated \n, any but a digit
\S [^\s] any non-whitespace
\W [^\w] any non-word
. any character, except \n

The abbreviations \d \s \w \D \S \W can be used both inside and outside of a character class

? re:find( re:new( """\d\d:\d\d:\d\d"""), "12:30:48 the time format" )  
? re:find( re:new( """[\d\s]"""), "any digit or whitespace" ) 
? re:find( re:new("""\w\W\w""" ), "matches a word char, followed by non word, followed by  
 
word char" ) 
? re:find( re:new("..rt"), "matches any two chars followed by rt" ) 
? re:find( re:new("""end\."""), "matches end." ) 
? re:find( re:new("end[.]"), "same thing, matches end." ) 

Notice that when abbreviations are used, the """ triple delimiter should be used.

The word anchor \b matches the boundary between a word character and a non-word character \w\W or \W\w .

x = "Housecat concatenates house and cat" 
? re:find( re:new("""\bcat"""), "in concatenates matches cat" ) 
? re:find( re:new("""cat\b"""), "in housecat matches cat" ) 
? re:find( re:new("""\bcat\b"""), "matches end of string cat" ) 

Matching this | that --Alternatives

Different character strings may be matched using the alternation metacharacer( | ). If dog or cat are to match, write the regex pattern as "dog|cat". At each point in the haystack string, first "dog" will be tried as a match, it it fails then "cat" will be tried as a match. If both fail, then the next character of the haystack string starts the matching tests again.

? re:find( re:new("cat|dog|bird"), "cats and dogs -- matches cat" ) 
? re:find( re:new("dog|cat|bird"), "cats and dogs -- matches cat" ) 

In the second example, "dog", may be first in the list, but "cat" is still the first pattern that matches.

? re:find( re:new("c|ca|cat|cats"), "cats -- matches c" ) 
? re:find( re:new("cats|cat|ca|c"), "cats -- matches cats" ) 

At any given character position in the haystack string, the first alternative that fully matches, will be the match that succeeds. In these examples the first alternative does have a valid match, no more matching tests are required.

( Grouping Things ) and ( Hierarchical (Matching))

The grouping metacharacters( ( ) )allow a part of the regex pattern to be

treated as a single unit. The parenthesis surround a portion of the regex that will be considered as a single unit. Thus the regex pattern "house(cat|keeper)" means that "housecat" or "housekeeper" will succeed as matches.

? re:find( re:new("(a|b)b"), "ab bb -- both will match" ) 
? re:find( re:new("(^a|b)c"), "ac at start of string, or bc anywhere" ) 
? re:find( re:new("house(cat|)"), "house or housecat will match" ) 
? re:find( re:new("house(cat(s|))"), "housecats or housecat ore house will match" ) 
? re:find( re:new("""(19|20|)\d\d"""), """matches the null alternative ()\d\d because 20\d\d  
 
can not match""" ) 

Extracting Matches

The grouping metacharacters ( ) can be used to "extract" the matching slice for each

grouping. First the slice for the overall match is presented, then each grouping is presented.

? re:find( re:new("""(\d\d):(\d\d):(\d\d)"""), "34:12:15 -- matching hh:mm:ss time format" ) 
{ 
  {1,8}, 
  {1,2}, 
  {4,5}, 
  {7,8} 
} 

When groupings in a regex pattern are nested, then they are shown starting with the leftmost opening parenthesis, followed by the next opening parenthesis:

 r = re:new(   `(ab(cd|ef)((gi)|j))`   ) 
 
--  1 2 34 

A related feature are backreferences which are written as \1 \2, .. A backreference is a matching pattern that may be used inside the definition of a regex pattern.

? re:find( re:new("""(\w\w\w)\s\1"""), "the the is located in a string" ) 

TOM above seems to work, but need some more understanding

NOTE $1 $2 ... are variables that are extracted and work as PERL specific variables. They do not exist in Euphoria. The \1 \2 ... are internal to the regex pattern, therefore are available in Euphoria.

Matching Repetitions ? * + { }

The quantifier metacharacters( ? * + { } )allow a portion of the regex pattern

to be repeated (a specified number of times) and still provide a successful match.

The repetition meanings are:

meta regex example meaning
? `a?` matches a 1 or 0 times
* `a*` matches a 0 or more times -- any number
+ `a+` matches a 1 ore more times -- at least once
a{n,m} at least n, but not more than m times
a{n,} at least n, or more times
a{n} exactly n times
? re:find( re:new( `[a-z]+\s+\d*` ), "beer 12 -- lower case word, some space, and any number  
 
of digits" ) 
 
? re:find( re:new( `(\W+)|s+\1` ), "match match doubled words of arbitrary length" ) 
 
? re:find( re:new( `\d{2,4}` ), "20311 02 1914 --at least 2 but not more than 4 digits" ) 
 
? re:find( re:new( `d{4}|\d{2}` ), "123 02 2010 -- better strategy since 3 digit dates  
 
excluded" ) 

These quantifiers are known as greedy, in that they will try to match as much of the haystack string as possible.

? re:find( re:new( `^(.*)(at)(.*)$` ), "the cat in the hat" ) 
 
--{ 
--  {1,18}, 
--  {1,16}, 
--  {17,18}, 
--  {19,18} 
--} 

This regex matches the haystack string. The first group is "the cat in the h", the

second group is "at", and the third group is "no match".

The first( .* )quantifier (being greedy) can grab as much of the string as

possible as long as the regex matches the haystack string. The second .* matches nothing, because there is no string left for it to work on.

Matching Efficiency

The re:new() function is said to "compile" a regex text pattern into an internal format used for actual matching. This "compiling" has a time cost associated with it.

When a regex is created, the PCRE engine allocates some memory for it. Euphoria does required memory cleanup for you.

Find and Replace

A regex may be located in a string and then the contents replaced with new text. Use the find_replace() function:

-- a literal replacement 
 
sequence x = "Time to feed the cat!"       -- Time to feed the cat! 
regex r = re:new( "cat" ) 
x = find_replace( r, x, "hacker" )         -- Time to feed the hacker! 
-- replacement is a backreference 
 
sequence y = "'quoted words'"         -- 'quoted words' 
regex r = re:new( `'(.*)'$` ) 
y = find_replace( r, y, "\1" )        --  quoted words 
 
-- the single quotes are stripped from the string 
-- all matches are replaced 
 
x = "I batted 4 for 4"                    -- I batted 4 for 4 
r = re:new( `4` ) 
x = re:find_replace( r, x, "four" )       -- I batted four for four 


-- specify the number of replacements 
 
x = "I batted 4 for 4"                       -- I batted 4 for 4 
r = re:new( `4` ) 
x = re:find_replace_limit( r, x, "four", 1 ) -- I batted four for 4 

{ "Splitting", "Text" }

A string may be split into individual words. A regex needle is used to locate suitable splitting locations. The located needle is "removed" from the string and used to "split" the string into words. For example the regex needle `\s+`

represents the space between words. The re:split() function can then produce a sequence

of words:

sequence x = "Calvin and Hobbes" 
 
regex r = re:new( `\s+` ) 
x = re:split( r, x ) 
display( x ) 
--{ 
--  "Calvin", 
--  "and", 
--  "Hobbes" 
--} 

The elements of the new sequence are sometimes called tokens, because the splitting process has more uses than just finding words in a sentence.


Data from a spreadsheet may be saved in a comma-delimited format. The regex `,/s*` will locate commas and whitespace, and allow a comma delimited string to be split:

 
x = "1.618,2.718,   3.142" 
r = re:new( `,\s*` ) 
x = re:split(r,x) 
display( x ) 
--{ 
--  "1.618", 
--  "2.718", 
--  "3.142" 
--} 

The regex `/` will parse a directory listing:

x = `/usr/bin` 
r = re:new( `/` ) 
x = split(r,x) 
display(x) 
--{ 
--  "", 
--  "usr", 
--  "bin" 
--}  

Since the first character of x matched the regex, re:split() pretends an empty

initial element to the output sequence.

The RDS archives contain a few more options for splitting strings of text. For example, the file strtok-2-1.zip contains some useful functions related to splitting. Warning: this file was written for Euphoria V3--before using it, copy the file "Strtok-v-2-1.e" to "strtok.e". Then, using a text editor search and replace the

variable "loop" to "iloop" and rename "case" to "xcase". (This is to accommodate the new keywords found in Euphoria V4.0)

regex strtok

include std/regex.e as re

include strtok.e

object x = "1.618,2.718,   3.142"
r = re:new( `,\s*` )
x = re:split(r,x)
display( x )
--{
--  "1.618",
--  "2.718",
--  "3.142"

    

Search



Quick Links

User menu

Not signed in.

Misc Menu