RegularExpressions

EuGuide: Regular Expressions

A regular expression is a way of describing, in a cleverly coded way, a fragment of text that you may search for in a larger string of text. It is all about pattern matching. Regular expressions is a coding system, and it's akin to learning a new language.

First draft

  • (c) copyright Tom Ciplijauskas March 2009

Simple Match: Euphoria

To locate the position of a particular character in a string you would normally write the following:

? match("pho", "Euphoria") 
   --             | 
   --             pho    ---the match you want 
           
   -- 3   ---output from match() 
     
   ---the matched pattern starts at index 3    

Where "pho" is the pattern, "Euphoria" is the string you are searching, and 3 is the result of your search. Searches of this kind are best done with match().

Simple Match: Regular Expressions

The previous example with match() looks like this expressed using regluar expressions:

include std/regex.e as re 
regex r = re:new( "pho" ) 
? re:find( r, "Euphoria" ) 
   --            |               
   --            pho    ---the match you want    
 
   -- { 
   --  {3, 5}    
   -- }        ---output from re:new() 
   
   ---matched pattern starts at index 3 and ends at index 5 
 
re:free( r )  

The fragment "pho" is a text pattern that is a literal match for the three characters found in the search string "Euphoria". A blank space (invisible as it is) is also a valid character.

Demo program

Experimenting with regular expression is easier if you have a demonstation program. Try the following code as a starting point:

include std/console.e 
include std/graphics.e 
 
   clear_screen() 
 
puts(1, "Test out regular expressions.\n\n" & 
         "Enter string values without \" delimeters\n" & 
         "(do not use #/ / notation)\n\n" ) 
 
sequence haystack = prompt_string( "enter the haystack... " ) 
sequence needle  = prompt_string( "enter the needle..... " ) 
 
include regex.e as re 
   regex r = re:new( needle ) 
   object result = re:find_all( r, haystack ) 
 
   position(15,1) puts(1, haystack ) 
   text_color( YELLOW ) bk_color( BLUE ) 
 
sequence slice 
for i=1 to length( result ) do 
   slice = result[i][1] 
   for k=slice[1] to slice[2] do 
      position(15, k ) 
      puts(1, haystack[ k ] ) 
      end for    
   end for 
 
   text_color( BLACK ) bk_color( BRIGHT_WHITE ) 
position( 20, 1 ) puts(1, "results: " ) print(1, result )   
puts(1, "\n\npatterns matched: " ) ? length( result ) 
 
puts(1, "\n...done...\n" ) 

Simple Find: Euphoria

The Euphoria find() will search for a "needle" object as an element of a Euphoria "haystack" object.

Example 1: 
location = find(11, {5, 8, 11, 2, 3}) 
-- location is set to 3 
  
Example 2: 
names = {"fred", "rob", "george", "mary", ""} 
location = find("mary", names) 
-- location is set to 4 

Think of find() as an extension that goes beyond finding a slice out of a string.

Using regex.e

  • To prevent identifier name clashes we write:
    • include std/regex.e as re
    • The choice re is common because it suggests "regular expression." The library routines then become re:new() instead of just new().
  • Creating a regex involves a pair of routines: new() and free().
    • By analogy think of the open() and close() pair you use with files.
    • We write a regex as a text string, but must "compile" it into an encoded value--hence the variable "r" and the "regex-type." The new() function creates the compiled form we need.
    • The compiled regex is then used in a variety of searching routines, such as the find() function.
    • When done, you free() the memory used by the compiled regex. You may now create another compiled regex.

Metacharacters

Regular expressions are about exotic text patterns that you may use for searching. You must learn the rules of regular expressions before I can show you what can be done. I will often use regex as shorthand for "regular expression."

Some characters have special meaning when used in regular expressions. They are analogous to keywords in Euphoria, and are called metacharacters.

\ . ^ $ * +  ?  @ #  ( ) [ - ] { , }  

Metacharacters are used to define and control how a regular expression is created and used. They are words of the regex language.

When you see one of them, you must reason out what special purpose they are serving.

If you need to use a metacharacter in its literal sense, that character must be "escaped" first. The backslash( \ )is the escape character. When placed before a metacharacter, the metacharacter loses its special meaning.

For example * has special meaning, but \* is just a star.

If you escape the escape symbol, \\ , its special meaning is lost and it is just a backslash.

The escape \ is also used to create special meaning. For example the letter c on its own is just a letter. But combined with an escape the c loses is normal meaning and gains special meaning. The \c is now a command for "lower case" searching.

Another caution, the meaning of a metacharacter may depend on the location where it is used. For example the ^ changes meaning if it's inside or outside square brackets: ^[ ] is very different from [^ ] .

Writing style

A regex is coded using metacharacters and regular characters.

[a-z]+i[a-z]* 

The coded regex is then written as a Euphoria string. A simple regex just needs( " )delimeters and is used as an argument to the re:new() function:

regex sample = re:new( "[a-z]+i[a-z]*" ) 

Here is an alternative way of writing the same thing using #/ and / as delimeters:

regex sample = re:new( #/[a-z]+[a-z]*/ ) 

A regex can become messy very quickly if there are backslashes. Consider this regex:

\* 

To search for a literal * you need to escape it. A \ is also an escape inside a Euphoria string--the Euphoria escape and regex escape are in competition. The consequence is you must add an extra \ to write the regex as a string:

regex sample = re:new( "\\*" ) 

The alternative writing style lets you write a regex without doubling up on the escape characters:

regex sample = re:new( #/\*/ ) 

You will see the style using( / )forward slash delimeters used to describe regex patterns in outside references such as books and the web:

/[a-z]+i[a-z]*/ 

Euphoria wildcard matching

The Euphoria wildcard match lets you create fancier searches.

The plain Euphoria( ? )represents any one character, while the plain Euphoria( * )represents several characters.

For example:

include std/wildcard 
integer i 
i = wildcard_match("A?B*", "AQBXXYY") 
? i 
   -- i is 1 (TRUE) 
 
i = wildcard_match("*xyz*", "AAAbbbxyz") 
? i 
   -- i is 1 (TRUE) 
 
i = wildcard_match("A*B*C", "a111b222c") 
? i 
   -- i is 0 (FALSE) because upper/lower case doesn't match 

You can not search for literal ? or * using match() or wildcard_match().

If your searches are simple, then use match() and wildcard_match().

Regex "wildcard" matching

The advantage of a regex is that you can do a lot more than is possible with a simple search.

Warning: when using a regex, the meaning of ? and * is not the same as in plain Euphoria.

The vocabulary for a "wildcard" regex is:

metacharacter meaning
. any character
.* any character 0 or more times
.+ any character 1 or more times
.? any character 0 or 1 times


Using a regex for the previous examples would look like:

include std/regex.e as re 
regex r = re:new( "A.B" ) 
? re:find( r, "AQBXXYY" ) 
   --          |          {  
   --          AQB          {1, 3} 
   --                     }  
include std/regex.e as re 
regex r = re:new( ".*xyz.*" ) 
? result = re:find( r, "AAAbbbxyz" ) 
   --                   |             { 
   --                   AAAabbxyz       {1,9} 
   --                                 } 
include std/regex.e as re 
regex r = re:new( "A.*B.*C" ) 
? re:find( r, "a111b222c" ) 
   --         | 
   --        -1 

The( . )is working like a wildcard. The( * +? )are quantifiers acting on the( . )letting you be very specific about what will be matched.

Using quantifiers does have its gotchas, since what is matched may not be what you expect at first glance.

The Regex Routines

A regex search is like finding a "needle" in a "haystack." Or, a slice out of a string.

Sometimes it is just enought to know that a "needle" exists within the "haystack."

  • The function is_match() returns false ( 0 ) or true ( 1 ) if your regex "needle" matches the entire "haystack" string.
  • The function has_match() returns false ( 0 ) or true ( 1 ) if your regex "needle" matches a portion of the "haystack" string.

At other times you want to find the "needle" and all details about the search. The "needle" is now described by a slice that is part of the search string. A "needle" slice is a sequence pair with start and end indices: { start, end } .

  • The function find() returns false ( 0 ) if a search fails, or the first slice that matches the regular expression.
  • The function find_all() returns false ( 0 ) if a search fails, or all slices that match the regular expression.

The output of the "find" functions may contain even more information. Complex regular expressions may be composed of simpler regular expressions. The output of find() or find_all() will also include the simple slices that were matched as part of matching the large complex regex.

First match

The function has_match() will tell you if a match exists.

The function find() is designed to display the first, from the left, valid match it can locate.

include std/regex.e as re 
regex r = re:new( "dog" ) 
 
? re:has_match( r, "cats and dogs and cats and dogs and dogs" ) 
   -- 1 
 
? re:find( r, "cats and dogs and cats and dogs and dogs" ) 
  --                    |                                   { 
  --                    dog                                   {10,12} 
  --                                                        } 

All matches

The find_all() function will display all possible matches.

include std/regex.e as re 
regex r = re:new( "dog" ) 
? re:find_all( r, "cats and dogs and cats and dogs and dogs" ) 
   --                       |                 |        |        { 
   --                       dog               |        |         {10,12}, 
   --                                         dog      |         {28,30}, 
   --                                                  dog       {37,39} 
   --                                                           } 

^ $ search anchors

You may use anchor metacharacters to specify that your match must be in a specific location within the search string.

The metacharacter( ^ )is an anchor that requires the match to start at the beginning of the string.

The metacharacter( $ )is an anchor that requires the match to finish at the end of the string.

include std/regex.e as re 
regex r  
 
r = re:new( "cat" ) ---no anchors  
? re:find( r, "dog chases cat chasing another cat" )   
   --                     |                          { 
   --                     cat                          {12,14} 
   --                                                } 
 
r = re:new( "^cat" ) ---anchor start of string   
? re:find( r, "dog chases cat chasing another cat" ) 
   --         | 
   --        -1 
 
r = re:new( "^dog" ) ---anchor start of string    
? re:find( r, "dog chases cat chasing another cat" ) 
   --          |                                     { 
   --          dog                                     { 1, 3 } 
   --                                                } 
 
 
r = re:new( "cat$" ) ---anchor end of string 
? re:find( r, "dog chases cat chasing another cat" ) 
   --                                           |    { 
   --                                         cat      {32,34} 
   --                                                } 
 
r = re:new( "cat$" ) --- anchor end of string         
? find( r, "dog chases cat chasing another cat or maybe a mouse" ) 
   --      | 
   --     -1 

You may use ^ and $ together to force a regex to match an entire string.

include std/regex.e as re 
regex r = re:new( "^.+$" ) 
? re:find( r, "Match this entire string, please." ) 
   --          |                               |    { 
   --                                                 {1,33} 
   --                                               } 
 

\b word anchor

The metacharacter( \b )is used to anchor a search using the boundary of a word:

 
include std/regex.e as re 
 
regex r = re:new( "cat" ) --- can be anywhere 
 
? find_all( r, "dogchasescatchasinganothercatormouse" ) 
   --                    |                |           { 
   --                    |                |             { 
   --                    cat              |               {10,12} 
   --                                     |             }, 
   --                                     |             { 
   --                                     cat             {27,29} 
   --                                                   } 
   --                                                 } 

r = re:new( "\\bcat" ) --- must start a word 
 
? re:find_all( r, "dogchasescatchasinganothercatormouse" ) 
   --         | 
   --        -1 
 
? re:find( r, "dogchases catchasinganothercatormouse" ) 
   --                    |                              {    
   --                    cat                             {11,13} 
   --                                                   } 

r = re:new( "cat\\b" )  --- must end a word 
 
? re:find_all( r, "dogchases catchasinganothercatormouse" ) 
   --             |                 
   --             0 
 
? re:find_all( r, "dogchases catchasinganothercat ormouse" ) 
   --                                           |           { 
   --                                         cat            {28,30} 
   --                                                       } 

r = re:new( "\\bcat\\b" ) --- must be a complete word 
 
? re:find_all( r, "dogchasescatchasinganothercatormouse" ) 
   --             | 
   --            {} 
 
? re:find_all( r, "dogchases cat chasinganothercatormouse" ) 
   --                        | |                            { 
   --                        | |                             { 
   --                        cat                              {11,13} 
   --                                                        }  
   --                                                       } 
 
r = re:new( #/\bcat\b/ ) 
? re:find_all( r, "dogchases cat chasinganothercatormouse" ) 
   -- {                      | | 
   --  { 
   --     {11,13} 
   --  } 
   -- } 
   --- same as above but with alternative notation 

\B word anchor

The metacharacter( \B )requires the match to be inside a word boundary:

include std/regex.e 
r = re:new( #/\Bcat/ ) 
? re:find( r, "dogchases catchasinganothercatormouse cat" ) 
   --                                     |                 { 
   --                                     cat                {28,30} 
   --                                                       } 

[ ] range

  • user defined
  • predefined range

Square brackets( [ ] )are used to describe a class of characters.

Between the brackets you list the characters that belong to the class. They may be listed individually: [abc] . They may also be listed as part of a range of characters: [a-c] . In this case the range starts at 'a', continues on as indicated by the metacharacter( - )dash separator, and ends at 'c'.

The meaning of [a-c] is that any one character in the range may result in a proper match.

include regex.e as re 
regex r = re:new( "[aeiou]" )                        -- find a vowel 
? re:find_all( r, "The simple, powerful language" ) 
--                   |  |   |   | |  |   |  || |   { 
--                   e  |   |   | |  |   |  || |     { {3,3}   }, 
--                      i   |   | |  |   |  || |     { {6,6}   }, 
--                          e   | |  |   |  || |     { {10,10} }, 
--                              o |  |   |  || |     { {14,14} }, 
--                                e  |   |  || |     { {16,16} }, 
--                                   u   |  || |     { {19,19} }, 
--                                       a  || |     { {23,23} }, 
--                                          u| |     { {26,26} }, 
--                                           a |     { {27,27} }, 
--                                             e     { {29,29} } 
--                                                 } 

[^ ] range complement

When a range starts with the metacharacter pair [^ then everything not in the class is included. The ^ is a range complement operator.

If [aeiou] are all of the vowels, then [^aeiou] means all characters that are not vowels.

include regex.e as re 
regex r = re:new( "[^aeiou]" ) 
? re:find_all( r, "The simple, powerful language" ) 

Units and Groups

The regex coding system operates on one unit of the regex expression at a time. Any single ordinary character is easy to recognize as being a unit.

An escaped metacharater, such as \$ is also a unit. It is a unit composed of two symbols.

It is easy for a unit to "get lost" in a regular expression:

abc\$ 

In a regex you may use round brackets:( ( ) )to group symbols. You may also nest brackets.

This example is identical in meaning to the one above, but it emphasizes that \$ is a single unit:

-- haystack string: abc$defg 
-- regular expression: abc(\$) 
 
--matches abc$ 

The brackets make it clear that you are searching for a literal $ and not using the $ as an anchor.

Brackets are used to create elaborate groups of symbols in exotic regex codes.

When brackets are used, the group automatically also becomes a named group. A named group gets an identifer. That identifer can be used as a constant in the rest of the regex expression--saving a lot of typing. The section on named groups will explain how this works.

Brackets are metacharacters themselves, so you must escape them if you need to use brackets in their literal meaning.

Matching metacharacters

When you see a metacharacter you must consider what special meaning it may have.

Inside a class **[ ]**, metacharacters are, mostly, just characters (no special meaning). Of course, there are gotchas!

regex r = re:new( "[rds^@]" ) 

The class includes the characters 'r', 'd', 's', '^', and '@' .

regex r = re:new( "[^rds@]" ) 

When you see **[^ ]** it means that none of the characters that follow the are to be matched. That is to say, any other character will match. The ^ just after the [ has a special metacharacter meaning. The actual characters in the class are the complement of the characters listed.

If you want [ or ] to be part of you character class, you need a special notation to include these metacharacters in your class. The( \ )backslash is known as the "escape" character. It is used to turn a metacharacter into an ordinary character.

regex r = re:new( "[rds\[]" ) 

The characters are now 'r', 'd', 's', and '[' . The bracket has been escaped from its metacharatcer meaning.

To include a '\' in the regex you are creating, you must also "escape" the backslash. That means you must write **\\** to represent one backslash.

BUT! The '\' is also an escape character in a Euphoria sequence that you will be searching. The two escape characters are working against each other!

include std/regex.e as re 
regex r = re:new( "\\\\" ) 
object result = re:find( r, "Eup\\horia" ) 
? result 
 
   -- { 
   --    {4,4} 
   -- } 

To include one backslash in a string you need \\ (escaping in the string). For each \ in a regular expression you also need \\ (escaping the regular expression). The painful result is that you need \\\\ to match \\ which is in reality just the \ character.

Metacharacter shortcuts and escaped characters

The \ backslash is also used to to code commonly used classes of characters.

shortcut class meaning long form
\d any decimal digit character [0-9]
\D an non-digit character [^0-9]
\s any whitespace character [ \t\n\r]
\S any non-whitespace character [^ \t\n\r]
\w any alphanumeric character [a-zA-Z0-9]
\W any non-alphanumeric character [^a-zA-Z0-9]


It may be hard to notice (depending on the font you are using) that a blank character ( ' ' ) is included in the class of whitespace characters. A blank space is significant when writing a regular expression.

Notice that '\t', '\n' , ... are recognized as "escaped" characters that represent the non-printing characters like TAB and NEWLINE.

include std/regex.e as re 
regex r = re:new( "\\n" ) 
result = re:find( r, "Eup\\horia\n" ) 
? result 
   -- { 
   --    {10,10} 
   -- } 

In a Euphoria string \n is a single character. Therefore in a regular expression you only need \\n to search for it.

If you encounter a regex shortcut, then it is considered to be a single character. Shortcuts can be entered into a class and may be part of a larger class definition.

regex r = re:new( "[\s.!;,:?]" ) 

The regex for whitespace is now extended with punctuation markers.

| Alternation

? as an optional

The metacharacter( | )is the alternation symbol. It allows one of two choices to be included in the regex match.

include std/regex.e as re 
regex r = re:new( "a|b" ) -- a or b 
? re:find_all( r, "xxxxaxxxxxbxxxxxx" ) 
   --                  |     |         { 
   --                  a     |            {5,5}, 
   --                        b            {11,11} 
   --                                  } 

Thus 'a' or 'b' will result in a valid match.

Groups are also recognized:

include std/regex.e as re 
 
regex r = re:new( "sink|swim" )  
 
? re:find_all( r, "I am afraid I will sink if I try to swim" ) 
   --                                 |                |   { 
   --                                 |                |     { 
   --                                 sink             |        {20,23} -- sink 
   --                                                  |      }, 
   --                                                  |      { 
   --                                                  swim     {37,40} -- swim 
   --                                                          } 
   --                                                        } 

Repeated patterns

Regular expressions are about patterns formed from characters. A simple pattern occurs when a unit or group is repeated several times within the string being searched.

The ( + ) metacharacter is used to indicate that a pattern can be matched one or more times. The + "quantifier" applies to the unit or group to its left.

include std/regex.e as re 
regex r = re:new( "a+" ) 
 
? re:find_all( r, "cat" ) -- one 
   --               |     { 
   --               a      {2,2}   
   --                     } 
 
 
? re:find_all( r, "caat" ) -- two 
   --               |      { 
   --               aa       {2,3}    
   --                      } 
 
? re:find_all( r, "caaaatatonic" ) -- many 
   --               |    |         { 
   --               aaaa |          {2,5}, 
   --                    a          {7,7} 
   --                              } 

The( * )metacharacter is used to indicate that a pattern can be matched zero or more times. The * "quantifier" applies to the unit or group to its left.

include std/regex.e as re 
regex r = re:new( "a*" ) 
 
? re:find_all( r, "cat" ) -- one 
   --               |     { 
   --               |      {1,0},      
   --               a      {2,2}, 
   --                      {3,2} 
   --                     } 

Surprise! You may have expected only one match, but three matches are reported. The first and last matches are "zero length" matches indicated by {1,0} and {3,2} which are "illegal" Euphoria indexes. The * allows for zero or more matches, and it is true that 'c' is a zero match for 'a'.

include std/regex.e as re 
regex r = re:new( "a*" ) 

The( ? )metacharacter is used to indicate that a pattern can be matched zero or one times. The ? "quantifier" applies to the unit or group to its left.

include std/regex.e as re 
regex r = re:new( "a?" ) 
 
? re:find_all( r, "cat" ) -- one 
   --               |     { 
   --               |      {1,0}, 
   --               a      {2,2},    -- a 
   --                      {3,2} 
   --                     } 
  
? re:find_all( r, "caat" ) -- two 
   --               ||     { 
   --               ||       {1,0}, 
   --               a|       {2,2},     -- a 
   --                a       {3,3},     -- a 
   --                        {4,3} 
   --                       }   
 
? re:find_all( r, "caaaatatonic" ) -- many 
   --                              { 
   --                               {1,0}, 
   --                               {2,2},     -- a 
   --                               {3,3},     -- a 
   --                               {4,4},     -- a 
   --                               {5,5},     -- a 
   --                               {6,5}, 
   --                               {7,7},     -- a 
   --                               {8,7}, 
   --                               {9,8}, 
   --                               {10,9}, 
   --                               {11,10}, 
   --                               {12,11} 
   -- }  

The quantifiers may be used to create complex regex searches.

include std/regex.e as re 
regex r = re:new( "A+B*C?D" ) 
 
? re:find( r, "AAAD" ) 
   --          |       { 
   --          AAAD    {1,4} 
   --                   } 
 
? re:find( r, "ABBBBCD" ) 
   --          |         { 
   --          ABBBBCD    {1,7} 
   --                    } 
 
? re:find( r, "BBBCD" ) 
   --         | 
   --         0 
 
? re:find( r, "ABCCD" ) 
   --         | 
   --         0 
 
? re:find( r, "AAABBC" ) 
   --         | 
   --         0 

Extent of matching: Greedy, Non-Greedy

left to right searching, and backing up to find match

problems in getting too much, too little

Numeric quantifiers

The meta characters( { , } )are used to specify numeric quantifiers.

If there is only one number, such as {2}, then an exact match is required:

include std/regex.e as re 
regex r = re:new( "a{2} b{3} c{4}" ) 
 
? re:find_all( r, "aa bbb cccc" ) 
--                 |         |   { 
--                 |         |     { 
--                 aa bbb cccc       {1,11} 
--                                 } 
--                               } 
 
? re:find_all( r, "aaa bb cccccc" ) 
               --{} 

A numeric quantifier can represent a specific range of numbers { minimum, maximu

Search



Quick Links

User menu

Not signed in.

Misc Menu