Re: strings, mirc-Eu

new topic     » goto parent     » topic index » view thread      » older message » newer message

I've started some work on perl-like pattern matching. I don't know if I'll
have it done in a timely manner, so here's the basic rundown of how it works
(in theory), in case anyone finds it interesting enough to continue with.
The basic pattern matcher has been written, so I know the approach is
feasible.

Here's a simple basic perl pattern:

   [a-z]

This says that the character needs to match once (and only once) a character
in the range of "a" through "z". That gets coded into a Euphoria sequence
like so:

   { IS, 1, 1, "az" }

The first argument specifies that the range must match (as opposed to NOT
matching). Arguments 2-3 specify the number of times this match is to take
place. The final argument is the range string. The pattern:

   [a-zA-Z_]

specifies that the character must be within a-z or A-Z or '_'. That gets
coded as:

   { IS, 1, 1, "azAZ_" }

The modifiers ? (zero or one), * (zero or more) +, (one or more) and
{min,max} (at least min, but no more than max} get coded as:

   { .. 0, 1, .. }     -- ? in perl; zero or one
   { .. 0, MANY, .. }  -- * in perl; zero or more
   { .. 1, MANY, .. }  -- + in perl; one or more
   { .. min, max }     -- {min,max} in perl; at least min but no more than
max

Incidentally, the constant MANY is merely -1. As an example, here's the
specification for a string of alphanumeric characters:

   [a-zA-Z0-9_]+

It gets coded as:

   { IS, 1, -1, "azAZ09__" }

There are only a couple more operators needed to complete the code. The
STRING operator matches against a given string:

   { STRING, <string> }


AND and OR allow tests to be string together. A simple example:

   fee|fie|foe

gets coded:

   { OR,
      { STRING, "fee" },
      { STRING, "fie" },
      { STRING, "foe" }
   }

More practically, here's a specification for a Euphoria word. It begins with
an alpha character, and is optionally followed by one or more alpha,
numeric, or underscore characters:

   [a-zA-Z][a-zA-Z0-9_]+

The coding is:

   { AND,
      { IS, 1, 1, "azAZ" },
      { IS, 0, MANY, "azAZ09__" }
   }

I haven't forgotten about the parenthesis operators; I just haven't coded
them up yet. Externally, they simply get coded as {OPEN} and {CLOSE}. For
example:

   (\d+) (\d+)

would be coded as:

   { AND
      { OPEN },
      { IS, 1, MANY, "09" },
      { CLOSE },
      { OPEN },
      { IS, 1, MANY, "09" },
      { CLOSE }
   }

Since I haven't actually coded the parenthesis handlers (although I *think*
I know how to do it), I won't pretend to explain them.

Here are some small 'real life' examples (parenthesis coding omitted):

Perl expression:

   /^Subject: (.*)/

Euphoria coding:
   { AND,
      { STRING, "Subject:" },
      { IS_NOT, 1, MANY, "\n\n" }
   }

Perl expression:
   /^Date (\d+) (\w+) (\d+) (\d+): (\d+) (.*)$/

Euphoria coding:

   { AND
      { STRING, "Date:" }
      { IS, 1, MANY, "09" }
      { IS, 1, MANY, "az09__" }
      { IS, 1, MANY, "09" }
      { IS, 1, MANY, "09" }
      { STRING, ":" }
      { IS, 1, MANY, "09" }
      { NOT, 0, MANY, "\n\n" }
   }

I think this shows that an implementation of perl-like pattern matching in
Euphoria is certainly feasible. I've already coded up a simple pattern
matcher, *without* parenthesis assignment (I'm working on it!). Writing a
parser to convert Perl patterns into code is a bit more difficult.

Is anyone interested in this sort of thing?

Thanks.

-- David Cuny

new topic     » goto parent     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu