Re: strings, mirc-Eu
- Posted by "Cuny, David at DSS" <David.Cuny at DSS.CA.GOV> Dec 28, 1999
- 505 views
I've started some work on perl-like pattern matching. I don't know if I'll have it done in a timely manner, so here's the basic rundown of how it works (in theory), in case anyone finds it interesting enough to continue with. The basic pattern matcher has been written, so I know the approach is feasible. Here's a simple basic perl pattern: [a-z] This says that the character needs to match once (and only once) a character in the range of "a" through "z". That gets coded into a Euphoria sequence like so: { IS, 1, 1, "az" } The first argument specifies that the range must match (as opposed to NOT matching). Arguments 2-3 specify the number of times this match is to take place. The final argument is the range string. The pattern: [a-zA-Z_] specifies that the character must be within a-z or A-Z or '_'. That gets coded as: { IS, 1, 1, "azAZ_" } The modifiers ? (zero or one), * (zero or more) +, (one or more) and {min,max} (at least min, but no more than max} get coded as: { .. 0, 1, .. } -- ? in perl; zero or one { .. 0, MANY, .. } -- * in perl; zero or more { .. 1, MANY, .. } -- + in perl; one or more { .. min, max } -- {min,max} in perl; at least min but no more than max Incidentally, the constant MANY is merely -1. As an example, here's the specification for a string of alphanumeric characters: [a-zA-Z0-9_]+ It gets coded as: { IS, 1, -1, "azAZ09__" } There are only a couple more operators needed to complete the code. The STRING operator matches against a given string: { STRING, <string> } AND and OR allow tests to be string together. A simple example: fee|fie|foe gets coded: { OR, { STRING, "fee" }, { STRING, "fie" }, { STRING, "foe" } } More practically, here's a specification for a Euphoria word. It begins with an alpha character, and is optionally followed by one or more alpha, numeric, or underscore characters: [a-zA-Z][a-zA-Z0-9_]+ The coding is: { AND, { IS, 1, 1, "azAZ" }, { IS, 0, MANY, "azAZ09__" } } I haven't forgotten about the parenthesis operators; I just haven't coded them up yet. Externally, they simply get coded as {OPEN} and {CLOSE}. For example: (\d+) (\d+) would be coded as: { AND { OPEN }, { IS, 1, MANY, "09" }, { CLOSE }, { OPEN }, { IS, 1, MANY, "09" }, { CLOSE } } Since I haven't actually coded the parenthesis handlers (although I *think* I know how to do it), I won't pretend to explain them. Here are some small 'real life' examples (parenthesis coding omitted): Perl expression: /^Subject: (.*)/ Euphoria coding: { AND, { STRING, "Subject:" }, { IS_NOT, 1, MANY, "\n\n" } } Perl expression: /^Date (\d+) (\w+) (\d+) (\d+): (\d+) (.*)$/ Euphoria coding: { AND { STRING, "Date:" } { IS, 1, MANY, "09" } { IS, 1, MANY, "az09__" } { IS, 1, MANY, "09" } { IS, 1, MANY, "09" } { STRING, ":" } { IS, 1, MANY, "09" } { NOT, 0, MANY, "\n\n" } } I think this shows that an implementation of perl-like pattern matching in Euphoria is certainly feasible. I've already coded up a simple pattern matcher, *without* parenthesis assignment (I'm working on it!). Writing a parser to convert Perl patterns into code is a bit more difficult. Is anyone interested in this sort of thing? Thanks. -- David Cuny