Re: Looking for the best "Parse"...

new topic     » goto parent     » topic index » view thread      » older message » newer message

----- Original Message -----
From: <euman at bellsouth.net>
To: "EUforum" <EUforum at topica.com>
Subject: Re: Looking for the best "Parse"...


> I dont think these multi-run test are any good......
>
> What Im saying  is that often times the second pass will be 5 times faster
than the first.
>
> Anyone want to comment on this?

I have noticed this also. It has to do with garbage collection processes. In
order to compensate for this phenomena, I have styled the tests using fake
calls to routines which seem to trick the garbage collection. To
demonstrate, I swapped the order of the routines around and still get the
same sort of results.

However, I have been reminded never to write code while having breakfast,
and while getting the kids ready for club tennis competitions - there is a
major mistake in the routine I sent out. It doesn't return any text after
the last match! Ooops. Below is the corrected routine.

--------------
-- Assumption: All sequences contain ASCII text.
-- That is, each element is an integer from 0 to 255.
global function dparse(sequence text, sequence esc, sequence replace)
    sequence t,ltext
    integer pos,  -- posn of esc in text
            epos, -- end of text in t
            spos, -- pos in t to copy to
            ppos  -- previous 'pos' value

    -- Do a case-insensitive compare
    esc = lower(esc)
    ltext = lower(text)
    -- Find the first match
    pos = match(esc, ltext)
    -- If no matches, return the original text
    if pos = 0 or length(esc) = 0 then
        return text
    end if

    -- Create a sequence long enough to hold the result
    ppos = length(text) + floor(1 +
(length(replace)*length(text)/length(esc)))
    t = repeat(-1, ppos)

    -- Initialise the position holders
    epos = 0
    spos = 0
    ppos = 0

    -- Keep making replacements until no more matches
    while pos do
        -- Copy all the text up to the current match
        -- Note special case ('cos Eu doesn't like slices starting with 0)
        if pos != 1 then
            epos = spos + pos - ppos - 1
            t[spos .. epos] = ltext[ppos .. pos-1]
        end if

        -- Hide the current match by changing its value to a non-text value,
        -- this way the next iteration won't find it again.
        ltext[pos] = -1

        -- Skip over the escape text
        ppos = pos + length(esc)

        -- Copy the replacement text into the result
        spos = epos + 1
        epos = epos + length(replace)
        t[spos .. epos] = replace

        -- Skip over the replacement text in the result area
        spos = epos + 1

        -- Look for the next match
        pos = match(esc, ltext)
    end while

    -- Copy any left over text
    if ppos <= length(ltext) then
        pos = length(ltext)
        epos = spos + pos - ppos
        t[spos .. epos] = ltext[ppos .. pos]
    end if
    -- Return only the copied text, as there could be some residual
    -- junk in the result area.
    return t[1..epos]
end function

--------------

Some more observations, if I may.

The recursive solution presented so far have a problem when the replacement
text contains the escape text. They go into a never ending loop.

They choke on this sort of thing..
   res = parse("one$Atwo", "$A", "three$Afour")

Secondly, these sort of routines would go a lot faster and be coded a lot
more simply if RDS provided a variation to match() and find() that allowed
one to specify the starting offset.  Such as ...
   pos = match_from(",", "abc,def,ghi,klm", 5)
would start at the fifth element and return 8.

Finally, it would be nice if RDS could treat zero-length slices as a NOP,
regardless of what the actual start and end pos values are.


---
Derek.

new topic     » goto parent     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu