Re: Parsing Challenge

new topic     » goto parent     » topic index » view thread      » older message » newer message

c.k.lester wrote:
> 
> Kat wrote:
> > 
> > getxml() in strtok.
> 
> How would I handle the divs with class and id attributes? Is something like
> the following valid?
> 
> s = getxml("<div","/",0)

The short answer is "no", but mostly because getxml() was written when these
tags had no parameter fields like <div id="doc3" class="yui-t2">. And you don't
care about the contents of <div*<div> anyhow, and i don't want to stop everything
right now and recode it (possibly breaking code that works as is) to handle
params, so...

The resulting answer to the problem becomes use parses() in strtok.e v3, which
isn't released (but see below). In it, do

result = parses(yourhtmlfile,"<div")

and the start of each nested sequence in result ({"","","",etc}) will be the
parameters of the <div*> tag, plus extra stuff you can ignore. I do this for
finding "href" all day now.

For your example, the first seq will be:
" id="doc3" class="yui-t2">\n   "
the 2nd one will be:
" id="hd">\n       "
etc, and you use a for loop = 1 to length(result) to get 

params = result[loop][1..match(">",result[loop])-1] 

This way, you can spec which tag name, <div*, <table*, <etc* that you want the
param list for, as well as which one, result[x]. If the file does not start with
the tag you are looking for, then obviously the first param list for that tag
starts at result[2].

global function find_alls(sequence sep, sequence text) -- Kat July 9 2005
-- return a list of all indexes where what is found in s
sequence list, found
atom place, step, whatlen -- place must be atom for "while" line

trace(1)

   list = {}
   for loop = 1 to length(sep) do
      place = match(sep[loop],text)
      while place do
        found = text[place..place+length(sep[loop])-1]
        list = append(list,{place,found})
        text[place] = ' '
        place = match(sep[loop],text)
      end while
   end for

   list = strtoksort(list)

   return list
end function



------------------------------------------------------------------

global function parses(sequence s, sequence c) -- Kat July 9 2005
atom lenc, index
sequence parsed, where
object junk
atom keep , ci

if equal(s,"") then return "" end if
if equal(c,"") then return "" end if
keep = 0
ci = 0
if sequence(c) and length(c) = 2 then
  if match("i",c[1]) then
    ci = 1
  end if
  if match("k",c[1]) then
    keep = 1
  end if
  if ci or keep then
    c = c[2]
  end if
end if

if (length(c) >= length(s)) then return "" end if

parsed = {}
lenc = length(c)

if ci
  then where = find_allsci(c,s)
  else where = find_alls(c,s)
end if

if keep then
  if where[1][1] = 1
    then lastkeep = 2
    else lastkeep = 1
  end if
end if

where = {{1,""}} & where & {{length(s)+1,""}}
trace(1)

if equal(where,"") then return{s} end if

if keep
  then  for loop = 2 to length(where)-1 do
          lastseparators = lastseparators & {where[loop][2]}
        end for
        for loop = 1 to length(where)-1 do
parsed = parsed & {
          s[where[loop][1]+length(where[loop][2])..where[loop+1][1]-1] }
          --& {where[loop+1][2]}
          --if loop = length(where)-2 then exit end if
        end for

  else  for loop = 1 to length(where)-1 do
parsed &=
          {s[where[loop][1]+length(where[loop][2])..where[loop+1][1]-1]}
        end for
end if

return parsed
end function --parses()


Add the global vars and such as needed, and remove the trace(1), etc..

Yep, July 9, 2005. I been holding onto it, not wanting to get flamed for
releasing it ever since then.

Kat

new topic     » goto parent     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu