Re: Parsing Challenge
c.k.lester wrote:
>
> Kat wrote:
> >
> > getxml() in strtok.
>
> How would I handle the divs with class and id attributes? Is something like
> the following valid?
>
> s = getxml("<div","/",0)
The short answer is "no", but mostly because getxml() was written when these
tags had no parameter fields like <div id="doc3" class="yui-t2">. And you don't
care about the contents of <div*<div> anyhow, and i don't want to stop everything
right now and recode it (possibly breaking code that works as is) to handle
params, so...
The resulting answer to the problem becomes use parses() in strtok.e v3, which
isn't released (but see below). In it, do
result = parses(yourhtmlfile,"<div")
and the start of each nested sequence in result ({"","","",etc}) will be the
parameters of the <div*> tag, plus extra stuff you can ignore. I do this for
finding "href" all day now.
For your example, the first seq will be:
" id="doc3" class="yui-t2">\n "
the 2nd one will be:
" id="hd">\n "
etc, and you use a for loop = 1 to length(result) to get
params = result[loop][1..match(">",result[loop])-1]
This way, you can spec which tag name, <div*, <table*, <etc* that you want the
param list for, as well as which one, result[x]. If the file does not start with
the tag you are looking for, then obviously the first param list for that tag
starts at result[2].
global function find_alls(sequence sep, sequence text) -- Kat July 9 2005
-- return a list of all indexes where what is found in s
sequence list, found
atom place, step, whatlen -- place must be atom for "while" line
trace(1)
list = {}
for loop = 1 to length(sep) do
place = match(sep[loop],text)
while place do
found = text[place..place+length(sep[loop])-1]
list = append(list,{place,found})
text[place] = ' '
place = match(sep[loop],text)
end while
end for
list = strtoksort(list)
return list
end function
------------------------------------------------------------------
global function parses(sequence s, sequence c) -- Kat July 9 2005
atom lenc, index
sequence parsed, where
object junk
atom keep , ci
if equal(s,"") then return "" end if
if equal(c,"") then return "" end if
keep = 0
ci = 0
if sequence(c) and length(c) = 2 then
if match("i",c[1]) then
ci = 1
end if
if match("k",c[1]) then
keep = 1
end if
if ci or keep then
c = c[2]
end if
end if
if (length(c) >= length(s)) then return "" end if
parsed = {}
lenc = length(c)
if ci
then where = find_allsci(c,s)
else where = find_alls(c,s)
end if
if keep then
if where[1][1] = 1
then lastkeep = 2
else lastkeep = 1
end if
end if
where = {{1,""}} & where & {{length(s)+1,""}}
trace(1)
if equal(where,"") then return{s} end if
if keep
then for loop = 2 to length(where)-1 do
lastseparators = lastseparators & {where[loop][2]}
end for
for loop = 1 to length(where)-1 do
parsed = parsed & {
s[where[loop][1]+length(where[loop][2])..where[loop+1][1]-1] }
--& {where[loop+1][2]}
--if loop = length(where)-2 then exit end if
end for
else for loop = 1 to length(where)-1 do
parsed &=
{s[where[loop][1]+length(where[loop][2])..where[loop+1][1]-1]}
end for
end if
return parsed
end function --parses()
Add the global vars and such as needed, and remove the trace(1), etc..
Yep, July 9, 2005. I been holding onto it, not wanting to get flamed for
releasing it ever since then.
Kat
|
Not Categorized, Please Help
|
|