Re: Parsing Challenge
- Posted by Kat <KAT12 at ?oosahs?net> May 06, 2008
- 602 views
c.k.lester wrote: > > Kat wrote: > > > > getxml() in strtok. > > How would I handle the divs with class and id attributes? Is something like > the following valid? > > s = getxml("<div","/",0) The short answer is "no", but mostly because getxml() was written when these tags had no parameter fields like <div id="doc3" class="yui-t2">. And you don't care about the contents of <div*<div> anyhow, and i don't want to stop everything right now and recode it (possibly breaking code that works as is) to handle params, so... The resulting answer to the problem becomes use parses() in strtok.e v3, which isn't released (but see below). In it, do result = parses(yourhtmlfile,"<div") and the start of each nested sequence in result ({"","","",etc}) will be the parameters of the <div*> tag, plus extra stuff you can ignore. I do this for finding "href" all day now. For your example, the first seq will be: " id="doc3" class="yui-t2">\n " the 2nd one will be: " id="hd">\n " etc, and you use a for loop = 1 to length(result) to get params = result[loop][1..match(">",result[loop])-1] This way, you can spec which tag name, <div*, <table*, <etc* that you want the param list for, as well as which one, result[x]. If the file does not start with the tag you are looking for, then obviously the first param list for that tag starts at result[2].
global function find_alls(sequence sep, sequence text) -- Kat July 9 2005 -- return a list of all indexes where what is found in s sequence list, found atom place, step, whatlen -- place must be atom for "while" line trace(1) list = {} for loop = 1 to length(sep) do place = match(sep[loop],text) while place do found = text[place..place+length(sep[loop])-1] list = append(list,{place,found}) text[place] = ' ' place = match(sep[loop],text) end while end for list = strtoksort(list) return list end function ------------------------------------------------------------------ global function parses(sequence s, sequence c) -- Kat July 9 2005 atom lenc, index sequence parsed, where object junk atom keep , ci if equal(s,"") then return "" end if if equal(c,"") then return "" end if keep = 0 ci = 0 if sequence(c) and length(c) = 2 then if match("i",c[1]) then ci = 1 end if if match("k",c[1]) then keep = 1 end if if ci or keep then c = c[2] end if end if if (length(c) >= length(s)) then return "" end if parsed = {} lenc = length(c) if ci then where = find_allsci(c,s) else where = find_alls(c,s) end if if keep then if where[1][1] = 1 then lastkeep = 2 else lastkeep = 1 end if end if where = {{1,""}} & where & {{length(s)+1,""}} trace(1) if equal(where,"") then return{s} end if if keep then for loop = 2 to length(where)-1 do lastseparators = lastseparators & {where[loop][2]} end for for loop = 1 to length(where)-1 do parsed = parsed & { s[where[loop][1]+length(where[loop][2])..where[loop+1][1]-1] } --& {where[loop+1][2]} --if loop = length(where)-2 then exit end if end for else for loop = 1 to length(where)-1 do parsed &= {s[where[loop][1]+length(where[loop][2])..where[loop+1][1]-1]} end for end if return parsed end function --parses()
Add the global vars and such as needed, and remove the trace(1), etc.. Yep, July 9, 2005. I been holding onto it, not wanting to get flamed for releasing it ever since then. Kat