1. Parsing Challenge
- Posted by c.k.lester <euphoric at ck?ester?com> May 05, 2008
- 620 views
- Last edited May 06, 2008
A call for help. I have the following HTML code:
<div id="doc3" class="yui-t2"> <div id="hd"> <h1>YUI: CSS Grid Builder</h1> </div> <div id="bd"> <div id="yui-main"> <div class="yui-b"> <div class="yui-g"> <!-- YOUR DATA GOES HERE --> </div> </div> </div> <div class="yui-b"> <!-- YOUR NAVIGATION GOES HERE --> </div> </div> <div id="ft">Footer is here.</div> </div>
I need to parse (extract) the class and ID of each DIV so I can create database records based on the structure. I'm pretty sure there's a lib already available to help me do this in the shortest possible time, right? :) If not, maybe somebody with excellent parser coding skills can give a nice guy a little hand. or two. What I want to be able to do is take HTML code generated by the YUI Grids Builder and paste it into BBCMF so it will build the page for me/you/us. Thanks!
2. Re: Parsing Challenge
- Posted by Jeremy Cowgar <jeremy at c?wgar?com> May 05, 2008
- 634 views
- Last edited May 06, 2008
Here's quick code w/little error checking, but it works fine on your example HTML.
include sequence.e include search.e include file.e function get_attr(sequence data, sequence attr, integer start) integer stop, max_stop max_stop = find_from('>', data, start) start = match_from(attr & "=", data, start) if start = 0 or start > max_stop then return "" end if stop = find_any_from(" >", data, start) return trim(data[start+length(attr)+1..stop-1], "\"") end function sequence data, matches data = read_file("ex.html") matches = match_all("<div", data, 1) for i = 1 to length(matches) do puts(1, " Id: " & get_attr(data, "id", matches[i]) & "\n") puts(1, "Class: " & get_attr(data, "class", matches[i]) & "\n\n") end for
It requires the new standard library, but you can grab search.e, sequence.e and file.e from http://rapideuphoria.svn.sourceforge.net/svnroot/rapideuphoria/trunk/include/ -- Jeremy Cowgar http://jeremy.cowgar.com
3. Re: Parsing Challenge
- Posted by Bernie Ryan <xotron at b?uefrog.?om> May 05, 2008
- 599 views
- Last edited May 06, 2008
c.k.lester wrote: > > A call for help. I have the following HTML code: > > }}} <eucode><div id="doc3" class="yui-t2"> > <div id="hd"> > <h1>YUI: CSS Grid Builder</h1> > </div> > <div id="bd"> > <div id="yui-main"> > <div class="yui-b"> > <div class="yui-g"> > <!-- YOUR DATA GOES HERE --> > </div> > </div> > </div> > <div class="yui-b"> > <!-- YOUR NAVIGATION GOES HERE --> > </div> > </div> > <div id="ft">Footer is here.</div> > </div> > </eucode> {{{ > > I need to parse (extract) the class and ID of each DIV so I can create > database records based on the structure. I'm pretty sure there's a lib > already available to help me do this in the shortest possible time, right? :) > > If not, maybe somebody with excellent parser coding skills can give a nice > guy a little hand. or two. > > What I want to be able to do is take HTML code generated by the YUI Grids > Builder and paste it into BBCMF so it will build the page for me/you/us. > CK: Download mixedlib.e library in archive it has function for parsing. It's listed under Utility Library C-functions Then you can try this code which can be written in many different ways the library is very flexible. Note I am just using one function strtok here which is the same as that used in "C".
include mixedlib.e integer fn fn = open("Doc.htm","rb") sequence buffer buffer = {} object line line = {} line = gets(fn) while sequence(line) do buffer &= line line = gets(fn) end while atom token sequence s token = strtok(sz(buffer), sz(" >")) while token do s = str2seq(token) if match("id=",s) or match("class=",s) then puts(1,str2seq(token)&"\n") end if token = strtok(0, sz(" >")) end while
Bernie My files in archive: WMOTOR, XMOTOR, W32ENGIN, MIXEDLIB, EU_ENGIN, WIN32ERU, WIN32API Can be downloaded here: http://www.rapideuphoria.com/cgi-bin/asearch.exu?dos=on&win=on&lnx=on&gen=on&keywords=bernie+ryan
4. Re: Parsing Challenge
- Posted by Kat <KAT12 at coosahs.n??> May 05, 2008
- 617 views
- Last edited May 06, 2008
c.k.lester wrote: > > A call for help. I have the following HTML code: > > }}} <eucode><div id="doc3" class="yui-t2"> > <div id="hd"> > <h1>YUI: CSS Grid Builder</h1> > </div> > <div id="bd"> > <div id="yui-main"> > <div class="yui-b"> > <div class="yui-g"> > <!-- YOUR DATA GOES HERE --> > </div> > </div> > </div> > <div class="yui-b"> > <!-- YOUR NAVIGATION GOES HERE --> > </div> > </div> > <div id="ft">Footer is here.</div> > </div> > </eucode> {{{ > > I need to parse (extract) the class and ID of each DIV so I can create > database records based on the structure. I'm pretty sure there's a lib > already available to help me do this in the shortest possible time, right? :) > > If not, maybe somebody with excellent parser coding skills can give a nice > guy a little hand. or two. > > What I want to be able to do is take HTML code generated by the YUI Grids > Builder and paste it into BBCMF so it will build the page for me/you/us. > > Thanks! getxml() in strtok. Kat
5. Re: Parsing Challenge
- Posted by c.k.lester <euphoric at ?klester.?om> May 06, 2008
- 603 views
Kat wrote: > > getxml() in strtok. How would I handle the divs with class and id attributes? Is something like the following valid? s = getxml("<div","/",0)
6. Re: Parsing Challenge
- Posted by c.k.lester <euphoric at ckle?ter.co?> May 06, 2008
- 587 views
Jeremy Cowgar wrote: > > Here's quick code w/little error checking, but it works fine on your example > HTML. Jeremy, using your code, how would I determine the hierarchy of the tags (parents and children)?
7. Re: Parsing Challenge
- Posted by c.k.lester <euphoric at c??ester.com> May 06, 2008
- 576 views
Bernie Ryan wrote: > > Download mixedlib.e library in archive it has function for > parsing. Bernie, same question for you. :) How would I determine the hierarchy of the tags (parents, children)? I need to know the nesting in order to build a database of the child/parent structure.
8. Re: Parsing Challenge
- Posted by Jeremy Cowgar <jeremy at cow??r.com> May 06, 2008
- 631 views
c.k.lester wrote: > > Jeremy Cowgar wrote: > > > > Here's quick code w/little error checking, but it works fine on your example > > HTML. > > Jeremy, using your code, how would I determine the hierarchy of the tags > (parents and children)? You didn't tell us you needed that You may want to check out stack.e in 4.0 and push onto the stack when a new div is hit, and pop off the stack when a /div is hit. -- Jeremy Cowgar http://jeremy.cowgar.com
9. Re: Parsing Challenge
- Posted by Shawn Pringle <shawn.pringle at ?mail.com> May 06, 2008
- 615 views
The better question is how can you include and call perl modules from EUPHORIA? It probably is less work then reimplementing one or two perl modules in EUPHORIA. Shawn Pringle
10. Re: Parsing Challenge
- Posted by Kat <KAT12 at ?oosahs?net> May 06, 2008
- 603 views
c.k.lester wrote: > > Kat wrote: > > > > getxml() in strtok. > > How would I handle the divs with class and id attributes? Is something like > the following valid? > > s = getxml("<div","/",0) The short answer is "no", but mostly because getxml() was written when these tags had no parameter fields like <div id="doc3" class="yui-t2">. And you don't care about the contents of <div*<div> anyhow, and i don't want to stop everything right now and recode it (possibly breaking code that works as is) to handle params, so... The resulting answer to the problem becomes use parses() in strtok.e v3, which isn't released (but see below). In it, do result = parses(yourhtmlfile,"<div") and the start of each nested sequence in result ({"","","",etc}) will be the parameters of the <div*> tag, plus extra stuff you can ignore. I do this for finding "href" all day now. For your example, the first seq will be: " id="doc3" class="yui-t2">\n " the 2nd one will be: " id="hd">\n " etc, and you use a for loop = 1 to length(result) to get params = result[loop][1..match(">",result[loop])-1] This way, you can spec which tag name, <div*, <table*, <etc* that you want the param list for, as well as which one, result[x]. If the file does not start with the tag you are looking for, then obviously the first param list for that tag starts at result[2].
global function find_alls(sequence sep, sequence text) -- Kat July 9 2005 -- return a list of all indexes where what is found in s sequence list, found atom place, step, whatlen -- place must be atom for "while" line trace(1) list = {} for loop = 1 to length(sep) do place = match(sep[loop],text) while place do found = text[place..place+length(sep[loop])-1] list = append(list,{place,found}) text[place] = ' ' place = match(sep[loop],text) end while end for list = strtoksort(list) return list end function ------------------------------------------------------------------ global function parses(sequence s, sequence c) -- Kat July 9 2005 atom lenc, index sequence parsed, where object junk atom keep , ci if equal(s,"") then return "" end if if equal(c,"") then return "" end if keep = 0 ci = 0 if sequence(c) and length(c) = 2 then if match("i",c[1]) then ci = 1 end if if match("k",c[1]) then keep = 1 end if if ci or keep then c = c[2] end if end if if (length(c) >= length(s)) then return "" end if parsed = {} lenc = length(c) if ci then where = find_allsci(c,s) else where = find_alls(c,s) end if if keep then if where[1][1] = 1 then lastkeep = 2 else lastkeep = 1 end if end if where = {{1,""}} & where & {{length(s)+1,""}} trace(1) if equal(where,"") then return{s} end if if keep then for loop = 2 to length(where)-1 do lastseparators = lastseparators & {where[loop][2]} end for for loop = 1 to length(where)-1 do parsed = parsed & { s[where[loop][1]+length(where[loop][2])..where[loop+1][1]-1] } --& {where[loop+1][2]} --if loop = length(where)-2 then exit end if end for else for loop = 1 to length(where)-1 do parsed &= {s[where[loop][1]+length(where[loop][2])..where[loop+1][1]-1]} end for end if return parsed end function --parses()
Add the global vars and such as needed, and remove the trace(1), etc.. Yep, July 9, 2005. I been holding onto it, not wanting to get flamed for releasing it ever since then. Kat