OpenEuphoria: Forum: Parsing Challenge

1. Parsing Challenge

Posted by c.k.lester <euphoric at ck?ester?com> May 05, 2008
620 views
Last edited May 06, 2008

A call for help. I have the following HTML code:

<div id="doc3" class="yui-t2">
	<div id="hd">
		<h1>YUI: CSS Grid Builder</h1>
	</div>
	<div id="bd">
		<div id="yui-main">
			<div class="yui-b">
				<div class="yui-g">
					<!-- YOUR DATA GOES HERE -->
				</div>
			</div>
		</div>
		<div class="yui-b">
			<!-- YOUR NAVIGATION GOES HERE -->
		</div>
	</div>
	<div id="ft">Footer is here.</div>
</div>


I need to parse (extract) the class and ID of each DIV so I can create
database records based on the structure. I'm pretty sure there's a lib
already available to help me do this in the shortest possible time, right? :)

If not, maybe somebody with excellent parser coding skills can give a nice
guy a little hand. or two.

What I want to be able to do is take HTML code generated by the YUI Grids
Builder and paste it into BBCMF so it will build the page for me/you/us.

Thanks!

new topic » topic index » view message » categorize

2. Re: Parsing Challenge

Posted by Jeremy Cowgar <jeremy at c?wgar?com> May 05, 2008
633 views
Last edited May 06, 2008

Here's quick code w/little error checking, but it works fine on your example
HTML.

include sequence.e
include search.e
include file.e

function get_attr(sequence data, sequence attr, integer start)
    integer stop, max_stop

    max_stop = find_from('>', data, start)
    start = match_from(attr & "=", data, start)
    if start = 0 or start > max_stop then
        return ""
    end if

    stop = find_any_from(" >", data, start)

    return trim(data[start+length(attr)+1..stop-1], "\"")
end function

sequence data, matches
data = read_file("ex.html")

matches = match_all("<div", data, 1)

for i = 1 to length(matches) do
    puts(1, "   Id: " & get_attr(data, "id", matches[i]) & "\n")
    puts(1, "Class: " & get_attr(data, "class", matches[i]) & "\n\n")
end for


It requires the new standard library, but you can grab search.e, sequence.e and
file.e from

http://rapideuphoria.svn.sourceforge.net/svnroot/rapideuphoria/trunk/include/

--
Jeremy Cowgar
http://jeremy.cowgar.com

new topic » goto parent » topic index » view message » categorize

3. Re: Parsing Challenge

Posted by Bernie Ryan <xotron at b?uefrog.?om> May 05, 2008
599 views
Last edited May 06, 2008

c.k.lester wrote:
> 
> A call for help. I have the following HTML code:
> 
> }}}
<eucode><div id="doc3" class="yui-t2">
> 	<div id="hd">
> 		<h1>YUI: CSS Grid Builder</h1>
> 	</div>
> 	<div id="bd">
> 		<div id="yui-main">
> 			<div class="yui-b">
> 				<div class="yui-g">
> 					<!-- YOUR DATA GOES HERE -->
> 				</div>
> 			</div>
> 		</div>
> 		<div class="yui-b">
> 			<!-- YOUR NAVIGATION GOES HERE -->
> 		</div>
> 	</div>
> 	<div id="ft">Footer is here.</div>
> </div>
> </eucode>
{{{

> 
> I need to parse (extract) the class and ID of each DIV so I can create
> database records based on the structure. I'm pretty sure there's a lib
> already available to help me do this in the shortest possible time, right? :)
> 
> If not, maybe somebody with excellent parser coding skills can give a nice
> guy a little hand. or two.
> 
> What I want to be able to do is take HTML code generated by the YUI Grids
> Builder and paste it into BBCMF so it will build the page for me/you/us.
> 

CK:
  
Download mixedlib.e library in archive it has function for

parsing. It's listed under Utility Library C-functions

Then you can try this code which can be written in many

different ways the library is very flexible.

Note I am just using one function strtok here which is the same

as that used in  "C".

include mixedlib.e

integer fn
fn = open("Doc.htm","rb")
sequence buffer buffer = {}
object line line = {}
line = gets(fn)
while sequence(line) do
buffer &= line
line = gets(fn)
end while

atom token
sequence s
token = strtok(sz(buffer), sz(" >"))
while token do
s = str2seq(token)
if match("id=",s) or match("class=",s) then
 puts(1,str2seq(token)&"\n")
end if 
token = strtok(0, sz(" >"))
end while




Bernie

My files in archive:
WMOTOR, XMOTOR, W32ENGIN, MIXEDLIB, EU_ENGIN, WIN32ERU, WIN32API 

Can be downloaded here:
http://www.rapideuphoria.com/cgi-bin/asearch.exu?dos=on&win=on&lnx=on&gen=on&keywords=bernie+ryan

new topic » goto parent » topic index » view message » categorize

4. Re: Parsing Challenge

Posted by Kat <KAT12 at coosahs.n??> May 05, 2008
617 views
Last edited May 06, 2008

c.k.lester wrote:
> 
> A call for help. I have the following HTML code:
> 
> }}}
<eucode><div id="doc3" class="yui-t2">
> 	<div id="hd">
> 		<h1>YUI: CSS Grid Builder</h1>
> 	</div>
> 	<div id="bd">
> 		<div id="yui-main">
> 			<div class="yui-b">
> 				<div class="yui-g">
> 					<!-- YOUR DATA GOES HERE -->
> 				</div>
> 			</div>
> 		</div>
> 		<div class="yui-b">
> 			<!-- YOUR NAVIGATION GOES HERE -->
> 		</div>
> 	</div>
> 	<div id="ft">Footer is here.</div>
> </div>
> </eucode>
{{{

> 
> I need to parse (extract) the class and ID of each DIV so I can create
> database records based on the structure. I'm pretty sure there's a lib
> already available to help me do this in the shortest possible time, right? :)
> 
> If not, maybe somebody with excellent parser coding skills can give a nice
> guy a little hand. or two.
> 
> What I want to be able to do is take HTML code generated by the YUI Grids
> Builder and paste it into BBCMF so it will build the page for me/you/us.
> 
> Thanks!

getxml() in strtok.

Kat

new topic » goto parent » topic index » view message » categorize

5. Re: Parsing Challenge

Posted by c.k.lester <euphoric at ?klester.?om> May 06, 2008
603 views

Kat wrote:
> 
> getxml() in strtok.

How would I handle the divs with class and id attributes? Is something like
the following valid?

s = getxml("<div","/",0)

new topic » goto parent » topic index » view message » categorize

6. Re: Parsing Challenge

Posted by c.k.lester <euphoric at ckle?ter.co?> May 06, 2008
586 views

Jeremy Cowgar wrote:
> 
> Here's quick code w/little error checking, but it works fine on your example
> HTML.

Jeremy, using your code, how would I determine the hierarchy of the tags
(parents and children)?

new topic » goto parent » topic index » view message » categorize

7. Re: Parsing Challenge

Posted by c.k.lester <euphoric at c??ester.com> May 06, 2008
575 views

Bernie Ryan wrote:
> 
> Download mixedlib.e library in archive it has function for
> parsing.

Bernie, same question for you. :)

How would I determine the hierarchy of the tags (parents, children)? I need
to know the nesting in order to build a database of the child/parent
structure.

new topic » goto parent » topic index » view message » categorize

8. Re: Parsing Challenge

Posted by Jeremy Cowgar <jeremy at cow??r.com> May 06, 2008
631 views

c.k.lester wrote:
> 
> Jeremy Cowgar wrote:
> > 
> > Here's quick code w/little error checking, but it works fine on your example
> > HTML.
> 
> Jeremy, using your code, how would I determine the hierarchy of the tags
> (parents and children)?

You didn't tell us you needed that  You may want to check out stack.e in 4.0
and push onto the stack when a new div is hit, and pop off the stack when a /div
is hit.

--
Jeremy Cowgar
http://jeremy.cowgar.com

new topic » goto parent » topic index » view message » categorize

9. Re: Parsing Challenge

Posted by Shawn Pringle <shawn.pringle at ?mail.com> May 06, 2008
615 views

The better question is how can you include and call perl modules
from EUPHORIA?  It probably is less work then reimplementing one or
two perl modules in EUPHORIA.  

Shawn Pringle

new topic » goto parent » topic index » view message » categorize

10. Re: Parsing Challenge

Posted by Kat <KAT12 at ?oosahs?net> May 06, 2008
603 views

c.k.lester wrote:
> 
> Kat wrote:
> > 
> > getxml() in strtok.
> 
> How would I handle the divs with class and id attributes? Is something like
> the following valid?
> 
> s = getxml("<div","/",0)

The short answer is "no", but mostly because getxml() was written when these
tags had no parameter fields like <div id="doc3" class="yui-t2">. And you don't
care about the contents of <div*<div> anyhow, and i don't want to stop everything
right now and recode it (possibly breaking code that works as is) to handle
params, so...

The resulting answer to the problem becomes use parses() in strtok.e v3, which
isn't released (but see below). In it, do

result = parses(yourhtmlfile,"<div")

and the start of each nested sequence in result ({"","","",etc}) will be the
parameters of the <div*> tag, plus extra stuff you can ignore. I do this for
finding "href" all day now.

For your example, the first seq will be:
" id="doc3" class="yui-t2">\n   "
the 2nd one will be:
" id="hd">\n       "
etc, and you use a for loop = 1 to length(result) to get 

params = result[loop][1..match(">",result[loop])-1] 

This way, you can spec which tag name, <div*, <table*, <etc* that you want the
param list for, as well as which one, result[x]. If the file does not start with
the tag you are looking for, then obviously the first param list for that tag
starts at result[2].

global function find_alls(sequence sep, sequence text) -- Kat July 9 2005
-- return a list of all indexes where what is found in s
sequence list, found
atom place, step, whatlen -- place must be atom for "while" line

trace(1)

   list = {}
   for loop = 1 to length(sep) do
      place = match(sep[loop],text)
      while place do
        found = text[place..place+length(sep[loop])-1]
        list = append(list,{place,found})
        text[place] = ' '
        place = match(sep[loop],text)
      end while
   end for

   list = strtoksort(list)

   return list
end function



------------------------------------------------------------------

global function parses(sequence s, sequence c) -- Kat July 9 2005
atom lenc, index
sequence parsed, where
object junk
atom keep , ci

if equal(s,"") then return "" end if
if equal(c,"") then return "" end if
keep = 0
ci = 0
if sequence(c) and length(c) = 2 then
  if match("i",c[1]) then
    ci = 1
  end if
  if match("k",c[1]) then
    keep = 1
  end if
  if ci or keep then
    c = c[2]
  end if
end if

if (length(c) >= length(s)) then return "" end if

parsed = {}
lenc = length(c)

if ci
  then where = find_allsci(c,s)
  else where = find_alls(c,s)
end if

if keep then
  if where[1][1] = 1
    then lastkeep = 2
    else lastkeep = 1
  end if
end if

where = {{1,""}} & where & {{length(s)+1,""}}
trace(1)

if equal(where,"") then return{s} end if

if keep
  then  for loop = 2 to length(where)-1 do
          lastseparators = lastseparators & {where[loop][2]}
        end for
        for loop = 1 to length(where)-1 do
parsed = parsed & {
          s[where[loop][1]+length(where[loop][2])..where[loop+1][1]-1] }
          --& {where[loop+1][2]}
          --if loop = length(where)-2 then exit end if
        end for

  else  for loop = 1 to length(where)-1 do
parsed &=
          {s[where[loop][1]+length(where[loop][2])..where[loop+1][1]-1]}
        end for
end if

return parsed
end function --parses()


Add the global vars and such as needed, and remove the trace(1), etc..

Yep, July 9, 2005. I been holding onto it, not wanting to get flamed for
releasing it ever since then.

Kat

new topic » goto parent » topic index » view message » categorize

OpenEuphoria

1. Parsing Challenge

2. Re: Parsing Challenge

3. Re: Parsing Challenge

4. Re: Parsing Challenge

5. Re: Parsing Challenge

6. Re: Parsing Challenge

7. Re: Parsing Challenge

8. Re: Parsing Challenge

9. Re: Parsing Challenge

10. Re: Parsing Challenge

Search

Include:

Quick Links

User menu

Misc Menu