1. parsing HTML

Can I parse HTML pages with Thomas Parslows's XML library? Will it work
cause HTML pages don't have so strict syntax like XML, for example: <P> tag
doesn't need closing </P> tag ...
That XML library would be ideal because it allows that you pass it XML data
by parts and I am not reading whole HTML site but reading it by pieces.

Or is there some other library for parsing HTML pages?
I need to get title of page and extract all links on page, that's for now,
something else might come up later.

Tone ©koda

new topic     » topic index » view message » categorize

2. Re: parsing HTML

On 2 Aug 2002, at 17:51, 10963508 at europeonline.com wrote:

> 
> Can I parse HTML pages with Thomas Parslows's XML library? Will it work
> cause HTML pages don't have so strict syntax like XML, for example: <P> tag
> doesn't need closing </P> tag ... That XML library would be ideal because it
> allows that you pass it XML data by parts and I am not reading whole HTML site
> but reading it by pieces.
> 
> Or is there some other library for parsing HTML pages?

There is this in strtok (any version)

global function getxml(sequence record, sequence starttag, sequence 
endtag, integer tagnum)

You don't haveto fill in all the parms, see the comments in the .ew file, i 
didn't doc it in the readme. I have used this, pretty much unchanged, since i 
first wrote it in 1992 or so in Turbo Pascal. 

Kat

new topic     » goto parent     » topic index » view message » categorize

3. Re: parsing HTML

Thanks, I'm going take look at it now.

----- Original Message -----
From: "Kat" <gertie at PELL.NET>
To: "EUforum" <EUforum at topica.com>
Subject: Re: parsing HTML


>
> On 2 Aug 2002, at 17:51, 10963508 at europeonline.com wrote:
>
> >
> > Can I parse HTML pages with Thomas Parslows's XML library? Will it work
> > cause HTML pages don't have so strict syntax like XML, for example: <P>
tag
> > doesn't need closing </P> tag ... That XML library would be ideal
because it
> > allows that you pass it XML data by parts and I am not reading whole
HTML site
> > but reading it by pieces.
> >
> > Or is there some other library for parsing HTML pages?
>
> There is this in strtok (any version)
>
> global function getxml(sequence record, sequence starttag, sequence
> endtag, integer tagnum)
>
> You don't haveto fill in all the parms, see the comments in the .ew file,
i
> didn't doc it in the readme. I have used this, pretty much unchanged,
since i
> first wrote it in 1992 or so in Turbo Pascal.
>
> Kat
>
>
>
>

new topic     » goto parent     » topic index » view message » categorize

4. Re: parsing HTML

> Can I parse HTML pages with Thomas Parslows's XML library? Will it work
> cause HTML pages don't have so strict syntax like XML, for example: <P> tag
> doesn't need closing </P> tag ...
> That XML library would be ideal because it allows that you pass it XML data
> by parts and I am not reading whole HTML site but reading it by pieces.

> Or is there some other library for parsing HTML pages?
> I need to get title of page and extract all links on page, that's for now,
> something else might come up later.

> Tone ©koda

Hi,

My library would not really be very useful for that, it expects
conformant XML and returns an error if it is not.

Thomas Parslow (PatRat)
E-Mail/Jabber: tom at almostobsolete.net
ICQ: 26359483

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu