Re: [OT]Where's everyone from? Contest

new topic     » topic index » view thread      » older message » newer message

On 8 Aug 2004, at 12:00, EUforum at topica.com wrote:

> On 7 Aug 2004, at 7:32, irv mullins wrote:
> 
> > 
> > posted by: irv mullins <irvm at ellijay.com>
> > 
> > irv mullins wrote:
> > 
> > > Perhaps the program would be more of a challenge, and more useful, 
> > > if it could access an existing database on the web. That way, it 
> > > could immediately be used by others. For example, a business could 
> > > visualize their customer base, a web-ring could chart their members, 
> > > etc. A well-done program like that would certainly get some attention
> > > on SourceForge or Slashdot, etc. Perhaps even a writeup in a magazine,
> > > which
> > > would be good for Euphoria (and RDS). 
> > 
> > Replying to my own message, http://worldatlas.com/aatlas/imageg.htm
> > has the needed info for almost everywhere in the world. Writing code 
> > to access that site and extract the needed info would be an interesting 
> > task.
> 
> It would be as trivial as mining APOD (because of the pic links) and 
> pantheon.org (for the data), which i have done.

And as trivial as what i have been doing for a month now. On one topic of 
interest, started a month ago, grabbing several hundred megabytes of web 
pages (64megs from one domain, and i just copied over 4 full zip disks for 
someone on another topic), using 3 different remote proxies (for assorted 
reasons), and running at 100% cpu 24-7 the last 2 weeks doing the data 
extraction. I am at about 65 megabytes of *extracted/munged* data from 
*one* of 5 domains now, and am only on 'E'. One list of urls alone is 32 
megabytes (300,000+ urls), another is 25 megabytes (245,000+ urls). Two 
other url lists look like they will grow past those sizes. Probably be just 
another case of "oh you have all that data, but i need only one line of it, and 
since you have it already, i will go look for it myself.".

Data mining is trivial, even using an existing online source in real time.
(Tiggr
used Babblefish about 1998-99, but the results were often too wierd to 
understand, and lag was horrible.) The catch is that people think your bot is 
broken because of internet lag (the Mars roving bots get better bandwidth 
from Mars to Earth than i can get), and they disparage it frequently and 
deeply, and then they begin abusing it.

Kat

new topic     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu