RE: 20gigabyte string searching

new topic     » goto parent     » topic index » view thread      » older message » newer message

On 1 Oct 2004, at 23:56, Ricardo Forno wrote:

> 
> 
> It depends on how many times you are going to search for a substring.
> If you need to search only once, then the best way is to read the data by
> pieces into storage and perform the search on them.
> This holds true also if you are going to perform the search a few (how
> many?) times.

Basically, you need to know if you already have the url before you add it as a 
duplicate.

> The same in case you modify (append, etc.) the huge file between a few
> searches.
> In any other case, there are some techniques that will help (sorting,
> hashing), but it all depends on the fine details of the search (is the
> length of the substring constant? is the file partitioned into records (it
> seems it is, according to your description)? are these recors all of the
> same length? etc.)

In any way i can think of, i have doubts whether Eu can do it in my lifetime 
on one computer of any mhz rating.

> Please give us a sample of the file and the substring, and the above
> mentioned details.

Ok, here's a leftover from something else:

http://www.wolfe.net/stroetl/index.html
http://astroamerica.com
http://www.spirit.satelnet.org/spirit/astro/overview.html
http://www.the-ultimate.com/space/astro.htm
http://www.astrologycom.com
http://www.dorrit.as.utexas.edu
http://www.nfra.nl
http://www.astronet.com
http://www.astronline.co.uk
http://www.marvel.stsci.edu/net-resources.html
http://astronomylinks.com
http://www.kalmback.com/astro/astronomy.html
http://www.astronomynow.com
http://www.skypub.com/links/astroweb.html
http://www.stsci.edu/net-resources.html

Now picture 1.5million to 15 million urls coming in from the internet, as i
build
the list of the urls, and i don't wanna duplicate any. ( btw, i've managed to 
handle lists of 350,000 urls before.) Then, i haveto tag them as <done> once 
i do them. I cannot delete any, because then when another url is to be added 
(the substring), i might not have it as a duplicate listing. Plus, if i have 
<done> it, i want to have thought ahead and stored the page locally, so i 
want the location where i stored it at. I expect i can build up a new list like 
that over 100,000 times. Someone else said it was definitely going to be 
150,000 times, and they suggested 500,000 times. I know, it sounds too 
fantastic, but i am looking for something that won't be on one webpage 
anywhere. I can have the urls, but not the pages,, i haveto go get those 
myself. Hopefully, in the course of going thru such a list, i can do some 
shortcircuiting of the amount, the list size, based on webpage content. It's a 
needle in the haystack search.

Kat

> Regards.
> ----- Original Message -----
> From: Kat <gertie at visionsix.com>
> To: <EUforum at topica.com>
> Sent: Friday, October 01, 2004 9:10 PM
> Subject: 20gigabyte string searching
> 
> 
> > Serious question: what's the best way to search twenty gigabytes of text
> for
> > a 100byte substring, using Euphoria? Keep in mind Eu will blow up 20gigs
> to
> > 80 in ram, and each copy made to munge it is another 80 gigs, so loading
> > the file into a sequence isn't possible. I foresee tons of disk thrashing,
> as i
> > don't have gigs of ram laying around..
> >
> > The two gigs can be formed like this:
> > {"string-1","string-2"...."string-n"}
> > where each string-x is 20 to 120 chars long, most will be in the 100
> > character neighborhood. Chances of them being sorted is low, as i don't
> see
> > how Eu can be used to sort them in my lifetime.
> > or they can be like this:
> > {string-1\nstring-2\nstring3\n...string-n}
> >
> > I have a list of 150 million string-xs laying here, and don't know the
> best way
> > to put them together so solve the search problem. Flat sequence with
> > separators, or nested sequence? Having a parallel sequence of what was
> > found would be terribly nice, but would be equally huge in count (altho
> > shouldn't be as big absolutely).
> >
> > Kat
> >
> >
> 
> 
>

new topic     » goto parent     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu