RE: 20gigabyte string searching

new topic     » goto parent     » topic index » view thread      » older message » newer message

Kat wrote:

> 
> Now picture 1.5million to 15 million urls coming in from the internet, 
> as i build 
> the list of the urls, and i don't wanna duplicate any. ( btw, i've 
> managed to 
> handle lists of 350,000 urls before.) Then, i haveto tag them as <done> 
> once 
> i do them. I cannot delete any, because then when another url is to be 
> added 
> (the substring), i might not have it as a duplicate listing. Plus, if i 
> have 
> <done> it, i want to have thought ahead and stored the page locally, so 
> i 
> want the location where i stored it at. I expect i can build up a new 
> list like 
> that over 100,000 times. Someone else said it was definitely going to be 
> 
> 150,000 times, and they suggested 500,000 times. I know, it sounds too 
> fantastic, but i am looking for something that won't be on one webpage 
> anywhere. I can have the urls, but not the pages,, i haveto go get those 
> 
> myself. Hopefully, in the course of going thru such a list, i can do 
> some 
> shortcircuiting of the amount, the list size, based on webpage content. 
> It's a 
> needle in the haystack search.
> 
> Kat

It may be that I don't understand the problem but why not simply use a 
hash table to index the sequential position of each URL in the file? An 
array can hold the actual disk position plus the "done" flag. For 15 
million entries the whole index arrangement could occupy several hundred 
megabytes of RAM, difficult but doable if the resources are there.

A hash computation will retrieve either a single index or a small 
sequence of indexes. The index values will need to be translated into 
disk positions and those strings actually read in again to confirm the 
search. Depending on the hash table quality there might be, say, 2 or 3 
disk movements per URL search.

Mike

PS: This idea seems similar to the one Pete posted but the index is 
managed in RAM so performance should be much better (at the expense of 
huge amounts of RAM).

new topic     » goto parent     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu