RE: 20gigabyte string searching
- Posted by Mike <vulcan at win.co.nz> Oct 02, 2004
- 466 views
Kat wrote: > > Now picture 1.5million to 15 million urls coming in from the internet, > as i build > the list of the urls, and i don't wanna duplicate any. ( btw, i've > managed to > handle lists of 350,000 urls before.) Then, i haveto tag them as <done> > once > i do them. I cannot delete any, because then when another url is to be > added > (the substring), i might not have it as a duplicate listing. Plus, if i > have > <done> it, i want to have thought ahead and stored the page locally, so > i > want the location where i stored it at. I expect i can build up a new > list like > that over 100,000 times. Someone else said it was definitely going to be > > 150,000 times, and they suggested 500,000 times. I know, it sounds too > fantastic, but i am looking for something that won't be on one webpage > anywhere. I can have the urls, but not the pages,, i haveto go get those > > myself. Hopefully, in the course of going thru such a list, i can do > some > shortcircuiting of the amount, the list size, based on webpage content. > It's a > needle in the haystack search. > > Kat It may be that I don't understand the problem but why not simply use a hash table to index the sequential position of each URL in the file? An array can hold the actual disk position plus the "done" flag. For 15 million entries the whole index arrangement could occupy several hundred megabytes of RAM, difficult but doable if the resources are there. A hash computation will retrieve either a single index or a small sequence of indexes. The index values will need to be translated into disk positions and those strings actually read in again to confirm the search. Depending on the hash table quality there might be, say, 2 or 3 disk movements per URL search. Mike PS: This idea seems similar to the one Pete posted but the index is managed in RAM so performance should be much better (at the expense of huge amounts of RAM).