Re: 20gigabyte string searching
- Posted by Pete Lomax <petelomax at blueyonder.co.uk> Oct 02, 2004
- 444 views
On Fri, 1 Oct 2004 23:38:00 -0500, Kat <gertie at visionsix.com> wrote: >Basically, you need to know if you already have the url before you add it as a >duplicate. <snip> >> Please give us a sample of the file and the substring, and the above >> mentioned details. > >Ok, here's a leftover from something else: > >http://www.wolfe.net/stroetl/index.html >http://astroamerica.com <snip> If I wanted to minimize disk space and memory usage, this is what I would do. 1) I would store all the urls in a plain text file, with the rule that each line added must begin with "http://". I would use that specific value as an "end of chain" marker, much like 0 is traditionally used, otherwise I will overwrite those 6 characters with a link to the next line. This url file is "add only", for now. 2) I would hash the url, turning it into a single positive integer between 1 and 1048576. For this I would use jiri babor's hash() function [just that, ignoring the table handling part] and probably use one of Jeurgen's bit operators to shift the 32-bit result from hash() down to the 20-bit key I need. 3) I would build a hash lookup file, again using a 6-character lookup key to my main url text file, and initialise it with "http://"'s. The hash lookup file will be 6MB, for 1048576 keys. 4) For lookup keys, I would use a base 64 encoding (0-9,A-Z,a-z) which limits the url file to approx 68G, or base 94 (! to ~, but not /), which limits the url file to approx 689G. My base64 encoding contribution could be used/adapted for this. As an example, suppose hash("http://www.spirit.satelnet.org") returns 968961102, which I shift right to get 946251. I seek(946251*6) and read 6 bytes from the hash file, if it is "http://" then the bucket is empty and the url needs to be added, in which case I where() the url-text file, encode it and overwrite the 6-byte entry in the hash file. If it is not "http://" then I decode it, seek, and gets() a line from the url-text file. If it matches, of course, I don't want to add this url again, otherwise I repeat the "http://" check on the start of the line. If I get to the end of the chain, then I have to overwrite the start of the last entry with a 6-byte encoded link to the line I am about to add. Basically, with that amount of data I would hold none of it in memory whatsoever, if there is any caching to be done, let the OS handle it. Regards, Pete