Re: 20gigabyte string searching

new topic     » goto parent     » topic index » view thread      » older message » newer message

On Fri, 1 Oct 2004 23:38:00 -0500, Kat <gertie at visionsix.com> wrote:

>Basically, you need to know if you already have the url before you add it as a 
>duplicate.
<snip>
>> Please give us a sample of the file and the substring, and the above
>> mentioned details.
>
>Ok, here's a leftover from something else:
>
>http://www.wolfe.net/stroetl/index.html
>http://astroamerica.com
<snip>

If I wanted to minimize disk space and memory usage, this is what I
would do.
1) I would store all the urls in a plain text file, with the rule that
each line added must begin with "http://". I would use that specific
value as an "end of chain" marker, much like 0 is traditionally used,
otherwise I will overwrite those 6 characters with a link to the next
line. This url file is "add only", for now.
2) I would hash the url, turning it into a single positive integer
between 1 and 1048576. For this I would use jiri babor's hash()
function [just that, ignoring the table handling part] and probably
use one of Jeurgen's bit operators to shift the 32-bit result from
hash() down to the 20-bit key I need.
3) I would build a hash lookup file, again using a 6-character lookup
key to my main url text file, and initialise it with "http://"'s.
The hash lookup file will be 6MB, for 1048576 keys.
4) For lookup keys, I would use a base 64 encoding (0-9,A-Z,a-z) which
limits the url file to approx 68G, or base 94 (! to ~, but not /),
which limits the url file to approx 689G. My base64 encoding
contribution could be used/adapted for this.

As an example, suppose hash("http://www.spirit.satelnet.org")
returns 968961102, which I shift right to get 946251.

I seek(946251*6) and read 6 bytes from the hash file, if it is
"http://" then the bucket is empty and the url needs to be added, in
which case I where() the url-text file, encode it and overwrite the
6-byte entry in the hash file. If it is not "http://" then I decode
it, seek, and gets() a line from the url-text file. If it matches, of
course, I don't want to add this url again, otherwise I repeat the
"http://" check on the start of the line. If I get to the end of the
chain, then I have to overwrite the start of the last entry with a
6-byte encoded link to the line I am about to add.

Basically, with that amount of data I would hold none of it in memory
whatsoever, if there is any caching to be done, let the OS handle it.

Regards,
Pete

new topic     » goto parent     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu