Re: 20gigabyte string searching

new topic     » goto parent     » topic index » view thread      » older message » newer message

Kat wrote:
> 
> Serious question: what's the best way to search twenty gigabytes of text for 
> a 100byte substring, using Euphoria? Keep in mind Eu will blow up 20gigs to 
> 80 in ram, and each copy made to munge it is another 80 gigs, so loading 
> the file into a sequence isn't possible. I foresee tons of disk thrashing, as
> i
> don't have gigs of ram laying around..
> 
> The two gigs can be formed like this:
> {"string-1","string-2"...."string-n"}
> where each string-x is 20 to 120 chars long, most will be in the 100 
> character neighborhood. Chances of them being sorted is low, as i don't see 
> how Eu can be used to sort them in my lifetime.
> or they can be like this:
> {string-1\nstring-2\nstring3\n...string-n}
> 
> I have a list of 150 million string-xs laying here, and don't know the best
> way
> to put them together so solve the search problem. Flat sequence with 
> separators, or nested sequence? Having a parallel sequence of what was 
> found would be terribly nice, but would be equally huge in count (altho 
> shouldn't be as big absolutely).

I guess I don't understand your situation, but why do you need to have 
the whole 20Gigs in RAM at once. I just did a test and I can do a brute-
force string search over 20Gigs in about 70 minutes, using no more
than a meg of RAM.

Anyhow, if you've got the disk space you can index things a lot. This is
basic stuff for databases. You might need about 40-60Gigs to index the
words, but will speed up searching 100-1000 times.

-- 
Derek Parnell
Melbourne, Australia

new topic     » goto parent     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu