Re: 20gigabyte string searching
Kat wrote:
>
> Serious question: what's the best way to search twenty gigabytes of text for
> a 100byte substring, using Euphoria? Keep in mind Eu will blow up 20gigs to
> 80 in ram, and each copy made to munge it is another 80 gigs, so loading
> the file into a sequence isn't possible. I foresee tons of disk thrashing, as
> i
> don't have gigs of ram laying around..
>
> The two gigs can be formed like this:
> {"string-1","string-2"...."string-n"}
> where each string-x is 20 to 120 chars long, most will be in the 100
> character neighborhood. Chances of them being sorted is low, as i don't see
> how Eu can be used to sort them in my lifetime.
> or they can be like this:
> {string-1\nstring-2\nstring3\n...string-n}
>
> I have a list of 150 million string-xs laying here, and don't know the best
> way
> to put them together so solve the search problem. Flat sequence with
> separators, or nested sequence? Having a parallel sequence of what was
> found would be terribly nice, but would be equally huge in count (altho
> shouldn't be as big absolutely).
I guess I don't understand your situation, but why do you need to have
the whole 20Gigs in RAM at once. I just did a test and I can do a brute-
force string search over 20Gigs in about 70 minutes, using no more
than a meg of RAM.
Anyhow, if you've got the disk space you can index things a lot. This is
basic stuff for databases. You might need about 40-60Gigs to index the
words, but will speed up searching 100-1000 times.
--
Derek Parnell
Melbourne, Australia
|
Not Categorized, Please Help
|
|