Re: 20gigabyte string searching
- Posted by Derek Parnell <ddparnell at bigpond.com> Oct 02, 2004
- 477 views
Kat wrote: > > Serious question: what's the best way to search twenty gigabytes of text for > a 100byte substring, using Euphoria? Keep in mind Eu will blow up 20gigs to > 80 in ram, and each copy made to munge it is another 80 gigs, so loading > the file into a sequence isn't possible. I foresee tons of disk thrashing, as > i > don't have gigs of ram laying around.. > > The two gigs can be formed like this: > {"string-1","string-2"...."string-n"} > where each string-x is 20 to 120 chars long, most will be in the 100 > character neighborhood. Chances of them being sorted is low, as i don't see > how Eu can be used to sort them in my lifetime. > or they can be like this: > {string-1\nstring-2\nstring3\n...string-n} > > I have a list of 150 million string-xs laying here, and don't know the best > way > to put them together so solve the search problem. Flat sequence with > separators, or nested sequence? Having a parallel sequence of what was > found would be terribly nice, but would be equally huge in count (altho > shouldn't be as big absolutely). I guess I don't understand your situation, but why do you need to have the whole 20Gigs in RAM at once. I just did a test and I can do a brute- force string search over 20Gigs in about 70 minutes, using no more than a meg of RAM. Anyhow, if you've got the disk space you can index things a lot. This is basic stuff for databases. You might need about 40-60Gigs to index the words, but will speed up searching 100-1000 times. -- Derek Parnell Melbourne, Australia