RE: 20gigabyte string searching
- Posted by "Ricardo M. Forno" <rforno at uyuyuy.com> Oct 02, 2004
- 468 views
It depends on how many times you are going to search for a substring. If you need to search only once, then the best way is to read the data by pieces into storage and perform the search on them. This holds true also if you are going to perform the search a few (how many?) times. The same in case you modify (append, etc.) the huge file between a few searches. In any other case, there are some techniques that will help (sorting, hashing), but it all depends on the fine details of the search (is the length of the substring constant? is the file partitioned into records (it seems it is, according to your description)? are these recors all of the same length? etc.) Please give us a sample of the file and the substring, and the above mentioned details. Regards. ----- Original Message ----- From: Kat <gertie at visionsix.com> To: <EUforum at topica.com> Sent: Friday, October 01, 2004 9:10 PM Subject: 20gigabyte string searching > > > Serious question: what's the best way to search twenty gigabytes of text for > a 100byte substring, using Euphoria? Keep in mind Eu will blow up 20gigs to > 80 in ram, and each copy made to munge it is another 80 gigs, so loading > the file into a sequence isn't possible. I foresee tons of disk thrashing, as i > don't have gigs of ram laying around.. > > The two gigs can be formed like this: > {"string-1","string-2"...."string-n"} > where each string-x is 20 to 120 chars long, most will be in the 100 > character neighborhood. Chances of them being sorted is low, as i don't see > how Eu can be used to sort them in my lifetime. > or they can be like this: > {string-1\nstring-2\nstring3\n...string-n} > > I have a list of 150 million string-xs laying here, and don't know the best way > to put them together so solve the search problem. Flat sequence with > separators, or nested sequence? Having a parallel sequence of what was > found would be terribly nice, but would be equally huge in count (altho > shouldn't be as big absolutely). > > Kat > > > >