RE: 20gigabyte string searching

new topic     » topic index » view thread      » older message » newer message

It depends on how many times you are going to search for a substring.
If you need to search only once, then the best way is to read the data by
pieces into storage and perform the search on them.
This holds true also if you are going to perform the search a few (how
many?) times.
The same in case you modify (append, etc.) the huge file between a few
searches.
In any other case, there are some techniques that will help (sorting,
hashing), but it all depends on the fine details of the search (is the
length of the substring constant? is the file partitioned into records (it
seems it is, according to your description)? are these recors all of the
same length? etc.)
Please give us a sample of the file and the substring, and the above
mentioned details.
Regards.
----- Original Message -----
From: Kat <gertie at visionsix.com>
To: <EUforum at topica.com>
Sent: Friday, October 01, 2004 9:10 PM
Subject: 20gigabyte string searching


>
>
> Serious question: what's the best way to search twenty gigabytes of text
for
> a 100byte substring, using Euphoria? Keep in mind Eu will blow up 20gigs
to
> 80 in ram, and each copy made to munge it is another 80 gigs, so loading
> the file into a sequence isn't possible. I foresee tons of disk thrashing,
as i
> don't have gigs of ram laying around..
>
> The two gigs can be formed like this:
> {"string-1","string-2"...."string-n"}
> where each string-x is 20 to 120 chars long, most will be in the 100
> character neighborhood. Chances of them being sorted is low, as i don't
see
> how Eu can be used to sort them in my lifetime.
> or they can be like this:
> {string-1\nstring-2\nstring3\n...string-n}
>
> I have a list of 150 million string-xs laying here, and don't know the
best way
> to put them together so solve the search problem. Flat sequence with
> separators, or nested sequence? Having a parallel sequence of what was
> found would be terribly nice, but would be equally huge in count (altho
> shouldn't be as big absolutely).
>
> Kat
>
>
>
>

new topic     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu