Re: 20gigabyte string searching

new topic     » goto parent     » topic index » view thread      » older message » newer message

Kat wrote:
> 
> Serious question: what's the best way to search twenty gigabytes of text for 
> a 100byte substring, using Euphoria? Keep in mind Eu will blow up 20gigs to 
> 80 in ram, and each copy made to munge it is another 80 gigs, so loading 
> the file into a sequence isn't possible. I foresee tons of disk thrashing, as
> i
> don't have gigs of ram laying around..
> 
> The two gigs can be formed like this:
> {"string-1","string-2"...."string-n"}
> where each string-x is 20 to 120 chars long, most will be in the 100 
> character neighborhood. Chances of them being sorted is low, as i don't see 
> how Eu can be used to sort them in my lifetime.
> or they can be like this:
> {string-1\nstring-2\nstring3\n...string-n}
> 
> I have a list of 150 million string-xs laying here, and don't know the best
> way
> to put them together so solve the search problem. Flat sequence with 
> separators, or nested sequence? Having a parallel sequence of what was 
> found would be terribly nice, but would be equally huge in count (altho 
> shouldn't be as big absolutely).
> 
> Kat
> 
> 
Bear with me as i think. picture what i am trying to say.
advanced users forgive me for lack of tech terms or knowledge.

if the original file is too big or time consuming to sort(keep sorted)
then a has table could be used and sorted easier to maintain.
someone else mentioned this too.

a unique number for each record could be a year/time (julian date, but how to
calculate it? or the birthdate(creation date?) of the url is unique, but how to
connect the two together in an index when only the human readable url is known.
build a key similar to your own internet address, a set of 4 numbers generated
by 4 different lists (multi-dimensional).

how to translate a human readable url into a set of numbers?
a url is punctuated with slashes or periods which could be used to break down
the url intolists.
the lists would contain only the data for that particular positionin the url and
generated address data.
searching 3 or 4 lists might be faster than sorting/inserting into the original
database.

url = www.xxxx\yyyy\zzz read as url = www.dim1\dim2\dim3
xxxx= list1   --- the actual url data is the key for this list
yyyy = list 2
zzzz = list 3

for each url in the database
list1 is the list containing the url info (of all urls in database)at that
position(xxxx) in the url
list 2 is for yyyy url data. list3 is for the zzzz data in the url.
using the data , search list1 for the corresponding record code.
do the same for each list.
the built up address would look like 04.06.10   (use of leading zeros)
this will point to the record into the hash table that contains the needed
data(record pointer to the original database record).
the key in these smaller lists is the url info, the data is the record number
inside this list of this url info.
these smaller lists can be sorted.

i hope i'm clear

rudy

new topic     » goto parent     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu