1. 20gigabyte string searching
- Posted by "Kat" <gertie at visionsix.com> Oct 02, 2004
- 492 views
Serious question: what's the best way to search twenty gigabytes of text for a 100byte substring, using Euphoria? Keep in mind Eu will blow up 20gigs to 80 in ram, and each copy made to munge it is another 80 gigs, so loading the file into a sequence isn't possible. I foresee tons of disk thrashing, as i don't have gigs of ram laying around.. The two gigs can be formed like this: {"string-1","string-2"...."string-n"} where each string-x is 20 to 120 chars long, most will be in the 100 character neighborhood. Chances of them being sorted is low, as i don't see how Eu can be used to sort them in my lifetime. or they can be like this: {string-1\nstring-2\nstring3\n...string-n} I have a list of 150 million string-xs laying here, and don't know the best way to put them together so solve the search problem. Flat sequence with separators, or nested sequence? Having a parallel sequence of what was found would be terribly nice, but would be equally huge in count (altho shouldn't be as big absolutely). Kat
2. Re: 20gigabyte string searching
- Posted by cklester <cklester at yahoo.com> Oct 02, 2004
- 465 views
Kat wrote: > > Serious question: what's the best way to search twenty gigabytes of text for > a 100byte substring, using Euphoria? Would it be easier to create an index (like a b-tree or something), then use that for searching? -=ck "Programming in a state of EUPHORIA." http://www.cklester.com/euphoria/
3. Re: 20gigabyte string searching
- Posted by "Kat" <gertie at visionsix.com> Oct 02, 2004
- 439 views
On 1 Oct 2004, at 18:21, cklester wrote: > > > posted by: cklester <cklester at yahoo.com> > > Kat wrote: > > > > Serious question: what's the best way to search twenty gigabytes of text for > > a > > 100byte substring, using Euphoria? > > Would it be easier to create an index (like a b-tree or something), > then use that for searching? Sure! Where? 20gig in a index in an Eu sequence is 80gigs. Kat
4. Re: 20gigabyte string searching
- Posted by cklester <cklester at yahoo.com> Oct 02, 2004
- 477 views
People wrote: > > > Serious question: what's the best way to search twenty gigabytes of text > > > for a > > > 100byte substring, using Euphoria? > > Would it be easier to create an index (like a b-tree or something), > > then use that for searching? > Sure! Where? > 20gig in a index in an Eu sequence is 80gigs. I mean like this: { "This", "Is an example", "Might not be", "Accurate" } Index would be: { {"A", {4}}, {"B", {}}, ... {"I", {2}}, ... {"M", {3}}, ... {"T", {1}} } So if you search for "Might not be," it searches the index first for 'M', finds all instances in the list where the searchable items start with 'M' and finds one at 3, then just test the rest. Or, even an index like this: { "T", "IAE", "MNB", "A" } So if I search for "is an example," it looks for the first-letter-of-each- word pattern of "IAE" and finds it. Either one of these methods will reduce the required storage for a regular index of words, plus it will speed up the searches significantly. I think. I could be wrong. It's late. :) -=ck "Programming in a state of EUPHORIA." http://www.cklester.com/euphoria/
5. Re: 20gigabyte string searching
- Posted by "Kat" <gertie at visionsix.com> Oct 02, 2004
- 442 views
On 1 Oct 2004, at 19:15, cklester wrote: > > > posted by: cklester <cklester at yahoo.com> > > People wrote: > > > > > Serious question: what's the best way to search twenty gigabytes of text > > > > for a 100byte substring, using Euphoria? > > > Would it be easier to create an index (like a b-tree or something), > > > then use that for searching? > > Sure! Where? > > 20gig in a index in an Eu sequence is 80gigs. > > I mean like this: > > { "This", "Is an example", "Might not be", "Accurate" } > > Index would be: > > { > {"A", {4}}, > {"B", {}}, > ... > {"I", {2}}, > ... > {"M", {3}}, > ... > {"T", {1}} > } > > So if you search for "Might not be," it searches the index first for 'M', > finds all instances in the list where the searchable items start with 'M' > and finds one at 3, then just test the rest. > > Or, even an index like this: > > { "T", "IAE", "MNB", "A" } > > So if I search for "is an example," it looks for the first-letter-of-each- > word pattern of "IAE" and finds it. > > Either one of these methods will reduce the required storage for a > regular index of words, plus it will speed up the searches significantly. > I think. I could be wrong. It's late. :) Ok, you got the index, where's the 20gig file you are indexing into? Kat
6. Re: 20gigabyte string searching
- Posted by Derek Parnell <ddparnell at bigpond.com> Oct 02, 2004
- 478 views
Kat wrote: > > Serious question: what's the best way to search twenty gigabytes of text for > a 100byte substring, using Euphoria? Keep in mind Eu will blow up 20gigs to > 80 in ram, and each copy made to munge it is another 80 gigs, so loading > the file into a sequence isn't possible. I foresee tons of disk thrashing, as > i > don't have gigs of ram laying around.. > > The two gigs can be formed like this: > {"string-1","string-2"...."string-n"} > where each string-x is 20 to 120 chars long, most will be in the 100 > character neighborhood. Chances of them being sorted is low, as i don't see > how Eu can be used to sort them in my lifetime. > or they can be like this: > {string-1\nstring-2\nstring3\n...string-n} > > I have a list of 150 million string-xs laying here, and don't know the best > way > to put them together so solve the search problem. Flat sequence with > separators, or nested sequence? Having a parallel sequence of what was > found would be terribly nice, but would be equally huge in count (altho > shouldn't be as big absolutely). I guess I don't understand your situation, but why do you need to have the whole 20Gigs in RAM at once. I just did a test and I can do a brute- force string search over 20Gigs in about 70 minutes, using no more than a meg of RAM. Anyhow, if you've got the disk space you can index things a lot. This is basic stuff for databases. You might need about 40-60Gigs to index the words, but will speed up searching 100-1000 times. -- Derek Parnell Melbourne, Australia
7. Re: 20gigabyte string searching
- Posted by Pete Lomax <petelomax at blueyonder.co.uk> Oct 02, 2004
- 443 views
On Fri, 1 Oct 2004 23:38:00 -0500, Kat <gertie at visionsix.com> wrote: >Basically, you need to know if you already have the url before you add it as a >duplicate. <snip> >> Please give us a sample of the file and the substring, and the above >> mentioned details. > >Ok, here's a leftover from something else: > >http://www.wolfe.net/stroetl/index.html >http://astroamerica.com <snip> If I wanted to minimize disk space and memory usage, this is what I would do. 1) I would store all the urls in a plain text file, with the rule that each line added must begin with "http://". I would use that specific value as an "end of chain" marker, much like 0 is traditionally used, otherwise I will overwrite those 6 characters with a link to the next line. This url file is "add only", for now. 2) I would hash the url, turning it into a single positive integer between 1 and 1048576. For this I would use jiri babor's hash() function [just that, ignoring the table handling part] and probably use one of Jeurgen's bit operators to shift the 32-bit result from hash() down to the 20-bit key I need. 3) I would build a hash lookup file, again using a 6-character lookup key to my main url text file, and initialise it with "http://"'s. The hash lookup file will be 6MB, for 1048576 keys. 4) For lookup keys, I would use a base 64 encoding (0-9,A-Z,a-z) which limits the url file to approx 68G, or base 94 (! to ~, but not /), which limits the url file to approx 689G. My base64 encoding contribution could be used/adapted for this. As an example, suppose hash("http://www.spirit.satelnet.org") returns 968961102, which I shift right to get 946251. I seek(946251*6) and read 6 bytes from the hash file, if it is "http://" then the bucket is empty and the url needs to be added, in which case I where() the url-text file, encode it and overwrite the 6-byte entry in the hash file. If it is not "http://" then I decode it, seek, and gets() a line from the url-text file. If it matches, of course, I don't want to add this url again, otherwise I repeat the "http://" check on the start of the line. If I get to the end of the chain, then I have to overwrite the start of the last entry with a 6-byte encoded link to the line I am about to add. Basically, with that amount of data I would hold none of it in memory whatsoever, if there is any caching to be done, let the OS handle it. Regards, Pete
8. Re: 20gigabyte string searching
- Posted by Terry Constant <EUforum at terryconstant.com> Oct 02, 2004
- 447 views
Kat, One other thought. You mentioned that 20 gb of text becomes 80 gb as a sequence. If size is a major issue, consider using regular expressions. If the data you are searching remains as a single string of text data, you will not have the size increase you mention. Karl Bochert submitted regex.zip 2002-04-23. I have used his routines some and they are quite usable. In short, using regular expressions on single large strings could help with the size issues. Terry Constant
9. Re: 20gigabyte string searching
- Posted by rudy toews <rltoews at ilos.net> Oct 03, 2004
- 458 views
Kat wrote: > > Serious question: what's the best way to search twenty gigabytes of text for > a 100byte substring, using Euphoria? Keep in mind Eu will blow up 20gigs to > 80 in ram, and each copy made to munge it is another 80 gigs, so loading > the file into a sequence isn't possible. I foresee tons of disk thrashing, as > i > don't have gigs of ram laying around.. > > The two gigs can be formed like this: > {"string-1","string-2"...."string-n"} > where each string-x is 20 to 120 chars long, most will be in the 100 > character neighborhood. Chances of them being sorted is low, as i don't see > how Eu can be used to sort them in my lifetime. > or they can be like this: > {string-1\nstring-2\nstring3\n...string-n} > > I have a list of 150 million string-xs laying here, and don't know the best > way > to put them together so solve the search problem. Flat sequence with > separators, or nested sequence? Having a parallel sequence of what was > found would be terribly nice, but would be equally huge in count (altho > shouldn't be as big absolutely). > > Kat > > Bear with me as i think. picture what i am trying to say. advanced users forgive me for lack of tech terms or knowledge. if the original file is too big or time consuming to sort(keep sorted) then a has table could be used and sorted easier to maintain. someone else mentioned this too. a unique number for each record could be a year/time (julian date, but how to calculate it? or the birthdate(creation date?) of the url is unique, but how to connect the two together in an index when only the human readable url is known. build a key similar to your own internet address, a set of 4 numbers generated by 4 different lists (multi-dimensional). how to translate a human readable url into a set of numbers? a url is punctuated with slashes or periods which could be used to break down the url intolists. the lists would contain only the data for that particular positionin the url and generated address data. searching 3 or 4 lists might be faster than sorting/inserting into the original database. url = www.xxxx\yyyy\zzz read as url = www.dim1\dim2\dim3 xxxx= list1 --- the actual url data is the key for this list yyyy = list 2 zzzz = list 3 for each url in the database list1 is the list containing the url info (of all urls in database)at that position(xxxx) in the url list 2 is for yyyy url data. list3 is for the zzzz data in the url. using the data , search list1 for the corresponding record code. do the same for each list. the built up address would look like 04.06.10 (use of leading zeros) this will point to the record into the hash table that contains the needed data(record pointer to the original database record). the key in these smaller lists is the url info, the data is the record number inside this list of this url info. these smaller lists can be sorted. i hope i'm clear rudy
10. Re: 20gigabyte string searching
- Posted by Pete Lomax <petelomax at blueyonder.co.uk> Oct 03, 2004
- 440 views
On Sat, 2 Oct 2004 22:26:26 +0000, Mike <vulcan at win.co.nz> wrote: >Depending on the hash table quality there might be, say, 2 or 3 >disk movements per URL search. > >Mike > >PS: This idea seems similar to the one Pete posted but the index is >managed in RAM so performance should be much better (at the expense of >huge amounts of RAM). I thought about that, with a hash table of a million entries on disk, and 15 million urls, if the hash function does its job, the average bucket size will be 15. Either way, (15 or 2) local disk accesses will probably be faster than reading a 100-byte url from the interweb. Pete
11. Re: 20gigabyte string searching
- Posted by "Unkmar" <L3Euphoria at bellsouth.net> Oct 03, 2004
- 459 views
----- Original Message ----- From: "Pete Lomax" Sent: Sunday, October 03, 2004 5:38 AM Subject: Re: 20gigabyte string searching > > > On Sat, 2 Oct 2004 22:26:26 +0000, Mike <vulcan at win.co.nz> wrote: > > >Depending on the hash table quality there might be, say, 2 or 3 > >disk movements per URL search. > > > >Mike > > > >PS: This idea seems similar to the one Pete posted but the index is > >managed in RAM so performance should be much better (at the expense of > >huge amounts of RAM). > > I thought about that, with a hash table of a million entries on disk, > and 15 million urls, if the hash function does its job, the average > bucket size will be 15. Either way, (15 or 2) local disk accesses will > probably be faster than reading a 100-byte url from the interweb. > > Pete > What would be the total size of the hash table? Do you have to create the entire hash table in RAM or can it be a build to file as you go along? kat, I onced devised a sorting solution for such large lists. I've never posted the process or exposed it to anyone. Shortly after I come up with it I assumed it to be so simple in design that everyone must already know how to do it. I will post the idea later. I have other things to tend to. :/ unkmar
12. Re: 20gigabyte string searching
- Posted by jiri babor <jbabor at paradise.net.nz> Oct 04, 2004
- 465 views
Kat, my Euphoria is a bit rusty, I have not touch it for almost two years, but you might be interested in following code. Let me know if you are, in principle, because a number of substantial optimizations is still possible. jiri
-- kat.ex : version 1.00 -- jiri babor -- jbabor at paradise.net.nz -- 04-10-04 include file.e -- dir function search(sequence s, object key) -- binary search of sorted sequence -- return {flag, index}, where -- if flag = 1 ... key found at index -- if flag = 0 ... key not found, index is insertion point integer c,hi,lo,mid lo = 1 hi = length(s) while lo <= hi do mid = floor((lo + hi) / 2) c = compare(key, s[mid]) if c < 0 then -- key < s[mid] hi = mid - 1 elsif c > 0 then -- key > s[mid] lo = mid + 1 else -- key = s[mid] return {1,mid} end if end while return {0,lo} end function function pword(integer n) -- pseudo word: random ASCII string sequence w w = {} for j = 1 to rand(n) do w &= rand(26)+96 end for return w end function procedure hoard(sequence path, integer rid, sequence params, integer maxn) -- path: data directory -- rid: routine_id of function returning next address -- params: parameter sequence of rid -- maximum number of addresses held in single file object a,x sequence data,last,lasts,name,names,newname,r integer count,f,f1,f2,hits,m,m1,n,r2 count = 0 -- address counter hits = 0 -- address collision counter if path[length(path)] != '\\' then path &= '\\' end if x = dir(path) if atom(x) then puts(1,"Error: data directory does not exist. Aborting...\n") abort(1) else names = {} lasts = {} f1 = open(path & "names","r") f2 = open(path & "lasts","r") if f1 != -1 then -- initialized database while 1 do x = gets(f1) if atom(x) then exit end if x = x[1..length(x)-1] -- strip '\n' names = append(names, x) end while close(f1) lasts = {} while 1 do x = gets(f2) if atom(x) then exit end if x = x[1..length(x)-1] -- strip '\n' lasts = append(lasts, x) end while close(f2) end if n = length(names) end if -- main loop while get_key() != 27 do -- esc a = call_func(rid,params) count += 1 r = search(lasts, a) r2 = r[2] if not r[1] then -- not found among last addresses if r2 > n then -- address > last address of last data file if n = 0 then -- no data files yet n = 1 name = "1" -- new file name f = open(path & name, "w") puts(f, a & '\n') close(f) names = append(names, name) lasts = append(lasts, a) else -- append the last data file name = names[n] f = open(path & name, "a") puts(f, a & '\n') close(f) lasts[n] = a end if else name = names[r2] data = {} f = open(path & name, "r") while 1 do x = gets(f) if atom(x) then exit end if x = x[1..length(x)-1] -- strip '\n' data = append(data, x) end while close(f) r = search(data, a) if r[1] = 0 then -- not found: insert data = data[1..r[2]-1] & {a} & data[r[2]..length(data)] -- write it out m = length(data) if m > maxn then -- subdivide m1 = floor(m/2) newname = sprint(n+1) names = names[1..r2-1] & {newname} & names[r2..n] last = data[m1] lasts = lasts[1..r2-1] & {last} & lasts[r2..n] n += 1 f = open(path & newname, "w") for i = 1 to m1 do puts(f, data[i] & '\n') end for close(f) data = data[m1+1..m] end if f = open(path & name, "w") for i = 1 to length(data) do puts(f, data[i] & '\n') end for close(f) else hits += 1 end if end if else hits += 1 end if end while -- save control sequences f1 = open(path & "names", "w") for i = 1 to length(names) do puts(f1, names[i] & '\n') end for close(f1) f2 = open(path & "lasts", "w") for i = 1 to length(lasts) do puts(f2, lasts[i] & '\n') end for close(f2) printf(1, "%d addresses\n", count) printf(1, "%d collisions\n", hits) end procedure hoard("c:\\data", routine_id("pword"), {4}, 100)