1. RE: 20gigabyte string searching
- Posted by "Ricardo M. Forno" <rforno at uyuyuy.com> Oct 02, 2004
- 470 views
It depends on how many times you are going to search for a substring. If you need to search only once, then the best way is to read the data by pieces into storage and perform the search on them. This holds true also if you are going to perform the search a few (how many?) times. The same in case you modify (append, etc.) the huge file between a few searches. In any other case, there are some techniques that will help (sorting, hashing), but it all depends on the fine details of the search (is the length of the substring constant? is the file partitioned into records (it seems it is, according to your description)? are these recors all of the same length? etc.) Please give us a sample of the file and the substring, and the above mentioned details. Regards. ----- Original Message ----- From: Kat <gertie at visionsix.com> To: <EUforum at topica.com> Sent: Friday, October 01, 2004 9:10 PM Subject: 20gigabyte string searching > > > Serious question: what's the best way to search twenty gigabytes of text for > a 100byte substring, using Euphoria? Keep in mind Eu will blow up 20gigs to > 80 in ram, and each copy made to munge it is another 80 gigs, so loading > the file into a sequence isn't possible. I foresee tons of disk thrashing, as i > don't have gigs of ram laying around.. > > The two gigs can be formed like this: > {"string-1","string-2"...."string-n"} > where each string-x is 20 to 120 chars long, most will be in the 100 > character neighborhood. Chances of them being sorted is low, as i don't see > how Eu can be used to sort them in my lifetime. > or they can be like this: > {string-1\nstring-2\nstring3\n...string-n} > > I have a list of 150 million string-xs laying here, and don't know the best way > to put them together so solve the search problem. Flat sequence with > separators, or nested sequence? Having a parallel sequence of what was > found would be terribly nice, but would be equally huge in count (altho > shouldn't be as big absolutely). > > Kat > > > >
2. RE: 20gigabyte string searching
- Posted by "Kat" <gertie at visionsix.com> Oct 02, 2004
- 458 views
On 1 Oct 2004, at 23:56, Ricardo Forno wrote: > > > It depends on how many times you are going to search for a substring. > If you need to search only once, then the best way is to read the data by > pieces into storage and perform the search on them. > This holds true also if you are going to perform the search a few (how > many?) times. Basically, you need to know if you already have the url before you add it as a duplicate. > The same in case you modify (append, etc.) the huge file between a few > searches. > In any other case, there are some techniques that will help (sorting, > hashing), but it all depends on the fine details of the search (is the > length of the substring constant? is the file partitioned into records (it > seems it is, according to your description)? are these recors all of the > same length? etc.) In any way i can think of, i have doubts whether Eu can do it in my lifetime on one computer of any mhz rating. > Please give us a sample of the file and the substring, and the above > mentioned details. Ok, here's a leftover from something else: http://www.wolfe.net/stroetl/index.html http://astroamerica.com http://www.spirit.satelnet.org/spirit/astro/overview.html http://www.the-ultimate.com/space/astro.htm http://www.astrologycom.com http://www.dorrit.as.utexas.edu http://www.nfra.nl http://www.astronet.com http://www.astronline.co.uk http://www.marvel.stsci.edu/net-resources.html http://astronomylinks.com http://www.kalmback.com/astro/astronomy.html http://www.astronomynow.com http://www.skypub.com/links/astroweb.html http://www.stsci.edu/net-resources.html Now picture 1.5million to 15 million urls coming in from the internet, as i build the list of the urls, and i don't wanna duplicate any. ( btw, i've managed to handle lists of 350,000 urls before.) Then, i haveto tag them as <done> once i do them. I cannot delete any, because then when another url is to be added (the substring), i might not have it as a duplicate listing. Plus, if i have <done> it, i want to have thought ahead and stored the page locally, so i want the location where i stored it at. I expect i can build up a new list like that over 100,000 times. Someone else said it was definitely going to be 150,000 times, and they suggested 500,000 times. I know, it sounds too fantastic, but i am looking for something that won't be on one webpage anywhere. I can have the urls, but not the pages,, i haveto go get those myself. Hopefully, in the course of going thru such a list, i can do some shortcircuiting of the amount, the list size, based on webpage content. It's a needle in the haystack search. Kat > Regards. > ----- Original Message ----- > From: Kat <gertie at visionsix.com> > To: <EUforum at topica.com> > Sent: Friday, October 01, 2004 9:10 PM > Subject: 20gigabyte string searching > > > > Serious question: what's the best way to search twenty gigabytes of text > for > > a 100byte substring, using Euphoria? Keep in mind Eu will blow up 20gigs > to > > 80 in ram, and each copy made to munge it is another 80 gigs, so loading > > the file into a sequence isn't possible. I foresee tons of disk thrashing, > as i > > don't have gigs of ram laying around.. > > > > The two gigs can be formed like this: > > {"string-1","string-2"...."string-n"} > > where each string-x is 20 to 120 chars long, most will be in the 100 > > character neighborhood. Chances of them being sorted is low, as i don't > see > > how Eu can be used to sort them in my lifetime. > > or they can be like this: > > {string-1\nstring-2\nstring3\n...string-n} > > > > I have a list of 150 million string-xs laying here, and don't know the > best way > > to put them together so solve the search problem. Flat sequence with > > separators, or nested sequence? Having a parallel sequence of what was > > found would be terribly nice, but would be equally huge in count (altho > > shouldn't be as big absolutely). > > > > Kat > > > > > > >
3. RE: 20gigabyte string searching
- Posted by Terry Constant <EUforum at terryconstant.com> Oct 02, 2004
- 451 views
Kat, I am just replying in general to the things about your searching, no one particular post. My idea/question is: Could you use multiple text files for your lists? As you gather the lists, put them into files such as a.txt, b.txt, ..., z.txt (or whatever naming scheme that works.) Possibly(?) sort each file as you add to it (you could keep file sizes down by breaking into smaller chunks if needed, such as, ..., mm.txt, mz.txt, ...). But even if you do not sort each file, you at least have a rough sort by first letter, which makes the size for an individual search smaller. The obvious advantage of sorting the individual files is that you can use binary search algorithms. Let your search alogorithms, select the file to read. Thus, you do not have to search through 20 gb of text and some of your problems go away. Further, much of your cpu time is done incrementally as the text is added to the separate files. I am sure that I don't fully grasp what you are doing, so I just offer this idea in case it is of some use. Terry Constant
4. RE: 20gigabyte string searching
- Posted by Mike <vulcan at win.co.nz> Oct 02, 2004
- 468 views
- Last edited Oct 03, 2004
Kat wrote: > > Now picture 1.5million to 15 million urls coming in from the internet, > as i build > the list of the urls, and i don't wanna duplicate any. ( btw, i've > managed to > handle lists of 350,000 urls before.) Then, i haveto tag them as <done> > once > i do them. I cannot delete any, because then when another url is to be > added > (the substring), i might not have it as a duplicate listing. Plus, if i > have > <done> it, i want to have thought ahead and stored the page locally, so > i > want the location where i stored it at. I expect i can build up a new > list like > that over 100,000 times. Someone else said it was definitely going to be > > 150,000 times, and they suggested 500,000 times. I know, it sounds too > fantastic, but i am looking for something that won't be on one webpage > anywhere. I can have the urls, but not the pages,, i haveto go get those > > myself. Hopefully, in the course of going thru such a list, i can do > some > shortcircuiting of the amount, the list size, based on webpage content. > It's a > needle in the haystack search. > > Kat It may be that I don't understand the problem but why not simply use a hash table to index the sequential position of each URL in the file? An array can hold the actual disk position plus the "done" flag. For 15 million entries the whole index arrangement could occupy several hundred megabytes of RAM, difficult but doable if the resources are there. A hash computation will retrieve either a single index or a small sequence of indexes. The index values will need to be translated into disk positions and those strings actually read in again to confirm the search. Depending on the hash table quality there might be, say, 2 or 3 disk movements per URL search. Mike PS: This idea seems similar to the one Pete posted but the index is managed in RAM so performance should be much better (at the expense of huge amounts of RAM).