RE: 20gigabyte string searching
- Posted by "Kat" <gertie at visionsix.com> Oct 02, 2004
- 456 views
On 1 Oct 2004, at 23:56, Ricardo Forno wrote: > > > It depends on how many times you are going to search for a substring. > If you need to search only once, then the best way is to read the data by > pieces into storage and perform the search on them. > This holds true also if you are going to perform the search a few (how > many?) times. Basically, you need to know if you already have the url before you add it as a duplicate. > The same in case you modify (append, etc.) the huge file between a few > searches. > In any other case, there are some techniques that will help (sorting, > hashing), but it all depends on the fine details of the search (is the > length of the substring constant? is the file partitioned into records (it > seems it is, according to your description)? are these recors all of the > same length? etc.) In any way i can think of, i have doubts whether Eu can do it in my lifetime on one computer of any mhz rating. > Please give us a sample of the file and the substring, and the above > mentioned details. Ok, here's a leftover from something else: http://www.wolfe.net/stroetl/index.html http://astroamerica.com http://www.spirit.satelnet.org/spirit/astro/overview.html http://www.the-ultimate.com/space/astro.htm http://www.astrologycom.com http://www.dorrit.as.utexas.edu http://www.nfra.nl http://www.astronet.com http://www.astronline.co.uk http://www.marvel.stsci.edu/net-resources.html http://astronomylinks.com http://www.kalmback.com/astro/astronomy.html http://www.astronomynow.com http://www.skypub.com/links/astroweb.html http://www.stsci.edu/net-resources.html Now picture 1.5million to 15 million urls coming in from the internet, as i build the list of the urls, and i don't wanna duplicate any. ( btw, i've managed to handle lists of 350,000 urls before.) Then, i haveto tag them as <done> once i do them. I cannot delete any, because then when another url is to be added (the substring), i might not have it as a duplicate listing. Plus, if i have <done> it, i want to have thought ahead and stored the page locally, so i want the location where i stored it at. I expect i can build up a new list like that over 100,000 times. Someone else said it was definitely going to be 150,000 times, and they suggested 500,000 times. I know, it sounds too fantastic, but i am looking for something that won't be on one webpage anywhere. I can have the urls, but not the pages,, i haveto go get those myself. Hopefully, in the course of going thru such a list, i can do some shortcircuiting of the amount, the list size, based on webpage content. It's a needle in the haystack search. Kat > Regards. > ----- Original Message ----- > From: Kat <gertie at visionsix.com> > To: <EUforum at topica.com> > Sent: Friday, October 01, 2004 9:10 PM > Subject: 20gigabyte string searching > > > > Serious question: what's the best way to search twenty gigabytes of text > for > > a 100byte substring, using Euphoria? Keep in mind Eu will blow up 20gigs > to > > 80 in ram, and each copy made to munge it is another 80 gigs, so loading > > the file into a sequence isn't possible. I foresee tons of disk thrashing, > as i > > don't have gigs of ram laying around.. > > > > The two gigs can be formed like this: > > {"string-1","string-2"...."string-n"} > > where each string-x is 20 to 120 chars long, most will be in the 100 > > character neighborhood. Chances of them being sorted is low, as i don't > see > > how Eu can be used to sort them in my lifetime. > > or they can be like this: > > {string-1\nstring-2\nstring3\n...string-n} > > > > I have a list of 150 million string-xs laying here, and don't know the > best way > > to put them together so solve the search problem. Flat sequence with > > separators, or nested sequence? Having a parallel sequence of what was > > found would be terribly nice, but would be equally huge in count (altho > > shouldn't be as big absolutely). > > > > Kat > > > > > > >