OpenEuphoria: Forum: RE: 20gigabyte string searching

1. RE: 20gigabyte string searching

Posted by "Ricardo M. Forno" <rforno at uyuyuy.com> Oct 02, 2004
534 views

It depends on how many times you are going to search for a substring.
If you need to search only once, then the best way is to read the data by
pieces into storage and perform the search on them.
This holds true also if you are going to perform the search a few (how
many?) times.
The same in case you modify (append, etc.) the huge file between a few
searches.
In any other case, there are some techniques that will help (sorting,
hashing), but it all depends on the fine details of the search (is the
length of the substring constant? is the file partitioned into records (it
seems it is, according to your description)? are these recors all of the
same length? etc.)
Please give us a sample of the file and the substring, and the above
mentioned details.
Regards.
----- Original Message -----
From: Kat <gertie at visionsix.com>
To: <EUforum at topica.com>
Sent: Friday, October 01, 2004 9:10 PM
Subject: 20gigabyte string searching


>
>
> Serious question: what's the best way to search twenty gigabytes of text
for
> a 100byte substring, using Euphoria? Keep in mind Eu will blow up 20gigs
to
> 80 in ram, and each copy made to munge it is another 80 gigs, so loading
> the file into a sequence isn't possible. I foresee tons of disk thrashing,
as i
> don't have gigs of ram laying around..
>
> The two gigs can be formed like this:
> {"string-1","string-2"...."string-n"}
> where each string-x is 20 to 120 chars long, most will be in the 100
> character neighborhood. Chances of them being sorted is low, as i don't
see
> how Eu can be used to sort them in my lifetime.
> or they can be like this:
> {string-1\nstring-2\nstring3\n...string-n}
>
> I have a list of 150 million string-xs laying here, and don't know the
best way
> to put them together so solve the search problem. Flat sequence with
> separators, or nested sequence? Having a parallel sequence of what was
> found would be terribly nice, but would be equally huge in count (altho
> shouldn't be as big absolutely).
>
> Kat
>
>
>
>

new topic » topic index » view message » categorize

2. RE: 20gigabyte string searching

Posted by "Kat" <gertie at visionsix.com> Oct 02, 2004
507 views

On 1 Oct 2004, at 23:56, Ricardo Forno wrote:

> 
> 
> It depends on how many times you are going to search for a substring.
> If you need to search only once, then the best way is to read the data by
> pieces into storage and perform the search on them.
> This holds true also if you are going to perform the search a few (how
> many?) times.

Basically, you need to know if you already have the url before you add it as a 
duplicate.

> The same in case you modify (append, etc.) the huge file between a few
> searches.
> In any other case, there are some techniques that will help (sorting,
> hashing), but it all depends on the fine details of the search (is the
> length of the substring constant? is the file partitioned into records (it
> seems it is, according to your description)? are these recors all of the
> same length? etc.)

In any way i can think of, i have doubts whether Eu can do it in my lifetime 
on one computer of any mhz rating.

> Please give us a sample of the file and the substring, and the above
> mentioned details.

Ok, here's a leftover from something else:

http://www.wolfe.net/stroetl/index.html
http://astroamerica.com
http://www.spirit.satelnet.org/spirit/astro/overview.html
http://www.the-ultimate.com/space/astro.htm
http://www.astrologycom.com
http://www.dorrit.as.utexas.edu
http://www.nfra.nl
http://www.astronet.com
http://www.astronline.co.uk
http://www.marvel.stsci.edu/net-resources.html
http://astronomylinks.com
http://www.kalmback.com/astro/astronomy.html
http://www.astronomynow.com
http://www.skypub.com/links/astroweb.html
http://www.stsci.edu/net-resources.html

Now picture 1.5million to 15 million urls coming in from the internet, as i
build
the list of the urls, and i don't wanna duplicate any. ( btw, i've managed to 
handle lists of 350,000 urls before.) Then, i haveto tag them as <done> once 
i do them. I cannot delete any, because then when another url is to be added 
(the substring), i might not have it as a duplicate listing. Plus, if i have 
<done> it, i want to have thought ahead and stored the page locally, so i 
want the location where i stored it at. I expect i can build up a new list like 
that over 100,000 times. Someone else said it was definitely going to be 
150,000 times, and they suggested 500,000 times. I know, it sounds too 
fantastic, but i am looking for something that won't be on one webpage 
anywhere. I can have the urls, but not the pages,, i haveto go get those 
myself. Hopefully, in the course of going thru such a list, i can do some 
shortcircuiting of the amount, the list size, based on webpage content. It's a 
needle in the haystack search.

Kat

> Regards.
> ----- Original Message -----
> From: Kat <gertie at visionsix.com>
> To: <EUforum at topica.com>
> Sent: Friday, October 01, 2004 9:10 PM
> Subject: 20gigabyte string searching
> 
> 
> > Serious question: what's the best way to search twenty gigabytes of text
> for
> > a 100byte substring, using Euphoria? Keep in mind Eu will blow up 20gigs
> to
> > 80 in ram, and each copy made to munge it is another 80 gigs, so loading
> > the file into a sequence isn't possible. I foresee tons of disk thrashing,
> as i
> > don't have gigs of ram laying around..
> >
> > The two gigs can be formed like this:
> > {"string-1","string-2"...."string-n"}
> > where each string-x is 20 to 120 chars long, most will be in the 100
> > character neighborhood. Chances of them being sorted is low, as i don't
> see
> > how Eu can be used to sort them in my lifetime.
> > or they can be like this:
> > {string-1\nstring-2\nstring3\n...string-n}
> >
> > I have a list of 150 million string-xs laying here, and don't know the
> best way
> > to put them together so solve the search problem. Flat sequence with
> > separators, or nested sequence? Having a parallel sequence of what was
> > found would be terribly nice, but would be equally huge in count (altho
> > shouldn't be as big absolutely).
> >
> > Kat
> >
> >
> 
> 
>

new topic » goto parent » topic index » view message » categorize

3. RE: 20gigabyte string searching

Posted by Terry Constant <EUforum at terryconstant.com> Oct 02, 2004
502 views

Kat,

I am just replying in general to the things about your searching, no one 
particular post.

My idea/question is:

Could you use multiple text files for your lists? As you gather the
lists, put them into files such as a.txt, b.txt, ..., z.txt (or
whatever naming scheme that works.) Possibly(?) sort each file as you add
to it (you could keep file sizes down by breaking into smaller chunks
if needed, such as, ..., mm.txt, mz.txt, ...). But even if you do not sort
each file, you at least have a rough sort by first letter, which makes 
the size for an individual search smaller. The obvious advantage of sorting
the individual files is that you can use binary search algorithms.

Let your search alogorithms, select the file to read. Thus, you do not
have to search through 20 gb of text and some of your problems go away.
Further, much of your cpu time is done incrementally as the text is added
to the separate files.

I am sure that I don't fully grasp what you are doing, so I just offer
this idea in case it is of some use.

Terry Constant

new topic » goto parent » topic index » view message » categorize

4. RE: 20gigabyte string searching

Posted by Mike <vulcan at win.co.nz> Oct 02, 2004
518 views
Last edited Oct 03, 2004

Kat wrote:

> 
> Now picture 1.5million to 15 million urls coming in from the internet, 
> as i build 
> the list of the urls, and i don't wanna duplicate any. ( btw, i've 
> managed to 
> handle lists of 350,000 urls before.) Then, i haveto tag them as <done> 
> once 
> i do them. I cannot delete any, because then when another url is to be 
> added 
> (the substring), i might not have it as a duplicate listing. Plus, if i 
> have 
> <done> it, i want to have thought ahead and stored the page locally, so 
> i 
> want the location where i stored it at. I expect i can build up a new 
> list like 
> that over 100,000 times. Someone else said it was definitely going to be 
> 
> 150,000 times, and they suggested 500,000 times. I know, it sounds too 
> fantastic, but i am looking for something that won't be on one webpage 
> anywhere. I can have the urls, but not the pages,, i haveto go get those 
> 
> myself. Hopefully, in the course of going thru such a list, i can do 
> some 
> shortcircuiting of the amount, the list size, based on webpage content. 
> It's a 
> needle in the haystack search.
> 
> Kat

It may be that I don't understand the problem but why not simply use a 
hash table to index the sequential position of each URL in the file? An 
array can hold the actual disk position plus the "done" flag. For 15 
million entries the whole index arrangement could occupy several hundred 
megabytes of RAM, difficult but doable if the resources are there.

A hash computation will retrieve either a single index or a small 
sequence of indexes. The index values will need to be translated into 
disk positions and those strings actually read in again to confirm the 
search. Depending on the hash table quality there might be, say, 2 or 3 
disk movements per URL search.

Mike

PS: This idea seems similar to the one Pete posted but the index is 
managed in RAM so performance should be much better (at the expense of 
huge amounts of RAM).

OpenEuphoria

1. RE: 20gigabyte string searching

2. RE: 20gigabyte string searching

3. RE: 20gigabyte string searching

4. RE: 20gigabyte string searching

Search

Include:

Quick Links

User menu

Misc Menu