1. 20gigabyte string searching

Serious question: what's the best way to search twenty gigabytes of text for 
a 100byte substring, using Euphoria? Keep in mind Eu will blow up 20gigs to 
80 in ram, and each copy made to munge it is another 80 gigs, so loading 
the file into a sequence isn't possible. I foresee tons of disk thrashing, as i 
don't have gigs of ram laying around..

The two gigs can be formed like this:
{"string-1","string-2"...."string-n"}
where each string-x is 20 to 120 chars long, most will be in the 100 
character neighborhood. Chances of them being sorted is low, as i don't see 
how Eu can be used to sort them in my lifetime.
or they can be like this:
{string-1\nstring-2\nstring3\n...string-n}

I have a list of 150 million string-xs laying here, and don't know the best way 
to put them together so solve the search problem. Flat sequence with 
separators, or nested sequence? Having a parallel sequence of what was 
found would be terribly nice, but would be equally huge in count (altho 
shouldn't be as big absolutely).

Kat

new topic     » topic index » view message » categorize

2. Re: 20gigabyte string searching

Kat wrote:
> 
> Serious question: what's the best way to search twenty gigabytes of text for 
> a 100byte substring, using Euphoria?

Would it be easier to create an index (like a b-tree or something),
then use that for searching?

-=ck
"Programming in a state of EUPHORIA."
http://www.cklester.com/euphoria/

new topic     » goto parent     » topic index » view message » categorize

3. Re: 20gigabyte string searching

On 1 Oct 2004, at 18:21, cklester wrote:

> 
> 
> posted by: cklester <cklester at yahoo.com>
> 
> Kat wrote:
> > 
> > Serious question: what's the best way to search twenty gigabytes of text for
> > a
> > 100byte substring, using Euphoria?
> 
> Would it be easier to create an index (like a b-tree or something),
> then use that for searching?

Sure! Where? 
20gig in a index in an Eu sequence is 80gigs.

Kat

new topic     » goto parent     » topic index » view message » categorize

4. Re: 20gigabyte string searching

People wrote:

> > > Serious question: what's the best way to search twenty gigabytes of text
> > > for a
> > > 100byte substring, using Euphoria?
> > Would it be easier to create an index (like a b-tree or something),
> > then use that for searching?
> Sure! Where? 
> 20gig in a index in an Eu sequence is 80gigs.

I mean like this:

{ "This", "Is an example", "Might not be", "Accurate" }

Index would be:

{
 {"A", {4}},
 {"B", {}},
...
 {"I", {2}},
...
 {"M", {3}},
...
 {"T", {1}}
}

So if you search for "Might not be," it searches the index first for 'M',
finds all instances in the list where the searchable items start with 'M'
and finds one at 3, then just test the rest.

Or, even an index like this:

{ "T", "IAE", "MNB", "A" }

So if I search for "is an example," it looks for the first-letter-of-each-
word pattern of "IAE" and finds it.

Either one of these methods will reduce the required storage for a
regular index of words, plus it will speed up the searches significantly.
I think. I could be wrong. It's late. :)

-=ck
"Programming in a state of EUPHORIA."
http://www.cklester.com/euphoria/

new topic     » goto parent     » topic index » view message » categorize

5. Re: 20gigabyte string searching

On 1 Oct 2004, at 19:15, cklester wrote:

> 
> 
> posted by: cklester <cklester at yahoo.com>
> 
> People wrote:
> 
> > > > Serious question: what's the best way to search twenty gigabytes of text
> > > > for a 100byte substring, using Euphoria?
> > > Would it be easier to create an index (like a b-tree or something),
> > > then use that for searching?
> > Sure! Where? 
> > 20gig in a index in an Eu sequence is 80gigs.
> 
> I mean like this:
> 
> { "This", "Is an example", "Might not be", "Accurate" }
> 
> Index would be:
> 
> {
>  {"A", {4}},
>  {"B", {}},
> ...
>  {"I", {2}},
> ...
>  {"M", {3}},
> ...
>  {"T", {1}}
> }
> 
> So if you search for "Might not be," it searches the index first for 'M',
> finds all instances in the list where the searchable items start with 'M'
> and finds one at 3, then just test the rest.
> 
> Or, even an index like this:
> 
> { "T", "IAE", "MNB", "A" }
> 
> So if I search for "is an example," it looks for the first-letter-of-each-
> word pattern of "IAE" and finds it.
> 
> Either one of these methods will reduce the required storage for a
> regular index of words, plus it will speed up the searches significantly.
> I think. I could be wrong. It's late. :)

Ok, you got the index, where's the 20gig file you are indexing into?

Kat

new topic     » goto parent     » topic index » view message » categorize

6. Re: 20gigabyte string searching

Kat wrote:
> 
> Serious question: what's the best way to search twenty gigabytes of text for 
> a 100byte substring, using Euphoria? Keep in mind Eu will blow up 20gigs to 
> 80 in ram, and each copy made to munge it is another 80 gigs, so loading 
> the file into a sequence isn't possible. I foresee tons of disk thrashing, as
> i
> don't have gigs of ram laying around..
> 
> The two gigs can be formed like this:
> {"string-1","string-2"...."string-n"}
> where each string-x is 20 to 120 chars long, most will be in the 100 
> character neighborhood. Chances of them being sorted is low, as i don't see 
> how Eu can be used to sort them in my lifetime.
> or they can be like this:
> {string-1\nstring-2\nstring3\n...string-n}
> 
> I have a list of 150 million string-xs laying here, and don't know the best
> way
> to put them together so solve the search problem. Flat sequence with 
> separators, or nested sequence? Having a parallel sequence of what was 
> found would be terribly nice, but would be equally huge in count (altho 
> shouldn't be as big absolutely).

I guess I don't understand your situation, but why do you need to have 
the whole 20Gigs in RAM at once. I just did a test and I can do a brute-
force string search over 20Gigs in about 70 minutes, using no more
than a meg of RAM.

Anyhow, if you've got the disk space you can index things a lot. This is
basic stuff for databases. You might need about 40-60Gigs to index the
words, but will speed up searching 100-1000 times.

-- 
Derek Parnell
Melbourne, Australia

new topic     » goto parent     » topic index » view message » categorize

7. Re: 20gigabyte string searching

On Fri, 1 Oct 2004 23:38:00 -0500, Kat <gertie at visionsix.com> wrote:

>Basically, you need to know if you already have the url before you add it as a 
>duplicate.
<snip>
>> Please give us a sample of the file and the substring, and the above
>> mentioned details.
>
>Ok, here's a leftover from something else:
>
>http://www.wolfe.net/stroetl/index.html
>http://astroamerica.com
<snip>

If I wanted to minimize disk space and memory usage, this is what I
would do.
1) I would store all the urls in a plain text file, with the rule that
each line added must begin with "http://". I would use that specific
value as an "end of chain" marker, much like 0 is traditionally used,
otherwise I will overwrite those 6 characters with a link to the next
line. This url file is "add only", for now.
2) I would hash the url, turning it into a single positive integer
between 1 and 1048576. For this I would use jiri babor's hash()
function [just that, ignoring the table handling part] and probably
use one of Jeurgen's bit operators to shift the 32-bit result from
hash() down to the 20-bit key I need.
3) I would build a hash lookup file, again using a 6-character lookup
key to my main url text file, and initialise it with "http://"'s.
The hash lookup file will be 6MB, for 1048576 keys.
4) For lookup keys, I would use a base 64 encoding (0-9,A-Z,a-z) which
limits the url file to approx 68G, or base 94 (! to ~, but not /),
which limits the url file to approx 689G. My base64 encoding
contribution could be used/adapted for this.

As an example, suppose hash("http://www.spirit.satelnet.org")
returns 968961102, which I shift right to get 946251.

I seek(946251*6) and read 6 bytes from the hash file, if it is
"http://" then the bucket is empty and the url needs to be added, in
which case I where() the url-text file, encode it and overwrite the
6-byte entry in the hash file. If it is not "http://" then I decode
it, seek, and gets() a line from the url-text file. If it matches, of
course, I don't want to add this url again, otherwise I repeat the
"http://" check on the start of the line. If I get to the end of the
chain, then I have to overwrite the start of the last entry with a
6-byte encoded link to the line I am about to add.

Basically, with that amount of data I would hold none of it in memory
whatsoever, if there is any caching to be done, let the OS handle it.

Regards,
Pete

new topic     » goto parent     » topic index » view message » categorize

8. Re: 20gigabyte string searching

Kat,

One other thought. You mentioned that 20 gb of text becomes 80 gb
as a sequence. If size is a major issue, consider using regular
expressions. If the data you are searching remains as a single string
of text data, you will not have the size increase you mention.

Karl Bochert submitted regex.zip 2002-04-23. I have used his routines
some and they are quite usable.

In short, using regular expressions on single large strings could 
help with the size issues.


Terry Constant

new topic     » goto parent     » topic index » view message » categorize

9. Re: 20gigabyte string searching

Kat wrote:
> 
> Serious question: what's the best way to search twenty gigabytes of text for 
> a 100byte substring, using Euphoria? Keep in mind Eu will blow up 20gigs to 
> 80 in ram, and each copy made to munge it is another 80 gigs, so loading 
> the file into a sequence isn't possible. I foresee tons of disk thrashing, as
> i
> don't have gigs of ram laying around..
> 
> The two gigs can be formed like this:
> {"string-1","string-2"...."string-n"}
> where each string-x is 20 to 120 chars long, most will be in the 100 
> character neighborhood. Chances of them being sorted is low, as i don't see 
> how Eu can be used to sort them in my lifetime.
> or they can be like this:
> {string-1\nstring-2\nstring3\n...string-n}
> 
> I have a list of 150 million string-xs laying here, and don't know the best
> way
> to put them together so solve the search problem. Flat sequence with 
> separators, or nested sequence? Having a parallel sequence of what was 
> found would be terribly nice, but would be equally huge in count (altho 
> shouldn't be as big absolutely).
> 
> Kat
> 
> 
Bear with me as i think. picture what i am trying to say.
advanced users forgive me for lack of tech terms or knowledge.

if the original file is too big or time consuming to sort(keep sorted)
then a has table could be used and sorted easier to maintain.
someone else mentioned this too.

a unique number for each record could be a year/time (julian date, but how to
calculate it? or the birthdate(creation date?) of the url is unique, but how to
connect the two together in an index when only the human readable url is known.
build a key similar to your own internet address, a set of 4 numbers generated
by 4 different lists (multi-dimensional).

how to translate a human readable url into a set of numbers?
a url is punctuated with slashes or periods which could be used to break down
the url intolists.
the lists would contain only the data for that particular positionin the url and
generated address data.
searching 3 or 4 lists might be faster than sorting/inserting into the original
database.

url = www.xxxx\yyyy\zzz read as url = www.dim1\dim2\dim3
xxxx= list1   --- the actual url data is the key for this list
yyyy = list 2
zzzz = list 3

for each url in the database
list1 is the list containing the url info (of all urls in database)at that
position(xxxx) in the url
list 2 is for yyyy url data. list3 is for the zzzz data in the url.
using the data , search list1 for the corresponding record code.
do the same for each list.
the built up address would look like 04.06.10   (use of leading zeros)
this will point to the record into the hash table that contains the needed
data(record pointer to the original database record).
the key in these smaller lists is the url info, the data is the record number
inside this list of this url info.
these smaller lists can be sorted.

i hope i'm clear

rudy

new topic     » goto parent     » topic index » view message » categorize

10. Re: 20gigabyte string searching

On Sat,  2 Oct 2004 22:26:26 +0000, Mike <vulcan at win.co.nz> wrote:

>Depending on the hash table quality there might be, say, 2 or 3 
>disk movements per URL search.
>
>Mike
>
>PS: This idea seems similar to the one Pete posted but the index is 
>managed in RAM so performance should be much better (at the expense of 
>huge amounts of RAM).

I thought about that, with a hash table of a million entries on disk,
and 15 million urls, if the hash function does its job, the average
bucket size will be 15. Either way, (15 or 2) local disk accesses will
probably be faster than reading a 100-byte url from the interweb.

Pete

new topic     » goto parent     » topic index » view message » categorize

11. Re: 20gigabyte string searching

----- Original Message ----- 
From: "Pete Lomax"
Sent: Sunday, October 03, 2004 5:38 AM
Subject: Re: 20gigabyte string searching


> 
> 
> On Sat,  2 Oct 2004 22:26:26 +0000, Mike <vulcan at win.co.nz> wrote:
> 
> >Depending on the hash table quality there might be, say, 2 or 3 
> >disk movements per URL search.
> >
> >Mike
> >
> >PS: This idea seems similar to the one Pete posted but the index is 
> >managed in RAM so performance should be much better (at the expense of 
> >huge amounts of RAM).
> 
> I thought about that, with a hash table of a million entries on disk,
> and 15 million urls, if the hash function does its job, the average
> bucket size will be 15. Either way, (15 or 2) local disk accesses will
> probably be faster than reading a 100-byte url from the interweb.
> 
> Pete
> 

What would be the total size of the hash table?
Do you have to create the entire hash table in RAM or can it be a build
to file as you go along?

kat, I onced devised a sorting solution for such large lists.  I've never
posted the process or exposed it to anyone.  Shortly after I come up with
it I assumed it to be so simple in design that everyone must already know
how to do it.  I will post the idea later.  I have other things to tend to. :/

    unkmar

new topic     » goto parent     » topic index » view message » categorize

12. Re: 20gigabyte string searching

Kat, my Euphoria is a bit rusty, I have not touch it for almost two years, but
you might be interested in following code.
Let me know if you are, in principle, because a number of substantial
optimizations is still possible.

jiri

--  kat.ex : version 1.00
--  jiri babor
--  jbabor at paradise.net.nz
--  04-10-04

include file.e                          -- dir    

function search(sequence s, object key)
    --  binary search of sorted sequence
    --  return {flag, index}, where
    --      if flag = 1 ... key found at index 
    --      if flag = 0 ... key not found, index is insertion point

    integer c,hi,lo,mid

    lo = 1
    hi = length(s)

    while lo <= hi do
        mid = floor((lo + hi) / 2)
        c = compare(key, s[mid])
        if    c < 0 then                -- key < s[mid]
            hi = mid - 1
        elsif c > 0 then                -- key > s[mid]
            lo = mid + 1
        else                            -- key = s[mid]
            return {1,mid}
        end if
    end while

    return {0,lo}
end function

function pword(integer n)               -- pseudo word: random ASCII string 
    sequence w

    w = {}
    for j = 1 to rand(n) do
        w &= rand(26)+96
    end for
    return w
end function

procedure hoard(sequence path, integer rid, sequence params, integer maxn)
    --  path: data directory
    --  rid: routine_id of function returning next address
    --  params: parameter sequence of rid
    --  maximum number of addresses held in single file

    object a,x
    sequence data,last,lasts,name,names,newname,r
    integer count,f,f1,f2,hits,m,m1,n,r2

    count = 0                           -- address counter
    hits = 0                            -- address collision counter

    if path[length(path)] != '\\' then
        path &= '\\'
    end if

    x = dir(path)
    if atom(x) then
        puts(1,"Error: data directory does not exist. Aborting...\n")
        abort(1)
    else
        names = {}
        lasts = {}

        f1 = open(path & "names","r")
        f2 = open(path & "lasts","r")

        if f1 != -1 then                -- initialized database
            while 1 do
                x = gets(f1)
                if atom(x) then exit end if
                x = x[1..length(x)-1]   -- strip '\n'
                names = append(names, x)
            end while
            close(f1)

            lasts = {}
            while 1 do
                x = gets(f2)
                if atom(x) then exit end if
                x = x[1..length(x)-1]   -- strip '\n'
                lasts = append(lasts, x)
            end while
            close(f2)
        end if
        n = length(names)
    end if

    -- main loop
    while get_key() != 27 do            -- esc
        a = call_func(rid,params)
        count += 1
        r = search(lasts, a)
        r2 = r[2]
        if not r[1] then                -- not found among last addresses
if r2 > n then              -- address > last address of last data
            file
                if n = 0 then           -- no data files yet
                    n = 1
                    name = "1"          -- new file name
                    f = open(path & name, "w")
                    puts(f, a & '\n')
                    close(f)
                    names = append(names, name)
                    lasts = append(lasts, a)
                else                    -- append the last data file
                    name = names[n]
                    f = open(path & name, "a")
                    puts(f, a & '\n')
                    close(f)
                    lasts[n] = a
                end if
            else
                name = names[r2]
                data = {}
                f = open(path & name, "r")
                while 1 do
                    x = gets(f)
                    if atom(x) then exit end if
                    x = x[1..length(x)-1]   -- strip '\n'
                    data = append(data, x)
                end while
                close(f)
                r = search(data, a)
                if r[1] = 0 then        -- not found: insert
                    data = data[1..r[2]-1] & {a} & data[r[2]..length(data)]
                    -- write it out
                    m = length(data)
                    if m > maxn then    -- subdivide
                        m1 = floor(m/2)
                        newname = sprint(n+1)
                        names = names[1..r2-1] & {newname} & names[r2..n]
                        last = data[m1]
                        lasts = lasts[1..r2-1] & {last} & lasts[r2..n]
                        n += 1
                        f = open(path & newname, "w")
                        for i = 1 to m1 do
                            puts(f, data[i] & '\n')
                        end for
                        close(f)
                        data = data[m1+1..m]
                    end if
                    f = open(path & name, "w")
                    for i = 1 to length(data) do
                        puts(f, data[i] & '\n')
                    end for
                    close(f)
                else
                    hits += 1
                end if
            end if
        else
            hits += 1
        end if
    end while

    -- save control sequences
    f1 = open(path & "names", "w")
    for i = 1 to length(names) do
        puts(f1, names[i] & '\n')
    end for
    close(f1)
    f2 = open(path & "lasts", "w")
    for i = 1 to length(lasts) do
        puts(f2, lasts[i] & '\n')
    end for
    close(f2)

    printf(1, "%d addresses\n", count)
    printf(1, "%d collisions\n", hits)
end procedure

hoard("c:\\data", routine_id("pword"), {4}, 100)


new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu