1. 20gigabyte string searching
Serious question: what's the best way to search twenty gigabytes of text for
a 100byte substring, using Euphoria? Keep in mind Eu will blow up 20gigs to
80 in ram, and each copy made to munge it is another 80 gigs, so loading
the file into a sequence isn't possible. I foresee tons of disk thrashing, as i
don't have gigs of ram laying around..
The two gigs can be formed like this:
{"string-1","string-2"...."string-n"}
where each string-x is 20 to 120 chars long, most will be in the 100
character neighborhood. Chances of them being sorted is low, as i don't see
how Eu can be used to sort them in my lifetime.
or they can be like this:
{string-1\nstring-2\nstring3\n...string-n}
I have a list of 150 million string-xs laying here, and don't know the best way
to put them together so solve the search problem. Flat sequence with
separators, or nested sequence? Having a parallel sequence of what was
found would be terribly nice, but would be equally huge in count (altho
shouldn't be as big absolutely).
Kat
2. Re: 20gigabyte string searching
Kat wrote:
>
> Serious question: what's the best way to search twenty gigabytes of text for
> a 100byte substring, using Euphoria?
Would it be easier to create an index (like a b-tree or something),
then use that for searching?
-=ck
"Programming in a state of EUPHORIA."
http://www.cklester.com/euphoria/
3. Re: 20gigabyte string searching
On 1 Oct 2004, at 18:21, cklester wrote:
>
>
> posted by: cklester <cklester at yahoo.com>
>
> Kat wrote:
> >
> > Serious question: what's the best way to search twenty gigabytes of text for
> > a
> > 100byte substring, using Euphoria?
>
> Would it be easier to create an index (like a b-tree or something),
> then use that for searching?
Sure! Where?
20gig in a index in an Eu sequence is 80gigs.
Kat
4. Re: 20gigabyte string searching
People wrote:
> > > Serious question: what's the best way to search twenty gigabytes of text
> > > for a
> > > 100byte substring, using Euphoria?
> > Would it be easier to create an index (like a b-tree or something),
> > then use that for searching?
> Sure! Where?
> 20gig in a index in an Eu sequence is 80gigs.
I mean like this:
{ "This", "Is an example", "Might not be", "Accurate" }
Index would be:
{
{"A", {4}},
{"B", {}},
...
{"I", {2}},
...
{"M", {3}},
...
{"T", {1}}
}
So if you search for "Might not be," it searches the index first for 'M',
finds all instances in the list where the searchable items start with 'M'
and finds one at 3, then just test the rest.
Or, even an index like this:
{ "T", "IAE", "MNB", "A" }
So if I search for "is an example," it looks for the first-letter-of-each-
word pattern of "IAE" and finds it.
Either one of these methods will reduce the required storage for a
regular index of words, plus it will speed up the searches significantly.
I think. I could be wrong. It's late. :)
-=ck
"Programming in a state of EUPHORIA."
http://www.cklester.com/euphoria/
5. Re: 20gigabyte string searching
On 1 Oct 2004, at 19:15, cklester wrote:
>
>
> posted by: cklester <cklester at yahoo.com>
>
> People wrote:
>
> > > > Serious question: what's the best way to search twenty gigabytes of text
> > > > for a 100byte substring, using Euphoria?
> > > Would it be easier to create an index (like a b-tree or something),
> > > then use that for searching?
> > Sure! Where?
> > 20gig in a index in an Eu sequence is 80gigs.
>
> I mean like this:
>
> { "This", "Is an example", "Might not be", "Accurate" }
>
> Index would be:
>
> {
> {"A", {4}},
> {"B", {}},
> ...
> {"I", {2}},
> ...
> {"M", {3}},
> ...
> {"T", {1}}
> }
>
> So if you search for "Might not be," it searches the index first for 'M',
> finds all instances in the list where the searchable items start with 'M'
> and finds one at 3, then just test the rest.
>
> Or, even an index like this:
>
> { "T", "IAE", "MNB", "A" }
>
> So if I search for "is an example," it looks for the first-letter-of-each-
> word pattern of "IAE" and finds it.
>
> Either one of these methods will reduce the required storage for a
> regular index of words, plus it will speed up the searches significantly.
> I think. I could be wrong. It's late. :)
Ok, you got the index, where's the 20gig file you are indexing into?
Kat
6. Re: 20gigabyte string searching
Kat wrote:
>
> Serious question: what's the best way to search twenty gigabytes of text for
> a 100byte substring, using Euphoria? Keep in mind Eu will blow up 20gigs to
> 80 in ram, and each copy made to munge it is another 80 gigs, so loading
> the file into a sequence isn't possible. I foresee tons of disk thrashing, as
> i
> don't have gigs of ram laying around..
>
> The two gigs can be formed like this:
> {"string-1","string-2"...."string-n"}
> where each string-x is 20 to 120 chars long, most will be in the 100
> character neighborhood. Chances of them being sorted is low, as i don't see
> how Eu can be used to sort them in my lifetime.
> or they can be like this:
> {string-1\nstring-2\nstring3\n...string-n}
>
> I have a list of 150 million string-xs laying here, and don't know the best
> way
> to put them together so solve the search problem. Flat sequence with
> separators, or nested sequence? Having a parallel sequence of what was
> found would be terribly nice, but would be equally huge in count (altho
> shouldn't be as big absolutely).
I guess I don't understand your situation, but why do you need to have
the whole 20Gigs in RAM at once. I just did a test and I can do a brute-
force string search over 20Gigs in about 70 minutes, using no more
than a meg of RAM.
Anyhow, if you've got the disk space you can index things a lot. This is
basic stuff for databases. You might need about 40-60Gigs to index the
words, but will speed up searching 100-1000 times.
--
Derek Parnell
Melbourne, Australia
7. Re: 20gigabyte string searching
On Fri, 1 Oct 2004 23:38:00 -0500, Kat <gertie at visionsix.com> wrote:
>Basically, you need to know if you already have the url before you add it as a
>duplicate.
<snip>
>> Please give us a sample of the file and the substring, and the above
>> mentioned details.
>
>Ok, here's a leftover from something else:
>
>http://www.wolfe.net/stroetl/index.html
>http://astroamerica.com
<snip>
If I wanted to minimize disk space and memory usage, this is what I
would do.
1) I would store all the urls in a plain text file, with the rule that
each line added must begin with "http://". I would use that specific
value as an "end of chain" marker, much like 0 is traditionally used,
otherwise I will overwrite those 6 characters with a link to the next
line. This url file is "add only", for now.
2) I would hash the url, turning it into a single positive integer
between 1 and 1048576. For this I would use jiri babor's hash()
function [just that, ignoring the table handling part] and probably
use one of Jeurgen's bit operators to shift the 32-bit result from
hash() down to the 20-bit key I need.
3) I would build a hash lookup file, again using a 6-character lookup
key to my main url text file, and initialise it with "http://"'s.
The hash lookup file will be 6MB, for 1048576 keys.
4) For lookup keys, I would use a base 64 encoding (0-9,A-Z,a-z) which
limits the url file to approx 68G, or base 94 (! to ~, but not /),
which limits the url file to approx 689G. My base64 encoding
contribution could be used/adapted for this.
As an example, suppose hash("http://www.spirit.satelnet.org")
returns 968961102, which I shift right to get 946251.
I seek(946251*6) and read 6 bytes from the hash file, if it is
"http://" then the bucket is empty and the url needs to be added, in
which case I where() the url-text file, encode it and overwrite the
6-byte entry in the hash file. If it is not "http://" then I decode
it, seek, and gets() a line from the url-text file. If it matches, of
course, I don't want to add this url again, otherwise I repeat the
"http://" check on the start of the line. If I get to the end of the
chain, then I have to overwrite the start of the last entry with a
6-byte encoded link to the line I am about to add.
Basically, with that amount of data I would hold none of it in memory
whatsoever, if there is any caching to be done, let the OS handle it.
Regards,
Pete
8. Re: 20gigabyte string searching
Kat,
One other thought. You mentioned that 20 gb of text becomes 80 gb
as a sequence. If size is a major issue, consider using regular
expressions. If the data you are searching remains as a single string
of text data, you will not have the size increase you mention.
Karl Bochert submitted regex.zip 2002-04-23. I have used his routines
some and they are quite usable.
In short, using regular expressions on single large strings could
help with the size issues.
Terry Constant
9. Re: 20gigabyte string searching
Kat wrote:
>
> Serious question: what's the best way to search twenty gigabytes of text for
> a 100byte substring, using Euphoria? Keep in mind Eu will blow up 20gigs to
> 80 in ram, and each copy made to munge it is another 80 gigs, so loading
> the file into a sequence isn't possible. I foresee tons of disk thrashing, as
> i
> don't have gigs of ram laying around..
>
> The two gigs can be formed like this:
> {"string-1","string-2"...."string-n"}
> where each string-x is 20 to 120 chars long, most will be in the 100
> character neighborhood. Chances of them being sorted is low, as i don't see
> how Eu can be used to sort them in my lifetime.
> or they can be like this:
> {string-1\nstring-2\nstring3\n...string-n}
>
> I have a list of 150 million string-xs laying here, and don't know the best
> way
> to put them together so solve the search problem. Flat sequence with
> separators, or nested sequence? Having a parallel sequence of what was
> found would be terribly nice, but would be equally huge in count (altho
> shouldn't be as big absolutely).
>
> Kat
>
>
Bear with me as i think. picture what i am trying to say.
advanced users forgive me for lack of tech terms or knowledge.
if the original file is too big or time consuming to sort(keep sorted)
then a has table could be used and sorted easier to maintain.
someone else mentioned this too.
a unique number for each record could be a year/time (julian date, but how to
calculate it? or the birthdate(creation date?) of the url is unique, but how to
connect the two together in an index when only the human readable url is known.
build a key similar to your own internet address, a set of 4 numbers generated
by 4 different lists (multi-dimensional).
how to translate a human readable url into a set of numbers?
a url is punctuated with slashes or periods which could be used to break down
the url intolists.
the lists would contain only the data for that particular positionin the url and
generated address data.
searching 3 or 4 lists might be faster than sorting/inserting into the original
database.
url = www.xxxx\yyyy\zzz read as url = www.dim1\dim2\dim3
xxxx= list1 --- the actual url data is the key for this list
yyyy = list 2
zzzz = list 3
for each url in the database
list1 is the list containing the url info (of all urls in database)at that
position(xxxx) in the url
list 2 is for yyyy url data. list3 is for the zzzz data in the url.
using the data , search list1 for the corresponding record code.
do the same for each list.
the built up address would look like 04.06.10 (use of leading zeros)
this will point to the record into the hash table that contains the needed
data(record pointer to the original database record).
the key in these smaller lists is the url info, the data is the record number
inside this list of this url info.
these smaller lists can be sorted.
i hope i'm clear
rudy
10. Re: 20gigabyte string searching
On Sat, 2 Oct 2004 22:26:26 +0000, Mike <vulcan at win.co.nz> wrote:
>Depending on the hash table quality there might be, say, 2 or 3
>disk movements per URL search.
>
>Mike
>
>PS: This idea seems similar to the one Pete posted but the index is
>managed in RAM so performance should be much better (at the expense of
>huge amounts of RAM).
I thought about that, with a hash table of a million entries on disk,
and 15 million urls, if the hash function does its job, the average
bucket size will be 15. Either way, (15 or 2) local disk accesses will
probably be faster than reading a 100-byte url from the interweb.
Pete
11. Re: 20gigabyte string searching
----- Original Message -----
From: "Pete Lomax"
Sent: Sunday, October 03, 2004 5:38 AM
Subject: Re: 20gigabyte string searching
>
>
> On Sat, 2 Oct 2004 22:26:26 +0000, Mike <vulcan at win.co.nz> wrote:
>
> >Depending on the hash table quality there might be, say, 2 or 3
> >disk movements per URL search.
> >
> >Mike
> >
> >PS: This idea seems similar to the one Pete posted but the index is
> >managed in RAM so performance should be much better (at the expense of
> >huge amounts of RAM).
>
> I thought about that, with a hash table of a million entries on disk,
> and 15 million urls, if the hash function does its job, the average
> bucket size will be 15. Either way, (15 or 2) local disk accesses will
> probably be faster than reading a 100-byte url from the interweb.
>
> Pete
>
What would be the total size of the hash table?
Do you have to create the entire hash table in RAM or can it be a build
to file as you go along?
kat, I onced devised a sorting solution for such large lists. I've never
posted the process or exposed it to anyone. Shortly after I come up with
it I assumed it to be so simple in design that everyone must already know
how to do it. I will post the idea later. I have other things to tend to. :/
unkmar
12. Re: 20gigabyte string searching
Kat, my Euphoria is a bit rusty, I have not touch it for almost two years, but
you might be interested in following code.
Let me know if you are, in principle, because a number of substantial
optimizations is still possible.
jiri
-- kat.ex : version 1.00
-- jiri babor
-- jbabor at paradise.net.nz
-- 04-10-04
include file.e -- dir
function search(sequence s, object key)
-- binary search of sorted sequence
-- return {flag, index}, where
-- if flag = 1 ... key found at index
-- if flag = 0 ... key not found, index is insertion point
integer c,hi,lo,mid
lo = 1
hi = length(s)
while lo <= hi do
mid = floor((lo + hi) / 2)
c = compare(key, s[mid])
if c < 0 then -- key < s[mid]
hi = mid - 1
elsif c > 0 then -- key > s[mid]
lo = mid + 1
else -- key = s[mid]
return {1,mid}
end if
end while
return {0,lo}
end function
function pword(integer n) -- pseudo word: random ASCII string
sequence w
w = {}
for j = 1 to rand(n) do
w &= rand(26)+96
end for
return w
end function
procedure hoard(sequence path, integer rid, sequence params, integer maxn)
-- path: data directory
-- rid: routine_id of function returning next address
-- params: parameter sequence of rid
-- maximum number of addresses held in single file
object a,x
sequence data,last,lasts,name,names,newname,r
integer count,f,f1,f2,hits,m,m1,n,r2
count = 0 -- address counter
hits = 0 -- address collision counter
if path[length(path)] != '\\' then
path &= '\\'
end if
x = dir(path)
if atom(x) then
puts(1,"Error: data directory does not exist. Aborting...\n")
abort(1)
else
names = {}
lasts = {}
f1 = open(path & "names","r")
f2 = open(path & "lasts","r")
if f1 != -1 then -- initialized database
while 1 do
x = gets(f1)
if atom(x) then exit end if
x = x[1..length(x)-1] -- strip '\n'
names = append(names, x)
end while
close(f1)
lasts = {}
while 1 do
x = gets(f2)
if atom(x) then exit end if
x = x[1..length(x)-1] -- strip '\n'
lasts = append(lasts, x)
end while
close(f2)
end if
n = length(names)
end if
-- main loop
while get_key() != 27 do -- esc
a = call_func(rid,params)
count += 1
r = search(lasts, a)
r2 = r[2]
if not r[1] then -- not found among last addresses
if r2 > n then -- address > last address of last data
file
if n = 0 then -- no data files yet
n = 1
name = "1" -- new file name
f = open(path & name, "w")
puts(f, a & '\n')
close(f)
names = append(names, name)
lasts = append(lasts, a)
else -- append the last data file
name = names[n]
f = open(path & name, "a")
puts(f, a & '\n')
close(f)
lasts[n] = a
end if
else
name = names[r2]
data = {}
f = open(path & name, "r")
while 1 do
x = gets(f)
if atom(x) then exit end if
x = x[1..length(x)-1] -- strip '\n'
data = append(data, x)
end while
close(f)
r = search(data, a)
if r[1] = 0 then -- not found: insert
data = data[1..r[2]-1] & {a} & data[r[2]..length(data)]
-- write it out
m = length(data)
if m > maxn then -- subdivide
m1 = floor(m/2)
newname = sprint(n+1)
names = names[1..r2-1] & {newname} & names[r2..n]
last = data[m1]
lasts = lasts[1..r2-1] & {last} & lasts[r2..n]
n += 1
f = open(path & newname, "w")
for i = 1 to m1 do
puts(f, data[i] & '\n')
end for
close(f)
data = data[m1+1..m]
end if
f = open(path & name, "w")
for i = 1 to length(data) do
puts(f, data[i] & '\n')
end for
close(f)
else
hits += 1
end if
end if
else
hits += 1
end if
end while
-- save control sequences
f1 = open(path & "names", "w")
for i = 1 to length(names) do
puts(f1, names[i] & '\n')
end for
close(f1)
f2 = open(path & "lasts", "w")
for i = 1 to length(lasts) do
puts(f2, lasts[i] & '\n')
end for
close(f2)
printf(1, "%d addresses\n", count)
printf(1, "%d collisions\n", hits)
end procedure
hoard("c:\\data", routine_id("pword"), {4}, 100)