1. Constructing sequences

Dear Euphorians

Overall, I'm very happy with the performance of Euphoria (both 
interpreted and compiled from translated C), but I've noticed
that constructing sequences seems to be pretty slow by comparison
with accessing them once they are constructed.

When I add 100,000 smaller sequences (my data "records") to a storage
sequence, it takes me nearly 40 seconds either interpreted or
compiled, to complete. Once this 100,000 record sequence is built
however, I can access the various "records" and/or their "fields"
at blazing speeds. I get no speed up on the compiled version for
the construction of the sequence, but I do get some speedup for the
already very rapid access to the built sequence.

I have even tried making the sequence to be constructed global, and
using a procedure to append to it, rather than a function which has
to pass a copy of the sequence. It made no discernible difference.

I have read the performance tips about using the optimized form
'append' etc. etc., but if any of you out there have any experience
of improving the speed of creation of large sequences, I'd be very
grateful for any tips or pointers (no pun intended).

Best

Gordon

new topic     » topic index » view message » categorize

2. Re: Constructing sequences

Gordon Webster wrote:
> 
> Dear Euphorians
> 
> Overall, I'm very happy with the performance of Euphoria (both 
> interpreted and compiled from translated C), but I've noticed
> that constructing sequences seems to be pretty slow by comparison
> with accessing them once they are constructed.
> 
> When I add 100,000 smaller sequences (my data "records") to a storage
> sequence, it takes me nearly 40 seconds either interpreted or
> compiled, to complete. Once this 100,000 record sequence is built
> however, I can access the various "records" and/or their "fields"
> at blazing speeds. I get no speed up on the compiled version for
> the construction of the sequence, but I do get some speedup for the
> already very rapid access to the built sequence.
> 
> I have even tried making the sequence to be constructed global, and
> using a procedure to append to it, rather than a function which has
> to pass a copy of the sequence. It made no discernible difference.
> 
> I have read the performance tips about using the optimized form
> 'append' etc. etc., but if any of you out there have any experience
> of improving the speed of creation of large sequences, I'd be very
> grateful for any tips or pointers (no pun intended).
> 
> Best
 
Gordon:
 
I have found that thing slow way down when using many appends
is version 2.5 Eu. If you can define the complete large sequence
all at once then that will be faster without appends.

Have you thought of using a database to store the information
your keeping in the small sequences instead of adding them to
a large sequence. 

Bernie

My files in archive:
w32engin.ew mixedlib.e eu_engin.e win32eru.exw

Can be downloaded here:
http://www.rapideuphoria.com/cgi-bin/asearch.exu?dos=on&win=on&lnx=on&gen=on&keywords=bernie+ryan

new topic     » goto parent     » topic index » view message » categorize

3. Re: Constructing sequences

Bernie Ryan wrote:
> 
> Gordon Webster wrote:
> > 
> > Dear Euphorians
> > 
> > Overall, I'm very happy with the performance of Euphoria (both 
> > interpreted and compiled from translated C), but I've noticed
> > that constructing sequences seems to be pretty slow by comparison
> > with accessing them once they are constructed.
> > 
> > When I add 100,000 smaller sequences (my data "records") to a storage
> > sequence, it takes me nearly 40 seconds either interpreted or
> > compiled, to complete. Once this 100,000 record sequence is built
> > however, I can access the various "records" and/or their "fields"
> > at blazing speeds. I get no speed up on the compiled version for
> > the construction of the sequence, but I do get some speedup for the
> > already very rapid access to the built sequence.
> > 
> > I have even tried making the sequence to be constructed global, and
> > using a procedure to append to it, rather than a function which has
> > to pass a copy of the sequence. It made no discernible difference.
> > 
> > I have read the performance tips about using the optimized form
> > 'append' etc. etc., but if any of you out there have any experience
> > of improving the speed of creation of large sequences, I'd be very
> > grateful for any tips or pointers (no pun intended).
> > 
> > Best
>  
> Gordon:
>  
> I have found that thing slow way down when using many appends
> is version 2.5 Eu. If you can define the complete large sequence
> all at once then that will be faster without appends.
> 
> Have you thought of using a database to store the information
> your keeping in the small sequences instead of adding them to
> a large sequence. 
> 
> Bernie
> 
> My files in archive:
> w32engin.ew mixedlib.e eu_engin.e win32eru.exw
> 
> Can be downloaded here:
> <a
> href="http://www.rapideuphoria.com/cgi-bin/asearch.exu?dos=on&win=on&lnx=on&gen=on&keywords=bernie+ryan">http://www.rapideuphoria.com/cgi-bin/asearch.exu?dos=on&win=on&lnx=on&gen=on&keywords=bernie+ryan</a>
> 

Thanks for the reply Bernie

I am reading the data records in from a file so I don't see an
obvious way I could build the whole sequence at once. Wouldn't
I have to temporariliy store the records in another sequence,
incurring the same 'append' overhead anyway?

The other aspect of this is that parts of the data records are
going to be used for heavy numerical computation and I need to
be able to construct numerical vectors and matrices really fast,
using certain fields of my data records. I didn't consider using
a database for this, since my data records are very simple and
well defined and I was worried that accessing fields from a db
would be a lot slower than accessing atoms from a sequence. I'm
really shooting for "lean and fast" here since I may have as many
as a million records to process and I want to keep the memory
overhead down as well.

Perhaps I should be thinking of using a db. Would it bloat my
memory overhead much, or slow things down compared to using
raw sequences?

It seems that once I've built my raw sequence, I can access it very
very rapidly indeed, but if I don't use a db, is it possible to
speed up the construction of sequences in memory?

Best

Gordon

new topic     » goto parent     » topic index » view message » categorize

4. Re: Constructing sequences

Gordon Webster wrote:
> 
> Dear Euphorians
> 
> Overall, I'm very happy with the performance of Euphoria (both 
> interpreted and compiled from translated C), but I've noticed
> that constructing sequences seems to be pretty slow by comparison
> with accessing them once they are constructed.
> 
> When I add 100,000 smaller sequences (my data "records") to a storage
> sequence, it takes me nearly 40 seconds either interpreted or
> compiled, to complete. Once this 100,000 record sequence is built
> however, I can access the various "records" and/or their "fields"
> at blazing speeds. I get no speed up on the compiled version for
> the construction of the sequence, but I do get some speedup for the
> already very rapid access to the built sequence.

How are you adding the smaller sequences to the larger sequence?  If you
continue to simply append sequences to another, you will be forcing Euphoria
to reallocate memory many times (it does allocate more than it needs, but
you'll still be wasting a lot of time doing this).  If you know how many
(or even about how many) you can get a big speedup.  Try, for instance
(pseudocode):
storage_sequence = repeat( 0, 100000 )
while reading_data do
  -- read data...
  storage_sequence[ix] = new_data
  ix += 1
end while

You should see a big increase in speed.  You can also increase the size by
some fixed (or changing) size.  Either way, I think you'll find a nice
speed increase.

Matt Lewis

new topic     » goto parent     » topic index » view message » categorize

5. Re: Constructing sequences

Matt Lewis wrote:
> 
> Gordon Webster wrote:
> > 
> > Dear Euphorians
> > 
> > Overall, I'm very happy with the performance of Euphoria (both 
> > interpreted and compiled from translated C), but I've noticed
> > that constructing sequences seems to be pretty slow by comparison
> > with accessing them once they are constructed.
> > 
> > When I add 100,000 smaller sequences (my data "records") to a storage
> > sequence, it takes me nearly 40 seconds either interpreted or
> > compiled, to complete. Once this 100,000 record sequence is built
> > however, I can access the various "records" and/or their "fields"
> > at blazing speeds. I get no speed up on the compiled version for
> > the construction of the sequence, but I do get some speedup for the
> > already very rapid access to the built sequence.
> 
> How are you adding the smaller sequences to the larger sequence?  If you
> continue to simply append sequences to another, you will be forcing Euphoria
> to reallocate memory many times (it does allocate more than it needs, but
> you'll still be wasting a lot of time doing this).  If you know how many
> (or even about how many) you can get a big speedup.  Try, for instance
> (pseudocode):
> }}}
<eucode>
> storage_sequence = repeat( 0, 100000 )
> while reading_data do
>   -- read data...
>   storage_sequence[ix] = new_data
>   ix += 1
> end while
> <font color="#330033"></eucode>
{{{
</font>
> You should see a big increase in speed.  You can also increase the size by
> some fixed (or changing) size.  Either way, I think you'll find a nice
> speed increase.
> 
> Matt Lewis
> 


Aaaaaaaaaaah! Thanks Matt

It's all the individual memory allocations that take the time on the
'append' operation. Using your method of pre-allocating memory in blocks,
my 40 second sequence build comes down to 0.05s - an 800x speedup!

BTW: just for fun, when I compiled this, the time was too small to be
displayed by the time() function for 100,000 records added.

I will allocate and expand my sequences in blocks from now on.

Thanks for the help - I really appreciate it!

Best

Gordon

new topic     » goto parent     » topic index » view message » categorize

6. Re: Constructing sequences

Gordon Webster wrote:

[snip]

> I am reading the data records in from a file so I don't see an
> obvious way I could build the whole sequence at once. Wouldn't
> I have to temporariliy store the records in another sequence,
> incurring the same 'append' overhead anyway?

[snip]

Gordon, try please:

-- file aa.ex
include file.e

function Get_Big_File(object name)
  object file
     file = dir(name)
       file = file[1][3]
         file = repeat(0, file)
           name = open(name, "rb")
             for i=1 to length(file) do
                file[i] = getc(name)
             end for
             close(name)
  return file
end function

puts(1, Get_Big_File("aa.ex"))
-- end of file aa.ex


Just copy/paste this code into NotePad,
save it as aa.ex, and then run it from
ed.ex.

Good Luck!

Regards,
Igor Kachan
kinz at peterlink.ru

new topic     » goto parent     » topic index » view message » categorize

7. Re: Constructing sequences

Gordon Webster wrote:
> Aaaaaaaaaaah! Thanks Matt
> 
> It's all the individual memory allocations that take the time on the
> 'append' operation. Using your method of pre-allocating memory in blocks,
> my 40 second sequence build comes down to 0.05s - an 800x speedup!
> 
> BTW: just for fun, when I compiled this, the time was too small to be
> displayed by the time() function for 100,000 records added.
> 
> I will allocate and expand my sequences in blocks from now on.

That sounds like a good solution, however
I did a test on my Pentium-4 1.8GHz machine...

integer fn
object line
sequence buffer

-- create file
fn = open("bigfile.txt", "wb")
for i = 1 to 100000 do
    puts(fn, 'A' + rand(repeat(100, 19 + rand(11))))
    puts(fn, '\n')
end for
close(fn) -- file is about 2.5Mb

-- read file
fn = open("bigfile.txt", "rb") -- 100,000 lines of 20 to 30 random chars

atom t
t = time()

buffer = {}
while 1 do
    line = gets(fn)
    if atom(line) then
	exit
    end if
    buffer = append(buffer, line)
end while

close(fn)

? time() - t -- 4.06 seconds, 1.8GHz Pentium-4

? length(buffer) -- 100,000


With 2.5, ex.exe took 4.06 seconds. 
With 2.4, ex.exe took 4.34 seconds.

Maybe your machine is much slower, or maybe you
were not appending to a simple variable, like "buffer",
but rather to a subscripted variable. The latter case
is not optimized as well.

Regards,
   Rob Craig
   Rapid Deployment Software
   http://www.RapidEuphoria.com

new topic     » goto parent     » topic index » view message » categorize

8. Re: Constructing sequences

Robert Craig wrote:
> 
> Gordon Webster wrote:
> > Aaaaaaaaaaah! Thanks Matt
> > 
> > It's all the individual memory allocations that take the time on the
> > 'append' operation. Using your method of pre-allocating memory in blocks,
> > my 40 second sequence build comes down to 0.05s - an 800x speedup!
> > 
> > BTW: just for fun, when I compiled this, the time was too small to be
> > displayed by the time() function for 100,000 records added.
> > 
> > I will allocate and expand my sequences in blocks from now on.
> 
> That sounds like a good solution, however
> I did a test on my Pentium-4 1.8GHz machine...
> 
> }}}
<eucode>
> integer fn
> object line
> sequence buffer
> 
> -- create file
> fn = open("bigfile.txt", "wb")
> for i = 1 to 100000 do
>     puts(fn, 'A' + rand(repeat(100, 19 + rand(11))))
>     puts(fn, '\n')
> end for
> close(fn) -- file is about 2.5Mb
> 
> -- read file
> fn = open("bigfile.txt", "rb") -- 100,000 lines of 20 to 30 random chars
> 
> atom t
> t = time()
> 
> buffer = {}
> while 1 do
>     line = gets(fn)
>     if atom(line) then
> 	exit
>     end if
>     buffer = append(buffer, line)
> end while
> 
> close(fn)
> 
> ? time() - t -- 4.06 seconds, 1.8GHz Pentium-4
> 
> ? length(buffer) -- 100,000
> <font color="#330033"></eucode>
{{{
</font>
> 
> With 2.5, ex.exe took 4.06 seconds. 
> With 2.4, ex.exe took 4.34 seconds.
> 
> Maybe your machine is much slower, or maybe you
> were not appending to a simple variable, like "buffer",
> but rather to a subscripted variable. The latter case
> is not optimized as well.
> 
> Regards,
>    Rob Craig
>    Rapid Deployment Software
>    <a href="http://www.RapidEuphoria.com">http://www.RapidEuphoria.com</a>
> 


Dear Rob,

You are absolutely right, I was writing to a subscripted variable.

I had created a sequence whose first few variables are the sequence
'config', and whose last variable is another sequence destined to hold
the actual data records.

In effect was 'appending' to mysequence[$]. I didn't realise that there
was such a difference between subscripted and unsubscripted variables
for this purpose (and my machine is pretty fast - a 3.2GHz Pentium).
Thanks for the tip. This is something I might never have discovered
for myself.

This list is a terrific resource and I have had some great help here,
for which I'm very grateful. Coding in Euphoria is certainly a pleasure.
I love how lean, clean and readable my code is and (an important feature
for me) how fast it runs ...

... when you do things the "right" way, as I'm learning today smile

I know that you have a "Performance Tips" section in the online docs,
but it might be worth compiling some more of this kind of advice into it.
Trawling through 9 years worth of the EuForum is not so easy - the search
is lighning quick, but wading through the output can be a slog. It would
really be a shame if a newcomer to Euphoria like myself, were to try out
the language without being aware of these things, erroneously conclude
that Perl/Python etc. (pick your favorite interpreter) was faster, and go
looking elsewhere for speedy code.

Best

Gordon

new topic     » goto parent     » topic index » view message » categorize

9. Re: Constructing sequences

Robert Craig wrote:
> With 2.5, ex.exe took 4.06 seconds. 

Another 'instant' speed up is use exwc.exe when running on a Windows platform. I
used the same code as Robert here and on my Pentium 3, 550MHz machine I got 0.45
seconds using exwc.exe

-- 
Derek Parnell
Melbourne, Australia
irc://irc.sorcery.net:9000/euphoria

new topic     » goto parent     » topic index » view message » categorize

10. Re: Constructing sequences

Hi Rob...
There is something wrong...

0.13s exw test.exw
0.11s exwc test.exw
2.54s ex test.ex  (!!!!)

Pentium 2.4Ghz win XP.

Rubens


At 17:20 27/6/2005, you wrote:
>
>
>posted by: Derek Parnell <ddparnell at bigpond.com>
>
>Robert Craig wrote:
> > With 2.5, ex.exe took 4.06 seconds.
>
>Another 'instant' speed up is use exwc.exe when running on a Windows 
>platform. I used the same code as Robert here and on my Pentium 3, 
>550MHz machine I got 0.45 seconds using exwc.exe
>
>--
>Derek Parnell
>Melbourne, Australia
>irc://irc.sorcery.net:9000/euphoria
>
>
>

new topic     » goto parent     » topic index » view message » categorize

11. Re: Constructing sequences

Rubens wrote:
> Hi Rob...
> There is something wrong...
> 
> 0.13s exw test.exw
> 0.11s exwc test.exw
> 2.54s ex test.ex  (!!!!)
> 
> Pentium 2.4Ghz win XP.
> 
> Rubens
> 
> 
> At 17:20 27/6/2005, you wrote:
> >
> >
> >posted by: Derek Parnell <ddparnell at bigpond.com>
> >
> >Robert Craig wrote:
> > > With 2.5, ex.exe took 4.06 seconds.
> >
> >Another 'instant' speed up is use exwc.exe when running on a Windows 
> >platform. I used the same code as Robert here and on my Pentium 3, 
> >550MHz machine I got 0.45 seconds using exwc.exe

I get:
     ex.exe 4.06 seconds
   exwc.exe 0.16 seconds

yet ex and exwc are roughly equal on the sieve8k.exw benchmark.

I think the reason for this is that I'm using
WATCOM's storage allocator for ex.exe, but I'm using Microsoft's
allocator for exw.exe and exwc.exe.

I think WATCOM's allocator scans through all the (up to 100000)
allocated blocks, looking for one that has been marked free, 
before it gives up and allocates a new block. Most other
allocators maintain a free list, so they would know instantly
that there are no free blocks, and they would allocate a new block.
WATCOM is faster at freeing blocks, and generally works ok when 
there's a mix of allocates and frees happening. It's just the 
case of thousands of allocates in a row with no frees 
where it falls down.

All of this shows that storage allocation algorithms
can have a big, sometimes huge, impact on performance.

Regards,
   Rob Craig
   Rapid Deployment Software
   http://www.RapidEuphoria.com

new topic     » goto parent     » topic index » view message » categorize

12. Re: Constructing sequences

> 0.13s exw test.exw
> 0.11s exwc test.exw
> 2.54s ex test.ex  (!!!!)

LOL  :)  There's nothing wrong - that's just Windows XP DOS emulation
rearing its ugly face again. Since it's fully emulating DOS, there's
*a lot* of overhead with memory and such. Try to use exwc for any
command-line programs, unless you need to specifically use DOS for
some reason.

~Greg

new topic     » goto parent     » topic index » view message » categorize

13. Re: Constructing sequences

Greg Haberek wrote:
> > 0.13s exw test.exw
> > 0.11s exwc test.exw
> > 2.54s ex test.ex  (!!!!)
> 
> LOL  :)  There's nothing wrong - that's just Windows XP DOS emulation
> rearing its ugly face again. Since it's fully emulating DOS, there's
> *a lot* of overhead with memory and such. Try to use exwc for any
> command-line programs, unless you need to specifically use DOS for
> some reason.

If I comment out the "append" line in my example,
then I get 0.11 (instead of 4.06) for ex, and 0.05 for exwc,
so the I/O is a bit faster when using exwc, but it's
the append that is dominating the time for ex.

Regards,
   Rob Craig
   Rapid Deployment Software
   http://www.RapidEuphoria.com

new topic     » goto parent     » topic index » view message » categorize

14. Re: Constructing sequences

On 27 Jun 2005, at 7:43, Gordon Webster wrote:

> 
> 
> posted by: Gordon Webster <gwalias-bb at yahoo.com>
> 
> Dear Euphorians
> 
> Overall, I'm very happy with the performance of Euphoria (both 
> interpreted and compiled from translated C), but I've noticed
> that constructing sequences seems to be pretty slow by comparison
> with accessing them once they are constructed.
> 
> When I add 100,000 smaller sequences (my data "records") to a storage
> sequence, it takes me nearly 40 seconds either interpreted or
> compiled, to complete. Once this 100,000 record sequence is built
> however, I can access the various "records" and/or their "fields"
> at blazing speeds. I get no speed up on the compiled version for
> the construction of the sequence, but I do get some speedup for the
> already very rapid access to the built sequence.
> 
> I have even tried making the sequence to be constructed global, and
> using a procedure to append to it, rather than a function which has
> to pass a copy of the sequence. It made no discernible difference.
> 
> I have read the performance tips about using the optimized form
> 'append' etc. etc., but if any of you out there have any experience
> of improving the speed of creation of large sequences, I'd be very
> grateful for any tips or pointers (no pun intended).

See the last email thread about getf(). It's now extremely fast at building a 
sequence in memory from a harddrive file. HTH.

Kat

new topic     » goto parent     » topic index » view message » categorize

15. Re: Constructing sequences

Allright,

Considering someone posted something about file i/o (Including Kat), about this,
and the overwhelming evidence that pre-allocation is best way to go, here should
be an extreamly fast version of loading a file into Memory (AKA Sequence):

function fast_read(sequence fname)
    sequence file_block
    integer fh, size
    fh = open(fname,"rb")
    if seek(fh,-1) then end if
    size = where(fh)
    if seek(fh,0) then end if
    file_block = repeat(0,size)
    for i = 1 to size do
        file_block[i] = getc(fh)
    end for
    close(fh)
    return file_block
end function


This code is untested, as I formulated it out in my head, but it should prove
the point that it will be faster.  I use seek() instead of dir(), cause
sometimes, dir() Rely's on the Operating System to get the size, and that can
prove to be slow.  seek() is faster, cause you open the file allready, or have it
open, seek() is easier to reset the file pointers, rewind, and such, and I have
yet to see an error occur from seek() where it goes to the end of the file, then
back to the first byte in the file.  And that's from countless usages that I have
used this method for.

Mario Steele
http://enchantedblade.trilake.net
Attaining World Dominiation, one byte at a time...

new topic     » goto parent     » topic index » view message » categorize

16. Re: Constructing sequences

Mario Steele wrote:

> posted by: Mario Steele <eumario at trilake.net>
> 
> Allright,
> 
> Considering someone posted something about file i/o (Including Kat),
> about this, and the overwhelming evidence that pre-allocation is best
> way to go, here should be an extreamly fast version of loading a file
> into Memory (AKA Sequence):
> 
> }}}
<eucode>
> function fast_read(sequence fname)
>     sequence file_block
>     integer fh, size
>     fh = open(fname,"rb")
>     if seek(fh,-1) then end if
>     size = where(fh)
>     if seek(fh,0) then end if
>     file_block = repeat(0,size)
>     for i = 1 to size do
>         file_block[i] = getc(fh)
>     end for
>     close(fh)
>     return file_block
> end function
> </eucode>
{{{

> 
> This code is untested, as I formulated it out in my head, but it should
> prove the point that it will be faster.  I use seek() instead of dir(),
> cause sometimes, dir() Rely's on the Operating System to get the size,
> and that can prove to be slow.  seek() is faster, cause you open the file
> allready, or have it open, seek() is easier to reset the file pointers,
> rewind, and such, and I have yet to see an error occur from seek() where
> it goes to the end of the file, then back to the first byte in the file.
> And that's from countless usages that I have used this method for.

You are right, I think, about pre-allocation.
So the question is - what method is fastest for
determining of the file size to read file into
already prepared sequence.

I have tested the following code:

--- file cc.e
sequence F
integer fn

procedure a(sequence name)
       F=machine_func(22, name) -- dir 
       F=repeat(0, F[1][3])
       fn = open(name, "rb")
	 close(fn)
end procedure

procedure b(sequence name)
       fn = open(name,"rb")
	if machine_func(19,{fn,-1}) then end if   --seek
	   F=repeat(0, machine_func(20,fn))       --where
	if machine_func(19,{fn,0}) then end if    --seek
	   close(fn)
end procedure

atom T
T=time()
for i=1 to 5000 do
    a("cc.e")
end for
? time() - T

T=time()
for i=1 to 5000 do
    b("cc.e")
end for
? time() - T
-- end of file cc.e


My results on P166MMX/Win95 64M RAM.
------------
DOS Window, ex.exe:

a() 7.8  7.74  7.75  7.69  7.8
b() 4.61 4.62  4.61  4.62  4.67

Plane DOS32 session:

a() 2.75  2.8  2.86  2.75  2.75
b() 2.14  2.14 2.14  2.14  2.2

Windows console, exw.exe:

a()  4.4   4.51  4.68  4.39  4.49
b()  3.67  3.68  3.67  3.68  3.73
----------- 

So, the fastest way is seek/where/seek
with direct calls of needed machine_func().

But dir() is not that bad too, I think.
Adds just a small fraction of one millisecond
to the process, but gives very clear code.

Regards,
Igor Kachan
kinz at peterlink.ru

new topic     » goto parent     » topic index » view message » categorize

17. Re: Constructing sequences

In WIN32 you could just use GetFileSizeEx or GetFileSize, exported
from kernel32.dll. But I don't know how fast it is.
Dan

On 6/28/05, Igor Kachan <kinz at peterlink.ru> wrote:
>
>
> Mario Steele wrote:
>
> > posted by: Mario Steele <eumario at trilake.net>
> >
> > Allright,
> >
> > Considering someone posted something about file i/o (Including Kat),
> > about this, and the overwhelming evidence that pre-allocation is best
> > way to go, here should be an extreamly fast version of loading a file
> > into Memory (AKA Sequence):
> >
> > }}}
<eucode>
> > function fast_read(sequence fname)
> >     sequence file_block
> >     integer fh, size
> >     fh = open(fname,"rb")
> >     if seek(fh,-1) then end if
> >     size = where(fh)
> >     if seek(fh,0) then end if
> >     file_block = repeat(0,size)
> >     for i = 1 to size do
> >         file_block[i] = getc(fh)
> >     end for
> >     close(fh)
> >     return file_block
> > end function
> > </eucode>
{{{

> >
> > This code is untested, as I formulated it out in my head, but it should
> > prove the point that it will be faster.  I use seek() instead of dir(),
> > cause sometimes, dir() Rely's on the Operating System to get the size,
> > and that can prove to be slow.  seek() is faster, cause you open the fi=
le
> > allready, or have it open, seek() is easier to reset the file pointers,
> > rewind, and such, and I have yet to see an error occur from seek() wher=
e
> > it goes to the end of the file, then back to the first byte in the file=
.
> > And that's from countless usages that I have used this method for.
>
> You are right, I think, about pre-allocation.
> So the question is - what method is fastest for
> determining of the file size to read file into
> already prepared sequence.
>
> I have tested the following code:
>
> }}}
<eucode>
> --- file cc.e
> sequence F
> integer fn
>=20
> procedure a(sequence name)
>       F=machine_func(22, name) -- dir
>       F=repeat(0, F[1][3])
>       fn = open(name, "rb")
>         close(fn)
> end procedure
>=20
> procedure b(sequence name)
>       fn = open(name,"rb")
>        if machine_func(19,{fn,-1}) then end if   --seek
>           F=repeat(0, machine_func(20,fn))       --where
>        if machine_func(19,{fn,0}) then end if    --seek
>           close(fn)
> end procedure
>=20
> atom T
> T=time()
> for i=1 to 5000 do
>    a("cc.e")
> end for
> ? time() - T
>=20
> T=time()
> for i=1 to 5000 do
>    b("cc.e")
> end for
> ? time() - T
> -- end of file cc.e
>=20
> </eucode>
{{{

>
> My results on P166MMX/Win95 64M RAM.
> ------------
> DOS Window, ex.exe:
>
> a() 7.8  7.74  7.75  7.69  7.8
> b() 4.61 4.62  4.61  4.62  4.67
>
> Plane DOS32 session:
>
> a() 2.75  2.8  2.86  2.75  2.75
> b() 2.14  2.14 2.14  2.14  2.2
>
> Windows console, exw.exe:
>
> a()  4.4   4.51  4.68  4.39  4.49
> b()  3.67  3.68  3.67  3.68  3.73
> -----------
>
> So, the fastest way is seek/where/seek
> with direct calls of needed machine_func().
>
> But dir() is not that bad too, I think.
> Adds just a small fraction of one millisecond
> to the process, but gives very clear code.
>
> Regards,
> Igor Kachan
> kinz at peterlink.ru
>
>
>
>
>
>
>

new topic     » goto parent     » topic index » view message » categorize

18. Re: Constructing sequences

Mario Steele wrote:
> 
> Allright,
> 
> Considering someone posted something about file i/o (Including Kat), about
> this, and
> the overwhelming evidence that pre-allocation is best way to go, here should
> be an
> extreamly fast version of loading a file into Memory (AKA Sequence):

It seems that a lot of people have missed the original poster's point. Or maybe
its just me blink But the way I read it is that they were trying to load each LINE
of a text file into its own sub-sequence, rather than load the entire file into
one single sequence. Thus a file with twenty lines would create twenty
sub-sequences.

Rob Craig's code is about the best so far for doing that, though I'm sure it
could still be tweaked a bit more.

-- 
Derek Parnell
Melbourne, Australia
irc://irc.sorcery.net:9000/euphoria

new topic     » goto parent     » topic index » view message » categorize

19. Re: Constructing sequences

On Tue, 28 Jun 2005 20:05:31 +0400, Igor Kachan <kinz at peterlink.ru>
wrote:

Please don't do this:
F=machine_func(22, name) -- dir
       ...
	if machine_func(19,{fn,-1}) then end if   --seek
	   F=repeat(0, machine_func(20,fn))       --where
	if machine_func(19,{fn,0}) then end if    --seek

It costs absolutely nothing to define
constant M_SEEK  = 19,
	 M_WHERE = 20,
	 M_DIR   = 22,

And hence (at the very least) code:
F=machine_func(M_DIR, name)
	...
	if machine_func(M_SEEK,{fn,-1}) then end if
	   F=repeat(0, machine_func(M_WHERE,fn))
	if machine_func(M_SEEK,{fn,0}) then end if

If you can show a (run-time) difference, I'll eat my hat, and even the
saving of a routine call is highly questionable, as of course it goes
without saying that
F=dir(name)
	...
	void = seek(fn,-1)
	F=repeat(0,where(fn))
	void = seek(fn,0)

is far, far, far more readable (given that I have only copied relevant
lines so the above does not entirely cohere), and, again, in a
practical real-world example if you can show any measurable gain,
I guess I'll eat my other hat.

>But dir() is not that bad too, I think.
>Adds just a small fraction of one millisecond
>to the process, but gives very clear code.
I wholeheartedly agree. Your example does show an overhead of calling
dir() 5000 times, but very few real-world programs will do that.

Use code which is readable, even when you need it to be fast.

Regards,
Pete

new topic     » goto parent     » topic index » view message » categorize

20. Re: Constructing sequences

Pete Lomax wrote:

> On Tue, 28 Jun 2005 20:05:31 +0400, Igor Kachan <kinz at peterlink.ru>
> wrote:
 
 Please don't do this:

> }}}
<eucode>
>            F=machine_func(22, name) -- dir 
>        ...
> 	if machine_func(19,{fn,-1}) then end if   --seek
> 	   F=repeat(0, machine_func(20,fn))       --where
> 	if machine_func(19,{fn,0}) then end if    --seek
> </eucode>
{{{


Pete, my goal was to post a short, but yet understandable,
solution. But I can not suggest this form for the real
programming in a big complicated program too.

So, I'm fully agreed with your point, but in more hard form:
almost never do that above, boys & girls!  smile

> It costs absolutely nothing to define
> }}}
<eucode>
> constant M_SEEK  = 19,
> 	 M_WHERE = 20,
> 	 M_DIR   = 22,
> </eucode>
{{{

> And hence (at the very least) code:
> }}}
<eucode>
> 	F=machine_func(M_DIR, name)
> 	...
> 	if machine_func(M_SEEK,{fn,-1}) then end if
> 	   F=repeat(0, machine_func(M_WHERE,fn))
> 	if machine_func(M_SEEK,{fn,0}) then end if
> </eucode>
{{{


Yes, Pete, you are right, of course.
But it costs some extra space and some extra efforts.
These constants are not global in file.e library
to just include them.
And Russian laziness, indolens and sloth are world-wide
known, as Oblomov, I'm just lazy too, sorry, sorry, sorry. smile     

> If you can show a (run-time) difference, I'll eat my hat, and even the
> saving of a routine call is highly questionable, as of course it goes
> without saying that
> }}}
<eucode>
> 	F=dir(name)
> 	...
> 	void = seek(fn,-1)
> 	F=repeat(0,where(fn))
> 	void = seek(fn,0)
> </eucode>
{{{

> is far, far, far more readable (given that I have only copied relevant
> lines so the above does not entirely cohere), and, again, in a
> practical real-world example if you can show any measurable gain,
> I guess I'll eat my other hat.

I'm ready to eat yours hats, do not worry, you are absolutelly
right here, I think. 
 
> >But dir() is not that bad too, I think.
> >Adds just a small fraction of one millisecond
> >to the process, but gives very clear code.
> I wholeheartedly agree. Your example does show an overhead of calling
> dir() 5000 times, but very few real-world programs will do that.

Yes, of course.

> Use code which is readable, even when you need it to be fast.

Yes, but sometimes you may need some extremely fast and extremely
compact Euphoria code, Pete.

> Regards,
> Pete

Regards,
Igor Kachan
kinz at peterlink.ru

new topic     » goto parent     » topic index » view message » categorize

21. Re: Constructing sequences

On 29 Jun 2005, at 0:55, Pete Lomax wrote:

> 
> 
> On Tue, 28 Jun 2005 20:05:31 +0400, Igor Kachan <kinz at peterlink.ru>
> wrote:

<snip>
 
> >But dir() is not that bad too, I think.
> >Adds just a small fraction of one millisecond
> >to the process, but gives very clear code.
> I wholeheartedly agree. Your example does show an overhead of calling
> dir() 5000 times, but very few real-world programs will do that.

Wow, i seem to be still writing lots of rare code then! Can i get a goto?

Kat

new topic     » goto parent     » topic index » view message » categorize

22. Re: Constructing sequences

Derek Parnell wrote:

> 
> Mario Steele wrote:
> > 
> > Allright,
> > 
> > Considering someone posted something about file i/o (Including Kat),
about this, and
> > the overwhelming evidence that pre-allocation is best way to go, here
should be an
> > extreamly fast version of loading a file into Memory (AKA Sequence):
> 
> It seems that a lot of people have missed the original poster's point.
> Or maybe its just me blink But the way I read it is that they were trying
> to load each LINE of a text file into its own sub-sequence, rather than
> load the entire file into one single sequence. Thus a file with twenty
> lines would create twenty sub-sequences.
> 
> Rob Craig's code is about the best so far for doing that, though I'm sure
> it could still be tweaked a bit more.

What to me, I try to answer the following Gordon's
question:

GW>> I am reading the data records in from a file so I don't see an
GW>> obvious way I could build the whole sequence at once. Wouldn't
GW>> I have to temporariliy store the records in another sequence,
GW>> incurring the same 'append' overhead anyway?

Really, the way with the *file size* is absent in the original
RDS documents, as far as I can see, so it is not *obvious way*,
I think.

But these ways exist.

So, any one can place a big file into single sequence at once
and do all he/she wants with that sequence - scan, find CR/LF,
make lines, records, parse them etc etc.

Mario takes this point in his attention too, I think.

Matt's approach is the simplest one - just create
a sequence with some extra elements to place a big
file without risk to go away from the sequence space
while reading a file.

Regards,
Igor Kachan
kinz at peterlink.ru

new topic     » goto parent     » topic index » view message » categorize

23. Re: Constructing sequences

Kat wrote:

> On 29 Jun 2005, at 0:55, Pete Lomax wrote:
>
>> On Tue, 28 Jun 2005 20:05:31 +0400, Igor Kachan wrote:
>
> <snip>
>
>>> But dir() is not that bad too, I think.
>>> Adds just a small fraction of one millisecond
>>> to the process, but gives very clear code.
>> I wholeheartedly agree. Your example does show an overhead of calling
>> dir() 5000 times, but very few real-world programs will do that.
>
> Wow, i seem to be still writing lots of rare code then! Can i get a goto?

I couldn't imagine it previously, but obviously any and every post
(regardless of its topic) *can* be used to write a reply that deals
with "goto". Interesting.

Regards,
   Juergen

new topic     » goto parent     » topic index » view message » categorize

24. Re: Constructing sequences

This is similar to I problem I had. It seems best to avoid using some kinds 
of very large data structures in memory and use clever file manipulation or 
the fast Euphoria Database System.
Try running the code below, incrementing MEMORY constant: 1000, 10000, 100000.
You can also have Task Manager open and watch Mem Usage.
Here is what I got, notice that the parsing functions do not return the same
(By the way has anyone a simple trick to get rid of the return at the
end og fthe string and the last comma):

mySplit (my own)
0.06
1000
0.87
10000
87.93
100000
"abc","abc","abc","abc","abc","abc","abc","abc","abc","abc","","abc",

Kparse (kparse.e)
0.06
1000
3.46
10000
384.81
100000
"abc","abc","abc","abc","abc","abc","abc","abc","abc","abc","","abc
",

parse (Strtok-v2-1.e)
0.05
1000
3.51
10000
385.47
100000
"abc","abc","abc","abc","abc","abc","abc","abc","abc","abc","abc
",

And here is the code. Notice my way of naming variables, o for objects, s for
strings,
n for integers, l for lists, t for tables or trees (lists containing lists). Any
comment?
Also it is the append function that evaluates the parsing function so:
	l_line = mySplit(o_line, TAB)
	t_buffer = append(t_buffer, l_line)
is same as:
	t_buffer = append(t_buffer, mySplit(o_line, TAB))
You can uncomment all functions and it is the last one that runs.

include kparse.e
include Strtok-v2-1.e
include file.e
include types.e

without warning

sequence s_line, l_line, t_buffer
integer n_file
object o_line
atom t
constant TRUE = 1
constant FALSE = 0
constant TAB = "\t"
constant MEMORY = 1000


function mySplit(string s_input, object o_limiter)
	sequence l_return, s_return, s_limiter, s_tail
	integer n_start, n_stop
	
	if atom(o_limiter) then
		s_limiter = {o_limiter}
	else
		s_limiter = o_limiter
	end if
	l_return = {}
	n_start = 1
	s_tail = s_input
	while TRUE do
		n_stop = match(s_limiter, s_tail)
		if n_stop then
			s_return = s_input[n_start..n_stop-1]
			l_return = append(l_return, s_return)
			s_tail = s_tail[n_stop + length(s_limiter)..$]
		else
			if length(s_tail) > 1 then
				l_return = append(l_return, s_tail[n_start..$-1])
			end if
			exit
		end if
	end while
	return l_return
end function

-- create file
n_file = open("big.txt", "wb")
s_line = "abc\tabc\tabc\tabc\tabc\tabc\tabc\tabc\tabc\tabc\t\tabc"
for i = 1 to MEMORY do
	puts(n_file, s_line)
	puts(n_file, "\n")
end for
close(n_file)

t = time()
n_file = open("big.txt", "rb")
t_buffer = {}
while 1 do
	o_line = gets(n_file)
	if atom(o_line) then
		exit
	end if
	--l_line = Kparse(o_line, TAB)
	--l_line = parse(o_line, TAB)
	l_line = mySplit(o_line, TAB)
	t_buffer = append(t_buffer, l_line)
end while
close(n_file)
? time() - t
? length(t_buffer)
? length(t_buffer[$-1])
for i = 1 to length(t_buffer[$-1]) do
	puts(1,"\"" & t_buffer[$-1][i] & "\",")
end for


new topic     » goto parent     » topic index » view message » categorize

25. Re: Constructing sequences

Sorry, an old version of my function got there. Here it is hopefully OK:

function mySplit(string s_input, object o_limiter)
	sequence l_return, s_return, s_limiter, s_tail
	integer n_start, n_stop
	
	if atom(o_limiter) then
		s_limiter = {o_limiter}
	else
		s_limiter = o_limiter
	end if
	l_return = {}
	n_start = 1
	s_tail = s_input
	while TRUE do
		n_stop = match(s_limiter, s_tail)
		if n_stop then
s_return = s_tail[n_start..n_stop-1]

			l_return = append(l_return, s_return)
			s_tail = s_tail[n_stop + length(s_limiter)..$]
		else
			if length(s_tail) > 1 then
				l_return = append(l_return, s_tail[n_start..$-1])
			end if
			exit
		end if
	end while
	return l_return
end function

new topic     » goto parent     » topic index » view message » categorize

26. Re: Constructing sequences

Haflidi Asgrimsson wrote:
> 
> This is similar to I problem I had. It seems best to avoid using some kinds 
> of very large data structures in memory and use clever file manipulation or 
> the fast Euphoria Database System.
> Try running the code below, incrementing MEMORY constant: 1000, 10000, 100000.
> You can also have Task Manager open and watch Mem Usage.
> Here is what I got, notice that the parsing functions do not return the same
> (By the way has anyone a simple trick to get rid of the return at the
> end og fthe string and the last comma):
> 

I remember that Kat added a keep parameter into string-tok , which needs to be
specified
in the call. that's why the output is different. 

since Kparse builds a flat bookmark list of delimiter positions 
before it generates the actual list, it would be 
easy to read the length of the bookmark list and preallocate
the parsed output list.

there's still the overhead of appending to the bookmark list itself.



--"ask about our layaway plan".
--

new topic     » goto parent     » topic index » view message » categorize

27. Re: Constructing sequences

Maybe I’m digressing from the point of the original post but, for me, it
raises the question, what are the general rules for fast loading and saving
of very complex or huge structures and what level of sub-sequences can be
addressed efficiently?

Here’s an example with pre allocation (structure only) and appending to a
subscripted variable which is only twice as slow at loading from file as
Robert’s optimised routine loading to a simple buffer sequence (it’s part of
a simple spell checker I use).

For speed, it appears that pre-defining the structure before reading and
breaking the structure down into simple units before saving is important.

include file.e
include misc.e
atom handle

-- create pseudo dictionary file - legible version of Robert's code
handle = open("bigfile.txt", "w")
for i = 1 to 100000 do
puts(handle, 'a' + rand(repeat(25, 19 + rand(11))))
puts(handle, '\n')
end for
close(handle)

--key for pseudo dictionary hash - code intended for proper lower case words
function dic_hash_key(sequence key_word)
integer i, j, k
i = key_word[1] - 96
if length(key_word) > 2 then--most tested at this stage
j = key_word[2] - 96
k = key_word[3] - 96
return {i, j, k}
elsif length(key_word) > 1 then--very few tested at or after this stage
j = key_word[2] - 96
return {i, j, 26}
else return {i, 26, 26}
end if
end function

--read
function read_list_to_hash(sequence file)
object filed_line
sequence hash, key
handle = open(file, "r")
hash = repeat(repeat(repeat({}, 26), 26), 26)
while 1 do
filed_line = gets(handle)
if sequence(filed_line) then
filed_line = filed_line[1..length(filed_line) -1]
key = dic_hash_key(filed_line)
hash[key[1]][key[2]][key[3]]=append(hash[key[1]][key[2]][key[3]],filed_line)
else exit
end if
end while
close(handle)
return hash
end function

--save
procedure save_hash_to_list(sequence file, sequence hash)
handle = open(current_dir() & file, "w")
for h = 1 to length(hash) do
for i = 1 to length(hash[h]) do
for j = 1 to length(hash[h][i]) do
for k = 1 to length(hash[h][i][j]) do
puts(handle, hash[h][i][j][k] & 10)
end for
end for
end for
end for
close(handle)
end procedure

sequence dictionary_hash
atom t

--load dictionary
t = time()
dictionary_hash = read_list_to_hash("bigfile.txt")
?time()-t

--save dictionary
t = time()
save_hash_to_list("bigfile.txt", dictionary_hash)
?time()-t
sleep(3)

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu