1. internal storage

The manual states:

 Performance Note:
    Does this mean that all atoms are stored in memory as 8-byte 
    floating-point numbers? No. The Euphoria interpreter usually 
    stores integer-valued atoms as machine integers (4 bytes) to 
    save space and improve execution speed. When fractional results 
    occur or numbers get too big, conversion to floating-point happens 
    automatically.

My question is are string sequences then stored as 4 byte atoms or as 
1 byte atoms?

This is quite important to the app I am porting since I am parsing about
4 Meg of character data and the process performs much better if I can
parse the character data to an intermediate state in memory which causes
having all the data in memory at once. I could always go with an
intermediate state to temp files, but this causes the process to run 1000%
slower( on windoze ). I can live with requiring 4 Meg of memory consumption
for this process, but not so keen on requiring 16 Meg of memory consumption.

Jim

new topic     » topic index » view message » categorize

2. Re: internal storage

Jim Hendricks wrote:
> 
> The manual states:
> 
>  Performance Note:
>     Does this mean that all atoms are stored in memory as 8-byte 
>     floating-point numbers? No. The Euphoria interpreter usually 
>     stores integer-valued atoms as machine integers (4 bytes) to 
>     save space and improve execution speed. When fractional results 
>     occur or numbers get too big, conversion to floating-point happens 
>     automatically.
> 
> My question is are string sequences then stored as 4 byte atoms or as 
> 1 byte atoms?
> 
> This is quite important to the app I am porting since I am parsing about
> 4 Meg of character data and the process performs much better if I can
> parse the character data to an intermediate state in memory which causes
> having all the data in memory at once. I could always go with an
> intermediate state to temp files, but this causes the process to run 1000%
> slower( on windoze ). I can live with requiring 4 Meg of memory consumption
> for this process, but not so keen on requiring 16 Meg of memory consumption.

All 'characters' are stored as 4-byte integers and not stored as single 
bytes.

If you have large text strings in RAM, you might need to rethink your 
design, or let the operating system deal with the virtual memory.

Another consideration is that each sequence is stored in contiguous RAM
so rather than have a giant single sequence containing all your
text, it is better to break it up into a sequence of sequences.

E.G. 

sequence theFile
theFile = {
  "line one", 
  "line two", 
  "line three"
   }

This is four sequences: one for the file and another for each line. So it
takes up four independant RAM blocks. This can effect RAM paging swapping
for very large sequences.

-- 
Derek Parnell
Melbourne, Australia

new topic     » goto parent     » topic index » view message » categorize

3. Re: internal storage

Derek Parnell wrote:
> 
> Jim Hendricks wrote:
> > 
> > The manual states:
> > 
> >  Performance Note:
> >     Does this mean that all atoms are stored in memory as 8-byte 
> >     floating-point numbers? No. The Euphoria interpreter usually 
> >     stores integer-valued atoms as machine integers (4 bytes) to 
> >     save space and improve execution speed. When fractional results 
> >     occur or numbers get too big, conversion to floating-point happens 
> >     automatically.
> > 
> > My question is are string sequences then stored as 4 byte atoms or as 
> > 1 byte atoms?
> > 
> > This is quite important to the app I am porting since I am parsing about
> > 4 Meg of character data and the process performs much better if I can
> > parse the character data to an intermediate state in memory which causes
> > having all the data in memory at once. I could always go with an
> > intermediate state to temp files, but this causes the process to run 1000%
> > slower( on windoze ). I can live with requiring 4 Meg of memory consumption
> > for this process, but not so keen on requiring 16 Meg of memory consumption.
> 
> All 'characters' are stored as 4-byte integers and not stored as single 
> bytes.
This DEFINATLY should be improved in a new version of Euphoria.
There are many times where I use allocated memory to get around
this problem.
Euphoria 2.5(or 2.6 if it would take too long) should use 1-byte
instead of 4-byte whenever possible.

> 
> If you have large text strings in RAM, you might need to rethink your 
> design, or let the operating system deal with the virtual memory.
> 
> Another consideration is that each sequence is stored in contiguous RAM
> so rather than have a giant single sequence containing all your
> text, it is better to break it up into a sequence of sequences.
> 
> E.G. 
> 
> sequence theFile
> theFile = {
>   "line one", 
>   "line two", 
>   "line three"
>    }
> 
> This is four sequences: one for the file and another for each line. So it
> takes up four independant RAM blocks. This can effect RAM paging swapping
> for very large sequences.
> 
> -- 
> Derek Parnell
> Melbourne, Australia
>

new topic     » goto parent     » topic index » view message » categorize

4. Re: internal storage

CoJaBo wrote:
> 
> Derek Parnell wrote:

[snip]

> > All 'characters' are stored as 4-byte integers and not stored as single 
> > bytes.
>
> This DEFINATLY should be improved in a new version of Euphoria.
> There are many times where I use allocated memory to get around
> this problem.
> Euphoria 2.5(or 2.6 if it would take too long) should use 1-byte
> instead of 4-byte whenever possible.

On the other hand, Euphoria's choice of 30-bit characters makes Unicode
very, very easy to implement. Encoding in UTF-32 is a one-to-one
mapping for most characters and only a small number would need to be
stored in atoms.

At the risk of complicating Euphoria, there may be a case to argue for
a native UTF-8 character string. This would mean that English text
would use 8-bit characters, and most European languages would average
around 8-10 bits per character, though the East Asian languages would
more than likely average 16-20 bits per character. Microsoft have 
decided to store Unicode strings as UTF-16 encoding which means that
most languages in the world use about 2 bytes per character.

Of course, you could do roll-your-own 'packed' string type for Euphoria
sequences at the cost of slower execution speed. 

But there can't be many applications where the need for all text to 
be simulanteously stored in RAM is actually a performance boost. Most
applications would only be dealing with a subset of the text at any one
time. I don't think Google keeps all its cached pages in RAM blink

<anacadote>
I once wrote a tiny text editor (4KB of assembler) in which the text was
never stored in RAM, just the disk address of each line. It ran so fast
on an Intel-8088 that people didn't notice it was continually going
out to disk to read text in.
</anacadote>

-- 
Derek Parnell
Melbourne, Australia

new topic     » goto parent     » topic index » view message » categorize

5. Re: internal storage

Derek Parnell wrote:
> 
> Jim Hendricks wrote:
> > 
> > The manual states:
> > 
> >  Performance Note:
> >     Does this mean that all atoms are stored in memory as 8-byte 
> >     floating-point numbers? No. The Euphoria interpreter usually 
> >     stores integer-valued atoms as machine integers (4 bytes) to 
> >     save space and improve execution speed. When fractional results 
> >     occur or numbers get too big, conversion to floating-point happens 
> >     automatically.
> > 
> > My question is are string sequences then stored as 4 byte atoms or as 
> > 1 byte atoms?
> > 
> > This is quite important to the app I am porting since I am parsing about
> > 4 Meg of character data and the process performs much better if I can
> > parse the character data to an intermediate state in memory which causes
> > having all the data in memory at once. I could always go with an
> > intermediate state to temp files, but this causes the process to run 1000%
> > slower( on windoze ). I can live with requiring 4 Meg of memory consumption
> > for this process, but not so keen on requiring 16 Meg of memory consumption.
> 
> All 'characters' are stored as 4-byte integers and not stored as single 
> bytes.
> 
> If you have large text strings in RAM, you might need to rethink your 
> design, or let the operating system deal with the virtual memory.
> 
> Another consideration is that each sequence is stored in contiguous RAM
> so rather than have a giant single sequence containing all your
> text, it is better to break it up into a sequence of sequences.
> 
> E.G. 
> 
> sequence theFile
> theFile = {
>   "line one", 
>   "line two", 
>   "line three"
>    }
> 
> This is four sequences: one for the file and another for each line. So it
> takes up four independant RAM blocks. This can effect RAM paging swapping
> for very large sequences.
> 
> -- 
> Derek Parnell
> Melbourne, Australia
Thanks, I will have to go with temp files than and take the performance
hit.  It's not a key function of the application so perfomance is not
important, just anoying to take a process that runs in a few seconds in
memory to something that runs in a minute or more via temp files. ( I may
have been exagerating a little with my 1000% slow down, but it sure 
feels that way )  I know this slow down because this app is running fine
in Java and I first wrote it against memory, didn't like how much memory
was consumed, went to a temp file and saw the slow down.  Java Strings use
2 byte chars, but I think char and byte still take 4 bytes internally with
Java because I canned using Strings and went with byte arrays and didn't
see any improvement in memory usage. 

I am already using sequences of sequences because it allows me to structure
the intermediate data in a very navigatable way so I guess it's good to
hear that the decision had some hidden benefits and I managed by accident to
dodge a performance bullet.

new topic     » goto parent     » topic index » view message » categorize

6. Re: internal storage

Derek Parnell wrote:
> 
> CoJaBo wrote:
> > 
> > Derek Parnell wrote:
> 
> [snip]
> 
> > > All 'characters' are stored as 4-byte integers and not stored as single 
> > > bytes.
> >
> > This DEFINATLY should be improved in a new version of Euphoria.
> > There are many times where I use allocated memory to get around
> > this problem.
> > Euphoria 2.5(or 2.6 if it would take too long) should use 1-byte
> > instead of 4-byte whenever possible.
> 
> On the other hand, Euphoria's choice of 30-bit characters makes Unicode
> very, very easy to implement. Encoding in UTF-32 is a one-to-one
> mapping for most characters and only a small number would need to be
> stored in atoms.
> 
> At the risk of complicating Euphoria, there may be a case to argue for
> a native UTF-8 character string. This would mean that English text
> would use 8-bit characters, and most European languages would average
> around 8-10 bits per character, though the East Asian languages would
> more than likely average 16-20 bits per character. Microsoft have 
> decided to store Unicode strings as UTF-16 encoding which means that
> most languages in the world use about 2 bytes per character.
I'm one of those "limited mindset" Americans who avoids UTF.  Yeah I know,
it's ignoring the whole non-English speaking world( which is pretty big), 
but I keep my hands full just catering to the English only crowd.

> 
> Of course, you could do roll-your-own 'packed' string type for Euphoria
> sequences at the cost of slower execution speed. 
> 
> But there can't be many applications where the need for all text to 
> be simulanteously stored in RAM is actually a performance boost. Most
> applications would only be dealing with a subset of the text at any one
> time. I don't think Google keeps all its cached pages in RAM blink
I agree.  My particular case is processing a tag markup file ( similar to
XML ) to convert it to a proprietary file storage format for use in an
application.  There are certain pieces of information that I need to 
know about the data before writing the final file storage format which are
not available until all the input data has been processed.  By going all
RAM, all the data is read in with some processing and formatting happening
along the way.  Once I have the final data from reading the whole file, I
can then write out the final file with the remaining processing happening
while writing the data out.  With the temp file approach I have to read the
data in while writing the partially processed data out to temp files.  Once
the full data is read and I have the final key pieces, I have to read the
temp files do the final processing and write out to the final files.  
Problem is that the temp file reading requires some jumping around within
the temp files and therefore leads to additional perfomance delays beyond
the fact that I'm now reading and writing files.  Jumping around in 
memory is negligible.

I'm toying now with the thought of a multi-pass process which would read
the input file through once to build/obtain the meta info necessary for
final assembly, then do a second pass read to do the read/process/write in
one shot without the need for an intermediate store in RAM or in temp files.

> <anacadote>
> I once wrote a tiny text editor (4KB of assembler) in which the text was
> never stored in RAM, just the disk address of each line. It ran so fast
> on an Intel-8088 that people didn't notice it was continually going
> out to disk to read text in.
> </anacadote>
> 
> -- 
> Derek Parnell
> Melbourne, Australia
>

new topic     » goto parent     » topic index » view message » categorize

7. Re: internal storage

Jim Hendricks wrote:
<snip>
> My question is are string sequences then stored as 4 byte atoms or as 
> 1 byte atoms?
<snip>
Yes, string sequences as stated are really just numeric sequences. 
{65,66,67,3265} is internally the same as "ABC" & 3265
You can sorta obscure plain text passwords in source files by writting 
them in sequence format 

 i.e. pwd = {325,330,335}/5 

There are two workarounds that I tried because I had to split 24 megabyte 
revenue usage reports apart, creating a new subreport for each 
revenue department key found in the page headers. This would run very slow
on the old win 95 machines it had to be run on, constantly chugging through 
virtual memory. 

The first was to try to internally pack three atoms into one integer 
(the library is in the archives) 
so incompress({65,66,67}) would return {656667}

which runs pretty slow too with really big sequences.

The faster way was to create a 'bucket' routine to 
handle the file input, which allows me to set a limit on
the amount of text held in memory, 100,000 lines or so.

The splitter routine does not read the file directly,
it reads from a buffer sequence until it reaches the end of the bucket,
and calls the bucket fill procedure again.

new topic     » goto parent     » topic index » view message » categorize

8. Re: internal storage

Michael Raley wrote:
> 
> 
> Jim Hendricks wrote:
> <snip>
> > My question is are string sequences then stored as 4 byte atoms or as 
> > 1 byte atoms?
> <snip>
> Yes, string sequences as stated are really just numeric sequences. 
> {65,66,67,3265} is internally the same as "ABC" & 3265
> You can sorta obscure plain text passwords in source files by writting 
> them in sequence format 
> 
>  i.e. pwd = {325,330,335}/5 
> 
> There are two workarounds that I tried because I had to split 24 megabyte 
> revenue usage reports apart, creating a new subreport for each 
> revenue department key found in the page headers. This would run very slow
> on the old win 95 machines it had to be run on, constantly chugging through 
> virtual memory. 
> 
> The first was to try to internally pack three atoms into one integer 
> (the library is in the archives) 
> so incompress({65,66,67}) would return {656667}
Yes, that was suggested by someone else, but if I were to go with all the
work of packing/unpacking I would stuff 4 chars per atom unless there's some
limitation on use of all 32 bits of an integer atom.
 
> which runs pretty slow too with really big sequences.
> 
> The faster way was to create a 'bucket' routine to 
> handle the file input, which allows me to set a limit on
> the amount of text held in memory, 100,000 lines or so.
> 
> The splitter routine does not read the file directly,
> it reads from a buffer sequence until it reaches the end of the bucket,
> and calls the bucket fill procedure again.
Yes, this approach is used in many apps as buffered IO.  This rightly 
assumes that IO is the performance bottleneck and reading many bytes 
takes only slightly longer than reading 1 byte since the performance hit is
in clearing the channel, positioning the head, etc.  This is where 
buffering on a cluster boundary gives the best performance kick since
clusters are contiguous on the HDD and therefore can be read in 1 pass. Of
course this same situation exists for write only worse since write is a 
more costly operation than read.

As I stated in a previous post, my app stores all the data in Memory 
primarily because some of the information I need to properly process the
data is not known until the whole file has been read. I may do well to go
with a multipass process whereby my first pass obtains the metadata info
necessary to process the file and then the second pass does the actual
processing.

Jim

new topic     » goto parent     » topic index » view message » categorize

9. Re: internal storage

On 20 Sep 2004, at 5:54, Jim Hendricks wrote:

> 
> 
> posted by: Jim Hendricks <jim at bizcomputinginc.com>
> 
> Michael Raley wrote:
> > 
> > 
> > Jim Hendricks wrote:
> > <snip>
> > > My question is are string sequences then stored as 4 byte atoms or as 
> > > 1 byte atoms?
> > <snip>
> > Yes, string sequences as stated are really just numeric sequences. 
> > {65,66,67,3265} is internally the same as "ABC" & 3265
> > You can sorta obscure plain text passwords in source files by writting 
> > them in sequence format 
> > 
> >  i.e. pwd = {325,330,335}/5 
> > 
> > There are two workarounds that I tried because I had to split 24 megabyte
> > revenue usage reports apart, creating a new subreport for each revenue
> > department key found in the page headers. This would run very slow on the
> > old
> > win 95 machines it had to be run on, constantly chugging through virtual
> > memory. 

Did you try faster hardware under the win95? I recently spent an hr optomising
at a
function, and trimmed over an hour of runtime out of it (went from 20 minutes to
under
a second, using indexes extensively and NOT resizing the sequences).

> > The first was to try to internally pack three atoms into one integer 
> > (the library is in the archives) 
> > so incompress({65,66,67}) would return {656667}
> Yes, that was suggested by someone else, but if I were to go with all the
> work of packing/unpacking I would stuff 4 chars per atom unless there's some
> limitation on use of all 32 bits of an integer atom.

Several people have written assorted memory storage files for your task. Derek
wrote
one, Euman wrote at least one this year, Allen Robnett did something, Jordah did
something along these lines, and i wouldn't be surprised if Jiri did too. Try
Euman's
for speed, or Derek's for versatility,, or Jordah's. Euman's uses a windoze C
call to do
fast pattern searches in the memory block. It's plenty easy to drop blocks of
the raw
memory into a sequence to do Eu operations on them.

Kat

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu