1. internal storage
- Posted by Jim Hendricks <jim at bizcomputinginc.com> Sep 19, 2004
- 437 views
- Last edited Sep 20, 2004
The manual states: Performance Note: Does this mean that all atoms are stored in memory as 8-byte floating-point numbers? No. The Euphoria interpreter usually stores integer-valued atoms as machine integers (4 bytes) to save space and improve execution speed. When fractional results occur or numbers get too big, conversion to floating-point happens automatically. My question is are string sequences then stored as 4 byte atoms or as 1 byte atoms? This is quite important to the app I am porting since I am parsing about 4 Meg of character data and the process performs much better if I can parse the character data to an intermediate state in memory which causes having all the data in memory at once. I could always go with an intermediate state to temp files, but this causes the process to run 1000% slower( on windoze ). I can live with requiring 4 Meg of memory consumption for this process, but not so keen on requiring 16 Meg of memory consumption. Jim
2. Re: internal storage
- Posted by Derek Parnell <ddparnell at bigpond.com> Sep 20, 2004
- 435 views
Jim Hendricks wrote: > > The manual states: > > Performance Note: > Does this mean that all atoms are stored in memory as 8-byte > floating-point numbers? No. The Euphoria interpreter usually > stores integer-valued atoms as machine integers (4 bytes) to > save space and improve execution speed. When fractional results > occur or numbers get too big, conversion to floating-point happens > automatically. > > My question is are string sequences then stored as 4 byte atoms or as > 1 byte atoms? > > This is quite important to the app I am porting since I am parsing about > 4 Meg of character data and the process performs much better if I can > parse the character data to an intermediate state in memory which causes > having all the data in memory at once. I could always go with an > intermediate state to temp files, but this causes the process to run 1000% > slower( on windoze ). I can live with requiring 4 Meg of memory consumption > for this process, but not so keen on requiring 16 Meg of memory consumption. All 'characters' are stored as 4-byte integers and not stored as single bytes. If you have large text strings in RAM, you might need to rethink your design, or let the operating system deal with the virtual memory. Another consideration is that each sequence is stored in contiguous RAM so rather than have a giant single sequence containing all your text, it is better to break it up into a sequence of sequences. E.G. sequence theFile theFile = { "line one", "line two", "line three" } This is four sequences: one for the file and another for each line. So it takes up four independant RAM blocks. This can effect RAM paging swapping for very large sequences. -- Derek Parnell Melbourne, Australia
3. Re: internal storage
- Posted by CoJaBo <cojabo at suscom.net> Sep 20, 2004
- 423 views
Derek Parnell wrote: > > Jim Hendricks wrote: > > > > The manual states: > > > > Performance Note: > > Does this mean that all atoms are stored in memory as 8-byte > > floating-point numbers? No. The Euphoria interpreter usually > > stores integer-valued atoms as machine integers (4 bytes) to > > save space and improve execution speed. When fractional results > > occur or numbers get too big, conversion to floating-point happens > > automatically. > > > > My question is are string sequences then stored as 4 byte atoms or as > > 1 byte atoms? > > > > This is quite important to the app I am porting since I am parsing about > > 4 Meg of character data and the process performs much better if I can > > parse the character data to an intermediate state in memory which causes > > having all the data in memory at once. I could always go with an > > intermediate state to temp files, but this causes the process to run 1000% > > slower( on windoze ). I can live with requiring 4 Meg of memory consumption > > for this process, but not so keen on requiring 16 Meg of memory consumption. > > All 'characters' are stored as 4-byte integers and not stored as single > bytes. This DEFINATLY should be improved in a new version of Euphoria. There are many times where I use allocated memory to get around this problem. Euphoria 2.5(or 2.6 if it would take too long) should use 1-byte instead of 4-byte whenever possible. > > If you have large text strings in RAM, you might need to rethink your > design, or let the operating system deal with the virtual memory. > > Another consideration is that each sequence is stored in contiguous RAM > so rather than have a giant single sequence containing all your > text, it is better to break it up into a sequence of sequences. > > E.G. > > sequence theFile > theFile = { > "line one", > "line two", > "line three" > } > > This is four sequences: one for the file and another for each line. So it > takes up four independant RAM blocks. This can effect RAM paging swapping > for very large sequences. > > -- > Derek Parnell > Melbourne, Australia >
4. Re: internal storage
- Posted by Derek Parnell <ddparnell at bigpond.com> Sep 20, 2004
- 455 views
CoJaBo wrote: > > Derek Parnell wrote: [snip] > > All 'characters' are stored as 4-byte integers and not stored as single > > bytes. > > This DEFINATLY should be improved in a new version of Euphoria. > There are many times where I use allocated memory to get around > this problem. > Euphoria 2.5(or 2.6 if it would take too long) should use 1-byte > instead of 4-byte whenever possible. On the other hand, Euphoria's choice of 30-bit characters makes Unicode very, very easy to implement. Encoding in UTF-32 is a one-to-one mapping for most characters and only a small number would need to be stored in atoms. At the risk of complicating Euphoria, there may be a case to argue for a native UTF-8 character string. This would mean that English text would use 8-bit characters, and most European languages would average around 8-10 bits per character, though the East Asian languages would more than likely average 16-20 bits per character. Microsoft have decided to store Unicode strings as UTF-16 encoding which means that most languages in the world use about 2 bytes per character. Of course, you could do roll-your-own 'packed' string type for Euphoria sequences at the cost of slower execution speed. But there can't be many applications where the need for all text to be simulanteously stored in RAM is actually a performance boost. Most applications would only be dealing with a subset of the text at any one time. I don't think Google keeps all its cached pages in RAM <anacadote> I once wrote a tiny text editor (4KB of assembler) in which the text was never stored in RAM, just the disk address of each line. It ran so fast on an Intel-8088 that people didn't notice it was continually going out to disk to read text in. </anacadote> -- Derek Parnell Melbourne, Australia
5. Re: internal storage
- Posted by Jim Hendricks <jim at bizcomputinginc.com> Sep 20, 2004
- 426 views
Derek Parnell wrote: > > Jim Hendricks wrote: > > > > The manual states: > > > > Performance Note: > > Does this mean that all atoms are stored in memory as 8-byte > > floating-point numbers? No. The Euphoria interpreter usually > > stores integer-valued atoms as machine integers (4 bytes) to > > save space and improve execution speed. When fractional results > > occur or numbers get too big, conversion to floating-point happens > > automatically. > > > > My question is are string sequences then stored as 4 byte atoms or as > > 1 byte atoms? > > > > This is quite important to the app I am porting since I am parsing about > > 4 Meg of character data and the process performs much better if I can > > parse the character data to an intermediate state in memory which causes > > having all the data in memory at once. I could always go with an > > intermediate state to temp files, but this causes the process to run 1000% > > slower( on windoze ). I can live with requiring 4 Meg of memory consumption > > for this process, but not so keen on requiring 16 Meg of memory consumption. > > All 'characters' are stored as 4-byte integers and not stored as single > bytes. > > If you have large text strings in RAM, you might need to rethink your > design, or let the operating system deal with the virtual memory. > > Another consideration is that each sequence is stored in contiguous RAM > so rather than have a giant single sequence containing all your > text, it is better to break it up into a sequence of sequences. > > E.G. > > sequence theFile > theFile = { > "line one", > "line two", > "line three" > } > > This is four sequences: one for the file and another for each line. So it > takes up four independant RAM blocks. This can effect RAM paging swapping > for very large sequences. > > -- > Derek Parnell > Melbourne, Australia Thanks, I will have to go with temp files than and take the performance hit. It's not a key function of the application so perfomance is not important, just anoying to take a process that runs in a few seconds in memory to something that runs in a minute or more via temp files. ( I may have been exagerating a little with my 1000% slow down, but it sure feels that way ) I know this slow down because this app is running fine in Java and I first wrote it against memory, didn't like how much memory was consumed, went to a temp file and saw the slow down. Java Strings use 2 byte chars, but I think char and byte still take 4 bytes internally with Java because I canned using Strings and went with byte arrays and didn't see any improvement in memory usage. I am already using sequences of sequences because it allows me to structure the intermediate data in a very navigatable way so I guess it's good to hear that the decision had some hidden benefits and I managed by accident to dodge a performance bullet.
6. Re: internal storage
- Posted by Jim Hendricks <jim at bizcomputinginc.com> Sep 20, 2004
- 425 views
Derek Parnell wrote: > > CoJaBo wrote: > > > > Derek Parnell wrote: > > [snip] > > > > All 'characters' are stored as 4-byte integers and not stored as single > > > bytes. > > > > This DEFINATLY should be improved in a new version of Euphoria. > > There are many times where I use allocated memory to get around > > this problem. > > Euphoria 2.5(or 2.6 if it would take too long) should use 1-byte > > instead of 4-byte whenever possible. > > On the other hand, Euphoria's choice of 30-bit characters makes Unicode > very, very easy to implement. Encoding in UTF-32 is a one-to-one > mapping for most characters and only a small number would need to be > stored in atoms. > > At the risk of complicating Euphoria, there may be a case to argue for > a native UTF-8 character string. This would mean that English text > would use 8-bit characters, and most European languages would average > around 8-10 bits per character, though the East Asian languages would > more than likely average 16-20 bits per character. Microsoft have > decided to store Unicode strings as UTF-16 encoding which means that > most languages in the world use about 2 bytes per character. I'm one of those "limited mindset" Americans who avoids UTF. Yeah I know, it's ignoring the whole non-English speaking world( which is pretty big), but I keep my hands full just catering to the English only crowd. > > Of course, you could do roll-your-own 'packed' string type for Euphoria > sequences at the cost of slower execution speed. > > But there can't be many applications where the need for all text to > be simulanteously stored in RAM is actually a performance boost. Most > applications would only be dealing with a subset of the text at any one > time. I don't think Google keeps all its cached pages in RAM I agree. My particular case is processing a tag markup file ( similar to XML ) to convert it to a proprietary file storage format for use in an application. There are certain pieces of information that I need to know about the data before writing the final file storage format which are not available until all the input data has been processed. By going all RAM, all the data is read in with some processing and formatting happening along the way. Once I have the final data from reading the whole file, I can then write out the final file with the remaining processing happening while writing the data out. With the temp file approach I have to read the data in while writing the partially processed data out to temp files. Once the full data is read and I have the final key pieces, I have to read the temp files do the final processing and write out to the final files. Problem is that the temp file reading requires some jumping around within the temp files and therefore leads to additional perfomance delays beyond the fact that I'm now reading and writing files. Jumping around in memory is negligible. I'm toying now with the thought of a multi-pass process which would read the input file through once to build/obtain the meta info necessary for final assembly, then do a second pass read to do the read/process/write in one shot without the need for an intermediate store in RAM or in temp files. > <anacadote> > I once wrote a tiny text editor (4KB of assembler) in which the text was > never stored in RAM, just the disk address of each line. It ran so fast > on an Intel-8088 that people didn't notice it was continually going > out to disk to read text in. > </anacadote> > > -- > Derek Parnell > Melbourne, Australia >
7. Re: internal storage
- Posted by Michael Raley <thinkways at yahoo.com> Sep 20, 2004
- 445 views
Jim Hendricks wrote: <snip> > My question is are string sequences then stored as 4 byte atoms or as > 1 byte atoms? <snip> Yes, string sequences as stated are really just numeric sequences. {65,66,67,3265} is internally the same as "ABC" & 3265 You can sorta obscure plain text passwords in source files by writting them in sequence format i.e. pwd = {325,330,335}/5 There are two workarounds that I tried because I had to split 24 megabyte revenue usage reports apart, creating a new subreport for each revenue department key found in the page headers. This would run very slow on the old win 95 machines it had to be run on, constantly chugging through virtual memory. The first was to try to internally pack three atoms into one integer (the library is in the archives) so incompress({65,66,67}) would return {656667} which runs pretty slow too with really big sequences. The faster way was to create a 'bucket' routine to handle the file input, which allows me to set a limit on the amount of text held in memory, 100,000 lines or so. The splitter routine does not read the file directly, it reads from a buffer sequence until it reaches the end of the bucket, and calls the bucket fill procedure again.
8. Re: internal storage
- Posted by Jim Hendricks <jim at bizcomputinginc.com> Sep 20, 2004
- 425 views
Michael Raley wrote: > > > Jim Hendricks wrote: > <snip> > > My question is are string sequences then stored as 4 byte atoms or as > > 1 byte atoms? > <snip> > Yes, string sequences as stated are really just numeric sequences. > {65,66,67,3265} is internally the same as "ABC" & 3265 > You can sorta obscure plain text passwords in source files by writting > them in sequence format > > i.e. pwd = {325,330,335}/5 > > There are two workarounds that I tried because I had to split 24 megabyte > revenue usage reports apart, creating a new subreport for each > revenue department key found in the page headers. This would run very slow > on the old win 95 machines it had to be run on, constantly chugging through > virtual memory. > > The first was to try to internally pack three atoms into one integer > (the library is in the archives) > so incompress({65,66,67}) would return {656667} Yes, that was suggested by someone else, but if I were to go with all the work of packing/unpacking I would stuff 4 chars per atom unless there's some limitation on use of all 32 bits of an integer atom. > which runs pretty slow too with really big sequences. > > The faster way was to create a 'bucket' routine to > handle the file input, which allows me to set a limit on > the amount of text held in memory, 100,000 lines or so. > > The splitter routine does not read the file directly, > it reads from a buffer sequence until it reaches the end of the bucket, > and calls the bucket fill procedure again. Yes, this approach is used in many apps as buffered IO. This rightly assumes that IO is the performance bottleneck and reading many bytes takes only slightly longer than reading 1 byte since the performance hit is in clearing the channel, positioning the head, etc. This is where buffering on a cluster boundary gives the best performance kick since clusters are contiguous on the HDD and therefore can be read in 1 pass. Of course this same situation exists for write only worse since write is a more costly operation than read. As I stated in a previous post, my app stores all the data in Memory primarily because some of the information I need to properly process the data is not known until the whole file has been read. I may do well to go with a multipass process whereby my first pass obtains the metadata info necessary to process the file and then the second pass does the actual processing. Jim
9. Re: internal storage
- Posted by "Kat" <gertie at visionsix.com> Sep 20, 2004
- 419 views
On 20 Sep 2004, at 5:54, Jim Hendricks wrote: > > > posted by: Jim Hendricks <jim at bizcomputinginc.com> > > Michael Raley wrote: > > > > > > Jim Hendricks wrote: > > <snip> > > > My question is are string sequences then stored as 4 byte atoms or as > > > 1 byte atoms? > > <snip> > > Yes, string sequences as stated are really just numeric sequences. > > {65,66,67,3265} is internally the same as "ABC" & 3265 > > You can sorta obscure plain text passwords in source files by writting > > them in sequence format > > > > i.e. pwd = {325,330,335}/5 > > > > There are two workarounds that I tried because I had to split 24 megabyte > > revenue usage reports apart, creating a new subreport for each revenue > > department key found in the page headers. This would run very slow on the > > old > > win 95 machines it had to be run on, constantly chugging through virtual > > memory. Did you try faster hardware under the win95? I recently spent an hr optomising at a function, and trimmed over an hour of runtime out of it (went from 20 minutes to under a second, using indexes extensively and NOT resizing the sequences). > > The first was to try to internally pack three atoms into one integer > > (the library is in the archives) > > so incompress({65,66,67}) would return {656667} > Yes, that was suggested by someone else, but if I were to go with all the > work of packing/unpacking I would stuff 4 chars per atom unless there's some > limitation on use of all 32 bits of an integer atom. Several people have written assorted memory storage files for your task. Derek wrote one, Euman wrote at least one this year, Allen Robnett did something, Jordah did something along these lines, and i wouldn't be surprised if Jiri did too. Try Euman's for speed, or Derek's for versatility,, or Jordah's. Euman's uses a windoze C call to do fast pattern searches in the memory block. It's plenty easy to drop blocks of the raw memory into a sequence to do Eu operations on them. Kat