Re: internal storage
- Posted by Jim Hendricks <jim at bizcomputinginc.com> Sep 20, 2004
- 427 views
Derek Parnell wrote: > > CoJaBo wrote: > > > > Derek Parnell wrote: > > [snip] > > > > All 'characters' are stored as 4-byte integers and not stored as single > > > bytes. > > > > This DEFINATLY should be improved in a new version of Euphoria. > > There are many times where I use allocated memory to get around > > this problem. > > Euphoria 2.5(or 2.6 if it would take too long) should use 1-byte > > instead of 4-byte whenever possible. > > On the other hand, Euphoria's choice of 30-bit characters makes Unicode > very, very easy to implement. Encoding in UTF-32 is a one-to-one > mapping for most characters and only a small number would need to be > stored in atoms. > > At the risk of complicating Euphoria, there may be a case to argue for > a native UTF-8 character string. This would mean that English text > would use 8-bit characters, and most European languages would average > around 8-10 bits per character, though the East Asian languages would > more than likely average 16-20 bits per character. Microsoft have > decided to store Unicode strings as UTF-16 encoding which means that > most languages in the world use about 2 bytes per character. I'm one of those "limited mindset" Americans who avoids UTF. Yeah I know, it's ignoring the whole non-English speaking world( which is pretty big), but I keep my hands full just catering to the English only crowd. > > Of course, you could do roll-your-own 'packed' string type for Euphoria > sequences at the cost of slower execution speed. > > But there can't be many applications where the need for all text to > be simulanteously stored in RAM is actually a performance boost. Most > applications would only be dealing with a subset of the text at any one > time. I don't think Google keeps all its cached pages in RAM I agree. My particular case is processing a tag markup file ( similar to XML ) to convert it to a proprietary file storage format for use in an application. There are certain pieces of information that I need to know about the data before writing the final file storage format which are not available until all the input data has been processed. By going all RAM, all the data is read in with some processing and formatting happening along the way. Once I have the final data from reading the whole file, I can then write out the final file with the remaining processing happening while writing the data out. With the temp file approach I have to read the data in while writing the partially processed data out to temp files. Once the full data is read and I have the final key pieces, I have to read the temp files do the final processing and write out to the final files. Problem is that the temp file reading requires some jumping around within the temp files and therefore leads to additional perfomance delays beyond the fact that I'm now reading and writing files. Jumping around in memory is negligible. I'm toying now with the thought of a multi-pass process which would read the input file through once to build/obtain the meta info necessary for final assembly, then do a second pass read to do the read/process/write in one shot without the need for an intermediate store in RAM or in temp files. > <anacadote> > I once wrote a tiny text editor (4KB of assembler) in which the text was > never stored in RAM, just the disk address of each line. It ran so fast > on an Intel-8088 that people didn't notice it was continually going > out to disk to read text in. > </anacadote> > > -- > Derek Parnell > Melbourne, Australia >