Re: internal storage

new topic     » goto parent     » topic index » view thread      » older message » newer message

Derek Parnell wrote:
> 
> CoJaBo wrote:
> > 
> > Derek Parnell wrote:
> 
> [snip]
> 
> > > All 'characters' are stored as 4-byte integers and not stored as single 
> > > bytes.
> >
> > This DEFINATLY should be improved in a new version of Euphoria.
> > There are many times where I use allocated memory to get around
> > this problem.
> > Euphoria 2.5(or 2.6 if it would take too long) should use 1-byte
> > instead of 4-byte whenever possible.
> 
> On the other hand, Euphoria's choice of 30-bit characters makes Unicode
> very, very easy to implement. Encoding in UTF-32 is a one-to-one
> mapping for most characters and only a small number would need to be
> stored in atoms.
> 
> At the risk of complicating Euphoria, there may be a case to argue for
> a native UTF-8 character string. This would mean that English text
> would use 8-bit characters, and most European languages would average
> around 8-10 bits per character, though the East Asian languages would
> more than likely average 16-20 bits per character. Microsoft have 
> decided to store Unicode strings as UTF-16 encoding which means that
> most languages in the world use about 2 bytes per character.
I'm one of those "limited mindset" Americans who avoids UTF.  Yeah I know,
it's ignoring the whole non-English speaking world( which is pretty big), 
but I keep my hands full just catering to the English only crowd.

> 
> Of course, you could do roll-your-own 'packed' string type for Euphoria
> sequences at the cost of slower execution speed. 
> 
> But there can't be many applications where the need for all text to 
> be simulanteously stored in RAM is actually a performance boost. Most
> applications would only be dealing with a subset of the text at any one
> time. I don't think Google keeps all its cached pages in RAM blink
I agree.  My particular case is processing a tag markup file ( similar to
XML ) to convert it to a proprietary file storage format for use in an
application.  There are certain pieces of information that I need to 
know about the data before writing the final file storage format which are
not available until all the input data has been processed.  By going all
RAM, all the data is read in with some processing and formatting happening
along the way.  Once I have the final data from reading the whole file, I
can then write out the final file with the remaining processing happening
while writing the data out.  With the temp file approach I have to read the
data in while writing the partially processed data out to temp files.  Once
the full data is read and I have the final key pieces, I have to read the
temp files do the final processing and write out to the final files.  
Problem is that the temp file reading requires some jumping around within
the temp files and therefore leads to additional perfomance delays beyond
the fact that I'm now reading and writing files.  Jumping around in 
memory is negligible.

I'm toying now with the thought of a multi-pass process which would read
the input file through once to build/obtain the meta info necessary for
final assembly, then do a second pass read to do the read/process/write in
one shot without the need for an intermediate store in RAM or in temp files.

> <anacadote>
> I once wrote a tiny text editor (4KB of assembler) in which the text was
> never stored in RAM, just the disk address of each line. It ran so fast
> on an Intel-8088 that people didn't notice it was continually going
> out to disk to read text in.
> </anacadote>
> 
> -- 
> Derek Parnell
> Melbourne, Australia
>

new topic     » goto parent     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu