1. Sequences and long files

A couple weeks ago, someone posted a question about how to handle =
super-huge database files (which wouldn't take too long these days), =
being able to access them in a reasonable amount of time, and still be =
able to use Euphoria sequences allowing dynamic memory allocation.  =
These two ideas almost seem to be an antithesis.  Random access of files =
requires fixed record lengths, but variable length records preclude =
random access.

So, I starting thinking (a dangerous pasttime, I know :).  I tend to =
live in theory, so I'll leave the reality to the smart guys on the list =
(i.e., I'm not going to attempt to code this off the top of my head!).  =
What makes sense is having two databases: the first would be a sorted =
version of the complete data, sorted from most recent to oldest (the =
most recent data is more likely to be needed first, except in a =
warehouse situation).  Then the database could be read, say 1000 records =
at a time held in memory, sequentially.  This is common sense and is =
usually what happens anyway.  The slow part is when this file needs =
updating.  Instead of writing changes to the main database file (which =
means writing the database file every time), changes would be kept in =
the second database file, a much smaller file whose access time is =
miniscule.  Then, during an idle time like waiting for user input, copy =
the changes file into the main database file and save the whole thing.  =
This means adding a time check in the wait_for_input routine for like 3 =
minutes, but only if the change file exists.

Hence, variable length records, mostly rapid access time, and not too =
difficult to program, I would think.  Now let me move out of the way and =
find my fire coat....

Serving Jesus Christ,
Michael J. Sabal
mjs at osa.att.ne.jp
http://home.att.ne.jp/gold/mjs/

new topic     » topic index » view message » categorize

2. Re: Sequences and long files

At 09:13 AM 5/7/98 +-900, Michael J. Sabal wrote:
>---------------------- Information from the mail header -----------------------
>Sender:       Euphoria Programming for MS-DOS <EUPHORIA at
>MIAMIU.ACS.MUOHIO.EDU>
>Poster:       Michael Sabal <mjs at OSA.ATT.NE.JP>
>Subject:      Sequences and long files
>-------------------------------------------------------------------------------
>
>A couple weeks ago, someone posted a question about how to handle =
>super-huge database files
<snip>
, changes would be kept in =
>the second database file, a much smaller file whose access time is =
>miniscule.  Then, during an idle time like waiting for user input, copy =
>the changes file into the main database file and save the whole thing.  =
>This means adding a time check in the wait_for_input routine for like 3 =
>minutes, but only if the change file exists.
>
>Hence, variable length records, mostly rapid access time, and not too =
>difficult to program, I would think.

This idea would work. There is a difficulty with this, however.
If something goes wrong before or while the sequence is
written to disk, you've lost a LOT of data.
Writing one record at a time sort of cuts down on the
damage you can do.

There's also the problem of Euphoria's native sequence
format: a 300 meg data file would probably take 2 or 3
times that much disk space.
If s = {"Now is the time",123}, -- 20 bytes more or less --
it takes 63 bytes to store on disk with a print(fn,s)

Irv

new topic     » goto parent     » topic index » view message » categorize

3. Re: Sequences and long files

>So, I starting thinking (a dangerous pasttime, I know :).  I tend to live
in theory, so I'll leave the reality to the
>smart guys on the list (i.e., I'm not going to attempt to code this off the
top of my head!).  What makes
>sense is having two databases: the first would be a sorted version of the
complete data, sorted from most
>recent to oldest (the most recent data is more likely to be needed first,
except in a warehouse situation).
>Then the database could be read, say 1000 records at a time held in memory,
sequentially.  This is
>common sense and is usually what happens anyway.  The slow part is when
this file needs updating.
>Instead of writing changes to the main database file (which means writing
the database file every time),
>changes would be kept in the second database file, a much smaller file
whose access time is miniscule.
>Then, during an idle time like waiting for user input, copy the changes
file into the main database file and
>save the whole thing.  This means adding a time check in the wait_for_input
routine for like 3 minutes, but
>only if the change file exists.


Following your idea, we can make a "smart" database engine that learns how
it's been used. Think like voice or writting recognition. The program
constantly corrects it's probabilistic estimations on how data is accesed
(and written).

Sound extremly interesting, and a bit complex to do (fuzzy logic?). It seems
as a group project work. Volunteers?
Project segementation:
    - Physical storing and retriving
    - Logical storing and retriving (FAST indexing)
    - Fuzzy logic
    - User interface
    - Documentation
    and finally... extensive beta testing (requires large amounts of data
and procesing).

Regards,
    Daniel Bertein
    daber at pair.com

new topic     » goto parent     » topic index » view message » categorize

4. Re: Sequences and long files

>There's also the problem of Euphoria's native sequence
>format: a 300 meg data file would probably take 2 or 3
>times that much disk space.
>If s = {"Now is the time",123}, -- 20 bytes more or less --
>it takes 63 bytes to store on disk with a print(fn,s)


It takes about 47 bytes with EDOM2 without compression, of which 27 are
default overhead to make sure it's an EDOM2 file, to say it has no
compression, and to point out which default function handler we use.

In the other 20 bytes is saved both the structure of this sequence and its
contents.
If some parts of the structure could be saved because there is a pre-known
pattern you could use a structure handler, and save even more data. And off
course you could filter the data through your own compression routines, but
that would slow the procces down a bit.

BTW The speed of saving is a bit slower than using the print () routine, but
the loading up of an edo file goes a lot faster than using the get ()
routine. (of course because the get () routine is coded in native Euphoria,
unlike the print () routine which is built-in)

Ralf Nieuwenhuijsen
nieuwen at xs4all.nl

new topic     » goto parent     » topic index » view message » categorize

5. Re: Sequences and long files

From:    Irv

>There's also the problem of Euphoria's native sequence
>format: a 300 meg data file would probably take 2 or 3
>times that much disk space.
>If s = {"Now is the time",123}, -- 20 bytes more or less --
>it takes 63 bytes to store on disk with a print(fn,s)

     A few more instructions keeps the format readable by get() but saves a
good deal of disk space.  Use printf to print the strings as strings and
actually print out the quotes, braces, and commas.

puts( fn, "{" )
for whatever = 1 to NumberOfRecords do
   puts( fn, "{" )
   for count = 1 to length( record ) do
      data = record[count]
      if sequence( data ) then
               printf( fn, "\"%s\"", data )
      else
               printf( fn, "%d", data )
      end if
      if count != length( record ) then
         puts( fn, "," )
      end if
   end for
   puts( fn, "}" )
if whatever != NumberOfRecords then
   puts( fn, "," )
end if
end for
puts( fn, "}" )

     That's just my spur of the moment typing...it could probably be made
smaller and/or made into a general purpose recursive function.  Of course,
if you don't know that all your data is either strings or numbers then you'd
also have to check for sequences inside of sequences and recurse down
properly.  You could of course use some sort of compression on the actual
strings themselves before you save it too, if they're big..

     I still like the idea of an index at the end of the file better,
though.  Only keeping one record and the index in RAM.  Just have to rewrite
the file every x edits or whatever..  Simpler than keeping track of idle
time, I'd think.  But that would work too.

new topic     » goto parent     » topic index » view message » categorize

6. Re: Sequences and long files

At 11:52 PM 5/7/98 -0400, you wrote:

>>There's also the problem of Euphoria's native sequence
>>format: a 300 meg data file would probably take 2 or 3
>>times that much disk space.
>>If s = {"Now is the time",123}, -- 20 bytes more or less --
>>it takes 63 bytes to store on disk with a print(fn,s)
>
>     A few more instructions keeps the format readable by get() but saves a
>good deal of disk space.  Use printf to print the strings as strings and
>actually print out the quotes, braces, and commas.
>
>puts( fn, "{" )
>for whatever = 1 to NumberOfRecords do
>   puts( fn, "{" )
>   for count = 1 to length( record ) do
>      data = record[count]
>      if sequence( data ) then
>               printf( fn, "\"%s\"", data )
>      else
>               printf( fn, "%d", data )
>      end if
>      if count != length( record ) then
>         puts( fn, "," )
>      end if
>   end for
>   puts( fn, "}" )
>if whatever != NumberOfRecords then
>   puts( fn, "," )
>end if
>end for
>puts( fn, "}" )
>
>     That's just my spur of the moment typing...it could probably be made
>smaller and/or made into a general purpose recursive function.  Of course,
>if you don't know that all your data is either strings or numbers then you'd
>also have to check for sequences inside of sequences and recurse down
>properly.

If you use the %d to print numbers, you'll lose the decimals.
You can use %f to save the number accurately (more or less),
but that takes a lot of space. If there are more numbers than
strings in your data, it will take MORE space than the native format. Arrgh!

Irv

new topic     » goto parent     » topic index » view message » categorize

7. Re: Sequences and long files

>>puts( fn, "{" )
>>for whatever = 1 to NumberOfRecords do
>>   puts( fn, "{" )
>>   for count = 1 to length( record ) do
>>      data = record[count]
>>      if sequence( data ) then
>>               printf( fn, "\"%s\"", data )
>>      else
>>               printf( fn, "%d", data )
>>      end if
>>      if count != length( record ) then
>>         puts( fn, "," )
>>      end if
>>   end for
>>   puts( fn, "}" )
>>if whatever != NumberOfRecords then
>>   puts( fn, "," )
>>end if
>>end for
>>puts( fn, "}" )
>
>If you use the %d to print numbers, you'll lose the decimals.
>You can use %f to save the number accurately (more or less),
>but that takes a lot of space. If there are more numbers than
>strings in your data, it will take MORE space than the native format.
>Arrgh!

Untested, but should work, and if it doesn't, you get the idea:

if sequence( data ) then
    -- Print sequence
    printf( fn, "\"%s\"", data )
elsif integer(data) then
    -- Print integer number
    printf( fn, "%d", data )
else
    -- Handle float.
    -- Assuming .16 is the best decimal precision
    num_string = sprintf("%.16f", data)
    -- Search for the first character that isn't a zero
    for i = length(num_string) to find('.', num_string) by -1 do
        if num_string[i] != '0' then
            -- Okay, we have the total precision, convert to sequence
            num_string = sprintf("%d", i-find('.', num_string))
            exit
        end if
    end for
    -- Use the num_string sequence to cut down on space used:
    printf( fn, "%." & num_string & "f", data )
end if


_____________________________________________________________________
You don't need to buy Internet access to use free Internet e-mail.
Get completely free e-mail from Juno at http://www.juno.com
Or call Juno at (800) 654-JUNO [654-5866]

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu