1. Sequences and long files
- Posted by Michael Sabal <mjs at OSA.ATT.NE.JP> May 07, 1998
- 675 views
A couple weeks ago, someone posted a question about how to handle = super-huge database files (which wouldn't take too long these days), = being able to access them in a reasonable amount of time, and still be = able to use Euphoria sequences allowing dynamic memory allocation. = These two ideas almost seem to be an antithesis. Random access of files = requires fixed record lengths, but variable length records preclude = random access. So, I starting thinking (a dangerous pasttime, I know :). I tend to = live in theory, so I'll leave the reality to the smart guys on the list = (i.e., I'm not going to attempt to code this off the top of my head!). = What makes sense is having two databases: the first would be a sorted = version of the complete data, sorted from most recent to oldest (the = most recent data is more likely to be needed first, except in a = warehouse situation). Then the database could be read, say 1000 records = at a time held in memory, sequentially. This is common sense and is = usually what happens anyway. The slow part is when this file needs = updating. Instead of writing changes to the main database file (which = means writing the database file every time), changes would be kept in = the second database file, a much smaller file whose access time is = miniscule. Then, during an idle time like waiting for user input, copy = the changes file into the main database file and save the whole thing. = This means adding a time check in the wait_for_input routine for like 3 = minutes, but only if the change file exists. Hence, variable length records, mostly rapid access time, and not too = difficult to program, I would think. Now let me move out of the way and = find my fire coat.... Serving Jesus Christ, Michael J. Sabal mjs at osa.att.ne.jp http://home.att.ne.jp/gold/mjs/
2. Re: Sequences and long files
- Posted by Irv <irv at ELLIJAY.COM> May 06, 1998
- 672 views
- Last edited May 07, 1998
At 09:13 AM 5/7/98 +-900, Michael J. Sabal wrote: >---------------------- Information from the mail header ----------------------- >Sender: Euphoria Programming for MS-DOS <EUPHORIA at >MIAMIU.ACS.MUOHIO.EDU> >Poster: Michael Sabal <mjs at OSA.ATT.NE.JP> >Subject: Sequences and long files >------------------------------------------------------------------------------- > >A couple weeks ago, someone posted a question about how to handle = >super-huge database files <snip> , changes would be kept in = >the second database file, a much smaller file whose access time is = >miniscule. Then, during an idle time like waiting for user input, copy = >the changes file into the main database file and save the whole thing. = >This means adding a time check in the wait_for_input routine for like 3 = >minutes, but only if the change file exists. > >Hence, variable length records, mostly rapid access time, and not too = >difficult to program, I would think. This idea would work. There is a difficulty with this, however. If something goes wrong before or while the sequence is written to disk, you've lost a LOT of data. Writing one record at a time sort of cuts down on the damage you can do. There's also the problem of Euphoria's native sequence format: a 300 meg data file would probably take 2 or 3 times that much disk space. If s = {"Now is the time",123}, -- 20 bytes more or less -- it takes 63 bytes to store on disk with a print(fn,s) Irv
3. Re: Sequences and long files
- Posted by Daniel Berstein <daber at PAIR.COM> May 07, 1998
- 654 views
- Last edited May 08, 1998
>So, I starting thinking (a dangerous pasttime, I know :). I tend to live in theory, so I'll leave the reality to the >smart guys on the list (i.e., I'm not going to attempt to code this off the top of my head!). What makes >sense is having two databases: the first would be a sorted version of the complete data, sorted from most >recent to oldest (the most recent data is more likely to be needed first, except in a warehouse situation). >Then the database could be read, say 1000 records at a time held in memory, sequentially. This is >common sense and is usually what happens anyway. The slow part is when this file needs updating. >Instead of writing changes to the main database file (which means writing the database file every time), >changes would be kept in the second database file, a much smaller file whose access time is miniscule. >Then, during an idle time like waiting for user input, copy the changes file into the main database file and >save the whole thing. This means adding a time check in the wait_for_input routine for like 3 minutes, but >only if the change file exists. Following your idea, we can make a "smart" database engine that learns how it's been used. Think like voice or writting recognition. The program constantly corrects it's probabilistic estimations on how data is accesed (and written). Sound extremly interesting, and a bit complex to do (fuzzy logic?). It seems as a group project work. Volunteers? Project segementation: - Physical storing and retriving - Logical storing and retriving (FAST indexing) - Fuzzy logic - User interface - Documentation and finally... extensive beta testing (requires large amounts of data and procesing). Regards, Daniel Bertein daber at pair.com
4. Re: Sequences and long files
- Posted by Ralf Nieuwenhuijsen <nieuwen at XS4ALL.NL> May 07, 1998
- 661 views
>There's also the problem of Euphoria's native sequence >format: a 300 meg data file would probably take 2 or 3 >times that much disk space. >If s = {"Now is the time",123}, -- 20 bytes more or less -- >it takes 63 bytes to store on disk with a print(fn,s) It takes about 47 bytes with EDOM2 without compression, of which 27 are default overhead to make sure it's an EDOM2 file, to say it has no compression, and to point out which default function handler we use. In the other 20 bytes is saved both the structure of this sequence and its contents. If some parts of the structure could be saved because there is a pre-known pattern you could use a structure handler, and save even more data. And off course you could filter the data through your own compression routines, but that would slow the procces down a bit. BTW The speed of saving is a bit slower than using the print () routine, but the loading up of an edo file goes a lot faster than using the get () routine. (of course because the get () routine is coded in native Euphoria, unlike the print () routine which is built-in) Ralf Nieuwenhuijsen nieuwen at xs4all.nl
5. Re: Sequences and long files
- Posted by Falkon <Falkn13 at IBM.NET> May 07, 1998
- 661 views
- Last edited May 08, 1998
From: Irv >There's also the problem of Euphoria's native sequence >format: a 300 meg data file would probably take 2 or 3 >times that much disk space. >If s = {"Now is the time",123}, -- 20 bytes more or less -- >it takes 63 bytes to store on disk with a print(fn,s) A few more instructions keeps the format readable by get() but saves a good deal of disk space. Use printf to print the strings as strings and actually print out the quotes, braces, and commas. puts( fn, "{" ) for whatever = 1 to NumberOfRecords do puts( fn, "{" ) for count = 1 to length( record ) do data = record[count] if sequence( data ) then printf( fn, "\"%s\"", data ) else printf( fn, "%d", data ) end if if count != length( record ) then puts( fn, "," ) end if end for puts( fn, "}" ) if whatever != NumberOfRecords then puts( fn, "," ) end if end for puts( fn, "}" ) That's just my spur of the moment typing...it could probably be made smaller and/or made into a general purpose recursive function. Of course, if you don't know that all your data is either strings or numbers then you'd also have to check for sequences inside of sequences and recurse down properly. You could of course use some sort of compression on the actual strings themselves before you save it too, if they're big.. I still like the idea of an index at the end of the file better, though. Only keeping one record and the index in RAM. Just have to rewrite the file every x edits or whatever.. Simpler than keeping track of idle time, I'd think. But that would work too.
6. Re: Sequences and long files
- Posted by Irv <irv at ELLIJAY.COM> May 08, 1998
- 653 views
At 11:52 PM 5/7/98 -0400, you wrote: >>There's also the problem of Euphoria's native sequence >>format: a 300 meg data file would probably take 2 or 3 >>times that much disk space. >>If s = {"Now is the time",123}, -- 20 bytes more or less -- >>it takes 63 bytes to store on disk with a print(fn,s) > > A few more instructions keeps the format readable by get() but saves a >good deal of disk space. Use printf to print the strings as strings and >actually print out the quotes, braces, and commas. > >puts( fn, "{" ) >for whatever = 1 to NumberOfRecords do > puts( fn, "{" ) > for count = 1 to length( record ) do > data = record[count] > if sequence( data ) then > printf( fn, "\"%s\"", data ) > else > printf( fn, "%d", data ) > end if > if count != length( record ) then > puts( fn, "," ) > end if > end for > puts( fn, "}" ) >if whatever != NumberOfRecords then > puts( fn, "," ) >end if >end for >puts( fn, "}" ) > > That's just my spur of the moment typing...it could probably be made >smaller and/or made into a general purpose recursive function. Of course, >if you don't know that all your data is either strings or numbers then you'd >also have to check for sequences inside of sequences and recurse down >properly. If you use the %d to print numbers, you'll lose the decimals. You can use %f to save the number accurately (more or less), but that takes a lot of space. If there are more numbers than strings in your data, it will take MORE space than the native format. Arrgh! Irv
7. Re: Sequences and long files
- Posted by Robert B Pilkington <bpilkington at JUNO.COM> May 08, 1998
- 684 views
>>puts( fn, "{" ) >>for whatever = 1 to NumberOfRecords do >> puts( fn, "{" ) >> for count = 1 to length( record ) do >> data = record[count] >> if sequence( data ) then >> printf( fn, "\"%s\"", data ) >> else >> printf( fn, "%d", data ) >> end if >> if count != length( record ) then >> puts( fn, "," ) >> end if >> end for >> puts( fn, "}" ) >>if whatever != NumberOfRecords then >> puts( fn, "," ) >>end if >>end for >>puts( fn, "}" ) > >If you use the %d to print numbers, you'll lose the decimals. >You can use %f to save the number accurately (more or less), >but that takes a lot of space. If there are more numbers than >strings in your data, it will take MORE space than the native format. >Arrgh! Untested, but should work, and if it doesn't, you get the idea: if sequence( data ) then -- Print sequence printf( fn, "\"%s\"", data ) elsif integer(data) then -- Print integer number printf( fn, "%d", data ) else -- Handle float. -- Assuming .16 is the best decimal precision num_string = sprintf("%.16f", data) -- Search for the first character that isn't a zero for i = length(num_string) to find('.', num_string) by -1 do if num_string[i] != '0' then -- Okay, we have the total precision, convert to sequence num_string = sprintf("%d", i-find('.', num_string)) exit end if end for -- Use the num_string sequence to cut down on space used: printf( fn, "%." & num_string & "f", data ) end if _____________________________________________________________________ You don't need to buy Internet access to use free Internet e-mail. Get completely free e-mail from Juno at http://www.juno.com Or call Juno at (800) 654-JUNO [654-5866]