OpenEuphoria: Forum: EDS database speed questions

1. EDS database speed questions

Posted by AlanO Aug 24, 2008
1327 views

Hello! I have a Euphoria database of 2.2 million records. The key is 9 chars, the data is 67. In my app, each record the user choses, launches searches against the database, by key, of between 1 and 300 times. The disk space is not important, speed is. The user PC's will be assumed to be running Windows, with 1Gb ram, no other ram hungry apps running.

Does anyone have measurements or practical experience where they can advise on:

a) Turning off compression / decompression in database.e, what will this gain me in statements dropped (measurable speed) ? b) Creating my own index entry after X many records,(The 'pick any number between 1 and 100' method) At what value of X will this be faster than a simple get by index, if at all? Knowing the total number of records in advance, can help to get the value of X

I originally had everything running in sequences, not database, but the PC ran out of memory when the record count jumped from 150k to 2.2 million. The app does sorts too!

This is my first really commercial app, the client has already paid $2000 for the old version but needs it to support 2.2 million records. (My company got the the $$ not me :( Anything created using thier equipment / time / technology / ideas is thiers, even tho its my idea.. so I can't offer any $ rewards for assistance)

Regards Alan

new topic » topic index » view message » categorize

2. Re: EDS database speed questions

Posted by AlanO Aug 24, 2008
1318 views

To summarise, is there a faster way to find a database record by key, than the binary search used in databse.e's db_find_key() function, if you know you have 2.2 million records?

Regards Alan

new topic » goto parent » topic index » view message » categorize

3. Re: EDS database speed questions

Posted by ChrisB (moderator) Aug 24, 2008
1303 views

Hi

At that sort of level of database, shouldn't you be considering SQL? Let the engine do the sorting and collating of data. It may take a little time to convert the data, but the maintainance advantages would outweigh this. Speed would be comparable, if not better.

Options - eusql, sqlite, mysql.

Chris

new topic » goto parent » topic index » view message » categorize

4. Re: EDS database speed questions

Posted by AlanO Aug 24, 2008
1304 views

The app was punted as not needing any external app/dll.. would hate to have to tell the client that its no longer the case...SQL or otherwise.

How about if I created my own index of every say 100th record, in a (memory) sequence. Do a binary search against that, then use the results as the hi/lo starting points in the big database? Disk access is always slower than memory. Another way of thinking would be to take advantage of the 512 byte physical disk sector, where I would assume that when fetching a record, the nearby records would also be in memory and so would not incur a speed hit. Design the search accordingly.

Loading the database is also not a "one-off". The data changes as a result of both my app and external issues, and so will be reloaded once a day, so I'd like to keep the load time down too.

Regards Alan

new topic » goto parent » topic index » view message » categorize

5. Re: EDS database speed questions

Posted by ChrisB (moderator) Aug 24, 2008
1296 views

Hi

Of course you have your own programming issues, but I would still say with a database that size, adding sqlite as a dll adds only about 1/2 a megabyte to to the program size (as a dll), results in a much more secure database with rollback features in the event of a crash, allows use of industry standard sql statements, allows exceptionally quick maitainance and updates, uses b-tree searches so is lightning fast, and uses an engine embedded in devices used everday (your phone, for instance)

As far as trumping no use of external programs, were you expecting 2.2 million records at the first program consultation. Wouldn't the client rather have transparent speed and reliability issues at the expense of a zero set tiny dll, installed into the programs directory?

I dunno, can you tell I'm an sqlite fan?

I don't know your database structure, but using every 100th record sounds like an additional overhead, and may be heading towards a hash table type of thing.

Chris

new topic » goto parent » topic index » view message » categorize

6. Re: EDS database speed questions

Posted by DerekParnell (admin) Aug 24, 2008
1296 views

With 2.2 million, EDS needs to do a maximum of 45 disk seeks to find any record. This is on par with B-Tree accessing.

Compression on a 67 element sequence is negligible. But I'll consider an option to have a non-compressed D/B for a future EDS enhancement.

Is it possible for you to split the table into sub-tables? For example, if the keys alway began with a letter, you could have 26 sub-tables (approx 85,000 per table) using 35 disk seeks in the worst case. This would average out at around 75% of the original speed.

Version 4 EDS has a number of speed improvements, but I haven't benchmarked them yet as I have one or two more things to improve. One of these is an option to load an EDS database in to RAM to avoid the disk I/O overhead. Would work better for a READ-ONLY database, but it is not going to be restricted to that.

If RAM is not a constraint, you could load the DB into a map. In your case you would need approximately 650Meg of RAM to hold the map. This would improve D/B access by at least 20 times, but you'd have to benchmark it to be sure.

new topic » goto parent » topic index » view message » categorize

7. Re: EDS database speed questions

Posted by AlanO Aug 24, 2008
1304 views

Thanks Derek! Using your 45 max (I'll assume 43 average) my app will have 10750 disk accesses average, each time the user hits go. If the disk runs at 20 milliseconds, thats 215 seconds of wait just for disk access! Hmmm... poke the index into memory and get assembly to do a binary search of the index. The memory needed is 19.8Mb so thats do-able and nothing could be quicker. Should result in about 250 disk accesses at 20 milliseconds, thats 5 seconds.

Is there a quicker method? Chris I'll still get around to learning SQL, but I doubt if that would be quicker than the memory/assembly method would it?

Derek, could there be a memory/read only/performance option for eds database indexes in the future?

Thanks guys Alan

new topic » goto parent » topic index » view message » categorize

8. Re: EDS database speed questions

Posted by ChrisB (moderator) Aug 24, 2008
1301 views

Hi

That certainly impressive performance! Go for it.

Chris

new topic » goto parent » topic index » view message » categorize

9. Re: EDS database speed questions

Posted by ghaberek (admin) Aug 25, 2008
1277 views

DerekParnell said...

Is it possible for you to split the table into sub-tables? For example, if the keys alway began with a letter, you could have 26 sub-tables (approx 85,000 per table) using 35 disk seeks in the worst case. This would average out at around 75% of the original speed.

I'm gonna suggest this as well. 85k records per table seems far more manageable and easier to search than 2.2M records.

-Greg

new topic » goto parent » topic index » view message » categorize

10. Re: EDS database speed questions

Posted by Spock Aug 28, 2008
1243 views

AlanO said...

Thanks Derek! Using your 45 max (I'll assume 43 average) my app will have 10750 disk accesses average, each time the user hits go. If the disk runs at 20 milliseconds, thats 215 seconds of wait just for disk access! Hmmm... poke the index into memory and get assembly to do a binary search of the index. The memory needed is 19.8Mb so thats do-able and nothing could be quicker. Should result in about 250 disk accesses at 20 milliseconds, thats 5 seconds.

Is there a quicker method? Chris I'll still get around to learning SQL, but I doubt if that would be quicker than the memory/assembly method would it?

Derek, could there be a memory/read only/performance option for eds database indexes in the future?

Thanks guys Alan

Hi Alan,

I have no experience programming â€œproperâ€ databases and Iâ€™m not quite sure what the program is supposed to do but, FWIW, this is what Iâ€™d do:

1) Manage a plain file instead of using database.e - you can always convert it later if you need to expand it but youâ€™ll have to accept much reduced performance.

2) Whenever the program loads (or the records are updated), it goes through all the records and creates a linear list of indices in RAM.

3) Each index in the list comprises of the key compressed into 8 bytes. Since each key is 9 chars it should be easy enough to do said compressing as 9 x 7 bits = 63 bits which will fit into 8 bytes.

4) Whenever a search is made for the records associated with a particular key, the program compresses the search key and then performs a linear search on the entire index and creates a sequence containing the disk positions of the records. If there are 2.2 million records the index table will be 17.6 megabytes long.

Even without resorting to ASM the whole search process should be very fast.

5) The disk positions gleaned from searching the index will already be in ascending order so loading the actual records using seek() will be very swift too.

6) If the chars that make up each key have a limited range, say all upper or lower caps (or even Alphanumeric range), then the compression factor could be improved such that the compressed key could fit into 2 Eu integers which opens the door to even better performance using a sequence to host the indices rather than some allocated RAM.

7) If the number of unique keys is modest, step 3 could be altered so that, using a hash table, the compressed key is replaced by, say, a client number which could be a 4-byte or even 2-byte integer. The size of the index table would halve (or quarter to a modest 4.4 MB if in RAM).

8) If duplicate searches are often made for the same key (Hmmm, that does seem unlikely on a single PC) a small buffer of the last, say, 100 searches could be added to improve performance yet further.

9) If you are really serious about file performance in Windows you could use the API routines to perform all the open/close/seek/read functions. Expect the speed to at least double - unless the Eu 4 developers have already incorporated this stuff. Actually, the bulk of the time will be spent gathering the data so this part of the operation really ought to be optimized.

Spock

new topic » goto parent » topic index » view message » categorize

11. Re: EDS database speed questions

Posted by AlanO Sep 01, 2008
1219 views

OK, here is the background:

There is a database of mainframe tapes. The tape names are all fixed 6 char length, but typically its in "ranges" where a range might be the first 2 chars alpha, and the remaining 4 chars are numeric. The tapes themselves can contain multiple files; each file has a filename, filenumber, write date, time, etc quite a few fields. When a mainframe writes to tape, it is allowed to write to any tape thats part of the allowed range, where the particular tape is in "scratch" status. Scratch status means that the tape no longer contains data that has to be retained. What has to be scratched and when is the resposibility of the tape management software, of which there are a few software vendors (IBM, CA, BMC). Now, when a file thats being written, cannot fit onto the current tape, it overflows to another scratch tape. So any tape could potentially be multifile and/or multi-tape overflowed (called multi-volume in IBM speak). The tape database should have pointers for the multivol files, in that if file 17 on say tape X00020 overflowed onto tapes X01056 and X00055, then each of these tapes mentioned should have pointers indicating what the previous and next tapes are. In theory, these pointers should be correct! But they are not, and the customer is not going to fix it, just to make my programs' life (and mine) easier.

My program generates mainframe code (called JCL) to copy these tape files. The user clicks on a few tapes, and my program generates the JCL to copy all the files on those tapes. My program has to follow the next/previous pointers (called chains) and not allow a copy if there is a fault in the chain. Doing so could be a disaster - imagine mixing two different years' payroll into a single output file - what happens when this gets read into a application? I also keep logs and generate updates for the tape database so that the user knows what was copied, and to where. The total number of tape files is 2.2 million.

When I load the input file, (its a listing from the tape manager) I sort it first by tape name and file number. So finding the first file is easy... but due to the fact that the next tape in the chain could be anywhere in the listing, its a true random access type issue.

BTW, the need for all this is IBM dropping support for old (small capacity) tapes. The customers need to roll up thousands of small files onto single new tapes. The new tapes can be 1 terabyte (1024 Gigabyte) capacity. They are also expensive, so the customer wants to fit as much data on a tape as he can. The support is being dropped in December so there is a time factor.

So far my program is 3000 lines exactly, excluding win32lib etc. A lot of that code is (user and data) error checking.

I initially had the input file as a big sequence, but at 2.2 million records of 67 bytes each, I ran out of memory! My program does work, but obviously I want to make it as fast as possible.

If you are still awake, thanks ;))

Regards Alan

new topic » goto parent » topic index » view message » categorize

12. Re: EDS database speed questions

Posted by Spock Sep 01, 2008
1209 views
Last edited Sep 02, 2008

AlanO said...

OK, here is the background:

There is a database of mainframe tapes. The tape names are all fixed 6 char length, but typically its in "ranges" where a range might be the first 2 chars alpha, and the remaining 4 chars are numeric. The tapes themselves can contain multiple files; each file has a filename, filenumber, write date, time, etc quite a few fields. When a mainframe writes to tape, it is allowed to write to any tape thats part of the allowed range, where the particular tape is in "scratch" status. Scratch status means that the tape no longer contains data that has to be retained. What has to be scratched and when is the resposibility of the tape management software, of which there are a few software vendors (IBM, CA, BMC). Now, when a file thats being written, cannot fit onto the current tape, it overflows to another scratch tape. So any tape could potentially be multifile and/or multi-tape overflowed (called multi-volume in IBM speak). The tape database should have pointers for the multivol files, in that if file 17 on say tape X00020 overflowed onto tapes X01056 and X00055, then each of these tapes mentioned should have pointers indicating what the previous and next tapes are. In theory, these pointers should be correct! But they are not, and the customer is not going to fix it, just to make my programs' life (and mine) easier.

My program generates mainframe code (called JCL) to copy these tape files. The user clicks on a few tapes, and my program generates the JCL to copy all the files on those tapes. My program has to follow the next/previous pointers (called chains) and not allow a copy if there is a fault in the chain. Doing so could be a disaster - imagine mixing two different years' payroll into a single output file - what happens when this gets read into a application? I also keep logs and generate updates for the tape database so that the user knows what was copied, and to where. The total number of tape files is 2.2 million.

When I load the input file, (its a listing from the tape manager) I sort it first by tape name and file number. So finding the first file is easy... but due to the fact that the next tape in the chain could be anywhere in the listing, its a true random access type issue.

BTW, the need for all this is IBM dropping support for old (small capacity) tapes. The customers need to roll up thousands of small files onto single new tapes. The new tapes can be 1 terabyte (1024 Gigabyte) capacity. They are also expensive, so the customer wants to fit as much data on a tape as he can. The support is being dropped in December so there is a time factor.

So far my program is 3000 lines exactly, excluding win32lib etc. A lot of that code is (user and data) error checking.

I initially had the input file as a big sequence, but at 2.2 million records of 67 bytes each, I ran out of memory! My program does work, but obviously I want to make it as fast as possible.

If you are still awake, thanks ;))

Regards Alan

Alan,

They are not paying you enough!

Now, what is the precise structure of each record and the actions that are performed on the big sequence holding all these records?

Spock

new topic » goto parent » topic index » view message » categorize

13. Re: EDS database speed questions

Posted by AlanO Sep 02, 2008
1212 views

Here is the procedure that reads the data in, after each line is validated

 
--------------------------------------------------------------- 
procedure prep_data() -- reformat data for the database 
    sequence s1, s2, s3, s4, s5, s7, s8, s9 
    sequence Dslabel,  Volser,  Dsname,   Recfm,   Lrecl 
    sequence Firstvol, Dscsize, Volseq,   Volsnum, Media 
    sequence Dsexpdt,  Credt,   Cretm,    Crejbn,  Dusrdata 
    sequence Newdsn,   Blksize 
    integer  i1, i2 
    integer  dslabel, lrecl, blksize, volsnum 
    atom     a1, dscsize 
    i1 = 0 
    cleaned_data = {} 
    ctt_index = {} 
    -- whole_list = {} 
    cbv = {"junk first record"} -- total bytes by volume 
    while i1 < length(ctt_data) do 
       i1 = i1 + 1 
       s1 = ctt_data[i1] 
       s1 = drop_cr(s1) 
       -- ensure every record is the same length 
       while length(s1) < 243 do 
          s1 = s1 & " " 
       end while 
       -- DSLABEL 
       Dslabel = s1[3..7]  -- support 5 digits, not 3 
       s2 = value(Dslabel) 
       dslabel = s2[2] 
       -- VOLSER 
       Volser = s1[12..17] 
       -- DSNAME 
       Dsname = s1[22..65] 
       Dsname = drop_trailing_blanks(Dsname) 
       -- RECFM    with following spaces 
       Recfm = s1[70..72] 
       Recfm = drop_trailing_blanks(Recfm) 
       -- LRECL 
       Lrecl = s1[81..87] 
       Lrecl = drop_trailing_blanks(Lrecl) 
       s2 = value(Lrecl) 
       lrecl = s2[2] 
       -- BLKSIZE 
       Blksize = s1[92..98] 
       Blksize = drop_trailing_blanks(Blksize) 
       s2 = value(Blksize) 
       blksize = s2[2] 
       -- FIRSTVOL 
       Firstvol = s1[103..108] 
       if compare(Firstvol,"        ") = 0 then 
          Firstvol = Volser 
       end if 
       -- DSCSIZE 
       Dscsize = s1[113..120] 
       Dscsize = drop_trailing_blanks(Dscsize) 
       -- field is suffixed with a M or K or G 
       i2 = length(Dscsize) 
       s2 = Dscsize[i2..i2] 
       s3 = Dscsize[1..i2-1] 
       s4 = value(s3) 
       a1 = s4[2] 
       if compare(s2,"K") = 0 then 
          a1 = a1 * 1000 
       end if 
       if compare(s2,"M") = 0 then 
          a1 = a1 * 1000000 
       end if 
       if compare(s2,"G") = 0 then 
          a1 = a1 * 1000000000 
       end if 
       dscsize = a1 
       -- VOLSEQ 
       Volseq = s1[130..135] 
       -- VOLSNUM 
       Volsnum = s1[146..149] 
       s2 = value(Volsnum) 
       volsnum = s2[2] 
       -- MEDIA 
       Media = s1[154..161] 
       Media = drop_trailing_blanks(Media) 
       if length(Media) < 2 then 
          Media = "*blank*" 
       end if 
       -- DSEXPDT 
       Dsexpdt = s1[166..173] 
       -- CREDT 
       Credt = s1[180..187] 
       if compare(Credt,highdate) = 1 then 
          highdate = Credt  -- keep highest date found for licence check 
       end if 
       -- CRETM 
       Cretm = s1[192..196] 
       -- CREJBN 
       Crejbn = s1[204..211] 
       Crejbn = drop_cr(Crejbn) -- might be a trailing "\n" 
       Crejbn = drop_trailing_blanks(Crejbn) 
       if length(Crejbn) < 2 then 
          Crejbn = "*blank* " 
       end if 
       -- DUSRDATA may be blank, and record is then truncated 
       if length(s1) < 229 then 
          Dusrdata = "                    " 
         else 
          Dusrdata = s1[223..243] 
       end if 
       Newdsn = Dsname -- default 
       s5 = {} 
       s5 = append(s5,Volser)    -- 1  6 char tape name 
       s5 = append(s5,dslabel)   -- 2  5 char filenumber 
       s5 = append(s5,Dsname)    -- 3  44 char filename 
       s5 = append(s5,Recfm)     -- 4  3 char record format 
       s5 = append(s5,lrecl)     -- 5  integer, record length 
       s5 = append(s5,blksize)   -- 6  integer, block size 
       s5 = append(s5,Firstvol)  -- 7  6 char, first tape in chain 
       s5 = append(s5,dscsize)   -- 8  atom, bytes in file 
       s5 = append(s5,Volseq)    -- 9  6 char, volume sequence number 
       s5 = append(s5,volsnum)   -- 10 integer, number of tapes in chain 
       s5 = append(s5,Media)     -- 11 8 char, tape type 
       s5 = append(s5,Dsexpdt)   -- 12 8 char, file expiry type 
       s5 = append(s5,Credt)     -- 13 8 char, file creation date 
       s5 = append(s5,Cretm)     -- 14 5 char, file creation time 
       s5 = append(s5,Crejbn)    -- 15 8 char, file creation jobname 
       s5 = append(s5,Dusrdata)  -- 16 20 char, user freeform data 
       s5 = append(s5,Newdsn)    -- 17 44 char, new filename 
       s5 = append(s5,i1)        -- 18 integer, this list counter 
       -- check that the copy is not already done... 
       -- the "ARCH2VTS" is inserted by executing the copy process, 
       -- by a optional JCL step 
       if compare("ARCH2VTS",Dusrdata[1..8]) = 0 then 
          done_list = append(done_list,s5) 
          done_vols = done_vols + 1 
          done_bytes = done_bytes + dscsize 
         else 
          cleaned_data = append(cleaned_data,s5) 
          -- keep track of files and bytes per volume 
          if dslabel = 1 then 
             -- cbv keeps totals of files, bytes per volume 
             -- save prev 
             cbv = append(cbv,track_data) -- total bytes by volume 
             track_data = {} 
             track_data = append(track_data,Volser) 
             track_data = append(track_data,dslabel) 
	     track_data = append(track_data,dscsize) 
            else 
             track_data[2] = dslabel 
             track_data[3] = track_data[3] + dscsize 
          end if 
          -- construct index record too 
          s7 = sprintf("%06d",{i1}) -- max 1 million records 
          s8 = Volser & Dslabel & Volseq & Volsnum & s7 & Dsname 
          ctt_index = append(ctt_index,s8) 
       end if 
    end while 
    ctt_data = {} -- drop data not needed anymore 
    -- sort ctt_index by volser, label, volseq 
    s9 = sort(ctt_index) 
    ctt_index = s9 
    -- save final record 
    cbv = append(cbv,track_data) 
end procedure 
----------------------------------------------------------------

The proc reads in sequence ctt_data, and creates ctt_index. Ctt_index is much shorter in record length than ctt_data. My program does lookups against this, and then gets the rest of the fields by using the index stored in ctt_index[18] when required.

But anyway, I think you have helped already - if I have fixed length records, then seek() against a file of ctt_data is the way to go!

Regards Alan

new topic » goto parent » topic index » view message » categorize

14. Re: EDS database speed questions

Posted by bryanso Sep 02, 2008
1192 views

With 2.2 million, EDS needs to do a maximum of 45 disk seeks to
find any record. This is on par with B-Tree accessing.

Derek, this cannot be right. Perhaps on par with Binary Tree accessing, yes. But B-Tree / B+ Tree certainly has much, much less disk accesses because multiple keys are kept per block. I'm guessing mysql probably can get to a record in 5 disk accesses.

The insertion of new records is going to give you performance issue if you don't go with an external RDBMS like mysql. But may be your data set is static.

Bryan

new topic » goto parent » topic index » view message » categorize

15. Re: EDS database speed questions

Posted by DerekParnell (admin) Sep 02, 2008
1177 views

bryanso said...

With 2.2 million, EDS needs to do a maximum of 45 disk seeks to
find any record. This is on par with B-Tree accessing.

Derek, this cannot be right. Perhaps on par with Binary Tree accessing, yes. But B-Tree / B+ Tree certainly has much, much less disk accesses because multiple keys are kept per block. I'm guessing mysql probably can get to a record in 5 disk accesses.

Thanks Bryan. Yes I did mean a full balanced binary tree and not a B+ Tree.

Coincidently, I've started writing a B+-Tree library and an alternative to EDS, using VSAM-like files that support multiple indexes per table.

new topic » goto parent » topic index » view message » categorize

OpenEuphoria

1. EDS database speed questions

2. Re: EDS database speed questions

3. Re: EDS database speed questions

4. Re: EDS database speed questions

5. Re: EDS database speed questions

6. Re: EDS database speed questions

7. Re: EDS database speed questions

8. Re: EDS database speed questions

9. Re: EDS database speed questions

10. Re: EDS database speed questions

11. Re: EDS database speed questions

12. Re: EDS database speed questions

13. Re: EDS database speed questions

14. Re: EDS database speed questions

15. Re: EDS database speed questions

Search

Include:

Quick Links

User menu

Misc Menu