1. EDS database speed questions
- Posted by AlanO Aug 24, 2008
- 1206 views
Hello! I have a Euphoria database of 2.2 million records. The key is 9 chars, the data is 67. In my app, each record the user choses, launches searches against the database, by key, of between 1 and 300 times. The disk space is not important, speed is. The user PC's will be assumed to be running Windows, with 1Gb ram, no other ram hungry apps running.
Does anyone have measurements or practical experience where they can advise on:
a) Turning off compression / decompression in database.e, what will this gain me in statements dropped (measurable speed) ? b) Creating my own index entry after X many records,(The 'pick any number between 1 and 100' method) At what value of X will this be faster than a simple get by index, if at all? Knowing the total number of records in advance, can help to get the value of X
I originally had everything running in sequences, not database, but the PC ran out of memory when the record count jumped from 150k to 2.2 million. The app does sorts too!
This is my first really commercial app, the client has already paid $2000 for the old version but needs it to support 2.2 million records. (My company got the the $$ not me :( Anything created using thier equipment / time / technology / ideas is thiers, even tho its my idea.. so I can't offer any $ rewards for assistance)
Regards Alan
2. Re: EDS database speed questions
- Posted by AlanO Aug 24, 2008
- 1212 views
To summarise, is there a faster way to find a database record by key, than the binary search used in databse.e's db_find_key() function, if you know you have 2.2 million records?
Regards Alan
3. Re: EDS database speed questions
- Posted by ChrisB (moderator) Aug 24, 2008
- 1202 views
Hi
At that sort of level of database, shouldn't you be considering SQL? Let the engine do the sorting and collating of data. It may take a little time to convert the data, but the maintainance advantages would outweigh this. Speed would be comparable, if not better.
Options - eusql, sqlite, mysql.
Chris
4. Re: EDS database speed questions
- Posted by AlanO Aug 24, 2008
- 1198 views
The app was punted as not needing any external app/dll.. would hate to have to tell the client that its no longer the case...SQL or otherwise.
How about if I created my own index of every say 100th record, in a (memory) sequence. Do a binary search against that, then use the results as the hi/lo starting points in the big database? Disk access is always slower than memory. Another way of thinking would be to take advantage of the 512 byte physical disk sector, where I would assume that when fetching a record, the nearby records would also be in memory and so would not incur a speed hit. Design the search accordingly.
Loading the database is also not a "one-off". The data changes as a result of both my app and external issues, and so will be reloaded once a day, so I'd like to keep the load time down too.
Regards Alan
5. Re: EDS database speed questions
- Posted by ChrisB (moderator) Aug 24, 2008
- 1191 views
Hi
Of course you have your own programming issues, but I would still say with a database that size, adding sqlite as a dll adds only about 1/2 a megabyte to to the program size (as a dll), results in a much more secure database with rollback features in the event of a crash, allows use of industry standard sql statements, allows exceptionally quick maitainance and updates, uses b-tree searches so is lightning fast, and uses an engine embedded in devices used everday (your phone, for instance)
As far as trumping no use of external programs, were you expecting 2.2 million records at the first program consultation. Wouldn't the client rather have transparent speed and reliability issues at the expense of a zero set tiny dll, installed into the programs directory?
I dunno, can you tell I'm an sqlite fan?
I don't know your database structure, but using every 100th record sounds like an additional overhead, and may be heading towards a hash table type of thing.
Chris
6. Re: EDS database speed questions
- Posted by DerekParnell (admin) Aug 24, 2008
- 1196 views
With 2.2 million, EDS needs to do a maximum of 45 disk seeks to find any record. This is on par with B-Tree accessing.
Compression on a 67 element sequence is negligible. But I'll consider an option to have a non-compressed D/B for a future EDS enhancement.
Is it possible for you to split the table into sub-tables? For example, if the keys alway began with a letter, you could have 26 sub-tables (approx 85,000 per table) using 35 disk seeks in the worst case. This would average out at around 75% of the original speed.
Version 4 EDS has a number of speed improvements, but I haven't benchmarked them yet as I have one or two more things to improve. One of these is an option to load an EDS database in to RAM to avoid the disk I/O overhead. Would work better for a READ-ONLY database, but it is not going to be restricted to that.
If RAM is not a constraint, you could load the DB into a map. In your case you would need approximately 650Meg of RAM to hold the map. This would improve D/B access by at least 20 times, but you'd have to benchmark it to be sure.
7. Re: EDS database speed questions
- Posted by AlanO Aug 24, 2008
- 1196 views
Thanks Derek! Using your 45 max (I'll assume 43 average) my app will have 10750 disk accesses average, each time the user hits go. If the disk runs at 20 milliseconds, thats 215 seconds of wait just for disk access! Hmmm... poke the index into memory and get assembly to do a binary search of the index. The memory needed is 19.8Mb so thats do-able and nothing could be quicker. Should result in about 250 disk accesses at 20 milliseconds, thats 5 seconds.
Is there a quicker method? Chris I'll still get around to learning SQL, but I doubt if that would be quicker than the memory/assembly method would it?
Derek, could there be a memory/read only/performance option for eds database indexes in the future?
Thanks guys Alan
8. Re: EDS database speed questions
- Posted by ChrisB (moderator) Aug 24, 2008
- 1199 views
Hi
That certainly impressive performance! Go for it.
Chris
9. Re: EDS database speed questions
- Posted by ghaberek (admin) Aug 25, 2008
- 1168 views
Is it possible for you to split the table into sub-tables? For example, if the keys alway began with a letter, you could have 26 sub-tables (approx 85,000 per table) using 35 disk seeks in the worst case. This would average out at around 75% of the original speed.
I'm gonna suggest this as well. 85k records per table seems far more manageable and easier to search than 2.2M records.
-Greg
10. Re: EDS database speed questions
- Posted by Spock Aug 28, 2008
- 1141 views
Thanks Derek! Using your 45 max (I'll assume 43 average) my app will have 10750 disk accesses average, each time the user hits go. If the disk runs at 20 milliseconds, thats 215 seconds of wait just for disk access! Hmmm... poke the index into memory and get assembly to do a binary search of the index. The memory needed is 19.8Mb so thats do-able and nothing could be quicker. Should result in about 250 disk accesses at 20 milliseconds, thats 5 seconds.
Is there a quicker method? Chris I'll still get around to learning SQL, but I doubt if that would be quicker than the memory/assembly method would it?
Derek, could there be a memory/read only/performance option for eds database indexes in the future?
Thanks guys Alan
Hi Alan,
I have no experience programming “proper†databases and I’m not quite sure what the program is supposed to do but, FWIW, this is what I’d do:
1) Manage a plain file instead of using database.e - you can always convert it later if you need to expand it but you’ll have to accept much reduced performance.
2) Whenever the program loads (or the records are updated), it goes through all the records and creates a linear list of indices in RAM.
3) Each index in the list comprises of the key compressed into 8 bytes. Since each key is 9 chars it should be easy enough to do said compressing as 9 x 7 bits = 63 bits which will fit into 8 bytes.
4) Whenever a search is made for the records associated with a particular key, the program compresses the search key and then performs a linear search on the entire index and creates a sequence containing the disk positions of the records. If there are 2.2 million records the index table will be 17.6 megabytes long.
Even without resorting to ASM the whole search process should be very fast.
5) The disk positions gleaned from searching the index will already be in ascending order so loading the actual records using seek() will be very swift too.
6) If the chars that make up each key have a limited range, say all upper or lower caps (or even Alphanumeric range), then the compression factor could be improved such that the compressed key could fit into 2 Eu integers which opens the door to even better performance using a sequence to host the indices rather than some allocated RAM.
7) If the number of unique keys is modest, step 3 could be altered so that, using a hash table, the compressed key is replaced by, say, a client number which could be a 4-byte or even 2-byte integer. The size of the index table would halve (or quarter to a modest 4.4 MB if in RAM).
8) If duplicate searches are often made for the same key (Hmmm, that does seem unlikely on a single PC) a small buffer of the last, say, 100 searches could be added to improve performance yet further.
9) If you are really serious about file performance in Windows you could use the API routines to perform all the open/close/seek/read functions. Expect the speed to at least double - unless the Eu 4 developers have already incorporated this stuff. Actually, the bulk of the time will be spent gathering the data so this part of the operation really ought to be optimized.
Spock
11. Re: EDS database speed questions
- Posted by AlanO Sep 01, 2008
- 1108 views
OK, here is the background:
There is a database of mainframe tapes. The tape names are all fixed 6 char length, but typically its in "ranges" where a range might be the first 2 chars alpha, and the remaining 4 chars are numeric. The tapes themselves can contain multiple files; each file has a filename, filenumber, write date, time, etc quite a few fields. When a mainframe writes to tape, it is allowed to write to any tape thats part of the allowed range, where the particular tape is in "scratch" status. Scratch status means that the tape no longer contains data that has to be retained. What has to be scratched and when is the resposibility of the tape management software, of which there are a few software vendors (IBM, CA, BMC). Now, when a file thats being written, cannot fit onto the current tape, it overflows to another scratch tape. So any tape could potentially be multifile and/or multi-tape overflowed (called multi-volume in IBM speak). The tape database should have pointers for the multivol files, in that if file 17 on say tape X00020 overflowed onto tapes X01056 and X00055, then each of these tapes mentioned should have pointers indicating what the previous and next tapes are. In theory, these pointers should be correct! But they are not, and the customer is not going to fix it, just to make my programs' life (and mine) easier.
My program generates mainframe code (called JCL) to copy these tape files. The user clicks on a few tapes, and my program generates the JCL to copy all the files on those tapes. My program has to follow the next/previous pointers (called chains) and not allow a copy if there is a fault in the chain. Doing so could be a disaster - imagine mixing two different years' payroll into a single output file - what happens when this gets read into a application? I also keep logs and generate updates for the tape database so that the user knows what was copied, and to where. The total number of tape files is 2.2 million.
When I load the input file, (its a listing from the tape manager) I sort it first by tape name and file number. So finding the first file is easy... but due to the fact that the next tape in the chain could be anywhere in the listing, its a true random access type issue.
BTW, the need for all this is IBM dropping support for old (small capacity) tapes. The customers need to roll up thousands of small files onto single new tapes. The new tapes can be 1 terabyte (1024 Gigabyte) capacity. They are also expensive, so the customer wants to fit as much data on a tape as he can. The support is being dropped in December so there is a time factor.
So far my program is 3000 lines exactly, excluding win32lib etc. A lot of that code is (user and data) error checking.
I initially had the input file as a big sequence, but at 2.2 million records of 67 bytes each, I ran out of memory! My program does work, but obviously I want to make it as fast as possible.
If you are still awake, thanks ;))
Regards Alan
12. Re: EDS database speed questions
- Posted by Spock Sep 01, 2008
- 1107 views
- Last edited Sep 02, 2008
OK, here is the background:
There is a database of mainframe tapes. The tape names are all fixed 6 char length, but typically its in "ranges" where a range might be the first 2 chars alpha, and the remaining 4 chars are numeric. The tapes themselves can contain multiple files; each file has a filename, filenumber, write date, time, etc quite a few fields. When a mainframe writes to tape, it is allowed to write to any tape thats part of the allowed range, where the particular tape is in "scratch" status. Scratch status means that the tape no longer contains data that has to be retained. What has to be scratched and when is the resposibility of the tape management software, of which there are a few software vendors (IBM, CA, BMC). Now, when a file thats being written, cannot fit onto the current tape, it overflows to another scratch tape. So any tape could potentially be multifile and/or multi-tape overflowed (called multi-volume in IBM speak). The tape database should have pointers for the multivol files, in that if file 17 on say tape X00020 overflowed onto tapes X01056 and X00055, then each of these tapes mentioned should have pointers indicating what the previous and next tapes are. In theory, these pointers should be correct! But they are not, and the customer is not going to fix it, just to make my programs' life (and mine) easier.
My program generates mainframe code (called JCL) to copy these tape files. The user clicks on a few tapes, and my program generates the JCL to copy all the files on those tapes. My program has to follow the next/previous pointers (called chains) and not allow a copy if there is a fault in the chain. Doing so could be a disaster - imagine mixing two different years' payroll into a single output file - what happens when this gets read into a application? I also keep logs and generate updates for the tape database so that the user knows what was copied, and to where. The total number of tape files is 2.2 million.
When I load the input file, (its a listing from the tape manager) I sort it first by tape name and file number. So finding the first file is easy... but due to the fact that the next tape in the chain could be anywhere in the listing, its a true random access type issue.
BTW, the need for all this is IBM dropping support for old (small capacity) tapes. The customers need to roll up thousands of small files onto single new tapes. The new tapes can be 1 terabyte (1024 Gigabyte) capacity. They are also expensive, so the customer wants to fit as much data on a tape as he can. The support is being dropped in December so there is a time factor.
So far my program is 3000 lines exactly, excluding win32lib etc. A lot of that code is (user and data) error checking.
I initially had the input file as a big sequence, but at 2.2 million records of 67 bytes each, I ran out of memory! My program does work, but obviously I want to make it as fast as possible.
If you are still awake, thanks ;))
Regards Alan
Alan,
They are not paying you enough!
Now, what is the precise structure of each record and the actions that are performed on the big sequence holding all these records?
Spock
13. Re: EDS database speed questions
- Posted by AlanO Sep 02, 2008
- 1105 views
Here is the procedure that reads the data in, after each line is validated
--------------------------------------------------------------- procedure prep_data() -- reformat data for the database sequence s1, s2, s3, s4, s5, s7, s8, s9 sequence Dslabel, Volser, Dsname, Recfm, Lrecl sequence Firstvol, Dscsize, Volseq, Volsnum, Media sequence Dsexpdt, Credt, Cretm, Crejbn, Dusrdata sequence Newdsn, Blksize integer i1, i2 integer dslabel, lrecl, blksize, volsnum atom a1, dscsize i1 = 0 cleaned_data = {} ctt_index = {} -- whole_list = {} cbv = {"junk first record"} -- total bytes by volume while i1 < length(ctt_data) do i1 = i1 + 1 s1 = ctt_data[i1] s1 = drop_cr(s1) -- ensure every record is the same length while length(s1) < 243 do s1 = s1 & " " end while -- DSLABEL Dslabel = s1[3..7] -- support 5 digits, not 3 s2 = value(Dslabel) dslabel = s2[2] -- VOLSER Volser = s1[12..17] -- DSNAME Dsname = s1[22..65] Dsname = drop_trailing_blanks(Dsname) -- RECFM with following spaces Recfm = s1[70..72] Recfm = drop_trailing_blanks(Recfm) -- LRECL Lrecl = s1[81..87] Lrecl = drop_trailing_blanks(Lrecl) s2 = value(Lrecl) lrecl = s2[2] -- BLKSIZE Blksize = s1[92..98] Blksize = drop_trailing_blanks(Blksize) s2 = value(Blksize) blksize = s2[2] -- FIRSTVOL Firstvol = s1[103..108] if compare(Firstvol," ") = 0 then Firstvol = Volser end if -- DSCSIZE Dscsize = s1[113..120] Dscsize = drop_trailing_blanks(Dscsize) -- field is suffixed with a M or K or G i2 = length(Dscsize) s2 = Dscsize[i2..i2] s3 = Dscsize[1..i2-1] s4 = value(s3) a1 = s4[2] if compare(s2,"K") = 0 then a1 = a1 * 1000 end if if compare(s2,"M") = 0 then a1 = a1 * 1000000 end if if compare(s2,"G") = 0 then a1 = a1 * 1000000000 end if dscsize = a1 -- VOLSEQ Volseq = s1[130..135] -- VOLSNUM Volsnum = s1[146..149] s2 = value(Volsnum) volsnum = s2[2] -- MEDIA Media = s1[154..161] Media = drop_trailing_blanks(Media) if length(Media) < 2 then Media = "*blank*" end if -- DSEXPDT Dsexpdt = s1[166..173] -- CREDT Credt = s1[180..187] if compare(Credt,highdate) = 1 then highdate = Credt -- keep highest date found for licence check end if -- CRETM Cretm = s1[192..196] -- CREJBN Crejbn = s1[204..211] Crejbn = drop_cr(Crejbn) -- might be a trailing "\n" Crejbn = drop_trailing_blanks(Crejbn) if length(Crejbn) < 2 then Crejbn = "*blank* " end if -- DUSRDATA may be blank, and record is then truncated if length(s1) < 229 then Dusrdata = " " else Dusrdata = s1[223..243] end if Newdsn = Dsname -- default s5 = {} s5 = append(s5,Volser) -- 1 6 char tape name s5 = append(s5,dslabel) -- 2 5 char filenumber s5 = append(s5,Dsname) -- 3 44 char filename s5 = append(s5,Recfm) -- 4 3 char record format s5 = append(s5,lrecl) -- 5 integer, record length s5 = append(s5,blksize) -- 6 integer, block size s5 = append(s5,Firstvol) -- 7 6 char, first tape in chain s5 = append(s5,dscsize) -- 8 atom, bytes in file s5 = append(s5,Volseq) -- 9 6 char, volume sequence number s5 = append(s5,volsnum) -- 10 integer, number of tapes in chain s5 = append(s5,Media) -- 11 8 char, tape type s5 = append(s5,Dsexpdt) -- 12 8 char, file expiry type s5 = append(s5,Credt) -- 13 8 char, file creation date s5 = append(s5,Cretm) -- 14 5 char, file creation time s5 = append(s5,Crejbn) -- 15 8 char, file creation jobname s5 = append(s5,Dusrdata) -- 16 20 char, user freeform data s5 = append(s5,Newdsn) -- 17 44 char, new filename s5 = append(s5,i1) -- 18 integer, this list counter -- check that the copy is not already done... -- the "ARCH2VTS" is inserted by executing the copy process, -- by a optional JCL step if compare("ARCH2VTS",Dusrdata[1..8]) = 0 then done_list = append(done_list,s5) done_vols = done_vols + 1 done_bytes = done_bytes + dscsize else cleaned_data = append(cleaned_data,s5) -- keep track of files and bytes per volume if dslabel = 1 then -- cbv keeps totals of files, bytes per volume -- save prev cbv = append(cbv,track_data) -- total bytes by volume track_data = {} track_data = append(track_data,Volser) track_data = append(track_data,dslabel) track_data = append(track_data,dscsize) else track_data[2] = dslabel track_data[3] = track_data[3] + dscsize end if -- construct index record too s7 = sprintf("%06d",{i1}) -- max 1 million records s8 = Volser & Dslabel & Volseq & Volsnum & s7 & Dsname ctt_index = append(ctt_index,s8) end if end while ctt_data = {} -- drop data not needed anymore -- sort ctt_index by volser, label, volseq s9 = sort(ctt_index) ctt_index = s9 -- save final record cbv = append(cbv,track_data) end procedure ----------------------------------------------------------------
The proc reads in sequence ctt_data, and creates ctt_index. Ctt_index is much shorter in record length than ctt_data. My program does lookups against this, and then gets the rest of the fields by using the index stored in ctt_index[18] when required.
But anyway, I think you have helped already - if I have fixed length records, then seek() against a file of ctt_data is the way to go!
Regards Alan
14. Re: EDS database speed questions
- Posted by bryanso Sep 02, 2008
- 1092 views
find any record. This is on par with B-Tree accessing.
Derek, this cannot be right. Perhaps on par with Binary Tree accessing, yes. But B-Tree / B+ Tree certainly has much, much less disk accesses because multiple keys are kept per block. I'm guessing mysql probably can get to a record in 5 disk accesses.
The insertion of new records is going to give you performance issue if you don't go with an external RDBMS like mysql. But may be your data set is static.
Bryan
15. Re: EDS database speed questions
- Posted by DerekParnell (admin) Sep 02, 2008
- 1074 views
find any record. This is on par with B-Tree accessing.
Derek, this cannot be right. Perhaps on par with Binary Tree accessing, yes. But B-Tree / B+ Tree certainly has much, much less disk accesses because multiple keys are kept per block. I'm guessing mysql probably can get to a record in 5 disk accesses.
Thanks Bryan. Yes I did mean a full balanced binary tree and not a B+ Tree.
Coincidently, I've started writing a B+-Tree library and an alternative to EDS, using VSAM-like files that support multiple indexes per table.