Re: Is Euphoria OK for a large Database?
- Posted by Ralf Nieuwenhuijsen <nieuwen at XS4ALL.NL> Dec 18, 1997
- 709 views
>> I am interested in writing a large (chess) database program. I am >> considering using Euphoria as it seems relatively straightforward. I do >> not have any great programming experience and would like some advice... Well, Euphoria is one those language with great support, through this wonderfull list-serv. >> The number of games in a database can be quite high(up to around 1 >> million!!) Again, Euphoria is a very nice choice, you'll (almost) never get an out-of-memory error nor are sequences limited at their sizes in any way. Only when and the normal memory (base memory + extended memory) and the HD are full you get an out-of-memory chrash. (that is: It will abort you program and tell the user its out of memory, however I might be wrong, never tried it nor was able to aquire so much memory) And off course Euphoria automatically caches disk operations for speed, so people won't need to run smartdisk to gain that extra speed. >> but usually come in smaller packets of around 500 to 2000 (say). A >> useful database could easily be constructed with only 50000 games in it. >> Sorting and/or searching will be required. Easy, in Euphoria you could write (and some have) a very generic sort and/or searching routines, that will work with any type of data and they can use any comparisation routine you like. (With version 2.0 you can give your own custom iD for comparisation) >> Question 1. Is Euphoria up to the task? >yes. Euphoria can handle that kind of structure without breaking a sweat. >If you got the ram, it'll handle the data. Or the HD, but off course slower... (Wy you hink win95 is so slow, cause it uses more ram then we have) >> Question 2. I read in an associated website that Euphoria stored >> sequences in a way that used a lot of diskspace. > >There are ways around that. You can write it out as a string of ascii >characters (or whatever) with "parsing characters" to save space, and a >number of people have created compression methods which should work well >with your data. Or get my EDOM, you can save a sequence with any type of data in it, with any size to disk compressed. But it is kind of slow, it uses my old EDO (converting a sequence with all its data to an *efficient* binary sequence (thus 1's and 0's) ) and Daniel Berstein's compression routines. It saves *almost* as good as zipping, esspecially if you consider that is also has to save the complete structure of the sequence and its datatypes in there. >>Can the data be stored in a reasonably efficient manner? >>Would I use sequences for such a large database? > >That's the only way to do it. Sequences are wonderful things. You could >have a sequence called Moves such that Moves[1] is everything for game 1, >Moves[2] be game 2, etc. Each element in Moves would be a sequence >containing all the data pertaining to it. So Moves[45][2] would be the >black player's name for game 45. length(Moves[i]) is the number of lines >required for game i. Indexing in and retrieving the info you need is a >snap. And then: -- begin.. include edom.e if not edo_save ("my_db.edo", my_sequence) then puts(1,"Unable to save!\n") abort(1) end if -- End of program... easy huh ? To load you could use: my_sequence = edo_load ("my_db.edo") It's very easy to use, and it compresses very well. (thanx to Daniel's routines) But it's slow, very slow, for very big amounts of data. In that case you have to wait for EDOM2. I am currently working on it. I finally finished my EupBit library that will be used to quickly write out bits and charaters at binary positions. It now also supports custom iD's to handle the input and output, so you could give him input from a memory buffer, or output to one, or you could write it out compressed or encrypted. EDOM2 won't compress your data, it will save it very efficiently, making a lot of assumptions, handling scopes and exceptions. It's nearly done, still a few algorithms needed. (It doesn't use compression, but because it uses EupBit you could set special routine iD's to routines that compress or encrypt the data. So you could have influence about the way the data is written out.) >> Question 3. Euphoria seems slightly confusing with regard >> to strings/sequences(see above?). Any >> suggestions as to a nice simple method of reading in the >> aforementioned textfiles? > >There are ways to do it. If you have the data in an ascii text format, >you should be able to read the file in directly. Euphoria might be consfusion because you have learned *bad* stuff, like most programming language do. QBasic will tell you there is a difference between a character, and integer or a floating point. You could get a very nice tutorial about Euphoria, by David Gay. (See link in \other sites related to Euphoria at the Official Euphoria Page) Here's is the difference briefly.. An atom is a value (not a number). A character on the screen is graphical representation of a value in memory. So a character is a value also, a floating is value also. It are all values. And all values can be read or written as a number, or character, or whatever you read in. When you type 'C' in your program is is equal to the value 67. Because the ASCII table of dos has the C at the 67th place. (also Windows has this, but windows doesn't support the last 128 characters in the ASCII table.) A file too, is a large amount of values. All those values are from 0-255. Just like characters. WHenever a character is expected, but you give a value that is higher than 255, Euphoria will cut off the last bits of memory, so the device will still get the byte it needs. So writing a 'C' to the screen does the same as writing 67 + 255. Or as writing 'C' + 255 A sequence is the way you wish to structure/order all your values. As you problely know, an object can be either a sequence or an atom. A sequence is a list of objects. Each of those objects can thus be an atom (value) or a sequence (structure) So a sequence can contain other sequences and those sequences can contain other sequence, etc. Just like a directory structure, or tree. (with branches) Some routines only work with values, some only with values within a certain range, some only with sequences, and some with both. Puts is an example of this. It can write out a value, or a whole list of values (a sequence thus) But puts will generate an error if you give him a sequence that contains other sequences. Puts will cut the atom (to a value withing 0-255, you can calculate this value ourself by remainder (value, 256)) Some routines and all comparisations and arithemetical commands are resursive. They will do the same action to every member of the sequence, and thus also every memory of every sub-sequence. Example with the '+' command: sequence s1, s2 s1 = { "This is a string, eh.. a sequence" , 3, 4, { 3, 4, { 4, 5, "", {} } } } s2 = s2 + 3 print(1, s1) -- Will print out s1 and its structure Now all values and characters are 3 more than they were it the beginning. Also note: Euphoria sees this as a sequence also: "Euphoria" Because its just a sequence containing characters (thus value): 'E', 'U', 'P' .... Arithemetic and comparisations also work with 2 sequences (only when they are of the same length) Example: s1 = { 1, 2, { 10, 9, 8} } s2 = { 0, { 1, 2, 3 } , 3 } print( 1, s1 + s2 ) Now it will make a new sequence where every element (value in the ordered list/sequence) is added to the element of the other sequence. So the sequence printed will be: { 1 + 0, 2 + { 1, 2, 3}, {10, 9, 8} + 3 } But offcourse the values will already be calculated, thus: { 1, {3,4,5}, {13,12,11} } Same works with comparisation: print (1, s1 = s2) The sequence will then be evaluated: { 1 = 0 , { 2 = 1, 2 = 2, 2 = 3} , {3 = 10, 3 = 9, 3 = 8} } And that will make: { 0, {0,1,0}, {0,0,0} Where 0 off course means false, and 1 means true. And the = means compare not assign, unless it is used without any statement: s1 = ( s1 = s2 ) Will assign the { 0, {0,1,0}, {0,0,0} } to s1 I'm not sure but I think the paranthesis are needed. BUT NOTE: this is the syntax of an if statement: if true/false then -- code end if So, this *will* work: if 1 = 1 then -- code end if But this *won't* work: if {1,0,1} = 1 then -- code end if The if-statement may only get an value and not a structure filled with values..! Now you know how to the file I/O to Euphoria basically works, and you know what sequences are like, you need to know how to cut, modify and replace sequences: s1 = s2[2] Will asign the second element of s2 to s1 (Only a pointer is copied, not the whole sequence, until it becomes nessesary. s1 = s2[2][1] Will generate an error, because the 1st element of the 2nd element of s2 is an value. And s1 is declared as a sequence. s1 = s2[2][1..2] Will asign the elements 1 until 2 to of the 2nd element of s2 to s1. This is not allowed: [4..1] The elements are indexed from 1 to their length. (try length(s2) to get the length during run-time) This is allowed: [0..4] Only element zero doesn't exist, and is ignored. This is not allowed: [-1..4] This is allowed: [1..1] -- Returns a sequence with one element This is allowed: [4..3] -- Returns a zero element sequence like {} or "" (doesn't matter) This is not allowed [6..3] It may be confusing why some are and why some are not allowed, but when you write a lot of algorithms, you'll find out, that the algorithm will work because of the flexibilities, and that it wouldn't have worked right when some stuff was allowed that isn't allowed right now. I consider this to be very elegant. I hope this mini-tutorial helped... Ralf