Re: Inner loop compiler
- Posted by OthelloVivaldi at hotmail.com Sep 07, 2002
- 546 views
Hello, Mr. E! :D=20 Thanks for your reply, I've been so busy in the real-world these days, = but now I finally got plenty of time to reply. This LONGISH post answers = several of your recent postings, ok? > Euphoria Win and Linux both have a call_back function for writing your = > own call back routines. I've been able to call these from machine = code,=20 > so you can still do the hard stuff in Euphoria. But it would limit = you=20 > to exu or exw. =20 > -- call.exw -- demonstrates use of call near instruction Great, that snippet looks very useful, compiler or no compiler! > The BOUND instruction would work nicely for checking index values in=20 > sequences. Just set up a nice interrupt handler for out-of-range = errors=20 > and you're good to go. I was preparing to use cmp index, lowest / jb error / cmp index highest = / ja error... 'Bound' seems a much better alternative. =20 > Peek, poke and call - each can be done with a single asm instruction = ...=20 > for integers though. However, it would be a problem if people found themselves 'forced' to = use peek and poke on a regular basis. That would compromise safety. One = possibility is to adopt a safe version of the c pointer type. Such a = pointer could either address an array that was safely allocated by = Euphoria, or an area of memory (say, bios video memory) that the = programmer would want to access. Unlike in c, the pointer would not be = allowed to be misaligned with its type or moved out of it's own bounds. = E.g. : -- Integer 'video_mem' is allowed to point to any byte in the mode = 19 screen. BYTE POINTER video_mem 0xA0000, 64000 > > Basically, this leaves us with statements for program flow, > I have some eucode that generates machine code branches automatically=20 > from a sequence of conditions, and correctly uses the right sized=20 > (signed byte or signed doubleword) jump instruction. Great, that code would make an eventual direct-to-machine-code compiler = easier to write... =20 > > ...and arrays that must be compatible with Euphoria sequences. > Has anyone reverse-engineered Rob's sequence format from the=20 > Euphoria-to-C translator yet? I guess that information wouldn't be = that=20 > useful anyway.. since we can't get the address of a sequence by any=20 > means I know of. Someone might be able to poke around in memory and=20 > find some global sequence index and then locate other sequences from=20 > there... Manipulating sequences would also have to be required to=20 > emulate the translator conventions on reference counts and whatnot. >=20 > OR... you could just write your own sequence engine in asm and use = that. My fault, "compatible with Euphoria sequences" was a very misleading and = clumsy expression. All I had in mind was that it should be easy to peek = and poke sequences into and out of static memory. For example, Euphoria = can only read and write 32 bit ints, and 32 or 64 bit floats, so any = other types of data in the arrays would be incompatible. Actually, I was thinking in the opposite direction of Euphoria's = arbitrarily structured sequences. Sequence trees with recursive pointers = are great in high level, but in the inner loop, I'd like to sacrifice = this kind of flexibility and instead go for fast, direct indexing. = Something like arrays of (preferably) power-of-2 aligned data, e.g. int = a[17][512] would be great. Of course, sometimes arrays like [600][800] = might also be nice, (to save memory), which could hopefully be arranged = for with a few lea operations. Your own mode 19 pixel(x,y) in asm.ex, is = an example. > Asm.e also assembles directly to memory, which should be an option for = > the compiler to. Generating functions at run-time is always nice. Agreed. Now to your next posting: =20 > Here are some more ideas I had on what to implement first in a simple=20 > compiler: >=20 > Functions and Procedures.=20 =20 I guess you are right. One thing I would like to maintain is that the = code should fit well into the code cache (I think it's like 4, 8 or = 16Kb). Simply in-lining all 'calls'would quickly result in too much = code. So subs are necessary. > Parameters are passed on the stack, using=20 > push instructions. Space for local variables is created by moving the = > stack pointer. Parameters and local variables are accessed by reading = > memory at the stack pointer + an offset. Recursion should be allowed = -=20 > that's why we have a stack. =20 =20 But what happens if a program runs out of stack memory at run-time? > Function results should be returned in the=20 > eax register. No registers need to be saved before calling = subroutines,=20 > but a subroutine may modify any register except the stack pointer. =20 > (Ideas for later implementation: allow complex calling conventions = like=20 > passing parameters in registers, store local variables in registers,=20 > etc) You see, I think it might be possible to do without the stacked = parameters for calls that are internal to the code currently being = compiled. Parameters would not be necessary if the compiler would knows = where it has all its data at any given time. For example, if a variable = FOO is currently in either a register or memory position, lets call it = R/M_X, it would be a waste to move it to EAX and then make the call. One = could just call the procedure right away, and the compiler would know = that FOO is currently at R/M_X (say, it could be in EBP, or [EBP] or = [FOO_PTR]). I will write more about this below. =20 > Global Variables. Global variables are simply a pointer to = someplace=20 > allocated on the heap. (Variables can be optimized to be stored as=20 > registers later.) Actually, I first imagined everything to be global... :) > Data Types. Let's stick with 32-bit integers for now. (We can add=20 > floating point and arrays later.) (We could also probably implement=20 > arrays just using peek/poke.) Yes, let's wait with the floats. But simple arrays, with max 2 = dimensions and only power-of-2 width could more easily be implemented at = an early stage. =20 > Simple Assignment and Arithmetic. Assignments will be initially=20 > limited to "x =3D y" where y is another variable or a literal value. =20 > Arithmetic can only be done on existing variables, in the form "x +=3D = y".=20 >(We can add more arithmetic operations later.) Concurred. For x86 this would sure make sense. I just thought that since = Eu was available for both Linux and BSD that maybe it also currently ran = on other computers. But Bernie says that Rob Craig is 'hand coding' some = of his routines, does this mean Eu runs only on x86? =20 > If and While statements. Each statement may have only one=20 > comparison. Comparisons can have the form "x < y" where x or y may be = > a variable or a literal value. (For loops can be implemented using=20 > while loops. From what I've heard, the first versions of Euphoria=20 > didn't have for loops for that reason.) For some time I thought GOTO would be easier to compile, but I guess = WHILE is actually easier to deal with. IF and WHILE is sufficient for my = needs. > Now, for the intermediate representation, a tree structure would=20 > probably be appropriate. The parser would take the text = representation=20 > of the program (the source code) and convert it to an intermediate=20 > representation. This intermediate representation is where the=20 > optimization would take place, but we'll leave that out for now. Then = > the code generation would use the optimized intermediate = representation=20 > to create the actual machine code. Ok, Pete. What you are describing (I snipped away the rest) is a very = direct, and simple approach. It would compile in no time, it would run = quite fast (at least much faster than Euphoria) and it basically solves = my original request. Moreover, it looks easy to debug, so there is = reason to believe it would quickly get stable. Even though I'd like to = attempt a more complex solution later, I think it would be a good idea = to complete this compiler first, and see what kind of problems we = encounter. > Comments? Suggestions? Clarifications? >=20 > Or "shut up man and start coding"? >=20 > Pete Sure, go ahead, but give me/us a chance to help! Maybe the job could be = cut up into pieces somehow, so we can exploit the principle of = parallelism? Anyhow, my humble programming services are at your command = :) Finally I'd like to give a general outline of my plans for a compiler = based on the a-star algorithm. I hope to be able to write a much better = and more detailed description later.=20 The compiler could be compared to a chess engine. The compiler generates = lists of compiled statements, a chess engine lists of chess moves. Both = programs will not only relate to one list, but to huge trees of lists = branching out from a root. A chess engine cannot make an illegal move (unless it's buggy). = Likewise, the compiler, is unable to produce a statement or list of = statements that could result in an error. It would always create a = program that correctly would execute the source code. A chess engine must know 'everything' relevant to the game, including = such things as which rooks has moved, so as to play by the correct = castling rules. But it cannot limit itself to perfect knowledge about = the current situation on the board. In order to think ahead, it must = search as far as it can, through a huge branching tree, in which it must = have the same perfect knowledge about every possible node. Else, it = couldn't plan it's moves correctly. The compiler also prowls through a search tree. and must know = 'everything' about each node on that tree, that is, about as much as a = human programmer would. Perhaps the most important information the = compiler has is WHERE the different variables are located (is 'foo' in = memory or in one of the registers?). Mostly, it will not be aware of the = values in these variables, but sometimes it will, and then it could = exploit it. It's the same with flags, they are either known as 'set', = 'cleared' or more usually 'undefined'. =20 One useful way to exploit knowledge of contents, is when at some point = you need to compile something like Frodo =3D 0. Before setting out to = free up a register and zero it to hold this value, the compiler could = first see whether it already had a unused register with a value known to = be zero. That's not unthinkable, maybe a loop counter has recently ended = up with that value. Now the compiler could simply use that register = right away, without any code. Still, the most important thing about the compiler is it's ability to = 'juggle' registers and variables in memory, so as to minimalize the need = for constant swapping out of registers.=20 Basically, the compiler could be described in terms of 4 concepts. 1. A code generator. 2. A cost evaluator. 3. A priority queue. 4. A heuristic estimation. 1. The code generator has the task of producing a correct piece of = assembly code that will do what a single source statement expects. How = this piece of code will look like depends on many factors, all of which = can be deduced from the list of code prior to this statement. If the = statement was foo -=3D bar, then the code would have to take into = account the whereabouts of these two variables. If 'bar' is in memory = and 'foo' happens to be in ecx, then one correct piece of code could be: = add ecx, [bar_ptr].=20 There may of course be many other ways to accomplish the same thing. The = value of 'bar' could for example be loaded into another register, (the = contents of which would then have to be written out to memory), and = thereafter the two regs could be added. So far, this could seem like a = less optimal solution, simply wasting some clocks, doing approximately = the same thing. But it may turn out that the most expensive version of the statement = later will turn out to be the best. It's like in chess: sometimes you = can sacrifice a knight or even a queen, and find that this move can be = made to improve your situation several moves later. The code generator should therefore not worry about how much each = statement costs, just generate the set of correct ways to resolve a = statement at a given point in the body of the code. 2. The cost evaluator sums up the 'cost' of the generated assembly = statements. Cost would be measured out in clocks. The execution cost of = an XOR REG, REG might be 2 clocks. In addition comes instruction = fetching and decoding. It must be possible to look up the correct (more = or less) cost values for most instruction and instruction combinations = by fine-reading appropriate literature. Certain things would still be = unpredictable of course, so the average time a processor will spend down = a pipe after a mispredicted jump must simply be guesstimated (one must = still attempt to figure out reasonable penalties for this). The compiler would provide the best results when these costs are as = close to reality as possible. In chess too, it is important to hint the = engine in the right direction (the best moves) by giving the 'best' = values to the different pieces. Often these are: 1 for pawn, 3 for = knight, 3.25 for bishop, 5 for a rook, 9 or 10 for a queen and 'inf' for = the king. One can play around with these values of course, and see if = the engine would play better chess if the rook was worth 8 points = instead. It probably would play worse. 3. The priority queue has the mission of always knowing which node among = the compilers lists of programs is the 'best'. That is, the one with the = least clock-cost in relation to how many statements it has completed. = This is as simple as 'keeping the minimum'. A data structure known as a = 'heap' is often used for this, but there may be other alternatives that = are better for this application. I'd expect that optimizing the priority = queue means optimizing the compiler, since it's here most of the = compile-time will be spent. 4. At any point in the developing of the assembly code, a 'heuristic' is = produced that attempts to guess how 'good' the current node is in = comparison to all other possible nodes. This guess is the current cost = for the node in the tree (which is a hard fact) + a negative estimate of = how much is left before the final source statement is reached. The = latter estimate can be multiplied with a floating point factor to make = the engine search either in 'breadth', meaning more brute force, or = 'depth', meaning 'take some chances, advance quickly towards the goal'. When this heuristic factor is set to 1, a perfectly optimized* = compilation would be the result, but the engine might spend quite some = time, even on small programs (compilation time would increase = exponentially with the number of statements in the program). As the = heuristic factor is incremented, the resulting code will be less = optimized, but compilation time would be drastically reduced.=20 The beauty of a-star is that one can always diddle this number to get a = search time which balances one's needs. Just increment the number to = trade in run-time efficiency against compile-time! * Perfect optimization means that, given only the statement building = blocks that the code generator can produce, and given that the clock = estimates for all these code snippets are correct, it would be = impossible to compile the program to more effective assembly. No = kidding, and before you burst out in protest, please make sure you = understood the two premises in that sentence... :) Yours Sincerely, Barbarella