Re: Inner loop compiler

new topic     » goto parent     » topic index » view thread      » older message » newer message

Hello, Mr. E! :D=20

Thanks for your reply, I've been so busy in the real-world these days, =
but now I finally got plenty of time to reply. This LONGISH post answers =
several of your recent postings, ok?



> Euphoria Win and Linux both have a call_back function for writing your =

> own call back routines.  I've been able to call these from machine =
code,=20
> so you can still do the hard stuff in Euphoria.  But it would limit =
you=20
> to exu or exw.
=20
> -- call.exw -- demonstrates use of call near instruction

Great, that snippet looks very useful, compiler or no compiler!

> The BOUND instruction would work nicely for checking index values in=20
> sequences.  Just set up a nice interrupt handler for out-of-range =
errors=20
> and you're good to go.

I was preparing to use cmp index, lowest / jb error / cmp index highest =
/ ja error... 'Bound' seems a much better alternative.
=20
> Peek, poke and call - each can be done with a single asm instruction =
...=20
> for integers though.

However, it would be a problem if people found themselves 'forced' to =
use peek and poke on a regular basis. That would compromise safety. One =
possibility is to adopt a safe version of the c pointer type. Such a =
pointer could either address an array that was safely allocated by =
Euphoria, or an area of memory (say, bios video memory) that the =
programmer would want to access. Unlike in c, the pointer would not be =
allowed to be misaligned with its type or moved out of it's own bounds. =
E.g. :

    -- Integer 'video_mem' is allowed to point to any byte in the mode =
19 screen.
    BYTE POINTER video_mem 0xA0000, 64000


> >     Basically, this leaves us with statements for program flow,
> I have some eucode that generates machine code branches automatically=20
> from a sequence of conditions, and correctly uses the right sized=20
> (signed byte or signed doubleword) jump instruction.
Great, that code would make an eventual direct-to-machine-code compiler =
easier to write...
=20
> > ...and arrays that must be compatible with Euphoria sequences.

> Has anyone reverse-engineered Rob's sequence format from the=20
> Euphoria-to-C translator yet? I guess that information wouldn't be =
that=20
> useful anyway.. since we can't get the address of a sequence by any=20
> means I know of.  Someone might be able to poke around in memory and=20
> find some global sequence index and then locate other sequences from=20
> there...  Manipulating sequences would also have to be required to=20
> emulate the translator conventions on reference counts and whatnot.
>=20
> OR... you could just write your own sequence engine in asm and use =
that.

My fault, "compatible with Euphoria sequences" was a very misleading and =
clumsy expression. All I had in mind was that it should be easy to peek =
and poke sequences into and out of static memory. For example, Euphoria =
can only read and write 32 bit ints, and 32 or 64 bit floats, so any =
other types of data in the arrays would be incompatible.

Actually, I was thinking in the opposite direction of Euphoria's =
arbitrarily structured sequences. Sequence trees with recursive pointers =
are great in high level, but in the inner loop, I'd like to sacrifice =
this kind of flexibility and instead go for fast, direct indexing. =
Something like arrays of (preferably) power-of-2 aligned data, e.g. int =
a[17][512] would be great. Of course, sometimes arrays like [600][800] =
might also be nice, (to save memory), which could hopefully be arranged =
for with a few lea operations. Your own mode 19 pixel(x,y) in asm.ex, is =
an example.

> Asm.e also assembles directly to memory, which should be an option for =

> the compiler to.  Generating functions at run-time is always nice.

Agreed. Now to your next posting:
=20
> Here are some more ideas I had on what to implement first in a simple=20
> compiler:
>=20
>   Functions and Procedures.=20
=20
I guess you are right. One thing I would like to maintain is that the =
code should fit well into the code cache (I think it's like 4, 8 or =
16Kb). Simply in-lining all 'calls'would quickly result in too much =
code. So subs are necessary.

> Parameters are passed on the stack, using=20
> push instructions.  Space for local variables is created by moving the =

> stack pointer.  Parameters and local variables are accessed by reading =

> memory at the stack pointer + an offset.  Recursion should be allowed =
-=20
> that's why we have a stack. =20
=20
But what happens if a program runs out of stack memory at run-time?

> Function results should be returned in the=20
> eax register.  No registers need to be saved before calling =
subroutines,=20
> but a subroutine may modify any register except the stack pointer. =20
> (Ideas for later implementation: allow complex calling conventions =
like=20
> passing parameters in registers, store local variables in registers,=20
> etc)

You see, I think it might be possible to do without the stacked =
parameters for calls that are internal to the code currently being =
compiled. Parameters would not be necessary if the compiler would knows =
where it has all its data at any given time. For example, if a variable =
FOO is currently in either a register or memory position, lets call it =
R/M_X, it would be a waste to move it to EAX and then make the call. One =
could just call the procedure right away, and the compiler would know =
that FOO is currently at R/M_X (say, it could be in EBP, or [EBP] or =
[FOO_PTR]). I will write more about this below.
=20
>   Global Variables.  Global variables are simply a pointer to =
someplace=20
> allocated on the heap.  (Variables can be optimized to be stored as=20
> registers later.)

Actually, I first imagined everything to be global... :)

>   Data Types.  Let's stick with 32-bit integers for now.  (We can add=20
> floating point and arrays later.)  (We could also probably implement=20
> arrays just using peek/poke.)

Yes, let's wait with the floats. But simple arrays, with max 2 =
dimensions and only power-of-2 width could more easily be implemented at =
an early stage.
=20
> Simple Assignment and Arithmetic.  Assignments will be initially=20
> limited to "x =3D y" where y is another variable or a literal value. =20
> Arithmetic can only be done on existing variables, in the form "x +=3D =
y".=20
>(We can add more arithmetic operations later.)

Concurred. For x86 this would sure make sense. I just thought that since =
Eu was available for both Linux and BSD that maybe it also currently ran =
on other computers. But Bernie says that Rob Craig is 'hand coding' some =
of his routines, does this mean Eu runs only on x86?
=20
> If and While statements.  Each statement may have only one=20
> comparison.  Comparisons can have the form "x < y" where x or y may be =

> a variable or a literal value.  (For loops can be implemented using=20
> while loops.  From what I've heard, the first versions of Euphoria=20
> didn't have for loops for that reason.)

For some time I thought GOTO would be easier to compile, but I guess =
WHILE is actually easier to deal with. IF and WHILE is sufficient for my =
needs.

> Now, for the intermediate representation, a tree structure would=20
> probably be appropriate.  The parser would take the text =
representation=20
> of the program (the source code) and convert it to an intermediate=20
> representation.  This intermediate representation is where the=20
> optimization would take place, but we'll leave that out for now.  Then =

> the code generation would use the optimized intermediate =
representation=20
> to create the actual machine code.

Ok, Pete. What you are describing (I snipped away the rest) is a very =
direct, and simple approach. It would compile in no time, it would run =
quite fast (at least much faster than Euphoria) and it basically solves =
my original request. Moreover, it looks easy to debug, so there is =
reason to believe it would quickly get stable. Even though I'd like to =
attempt a more complex solution later, I think it would be a good idea =
to complete this compiler first, and see what kind of problems we =
encounter.

> Comments?  Suggestions?  Clarifications?
>=20
> Or "shut up man and start coding"?  grin
>=20
> Pete

Sure, go ahead, but give me/us a chance to help! Maybe the job could be =
cut up into pieces somehow, so we can exploit the principle of =
parallelism? Anyhow, my humble programming services are at your command =
:)




Finally I'd like to give a general outline of my plans for a compiler =
based on the a-star algorithm. I hope to be able to write a much better =
and more detailed description later.=20

The compiler could be compared to a chess engine. The compiler generates =
lists of compiled statements, a chess engine lists of chess moves. Both =
programs will not only relate to one list, but to huge trees of lists =
branching out from a root.

A chess engine cannot make an illegal move (unless it's buggy). =
Likewise, the compiler,  is unable to produce a statement or list of =
statements that could result in an error. It would always create a =
program that correctly would execute the source code.

A chess engine must know 'everything' relevant to the game, including =
such things as which rooks has moved, so as to play by the correct =
castling rules. But it cannot limit itself to perfect knowledge about =
the current situation on the board. In order to think ahead, it must =
search as far as it can, through a huge branching tree, in which it must =
have the same perfect knowledge about every possible node. Else, it =
couldn't plan it's moves correctly.

The compiler also prowls through a search tree. and must know =
'everything' about each node on that tree, that is, about as much as a =
human programmer would. Perhaps the most important information the =
compiler has is WHERE the different variables are located (is 'foo' in =
memory or in one of the registers?). Mostly, it will not be aware of the =
values in these variables, but sometimes it will, and then it could =
exploit it. It's the same with flags, they are either known as 'set', =
'cleared' or more usually 'undefined'. =20

One useful way to exploit knowledge of contents, is when at some point =
you need to compile something like Frodo =3D 0. Before setting out to =
free up a register and zero it to hold this value, the compiler could =
first see whether it already had a unused register with a value known to =
be zero. That's not unthinkable, maybe a loop counter has recently ended =
up with that value. Now the compiler could simply use that register =
right away, without any code.

Still,  the most important thing about the compiler is it's ability to =
'juggle' registers and variables in memory, so as to minimalize the need =
for constant swapping out of registers.=20

Basically, the compiler could be described in terms of 4 concepts.

 1. A code generator.
 2. A cost evaluator.
 3. A priority queue.
 4. A heuristic estimation.

1. The code generator has the task of producing  a correct piece of =
assembly code that will do what a single source statement expects. How =
this piece of code will look like depends on many factors, all of which =
can be deduced from the list of code prior to this statement. If the =
statement was foo -=3D bar, then the code would have to take into =
account the whereabouts of these two variables. If 'bar' is in memory =
and 'foo' happens to be in ecx, then one correct piece of code could be: =
add ecx, [bar_ptr].=20

There may of course be many other ways to accomplish the same thing. The =
value of 'bar' could for example be loaded into another register, (the =
contents of which would then have to be written out to memory), and =
thereafter the two regs could be added. So far, this could seem like a =
less optimal solution, simply wasting some clocks, doing approximately =
the same thing.

But it may turn out that the most expensive version of the statement =
later will turn out to be the best. It's like in chess: sometimes you =
can sacrifice a knight or even a queen, and find that this move can be =
made to improve your situation several moves later.

The code generator should therefore not worry about how much each =
statement costs, just generate the set of correct ways to resolve a =
statement at a given point in the body of the code.


2. The cost evaluator sums up the 'cost' of the generated assembly =
statements. Cost would be measured out in clocks. The execution cost of  =
an XOR REG, REG might be 2 clocks. In addition comes instruction =
fetching and decoding. It must be possible to look up the correct (more =
or less) cost values for most instruction and instruction combinations =
by fine-reading appropriate literature. Certain things would still be =
unpredictable of course, so the average time a processor will spend down =
a pipe after a mispredicted jump must simply be guesstimated (one must =
still attempt to figure out reasonable penalties for this).

The compiler would provide the best results when these costs are as =
close to reality as possible. In chess too, it is important to hint the =
engine in the right direction (the best moves) by giving the 'best' =
values to the different pieces. Often these are: 1 for pawn, 3 for =
knight, 3.25 for bishop, 5 for a rook, 9 or 10 for a queen and 'inf' for =
the king. One can play around with these values of course, and see if =
the engine would play better chess if the rook was worth 8 points =
instead. It probably would play worse.


3. The priority queue has the mission of always knowing which node among =
the compilers lists of programs is the 'best'. That is, the one with the =
least clock-cost in relation to how many statements it has completed. =
This is as simple as 'keeping the minimum'. A data structure known as a =
'heap' is often used for this, but there may be other alternatives that =
are better for this application. I'd expect that optimizing the priority =
queue means optimizing the compiler, since it's here most of the =
compile-time will be spent.


4. At any point in the developing of the assembly code, a 'heuristic' is =
produced that attempts to guess how 'good' the current node is in =
comparison to all other possible nodes. This guess is the current cost =
for the node in the tree (which is a hard fact) + a negative estimate of =
how much is left before the final source statement is reached. The =
latter estimate can be multiplied with a floating point factor to make =
the engine search either in 'breadth', meaning more brute force, or =
'depth', meaning 'take some chances, advance quickly towards the goal'.

When this heuristic factor is set to 1, a perfectly optimized* =
compilation would be the result, but the engine might spend quite some =
time, even on small programs (compilation time would increase =
exponentially with the number of statements in the program). As the =
heuristic factor is incremented, the resulting code will be less =
optimized, but compilation time would be drastically reduced.=20

The beauty of a-star is that one can always diddle this number to get a =
search time which balances one's needs. Just increment the number to =
trade in run-time efficiency against compile-time!

* Perfect optimization means that, given only the statement building =
blocks that the code generator can produce, and given that the clock =
estimates for all these code snippets are correct, it would be =
impossible to compile the program to more effective assembly. No =
kidding, and before you burst out in protest, please make sure you =
understood the two premises in that sentence... :)




Yours Sincerely,
Barbarella

new topic     » goto parent     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu