1. Brain-scratcher: New example of the delay

Okay, I'm hoping I can get someone else to reproduce this and offer some help with it. I've reproduced it on a second computer, but the parameters I had to use were slightly different, so I'm not sure exactly what all affects it.

Again, this seems to be an issue only with exw.exe. EX.EXE runs it a little slower, but it seems consistent, and not as slow as when exw.exe runs it slowly.

For this test, I've used the following. I think if you play around with it, you can reproduce it under whatever environment you normally use, but just so you know mine:
Filename: C:\b.exw
Euphoria version: exw.exe from euphoria_40a3.exe
Method of running:
- Copy exw.exe in root folder (C:\exw.exe)
- Open command prompt and navigate to the root directory (cd\)
- Enter this command: exw.exw b.exw
File format: ANSI text, CR-LF line endings, no extra lines or spaces at the end of lines.

Here's the code:

atom x = machine_func(16, 1000) 
poke(x, 195) 
for i = 1 to 8000000 do 
poke(x + 311, 0) 
call(x) 
end for 
? {} 

Code explanation:
- I start by allocating a 1000 byte buffer.
- I poke in the simplest machine-code program (RET).
- I run a couple lines of code a bunch of times to make the time evident.
- I use the print directive "?" to call up a console window, so you know when the code is finished.
- The program immediately ends, so you'll only see it flash on screen briefly.

Outcome:
- If it behaves like my computer, this program will run for about 2-3 seconds. You'll see the console flash on screen as soon as it's done.
- If the console appears immediately, replace "311" with a smaller number and try again. I'm sure you'll get it to delay before that number reaches zero.
- Now increase "331" to something higher. Before you get to 999, your program should finish immediately rather than delaying.

Check for strange behavior:
- By making half-way changes and running it repeatedly, you'll probably find some point at which it switches from slow to fast. The first fast number is always even. So, for example, you might find that 631 is slow and 632 is fast. Now comes the fun part:
- Convert your first fast number to hexadecimal, with Windows calculator or something. (e.g. 632 = #278)
- In your code, replace 632 with #278, and run it again.
- Do the same with the last slow number. (e.g. 631 = #277)
- You'll now find that instead of one number running fast and the other slow, they now either both run slow or both run fast, even though you're using the same numbers, just written differently. The cutoff point has now changed. If both fast, lower your hex values till you find the slow mode. If both fast, raise your hex value till you find the fast mode.
- Keep playing around with it. Change the numbers in loop, e.g. "1" to "0" or even "#1". Change the value of the number poked during the loop. Any little change like that should change the cutoff point of the offset where slow becomes fast.

Please tell me if you can reproduce this!

new topic     » topic index » view message » categorize

2. Re: Brain-scratcher: New example of the delay

is there some way to automate the testing? script or batch file maybe with a pause between tests.

have you tried translation and binding?

new topic     » goto parent     » topic index » view message » categorize

3. Re: Brain-scratcher: New example of the delay

CladStone said...

Okay, I'm hoping I can get someone else to reproduce this and offer some help with it. I've reproduced it on a second computer, but the parameters I had to use were slightly different, so I'm not sure exactly what all affects it.

Again, this seems to be an issue only with exw.exe. EX.EXE runs it a little slower, but it seems consistent, and not as slow as when exw.exe runs it slowly.

I'm able to reproduce this on Linux, using r2165. The magic number for me is 32. Anything less than that, and I get the delay. I believe this has something to do with the processor cache and how the OS deals with the memory. I suspect that there is a threshold at which it flushes the changes or something (I'm really pushing the limit of my knowledge here).

The bottom line is that I think there's nothing special about this code from a euphoria standpoint.

Matt

new topic     » goto parent     » topic index » view message » categorize

4. Re: Brain-scratcher: New example of the delay

Sure, you could come up with a way to automate testing. I thought about it, but didn't get that far.

Regarding translation and/or binding: A capital suggestion; thank you. It took me a little while since I'd never done that before and didn't have a C compiler installed, but I've now tried both.

Binding:
I reproduced the problem using "bindw" to create a few .exe files. One runs fast with the value "x + 230" on line 4. Another runs slow with the value "x + 150". (I didn't take the time to track down the exact "cutoff" point I mentioned earlier - I just wanted to proof that something that ought not make a difference does.) The slow .exe is 2 bytes smaller, and has 29 other different bytes (at least 9 of these are merely shifted), whereas another fast one (using "x + 311") is the same size as the "x + 230" and only different by 4 bytes.

Translation:
The only difference I could produce in the generated C code was between "x + 2" (or higher) and "x + 1", but it only renamed 2 variables and changed a couple lines to make use of the unique number that "1" is, e.g. it used:

if (_11 > MAXINT)

vs.

if ((long)((unsigned long)_12 + (unsigned long)HIGH_BITS) >= 0)

and

_11 = binary_op(PLUS, 1, _1x_135);

vs.

_12 = NewDouble(DBL_PTR(_1x_135)->dbl + (double)2);

So it seems the problem was reproduced in binding, but not in translating. If you like, I can try to post those .exe files somewhere. I'm not really sure how to reverse engineer them and tell exactly what's going on, though I might try anyway.

new topic     » goto parent     » topic index » view message » categorize

5. Re: Brain-scratcher: New example of the delay

Thanks for reproducing this for me. Yes, a processor caching issue was my first thought when I came across the delay, but now that I've seen that they bind differently, that seems to defy the idea that that's ALL it is.

new topic     » goto parent     » topic index » view message » categorize

6. Re: Brain-scratcher: New example of the delay

I tried compiling my C code to see what it did, but got a compile error on line 41 of main-.c, so I'm not sure the generated .exe even does what it is supposed to.

new topic     » goto parent     » topic index » view message » categorize

7. Re: Brain-scratcher: New example of the delay

CladStone said...

Thanks for reproducing this for me. Yes, a processor caching issue was my first thought when I came across the delay, but now that I've seen that they bind differently, that seems to defy the idea that that's ALL it is.

I am not so sure about this. The slightly different IL in a bound executable may easily account for this.

new topic     » goto parent     » topic index » view message » categorize

8. Re: Brain-scratcher: New example of the delay

See http://en.wikipedia.org/wiki/Self-modifying_code in particular the last sentence of this:

Interaction of cache and self-modifying code
In some cases short sections of self-modifying code executes more slowly on modern processors. This is because a modern processor will usually try to keep blocks of code in its cache memory. Each time the program rewrites a part of itself, the rewritten part must be loaded into the cache again, which results in a slight delay, if the modified codelet shares the same cache line with the modifying code, as is the case when the modified memory address is located within a few bytes to the one of the modifying code.

The cache invalidation issue on modern processors usually means that self-modifying code would still be faster only when the modification will occur rarely, such as in the case of a state switching inside an inner loop.

Most modern processors load the machine code before they execute it, which means that if an instruction that is too near the instruction pointer is modified, the processor will not notice, but instead execute the code as it was before it was modified. See Prefetch Input Queue (PIQ). PC processors have to handle self-modifying code correctly for backwards compatibility reasons but they are far from efficient at doing so[citation needed].

new topic     » goto parent     » topic index » view message » categorize

9. Re: Brain-scratcher: New example of the delay

CladStone said...

Thanks for reproducing this for me. Yes, a processor caching issue was my first thought when I came across the delay, but now that I've seen that they bind differently, that seems to defy the idea that that's ALL it is.

I suspect it may have to do with where the memory is allocated, and how close to a page boundary it is. The magic values seem to change, but the effect seems consistent, whether translated or bound or interpreted. It really does look like something going on with OS/processor memory management.

Is there some real situation where this issue is causing problems? I don't think there's anything that euphoria can do about this.

Matt

new topic     » goto parent     » topic index » view message » categorize

10. Re: Brain-scratcher: New example of the delay

jimcbrown said...

I am not so sure about this. The slightly different IL in a bound executable may easily account for this.

What's an IL? I don't understand why the code binds differently.

But yes, after a few more tests, I'm willing to concede that this IS a processor caching issue, and that the state is not captured in the bound .exe file. (I found one more like 150 that ran fast, and then I got my 150 to run fast.)

My big question now is about how to avoid the issue. If you'll notice from my original example, there were 2 allocations. The full code was not intended to modify the machine code routine at all. Must I pad any memory block I use for machine code routines with a certain amount of unused memory? If so, do I pad on both ends, or just after? And by how much do I need to pad to make sure I don't run into the issue?

new topic     » goto parent     » topic index » view message » categorize

11. Re: Brain-scratcher: New example of the delay

This is an interesting situation and may be worth further investigation. But I don't think the results would be very useful. Windows, along with any other system that currently runs Euphoria, is not a real-time operating system. There are no guarantees regarding the performance of a particular block of code. There are many factors involved, relating to both the CPU and OS. The results may appear to be consistent with a particular system in the short term but this cannot be relied upon. I don't think this is a Euphoria issue, except that an interpreter adds an extra layer of complexity to an already complex system. I think it would be a mistake to attempt to optimize code based on these experimental results.

new topic     » goto parent     » topic index » view message » categorize

12. Re: Brain-scratcher: New example of the delay

CladStone said...

Sure, you could come up with a way to automate testing.

re the #hex anomaly, I was thinking a slight delay may be caused while interpreted due to the # char being used to signal the start of the new multiline string in eu4. but that wouldn't apply if this was also seen in eu3. I would expect the problem to disappear when translated.

for testing, look at eutest in eu4. you can generate a program or series of programs programatically, write them to disk and run eutest. I believe it can time individual tests and produce a report. it will run any tests in a directory starting with the name t_*. or any named on the commandline.

new topic     » goto parent     » topic index » view message » categorize

13. Re: Brain-scratcher: New example of the delay

CladStone said...

My big question now is about how to avoid the issue. If you'll notice from my original example, there were 2 allocations. The full code was not intended to modify the machine code routine at all. Must I pad any memory block I use for machine code routines with a certain amount of unused memory? If so, do I pad on both ends, or just after? And by how much do I need to pad to make sure I don't run into the issue?

I suspect the best you could do is allocate an nK block up front for all asm code, where by nK I mean 1K, 4K, 32K, or whatever you end up needing. I doubt there is any way for the underlying OS to allocate 'executable' and 'non-executable' requests in separate non-cache-clashing areas...?

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu