### 1. call_func is expensive

Using call_func to select the desired "compare" function in a sorting algorithm looks very expensive.

Basic sorting algorithms an be used for a variety of uses if you choose the correct "compare" function.

```-- compare "if" vs "call_func"

atom t = time()
integer void

integer count = 1_000_000

puts(1, "compare inside an `if` branch\n\n" )
-- use "if" to select a suitable compare function

t = time()
for i=1 to count do
if 1 then
void = compare( 4, 3)
end if
end for
? time() - t

function kustom( object left, object right )
return compare( left, right )
end function
integer rid = routine_id( "kustom" )

puts(1, "\ncall_func inside an 'if` branch\n\n" )
-- use "call_func" to select a suitable compare function

t = time()
for i=1 to count do
if 1 then
void = call_func( rid, {4, 3} )
end if
end for
? time() - t
```

Some results I get; comparative time in seconds.

OE interpret OE compiled p interpret p -c compiled
if 0.01 0 0.01 0
call 0.05 0.02 0.3 0.26

The application for call_func would be to select sorting direction (up, down); sort (numbers) or collate (alphanumeric); user custom sort.

It looks like a sort API requires separate functions for each category of sorting--if we wish to maintain the idea that Euphoria is "fast." Is the expense of one "sort" function worth it?

_tom

### 2. Re: call_func is expensive // Phix much slower

_tom said...

Using call_func to select the desired "compare" function in a sorting algorithm looks very expensive.

Basic sorting algorithms an be used for a variety of uses if you choose the correct "compare" function.

```-- compare "if" vs "call_func"

atom t = time()
integer void

integer count = 1_000_000

puts(1, "compare inside an `if` branch\n\n" )
-- use "if" to select a suitable compare function

t = time()
for i=1 to count do
if 1 then
void = compare( 4, 3)
end if
end for
? time() - t

function kustom( object left, object right )
return compare( left, right )
end function
integer rid = routine_id( "kustom" )

puts(1, "\ncall_func inside an 'if` branch\n\n" )
-- use "call_func" to select a suitable compare function

t = time()
for i=1 to count do
if 1 then
void = call_func( rid, {4, 3} )
end if
end for
? time() - t
```

Some results I get; comparative time in seconds.

OE interpret OE compiled p interpret p -c compiled
if 0.01 0 0.01 0
call 0.05 0.02 0.3 0.26

The application for call_func would be to select sorting direction (up, down); sort (numbers) or collate (alphanumeric); user custom sort.

It looks like a sort API requires separate functions for each category of sorting--if we wish to maintain the idea that Euphoria is "fast." Is the expense of one "sort" function worth it?

_tom

The figures might be more indicative if the loop count were higher (eg, 100 million) as some timings might get swamped by the timer resolution. In any case, indirect calls were always going to be slower than something else. I imagine that custom sorts would be kinda rare and maybe not so important? In my day to day work the typical sort comparisons are only for strings or ints or fp values - which are all handled by compare(). I only ever used custom_sort() in a few rare cases where that API didn't handle the specific semantics, eg, sorting an index via another index via a data source.

Spock

### 3. Re: call_func is expensive // Phix much slower

Spock said...
_tom said...

...

It looks like a sort API requires separate functions for each category of sorting--if we wish to maintain the idea that Euphoria is "fast." Is the expense of one "sort" function worth it?

_tom

The figures might be more indicative if the loop count were higher (eg, 100 million) as some timings might get swamped by the timer resolution. In any case, indirect calls were always going to be slower than something else. I imagine that custom sorts would be kinda rare and maybe not so important? In my day to day work the typical sort comparisons are only for strings or ints or fp values - which are all handled by compare(). I only ever used custom_sort() in a few rare cases where that API didn't handle the specific semantics, eg, sorting an index via another index via a data source.

Spock

I have an idea for enhancing the performance of Msort. If it works well we might be a bit more relaxed about using the lower performance (but increased flexibility) from custom_sort(). Gimme a few days..

Spock

### 4. Re: call_func is expensive // Phix much slower

Spock said...

I have an idea for enhancing the performance of Msort. If it works well we might be a bit more relaxed about using the lower performance (but increased flexibility) from custom_sort(). Gimme a few days..

Spock

As is, Msort is fast and practical. The simplest way to get performance is to compile, which is the easy way to get (roughly) twice the speed.

But, Euphoria has a legacy of speed; it will be fun to see how things can get faster.

_tom

### 5. Re: call_func is expensive // Phix much slower

_tom said...

Phix much slower

I'll take a look

I had imagined a new sort that would default to using compare directly - I certainly wouldn't want to be forced to define a custom compare function for a standard sort either.

### 6. Re: call_func is expensive // Phix much slower

_tom said...
Spock said...

I have an idea for enhancing the performance of Msort. If it works well we might be a bit more relaxed about using the lower performance (but increased flexibility) from custom_sort(). Gimme a few days..

Spock

As is, Msort is fast and practical. The simplest way to get performance is to compile, which is the easy way to get (roughly) twice the speed.

But, Euphoria has a legacy of speed; it will be fun to see how things can get faster.

_tom

Well, I had some initial success. I get up to 20% improvement over msort. With some tweaking of constants maybe could extend that to 30%. Here is the algorithm in code. It won't run as is but gives the idea:

```
include bfind.e

function m2sort(seq s)

-- early out?
if ~s < 200 then return insertion_sort(s) end if

-- make index
seq index = {}
for i = 1 to ~s by 20 do
index &= { s.i }
end for
index = m2sort(index)

-- make the buffer
seq store = repeat( {}, ~index )

-- store all items in buffer
for i = 1 to ~s do
obj x = s.i
int n = bfind(x, index)
store.n &= { x }
end for

-- final sort on all items
s = {}
for i = 1 to ~store do
s &= m2sort( store.i )
end for

-- exit
return s

end function

```

Basically it samples the data and uses that to coarsely place the data into buckets. And then does a final sort of each bucket. BTW, the tilde symbol represents length(). And I have dot notation. For me it's an interesting little side-project. When some more substantial results come through I'll update msort.zip on the contributions page.

Spock

### 7. Re: call_func is expensive

I found a chance to look at this, but sad to say it ain't good news.

```            rds Eu 2.4:     phix:       eui 4.1.0
compare:        0.05        0.05        0.09
kustom:         0.20        0.57        0.18
call_func:      0.64        2.0         0.50
```

So, regarding phix vs eu, the builtin is fine, standard call is 3x, call_func is 4x...
(The 3x-ish ratio between kustom and call_func within each is about the same across all three and not discussed further)

Now, taking the headline 4x of call_func first:
in builtins/pcallfunc.e the hll function call_common is used, in theory all that code could be rewritten from hll to low-level assembly, and perhaps that could get us from 4x to 3x, but there is a fair chance that doing so would not make even the slightest dent; I suspect the real problem is elsewhere. (while I was there, I made a small improvement to call_proc, that oddly enough made no difference at all to call_func)

More shocking, and probably 75% of the 4x issue anyway, is the 3x difference between a plain hll call in Phix and OpenEuphoria, made stranger by the fact that overall performance otherwise seems pretty much on a par.
I found a few bits of old debug code but removing them made little difference.
My instinct says that the Phix code is suffering at least two (extra) branch mispredictions and corresponding pipeline flushes.
I quickly tried a few things that made absolutely no difference.

Perhaps the compiler could sort locals by type, integer last, to minimise the dealloc loop inside opRetf.
Perhaps integer-only routines could use an optimised variant of opRetf with no dealloc loop....
Otherwise I'm at a bit of a loss (regarding any other way to improve opFrame/opRetf in builtins/VM/pStack.e).

In summary: I cannot see a quick fix for this.
The more subtle 3x is probably more significant that the headline-making 4x.
At least from a Phix perspective, use of call_func in any sort function should indeed be minimised, at least for now.

Pete

### 8. Re: call_func is expensive

petelomax said...

In summary: I cannot see a quick fix for this.
The more subtle 3x is probably more significant that the headline-making 4x.
At least from a Phix perspective, use of call_func in any sort function should indeed be minimised, at least for now.

Pete

I find that one can write a "test" that can show that call_function has a penalty, but not horrible one. Just fuss with different loops and you can generate "better" results.

Benchmarking is inexact...

_tom