### 1. Natural sort comparison

Forked from Re: sort constants confusing

_tom said...

A demo of the new sorting include is: http://openeuphoria.org/pastey/293.wc

It provides stable sorting, column sorting, index sorting, along with number and natural order options. All sorts are in the up direction.

After changing a few identifiers this works on OE and Phix.

_tom

Thanks for merging that. I have some questions about the Natural Sort comparator..

Because this is implemented at the compare level there would be quite a lot of overhead at each iteration computing 'natural' values for each target pair prior to the actual comparison. This has got to be quite costly but I'm not sure what the best way of solving it is. I suppose if the data set were always smallish, eg, n < 200 then it doesn't really matter.

Another question is: What would happen if we were sorting a mixture of atoms and strings? The code presented would fail in its fundamental purpose as atoms and sequences would end up being clumped (albeit in nicely sorted clumps).

Lastly, what about things like whitespace, decimal points etc.. the definition of what constitues 'naturalness' is a bit fuzzy. I mean, shouldn't we be also monocasing the text as well?

Spock

### 2. Re: Natural sort comparison

I have been thinking of this stuff as Mike Sort (with obvious inference to some competitive sorting algorithm). Since I didn't invent it I wait for input.

Sorting API must consist of:

• data: integer|atom|string|object
• stability: TRUE|FALSE
• direction: UP|DN
• order: arithmetic|natural|custom
• column: 0|1|2|...
• index sorting

The API question is:

• a universal sort with five (or more) arguments
• sort with the most "popular" arguments
• lots of specialized sort routines with minimal argument lists

This is a teaching opportunity to show me how to design a library. Time to hear from Pete about what he expects in a sorting library.

The Natural Order algorithm is by Andy Serpa. "Natural Order" is sometimes called "Lexical Order" which is the alternative to "arithmetic order." The existing algo does not do everything:

```The algorithm only takes into account "unbroken" integers within the
character strings
it makes no allowance for decimal points (although it
will still work correctly as long as there are an equal number of digits
to the right of the decimal point in each number being compared).
```

I have not thought about unicode and case sensitivity (monocasting; seems like an ugly word) variations in a custom sort.

In the pastey I left out a second sorting algo that is optimized for objects that are only string sequences--making it about twice as fast. In the original he also uses custom_sort(ID) as a way of executing the sorting.

Since using routine_id is expensive, is it worth writing the sorting comparison directly into the sorting algorithm?

_tom

### 3. Re: Natural sort comparison

_tom said...

The Natural Order algorithm is by Andy Serpa. "Natural Order" is sometimes called "Lexical Order" which is the alternative to "arithmetic order." The existing algo does not do everything..

I have not thought about unicode and case sensitivity (monocasting; seems like an ugly word) variations in a custom sort.

In the pastey I left out a second sorting algo that is optimized for objects that are only string sequences--making it about twice as fast. In the original he also uses custom_sort(ID) as a way of executing the sorting.

Since using routine_id is expensive, is it worth writing the sorting comparison directly into the sorting algorithm?

_tom

If performance were really needed there could be another way:

From the original data 'S' create a key sequence 'A' to be used for the comparison [user accesses whatever functions they like in naturalsort.e]
Create an index array 'B' same length as A, eg, {1,2,3..}
Sort B via A using sort_index() [if that happens to exist in the API]
remap S using B

I personally don't like the idea of putting relatively rare cases into a general purpose API when the user could deal with them in the code.

Spock

### 4. Re: Natural sort comparison

My preferred sort API might be (this is all draft, there may be niggles)

```    sequence res = sort(sequence s, sequence options={})
```

where s is the data to be sorted and options is an even-length sequence with odd elements being recognised literal strings and even elements actual settings, eg

```    ?sort(s,{"direction",-1,"compare",routine_id("my_compare")})
```

The available options would include at least:

even element odd element
"compare" routine_id("my_compare") where my_compare is a function that accepts two elements of s
"direction" +/-1 (look ma, no names to argue over)
"preproc" routine_id("my_preprocessor"), a function that accepts (idx, si), or a column set as per "return"
"return" -2, or {0..length(s[i])} ie: all, or a set of columns (aka retset)

As noted by Mike, "preproc" could be significantly more efficient than "compare"; it would be an error to have both.
In fact we might drop "compare" and only offer "preproc", and I now think we should.

This interface could/should effectively make custom_sort obsolete (albeit fiddly to convert compare->preproc).

I would use -2 to indicate "preproc not set", rather than -1 which could conflict with/hide errors from routine_id failures.

A preproc/retset element of 0 means "index".

I would also make it thread-safe, meaning you cannot have the file-level SortType/Column/Keys/CompareRtn.

The innards might look a bit like this:

```
keys = s
if preproc!=-2 then
for i=1 to length(keys) do
if integer(preproc) then
keys[i] = call_func(preproc,{i,keys[i]})
else
keys[i] = get_columns(keys[i],i,preproc)
end if
end for
end if
keys = actual_sort(keys)
if retset!=-2 then
for i=1 to length(keys) do
keys[i] = get_columns(keys[i],i,retset)
end for
end if
return keys     -- **NB**
```

Note that if [parts of] the original s need to be returned, preproc must include them in keys, and "return" specified to extract them from and throw away the other working storage that formed part of keys[i].

A stable sort could be achieved by incorporating the index into keys.

I haven't attempted any of this at all, just a few ideas.

Pete

### 5. Re: Natural sort comparison

Am I right in thinking that the existing sort is perfectly stable - what is not stable is custom_sort, because you don't get any indexes with which to resolve ties.

### 6. Re: Natural sort comparison

_tom said...

The Natural Order algorithm is by Andy Serpa. The existing algo does not do everything:

```The algorithm only takes into account "unbroken" integers within the
character strings
it makes no allowance for decimal points (although it
will still work correctly as long as there are an equal number of digits
to the right of the decimal point in each number being compared).
```

I have not thought about unicode and case sensitivity (monocasting; seems like an ugly word) variations in a custom sort.

You're gonna hate me: http://rosettacode.org/wiki/Natural_sorting#Phix It might need enhancing for the 1.009 vs 1.1 case, with some prev='.' tests or similar

Pete

### 7. Re: Natural sort comparison

petelomax said...

Am I right in thinking that the existing sort is perfectly stable - what is not stable is custom_sort, because you don't get any indexes with which to resolve ties.

"custom_sort" happens to lack an option for choice of columns

If you have just one column then all sorting algorithms give you the same answer. If items get swapped or not, you can not tell. Stability does not matter.

When you have two (or more columns) and you sort using a column a the key, then you can tell the difference. Stable sorting becomes important.

Using Andy's natural sort order: I got the same result with Mike_sort and std/sort.e; the items have no columns but have an "internal" order.

_tom

### 8. Re: Natural sort comparison

petelomax said...
_tom said...

The Natural Order algorithm is by Andy Serpa. The existing algo does not do everything:

```The algorithm only takes into account "unbroken" integers within the
character strings
it makes no allowance for decimal points (although it
will still work correctly as long as there are an equal number of digits
to the right of the decimal point in each number being compared).
```

I have not thought about unicode and case sensitivity (monocasting; seems like an ugly word) variations in a custom sort.

You're gonna hate me: http://rosettacode.org/wiki/Natural_sorting#Phix It might need enhancing for the 1.009 vs 1.1 case, with some prev='.' tests or similar

Pete

You use a syntax trick in Phix; no cut and paste for OE.

A ligature is "two characters combined into one."

hence o and e becomes 

_tom

### 9. Re: Natural sort comparison

_tom said...
petelomax said...

Am I right in thinking that the existing sort is perfectly stable - what is not stable is custom_sort, because you don't get any indexes with which to resolve ties.

"custom_sort" happens to lack an option for choice of columns

If you have just one column then all sorting algorithms give you the same answer. If items get swapped or not, you can not tell. Stability does not matter.

When you have two (or more columns) and you sort using a column a the key, then you can tell the difference. Stable sorting becomes important.

Using Andy's natural sort order: I got the same result with Mike_sort and std/sort.e; the items have no columns but have an "internal" order.

_tom

It's not necessary for custom_sort to pass any column ordering since the User's compare routine would take this into account. Except for sorting one-dimensional data the choice of algorithm will impact stability. Indexes can be used to convert any unstable sort into a stable one but I think it's easier to just use a naturally stable sort in the first place. Please correct me if I'm wrong but AFAIK the existing sort in \std is based on Shell sort so not stable.

Msort is basically Merge sort supplemented by Insertion sort. This is more-or-less the core of Timsort (I personally think Tim Peters went way overboard trying to squeeze out every last drop of performance). In my own tests Msort easily beats Qsort in any use case. It's fast and always correct so please feel free to use it how you want.

Spock

### 10. Re: Natural sort comparison

I found a way to improve quick_sort for the all-1s case, along with a minor speedup for the swap:

```function quick_sort(sequence s)
sequence qstack
integer first, last, stackptr, I, J
--object tmp, pivot
object pivot, ti, tj

qstack = repeat(0,floor((length(s)/5)+10))    -- create a stack

first = 1
last = length(s)
stackptr = 0
while 1 do
while first<last do
--maybe?:
--          if last-first<=100 then
--              s[first..last] = insertion_sort(s[first..last])
--              exit
--          end if
pivot = s[floor((last+first)/2)]
I = first
J = last
while 1 do
--              while s[I]<pivot do
while 1 do
ti = s[I]
if ti>=pivot then exit end if
I += 1
end while
--              while s[J]>pivot do
while 1 do
tj = s[J]
if tj<=pivot then exit end if
J -= 1
end while
if I>J then exit end if
if I<J then
if ti=tj then
{I,J} = {J+1,I-1}
exit
end if
--                  tmp = s[I]
--                  s[I] = s[J]
--                  s[J] = tmp
s[I] = tj
s[J] = ti
end if
I += 1
J -= 1
if I>J then exit end if
end while
if I<last then
qstack[stackptr+1] = I
qstack[stackptr+2] = last
stackptr += 2
end if
last = J
end while
if stackptr=0 then exit end if
stackptr -= 2
first = qstack[stackptr+1]
last = qstack[stackptr+2]
end while
return s
end function
```

Which makes it the clear winner across the board in my "Compare_sorting_algorithms.exw".

Given the performance boost that qstack gives to quick_sort, I wonder whether a similar technique could be applied to merge_sort, likewise turning it from a recursive algorithm to an iterative one, and whether merges could be performed in-situ. I jotted down some outline thoughts:

A stack of length log2 might be enough? Or even ceil(log2(n))-6 if we farm out length<=64 to insertion_sort.

Consider the merge operation of merge_sort, being performed in-situ.
(Note that <done> may in fact be a segment awaiting a future merge with merge(A,B))

```xxxxxx,a1,a2,a3,a4,a5,b1,b2,b3,b4,b5
<done> <----- A ----> <----- B ---->
```

Now, if we pick a1, then all that happens is that A shrinks. A' and B remain sorted.
If we pick b1, then a1 must move into that slot. At this stage the rest of B remains sorted, whereas A becomes (not for long) a still-sorted-but-now-circular list.

At some point, however, the circular list aspect of A breaks down (assuming we would rather not physically re-order it), eg:

```xxxxx,b1,b2,b3,a4,a5,a1,a2,a3,b4,b5
<--- done ---> <----- A ----> <-B->
```

(pick a1)

```xxxxx,b1,b2,b3,a1,a5,a4,a2,a3,b4,b5
<----- done ----> <--- A ---> <-B->
```

The obvious solution is to maintain A as a linked list, which can be done using a parallel sequence links, initially the same size as the sequence x being sorted.
There may be opportunities to preset that links sequence during some prior merge operation (but probably not if we palm off short sequences to insertion_sort) - then again maintaining it for the B is pointless.

Lastly, if the merge terminates because A became empty, then any remainder of B is already in place, whereas if B becomes empty, then A must be reordered, using the same linked list, obviously with null-ops for any head(A)==A.

Pete

### 11. Re: Natural sort comparison

petelomax said...

I found a way to improve quick_sort for the all-1s case, along with a minor speedup for the swap: .... Which makes it the clear winner across the board in my "Compare_sorting_algorithms.exw".

Given the performance boost that qstack gives to quick_sort, I wonder whether a similar technique could be applied to merge_sort, likewise turning it from a recursive algorithm to an iterative one, and whether merges could be performed in-situ. I jotted down some outline thoughts...

Pete

Your enhanced Quick sort is definitely faster (but won't be stable). I had already started some work on an in-place Merge sort but got a bit distracted looking at Library sort (gapped Insertion sort) which might be better than Merge sort in some ways. It is like a Bucket sort which clumps values into discrete sorted groups. I want to try something along the same lines but leave the final sort till the end. If each [unsorted] clump were kept under a minimum size (you suggest n=64) Insertion sort would be the ideal final pass.

Spock

### 12. Re: Natural sort comparison

Here it is

```enum SPLIT, MERGE

function pihism_sort(sequence x)
--
-- Pete's Iterative Half-In-Situ Merge Sort.
--
integer first = 1, mid, last = length(x), adx,
op = SPLIT,
qslen = 2*ceil(log(last)/log(2)+1),
qdx=0
sequence qsop = repeat(0,qslen),
qsf = repeat(0,qslen),
qsm = repeat(0,qslen),
qsl = repeat(0,qslen),
a
object ai, xmid

while 1 do  -- until qstack is empty

if op=SPLIT then
--
-- just push further split/merge pair(s) onto the stack...
--
while first<last do
mid = floor((first+last)/2)
qdx += 1
qsop[qdx] = MERGE
qsf[qdx] = first
qsm[qdx] = mid
qsl[qdx] = last
if mid+1<last then
qdx += 1
qsop[qdx] = SPLIT
qsf[qdx] = mid+1
qsl[qdx] = last
end if
if first=mid then exit end if
last = mid
end while

else -- op=MERGE
--
-- first, leave any parts of the first half already in place.
--
mid += 1
xmid = x[mid]
while first<mid and x[first]<=xmid do
first += 1
end while
--
-- then merge until the first half is empty.
-- any remainder of the second half will already be in place.
--
if first<mid then
a = x[first..mid-1]
ai = a
while 1 do
if mid>last
or compare(ai,xmid)<=0 then
x[first] = ai
first += 1
if adx>length(a) then exit end if
else
x[first] = xmid
first += 1
mid += 1
if mid<=last then
xmid = x[mid]
end if
end if
end while
end if
end if
if qdx=0 then exit end if
op = qsop[qdx]
first = qsf[qdx]
mid = qsm[qdx]
last = qsl[qdx]
qdx -= 1
end while
return x
end function

sequence s,r,t
for i=1 to 100 do
--for i=1 to 1000000 by 10000 do  -- slow!
--?i
t = tagset(i)
s = pihism_sort(t)
if s!=t then ?9/0 end if
s = reverse(t)
r = pihism_sort(s)
if r!=t then ?9/0 end if
s = shuffle(t)
r = pihism_sort(s)
if r!=t then ?9/0 end if
end for

constant LIM = 1000000-0
t = shuffle(tagset(LIM))
atom t0 = time()
s = pihism_sort(t)
atom t1 = time()
r = sort(t)
atom t2 = time()
?{t1-t0,":",t2-t1}

?"ok"
{} = wait_key()
```

and the medals awarded are:

```                  ones    sorted  random  reverse
pihism_sort      gold    gold    silver  bronze
quick_sort       silver  silver  gold    gold
builtin_sort     bronze  bronze  bronze  silver
```

Not much in it though. I'm very pleased with that result.

Pete

PS: I tested this, integer-only, on RDS Eu/OpenEuphoria, and can post that version (f06.exw) if required.

PPS: Obviously 100% in-situ with a doubly-linked list was a completely insane idea, but 50% works well.

### 13. Re: Natural sort comparison

petelomax said...

.. and the medals awarded are:

```                  ones    sorted  random  reverse
pihism_sort      gold    gold    silver  bronze
quick_sort       silver  silver  gold    gold
builtin_sort     bronze  bronze  bronze  silver
```

Not much in it though. I'm very pleased with that result.

Pete

So am I. The only sorting benchmark that should matter the most is for random strings. Testing it shows Msort losing to quick_sort but easily beating pihism_sort. I'd call this round a draw..

Spock