1. Natural sort comparison
- Posted by Spock Jul 19, 2017
- 2380 views
Forked from Re: sort constants confusing
A demo of the new sorting include is: http://openeuphoria.org/pastey/293.wc
It provides stable sorting, column sorting, index sorting, along with number and natural order options. All sorts are in the up direction.
After changing a few identifiers this works on OE and Phix.
_tom
Thanks for merging that. I have some questions about the Natural Sort comparator..
Because this is implemented at the compare level there would be quite a lot of overhead at each iteration computing 'natural' values for each target pair prior to the actual comparison. This has got to be quite costly but I'm not sure what the best way of solving it is. I suppose if the data set were always smallish, eg, n < 200 then it doesn't really matter.
Another question is: What would happen if we were sorting a mixture of atoms and strings? The code presented would fail in its fundamental purpose as atoms and sequences would end up being clumped (albeit in nicely sorted clumps).
Lastly, what about things like whitespace, decimal points etc.. the definition of what constitues 'naturalness' is a bit fuzzy. I mean, shouldn't we be also monocasing the text as well?
Spock
2. Re: Natural sort comparison
- Posted by _tom (admin) Jul 20, 2017
- 2316 views
I have been thinking of this stuff as Mike Sort (with obvious inference to some competitive sorting algorithm). Since I didn't invent it I wait for input.
Sorting API must consist of:
- data: integer|atom|string|object
- stability: TRUE|FALSE
- direction: UP|DN
- order: arithmetic|natural|custom
- column: 0|1|2|...
- index sorting
The API question is:
- a universal sort with five (or more) arguments
- sort with the most "popular" arguments
- lots of specialized sort routines with minimal argument lists
This is a teaching opportunity to show me how to design a library. Time to hear from Pete about what he expects in a sorting library.
The Natural Order algorithm is by Andy Serpa. "Natural Order" is sometimes called "Lexical Order" which is the alternative to "arithmetic order." The existing algo does not do everything:
The algorithm only takes into account "unbroken" integers within the character strings it makes no allowance for decimal points (although it will still work correctly as long as there are an equal number of digits to the right of the decimal point in each number being compared).
I have not thought about unicode and case sensitivity (monocasting; seems like an ugly word) variations in a custom sort.
In the pastey I left out a second sorting algo that is optimized for objects that are only string sequences--making it about twice as fast. In the original he also uses custom_sort(ID) as a way of executing the sorting.
Since using routine_id is expensive, is it worth writing the sorting comparison directly into the sorting algorithm?
_tom
3. Re: Natural sort comparison
- Posted by Spock Jul 20, 2017
- 2317 views
The Natural Order algorithm is by Andy Serpa. "Natural Order" is sometimes called "Lexical Order" which is the alternative to "arithmetic order." The existing algo does not do everything..
I have not thought about unicode and case sensitivity (monocasting; seems like an ugly word) variations in a custom sort.
In the pastey I left out a second sorting algo that is optimized for objects that are only string sequences--making it about twice as fast. In the original he also uses custom_sort(ID) as a way of executing the sorting.
Since using routine_id is expensive, is it worth writing the sorting comparison directly into the sorting algorithm?
_tom
If performance were really needed there could be another way:
From the original data 'S' create a key sequence 'A' to be used for the comparison [user accesses whatever functions they like in naturalsort.e]
Create an index array 'B' same length as A, eg, {1,2,3..}
Sort B via A using sort_index() [if that happens to exist in the API]
remap S using B
I personally don't like the idea of putting relatively rare cases into a general purpose API when the user could deal with them in the code.
Spock
4. Re: Natural sort comparison
- Posted by petelomax Jul 21, 2017
- 2304 views
My preferred sort API might be (this is all draft, there may be niggles)
sequence res = sort(sequence s, sequence options={})
where s is the data to be sorted and options is an even-length sequence with odd elements being recognised literal strings and even elements actual settings, eg
?sort(s,{"direction",-1,"compare",routine_id("my_compare")})
The available options would include at least:
even element | odd element |
---|---|
"compare" | routine_id("my_compare") where my_compare is a function that accepts two elements of s |
"direction" | +/-1 (look ma, no names to argue over) |
"preproc" | routine_id("my_preprocessor"), a function that accepts (idx, si), or a column set as per "return" |
"return" | -2, or {0..length(s[i])} ie: all, or a set of columns (aka retset) |
As noted by Mike, "preproc" could be significantly more efficient than "compare"; it would be an error to have both.
In fact we might drop "compare" and only offer "preproc", and I now think we should.
This interface could/should effectively make custom_sort obsolete (albeit fiddly to convert compare->preproc).
I would use -2 to indicate "preproc not set", rather than -1 which could conflict with/hide errors from routine_id failures.
A preproc/retset element of 0 means "index".
I would also make it thread-safe, meaning you cannot have the file-level SortType/Column/Keys/CompareRtn.
The innards might look a bit like this:
keys = s if preproc!=-2 then for i=1 to length(keys) do if integer(preproc) then keys[i] = call_func(preproc,{i,keys[i]}) else keys[i] = get_columns(keys[i],i,preproc) end if end for end if keys = actual_sort(keys) if retset!=-2 then for i=1 to length(keys) do keys[i] = get_columns(keys[i],i,retset) end for end if return keys -- **NB**
Note that if [parts of] the original s need to be returned, preproc must include them in keys, and "return" specified to extract them from and throw away the other working storage that formed part of keys[i].
A stable sort could be achieved by incorporating the index into keys.
I haven't attempted any of this at all, just a few ideas.
Pete
5. Re: Natural sort comparison
- Posted by petelomax Jul 21, 2017
- 2282 views
Am I right in thinking that the existing sort is perfectly stable - what is not stable is custom_sort, because you don't get any indexes with which to resolve ties.
6. Re: Natural sort comparison
- Posted by petelomax Jul 21, 2017
- 2298 views
The Natural Order algorithm is by Andy Serpa. The existing algo does not do everything:
The algorithm only takes into account "unbroken" integers within the character strings it makes no allowance for decimal points (although it will still work correctly as long as there are an equal number of digits to the right of the decimal point in each number being compared).
I have not thought about unicode and case sensitivity (monocasting; seems like an ugly word) variations in a custom sort.
You're gonna hate me: http://rosettacode.org/wiki/Natural_sorting#Phix
It might need enhancing for the 1.009 vs 1.1 case, with some prev='.' tests or similar
Pete
7. Re: Natural sort comparison
- Posted by _tom (admin) Jul 21, 2017
- 2265 views
Am I right in thinking that the existing sort is perfectly stable - what is not stable is custom_sort, because you don't get any indexes with which to resolve ties.
"custom_sort" happens to lack an option for choice of columns
If you have just one column then all sorting algorithms give you the same answer. If items get swapped or not, you can not tell. Stability does not matter.
When you have two (or more columns) and you sort using a column a the key, then you can tell the difference. Stable sorting becomes important.
Using Andy's natural sort order: I got the same result with Mike_sort and std/sort.e; the items have no columns but have an "internal" order.
_tom
8. Re: Natural sort comparison
- Posted by _tom (admin) Jul 21, 2017
- 2284 views
The Natural Order algorithm is by Andy Serpa. The existing algo does not do everything:
The algorithm only takes into account "unbroken" integers within the character strings it makes no allowance for decimal points (although it will still work correctly as long as there are an equal number of digits to the right of the decimal point in each number being compared).
I have not thought about unicode and case sensitivity (monocasting; seems like an ugly word) variations in a custom sort.
You're gonna hate me: http://rosettacode.org/wiki/Natural_sorting#Phix
It might need enhancing for the 1.009 vs 1.1 case, with some prev='.' tests or similar
Pete
You use a syntax trick in Phix; no cut and paste for OE.
A ligature is "two characters combined into one."
hence o and e becomes œ
_tom
9. Re: Natural sort comparison
- Posted by Spock Jul 23, 2017
- 2257 views
Am I right in thinking that the existing sort is perfectly stable - what is not stable is custom_sort, because you don't get any indexes with which to resolve ties.
"custom_sort" happens to lack an option for choice of columns
If you have just one column then all sorting algorithms give you the same answer. If items get swapped or not, you can not tell. Stability does not matter.
When you have two (or more columns) and you sort using a column a the key, then you can tell the difference. Stable sorting becomes important.
Using Andy's natural sort order: I got the same result with Mike_sort and std/sort.e; the items have no columns but have an "internal" order.
_tom
It's not necessary for custom_sort to pass any column ordering since the User's compare routine would take this into account. Except for sorting one-dimensional data the choice of algorithm will impact stability. Indexes can be used to convert any unstable sort into a stable one but I think it's easier to just use a naturally stable sort in the first place. Please correct me if I'm wrong but AFAIK the existing sort in \std is based on Shell sort so not stable.
Msort is basically Merge sort supplemented by Insertion sort. This is more-or-less the core of Timsort (I personally think Tim Peters went way overboard trying to squeeze out every last drop of performance). In my own tests Msort easily beats Qsort in any use case. It's fast and always correct so please feel free to use it how you want.
Spock
10. Re: Natural sort comparison
- Posted by petelomax Aug 11, 2017
- 2158 views
I found a way to improve quick_sort for the all-1s case, along with a minor speedup for the swap:
function quick_sort(sequence s) sequence qstack integer first, last, stackptr, I, J --object tmp, pivot object pivot, ti, tj qstack = repeat(0,floor((length(s)/5)+10)) -- create a stack first = 1 last = length(s) stackptr = 0 while 1 do while first<last do --maybe?: -- if last-first<=100 then -- s[first..last] = insertion_sort(s[first..last]) -- exit -- end if pivot = s[floor((last+first)/2)] I = first J = last while 1 do -- while s[I]<pivot do while 1 do ti = s[I] if ti>=pivot then exit end if I += 1 end while -- while s[J]>pivot do while 1 do tj = s[J] if tj<=pivot then exit end if J -= 1 end while if I>J then exit end if if I<J then -- added 6/8/17: if ti=tj then {I,J} = {J+1,I-1} exit end if -- tmp = s[I] -- s[I] = s[J] -- s[J] = tmp s[I] = tj s[J] = ti end if I += 1 J -= 1 if I>J then exit end if end while if I<last then qstack[stackptr+1] = I qstack[stackptr+2] = last stackptr += 2 end if last = J end while if stackptr=0 then exit end if stackptr -= 2 first = qstack[stackptr+1] last = qstack[stackptr+2] end while return s end function
Which makes it the clear winner across the board in my "Compare_sorting_algorithms.exw".
Given the performance boost that qstack gives to quick_sort, I wonder whether a similar technique could be applied to merge_sort, likewise turning it from a recursive algorithm to an iterative one, and whether merges could be performed in-situ. I jotted down some outline thoughts:
A stack of length log2 might be enough? Or even ceil(log2(n))-6 if we farm out length<=64 to insertion_sort.
Consider the merge operation of merge_sort, being performed in-situ.
(Note that <done> may in fact be a segment awaiting a future merge with merge(A,B))
xxxxxx,a1,a2,a3,a4,a5,b1,b2,b3,b4,b5 <done> <----- A ----> <----- B ---->
Now, if we pick a1, then all that happens is that A shrinks. A' and B remain sorted.
If we pick b1, then a1 must move into that slot. At this stage the rest of B remains sorted, whereas A becomes (not for long) a still-sorted-but-now-circular list.
At some point, however, the circular list aspect of A breaks down (assuming we would rather not physically re-order it), eg:
xxxxx,b1,b2,b3,a4,a5,a1,a2,a3,b4,b5 <--- done ---> <----- A ----> <-B->
(pick a1)
xxxxx,b1,b2,b3,a1,a5,a4,a2,a3,b4,b5 <----- done ----> <--- A ---> <-B->
The obvious solution is to maintain A as a linked list, which can be done using a parallel sequence links, initially the same size as the sequence x being sorted.
There may be opportunities to preset that links sequence during some prior merge operation (but probably not if we palm off short sequences to insertion_sort) - then again maintaining it for the B is pointless.
Lastly, if the merge terminates because A became empty, then any remainder of B is already in place, whereas if B becomes empty, then A must be reordered, using the same linked list, obviously with null-ops for any head(A)==A[1].
Pete
11. Re: Natural sort comparison
- Posted by Spock Aug 13, 2017
- 2072 views
I found a way to improve quick_sort for the all-1s case, along with a minor speedup for the swap: .... Which makes it the clear winner across the board in my "Compare_sorting_algorithms.exw".
Given the performance boost that qstack gives to quick_sort, I wonder whether a similar technique could be applied to merge_sort, likewise turning it from a recursive algorithm to an iterative one, and whether merges could be performed in-situ. I jotted down some outline thoughts...
Pete
Your enhanced Quick sort is definitely faster (but won't be stable). I had already started some work on an in-place Merge sort but got a bit distracted looking at Library sort (gapped Insertion sort) which might be better than Merge sort in some ways. It is like a Bucket sort which clumps values into discrete sorted groups. I want to try something along the same lines but leave the final sort till the end. If each [unsorted] clump were kept under a minimum size (you suggest n=64) Insertion sort would be the ideal final pass.
Spock
12. Re: Natural sort comparison
- Posted by petelomax Aug 14, 2017
- 2005 views
Here it is
enum SPLIT, MERGE function pihism_sort(sequence x) -- -- Pete's Iterative Half-In-Situ Merge Sort. -- integer first = 1, mid, last = length(x), adx, op = SPLIT, qslen = 2*ceil(log(last)/log(2)+1), qdx=0 sequence qsop = repeat(0,qslen), qsf = repeat(0,qslen), qsm = repeat(0,qslen), qsl = repeat(0,qslen), a object ai, xmid while 1 do -- until qstack is empty if op=SPLIT then -- -- just push further split/merge pair(s) onto the stack... -- while first<last do mid = floor((first+last)/2) qdx += 1 qsop[qdx] = MERGE qsf[qdx] = first qsm[qdx] = mid qsl[qdx] = last if mid+1<last then qdx += 1 qsop[qdx] = SPLIT qsf[qdx] = mid+1 qsl[qdx] = last end if if first=mid then exit end if last = mid end while else -- op=MERGE -- -- first, leave any parts of the first half already in place. -- mid += 1 xmid = x[mid] while first<mid and x[first]<=xmid do first += 1 end while -- -- then merge until the first half is empty. -- any remainder of the second half will already be in place. -- if first<mid then a = x[first..mid-1] ai = a[1] adx = 1 while 1 do if mid>last or compare(ai,xmid)<=0 then x[first] = ai first += 1 adx += 1 if adx>length(a) then exit end if ai = a[adx] else x[first] = xmid first += 1 mid += 1 if mid<=last then xmid = x[mid] end if end if end while end if end if if qdx=0 then exit end if op = qsop[qdx] first = qsf[qdx] mid = qsm[qdx] last = qsl[qdx] qdx -= 1 end while return x end function sequence s,r,t for i=1 to 100 do --for i=1 to 1000000 by 10000 do -- slow! --?i t = tagset(i) s = pihism_sort(t) if s!=t then ?9/0 end if s = reverse(t) r = pihism_sort(s) if r!=t then ?9/0 end if s = shuffle(t) r = pihism_sort(s) if r!=t then ?9/0 end if end for constant LIM = 1000000-0 t = shuffle(tagset(LIM)) atom t0 = time() s = pihism_sort(t) atom t1 = time() r = sort(t) atom t2 = time() ?{t1-t0,":",t2-t1} ?"ok" {} = wait_key()
and the medals awarded are:
ones sorted random reverse pihism_sort gold gold silver bronze quick_sort silver silver gold gold builtin_sort bronze bronze bronze silver
Not much in it though. I'm very pleased with that result.
Pete
PS: I tested this, integer-only, on RDS Eu/OpenEuphoria, and can post that version (f06.exw) if required.
PPS: Obviously 100% in-situ with a doubly-linked list was a completely insane idea, but 50% works well.
13. Re: Natural sort comparison
- Posted by Spock Aug 15, 2017
- 1977 views
.. and the medals awarded are:
ones sorted random reverse pihism_sort gold gold silver bronze quick_sort silver silver gold gold builtin_sort bronze bronze bronze silver
Not much in it though. I'm very pleased with that result.
Pete
So am I. The only sorting benchmark that should matter the most is for random strings. Testing it shows Msort losing to quick_sort but easily beating pihism_sort. I'd call this round a draw..
Spock