1. Natural sort comparison

Forked from Re: sort constants confusing

_tom said...

A demo of the new sorting include is: http://openeuphoria.org/pastey/293.wc

It provides stable sorting, column sorting, index sorting, along with number and natural order options. All sorts are in the up direction.

After changing a few identifiers this works on OE and Phix.

_tom

Thanks for merging that. I have some questions about the Natural Sort comparator..

Because this is implemented at the compare level there would be quite a lot of overhead at each iteration computing 'natural' values for each target pair prior to the actual comparison. This has got to be quite costly but I'm not sure what the best way of solving it is. I suppose if the data set were always smallish, eg, n < 200 then it doesn't really matter.

Another question is: What would happen if we were sorting a mixture of atoms and strings? The code presented would fail in its fundamental purpose as atoms and sequences would end up being clumped (albeit in nicely sorted clumps).

Lastly, what about things like whitespace, decimal points etc.. the definition of what constitues 'naturalness' is a bit fuzzy. I mean, shouldn't we be also monocasing the text as well?

Spock

new topic     » topic index » view message » categorize

2. Re: Natural sort comparison

I have been thinking of this stuff as Mike Sort (with obvious inference to some competitive sorting algorithm). Since I didn't invent it I wait for input.


Sorting API must consist of:

  • data: integer|atom|string|object
  • stability: TRUE|FALSE
  • direction: UP|DN
  • order: arithmetic|natural|custom
  • column: 0|1|2|...
  • index sorting

The API question is:

  • a universal sort with five (or more) arguments
  • sort with the most "popular" arguments
  • lots of specialized sort routines with minimal argument lists

This is a teaching opportunity to show me how to design a library. Time to hear from Pete about what he expects in a sorting library.


The Natural Order algorithm is by Andy Serpa. "Natural Order" is sometimes called "Lexical Order" which is the alternative to "arithmetic order." The existing algo does not do everything:

The algorithm only takes into account "unbroken" integers within the 
character strings  
it makes no allowance for decimal points (although it 
will still work correctly as long as there are an equal number of digits 
to the right of the decimal point in each number being compared). 

I have not thought about unicode and case sensitivity (monocasting; seems like an ugly word) variations in a custom sort.

In the pastey I left out a second sorting algo that is optimized for objects that are only string sequences--making it about twice as fast. In the original he also uses custom_sort(ID) as a way of executing the sorting.

Since using routine_id is expensive, is it worth writing the sorting comparison directly into the sorting algorithm?

_tom

new topic     » goto parent     » topic index » view message » categorize

3. Re: Natural sort comparison

_tom said...

The Natural Order algorithm is by Andy Serpa. "Natural Order" is sometimes called "Lexical Order" which is the alternative to "arithmetic order." The existing algo does not do everything..

I have not thought about unicode and case sensitivity (monocasting; seems like an ugly word) variations in a custom sort.

In the pastey I left out a second sorting algo that is optimized for objects that are only string sequences--making it about twice as fast. In the original he also uses custom_sort(ID) as a way of executing the sorting.

Since using routine_id is expensive, is it worth writing the sorting comparison directly into the sorting algorithm?

_tom

If performance were really needed there could be another way:

From the original data 'S' create a key sequence 'A' to be used for the comparison [user accesses whatever functions they like in naturalsort.e]
Create an index array 'B' same length as A, eg, {1,2,3..}
Sort B via A using sort_index() [if that happens to exist in the API]
remap S using B

I personally don't like the idea of putting relatively rare cases into a general purpose API when the user could deal with them in the code.

Spock

new topic     » goto parent     » topic index » view message » categorize

4. Re: Natural sort comparison

My preferred sort API might be (this is all draft, there may be niggles)

    sequence res = sort(sequence s, sequence options={}) 

where s is the data to be sorted and options is an even-length sequence with odd elements being recognised literal strings and even elements actual settings, eg

    ?sort(s,{"direction",-1,"compare",routine_id("my_compare")}) 

The available options would include at least:

even element odd element
"compare" routine_id("my_compare") where my_compare is a function that accepts two elements of s
"direction" +/-1 (look ma, no names to argue over)
"preproc" routine_id("my_preprocessor"), a function that accepts (idx, si), or a column set as per "return"
"return" -2, or {0..length(s[i])} ie: all, or a set of columns (aka retset)

As noted by Mike, "preproc" could be significantly more efficient than "compare"; it would be an error to have both.
In fact we might drop "compare" and only offer "preproc", and I now think we should.

This interface could/should effectively make custom_sort obsolete (albeit fiddly to convert compare->preproc).

I would use -2 to indicate "preproc not set", rather than -1 which could conflict with/hide errors from routine_id failures.

A preproc/retset element of 0 means "index".

I would also make it thread-safe, meaning you cannot have the file-level SortType/Column/Keys/CompareRtn.

The innards might look a bit like this:

 
    keys = s 
    if preproc!=-2 then 
        for i=1 to length(keys) do 
            if integer(preproc) then 
                keys[i] = call_func(preproc,{i,keys[i]}) 
            else 
                keys[i] = get_columns(keys[i],i,preproc) 
            end if 
        end for 
    end if 
    keys = actual_sort(keys) 
    if retset!=-2 then 
        for i=1 to length(keys) do 
            keys[i] = get_columns(keys[i],i,retset) 
        end for 
    end if 
    return keys     -- **NB** 

Note that if [parts of] the original s need to be returned, preproc must include them in keys, and "return" specified to extract them from and throw away the other working storage that formed part of keys[i].

A stable sort could be achieved by incorporating the index into keys.

I haven't attempted any of this at all, just a few ideas.

Pete

new topic     » goto parent     » topic index » view message » categorize

5. Re: Natural sort comparison

Am I right in thinking that the existing sort is perfectly stable - what is not stable is custom_sort, because you don't get any indexes with which to resolve ties.

new topic     » goto parent     » topic index » view message » categorize

6. Re: Natural sort comparison

_tom said...

The Natural Order algorithm is by Andy Serpa. The existing algo does not do everything:

The algorithm only takes into account "unbroken" integers within the 
character strings  
it makes no allowance for decimal points (although it 
will still work correctly as long as there are an equal number of digits 
to the right of the decimal point in each number being compared). 

I have not thought about unicode and case sensitivity (monocasting; seems like an ugly word) variations in a custom sort.

You're gonna hate me: http://rosettacode.org/wiki/Natural_sorting#Phix smile

It might need enhancing for the 1.009 vs 1.1 case, with some prev='.' tests or similar

Pete

new topic     » goto parent     » topic index » view message » categorize

7. Re: Natural sort comparison

petelomax said...

Am I right in thinking that the existing sort is perfectly stable - what is not stable is custom_sort, because you don't get any indexes with which to resolve ties.

"custom_sort" happens to lack an option for choice of columns

If you have just one column then all sorting algorithms give you the same answer. If items get swapped or not, you can not tell. Stability does not matter.

When you have two (or more columns) and you sort using a column a the key, then you can tell the difference. Stable sorting becomes important.

Using Andy's natural sort order: I got the same result with Mike_sort and std/sort.e; the items have no columns but have an "internal" order.

_tom

new topic     » goto parent     » topic index » view message » categorize

8. Re: Natural sort comparison

petelomax said...
_tom said...

The Natural Order algorithm is by Andy Serpa. The existing algo does not do everything:

The algorithm only takes into account "unbroken" integers within the 
character strings  
it makes no allowance for decimal points (although it 
will still work correctly as long as there are an equal number of digits 
to the right of the decimal point in each number being compared). 

I have not thought about unicode and case sensitivity (monocasting; seems like an ugly word) variations in a custom sort.

You're gonna hate me: http://rosettacode.org/wiki/Natural_sorting#Phix smile

It might need enhancing for the 1.009 vs 1.1 case, with some prev='.' tests or similar

Pete

You use a syntax trick in Phix; no cut and paste for OE.

A ligature is "two characters combined into one."

hence o and e becomes

_tom

new topic     » goto parent     » topic index » view message » categorize

9. Re: Natural sort comparison

_tom said...
petelomax said...

Am I right in thinking that the existing sort is perfectly stable - what is not stable is custom_sort, because you don't get any indexes with which to resolve ties.

"custom_sort" happens to lack an option for choice of columns

If you have just one column then all sorting algorithms give you the same answer. If items get swapped or not, you can not tell. Stability does not matter.

When you have two (or more columns) and you sort using a column a the key, then you can tell the difference. Stable sorting becomes important.

Using Andy's natural sort order: I got the same result with Mike_sort and std/sort.e; the items have no columns but have an "internal" order.

_tom

It's not necessary for custom_sort to pass any column ordering since the User's compare routine would take this into account. Except for sorting one-dimensional data the choice of algorithm will impact stability. Indexes can be used to convert any unstable sort into a stable one but I think it's easier to just use a naturally stable sort in the first place. Please correct me if I'm wrong but AFAIK the existing sort in \std is based on Shell sort so not stable.

Msort is basically Merge sort supplemented by Insertion sort. This is more-or-less the core of Timsort (I personally think Tim Peters went way overboard trying to squeeze out every last drop of performance). In my own tests Msort easily beats Qsort in any use case. It's fast and always correct so please feel free to use it how you want.

Spock

new topic     » goto parent     » topic index » view message » categorize

10. Re: Natural sort comparison

I found a way to improve quick_sort for the all-1s case, along with a minor speedup for the swap:

function quick_sort(sequence s) 
sequence qstack 
integer first, last, stackptr, I, J 
--object tmp, pivot 
object pivot, ti, tj 
 
    qstack = repeat(0,floor((length(s)/5)+10))    -- create a stack 
 
    first = 1 
    last = length(s) 
    stackptr = 0 
    while 1 do 
        while first<last do 
--maybe?: 
--          if last-first<=100 then 
--              s[first..last] = insertion_sort(s[first..last]) 
--              exit 
--          end if 
            pivot = s[floor((last+first)/2)] 
            I = first 
            J = last 
            while 1 do 
--              while s[I]<pivot do 
                while 1 do 
                    ti = s[I] 
                    if ti>=pivot then exit end if 
                    I += 1 
                end while 
--              while s[J]>pivot do 
                while 1 do 
                    tj = s[J] 
                    if tj<=pivot then exit end if 
                    J -= 1 
                end while 
                if I>J then exit end if 
                if I<J then 
                    -- added 6/8/17: 
                    if ti=tj then 
                        {I,J} = {J+1,I-1} 
                        exit 
                    end if 
--                  tmp = s[I] 
--                  s[I] = s[J] 
--                  s[J] = tmp 
                    s[I] = tj 
                    s[J] = ti 
                end if 
                I += 1 
                J -= 1 
                if I>J then exit end if 
            end while 
            if I<last then 
                qstack[stackptr+1] = I 
                qstack[stackptr+2] = last 
                stackptr += 2 
            end if 
            last = J 
        end while 
        if stackptr=0 then exit end if 
        stackptr -= 2 
        first = qstack[stackptr+1] 
        last = qstack[stackptr+2] 
    end while 
    return s 
end function 

Which makes it the clear winner across the board in my "Compare_sorting_algorithms.exw".

Given the performance boost that qstack gives to quick_sort, I wonder whether a similar technique could be applied to merge_sort, likewise turning it from a recursive algorithm to an iterative one, and whether merges could be performed in-situ. I jotted down some outline thoughts:

A stack of length log2 might be enough? Or even ceil(log2(n))-6 if we farm out length<=64 to insertion_sort.

Consider the merge operation of merge_sort, being performed in-situ.
(Note that <done> may in fact be a segment awaiting a future merge with merge(A,B))

xxxxxx,a1,a2,a3,a4,a5,b1,b2,b3,b4,b5 
<done> <----- A ----> <----- B ----> 

Now, if we pick a1, then all that happens is that A shrinks. A' and B remain sorted.
If we pick b1, then a1 must move into that slot. At this stage the rest of B remains sorted, whereas A becomes (not for long) a still-sorted-but-now-circular list.

At some point, however, the circular list aspect of A breaks down (assuming we would rather not physically re-order it), eg:

xxxxx,b1,b2,b3,a4,a5,a1,a2,a3,b4,b5 
<--- done ---> <----- A ----> <-B-> 

(pick a1)

xxxxx,b1,b2,b3,a1,a5,a4,a2,a3,b4,b5 
<----- done ----> <--- A ---> <-B-> 

The obvious solution is to maintain A as a linked list, which can be done using a parallel sequence links, initially the same size as the sequence x being sorted.
There may be opportunities to preset that links sequence during some prior merge operation (but probably not if we palm off short sequences to insertion_sort) - then again maintaining it for the B is pointless.

Lastly, if the merge terminates because A became empty, then any remainder of B is already in place, whereas if B becomes empty, then A must be reordered, using the same linked list, obviously with null-ops for any head(A)==A[1].

Pete

new topic     » goto parent     » topic index » view message » categorize

11. Re: Natural sort comparison

petelomax said...

I found a way to improve quick_sort for the all-1s case, along with a minor speedup for the swap: .... Which makes it the clear winner across the board in my "Compare_sorting_algorithms.exw".

Given the performance boost that qstack gives to quick_sort, I wonder whether a similar technique could be applied to merge_sort, likewise turning it from a recursive algorithm to an iterative one, and whether merges could be performed in-situ. I jotted down some outline thoughts...

Pete

Your enhanced Quick sort is definitely faster (but won't be stable). I had already started some work on an in-place Merge sort but got a bit distracted looking at Library sort (gapped Insertion sort) which might be better than Merge sort in some ways. It is like a Bucket sort which clumps values into discrete sorted groups. I want to try something along the same lines but leave the final sort till the end. If each [unsorted] clump were kept under a minimum size (you suggest n=64) Insertion sort would be the ideal final pass.

Spock

new topic     » goto parent     » topic index » view message » categorize

12. Re: Natural sort comparison

Here it is

enum SPLIT, MERGE 
 
function pihism_sort(sequence x) 
-- 
-- Pete's Iterative Half-In-Situ Merge Sort. 
-- 
integer first = 1, mid, last = length(x), adx, 
        op = SPLIT,  
        qslen = 2*ceil(log(last)/log(2)+1), 
        qdx=0 
sequence qsop = repeat(0,qslen), 
         qsf = repeat(0,qslen), 
         qsm = repeat(0,qslen), 
         qsl = repeat(0,qslen), 
         a 
object ai, xmid 
 
    while 1 do  -- until qstack is empty 
 
        if op=SPLIT then 
            -- 
            -- just push further split/merge pair(s) onto the stack... 
            -- 
            while first<last do 
                mid = floor((first+last)/2) 
                qdx += 1  
                qsop[qdx] = MERGE 
                qsf[qdx] = first 
                qsm[qdx] = mid 
                qsl[qdx] = last 
                if mid+1<last then 
                    qdx += 1 
                    qsop[qdx] = SPLIT 
                    qsf[qdx] = mid+1 
                    qsl[qdx] = last 
                end if 
                if first=mid then exit end if 
                last = mid 
            end while 
 
        else -- op=MERGE                             
            -- 
            -- first, leave any parts of the first half already in place. 
            -- 
            mid += 1 
            xmid = x[mid] 
            while first<mid and x[first]<=xmid do 
                first += 1 
            end while 
            -- 
            -- then merge until the first half is empty. 
            -- any remainder of the second half will already be in place. 
            -- 
            if first<mid then 
                a = x[first..mid-1] 
                ai = a[1] 
                adx = 1 
                while 1 do 
                    if mid>last 
                    or compare(ai,xmid)<=0 then 
                        x[first] = ai 
                        first += 1 
                        adx += 1 
                        if adx>length(a) then exit end if 
                        ai = a[adx] 
                    else 
                        x[first] = xmid 
                        first += 1 
                        mid += 1 
                        if mid<=last then 
                            xmid = x[mid] 
                        end if 
                    end if 
                end while 
            end if 
        end if 
        if qdx=0 then exit end if 
        op = qsop[qdx] 
        first = qsf[qdx] 
        mid = qsm[qdx] 
        last = qsl[qdx] 
        qdx -= 1 
    end while 
    return x 
end function 
 
sequence s,r,t 
for i=1 to 100 do 
--for i=1 to 1000000 by 10000 do  -- slow! 
--?i 
    t = tagset(i) 
    s = pihism_sort(t) 
    if s!=t then ?9/0 end if 
    s = reverse(t) 
    r = pihism_sort(s) 
    if r!=t then ?9/0 end if 
    s = shuffle(t) 
    r = pihism_sort(s) 
    if r!=t then ?9/0 end if 
end for 
 
constant LIM = 1000000-0 
t = shuffle(tagset(LIM)) 
atom t0 = time() 
s = pihism_sort(t) 
atom t1 = time() 
r = sort(t) 
atom t2 = time() 
?{t1-t0,":",t2-t1} 
     
?"ok" 
{} = wait_key() 

and the medals awarded are:

                  ones    sorted  random  reverse 
  pihism_sort      gold    gold    silver  bronze 
  quick_sort       silver  silver  gold    gold 
  builtin_sort     bronze  bronze  bronze  silver 

Not much in it though. I'm very pleased with that result.

Pete

PS: I tested this, integer-only, on RDS Eu/OpenEuphoria, and can post that version (f06.exw) if required.

PPS: Obviously 100% in-situ with a doubly-linked list was a completely insane idea, but 50% works well.

new topic     » goto parent     » topic index » view message » categorize

13. Re: Natural sort comparison

petelomax said...

.. and the medals awarded are:

                  ones    sorted  random  reverse 
  pihism_sort      gold    gold    silver  bronze 
  quick_sort       silver  silver  gold    gold 
  builtin_sort     bronze  bronze  bronze  silver 

Not much in it though. I'm very pleased with that result.

Pete

So am I. The only sorting benchmark that should matter the most is for random strings. Testing it shows Msort losing to quick_sort but easily beating pihism_sort. I'd call this round a draw..

Spock

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu