Euphoria Ticket #745: Ranked Sort routine

Add a new routine to the sort library that returns a data set sorted based on weighted rankings rather than field values. The data is sorted using the sum of the rank values of the selected columns rather than the actual field values.

Given a sequence S of elements, define rank(O, P) as the relative position that object O would have if sequence S would be sorted by predicate P. So the rank is a number between 1 and n (n being the number of elements in the sequence.) For example, if we sort { {40,5}, {10,9}, {30,1}, {20,6} } using "O[i]<O[i+1]" as a predicate, the rank of {20,6} is 2.

Now say we have multiple predicates, P1, P2, ..., Pk and each predicate has a relative weighting (W1, W2, ..., Wk) against the other predicates. The weighted rank score for a given predicate i is Wi * rank(O,Pi).

We want to sort elements in sequence S in increasing order of sum(W1 * rank(O, P1), ... ,Wk * rank(O, Pk)).

A practical application can by seen in selecting better performing company stocks using a range of indicators e.g. P/E, P/book value, dividends etc. etc. The sort can order stocks to get a set of "best of breed" stocks that are among the top ones in a variety of indicators. Relative importance can be assigned to each indicator.

Here is a suggested implementation.

include std/sort.e 
include std/math.e 
 
public function rank_sort(sequence data, object collist, integer resulttype = 1) 
    sequence Result 
    integer j 
    integer k 
    sequence weights 
 
    -- Normalize input arguments. 
    if atom(collist) then 
        collist = {collist} 
    end if 
 
    -- Seperate weights from columns to make it easier later. 
    weights = repeat(1, length(collist)) 
    for i = 1 to length(collist) do 
        if sequence(collist[i]) then 
            weights[i] = collist[i][2] 
            collist[i] = collist[i][1] 
        end if 
    end for 
     
    -- Append temporary fields to data: original index, final rank 
    -- Original index is there to enable us to reference original data 
    --   elements from a sorted version of them. 
    for i = 1 to length(data) do 
        data[i] &= {i,0} 
    end for 
     
    -- Now sort for each column in the input argument. 
    for c = 1 to length(collist) do 
        -- Sort the data on the current column 
        Result = sort_columns(data, {collist[c]}) 
         
        j  = 1  -- starting rank 
        k = abs(collist[c]) -- in case we were doing an inverted sort on this column. 
         
        -- Apply rankings to data elements.  
        -- Elements with same field values are given equal rank. 
        for i = 1 to length(Result) do 
            if not equal(Result[j][k], Result[i][k]) then 
                -- Different field value so rank is updated. 
                j = i 
            end if 
             
            -- accumulate weighted rank into original data element. 
            data[Result[i][$-1]][$] += (j * weights[c]) 
        end for 
    end for 
 
    -- Finally, sort the data on the accumulated rankings. 
    Result = sort_columns(data, {length(data[1])}) 
 
 
    -- Determine which parts of the data to return.      
    if resulttype = 1 then 
        -- Return sorted data 
        k = length(Result[1]) 
        j = k - 1 
    else 
        -- Return data index and rank scores 
        k = length(Result[1]) - 2 
        j = 1 
    end if 
     
    -- Strip temporary fields before returning result. 
    for i = 1 to length(Result) do 
        Result[i] = remove(Result[i], j, k) 
    end for 
     
    return Result 
end function 

And a unit test.

include std/unittest.e 
sequence data = { 
  --  Name,     Height, Hair color, Hair Len,  Shoe size -- 
    {"Alan",    180,    "black",    15,          7}, 
    {"Barry",   170,    "red",      10,         11}, 
    {"Colin",   180,    "grey",      9,         10}, 
    {"Dennis",  201,    "blonde",    5,          8}, 
    {"Edward",  176,    "brown",     1,         11}, 
    {"Frank",   180,    "blonde",    3,          9}, 
    {"George",  201,    "grey",     10,         10}, 
    {"Harold",  176,    "black",    12,          9}, 
    $    
    } 
 
sequence expected = { 
    {"George",   201,    "grey",     10,         10}, -- very tall, large shoe 
    {"Edward",   176,    "brown",     1,         11}, -- very short hair, large shoe 
    {"Colin",    180,    "grey",      9,         10}, -- large shoe 
    {"Frank",    180,    "blonde",    3,          9}, -- short hair 
    {"Dennis",   201,    "blonde",    5,          8}, -- small shoe (negates tallness) 
    {"Barry",    170,    "red",      10,         11}, -- large shoe (negates shortness) 
    {"Harold",   176,    "black",    12,          9}, -- larger shoe, shorter hair than bottom item 
    {"Alan",     180,    "black",    15,          7}, -- very small shoe, very long hair 
    $ 
    } 
 
-- Sort based on ... 
--    height (tall -> short): weighting = 1 
--    hair length(short -> long): weighting = 0.5 
--    shoe size(big -> small): weighting = 1 
-- such that height and shoe size have the same importance 
-- and hair length is only half as important. 
constant HEIGHT = 2, HAIRLEN = 4, SHOESIZE = 5 
test_equal ("rank sort", expected, rank_sort(data, {-HEIGHT, {HAIRLEN,0.5}, -SHOESIZE}, 1)) 
 

Acknowledgements: Thanks to Andrei Alexandrescu for the idea and function specification.

Details

Type: Feature Request Severity: Normal Category: Library Routine
Assigned To: DerekParnell Status: New Reported Release:
Fixed in SVN #: View VCS: none Milestone:

1. Comment by useless_ Jan 28, 2012

I do not fully understand this function, but it looks similar to sorttok() or sortntok() in strtok.e. Perhaps a reason for using this function is needed.

useless

2. Comment by DerekParnell Jan 28, 2012

I'm not aware of the functionality of the two routines you cite, however let me explain a bit further about this proposed function.

Taking the example data and sort shown in the initial proposal. The idea here is that I'm looking for the three people who are taller, with larger shoe sizes and to a degree have shorter hair. (I'm not sure why I need this but bear with me, ok?).

This is not saying I'm looking for the tallest people who also have largest shoe size and the shortest hair. (The relative sizes are when compared with the entire sequence-population). Now let's look at the data ...

George and Dennis are obviously the tallest in the group, but Dennis has quite a small shoe size. Also, George has longish hair ... so how do we compare these two people for criteria fit? Now, Edward is very short but he has very short hair and very large shoes, so how does that fit in with my needs?

Then there is Alan, Colin and Frank at all the same medium height. We can just look at their shoe sizes to order them, right? But Frank has much shorter hair than the other two ... hmmm. Is that really going to affect things much?

Finally there's Barry, Edward and Harold. All are pretty short so you'd think they wouldn't get picked. But both Barry and Edward have the largest shoes of all of them.

This proposed function takes all those things into consideration, based on the caller's perceived importance for each field, and outputs the list sorted accordingly.

So it turns out that the top three people are George (very tall with big feet that negates him having long hair), Edward (Largest shoes and shortest hair that negates his short stature), and Colin (Tallish with large shoes and shortish hair).

3. Comment by useless_ Jan 28, 2012

Thanks for the clear explaination, Derek. I was stumbling over "rank", i think i would have called it "alt-normalisation" or something. The strtok sort functions sort a field, then sort inside each field, in a nested fashion, in the order specified. This new function is different, and is easier for what it does.

I'd like to make a suggestion:

Add an optional number of the top results, so when sort-ranking 100000's of items, the entire sorted db isn't returned if only the top one or two is desired. But this may not be faster, or less of a memory hog, i don't know.

useless

4. Comment by DerekParnell Jan 28, 2012

Kat said...

Add an optional number of the top results, so when sort-ranking 100000's of items, the entire sorted db isn't returned if only the top one or two is desired. But this may not be faster, or less of a memory hog, i don't know.

It doesn't make it faster or reduce memory (so much), and we can't predict what the caller wants to use the sorted data for. It might be the top X, or maybe the median value they are after, or to count the number of blondes that are not in the top group.

So it would be simpler and more strategic to leave it as is. You might notice that you can get it to return a list of data's ranking scores instead of the data itself. You can use that return to look up elements in the original data.

Using the example above ...

rank_sort(data, {-HEIGHT, {HAIRLEN,0.5}, -SHOESIZE}, 0) 

returns ...

{ 
  {7,  6.5}, 
  {5,  7.5}, 
  {3,  8  }, 
  {6,  9  }, 
  {4,  9.5}, 
  {2, 11.5}, 
  {8, 14.5}, 
  {1, 15  } 
} 

Each returned element has the index into the original data and the element's ranking score. This allows selection of elements based on relative rankings, eg. the 25% quartile or one standard deviation around the average rank score, etc ...

Search



Quick Links

User menu

Not signed in.

Misc Menu