Euphoria
Ticket #745:
Ranked Sort routine

Reported by
DerekParnell
Jan 28, 2012
Add a new routine to the sort library that returns a data set sorted based on weighted rankings rather than field values. The data is sorted using the sum of the rank values of the selected columns rather than the actual field values.
Given a sequence S of elements, define rank(O, P) as the relative position that object O would have if sequence S would be sorted by predicate P. So the rank is a number between 1 and n (n being the number of elements in the sequence.) For example, if we sort { {40,5}, {10,9}, {30,1}, {20,6} } using "O[i]<O[i+1]" as a predicate, the rank of {20,6} is 2.
Now say we have multiple predicates, P1, P2, ..., Pk and each predicate has a relative weighting (W1, W2, ..., Wk) against the other predicates. The weighted rank score for a given predicate i is Wi * rank(O,Pi).
We want to sort elements in sequence S in increasing order of sum(W1 * rank(O, P1), ... ,Wk * rank(O, Pk)).
A practical application can by seen in selecting better performing company stocks using a range of indicators e.g. P/E, P/book value, dividends etc. etc. The sort can order stocks to get a set of "best of breed" stocks that are among the top ones in a variety of indicators. Relative importance can be assigned to each indicator.
Here is a suggested implementation.
include std/sort.e
include std/math.e
public function rank_sort(sequence data, object collist, integer resulttype = 1)
sequence Result
integer j
integer k
sequence weights
 Normalize input arguments.
if atom(collist) then
collist = {collist}
end if
 Seperate weights from columns to make it easier later.
weights = repeat(1, length(collist))
for i = 1 to length(collist) do
if sequence(collist[i]) then
weights[i] = collist[i][2]
collist[i] = collist[i][1]
end if
end for
 Append temporary fields to data: original index, final rank
 Original index is there to enable us to reference original data
 elements from a sorted version of them.
for i = 1 to length(data) do
data[i] &= {i,0}
end for
 Now sort for each column in the input argument.
for c = 1 to length(collist) do
 Sort the data on the current column
Result = sort_columns(data, {collist[c]})
j = 1  starting rank
k = abs(collist[c])  in case we were doing an inverted sort on this column.
 Apply rankings to data elements.
 Elements with same field values are given equal rank.
for i = 1 to length(Result) do
if not equal(Result[j][k], Result[i][k]) then
 Different field value so rank is updated.
j = i
end if
 accumulate weighted rank into original data element.
data[Result[i][$1]][$] += (j * weights[c])
end for
end for
 Finally, sort the data on the accumulated rankings.
Result = sort_columns(data, {length(data[1])})
 Determine which parts of the data to return.
if resulttype = 1 then
 Return sorted data
k = length(Result[1])
j = k  1
else
 Return data index and rank scores
k = length(Result[1])  2
j = 1
end if
 Strip temporary fields before returning result.
for i = 1 to length(Result) do
Result[i] = remove(Result[i], j, k)
end for
return Result
end function
And a unit test.
include std/unittest.e
sequence data = {
 Name, Height, Hair color, Hair Len, Shoe size 
{"Alan", 180, "black", 15, 7},
{"Barry", 170, "red", 10, 11},
{"Colin", 180, "grey", 9, 10},
{"Dennis", 201, "blonde", 5, 8},
{"Edward", 176, "brown", 1, 11},
{"Frank", 180, "blonde", 3, 9},
{"George", 201, "grey", 10, 10},
{"Harold", 176, "black", 12, 9},
$
}
sequence expected = {
{"George", 201, "grey", 10, 10},  very tall, large shoe
{"Edward", 176, "brown", 1, 11},  very short hair, large shoe
{"Colin", 180, "grey", 9, 10},  large shoe
{"Frank", 180, "blonde", 3, 9},  short hair
{"Dennis", 201, "blonde", 5, 8},  small shoe (negates tallness)
{"Barry", 170, "red", 10, 11},  large shoe (negates shortness)
{"Harold", 176, "black", 12, 9},  larger shoe, shorter hair than bottom item
{"Alan", 180, "black", 15, 7},  very small shoe, very long hair
$
}
 Sort based on ...
 height (tall > short): weighting = 1
 hair length(short > long): weighting = 0.5
 shoe size(big > small): weighting = 1
 such that height and shoe size have the same importance
 and hair length is only half as important.
constant HEIGHT = 2, HAIRLEN = 4, SHOESIZE = 5
test_equal ("rank sort", expected, rank_sort(data, {HEIGHT, {HAIRLEN,0.5}, SHOESIZE}, 1))
Acknowledgements: Thanks to Andrei Alexandrescu for the idea and function specification.
Details
1. Comment by useless_
Jan 28, 2012
I do not fully understand this function, but it looks similar to sorttok() or sortntok() in strtok.e. Perhaps a reason for using this function is needed.
useless
2. Comment by DerekParnell
Jan 28, 2012
I'm not aware of the functionality of the two routines you cite, however let me explain a bit further about this proposed function.
Taking the example data and sort shown in the initial proposal. The idea here is that I'm looking for the three people who are taller, with larger shoe sizes and to a degree have shorter hair. (I'm not sure why I need this but bear with me, ok?).
This is not saying I'm looking for the tallest people who also have largest shoe size and the shortest hair. (The relative sizes are when compared with the entire sequencepopulation). Now let's look at the data ...
George and Dennis are obviously the tallest in the group, but Dennis has quite a small shoe size. Also, George has longish hair ... so how do we compare these two people for criteria fit? Now, Edward is very short but he has very short hair and very large shoes, so how does that fit in with my needs?
Then there is Alan, Colin and Frank at all the same medium height. We can just look at their shoe sizes to order them, right? But Frank has much shorter hair than the other two ... hmmm. Is that really going to affect things much?
Finally there's Barry, Edward and Harold. All are pretty short so you'd think they wouldn't get picked. But both Barry and Edward have the largest shoes of all of them.
This proposed function takes all those things into consideration, based on the caller's perceived importance for each field, and outputs the list sorted accordingly.
So it turns out that the top three people are George (very tall with big feet that negates him having long hair), Edward (Largest shoes and shortest hair that negates his short stature), and Colin (Tallish with large shoes and shortish hair).
3. Comment by useless_
Jan 28, 2012
Thanks for the clear explaination, Derek. I was stumbling over "rank", i think i would have called it "altnormalisation" or something. The strtok sort functions sort a field, then sort inside each field, in a nested fashion, in the order specified. This new function is different, and is easier for what it does.
I'd like to make a suggestion:
Add an optional number of the top results, so when sortranking 100000's of items, the entire sorted db isn't returned if only the top one or two is desired. But this may not be faster, or less of a memory hog, i don't know.
useless
4. Comment by DerekParnell
Jan 28, 2012
Kat said...
Add an optional number of the top results, so when sortranking 100000's of items, the entire sorted db isn't returned if only the top one or two is desired. But this may not be faster, or less of a memory hog, i don't know.
It doesn't make it faster or reduce memory (so much), and we can't predict what the caller wants to use the sorted data for. It might be the top X, or maybe the median value they are after, or to count the number of blondes that are not in the top group.
So it would be simpler and more strategic to leave it as is. You might notice that you can get it to return a list of data's ranking scores instead of the data itself. You can use that return to look up elements in the original data.
Using the example above ...
rank_sort(data, {HEIGHT, {HAIRLEN,0.5}, SHOESIZE}, 0)
returns ...
{
{7, 6.5},
{5, 7.5},
{3, 8 },
{6, 9 },
{4, 9.5},
{2, 11.5},
{8, 14.5},
{1, 15 }
}
Each returned element has the index into the original data and the element's ranking score. This allows selection of elements based on relative rankings, eg. the 25% quartile or one standard deviation around the average rank score, etc ...