Re: fast integer sort

new topic     » goto parent     » topic index » view thread      » older message » newer message

Mike wrote:
> Ok. What do I do now. Maybe I should study bucket_sort() and
> try to make it faster (Yeah, Sure!).
> Surprisingly, it worked. Of course, if Rob had really wanted to he
> could easily have optimized his code.

I think you over-optimized your routine - it doesn't sort anymore.

> Recently, I had a look at the integer sorting function bucket_sort() which
> is in c:\euphoria\demo\allsorts.ex to see if I could use that to make
> a fast
> character string sorting routine. *Big Mistake*. Although the routine did
> work , there was a speed difference (compared to sort() ) of
> about tenfold, in Rob's favour. Ouch!

It *is* possible to adapt bucket sort to efficiently handle strings.
Art Adamson came up this idea a long time ago (See Archive).
I recently wrote my own bucket sort for strings to see how fast I could
make it. With a bit of tuning I came up with something twice as fast as
the standard Euphoria sort() in sort.e. Art's was also faster than sort()
as I recall. Here's my code:

-- Sort strings using "buckets"
-- Based on an idea by Art Adamson

include sort.e
include machine.e
tick_rate(200)

constant NSTRINGS = 20000  -- number of random strings to sort
constant STRLEN = 8        -- length of strings
constant NBUCKETS = floor(NSTRINGS/10)  -- number of buckets to use
constant MAX = 255*256+255+1  -- maximum value for 2 characters
constant SCALE = NBUCKETS/MAX

sequence strings, result
sequence bucket
strings = rand(repeat(repeat(256, STRLEN), NSTRINGS))-1
bucket = repeat({}, NBUCKETS)

function string_sort(sequence strings)
-- "bucket sort" for strings
    integer c1, c2, b

    -- put each string into the appropriate bucket
    for i = 1 to length(strings) do
        c1 = strings[i][1]
        c2 = strings[i][2]
        b = 1+floor((c1*256+c2)*SCALE)
        bucket[b] = append(bucket[b], strings[i])
    end for

    -- sort each bucket, and concatenate the buckets
    result = {}
    for i = 1 to length(bucket) do
        result = result & sort(bucket[i]) -- use regular sort
    end for
    return result
end function

procedure stats()
-- sanity check on result
    printf(1, "length of result: %d\n", length(result))
    puts(1, "first string: ")
    ? result[1]
    puts(1, "last string: ")
    ? result[NSTRINGS]
    puts(1, '\n')
end procedure

atom t
t = time()
result = sort(strings)
printf(1, "regular sort: %.2f\n", time()-t)
stats()

t = time()
result = string_sort(strings)
printf(1, "bucket sort: %.2f\n", time()-t)
stats()

The basic idea is that it's faster to sort a bunch of small
sequences, than it is to sort one big sequence.

I originally tried it with 256 buckets, where each string
is placed into a bucket based on the value of its first
character. That wasn't as fast as using the first
two characters, which implies 256*256=65536 buckets.
However 65536 is  a lot of buckets if you don't have many
strings, so I scale it to NSTRINGS/10 which seems to run
faster anyway.

Note: this was only tested on *random* strings.
If the strings were not random, bucket sort would not
do as well, since there would likely be a very uneven distribution
of strings across the buckets.

Regards,
     Rob Craig
     Rapid Deployment Software
     http://members.aol.com/FilesEu/

new topic     » goto parent     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu