Re: fast integer sort
- Posted by Robert Craig <rds at EMAIL.MSN.COM> Dec 30, 1998
- 342 views
Mike wrote: > Ok. What do I do now. Maybe I should study bucket_sort() and > try to make it faster (Yeah, Sure!). > Surprisingly, it worked. Of course, if Rob had really wanted to he > could easily have optimized his code. I think you over-optimized your routine - it doesn't sort anymore. > Recently, I had a look at the integer sorting function bucket_sort() which > is in c:\euphoria\demo\allsorts.ex to see if I could use that to make > a fast > character string sorting routine. *Big Mistake*. Although the routine did > work , there was a speed difference (compared to sort() ) of > about tenfold, in Rob's favour. Ouch! It *is* possible to adapt bucket sort to efficiently handle strings. Art Adamson came up this idea a long time ago (See Archive). I recently wrote my own bucket sort for strings to see how fast I could make it. With a bit of tuning I came up with something twice as fast as the standard Euphoria sort() in sort.e. Art's was also faster than sort() as I recall. Here's my code: -- Sort strings using "buckets" -- Based on an idea by Art Adamson include sort.e include machine.e tick_rate(200) constant NSTRINGS = 20000 -- number of random strings to sort constant STRLEN = 8 -- length of strings constant NBUCKETS = floor(NSTRINGS/10) -- number of buckets to use constant MAX = 255*256+255+1 -- maximum value for 2 characters constant SCALE = NBUCKETS/MAX sequence strings, result sequence bucket strings = rand(repeat(repeat(256, STRLEN), NSTRINGS))-1 bucket = repeat({}, NBUCKETS) function string_sort(sequence strings) -- "bucket sort" for strings integer c1, c2, b -- put each string into the appropriate bucket for i = 1 to length(strings) do c1 = strings[i][1] c2 = strings[i][2] b = 1+floor((c1*256+c2)*SCALE) bucket[b] = append(bucket[b], strings[i]) end for -- sort each bucket, and concatenate the buckets result = {} for i = 1 to length(bucket) do result = result & sort(bucket[i]) -- use regular sort end for return result end function procedure stats() -- sanity check on result printf(1, "length of result: %d\n", length(result)) puts(1, "first string: ") ? result[1] puts(1, "last string: ") ? result[NSTRINGS] puts(1, '\n') end procedure atom t t = time() result = sort(strings) printf(1, "regular sort: %.2f\n", time()-t) stats() t = time() result = string_sort(strings) printf(1, "bucket sort: %.2f\n", time()-t) stats() The basic idea is that it's faster to sort a bunch of small sequences, than it is to sort one big sequence. I originally tried it with 256 buckets, where each string is placed into a bucket based on the value of its first character. That wasn't as fast as using the first two characters, which implies 256*256=65536 buckets. However 65536 is a lot of buckets if you don't have many strings, so I scale it to NSTRINGS/10 which seems to run faster anyway. Note: this was only tested on *random* strings. If the strings were not random, bucket sort would not do as well, since there would likely be a very uneven distribution of strings across the buckets. Regards, Rob Craig Rapid Deployment Software http://members.aol.com/FilesEu/