Re: External sorting
- Posted by Derek Parnell <ddparnell at bigpond.com> Jun 12, 2003
- 371 views
On Thu, 12 Jun 2003 12:23:32 +1000, <mistertrik at hotmail.com> wrote: > > > What about a bucket sort? Open n files for writing, where n is the number > of buckets you have, and as you read in your data to be sorted, stick it > in the appropriate bucket. Then, open an output file, and go through each > bucket in order, sort the contents of that bucket, and append it to the > output file. > > As far as what number you choose 'n' to be, just pick it so that the size > of each bucket when full will not be more than the amount of memory you > want to use at a time. > > IMO this manner of sorting would require a lot less file I/O than the > merging sort described earlier... what do you think? One of the issues with this method is that the bucket files are of varying length, and any one of them might end up to be larger than RAM can cope with. Of course, you could deal with this is recursively - that is, for each bucket file that is too big, reprocess it into smaller buckets. Another issue is working out what 'n' should be. This is normally done by knowing the range of values that the key can have or by know how many rcords need sorting. This is not always easy. Consider sorting the telephone book entries of a large city. The range of names is huge. You might like to split on the first 2/3/4 letters of a name but even that can give very large bucket files. The idea, however, can work for some situations. -- cheers, Derek Parnell