1. External sorting
Has anyone written anything to do external sorting? That is, sorting
data in a file where the file is too large to fit entirely in memory?
[writing out to a new file in sorted order.] Ricardo Forno suggests in
genfunc.e that his NWay sorts could be apapted to this task, but I'm too
dumb to determine how one would do it.
Anyone?
2. Re: External sorting
--=======54282C31=======
At 11:23 PM 6/10/03 +0000, you wrote:
>
>
>Has anyone written anything to do external sorting? That is, sorting
>data in a file where the file is too large to fit entirely in memory?
>[writing out to a new file in sorted order.] Ricardo Forno suggests in
>genfunc.e that his NWay sorts could be apapted to this task, but I'm too
>dumb to determine how one would do it.
>
>Anyone?
Andy,
If you aren't wedded to Euphoria for this, I've been using a
16-bit DOS app for this called RPS, written by the late Robert Pirko, for
many years. It's rock solid, very fast and extremely versatile. I chose
it after examining a lot of others. I'm sure it's still around on the net
but if you can't find it, I'll be happy to send it.
Bob
--=======54282C31=======
Content-Type: text/plain; charset=us-ascii; x-avg=cert;
x-avg-checked=avg-ok-254133D8
Content-Disposition: inline
---
--=======54282C31=======--
3. Re: External sorting
On Tue, 10 Jun 2003 23:23:12 +0000, Andy Serpa <ac at onehorseshy.com> wrote:
>
>
> Has anyone written anything to do external sorting? That is, sorting
> data in a file where the file is too large to fit entirely in memory?
> [writing out to a new file in sorted order.] Ricardo Forno suggests in
> genfunc.e that his NWay sorts could be apapted to this task, but I'm too
> dumb to determine how one would do it.
>
> Anyone?
>
This takes me back to my IBM mainframe days.
The general idea is to take into RAM as much as you dare, sort that and
output it to a sort work file. You repeat this, creating as many sort work
files as required until all the original (unsorted) data has been
processed. This means you end up with a set of sorted files. Next, you take
two (or more) files at a time and merge them, creating either the final
file or additional work files. You repeat merging the work files until you
have one completely sorted file.
There are a few optimisation possible, but this is the general idea.
To diagram this...
[UNSORTED] --> (ram sorting) --> sortwk1, sortwk2, ...sortwkn
[sortwk1] + [sortwk2] --> (merge phase) --> sortmg1
[sortmg1] + [sortwk3] --> (merge phase) --> sortmg2
. . . [sortmgn] + [sortwkn] --> (merge phase) --> final
--
Derek
--
cheers,
Derek Parnell
4. Re: External sorting
On Thu, 12 Jun 2003 12:23:32 +1000, <mistertrik at hotmail.com> wrote:
>
>
> What about a bucket sort? Open n files for writing, where n is the number
> of buckets you have, and as you read in your data to be sorted, stick it
> in the appropriate bucket. Then, open an output file, and go through each
> bucket in order, sort the contents of that bucket, and append it to the
> output file.
>
> As far as what number you choose 'n' to be, just pick it so that the size
> of each bucket when full will not be more than the amount of memory you
> want to use at a time.
>
> IMO this manner of sorting would require a lot less file I/O than the
> merging sort described earlier... what do you think?
One of the issues with this method is that the bucket files are of varying
length, and any one of them might end up to be larger than RAM can cope
with. Of course, you could deal with this is recursively - that is, for
each bucket file that is too big, reprocess it into smaller buckets.
Another issue is working out what 'n' should be. This is normally done by
knowing the range of values that the key can have or by know how many
rcords need sorting. This is not always easy. Consider sorting the
telephone book entries of a large city. The range of names is huge. You
might like to split on the first 2/3/4 letters of a name but even that can
give very large bucket files.
The idea, however, can work for some situations.
--
cheers,
Derek Parnell