1. External sorting

Has anyone written anything to do external sorting?  That is, sorting 
data in a file where the file is too large to fit entirely in memory? 
[writing out to a new file in sorted order.]  Ricardo Forno suggests in 
genfunc.e that his NWay sorts could be apapted to this task, but I'm too 
dumb to determine how one would do it.

Anyone?

new topic     » topic index » view message » categorize

2. Re: External sorting

--=======54282C31=======

At 11:23 PM 6/10/03 +0000, you wrote:

>
>
>Has anyone written anything to do external sorting?  That is, sorting
>data in a file where the file is too large to fit entirely in memory?
>[writing out to a new file in sorted order.]  Ricardo Forno suggests in
>genfunc.e that his NWay sorts could be apapted to this task, but I'm too
>dumb to determine how one would do it.
>
>Anyone?

Andy,

         If you aren't wedded to Euphoria for this, I've been using a 
16-bit DOS app for this called RPS, written by the late Robert Pirko, for 
many years.  It's rock solid, very fast and extremely versatile.  I chose 
it after examining a lot of others.  I'm sure it's still around on the net 
but if you can't find it, I'll be happy to send it.

                         Bob

--=======54282C31=======
Content-Type: text/plain; charset=us-ascii; x-avg=cert;
x-avg-checked=avg-ok-254133D8
Content-Disposition: inline


---

--=======54282C31=======--

new topic     » goto parent     » topic index » view message » categorize

3. Re: External sorting

On Tue, 10 Jun 2003 23:23:12 +0000, Andy Serpa <ac at onehorseshy.com> wrote:

>
>
> Has anyone written anything to do external sorting?  That is, sorting 
> data in a file where the file is too large to fit entirely in memory? 
> [writing out to a new file in sorted order.]  Ricardo Forno suggests in 
> genfunc.e that his NWay sorts could be apapted to this task, but I'm too 
> dumb to determine how one would do it.
>
> Anyone?
>

This takes me back to my IBM mainframe days.

The general idea is to take into RAM as much as you dare, sort that and 
output it to a sort work file. You repeat this, creating as many sort work 
files as required until all the original (unsorted) data has been 
processed. This means you end up with a set of sorted files. Next, you take 
two (or more) files at a time and merge them, creating either the final 
file or additional work files. You repeat merging the work files until you 
have one completely sorted file.

There are a few optimisation possible, but this is the general idea.

To diagram this...

  [UNSORTED] --> (ram sorting) --> sortwk1, sortwk2, ...sortwkn

  [sortwk1] + [sortwk2] --> (merge phase) --> sortmg1
  [sortmg1] + [sortwk3] --> (merge phase) --> sortmg2
  . . . [sortmgn] + [sortwkn] --> (merge phase) --> final

-- 
Derek
-- 

cheers,
Derek Parnell

new topic     » goto parent     » topic index » view message » categorize

4. Re: External sorting

On Thu, 12 Jun 2003 12:23:32 +1000, <mistertrik at hotmail.com> wrote:

>
>
> What about a bucket sort? Open n files for writing, where n is the number 
> of buckets you have, and as you read in your data to be sorted, stick it 
> in the appropriate bucket. Then, open an output file, and go through each 
> bucket in order, sort the contents of that bucket, and append it to the 
> output file.
>
> As far as what number you choose 'n' to be, just pick it so that the size 
> of each bucket when full will not be more than the amount of memory you 
> want to use at a time.
>
> IMO this manner of sorting would require a lot less file I/O than the 
> merging sort described earlier... what do you think?

One of the issues with this method is that the bucket files are of varying 
length, and any one of them might end up to be larger than RAM can cope 
with. Of course, you could deal with this is recursively - that is, for 
each bucket file that is too big, reprocess it into smaller buckets.

Another issue is working out what 'n' should be. This is normally done by 
knowing the range of values that the key can have or by know how many 
rcords need sorting. This is not always easy. Consider sorting the 
telephone book entries of a large city. The range of names is huge. You 
might like to split on the first 2/3/4 letters of a name but even that can 
give very large bucket files.

The idea, however, can work for some situations.

-- 

cheers,
Derek Parnell

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu