Re: Optimizing basic operations
- Posted by DerekParnell (admin) Aug 03, 2013
- 5781 views
I'm also not sure if I understand what you're trying but it sounds like you want to know the top 'X' scores from a potentially huge set of data records. Here is one way of doing that does not involve much sorting at all. I've tested with a billion records and it takes about two minutes and does around 1500 'sorts' (binary search insertions actually).
constant MAXSCORE = 100_000_000 -- Largest key possible constant SAMPLESIZE = 1_000_000_000 -- The number of samples to check constant TOPCOUNT = 100 -- Only this amount are retained. sequence Tops sequence NewItem -- The first item in SList is the new item and it is to be placed inside -- SList such that the sequence remains sorted in ascending order. -- The requirement is that SList[2..$] is already sorted. -- It returns the input with the new item in the correct place. function insert_one(sequence SList) sequence NewData integer NewScore integer min integer max integer pos NewData = SList[1] -- Copy the new data NewScore = NewData[1] -- Get the new key now for performance -- Initialize the searching indexes. pos = 2 min = 2 max = length(SList) while min <= max do pos = floor((min + max) / 2) switch eu:compare(NewScore, SList[pos][1]) do case 0 then -- The new score already exists in the list pos -= 1 min = max + 1 -- Force loop to end. case -1 then pos -= 1 max = pos case 1 then min = pos + 1 end switch end while SList[1 .. pos-1] = SList[2 .. pos] -- Shift existing items down SList[pos] = NewData -- Put new data in right spot. return SList end function --- Perform simulation -- integer Lowest Tops = { {rand(MAXSCORE), "some data" , 0} } -- Get first score record. Lowest = Tops[1][1] -- For performance, get the lowest value so far. for i = 1 to SAMPLESIZE do NewItem = {rand(MAXSCORE), "some data" , i} -- Get next score record if NewItem[1] <= Lowest then -- The new score is lower than any I know about so far. if length(Tops) = TOPCOUNT then -- But since it's lower than the top 'X' scores, I don't need it. continue -- Discard new item. This is by far the most common case. end if -- The new score becomes the list's new lowest score. Tops = prepend(Tops, NewItem) else -- The new score is large than the lowest so it needs to go into the -- top 'X' scores. if length(Tops) < TOPCOUNT then -- As the list isn't full yet, I can just add it in. Tops = prepend(Tops, NewItem) else -- The list is full so I need to remove the current lowest and -- insert the new data. Tops[1] = NewItem end if Tops = insert_one(Tops) -- This rearranges the list so that the -- first item is moved to the correct -- sorted position. end if Lowest = Tops[1][1] -- For performance, get the new lowest value so far. end for