Re: Optimizing basic operations

new topic     » goto parent     » topic index » view thread      » older message » newer message

I'm also not sure if I understand what you're trying but it sounds like you want to know the top 'X' scores from a potentially huge set of data records. Here is one way of doing that does not involve much sorting at all. I've tested with a billion records and it takes about two minutes and does around 1500 'sorts' (binary search insertions actually).

 
constant MAXSCORE   = 100_000_000   -- Largest key possible 
constant SAMPLESIZE = 1_000_000_000 -- The number of samples to check 
constant TOPCOUNT   = 100           -- Only this amount are retained. 
 
sequence Tops 
sequence NewItem 
 
 
-- The first item in SList is the new item and it is to be placed inside 
-- SList such that the sequence remains sorted in ascending order. 
-- The requirement is that SList[2..$] is already sorted. 
-- It returns the input with the new item in the correct place. 
 
function insert_one(sequence SList) 
    sequence NewData 
    integer  NewScore 
    integer  min 
    integer  max 
    integer  pos 
 
    NewData = SList[1] -- Copy the new data 
    NewScore = NewData[1] -- Get the new key now for performance 
     
    -- Initialize the searching indexes. 
    pos = 2 
    min = 2 
    max = length(SList) 
     
    while min <= max do 
        pos = floor((min + max) / 2) 
         
        switch eu:compare(NewScore,  SList[pos][1]) do 
         
            case 0 then -- The new score already exists in the list 
                pos -= 1 
                min = max + 1   -- Force loop to end. 
         
            case -1 then 
                pos -= 1 
                max = pos 
                 
            case 1 then 
                min = pos + 1 
        end switch 
    end while 
 
    SList[1 .. pos-1] = SList[2 .. pos] -- Shift existing items down 
    SList[pos] = NewData    -- Put new data in right spot. 
 
    return SList 
end function 
 
--- Perform simulation --  
 
integer Lowest 
Tops = { {rand(MAXSCORE), "some data" , 0} } -- Get first score record. 
Lowest = Tops[1][1] -- For performance, get the lowest value so far. 
 
for i = 1 to SAMPLESIZE do 
    NewItem = {rand(MAXSCORE), "some data" , i} -- Get next score record 
     
    if NewItem[1] <= Lowest then 
        -- The new score is lower than any I know about so far. 
        if length(Tops) = TOPCOUNT then 
            -- But since it's lower than the top 'X' scores, I don't need it. 
            continue    -- Discard new item. This is by far the most common case. 
        end if 
        -- The new score becomes the list's new lowest score. 
        Tops = prepend(Tops, NewItem) 
    else 
        -- The new score is large than the lowest so it needs to go into the 
        -- top 'X' scores. 
        if length(Tops) < TOPCOUNT then 
            -- As the list isn't full yet, I can just add it in. 
            Tops = prepend(Tops, NewItem) 
        else 
            -- The list is full so I need to remove the current lowest and 
            -- insert the new data. 
            Tops[1] = NewItem 
        end if 
         
        Tops = insert_one(Tops) -- This rearranges the list so that the 
                                -- first item is moved to the correct 
                                -- sorted position. 
    end if 
     
    Lowest = Tops[1][1] -- For performance, get the new lowest value so far. 
end for 
new topic     » goto parent     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu