The fastest programming language

new topic » topic index » view message » categorize

2. Re: Primes | A Software Drag Race - Dave's Garage

Posted by petelomax Jul 14, 2021
1483 views
Last edited Jul 16, 2021

Results so far collected on this machine, slowest first:

   Python: 14 (version 1, in 10 seconds so 7 really) 
     Wren: 68 
    Julia: 110 (version1) 
    Julia: 111 (version2) 
     Phix: 675 (hll init and main loop, aka "if true then") 
FreeBASIC: 1054 (prime_boolean, the 8of30 one gave incorrect results) 
   Python: 2270 (version 2) 
       Go: 3064 (solution 1) 
   Python: 3524 (solution 3, unfaithful) 
     Phix: 4081 (hll init and asm main loop) 
       Go: 4233 (solution 2) 
     Phix: 5419 (asm init and main loop)

Still not sure about the Docker thing, let's see if anyone else can improve things here first anyway.

~~Actually, I'm pretty sure allocate()+rep movsd/q would be faster, and of course a single peek() afterwards both gets the easier-to-handle (binary) string and is not timed.~~ update: tried that, no gain, in fact it was slower[?].

-- 
-- demo/prime_drag_race.exw 
-- ======================== 
-- 
?"begin" 
--constant limit = 100_000_000 
--constant limit = 10_000_000 
constant limit = 1_000_000 
--constant limit = 100_000 
--constant limit = 10_000 
--constant limit = 1_000 
--constant limit = 100 
--constant limit = 10 
 
constant prime_counts = new_dict({{10, 4}, 
                                  {100, 25}, 
                                  {1000, 168}, 
                                  {10000, 1229}, 
                                  {100000, 9592}, 
                                  {1000000, 78498}, 
                                  {10000000, 664579}, 
                                  {100000000, 5761455}}) 
 
function sieve(integer n) 
    integer bytes = floor((n+15)/16)    -- (round up to whole no of odd-bit bytes) 
                                        -- (eg a limit of 10 ==> 1..15, in 1 byte) 
    string res = repeat('\0',bytes+iff(machine_bits()=32?11:23)) 
    -- initialise with every 3rd bit set (strike 3/9/15 etc) 
    --  done in blocks of 12/24 bytes, hence padding above 
--  if true then                        -- use slow hll version 
    if false then                       -- use fast asm version 
        for i=1 to length(res) by 3 do 
            res[i]   = 0b1001_0010 
            res[i+1] = 0b0010_0100 
            res[i+2] = 0b0100_1001 
        end for 
    else 
        -- For the asm version of the above I used a "block size" of 12/24, 
        -- simply because that fits comfortably in the available registers. 
        #ilASM{ 
            [32] 
                mov edi,[res] 
                mov ecx,[bytes] 
                shl edi,2 
                mov eax,0b1001_0010_0100_1001_0010_0100_1001_0010 
                mov edx,0b0010_0100_1001_0010_0100_1001_0010_0100 
                mov esi,0b0100_1001_0010_0100_1001_0010_0100_1001 
              @@: 
                mov [edi],eax 
                mov [edi+4],edx 
                mov [edi+8],esi 
                lea edi,[edi+12] 
                sub ecx,12 
                jg @b 
            [64] 
                -- Here we use a bit of bit-fiddling in preference to lots 
                -- of massive literal constants, because it is easy enough. 
                -- Of course you could do much the same in the 32-bit code. 
                -- Note we should avoid setting the low-dword of rax to a 
                -- signed value, since that fills the high-dword with 1s. 
                mov rax,0b0100_1001_0010_0100_1001_0010_0100_1001 
                mov rdi,[res] 
                mov rdx,rax 
                shl rax,1 
                mov rcx,[bytes] 
                shl rdx,34 
                shl rdi,2 
                add rax,rdx 
                lea rdx,[rbx+rax*4+1] 
                lea rsi,[rbx+rax*2] 
              @@: 
                mov [rdi],rax 
                mov [rdi+8],rdx 
                mov [rdi+16],rsi 
                lea rdi,[rdi+24] 
                sub rcx,24 
                jg @b 
              } 
    end if 
    -- then clear 3rd and strike 1st bit too. 
    res[1] = 0b1001_0001 
    -- 
    -- now strike 25,35,45...       5^2, {+5*2} 
    --       then 49,63,77...       7^2, {+7*2} 
    --       then 121,143,165...    11^2, {+11*2} 
    --       then 169,195,221...    13^2, {+13*2} 
    --       then 289,323,357...    17^2, {+17*2} 
    -- 
    integer p = 5,  -- (first prime to test) 
            p2 = p*p, -- (squared) 
            pb = 2,   -- (bit index for 5) 
            mask = 0b0100  -- (mask for 5) 
    while p2<=n do 
--      if true then                        -- use slow hll version 
        if false then                       -- use fast asm version 
            integer bdx = floor(pb/8)+1, 
                    byte = res[bdx] 
            if and_bits(byte,mask)=0 then 
                for k2=p2 to n by p*2 do 
                    bdx = floor(k2/16)+1 
                    byte = res[bdx] 
                    integer kl = remainder(k2,16) 
                    integer mask2 = power(2,floor(kl/2)) 
                    byte = or_bits(byte,mask2) 
                    res[bdx] = byte 
                end for 
            end if 
        else 
            -- Obviously this would be much harder to write without the above 
            -- to refer to and check against, likewise having a printable kl 
            -- etc in the above loop helps (vs. doing the calculation inline). 
            -- Most users bother not with this sort of thing, for good reason. 
            -- In truth, #ilASM{} is for self-hosting, not really performance, 
            -- and there never was any remote intention for it be easy to use. 
            #ilASM{ 
                    [32] 
                        mov ecx,[pb] 
                        mov esi,[res] 
                        shr ecx,3                   -- bdx := floor(bp/8) 
                        mov edx,[mask] 
                        mov eax,[p2]                -- k2 
--                      test byte[esi*4+ecx],dl     -- byte,mask -- (needs 1.0.1) 
                        test [esi*4+ecx],edx        -- byte,mask -- (works too) 
                        jne :nextp 
                          ::by2p 
                            mov ecx,eax             -- kl := k2 
                            mov edi,eax             -- bdx := k2 
                            and ecx,15              -- kl := remainder(k2,16) 
                            mov edx,1               -- mask2 := 1 
                            shr ecx,1               -- kl := kl/2 
                            shr edi,4               -- bdx := floor(k2/16) 
                            shl edx,cl              -- mask2 := power(2,floor(kl/2)) 
                            mov ecx,[p] 
                            or byte[esi*4+edi],dl 
                            lea eax,[eax+ecx*2]     -- k2 += p*2 
                            cmp eax,[n] 
                            jle :by2p 
                    [64] 
                        mov rcx,[pb] 
                        mov rsi,[res] 
                        shr rcx,3                   -- bdx := floor(bp/8) 
                        mov rdx,[mask] 
                        mov rax,[p2]                -- k2 
--                      test byte[rsi*4+rcx],dl     -- byte,mask -- (needs 1.0.1) 
                        test [rsi*4+rcx],rdx        -- byte,mask -- (works too) 
                        jne :nextp 
                          ::by2p 
                            mov rcx,rax             -- kl := k2 
                            mov rdi,rax             -- bdx := k2 
                            and rcx,15              -- kl := remainder(k2,16) 
                            mov dl,1                -- mask := 1 
                            shr rcx,1               -- kl := kl/2 
                            shr rdi,4               -- bdx := floor(k2/16) 
                            shl dl,cl               -- mask := power(2,floor(kl/2)) 
                            mov rcx,[p] 
                            or byte[rsi*4+rdi],dl 
                            lea rax,[rax+rcx*2]     -- k2 += p*2 
                            cmp rax,[n] 
                            jle :by2p 
                    [] 
                      ::nextp 
                  } 
        end if 
        p += 2 
        p2 = p*p 
        pb += 1 
        mask *= 2 
        if mask>#FF then mask /= #100 end if 
    end while 
    res = res[1..bytes] 
    return res -- (or return {res,bytes,n} should that be needed/desired) 
end function 
 
atom t5 = time()+5 
--atom t5 = time()-.05  -- (useful for verifying different sizes) 
integer passes = 0 
string res 
while true do 
    res = sieve(limit) 
    passes += 1 
    if time()>t5 then exit end if 
end while 
atom actual_time = time()-t5+5 
 
integer prime_count = 1 -- (no bit for 2 so count it automatically) 
--sequence primes = repeat(0,limit) -- (neither helps nor hinders) 
--primes[2] = 1                                 -- "" 
integer p = 1 
for i=1 to length(res) do 
    integer ri = res[i] 
    for j=1 to 8 do 
        if even(ri) then 
            if p>limit then exit end if 
            prime_count += 1 
--          primes[p] = 1                       -- "" 
        end if 
        ri = floor(ri/2) 
        p += 2 
    end for 
end for 
puts(1,"\n") 
integer verify = getd(limit,prime_counts) 
if prime_count==verify then 
--  printf(1,"%d primes < %d (verified)\n",{prime_count,limit}) 
    printf(1,"phix;%d;%f;1;algorithm=base,faithful=yes,bits=1\n", {passes, actual_time}) 
--  phix;5419;5.015000;1;algorithm=base,faithful=yes,bits=1 
else 
    printf(1,"**ERROR** %d primes(<%d) should be %d\n",{prime_count,limit,verify}) 
end if 
{} = wait_key()

recap: the above is creating an odd-bit-only sieve bitmap, so res[1] contains the bits for {15,13,11,9,7,5,3,1}, res[2] the bits for {31,29,27,25,23,21,18,17}, etc. (8 bits per byte)

aside: If you don't like the #ilASM{} in the above (who would) you can just delete it and still join in, albeit trying to improve the 675 instead of the 5419.

new topic » goto parent » topic index » view message » categorize

3. Re: Primes | A Software Drag Race - Dave's Garage

Posted by petelomax Jul 16, 2021
1333 views

I've had another idea, or rather extended one I had above. Instead of setting every third bit I initialised 24 bits aka 3 bytes and replicated that. The 24 is lcm(3,8), though since we are dealing with primes we can just multiply them and since we are dealing with bytes we can just forget the 8. We could do much the same for 5: initialise lcm(5,24) = 120 bits or 15 bytes and replicate that. Likewise lcm(7,120) = 840 bits or 105 bytes. As we extend by another (p-1) blocks we can copy the prior bits, setting any new bits as we go. So the early passes for small primes can terminate early and rely on higher primes to replicate their bit pattern. Obviously the main loop guarantees the sieve always gets properly filled all the way to the end.

As ever there is a catch or in this case two. First, you need a pristine (copy of) earlier bytes and I suggest you can do that best by setting, for eg multiples of 5, bytes 4..15 first and only then bytes 1..3. Second, for the small primes {3,5,7,11,13} (or do I mean just {3,5,7}?) you will never skip a byte but you may set >1 bit in the same byte, and you need to avoid re-fetching the pristine byte when you do so, whereas when setting multiples of 17 and up you may skip bytes but will never touch the same byte twice, so for that I suggest breaking the main loop into two phases, one for <=13 and one for >=17. (erm, it might be <=7 and >=11 since we're doing odd-bits only)

It would greatly surprise me if non-one else has ever had the same idea on such an age-old task, or maybe it is just a subtly different way to implement a 3-5-7 wheel.

UPDATE: Doh. The pristine/replicate bytes want bits 3,5,7 etc set whereas the final result obviously does not. So it would need to be kept separately and beyond a certain size the overhead of updating two copies would start to outweigh any possible savings... Then again, looking at the numbers:

1 * 3 = 3 
3 * 5 = 15 
15 * 7 = 105 
105 * 11 = 1,155 
1155 * 13 = 15,015 
15015 * 17 = 255,255 
255255 * 19 = 4,849,845 
4849845 * 23 = 111,546,435 
111546435 * 29 = 3,234,846,615 
3234846615 * 31 = 100,280,245,065

A 1-million sieve would easily be filled on the 6th pass (the length of res for 1,000,000 is 62,500), after/during which we can leave any such copy well alone.
In the critical timed case the size of the copy need not exceed 25% of the result, and in fact "pristine pristine" could theoretically fit in just two bytes[??].

OpenEuphoria

1. Primes | A Software Drag Race - Dave's Garage

The fastest programming language

2. Re: Primes | A Software Drag Race - Dave's Garage

3. Re: Primes | A Software Drag Race - Dave's Garage

Search

Include:

Quick Links

User menu

Misc Menu