1. Compressed Linetables

Just thought I'd post a fairly simple mini challenge for anyone who fancies it:

I use a LineTab to map machine code offsets to a source code line number. It is a simple flat sequence of integers, with negative values representing a "skip" (most will be -1..-25, but it should cope with -1073741824 or thereabouts) and positive values representing an ever-increasing offset (an upper limit of 1073741823 would be more than enough). Some of these are quite large, one just spotted was #334 (820) entries long, so it is about time that I considered compressing them somehow. Speed is not a significant issue. Here is that 820-entry-long one as an example:

    {-637,0,-23,5,6,-2,12,13,14,-71,20,-78,21,27,29,-2,31,35,37,39,44,46,-1,48,53,58,60,66,68,71,74,75,81,91,93,98, 
     100,-1,110,116,117,-3,123,-1,124,-42,132,-11,133,136,142,146,152,158,161,-1,163,169,172,175,178,180,183,185, 
     192,203,209,220,226,227,-1,228,-3,234,238,242,245,248,-1,252,-55,258,-4,263,264,266,269,275,-1,281,283,285,287, 
     293,298,302,306,-1,308,310,-2,312,314,-1,318,321,323,326,332,338,340,-33,342,-2,344,-5,347,350,353,357,359,361, 
     365,367,369,-1,372,376,-1,378,381,383,385,-1,388,392,395,397,401,405,-33,407,-4,412,413,416,420,423,426,430,437, 
     443,447,450,452,454,458,460,-1,463,-21,467,-14,470,473,477,481,-3,485,486,487,488,493,497,-23,499,501,-6,502,504, 
     509,-1,514,521,523,525,532,533,534,535,-7,536,541,542,543,544,-1,545,550,552,-1,559,562,564,567,569,-41,573,-9, 
     574,575,-10,576,-1,577,-4,583,584,590,-26,592,-1,594,-1,599,607,-11,613,619,622,628,635,637,644,646,647,-1,648, 
     -24,654,-1,659,661,-2,667,671,675,679,683,-1,687,-17,693,-3,695,699,701,703,710,716,-8,720,-3,722,723,-1,724,-1, 
     725,731,737,740,745,-1,750,752,754,756,759,-1,761,762,-23,763,767,769,775,782,784,790,792,795,798,-4,800,-1,806, 
     810,812,814,819,821,828,829,-1,830,-3,836,841,-8,843,-2,848,-45,854,-3,856,862,866,-7,869,-3,874,-1,880,881,882, 
     889,900,906,913,919,920,-1,921,-20,927,931,-4,933,-2,935,938,942,-7,946,-9,952,953,954,956,-1,958,961,964,970, 
     976,979,981,987,993,999,1002,1008,-19,1009,-3,1014,1016,1019,-6,1022,-3,1028,1032,-1,1035,1038,1040,1045,-19, 
     1050,1056,1057,1058,1061,1068,1079,1085,1092,1098,1099,1100,-1,1101,-22,1107,1111,1116,1118,-1,1120,1124,1127,-1, 
     1131,-2,1135,-20,1141,1142,1143,1145,1147,1150,-13,1151,-7,1157,-1,1158,1164,-2,1167,1171,1174,1177,-1,1180,-13, 
     1183,-3,1188,1189,1190,1191,1192,1199,1206,1212,1219,1225,1232,1238,1239,-1,1240,-32,1246,1253,1259,1266,1273,-8, 
     1277,-12,1283,1287,1289,-19,1295,-1,1300,1301,1302,1303,-12,1304,-5,1309,1312,1315,1317,1319,1323,1325,1327,-1, 
     1330,1337,1343,1347,-1,1349,1352,1354,1356,-32,1359,1363,-3,1366,1369,1371,1373,1376,1379,1385,1389,1392,1395, 
     1399,1405,1409,1413,-41,1417,-3,1419,1425,1428,1431,1434,1440,-12,1446,-7,1447,-1,1450,1456,1462,1468,1474,1477, 
     1482,-1,1487,-1,1489,1491,1493,1496,-2,1498,1504,1507,-24,1513,1515,1516,-4,1517,-11,1523,1524,1527,-4,1530,-2, 
     1532,1533,1534,1535,1536,1537,1538,1543,-10,1546,-3,1547,1554,1560,1567,1574,1581,1587,1594,1600,1601,-1,1602,-22, 
     1608,1612,1615,1618,1621,1624,1627,1630,-10,1633,-10,1639,-1,1642,-7,1643,-1,1645,1651,1652,1655,1660,1662,1668, 
     1672,1674,1677,1680,1683,1690,1694,1701,-32,1702,-7,1703,-1,1705,1711,1712,1716,1721,1723,1729,1731,1735,1738, 
     1741,1744,1747,1754,1758,1765,-35,1766,-17,1767,-1,1768,1771,1774,1777,1780,1782,1785,1788,1791,1793,-1,1795,1796, 
     1798,1800,1802,1804,1806,1808,1814,1816,1821,1823,-1,1828,-38,1829,-3,1830,1836,1838,1842,1847,1849,1852,1854, 
     1857,1858,-1,1859,1862,1864,1869,-1,1871,-1,1875,1880,1885,1887,1893,1896,1899,1905,1908,1915,1917,1920,-38,1921, 
     -15,1922,1925,1932,1934,-1,1935,-1,1942,-1,1945,-4,1948,1949,-23,1952,1956,-2,1958,1962,-1,1968,1970,1975,1980, 
     -13,1981,-3,1983,1987,-1,1993,1996,1999,-10,2004,-4,2005,2009,-1,2015,-1,2022,-1,2025,-1,2027,-14,2030,-1,2032, 
     2035,2038,2041,2044,2047,2048,2053,2054,2056,2058,2060,2062,-23,2064,2067,2069,2075,2077,2082,2085,2087,-12,2094, 
     -1,2096,2097,2104,-1,2107,-2,2110,-2,2112,2113,-16,2120,-3,2124,-3,2126,2130,-1,2136,2137,2140,2145,2150,-17,2151, 
     -2,2153,-3,2156,-3,2158,2162,-1,2168,-2,2169,2172,2175,2180,-19,2181,-4,2186,2190,-1,2196,2199,2202,-9,2204,-143, 
     2209,2214} 
What that means:
    offset 0 corresponds to line 638        (ie -(-637)+1)                      aka -LineTab[1]+(LineTab[2]>=0) 
    offset 5 corresponds to line 662        (ie -(-637)+1-(-23)+1)              aka -[1]+([2]>=0)+-[3]+([4]>=0) 
    offset 6 corresponds to line 663        (ie       ""          +1)           aka         "" +([5]>=0) 
    offset 12 corresponds to line 666...    (ie         ""           -(-2)+1)   aka         "" +-[6]+([7]>=0) 
To convert an offset to a source line number (which I only need to do when the app crashes) I simply start at the top and add up line numbers until I get to the required offset, not that you really needed to understand that. It is reasonably compact, but 3280 bytes could clearly be improved on, something nearer to 900 should be quite possible, and I would be mightily impressed if someone could jam that lot into 100 or so bytes!

A full-on LZW or whatever approach feels like overkill and could quite probably become intermittently error-prone. The compress() routine from database.e seems like a reasonable starting point (or something entirely different if you fancy), but that "ever-increasing offset" looks like a few "reset base" in the middle or suchlike should help even more. Assuming you need one every 256 bytes, a finger in the air says that saves ~820*0.6*3-50=1426 bytes. Equally (again, not trying to push a solution on anyone) I could see that -637 held as {-128,-128,-128,-128,-125}. But that's enough of my daft half-baked ideas!

The challenge, should anyone be at all interested, is to pack that into the shortest sequence of bytes, and of course unpack said bytes to obtain the original flawlessly.

Bonus points: convert an offset into a line number without uncompressing.

new topic     » topic index » view message » categorize

2. Re: Compressed Linetables

This reminds me of some interview question of how to pack a list of 1 million sorted numbers that range from 1 to 10 million into the smallest possible data structure. The solution is to store the difference between each adjacent number using arithmetic coding. I wonder if a similar approach could be used here, with two lists: offsets and line numbers.

new topic     » goto parent     » topic index » view message » categorize

3. Re: Compressed Linetables

PeteE said...

The solution is to store the difference between each adjacent number using arithmetic coding.

Oh yeah, I could easily store the number of bytes each line generates instead of the offset. 0 would become invalid, and the percentage of -128..127 numbers should hit the pretty high 90s. Nice idea, thanks.

PeteE said...

I wonder if a similar approach could be used here, with two lists: offsets and line numbers.

Not sure I get that part

new topic     » goto parent     » topic index » view message » categorize

4. Re: Compressed Linetables

petelomax said...
PeteE said...

I wonder if a similar approach could be used here, with two lists: offsets and line numbers.

Not sure I get that part

I didn't think using negative numbers for line number offsets would work very well with arithmetic coding, so I was thinking two lists of positive numbers would be better. I've always wanted to implement arithmetic coding, so I'll have a stab at it when I get home from work.

new topic     » goto parent     » topic index » view message » categorize

5. Re: Compressed Linetables

PeteE said...

I didn't think using negative numbers for line number offsets would work very well with arithmetic coding, so I was thinking two lists of positive numbers would be better. I've always wanted to implement arithmetic coding, so I'll have a stab at it when I get home from work.

As long as you are aware that the above table is

   {skip,code, 
    skip,code,code, 
    skip,code,code,code, 
    skip,code, 
    skip,code,code,code, 
    skip,... 
There's no pattern, other than someone's coding/spacing/commenting style, plus any lines the compiler could completely optimise away. There may, in fact, not be any skips at all, but if there are present, they are never consecutive but instead always separated by one or more code offsets(/sizes). If you split it into two lists it might be difficult if not impossible to recombine them correctly.

Pete

new topic     » goto parent     » topic index » view message » categorize

6. Re: Compressed Linetables

petelomax said...

As long as you are aware that the above table is

   {skip,code, 
    skip,code,code, 
    skip,code,code,code, 
    skip,code, 
    skip,code,code,code, 
    skip,... 
There's no pattern, other than someone's coding/spacing/commenting style, plus any lines the compiler could completely optimise away. There may, in fact, not be any skips at all, but if there are present, they are never consecutive but instead always separated by one or more code offsets(/sizes). If you split it into two lists it might be difficult if not impossible to recombine them correctly.

I understand, but you could also write it this way, given that multiple code offsets mapped to the same line would be the same as a skip of zero, resulting in this:

   {skip,code, 
    skip,code,0,code, 
    skip,code,0,code,0,code, 
    skip,code, 
    skip,code,0,code,0,code, 
    skip,... 
which could then be split into two lists of the same length. But doing it this way is just an implementation detail that doesn't really matter to your challenge. You just want to see an encoder that takes your table and outputs some compressed data that later be decoded and queried to get a line number given a machine code offset.

I've been working on the arithmetic coding, and I'm starting to believe it's way overkill, but at least it's been a learning experience. I'll post it when I have the decoder working.

new topic     » goto parent     » topic index » view message » categorize

7. Re: Compressed Linetables

Here's a first attempt:

 
-- use bit precision that fit euphoria's integer type 
constant 
    bit_precision = 30, 
    limit = power(2, bit_precision) - 1, 
    bit1 = power(2, bit_precision - 1) 
 
-- sum a sequence 
function sum(sequence s) 
    atom result 
    result = 0 
    for i = 1 to length(s) do 
        result += s[i] 
    end for 
    return result 
end function 
 
-- generate look-up-table of weights, normalized to bit precision 
function generate_lut(sequence s) 
    sequence lut 
    integer x 
    atom sum_s 
    -- scale weights from 0 to bit limit 
    sum_s = sum(s) 
    lut = repeat(0, length(s)) 
    -- change independent weights to increasing ranges 
    x = 0 
    for i = 1 to length(lut) do 
        lut[i] = x 
        x += floor(s[i] * limit / sum_s) 
    end for 
    return lut 
end function 
 
-- add an integer x to a binary sequence bits 
function badd(sequence bits, atom x) 
    integer c, s 
    atom j 
    j = 1 
    c = 0 
    for i = length(bits) to 1 by -1 do 
        s = bits[i] + (and_bits(x, j) != 0) + c 
        c = s > 1 
        bits[i] = and_bits(s, 1) 
        j += j 
    end for 
    if c then 
        printf(1, "badd overflow! x=%d bits=", {x}) ? bits 
    end if 
    return bits 
end function 
 
-- return an integer representing the binary difference of x and y 
function bdif(sequence x, sequence y) 
    integer c, s 
    atom j, result 
    j = 1 
    c = 0 
    result = 0 
    for i = length(x) to 1 by -1 do 
        if i <= length(y) then 
            s = y[i] 
        else 
            s = 0 
        end if 
        s -= x[i] + c 
        c = s < 0 
        result += j * and_bits(s, 1) 
        j += j 
    end for 
    return result * (1 - 2 * c) 
end function 
 
-- binary search between l range d for a symbol in lut that matches code 
function bsearch(sequence l, integer d, sequence code, sequence lut) 
    integer a, b, mid, x 
    sequence tmp 
 
    a = 1 
    b = length(lut) - 1 
 
    while a != b do 
        mid = floor((a + b + 1) / 2) 
        x = lut[mid] 
        tmp = badd(l, floor(x * d / (limit+1))) 
        if bdif(tmp, code) < 0 then 
            b = mid - 1 
        else 
            a = mid 
        end if 
    end while 
    return a 
end function 
 
-- arithmetic encode a list of symbols, given a weighted symbol table 
function arith_encode(sequence list, sequence syms, sequence weights) 
    sequence lut, result, u, l 
    integer sym, a, b, d 
 
    lut = generate_lut(weights) & limit 
    l = repeat(0, bit_precision) 
    u = repeat(1, bit_precision) 
    result = {} 
 
    for i = 1 to length(list) do 
        -- get the symbol index from the list 
        sym = find(list[i], syms) 
        if sym = 0 then 
            puts(1, "symbol not found: ") ? list[i] 
            return 0 
        end if 
 
        -- find the difference of the upper and lower range 
        d = bdif(l, u) 
 
        -- scale up d so that we have more precision for bounds 
        while d < bit1 and length(l) < bit_precision do 
            d += d 
            l &= 0 
        end while 
        --printf(1, "sym=%d d=%d\n", {sym, d}) 
 
        -- get the symbol bounds from look-up-table 
        a = lut[sym] 
        b = lut[sym+1] 
 
        -- scale the symbol bounds by the difference 
        a = floor(a * d / (limit+1)) 
        b = floor(b * d / (limit+1)) 
 
        -- create the new range from the symbol bounds 
        u = badd(l, b) 
        l = badd(l, a) 
 
        -- last item? 
        if i = length(list) then 
            -- figure out how many bits we need to flush 
            u = badd(l, floor((b - a) / 2)) 
            for j = 1 to length(l) do 
                if l[j] != u[j] then 
                    l[j] = u[j] 
                    if j < length(l) then 
                        l[j+1] = 1 - u[j+1] 
                    end if 
                    exit 
                end if 
            end for 
        end if 
 
        -- leading bits that are the same won't change, 
        -- and can be moved to the result. 
        while length(l) and l[1] = u[1] do 
            result &= l[1] 
            l = l[2..$] 
            u = u[2..$] 
        end while 
        -- TODO: this will output bytes instead of bits 
        while 0 and length(l) > 8 and equal(l[1..8], u[1..8]) do 
            result &= l[1] + l[2]*2 + l[3]*4 + l[4]*8 + 
                l[5]*16 + l[6]*32 + l[7]*64 + l[8]*128 
            l = l[9..$] 
            u = u[9..$] 
        end while 
 
    end for 
 
    return result 
end function 
 
-- arithmetic decode a coded sequence, given a weighted symbol table 
function arith_decode(sequence code, sequence syms, sequence weights) 
    sequence lut, result, u, l 
    integer sym, a, b, len, d 
 
    lut = generate_lut(weights) & limit 
    l = repeat(0, bit_precision) 
    u = repeat(1, bit_precision) 
if 0 then 
    -- convert bytes to bits 
    result = repeat(0, 8 * length(code)) 
    for i = 1 to length(code) do 
        d = 1 
        for j = 7 to 0 by -1 do 
            result[i*8-j] = and_bits(code[i], d) != 0 
            d += d 
        end for 
    end for 
    code = result 
    ? code 
end if 
    result = {} 
 
    len = length(code) - 1 
 
    while len do 
 
        -- find the difference of the upper and lower range 
        d = bdif(l, u) 
 
        -- scale up d so that we have more precision for bounds 
        while d < bit1 and length(l) < bit_precision do 
            d += d 
            l &= 0 
        end while 
 
        -- do a binary search of the lut to find the next symbol index 
        sym = bsearch(l, d, code, lut) 
 
        -- save the symbol in the results 
        result = append(result, syms[sym]) 
        --printf(1, "sym=%d d=%d\n", {sym, d}) 
 
        -- get the symbol bounds from look-up-table 
        a = lut[sym] 
        b = lut[sym+1] 
 
        -- scale the symbol bounds by the difference 
        a = floor(a * d / (limit+1)) 
        b = floor(b * d / (limit+1)) 
 
        -- create the new range from the symbol bounds 
        u = badd(l, b) 
        l = badd(l, a) 
 
        -- leading bits that are the same won't change, 
        -- and can be removed. 
        while l[1] = u[1] do 
            code = code[2..$] 
            l = l[2..$] 
            u = u[2..$] 
            len -= 1 
        end while 
 
    end while 
 
    return result 
end function 
 
 
sequence code, table, symbols, weights 
 
 
-- symbol mapping: 
-- 0..61 maps to integers 0..61 
-- 62 means add 62 to the next symbol value 
-- 63 means to make the next symbol negative 
symbols = repeat(0, 64) 
for i = 1 to length(symbols) do 
    symbols[i] = i-1 
end for 
 
-- weights are loosely based on the histogram of the table below 
weights = {10,2000,2000,2000,1000,1000,1000,500,50,50,50,50,50,50,50,50,50,50,10, 
10,10,10,10,10,10,5,5,5,5,5,5,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,100,2000} 
 
 
function compress_linetable(sequence table) 
    sequence s 
    integer x, y 
    s = {} 
    y = 0 
     
    -- convert the table to a list of symbols 0..63 
    for i = 1 to length(table) do 
        x = table[i] 
        if x < 0 then 
            s &= 63 -- 63 means negative 
            x = -x 
        else 
            -- calculate delta for offsets 
            x -= y 
            y = table[i] 
        end if 
        while x >= 62 do 
            s &= 62 
            x -= 62 
        end while 
        s &= x 
    end for 
 
    return arith_encode(s, symbols, weights) 
end function 
 
function decompress_linetable(sequence code) 
    sequence table, s 
    integer x, y, neg 
 
    x = 0 
    y = 0 
    neg = 0 
    table = {} 
 
    s = arith_decode(code, symbols, weights) 
 
    for i = 1 to length(s) do 
        if s[i] = 63 then 
            neg = 1 
        elsif s[i] = 62 then 
            x += 62 
        else 
            x += s[i] 
            if neg then 
                table &= -x 
                neg = 0 
            else 
                y += x 
                table &= y 
            end if 
            x = 0 
        end if 
    end for 
 
    return table 
end function 
 
 
 
table= 
    {-637,0,-23,5,6,-2,12,13,14,-71,20,-78,21,27,29,-2,31,35,37,39,44,46,-1,48,53,58,60,66,68,71,74,75,81,91,93,98, 
     100,-1,110,116,117,-3,123,-1,124,-42,132,-11,133,136,142,146,152,158,161,-1,163,169,172,175,178,180,183,185,  
     192,203,209,220,226,227,-1,228,-3,234,238,242,245,248,-1,252,-55,258,-4,263,264,266,269,275,-1,281,283,285,287,  
     293,298,302,306,-1,308,310,-2,312,314,-1,318,321,323,326,332,338,340,-33,342,-2,344,-5,347,350,353,357,359,361,  
     365,367,369,-1,372,376,-1,378,381,383,385,-1,388,392,395,397,401,405,-33,407,-4,412,413,416,420,423,426,430,437,  
     443,447,450,452,454,458,460,-1,463,-21,467,-14,470,473,477,481,-3,485,486,487,488,493,497,-23,499,501,-6,502,504,  
     509,-1,514,521,523,525,532,533,534,535,-7,536,541,542,543,544,-1,545,550,552,-1,559,562,564,567,569,-41,573,-9,  
     574,575,-10,576,-1,577,-4,583,584,590,-26,592,-1,594,-1,599,607,-11,613,619,622,628,635,637,644,646,647,-1,648,  
     -24,654,-1,659,661,-2,667,671,675,679,683,-1,687,-17,693,-3,695,699,701,703,710,716,-8,720,-3,722,723,-1,724,-1,  
     725,731,737,740,745,-1,750,752,754,756,759,-1,761,762,-23,763,767,769,775,782,784,790,792,795,798,-4,800,-1,806,  
     810,812,814,819,821,828,829,-1,830,-3,836,841,-8,843,-2,848,-45,854,-3,856,862,866,-7,869,-3,874,-1,880,881,882,  
     889,900,906,913,919,920,-1,921,-20,927,931,-4,933,-2,935,938,942,-7,946,-9,952,953,954,956,-1,958,961,964,970,  
     976,979,981,987,993,999,1002,1008,-19,1009,-3,1014,1016,1019,-6,1022,-3,1028,1032,-1,1035,1038,1040,1045,-19,  
     1050,1056,1057,1058,1061,1068,1079,1085,1092,1098,1099,1100,-1,1101,-22,1107,1111,1116,1118,-1,1120,1124,1127,-1,  
     1131,-2,1135,-20,1141,1142,1143,1145,1147,1150,-13,1151,-7,1157,-1,1158,1164,-2,1167,1171,1174,1177,-1,1180,-13,  
     1183,-3,1188,1189,1190,1191,1192,1199,1206,1212,1219,1225,1232,1238,1239,-1,1240,-32,1246,1253,1259,1266,1273,-8,  
     1277,-12,1283,1287,1289,-19,1295,-1,1300,1301,1302,1303,-12,1304,-5,1309,1312,1315,1317,1319,1323,1325,1327,-1,  
     1330,1337,1343,1347,-1,1349,1352,1354,1356,-32,1359,1363,-3,1366,1369,1371,1373,1376,1379,1385,1389,1392,1395,  
     1399,1405,1409,1413,-41,1417,-3,1419,1425,1428,1431,1434,1440,-12,1446,-7,1447,-1,1450,1456,1462,1468,1474,1477,  
     1482,-1,1487,-1,1489,1491,1493,1496,-2,1498,1504,1507,-24,1513,1515,1516,-4,1517,-11,1523,1524,1527,-4,1530,-2,  
     1532,1533,1534,1535,1536,1537,1538,1543,-10,1546,-3,1547,1554,1560,1567,1574,1581,1587,1594,1600,1601,-1,1602,-22,  
     1608,1612,1615,1618,1621,1624,1627,1630,-10,1633,-10,1639,-1,1642,-7,1643,-1,1645,1651,1652,1655,1660,1662,1668,  
     1672,1674,1677,1680,1683,1690,1694,1701,-32,1702,-7,1703,-1,1705,1711,1712,1716,1721,1723,1729,1731,1735,1738,  
     1741,1744,1747,1754,1758,1765,-35,1766,-17,1767,-1,1768,1771,1774,1777,1780,1782,1785,1788,1791,1793,-1,1795,1796,  
     1798,1800,1802,1804,1806,1808,1814,1816,1821,1823,-1,1828,-38,1829,-3,1830,1836,1838,1842,1847,1849,1852,1854,  
     1857,1858,-1,1859,1862,1864,1869,-1,1871,-1,1875,1880,1885,1887,1893,1896,1899,1905,1908,1915,1917,1920,-38,1921,  
     -15,1922,1925,1932,1934,-1,1935,-1,1942,-1,1945,-4,1948,1949,-23,1952,1956,-2,1958,1962,-1,1968,1970,1975,1980,  
     -13,1981,-3,1983,1987,-1,1993,1996,1999,-10,2004,-4,2005,2009,-1,2015,-1,2022,-1,2025,-1,2027,-14,2030,-1,2032,  
     2035,2038,2041,2044,2047,2048,2053,2054,2056,2058,2060,2062,-23,2064,2067,2069,2075,2077,2082,2085,2087,-12,2094,  
     -1,2096,2097,2104,-1,2107,-2,2110,-2,2112,2113,-16,2120,-3,2124,-3,2126,2130,-1,2136,2137,2140,2145,2150,-17,2151,  
     -2,2153,-3,2156,-3,2158,2162,-1,2168,-2,2169,2172,2175,2180,-19,2181,-4,2186,2190,-1,2196,2199,2202,-9,2204,-143,  
     2209,2214}  
 
 
code = compress_linetable(table) 
 
table = decompress_linetable(code) 
? table 
 
printf(1, "%d entries -> %d bits\n", {length(table), length(code)}) 
printf(1, "%g bits per entry\n", {length(code) / length(table)}) 
 

Outputs:

821 entries -> 3549 bits 
4.32278 bits per entry 

There's still a problem with the decoder - it doesn't know where to stop, so there ends up being an extra entry in the decoded table. I haven't decided whether to encode the length of the output into the coded bits, or to use a special symbol to indicate when to stop decoding.

new topic     » goto parent     » topic index » view message » categorize

8. Re: Compressed Linetables

Looking good.

FYI, since I am using 32-bit Phix, it needed the following tweaks:

in generate_lut():

--/**/lut = sq_floor(sq_div(sq_mul(s,limit),sum(s)))    --/*    -- Phix 
    lut = floor(s * limit / sum(s))                     --*/    -- OpenEu 

and (also needed on 32-bit OE) in badd():

atom j 

in bdif():

atom j, result 

After that, it was fine.

I had whipped up two small routines:

function offsets_to_bytes_generated(sequence offsets, integer totalsize) 
integer last = 0 
    for i=1 to length(offsets) do 
        if offsets[i]>=0 then 
            if last>0 then 
                offsets[last] = offsets[i]-offsets[last] 
            end if 
            last = i 
        end if 
    end for 
    if last then 
        offsets[last] = totalsize-offsets[last] 
    end if 
    return offsets 
end function 
 
function bytes_generated_to_offsets(sequence bytes_generated) 
integer base = 0, tmp 
    for i=1 to length(bytes_generated) do 
        if bytes_generated[i]>0 then 
            tmp = base 
            base += bytes_generated[i] 
            bytes_generated[i] = tmp 
        end if 
    end for 
    return bytes_generated 
end function 

so in compress_linetable() I removed "calculate delta for offsets", and in decompress_linetable() I replaced y += x table &= y with table &= x. That gave me perfect results, without the extra entry:

820 entries -> 3546 bits 
4.32439 bits per entry 
I will have to study this stuff some more, and can see it taking a while to sink in properly.
Pete

Edit: forgot to mention, I used 2222 as a "fake" totalsize.

new topic     » goto parent     » topic index » view message » categorize

9. Re: Compressed Linetables

Okay, I'm back from a holiday weekend, so I will try to explain arithmetic coding. (Trying to explain things helps my own understanding usually.)

The whole idea is to compress a list of symbols into a number between 0 and 1. The information is stored in the fractional part of the number, all the digits after the decimal point. It also takes into account the likelyhood of a symbol appearing in the list, more likely symbols use smaller data and less likely symbols use larger data. Lets start with an example:

Suppose we have 4 symbols: " " (space) "I", "am", "Sam". And their likelihood (or weights) of appearing in a list: 40%, 30%, 20%, 10%. So spaces are really common, "I" is less common, "am" is half that of "space" and "Sam" is least likely.

A sample list to encode is: {"I", " ", "am", " ", "Sam", " ", "I", " ", "am"}


For encoding, convert the likelihoods to probabilities: .4, .3, .2, .1 (so that their sum is exactly 1) and then convert to it to ranges for each symbol:
0 to .4: " "
.4 to .7: "I"
.7 to .9: "am"
.9 to 1: "Sam"

On a number line it looks like this:

0              .4         .7     .9   1 
.--------------.----------.------.----. 
      " "           "I"     "am"  "Sam" 

We start with the range [0, 1] and subdivide it by the range for each symbol in the list.

The first symbol is "I" so the new range is just [.4, .7].

The next is space, so we need to take its range [0, .4] and scale it between the current range. To do this, find the difference between the upper and lower bounds of the current range, multiply the symbol range by the difference, then add the lower bound of the current range:
The difference between [.4, .7] is .3.
Multiply [0, .4] * .3 is [0, .12]
Add [0, .12] + .4 is the new range [.4, .52]

The next is "am" [.7, .9]
The difference between [.4, .52] is .12
Multiply [.7, .9] * .12 is [.084, .108]
Add [.084, .108] + .4 is the new range [.484, .508]

The next is space [0, .4]
The difference between [.484, .508] is .024
Multiply [0, .4] * .024 is [0, .0096]
Add [0, .0096] + .484 is the new range [.484, .4936]

Here the first digits of our range match, so we can be certain that the result of encoding will start with .4 and we could remove it from our calculations, but we'll leave it in for now to see how the numbers keep changing.

The next is "Sam" [.9, 1]
The difference between [.484, .4936] is .0096
Multiply [.9, 1] * .0096 is [.00864, .0096]
Add [.00864, .0096] + .484 is the new range [.49264, .4936]

Now we have two matching digits!

The next is space [0, .4]
The difference between [.49264, .4936] is .00096
Multiply [0, .4] * .00096 is [0, .000384]
Add [0, .000384] + .49264 is the new range [.49264, .493024]

The next is "I" [.4, .7]
The difference between [.49264, .493024] is .000384
Multiply [.4, .7] * .000384 is [.0001536, .0002688]
Add [.0001536, .0002688] + .49264 is the new range [.4927936, .4929088]

The next is space [0, .4]
The difference between [4927936, .4929088] is .0001152
Multiply [0, .4] * .0001152 is [0, .00004608]
Add [0, .00004608] + .4927936 is the new range [.4927936, .49283968]

The next is "am" [.7, .9]
The difference between [.4927936, .49283968] is .00004608
Multiply [.7, .9] * .00004608 is [.000032256, .000041472]
Add [.000032256, .000041472] + .4927936 is the new range [.492825856, .492835072]

And so the encoded list of symbols is any number between [.492825856, .492835072]. But we only want one number as the result, so we'll just average them together: .492830464, it's within in the range, so it will work for decoding. We could also probably get by with fewer digits, probably .492830 should be enough for the decoder to give us the correct result, we'll check it later.

Now you can see that the numbers are getting more and more digits after the decimal point after each symbol is encoded, and there is a limit to the number of digits you can store in a 64-bit floating point double, so using atoms for the calculation after a certain number of digits won't work. Instead we work in binary representing the fractional part: for example .101010010000... that we can keep expanding as much as needed.


Now for decoding, we'll keep using decimal for illustration purposes. Taking our result number from encoding, we now try to guess (or something smarter) which symbols get us to that number, resampling the search space each time we find a symbol.

So for the first step we have our number line:

0              .4         .7     .9   1 
.--------------.----------.------.----. 
      " "           "I"     "am"  "Sam" 

We can look on there and see that .492830464 falls in the range [.4, .7] so we know that the first symbol must be "I". To figure out the next scale we find the difference of the current range, multiply it by the probabilities and add it to the lower bound of the current range:

The difference between [.4, .7] is .3.
Multiply [0, .4, .7, .9, 1] * .3 is [0, .12, .21, .27, .3]
Add [0, .12, .21, .27, .3] + .4 is [.4, .52, .61, .67, .7]

.4             .52        .61    .67  .7 
.--------------.----------.------.----. 
      " "           "I"     "am"  "Sam" 

We see that .492830464 falls in the range [.4, .52] so the next symbol is a space. Now recalculating the scale for every symbol is a lot of unnecessary computation, so instead using a binary search reduces that by a lot. We would start at the middle symbol "I" or "am", compute the value for that symbol, and repeat the binary search to the left or right of the middle symbol based on the comparison with the target number. Anyway, we'll keep on doing the whole scale for illustration.

The difference between [.4, .52] is .12.
Multiply [0, .4, .7, .9, 1] * .12 is [0, .048, .084, .108, .12]
Add [0, .048, .084, .108, .12] + .4 is [.4, .448, .484, .508, .52]

.4             .448       .484   .508 .52 
.--------------.----------.------.----. 
      " "           "I"     "am"  "Sam" 

The new range is [.484, .508], and the next symbol is "am".

The difference between [.484, .508] is .024
Multiply [0, .4, .7, .9, 1] * .024 is [0, .0096, .0168, .0216, .024]
Add that + .484 is [.484, .4936, .5008, .5056, .508]

The new range is [.484, .4936] and the symbol is space.

The difference between [.484, .4936] is .0096
Multiply [0, .4, .7, .9, 1] * .0096 is [0, .00384, .00672, .00864, .0096]
Add that + .484 is [0.484, .48784, .49072, .49264, .4936]

The new range is [.49264, .4936] and the symbol is "Sam".

The difference between [.49264, .4936] is .00096
Multiply [0, .4, .7, .9, 1] * .00096 is [0, .000384, .000672, .000864, .00096]
Add that + .49264 is [.49264, .493024, .493312, .493504, .4936]

The new range is [.49264, .493024] and the symbol is space.

The difference between [.49264, .493024] is .000384
Multiply [0, .4, .7, .9, 1] * .000384 is [0, .0001536, .0002688, .0003456, .000384]
Add that + .49264 is [.49264, .4927936, .4929088, .4929856, .493024]

The new range is [4927936, .4929088] and the symbol is "I".

The difference between [4927936, .4929088] is .0001152
Multiply [0, .4, .7, .9, 1] * .0001152 is [0, .00004608, .00008064, .00010368, .0001152]
Add that + .4927936 is [.4927936, .49283968, .49287424, .49289728, .4929088]

The new range is [.4927936, .49283968] and the symbol is space.

The difference between [.4927936, .49283968] is .00004608
Multiply [0, .4, .7, .9, 1] * .00004608 is [0, .000018432,.000032256,.000041472,.00004608]
Add that + .4927936 is [.4927936, .492812032, .492825856, .492835072, .49283968]

The new range is [.492825856, .492835072] and the final symbol is "am". As you go along you can see that the range for each matches the range calculated during the encoding process, so that you can see that the decoder is narrowing down the same list of symbols from just one number. Also, if we had used the shorter encoding .492830, it would still have given us the same final symbol.

Fortunately we knew how many symbols were in the original list, so we know we can stop now. But if you end up decoding an arbitrary list of symbols where you don't know the length of the original list, what happens if you keep going?

The difference between [.492825856, .492835072] is .000009216
Multiply [0, .4, .7, .9, 1] * .000009216 is [0,.0000036864,.0000064512,.0000082944,.000009216]
Add that + .492825856 is [0.492825856,0.4928295424,0.4928323072,0.4928341504,0.492835072]

The new range is [0.4928295424,0.4928323072] and the symbol is "I"??? We could keep going and generating symbols indefinitely, given we have infinite precision for the range. We don't want that, so we could encode a special symbol that means "end of input", or encode the first symbol of the length of the input list, somehow.

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu