Re: Compressed Linetables
- Posted by PeteE Sep 02, 2014
- 1324 views
Okay, I'm back from a holiday weekend, so I will try to explain arithmetic coding. (Trying to explain things helps my own understanding usually.)
The whole idea is to compress a list of symbols into a number between 0 and 1. The information is stored in the fractional part of the number, all the digits after the decimal point. It also takes into account the likelyhood of a symbol appearing in the list, more likely symbols use smaller data and less likely symbols use larger data. Lets start with an example:
Suppose we have 4 symbols: " " (space) "I", "am", "Sam". And their likelihood (or weights) of appearing in a list: 40%, 30%, 20%, 10%. So spaces are really common, "I" is less common, "am" is half that of "space" and "Sam" is least likely.
A sample list to encode is: {"I", " ", "am", " ", "Sam", " ", "I", " ", "am"}
For encoding, convert the likelihoods to probabilities: .4, .3, .2, .1 (so that their sum is exactly 1) and then convert to it to ranges for each symbol:
0 to .4: " "
.4 to .7: "I"
.7 to .9: "am"
.9 to 1: "Sam"
On a number line it looks like this:
0 .4 .7 .9 1 .--------------.----------.------.----. " " "I" "am" "Sam"
We start with the range [0, 1] and subdivide it by the range for each symbol in the list.
The first symbol is "I" so the new range is just [.4, .7].
The next is space, so we need to take its range [0, .4] and scale it between the current range. To do this, find the difference between the upper and lower bounds of the current range, multiply the symbol range by the difference, then add the lower bound of the current range:
The difference between [.4, .7] is .3.
Multiply [0, .4] * .3 is [0, .12]
Add [0, .12] + .4 is the new range [.4, .52]
The next is "am" [.7, .9]
The difference between [.4, .52] is .12
Multiply [.7, .9] * .12 is [.084, .108]
Add [.084, .108] + .4 is the new range [.484, .508]
The next is space [0, .4]
The difference between [.484, .508] is .024
Multiply [0, .4] * .024 is [0, .0096]
Add [0, .0096] + .484 is the new range [.484, .4936]
Here the first digits of our range match, so we can be certain that the result of encoding will start with .4 and we could remove it from our calculations, but we'll leave it in for now to see how the numbers keep changing.
The next is "Sam" [.9, 1]
The difference between [.484, .4936] is .0096
Multiply [.9, 1] * .0096 is [.00864, .0096]
Add [.00864, .0096] + .484 is the new range [.49264, .4936]
Now we have two matching digits!
The next is space [0, .4]
The difference between [.49264, .4936] is .00096
Multiply [0, .4] * .00096 is [0, .000384]
Add [0, .000384] + .49264 is the new range [.49264, .493024]
The next is "I" [.4, .7]
The difference between [.49264, .493024] is .000384
Multiply [.4, .7] * .000384 is [.0001536, .0002688]
Add [.0001536, .0002688] + .49264 is the new range [.4927936, .4929088]
The next is space [0, .4]
The difference between [4927936, .4929088] is .0001152
Multiply [0, .4] * .0001152 is [0, .00004608]
Add [0, .00004608] + .4927936 is the new range [.4927936, .49283968]
The next is "am" [.7, .9]
The difference between [.4927936, .49283968] is .00004608
Multiply [.7, .9] * .00004608 is [.000032256, .000041472]
Add [.000032256, .000041472] + .4927936 is the new range [.492825856, .492835072]
And so the encoded list of symbols is any number between [.492825856, .492835072]. But we only want one number as the result, so we'll just average them together: .492830464, it's within in the range, so it will work for decoding. We could also probably get by with fewer digits, probably .492830 should be enough for the decoder to give us the correct result, we'll check it later.
Now you can see that the numbers are getting more and more digits after the decimal point after each symbol is encoded, and there is a limit to the number of digits you can store in a 64-bit floating point double, so using atoms for the calculation after a certain number of digits won't work. Instead we work in binary representing the fractional part: for example .101010010000... that we can keep expanding as much as needed.
Now for decoding, we'll keep using decimal for illustration purposes. Taking our result number from encoding, we now try to guess (or something smarter) which symbols get us to that number, resampling the search space each time we find a symbol.
So for the first step we have our number line:
0 .4 .7 .9 1 .--------------.----------.------.----. " " "I" "am" "Sam"
We can look on there and see that .492830464 falls in the range [.4, .7] so we know that the first symbol must be "I". To figure out the next scale we find the difference of the current range, multiply it by the probabilities and add it to the lower bound of the current range:
The difference between [.4, .7] is .3.
Multiply [0, .4, .7, .9, 1] * .3 is [0, .12, .21, .27, .3]
Add [0, .12, .21, .27, .3] + .4 is [.4, .52, .61, .67, .7]
.4 .52 .61 .67 .7 .--------------.----------.------.----. " " "I" "am" "Sam"
We see that .492830464 falls in the range [.4, .52] so the next symbol is a space. Now recalculating the scale for every symbol is a lot of unnecessary computation, so instead using a binary search reduces that by a lot. We would start at the middle symbol "I" or "am", compute the value for that symbol, and repeat the binary search to the left or right of the middle symbol based on the comparison with the target number. Anyway, we'll keep on doing the whole scale for illustration.
The difference between [.4, .52] is .12.
Multiply [0, .4, .7, .9, 1] * .12 is [0, .048, .084, .108, .12]
Add [0, .048, .084, .108, .12] + .4 is [.4, .448, .484, .508, .52]
.4 .448 .484 .508 .52 .--------------.----------.------.----. " " "I" "am" "Sam"
The new range is [.484, .508], and the next symbol is "am".
The difference between [.484, .508] is .024
Multiply [0, .4, .7, .9, 1] * .024 is [0, .0096, .0168, .0216, .024]
Add that + .484 is [.484, .4936, .5008, .5056, .508]
The new range is [.484, .4936] and the symbol is space.
The difference between [.484, .4936] is .0096
Multiply [0, .4, .7, .9, 1] * .0096 is [0, .00384, .00672, .00864, .0096]
Add that + .484 is [0.484, .48784, .49072, .49264, .4936]
The new range is [.49264, .4936] and the symbol is "Sam".
The difference between [.49264, .4936] is .00096
Multiply [0, .4, .7, .9, 1] * .00096 is [0, .000384, .000672, .000864, .00096]
Add that + .49264 is [.49264, .493024, .493312, .493504, .4936]
The new range is [.49264, .493024] and the symbol is space.
The difference between [.49264, .493024] is .000384
Multiply [0, .4, .7, .9, 1] * .000384 is [0, .0001536, .0002688, .0003456, .000384]
Add that + .49264 is [.49264, .4927936, .4929088, .4929856, .493024]
The new range is [4927936, .4929088] and the symbol is "I".
The difference between [4927936, .4929088] is .0001152
Multiply [0, .4, .7, .9, 1] * .0001152 is [0, .00004608, .00008064, .00010368, .0001152]
Add that + .4927936 is [.4927936, .49283968, .49287424, .49289728, .4929088]
The new range is [.4927936, .49283968] and the symbol is space.
The difference between [.4927936, .49283968] is .00004608
Multiply [0, .4, .7, .9, 1] * .00004608 is [0, .000018432,.000032256,.000041472,.00004608]
Add that + .4927936 is [.4927936, .492812032, .492825856, .492835072, .49283968]
The new range is [.492825856, .492835072] and the final symbol is "am". As you go along you can see that the range for each matches the range calculated during the encoding process, so that you can see that the decoder is narrowing down the same list of symbols from just one number. Also, if we had used the shorter encoding .492830, it would still have given us the same final symbol.
Fortunately we knew how many symbols were in the original list, so we know we can stop now. But if you end up decoding an arbitrary list of symbols where you don't know the length of the original list, what happens if you keep going?
The difference between [.492825856, .492835072] is .000009216
Multiply [0, .4, .7, .9, 1] * .000009216 is [0,.0000036864,.0000064512,.0000082944,.000009216]
Add that + .492825856 is [0.492825856,0.4928295424,0.4928323072,0.4928341504,0.492835072]
The new range is [0.4928295424,0.4928323072] and the symbol is "I"??? We could keep going and generating symbols indefinitely, given we have infinite precision for the range. We don't want that, so we could encode a special symbol that means "end of input", or encode the first symbol of the length of the input list, somehow.