b e k o p r
1 3 2 2 1 1

Ben Tanen ยท COMP-150

This is a visual explanation and exploration of adaptive Huffman coding and how it compares to traditional static Huffman coding. Specifically, we will focus on how our encoding trees might differ when using adaptive vs. static Huffman.

First, we will explore how traditional Huffman coding builds its encoding tree for a specific string, in this case "bookkeeper". We will then do the same for adaptive Huffman coding using the FGK algorithm and compare these two trees. At the end, you can further explore how static vs. Huffman coding trees compare through other strings.

### What is Huffman Coding?

Before we get started, let's quickly discuss what exactly Huffman coding is. When we transmit information, we generally need to convert some sort of data (text, pictures, etc.) into binary. To do this, we assign codes to help us distinguish between different pieces of data.

For example, if we had the string "abca", we might assign codes like: $a$ = 00, $b$ = 01, $c$ = 10. This would make it so that our binary encoding of "abca" is "00 01 10 00".

But what if we wanted to encode "aabaacaa"? If we used our original encoding where we use two bits for each character, we would be treating $a$ and $c$ with equal importance, even though $a$ appears much more.

Wouldn't it be more efficient if we used fewer bits for $a$ and more bits for $c$? This is where Huffman coding comes in.

Huffman coding is a lossless data compression algorithm that assigns variable-length codes based on the frequencies of our input characters.

In order to determine what code to assign to each character, we will first build a binary tree that will organize our characters based on frequency.

### Building a Huffman Tree

As an example, let's encode the string "bookkeeper". Before we can start encoding, we will build our Huffman tree for this string, which will in turn show us what binary encoding we will use for each character.

To start, we need to count the frequency for each character in our string and store these frequencies in a table.

We will use this table to add nodes and edges that will build up our tree.

First, we start by adding leaf nodes for the two characters that occur the least. In this case, we have a tie between $b$, $p$, and $r$, so we'll just arbitrarily choose $p$ and $r$.

When we add in our leaves for $p$ and $r$, we will attach them to a parent node for a new pseudo-character "pr". This pseudo-character represents occurrences of $p$ or $r$ so it's frequency is equal to the frequency of $p$ plus the frequency of $r$.

We will also update our table to include our new pseudo-character. We can get this by simply merging the columns of $p$ and $r$.

With a now reduced table, we can repeat this process again for our updated values.

As we can see, $b$ has the lowest frequency in our table so we'll use that. For the second lowest frequency, there is a tie between $k$, $o$, and "pr", so we can again pick arbitrarily. Let's use our pseudo-character "pr".

Since we don't have a leaf node for $b$ yet, we will have to add that into our tree.

Then, as we did before, we'll attach our $b$ node and our "pr" node to a parent node for a new pseudo-character "bpr".

Finally, we'll update our table to reflect our new pseudo-character.

It looks like our tree is coming along, but it doesn't quite have everything yet.

To keep going, we can repeat this process again...

...and again

...and again

...until our table is only left with one value, a pseudo-character containing all of our original characters. This means we're done building our Huffman tree!

So how do we use this tree to assign codes?

Given our Huffman tree, to determine the binary code that we will use for any particular character, we can simply walk from the root to our character's leaf node, taking note of when we traverse left and when we traverse right.

As we walk from root to leaf, we will denote a left traversal with "0" and a right traversal with a "1".

For example, say we wanted to find the encoding for $p$, which only occurs once in "bookkeeper".

In our walk from root to the $p$ leaf, we go left, right, right, and left again. This means we will use four bits to encode $p$ as "0110".

What about our encoding of a more frequently used character like $k$?

For our walk to $k$, we traverse right then left. This means we will only use two bits to encode $k$ as "10".

It seems like our tree works - hooray for efficiency!

If we do this for all of our characters, we get our full binary encoding scheme. Let's compare this new Huffman scheme against a naive encoding scheme where we just arbitrarily assign binary codes.

 char freq old code new code $b$ $e$ $k$ $o$ $p$ $r$ 1 3 2 2 1 1 000 001 010 011 100 101 010 11 10 11 0110 0111

Using the naive scheme, encoding "bookkeeper" would take 30 bits. Using our Huffman scheme, we only use 25 bits to encode, which is a roughly a 17% improvement!

A small thing to note: as we were building our tree, when choosing our two least frequent characters in our table, we repeatedly had ties between three or more characters. When this happened, we would choose two of our tied elements arbitrarily.

By doing this, we can see that our arbitrary choice will change our tree. This means we can actually get multiple different trees from the same input string. For example, we could have initially chosen to start with $b$ and $r$ instead of $p$ and $r$. If we had done this, we would get a very similar tree but the $b$ and $p$ nodes would have been swapped.

While these trees might differ in their arrangement and shape, they are all valid Huffman trees. Since the algorithm is based on frequencies, this means that it doesn't matter if we assign a three-bit code to $b$ and a four-bit code to $p$ or vice versa. The tree and resulting encoding scheme will still result in the same efficiency improvement.

So how does this tree and this encoding compare to the one produced using adaptive Huffman coding? Keep scrolling to find out!

bookkeeper

While traditional Huffman coding is very useful, we can sometimes be limited by the fact that we need to know what data we are going to be encoding before we can start encoding. This might work in some scenarios, but there are many other applications where this is impractical or impossible.

For example, if we wanted to transmit a live video stream, we could not possibly know exactly what is going to be transmitted before hand.

As an alternative, we can use adaptive Huffman coding.

With adaptive Huffman coding, our purpose and goal is identical to traditional Huffman coding - we want to build a tree that will give us an optimal binary encoding scheme. The major distinction is that we will not pre-process our input before we start encoding it. Instead, we will be building a tree on the fly as we read in our input.

We will be using the FGK (Faller-Gallager-Knuth) Algorithm.

As we did with traditional Huffman coding, we will build our FGK tree with leaves for characters and interior nodes for pseudo-characters. As with static Huffman, an interior node's frequency will be equal to the sum of the frequencies of its children.

However, there are two major differences between traditional Huffman tree and the FGK tree.

First, our FGK tree must satisfy the sibling property. In order to do this, our tree must meet the following conditions:

1. All nodes in our tree (except for the root) must have a sibling.
2. The nodes can be listed (from left to right, bottom to top) in order of increasing value.

Take this tree as an example.

First of all, each node (except for the root) has a sibling, so the tree meets the first condition. Next, we can see that the values of our nodes increase as we look from left to right and bottom to top in our tree.

Thus, we can see this tree satisfies the sibling property.

Now consider this tree as another example.

While every node has a sibling, we can see the nodes appear out of order (since $6 \not < 4$). Thus we get a conflict of the sibling property

However, if we simply swap our conflicting nodes (and their subtrees), we now satisfy the sibling property.

The ability to swap conflicting nodes to maintain the sibling property will come in handy when building our FGK tree.

The second major difference from traditional Huffman trees to FGK trees is our use of a null node.

In our traditional Huffman tree, we build our tree from the bottom up (starting with the leaves and building up to our root) using our frequency table.

For adaptive Huffman coding, we are reading our input and building our tree at the same time (without first counting frequencies). As a result, we must build our FGK top down (starting with a root and build down to our leaves). We use our null node as a sibling for new character nodes we will add. This way we will still maintain the sibling property.

Let's see how all of this looks while building a FGK tree for "bookkeeper".

Since we are building our tree on the fly, we will update our tree based on the character we read in. If the character is new, we'll add it to our tree using the null node. If we've seen the character already, we'll simply update its frequency in our tree.

Let's consider our first input character, $b$.

This is our first $b$ so we will have to add it into our tree. We can make our $b$ node siblings with the null node and then make them both children of a new interior node.

After inserting each new node, we must make sure that we don't violate the sibling property.

Since we will always insert new leaves as a sibling of the null node, we know that we will never violate the first condition of the sibling property.

However, we will need to make sure that our nodes are always in the right order (increasing in value from left to right and bottom to top). To do this, we will incrementally walk up the tree and update the parent chain between our new leaf and the root. If we ever encounter conflicting nodes, we can swap them.

After inserting just $b$, we can see our parent chain is already updated and we satisfy the sibling property. With that we can continue reading in our input.

Our next character is going to be $o$, which we haven't seen yet. Therefore, we'll again need to add a new node to our tree.

Now that we have a leaf for $o$ and an interior node with a value equal to the null node's frequency plus the $o$ node's frequency ($0 + 1 = 1$), we just need to walk up and update the parent chain.

In this case, the parent chain just includes the root so there are no conflicts. We can just update the root to properly reflect its children's values ($1 + 1 = 2$).

Onto the next character, which happens to be another $o$ and our first duplicate. We'll first just update the $o$ node's frequency in our tree and update the rest of the tree to make sure it still satisfies the sibling property.

When we update the frequency of $o$, we get our first conflict. Our nodes are suppose to be in order from left to right and bottom to top, but the frequency of $o$ is now greater than the frequency of $b$. Before we can continue, we must rectify this by swapping the two nodes.

Now that we've swapped the two nodes, our tree again satisfies the sibling property, so we can continue updating the rest of the tree. In this case, we simply had to update the root's value, so we are good to go.

Time for the next character...

This is our first time reading $k$ so we'll simply insert it into the tree as we did for $b$ and $o$.

We'll again have to update our parent chain. With the current tree arrangement, we should be able to just increment the nodes in our chain and move onto the next letter of input.

We didn't run into any conflicts with updating our parent chain so let's read in our next character, which is another $k$.

As we did with $o$, we will first update the value on our existing $k$ node and update the parent chain as necessary. If we run into a conflict, we might have to make a swap or two.

It looks like there is a conflict between $k$ (with a frequency of 2) and $b$ (with a frequency of 1) so we'll have to swap them. We can then keep updating our parent chain.

After swapping $b$ and $k$, we can then update the first node in our parent chain. However, this leads to another conflict. In order to resolve this, we'll need to swap our left subtree with our right subtree.

Once we make the swap and update our last node in the chain (the root), we can read in the next character.

As we continue to read from our input, we'll continue inserting new nodes, updating our existing nodes, and making swaps when necessary.

Let's breeze through the rest of our input and see what our tree looks like at the end.

We have our first $e$ so we insert a node for $e$ - easy!

Just make sure to update our parent chain. No conflicts, so the ball keeps rolling...

(We just adjusted the tree positioning / spacing in preparation for more insertions - nothing related to our algorithm).

Another $e$ so we update our existing node. We get a conflict so it's time to swap.

After swapping our nodes and updating our parent, we get another conflict. Time for another swap!

We update the rest of the parent chain and we're good to go.

A new character, so time for a new leaf. As always, we also have to update the parent chain.

It doesn't look like we have any conflicts so we can read our next character.

Our third $e$ - after updating our existing $e$ node, we get a conflict, which means it's swapping time.

We always have to update that parent chain...

Time for our last input character, $r$. This is our first $r$ so we insert a new node.

Let's see if we run into any conflicts when updating our parent chain for the last time.

We do get a conflict, but we know what to do.

Another conflict, another swap...

We now update our parent chain all the way to the root and...

... we're done!

We now have a completed Huffman tree without having to pre-process the input. As with our normal Huffman trees, we can use this tree to figure out which binary codes we should assign to which characters.

Want to see how our traditional Huffman tree and our FGK tree compare? Scroll down (or click here) to check out the trees side-by-side. You can also see how our trees compare when we use different word!