C++ 存储哈夫曼树的有效方式
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/759707/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Efficient way of storing Huffman tree
提问by X-Istence
I am writing a Huffman encoding/decoding tool and am looking for an efficient way to store the Huffman tree that is created to store inside of the output file.
我正在编写一个霍夫曼编码/解码工具,并且正在寻找一种有效的方法来存储为存储在输出文件中而创建的霍夫曼树。
Currently there are two different versions I am implementing.
目前我正在实施两个不同的版本。
- This one reads the entire file into memory character by character and builds a frequency table for the whole document. This would only require outputting the tree once, and thus efficiency is not that big of a concern, other than if the input file is small.
- The other method I am using is to read a chunk of data, about 64 kilobyte in size and run the frequency analysis over that, create a tree and encode it. However, in this case before every chunk I will need to output my frequency tree so that the decoder is able to re-build its tree and properly decode the encoded file. This is where the efficiency does come into place since I want to save as much space as possible.
- 这个是将整个文件逐个字符读入内存,并为整个文件建立一个频率表。这只需要输出一次树,因此效率不是那么重要,除非输入文件很小。
- 我使用的另一种方法是读取大约 64 KB 大小的数据块并对其运行频率分析,创建一棵树并对其进行编码。但是,在这种情况下,在每个块之前,我需要输出我的频率树,以便解码器能够重新构建其树并正确解码编码文件。这是效率确实发挥作用的地方,因为我想尽可能多地节省空间。
In my searches so far I have not found a good way of storing the tree in as little space as possible, I am hoping the StackOverflow community can help me find a good solution!
到目前为止,在我的搜索中,我还没有找到在尽可能小的空间内存储树的好方法,我希望 StackOverflow 社区可以帮助我找到一个好的解决方案!
回答by Lasse V. Karlsen
Since you already have to implement code to handle a bit-wise layer on top of your byte-organized stream/file, here's my proposal.
由于您已经必须实现代码来处理字节组织流/文件之上的逐位层,因此这是我的建议。
Do not store the actual frequencies, they're not needed for decoding. You do, however, need the actual tree.
不要存储实际频率,它们不需要解码。但是,您确实需要实际的树。
So for each node, starting at root:
所以对于每个节点,从根开始:
- If leaf-node: Output 1-bit + N-bit character/byte
- If not leaf-node, output 0-bit. Then encode both child nodes (left first then right) the same way
- 如果叶子节点:输出 1 位 + N 位字符/字节
- 如果不是叶节点,则输出 0 位。然后以相同的方式对两个子节点(先左后右)进行编码
To read, do this:
要阅读,请执行以下操作:
- Read bit. If 1, then read N-bit character/byte, return new node around it with no children
- If bit was 0, decode left and right child-nodes the same way, and return new node around them with those children, but no value
- 读点。如果为 1,则读取 N 位字符/字节,返回其周围没有子节点的新节点
- 如果位为 0,以相同的方式解码左右子节点,并返回带有这些子节点的新节点,但没有值
A leaf-node is basically any node that doesn't have children.
叶节点基本上是任何没有子节点的节点。
With this approach, you can calculate the exact size of your output before writing it, to figure out if the gains are enough to justify the effort. This assumes you have a dictionary of key/value pairs that contains the frequency of each character, where frequency is the actual number of occurrences.
使用这种方法,您可以在写入之前计算输出的确切大小,以确定收益是否足以证明付出的努力。这假设您有一个包含每个字符出现频率的键/值对字典,其中频率是实际出现的次数。
Pseudo-code for calculation:
计算伪代码:
Tree-size = 10 * NUMBER_OF_CHARACTERS - 1
Encoded-size = Sum(for each char,freq in table: freq * len(PATH(char)))
The tree-size calculation takes the leaf and non-leaf nodes into account, and there's one less inline node than there are characters.
树大小计算考虑了叶节点和非叶节点,并且内联节点比字符少一个。
SIZE_OF_ONE_CHARACTER would be number of bits, and those two would give you the number of bits total that my approach for the tree + the encoded data will occupy.
SIZE_OF_ONE_CHARACTER 将是位数,这两个将为您提供我对树的方法 + 编码数据将占用的总位数。
PATH(c) is a function/table that would yield the bit-path from root down to that character in the tree.
PATH(c) 是一个函数/表,它将产生从根到树中该字符的位路径。
Here's a C#-looking pseudo-code to do it, which assumes one character is just a simple byte.
这是一个看起来像 C# 的伪代码,它假设一个字符只是一个简单的字节。
void EncodeNode(Node node, BitWriter writer)
{
if (node.IsLeafNode)
{
writer.WriteBit(1);
writer.WriteByte(node.Value);
}
else
{
writer.WriteBit(0);
EncodeNode(node.LeftChild, writer);
EncodeNode(node.Right, writer);
}
}
To read it back in:
将其读回:
Node ReadNode(BitReader reader)
{
if (reader.ReadBit() == 1)
{
return new Node(reader.ReadByte(), null, null);
}
else
{
Node leftChild = ReadNode(reader);
Node rightChild = ReadNode(reader);
return new Node(0, leftChild, rightChild);
}
}
An example (simplified, use properties, etc.) Node implementation:
一个示例(简化、使用属性等)节点实现:
public class Node
{
public Byte Value;
public Node LeftChild;
public Node RightChild;
public Node(Byte value, Node leftChild, Node rightChild)
{
Value = value;
LeftChild = leftChild;
RightChild = rightChild;
}
public Boolean IsLeafNode
{
get
{
return LeftChild == null;
}
}
}
Here's a sample output from a specific example.
这是来自特定示例的示例输出。
Input: AAAAAABCCCCCCDDEEEEE
输入:AAAAAABCCCCCCDDEEEEE
Frequencies:
频率:
- A: 6
- B: 1
- C: 6
- D: 2
- E: 5
- 答:6
- 乙:1
- 中:6
- D: 2
- 电子:5
Each character is just 8 bits, so the size of the tree will be 10 * 5 - 1 = 49 bits.
每个字符只有 8 位,因此树的大小将为 10 * 5 - 1 = 49 位。
The tree could look like this:
这棵树可能看起来像这样:
20
----------
| 8
| -------
12 | 3
----- | -----
A C E B D
6 6 5 1 2
So the paths to each character is as follows (0 is left, 1 is right):
所以每个字符的路径如下(0为左,1为右):
- A: 00
- B: 110
- C: 01
- D: 111
- E: 10
- 答:00
- 乙:110
- 丙:01
- D: 111
- 电子:10
So to calculate the output size:
所以要计算输出大小:
- A: 6 occurrences * 2 bits = 12 bits
- B: 1 occurrence * 3 bits = 3 bits
- C: 6 occurrences * 2 bits = 12 bits
- D: 2 occurrences * 3 bits = 6 bits
- E: 5 occurrences * 2 bits = 10 bits
- A: 6 次 * 2 位 = 12 位
- B:1 次出现 * 3 位 = 3 位
- C: 6 次 * 2 位 = 12 位
- D:出现 2 次 * 3 位 = 6 位
- E:出现 5 次 * 2 位 = 10 位
Sum of encoded bytes is 12+3+12+6+10 = 43 bits
编码字节总和为 12+3+12+6+10 = 43 位
Add that to the 49 bits from the tree, and the output will be 92 bits, or 12 bytes. Compare that to the 20 * 8 bytes necessary to store the original 20 characters unencoded, you'll save 8 bytes.
将其添加到树中的 49 位,输出将为 92 位或 12 个字节。与存储未编码的原始 20 个字符所需的 20 * 8 个字节相比,您将节省 8 个字节。
The final output, including the tree to begin with, is as follows. Each character in the stream (A-E) is encoded as 8 bits, whereas 0 and 1 is just a single bit. The space in the stream is just to separate the tree from the encoded data and does not take up any space in the final output.
最终输出,包括开始的树,如下所示。流 (AE) 中的每个字符都被编码为 8 位,而 0 和 1 只是一个位。流中的空间只是将树与编码数据分开,不占用最终输出中的任何空间。
001A1C01E01B1D 0000000000001100101010101011111111010101010
For the concrete example you have in the comments, AABCDEF, you will get this:
对于您在评论中的具体示例 AABCDEF,您将得到以下信息:
Input: AABCDEF
输入:AABCDEF
Frequencies:
频率:
- A: 2
- B: 1
- C: 1
- D: 1
- E: 1
- F: 1
- A2
- 乙:1
- 丙:1
- D: 1
- 电子:1
- 女:1
Tree:
树:
7
-------------
| 4
| ---------
3 2 2
----- ----- -----
A B C D E F
2 1 1 1 1 1
Paths:
路径:
- A: 00
- B: 01
- C: 100
- D: 101
- E: 110
- F: 111
- 答:00
- 乙:01
- 丙:100
- D: 101
- 电子:110
- 外:111
Tree: 001A1B001C1D01E1F = 59 bits
Data: 000001100101110111 = 18 bits
Sum: 59 + 18 = 77 bits = 10 bytes
树:001A1B001C1D01E1F = 59 位
数据:000001100101110111 = 18 位
总和:59 + 18 = 77 位 = 10 字节
Since the original was 7 characters of 8 bits = 56, you will have too much overhead of such small pieces of data.
由于原始是 8 位的 7 个字符 = 56,因此您将有太多的此类小数据的开销。
回答by Ezran
If you have enough control over the tree generation, you could make it do a canonical tree (the same way DEFLATEdoes, for example), which basically means you create rules to resolve any ambiguous situations when building the tree. Then, like DEFLATE, all you actually have to store are the lengths of the codes for each character.
如果您对树的生成有足够的控制权,您可以让它做一个规范树(例如,DEFLATE的方式相同),这基本上意味着您在构建树时创建规则来解决任何不明确的情况。然后,就像 DEFLATE 一样,您实际需要存储的只是每个字符的代码长度。
That is, if you had the tree/codes Lasse mentioned above:
也就是说,如果你有上面提到的树/代码 Lasse:
- A: 00
- B: 110
- C: 01
- D: 111
- E: 10
- 答:00
- 乙:110
- 丙:01
- D: 111
- 电子:10
Then you could store those as: 2, 3, 2, 3, 2
然后你可以将它们存储为:2, 3, 2, 3, 2
And that's actually enough information to regenerate the huffman table, assuming you're always using the same character set -- say, ASCII. (Which means you couldn't skip letters -- you'd have to list a code length for each one, even if it's zero.)
这实际上足以重新生成霍夫曼表,假设您总是使用相同的字符集——比如,ASCII。(这意味着你不能跳过字母——你必须为每个字母列出一个代码长度,即使它是零。)
If you also put a limitation on the bit lengths (say, 7 bits), you could store each of these numbers using short binary strings. So 2,3,2,3,2 becomes 010 011 010 011 010 -- Which fits in 2 bytes.
如果您还限制了位长(例如 7 位),您可以使用短二进制字符串存储这些数字中的每一个。所以 2,3,2,3,2 变成了 010 011 010 011 010 —— 适合 2 个字节。
If you want to get reallycrazy, you could do what DEFLATE does, and make another huffman table of the lengths of these codes, and store its code lengths beforehand. Especially since they add extra codes for "insert zero N times in a row" to shorten things further.
如果你想变得非常疯狂,你可以做 DEFLATE 所做的,并为这些代码的长度制作另一个霍夫曼表,并预先存储它的代码长度。特别是因为他们为“连续插入零 N 次”添加了额外的代码以进一步缩短事情的时间。
The RFC for DEFLATE isn't too bad, if you're already familiar with huffman coding: http://www.ietf.org/rfc/rfc1951.txt
如果您已经熟悉霍夫曼编码,则 DEFLATE 的 RFC 还不错:http://www.ietf.org/rfc/rfc1951.txt
回答by Sam Hasler
branches are 0 leaves are 1. Traverse the tree depth first to get its "shape"
分支是 0 叶是 1。首先遍历树的深度以获得它的“形状”
e.g. the shape for this tree
0 - 0 - 1 (A)
| \- 1 (E)
\
0 - 1 (C)
\- 0 - 1 (B)
\- 1 (D)
would be 001101011
Follow that with the bits for the characters in the same depth first order AECBD (when reading you'll know how many characters to expect from the shape of the tree). Then output the codes for the message. You then have a long series of bits that you can divide up into characters for output.
遵循相同深度一阶 AECBD 中字符的位(阅读时您会知道从树的形状中期望多少个字符)。然后输出消息的代码。然后,您可以将一长串位分成多个字符进行输出。
If you are chunking it, you could test that storing the tree for the next chuck is as efficient as just reusing the tree for the previous chunk and have the tree shape being "1" as an indicator to just reuse the tree from the previous chunk.
如果对它进行分块,则可以测试为下一个块存储树与将树重用于前一个块一样有效,并且将树形状设为“1”作为仅重用前一个块中的树的指示符.
回答by unwind
The tree is generally created from a frequency table of the bytes. So store that table, or just the bytes themselves sorted by frequency, and re-create the tree on the fly. This of course assumes that you're building the tree to represent single bytes, not larger blocks.
树通常是从字节的频率表中创建的。因此,存储该表,或仅存储按频率排序的字节本身,并即时重新创建树。这当然假设您正在构建树来表示单个字节,而不是更大的块。
UPDATE: As pointed out by j_random_hacker in a comment, you actually can't do this: you need the frequency values themselves. They are combined and "bubble" upwards as you build the tree. This pagedescribes the way a tree is built from the frequency table. As a bonus, it also saves this answer from being deleted by mentioning a way to save out the tree:
更新:正如 j_random_hacker 在评论中指出的那样,您实际上不能这样做:您需要频率值本身。当您构建树时,它们会组合在一起并向上“冒泡”。此页面描述了从频率表构建树的方式。作为奖励,它还通过提及一种保存树的方法来避免删除此答案:
The easiest way to output the huffman tree itself is to, starting at the root, dump first the left hand side then the right hand side. For each node you output a 0, for each leaf you output a 1 followed by N bits representing the value.
输出霍夫曼树本身的最简单方法是,从根开始,先转储左侧,然后转储右侧。对于每个节点,您输出一个 0,对于每个叶子,您输出一个 1,后跟 N 位代表值。
回答by verdy_p
A better approach
更好的方法
Tree:
树:
7
-------------
| 4
| ---------
3 2 2
----- ----- -----
A B C D E F
2 1 1 1 1 1 : frequencies
2 2 3 3 3 3 : tree depth (encoding bits)
Now just derive this table:
现在只需导出此表:
depth number of codes
----- ---------------
2 2 [A B]
3 4 [C D E F]
You don't need to use the same binary tree, just keep the computed tree depth i.e. the number of encoding bits. So just keep the vector of uncompressed values [A B C D E F] ordered by tree depth, use relative indexes instead to this separate vector. Now recreate the aligned bit patterns for each depth:
您不需要使用相同的二叉树,只需保留计算的树深度,即编码位数。所以只需保持未压缩值的向量 [ABCDEF] 按树深度排序,使用相对索引代替这个单独的向量。现在为每个深度重新创建对齐的位模式:
depth number of codes
----- ---------------
2 2 [00x 01x]
3 4 [100 101 110 111]
What you immediately see is that only the first bit pattern in each row is significant. You get the following lookup table:
您立即看到的是,每行中只有第一位模式是重要的。您将获得以下查找表:
first pattern depth first index
------------- ----- -----------
000 2 0
100 3 2
This LUT has a very small size (even if your Huffman codes can be 32-bit long, it will only contain 32 rows), and in fact the first pattern is always null, you can ignore it completely when performing a binary search of patterns in it (here only 1 pattern will need to be compared to know if the bit depth is 2 or 3 and get the first index at which the associated data is stored in the vector). In our example you'll need to perform a fast binary search of input patterns in a search space of 31 values at most, i.e. a maximum of 5 integer compares. These 31 compare routines can be optimized in 31 codes to avoid all loops and having to manage states when browing the integer binary lookup tree. All this table fits in small fixed length (the LUT just needs 31 rows atmost for Huffman codes not longer than 32 bits, and the 2 other columns above will fill at most 32 rows).
这个 LUT 的大小非常小(即使你的 Huffman 代码可以是 32 位长,它也只会包含 32 行),而且实际上第一个模式总是空的,你可以在执行模式的二分搜索时完全忽略它在其中(这里只需要比较 1 个模式即可知道位深度是 2 还是 3 并获取相关数据存储在向量中的第一个索引)。在我们的示例中,您需要在最多 31 个值的搜索空间中对输入模式执行快速二分搜索,即最多 5 个整数比较。这 31 个比较例程可以在 31 个代码中进行优化,以避免在浏览整数二叉查找树时出现所有循环和必须管理状态。所有这些表都适合小的固定长度(对于不长于 32 位的霍夫曼代码,LUT 最多只需要 31 行,
In other words the LUT above requires 31 ints of 32-bit size each, 32 bytes to store the bit depth values: but you can avoid it this by implying the depth column (and the first row for depth 1):
换句话说,上面的 LUT 需要 31 个 32 位大小的整数,32 个字节来存储位深度值:但是您可以通过暗示深度列(以及深度 1 的第一行)来避免这种情况:
first pattern (depth) first index
------------- ------- -----------
(000) (1) (0)
000 (2) 0
100 (3) 2
000 (4) 6
000 (5) 6
... ... ...
000 (32) 6
So your LUT contains [000, 100, 000(30times)]. To search in it you must find the position where the input bits pattern are between two patterns: it must be lower than the pattern at the next position in this LUT but still higher than or equal to the pattern in the current position (if both positions contain the same pattern, the current row will not match, the input pattern fits below). You'll then divide and conquer, and will use 5 tests at most (the binary search requires a single code with 5 embedded if/then/else nested levels, it has 32 branches, the branch reached indicates directly the bit depth that does not need to be stored; you perform then a single directly indexed lookup to the second table for returning the first index; you derive additively the final index in the vector of decoded values).
所以你的 LUT 包含 [000, 100, 000(30times)]。要在其中搜索,您必须找到输入位模式在两个模式之间的位置:它必须低于此 LUT 中下一个位置的模式,但仍高于或等于当前位置的模式(如果两个位置包含相同的模式,当前行将不匹配,输入模式适合下面)。然后你将分而治之,最多使用5个测试(二分查找需要一个带有5个嵌套if/then/else嵌套级别的代码,它有32个分支,到达的分支直接表示不需要存储;然后对第二个表执行单个直接索引查找以返回第一个索引;您在解码值的向量中附加地推导出最终索引)。
Once you get a position in the lookup table (search in the 1st column), you immediately have the number of bits to take from the input and then the start index to the vector. The bit depth you get can be used to derive directly the adjusted index position, by basic bitmasking after substracting the first index.
一旦您在查找表中获得一个位置(在第一列中搜索),您将立即获得要从输入中获取的位数,然后是向量的起始索引。通过减去第一个索引后的基本位掩码,您获得的位深度可用于直接导出调整后的索引位置。
In summary: never store linked binary trees, and you don't need any loop to perform thelookup which just requires 5 nested ifs comparing patterns at fixed positions in a table of 31 patterns, and a table of 31 ints containing the start offset within the vector of decoded values (in the first branch of the nested if/then/else tests, the start offset to the vector is implied, it is always zero; it is also the most frequent branch that will be taken as it matches the shortest code which is for the most frequent decoded values).
总之:永远不要存储链接的二叉树,并且您不需要任何循环来执行查找,它只需要 5 个嵌套的 ifs 比较 31 个模式表中固定位置的模式,以及一个包含 31 个整数的表,其中包含起始偏移量解码值的向量(在嵌套的 if/then/else 测试的第一个分支中,隐含了向量的起始偏移量,它始终为零;它也是最常见的分支,因为它匹配最短的代码这是最频繁的解码值)。