C语言 C中的霍夫曼编码

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20018426/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-02 10:17:58  来源:igfitidea点击:

Huffman encoding in C

ctreehuffman-code

提问by Whizzil

I am trying to write a module which assigns huffman encoded words to the input symbols, but the given codes differ from what they should look like.

我正在尝试编写一个模块,该模块将霍夫曼编码的单词分配给输入符号,但给定的代码与它们的外观不同。

For example, if I run it with following symbol probabilities:

例如,如果我使用以下符号概率运行它:

(1st column: probabilities; 2nd column: my huffman codes; 3rd column: correct huffman codes)

(第一列:概率;第二列:我的霍夫曼代码;第三列:正确的霍夫曼代码)

0,25 --> 01 --> 10

0,25 --> 01 --> 10

0,15 --> 101 --> 111

0,15 --> 101 --> 111

0,15 --> 110 --> 110

0,15 --> 110 --> 110

0,1 --> 1111 --> 010

0,1 --> 1111 --> 010

0,1 --> 000 --> 001

0,1 --> 000 --> 001

0,05 --> 0010 --> 0110

0,05 --> 0010 --> 0110

0,05 --> 0011 --> 0001

0,05 --> 0011 --> 0001

0,05 --> 1000 --> 0000

0,05 --> 1000 --> 0000

0,05 --> 1001 --> 01111

0,05 --> 1001 --> 01111

0,05 --> 1110 --> 01110

0,05 --> 1110 --> 01110

I think the problem might be caused in my function for generating huffman codes, since strcat()function's behaviour was initially not good for my idea, so I combined it with strcat(). Not sure if it is good that way tho.

我认为问题可能出在我生成霍夫曼代码的函数中,因为strcat()函数的行为最初对我的想法不利,所以我将它与strcat()结合起来。不知道这样是否好。

I am providing you with two functions responsible for codes assign, build_huffman_tree()and generate_huffman_tree(), hopefully you can help me out with this, and point out where the problem could be.

我为您提供了两个负责代码分配的函数,build_huffman_tree()generate_huffman_tree(),希望您能帮我解决这个问题,并指出问题所在。

Generate guffman tree:

生成古夫曼树:

void generate_huffman_tree(node *n, char *code){
if(n->left== NULL && n->right== NULL){
    SYMBOLS[code_counter] = n->symbol; // this 3 lines just store current code, not important
    CODES[code_counter] = strdup(code);
    code_counter += 1;
}
if(n->left!= NULL){
    char temp[100];
    strcpy(temp, code);
    strcat(temp, "0");
    generate_huffman_tree(n->left, temp);
}
if(n->right!= NULL){
    char temp[100];
    strcpy(temp, code);
    strcat(temp, "1");
    generate_huffman_tree(n->right, temp);
}

Build Huffman tree:

构建霍夫曼树:

node *build_huffman_tree(double *probabilities){

int num_of_nodes = NUM_SYMBOLS;
int num = NUM_SYMBOLS;

// 1) Initialization: Create new node for every probability
node *leafs = (node*) malloc(num_of_nodes*sizeof(node));
int i;
for(i=0; i<num_of_nodes; i+=1){
    node c;
    c.probability= *(probability+ i);
    c.symbol= *(SYMBOLS + i);
    c.left= NULL;
    c.right= NULL;
    *(leafs+i) = c;
}

node *root= (node*) malloc(sizeof(node)); // Root node which will be returned

while(num_of_nodes> 1){

    // 2) Find 2 nodes with lowest probabilities
    node *min_n1= (node*)malloc(sizeof(node));
    node *min_n2 = (node*)malloc(sizeof(node));

    *min_n1 = *find_min_node(leafs, num, min_n1);
    leafs = remove_node(leafs, min_n1, num); 
    num -= 1;

    *min_n2= *find_min_node(leafs, num, min_n2);
    leafs = remove_node(leafs, min_n2, num);
    num -= 1;

    // 3) Create parent node, and assign 2 min nodes as its children
            // add parent node to leafs, while its children have been removed from leafs
    node *new_node = (node*) malloc(sizeof(node));
    new_node->probabilty= min_n1->probability + min_n2->probability;
    new_node->left= min_n1;
    new_node->right= min_n2;

    leafs = add_node(leafs, new_node, num);
    num += 1;

    num_of_nodes -= 1;

    root = new_node;
}

return root;

I have tested functions for finding 2 min nodes, removing and adding nodes to leafs structure, and it is proven to work fine, so I guess the problem should be something about this here.

我已经测试了用于查找 2 分钟节点、删除节点并将节点添加到叶子结构的功能,并且证明它可以正常工作,所以我想问题应该在这里。

回答by Mark Adler

I didn't look at your source code, but there's nothing wrong with the Huffman code you generated. There is also nothing wrong with what you are calling "correct huffman codes". There is more than one valid Huffman code possible with that set of probabilities. If you take the sum of the probabilities times the bit lengths for both Huffman codes, you will find that those sums are exactly the same. Both Huffman codes are optimal, even though they're different.

我没有查看您的源代码,但是您生成的霍夫曼代码没有任何问题。您所说的“正确的霍夫曼代码”也没有错。对于这组概率,可能有不止一个有效的霍夫曼代码。如果将两个霍夫曼码的概率之和乘以位长,您会发现这些和完全相同。两个霍夫曼代码都是最优的,即使它们不同。

The way this happens is that when you look for the two lowest frequencies, there is more than one choice. Depending on which choice you make, you will get a different tree.

发生这种情况的方式是,当您寻找两个最低频率时,有不止一种选择。根据您做出的选择,您将获得不同的树。

回答by Mohamed Ennahdi El Idrissi

This code below is an implementation of Mark Allen Weiss's Algorithm. Give it a try!

下面的代码是 Mark Allen Weiss 算法的实现。试一试!

It offers routines similar to yours, in addition to a function that displays the result according to the previously constituted codes for each letter.

它提供了与您类似的例程,此外还有一个根据每个字母先前构成的代码显示结果的功能。

The compiler used is MinGW 2.95 (C-Free 4.0).

使用的编译器是 MinGW 2.95 (C-Free 4.0)。

Prerequisites:

先决条件:

An input file with a text (any, but remember, it deals with alphabet characters only, no punctuation, no space, no numbers). The constant IN_PATH is the one you should modify to point at the right location of your text file to run the program successfully.

带有文本的输入文件(任何,但请记住,它只处理字母字符,没有标点符号,没有空格,没有数字)。常量 IN_PATH 是您应该修改以指向文本文件的正确位置以成功运行程序的常量。

The image shows a sample text, the letters proportions and the result of huffman code interpretation (letters separated by one space).

图像显示了示例文本、字母比例和霍夫曼代码解释的结果(字母由一个空格分隔)。

Good luck!

祝你好运!

//*******************************************************************
// File:            HuffmanEncoding - Tree.c
// Author(s):       Mohamed Ennahdi El Idrissi
// Date:            14-Aug-2012
// 
// Input Files:     in.txt
// Output Files:    out.txt
// Description:     CSC 2302 - <Data Structures>
//                             <Struct, Array, File I/O, Recursion, Pointers, Binary Tree>
//                  This program covers the Huffman Encoding concept.
//                  We first read a file, from we which we count the number of characters, and then reckon the frequency
//                  of each letter individually. Each letter's frequency is stored in a node with its respective character.
//                  This node is stored in an array of 26 elements (element 0 -> 'A', element 1 -> 'B'...element 25 -> 'Z').
//                  Each element is a pointer, and each pointer is supposed to be a root of a tree (sub tree).
//                  After processing all characters of the text (read from a file), we end up with an array with
//                  25 NULL elements. The only element that is not NULL is the root of the tree that gathers the different
//                  nodes of each letter.
//                  Deducing the encoding of each letter if performed with intermediary of the prefix traversal.
//                  To summarize, the pseudo-code is:
//                      - Initialize the letters array
//                      - Read file
//                          - Increment each letter frequency + compute the number of characters in the file
//                          - Store in the array's node the frequency of each letter
//                      - Compute the number (N) of involved characters (Sometimes, texts don't include all letters. In our case 'Q' and 'Z' are absent).
//                      - Loop N times
//                          - find Minimum and second minimum
//                          - create a new node, its left child contains the minimum and the right child contains the second minimum
//                          - minimum position points on the new node, and the second minimum's array position points on NULL
//                      - Browse the array till the unique non NULL element is encountered
//                          - invoke prefix traversal function
//                              - build the encoding of each character
//                              - display the letter and its characteristics when found.
//                      - Finally, read the output file to interpret its content
//                          - if root contains a character (A - Z), display character
//                          - else, if the current character is '0', browse the left leaf
//                          - else, if the current character is '1', browse the right leaf
//                          
//*******************************************************************
#include <stdio.h>

#define NBR_OF_LETTERS 26

#define LEFT 'L'
#define RIGHT 'R'

#define CODE_SIZE 128

#define TYPED_ALLOC(type) (type *) malloc( sizeof(type) )
#define BYTE_SIZE 8

#define IN_PATH "./files/in.txt"
#define OUT_PATH "./files/out.txt"

typedef struct tree_node_s {
    float frequency;
    char c;
    char code[CODE_SIZE];
    struct tree_node_s *left;
    struct tree_node_s *right;
} tree_node_t;

tree_node_t *arr[NBR_OF_LETTERS], *letters[NBR_OF_LETTERS];


void findMinAndSecondMin(tree_node_t **, float *, int *, float *, int *);
void printTree(tree_node_t *);
void interpret(char *, int *, tree_node_t *);
void printTree(tree_node_t *);
void encode(tree_node_t *, tree_node_t **, char, short, char*);

/*
 *
 */
int main() {
    char str[CODE_SIZE];
    int fileReadingVerdict;
    int i, j, k, index, n;
    float min, secondMin;
    int minIndex, secondMinIndex;
    int numberOfCharacters = 0;
    tree_node_t *tree;
    FILE *in = fopen(IN_PATH, "r");
    FILE *out;
    if ( in == NULL ) {
        printf("\nFile not found");
        return 0;
    } else {
        /*
         *  Begin: Array Initialization
         */
        for (i = 'A'; i <= 'Z'; i++) {
            index = i - 'A';
            arr[index] = NULL;
        }
        /*
         *  End:    Array Initialization
         */
        numberOfCharacters = 0;
        fileReadingVerdict = fgets(str, CODE_SIZE, in) != NULL;
        while(!feof(in) || fileReadingVerdict) {
            n = strlen(str);
            printf("\n%s", str);
            for (i = 0; i < n ; i ++ ) {
                str[i] = toupper(str[i]);
                if (str[i] >= 'A' && str[i] <= 'Z') {
                    numberOfCharacters ++;
                    index = str[i] - 'A';
                    if (arr[index] == NULL) {
                        arr[index] = TYPED_ALLOC(tree_node_t);// malloc(sizeof(tree_node_t));
                        arr[index]->c = str[i];
                        arr[index]->frequency = 1;
                        arr[index]->left = arr[index]->right = NULL;
                    } else {
                        arr[index]->frequency += 1;
                    }
                }
            }
            if (fileReadingVerdict) {
                fileReadingVerdict = fgets(str, CODE_SIZE, in) != NULL;
            }
        }
    }   
    fclose(in);

    for ( i = 0, n = 0 ; i < NBR_OF_LETTERS ; i ++ ) {
        letters[i] = arr[i];
        if (arr[i] != NULL) {
            arr[i]->frequency /= numberOfCharacters;    // Computing the frequency.
            n ++;                                       // n is the number of involved letters which is going to be consumed in the do while loop's condition
        }
    }

    j = 1;
    do {
        findMinAndSecondMin(arr, &min, &minIndex, &secondMin, &secondMinIndex);

        if (minIndex != -1 && secondMinIndex != -1 && minIndex != secondMinIndex) {
            tree_node_t *temp;
            tree = TYPED_ALLOC(tree_node_t);// malloc(sizeof(tree_node_t));
            tree->frequency = arr[minIndex]->frequency + arr[secondMinIndex]->frequency;
            tree->c = j;
            tree->left = arr[minIndex];
            temp = TYPED_ALLOC(tree_node_t);// malloc(sizeof(tree_node_t));
            temp->c = arr[secondMinIndex]->c;
            temp->frequency = arr[secondMinIndex]->frequency;
            temp->left = arr[secondMinIndex]->left;
            temp->right = arr[secondMinIndex]->right;
            tree->right = temp;

            arr[minIndex] = tree;

            arr[secondMinIndex] = NULL;
        } 
        j ++;
    } while( j < n );

    for ( i = 0 ; i < NBR_OF_LETTERS ; i ++ ) {
        if (arr[i] != NULL)  {
            char code[CODE_SIZE];
            strcpy(code, "");
            encode(tree = arr[i], letters, 0, 0, code);
            puts("\nSuccessful encoding");
            printTree(arr[i]);
            break;
        }
    }
    in = fopen(IN_PATH, "r");
    out = fopen(OUT_PATH, "w");
    fileReadingVerdict = fgets(str, CODE_SIZE, in) != NULL;
    while(!feof(in) || fileReadingVerdict) {
        n = strlen(str);
        for (i = 0; i < n ; i ++ ) {
            str[i] = toupper(str[i]);
            if (str[i] >= 'A' && str[i] <= 'Z') {
                index = str[i] - 'A';
                fputs(letters[index]->code, out);
            }
        }
        if (fileReadingVerdict) {
            fileReadingVerdict = fgets(str, CODE_SIZE, in) != NULL;
        }
    }

    fclose(in);
    fclose(out);

    printf("\nFile size (only letters) of the input file:   %d bits", numberOfCharacters * BYTE_SIZE);

    out = fopen(OUT_PATH, "r");
    fileReadingVerdict = fgets(str, CODE_SIZE, out) != NULL;
    numberOfCharacters = 0;
    while(!feof(out) || fileReadingVerdict) {
        numberOfCharacters += strlen(str);
        if (fileReadingVerdict) {
            fileReadingVerdict = fgets(str, CODE_SIZE, out) != NULL;
        }
    }
    fclose(out);    

    printf("\nFile size of the output file: %d bits", numberOfCharacters);

    printf("\nInterpreting output file:\n");
    out = fopen(OUT_PATH, "r");
    fileReadingVerdict = fgets(str, CODE_SIZE, out) != NULL;
    while(!feof(out) || fileReadingVerdict) {
        n = strlen(str);
        i = 0 ;
        while(i < n) {
            interpret(str, &i, tree);
        }
        if (fileReadingVerdict) {
            fileReadingVerdict = fgets(str, CODE_SIZE, out) != NULL;
        }
    }
    fclose(out);

    puts("\n");
    return 0;
}
/*
 *
 */
void encode(tree_node_t *node, tree_node_t **letters, char direction, short level, char* code) {
    int n;
    if ( node != NULL ) {
        if ((n = strlen(code)) < level) {
            if (direction == RIGHT) {
                strcat(code, "1");
            } else {
                if (direction == LEFT) {
                    strcat(code, "0");
                }
            }
        } else {
            if (n >= level) {
                code[n - (n - level) - 1] = 0;
                if (direction == RIGHT) {
                    strcat(code, "1");
                } else {
                    if (direction == LEFT) {
                        strcat(code, "0");
                    }
                }   
            }
        }
        if (node->c >= 'A' && node->c <= 'Z') {
            strcpy(node->code, code);
            strcpy(letters[node->c - 'A']->code, code);
        }
        encode(node->left, letters, LEFT, level + 1, code);
        encode(node->right, letters, RIGHT, level + 1, code);
    }
}

void printTree(tree_node_t *node) {
    int n;
    if ( node != NULL ) {
        if (node->c >= 'A' && node->c <= 'Z') {
            printf("\t%c - frequency: %.10f\tencoding: %s\n", node->c, node->frequency, node->code);
        }
        printTree(node->left);
        printTree(node->right);
    }
}

/*
 *  Begin:  Minimum and second minimum
 */
void findMinAndSecondMin(tree_node_t *arr[], float *min, int *minIndex, float *secondMin, int *secondMinIndex) {
    int i, k;
    k = 0;
    *minIndex = -1;
    /*
     * Skipping all the NULL elements.
     */     
    while (k < NBR_OF_LETTERS && arr[k] == NULL) k++;

    *minIndex = k;
    *min = arr[k]->frequency;

    for ( i = k ; i < NBR_OF_LETTERS; i ++ ) {
        if ( arr[i] != NULL && arr[i]->frequency < *min ) {
            *min = arr[i]->frequency;
            *minIndex = i;
        }
    }

    k = 0;
    *secondMinIndex = -1;
    /*
     * Skipping all the NULL elements.
     */         
    while ((k < NBR_OF_LETTERS && arr[k] == NULL) || (k == *minIndex && arr[k] != NULL)) k++;

    *secondMin = arr[k]->frequency;
    *secondMinIndex = k;

    if (k == *minIndex) k ++;

    for ( i = k ; i < NBR_OF_LETTERS; i ++ ) {
        if ( arr[i] != NULL && arr[i]->frequency < *secondMin && i != *minIndex ) {
            *secondMin = arr[i]->frequency;
            *secondMinIndex = i;
        }
    }
    /*
     *  End:    Minimum and second minimum
     */
}

void interpret(char *str, int *index, tree_node_t *tree) {
    int n = strlen(str);
    if (tree->c >= 'A' && tree->c <= 'Z') {
        printf("%c ", tree->c);
        return ;
    } else {
        if ( *index < n ) {
            if (str[*index] == '0') {
                (*index) ++;
                interpret(str, index, tree->left);
            } else {
                if (str[*index] == '1') {
                    (*index) ++;
                    interpret(str, index, tree->right);
                }
            }
        }
    }
}

enter image description here

在此处输入图片说明