如何在 Java 中压缩字符串?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3649485/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to compress a String in Java?
提问by user421851
I use GZIPOutputStream
or ZIPOutputStream
to compress a String (my string.length()
is less than 20), but the compressed result is longer than the original string.
我使用GZIPOutputStream
orZIPOutputStream
来压缩一个字符串(我string.length()
的小于 20),但压缩结果比原始字符串长。
On some site, I found some friends said that this is because my original string is too short, GZIPOutputStream
can be used to compress longer strings.
在某个网站上,我发现有朋友说这是因为我原来的字符串太短了,GZIPOutputStream
可以用来压缩更长的字符串。
so, can somebody give me a help to compress a String?
那么,有人可以帮我压缩字符串吗?
My function is like:
我的功能是这样的:
String compress(String original) throws Exception {
}
Update:
更新:
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.util.zip.GZIPOutputStream;
import java.util.zip.*;
//ZipUtil
public class ZipUtil {
public static String compress(String str) {
if (str == null || str.length() == 0) {
return str;
}
ByteArrayOutputStream out = new ByteArrayOutputStream();
GZIPOutputStream gzip = new GZIPOutputStream(out);
gzip.write(str.getBytes());
gzip.close();
return out.toString("ISO-8859-1");
}
public static void main(String[] args) throws IOException {
String string = "admin";
System.out.println("after compress:");
System.out.println(ZipUtil.compress(string));
}
}
The result is :
结果是:
采纳答案by JesperE
Compression algorithms almost always have some form of space overhead, which means that they are only effective when compressing data which is sufficiently large that the overhead is smaller than the amount of saved space.
压缩算法几乎总是有某种形式的空间开销,这意味着它们只有在压缩足够大的数据时才有效,以至于开销小于节省的空间量。
Compressing a string which is only 20 characters long is not too easy, and it is not always possible. If you have repetition, Huffman Coding or simple run-length encoding might be able to compress, but probably not by very much.
压缩只有 20 个字符长的字符串并不容易,也并非总是可行。如果您有重复,霍夫曼编码或简单的游程编码可能能够压缩,但可能不会压缩太多。
回答by Matthew Flaschen
Your friend is correct. Both gzip and ZIP are based on DEFLATE. This is a general purpose algorithm, and is not intended for encoding small strings.
你的朋友是对的。gzip 和 ZIP 都基于DEFLATE。这是一种通用算法,不适用于编码小字符串。
If you need this, a possible solution is a custom encoding and decoding HashMap<String, String>
. This can allow you to do a simple one-to-one mapping:
如果您需要这个,一个可能的解决方案是自定义编码和解码HashMap<String, String>
。这可以让你做一个简单的一对一映射:
HashMap<String, String> toCompressed, toUncompressed;
String compressed = toCompressed.get(uncompressed);
// ...
String uncompressed = toUncompressed.get(compressed);
Clearly, this requires setup, and is only practical for a small number of strings.
显然,这需要设置,并且仅适用于少数字符串。
回答by Noel M
Huffman Codingmight help, but only if you have a lot of frequent characters in your small String
霍夫曼编码可能会有所帮助,但前提是您的小字符串中有很多频繁出现的字符
回答by Jon Freedman
When you create a String, you can think of it as a list of char's, this means that for each character in your String, you need to support all the possible values of char. From the sun docs
创建字符串时,您可以将其视为字符列表,这意味着对于字符串中的每个字符,您需要支持所有可能的字符值。来自太阳文档
char: The char data type is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).
char:char 数据类型是单个 16 位 Unicode 字符。它的最小值为 '\u0000'(或 0),最大值为 '\uffff'(或 65,535)。
If you have a reduced set of characters you want to support you can write a simple compression algorithm, which is analogous to binary->decimal->hex radix converstion. You go from 65,536 (or however many characters your target system supports) to 26 (alphabetical) / 36 (alphanumeric) etc.
如果您想要支持的字符集减少,您可以编写一个简单的压缩算法,类似于二进制->十进制->十六进制基数转换。您可以从 65,536(或您的目标系统支持的字符数)到 26(字母)/36(字母数字)等。
I've used this trick a few times, for example encoding timestamps as text (target 36 +, source 10) - just make sure you have plenty of unit tests!
我已经多次使用这个技巧,例如将时间戳编码为文本(目标 36 +,源 10)——只要确保你有足够的单元测试!
回答by YoK
You don't see any compression happening for your String, As you atleast require couple of hundred bytes to have real compression using GZIPOutputStream or ZIPOutputStream. Your String is too small.(I don't understand why you require compression for same)
你没有看到你的字符串发生任何压缩,因为你至少需要几百个字节才能使用 GZIPOutputStream 或 ZIPOutputStream 进行真正的压缩。你的字符串太小了。(我不明白你为什么需要压缩)
Check Conclusion from this article:
检查这篇文章的结论:
The article also shows how to compress and decompress data on the fly in order to reduce network traffic and improve the performance of your client/server applications. Compressing data on the fly, however, improves the performance of client/server applications only when the objects being compressed are more than a couple of hundred bytes. You would not be able to observe improvement in performance if the objects being compressed and transferred are simple String objects, for example.
本文还展示了如何动态压缩和解压缩数据,以减少网络流量并提高客户端/服务器应用程序的性能。然而,仅当被压缩的对象超过几百字节时,动态压缩数据才能提高客户端/服务器应用程序的性能。例如,如果被压缩和传输的对象是简单的 String 对象,您将无法观察到性能的提高。
回答by Benoit Courtine
The ZIP algorithm is a combination of LZWand Huffman Trees. You can use one of theses algorithms separately.
ZIP 算法是LZW和Huffman Trees的组合。您可以单独使用这些算法之一。
The compression is based on 2 factors :
压缩基于两个因素:
- the repetition of substrings in your original chain (LZW): if there are a lot of repetitions, the compression will be efficient. This algorithm has good performances for compressing a long plain text, since words are often repeated
- the number of each character in the compressed chain (Huffman): more the repartition between characters is unbalanced, more the compression will be efficient
- 原始链中子串的重复(LZW):如果重复很多,压缩将是有效的。由于单词经常重复,因此该算法在压缩较长的纯文本方面具有良好的性能
- 压缩链中每个字符的数量(霍夫曼):字符之间的重新分配越不平衡,压缩效率越高
In your case, you should try the LZW algorithm only. Used basically, the chain can be compressed without adding meta-informations: it is probably better for short strings compression.
在您的情况下,您应该只尝试 LZW 算法。基本上使用,可以在不添加元信息的情况下压缩链:短字符串压缩可能更好。
For the Huffman algorithm, the coding tree has to be sent with the compressed text. So, for a small text, the result can be larger than the original text, because of the tree.
对于霍夫曼算法,编码树必须与压缩文本一起发送。因此,对于小文本,由于树的原因,结果可能比原始文本大。
回答by Tom Anderson
Huffman encoding is a sensible option here. Gzip and friends do this, but the way they work is to build a Huffman tree for the input, send that, then send the data encoded with the tree. If the tree is large relative to the data, there may be no not saving in size.
霍夫曼编码在这里是一个明智的选择。Gzip 和朋友们这样做,但他们的工作方式是为输入构建一个霍夫曼树,发送它,然后发送用树编码的数据。如果树相对于数据较大,则可能不会不节省大小。
However, it is possible to avoid sending a tree: instead, you arrange for the sender and receiver to already have one. It can't be built specifically for every string, but you can have a single global tree used to encode all strings. If you build it from the same language as the input strings (English or whatever), you should still get good compression, although not as good as with a custom tree for every input.
但是,可以避免发送一棵树:相反,您可以安排发送方和接收方已经拥有一棵树。它不能专门为每个字符串构建,但您可以使用单个全局树来对所有字符串进行编码。如果您使用与输入字符串相同的语言(英语或其他语言)构建它,您仍然应该获得良好的压缩,尽管不如为每个输入使用自定义树。
回答by Arne Deutsch
If the passwords are more or less "random" you are out of luck, you will not be able to get a significant reduction in size.
如果密码或多或少是“随机的”,那么您就不走运了,您将无法显着减小大小。
But:Why do you need to compress the passwords? Maybe what you need is not a compression, but some sort of hash value? If you just need to check if a name matches a given password, you don't need do save the password, but can save the hash of a password. To check if a typed in password matches a given name, you can build the hash value the same way and compare it to the saved hash. As a hash (Object.hashCode()) is an int you will be able to store all 20 password-hashes in 80 bytes).
但是:为什么需要压缩密码?也许您需要的不是压缩,而是某种哈希值?如果您只需要检查名称是否与给定的密码匹配,则不需要保存密码,但可以保存密码的哈希值。要检查输入的密码是否与给定的名称匹配,您可以以相同的方式构建哈希值并将其与保存的哈希值进行比较。由于散列 (Object.hashCode()) 是一个 int,您将能够以 80 个字节存储所有 20 个密码散列。
回答by live-love
Take a look at the Huffman algorithm.
看看霍夫曼算法。
https://codereview.stackexchange.com/questions/44473/huffman-code-implementation
https://codereview.stackexchange.com/questions/44473/huffman-code-implementation
The idea is that each character is replaced with sequence of bits, depending on their frequency in the text (the more frequent, the smaller the sequence).
这个想法是用位序列替换每个字符,这取决于它们在文本中的频率(频率越高,序列越小)。
You can read your entire text and build a table of codes, for example:
您可以阅读整个文本并构建一个代码表,例如:
Symbol Code
符号代码
a 0
0
s 10
10
e 110
110
m 111
米 111
The algorithm builds a symbol tree based on the text input. The more variety of characters you have, the worst the compression will be.
该算法基于文本输入构建符号树。您拥有的字符越多,压缩效果就越差。
But depending on your text, it could be effective.
但根据你的文字,它可能是有效的。
回答by rghome
If you know that your strings are mostly ASCII you could convert them to UTF-8.
如果您知道您的字符串主要是 ASCII,您可以将它们转换为 UTF-8。
byte[] bytes = string.getBytes("UTF-8");
This may reduce the memory size by about 50%. However, you will get a byte array out and not a string. If you are writing it to a file though, that should not be a problem.
这可能会减少大约 50% 的内存大小。但是,您将得到一个字节数组而不是字符串。但是,如果您将其写入文件,那应该不是问题。
To convert back to a String:
转换回字符串:
private final Charset UTF8_CHARSET = Charset.forName("UTF-8");
...
String s = new String(bytes, UTF8_CHARSET);