java 用Java有效压缩10-1000个字符的字符串?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/5534070/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Effectively compress strings of 10-1000 characters in Java?
提问by sanity
I need to compress strings (written in a known but variable language) of anywhere from 10 to 1000 characters into individual UDP packets.
我需要将 10 到 1000 个字符的字符串(用已知但可变的语言编写)压缩到单个 UDP 数据包中。
What compression algorithms available in Java are well suited to this task?
Java 中可用的哪些压缩算法非常适合此任务?
Are there maybe open source Java libraries available to do this?
是否有可用的开源 Java 库来做到这一点?
回答by
"It depends".
“这取决于”。
I would start with just the primary candidates: LZMA("7-zip"), deflate(direct, zlib: deflate + small wrapper, gzip: deflate + slightly larger wrapper, zip: deflate + even larger wrapper), bzip2 (I doubt this would be that good here, works best with a relative large window), perhaps even one of other LZ* branches like LZS which has an RFC for IP Payload compressionbut...
我会从主要候选者开始:LZMA(“7-zip”),deflate(直接,zlib:deflate + 小包装,gzip:deflate + 稍大的包装,zip:deflate + 更大的包装),bzip2(我怀疑这在这里会很好,在相对大的窗口中效果最好),甚至可能是其他 LZ* 分支之一,例如 LZS,它具有用于 IP 负载压缩的 RFC,但是......
...run some analysis based upon the actual data and compression/throughputusing several different approaches. Java has both GZIPOutputStream("deflate in gzip wrapper") and DeflaterOutputStream("plain deflate", recommend over gzip or zip "wrappers") standard and there are LZMA Java implementations(just need compressor, not container) so these should all be trivial to mock-up.
...使用几种不同的方法根据实际数据和压缩/吞吐量运行一些分析。Java 有GZIPOutputStream(“在 gzip 包装器中放气”)和DeflaterOutputStream(“普通放气”,推荐通过 gzip 或 zip“包装器”)标准,并且有LZMA Java 实现(只需要压缩器,而不是容器),所以这些都应该是微不足道的模拟。
If there is regularity between the packets then it is is possible this could be utilized -- e.g. build cache mappings, Huffman tables, or just modify the "windows" of one of the other algorithms -- but packet-loss and "de-compressibility" likely needs to be accounted for. Going down this route though adds far more complexity. More ideas for helping out the compressor may be found at SO: How to find a good/optimal dictionary for zlib 'setDictionary' when processing a given set of data?.
如果数据包之间存在规律性,那么这可能会被利用——例如构建缓存映射、霍夫曼表,或者只是修改其他算法之一的“窗口”——但包丢失和“解压缩性” “可能需要考虑。虽然沿着这条路线走会增加更多的复杂性。更多帮助压缩器的想法可以在SO: How to find a good/optimal dictionary for zlib 'setDictionary' when processing a given set of data? .
Also the protocol should likely have a simple "fall back" of zero-compression because some [especially small random] data might not be practically compressible or might "compress" to a larger size(zlib actually has this guard, but also has the "wrapper overhead" so it would be better encoded separately for very small data). The overhead of the "wrapper" for the compressed data -- such as gzip or zip -- also needs to be taken into account for such small sizes. This is especially important to consider of string data less than ~100 characters.
此外,该协议应该有一个简单的零压缩“回退”,因为一些 [特别是小随机] 数据实际上可能无法压缩或可能“压缩”到更大的尺寸(zlib 实际上有这个保护,但也有“包装器开销”,因此对于非常小的数据最好单独编码)。压缩数据的“包装器”(例如 gzip 或 zip)的开销也需要考虑到如此小的尺寸。这对于考虑少于 100 个字符的字符串数据尤其重要。
Happy coding.
快乐编码。
Another thing to consider is the encoding used to shove the characters into the output stream. I would first start with UTF-8, but that may not always be ideal.
要考虑的另一件事是用于将字符推送到输出流中的编码。我会首先从 UTF-8 开始,但这可能并不总是理想的。
See SO: Best compression algorithm for short text stringswhich suggests SMAZ, but I do not know how this algorithm will transfer to unicode / binary.
请参阅SO:Best compression algorithm for short text strings建议SMAZ,但我不知道该算法将如何转换为 unicode / binary。
Also consider that not all deflate (or other format) implementations are created equal. I am not privy on Java's standard deflate compared to a 3rd party (say JZlib) in terms of efficiency for small data, but consider Compressing Small Payloads [.NET]which shows rather negative numbers for "the same compression" format. The article also ends nicely:
还要考虑并非所有的 deflate(或其他格式)实现都是平等的。就小数据的效率而言,与第三方(例如JZlib)相比,我并不了解 Java 的标准 deflate ,但请考虑Compressing Small Payloads [.NET],它显示了“相同压缩”格式的负数。文章的结尾也很好:
...it's usually most beneficial to compress anyway, and determine which payload (the compressed or the uncompressed one) has the smallest size and include a small token to indicate whether decompression is required.
My final conclusion: always test using real-world data and measure the benefits, or you might be in for a little surprise in the end!
...无论如何压缩通常是最有益的,并确定哪个有效负载(压缩的或未压缩的)具有最小的大小,并包含一个小标记来指示是否需要解压缩。
我的最终结论是:始终使用真实世界的数据进行测试并衡量收益,否则最终您可能会感到意外!
Happy coding. For real this time.
快乐编码。这次是真的。
回答by MeBigFatGuy
The simplest thing to do would be to layer a GZIPOutputStream on top of a ByteArrayOutputStream, as that is built into the JDK, using
最简单的做法是在 ByteArrayOutputStream 之上分层一个 GZIPOutputStream,因为它内置在 JDK 中,使用
ByteArrayOutputStream baos = new ByteArrayOutputStream();
GZIPOutputStream zos = new GZIPOutputStream(baos);
zos.write(someText.getBytes());
zos.finish();
zos.flush();
byte[] udpBuffer = baos.toByteArray();
There maybe other algorithms that do a better job, but I'd try this first, to see if it fits your needs as it doesn't require any extra jars, and does a pretty good job.
也许还有其他算法做得更好,但我会先尝试这个,看看它是否符合您的需求,因为它不需要任何额外的罐子,并且做得很好。
回答by Peter Lawrey
Most standard compression algorithims doesn't work so well with small amounts of data. Often there is a header and a checksum and it takes time for the compression to warmup. I.e. it builds a data dictionary based on the data it has seen.
大多数标准压缩算法在处理少量数据时效果不佳。通常有一个标头和一个校验和,压缩需要时间来预热。即它根据它所看到的数据构建一个数据字典。
For this reason you can find that
出于这个原因,你可以发现
- small packets may be smaller or the same size with no compression.
- a simple application/protocol specific compression is better
- you have to provide a prebuilt data dictionary to the compression algorithim and strip out the headers as much as possible.
- 小数据包可能更小或相同,没有压缩。
- 一个简单的应用程序/协议特定的压缩更好
- 您必须为压缩算法提供一个预构建的数据字典,并尽可能多地去除标头。
I usually go with second option for small data packets.
对于小数据包,我通常选择第二个选项。
回答by Dmitry Bryliuk
good compression algorithm for short strings/url is lzw implementation, it is in java and can be easily ported for client gwt: https://code.google.com/p/lzwj/source/browse/src/main/java/by/dev/madhead/lzwj/compress/LZW.java
短字符串/url 的良好压缩算法是 lzw 实现,它是在 java 中,可以很容易地为客户端 gwt 移植:https: //code.google.com/p/lzwj/source/browse/src/main/java/by /dev/madhead/lzwj/compress/LZW.java
some remarks
一些评论
- use 9 bit code word length for small strings (though you may try which is better). original ratio is from 1 (very small strings, compressed is not larger than original string) to 0.5 (larger strings)
- in case of client gwt for other code word lengths it was required to adjust input/output processing to work on per-byte basis, to avoid bugs when buffering bit sequence into long, which is emulated for js.
- 对小字符串使用 9 位代码字长(尽管您可以尝试哪个更好)。原始比例是从 1(非常小的字符串,压缩后的字符串不大于原始字符串)到 0.5(较大的字符串)
- 对于其他码字长度的客户端 gwt,需要调整输入/输出处理以在每个字节的基础上工作,以避免在将位序列缓冲为 long 时出现错误,这是为 js 模拟的。
I'm using it for complex url parameters encoding in client gwt, together with base64 encoding and autobean serialization to json.
我将它用于客户端 gwt 中复杂的 url 参数编码,以及 base64 编码和 autobean 序列化到 json。
upd: base64 implementation is here: http://www.source-code.biz/base64coder/javayou have to change it to make url-safe, i.e. change following characters:
upd:base64 实现在这里:http: //www.source-code.biz/base64coder/java你必须改变它以使 url-safe,即改变以下字符: