XML 的最佳压缩算法?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1082285/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-06 12:36:02  来源:igfitidea点击:

Best compression algorithm for XML?

xmlalgorithmtextcompressionzip

提问by Aethex

I barely know a thing about compression, so bear with me (this is probably a stupid and painfully obvious question).

我对压缩几乎一无所知,所以请耐心等待(这可能是一个愚蠢而痛苦的问题)。

So lets say I have an XML file with a few tags.

假设我有一个带有几个标签的 XML 文件。

<verylongtagnumberone>
  <verylongtagnumbertwo>
    text
  </verylongtagnumbertwo>
</verylongtagnumberone>

Now lets say I have a bunch of these very long tags with many attributes in my multiple XML files. I need to compress them to the smallest size possible. The best way would be to use an XML-specific algorithm which assigns individual tags pseudonyms like vlt1 or vlt2. However, this wouldn't be as 'open' of a way as I m trying to go for, and I want to use a common algorithm like DEFLATE or LZ. It also helpes if the archive was a .zip file.

现在假设我的多个 XML 文件中有一堆这些非常长的标签,其中包含许多属性。我需要将它们压缩到尽可能小的尺寸。最好的方法是使用特定于 XML 的算法,该算法分配单个标签假名,如 vlt1 或 vlt2。然而,这不会像我想要的那样“开放”,我想使用像 DEFLATE 或 LZ 这样的通用算法。如果存档是 .zip 文件,它也有帮助。

Since I'm dealing with plain text (no binary files like images), I'd like an algorithm that suits plain text. Which one produces the smallest file size (lossless algorithms are preferred)?

因为我正在处理纯文本(没有像图像这样的二进制文件),所以我想要一种适合纯文本的算法。哪个产生最小的文件大小(首选无损算法)?

By the way, the scenario is this: I am creating a standard for documents, like ODF or MS Office XML, that contain XML files, packaged in a .zip.

顺便说一下,场景是这样的:我正在为包含 XML 文件的文档(如 ODF 或 MS Office XML)创建一个标准,这些文件打包在 .zip 中。

EDIT: The 'encryption' thing was a typo; it should ave ben 'compression'.

编辑:“加密”的东西是一个错字;它应该是'压缩'。

采纳答案by ivan_ivanovich_ivanoff

There is a W3 (not-yet-released) standard named EXI (Efficient XML Interchange).

有一个名为EXI(高效 XML 交换)的 W3(尚未发布)标准。

Should become THE data format for compressing XML data in the future (claimed to be the last necessary binary format). Being optimized for XML, it compresses XML more ways more efficient than any conventional compression algorithm.

将来应该成为压缩 XML 数据的数据格式(号称是最后一个必需的二进制格式)。针对 XML 进行了优化,它以比任何传统压缩算法更有效的方式压缩 XML。

With EXI, you can operate on compressed XML data on the fly (without the need to uncompress or re-compress it).

使用 EXI,您可以即时操作压缩的 XML 数据(无需解压缩或重新压缩它)。

EXI = (XML + XMLSchema) as binary.

EXI = (XML + XMLSchema) 作为二进制文件。

And here you go with the opensource implementation (don't know if it's already stable):
Exificient

在这里你使用开源实现(不知道它是否已经稳定):
Exificient

回答by ivan_ivanovich_ivanoff

Another alternative to "compress" XML would be FI (Fast Infoset).

“压缩” XML 的另一种替代方法是 FI(Fast Infoset)。

XML, stored as FI, would contain every tag and attribute only once, all other occurrences are referencing the first one, thus saving space.

存储为 FI 的 XML 将只包含每个标签和属性一次,所有其他出现都引用第一个,从而节省空间。

See:

看:

Very good article on java.sun.com, and of course
the Wikipedia entry

java.sun.com 上非常好的文章,当然
还有维基百科条目

The difference to EXI from the compression point of view is that Fast Infoset (being structured plaintext) is less efficient.

从压缩的角度来看,与 EXI 的不同之处在于 Fast Infoset(结构化明文)效率较低。

Other important difference is: FI is a mature standard with many implementations.
One of them: Fast Infoset Project @ dev.java.net

另一个重要的区别是:FI 是一个成熟的标准,有很多实现。
其中之一:Fast Infoset Project @ dev.java.net

回答by sendbits

Yes, *.zip best in practice. Gory deets contained in this USENIX papershowing that "optimal" compressors not worth computational cost & domain-specific compressors don't beat zip [on average].

是的,*.zip 最佳实践。这篇 USENIX 论文中包含的血腥 deets表明,不值得计算成本的“最佳”压缩器和特定于域的压缩器 [平均] 无法击败 zip。

Disclaimer: I wrote that paper, which has been cited 60+ times according to Google.

免责声明:我写了那篇论文,根据谷歌的说法,它被引用了 60 多次。

回答by Mizipzor

It seems like you're more interested in compression rather than encryption. Is that the case? If so, thismight prove an interesting read even though is not an exact solution.

看起来您对压缩而不是加密更感兴趣。是这样吗?如果是这样,即使这不是一个确切的解决方案,也可能是一个有趣的阅读。

回答by Pete Kirkham

By the way, the scenario is this: I am creating a standard for documents, like ODF or MS Office XML, that contain XML files, packaged in a .zip.

顺便说一下,场景是这样的:我正在为包含 XML 文件的文档(如 ODF 或 MS Office XML)创建一个标准,这些文件打包在.zip 中

then I'd suggest you use .zip compression, or your users will get confused.

那么我建议您使用 .zip 压缩,否则您的用户会感到困惑。

回答by Pete Kirkham

I hope I understood correctly what you need to do... First thing I would like to say is that there are no good or bad compression algorithmss for text - zip, bzip, gzip, rar, 7zip are good enough to compress anything that has a low entrpy - i.e. large file with small character set. If I would have to use them I would choose 7zip at my first choice, rar as a second and zip as third. But the difference is very small so you should try whatever easier for you. Second - I could not understand what you are trying to encrypt. Suppose that this is an XML file then you should first compress it using your favourite compression algorithm and then encrypt it using your favourite encryption algorithm. In most cases any modern algorithm implemented for instance in PGP will be secure enough for anything. Hope that helps.

我希望我正确理解你需要做什么......我想说的第一件事是文本没有好的或坏的压缩算法 - zip、bzip、gzip、rar、7zip 足以压缩任何具有低entrpy - 即具有小字符集的大文件。如果我必须使用它们,我会在我的第一选择中选择 7zip,rar 作为第二个,zip 作为第三个。但差异非常小,因此您应该尝试更容易的方法。第二 - 我无法理解您要加密的内容。假设这是一个 XML 文件,那么您应该首先使用您喜欢的压缩算法对其进行压缩,然后使用您喜欢的加密算法对其进行加密。在大多数情况下,例如在 PGP 中实现的任何现代算法对于任何事情都是足够安全的。希望有帮助。

回答by Zepplock

Your alternatives are:

您的选择是:

  • Use a webserver that supports gzip compression. It'll auto compress all outgoing html. There's a small CPU penalty though.
  • Use something like JSON. It'll drastically reduce the size of the message
  • There's also a binary XML but I have not tried it myself.
  • 使用支持 gzip 压缩的网络服务器。它会自动压缩所有传出的 html。虽然有一个小的 CPU 惩罚。
  • 使用类似 JSON 的东西。它将大大减少消息的大小
  • 还有一个二进制 XML,但我自己没有尝试过。

回答by user1496062

None of the default ones are ideal for XML but you will still get good values since there is a lot of repeatables.

默认值都不是 XML 的理想选择,但由于存在大量可重复项,您仍会获得良好的值。

Because XML uses a lot of repeats ( tags . > ) you want these be less than a bit so some form of arithmetic rather than Huffman encoding . So rar / 7zip should be significantly better in theory..these algorithms offer high compression so are slower. Ideally you'd want a simple compression with an arithmetic encoder ( which for XML would be fast and give high compression) .

因为 XML 使用了大量重复(标签.>),所以您希望这些重复少一点,所以某种形式的算术而不是霍夫曼编码。所以 rar / 7zip 理论上应该明显更好......这些算法提供高压缩,所以速度较慢。理想情况下,您需要使用算术编码器进行简单压缩(对于 XML 而言,它会很快并提供高压缩率)。