java Java中BASE64类的编码/解码算法效率如何?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6355704/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 15:31:55  来源:igfitidea点击:

How efficient is the encoding/decoding algorithm of BASE64 class in Java?

javaencodingbase64apache-commons-codecstring-decoding

提问by Subhadip Pal

I am about to use an algorithm to encode a variable length but very long Stringfield retrieved from an XML file, then that encoded data should be persisted in the database.

我即将使用一种算法对从 XML 文件中检索到的可变长度但很长的String字段进行编码,然后该编码数据应保留在数据库中。

Later, when I recieve a second file I need to fetch the encoded data from database (previously stored) and then decode it and validate with the new data for duplicate.

后来,当我收到第二个文件时,我需要从数据库(以前存储的)中获取编码数据,然后对其进行解码并使用新数据进行验证以进行重复。

I tried org.apache.commons.codec.binary.Base64class it has 2 methods:

我试过org.apache.commons.codec.binary.Base64class 它有两种方法:

  1. encodeBase64(Byte[] barray)
  2. decodeBase64(String str)
  1. encodeBase64(Byte[] barray)
  2. decodeBase64(String str)

which works perfectly fine and solves my problem. But it converts 55 char string to just 6 char String.

它工作得很好并解决了我的问题。但它将 55 个字符的字符串转换为 6 个字符的字符串。

So I wonder if there is any case where these algorithm encodes 2 Strings which are very large and have only 1 char mismatch (for example) into same encoded byte arrays.

所以我想知道是否存在这些算法将 2 个非常大且只有 1 个字符不匹配(例如)的字符串编码为相同编码字节数组的情况。

I donot know about the Base64class much but if anyone can help me out it will be really helpful.

Base64不太了解这门课,但如果有人能帮助我,那将非常有帮助。

If you can suggest any other Algorithm which makes a large String short of fixed length and solves my purpose I will be happy to use it.

如果您可以建议任何其他算法,使大字符串短于固定长度并解决我的目的,我将很乐意使用它。

Thanks in advance.

提前致谢。

回答by johnstok

Not very efficient.

效率不高。

Also, using sun.miscclasses gives a non-portable application.

此外,使用sun.misc类提供了不可移植的应用程序。

Check out the following performance comparisons from MiGBase64:

查看MiGBase64的以下性能比较:

enter image description here

在此处输入图片说明



So I wonder if there is any case where these algorithm encodes 2 Strings which are very large and have only 1 char mismatch (for example) into same encoded byte arrays.

所以我想知道是否存在这些算法将 2 个非常大且只有 1 个字符不匹配(例如)的字符串编码为相同编码字节数组的情况。

Base64 isn't a hashing algorithm, it's an encoding and must therefore be bi-directional. Collisions can't be allowed by necessity - otherwise decoding would be non-deterministic. Base64 is designed to represent arbitrary binary data in an ASCII string. Encoding a Unicode string as Base64 will often increasethe number of code pointsrequired since the Unicode character set requires multiple bytes. The Base64 representation of a Unicode string will vary depending on the encoding (UTF-8, UTF-16) used. For example:

Base64 不是散列算法,它是一种编码,因此必须是双向的。必然不允许冲突 - 否则解码将是不确定的。Base64 旨在表示 ASCII 字符串中的任意二进制数据。将 Unicode 字符串编码为 Base64 通常会增加所需的代码点数,因为 Unicode 字符集需要多个字节。Unicode 字符串的 Base64 表示会因使用的编码(UTF-8、UTF-16)而异。例如:

Base64( UTF8( "test" ) ) => "dGVzdA=="
Base64( UTF16( "test" ) ) => "/v8AdABlAHMAdA=="


Solution 1

方案一

Use lossless compression

使用无损压缩

GZip( UTF8( "test" ) )

Here you are converting the string to byte array and using lossless compression to reduce the number of bytes you have to store. You can vary the char encoding and compression algorithm to reduce the number of bytes depending on the Strings you will be storing (ie if it's mostly ASCII then UTF-8 will probably be best.

在这里,您将字符串转换为字节数组并使用无损压缩来减少必须存储的字节数。您可以改变字符编码和压缩算法以减少字节数,具体取决于您将存储的字符串(即,如果它主要是 ASCII,则 UTF-8 可能是最好的。

Pros: no collisions, ability to recover original string
Cons: Bytes required to store value is variable; bytes required to store value is larger

优点:没有冲突,能够恢复原始字符串
缺点:存储值所需的字节是可变的;存储值所需的字节更大

Solution 2

解决方案2

Use a hashing algorithm

使用哈希算法

SHA256( UTF8( "test" ) )

Here you are converting the string to a fixed length set of bytes with a hashing function. Hashing is uni-directional and by its nature collisions can be possible. However, based on the profile and number of Strings that you expect to process you can select a hash function to minimise the likelihood of collisions

在这里,您使用散列函数将字符串转换为固定长度的字节集。散列是单向的,从本质上讲,碰撞是可能的。但是,根据您希望处理的字符串的配置文件和数量,您可以选择一个散列函数来最小化冲突的可能性

Pros: Bytes required to store value is fixed; bytes required to store value is small
Cons: Collisions possible, no ability to recover original string

优点:存储值所需的字节是固定的;存储值所需的字节小
缺点:可能发生冲突,无法恢复原始字符串

回答by Andrzej Doyle

I just saw your comment - it seems you're actually looking for compression rather than hashing as I initially thought. Though in that case, you won'tbe able to get fixed length output for arbitrary input (think about it, an infinite number of inputs cannot map bijectively to a finite number of outputs), so I hope that wasn't a strong requirement.

我刚刚看到您的评论 - 看来您实际上是在寻找压缩而不是我最初认为的散列。虽然在那种情况下,您将无法获得任意输入的固定长度输出(想想看,无限数量的输入不能双射映射到有限数量的输出),所以我希望这不是一个强烈的要求.

In any case, the performance of your chosen compression algorithm will depend on the characteristics of the input text. In the absence of further information, DEFLATE compression (as used by the Zip input streams, IIRC) is a good general-purpose algorithm to start with, and at least use as a basis for comparison. For ease of implementation, though, you can use the Deflatorclass built into the JDK, which uses ZLib compression.

无论如何,您选择的压缩算法的性能将取决于输入文本的特征。在没有更多信息的情况下,DEFLATE 压缩(由 Zip 输入流,IIRC 使用)是一种很好的通用算法,至少可以用作比较的基础。不过,为了便于实现,您可以使用JDK 中内置的Deflator类,它使用 ZLib 压缩。

If your input strings have particular patterns, then different compression algorithms may be more or less efficient. In one respect it doesn't matter which one you use, if you don't intend the compressed data to be read by any other processes - so long as you can compress and decompress yourself, it'll be transparent to your clients.

如果您的输入字符串具有特定模式,则不同的压缩算法可能或多或少地有效。在一方面,如果您不打算让任何其他进程读取压缩数据,那么无论您使用哪一个都无关紧要 - 只要您可以自己压缩和解压缩,它对您的客户来说就是透明的。

These other questions may be of interest:

这些其他问题可能很有趣: