JSON 字符串中的二进制数据。比 Base64 更好的东西
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1443158/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Binary Data in JSON String. Something better than Base64
提问by dmeister
The JSON formatnatively doesn't support binary data. The binary data has to be escaped so that it can be placed into a string element (i.e. zero or more Unicode chars in double quotes using backslash escapes) in JSON.
该JSON格式本身不支持二进制数据。必须对二进制数据进行转义,以便将其放入 JSON 中的字符串元素(即使用反斜杠转义的双引号中的零个或多个 Unicode 字符)。
An obvious method to escape binary data is to use Base64. However, Base64 has a high processing overhead. Also it expands 3 bytes into 4 characters which leads to an increased data size by around 33%.
转义二进制数据的一个明显方法是使用 Base64。但是,Base64 的处理开销很高。此外,它将 3 个字节扩展为 4 个字符,这导致数据大小增加了约 33%。
One use case for this is the v0.8 draft of the CDMI cloud storage API specification. You create data objects via a REST-Webservice using JSON, e.g.
一个用例是CDMI 云存储 API 规范的 v0.8 草案。您使用 JSON 通过 REST-Webservice 创建数据对象,例如
PUT /MyContainer/BinaryObject HTTP/1.1
Host: cloud.example.com
Accept: application/vnd.org.snia.cdmi.dataobject+json
Content-Type: application/vnd.org.snia.cdmi.dataobject+json
X-CDMI-Specification-Version: 1.0
{
"mimetype" : "application/octet-stream",
"metadata" : [ ],
"value" : "TWFuIGlzIGRpc3Rpbmd1aXNoZWQsIG5vdCBvbmx5IGJ5IGhpcyByZWFzb24sIGJ1dCBieSB0aGlz
IHNpbmd1bGFyIHBhc3Npb24gZnJvbSBvdGhlciBhbmltYWxzLCB3aGljaCBpcyBhIGx1c3Qgb2Yg
dGhlIG1pbmQsIHRoYXQgYnkgYSBwZXJzZXZlcmFuY2Ugb2YgZGVsaWdodCBpbiB0aGUgY29udGlu
dWVkIGFuZCBpbmRlZmF0aWdhYmxlIGdlbmVyYXRpb24gb2Yga25vd2xlZGdlLCBleGNlZWRzIHRo
ZSBzaG9ydCB2ZWhlbWVuY2Ugb2YgYW55IGNhcm5hbCBwbGVhc3VyZS4=",
}
Are there better ways and standard methods to encode binary data into JSON strings?
是否有更好的方法和标准方法将二进制数据编码为 JSON 字符串?
采纳答案by hobbs
There are 94 Unicode characters which can be represented as one byte according to the JSON spec (if your JSON is transmitted as UTF-8). With that in mind, I think the best you can do space-wise is base85which represents four bytes as five characters. However, this is only a 7% improvement over base64, it's more expensive to compute, and implementations are less common than for base64 so it's probably not a win.
根据 JSON 规范,有 94 个 Unicode 字符可以表示为一个字节(如果您的 JSON 以 UTF-8 格式传输)。考虑到这一点,我认为您可以在空间方面做的最好的是base85,它将四个字节表示为五个字符。然而,这仅比 base64 提高了 7%,计算成本更高,并且实现比 base64 少,因此它可能不是胜利。
You could also simply map every input byte to the corresponding character in U+0000-U+00FF, then do the minimum encoding required by the JSON standard to pass those characters; the advantage here is that the required decoding is nil beyond builtin functions, but the space efficiency is bad -- a 105% expansion (if all input bytes are equally likely) vs. 25% for base85 or 33% for base64.
您也可以简单地将每个输入字节映射到 U+0000-U+00FF 中的相应字符,然后执行 JSON 标准要求的最低编码以传递这些字符;这里的优点是所需的解码在内置函数之外为零,但空间效率很差——105% 的扩展(如果所有输入字节的可能性相同)与 base85 的 25% 或 base64 的 33%。
Final verdict: base64 wins, in my opinion, on the grounds that it's common, easy, and not bad enoughto warrant replacement.
最终结论:在我看来,base64 获胜,理由是它很常见、容易,而且还不够糟糕,需要更换。
回答by ?lex
I ran into the same problem, and thought I'd share a solution: multipart/form-data.
我遇到了同样的问题,并认为我会分享一个解决方案:multipart/form-data。
By sending a multipart form you send first as string your JSON meta-data, and then separately send as raw binary (image(s), wavs, etc) indexed by the Content-Dispositionname.
通过发送多部分表单,您首先将JSON 元数据作为字符串发送,然后作为由Content-Disposition名称索引的原始二进制文件(图像、wav 等)单独发送。
Here's a nice tutorialon how to do this in obj-c, and here is a blog articlethat explains how to partition the string data with the form boundary, and separate it from the binary data.
这是一个很好的教程,介绍如何在 obj-c 中执行此操作,这是一篇博客文章,解释了如何使用表单边界对字符串数据进行分区,并将其与二进制数据分开。
The only change you really need to do is on the server side; you will have to capture your meta-data which should reference the POST'ed binary data appropriately (by using a Content-Disposition boundary).
您真正需要做的唯一更改是在服务器端;您将必须捕获应适当引用 POST 二进制数据的元数据(通过使用 Content-Disposition 边界)。
Granted it requires additional work on the server side, but if you are sending many images or large images, this is worth it. Combine this with gzip compression if you want.
当然,它需要在服务器端进行额外的工作,但如果您要发送许多图像或大图像,这是值得的。如果需要,将其与 gzip 压缩结合使用。
IMHO sending base64 encoded data is a hack; the RFC multipart/form-data was created for issues such as this: sending binary data in combination with text or meta-data.
恕我直言,发送 base64 编码数据是一种黑客行为;RFC multipart/form-data 是为这样的问题创建的:结合文本或元数据发送二进制数据。
回答by chmike
The problem with UTF-8 is that it is not the most space efficient encoding. Also, some random binary byte sequences are invalid UTF-8 encoding. So you can't just interpret a random binary byte sequence as some UTF-8 data because it will be invalid UTF-8 encoding. The benefit of this constrain on the UTF-8 encoding is that it makes it robust and possible to locate multi byte chars start and end whatever byte we start looking at.
UTF-8 的问题在于它不是最节省空间的编码。此外,一些随机二进制字节序列是无效的 UTF-8 编码。因此,您不能将随机二进制字节序列解释为某些 UTF-8 数据,因为它将是无效的 UTF-8 编码。这种对 UTF-8 编码的限制的好处是它使得定位多字节字符开始和结束我们开始查看的任何字节变得健壮和可能。
As a consequence, if encoding a byte value in the range [0..127] would need only one byte in UTF-8 encoding, encoding a byte value in the range [128..255] would require 2 bytes ! Worse than that. In JSON, control chars, " and \ are not allowed to appear in a string. So the binary data would require some transformation to be properly encoded.
因此,如果编码范围 [0..127] 中的字节值只需要 UTF-8 编码的一个字节,则编码范围 [128..255] 中的字节值将需要 2 个字节!比那更糟。在 JSON 中,控制字符 " 和 \ 不允许出现在字符串中。因此二进制数据需要进行一些转换才能正确编码。
Let see. If we assume uniformly distributed random byte values in our binary data then, on average, half of the bytes would be encoded in one bytes and the other half in two bytes. The UTF-8 encoded binary data would have 150% of the initial size.
让我们看看。如果我们假设二进制数据中均匀分布的随机字节值,那么平均而言,一半的字节将被编码为一个字节,另一半将被编码为两个字节。UTF-8 编码的二进制数据将具有初始大小的 150%。
Base64 encoding grows only to 133% of the initial size. So Base64 encoding is more efficient.
Base64 编码仅增长到初始大小的 133%。所以Base64编码效率更高。
What about using another Base encoding ? In UTF-8, encoding the 128 ASCII values is the most space efficient. In 8 bits you can store 7 bits. So if we cut the binary data in 7 bit chunks to store them in each byte of an UTF-8 encoded string, the encoded data would grow only to 114% of the initial size. Better than Base64. Unfortunately we can't use this easy trick because JSON doesn't allow some ASCII chars. The 33 control characters of ASCII ( [0..31] and 127) and the " and \ must be excluded. This leaves us only 128-35 = 93 chars.
使用另一个 Base 编码怎么样?在 UTF-8 中,对 128 个 ASCII 值进行编码是最节省空间的。在 8 位中,您可以存储 7 位。因此,如果我们将二进制数据切成 7 位块以将它们存储在 UTF-8 编码字符串的每个字节中,则编码数据将仅增长到初始大小的 114%。比 Base64 好。不幸的是,我们不能使用这个简单的技巧,因为 JSON 不允许一些 ASCII 字符。ASCII 的 33 个控制字符([0..31] 和 127)以及 " 和 \ 必须被排除。这样我们只剩下 128-35 = 93 个字符。
So in theory we could define a Base93 encoding which would grow the encoded size to 8/log2(93) = 8*log10(2)/log10(93) = 122%. But a Base93 encoding would not be as convenient as a Base64 encoding. Base64 requires to cut the input byte sequence in 6bit chunks for which simple bitwise operation works well. Beside 133% is not much more than 122%.
所以理论上我们可以定义一个 Base93 编码,它将编码大小增加到 8/log2(93) = 8*log10(2)/log10(93) = 122%。但是 Base93 编码不如 Base64 编码方便。Base64 需要将输入字节序列切成 6 位块,这样简单的按位运算效果很好。除了 133%,也就是 122%。
This is why I came independently to the common conclusion that Base64 is indeed the best choice to encode binary data in JSON. My answer presents a justification for it. I agree it isn't very attractive from the performance point of view, but consider also the benefit of using JSON with it's human readable string representation easy to manipulate in all programming languages.
这就是为什么我独立得出一个共同的结论,即 Base64 确实是在 JSON 中编码二进制数据的最佳选择。我的回答为它提供了一个理由。我同意从性能的角度来看它不是很有吸引力,但也要考虑使用 JSON 的好处,它是人类可读的字符串表示,易于在所有编程语言中操作。
If performance is critical than a pure binary encoding should be considered as replacement of JSON. But with JSON my conclusion is that Base64 is the best.
如果性能至关重要,则应将纯二进制编码视为 JSON 的替代品。但是对于 JSON,我的结论是 Base64 是最好的。
回答by DarcyThomas
BSON (Binary JSON) may work for you. http://en.wikipedia.org/wiki/BSON
BSON(二进制 JSON)可能适合您。 http://en.wikipedia.org/wiki/BSON
Edit: FYI the .NET library json.netsupports reading and writing bson if you are looking for some C# server side love.
编辑:仅供参考,.NET 库json.net支持读写 bson,如果您正在寻找一些 C# 服务器端的爱。
回答by andrej
If you deal with bandwidth problems, try to compress data at the client side first, then base64-it.
如果您处理带宽问题,请尝试先在客户端压缩数据,然后再使用 base64-it。
Nice example of such magic is at http://jszip.stuartk.co.uk/and more discussion to this topic is at JavaScript implementation of Gzip
这种魔法的好例子是在http://jszip.stuartk.co.uk/和更多关于这个主题的讨论是在Gzip 的 JavaScript 实现
回答by richardtallent
yEnc might work for you:
yEnc 可能适合您:
http://en.wikipedia.org/wiki/Yenc
http://en.wikipedia.org/wiki/Yenc
"yEnc is a binary-to-text encoding scheme for transferring binary files in [text]. It reduces the overhead over previous US-ASCII-based encoding methods by using an 8-bit Extended ASCII encoding method. yEnc's overhead is often (if each byte value appears approximately with the same frequency on average) as little as 1–2%, compared to 33%–40% overhead for 6-bit encoding methods like uuencode and Base64. ... By 2003 yEnc became the de facto standard encoding system for binary files on Usenet."
“yEnc 是一种二进制到文本的编码方案,用于在 [text] 中传输二进制文件。它通过使用 8 位扩展 ASCII 编码方法减少了以前基于 US-ASCII 的编码方法的开销。yEnc 的开销通常是(如果每个字节值平均以大致相同的频率出现)低至 1–2%,而 uuencode 和 Base64 等 6 位编码方法的开销为 33%–40%。...到 2003 年,yEnc 成为事实上的标准Usenet 上二进制文件的编码系统。”
However, yEnc is an 8-bit encoding, so storing it in a JSON string has the same problems as storing the original binary data — doing it the na?ve way means about a 100% expansion, which is worse than base64.
但是,yEnc 是一种 8 位编码,因此将其存储在 JSON 字符串中与存储原始二进制数据具有相同的问题——以天真的方式进行意味着大约 100% 的扩展,这比 base64 更糟糕。
回答by StaxMan
While it is true that base64 has ~33% expansion rate, it is not necessarily true that processing overhead is significantly more than this: it really depends on JSON library/toolkit you are using. Encoding and decoding are simple straight-forward operations, and they can even be optimized wrt character encoding (as JSON only supports UTF-8/16/32) -- base64 characters are always single-byte for JSON String entries. For example on Java platform there are libraries that can do the job rather efficiently, so that overhead is mostly due to expanded size.
虽然 base64 确实具有约 33% 的扩展率,但处理开销不一定比这多得多:它实际上取决于您使用的 JSON 库/工具包。编码和解码是简单直接的操作,它们甚至可以优化 wrt 字符编码(因为 JSON 仅支持 UTF-8/16/32)——对于 JSON 字符串条目,base64 字符始终是单字节的。例如,在 Java 平台上,有一些库可以相当有效地完成这项工作,因此开销主要是由于扩展的大小。
I agree with two earlier answers:
我同意之前的两个答案:
- base64 is simple, commonly used standard, so it is unlikely to find something better specifically to use with JSON (base-85 is used by postscript etc; but benefits are at best marginal when you think about it)
- compression before encoding (and after decoding) may make lots of sense, depending on data you use
- base64 是简单、常用的标准,所以不太可能找到更好的东西专门用于 JSON(base-85 被 postscript 等使用;但当你考虑它时,好处充其量只是微不足道的)
- 编码前(和解码后)压缩可能很有意义,具体取决于您使用的数据
回答by Stefano Fratini
It's very fast to encode, decode and compact
编码、解码和压缩速度非常快
Speed comparison (java based but meaningful nevertheless): https://github.com/eishay/jvm-serializers/wiki/
速度比较(基于java但有意义):https: //github.com/eishay/jvm-serializers/wiki/
Also it's an extension to JSON that allow you to skip base64 encoding for byte arrays
它也是 JSON 的扩展,允许您跳过字节数组的 base64 编码
Smile encoded strings can be gzipped when space is critical
当空间很关键时,微笑编码的字符串可以被压缩
回答by a paid nerd
(Edit 7 years later:Google Gears is gone. Ignore this answer.)
(7 年后编辑:Google Gears 不见了。忽略这个答案。)
The Google Gears team ran into the lack-of-binary-data-types problem and has attempted to address it:
Google Gears 团队遇到了缺少二进制数据类型的问题并试图解决它:
JavaScript has a built-in data type for text strings, but nothing for binary data. The Blob object attempts to address this limitation.
JavaScript 具有用于文本字符串的内置数据类型,但对于二进制数据则没有。Blob 对象试图解决这个限制。
Maybe you can weave that in somehow.
也许你可以以某种方式编织它。
回答by jsoverson
Since you're looking for the ability to shoehorn binary data into a strictly text-based and very limited format, I think Base64's overhead is minimal compared to the convenience you're expecting to maintain with JSON. If processing power and throughput is a concern, then you'd probably need to reconsider your file formats.
由于您正在寻找将二进制数据硬塞为严格基于文本且非常有限的格式的能力,我认为与您期望使用 JSON 维护的便利相比,Base64 的开销很小。如果处理能力和吞吐量是一个问题,那么您可能需要重新考虑您的文件格式。

