在 postgresql 中处理 Unicode 序列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31671634/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-21 01:59:06  来源:igfitidea点击:

Handling Unicode sequences in postgresql

jsonpostgresqlunicode

提问by Lix

I have some JSON data stored in a JSON (not JSONB) column in my postgresql database (9.4.1). Some of these JSON structures contain unicode sequences in their attribute values. For example:

我有一些 JSON 数据存储在我的 postgresql 数据库 (9.4.1) 的 JSON(不是 JSONB)列中。其中一些 JSON 结构在其属性值中包含 unicode 序列。例如:

{"client_id": 1, "device_name": "FooBar\ufffd\u0000\ufffd\u000f\ufffd" }

When I try to query this JSON column (even if I'm not directly trying to access the device_nameattribute), I get the following error:

当我尝试查询此 JSON 列时(即使我没有直接尝试访问该device_name属性),我收到以下错误:

ERROR: unsupported Unicode escape sequence
Detail: \u0000cannot be converted to text.

错误:不受支持的 Unicode 转义序列
详细信息:\u0000无法转换为文本。

You can recreate this error by executing the following command on a postgresql server:

您可以通过在 postgresql 服务器上执行以下命令来重新创建此错误:

select '{"client_id": 1, "device_name": "FooBar\ufffd\u0000\ufffd\u000f\ufffd" }'::json->>'client_id'

The error makes sense to me - there is simply no way to represent the unicode sequence NULLin a textual result.

该错误对我来说很有意义 - 根本无法NULL在文本结果中表示 unicode 序列。

Is there any way for me to query the same JSON data without having to perform "sanitation" on the incoming data? These JSON structures change regularly so scanning a specific attribute (device_namein this case) would not be a good solution since there could easily be other attributes that might hold similar data.

有什么方法可以让我查询相同的 JSON 数据而不必对传入的数据执行“清理”?这些 JSON 结构会定期更改,因此扫描特定属性(device_name在本例中)不是一个好的解决方案,因为很容易存在其他可能包含类似数据的属性。



After some more investigations, it seems that this behavior is new for version 9.4.1 as mentioned in the changelog:

经过更多调查后,如更改日志中所述,此行为似乎是 9.4.1 版的新行为:

...Therefore \u0000will now also be rejected in json values when conversion to de-escaped form is required. This change does not break the ability to store \u0000in json columns so long as no processing is done on the values...

...因此\u0000,当需要转换为转义形式时,现在也将在 json 值中被拒绝。\u0000只要不对值进行处理,此更改就不会破坏存储在 json 列中的能力...

Was this really the intention? Is a downgrade to pre 9.4.1 a viable option here?

这真的是故意的吗?降级到 9.4.1 之前的版本是否可行?



As a side note, this property is taken from the name of the client's mobile device - it's the user that entered this text into the device. How on earth did a user insert NULLand REPLACEMENT CHARACTERvalues?!

附带说明一下,此属性取自客户端移动设备的名称 - 是将文本输入设备的用户。用户到底是怎么插入NULLREPLACEMENT CHARACTER取值的?!

回答by Patrick

\u0000is the one Unicode code point which is not valid in a string. I see no other way than to sanitize the string.

\u0000是一个在字符串中无效的 Unicode 代码点。除了对字符串进行消毒之外,我看不到其他方法。

Since jsonis just a string in a specific format, you can use the standard string functions, without worrying about the JSON structure. A one-line sanitizer to remove the code point would be:

由于json只是特定格式的字符串,您可以使用标准字符串函数,而无需担心 JSON 结构。删除代码点的单行消毒剂将是:

SELECT (regexp_replace(the_string::text, '\u0000', '', 'g'))::json;

But you can also insert any character of your liking, which would be useful if the zero code point is used as some form of delimiter.

但是您也可以插入您喜欢的任何字符,如果零代码点用作某种形式的分隔符,这将非常有用。

Note also the subtle difference between what is stored in the database and how it is presented to the user. You can store the code point in a JSON string, but you have to pre-process it to some other character before processing the value as a jsondata type.

还要注意数据库中存储的内容与呈现给用户的方式之间的细微差别。您可以将代码点存储在 JSON 字符串中,但在将值作为json数据类型处理之前,您必须将其预处理为某个其他字符。

回答by Hendrik

The solution by Patrick didn't work out of the box for me. Regardless there was always an error thrown. I then researched a little more and was able to write a small custom function that fixed the issue for me.

帕特里克的解决方案对我来说不是开箱即用的。不管怎样,总是抛出一个错误。然后我进行了更多研究,并能够编写一个小的自定义函数来为我解决这个问题。

First I could reproduce the error by writing:

首先,我可以通过编写来重现错误:

select json '{ "a":  "null \u0000 escape" }' ->> 'a' as fails

Then I added a custom function which I used in my query:

然后我添加了一个我在查询中使用的自定义函数:

CREATE OR REPLACE FUNCTION null_if_invalid_string(json_input JSON, record_id UUID)
  RETURNS JSON AS $$
DECLARE json_value JSON DEFAULT NULL;
BEGIN
  BEGIN
    json_value := json_input ->> 'location';
    EXCEPTION WHEN OTHERS
    THEN
      RAISE NOTICE 'Invalid json value: "%".  Returning NULL.', record_id;
      RETURN NULL;
  END;
  RETURN json_input;
END;
$$ LANGUAGE plpgsql;

To call the function do this. You should not receive an error.

要调用该函数,请执行此操作。您不应收到错误消息。

select null_if_invalid_string('{ "a":  "null \u0000 escape" }', id) from my_table

Whereas this should return the json as expected:

而这应该按预期返回json:

select null_if_invalid_string('{ "a":  "null" }', id) from my_table

回答by rubo77

If you don't want those nullbyte results, just add:

如果您不想要那些空字节结果,只需添加:

AND json NOT LIKE '%\u0000%'

in your WHERE statement

在你的 WHERE 声明中