postgresql Redshift/Postgres:如何忽略产生错误的行?(json_extract_path_text 中的 JSON 无效)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25317707/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-21 01:34:58  来源:igfitidea点击:

Redshift/Postgres: how can I ignore rows that generate errors? (Invalid JSON in json_extract_path_text)

postgresqlamazon-redshift

提问by Kevin S

I'm trying to run a query in redshift where I'm selecting using json_extract_path_text. Unfortunately, some of the JSON entries in this database column are invalid.

我正在尝试在 redshift 中运行查询,我正在选择使用json_extract_path_text. 不幸的是,此数据库列中的某些 JSON 条目无效。

What happens:When the query hits an invalid JSON value, it stops with a "JSON parsing error".

会发生什么:当查询遇到无效的 JSON 值时,它会因“JSON 解析错误”而停止。

What I want:Ignore any rows with invalid JSON in that column, but return any rows where it can parse the JSON.

我想要的是:忽略该列中任何具有无效 JSON 的行,但返回任何可以解析 JSON 的行。

Why I can't make it do what I want:I don't think I understand error handling in Redshift/Postgres. It should be possible to simply skip any rows that generate errors, but I tried entering EXEC SQL WHENEVER SQLERROR CONTINUE(based on the Postgres docs) and got a "syntax error at or near SQLERROR".

为什么我不能让它做我想做的事:我认为我不了解 Redshift/Postgres 中的错误处理。应该可以简单地跳过任何产生错误的行,但我尝试输入EXEC SQL WHENEVER SQLERROR CONTINUE(基于Postgres 文档)并得到“在或附近的语法错误SQLERROR”。

回答by dvmlls

Create a python UDF:

创建一个python UDF:

create or replace function f_json_ok(js varchar(65535)) 
returns boolean
immutable
as $$
    if js is None: 
        return None

    import json
    try:
        json.loads(js)
        return True
    except:
        return False
$$ language plpythonu

Use it like so:

像这样使用它:

select *
from schema.table
where 'DesiredValue' = 
    case 
        when f_json_ok(json_column) then json_extract_path_text(json_column, 'Key') 
        else 'nope' 
    end 

回答by David Wolever

Edit: it seems like Redshift only supports Python UDFsso this answer will not work. I'm going to leave this answer here for posterity (and in the event someone finds this who isn't using Redshift).

编辑:似乎Redshift 只支持 Python UDF,所以这个答案不起作用。我将把这个答案留给后代(如果有人发现这个没有使用 Redshift)。

Potentially relevant: here is a plpgsql function which will try to decode JSON and return a default value if that fails:

可能相关:这是一个 plpgsql 函数,它将尝试解码 JSON 并在失败时返回默认值:

CREATE OR REPLACE FUNCTION safe_json(i text, fallback json) RETURNS json AS $$
BEGIN
    RETURN i::json;
EXCEPTION
    WHEN others THEN
        RETURN fallback;
END;
$$ LANGUAGE plpgsql IMMUTABLE RETURNS NULL ON NULL INPUT;

Then you can use it like this:

然后你可以像这样使用它:

SELECT
    …
FROM (
    SELECT safe_json(my_text, '{"error": "invalid JSON"}'::json) AS my_json
    FROM my_table
) as x

To guarantee that you'll always have valid JSON

保证您将始终拥有有效的 JSON

回答by semicircle21

Update: The UDF solution seems perfect. At the time I wrote this, that answer wasn't there. This one is just some work around methods.

更新:UDF 解决方案似乎很完美。在我写这篇文章的时候,那个答案并不存在。这只是一些解决方法。

Although json_extract_path_textcan't ignore errors, but Redshift's COPYhave a MAXERRORParameter.

虽然json_extract_path_text不能忽略错误,但是Redshift的COPY有个MAXERROR参数。

So, you can use something like this instead:

所以,你可以使用这样的东西:

COPY raw_json FROM 's3://data-source' 
CREDENTIALS 'aws_access_key_id;aws_secret_access_key'
JSON 's3://json_path.json'
MAXERROR 1000;

The next pitfall is in json_path.jsonfile: you can't use a $to specify the root element:

下一个陷阱在json_path.json文件中:您不能使用 a$来指定根元素:

{
    "jsonpaths": [
        "$['_id']",
        "$['type']",
        "$" <--------------- this will fail.
    ]
}

So, it would be convenience to have a "top-level" element containing other fields, like this: (So, $['data']is everything on your record)

因此,拥有一个包含其他字段的“顶级”元素会很方便,如下所示:(因此,$['data']您的记录中包含所有内容)

{
    "data": {
        "id": 1
        ...
    }
}
{
    "data": {
        "id": 2
        ...
    }
}

If you can't change the source format, Redshift's UNLOADwill help:

如果您无法更改源格式,RedshiftUNLOAD将提供帮助:

UNLOAD ('select_statement')
TO 's3://object_path_prefix'

It's easy to use a select_statementto concatenate: { "data" :+ old string + }...

使用 aselect_statement来连接很容易:{ "data" :+ 旧字符串 + }...

Then, Redshift rocks again!

然后,Redshift 再次震撼!

回答by harmic

I assume the JSON data is actually stored in a TEXT column rather than a JSON column (otherwise you would not have been able to store non-JSON in there in the first place).

我假设 JSON 数据实际上存储在 TEXT 列中而不是 JSON 列中(否则您将无法首先在其中存储非 JSON)。

If there is some pattern to the data that would allow you to make a regex that detects the valid rows, or the invalid ones, then you could use a CASE statement. For example:

如果数据中有某种模式可以让您制作检测有效行或无效行的正则表达式,那么您可以使用 CASE 语句。例如:

SELECT CASE
    WHEN mycol !~ 'not_json' THEN json_extract_path_text(mycol, ....)
    ELSE NULL
END AS mystuff
...

replacing not_json with a regex that detects the non-JSON formatted values.

用检测非 JSON 格式值的正则表达式替换 not_json。

This may or may not be practical depending on the format of your data.

这可能实用,也可能不实用,具体取决于您的数据格式。

According to the answers on this questionit is apparently possible to completely verify arbitrary JSON data using some regex implementations, but alas not the one used by postgresql.

根据这个问题的答案,显然可以使用一些正则表达式实现来完全验证任意 JSON 数据,但可惜不是 postgresql 使用的那个。

回答by Iain

Redshift is missing a lot of Postgres functions such as error handling.

Redshift 缺少许多 Postgres 功能,例如错误处理。

The way I handle this is:

我处理这个的方式是:

  1. Use CREATE TABLE ASto create a 'fixup' table with the JSON field and whatever the key is on the main table you're trying to query. Make sure you set the DISTKEY and SORTKEY to your JSON field.

  2. Add two columns to my fixup table: valid_json (BOOLEAN) and an extract_test(VARCHAR)

  3. Try to UPDATE extract_test with some text from the JSON field using JSON_EXTRACT_PATH_TEXT.

  4. Use the errors from that to spot common characters which are screwing up the JSON. If I'm importing from web log data, I might find ???? or something similar

  5. Use UPDATE table SET valid_json = false for JSON fields with that value

  6. Finally, change the json fields in my original table using UPDATE c SET json_field = NULL FROM fixup_table f WHERE original_table.id = f.id AND f.valid_json = FALSE

  1. 用于使用CREATE TABLE ASJSON 字段以及您尝试查询的主表上的任何键创建“修复”表。确保将 DISTKEY 和 SORTKEY 设置为 JSON 字段。

  2. 在我的修复表中添加两列:valid_json (BOOLEAN) 和一个 extract_test(VARCHAR)

  3. 尝试使用 JSON 字段中的一些文本更新 extract_test JSON_EXTRACT_PATH_TEXT

  4. 使用其中的错误来发现破坏 JSON 的常见字符。如果我从网络日志数据导入,我可能会发现 ???? 或类似的东西

  5. 对具有该值的 JSON 字段使用 UPDATE table SET valid_json = false

  6. 最后,使用更改原始表中的 json 字段 UPDATE c SET json_field = NULL FROM fixup_table f WHERE original_table.id = f.id AND f.valid_json = FALSE

It's still manual, but far quicker than fixing line by line on a big table, and by using the right DISTKEY/SORTKEY on your fixup table you can make the queries run quickly.

它仍然是手动的,但比在大表上逐行修复要快得多,并且通过在修复表上使用正确的 DISTKEY/SORTKEY,您可以使查询快速运行。

回答by Octavian

You can use the following function:

您可以使用以下功能:

CREATE OR REPLACE FUNCTION isValidJSONv2(i varchar(MAX)) RETURNS int stable AS $CODE$
import json
import sys
try:
    if i is None:
        return 0
    json_object = json.loads(i)
    return 1
except:
    return 0
$CODE$ language plpythonu;

The problem still that if you still use the json parsing functions in the select, the error is still thrown. You would have to filter the valid from the unvalid jsons in different tables. I have posted the issue here: https://forums.aws.amazon.com/thread.jspa?threadID=232468

问题仍然是如果你仍然在select中使用json解析函数,仍然会抛出错误。您必须从不同表中的无效 json 中过滤有效。我在这里发布了这个问题:https: //forums.aws.amazon.com/thread.jspa?threadID=232468

回答by Puneet

Redshift now supports passing a boolean argument that allows you to consider invalid JSON as null

Redshift 现在支持传递一个布尔参数,允许您将无效的 JSON 视为 null

select json_extract_path_text('invalid', 'path', true)

select json_extract_path_text('invalid', 'path', true)

returns null

返回空值

https://docs.aws.amazon.com/redshift/latest/dg/JSON_EXTRACT_PATH_TEXT.html

https://docs.aws.amazon.com/redshift/latest/dg/JSON_EXTRACT_PATH_TEXT.html