使用 HIVE 解析 json 数组

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24447428/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-03 17:23:16  来源:igfitidea点击:

Parse json arrays using HIVE

arraysjsonhive

提问by marlieg

I have many json arrays stored in a table (jt) that looks like this:

我在一个表 (jt) 中存储了许多 json 数组,如下所示:

[{"ts":1403781896,"id":14,"log":"show"},{"ts":1403781896,"id":14,"log":"start"}]
[{"ts":1403781911,"id":14,"log":"press"},{"ts":1403781911,"id":14,"log":"press"}]

Each array is a record.

每个数组都是一个记录。

I would like to parse this table in order to get a new table (logs) with 3 fields: ts, id, log. I tried to use the get_json_object method, but it seems that method is not compatible with json arrays because I only get null values.

我想解析这个表以获得一个包含 3 个字段的新表(日志):ts、id、log。我尝试使用 get_json_object 方法,但似乎该方法与 json 数组不兼容,因为我只获取空值。

This is the code I have tested:

这是我测试过的代码:

CREATE TABLE logs AS 
SELECT get_json_object(jt.value, '$.ts') AS ts, 
get_json_object(jt.value, '$.id') AS id,
get_json_object(jt.value, '$.log') AS log
FROM jt;

I tried to use other functions but they seem really complicated. Thank you! :)

我尝试使用其他功能,但它们看起来很复杂。谢谢!:)

Update! I solved my issue by performing a regexp:

更新!我通过执行正则表达式解决了我的问题:

CREATE TABLE jt_reg AS
select regexp_replace(regexp_replace(value,'\}\,\{','\}\\n\{'),'\[|\]','') as valuereg  from jt;


CREATE TABLE logs AS 
SELECT get_json_object(jt_reg.valuereg, '$.ts') AS ts, 
get_json_object(jt_reg.valuereg, '$.id') AS id,
get_json_object(jt_reg.valuereg, '$.log') AS log
FROM ams_json_reg;

回答by vijay kumar

Use explode() function

使用 explode()函数

 hive (default)> CREATE TABLE logs AS
                  >   SELECT get_json_object(single_json_table.single_json, '$.ts') AS ts,
                  >   get_json_object(single_json_table.single_json, '$.id') AS id,
                  >   get_json_object(single_json_table.single_json, '$.log') AS log
                  >   FROM
                  >     (SELECT explode(json_array_col) as single_json FROM jt) single_json_table ;

Automatically selecting local only mode for query
Total MapReduce jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator

hive (default)> select * from logs;
OK
ts      id      log
1403781896      14      show
1403781896      14      start
1403781911      14      press
1403781911      14      press
Time taken: 0.118 seconds, Fetched: 4 row(s)
hive (default)>

where json_array_col is column in jt which holds your array of jsons.

其中 json_array_col 是 jt 中的列,其中包含您的 jsons 数组。

hive (default)> select json_array_col from jt;
json_array_col
["{"ts":1403781896,"id":14,"log":"show"}","{"ts":1403781896,"id":14,"log":"start"}"]
["{"ts":1403781911,"id":14,"log":"press"}","{"ts":1403781911,"id":14,"log":"press"}"]

回答by ryan

I just ran into this problem, with the JSON array stored as a string in the hive table.

我刚刚遇到了这个问题,JSON 数组作为字符串存储在 hive 表中。

The solution is a bit hacky and ugly, but it works and doesn't require serdes or external UDFs

该解决方案有点笨拙和丑陋,但它有效并且不需要 serdes 或外部 UDF

SELECT 
       get_json_object(single_json_table.single_json, '$.ts') AS ts,
       get_json_object(single_json_table.single_json, '$.id') AS id,
       get_json_object(single_json_table.single_json, '$.log') AS log
FROM ( SELECT explode (
  split(regexp_replace(substr(json_array_col, 2, length(json_array_col)-2),
            '"}","', '"}",,,,"'), ',,,,')
      ) FROM src_table) single_json_table;

I broke the lines up so that it would be a little easier to read. I'm using substr() to strip the first and last characters, removing [ and ] . I'm then using regex_replace to match the separator between records in the json array and adding or changing the separator to be something unique that can then be used easily with split() to turn the string into a hive array of json objects which can then be used with explode() as described in the previous solution.

我把行拆开,这样会更容易阅读。我正在使用 substr() 去除第一个和最后一个字符,删除 [ 和 ] 。然后我使用 regex_replace 来匹配 json 数组中记录之间的分隔符,并将分隔符添加或更改为独特的东西,然后可以轻松地与 split() 一起使用将字符串转换为 json 对象的 hive 数组,然后如上一个解决方案中所述,与explode() 一起使用。

Note, the separator regex used here ( "}"," ) wouldn't work with the original data set...the regex would have to be ( "},\{" ) and the replacement would then need to be "},,,,{" eg..

请注意,此处使用的分隔符正则表达式 ( "}"," ) 不适用于原始数据集...正则表达式必须为 ( "},\{" ),然后替换为 "} ,,,,{" 例如..

  split(regexp_replace(substr(json_array_col, 2, length(json_array_col)-2),
            '"},\{"', '"},,,,{"'), ',,,,')

回答by lingjue8848

because get_json_object doesn't support json array string, so you can concat to a json object, like this:

因为 get_json_object 不支持 json 数组字符串,所以你可以连接到一个 json 对象,像这样:

SELECT 
    get_json_object(concat(concat('{"root":', jt.value), '}'), '$.root')
FROM jt;