使用 HIVE 解析 json 数组
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24447428/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parse json arrays using HIVE
提问by marlieg
I have many json arrays stored in a table (jt) that looks like this:
我在一个表 (jt) 中存储了许多 json 数组,如下所示:
[{"ts":1403781896,"id":14,"log":"show"},{"ts":1403781896,"id":14,"log":"start"}]
[{"ts":1403781911,"id":14,"log":"press"},{"ts":1403781911,"id":14,"log":"press"}]
Each array is a record.
每个数组都是一个记录。
I would like to parse this table in order to get a new table (logs) with 3 fields: ts, id, log. I tried to use the get_json_object method, but it seems that method is not compatible with json arrays because I only get null values.
我想解析这个表以获得一个包含 3 个字段的新表(日志):ts、id、log。我尝试使用 get_json_object 方法,但似乎该方法与 json 数组不兼容,因为我只获取空值。
This is the code I have tested:
这是我测试过的代码:
CREATE TABLE logs AS
SELECT get_json_object(jt.value, '$.ts') AS ts,
get_json_object(jt.value, '$.id') AS id,
get_json_object(jt.value, '$.log') AS log
FROM jt;
I tried to use other functions but they seem really complicated. Thank you! :)
我尝试使用其他功能,但它们看起来很复杂。谢谢!:)
Update! I solved my issue by performing a regexp:
更新!我通过执行正则表达式解决了我的问题:
CREATE TABLE jt_reg AS
select regexp_replace(regexp_replace(value,'\}\,\{','\}\\n\{'),'\[|\]','') as valuereg from jt;
CREATE TABLE logs AS
SELECT get_json_object(jt_reg.valuereg, '$.ts') AS ts,
get_json_object(jt_reg.valuereg, '$.id') AS id,
get_json_object(jt_reg.valuereg, '$.log') AS log
FROM ams_json_reg;
回答by vijay kumar
Use explode() function
使用 explode()函数
hive (default)> CREATE TABLE logs AS
> SELECT get_json_object(single_json_table.single_json, '$.ts') AS ts,
> get_json_object(single_json_table.single_json, '$.id') AS id,
> get_json_object(single_json_table.single_json, '$.log') AS log
> FROM
> (SELECT explode(json_array_col) as single_json FROM jt) single_json_table ;
Automatically selecting local only mode for query
Total MapReduce jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
hive (default)> select * from logs;
OK
ts id log
1403781896 14 show
1403781896 14 start
1403781911 14 press
1403781911 14 press
Time taken: 0.118 seconds, Fetched: 4 row(s)
hive (default)>
where json_array_col is column in jt which holds your array of jsons.
其中 json_array_col 是 jt 中的列,其中包含您的 jsons 数组。
hive (default)> select json_array_col from jt;
json_array_col
["{"ts":1403781896,"id":14,"log":"show"}","{"ts":1403781896,"id":14,"log":"start"}"]
["{"ts":1403781911,"id":14,"log":"press"}","{"ts":1403781911,"id":14,"log":"press"}"]
回答by ryan
I just ran into this problem, with the JSON array stored as a string in the hive table.
我刚刚遇到了这个问题,JSON 数组作为字符串存储在 hive 表中。
The solution is a bit hacky and ugly, but it works and doesn't require serdes or external UDFs
该解决方案有点笨拙和丑陋,但它有效并且不需要 serdes 或外部 UDF
SELECT
get_json_object(single_json_table.single_json, '$.ts') AS ts,
get_json_object(single_json_table.single_json, '$.id') AS id,
get_json_object(single_json_table.single_json, '$.log') AS log
FROM ( SELECT explode (
split(regexp_replace(substr(json_array_col, 2, length(json_array_col)-2),
'"}","', '"}",,,,"'), ',,,,')
) FROM src_table) single_json_table;
I broke the lines up so that it would be a little easier to read. I'm using substr() to strip the first and last characters, removing [ and ] . I'm then using regex_replace to match the separator between records in the json array and adding or changing the separator to be something unique that can then be used easily with split() to turn the string into a hive array of json objects which can then be used with explode() as described in the previous solution.
我把行拆开,这样会更容易阅读。我正在使用 substr() 去除第一个和最后一个字符,删除 [ 和 ] 。然后我使用 regex_replace 来匹配 json 数组中记录之间的分隔符,并将分隔符添加或更改为独特的东西,然后可以轻松地与 split() 一起使用将字符串转换为 json 对象的 hive 数组,然后如上一个解决方案中所述,与explode() 一起使用。
Note, the separator regex used here ( "}"," ) wouldn't work with the original data set...the regex would have to be ( "},\{" ) and the replacement would then need to be "},,,,{" eg..
请注意,此处使用的分隔符正则表达式 ( "}"," ) 不适用于原始数据集...正则表达式必须为 ( "},\{" ),然后替换为 "} ,,,,{" 例如..
split(regexp_replace(substr(json_array_col, 2, length(json_array_col)-2),
'"},\{"', '"},,,,{"'), ',,,,')
回答by lingjue8848
because get_json_object doesn't support json array string, so you can concat to a json object, like this:
因为 get_json_object 不支持 json 数组字符串,所以你可以连接到一个 json 对象,像这样:
SELECT
get_json_object(concat(concat('{"root":', jt.value), '}'), '$.root')
FROM jt;

