Hive:解析 JSON
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/12645634/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Hive: parsing JSON
提问by Don P
I am trying to get some values out of nested JSON for millions of rows (5 TB+ table). What is the most efficient way to do this?
我正在尝试从数百万行(5 TB+ 表)的嵌套 JSON 中获取一些值。执行此操作的最有效方法是什么?
Here is an example:
下面是一个例子:
{"country":"US","page":227,"data":{"ad":{"impressions":{"s":10,"o":10}}}}
I need these values out of the above JSON:
我需要上述 JSON 中的这些值:
Country Page impressions_s impressions_o
--------- ----- ------------- --------------
US 2 10 10
This is Hive's json_tuple function, I am not sure if this is the best function. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-getjsonobject
这是 Hive 的 json_tuple 函数,我不确定这是不是最好的函数。 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-getjsonobject
回答by Wemerson Cesar
You can use get_json_object:
您可以使用 get_json_object:
select get_json_object(fieldname, '$.country'),
get_json_object(fieldname, '$.data.ad.s') from ...
You will get better performance with json_tuple but I found a "how to" to get the values in json inside json; To formating your table you can use something like this:
使用 json_tuple 可以获得更好的性能,但我找到了一个“如何”在 json 中获取 json 中的值;要格式化您的表格,您可以使用以下内容:
from table t lateral view
explode( split(regexp_replace(get_json_object(ln, ''$.data.ad.s'), '\\[|\\]', ''), ',' ) ) tb1 as s
this code above will transform you "Array" in a column.
from table t lateral view
explode( split(regexp_replace(get_json_object(ln, ''$.data.ad.s'), '\\[|\\]', ''), ',' ) ) tb1 as s
上面的这段代码将在一列中转换你的“数组”。
form more: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
表格更多:https: //cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
I hope this help ...
我希望这有助于...
回答by Sanjiv
Here is what you can quickly try , I would suggest to use Json-Ser-De.
这是您可以快速尝试的方法,我建议使用Json-Ser-De。
nano /tmp/hive-parsing-json.json
nano /tmp/hive-parsing-json.json
{"country":"US","page":227,"data":{"ad":{"impressions":{"s":10,"o":10}}}}
Create base table :
创建基表:
hive > CREATE TABLE hive_parsing_json_table ( json string );
Load json file to Table :
将 json 文件加载到 Table :
hive > LOAD DATA LOCAL INPATH '/tmp/hive-parsing-json.json' INTO TABLE hive_parsing_json_table;
Query the table :
查询表:
hive > select v1.Country, v1.Page, v4.impressions_s, v4.impressions_o
from hive_parsing_json_table hpjp
LATERAL VIEW json_tuple(hpjp.json, 'country', 'page', 'data') v1
as Country, Page, data
LATERAL VIEW json_tuple(v1.data, 'ad') v2
as Ad
LATERAL VIEW json_tuple(v2.Ad, 'impressions') v3
as Impressions
LATERAL VIEW json_tuple(v3.Impressions, 's' , 'o') v4
as impressions_s,impressions_o;
Output :
输出 :
v1.country v1.page v4.impressions_s v4.impressions_o
US 227 10 10
回答by Hemantha Kumara M S
Using hive native json-serde('org.apache.hive.hcatalog.data.JsonSerDe')you can do this.. here are the steps
使用 hive native json-serde('org.apache.hive.hcatalog.data.JsonSerDe')你可以做到这一点..这里是步骤
ADD JAR /path/to/hive-hcatalog-core.jar;
添加 JAR /path/to/hive-hcatalog-core.jar;
create a table as below
CREATE TABLE json_serde_nestedjson (
country string,
page int,
data struct < ad: struct < impressions: struct < s:int, o:int > > >
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
then load data(stored in file)
然后加载数据(存储在文件中)
LOAD DATA LOCAL INPATH '/tmp/nested.json' INTO TABLE json_serde_nestedjson;
then get required data using
然后使用获取所需的数据
SELECT country, page, data.ad.impressions.s, data.ad.impressions.o
FROM json_serde_nestedjson;
回答by HuntingCheetah
Implementing a SerDe to parse your data in JSON is a better way for your case.
实现 SerDe 来解析 JSON 中的数据是一种更好的方法。
A tutorial on how to implement SerDe for parsing JSON can be found here
可以在此处找到有关如何实现 SerDe 以解析 JSON 的教程
http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/
http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/
You can use the following sample SerDe implementation as well
您也可以使用以下示例 SerDe 实现

