你如何用 JSON 数据制作一个 HIVE 表?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/11479247/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do you make a HIVE table out of JSON data?
提问by nickponline
I want to create a Hive table out of some JSON data (nested) and run queries on it? Is this even possible?
我想从一些 JSON 数据(嵌套)中创建一个 Hive 表并对其运行查询?这甚至可能吗?
I've gotten as far as uploading the JSON file to S3 and launching an EMR instance but I don't know what to type in the hive console to get the JSON file to be a Hive table?
我已经将 JSON 文件上传到 S3 并启动了一个 EMR 实例,但我不知道在 hive 控制台中输入什么才能使 JSON 文件成为 Hive 表?
Does anyone have some example command to get me started, I can't find anything useful with Google ...
有没有人有一些示例命令让我开始,我找不到任何有用的谷歌......
采纳答案by seedhead
You'll need to use a JSON serde in order for Hive to map your JSON to the columns in your table.
您需要使用 JSON serde,以便 Hive 将您的 JSON 映射到表中的列。
A really good example showing you how is here:
一个非常好的例子向您展示如何在这里:
http://aws.amazon.com/articles/2855
http://aws.amazon.com/articles/2855
Unfortunately the JSON serde supplied doesn't handle nested JSON very well so you might need to flatten your JSON in order to use it.
不幸的是,提供的 JSON serde 不能很好地处理嵌套的 JSON,因此您可能需要展平 JSON 才能使用它。
Here's an example of the correct syntax from the article:
这是文章中正确语法的示例:
create external table impressions (
requestBeginTime string, requestEndTime string, hostname string
)
partitioned by (
dt string
)
row format
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties (
'paths'='requestBeginTime, requestEndTime, hostname'
)
location 's3://my.bucket/' ;
回答by Mike Repass
It's actually not necessary to use the JSON SerDe. There is a great blog post here (I'm not affiliated with the author in any way):
实际上没有必要使用 JSON SerDe。这里有一篇很棒的博客文章(我与作者没有任何关系):
http://pkghosh.wordpress.com/2012/05/06/hive-plays-well-with-json/
http://pkghosh.wordpress.com/2012/05/06/hive-plays-well-with-json/
Which outlines a strategy using the builtin-function json_tuple to parse the json at time of query (NOT at the time of table definition):
其中概述了使用内置函数 json_tuple 在查询时(不是在表定义时)解析 json 的策略:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-json_tuple
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-json_tuple
So basically, your table schema is simply to load each line as a single 'string' column and then extract the relevant json fields as needed on a per query basis. e.g. this query from that blog post:
所以基本上,您的表架构只是将每一行作为单个“字符串”列加载,然后根据每个查询的需要提取相关的 json 字段。例如来自那篇博文的这个查询:
SELECT b.blogID, c.email FROM comments a LATERAL VIEW json_tuple(a.value, 'blogID', 'contact') b
AS blogID, contact LATERAL VIEW json_tuple(b.contact, 'email', 'website') c
AS email, website WHERE b.blogID='64FY4D0B28';
In my humble experience, this has proven more reliable (I encountered various cryptic issues dealing with the JSON serdes, especially with nested objects).
以我的拙见,事实证明这更可靠(我遇到了处理 JSON serdes 的各种神秘问题,尤其是嵌套对象)。
回答by otto
I just had to solve the same problem, and none of the as of yet linked to JSON SerDes seemed good enough. Amazon's might be good, but I can't find the source for it anywhere (does anyone have a link?).
我只需要解决同样的问题,到目前为止,还没有一个链接到 JSON SerDes 看起来足够好。亚马逊的可能不错,但我在任何地方都找不到它的来源(有人有链接吗?)。
HCatalog's built in JsonSerDe is working for me, even though I'm not actually using HCatalog anywhere else.
HCatalog 内置的 JsonSerDe 对我有用,即使我实际上并没有在其他任何地方使用 HCatalog。
To use HCatalog's JsonSerDe, add the hcatalog-core .jar to Hive's auxpath and create your hive table:
要使用 HCatalog 的 JsonSerDe,请将 hcatalog-core .jar 添加到 Hive 的辅助路径并创建您的 hive 表:
$ hive --auxpath /path/to/hcatalog-core.jar
hive (default)>
create table my_table(...)
ROW FORMAT SERDE
'org.apache.hcatalog.data.JsonSerDe'
...
;
I wrote a post here with more details
我在这里写了一篇有更多细节的帖子
回答by Heapify
Hive 0.12 and later in hcatalog-core has JsonSerDe which will serialize and deserialize your JSON data. So, all you need to do is create an external table like the following example:
hcatalog-core 中的 Hive 0.12 及更高版本具有 JsonSerDe,它将序列化和反序列化您的 JSON 数据。因此,您需要做的就是创建一个外部表,如下例所示:
CREATE EXTERNAL TABLE json_table (
username string,
tweet string,
timestamp long)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION
'hdfs://data/some-folder-in-hdfs'
The corresponsing json data file should look like the following example:
相应的 json 数据文件应类似于以下示例:
{"username":"miguno","tweet":"Rock: Nerf paper, scissors is fine.","timestamp": 1366150681 }
{"username":"BlizzardCS","tweet":"Works as intended. Terran is IMBA.","timestamp": 1366154481 }
回答by Rok Kralj
Generating SerDe schema from .json file
从 .json 文件生成 SerDe 模式
If your .json file is big, it might be tedious to write the the schema by hand. If so, you can use this handy tool to generate it automatically.
如果您的 .json 文件很大,手动编写架构可能会很乏味。如果是这样,您可以使用这个方便的工具自动生成它。
回答by davidemm
JSON processing capabilities are now available in Hive out-of-the-box.
JSON 处理功能现在可在 Hive 中开箱即用。
Hive 4.0.0 and later
Hive 4.0.0 及更高版本
CREATE TABLE ... STORED AS JSONFILE
Each JSON object must be flattened to fit into one-line (does not support new-line characters). These objects are not part of a formal JSON array.
每个 JSON 对象都必须展平以适合一行(不支持换行符)。这些对象不是正式 JSON 数组的一部分。
{"firstName":"John","lastName":"Smith","Age":21}
{"firstName":"Jane","lastName":"Harding","Age":18}

