如何在 Pig 中解析 JSON?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/5013003/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I parse JSON in Pig?
提问by Eric Lubow
I have a lot of gzip'd log files in s3 that has 3 types of log lines: b,c,i. i and c are both single level json:
我在 s3 中有很多 gzip 的日志文件,它们有 3 种类型的日志行:b、c、i。i 和 c 都是单级 json:
{"this":"that","test":"4"}
Type b is deeply nested json. I came across this gisttalking about compiling a jar to make this work. Since my java skills are less than stellar, I didn't really know what to do from here.
类型 b 是深度嵌套的 json。我遇到了这个谈论编译 jar 来完成这项工作的要点。由于我的 Java 技能并不出色,因此我真的不知道从这里开始该做什么。
{"this":{"foo":"bar","baz":{"test":"me"},"total":"5"}}
Since types i and c are not always in the same order, this makes specifying everything in the generate regex difficult. Is handling JSON (in a gzip'd file) possible with Pig? I am using whichever version of Pig comes built on an Amazon Elastic Map Reduce instance.
由于类型 i 和 c 的顺序并不总是相同,这使得在 generate regex 中指定所有内容变得困难。Pig 是否可以处理 JSON(在 gzip 文件中)?我使用的是基于 Amazon Elastic Map Reduce 实例构建的 Pig 的任何版本。
This boils down to two questions: 1) Can I parse JSON with Pig (and if so, how)? 2) If I can parse JSON (from a gzip'd logfile), can I parse nested JSON objects?
这归结为两个问题:1)我可以用 Pig 解析 JSON(如果可以,如何解析)?2)如果我可以解析 JSON(来自 gzip 的日志文件),我可以解析嵌套的 JSON 对象吗?
采纳答案by Eric Lubow
After a lot of workarounds and working through things, I was able to answer to get this done. I did a write-up about it on my blog about how to do this. It is available here: http://eric.lubow.org/2011/hadoop/pig-queries-parsing-json-on-amazons-elastic-map-reduce-using-s3-data/
经过大量的解决方法和解决问题,我能够回答以完成这项工作。我在我的博客上写了一篇关于如何做到这一点的文章。它可以在这里找到:http: //eric.lubow.org/2011/hadoop/pig-queries-parsing-json-on-amazons-elastic-map-reduce-using-s3-data/
回答by Thejas Nair
Pig 0.10 comes with builtin JsonStorage and JsonLoader().
Pig 0.10 带有内置的 JsonStorage 和 JsonLoader()。
回答by Eran Kampf
Pig comes with a JSON loader. To load you use:
Pig 带有一个 JSON 加载器。要加载您使用:
A = LOAD ‘data.json'
USING PigJsonLoader();
A = LOAD 'data.json'
使用 PigJsonLoader();
To store you can use:
要存储,您可以使用:
STORE INTO ‘output.json'
USING PigJsonLoader();
However, I'm not sure it supports GZIPed data....
但是,我不确定它是否支持 GZIPed 数据....
回答by A B
Please try this: https://github.com/a-b/elephant-bird
回答by Reddevil
We can do it by using JsonLoader...But we have to mention the schema for your json data or else it may arise an error..just follow the below link
我们可以通过使用 JsonLoader 来做到这一点...但是我们必须提及您的 json 数据的架构,否则可能会出现错误...只需按照以下链接
http://joshualande.com/read-write-json-apache-pig/
We can also do it by creating UDF to parse it...
我们也可以通过创建 UDF 来解析它...
回答by Shrikant
You can try usin the twitter elephantbird json loader , It handles the json data dynamically.But you have to be very precise with the schema .
您可以尝试使用 Twitter 大象鸟 json 加载器,它动态处理 json 数据。但是您必须非常精确地使用架构。
api_data = LOAD 'file name' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
api_data = LOAD '文件名' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
回答by rahulbmv
I have seen the usage of twitter elephantbird increase a lot and it is quickly becoming the goto library for json parsing in PIG.
我已经看到 Twitter 大象鸟的使用量增加了很多,并且它正在迅速成为 PIG 中用于 json 解析的 goto 库。
Example :
例子 :
DEFINE TwitterJsonLoader com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true ');
JsonInput = LOAD 'input_path' USING TwitterJsonLoader() AS (entity: map[]);
InputObjects = FOREACH JsonInput GENERATE (map[]) entity#'Object' AS JsonObject;
InputIds = FOREACH InputObjects GENERATE JsonObject#'id' AS id;

