如何在 Pig 中解析 JSON？

Question

提问by Eric Lubow

I have a lot of gzip'd log files in s3 that has 3 types of log lines: b,c,i. i and c are both single level json:

我在 s3 中有很多 gzip 的日志文件，它们有 3 种类型的日志行：b、c、i。i 和 c 都是单级 json：

{"this":"that","test":"4"}

Type b is deeply nested json. I came across this gisttalking about compiling a jar to make this work. Since my java skills are less than stellar, I didn't really know what to do from here.

类型 b 是深度嵌套的 json。我遇到了这个谈论编译 jar 来完成这项工作的要点。由于我的 Java 技能并不出色，因此我真的不知道从这里开始该做什么。

{"this":{"foo":"bar","baz":{"test":"me"},"total":"5"}}

Since types i and c are not always in the same order, this makes specifying everything in the generate regex difficult. Is handling JSON (in a gzip'd file) possible with Pig? I am using whichever version of Pig comes built on an Amazon Elastic Map Reduce instance.

由于类型 i 和 c 的顺序并不总是相同，这使得在 generate regex 中指定所有内容变得困难。Pig 是否可以处理 JSON（在 gzip 文件中）？我使用的是基于 Amazon Elastic Map Reduce 实例构建的 Pig 的任何版本。

This boils down to two questions: 1) Can I parse JSON with Pig (and if so, how)? 2) If I can parse JSON (from a gzip'd logfile), can I parse nested JSON objects?

这归结为两个问题：1）我可以用 Pig 解析 JSON（如果可以，如何解析）？2）如果我可以解析 JSON（来自 gzip 的日志文件），我可以解析嵌套的 JSON 对象吗？

Answer 1

采纳答案by Eric Lubow

After a lot of workarounds and working through things, I was able to answer to get this done. I did a write-up about it on my blog about how to do this. It is available here: http://eric.lubow.org/2011/hadoop/pig-queries-parsing-json-on-amazons-elastic-map-reduce-using-s3-data/

经过大量的解决方法和解决问题，我能够回答以完成这项工作。我在我的博客上写了一篇关于如何做到这一点的文章。它可以在这里找到：http: //eric.lubow.org/2011/hadoop/pig-queries-parsing-json-on-amazons-elastic-map-reduce-using-s3-data/

Answer 2

回答by Thejas Nair

Pig 0.10 comes with builtin JsonStorage and JsonLoader().

Pig 0.10 带有内置的 JsonStorage 和 JsonLoader()。

pig doc for json load/store

用于 json 加载/存储的猪文档

Answer 3

回答by Eran Kampf

Pig comes with a JSON loader. To load you use:

Pig 带有一个 JSON 加载器。要加载您使用：

A = LOAD ‘data.json'
USING PigJsonLoader();

A = LOAD 'data.json'
使用 PigJsonLoader();

To store you can use:

要存储，您可以使用：

STORE INTO ‘output.json' 
    USING PigJsonLoader();

However, I'm not sure it supports GZIPed data....

但是，我不确定它是否支持 GZIPed 数据....

Answer 4

回答by A B

Please try this: https://github.com/a-b/elephant-bird

请试试这个：https: //github.com/ab/elephant-bird

Answer 5

回答by Reddevil

We can do it by using JsonLoader...But we have to mention the schema for your json data or else it may arise an error..just follow the below link

我们可以通过使用 JsonLoader 来做到这一点...但是我们必须提及您的 json 数据的架构，否则可能会出现错误...只需按照以下链接

         http://joshualande.com/read-write-json-apache-pig/

We can also do it by creating UDF to parse it...

我们也可以通过创建 UDF 来解析它...

Answer 6

回答by Shrikant

You can try usin the twitter elephantbird json loader , It handles the json data dynamically.But you have to be very precise with the schema .

您可以尝试使用 Twitter 大象鸟 json 加载器，它动态处理 json 数据。但是您必须非常精确地使用架构。

api_data = LOAD 'file name' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');

api_data = LOAD '文件名' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');

Answer 7

回答by rahulbmv

I have seen the usage of twitter elephantbird increase a lot and it is quickly becoming the goto library for json parsing in PIG.

我已经看到 Twitter 大象鸟的使用量增加了很多，并且它正在迅速成为 PIG 中用于 json 解析的 goto 库。

Example :

例子：

DEFINE TwitterJsonLoader com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true ');

JsonInput = LOAD 'input_path' USING TwitterJsonLoader() AS (entity: map[]);

InputObjects = FOREACH JsonInput GENERATE (map[]) entity#'Object' AS   JsonObject;

InputIds = FOREACH InputObjects GENERATE JsonObject#'id' AS id;

如何在 Pig 中解析 JSON？

提问by Eric Lubow

采纳答案by Eric Lubow

回答by Thejas Nair

回答by Eran Kampf

回答by A B

回答by Reddevil

回答by Shrikant

回答by rahulbmv

相关推荐

最近更新

标签

如何在 Pig 中解析 JSON？

提问by Eric Lubow

采纳答案by Eric Lubow

回答by Thejas Nair

回答by Eran Kampf

回答by A B

回答by Reddevil

回答by Shrikant

回答by rahulbmv

相关推荐

Ruby 对象和 JSON 序列化（不带 Rails）

如何将 Scala Map 转换为 JSON 字符串？

如何更改json键：值

json 如何使用 Powershell 访问宁静的网络服务？

相关推荐

最近更新

标签