在 Apache Spark 中读取多行 JSON
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38545850/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Read multiline JSON in Apache Spark
提问by Finkelson
I was trying to use a JSON file as a small DB. After creating a template table on DataFrame I queried it with SQL and got an exception. Here is my code:
我试图将 JSON 文件用作小型数据库。在 DataFrame 上创建模板表后,我用 SQL 查询它并得到一个异常。这是我的代码:
val df = sqlCtx.read.json("/path/to/user.json")
df.registerTempTable("user_tt")
val info = sqlCtx.sql("SELECT name FROM user_tt")
info.show()
df.printSchema()result:
df.printSchema()结果:
root
|-- _corrupt_record: string (nullable = true)
My JSON file:
我的 JSON 文件:
{
"id": 1,
"name": "Morty",
"age": 21
}
Exeption:
例外:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns: [_corrupt_record];
How can I fix it?
我该如何解决?
UPD
UPD
_corrupt_recordis
_corrupt_record是
+--------------------+
| _corrupt_record|
+--------------------+
| {|
| "id": 1,|
| "name": "Morty",|
| "age": 21|
| }|
+--------------------+
UPD2
UPD2
It's weird, but when I rewrite my JSON to make it oneliner, everything works fine.
这很奇怪,但是当我重写我的 JSON 以使其成为单行时,一切正常。
{"id": 1, "name": "Morty", "age": 21}
So the problem is in a newline.
所以问题出在newline.
UPD3
UPD3
I found in docs the next sentence:
我在文档中找到了下一句话:
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.
请注意,作为 json 文件提供的文件不是典型的 JSON 文件。每行必须包含一个单独的、自包含的有效 JSON 对象。因此,常规的多行 JSON 文件通常会失败。
It isn't convenient to keep JSON in such format. Is there any workaround to get rid of multi-lined structure of JSON or to convert it in oneliner?
以这种格式保存 JSON 并不方便。是否有任何解决方法可以摆脱 JSON 的多行结构或将其转换为 oneliner?
回答by zero323
Spark >= 2.2
火花 >= 2.2
Spark 2.2 introduced wholeFilemultiLineoption which can be used to load JSON (not JSONL) files:
Spark 2.2 引入了可用于加载 JSON(而非 JSONL)文件的选项:wholeFilemultiLine
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/path/to/user.json")
See:
看:
- SPARK-18352- Parse normal, multi-line JSON files (not just JSON Lines).
- SPARK-20980- Rename the option
wholeFiletomultiLinefor JSON and CSV.
- SPARK-18352-解析普通的多行 JSON 文件(不仅仅是 JSON 行)。
- SPARK-20980-重命名选项
wholeFile来multiLine对JSON和CSV。
Spark < 2.2
火花 < 2.2
Well, using JSONL formated data may be inconvenient but it I will argue that is not the issue with API but the format itself. JSON is simply not designed to be processed in parallel in distributed systems.
好吧,使用 JSONL 格式的数据可能不方便,但我认为这不是 API 的问题,而是格式本身的问题。JSON 根本不是为了在分布式系统中并行处理而设计的。
It provides no schema and without making some very specific assumptions about its formatting and shape it is almost impossible to correctly identify top level documents. Arguably this is the worst possible format to imagine to use in systems like Apache Spark. It is also quite tricky and typically impractical to write valid JSON in distributed systems.
它不提供任何模式,而且如果不对它的格式和形状做出一些非常具体的假设,几乎不可能正确识别顶级文档。可以说,这是在像 Apache Spark 这样的系统中使用的最糟糕的格式。在分布式系统中编写有效的 JSON 也非常棘手且通常不切实际。
That being said, if individual files are valid JSON documents (either single document or an array of documents) you can always try wholeTextFiles:
话虽如此,如果单个文件是有效的 JSON 文档(单个文档或文档数组),您始终可以尝试wholeTextFiles:
spark.read.json(sc.wholeTextFiles("/path/to/user.json").values())
回答by Dan Coates
Just to add on to zero323's answer, the option in Spark 2.2+ to read multi-line JSON was renamed to multiLine(see the Spark documentation here).
只是为了补充 zero323 的答案,Spark 2.2+ 中读取多行 JSON 的选项已重命名为multiLine(请参阅此处的 Spark 文档)。
Therefore, the correct syntax is now:
因此,正确的语法现在是:
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/path/to/user.json")
This happened in https://issues.apache.org/jira/browse/SPARK-20980.

