在 Apache Spark 中读取多行 JSON

Question

提问by Finkelson

I was trying to use a JSON file as a small DB. After creating a template table on DataFrame I queried it with SQL and got an exception. Here is my code:

我试图将 JSON 文件用作小型数据库。在 DataFrame 上创建模板表后，我用 SQL 查询它并得到一个异常。这是我的代码：

val df = sqlCtx.read.json("/path/to/user.json")
df.registerTempTable("user_tt")

val info = sqlCtx.sql("SELECT name FROM user_tt")
info.show()

df.printSchema()result:

df.printSchema()结果：

root
 |-- _corrupt_record: string (nullable = true)

My JSON file:

我的 JSON 文件：

{
  "id": 1,
  "name": "Morty",
  "age": 21
}

Exeption:

例外：

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns: [_corrupt_record];

How can I fix it?

我该如何解决？

UPD

_corrupt_recordis

_corrupt_record是

+--------------------+
|     _corrupt_record|
+--------------------+
|                   {|
|            "id": 1,|
|    "name": "Morty",|
|           "age": 21|
|                   }|
+--------------------+

UPD2

It's weird, but when I rewrite my JSON to make it oneliner, everything works fine.

这很奇怪，但是当我重写我的 JSON 以使其成为单行时，一切正常。

{"id": 1, "name": "Morty", "age": 21}

So the problem is in a newline.

所以问题出在newline.

UPD3

I found in docs the next sentence:

我在文档中找到了下一句话：

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

请注意，作为 json 文件提供的文件不是典型的 JSON 文件。每行必须包含一个单独的、自包含的有效 JSON 对象。因此，常规的多行 JSON 文件通常会失败。

It isn't convenient to keep JSON in such format. Is there any workaround to get rid of multi-lined structure of JSON or to convert it in oneliner?

以这种格式保存 JSON 并不方便。是否有任何解决方法可以摆脱 JSON 的多行结构或将其转换为 oneliner？

Answer 1

回答by zero323

Spark >= 2.2

火花 >= 2.2

Spark 2.2 introduced ~~wholeFile~~multiLineoption which can be used to load JSON (not JSONL) files:

Spark 2.2 引入了可用于加载 JSON（而非 JSONL）文件的选项：~~wholeFile~~multiLine

spark.read
  .option("multiLine", true).option("mode", "PERMISSIVE")
  .json("/path/to/user.json")

See:

看：

SPARK-18352- Parse normal, multi-line JSON files (not just JSON Lines).
SPARK-20980- Rename the option wholeFileto multiLinefor JSON and CSV.

SPARK-18352-解析普通的多行 JSON 文件（不仅仅是 JSON 行）。
SPARK-20980-重命名选项wholeFile来multiLine对JSON和CSV。

Spark < 2.2

火花 < 2.2

Well, using JSONL formated data may be inconvenient but it I will argue that is not the issue with API but the format itself. JSON is simply not designed to be processed in parallel in distributed systems.

好吧，使用 JSONL 格式的数据可能不方便，但我认为这不是 API 的问题，而是格式本身的问题。JSON 根本不是为了在分布式系统中并行处理而设计的。

It provides no schema and without making some very specific assumptions about its formatting and shape it is almost impossible to correctly identify top level documents. Arguably this is the worst possible format to imagine to use in systems like Apache Spark. It is also quite tricky and typically impractical to write valid JSON in distributed systems.

它不提供任何模式，而且如果不对它的格式和形状做出一些非常具体的假设，几乎不可能正确识别顶级文档。可以说，这是在像 Apache Spark 这样的系统中使用的最糟糕的格式。在分布式系统中编写有效的 JSON 也非常棘手且通常不切实际。

That being said, if individual files are valid JSON documents (either single document or an array of documents) you can always try wholeTextFiles:

话虽如此，如果单个文件是有效的 JSON 文档（单个文档或文档数组），您始终可以尝试wholeTextFiles：

spark.read.json(sc.wholeTextFiles("/path/to/user.json").values())

Answer 2

回答by Dan Coates

Just to add on to zero323's answer, the option in Spark 2.2+ to read multi-line JSON was renamed to multiLine(see the Spark documentation here).

只是为了补充 zero323 的答案，Spark 2.2+ 中读取多行 JSON 的选项已重命名为multiLine（请参阅此处的 Spark 文档）。

Therefore, the correct syntax is now:

因此，正确的语法现在是：

spark.read
  .option("multiLine", true).option("mode", "PERMISSIVE")
  .json("/path/to/user.json")

This happened in https://issues.apache.org/jira/browse/SPARK-20980.

这发生在https://issues.apache.org/jira/browse/SPARK-20980 中。

在 Apache Spark 中读取多行 JSON

提问by Finkelson

回答by zero323

回答by Dan Coates

相关推荐

最近更新

标签

在 Apache Spark 中读取多行 JSON

提问by Finkelson

回答by zero323

回答by Dan Coates

相关推荐

如何从 PowerShell 中的 Invoke-WebRequest 解析 JSON？

使用 mongoimport 导入 1 个以上的 json 文件

为什么在此 cURL 调用中的请求正文中会出现格式错误的 JSON？

使用 Perl 遍历 JSON

相关推荐

最近更新

标签