scala 通过 Spark 读取保存在文件夹中的所有 Parquet 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43039254/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 09:09:43  来源:igfitidea点击:

Read all Parquet files saved in a folder via Spark

scalaapache-sparkapache-spark-sql

提问by himanshuIIITian

I have a folder containing Parquet files. Something like this:

我有一个包含 Parquet 文件的文件夹。像这样的东西:

scala> val df = sc.parallelize(List(1,2,3,4)).toDF()
df: org.apache.spark.sql.DataFrame = [value: int]

scala> df.write.parquet("/tmp/test/df/1.parquet")

scala> val df = sc.parallelize(List(5,6,7,8)).toDF()
df: org.apache.spark.sql.DataFrame = [value: int]

scala> df.write.parquet("/tmp/test/df/2.parquet")

After saving dataframes when I go to read all parquet files in dffolder, it gives me error.

当我去读取文件df夹中的所有镶木地板文件时保存数据帧后,它给了我错误。

scala> val read = spark.read.parquet("/tmp/test/df")
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun.apply(DataSource.scala:189)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun.apply(DataSource.scala:189)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:188)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:441)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:425)
  ... 48 elided

I know I can read Parquet files by giving full path, but it would be better if there is a way to read all parquet files in a folder.

我知道我可以通过提供完整路径来读取 Parquet 文件,但是如果有办法读取文件夹中的所有 Parquet 文件会更好。

回答by eliasah

Spark doesn't write/read parquet the way you think it does.

Spark 不会像您认为的那样写入/读取镶木地板。

It uses the Hadoop library to write/read partitionedparquet file.

它使用 Hadoop 库来写入/读取分区的镶木地板文件。

Thus your first parquet file is under the path /tmp/test/df/1.parquet/where 1.parquetis a directory. This means that when reading from parquet you would need to provide the path to your parquet directory or path if it's one file.

因此,您的第一个镶木地板文件位于目录的路径/tmp/test/df/1.parquet/1.parquet。这意味着从 parquet 读取时,您需要提供 parquet 目录的路径或路径(如果它是一个文件)。

val df = spark.read.parquet("/tmp/test/df/1.parquet/")

I advice you to read the official documentation for more details. [cf. SQL Programming Guide - Parquet Files]

我建议您阅读官方文档以获取更多详细信息。[参见 SQL 编程指南 - Parquet 文件]

EDIT:

编辑:

You must be looking for something like this :

你一定在寻找这样的东西:

scala> sqlContext.range(1,100).write.save("/tmp/test/df/1.parquet")

scala> sqlContext.range(100,500).write.save("/tmp/test/df/2.parquet")

scala> val df = sqlContext.read.load("/tmp/test/df/*")
// df: org.apache.spark.sql.DataFrame = [id: bigint]

scala> df.show(3)
// +---+
// | id|
// +---+
// |400|
// |401|
// |402|
// +---+
// only showing top 3 rows

scala> df.count
// res3: Long = 499

You can also use wildcards in your file paths URI.

您还可以在文件路径 URI 中使用通配符。

And you can provide multiple files paths as followed :

您可以提供多个文件路径,如下所示:

scala> val df2 = sqlContext.read.load("/tmp/test/df/1.parquet","/tmp/test/df/2.parquet")
// df2: org.apache.spark.sql.DataFrame = [id: bigint]

scala> df2.count
// res5: Long = 499

回答by ktheitroadalo

The file you wrote on /tmp/test/df/1.parquetand /tmp/test/df/2.parquetare not a output file they are output Directory. so, you can read the parquet is

该文件你写的/tmp/test/df/1.parquet/tmp/test/df/2.parquet不是一个输出文件它们是输出目录。所以,你可以读到镶木地板是

val data = spark.read.parquet("/tmp/test/df/1.parquet/")

回答by Ihor Konovalenko

You can write data into folder not as separate Spark "files" (in fact folders) 1.parquet, 2.parquetetc. If don't set file name but only path, Spark will put files into the folder as real files (not folders), and automatically name that files.

您可以将数据写入到文件夹而不是单独星火“文件”(实际上文件夹)1.parquet2.parquet等等。如果不设置文件名,但唯一路径,星火将会把文件放入文件夹中的文件真正(而不是文件夹),并自动名那个文件。

df1.write.partitionBy("countryCode").format("parquet").mode("overwrite").save("/tmp/data1/")
df2.write.partitionBy("countryCode").format("parquet").mode("append").save("/tmp/data1/")
df3.write.partitionBy("countryCode").format("parquet").mode("append").save("/tmp/data1/")

Further we can read data from all files in data folder:

此外,我们可以从数据文件夹中的所有文件中读取数据:

val df = spark.read.format("parquet").load("/tmp/data1/")