scala 从分区的镶木地板文件中读取 DataFrame

Question

提问by WoodChopper

How to read partitioned parquet with condition as dataframe,

如何以条件为数据帧读取分区镶木地板，

this works fine,

这工作正常，

val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=25/*")

Partition is there for day=1 to day=30is it possible to read something like(day = 5 to 6)or day=5,day=6,

分区是为了day=1 to day=30是否可以读取类似(day = 5 to 6)或的内容day=5,day=6，

val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=??/*")

If I put *it gives me all 30 days data and it too big.

如果我把*它给了我所有 30 天的数据，它太大了。

Answer 1

回答by Glennie Helles Sindholt

sqlContext.read.parquetcan take multiple paths as input. If you want just day=5and day=6, you can simply add two paths like:

sqlContext.read.parquet可以将多个路径作为输入。如果你只想要day=5and day=6，你可以简单地添加两条路径，如：

val dataframe = sqlContext
      .read.parquet("file:///your/path/data=jDD/year=2015/month=10/day=5/", 
                    "file:///your/path/data=jDD/year=2015/month=10/day=6/")

If you have folders under day=X, like say country=XX, countrywill automatically be added as a column in the dataframe.

如果你下有文件夹day=X，就像说country=XX，country会自动添加为一列dataframe。

EDIT: As of Spark 1.6 one needs to provide a "basepath"-option in order for Spark to generate columns automatically. In Spark 1.6.x the above would have to be re-written like this to create a dataframe with the columns "data", "year", "month" and "day":

编辑：从 Spark 1.6 开始，需要提供一个“basepath”选项，以便 Spark 自动生成列。在 Spark 1.6.x 中，必须像这样重写上面的内容以创建一个包含“data”、“year”、“month”和“day”列的数据框：

val dataframe = sqlContext
     .read
     .option("basePath", "file:///your/path/")
     .parquet("file:///your/path/data=jDD/year=2015/month=10/day=5/", 
                    "file:///your/path/data=jDD/year=2015/month=10/day=6/")

Answer 2

回答by Neelesh Sambhajiche

If you want to read for multiple days, for example day = 5and day = 6and want to mention the range in the path itself, wildcards can be used:

如果你想读多天，例如day = 5和day = 6，想提路径本身的范围内，可以使用通配符：

val dataframe = sqlContext
  .read
  .parquet("file:///your/path/data=jDD/year=2015/month=10/day={5,6}/*")

Wildcards can also be used to specify a range of days:

通配符也可用于指定天数范围：

val dataframe = sqlContext
  .read
  .parquet("file:///your/path/data=jDD/year=2015/month=10/day=[5-10]/*")

This matches all days from 5 to 10.

这匹配从 5 到 10 的所有天。

Answer 3

回答by Kiran N

you need to provide mergeSchema = trueoption. like mentioned below (this is from 1.6.0):

你需要提供mergeSchema = true选项。如下所述（来自 1.6.0）：

val dataframe = sqlContext.read.option("mergeSchema", "true").parquet("file:///your/path/data=jDD")

This will read all the parquet files into dataframe and also creates columns year, month and day in the dataframe data.

这会将所有镶木地板文件读入数据框，并在数据框数据中创建列年、月和日。

Ref: https://spark.apache.org/docs/1.6.0/sql-programming-guide.html#schema-merging

参考：https: //spark.apache.org/docs/1.6.0/sql-programming-guide.html#schema-merging

scala 从分区的镶木地板文件中读取 DataFrame

提问by WoodChopper

回答by Glennie Helles Sindholt

回答by Neelesh Sambhajiche

回答by Kiran N

相关推荐

最近更新

标签

scala 从分区的镶木地板文件中读取 DataFrame

提问by WoodChopper

回答by Glennie Helles Sindholt

回答by Neelesh Sambhajiche

回答by Kiran N

相关推荐

scala Spark 从一行中提取值

scala 如何迭代记录火花Scala？

scala 将 RDD[org.apache.spark.sql.Row] 转换为 RDD[org.apache.spark.mllib.linalg.Vector]

在 Spark/Scala 中将 RDD 转换为数据帧

相关推荐

最近更新

标签