scala 根据列值对火花数据框进行分区?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/44964619/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Partition a spark dataframe based on column value?
提问by jdk2588
i have a dataframe from a sql source which looks like:
我有一个来自 sql 源的数据框,如下所示:
User(id: Long, fname: String, lname: String, country: String)
[1, Fname1, Lname1, Belarus]
[2, Fname2, Lname2, Belgium]
[3, Fname3, Lname3, Austria]
[4, Fname4, Lname4, Australia]
I want to partition and write this data into csv files where each partition is based on initial letter of the country, so Belarus and Belgium should be one in output file, Austria and Australia in other.
我想将这些数据分区并将其写入csv文件,其中每个分区都基于国家的首字母,所以白俄罗斯和比利时应该在输出文件中,奥地利和澳大利亚在其他文件中。
采纳答案by ktheitroadalo
Here is what you can do
这是你可以做的
import org.apache.spark.sql.functions._
//create a dataframe with demo data
val df = spark.sparkContext.parallelize(Seq(
(1, "Fname1", "Lname1", "Belarus"),
(2, "Fname2", "Lname2", "Belgium"),
(3, "Fname3", "Lname3", "Austria"),
(4, "Fname4", "Lname4", "Australia")
)).toDF("id", "fname","lname", "country")
//create a new column with the first letter of column
val result = df.withColumn("countryFirst", split($"country", "")(0))
//save the data with partitionby first letter of country
result.write.partitionBy("countryFirst").format("com.databricks.spark.csv").save("outputpath")
Edited:You can also use the substring which can increase the performance as suggested by Raphel as
编辑:您还可以使用可以提高性能的子字符串,如 Rachel 建议的那样
substring(Column str, int pos, int len)Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type
substring(Column str, int pos, int len)子字符串从 pos 开始,当 str 是 String 类型时长度为 len 或返回字节数组中从 pos 开始的字节数组切片,当 str 是 Binary 类型时长度为 len
val result = df.withColumn("firstCountry", substring($"country",1,1))
and then use partitionby with write
然后使用 partitionby 和 write
Hope this solves your problem!
希望这能解决您的问题!
回答by Shaido - Reinstate Monica
One alternative to solve this problem would be to first create a column containing only the first letter of each country. Having done this step, you could use partitionByto save each partition to separate files.
解决此问题的一种替代方法是首先创建一个仅包含每个国家/地区的第一个字母的列。完成此步骤后,您可以使用partitionBy将每个分区保存到单独的文件。
dataFrame.write.partitionBy("column").format("com.databricks.spark.csv").save("/path/to/dir/")

