scala 根据列值对火花数据框进行分区？

Question

提问by jdk2588

i have a dataframe from a sql source which looks like:

我有一个来自 sql 源的数据框，如下所示：

User(id: Long, fname: String, lname: String, country: String)

[1, Fname1, Lname1, Belarus]
[2, Fname2, Lname2, Belgium]
[3, Fname3, Lname3, Austria]
[4, Fname4, Lname4, Australia]

I want to partition and write this data into csv files where each partition is based on initial letter of the country, so Belarus and Belgium should be one in output file, Austria and Australia in other.

我想将这些数据分区并将其写入csv文件，其中每个分区都基于国家的首字母，所以白俄罗斯和比利时应该在输出文件中，奥地利和澳大利亚在其他文件中。

Answer 1

采纳答案by ktheitroadalo

Here is what you can do

这是你可以做的

import org.apache.spark.sql.functions._
//create a dataframe with demo data
val df = spark.sparkContext.parallelize(Seq(
  (1, "Fname1", "Lname1", "Belarus"),
  (2, "Fname2", "Lname2", "Belgium"),
  (3, "Fname3", "Lname3", "Austria"),
  (4, "Fname4", "Lname4", "Australia")
)).toDF("id", "fname","lname", "country")

//create a new column with the first letter of column
val result = df.withColumn("countryFirst", split($"country", "")(0))

//save the data with partitionby first letter of country 

result.write.partitionBy("countryFirst").format("com.databricks.spark.csv").save("outputpath")

Edited:You can also use the substring which can increase the performance as suggested by Raphel as

编辑：您还可以使用可以提高性能的子字符串，如 Rachel 建议的那样

substring(Column str, int pos, int len)Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type

substring(Column str, int pos, int len)子字符串从 pos 开始，当 str 是 String 类型时长度为 len 或返回字节数组中从 pos 开始的字节数组切片，当 str 是 Binary 类型时长度为 len

val result = df.withColumn("firstCountry", substring($"country",1,1))

and then use partitionby with write

然后使用 partitionby 和 write

Hope this solves your problem!

希望这能解决您的问题！

Answer 2

回答by Shaido - Reinstate Monica

One alternative to solve this problem would be to first create a column containing only the first letter of each country. Having done this step, you could use partitionByto save each partition to separate files.

解决此问题的一种替代方法是首先创建一个仅包含每个国家/地区的第一个字母的列。完成此步骤后，您可以使用partitionBy将每个分区保存到单独的文件。

dataFrame.write.partitionBy("column").format("com.databricks.spark.csv").save("/path/to/dir/")

scala 根据列值对火花数据框进行分区？

提问by jdk2588

采纳答案by ktheitroadalo

回答by Shaido - Reinstate Monica

相关推荐

最近更新

标签

scala 根据列值对火花数据框进行分区？

提问by jdk2588

采纳答案by ktheitroadalo

回答by Shaido - Reinstate Monica

相关推荐

scala Apache Spark 中的 join 和 cogroup 有什么区别

在 Scala 中附加到 Seq

scala 如果 SparkSession 没有关闭会发生什么？

scala 如何将 RDD[Row] 转换为 RDD[String]

相关推荐

最近更新

标签