scala 如何将数据帧拆分为具有相同列值的数据帧？

Question

提问by user1735076

Using Scala, how can I split dataFrame into multiple dataFrame (be it array or collection) with same column value. For example I want to split the following DataFrame:

使用 Scala，如何将数据帧拆分为具有相同列值的多个数据帧（无论是数组还是集合）。例如，我想拆分以下 DataFrame：

ID  Rate    State
1   24  AL
2   35  MN
3   46  FL
4   34  AL
5   78  MN
6   99  FL

to:

到：

data set 1

数据集 1

ID  Rate    State
1   24  AL  
4   34  AL

data set 2

数据集 2

ID  Rate    State
2   35  MN
5   78  MN

data set 3

数据集 3

ID  Rate    State
3   46  FL
6   99  FL

Answer 1

回答by zero323

You can collect unique state values and simply map over resulting array:

您可以收集唯一的状态值并简单地映射结果数组：

val states = df.select("State").distinct.collect.flatMap(_.toSeq)
val byStateArray = states.map(state => df.where($"State" <=> state))

or to map:

或映射：

val byStateMap = states
    .map(state => (state -> df.where($"State" <=> state)))
    .toMap

The same thing in Python:

在 Python 中同样的事情：

from itertools import chain
from pyspark.sql.functions import col

states = chain(*df.select("state").distinct().collect())

# PySpark 2.3 and later
# In 2.2 and before col("state") == state) 
# should give the same outcome, ignoring NULLs 
# if NULLs are important 
# (lit(state).isNull() & col("state").isNull()) | (col("state") == state)
df_by_state = {state: 
  df.where(col("state").eqNullSafe(state)) for state in states}

The obvious problem here is that it requires a full data scan for each level, so it is an expensive operation. If you're looking for a way to just split the output see also How do I split an RDD into two or more RDDs?

这里明显的问题是它需要对每个级别进行完整的数据扫描，因此这是一项昂贵的操作。如果您正在寻找一种仅拆分输出的方法，请参阅如何将 RDD 拆分为两个或多个 RDD？

In particular you can write Datasetpartitioned by the column of interest:

特别是您可以Dataset按感兴趣的列分区写入：

val path: String = ???
df.write.partitionBy("State").parquet(path)

and read back if needed:

并在需要时回读：

// Depend on partition prunning
for { state <- states } yield spark.read.parquet(path).where($"State" === state)

// or explicitly read the partition
for { state <- states } yield spark.read.parquet(s"$path/State=$state")

Depending on the size of the data, number of levels of the splitting, storag and persistence level of the input it might faster or slower than multiple filters.

根据数据的大小、分割的级别数、输入的存储级别和持久级别，它可能比多个过滤器更快或更慢。

Answer 2

回答by ashK

It is very simple (if the spark version is 2) if you make the dataframe as a temporary table.

如果将数据帧作为临时表，这非常简单（如果 spark 版本为 2）。

df1.createOrReplaceTempView("df1")

And now you can do the queries,

现在你可以进行查询，

var df2 = spark.sql("select * from df1 where state = 'FL'")
var df3 = spark.sql("select * from df1 where state = 'MN'")
var df4 = spark.sql("select * from df1 where state = 'AL'")

Now you got the df2, df3, df4. If you want to have them as list, you can use,

现在你得到了 df2、df3、df4。如果你想把它们作为列表，你可以使用，

df2.collect()
df3.collect()

or even map/filter function. Please refer https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes

甚至地图/过滤功能。请参考https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes

Ash

灰

Answer 3

回答by Ruthika jawar

you can use .. 
var stateDF = df.select("state").distinct()  // to get states in a df
val states = stateDF.rdd.map(x=>x(0)).collect.toList //to get states in a list

for (i <- states)  //loop to get each state
{
var finalDF = sqlContext.sql("select * from table1 where state = '" + state
+"' ")
}

scala 如何将数据帧拆分为具有相同列值的数据帧？

提问by user1735076

回答by zero323

回答by ashK

回答by Ruthika jawar

相关推荐

最近更新

标签

scala 如何将数据帧拆分为具有相同列值的数据帧？

提问by user1735076

回答by zero323

回答by ashK

回答by Ruthika jawar

相关推荐

scala 在列表的指定位置插入新元素

Spark / Scala：将 RDD 传递给函数

如何将 Scala 数组传递给 Scala vararg 方法？

价值 || 不是 String 的成员 - scala

相关推荐

最近更新

标签