scala 如何将数据帧拆分为具有相同列值的数据帧?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31669308/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to split a dataframe into dataframes with same column values?
提问by user1735076
Using Scala, how can I split dataFrame into multiple dataFrame (be it array or collection) with same column value. For example I want to split the following DataFrame:
使用 Scala,如何将数据帧拆分为具有相同列值的多个数据帧(无论是数组还是集合)。例如,我想拆分以下 DataFrame:
ID Rate State
1 24 AL
2 35 MN
3 46 FL
4 34 AL
5 78 MN
6 99 FL
to:
到:
data set 1
数据集 1
ID Rate State
1 24 AL
4 34 AL
data set 2
数据集 2
ID Rate State
2 35 MN
5 78 MN
data set 3
数据集 3
ID Rate State
3 46 FL
6 99 FL
回答by zero323
You can collect unique state values and simply map over resulting array:
您可以收集唯一的状态值并简单地映射结果数组:
val states = df.select("State").distinct.collect.flatMap(_.toSeq)
val byStateArray = states.map(state => df.where($"State" <=> state))
or to map:
或映射:
val byStateMap = states
.map(state => (state -> df.where($"State" <=> state)))
.toMap
The same thing in Python:
在 Python 中同样的事情:
from itertools import chain
from pyspark.sql.functions import col
states = chain(*df.select("state").distinct().collect())
# PySpark 2.3 and later
# In 2.2 and before col("state") == state)
# should give the same outcome, ignoring NULLs
# if NULLs are important
# (lit(state).isNull() & col("state").isNull()) | (col("state") == state)
df_by_state = {state:
df.where(col("state").eqNullSafe(state)) for state in states}
The obvious problem here is that it requires a full data scan for each level, so it is an expensive operation. If you're looking for a way to just split the output see also How do I split an RDD into two or more RDDs?
这里明显的问题是它需要对每个级别进行完整的数据扫描,因此这是一项昂贵的操作。如果您正在寻找一种仅拆分输出的方法,请参阅如何将 RDD 拆分为两个或多个 RDD?
In particular you can write Datasetpartitioned by the column of interest:
特别是您可以Dataset按感兴趣的列分区写入:
val path: String = ???
df.write.partitionBy("State").parquet(path)
and read back if needed:
并在需要时回读:
// Depend on partition prunning
for { state <- states } yield spark.read.parquet(path).where($"State" === state)
// or explicitly read the partition
for { state <- states } yield spark.read.parquet(s"$path/State=$state")
Depending on the size of the data, number of levels of the splitting, storag and persistence level of the input it might faster or slower than multiple filters.
根据数据的大小、分割的级别数、输入的存储级别和持久级别,它可能比多个过滤器更快或更慢。
回答by ashK
It is very simple (if the spark version is 2) if you make the dataframe as a temporary table.
如果将数据帧作为临时表,这非常简单(如果 spark 版本为 2)。
df1.createOrReplaceTempView("df1")
And now you can do the queries,
现在你可以进行查询,
var df2 = spark.sql("select * from df1 where state = 'FL'")
var df3 = spark.sql("select * from df1 where state = 'MN'")
var df4 = spark.sql("select * from df1 where state = 'AL'")
Now you got the df2, df3, df4. If you want to have them as list, you can use,
现在你得到了 df2、df3、df4。如果你想把它们作为列表,你可以使用,
df2.collect()
df3.collect()
or even map/filter function. Please refer https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes
甚至地图/过滤功能。请参考https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes
Ash
灰
回答by Ruthika jawar
you can use ..
var stateDF = df.select("state").distinct() // to get states in a df
val states = stateDF.rdd.map(x=>x(0)).collect.toList //to get states in a list
for (i <- states) //loop to get each state
{
var finalDF = sqlContext.sql("select * from table1 where state = '" + state
+"' ")
}

