Python pyspark : 将 DataFrame 转换为 RDD[string]
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35457927/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pyspark : Convert DataFrame to RDD[string]
提问by Toren
I'd like to convert pyspark.sql.dataframe.DataFrame
to pyspark.rdd.RDD[String]
我想转换pyspark.sql.dataframe.DataFrame
为pyspark.rdd.RDD[String]
I converted a DataFrame df
to RDD data
:
我将 DataFrame 转换df
为 RDD data
:
data = df.rdd
type (data)
## pyspark.rdd.RDD
the new RDD data
contains Row
新的 RDDdata
包含Row
first = data.first()
type(first)
## pyspark.sql.types.Row
data.first()
Row(_c0=u'aaa', _c1=u'bbb', _c2=u'ccc', _c3=u'ddd')
I'd like to convert Row
to list of String
, like example below:
我想转换Row
为列表String
,如下例所示:
u'aaa',u'bbb',u'ccc',u'ddd'
Thanks
谢谢
采纳答案by zero323
PySpark Row
is just a tuple
and can be used as such. All you need here is a simple map
(or flatMap
if you want to flatten the rows as well) with list
:
PySparkRow
只是一个tuple
并且可以这样使用。您需要的只是一个简单的map
(或者flatMap
如果您还想展平行)list
:
data.map(list)
or if you expect different types:
或者如果您期望不同的类型:
data.map(lambda row: [str(c) for c in row])
回答by alofgran
The accepted answer is old. With Spark 2.0, you must now explicitly state that you're converting to an rdd by adding .rdd
to the statement. Therefore, the equivalent of this statement in Spark 1.0:
接受的答案是旧的。使用 Spark 2.0,您现在必须通过添加.rdd
到语句来明确声明您正在转换为 rdd 。因此,相当于 Spark 1.0 中的这条语句:
data.map(list)
Should now be:
现在应该是:
data.rdd.map(list)
in Spark 2.0. Related to the accepted answer in this post.
在 Spark 2.0 中。与在接受的答案这一职位。