Python pyspark : 将 DataFrame 转换为 RDD[string]
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35457927/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pyspark : Convert DataFrame to RDD[string]
提问by Toren
I'd like to convert pyspark.sql.dataframe.DataFrameto pyspark.rdd.RDD[String]
我想转换pyspark.sql.dataframe.DataFrame为pyspark.rdd.RDD[String]
I converted a DataFrame dfto RDD data:
我将 DataFrame 转换df为 RDD data:
data = df.rdd
type (data)
## pyspark.rdd.RDD
the new RDD datacontains Row
新的 RDDdata包含Row
first = data.first()
type(first)
## pyspark.sql.types.Row
data.first()
Row(_c0=u'aaa', _c1=u'bbb', _c2=u'ccc', _c3=u'ddd')
I'd like to convert Rowto list of String, like example below:
我想转换Row为列表String,如下例所示:
u'aaa',u'bbb',u'ccc',u'ddd'
Thanks
谢谢
采纳答案by zero323
PySpark Rowis just a tupleand can be used as such. All you need here is a simple map(or flatMapif you want to flatten the rows as well) with list:
PySparkRow只是一个tuple并且可以这样使用。您需要的只是一个简单的map(或者flatMap如果您还想展平行)list:
data.map(list)
or if you expect different types:
或者如果您期望不同的类型:
data.map(lambda row: [str(c) for c in row])
回答by alofgran
The accepted answer is old. With Spark 2.0, you must now explicitly state that you're converting to an rdd by adding .rddto the statement. Therefore, the equivalent of this statement in Spark 1.0:
接受的答案是旧的。使用 Spark 2.0,您现在必须通过添加.rdd到语句来明确声明您正在转换为 rdd 。因此,相当于 Spark 1.0 中的这条语句:
data.map(list)
Should now be:
现在应该是:
data.rdd.map(list)
in Spark 2.0. Related to the accepted answer in this post.
在 Spark 2.0 中。与在接受的答案这一职位。

