Python pyspark : 将 DataFrame 转换为 RDD[string]

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35457927/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 16:28:12  来源:igfitidea点击:

pyspark : Convert DataFrame to RDD[string]

pythonapache-sparkdataframepysparkapache-spark-sql

提问by Toren

I'd like to convert pyspark.sql.dataframe.DataFrameto pyspark.rdd.RDD[String]

我想转换pyspark.sql.dataframe.DataFramepyspark.rdd.RDD[String]

I converted a DataFrame dfto RDD data:

我将 DataFrame 转换df为 RDD data

data = df.rdd
type (data)
## pyspark.rdd.RDD 

the new RDD datacontains Row

新的 RDDdata包含Row

first = data.first()
type(first)
## pyspark.sql.types.Row

data.first()
Row(_c0=u'aaa', _c1=u'bbb', _c2=u'ccc', _c3=u'ddd')

I'd like to convert Rowto list of String, like example below:

我想转换Row为列表String,如下例所示:

u'aaa',u'bbb',u'ccc',u'ddd'

Thanks

谢谢

采纳答案by zero323

PySpark Rowis just a tupleand can be used as such. All you need here is a simple map(or flatMapif you want to flatten the rows as well) with list:

PySparkRow只是一个tuple并且可以这样使用。您需要的只是一个简单的map(或者flatMap如果您还想展平行)list

data.map(list)

or if you expect different types:

或者如果您期望不同的类型:

data.map(lambda row: [str(c) for c in row])

回答by alofgran

The accepted answer is old. With Spark 2.0, you must now explicitly state that you're converting to an rdd by adding .rddto the statement. Therefore, the equivalent of this statement in Spark 1.0:

接受的答案是旧的。使用 Spark 2.0,您现在必须通过添加.rdd到语句来明确声明您正在转换为 rdd 。因此,相当于 Spark 1.0 中的这条语句:

data.map(list)

Should now be:

现在应该是:

data.rdd.map(list)

in Spark 2.0. Related to the accepted answer in this post.

在 Spark 2.0 中。与在接受的答案这一职位