将 spark DataFrame 列转换为 python 列表
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38610559/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Convert spark DataFrame column to python list
提问by a.moussa
I work on a dataframe with two column, mvv and count.
我处理具有两列 mvv 和计数的数据框。
+---+-----+
|mvv|count|
+---+-----+
| 1 | 5 |
| 2 | 9 |
| 3 | 3 |
| 4 | 1 |
i would like to obtain two list containing mvv values and count value. Something like
我想获得两个包含 mvv 值和计数值的列表。就像是
mvv = [1,2,3,4]
count = [5,9,3,1]
So, I tried the following code: The first line should return a python list of row. I wanted to see the first value:
所以,我尝试了以下代码:第一行应该返回一个 Python 行列表。我想看到第一个值:
mvv_list = mvv_count_df.select('mvv').collect()
firstvalue = mvv_list[0].getInt(0)
But I get an error message with the second line:
但是我在第二行收到一条错误消息:
AttributeError: getInt
属性错误:getInt
回答by Thiago Baldim
See, why this way that you are doing is not working. First, you are trying to get integer from a RowType, the output of your collect is like this:
看,为什么你正在做的这种方式不起作用。首先,您试图从行类型中获取整数,收集的输出如下所示:
>>> mvv_list = mvv_count_df.select('mvv').collect()
>>> mvv_list[0]
Out: Row(mvv=1)
If you take something like this:
如果你采取这样的事情:
>>> firstvalue = mvv_list[0].mvv
Out: 1
You will get the mvv
value. If you want all the information of the array you can take something like this:
您将获得mvv
价值。如果您想要数组的所有信息,您可以采用以下方法:
>>> mvv_array = [int(row.mvv) for row in mvv_list.collect()]
>>> mvv_array
Out: [1,2,3,4]
But if you try the same for the other column, you get:
但是,如果您对另一列尝试相同的操作,则会得到:
>>> mvv_count = [int(row.count) for row in mvv_list.collect()]
Out: TypeError: int() argument must be a string or a number, not 'builtin_function_or_method'
This happens because count
is a built-in method. And the column has the same name as count
. A workaround to do this is change the column name of count
to _count
:
发生这种情况是因为它count
是一个内置方法。并且该列的名称与count
. 一种解决方法是将列名更改count
为_count
:
>>> mvv_list = mvv_list.selectExpr("mvv as mvv", "count as _count")
>>> mvv_count = [int(row._count) for row in mvv_list.collect()]
But this workaround is not needed, as you can access the column using the dictionary syntax:
但不需要此解决方法,因为您可以使用字典语法访问该列:
>>> mvv_array = [int(row['mvv']) for row in mvv_list.collect()]
>>> mvv_count = [int(row['count']) for row in mvv_list.collect()]
And it will finally work!
它最终会起作用!
回答by Neo
Following one liner gives the list you want.
跟随一个班轮给出你想要的清单。
mvv = mvv_count_df.select("mvv").rdd.flatMap(lambda x: x).collect()
回答by Muhammad Raihan Muhaimin
This will give you all the elements as a list.
这将为您提供所有元素作为列表。
mvv_list = list(
mvv_count_df.select('mvv').toPandas()['mvv']
)
回答by Itachi
The following code will help you
以下代码将帮助您
mvv_count_df.select('mvv').rdd.map(lambda row : row[0]).collect()
回答by luminousmen
On my data I got these benchmarks:
在我的数据上,我得到了这些基准:
>>> data.select(col).rdd.flatMap(lambda x: x).collect()
0.52 sec
0.52 秒
>>> [row[col] for row in data.collect()]
0.271 sec
0.271 秒
>>> list(data.select(col).toPandas()[col])
0.427 sec
0.427 秒
The result is the same
结果是一样的
回答by anirban sen
If you get the error below :
如果您收到以下错误:
AttributeError: 'list' object has no attribute 'collect'
AttributeError: 'list' 对象没有属性 'collect'
This code will solve your issues :
此代码将解决您的问题:
mvv_list = mvv_count_df.select('mvv').collect()
mvv_array = [int(i.mvv) for i in mvv_list]