将 spark DataFrame 列转换为 python 列表

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38610559/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 21:11:36  来源:igfitidea点击:

Convert spark DataFrame column to python list

pythonapache-sparkpysparkspark-dataframe

提问by a.moussa

I work on a dataframe with two column, mvv and count.

我处理具有两列 mvv 和计数的数据框。

+---+-----+
|mvv|count|
+---+-----+
| 1 |  5  |
| 2 |  9  |
| 3 |  3  |
| 4 |  1  |

i would like to obtain two list containing mvv values and count value. Something like

我想获得两个包含 mvv 值和计数值的列表。就像是

mvv = [1,2,3,4]
count = [5,9,3,1]

So, I tried the following code: The first line should return a python list of row. I wanted to see the first value:

所以,我尝试了以下代码:第一行应该返回一个 Python 行列表。我想看到第一个值:

mvv_list = mvv_count_df.select('mvv').collect()
firstvalue = mvv_list[0].getInt(0)

But I get an error message with the second line:

但是我在第二行收到一条错误消息:

AttributeError: getInt

属性错误:getInt

回答by Thiago Baldim

See, why this way that you are doing is not working. First, you are trying to get integer from a RowType, the output of your collect is like this:

看,为什么你正在做的这种方式不起作用。首先,您试图从类型中获取整数,收集的输出如下所示:

>>> mvv_list = mvv_count_df.select('mvv').collect()
>>> mvv_list[0]
Out: Row(mvv=1)

If you take something like this:

如果你采取这样的事情:

>>> firstvalue = mvv_list[0].mvv
Out: 1

You will get the mvvvalue. If you want all the information of the array you can take something like this:

您将获得mvv价值。如果您想要数组的所有信息,您可以采用以下方法:

>>> mvv_array = [int(row.mvv) for row in mvv_list.collect()]
>>> mvv_array
Out: [1,2,3,4]

But if you try the same for the other column, you get:

但是,如果您对另一列尝试相同的操作,则会得到:

>>> mvv_count = [int(row.count) for row in mvv_list.collect()]
Out: TypeError: int() argument must be a string or a number, not 'builtin_function_or_method'

This happens because countis a built-in method. And the column has the same name as count. A workaround to do this is change the column name of countto _count:

发生这种情况是因为它count是一个内置方法。并且该列的名称与count. 一种解决方法是将列名更改count_count

>>> mvv_list = mvv_list.selectExpr("mvv as mvv", "count as _count")
>>> mvv_count = [int(row._count) for row in mvv_list.collect()]

But this workaround is not needed, as you can access the column using the dictionary syntax:

但不需要此解决方法,因为您可以使用字典语法访问该列:

>>> mvv_array = [int(row['mvv']) for row in mvv_list.collect()]
>>> mvv_count = [int(row['count']) for row in mvv_list.collect()]

And it will finally work!

它最终会起作用!

回答by Neo

Following one liner gives the list you want.

跟随一个班轮给出你想要的清单。

mvv = mvv_count_df.select("mvv").rdd.flatMap(lambda x: x).collect()

回答by Muhammad Raihan Muhaimin

This will give you all the elements as a list.

这将为您提供所有元素作为列表。

mvv_list = list(
    mvv_count_df.select('mvv').toPandas()['mvv']
)

回答by Itachi

The following code will help you

以下代码将帮助您

mvv_count_df.select('mvv').rdd.map(lambda row : row[0]).collect()

回答by luminousmen

On my data I got these benchmarks:

在我的数据上,我得到了这些基准:

>>> data.select(col).rdd.flatMap(lambda x: x).collect()

0.52 sec

0.52 秒

>>> [row[col] for row in data.collect()]

0.271 sec

0.271 秒

>>> list(data.select(col).toPandas()[col])

0.427 sec

0.427 秒

The result is the same

结果是一样的

回答by anirban sen

If you get the error below :

如果您收到以下错误:

AttributeError: 'list' object has no attribute 'collect'

AttributeError: 'list' 对象没有属性 'collect'

This code will solve your issues :

此代码将解决您的问题:

mvv_list = mvv_count_df.select('mvv').collect()

mvv_array = [int(i.mvv) for i in mvv_list]