将 spark DataFrame 列转换为 python 列表

Question

提问by a.moussa

I work on a dataframe with two column, mvv and count.

我处理具有两列 mvv 和计数的数据框。

+---+-----+
|mvv|count|
+---+-----+
| 1 |  5  |
| 2 |  9  |
| 3 |  3  |
| 4 |  1  |

i would like to obtain two list containing mvv values and count value. Something like

我想获得两个包含 mvv 值和计数值的列表。就像是

mvv = [1,2,3,4]
count = [5,9,3,1]

So, I tried the following code: The first line should return a python list of row. I wanted to see the first value:

所以，我尝试了以下代码：第一行应该返回一个 Python 行列表。我想看到第一个值：

mvv_list = mvv_count_df.select('mvv').collect()
firstvalue = mvv_list[0].getInt(0)

But I get an error message with the second line:

但是我在第二行收到一条错误消息：

AttributeError: getInt

属性错误：getInt

Answer 1

回答by Thiago Baldim

See, why this way that you are doing is not working. First, you are trying to get integer from a RowType, the output of your collect is like this:

看，为什么你正在做的这种方式不起作用。首先，您试图从行类型中获取整数，收集的输出如下所示：

>>> mvv_list = mvv_count_df.select('mvv').collect()
>>> mvv_list[0]
Out: Row(mvv=1)

If you take something like this:

如果你采取这样的事情：

>>> firstvalue = mvv_list[0].mvv
Out: 1

You will get the mvvvalue. If you want all the information of the array you can take something like this:

您将获得mvv价值。如果您想要数组的所有信息，您可以采用以下方法：

>>> mvv_array = [int(row.mvv) for row in mvv_list.collect()]
>>> mvv_array
Out: [1,2,3,4]

But if you try the same for the other column, you get:

但是，如果您对另一列尝试相同的操作，则会得到：

>>> mvv_count = [int(row.count) for row in mvv_list.collect()]
Out: TypeError: int() argument must be a string or a number, not 'builtin_function_or_method'

This happens because countis a built-in method. And the column has the same name as count. A workaround to do this is change the column name of countto _count:

发生这种情况是因为它count是一个内置方法。并且该列的名称与count. 一种解决方法是将列名更改count为_count：

>>> mvv_list = mvv_list.selectExpr("mvv as mvv", "count as _count")
>>> mvv_count = [int(row._count) for row in mvv_list.collect()]

But this workaround is not needed, as you can access the column using the dictionary syntax:

但不需要此解决方法，因为您可以使用字典语法访问该列：

>>> mvv_array = [int(row['mvv']) for row in mvv_list.collect()]
>>> mvv_count = [int(row['count']) for row in mvv_list.collect()]

And it will finally work!

它最终会起作用！

Answer 2

回答by Neo

Following one liner gives the list you want.

跟随一个班轮给出你想要的清单。

mvv = mvv_count_df.select("mvv").rdd.flatMap(lambda x: x).collect()

Answer 3

回答by Muhammad Raihan Muhaimin

This will give you all the elements as a list.

这将为您提供所有元素作为列表。

mvv_list = list(
    mvv_count_df.select('mvv').toPandas()['mvv']
)

Answer 4

回答by Itachi

The following code will help you

以下代码将帮助您

mvv_count_df.select('mvv').rdd.map(lambda row : row[0]).collect()

Answer 5

回答by luminousmen

On my data I got these benchmarks:

在我的数据上，我得到了这些基准：

>>> data.select(col).rdd.flatMap(lambda x: x).collect()

0.52 sec

0.52 秒

>>> [row[col] for row in data.collect()]

0.271 sec

0.271 秒

>>> list(data.select(col).toPandas()[col])

0.427 sec

0.427 秒

The result is the same

结果是一样的

Answer 6

回答by anirban sen

If you get the error below :

如果您收到以下错误：

AttributeError: 'list' object has no attribute 'collect'

AttributeError: 'list' 对象没有属性 'collect'

This code will solve your issues :

此代码将解决您的问题：

mvv_list = mvv_count_df.select('mvv').collect()

mvv_array = [int(i.mvv) for i in mvv_list]

将 spark DataFrame 列转换为 python 列表

提问by a.moussa

回答by Thiago Baldim

回答by Neo

回答by Muhammad Raihan Muhaimin

回答by Itachi

回答by luminousmen

回答by anirban sen

相关推荐

最近更新

标签

将 spark DataFrame 列转换为 python 列表

提问by a.moussa

回答by Thiago Baldim

回答by Neo

回答by Muhammad Raihan Muhaimin

回答by Itachi

回答by luminousmen

回答by anirban sen

相关推荐

Python 在同一个 Flask 视图中处理 GET 和 POST

Python 如何在 tensorflow 中获取当前可用的 GPU？

Python 如何使用 Spark (pyspark) 编写镶木地板文件？

Python Pandas 数据框读取 Excel 工作表中的精确指定范围

相关推荐

最近更新

标签