pandas spark - 将数据帧转换为列表以提高性能

Question

提问by Yakov

I need to covert a column of the Spark dataframe to list to use later for matplotlib

我需要将 Spark 数据框的一列转换为列表以供稍后用于 matplotlib

df.toPandas()[col_name].values.tolist()

it looks like there is high performance overhead this operation takes around 18sec is there other way to do that or improve the perfomance?

看起来这个操作需要大约 18 秒的高性能开销有没有其他方法可以做到这一点或提高性能？

Answer 1

回答by P.Panayotov

You can do it this way:

你可以这样做：

>>> [list(row) for row in df.collect()]

Example:
>>> d = [['Alice', 1], ['Bob', 2]]
>>> df = spark.createDataFrame(d, ['name', 'age'])
>>> df.show()
+-----+---+
| name|age|
+-----+---+
|Alice| 1|
| Bob| 2|
+-----+---+
>>> to_list = [list(row) for row in df.collect()]
print list
Result: [[u'Alice', 1], [u'Bob', 2]]

示例：
>>> d = [['Alice', 1], ['Bob', 2]]
>>> df = spark.createDataFrame(d, ['name', 'age'])
>>> df.show()
+-----+---+
| name|age|
+-----+---+
|Alice| 1|
| Bob| 2|
+-----+---+
>>> to_list = [list(row) for row in df.collect()]
print list
结果：[[u'Alice', 1], [u'Bob', 2]]

Answer 2

回答by zero323

If you really need a local list there is not much you can do here but one improvement is to collect only a single column not a whole DataFrame:

如果你真的需要一个本地列表，你可以在这里做的不多，但一个改进是只收集一列而不是一列DataFrame：

df.select(col_name).flatMap(lambda x: x).collect()

Answer 3

回答by Artem Osipov

You can use an iterator to save memory toLocalIterator. The iterator will consume as much memory as the largest partition in this. And if you need to use the result only once, then the iterator is perfect is this case.

您可以使用迭代器来节省内存toLocalIterator。迭代器将消耗与此中最大分区一样多的内存。如果你只需要使用一次结果，那么迭代器就是这种情况。

d = [['Bender', 12], ['Flex', 123],['Fry', 1234]]
df = spark.createDataFrame(d, ['name', 'value'])
df.show()
+------+-----+
|  name|value|
+------+-----+
|Bender|   12|
|  Flex|  123|
|   Fry| 1234|
+------+-----+`
values = [row.value for row in df.toLocalIterator()]

print(values)
>>> [12, 123, 1234]

Also toPandas() method should only be used if the resulting Pandas's DataFrame is expected to be small, as all the data is loaded into the driver's memory.

此外，只有在预期生成的 Pandas 的 DataFrame 较小时才应使用 toPandas() 方法，因为所有数据都加载到驱动程序的内存中。

pandas spark - 将数据帧转换为列表以提高性能

提问by Yakov

回答by P.Panayotov

回答by zero323

回答by Artem Osipov

相关推荐

最近更新

标签

pandas spark - 将数据帧转换为列表以提高性能

提问by Yakov

回答by P.Panayotov

回答by zero323

回答by Artem Osipov

相关推荐

pandas Pylint 抱怨“参数‘cls’没有价值”

如何在 Pandas 的应用函数中测试 nan？

将 Pandas DataFrame 传递给 Scipy.optimize.curve_fit

Python pandas：合并两个没有键的表（将 2 个数据帧相乘并广播所有元素；NxN 数据帧）

相关推荐

最近更新

标签