pandas spark - 将数据帧转换为列表以提高性能
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35364133/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
spark - Converting dataframe to list improving performance
提问by Yakov
I need to covert a column of the Spark dataframe to list to use later for matplotlib
我需要将 Spark 数据框的一列转换为列表以供稍后用于 matplotlib
df.toPandas()[col_name].values.tolist()
it looks like there is high performance overhead this operation takes around 18sec is there other way to do that or improve the perfomance?
看起来这个操作需要大约 18 秒的高性能开销有没有其他方法可以做到这一点或提高性能?
回答by P.Panayotov
You can do it this way:
你可以这样做:
>>> [list(row) for row in df.collect()]
Example:>>> d = [['Alice', 1], ['Bob', 2]]
>>> df = spark.createDataFrame(d, ['name', 'age'])
>>> df.show()
+-----+---+
| name|age|
+-----+---+
|Alice| 1|
| Bob| 2|
+-----+---+
>>> to_list = [list(row) for row in df.collect()]
print list
Result: [[u'Alice', 1], [u'Bob', 2]]
示例:>>> d = [['Alice', 1], ['Bob', 2]]
>>> df = spark.createDataFrame(d, ['name', 'age'])
>>> df.show()
+-----+---+
| name|age|
+-----+---+
|Alice| 1|
| Bob| 2|
+-----+---+
>>> to_list = [list(row) for row in df.collect()]
print list
结果:[[u'Alice', 1], [u'Bob', 2]]
回答by zero323
If you really need a local list there is not much you can do here but one improvement is to collect only a single column not a whole DataFrame
:
如果你真的需要一个本地列表,你可以在这里做的不多,但一个改进是只收集一列而不是一列DataFrame
:
df.select(col_name).flatMap(lambda x: x).collect()
回答by Artem Osipov
You can use an iterator to save memory toLocalIterator
. The iterator will consume as much memory as the largest partition in this. And if you need to use the result only once, then the iterator is perfect is this case.
您可以使用迭代器来节省内存toLocalIterator
。迭代器将消耗与此中最大分区一样多的内存。如果你只需要使用一次结果,那么迭代器就是这种情况。
d = [['Bender', 12], ['Flex', 123],['Fry', 1234]]
df = spark.createDataFrame(d, ['name', 'value'])
df.show()
+------+-----+
| name|value|
+------+-----+
|Bender| 12|
| Flex| 123|
| Fry| 1234|
+------+-----+`
values = [row.value for row in df.toLocalIterator()]
print(values)
>>> [12, 123, 1234]
Also toPandas() method should only be used if the resulting Pandas's DataFrame is expected to be small, as all the data is loaded into the driver's memory.
此外,只有在预期生成的 Pandas 的 DataFrame 较小时才应使用 toPandas() 方法,因为所有数据都加载到驱动程序的内存中。