Pandas 中的 pyspark flatMap

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31080258/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:32:44  来源:igfitidea点击:

pyspark's flatMap in pandas

pandaspyspark

提问by GeauxEric

Is there an operation in pandas that does the same as flatMapin pyspark?

pandas 中是否有与pyspark 中的flatMap相同的操作?

flatMap example:

平面地图示例:

>>> rdd = sc.parallelize([2, 3, 4])
>>> sorted(rdd.flatMap(lambda x: range(1, x)).collect())
[1, 1, 1, 2, 2, 3]

So far I can think of applyfollowed by itertools.chain, but I am wondering if there is a one-step solution.

到目前为止,我可以想到apply其次itertools.chain,但我想知道是否有一步解决方案。

回答by santon

There's a hack. I often do something like

有一个黑客。我经常做类似的事情

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})

In [3]: df['x'].apply(pd.Series).unstack().reset_index(drop=True)
Out[3]:
0     1
1     3
2     2
3     4
4   NaN
5     5
dtype: float64

The introduction of NaNis because the intermediate object creates a MultiIndex, but for a lot of things you can just drop that:

的引入NaN是因为中间对象创建了一个MultiIndex,但是对于很多事情,您可以删除它:

In [4]: df['x'].apply(pd.Series).unstack().reset_index(drop=True).dropna()
Out[4]:
0    1
1    3
2    2
3    4
5    5
dtype: float64

This trick uses all pandas code, so I would expect it to be reasonably efficient, though it might not like things like very different sized lists.

这个技巧使用了所有的 Pandas 代码,所以我希望它是相当有效的,尽管它可能不喜欢非常不同大小的列表之类的东西。

回答by nikita

there are three steps to solve this question.

解决这个问题需要三个步骤。

import pandas as pd
df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})
df_new = df['x'].apply(pd.Series).unstack().reset_index().dropna()
df_new[['level_1',0]]`

result picture

结果图片

回答by MRocklin

I suspect that the answer is "no, not efficiently."

我怀疑答案是“不,效率不高”。

Pandas isn't built for nested data like this. I suspect that the case you're considering in Pandas looks a bit like the following:

Pandas 不是为这样的嵌套数据构建的。我怀疑您在 Pandas 中考虑的情况看起来有点像以下内容:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})

In [3]: df
Out[3]: 
           x
0     [1, 2]
1  [3, 4, 5]

And that you want something like the following

并且您想要类似以下内容

    x
0   1
0   2
1   3
1   4
1   5

It is far more typical to normalizeyour data in Python before you send it to Pandas. If Pandas did do this then it would probably only be able to operate at slow Python speeds rather than fast C speeds.

在将数据发送到 Pandas 之前,在 Python 中对数据进行规范化更为典型。如果 Pandas 这样做,那么它可能只能以低速 Python 运行,而不是快速的 C 速度。

Generally one does a bit of munging of data before one uses tabular computation.

通常,在使用表格计算之前,会先对数据进行一些处理。