Pandas 中的 pyspark flatMap

Question

提问by GeauxEric

Is there an operation in pandas that does the same as flatMapin pyspark?

pandas 中是否有与pyspark 中的flatMap相同的操作？

flatMap example:

平面地图示例：

>>> rdd = sc.parallelize([2, 3, 4])
>>> sorted(rdd.flatMap(lambda x: range(1, x)).collect())
[1, 1, 1, 2, 2, 3]

So far I can think of applyfollowed by itertools.chain, but I am wondering if there is a one-step solution.

到目前为止，我可以想到apply其次itertools.chain，但我想知道是否有一步解决方案。

Answer 1

回答by santon

There's a hack. I often do something like

有一个黑客。我经常做类似的事情

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})

In [3]: df['x'].apply(pd.Series).unstack().reset_index(drop=True)
Out[3]:
0     1
1     3
2     2
3     4
4   NaN
5     5
dtype: float64

The introduction of NaNis because the intermediate object creates a MultiIndex, but for a lot of things you can just drop that:

的引入NaN是因为中间对象创建了一个MultiIndex，但是对于很多事情，您可以删除它：

In [4]: df['x'].apply(pd.Series).unstack().reset_index(drop=True).dropna()
Out[4]:
0    1
1    3
2    2
3    4
5    5
dtype: float64

This trick uses all pandas code, so I would expect it to be reasonably efficient, though it might not like things like very different sized lists.

这个技巧使用了所有的 Pandas 代码，所以我希望它是相当有效的，尽管它可能不喜欢非常不同大小的列表之类的东西。

Answer 2

回答by nikita

there are three steps to solve this question.

解决这个问题需要三个步骤。

import pandas as pd
df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})
df_new = df['x'].apply(pd.Series).unstack().reset_index().dropna()
df_new[['level_1',0]]`

Answer 3

回答by MRocklin

I suspect that the answer is "no, not efficiently."

我怀疑答案是“不，效率不高”。

Pandas isn't built for nested data like this. I suspect that the case you're considering in Pandas looks a bit like the following:

Pandas 不是为这样的嵌套数据构建的。我怀疑您在 Pandas 中考虑的情况看起来有点像以下内容：

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})

In [3]: df
Out[3]: 
           x
0     [1, 2]
1  [3, 4, 5]

And that you want something like the following

并且您想要类似以下内容

It is far more typical to normalizeyour data in Python before you send it to Pandas. If Pandas did do this then it would probably only be able to operate at slow Python speeds rather than fast C speeds.

在将数据发送到 Pandas 之前，在 Python 中对数据进行规范化更为典型。如果 Pandas 这样做，那么它可能只能以低速 Python 运行，而不是快速的 C 速度。

Generally one does a bit of munging of data before one uses tabular computation.

通常，在使用表格计算之前，会先对数据进行一些处理。

Pandas 中的 pyspark flatMap

提问by GeauxEric

回答by santon

回答by nikita

回答by MRocklin

相关推荐

最近更新

标签

Pandas 中的 pyspark flatMap

提问by GeauxEric

回答by santon

回答by nikita

回答by MRocklin

相关推荐

使用 pandas 数据框进行 rpy2 回归的最小示例

pandas 数据框按 nan 的数量删除列

Pandas.read_csv 将所有文件读入一列

pandas 如何在 scikit learn 中矢量化具有多个文本列的数据框而不会丢失对原始列的跟踪

相关推荐

最近更新

标签