pandas 迭代行并扩展熊猫数据框

Question

提问by bowlby

I have pandas dataframe with a column containing values or lists of values (of unequal length). I want to 'expand' the rows, so each value in the list becomes single value in column. An example says it all:

我有一个包含值或值列表（长度不等）的列的Pandas数据框。我想“扩展”行，因此列表中的每个值都成为列中的单个值。一个例子说明了一切：

dfIn = pd.DataFrame({u'name': ['Tom', 'Jim', 'Claus'],
 u'location': ['Amsterdam', ['Berlin','Paris'], ['Antwerp','Barcelona','Pisa'] ]})

    location     name
0   Amsterdam   Tom
1   [Berlin, Paris] Jim
2   [Antwerp, Barcelona, Pisa]  Claus

I want to turn into:

我想变成：

dfOut = pd.DataFrame({u'name': ['Tom', 'Jim', 'Jim', 'Claus','Claus','Claus'],
u'location': ['Amsterdam', 'Berlin','Paris', 'Antwerp','Barcelona','Pisa']})

    location     name
0   Amsterdam   Tom
1   Berlin   Jim
2   Paris   Jim
3   Antwerp Claus
4   Barcelona   Claus
5   Pisa    Claus

I first tried using apply but it's not possible to return multiple Series as far as I know. iterrows seems to be the trick. But the code below gives me an empty dataframe...

我首先尝试使用 apply 但据我所知不可能返回多个系列。iterrows 似乎是诀窍。但是下面的代码给了我一个空的数据框......

def duplicator(series):
    if type(series['location']) == list:
        for location in series['location']:
            subSeries = series
            subSeries['location'] = location
            dfOut.append(subSeries)
    else:
        dfOut.append(series)

for index, row in dfIn.iterrows():
    duplicator(row)

Answer 1

采纳答案by unutbu

If you return a series whose indexis a list of locations, then dfIn.applywill collate those series into a table:

如果您返回一个index包含位置列表的dfIn.apply系列，那么会将这些系列整理到一个表格中：

import pandas as pd
dfIn = pd.DataFrame({u'name': ['Tom', 'Jim', 'Claus'],
                     u'location': ['Amsterdam', ['Berlin','Paris'],
                                   ['Antwerp','Barcelona','Pisa'] ]})

def expand(row):
    locations = row['location'] if isinstance(row['location'], list) else [row['location']]
    s = pd.Series(row['name'], index=list(set(locations)))
    return s

In [156]: dfIn.apply(expand, axis=1)
Out[156]: 
  Amsterdam Antwerp Barcelona Berlin Paris   Pisa
0       Tom     NaN       NaN    NaN   NaN    NaN
1       NaN     NaN       NaN    Jim   Jim    NaN
2       NaN   Claus     Claus    NaN   NaN  Claus

You can then stack this DataFrame to obtain:

然后，您可以堆叠此 DataFrame 以获得：

In [157]: dfIn.apply(expand, axis=1).stack()
Out[157]: 
0  Amsterdam      Tom
1  Berlin         Jim
   Paris          Jim
2  Antwerp      Claus
   Barcelona    Claus
   Pisa         Claus
dtype: object

This is a Series, while you want a DataFrame. A little massaging with reset_indexgives you the desired result:

这是一个系列，而您需要一个 DataFrame。稍微按摩一下reset_index就能达到您想要的效果：

dfOut = dfIn.apply(expand, axis=1).stack()
dfOut = dfOut.to_frame().reset_index(level=1, drop=False)
dfOut.columns = ['location', 'name']
dfOut.reset_index(drop=True, inplace=True)
print(dfOut)

yields

产量

    location   name
0  Amsterdam    Tom
1     Berlin    Jim
2      Paris    Jim
3  Amsterdam  Claus
4    Antwerp  Claus
5  Barcelona  Claus

Answer 2

回答by MorganM

Not as much interesting/fancy pandas usage, but this works:

没有那么多有趣/花哨的Pandas用法，但这有效：

import numpy as np
dfIn.loc[:, 'location'] = dfIn.location.apply(np.atleast_1d)
all_locations = np.hstack(dfIn.location)
all_names = np.hstack([[n]*len(l) for n, l in dfIn[['name', 'location']].values])
dfOut = pd.DataFrame({'location':all_locations, 'name':all_names})

It's about 40x faster than the apply/stack/reindex approach. As far as I can tell, that ratio holds at pretty much all dataframe sizes (didn't test how it scales with the size of the lists in each row). If you can guarantee that all locationentries are already iterables, you can remove the atleast_1dcall, which gives about another 20% speedup.

它比应用/堆栈/重新索引方法快大约 40 倍。据我所知，该比率几乎适用于所有数据帧大小（没有测试它如何随每行列表的大小缩放）。如果您可以保证所有location条目都已经是可迭代的，则可以删除该atleast_1d调用，这又可以提高 20% 的速度。

pandas 迭代行并扩展熊猫数据框

提问by bowlby

采纳答案by unutbu

回答by MorganM

相关推荐

最近更新

标签

pandas 迭代行并扩展熊猫数据框

提问by bowlby

采纳答案by unutbu

回答by MorganM

相关推荐

pandas 和 numpy 线程安全

Pandas DataFrameGroupBy 导出到 Excel

pandas 熊猫合并返回空数据框

仅使用一行交换 Pandas 数据框中选定行的列值的正确语法是什么？

相关推荐

最近更新

标签