pandas 迭代行并扩展熊猫数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26068021/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:30:36  来源:igfitidea点击:

Iterate over rows and expand pandas dataframe

pythonloopspandas

提问by bowlby

I have pandas dataframe with a column containing values or lists of values (of unequal length). I want to 'expand' the rows, so each value in the list becomes single value in column. An example says it all:

我有一个包含值或值列表(长度不等)的列的Pandas数据框。我想“扩展”行,因此列表中的每个值都成为列中的单个值。一个例子说明了一切:

dfIn = pd.DataFrame({u'name': ['Tom', 'Jim', 'Claus'],
 u'location': ['Amsterdam', ['Berlin','Paris'], ['Antwerp','Barcelona','Pisa'] ]})

    location     name
0   Amsterdam   Tom
1   [Berlin, Paris] Jim
2   [Antwerp, Barcelona, Pisa]  Claus

I want to turn into:

我想变成:

dfOut = pd.DataFrame({u'name': ['Tom', 'Jim', 'Jim', 'Claus','Claus','Claus'],
u'location': ['Amsterdam', 'Berlin','Paris', 'Antwerp','Barcelona','Pisa']})

    location     name
0   Amsterdam   Tom
1   Berlin   Jim
2   Paris   Jim
3   Antwerp Claus
4   Barcelona   Claus
5   Pisa    Claus

I first tried using apply but it's not possible to return multiple Series as far as I know. iterrows seems to be the trick. But the code below gives me an empty dataframe...

我首先尝试使用 apply 但据我所知不可能返回多个系列。iterrows 似乎是诀窍。但是下面的代码给了我一个空的数据框......

def duplicator(series):
    if type(series['location']) == list:
        for location in series['location']:
            subSeries = series
            subSeries['location'] = location
            dfOut.append(subSeries)
    else:
        dfOut.append(series)

for index, row in dfIn.iterrows():
    duplicator(row)

采纳答案by unutbu

If you return a series whose indexis a list of locations, then dfIn.applywill collate those series into a table:

如果您返回一个index包含位置列表的dfIn.apply系列,那么会将这些系列整理到一个表格中:

import pandas as pd
dfIn = pd.DataFrame({u'name': ['Tom', 'Jim', 'Claus'],
                     u'location': ['Amsterdam', ['Berlin','Paris'],
                                   ['Antwerp','Barcelona','Pisa'] ]})

def expand(row):
    locations = row['location'] if isinstance(row['location'], list) else [row['location']]
    s = pd.Series(row['name'], index=list(set(locations)))
    return s

In [156]: dfIn.apply(expand, axis=1)
Out[156]: 
  Amsterdam Antwerp Barcelona Berlin Paris   Pisa
0       Tom     NaN       NaN    NaN   NaN    NaN
1       NaN     NaN       NaN    Jim   Jim    NaN
2       NaN   Claus     Claus    NaN   NaN  Claus

You can then stack this DataFrame to obtain:

然后,您可以堆叠此 DataFrame 以获得:

In [157]: dfIn.apply(expand, axis=1).stack()
Out[157]: 
0  Amsterdam      Tom
1  Berlin         Jim
   Paris          Jim
2  Antwerp      Claus
   Barcelona    Claus
   Pisa         Claus
dtype: object

This is a Series, while you want a DataFrame. A little massaging with reset_indexgives you the desired result:

这是一个系列,而您需要一个 DataFrame。稍微按摩一下reset_index就能达到您想要的效果:

dfOut = dfIn.apply(expand, axis=1).stack()
dfOut = dfOut.to_frame().reset_index(level=1, drop=False)
dfOut.columns = ['location', 'name']
dfOut.reset_index(drop=True, inplace=True)
print(dfOut)

yields

产量

    location   name
0  Amsterdam    Tom
1     Berlin    Jim
2      Paris    Jim
3  Amsterdam  Claus
4    Antwerp  Claus
5  Barcelona  Claus

回答by MorganM

Not as much interesting/fancy pandas usage, but this works:

没有那么多有趣/花哨的Pandas用法,但这有效:

import numpy as np
dfIn.loc[:, 'location'] = dfIn.location.apply(np.atleast_1d)
all_locations = np.hstack(dfIn.location)
all_names = np.hstack([[n]*len(l) for n, l in dfIn[['name', 'location']].values])
dfOut = pd.DataFrame({'location':all_locations, 'name':all_names})

It's about 40x faster than the apply/stack/reindex approach. As far as I can tell, that ratio holds at pretty much all dataframe sizes (didn't test how it scales with the size of the lists in each row). If you can guarantee that all locationentries are already iterables, you can remove the atleast_1dcall, which gives about another 20% speedup.

它比应用/堆栈/重新索引方法快大约 40 倍。据我所知,该比率几乎适用于所有数据帧大小(没有测试它如何随每行列表的大小缩放)。如果您可以保证所有location条目都已经是可迭代的,则可以删除该atleast_1d调用,这又可以提高 20% 的速度。