在 Pandas 中相应地复制另一列的值时,将具有列表类型值的列展平
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/21160134/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Flatten a column with value of type list while duplicating the other column's value accordingly in Pandas
提问by Yu Shen
Dear power Pandas experts:
亲爱的 Power Pandas 专家:
I'm trying to implement a function to flatten a column of a dataframe which has element of type list, I want for each row of the dataframe where the column has element of type list, all columns but the designated column to be flattened will be duplicated, while the designated column will have one of the value in the list.
我正在尝试实现一个函数来展平具有列表类型元素的数据帧的列,我希望对于列具有列表类型元素的数据帧的每一行,除要展平的指定列之外的所有列都将是重复,而指定的列将具有列表中的值之一。
The following illustrate my requirements:
下面说明我的要求:
input = DataFrame({'A': [1, 2], 'B': [['a', 'b'], 'c']})
A B
0 1 [a, b]
1 2 c
expected = DataFrame({'A': [1, 1, 2], 'B': ['a', 'b', 'c']}, index=[0, 0, 1])
A B
0 1 a
0 1 b
1 2 c
I feel that there might be an elegant solution/concept for it, but I'm struggling.
我觉得它可能有一个优雅的解决方案/概念,但我很挣扎。
Here is my attempt, which does not work yet.
这是我的尝试,但尚未奏效。
def flattenColumn(df, column):
'''column is a string of the column's name.
for each value of the column's element (which might be a list), duplicate the rest of columns at the correspdonding row with the (each) value.
'''
def duplicate_if_needed(row):
return concat([concat([row.drop(column, axis = 1), DataFrame({column: each})], axis = 1) for each in row[column][0]])
return df.groupby(df.index).transform(duplicate_if_needed)
In recognition of alko's help, here is my trivial generalization of the solution to deal with more than 2 columns in a dataframe:
承认 alko 的帮助,这是我对处理数据帧中超过 2 列的解决方案的简单概括:
def flattenColumn(input, column):
'''
column is a string of the column's name.
for each value of the column's element (which might be a list),
duplicate the rest of columns at the corresponding row with the (each) value.
'''
column_flat = pandas.DataFrame(
[
[i, c_flattened]
for i, y in input[column].apply(list).iteritems()
for c_flattened in y
],
columns=['I', column]
)
column_flat = column_flat.set_index('I')
return (
input.drop(column, 1)
.merge(column_flat, left_index=True, right_index=True)
)
The only limitation at the moment is that the order of columns changed, the column flatten would be at the right most, not in its original position. It should be feasible to fix.
目前唯一的限制是列的顺序发生了变化,列展平将在最右侧,而不是在其原始位置。修复应该是可行的。
回答by alko
I guess easies way to flatten list of lists would be a pure python code, as this object type is not well suited for pandas or numpy. So you can do it with for example
我想扁平化列表列表的简单方法是纯 python 代码,因为这种对象类型不太适合Pandas或 numpy。所以你可以用例如
>>> b_flat = pd.DataFrame([[i, x]
... for i, y in input['B'].apply(list).iteritems()
... for x in y], columns=list('IB'))
>>> b_flat = b_flat.set_index('I')
Having B column flattened, you can merge it back:
将 B 列展平后,您可以将其合并回来:
>>> input[['A']].merge(b_flat, left_index=True, right_index=True)
A B
0 1 a
0 1 b
1 2 c
[3 rows x 2 columns]
If you want the index to be recreated, as in your expected result, you can add .reset_index(drop=True)to last command.
如果您希望重新创建索引,如您预期的结果,您可以添加.reset_index(drop=True)到最后一个命令。
回答by Ian Gow
It is surprising that there isn't a more "native" solution. Putting the answer from @alko into a function is easy enough:
令人惊讶的是,没有更“本机”的解决方案。将@alko 的答案放入一个函数中很容易:
def unnest(df, col, reset_index=False):
import pandas as pd
col_flat = pd.DataFrame([[i, x]
for i, y in df[col].apply(list).iteritems()
for x in y], columns=['I', col])
col_flat = col_flat.set_index('I')
df = df.drop(col, 1)
df = df.merge(col_flat, left_index=True, right_index=True)
if reset_index:
df = df.reset_index(drop=True)
return df
Then simply
然后简单地
input = pd.DataFrame({'A': [1, 2], 'B': [['a', 'b'], 'c']})
expected = unnest(input, 'B')
I guess it would be nice to allow unnesting of multiple columns at once and to handle the possibility of a nested column named I, which would break this code.
我想允许一次取消多个列的嵌套并处理名为 的嵌套列的可能性会很好I,这会破坏此代码。
回答by yaiir
A slightly simpler / more readable solution than the ones above which worked for me.
一个比上面对我有用的解决方案稍微简单/更具可读性。
out = []
for n, row in df.iterrows():
for item in row['B']:
row['flat_B'] = item
out += [row.copy()]
flattened_df = pd.DataFrame(out)
回答by lynochka
How about
怎么样
input = pd.DataFrame({'A': [1, 2], 'B': [['a', 'b'], 'c']})
input[['A', 'B']].set_index(['A'])['B'].apply(pd.Series).stack().reset_index(level=1, drop=True).reset_index().rename(columns={0:'B'})
Out[1]:
A B
0 1 a
1 1 b
2 2 c
回答by Wanlie
You can also manipulate the list first, then create a new dataframe: for example:
您也可以先操作列表,然后创建一个新的数据框:例如:
input = DataFrame({'A': [1, 2], 'B': [['a', 'b'], 'c']})
listA=input.A.tolist()
listB=input.B.tolist()
count_sublist_len=[len(ele) for ele in listB if type(ele)==list else 1]
# create similar list for A
new_listA=[count_sublist_len[i]*[listA[i]] for i in range(len(listA)]
# flatten them
f_A=[item for sublist in new_listA for item in sublist]
f_B=[item for sublist in listB for item in sublist]
df_new=pd.DataFrame({'A':f_A,'B':f_b})
回答by Martin
Basically the same as what yaiirdid but then using list comprehension in a nice function:
与yaiir所做的基本相同,但随后在一个不错的函数中使用了列表理解:
def flatten_col(df: pd.DataFrame, col_from: str, col_to: str) -> pd.DataFrame:
return pd.DataFrame([row.copy().set_value(col_to, x)
for i, row in df.iterrows()
for x in row[col_from]]) \
.reset_index(drop=True)
where col_fromis the column containing the lists and col_tois the name of the new column with the split list values.
其中col_from是包含列表的列,col_to是具有拆分列表值的新列的名称。
Use as flatten_col(input, 'B', 'B')in your example.
The benefit of this method is that copies along all other columns as well (unlike some other solutions). However it does use the deprecated set_valuemethod..
使用如flatten_col(input, 'B', 'B')在你的榜样。这种方法的好处是也可以沿所有其他列进行复制(与其他一些解决方案不同)。但是它确实使用了不推荐使用的set_value方法..
回答by yoav_aaa
One liner - applying the pd.DataFrameconstructor, concatenating and joining to original.
一个班轮 - 应用pd.DataFrame构造函数,连接并连接到原始。
my_df = pd.DataFrame({'a': [1, 2, 3], 'b': [2, 3, 4], 'c': [(1, 2), (1, 2), (2, 3)]})
my_df.join(pd.concat(map(lambda x: pd.DataFrame(list(x)), my_df['c']), axis=0))

