Python 如何将 Dataframe 单元格内的列表分解为单独的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32468402/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to explode a list inside a Dataframe cell into separate rows
提问by SpicyClubSauce
I'm looking to turn a pandas cell containing a list into rows for each of those values.
我希望将包含列表的 Pandas 单元格转换为每个值的行。
So, take this:
所以,拿这个:
If I'd like to unpack and stack the values in the nearest_neighbors
column so that each value would be a row within each opponent
index, how would I best go about this? Are there pandas methods that are meant for operations like this?
如果我想解压缩并堆叠nearest_neighbors
列中的值,以便每个值都是每个opponent
索引中的一行,我该如何最好地解决这个问题?是否有用于此类操作的熊猫方法?
采纳答案by Alexander
In the code below, I first reset the index to make the row iteration easier.
在下面的代码中,我首先重置索引以使行迭代更容易。
I create a list of lists where each element of the outer list is a row of the target DataFrame and each element of the inner list is one of the columns. This nested list will ultimately be concatenated to create the desired DataFrame.
我创建了一个列表列表,其中外部列表的每个元素都是目标 DataFrame 的一行,而内部列表的每个元素都是列之一。这个嵌套列表最终将被连接起来以创建所需的 DataFrame。
I use a lambda
function together with list iteration to create a row for each element of the nearest_neighbors
paired with the relevant name
and opponent
.
我将一个lambda
函数与列表迭代一起使用,为nearest_neighbors
与相关的name
和配对的每个元素创建一行opponent
。
Finally, I create a new DataFrame from this list (using the original column names and setting the index back to name
and opponent
).
最后,我从这个列表中创建一个新的 DataFrame(使用原始列名并将索引设置回name
and opponent
)。
df = (pd.DataFrame({'name': ['A.J. Price'] * 3,
'opponent': ['76ers', 'blazers', 'bobcats'],
'nearest_neighbors': [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']] * 3})
.set_index(['name', 'opponent']))
>>> df
nearest_neighbors
name opponent
A.J. Price 76ers [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
blazers [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
bobcats [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
df.reset_index(inplace=True)
rows = []
_ = df.apply(lambda row: [rows.append([row['name'], row['opponent'], nn])
for nn in row.nearest_neighbors], axis=1)
df_new = pd.DataFrame(rows, columns=df.columns).set_index(['name', 'opponent'])
>>> df_new
nearest_neighbors
name opponent
A.J. Price 76ers Zach LaVine
76ers Jeremy Lin
76ers Nate Robinson
76ers Isaia
blazers Zach LaVine
blazers Jeremy Lin
blazers Nate Robinson
blazers Isaia
bobcats Zach LaVine
bobcats Jeremy Lin
bobcats Nate Robinson
bobcats Isaia
EDIT JUNE 2017
编辑 2017 年 6 月
An alternative method is as follows:
另一种方法如下:
>>> (pd.melt(df.nearest_neighbors.apply(pd.Series).reset_index(),
id_vars=['name', 'opponent'],
value_name='nearest_neighbors')
.set_index(['name', 'opponent'])
.drop('variable', axis=1)
.dropna()
.sort_index()
)
回答by maxymoo
I think this a really good question, in Hive you would use EXPLODE
, I think there is a case to be made that Pandas should include this functionality by default. I would probably explode the list column with a nested generator comprehension like this:
我认为这是一个非常好的问题,在 Hive 中你会使用EXPLODE
,我认为有一个案例表明 Pandas 应该默认包含这个功能。我可能会用这样的嵌套生成器理解来分解列表列:
pd.DataFrame({
"name": i[0],
"opponent": i[1],
"nearest_neighbor": neighbour
}
for i, row in df.iterrows() for neighbour in row.nearest_neighbors
).set_index(["name", "opponent"])
回答by Philipp Schwarz
Nicer alternative solution with apply(pd.Series):
使用 apply(pd.Series) 更好的替代解决方案:
df = pd.DataFrame({'listcol':[[1,2,3],[4,5,6]]})
# expand df.listcol into its own dataframe
tags = df['listcol'].apply(pd.Series)
# rename each variable is listcol
tags = tags.rename(columns = lambda x : 'listcol_' + str(x))
# join the tags dataframe back to the original dataframe
df = pd.concat([df[:], tags[:]], axis=1)
回答by 13Herger
Similar to Hive's EXPLODE functionality:
类似于 Hive 的 EXPLODE 功能:
import copy
def pandas_explode(df, column_to_explode):
"""
Similar to Hive's EXPLODE function, take a column with iterable elements, and flatten the iterable to one element
per observation in the output table
:param df: A dataframe to explod
:type df: pandas.DataFrame
:param column_to_explode:
:type column_to_explode: str
:return: An exploded data frame
:rtype: pandas.DataFrame
"""
# Create a list of new observations
new_observations = list()
# Iterate through existing observations
for row in df.to_dict(orient='records'):
# Take out the exploding iterable
explode_values = row[column_to_explode]
del row[column_to_explode]
# Create a new observation for every entry in the exploding iterable & add all of the other columns
for explode_value in explode_values:
# Deep copy existing observation
new_observation = copy.deepcopy(row)
# Add one (newly flattened) value from exploding iterable
new_observation[column_to_explode] = explode_value
# Add to the list of new observations
new_observations.append(new_observation)
# Create a DataFrame
return_df = pandas.DataFrame(new_observations)
# Return
return return_df
回答by Zero
Use apply(pd.Series)
and stack
, then reset_index
and to_frame
使用apply(pd.Series)
and stack
,然后reset_index
andto_frame
In [1803]: (df.nearest_neighbors.apply(pd.Series)
.stack()
.reset_index(level=2, drop=True)
.to_frame('nearest_neighbors'))
Out[1803]:
nearest_neighbors
name opponent
A.J. Price 76ers Zach LaVine
76ers Jeremy Lin
76ers Nate Robinson
76ers Isaia
blazers Zach LaVine
blazers Jeremy Lin
blazers Nate Robinson
blazers Isaia
bobcats Zach LaVine
bobcats Jeremy Lin
bobcats Nate Robinson
bobcats Isaia
Details
细节
In [1804]: df
Out[1804]:
nearest_neighbors
name opponent
A.J. Price 76ers [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
blazers [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
bobcats [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
回答by Sinan Ozel
Here is a potential optimization for larger dataframes. This runs faster when there are several equal values in the "exploding" field. (The larger the dataframe is compared to the unique value count in the field, the better this code will perform.)
这是对较大数据帧的潜在优化。当“爆炸”字段中有几个相等的值时,这运行得更快。(数据帧与字段中的唯一值计数相比越大,此代码的性能就越好。)
def lateral_explode(dataframe, fieldname):
temp_fieldname = fieldname + '_made_tuple_'
dataframe[temp_fieldname] = dataframe[fieldname].apply(tuple)
list_of_dataframes = []
for values in dataframe[temp_fieldname].unique().tolist():
list_of_dataframes.append(pd.DataFrame({
temp_fieldname: [values] * len(values),
fieldname: list(values),
}))
dataframe = dataframe[list(set(dataframe.columns) - set([fieldname]))]\
.merge(pd.concat(list_of_dataframes), how='left', on=temp_fieldname)
del dataframe[temp_fieldname]
return dataframe
回答by Oleg
The fastestmethod I found so far is extending the DataFrame with .iloc
and assigning back the flattenedtarget column.
到目前为止,我发现的最快的方法是扩展 DataFrame.iloc
并分配回展平的目标列。
Given the usual input (replicated a bit):
鉴于通常的输入(复制一点):
df = (pd.DataFrame({'name': ['A.J. Price'] * 3,
'opponent': ['76ers', 'blazers', 'bobcats'],
'nearest_neighbors': [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']] * 3})
.set_index(['name', 'opponent']))
df = pd.concat([df]*10)
df
Out[3]:
nearest_neighbors
name opponent
A.J. Price 76ers [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
blazers [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
bobcats [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
76ers [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
blazers [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
...
Given the following suggested alternatives:
鉴于以下建议的替代方案:
col_target = 'nearest_neighbors'
def extend_iloc():
# Flatten columns of lists
col_flat = [item for sublist in df[col_target] for item in sublist]
# Row numbers to repeat
lens = df[col_target].apply(len)
vals = range(df.shape[0])
ilocations = np.repeat(vals, lens)
# Replicate rows and add flattened column of lists
cols = [i for i,c in enumerate(df.columns) if c != col_target]
new_df = df.iloc[ilocations, cols].copy()
new_df[col_target] = col_flat
return new_df
def melt():
return (pd.melt(df[col_target].apply(pd.Series).reset_index(),
id_vars=['name', 'opponent'],
value_name=col_target)
.set_index(['name', 'opponent'])
.drop('variable', axis=1)
.dropna()
.sort_index())
def stack_unstack():
return (df[col_target].apply(pd.Series)
.stack()
.reset_index(level=2, drop=True)
.to_frame(col_target))
I find that extend_iloc()
is the fastest:
我发现这extend_iloc()
是最快的:
%timeit extend_iloc()
3.11 ms ± 544 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit melt()
22.5 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit stack_unstack()
11.5 ms ± 410 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
回答by Brian Atwood
Extending Oleg's .iloc
answer to automatically flatten all list-columns:
扩展 Oleg 的.iloc
答案以自动展平所有列表列:
def extend_iloc(df):
cols_to_flatten = [colname for colname in df.columns if
isinstance(df.iloc[0][colname], list)]
# Row numbers to repeat
lens = df[cols_to_flatten[0]].apply(len)
vals = range(df.shape[0])
ilocations = np.repeat(vals, lens)
# Replicate rows and add flattened column of lists
with_idxs = [(i, c) for (i, c) in enumerate(df.columns) if c not in cols_to_flatten]
col_idxs = list(zip(*with_idxs)[0])
new_df = df.iloc[ilocations, col_idxs].copy()
# Flatten columns of lists
for col_target in cols_to_flatten:
col_flat = [item for sublist in df[col_target] for item in sublist]
new_df[col_target] = col_flat
return new_df
This assumes that each list-column has equal list length.
这假设每个列表列具有相等的列表长度。
回答by Briford Wylie
So all of these answers are good but I wanted something ^really simple^ so here's my contribution:
所以所有这些答案都很好,但我想要一些 ^ 非常简单的 ^ 所以这是我的贡献:
def explode(series):
return pd.Series([x for _list in series for x in _list])
That's it.. just use this when you want a new series where the lists are 'exploded'. Here's an example where we do value_counts() on taco choices :)
就是这样......当你想要一个列表被“爆炸”的新系列时,只需使用它。这是我们对 taco 选择执行 value_counts() 的示例:)
In [1]: my_df = pd.DataFrame(pd.Series([['a','b','c'],['b','c'],['c']]), columns=['tacos'])
In [2]: my_df.head()
Out[2]:
tacos
0 [a, b, c]
1 [b, c]
2 [c]
In [3]: explode(my_df['tacos']).value_counts()
Out[3]:
c 3
b 2
a 1
回答by joelostblom
- Exploding a list-like column has been simplified significantly in pandas 0.25with the addition of the
explode()
method:
- 通过添加以下
方法,在 pandas 0.25 中显着简化了类似列表的列的分解
explode()
:
df = (pd.DataFrame({'name': ['A.J. Price'] * 3,
'opponent': ['76ers', 'blazers', 'bobcats'],
'nearest_neighbors': [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']] * 3})
.set_index(['name', 'opponent']))
df.explode('nearest_neighbors')
Out:
出去:
nearest_neighbors
name opponent
A.J. Price 76ers Zach LaVine
76ers Jeremy Lin
76ers Nate Robinson
76ers Isaia
blazers Zach LaVine
blazers Jeremy Lin
blazers Nate Robinson
blazers Isaia
bobcats Zach LaVine
bobcats Jeremy Lin
bobcats Nate Robinson
bobcats Isaia