pandas 将多列拆分为熊猫数据框中的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38651008/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Splitting multiple columns into rows in pandas dataframe
提问by ctan
I have a pandas dataframe as follows:
我有一个Pandas数据框,如下所示:
ticker account value date
aa assets 100,200 20121231, 20131231
bb liabilities 50, 150 20141231, 20131231
I would like to split df['value']
and df['date']
so that the dataframe looks like this:
我想拆分df['value']
,df['date']
以便数据框看起来像这样:
ticker account value date
aa assets 100 20121231
aa assets 200 20131231
bb liabilities 50 20141231
bb liabilities 150 20131231
Would greatly appreciate any help.
将不胜感激任何帮助。
回答by jezrael
You can first split
columns, create Series
by stack
and remove whitespaces by strip
:
您可以首先split
列,Series
通过以下方式创建stack
和删除空格strip
:
s1 = df.value.str.split(',', expand=True).stack().str.strip().reset_index(level=1, drop=True)
s2 = df.date.str.split(',', expand=True).stack().str.strip().reset_index(level=1, drop=True)
Then concat
both Series
to df1
:
然后concat
都Series
到df1
:
df1 = pd.concat([s1,s2], axis=1, keys=['value','date'])
Remove old columns value
and date
and join
:
删除旧列value
和date
和join
:
print (df.drop(['value','date'], axis=1).join(df1).reset_index(drop=True))
ticker account value date
0 aa assets 100 20121231
1 aa assets 200 20131231
2 bb liabilities 50 20141231
3 bb liabilities 150 20131231
回答by piRSquared
I'm noticing this question a lot. That is, how do I split this column that has a list into multiple rows? I've seen it called exploding. Here are some links:
我经常注意到这个问题。也就是说,如何将此具有列表的列拆分为多行?我见过它叫做爆炸。以下是一些链接:
So I wrote a function that will do it.
所以我写了一个函数来完成它。
def explode(df, columns):
idx = np.repeat(df.index, df[columns[0]].str.len())
a = df.T.reindex_axis(columns).values
concat = np.concatenate([np.concatenate(a[i]) for i in range(a.shape[0])])
p = pd.DataFrame(concat.reshape(a.shape[0], -1).T, idx, columns)
return pd.concat([df.drop(columns, axis=1), p], axis=1).reset_index(drop=True)
But before we can use it, we need lists (or iterable) in a column.
但是在我们可以使用它之前,我们需要在列中使用列表(或可迭代的)。
Setup
设置
df = pd.DataFrame([['aa', 'assets', '100,200', '20121231,20131231'],
['bb', 'liabilities', '50,50', '20141231,20131231']],
columns=['ticker', 'account', 'value', 'date'])
df
split value
and date
columns:
拆分value
和date
列:
df.value = df.value.str.split(',')
df.date = df.date.str.split(',')
df
Now we could explode on either column or both, one after the other.
现在我们可以在任一列或两个列上一个接一个地爆炸。
Solution
解决方案
explode(df, ['value','date'])
Timing
定时
I removed strip
from @jezrael's timing because I could not effectively add it to mine. This is a necessary step for this question as OP has spaces in strings after commas. I was aiming at providing a generic way to explode a column given it already has iterables in it and I think I've accomplished that.
我strip
从@jezrael 的时间中删除了,因为我无法有效地将它添加到我的。这是此问题的必要步骤,因为 OP 在逗号后的字符串中有空格。我的目标是提供一种通用的方法来爆炸列,因为它已经包含可迭代对象,我想我已经做到了。
code
代码
def get_df(n=1):
return pd.DataFrame([['aa', 'assets', '100,200,200', '20121231,20131231,20131231'],
['bb', 'liabilities', '50,50', '20141231,20131231']] * n,
columns=['ticker', 'account', 'value', 'date'])
small 2 row sample
小 2 行样本
medium 200 row sample
中 200 行样本
large 2,000,000 row sample
大 2,000,000 行样本
回答by titipata
I wrote explode
function based on previous answers. It might be useful for anyone who want to grab and use it quickly.
我explode
根据以前的答案编写了函数。对于想要快速获取和使用它的任何人来说,它可能很有用。
def explode(df, cols, split_on=','):
"""
Explode dataframe on the given column, split on given delimeter
"""
cols_sep = list(set(df.columns) - set(cols))
df_cols = df[cols_sep]
explode_len = df[cols[0]].str.split(split_on).map(len)
repeat_list = []
for r, e in zip(df_cols.as_matrix(), explode_len):
repeat_list.extend([list(r)]*e)
df_repeat = pd.DataFrame(repeat_list, columns=cols_sep)
df_explode = pd.concat([df[col].str.split(split_on, expand=True).stack().str.strip().reset_index(drop=True)
for col in cols], axis=1)
df_explode.columns = cols
return pd.concat((df_repeat, df_explode), axis=1)
example given from @piRSquared:
从@piRSquared 给出的例子:
df = pd.DataFrame([['aa', 'assets', '100,200', '20121231,20131231'],
['bb', 'liabilities', '50,50', '20141231,20131231']],
columns=['ticker', 'account', 'value', 'date'])
explode(df, ['value', 'date'])
output
输出
+-----------+------+-----+--------+
| account|ticker|value| date|
+-----------+------+-----+--------+
| assets| aa| 100|20121231|
| assets| aa| 200|20131231|
|liabilities| bb| 50|20141231|
|liabilities| bb| 50|20131231|
+-----------+------+-----+--------+
回答by Caro
Because I'm too new, I'm not allowed to write a comment, so I write an "answer".
因为我太新,不许写评论,所以写了个“回答”。
@titipata your answer worked really good, but in my opinion there is a small "mistake" in your code I'm not able to find for my self.
@titipata 您的回答非常有效,但在我看来,您的代码中有一个小“错误”,我无法为自己找到。
I work with the example from this question and changed just the values.
我使用this question中的示例并仅更改了值。
df = pd.DataFrame([['title1', 'publisher1', '1.1,1.2', '1'],
['title2', 'publisher2', '2', '2.1,2.2']],
columns=['titel', 'publisher', 'print', 'electronic'])
explode(df, ['print', 'electronic'])
publisher titel print electronic
0 publisher1 title1 1.1 1
1 publisher1 title1 1.2 2.1
2 publisher2 title2 2 2.2
As you see, in the column 'electronic' should be in row '1' the value '1' and not '2.1'.
如您所见,“电子”列中的“1”行应为“1”而不是“2.1”。
Because of that, the hole DataSet would change. I hope someone could help me to find a solution for this.
因此,孔数据集会发生变化。我希望有人可以帮助我找到解决方案。