pandas 将多列拆分为熊猫数据框中的行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38651008/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:41:50  来源:igfitidea点击:

Splitting multiple columns into rows in pandas dataframe

pandasjoindataframesplitmultiple-columns

提问by ctan

I have a pandas dataframe as follows:

我有一个Pandas数据框,如下所示:

ticker    account      value         date
aa       assets       100,200       20121231, 20131231
bb       liabilities  50, 150       20141231, 20131231

I would like to split df['value']and df['date']so that the dataframe looks like this:

我想拆分df['value']df['date']以便数据框看起来像这样:

ticker    account      value         date
aa       assets       100           20121231
aa       assets       200           20131231 
bb       liabilities  50            20141231
bb       liabilities  150           20131231

Would greatly appreciate any help.

将不胜感激任何帮助。

回答by jezrael

You can first splitcolumns, create Seriesby stackand remove whitespaces by strip:

您可以首先split列,Series通过以下方式创建stack和删除空格strip

s1 = df.value.str.split(',', expand=True).stack().str.strip().reset_index(level=1, drop=True)
s2 = df.date.str.split(',', expand=True).stack().str.strip().reset_index(level=1, drop=True)

Then concatboth Seriesto df1:

然后concatSeriesdf1

df1 = pd.concat([s1,s2], axis=1, keys=['value','date'])

Remove old columns valueand dateand join:

删除旧列valuedatejoin

print (df.drop(['value','date'], axis=1).join(df1).reset_index(drop=True))
  ticker      account value      date
0     aa       assets   100  20121231
1     aa       assets   200  20131231
2     bb  liabilities    50  20141231
3     bb  liabilities   150  20131231

回答by piRSquared

I'm noticing this question a lot. That is, how do I split this column that has a list into multiple rows? I've seen it called exploding. Here are some links:

我经常注意到这个问题。也就是说,如何将此具有列表的列拆分为多行?我见过它叫做爆炸。以下是一些链接:

So I wrote a function that will do it.

所以我写了一个函数来完成它。

def explode(df, columns):
    idx = np.repeat(df.index, df[columns[0]].str.len())
    a = df.T.reindex_axis(columns).values
    concat = np.concatenate([np.concatenate(a[i]) for i in range(a.shape[0])])
    p = pd.DataFrame(concat.reshape(a.shape[0], -1).T, idx, columns)
    return pd.concat([df.drop(columns, axis=1), p], axis=1).reset_index(drop=True)

But before we can use it, we need lists (or iterable) in a column.

但是在我们可以使用它之前,我们需要在列中使用列表(或可迭代的)。

Setup

设置

df = pd.DataFrame([['aa', 'assets',      '100,200', '20121231,20131231'],
                   ['bb', 'liabilities', '50,50',   '20141231,20131231']],
                  columns=['ticker', 'account', 'value', 'date'])

df

enter image description here

在此处输入图片说明

split valueand datecolumns:

拆分valuedate列:

df.value = df.value.str.split(',')
df.date = df.date.str.split(',')

df

enter image description here

在此处输入图片说明

Now we could explode on either column or both, one after the other.

现在我们可以在任一列或两个列上一个接一个地爆炸。

Solution

解决方案

explode(df, ['value','date'])

enter image description here

在此处输入图片说明



Timing

定时

I removed stripfrom @jezrael's timing because I could not effectively add it to mine. This is a necessary step for this question as OP has spaces in strings after commas. I was aiming at providing a generic way to explode a column given it already has iterables in it and I think I've accomplished that.

strip从@jezrael 的时间中删除了,因为我无法有效地将它添加到我的。这是此问题的必要步骤,因为 OP 在逗号后的字符串中有空格。我的目标是提供一种通用的方法来爆炸列,因为它已经包含可迭代对象,我想我已经做到了。

code

代码

def get_df(n=1):
    return pd.DataFrame([['aa', 'assets',      '100,200,200', '20121231,20131231,20131231'],
                         ['bb', 'liabilities', '50,50',   '20141231,20131231']] * n,
                        columns=['ticker', 'account', 'value', 'date'])

small 2 row sample

小 2 行样本

enter image description here

在此处输入图片说明

medium 200 row sample

中 200 行样本

enter image description here

在此处输入图片说明

large 2,000,000 row sample

大 2,000,000 行样本

enter image description here

在此处输入图片说明

回答by titipata

I wrote explodefunction based on previous answers. It might be useful for anyone who want to grab and use it quickly.

explode根据以前的答案编写了函数。对于想要快速获取和使用它的任何人来说,它可能很有用。

def explode(df, cols, split_on=','):
    """
    Explode dataframe on the given column, split on given delimeter
    """
    cols_sep = list(set(df.columns) - set(cols))
    df_cols = df[cols_sep]
    explode_len = df[cols[0]].str.split(split_on).map(len)
    repeat_list = []
    for r, e in zip(df_cols.as_matrix(), explode_len):
        repeat_list.extend([list(r)]*e)
    df_repeat = pd.DataFrame(repeat_list, columns=cols_sep)
    df_explode = pd.concat([df[col].str.split(split_on, expand=True).stack().str.strip().reset_index(drop=True)
                            for col in cols], axis=1)
    df_explode.columns = cols
    return pd.concat((df_repeat, df_explode), axis=1)

example given from @piRSquared:

从@piRSquared 给出的例子:

df = pd.DataFrame([['aa', 'assets', '100,200', '20121231,20131231'],
                   ['bb', 'liabilities', '50,50', '20141231,20131231']],
                  columns=['ticker', 'account', 'value', 'date'])
explode(df, ['value', 'date'])

output

输出

+-----------+------+-----+--------+
|    account|ticker|value|    date|
+-----------+------+-----+--------+
|     assets|    aa|  100|20121231|
|     assets|    aa|  200|20131231|
|liabilities|    bb|   50|20141231|
|liabilities|    bb|   50|20131231|
+-----------+------+-----+--------+

回答by Caro

Because I'm too new, I'm not allowed to write a comment, so I write an "answer".

因为我太新,不许写评论,所以写了个“回答”。

@titipata your answer worked really good, but in my opinion there is a small "mistake" in your code I'm not able to find for my self.

@titipata 您的回答非常有效,但在我看来,您的代码中有一个小“错误”,我无法为自己找到。

I work with the example from this question and changed just the values.

我使用this question中的示例并仅更改了值。

df = pd.DataFrame([['title1', 'publisher1', '1.1,1.2', '1'],
               ['title2', 'publisher2', '2', '2.1,2.2']],
              columns=['titel', 'publisher', 'print', 'electronic'])

explode(df, ['print', 'electronic'])

    publisher   titel   print   electronic
0   publisher1  title1  1.1     1
1   publisher1  title1  1.2     2.1
2   publisher2  title2  2       2.2

As you see, in the column 'electronic' should be in row '1' the value '1' and not '2.1'.

如您所见,“电子”列中的“1”行应为“1”而不是“2.1”。

Because of that, the hole DataSet would change. I hope someone could help me to find a solution for this.

因此,孔数据集会发生变化。我希望有人可以帮助我找到解决方案。