Python 如何在熊猫数据框中拆分元组列?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29550414/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 04:43:06  来源:igfitidea点击:

how to split column of tuples in pandas dataframe?

pythonnumpypandasdataframetuples

提问by Donbeo

I have a pandas dataframe (this is only a little piece)

我有一个熊猫数据框(这只是一小部分)

>>> d1
   y norm test  y norm train  len(y_train)  len(y_test)  \
0    64.904368    116.151232          1645          549   
1    70.852681    112.639876          1645          549   

                                    SVR RBF  \
0   (35.652207342877873, 22.95533537448393)   
1  (39.563683797747622, 27.382483096332511)   

                                        LCV  \
0  (19.365430594452338, 13.880062435173587)   
1  (19.099614489458364, 14.018867136617146)   

                                   RIDGE CV  \
0  (4.2907610988480362, 12.416745648065584)   
1    (4.18864306788194, 12.980833914392477)   

                                         RF  \
0   (9.9484841581029428, 16.46902345373697)   
1  (10.139848213735391, 16.282141345406522)   

                                           GB  \
0  (0.012816232716538605, 15.950164822266007)   
1  (0.012814519804493328, 15.305745202851712)   

                                             ET DATA  
0  (0.00034337162272515505, 16.284800366214057)  j2m  
1  (0.00024811554516431878, 15.556506191784194)  j2m  
>>> 

I want to split all the columns that contain tuples. For example I want to replace the column LCVwith the columns LCV-aand LCV-b.

我想拆分所有包含元组的列。例如,我想LCV用列LCV-aLCV-b.

How can I do that?

我怎样才能做到这一点?

采纳答案by joris

You can do this by doing pd.DataFrame(col.tolist())on that column:

您可以通过pd.DataFrame(col.tolist())在该列上执行此操作来执行此操作:

In [2]: df = pd.DataFrame({'a':[1,2], 'b':[(1,2), (3,4)]})                                                                                                                      

In [3]: df                                                                                                                                                                      
Out[3]: 
   a       b
0  1  (1, 2)
1  2  (3, 4)

In [4]: df['b'].tolist()                                                                                                                                                        
Out[4]: [(1, 2), (3, 4)]

In [5]: pd.DataFrame(df['b'].tolist(), index=df.index)                                                                                                                                          
Out[5]: 
   0  1
0  1  2
1  3  4

In [6]: df[['b1', 'b2']] = pd.DataFrame(df['b'].tolist(), index=df.index)                                                                                                                       

In [7]: df                                                                                                                                                                      
Out[7]: 
   a       b  b1  b2
0  1  (1, 2)   1   2
1  2  (3, 4)   3   4

Note: in an earlier version, this answer recommended to use df['b'].apply(pd.Series)instead of pd.DataFrame(df['b'].tolist(), index=df.index). That works as well (because it makes of each tuple a Series, which is then seen as a row of a dataframe), but is slower / uses more memory than the tolistversion, as noted by the other answers here (thanks to @denfromufa).
I updated this answer to make sure the most visible answer has the best solution.

注意:在早期版本中,建议使用此答案df['b'].apply(pd.Series)而不是pd.DataFrame(df['b'].tolist(), index=df.index). 这也有效(因为它将每个元组组成一个系列,然后将其视为数据帧的一行),但比tolist版本更慢/使用更多内存,如此处的其他答案所述(感谢@denfromufa) .
我更新了此答案以确保最明显的答案具有最佳解决方案。

回答by denfromufa

On much larger datasets, I found that .apply()is few orders slower than pd.DataFrame(df['b'].values.tolist(), index=df.index)

在更大的数据集上,我发现.apply()pd.DataFrame(df['b'].values.tolist(), index=df.index)

This performance issue was closed in GitHub, although I do not agree with this decision:

此性能问题已在 GitHub 中关闭,尽管我不同意此决定:

https://github.com/pandas-dev/pandas/issues/11615

https://github.com/pandas-dev/pandas/issues/11615

EDIT: based on this answer: https://stackoverflow.com/a/44196843/2230844

编辑:基于这个答案:https: //stackoverflow.com/a/44196843/2230844

回答by Mike

I know this is from a while ago, but a caveat of the second solution:

我知道这是不久前的,但要注意第二种解决方案:

pd.DataFrame(df['b'].values.tolist())

is that it will explicitly discard the index, and add in a default sequential index, whereas the accepted answer

是它将显式丢弃索引,并添加默认的顺序索引,而接受的答案

apply(pd.Series)

will not, since the result of apply will retain the row index. While the order is initially retained from the original array, pandas will try to match the indicies from the two dataframes.

不会,因为 apply 的结果将保留行索引。虽然顺序最初是从原始数组中保留的,但 Pandas 会尝试匹配来自两个数据帧的索引。

This can be very important if you are trying to set the rows into an numerically indexed array, and pandas will automatically try to match the index of the new array to the old, and cause some distortion in the ordering.

如果您尝试将行设置为数字索引数组,这可能非常重要,并且 Pandas 会自动尝试将新数组的索引与旧数组的索引匹配,并导致排序出现一些失真。

A better hybrid solution would be to set the index of the original dataframe onto the new, i.e.

更好的混合解决方案是将原始数据帧的索引设置为新的,即

pd.DataFrame(df['b'].values.tolist(), index=df.index)

Which will retain the speed of using the second method while ensuring the order and indexing is retained on the result.

这将保留使用第二种方法的速度,同时确保在结果上保留顺序和索引。

回答by Jinhua Wang

I think a simpler way is:

我认为更简单的方法是:

>>> import pandas as pd
>>> df = pd.DataFrame({'a':[1,2], 'b':[(1,2), (3,4)]}) 
>>> df
   a       b
0  1  (1, 2)
1  2  (3, 4)
>>> df['b_a']=df['b'].str[0]
>>> df['b_b']=df['b'].str[1]
>>> df
   a       b  b_a  b_b
0  1  (1, 2)    1    2
1  2  (3, 4)    3    4

回答by piRSquared

The straccessor that is available to pandas.Seriesobjects of dtype == objectis actually an iterable.

str是提供给访问pandas.Series的对象dtype == object实际上是一个迭代。

Assume a pandas.DataFramedf:

假设一个pandas.DataFramedf

df = pd.DataFrame(dict(col=[*zip('abcdefghij', range(10, 101, 10))]))

df

        col
0   (a, 10)
1   (b, 20)
2   (c, 30)
3   (d, 40)
4   (e, 50)
5   (f, 60)
6   (g, 70)
7   (h, 80)
8   (i, 90)
9  (j, 100)

We can test if it is an iterable

我们可以测试它是否是可迭代的

from collections import Iterable

isinstance(df.col.str, Iterable)

True

We can then assign from it like we do other iterables:

然后我们可以像其他可迭代对象一样从中分配:

var0, var1 = 'xy'
print(var0, var1)

x y

Simplest solution

最简单的解决方案

So in one line we can assign both columns

所以在一行中我们可以分配两列

df['a'], df['b'] = df.col.str

df

        col  a    b
0   (a, 10)  a   10
1   (b, 20)  b   20
2   (c, 30)  c   30
3   (d, 40)  d   40
4   (e, 50)  e   50
5   (f, 60)  f   60
6   (g, 70)  g   70
7   (h, 80)  h   80
8   (i, 90)  i   90
9  (j, 100)  j  100


Faster solution

更快的解决方案

Only slightly more complicate, we can use zipto create a similar iterable

只是稍微复杂一点,我们可以用它zip来创建一个类似的可迭代对象

df['c'], df['d'] = zip(*df.col)

df

        col  a    b  c    d
0   (a, 10)  a   10  a   10
1   (b, 20)  b   20  b   20
2   (c, 30)  c   30  c   30
3   (d, 40)  d   40  d   40
4   (e, 50)  e   50  e   50
5   (f, 60)  f   60  f   60
6   (g, 70)  g   70  g   70
7   (h, 80)  h   80  h   80
8   (i, 90)  i   90  i   90
9  (j, 100)  j  100  j  100


Inline

排队

Meaning, don't mutate existing df
This works because assigntakes keyword arguments where the keywords are the new(or existing) column names and the values will be the values of the new column. You can use a dictionary and unpack it with **and have it act as the keyword arguments. So this is a clever way of assigning a new column named 'g'that is the first item in the df.col.striterable and 'h'that is the second item in the df.col.striterable.

意思是,不要改变现有的df
这是有效的,因为assign采用关键字参数,其中关键字是新的(或现有的)列名称,而值将是新列的值。您可以使用字典并将其解压缩**并将其用作关键字参数。因此,这是分配一个名为新列的巧妙方法,该列'g'df.col.str可迭代对象中的第一项,也是可迭代'h'对象中的第二项df.col.str

df.assign(**dict(zip('gh', df.col.str)))

        col  g    h
0   (a, 10)  a   10
1   (b, 20)  b   20
2   (c, 30)  c   30
3   (d, 40)  d   40
4   (e, 50)  e   50
5   (f, 60)  f   60
6   (g, 70)  g   70
7   (h, 80)  h   80
8   (i, 90)  i   90
9  (j, 100)  j  100


My version of the listapproach

我的list方法版本

With modern list comprehension and variable unpacking.
Note:also inline using join

具有现代列表理解和变量解包。
注意:也内联使用join

df.join(pd.DataFrame([*df.col], df.index, [*'ef']))

        col  g    h
0   (a, 10)  a   10
1   (b, 20)  b   20
2   (c, 30)  c   30
3   (d, 40)  d   40
4   (e, 50)  e   50
5   (f, 60)  f   60
6   (g, 70)  g   70
7   (h, 80)  h   80
8   (i, 90)  i   90
9  (j, 100)  j  100

The mutating version would be

变异版本将是

df[['e', 'f']] = pd.DataFrame([*df.col], df.index)


Naive Time Test

天真时间测试

短数据帧

Use one defined above

使用上面定义的一个

%timeit df.assign(**dict(zip('gh', df.col.str)))
%timeit df.assign(**dict(zip('gh', zip(*df.col))))
%timeit df.join(pd.DataFrame([*df.col], df.index, [*'gh']))

1.16 ms ± 21.5 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
635 μs ± 18.7 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
795 μs ± 42.5 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
长数据帧

10^3 times bigger

大 10^3 倍

df = pd.concat([df] * 1000, ignore_index=True)

%timeit df.assign(**dict(zip('gh', df.col.str)))
%timeit df.assign(**dict(zip('gh', zip(*df.col))))
%timeit df.join(pd.DataFrame([*df.col], df.index, [*'gh']))

11.4 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.1 ms ± 41.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.33 ms ± 35.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)