Python 如何在熊猫数据框中拆分元组列?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29550414/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how to split column of tuples in pandas dataframe?
提问by Donbeo
I have a pandas dataframe (this is only a little piece)
我有一个熊猫数据框(这只是一小部分)
>>> d1
y norm test y norm train len(y_train) len(y_test) \
0 64.904368 116.151232 1645 549
1 70.852681 112.639876 1645 549
SVR RBF \
0 (35.652207342877873, 22.95533537448393)
1 (39.563683797747622, 27.382483096332511)
LCV \
0 (19.365430594452338, 13.880062435173587)
1 (19.099614489458364, 14.018867136617146)
RIDGE CV \
0 (4.2907610988480362, 12.416745648065584)
1 (4.18864306788194, 12.980833914392477)
RF \
0 (9.9484841581029428, 16.46902345373697)
1 (10.139848213735391, 16.282141345406522)
GB \
0 (0.012816232716538605, 15.950164822266007)
1 (0.012814519804493328, 15.305745202851712)
ET DATA
0 (0.00034337162272515505, 16.284800366214057) j2m
1 (0.00024811554516431878, 15.556506191784194) j2m
>>>
I want to split all the columns that contain tuples. For example I want to replace the column LCV
with the columns LCV-a
and LCV-b
.
我想拆分所有包含元组的列。例如,我想LCV
用列LCV-a
和LCV-b
.
How can I do that?
我怎样才能做到这一点?
采纳答案by joris
You can do this by doing pd.DataFrame(col.tolist())
on that column:
您可以通过pd.DataFrame(col.tolist())
在该列上执行此操作来执行此操作:
In [2]: df = pd.DataFrame({'a':[1,2], 'b':[(1,2), (3,4)]})
In [3]: df
Out[3]:
a b
0 1 (1, 2)
1 2 (3, 4)
In [4]: df['b'].tolist()
Out[4]: [(1, 2), (3, 4)]
In [5]: pd.DataFrame(df['b'].tolist(), index=df.index)
Out[5]:
0 1
0 1 2
1 3 4
In [6]: df[['b1', 'b2']] = pd.DataFrame(df['b'].tolist(), index=df.index)
In [7]: df
Out[7]:
a b b1 b2
0 1 (1, 2) 1 2
1 2 (3, 4) 3 4
Note: in an earlier version, this answer recommended to use df['b'].apply(pd.Series)
instead of pd.DataFrame(df['b'].tolist(), index=df.index)
. That works as well (because it makes of each tuple a Series, which is then seen as a row of a dataframe), but is slower / uses more memory than the tolist
version, as noted by the other answers here (thanks to @denfromufa).
I updated this answer to make sure the most visible answer has the best solution.
注意:在早期版本中,建议使用此答案df['b'].apply(pd.Series)
而不是pd.DataFrame(df['b'].tolist(), index=df.index)
. 这也有效(因为它将每个元组组成一个系列,然后将其视为数据帧的一行),但比tolist
版本更慢/使用更多内存,如此处的其他答案所述(感谢@denfromufa) .
我更新了此答案以确保最明显的答案具有最佳解决方案。
回答by denfromufa
On much larger datasets, I found that .apply()
is few orders slower than pd.DataFrame(df['b'].values.tolist(), index=df.index)
在更大的数据集上,我发现.apply()
比pd.DataFrame(df['b'].values.tolist(), index=df.index)
This performance issue was closed in GitHub, although I do not agree with this decision:
此性能问题已在 GitHub 中关闭,尽管我不同意此决定:
https://github.com/pandas-dev/pandas/issues/11615
https://github.com/pandas-dev/pandas/issues/11615
EDIT: based on this answer: https://stackoverflow.com/a/44196843/2230844
编辑:基于这个答案:https: //stackoverflow.com/a/44196843/2230844
回答by Mike
I know this is from a while ago, but a caveat of the second solution:
我知道这是不久前的,但要注意第二种解决方案:
pd.DataFrame(df['b'].values.tolist())
is that it will explicitly discard the index, and add in a default sequential index, whereas the accepted answer
是它将显式丢弃索引,并添加默认的顺序索引,而接受的答案
apply(pd.Series)
will not, since the result of apply will retain the row index. While the order is initially retained from the original array, pandas will try to match the indicies from the two dataframes.
不会,因为 apply 的结果将保留行索引。虽然顺序最初是从原始数组中保留的,但 Pandas 会尝试匹配来自两个数据帧的索引。
This can be very important if you are trying to set the rows into an numerically indexed array, and pandas will automatically try to match the index of the new array to the old, and cause some distortion in the ordering.
如果您尝试将行设置为数字索引数组,这可能非常重要,并且 Pandas 会自动尝试将新数组的索引与旧数组的索引匹配,并导致排序出现一些失真。
A better hybrid solution would be to set the index of the original dataframe onto the new, i.e.
更好的混合解决方案是将原始数据帧的索引设置为新的,即
pd.DataFrame(df['b'].values.tolist(), index=df.index)
Which will retain the speed of using the second method while ensuring the order and indexing is retained on the result.
这将保留使用第二种方法的速度,同时确保在结果上保留顺序和索引。
回答by Jinhua Wang
I think a simpler way is:
我认为更简单的方法是:
>>> import pandas as pd
>>> df = pd.DataFrame({'a':[1,2], 'b':[(1,2), (3,4)]})
>>> df
a b
0 1 (1, 2)
1 2 (3, 4)
>>> df['b_a']=df['b'].str[0]
>>> df['b_b']=df['b'].str[1]
>>> df
a b b_a b_b
0 1 (1, 2) 1 2
1 2 (3, 4) 3 4
回答by piRSquared
The str
accessor that is available to pandas.Series
objects of dtype == object
is actually an iterable.
的str
是提供给访问pandas.Series
的对象dtype == object
实际上是一个迭代。
Assume a pandas.DataFrame
df
:
假设一个pandas.DataFrame
df
:
df = pd.DataFrame(dict(col=[*zip('abcdefghij', range(10, 101, 10))]))
df
col
0 (a, 10)
1 (b, 20)
2 (c, 30)
3 (d, 40)
4 (e, 50)
5 (f, 60)
6 (g, 70)
7 (h, 80)
8 (i, 90)
9 (j, 100)
We can test if it is an iterable
我们可以测试它是否是可迭代的
from collections import Iterable
isinstance(df.col.str, Iterable)
True
We can then assign from it like we do other iterables:
然后我们可以像其他可迭代对象一样从中分配:
var0, var1 = 'xy'
print(var0, var1)
x y
Simplest solution
最简单的解决方案
So in one line we can assign both columns
所以在一行中我们可以分配两列
df['a'], df['b'] = df.col.str
df
col a b
0 (a, 10) a 10
1 (b, 20) b 20
2 (c, 30) c 30
3 (d, 40) d 40
4 (e, 50) e 50
5 (f, 60) f 60
6 (g, 70) g 70
7 (h, 80) h 80
8 (i, 90) i 90
9 (j, 100) j 100
Faster solution
更快的解决方案
Only slightly more complicate, we can use zip
to create a similar iterable
只是稍微复杂一点,我们可以用它zip
来创建一个类似的可迭代对象
df['c'], df['d'] = zip(*df.col)
df
col a b c d
0 (a, 10) a 10 a 10
1 (b, 20) b 20 b 20
2 (c, 30) c 30 c 30
3 (d, 40) d 40 d 40
4 (e, 50) e 50 e 50
5 (f, 60) f 60 f 60
6 (g, 70) g 70 g 70
7 (h, 80) h 80 h 80
8 (i, 90) i 90 i 90
9 (j, 100) j 100 j 100
Inline
排队
Meaning, don't mutate existing df
This works because assign
takes keyword arguments where the keywords are the new(or existing) column names and the values will be the values of the new column. You can use a dictionary and unpack it with **
and have it act as the keyword arguments. So this is a clever way of assigning a new column named 'g'
that is the first item in the df.col.str
iterable and 'h'
that is the second item in the df.col.str
iterable.
意思是,不要改变现有的df
这是有效的,因为assign
采用关键字参数,其中关键字是新的(或现有的)列名称,而值将是新列的值。您可以使用字典并将其解压缩**
并将其用作关键字参数。因此,这是分配一个名为新列的巧妙方法,该列'g'
是df.col.str
可迭代对象中的第一项,也是可迭代'h'
对象中的第二项df.col.str
。
df.assign(**dict(zip('gh', df.col.str)))
col g h
0 (a, 10) a 10
1 (b, 20) b 20
2 (c, 30) c 30
3 (d, 40) d 40
4 (e, 50) e 50
5 (f, 60) f 60
6 (g, 70) g 70
7 (h, 80) h 80
8 (i, 90) i 90
9 (j, 100) j 100
My version of the list
approach
我的list
方法版本
With modern list comprehension and variable unpacking.
Note:also inline using join
具有现代列表理解和变量解包。
注意:也内联使用join
df.join(pd.DataFrame([*df.col], df.index, [*'ef']))
col g h
0 (a, 10) a 10
1 (b, 20) b 20
2 (c, 30) c 30
3 (d, 40) d 40
4 (e, 50) e 50
5 (f, 60) f 60
6 (g, 70) g 70
7 (h, 80) h 80
8 (i, 90) i 90
9 (j, 100) j 100
The mutating version would be
变异版本将是
df[['e', 'f']] = pd.DataFrame([*df.col], df.index)
Naive Time Test
天真时间测试
短数据帧Use one defined above
使用上面定义的一个
%timeit df.assign(**dict(zip('gh', df.col.str)))
%timeit df.assign(**dict(zip('gh', zip(*df.col))))
%timeit df.join(pd.DataFrame([*df.col], df.index, [*'gh']))
1.16 ms ± 21.5 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
635 μs ± 18.7 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
795 μs ± 42.5 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
长数据帧
10^3 times bigger
大 10^3 倍
df = pd.concat([df] * 1000, ignore_index=True)
%timeit df.assign(**dict(zip('gh', df.col.str)))
%timeit df.assign(**dict(zip('gh', zip(*df.col))))
%timeit df.join(pd.DataFrame([*df.col], df.index, [*'gh']))
11.4 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.1 ms ± 41.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.33 ms ± 35.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)