Python 如何在 Pandas 中解压一系列元组?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22799300/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to unpack a Series of tuples in Pandas?
提问by mwaskom
Sometimes I end up with a series of tuples/lists when using Pandas. This is common when, for example, doing a group-by and passing a function that has multiple return values:
有时我在使用 Pandas 时会得到一系列元组/列表。例如,当执行 group-by 并传递具有多个返回值的函数时,这很常见:
import numpy as np
from scipy import stats
df = pd.DataFrame(dict(x=np.random.randn(100),
y=np.repeat(list("abcd"), 25)))
out = df.groupby("y").x.apply(stats.ttest_1samp, 0)
print out
y
a (1.3066417476, 0.203717485506)
b (0.0801133382517, 0.936811414675)
c (1.55784329113, 0.132360504653)
d (0.267999459642, 0.790989680709)
dtype: object
What is the correct way to "unpack" this structure so that I get a DataFrame with two columns?
“解包”这个结构的正确方法是什么,以便我得到一个包含两列的数据帧?
A related question is how I can unpack either this structure or the resulting dataframe into two Series/array objects. This almost works:
一个相关的问题是我如何将此结构或结果数据帧解包为两个系列/数组对象。这几乎有效:
t, p = zip(*out)
but it t
is
但它t
是
(array(1.3066417475999257),
array(0.08011333825171714),
array(1.557843291126335),
array(0.267999459641651))
and one needs to take the extra step of squeezing it.
并且需要采取额外的步骤来挤压它。
采纳答案by Siraj S.
maybe this is most strightforward (most pythonic i guess):
也许这是最直接的(我猜是最pythonic的):
out.apply(pd.Series)
if you would want to rename the columns to something more meaningful, than:
如果您想将列重命名为更有意义的名称,则:
out.columns=['Kstats','Pvalue']
if you do not want the default name for the index:
如果您不想要索引的默认名称:
out.index.name=None
回答by behzad.nouri
maybe:
也许:
>>> pd.DataFrame(out.tolist(), columns=['out-1','out-2'], index=out.index)
out-1 out-2
y
a -1.9153853424536496 0.067433
b 1.277561889173181 0.213624
c 0.062021492729736116 0.951059
d 0.3036745009819999 0.763993
[4 rows x 2 columns]
回答by CT Zhu
I believe you want this:
我相信你想要这个:
df=pd.DataFrame(out.tolist())
df.columns=['KS-stat', 'P-value']
result:
结果:
KS-stat P-value
0 -2.12978778869 0.043643
1 3.50655433879 0.001813
2 -1.2221274198 0.233527
3 -0.977154419818 0.338240
回答by Jeremy Z
I have met the similar problem. What I found 2 ways to solving it are exactly the answer of @CT ZHU and that of @Siraj S.
我遇到过类似的问题。我发现的两种解决方法正是@CT ZHU 和@Siraj S 的答案。
Here is my supplementary information you might be interested: I have compared 2 ways and found the way of @CT ZHU performs much faster when the size of input grows.
以下是您可能感兴趣的补充信息: 我比较了 2 种方式,发现 @CT ZHU 的方式在输入大小增加时执行得更快。
Example:
例子:
#Python 3
import time
from statistics import mean
df_a = pd.DataFrame({'a':range(1000),'b':range(1000)})
#function to test
def func1(x):
c = str(x)*3
d = int(x)+100
return c,d
# Siraj S's way
time_difference = []
for i in range(100):
start = time.time()
df_b = df_a['b'].apply(lambda x: func1(x)).apply(pd.Series)
end = time.time()
time_difference.append(end-start)
print(mean(time_difference))
# 0.14907703161239624
# CT ZHU's way
time_difference = []
for i in range(100):
start = time.time()
df_b = pd.DataFrame(df_a['b'].apply(lambda x: func1(x)).tolist())
end = time.time()
time_difference.append(end-start)
print(mean(time_difference))
# 0.0014058423042297363
PS: Please forgive my ugly code.
PS:请原谅我丑陋的代码。