通过从每行的不同列中选择一个元素,从 Pandas DataFrame 创建一个系列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18589821/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Create a Series from a Pandas DataFrame by choosing an element from different columns on each row
提问by Brian
My goal is to create a Series from a Pandas DataFrame by choosing an element from different columns on each row.
我的目标是通过从每行的不同列中选择一个元素,从 Pandas DataFrame 创建一个系列。
For example, I have the following DataFrame:
例如,我有以下数据帧:
In [171]: pred[:10]
Out[171]:
0 1 2
Timestamp
2010-12-21 00:00:00 0 0 1
2010-12-20 00:00:00 1 1 1
2010-12-17 00:00:00 1 1 1
2010-12-16 00:00:00 0 0 1
2010-12-15 00:00:00 1 1 1
2010-12-14 00:00:00 1 1 1
2010-12-13 00:00:00 0 0 1
2010-12-10 00:00:00 1 1 1
2010-12-09 00:00:00 1 1 1
2010-12-08 00:00:00 0 0 1
And, I have the following series:
而且,我有以下系列:
In [172]: useProb[:10]
Out[172]:
Timestamp
2010-12-21 00:00:00 1
2010-12-20 00:00:00 2
2010-12-17 00:00:00 1
2010-12-16 00:00:00 2
2010-12-15 00:00:00 2
2010-12-14 00:00:00 2
2010-12-13 00:00:00 0
2010-12-10 00:00:00 2
2010-12-09 00:00:00 2
2010-12-08 00:00:00 0
I would like to create a new series, usePred, that takes the values from pred, based on the column information in useProb to return the following:
我想创建一个新系列 usePred,它根据 useProb 中的列信息从 pred 中获取值以返回以下内容:
In [172]: usePred[:10]
Out[172]:
Timestamp
2010-12-21 00:00:00 0
2010-12-20 00:00:00 1
2010-12-17 00:00:00 1
2010-12-16 00:00:00 1
2010-12-15 00:00:00 1
2010-12-14 00:00:00 1
2010-12-13 00:00:00 0
2010-12-10 00:00:00 1
2010-12-09 00:00:00 1
2010-12-08 00:00:00 0
This last step is where I fail. I've tried things like:
这最后一步是我失败的地方。我试过这样的事情:
usePred = pd.DataFrame(index = pred.index)
for row in usePred:
usePred['PREDS'].ix[row] = pred.ix[row, useProb[row]]
And, I've tried:
而且,我试过:
usePred['PREDS'] = pred.iloc[:,useProb]
I google'd and search on stackoverflow, for hours, but can't seem to solve the problem.
我用谷歌搜索并在 stackoverflow 上搜索了几个小时,但似乎无法解决问题。
回答by Andy Hayden
One solution could be to use get dummies(which shouldbe more efficient that apply):
一种解决方案可能是使用get dummys(这应该更有效):
In [11]: (pd.get_dummies(useProb) * pred).sum(axis=1)
Out[11]:
Timestamp
2010-12-21 00:00:00 0
2010-12-20 00:00:00 1
2010-12-17 00:00:00 1
2010-12-16 00:00:00 1
2010-12-15 00:00:00 1
2010-12-14 00:00:00 1
2010-12-13 00:00:00 0
2010-12-10 00:00:00 1
2010-12-09 00:00:00 1
2010-12-08 00:00:00 0
dtype: float64
You could use an apply with a couple of locs:
您可以使用带有几个 locs 的应用程序:
In [21]: pred.apply(lambda row: row.loc[useProb.loc[row.name]], axis=1)
Out[21]:
Timestamp
2010-12-21 00:00:00 0
2010-12-20 00:00:00 1
2010-12-17 00:00:00 1
2010-12-16 00:00:00 1
2010-12-15 00:00:00 1
2010-12-14 00:00:00 1
2010-12-13 00:00:00 0
2010-12-10 00:00:00 1
2010-12-09 00:00:00 1
2010-12-08 00:00:00 0
dtype: int64
The trick being that you have access to the rows index via the name property.
诀窍是您可以通过 name 属性访问行索引。
回答by unutbu
Here is another way to do it using DataFrame.lookup:
这是使用DataFrame.lookup 的另一种方法:
pred.lookup(row_labels=pred.index,
col_labels=pred.columns[useProb['0']])
It seems to be exactly what you need, except that care must be taken to supply values which are labels. For example, if pred.columnsare strings, and useProb['0']values are integers, then we could use
它似乎正是您所需要的,只是必须注意提供标签值。例如,如果pred.columns是字符串,而useProb['0']值是整数,那么我们可以使用
pred.columns[useProb['0']]
so that the values passed to the col_labelsparameter are proper label values.
以便传递给col_labels参数的值是正确的标签值。
For example,
例如,
import io
import pandas as pd
content = io.BytesIO('''\
Timestamp 0 1 2
2010-12-21 00:00:00 0 0 1
2010-12-20 00:00:00 1 1 1
2010-12-17 00:00:00 1 1 1
2010-12-16 00:00:00 0 0 1
2010-12-15 00:00:00 1 1 1
2010-12-14 00:00:00 1 1 1
2010-12-13 00:00:00 0 0 1
2010-12-10 00:00:00 1 1 1
2010-12-09 00:00:00 1 1 1
2010-12-08 00:00:00 0 0 1''')
pred = pd.read_table(content, sep='\s{2,}', parse_dates=True, index_col=[0])
content = io.BytesIO('''\
Timestamp 0
2010-12-21 00:00:00 1
2010-12-20 00:00:00 2
2010-12-17 00:00:00 1
2010-12-16 00:00:00 2
2010-12-15 00:00:00 2
2010-12-14 00:00:00 2
2010-12-13 00:00:00 0
2010-12-10 00:00:00 2
2010-12-09 00:00:00 2
2010-12-08 00:00:00 0''')
useProb = pd.read_table(content, sep='\s{2,}', parse_dates=True, index_col=[0])
print(pd.Series(pred.lookup(row_labels=pred.index,
col_labels=pred.columns[useProb['0']]),
index=pred.index))
yields
产量
Timestamp
2010-12-21 0
2010-12-20 1
2010-12-17 1
2010-12-16 1
2010-12-15 1
2010-12-14 1
2010-12-13 0
2010-12-10 1
2010-12-09 1
2010-12-08 0
dtype: int64

