Python 从 DataFrame 中的特定列中选择非空行并从其他列中进行子选择
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41337477/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Select non-null rows from a specific column in a DataFrame and take a sub-selection of other columns
提问by EdChum
I have a dataFrame which has several coulmns, so i choosed some of its coulmns to create a variable like this xtrain = df[['Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ]]
i want to drop from these coulmns all raws that the Survive coulmn in the main dataFrame is nan.
我有一个包含多个库尔姆的数据帧,所以我选择了它的一些库尔姆来创建一个这样的变量,xtrain = df[['Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ]]
我想从这些库中删除主数据帧中的生存库是 nan 的所有原始数据。
回答by EdChum
You can pass a boolean mask to your df based on notnull()
of 'Survive' column and select the cols of interest:
您可以根据notnull()
“生存”列将布尔掩码传递给您的 df并选择感兴趣的列:
In [2]:
# make some data
df = pd.DataFrame(np.random.randn(5,7), columns= ['Survive', 'Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ])
df['Survive'].iloc[2] = np.NaN
df
Out[2]:
Survive Age Fare Group_Size deck Pclass Title
0 1.174206 -0.056846 0.454437 0.496695 1.401509 -2.078731 -1.024832
1 0.036843 1.060134 0.770625 -0.114912 0.118991 -0.317909 0.061022
2 NaN -0.132394 -0.236904 -0.324087 0.570660 0.758084 -0.176421
3 -2.145934 -0.020003 -0.777785 0.835467 1.498284 -1.371325 0.661991
4 -0.197144 -0.089806 -0.706548 1.621260 1.754292 0.725897 0.860482
Now pass a mask to loc
to take only non NaN
rows:
现在传递一个掩码来loc
只取非NaN
行:
In [3]:
xtrain = df.loc[df['Survive'].notnull(), ['Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ]]
xtrain
Out[3]:
Age Fare Group_Size deck Pclass Title
0 -0.056846 0.454437 0.496695 1.401509 -2.078731 -1.024832
1 1.060134 0.770625 -0.114912 0.118991 -0.317909 0.061022
3 -0.020003 -0.777785 0.835467 1.498284 -1.371325 0.661991
4 -0.089806 -0.706548 1.621260 1.754292 0.725897 0.860482
回答by piRSquared
Two alternatives because... well why not?
Both drop nan
prior to column slicing. That's two call rather than EdChum's one call.
两种选择,因为......为什么不呢?
两者都nan
在列切片之前下降。这是两次通话而不是 EdChum 的一次通话。
one
一
df.dropna(subset=['Survive'])[
['Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ]]
two
二
df.query('Survive == Survive')[
['Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ]]