Python 从 DataFrame 中的特定列中选择非空行并从其他列中进行子选择

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41337477/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-20 00:50:02  来源:igfitidea点击:

Select non-null rows from a specific column in a DataFrame and take a sub-selection of other columns

pythonpandas

提问by EdChum

I have a dataFrame which has several coulmns, so i choosed some of its coulmns to create a variable like this xtrain = df[['Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ]]i want to drop from these coulmns all raws that the Survive coulmn in the main dataFrame is nan.

我有一个包含多个库尔姆的数据帧,所以我选择了它的一些库尔姆来创建一个这样的变量,xtrain = df[['Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ]]我想从这些库中删除主数据帧中的生存库是 nan 的所有原始数据。

回答by EdChum

You can pass a boolean mask to your df based on notnull()of 'Survive' column and select the cols of interest:

您可以根据notnull()“生存”列将布尔掩码传递给您的 df并选择感兴趣的列:

In [2]:
# make some data
df = pd.DataFrame(np.random.randn(5,7), columns= ['Survive', 'Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ])
df['Survive'].iloc[2] = np.NaN
df
Out[2]:
    Survive       Age      Fare  Group_Size      deck    Pclass     Title
0  1.174206 -0.056846  0.454437    0.496695  1.401509 -2.078731 -1.024832
1  0.036843  1.060134  0.770625   -0.114912  0.118991 -0.317909  0.061022
2       NaN -0.132394 -0.236904   -0.324087  0.570660  0.758084 -0.176421
3 -2.145934 -0.020003 -0.777785    0.835467  1.498284 -1.371325  0.661991
4 -0.197144 -0.089806 -0.706548    1.621260  1.754292  0.725897  0.860482

Now pass a mask to locto take only non NaNrows:

现在传递一个掩码来loc只取非NaN行:

In [3]:
xtrain = df.loc[df['Survive'].notnull(), ['Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ]]
xtrain

Out[3]:
        Age      Fare  Group_Size      deck    Pclass     Title
0 -0.056846  0.454437    0.496695  1.401509 -2.078731 -1.024832
1  1.060134  0.770625   -0.114912  0.118991 -0.317909  0.061022
3 -0.020003 -0.777785    0.835467  1.498284 -1.371325  0.661991
4 -0.089806 -0.706548    1.621260  1.754292  0.725897  0.860482

回答by piRSquared

Two alternatives because... well why not?
Both drop nanprior to column slicing. That's two call rather than EdChum's one call.

两种选择,因为......为什么不呢?
两者都nan在列切片之前下降。这是两次通话而不是 EdChum 的一次通话。

one

df.dropna(subset=['Survive'])[
    ['Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ]]

two

df.query('Survive == Survive')[
    ['Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ]]