pandas 根据某些列(熊猫)中的空值删除行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42125131/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Delete row based on nulls in certain columns (pandas)
提问by gesingle
I know how to drop a row from a DataFrame containing all nulls OR a single null but can you drop a row based on the nulls for a specified set of columns?
我知道如何从包含所有空值或单个空值的 DataFrame 中删除一行,但是您可以根据一组指定列的空值删除一行吗?
For example, say I am working with data containing geographical info (city, latitude, and longitude) in addition to numerous other fields. I want to keep the rows that at a minimum contain a value for city OR for lat and long but drop rows that have null values for all three.
例如,假设我正在处理包含地理信息(城市、纬度和经度)以及许多其他字段的数据。我想保留至少包含 city 值或 lat 和 long 值的行,但删除所有三个值都为空的行。
I am having trouble finding functionality for this in pandas documentation. Any guidance would be appreciated.
我无法在 pandas 文档中找到此功能。任何指导将不胜感激。
回答by Gene Burinsky
You can use pd.dropna
but instead of using how='all'
and subset=[]
, you can use the thresh
parameter to require a minimum number of NAs in a row before a row gets dropped. In the city, long/lat example, a thresh=2
will work because we only drop in case of 3 NAs. Using the great data example set up by MaxU, we would do
您可以使用pd.dropna
但不是使用how='all'
and subset=[]
,而是可以使用该thresh
参数在一行被删除之前要求最少数量的 NA。在城市中,long/lat 示例中,athresh=2
会起作用,因为我们只在 3 个 NA 的情况下下降。使用 MaxU 设置的优秀数据示例,我们会做
## get the data
df = pd.read_clipboard()
## remove undesired rows
df.dropna(axis=0, subset=[['city', 'longitude', 'latitude']], thresh=2)
This yields:
这产生:
In [5]: df.dropna(axis=0, subset=[['city', 'longitude', 'latitude']], thresh=2)
Out[5]:
city latitude longitude a b
0 aaa 11.1111 NaN 1 2
1 bbb NaN 22.2222 5 6
3 NaN 11.1111 33.3330 1 2
回答by MaxU
Try this:
尝试这个:
In [25]: df
Out[25]:
city latitude longitude a b
0 aaa 11.1111 NaN 1 2
1 bbb NaN 22.2222 5 6
2 NaN NaN NaN 3 4
3 NaN 11.1111 33.3330 1 2
4 NaN NaN 44.4440 1 1
In [26]: df.query("city == city or (latitude == latitude and longitude == longitude)")
Out[26]:
city latitude longitude a b
0 aaa 11.1111 NaN 1 2
1 bbb NaN 22.2222 5 6
3 NaN 11.1111 33.3330 1 2
If i understand OP correctly the row with index 4
must be dropped as not both coordinates are not-null. So dropna()
won't work "properly" in this case:
如果我正确理解 OP,则4
必须删除带有索引的行,因为不是两个坐标都不是空的。所以dropna()
在这种情况下不会“正常”工作:
In [62]: df.dropna(subset=['city','latitude','longitude'], how='all')
Out[62]:
city latitude longitude a b
0 aaa 11.1111 NaN 1 2
1 bbb NaN 22.2222 5 6
3 NaN 11.1111 33.3330 1 2
4 NaN NaN 44.4440 1 1 # this row should be dropped...
回答by Boud
dropna has a parameter to apply the tests only on a subset of columns:
dropna 有一个参数来仅对列的子集应用测试:
dropna(axis=0, how='all', subset=[your three columns in this list])
回答by piRSquared
Using a boolean mask and some clever dot
product (this is for @Boud)
使用布尔掩码和一些聪明的dot
产品(这是针对@Boud)
subset = ['city', 'latitude', 'longitude']
df[df[subset].notnull().dot([2, 1, 1]).ge(2)]
city latitude longitude a b
0 aaa 11.1111 NaN 1 2
1 bbb NaN 22.2222 5 6
3 NaN 11.1111 33.3330 1 2
回答by Jimmy C
You can perform selection by exploiting the bitwise operators.
您可以通过利用按位运算符来执行选择。
## create example data
df = pd.DataFrame({'City': ['Gothenburg', None, None], 'Long': [None, 1, 1], 'Lat': [1, None, 1]})
## bitwise/logical operators
~df.City.isnull() | (~df.Lat.isnull() & ~df.Long.isnull())
0 True
1 False
2 True
dtype: bool
## subset using above statement
df[~df.City.isnull() | (~df.Lat.isnull() & ~df.Long.isnull())]
City Lat Long
0 Gothenburg 1.0 NaN
2 None 1.0 1.0