Pandas:使用 groupby 和函数进行数据帧过滤
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/17950835/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: DataFrame filtering using groupby and a function
提问by Matthijs
Using Python 3.3 and Pandas 0.10
使用 Python 3.3 和 Pandas 0.10
I have a DataFrame that is built from concatenating multiple CSV files. First, I filter out all values in the Name column that contain a certain string. The result looks something like this (shortened for brevity sakes, actually there are more columns):
我有一个通过连接多个 CSV 文件构建的 DataFrame。首先,我过滤掉 Name 列中包含某个字符串的所有值。结果看起来像这样(为简洁起见缩短了,实际上有更多的列):
Name    ID
'A'     1
'B'     2
'C'     3
'C'     3
'E'     4
'F'     4
...     ...
Now my issue is that I want to remove a special case of 'duplicate' values. I want to remove all ID duplicates (entire row actually) where the corresponding Name values that are mapped to this ID are notsimilar. In the example above I would like to keep rows with ID 1, 2 and 3. Where ID=4 the Name values are unequal and I want to remove those.
现在我的问题是我想删除“重复”值的特殊情况。我想删除所有 ID 重复项(实际上是整行),其中映射到此 ID 的相应 Name 值不相似。在上面的示例中,我想保留 ID 为 1、2 和 3 的行。其中 ID=4 的 Name 值不相等,我想删除它们。
I tried to use the following line of code (based on the suggestion here: Python Pandas: remove entries based on the number of occurrences).
我尝试使用以下代码行(基于此处的建议:Python Pandas:根据出现次数删除条目)。
Code:
代码:
df[df.groupby('ID').apply(lambda g: len({x for x in g['Name']})) == 1]
However that gives me the error:
ValueError: Item wrong length 51906 instead of 109565!
但是,这给了我错误:
ValueError: Item wrong length 51906 instead of 109565!
Edit:
编辑:
Instead of using apply()I have also tried using transform(), however that gives me the error: AttributeError: 'int' object has no attribute 'ndim'. An explanation on why the error is different per function is very much appreciated!
而不是使用apply()我也尝试使用transform(),但是这给了我错误:AttributeError: 'int' object has no attribute 'ndim'。非常感谢解释为什么每个函数的错误不同!
Also, I want to keep keep all rows where ID = 3 in the above example.
另外,我想保留上面示例中 ID = 3 的所有行。
Thanks in advance, Matthijs
提前致谢,马蒂斯
回答by Dan Allan
Instead of length len, I think you want to consider the number of unique values of Name in each group. Use nunique(), and check out this neat recipe for filtering groups.
而不是 length len,我认为您要考虑每个组中 Name 的唯一值的数量。使用nunique(),并查看这个用于过滤组的简洁方法。
df[df.groupby('ID').Name.transform(lambda x: x.nunique() == 1).astype('bool')]
If you upgrade to pandas 0.12, you can use the new filtermethod on groups, which makes this more succinct and straightforward.
如果升级到 pandas 0.12,则可以filter在组上使用新方法,这使得这更加简洁明了。
df.groupby('ID').filter(lambda x: x.Name.nunique() == 1)
A general remark: Sometimes, of course, you do want to know the length of the group, but I find that sizeis a safer choice than len, which has been troublesome for me in some cases.
一般评论:当然,有时您确实想知道组的长度,但我发现这size是比 更安全的选择len,在某些情况下这对我来说很麻烦。
回答by Andy Hayden
You could first drop the duplicates:
您可以先删除重复项:
In [11]: df = df.drop_duplicates()
In [12]: df
Out[12]:
  Name ID
0    A  1
1    B  2
2    C  3
4    E  4
5    F  4
The groupbyid and only consider those with one element:
该groupbyID和只考虑那些具有一个元素:
In [13]: g = df.groupby('ID')
In [14]: size = (g.size() == 1)
In [15]: size
Out[15]:
ID
1      True
2      True
3      True
4     False
dtype: bool
In [16]: size[size].index
Out[16]: Int64Index([1, 2, 3], dtype=int64)
In [17]: df['ID'].isin(size[size].index)
Out[17]:
0     True
1     True
2     True
4    False
5    False
Name: ID, dtype: bool
And boolean index by this:
和布尔索引:
In [18]: df[df['ID'].isin(size[size].index)]
Out[18]:
  Name ID
0    A  1
1    B  2
2    C  3

