Pandas：使用 groupby 和函数进行数据帧过滤

Question

提问by Matthijs

Using Python 3.3 and Pandas 0.10

使用 Python 3.3 和 Pandas 0.10

I have a DataFrame that is built from concatenating multiple CSV files. First, I filter out all values in the Name column that contain a certain string. The result looks something like this (shortened for brevity sakes, actually there are more columns):

我有一个通过连接多个 CSV 文件构建的 DataFrame。首先，我过滤掉 Name 列中包含某个字符串的所有值。结果看起来像这样（为简洁起见缩短了，实际上有更多的列）：

Name    ID
'A'     1
'B'     2
'C'     3
'C'     3
'E'     4
'F'     4
...     ...

Now my issue is that I want to remove a special case of 'duplicate' values. I want to remove all ID duplicates (entire row actually) where the corresponding Name values that are mapped to this ID are notsimilar. In the example above I would like to keep rows with ID 1, 2 and 3. Where ID=4 the Name values are unequal and I want to remove those.

现在我的问题是我想删除“重复”值的特殊情况。我想删除所有 ID 重复项（实际上是整行），其中映射到此 ID 的相应 Name 值不相似。在上面的示例中，我想保留 ID 为 1、2 和 3 的行。其中 ID=4 的 Name 值不相等，我想删除它们。

I tried to use the following line of code (based on the suggestion here: Python Pandas: remove entries based on the number of occurrences).

我尝试使用以下代码行（基于此处的建议：Python Pandas：根据出现次数删除条目）。

Code:

代码：

df[df.groupby('ID').apply(lambda g: len({x for x in g['Name']})) == 1]

However that gives me the error: ValueError: Item wrong length 51906 instead of 109565!

但是，这给了我错误： ValueError: Item wrong length 51906 instead of 109565!

Edit:

编辑：

Instead of using apply()I have also tried using transform(), however that gives me the error: AttributeError: 'int' object has no attribute 'ndim'. An explanation on why the error is different per function is very much appreciated!

而不是使用apply()我也尝试使用transform()，但是这给了我错误：AttributeError: 'int' object has no attribute 'ndim'。非常感谢解释为什么每个函数的错误不同！

Also, I want to keep keep all rows where ID = 3 in the above example.

另外，我想保留上面示例中 ID = 3 的所有行。

Thanks in advance, Matthijs

提前致谢，马蒂斯

Answer 1

回答by Dan Allan

Instead of length len, I think you want to consider the number of unique values of Name in each group. Use nunique(), and check out this neat recipe for filtering groups.

而不是 length len，我认为您要考虑每个组中 Name 的唯一值的数量。使用nunique()，并查看这个用于过滤组的简洁方法。

df[df.groupby('ID').Name.transform(lambda x: x.nunique() == 1).astype('bool')]

If you upgrade to pandas 0.12, you can use the new filtermethod on groups, which makes this more succinct and straightforward.

如果升级到 pandas 0.12，则可以filter在组上使用新方法，这使得这更加简洁明了。

df.groupby('ID').filter(lambda x: x.Name.nunique() == 1)

A general remark: Sometimes, of course, you do want to know the length of the group, but I find that sizeis a safer choice than len, which has been troublesome for me in some cases.

一般评论：当然，有时您确实想知道组的长度，但我发现这size是比更安全的选择len，在某些情况下这对我来说很麻烦。

Answer 2

回答by Andy Hayden

You could first drop the duplicates:

您可以先删除重复项：

In [11]: df = df.drop_duplicates()

In [12]: df
Out[12]:
  Name ID
0    A  1
1    B  2
2    C  3
4    E  4
5    F  4

The groupbyid and only consider those with one element:

该groupbyID和只考虑那些具有一个元素：

In [13]: g = df.groupby('ID')

In [14]: size = (g.size() == 1)

In [15]: size
Out[15]:
ID
1      True
2      True
3      True
4     False
dtype: bool

In [16]: size[size].index
Out[16]: Int64Index([1, 2, 3], dtype=int64)

In [17]: df['ID'].isin(size[size].index)
Out[17]:
0     True
1     True
2     True
4    False
5    False
Name: ID, dtype: bool

And boolean index by this:

和布尔索引：

In [18]: df[df['ID'].isin(size[size].index)]
Out[18]:
  Name ID
0    A  1
1    B  2
2    C  3

Pandas：使用 groupby 和函数进行数据帧过滤

提问by Matthijs

回答by Dan Allan

回答by Andy Hayden

相关推荐

最近更新

标签

Pandas：使用 groupby 和函数进行数据帧过滤

提问by Matthijs

回答by Dan Allan

回答by Andy Hayden

相关推荐

使用 Pandas DataFrame.sort() 时，我可以让它实际重新编号行吗？

Pandas date_range 从 DatetimeIndex 到 Date 格式

在 Pandas DataFrame 中有效地将字符串转换为适当的数字类型

复杂的（对我来说）在 Pandas 中从宽到长的重塑

相关推荐

最近更新

标签