pandas - 在分组数据帧后仅保留 True 值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28859284/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:00:49  来源:igfitidea点击:

pandas - keep only True values after groupby a DataFrame

pythonpandas

提问by Fabio Lamanna

I've been working on a DataFrame with User_IDs, DateTime objects and other information, like the following extract:

我一直在研究带有 User_ID、DateTime 对象和其他信息的 DataFrame,例如以下摘录:

User_ID;Latitude;Longitude;Datetime
222583401;41.4020375;2.1478710;2014-07-06 20:49:20
287280509;41.3671346;2.0793115;2013-01-30 09:25:47
329757763;41.5453577;2.1175164;2012-09-25 08:40:59
189757330;41.5844998;2.5621569;2013-10-01 11:55:20
624921653;41.5931846;2.3030671;2013-07-09 20:12:20
414673119;41.5550136;2.0965829;2014-02-24 20:15:30
414673119;41.5550136;2.0975829;2014-02-24 20:16:30
414673119;41.5550136;2.0985829;2014-02-24 20:17:30

I've grouped Users with:

我将用户分组为:

g = df.groupby(['User_ID','Datetime'])

and then check for no-single DataTime objects:

然后检查非单个 DataTime 对象:

df = df.groupby('User_ID')['Datetime'].apply(lambda g: len(g)>1)

I've obtained the following boolean DataFrame:

我已经获得了以下布尔数据帧:

User_ID
189757330    False
222583401    False
287280509    False
329757763    False
414673119     True
624921653    False
Name: Datetime, dtype: bool

which is fine for my purposes to keep only User_ID with a True masked value. Now I would like to keep only the User_ID values associated to the True values, and write them to a new DataFrame with pandas.to_csv, for instance. The expected DataFrame would contain only the User_ID with more than one DateTime object:

这对我的目的来说很好,只保留带有 True 掩码值的 User_ID。现在,我只想保留与 True 值关联的 User_ID 值,并将它们写入新的 DataFrame 中pandas.to_csv,例如。预期的 DataFrame 将仅包含具有多个 DateTime 对象的 User_ID:

User_ID;Latitude;Longitude;Datetime
414673119;41.5550136;2.0965829;2014-02-24 20:15:30
414673119;41.5550136;2.0975829;2014-02-24 20:16:30
414673119;41.5550136;2.0985829;2014-02-24 20:17:30

How may I have access to the boolean values for each User_ID? Thanks for your kind help.

我如何才能访问每个 User_ID 的布尔值?感谢您的帮助。

回答by EdChum

Assign the result of df.groupby('User_ID')['Datetime'].apply(lambda g: len(g)>1)to a variable so you can perform boolean indexing and then use the index from this to call isinand filter your orig df:

将 的结果分配df.groupby('User_ID')['Datetime'].apply(lambda g: len(g)>1)给变量,以便您可以执行布尔索引,然后使用此索引来调用isin和过滤您的原始文件:

In [366]:

users = df.groupby('User_ID')['Datetime'].apply(lambda g: len(g)>1)
users

Out[366]:
User_ID
189757330    False
222583401    False
287280509    False
329757763    False
414673119     True
624921653    False
Name: Datetime, dtype: bool

In [367]:   
users[users]

Out[367]:
User_ID
414673119    True
Name: Datetime, dtype: bool

In [368]:
users[users].index

Out[368]:
Int64Index([414673119], dtype='int64')

In [361]:
df[df['User_ID'].isin(users[users].index)]

Out[361]:
     User_ID   Latitude  Longitude            Datetime
5  414673119  41.555014   2.096583 2014-02-24 20:15:30
6  414673119  41.555014   2.097583 2014-02-24 20:16:30
7  414673119  41.555014   2.098583 2014-02-24 20:17:30

You can then call to_csvon the above as normal

然后你可以to_csv正常调用上面的

回答by acushner

first, make sure you have no duplicate entries:

首先,确保您没有重复的条目:

df = df.drop_duplicates()

then, figure out the counts for each:

然后,找出每个的计数:

counts = df.groupby('User_ID').Datetime.count()

finally, figure out where the indexes overlap:

最后,找出索引重叠的位置:

df[df.User_ID.isin(counts[counts > 1].index)]