按列数过滤 Pandas df 并写入数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43300146/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Filter Pandas df by count of a column and write data
提问by andrewryanx
I have a data set of geo-located social media posts for which I am trying to filter by the frequency of user_id
greater than 1 (users who posted 2 or more times). I would like to filter this so I can further clean trajectory data I'm creating.
我有一个地理定位社交媒体帖子的数据集,我试图按user_id
大于 1的频率(发布 2 次或更多次的用户)对其进行过滤。我想过滤它,以便我可以进一步清理我正在创建的轨迹数据。
Sample code:
示例代码:
# Import Data
data = pd.read_csv('path', delimiter=',', engine='python')
#print len(data),"rows"
#print data
# Create Data Fame
df = pd.DataFrame(data, columns=['user_id','timestamp','latitude','longitude'])
#print data.head()
# Get a list of unique user_id values
uniqueIds = np.unique(data['user_id'].values)
# Get the ordered (by timestamp) coordinates for each user_id
output = [[id,data.loc[data['user_id']==id].sort_values(by='timestamp')['latitude','longitude'].values.tolist()] for id in uniqueIds]
# Save outputs
outputs = pd.DataFrame(output)
#print outputs
outputs.to_csv('path', index=False, header=False)
I tried using df[].value_counts()
to get a count of user_id, and then pass >1 in the line output = [[......data['user_id']==id>1].....
however, that did not work. Is it possible to add the frequency of user_id
as an additional argument to code and extract information for only those users?
我尝试使用df[].value_counts()
获取 user_id 的计数,然后在行中传递 >1,output = [[......data['user_id']==id>1].....
但是,这不起作用。是否可以将频率user_id
作为附加参数添加到代码中并仅为这些用户提取信息?
Sample data:
样本数据:
user_id, timestamp, latitude, longitude
478134225, 3/12/2017 9:04, 38.8940974, -77.0276216
478103585, 3/12/2017 9:04, 38.882584, -77.1124701
478073193, 3/12/2017 9:07, 39.00027849, -77.09480086
476194185, 3/12/2017 9:14, 38.8048355, -77.0469214
476162349, 3/12/2017 9:16, 38.8940974, -77.0276216
478073193, 3/12/2017 9:05, 38.8549, -76.8752
477899275, 3/12/2017 9:08, 38.90181532, -77.03733586
477452890, 3/12/2017 9:08, 38.96117237, -76.95561893
478073193, 3/12/2017 9:05, 38.7188716, -77.1542684
回答by Scott Boston
Let's try to groupby
then the filter
method that returns only those records that evaluate as true in the filter function. In this case, return True for groups or user_id's that have more than one record.
让我们尝试使用仅返回过滤器函数中评估为真的那些记录groupby
的filter
方法。在这种情况下,为具有多个记录的组或 user_id 返回 True。
df.groupby('user_id').filter(lambda x: x['user_id'].count()>1)
A more efficient statement using transform and boolean indexing.
使用转换和布尔索引的更有效语句。
df[df.groupby('user_id')['user_id'].transform('count') > 1]