按列数过滤 Pandas df 并写入数据

Question

提问by andrewryanx

I have a data set of geo-located social media posts for which I am trying to filter by the frequency of user_idgreater than 1 (users who posted 2 or more times). I would like to filter this so I can further clean trajectory data I'm creating.

我有一个地理定位社交媒体帖子的数据集，我试图按user_id大于 1的频率（发布 2 次或更多次的用户）对其进行过滤。我想过滤它，以便我可以进一步清理我正在创建的轨迹数据。

Sample code:

示例代码：

# Import Data
data = pd.read_csv('path', delimiter=',', engine='python')
#print len(data),"rows"
#print data

# Create Data Fame
df = pd.DataFrame(data, columns=['user_id','timestamp','latitude','longitude'])
#print data.head()

# Get a list of unique user_id values
uniqueIds = np.unique(data['user_id'].values)

# Get the ordered (by timestamp) coordinates for each user_id
output = [[id,data.loc[data['user_id']==id].sort_values(by='timestamp')['latitude','longitude'].values.tolist()] for id in uniqueIds]

# Save outputs
outputs = pd.DataFrame(output)
#print outputs
outputs.to_csv('path', index=False, header=False)

I tried using df[].value_counts()to get a count of user_id, and then pass >1 in the line output = [[......data['user_id']==id>1].....however, that did not work. Is it possible to add the frequency of user_idas an additional argument to code and extract information for only those users?

我尝试使用df[].value_counts()获取 user_id 的计数，然后在行中传递 >1，output = [[......data['user_id']==id>1].....但是，这不起作用。是否可以将频率user_id作为附加参数添加到代码中并仅为这些用户提取信息？

Sample data:

样本数据：

user_id, timestamp, latitude, longitude
478134225, 3/12/2017 9:04, 38.8940974, -77.0276216
478103585, 3/12/2017 9:04, 38.882584, -77.1124701
478073193, 3/12/2017 9:07, 39.00027849, -77.09480086
476194185, 3/12/2017 9:14, 38.8048355, -77.0469214
476162349, 3/12/2017 9:16, 38.8940974, -77.0276216
478073193, 3/12/2017 9:05, 38.8549, -76.8752
477899275, 3/12/2017 9:08, 38.90181532, -77.03733586
477452890, 3/12/2017 9:08, 38.96117237, -76.95561893
478073193, 3/12/2017 9:05, 38.7188716, -77.1542684

Answer 1

回答by Scott Boston

Let's try to groupbythen the filtermethod that returns only those records that evaluate as true in the filter function. In this case, return True for groups or user_id's that have more than one record.

让我们尝试使用仅返回过滤器函数中评估为真的那些记录groupby的filter方法。在这种情况下，为具有多个记录的组或 user_id 返回 True。

df.groupby('user_id').filter(lambda x: x['user_id'].count()>1)

A more efficient statement using transform and boolean indexing.

使用转换和布尔索引的更有效语句。

df[df.groupby('user_id')['user_id'].transform('count') > 1]

按列数过滤 Pandas df 并写入数据

提问by andrewryanx

回答by Scott Boston

相关推荐

最近更新

标签

按列数过滤 Pandas df 并写入数据

提问by andrewryanx

回答by Scott Boston

相关推荐

pandas 将 float64 列转换为日期时间熊猫

pandas matplotlib 散点图 x 轴标签

pandas 类型错误：“不支持 - 的操作数类型：'时间戳'和'str'”熊猫

Pandas：使用包含在索引中的列名时出现 KeyError

相关推荐

最近更新

标签