Python 熊猫：数数

Question

提问by Mike Dewar

In the following, male_trips is a big pandas data frame and stations is a small pandas data frame. For each station id I'd like to know how many male trips took place. The following does the job, but takes a long time:

下面，male_trips 是大熊猫数据框，站是小熊猫数据框。对于每个车站 id，我想知道发生了多少男性旅行。以下完成了这项工作，但需要很长时间：

mc = [ sum( male_trips['start_station_id'] == id ) for id in stations['id'] ]

how should I go about this instead?

我应该怎么做呢？

Update! So there were two main approaches: groupby()followed by size(), and the simpler .value_counts(). I did a quick timeit, and the groupbyapproach wins by quite a large margin! Here is the code:

更新！因此有两种主要方法：groupby()其次是size()，以及更简单的.value_counts()。我做了一个快速的timeit，并且该groupby方法以相当大的优势获胜！这是代码：

from timeit import Timer
setup = "import pandas; male_trips=pandas.load('maletrips')"
a  = "male_trips.start_station_id.value_counts()"
b = "male_trips.groupby('start_station_id').size()"
Timer(a,setup).timeit(100)
Timer(b,setup).timeit(100)

and here is the result:

结果如下：

In [4]: Timer(a,setup).timeit(100) # <- this is value_counts
Out[4]: 9.709594964981079

In [5]: Timer(b,setup).timeit(100) # <- this is groupby / size
Out[5]: 1.5574288368225098

Note that, at this speed, for exploring data typingvalue_counts is marginally quicker and less remembering!

请注意，以这种速度，探索数据类型value_counts 会稍微快一些，而且记忆力会降低！

Answer 1

采纳答案by Dani Arribas-Bel

I'd do like Vishal but instead of using sum() using size() to get a count of the number of rows allocated to each group of 'start_station_id'. So:

我确实喜欢 Vishal，但不是使用 sum() 使用 size() 来计算分配给每组“start_station_id”的行数。所以：

df = male_trips.groupby('start_station_id').size()

Answer 2

回答by Joran Beasley

male_trips.count()

doesnt work? http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html

不起作用？ http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html

Answer 3

回答by vgoklani

how long would this take:

这需要多长时间：

df = male_trips.groupby('start_station_id').sum()

Answer 4

回答by Arthur G

edit: after seeing in the answer above that isinand value_countsexist (and value_countseven comes with its own entry in pandas.core.algorithmand also isinisn't simply np.in1d) I updated the three methods below

编辑：在上面的答案中看到isin并value_counts存在后（value_counts甚至带有自己的条目pandas.core.algorithm，也isin不仅仅是简单的np.in1d）我更新了下面的三种方法

male_trips.start_station_id[male_trips.start_station_id.isin(station.id)].value_counts()

You could also do an inner join on stations.id: pd.merge(male_trips, station, left_on='start_station_id', right_on='id')followed by value_counts. Or:

您还可以对stations.id:pd.merge(male_trips, station, left_on='start_station_id', right_on='id')进行内部连接，然后是value_counts. 或者：

male_trips.set_index('start_station_id, inplace=True)
station.set_index('id, inplace=True)
male_trips.ix[male_trips.index.intersection(station.index)].reset_index().start_station_id.value_counts()

If you have the time I'd be interested how this performs differently with a huge DataFrame.

如果您有时间，我会感兴趣如何使用巨大的 DataFrame 来执行不同的操作。

Answer 5

回答by ely

My answer below works in Pandas 0.7.3. Not sure about the new releases.

我下面的答案适用于 Pandas 0.7.3。不确定新版本。

This is what the pandas.Series.value_countsmethod is for:

这是该pandas.Series.value_counts方法的用途：

count_series = male_trips.start_station_id.value_counts()

It should be straight-forward to then inspect count_seriesbased on the values in stations['id']. However, if you insist on onlyconsidering those values, you could do the following:

然后count_series根据中的值进行检查应该很简单stations['id']。但是，如果您坚持只考虑这些值，则可以执行以下操作：

count_series = (
                male_trips[male_trips.start_station_id.isin(stations.id.values)]
                    .start_station_id
                    .value_counts()
               )

and this will only give counts for station IDs actually found in stations.id.

并且这只会给出实际在stations.id.

Python 熊猫：数数

提问by Mike Dewar

采纳答案by Dani Arribas-Bel

回答by Joran Beasley

回答by vgoklani

回答by Arthur G

回答by ely

相关推荐

最近更新

标签

Python 熊猫：数数

提问by Mike Dewar

采纳答案by Dani Arribas-Bel

回答by Joran Beasley

回答by vgoklani

回答by Arthur G

回答by ely

相关推荐

Python 重复多个字符正则表达式

当timedelta.days小于1时确定python中的“天”

使用 Python 连接到 SMTP（SSL 或 TLS）

Python 使用列表理解的嵌套 For 循环

相关推荐

最近更新

标签