Python 如何按熊猫中的两列计算唯一记录?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47023541/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:58:52  来源:igfitidea点击:

How to count unique records by two columns in pandas?

pythonpandasdataframegroup-by

提问by GhostKU

I have dataframe in pandas:

我在熊猫中有数据框:

In [10]: df
Out[10]:
    col_a    col_b  col_c  col_d
0  France    Paris      3      4
1      UK    Londo      4      5
2      US  Chicago      5      6
3      UK  Bristol      3      3
4      US    Paris      8      9
5      US   London     44      4
6      US  Chicago     12      4

I need to count unique cities. I can count unique states

我需要计算独特的城市。我可以计算独特的状态

In [11]: df['col_a'].nunique()
Out[11]: 3

and I can try to count unique cities

我可以尝试计算独特的城市

In [12]: df['col_b'].nunique()
Out[12]: 5

but it is wrong because US Paris and Paris in France are different cities. So now I'm doing in like this:

但这是错误的,因为美国巴黎和法国巴黎是不同的城市。所以现在我这样做:

In [13]: df['col_a_b'] = df['col_a'] + ' - ' + df['col_b']

In [14]: df
Out[14]:
    col_a    col_b  col_c  col_d         col_a_b
0  France    Paris      3      4  France - Paris
1      UK    Londo      4      5      UK - Londo
2      US  Chicago      5      6    US - Chicago
3      UK  Bristol      3      3    UK - Bristol
4      US    Paris      8      9      US - Paris
5      US   London     44      4     US - London
6      US  Chicago     12      4    US - Chicago

In [15]: df['col_a_b'].nunique()
Out[15]: 6

Maybe there is a better way? Without creating an additional column.

也许有更好的方法?无需创建额外的列。

回答by YOBEN_S

By using ngroups

通过使用 ngroups

df.groupby(['col_a', 'col_b']).ngroups
Out[101]: 6

Or using set

或使用 set

len(set(zip(df['col_a'],df['col_b'])))
Out[106]: 6

回答by Psidom

You can select col_aand col_b, drop the duplicates, then check the shape/lenof the result data frame:

您可以选择col_acol_b,删除重复项,然后检查结果数据框的形状/长度

df[['col_a', 'col_b']].drop_duplicates().shape[0]
# 6

len(df[['col_a', 'col_b']].drop_duplicates())
# 6


Because groupbyignore NaNs, and may unnecessarily invoke a sorting process, choose accordingly which method to use if you have NaNs in the columns:

因为groupby忽略NaNs,并且可能会不必要地调用排序过程,所以如果NaN列中有s ,请相应地选择要使用的方法:

Consider a data frame as following:

考虑如下数据框:

df = pd.DataFrame({
    'col_a': [1,2,2,pd.np.nan,1,4],
    'col_b': [2,2,3,pd.np.nan,2,pd.np.nan]
})

print(df)

#   col_a  col_b
#0    1.0    2.0
#1    2.0    2.0
#2    2.0    3.0
#3    NaN    NaN
#4    1.0    2.0
#5    4.0    NaN

Timing:

时间

df = pd.concat([df] * 1000)

%timeit df.groupby(['col_a', 'col_b']).ngroups
# 1000 loops, best of 3: 625 μs per loop

%timeit len(df[['col_a', 'col_b']].drop_duplicates())
# 1000 loops, best of 3: 1.02 ms per loop

%timeit df[['col_a', 'col_b']].drop_duplicates().shape[0]
# 1000 loops, best of 3: 1.01 ms per loop    

%timeit len(set(zip(df['col_a'],df['col_b'])))
# 10 loops, best of 3: 56 ms per loop

%timeit len(df.groupby(['col_a', 'col_b']))
# 1 loop, best of 3: 260 ms per loop

Result:

结果

df.groupby(['col_a', 'col_b']).ngroups
# 3

len(df[['col_a', 'col_b']].drop_duplicates())
# 5

df[['col_a', 'col_b']].drop_duplicates().shape[0]
# 5

len(set(zip(df['col_a'],df['col_b'])))
# 2003

len(df.groupby(['col_a', 'col_b']))
# 2003

So the difference:

所以区别:

Option 1:

选项1:

df.groupby(['col_a', 'col_b']).ngroups

is fast, and it excludes rows that contain NaNs.

速度很快,并且排除包含NaNs 的行。

Option 2 & 3:

选项 2 和 3:

len(df[['col_a', 'col_b']].drop_duplicates())
df[['col_a', 'col_b']].drop_duplicates().shape[0]

Reasonably fast, it considers NaNs as a unique value.

相当快,它将NaNs 视为唯一值。

Option 4 & 5:

选项 4 和 5:

len(set(zip(df['col_a'],df['col_b']))) 
len(df.groupby(['col_a', 'col_b'])) 

slow, and it is following the logic that numpy.nan == numpy.nanis False, so different (nan, nan)rows are considered different.

慢,并且遵循numpy.nan == numpy.nanFalse的逻辑,因此不同的(nan, nan)行被认为是不同的。

回答by MaxU

In [105]: len(df.groupby(['col_a', 'col_b']))
Out[105]: 6

回答by Anuj

try this, I'm basically subtracting the number of duplicate groups from the number of rows in df. This is assuming we are grouping all the categories in the df

试试这个,我基本上是从 df 的行数中减去重复组的数量。这是假设我们对 df 中的所有类别进行分组

df.shape[0] - df[['col_a','col_b']].duplicated().sum()

df.shape[0] - df[['col_a','col_b']].duplicated().sum()

774 μs ± 603 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

774 μs ± 603 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)