pandas 检查数据框列中的所有值是否相同
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/54405704/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Check if all values in dataframe column are the same
提问by HelloToEarth
I want to do a quick and easy check if all column values for counts
are the same in a dataframe:
我想快速轻松地检查counts
数据框中的所有列值是否相同:
In:
在:
import pandas as pd
d = {'names': ['Jim', 'Ted', 'Mal', 'Ted'], 'counts': [3, 4, 3, 3]}
pd.DataFrame(data=d)
Out:
出去:
names counts
0 Jim 3
1 Ted 4
2 Mal 3
3 Ted 3
I want just a simple condition that if all counts = same value
then print('True')
.
我只想要一个简单的条件,if all counts = same value
然后print('True')
.
Is there a fast way to do this?
有没有快速的方法来做到这一点?
回答by yatu
An efficient way to do this is by comparing the first value with the rest, and using all
:
一种有效的方法是将第一个值与其余值进行比较,然后使用all
:
def is_unique(s):
a = s.to_numpy() # s.values (pandas<0.24)
return (a[0] == a).all()
is_unique(df['counts'])
# False
For an entire dataframe
对于整个数据框
In the case of wanting to perform the same task on an entire dataframe, we can extend the above by setting axis=0
in all
:
在想要对整个数据帧执行相同任务的情况下,我们可以通过设置axis=0
in来扩展上述内容all
:
def unique_cols(df):
a = df.to_numpy() # df.values (pandas<0.24)
return (a[0] == a).all(0)
For the shared example, we'd get:
对于共享示例,我们将得到:
unique_cols(df)
# array([False, False])
Here's a benchmark of the above methods compared with some other approaches, such as using nunique
(for a pd.Series
):
这是上述方法与其他一些方法相比的基准,例如使用nunique
(for a pd.Series
):
s_num = pd.Series(np.random.randint(0, 1_000, 1_100_000))
perfplot.show(
setup=lambda n: s_num.iloc[:int(n)],
kernels=[
lambda s: s.nunique() == 1,
lambda s: is_unique(s)
],
labels=['nunique', 'first_vs_rest'],
n_range=[2**k for k in range(0, 20)],
xlabel='N'
)
And bellow are the timings for a pd.DataFrame
. Let's compare too with a numba
approach, which is especially useful here since we can take advantage of short-cutting as soon as we see a repeated value in a given column (note: the numba approach will only work with numerical data):
下面是pd.DataFrame
. 让我们与一种numba
方法进行比较,这种方法在这里特别有用,因为我们可以在看到给定列中的重复值时立即利用快捷方式(注意:numba 方法仅适用于数字数据):
from numba import njit
@njit
def unique_cols_nb(a):
n_cols = a.shape[1]
out = np.zeros(n_cols, dtype=np.int32)
for i in range(n_cols):
init = a[0, i]
for j in a[1:, i]:
if j != init:
break
else:
out[i] = 1
return out
If we compare the three methods:
如果我们比较这三种方法:
df = pd.DataFrame(np.concatenate([np.random.randint(0, 1_000, (500_000, 200)),
np.zeros((500_000, 10))], axis=1))
perfplot.show(
setup=lambda n: df.iloc[:int(n),:],
kernels=[
lambda df: (df.nunique(0) == 1).values,
lambda df: unique_cols_nb(df.values).astype(bool),
lambda df: unique_cols(df)
],
labels=['nunique', 'unique_cols_nb', 'unique_cols'],
n_range=[2**k for k in range(0, 20)],
xlabel='N'
)
回答by YOBEN_S
Update using np.unique
更新使用 np.unique
len(np.unique(df.counts))==1
False
Or
或者
len(set(df.counts.tolist()))==1
Or
或者
df.counts.eq(df.counts.iloc[0]).all()
False
Or
或者
df.counts.std()==0
False
回答by Michel de Ruiter
I think nunique
does much more work than necessary. Iteration can stop at the first difference. This simple and generic solution uses itertools
:
我认为nunique
做的工作比必要的要多得多。迭代可以在第一个差异处停止。这个简单而通用的解决方案使用itertools
:
import itertools
def all_equal(iterable):
"Returns True if all elements are equal to each other"
g = itertools.groupby(iterable)
return next(g, True) and not next(g, False)
all_equal(df.counts)
One can use this even to find allcolumns with constant contents in one go:
甚至可以使用它一次性找到所有内容不变的列:
constant_columns = df.columns[df.apply(all_equal)]
A slightly more readable but less performant alternative:
一个更易读但性能更差的替代方案:
df.counts.min() == df.counts.max()
Add skipna=False
here if necessary.
skipna=False
如有必要,请在此处添加。