Python 熊猫唯一值多列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/26977076/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas unique values multiple columns
提问by user2333196
df = pd.DataFrame({'Col1': ['Bob', 'Joe', 'Bill', 'Mary', 'Joe'],
'Col2': ['Joe', 'Steve', 'Bob', 'Bob', 'Steve'],
'Col3': np.random.random(5)})
What is the best way to return the unique values of 'Col1' and 'Col2'?
返回“Col1”和“Col2”的唯一值的最佳方法是什么?
The desired output is
所需的输出是
'Bob', 'Joe', 'Bill', 'Mary', 'Steve'
采纳答案by Alex Riley
pd.uniquereturns the unique values from an input array, or DataFrame column or index.
pd.unique从输入数组或 DataFrame 列或索引返回唯一值。
The input to this function needs to be one-dimensional, so multiple columns will need to be combined. The simplest way is to select the columns you want and then view the values in a flattened NumPy array. The whole operation looks like this:
此函数的输入需要是一维的,因此需要组合多个列。最简单的方法是选择您想要的列,然后在扁平化的 NumPy 数组中查看值。整个操作如下所示:
>>> pd.unique(df[['Col1', 'Col2']].values.ravel('K'))
array(['Bob', 'Joe', 'Bill', 'Mary', 'Steve'], dtype=object)
Note that ravel()is an array method than returns a view (if possible) of a multidimensional array. The argument 'K'tells the method to flatten the array in the order the elements are stored in memory (pandas typically stores underlying arrays in Fortran-contiguous order; columns before rows). This can be significantly faster than using the method's default 'C' order.
请注意,这ravel()是一个数组方法,而不是返回多维数组的视图(如果可能)。该参数'K'告诉方法按照元素在内存中的存储顺序展平数组(pandas 通常以Fortran 连续顺序存储底层数组;列在行之前)。这比使用该方法的默认“C”顺序要快得多。
An alternative way is to select the columns and pass them to np.unique:
另一种方法是选择列并将它们传递给np.unique:
>>> np.unique(df[['Col1', 'Col2']].values)
array(['Bill', 'Bob', 'Joe', 'Mary', 'Steve'], dtype=object)
There is no need to use ravel()here as the method handles multidimensional arrays. Even so, this is likely to be slower than pd.uniqueas it uses a sort-based algorithm rather than a hashtable to identify unique values.
ravel()此处无需使用,因为该方法处理多维数组。即便如此,这可能比pd.unique它使用基于排序的算法而不是哈希表来识别唯一值要慢。
The difference in speed is significant for larger DataFrames (especially if there are only a handful of unique values):
对于较大的 DataFrame 来说,速度的差异很显着(尤其是在只有少数唯一值的情况下):
>>> df1 = pd.concat([df]*100000, ignore_index=True) # DataFrame with 500000 rows
>>> %timeit np.unique(df1[['Col1', 'Col2']].values)
1 loop, best of 3: 1.12 s per loop
>>> %timeit pd.unique(df1[['Col1', 'Col2']].values.ravel('K'))
10 loops, best of 3: 38.9 ms per loop
>>> %timeit pd.unique(df1[['Col1', 'Col2']].values.ravel()) # ravel using C order
10 loops, best of 3: 49.9 ms per loop
回答by Jerome Montino
Non-pandassolution: using set().
非pandas解决方案:使用 set()。
import pandas as pd
import numpy as np
df = pd.DataFrame({'Col1' : ['Bob', 'Joe', 'Bill', 'Mary', 'Joe'],
'Col2' : ['Joe', 'Steve', 'Bob', 'Bob', 'Steve'],
'Col3' : np.random.random(5)})
print df
print set(df.Col1.append(df.Col2).values)
Output:
输出:
Col1 Col2 Col3
0 Bob Joe 0.201079
1 Joe Steve 0.703279
2 Bill Bob 0.722724
3 Mary Bob 0.093912
4 Joe Steve 0.766027
set(['Steve', 'Bob', 'Bill', 'Joe', 'Mary'])
回答by Mike
I have setup a DataFramewith a few simple strings in it's columns:
我DataFrame在它的列中设置了一些简单的字符串:
>>> df
a b
0 a g
1 b h
2 d a
3 e e
You can concatenate the columns you are interested in and call uniquefunction:
您可以连接您感兴趣的列并调用unique函数:
>>> pandas.concat([df['a'], df['b']]).unique()
array(['a', 'b', 'd', 'e', 'g', 'h'], dtype=object)
回答by James Little
In [5]: set(df.Col1).union(set(df.Col2))
Out[5]: {'Bill', 'Bob', 'Joe', 'Mary', 'Steve'}
Or:
或者:
set(df.Col1) | set(df.Col2)
回答by erikreed
An updated solution using numpy v1.13+ requires specifying the axis in np.uniqueif using multiple columns, otherwise the array is implicitly flattened.
如果使用多列,则使用 numpy v1.13+ 的更新解决方案需要在np.unique 中指定轴,否则数组将被隐式展平。
import numpy as np
np.unique(df[['col1', 'col2']], axis=0)
This change was introduced Nov 2016: https://github.com/numpy/numpy/commit/1f764dbff7c496d6636dc0430f083ada9ff4e4be
此更改于 2016 年 11 月引入:https: //github.com/numpy/numpy/commit/1f764dbff7c496d6636dc0430f083ada9ff4e4be
回答by Lisle
for those of us that love all things pandas, apply, and of course lambda functions:
对于我们这些热爱熊猫、应用,当然还有 lambda 函数的人:
df['Col3'] = df[['Col1', 'Col2']].apply(lambda x: ''.join(x), axis=1)
回答by smishra
list(set(df[['Col1', 'Col2']].as_matrix().reshape((1,-1)).tolist()[0]))
The output will be ['Mary', 'Joe', 'Steve', 'Bob', 'Bill']
输出将是 ['Mary', 'Joe', 'Steve', 'Bob', 'Bill']
回答by muon
here's another way
这是另一种方式
import numpy as np
set(np.concatenate(df.values))

