Python 熊猫唯一值多列

Question

提问by user2333196

df = pd.DataFrame({'Col1': ['Bob', 'Joe', 'Bill', 'Mary', 'Joe'],
                   'Col2': ['Joe', 'Steve', 'Bob', 'Bob', 'Steve'],
                   'Col3': np.random.random(5)})

What is the best way to return the unique values of 'Col1' and 'Col2'?

返回“Col1”和“Col2”的唯一值的最佳方法是什么？

The desired output is

所需的输出是

'Bob', 'Joe', 'Bill', 'Mary', 'Steve'

Answer 1

采纳答案by Alex Riley

pd.uniquereturns the unique values from an input array, or DataFrame column or index.

pd.unique从输入数组或 DataFrame 列或索引返回唯一值。

The input to this function needs to be one-dimensional, so multiple columns will need to be combined. The simplest way is to select the columns you want and then view the values in a flattened NumPy array. The whole operation looks like this:

此函数的输入需要是一维的，因此需要组合多个列。最简单的方法是选择您想要的列，然后在扁平化的 NumPy 数组中查看值。整个操作如下所示：

>>> pd.unique(df[['Col1', 'Col2']].values.ravel('K'))
array(['Bob', 'Joe', 'Bill', 'Mary', 'Steve'], dtype=object)

Note that ravel()is an array method than returns a view (if possible) of a multidimensional array. The argument 'K'tells the method to flatten the array in the order the elements are stored in memory (pandas typically stores underlying arrays in Fortran-contiguous order; columns before rows). This can be significantly faster than using the method's default 'C' order.

请注意，这ravel()是一个数组方法，而不是返回多维数组的视图（如果可能）。该参数'K'告诉方法按照元素在内存中的存储顺序展平数组（pandas 通常以Fortran 连续顺序存储底层数组；列在行之前）。这比使用该方法的默认“C”顺序要快得多。

An alternative way is to select the columns and pass them to np.unique:

另一种方法是选择列并将它们传递给np.unique：

>>> np.unique(df[['Col1', 'Col2']].values)
array(['Bill', 'Bob', 'Joe', 'Mary', 'Steve'], dtype=object)

There is no need to use ravel()here as the method handles multidimensional arrays. Even so, this is likely to be slower than pd.uniqueas it uses a sort-based algorithm rather than a hashtable to identify unique values.

ravel()此处无需使用，因为该方法处理多维数组。即便如此，这可能比pd.unique它使用基于排序的算法而不是哈希表来识别唯一值要慢。

The difference in speed is significant for larger DataFrames (especially if there are only a handful of unique values):

对于较大的 DataFrame 来说，速度的差异很显着（尤其是在只有少数唯一值的情况下）：

>>> df1 = pd.concat([df]*100000, ignore_index=True) # DataFrame with 500000 rows
>>> %timeit np.unique(df1[['Col1', 'Col2']].values)
1 loop, best of 3: 1.12 s per loop

>>> %timeit pd.unique(df1[['Col1', 'Col2']].values.ravel('K'))
10 loops, best of 3: 38.9 ms per loop

>>> %timeit pd.unique(df1[['Col1', 'Col2']].values.ravel()) # ravel using C order
10 loops, best of 3: 49.9 ms per loop

Answer 2

回答by Jerome Montino

Non-pandassolution: using set().

非pandas解决方案：使用 set()。

import pandas as pd
import numpy as np

df = pd.DataFrame({'Col1' : ['Bob', 'Joe', 'Bill', 'Mary', 'Joe'],
              'Col2' : ['Joe', 'Steve', 'Bob', 'Bob', 'Steve'],
               'Col3' : np.random.random(5)})

print df

print set(df.Col1.append(df.Col2).values)

Output:

输出：

   Col1   Col2      Col3
0   Bob    Joe  0.201079
1   Joe  Steve  0.703279
2  Bill    Bob  0.722724
3  Mary    Bob  0.093912
4   Joe  Steve  0.766027
set(['Steve', 'Bob', 'Bill', 'Joe', 'Mary'])

Answer 3

回答by Mike

I have setup a DataFramewith a few simple strings in it's columns:

我DataFrame在它的列中设置了一些简单的字符串：

>>> df
   a  b
0  a  g
1  b  h
2  d  a
3  e  e

You can concatenate the columns you are interested in and call uniquefunction:

您可以连接您感兴趣的列并调用unique函数：

>>> pandas.concat([df['a'], df['b']]).unique()
array(['a', 'b', 'd', 'e', 'g', 'h'], dtype=object)

Answer 4

回答by James Little

In [5]: set(df.Col1).union(set(df.Col2))
Out[5]: {'Bill', 'Bob', 'Joe', 'Mary', 'Steve'}

Or:

或者：

set(df.Col1) | set(df.Col2)

Answer 5

回答by erikreed

An updated solution using numpy v1.13+ requires specifying the axis in np.uniqueif using multiple columns, otherwise the array is implicitly flattened.

如果使用多列，则使用 numpy v1.13+ 的更新解决方案需要在np.unique 中指定轴，否则数组将被隐式展平。

import numpy as np

np.unique(df[['col1', 'col2']], axis=0)

This change was introduced Nov 2016: https://github.com/numpy/numpy/commit/1f764dbff7c496d6636dc0430f083ada9ff4e4be

此更改于 2016 年 11 月引入：https: //github.com/numpy/numpy/commit/1f764dbff7c496d6636dc0430f083ada9ff4e4be

Answer 6

回答by Lisle

for those of us that love all things pandas, apply, and of course lambda functions:

对于我们这些热爱熊猫、应用，当然还有 lambda 函数的人：

df['Col3'] = df[['Col1', 'Col2']].apply(lambda x: ''.join(x), axis=1)

Answer 7

回答by smishra

list(set(df[['Col1', 'Col2']].as_matrix().reshape((1,-1)).tolist()[0]))

The output will be ['Mary', 'Joe', 'Steve', 'Bob', 'Bill']

输出将是 ['Mary', 'Joe', 'Steve', 'Bob', 'Bill']

Answer 8

回答by muon

here's another way

这是另一种方式


import numpy as np
set(np.concatenate(df.values))

Python 熊猫唯一值多列

提问by user2333196

采纳答案by Alex Riley

回答by Jerome Montino

回答by Mike

回答by James Little

回答by erikreed

回答by Lisle

回答by smishra

回答by muon

相关推荐

最近更新

标签

Python 熊猫唯一值多列

提问by user2333196

采纳答案by Alex Riley

回答by Jerome Montino

回答by Mike

回答by James Little

回答by erikreed

回答by Lisle

回答by smishra

回答by muon

相关推荐

Python 将一个字符串分成 N 个相等的部分？

Python 在numpy中共轭转置运算符“.H”

Python 调试错误“gcc: error: x86_64-linux-gnu-gcc: No such file or directory”

在python中绘制填充多边形

相关推荐

最近更新

标签