Pandas：按行从 DataFrame 的特定列中选择值

Question

提问by Jason Sanchez

Given a DataFrame with multiple columns, how do we select values from specific columns by row to create a new Series?

给定一个具有多列的 DataFrame，我们如何从特定的列中逐行选择值来创建一个新的系列？

df = pd.DataFrame({"A":[1,2,3,4], 
                   "B":[10,20,30,40], 
                   "C":[100,200,300,400]})
columns_to_select = ["B", "A", "A", "C"]

Goal: [10, 2, 3, 400]

目标： [10, 2, 3, 400]

One method that works is to use an apply statement.

一种有效的方法是使用 apply 语句。

df["cols"] = columns_to_select
df.apply(lambda x: x[x.cols], axis=1)

Unfortunately, this is not a vectorized operation and takes a long time on a large dataset. Any ideas would be appreciated.

不幸的是，这不是矢量化操作，在大型数据集上需要很长时间。任何想法，将不胜感激。

Answer 1

回答by MaxU

Pandas approach:

Pandas方法：

In [22]: df['new'] = df.lookup(df.index, columns_to_select)

In [23]: df
Out[23]:
   A   B    C  new
0  1  10  100   10
1  2  20  200    2
2  3  30  300    3
3  4  40  400  400

Answer 2

回答by Divakar

NumPy way

NumPy 方式

Here's a vectorized NumPy way using advanced indexing-

这是使用的矢量化 NumPy 方式advanced indexing-

# Extract array data
In [10]: a = df.values

# Get integer based column IDs
In [11]: col_idx = np.searchsorted(df.columns, columns_to_select)

# Use NumPy's advanced indexing to extract relevant elem per row
In [12]: a[np.arange(len(col_idx)), col_idx]
Out[12]: array([ 10,   2,   3, 400])

If column names of dfare not sorted, we need to use sorterargument with np.searchsorted. The code to extract col_idxfor such a generic dfwould be :

如果的列名df未排序，我们需要使用sorter带有的参数np.searchsorted。col_idx为这种泛型提取的代码df是：

# https://stackoverflow.com/a/38489403/ @Divakar
def column_index(df, query_cols):
    cols = df.columns.values
    sidx = np.argsort(cols)
    return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]

So, col_idxwould be obtained like so -

所以，col_idx会像这样获得 -

col_idx = column_index(df, columns_to_select)

Further optimization

进一步优化

Profiling it revealed that the bottleneck was processing strings with np.searchsorted, the usual NumPy weakness of not being so great with strings. So, to overcome that and using the special case scenario of column names being single letters, we could quickly convert those to numerals and then feed those to searchsortedfor much faster processing.

对其进行分析表明瓶颈在于处理字符串np.searchsorted，这是 NumPy 的常见弱点，即对字符串不太好。因此，为了克服这个问题，并使用列名称为单个字母的特殊情况，我们可以快速将它们转换为数字，然后将它们提供给以searchsorted进行更快的处理。

Thus, an optimized version of getting the integer based column IDs, for the case where the column names are single letters and sorted, would be -

因此，对于列名是单个字母并已排序的情况，获取基于整数的列 ID 的优化版本将是 -

def column_index_singlechar_sorted(df, query_cols):
    c0 = np.fromstring(''.join(df.columns), dtype=np.uint8)
    c1 = np.fromstring(''.join(query_cols), dtype=np.uint8)
    return np.searchsorted(c0, c1)

This, gives us a modified version of the solution, like so -

这为我们提供了解决方案的修改版本，如下所示 -

a = df.values
col_idx = column_index_singlechar_sorted(df, columns_to_select)
out = pd.Series(a[np.arange(len(col_idx)), col_idx])

Timings -

时间 -

In [149]: # Setup df with 26 uppercase column letters and many rows
     ...: import string
     ...: df = pd.DataFrame(np.random.randint(0,9,(1000000,26)))
     ...: s = list(string.uppercase[:df.shape[1]])
     ...: df.columns = s
     ...: idx = np.random.randint(0,df.shape[1],len(df))
     ...: columns_to_select = np.take(s, idx).tolist()

# With df.lookup from @MaxU's soln
In [150]: %timeit pd.Series(df.lookup(df.index, columns_to_select))
10 loops, best of 3: 76.7 ms per loop

# With proposed one from this soln
In [151]: %%timeit
     ...: a = df.values
     ...: col_idx = column_index_singlechar_sorted(df, columns_to_select)
     ...: out = pd.Series(a[np.arange(len(col_idx)), col_idx])
10 loops, best of 3: 59 ms per loop

Given that df.lookupsolves for a generic case, that's a probably a better choice, but the other possible optimizations as shown in this post could be handy as well!

鉴于它df.lookup解决了一般情况，这可能是一个更好的选择，但本文中显示的其他可能的优化也可能很方便！

Pandas：按行从 DataFrame 的特定列中选择值

提问by Jason Sanchez

回答by MaxU

回答by Divakar

相关推荐

最近更新

标签

Pandas：按行从 DataFrame 的特定列中选择值

提问by Jason Sanchez

回答by MaxU

回答by Divakar

相关推荐

如何重新排列/重新排序（不一定排序）pandas 数据帧索引？

pandas zip 参数 #1 在循环 zip 时必须支持迭代错误

Pandas - 具有基于行索引的条件的 lambda 函数

Pandas 按工作日分组 (M/T/W/T/F/S/S)

相关推荐

最近更新

标签