Pandas:按行从 DataFrame 的特定列中选择值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47997566/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:58:14  来源:igfitidea点击:

Pandas: Select values from specific columns of a DataFrame by row

pythonpandasnumpyindexing

提问by Jason Sanchez

Given a DataFrame with multiple columns, how do we select values from specific columns by row to create a new Series?

给定一个具有多列的 DataFrame,我们如何从特定的列中逐行选择值来创建一个新的系列?

df = pd.DataFrame({"A":[1,2,3,4], 
                   "B":[10,20,30,40], 
                   "C":[100,200,300,400]})
columns_to_select = ["B", "A", "A", "C"]

Goal: [10, 2, 3, 400]

目标: [10, 2, 3, 400]

One method that works is to use an apply statement.

一种有效的方法是使用 apply 语句。

df["cols"] = columns_to_select
df.apply(lambda x: x[x.cols], axis=1)

Unfortunately, this is not a vectorized operation and takes a long time on a large dataset. Any ideas would be appreciated.

不幸的是,这不是矢量化操作,在大型数据集上需要很长时间。任何想法,将不胜感激。

回答by MaxU

Pandas approach:

Pandas方法

In [22]: df['new'] = df.lookup(df.index, columns_to_select)

In [23]: df
Out[23]:
   A   B    C  new
0  1  10  100   10
1  2  20  200    2
2  3  30  300    3
3  4  40  400  400

回答by Divakar

NumPy way

NumPy 方式

Here's a vectorized NumPy way using advanced indexing-

这是使用的矢量化 NumPy 方式advanced indexing-

# Extract array data
In [10]: a = df.values

# Get integer based column IDs
In [11]: col_idx = np.searchsorted(df.columns, columns_to_select)

# Use NumPy's advanced indexing to extract relevant elem per row
In [12]: a[np.arange(len(col_idx)), col_idx]
Out[12]: array([ 10,   2,   3, 400])

If column names of dfare not sorted, we need to use sorterargument with np.searchsorted. The code to extract col_idxfor such a generic dfwould be :

如果 的列名df未排序,我们需要使用sorter带有 的参数np.searchsortedcol_idx为这种泛型提取的代码df是:

# https://stackoverflow.com/a/38489403/ @Divakar
def column_index(df, query_cols):
    cols = df.columns.values
    sidx = np.argsort(cols)
    return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]

So, col_idxwould be obtained like so -

所以,col_idx会像这样获得 -

col_idx = column_index(df, columns_to_select)

Further optimization

进一步优化

Profiling it revealed that the bottleneck was processing strings with np.searchsorted, the usual NumPy weakness of not being so great with strings. So, to overcome that and using the special case scenario of column names being single letters, we could quickly convert those to numerals and then feed those to searchsortedfor much faster processing.

对其进行分析表明瓶颈在于处理字符串np.searchsorted,这是 NumPy 的常见弱点,即对字符串不太好。因此,为了克服这个问题,并使用列名称为单个字母的特殊情况,我们可以快速将它们转换为数字,然后将它们提供给以searchsorted进行更快的处理。

Thus, an optimized version of getting the integer based column IDs, for the case where the column names are single letters and sorted, would be -

因此,对于列名是单个字母并已排序的情况,获取基于整数的列 ID 的优化版本将是 -

def column_index_singlechar_sorted(df, query_cols):
    c0 = np.fromstring(''.join(df.columns), dtype=np.uint8)
    c1 = np.fromstring(''.join(query_cols), dtype=np.uint8)
    return np.searchsorted(c0, c1)

This, gives us a modified version of the solution, like so -

这为我们提供了解决方案的修改版本,如下所示 -

a = df.values
col_idx = column_index_singlechar_sorted(df, columns_to_select)
out = pd.Series(a[np.arange(len(col_idx)), col_idx])

Timings -

时间 -

In [149]: # Setup df with 26 uppercase column letters and many rows
     ...: import string
     ...: df = pd.DataFrame(np.random.randint(0,9,(1000000,26)))
     ...: s = list(string.uppercase[:df.shape[1]])
     ...: df.columns = s
     ...: idx = np.random.randint(0,df.shape[1],len(df))
     ...: columns_to_select = np.take(s, idx).tolist()

# With df.lookup from @MaxU's soln
In [150]: %timeit pd.Series(df.lookup(df.index, columns_to_select))
10 loops, best of 3: 76.7 ms per loop

# With proposed one from this soln
In [151]: %%timeit
     ...: a = df.values
     ...: col_idx = column_index_singlechar_sorted(df, columns_to_select)
     ...: out = pd.Series(a[np.arange(len(col_idx)), col_idx])
10 loops, best of 3: 59 ms per loop

Given that df.lookupsolves for a generic case, that's a probably a better choice, but the other possible optimizations as shown in this post could be handy as well!

鉴于它df.lookup解决了一般情况,这可能是一个更好的选择,但本文中显示的其他可能的优化也可能很方便!