在 Pandas DataFrame 中有效地找到匹配的行(基于内容)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25028113/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:18:23  来源:igfitidea点击:

Efficiently find matching rows (based on content) in a pandas DataFrame

pythonpandasdataframe

提问by sanguineturtle

I am writing some tests and I am using Pandas DataFrames to house a large dataset ~(600,000 x 10). I have extracted 10 random rows from the source data (using Stata) and now I want to write a test see if those rows are in the DataFrame in my test suite.

我正在编写一些测试,我正在使用 Pandas DataFrames 来容纳一个大型数据集 ~(600,000 x 10)。我从源数据中提取了 10 个随机行(使用 Stata),现在我想编写一个测试,看看这些行是否在我的测试套件中的 DataFrame 中。

As a small example

作为一个小例子

np.random.seed(2)
raw_data = pd.DataFrame(np.random.rand(5,3), columns=['one', 'two', 'three'])
random_sample = raw_data.ix[1]

Here raw_datais:

raw_data是:

enter image description here

在此处输入图片说明

And random_sampleis derived to guarantee a match and is:

并且random_sample被派生以保证匹配并且是:

enter image description here

在此处输入图片说明

Currently I have written:

目前我已经写了:

for idx, row in raw_data.iterrows():
    if random_sample.equals(row):
        print "match"
        break

Which works but on the large dataset is veryslow. Is there a more efficient way to check if an entire row is contained in the DataFrame?

哪个有效,但在大型数据集上非常慢。有没有更有效的方法来检查整行是否包含在 DataFrame 中?

BTW: My example also needs to be able to compare np.NaNequality which is why I am using the equals()method

顺便说一句:我的示例还需要能够比较np.NaN相等性,这就是我使用该equals()方法的原因

回答by DSM

equalsdoesn't seem to broadcast, but we can always do the equality comparison manually:

equals似乎没有广播,但我们总是可以手动进行相等比较:

>>> df = pd.DataFrame(np.random.rand(600000, 10))
>>> sample = df.iloc[-1]
>>> %timeit df[((df == sample) | (df.isnull() & sample.isnull())).all(1)]
1 loops, best of 3: 231 ms per loop
>>> df[((df == sample) | (df.isnull() & sample.isnull())).all(1)]
              0         1         2         3         4         5         6  \
599999  0.07832  0.064828  0.502513  0.851816  0.976464  0.761231  0.275242   

               7        8         9  
599999  0.426393  0.91632  0.569807  

which is much faster than the iterative version for me (which takes > 30s.)

这比我的迭代版本快得多(需要 > 30 秒。)

But since we have lots of rows and relatively few columns, we could loop over the columns, and in the typical case probably cut down substantially on the number of rows to be looked at. For example, something like

但是由于我们有很多行和相对较少的列,我们可以遍历列,并且在典型情况下可能会大大减少要查看的行数。例如,像

def finder(df, row):
    for col in df:
        df = df.loc[(df[col] == row[col]) | (df[col].isnull() & pd.isnull(row[col]))]
    return df

gives me

给我

>>> %timeit finder(df, sample)
10 loops, best of 3: 35.2 ms per loop

which is roughly an order of magnitude faster, because after the first column there's only one row left.

这大约快了一个数量级,因为在第一列之后只剩下一行了。

(I think I once had a much slicker way to do this but for the life of me I can't remember it now.)

(我想我曾经有一种更巧妙的方法来做到这一点,但对于我的生活,我现在不记得了。)

回答by sanguineturtle

The best I have come up with is to take a filtering approach which seems to work quite well and prevents a lot of comparisons when the dataset is large:

我想出的最好的方法是采用一种过滤方法,该方法似乎效果很好,并且在数据集很大时可以防止进行大量比较:

tmp = raw_data    
for idx, val in random_sample.iteritems():
    try:
        if np.isnan(val):
            continue
    except:
        pass
    tmp = tmp[tmp[idx] == val]
if len(tmp) == 1: print "match"

Note:This is actually a slower for the above small dimensional example. But on a large dataset this ~9 times faster than the basic iteration

注意:对于上面的小维度示例,这实际上是一个较慢的过程。但在大型数据集上,这比基本迭代快约 9 倍