在 Pandas DataFrame 中有效地找到匹配的行（基于内容）

Question

提问by sanguineturtle

I am writing some tests and I am using Pandas DataFrames to house a large dataset ~(600,000 x 10). I have extracted 10 random rows from the source data (using Stata) and now I want to write a test see if those rows are in the DataFrame in my test suite.

我正在编写一些测试，我正在使用 Pandas DataFrames 来容纳一个大型数据集 ~(600,000 x 10)。我从源数据中提取了 10 个随机行（使用 Stata），现在我想编写一个测试，看看这些行是否在我的测试套件中的 DataFrame 中。

As a small example

作为一个小例子

np.random.seed(2)
raw_data = pd.DataFrame(np.random.rand(5,3), columns=['one', 'two', 'three'])
random_sample = raw_data.ix[1]

Here raw_datais:

这raw_data是：

enter image description here

在此处输入图片说明

And random_sampleis derived to guarantee a match and is:

并且random_sample被派生以保证匹配并且是：

enter image description here

在此处输入图片说明

Currently I have written:

目前我已经写了：

for idx, row in raw_data.iterrows():
    if random_sample.equals(row):
        print "match"
        break

Which works but on the large dataset is veryslow. Is there a more efficient way to check if an entire row is contained in the DataFrame?

哪个有效，但在大型数据集上非常慢。有没有更有效的方法来检查整行是否包含在 DataFrame 中？

BTW: My example also needs to be able to compare np.NaNequality which is why I am using the equals()method

顺便说一句：我的示例还需要能够比较np.NaN相等性，这就是我使用该equals()方法的原因

Answer 1

回答by DSM

equalsdoesn't seem to broadcast, but we can always do the equality comparison manually:

equals似乎没有广播，但我们总是可以手动进行相等比较：

>>> df = pd.DataFrame(np.random.rand(600000, 10))
>>> sample = df.iloc[-1]
>>> %timeit df[((df == sample) | (df.isnull() & sample.isnull())).all(1)]
1 loops, best of 3: 231 ms per loop
>>> df[((df == sample) | (df.isnull() & sample.isnull())).all(1)]
              0         1         2         3         4         5         6  \
599999  0.07832  0.064828  0.502513  0.851816  0.976464  0.761231  0.275242   

               7        8         9  
599999  0.426393  0.91632  0.569807

which is much faster than the iterative version for me (which takes > 30s.)

这比我的迭代版本快得多（需要 > 30 秒。）

But since we have lots of rows and relatively few columns, we could loop over the columns, and in the typical case probably cut down substantially on the number of rows to be looked at. For example, something like

但是由于我们有很多行和相对较少的列，我们可以遍历列，并且在典型情况下可能会大大减少要查看的行数。例如，像

def finder(df, row):
    for col in df:
        df = df.loc[(df[col] == row[col]) | (df[col].isnull() & pd.isnull(row[col]))]
    return df

gives me

给我

>>> %timeit finder(df, sample)
10 loops, best of 3: 35.2 ms per loop

which is roughly an order of magnitude faster, because after the first column there's only one row left.

这大约快了一个数量级，因为在第一列之后只剩下一行了。

(I think I once had a much slicker way to do this but for the life of me I can't remember it now.)

（我想我曾经有一种更巧妙的方法来做到这一点，但对于我的生活，我现在不记得了。）

Answer 2

回答by sanguineturtle

The best I have come up with is to take a filtering approach which seems to work quite well and prevents a lot of comparisons when the dataset is large:

我想出的最好的方法是采用一种过滤方法，该方法似乎效果很好，并且在数据集很大时可以防止进行大量比较：

tmp = raw_data    
for idx, val in random_sample.iteritems():
    try:
        if np.isnan(val):
            continue
    except:
        pass
    tmp = tmp[tmp[idx] == val]
if len(tmp) == 1: print "match"

Note:This is actually a slower for the above small dimensional example. But on a large dataset this ~9 times faster than the basic iteration

注意：对于上面的小维度示例，这实际上是一个较慢的过程。但在大型数据集上，这比基本迭代快约 9 倍

在 Pandas DataFrame 中有效地找到匹配的行（基于内容）

提问by sanguineturtle

回答by DSM

回答by sanguineturtle

相关推荐

最近更新

标签

在 Pandas DataFrame 中有效地找到匹配的行（基于内容）

提问by sanguineturtle

回答by DSM

回答by sanguineturtle

相关推荐

Pandas Dataframe apply() 方法提供了一个行对象，但是如何访问索引值

pandas 在熊猫的数据框中随机化/改组行

使用 Python 的 Pandas 按箱查找平均值

Pandas：如何迭代两个格式完全相同的数据帧？

相关推荐

最近更新

标签