Python Pandas 只比较标记相同的 DataFrame 对象

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37557131/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 19:33:21  来源:igfitidea点击:

Python Pandas Only Compare Identically Labeled DataFrame Objects

pythonpandasnumpy

提问by David Crook

I tried all the solutions here: Pandas "Can only compare identically-labeled DataFrame objects" error

我在这里尝试了所有解决方案: Pandas“只能比较相同标记的 DataFrame 对象”错误

Didn't work for me. Here's what I've got. I have two data frames. One is a set of financial data that already exists in the system and another set that has some that may or may not exist in the system. I need to find the difference and add the stuff that doesn't exist.

没有为我工作。这就是我所拥有的。我有两个数据框。一个是系统中已经存在的一组财务数据,另一个是系统中可能存在也可能不存在的一组财务数据。我需要找到差异并添加不存在的东西。

Here is the code:

这是代码:

import pandas as pd
import numpy as np
from azure.storage.blob import AppendBlobService, PublicAccess, ContentSettings
from io import StringIO

dataUrl = "http://ichart.finance.yahoo.com/table.csv?s=MSFT"
blobUrlBase = "https://pyjobs.blob.core.windows.net/"
data = pd.read_csv(dataUrl)

abs = AppendBlobService(account_name='pyjobs', account_key='***')
abs.create_container("stocks", public_access = PublicAccess.Container)
abs.append_blob_from_text('stocks', 'msft', data[:25].to_csv(index=False))
existing = pd.read_csv(StringIO(abs.get_blob_to_text('stocks', 'msft').content))

ne = (data != existing).any(1)

the failing code is the final line. I was going through an article on determining differences between data frames.

失败的代码是最后一行。我正在阅读一篇关于确定数据框之间差异的文章。

I checked the dtypes on all columns, they appear to be the same. I also did a side by side output, I sorted teh axis, the indices, dropped the indices etc. Still get that bloody error.

我检查了所有列上的 dtypes,它们似乎相同。我还做了一个并排输出,我对轴进行了排序,索引,删除了索引等。仍然得到那个该死的错误。

Here is the output of the first row of existing and data

这是现有和数据的第一行的输出

>>> existing[:1]
         Date       Open   High    Low  Close    Volume  Adj Close
0  2016-05-27  51.919998  52.32  51.77  52.32  17653700      52.32
>>> data[:1]
         Date       Open   High    Low  Close    Volume  Adj Close
0  2016-05-27  51.919998  52.32  51.77  52.32  17653700      52.32

Here is the exact error I receive:

这是我收到的确切错误:

>>> ne = (data != existing).any(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Anaconda3\lib\site-packages\pandas\core\ops.py", line 1169, in f
    return self._compare_frame(other, func, str_rep)
  File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3571, in _compare_frame
    raise ValueError('Can only compare identically-labeled '
ValueError: Can only compare identically-labeled DataFrame objects

采纳答案by piRSquared

In order to get around this, you want to compare the underlying numpy arrays.

为了解决这个问题,您需要比较底层的 numpy 数组。

import pandas as pd

df1 = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'], index=['One', 'Two'])
df2 = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'], index=['one', 'two'])


df1.values == df2.values

array([[ True,  True],
       [ True,  True]], dtype=bool)

回答by danger89

If you want to compare 2 Data Frames. Check-out flexible comparison in Pandas, using the methods like .eq(), .nq(), gt() and more... --> equal, not equal and greater then.

如果要比较 2 个数据帧。使用 .eq()、.nq()、gt() 等方法检查 Pandas 中的灵活比较... --> 等于,不等于和更大。

Example:

例子:

df['new_col'] = df.gt(df_1)

df['new_col'] = df.gt(df_1)

http://pandas.pydata.org/pandas-docs/stable/basics.html#flexible-comparisons

http://pandas.pydata.org/pandas-docs/stable/basics.html#flexible-comparisons

回答by David Crook

Replicated with some fake data to achieve the end goal of removing duplicates. Note this is not the answer to the original question, but what the answer was to what I was attempting to do that caused the question.

用一些假数据进行复制,以达到消除重复的最终目标。请注意,这不是原始问题的答案,而是我试图做的导致问题的答案。

b = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                    index=[4, 5, 6, 7])


c = pd.DataFrame({'A': ['A7', 'A8', 'A9', 'A10', 'A11'],
                  'A': ['A7', 'A8', 'A9', 'A10', 'A11'],
                  'B': ['B7', 'B8', 'B9', 'B10', 'B11'],
                  'C': ['C7', 'C8', 'C9', 'C10', 'C11'],
                  'D': ['D7', 'D8', 'D9', 'D10', 'D11']},
                   index=[7, 8, 9, 10, 11])

result = pd.concat([b,c])
idx = np.unique(result["A"], return_index=True)[1]
result.iloc[idx].sort()