Python Pandas 只比较标记相同的 DataFrame 对象

Question

提问by David Crook

I tried all the solutions here: Pandas "Can only compare identically-labeled DataFrame objects" error

我在这里尝试了所有解决方案： Pandas“只能比较相同标记的 DataFrame 对象”错误

Didn't work for me. Here's what I've got. I have two data frames. One is a set of financial data that already exists in the system and another set that has some that may or may not exist in the system. I need to find the difference and add the stuff that doesn't exist.

没有为我工作。这就是我所拥有的。我有两个数据框。一个是系统中已经存在的一组财务数据，另一个是系统中可能存在也可能不存在的一组财务数据。我需要找到差异并添加不存在的东西。

Here is the code:

这是代码：

import pandas as pd
import numpy as np
from azure.storage.blob import AppendBlobService, PublicAccess, ContentSettings
from io import StringIO

dataUrl = "http://ichart.finance.yahoo.com/table.csv?s=MSFT"
blobUrlBase = "https://pyjobs.blob.core.windows.net/"
data = pd.read_csv(dataUrl)

abs = AppendBlobService(account_name='pyjobs', account_key='***')
abs.create_container("stocks", public_access = PublicAccess.Container)
abs.append_blob_from_text('stocks', 'msft', data[:25].to_csv(index=False))
existing = pd.read_csv(StringIO(abs.get_blob_to_text('stocks', 'msft').content))

ne = (data != existing).any(1)

the failing code is the final line. I was going through an article on determining differences between data frames.

失败的代码是最后一行。我正在阅读一篇关于确定数据框之间差异的文章。

I checked the dtypes on all columns, they appear to be the same. I also did a side by side output, I sorted teh axis, the indices, dropped the indices etc. Still get that bloody error.

我检查了所有列上的 dtypes，它们似乎相同。我还做了一个并排输出，我对轴进行了排序，索引，删除了索引等。仍然得到那个该死的错误。

Here is the output of the first row of existing and data

这是现有和数据的第一行的输出

>>> existing[:1]
         Date       Open   High    Low  Close    Volume  Adj Close
0  2016-05-27  51.919998  52.32  51.77  52.32  17653700      52.32
>>> data[:1]
         Date       Open   High    Low  Close    Volume  Adj Close
0  2016-05-27  51.919998  52.32  51.77  52.32  17653700      52.32

Here is the exact error I receive:

这是我收到的确切错误：

>>> ne = (data != existing).any(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Anaconda3\lib\site-packages\pandas\core\ops.py", line 1169, in f
    return self._compare_frame(other, func, str_rep)
  File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3571, in _compare_frame
    raise ValueError('Can only compare identically-labeled '
ValueError: Can only compare identically-labeled DataFrame objects

Answer 1

采纳答案by piRSquared

In order to get around this, you want to compare the underlying numpy arrays.

为了解决这个问题，您需要比较底层的 numpy 数组。

import pandas as pd

df1 = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'], index=['One', 'Two'])
df2 = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'], index=['one', 'two'])


df1.values == df2.values

array([[ True,  True],
       [ True,  True]], dtype=bool)

Answer 2

回答by danger89

If you want to compare 2 Data Frames. Check-out flexible comparison in Pandas, using the methods like .eq(), .nq(), gt() and more... --> equal, not equal and greater then.

如果要比较 2 个数据帧。使用 .eq()、.nq()、gt() 等方法检查 Pandas 中的灵活比较... --> 等于，不等于和更大。

Example:

例子：

df['new_col'] = df.gt(df_1)

http://pandas.pydata.org/pandas-docs/stable/basics.html#flexible-comparisons

Answer 3

回答by David Crook

Replicated with some fake data to achieve the end goal of removing duplicates. Note this is not the answer to the original question, but what the answer was to what I was attempting to do that caused the question.

用一些假数据进行复制，以达到消除重复的最终目标。请注意，这不是原始问题的答案，而是我试图做的导致问题的答案。

b = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                    index=[4, 5, 6, 7])


c = pd.DataFrame({'A': ['A7', 'A8', 'A9', 'A10', 'A11'],
                  'A': ['A7', 'A8', 'A9', 'A10', 'A11'],
                  'B': ['B7', 'B8', 'B9', 'B10', 'B11'],
                  'C': ['C7', 'C8', 'C9', 'C10', 'C11'],
                  'D': ['D7', 'D8', 'D9', 'D10', 'D11']},
                   index=[7, 8, 9, 10, 11])

result = pd.concat([b,c])
idx = np.unique(result["A"], return_index=True)[1]
result.iloc[idx].sort()

Python Pandas 只比较标记相同的 DataFrame 对象

提问by David Crook

采纳答案by piRSquared

回答by danger89

回答by David Crook

相关推荐

最近更新

标签

Python Pandas 只比较标记相同的 DataFrame 对象

提问by David Crook

采纳答案by piRSquared

回答by danger89

回答by David Crook

相关推荐

PIP：“无法卸载'ipython'。这是一个distutils安装的项目，因此我们无法准确确定......”

Python 如何检查文件是否已打开（在同一进程中）

Python 用户在对话框中输入

Python OpenCV 在 Windows 上安装 opencv_contrib

相关推荐

最近更新

标签