Python Pandas 只比较标记相同的 DataFrame 对象
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37557131/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python Pandas Only Compare Identically Labeled DataFrame Objects
提问by David Crook
I tried all the solutions here: Pandas "Can only compare identically-labeled DataFrame objects" error
我在这里尝试了所有解决方案: Pandas“只能比较相同标记的 DataFrame 对象”错误
Didn't work for me. Here's what I've got. I have two data frames. One is a set of financial data that already exists in the system and another set that has some that may or may not exist in the system. I need to find the difference and add the stuff that doesn't exist.
没有为我工作。这就是我所拥有的。我有两个数据框。一个是系统中已经存在的一组财务数据,另一个是系统中可能存在也可能不存在的一组财务数据。我需要找到差异并添加不存在的东西。
Here is the code:
这是代码:
import pandas as pd
import numpy as np
from azure.storage.blob import AppendBlobService, PublicAccess, ContentSettings
from io import StringIO
dataUrl = "http://ichart.finance.yahoo.com/table.csv?s=MSFT"
blobUrlBase = "https://pyjobs.blob.core.windows.net/"
data = pd.read_csv(dataUrl)
abs = AppendBlobService(account_name='pyjobs', account_key='***')
abs.create_container("stocks", public_access = PublicAccess.Container)
abs.append_blob_from_text('stocks', 'msft', data[:25].to_csv(index=False))
existing = pd.read_csv(StringIO(abs.get_blob_to_text('stocks', 'msft').content))
ne = (data != existing).any(1)
the failing code is the final line. I was going through an article on determining differences between data frames.
失败的代码是最后一行。我正在阅读一篇关于确定数据框之间差异的文章。
I checked the dtypes on all columns, they appear to be the same. I also did a side by side output, I sorted teh axis, the indices, dropped the indices etc. Still get that bloody error.
我检查了所有列上的 dtypes,它们似乎相同。我还做了一个并排输出,我对轴进行了排序,索引,删除了索引等。仍然得到那个该死的错误。
Here is the output of the first row of existing and data
这是现有和数据的第一行的输出
>>> existing[:1]
Date Open High Low Close Volume Adj Close
0 2016-05-27 51.919998 52.32 51.77 52.32 17653700 52.32
>>> data[:1]
Date Open High Low Close Volume Adj Close
0 2016-05-27 51.919998 52.32 51.77 52.32 17653700 52.32
Here is the exact error I receive:
这是我收到的确切错误:
>>> ne = (data != existing).any(1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda3\lib\site-packages\pandas\core\ops.py", line 1169, in f
return self._compare_frame(other, func, str_rep)
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3571, in _compare_frame
raise ValueError('Can only compare identically-labeled '
ValueError: Can only compare identically-labeled DataFrame objects
采纳答案by piRSquared
In order to get around this, you want to compare the underlying numpy arrays.
为了解决这个问题,您需要比较底层的 numpy 数组。
import pandas as pd
df1 = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'], index=['One', 'Two'])
df2 = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'], index=['one', 'two'])
df1.values == df2.values
array([[ True, True],
[ True, True]], dtype=bool)
回答by danger89
If you want to compare 2 Data Frames. Check-out flexible comparison in Pandas, using the methods like .eq(), .nq(), gt() and more... --> equal, not equal and greater then.
如果要比较 2 个数据帧。使用 .eq()、.nq()、gt() 等方法检查 Pandas 中的灵活比较... --> 等于,不等于和更大。
Example:
例子:
df['new_col'] = df.gt(df_1)
df['new_col'] = df.gt(df_1)
http://pandas.pydata.org/pandas-docs/stable/basics.html#flexible-comparisons
http://pandas.pydata.org/pandas-docs/stable/basics.html#flexible-comparisons
回答by David Crook
Replicated with some fake data to achieve the end goal of removing duplicates. Note this is not the answer to the original question, but what the answer was to what I was attempting to do that caused the question.
用一些假数据进行复制,以达到消除重复的最终目标。请注意,这不是原始问题的答案,而是我试图做的导致问题的答案。
b = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])
c = pd.DataFrame({'A': ['A7', 'A8', 'A9', 'A10', 'A11'],
'A': ['A7', 'A8', 'A9', 'A10', 'A11'],
'B': ['B7', 'B8', 'B9', 'B10', 'B11'],
'C': ['C7', 'C8', 'C9', 'C10', 'C11'],
'D': ['D7', 'D8', 'D9', 'D10', 'D11']},
index=[7, 8, 9, 10, 11])
result = pd.concat([b,c])
idx = np.unique(result["A"], return_index=True)[1]
result.iloc[idx].sort()