每次为 Pandas DataFrame 获取相同的哈希值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31567401/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Get the same hash value for a Pandas DataFrame each time
提问by mkurnikov
My goal is to get unique hash value for a DataFrame. I obtain it out of .csv file. Whole point is to get the same hash each time I call hash() on it.
我的目标是为 DataFrame 获取唯一的哈希值。我从 .csv 文件中获取它。重点是每次我调用 hash() 时都获得相同的哈希值。
My idea was that I create the function
我的想法是我创建了这个函数
def _get_array_hash(arr):
arr_hashable = arr.values
arr_hashable.flags.writeable = False
hash_ = hash(arr_hashable.data)
return hash_
that is calling underlying numpy array, set it to immutable state and get hash of the buffer.
即调用底层 numpy 数组,将其设置为不可变状态并获取缓冲区的哈希值。
INLINE UPD.
内联UPD。
As of 08.11.2016, this version of the function doesn't work anymore. Instead, you should use
自 2016 年 11 月 8 日起,此版本的功能不再起作用。相反,你应该使用
hash(df.values.tobytes())
See comments for the Most efficient property to hash for numpy array.
END OF INLINE UPD.
内联 UPD 结束。
It works for regular pandas array:
它适用于常规的Pandas数组:
In [12]: data = pd.DataFrame({'A': [0], 'B': [1]})
In [13]: _get_array_hash(data)
Out[13]: -5522125492475424165
In [14]: _get_array_hash(data)
Out[14]: -5522125492475424165
But then I try to apply it to DataFrame obtained from a .csv file:
但后来我尝试将其应用于从 .csv 文件获得的 DataFrame:
In [15]: fpath = 'foo/bar.csv'
In [16]: data_from_file = pd.read_csv(fpath)
In [17]: _get_array_hash(data_from_file)
Out[17]: 6997017925422497085
In [18]: _get_array_hash(data_from_file)
Out[18]: -7524466731745902730
Can somebody explain me, how's that possible?
有人可以解释一下,这怎么可能?
I can create new DataFrame out of it, like
我可以从中创建新的 DataFrame,例如
new_data = pd.DataFrame(data=data_from_file.values,
columns=data_from_file.columns,
index=data_from_file.index)
and it works again
它再次起作用
In [25]: _get_array_hash(new_data)
Out[25]: -3546154109803008241
In [26]: _get_array_hash(new_data)
Out[26]: -3546154109803008241
But my goal is to preserve the same hash value for a dataframe across application launches in order to retrieve some value from cache.
但我的目标是在应用程序启动时为数据帧保留相同的哈希值,以便从缓存中检索某些值。
回答by Jonathan Stray
As of Pandas 0.20.1, you can use the little known (and poorly documented) hash_pandas_object(source code) which was recently made publicin pandas.util. It returns one hash value for reach row of the dataframe (and works on series etc. too)
由于大Pandas0.20.1的,你可以使用鲜为人知(和记录不完整)hash_pandas_object(源代码,这是最近)公布在pandas.util。它为数据帧的到达行返回一个哈希值(也适用于系列等)
import pandas as pd
import numpy as np
np.random.seed(42)
arr = np.random.choice(['foo', 'bar', 42], size=(3,4))
df = pd.DataFrame(arr)
print(df)
# 0 1 2 3
# 0 42 foo 42 42
# 1 foo foo 42 bar
# 2 42 42 42 42
from pandas.util import hash_pandas_object
h = hash_pandas_object(df)
print(h)
# 0 5559921529589760079
# 1 16825627446701693880
# 2 7171023939017372657
# dtype: uint64
You can always do hash_pandas_object(df).sum()if you want an overall hash of all rows.
hash_pandas_object(df).sum()如果您想要所有行的整体散列,您总是可以这样做。
回答by eMMe
I had a similar problem: check if a dataframe is changed and I solved it by hashing the msgpack serialization string. This seems stable among different reloading the same data.
我有一个类似的问题:检查数据帧是否已更改,我通过散列 msgpack 序列化字符串来解决它。这在不同重新加载相同数据之间似乎很稳定。
import pandas as pd
import hashlib
DATA_FILE = 'data.json'
data1 = pd.read_json(DATA_FILE)
data2 = pd.read_json(DATA_FILE)
assert hashlib.md5(data1.to_msgpack()).hexdigest() == hashlib.md5(data2.to_msgpack()).hexdigest()
assert hashlib.md5(data1.values.tobytes()).hexdigest() != hashlib.md5(data2.values.tobytes()).hexdigest()

