每次为 Pandas DataFrame 获取相同的哈希值

Question

提问by mkurnikov

My goal is to get unique hash value for a DataFrame. I obtain it out of .csv file. Whole point is to get the same hash each time I call hash() on it.

我的目标是为 DataFrame 获取唯一的哈希值。我从 .csv 文件中获取它。重点是每次我调用 hash() 时都获得相同的哈希值。

My idea was that I create the function

我的想法是我创建了这个函数

def _get_array_hash(arr):
    arr_hashable = arr.values
    arr_hashable.flags.writeable = False
    hash_ = hash(arr_hashable.data)
    return hash_

that is calling underlying numpy array, set it to immutable state and get hash of the buffer.

即调用底层 numpy 数组，将其设置为不可变状态并获取缓冲区的哈希值。

INLINE UPD.

内联UPD。

As of 08.11.2016, this version of the function doesn't work anymore. Instead, you should use

自 2016 年 11 月 8 日起，此版本的功能不再起作用。相反，你应该使用

hash(df.values.tobytes())

See comments for the Most efficient property to hash for numpy array.

请参阅最有效的属性来散列 numpy array 的评论。

END OF INLINE UPD.

内联 UPD 结束。

It works for regular pandas array:

它适用于常规的Pandas数组：

In [12]: data = pd.DataFrame({'A': [0], 'B': [1]})

In [13]: _get_array_hash(data)
Out[13]: -5522125492475424165

In [14]: _get_array_hash(data)
Out[14]: -5522125492475424165

But then I try to apply it to DataFrame obtained from a .csv file:

但后来我尝试将其应用于从 .csv 文件获得的 DataFrame：

In [15]: fpath = 'foo/bar.csv'

In [16]: data_from_file = pd.read_csv(fpath)

In [17]: _get_array_hash(data_from_file)
Out[17]: 6997017925422497085

In [18]: _get_array_hash(data_from_file)
Out[18]: -7524466731745902730

Can somebody explain me, how's that possible?

有人可以解释一下，这怎么可能？

I can create new DataFrame out of it, like

我可以从中创建新的 DataFrame，例如

new_data = pd.DataFrame(data=data_from_file.values, 
            columns=data_from_file.columns, 
            index=data_from_file.index)

and it works again

它再次起作用

In [25]: _get_array_hash(new_data)
Out[25]: -3546154109803008241

In [26]: _get_array_hash(new_data)
Out[26]: -3546154109803008241

But my goal is to preserve the same hash value for a dataframe across application launches in order to retrieve some value from cache.

但我的目标是在应用程序启动时为数据帧保留相同的哈希值，以便从缓存中检索某些值。

Answer 1

回答by Jonathan Stray

As of Pandas 0.20.1, you can use the little known (and poorly documented) hash_pandas_object(source code) which was recently made publicin pandas.util. It returns one hash value for reach row of the dataframe (and works on series etc. too)

由于大Pandas0.20.1的，你可以使用鲜为人知（和记录不完整）hash_pandas_object（源代码，这是最近）公布在pandas.util。它为数据帧的到达行返回一个哈希值（也适用于系列等）

import pandas as pd
import numpy as np

np.random.seed(42)
arr = np.random.choice(['foo', 'bar', 42], size=(3,4))
df = pd.DataFrame(arr)

print(df)
#      0    1   2    3
# 0   42  foo  42   42
# 1  foo  foo  42  bar
# 2   42   42  42   42

from pandas.util import hash_pandas_object
h = hash_pandas_object(df)

print(h)
# 0     5559921529589760079
# 1    16825627446701693880
# 2     7171023939017372657
# dtype: uint64

You can always do hash_pandas_object(df).sum()if you want an overall hash of all rows.

hash_pandas_object(df).sum()如果您想要所有行的整体散列，您总是可以这样做。

Answer 2

回答by eMMe

I had a similar problem: check if a dataframe is changed and I solved it by hashing the msgpack serialization string. This seems stable among different reloading the same data.

我有一个类似的问题：检查数据帧是否已更改，我通过散列 msgpack 序列化字符串来解决它。这在不同重新加载相同数据之间似乎很稳定。

import pandas as pd
import hashlib
DATA_FILE = 'data.json'

data1 = pd.read_json(DATA_FILE)
data2 = pd.read_json(DATA_FILE)

assert hashlib.md5(data1.to_msgpack()).hexdigest() == hashlib.md5(data2.to_msgpack()).hexdigest()
assert hashlib.md5(data1.values.tobytes()).hexdigest() != hashlib.md5(data2.values.tobytes()).hexdigest()

Answer 3

回答by uut

Joblibprovides a hashing function optimized for objects containing numpy arrays (e.g. pandas dataframes).

Joblib提供了一个针对包含 numpy 数组（例如Pandas数据帧）的对象优化的散列函数。

import joblib
joblib.hash(df)

每次为 Pandas DataFrame 获取相同的哈希值

提问by mkurnikov

回答by Jonathan Stray

回答by eMMe

回答by uut

相关推荐

最近更新

标签

每次为 Pandas DataFrame 获取相同的哈希值

提问by mkurnikov

回答by Jonathan Stray

回答by eMMe

回答by uut

相关推荐

如何在 Pandas 中绘制日期的核密度图？

pandas python dask DataFrame，是否支持（平凡可并行化）行应用？

pandas read_table vs. read_csv vs. from_csv vs. read_excel的性能差异？

pandas 在python中使用时间序列进行预测

相关推荐

最近更新

标签