pandas 在python pandas的数据框中为具有选定列的每行数据创建哈希值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25757042/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:26:44  来源:igfitidea点击:

Create hash value for each row of data with selected columns in dataframe in python pandas

pythonhashpandas

提问by lokheart

I have asked similar questionin R about creating hash value for each row of data. I know that I can use something like hashlib.md5(b'Hello World').hexdigest()to hash a string, but how about a row in a dataframe?

我在 R 中问过关于为每行数据创建哈希值的类似问题。我知道我可以使用诸如hashlib.md5(b'Hello World').hexdigest()散列字符串之类的东西,但是数据帧中的一行呢?

update 01

更新 01

I have drafted my code as below:

我已经起草了我的代码如下:

for index, row in course_staff_df.iterrows():
        temp_df.loc[index,'hash'] = hashlib.md5(str(row[['cola','colb']].values)).hexdigest()

It seems not very pythonic to me, any better solution?

对我来说似乎不是很pythonic,有什么更好的解决方案吗?

回答by cwharland

Or simply:

或者干脆:

df.apply(lambda x: hash(tuple(x)), axis = 1)

As an example:

举个例子:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(3,5))
print df
df.apply(lambda x: hash(tuple(x)), axis = 1)

     0         1         2         3         4
0  0.728046  0.542013  0.672425  0.374253  0.718211
1  0.875581  0.512513  0.826147  0.748880  0.835621
2  0.451142  0.178005  0.002384  0.060760  0.098650

0    5024405147753823273
1    -798936807792898628
2   -8745618293760919309

回答by Aaron Hall

Create hash value for each row of data with selected columns in dataframe in python pandas

在python pandas的数据框中为具有选定列的每行数据创建哈希值

These solutions work for the life of the Python process.

这些解决方案适用于 Python 进程的整个生命周期。

If order matters, one method would be to coerce the row (a Series object) to a tuple:

如果顺序很重要,一种方法是将行(一个 Series 对象)强制转换为元组:

>>> hash(tuple(df.irow(1)))
-4901655572611365671

This demonstrates order matters for tuple hashing:

这演示了元组散列的顺序问题:

>>> hash((1,2,3))
2528502973977326415
>>> hash((3,2,1))
5050909583595644743

To do so for every row, appended as a column would look like this:

要对每一行执行此操作,附加为列将如下所示:

>>> df = df.drop('hash', 1) # lose the old hash
>>> df['hash'] = pd.Series((hash(tuple(row)) for _, row in df.iterrows()))
>>> df
           y  x0                 hash
0  11.624345  10 -7519341396217622291
1  10.388244  11 -6224388738743104050
2  11.471828  12 -4278475798199948732
3  11.927031  13 -1086800262788974363
4  14.865408  14  4065918964297112768
5  12.698461  15  8870116070367064431
6  17.744812  16 -2001582243795030948
7  16.238793  17  4683560048732242225
8  18.319039  18 -4288960467160144170
9  18.750630  19  7149535252257157079

[10 rows x 3 columns]

If order does not matter, use the hash of frozensets instead of tuples:

如果顺序无关紧要,请使用frozensets 的散列而不是元组:

>>> hash(frozenset((3,2,1)))
-272375401224217160
>>> hash(frozenset((1,2,3)))
-272375401224217160

Avoid summing the hashes of all of the elements in the row, as this could be cryptographically insecure and lead to hashes that fall outside the range of the original.

避免对行中所有元素的散列求和,因为这可能在密码学上不安全并导致散列超出原始范围。

(You could use modulo to constrain the range, but this amounts to rolling your own hash function, and the best practice is notto.)

(您可以使用模数来限制范围,但这相当于滚动您自己的哈希函数,而最佳做法是要这样做。)

You can make permanent cryptographic quality hashes, for example using sha256, as well using the hashlibmodule.

您可以永久的使用密码散列质量,例如使用SHA256,以及使用hashlib模块。

There is some discussion of the API for cryptographic hash functions in PEP 452.

PEP 452 中有一些关于加密散列函数的 API 的讨论。

Thanks to users Jamie Marshal and Discrete Lizard for their comments.

感谢用户 Jamie Marshal 和 Discrete Lizard 的评论。

回答by Neal Fultz

This is now available in pandas.util.hash_pandas_object:

现在可以在pandas.util.hash_pandas_object

pandas.util.hash_pandas_object(df)

回答by Wesley Batista

I've came up with this adaption from the code provided on the question:

我从问题提供的代码中提出了这种改编:

new_df2 = df.copy()
key_combination = ['col1', 'col2', 'col3', 'col4']
new_df2.index = list(map(lambda x: hashlib.sha1('-'.join([col_value for col_value in x]).encode('utf-8')).hexdigest(), new_df2[key_combination].values))