pandas 将数据帧行转换为 Python 集

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47545052/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:50:05  来源:igfitidea点击:

Convert dataframe rows to Python set

pythonpandasdataframesetsimilarity

提问by user46543

I have this dataset:

我有这个数据集:

import pandas as pd
import itertools

A = ['A','B','C']
M = ['1','2','3']
F = ['plus','minus','square']

df = pd.DataFrame(list(itertools.product(A,M,F)), columns=['A','M','F'])
print(df)

The example output is like this:

示例输出是这样的:

   A  M       F
0   A  1    plus
1   A  1   minus
2   A  1  square
3   A  2    plus
4   A  2   minus
5   A  2  square

I want to pairwise comparison (jaccard similarity) of each row from this data frame, for example, comparing

我想对这个数据框中的每一行进行成对比较(jaccard 相似度),例如,比较

A 1 plusand A 2 squareand get the similarity value between those both set.

A 1 plusA 2 square获得这两个集合之间的相似度值。

I have wrote a jaccard function:

我写了一个 jaccard 函数:

def jaccard(a, b):
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

Which is only work on set because I used intersection

这只能在片场工作,因为我用过 intersection

I want the output like this (this expected result value is just random number):

我想要这样的输出(这个预期结果值只是随机数):

    0     1     2     3     45
0  1.00  0.43  0.61  0.55  0.46
1  0.43  1.00  0.52  0.56  0.49
2  0.61  0.52  1.00  0.48  0.53
3  0.55  0.56  0.48  1.00  0.49
45  0.46  0.49  0.53  0.49  1.00

What is the best way to get the result of pairwise metrics?

获得成对度量结果的最佳方法是什么?

Thank you,

谢谢,

回答by cs95

You could get rid of the nested apply by vectorizing your function. First, get all pair-wise combinations and pass it to a vectorized version of your function -

您可以通过向量化您的函数来摆脱嵌套的应用程序。首先,获取所有成对组合并将其传递给函数的矢量化版本 -

def jaccard_similarity_score(a, b):
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

i = df.apply(frozenset, 1).to_frame()
j = i.assign(foo=1)
k = j.merge(j, on='foo').drop('foo', 1)
k.columns = ['A', 'B']

fnc = np.vectorize(jaccard_similarity_score)
y = fnc(k['A'], k['B']).reshape(len(df), -1)
y
array([[ 1. ,  0.5,  0.5,  0.5,  0.2,  0.2],
       [ 0.5,  1. ,  0.5,  0.2,  0.5,  0.2],
       [ 0.5,  0.5,  1. ,  0.2,  0.2,  0.5],
       [ 0.5,  0.2,  0.2,  1. ,  0.5,  0.5],
       [ 0.2,  0.5,  0.2,  0.5,  1. ,  0.5],
       [ 0.2,  0.2,  0.5,  0.5,  0.5,  1. ]])

This is already faster, but let's see if we can get evenfaster.

这已经是快,但让我们看看我们是否能够得到甚至更快。



Using senderle's fast cartesian_product-

使用 senderle 的快速cartesian_product-

def cartesian_product(*arrays):
    la = len(arrays)
    dtype = numpy.result_type(*arrays)
    arr = numpy.empty([len(a) for a in arrays] + [la], dtype=dtype)
    for i, a in enumerate(numpy.ix_(*arrays)):
        arr[...,i] = a
    return arr.reshape(-1, la)  


i = df.apply(frozenset, 1).values
j = cartesian_product(i, i)
y = fnc(j[:, 0], j[:, 1]).reshape(-1, len(df))

y

array([[ 1. ,  0.5,  0.5,  0.5,  0.2,  0.2],
       [ 0.5,  1. ,  0.5,  0.2,  0.5,  0.2],
       [ 0.5,  0.5,  1. ,  0.2,  0.2,  0.5],
       [ 0.5,  0.2,  0.2,  1. ,  0.5,  0.5],
       [ 0.2,  0.5,  0.2,  0.5,  1. ,  0.5],
       [ 0.2,  0.2,  0.5,  0.5,  0.5,  1. ]])

回答by Sebastian

A full implementation of what you want can be found here:

可以在此处找到您想要的完整实现:

series_set = df.apply(frozenset, axis=1)
new_df = series_set.apply(lambda a: series_set.apply(lambda b: jaccard(a,b)))