pandas 将数据帧行转换为 Python 集

Question

提问by user46543

I have this dataset:

我有这个数据集：

import pandas as pd
import itertools

A = ['A','B','C']
M = ['1','2','3']
F = ['plus','minus','square']

df = pd.DataFrame(list(itertools.product(A,M,F)), columns=['A','M','F'])
print(df)

The example output is like this:

示例输出是这样的：

   A  M       F
0   A  1    plus
1   A  1   minus
2   A  1  square
3   A  2    plus
4   A  2   minus
5   A  2  square

I want to pairwise comparison (jaccard similarity) of each row from this data frame, for example, comparing

我想对这个数据框中的每一行进行成对比较（jaccard 相似度），例如，比较

A 1 plusand A 2 squareand get the similarity value between those both set.

A 1 plus并A 2 square获得这两个集合之间的相似度值。

I have wrote a jaccard function:

我写了一个 jaccard 函数：

def jaccard(a, b):
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

Which is only work on set because I used intersection

这只能在片场工作，因为我用过 intersection

I want the output like this (this expected result value is just random number):

我想要这样的输出（这个预期结果值只是随机数）：

    0     1     2     3     45
0  1.00  0.43  0.61  0.55  0.46
1  0.43  1.00  0.52  0.56  0.49
2  0.61  0.52  1.00  0.48  0.53
3  0.55  0.56  0.48  1.00  0.49
45  0.46  0.49  0.53  0.49  1.00

What is the best way to get the result of pairwise metrics?

获得成对度量结果的最佳方法是什么？

Thank you,

谢谢，

Answer 1

回答by cs95

You could get rid of the nested apply by vectorizing your function. First, get all pair-wise combinations and pass it to a vectorized version of your function -

您可以通过向量化您的函数来摆脱嵌套的应用程序。首先，获取所有成对组合并将其传递给函数的矢量化版本 -

def jaccard_similarity_score(a, b):
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

i = df.apply(frozenset, 1).to_frame()
j = i.assign(foo=1)
k = j.merge(j, on='foo').drop('foo', 1)
k.columns = ['A', 'B']

fnc = np.vectorize(jaccard_similarity_score)
y = fnc(k['A'], k['B']).reshape(len(df), -1)

y
array([[ 1. ,  0.5,  0.5,  0.5,  0.2,  0.2],
       [ 0.5,  1. ,  0.5,  0.2,  0.5,  0.2],
       [ 0.5,  0.5,  1. ,  0.2,  0.2,  0.5],
       [ 0.5,  0.2,  0.2,  1. ,  0.5,  0.5],
       [ 0.2,  0.5,  0.2,  0.5,  1. ,  0.5],
       [ 0.2,  0.2,  0.5,  0.5,  0.5,  1. ]])

This is already faster, but let's see if we can get evenfaster.

这已经是快，但让我们看看我们是否能够得到甚至更快。

Using senderle's fast cartesian_product-

使用 senderle 的快速cartesian_product-

def cartesian_product(*arrays):
    la = len(arrays)
    dtype = numpy.result_type(*arrays)
    arr = numpy.empty([len(a) for a in arrays] + [la], dtype=dtype)
    for i, a in enumerate(numpy.ix_(*arrays)):
        arr[...,i] = a
    return arr.reshape(-1, la)  


i = df.apply(frozenset, 1).values
j = cartesian_product(i, i)
y = fnc(j[:, 0], j[:, 1]).reshape(-1, len(df))

y

array([[ 1. ,  0.5,  0.5,  0.5,  0.2,  0.2],
       [ 0.5,  1. ,  0.5,  0.2,  0.5,  0.2],
       [ 0.5,  0.5,  1. ,  0.2,  0.2,  0.5],
       [ 0.5,  0.2,  0.2,  1. ,  0.5,  0.5],
       [ 0.2,  0.5,  0.2,  0.5,  1. ,  0.5],
       [ 0.2,  0.2,  0.5,  0.5,  0.5,  1. ]])

Answer 2

回答by Sebastian

A full implementation of what you want can be found here:

可以在此处找到您想要的完整实现：

series_set = df.apply(frozenset, axis=1)
new_df = series_set.apply(lambda a: series_set.apply(lambda b: jaccard(a,b)))

pandas 将数据帧行转换为 Python 集

提问by user46543

回答by cs95

回答by Sebastian

相关推荐

最近更新

标签

pandas 将数据帧行转换为 Python 集

提问by user46543

回答by cs95

回答by Sebastian

相关推荐

ValueError：Pandas 中的数组长度必须相同

pandas 如何检查数据框中是否存在值

Pandas groupby 和聚合输出应包括所有原始列（包括未聚合的列）

Pandas 读取带有浮点值的 csv 文件会导致奇怪的四舍五入和十进制数字

相关推荐

最近更新

标签