Pandas DataFrame 能否高效计算 PMI（Pointwise Mutual Information）？

Question

提问by jfive

I've looked around and surprisingly haven't found an easy use of framework or existing code for the calculation of Pointwise Mutual Information (Wiki PMI) despite libraries like Scikit-learn offering a metric for overall Mutual Information (by histogram). This is in the context of Python and Pandas!

尽管像 Scikit-learn 这样的库提供了整体互信息的度量（通过直方图），但我环顾四周，令人惊讶的是还没有找到简单使用框架或现有代码来计算逐点互信息 ( Wiki PMI)。这是在 Python 和 Pandas 的上下文中！

My problem:

我的问题：

I have a DataFrame with a series of [x,y] examples in each row and wish to calculate a series of PMI values as per the formula (or a simpler one):

我有一个 DataFrame，每行都有一系列 [x,y] 示例，并希望根据公式（或更简单的公式）计算一系列 PMI 值：

PMI(x, y) = log( p(x,y) / p(x) * p(y) )

So far my approach is:

到目前为止，我的方法是：

def pmi_func(df, x, y):
    df['freq_x'] = df.groupby(x).transform('count')
    df['freq_y'] = df.groupby(y).transform('count')
    df['freq_x_y'] = df.groupby([x, y]).transform('count')
    df['pmi'] = np.log( df['freq_x_y'] / (df['freq_x'] * df['freq_y']) )

Would this give a valid and/or efficient computation?

这会提供有效和/或高效的计算吗？

Sample I/O:

示例输入/输出：

x  y  PMI
0  0  0.176
0  0  0.176
0  1  0

Answer 1

回答by Zero

I would add three bits.

我会添加三个位。

def pmi(dff, x, y):
    df = dff.copy()
    df['f_x'] = df.groupby(x)[x].transform('count')
    df['f_y'] = df.groupby(y)[y].transform('count')
    df['f_xy'] = df.groupby([x, y])[x].transform('count')
    df['pmi'] = np.log(len(df.index) * df['f_xy'] / (df['f_x'] * df['f_y']) )
    return df

df.groupby(x)[x].transform('count')and df.groupby(y)[y].transform('count')should be used so that only count is retured.
np.log(len(df.index) * df['f_xy'] / (df['f_x'] * df['f_y'])probabilities to be used.
work on copy of dataframe, rather than modifying input dataframe.

df.groupby(x)[x].transform('count')并且df.groupby(y)[y].transform('count')应该被使用，以便只返回计数。
np.log(len(df.index) * df['f_xy'] / (df['f_x'] * df['f_y'])要使用的概率。
处理数据帧的副本，而不是修改输入数据帧。

Answer 2

回答by jfive

Solution (with SKlearn KDE alternative as well):

解决方案（也可以使用 SKlearn KDE 替代方案）：

Please comment for review

请评论以供审核

from sklearn.neighbors.kde import KernelDensity

# pmi function 
def pmi_func(df, x, y):
    freq_x = df.groupby(x).transform('count')
    freq_y = df.groupby(y).transform('count')
    freq_x_y = df.groupby([x, y]).transform('count')
    df['pmi'] = np.log( len(df.index) *  (freq_x_y / (freq_x * freq_y)) )

# pmi with kernel density estimation    
def kernel_pmi_func(df, x, y):
    # reshape data
    x = np.array(df[x])
    y = np.array(df[y])
    x_y = np.stack((x, y), axis=-1)

    # kernel density estimation
    kde_x = KernelDensity(kernel='gaussian', bandwidth=0.1).fit(x[:, np.newaxis])
    kde_y = KernelDensity(kernel='gaussian', bandwidth=0.1).fit(y[:, np.newaxis])
    kde_x_y = KernelDensity(kernel='gaussian', bandwidth=0.1).fit(x_y)

    # score
    p_x = pd.Series(np.exp(kde_x.score_samples(x[:, np.newaxis])))
    p_y = pd.Series(np.exp(kde_y.score_samples(y[:, np.newaxis])))
    p_x_y = pd.Series(np.exp(kde_x_y.score_samples(x_y)))   

    df['pmi'] = np.log( p_x_y / (p_x * p_y) )

Pandas DataFrame 能否高效计算 PMI（Pointwise Mutual Information）？

提问by jfive

回答by Zero

回答by jfive

相关推荐

最近更新

标签

Pandas DataFrame 能否高效计算 PMI（Pointwise Mutual Information）？

提问by jfive

回答by Zero

回答by jfive

相关推荐

pandas 在所有子图中绘制带有列的 DataFrame

pandas 系列对象没有属性 'strip'

pandas 向 Python 中的数据框列添加百分号

pandas 如何在matplotlib中以'%H:%M'格式在y轴上绘制时间？

相关推荐

最近更新

标签