Pandas DataFrame 能否高效计算 PMI(Pointwise Mutual Information)?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35850582/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Can Pandas DataFrame efficiently calculate PMI (Pointwise Mutual Information)?
提问by jfive
I've looked around and surprisingly haven't found an easy use of framework or existing code for the calculation of Pointwise Mutual Information (Wiki PMI) despite libraries like Scikit-learn offering a metric for overall Mutual Information (by histogram). This is in the context of Python and Pandas!
尽管像 Scikit-learn 这样的库提供了整体互信息的度量(通过直方图),但我环顾四周,令人惊讶的是还没有找到简单使用框架或现有代码来计算逐点互信息 ( Wiki PMI)。这是在 Python 和 Pandas 的上下文中!
My problem:
我的问题:
I have a DataFrame with a series of [x,y] examples in each row and wish to calculate a series of PMI values as per the formula (or a simpler one):
我有一个 DataFrame,每行都有一系列 [x,y] 示例,并希望根据公式(或更简单的公式)计算一系列 PMI 值:
PMI(x, y) = log( p(x,y) / p(x) * p(y) )
PMI(x, y) = log( p(x,y) / p(x) * p(y) )
So far my approach is:
到目前为止,我的方法是:
def pmi_func(df, x, y):
df['freq_x'] = df.groupby(x).transform('count')
df['freq_y'] = df.groupby(y).transform('count')
df['freq_x_y'] = df.groupby([x, y]).transform('count')
df['pmi'] = np.log( df['freq_x_y'] / (df['freq_x'] * df['freq_y']) )
Would this give a valid and/or efficient computation?
这会提供有效和/或高效的计算吗?
Sample I/O:
示例输入/输出:
x y PMI
0 0 0.176
0 0 0.176
0 1 0
回答by Zero
I would add three bits.
我会添加三个位。
def pmi(dff, x, y):
df = dff.copy()
df['f_x'] = df.groupby(x)[x].transform('count')
df['f_y'] = df.groupby(y)[y].transform('count')
df['f_xy'] = df.groupby([x, y])[x].transform('count')
df['pmi'] = np.log(len(df.index) * df['f_xy'] / (df['f_x'] * df['f_y']) )
return df
df.groupby(x)[x].transform('count')
anddf.groupby(y)[y].transform('count')
should be used so that only count is retured.np.log(len(df.index) * df['f_xy'] / (df['f_x'] * df['f_y'])
probabilities to be used.- work on copy of dataframe, rather than modifying input dataframe.
df.groupby(x)[x].transform('count')
并且df.groupby(y)[y].transform('count')
应该被使用,以便只返回计数。np.log(len(df.index) * df['f_xy'] / (df['f_x'] * df['f_y'])
要使用的概率。- 处理数据帧的副本,而不是修改输入数据帧。
回答by jfive
Solution (with SKlearn KDE alternative as well):
解决方案(也可以使用 SKlearn KDE 替代方案):
Please comment for review
请评论以供审核
from sklearn.neighbors.kde import KernelDensity
# pmi function
def pmi_func(df, x, y):
freq_x = df.groupby(x).transform('count')
freq_y = df.groupby(y).transform('count')
freq_x_y = df.groupby([x, y]).transform('count')
df['pmi'] = np.log( len(df.index) * (freq_x_y / (freq_x * freq_y)) )
# pmi with kernel density estimation
def kernel_pmi_func(df, x, y):
# reshape data
x = np.array(df[x])
y = np.array(df[y])
x_y = np.stack((x, y), axis=-1)
# kernel density estimation
kde_x = KernelDensity(kernel='gaussian', bandwidth=0.1).fit(x[:, np.newaxis])
kde_y = KernelDensity(kernel='gaussian', bandwidth=0.1).fit(y[:, np.newaxis])
kde_x_y = KernelDensity(kernel='gaussian', bandwidth=0.1).fit(x_y)
# score
p_x = pd.Series(np.exp(kde_x.score_samples(x[:, np.newaxis])))
p_y = pd.Series(np.exp(kde_y.score_samples(y[:, np.newaxis])))
p_x_y = pd.Series(np.exp(kde_x_y.score_samples(x_y)))
df['pmi'] = np.log( p_x_y / (p_x * p_y) )