在浮点值列上合并 Pandas DataFrame

Question

提问by Megan

I have two data frames that I am trying to merge.

我有两个要合并的数据框。

Dataframe A:

数据框 A：

    col1    col2    sub    grade
0   1       34.32   x       a 
1   1       34.32   x       b
2   1       34.33   y       c
3   2       10.14   z       b
4   3       33.01   z       a

Dataframe B:

数据框 B：

    col1    col2    group   ID
0   1       34.32   t       z 
1   1       54.32   s       w
2   1       34.33   r       z
3   2       10.14   q       z
4   3       33.01   q       e

I want to merge on col1 and col2. I've been pd.merge with the following syntax:

我想在 col1 和 col2 上合并。我已经 pd.merge 使用以下语法：

pd.merge(A, B, how = 'outer', on = ['col1', 'col2'])

However, I think I am running into issues joining on the float values of col2 since many rows are being dropped. Is there any way to use np.isclose to match the values of col2? When I reference the index of a particular value of col2 in either dataframe, the value has many more decimal places than what is displayed in the dataframe.

但是，我认为我在加入 col2 的浮点值时遇到了问题，因为许多行都被删除了。有没有办法使用 np.isclose 来匹配 col2 的值？当我在任一数据框中引用 col2 的特定值的索引时，该值的小数位数比数据框中显示的要多得多。

I would like the result to be:

我希望结果是：

    col1   col2   sub   grade   group    ID
0   1      34.32  x     a       t        z
1   1      34.32  x     b       s        w
2   1      54.32  s     w       NaN      NaN
3   1      34.33  y     c       r        z
4   2      10.14  z     b       q        z
5   3      33.01  z     a       q        e

Answer 1

采纳答案by jezrael

You can use a little hack - multiple float columns by some constant like 100, 1000..., convert column to int, mergeand last divide by constant:

您可以使用一些小技巧 - 通过一些常量（如100, 1000... ）将多个浮点列转换为int，merge最后除以常量：

N = 100
#thank you koalo for comment
A.col2 = np.round(A.col2*N).astype(int) 
B.col2 = np.round(B.col2*N).astype(int) 
df = pd.merge(A, B, how = 'outer', on = ['col1', 'col2'])
df.col2 = df.col2 / N
print (df)
   col1   col2  sub grade group ID
0     1  34.32    x     a     t  z
1     1  34.32    x     b     t  z
2     1  34.33    y     c     r  z
3     2  10.14    z     b     q  z
4     3  33.01    z     a     q  e
5     1  54.32  NaN   NaN     s  w

Answer 2

回答by Sesquipedalism

I had a similar problem where I needed to identify matching rows with thousands of float columns and no identifier. This case is difficult because values can vary slightly due to rounding.

我有一个类似的问题，我需要识别具有数千个浮点列且没有标识符的匹配行。这种情况很困难，因为值可能会因四舍五入而略有不同。

In this case, I used scipy.spatial.distance.cosineto get the cosine similarity between rows.

在这种情况下，我使用scipy.spatial.distance.cosine来获取行之间的余弦相似度。

from scipy import distance

threshold = 0.99999
similarity = 1 - spatial.distance.cosine(row1, row2)

if similarity >= threshold:
    # it's a match
else:
    # loop and check another row pair

This won't work if you have duplicate or very similar rows, but when you have a large number of float columns and not too many of rows, it works well.

如果您有重复或非常相似的行，这将不起作用，但是当您有大量浮动列且行数不多时，它运行良好。

在浮点值列上合并 Pandas DataFrame

提问by Megan

采纳答案by jezrael

回答by Sesquipedalism

相关推荐

最近更新

标签

在浮点值列上合并 Pandas DataFrame

提问by Megan

采纳答案by jezrael

回答by Sesquipedalism

相关推荐

pandas 将字典列表转换为数据框

使用动态名称在 Pandas 中创建新数据框还会添加新列

Pandas NameError：未定义名称“df”

pandas 使用机器学习预测 NA（缺失值）

相关推荐

最近更新

标签