在浮点值列上合并 Pandas DataFrame
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41135371/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Merge pandas DataFrame on column of float values
提问by Megan
I have two data frames that I am trying to merge.
我有两个要合并的数据框。
Dataframe A:
数据框 A:
col1 col2 sub grade
0 1 34.32 x a
1 1 34.32 x b
2 1 34.33 y c
3 2 10.14 z b
4 3 33.01 z a
Dataframe B:
数据框 B:
col1 col2 group ID
0 1 34.32 t z
1 1 54.32 s w
2 1 34.33 r z
3 2 10.14 q z
4 3 33.01 q e
I want to merge on col1 and col2. I've been pd.merge with the following syntax:
我想在 col1 和 col2 上合并。我已经 pd.merge 使用以下语法:
pd.merge(A, B, how = 'outer', on = ['col1', 'col2'])
However, I think I am running into issues joining on the float values of col2 since many rows are being dropped. Is there any way to use np.isclose to match the values of col2? When I reference the index of a particular value of col2 in either dataframe, the value has many more decimal places than what is displayed in the dataframe.
但是,我认为我在加入 col2 的浮点值时遇到了问题,因为许多行都被删除了。有没有办法使用 np.isclose 来匹配 col2 的值?当我在任一数据框中引用 col2 的特定值的索引时,该值的小数位数比数据框中显示的要多得多。
I would like the result to be:
我希望结果是:
col1 col2 sub grade group ID
0 1 34.32 x a t z
1 1 34.32 x b s w
2 1 54.32 s w NaN NaN
3 1 34.33 y c r z
4 2 10.14 z b q z
5 3 33.01 z a q e
采纳答案by jezrael
You can use a little hack - multiple float columns by some constant like 100
, 1000
..., convert column to int
, merge
and last divide by constant:
您可以使用一些小技巧 - 通过一些常量(如100
, 1000
... )将多个浮点列转换为int
,merge
最后除以常量:
N = 100
#thank you koalo for comment
A.col2 = np.round(A.col2*N).astype(int)
B.col2 = np.round(B.col2*N).astype(int)
df = pd.merge(A, B, how = 'outer', on = ['col1', 'col2'])
df.col2 = df.col2 / N
print (df)
col1 col2 sub grade group ID
0 1 34.32 x a t z
1 1 34.32 x b t z
2 1 34.33 y c r z
3 2 10.14 z b q z
4 3 33.01 z a q e
5 1 54.32 NaN NaN s w
回答by Sesquipedalism
I had a similar problem where I needed to identify matching rows with thousands of float columns and no identifier. This case is difficult because values can vary slightly due to rounding.
我有一个类似的问题,我需要识别具有数千个浮点列且没有标识符的匹配行。这种情况很困难,因为值可能会因四舍五入而略有不同。
In this case, I used scipy.spatial.distance.cosineto get the cosine similarity between rows.
在这种情况下,我使用scipy.spatial.distance.cosine来获取行之间的余弦相似度。
from scipy import distance
threshold = 0.99999
similarity = 1 - spatial.distance.cosine(row1, row2)
if similarity >= threshold:
# it's a match
else:
# loop and check another row pair
This won't work if you have duplicate or very similar rows, but when you have a large number of float columns and not too many of rows, it works well.
如果您有重复或非常相似的行,这将不起作用,但是当您有大量浮动列且行数不多时,它运行良好。