pandas 不区分大小写的熊猫 dataframe.merge

Question

提问by EMC

I am struggling with the easiest way to do a case insensitive merge in pandas. Is there a way to do it right on the merge? Do I need to use (?i) or a regex with ignorecase? In my code snippet below I am joining some Countries where it may be "United States" in one file and "UNITED STATES" in another and I just want to take the case out of the equation. Thank you!

我正在努力寻找在 Pandas 中进行不区分大小写合并的最简单方法。有没有办法在合并时正确地做到这一点？我是否需要使用 (?i) 或带忽略大小写的正则表达式？在我下面的代码片段中，我加入了一些国家，其中一个文件中可能是“美国”，另一个文件中可能是“美国”，我只是想把这个案例排除在外。谢谢！

import pandas as pd
import csv
import sys

env_path = sys.argv[1]
map_path = sys.argv[2]


df_address = pd.read_csv(env_path + "\address.csv")
df_CountryMapping = pd.read_csv(map_path + "\CountryMapping.csv")

df_merged = df_address.merge(df_CountryMapping, left_on="Country", right_on="NAME", how="left")

....

Answer 1

回答by Shashank Agarwal

Lowercase the values in the two columns that will be used to merge, and then merge on the lowercased columns

将用于合并的两列中的值小写，然后在小写的列上合并

df_address['country_lower'] = df_address['Country'].str.lower()
df_CountryMapping['name_lower'] = df_CountryMapping['NAME'].str.lower()
df_merged = df_address.merge(df_CountryMapping, left_on="country_lower", right_on="name_lower", how="left")

Answer 2

回答by Uri Goren

I suggest lowering the column names after reading them

我建议在阅读后降低列名

df_address.columns=[c.lower() for c in df_address.columns]
df_CountryMapping.columns=[c.lower() for c in df_CountryMapping.columns]

Then update the values

然后更新值

df_address['country']=df_address['country'].str.lower()
df_CountryMapping['name']=df_CountryMapping['name'].str.lower()

And only then, do the merging

然后才进行合并

df_merged = df_address.merge(df_CountryMapping, left_on="country", right_on="name", how="left")

Answer 3

回答by mway

One solution would be to convert the column names of both data frames to be all lowercase. So something like this:

一种解决方案是将两个数据框的列名都转换为小写。所以像这样：

df_address = pd.read_csv(env_path + "\address.csv")
df_CountryMapping = pd.read_csv(map_path + "\CountryMapping.csv")

df_address.rename(columns=lambda x: x.lower(), inplace=True)
df_CountryMapping.rename(columns=lambda x: x.lower(), inplace=True)

df_merged = df_address.merge(df_CountryMapping, left_on="country", right_on="name", how="left")

Answer 4

回答by Lelouch

Another option is with ".str.casefold()" for a more comprehensive incorporation of ASCII and different language characters. If your just using English alpha chars it should be the same as ".str.lower()"

另一种选择是使用“.str.casefold()”来更全面地结合 ASCII 和不同语言的字符。如果您只使用英文字母字符，它应该与“.str.lower()”相同

df_address['country_casefolded'] = df_address['Country'].str.casefold()
df_CountryMapping['name_casefolded'] = df_CountryMapping['NAME'].str.casefold()
df_merged = df_address.merge(df_CountryMapping, left_on="country_casefolded", right_on="name_casefolded", how="left")

pandas 不区分大小写的熊猫 dataframe.merge

提问by EMC

回答by Shashank Agarwal

回答by Uri Goren

回答by mway

回答by Lelouch

相关推荐

最近更新

标签

pandas 不区分大小写的熊猫 dataframe.merge

提问by EMC

回答by Shashank Agarwal

回答by Uri Goren

回答by mway

回答by Lelouch

相关推荐

快速Haversine近似（Python/Pandas）

pandas python pandas跨列条件计数

如何在 Pandas 和 Matplotlib 中使用 ax

计算 DataFrame Pandas 中“时间”行之间的差异

相关推荐

最近更新

标签