pandas 不区分大小写的熊猫 dataframe.merge
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29761915/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Case insensitive pandas dataframe.merge
提问by EMC
I am struggling with the easiest way to do a case insensitive merge in pandas. Is there a way to do it right on the merge? Do I need to use (?i) or a regex with ignorecase? In my code snippet below I am joining some Countries where it may be "United States" in one file and "UNITED STATES" in another and I just want to take the case out of the equation. Thank you!
我正在努力寻找在 Pandas 中进行不区分大小写合并的最简单方法。有没有办法在合并时正确地做到这一点?我是否需要使用 (?i) 或带忽略大小写的正则表达式?在我下面的代码片段中,我加入了一些国家,其中一个文件中可能是“美国”,另一个文件中可能是“美国”,我只是想把这个案例排除在外。谢谢!
import pandas as pd
import csv
import sys
env_path = sys.argv[1]
map_path = sys.argv[2]
df_address = pd.read_csv(env_path + "\address.csv")
df_CountryMapping = pd.read_csv(map_path + "\CountryMapping.csv")
df_merged = df_address.merge(df_CountryMapping, left_on="Country", right_on="NAME", how="left")
....
回答by Shashank Agarwal
Lowercase the values in the two columns that will be used to merge, and then merge on the lowercased columns
将用于合并的两列中的值小写,然后在小写的列上合并
df_address['country_lower'] = df_address['Country'].str.lower()
df_CountryMapping['name_lower'] = df_CountryMapping['NAME'].str.lower()
df_merged = df_address.merge(df_CountryMapping, left_on="country_lower", right_on="name_lower", how="left")
回答by Uri Goren
I suggest lowering the column names after reading them
我建议在阅读后降低列名
df_address.columns=[c.lower() for c in df_address.columns]
df_CountryMapping.columns=[c.lower() for c in df_CountryMapping.columns]
Then update the values
然后更新值
df_address['country']=df_address['country'].str.lower()
df_CountryMapping['name']=df_CountryMapping['name'].str.lower()
And only then, do the merging
然后才进行合并
df_merged = df_address.merge(df_CountryMapping, left_on="country", right_on="name", how="left")
回答by mway
One solution would be to convert the column names of both data frames to be all lowercase. So something like this:
一种解决方案是将两个数据框的列名都转换为小写。所以像这样:
df_address = pd.read_csv(env_path + "\address.csv")
df_CountryMapping = pd.read_csv(map_path + "\CountryMapping.csv")
df_address.rename(columns=lambda x: x.lower(), inplace=True)
df_CountryMapping.rename(columns=lambda x: x.lower(), inplace=True)
df_merged = df_address.merge(df_CountryMapping, left_on="country", right_on="name", how="left")
回答by Lelouch
Another option is with ".str.casefold()" for a more comprehensive incorporation of ASCII and different language characters. If your just using English alpha chars it should be the same as ".str.lower()"
另一种选择是使用“.str.casefold()”来更全面地结合 ASCII 和不同语言的字符。如果您只使用英文字母字符,它应该与“.str.lower()”相同
df_address['country_casefolded'] = df_address['Country'].str.casefold()
df_CountryMapping['name_casefolded'] = df_CountryMapping['NAME'].str.casefold()
df_merged = df_address.merge(df_CountryMapping, left_on="country_casefolded", right_on="name_casefolded", how="left")

