pandas 大熊猫替换(擦除)字符串中的不同字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33157643/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:02:58  来源:igfitidea点击:

pandas replace (erase) different characters from strings

pythonstringtextpandasreplace

提问by As3adTintin

I have a list of high schools. I would like to erase certain characters, words, and symbols from the strings.

我有一份高中名单。我想从字符串中删除某些字符、单词和符号。

I currently have:

我目前有:

df['schoolname'] = df['schoolname'].str.replace('high', "")

However, I would like to use a list so I can quickly replace high, school, /, etc.

不过,我想用一个列表,以便我能快捷地更换highschool/,等。

Any suggestions?

有什么建议?

df['schoolname'] = df['schoolname'].str.replace(['high', 'school'], "") 

does not work

不起作用

回答by Andy Hayden

Use regex (seperate the strings by |):

使用正则表达式(用 分隔字符串|):

df['schoolname'] = df['schoolname'].str.replace('high|school', "")

回答by Amir Imani

You can create a dictionary and then .replace({}, regex=True)method:

您可以创建一个字典,然后.replace({}, regex=True)方法:

replacements = {
   'schoolname': {
      r'(high|school)': ''}
}

df.replace(replacements, regex=True, inplace=True)

回答by mrHymanlu

My problem: I wanted to find a simple solution in deleting characters / symbols using the replace method with pandas.

我的问题:我想找到一个简单的解决方案,使用Pandas的替换方法删除字符/符号。

I had the following array in a data frame:

我在数据框中有以下数组:

  df = array(['2012', '2016', '2011', '2013', '2015', '2017', '2001', '2007',
   '[2005], ?2004.', '2005', '2009', '2008', '2009, c2008.', '2006',
   '2019', '[2003]', '2018', '2012, c2011.', '[2012]', 'c2012.',
   '2014', '2002', 'c2005.', '[2000]', 'c2000.', '2010',
   '2008, c2007.', '2011, c2010.', '2011, ?2002.', 'c2011.', '[2017]',
   'c1996.', '[2018]', '[2019]', '[2011]', '2000', '2000, c1995.',
   '[2004]', '2005, ?2004.', 'c2004.', '[2009]', 'c2009.', '[2014]',
   '1999', '[2010]', 'c2010.', '[2006]', '2007, 2006.', '[2013]',
   'c2001.', 'C2016.', '2008, c2006.', '2011, ?2010.', '2007, c2005.',
   '2009, c2005.', 'c2002.', '[2004], c2003.', '2009, c2007.', '2003',
   '?2003.', '[2016]', '[2001]', '2010, c2001.', '[1998]', 'c1998.'],
  dtype=object)

As you can see, the years were entered using multiple formats (ugh!) with brackets and copyright symbols and lowercase c and uppercase C.

如您所见,年份是使用多种格式(呃!)输入的,包括括号和版权符号以及小写 c 和大写 C。

Now I wanted to remove those unwanted characters and only have the years in four digits. Since it's an array, you also need to transform it into a string before using replace(). Create a variable of all the characters you want replaced and separate them with ' | '.

现在我想删除那些不需要的字符并且只有四位数的年份。由于它是一个数组,因此您还需要在使用replace() 之前将其转换为字符串。创建一个包含所有要替换的字符的变量,并用 ' | 分隔它们。'。

rep_chars = 'c|C|\]|\[|?|\.'

df[Year] = df['Year'].str.replace(rep_chars,"")

Make sure to use \.and not just the period. The same with \]and \[.

确保使用\.而不仅仅是期间。与\]和相同\[

Output:

输出:

array(['2012', '2016', '2011', '2013', '2015', '2017', '2001', '2007',
   '2005, 2004', '2005', '2009', '2008', '2009, 2008', '2006', '2019',
   '2003', '2018', '2012, 2011', '2014', '2002', '2000', '2010',
   '2008, 2007', '2011, 2010', '2011, 2002', '1996', '2000, 1995',
   '2004', '1999', '2007, 2006', '2008, 2006', '2007, 2005',
   '2009, 2005', '2004, 2003', '2009, 2007', '2010, 2001', '1998'],
  dtype=object)

Happy Data Cleaning!

快乐的数据清理!