pandas 大熊猫替换（擦除）字符串中的不同字符

Question

提问by As3adTintin

I have a list of high schools. I would like to erase certain characters, words, and symbols from the strings.

我有一份高中名单。我想从字符串中删除某些字符、单词和符号。

I currently have:

我目前有：

df['schoolname'] = df['schoolname'].str.replace('high', "")

However, I would like to use a list so I can quickly replace high, school, /, etc.

不过，我想用一个列表，以便我能快捷地更换high，school，/，等。

Any suggestions?

有什么建议？

df['schoolname'] = df['schoolname'].str.replace(['high', 'school'], "")

does not work

不起作用

Answer 1

回答by Andy Hayden

Use regex (seperate the strings by |):

使用正则表达式（用分隔字符串|）：

df['schoolname'] = df['schoolname'].str.replace('high|school', "")

Answer 2

回答by Amir Imani

You can create a dictionary and then .replace({}, regex=True)method:

您可以创建一个字典，然后.replace({}, regex=True)方法：

replacements = {
   'schoolname': {
      r'(high|school)': ''}
}

df.replace(replacements, regex=True, inplace=True)

Answer 3

回答by mrHymanlu

My problem: I wanted to find a simple solution in deleting characters / symbols using the replace method with pandas.

我的问题：我想找到一个简单的解决方案，使用Pandas的替换方法删除字符/符号。

I had the following array in a data frame:

我在数据框中有以下数组：

  df = array(['2012', '2016', '2011', '2013', '2015', '2017', '2001', '2007',
   '[2005], ?2004.', '2005', '2009', '2008', '2009, c2008.', '2006',
   '2019', '[2003]', '2018', '2012, c2011.', '[2012]', 'c2012.',
   '2014', '2002', 'c2005.', '[2000]', 'c2000.', '2010',
   '2008, c2007.', '2011, c2010.', '2011, ?2002.', 'c2011.', '[2017]',
   'c1996.', '[2018]', '[2019]', '[2011]', '2000', '2000, c1995.',
   '[2004]', '2005, ?2004.', 'c2004.', '[2009]', 'c2009.', '[2014]',
   '1999', '[2010]', 'c2010.', '[2006]', '2007, 2006.', '[2013]',
   'c2001.', 'C2016.', '2008, c2006.', '2011, ?2010.', '2007, c2005.',
   '2009, c2005.', 'c2002.', '[2004], c2003.', '2009, c2007.', '2003',
   '?2003.', '[2016]', '[2001]', '2010, c2001.', '[1998]', 'c1998.'],
  dtype=object)

As you can see, the years were entered using multiple formats (ugh!) with brackets and copyright symbols and lowercase c and uppercase C.

如您所见，年份是使用多种格式（呃！）输入的，包括括号和版权符号以及小写 c 和大写 C。

Now I wanted to remove those unwanted characters and only have the years in four digits. Since it's an array, you also need to transform it into a string before using replace(). Create a variable of all the characters you want replaced and separate them with ' | '.

现在我想删除那些不需要的字符并且只有四位数的年份。由于它是一个数组，因此您还需要在使用replace() 之前将其转换为字符串。创建一个包含所有要替换的字符的变量，并用 ' | 分隔它们。'。

rep_chars = 'c|C|\]|\[|?|\.'

df[Year] = df['Year'].str.replace(rep_chars,"")

Make sure to use \.and not just the period. The same with \]and \[.

确保使用\.而不仅仅是期间。与\]和相同\[。

Output:

输出：

array(['2012', '2016', '2011', '2013', '2015', '2017', '2001', '2007',
   '2005, 2004', '2005', '2009', '2008', '2009, 2008', '2006', '2019',
   '2003', '2018', '2012, 2011', '2014', '2002', '2000', '2010',
   '2008, 2007', '2011, 2010', '2011, 2002', '1996', '2000, 1995',
   '2004', '1999', '2007, 2006', '2008, 2006', '2007, 2005',
   '2009, 2005', '2004, 2003', '2009, 2007', '2010, 2001', '1998'],
  dtype=object)

Happy Data Cleaning!

快乐的数据清理！

pandas 大熊猫替换（擦除）字符串中的不同字符

提问by As3adTintin

回答by Andy Hayden

回答by Amir Imani

回答by mrHymanlu

相关推荐

最近更新

标签

pandas 大熊猫替换（擦除）字符串中的不同字符

提问by As3adTintin

回答by Andy Hayden

回答by Amir Imani

回答by mrHymanlu

相关推荐

堆叠 Pandas DataFrame 时设置列名

将 Pandas 时间戳转换为时间（寻找比 .apply 更快的东西）

Pyinstaller 和 Pandas 的导入错误

pandas 熊猫数据帧连接/更新（“upsert”）？

相关推荐

最近更新

标签