Python re.sub 错误与“预期的字符串或类似字节的对象”

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43727583/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 23:20:10  来源:igfitidea点击:

re.sub erroring with "Expected string or bytes-like object"

pythonregexpandasnltk

提问by imanexcelnoob

I have read multiple posts regarding this error, but I still can't figure it out. When I try to loop through my function:

我已经阅读了多篇关于此错误的帖子,但我仍然无法弄清楚。当我尝试遍历我的函数时:

def fix_Plan(location):
    letters_only = re.sub("[^a-zA-Z]",  # Search for all non-letters
                          " ",          # Replace all non-letters with spaces
                          location)     # Column and row to search    

    words = letters_only.lower().split()     
    stops = set(stopwords.words("english"))      
    meaningful_words = [w for w in words if not w in stops]      
    return (" ".join(meaningful_words))    

col_Plan = fix_Plan(train["Plan"][0])    
num_responses = train["Plan"].size    
clean_Plan_responses = []

for i in range(0,num_responses):
    clean_Plan_responses.append(fix_Plan(train["Plan"][i]))

Here is the error:

这是错误:

Traceback (most recent call last):
  File "C:/Users/xxxxx/PycharmProjects/tronc/tronc2.py", line 48, in <module>
    clean_Plan_responses.append(fix_Plan(train["Plan"][i]))
  File "C:/Users/xxxxx/PycharmProjects/tronc/tronc2.py", line 22, in fix_Plan
    location)  # Column and row to search
  File "C:\Users\xxxxx\AppData\Local\Programs\Python\Python36\lib\re.py", line 191, in sub
    return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object

回答by abccd

As you stated in the comments, some of the values appeared to be floats, not strings. You will need to change it to strings before passing it to re.sub. The simplest way is to change locationto str(location)when using re.sub. It wouldn't hurt to do it anyways even if it's already a str.

正如您在评论中所述,某些值似乎是浮点数,而不是字符串。您需要将其更改为字符串,然后再将其传递给re.sub. 最简单的方法是在使用时location改为。即使它已经是一个.str(location)re.substr

letters_only = re.sub("[^a-zA-Z]",  # Search for all non-letters
                          " ",          # Replace all non-letters with spaces
                          str(location))

回答by msaif

the simplest solution is to apply python str function to the column you are trying to loop through.

最简单的解决方案是将 python str 函数应用于您尝试循环的列。

if you are using pandas this can be implemented as

如果您使用的是熊猫,这可以实现为

dataframe['column_name']=dataframe['column_name'].apply(str)

dataframe['column_name']=dataframe['column_name'].apply(str)

回答by Bilal Chandio

I suppose better would be to use re.match() function. here is an example which may help you.

我想最好是使用 re.match() 函数。这是一个可以帮助您的示例。

import re
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
sentences = word_tokenize("I love to learn NLP \n 'a :(")
#for i in range(len(sentences)):
sentences = [word.lower() for word in sentences if re.match('^[a-zA-Z]+', word)]  
sentences