pandas 类型错误:预期的字符串或类似字节的对象 – 使用 Python/NLTK word_tokenize
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46105180/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
TypeError: expected string or bytes-like object – with Python/NLTK word_tokenize
提问by LMGagne
I have a dataset with ~40 columns, and am using .apply(word_tokenize)
on 5 of them like so: df['token_column'] = df.column.apply(word_tokenize)
.
我有〜40列的数据集,并正在使用.apply(word_tokenize)
他们的5像这样:df['token_column'] = df.column.apply(word_tokenize)
。
I'm getting a TypeError for only one of the columns, we'll call this problem_column
我只收到了其中一列的 TypeError,我们将此称为 issue_column
TypeError: expected string or bytes-like object
Here's the full error (stripped df and column names, and pii), I'm new to Python and am still trying to figure out which parts of the error messages are relevant:
这是完整的错误(删除了 df 和列名,以及 pii),我是 Python 新手,我仍在尝试找出错误消息的哪些部分是相关的:
TypeError Traceback (most recent call last)
<ipython-input-51-22429aec3622> in <module>()
----> 1 df['token_column'] = df.problem_column.apply(word_tokenize)
C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
2353 else:
2354 values = self.asobject
-> 2355 mapped = lib.map_infer(values, f, convert=convert_dtype)
2356
2357 if len(mapped) and isinstance(mapped[0], Series):
pandas\_libs\src\inference.pyx in pandas._libs.lib.map_infer (pandas\_libs\lib.c:66440)()
C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py in word_tokenize(text, language, preserve_line)
128 :type preserver_line: bool
129 """
--> 130 sentences = [text] if preserve_line else sent_tokenize(text, language)
131 return [token for sent in sentences
132 for token in _treebank_word_tokenizer.tokenize(sent)]
C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py in sent_tokenize(text, language)
95 """
96 tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
---> 97 return tokenizer.tokenize(text)
98
99 # Standard word tokenizer.
C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in tokenize(self, text, realign_boundaries)
1233 Given a text, returns a list of the sentences in that text.
1234 """
-> 1235 return list(self.sentences_from_text(text, realign_boundaries))
1236
1237 def debug_decisions(self, text):
C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in sentences_from_text(self, text, realign_boundaries)
1281 follows the period.
1282 """
-> 1283 return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
1284
1285 def _slices_from_text(self, text):
C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in span_tokenize(self, text, realign_boundaries)
1272 if realign_boundaries:
1273 slices = self._realign_boundaries(text, slices)
-> 1274 return [(sl.start, sl.stop) for sl in slices]
1275
1276 def sentences_from_text(self, text, realign_boundaries=True):
C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in <listcomp>(.0)
1272 if realign_boundaries:
1273 slices = self._realign_boundaries(text, slices)
-> 1274 return [(sl.start, sl.stop) for sl in slices]
1275
1276 def sentences_from_text(self, text, realign_boundaries=True):
C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _realign_boundaries(self, text, slices)
1312 """
1313 realign = 0
-> 1314 for sl1, sl2 in _pair_iter(slices):
1315 sl1 = slice(sl1.start + realign, sl1.stop)
1316 if not sl2:
C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _pair_iter(it)
310 """
311 it = iter(it)
--> 312 prev = next(it)
313 for el in it:
314 yield (prev, el)
C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _slices_from_text(self, text)
1285 def _slices_from_text(self, text):
1286 last_break = 0
-> 1287 for match in self._lang_vars.period_context_re().finditer(text):
1288 context = match.group() + match.group('after_tok')
1289 if self.text_contains_sentbreak(context):
TypeError: expected string or bytes-like object
The 5 columns are all character/string (as verified in SQL Server, SAS, and using .select_dtypes(include=[object]))
.
5 列都是字符/字符串(在 SQL Server、SAS 中验证,并使用.select_dtypes(include=[object]))
.
For good measure I used .to_string()
to make sure problem_column is really and truly not anything besides a string, but I continue to get the error. If I process the columns separately good_column1-good_column4 continue to work and problem_column will still generate the error.
为了更好的衡量,我曾经.to_string()
确保 question_column 除了字符串之外真的不是任何东西,但我继续得到错误。如果我单独处理列 good_column1-good_column4 继续工作,而 problem_column 仍会产生错误。
I've googled around and aside from stripping any numbers from the set (which I can't do, because those are meaningful) I haven't found any additional fixes.
除了从集合中删除任何数字(我不能这样做,因为这些很有意义)之外,我已经四处搜索,我还没有找到任何其他修复程序。
采纳答案by LMGagne
This is what got me the desired result.
这就是让我得到想要的结果的原因。
def custom_tokenize(text):
if not text:
print('The text to be tokenized is a None type. Defaulting to blank string.')
text = ''
return word_tokenize(text)
df['tokenized_column'] = df.column.apply(custom_tokenize)
回答by Ekho
The problem is that you have None (NA) types in your DF. Try this:
问题是您的 DF 中有 None (NA) 类型。尝试这个:
df['label'].dropna(inplace=True)
tokens = df['label'].apply(word_tokenize)
回答by Danish Shaikh
It might be showing an error because word_tokenize()
only accept 1 string at a time. You can loop through the strings and then tokenize it.
它可能会显示错误,因为一次word_tokenize()
只接受 1 个字符串。您可以遍历字符串,然后将其标记化。
For example:
例如:
text = "This is the first sentence. This is the second one. And this is the last one."
sentences = sent_tokenize(text)
words = [word_tokenize(sent) for sent in sentences]
print(words)
回答by KerryChu
Try
尝试
from nltk.tokenize import word_tokenize as WordTokenizer
def word_tokenizer(data, col):
token=[]
for item in data[col]:
token.append(WordTokenizer(item))
return token
token = word_tokenizer(df, column)
df. insert(index, 'token_column', token)