如何避免解码为 str：在 Pandas 中需要类似字节的对象错误？

Question

提问by wayne64001

Here is my code :

这是我的代码：

data = pd.read_csv('asscsv2.csv', encoding = "ISO-8859-1", error_bad_lines=False);
data_text = data[['content']]
data_text['index'] = data_text.index
documents = data_text

It looks like

看起来像

print(documents[:2])
                                              content  index
 0  Pretty extensive background in Egyptology and ...      0
 1  Have you guys checked the back end of the Sphi...      1

And I define a preprocess function by using gensim

我使用 gensim 定义了一个预处理函数

stemmer = PorterStemmer()
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

And when I use this function:

当我使用这个功能时：

processed_docs = documents['content'].map(preprocess)

It appears

它出现

TypeError: decoding to str: need a bytes-like object, float found

How to encode my csv file to byte-like object or how to avoid this kind of error?

如何将我的 csv 文件编码为类似字节的对象或如何避免此类错误？

Answer 1

回答by Vishnudev

Your data has NaNs(not a number).

您的数据有NaNs（不是数字）。

You can either drop them first:

您可以先删除它们：

documents = documents.dropna(subset=['content'])

Or, you can fill all NaNswith an empty string, convert the column to string type and then map your string based function.

或者，您可以NaNs使用空字符串填充所有内容，将列转换为字符串类型，然后映射基于字符串的函数。

documents['content'].fillna('').astype(str).map(preprocess)

This is because your function preprocess has function calls that accept string only data type.

这是因为您的函数预处理具有仅接受字符串数据类型的函数调用。

Edit:

编辑：

How do I know that your data contains NaNs? Numpy nan are considered float values

我怎么知道您的数据包含 NaN？Numpy nan 被认为是浮点值

>>> import numpy as np
>>> type(np.nan)
<class 'float'>

如何避免解码为 str：在 Pandas 中需要类似字节的对象错误？

提问by wayne64001

回答by Vishnudev

相关推荐

最近更新

标签

如何避免解码为 str：在 Pandas 中需要类似字节的对象错误？

提问by wayne64001

回答by Vishnudev

相关推荐

pandas 大熊猫在行上迭代作为字典

Pandas 数据框选择列表列包含任何字符串列表的行

Python Pandas：计算组内的移动平均值

Pandas 使用行索引拆分数据帧

相关推荐

最近更新

标签