在 Pandas 中,当使用 read_csv() 时,如何将 NaN 分配给不是 dtype 预期的值?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40011531/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:11:56  来源:igfitidea点击:

in Pandas, when using read_csv(), how to assign a NaN to a value that's not the dtype intended?

pythonpandas

提问by luisfer

Note:Please excuse my very low skilled English, feel free to modify the question's title, or the following text to be more understandable

注意:请原谅我的英语水平很低,随意修改问题的标题,或者下面的文字更容易理解

I have this line in my code:

我的代码中有这一行:

moto = pd.read_csv('reporte.csv')

It sends a DtypeWarning: Columns (2,3,4,5,6,7,8,9,10,12,13) have mixed types.warning, so I change it to

它发出DtypeWarning: Columns (2,3,4,5,6,7,8,9,10,12,13) have mixed types.警告,所以我将其更改为

moto = pd.read_csv('reporte.csv', dtype={'TP': np.float64})

Now it drops a ValueError: could not convert string to float: 'None'.

现在它掉了一个ValueError: could not convert string to float: 'None'.

I checked the file (around 200K lines) with Excel, and yes, I found some cells with "None" value.

我用 Excel 检查了文件(大约 20 万行),是的,我发现了一些具有“”值的单元格。

So my question is: Is there a way to ignore the error, or force python to fill the offending error with NaN or something else?

所以我的问题是:有没有办法忽略错误,或者强制 python 用 NaN 或其他东西填充有问题的错误?

I tried the solution herebut it didn't work.

我在这里尝试了解决方案但没有用。

采纳答案by wordsmith

I tried creating a csv to replicate this feedback but couldn't on pandas 0.18, so I can only recommend two methods to handle this:

我尝试创建一个 csv 来复制这个反馈,但不能在 Pandas 0.18 上,所以我只能推荐两种方法来处理这个:

First

第一的

If you know that your missing values are all marked by a string 'none', then do this:

如果您知道您的缺失值都由字符串“none”标记,请执行以下操作:

moto = pd.read_csv("test.csv", na_values=['none'])

You can also add, to the na_values list, other markers that should be converted to NaNs.

您还可以向 na_values 列表添加其他应转换为 NaN 的标记。

Second

第二

Try your first line again without using the dtype option.

在不使用 dtype 选项的情况下再次尝试您的第一行。

moto = pd.read_csv('reporte.csv')

The read is successful because you are only getting a warning. Now execute moto.dtypesto show you which columns are objects. For the ones you want to change do the following:

读取成功,因为您只收到警告。现在执行moto.dtypes以显示哪些列是对象。对于要更改的那些,请执行以下操作:

moto.test_column = pd.to_numeric(data.test_column, errors='coerce')

The 'coerce' option will convert any problematic entries, like 'none', to NaNs.

'coerce' 选项会将任何有问题的条目(如 'none')转换为 NaN。

To convert the entire dataframe at once, you can use convert_objects. You could also use it on a single column, but that usage is deprecated in favor of to_numeric. The option, convert_numeric, does the coercion to NaNs:

要一次转换整个数据帧,您可以使用 convert_objects。您也可以在单个列上使用它,但不推荐使用这种用法,而是使用 to_numeric。选项 convert_numeric 强制转换为 NaN:

moto = moto.convert_objects(convert_numeric=True)

After any of these methods, proceed with fillnato do what you need to.

在这些方法中的任何一个之后,继续使用fillna来做你需要做的事情。

回答by Edward Sternin

A great answer, wordsmith! Just to add a couple of minor things:

一个很好的答案,文字匠!只是添加一些小东西:

  • there is a typo in the answer, data.test_column should probably be moto.test_column
  • convert_objects is now deprecated, in favor of type-specific methods on columns, one-at-a-time [why?]
  • 答案中有一个错字,data.test_column 应该是 moto.test_column
  • convert_objects 现在已弃用,支持在列上使用特定于类型的方法,一次一个 [为什么?]

A full working example, including the dropping of the lines containing read errors (not column count errors, covered by read_csv(..., error_bad_lines=False) would be:

一个完整的工作示例,包括删除包含读取错误的行(不是列计数错误,由 read_csv(..., error_bad_lines=False) 覆盖:

moto = pd.read_csv('reporte.csv')
moto.test_column = pd.to_numeric(moto.test_column, errors='coerce')
moto.dropna(axis='index',how='any',inplace=True)

I would also like to offer an alternative:

我还想提供一个替代方案:

from pandas import read_csv
import numpy as np

# if the data is not a valid "number", return a NaN
# note that it must be a float, as NaN is a float:  print(type(np.nan))
def valid_float(y):
  try:
    return float(y)
  except ValueError:
    return np.nan

# assuming the first row of the file contains the column names 'A','B','C'...
data = read_csv('test.csv',header=0,usecols=['A','B','D'],
   converters={'A': valid_float, 'B': valid_float, 'D': valid_float} )

# delete all rows ('index') with an invalid numerical entry
data.dropna(axis='index',how='any',inplace=True)

This is fairly compact and readable at the same time. For a true one-liner, it would be great to (1) re-write the validation function as lambda code, and (2) do the dropping of defective rows directly in the call to read_csv, but I could not figure out how to do either of these.

这是相当紧凑和可读的同时。对于真正的单行程序,最好 (1) 将验证函数重新编写为 lambda 代码,以及 (2) 在对 read_csv 的调用中直接删除有缺陷的行,但我不知道如何做其中任何一个。