Python pandas.to_numeric - 找出它无法解析的字符串

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40790031/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 23:59:56  来源:igfitidea点击:

pandas.to_numeric - find out which string it was unable to parse

pythonpandasdata-sciencedata-cleaning

提问by clstaudt

Applying pandas.to_numericto a dataframe column which contains strings that represent numbers (and possibly other unparsable strings) results in an error message like this:

应用于pandas.to_numeric包含表示数字的字符串(可能还有其他无法解析的字符串)的数据框列会导致如下错误消息:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-66-07383316d7b6> in <module>()
      1 for column in shouldBeNumericColumns:
----> 2     trainData[column] = pandas.to_numeric(trainData[column])

/usr/local/lib/python3.5/site-packages/pandas/tools/util.py in to_numeric(arg, errors)
    113         try:
    114             values = lib.maybe_convert_numeric(values, set(),
--> 115                                                coerce_numeric=coerce_numeric)
    116         except:
    117             if errors == 'raise':

pandas/src/inference.pyx in pandas.lib.maybe_convert_numeric (pandas/lib.c:53558)()

pandas/src/inference.pyx in pandas.lib.maybe_convert_numeric (pandas/lib.c:53344)()

ValueError: Unable to parse string

Wouldn't it be helpful to see which value failed to parse?

查看哪个值解析失败会不会有帮助?

回答by jezrael

I think you can add parameter errors='coerce'for convert bad non numeric values to NaN, then check this values by isnulland use boolean indexing:

我认为您可以添加errors='coerce'用于将错误的非数值转换为 的参数NaN,然后通过以下方式检查该值isnull并使用boolean indexing

print (df[pd.to_numeric(df.col, errors='coerce').isnull()])

Sample:

样本:

df = pd.DataFrame({'B':['a','7','8'],
                   'C':[7,8,9]})

print (df)
   B  C
0  a  7
1  7  8
2  8  9

print (df[pd.to_numeric(df.B, errors='coerce').isnull()])
   B  C
0  a  7

Or if need find all string in mixed column - numerice with string values check typeof values if is string:

或者,如果需要在混合列中查找所有字符串 - 带有字符串值的数值检查type值是否为string

df = pd.DataFrame({'B':['a',7, 8],
                   'C':[7,8,9]})

print (df)
   B  C
0  a  7
1  7  8
2  8  9

print (df[df.B.apply(lambda x: isinstance(x, str))])
   B  C
0  a  7

回答by 3novak

I have thought the very same thing, and I don't know if there's a better way, but my current workaround is to search for characters which aren't numbers or periods. This usually turns up the problem. There are cases where multiple periods can cause a problem, but I've found those are rare.

我也想过同样的事情,我不知道是否有更好的方法,但我目前的解决方法是搜索不是数字或句点的字符。这通常会出现问题。在某些情况下,多个时期可能会导致问题,但我发现这种情况很少见。

import pandas as pd
import re

non_numeric = re.compile(r'[^\d.]+')

df = pd.DataFrame({'a': [3,2,'NA']})
df.loc[df['a'].str.contains(non_numeric)]