Python 在熊猫的数据框中查找非数字行?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21771133/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 23:35:03  来源:igfitidea点击:

Finding non-numeric rows in dataframe in pandas?

pythonpandasdataframe

提问by

I have a large dataframe in pandas that apart from the column used as index is supposed to have only numeric values:

我在 Pandas 中有一个大数据框,除了用作索引的列之外,它应该只有数值:

df = pd.DataFrame({'a': [1, 2, 3, 'bad', 5],
                   'b': [0.1, 0.2, 0.3, 0.4, 0.5],
                   'item': ['a', 'b', 'c', 'd', 'e']})
df = df.set_index('item')

How can I find the row of the dataframe dfthat has a non-numeric value in it?

如何找到df其中包含非数字值的数据帧行?

In this example it's the fourth row in the dataframe, which has the string 'bad'in the acolumn. How can this row be found programmatically?

在这个例子中它的第四行中的数据帧,它具有串'bad'a列。如何以编程方式找到这一行?

采纳答案by Andy Hayden

You could use np.isrealto check the type of each element (applymapapplies a function to each element in the DataFrame):

您可以使用np.isreal检查每个元素的类型(applymap将函数应用于DataFrame中的每个元素):

In [11]: df.applymap(np.isreal)
Out[11]:
          a     b
item
a      True  True
b      True  True
c      True  True
d     False  True
e      True  True

If all in the row are True then they are all numeric:

如果行中的所有内容均为 True,则它们都是数字:

In [12]: df.applymap(np.isreal).all(1)
Out[12]:
item
a        True
b        True
c        True
d       False
e        True
dtype: bool

So to get the subDataFrame of rouges, (Note: the negation, ~, of the above finds the ones which have at least one rogue non-numeric):

因此,要获取 rouges 的 subDataFrame,(注意:上面的否定,〜,找到至少具有一个非数字流氓的那些):

In [13]: df[~df.applymap(np.isreal).all(1)]
Out[13]:
        a    b
item
d     bad  0.4

You could also find the location of the firstoffender you could use argmin:

您还可以找到可以使用argmin第一个罪犯的位置:

In [14]: np.argmin(df.applymap(np.isreal).all(1))
Out[14]: 'd'

As @CTZhupoints out, it may be slightly faster to check whether it's an instance ofeither int or float (there is some additional overhead with np.isreal):

正如@CTZhu指出的那样,检查它是 int 还是 float的实例可能会稍微快一点(np.isreal 有一些额外的开销):

df.applymap(lambda x: isinstance(x, (int, float)))

回答by CT Zhu

Sorry about the confusion, this should be the correct approach. Do you want only to capture 'bad'only, not things like 'good'; Or just any non-numerical values?

很抱歉造成混乱,这应该是正确的方法。你只想捕捉'bad',而不是像'good'; 或者只是任何非数字值?

In[15]:
np.where(np.any(np.isnan(df.convert_objects(convert_numeric=True)), axis=1))
Out[15]:
(array([3]),)

回答by Pedro M Duarte

Already some great answers to this question, however here is a nice snippet that I use regularly to drop rows if they have non-numeric values on some columns:

这个问题已经有一些很好的答案,但是这里有一个很好的片段,如果它们在某些列上具有非数字值,我经常使用它来删除行:

# Eliminate invalid data from dataframe (see Example below for more context)

num_df = (df.drop(data_columns, axis=1)
         .join(df[data_columns].apply(pd.to_numeric, errors='coerce')))

num_df = num_df[num_df[data_columns].notnull().all(axis=1)]

The way this works is we first dropall the data_columnsfrom the df, and then use a jointo put them back in after passing them through pd.to_numeric(with option 'coerce', such that all non-numeric entries are converted to NaN). The result is saved to num_df.

其工作方式是我们首先将drop所有data_columns来自df,然后join在通过它们后使用 a将它们放回去pd.to_numeric(使用选项'coerce',以便所有非数字条目都转换为NaN)。结果保存到num_df.

On the second line we use a filter that keeps only rows where all values are not null.

在第二行,我们使用一个过滤器,只保留所有值都不为空的行。

Note that pd.to_numericis coercing to NaNeverything that cannot be converted to a numeric value, so strings that represent numeric values will not be removed. For example '1.25'will be recognized as the numeric value 1.25.

请注意,这pd.to_numeric是强制NaN所有无法转换为数值的内容,因此不会删除表示数值的字符串。例如'1.25'将被识别为数值1.25

Disclaimer: pd.to_numericwas introduced in pandas version 0.17.0

免责声明:pd.to_numeric在熊猫版本中引入0.17.0

Example:

例子:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({"item": ["a", "b", "c", "d", "e"],
   ...:                    "a": [1,2,3,"bad",5],
   ...:                    "b":[0.1,0.2,0.3,0.4,0.5]})

In [3]: df
Out[3]: 
     a    b item
0    1  0.1    a
1    2  0.2    b
2    3  0.3    c
3  bad  0.4    d
4    5  0.5    e

In [4]: data_columns = ['a', 'b']

In [5]: num_df = (df
   ...:           .drop(data_columns, axis=1)
   ...:           .join(df[data_columns].apply(pd.to_numeric, errors='coerce')))

In [6]: num_df
Out[6]: 
  item   a    b
0    a   1  0.1
1    b   2  0.2
2    c   3  0.3
3    d NaN  0.4
4    e   5  0.5

In [7]: num_df[num_df[data_columns].notnull().all(axis=1)]
Out[7]: 
  item  a    b
0    a  1  0.1
1    b  2  0.2
2    c  3  0.3
4    e  5  0.5

回答by Borja_042

In case you are working with a column with string values, you can use THE VERY USEFUL function series.str.isnumeric() like:

如果您正在处理带有字符串值的列,您可以使用非常有用的函数 series.str.isnumeric() ,例如:

a = pd.Series(['hi','hola','2.31','288','312','1312', '0,21', '0.23'])

What i do is to copy that column to new column, and do a str.replace('.','') and str.replace(',','') then i select the numeric values. and:

我所做的是将该列复制到新列,然后执行 str.replace('.','') 和 str.replace(',','') 然后我选择数值。和:

a = a.str.replace('.','')
a = a.str.replace(',','') 
a.str.isnumeric()

Out[15]: 0 False 1 False 2 True 3 True 4 True 5 True 6 True 7 True dtype: bool

Out[15]: 0 False 1 False 2 True 3 True 4 True 5 True 6 True 7 True dtype: bool

Good luck all!

祝大家好运!

回答by matiu

# Original code
df = pd.DataFrame({'a': [1, 2, 3, 'bad', 5],
                   'b': [0.1, 0.2, 0.3, 0.4, 0.5],
                   'item': ['a', 'b', 'c', 'd', 'e']})
df = df.set_index('item')

Convert to numericusing 'coerce' which fills bad values with 'nan'

使用 'coerce'转换为数字,用 'nan' 填充错误值

a = pd.to_numeric(df.a, errors='coerce')

Use isnato return a boolean index:

使用isna返回一个布尔索引:

idx = a.isna()

Apply that index to the data frame:

将该索引应用于数据框:

df[idx]

output

输出

Returns the row with the bad data in it:

返回包含错误数据的行:

        a    b
item          
d     bad  0.4

回答by dyang

I'm thinking something like, just give an idea, to convert the column to string, and work with string is easier. however this does not work with strings containing numbers, like bad123. and ~is taking the complement of selection.

我正在考虑类似的事情,只是给出一个想法,将列转换为字符串,并且使用字符串更容易。但是,这不适用于包含数字的字符串,例如bad123. 并~采取选择的补充。

df['a'] = df['a'].astype(str)
df[~df['a'].str.contains('0|1|2|3|4|5|6|7|8|9')]
df['a'] = df['a'].astype(object)

and using '|'.join([str(i) for i in range(10)])to generate '0|1|...|8|9'

并使用'|'.join([str(i) for i in range(10)])生成'0|1|...|8|9'

or using np.isreal()function, just like the most voted answer

或使用np.isreal()函数,就像投票最多的答案

df[~df['a'].apply(lambda x: np.isreal(x))]