Python 使用熊猫删除一列中的非数字行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33961028/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 14:15:22  来源:igfitidea点击:

Remove non-numeric rows in one column with pandas

pythonpandas

提问by HungUnicorn

There is a dataframe like the following, and it has one unclean column 'id' which it sholud be numeric column

有一个如下所示的数据框,它有一个不干净的列“id”,它应该是数字列

id, name
1,  A
2,  B
3,  C
tt, D
4,  E
5,  F
de, G

Is there a concise way to remove the rows because tt and de are not numeric values

是否有一种简洁的方法来删除行,因为 tt 和 de 不是数值

tt,D
de,G

to make the dataframe clean?

使数据框干净?

id, name
1,  A
2,  B
3,  C
4,  E
5,  F

采纳答案by Anton Protopopov

You could use standard method of strings isnumericand apply it to each value in your idcolumn:

您可以使用标准的字符串方法isnumeric并将其应用于id列中的每个值:

import pandas as pd
from io import StringIO

data = """
id,name
1,A
2,B
3,C
tt,D
4,E
5,F
de,G
"""

df = pd.read_csv(StringIO(data))

In [55]: df
Out[55]: 
   id name
0   1    A
1   2    B
2   3    C
3  tt    D
4   4    E
5   5    F
6  de    G

In [56]: df[df.id.apply(lambda x: x.isnumeric())]
Out[56]: 
  id name
0  1    A
1  2    B
2  3    C
4  4    E
5  5    F

Or if you want to use idas index you could do:

或者,如果您想id用作索引,您可以这样做:

In [61]: df[df.id.apply(lambda x: x.isnumeric())].set_index('id')
Out[61]: 
   name
id     
1     A
2     B
3     C
4     E
5     F

Edit. Add timings

编辑。添加时间

Although case with pd.to_numericis not using applymethod it is almost two times slower than with applying np.isnumericfor strcolumns. Also I add option with using pandas str.isnumericwhich is less typing and still faster then using pd.to_numeric. But pd.to_numericis more general because it could work with any data types (not only strings).

虽然情况下与pd.to_numeric未使用apply的方法,它比与施加慢几乎两倍np.isnumericstr列。我还添加了使用熊猫的选项,str.isnumeric它比使用pd.to_numeric. 但pd.to_numeric更通用,因为它可以处理任何数据类型(不仅是字符串)。

df_big = pd.concat([df]*10000)

In [3]: df_big = pd.concat([df]*10000)

In [4]: df_big.shape
Out[4]: (70000, 2)

In [5]: %timeit df_big[df_big.id.apply(lambda x: x.isnumeric())]
15.3 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: %timeit df_big[df_big.id.str.isnumeric()]
20.3 ms ± 171 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit df_big[pd.to_numeric(df_big['id'], errors='coerce').notnull()]
29.9 ms ± 682 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

回答by DeepSpace

Given that dfis your dataframe,

鉴于这df是您的数据框,

import numpy as np
df[df['id'].apply(lambda x: isinstance(x, (int, np.int64)))]

What it does is passing each value in the idcolumn to the isinstancefunction and checks if it's an int. Then it returns a boolean array, and finally returning only the rows where there is True.

它所做的是将id列中的每个值传递给isinstance函数并检查它是否是int. 然后它返回一个布尔数组,最后只返回有 的行True

If you also need to account for floatvalues, another option is:

如果您还需要考虑float价值,另一种选择是:

import numpy as np
df[df['id'].apply(lambda x: type(x) in [int, np.int64, float, np.float64])]

Note that either way is not inplace, so you will need to reassign it to your original df, or create a new one:

请注意,这两种方式都没有就位,因此您需要将其重新分配给原始 df,或创建一个新的:

df = df[df['id'].apply(lambda x: type(x) in [int, np.int64, float, np.float64])]
# or
new_df = df[df['id'].apply(lambda x: type(x) in [int, np.int64, float, np.float64])]

回答by Zero

Using pd.to_numeric

使用 pd.to_numeric

In [1079]: df[pd.to_numeric(df['id'], errors='coerce').notnull()]
Out[1079]:
  id  name
0  1     A
1  2     B
2  3     C
4  4     E
5  5     F

回答by Matphy

x.isnumeric()does not test return Truewhen xis of type float.

x.isnumeric()类型为True时不测试返回。xfloat

One way to filter out values which can be converted to float:

过滤掉可以转换为 的值的一种方法float

df[df['id'].apply(lambda x: is_float(x))]

df[df['id'].apply(lambda x: is_float(x))]

def is_float(x):
    try:
        float(x)
    except ValueError:
        return False
    return True