如何从 Pandas 数据框中特定列中的所有值中删除所有非数字字符?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44117326/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:39:52  来源:igfitidea点击:

How can I remove all non-numeric characters from all the values in a particular column in pandas dataframe?

pythonpandasdataframe

提问by ag14

I have a dataframe which looks like this:

我有一个如下所示的数据框:

     A       B           C
1   red78   square    big235
2   green   circle    small123
3   blue45  triangle  big657

I need to be able to remove the non-numeric characters from all the rows in column C so that my dataframe looks like:

我需要能够从 C 列中的所有行中删除非数字字符,以便我的数据框看起来像:

     A       B           C
1   red78   square    235
2   green   circle    123
3   blue45  triangle  657

I tried using the following but get the error expected string or buffer:

我尝试使用以下方法但得到错误预期的字符串或缓冲区:

import re
dfOutput.imgID = dfOutput.imgID.apply(re.sub('[^0-9]','', dfOutput.imgID), axis = 0)

What should I do instead?

我应该怎么做?

Code to create dataframe:

创建数据框的代码:

dfObject = pd.DataFrame()
dfObject.set_value(1, 'A', 'red78')
dfObject.set_value(1, 'B', 'square')
dfObject.set_value(1, 'C', 'big235')
dfObject.set_value(2, 'A', 'green')
dfObject.set_value(2, 'B', 'circle')
dfObject.set_value(2, 'C', 'small123')
dfObject.set_value(3, 'A', 'blue45')
dfObject.set_value(3, 'B', 'triangle')
dfObject.set_value(3, 'C', 'big657')

回答by EdChum

Use str.extractand pass a regex pattern to extract just the numeric parts:

使用str.extract并传递正则表达式模式以仅提取数字部分:

In[40]:
dfObject['C'] = dfObject['C'].str.extract('(\d+)', expand=False)
dfObject

Out[40]: 
        A         B    C
1   red78    square  235
2   green    circle  123
3  blue45  triangle  657

If needed you can cast to int:

如果需要,您可以投射到int

dfObject['C'] = dfObject['C'].astype(int)

回答by Scott Boston

You can use .str.replacewith a regex:

您可以使用.str.replace正则表达式:

dfObject['C'] = dfObject.C.str.replace(r"[a-zA-Z]",'')

output:

输出:

        A         B    C
1   red78    square  235
2   green    circle  123
3  blue45  triangle  657

回答by Wiktor Stribi?ew

To remove all non-digit characters from strings in a Pandas column you should use str.replacewith \D+or [^0-9]+patterns:

要从 Pandas 列中的字符串中删除所有非数字字符,您应该使用str.replacewith\D+[^0-9]+patterns:

dfObject['C'] = dfObject['C'].str.replace(r'\D+', '')

Or, since in Python 3, \Dis fully Unicode-aware by default and thus does not match non-ASCII digits (like ?????????, see proof) you should consider

或者,由于在 Python 3 中,\D默认情况下完全识别Unicode,因此不匹配非 ASCII 数字(如?????????,请参阅proof),您应该考虑

dfObject['C'] = dfObject['C'].str.replace(r'[^0-9]+', '')

So,

所以,

import re
print ( re.sub( r'\D+', '', '1?????????0') )         # => 1?????????0
print ( re.sub( r'[^0-9]+', '', '1?????????0') )     # => 10

回答by jpp

You can also do this via a lambdafunction with str.isdigit:

你也可以通过一个lambda函数来做到这一点str.isdigit

import pandas as pd

df = pd.DataFrame({'Name': ['John5', 'Tom 8', 'Ron 722']})

df['Name'] = df['Name'].map(lambda x: ''.join([i for i in x if i.isdigit()]))

#   Name
# 0    5
# 1    8
# 2  722

回答by MEdwin

After 2 years, to help others, I actually think that you were very close to the answer. I have used your logic but made it work. basically you create a function that does the clean up and then apply it to the column C.

2年后,帮助别人,其实我觉得你已经很接近答案了。我已经使用了你的逻辑,但使它起作用。基本上,您创建一个执行清理工作的函数,然后将其应用于 column C

import pandas as pd
import re

df = pd.DataFrame({
     'A': ['red78', 'green', 'blue45'],
     'B': ['square', 'circle', 'triangle'],
    'C': ['big235', 'small123',  'big657']
})

def remove_chars(s):
    return re.sub('[^0-9]+', '', s) 

df['C'] = df['C'].apply(remove_chars)
df

Result below:

结果如下:

A   B   C
0   red78   square  235
1   green   circle  123
2   blue45  triangle    657