Python 删除熊猫数据框中的特殊字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38277928/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 20:36:08  来源:igfitidea点击:

Remove special characters in pandas dataframe

pythonnumpypandas

提问by RageQuilt

This seems like an inherently simple task but I am finding it very difficult to remove the '' from my entire data frame and return the numeric values in each column, including the numbers that did not have ''. The dateframe includes hundreds of more columns and looks like this in short:

这似乎是一项本质上很简单的任务,但我发现很难从整个数据框中删除 ' ' 并返回每列中的数值,包括没有 ''的数字。日期框架包含数百个列,简而言之如下所示:

Time            A1      A2
2.0002546296    1499    1592
2.0006712963    1252    1459
2.0902546296    1731    2223
2.0906828704    1691    1904
2.1742245370    2364    3121
2.1764699074    2096    1942
2.7654050926    *7639*  *8196*
2.7658564815    *7088*  *7542*
2.9048958333    *8736*  *8459*
2.9053125000    *7778*  *7704*
2.9807175926    *6612*  *6593*
3.0585763889    *8520*  *9122*

I have not written it to iterate over every column in df yet but as far as the first column goes I have come up with this

我还没有写它来迭代 df 中的每一列,但就第一列而言,我想出了这个

df['A1'].str.replace('*','').astype(float)

which yields

这产生

0        NaN
1        NaN
2        NaN
3        NaN
4        NaN
5        NaN
6        NaN
7        NaN
8        NaN
9        NaN
10       NaN
11       NaN
12       NaN
13       NaN
14       NaN
15       NaN
16       NaN
17       NaN
18       NaN
19    7639.0
20    7088.0
21    8736.0
22    7778.0
23    6612.0
24    8520.0

Is there a very easy way to just remove the '*' in the dataframe in pandas?

有没有一种非常简单的方法可以删除熊猫数据框中的“*”?

回答by shivsn

use replacewhich applies on whole dataframe :

使用适用于整个数据框的替换

df
Out[14]: 
       Time      A1      A2
0  2.000255    1499    1592
1  2.176470    2096    1942
2  2.765405  *7639*  *8196*
3  2.765856  *7088*  *7542*
4  2.904896  *8736*  *8459*
5  2.905312  *7778*  *7704*
6  2.980718  *6612*  *6593*
7  3.058576  *8520*  *9122*

df=df.replace('\*','',regex=True).astype(float)

df
Out[16]: 
       Time    A1    A2
0  2.000255  1499  1592
1  2.176470  2096  1942
2  2.765405  7639  8196
3  2.765856  7088  7542
4  2.904896  8736  8459
5  2.905312  7778  7704
6  2.980718  6612  6593
7  3.058576  8520  9122

回答by amin

There is another solution which uses map and strip functions. You can see the below link: Pandas DataFrame: remove unwanted parts from strings in a column.

还有另一种使用 map 和 strip 函数的解决方案。您可以看到以下链接: Pandas DataFrame:从列中的字符串中删除不需要的部分。

df = 
    Time     A1     A2
0   2.0     1258    *1364*
1   2.1     *1254*  2002
2   2.2     1520    3364
3   2.3     *300*   *10056*

cols = ['A1', 'A2']
for col in cols:
    df[col] = df[col].map(lambda x: str(x).lstrip('*').rstrip('*')).astype(float)

df = 
    Time     A1     A2
0   2.0     1258    1364
1   2.1     1254    2002
2   2.2     1520    3364
3   2.3     300     10056

The parsing procedure only be applied on the desired columns.

解析过程仅应用于所需的列。

回答by CuriousCoder

I found this to be a simple approach - Use replaceto retain only the digits (and dotand minussign).
This would remove characters, alphabets or anything that is not defined in to_replaceattribute.

我发现这是一种简单的方法 - 用于replace仅保留数字(dotminus符号)。
这将删除字符、字母或任何未在to_replace属性中定义的内容。

So, the solution is:
df['A1'].replace(regex=True, inplace=True, to_replace=r'[^0-9.\-]', value=r'']
df['A1'] = df['A1'].astype(float64)

所以,解决办法是:
df['A1'].replace(regex=True, inplace=True, to_replace=r'[^0-9.\-]', value=r'']
df['A1'] = df['A1'].astype(float64)

回答by ?oàn Ph??ng Th?o

I found the answer of CuriousCoder so brief and useful but there must be a ')'instead of ']'So it should be:

我发现 CuriousCoder 的答案如此简短和有用,但必须有一个')'而不是']'所以它应该是:

df['A1'].replace(regex=True, inplace=True, to_replace=r'[^0-9.\-]',
value=r''] df['A1'] = df['A1'].astype(float64)