pandas 忽略熊猫数据框中的非数字字符串值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36685347/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:03:44  来源:igfitidea点击:

Ignoring non-numerical string values in pandas dataframe

pythonpandas

提问by devil0150

I have a DataFrame in which a column might have three kinds of values, integers (12331), integers as strings ('345') or some other string ('text').

我有一个 DataFrame,其中一列可能有三种值,整数 (12331)、整数作为字符串 ('345') 或其他一些字符串 ('text')。

Is there a way to drop all rows with the last kind of string from the dataframe, and convert the first kind of string into integers? Or at least some way to ignore the rows that cause type errors if I'm summing the column.

有没有办法从数据框中删除最后一种字符串的所有行,并将第一种字符串转换为整数?或者,如果我对列求和,至少可以通过某种方式忽略导致类型错误的行。

This dataframe is from reading a pretty big CSV file (25 GB), so I'd like some solution that would work when reading in chunks.

这个数据框来自读取一个相当大的 CSV 文件(25 GB),所以我想要一些在大块读取时可以工作的解决方案。

回答by Marius

Pandas has some tools for converting these kinds of columns, but they may not suit your needs exactly. pd.to_numericconverts mixed columns like yours, but converts non-numeric strings to NaN. This means you'll get float columns, not integer, since only float columns can have NaNvalues. That usually doesn't matter too much but it's good to be aware of.

Pandas 有一些工具可以转换这些类型的列,但它们可能无法完全满足您的需求。pd.to_numeric转换像您这样的混合列,但将非数字字符串转换为NaN. 这意味着您将获得浮点列,而不是整数,因为只有浮点列可以有NaN值。这通常无关紧要,但最好注意一下。

df = pd.DataFrame({'mixed_types': [12331, '345', 'text']})

pd.to_numeric(df['mixed_types'], errors='coerce')
Out[7]: 
0    12331.0
1      345.0
2        NaN
Name: mixed_types, dtype: float64

If you want to then drop all the NaNrows:

如果您想删除所有NaN行:

# Replace the column with the converted values
df['mixed_types'] = pd.to_numeric(df['mixed_types'], errors='coerce')

# Drop NA values, listing the converted columns explicitly
#   so NA values in other columns aren't dropped
df.dropna(subset = ['mixed_types'])
Out[11]: 
   mixed_types
0      12331.0
1        345.0

回答by Anton Protopopov

You could use pd.to_numericwith errors=coerceto substitute your non numeric values with NaNand apply it the each column. Then you could use dropnaor fillnawhatever you prefer.

您可以使用pd.to_numericwitherrors=coerce替换您的非数字值NaN并将其应用于每一列。然后你可以使用dropnafillna任何你喜欢的。

df = pd.read_csv('file.csv')
df = df.apply(pd.to_numeric, errors='coerce')
df = df.dropna()

回答by PhilChang

you can use df._get_numeric_data() directly.

你可以直接使用 df._get_numeric_data() 。