pandas 忽略熊猫数据框中的非数字字符串值

Question

提问by devil0150

I have a DataFrame in which a column might have three kinds of values, integers (12331), integers as strings ('345') or some other string ('text').

我有一个 DataFrame，其中一列可能有三种值，整数 (12331)、整数作为字符串 ('345') 或其他一些字符串 ('text')。

Is there a way to drop all rows with the last kind of string from the dataframe, and convert the first kind of string into integers? Or at least some way to ignore the rows that cause type errors if I'm summing the column.

有没有办法从数据框中删除最后一种字符串的所有行，并将第一种字符串转换为整数？或者，如果我对列求和，至少可以通过某种方式忽略导致类型错误的行。

This dataframe is from reading a pretty big CSV file (25 GB), so I'd like some solution that would work when reading in chunks.

这个数据框来自读取一个相当大的 CSV 文件（25 GB），所以我想要一些在大块读取时可以工作的解决方案。

Answer 1

回答by Marius

Pandas has some tools for converting these kinds of columns, but they may not suit your needs exactly. pd.to_numericconverts mixed columns like yours, but converts non-numeric strings to NaN. This means you'll get float columns, not integer, since only float columns can have NaNvalues. That usually doesn't matter too much but it's good to be aware of.

Pandas 有一些工具可以转换这些类型的列，但它们可能无法完全满足您的需求。pd.to_numeric转换像您这样的混合列，但将非数字字符串转换为NaN. 这意味着您将获得浮点列，而不是整数，因为只有浮点列可以有NaN值。这通常无关紧要，但最好注意一下。

df = pd.DataFrame({'mixed_types': [12331, '345', 'text']})

pd.to_numeric(df['mixed_types'], errors='coerce')
Out[7]: 
0    12331.0
1      345.0
2        NaN
Name: mixed_types, dtype: float64

If you want to then drop all the NaNrows:

如果您想删除所有NaN行：

# Replace the column with the converted values
df['mixed_types'] = pd.to_numeric(df['mixed_types'], errors='coerce')

# Drop NA values, listing the converted columns explicitly
#   so NA values in other columns aren't dropped
df.dropna(subset = ['mixed_types'])
Out[11]: 
   mixed_types
0      12331.0
1        345.0

Answer 2

回答by Anton Protopopov

You could use pd.to_numericwith errors=coerceto substitute your non numeric values with NaNand apply it the each column. Then you could use dropnaor fillnawhatever you prefer.

您可以使用pd.to_numericwitherrors=coerce替换您的非数字值NaN并将其应用于每一列。然后你可以使用dropna或fillna任何你喜欢的。

df = pd.read_csv('file.csv')
df = df.apply(pd.to_numeric, errors='coerce')
df = df.dropna()

Answer 3

回答by PhilChang

you can use df._get_numeric_data() directly.

你可以直接使用 df._get_numeric_data() 。

pandas 忽略熊猫数据框中的非数字字符串值

提问by devil0150

回答by Marius

回答by Anton Protopopov

回答by PhilChang

相关推荐

最近更新

标签

pandas 忽略熊猫数据框中的非数字字符串值

提问by devil0150

回答by Marius

回答by Anton Protopopov

回答by PhilChang

相关推荐

pandas 大熊猫到sql server

pandas 熊猫重新采样选项

从 Pandas Dataframe 打印中删除页眉和页脚

pandas 为什么 DataFrame 的串联速度会呈指数级增长？

相关推荐

最近更新

标签