python:在存在字符串的情况下将pandas数据帧中的数值数据转换为浮点数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19864028/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
python: convert numerical data in pandas dataframe to floats in the presence of strings
提问by natsuki_2002
I've got a pandas dataframe with a column 'cap'. This column mostly consists of floats but has a few strings in it, for instance at index 2.
我有一个带有“cap”列的熊猫数据框。该列主要由浮点数组成,但其中有一些字符串,例如在索引 2 处。
df =
cap
0 5.2
1 na
2 2.2
3 7.6
4 7.5
5 3.0
...
I import my data from a csv file like so:
我从一个 csv 文件导入我的数据,如下所示:
df = DataFrame(pd.read_csv(myfile.file))
Unfortunately, when I do this, the column 'cap' is imported entirely as strings. I would like floats to be identified as floats and strings as strings. Trying to convert this using:
不幸的是,当我这样做时,列 'cap' 完全作为字符串导入。我希望将浮点数标识为浮点数,将字符串标识为字符串。尝试使用以下方法转换:
df['cap'] = df['cap'].astype(float)
throws up an error:
抛出一个错误:
could not convert string to float: na
Is there any way to make all the numbers into floats but keep the 'na' as a string?
有没有办法将所有数字变成浮点数,但将 'na' 保留为字符串?
采纳答案by Acorbe
Here is a possible workaround
这是一个可能的解决方法
first you define a function that converts numbers to float only when needed
首先,您定义一个函数,仅在需要时将数字转换为浮点数
def to_number(s):
try:
s1 = float(s)
return s1
except ValueError:
return s
and then you apply it row by row.
然后逐行应用它。
Example:
例子:
given
给予
df
0
0 a
1 2
where both a
and 2
are strings, we do the conversion via
其中a
和2
都是字符串,我们通过
converted = df.apply(lambda f : to_number(f[0]) , axis = 1)
converted
0 a
1 2
A direct check on the types:
直接检查类型:
type(converted.iloc[0])
str
type(converted.iloc[1])
float
回答by Andy Hayden
Calculations with columns of float64 dtype (rather than object) are much more efficient, so this is usually preferred... it will also allow you to do other calculations. Because of this is recommended to use NaN for missing data(rather than your own placeholder, or None).
使用 float64 dtype(而不是对象)列的计算效率更高,因此这通常是首选......它还允许您进行其他计算。因此,建议对缺失数据使用 NaN(而不是您自己的占位符或 None)。
Is this really the answer you want?
这真的是你想要的答案吗?
In [11]: df.sum() # all strings
Out[11]:
cap 5.2na2.27.67.53.0
dtype: object
In [12]: df.apply(lambda f: to_number(f[0]), axis=1).sum() # floats and 'na' strings
TypeError: unsupported operand type(s) for +: 'float' and 'str'
You should use convert_numeric to coerce to floats:
您应该使用 convert_numeric 来强制浮动:
In [21]: df.convert_objects(convert_numeric=True)
Out[21]:
cap
0 5.2
1 NaN
2 2.2
3 7.6
4 7.5
5 3.0
Or read it in directly as a csv, by appending 'na' to the list of values to be considered NaN:
或者通过将“na”附加到要被视为 NaN 的值列表,直接将其作为 csv 读入:
In [22]: pd.read_csv(myfile.file, na_values=['na'])
Out[22]:
cap
0 5.2
1 NaN
2 2.2
3 7.6
4 7.5
5 3.0
In either case, sum (and many other pandas functions) will now work:
无论哪种情况, sum(以及许多其他 Pandas 函数)现在都可以工作:
In [23]: df.sum()
Out[23]:
cap 25.5
dtype: float64
As Jeff advises:
正如杰夫所建议的:
repeat 3 times fast: object==bad, float==good
快速重复 3 次:object==bad,float==good
回答by reabow
I tried an alternative on the above:
我在上面尝试了另一种选择:
for num, item in enumerate(data['col']):
try:
float(item)
except:
data['col'][num] = nan
回答by Victor Grau Serrat
First of all the way you import you CSV is redundant, instead of doing:
首先,您导入 CSV 的方式是多余的,而不是执行以下操作:
df = DataFrame(pd.read_csv(myfile.file))
You can do directly:
你可以直接做:
df = pd.read_csv(myfile.file)
Then to convert to float, and put whatever is not a number as NaN:
然后转换为浮点数,并将任何不是数字的内容作为 NaN:
df = pd.to_numeric(df, errors='coerce')