Python 使用 Pandas 将字符串对象转换为 int/float
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24037507/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Converting string objects to int/float using pandas
提问by tejesh95
import pandas as pd
path1 = "/home/supertramp/Desktop/100&life_180_data.csv"
mydf = pd.read_csv(path1)
numcigar = {"Never":0 ,"1-5 Cigarettes/day" :1,"10-20 Cigarettes/day":4}
print mydf['Cigarettes']
mydf['CigarNum'] = mydf['Cigarettes'].apply(numcigar.get).astype(float)
print mydf['CigarNum']
mydf.to_csv('/home/supertramp/Desktop/powerRangers.csv')
The csv file "100&life_180_data.csv" contains columns like age, bmi,Cigarettes,Alocohol etc.
csv 文件“100&life_180_data.csv”包含年龄、bmi、香烟、酒精等列。
No int64
Age int64
BMI float64
Alcohol object
Cigarettes object
dtype: object
Cigarettes column contains "Never" "1-5 Cigarettes/day","10-20 Cigarettes/day". I want to assign weights to these object (Never,1-5 Cigarettes/day ,....)
香烟栏包含“从不”、“1-5 支香烟/天”、“10-20 支香烟/天”。我想为这些对象分配权重(从不,1-5 支香烟/天,....)
The expected output is new column CigarNum appended which consists only numbers 0,1,2 CigarNum is as expected till 8 rows and then shows Nan till last row in CigarNum column
预期的输出是附加的新列 CigarNum,它只包含数字 0,1,2 CigarNum 与预期的一样,直到 8 行,然后显示 Nan 直到 CigarNum 列的最后一行
0 Never
1 Never
2 1-5 Cigarettes/day
3 Never
4 Never
5 Never
6 Never
7 Never
8 Never
9 Never
10 Never
11 Never
12 10-20 Cigarettes/day
13 1-5 Cigarettes/day
14 Never
...
167 Never
168 Never
169 10-20 Cigarettes/day
170 Never
171 Never
172 Never
173 Never
174 Never
175 Never
176 Never
177 Never
178 Never
179 Never
180 Never
181 Never
Name: Cigarettes, Length: 182, dtype: object
The output I get shoudln't give NaN after few first rows.
在第一行几行之后,我得到的输出不应给出 NaN。
0 0
1 0
2 1
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 NaN
11 NaN
12 NaN
13 NaN
14 0
...
167 NaN
168 NaN
169 NaN
170 NaN
171 NaN
172 NaN
173 NaN
174 NaN
175 NaN
176 NaN
177 NaN
178 NaN
179 NaN
180 NaN
181 NaN
Name: CigarNum, Length: 182, dtype: float64
采纳答案by EdChum
OK, first problem is you have embedded spaces causing the function to incorrectly apply:
好的,第一个问题是您嵌入了空格,导致函数应用不正确:
fix this using vectorised str
:
使用矢量化解决这个问题str
:
mydf['Cigarettes'] = mydf['Cigarettes'].str.replace(' ', '')
now create your new column should just work:
现在创建您的新列应该可以正常工作:
mydf['CigarNum'] = mydf['Cigarettes'].apply(numcigar.get).astype(float)
UPDATE
更新
Thanks to @Jeff as always for pointing out superior ways to do things:
感谢@Jeff 一如既往地指出做事的优越方法:
So you can call replace
instead of calling apply
:
所以你可以打电话replace
而不是打电话apply
:
mydf['CigarNum'] = mydf['Cigarettes'].replace(numcigar)
# now convert the types
mydf['CigarNum'] = mydf['CigarNum'].convert_objects(convert_numeric=True)
you can also use factorize
method also.
你也可以使用factorize
方法。
Thinking about it why not just set the dict values to be floats anyway and then you avoid the type conversion?
考虑一下为什么不将 dict 值设置为浮点数,然后避免类型转换?
So:
所以:
numcigar = {"Never":0.0 ,"1-5 Cigarettes/day" :1.0,"10-20 Cigarettes/day":4.0}
Version 0.17.0 or newer
0.17.0 或更新版本
convert_objects
is deprecated since 0.17.0
, this has been replaced with to_numeric
convert_objects
已被弃用0.17.0
,这已被替换为to_numeric
mydf['CigarNum'] = pd.to_numeric(mydf['CigarNum'], errors='coerce')
Here errors='coerce'
will return NaN
where the values cannot be converted to a numeric value, without this it will raise an exception
这里errors='coerce'
将返回NaN
值不能转换为数值的地方,没有它会引发异常
回答by Apogentus
Try using this function for all problems of this kind:
尝试使用此函数解决所有此类问题:
def get_series_ids(x):
'''Function returns a pandas series consisting of ids,
corresponding to objects in input pandas series x
Example:
get_series_ids(pd.Series(['a','a','b','b','c']))
returns Series([0,0,1,1,2], dtype=int)'''
values = np.unique(x)
values2nums = dict(zip(values,range(len(values))))
return x.replace(values2nums)