Python 熊猫:将 dtype 'object' 转换为 int
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39173813/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: convert dtype 'object' to int
提问by cyril
I've read an SQL query into Pandas and the values are coming in as dtype 'object', although they are strings, dates and integers. I am able to convert the date 'object' to a Pandas datetime dtype, but I'm getting an error when trying to convert the string and integers.
我已经将 SQL 查询读入 Pandas,并且这些值以 dtype 'object' 的形式出现,尽管它们是字符串、日期和整数。我能够将日期“对象”转换为 Pandas datetime dtype,但是在尝试转换字符串和整数时出现错误。
Here is an example:
下面是一个例子:
>>> import pandas as pd
>>> df = pd.read_sql_query('select * from my_table', conn)
>>> df
id date purchase
1 abc1 2016-05-22 1
2 abc2 2016-05-29 0
3 abc3 2016-05-22 2
4 abc4 2016-05-22 0
>>> df.dtypes
id object
date object
purchase object
dtype: object
Converting the df['date']
to a datetime works:
将 转换df['date']
为日期时间有效:
>>> pd.to_datetime(df['date'])
1 2016-05-22
2 2016-05-29
3 2016-05-22
4 2016-05-22
Name: date, dtype: datetime64[ns]
But I get an error when trying to convert the df['purchase']
to an integer:
但是在尝试将 转换df['purchase']
为整数时出现错误:
>>> df['purchase'].astype(int)
....
pandas/lib.pyx in pandas.lib.astype_intsafe (pandas/lib.c:16667)()
pandas/src/util.pxd in util.set_value_at (pandas/lib.c:67540)()
TypeError: long() argument must be a string or a number, not 'java.lang.Long'
NOTE: I get a similar error when I tried .astype('float')
注意:我尝试时遇到类似的错误 .astype('float')
And when trying to convert to a string, nothing seems to happen.
当尝试转换为字符串时,似乎什么也没有发生。
>>> df['id'].apply(str)
1 abc1
2 abc2
3 abc3
4 abc4
Name: id, dtype: object
回答by cyril
Documenting the answer that worked for me based on the comment by @piRSquared.
根据@piRSquared 的评论记录对我有用的答案。
I needed to convert to a string first, then an integer.
我需要先转换为字符串,然后是整数。
>>> df['purchase'].astype(str).astype(int)
回答by Kariru
It's simple
这很简单
pd.factorize(df.purchase)[0]
Example:
例子:
labels, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'])`
labels
# array([0, 0, 1, 2, 0])
uniques
# array(['b', 'a', 'c'], dtype=object)
回答by HEMANTHKUMAR GADI
My train data contains three features are object after applying astype
it converts the object into numeric but before that, you need to perform some preprocessing steps:
我的训练数据包含三个特征是对象,应用后将astype
对象转换为数字,但在此之前,您需要执行一些预处理步骤:
train.dtypes
C12 object
C13 object
C14 Object
train['C14'] = train.C14.astype(int)
train.dtypes
C12 object
C13 object
C14 int32
回答by mandeep
Follow these steps:
按着这些次序:
1.clean your file -> open your datafile in csv
format and see that there is "?" in place of empty places and delete all of them.
1.清理你的文件->以csv
格式打开你的数据文件,看到有“?” 代替空的地方并删除所有这些。
2.drop the rows containing missing values e.g.:
2.删除包含缺失值的行,例如:
df.dropna(subset=["normalized-losses"], axis = 0 , inplace= True)
3.use astype now for conversion
3.现在使用astype进行转换
df["normalized-losses"]=df["normalized-losses"].astype(int)
Note: If still finding erros in your program then again inspect your csv
file, open it in excel to find whether is there an "?" in your required column, then delete it and save file and go back and run your program.
注意:如果在你的程序中仍然发现错误,那么再次检查你的csv
文件,用excel打开它看看是否有“?” 在您需要的列中,然后将其删除并保存文件并返回并运行您的程序。
comment success! if it works. :)
评论成功!如果它有效。:)
回答by ohlemacher
In my case, I had a df with mixed data:
就我而言,我有一个混合数据的 df:
df:
0 1 2 ... 242 243 244
0 2020-04-22T04:00:00Z 0 0 ... 3,094,409.5 13,220,425.7 5,449,201.1
1 2020-04-22T06:00:00Z 0 0 ... 3,716,941.5 8,452,012.9 6,541,599.9
....
The floats are actually objects, but I need them to be real floats.
花车实际上是物体,但我需要它们是真正的花车。
To fix it, referencing @AMC's comment above:
要修复它,请参考上面@AMC 的评论:
def coerce_to_float(val):
try:
return float(val)
except ValueError:
return val
df = df.applymap(lambda x: coerce_to_float(x))
回答by cs95
pandas >= 1.0: convert_dtypes
熊猫 >= 1.0: convert_dtypes
The (self) accepted answer doesn't take into consideration the possibility of NaNs in object columns.
(自我)接受的答案没有考虑对象列中 NaN 的可能性。
df = pd.DataFrame({'a': [1, 2, np.nan], 'b': [True, False, np.nan]}, dtype=object)
df
a b
0 1 True
1 2 False
2 NaN NaN
df['a'].astype(str).astype(int) # raises ValueError
This chokes because the NaN is converted to a string "nan", and further attempts to coerce to integer will fail. To avoid this issue, we can soft-convert columns to their corresponding nullable typeusing convert_dtypes
:
这会令人窒息,因为 NaN 被转换为字符串“nan”,进一步尝试强制转换为整数将失败。为避免此问题,我们可以使用以下方法将列软转换为其相应的可为空类型convert_dtypes
:
df.convert_dtypes()
a b
0 1 True
1 2 False
2 <NA> <NA>
df.convert_dtypes().dtypes
a Int64
b boolean
dtype: object
If your data has junk text mixed in with your ints, you can use pd.to_numeric
as an initial step:
如果您的数据中混有垃圾文本和整数,您可以将其pd.to_numeric
用作初始步骤:
s = pd.Series(['1', '2', '...'])
s.convert_dtypes() # converts to string, which is not what we want
0 1
1 2
2 ...
dtype: string
# coerces non-numeric junk to NaNs
pd.to_numeric(s, errors='coerce')
0 1.0
1 2.0
2 NaN
dtype: float64
# one final `convert_dtypes` call to convert to nullable int
pd.to_numeric(s, errors='coerce').convert_dtypes()
0 1
1 2
2 <NA>
dtype: Int64
回答by onietosi
Cannot comment so posting this as an answer, which is somewhat in between @piRSquared/@cyril's solution and @cs95's:
无法发表评论,因此将其作为答案发布,这有点介于@piRSquared/ @cyril的解决方案和@cs95的解决方案之间:
As noted by @cs95, if your data contains NaNs or Nones, converting to string type will throw an error when trying to convert to int afterwards.
正如@cs95 所指出的,如果您的数据包含 NaN 或 Nones,则在之后尝试转换为 int 时,转换为字符串类型将引发错误。
However, if your data consists of (numerical) strings, using convert_dtypes
will convert it to string type unless you use pd.to_numeric
as suggested by @cs95 (potentially combined with df.apply()
).
但是,如果您的数据由(数字)字符串组成,则 usingconvert_dtypes
会将其转换为字符串类型,除非您pd.to_numeric
按照@cs95 的建议使用(可能与 结合使用df.apply()
)。
In the case that your data consists only of numerical strings (including NaNs or Nones but without any non-numeric "junk"), a possibly simpler alternative would be to convert first to float and then to one of the nullable-integer extension dtypesprovided by pandas (already present in version 0.24) (see also this answer):
如果您的数据仅包含数字字符串(包括 NaN 或 Nones,但没有任何非数字“垃圾”),一个可能更简单的替代方法是先转换为浮点数,然后转换为提供的可空整数扩展 dtypes 之一由熊猫(已存在于 0.24 版中)(另请参阅此答案):
df['purchase'].astype(float).astype('Int64')
Note that there has been recent discussion on this on github(currently an -unresolved- closed issue though) and that in the case of very long 64-bit integers you may have to convert explicitly to float128
to avoid approximations during the conversions.
请注意,最近在github上对此进行了讨论(尽管目前是一个未解决的已关闭问题),并且在非常长的 64 位整数的情况下,您可能必须显式转换为float128
以避免在转换过程中出现近似值。
回答by Rishabh Jain
## list of columns
l1 = ['PM2.5', 'PM10', 'TEMP', 'BP', ' RH', 'WS','CO', 'O3', 'Nox', 'SO2']
for i in l1:
for j in range(0, 8431): #rows = 8431
df[i][j] = int(df[i][j])
I recommend you to use this only with small data. This code has complexity of O(n^2).
我建议您仅对小数据使用它。这段代码的复杂度为 O(n^2)。