Python 熊猫：将 dtype 'object' 转换为 int

Question

提问by cyril

I've read an SQL query into Pandas and the values are coming in as dtype 'object', although they are strings, dates and integers. I am able to convert the date 'object' to a Pandas datetime dtype, but I'm getting an error when trying to convert the string and integers.

我已经将 SQL 查询读入 Pandas，并且这些值以 dtype 'object' 的形式出现，尽管它们是字符串、日期和整数。我能够将日期“对象”转换为 Pandas datetime dtype，但是在尝试转换字符串和整数时出现错误。

Here is an example:

下面是一个例子：

>>> import pandas as pd
>>> df = pd.read_sql_query('select * from my_table', conn)
>>> df
    id    date          purchase
 1  abc1  2016-05-22    1
 2  abc2  2016-05-29    0
 3  abc3  2016-05-22    2
 4  abc4  2016-05-22    0

>>> df.dtypes
 id          object
 date        object
 purchase    object
 dtype: object

Converting the df['date']to a datetime works:

将转换df['date']为日期时间有效：

>>> pd.to_datetime(df['date'])
 1  2016-05-22
 2  2016-05-29
 3  2016-05-22
 4  2016-05-22
 Name: date, dtype: datetime64[ns]

But I get an error when trying to convert the df['purchase']to an integer:

但是在尝试将转换df['purchase']为整数时出现错误：

>>> df['purchase'].astype(int)
 ....
 pandas/lib.pyx in pandas.lib.astype_intsafe (pandas/lib.c:16667)()
 pandas/src/util.pxd in util.set_value_at (pandas/lib.c:67540)()

 TypeError: long() argument must be a string or a number, not 'java.lang.Long'

NOTE: I get a similar error when I tried .astype('float')

注意：我尝试时遇到类似的错误 .astype('float')

And when trying to convert to a string, nothing seems to happen.

当尝试转换为字符串时，似乎什么也没有发生。

>>> df['id'].apply(str)
 1 abc1
 2 abc2
 3 abc3
 4 abc4
 Name: id, dtype: object

Answer 1

回答by cyril

Documenting the answer that worked for me based on the comment by @piRSquared.

根据@piRSquared 的评论记录对我有用的答案。

I needed to convert to a string first, then an integer.

我需要先转换为字符串，然后是整数。

>>> df['purchase'].astype(str).astype(int)

Answer 2

回答by Kariru

It's simple

这很简单

pd.factorize(df.purchase)[0]

Example:

例子：

labels, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'])`

labels
# array([0, 0, 1, 2, 0])

uniques
# array(['b', 'a', 'c'], dtype=object)

Answer 3

回答by HEMANTHKUMAR GADI

My train data contains three features are object after applying astypeit converts the object into numeric but before that, you need to perform some preprocessing steps:

我的训练数据包含三个特征是对象，应用后将astype对象转换为数字，但在此之前，您需要执行一些预处理步骤：

train.dtypes

C12       object
C13       object
C14       Object

train['C14'] = train.C14.astype(int)

train.dtypes

C12       object
C13       object
C14       int32

Answer 4

回答by mandeep

Follow these steps:

按着这些次序：

1.clean your file -> open your datafile in csvformat and see that there is "?" in place of empty places and delete all of them.

1.清理你的文件->以csv格式打开你的数据文件，看到有“？” 代替空的地方并删除所有这些。

2.drop the rows containing missing values e.g.:

2.删除包含缺失值的行，例如：

df.dropna(subset=["normalized-losses"], axis = 0 , inplace= True)

3.use astype now for conversion

3.现在使用astype进行转换

df["normalized-losses"]=df["normalized-losses"].astype(int)

Note: If still finding erros in your program then again inspect your csvfile, open it in excel to find whether is there an "?" in your required column, then delete it and save file and go back and run your program.

注意：如果在你的程序中仍然发现错误，那么再次检查你的csv文件，用excel打开它看看是否有“？” 在您需要的列中，然后将其删除并保存文件并返回并运行您的程序。

comment success! if it works. :)

评论成功！如果它有效。:)

Answer 5

回答by ohlemacher

In my case, I had a df with mixed data:

就我而言，我有一个混合数据的 df：

df:
                     0   1   2    ...                  242                  243                  244
0   2020-04-22T04:00:00Z   0   0  ...          3,094,409.5         13,220,425.7          5,449,201.1
1   2020-04-22T06:00:00Z   0   0  ...          3,716,941.5          8,452,012.9          6,541,599.9
....

The floats are actually objects, but I need them to be real floats.

花车实际上是物体，但我需要它们是真正的花车。

To fix it, referencing @AMC's comment above:

要修复它，请参考上面@AMC 的评论：

def coerce_to_float(val):
    try:
       return float(val)
    except ValueError:
       return val

df = df.applymap(lambda x: coerce_to_float(x))

Answer 6

回答by cs95

pandas >= 1.0: `convert_dtypes`

熊猫 >= 1.0： `convert_dtypes`

The (self) accepted answer doesn't take into consideration the possibility of NaNs in object columns.

（自我）接受的答案没有考虑对象列中 NaN 的可能性。

df = pd.DataFrame({'a': [1, 2, np.nan], 'b': [True, False, np.nan]}, dtype=object) 
df                                                                         

     a      b
0    1   True
1    2  False
2  NaN    NaN

df['a'].astype(str).astype(int) # raises ValueError

This chokes because the NaN is converted to a string "nan", and further attempts to coerce to integer will fail. To avoid this issue, we can soft-convert columns to their corresponding nullable typeusing convert_dtypes:

这会令人窒息，因为 NaN 被转换为字符串“nan”，进一步尝试强制转换为整数将失败。为避免此问题，我们可以使用以下方法将列软转换为其相应的可为空类型convert_dtypes：

df.convert_dtypes()                                                        

      a      b
0     1   True
1     2  False
2  <NA>   <NA>

df.convert_dtypes().dtypes                                                 

a      Int64
b    boolean
dtype: object

If your data has junk text mixed in with your ints, you can use pd.to_numericas an initial step:

如果您的数据中混有垃圾文本和整数，您可以将其pd.to_numeric用作初始步骤：

s = pd.Series(['1', '2', '...'])
s.convert_dtypes()  # converts to string, which is not what we want                             

0      1
1      2
2    ...
dtype: string 

# coerces non-numeric junk to NaNs
pd.to_numeric(s, errors='coerce')               

0    1.0
1    2.0
2    NaN
dtype: float64

# one final `convert_dtypes` call to convert to nullable int
pd.to_numeric(s, errors='coerce').convert_dtypes()                                                                         

0       1
1       2
2    <NA>
dtype: Int64

Answer 7

回答by onietosi

Cannot comment so posting this as an answer, which is somewhat in between @piRSquared/@cyril's solution and @cs95's:

无法发表评论，因此将其作为答案发布，这有点介于@piRSquared/ @cyril的解决方案和@cs95的解决方案之间：

As noted by @cs95, if your data contains NaNs or Nones, converting to string type will throw an error when trying to convert to int afterwards.

正如@cs95 所指出的，如果您的数据包含 NaN 或 Nones，则在之后尝试转换为 int 时，转换为字符串类型将引发错误。

However, if your data consists of (numerical) strings, using convert_dtypeswill convert it to string type unless you use pd.to_numericas suggested by @cs95 (potentially combined with df.apply()).

但是，如果您的数据由（数字）字符串组成，则 usingconvert_dtypes会将其转换为字符串类型，除非您pd.to_numeric按照@cs95 的建议使用（可能与结合使用df.apply()）。

In the case that your data consists only of numerical strings (including NaNs or Nones but without any non-numeric "junk"), a possibly simpler alternative would be to convert first to float and then to one of the nullable-integer extension dtypesprovided by pandas (already present in version 0.24) (see also this answer):

如果您的数据仅包含数字字符串（包括 NaN 或 Nones，但没有任何非数字“垃圾”），一个可能更简单的替代方法是先转换为浮点数，然后转换为提供的可空整数扩展 dtypes 之一由熊猫（已存在于 0.24 版中）（另请参阅此答案）：

df['purchase'].astype(float).astype('Int64')

Note that there has been recent discussion on this on github(currently an -unresolved- closed issue though) and that in the case of very long 64-bit integers you may have to convert explicitly to float128to avoid approximations during the conversions.

请注意，最近在github上对此进行了讨论（尽管目前是一个未解决的已关闭问题），并且在非常长的 64 位整数的情况下，您可能必须显式转换为float128以避免在转换过程中出现近似值。

Answer 8

回答by Rishabh Jain

This was my data

这是我的数据

## list of columns 
l1 = ['PM2.5', 'PM10', 'TEMP', 'BP', ' RH', 'WS','CO', 'O3', 'Nox', 'SO2'] 

for i in l1:
 for j in range(0, 8431): #rows = 8431
   df[i][j] = int(df[i][j])

I recommend you to use this only with small data. This code has complexity of O(n^2).

我建议您仅对小数据使用它。这段代码的复杂度为 O(n^2)。

Python 熊猫：将 dtype 'object' 转换为 int

提问by cyril

回答by cyril

回答by Kariru

回答by HEMANTHKUMAR GADI

回答by mandeep

回答by ohlemacher

回答by cs95

pandas >= 1.0: `convert_dtypes`

熊猫 >= 1.0： `convert_dtypes`

回答by onietosi

回答by Rishabh Jain

相关推荐

最近更新

标签

Python 熊猫：将 dtype 'object' 转换为 int

提问by cyril

回答by cyril

回答by Kariru

回答by HEMANTHKUMAR GADI

回答by mandeep

回答by ohlemacher

回答by cs95

pandas >= 1.0: convert_dtypes

熊猫 >= 1.0： convert_dtypes

回答by onietosi

回答by Rishabh Jain

相关推荐

Python 当默认 pip 为 pip2 时，升级 pip3 的正确格式是什么？

WinError 2 系统找不到指定的文件 (Python)

Python 数据框，从列表中设置索引

Python 更新 TensorFlow

相关推荐

最近更新

标签

pandas >= 1.0: `convert_dtypes`

熊猫 >= 1.0： `convert_dtypes`