Python 更改 Pandas 中列的数据类型

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15891038/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 21:18:41  来源:igfitidea点击:

Change data type of columns in Pandas

pythonpandasdataframetypescasting

提问by

I want to convert a table, represented as a list of lists, into a Pandas DataFrame. As an extremely simplified example:

我想将表示为列表列表的表格转换为Pandas DataFrame. 作为一个极其简化的例子:

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a)

What is the best way to convert the columns to the appropriate types, in this case columns 2 and 3 into floats? Is there a way to specify the types while converting to DataFrame? Or is it better to create the DataFrame first and then loop through the columns to change the type for each column? Ideally I would like to do this in a dynamic way because there can be hundreds of columns and I don't want to specify exactly which columns are of which type. All I can guarantee is that each columns contains values of the same type.

将列转换为适当类型的最佳方法是什么,在这种情况下,将第 2 列和第 3 列转换为浮点数?有没有办法在转换为 DataFrame 时指定类型?还是先创建 DataFrame 然后遍历列以更改每列的类型更好?理想情况下,我想以动态方式执行此操作,因为可能有数百列,而且我不想确切指定哪些列属于哪种类型。我所能保证的是每一列都包含相同类型的值。

采纳答案by Alex Riley

You have three main options for converting types in pandas:

在 Pandas 中转换类型有三个主要选项:

  1. to_numeric()- provides functionality to safely convert non-numeric types (e.g. strings) to a suitable numeric type. (See also to_datetime()and to_timedelta().)

  2. astype()- convert (almost) any type to (almost) any other type (even if it's not necessarily sensible to do so). Also allows you to convert to categorialtypes (very useful).

  3. infer_objects()- a utility method to convert object columns holding Python objects to a pandas type if possible.

  1. to_numeric()- 提供将非数字类型(例如字符串)安全地转换为合适的数字类型的功能。(另见to_datetime()to_timedelta()。)

  2. astype()- 将(几乎)任何类型转换为(几乎)任何其他类型(即使这样做不一定明智)。还允许您转换为分类类型(非常有用)。

  3. infer_objects()- 如果可能,将包含 Python 对象的对象列转换为 Pandas 类型的实用方法。

Read on for more detailed explanations and usage of each of these methods.

请继续阅读以了解每种方法的更详细说明和用法。



1. to_numeric()

1. to_numeric()

The best way to convert one or more columns of a DataFrame to numeric values is to use pandas.to_numeric().

将 DataFrame 的一列或多列转换为数值的最佳方法是使用pandas.to_numeric().

This function will try to change non-numeric objects (such as strings) into integers or floating point numbers as appropriate.

此函数将尝试将非数字对象(例如字符串)适当地更改为整数或浮点数。

Basic usage

基本用法

The input to to_numeric()is a Series or a single column of a DataFrame.

输入to_numeric()是一个系列或 DataFrame 的单个列。

>>> s = pd.Series(["8", 6, "7.5", 3, "0.9"]) # mixed string and numeric values
>>> s
0      8
1      6
2    7.5
3      3
4    0.9
dtype: object

>>> pd.to_numeric(s) # convert everything to float values
0    8.0
1    6.0
2    7.5
3    3.0
4    0.9
dtype: float64

As you can see, a new Series is returned. Remember to assign this output to a variable or column name to continue using it:

如您所见,返回了一个新系列。请记住将此输出分配给变量或列名称以继续使用它:

# convert Series
my_series = pd.to_numeric(my_series)

# convert column "a" of a DataFrame
df["a"] = pd.to_numeric(df["a"])

You can also use it to convert multiple columns of a DataFrame via the apply()method:

您还可以使用它通过以下apply()方法转换 DataFrame 的多列:

# convert all columns of DataFrame
df = df.apply(pd.to_numeric) # convert all columns of DataFrame

# convert just columns "a" and "b"
df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric)

As long as your values can all be converted, that's probably all you need.

只要您的值都可以转换,这可能就是您所需要的。

Error handling

错误处理

But what if some values can't be converted to a numeric type?

但是如果某些值无法转换为数字类型怎么办?

to_numeric()also takes an errorskeyword argument that allows you to force non-numeric values to be NaN, or simply ignore columns containing these values.

to_numeric()还采用errors关键字参数,允许您强制非数字值为NaN,或者只是忽略包含这些值的列。

Here's an example using a Series of strings swhich has the object dtype:

这是使用s具有对象 dtype 的一系列字符串的示例:

>>> s = pd.Series(['1', '2', '4.7', 'pandas', '10'])
>>> s
0         1
1         2
2       4.7
3    pandas
4        10
dtype: object

The default behaviour is to raise if it can't convert a value. In this case, it can't cope with the string 'pandas':

默认行为是在无法转换值时引发。在这种情况下,它无法处理字符串 'pandas':

>>> pd.to_numeric(s) # or pd.to_numeric(s, errors='raise')
ValueError: Unable to parse string

Rather than fail, we might want 'pandas' to be considered a missing/bad numeric value. We can coerce invalid values to NaNas follows using the errorskeyword argument:

与其失败,我们可能希望将 'pandas' 视为缺失/错误的数值。我们可以NaN使用errors关键字参数将无效值强制如下:

>>> pd.to_numeric(s, errors='coerce')
0     1.0
1     2.0
2     4.7
3     NaN
4    10.0
dtype: float64

The third option for errorsis just to ignore the operation if an invalid value is encountered:

errors如果遇到无效值,则第三个选项是忽略该操作:

>>> pd.to_numeric(s, errors='ignore')
# the original Series is returned untouched

This last option is particularly useful when you want to convert your entire DataFrame, but don't not know which of our columns can be converted reliably to a numeric type. In that case just write:

当您想要转换整个 DataFrame,但不知道我们的哪些列可以可靠地转换为数字类型时,最后一个选项特别有用。在这种情况下,只需写:

df.apply(pd.to_numeric, errors='ignore')

The function will be applied to each column of the DataFrame. Columns that can be converted to a numeric type will be converted, while columns that cannot (e.g. they contain non-digit strings or dates) will be left alone.

该函数将应用于 DataFrame 的每一列。可以转换为数字类型的列将被转换,而不能(例如它们包含非数字字符串或日期)的列将被保留。

Downcasting

垂头丧气

By default, conversion with to_numeric()will give you either a int64or float64dtype (or whatever integer width is native to your platform).

默认情况下,转换 withto_numeric()将为您提供 aint64float64dtype (或您的平台原生的任何整数宽度)。

That's usually what you want, but what if you wanted to save some memory and use a more compact dtype, like float32, or int8?

这通常是您想要的,但是如果您想节省一些内存并使用更紧凑的 dtype,例如float32, 或int8?

to_numeric()gives you the option to downcast to either 'integer', 'signed', 'unsigned', 'float'. Here's an example for a simple series sof integer type:

to_numeric()为您提供向下转换为“整数”、“有符号”、“无符号”、“浮点”的选项。这是一个简单s的整数类型系列示例:

>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64

Downcasting to 'integer' uses the smallest possible integer that can hold the values:

向下转换为“整数”使用可以保存值的最小可能整数:

>>> pd.to_numeric(s, downcast='integer')
0    1
1    2
2   -7
dtype: int8

Downcasting to 'float' similarly picks a smaller than normal floating type:

向下转换为 'float' 类似地选择一个比正常浮动类型更小的浮动类型:

>>> pd.to_numeric(s, downcast='float')
0    1.0
1    2.0
2   -7.0
dtype: float32


2. astype()

2. astype()

The astype()method enables you to be explicit about the dtype you want your DataFrame or Series to have. It's very versatile in that you can try and go from one type to the any other.

astype()方法使您能够明确您希望 DataFrame 或 Series 具有的 dtype。它的用途非常广泛,您可以尝试从一种类型转换为另一种类型。

Basic usage

基本用法

Just pick a type: you can use a NumPy dtype (e.g. np.int16), some Python types (e.g. bool), or pandas-specific types (like the categorical dtype).

只需选择一种类型:您可以使用 NumPy dtype(例如np.int16)、某些 Python 类型(例如 bool)或 Pandas 特定类型(例如分类 dtype)。

Call the method on the object you want to convert and astype()will try and convert it for you:

在要转换的对象上调用该方法,astype()并将尝试为您转换它:

# convert all DataFrame columns to the int64 dtype
df = df.astype(int)

# convert column "a" to int64 dtype and "b" to complex type
df = df.astype({"a": int, "b": complex})

# convert Series to float16 type
s = s.astype(np.float16)

# convert Series to Python strings
s = s.astype(str)

# convert Series to categorical type - see docs for more details
s = s.astype('category')

Notice I said "try" - if astype()does not know how to convert a value in the Series or DataFrame, it will raise an error. For example if you have a NaNor infvalue you'll get an error trying to convert it to an integer.

请注意,我说的是“尝试”——如果astype()不知道如何转换 Series 或 DataFrame 中的值,则会引发错误。例如,如果您有一个NaNorinf值,您将在尝试将其转换为整数时遇到错误。

As of pandas 0.20.0, this error can be suppressed by passing errors='ignore'. Your original object will be return untouched.

从 pandas 0.20.0 开始,可以通过传递errors='ignore'. 您的原始对象将原封不动地返回。

Be careful

当心

astype()is powerful, but it will sometimes convert values "incorrectly". For example:

astype()功能强大,但有时会“错误地”转换值。例如:

>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64

These are small integers, so how about converting to an unsigned 8-bit type to save memory?

这些是小整数,那么如何转换为无符号 8 位类型以节省内存?

>>> s.astype(np.uint8)
0      1
1      2
2    249
dtype: uint8

The conversion worked, but the -7 was wrapped round to become 249 (i.e. 28- 7)!

转换成功了,但 -7 被包裹成 249(即 2 8- 7)!

Trying to downcast using pd.to_numeric(s, downcast='unsigned')instead could help prevent this error.

尝试使用pd.to_numeric(s, downcast='unsigned')相反的方法可以帮助防止出现此错误。



3. infer_objects()

3. infer_objects()

Version 0.21.0 of pandas introduced the method infer_objects()for converting columns of a DataFrame that have an object datatype to a more specific type (soft conversions).

pandas 0.21.0 版引入了infer_objects()将具有对象数据类型的 DataFrame 列转换为更具体类型(软转换)的方法。

For example, here's a DataFrame with two columns of object type. One holds actual integers and the other holds strings representing integers:

例如,这是一个具有两列对象类型的 DataFrame。一个保存实际整数,另一个保存表示整数的字符串:

>>> df = pd.DataFrame({'a': [7, 1, 5], 'b': ['3','2','1']}, dtype='object')
>>> df.dtypes
a    object
b    object
dtype: object

Using infer_objects(), you can change the type of column 'a' to int64:

使用infer_objects(),您可以将列 'a' 的类型更改为 int64:

>>> df = df.infer_objects()
>>> df.dtypes
a     int64
b    object
dtype: object

Column 'b' has been left alone since its values were strings, not integers. If you wanted to try and force the conversion of both columns to an integer type, you could use df.astype(int)instead.

列 'b' 已被保留,因为它的值是字符串,而不是整数。如果您想尝试强制将两列都转换为整数类型,则可以df.astype(int)改用。

回答by hernamesbarbara

How about this?

这个怎么样?

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])
df
Out[16]: 
  one  two three
0   a  1.2   4.2
1   b   70  0.03
2   x    5     0

df.dtypes
Out[17]: 
one      object
two      object
three    object

df[['two', 'three']] = df[['two', 'three']].astype(float)

df.dtypes
Out[19]: 
one       object
two      float64
three    float64

回答by Harry Stevens

Here is a function that takes as its arguments a DataFrame and a list of columns and coerces all data in the columns to numbers.

这是一个函数,它将 DataFrame 和列列表作为其参数,并将列中的所有数据强制转换为数字。

# df is the DataFrame, and column_list is a list of columns as strings (e.g ["col1","col2","col3"])
# dependencies: pandas

def coerce_df_columns_to_numeric(df, column_list):
    df[column_list] = df[column_list].apply(pd.to_numeric, errors='coerce')

So, for your example:

所以,对于你的例子:

import pandas as pd

def coerce_df_columns_to_numeric(df, column_list):
    df[column_list] = df[column_list].apply(pd.to_numeric, errors='coerce')

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['col1','col2','col3'])

coerce_df_columns_to_numeric(df, ['col2','col3'])

回答by MikeyE

How about creating two dataframes, each with different data types for their columns, and then appending them together?

如何创建两个数据框,每个数据框的列具有不同的数据类型,然后将它们附加在一起?

d1 = pd.DataFrame(columns=[ 'float_column' ], dtype=float)
d1 = d1.append(pd.DataFrame(columns=[ 'string_column' ], dtype=str))

Results

结果

In[8}:  d1.dtypes
Out[8]: 
float_column     float64
string_column     object
dtype: object

After the dataframe is created, you can populate it with floating point variables in the 1st column, and strings (or any data type you desire) in the 2nd column.

创建数据框后,您可以在第一列中使用浮点变量填充它,并在第二列中使用字符串(或您想要的任何数据类型)填充它。

回答by Akash Nayak

this below code will change datatype of column.

下面的代码将更改列的数据类型。

df[['col.name1', 'col.name2'...]] = df[['col.name1', 'col.name2'..]].astype('data_type')

in place of data type you can give your datatype .what do you want like str,float,int etc.

代替数据类型,你可以给你的数据类型。你想要什么,比如 str、float、int 等。

回答by Thom Ives

When I've only needed to specify specific columns, and I want to be explicit, I've used (per DOCS LOCATION):

当我只需要指定特定的列并且我想明确表示时,我使用了(根据DOCS LOCATION):

dataframe = dataframe.astype({'col_name_1':'int','col_name_2':'float64', etc. ...})

So, using the original question, but providing column names to it ...

因此,使用原始问题,但为其提供列名......

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['col_name_1', 'col_name_2', 'col_name_3'])
df = df.astype({'col_name_2':'float64', 'col_name_3':'float64'})

回答by SarahD

I thought I had the same problem but actually I have a slight difference that makes the problem easier to solve. For others looking at this question it's worth checking the format of your input list. In my case the numbers are initially floats not strings as in the question:

我以为我有同样的问题,但实际上我有一点不同,这使问题更容易解决。对于看这个问题的其他人来说,检查输入列表的格式是值得的。在我的情况下,数字最初是浮点数而不是问题中的字符串:

a = [['a', 1.2, 4.2], ['b', 70, 0.03], ['x', 5, 0]]

but by processing the list too much before creating the dataframe I lose the types and everything becomes a string.

但是通过在创建数据框之前过多地处理列表,我丢失了类型并且所有内容都变成了字符串。

Creating the data frame via a numpy array

通过 numpy 数组创建数据框

df = pd.DataFrame(np.array(a))

df
Out[5]: 
   0    1     2
0  a  1.2   4.2
1  b   70  0.03
2  x    5     0

df[1].dtype
Out[7]: dtype('O')

gives the same data frame as in the question, where the entries in columns 1 and 2 are considered as strings. However doing

给出与问题相同的数据框,其中第 1 列和第 2 列中的条目被视为字符串。然而做

df = pd.DataFrame(a)

df
Out[10]: 
   0     1     2
0  a   1.2  4.20
1  b  70.0  0.03
2  x   5.0  0.00

df[1].dtype
Out[11]: dtype('float64')

does actually give a data frame with the columns in the correct format

实际上确实提供了一个包含正确格式列的数据框

回答by cs95

pandas >= 1.0

熊猫 >= 1.0

Here's a chart that summarises some of the most important conversions in pandas.

下面的图表总结了 Pandas 中一些最重要的转换。

enter image description here

在此处输入图片说明

Conversions to string are trivial .astype(str)and are not shown in the figure.

转换为字符串是微不足道的.astype(str),图中未显示。

"Hard" versus "Soft" conversions

“硬”与“软”转换

Note that "conversions" in this context could either refer to converting text data into their actual data type (hard conversion), or inferring more appropriate data types for data in object columns (soft conversion). To illustrate the difference, take a look at

请注意,此上下文中的“转换”可以指将文本数据转换为其实际数据类型(硬转换),或为对象列中的数据推断更合适的数据类型(软转换)。为了说明差异,请看

df = pd.DataFrame({'a': ['1', '2', '3'], 'b': [4, 5, 6]}, dtype=object)
df.dtypes                                                                  

a    object
b    object
dtype: object

# Actually converts string to numeric - hard conversion
df.apply(pd.to_numeric).dtypes                                             

a    int64
b    int64
dtype: object

# Infers better data types for object data - soft conversion
df.infer_objects().dtypes                                                  

a    object  # no change
b     int64
dtype: object

# Same as infer_objects, but converts to equivalent ExtensionType
df.convert_dtypes().dtypes                                                     

回答by Sohail

Starting pandas 1.0.0, we have pandas.DataFrame.convert_dtypes. You can even control what types to convert!

从熊猫 1.0.0 开始,我们有pandas.DataFrame.convert_dtypes. 您甚至可以控制要转换的类型!

In [40]: df = pd.DataFrame(
    ...:     {
    ...:         "a": pd.Series([1, 2, 3], dtype=np.dtype("int32")),
    ...:         "b": pd.Series(["x", "y", "z"], dtype=np.dtype("O")),
    ...:         "c": pd.Series([True, False, np.nan], dtype=np.dtype("O")),
    ...:         "d": pd.Series(["h", "i", np.nan], dtype=np.dtype("O")),
    ...:         "e": pd.Series([10, np.nan, 20], dtype=np.dtype("float")),
    ...:         "f": pd.Series([np.nan, 100.5, 200], dtype=np.dtype("float")),
    ...:     }
    ...: )

In [41]: dff = df.copy()

In [42]: df 
Out[42]: 
   a  b      c    d     e      f
0  1  x   True    h  10.0    NaN
1  2  y  False    i   NaN  100.5
2  3  z    NaN  NaN  20.0  200.0

In [43]: df.dtypes
Out[43]: 
a      int32
b     object
c     object
d     object
e    float64
f    float64
dtype: object

In [44]: df = df.convert_dtypes()

In [45]: df.dtypes
Out[45]: 
a      Int32
b     string
c    boolean
d     string
e      Int64
f    float64
dtype: object

In [46]: dff = dff.convert_dtypes(convert_boolean = False)

In [47]: dff.dtypes
Out[47]: 
a      Int32
b     string
c     object
d     string
e      Int64
f    float64
dtype: object