Python Pandas，创建指定列数据类型的空 DataFrame

Question

提问by Ray

There is one thing that I find myself having to do quite often, and it surprises me how difficult it is to achieve this in Pandas. Suppose I need to create an empty DataFramewith specified index type and name, and column types and names. (I might want to fill it later, in a loop for example.) The easiest way to do this, that I have found, is to create an empty pandas.Seriesobject for each column, specifying their dtypes, put them into a dictionary which specifies their names, and pass the dictionary into the DataFrameconstructor. Something like the following.

我发现自己必须经常做一件事，让我惊讶的是在 Pandas 中实现这一目标是多么困难。假设我需要创建一个DataFrame具有指定索引类型和名称以及列类型和名称的空。（我可能想稍后填充它，例如在循环中。）我发现最简单的方法是pandas.Series为每列创建一个空对象，指定它们的dtypes，将它们放入指定它们的字典中名称，并将字典传递给DataFrame构造函数。类似于以下内容。

def create_empty_dataframe():
    index = pandas.Index([], name="id", dtype=int)
    column_names = ["name", "score", "height", "weight"]
    series = [pandas.Series(dtype=str), pandas.Series(dtype=int), pandas.Series(dtype=float), pandas.Series(dtype=float)]
    columns = dict(zip(column_names, series))
    return pandas.DataFrame(columns, index=index, columns=column_names)
    # The columns=column_names is required because the dictionary will in general put the columns in arbitrary order.

First question. Is the above really the simplest way of doing this? There are so many things that are convoluted about this. What I really want to do, and what I'm pretty sure a lot of people really want to do, is something like the following.

第一个问题。以上真的是最简单的方法吗？关于这一点，有很多令人费解的事情。我真正想做的事情，而且我很确定很多人真正想做的事情，如下所示。

df = pandas.DataFrame(columns=["id", "name", "score", "height", "weight"], dtypes=[int, str, int, float, float], index_column="id")

Second question. Is this sort of syntax at all possible in Pandas? If not, are the devs considering supporting something like this at all? It feels to me that it really ought to be as simple as this (the above syntax).

第二个问题。这种语法在 Pandas 中完全可能吗？如果没有，开发人员是否正在考虑支持这样的事情？我觉得它真的应该像这样简单（上面的语法）。

Answer 1

回答by EdChum

Unfortunately the DateFramector accepts a single dtypedescriptor, however you can cheat a little by using read_csv:

不幸的是，DateFramector 接受单个dtype描述符，但是您可以使用read_csv以下方法作弊：

In [143]:
import pandas as pd
import io
cols=["id", "name", "score", "height", "weight"]
df = pd.read_csv(io.StringIO(""), names=cols, dtype=dict(zip(cols,[int, str, int, float, float])), index_col=['id']) 
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 0 entries
Data columns (total 4 columns):
name      0 non-null object
score     0 non-null int32
height    0 non-null float64
weight    0 non-null float64
dtypes: float64(2), int32(1), object(1)
memory usage: 0.0+ bytes

So you can see that the dtypes are as desired and that the index is set as desired:

因此，您可以看到 dtypes 是所需的，并且索引已根据需要设置：

In [145]:

df.index
Out[145]:
Int64Index([], dtype='int64', name='id')

Answer 2

回答by Roland Bischof

You can set the dtype of a DataFrame's column also by replacing it:

您也可以通过替换来设置 DataFrame 列的 dtype：

df['column_name'] = df['column_name'].astype(float)

Answer 3

回答by Elliot

You simplify things a little by using a dictionary comprehension

您可以使用字典理解来简化一些事情

def create_empty_dataframe():
    index = pandas.Index([], name="id", dtype=int)
    # specify column name and data type 
    columns = [('name', str),
               ('score', int),
               ('height', float),
               ('weight', float)]
    # create the dataframe from a dict
    return pandas.DataFrame({k: pandas.Series(dtype=t) for k, t in columns})

This isn't drastically different in effect to what you've already done, but it should be easier to make an arbitrary dataframe without having to modify multiple locations in the code.

这与您已经完成的效果没有太大的不同，但是制作任意数据帧应该更容易，而无需修改代码中的多个位置。

Answer 4

回答by Justin Malinchak

import pandas as pd
df = pd.DataFrame([{'col00':int(0),'col01':float(0),'col02':str('xx')}])
df = pd.DataFrame([], None, df.columns)
print df

Answer 5

回答by James Hirschorn

Here is a generic function based on @Elliot's answer:

这是基于@Elliot 答案的通用函数：

import pandas as pd


def create_empty_DataFrame(columns, index_col):
    index_type = next((t for name, t in columns if name == index_col))
    df = pd.DataFrame({name: pd.Series(dtype=t) for name, t in columns if name != index_col},
                      index=pd.Index([], dtype=index_type))
    cols = [name for name, _ in columns]
    cols.remove(index_col)
    return df[cols]

Note that return df[cols]rather than return dfis necessary to preserve the order of the non-index columns. Some test code:

请注意，return df[cols]不是return df必须保留非索引列的顺序。一些测试代码：

columns = [
    ('id', str),
    ('primary', bool),
    ('side', str),
    ('quantity', int),
    ('price', float)]

table = create_empty_DataFrame(columns, 'id')

Check the dtypesand index:

检查dtypes和索引：

table.info()

<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 4 columns):
primary     0 non-null bool
side        0 non-null object
quantity    0 non-null int64
price       0 non-null float64
dtypes: bool(1), float64(1), int64(1), object(1)
memory usage: 0.0+ bytes

table.index

Index([], dtype='object', name='id')

Python Pandas，创建指定列数据类型的空 DataFrame

提问by Ray

回答by EdChum

回答by Roland Bischof

回答by Elliot

回答by Justin Malinchak

回答by James Hirschorn

相关推荐

最近更新

标签

Python Pandas，创建指定列数据类型的空 DataFrame

提问by Ray

回答by EdChum

回答by Roland Bischof

回答by Elliot

回答by Justin Malinchak

回答by James Hirschorn

相关推荐

pandas sklearn LabelEncoder 和 pd.get_dummies 有什么区别？

Pandas 滚动窗口 - datetime64[ns] 未实现

pandas 在pandas DataFrame中减去多列并附加结果

pandas.DataFrame corrwith() 方法

相关推荐

最近更新

标签