Python Pandas,创建指定列数据类型的空 DataFrame
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38523965/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python Pandas, create empty DataFrame specifying column dtypes
提问by Ray
There is one thing that I find myself having to do quite often, and it surprises me how difficult it is to achieve this in Pandas. Suppose I need to create an empty DataFrame
with specified index type and name, and column types and names. (I might want to fill it later, in a loop for example.) The easiest way to do this, that I have found, is to create an empty pandas.Series
object for each column, specifying their dtype
s, put them into a dictionary which specifies their names, and pass the dictionary into the DataFrame
constructor. Something like the following.
我发现自己必须经常做一件事,让我惊讶的是在 Pandas 中实现这一目标是多么困难。假设我需要创建一个DataFrame
具有指定索引类型和名称以及列类型和名称的空。(我可能想稍后填充它,例如在循环中。)我发现最简单的方法是pandas.Series
为每列创建一个空对象,指定它们的dtype
s,将它们放入指定它们的字典中名称,并将字典传递给DataFrame
构造函数。类似于以下内容。
def create_empty_dataframe():
index = pandas.Index([], name="id", dtype=int)
column_names = ["name", "score", "height", "weight"]
series = [pandas.Series(dtype=str), pandas.Series(dtype=int), pandas.Series(dtype=float), pandas.Series(dtype=float)]
columns = dict(zip(column_names, series))
return pandas.DataFrame(columns, index=index, columns=column_names)
# The columns=column_names is required because the dictionary will in general put the columns in arbitrary order.
First question. Is the above really the simplest way of doing this? There are so many things that are convoluted about this. What I really want to do, and what I'm pretty sure a lot of people really want to do, is something like the following.
第一个问题。以上真的是最简单的方法吗?关于这一点,有很多令人费解的事情。我真正想做的事情,而且我很确定很多人真正想做的事情,如下所示。
df = pandas.DataFrame(columns=["id", "name", "score", "height", "weight"], dtypes=[int, str, int, float, float], index_column="id")
Second question. Is this sort of syntax at all possible in Pandas? If not, are the devs considering supporting something like this at all? It feels to me that it really ought to be as simple as this (the above syntax).
第二个问题。这种语法在 Pandas 中完全可能吗?如果没有,开发人员是否正在考虑支持这样的事情?我觉得它真的应该像这样简单(上面的语法)。
回答by EdChum
Unfortunately the DateFrame
ctor accepts a single dtype
descriptor, however you can cheat a little by using read_csv
:
不幸的是,DateFrame
ctor 接受单个dtype
描述符,但是您可以使用read_csv
以下方法作弊:
In [143]:
import pandas as pd
import io
cols=["id", "name", "score", "height", "weight"]
df = pd.read_csv(io.StringIO(""), names=cols, dtype=dict(zip(cols,[int, str, int, float, float])), index_col=['id'])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 0 entries
Data columns (total 4 columns):
name 0 non-null object
score 0 non-null int32
height 0 non-null float64
weight 0 non-null float64
dtypes: float64(2), int32(1), object(1)
memory usage: 0.0+ bytes
So you can see that the dtypes are as desired and that the index is set as desired:
因此,您可以看到 dtypes 是所需的,并且索引已根据需要设置:
In [145]:
df.index
Out[145]:
Int64Index([], dtype='int64', name='id')
回答by Roland Bischof
You can set the dtype of a DataFrame's column also by replacing it:
您也可以通过替换来设置 DataFrame 列的 dtype:
df['column_name'] = df['column_name'].astype(float)
回答by Elliot
You simplify things a little by using a dictionary comprehension
您可以使用字典理解来简化一些事情
def create_empty_dataframe():
index = pandas.Index([], name="id", dtype=int)
# specify column name and data type
columns = [('name', str),
('score', int),
('height', float),
('weight', float)]
# create the dataframe from a dict
return pandas.DataFrame({k: pandas.Series(dtype=t) for k, t in columns})
This isn't drastically different in effect to what you've already done, but it should be easier to make an arbitrary dataframe without having to modify multiple locations in the code.
这与您已经完成的效果没有太大的不同,但是制作任意数据帧应该更容易,而无需修改代码中的多个位置。
回答by Justin Malinchak
import pandas as pd
df = pd.DataFrame([{'col00':int(0),'col01':float(0),'col02':str('xx')}])
df = pd.DataFrame([], None, df.columns)
print df
回答by James Hirschorn
Here is a generic function based on @Elliot's answer:
这是基于@Elliot 答案的通用函数:
import pandas as pd
def create_empty_DataFrame(columns, index_col):
index_type = next((t for name, t in columns if name == index_col))
df = pd.DataFrame({name: pd.Series(dtype=t) for name, t in columns if name != index_col},
index=pd.Index([], dtype=index_type))
cols = [name for name, _ in columns]
cols.remove(index_col)
return df[cols]
Note that return df[cols]
rather than return df
is necessary to preserve the order of the non-index columns. Some test code:
请注意,return df[cols]
不是return df
必须保留非索引列的顺序。一些测试代码:
columns = [
('id', str),
('primary', bool),
('side', str),
('quantity', int),
('price', float)]
table = create_empty_DataFrame(columns, 'id')
Check the dtypes
and index:
检查dtypes
和索引:
table.info()
<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 4 columns):
primary 0 non-null bool
side 0 non-null object
quantity 0 non-null int64
price 0 non-null float64
dtypes: bool(1), float64(1), int64(1), object(1)
memory usage: 0.0+ bytes
table.index
Index([], dtype='object', name='id')