Python 使用包含多种类型的 numpy 数组创建 Pandas DataFrame

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21647054/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 23:20:25  来源:igfitidea点击:

Creating a Pandas DataFrame with a numpy array containing multiple types

pythonnumpypandas

提问by bfcondon

I want to create a pandas dataframe with default values of zero, but one column of integers and the other of floats. I am able to create a numpy array with the correct types, see the valuesvariable below. However, when I pass that into the dataframe constructor, it only returns NaN values (see dfbelow). I have include the untyped code that returns an array of floats(see df2)

我想创建一个默认值为零的熊猫数据框,但是一列整数和另一列浮点数。我能够创建一个具有正确类型的 numpy 数组,请参阅values下面的变量。但是,当我将它传递给数据帧构造函数时,它只返回 NaN 值(见df下文)。我已经包含了返回浮点数组的无类型代码(参见df2

import pandas as pd
import numpy as np

values = np.zeros((2,3), dtype='int32,float32')
index = ['x', 'y']
columns = ['a','b','c']

df = pd.DataFrame(data=values, index=index, columns=columns)
df.values.dtype

values2 = np.zeros((2,3))
df2 = pd.DataFrame(data=values2, index=index, columns=columns)
df2.values.dtype

Any suggestions on how to construct the dataframe?

关于如何构建数据框的任何建议?

采纳答案by unutbu

Here are a few options you could choose from:

您可以从以下几个选项中进行选择:

import numpy as np
import pandas as pd

index = ['x', 'y']
columns = ['a','b','c']

# Option 1: Set the column names in the structured array's dtype 
dtype = [('a','int32'), ('b','float32'), ('c','float32')]
values = np.zeros(2, dtype=dtype)
df = pd.DataFrame(values, index=index)

# Option 2: Alter the structured array's column names after it has been created
values = np.zeros(2, dtype='int32, float32, float32')
values.dtype.names = columns
df2 = pd.DataFrame(values, index=index, columns=columns)

# Option 3: Alter the DataFrame's column names after it has been created
values = np.zeros(2, dtype='int32, float32, float32')
df3 = pd.DataFrame(values, index=index)
df3.columns = columns

# Option 4: Use a dict of arrays, each of the right dtype:
df4 = pd.DataFrame(
    {'a': np.zeros(2, dtype='int32'),
     'b': np.zeros(2, dtype='float32'),
     'c': np.zeros(2, dtype='float32')}, index=index, columns=columns)

# Option 5: Concatenate DataFrames of the simple dtypes:
df5 = pd.concat([
    pd.DataFrame(np.zeros((2,), dtype='int32'), columns=['a']), 
    pd.DataFrame(np.zeros((2,2), dtype='float32'), columns=['b','c'])], axis=1)

# Option 6: Alter the dtypes after the DataFrame has been formed. (This is not very efficient)
values2 = np.zeros((2, 3))
df6 = pd.DataFrame(values2, index=index, columns=columns)
for col, dtype in zip(df6.columns, 'int32 float32 float32'.split()):
    df6[col] = df6[col].astype(dtype)

Each of the options above produce the same result

上面的每个选项都会产生相同的结果

   a  b  c
x  0  0  0
y  0  0  0

with dtypes:

使用数据类型:

a      int32
b    float32
c    float32
dtype: object


Why pd.DataFrame(values, index=index, columns=columns)produces a DataFrame with NaNs:

为什么pd.DataFrame(values, index=index, columns=columns)用 NaN 生成 DataFrame

valuesis a structured array with column names f0, f1, f2:

values是一个带有列名f0, f1,的结构化数组f2

In [171]:  values
Out[172]: 
array([(0, 0.0, 0.0), (0, 0.0, 0.0)], 
      dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<f4')])

If you pass the argument columns=['a', 'b', 'c']to pd.DataFrame, then Pandas will look for columns with those names in the structured array values. When those columns are not found, Pandas places NaNs in the DataFrame to represent missing values.

如果您将参数传递columns=['a', 'b', 'c']pd.DataFrame,那么 Pandas 将在结构化数组中查找具有这些名称的列values。当找不到这些列时,Pandas 将NaNs 放在 DataFrame 中以表示缺失值。