Python 在 Pandas 中创建空数据框指定列类型

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36462257/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:55:18  来源:igfitidea点击:

Create Empty Dataframe in Pandas specifying column types

pythonpandas

提问by Vincent

I'm trying to create an empty data frame with an index and specify the column types. The way I am doing it is the following:

我正在尝试创建一个带有索引的空数据框并指定列类型。我这样做的方式如下:

df = pd.DataFrame(index=['pbp'],columns=['contract',
                                         'state_and_county_code',
                                         'state',
                                         'county',
                                         'starting_membership',
                                         'starting_raw_raf',
                                         'enrollment_trend',
                                         'projected_membership',
                                         'projected_raf'],
                                dtype=['str', 'str', 'str', 'str', 'int', 'float', 'float', 'int', 'float'])

However, I get the following error,

但是,我收到以下错误,

TypeError: data type not understood

What does this mean?

这是什么意思?

回答by ryanjdillon

You can do it like this

你可以这样做

import numpy
import pandas

dtypes = numpy.dtype([
          ('a', str),
          ('b', int),
          ('c', float),
          ('d', numpy.datetime64),
          ])
data = numpy.empty(0, dtype=dtypes)
df = pandas.DataFrame(data)

回答by user48956

This really smells like a bug.

这真的闻起来像一个错误。

Here's another (simpler) solution.

这是另一个(更简单的)解决方案。

import pandas as pd
import numpy as np

def df_empty(columns, dtypes, index=None):
    assert len(columns)==len(dtypes)
    df = pd.DataFrame(index=index)
    for c,d in zip(columns, dtypes):
        df[c] = pd.Series(dtype=d)
    return df

df = df_empty(['a', 'b'], dtypes=[np.int64, np.int64])
print(list(df.dtypes)) # int64, int64

回答by Alberto

You can use the following:

您可以使用以下内容:

df = pd.DataFrame({'a': pd.Series([], dtype='int'),
                   'b': pd.Series([], dtype='str'),
                   'c': pd.Series([], dtype='float')})

then if you call df you have

那么如果你打电话给 df 你有

>>> df 
Empty DataFrame 
Columns: [a, b, c]
Index: []

and if you check its types

如果你检查它的类型

>>> df.dtypes
a      int32
b     object
c    float64
dtype: object

回答by SummerEla

This is an old question, but I don't see a solid answer (although @eric_g was super close).

这是一个老问题,但我没有看到可靠的答案(尽管 @eric_g 非常接近)。

You just need to create an empty dataframe with a list of dictionary key:value pairs. The key being your column name, and the value being an empty data type.

您只需要创建一个包含字典键值对列表的空数据框。键是您的列名,值是空数据类型。

So in your example dataset, it would look as follows:

因此,在您的示例数据集中,它将如下所示:

df = pd.DataFrame(,columns=[{'contract':'',
                              'state_and_county_code':'',
                              'state':'',
                              'county':'',
                              'starting_membership':int(),
                              'starting_raw_raf':float(),
                              'enrollment_trend':float(),
                              'projected_membership':int(),
                              'projected_raf':float(),
                              'pbp':int() #just guessing on this data type
                                      }]).set_index=("pbp")

Alternatively, this works in pandas .25 and python 3.7:

或者,这适用于 pandas .25 和 python 3.7:

df = pd.DataFrame({'contract':'',
                          'state_and_county_code':'',
                          'state':'',
                          'county':'',
                          'starting_membership':int(),
                          'starting_raw_raf':float(),
                          'enrollment_trend':float(),
                          'projected_membership':int(),
                          'projected_raf':float(),
                          'pbp':int() #just guessing on this data type
                                  }, 
             index=[1])

回答by ptrj

Just a remark.

只是一个评论。

You can get around the Type Error using np.dtype:

您可以使用np.dtype以下方法解决类型错误:

pd.DataFrame(index = ['pbp'], columns = ['a','b'], dtype = np.dtype([('str','float')]))

but you get instead:

但你得到:

NotImplementedError: compound dtypes are not implementedin the DataFrame constructor

回答by JaminSore

I found this question after running into the same issue. I prefer the following solution (Python 3) for creating an empty DataFrame with no index.

遇到同样的问题后,我发现了这个问题。我更喜欢以下解决方案(Python 3)来创建一个没有 index的空 DataFrame 。

import numpy as np
import pandas as pd

def make_empty_typed_df(dtype):
    tdict = np.typeDict
    types = tuple(tdict.get(t, t) for (_, t, *__) in dtype)
    if any(t == np.void for t in types):
        raise NotImplementedError('Not Implemented for columns of type "void"')
    return pd.DataFrame.from_records(np.array([tuple(t() for t in types)], dtype=dtype)).iloc[:0, :]

Testing this out ...

测试一下...

from itertools import chain

dtype = [('col%d' % i, t) for i, t in enumerate(chain(np.typeDict, set(np.typeDict.values())))]
dtype = [(c, t) for (c, t) in dtype if (np.typeDict.get(t, t) != np.void) and not isinstance(t, int)]

print(make_empty_typed_df(dtype))

Out:

出去:

Empty DataFrame

Columns: [col0, col6, col16, col23, col24, col25, col26, col27, col29, col30, col31, col32, col33, col34, col35, col36, col37, col38, col39, col40, col41, col42, col43, col44, col45, col46, col47, col48, col49, col50, col51, col52, col53, col54, col55, col56, col57, col58, col60, col61, col62, col63, col64, col65, col66, col67, col68, col69, col70, col71, col72, col73, col74, col75, col76, col77, col78, col79, col80, col81, col82, col83, col84, col85, col86, col87, col88, col89, col90, col91, col92, col93, col95, col96, col97, col98, col99, col100, col101, col102, col103, col104, col105, col106, col107, col108, col109, col110, col111, col112, col113, col114, col115, col117, col119, col120, col121, col122, col123, col124, ...]
Index: []

[0 rows x 146 columns]

And the datatypes ...

和数据类型...

print(make_empty_typed_df(dtype).dtypes)

Out:

出去:

col0      timedelta64[ns]
col6               uint16
col16              uint64
col23                int8
col24     timedelta64[ns]
col25                bool
col26           complex64
col27               int64
col29             float64
col30                int8
col31             float16
col32              uint64
col33               uint8
col34              object
col35          complex128
col36               int64
col37               int16
col38               int32
col39               int32
col40             float16
col41              object
col42              uint64
col43              object
col44               int16
col45              object
col46               int64
col47               int16
col48              uint32
col49              object
col50              uint64
               ...       
col144              int32
col145               bool
col146            float64
col147     datetime64[ns]
col148             object
col149             object
col150         complex128
col151    timedelta64[ns]
col152              int32
col153              uint8
col154            float64
col156              int64
col157             uint32
col158             object
col159               int8
col160              int32
col161             uint64
col162              int16
col163             uint32
col164             object
col165     datetime64[ns]
col166            float32
col167               bool
col168            float64
col169         complex128
col170            float16
col171             object
col172             uint16
col173          complex64
col174         complex128
dtype: object

Adding an index gets tricky because there isn't a true missing value for most data types so they end up getting cast to some other type with a native missing value (e.g., ints are cast to floats or objects), but if you have complete data of the types you've specified, then you can always insert rows as needed, and your types will be respected. This can be accomplished with:

添加索引变得棘手,因为大多数数据类型都没有真正的缺失值,因此它们最终会被强制转换为具有本机缺失值的其他类型(例如,ints 被强制转换为floats 或objects),但如果您有完整的您指定的类型的数据,然后您可以随时根据需要插入行,并且您的类型将得到尊重。这可以通过以下方式完成:

df.loc[index, :] = new_row

Again, as @Hun pointed out, this NOT how Pandas is intended to be used.

同样,正如@Hun 指出的那样,这不是 Pandas 的用途。

回答by Eric G.

You can do this by passing a dictionary into the DataFrame constructor:

您可以通过将字典传递给 DataFrame 构造函数来完成此操作:

df = pd.DataFrame(index=['pbp'],
                  data={'contract' : np.full(1, "", dtype=str),
                        'starting_membership' : np.full(1, np.nan, dtype=float),
                        'projected_membership' : np.full(1, np.nan, dtype=int)
                       }
                 )

This will correctlygive you a dataframe that looks like:

这将正确地为您提供一个如下所示的数据框:

     contract  projected_membership   starting_membership
pbp     ""             NaN           -9223372036854775808

With dtypes:

使用数据类型:

contract                 object
projected_membership    float64
starting_membership       int64

That said, there are two things to note:

也就是说,有两点需要注意:

1) strisn't actually a type that a DataFrame column can handle; instead it falls back to the general case object. It'll still work properly.

1)str实际上不是 DataFrame 列可以处理的类型;相反,它退回到一般情况object。它仍然会正常工作。

2) Why don't you see NaNunder starting_membership? Well, NaNis only defined for floats; there is no "None" value for integers, so it casts np.NaNto an integer. If you want a different default value, you can change that in the np.fullcall.

2)你为什么看不到NaN下面starting_membership?好吧,NaN只为浮点数定义;整数没有“无”值,因此它转换np.NaN为整数。如果您想要不同的默认值,您可以在np.full调用中更改它。

回答by Hun

pandas doesn't offer pure integer column. You can either use float column and convert that column to integer as needed or treat it like an object. What you are trying to implement is not the way pandas is supposed to be used. But if you REALLY REALLY want that, you can get around the TypeError message by doing this.

pandas 不提供纯整数列。您可以使用浮点列并根据需要将该列转换为整数,也可以将其视为对象。您尝试实现的不是应该使用熊猫的方式。但是如果你真的真的想要那个,你可以通过这样做来绕过 TypeError 消息。

df1 =  pd.DataFrame(index=['pbp'], columns=['str1','str2','str2'], dtype=str)
df2 =  pd.DataFrame(index=['pbp'], columns=['int1','int2'], dtype=int)
df3 =  pd.DataFrame(index=['pbp'], columns=['flt1','flt2'], dtype=float)
df = pd.concat([df1, df2, df3], axis=1)

    str1 str2 str2 int1 int2  flt1  flt2
pbp  NaN  NaN  NaN  NaN  NaN   NaN   NaN

You can rearrange the col order as you like. But again, this is not the way pandas was supposed to be used.

您可以根据需要重新排列颜色顺序。但同样,这不是应该使用熊猫的方式。

 df.dtypes
str1     object
str2     object
str2     object
int1     object
int2     object
flt1    float64
flt2    float64
dtype: object

Note that int is treated as object.

请注意, int 被视为对象。

回答by jdehesa

I found the easiest workaround for me was to simply concatenate a list of empty series for each individual column:

我发现对我来说最简单的解决方法是简单地为每个单独的列连接一个空系列列表:

import pandas as pd

columns = ['contract',
           'state_and_county_code',
           'state',
           'county',
           'starting_membership',
           'starting_raw_raf',
           'enrollment_trend',
           'projected_membership',
           'projected_raf']
dtype = ['str', 'str', 'str', 'str', 'int', 'float', 'float', 'int', 'float']
df = pd.concat([pd.Series(name=col, dtype=dt) for col, dt in zip(columns, dtype)], axis=1)
df.info()
# <class 'pandas.core.frame.DataFrame'>
# Index: 0 entries
# Data columns (total 9 columns):
# contract                 0 non-null object
# state_and_county_code    0 non-null object
# state                    0 non-null object
# county                   0 non-null object
# starting_membership      0 non-null int32
# starting_raw_raf         0 non-null float64
# enrollment_trend         0 non-null float64
# projected_membership     0 non-null int32
# projected_raf            0 non-null float64
# dtypes: float64(3), int32(2), object(4)
# memory usage: 0.0+ bytes

回答by Korhan

My solution (without setting an index) is to initialize a dataframe with column names and specify data types using astype()method.

我的解决方案(不设置索引)是使用列名初始化数据框并使用astype()方法指定数据类型。

df = pd.DataFrame(columns=['contract',
                     'state_and_county_code',
                     'state',
                     'county',
                     'starting_membership',
                     'starting_raw_raf',
                     'enrollment_trend',
                     'projected_membership',
                     'projected_raf'])
df = df.astype( dtype={'contract' : str, 
                 'state_and_county_code': str,
                 'state': str,
                 'county': str,
                 'starting_membership': int,
                 'starting_raw_raf': float,
                 'enrollment_trend': float,
                 'projected_membership': int,
                 'projected_raf': float})