Python 确定 Pandas 列数据类型
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41262370/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Determining Pandas Column DataType
提问by code base 5000
Sometimes when data is imported to Pandas Dataframe, it always imports as type object
. This is fine and well for doing most operations, but I am trying to create a custom export function, and my question is this:
有时,当数据导入 Pandas Dataframe 时,它总是作为 type 导入object
。这对于大多数操作来说都很好,但我正在尝试创建一个自定义导出函数,我的问题是:
- Is there a way to force Pandas to infer the data types of the input data?
- If not, is there a way after the data is loaded to infer the data types somehow?
- 有没有办法强制 Pandas 推断输入数据的数据类型?
- 如果没有,在加载数据后有没有办法以某种方式推断数据类型?
I know I can tell Pandas that this is of type int, str, etc.. but I don't want to do that, I was hoping pandas could be smart enough to know all the data types when a user imports or adds a column.
我知道我可以告诉 Pandas 这是 int、str 等类型。但我不想那样做,我希望当用户导入或添加列时,pandas 可以足够聪明以了解所有数据类型.
EDIT - example of import
编辑 - 导入示例
a = ['a']
col = ['somename']
df = pd.DataFrame(a, columns=col)
print(df.dtypes)
>>> somename object
dtype: object
The type should be string?
类型应该是字符串?
回答by lmo
This is only a partial answer, but you can get frequency counts of the data type of the elements in a variable over the entire DataFrame as follows:
这只是部分答案,但您可以获得整个 DataFrame 中变量中元素的数据类型的频率计数,如下所示:
dtypeCount =[df.iloc[:,i].apply(type).value_counts() for i in range(df.shape[1])]
This returns
这返回
dtypeCount
[<class 'numpy.int32'> 4
Name: a, dtype: int64,
<class 'int'> 2
<class 'str'> 2
Name: b, dtype: int64,
<class 'numpy.int32'> 4
Name: c, dtype: int64]
It doesn't print this nicely, but you can pull out information for any variable by location:
它不能很好地打印出来,但是您可以按位置提取任何变量的信息:
dtypeCount[1]
<class 'int'> 2
<class 'str'> 2
Name: b, dtype: int64
which should get you started in finding what data types are causing the issue and how many of them there are.
这应该让您开始查找导致问题的数据类型以及其中有多少数据类型。
You can then inspect the rows that have a str object in the second variable using
然后,您可以使用检查在第二个变量中具有 str 对象的行
df[df.iloc[:,1].map(lambda x: type(x) == str)]
a b c
1 1 n 4
3 3 g 6
data
数据
df = DataFrame({'a': range(4),
'b': [6, 'n', 7, 'g'],
'c': range(3, 7)})
回答by shahar_m
You can also infer the objects from after dropping irrelevant items by using infer_objects()
. Below is a general example.
您还可以使用infer_objects()
. 下面是一个通用示例。
df_orig = pd.DataFrame({"A": ["a", 1, 2, 3], "B": ["b", 1.2, 1.8, 1.8]})
df = df_orig.iloc[1:].infer_objects()
print(df_orig.dtypes, df.dtypes, sep='\n\n')
Output:
输出:
回答by MisterMonk
Here an (not perfect) try to write an better inferer. When you have allready data in your dataframe, the inferer will guess the smallet type possible. Datetime is currently missing, but I think it could be an starting point. With this inferer, i can get down 70% of the memory in use.
这里有一个(不完美的)尝试编写一个更好的推理器。当您的数据框中已有数据时,推断器将猜测可能的 smallet 类型。当前缺少日期时间,但我认为这可能是一个起点。使用此推理器,我可以减少 70% 的内存使用量。
def infer_df(df, hard_mode=False, float_to_int=False, mf=None):
ret = {}
# ToDo: How much does auto convertion cost
# set multiplication factor
mf = 1 if hard_mode else 0.5
# set supported datatyp
integers = ['int8', 'int16', 'int32', 'int64']
floats = ['float16', 'float32', 'float64']
# ToDo: Unsigned Integer
# generate borders for each datatype
b_integers = [(np.iinfo(i).min, np.iinfo(i).max, i) for i in integers]
b_floats = [(np.finfo(f).min, np.finfo(f).max, f) for f in floats]
for c in df.columns:
_type = df[c].dtype
# if a column is set to float, but could be int
if float_to_int and np.issubdtype(_type, np.floating):
if np.sum(np.remainder(df[c], 1)) == 0:
df[c] = df[c].astype('int64')
_type = df[c].dtype
# convert type of column to smallest possible
if np.issubdtype(_type, np.integer) or np.issubdtype(_type, np.floating):
borders = b_integers if np.issubdtype(_type, np.integer) else b_floats
_min = df[c].min()
_max = df[c].max()
for b in borders:
if b[0] * mf < _min and _max < b[1] * mf:
ret[c] = b[2]
break
if _type == 'object' and len(df[c].unique()) / len(df) < 0.1:
ret[c] = 'category'
return ret