如何检查python pandas中列的dtype
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22697773/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how to check the dtype of a column in python pandas
提问by James Bond
I need to use different functions to treat numeric columns and string columns. What I am doing now is really dumb:
我需要使用不同的函数来处理数字列和字符串列。我现在正在做的事情真的很愚蠢:
allc = list((agg.loc[:, (agg.dtypes==np.float64)|(agg.dtypes==np.int)]).columns)
for y in allc:
treat_numeric(agg[y])
allc = list((agg.loc[:, (agg.dtypes!=np.float64)&(agg.dtypes!=np.int)]).columns)
for y in allc:
treat_str(agg[y])
Is there a more elegant way to do this? E.g.
有没有更优雅的方法来做到这一点?例如
for y in agg.columns:
if(dtype(agg[y]) == 'string'):
treat_str(agg[y])
elif(dtype(agg[y]) != 'string'):
treat_numeric(agg[y])
采纳答案by David Robinson
回答by Mike
I know this is a bit of an old thread but with pandas 19.02, you can do:
我知道这有点旧,但是使用 pandas 19.02,您可以执行以下操作:
df.select_dtypes(include=['float64']).apply(your_function)
df.select_dtypes(exclude=['string','object']).apply(your_other_function)
http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.select_dtypes.html
http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.select_dtypes.html
回答by danthelion
In pandas 0.20.2you can do:
在pandas 0.20.2你可以这样做:
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
is_string_dtype(df['A'])
>>>> True
is_numeric_dtype(df['B'])
>>>> True
So your code becomes:
所以你的代码变成:
for y in agg.columns:
if (is_string_dtype(agg[y])):
treat_str(agg[y])
elif (is_numeric_dtype(agg[y])):
treat_numeric(agg[y])
回答by Tom
If you want to mark the type of a dataframe column as a string, you can do:
如果要将数据框列的类型标记为字符串,可以执行以下操作:
df['A'].dtype.kind
An example:
一个例子:
In [8]: df = pd.DataFrame([[1,'a',1.2],[2,'b',2.3]])
In [9]: df[0].dtype.kind, df[1].dtype.kind, df[2].dtype.kind
Out[9]: ('i', 'O', 'f')
The answer for your code:
您的代码的答案:
for y in agg.columns:
if(agg[y].dtype.kind == 'f' or agg[y].dtype.kind == 'i'):
treat_numeric(agg[y])
else:
treat_str(agg[y])
回答by BeforeFlight
Asked question title is general, but authors use case stated in the body of the question is specific. So any other answers may be used.
问题标题是一般性的,但问题正文中陈述的作者用例是具体的。因此可以使用任何其他答案。
But in order to fully answer the title questionit should be clarified that it seems like all of the approaches may failin some cases and require some rework. I reviewed all of them (and some additional) in decreasing of reliability order (in my opinion):
但是为了完全回答标题问题,应该澄清的是,在某些情况下,所有方法似乎都可能失败并需要进行一些返工。我了所有这些(以及一些额外的),以降低可靠性顺序(在我看来):
1. Comparing types directly via ==(accepted answer).
1.直接通过==(接受的答案)比较类型。
Despite the fact that this is accepted answer and has most upvotes count, I think this method should not be used at all. Because in fact this approach is discouragedin python as mentioned several times here.
But if one still want to use it - should be aware of some pandas-specific dtypes like pd.CategoricalDType, pd.PeriodDtype, or pd.IntervalDtype. Here one have to use extra type( )in order to recognize dtype correctly:
尽管这是公认的答案并且获得了最多的赞成票,但我认为根本不应该使用这种方法。因为实际上这种方法在 python 中是不鼓励的,正如这里多次提到的。
但是如果仍然想使用它 - 应该注意一些熊猫特定的 dtypes,如pd.CategoricalDType, pd.PeriodDtype, 或pd.IntervalDtype。这里必须使用 extratype( )才能正确识别 dtype:
s = pd.Series([pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')])
s
s.dtype == pd.PeriodDtype # Not working
type(s.dtype) == pd.PeriodDtype # working
>>> 0 2002-03-01
>>> 1 2012-02-01
>>> dtype: period[D]
>>> False
>>> True
Another caveat here is that type should be pointed out precisely:
这里的另一个警告是应该准确指出类型:
s = pd.Series([1,2])
s
s.dtype == np.int64 # Working
s.dtype == np.int32 # Not working
>>> 0 1
>>> 1 2
>>> dtype: int64
>>> True
>>> False
2. isinstance()approach.
2.isinstance()方法。
This method has not been mentioned in answers so far.
到目前为止,答案中还没有提到这种方法。
So if direct comparing of types is not a good idea - lets try built-in python function for this purpose, namely - isinstance().
It fails just in the beginning, because assumes that we have some objects, but pd.Seriesor pd.DataFramemay be used as just empty containers with predefined dtypebut no objects in it:
因此,如果直接比较类型不是一个好主意 - 让我们为此目的尝试内置 python 函数,即 - isinstance()。
它只是在开始时失败,因为假设我们有一些对象,但是pd.Series或者pd.DataFrame可以用作带有预定义dtype但没有对象的空容器:
s = pd.Series([], dtype=bool)
s
>>> Series([], dtype: bool)
But if one somehow overcome this issue, and wants to access each object, for example, in the first row and checks its dtype like something like that:
但是,如果有人以某种方式克服了这个问题,并且想要访问每个对象,例如,在第一行中并像这样检查它的 dtype:
df = pd.DataFrame({'int': [12, 2], 'dt': [pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')]},
index = ['A', 'B'])
for col in df.columns:
df[col].dtype, 'is_int64 = %s' % isinstance(df.loc['A', col], np.int64)
>>> (dtype('int64'), 'is_int64 = True')
>>> (dtype('<M8[ns]'), 'is_int64 = False')
It will be misleading in the case of mixed type of data in single column:
在单列混合类型数据的情况下会产生误导:
df2 = pd.DataFrame({'data': [12, pd.Timestamp('2013-01-02')]},
index = ['A', 'B'])
for col in df2.columns:
df2[col].dtype, 'is_int64 = %s' % isinstance(df2.loc['A', col], np.int64)
>>> (dtype('O'), 'is_int64 = False')
And last but not least - this method cannot directly recognize Categorydtype. As stated in docs:
最后但并非最不重要的 - 此方法无法直接识别Categorydtype。如文档中所述:
Returning a single item from categorical data will also return the value, not a categorical of length “1”.
从分类数据返回单个项目也将返回值,而不是长度为“1”的分类。
df['int'] = df['int'].astype('category')
for col in df.columns:
df[col].dtype, 'is_int64 = %s' % isinstance(df.loc['A', col], np.int64)
>>> (CategoricalDtype(categories=[2, 12], ordered=False), 'is_int64 = True')
>>> (dtype('<M8[ns]'), 'is_int64 = False')
So this method is also almost inapplicable.
所以这个方法也几乎不适用。
3. df.dtype.kindapproach.
3.df.dtype.kind方法。
This method yet may work with empty pd.Seriesor pd.DataFramesbut has another problems.
此方法可能适用于空pd.Series或pd.DataFrames但有另一个问题。
First - it is unable to differ some dtypes:
首先 - 它无法区分某些 dtypes:
df = pd.DataFrame({'prd' :[pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')],
'str' :['s1', 's2'],
'cat' :[1, -1]})
df['cat'] = df['cat'].astype('category')
for col in df:
# kind will define all columns as 'Object'
print (df[col].dtype, df[col].dtype.kind)
>>> period[D] O
>>> object O
>>> category O
Second, what is actually still unclear for me, it even returns on some dtypes None.
其次,对我来说实际上仍然不清楚的是,它甚至在某些 dtypes None上返回。
4. df.select_dtypesapproach.
4.df.select_dtypes方法。
This is almost what we want. This method designed inside pandas so it handles most corner cases mentioned earlier - empty DataFrames, differs numpy or pandas-specific dtypes well. It works well with single dtype like .select_dtypes('bool'). It may be used even for selecting groups of columns based on dtype:
这几乎就是我们想要的。这个方法是在 pandas 内部设计的,因此它可以处理前面提到的大多数极端情况 - 空数据帧,很好地区分 numpy 或 pandas 特定的 dtypes。它适用于像.select_dtypes('bool'). 它甚至可以用于根据 dtype 选择列组:
test = pd.DataFrame({'bool' :[False, True], 'int64':[-1,2], 'int32':[-1,2],'float': [-2.5, 3.4],
'compl':np.array([1-1j, 5]),
'dt' :[pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')],
'td' :[pd.Timestamp('2012-03-02')- pd.Timestamp('2016-10-20'),
pd.Timestamp('2010-07-12')- pd.Timestamp('2000-11-10')],
'prd' :[pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')],
'intrv':pd.arrays.IntervalArray([pd.Interval(0, 0.1), pd.Interval(1, 5)]),
'str' :['s1', 's2'],
'cat' :[1, -1],
'obj' :[[1,2,3], [5435,35,-52,14]]
})
test['int32'] = test['int32'].astype(np.int32)
test['cat'] = test['cat'].astype('category')
Like so, as stated in the docs:
像这样,如文档中所述:
test.select_dtypes('number')
>>> int64 int32 float compl td
>>> 0 -1 -1 -2.5 (1-1j) -1693 days
>>> 1 2 2 3.4 (5+0j) 3531 days
On may think that here we see first unexpected (at used to be for me: question) results - TimeDeltais included into output DataFrame. But as answeredin contrary it should be so, but one have to be aware of it. Note that booldtype is skipped, that may be also undesired for someone, but it's due to booland numberare in different "subtrees" of numpy dtypes. In case with bool, we may use test.select_dtypes(['bool'])here.
可能认为在这里我们看到了第一个意想不到的(曾经是我的:问题)结果 -TimeDelta被包含到 output 中DataFrame。但正如相反的回答,它应该是这样,但人们必须意识到这一点。请注意,booldtype 被跳过,这对某些人来说也可能是不受欢迎的,但这是由于bool并且number位于numpy dtype 的不同“子树”中。如果是 bool,我们可以test.select_dtypes(['bool'])在这里使用。
Next restriction of this method is that for current version of pandas (0.24.2), this code: test.select_dtypes('period')will raise NotImplementedError.
此方法的下一个限制是对于当前版本的熊猫 (0.24.2),此代码:test.select_dtypes('period')将引发NotImplementedError.
And another thing is that it's unable to differ strings from other objects:
另一件事是它无法将字符串与其他对象区分开来:
test.select_dtypes('object')
>>> str obj
>>> 0 s1 [1, 2, 3]
>>> 1 s2 [5435, 35, -52, 14]
But this is, first - already mentionedin the docs. And second - is not the problem of this method, rather the way strings are stored in DataFrame. But anyway this case have to have some post processing.
但这是,首先 - 已经在文档中提到过。其次 - 不是这种方法的问题,而是字符串存储在DataFrame. 但无论如何,这种情况必须有一些后期处理。
5. df.api.types.is_XXX_dtypeapproach.
5.df.api.types.is_XXX_dtype方法。
This one is intended to be most robust and native way to achieve dtype recognition (path of the module where functions resides says by itself) as i suppose. And it works almost perfectly, but still have at least one caveat and still have to somehow distinguish string columns.
正如我所想的那样,这旨在成为实现 dtype 识别(函数所在模块的路径自己说)的最健壮和原生的方式。它几乎完美地工作,但仍然至少有一个警告,并且仍然必须以某种方式区分 string columns。
Besides, this may be subjective, but this approach also has more 'human-understandable' numberdtypes group processing comparing with .select_dtypes('number'):
此外,这可能是主观的,但与以下方法相比,这种方法也具有更多“人类可理解”的numberdtypes 组处理.select_dtypes('number'):
for col in test.columns:
if pd.api.types.is_numeric_dtype(test[col]):
print (test[col].dtype)
>>> bool
>>> int64
>>> int32
>>> float64
>>> complex128
No timedeltaand boolis included. Perfect.
没有timedelta和bool被包括在内。完美的。
My pipeline exploits exactly this functionality at this moment of time, plus a bit of post hand processing.
我的管道此时正是利用了这个功能,再加上一些后期手工处理。
Output.
输出。
Hope I was able to argument the main point - that all discussed approaches may be used, but only pd.DataFrame.select_dtypes()and pd.api.types.is_XXX_dtypeshould be really considered as the applicable ones.
希望我能够论证主要观点——所有讨论过的方法都可以使用,但只能pd.DataFrame.select_dtypes()并且pd.api.types.is_XXX_dtype应该真正被视为适用的方法。
回答by ePi272314
To pretty print the column data types
漂亮地打印列数据类型
To check the data types after, for example, an import from a file
例如,在从文件导入后检查数据类型
def printColumnInfo(df):
template="%-8s %-30s %s"
print(template % ("Type", "Column Name", "Example Value"))
print("-"*53)
for c in df.columns:
print(template % (df[c].dtype, c, df[c].iloc[1]) )
Illustrative output:
说明性输出:
Type Column Name Example Value
-----------------------------------------------------
int64 Age 49
object Attrition No
object BusinessTravel Travel_Frequently
float64 DailyRate 279.0

