Python 计算数据框中列的汇总统计信息
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22235245/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Calculate summary statistics of columns in dataframe
提问by Tyler Wood
I have a dataframe of the following form (for example)
我有以下形式的数据框(例如)
shopper_num,is_martian,number_of_items,count_pineapples,birth_country,tranpsortation_method
1,FALSE,0,0,MX,
2,FALSE,1,0,MX,
3,FALSE,0,0,MX,
4,FALSE,22,0,MX,
5,FALSE,0,0,MX,
6,FALSE,0,0,MX,
7,FALSE,5,0,MX,
8,FALSE,0,0,MX,
9,FALSE,4,0,MX,
10,FALSE,2,0,MX,
11,FALSE,0,0,MX,
12,FALSE,13,0,MX,
13,FALSE,0,0,CA,
14,FALSE,0,0,US,
How can I use Pandas to calculate summary statistics of each column (column data types are variable, some columns have no information
如何使用 Pandas 计算每列的汇总统计信息(列数据类型是可变的,有些列没有信息
And then return the a dataframe of the form:
然后返回表单的数据框:
columnname, max, min, median,
is_martian, NA, NA, FALSE
So on and so on
等等等等
采纳答案by EdChum
describemay give you everything you want otherwise you can perform aggregations using groupby and pass a list of agg functions: http://pandas.pydata.org/pandas-docs/stable/groupby.html#applying-multiple-functions-at-once
describe可能会给你你想要的一切,否则你可以使用 groupby 执行聚合并传递 agg 函数列表:http: //pandas.pydata.org/pandas-docs/stable/groupby.html#applying-multiple-functions-at-once
In [43]:
df.describe()
Out[43]:
shopper_num is_martian number_of_items count_pineapples
count 14.0000 14 14.000000 14
mean 7.5000 0 3.357143 0
std 4.1833 0 6.452276 0
min 1.0000 False 0.000000 0
25% 4.2500 0 0.000000 0
50% 7.5000 0 0.000000 0
75% 10.7500 0 3.500000 0
max 14.0000 False 22.000000 0
[8 rows x 4 columns]
Note that some columns cannot be summarised as there is no logical way to summarise them, for instance columns containing string data
请注意,某些列无法汇总,因为没有逻辑方法来汇总它们,例如包含字符串数据的列
As you prefer you can transpose the result if you prefer:
如果您愿意,您可以根据自己的喜好转置结果:
In [47]:
df.describe().transpose()
Out[47]:
count mean std min 25% 50% 75% max
shopper_num 14 7.5 4.1833 1 4.25 7.5 10.75 14
is_martian 14 0 0 False 0 0 0 False
number_of_items 14 3.357143 6.452276 0 0 0 3.5 22
count_pineapples 14 0 0 0 0 0 0 0
[4 rows x 8 columns]
回答by Ken Wallace
To clarify one point in @EdChum's answer, per the documentation, you can include the object columns by using df.describe(include='all'). It won't provide many statistics, but will provide a few pieces of info, including count, number of unique values, top value. This may be a new feature, I don't know as I am a relatively new user.
为了澄清@EdChum 回答中的一点,根据文档,您可以使用df.describe(include='all'). 它不会提供很多统计信息,但会提供一些信息,包括计数、唯一值的数量、最高值。这可能是一个新功能,我不知道,因为我是一个相对较新的用户。
回答by akilat90
Now there is the pandas_profilingpackage, which is a more complete alternative to df.describe().
现在有了pandas_profiling包,它是df.describe().
If your pandas dataframe is df, the below will return a complete analysis including some warnings about missing values, skewness, etc. It presents histograms and correlation plots as well.
如果您的 Pandas 数据框是df,下面将返回一个完整的分析,包括一些关于缺失值、偏度等的警告。它还显示了直方图和相关图。
import pandas_profiling
pandas_profiling.ProfileReport(df)
See the example notebookdetailing the usage.
请参阅示例笔记本,详细说明用法。

