Pandas 交叉表 - 如何为数据集中不存在的值打印行/列?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40280552/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas crosstab - How to print rows/columns for values that don't exist in the data sets?
提问by pasitrauk
I am a beginner with pandas at best and I couldn't find a solution to this problem anywhere.
我充其量是Pandas的初学者,我无法在任何地方找到解决此问题的方法。
Let's say I have two variables: variable1, variable2.
假设我有两个变量:variable1、variable2。
They can have the following predefinedvalues:
它们可以具有以下预定义值:
variable1 = ['1', '4', '9', '15', '20']
variable2 = ['2', '5', '6']
However, the current data set only has some of those values:
但是,当前数据集只有其中一些值:
df = pd.DataFrame({variable1 : ['1', '9', '20'],
variable2 : ['2', '2', '6']})
When crossing the variables:
交叉变量时:
pd.crosstab(df.variable1, df.variable2)
I get:
我得到:
variable2 2 6
variable1
1 1 0
20 0 1
9 1 0
Is there a way to put all the possible categorical values in both the columns and the rows even if the current data set does not have all of them? The goal is to have a table of the same size when running the script with an updated data set which may have the values that were not present in the previous data set.
即使当前数据集没有所有可能的分类值,有没有办法将所有可能的分类值同时放在列和行中?目标是在使用更新的数据集运行脚本时拥有相同大小的表,该数据集可能具有先前数据集中不存在的值。
回答by jezrael
Use DataFrame.reindex
:
variable1 = ['1', '4', '9', '15', '20']
variable2 = ['2', '5', '6']
df = pd.DataFrame({'variable1' : ['1', '9', '20'],
'variable2' : ['2', '2', '6']})
print (df)
variable1 variable2
0 1 2
1 9 2
2 20 6
df = pd.crosstab(df.variable1, df.variable2)
df = df.reindex(index=variable1, columns=variable2, fill_value=0)
print (df)
variable2 2 5 6
variable1
1 1 0 0
4 0 0 0
9 1 0 0
15 0 0 0
20 0 0 1
from collections import OrderedDict
valuelabels = OrderedDict([('S8', [['1', 'Medical oncology'],
['2', 'Hematology'],
['3', 'Hematology/Oncology'],
['4', 'Other']]),
('S9', [['1', 'Academic / Teaching Hospital'],
['2', 'Community-Based Solo Private Practice'],
['3', 'Community-Based Group Private Practice (record practice size )'], ['4', 'Community Non-Teaching Hospital'],
['5', 'Comprehensive Cancer Center'],
['6', 'Other (specify)']])])
#print (valuelabels)
df = pd.DataFrame({'variable1' : ['1', '2', '4'],
'variable2' : ['2', '3', '1']})
table = pd.crosstab(df.variable1, df.variable2)
print (table)
variable2 1 2 3
variable1
1 0 1 0
2 0 0 1
4 1 0 0
d1 = dict(list(zip([a[0] for a in valuelabels['S8']], [a[1] for a in valuelabels['S8']])))
print (d1)
{'4': 'Other', '1': 'Medical oncology', '2': 'Hematology', '3': 'Hematology/Oncology'}
d2 = dict(list(zip([a[0] for a in valuelabels['S9']], [a[1] for a in valuelabels['S9']])))
print (d2)
{'1': 'Academic / Teaching Hospital',
'3': 'Community-Based Group Private Practice (record practice size )',
'4': 'Community Non-Teaching Hospital',
'6': 'Other (specify)',
'2': 'Community-Based Solo Private Practice',
'5': 'Comprehensive Cancer Center'}
table = table.reindex(index=[a[0] for a in valuelabels['S8']],
columns=[a[0] for a in valuelabels['S9'], fill_value=0)
print (table)
variable2 1 2 3 4 5 6
variable1
1 0 1 0 0 0 0
2 0 0 1 0 0 0
3 0 0 0 0 0 0
4 1 0 0 0 0 0
table.index = table.index.to_series().map(d1).values
table.columns = table.columns.to_series().map(d2).values
print (table)
Academic / Teaching Hospital \
Medical oncology 0
Hematology 0
Hematology/Oncology 0
Other 1
Community-Based Solo Private Practice \
Medical oncology 1
Hematology 0
Hematology/Oncology 0
Other 0
Community-Based Group Private Practice (record practice size ) \
Medical oncology 0
Hematology 1
Hematology/Oncology 0
Other 0
Community Non-Teaching Hospital \
Medical oncology 0
Hematology 0
Hematology/Oncology 0
Other 0
Comprehensive Cancer Center Other (specify)
Medical oncology 0 0
Hematology 0 0
Hematology/Oncology 0 0
Other 0 0
回答by ayhan
You can use reindex:
您可以使用重新索引:
ct = pd.crosstab(df.variable1, df.variable2)
ct.reindex(index=variable1, columns=variable2).fillna(0).astype('int')
Out:
variable2 2 5 6
variable1
1 1 0 0
4 0 0 0
9 1 0 0
15 0 0 0
20 0 0 1
回答by Rahul Chowdary Nuthakki
def TargetPercentByNominal (
targetVar, # target variable
predictor): # nominal predictor
countTable = pandas.crosstab(index = predictor, columns = targetVar, margins = True, dropna = True)
x = countTable.drop('All', 1)
percentTable = countTable.div(x.sum(1), axis='index')*100
print("Frequency Table: \n")
print(countTable)
print( )
print("Percent Table: \n")
print(percentTable)
return