Pandas 交叉表 - 如何为数据集中不存在的值打印行/列？

Question

提问by pasitrauk

I am a beginner with pandas at best and I couldn't find a solution to this problem anywhere.

我充其量是Pandas的初学者，我无法在任何地方找到解决此问题的方法。

Let's say I have two variables: variable1, variable2.

假设我有两个变量：variable1、variable2。

They can have the following predefinedvalues:

它们可以具有以下预定义值：

variable1 = ['1', '4', '9', '15', '20']
variable2 = ['2', '5', '6']

However, the current data set only has some of those values:

但是，当前数据集只有其中一些值：

df = pd.DataFrame({variable1 : ['1', '9', '20'],
                  variable2 : ['2', '2', '6']})

When crossing the variables:

交叉变量时：

pd.crosstab(df.variable1, df.variable2)

I get:

我得到：

variable2  2  6
variable1      
1          1  0
20         0  1
9          1  0

Is there a way to put all the possible categorical values in both the columns and the rows even if the current data set does not have all of them? The goal is to have a table of the same size when running the script with an updated data set which may have the values that were not present in the previous data set.

即使当前数据集没有所有可能的分类值，有没有办法将所有可能的分类值同时放在列和行中？目标是在使用更新的数据集运行脚本时拥有相同大小的表，该数据集可能具有先前数据集中不存在的值。

Answer 1

回答by jezrael

Use DataFrame.reindex:

使用DataFrame.reindex：

variable1 = ['1', '4', '9', '15', '20']
variable2 = ['2', '5', '6']


df = pd.DataFrame({'variable1' : ['1', '9', '20'],
                  'variable2' : ['2', '2', '6']})

print (df)                  
  variable1 variable2
0         1         2
1         9         2
2        20         6

df = pd.crosstab(df.variable1, df.variable2)
df = df.reindex(index=variable1, columns=variable2, fill_value=0)
print (df)
variable2  2  5  6
variable1         
1          1  0  0
4          0  0  0
9          1  0  0
15         0  0  0
20         0  0  1

from collections import OrderedDict


valuelabels = OrderedDict([('S8', [['1', 'Medical oncology'], 
                                   ['2', 'Hematology'], 
                                   ['3', 'Hematology/Oncology'], 
                                   ['4', 'Other']]), 
                           ('S9', [['1', 'Academic / Teaching Hospital'], 
                                   ['2', 'Community-Based Solo Private Practice'], 
                                   ['3', 'Community-Based Group Private Practice (record practice size )'], ['4', 'Community Non-Teaching Hospital'], 
                                   ['5', 'Comprehensive Cancer Center'], 
                                   ['6', 'Other (specify)']])])
#print (valuelabels)


df = pd.DataFrame({'variable1' : ['1', '2', '4'],
                  'variable2' : ['2', '3', '1']})

table = pd.crosstab(df.variable1, df.variable2)      
print (table)
variable2  1  2  3
variable1         
1          0  1  0
2          0  0  1
4          1  0  0

d1 = dict(list(zip([a[0] for a in valuelabels['S8']], [a[1] for a in valuelabels['S8']])))
print (d1)
{'4': 'Other', '1': 'Medical oncology', '2': 'Hematology', '3': 'Hematology/Oncology'}

d2 = dict(list(zip([a[0] for a in valuelabels['S9']], [a[1] for a in valuelabels['S9']])))
print (d2)
{'1': 'Academic / Teaching Hospital', 
'3': 'Community-Based Group Private Practice (record practice size )', 
'4': 'Community Non-Teaching Hospital', 
'6': 'Other (specify)', 
'2': 'Community-Based Solo Private Practice', 
'5': 'Comprehensive Cancer Center'}

table = table.reindex(index=[a[0] for a in valuelabels['S8']], 
                      columns=[a[0] for a in valuelabels['S9'], fill_value=0)
print (table)
variable2  1  2  3  4  5  6
variable1                  
1          0  1  0  0  0  0
2          0  0  1  0  0  0
3          0  0  0  0  0  0
4          1  0  0  0  0  0

table.index = table.index.to_series().map(d1).values
table.columns = table.columns.to_series().map(d2).values

print (table)
                     Academic / Teaching Hospital  \
Medical oncology                                0   
Hematology                                      0   
Hematology/Oncology                             0   
Other                                           1   

                     Community-Based Solo Private Practice  \
Medical oncology                                         1   
Hematology                                               0   
Hematology/Oncology                                      0   
Other                                                    0   

                     Community-Based Group Private Practice (record practice size )  \
Medical oncology                                                     0                
Hematology                                                           1                
Hematology/Oncology                                                  0                
Other                                                                0                

                     Community Non-Teaching Hospital  \
Medical oncology                                   0   
Hematology                                         0   
Hematology/Oncology                                0   
Other                                              0   

                     Comprehensive Cancer Center  Other (specify)  
Medical oncology                               0                0  
Hematology                                     0                0  
Hematology/Oncology                            0                0  
Other                                          0                0

Answer 2

回答by ayhan

You can use reindex:

您可以使用重新索引：

ct = pd.crosstab(df.variable1, df.variable2)
ct.reindex(index=variable1, columns=variable2).fillna(0).astype('int')
Out: 
variable2  2  5  6
variable1         
1          1  0  0
4          0  0  0
9          1  0  0
15         0  0  0
20         0  0  1

Answer 3

回答by Rahul Chowdary Nuthakki

def TargetPercentByNominal (
 targetVar,       # target variable
 predictor):      # nominal predictor

countTable = pandas.crosstab(index = predictor, columns = targetVar, margins = True, dropna = True)
  x = countTable.drop('All', 1)
  percentTable = countTable.div(x.sum(1), axis='index')*100

  print("Frequency Table: \n")
  print(countTable)
  print( )
  print("Percent Table: \n")
  print(percentTable)

  return

Pandas 交叉表 - 如何为数据集中不存在的值打印行/列？

提问by pasitrauk

回答by jezrael

回答by ayhan

回答by Rahul Chowdary Nuthakki

相关推荐

最近更新

标签

Pandas 交叉表 - 如何为数据集中不存在的值打印行/列？

提问by pasitrauk

回答by jezrael

回答by ayhan

回答by Rahul Chowdary Nuthakki

相关推荐

用数据框 Python pandas 中的空白替换 0

Pandas DataFrame 的条件计算列

pandas 比较不同长度的熊猫数据帧

pandas “DataFrame”对象不可调用

相关推荐

最近更新

标签