Python 找出给定数据集中每列中缺失值的百分比

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/51070985/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 19:42:42  来源:igfitidea点击:

Find out the percentage of missing values in each column in the given dataset

pythonpython-3.xpandasnumpy

提问by Shaswata

import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
percent= 100*(len(df.loc[:,df.isnull().sum(axis=0)>=1 ].index) / len(df.index))
print(round(percent,2))

input is https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0

输入是https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0

and the output should be

并且输出应该是

Ord_id                 0.00
Prod_id                0.00
Ship_id                0.00
Cust_id                0.00
Sales                  0.24
Discount               0.65
Order_Quantity         0.65
Profit                 0.65
Shipping_Cost          0.65
Product_Base_Margin    1.30
dtype: float64

回答by Engineero

How about this? I think I actually found something similar on here once before, but I'm not seeing it now...

这个怎么样?我想我以前在这里发现过类似的东西,但我现在没有看到它......

percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({'column_name': df.columns,
                                 'percent_missing': percent_missing})

And if you want the missing percentages sorted, follow the above with:

如果您希望对缺失的百分比进行排序,请按照上述操作:

missing_value_df.sort_values('percent_missing', inplace=True)

As mentioned in the comments, you may also be able to get by with just the first line in my code above, i.e.:

正如评论中提到的,您也可以只使用上面代码中的第一行,即:

percent_missing = df.isnull().sum() * 100 / len(df)

回答by Scott Boston

Update let's use meanwith isnull:

更新让我们用meanisnull

df.isnull().mean() * 100

Output:

输出:

Ord_id                 0.000000
Prod_id                0.000000
Ship_id                0.000000
Cust_id                0.000000
Sales                  0.238124
Discount               0.654840
Order_Quantity         0.654840
Profit                 0.654840
Shipping_Cost          0.654840
Product_Base_Margin    1.297774
dtype: float64

IIUC:

IUC:

df.isnull().sum() / df.shape[0] * 100.00

Output:

输出:

Ord_id                 0.000000
Prod_id                0.000000
Ship_id                0.000000
Cust_id                0.000000
Sales                  0.238124
Discount               0.654840
Order_Quantity         0.654840
Profit                 0.654840
Shipping_Cost          0.654840
Product_Base_Margin    1.297774
dtype: float64

回答by RomanPerekhrest

To cover all missingvalues and round the results:

覆盖所有缺失值并四舍五入结果:

((df.isnull() | df.isna()).sum() * 100 / df.index.size).round(2)

The output:

输出:

Out[556]: 
Ord_id                 0.00
Prod_id                0.00
Ship_id                0.00
Cust_id                0.00
Sales                  0.24
Discount               0.65
Order_Quantity         0.65
Profit                 0.65
Shipping_Cost          0.65
Product_Base_Margin    1.30
dtype: float64

回答by Nitish Arora

The solution you're looking for is :

您正在寻找的解决方案是:

round(df.isnull().mean()*100,2) 

This will round up the percentage upto 2 decimal places

这会将百分比四舍五入到小数点后两位

Another way to do this is

另一种方法是

round((df.isnull().sum()*100)/len(df),2)

but this is not efficient as using mean() is.

但这并不像使用 mean() 那样有效。

回答by GpandaM

If there are multiple dataframe below is the function to calculate number of missing value in each column with percentage

如果下面有多个数据框,则使用百分比计算每列中缺失值的数量

def miss_data(df):
    x = ['column_name','missing_data', 'missing_in_percentage']
    missing_data = pd.DataFrame(columns=x)
    columns = df.columns
    for col in columns:
        icolumn_name = col
        imissing_data = df[col].isnull().sum()
        imissing_in_percentage = (df[col].isnull().sum()/df[col].shape[0])*100

        missing_data.loc[len(missing_data)] = [icolumn_name, imissing_data, imissing_in_percentage]
    print(missing_data) 

回答by naimur978

By this following code, you can get the corresponding percentage values from every columns. Just switch the name train_data with df, in case of yours.

通过以下代码,您可以从每一列中获取相应的百分比值。只需将名称 train_data 与 df 切换即可,以防万一。

Input:

输入:

In [1]:

all_data_na = (train_data.isnull().sum() / len(train_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head(20)

Output :

输出 :

Out[1]: 
                                Missing Ratio
 left_eyebrow_outer_end_x       68.435239
 left_eyebrow_outer_end_y       68.435239
 right_eyebrow_outer_end_y      68.279189
 right_eyebrow_outer_end_x      68.279189
 left_eye_outer_corner_x        67.839410
 left_eye_outer_corner_y        67.839410
 right_eye_inner_corner_x       67.825223
 right_eye_inner_corner_y       67.825223
 right_eye_outer_corner_x       67.825223
 right_eye_outer_corner_y       67.825223
 mouth_left_corner_y            67.811037
 mouth_left_corner_x            67.811037
 left_eyebrow_inner_end_x       67.796851
 left_eyebrow_inner_end_y       67.796851
 right_eyebrow_inner_end_y      67.796851
 mouth_right_corner_x           67.796851
 mouth_right_corner_y           67.796851
 right_eyebrow_inner_end_x      67.796851
 left_eye_inner_corner_x        67.782664
 left_eye_inner_corner_y        67.782664

回答by Suhas_Pote

import numpy as np
import pandas as pd

raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', np.nan, np.nan, 'Milner', 'Cooze'], 
        'age': [22, np.nan, 23, 24, 25], 
        'sex': ['m', np.nan, 'f', 'm', 'f'], 
        'Test1_Score': [4, np.nan, 0, 0, 0],
        'Test2_Score': [25, np.nan, np.nan, 0, 0]}
results = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'Test1_Score', 'Test2_Score'])


results 

  first_name last_name   age  sex  Test1_Score  Test2_Score
0      Jason    Miller  22.0    m          4.0         25.0
1        NaN       NaN   NaN  NaN          NaN          NaN
2       Tina       NaN  23.0    f          0.0          NaN
3       Jake    Milner  24.0    m          0.0          0.0
4        Amy     Cooze  25.0    f          0.0          0.0

You can use following function, which will give you output in Dataframe

您可以使用以下功能,它将在数据框中为您提供输出

  • Zero Values
  • Missing Values
  • % of Total Values
  • Total Zero Missing Values
  • % Total Zero Missing Values
  • Data Type
  • 零值
  • 缺失值
  • 占总价值的百分比
  • 总零缺失值
  • % 总零缺失值
  • 数据类型

Just copy and paste following function and call it by passing your pandas Dataframe

只需复制并粘贴以下函数并通过传递您的熊猫数据框来调用它

def missing_zero_values_table(df):
        zero_val = (df == 0.00).astype(int).sum(axis=0)
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1)
        mz_table = mz_table.rename(
        columns = {0 : 'Zero Values', 1 : 'Missing Values', 2 : '% of Total Values'})
        mz_table['Total Zero Missing Values'] = mz_table['Zero Values'] + mz_table['Missing Values']
        mz_table['% Total Zero Missing Values'] = 100 * mz_table['Total Zero Missing Values'] / len(df)
        mz_table['Data Type'] = df.dtypes
        mz_table = mz_table[
            mz_table.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns and " + str(df.shape[0]) + " Rows.\n"      
            "There are " + str(mz_table.shape[0]) +
              " columns that have missing values.")
#         mz_table.to_excel('D:/sampledata/missing_and_zero_values.xlsx', freeze_panes=(1,0), index = False)
        return mz_table

missing_zero_values_table(results)

Output

输出

Your selected dataframe has 6 columns and 5 Rows.
There are 6 columns that have missing values.

             Zero Values  Missing Values  % of Total Values  Total Zero Missing Values  % Total Zero Missing Values Data Type
last_name              0               2               40.0                          2                         40.0    object
Test2_Score            2               2               40.0                          4                         80.0   float64
first_name             0               1               20.0                          1                         20.0    object
age                    0               1               20.0                          1                         20.0   float64
sex                    0               1               20.0                          1                         20.0    object
Test1_Score            3               1               20.0                          4                         80.0   float64

If you want to keep it simple then you can use following function to get missing values in %

如果你想保持简单,那么你可以使用以下函数来获取 % 中的缺失值

def missing(dff):
    print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))


missing(results)

Test2_Score    40.0
last_name      40.0
Test1_Score    20.0
sex            20.0
age            20.0
first_name     20.0
dtype: float64

回答by Sunit Deogam

# Why this chord is not running it shows error 

  File "<tokenize>", line 19
    return mis_val_table_ren_columns
    ^
IndentationError: unindent does not match any outer indentation level
# check number & percentage of missing value in the columns
def missing_values_table(df):
        mis_val = df.isnull().sum() #total missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df) #percentage of missing values
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1) #make a table with the results
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'}) #rename the columns
     # sort the table by percentage of missing value
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)

        #print same summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")

          # return the dataframe with missing information
    return mis_val_table_ren_columns

 # missing values statistics
    missing_values = missing_values_table(data_df)
    missing_values.head()