Python 找出给定数据集中每列中缺失值的百分比
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/51070985/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Find out the percentage of missing values in each column in the given dataset
提问by Shaswata
import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
percent= 100*(len(df.loc[:,df.isnull().sum(axis=0)>=1 ].index) / len(df.index))
print(round(percent,2))
input is https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0
输入是https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0
and the output should be
并且输出应该是
Ord_id 0.00
Prod_id 0.00
Ship_id 0.00
Cust_id 0.00
Sales 0.24
Discount 0.65
Order_Quantity 0.65
Profit 0.65
Shipping_Cost 0.65
Product_Base_Margin 1.30
dtype: float64
回答by Engineero
How about this? I think I actually found something similar on here once before, but I'm not seeing it now...
这个怎么样?我想我以前在这里发现过类似的东西,但我现在没有看到它......
percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({'column_name': df.columns,
'percent_missing': percent_missing})
And if you want the missing percentages sorted, follow the above with:
如果您希望对缺失的百分比进行排序,请按照上述操作:
missing_value_df.sort_values('percent_missing', inplace=True)
As mentioned in the comments, you may also be able to get by with just the first line in my code above, i.e.:
正如评论中提到的,您也可以只使用上面代码中的第一行,即:
percent_missing = df.isnull().sum() * 100 / len(df)
回答by Scott Boston
Update let's use mean
with isnull
:
更新让我们用mean
用isnull
:
df.isnull().mean() * 100
Output:
输出:
Ord_id 0.000000
Prod_id 0.000000
Ship_id 0.000000
Cust_id 0.000000
Sales 0.238124
Discount 0.654840
Order_Quantity 0.654840
Profit 0.654840
Shipping_Cost 0.654840
Product_Base_Margin 1.297774
dtype: float64
IIUC:
IUC:
df.isnull().sum() / df.shape[0] * 100.00
Output:
输出:
Ord_id 0.000000
Prod_id 0.000000
Ship_id 0.000000
Cust_id 0.000000
Sales 0.238124
Discount 0.654840
Order_Quantity 0.654840
Profit 0.654840
Shipping_Cost 0.654840
Product_Base_Margin 1.297774
dtype: float64
回答by RomanPerekhrest
To cover all missingvalues and round the results:
覆盖所有缺失值并四舍五入结果:
((df.isnull() | df.isna()).sum() * 100 / df.index.size).round(2)
The output:
输出:
Out[556]:
Ord_id 0.00
Prod_id 0.00
Ship_id 0.00
Cust_id 0.00
Sales 0.24
Discount 0.65
Order_Quantity 0.65
Profit 0.65
Shipping_Cost 0.65
Product_Base_Margin 1.30
dtype: float64
回答by Nitish Arora
The solution you're looking for is :
您正在寻找的解决方案是:
round(df.isnull().mean()*100,2)
This will round up the percentage upto 2 decimal places
这会将百分比四舍五入到小数点后两位
Another way to do this is
另一种方法是
round((df.isnull().sum()*100)/len(df),2)
but this is not efficient as using mean() is.
但这并不像使用 mean() 那样有效。
回答by GpandaM
If there are multiple dataframe below is the function to calculate number of missing value in each column with percentage
如果下面有多个数据框,则使用百分比计算每列中缺失值的数量
def miss_data(df):
x = ['column_name','missing_data', 'missing_in_percentage']
missing_data = pd.DataFrame(columns=x)
columns = df.columns
for col in columns:
icolumn_name = col
imissing_data = df[col].isnull().sum()
imissing_in_percentage = (df[col].isnull().sum()/df[col].shape[0])*100
missing_data.loc[len(missing_data)] = [icolumn_name, imissing_data, imissing_in_percentage]
print(missing_data)
回答by naimur978
By this following code, you can get the corresponding percentage values from every columns. Just switch the name train_data with df, in case of yours.
通过以下代码,您可以从每一列中获取相应的百分比值。只需将名称 train_data 与 df 切换即可,以防万一。
Input:
输入:
In [1]:
all_data_na = (train_data.isnull().sum() / len(train_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head(20)
Output :
输出 :
Out[1]:
Missing Ratio
left_eyebrow_outer_end_x 68.435239
left_eyebrow_outer_end_y 68.435239
right_eyebrow_outer_end_y 68.279189
right_eyebrow_outer_end_x 68.279189
left_eye_outer_corner_x 67.839410
left_eye_outer_corner_y 67.839410
right_eye_inner_corner_x 67.825223
right_eye_inner_corner_y 67.825223
right_eye_outer_corner_x 67.825223
right_eye_outer_corner_y 67.825223
mouth_left_corner_y 67.811037
mouth_left_corner_x 67.811037
left_eyebrow_inner_end_x 67.796851
left_eyebrow_inner_end_y 67.796851
right_eyebrow_inner_end_y 67.796851
mouth_right_corner_x 67.796851
mouth_right_corner_y 67.796851
right_eyebrow_inner_end_x 67.796851
left_eye_inner_corner_x 67.782664
left_eye_inner_corner_y 67.782664
回答by Suhas_Pote
import numpy as np
import pandas as pd
raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', np.nan, np.nan, 'Milner', 'Cooze'],
'age': [22, np.nan, 23, 24, 25],
'sex': ['m', np.nan, 'f', 'm', 'f'],
'Test1_Score': [4, np.nan, 0, 0, 0],
'Test2_Score': [25, np.nan, np.nan, 0, 0]}
results = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'Test1_Score', 'Test2_Score'])
results
first_name last_name age sex Test1_Score Test2_Score
0 Jason Miller 22.0 m 4.0 25.0
1 NaN NaN NaN NaN NaN NaN
2 Tina NaN 23.0 f 0.0 NaN
3 Jake Milner 24.0 m 0.0 0.0
4 Amy Cooze 25.0 f 0.0 0.0
You can use following function, which will give you output in Dataframe
您可以使用以下功能,它将在数据框中为您提供输出
- Zero Values
- Missing Values
- % of Total Values
- Total Zero Missing Values
- % Total Zero Missing Values
- Data Type
- 零值
- 缺失值
- 占总价值的百分比
- 总零缺失值
- % 总零缺失值
- 数据类型
Just copy and paste following function and call it by passing your pandas Dataframe
只需复制并粘贴以下函数并通过传递您的熊猫数据框来调用它
def missing_zero_values_table(df):
zero_val = (df == 0.00).astype(int).sum(axis=0)
mis_val = df.isnull().sum()
mis_val_percent = 100 * df.isnull().sum() / len(df)
mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1)
mz_table = mz_table.rename(
columns = {0 : 'Zero Values', 1 : 'Missing Values', 2 : '% of Total Values'})
mz_table['Total Zero Missing Values'] = mz_table['Zero Values'] + mz_table['Missing Values']
mz_table['% Total Zero Missing Values'] = 100 * mz_table['Total Zero Missing Values'] / len(df)
mz_table['Data Type'] = df.dtypes
mz_table = mz_table[
mz_table.iloc[:,1] != 0].sort_values(
'% of Total Values', ascending=False).round(1)
print ("Your selected dataframe has " + str(df.shape[1]) + " columns and " + str(df.shape[0]) + " Rows.\n"
"There are " + str(mz_table.shape[0]) +
" columns that have missing values.")
# mz_table.to_excel('D:/sampledata/missing_and_zero_values.xlsx', freeze_panes=(1,0), index = False)
return mz_table
missing_zero_values_table(results)
Output
输出
Your selected dataframe has 6 columns and 5 Rows.
There are 6 columns that have missing values.
Zero Values Missing Values % of Total Values Total Zero Missing Values % Total Zero Missing Values Data Type
last_name 0 2 40.0 2 40.0 object
Test2_Score 2 2 40.0 4 80.0 float64
first_name 0 1 20.0 1 20.0 object
age 0 1 20.0 1 20.0 float64
sex 0 1 20.0 1 20.0 object
Test1_Score 3 1 20.0 4 80.0 float64
If you want to keep it simple then you can use following function to get missing values in %
如果你想保持简单,那么你可以使用以下函数来获取 % 中的缺失值
def missing(dff):
print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))
missing(results)
Test2_Score 40.0
last_name 40.0
Test1_Score 20.0
sex 20.0
age 20.0
first_name 20.0
dtype: float64
回答by Sunit Deogam
# Why this chord is not running it shows error
File "<tokenize>", line 19
return mis_val_table_ren_columns
^
IndentationError: unindent does not match any outer indentation level
# check number & percentage of missing value in the columns
def missing_values_table(df):
mis_val = df.isnull().sum() #total missing values
mis_val_percent = 100 * df.isnull().sum() / len(df) #percentage of missing values
mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1) #make a table with the results
mis_val_table_ren_columns = mis_val_table.rename(
columns = {0 : 'Missing Values', 1 : '% of Total Values'}) #rename the columns
# sort the table by percentage of missing value
mis_val_table_ren_columns = mis_val_table_ren_columns[
mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
'% of Total Values', ascending=False).round(1)
#print same summary information
print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"
"There are " + str(mis_val_table_ren_columns.shape[0]) +
" columns that have missing values.")
# return the dataframe with missing information
return mis_val_table_ren_columns
# missing values statistics
missing_values = missing_values_table(data_df)
missing_values.head()