pandas 如何加载excel表并清理python中的数据?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42035652/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:54:32  来源:igfitidea点击:

How to load an excel sheet and clean the data in python?

pythonpandasdataframedata-science

提问by prajval shet

Load the energy data from the file Energy Indicators.xls, which is a list of indicators of energy supply and renewable electricity production from the United Nations for the year 2013, and should be put into a DataFrame with the variable name of energy.

从文件Energy Indicators.xls中加载能源数据,该文件是联合国2013年能源供应和可再生电力生产指标列表,应放入变量名称为energy的DataFrame中。

Keep in mind that this is an Excel file, and not a comma separated values file. Also, make sure to exclude the footer and header information from the datafile. The first two columns are unneccessary, so you should get rid of them, and you should change the column labels so that the columns are:

请记住,这是一个 Excel 文件,而不是逗号分隔值文件。此外,请确保从数据文件中排除页脚和页眉信息。前两列是不必要的,所以你应该去掉它们,你应该更改列标签,使列是:

['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable'] Convert Energy Supply to gigajoules (there are 1,000,000 gigajoules in a petajoule). For all countries which have missing data (e.g. data with "...") make sure this is reflected as np.NaN values.

['国家'、'能源供应'、'人均能源供应'、'可再​​生能源百分比'] 将能源供应转换为千兆焦耳(1,000,000 亿焦耳)。对于所有缺少数据(例如带有“...”的数据)的国家/地区,请确保将其反映为 np.NaN 值。

Rename the following list of countries (for use in later questions): "Republic of Korea": "South Korea", "United States of America": "United States", "United Kingdom of Great Britain and Northern Ireland": "United Kingdom", "China, Hong Kong Special Administrative Region": "Hong Kong"

重命名以下国家/地区列表(用于后面的问题):“大韩民国”:“韩国”,“美利坚合众国”:“美国”,“大不列颠及北爱尔兰联合王国”:“美国王国”、“ CN 、香港特别行政区”:“香港”

There are also several countries with numbers and/or parenthesis in their name. Be sure to remove these, e.g. 'Bolivia (Plurinational State of)' should be 'Bolivia', 'Switzerland17' should be 'Switzerland'.

还有几个国家的名称中带有数字和/或括号。请务必删除这些,例如“Bolivia (Plurinational State of)”应为“Bolivia”,“Switzerland17”应为“Switzerland”。

Next, load the GDP data from the file world_bank.csv, which is a csv containing countries' GDP from 1960 to 2015 from World Bank. Call this DataFrame GDP. Make sure to skip the header, and rename the following list of countries: "Korea, Rep.": "South Korea", "Iran, Islamic Rep.": "Iran", "Hong Kong SAR, China": "Hong Kong"

接下来,从文件world_bank.csv 中加载GDP 数据,该文件是世界银行提供的包含1960 年至2015 年各国GDP 的csv。称之为 DataFrame GDP。确保跳过标题,并重命名以下国家/地区列表: "Korea, Rep.": "South Korea", "Iran, Islam Rep.": "Iran", "Hong Kong SAR, China": "Hong Kong ”

Finally, load the Sciamgo Journal and Country Rank data for Energy Engineering and Power Technology from the file scimagojr-3.xlsx, which ranks countries based on their journal contributions in the aforementioned area. Call this DataFrame ScimEn.

最后,从文件 scimagojr-3.xlsx 中加载能源工程和电力技术的 Sciamgo 期刊和国家排名数据,该文件根据国家在上述领域的期刊贡献进行排名。调用这个 DataFrame ScimEn。

Join the three datasets: GDP, Energy, and ScimEn into a new dataset (using the intersection of country names). Use only the last 10 years (2006-2015) of GDP data and only the top 15 countries by Scimagojr 'Rank' (Rank 1 through 15).

将三个数据集:GDP、Energy 和 ScimEn 加入一个新数据集(使用国家名称的交集)。仅使用过去 10 年(2006-2015)的 GDP 数据和 Scimagojr '排名'(排名 1 到 15)的前 15 个国家。

The index of this DataFrame should be the name of the country, and the columns should be ['Rank', 'Documents', 'Citable documents', 'Citations', 'Self-citations', 'Citations per document', 'H index', 'Energy Supply', 'Energy Supply per Capita', '% Renewable', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015'].

这个DataFrame的索引应该是国家名称,列应该是['Rank', 'Documents', 'Citable documents', 'Citations', 'Self- citations', 'Citations per document', 'H指数”、“能源供应”、“人均能源供应”、“可再生能源百分比”、“2006”、“2007”、“2008”、“2009”、“2010”、“2011”、“2012”、“2013” ', '2014', '2015']。

This function should return a DataFrame with 20 columns and 15 entries.

此函数应返回一个包含 20 列和 15 个条目的 DataFrame。

I have tried the following code for this question, but it is returning only 12 rows instead of 15:

我已经针对这个问题尝试了以下代码,但它只返回 12 行而不是 15 行:

import pandas as pd

from pandas import ExcelWriter

from pandas import ExcelFile

pd.set_option('display.max_columns', None)

pd.set_option('display.max_rows', None)

Energy = pd.read_excel('Energy Indicators.xls')

Energy.drop(Energy.columns[[0,1]],axis=1,inplace=True)

Energy.columns=['Country','Energy Supply','Energy Supply per capita','% Renewable']

Energy['Energy Supply']*=1000000

Energy['Country'] = Energy['Country'].str.replace(r"\(.*\)","")

Energy['Country'] = Energy['Country'].str.replace("[0-9()]+$", "")

Energy.replace('Republic of Korea','South Korea', inplace = True)

Energy.replace('United States of America','United States', inplace = True)

Energy.replace('United Kingdom of Great Britain and Northern Ireland','United Kingdom', inplace = True)

Energy.replace('China, Hong Kong Special Administrative Region','Hong Kong', inplace = True)

import pandas as pd

GDP = pd.read_csv('world_bank.csv', index_col=0, header=None)

GDP = GDP.drop(['Data Source'])

GDP = GDP.dropna()

GDP = GDP.reset_index()

GDP.columns = GDP.iloc[0]

GDP.drop(GDP.index[[0,3]], inplace=True)

GDP = GDP.rename(columns={'Country Name': 'Country'})

GDP.replace(',','-', inplace=True)

GDP = GDP.replace('Korea, Rep.','South Korea')

GDP = GDP.replace('Iran, Islamic Rep.','Iran')

GDP = GDP.replace('Hong Kong SAR, China','Hong Kong')


import pandas as pd

from pandas import ExcelWriter

from pandas import ExcelFile

pd.set_option('display.max_columns', None)

pd.set_option('display.max_rows', None)

ScimEn = pd.read_excel('scimagojr-3.xlsx')


b = pd.merge(pd.merge(Energy,GDP,on='Country'),ScimEn,on='Country')

a = pd.merge(pd.merge(Energy,GDP,on='Country'),ScimEn,on='Country')

a = a.sort(['Rank'], ascending=[True])

a = a[a["Rank"] < 16]

a=a.rename(columns = {'2006.0':'abc'})

a.columns.values[53] = "2006"

a.columns.values[54] = "2007"

a.columns.values[55] = "2008"

a.columns.values[56] = "2009"

a.columns.values[57] = "2010"

a.columns.values[58] = "2011"

a.columns.values[59] = "2012"

a.columns.values[60] = "2013"

a.columns.values[61] = "2014"

a.columns.values[62] = "2015"


a = a[['Country','Rank', 'Documents', 'Citable documents', 'Citations', 'Self-citations', 'Citations per document', 'H index', 'Energy Supply', 'Energy Supply per capita', '% Renewable', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015']]


a = a.set_index('Country')


def ans():

    return a

ans()

回答by sali333

import numpy as np 
import pandas as pd 


def energy():
    energy=pd.ExcelFile('Energy Indicators.xls').parse('Energy')
    energy=energy.iloc[16:243][['Environmental Indicators: Energy','Unnamed: 3','Unnamed: 4','Unnamed: 5']].copy()
    energy.columns=['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable']

    energy = energy.replace('...', np.nan)
    energy['Energy Supply']=energy['Energy Supply']*1000000

    energy = energy.replace("Republic of Korea", "South Korea")
    energy = energy.replace("United States of America", "United States")
    energy = energy.replace("United Kingdom of Great Britain and Northern Ireland","United Kingdom")
    energy = energy.replace("China, Hong Kong Special Administrative Region", "Hong Kong")

    energy['Country'] = energy['Country'].str.extract('(^[a-zA-Z\s]+)', expand=False).str.strip()   

    energy=energy.reset_index()
    energy=energy[['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable']]
    return energy.iloc[43]

def GDP():
    GDP=pd.read_csv('world_bank.csv')
    s=(GDP.iloc[3].values)[:4].astype(str).tolist()+(GDP.iloc[3].values)[4:].astype(int).astype(str).tolist()
    GDP=GDP.iloc[4:]
    GDP.columns=s
    GDP=GDP[['Country Name','2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015']]
    GDP.columns=['Country','2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015']
    GDP=GDP.replace("Korea, Rep.", "South Korea",regex=False)

    GDP=GDP.replace("Iran, Islamic Rep.","Iran")

    GDP=GDP.replace("Hong Kong SAR, China","Hong Kong",regex=False)
    return GDP

def ScimEn():
    ScimEn=pd.ExcelFile('scimagojr-3.xlsx').parse('Sheet1')

    return ScimEn

def result():   
    e= energy()
    G=GDP()
    S=ScimEn()
    tdf=pd.merge(e,G,on='Country')
    tdf=pd.merge(tdf,S,on='Country')
    res = tdf.sort_values(by=['Rank'], inplace = True)
    res = tdf.head(15)
    res=res.set_index('Country', inplace=False)
    return res

回答by i.n.n.m

Please ask your question one by one. However, I came up with a solution for some of the questions you asked above. You can load excel file using ex = pd.ExcelFile('Yourfilename.xls')and then check your sheet name in excel file using ex.sheet_name. next you can use the following to read that sheet you have in your excel file,

请一一提出您的问题。但是,对于您上面提出的一些问题,我想出了一个解决方案。您可以使用加载 excel 文件ex = pd.ExcelFile('Yourfilename.xls'),然后使用ex.sheet_name. 接下来,您可以使用以下内容阅读您的 excel 文件中的表格,

en = ex.parse('sheetname', skiprows = 2, skip_footer =True,..)

then, you can replace the special characters using the following syntax

然后,您可以使用以下语法替换特殊字符

en.replace('$%^',np.NaN, inplace =True)

now you can check your dataframe's header and see and then check how many rows you still have to skip and other details.

现在您可以检查数据帧的标题,然后查看您仍然需要跳过的行数以及其他详细信息。

to drop columns, you can use the following syntax

要删除列,您可以使用以下语法

en.drop([dol for col in ['colname1', 'colname2', ...] if col in en], axis =1, inplace =True)

This is it for now from the big question. Work on this and if this does what you want, make is as the answer for the parts to read excel file, skip rows, replace NaN, and skip columns.

这就是现在的大问题。解决这个问题,如果这符合您的要求,make 是读取 excel 文件、跳过行、替换 NaN 和跳过列的部分的答案。

回答by user8235471

def answer_one():
    import pandas as pd
    energy=pd.read_excel('Energy Indicators.xls', skiprows=2)
    energy.columns=['a','b','Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable']
    del energy['a']
    del energy['b']
    energy['Energy Supply']*=1000000
    energy['Country'] = energy['Country'].str.replace(r"\(.*\)","")
    energy['Country'] = energy['Country'].str.replace("[0-9()]+$", "")
    energy.replace('Republic of Korea','South Korea', inplace = True)
    energy.replace('United States of America','United States', inplace = True)
    energy.replace('United Kingdom of Great Britain and Northern Ireland','United Kingdom', inplace = True)
    energy.replace('China, Hong Kong Special Administrative Region','Hong Kong', inplace = True)
    GDP=pd.read_csv('world_bank.csv',skiprows=4)
    GDP.replace('Korea, Rep.','South Korea')
    GDP.replace('Iran, Islamic Rep.','Iran')
    GDP.replace('Hong Kong SAR, China' , 'Hong Kong')
    ScimEn=pd.read_excel('scimagojr-3.xlsx')
    GDP.columns=['Country', 'Country Code', 'Indicator Name', 'Indicator Code',
       '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015']
    for i in ['1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005']:
        del GDP[i]
    ScimEn=ScimEn[ScimEn['Rank']<16]
    x=pd.merge(GDP,ScimEn,how='inner',left_on='Country',right_on='Country')
    y=pd.merge(x,energy,how='inner',left_on='Country',right_on='Country')
    y=y.set_index('Country')
    del y['Country Code']
    del y['Indicator Name']
    del y['Indicator Code']
    return y

answer_one()