将 Web 数据文件加载到 Pandas 数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43422692/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:23:45  来源:igfitidea点击:

Loading web data file to pandas dataframe

pythonpandas

提问by Wookeun Lee

I would like to load a .csvfile from the web and convert it into a pandas.DataFrame.

我想.csv从网络加载一个文件并将其转换为pandas.DataFrame.

Here's my target page where I want to find a .csvfile:

这是我要在其中查找.csv文件的目标页面:

https://vincentarelbundock.github.io/Rdatasets/datasets.html

https://vincentarelbundock.github.io/Rdatasets/datasets.html

How can I load a .csvfile of corresponding items from the webpage and convert into a panda.DataFrame?

如何.csv从网页加载相应项目的文件并将其转换为panda.DataFrame?

In addition it would be great if I could get the addresses of the .csvfiles from the web page also.

此外,如果我也能从.csv网页中获取文件的地址,那就太好了。

This would allow me to create a function to convert an item name from the target page, that would return the .csvfile address like:

这将允许我创建一个函数来转换目标页面中的项目名称,该函数将返回如下.csv文件地址:

def data(item): 
    file = 'https://vincentarelbundock.github.io/Rdatasets/csv/datasets/'+str(item)+'.csv'

However, the addresses of the csv files in the webpage are not all the same pattern.

但是,网页中csv文件的地址并不完全相同。

For example,

例如,

https://vincentarelbundock.github.io/Rdatasets/csv/Stat2Data/Cuckoo.csv 
https://vincentarelbundock.github.io/Rdatasets/csv/datasets/cars.csv

Quite a lot of files are in different directory, so I need to search 'items' and get the address of the corresponding csv file.

相当多的文件在不同的目录中,所以我需要搜索'items'并获取对应的csv文件的地址。

回答by Stephen Rauch

Pandas can read the csvdirectly from the http link:

Pandas 可以csv直接从 http 链接读取:

Example;

例子;

df = pd.read_csv(
    'https://vincentarelbundock.github.io/Rdatasets/'
    'csv/datasets/OrchardSprays.csv')
print(df)

Results:

结果:

    Unnamed: 0  decrease  rowpos  colpos treatment
0            1        57       1       1         D
1            2        95       2       1         E
..         ...       ...     ...     ...       ...
62          63         3       7       8         A
63          64        19       8       8         C

[64 rows x 5 columns]

Getting links via scraping:

通过抓取获取链接:

To get the links themselves from the front page, we can also use pandasto do web scraping for data. Something like:

为了从首页获取链接本身,我们还可以使用pandas网页抓取数据。就像是:

base_url = 'https://vincentarelbundock.github.io/Rdatasets/'
url = base_url + 'datasets.html'

import pandas as pd
df = pd.read_html(url, attrs={'class': 'dataframe'},
                  header=0, flavor='html5lib')[0]

Will return the data in the table on the page. Unfortunately for our uses here, this does not work because pandasgrabs the text on the page, not the links.

将返回页面上表中的数据。不幸的是,对于我们在这里的使用,这不起作用,因为pandas抓取页面上的文本,而不是链接。

Monkey Patching the scraper to get Links:

猴子修补刮板以获得链接:

To get the URLs, we can monkey patch the library like:

要获取 URL,我们可以对库进行猴子修补,例如:

def _text_getter(self, obj):
    text = obj.text
    if text.strip() in ('CSV', 'DOC'):
        try:
            text = base_url + obj.find('a')['href']
        except (TypeError, KeyError):
            pass
    return text

from pandas.io.html import _BeautifulSoupHtml5LibFrameParser as bsp
bsp._text_getter = _text_getter

Test Code:

测试代码:

base_url = 'https://vincentarelbundock.github.io/Rdatasets/'
url = base_url + 'datasets.html'

import pandas as pd
df = pd.read_html(url, attrs={'class': 'dataframe'},
                  header=0, flavor='html5lib')[0]

for row in df.head().iterrows():
    print('%-14s: %s' % (row[1].Item, row[1].csv))

Results:

结果:

AirPassengers: https://vincentarelbundock.github.io/Rdatasets/csv/datasets/AirPassengers.csv
BJsales      : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/BJsales.csv
BOD          : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/BOD.csv
CO2          : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/CO2.csv
Formaldehyde : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/Formaldehyde.csv