将 Web 数据文件加载到 Pandas 数据框
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43422692/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Loading web data file to pandas dataframe
提问by Wookeun Lee
I would like to load a .csvfile from the web and convert it into a pandas.DataFrame.
我想.csv从网络加载一个文件并将其转换为pandas.DataFrame.
Here's my target page where I want to find a .csvfile:
这是我要在其中查找.csv文件的目标页面:
https://vincentarelbundock.github.io/Rdatasets/datasets.html
https://vincentarelbundock.github.io/Rdatasets/datasets.html
How can I load a .csvfile of corresponding items from the webpage and convert into a panda.DataFrame?
如何.csv从网页加载相应项目的文件并将其转换为panda.DataFrame?
In addition it would be great if I could get the addresses of the .csvfiles from the web page also.
此外,如果我也能从.csv网页中获取文件的地址,那就太好了。
This would allow me to create a function to convert an item name from the target page, that would return the .csvfile address like:
这将允许我创建一个函数来转换目标页面中的项目名称,该函数将返回如下.csv文件地址:
def data(item):
file = 'https://vincentarelbundock.github.io/Rdatasets/csv/datasets/'+str(item)+'.csv'
However, the addresses of the csv files in the webpage are not all the same pattern.
但是,网页中csv文件的地址并不完全相同。
For example,
例如,
https://vincentarelbundock.github.io/Rdatasets/csv/Stat2Data/Cuckoo.csv
https://vincentarelbundock.github.io/Rdatasets/csv/datasets/cars.csv
Quite a lot of files are in different directory, so I need to search 'items' and get the address of the corresponding csv file.
相当多的文件在不同的目录中,所以我需要搜索'items'并获取对应的csv文件的地址。
回答by Stephen Rauch
Pandas can read the csvdirectly from the http link:
Pandas 可以csv直接从 http 链接读取:
Example;
例子;
df = pd.read_csv(
'https://vincentarelbundock.github.io/Rdatasets/'
'csv/datasets/OrchardSprays.csv')
print(df)
Results:
结果:
Unnamed: 0 decrease rowpos colpos treatment
0 1 57 1 1 D
1 2 95 2 1 E
.. ... ... ... ... ...
62 63 3 7 8 A
63 64 19 8 8 C
[64 rows x 5 columns]
Getting links via scraping:
通过抓取获取链接:
To get the links themselves from the front page, we can also use pandasto do web scraping for data. Something like:
为了从首页获取链接本身,我们还可以使用pandas网页抓取数据。就像是:
base_url = 'https://vincentarelbundock.github.io/Rdatasets/'
url = base_url + 'datasets.html'
import pandas as pd
df = pd.read_html(url, attrs={'class': 'dataframe'},
header=0, flavor='html5lib')[0]
Will return the data in the table on the page. Unfortunately for our uses here, this does not work because pandasgrabs the text on the page, not the links.
将返回页面上表中的数据。不幸的是,对于我们在这里的使用,这不起作用,因为pandas抓取页面上的文本,而不是链接。
Monkey Patching the scraper to get Links:
猴子修补刮板以获得链接:
To get the URLs, we can monkey patch the library like:
要获取 URL,我们可以对库进行猴子修补,例如:
def _text_getter(self, obj):
text = obj.text
if text.strip() in ('CSV', 'DOC'):
try:
text = base_url + obj.find('a')['href']
except (TypeError, KeyError):
pass
return text
from pandas.io.html import _BeautifulSoupHtml5LibFrameParser as bsp
bsp._text_getter = _text_getter
Test Code:
测试代码:
base_url = 'https://vincentarelbundock.github.io/Rdatasets/'
url = base_url + 'datasets.html'
import pandas as pd
df = pd.read_html(url, attrs={'class': 'dataframe'},
header=0, flavor='html5lib')[0]
for row in df.head().iterrows():
print('%-14s: %s' % (row[1].Item, row[1].csv))
Results:
结果:
AirPassengers: https://vincentarelbundock.github.io/Rdatasets/csv/datasets/AirPassengers.csv
BJsales : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/BJsales.csv
BOD : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/BOD.csv
CO2 : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/CO2.csv
Formaldehyde : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/Formaldehyde.csv

