将 Web 数据文件加载到 Pandas 数据框
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43422692/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Loading web data file to pandas dataframe
提问by Wookeun Lee
I would like to load a .csv
file from the web and convert it into a pandas.DataFrame
.
我想.csv
从网络加载一个文件并将其转换为pandas.DataFrame
.
Here's my target page where I want to find a .csv
file:
这是我要在其中查找.csv
文件的目标页面:
https://vincentarelbundock.github.io/Rdatasets/datasets.html
https://vincentarelbundock.github.io/Rdatasets/datasets.html
How can I load a .csv
file of corresponding items from the webpage and convert into a panda.DataFrame
?
如何.csv
从网页加载相应项目的文件并将其转换为panda.DataFrame
?
In addition it would be great if I could get the addresses of the .csv
files from the web page also.
此外,如果我也能从.csv
网页中获取文件的地址,那就太好了。
This would allow me to create a function to convert an item name from the target page, that would return the .csv
file address like:
这将允许我创建一个函数来转换目标页面中的项目名称,该函数将返回如下.csv
文件地址:
def data(item):
file = 'https://vincentarelbundock.github.io/Rdatasets/csv/datasets/'+str(item)+'.csv'
However, the addresses of the csv files in the webpage are not all the same pattern.
但是,网页中csv文件的地址并不完全相同。
For example,
例如,
https://vincentarelbundock.github.io/Rdatasets/csv/Stat2Data/Cuckoo.csv
https://vincentarelbundock.github.io/Rdatasets/csv/datasets/cars.csv
Quite a lot of files are in different directory, so I need to search 'items' and get the address of the corresponding csv file.
相当多的文件在不同的目录中,所以我需要搜索'items'并获取对应的csv文件的地址。
回答by Stephen Rauch
Pandas can read the csv
directly from the http link:
Pandas 可以csv
直接从 http 链接读取:
Example;
例子;
df = pd.read_csv(
'https://vincentarelbundock.github.io/Rdatasets/'
'csv/datasets/OrchardSprays.csv')
print(df)
Results:
结果:
Unnamed: 0 decrease rowpos colpos treatment
0 1 57 1 1 D
1 2 95 2 1 E
.. ... ... ... ... ...
62 63 3 7 8 A
63 64 19 8 8 C
[64 rows x 5 columns]
Getting links via scraping:
通过抓取获取链接:
To get the links themselves from the front page, we can also use pandas
to do web scraping for data. Something like:
为了从首页获取链接本身,我们还可以使用pandas
网页抓取数据。就像是:
base_url = 'https://vincentarelbundock.github.io/Rdatasets/'
url = base_url + 'datasets.html'
import pandas as pd
df = pd.read_html(url, attrs={'class': 'dataframe'},
header=0, flavor='html5lib')[0]
Will return the data in the table on the page. Unfortunately for our uses here, this does not work because pandas
grabs the text on the page, not the links.
将返回页面上表中的数据。不幸的是,对于我们在这里的使用,这不起作用,因为pandas
抓取页面上的文本,而不是链接。
Monkey Patching the scraper to get Links:
猴子修补刮板以获得链接:
To get the URLs, we can monkey patch the library like:
要获取 URL,我们可以对库进行猴子修补,例如:
def _text_getter(self, obj):
text = obj.text
if text.strip() in ('CSV', 'DOC'):
try:
text = base_url + obj.find('a')['href']
except (TypeError, KeyError):
pass
return text
from pandas.io.html import _BeautifulSoupHtml5LibFrameParser as bsp
bsp._text_getter = _text_getter
Test Code:
测试代码:
base_url = 'https://vincentarelbundock.github.io/Rdatasets/'
url = base_url + 'datasets.html'
import pandas as pd
df = pd.read_html(url, attrs={'class': 'dataframe'},
header=0, flavor='html5lib')[0]
for row in df.head().iterrows():
print('%-14s: %s' % (row[1].Item, row[1].csv))
Results:
结果:
AirPassengers: https://vincentarelbundock.github.io/Rdatasets/csv/datasets/AirPassengers.csv
BJsales : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/BJsales.csv
BOD : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/BOD.csv
CO2 : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/CO2.csv
Formaldehyde : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/Formaldehyde.csv