pandas 熊猫 read_html ValueError: 没有找到表格

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/53398785/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 06:09:57  来源:igfitidea点击:

pandas read_html ValueError: No tables found

pythonhtmlpandasparsingweb-scraping

提问by Noman Bashir

I am trying to scrap the historical weather data from the "https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html" weather underground page. I have the following code:

我试图从“ https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html”天气地下页面中删除历史天气数据。我有以下代码:

import pandas as pd 

page_link = 'https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html'
df = pd.read_html(page_link)
print(df)

I have the following response:

我有以下回应:

Traceback (most recent call last):
 File "weather_station_scrapping.py", line 11, in <module>
  result = pd.read_html(page_link)
 File "/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 987, in read_html
  displayed_only=displayed_only)
 File "/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 815, in _parse raise_with_traceback(retained)
 File "/anaconda3/lib/python3.6/site-packages/pandas/compat/__init__.py", line 403, in raise_with_traceback
  raise exc.with_traceback(traceback)
ValueError: No tables found

Although, this page clearly has a table but it is not being picked by the read_html. I have tried using Selenium so that the page can be loaded before I read it.

虽然,这个页面显然有一个表格,但它没有被 read_html 选中。我曾尝试使用 Selenium 以便在我阅读之前可以加载页面。

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Firefox()
driver.get("https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html")
elem = driver.find_element_by_id("history_table")

head = elem.find_element_by_tag_name('thead')
body = elem.find_element_by_tag_name('tbody')

list_rows = []

for items in body.find_element_by_tag_name('tr'):
    list_cells = []
    for item in items.find_elements_by_tag_name('td'):
        list_cells.append(item.text)
    list_rows.append(list_cells)
driver.close()

Now, the problem is that it cannot find "tr". I would appreciate any suggestions.

现在,问题是它找不到“tr”。我将不胜感激任何建议。

采纳答案by QHarr

You can use requestsand avoid opening browser.

您可以使用requests并避免打开浏览器。

You can get current conditions by using:

您可以使用以下方法获取当前条件:

https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15

https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15

and strip of 'jQuery1720724027235122559_1542743885014('from the left and ')'from the right. Then handle the json string.

'jQuery1720724027235122559_1542743885014('从左侧和')'右侧剥离。然后处理json字符串。

You can get summary and history by calling the API with the following

您可以通过使用以下命令调用 API 来获取摘要和历史记录

https://api-ak.wunderground.com/api/606f3f6977348613/history_20170201null/units:both/v:2.0/q/pws:KMAHADLE7.json?callback=jQuery1720724027235122559_1542743885015&_=1542743886276

https://api-ak.wunderground.com/api/606f3f6977348613/history_20170201null/units:both/v:2.0/q/pws:KMAHADLE7.json?callback=jQuery17207240272351224255716524255716287455716286

You then need to strip 'jQuery1720724027235122559_1542743885015('from the front and ');'from the right. You then have a JSON string you can parse.

然后,您需要'jQuery1720724027235122559_1542743885015('从正面和');'右侧剥离。然后您就有了一个可以解析的 JSON 字符串。

Sample of JSON:

JSON 示例:

You can find these URLs by using F12 dev tools in browser and inspecting the network tab for the traffic created during page load.

您可以通过在浏览器中使用 F12 开发工具并检查页面加载期间创建的流量的网络选项卡来找到这些 URL。

An example for current, noting there seems to be a problem with nullsin the JSON so I am replacing with "placeholder":

的一个示例current,注意到nullsJSON 中似乎存在问题,因此我将替换为"placeholder"

import requests
import pandas as pd
import json
from pandas.io.json import json_normalize
from bs4 import BeautifulSoup

url = 'https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')
s = s.replace('null','"placeholder"')
data= json.loads(s)
data = json_normalize(data)
df = pd.DataFrame(data)
print(df)

回答by G. Anderson

Here's a solution using selenium for browser automation

这是使用 selenium 进行浏览器自动化的解决方案

from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome(chromedriver)
driver.implicitly_wait(30)

driver.get('https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html')
    df=pd.read_html(driver.find_element_by_id("history_table").get_attribute('outerHTML'))[0]

Time    Temperature Dew Point   Humidity    Wind    Speed   Gust    Pressure  Precip. Rate. Precip. Accum.  UV  Solar
0   12:02 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m2
1   12:07 AM    25.5 °C 19 °C   76 %    East    0 kph   0 kph   29.31 hPa   0 mm    0 mm    0   0 w/m2
2   12:12 AM    25.5 °C 19 °C   76 %    East    0 kph   0 kph   29.31 hPa   0 mm    0 mm    0   0 w/m2
3   12:17 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m2
4   12:22 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m2

Editing with breakdown of exactly what's happening, since the above one-liner is actually not very good self-documenting code:

对正在发生的事情进行细分进行编辑,因为上面的单行代码实际上并不是很好的自文档化代码:

After setting up the driver, we select the table with its ID value (Thankfully this site actually uses reasonable and descriptive IDs)

设置驱动程序后,我们选择带有其 ID 值的表(幸好这个站点实际上使用了合理且具有描述性的 ID)

tab=driver.find_element_by_id("history_table")

Then, from that element, we get the HTML instead of the web driver element object

然后,从该元素中,我们获得 HTML 而不是 Web 驱动程序元素对象

tab_html=tab.get_attribute('outerHTML')

We use pandas to parse the html

我们使用pandas来解析html

tab_dfs=pd.read_html(tab_html)

From the docs:

文档

"read_html returns a list of DataFrame objects, even if there is only a single table contained in the HTML content"

“read_html 返回一个 DataFrame 对象列表,即使 HTML 内容中只包含一个表格”

So we index into that list with the only table we have, at index zero

所以我们用我们唯一的表索引到该列表中,索引为零

df=tab_dfs[0]