pandas 熊猫 read_html ValueError: 没有找到表格
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/53398785/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas read_html ValueError: No tables found
提问by Noman Bashir
I am trying to scrap the historical weather data from the "https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html" weather underground page. I have the following code:
我试图从“ https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html”天气地下页面中删除历史天气数据。我有以下代码:
import pandas as pd
page_link = 'https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html'
df = pd.read_html(page_link)
print(df)
I have the following response:
我有以下回应:
Traceback (most recent call last):
File "weather_station_scrapping.py", line 11, in <module>
result = pd.read_html(page_link)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 987, in read_html
displayed_only=displayed_only)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 815, in _parse raise_with_traceback(retained)
File "/anaconda3/lib/python3.6/site-packages/pandas/compat/__init__.py", line 403, in raise_with_traceback
raise exc.with_traceback(traceback)
ValueError: No tables found
Although, this page clearly has a table but it is not being picked by the read_html. I have tried using Selenium so that the page can be loaded before I read it.
虽然,这个页面显然有一个表格,但它没有被 read_html 选中。我曾尝试使用 Selenium 以便在我阅读之前可以加载页面。
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html")
elem = driver.find_element_by_id("history_table")
head = elem.find_element_by_tag_name('thead')
body = elem.find_element_by_tag_name('tbody')
list_rows = []
for items in body.find_element_by_tag_name('tr'):
list_cells = []
for item in items.find_elements_by_tag_name('td'):
list_cells.append(item.text)
list_rows.append(list_cells)
driver.close()
Now, the problem is that it cannot find "tr". I would appreciate any suggestions.
现在,问题是它找不到“tr”。我将不胜感激任何建议。
采纳答案by QHarr
You can use requests
and avoid opening browser.
您可以使用requests
并避免打开浏览器。
You can get current conditions by using:
您可以使用以下方法获取当前条件:
and strip of 'jQuery1720724027235122559_1542743885014('
from the left and ')'
from the right. Then handle the json string.
并'jQuery1720724027235122559_1542743885014('
从左侧和')'
右侧剥离。然后处理json字符串。
You can get summary and history by calling the API with the following
您可以通过使用以下命令调用 API 来获取摘要和历史记录
You then need to strip 'jQuery1720724027235122559_1542743885015('
from the front and ');'
from the right. You then have a JSON string you can parse.
然后,您需要'jQuery1720724027235122559_1542743885015('
从正面和');'
右侧剥离。然后您就有了一个可以解析的 JSON 字符串。
Sample of JSON:
JSON 示例:
You can find these URLs by using F12 dev tools in browser and inspecting the network tab for the traffic created during page load.
您可以通过在浏览器中使用 F12 开发工具并检查页面加载期间创建的流量的网络选项卡来找到这些 URL。
An example for current
, noting there seems to be a problem with nulls
in the JSON so I am replacing with "placeholder"
:
的一个示例current
,注意到nulls
JSON 中似乎存在问题,因此我将替换为"placeholder"
:
import requests
import pandas as pd
import json
from pandas.io.json import json_normalize
from bs4 import BeautifulSoup
url = 'https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')
s = s.replace('null','"placeholder"')
data= json.loads(s)
data = json_normalize(data)
df = pd.DataFrame(data)
print(df)
回答by G. Anderson
Here's a solution using selenium for browser automation
这是使用 selenium 进行浏览器自动化的解决方案
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome(chromedriver)
driver.implicitly_wait(30)
driver.get('https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html')
df=pd.read_html(driver.find_element_by_id("history_table").get_attribute('outerHTML'))[0]
Time Temperature Dew Point Humidity Wind Speed Gust Pressure Precip. Rate. Precip. Accum. UV Solar
0 12:02 AM 25.5 °C 18.7 °C 75 % East 0 kph 0 kph 29.3 hPa 0 mm 0 mm 0 0 w/m2
1 12:07 AM 25.5 °C 19 °C 76 % East 0 kph 0 kph 29.31 hPa 0 mm 0 mm 0 0 w/m2
2 12:12 AM 25.5 °C 19 °C 76 % East 0 kph 0 kph 29.31 hPa 0 mm 0 mm 0 0 w/m2
3 12:17 AM 25.5 °C 18.7 °C 75 % East 0 kph 0 kph 29.3 hPa 0 mm 0 mm 0 0 w/m2
4 12:22 AM 25.5 °C 18.7 °C 75 % East 0 kph 0 kph 29.3 hPa 0 mm 0 mm 0 0 w/m2
Editing with breakdown of exactly what's happening, since the above one-liner is actually not very good self-documenting code:
对正在发生的事情进行细分进行编辑,因为上面的单行代码实际上并不是很好的自文档化代码:
After setting up the driver, we select the table with its ID value (Thankfully this site actually uses reasonable and descriptive IDs)
设置驱动程序后,我们选择带有其 ID 值的表(幸好这个站点实际上使用了合理且具有描述性的 ID)
tab=driver.find_element_by_id("history_table")
Then, from that element, we get the HTML instead of the web driver element object
然后,从该元素中,我们获得 HTML 而不是 Web 驱动程序元素对象
tab_html=tab.get_attribute('outerHTML')
We use pandas to parse the html
我们使用pandas来解析html
tab_dfs=pd.read_html(tab_html)
From the docs:
从文档:
"read_html returns a list of DataFrame objects, even if there is only a single table contained in the HTML content"
“read_html 返回一个 DataFrame 对象列表,即使 HTML 内容中只包含一个表格”
So we index into that list with the only table we have, at index zero
所以我们用我们唯一的表索引到该列表中,索引为零
df=tab_dfs[0]