使用 BeautifulSoup 和 Python 抓取多个页面
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/26497722/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Scrape multiple pages with BeautifulSoup and Python
提问by Philip McQuitty
My code successfully scrapes the tr align=center tags from [ http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY] and writes the td elements to a text file.
我的代码成功地从 [ http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY] 中抓取了 tr align=center 标签,并将 td 元素写入文本文件。
However, there are multiple pages available at the site above in which I would like to be able to scrape.
但是,上面的站点上有多个页面可用,我希望能够在其中抓取。
For example, with the url above, when I click the link to "page 2" the overall url does NOT change. I looked at the page source and saw a javascript code to advance to the next page.
例如,对于上面的 url,当我单击“第 2 页”的链接时,整体 url 不会改变。我查看了页面源代码,看到了一个 javascript 代码以前进到下一页。
How can my code be changed to scrape data from all the available listed pages?
如何更改我的代码以从所有可用的列出页面中抓取数据?
My code that works for page 1 only:
我的代码仅适用于第 1 页:
import bs4
import requests
response = requests.get('http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY')
soup = bs4.BeautifulSoup(response.text)
soup.prettify()
acct = open("/Users/it/Desktop/accounting.txt", "w")
for tr in soup.find_all('tr', align='center'):
stack = []
for td in tr.findAll('td'):
stack.append(td.text.replace('\n', '').replace('\t', '').strip())
acct.write(", ".join(stack) + '\n')
采纳答案by Jerome Montino
The trick here is to check the requests that are coming in and out of the page-change action when you click on the link to view the other pages. The way to check this is to use Chrome's inspection tool (via pressing F12) or installing the Firebug extension in Firefox. I will be using Chrome's inspection tool in this answer. See below for my settings.
这里的技巧是在您单击链接查看其他页面时检查进出页面更改操作的请求。检查这一点的方法是使用 Chrome 的检查工具(通过按F12)或在 Firefox 中安装 Firebug 扩展。我将在这个答案中使用 Chrome 的检查工具。请参阅下面的我的设置。


Now, what we want to see is either a GETrequest to another page or a POSTrequest that changes the page. While the tool is open, click on a page number. For a really brief moment, there will only be one request that will appear, and it's a POSTmethod. All the other elements will quickly follow and fill the page. See below for what we're looking for.
现在,我们想要看到的是GET对另一个页面的POST请求或更改页面的请求。在工具打开时,单击页码。在很短的时间内,只会出现一个请求,而且它是一种POST方法。所有其他元素将快速跟随并填满页面。请参阅下文了解我们正在寻找的内容。


Click on the above POSTmethod. It should bring up a sub-window of sorts that has tabs. Click on the Headerstab. This page lists the request headers, pretty much the identification stuff that the other side (the site, for example) needs from you to be able to connect (someone else can explain this muuuch better than I do).
点击上面的POST方法。它应该打开一个带有选项卡的子窗口。单击Headers选项卡。此页面列出了请求标头,几乎是另一方(例如站点)需要您提供的身份信息才能进行连接(其他人可以比我更好地解释这一点)。
Whenever the URL has variables like page numbers, location markers, or categories, more often that not, the site uses query-strings. Long story made short, it's similar to an SQL query (actually, it is an SQL query, sometimes) that allows the site to pull the information you need. If this is the case, you can check the request headers for query string parameters. Scroll down a bit and you should find it.
每当 URL 具有页码、位置标记或类别等变量时,站点通常会使用查询字符串。长话短说,它类似于 SQL 查询(实际上,有时是 SQL 查询),允许站点提取您需要的信息。如果是这种情况,您可以检查请求标头中的查询字符串参数。向下滚动一点,您应该会找到它。


As you can see, the query string parameters match the variables in our URL. A little bit below, you can see Form Datawith pageNum: 2beneath it. This is the key.
如您所见,查询字符串参数与我们 URL 中的变量相匹配。在下面一点,你可以看到Form Data它pageNum: 2下面。这是关键。
POSTrequests are more commonly known as form requests because these are the kind of requests made when you submit forms, log in to websites, etc. Basically, pretty much anything where you have to submit information. What most people don't see is that POSTrequests have a URL that they follow. A good example of this is when you log-in to a website and, very briefly, see your address bar morph into some sort of gibberish URL before settling on /index.htmlor somesuch.
POST请求通常称为表单请求,因为它们是您提交表单、登录网站等时发出的请求类型。基本上,几乎所有您必须提交信息的地方。大多数人没有看到的是POST请求有一个他们遵循的 URL。一个很好的例子是,当您登录到一个网站时,非常简短地看到您的地址栏在安顿/index.html下来之前变成了某种乱码 URL 。
What the above paragraph basically means is that you can (but not always) append the form data to your URL and it will carry out the POSTrequest for you on execution. To know the exact string you have to append, click on view source.
上述段落的基本意思是,您可以(但并非总是)将表单数据附加到您的 URL 中,它会POST在执行时为您执行请求。要知道您必须附加的确切字符串,请单击view source。


Test if it works by adding it to the URL.
通过将其添加到 URL 来测试它是否有效。


Et voila, it works. Now, the real challenge: getting the last page automatically and scraping all of the pages. Your code is pretty much there. The only things remaining to be done are getting the number of pages, constructing a list of URLs to scrape, and iterating over them.
等等,它有效。现在,真正的挑战是:自动获取最后一页并抓取所有页面。你的代码就在那里。剩下要做的唯一事情是获取页面数量,构建要抓取的 URL 列表,并对其进行迭代。
Modified code is below:
修改后的代码如下:
from bs4 import BeautifulSoup as bsoup
import requests as rq
import re
base_url = 'http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY'
r = rq.get(base_url)
soup = bsoup(r.text)
# Use regex to isolate only the links of the page numbers, the one you click on.
page_count_links = soup.find_all("a",href=re.compile(r".*javascript:goToPage.*"))
try: # Make sure there are more than one page, otherwise, set to 1.
num_pages = int(page_count_links[-1].get_text())
except IndexError:
num_pages = 1
# Add 1 because Python range.
url_list = ["{}&pageNum={}".format(base_url, str(page)) for page in range(1, num_pages + 1)]
# Open the text file. Use with to save self from grief.
with open("results.txt","wb") as acct:
for url_ in url_list:
print "Processing {}...".format(url_)
r_new = rq.get(url_)
soup_new = bsoup(r_new.text)
for tr in soup_new.find_all('tr', align='center'):
stack = []
for td in tr.findAll('td'):
stack.append(td.text.replace('\n', '').replace('\t', '').strip())
acct.write(", ".join(stack) + '\n')
We use regular expressions to get the proper links. Then using list comprehension, we built a list of URL strings. Finally, we iterate over them.
我们使用正则表达式来获取正确的链接。然后使用列表理解,我们构建了一个 URL 字符串列表。最后,我们迭代它们。
Results:
结果:
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=1...
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=2...
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=3...
[Finished in 6.8s]


Hope that helps.
希望有帮助。
EDIT:
编辑:
Out of sheer boredom, I think I just created a scraper for the entire class directory. Also, I update both the above and below codes to not error out when there is only a single page available.
出于无聊,我想我只是为整个班级目录创建了一个刮刀。另外,我更新了上面和下面的代码,以便在只有一个页面可用时不会出错。
from bs4 import BeautifulSoup as bsoup
import requests as rq
import re
spring_2015 = "http://my.gwu.edu/mod/pws/subjects.cfm?campId=1&termId=201501"
r = rq.get(spring_2015)
soup = bsoup(r.text)
classes_url_list = [c["href"] for c in soup.find_all("a", href=re.compile(r".*courses.cfm\?campId=1&termId=201501&subjId=.*"))]
print classes_url_list
with open("results.txt","wb") as acct:
for class_url in classes_url_list:
base_url = "http://my.gwu.edu/mod/pws/{}".format(class_url)
r = rq.get(base_url)
soup = bsoup(r.text)
# Use regex to isolate only the links of the page numbers, the one you click on.
page_count_links = soup.find_all("a",href=re.compile(r".*javascript:goToPage.*"))
try:
num_pages = int(page_count_links[-1].get_text())
except IndexError:
num_pages = 1
# Add 1 because Python range.
url_list = ["{}&pageNum={}".format(base_url, str(page)) for page in range(1, num_pages + 1)]
# Open the text file. Use with to save self from grief.
for url_ in url_list:
print "Processing {}...".format(url_)
r_new = rq.get(url_)
soup_new = bsoup(r_new.text)
for tr in soup_new.find_all('tr', align='center'):
stack = []
for td in tr.findAll('td'):
stack.append(td.text.replace('\n', '').replace('\t', '').strip())
acct.write(", ".join(stack) + '\n')

