Python Beautiful Soup 使用正则表达式查找标签?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24748445/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 05:07:24  来源:igfitidea点击:

Beautiful Soup Using Regex to Find Tags?

pythonregexweb-scraping

提问by user3314418

I'd really like to be able to allow Beautiful Soup to match any list of tags, like so. I know attr accepts regex, but is there anything in beautiful soup that allows you to do so?

我真的很希望能够让 Beautiful Soup 匹配任何标签列表,就像这样。我知道 attr 接受正则表达式,但是美丽的汤中有什么东西可以让你这样做吗?

soup.findAll("(a|div)")

Output:

输出:

<a> ASDFS
<div> asdfasdf
<a> asdfsdf

My goal is to create a scraper that can grab tables from sites. Sometimes tags are named inconsistently, and I'd like to be able to input a list of tags to name the 'data' part of a table.

我的目标是创建一个可以从站点抓取表格的抓取工具。有时标签的命名不一致,我希望能够输入标签列表来命名表的“数据”部分。

采纳答案by hwnd

find_all()is the most favored method in the Beautiful Soup search API.

find_all()是 Beautiful Soup 搜索 API 中最受欢迎的方法。

You can pass a variation of filters. Also, pass a listto find multiple tags:

您可以传递各种过滤器。另外,传递一个列表来查找多个标签:

>>> soup.find_all(['a', 'div']) 

Example:

示例

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<html><body><div>asdfasdf</div><p><a>foo</a></p></body></html>')
>>> soup.find_all(['a', 'div'])
[<div>asdfasdf</div>, <a>foo</a>]

Or you can use a regular expressionto find tags that contain aor div:

或者您可以使用正则表达式来查找包含a或的标签div

>>> import re
>>> soup.find_all(re.compile("(a|div)"))

回答by ZJS

yes see docs...

是的,请参阅文档...

http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html

http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html

import re

soup.findAll(re.compile("^a$|(div)"))

回答by Manu CJ

Note that you can also use regular expressions to search in attributes of tags. For example:

请注意,您还可以使用正则表达式来搜索标签的属性。例如:

import re
from bs4 import BeautifulSoup

soup.find_all('a', {'href': re.compile(r'crummy\.com/')})

This example finds all <a>tags that link to a website containing the substring 'crummy.com'.

此示例查找<a>链接到包含子字符串的网站的所有标签'crummy.com'

(I know this is a very old post, but hopefully someone will find this additional information useful.)

(我知道这是一篇很老的帖子,但希望有人会发现这些附加信息很有用。)