Python 如何使用漂亮的汤和重新找到包含特定文本的特定类的跨度?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16248723/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to find spans with a specific class containing specific text using beautiful soup and re?
提问by user1063287
how can I find all span's with a class of 'blue'that contain text in the format:
我怎样才能找到'blue'包含以下格式文本的类的所有跨度:
04/18/13 7:29pm
which could therefore be:
因此可能是:
04/18/13 7:29pm
or:
或者:
Posted on 04/18/13 7:29pm
in terms of constructing the logic to do this, this is what i have got so far:
在构建执行此操作的逻辑方面,这就是我到目前为止所得到的:
new_content = original_content.find_all('span', {'class' : 'blue'}) # using beautiful soup's find_all
pattern = re.compile('<span class=\"blue\">[data in the format 04/18/13 7:29pm]</span>') # using re
for _ in new_content:
result = re.findall(pattern, _)
print result
I've been referring to https://stackoverflow.com/a/7732827and https://stackoverflow.com/a/12229134to try and figure out a way to do this, but the above is all i have got so far.
我一直在参考https://stackoverflow.com/a/7732827和https://stackoverflow.com/a/12229134试图找出一种方法来做到这一点,但以上是我到目前为止所得到的.
edit:
编辑:
to clarify the scenario, there are span's with:
为了澄清这个场景,有跨度:
<span class="blue">here is a lot of text that i don't need</span>
and
和
<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
and note i only need 04/18/13 7:29pmnot the rest of the content.
并注意我只不需要04/18/13 7:29pm其余的内容。
edit 2:
编辑2:
I also tried:
我也试过:
pattern = re.compile('<span class="blue">.*?(\d\d/\d\d/\d\d \d\d?:\d\d\w\w)</span>')
for _ in new_content:
result = re.findall(pattern, _)
print result
and got error:
并得到错误:
'TypeError: expected string or buffer'
采纳答案by Corey Goldberg
import re
from bs4 import BeautifulSoup
html_doc = """
<html>
<body>
<span class="blue">here is a lot of text that i don't need</span>
<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
<span class="blue">04/19/13 7:30pm</span>
<span class="blue">Posted on 04/20/13 10:31pm</span>
</body>
</html>
"""
# parse the html
soup = BeautifulSoup(html_doc)
# find a list of all span elements
spans = soup.find_all('span', {'class' : 'blue'})
# create a list of lines corresponding to element texts
lines = [span.get_text() for span in spans]
# collect the dates from the list of lines using regex matching groups
found_dates = []
for line in lines:
m = re.search(r'(\d{2}/\d{2}/\d{2} \d+:\d+[a|p]m)', line)
if m:
found_dates.append(m.group(1))
# print the dates we collected
for date in found_dates:
print(date)
output:
输出:
04/18/13 7:29pm
04/19/13 7:30pm
04/20/13 10:31pm
回答by Nolen Royalty
This pattern seems to satisfy what you're looking for:
这种模式似乎可以满足您的需求:
>>> pattern = re.compile('<span class="blue">.*?(\d\d/\d\d/\d\d \d\d?:\d\d\w\w)</span>')
>>> pattern.match('<span class="blue">here is a lot of text that i dont need</span>')
>>> pattern.match('<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>').groups()
('04/18/13 7:29pm',)
回答by pradyunsg
This is a flexible regex that you can use:
这是一个灵活的正则表达式,您可以使用:
"(\d\d?/\d\d?/\d\d\d?\d?\s*\d\d?:\d\d[a|p|A|P][m|M])"
Example:
例子:
>>> import re
>>> from bs4 import BeautifulSoup
>>> html = """
<html>
<body>
<span class="blue">here is a lot of text that i don't need</span>
<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
<span class="blue">04/19/13 7:30pm</span>
<span class="blue">04/18/13 7:29pm</span>
<span class="blue">Posted on 15/18/2013 10:00AM</span>
<span class="blue">Posted on 04/20/13 10:31pm</span>
<span class="blue">Posted on 4/1/2013 17:09aM</span>
</body>
</html>
"""
>>> soup = BeautifulSoup(html)
>>> lines = [i.get_text() for i in soup.find_all('span', {'class' : 'blue'})]
>>> ok = [m.group(1)
for line in lines
for m in (re.search(r'(\d\d?/\d\d?/\d\d\d?\d?\s*\d\d?:\d\d[a|p|A|P][m|M])', line),)
if m]
>>> ok
[u'04/18/13 7:29pm', u'04/19/13 7:30pm', u'04/18/13 7:29pm', u'15/18/2013 10:00AM', u'04/20/13 10:31pm', u'4/1/2013 17:09aM']
>>> for i in ok:
print i
04/18/13 7:29pm
04/19/13 7:30pm
04/18/13 7:29pm
15/18/2013 10:00AM
04/20/13 10:31pm
4/1/2013 17:09aM

