Python 网页抓取 - 如何识别网页上的主要内容

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4672060/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 16:51:18  来源:igfitidea点击:

Web scraping - how to identify main content on a webpage

pythonweb-scrapinghtml-parsingwebpage

提问by kefeizhou

Given a news article webpage (from any major news source such as times or bloomberg), I want to identify the main article content on that page and throw out the other misc elements such as ads, menus, sidebars, user comments.

给定一个新闻文章网页(来自任何主要新闻来源,例如时代或彭博社),我想确定该页面上的主要文章内容并丢弃其他杂项元素,例如广告、菜单、侧边栏、用户评论。

What's a generic way of doing this that will work on most major news sites?

什么是适用于大多数主要新闻网站的通用方法?

What are some good tools or libraries for data mining? (preferably python based)

有哪些好的数据挖掘工具或库?(最好基于python)

采纳答案by Amber

There's no way to do this that's guaranteed to work, but one strategy you might use is to try to find the element with the most visible text inside of it.

没有办法做到这一点可以保证有效,但您可能使用的一种策略是尝试找到其中包含最明显文本的元素。

回答by nedk

It might be more useful to extract the RSS feeds (<link type="application/rss+xml" href="..."/>) on that page and parse the data in the feed to get the main content.

提取<link type="application/rss+xml" href="..."/>该页面上的 RSS 提要 ( ) 并解析提要中的数据以获取主要内容可能更有用。

回答by Spacedman

I wouldn't try to scrape it from the web page - too many things could mess it up - but instead see which web sites publish RSS feeds. For example, the Guardian's RSS feed has most of the text from their leading articles:

我不会尝试从网页上抓取它 - 太多事情可能会弄乱它 - 而是查看哪些网站发布了 RSS 提要。例如,卫报的 RSS 提要包含来自其主要文章的大部分文本:

http://feeds.guardian.co.uk/theguardian/rss

http://feeds.guardian.co.uk/theguardian/rss

I don't know if The Times (The London Times, not NY) has one because it's behind a paywall. Good luck with that...

我不知道《泰晤士报》(伦敦时报,不是纽约时报)是否有一个,因为它在付费墙后面。祝你好运...

回答by gte525u

There are a number of ways to do it, but, none will always work. Here are the two easiest:

有很多方法可以做到这一点,但是,没有一种方法总是有效的。这里有两个最简单的:

  • if it's a known finite set of websites: in your scraper convert each url from the normal url to the print url for a given site (cannot really be generalized across sites)
  • Use the arc90 readability algorithm (reference implementation is in javascript) http://code.google.com/p/arc90labs-readability/. The short version of this algorithm is it looks for divs with p tags within them. It will not work for some websites but is generally pretty good.
  • 如果它是一组已知的有限网站:在您的抓取工具中,将每个 url 从普通 url 转换为给定站点的打印 url(不能真正跨站点推广)
  • 使用 arc90 可读性算法(参考实现在 javascript 中)http://code.google.com/p/arc90labs-readability/。该算法的简短版本是查找其中包含 p 标签的 div。它不适用于某些网站,但通常非常好。

回答by PhilS

Another possibility of separating "real" content from noise is by measuring HTML densityof the parts of a HTML page.

另一种将“真实”内容与噪音区分开来的方法是测量HTML 页面各部分的HTML 密度

You will need a bit of experimentation with the thresholds to extract the "real" content, and I guess you could improve the algorithm by applying heuristics to specify the exact bounds of the HTML segment after having identified the interesting content.

您将需要对阈值进行一些实验以提取“真实”内容,我想您可以通过应用启发式算法来改进算法,在识别出有趣的内容后指定 HTML 段的确切边界。

Update: Just found out the URL above does not work right now; here is an alternative linkto a cached version of archive.org.

更新:刚刚发现上面的 URL 现在不起作用;这是一个指向archive.org缓存版本的替代链接

回答by Cerin

A while ago I wrote a simple Python scriptfor just this task. It uses a heuristic to group text blocks together based on their depth in the DOM. The group with the most text is then assumed to be the main content. It's not perfect, but works generally well for news sites, where the article is generally the biggest grouping of text, even if broken up into multiple div/p tags.

不久前,我为此任务编写了一个简单的 Python 脚本。它使用启发式方法根据文本块在 DOM 中的深度将它们组合在一起。然后假定具有最多文本的组是主要内容。它并不完美,但通常适用于新闻网站,其中文章通常是最大的文本分组,即使分解为多个 div/p 标签。

You'd use the script like: python webarticle2text.py <url>

你会使用这样的脚本: python webarticle2text.py <url>

回答by JordanBelf

Diffbot offers a free(10.000 urls) API to do that, don't know if that approach is what you are looking for, but it might help someone http://www.diffbot.com/

Diffbot 提供了一个免费的(10.000 个网址)API 来做到这一点,不知道这种方法是否是您正在寻找的方法,但它可能会帮助某人http://www.diffbot.com/

回答by asmaier

For a solution in Java have a look at https://code.google.com/p/boilerpipe/:

有关 Java 中的解决方案,请查看https://code.google.com/p/boilerpipe/

The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.

样板库提供了算法来检测和删除网页主要文本内容周围多余的“混乱”(样板、模板)。

该库已经为常见任务(例如:新闻文章提取)提供了特定策略,并且还可以针对个别问题设置轻松扩展。

But there is also a python wrapper around this available here:

但是这里也有一个可用的python包装器:

https://github.com/misja/python-boilerpipe

https://github.com/misja/python-boilerpipe

回答by Mona Jalal

Check the following script. It is really amazing:

检查以下脚本。真的很神奇:

from newspaper import Article
URL = "https://www.ksat.com/money/philippines-stops-sending-workers-to-qatar"
article = Article(URL)
article.download()
print(article.html)
article.parse()
print(article.authors)
print(article.publish_date)
#print(article.text)
print(article.top_image)
print(article.movies)
article.nlp()
print(article.keywords)
print(article.summary)

More documentation can be found at http://newspaper.readthedocs.io/en/latest/and https://github.com/codelucas/newspaperyou should install it using:

更多文档可以在http://newspaper.readthedocs.io/en/latest/https://github.com/codelucas/newspaper找到,你应该使用以下方法安装它:

pip3 install newspaper3k