从维基百科文章中提取第一段 (Python)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4460921/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 15:51:30  来源:igfitidea点击:

Extract the first paragraph from a Wikipedia article (Python)

pythonwikipedia

提问by Alon Gubkin

How can I extract the first paragraph from a Wikipedia article, using Python?

如何使用 Python 从维基百科文章中提取第一段?

For example, for Albert Einstein, that would be:

例如,对于阿尔伯特·爱因斯坦,那将是:

Albert Einstein (pronounced /??lb?rt ?a?nsta?n/; German: [?alb?t ?a?n?ta?n] ( listen); 14 March 1879 – 18 April 1955) was a theoretical physicist, philosopher and author who is widely regarded as one of the most influential and iconic scientists and intellectuals of all time. A German-Swiss Nobel laureate, Einstein is often regarded as the father of modern physics.[2] He received the 1921 Nobel Prize in Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect".[3]

阿尔伯特·爱因斯坦(读作 /??lb?rt ?a?nsta?n/;德语:[?alb?t?a?n?ta?n](听);1879 年 3 月 14 日 - 1955 年 4 月 18 日)是一位理论物理学家,哲学家和作家,被广泛认为是有史以来最具影响力和标志性的科学家和知识分子之一。作为德国和瑞士的诺贝尔奖获得者,爱因斯坦经常被视为现代物理学之父。 [2] 他因“对理论物理学的贡献,特别是对光电效应定律的发现”而获得 1921 年诺贝尔物理学奖。[3]

采纳答案by joksnet

Some time ago I made two classes for get Wikipedia articles in plain text. I know that they aren't the best solution, but you can adapt it to your needs:

前段时间我做了两个类来获取纯文本的维基百科文章。我知道它们不是最好的解决方案,但您可以根据自己的需要进行调整:

    wikipedia.py
    wiki2plain.py

    wikipedia.py
    wiki2plain.py

You can use it like this:

你可以这样使用它:

from wikipedia import Wikipedia
from wiki2plain import Wiki2Plain

lang = 'simple'
wiki = Wikipedia(lang)

try:
    raw = wiki.article('Uruguay')
except:
    raw = None

if raw:
    wiki2plain = Wiki2Plain(raw)
    content = wiki2plain.text

回答by Johannes Charra

Try a combination of urllibto fetch the site and BeautifulSoupor lxmlto parse the data.

尝试结合urllib获取站点和/BeautifulSouplxml解析数据。

回答by dheerosaur

If you want library suggestions, BeautifulSoup, urllib2come to mind. Answered on SO before: Web scraping with Python.

如果您想要库建议,请想到BeautifulSoupurllib2。之前在 SO 上回答过:使用 Python 进行 Web 抓取

I have tried urllib2 to get a page from Wikipedia. But, it was 403 (forbidden). MediaWiki provides API for Wikipedia, supporting various output formats. I haven't used python-wikitools, but may be worth a try. http://code.google.com/p/python-wikitools/

我尝试过 urllib2 从维基百科获取页面。但是,它是 403(禁止)。MediaWiki 为维基百科提供API,支持多种输出格式。我没有使用过 python-wikitools,但可能值得一试。http://code.google.com/p/python-wikitools/

回答by jaydel

First, I promise I am not being snarky.

首先,我保证我不会刻薄。

Here's a previous question that might be of use: Fetch a Wikipedia article with Python

这是上一个可能有用的问题: 使用 Python 获取维基百科文章

In this someone suggests using the wikipedia high level API, which leads to this question:

在此有人建议使用维基百科高级 API,这导致了这个问题:

Is there a Wikipedia API?

有维基百科API吗?

回答by ViennaMike

As others have said, one approach is to use the wikimedia API and urllib or urllib2. The code fragments below are part of what I used to extract what is called the "lead" section, which has the article abstract and the infobox. This will check if the returned text is a redirect instead of actual content, and also let you skip the infobox if present (in my case I used different code to pull out and format the infobox.

正如其他人所说,一种方法是使用维基媒体 API 和 urllib 或 urllib2。下面的代码片段是我用来提取所谓的“潜在客户”部分的一部分,其中包含文章摘要和信息框。这将检查返回的文本是否是重定向而不是实际内容,如果存在,还可以让您跳过信息框(在我的情况下,我使用不同的代码来拉出和格式化信息框。

contentBaseURL='http://en.wikipedia.org/w/index.php?title='

def getContent(title):
    URL=contentBaseURL+title+'&action=raw&section=0'
    f=urllib.urlopen(URL)
    rawContent=f.read()
    return rawContent

infoboxPresent = 0
# Check if a redirect was returned.  If so, go to the redirection target
    if rawContent.find('#REDIRECT') == 0:
        rawContent = getFullContent(title)
        # extract the redirection title
        # Extract and format the Infobox
        redirectStart=rawContent.find('#REDIRECT[[')+11   
        count = 0
        redirectEnd = 0
        for i, char in enumerate(rawContent[redirectStart:-1]):
            if char == "[": count += 1
            if char == "]}":
                count -= 1
                if count == 0:
                    redirectEnd = i+redirectStart+1
                    break
        redirectTitle = rawContent[redirectStart:redirectEnd]
        print 'redirectTitle is: ',redirectTitle
        rawContent = getContent(redirectTitle)

    # Skip the Infobox
    infoboxStart=rawContent.find("{{Infobox")   #Actually starts at the double {'s before "Infobox"
    count = 0
    infoboxEnd = 0
    for i, char in enumerate(rawContent[infoboxStart:-1]):
        if char == "{": count += 1
        if char == "}":
            count -= 1
            if count == 0:
                infoboxEnd = i+infoboxStart+1
                break

    if infoboxEnd <> 0:
        rawContent = rawContent[infoboxEnd:]

You'll be getting back the raw text including wiki markup, so you'll need to do some clean up. If you just want the first paragraph, not the whole first section, look for the first new line character.

您将取回包含 wiki 标记的原始文本,因此您需要进行一些清理工作。如果您只想要第一段,而不是整个第一部分,请查找第一个换行符。

回答by Jens Timmerman

What I did is this:

我所做的是这样的:

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

article= "Albert Einstein"
article = urllib.quote(article)

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')] #wikipedia needs this

resource = opener.open("http://en.wikipedia.org/wiki/" + article)
data = resource.read()
resource.close()
soup = BeautifulSoup(data)
print soup.find('div',id="bodyContent").p

回答by goldsmith

I wrote a Python library that aims to make this very easy. Check it out at Github.

我编写了一个 Python 库,旨在让这一切变得非常简单。在Github 上查看。

To install it, run

要安装它,请运行

$ pip install wikipedia

Then to get the first paragraph of an article, just use the wikipedia.summaryfunction.

然后要获取文章的第一段,只需使用该wikipedia.summary功能。

>>> import wikipedia
>>> print wikipedia.summary("Albert Einstein", sentences=2)

prints

印刷

Albert Einstein (/??lb?rt ?a?nsta?n/; German: [?alb?t ?a?n?ta?n] ( listen); 14 March 1879 – 18 April 1955) was a German-born theoretical physicist who developed the general theory of relativity, one of the two pillars of modern physics (alongside quantum mechanics). While best known for his mass–energy equivalence formula E = mc2 (which has been dubbed "the world's most famous equation"), he received the 1921 Nobel Prize in Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect".

阿尔伯特·爱因斯坦(/??lb?rt ?a?nsta?n/;德语:[?alb?t?a?n?ta?n](听);1879 年 3 月 14 日 - 1955 年 4 月 18 日)是德国出生发展广义相对论的理论物理学家,广义相对论是现代物理学的两大支柱之一(与量子力学并列)。虽然最出名的是他的质能等价公式 E = mc2(被称为“世界上最著名的方程”),但他获得了 1921 年的诺贝尔物理学奖,“因为他对理论物理学的贡献,特别是他发现了光电效应定律”。

As far as how it works, wikipediamakes a request to the Mobile Frontend Extensionof the MediaWiki API, which returns mobile friendly versions of Wikipedia articles. To be specific, by passing the parameters prop=extracts&exsectionformat=plain, the MediaWiki servers will parse the Wikitext and return a plain text summary of the article you are requesting, up to and including the entire page text. It also accepts the parameters excharsand exsentences, which, not surprisingly, limit the number of characters and sentences returned by the API.

至于它是如何工作的,向 MediaWiki APIwikipedia移动前端扩展发出请求,该扩展返回维基百科文章的移动友好版本。具体来说,通过传递参数prop=extracts&exsectionformat=plain,MediaWiki 服务器将解析 Wikitext 并返回您请求的文章的纯文本摘要,直到并包括整个页面文本。它还接受参数excharsand exsentences,这并不奇怪,它限制了 API 返回的字符和句子的数量。

回答by Superdooperhero

Try pattern.

试试pattern

pip install pattern

from pattern.web import Wikipedia
article = Wikipedia(language="af").search('Kaapstad', throttle=10)
print article.string

回答by skierpage

Wikipedia runs a MediaWiki extension that provides exactly this functionality as an API module. TextExtractsimplements action=query&prop=extractswith options to return the first Nsentences and/or just the introduction, as HTML or plain text.

维基百科运行了一个 MediaWiki 扩展,它作为一个 API 模块提供了这个功能。TextExtracts工具action=query&prop=extracts的选项返回第一ñ句子和/或刚刚引进,如HTML或纯文本。

Here's the API call you want to make, try it: https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Albert%20Einstein&exintro=&exsentences=2&explaintext=&redirects=&formatversion=2

这是您想要进行的 API 调用,试试看:https: //en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Albert%20Einstein&exintro=&exsentences=2&explaintext=&redirects=&formatversion=2

  • action=query&prop=extractsto request this info
  • (ex)sentences=2, (ex)intro=, (ex)plaintext, are parameters to the module (see the first link for its API doc) asking for two sentences from the intro as plain text; leave off the latter for HTML.
  • redirects=(true) so if you ask for "titles=Einstein" you'll get the Albert Einstein page info
  • formatversion=2for a cleaner format in UTF-8.
  • action=query&prop=extracts请求此信息
  • (ex)sentences=2, (ex)intro=, (ex)plaintext, 是模块的参数(请参阅其 API 文档的第一个链接),要求将介绍中的两个句子作为纯文本;将后者留给 HTML。
  • redirects=(true) 所以如果你问“titles=Einstein”你会得到阿尔伯特爱因斯坦的页面信息
  • formatversion=2以获得更清晰的 UTF-8 格式。

There are various libraries that wrap invoking the MediaWiki action API, such as the one in DGund's answer, but it's not too hard to make the API calls yourself.

有各种各样的库可以调用 MediaWiki 操作 API,例如 DGund 的答案中的库,但是自己调用 API 并不太难。

Page info in search resultsdiscusses getting this text extract, along with getting a description and lead image for articles.

搜索结果中的页面信息讨论了获取此文本摘录,以及获取文章的描述和主要图像。

回答by Husky

The relatively new REST APIhas a summarymethod that is perfect for this use, and does a lot of the things mentioned in the other answers here (e.g. removing wikicode). It even includes an image and geocoordinates if applicable.

相对较新的REST API有一种summary非常适合这种用途的方法,并且做了很多其他答案中提到的事情(例如删除 wikicode)。如果适用,它甚至包括图像和地理坐标。

Using the lovely requestsmodule and Python 3:

使用可爱的requests模块和 Python 3:

import requests
r = requests.get("https://en.wikipedia.org/api/rest_v1/page/summary/Amsterdam")
page = r.json()
print(page["extract"]) # Returns 'Amsterdam is the capital and...'