Python 如何从维基百科中获取纯文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4452102/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 15:48:42  来源:igfitidea点击:

How to get plain text out of Wikipedia

pythonmediawikiwikipediawikipedia-apimediawiki-api

提问by Wifi

I'd like to write a script that gets the Wikipedia description section only. That is, when I say

我想编写一个仅获取维基百科描述部分的脚本。也就是说,当我说

/wiki bla bla bla

it will go to the Wikipedia page for bla bla bla, get the following, and return it to the chatroom:

它将转到Wikipedia 页面bla bla bla,获取以下内容,并将其返回到聊天室:

"Bla Bla Bla" is the name of a song made by Gigi D'Agostino. He described this song as "a piece I wrote thinking of all the people who talk and talk without saying anything". The prominent but nonsensical vocal samples are taken from UK band Stretch's song "Why Did You Do It"

“Bla Bla Bla”是吉吉·达戈斯蒂诺 (Gigi D'Agostino) 创作的一首歌的名字。他将这首歌描述为“我写的一首曲子,想着所有说话和说话不说话的人”。突出但荒谬的人声样本取自英国乐队 Stretch 的歌曲“Why Did You Do It”

How can I do this?

我怎样才能做到这一点?

回答by Katriel

Use the MediaWiki API, which runs on Wikipedia. You will have to do some parsing of the data yourself.

使用在维基百科上运行的MediaWiki API。您必须自己对数据进行一些解析。

For instance:

例如:

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&&titles=Bla%20Bla%20Bla

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&&titles=Bla%20Bla%20Bla

means

方法

fetch (action=query) the content (rvprop=content) of the most recent revision of Main Page (title=Main%20Page) in JSON format (format=json).

以 JSON 格式 (format=json) 获取 (action=query) 主页 (title=Main%20Page) 的最新修订版的内容 (rvprop=content)。

You will probably want to search for the query and use the first result, to handle spelling errors and the like.

您可能希望搜索查询并使用第一个结果来处理拼写错误等。

回答by abc def foo bar

You can try the BeautifulSoup HTML parsing library for python,but you'll have to write a simple parser.

您可以尝试 Python 的 BeautifulSoup HTML 解析库,但您必须编写一个简单的解析器。

回答by hippietrail

You can fetch just the first section using the API:

您可以使用 API 仅获取第一部分:

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvsection=0&titles=Bla%20Bla%20Bla&rvprop=content

This will give you raw wikitext, you'll have to deal with templates and markup.

这将为您提供原始维基文本,您将不得不处理模板和标记。

Or you can fetch the whole page rendered into HTML which has its own pros and cons as far as parsing:

或者,您可以将呈现为 HTML 的整个页面提取出来,就解析而言,该页面各有优缺点:

http://en.wikipedia.org/w/api.php?action=parse&prop=text&page=Bla_Bla_Bla

I can't see an easy way to get parsed HTML of the first section in a single call but you can do it with two calls by passing the wikitext you receive from the first URL back with text=in place of the page=in the second URL.

我看不到一个简单的方法来获得解析的第一部分的HTML在一个单一的电话,但你可以通过你从与第一URL后面收到wikitext的两个电话做text=到位的page=第二URL。

UPDATE

更新

Sorry I neglected the "plain text" part of your question. Get the part of the article you want as HTML. It's mucheasier to strip HTML than to strip wikitext!

抱歉,我忽略了您问题的“纯文本”部分。以 HTML 格式获取您想要的文章部分。这是很多容易剥去HTML,而不是带wikitext的!

回答by Finn ?rup Nielsen

"...a script that gets the Wikipedia description section only..."

“……一个只获取维基百科描述部分的脚本……”

For your application you might what to look on the dumps, e.g.: http://dumps.wikimedia.org/enwiki/20120702/

对于您的应用程序,您可能会在转储中查看什么,例如:http: //dumps.wikimedia.org/enwiki/20120702/

The particular files you need are 'abstract' XML files, e.g., this small one (22.7MB):

您需要的特定文件是“抽象”XML 文件,例如,这个小文件 (22.7MB):

http://dumps.wikimedia.org/enwiki/20120702/enwiki-20120702-abstract19.xml

http://dumps.wikimedia.org/enwiki/20120702/enwiki-20120702-abstract19.xml

The XML has a tag called 'abstract' which contain the first part of each article.

XML 有一个名为“抽象”的标签,其中包含每篇文章的第一部分。

Otherwise wikipedia2text uses, e.g., w3m to download the page with templates expanded and formatted to text. From that you might be able to pick out the abstract via a regular expression.

否则,wikipedia2text 会使用例如 w3m 来下载模板扩展并格式化为文本的页面。从中您可以通过正则表达式挑选出摘要。

回答by Harriv

You can try WikiExtractor: http://medialab.di.unipi.it/wiki/Wikipedia_Extractor

你可以试试 WikiExtractor:http://medialab.di.unipi.it/wiki/Wikipedia_Extractor

It's for Python 2.7 and 3.3+.

它适用于 Python 2.7 和 3.3+。

回答by Narinder S. Ghumman

DBPedia is the perfect solution for this problem. Here: http://dbpedia.org/page/Metallica, look at the perfectly organised data using RDF. One can query for anything here at http://dbpedia.org/sparqlusing SPARQL, the query language for the RDF. There's always a way to find the pageID so as to get descriptive text but this should do for the most part.

DBPedia 是这个问题的完美解决方案。在这里:http://dbpedia.org/page/Metallica,使用 RDF 查看完美组织的数据。可以使用 RDF 的查询语言 SPARQL在http://dbpedia.org/sparql 上查询任何内容。总有一种方法可以找到 pageID 以获得描述性文本,但这在大多数情况下应该可以做到。

There will be a learning curve for RDF and SPARQL for writing any useful code but this is the perfect solution.

RDF 和 SPARQL 将有一个学习曲线来编写任何有用的代码,但这是完美的解决方案。

For example, a query run for Metallicareturns an HTML table with the abstract in several different languages:

例如,运行Metallica的查询会返回一个 HTML 表,其中包含几种不同语言的摘要:

<table class="sparql" border="1">
  <tr>
    <th>abstract</th>
  </tr>
  <tr>
    <td><pre>"Metallica is an American heavy metal band formed..."@en</pre></td>
  </tr>
  <tr>
    <td><pre>"Metallica es una banda de thrash metal estadounidense..."@es</pre></td>
... 

SPARQL QUERY :

SPARQL 查询:

PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
PREFIX dbpprop: <http://dbpedia.org/property/>
PREFIX dbres: <http://dbpedia.org/resource/>

SELECT ?abstract WHERE {
 dbres:Metallica dbpedia-owl:abstract ?abstract.
}

Change "Metallica" to any resource name (resource name as in wikipedia.org/resourcename) for queries pertaining to abstract.

对于与摘要相关的查询,将“Metallica”更改为任何资源名称(如 wikipedia.org/resourcename 中的资源名称)。

回答by Hardest

There is also the opportunity to consume Wikipedia pages through a wrapper API like JSONpedia, it works both live (ask for the current JSON representation of a Wiki page) and storage based (query multiple pages previously ingested in Elasticsearch and MongoDB). The output JSON also include plain rendered page text.

还有机会通过像JSONpedia这样的包装器 API 来使用 Wikipedia 页面,它既可以实时工作(询问 Wiki 页面的当前 JSON 表示),也可以基于存储(查询以前在 Elasticsearch 和 MongoDB 中摄取的多个页面)。输出 JSON 还包括纯呈现的页面文本。

回答by ESL

I think the better option is to use the extractsprop that provides you MediaWiki API. It returns you only some tags (b, i, h#, span, ul, li) and removes tables, infoboxes, references, etc.

我认为更好的选择是使用extracts为您提供 MediaWiki API道具。它只返回一些标签(b、i、h#、span、ul、li)并删除表格、信息框、引用等。

http://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Bla%20Bla%20Bla&format=xmlgives you something very simple:

http://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Bla%20Bla%20Bla&format=xml给你一些非常简单的东西:

<api><query><pages><page pageid="4456737" ns="0" title="Bla Bla Bla"><extract xml:space="preserve">
<p>"<b>Bla Bla Bla</b>" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, <i>L'Amour Toujours</i>. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with <i>L'Amour Toujours (I'll Fly With You)</i> in its US radio version.</p> <p></p> <h2><span id="Background_and_writing">Background and writing</span></h2> <p>He described this song as "a piece I wrote thinking of all the people who talk and talk without saying anything". The prominent but nonsensical vocal samples are taken from UK band Stretch's song <i>"Why Did You Do It"</i>.</p> <h2><span id="Music_video">Music video</span></h2> <p>The song also featured a popular music video in the style of La Linea. The music video shows a man with a floating head and no arms walking toward what appears to be a shark that multiplies itself and can change direction. This style was also used in "The Riddle", another song by Gigi D'Agostino, originally from British singer Nik Kershaw.</p> <h2><span id="Chart_performance">Chart performance</span></h2> <h2><span id="References">References</span></h2> <h2><span id="External_links">External links</span></h2> <ul><li>Full lyrics of this song at MetroLyrics</li> </ul>
</extract></page></pages></query></api>

You can then run it through a regular expression, in JavaScript would be something like this (maybe you have to do some minor modifications:

然后你可以通过一个正则表达式来运行它,在 JavaScript 中会是这样的(也许你必须做一些小的修改:

/^.*<\s*extract[^>]*\s*>\s*((?:[^<]*|<\s*\/?\s*[^>hH][^>]*\s*>)*).*<\s*(?:h|H).*$/.exec(data)

Which gives you (only paragrphs, bold and italic):

这给你(只有段落,粗体和斜体):

"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You)in its US radio version.

Bla Bla Bla”是意大利DJ Gigi D'Agostino创作和录制的一首歌的名字。它于 1999 年 5 月作为专辑《L'Amour Toujours》的第三首单曲发行。它在奥地利排名第 3,在法国排名第 15。这首歌也可以在与L'Amour Toujours (I'll Fly With You)的美国广播版中添加的混音中听到。

回答by Anuraj

You can get wiki data in Text formats. If you need to access many title's informations, you can get all title's wiki data in a single call. Use pipe character ( | ) to separate each titles.

您可以获取文本格式的维基数据。如果您需要访问多个标题的信息,您可以在一次调用中获取所有标题的 wiki 数据。使用竖线字符 ( | ) 分隔每个标题。

http://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exlimit=max&explaintext&exintro&titles=Yahoo|Google&redirects=

http://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exlimit=max&explaintext&exintro&titles=Yahoo|Google&redirects=

Here this api call return both Googles and Yahoos data.

这里这个 api 调用返回谷歌和雅虎数据。

explaintext=> Return extracts as plain text instead of limited HTML.

explaintext=> 以纯文本而不是有限的 HTML 形式返回摘录。

exlimit = max(now its 20); Otherwise only one result will return.

exlimit = max(现在是 20);否则只会返回一个结果。

exintro=> Return only content before the first section. If you want full data, just remove this.

exintro=> 仅返回第一部分之前的内容。如果你想要完整的数据,只需删除它。

redirects=Resolve redirect issues.

redirects=解决重定向问题。

回答by Mark Amery

Here are a few different possible approaches; use whichever works for you. All my code examples below use requestsfor HTTP requests to the API; you can install requestswith pip install requestsif you have Pip. They also all use the Mediawiki API, and two use the queryendpoint; follow those links if you want documentation.

以下是几种不同的可能方法;使用适合您的任何一种。我下面的所有代码示例都requests用于对 API 的 HTTP 请求;您可以安装requests使用pip install requests,如果你有皮普。它们也都使用Mediawiki API,其中两个使用查询端点;如果您需要文档,请点击这些链接。

1. Get a plain text representation of either the entire page or the page "extract" straight from the API with the extractsprop

1. 使用extractsprop直接从 API 中获取整个页面或页面“提取”的纯文本表示

Note that this approach only works on MediaWiki sites with the TextExtracts extension. This notably includes Wikipedia, but not some smaller Mediawiki sites like, say, http://www.wikia.com/

请注意,此方法仅适用于带有TextExtracts 扩展名的MediaWiki 站点。这尤其包括维基百科,但不包括一些较小的 Mediawiki 站点,例如http://www.wikia.com/

You want to hit a URL like

你想打一个像

https://en.wikipedia.org/w/api.php?action=query&format=json&titles=Bla_Bla_Bla&prop=extracts&exintro&explaintext

https://en.wikipedia.org/w/api.php?action=query&format=json&titles=Bla_Bla_Bla&prop=extracts&exintro&explaintext

Breaking that down, we've got the following parameters in there (documented at https://www.mediawiki.org/wiki/Extension:TextExtracts#query+extracts):

分解一下,我们在那里有以下参数(记录在https://www.mediawiki.org/wiki/Extension:TextExtracts#query+extracts):

  • action=query, format=json, and title=Bla_Bla_Blaare all standard MediaWiki API parameters
  • prop=extractsmakes us use the TextExtracts extension
  • exintrolimits the response to content before the first section heading
  • explaintextmakes the extract in the response be plain text instead of HTML
  • action=query, format=json, 和title=Bla_Bla_Bla都是标准的 MediaWiki API 参数
  • prop=extracts让我们使用 TextExtracts 扩展
  • exintro限制对第一部分标题之前的内容的响应
  • explaintext使响应中的提取为纯文本而不是 HTML

Then parse the JSON response and extract the extract:

然后解析 JSON 响应并提取提取物:

>>> import requests
>>> response = requests.get(
...     'https://en.wikipedia.org/w/api.php',
...     params={
...         'action': 'query',
...         'format': 'json',
...         'titles': 'Bla Bla Bla',
...         'prop': 'extracts',
...         'exintro': True,
...         'explaintext': True,
...     }
... ).json()
>>> page = next(iter(response['query']['pages'].values()))
>>> print(page['extract'])
"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.

2. Get the full HTML of the page using the parseendpoint, parse it, and extract the first paragraph

2.使用parse端点获取页面的完整HTML ,解析它,并提取第一段

MediaWiki has a parseendpointthat you can hit with a URL like https://en.wikipedia.org/w/api.php?action=parse&page=Bla_Bla_Blato get the HTML of a page. You can then parse it with an HTML parser like lxml(install it first with pip install lxml) to extract the first paragraph.

MediaWiki 有一个parse端点,您可以使用像https://en.wikipedia.org/w/api.php?action=parse&page=Bla_Bla_Bla这样的 URL来获取页面的 HTML。然后,您可以使用像lxml这样的 HTML 解析器(首先使用 安装它pip install lxml)来解析它以提取第一段。

For example:

例如:

>>> import requests
>>> from lxml import html
>>> response = requests.get(
...     'https://en.wikipedia.org/w/api.php',
...     params={
...         'action': 'parse',
...         'page': 'Bla Bla Bla',
...         'format': 'json',
...     }
... ).json()
>>> raw_html = response['parse']['text']['*']
>>> document = html.document_fromstring(raw_html)
>>> first_p = document.xpath('//p')[0]
>>> intro_text = first_p.text_content()
>>> print(intro_text)
"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.

3. Parse wikitext yourself

3. 自己解析维基文本

You can use the queryAPI to get the page's wikitext, parse it using mwparserfromhell(install it first using pip install mwparserfromhell), then reduce it down to human-readable text using strip_code. strip_codedoesn't work perfectly at the time of writing (as shown clearly in the example below) but will hopefully improve.

您可以使用queryAPI 获取页面的 wikitext,使用mwparserfromhell(首先使用 安装它pip install mwparserfromhell)解析它,然后使用 将其缩减为人类可读的文本strip_codestrip_code在撰写本文时无法完美运行(如下例所示),但有望改进。

>>> import requests
>>> import mwparserfromhell
>>> response = requests.get(
...     'https://en.wikipedia.org/w/api.php',
...     params={
...         'action': 'query',
...         'format': 'json',
...         'titles': 'Bla Bla Bla',
...         'prop': 'revisions',
...         'rvprop': 'content',
...     }
... ).json()
>>> page = next(iter(response['query']['pages'].values()))
>>> wikicode = page['revisions'][0]['*']
>>> parsed_wikicode = mwparserfromhell.parse(wikicode)
>>> print(parsed_wikicode.strip_code())
{{dablink|For Ke$ha's song, see Blah Blah Blah (song). For other uses, see Blah (disambiguation)}}

"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.

Background and writing
He described this song as "a piece I wrote thinking of all the people who talk and talk without saying anything". The prominent but nonsensical vocal samples are taken from UK band Stretch's song "Why Did You Do It"''.

Music video
The song also featured a popular music video in the style of La Linea. The music video shows a man with a floating head and no arms walking toward what appears to be a shark that multiplies itself and can change direction. This style was also used in "The Riddle", another song by Gigi D'Agostino, originally from British singer Nik Kershaw.

Chart performance
Chart (1999-00)PeakpositionIreland (IRMA)Search for Irish peaks23

References

External links


Category:1999 singles
Category:Gigi D'Agostino songs
Category:1999 songs
Category:ZYX Music singles
Category:Songs written by Gigi D'Agostino