Python 有没有人分析过维基词典？

Question

提问by Rory

Wiktionaryis a wiki dictionary that covers many languages. It even has translations. I would be interested in parsing it and playing with the data, has anyone does anything like this before? Is there any library I can use? (Preferably Python.)

维基词典是一个涵盖多种语言的维基词典。它甚至有翻译。我有兴趣解析它并处理数据，以前有没有人做过这样的事情？有我可以使用的图书馆吗？（最好是 Python。）

Answer 1

采纳答案by Amber

Wiktionary runs on MediaWiki, which has an API.

维基词典在 MediaWiki 上运行，它有一个 API。

One of the subpages for the API documentation is Client code, which lists some Python libraries.

API 文档的子页面之一是客户端代码，其中列出了一些 Python 库。

Answer 2

回答by razzmataz

I had at one time downloaded a wiktionary dump, trying to gather together words and definitions for slavic languages. I approached it using elementtree to go thru the xml file that is the dump. I would avoid trying to scrape or crawl the site, and just download the xml dump that wikimedia provides for wiktionary. Go to the wikimedia downloads, look for the english wiktionary dumps (enwiktionary) and go to the most recent dump. You'll probably want the pages-articles.xml.bz2 file, which is just the article content, no history or comments. Parse this with whatever xml processing libraries you prefer in python. I personally prefer elementtree. Good luck.

有一次我下载了一个维基词典转储，试图收集斯拉夫语言的单词和定义。我使用 elementtree 接近它来浏览作为转储的 xml 文件。我会避免尝试抓取或抓取该站点，而只需下载 wikimedia 为维基词典提供的 xml 转储。转至维基媒体下载，查找英文维基词典转储 ( enwiktionary) 并转至最新转储。您可能需要 pages-articles.xml.bz2 文件，它只是文章内容，没有历史记录或评论。使用您在 python 中喜欢的任何 xml 处理库来解析它。我个人更喜欢 elementtree。祝你好运。

Answer 3

回答by Ben Reynwar

I had a crack at parsing the german wiktionary. I ended up writing it off as too difficult, but I put my (not at all tidied up) code up at https://github.com/benreynwar/wiktionary-parserbefore I gave up. Although there are conventions used by the editors they are not enforced by anything other than peer oversight. The diversity of templates used along with all the typos in the pages makes the parsing quite challenging.

我在解析德语维基词典时遇到了困难。我最终认为它太难了，但在我放弃之前，我将我的（根本没有整理）代码放在https://github.com/benreynwar/wiktionary-parser上。尽管编辑使用了一些约定，但除了同行监督之外，它们并没有被强制执行。使用的模板的多样性以及页面中的所有拼写错误使得解析非常具有挑战性。

I think the problem is that they've used the same system as for wiktionary which is great for ease of use by the editors, but is not appropriate for the much more structured content of wiktionary. It's a shame because if wiktionary could be easily parsed it would be a hugely useful resource.

我认为问题在于他们使用了与维基词典相同的系统，这对于编辑者的使用非常方便，但不适合维基词典结构化得多的内容。很遗憾，因为如果维基词典可以轻松解析，它将是一个非常有用的资源。

Answer 4

回答by spencercooly

wordnikhas done a good job parsing-out definitions, etc and they have a great api

wordnik在解析定义等方面做得很好，他们有一个很棒的 api

like the others have mentioned, wiktionary is a formatting-disaster, and was not built to be computer-readable

就像其他人提到的那样，维基词典是一种格式化灾难，并不是为了计算机可读而构建的

Answer 5

回答by benroth

I just made a word list from the German dump like that:

我刚刚从德语转储中制作了一个单词列表：

bzcat pages-articles.xml.bz2 | grep '<title>[^[:space:][:punct:]]*</title>' | sed 's:.*<title>\(.*\)</title>.*::' > words

Answer 6

回答by Andrew Krizhanovsky

You are welcome to play with the MySQL parsed Wiktionary database. There are two databases (English Wiktionary and Russian Wiktionary) created by the parser written in Java: http://wikokit.googlecode.com

欢迎您使用 MySQL 解析的维基词典数据库。用 Java 编写的解析器创建了两个数据库（英语维基词典和俄语维基词典）：http: //wikokit.googlecode.com

If you like PHP, then you are welcome to play with piwidict- PHP API to this machine-readable Wiktionary 2

如果你喜欢 PHP，那么欢迎你玩piwidict- 这个机器可读的维基词典2 的PHP API

Answer 7

回答by Jan Berkel

There is also JWKTLwhich does a good job at parsing and extracting structured data from wiktionary. It is written in Java and has support for the English, German and Russian editions.

还有JWKTL，它在从维基词典中解析和提取结构化数据方面做得很好。它是用 Java 编写的，支持英语、德语和俄语版本。

Answer 8

回答by Chin

It depends on how thoroughly you need to parse it. If you just need to get all contents of a word in a language (definition, etymology, pronunciation, conjugation, etc.) then it's pretty easy. I had done this before, although in Java using jsoup

这取决于您需要解析它的彻底程度。如果您只需要获取语言中某个单词的所有内容（定义、词源、发音、共轭等），那么这很容易。我以前做过这个，虽然在 Java 中使用 jsoup

However, if you need to parse it down to different components of the content (e.g. just getting the definitions of a word), then it will be much more challenging. A Wiktionary entry for a word in a language has no pre-defined template, so a header can be anything from <h3>to <h6>, the order of the sections may be jumbled, they can be repetitive, etc.

但是，如果您需要将其解析为内容的不同组成部分（例如，仅获取单词的定义），则将更具挑战性。一种语言中单词的维基词典条目没有预定义的模板，因此标题可以是从<h3>到的任何内容<h6>，部分的顺序可能会混乱，它们可能会重复等。

Answer 9

回答by yota

You may be interested in dbnaryproject, not python but interesting. Claims support parsing for 21 languages and it powers wikdict.

您可能对dbnary项目感兴趣，而不是 python 但很有趣。Claims 支持解析 21 种语言，并为wikdict 提供支持。

Answer 10

回答by Nemo

Yes, many people parsed Wiktionary. You can usually find past experiences in the Wiktionary-l mailing list archives.

是的，很多人解析了维基词典。您通常可以在Wiktionary-l 邮件列表档案中找到过去的经验。

A project not mentioned by other answers is DBPedia's Wiktionary RDF extraction.

其他答案没有提到的一个项目是 DBPedia 的维基词典 RDF 提取。

Dozens other research projects parsed Wiktionary: you can find some examples in a recent Wiktionary specialand in other issuesof the Wikimedia research newsletter.

数十个其他研究项目解析了维基词典：您可以在最近的维基词典特刊和维基媒体研究通讯的其他问题中找到一些例子。

Recentlysomeone also made an English Wiktionary REST APIwhich includes an unspecified subset of the Wiktionary data; future plans for the thing are not known yet.

最近有人还做了一个英文维基词典REST API，其中包含了一个未指定的维基词典数据子集；这件事的未来计划尚不清楚。

Python 有没有人分析过维基词典？

提问by Rory

采纳答案by Amber

回答by razzmataz

回答by Ben Reynwar

回答by spencercooly

回答by benroth

回答by Andrew Krizhanovsky

回答by Jan Berkel

回答by Chin

回答by yota

回答by Nemo

相关推荐

最近更新

标签

Python 有没有人分析过维基词典？

提问by Rory

采纳答案by Amber

回答by razzmataz

回答by Ben Reynwar

回答by spencercooly

回答by benroth

回答by Andrew Krizhanovsky

回答by Jan Berkel

回答by Chin

回答by yota

回答by Nemo

相关推荐

Python IncompleteRead 使用 httplib

Python字符串匹配

Python 在带有 CNTLM 的代理后面使用 pip

Python 如何将列表随机划分为 n 个几乎相等的部分？

相关推荐

最近更新

标签