Python 有没有人分析过维基词典?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3364279/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Has anyone parsed Wiktionary?
提问by Rory
Wiktionaryis a wiki dictionary that covers many languages. It even has translations. I would be interested in parsing it and playing with the data, has anyone does anything like this before? Is there any library I can use? (Preferably Python.)
维基词典是一个涵盖多种语言的维基词典。它甚至有翻译。我有兴趣解析它并处理数据,以前有没有人做过这样的事情?有我可以使用的图书馆吗?(最好是 Python。)
采纳答案by Amber
Wiktionary runs on MediaWiki, which has an API.
维基词典在 MediaWiki 上运行,它有一个 API。
One of the subpages for the API documentation is Client code, which lists some Python libraries.
API 文档的子页面之一是客户端代码,其中列出了一些 Python 库。
回答by razzmataz
I had at one time downloaded a wiktionary dump, trying to gather together words and definitions for slavic languages. I approached it using elementtree to go thru the xml file that is the dump. I would avoid trying to scrape or crawl the site, and just download the xml dump that wikimedia provides for wiktionary. Go to the wikimedia downloads, look for the english wiktionary dumps (enwiktionary) and go to the most recent dump. You'll probably want the pages-articles.xml.bz2 file, which is just the article content, no history or comments. Parse this with whatever xml processing libraries you prefer in python. I personally prefer elementtree. Good luck.
有一次我下载了一个维基词典转储,试图收集斯拉夫语言的单词和定义。我使用 elementtree 接近它来浏览作为转储的 xml 文件。我会避免尝试抓取或抓取该站点,而只需下载 wikimedia 为维基词典提供的 xml 转储。转至维基媒体下载,查找英文维基词典转储 ( enwiktionary) 并转至最新转储。您可能需要 pages-articles.xml.bz2 文件,它只是文章内容,没有历史记录或评论。使用您在 python 中喜欢的任何 xml 处理库来解析它。我个人更喜欢 elementtree。祝你好运。
回答by Ben Reynwar
I had a crack at parsing the german wiktionary. I ended up writing it off as too difficult, but I put my (not at all tidied up) code up at https://github.com/benreynwar/wiktionary-parserbefore I gave up. Although there are conventions used by the editors they are not enforced by anything other than peer oversight. The diversity of templates used along with all the typos in the pages makes the parsing quite challenging.
我在解析德语维基词典时遇到了困难。我最终认为它太难了,但在我放弃之前,我将我的(根本没有整理)代码放在https://github.com/benreynwar/wiktionary-parser上。尽管编辑使用了一些约定,但除了同行监督之外,它们并没有被强制执行。使用的模板的多样性以及页面中的所有拼写错误使得解析非常具有挑战性。
I think the problem is that they've used the same system as for wiktionary which is great for ease of use by the editors, but is not appropriate for the much more structured content of wiktionary. It's a shame because if wiktionary could be easily parsed it would be a hugely useful resource.
我认为问题在于他们使用了与维基词典相同的系统,这对于编辑者的使用非常方便,但不适合维基词典结构化得多的内容。很遗憾,因为如果维基词典可以轻松解析,它将是一个非常有用的资源。
回答by spencercooly
回答by benroth
I just made a word list from the German dump like that:
我刚刚从德语转储中制作了一个单词列表:
bzcat pages-articles.xml.bz2 | grep '<title>[^[:space:][:punct:]]*</title>' | sed 's:.*<title>\(.*\)</title>.*::' > words
回答by Andrew Krizhanovsky
You are welcome to play with the MySQL parsed Wiktionary database. There are two databases (English Wiktionary and Russian Wiktionary) created by the parser written in Java: http://wikokit.googlecode.com
欢迎您使用 MySQL 解析的维基词典数据库。用 Java 编写的解析器创建了两个数据库(英语维基词典和俄语维基词典):http: //wikokit.googlecode.com
If you like PHP, then you are welcome to play with piwidict- PHP API to this machine-readable Wiktionary 2
回答by Jan Berkel
回答by Chin
It depends on how thoroughly you need to parse it. If you just need to get all contents of a word in a language (definition, etymology, pronunciation, conjugation, etc.) then it's pretty easy. I had done this before, although in Java using jsoup
这取决于您需要解析它的彻底程度。如果您只需要获取语言中某个单词的所有内容(定义、词源、发音、共轭等),那么这很容易。我以前做过这个,虽然在 Java 中使用 jsoup
However, if you need to parse it down to different components of the content (e.g. just getting the definitions of a word), then it will be much more challenging. A Wiktionary entry for a word in a language has no pre-defined template, so a header can be anything from <h3>to <h6>, the order of the sections may be jumbled, they can be repetitive, etc.
但是,如果您需要将其解析为内容的不同组成部分(例如,仅获取单词的定义),则将更具挑战性。一种语言中单词的维基词典条目没有预定义的模板,因此标题可以是从<h3>到 的任何内容<h6>,部分的顺序可能会混乱,它们可能会重复等。
回答by yota
You may be interested in dbnaryproject, not python but interesting. Claims support parsing for 21 languages and it powers wikdict.
您可能对dbnary项目感兴趣,而不是 python 但很有趣。Claims 支持解析 21 种语言,并为wikdict 提供支持。
回答by Nemo
Yes, many people parsed Wiktionary. You can usually find past experiences in the Wiktionary-l mailing list archives.
是的,很多人解析了维基词典。您通常可以在Wiktionary-l 邮件列表档案 中找到过去的经验。
A project not mentioned by other answers is DBPedia's Wiktionary RDF extraction.
其他答案没有提到的一个项目是 DBPedia 的维基词典 RDF 提取。
Dozens other research projects parsed Wiktionary: you can find some examples in a recent Wiktionary specialand in other issuesof the Wikimedia research newsletter.
数十个其他研究项目解析了维基词典:您可以在最近的维基词典特刊和维基媒体研究通讯的其他问题中找到一些例子。
Recentlysomeone also made an English Wiktionary REST APIwhich includes an unspecified subset of the Wiktionary data; future plans for the thing are not known yet.
最近有人还做了一个英文维基词典REST API,其中包含了一个未指定的维基词典数据子集;这件事的未来计划尚不清楚。

