java 如何抓取整个维基百科?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2313748/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 20:34:07  来源:igfitidea点击:

How to crawl entire Wikipedia?

javaweb-crawlerwikipediawebsphinx

提问by Mr CooL

I've tried WebSphinx application.

我试过 WebSphinx 应用程序。

I realize if I put wikipedia.org as the starting URL, it will not crawl further.

我意识到如果我将 wikipedia.org 作为起始 URL,它将不会进一步爬行。

Hence, how to actually crawl the entire Wikipedia? Can anyone gimme some guidelines? Do I need to specifically go and find those URLs and put multiple starting URLs?

因此,如何真正抓取整个维基百科?任何人都可以给一些指导方针吗?我是否需要专门去查找这些 URL 并放置多个起始 URL?

Anyone has suggestions of good website with the tutorial on usng WebSphinx's API?

任何人都有关于usng WebSphinx API的教程的好网站的建议?

回答by Andrew

If your goal is to crawl all of Wikipedia, you might want to look at the available database dumps. See http://download.wikimedia.org/.

如果您的目标是爬取整个 Wikipedia,您可能需要查看可用的数据库转储。请参阅http://download.wikimedia.org/

回答by Dr.Optix

I'm not sure but maybe WEbSphinx's UserAgent is blocked by wikipedia's robots.txt

我不确定,但也许 WEBSphinx 的 UserAgent 被维基百科的 robots.txt 屏蔽了

http://en.wikipedia.org/robots.txt

http://en.wikipedia.org/robots.txt

回答by ?smet Alkan

I think you couldn't choose the required configuration for that. Switch to advanced, crawl the subdomain, unlimit the page size and time.

我认为您无法为此选择所需的配置。切换到高级,抓取子域,不限制页面大小和时间。

However, WebSphinxprobably can't crawl the whole Wikipedia, it slows down with bigger data and eventually stops near 200mb of memory is used. I recommend you Nutch, Heritrixand Crawler4j.

然而,WebSphinx可能无法抓取整个维基百科,它会随着数据量的增加而变慢,最终在使用了 200 mb 的内存时停止。我推荐你NutchHeritrixCrawler4j

回答by Gabe

In addition to using the Wikipedia database dump mentioned above, you can use Wikipedia's API for executing queries, such as retrieving 100 random articles.

除了使用上面提到的 Wikipedia 数据库转储之外,您还可以使用 Wikipedia 的 API 来执行查询,例如检索 100 篇随机文章。

http://www.mediawiki.org/wiki/API:Query_-Lists#random.2F_rn

http://www.mediawiki.org/wiki/API:Query_- Lists#random.2F_rn

回答by FrustratedWithFormsDesigner

You probably need to start with a random article, and then crawl all articles you can get to from that starting one. When that search tree has been exhausted, start with a new random article. You could seed your searches with terms you think will lead to the most articles, or start with the featured article on the front page.

您可能需要从一篇随机文章开始,然后从该起始文章中抓取您可以访问的所有文章。当搜索树用完后,从新的随机文章开始。您可以使用您认为会导致最多文章的字词来进行搜索,或者从首页上的特色文章开始。

Another question: Why didn't WebSphinx crawl further? Does wikipedia block bots that identify as 'WebSphinx'?

另一个问题:为什么WebSphinx没有进一步爬行?维基百科是否会阻止标识为“WebSphinx”的机器人?

回答by Yishu Fang

Have a look at dbpedia, a structured version of Wikipedia.

看看dbpedia,维基百科的结构化版本。