如何用 Python 编写一个简单的蜘蛛?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1805231/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 23:05:04  来源:igfitidea点击:

How to write a simple spider in Python?

pythonweb-crawlerscrapy

提问by Zeynel

I've been trying to write this spider for weeks but without success. What is the best way for me to code this in Python:

我一直在尝试编写这个蜘蛛数周但没有成功。我用 Python 编写代码的最佳方式是什么:

1) Initial url: http://www.whitecase.com/Attorneys/List.aspx?LastName=A

1)初始网址: http://www.whitecase.com/Attorneys/List.aspx?LastName=A

2) from initial url pick up these urls with this regex:

2) 从初始 url 中使用此正则表达式获取这些 url:

hxs.select('//td[@class="altRow"][1]/a/@href').re('/.a\w+')

hxs.select('//td[@class="altRow"][1]/a/@href').re('/.a\w+')

[u'/cabel', u'/jacevedo', u'/jacuna', u'/aadler', u'/zahmedani', u'/tairisto', u
/zalbert', u'/salberts', u'/aaleksandrova', u'/malhadeff', u'/nalivojvodic', u'
....

3) Go to each of these urls and scrape the school info with this regex

3)转到这些网址中的每一个并使用此正则表达式抓取学校信息

hxs.select('//td[@class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)'

hxs.select('//td[@class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)'

[u'JD, ', u'University of Florida Levin College of Law, <em>magna cum laude</em> , Order of the Coif, Symposium Editor, Florida Law Review, Awards for highest grades in Comparative Constitutional History, Legal Drafting, Real Property and Sales, ', u'2007']

[u'JD, ', u'University of Florida Levin College of Law, <em>magna cum laude</em> , Order of the Coif, Symposium Editor, Florida Law Review, Awards for highest grades in Comparative Constitutional History, Legal Drafting, Real Property and Sales, ', u'2007']

4) Write the scraped school info into schools.csv file

4)将抓取的学校信息写入schools.csv文件

Can you help me write this spider in Python? I've been trying to write it in Scrapy but without success. See my previous question.

你能帮我用 Python 写这个蜘蛛吗?我一直在尝试用 Scrapy 编写它,但没有成功。请参阅我之前的问题

Thank you.

谢谢你。

回答by Martin Beckett

http://www.ibm.com/developerworks/linux/library/l-spider/IBM article with good description

http://www.ibm.com/developerworks/linux/library/l-spider/带有很好描述的 IBM 文章

or

或者

http://code.activestate.com/recipes/576551/Python cookbook, better code but less explanation

http://code.activestate.com/recipes/576551/Python 食谱,更好的代码但更少的解释

回答by Nick Bastin

Also, I suggest you read:

另外,我建议你阅读:

RegEx match open tags except XHTML self-contained tags

RegEx 匹配除 XHTML 自包含标签之外的开放标签

Before you try to parse HTML with a regular expression. Then think about what happens the first time someone's name forces the page to be unicode instead of latin-1.

在尝试使用正则表达式解析 HTML 之前。然后想想当某人的名字第一次强制页面使用 unicode 而不是 latin-1 时会发生什么。

EDIT: To answer your question about a library to use in Python, I would suggest Beautiful Soup,which is a great HTML parser and supports unicode throughout (and does a really good job with malformed HTML, which you're going to find all over the place).

编辑:要回答有关在 Python 中使用的库的问题,我建议使用Beautiful Soup,它是一个出色的 HTML 解析器,并且始终支持 unicode(并且在处理格式错误的 HTML 方面做得非常好,您将到处找到它地方)。