如何用 Python 编写一个简单的蜘蛛？

Question

提问by Zeynel

I've been trying to write this spider for weeks but without success. What is the best way for me to code this in Python:

我一直在尝试编写这个蜘蛛数周但没有成功。我用 Python 编写代码的最佳方式是什么：

1) Initial url: http://www.whitecase.com/Attorneys/List.aspx?LastName=A

1）初始网址： http://www.whitecase.com/Attorneys/List.aspx?LastName=A

2) from initial url pick up these urls with this regex:

2) 从初始 url 中使用此正则表达式获取这些 url：

hxs.select('//td[@class="altRow"][1]/a/@href').re('/.a\w+')

[u'/cabel', u'/jacevedo', u'/jacuna', u'/aadler', u'/zahmedani', u'/tairisto', u
/zalbert', u'/salberts', u'/aaleksandrova', u'/malhadeff', u'/nalivojvodic', u'
....

3) Go to each of these urls and scrape the school info with this regex

3）转到这些网址中的每一个并使用此正则表达式抓取学校信息

hxs.select('//td[@class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)'

[u'JD, ', u'University of Florida Levin College of Law, <em>magna cum laude</em> , Order of the Coif, Symposium Editor, Florida Law Review, Awards for highest grades in Comparative Constitutional History, Legal Drafting, Real Property and Sales, ', u'2007']

4) Write the scraped school info into schools.csv file

4）将抓取的学校信息写入schools.csv文件

Can you help me write this spider in Python? I've been trying to write it in Scrapy but without success. See my previous question.

你能帮我用 Python 写这个蜘蛛吗？我一直在尝试用 Scrapy 编写它，但没有成功。请参阅我之前的问题。

Thank you.

谢谢你。

Answer 1

回答by Martin Beckett

http://www.ibm.com/developerworks/linux/library/l-spider/IBM article with good description

http://www.ibm.com/developerworks/linux/library/l-spider/带有很好描述的 IBM 文章

or

或者

http://code.activestate.com/recipes/576551/Python cookbook, better code but less explanation

http://code.activestate.com/recipes/576551/Python 食谱，更好的代码但更少的解释

Answer 2

回答by Nick Bastin

Also, I suggest you read:

另外，我建议你阅读：

RegEx match open tags except XHTML self-contained tags

RegEx 匹配除 XHTML 自包含标签之外的开放标签

Before you try to parse HTML with a regular expression. Then think about what happens the first time someone's name forces the page to be unicode instead of latin-1.

在尝试使用正则表达式解析 HTML 之前。然后想想当某人的名字第一次强制页面使用 unicode 而不是 latin-1 时会发生什么。

EDIT: To answer your question about a library to use in Python, I would suggest Beautiful Soup,which is a great HTML parser and supports unicode throughout (and does a really good job with malformed HTML, which you're going to find all over the place).

编辑：要回答有关在 Python 中使用的库的问题，我建议使用Beautiful Soup，它是一个出色的 HTML 解析器，并且始终支持 unicode（并且在处理格式错误的 HTML 方面做得非常好，您将到处找到它地方）。

如何用 Python 编写一个简单的蜘蛛？

提问by Zeynel

回答by Martin Beckett

回答by Nick Bastin

相关推荐

最近更新

标签

如何用 Python 编写一个简单的蜘蛛？

提问by Zeynel

回答by Martin Beckett

回答by Nick Bastin

相关推荐

python 调试 pyQT4 应用程序？

python 如何使用python win32com保存为excel文件

在 python 堆中偷看

Python optparse 值实例

相关推荐

最近更新

标签