如何用 Python 编写一个简单的蜘蛛?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1805231/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to write a simple spider in Python?
提问by Zeynel
I've been trying to write this spider for weeks but without success. What is the best way for me to code this in Python:
我一直在尝试编写这个蜘蛛数周但没有成功。我用 Python 编写代码的最佳方式是什么:
1) Initial url: http://www.whitecase.com/Attorneys/List.aspx?LastName=A
1)初始网址: http://www.whitecase.com/Attorneys/List.aspx?LastName=A
2) from initial url pick up these urls with this regex:
2) 从初始 url 中使用此正则表达式获取这些 url:
hxs.select('//td[@class="altRow"][1]/a/@href').re('/.a\w+')
hxs.select('//td[@class="altRow"][1]/a/@href').re('/.a\w+')
[u'/cabel', u'/jacevedo', u'/jacuna', u'/aadler', u'/zahmedani', u'/tairisto', u
/zalbert', u'/salberts', u'/aaleksandrova', u'/malhadeff', u'/nalivojvodic', u'
....
3) Go to each of these urls and scrape the school info with this regex
3)转到这些网址中的每一个并使用此正则表达式抓取学校信息
hxs.select('//td[@class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)'
hxs.select('//td[@class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)'
[u'JD, ', u'University of Florida Levin College of Law, <em>magna cum laude</em>
, Order of the Coif, Symposium Editor, Florida Law Review, Awards for highest
grades in Comparative Constitutional History, Legal Drafting, Real Property and
Sales, ', u'2007']
[u'JD, ', u'University of Florida Levin College of Law, <em>magna cum laude</em>
, Order of the Coif, Symposium Editor, Florida Law Review, Awards for highest
grades in Comparative Constitutional History, Legal Drafting, Real Property and
Sales, ', u'2007']
4) Write the scraped school info into schools.csv file
4)将抓取的学校信息写入schools.csv文件
Can you help me write this spider in Python? I've been trying to write it in Scrapy but without success. See my previous question.
你能帮我用 Python 写这个蜘蛛吗?我一直在尝试用 Scrapy 编写它,但没有成功。请参阅我之前的问题。
Thank you.
谢谢你。
回答by Martin Beckett
http://www.ibm.com/developerworks/linux/library/l-spider/IBM article with good description
http://www.ibm.com/developerworks/linux/library/l-spider/带有很好描述的 IBM 文章
or
或者
http://code.activestate.com/recipes/576551/Python cookbook, better code but less explanation
http://code.activestate.com/recipes/576551/Python 食谱,更好的代码但更少的解释
回答by Nick Bastin
Also, I suggest you read:
另外,我建议你阅读:
RegEx match open tags except XHTML self-contained tags
Before you try to parse HTML with a regular expression. Then think about what happens the first time someone's name forces the page to be unicode instead of latin-1.
在尝试使用正则表达式解析 HTML 之前。然后想想当某人的名字第一次强制页面使用 unicode 而不是 latin-1 时会发生什么。
EDIT: To answer your question about a library to use in Python, I would suggest Beautiful Soup,which is a great HTML parser and supports unicode throughout (and does a really good job with malformed HTML, which you're going to find all over the place).
编辑:要回答有关在 Python 中使用的库的问题,我建议使用Beautiful Soup,它是一个出色的 HTML 解析器,并且始终支持 unicode(并且在处理格式错误的 HTML 方面做得非常好,您将到处找到它地方)。