初学者学习 Python 屏幕抓取的最佳方式
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4328271/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Best way for a beginner to learn screen scraping by Python
提问by Andreas
This might be one of those questions that are difficult to answer, but here goes:
这可能是难以回答的问题之一,但这里有:
I don't consider my self programmer - but I would like to :-) I've learned R, because I was sick and tired of spss, and because a friend introduced me to the language - so I am not a complete stranger to programming logic.
我不考虑自己的程序员 - 但我想 :-) 我已经学习了 R,因为我厌倦了 spss,并且因为一个朋友向我介绍了这门语言 - 所以我对它并不完全陌生编程逻辑。
Now I would like to learn python - primarily to do screen scraping and text analysis, but also for writing webapps with Pylons or Django.
现在我想学习 python - 主要用于屏幕抓取和文本分析,但也用于使用 Pylons 或 Django 编写 web 应用程序。
So: How should I go about learning to screen scrape with python? I started going through the scrappy docsbut I feel to much "magic" is going on - after all - I am trying to learn, not just do.
所以:我应该如何学习用 python 进行屏幕抓取?我开始浏览那些杂乱无章的文档,但我觉得有很多“魔法”正在发生——毕竟——我正在努力学习,而不仅仅是做。
On the other hand: There is no reason to reinvent the wheel, and if Scrapy is to screen scraping what Django is to webpages, then It might after all be worth jumping straight into Scrapy. What do you think?
另一方面:没有理由重新发明轮子,如果 Scrapy 是为了屏幕抓取 Django 对网页的作用,那么它可能毕竟值得直接跳入 Scrapy。你怎么认为?
Oh - BTW: The kind of screen scraping: I want to scrape newspaper sites (i.e. fairly complex and big) for mentions of politicians etc. - That means I will need to scrape daily, incrementally and recursively - and I need to log the results into a database of sorts - which lead me to a bonus question: Everybody is talking about nonSQL DB. Should I learn to use e.g. mongoDB right away (I don't think I need strong consistency), or is that foolish for what I want to do?
哦 - 顺便说一句:那种屏幕抓取:我想抓取报纸网站(即相当复杂和大的)以提及家等。 - 这意味着我需要每天、增量和递归地抓取 - 我需要记录结果进入各种数据库 - 这让我想到了一个额外的问题:每个人都在谈论非 SQL DB。我应该立即学习使用例如 mongoDB(我认为我不需要强一致性),还是我想做的事情很愚蠢?
Thank you for any thoughts - and I apologize if this is to general to be considered a programming question.
感谢您的任何想法 - 如果这通常被视为一个编程问题,我深表歉意。
采纳答案by ayaz
I agree that the Scrapy docs give off that impression. But, I believe, as I found for myself, that if you are patient with Scrapy, and go through the tutorials first, and then bury yourself into the rest of the documentation, you will not only start to understand the different parts to Scrapy better, but you will appreciate why it does what it does the way it does it. It is a framework for writing spiders and screen scrappers in the real sense of a framework. You will still have to learn XPath, but I find that it is best to learn it regardless. After all, you do intend to scrape websites, and an understanding of what XPath is and how it works is only going to make things easier for you.
我同意 Scrapy 文档给人的印象。但是,我相信,正如我自己发现的那样,如果您对 Scrapy 有耐心,并先阅读教程,然后将自己埋头于文档的其余部分,您不仅会开始更好地理解 Scrapy 的不同部分,但你会明白为什么它会以它的方式做它所做的事情。它是一个真正意义上的框架编写蜘蛛和屏幕抓取器的框架。您仍然需要学习 XPath,但我发现无论如何最好都学习它。毕竟,您确实打算抓取网站,了解 XPath 是什么以及它是如何工作的只会让事情变得更容易。
Once you have, for example, understood the concept of pipelinesin Scrapy, you will be able to appreciate how easy it is to do all sorts of stuff with scrapped items, including storing them into a database.
例如,一旦您理解了pipelinesScrapy 中的概念,您将能够体会到用废弃的项目做各种事情是多么容易,包括将它们存储到数据库中。
BeautifulSoupis a wonderful Python library that can be used to scrape websites. But, in contrast to Scrapy, it is not a framework by any means. For smaller projects where you don't have to invest time in writing a proper spider and have to deal with scrapping a good amount of data, you can get by with BeautifulSoup. But for anything else, you will only begin to appreciate the sort of things Scrapy provides.
BeautifulSoup是一个很棒的 Python 库,可用于抓取网站。但是,与 Scrapy 相比,它无论如何都不是一个框架。对于较小的项目,您不必花时间编写合适的蜘蛛程序,也不必处理大量数据的报废,可以使用 BeautifulSoup。但是对于其他任何事情,您只会开始欣赏 Scrapy 提供的那种东西。
回答by cababunga
Looks like Scrappy is using XPATH for DOM traversal, which is a language itself and may feel somewhat cryptic for some time. I think BeautifulSoup will give you a faster start. With lxml you'll have to invest more time learning, but it generally considered (not only by me) a better alternative to BeautifulSoup.
看起来 Scrappy 正在使用 XPATH 进行 DOM 遍历,这本身就是一种语言,在一段时间内可能会感觉有些神秘。我认为 BeautifulSoup 会给你一个更快的开始。使用 lxml 您将不得不投入更多时间学习,但它通常被认为(不仅是我)是 BeautifulSoup 的更好替代品。
For database I would suggest you to start with SQLite and use it until you hit a wall and need something more scalable (which may never happen, depending on how far you want to go with that), at which point you'll know what kind of storage you need. Mongodb is definitely overkill at this point, but getting comfortable with SQL is a very useful skill.
对于数据库,我建议您从 SQLite 开始并使用它,直到您碰壁并需要更具可扩展性的东西(这可能永远不会发生,取决于您想用它走多远),到那时您就会知道是哪种您需要的存储空间。Mongodb 在这一点上绝对是矫枉过正,但熟悉 SQL 是一项非常有用的技能。
Here is a five-line example I gave some time ago to illustrate hoe BeautifulSoup can be used. Which is the best programming language to write a web bot?
下面是我前段时间举的一个五行示例,用来说明可以使用 hoe BeautifulSoup。 编写网络机器人的最佳编程语言是哪种?
回答by Marvo
Per the database part of the question, use the right tool for the job. Figure out what you wanna do, how you wanna organize your data, what kind of access you need, etc. THEN decide if a no-sql solution works for your project.
根据问题的数据库部分,使用适合该工作的工具。弄清楚你想做什么,你想如何组织你的数据,你需要什么样的访问,等等。然后决定一个 no-sql 解决方案是否适合你的项目。
I think no-sql solutions are here to stay for a variety of different applications. We've implemented them on various projects I've worked on in the last 20 years inside of SQL databases without dubbing it no-sql so the applications exist. So it's worth at least getting some background on what they offer and which products are working well to date.
我认为 no-sql 解决方案适用于各种不同的应用程序。我们已经在我过去 20 年在 SQL 数据库内部工作的各种项目中实现了它们,而没有将其称为 no-sql,因此应用程序存在。因此,至少有必要了解一下他们提供的产品以及迄今为止哪些产品运行良好的背景知识。
Design your project well, and keep the persistence layer separate, and you should be able to change your database solution with only minor heartache if you decide that's what's necessary.
很好地设计您的项目,并保持持久层独立,如果您认为这是必要的,您应该能够以轻微的心痛来更改您的数据库解决方案。
回答by hoju
I recommend starting lower level while learning - scrapy is a high level framework. Read a good Python book like Dive Into Pythonthen look at lxmlfor parsing HTML.
我建议在学习时从较低级别开始——scrapy 是一个高级框架。阅读一本好的 Python 书籍,例如Dive Into Python,然后查看lxml来解析 HTML。
回答by Omer Khan
I really like BeautifulSoup. I'm fairly new to Python but found it fairly easy to start screen scraping. I wrote a brief tutorial on screen scraping with beautiful soup. I hope it helps.
我真的很喜欢 BeautifulSoup。我对 Python 相当陌生,但发现开始屏幕抓取相当容易。我写了一个关于用漂亮的汤进行屏幕抓取的简短教程。我希望它有帮助。
回答by Jaakko
before diving into Scrapy take Udacity's introduction to Computer Science: https://www.udacity.com/course/cs101
在深入了解 Scrapy 之前,先看看 Udacity 对计算机科学的介绍:https://www.udacity.com/course/cs101
That's a great way to familiarize yourself with Python and you will actually learn Scrapy lot faster once you have some basic knowledge of Python.
这是熟悉 Python 的好方法,一旦您掌握了 Python 的一些基本知识,您实际上会更快地学习 Scrapy。

