vb.net 通过 Visual Basic 从网站检索数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14859781/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Retrieve data from a website via Visual Basic
提问by Hymanery Xu
There is this website that we purchase widgets from that provides details for each of their parts on its own webpage. Example: http://www.digikey.ca/product-search/en?lang=en&site=ca&KeyWords=AE9912-ND. I have to find all of their parts that are in our database, and add Manufacturer and Manufacturer Part Number values to their fields.
我们从这个网站购买小部件,该网站在其自己的网页上提供了每个部件的详细信息。示例:http: //www.digikey.ca/product-search/en?lang=en& site=ca& KeyWords=AE9912-ND。我必须找到我们数据库中的所有零件,并将制造商和制造商零件编号值添加到它们的字段中。
I was told that there is a way for Visual Basic to access a webpage and extract information. If someone could point me in the right direction on where to start, I'm sure I can figure this out.
有人告诉我,Visual Basic 有一种方法可以访问网页并提取信息。如果有人能指出我从哪里开始的正确方向,我相信我能解决这个问题。
Thanks.
谢谢。
采纳答案by MonkeyDoug
How to scrape a website using HTMLAgilityPack (VB.Net)
如何使用 HTMLAgilityPack (VB.Net) 抓取网站
I agree that htmlagilitypackis the easiest way to accomplish this. It is less error prone than just using Regex. The following will be how I deal with scraping.
我同意htmlagilitypack是实现这一目标的最简单方法。与仅使用 Regex 相比,它更不容易出错。以下将是我如何处理刮擦。
After downloading htmlagilitypack*dll, create a new application, add htmlagilitypackvia nuget, and reference to it. If you can use Chrome, it will allow you to inspect the page to get information about where your information is located. Right-click on a value you wish to capture and look for the table that it is found in (follow the HTML up a bit).
下载htmlagilitypack * DLL后,创建一个新的应用程序,添加htmlagilitypack通过的NuGet,并参考吧。如果您可以使用 Chrome,它将允许您检查页面以获取有关您的信息所在位置的信息。右键单击要捕获的值并查找它所在的表(稍微按照 HTML 进行操作)。
The following example will extract all the values from that page within the "pricing" table. We need to know the XPathvalue for the table (this value is used to instruct htmlagilitypack on what to look for) so that the document we create looks for our specific values. This can be achieved by finding whatever structure your values are in and right click copy XPath. From this we get...
以下示例将从“定价”表中的该页面中提取所有值。我们需要知道表的XPath值(该值用于指示 htmlagilitypack 查找什么),以便我们创建的文档查找我们的特定值。这可以通过找到您的值所在的任何结构并右键单击复制 XPath 来实现。由此我们得到...
//*[@id="pricing"]
Please note that sometimes the XPath you get from Chrome may be rather large. You can often simplify it by finding something unique about the table your values are in. In this example it is "id", but in other situations, it could easily be headings or class or whatever.
请注意,有时您从 Chrome 获得的 XPath 可能相当大。您通常可以通过查找您的值所在的表的独特内容来简化它。在此示例中,它是“id”,但在其他情况下,它很容易是标题或类或其他任何内容。
This XPath value looks for something with the id equal to pricing, that is our table. When we look further in, we see that our values are within tbody,tr and td tags. HtmlAgilitypack doesn't work well with the tbody so ignore it. Our new XPath is...
这个 XPath 值寻找 id 等于定价的东西,那就是我们的表。当我们进一步查看时,我们会看到我们的值位于 tbody、tr 和 td 标签内。HtmlAgilitypack 不适用于 tbody,因此请忽略它。我们的新 XPath 是...
//*[@id='pricing']/tr/td
This XPath says look for the pricing id within the page, then look for text within its tr and td tags. Now we add the code...
此 XPath 表示在页面中查找定价 ID,然后在其 tr 和 td 标签中查找文本。现在我们添加代码...
Dim Web As New HtmlAgilityPack.HtmlWeb
Dim Doc As New HtmlAgilityPack.HtmlDocument
Doc = Web.Load("http://www.digikey.ca/product-search/en?lang=en&site=ca&KeyWords=AE9912-ND")
For Each table As HtmlAgilityPack.HtmlNode In Doc.DocumentNode.SelectNodes("//*[@id='pricing']/tr/td")
Next
To extract the values we simply reference our table value that was created in our loop and it's innertext member.
要提取值,我们只需引用在循环中创建的表值及其内部文本成员。
Dim Web As New HtmlAgilityPack.HtmlWeb
Dim Doc As New HtmlAgilityPack.HtmlDocument
Doc = Web.Load("http://www.digikey.ca/product-search/en?lang=en&site=ca&KeyWords=AE9912-ND")
For Each table As HtmlAgilityPack.HtmlNode In Doc.DocumentNode.SelectNodes("//*[@id='pricing']/tr/td")
MsgBox(table.InnerText)
Next
Now we have message boxes that pop up the values...you can switch the message box for an arraylist to fill or whatever way you wish to store the values. Now simply do the same for whatever other tables you wish to get.
现在我们有弹出值的消息框……您可以切换消息框以填充数组列表或以任何您希望存储值的方式。现在只需对您希望获得的任何其他表执行相同操作。
Please note that the Doc variable that was created is reusable, so if you wanted to cycle through a different table in the same page, you do not have to reload the page. This is a good idea especially if you are making many requests, you don't want to slam the website, and if you are automating a large number of scrapes, it puts some time between requests.
请注意,创建的 Doc 变量是可重用的,因此如果您想在同一页面中循环浏览不同的表,则不必重新加载页面。这是一个好主意,特别是如果您提出很多请求,您不想抨击网站,并且如果您要自动执行大量抓取,则会在请求之间放置一些时间。
Scraping is really that easy. That's is the basic idea. Have fun!
刮痧真的就是这么简单。这就是基本思想。玩得开心!
回答by zeroef
Html Agility Packis going to be your friend!
Html Agility Pack将成为您的朋友!
What is exactly the Html Agility Pack (HAP)?
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
什么是 Html Agility Pack (HAP)?
这是一个敏捷的 HTML 解析器,它构建了一个读/写 DOM 并支持普通的 XPATH 或 XSLT(你实际上不必了解 XPATH 或 XSLT 来使用它,别担心......)。它是一个 .NET 代码库,允许您解析“网络之外”的 HTML 文件。解析器对“现实世界”格式错误的 HTML 非常宽容。对象模型与 System.Xml 的建议非常相似,但适用于 HTML 文档(或流)。
Looking at the source of the example page you provided, they are using HTML5 Microdata in their markup. I searched some more on CodePlexand found a microdata parser which may help too: MicroData Parser
查看您提供的示例页面的来源,他们在标记中使用了 HTML5 微数据。我搜索了更多内容CodePlex,发现了一个也可能有帮助的微数据解析器:MicroData Parser

