C# 从网页中提取数据,解析特定片段并显示
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18065526/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pulling data from a webpage, parsing it for specific pieces, and displaying it
提问by Aloehart
I've been using this site for a long time to find answers to my questions, but I wasn't able to find the answer on this one.
我一直在使用这个网站很长一段时间来寻找我的问题的答案,但我无法在这个网站上找到答案。
I am working with a small group on a class project. We're to build a small "game trading" website that allows people to register, put in a game they have they want to trade, and accept trades from others or request a trade.
我正在与一个小组合作开展一个班级项目。我们将建立一个小型的“游戏交易”网站,允许人们注册,放入他们想要交易的游戏,并接受他人的交易或请求交易。
We have the site functioning long ahead of schedule so we're trying to add more to the site. One thing I want to do myself is to link the games that are put in to Metacritic.
我们让网站提前很长时间运行,因此我们正在尝试向网站添加更多内容。我自己想做的一件事是将放入 Metacritic 的游戏链接起来。
Here's what I need to do. I need to (using asp and c# in visual studio 2012) get the correct game page on metacritic, pull its data, parse it for specific parts, and then display the data on our page.
这是我需要做的。我需要(在 Visual Studio 2012 中使用 asp 和 c#)在 metacritic 上获取正确的游戏页面,提取其数据,解析特定部分,然后在我们的页面上显示数据。
Essentially when you choose a game you want to trade for we want a small div to display with the game's information and rating. I'm wanting to do it this way to learn more and get something out of this project I didn't have to start with.
基本上,当您选择要交易的游戏时,我们需要一个小 div 来显示游戏的信息和评级。我想通过这种方式来了解更多信息并从这个我不必开始的项目中得到一些东西。
I was wondering if anyone could tell me where to start. I don't know how to pull data from a page. I'm still trying to figure out if I need to try and write something to automatically search for the game's title and find the page that way or if I can find some way to go straight to the game's page. And once I've gotten the data, I don't know how to pull the specific information I need from it.
我想知道是否有人可以告诉我从哪里开始。我不知道如何从页面中提取数据。我仍在试图弄清楚我是否需要尝试编写一些东西来自动搜索游戏的标题并以这种方式找到页面,或者我是否可以找到直接进入游戏页面的方法。一旦我获得了数据,我不知道如何从中提取我需要的特定信息。
One of the things that doesn't make this easy is that I'm learning c++ along with c# and asp so I keep getting my wires crossed. If someone could point me in the right direction it would be a big help. Thanks
不那么容易的一件事是我正在学习 c++ 以及 c# 和 asp,所以我一直在交叉。如果有人能指出我正确的方向,那将是一个很大的帮助。谢谢
采纳答案by Hanlet Esca?o
This small example uses HtmlAgilityPack, and using XPath
selectors to get to the desired elements.
这个小例子使用HtmlAgilityPack,并使用XPath
选择器来获取所需的元素。
protected void Page_Load(object sender, EventArgs e)
{
string url = "http://www.metacritic.com/game/pc/halo-spartan-assault";
var web = new HtmlAgilityPack.HtmlWeb();
HtmlDocument doc = web.Load(url);
string metascore = doc.DocumentNode.SelectNodes("//*[@id=\"main\"]/div[3]/div/div[2]/div[1]/div[1]/div/div/div[2]/a/span[1]")[0].InnerText;
string userscore = doc.DocumentNode.SelectNodes("//*[@id=\"main\"]/div[3]/div/div[2]/div[1]/div[2]/div[1]/div/div[2]/a/span[1]")[0].InnerText;
string summary = doc.DocumentNode.SelectNodes("//*[@id=\"main\"]/div[3]/div/div[2]/div[2]/div[1]/ul/li/span[2]/span/span[1]")[0].InnerText;
}
An easy way to obtain the XPath
for a given element is by using your web browser (I use Chrome) Developer Tools:
获取XPath
给定元素的一种简单方法是使用您的网络浏览器(我使用 Chrome)开发人员工具:
- Open the Developer Tools (F12or Ctrl+ Shift+ Con Windows or Command+ Shift+ Cfor Mac).
- Select the element in the page that you want the XPath for.
- Right click the element in the "Elements" tab.
- Click on "Copy as XPath".
- 打开开发人员工具(F12或Ctrl+ Shift+ C(Windows)或Command+ Shift+C适用于Mac)。
- 在页面中选择您想要 XPath 的元素。
- 右键单击“元素”选项卡中的元素。
- 单击“复制为 XPath”。
You can paste it exactly like that in c# (as shown in my code), but make sure to escape the quotes.
您可以像在 c# 中那样粘贴它(如我的代码所示),但请确保对引号进行转义。
You have to make sure you use some error handling techniques because Web Scrapping can cause errors if they change the HTML formatting of the page.
您必须确保使用一些错误处理技术,因为如果 Web Scrapping 更改页面的 HTML 格式,则会导致错误。
Edit
编辑
Per @knocte's suggestion, here is the link to the Nuget package for HTMLAgilityPack:
根据@knocte 的建议,这里是 HTMLAgilityPack 的 Nuget 包的链接:
回答by JeremiahDotNet
I looked and Metacritic.com doesn't have an API.
我看了看,Metacritic.com 没有 API。
You can use an HttpWebRequest to get the contents of a website as a string.
您可以使用 HttpWebRequest 以字符串形式获取网站的内容。
using System.Net;
using System.IO;
using System.Windows.Forms;
string result = null;
string url = "http://www.stackoverflow.com";
WebResponse response = null;
StreamReader reader = null;
try
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Method = "GET";
response = request.GetResponse();
reader = new StreamReader(response.GetResponseStream(), Encoding.UTF8);
result = reader.ReadToEnd();
}
catch (Exception ex)
{
// handle error
MessageBox.Show(ex.Message);
}
finally
{
if (reader != null)
reader.Close();
if (response != null)
response.Close();
}
Then you can parse the string for the data that you want by taking advantage of Metacritic's use of meta tags. Here's the information they have available in meta tags:
然后,您可以利用 Metacritic 对元标记的使用来解析所需数据的字符串。以下是他们在元标记中可用的信息:
- og:title
- og:type
- og:url
- og:image
- og:site_name
- og:description
- OG:标题
- OG:类型
- 网址:网址
- OG:图像
- og:site_name
- OG:描述
The format of each tag is: meta name="og:title" content="In a World..."
每个标签的格式为: meta name="og:title" content="In a World..."
回答by Jason Goemaat
I recommend Dcsoup. There's a nuget packagefor it and it uses CSS selectors so it is familiar if you use jquery. I've tried others but it is the best and easiest to use that I've found. There's not much documentation, but it's open source and a port of the java jsoup librarythat has good documentation. (Documentation for the .NET API here.) I absolutely love it.
我推荐Dcsoup。它有一个nuget 包,它使用 CSS 选择器,因此如果您使用 jquery,它会很熟悉。我试过其他人,但它是我发现的最好和最容易使用的。没有太多文档,但它是开源的,并且是具有良好文档的 java jsoup 库的一个端口。(此处为.NET API 的文档。)我非常喜欢它。
var timeoutInMilliseconds = 5000;
var uri = new Uri("http://www.metacritic.com/game/pc/fallout-4");
var doc = Supremes.Dcsoup.Parse(uri, timeoutInMilliseconds);
// <span itemprop="ratingValue">86</span>
var ratingSpan = doc.Select("span[itemprop=ratingValue]");
int ratingValue = int.Parse(ratingSpan.Text);
// selectors match both critic and user scores
var scoreDiv = doc.Select("div.score_summary");
var scoreAnchor = scoreDiv.Select("a.metascore_anchor");
int criticRating = int.Parse(scoreAnchor[0].Text);
float userRating = float.Parse(scoreAnchor[1].Text);
回答by jasniec
I'd recomend you WebsiteParser- it's based on HtmlAgilityPack (mentioned by Hanlet Esca?o) but it makes web scraping easier with attributes and css selectors:
我建议您使用WebsiteParser- 它基于 HtmlAgilityPack(由 Hanlet Esca?o 提及),但它使用属性和 css 选择器使网页抓取更容易:
class PersonModel
{
[Selector("#BirdthDate")]
[Converter(typeof(DateTimeConverter))]
public DateTime BirdthDate { get; set; }
}
// ...
PersonModel person = WebContentParser.Parse<PersonModel>(html);