Html cricinfo记分卡的html解析
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/8798260/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
html parsing of cricinfo scorecards
提问by brettdj
Aim
目的
I am looking to scrape 20/20 cricket scorecard data from the Cricinfo website, ideally into CSV formfor data analysis in Excel
我希望从Cricinfo 网站上抓取 20/20 板球记分卡数据,最好是CSV 格式,以便在 Excel 中进行数据分析
As an example the current Australian Big Bash 2011/12 scorecardsare available from
例如,当前的澳大利亚 Big Bash 2011/12 记分卡可从
- Game 1: http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html
- Last Game: http://www.espncricinfo.com/big-bash-league-2011/engine/match/524935.html
- 第一场:http: //www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html
- 最后一场比赛:http: //www.espncricinfo.com/big-bash-league-2011/engine/match/524935.html
Background
背景
I am proficient in using VBA (either automating IE
or using XMLHTTP
and then using regular expressions) to scrape data from websites, ie
Extract values from HTML TD and Tr
我精通使用 VBA(自动化IE
或使用XMLHTTP
然后使用正则表达式)从网站上抓取数据,即
从 HTML TD 和 Tr 中提取值
In that same question a comment was posted suggesting html parsing - which I hadn't come accross before - so I have taken a look at questions such as RegEx match open tags except XHTML self-contained tags
在同一个问题中,发布了一条评论建议 html 解析 - 我以前没有遇到过 - 所以我查看了诸如RegEx 匹配开放标签之类的问题,除了 XHTML 自包含标签
Query
询问
While I could write a regex to parse the cricket data below I would like advice as to how I could efficiently retrieve these results with html parsing.
虽然我可以编写一个正则表达式来解析下面的板球数据,但我想就如何使用 html 解析有效地检索这些结果提出建议。
Please bear in mind that my preference is a repeatable CSV format containing:
请记住,我的偏好是可重复的 CSV 格式,其中包含:
- the date/name of the match
- Team 1 name
- the output should dump up to 11 records for Team 1 (blank records where players haven't batted, ie "Did Not Bat")
- Team 2 name
- the output should dump up to 11 records for Team 2 (blank records where players haven't batted)
- 比赛的日期/名称
- 团队 1 名称
- 输出应最多为 Team 1 转储 11 条记录(球员未击球的空白记录,即“Did Not Bat”)
- 团队2名称
- 输出应最多为 Team 2 转储 11 条记录(球员未击球的空白记录)
Nirvana for me would be a solution that I could deploy using VBA or VBscript so I could fully automate my analysis, but I presume I will have to use a separate tool for the html parse.
Nirvana 对我来说是一个可以使用 VBA 或 VBscript 部署的解决方案,因此我可以完全自动化我的分析,但我认为我将不得不使用单独的工具进行 html 解析。
Sample Site links and Data to be Extracted
示例站点链接和要提取的数据
回答by Siddharth Rout
There are 2 techniques that I use for "VBA". I will describe them 1 by one.
我为“VBA”使用了两种技术。我将一一描述它们。
1) Using FireFox / Firebug Addon / Fiddler
1) 使用 FireFox / Firebug Addon / Fiddler
2) Using Excel's inbuilt facility to get data from the web
2) 使用 Excel 的内置工具从网络获取数据
Since this post will be read by many so I will even cover the obvious. Please feel free to skip whatever part you know
由于这篇文章会被很多人阅读,所以我什至会涵盖显而易见的内容。请随意跳过您知道的任何部分
1) Using FireFox / Firebug Addon / Fiddler
1) 使用 FireFox / Firebug Addon / Fiddler
FireFox : http://en.wikipedia.org/wiki/FirefoxFree download (http://www.mozilla.org/en-US/firefox/new/)
FireFox : http://en.wikipedia.org/wiki/Firefox免费下载 ( http://www.mozilla.org/en-US/firefox/new/)
Firebug Addon: http://en.wikipedia.org/wiki/Firebug_%28software%29Free download (https://addons.mozilla.org/en-US/firefox/addon/firebug/)
Firebug 插件:http: //en.wikipedia.org/wiki/Firebug_%28software%29免费下载 ( https://addons.mozilla.org/en-US/firefox/addon/firebug/)
Fiddler : http://en.wikipedia.org/wiki/Fiddler_%28software%29Free download (http://www.fiddler2.com/fiddler2/)
Fiddler : http://en.wikipedia.org/wiki/Fiddler_%28software%29免费下载 ( http://www.fiddler2.com/fiddler2/)
Once you have installed Firefox, install the Firebug Addon. The Firebug Addon lets you inspect the different elements in a webpage. For example if you want to know the name of a button, simply right click on it and click on "Inspect Element with Firebug" and it will give you all the details that you will need for that button.
安装 Firefox 后,请安装 Firebug 插件。Firebug Addon 可让您检查网页中的不同元素。例如,如果您想知道按钮的名称,只需右键单击它并单击“Inspect Element with Firebug”,它将为您提供该按钮所需的所有详细信息。
Another example would be finding the name of a table on a website which has the data that you need scrapped.
另一个示例是在网站上查找包含您需要报废的数据的表的名称。
I use Fiddler only when I am using XMLHTTP. It helps me to see the exact info being passed when you click on a button. Because of the increase in the number of BOTS which scrape the sites, most sites now, to prevent automatic scrapping, capture your mouse coordinates and pass that information and fiddler actually helps you in debugging that info that is being passed. I will not get into much details here about it as this info can be used maliciously.
我只在使用 XMLHTTP 时使用 Fiddler。当您单击按钮时,它可以帮助我查看传递的确切信息。由于抓取站点的 BOTS 数量增加,现在大多数站点为了防止自动抓取,捕获您的鼠标坐标并传递该信息,而 fiddler 实际上可以帮助您调试正在传递的信息。我不会在这里详细介绍它,因为这些信息可能会被恶意使用。
Now let's take a simple example on how to scrape the URL posted in your question
现在让我们举一个简单的例子来说明如何抓取您问题中发布的 URL
http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html
http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html
First let's find the name of the table which has that info. Simply right click on the table and click on "Inspect Element with Firebug" and it will give you the below snapshot.
首先让我们找到包含该信息的表的名称。只需右键单击表格并单击“Inspect Element with Firebug”,它就会为您提供以下快照。
So now we know that our data is stored in a table called "inningsBat1" If we can extract the contents of that table to an Excel file then we can definitely work with the data to do our analysis. Here is sample code which will dump that table in Sheet1
所以现在我们知道我们的数据存储在一个名为“inningsBat1”的表中 如果我们可以将该表的内容提取到 Excel 文件中,那么我们绝对可以使用数据进行分析。这是将在 Sheet1 中转储该表的示例代码
Before we proceed, I would recommend, closing all Excel and starting a fresh instance.
在我们继续之前,我建议关闭所有 Excel 并启动一个新实例。
Launch VBA and insert a Userform. Place a command button and a webcrowser control. Your Userform might look like this
启动 VBA 并插入一个用户表单。放置一个命令按钮和一个 webcrowser 控件。您的用户表单可能如下所示
Paste this code in the Userform code area
将此代码粘贴到用户表单代码区域中
Option Explicit
'~~> Set Reference to Microsoft HTML Object Library
Private Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)
Private Sub CommandButton1_Click()
Dim URL As String
Dim oSheet As Worksheet
Set oSheet = Sheets("Sheet1")
URL = "http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html"
PopulateDataSheets oSheet, URL
MsgBox "Data Scrapped. Please check " & oSheet.Name
End Sub
Public Sub PopulateDataSheets(wsk As Worksheet, URL As String)
Dim tbl As HTMLTable
Dim tr As HTMLTableRow
Dim insertRow As Long, Row As Long, col As Long
On Error GoTo whoa
WebBrowser1.navigate URL
WaitForWBReady
Set tbl = WebBrowser1.Document.getElementById("inningsBat1")
With wsk
.Cells.Clear
insertRow = 0
For Row = 0 To tbl.Rows.Length - 1
Set tr = tbl.Rows(Row)
If Trim(tr.innerText) <> "" Then
If tr.Cells.Length > 2 Then
If tr.Cells(1).innerText <> "Total" Then
insertRow = insertRow + 1
For col = 0 To tr.Cells.Length - 1
.Cells(insertRow, col + 1) = tr.Cells(col).innerText
Next
End If
End If
End If
Next
End With
whoa:
Unload Me
End Sub
Private Sub Wait(ByVal nSec As Long)
nSec = nSec + Timer
While Timer < nSec
DoEvents
Sleep 100
Wend
End Sub
Private Sub WaitForWBReady()
Wait 1
While WebBrowser1.ReadyState <> 4
Wait 3
Wend
End Sub
Now run your Userform and click on the Command button. You will notice that the data is dumped in Sheet1. See snapshot
现在运行您的用户窗体并单击命令按钮。您会注意到数据已转储到 Sheet1 中。查看快照
Similarly you can scrape other info as well.
同样,您也可以抓取其他信息。
2) Using Excel's inbuilt facility to get data from the web
2) 使用 Excel 的内置工具从网络获取数据
I believe you are using Excel 2007 so I will take that as an example to scrape the above mentioned link.
我相信您使用的是 Excel 2007,所以我将以它为例来抓取上述链接。
Navigate to Sheet2. Now navigate to Data Tab and click on the button "From Web" on the extreme right. See snapshot.
导航到 Sheet2。现在导航到“数据”选项卡,然后单击最右侧的“来自 Web”按钮。见快照。
Enter the url in the "New Web Query Window" and click on "Go"
在“New Web Query Window”中输入网址并点击“Go”
Once the page is uploaded, select the relevant table that you want to import by clicking on the small arrow as shown in the snapshot. Once done, click on "Import"
页面上传后,通过单击快照中所示的小箭头选择要导入的相关表。完成后,点击“导入”
Excel will then ask you where you want the data to be imported. Select the relevant cell and click on OK. And you are done! The data will be imported to the cell which you specified.
然后 Excel 会询问您要将数据导入到何处。选择相关单元格,然后单击“确定”。你已经完成了!数据将导入到您指定的单元格中。
If you wish you can record a macro and automate this as well :)
如果您愿意,也可以录制宏并自动执行此操作:)
Here is the macro that I recorded.
这是我录制的宏。
Sub Macro1()
With ActiveSheet.QueryTables.Add(Connection:= _
"URL;http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html" _
, Destination:=Range("$A"))
.Name = "524915"
.FieldNames = True
.RowNumbers = False
.FillAdjacentFormulas = False
.PreserveFormatting = True
.RefreshOnFileOpen = False
.BackgroundQuery = True
.RefreshStyle = xlInsertDeleteCells
.SavePassword = False
.SaveData = True
.AdjustColumnWidth = True
.RefreshPeriod = 0
.WebSelectionType = xlSpecifiedTables
.WebFormatting = xlWebFormattingNone
.WebTables = """inningsBat1"""
.WebPreFormattedTextToColumns = True
.WebConsecutiveDelimitersAsOne = True
.WebSingleBlockTextImport = False
.WebDisableDateRecognition = False
.WebDisableRedirections = False
.Refresh BackgroundQuery:=False
End With
End Sub
Hope this helps. Let me know if you still have some queries.
希望这可以帮助。如果您还有一些疑问,请告诉我。
Sid
锡德
回答by brettdj
For anyone else interested in this I ended up using the code below based on Siddhart Rout'searlier answer
对于对此感兴趣的其他人,我最终根据Siddhart Rout 的早期回答使用了以下代码
XMLHttp
was significantly quicker than automatingIE
- the code generates a CSV file for each series to be dowloaded (held in the
X
variable) - the code dumps each match to a regular 29 row range (regardless of how many players batted) to facillitate easier analysis later on
XMLHttp
明显快于自动化IE
- 代码为要下载的每个系列生成一个 CSV 文件(保存在
X
变量中) - 代码将每场比赛转储到常规的 29 行范围内(无论有多少球员击球),以便稍后进行更轻松的分析
Public Sub PopulateDataSheets_XML()
Dim URL As String
Dim ws As Worksheet
Dim lngRow As Long
Dim lngRecords As Long
Dim lngWrite As Long
Dim lngSpare As Long
Dim lngInnings As Long
Dim lngRow1 As Long
Dim X(1 To 15, 1 To 4) As String
Dim objFSO As Object
Dim objTF As Object
Dim xmlHttp As Object
Dim htmldoc As HTMLDocument
Dim htmlbody As htmlbody
Dim tbl As HTMLTable
Dim tr As HTMLTableRow
Dim strInnings As String
s = Timer()
Set xmlHttp = CreateObject("MSXML2.ServerXMLHTTP")
Set objFSO = CreateObject("scripting.filesystemobject")
X(1, 1) = "http://www.espncricinfo.com/indian-premier-league-2011/engine/match/"
X(1, 2) = 501198
X(1, 3) = 501271
X(1, 4) = "indian-premier-league-2011"
X(2, 1) = "http://www.espncricinfo.com/big-bash-league-2011/engine/match/"
X(2, 2) = 524915
X(2, 3) = 524945
X(2, 4) = "big-bash-league-2011"
X(3, 1) = "http://www.espncricinfo.com/ausdomestic-2010/engine/match/"
X(3, 2) = 461028
X(3, 3) = 461047
X(3, 4) = "big-bash-league-2010"
Set htmldoc = New HTMLDocument
Set htmlbody = htmldoc.body
For lngRow = 1 To UBound(X, 1)
If Len(X(lngRow, 1)) = 0 Then Exit For
Set objTF = objFSO.createtextfile("c:\temp\" & X(lngRow, 4) & ".csv")
For lngRecords = X(lngRow, 2) To X(lngRow, 3)
URL = X(lngRow, 1) & lngRecords & ".html"
xmlHttp.Open "GET", URL
xmlHttp.send
Do While xmlHttp.Status <> 200
DoEvents
Loop
htmlbody.innerHTML = xmlHttp.responseText
objTF.writeline X(lngRow, 1) & lngRecords & ".html"
For lngInnings = 1 To 2
strInnings = "Innings " & lngInnings
objTF.writeline strInnings
Set tbl = Nothing
On Error Resume Next
Set tbl = htmlbody.Document.getElementById("inningsBat" & lngInnings)
On Error GoTo 0
If Not tbl Is Nothing Then
lngWrite = 0
For lngRow1 = 0 To tbl.Rows.Length - 1
Set tr = tbl.Rows(lngRow1)
If Trim(tr.innerText) <> vbNewLine Then
If tr.Cells.Length > 2 Then
If tr.Cells(1).innerText <> "Extras" Then
If Len(tr.Cells(1).innerText) > 0 Then
objTF.writeline strInnings & "-" & lngWrite & "," & Trim(tr.Cells(1).innerText) & "," & Trim(tr.Cells(3).innerText)
lngWrite = lngWrite + 1
End If
Else
objTF.writeline strInnings & "-" & lngWrite & "," & Trim(tr.Cells(1).innerText) & "," & Trim(tr.Cells(3).innerText)
lngWrite = lngWrite + 1
Exit For
End If
End If
End If
Next
For lngSpare = 12 To lngWrite Step -1
objTF.writeline strInnings & "-" & lngWrite + (12 - lngSpare)
Next
Else
For lngSpare = 1 To 13
objTF.writeline strInnings & "-" & lngWrite + (12 - lngSpare)
Next
End If
Next
Next
Next
'Call ConsolidateSheets
End Sub
回答by Babak Naffas
RegEx is not a complete solution for parsing HTML because it is not guaranteed to be regular.
RegEx 不是解析 HTML 的完整解决方案,因为它不能保证是正则的。
You should use the HtmlAgilityPackto query the HTML. This will allow you to use the CSS selectors to query the HTML similar to how you do it with jQuery.
您应该使用HtmlAgilityPack来查询 HTML。这将允许您使用 CSS 选择器来查询 HTML,类似于您使用 jQuery 的方式。