Html cricinfo记分卡的html解析

Question

提问by brettdj

Aim

目的

I am looking to scrape 20/20 cricket scorecard data from the Cricinfo website, ideally into CSV formfor data analysis in Excel

我希望从Cricinfo 网站上抓取 20/20 板球记分卡数据，最好是CSV 格式，以便在 Excel 中进行数据分析

As an example the current Australian Big Bash 2011/12 scorecardsare available from

例如，当前的澳大利亚 Big Bash 2011/12 记分卡可从

第一场：http: //www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html
最后一场比赛：http: //www.espncricinfo.com/big-bash-league-2011/engine/match/524935.html

Background

背景

I am proficient in using VBA (either automating IEor using XMLHTTPand then using regular expressions) to scrape data from websites, ie Extract values from HTML TD and Tr

我精通使用 VBA（自动化IE或使用XMLHTTP然后使用正则表达式）从网站上抓取数据，即从 HTML TD 和 Tr 中提取值

In that same question a comment was posted suggesting html parsing - which I hadn't come accross before - so I have taken a look at questions such as RegEx match open tags except XHTML self-contained tags

在同一个问题中，发布了一条评论建议 html 解析 - 我以前没有遇到过 - 所以我查看了诸如RegEx 匹配开放标签之类的问题，除了 XHTML 自包含标签

Query

询问

While I could write a regex to parse the cricket data below I would like advice as to how I could efficiently retrieve these results with html parsing.

虽然我可以编写一个正则表达式来解析下面的板球数据，但我想就如何使用 html 解析有效地检索这些结果提出建议。

Please bear in mind that my preference is a repeatable CSV format containing:

请记住，我的偏好是可重复的 CSV 格式，其中包含：

the date/name of the match
Team 1 name
the output should dump up to 11 records for Team 1 (blank records where players haven't batted, ie "Did Not Bat")
Team 2 name
the output should dump up to 11 records for Team 2 (blank records where players haven't batted)

比赛的日期/名称
团队 1 名称
输出应最多为 Team 1 转储 11 条记录（球员未击球的空白记录，即“Did Not Bat”）
团队2名称
输出应最多为 Team 2 转储 11 条记录（球员未击球的空白记录）

Nirvana for me would be a solution that I could deploy using VBA or VBscript so I could fully automate my analysis, but I presume I will have to use a separate tool for the html parse.

Nirvana 对我来说是一个可以使用 VBA 或 VBscript 部署的解决方案，因此我可以完全自动化我的分析，但我认为我将不得不使用单独的工具进行 html 解析。

Sample Site links and Data to be Extracted

示例站点链接和要提取的数据

cricinfo scorecard source date

cricinfo记分卡来源日期

Answer 1

回答by Siddharth Rout

There are 2 techniques that I use for "VBA". I will describe them 1 by one.

我为“VBA”使用了两种技术。我将一一描述它们。

1) Using FireFox / Firebug Addon / Fiddler

1) 使用 FireFox / Firebug Addon / Fiddler

2) Using Excel's inbuilt facility to get data from the web

2) 使用 Excel 的内置工具从网络获取数据

Since this post will be read by many so I will even cover the obvious. Please feel free to skip whatever part you know

由于这篇文章会被很多人阅读，所以我什至会涵盖显而易见的内容。请随意跳过您知道的任何部分

1) Using FireFox / Firebug Addon / Fiddler

1) 使用 FireFox / Firebug Addon / Fiddler

FireFox : http://en.wikipedia.org/wiki/FirefoxFree download (http://www.mozilla.org/en-US/firefox/new/)

FireFox : http://en.wikipedia.org/wiki/Firefox免费下载 ( http://www.mozilla.org/en-US/firefox/new/)

Firebug Addon: http://en.wikipedia.org/wiki/Firebug_%28software%29Free download (https://addons.mozilla.org/en-US/firefox/addon/firebug/)

Firebug 插件：http: //en.wikipedia.org/wiki/Firebug_%28software%29免费下载 ( https://addons.mozilla.org/en-US/firefox/addon/firebug/)

Fiddler : http://en.wikipedia.org/wiki/Fiddler_%28software%29Free download (http://www.fiddler2.com/fiddler2/)

Fiddler : http://en.wikipedia.org/wiki/Fiddler_%28software%29免费下载 ( http://www.fiddler2.com/fiddler2/)

Once you have installed Firefox, install the Firebug Addon. The Firebug Addon lets you inspect the different elements in a webpage. For example if you want to know the name of a button, simply right click on it and click on "Inspect Element with Firebug" and it will give you all the details that you will need for that button.

安装 Firefox 后，请安装 Firebug 插件。Firebug Addon 可让您检查网页中的不同元素。例如，如果您想知道按钮的名称，只需右键单击它并单击“Inspect Element with Firebug”，它将为您提供该按钮所需的所有详细信息。

enter image description here

在此处输入图片说明

Another example would be finding the name of a table on a website which has the data that you need scrapped.

另一个示例是在网站上查找包含您需要报废的数据的表的名称。

I use Fiddler only when I am using XMLHTTP. It helps me to see the exact info being passed when you click on a button. Because of the increase in the number of BOTS which scrape the sites, most sites now, to prevent automatic scrapping, capture your mouse coordinates and pass that information and fiddler actually helps you in debugging that info that is being passed. I will not get into much details here about it as this info can be used maliciously.

我只在使用 XMLHTTP 时使用 Fiddler。当您单击按钮时，它可以帮助我查看传递的确切信息。由于抓取站点的 BOTS 数量增加，现在大多数站点为了防止自动抓取，捕获您的鼠标坐标并传递该信息，而 fiddler 实际上可以帮助您调试正在传递的信息。我不会在这里详细介绍它，因为这些信息可能会被恶意使用。

Now let's take a simple example on how to scrape the URL posted in your question

现在让我们举一个简单的例子来说明如何抓取您问题中发布的 URL

http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html

First let's find the name of the table which has that info. Simply right click on the table and click on "Inspect Element with Firebug" and it will give you the below snapshot.

首先让我们找到包含该信息的表的名称。只需右键单击表格并单击“Inspect Element with Firebug”，它就会为您提供以下快照。

enter image description here

在此处输入图片说明

So now we know that our data is stored in a table called "inningsBat1" If we can extract the contents of that table to an Excel file then we can definitely work with the data to do our analysis. Here is sample code which will dump that table in Sheet1

所以现在我们知道我们的数据存储在一个名为“inningsBat1”的表中如果我们可以将该表的内容提取到 Excel 文件中，那么我们绝对可以使用数据进行分析。这是将在 Sheet1 中转储该表的示例代码

Before we proceed, I would recommend, closing all Excel and starting a fresh instance.

在我们继续之前，我建议关闭所有 Excel 并启动一个新实例。

Launch VBA and insert a Userform. Place a command button and a webcrowser control. Your Userform might look like this

启动 VBA 并插入一个用户表单。放置一个命令按钮和一个 webcrowser 控件。您的用户表单可能如下所示

enter image description here

在此处输入图片说明

Paste this code in the Userform code area

将此代码粘贴到用户表单代码区域中

Option Explicit

'~~> Set Reference to Microsoft HTML Object Library

Private Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)

Private Sub CommandButton1_Click()
    Dim URL As String
    Dim oSheet As Worksheet

    Set oSheet = Sheets("Sheet1")

    URL = "http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html"

    PopulateDataSheets oSheet, URL

    MsgBox "Data Scrapped. Please check " & oSheet.Name
End Sub

Public Sub PopulateDataSheets(wsk As Worksheet, URL As String)
    Dim tbl As HTMLTable
    Dim tr As HTMLTableRow
    Dim insertRow As Long, Row As Long, col As Long

    On Error GoTo whoa

    WebBrowser1.navigate URL

    WaitForWBReady

    Set tbl = WebBrowser1.Document.getElementById("inningsBat1")

    With wsk
        .Cells.Clear

        insertRow = 0
        For Row = 0 To tbl.Rows.Length - 1
            Set tr = tbl.Rows(Row)
            If Trim(tr.innerText) <> "" Then
                If tr.Cells.Length > 2 Then
                    If tr.Cells(1).innerText <> "Total" Then
                        insertRow = insertRow + 1
                        For col = 0 To tr.Cells.Length - 1
                            .Cells(insertRow, col + 1) = tr.Cells(col).innerText
                        Next
                    End If
                End If
            End If
        Next
    End With
whoa:
    Unload Me
End Sub

Private Sub Wait(ByVal nSec As Long)
    nSec = nSec + Timer
    While Timer < nSec
       DoEvents
        Sleep 100
    Wend
End Sub

Private Sub WaitForWBReady()
    Wait 1
    While WebBrowser1.ReadyState <> 4
        Wait 3
    Wend
End Sub

Now run your Userform and click on the Command button. You will notice that the data is dumped in Sheet1. See snapshot

现在运行您的用户窗体并单击命令按钮。您会注意到数据已转储到 Sheet1 中。查看快照

enter image description here

在此处输入图片说明

Similarly you can scrape other info as well.

同样，您也可以抓取其他信息。

2) Using Excel's inbuilt facility to get data from the web

2) 使用 Excel 的内置工具从网络获取数据

I believe you are using Excel 2007 so I will take that as an example to scrape the above mentioned link.

我相信您使用的是 Excel 2007，所以我将以它为例来抓取上述链接。

Navigate to Sheet2. Now navigate to Data Tab and click on the button "From Web" on the extreme right. See snapshot.

导航到 Sheet2。现在导航到“数据”选项卡，然后单击最右侧的“来自 Web”按钮。见快照。

enter image description here

在此处输入图片说明

Enter the url in the "New Web Query Window" and click on "Go"

在“New Web Query Window”中输入网址并点击“Go”

Once the page is uploaded, select the relevant table that you want to import by clicking on the small arrow as shown in the snapshot. Once done, click on "Import"

页面上传后，通过单击快照中所示的小箭头选择要导入的相关表。完成后，点击“导入”

enter image description here

在此处输入图片说明

Excel will then ask you where you want the data to be imported. Select the relevant cell and click on OK. And you are done! The data will be imported to the cell which you specified.

然后 Excel 会询问您要将数据导入到何处。选择相关单元格，然后单击“确定”。你已经完成了！数据将导入到您指定的单元格中。

If you wish you can record a macro and automate this as well :)

如果您愿意，也可以录制宏并自动执行此操作:)

Here is the macro that I recorded.

这是我录制的宏。

Sub Macro1()
    With ActiveSheet.QueryTables.Add(Connection:= _
    "URL;http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html" _
    , Destination:=Range("$A"))
        .Name = "524915"
        .FieldNames = True
        .RowNumbers = False
        .FillAdjacentFormulas = False
        .PreserveFormatting = True
        .RefreshOnFileOpen = False
        .BackgroundQuery = True
        .RefreshStyle = xlInsertDeleteCells
        .SavePassword = False
        .SaveData = True
        .AdjustColumnWidth = True
        .RefreshPeriod = 0
        .WebSelectionType = xlSpecifiedTables
        .WebFormatting = xlWebFormattingNone
        .WebTables = """inningsBat1"""
        .WebPreFormattedTextToColumns = True
        .WebConsecutiveDelimitersAsOne = True
        .WebSingleBlockTextImport = False
        .WebDisableDateRecognition = False
        .WebDisableRedirections = False
        .Refresh BackgroundQuery:=False
    End With
End Sub

Hope this helps. Let me know if you still have some queries.

希望这可以帮助。如果您还有一些疑问，请告诉我。

Sid

锡德

Answer 2

回答by brettdj

For anyone else interested in this I ended up using the code below based on Siddhart Rout'searlier answer

对于对此感兴趣的其他人，我最终根据Siddhart Rout 的早期回答使用了以下代码

XMLHttpwas significantly quicker than automating IE
the code generates a CSV file for each series to be dowloaded (held in the Xvariable)
the code dumps each match to a regular 29 row range (regardless of how many players batted) to facillitate easier analysis later on

XMLHttp明显快于自动化 IE
代码为要下载的每个系列生成一个 CSV 文件（保存在X变量中）
代码将每场比赛转储到常规的 29 行范围内（无论有多少球员击球），以便稍后进行更轻松的分析

enter image description here

在此处输入图片说明

    Public Sub PopulateDataSheets_XML()
    Dim URL As String
    Dim ws As Worksheet

    Dim lngRow As Long
    Dim lngRecords As Long
    Dim lngWrite As Long
    Dim lngSpare As Long
    Dim lngInnings As Long
    Dim lngRow1 As Long
    Dim X(1 To 15, 1 To 4) As String

    Dim objFSO As Object
    Dim objTF As Object

    Dim xmlHttp As Object
    Dim htmldoc As HTMLDocument
    Dim htmlbody As htmlbody
    Dim tbl As HTMLTable
    Dim tr As HTMLTableRow
    Dim strInnings As String

    s = Timer()

    Set xmlHttp = CreateObject("MSXML2.ServerXMLHTTP")
    Set objFSO = CreateObject("scripting.filesystemobject")

    X(1, 1) = "http://www.espncricinfo.com/indian-premier-league-2011/engine/match/"
    X(1, 2) = 501198
    X(1, 3) = 501271
    X(1, 4) = "indian-premier-league-2011"
    X(2, 1) = "http://www.espncricinfo.com/big-bash-league-2011/engine/match/"
    X(2, 2) = 524915
    X(2, 3) = 524945
    X(2, 4) = "big-bash-league-2011"
    X(3, 1) = "http://www.espncricinfo.com/ausdomestic-2010/engine/match/"
    X(3, 2) = 461028
    X(3, 3) = 461047
    X(3, 4) = "big-bash-league-2010"

    Set htmldoc = New HTMLDocument
    Set htmlbody = htmldoc.body


    For lngRow = 1 To UBound(X, 1)
        If Len(X(lngRow, 1)) = 0 Then Exit For
        Set objTF = objFSO.createtextfile("c:\temp\" & X(lngRow, 4) & ".csv")

        For lngRecords = X(lngRow, 2) To X(lngRow, 3)
            URL = X(lngRow, 1) & lngRecords & ".html"

            xmlHttp.Open "GET", URL
            xmlHttp.send
            Do While xmlHttp.Status <> 200
                DoEvents
            Loop
            htmlbody.innerHTML = xmlHttp.responseText

            objTF.writeline X(lngRow, 1) & lngRecords & ".html"
            For lngInnings = 1 To 2
            strInnings = "Innings " & lngInnings
                objTF.writeline strInnings

                Set tbl = Nothing
                On Error Resume Next
                Set tbl = htmlbody.Document.getElementById("inningsBat" & lngInnings)
                On Error GoTo 0
                If Not tbl Is Nothing Then
                    lngWrite = 0
                    For lngRow1 = 0 To tbl.Rows.Length - 1
                        Set tr = tbl.Rows(lngRow1)
                        If Trim(tr.innerText) <> vbNewLine Then
                            If tr.Cells.Length > 2 Then
                                If tr.Cells(1).innerText <> "Extras" Then
                                    If Len(tr.Cells(1).innerText) > 0 Then
                                        objTF.writeline strInnings & "-" & lngWrite & "," & Trim(tr.Cells(1).innerText) & "," & Trim(tr.Cells(3).innerText)
                                        lngWrite = lngWrite + 1
                                    End If
                                Else
                                    objTF.writeline strInnings & "-" & lngWrite & "," & Trim(tr.Cells(1).innerText) & "," & Trim(tr.Cells(3).innerText)
                                    lngWrite = lngWrite + 1
                                    Exit For
                                End If
                            End If
                        End If
                    Next
                    For lngSpare = 12 To lngWrite Step -1
                        objTF.writeline strInnings & "-" & lngWrite + (12 - lngSpare)
                    Next
                Else
                    For lngSpare = 1 To 13
                        objTF.writeline strInnings & "-" & lngWrite + (12 - lngSpare)
                    Next
                End If
            Next
        Next
    Next
    'Call ConsolidateSheets
End Sub

Answer 3

回答by Babak Naffas

RegEx is not a complete solution for parsing HTML because it is not guaranteed to be regular.

RegEx 不是解析 HTML 的完整解决方案，因为它不能保证是正则的。

You should use the HtmlAgilityPackto query the HTML. This will allow you to use the CSS selectors to query the HTML similar to how you do it with jQuery.

您应该使用HtmlAgilityPack来查询 HTML。这将允许您使用 CSS 选择器来查询 HTML，类似于您使用 jQuery 的方式。

Html cricinfo记分卡的html解析

提问by brettdj

回答by Siddharth Rout

回答by brettdj

回答by Babak Naffas

相关推荐

最近更新

标签

Html cricinfo记分卡的html解析

提问by brettdj

回答by Siddharth Rout

回答by brettdj

回答by Babak Naffas

相关推荐

Html 如何创建 Excel 文件的下载链接？

Html 什么是“X-Content-Type-Options=nosniff”？

Html 如何防止在所有浏览器上的连字符处换行

如何在我自己的 HTML 中使用 Wordpress Contact Form 7？

相关推荐

最近更新

标签