windows 使用 PowerShell 或 VBS 从 HTML 文件中提取表格
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3605433/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Extract a Table from an HTML File with PowerShell or VBS
提问by Aaron Wurthmann
I have a two part problem that needs fixing. I'll try my best to describe it then break down what I "think" the steps are.
我有一个需要修复的两部分问题。我会尽力描述它,然后分解我“认为”的步骤。
I am trying to get a specific table in a webpage and email it to myself.
我正在尝试获取网页中的特定表格并将其通过电子邮件发送给自己。
At the moment what I am trying is to use GNU\Win32 wget.exe
(I'd rather use PowerShell natively but for some reason I couldn't, perhaps because the method I was using couldn't render the ASPX page?)
Using wget
I was able to make a local html version of the ASPX page.
目前我正在尝试的是使用 GNU\Win32 wget.exe
(我宁愿在本地使用 PowerShell,但由于某种原因我不能,也许是因为我使用的方法无法呈现 ASPX 页面?)使用wget
我能够制作 ASPX 页面的本地 html 版本。
Now I have been attempting to parse the file and extract a specific table. In this particular case the table begins with <table border="0" cellpadding="2" cellspacing="2" width="300px">
and ends with </table>
and there are no nested tables.
现在我一直在尝试解析文件并提取特定的表。在这种特殊情况下,表以 开始<table border="0" cellpadding="2" cellspacing="2" width="300px">
和结束,</table>
并且没有嵌套表。
I've thrown some regex at my problem (yes I know regex may not be the tool I need here) but to no avail.
我在我的问题上抛出了一些正则表达式(是的,我知道正则表达式可能不是我在这里需要的工具)但无济于事。
---Ammended Here is where I am at now...
---修正这是我现在的位置......
$content = (new-object System.Net.WebClient).DownloadString($url)
$found = $content -cmatch '(?si)<table border="0" cellpadding="2" cellspacing="2" width="300px"[^>]*>(.*?)Total Queries</td>(.*?)</tr>(.*?)</table>'
$result = $matches[3]
$result
回答by Keith Hill
I've done this sort of thing with PowerShell. It is pretty straightforward:
我已经用 PowerShell 做过这种事情。这非常简单:
PS> $url = "http://www.windowsitpro.com/news/PaulThurrottsWinInfoNews.aspx"
PS> $content = (new-object System.Net.WebClient).DownloadString($url)
PS> $content -match '(?s)<table[^>]+border\s*=\s*"0"\s*.*?>(.*?)</table>'
True
PS> $matches[1]
<tr>
<snip>
</tr>
Just substitute width
for border
and 300px
for 0
for your regex e.g.:
刚刚替补width
了border
,并300px
为0
您的正则表达式如:
PS> $content -match '(?s)<table[^>]+width\s*=\s*"300px"\s*.*?>(.*?)</table>'
Ih the case of matching multiple tables, you have to switch from -match, which is a boolean operator just looking to find a single match to Select-String which can find all matches e.g.:
在匹配多个表的情况下,您必须从 -match 切换,这是一个布尔运算符,只是想找到单个匹配项到 Select-String 可以找到所有匹配项,例如:
PS> $pattern = '(?s)<table[^>]+width\s*=\s*"300px"\s*.*?>(.*?)</table>'
PS> $content | Select-String -AllMatches $pattern |
Foreach {$_.Matches | $_.Group[1].Value}
Essentially all matches will be in the $_.Matches collection. If you know that the table is always the third one you can access like so:
基本上所有匹配项都在 $_.Matches 集合中。如果您知道该表始终是第三个,您可以像这样访问:
... | Foreach {$_.Matches[2].Group[1].Value}
回答by Start-Automating
A while ago I wrote a function called Get-MarkupTag. This gets you away from having to use regular expressions directly (it does so under the covers). It also attempts to turn HTML into XML, at which point getting out the data is pretty simple.
不久前,我写了一个名为Get-MarkupTag的函数。这使您不必直接使用正则表达式(它在幕后使用)。它还尝试将 HTML 转换为 XML,此时获取数据非常简单。
To do this with Get-MarkupTag, you'd do something like
要使用 Get-MarkupTag 执行此操作,您需要执行以下操作
$webClient = New-Object Net.Webclient -Property @{UseDefaultCredentials=$true}
$html = $webClient.DownloadString($url)
$table = Get-MarkupTag -html $html -tag "table" |
Where-Object { $_.Tag -like '<table border="0" cellpadding="2" cellspacing="2" width="300px">*' } |
Select-Object -expandProperty Xml
$table.tr | # Row
Foreach-Object {
$_.Td # Column
}
Hope this helps
希望这可以帮助
回答by p.campbell
I'd tackle it this way using VBScript.
我会使用 VBScript 以这种方式解决它。
remove all double-quotes with single quotes, just for ease of reading & writing the code. i.e.
myHTMLString = Replace(myHTMLString, """", "'")
determine if the file contains your table. Sounds like it doesn't have an
id
orname
attribute. Too bad, but failing that, useInStr
to determine where the starting position of the table is.Dim tableStartsAt = InStr(myHTMLString,"<table border='0'")
Careful with all the attributes here, as you're at the mercy of the table having its attributes moved around without you noticing! Perhaps when no matching table is found, email THAT stats to yourself as a warning that some maintenance is needed.now that you have the start position of your table, find its end tag. i.e.
Dim tableEndsAt = InStr(tableStartsAt,myHTMLString,"</table>")
get the HTML string:
Dim myTable = Mid(myHTMLString,tableStartsAt,tableEndsAt-tableStartsAt)
put that into an email, send using VBScript. Ensure you have
Mail.IsHTML = True
. Here's another VBScript sending emailquestion.
用单引号删除所有双引号,只是为了便于阅读和编写代码。IE
myHTMLString = Replace(myHTMLString, """", "'")
确定文件是否包含您的表。听起来它没有
id
orname
属性。太糟糕了,但失败了,用于InStr
确定桌子的起始位置。Dim tableStartsAt = InStr(myHTMLString,"<table border='0'")
小心这里的所有属性,因为你会受到桌子的摆布,它的属性在你没有注意到的情况下四处移动!也许当没有找到匹配的表时,将统计数据通过电子邮件发送给自己,作为需要进行一些维护的警告。现在您已获得表格的起始位置,找到它的结束标记。IE
Dim tableEndsAt = InStr(tableStartsAt,myHTMLString,"</table>")
获取 HTML 字符串:
Dim myTable = Mid(myHTMLString,tableStartsAt,tableEndsAt-tableStartsAt)
将其放入电子邮件中,使用 VBScript 发送。确保您拥有
Mail.IsHTML = True
. 这是另一个VBScript 发送电子邮件问题。
回答by Eric W
I thought the HuddleMassesGet-Web cmdlets had an option to read in tables as XML.
我认为HuddleMassesGet-Web cmdlet 可以选择以 XML 形式读取表格。