如何使用本机 powershell 命令从 html 文件中提取特定表格?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25940510/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to extract specific tables from html file using native powershell commands?
提问by Tom A.
I make use of the PAL tool (https://pal.codeplex.com/) to generate HTML reports from perfmon logs within Windows. After PAL processes .blg files from perfmon it dumps the information into an HTML document that contains tables with various data points about how the system performed. I am currently writing a script that looks at the contents of a directory for all HTML files, and does a get-content on all the HTML files.
我利用 PAL 工具 ( https://pal.codeplex.com/) 从 Windows 中的 perfmon 日志生成 HTML 报告。在 PAL 处理来自 perfmon 的 .blg 文件后,它将信息转储到一个 HTML 文档中,该文档包含带有关于系统如何执行的各种数据点的表格。我目前正在编写一个脚本,该脚本查看所有 HTML 文件的目录内容,并对所有 HTML 文件执行获取内容。
What I would like to do is scrape the dump of this get-content blob for specific tables that have varying amount of rows. Is it possible using native powershell cmdlets to look for specific tables, count how many rows are in each table, and dump justthe desired tables and table rows?
我想要做的是为具有不同行数的特定表抓取此 get-content blob 的转储。是否可以使用本机 powershell cmdlet 查找特定表,计算每个表中有多少行,并仅转储所需的表和表行?
Here is an example of the table format I'm trying to scrape:
这是我试图抓取的表格格式的示例:
<H3>Overall Counter Instance Statistics</H3>
<TABLE ID="table6" BORDER=1 CELLPADDING=2>
<TR><TH><B>Condition</B></TH><TH><B>\LogicalDisk(*)\Disk Transfers/sec</B></TH><TH><B>Min</B></TH><TH><B>Avg</B></TH><TH><B>Max</B></TH><TH><B>Hourly Trend</B></TH><TH><B>Std Deviation</B></TH><TH><B>10% of Outliers Removed</B></TH><TH><B>20% of Outliers Removed</B></TH><TH><B>30% of Outliers Removed</B></TH></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/C:</TD><TD>1</TD><TD>7</TD><TD>310</TD><TD>0</TD><TD>11</TD><TD>5</TD><TD>5</TD><TD>5</TD></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/D:</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/E:</TD><TD>0</TD><TD>24</TD><TD>164</TD><TD>-1</TD><TD>11</TD><TD>22</TD><TD>21</TD><TD>20</TD></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/HarddiskVolume5</TD><TD>0</TD><TD>0</TD><TD>2</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/L:</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/T:</TD><TD>0</TD><TD>7</TD><TD>430</TD><TD>0</TD><TD>21</TD><TD>3</TD><TD>2</TD><TD>2</TD></TR>
</TABLE>
The Table ID is constant among all the output files, but the amount of table rows is not. Any help is appreciated!
表 ID 在所有输出文件中是恒定的,但表行的数量不是。任何帮助表示赞赏!
回答by Alexander Obersht
OK, this isn't thoroughly tested but works with your example table in PS 2.0 with IE11:
好的,这没有经过彻底测试,但可以在 PS 2.0 中使用 IE11 与您的示例表一起使用:
# Parsing HTML with IE.
$oIE = New-Object -ComObject InternetExplorer.Application
$oIE.Navigate("file.html")
$oHtmlDoc = $oIE.Document
# Getting table by ID.
$oTable = $oHtmlDoc.getElementByID("table6")
# Extracting table rows as a collection.
$oTbody = $oTable.childNodes | Where-Object { $_.tagName -eq "tbody" }
$cTrs = $oTbody.childNodes | Where-Object { $_.tagName -eq "tr" }
# Creating a collection of table headers.
$cThs = $cTrs[0].childNodes | Where-Object { $_.tagName -eq "th" }
$cHeaders = @()
foreach ($oTh in $cThs) {
$cHeaders += `
($oTh.childNodes | Where-Object { $_.tagName -eq "b" }).innerHTML
}
# Converting rows to a collection of PS objects exportable to CSV.
$cCsv = @()
foreach ($oTr in $cTrs) {
$cTds = $oTr.childNodes | Where-Object { $_.tagName -eq "td" }
# Skipping the first row (headers).
if ([String]::IsNullOrEmpty($cTds)) { continue }
$oRow = New-Object PSObject
for ($i = 0; $i -lt $cHeaders.Count; $i++) {
$oRow | Add-Member -MemberType NoteProperty -Name $cHeaders[$i] `
-Value $cTds[$i].innerHTML
}
$cCsv += $oRow
}
# Closing IE.
$oIE.Quit()
# Exporting CSV.
$cCsv | Export-Csv -Path "file.csv" -NoTypeInformation
Honestly, I didn't aim for optimal code. It's just an example of how you could work with DOM objects in PS and convert them to PS objects.
老实说,我的目标不是优化代码。这只是一个示例,说明如何在 PS 中处理 DOM 对象并将它们转换为 PS 对象。
回答by TheMadTechnician
I see you accepted an answer but I thought I'd add a RegEx solution in here too. No COM objects needed for this one, and should be PSv2 friendly I'm pretty sure.
我看到你接受了一个答案,但我想我也会在这里添加一个 RegEx 解决方案。这个不需要 COM 对象,应该是 PSv2 友好的,我很确定。
$Path = 'C:\Path\To\File.html'
[regex]$regex = "(?s)<TABLE ID=.*?</TABLE>"
$tables = $regex.matches((GC C:\Temp\test.txt -raw)).groups.value
ForEach($String in $tables){
$table = $string.split("`n")
$CurTable = @()
$CurTableName = ([regex]'TABLE ID="([^"]*)"').matches($table[0]).groups[1].value
$CurTable += ($table[1] -replace "</B></TH><TH><B>",",") -replace "</?(TR|TH|B)>"
$CurTable += $table[2..($table.count-2)]|ForEach{$_ -replace "</TD><TD>","," -replace "</?T(D|R)>"}
$CurTable | convertfrom-csv | export-csv "C:\Path\To\Output$CurTableName.csv" -notype
}
That should output a CSV file for each table found. Such as table6.csv, table9.csv etc. If you wanted to output CSVs per HTML file you could wrap the entire thing in a ForEach loop like:
这应该为找到的每个表输出一个 CSV 文件。例如 table6.csv、table9.csv 等。如果您想为每个 HTML 文件输出 CSV,您可以将整个内容包装在 ForEach 循环中,例如:
ForEach($File in (Get-ChildItem "$Path\*.html")){
Insert above code here
}
You would need to modify the $tables =
line so that it was GC $file.fullname
to that it would load up each file as it iterated through.
您需要修改该$tables =
行,以便GC $file.fullname
在迭代时加载每个文件。
Then just modify the Export-Csv to something like:
然后只需将 Export-Csv 修改为:
$CurTable | convertfrom-csv | export-csv "C:\Path\To\Output$($File.BaseName)$CurTableName.csv" -notype
So if you had Server01.html with 3 tables in it you would get a folder named Server01 with 3 CSV files in it, one for each table.
因此,如果您有包含 3 个表的 Server01.html,您将获得一个名为 Server01 的文件夹,其中包含 3 个 CSV 文件,每个表一个。