解析本地 HTML 文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24977233/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parse local HTML file
提问by Steven Penny
I can use PowerShell to parse an HTML page
我可以使用 PowerShell 解析 HTML 页面
PS > $foo = Invoke-WebRequest http://example.com
PS > $foo.Links.Count
1
However if I download the page
但是,如果我下载页面
PS > Invoke-WebRequest -OutFile example.htm http://example.com
and then try to parse the downloaded page it gives unexpected result
然后尝试解析下载的页面,它给出了意想不到的结果
PS > $foo = Invoke-WebRequest file://$pwd/example.htm
PS > $foo.Links.Count
0
How can I parse the local downloaded page?
如何解析本地下载的页面?
采纳答案by Steven Penny
You can use the file with a web serverto get around the dumb limitation of Invoke-WebRequest
您可以将文件与 Web 服务器一起使用来绕过 Invoke-WebRequest 的愚蠢限制
PS > $foo = Invoke-WebRequest http://localhost:8080/example.htm
PS > $foo.Links.Count
1
Note this will work even with no connection, example
请注意,即使没有连接,这也可以工作,例如
PS > Invoke-WebRequest http://example.com Invoke-WebRequest : The remote name could not be resolved: 'example.com'
回答by PeterK
It appears that Invoke-WebRequest
loads file
protocol URIs just fine, but fails to parse them even in PowerShell 4.0 (where it is officially supported).
似乎Invoke-WebRequest
加载file
协议 URI 很好,但即使在 PowerShell 4.0(官方支持)中也无法解析它们。
An alternative that does not require setting up a website would be to load and parse HTML directly into MSHTML.
另一种不需要设置网站的替代方法是将 HTML 直接加载并解析为 MSHTML。
$html = New-Object -ComObject "HTMLFile";
$source = Get-Content -Path "file.html" -Raw;
$html.IHTMLDocument2_write($source);
$html.links.length;
Note that when I tested this, a single
请注意,当我对此进行测试时,单个
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
header prevented my HTML from parsing and I have no idea why -- the document had similar XHTML-style headers and MSHTML had no issues with those.
header 阻止了我的 HTML 解析,我不知道为什么 - 文档具有类似的 XHTML 样式的标题,而 MSHTML 没有这些问题。
回答by user2235686
Use file-link format
使用文件链接格式
$foo = Invoke-WebRequest "file://<path-to-file>"
Fix my mistake
修正我的错误
If html is valid xml then you can use select-xml:
如果 html 是有效的 xml,那么您可以使用select-xml:
[xml]$html = Get-Content '<path_to_html_file>'
Select-Xml $html -XPath '//a' | foreach {$_.node}