解析本地 HTML 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24977233/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-29 02:21:15  来源:igfitidea点击:

Parse local HTML file

htmlpowershell

提问by Steven Penny

I can use PowerShell to parse an HTML page

我可以使用 PowerShell 解析 HTML 页面

PS > $foo = Invoke-WebRequest http://example.com

PS > $foo.Links.Count
1

However if I download the page

但是,如果我下载页面

PS > Invoke-WebRequest -OutFile example.htm http://example.com

and then try to parse the downloaded page it gives unexpected result

然后尝试解析下载的页面,它给出了意想不到的结果

PS > $foo = Invoke-WebRequest file://$pwd/example.htm

PS > $foo.Links.Count
0

How can I parse the local downloaded page?

如何解析本地下载的页面?

采纳答案by Steven Penny

You can use the file with a web serverto get around the dumb limitation of Invoke-WebRequest

您可以将文件与 Web 服务器一起使用来绕过 Invoke-WebRequest 的愚蠢限制

PS > $foo = Invoke-WebRequest http://localhost:8080/example.htm

PS > $foo.Links.Count
1

Note this will work even with no connection, example

请注意,即使没有连接,这也可以工作,例如

PS > Invoke-WebRequest http://example.com
Invoke-WebRequest : The remote name could not be resolved: 'example.com'

回答by PeterK

It appears that Invoke-WebRequestloads fileprotocol URIs just fine, but fails to parse them even in PowerShell 4.0 (where it is officially supported).

似乎Invoke-WebRequest加载file协议 URI 很好,但即使在 PowerShell 4.0(官方支持)中也无法解析它们。

An alternative that does not require setting up a website would be to load and parse HTML directly into MSHTML.

另一种不需要设置网站的替代方法是将 HTML 直接加载并解析为 MSHTML。

$html = New-Object -ComObject "HTMLFile";
$source = Get-Content -Path "file.html" -Raw;
$html.IHTMLDocument2_write($source);

$html.links.length;

Note that when I tested this, a single

请注意,当我对此进行测试时,单个

<meta http-equiv="X-UA-Compatible" content="IE=edge" />

header prevented my HTML from parsing and I have no idea why -- the document had similar XHTML-style headers and MSHTML had no issues with those.

header 阻止了我的 HTML 解析,我不知道为什么 - 文档具有类似的 XHTML 样式的标题,而 MSHTML 没有这些问题。

回答by user2235686

Use file-link format

使用文件链接格式

$foo = Invoke-WebRequest "file://<path-to-file>"

Fix my mistake

修正我的错误

If html is valid xml then you can use select-xml:

如果 html 是有效的 xml,那么您可以使用select-xml

[xml]$html = Get-Content '<path_to_html_file>'
Select-Xml $html -XPath '//a' | foreach {$_.node}