C# 如何将 HTML 转换为 XHTML?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/138555/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-03 15:08:58  来源:igfitidea点击:

How to convert HTML to XHTML?

提问by JRoppert

I need to convert HTML documents into valid XML, preferably XHTML. What's the best way to do this? Does anybody know a toolkit/library/sample/...whatever that helps me to get that task done?

我需要将 HTML 文档转换为有效的 XML,最好是 XHTML。做到这一点的最佳方法是什么?有没有人知道工具包/库/样本/...任何可以帮助我完成任务的东西?

To be a bit more clear here, my application has to do the conversion automatically at runtime. I don't look for a tool that helps me to move some pages to XHTML manually.

在这里更清楚一点,我的应用程序必须在运行时自动进行转换。我不寻找可以帮助我手动将某些页面移动到 XHTML 的工具。

采纳答案by prakash

Convert from HTML to XML with HTML Tidy

使用 HTML Tidy 从 HTML 转换为 XML

Downloadable Binaries

可下载的二进制文件

JRoppert, For your need, i guess you might want to look at the Sources

JRoppert,根据您的需要,我想您可能想查看Sources

c:\temp>tidy -help
tidy [option...] [file...] [option...] [file...]
Utility to clean up and pretty print HTML/XHTML/XML
see http://tidy.sourceforge.net/

Options for HTML Tidy for Windows released on 14 February 2006:

File manipulation
-----------------
 -output <file>, -o  write output to the specified <file>
 <file>
 -config <file>      set configuration options from the specified <file>
 -file <file>, -f    write errors to the specified <file>
 <file>
 -modify, -m         modify the original input files

Processing directives
---------------------
 -indent, -i         indent element content
 -wrap <column>, -w  wrap text at the specified <column>. 0 is assumed if
 <column>            <column> is missing. When this option is omitted, the
                     default of the configuration option "wrap" applies.
 -upper, -u          force tags to upper case
 -clean, -c          replace FONT, NOBR and CENTER tags by CSS
 -bare, -b           strip out smart quotes and em dashes, etc.
 -numeric, -n        output numeric rather than named entities
 -errors, -e         only show errors
 -quiet, -q          suppress nonessential output
 -omit               omit optional end tags
 -xml                specify the input is well formed XML
 -asxml, -asxhtml    convert HTML to well formed XHTML
 -ashtml             force XHTML to well formed HTML
 -access <level>     do additional accessibility checks (<level> = 0, 1, 2, 3).
                     0 is assumed if <level> is missing.

Character encodings
-------------------
 -raw                output values above 127 without conversion to entities
 -ascii              use ISO-8859-1 for input, US-ASCII for output
 -latin0             use ISO-8859-15 for input, US-ASCII for output
 -latin1             use ISO-8859-1 for both input and output
 -iso2022            use ISO-2022 for both input and output
 -utf8               use UTF-8 for both input and output
 -mac                use MacRoman for input, US-ASCII for output
 -win1252            use Windows-1252 for input, US-ASCII for output
 -ibm858             use IBM-858 (CP850+Euro) for input, US-ASCII for output
 -utf16le            use UTF-16LE for both input and output
 -utf16be            use UTF-16BE for both input and output
 -utf16              use UTF-16 for both input and output
 -big5               use Big5 for both input and output
 -shiftjis           use Shift_JIS for both input and output
 -language <lang>    set the two-letter language code <lang> (for future use)

Miscellaneous
-------------
 -version, -v        show the version of Tidy
 -help, -h, -?       list the command line options
 -xml-help           list the command line options in XML format
 -help-config        list all configuration options
 -xml-config         list all configuration options in XML format
 -show-config        list the current configuration settings

Use --blah blarg for any configuration option "blah" with argument "blarg"

Input/Output default to stdin/stdout respectively
Single letter options apart from -f may be combined
as in:  tidy -f errs.txt -imu foo.html
For further info on HTML see http://www.w3.org/MarkUp

回答by Bravax

The easiest way is to set your Visual Studio IDE to identify the changes you need to make. You can do this in Visual Studio 2008 by going to: Tools, Options, Text Editor, HTML, Validation and choosing the appropriate target. Possibly XHTML 1.1 or XHTML 1.0 Transitional.

最简单的方法是设置 Visual Studio IDE 以识别需要进行的更改。您可以通过以下方式在 Visual Studio 2008 中执行此操作:工具、选项、文本编辑器、HTML、验证并选择适当的目标。可能是 XHTML 1.1 或 XHTML 1.0 Transitional。

For some information on the different types, read: http://msdn.microsoft.com/en-us/library/aa479043.aspx

有关不同类型的一些信息,请阅读:http: //msdn.microsoft.com/en-us/library/aa479043.aspx

Then you need to work through the points highlighted on your page.

然后,您需要完成页面上突出显示的要点。

回答by TcKs

You can use a HTML Agility Pack. Its open-source project from CodePlex.

您可以使用HTML Agility Pack。它的开源项目来自 CodePlex。

回答by hsivonen

The Validator.nu HTML Parsercomes with an HTML2XML sample program that does the conversion using the HTML5 parsing algorithm and infoset coercion rules.

所述Validator.nu HTML解析器附带了不使用HTML5分析算法和信息集强制规则转换的HTML2XML示例程序。

回答by Cetin Sert

Use Html2Xhtml for .NET 4.0:

在 .NET 4.0 中使用 Html2Xhtml:

In-memory string-to-string conversion:

内存中字符串到字符串的转换:

var xhtml = Html2Xhtml.RunAsFilter(stdin => stdin.Write(html)).ReadToEnd();

In-memory string-to-XDocument conversion:

内存中字符串到 XDocument 的转换:

var xdoc = Html2Xhtml.RunAsFilter(stdin => stdin.Write(html)).ReadToXDocument();

See http://corsis.sourceforge.net/index.php/Html2Xhtmlfor more information.

有关更多信息,请参阅http://corsis.sourceforge.net/index.php/Html2Xhtml

回答by mite

http://corsis.sourceforge.net/index.php/Html2Xhtmlhttp://corsis.sourceforge.net/index.php/Html2Xhtml

http://corsis.sourceforge.net/index.php/Html2Xhtmlhttp://corsis.sourceforge.net/index.php/Html2Xhtml

Html2Xhtml is a .NET 4.0 library for converting HTML to XHTML licensed under GPLv2 or above.

Html2Xhtml 是一个 .NET 4.0 库,用于将 HTML 转换为 GPLv2 或更高版本许可的 XHTML。

I tested Html2Xhtml in the local reconstruction of a large online database of the European Union. Tidy/Tidy.NET would not even produce valid output most of the time, Chilkat's HTML-to-XML was a bit slow and produced strange results (misplaced, missing, unexplainable elements). In attempt to find a free, fast and reliable conversion tool I created this library. It converts 2 - 4x faster than all other libraries I tested.

我在欧盟大型在线数据库的本地重建中测试了Html2Xhtml。Tidy/Tidy.NET 在大多数情况下甚至不会产生有效的输出,Chilkat 的 HTML-to-XML 有点慢并且产生奇怪的结果(错位、丢失、无法解释的元素)。为了找到一个免费、快速和可靠的转换工具,我创建了这个库。它的转换速度比我测试的所有其他库快 2 - 4 倍。

Html2Xhtml, combined with the power of LINQ to XML, is an excellent tool for all large-scale data extraction and web crawling scenarios.

Html2Xhtml 结合 LINQ to XML 的强大功能,是所有大规模数据提取和网络爬虫场景的绝佳工具。

回答by user1845579

you can convert html to xhtml with tidy executable file:

您可以使用整洁的可执行文件将 html 转换为 xhtml:

tidy -asxhtml -numeric < index.html > index.xhml

整洁 -asxhtml -numeric < index.html > index.xhml

you can check the c# implementation here.

您可以在此处查看 c# 实现。