Java 替换像 – 这样的特殊字符 和— 出现在带有相应代码的 xml 文档中,例如 – 等等

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21523574/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 09:09:38  来源:igfitidea点击:

Replace special characters like – and — occuring in an xml document with corresponding code like – etc

javaxmlregextemplatesxslt

提问by Rishi Verma

I wish to replace special characters like & ndash; and & mdash; occuring in an xml document with corresponding code like & #150; etc

我希望替换像 & ndash; 这样的特殊字符。和— 出现在带有相应代码的 xml 文档中,例如 – 等等

i have an input xml document containing several special characters

我有一个包含几个特殊字符的输入 xml 文档

 <?xml version="1.0"?>
  <!DOCTYPE BOOK SYSTEM "bookfull.dtd">
<BOOK> 
  <P>The war was between1890&ndash;1900
   <AF>something&mdash;something else</AF>
</P>
</BOOK>

there are several other characters like & rsquo; for single quotation

还有其他几个字符,如&rsquo; 单引号

my xslt code is as follows

我的 xslt 代码如下

<?xml version="1.0" encoding="UTF-8" ?>
     <xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/1999/xhtml">

<xsl:output method="html" omit-xml-declaration="yes" indent="yes" />
<xsl:strip-space elements="*" />


<xsl:param name="pDest"
    select="'file:///d:/LWW_Book_ePub_Transform/Epub_ZipCreation/XSLT_Transform/Output/'" />

<xsl:template-match="P">
<html>
<xsl:apply-templates/>
</html>
</xsl:template-match>

<xsl:template-match="AF">
.....
<xsl:apply-templates/>
.....
</xsl:template-match>

</xsl:stylesheet>

my java codes for parsing is as follow (i am making use of saxon9.)

我用于解析的 Java 代码如下(我正在使用 saxon9。)

package com.xsltprocessor;

import java.io.File;
import java.io.FileInputStream;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.Source;
import javax.xml.transform.Templates;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;

import org.w3c.dom.Document;

public class ParseUsingSAX {

public ParseUsingSAX() {
}

public void parseBookContent(String xsltFile) {
    try {


        //File inputXml = new File("D:\data\myxml.0f");
        File xslt = new File(xsltFile);

        TransformerFactory factory = TransformerFactory.newInstance();
        Templates template = factory.newTemplates(new StreamSource(new FileInputStream(xslt)));
        Transformer xformer = template.newTransformer();
        Source source = new StreamSource(new FileInputStream(inputXml));
        StreamResult result = new StreamResult();
        xformer.transform(source,result);           
        System.out.println("DONE");
    }
    catch (Exception ex) {
        // TODO Auto-generated catch block
        ex.printStackTrace();
        System.out.println("IO exception: " + ex.getMessage());
    }
}

}

}

i am getting the output after transformation as

我在转换后得到输出为

<html>
The war was between1890&ndash;1900
</html>

expected output

预期产出

<html>
The war was between1890&#150;1900
</html>

采纳答案by Mathias Müller

Use an xsl:character-mapelement that controls output serialization.

使用xsl:character-map控制输出序列化的元素。

<xsl:character-map name="dashes">
    <xsl:output-character character="&ndash;" string="&#150;"/>
</xsl:character-map>

You also have to declare

您还必须声明

<xsl:output use-character-maps="dashes"/>

as a top-level element to ensure that the character mapping is used.

作为顶级元素以确保使用字符映射。

As I mentioned in my comments, &ndash;is an HTML named entity that needs to be declared in XSLT. See e.g. thisdiscussion for more detail.

正如我在评论中提到的,&ndash;是一个需要在 XSLT 中声明的 HTML 命名实体。有关更多详细信息,请参见例如讨论。

Embedded into the stylesheet you show (this outputs dummy strings "MDASH" and "NDASH" - just for illustration):

嵌入到您显示的样式表中(这会输出虚拟字符串“MDASH”和“NDASH” - 仅用于说明):

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE stylesheet [
<!ENTITY ndash  "&#x2013;" >
<!ENTITY mdash  "&#x2014;" >
]>
<xsl:stylesheet version="2.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns="http://www.w3.org/1999/xhtml">

  <xsl:output method="html" omit-xml-declaration="yes" indent="yes" />
  <xsl:output use-character-maps="dashes"/>

  <xsl:strip-space elements="*" />

  <xsl:character-map name="dashes">
    <xsl:output-character character="&ndash;" string="NDASH"/>
    <xsl:output-character character="&mdash;" string="MDASH"/>
  </xsl:character-map>

  <xsl:param name="pDest"
    select="'file:///d:/LWW_Book_ePub_Transform/Epub_ZipCreation/XSLT_Transform/Output/'" />

  <xsl:template match="BOOK">
    <html>
      <xsl:apply-templates/>
    </html>
  </xsl:template>

  <xsl:template match="AF|P">
    <xsl:copy>
      <xsl:value-of select="."/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

Note that this does not have an effect on output produced with xsl:result-document(since you did not show your entire stylesheet). For more info on character-maps please refer to a previous answer of mineand the official recommendation.

请注意,这对生成的输出没有影响xsl:result-document(因为您没有显示整个样式表)。有关字符映射的更多信息,请参阅我之前的回答和官方推荐

回答by Jon Hanna

Either the DTD mentioned at <!DOCTYPE BOOK SYSTEM "bookfull.dtd">will include the entity references used (like &ndash;) or it is in error (or I suppose the input XML could have been in error in trying to use an entity it should be able to use).

提到的 DTD 要么<!DOCTYPE BOOK SYSTEM "bookfull.dtd">包括使用的实体引用(如&ndash;),要么出错(或者我认为输入 XML 在尝试使用它应该能够使用的实体时可能出错)。

If it does include them, then you need to set your XSLT processor to validate the document according to its DTD. (I don't know how to do this in your case, as I know the XSLT part of the problem, but not the specifics of how to use XSLT in Java).

如果它确实包含它们,那么您需要设置 XSLT 处理器以根据其 DTD 验证文档。(我不知道如何在您的情况下执行此操作,因为我知道问题的 XSLT 部分,但不知道如何在 Java 中使用 XSLT 的细节)。

If not, you'll have to fix it.

如果没有,你将不得不修复它。

Get a copy of http://www.w3.org/2003/entities/2007/w3centities-f.ent(while it would work to just reference that URI itself, the W3 would prefer if you didn't, and you'll not have better performance this way).

获取一个副本http://www.w3.org/2003/entities/2007/w3centities-f.ent(虽然它可以仅引用该 URI 本身,但如果您不这样做,W3 会更喜欢,并且您不会以这种方式获得更好的性能)。

Then create your own bookfull.dtd that includes:

然后创建您自己的 bookfull.dtd,其中包括:

<!ENTITY % w3centities-f PUBLIC "-//W3C//ENTITIES Combined Set//EN//XML"
    "w3centities-f.ent">
%w3centities-f;

Or alternatively, that includes the contents of that file directly within the DTD.

或者,直接在 DTD 中包含该文件的内容。

Now in interpreting the input document, the entity references can be resolved. For example, &ndash;in the above is defined by:

现在在解释输入文档时,可以解析实体引用。例如,&ndash;在上面定义为:

<!ENTITY ndash            "&#x02013;" ><!--EN DASH -->

Or in other words; "whenever &ndash;appears, replace it with ".

或者换句话说;“每当&ndash;出现时,将其替换为”。

This happens at the XML parsing step prior to the XSLT stylesheet being run, so as far as the XSLT is concerned, the content it received contained , not &ndash;, and it treats it as such.

这发生在 XSLT 样式表运行之前的 XML 解析步骤中,因此就 XSLT 而言,它接收到的内容包含,而不是&ndash;,并且它如此对待它。