bash 如何提取 XML 文件的特定元素?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/11599111/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How extract specific elements of an XML file?
提问by Hakim
I have an XML file containing texts in some languages. I want to extract the texts in just one language and store them in a separate file. How can I do this? Here is some of the beginning lines of my file:
我有一个包含某些语言文本的 XML 文件。我只想以一种语言提取文本并将它们存储在单独的文件中。我怎样才能做到这一点?这是我的文件的一些开头行:
<?xml version="1.0" encoding="UTF-8"?>
<tmx version="1.4b">
  <header creationtool="ORESAligner" creationtoolversion="1.0" datatype="plaintext" segtype="paragraph" adminlang="en-us" srclang="EN" o-tmf="ORES"/>
  <body>
    <tu tuid="55_100:6">
      <prop type="session">55</prop>
      <prop type="committee">3</prop>
      <tuv xml:lang="EN">
        <seg>RESOLUTION 55/100</seg>
      </tuv>
      <tuv xml:lang="AR">
        <seg>?????? 55/100</seg>
      </tuv>
      <tuv xml:lang="ZH">
        <seg>第55/100号决议</seg>
      </tuv>
      <tuv xml:lang="FR">
        <seg>RéSOLUTION 55/100</seg>
      </tuv>
      <tuv xml:lang="RU">
        <seg>РЕЗОЛЮЦИЯ 55/100</seg>
      </tuv>
      <tuv xml:lang="ES">
        <seg>RESOLUCIóN 55/100</seg>
      </tuv>
    </tu>
  </body>
</tmx>
Now say I want just English texts. the desired output should be:
现在说我只想要英文文本。所需的输出应该是:
RESOLUTION 55/100
How should I use this script? I am newbie in working XML files, and don't know how can I use this XPath expression. As I know xmlstarlet is able to modify XML files. But I don't know how...?
我应该如何使用这个脚本?我是处理 XML 文件的新手,不知道如何使用这个 XPath 表达式。据我所知 xmlstarlet 能够修改 XML 文件。但我不知道怎么...?
回答by Todd A. Jacobs
Extract English Nodes with XmlStarlet
用 XmlStarlet 提取英文节点
You could use xmlstarletto query your XMLusing XPath, and return just the nodes with an English-language attribute. For example:
你可以使用xmlstarlet来查询您的XML使用XPath的,并用英语属性只返回节点。例如:
$ xmlstarlet sel -t -v "//tuv[@xml:lang='EN']/seg/text()" /tmp/foo
RESOLUTION 55/100
Store Node Values in a File with Language Extension
将节点值存储在具有语言扩展名的文件中
If you want to store those values in some language-based file, then you could dump the values of each node found into a file with a language-based extension (e.g. "EN" for English).
如果您想将这些值存储在某个基于语言的文件中,那么您可以将找到的每个节点的值转储到具有基于语言的扩展名的文件中(例如,“EN”代表英语)。
# Don't overwrite LANG; use some other variable.
language='EN'
xmlstarlet sel \
    --noblanks \
    --text \
    --template \
    --match "//tuv[@xml:lang='${language}']" \
    --value-of seg \
    -n \
    /tmp/foo > "/tmp/foo.$language"
With this example, the contents of all matching nodes will be written to /tmp/foo.ENfor further processing. You can certainly adjust the shell redirection to suit any additional requirements.
在此示例中,所有匹配节点的内容将写入/tmp/foo.EN以供进一步处理。您当然可以调整外壳重定向以满足任何其他要求。
回答by perreal
If the xml file is well formatted, you can use a simple sed command:
如果 xml 文件格式正确,则可以使用简单的 sed 命令:
sed -n '/xml:lang="EN"/ {
N
s_.*<seg>\([^<]*\)</seg>__p
}
' input_file
Description:
描述:
sed -n '/xml:lang="EN"/ {           # 1) exec sed with no print flag, find a line
                                    # matching xml:lang="EN"
N                                   # 2) read the next line
s_.*<seg>\([^<]*\)</seg>__p       # 3) replace everything until </seg> with 
                                    # the text between <seg> and </seg> and print
}
' input_file
If you want to keep the segtags you can change the 3rd step:
如果您想保留seg标签,您可以更改第 3 步:
sed -n '/xml:lang="EN"/ {
N
s_.*\(<seg>[^<]*</seg>\)__p
}
' input_file
回答by Michael Kay
The following XPath expression extracts the information you are looking for:
以下 XPath 表达式提取您要查找的信息:
/tmx/body/tu/tuv[@xml:lang='EN']/seg
There are many tools that allow you to process XML files using XPath expressions. If you are working from the command line you could look at xmlsh.
有许多工具允许您使用 XPath 表达式处理 XML 文件。如果您从命令行工作,您可以查看xmlsh.
It's hard to tell the context of the requirement, but I would imagine that as it grows beyond the simple case given here, you will want to look at XSLT and/or XQuery.
很难说出需求的上下文,但我想,随着它超出此处给出的简单案例,您将需要查看 XSLT 和/或 XQuery。
回答by tanius
You can use the command line tool xml_greplike this:
您可以xml_grep像这样使用命令行工具:
xml_grep --cond "tu/tuv[@xml:lang='EN']/seg" --text_only file.tmx
The argument to --condis an XPath-like expression. Its syntax is similar to what xstarletetc. expect, but not identical.
的参数--cond是一个类似 XPath 的表达式。它的语法类似于什么xstarlet等,但不完全相同。

