bash 如何提取 XML 文件的特定元素？

Question

提问by Hakim

I have an XML file containing texts in some languages. I want to extract the texts in just one language and store them in a separate file. How can I do this? Here is some of the beginning lines of my file:

我有一个包含某些语言文本的 XML 文件。我只想以一种语言提取文本并将它们存储在单独的文件中。我怎样才能做到这一点？这是我的文件的一些开头行：

<?xml version="1.0" encoding="UTF-8"?>
<tmx version="1.4b">
  <header creationtool="ORESAligner" creationtoolversion="1.0" datatype="plaintext" segtype="paragraph" adminlang="en-us" srclang="EN" o-tmf="ORES"/>
  <body>
    <tu tuid="55_100:6">
      <prop type="session">55</prop>
      <prop type="committee">3</prop>
      <tuv xml:lang="EN">
        <seg>RESOLUTION 55/100</seg>
      </tuv>
      <tuv xml:lang="AR">
        <seg>?????? 55/100</seg>
      </tuv>
      <tuv xml:lang="ZH">
        <seg>第55/100号决议</seg>
      </tuv>
      <tuv xml:lang="FR">
        <seg>RéSOLUTION 55/100</seg>
      </tuv>
      <tuv xml:lang="RU">
        <seg>РЕЗОЛЮЦИЯ 55/100</seg>
      </tuv>
      <tuv xml:lang="ES">
        <seg>RESOLUCIóN 55/100</seg>
      </tuv>
    </tu>
  </body>
</tmx>

Now say I want just English texts. the desired output should be:

现在说我只想要英文文本。所需的输出应该是：

RESOLUTION 55/100

How should I use this script? I am newbie in working XML files, and don't know how can I use this XPath expression. As I know xmlstarlet is able to modify XML files. But I don't know how...?

我应该如何使用这个脚本？我是处理 XML 文件的新手，不知道如何使用这个 XPath 表达式。据我所知 xmlstarlet 能够修改 XML 文件。但我不知道怎么...？

Answer 1

回答by Todd A. Jacobs

Extract English Nodes with XmlStarlet

用 XmlStarlet 提取英文节点

You could use xmlstarletto query your XMLusing XPath, and return just the nodes with an English-language attribute. For example:

你可以使用xmlstarlet来查询您的XML使用XPath的，并用英语属性只返回节点。例如：

$ xmlstarlet sel -t -v "//tuv[@xml:lang='EN']/seg/text()" /tmp/foo
RESOLUTION 55/100

Store Node Values in a File with Language Extension

将节点值存储在具有语言扩展名的文件中

If you want to store those values in some language-based file, then you could dump the values of each node found into a file with a language-based extension (e.g. "EN" for English).

如果您想将这些值存储在某个基于语言的文件中，那么您可以将找到的每个节点的值转储到具有基于语言的扩展名的文件中（例如，“EN”代表英语）。

# Don't overwrite LANG; use some other variable.
language='EN'

xmlstarlet sel \
    --noblanks \
    --text \
    --template \
    --match "//tuv[@xml:lang='${language}']" \
    --value-of seg \
    -n \
    /tmp/foo > "/tmp/foo.$language"

With this example, the contents of all matching nodes will be written to /tmp/foo.ENfor further processing. You can certainly adjust the shell redirection to suit any additional requirements.

在此示例中，所有匹配节点的内容将写入/tmp/foo.EN以供进一步处理。您当然可以调整外壳重定向以满足任何其他要求。

Answer 2

回答by perreal

If the xml file is well formatted, you can use a simple sed command:

如果 xml 文件格式正确，则可以使用简单的 sed 命令：

sed -n '/xml:lang="EN"/ {
N
s_.*<seg>\([^<]*\)</seg>__p
}
' input_file

Description:

描述：

sed -n '/xml:lang="EN"/ {           # 1) exec sed with no print flag, find a line
                                    # matching xml:lang="EN"
N                                   # 2) read the next line
s_.*<seg>\([^<]*\)</seg>__p       # 3) replace everything until </seg> with 
                                    # the text between <seg> and </seg> and print
}
' input_file

If you want to keep the segtags you can change the 3rd step:

如果您想保留seg标签，您可以更改第 3 步：

sed -n '/xml:lang="EN"/ {
N
s_.*\(<seg>[^<]*</seg>\)__p
}
' input_file

Answer 3

回答by Michael Kay

The following XPath expression extracts the information you are looking for:

以下 XPath 表达式提取您要查找的信息：

/tmx/body/tu/tuv[@xml:lang='EN']/seg

There are many tools that allow you to process XML files using XPath expressions. If you are working from the command line you could look at xmlsh.

有许多工具允许您使用 XPath 表达式处理 XML 文件。如果您从命令行工作，您可以查看xmlsh.

It's hard to tell the context of the requirement, but I would imagine that as it grows beyond the simple case given here, you will want to look at XSLT and/or XQuery.

很难说出需求的上下文，但我想，随着它超出此处给出的简单案例，您将需要查看 XSLT 和/或 XQuery。

Answer 4

回答by tanius

You can use the command line tool xml_greplike this:

您可以xml_grep像这样使用命令行工具：

xml_grep --cond "tu/tuv[@xml:lang='EN']/seg" --text_only file.tmx

The argument to --condis an XPath-like expression. Its syntax is similar to what xstarletetc. expect, but not identical.

的参数--cond是一个类似 XPath 的表达式。它的语法类似于什么xstarlet等，但不完全相同。

bash 如何提取 XML 文件的特定元素？

提问by Hakim

回答by Todd A. Jacobs

Extract English Nodes with XmlStarlet

用 XmlStarlet 提取英文节点

Store Node Values in a File with Language Extension

将节点值存储在具有语言扩展名的文件中

回答by perreal

回答by Michael Kay

回答by tanius

相关推荐

最近更新

标签

bash 如何提取 XML 文件的特定元素？

提问by Hakim

回答by Todd A. Jacobs

Extract English Nodes with XmlStarlet

用 XmlStarlet 提取英文节点

Store Node Values in a File with Language Extension

将节点值存储在具有语言扩展名的文件中

回答by perreal

回答by Michael Kay

回答by tanius

相关推荐

bash 如何并行运行命令列表？

执行 if...then...fi 时出现 Cygwin 错误 - Bash 脚本

bash 清理感染c3284d病毒的服务器，使用搜索替换

bash GNU Screen - 在 shell 或脚本的后台运行命令中创建屏幕

相关推荐

最近更新

标签