从简单的 XML 文件中提取数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2222150/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-06 12:57:41  来源:igfitidea点击:

Extraction of data from a simple XML file

xmlbashsedawkgrep

提问by Zacky112

I've a XML file with the contents:

我有一个包含内容的 XML 文件:

<?xml version="1.0" encoding="utf-8"?>
<job xmlns="http://www.sample.com/">programming</job>

I need a way to extract what is in the <job..></job>tags, programmin in this case. This should be done on linux command prompt, using grep/sed/awk.

我需要一种方法来提取<job..></job>标签中的内容,在这种情况下是programmin。这应该在 linux 命令提示符下完成,使用 grep/sed/awk。

回答by amarillion

Do you really have touse only those tools? They're not designed for XML processing, and although it's possible to get something that works OK most of the time, it will fail on edge cases, like encoding, line breaks, etc.

你真的只需要使用那些工具吗?它们不是为 XML 处理而设计的,虽然在大多数情况下可以得到一些正常工作的东西,但它会在极端情况下失败,比如编码、换行等。

I recommend xml_grep:

我推荐 xml_grep:

xml_grep 'job' jobs.xml --text_only

Which gives the output:

这给出了输出:

programming

On ubuntu/debian, xml_grep is in the xml-twig-tools package.

在 ubuntu/debian 上,xml_grep 位于 xml-twig-tools 包中。

回答by Vijay

 grep '<job' file_name | cut -f2 -d">"|cut -f1 -d"<"

回答by lmxy

Using xmlstarlet:

使用 xmlstarlet:

echo '<job xmlns="http://www.sample.com/">programming</job>' | \
   xmlstarlet sel -N var="http://www.sample.com/" -t -m "//var:job" -v '.'

回答by Sobrique

Please don't use line and regex based parsing on XML. It is a bad idea. You can have semantically identical XML with different formatting, and regex and line based parsing simply cannot cope with it.

请不要在 XML 上使用基于行和正则表达式的解析。这是一个坏主意。您可以拥有具有不同格式的语义相同的 XML,而正则表达式和基于行的解析根本无法应对。

Things like unary tags and variable line wrapping - these snippets 'say' the same thing:

一元标签和可变换行之类的东西 - 这些片段“说”同样的事情:

<root>
  <sometag val1="fish" val2="carrot" val3="narf"></sometag>
</root>


<root>
  <sometag
      val1="fish"
      val2="carrot"
      val3="narf"></sometag>
</root>

<root
><sometag
val1="fish"
val2="carrot"
val3="narf"
></sometag></root>

<root><sometag val1="fish" val2="carrot" val3="narf"/></root>

Hopefully this makes it clear why making a regex/line based parser is difficult? Fortunately, you don't need to. Many scripting languages have at least one, sometimes more parser options.

希望这能说明为什么制作基于正则表达式/行的解析器很困难?幸运的是,您不需要这样做。许多脚本语言至少有一个,有时甚至更多的解析器选项。

As a previous poster has alluded to - xml_grepis available. That's actually a tool based off the XML::Twigperl library. However what it does is use 'xpath expressions' to find something, and differentiates between document structure, attributes and 'content'.

正如之前的海报所暗示的那样 -xml_grep可用。这实际上是一个基于XML::Twigperl 库的工具。然而,它所做的是使用“xpath 表达式”来查找某些内容,并区分文档结构、属性和“内容”。

E.g.:

例如:

xml_grep 'job' jobs.xml --text_only

However in the interest of making better answers, here's a couple of examples of 'roll your own' based on your source data:

但是,为了做出更好的答案,这里有几个基于源数据的“自己动手”的例子:

First way:

第一种方式:

Use twig handlersthat catches elements of a particular type and acts on them. The advantage of doing it this way is it parses the XML 'as you go', and lets you modify it in flight if you need to. This is particularly useful for discarding 'processed' XML when you're working with large files, using purgeor flush:

使用twig handlers它捕获特定类型的元素并对其进行操作。这样做的好处是它可以“随时”解析 XML,并允许您在需要时随时修改它。当您使用purge或处理大文件时,这对于丢弃“已处理”的 XML 特别有用flush

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

XML::Twig->new(
    twig_handlers => {
        'job' => sub { print $_ ->text }
    }
    )->parse( <> );

Which will use <>to take input (piped in, or specified via commandline ./myscript somefile.xml) and process it - each jobelement, it'll extract and print any text associated. (You might want print $_ -> text,"\n"to insert a linefeed).

它将用于<>获取输入(通过管道输入或通过 commandline 指定./myscript somefile.xml)并处理它 - 每个job元素,它将提取并打印任何关联的文本。(您可能想要print $_ -> text,"\n"插入换行符)。

Because it's matching on 'job' elements, it'll also match on nested job elements:

因为它匹配 'job' 元素,它也会匹配嵌套的 job 元素:

<job>programming
    <job>anotherjob</job>
</job>

Will match twice, but print some of the output twice too. You can however, match on /jobinstead if you prefer. Usefully - this lets you e.g. print and delete an element or copy and paste one modifying the XML structure.

将匹配两次,但也会两次打印一些输出。但是,/job如果您愿意,您可以改为匹配。有用 - 这让您可以例如打印和删除元素或复制和粘贴修改 XML 结构的元素。

Alternatively - parse first, and 'print' based on structure:

或者 - 首先解析,然后根据结构“打印”:

my $twig = XML::Twig->new( )->parse( <> );
print $twig -> root -> text;

As jobis your root element, all we need do is print the text of it.

就像job你的根元素一样,我们需要做的就是打印它的文本。

But we can be a bit more discerning, and look for jobor /joband print that specifically instead:

但是我们可以更挑剔一点,并专门寻找job/job打印它:

my $twig = XML::Twig->new( )->parse( <> );
print $twig -> findnodes('/job',0)->text;

You can use XML::Twigs pretty_printoption to reformat your XML too:

您也可以使用XML::Twigspretty_print选项来重新格式化您的 XML:

XML::Twig->new( 'pretty_print' => 'indented_a' )->parse( <> ) -> print;

There's a variety of output format options, but for simpler XML (like yours) most will look pretty similar.

有多种输出格式选项,但对于更简单的 XML(如您的),大多数看起来非常相似。

回答by ghostdog74

just use awk, no need other external tools. Below works if your desired tags appears in multitine.

只需使用 awk,无需其他外部工具。如果您想要的标签出现在 multitine 中,则以下有效。

$ cat file
test
<job xmlns="http://www.sample.com/">programming</job>
<job xmlns="http://www.sample.com/">
programming</job>

$ awk -vRS="</job>" '{gsub(/.*<job.*>/,"");print}' file
programming

programming

回答by 13ren

Assuming same line, input from stdin:

假设同一行,从标准输入输入:

sed -ne '/<\/job>/ { s/<[^>]*>\(.*\)<\/job>//; p }'

notes: -nstops it outputting everything automatically; -emeans it's a one-liner (aot a script) /<\/job>acts like a grep; sstrips the opentag + attributes and endtag; ;is a new statement; pprints; {}makes the grep apply to both statements, as one.

注意:-n停止自动输出所有内容;-e意味着它是一个单行(aot 脚本)/<\/job>,就像一个 grep;s去除 opentag + 属性和结束标签;;是一个新的陈述;p印刷; {}使 grep 应用于两个语句,作为一个。

回答by vldbnc

Using sedcommand:

使用sed命令:

Example:

例子:

$ cat file.xml
<note>
        <to>Tove</to>
                <from>Jani</from>
                <heading>Reminder</heading>
        <body>Don't forget me this weekend!</body>
</note>

$ cat file.xml | sed -ne '/<heading>/s#\s*<[^>]*>\s*##gp'
Reminder

Explanation:

解释:

cat file.xml | sed -ne '/<pattern_to_find>/s#\s*<[^>]*>\s*##gp'

cat file.xml | sed -ne '/<pattern_to_find>/s#\s*<[^>]*>\s*##gp'

n- suppress printing all lines
e- script

n- 禁止打印所有行
e- 脚本

/<pattern_to_find>/- finds lines that contain specified pattern what could be e.g.<heading>

/<pattern_to_find>/- 查找包含指定模式的行,例如<heading>

next is substitution part s///pthat removes everything except desired value where /is replaced with #for better readability:

接下来是替换部分s///p,它删除除所需值之外的所有内容,其中/替换#为以提高可读性:

s#\s*<[^>]*>\s*##gp
\s*- includes white-spaces if exist (same at the end)
<[^>]*>represents <xml_tag>as non-greedy regex alternative cause <.*?>does not work for sed
g - substitutes everything e.g. closing xml </xml_tag>tag

s#\s*<[^>]*>\s*##gp
\s*- 如果存在则包含空格(末尾相同)
<[^>]*>表示<xml_tag>非贪婪的正则表达式替代原因<.*?>不适用于 sed
g - 替换所有内容,例如关闭 xml</xml_tag>标记

回答by miku

A bit late to the show.

演出有点晚。

xmlcuttycuts out nodes from XML:

xmlcutty从 XML 中切出节点:

$ cat file.xml
<?xml version="1.0" encoding="utf-8"?>
<job xmlns="http://www.sample.com/">programming</job>
<job xmlns="http://www.sample.com/">designing</job>
<job xmlns="http://www.sample.com/">managing</job>
<job xmlns="http://www.sample.com/">teaching</job>

The pathargument names the path to the element you want to cut out. In this case, since we are not interested in the tags at all, we rename the tag to \n, so we get a nice list:

path参数名的元素要切出的路径。在这种情况下,由于我们对标签根本不感兴趣,我们将标签重命名为\n,因此我们得到了一个不错的列表:

$ xmlcutty -path /job -rename '\n' file.xml
programming
designing
managing
teaching

Note, that the XML was not valid to begin with (no root element). xmlcutty can work with slightly broken XML, too.

请注意,XML 开头无效(无根元素)。xmlcutty 也可以处理稍微损坏的 XML。

回答by codaddict

How about:

怎么样:

cat a.xml | grep '<job' | cut -d '>' -f 2 | cut -d '<' -f 1