Bash 脚本和 xml/rss 解析

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/10551917/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 02:15:27  来源:igfitidea点击:

Bash script and xml/rss parsing

xmlbashparsingxml-parsingxmllint

提问by Ivan

i'm writing a small script that parse an rss using xmllint.

我正在编写一个使用 xmllint 解析 rss 的小脚本。

Now i fetch the titles list with the following command:

现在我使用以下命令获取标题列表:

ITEMS=`echo "cat //title" | xmllint --shell rss.xml `
echo $ITEMS > tmpfile

But it returns:

但它返回:

<title>xxx</title> ------- <title>yyy :)</title> ------- <title>zzzzzz</title>

without newlines, or space. Now i'm interested only in the text content of title tags, and if possible i want to navigate through the titles using a for/while loop, something like:

没有换行符或空格。现在我只对标题标签的文本内容感兴趣,如果可能的话,我想使用 for/while 循环浏览标题,例如:

for  val in $ITEMS 
do
       echo $val
done

How it can be done? Thanks in advance

怎么做?提前致谢

回答by Philippe

I had the same type of requirement at some point to parse xml in bash. I ended up using xmlstarlet http://xmlstar.sourceforge.net/which you might be able to install.

在某些时候,我对在 bash 中解析 xml 有相同类型的要求。我最终使用了 xmlstarlet http://xmlstar.sourceforge.net/,您可以安装它。

If not, something like that will remove the surounding tags:

如果没有,类似的东西将删除周围的标签:

echo "cat  //title/text()" | xmllint --shell  rss.xml

Then you will need to cleanup the output after piping it, a basic solution would be:

然后你需要在管道输出后清理输出,一个基本的解决方案是:

echo "cat  //title/text()" | xmllint --shell  rss.xml  | egrep '^\w'

Hope this helps

希望这可以帮助

回答by shellter

To answer your first question, The unquoted use of $ITEMSwith echois eliminating your new-line chars. Try

要回答您的第一个问题,$ITEMSwith的未引用用法echo是消除您的换行符。尝试

ITEMS=`echo "cat //title" | xmllint --shell rss.xml `
echo "$ITEMS" > tmpfile
#----^------^--- dbl-quotes only

In general, using forloops is best left to items that won't generate unexpected spaces or other non-printable characters. (non-alphanumerics), like for i in {1..10} ; do echo $i; done

通常,for最好将循环用于不会产生意外空格或其他不可打印字符的项目。(非字母数字),如for i in {1..10} ; do echo $i; done

AND you don't really need the variables, or the tempfile, try

并且您实际上并不需要变量或临时文件,请尝试

  echo "cat //title" | xmllint --shell rss.xml |
  while read line ; do
      echo "$line"
  done

Depending on what is in your rrs feed, you may also benefit from changing the default IFS (Internal Field Separator) that is used by the read cmd, try

根据您的 rrs 提要中的内容,您还可以从更改读取 cmd 使用的默认 IFS(内部字段分隔符)中受益,请尝试

while IFS= read line ....
# or 
while IFS="\n" read line
# or
while IFS="\r\n" read line

I'm not sure what you're trying to achieve with echo "cat //title" |going into xmllint, so I'm leaving it as is. Is that an instruction to xmllint? or is it passed thru to create a header to the document? (Don't have xmllint to expermient with right now).

我不确定你想通过echo "cat //title" |进入 xmllint来实现什么,所以我保持原样。这是 xmllint 的指令吗?还是通过它来创建文档的标题?(现在没有 xmllint 可以进行实验)。

Also, you might want to look at reading rss feeds with awk, but it is rather low level.

此外,您可能想查看使用 awk 阅读 rss 提要,但它的级别相当低。

I hope this helps.

我希望这有帮助。