Bash 脚本和 xml/rss 解析
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/10551917/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Bash script and xml/rss parsing
提问by Ivan
i'm writing a small script that parse an rss using xmllint.
我正在编写一个使用 xmllint 解析 rss 的小脚本。
Now i fetch the titles list with the following command:
现在我使用以下命令获取标题列表:
ITEMS=`echo "cat //title" | xmllint --shell rss.xml `
echo $ITEMS > tmpfile
But it returns:
但它返回:
<title>xxx</title> ------- <title>yyy :)</title> ------- <title>zzzzzz</title>
without newlines, or space. Now i'm interested only in the text content of title tags, and if possible i want to navigate through the titles using a for/while loop, something like:
没有换行符或空格。现在我只对标题标签的文本内容感兴趣,如果可能的话,我想使用 for/while 循环浏览标题,例如:
for val in $ITEMS
do
echo $val
done
How it can be done? Thanks in advance
怎么做?提前致谢
回答by Philippe
I had the same type of requirement at some point to parse xml in bash. I ended up using xmlstarlet http://xmlstar.sourceforge.net/which you might be able to install.
在某些时候,我对在 bash 中解析 xml 有相同类型的要求。我最终使用了 xmlstarlet http://xmlstar.sourceforge.net/,您可以安装它。
If not, something like that will remove the surounding tags:
如果没有,类似的东西将删除周围的标签:
echo "cat //title/text()" | xmllint --shell rss.xml
Then you will need to cleanup the output after piping it, a basic solution would be:
然后你需要在管道输出后清理输出,一个基本的解决方案是:
echo "cat //title/text()" | xmllint --shell rss.xml | egrep '^\w'
Hope this helps
希望这可以帮助
回答by shellter
To answer your first question, The unquoted use of $ITEMSwith echois eliminating your new-line chars. Try
要回答您的第一个问题,$ITEMSwith的未引用用法echo是消除您的换行符。尝试
ITEMS=`echo "cat //title" | xmllint --shell rss.xml `
echo "$ITEMS" > tmpfile
#----^------^--- dbl-quotes only
In general, using forloops is best left to items that won't generate unexpected spaces or other non-printable characters. (non-alphanumerics), like for i in {1..10} ; do echo $i; done
通常,for最好将循环用于不会产生意外空格或其他不可打印字符的项目。(非字母数字),如for i in {1..10} ; do echo $i; done
AND you don't really need the variables, or the tempfile, try
并且您实际上并不需要变量或临时文件,请尝试
echo "cat //title" | xmllint --shell rss.xml |
while read line ; do
echo "$line"
done
Depending on what is in your rrs feed, you may also benefit from changing the default IFS (Internal Field Separator) that is used by the read cmd, try
根据您的 rrs 提要中的内容,您还可以从更改读取 cmd 使用的默认 IFS(内部字段分隔符)中受益,请尝试
while IFS= read line ....
# or
while IFS="\n" read line
# or
while IFS="\r\n" read line
I'm not sure what you're trying to achieve with echo "cat //title" |going into xmllint, so I'm leaving it as is. Is that an instruction to xmllint? or is it passed thru to create a header to the document? (Don't have xmllint to expermient with right now).
我不确定你想通过echo "cat //title" |进入 xmllint来实现什么,所以我保持原样。这是 xmllint 的指令吗?还是通过它来创建文档的标题?(现在没有 xmllint 可以进行实验)。
Also, you might want to look at reading rss feeds with awk, but it is rather low level.
此外,您可能想查看使用 awk 阅读 rss 提要,但它的级别相当低。
I hope this helps.
我希望这有帮助。

