bash 使用 curl、grep 和 sed 从 HTML 中提取数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23982321/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using curl, grep, and sed to extract data from HTML
提问by user2397282
I am trying to learn some terminal commands, and saw this one that grabs the links of the latest Google doodle and copies it to your clipboard:
我正在尝试学习一些终端命令,并看到这个抓取最新 Google 涂鸦的链接并将其复制到剪贴板的命令:
$ curl http://www.google.com/doodles#oodles/archive |
grep -A5 'latest-doodle on' | grep 'img src' |
sed s/.*'<img src="\/\/'/''/ | sed s/'" alt=".*'/''/ | pbcopy
I tried to do something similar - this command should copy the word of the day to your clipboard:
我尝试做类似的事情 - 此命令应该将当天的单词复制到剪贴板:
curl "http://www.merriam-webster.com/word-of-the-day/" |
grep -A5 'main_entry_word' | sed s/.*'<strong class="main_entry_word">'/''/ |
sed s/'</\strong>.*'/''/ | pbcopy
I got an error that said:
我收到一条错误消息:
sed: 1: "s/</\strong>.*//": bad flag in substitute command: '/'
I'm not really sure what I'm doing and I've tried some tutorials on other websites but I can't figure it out. I think the main problem is that I don't understand what most of the 'sed' command does.
我不太确定我在做什么,我在其他网站上尝试了一些教程,但我无法弄清楚。我认为主要问题是我不明白大多数“sed”命令的作用。
Can someone help me please?
有人能帮助我吗?
采纳答案by Bruce K
sed s/'<\/strong>.*'/''/
or
或者
sed s@'</strong>.*'@''@
回答by Kent
If I understand your requirement right, you want to extract the text between <strong...class="...">
and </strong>
, I would use single grep to save your grep|grep|sed|sed...
:
如果我理解你的要求吧,要提取的文本<strong...class="...">
和</strong>
,我会用单grep来保存您grep|grep|sed|sed...
:
also use -s
option of curl:
还可以使用-s
curl 选项:
kent$ curl -s "link"|grep -Po '<strong\s+class="main_entry_word">\K.*?(?=</strong>)'
output:
输出:
palmy