bash 使用 curl、grep 和 sed 从 HTML 中提取数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23982321/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 10:35:24  来源:igfitidea点击:

Using curl, grep, and sed to extract data from HTML

bashcurlsedgrep

提问by user2397282

I am trying to learn some terminal commands, and saw this one that grabs the links of the latest Google doodle and copies it to your clipboard:

我正在尝试学习一些终端命令,并看到这个抓取最新 Google 涂鸦的链接并将其复制到剪贴板的命令:

$ curl http://www.google.com/doodles#oodles/archive |
grep -A5 'latest-doodle on' | grep 'img src' |
sed s/.*'<img src="\/\/'/''/ | sed s/'" alt=".*'/''/ | pbcopy

I tried to do something similar - this command should copy the word of the day to your clipboard:

我尝试做类似的事情 - 此命令应该将当天的单词复制到剪贴板:

curl "http://www.merriam-webster.com/word-of-the-day/" |
grep -A5 'main_entry_word' | sed s/.*'<strong class="main_entry_word">'/''/ |
sed s/'</\strong>.*'/''/ | pbcopy

I got an error that said:

我收到一条错误消息:

sed: 1: "s/</\strong>.*//": bad flag in substitute command: '/'

I'm not really sure what I'm doing and I've tried some tutorials on other websites but I can't figure it out. I think the main problem is that I don't understand what most of the 'sed' command does.

我不太确定我在做什么,我在其他网站上尝试了一些教程,但我无法弄清楚。我认为主要问题是我不明白大多数“sed”命令的作用。

Can someone help me please?

有人能帮助我吗?

采纳答案by Bruce K

sed s/'<\/strong>.*'/''/

or

或者

sed s@'</strong>.*'@''@

回答by Kent

If I understand your requirement right, you want to extract the text between <strong...class="...">and </strong>, I would use single grep to save your grep|grep|sed|sed...:

如果我理解你的要求吧,要提取的文本<strong...class="..."></strong>,我会用单grep来保存您grep|grep|sed|sed...

also use -soption of curl:

还可以使用-scurl 选项:

kent$  curl -s "link"|grep -Po '<strong\s+class="main_entry_word">\K.*?(?=</strong>)'

output:

输出:

palmy