bash 从 curl 的结果中提取特定字符串
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3030908/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Extract a specific string from a curl'd result
提问by user170579
Given this curl command: curl --user-agent "fogent" --silent -o page.html "http://www.google.com/search?q=insansiate"
鉴于此 curl 命令: curl --user-agent "fogent" --silent -o page.html " http://www.google.com/search?q=insansiate"
* Spelling is intentionally incorrect. I want to grab the suggestion as my result.
* 拼写故意不正确。我想把这个建议作为我的结果。
I want to be able to either grep into the page.html file perhaps with grep -oE or pipe it right from curl and never store a file.
我希望能够使用 grep -oE 将其 grep 到 page.html 文件中,或者直接从 curl 管道它并且从不存储文件。
The result should be: 'instantiate'
结果应该是:“实例化”
I need only the word 'instantiate', or the phrase, whatever google is auto correcting, is what I am after.
我只需要“实例化”这个词,或者这个短语,无论谷歌是自动更正的,都是我所追求的。
Here is the basic html that is returned:
这是返回的基本html:
<span class=spell style="color:#cc0000">Did you mean: </span><a href="/search?hl=en&ie=UTF-8&&sa=X&ei=VEMUTMDqGoOINraK3NwL&ved=0CB0QBSgA&q=instantiate&spell=1"class=spell><b><i>instantiate</i></b></a> <span class=std>Top 2 results shown</span>
So perhaps from/to of the string below, which I hope is unique enough to cover all my bases.
所以也许从/到下面的字符串,我希望它足够独特以涵盖我的所有基础。
class=spell><b><i>instantiate</i></b></a>
I keep running into issues with greedy grep; perhaps I should run it though an html prettify tool first to get a line break or 50 in there. I don't know of any simple way to do so in bash, which is what I would ideally like this to be in. I really don't want to deal with firing up perl, and making sure I have the correct module.
我一直遇到贪婪的 grep 问题;也许我应该先通过 html 美化工具运行它以在其中获得换行符或 50。我不知道在 bash 中有什么简单的方法可以做到这一点,这是我理想中想要的。我真的不想处理启动 perl,并确保我有正确的模块。
Any suggestions, thank you?
有什么建议吗,谢谢
回答by Paused until further notice.
As I'm sure you're aware, screen scraping is a delicate business. This command sequence is no exception since it relies on the specific structure of the page which could change at any time without notice.
我相信您知道,屏幕抓取是一项微妙的业务。此命令序列也不例外,因为它依赖于页面的特定结构,该结构可能随时更改,恕不另行通知。
grep -o 'Did you mean:\([^>]*>\)\{5\}' page.html | sed 's/.*<i>\([^<]*\)<.*//' page.html
In a pipe:
在管道中:
curl --user-agent "fogent" --silent "http://www.google.com/search?q=insansiate" | grep -o 'Did you mean:\([^>]*>\)\{5\}' page.html | sed 's/.*<i>\([^<]*\)<.*//'
This relies on finding five ">" characters between "Did you mean:" and the "</i>" after the word you're looking for.
这依赖于在“您的意思是:”和</i>您要查找的单词后的“ ”之间找到五个“>”字符。
Have you considered other methods of getting spelling suggestions or are you specifically interested in what Google provides?
您是否考虑过其他获取拼写建议的方法,或者您对 Google 提供的内容特别感兴趣?
If you have ispell or aspell installed, you can do:
如果您安装了 ispell 或 aspell,您可以执行以下操作:
echo insansiate | ispell -a
and parse the result.
并解析结果。
回答by mklement0
xidelis a great utility for scraping web pages; it supports retrieving pages and extracting information in various query languages (CSS selectors, XPath).
xidel是一个很棒的网页抓取工具;它支持以各种查询语言(CSS 选择器、XPath)检索页面和提取信息。
In the case at hand, the simple CSS selector a.spellwill do the trick.
在手头的情况下,简单的 CSS 选择器a.spell可以解决问题。
xidel --user-agent "fogent" "http://google.com/search?q=insansiate" -e 'a.spell'
Note how xideldoes its own page retrieval, so no need for curlin this case.
注意xidel它自己的页面是如何检索的,所以curl在这种情况下不需要。
If, however, you needed curlfor more exotic retrieval options, here's how you'd combine the two tools (line break for readability):
但是,如果您需要curl更多奇特的检索选项,可以通过以下方式组合这两种工具(换行以提高可读性):
curl --user-agent "fogent" --silent "http://google.com/search?q=insansiate" |
xidel - -e 'a.spell'
回答by Ignacio Vazquez-Abrams
curl--> tidy -asxml--> xmlstarlet sel
curl--> tidy -asxml-->xmlstarlet sel
回答by trapd00r
Edit: Sorry, did not see your Perl notice.
编辑:抱歉,没有看到您的 Perl 通知。
#!/usr/bin/perl use strict; use LWP::UserAgent;
#!/usr/bin/perl 使用严格;使用 LWP::UserAgent;
my $arg = shift // 'insansiate';
my $lwp = LWP::UserAgent->new(agent => 'Mozilla');
my $c = $lwp->get("http://www.google.com/search?q=$arg") or die $!;
my @content = split(/:/, $c->content);
for(@content) {
if(m;<b><i>(.+)</i></b>;) {
print "\n";
exit;
}
}
Running:
跑步:
> perl google.pl
instantiate
> perl google.pl disconect
disconnect

