bash 使用 wget 和 regex 抓取数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/7361229/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Data scraping with wget and regex
提问by Aadi Droid
i'm just learning bash scripting, i was trying to scrape some data out of a site, mostly wikitionary. This is what I'm trying on the command line right now but it is not returning any result
我只是在学习 bash 脚本,我试图从网站中抓取一些数据,主要是维基词典。这就是我现在在命令行上尝试的,但它没有返回任何结果
wget -qO- http://en.wiktionary.org/wiki/robust | egrep '<ol>{[a-zA-Z]*[0-9]*}*</ol>'
What i'm trying is to get the data between the tags, just want them to be displayed. Can you please help me find out what I'm doing wrong ?
我想要的是获取标签之间的数据,只是想让它们显示出来。你能帮我找出我做错了什么吗?
Thanks
谢谢
采纳答案by Micha? ?rajer
you need to send output to stdout:
您需要将输出发送到标准输出:
wget -q http://en.wiktionary.org/wiki/robust -q -O - | ...
to get all <ol>tags with grep you can do:
要<ol>使用 grep获取所有标签,您可以执行以下操作:
wget -q http://en.wiktionary.org/wiki/robust -O - | tr '\n' ' ' | grep -o '<ol>.*</ol>'
回答by aioobe
At least you need to
至少你需要
- activate regular expressions by adding the
-eswitch. - send output from wget to stdout instead of to disk by adding the
-O -option
- 通过添加
-e开关激活正则表达式。 - 通过添加
-O -选项将输出从 wget 发送到 stdout 而不是磁盘
Honestly, I'd say grep is the wrong tool for this task, since grep works on a per-line basis, and your expression stretches over several lines.
老实说,我会说 grep 是执行此任务的错误工具,因为 grep 以每行为基础工作,并且您的表达式跨越多行。
I think sedor awkwould be a better fit for this task.
我认为sed或者awk更适合这项任务。
With sedit would look like
有了sed它看起来像
wget -O - -q http://en.wiktionary.org/wiki/robust | sed -n "/<ol>/,/<\/ol>/p"
If you want to get rid of the extra <ol>and </ol>you could do append
如果你想摆脱多余的<ol>,</ol>你可以做 append
... | grep -v -E "</?ol>"
Related links
相关链接
回答by Raffael
If I understand the question correctly then the goal is to extract the visible text content from within ol-sections. I would do it this way:
如果我正确理解了这个问题,那么目标是从 ol-sections 中提取可见的文本内容。我会这样做:
wget -qO- http://en.wiktionary.org/wiki/robust |
hxnormalize -x |
hxselect "ol" |
lynx -stdin -dump -nolist
[source: "Using the Linux Shell for Web Scraping"]
hxnormalize preprocesses the HTML code for hxselect which applies the CSS selector "ol". Lynx will render the code and reduce it to what is visible in a browser.
hxnormalize 预处理 hxselect 的 HTML 代码,它应用 CSS 选择器“ol”。Lynx 将呈现代码并将其缩减为在浏览器中可见的内容。

