bash 使用 wget 和 regex 抓取数据

Question

提问by Aadi Droid

i'm just learning bash scripting, i was trying to scrape some data out of a site, mostly wikitionary. This is what I'm trying on the command line right now but it is not returning any result

我只是在学习 bash 脚本，我试图从网站中抓取一些数据，主要是维基词典。这就是我现在在命令行上尝试的，但它没有返回任何结果

wget -qO- http://en.wiktionary.org/wiki/robust | egrep '<ol>{[a-zA-Z]*[0-9]*}*</ol>'

What i'm trying is to get the data between the tags, just want them to be displayed. Can you please help me find out what I'm doing wrong ?

我想要的是获取标签之间的数据，只是想让它们显示出来。你能帮我找出我做错了什么吗？

Thanks

谢谢

Answer 1

采纳答案by Micha? ?rajer

you need to send output to stdout:

您需要将输出发送到标准输出：

wget -q http://en.wiktionary.org/wiki/robust -q -O - | ...

to get all <ol>tags with grep you can do:

要<ol>使用 grep获取所有标签，您可以执行以下操作：

wget -q http://en.wiktionary.org/wiki/robust -O - | tr '\n' ' ' | grep -o '<ol>.*</ol>'

Answer 2

回答by aioobe

At least you need to

至少你需要

activate regular expressions by adding the -eswitch.
send output from wget to stdout instead of to disk by adding the -O -option

通过添加-e开关激活正则表达式。
通过添加-O -选项将输出从 wget 发送到 stdout 而不是磁盘

Honestly, I'd say grep is the wrong tool for this task, since grep works on a per-line basis, and your expression stretches over several lines.

老实说，我会说 grep 是执行此任务的错误工具，因为 grep 以每行为基础工作，并且您的表达式跨越多行。

I think sedor awkwould be a better fit for this task.

我认为sed或者awk更适合这项任务。

With sedit would look like

有了sed它看起来像

wget -O - -q http://en.wiktionary.org/wiki/robust | sed -n "/<ol>/,/<\/ol>/p"

If you want to get rid of the extra <ol>and </ol>you could do append

如果你想摆脱多余的<ol>，</ol>你可以做 append

... | grep -v -E "</?ol>"

Related links

相关链接

Answer 3

回答by Raffael

If I understand the question correctly then the goal is to extract the visible text content from within ol-sections. I would do it this way:

如果我正确理解了这个问题，那么目标是从 ol-sections 中提取可见的文本内容。我会这样做：

wget -qO- http://en.wiktionary.org/wiki/robust | 
  hxnormalize -x | 
  hxselect "ol" | 
  lynx -stdin -dump -nolist

[source: "Using the Linux Shell for Web Scraping"]

[来源：“使用 Linux Shell 进行网页抓取”]

hxnormalize preprocesses the HTML code for hxselect which applies the CSS selector "ol". Lynx will render the code and reduce it to what is visible in a browser.

hxnormalize 预处理 hxselect 的 HTML 代码，它应用 CSS 选择器“ol”。Lynx 将呈现代码并将其缩减为在浏览器中可见的内容。

bash 使用 wget 和 regex 抓取数据

提问by Aadi Droid

采纳答案by Micha? ?rajer

回答by aioobe

回答by Raffael

相关推荐

最近更新

标签

bash 使用 wget 和 regex 抓取数据

提问by Aadi Droid

采纳答案by Micha? ?rajer

回答by aioobe

回答by Raffael

相关推荐

bash 用原始符号替换所有符号链接

bash 如何按字母顺序合并bash中的文件

从 tcl 脚本调用 bash 脚本并返回和退出状态

bash 'grep -q' 不以 'tail -f' 退出

相关推荐

最近更新

标签