bash 在文件的每一行上运行 curl 命令并从结果中获取数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22537163/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Run curl command on each line of a file and fetch data from result
提问by aelor
Suppose I have a file containing a list of links of webpages.
假设我有一个包含网页链接列表的文件。
www.xyz.com/asdd
www.wer.com/asdas
www.asdas.com/asd
www.asd.com/asdas
I know that doing curl www.xyz.com/asdd
will fetch me the html of that webpage. I want to fetch some data from that webpage.
我知道这样做curl www.xyz.com/asdd
会为我获取该网页的 html。我想从那个网页中获取一些数据。
So the scenario is use curl to hit all the links in the file one by one and extract some data from the webpage and store somewhere else. Any ideas or suggestions.
所以场景是使用 curl 一个一个地点击文件中的所有链接,并从网页中提取一些数据并存储在其他地方。任何想法或建议。
采纳答案by fedorqui 'SO stop harming'
As indicated in the comments, this will loop through your_file
and curl
each line:
正如评论,通过这种将循环指示your_file
和curl
各行:
while IFS= read -r line
do
curl "$line"
done < your_file
To get the <title>
of a page, you can grep
something like this:
要获取<title>
页面的 ,您可以grep
这样做:
grep -iPo '(?<=<title>).*(?=</title>)' file
So all together you could do
所以你可以一起做
while IFS= read -r line
do
curl -s "$line" | grep -Po '(?<=<title>).*(?=</title>)'
done < your_file
Note curl -s
is for silent mode. See an example with google page:
注意curl -s
是静音模式。查看谷歌页面的示例:
$ curl -s http://www.google.com | grep -Po '(?<=<title>).*(?=</title>)'
302 Moved
回答by Orun
You can accomplish this in just one line with xargs
. Let's say you have a file in the working directory with all your URLs (one per line) called sitemap
只需一行即可完成此操作xargs
。假设您在工作目录中有一个文件,其中所有 URL(每行一个)都被称为sitemap
xargs -I{} curl -s {} <sitemap | grep title
xargs -I{} curl -s {} <sitemap | grep title
This would extract any lines with the word "title" in it. To extract the title tags you'll want to change the grep
a little. The -o
flag ensures that only the grepped result is printed:
这将提取任何包含“标题”一词的行。要提取标题标签,您需要grep
稍微更改一下。该-o
标志确保仅打印 grepped 结果:
xargs -I{} curl -s {} <sitemap | grep -o "<title>.*</title>"
xargs -I{} curl -s {} <sitemap | grep -o "<title>.*</title>"
- If you want to extract certain data, you will need to
\
escape characters.- For HTML attributes for example, you should match single and double quotes, and escape them like
[\"\']
- For HTML attributes for example, you should match single and double quotes, and escape them like
- Sometimes, depending on the character set, you may get some unusual
curl
output with special characters. If you detect this, you'll need to switch the encoding with a utility likeiconv
- 如果要提取某些数据,则需要对
\
字符进行转义。- 例如,对于 HTML 属性,您应该匹配单引号和双引号,并像这样对它们进行转义
[\"\']
- 例如,对于 HTML 属性,您应该匹配单引号和双引号,并像这样对它们进行转义
- 有时,根据字符集,您可能会得到一些
curl
带有特殊字符的异常输出。如果您检测到这一点,则需要使用类似的实用程序切换编码iconv