bash 在文件的每一行上运行 curl 命令并从结果中获取数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22537163/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 09:57:54  来源:igfitidea点击:

Run curl command on each line of a file and fetch data from result

regexbashcurlawk

提问by aelor

Suppose I have a file containing a list of links of webpages.

假设我有一个包含网页链接列表的文件。

www.xyz.com/asdd
www.wer.com/asdas
www.asdas.com/asd
www.asd.com/asdas

I know that doing curl www.xyz.com/asddwill fetch me the html of that webpage. I want to fetch some data from that webpage.

我知道这样做curl www.xyz.com/asdd会为我获取该网页的 html。我想从那个网页中获取一些数据。

So the scenario is use curl to hit all the links in the file one by one and extract some data from the webpage and store somewhere else. Any ideas or suggestions.

所以场景是使用 curl 一个一个地点击文件中的所有链接,并从网页中提取一些数据并存储在其他地方。任何想法或建议。

采纳答案by fedorqui 'SO stop harming'

As indicated in the comments, this will loop through your_fileand curleach line:

正如评论,通过这种将循环指示your_filecurl各行:

while IFS= read -r line
do
   curl "$line"
done < your_file

To get the <title>of a page, you can grepsomething like this:

要获取<title>页面的 ,您可以grep这样做:

grep -iPo '(?<=<title>).*(?=</title>)' file

So all together you could do

所以你可以一起做

while IFS= read -r line
do
   curl -s "$line" | grep -Po '(?<=<title>).*(?=</title>)'
done < your_file

Note curl -sis for silent mode. See an example with google page:

注意curl -s是静音模式。查看谷歌页面的示例:

$ curl -s http://www.google.com | grep -Po '(?<=<title>).*(?=</title>)'
302 Moved

回答by Orun

You can accomplish this in just one line with xargs. Let's say you have a file in the working directory with all your URLs (one per line) called sitemap

只需一行即可完成此操作xargs。假设您在工作目录中有一个文件,其中所有 URL(每行一个)都被称为sitemap

xargs -I{} curl -s {} <sitemap | grep title

xargs -I{} curl -s {} <sitemap | grep title

This would extract any lines with the word "title" in it. To extract the title tags you'll want to change the grepa little. The -oflag ensures that only the grepped result is printed:

这将提取任何包含“标题”一词的行。要提取标题标签,您需要grep稍微更改一下。该-o标志确保仅打印 grepped 结果:

xargs -I{} curl -s {} <sitemap | grep -o "<title>.*</title>"

xargs -I{} curl -s {} <sitemap | grep -o "<title>.*</title>"

需要注意的几点:
  • If you want to extract certain data, you will need to \escape characters.
    • For HTML attributes for example, you should match single and double quotes, and escape them like [\"\']
  • Sometimes, depending on the character set, you may get some unusual curloutput with special characters. If you detect this, you'll need to switch the encoding with a utility like iconv
  • 如果要提取某些数据,则需要对\字符进行转义。
    • 例如,对于 HTML 属性,您应该匹配单引号和双引号,并像这样对它们进行转义 [\"\']
  • 有时,根据字符集,您可能会得到一些curl带有特殊字符的异常输出。如果您检测到这一点,则需要使用类似的实用程序切换编码iconv