bash 在文件的每一行上运行 curl 命令并从结果中获取数据

Question

提问by aelor

Suppose I have a file containing a list of links of webpages.

假设我有一个包含网页链接列表的文件。

www.xyz.com/asdd
www.wer.com/asdas
www.asdas.com/asd
www.asd.com/asdas

I know that doing curl www.xyz.com/asddwill fetch me the html of that webpage. I want to fetch some data from that webpage.

我知道这样做curl www.xyz.com/asdd会为我获取该网页的 html。我想从那个网页中获取一些数据。

So the scenario is use curl to hit all the links in the file one by one and extract some data from the webpage and store somewhere else. Any ideas or suggestions.

所以场景是使用 curl 一个一个地点击文件中的所有链接，并从网页中提取一些数据并存储在其他地方。任何想法或建议。

Answer 1

采纳答案by fedorqui 'SO stop harming'

As indicated in the comments, this will loop through your_fileand curleach line:

正如评论，通过这种将循环指示your_file和curl各行：

while IFS= read -r line
do
   curl "$line"
done < your_file

To get the <title>of a page, you can grepsomething like this:

要获取<title>页面的，您可以grep这样做：

grep -iPo '(?<=<title>).*(?=</title>)' file

So all together you could do

所以你可以一起做

while IFS= read -r line
do
   curl -s "$line" | grep -Po '(?<=<title>).*(?=</title>)'
done < your_file

Note curl -sis for silent mode. See an example with google page:

注意curl -s是静音模式。查看谷歌页面的示例：

$ curl -s http://www.google.com | grep -Po '(?<=<title>).*(?=</title>)'
302 Moved

Answer 2

回答by Orun

You can accomplish this in just one line with xargs. Let's say you have a file in the working directory with all your URLs (one per line) called sitemap

只需一行即可完成此操作xargs。假设您在工作目录中有一个文件，其中所有 URL（每行一个）都被称为sitemap

xargs -I{} curl -s {} <sitemap | grep title

This would extract any lines with the word "title" in it. To extract the title tags you'll want to change the grepa little. The -oflag ensures that only the grepped result is printed:

这将提取任何包含“标题”一词的行。要提取标题标签，您需要grep稍微更改一下。该-o标志确保仅打印 grepped 结果：

xargs -I{} curl -s {} <sitemap | grep -o "<title>.*</title>"

需要注意的几点：

If you want to extract certain data, you will need to \escape characters.
- For HTML attributes for example, you should match single and double quotes, and escape them like [\"\']
Sometimes, depending on the character set, you may get some unusual curloutput with special characters. If you detect this, you'll need to switch the encoding with a utility like iconv

如果要提取某些数据，则需要对\字符进行转义。
- 例如，对于 HTML 属性，您应该匹配单引号和双引号，并像这样对它们进行转义 [\"\']
有时，根据字符集，您可能会得到一些curl带有特殊字符的异常输出。如果您检测到这一点，则需要使用类似的实用程序切换编码iconv

bash 在文件的每一行上运行 curl 命令并从结果中获取数据

提问by aelor

采纳答案by fedorqui 'SO stop harming'

回答by Orun

相关推荐

最近更新

标签

bash 在文件的每一行上运行 curl 命令并从结果中获取数据

提问by aelor

采纳答案by fedorqui 'SO stop harming'

回答by Orun

相关推荐

如何在 bash 脚本中读取 csv 文件到数组

bash 需要一个 shell 脚本来将大端转换为小端

bash 在 vmware 上执行的脚本

如何打印当前的 bash 提示？

相关推荐

最近更新

标签