bash 使用bash提取没有标签的网页源

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35777319/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 14:20:14  来源:igfitidea点击:

Extract the source of the webpage without tags using bash

bashcurltagsextractwget

提问by ?? ?ck ??wk

We can download the source of the page using wgetor curlBut i want to extract the source of the page without tags , I means extract it like text

我们可以使用wgetor下载页面的源代码curl但是我想提取没有标签的页面的源代码,我的意思是像文本一样提取它

回答by SLePort

You can pipe to a simple sed command :

您可以通过管道传输到一个简单的 sed 命令:

curl www.gnu.org | sed 's/<\/*[^>]*>//g'

回答by Pablo Prieto

Using Curl, Wget and Apache Tika Server (locally) you can parse HTML into simple text directly from the command line.

使用 Curl、Wget 和 Apache Tika Server(本地),您可以直接从命令行将 HTML 解析为简单文本。

First, you have to download the tika-server jar from the Apache site: https://tika.apache.org/download.html

首先,您必须从 Apache 站点下载 tika-server jar:https: //tika.apache.org/download.html

Then, run it as a local server:

然后,将其作为本地服务器运行:

$ java -jar tika-server-1.12.jar

After that, you can start parsing text using the following url:

之后,您可以使用以下网址开始解析文本:

http://localhost:9998/tika

http://localhost:9998/tika

Now, to parse the HTML of webpage into simple text:

现在,将网页的 HTML 解析为简单的文本:

 $ wget -O test.html YOUR-HTML-URL && curl -H "Accept: text/plain" -T test.html http://localhost:9998/tika

That should return the webpage text without tags.

那应该返回没有标签的网页文本。

This way you're using wget to download and save your desired webpage to "test.html" and then you use curl to send a request to the tika server in order to extract the text. Notice that it's necessary to send the header "Accept: text/plain" because tika can return several formats, not just plain text.

这样您就可以使用 wget 下载所需的网页并将其保存到“test.html”,然后使用 curl 向 tika 服务器发送请求以提取文本。请注意,有必要发送标题“Accept: text/plain”,因为 tika 可以返回多种格式,而不仅仅是纯文本。

回答by Leventix

Create a Ruby script that uses Nokogiri to parse the HTML:

创建一个使用 Nokogiri 解析 HTML 的 Ruby 脚本:

require 'nokogiri'
require 'open-uri'
html = Nokogiri::HTML(open 'https://stackoverflow.com/questions/6129357')

text  = html.at('body').inner_text
puts text

Source

来源

It would probably be simple to do with Javascript or Python if you're more comfortable with that, or search for a html-to-text utility. I imagine it would be very difficult to do this purely in bash.

如果您更喜欢 Javascript 或 Python,或者搜索 html-to-text 实用程序,那么使用 Javascript 或 Python 可能会很简单。我想纯粹在 bash 中做到这一点会非常困难。

See also: bash command to covert html page to a text file

另请参阅:将 html 页面转换为文本文件的 bash 命令