bash 使用bash提取没有标签的网页源

Question

提问by ?? ?ck ??wk

We can download the source of the page using wgetor curlBut i want to extract the source of the page without tags , I means extract it like text

我们可以使用wgetor下载页面的源代码curl但是我想提取没有标签的页面的源代码，我的意思是像文本一样提取它

Answer 1

回答by SLePort

You can pipe to a simple sed command :

您可以通过管道传输到一个简单的 sed 命令：

curl www.gnu.org | sed 's/<\/*[^>]*>//g'

Answer 2

回答by Pablo Prieto

Using Curl, Wget and Apache Tika Server (locally) you can parse HTML into simple text directly from the command line.

使用 Curl、Wget 和 Apache Tika Server（本地），您可以直接从命令行将 HTML 解析为简单文本。

First, you have to download the tika-server jar from the Apache site: https://tika.apache.org/download.html

首先，您必须从 Apache 站点下载 tika-server jar：https: //tika.apache.org/download.html

Then, run it as a local server:

然后，将其作为本地服务器运行：

$ java -jar tika-server-1.12.jar

After that, you can start parsing text using the following url:

之后，您可以使用以下网址开始解析文本：

http://localhost:9998/tika

Now, to parse the HTML of webpage into simple text:

现在，将网页的 HTML 解析为简单的文本：

 $ wget -O test.html YOUR-HTML-URL && curl -H "Accept: text/plain" -T test.html http://localhost:9998/tika

That should return the webpage text without tags.

那应该返回没有标签的网页文本。

This way you're using wget to download and save your desired webpage to "test.html" and then you use curl to send a request to the tika server in order to extract the text. Notice that it's necessary to send the header "Accept: text/plain" because tika can return several formats, not just plain text.

这样您就可以使用 wget 下载所需的网页并将其保存到“test.html”，然后使用 curl 向 tika 服务器发送请求以提取文本。请注意，有必要发送标题“Accept: text/plain”，因为 tika 可以返回多种格式，而不仅仅是纯文本。

Answer 3

回答by Leventix

Create a Ruby script that uses Nokogiri to parse the HTML:

创建一个使用 Nokogiri 解析 HTML 的 Ruby 脚本：

require 'nokogiri'
require 'open-uri'
html = Nokogiri::HTML(open 'https://stackoverflow.com/questions/6129357')

text  = html.at('body').inner_text
puts text

Source

来源

It would probably be simple to do with Javascript or Python if you're more comfortable with that, or search for a html-to-text utility. I imagine it would be very difficult to do this purely in bash.

如果您更喜欢 Javascript 或 Python，或者搜索 html-to-text 实用程序，那么使用 Javascript 或 Python 可能会很简单。我想纯粹在 bash 中做到这一点会非常困难。

See also: bash command to covert html page to a text file

另请参阅：将 html 页面转换为文本文件的 bash 命令

bash 使用bash提取没有标签的网页源

提问by ?? ?ck ??wk

回答by SLePort

回答by Pablo Prieto

回答by Leventix

相关推荐

最近更新

标签

bash 使用bash提取没有标签的网页源

提问by ?? ?ck ??wk

回答by SLePort

回答by Pablo Prieto

回答by Leventix

相关推荐

Git Bash 在 Windows 7 上显示奇怪的字符

bash 在控制台上显示 init.d 脚本回显

bash Python：ImportError：没有名为numpy的模块

bash 检查在 while 循环中调用的程序的退出代码

相关推荐

最近更新

标签