bash 使用bash提取没有标签的网页源
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35777319/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Extract the source of the webpage without tags using bash
提问by ?? ?ck ??wk
We can download the source of the page using wget
or curl
But i want to extract the source of the page without tags , I means extract it like text
我们可以使用wget
or下载页面的源代码curl
但是我想提取没有标签的页面的源代码,我的意思是像文本一样提取它
回答by SLePort
You can pipe to a simple sed command :
您可以通过管道传输到一个简单的 sed 命令:
curl www.gnu.org | sed 's/<\/*[^>]*>//g'
回答by Pablo Prieto
Using Curl, Wget and Apache Tika Server (locally) you can parse HTML into simple text directly from the command line.
使用 Curl、Wget 和 Apache Tika Server(本地),您可以直接从命令行将 HTML 解析为简单文本。
First, you have to download the tika-server jar from the Apache site: https://tika.apache.org/download.html
首先,您必须从 Apache 站点下载 tika-server jar:https: //tika.apache.org/download.html
Then, run it as a local server:
然后,将其作为本地服务器运行:
$ java -jar tika-server-1.12.jar
After that, you can start parsing text using the following url:
之后,您可以使用以下网址开始解析文本:
Now, to parse the HTML of webpage into simple text:
现在,将网页的 HTML 解析为简单的文本:
$ wget -O test.html YOUR-HTML-URL && curl -H "Accept: text/plain" -T test.html http://localhost:9998/tika
That should return the webpage text without tags.
那应该返回没有标签的网页文本。
This way you're using wget to download and save your desired webpage to "test.html" and then you use curl to send a request to the tika server in order to extract the text. Notice that it's necessary to send the header "Accept: text/plain" because tika can return several formats, not just plain text.
这样您就可以使用 wget 下载所需的网页并将其保存到“test.html”,然后使用 curl 向 tika 服务器发送请求以提取文本。请注意,有必要发送标题“Accept: text/plain”,因为 tika 可以返回多种格式,而不仅仅是纯文本。
回答by Leventix
Create a Ruby script that uses Nokogiri to parse the HTML:
创建一个使用 Nokogiri 解析 HTML 的 Ruby 脚本:
require 'nokogiri'
require 'open-uri'
html = Nokogiri::HTML(open 'https://stackoverflow.com/questions/6129357')
text = html.at('body').inner_text
puts text
It would probably be simple to do with Javascript or Python if you're more comfortable with that, or search for a html-to-text utility. I imagine it would be very difficult to do this purely in bash.
如果您更喜欢 Javascript 或 Python,或者搜索 html-to-text 实用程序,那么使用 Javascript 或 Python 可能会很简单。我想纯粹在 bash 中做到这一点会非常困难。