bash bash命令将html页面转换为文本文件

Question

提问by The Coder

I am a beginner to linux. Would you please help me how to convert an html page to a text file. the text file will remove any images and links from the webpage. I want to use only bash commands and not html to text converting tools. As an example, i want to convert the first page google search results for "computers".

我是 linux 的初学者。请您帮我如何将 html 页面转换为文本文件。文本文件将从网页中删除任何图像和链接。我只想使用 bash 命令而不是 html 到文本转换工具。例如，我想将第一页谷歌搜索结果转换为“计算机”。

Thank you

谢谢

Answer 1

采纳答案by Clayton Stanley

I used python-boilerpipeand it works very well, so far...

我使用了python-boilerpipe并且效果很好，到目前为止......

Answer 2

回答by V H

Easiest way is to use something like this which the dump (in short is the text version of viewable html)

最简单的方法是使用这样的转储（简而言之是可查看 html 的文本版本）

remote file

远程文件

lynx --dump www.google.com > file.txt
links -dump www.google.com

local file

本地文件

lynx --dump ./1.html > file.txt
links -dump ./1.htm

Answer 3

回答by Farid

You have html2text.pyon command line.

您在命令行上有html2text.py。

Usage: html2text.py [(filename|url) [encoding]]

用法： html2text.py [(filename|url) [encoding]]

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  --ignore-links        don't include any formatting for links
  --ignore-images       don't include any formatting for images
  -g, --google-doc      convert an html-exported Google Document
  -d, --dash-unordered-list
                        use a dash rather than a star for unordered list items
  -b BODY_WIDTH, --body-width=BODY_WIDTH
                        number of characters per output line, 0 for no wrap
  -i LIST_INDENT, --google-list-indent=LIST_INDENT
                        number of pixels Google indents nested lists
  -s, --hide-strikethrough
                        hide strike-through text. only relevent when -g is
                        specified as well

Answer 4

回答by Michael Gregtheitroade

On OSX you can use the command line tool called textutil to batch convert html files to txt format:

在 OSX 上，您可以使用名为 textutil 的命令行工具将 html 文件批量转换为 txt 格式：

textutil -convert txt *.html

Answer 5

回答by diachedelic

You could get nodejsand globally install the module html-to-text:

您可以获取nodejs并全局安装模块html-to-text：

npm install -g html-to-text

Then use it like this:

然后像这样使用它：

html-to-text < stuff.html > stuff.txt

Answer 6

回答by Ascatgz

in ubuntu/debian html2textis a good select. http://linux.die.net/man/1/html2text

在 ubuntu/debian 中html2text是一个不错的选择。http://linux.die.net/man/1/html2text

Answer 7

回答by Fredrik Pihl

Using sed

使用 sed

sed -e 's/<[^>]*>//g' foo.html

Answer 8

回答by sapht

I think links is the most common tool to do this. Check man links and search for plain text or similar. -dump is my guess, search for that too. The software comes with most distributions.

我认为链接是最常用的工具。检查 man 链接并搜索纯文本或类似内容。-dump 是我的猜测，也搜索一下。该软件随大多数发行版一起提供。

Answer 9

回答by ewwink

batch mode for local htm & html file, lynxrequired

本地 htm 和 html 文件的批处理模式，lynx必需

#!/bin/sh
# h2t, convert all htm and html files of a directory to text 

for file in `ls *.htm`
do
new=`basename $file htm`
lynx -dump $file > ${new}txt 
done
#####
for file in `ls *.html`
do
new=`basename $file html`
lynx -dump $file > ${new}txt 
done

Answer 10

回答by Vincent

Bash script to recursively convert html page to text file. Applied to httpd-manual. Makes grep -Rhi 'LoadModule ssl' /usr/share/httpd/manual_dump -A 10 work convenient.

Bash 脚本以递归方式将 html 页面转换为文本文件。应用于httpd-manual。使 grep -Rhi 'LoadModule ssl' /usr/share/httpd/manual_dump -A 10 工作方便。

#!/bin/sh
# Adapted from ewwink, recursive html to txt dump
# Made to kind of recursively (4 levels) dump the /usr/share/httpd manual to a dump httpd manual directory into a txt dump including dir
# put this script in /usr/share/httpd for it to work (after installing httpd-manual rpm)

for file in ./manual/*{,/*,/*/*,/*/*/*}.html
do
new=`basename $file .html`
mkdir -p ./manual_dump/${new}
lynx --dump $file > ./manual_dump/${new}.txt
done

bash bash命令将html页面转换为文本文件

提问by The Coder

采纳答案by Clayton Stanley

回答by V H

回答by Farid

回答by Michael Gregtheitroade

回答by diachedelic

回答by Ascatgz

回答by Fredrik Pihl

回答by sapht

回答by ewwink

回答by Vincent

相关推荐

最近更新

标签

bash bash命令将html页面转换为文本文件

提问by The Coder

采纳答案by Clayton Stanley

回答by V H

回答by Farid

回答by Michael Gregtheitroade

回答by diachedelic

回答by Ascatgz

回答by Fredrik Pihl

回答by sapht

回答by ewwink

回答by Vincent

相关推荐

bash 使用定义的 Content-Type 从 .sh 脚本运行 curl

bash 如何检查给定路径中是​​否存在目录

bash 使用通配符的“git add”没有像我希望的那样运行 - 我必须进入特定目录吗？

bash “+”（出现一次或多次）不适用于“sed”命令

相关推荐

最近更新

标签

bash 如何检查给定路径中是否存在目录