bash 使用grep提取html文件的标题

Question

提问by Vamsi Krishna B

cat 1.html | grep "<title>" > title.txt

This grep statement is not working.

此 grep 语句不起作用。

Please tell the best way to grab the title of a page using grep or sed.

请说出使用 grep 或 sed 获取页面标题的最佳方法。

Thanks.

谢谢。

Answer 1

回答by Dave

sed -n 's/<title>\(.*\)<\/title>//Ip' 1.html

uses the combination of -n and p to only print matches

使用 -n 和 p 的组合仅打印匹配项

Answer 2

回答by ghostdog74

you can use awk. This works even for multiline

你可以使用awk。这甚至适用于多行

$ cat file

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>

    <title>Extract Title of a html file

using grep - Stack Overflow</title>
    <link rel="stylesheet" type="text/css" href="http://sstatic.net/stackoverflow/all.css?v=9ea1a272f146">

$ awk -vRS="</title>" '/<title>/{gsub(/.*<title>|\n+/,"");print;exit}' file
Extract Title of a html file using grep - Stack Overflow

Answer 3

回答by Alex Howansky

You can use xml_grep from the XML::TwigPerl package:

您可以使用XML::TwigPerl 包中的xml_grep ：

xml_grep --text_only title 1.html

Answer 4

回答by Olivier Lasne

cat 1.html | grep -oE "<title>.*</title>" | sed 's/<title>//' | sed 's/<\/title>//'

Grep with -oE extract only the title tag, then sed remove the html tags

Grep 与 -oE 仅提取标题标签，然后 sed 删除 html 标签

Answer 5

回答by chigley

 grep "<title>" /path/to/html.html

Works fine for me. Are you sure 1.html is in your current working directory? pwdto check.

对我来说很好用。您确定 1.html 在您当前的工作目录中吗？pwd去检查。

Answer 6

回答by Martin Schnurer

Alex Hovansky's answer is good enough, although there is a chance that html is not well formed and your xml_grep would crash

Alex Hovansky 的回答已经足够好了，尽管 html 格式不正确并且您的 xml_grep 可能会崩溃

I recommend use tidy to convert html to xml, then use xml_grep

我建议使用 tidy 将 html 转换为 xml，然后使用 xml_grep

tidy -asxml -utf8 html_file.html > out.xml
xml_grep 'xpath_expression' out.xml

bash 使用grep提取html文件的标题

提问by Vamsi Krishna B

回答by Dave

回答by ghostdog74

回答by Alex Howansky

回答by Olivier Lasne

回答by chigley

回答by Martin Schnurer

相关推荐

最近更新

标签

bash 使用grep提取html文件的标题

提问by Vamsi Krishna B

回答by Dave

回答by ghostdog74

回答by Alex Howansky

回答by Olivier Lasne

回答by chigley

回答by Martin Schnurer

相关推荐

Bash 脚本日志文件轮换

bash 通过 RVM 安装 Ruby 1.9.2 失败

从 bash 脚本获取十六进制时间戳

在 su 命令中运行 bash 函数

相关推荐

最近更新

标签