bash 使用grep提取html文件的标题

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3833088/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-17 22:43:30  来源:igfitidea点击:

Extract Title of a html file using grep

bashshellgrep

提问by Vamsi Krishna B

cat 1.html | grep "<title>" > title.txt  

This grep statement is not working.

此 grep 语句不起作用。

Please tell the best way to grab the title of a page using grep or sed.

请说出使用 grep 或 sed 获取页面标题的最佳方法。

Thanks.

谢谢。

回答by Dave

sed -n 's/<title>\(.*\)<\/title>//Ip' 1.html

uses the combination of -n and p to only print matches

使用 -n 和 p 的组合仅打印匹配项

回答by ghostdog74

you can use awk. This works even for multiline

你可以使用awk。这甚至适用于多行

$ cat file

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>

    <title>Extract Title of a html file

using grep - Stack Overflow</title>
    <link rel="stylesheet" type="text/css" href="http://sstatic.net/stackoverflow/all.css?v=9ea1a272f146">

$ awk -vRS="</title>" '/<title>/{gsub(/.*<title>|\n+/,"");print;exit}' file
Extract Title of a html file using grep - Stack Overflow

回答by Alex Howansky

You can use xml_grep from the XML::TwigPerl package:

您可以使用XML::TwigPerl 包中的xml_grep :

xml_grep --text_only title 1.html

回答by Olivier Lasne

cat 1.html | grep -oE "<title>.*</title>" | sed 's/<title>//' | sed 's/<\/title>//'

Grep with -oE extract only the title tag, then sed remove the html tags

Grep 与 -oE 仅提取标题标签,然后 sed 删除 html 标签

回答by chigley

 grep "<title>" /path/to/html.html

Works fine for me. Are you sure 1.html is in your current working directory? pwdto check.

对我来说很好用。您确定 1.html 在您当前的工作目录中吗?pwd去检查。

回答by Martin Schnurer

Alex Hovansky's answer is good enough, although there is a chance that html is not well formed and your xml_grep would crash

Alex Hovansky 的回答已经足够好了,尽管 html 格式不正确并且您的 xml_grep 可能会崩溃

I recommend use tidy to convert html to xml, then use xml_grep

我建议使用 tidy 将 html 转换为 xml,然后使用 xml_grep

tidy -asxml -utf8 html_file.html > out.xml
xml_grep 'xpath_expression' out.xml