bash 使用grep提取html文件的标题
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3833088/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Extract Title of a html file using grep
提问by Vamsi Krishna B
cat 1.html | grep "<title>" > title.txt
This grep statement is not working.
此 grep 语句不起作用。
Please tell the best way to grab the title of a page using grep or sed.
请说出使用 grep 或 sed 获取页面标题的最佳方法。
Thanks.
谢谢。
回答by Dave
sed -n 's/<title>\(.*\)<\/title>//Ip' 1.html
uses the combination of -n and p to only print matches
使用 -n 和 p 的组合仅打印匹配项
回答by ghostdog74
you can use awk. This works even for multiline
你可以使用awk。这甚至适用于多行
$ cat file
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>Extract Title of a html file
using grep - Stack Overflow</title>
<link rel="stylesheet" type="text/css" href="http://sstatic.net/stackoverflow/all.css?v=9ea1a272f146">
$ awk -vRS="</title>" '/<title>/{gsub(/.*<title>|\n+/,"");print;exit}' file
Extract Title of a html file using grep - Stack Overflow
回答by Alex Howansky
回答by Olivier Lasne
cat 1.html | grep -oE "<title>.*</title>" | sed 's/<title>//' | sed 's/<\/title>//'
Grep with -oE extract only the title tag, then sed remove the html tags
Grep 与 -oE 仅提取标题标签,然后 sed 删除 html 标签
回答by chigley
grep "<title>" /path/to/html.html
Works fine for me. Are you sure 1.html is in your current working directory? pwdto check.
对我来说很好用。您确定 1.html 在您当前的工作目录中吗?pwd去检查。
回答by Martin Schnurer
Alex Hovansky's answer is good enough, although there is a chance that html is not well formed and your xml_grep would crash
Alex Hovansky 的回答已经足够好了,尽管 html 格式不正确并且您的 xml_grep 可能会崩溃
I recommend use tidy to convert html to xml, then use xml_grep
我建议使用 tidy 将 html 转换为 xml,然后使用 xml_grep
tidy -asxml -utf8 html_file.html > out.xml
xml_grep 'xpath_expression' out.xml

