bash 使用 grep、awk、sed 等从行中提取子字符串
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/13982633/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
extract substring from lines using grep, awk,sed or etc
提问by wenzi
I have a files with many lines like:
我有一个包含多行的文件,例如:
<a href="http://www.youtube.com/user/airuike" class="yt-uix-sessionlink yt-user-name " data-sessionlink="ei=CKPW6LXqqbQCFSqVIQod_BwsaQ%3D%3D" dir="ltr">lily weisy</a>
I want to extract www.youtube.com/user/airuike and lily weisy, and then I also want to separate airuike from www.youtube.com/user/
我想提取www.youtube.com/user/airuike和lilyweisy,然后我也想把airuike和www.youtube.com/user/分开
so I want to get 3 strings: www.youtube.com/user/airuike, airuike and lily weisy
所以我想得到 3 个字符串:www.youtube.com/user/airuike、airuike 和 lilyweisy
how to achieve this? thanks
如何实现这一目标?谢谢
回答by kdubs
do this:
做这个:
sed -e 's/.*href="\([^"]*\)".*>\([^<]*\)<.*/link: name:/' < data
will give you the first part. But I'm not sure what you are doing with it after this.
会给你第一部分。但我不确定你在这之后用它做什么。
回答by BeniBela
Since it is html, and html should be parsed with a html parser and not with grep/sed/awk, you could use the pattern matching function of my Xidel.
既然是 html,而且 html 应该用 html 解析器而不是 grep/sed/awk 来解析,你可以使用我的Xidel的模式匹配功能。
xidel yourfile.html -e '<a class="yt-uix-sessionlink yt-user-name " dir="ltr">{$link := @href, $user := substring-after($link, "www.youtube.com/user/"), $name:=text()}</a>*'
Or if you want a CSV like result:
或者,如果您想要类似 CSV 的结果:
xidel yourfile.html -e '<a class="yt-uix-sessionlink yt-user-name " dir="ltr">{string-join((@href, substring-after(@href, "www.youtube.com/user/"), text()), ", ")}</a>*' --hide-variable-names
It is kind of sad, that you also want to have the airuike string, otherwise it could be as simple as
有点可惜,你也想要airuike弦,否则就这么简单
xidel /yourfile.html -e '<a href="{$link}" class="yt-uix-sessionlink yt-user-name ">{$name}</a>*'
(and you were supposed to be able to use xidel '<a href="{$link:=., $user := filter($link, www.youtube.com/user/(.*)\', 1)}" class="yt-uix-sessionlink yt-user-name " dir="ltr">{$name}</a>*', but it seems I haven't thought the syntax through. Just oneerror check and it is breaking everything. )
(并且您应该能够使用xidel '<a href="{$link:=., $user := filter($link, www.youtube.com/user/(.*)\', 1)}" class="yt-uix-sessionlink yt-user-name " dir="ltr">{$name}</a>*',但似乎我还没有考虑过语法。只是 一个错误检查,它就破坏了一切。)
回答by Ed Morton
$ awk '{split(while read line
do
href=$(echo $line | grep -o 'http[^"]*')
user=$(echo $href | grep -o '[^/]*$')
text=$(echo $line | grep -o '[^>]*<\/a>$' | grep -o '^[^<]*')
echo href: $href
echo user: $user
echo text: $text
done < yourfile
,a,/(["<>]|:\/\/)/); u=a[4]; sub(/.*\//,"",a[4]); print u,a[4],a[12]}' file
www.youtube.com/user/airuike airuike lily weisy
回答by Pau Fracés
I think something like this must work
我认为这样的事情必须工作
##代码##Regular expressions basics: http://en.wikipedia.org/wiki/Regular_expression#POSIX_Basic_Regular_Expressions
正则表达式基础:http: //en.wikipedia.org/wiki/Regular_expression#POSIX_Basic_Regular_Expressions
Upd: checked and fixed
更新:检查并修复

