bash 使用 grep、awk、sed 等从行中提取子字符串

Question

提问by wenzi

I have a files with many lines like:

我有一个包含多行的文件，例如：

<a href="http://www.youtube.com/user/airuike" class="yt-uix-sessionlink yt-user-name " data-sessionlink="ei=CKPW6LXqqbQCFSqVIQod_BwsaQ%3D%3D" dir="ltr">lily weisy</a>

I want to extract www.youtube.com/user/airuike and lily weisy, and then I also want to separate airuike from www.youtube.com/user/

我想提取www.youtube.com/user/airuike和lilyweisy，然后我也想把airuike和www.youtube.com/user/分开

so I want to get 3 strings: www.youtube.com/user/airuike, airuike and lily weisy

所以我想得到 3 个字符串：www.youtube.com/user/airuike、airuike 和 lilyweisy

how to achieve this? thanks

如何实现这一目标？谢谢

Answer 1

回答by kdubs

do this:

做这个：

sed -e 's/.*href="\([^"]*\)".*>\([^<]*\)<.*/link: name:/' < data

will give you the first part. But I'm not sure what you are doing with it after this.

会给你第一部分。但我不确定你在这之后用它做什么。

Answer 2

回答by BeniBela

Since it is html, and html should be parsed with a html parser and not with grep/sed/awk, you could use the pattern matching function of my Xidel.

既然是 html，而且 html 应该用 html 解析器而不是 grep/sed/awk 来解析，你可以使用我的Xidel的模式匹配功能。

 xidel yourfile.html -e '<a class="yt-uix-sessionlink yt-user-name " dir="ltr">{$link := @href, $user := substring-after($link, "www.youtube.com/user/"), $name:=text()}</a>*'

Or if you want a CSV like result:

或者，如果您想要类似 CSV 的结果：

 xidel yourfile.html -e '<a class="yt-uix-sessionlink yt-user-name " dir="ltr">{string-join((@href, substring-after(@href, "www.youtube.com/user/"), text()), ", ")}</a>*' --hide-variable-names

It is kind of sad, that you also want to have the airuike string, otherwise it could be as simple as

有点可惜，你也想要airuike弦，否则就这么简单

xidel /yourfile.html  -e '<a href="{$link}" class="yt-uix-sessionlink yt-user-name ">{$name}</a>*'

(and you were supposed to be able to use xidel '<a href="{$link:=., $user := filter($link, www.youtube.com/user/(.*)\', 1)}" class="yt-uix-sessionlink yt-user-name " dir="ltr">{$name}</a>*', but it seems I haven't thought the syntax through. Just oneerror check and it is breaking everything. )

（并且您应该能够使用xidel '<a href="{$link:=., $user := filter($link, www.youtube.com/user/(.*)\', 1)}" class="yt-uix-sessionlink yt-user-name " dir="ltr">{$name}</a>*'，但似乎我还没有考虑过语法。只是一个错误检查，它就破坏了一切。）

Answer 3

回答by Ed Morton

$ awk '{split(while read line
do
    href=$(echo $line | grep -o 'http[^"]*')
    user=$(echo $href | grep -o '[^/]*$')
    text=$(echo $line | grep -o '[^>]*<\/a>$' | grep -o '^[^<]*')

    echo href: $href
    echo user: $user
    echo text: $text
done < yourfile
,a,/(["<>]|:\/\/)/); u=a[4]; sub(/.*\//,"",a[4]); print u,a[4],a[12]}' file
www.youtube.com/user/airuike airuike lily weisy

Answer 4

回答by Pau Fracés

I think something like this must work

我认为这样的事情必须工作

##代码##

Regular expressions basics: http://en.wikipedia.org/wiki/Regular_expression#POSIX_Basic_Regular_Expressions

正则表达式基础：http: //en.wikipedia.org/wiki/Regular_expression#POSIX_Basic_Regular_Expressions

Upd: checked and fixed

更新：检查并修复

bash 使用 grep、awk、sed 等从行中提取子字符串

提问by wenzi

回答by kdubs

回答by BeniBela

回答by Ed Morton

回答by Pau Fracés

相关推荐

最近更新

标签

bash 使用 grep、awk、sed 等从行中提取子字符串

提问by wenzi

回答by kdubs

回答by BeniBela

回答by Ed Morton

回答by Pau Fracés

相关推荐

如何编写在第一个命令后输入密码的 bash 脚本？

bash 如何使用正则表达式分隔符获取第 n 列

bash 如何开球到标准错误？

bash 导入shell脚本函数

相关推荐

最近更新

标签