在 PostgreSQL 中剥离 HTML 标签
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/12049905/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Stripping HTML tags in PostgreSQL
提问by samach
How would you strip HTML tags in PostgreSQL such that the data inside the tags is preserved?
您将如何去除 PostgreSQL 中的 HTML 标签,以便保留标签内的数据?
I found some solutions by googling it but they were striping the text between the tags too!
我通过谷歌搜索找到了一些解决方案,但他们也在标签之间剥离文本!
回答by acohen
select regexp_replace(content, E'<[^>]+>', '', 'gi') from message;
回答by Peter Krauss
Use xpath
使用 xpath
Feed your database with XML datatype, not with "second class" TEXT, because is very simple to convert HTML into XHTML (see HTML-Tidy or standard DOM's loadHTML()
and saveXML()
methods).
使用XML 数据类型而不是“第二类” TEXT 为您的数据库提供数据,因为将 HTML 转换为 XHTML 非常简单(请参阅 HTML-Tidy 或标准 DOMloadHTML()
和saveXML()
方法)。
! IT IS FAST AND IS VERY SAFE !
!它很快而且非常安全!
The commom information retrievalneed, is not a full content, but something into the XHTML, so the power of xpath
is wellcome.
常见的信息检索需要,不是一个完整的内容,而是一些XHTML 的内容,所以xpath
威康的威力很大。
Example: retrive all paragraphs with class="fn"
:
示例:检索所有段落 class="fn"
:
WITH needinfo AS (
SELECT *, xpath('//p[@class="fn"]//text()', xhtml)::text[] as frags
FROM t
) SELECT array_to_string(frags,' ') AS my_p_fn2txt
FROM needinfo
WHERE array_length(frags , 1)>0
-- for full content use xpath('//text()',xhtml)
regex solutions...
正则表达式解决方案...
I not recomend because is not an "information retrieval" solution... and, as @James and others commented here, the regex solution is not so save.
我不推荐,因为它不是“信息检索”解决方案......而且,正如@James 和其他人在这里评论的那样,正则表达式解决方案并不那么保存。
I like "pure SQL", for me is better than use Perl (se @Daniel's solution) or another.
我喜欢“纯 SQL”,对我来说比使用 Perl(se @Daniel 的解决方案)或其他更好。
CREATE OR REPLACE FUNCTION strip_tags(TEXT) RETURNS TEXT AS $$
SELECT regexp_replace(
regexp_replace(, E'(?x)<[^>]*?(\s alt \s* = \s* ([\'"]) ([^>]*?) ) [^>]*? >', E''),
E'(?x)(< [^>]*? >)', '', 'g')
$$ LANGUAGE SQL;
See this and many other variations at siafoo.net, eskpee.wordpress, ... and here at Stackoverflow.
看到这一点,许多其他变化的siafoo.net,eskpee.wordpress,......并在这里#1。
回答by Daniel Vérité
The choice is not limited to doing it server-side with a weak parser based on inadequate regexps or doing it client-side with a robust parser. It could be implemented server-side with a robust parser, too.
选择不仅限于使用基于不适当正则表达式的弱解析器在服务器端进行,或者使用强大的解析器在客户端进行。它也可以使用强大的解析器在服务器端实现。
Here's an example in PL/PerlU that takes advantage of the CPAN'sHTML modules.
下面是利用CPAN 的HTML 模块的PL/PerlU 示例。
CREATE FUNCTION extract_contents_from_html(text) returns text AS $$
use HTML::TreeBuilder;
use HTML::FormatText;
my $tree = HTML::TreeBuilder->new;
$tree->parse_content(shift);
my $formatter = HTML::FormatText->new(leftmargin=>0, rightmargin=>78);
$text = $formatter->format($tree);
$$ LANGUAGE plperlu;
Demo:
演示:
select extract_contents_from_html('<html><body color="white">Hi there!<br>How are you?</body></html>') ;
Output:
输出:
extract_contents_from_html ---------------------------- Hi there! How are you?
One needs to be aware of the caveatsthat come with untrusted languages, though.
人们必须意识到的警告来与不可信的语言,虽然。
回答by Winfield Trail
Any solution performed in the RDBMS is going to involve either string handling or regexes: to my knowledge there is NO way to manipulate HTML in a standards-compliant, safe way in the database. To reiterate, what you are asking for is very, VERY unsafe.
在 RDBMS 中执行的任何解决方案都将涉及字符串处理或正则表达式:据我所知,没有办法在数据库中以符合标准的安全方式操作 HTML。重申一下,您所要求的非常非常不安全。
A much better option is to do this in your application. This is application logic, and NOT the job or concern of your storage layer.
更好的选择是在您的应用程序中执行此操作。这是应用程序逻辑,而不是存储层的工作或关注点。
A great way to do this (in PHP, at least) would be HTML purifier.Don't do this in JavaScript,the user can tamper with it very easily.
一个很好的方法(至少在 PHP 中)是HTML 净化器。不要在 JavaScript 中这样做,用户很容易篡改它。
回答by James Mitch
Don't do it in postgreSQL.
不要在 postgreSQL 中这样做。
It is not designed to do this.
它不是为了做到这一点而设计的。
Use PHP or whatever language you are using to serve webpages.
使用 PHP 或您用来提供网页的任何语言。
Be careful with regular expressions though. HTML is a complex language which cannot be able to be described with regular expressions.
不过要小心正则表达式。HTML 是一种复杂的语言,无法用正则表达式来描述。
Use a DOM parser to strip out tags.
使用 DOM 解析器去除标签。
If you use regular expressions, it can be guaranteed that you leave nothing unsafe, but you can easily strip out more than you want, or it may leave malformed tags.
如果您使用正则表达式,可以保证您不会留下任何不安全的东西,但是您可以轻松地去除多余的东西,否则可能会留下格式错误的标签。