list “坏话”过滤器
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24515/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
"bad words" filter
提问by ila
Not very technical, but... I have to implement a bad words filter in a new site we are developing. So I need a "good" bad words list to feed my db with... any hint / direction? Looking around with google I found thisone, and it's a start, but nothing more.
不是很技术,但是... 我必须在我们正在开发的新站点中实施一个坏词过滤器。所以我需要一个“好”的坏词列表来为我的数据库提供......任何提示/方向?用谷歌环顾四周,我找到了这个,这是一个开始,但仅此而已。
Yes, I know that this kind of filters are easily escaped... but the client will is the client will !!! :-)
是的,我知道这种过滤器很容易逃脱...但客户将是客户将!:-)
The site will have to filter out both english and italian words, but for italian I can ask my colleagues to help me with a community-built list of "parolacce" :-) - an email will do.
该网站必须过滤掉英语和意大利语单词,但对于意大利语,我可以请我的同事帮助我使用社区构建的“parolacce”列表:-) - 一封电子邮件就可以了。
Thanks for any help.
谢谢你的帮助。
采纳答案by UnkwnTech
I didn't see any language specified but you can use this for PHP it will generate a RegEx for each instered work so that even intentional mis-spellings (i.e. @ss, i3itch ) will also be caught.
我没有看到任何指定的语言,但您可以将它用于 PHP,它会为每个插入的工作生成一个 RegEx,这样即使是故意的拼写错误(即 @ss, i3itch )也会被捕获。
<?php
/**
* @author [email protected]
**/
if($_GET['act'] == 'do')
{
$pattern['a'] = '/[a]/'; $replace['a'] = '[a A @]';
$pattern['b'] = '/[b]/'; $replace['b'] = '[b B I3 l3 i3]';
$pattern['c'] = '/[c]/'; $replace['c'] = '(?:[c C (]|[k K])';
$pattern['d'] = '/[d]/'; $replace['d'] = '[d D]';
$pattern['e'] = '/[e]/'; $replace['e'] = '[e E 3]';
$pattern['f'] = '/[f]/'; $replace['f'] = '(?:[f F]|[ph pH Ph PH])';
$pattern['g'] = '/[g]/'; $replace['g'] = '[g G 6]';
$pattern['h'] = '/[h]/'; $replace['h'] = '[h H]';
$pattern['i'] = '/[i]/'; $replace['i'] = '[i I l ! 1]';
$pattern['j'] = '/[j]/'; $replace['j'] = '[j J]';
$pattern['k'] = '/[k]/'; $replace['k'] = '(?:[c C (]|[k K])';
$pattern['l'] = '/[l]/'; $replace['l'] = '[l L 1 ! i]';
$pattern['m'] = '/[m]/'; $replace['m'] = '[m M]';
$pattern['n'] = '/[n]/'; $replace['n'] = '[n N]';
$pattern['o'] = '/[o]/'; $replace['o'] = '[o O 0]';
$pattern['p'] = '/[p]/'; $replace['p'] = '[p P]';
$pattern['q'] = '/[q]/'; $replace['q'] = '[q Q 9]';
$pattern['r'] = '/[r]/'; $replace['r'] = '[r R]';
$pattern['s'] = '/[s]/'; $replace['s'] = '[s S $ 5]';
$pattern['t'] = '/[t]/'; $replace['t'] = '[t T 7]';
$pattern['u'] = '/[u]/'; $replace['u'] = '[u U v V]';
$pattern['v'] = '/[v]/'; $replace['v'] = '[v V u U]';
$pattern['w'] = '/[w]/'; $replace['w'] = '[w W vv VV]';
$pattern['x'] = '/[x]/'; $replace['x'] = '[x X]';
$pattern['y'] = '/[y]/'; $replace['y'] = '[y Y]';
$pattern['z'] = '/[z]/'; $replace['z'] = '[z Z 2]';
$word = str_split(strtolower($_POST['word']));
$i=0;
while($i < count($word))
{
if(!is_numeric($word[$i]))
{
if($word[$i] != ' ' || count($word[$i]) < '1')
{
$word[$i] = preg_replace($pattern[$word[$i]], $replace[$word[$i]], $word[$i]);
}
}
$i++;
}
//$word = "/" . implode('', $word) . "/";
echo implode('', $word);
}
if($_GET['act'] == 'list')
{
$link = mysql_connect('localhost', 'username', 'password', '1');
mysql_select_db('peoples');
$sql = "SELECT word FROM filters";
$result = mysql_query($sql, $link);
$i=0;
while($i < mysql_num_rows($result))
{
echo mysql_result($result, $i, 'word') . "<br />";
$i++;
}
echo '<hr>';
}
?>
<html>
<head>
<title>RegEx Generator</title>
</head>
<body>
<form action='badword.php?act=do' method='post'>
Word: <input type='text' name='word' /><br />
<input type='submit' value='Generate' />
</form>
<a href="badword.php?act=list">List Words</a>
</body>
</html>
回答by AgentConundrum
Beware of clbuttic mistakes.
当心clbuttic错误。
"Apple made the clbuttic mistake of forcing out their visionary - I mean, look at what NeXT has been up to!"
Hmm. "clbuttic".
Google "clbuttic" - thousands of hits!
There's someone who call his car 'clbuttic'.
There are "Clbuttic Steam Engine" message boards.
Webster's dictionary - no help.
Hmm. What can this be?
HINT: People who make buttumptions about their regex scripts, will be embarbutted when they repeat this mbuttive mistake.
“苹果犯了一个严重的错误,迫使他们的远见卓识者出局——我的意思是,看看 NeXT 做了什么!”
唔。“笨拙”。
谷歌“clbuttic” - 数以千计的点击!
有人称他的车为“clbuttic”。
有“Clbuttic Steam Engine”留言板。
韦伯斯特词典 - 没有帮助。
唔。这可以是什么?
提示: 对他们的正则表达式脚本进行抨击的人,当他们重复这个顽固的错误时,将会受到打击。
回答by David Fraga
Shutterstock has a Github repo with a list of bad wordsused for filtering.
Shutterstock 有一个 Github 存储库,其中包含用于过滤的坏词列表。
You can check it out here: https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words
你可以在这里查看:https: //github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words
回答by Tony
If anyone needs an API, google currently provide a bad word indicator.
如果有人需要 API,谷歌目前提供了一个坏词指示器。
http://www.wdyl.com/profanity?q=naughtyword
{
response: "false"
}
Update: Google has now removed this service.
更新:谷歌现已删除此服务。
回答by Kibbee
I would say to just remove posts as you become aware of them, and block users who are overly explicit with their postings. You can say very offensive things without using any swear words. If you block the word ass (aka donkey), then people will just type a$$ or /\55, or whatever else they need to type to get past the filter.
我想说的是,在您意识到帖子时删除帖子,并阻止帖子过于露骨的用户。你可以在不使用任何脏话的情况下说出非常令人反感的话。如果你屏蔽了 ass(又名驴)这个词,那么人们只会输入 a$$ 或 /\55,或者他们需要输入的任何其他内容来通过过滤器。
回答by Jon Limjap
+1 on the Clbuttic mistake, I think it is important for "bad word" filters to scan for both leading and trailing spaces (e.g., " ass ") as opposed for just the exact string so that we won't have words like clbuttic, clbuttes, buttert, buttess, etc.
+1 关于 Clbuttic 错误,我认为“坏词”过滤器扫描前导和尾随空格(例如“ ass ”)而不是仅扫描确切的字符串很重要,这样我们就不会有像 clbuttic 这样的词, clbuttes, buttert, buttess, 等等。
回答by Ming-Tang
Wikipedia ClueBothas a bad word filter, read its source.
维基百科 ClueBot有一个坏词过滤器,阅读它的来源。
回答by Ross
You could always convince the client to have a session of users just constantly posting expletives and make an easy solution to add them to the system. It is a lot of work but it will probably be more representative of the community.
您总是可以说服客户建立一个用户会话,只需不断发布咒骂并制定一个简单的解决方案将它们添加到系统中。这是很多工作,但它可能更能代表社区。
回答by Richard
In researching this topic I determined that what was needed was more than just a list that does arbitrary replacements. I have built a web service that allows you to identify the level of 'cleanliness' you desire. It also makes an effort to identify false positives - i.e. where a word may be bad in one context but not in others. Take a look at http://filterlanguage.com
在研究这个主题时,我确定需要的不仅仅是一个可以任意替换的列表。我已经构建了一个网络服务,让您可以确定您想要的“清洁度”级别。它还努力识别误报——即某个词在某个上下文中可能是坏的,而在其他上下文中则不是。看看http://filterlanguage.com