什么是好的 Python 脏话过滤器库?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3531746/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What’s a good Python profanity filter library?
提问by Paul D. Waite
Like https://stackoverflow.com/questions/1521646/best-profanity-filter, but for Python — and I'm looking for libraries I can run and control myself locally, as opposed to web services.
就像https://stackoverflow.com/questions/1521646/best-profanity-filter一样,但对于 Python - 我正在寻找可以在本地运行和控制自己的库,而不是 Web 服务。
(And whilst it's always great to hear your fundamental objections of principle to profanity filtering, I'm not specifically looking for them here. I know profanity filtering can't pick up every hurtful thing being said. I know swearing, in the grand scheme of things, isn't a particularly big issue. I know you need some human input to deal with issues of content. I'd just like to find a good library, and see what use I can make of it.)
(虽然听到您对脏话过滤原则的基本反对意见总是很棒,但我并不是在这里专门寻找它们。我知道脏话过滤不能处理所有伤害性的事情。我知道发誓,在宏伟的计划中的东西,不是一个特别大的问题。我知道你需要一些人工输入来处理内容问题。我只想找到一个好的图书馆,看看我能用它做什么。)
采纳答案by leoluk
I didn't found any Python profanity library, so I made one myself.
我没有找到任何 Python 亵渎库,所以我自己做了一个。
Parameters
参数
filterlist
filterlist
A list of regular expressions that match a forbidden word. Please do not use \b, it will be inserted depending on inside_words.
匹配禁用词的正则表达式列表。请不要使用\b,它会被插入取决于inside_words。
Example:
['bad', 'un\w+']
例子:
['bad', 'un\w+']
ignore_case
ignore_case
Default: True
默认: True
Self-explanatory.
不言自明。
replacements
replacements
Default: "$@%-?!"
默认: "$@%-?!"
A string with characters from which the replacements strings will be randomly generated.
一个包含随机生成替换字符串的字符的字符串。
Examples: "%&$?!"or "-"etc.
例如:"%&$?!"或"-"等。
complete
complete
Default: True
默认: True
Controls if the entire string will be replaced or if the first and last chars will be kept.
控制是替换整个字符串还是保留第一个和最后一个字符。
inside_words
inside_words
Default: False
默认: False
Controls if words are searched inside other words too. Disabling this
控制是否也在其他词中搜索词。禁用这个
Module source
模块源码
(examples at the end)
(例子在最后)
"""
Module that provides a class that filters profanities
"""
__author__ = "leoluk"
__version__ = '0.0.1'
import random
import re
class ProfanitiesFilter(object):
def __init__(self, filterlist, ignore_case=True, replacements="$@%-?!",
complete=True, inside_words=False):
"""
Inits the profanity filter.
filterlist -- a list of regular expressions that
matches words that are forbidden
ignore_case -- ignore capitalization
replacements -- string with characters to replace the forbidden word
complete -- completely remove the word or keep the first and last char?
inside_words -- search inside other words?
"""
self.badwords = filterlist
self.ignore_case = ignore_case
self.replacements = replacements
self.complete = complete
self.inside_words = inside_words
def _make_clean_word(self, length):
"""
Generates a random replacement string of a given length
using the chars in self.replacements.
"""
return ''.join([random.choice(self.replacements) for i in
range(length)])
def __replacer(self, match):
value = match.group()
if self.complete:
return self._make_clean_word(len(value))
else:
return value[0]+self._make_clean_word(len(value)-2)+value[-1]
def clean(self, text):
"""Cleans a string from profanity."""
regexp_insidewords = {
True: r'(%s)',
False: r'\b(%s)\b',
}
regexp = (regexp_insidewords[self.inside_words] %
'|'.join(self.badwords))
r = re.compile(regexp, re.IGNORECASE if self.ignore_case else 0)
return r.sub(self.__replacer, text)
if __name__ == '__main__':
f = ProfanitiesFilter(['bad', 'un\w+'], replacements="-")
example = "I am doing bad ungood badlike things."
print f.clean(example)
# Returns "I am doing --- ------ badlike things."
f.inside_words = True
print f.clean(example)
# Returns "I am doing --- ------ ---like things."
f.complete = False
print f.clean(example)
# Returns "I am doing b-d u----d b-dlike things."
回答by Aaron Digulla
Profanity? What the f***'s that? ;-)
亵渎?那是什么鬼?;-)
It will still take a couple of years before a computer will really be able to recognize swearing and cursing and it is my sincere hope that people will have understood by then that profanity is human and not "dangerous."
计算机真正能够识别脏话和诅咒还需要几年的时间,我真诚地希望到那时人们会明白亵渎是人性的,而不是“危险的”。
Instead of a dumb filter, have a smart human moderator who can balance the tone of discussion as appropriate. A moderator who can detect abuse like:
而不是一个愚蠢的过滤器,有一个聪明的人类主持人,可以适当地平衡讨论的基调。可以检测滥用行为的版主,例如:
"If you were my husband, I'd poison your tea." - "If you were my wife, I'd drink it."
“如果你是我丈夫,我会在你的茶里下毒。” - “如果你是我的妻子,我会喝的。”
(that was from Winston Churchill, btw.)
(那是来自温斯顿丘吉尔,顺便说一句。)
回答by Glenn Maynard
It's possible for users to work around this, of course, but it should do a fairly thorough job of removing profanity:
当然,用户可以解决这个问题,但它应该做一个相当彻底的工作来消除亵渎:
import re
def remove_profanity(s):
def repl(word):
m = re.match(r"(\w+)(.*)", word)
if not m:
return word
word = "Bork" if m.group(1)[0].isupper() else "bork"
word += m.group(2)
return word
return " ".join([repl(w) for w in s.split(" ")])
print remove_profanity("You just come along with me and have a good time. The Galaxy's a fun place. You'll need to have this fish in your ear.")
回答by Anand
回答by nu everest
WebPurify is a Profanity Filter Library for Python
WebPurify 是 Python 的亵渎过滤器库
回答by user2592414
arrBad = [
'2g1c',
'2 girls 1 cup',
'acrotomophilia',
'anal',
'anilingus',
'anus',
'arsehole',
'ass',
'asshole',
'assmunch',
'auto erotic',
'autoerotic',
'babeland',
'baby batter',
'ball gag',
'ball gravy',
'ball kicking',
'ball licking',
'ball sack',
'ball sucking',
'bangbros',
'bareback',
'barely legal',
'barenaked',
'bastardo',
'bastinado',
'bbw',
'bdsm',
'beaver cleaver',
'beaver lips',
'bestiality',
'bi curious',
'big black',
'big breasts',
'big knockers',
'big tits',
'bimbos',
'birdlock',
'bitch',
'black cock',
'blonde action',
'blonde on blonde action',
'blow j',
'blow your l',
'blue waffle',
'blumpkin',
'bollocks',
'bondage',
'boner',
'boob',
'boobs',
'booty call',
'brown showers',
'brunette action',
'bukkake',
'bulldyke',
'bullet vibe',
'bung hole',
'bunghole',
'busty',
'butt',
'buttcheeks',
'butthole',
'camel toe',
'camgirl',
'camslut',
'camwhore',
'carpet muncher',
'carpetmuncher',
'chocolate rosebuds',
'circlejerk',
'cleveland steamer',
'clit',
'clitoris',
'clover clamps',
'clusterfwor',
'cock',
'cocks',
'coprolagnia',
'coprophilia',
'cornhole',
'cum',
'cumming',
'cunnilingus',
'cunt',
'darkie',
'date rape',
'daterape',
'deep throat',
'deepthroat',
'dick',
'dildo',
'dirty pillows',
'dirty sanchez',
'dog style',
'doggie style',
'doggiestyle',
'doggy style',
'doggystyle',
'dolcett',
'domination',
'dominatrix',
'dommes',
'donkey punch',
'double dong',
'double penetration',
'dp action',
'eat my ass',
'ecchi',
'ejaculation',
'erotic',
'erotism',
'escort',
'ethical slut',
'eunuch',
'faggot',
'fecal',
'felch',
'fellatio',
'feltch',
'female squirting',
'femdom',
'figging',
'fingering',
'fisting',
'foot fetish',
'footjob',
'frotting',
'fwor',
'fworing',
'fwor buttons',
'fudge packer',
'fudgepacker',
'futanari',
'g-spot',
'gang bang',
'gay sex',
'genitals',
'giant cock',
'girl on',
'girl on top',
'girls gone wild',
'goatcx',
'goatse',
'gokkun',
'golden shower',
'goo girl',
'goodpoop',
'goregasm',
'grope',
'group sex',
'guro',
'hand job',
'handjob',
'hard core',
'hardcore',
'hentai',
'homoerotic',
'honkey',
'hooker',
'hot chick',
'how to kill',
'how to murder',
'huge fat',
'humping',
'incest',
'intercourse',
'Hyman off',
'jail bait',
'jailbait',
'jerk off',
'jigaboo',
'jiggaboo',
'jiggerboo',
'jizz',
'juggs',
'kike',
'kinbaku',
'kinkster',
'kinky',
'knobbing',
'leather restraint',
'leather straight Hymanet',
'lemon party',
'lolita',
'lovemaking',
'make me come',
'male squirting',
'masturbate',
'menage a trois',
'milf',
'missionary position',
'motherfworer',
'mound of venus',
'mr hands',
'muff diver',
'muffdiving',
'nambla',
'nawashi',
'negro',
'neonazi',
'nig nog',
'nigga',
'nigger',
'nimphomania',
'nipple',
'nipples',
'nsfw images',
'nude',
'nudity',
'nympho',
'nymphomania',
'octopussy',
'omorashi',
'one cup two girls',
'one guy one jar',
'orgasm',
'orgy',
'paedophile',
'panties',
'panty',
'pedobear',
'pedophile',
'pegging',
'penis',
'phone sex',
'piece of shit',
'piss pig',
'pissing',
'pisspig',
'playboy',
'pleasure chest',
'pole smoker',
'ponyplay',
'poof',
'poop chute',
'poopchute',
'porn',
'porno',
'pornography',
'prince albert piercing',
'pthc',
'pubes',
'pussy',
'queaf',
'raghead',
'raging boner',
'rape',
'raping',
'rapist',
'rectum',
'reverse cowgirl',
'rimjob',
'rimming',
'rosy palm',
'rosy palm and her 5 sisters',
'rusty trombone',
's&m',
'sadism',
'scat',
'schlong',
'scissoring',
'semen',
'sex',
'sexo',
'sexy',
'shaved beaver',
'shaved pussy',
'shemale',
'shibari',
'shit',
'shota',
'shrimping',
'slanteye',
'slut',
'smut',
'snatch',
'snowballing',
'sodomize',
'sodomy',
'spic',
'spooge',
'spread legs',
'strap on',
'strapon',
'strappado',
'strip club',
'style doggy',
'suck',
'sucks',
'suicide girls',
'sultry women',
'swastika',
'swinger',
'tainted love',
'taste my',
'tea bagging',
'threesome',
'throating',
'tied up',
'tight white',
'tit',
'tits',
'titties',
'titty',
'tongue in a',
'topless',
'tosser',
'towelhead',
'tranny',
'tribadism',
'tub girl',
'tubgirl',
'tushy',
'twat',
'twink',
'twinkie',
'two girls one cup',
'undressing',
'upskirt',
'urethra play',
'urophilia',
'vagina',
'venus mound',
'vibrator',
'violet blue',
'violet wand',
'vorarephilia',
'voyeur',
'vulva',
'wank',
'wet dream',
'wetback',
'white power',
'women rapping',
'wrapping men',
'wrinkled starfish',
'xx',
'xxx',
'yaoi',
'yellow showers',
'yiffy',
'zoophilia']
def profanityFilter(text):
brokenStr1 = text.split()
badWordMask = '!@#$%!@#$%^~!@%^~@#$%!@#$%^~!'
new = ''
for word in brokenStr1:
if word in arrBad:
print word + ' <--Bad word!'
text = text.replace(word,badWordMask[:len(word)])
#print new
return text
print profanityFilter("this thing sucks sucks sucks fworing stuff")
You can add or remove from the bad words list,arrBad, as you please.
您可以随意添加或删除坏词列表,arrBad。

