什么是好的 Python 脏话过滤器库?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3531746/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 11:35:59  来源:igfitidea点击:

What’s a good Python profanity filter library?

pythonnlpprofanity

提问by Paul D. Waite

Like https://stackoverflow.com/questions/1521646/best-profanity-filter, but for Python — and I'm looking for libraries I can run and control myself locally, as opposed to web services.

就像https://stackoverflow.com/questions/1521646/best-profanity-filter一样,但对于 Python - 我正在寻找可以在本地运行和控制自己的库,而不是 Web 服务。

(And whilst it's always great to hear your fundamental objections of principle to profanity filtering, I'm not specifically looking for them here. I know profanity filtering can't pick up every hurtful thing being said. I know swearing, in the grand scheme of things, isn't a particularly big issue. I know you need some human input to deal with issues of content. I'd just like to find a good library, and see what use I can make of it.)

(虽然听到您对脏话过滤原则的基本反对意见总是很棒,但我并不是在这里专门寻找它们。我知道脏话过滤不能处理所有伤害性的事情。我知道发誓,在宏伟的计划中的东西,不是一个特别大的问题。我知道你需要一些人工输入来处理内容问题。我只想找到一个好的图书馆,看看我能用它做什么。)

采纳答案by leoluk

I didn't found any Python profanity library, so I made one myself.

我没有找到任何 Python 亵渎库,所以我自己做了一个。

Parameters

参数



filterlist

filterlist

A list of regular expressions that match a forbidden word. Please do not use \b, it will be inserted depending on inside_words.

匹配禁用词的正则表达式列表。请不要使用\b,它会被插入取决于inside_words

Example: ['bad', 'un\w+']

例子: ['bad', 'un\w+']

ignore_case

ignore_case

Default: True

默认: True

Self-explanatory.

不言自明。

replacements

replacements

Default: "$@%-?!"

默认: "$@%-?!"

A string with characters from which the replacements strings will be randomly generated.

一个包含随机生成替换字符串的字符的字符串。

Examples: "%&$?!"or "-"etc.

例如:"%&$?!""-"等。

complete

complete

Default: True

默认: True

Controls if the entire string will be replaced or if the first and last chars will be kept.

控制是替换整个字符串还是保留第一个和最后一个字符。

inside_words

inside_words

Default: False

默认: False

Controls if words are searched inside other words too. Disabling this

控制是否也在其他词中搜索词。禁用这个

Module source

模块源码



(examples at the end)

(例子在最后)

"""
Module that provides a class that filters profanities

"""

__author__ = "leoluk"
__version__ = '0.0.1'

import random
import re

class ProfanitiesFilter(object):
    def __init__(self, filterlist, ignore_case=True, replacements="$@%-?!", 
                 complete=True, inside_words=False):
        """
        Inits the profanity filter.

        filterlist -- a list of regular expressions that
        matches words that are forbidden
        ignore_case -- ignore capitalization
        replacements -- string with characters to replace the forbidden word
        complete -- completely remove the word or keep the first and last char?
        inside_words -- search inside other words?

        """

        self.badwords = filterlist
        self.ignore_case = ignore_case
        self.replacements = replacements
        self.complete = complete
        self.inside_words = inside_words

    def _make_clean_word(self, length):
        """
        Generates a random replacement string of a given length
        using the chars in self.replacements.

        """
        return ''.join([random.choice(self.replacements) for i in
                  range(length)])

    def __replacer(self, match):
        value = match.group()
        if self.complete:
            return self._make_clean_word(len(value))
        else:
            return value[0]+self._make_clean_word(len(value)-2)+value[-1]

    def clean(self, text):
        """Cleans a string from profanity."""

        regexp_insidewords = {
            True: r'(%s)',
            False: r'\b(%s)\b',
            }

        regexp = (regexp_insidewords[self.inside_words] % 
                  '|'.join(self.badwords))

        r = re.compile(regexp, re.IGNORECASE if self.ignore_case else 0)

        return r.sub(self.__replacer, text)


if __name__ == '__main__':

    f = ProfanitiesFilter(['bad', 'un\w+'], replacements="-")    
    example = "I am doing bad ungood badlike things."

    print f.clean(example)
    # Returns "I am doing --- ------ badlike things."

    f.inside_words = True    
    print f.clean(example)
    # Returns "I am doing --- ------ ---like things."

    f.complete = False    
    print f.clean(example)
    # Returns "I am doing b-d u----d b-dlike things."

回答by Aaron Digulla

Profanity? What the f***'s that? ;-)

亵渎?那是什么鬼?;-)

It will still take a couple of years before a computer will really be able to recognize swearing and cursing and it is my sincere hope that people will have understood by then that profanity is human and not "dangerous."

计算机真正能够识别脏话和诅咒还需要几年的时间,我真诚地希望到那时人们会明白亵渎是人性的,而不是“危险的”。

Instead of a dumb filter, have a smart human moderator who can balance the tone of discussion as appropriate. A moderator who can detect abuse like:

而不是一个愚蠢的过滤器,有一个聪明的人类主持人,可以适当地平衡讨论的基调。可以检测滥用行为的版主,例如:

"If you were my husband, I'd poison your tea." - "If you were my wife, I'd drink it."

“如果你是我丈夫,我会在你的茶里下毒。” - “如果你是我的妻子,我会喝的。”

(that was from Winston Churchill, btw.)

(那是来自温斯顿丘吉尔,顺便说一句。)

回答by Glenn Maynard

It's possible for users to work around this, of course, but it should do a fairly thorough job of removing profanity:

当然,用户可以解决这个问题,但它应该做一个相当彻底的工作来消除亵渎:

import re
def remove_profanity(s):
    def repl(word):
        m = re.match(r"(\w+)(.*)", word)
        if not m:
            return word
        word = "Bork" if m.group(1)[0].isupper() else "bork"
        word += m.group(2)
        return word
    return " ".join([repl(w) for w in s.split(" ")])

print remove_profanity("You just come along with me and have a good time. The Galaxy's a fun place. You'll need to have this fish in your ear.")

回答by nu everest

WebPurify is a Profanity Filter Library for Python

WebPurify 是 Python 的亵渎过滤器库

回答by user2592414

arrBad = [
'2g1c',
'2 girls 1 cup',
'acrotomophilia',
'anal',
'anilingus',
'anus',
'arsehole',
'ass',
'asshole',
'assmunch',
'auto erotic',
'autoerotic',
'babeland',
'baby batter',
'ball gag',
'ball gravy',
'ball kicking',
'ball licking',
'ball sack',
'ball sucking',
'bangbros',
'bareback',
'barely legal',
'barenaked',
'bastardo',
'bastinado',
'bbw',
'bdsm',
'beaver cleaver',
'beaver lips',
'bestiality',
'bi curious',
'big black',
'big breasts',
'big knockers',
'big tits',
'bimbos',
'birdlock',
'bitch',
'black cock',
'blonde action',
'blonde on blonde action',
'blow j',
'blow your l',
'blue waffle',
'blumpkin',
'bollocks',
'bondage',
'boner',
'boob',
'boobs',
'booty call',
'brown showers',
'brunette action',
'bukkake',
'bulldyke',
'bullet vibe',
'bung hole',
'bunghole',
'busty',
'butt',
'buttcheeks',
'butthole',
'camel toe',
'camgirl',
'camslut',
'camwhore',
'carpet muncher',
'carpetmuncher',
'chocolate rosebuds',
'circlejerk',
'cleveland steamer',
'clit',
'clitoris',
'clover clamps',
'clusterfwor',
'cock',
'cocks',
'coprolagnia',
'coprophilia',
'cornhole',
'cum',
'cumming',
'cunnilingus',
'cunt',
'darkie',
'date rape',
'daterape',
'deep throat',
'deepthroat',
'dick',
'dildo',
'dirty pillows',
'dirty sanchez',
'dog style',
'doggie style',
'doggiestyle',
'doggy style',
'doggystyle',
'dolcett',
'domination',
'dominatrix',
'dommes',
'donkey punch',
'double dong',
'double penetration',
'dp action',
'eat my ass',
'ecchi',
'ejaculation',
'erotic',
'erotism',
'escort',
'ethical slut',
'eunuch',
'faggot',
'fecal',
'felch',
'fellatio',
'feltch',
'female squirting',
'femdom',
'figging',
'fingering',
'fisting',
'foot fetish',
'footjob',
'frotting',
'fwor',
'fworing',
'fwor buttons',
'fudge packer',
'fudgepacker',
'futanari',
'g-spot',
'gang bang',
'gay sex',
'genitals',
'giant cock',
'girl on',
'girl on top',
'girls gone wild',
'goatcx',
'goatse',
'gokkun',
'golden shower',
'goo girl',
'goodpoop',
'goregasm',
'grope',
'group sex',
'guro',
'hand job',
'handjob',
'hard core',
'hardcore',
'hentai',
'homoerotic',
'honkey',
'hooker',
'hot chick',
'how to kill',
'how to murder',
'huge fat',
'humping',
'incest',
'intercourse',
'Hyman off',
'jail bait',
'jailbait',
'jerk off',
'jigaboo',
'jiggaboo',
'jiggerboo',
'jizz',
'juggs',
'kike',
'kinbaku',
'kinkster',
'kinky',
'knobbing',
'leather restraint',
'leather straight Hymanet',
'lemon party',
'lolita',
'lovemaking',
'make me come',
'male squirting',
'masturbate',
'menage a trois',
'milf',
'missionary position',
'motherfworer',
'mound of venus',
'mr hands',
'muff diver',
'muffdiving',
'nambla',
'nawashi',
'negro',
'neonazi',
'nig nog',
'nigga',
'nigger',
'nimphomania',
'nipple',
'nipples',
'nsfw images',
'nude',
'nudity',
'nympho',
'nymphomania',
'octopussy',
'omorashi',
'one cup two girls',
'one guy one jar',
'orgasm',
'orgy',
'paedophile',
'panties',
'panty',
'pedobear',
'pedophile',
'pegging',
'penis',
'phone sex',
'piece of shit',
'piss pig',
'pissing',
'pisspig',
'playboy',
'pleasure chest',
'pole smoker',
'ponyplay',
'poof',
'poop chute',
'poopchute',
'porn',
'porno',
'pornography',
'prince albert piercing',
'pthc',
'pubes',
'pussy',
'queaf',
'raghead',
'raging boner',
'rape',
'raping',
'rapist',
'rectum',
'reverse cowgirl',
'rimjob',
'rimming',
'rosy palm',
'rosy palm and her 5 sisters',
'rusty trombone',
's&m',
'sadism',
'scat',
'schlong',
'scissoring',
'semen',
'sex',
'sexo',
'sexy',
'shaved beaver',
'shaved pussy',
'shemale',
'shibari',
'shit',
'shota',
'shrimping',
'slanteye',
'slut',
'smut',
'snatch',
'snowballing',
'sodomize',
'sodomy',
'spic',
'spooge',
'spread legs',
'strap on',
'strapon',
'strappado',
'strip club',
'style doggy',
'suck',
'sucks',
'suicide girls',
'sultry women',
'swastika',
'swinger',
'tainted love',
'taste my',
'tea bagging',
'threesome',
'throating',
'tied up',
'tight white',
'tit',
'tits',
'titties',
'titty',
'tongue in a',
'topless',
'tosser',
'towelhead',
'tranny',
'tribadism',
'tub girl',
'tubgirl',
'tushy',
'twat',
'twink',
'twinkie',
'two girls one cup',
'undressing',
'upskirt',
'urethra play',
'urophilia',
'vagina',
'venus mound',
'vibrator',
'violet blue',
'violet wand',
'vorarephilia',
'voyeur',
'vulva',
'wank',
'wet dream',
'wetback',
'white power',
'women rapping',
'wrapping men',
'wrinkled starfish',
'xx',
'xxx',
'yaoi',
'yellow showers',
'yiffy',
'zoophilia']

def profanityFilter(text):
brokenStr1 = text.split()
badWordMask = '!@#$%!@#$%^~!@%^~@#$%!@#$%^~!'
new = ''
for word in brokenStr1:
    if word in arrBad:
        print word + ' <--Bad word!'
        text = text.replace(word,badWordMask[:len(word)])
        #print new

return text

print profanityFilter("this thing sucks sucks sucks fworing stuff")

You can add or remove from the bad words list,arrBad, as you please.

您可以随意添加或删除坏词列表,arrBad。