database 有谁知道一个很好的图书馆可以将一个人的名字映射到他或她的性别?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/818203/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-08 07:17:09  来源:igfitidea点击:

Does anyone know of a good library for mapping a person's name to his or her gender?

databaselanguage-agnostic

提问by Chas. Owens

I am looking for a library or database that can provide guesses about whether a person is male or female based on his or her name or nickname. Something like

我正在寻找一个图书馆或数据库,可以根据一个人的姓名或昵称猜测他是男性还是女性。就像是

john => "M",
mary => "F",
alex => "A", #ambiguous

I am looking for something that supports names other than English names (such as Japanese, Indian, etc.).

我正在寻找支持英文名称以外的名称的东西(例如日语、印度语等)。

Before I get another answer along the lines of "you are going to offend people by assuming their sex/gender" let me be clear, my application does not interact with anyone. It does not send emails or contact anyone in anyway. There are no users to ask. In many cases, the person in question is dead, and the only information I have is name, birth date, and date of death. The reason I want to know the sex of the individual is to make the grammar of the output nicer and to aid in possible searches that may come latter.

在我得到另一个答案之前,“你会通过假设他们的性别/性别来冒犯他们”,让我澄清一下,我的申请不会与任何人互动。它不会以任何方式发送电子邮件或联系任何人。没有用户要问。在很多情况下,这个人已经死了,而我拥有的唯一信息是姓名、出生日期和死亡日期。我想知道个人的性别的原因是为了使输出的语法更好,并有助于以后可能进行的搜索。

采纳答案by Ayman Hourieh

The gender of a name is something that cannot be inferred programmatically in the general case. You need a name database. Here is a free name database from the US Census Bureau.

在一般情况下,名称的性别是无法通过编程推断的。您需要一个名称数据库。这是来自美国人口普查局的免费姓名数据库

EDIT: The link for the 2010 name is dead but there are working links and a libraries in the comments.

编辑:2010 年名称的链接已失效,但评论中有可用链接和库。

回答by Ludwig Weinzierl

gender.c is an open sourceC program that does a good job. It comes with data for 44568 first names from all around the world. There is good documentation and a description of the file format (basically plain text) so it should not be to difficult to read it from your own application.

sex.c 是一个很好的开源C 程序。它带有来自世界各地的 44568 个名字的数据。有很好的文档和文件格式的描述(基本上是纯文本),因此从您自己的应用程序中阅读它应该不难。

Here is what the author says:

这是作者所说的:

A few words on quality of data

The dictionary of first names has been prepared with utmost care. For example, the Turkish, Indian and Korean names in this dictionary have all been independently classified by several native speakers. I also took special care to list only those names which can currently be found.

The lesson from this?

Any modifications should be done very cautiously (and they must also adhere to the sorting required by the search algorithm). For example, knowing that "Sascha" is a boy's name in Germany, the author never assumed the English "Sasha" to be a girl's name. Knowing that "Jan" is a boy's name in Germany, I never assumed it to be also a English short form of "Janet". Another case in point is the name "Esra". This is a boy's name in Germany, but a girl's name in Turkey.

关于数据质量的几句话

名字词典是精心准备的。例如,这本词典中的土耳其语、印度语和韩语人名,都由几位母语人士独立分类。我还特别注意只列出那些目前可以找到的名字。

从中吸取了什么教训?

任何修改都应该非常谨慎(并且它们还必须遵守搜索算法所需的排序)。例如,知道“Sascha”是德国男孩的名字,作者从不认为英文“Sasha”是女孩的名字。知道“Jan”是德国男孩的名字,我从没想过它也是“Janet”的英文缩写。另一个例子是“Esra”这个名字。这在德国是男孩的名字,但在土耳其是女孩的名字。

The program calculates a probability for the name being male of female. It can do so with the name as input alone or with the name and country of origin, which gives significantly better results.

该程序计算名称为男性或女性的概率。它可以单独使用名称作为输入,也可以使用名称和原产国来实现,这会产生明显更好的结果。

You can download it from the website of the German computer magazine c't 40 000 Namen. The article is in German but don't worry, all documentation is English. Here is the direct ftp link 0717-182.zipif you are not interested in the article. The zip-File contains the source code, an windows executable, the database and the documentation.

您可以从德国计算机杂志 c't 40 000 Namen的网站下载它 。这篇文章是德文的,但别担心,所有文档都是英文的。如果您对本文不感兴趣,这里是直接的 ftp 链接0717-182.zip。zip 文件包含源代码、Windows 可执行文件、数据库和文档。

回答by Shog9

"I tell ya, life ain't easy for a boy named 'Sue.'"

“我告诉你,对于一个叫‘苏’的男孩来说,生活并不容易。”

...So, why make it any harder? If you need to know the sex, just ask... Otherwise, don't worry about it.

……那么,为什么要让它变得更难呢?如果你需要知道性别,就问……否则,不要担心。

回答by Stromgren

I've builded a free API that gives a probabilistic guess on the gender based on a first name. Instead of using any of the above mentioned approaches, i instead use a huge dataset of profiles from social networks to provide a probabilistic guess along with a certainty factor. It also supports optional filtering through country or language id's. It's getting better by the day as more profiles are added to the dataset.

我已经构建了一个免费的 API,可以根据名字对性别进行概率猜测。我没有使用上述任何一种方法,而是使用来自社交网络的大量个人资料数据集来提供概率猜测和确定性因素。它还支持通过国家或语言 ID 进行可选过滤。随着越来越多的配置文件添加到数据集中,情况一天比一天好。

It's free to use at http://genderize.io

可在http://genderize.io免费使用

ONEthing you should consider is using a tool that takes demographics into account, as naming conventions will rely heavily on this.

您应该考虑的件事是使用一种将人口统计考虑在内的工具,因为命名约定将在很大程度上依赖于此。

Example

例子

http://api.genderize.io?name=kim
{"name":"kim","gender":"female","probability":"0.89","count":1440}

http://api.genderize.io?name=kim&country_id=dk
{"name":"kim","gender":"male","probability":"0.95","count":44,"country_id":"dk"}

回答by richardtallent

Here are two oddball approaches that may not even work, and likely wouldn't work en masse without violating the terms of a license:

这里有两种奇怪的方法,它们甚至可能行不通,而且在不违反许可条款的情况下可能无法集体使用:

  1. Use the Facebook API (which I know virtually nothing about, it may not even be possible) to perform two searches: one for FB male users with that first name, and one for female. Use the two numbers to decide the probability of gender.

  2. Much looser but more scalable, use the Google API and search for the name plus the gender-specific pronouns, and compare the numbers. For instance, there are 592,000,000 results for searching for "Richard his" (not as a phrase), but only 179,000,000 for "Richard her".

  1. 使用 Facebook API(我几乎一无所知,甚至可能不可能)执行两次搜索:一次针对具有该名字的 FB 男性用户,另一次针对女性。使用这两个数字来决定性别的概率。

  2. 更宽松但更具可扩展性,使用 Google API 并搜索名称和特定性别的代词,然后比较数字。例如,搜索“Richard his”(不是短语)有 592,000,000 个结果,而“Richard her”只有 179,000,000 个结果。

回答by bignose

Given your stated constraints, your best option is to re-phrase whatever it is you're writing to be gender-neutralunless you knowwhat gender they want to be called in each instance.

鉴于您陈述的限制,您最好的选择是重新表述您正在写的任何内容,使其保持性别中立,除非您知道在每种情况下他们想被称为什么性别。

If writing in English, remember that singular “they”is grammatically fine as a gender-neutral third-person singular pronoun.

如果用英语写作,请记住单数“they”作为第三人称单数代词在语法上没有问题。

A good example is the title of this question. As is currently:

这个问题的标题就是一个很好的例子。目前是这样:

    … mapping a person's name to his or her sex?

That would be less awkward if written:

如果这样写,那就不那么尴尬了:

    … mapping a person's name to their sex?

回答by Karl

It's also poor practice to assume that users must be male or female. There are a small but significant number of "intersex" people, most of whom are heartily sick of not having a box to tick..
bignose: interesting on the "singular they". I didn't realize it had such a long history.

假设用户必须是男性或女性也是不好的做法。有一小部分但相当数量的“双性人”,他们中的大多数都对没有一个框可以勾选
感到非常厌烦.. bignose:关于“单人他们”很有趣。没想到历史这么悠久。

回答by Remy

It's not a service, but a little app with a database:
http://www.codeproject.com/KB/cpp/genderizer.aspx

它不是一项服务,而是一个带有数据库的小应用程序:http:
//www.codeproject.com/KB/cpp/genderizer.aspx

And this tool is in german:
http://www.faq-o-matic.net/2011/06/01/zu-einem-vornamen-das-geschlecht-finden/

这个工具是德语的:http:
//www.faq-o-matic.net/2011/06/01/zu-einem-vornamen-das-geschlecht-finden/

And another one in VB:
http://www.vbarchiv.net/tipps/tipp_1925-geschlecht-anhand-des-vornamens-ermitteln.html

另一个在 VB 中:http:
//www.vbarchiv.net/tipps/tipp_1925-geschlecht-anhand-des-vornamens-ermitteln.html

I think in combination with some "Most used firstname in 2011" lists you should be able to build something decent.

我认为结合一些“2011 年最常用的名字”列表,您应该能够构建一些体面的东西。

回答by jm_tagarro

The python package SexMachinewill do that for you. Given any first name it returns if it's male, female or unisex. It relies on the data from the gender.cprogram by Jorg Michael.

python 包SexMachine会为你做这件事。给定任何名字,如果它是男性、女性或男女皆宜,它就会返回。它依赖于Jorg Michael的gender.c程序中的数据。

回答by Dimitar Slavchev

The idea will clearly not work in most languages.

这个想法显然不适用于大多数语言。

However if you could tell the nationality beforehand you could have more luck. In most Slav languages (e.g. russian, polish, bulgarian) you could safely assume that all surnames ending with -va -cha -ska (-a in general are feminine) while -v -ch -shi are masculine.

但是,如果你能事先说出国籍,你可能会有更多的运气。在大多数斯拉夫语言(例如俄语、波兰语、保加利亚语)中,您可以安全地假设所有以 -va -cha -ska 结尾的姓氏(通常 -a 是女性化的)而 -v -ch -shi 是男性化的。

In fact any surname has feminine and masculine form depending on the ending. The same names used in other countries (e.g. US) might use only the masculine form though.

事实上,任何姓氏都有女性和男性的形式,这取决于结尾。在其他国家(例如美国)使用的相同名称可能只使用阳性形式。

The same could be said for first names (-a -ya are feminine) but it is not 100% accurate.

名字也可以这样说(-a -ya 是女性化的),但不是 100% 准确。

But in general you would hardly get a library that is sufficiently accurate.

但总的来说,你很难得到一个足够准确的库。