创建具有唯一索引的 Pandas Dataframe

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/48357853/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:04:40  来源:igfitidea点击:

create Pandas Dataframe with unique index

pythonpandas

提问by user3605780

Can I create a dataframe which has a unique index or columns, similar to creating an unique key in mysql, that it will return an error if I try to add a duplicate index?

我可以创建一个具有唯一索引或列的数据框,类似于在 mysql 中创建唯一键,如果我尝试添加重复索引,它会返回错误吗?

Or is my only option to create an if-statement and check for the value in the dataframe before appending it?

或者是我创建 if 语句并在附加之前检查数据帧中的值的唯一选择?

EDIT:

编辑:

It seems my question was a bit unclear. With unique columns I mean that we cannot have non-unique values in a column.

看来我的问题有点不清楚。对于唯一列,我的意思是我们不能在列中包含非唯一值。

With

df.append(new_row, verify_integrity=True)

we can check for all columns, but how can we check for only one or two columns?

我们可以检查所有列,但如何只检查一两列?

回答by unutbu

You can use df.append(..., verify_integrity=True)to maintain a unique rowindex:

您可以使用df.append(..., verify_integrity=True)来维护唯一的索引:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.arange(12).reshape(3,4), columns=list('ABCD'))
dup_row = pd.DataFrame([[10,20,30,40]], columns=list('ABCD'), index=[1])
new_row = pd.DataFrame([[10,20,30,40]], columns=list('ABCD'), index=[9])

This successfully appends a new row (with index 9):

这成功地附加了一个新行(索引为 9):

df.append(new_row, verify_integrity=True)
#     A   B   C   D
# 0   0   1   2   3
# 1   4   5   6   7
# 2   8   9  10  11
# 9  10  20  30  40

This raises ValueError because 1 is already in the index:

这会引发 ValueError 因为 1 已经在索引中:

df.append(dup_row, verify_integrity=True)
# ValueError: Indexes have overlapping values: [1]


While the above works to ensure a unique rowindex, I'm not aware of a similar method for ensuring a unique columnindex. In theory you could transpose the DataFrame, append with verify_integrity=Trueand then transpose again, but generally I would not recommend this since transposing can alter dtypes when the column dtypes are not all the same. (When the column dtypes are not all the same the transposed DataFrame gets columns of objectdtype. Conversion to and from object arrays can be bad for performance.)

虽然上述工作可以确保唯一的索引,但我不知道用于确保唯一索引的类似方法。从理论上讲,您可以转置 DataFrame,附加,verify_integrity=True然后再次转置,但通常我不建议这样做,因为当列 dtypes 不完全相同时,转置可以改变 dtypes。(当列 dtype 不完全相同时,转置的 DataFrame 会获取 dtype 列object。与对象数组之间的转换可能对性能不利。)

If you need both unique row- and column- Indexes, then perhaps a better alternative is to stackyour DataFrame so that all the unique column index levels become row index levels. Then you can use appendwith verify_integrity=Trueon the reshaped DataFrame.

如果您需要唯一的行索引和列索引,那么也许更好的替代方案是stack您的 DataFrame,以便所有唯一的列索引级别都成为行索引级别。然后您可以在重塑后的 DataFrame 上使用appendwith verify_integrity=True

回答by Tai

OP's follow-up question:

OP的后续问题:

With df.append(new_row, verify_integrity=True), we can check for all columns, but how can we check for only one or two columns?

使用 df.append(new_row, verify_integrity=True),我们可以检查所有列,但是我们如何只检查一两列呢?

To check uniqueness of just one column, say the column name is value, one can try

要检查仅一列的唯一性,例如列名是value,可以尝试

df['value'].duplicated().any()

This will check whether any in this column is duplicated. If duplicated, then it is not unique.

这将检查此列中的任何内容是否重复。如果重复,则它不是唯一的。



Given two columns, say C1and C2,to check whether there are duplicated rows, we can still use DataFrame.duplicated.

给定两列,比如说C1C2,来检查是否有重复的,我们仍然可以使用DataFrame.duplicated.

df[["C1", "C2"]].duplicated()

It will check row-wise uniqueness. You can again use anyto check if any of the returned value is True.

它将检查行式唯一性。您可以再次使用any来检查任何返回值是否为True



Given 2 columns, say C1and C2, to check whether eachcolumn contains duplicated value, we can use apply.

给定 2 列,比如C1and C2,要检查列是否包含重复值,我们可以使用 apply。

df[["C1", "C2"]].apply(lambda x: x.duplicated().any())

This will apply the function to each column.

这会将函数应用于每一列。



NOTE

笔记

pd.DataFrame([[np.nan, np.nan],
              [ np.nan, np.nan]]).duplicated()

0    False
1     True
dtype: bool

np.nanwill also be captured by duplicated. If you want to ignore np.nan, you can try select the non-nan part first.

np.nan也将被捕获duplicated。如果您想忽略np.nan,您可以尝试先选择非 nan 部分。