基于列 y 中的唯一值的 Python Pandas 子集列 x 值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40680203/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:27:48  来源:igfitidea点击:

Python Pandas subset column x values based on unique values in column y

pythonpandasindexingsubsetslice

提问by Cole Robertson

I have a dataframe ( "df") equivalent to:

我有一个数据框(“df”)相当于:

   Cat   Data
    x    0.112
    x    0.112
    y    0.223
    y    0.223
    z    0.112
    z    0.112

In other words I have a category column and a data column, and the data values do not vary within values of the category column, but they may repeat themselves between different categories (i.e. the values in categories 'x' and 'z' are the same -- 0.112). This means that I need to select one data point from each category, rather than just subsetting on unique values of "Data".

换句话说,我有一个类别列和一个数据列,数据值在类别列的值内不会变化,但它们可能会在不同类别之间重复(即类别“x”和“z”中的值是相同 - 0.112)。这意味着我需要从每个类别中选择一个数据点,而不仅仅是对“数据”的唯一值进行子集化。

The way I've done it is like this:

我的做法是这样的:

    aLst = []
    bLst = []
    for i in df.index:
        if df.loc[i,'Cat'] not in aLst:
            aLst += [df.loc[i,'Cat']]
            bLst += [i]

    new_series = pd.Series(df.loc[bLst,'Data'])

Then I can do whatever I want with it. But the problem is this just seems like a clunky, un-pythonic way of doing things. Any suggestions?

然后我可以用它做任何我想做的事。但问题是这似乎是一种笨拙的、非 Python 式的做事方式。有什么建议?

回答by jezrael

I think you need drop_duplicates:

我认为你需要drop_duplicates

#by column Cat
print (df.drop_duplicates(['Cat']))
  Cat   Data
0   x  0.112
2   y  0.223
4   z  0.112

Or:

或者:

#by columns Cat and Value
print (df.drop_duplicates(['Cat','Data']))
  Cat   Data
0   x  0.112
2   y  0.223
4   z  0.112