Python 根据熊猫中的另一个值更改一个值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19226488/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 13:11:01  来源:igfitidea点击:

Change one value based on another value in pandas

pythonpandas

提问by Parseltongue

I'm trying to reprogram my Stata code into Python for speed improvements, and I was pointed in the direction of PANDAS. I am, however, having a hard time wrapping my head around how to process the data.

我正在尝试将我的 Stata 代码重新编程为 Python 以提高速度,我被指出了 PANDAS 的方向。然而,我很难思考如何处理数据。

Let's say I want to iterate over all values in the column head 'ID.' If that ID matches a specific number, then I want to change two corresponding values FirstName and LastName.

假设我想遍历列标题“ID”中的所有值。如果该 ID 与特定数字匹配,那么我想更改两个相应的值 FirstName 和 LastName。

In Stata it looks like this:

在Stata中它看起来像这样:

replace FirstName = "Matt" if ID==103
replace LastName =  "Jones" if ID==103

So this replaces all values in FirstName that correspond with values of ID == 103 to Matt.

因此,这会将 FirstName 中与 ID == 103 的值对应的所有值替换为 Matt。

In PANDAS, I'm trying something like this

在熊猫中,我正在尝试这样的事情

df = read_csv("test.csv")
for i in df['ID']:
    if i ==103:
          ...

Not sure where to go from here. Any ideas?

不知道从这里去哪里。有任何想法吗?

采纳答案by ely

One option is to use Python's slicing and indexing features to logically evaluate the places where your condition holds and overwrite the data there.

一种选择是使用 Python 的切片和索引功能从逻辑上评估条件所在的位置并覆盖那里的数据。

Assuming you can load your data directly into pandaswith pandas.read_csvthen the following code might be helpful for you.

假设您可以直接将数据加载到pandaswithpandas.read_csv那么以下代码可能对您有所帮助。

import pandas
df = pandas.read_csv("test.csv")
df.loc[df.ID == 103, 'FirstName'] = "Matt"
df.loc[df.ID == 103, 'LastName'] = "Jones"

As mentioned in the comments, you can also do the assignment to both columns in one shot:

如评论中所述,您还可以一次性完成两列的分配:

df.loc[df.ID == 103, ['FirstName', 'LastName']] = 'Matt', 'Jones'

Note that you'll need pandasversion 0.11 or newer to make use of locfor overwrite assignment operations.

请注意,您需要pandas0.11 或更高版本才能使用loc覆盖分配操作。



Another way to do it is to use what is called chained assignment. The behavior of this is less stable and so it is not considered the best solution (it is explicitly discouragedin the docs), but it is useful to know about:

另一种方法是使用所谓的链式赋值。这种行为不太稳定,因此不被认为是最佳解决方案(在文档中明确不鼓励这样做),但了解以下内容很有用:

import pandas
df = pandas.read_csv("test.csv")
df['FirstName'][df.ID == 103] = "Matt"
df['LastName'][df.ID == 103] = "Jones"

回答by Rutger Kassies

You can use map, it can map vales from a dictonairy or even a custom function.

您可以使用map,它可以映射字典甚至自定义函数中的值。

Suppose this is your df:

假设这是您的 df:

    ID First_Name Last_Name
0  103          a         b
1  104          c         d

Create the dicts:

创建字典:

fnames = {103: "Matt", 104: "Mr"}
lnames = {103: "Jones", 104: "X"}

And map:

和地图:

df['First_Name'] = df['ID'].map(fnames)
df['Last_Name'] = df['ID'].map(lnames)

The result will be:

结果将是:

    ID First_Name Last_Name
0  103       Matt     Jones
1  104         Mr         X

Or use a custom function:

或者使用自定义函数:

names = {103: ("Matt", "Jones"), 104: ("Mr", "X")}
df['First_Name'] = df['ID'].map(lambda x: names[x][0])

回答by Bill Bell

This question might still be visited often enough that it's worth offering an addendum to Mr Kassies' answer. The dictbuilt-in class can be sub-classed so that a default is returned for 'missing' keys. This mechanism works well for pandas. But see below.

这个问题可能仍然经常被访问,值得为 Kassies 先生的回答提供一个附录。可以对dict内置类进行子类化,以便为“缺失”键返回默认值。这种机制对熊猫很有效。但见下文。

In this way it's possible to avoid key errors.

这样就可以避免关键错误。

>>> import pandas as pd
>>> data = { 'ID': [ 101, 201, 301, 401 ] }
>>> df = pd.DataFrame(data)
>>> class SurnameMap(dict):
...     def __missing__(self, key):
...         return ''
...     
>>> surnamemap = SurnameMap()
>>> surnamemap[101] = 'Mohanty'
>>> surnamemap[301] = 'Drake'
>>> df['Surname'] = df['ID'].apply(lambda x: surnamemap[x])
>>> df
    ID  Surname
0  101  Mohanty
1  201         
2  301    Drake
3  401         

The same thing can be done more simply in the following way. The use of the 'default' argument for the getmethod of a dict object makes it unnecessary to subclass a dict.

同样的事情可以通过以下方式更简单地完成。对getdict 对象的方法使用“默认”参数使得没有必要对 dict 进行子类化。

>>> import pandas as pd
>>> data = { 'ID': [ 101, 201, 301, 401 ] }
>>> df = pd.DataFrame(data)
>>> surnamemap = {}
>>> surnamemap[101] = 'Mohanty'
>>> surnamemap[301] = 'Drake'
>>> df['Surname'] = df['ID'].apply(lambda x: surnamemap.get(x, ''))
>>> df
    ID  Surname
0  101  Mohanty
1  201         
2  301    Drake
3  401         

回答by ccpizza

The original question addresses a specific narrow use case. For those who need more generic answers here are some examples:

原始问题针对特定的狭义用例。对于那些需要更通用答案的人,这里有一些示例:

Creating a new column using data from other columns

使用来自其他列的数据创建新列

Given the dataframe below:

鉴于以下数据框:

import pandas as pd
import numpy as np

df = pd.DataFrame([['dog', 'hound', 5],
                   ['cat', 'ragdoll', 1]],
                  columns=['animal', 'type', 'age'])

In[1]:
Out[1]:
  animal     type  age
----------------------
0    dog    hound    5
1    cat  ragdoll    1

Below we are adding a new descriptioncolumn as a concatenation of other columns by using the +operation which is overridden for series. Fancy string formatting, f-strings etc won't work here since the +applies to scalars and not 'primitive' values:

下面我们description通过使用+为系列覆盖的操作添加一个新列作为其他列的串联。花式字符串格式、f 字符串等在这里不起作用,因为+适用于标量而不是“原始”值:

df['description'] = 'A ' + df.age.astype(str) + ' years old ' \
                    + df.type + ' ' + df.animal

In [2]: df
Out[2]:
  animal     type  age                description
-------------------------------------------------
0    dog    hound    5    A 5 years old hound dog
1    cat  ragdoll    1  A 1 years old ragdoll cat

We get 1 yearsfor the cat (instead of 1 year) which we will be fixing below using conditionals.

我们得到1 yearscat (而不是1 year),我们将在下面使用条件来修复它。

Modifying an existing column with conditionals

使用条件修改现有列

Here we are replacing the original animalcolumn with values from other columns, and using np.whereto set a conditional substring based on the value of age:

在这里,我们用animal来自其他列的值替换原始列,并使用np.where基于 的值设置条件子字符串age

# append 's' to 'age' if it's greater than 1
df.animal = df.animal + ", " + df.type + ", " + \
    df.age.astype(str) + " year" + np.where(df.age > 1, 's', '')

In [3]: df
Out[3]:
                 animal     type  age
-------------------------------------
0   dog, hound, 5 years    hound    5
1  cat, ragdoll, 1 year  ragdoll    1

Modifying multiple columns with conditionals

使用条件修改多列

A more flexible approach is to call .apply()on an entire dataframe rather than on a single column:

更灵活的方法是调用.apply()整个数据框而不是单个列:

def transform_row(r):
    r.animal = 'wild ' + r.type
    r.type = r.animal + ' creature'
    r.age = "{} year{}".format(r.age, r.age > 1 and 's' or '')
    return r

df.apply(transform_row, axis=1)

In[4]:
Out[4]:
         animal            type      age
----------------------------------------
0    wild hound    dog creature  5 years
1  wild ragdoll    cat creature   1 year

In the code above the transform_row(r)function takes a Seriesobject representing a given row (indicated by axis=1, the default value of axis=0will provide a Seriesobject for each column). This simplifies processing since we can access the actual 'primitive' values in the row using the column names and have visibility of other cells in the given row/column.

在上面的代码中,该transform_row(r)函数接受一个Series表示给定行的对象(由 表示axis=1, 的默认值axis=0将为Series每一列提供一个对象)。这简化了处理,因为我们可以使用列名访问行中的实际“原始”值,并且可以看到给定行/列中的其他单元格。