迭代一列并用提取的字符串替换值 [Pandas]

Question

提问by Feyzi Bagirov

I have a dataset, that looks like this:

我有一个数据集，看起来像这样：

  A   B
1 aa  1234
2 ab  3456
3 bc  [1357, 2468]
4 cc  8901
...

I need to iterate over the column B and replace all values in square brackets ([]) with four left digits in those brackets, so the dataset would look like this:

我需要遍历 B 列并将方括号 ([]) 中的所有值替换为这些括号中的四个左数字，因此数据集将如下所示：

  A   B
1 aa  1234
2 ab  3456
3 bc  1357
4 cc  8901
...

I have this code:

我有这个代码：

for item in df['B']:
if len(item) > 4:
    item_v = str(item[1:5])
    df['B'][item] = item_v 
    print(df['B'][item])

Which prints truncated values, however, if I check the head of the df, it still has the old values:

它打印截断的值，但是，如果我检查 df 的头部，它仍然具有旧值：

   > df['B'].head()

   >  A   B
    1 aa  1234
    2 ab  3456
    3 bc  [1357, 2468]
    4 cc  8901
    ...

What am I doing wrong?

我究竟做错了什么？

Answer 1

采纳答案by Joe T. Boka

The easiest and fastest way is to use Pandas str.get()function and create an other column for the desired results.

最简单和最快的方法是使用 Pandas str.get()函数并为所需结果创建另一个列。

Solution #1This first solution works if your values in Bare integers [1234,3456,[1357, 2468],8901]

解决方案 #1如果您的值B是整数，则第一个解决方案有效[1234,3456,[1357, 2468],8901]

df['C'] = df['B'].str.get(0).astype(float)
df.C.fillna(df['B'], inplace=True)
df['C'] = df.C.astype(int, inplace=True)

Output:

输出：

A             B     C
0  aa          1234  1234
1  ab          3456  3456
2  bc  [1357, 2468]  1357
3  cc          8901  8901

Then, you can delete column B if you don't need it.

然后，如果不需要，可以删除 B 列。

Solution #2This solution works if your values in Bare strings ['1234','3456',['1357', '2468'],'8901']

解决方案#2如果您的值B是字符串，则此解决方案有效['1234','3456',['1357', '2468'],'8901']

import re
df['digits'] = df['B'].apply(lambda x: re.findall('\d+', str(x)))
df['digits'] = df['digits'].str.get(0)
print(df)

Output:

输出：

   A             B    digits
0  aa          1234   1234
1  ab          3456   3456
2  bc  [1357, 2468]   1357
3  cc          8901   8901

Again, you can delete column B if you don't need it.

同样，如果不需要，可以删除 B 列。

Answer 2

回答by Craig

In your code, you are looping over the items in column B of the dataframe, but you don't have a way to index back into your original dataframe. Specifically, the line:

在您的代码中，您正在遍历数据框 B 列中的项目，但您没有办法重新索引到原始数据框。具体来说，该行：

df['B'][item] = item_v,

doesn't do what you want. It is placing a new item in column B with an index of item. If you try it with a small dataframe, you will probably see some odd values at the end of the frame. When I try this, I get:

不做你想做的。它在 B 列中放置一个索引为的新项目item。如果你用一个小的数据帧来尝试，你可能会在帧的末尾看到一些奇怪的值。当我尝试这个时，我得到：

In[36]: df
Out[36]: 
    A     B
0  aa  1234
1  ab  3456
2  bc  1357
3  cc  8901

In[37]: df['B'][item] = item_v

In[38]: df['B']
Out[38]: 
0       1234
1       3456
2       1357
3       8901
8901    8901 <-- ???
Name: B, dtype: object

To make matters worse, this line doesn't insert the value into the dataframe where you would expect. You will only see the new element when you look at df['B']. If you look at only dfyou will see the original dataframe without the extra item.

更糟糕的是，这一行不会将值插入到您期望的数据帧中。当您查看时，您只会看到新元素df['B']。如果你只看，df你会看到没有额外项目的原始数据框。

The correct way is to set elements in a dataframe is to use .loc[]like:

正确的方法是在数据框中设置元素，.loc[]如下所示：

df.loc[item,'B'] = item_v

This still doesn't address the original problem, which is how to get the correct index. One fix for your original code is to accumulate values for each item in column B in a list and then assign it back to column B like this:

这仍然没有解决最初的问题，即如何获得正确的索引。原始代码的一种解决方法是为列表中 B 列中的每个项目累积值，然后将其分配回 B 列，如下所示：

newB = []
for item in df['B']:
    if len(item) > 4:
        item_v = str(item[1:5])
    else:
        item_v = item
    newB.append(item_v)
print(newB)
df.loc[:, 'B'] = newB

However, with pandasthere are also solutions that don't require directly iterating over the items in column B.

但是，pandas也有一些解决方案不需要直接迭代 B 列中的项目。

For example, you can use .where()to replace only the strings longer than 4 characters along with the .strfunctions to manipulate the text elements. This one liner will do the job:

例如，您可以使用.where()仅替换长度超过 4 个字符的字符串以及.str操作文本元素的函数。这一个班轮将完成这项工作：

df.loc[:,'B'] = df['B'].where((df['B'].str.len() <= 4), df['B'].str[1:5])

This statement creates a Series that contains the item from column B if it is 4 or fewer characters, or the slice [1:5] of the item in column B if it is longer than 4 characters. This series is then assigned to replace column B in df.

此语句创建一个系列，其中包含 B 列中的项目（如果它是 4 个或更少字符），或者 B 列中项目的切片 [1:5] 如果它长于 4 个字符。然后分配该系列以替换中的 B 列df。

迭代一列并用提取的字符串替换值 [Pandas]

提问by Feyzi Bagirov

采纳答案by Joe T. Boka

回答by Craig

相关推荐

最近更新

标签

迭代一列并用提取的字符串替换值 [Pandas]

提问by Feyzi Bagirov

采纳答案by Joe T. Boka

回答by Craig

相关推荐

将请求中的 JSON 数据转换为 Pandas DataFrame

来自两个 Pandas 数据框的分组条形图

pandas 在 matplotlib 子图中添加一行

pandas 如何为每个循环遍历数据框中的两列？

相关推荐

最近更新

标签