迭代一列并用提取的字符串替换值 [Pandas]
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42741453/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Iterating over a column and replacing a value with an extracted string [Pandas]
提问by Feyzi Bagirov
I have a dataset, that looks like this:
我有一个数据集,看起来像这样:
A B
1 aa 1234
2 ab 3456
3 bc [1357, 2468]
4 cc 8901
...
I need to iterate over the column B and replace all values in square brackets ([]) with four left digits in those brackets, so the dataset would look like this:
我需要遍历 B 列并将方括号 ([]) 中的所有值替换为这些括号中的四个左数字,因此数据集将如下所示:
A B
1 aa 1234
2 ab 3456
3 bc 1357
4 cc 8901
...
I have this code:
我有这个代码:
for item in df['B']:
if len(item) > 4:
item_v = str(item[1:5])
df['B'][item] = item_v
print(df['B'][item])
Which prints truncated values, however, if I check the head of the df, it still has the old values:
它打印截断的值,但是,如果我检查 df 的头部,它仍然具有旧值:
> df['B'].head()
> A B
1 aa 1234
2 ab 3456
3 bc [1357, 2468]
4 cc 8901
...
What am I doing wrong?
我究竟做错了什么?
采纳答案by Joe T. Boka
The easiest and fastest way is to use Pandas str.get()function and create an other column for the desired results.
最简单和最快的方法是使用 Pandas str.get()函数并为所需结果创建另一个列。
Solution #1This first solution works if your values in B
are integers [1234,3456,[1357, 2468],8901]
解决方案 #1如果您的值B
是整数,则第一个解决方案有效[1234,3456,[1357, 2468],8901]
df['C'] = df['B'].str.get(0).astype(float)
df.C.fillna(df['B'], inplace=True)
df['C'] = df.C.astype(int, inplace=True)
Output:
输出:
A B C
0 aa 1234 1234
1 ab 3456 3456
2 bc [1357, 2468] 1357
3 cc 8901 8901
Then, you can delete column B if you don't need it.
然后,如果不需要,可以删除 B 列。
Solution #2This solution works if your values in B
are strings ['1234','3456',['1357', '2468'],'8901']
解决方案#2如果您的值B
是字符串,则此解决方案有效['1234','3456',['1357', '2468'],'8901']
import re
df['digits'] = df['B'].apply(lambda x: re.findall('\d+', str(x)))
df['digits'] = df['digits'].str.get(0)
print(df)
Output:
输出:
A B digits
0 aa 1234 1234
1 ab 3456 3456
2 bc [1357, 2468] 1357
3 cc 8901 8901
Again, you can delete column B if you don't need it.
同样,如果不需要,可以删除 B 列。
回答by Craig
In your code, you are looping over the items in column B of the dataframe, but you don't have a way to index back into your original dataframe. Specifically, the line:
在您的代码中,您正在遍历数据框 B 列中的项目,但您没有办法重新索引到原始数据框。具体来说,该行:
df['B'][item] = item_v
,
df['B'][item] = item_v
,
doesn't do what you want. It is placing a new item in column B with an index of item
. If you try it with a small dataframe, you will probably see some odd values at the end of the frame. When I try this, I get:
不做你想做的。它在 B 列中放置一个索引为 的新项目item
。如果你用一个小的数据帧来尝试,你可能会在帧的末尾看到一些奇怪的值。当我尝试这个时,我得到:
In[36]: df
Out[36]:
A B
0 aa 1234
1 ab 3456
2 bc 1357
3 cc 8901
In[37]: df['B'][item] = item_v
In[38]: df['B']
Out[38]:
0 1234
1 3456
2 1357
3 8901
8901 8901 <-- ???
Name: B, dtype: object
To make matters worse, this line doesn't insert the value into the dataframe where you would expect. You will only see the new element when you look at df['B']
. If you look at only df
you will see the original dataframe without the extra item.
更糟糕的是,这一行不会将值插入到您期望的数据帧中。当您查看 时,您只会看到新元素df['B']
。如果你只看,df
你会看到没有额外项目的原始数据框。
The correct way is to set elements in a dataframe is to use .loc[]
like:
正确的方法是在数据框中设置元素,.loc[]
如下所示:
df.loc[item,'B'] = item_v
df.loc[item,'B'] = item_v
This still doesn't address the original problem, which is how to get the correct index. One fix for your original code is to accumulate values for each item in column B in a list and then assign it back to column B like this:
这仍然没有解决最初的问题,即如何获得正确的索引。原始代码的一种解决方法是为列表中 B 列中的每个项目累积值,然后将其分配回 B 列,如下所示:
newB = []
for item in df['B']:
if len(item) > 4:
item_v = str(item[1:5])
else:
item_v = item
newB.append(item_v)
print(newB)
df.loc[:, 'B'] = newB
However, with pandas
there are also solutions that don't require directly iterating over the items in column B.
但是,pandas
也有一些解决方案不需要直接迭代 B 列中的项目。
For example, you can use .where()
to replace only the strings longer than 4 characters along with the .str
functions to manipulate the text elements. This one liner will do the job:
例如,您可以使用.where()
仅替换长度超过 4 个字符的字符串以及.str
操作文本元素的函数。这一个班轮将完成这项工作:
df.loc[:,'B'] = df['B'].where((df['B'].str.len() <= 4), df['B'].str[1:5])
This statement creates a Series that contains the item from column B if it is 4 or fewer characters, or the slice [1:5] of the item in column B if it is longer than 4 characters. This series is then assigned to replace column B in df
.
此语句创建一个系列,其中包含 B 列中的项目(如果它是 4 个或更少字符),或者 B 列中项目的切片 [1:5] 如果它长于 4 个字符。然后分配该系列以替换 中的 B 列df
。