Python。从 Pandas 列中提取字符串的最后一个字母
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/52850192/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python. Extract last letter of a string from a Pandas column
提问by prp
I want to store in a new variable the last digit from a 'UserId' (such UserId is of type string).
我想将“UserId”中的最后一位数字存储在一个新变量中(此类 UserId 是字符串类型)。
I came up with this, but it's a long df and takes forever. Any tips on how to optimize/avoid for loop?
我想出了这个,但这是一个很长的 df 并且需要永远。关于如何优化/避免 for 循环的任何提示?
df['LastDigit'] = np.nan
for i in range(0,len(df['UserId'])):
df.loc[i]['LastDigit'] = df.loc[i]['UserId'].strip()[-1]
回答by jezrael
Use str.strip
with indexing by str[-1]
:
使用str.strip
与索引的str[-1]
:
df['LastDigit'] = df['UserId'].str.strip().str[-1]
If performance is important and no missing values use list comprehension:
如果性能很重要并且没有缺失值,请使用列表理解:
df['LastDigit'] = [x.strip()[-1] for x in df['UserId']]
Your solution is really slow, it is last solution from this:
您的解决方案是很慢,这是从去年解决此:
6) updating an empty frame (e.g. using loc one-row-at-a-time)
6) 更新一个空帧(例如使用 loc 一次一行)
Performance:
性能:
np.random.seed(456)
users = ['joe','jan ','ben','rick ','clare','mary','tom']
df = pd.DataFrame({
'UserId': np.random.choice(users, size=1000),
})
In [139]: %%timeit
...: df['LastDigit'] = np.nan
...: for i in range(0,len(df['UserId'])):
...: df.loc[i]['LastDigit'] = df.loc[i]['UserId'].strip()[-1]
...:
__main__:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
57.9 s ± 1.48 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [140]: %timeit df['LastDigit'] = df['UserId'].str.strip().str[-1]
1.38 ms ± 150 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [141]: %timeit df['LastDigit'] = [x.strip()[-1] for x in df['UserId']]
343 μs ± 8.31 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
回答by el_Rinaldo
Another option is to use apply. Not performant as the list comprehension but very flexible based on your goals. Here some tries on a random dataframe with shape (44289, 31)
另一种选择是使用应用。不像列表理解那样高效,但根据您的目标非常灵活。这里有一些尝试使用形状 (44289, 31) 的随机数据框
%timeit df['LastDigit'] = df['UserId'].apply(lambda x: str(x)[-1]) #if some variables are not strings
12.4 ms ± 215 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df['LastDigit'] = df['UserId'].str.strip().str[-1]
31.5 ms ± 688 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df['LastDigit'] = [str(x).strip()[-1] for x in df['UserId']]
9.7 ms ± 119 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
hope this helps
希望这可以帮助