pandas 如何替换熊猫数据框中字符串中的空格?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42462530/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to replace the white space in a string in a pandas dataframe?
提问by katus
Suppose I have a pandas dataframe like this:
假设我有一个像这样的Pandas数据框:
Person_1 Person_2 Person_3
0 John Smith Jane Smith Mark Smith
1 Harry Jones Mary Jones Susan Jones
Reproducible form:
可复制形式:
df = pd.DataFrame([['John Smith', 'Jane Smith', 'Mark Smith'],
['Harry Jones', 'Mary Jones', 'Susan Jones'],
columns=['Person_1', 'Person_2', 'Person_3'])
What is the nicest way to replace the whitespace between the first and last name in each name with an underscore _ to get:
用下划线 _ 替换每个名字中名字和姓氏之间的空格的最佳方法是什么:
Person_1 Person_2 Person_3
0 John_Smith Jane_Smith Mark_Smith
1 Harry_Jones Mary_Jones Susan_Jones
Thank you in advance!
先感谢您!
回答by miradulo
I think you could also just opt for DataFrame.replace
.
我想你也可以选择DataFrame.replace
.
df.replace(' ', '_', regex=True)
Outputs
输出
Person_1 Person_2 Person_3
0 John_Smith Jane_Smith Mark_Smith
1 Harry_Jones Mary_Jones Susan_Jones
From some rough benchmarking, it predictably seems like piRSquared's NumPy solution is indeed the fastest, for this small sample at least, followed by DataFrame.replace
.
从一些粗略的基准测试来看,可以预见,piRSquared 的 NumPy 解决方案确实是最快的,至少对于这个小样本而言,其次是DataFrame.replace
.
%timeit df.values[:] = np.core.defchararray.replace(df.values.astype(str), ' ', '_')
10000 loops, best of 3: 78.4 μs per loop
%timeit df.replace(' ', '_', regex=True)
1000 loops, best of 3: 932 μs per loop
%timeit df.stack().str.replace(' ', '_').unstack()
100 loops, best of 3: 2.29 ms per loop
Interestinglyhowever, it appears that piRSquared's Pandas solution scales muchbetter with larger DataFrames than DataFrame.replace
, and even outperforms the NumPy solution.
有趣的是但是,似乎piRSquared的大Pandas解决方案规模太大与大于DataFrames好DataFrame.replace
,甚至优于NumPy的解决方案。
>>> df = pd.DataFrame([['John Smith', 'Jane Smith', 'Mark Smith']*10000,
['Harry Jones', 'Mary Jones', 'Susan Jones']*10000])
%timeit df.values[:] = np.core.defchararray.replace(df.values.astype(str), ' ', '_')
10 loops, best of 3: 181 ms per loop
%timeit df.replace(' ', '_', regex=True)
1 loop, best of 3: 4.14 s per loop
%timeit df.stack().str.replace(' ', '_').unstack()
10 loops, best of 3: 99.2 ms per loop
回答by Serenity
Use replace
method of dataframe:
replace
dataframe的使用方法:
df.replace('\s+', '_',regex=True,inplace=True)
回答by piRSquared
pandas
pandas
stack
/ unstack
with str.replace
stack
/unstack
与str.replace
df.stack().str.replace(' ', '_').unstack()
Person_1 Person_2 Person_3
0 John_Smith Jane_Smith Mark_Smith
1 Harry_Jones Mary_Jones Susan_Jones
numpy
numpy
pd.DataFrame(
np.core.defchararray.replace(df.values.astype(str), ' ', '_'),
df.index, df.columns)
Person_1 Person_2 Person_3
0 John_Smith Jane_Smith Mark_Smith
1 Harry_Jones Mary_Jones Susan_Jones
回答by Aravinda P K
I used the below code to replace white spaces in multiple (specific) Columns.
我使用以下代码替换多个(特定)列中的空格。
df[['Col1','Col2','Col3']] = df[['Col1','col2','Col3']].replace(' ', '', regex=True)
df[['Col1','Col2','Col3']] = df[['Col1','col2','Col3']].replace('','', regex=True)