Python 从列中获取字符串的第一个字母
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35552874/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Get first letter of a string from column
提问by michalk
I'm fighting with pandas and for now I'm loosing. I have source table similar to this:
我正在与熊猫战斗,现在我输了。我有与此类似的源表:
import pandas as pd
a=pd.Series([123,22,32,453,45,453,56])
b=pd.Series([234,4353,355,453,345,453,56])
df=pd.concat([a, b], axis=1)
df.columns=['First', 'Second']
I would like to add new column to this data frame with first digit from values in column 'First': a) change number to string from column 'First' b) extracting first character from newly created string c) Results from b save as new column in data frame
我想将新列添加到此数据框中,其中包含“First”列中值的第一个数字:a)将“First”列中的数字更改为字符串 b)从新创建的字符串中提取第一个字符 c)b 中的结果另存为新数据框中的列
I don't know how to apply this to the pandas data frame object. I would be grateful for helping me with that.
我不知道如何将它应用到 Pandas 数据框对象。我会很感激能帮我解决这个问题。
采纳答案by EdChum
Cast the dtype
of the col to str
and you can perform vectorised slicing calling str
:
dtype
将 col投射到str
,您可以执行矢量化切片调用str
:
In [29]:
df['new_col'] = df['First'].astype(str).str[0]
df
Out[29]:
First Second new_col
0 123 234 1
1 22 4353 2
2 32 355 3
3 453 453 4
4 45 345 4
5 453 453 4
6 56 56 5
if you need to you can cast the dtype
back again calling astype(int)
on the column
如果您需要,您可以dtype
再次调用astype(int)
该列
回答by cs95
.str.get
.str.get
This is the simplest to specify string methods
这是最简单的指定字符串方法
# Setup
df = pd.DataFrame({'A': ['xyz', 'abc', 'foobar'], 'B': [123, 456, 789]})
df
A B
0 xyz 123
1 abc 456
2 foobar 789
df.dtypes
A object
B int64
dtype: object
For string (read:object
) type columns, use
对于字符串 (read:)object
类型的列,请使用
df['C'] = df['A'].str[0]
# Similar to,
df['C'] = df['A'].str.get(0)
.str
handles NaNs by returning NaN as the output.
.str
通过返回 NaN 作为输出来处理 NaN。
For non-numeric columns, an .astype
conversion is required beforehand, as shown in @Ed Chum's answer.
对于非数字列,.astype
需要事先进行转换,如@Ed Chum 的回答所示。
# Note that this won't work well if the data has NaNs.
# It'll return lowercase "n"
df['D'] = df['B'].astype(str).str[0]
df
A B C D
0 xyz 123 x 1
1 abc 456 a 4
2 foobar 789 f 7
List Comprehension and Indexing
列表理解和索引
There is enough evidenceto suggest a simple list comprehension will work well here and probably be faster.
有足够的证据表明一个简单的列表理解在这里可以很好地工作并且可能更快。
# For string columns
df['C'] = [x[0] for x in df['A']]
# For numeric columns
df['D'] = [str(x)[0] for x in df['B']]
df
A B C D
0 xyz 123 x 1
1 abc 456 a 4
2 foobar 789 f 7
If your data has NaNs, then you will need to handle this appropriately with an if
/else
in the list comprehension,
如果您的数据有 NaN,那么您需要在列表推导式中使用if
/适当地处理它else
,
df2 = pd.DataFrame({'A': ['xyz', np.nan, 'foobar'], 'B': [123, 456, np.nan]})
df2
A B
0 xyz 123.0
1 NaN 456.0
2 foobar NaN
# For string columns
df2['C'] = [x[0] if isinstance(x, str) else np.nan for x in df2['A']]
# For numeric columns
df2['D'] = [str(x)[0] if pd.notna(x) else np.nan for x in df2['B']]
A B C D
0 xyz 123.0 x 1
1 NaN 456.0 NaN 4
2 foobar NaN f NaN
Let's do some timeit tests on some larger data.
让我们对一些更大的数据进行一些时间测试。
df_ = df.copy()
df = pd.concat([df_] * 5000, ignore_index=True)
%timeit df.assign(C=df['A'].str[0])
%timeit df.assign(D=df['B'].astype(str).str[0])
%timeit df.assign(C=[x[0] for x in df['A']])
%timeit df.assign(D=[str(x)[0] for x in df['B']])
12 ms ± 253 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
27.1 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
3.77 ms ± 110 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.84 ms ± 145 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
List comprehensions are 4x faster.
列表推导速度快 4 倍。