pandas 子字符串 python 熊猫

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22158033/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:46:02  来源:igfitidea点击:

sub string python pandas

pythonstringpandassubstring

提问by user3376660

I have a pandas dataframe that has a string column in it. The length of the frame is over 2 million rows and looping to extract the elements I need is a poor choice. My current code looks like the following

我有一个 Pandas 数据框,里面有一个字符串列。框架的长度超过 200 万行,循环提取我需要的元素是一个糟糕的选择。我当前的代码如下所示

for i in range(len(table["series_id"])):
    table["state_code"] = table["series_id"][i][2:4]
    table["area_code"] = table["series_id"][i][5:9]
    table["supersector_code"] = table["series_id"][i][11:12]

where "series_id" is the string containing multiple information fields I want to create an example data element:

其中“series_id”是包含多个信息字段的字符串,我想创建一个示例数据元素:

columns:

列:

 [series_id, year, month, value, footnotes]

The data:

数据:

[['SMS01000000000000001' '2006' 'M01' 1966.5 '']
 ['SMS01000000000000001' '2006' 'M02' 1970.4 '']
 ['SMS01000000000000001' '2006' 'M03' 1976.6 '']

However series_id is column of interest that I am struggling with. I have looked at the str.FUNCTION for python and specifically pandas.

但是 series_id 是我正在努力解决的感兴趣的列。我已经查看了 python 的 str.FUNCTION ,特别是 Pandas。

http://pandas.pydata.org/pandas-docs/stable/basics.html#testing-for-strings-that-match-or-contain-a-pattern

http://pandas.pydata.org/pandas-docs/stable/basics.html#testing-for-strings-that-match-or-contain-a-pattern

has a section describing each of the string functions i.e. specifically get& sliceare the functions I would like to use. Ideally I could envision a solution like so:

有一个部分描述了每个字符串函数,即特别是getslice是我想使用的函数。理想情况下,我可以设想这样的解决方案:

table["state_code"] = table["series_id"].str.get(1:3)

or

或者

table["state_code"] = table["series_id"].str.slice(1:3)

or

或者

table["state_code"] = table["series_id"].str.slice([1:3])

When I have tried the following functions I get an invalid syntax for the ":".

当我尝试了以下函数时,我得到了一个无效的“:”语法。

but alas I cannot seem to figure out the proper way to perform the vector operation for taking a substring on a pandas data frame column.

但唉,我似乎无法找出执行向量操作的正确方法,以在 Pandas 数据框列上获取子字符串。

Thank you

谢谢

回答by Andy Hayden

I think I would use str.extractwith some regex (which you can tweak for your needs):

我想我会将str.extract与一些正则表达式一起使用(您可以根据需要进行调整):

In [11]: s = pd.Series(["SMU78000009092000001"])

In [12]: s.str.extract('^.{2}(?P<state_code>.{3}).{1}(?P<area_code>\d{4}).{2}(?P<supersector_code>.{2})')
Out[12]: 
  state_code area_code supersector_code
0        U78      0000               92

This reads as: starts (^) with any two characters (which are ignored), the next three (any) characters are state_code, followed by any character (ignored), followed by four digits are area_code, ...

这读作:以 ( ^)开头的任何两个字符(被忽略),接下来的三个(任何)字符是state_code,后跟任何字符(被忽略),后跟四位数字是area_code, ...