pandas 子字符串 python 熊猫

Question

提问by user3376660

I have a pandas dataframe that has a string column in it. The length of the frame is over 2 million rows and looping to extract the elements I need is a poor choice. My current code looks like the following

我有一个 Pandas 数据框，里面有一个字符串列。框架的长度超过 200 万行，循环提取我需要的元素是一个糟糕的选择。我当前的代码如下所示

for i in range(len(table["series_id"])):
    table["state_code"] = table["series_id"][i][2:4]
    table["area_code"] = table["series_id"][i][5:9]
    table["supersector_code"] = table["series_id"][i][11:12]

where "series_id" is the string containing multiple information fields I want to create an example data element:

其中“series_id”是包含多个信息字段的字符串，我想创建一个示例数据元素：

columns:

列：

 [series_id, year, month, value, footnotes]

The data:

数据：

[['SMS01000000000000001' '2006' 'M01' 1966.5 '']
 ['SMS01000000000000001' '2006' 'M02' 1970.4 '']
 ['SMS01000000000000001' '2006' 'M03' 1976.6 '']

However series_id is column of interest that I am struggling with. I have looked at the str.FUNCTION for python and specifically pandas.

但是 series_id 是我正在努力解决的感兴趣的列。我已经查看了 python 的 str.FUNCTION ，特别是 Pandas。

http://pandas.pydata.org/pandas-docs/stable/basics.html#testing-for-strings-that-match-or-contain-a-pattern

has a section describing each of the string functions i.e. specifically get& sliceare the functions I would like to use. Ideally I could envision a solution like so:

有一个部分描述了每个字符串函数，即特别是get和slice是我想使用的函数。理想情况下，我可以设想这样的解决方案：

table["state_code"] = table["series_id"].str.get(1:3)

or

或者

table["state_code"] = table["series_id"].str.slice(1:3)

or

或者

table["state_code"] = table["series_id"].str.slice([1:3])

When I have tried the following functions I get an invalid syntax for the ":".

当我尝试了以下函数时，我得到了一个无效的“:”语法。

but alas I cannot seem to figure out the proper way to perform the vector operation for taking a substring on a pandas data frame column.

但唉，我似乎无法找出执行向量操作的正确方法，以在 Pandas 数据框列上获取子字符串。

Thank you

谢谢

Answer 1

回答by Andy Hayden

I think I would use str.extractwith some regex (which you can tweak for your needs):

我想我会将str.extract与一些正则表达式一起使用（您可以根据需要进行调整）：

In [11]: s = pd.Series(["SMU78000009092000001"])

In [12]: s.str.extract('^.{2}(?P<state_code>.{3}).{1}(?P<area_code>\d{4}).{2}(?P<supersector_code>.{2})')
Out[12]: 
  state_code area_code supersector_code
0        U78      0000               92

This reads as: starts (^) with any two characters (which are ignored), the next three (any) characters are state_code, followed by any character (ignored), followed by four digits are area_code, ...

这读作：以 ( ^)开头的任何两个字符（被忽略），接下来的三个（任何）字符是state_code，后跟任何字符（被忽略），后跟四位数字是area_code, ...

pandas 子字符串 python 熊猫

提问by user3376660

回答by Andy Hayden

相关推荐

最近更新

标签

pandas 子字符串 python 熊猫

提问by user3376660

回答by Andy Hayden

相关推荐

Pandas：使用数据帧的多列作为另一个的索引

Python / Pandas：数据帧索引中有多少层？

pandas Python经验分布函数（ecdf）实现

如何让 pandas.read_csv() 从 CSV 文件列推断日期时间和时间增量类型？

相关推荐

最近更新

标签