pandas 使用熊猫将字符串拆分为数字和文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32229419/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:48:43  来源:igfitidea点击:

split string into number and text with pandas

pythonstringpandas

提问by Paul T.

The Setup

设置

I have a pandasdataframe that contains a column 'iso' containing chemical isotope symbols, such as '4He', '16O', '197Au'. I want to label many (but not all) isotopes on a plot using the annotate()function in matplotlib. The label format should have the atomic mass in superscript. I can do this with the LaTeX style formatting:

我有一个pandas数据框,其中包含一列“iso”,其中包含化学同位素符号,例如“4He”、“16O”、“197Au”。我想使用中的annotate()函数在绘图上标记许多(但不是全部)同位素matplotlib。标签格式应该在上标中有原子质量。我可以使用 LaTeX 样式格式来做到这一点:

axis.annotate('$^{4}$He', xy=(x, y), xycoords='data')

I could write dozens of annotate()statements like the one above for each isotope I want to label, but I'd rather automate.

我可以annotate()为我想要标记的每个同位素编写几十个类似上面的语句,但我宁愿自动化。

The Question

问题

How can I extract the isotope number and name from my iso column?

如何从我的 iso 列中提取同位素编号和名称?

With those pieces extracted I can make the labels. Lets say we dump them into the variables Numand Sym. Now I can loop over my isotopes and do something like this:

提取这些碎片后,我可以制作标签。假设我们将它们转储到变量NumSym. 现在我可以遍历我的同位素并执行如下操作:

for i in list_of_isotopes:
  (Num, Sym) = df[df.iso==i].iso.str.MISSING_STRING_METHOD(???)
  axis.annotate('$^{%s}$%s' %(Num, Sym), xy=(x[Num], y[Num]), xycoords='data')

Presumably, there is a pandasstring methods that I can drop into the above. But I'm having trouble coming up with a solution. I've been trying split()and extract()with a few different patterns, but can't get the desired effect.

大概有一个pandas字符串方法可以放到上面。但是我在想出解决方案时遇到了麻烦。我一直在尝试split()extract()使用几种不同的模式,但无法获得预期的效果。

回答by Romain

This is my answer using split. The regexp used can be improved, I'm very bad at that sort of things :-)

这是我使用split. 使用的正则表达式可以改进,我很不擅长这类事情:-)

(\d+)stands for the integers, and ([A-Za-z]+)stands for the strings.

(\d+)代表整数,([A-Za-z]+)代表字符串。

df = pd.DataFrame({'iso': ['4He', '16O', '197Au']})
result = df['iso'].str.split('(\d+)([A-Za-z]+)', expand=True)
result = result.loc[:,[1,2]]
result.rename(columns={1:'x', 2:'y'}, inplace=True)
print(result)

Produces

生产

     x   y
0    4  He
1   16   O
2  197  Au

回答by taesu

I'd use simple string manipulation, without the hassle of regex.

我会使用简单的字符串操作,没有正则表达式的麻烦。

isotopes = ['4He', '16O', '197Au']
def get_num(isotope):
    return filter(str.isdigit, isotope)

def get_sym(isotope):
    return isotope.replace(get_num(isotope),'')

def get_num_sym(isotope):
    return (get_num(isotope),get_sym(isotope))


for isotope in isotopes:
    num,sym = get_num_sym(isotope)
    print num,sym

回答by albert

To extract the number and the element of an isotope symbol you can use a regular expression(short: regex) in combination with Python's remodule. The regex looks for number digits and after that it looks for characters which are grouped and accessible using the group's name. If the regex matches you can extract the data and .format()the desired annotation string:

要提取同位素符号的数字和元素,您可以将正则表达式(简称:regex)与 Python 的re模块结合使用。正则表达式查找数字,然后查找使用组名分组和访问的字符。如果正则表达式匹配,您可以提取数据和.format()所需的注释字符串:

#!/usr/bin/env python3
# coding: utf-8

import re

iso_num = '16O'

preg = re.compile('^(?P<num>[0-9]*)(?P<element>[A-Za-z]*)$')
m = preg.match(iso_num)

if m:
    num = m.group('num')
    element = m.group('element')

    note = '$^{}${}'.format(num, element)

    # axis.annotate(note, xy=(x, y), xycoords='data')

回答by Fei Yuan

Did you tried strip(), maybe you can consider this:

你试过了吗strip(),也许你可以考虑一下:

import string

for i in list_of_isotopes:
  Num = df[df.iso==i].iso.str.strip(string.ascii_letters)
  Sym = df[df.iso==i].iso.str.strip(string.digits)
  axis.annotate('$^%s$%s' %(Num, Sym), xy=(x[Num], y[Num]), xycoords='data')