使用 ^ 匹配 Python 正则表达式中的行首

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31400362/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 09:56:36  来源:igfitidea点击:

Using ^ to match beginning of line in Python regex

pythonregex

提问by chrisk

I'm trying to extract publication years ISI-style data from the Thomson-Reuters Web of Science. The line for "Publication Year" looks like this (at the very beginning of a line):

我正在尝试从 Thomson-Reuters Web of Science 中提取出版年份 ISI 样式的数据。“出版年”这一行看起来像这样(在一行的最开始):

PY 2015

For the script I'm writing I have defined the following regex function:

对于我正在编写的脚本,我定义了以下正则表达式函数:

import re
f = open('savedrecs.txt')
wosrecords = f.read()

def findyears():
    result = re.findall(r'PY (\d\d\d\d)', wosrecords)
    print result

findyears()

This, however, gives false positive results because the pattern may appear elsewhere in the data.

然而,这会产生假阳性结果,因为该模式可能出现在数据的其他地方。

So, I want to only match the pattern at the beginning of a line. Normally I would use ^for this purpose, but r'^PY (\d\d\d\d)'fails at matching my results. On the other hand, using \nseems to do what I want, but that might lead to further complications for me.

所以,我只想匹配一行开头的模式。通常我会^用于此目的,但r'^PY (\d\d\d\d)'无法匹配我的结果。另一方面,使用\n似乎可以做我想做的事,但这可能会给我带来更多的麻烦。

采纳答案by sinhayash

re.findall(r'^PY (\d\d\d\d)', wosrecords, flags=re.MULTILINE)

should work, let me know if it doesn't. I don't have your data.

应该有效,如果无效,请告诉我。我没有你的数据。

回答by Wiktor Stribi?ew

Use re.searchwith re.M:

使用re.searchre.M

import re
p = re.compile(r'^PY\s+(\d{4})', re.M)
test_str = "PY123\nPY 2015\nPY 2017"
print(re.findall(p, test_str)) 

See IDEONE demo

IDEONE 演示

EXPLANATION:

说明

  • ^- Start of a line (due to re.M)
  • PY- Literal PY
  • \s+- 1 or more whitespace
  • (\d{4})- Capture group holding 4 digits
  • ^- 一行的开始(由于re.M
  • PY- 文字 PY
  • \s+- 1 个或多个空格
  • (\d{4})- 捕获组持有 4 位数

回答by mac13k

In this particular case there is no need to use regular expressions, because the searched string is always 'PY' and is expected to be at the beginning of the line, so one can use string.findfor this job. The findfunction returns the position the substring is found in the given string or line, so if it is found at the start of the string the returned value is 0 (-1 if not found at all), ie.:

在这种特殊情况下,不需要使用正则表达式,因为搜索的字符串总是 'PY' 并且应该在行的开头,所以可以string.find用于这项工作。该find函数返回子字符串在给定字符串或行中的位置,因此如果在字符串的开头找到子字符串,则返回值为 0(如果根本找不到则为 -1),即:

In [12]: 'PY 2015'.find('PY')
Out[12]: 0

In [13]: ' PY 2015'.find('PY')
Out[13]: 1

Perhaps it could be a good idea to strip the white spaces, ie.:

也许剥离空白可能是个好主意,即:

In [14]: '  PY 2015'.find('PY')
Out[14]: 2

In [15]: '  PY 2015'.strip().find('PY')
Out[15]: 0

And next if only the year is of interest it can be extracted with split, ie.:

接下来,如果只对年份感兴趣,则可以通过拆分来提取,即:

In [16]: '  PY 2015'.strip().split()[1]
Out[16]: '2015'