使用 ^ 匹配 Python 正则表达式中的行首

Question

提问by chrisk

I'm trying to extract publication years ISI-style data from the Thomson-Reuters Web of Science. The line for "Publication Year" looks like this (at the very beginning of a line):

我正在尝试从 Thomson-Reuters Web of Science 中提取出版年份 ISI 样式的数据。“出版年”这一行看起来像这样（在一行的最开始）：

PY 2015

For the script I'm writing I have defined the following regex function:

对于我正在编写的脚本，我定义了以下正则表达式函数：

import re
f = open('savedrecs.txt')
wosrecords = f.read()

def findyears():
    result = re.findall(r'PY (\d\d\d\d)', wosrecords)
    print result

findyears()

This, however, gives false positive results because the pattern may appear elsewhere in the data.

然而，这会产生假阳性结果，因为该模式可能出现在数据的其他地方。

So, I want to only match the pattern at the beginning of a line. Normally I would use ^for this purpose, but r'^PY (\d\d\d\d)'fails at matching my results. On the other hand, using \nseems to do what I want, but that might lead to further complications for me.

所以，我只想匹配一行开头的模式。通常我会^用于此目的，但r'^PY (\d\d\d\d)'无法匹配我的结果。另一方面，使用\n似乎可以做我想做的事，但这可能会给我带来更多的麻烦。

Answer 1

采纳答案by sinhayash

re.findall(r'^PY (\d\d\d\d)', wosrecords, flags=re.MULTILINE)

should work, let me know if it doesn't. I don't have your data.

应该有效，如果无效，请告诉我。我没有你的数据。

Answer 2

回答by Wiktor Stribi?ew

Use re.searchwith re.M:

使用re.search有re.M：

import re
p = re.compile(r'^PY\s+(\d{4})', re.M)
test_str = "PY123\nPY 2015\nPY 2017"
print(re.findall(p, test_str))

See IDEONE demo

看IDEONE 演示

EXPLANATION:

说明：

^- Start of a line (due to re.M)
PY- Literal PY
\s+- 1 or more whitespace
(\d{4})- Capture group holding 4 digits

^- 一行的开始（由于re.M）
PY- 文字 PY
\s+- 1 个或多个空格
(\d{4})- 捕获组持有 4 位数

Answer 3

回答by mac13k

In this particular case there is no need to use regular expressions, because the searched string is always 'PY' and is expected to be at the beginning of the line, so one can use string.findfor this job. The findfunction returns the position the substring is found in the given string or line, so if it is found at the start of the string the returned value is 0 (-1 if not found at all), ie.:

在这种特殊情况下，不需要使用正则表达式，因为搜索的字符串总是 'PY' 并且应该在行的开头，所以可以string.find用于这项工作。该find函数返回子字符串在给定字符串或行中的位置，因此如果在字符串的开头找到子字符串，则返回值为 0（如果根本找不到则为 -1），即：

In [12]: 'PY 2015'.find('PY')
Out[12]: 0

In [13]: ' PY 2015'.find('PY')
Out[13]: 1

Perhaps it could be a good idea to strip the white spaces, ie.:

也许剥离空白可能是个好主意，即：

In [14]: '  PY 2015'.find('PY')
Out[14]: 2

In [15]: '  PY 2015'.strip().find('PY')
Out[15]: 0

And next if only the year is of interest it can be extracted with split, ie.:

接下来，如果只对年份感兴趣，则可以通过拆分来提取，即：

In [16]: '  PY 2015'.strip().split()[1]
Out[16]: '2015'

使用 ^ 匹配 Python 正则表达式中的行首

提问by chrisk

采纳答案by sinhayash

回答by Wiktor Stribi?ew

回答by mac13k

相关推荐

最近更新

标签

使用 ^ 匹配 Python 正则表达式中的行首

提问by chrisk

采纳答案by sinhayash

回答by Wiktor Stribi?ew

回答by mac13k

相关推荐

Python 在 Flask 中返回响应后需要执行一个函数

Anaconda 运行时错误：Python 未作为框架安装？

检查 Python 列表中是否存在键

Python 如何在 DynamoDB 中立即获取表的行数？

相关推荐

最近更新

标签