Python 如何提取两个标记之间的子字符串?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4666973/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to extract the substring between two markers?
提问by miernik
Let's say I have a string 'gfgfdAAA1234ZZZuijjk'and I want to extract just the '1234'part.
假设我有一个字符串'gfgfdAAA1234ZZZuijjk',我只想提取该'1234'部分。
I only know what will be the few characters directly before AAA, and after ZZZthe part I am interested in 1234.
我只知道我感兴趣的部分之前AAA和之后ZZZ的几个字符是什么1234。
With sedit is possible to do something like this with a string:
有了sed它,可以做这样的事情有一个字符串:
echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*||"
And this will give me 1234as a result.
这将给我1234一个结果。
How to do the same thing in Python?
如何在 Python 中做同样的事情?
采纳答案by eumiro
Using regular expressions - documentationfor further reference
使用正则表达式 -进一步参考的文档
import re
text = 'gfgfdAAA1234ZZZuijjk'
m = re.search('AAA(.+?)ZZZ', text)
if m:
found = m.group(1)
# found: 1234
or:
或者:
import re
text = 'gfgfdAAA1234ZZZuijjk'
try:
found = re.search('AAA(.+?)ZZZ', text).group(1)
except AttributeError:
# AAA, ZZZ not found in the original string
found = '' # apply your error handling
# found: 1234
回答by Lennart Regebro
>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> start = s.find('AAA') + 3
>>> end = s.find('ZZZ', start)
>>> s[start:end]
'1234'
Then you can use regexps with the re module as well, if you want, but that's not necessary in your case.
然后,如果需要,您也可以将 regexp 与 re 模块一起使用,但这在您的情况下不是必需的。
回答by infrared
import re
print re.search('AAA(.*?)ZZZ', 'gfgfdAAA1234ZZZuijjk').group(1)
回答by andreypopp
回答by tzot
regular expression
正则表达式
import re
re.search(r"(?<=AAA).*?(?=ZZZ)", your_text).group(0)
The above as-is will fail with an AttributeErrorif there are no "AAA" and "ZZZ" in your_text
AttributeError如果没有“AAA”和“ZZZ”,上面的原样将失败your_text
string methods
字符串方法
your_text.partition("AAA")[2].partition("ZZZ")[0]
The above will return an empty string if either "AAA" or "ZZZ" don't exist in your_text.
如果“AAA”或“ZZZ”不存在于your_text.
PS Python Challenge?
PS Python挑战?
回答by Denis Kutlubaev
Just in case somebody will have to do the same thing that I did. I had to extract everything inside parenthesis in a line. For example, if I have a line like 'US president (Barack Obama) met with ...' and I want to get only 'Barack Obama' this is solution:
以防万一有人不得不做和我一样的事情。我不得不在一行中提取括号内的所有内容。例如,如果我有一条像“美国总统(巴拉克奥巴马)会见……”这样的台词,而我只想得到“巴拉克奥巴马”,这是解决方案:
regex = '.*\((.*?)\).*'
matches = re.search(regex, line)
line = matches.group(1) + '\n'
I.e. you need to block parenthesis with slash \sign. Though it is a problem about more regular expressions that Python.
即你需要用slash \符号来阻止括号。尽管与 Python 相比,更多的正则表达式是一个问题。
Also, in some cases you may see 'r' symbols before regex definition. If there is no r prefix, you need to use escape characters like in C. Hereis more discussion on that.
此外,在某些情况下,您可能会在正则表达式定义之前看到“r”符号。如果没有 r 前缀,则需要像在 C 中一样使用转义字符。这里有更多讨论。
回答by user1810100
>>> s = '/tmp/10508.constantstring'
>>> s.split('/tmp/')[1].split('constantstring')[0].strip('.')
回答by Avinash Raj
With sed it is possible to do something like this with a string:
使用 sed 可以用字符串做这样的事情:
echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"
echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"
And this will give me 1234 as a result.
结果这会给我 1234。
You could do the same with re.subfunction using the same regex.
您可以re.sub使用相同的正则表达式对函数执行相同的操作。
>>> re.sub(r'.*AAA(.*)ZZZ.*', r'', 'gfgfdAAA1234ZZZuijjk')
'1234'
In basic sed, capturing group are represented by \(..\), but in python it was represented by (..).
在基本的 sed 中,捕获组由 表示\(..\),但在 python 中由(..).
回答by Saeed Zahedian Abroodi
You can find first substring with this function in your code (by character index). Also, you can find what is after a substring.
您可以在代码中使用此函数找到第一个子字符串(按字符索引)。此外,您还可以找到子字符串之后的内容。
def FindSubString(strText, strSubString, Offset=None):
try:
Start = strText.find(strSubString)
if Start == -1:
return -1 # Not Found
else:
if Offset == None:
Result = strText[Start+len(strSubString):]
elif Offset == 0:
return Start
else:
AfterSubString = Start+len(strSubString)
Result = strText[AfterSubString:AfterSubString + int(Offset)]
return Result
except:
return -1
# Example:
Text = "Thanks for contributing an answer to Stack Overflow!"
subText = "to"
print("Start of first substring in a text:")
start = FindSubString(Text, subText, 0)
print(start); print("")
print("Exact substring in a text:")
print(Text[start:start+len(subText)]); print("")
print("What is after substring \"%s\"?" %(subText))
print(FindSubString(Text, subText))
# Your answer:
Text = "gfgfdAAA1234ZZZuijjk"
subText1 = "AAA"
subText2 = "ZZZ"
AfterText1 = FindSubString(Text, subText1, 0) + len(subText1)
BeforText2 = FindSubString(Text, subText2, 0)
print("\nYour answer:\n%s" %(Text[AfterText1:BeforText2]))
回答by MaxLZ
One liners that return other string if there was no match.
Edit: improved version uses nextfunction, replace "not-found"with something else if needed:
如果没有匹配项,则返回其他字符串的一个衬垫。编辑:改进版本使用next功能,"not-found"如果需要用其他东西替换:
import re
res = next( (m.group(1) for m in [re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk" ),] if m), "not-found" )
My other method to do this, less optimal, uses regex 2nd time, still didn't found a shorter way:
我执行此操作的另一种方法不太理想,第二次使用正则表达式,但仍然没有找到更短的方法:
import re
res = ( ( re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk") or re.search("()","") ).group(1) )

