Python正则表达式拆分没有空字符串
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16840851/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python regex split without empty string
提问by tonga
I have the following file names that exhibit this pattern:
我有以下文件名表现出这种模式:
000014_L_20111007T084734-20111008T023142.txt
000014_U_20111007T084734-20111008T023142.txt
...
I want to extract the middle two time stamp parts after the second underscore '_'and before '.txt'. So I used the following Python regex string split:
我想在第二个下划线之后'_'和之前提取中间的两个时间戳部分'.txt'。所以我使用了以下 Python 正则表达式字符串拆分:
time_info = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
But this gives me two extra empty strings in the returned list:
但这在返回的列表中给了我两个额外的空字符串:
time_info=['', '20111007T084734', '20111008T023142', '']
How do I get only the two time stamp information? i.e. I want:
我如何只获得两个时间戳信息?即我想要:
time_info=['20111007T084734', '20111008T023142']
采纳答案by JAB
Don't use re.split(), use the groups()method of regex Match/SRE_Matchobjects.
不要使用re.split(),使用groups()正则表达式Match/SRE_Match对象的方法。
>>> f = '000014_L_20111007T084734-20111008T023142.txt'
>>> time_info = re.search(r'[LU]_(\w+)-(\w+)\.', f).groups()
>>> time_info
('20111007T084734', '20111008T023142')
You can even name the capturing groups and retrieve them in a dict, though you use groupdict()rather than groups()for that. (The regex pattern for such a case would be something like r'[LU]_(?P<groupA>\w+)-(?P<groupB>\w+)\.')
您甚至可以命名捕获组并在 dict 中检索它们,尽管您使用groupdict()而不是groups()为此。(这种情况下的正则表达式模式类似于r'[LU]_(?P<groupA>\w+)-(?P<groupB>\w+)\.')
回答by Elliot Bonneville
I'm no Python expert but maybe you could just remove the empty strings from your list?
我不是 Python 专家,但也许您可以从列表中删除空字符串?
str_list = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
time_info = filter(None, str_list)
回答by Ashwini Chaudhary
If the timestamps are always after the second _then you can use str.splitand str.strip:
如果时间戳总是在第二个之后,_那么您可以使用str.splitand str.strip:
>>> strs = "000014_L_20111007T084734-20111008T023142.txt"
>>> strs.strip(".txt").split("_",2)[-1].split("-")
['20111007T084734', '20111008T023142']
回答by Elazar
>>> f='000014_L_20111007T084734-20111008T023142.txt'
>>> f[10:-4].split('-')
['0111007T084734', '20111008T023142']
or, somewhat more general:
或者,更一般的:
>>> f[f.rfind('_')+1:-4].split('-')
['20111007T084734', '20111008T023142']
回答by PipperChip
Since this came up on google and for completeness, try using re.findallas an alternative!
由于这是在 google 上提出的,并且为了完整性,请尝试使用re.findall作为替代方案!
This does require a little re-thinking, but it still returns a list of matches like split does. This makes it a nice drop-in replacement for some existing code and gets rid of the unwanted text. Pair it with lookaheads and/or lookbehindsand you get very similar behavior.
这确实需要一些重新思考,但它仍然像 split 一样返回匹配列表。这使它成为一些现有代码的很好的替代品,并去除了不需要的文本。将它与前瞻和/或后视配对,你会得到非常相似的行为。
Yes, this is a bit of a "you're asking the wrong question" answer and doesn't use re.split(). It does solve the underlying issue- your list of matches suddenly have zero-length strings in it and you don't want that.
是的,这有点像“您问错了问题”的答案,并且不使用re.split(). 它确实解决了潜在的问题——您的匹配列表中突然包含零长度字符串,而您不希望这样。

