Python正则表达式拆分没有空字符串

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16840851/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 23:47:25  来源:igfitidea点击:

Python regex split without empty string

pythonregex

提问by tonga

I have the following file names that exhibit this pattern:

我有以下文件名表现出这种模式:

000014_L_20111007T084734-20111008T023142.txt
000014_U_20111007T084734-20111008T023142.txt
...

I want to extract the middle two time stamp parts after the second underscore '_'and before '.txt'. So I used the following Python regex string split:

我想在第二个下划线之后'_'和之前提取中间的两个时间戳部分'.txt'。所以我使用了以下 Python 正则表达式字符串拆分:

time_info = re.split('^[0-9]+_[LU]_|-|\.txt$', f)

But this gives me two extra empty strings in the returned list:

但这在返回的列表中给了我两个额外的空字符串:

time_info=['', '20111007T084734', '20111008T023142', '']

How do I get only the two time stamp information? i.e. I want:

我如何只获得两个时间戳信息?即我想要:

time_info=['20111007T084734', '20111008T023142']

采纳答案by JAB

Don't use re.split(), use the groups()method of regex Match/SRE_Matchobjects.

不要使用re.split(),使用groups()正则表达式Match/SRE_Match对象的方法。

>>> f = '000014_L_20111007T084734-20111008T023142.txt'
>>> time_info = re.search(r'[LU]_(\w+)-(\w+)\.', f).groups()
>>> time_info
('20111007T084734', '20111008T023142')

You can even name the capturing groups and retrieve them in a dict, though you use groupdict()rather than groups()for that. (The regex pattern for such a case would be something like r'[LU]_(?P<groupA>\w+)-(?P<groupB>\w+)\.')

您甚至可以命名捕获组并在 dict 中检索它们,尽管您使用groupdict()而不是groups()为此。(这种情况下的正则表达式模式类似于r'[LU]_(?P<groupA>\w+)-(?P<groupB>\w+)\.'

回答by Elliot Bonneville

I'm no Python expert but maybe you could just remove the empty strings from your list?

我不是 Python 专家,但也许您可以从列表中删除空字符串?

str_list = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
time_info = filter(None, str_list)

回答by Ashwini Chaudhary

If the timestamps are always after the second _then you can use str.splitand str.strip:

如果时间戳总是在第二个之后,_那么您可以使用str.splitand str.strip

>>> strs = "000014_L_20111007T084734-20111008T023142.txt"
>>> strs.strip(".txt").split("_",2)[-1].split("-")
['20111007T084734', '20111008T023142']

回答by Elazar

>>> f='000014_L_20111007T084734-20111008T023142.txt'
>>> f[10:-4].split('-')
['0111007T084734', '20111008T023142']

or, somewhat more general:

或者,更一般的:

>>> f[f.rfind('_')+1:-4].split('-')
['20111007T084734', '20111008T023142']

回答by PipperChip

Since this came up on google and for completeness, try using re.findallas an alternative!

由于这是在 google 上提出的,并且为了完整性,请尝试使用re.findall作为替代方案!

This does require a little re-thinking, but it still returns a list of matches like split does. This makes it a nice drop-in replacement for some existing code and gets rid of the unwanted text. Pair it with lookaheads and/or lookbehindsand you get very similar behavior.

这确实需要一些重新思考,但它仍然像 split 一样返回匹配列表。这使它成为一些现有代码的很好的替代品,并去除了不需要的文本。将它与前瞻和/或后视配对,你会得到非常相似的行为。

Yes, this is a bit of a "you're asking the wrong question" answer and doesn't use re.split(). It does solve the underlying issue- your list of matches suddenly have zero-length strings in it and you don't want that.

是的,这有点像“您问错了问题”的答案,并且不使用re.split(). 它确实解决了潜在的问题——您的匹配列表中突然包含零长度字符串,而您不希望这样。