Python正则表达式拆分没有空字符串

Question

提问by tonga

I have the following file names that exhibit this pattern:

我有以下文件名表现出这种模式：

000014_L_20111007T084734-20111008T023142.txt
000014_U_20111007T084734-20111008T023142.txt
...

I want to extract the middle two time stamp parts after the second underscore '_'and before '.txt'. So I used the following Python regex string split:

我想在第二个下划线之后'_'和之前提取中间的两个时间戳部分'.txt'。所以我使用了以下 Python 正则表达式字符串拆分：

time_info = re.split('^[0-9]+_[LU]_|-|\.txt$', f)

But this gives me two extra empty strings in the returned list:

但这在返回的列表中给了我两个额外的空字符串：

time_info=['', '20111007T084734', '20111008T023142', '']

How do I get only the two time stamp information? i.e. I want:

我如何只获得两个时间戳信息？即我想要：

time_info=['20111007T084734', '20111008T023142']

Answer 1

采纳答案by JAB

Don't use re.split(), use the groups()method of regex Match/SRE_Matchobjects.

不要使用re.split()，使用groups()正则表达式Match/SRE_Match对象的方法。

>>> f = '000014_L_20111007T084734-20111008T023142.txt'
>>> time_info = re.search(r'[LU]_(\w+)-(\w+)\.', f).groups()
>>> time_info
('20111007T084734', '20111008T023142')

You can even name the capturing groups and retrieve them in a dict, though you use groupdict()rather than groups()for that. (The regex pattern for such a case would be something like r'[LU]_(?P<groupA>\w+)-(?P<groupB>\w+)\.')

您甚至可以命名捕获组并在 dict 中检索它们，尽管您使用groupdict()而不是groups()为此。（这种情况下的正则表达式模式类似于r'[LU]_(?P<groupA>\w+)-(?P<groupB>\w+)\.'）

Answer 2

回答by Elliot Bonneville

I'm no Python expert but maybe you could just remove the empty strings from your list?

我不是 Python 专家，但也许您可以从列表中删除空字符串？

str_list = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
time_info = filter(None, str_list)

Answer 3

回答by Ashwini Chaudhary

If the timestamps are always after the second _then you can use str.splitand str.strip:

如果时间戳总是在第二个之后，_那么您可以使用str.splitand str.strip：

>>> strs = "000014_L_20111007T084734-20111008T023142.txt"
>>> strs.strip(".txt").split("_",2)[-1].split("-")
['20111007T084734', '20111008T023142']

Answer 4

回答by Elazar

>>> f='000014_L_20111007T084734-20111008T023142.txt'
>>> f[10:-4].split('-')
['0111007T084734', '20111008T023142']

or, somewhat more general:

或者，更一般的：

>>> f[f.rfind('_')+1:-4].split('-')
['20111007T084734', '20111008T023142']

Answer 5

回答by PipperChip

Since this came up on google and for completeness, try using re.findallas an alternative!

由于这是在 google 上提出的，并且为了完整性，请尝试使用re.findall作为替代方案！

This does require a little re-thinking, but it still returns a list of matches like split does. This makes it a nice drop-in replacement for some existing code and gets rid of the unwanted text. Pair it with lookaheads and/or lookbehindsand you get very similar behavior.

这确实需要一些重新思考，但它仍然像 split 一样返回匹配列表。这使它成为一些现有代码的很好的替代品，并去除了不需要的文本。将它与前瞻和/或后视配对，你会得到非常相似的行为。

Yes, this is a bit of a "you're asking the wrong question" answer and doesn't use re.split(). It does solve the underlying issue- your list of matches suddenly have zero-length strings in it and you don't want that.

是的，这有点像“您问错了问题”的答案，并且不使用re.split(). 它确实解决了潜在的问题——您的匹配列表中突然包含零长度字符串，而您不希望这样。

Python正则表达式拆分没有空字符串

提问by tonga

采纳答案by JAB

回答by Elliot Bonneville

回答by Ashwini Chaudhary

回答by Elazar

回答by PipperChip

相关推荐

最近更新

标签

Python正则表达式拆分没有空字符串

提问by tonga

采纳答案by JAB

回答by Elliot Bonneville

回答by Ashwini Chaudhary

回答by Elazar

回答by PipperChip

相关推荐

Python 如何在 Django 中添加印度标准时间 (IST)？

Python 如何在 Flask 中设置响应头？

Python Tkinter 画布 create_window()

Python 关闭子图中的轴

相关推荐

最近更新

标签