python 解析srt字幕

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2616766/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-04 01:04:48  来源:igfitidea点击:

Parsing srt subtitles

pythonregex

提问by Vojta Rylko

I want to parse srt subtitles:

我想解析 srt 字幕:

    1
    00:00:12,815 --> 00:00:14,509
    Chlapi, jak to jde s
    těma pracovníma světlama?.

    2
    00:00:14,815 --> 00:00:16,498
    Trochu je zesilujeme.

    3
    00:00:16,934 --> 00:00:17,814
    Jo, sleduj.

Every item into structure. With this regexs:

每个项目进入结构。使用这个正则表达式:

A:

A:

RE_ITEM = re.compile(r'(?P<index>\d+).'
    r'(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> '
    r'(?P<end>\d{2}:\d{2}:\d{2},\d{3}).'
    r'(?P<text>.*?)', re.DOTALL)

B:

乙:

RE_ITEM = re.compile(r'(?P<index>\d+).'
    r'(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> '
    r'(?P<end>\d{2}:\d{2}:\d{2},\d{3}).'
    r'(?P<text>.*)', re.DOTALL)

And this code:

而这段代码:

    for i in Subtitles.RE_ITEM.finditer(text):
    result.append((i.group('index'), i.group('start'), 
             i.group('end'), i.group('text')))

With code B I have only one item in array (because of greedy .*) and with code A I have empty 'text'because of no-greedy .*?

使用代码 BI 在数组中只有一项(因为贪婪 .*),使用代码 AI 时因为没有贪婪 .* 而有空的“文本”

How to cure this?

这个怎么治?

Thanks

谢谢

采纳答案by interjay

The text is followed by an empty line, or the end of file. So you can use:

文本后跟一个空行或文件结尾。所以你可以使用:

r' .... (?P<text>.*?)(\n\n|$)'

回答by John La Rooy

Why not use pysrt?

为什么不使用pysrt

回答by Chris Down

I became quite frustrated with srt libraries available for Python (often because they were heavyweight and eschewed language-standard types in favour of custom classes), so I've spent the last year or so working on my own srt library. You can get it at https://github.com/cdown/srt.

我对 Python 可用的 srt 库感到非常沮丧(通常是因为它们是重量级的并且避开了语言标准类型而支持自定义类),所以我在过去一年左右的时间里一直在研究我自己的 srt 库。你可以在https://github.com/cdown/srt得到它。

I tried to keep it simple and light on classes (except for the core Subtitle class, which more or less just stores the SRT block data). It can read and write SRT files, and turn noncompliant SRT files into compliant ones.

我试图在类上保持简单明了(除了核心 Subtitle 类,它或多或少只存储 SRT 块数据)。它可以读写SRT文件,将不合规的SRT文件转为合规的。

Here's a usage example with your sample input:

这是您的示例输入的用法示例:

>>> import srt, pprint
>>> gen = srt.parse('''\
... 1
... 00:00:12,815 --> 00:00:14,509
... Chlapi, jak to jde s
... těma pracovníma světlama?.
... 
... 2
... 00:00:14,815 --> 00:00:16,498
... Trochu je zesilujeme.
... 
... 3
... 00:00:16,934 --> 00:00:17,814
... Jo, sleduj.
... 
... ''')
>>> pprint.pprint(list(gen))
[Subtitle(start=datetime.timedelta(0, 12, 815000), end=datetime.timedelta(0, 14, 509000), index=1, proprietary='', content='Chlapi, jak to jde s\ntěma pracovníma světlama?.'),
 Subtitle(start=datetime.timedelta(0, 14, 815000), end=datetime.timedelta(0, 16, 498000), index=2, proprietary='', content='Trochu je zesilujeme.'),
 Subtitle(start=datetime.timedelta(0, 16, 934000), end=datetime.timedelta(0, 17, 814000), index=3, proprietary='', content='Jo, sleduj.')]

回答by interjay

splits = [s.strip() for s in re.split(r'\n\s*\n', text) if s.strip()]
regex = re.compile(r'''(?P<index>\d+).*?(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> (?P<end>\d{2}:\d{2}:\d{2},\d{3})\s*.*?\s*(?P<text>.*)''', re.DOTALL)
for s in splits:
    r = regex.search(s)
    print r.groups()

回答by bcoughlan

Here's a snippet I wrote which converts SRT files into dictionaries:

这是我编写的将 SRT 文件转换为字典的片段:

import re
def srt_time_to_seconds(time):
    split_time=time.split(',')
    major, minor = (split_time[0].split(':'), split_time[1])
    return int(major[0])*1440 + int(major[1])*60 + int(major[2]) + float(minor)/1000

def srt_to_dict(srtText):
    subs=[]
    for s in re.sub('\r\n', '\n', srtText).split('\n\n'):
        st = s.split('\n')
        if len(st)>=3:
            split = st[1].split(' --> ')
            subs.append({'start': srt_time_to_seconds(split[0].strip()),
                         'end': srt_time_to_seconds(split[1].strip()),
                         'text': '<br />'.join(j for j in st[2:len(st)])
                        })
    return subs

Usage:

用法:

import srt_to_dict
with open('test.srt', "r") as f:
        srtText = f.read()
        print srt_to_dict(srtText)

回答by Teddy

Here's some code I had lying around to parse SRT files:

这是我用来解析 SRT 文件的一些代码:

from __future__ import division

import datetime

class Srt_entry(object):
    def __init__(self, lines):
        def parsetime(string):
            hours, minutes, seconds = string.split(u':')
            hours = int(hours)
            minutes = int(minutes)
            seconds = float(u'.'.join(seconds.split(u',')))
            return datetime.timedelta(0, seconds, 0, 0, minutes, hours)
        self.index = int(lines[0])
        start, arrow, end = lines[1].split()
        self.start = parsetime(start)
        if arrow != u"-->":
            raise ValueError
        self.end = parsetime(end)
        self.lines = lines[2:]
        if not self.lines[-1]:
            del self.lines[-1]
    def __unicode__(self):
        def delta_to_string(d):
            hours = (d.days * 24) \
                    + (d.seconds // (60 * 60))
            minutes = (d.seconds // 60) % 60
            seconds = d.seconds % 60 + d.microseconds / 1000000
            return u','.join((u"%02d:%02d:%06.3f"
                              % (hours, minutes, seconds)).split(u'.'))
        return (unicode(self.index) + u'\n'
                + delta_to_string(self.start)
                + ' --> '
                + delta_to_string(self.end) + u'\n'
                + u''.join(self.lines))


srt_file = open("foo.srt")
entries = []
entry = []
for line in srt_file:
    if options.decode:
        line = line.decode(options.decode)
    if line == u'\n':
        entries.append(Srt_entry(entry))
        entry = []
    else:
        entry.append(line)
srt_file.close()