python 解析srt字幕

Question

提问by Vojta Rylko

I want to parse srt subtitles:

我想解析 srt 字幕：

    1
    00:00:12,815 --> 00:00:14,509
    Chlapi, jak to jde s
    těma pracovníma světlama?.

    2
    00:00:14,815 --> 00:00:16,498
    Trochu je zesilujeme.

    3
    00:00:16,934 --> 00:00:17,814
    Jo, sleduj.

Every item into structure. With this regexs:

每个项目进入结构。使用这个正则表达式：

A:

A：

RE_ITEM = re.compile(r'(?P<index>\d+).'
    r'(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> '
    r'(?P<end>\d{2}:\d{2}:\d{2},\d{3}).'
    r'(?P<text>.*?)', re.DOTALL)

B:

乙：

RE_ITEM = re.compile(r'(?P<index>\d+).'
    r'(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> '
    r'(?P<end>\d{2}:\d{2}:\d{2},\d{3}).'
    r'(?P<text>.*)', re.DOTALL)

And this code:

而这段代码：

    for i in Subtitles.RE_ITEM.finditer(text):
    result.append((i.group('index'), i.group('start'), 
             i.group('end'), i.group('text')))

With code B I have only one item in array (because of greedy .*) and with code A I have empty 'text'because of no-greedy .*?

使用代码 BI 在数组中只有一项（因为贪婪 .*），使用代码 AI 时因为没有贪婪 .* 而有空的“文本”？

How to cure this?

这个怎么治？

Thanks

谢谢

Answer 1

采纳答案by interjay

The text is followed by an empty line, or the end of file. So you can use:

文本后跟一个空行或文件结尾。所以你可以使用：

r' .... (?P<text>.*?)(\n\n|$)'

Answer 2

回答by John La Rooy

Why not use pysrt?

为什么不使用pysrt？

Answer 3

回答by Chris Down

I became quite frustrated with srt libraries available for Python (often because they were heavyweight and eschewed language-standard types in favour of custom classes), so I've spent the last year or so working on my own srt library. You can get it at https://github.com/cdown/srt.

我对 Python 可用的 srt 库感到非常沮丧（通常是因为它们是重量级的并且避开了语言标准类型而支持自定义类），所以我在过去一年左右的时间里一直在研究我自己的 srt 库。你可以在https://github.com/cdown/srt得到它。

I tried to keep it simple and light on classes (except for the core Subtitle class, which more or less just stores the SRT block data). It can read and write SRT files, and turn noncompliant SRT files into compliant ones.

我试图在类上保持简单明了（除了核心 Subtitle 类，它或多或少只存储 SRT 块数据）。它可以读写SRT文件，将不合规的SRT文件转为合规的。

Here's a usage example with your sample input:

这是您的示例输入的用法示例：

>>> import srt, pprint
>>> gen = srt.parse('''\
... 1
... 00:00:12,815 --> 00:00:14,509
... Chlapi, jak to jde s
... těma pracovníma světlama?.
... 
... 2
... 00:00:14,815 --> 00:00:16,498
... Trochu je zesilujeme.
... 
... 3
... 00:00:16,934 --> 00:00:17,814
... Jo, sleduj.
... 
... ''')
>>> pprint.pprint(list(gen))
[Subtitle(start=datetime.timedelta(0, 12, 815000), end=datetime.timedelta(0, 14, 509000), index=1, proprietary='', content='Chlapi, jak to jde s\ntěma pracovníma světlama?.'),
 Subtitle(start=datetime.timedelta(0, 14, 815000), end=datetime.timedelta(0, 16, 498000), index=2, proprietary='', content='Trochu je zesilujeme.'),
 Subtitle(start=datetime.timedelta(0, 16, 934000), end=datetime.timedelta(0, 17, 814000), index=3, proprietary='', content='Jo, sleduj.')]

Answer 4

回答by interjay

splits = [s.strip() for s in re.split(r'\n\s*\n', text) if s.strip()]
regex = re.compile(r'''(?P<index>\d+).*?(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> (?P<end>\d{2}:\d{2}:\d{2},\d{3})\s*.*?\s*(?P<text>.*)''', re.DOTALL)
for s in splits:
    r = regex.search(s)
    print r.groups()

Answer 5

回答by bcoughlan

Here's a snippet I wrote which converts SRT files into dictionaries:

这是我编写的将 SRT 文件转换为字典的片段：

import re
def srt_time_to_seconds(time):
    split_time=time.split(',')
    major, minor = (split_time[0].split(':'), split_time[1])
    return int(major[0])*1440 + int(major[1])*60 + int(major[2]) + float(minor)/1000

def srt_to_dict(srtText):
    subs=[]
    for s in re.sub('\r\n', '\n', srtText).split('\n\n'):
        st = s.split('\n')
        if len(st)>=3:
            split = st[1].split(' --> ')
            subs.append({'start': srt_time_to_seconds(split[0].strip()),
                         'end': srt_time_to_seconds(split[1].strip()),
                         'text': '<br />'.join(j for j in st[2:len(st)])
                        })
    return subs

Usage:

用法：

import srt_to_dict
with open('test.srt', "r") as f:
        srtText = f.read()
        print srt_to_dict(srtText)

Answer 6

回答by Teddy

Here's some code I had lying around to parse SRT files:

这是我用来解析 SRT 文件的一些代码：

from __future__ import division

import datetime

class Srt_entry(object):
    def __init__(self, lines):
        def parsetime(string):
            hours, minutes, seconds = string.split(u':')
            hours = int(hours)
            minutes = int(minutes)
            seconds = float(u'.'.join(seconds.split(u',')))
            return datetime.timedelta(0, seconds, 0, 0, minutes, hours)
        self.index = int(lines[0])
        start, arrow, end = lines[1].split()
        self.start = parsetime(start)
        if arrow != u"-->":
            raise ValueError
        self.end = parsetime(end)
        self.lines = lines[2:]
        if not self.lines[-1]:
            del self.lines[-1]
    def __unicode__(self):
        def delta_to_string(d):
            hours = (d.days * 24) \
                    + (d.seconds // (60 * 60))
            minutes = (d.seconds // 60) % 60
            seconds = d.seconds % 60 + d.microseconds / 1000000
            return u','.join((u"%02d:%02d:%06.3f"
                              % (hours, minutes, seconds)).split(u'.'))
        return (unicode(self.index) + u'\n'
                + delta_to_string(self.start)
                + ' --> '
                + delta_to_string(self.end) + u'\n'
                + u''.join(self.lines))


srt_file = open("foo.srt")
entries = []
entry = []
for line in srt_file:
    if options.decode:
        line = line.decode(options.decode)
    if line == u'\n':
        entries.append(Srt_entry(entry))
        entry = []
    else:
        entry.append(line)
srt_file.close()

python 解析srt字幕

提问by Vojta Rylko

采纳答案by interjay

回答by John La Rooy

回答by Chris Down

回答by interjay

回答by bcoughlan

回答by Teddy

相关推荐

最近更新

标签

python 解析srt字幕

提问by Vojta Rylko

采纳答案by interjay

回答by John La Rooy

回答by Chris Down

回答by interjay

回答by bcoughlan

回答by Teddy

相关推荐

python 使用脚本语言动态数据库

python 在 Django 中禁用文本字段的自动完成功能？

python Django TestCase 测试顺序

python 如何在没有循环的情况下检查另一个列表包含的列表？

相关推荐

最近更新

标签