在 python 中的单词上拆分语音音频文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36458214/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Split speech audio file on words in python
提问by user3059201
I feel like this is a fairly common problem but I haven't yet found a suitable answer. I have many audio files of human speech that I would like to break on words, which can be done heuristically by looking at pauses in the waveform, but can anyone point me to a function/library in python that does this automatically?
我觉得这是一个相当普遍的问题,但我还没有找到合适的答案。我有许多人类语音的音频文件,我想在单词上打断,这可以通过查看波形中的停顿来启发式地完成,但是有人能指出我在 python 中自动执行此操作的函数/库吗?
回答by Anil_M
An easier way to do this is using pydubmodule. recent addition of silent utilitiesdoes all the heavy lifting such as setting up silence threahold
, setting up silence length
. etc and simplifies code significantly as opposed to other methods mentioned.
一种更简单的方法是使用pydub模块。最近添加的静默实用程序完成了所有繁重的工作,例如setting up silence threahold
,setting up silence length
。等并显着简化了代码,而不是提到的其他方法。
Here is an demo implementation , inspiration from here
这是一个演示实现,灵感来自这里
Setup:
设置:
I had a audio file with spoken english letters from A
to Z
in the file "a-z.wav". A sub-directory splitAudio
was created in the current working directory. Upon executing the demo code, the files were split onto 26 separate files with each audio file storing each syllable.
我在文件“az.wav”中有一个音频文件,里面有从A
到的口语英文字母Z
。子目录splitAudio
是在当前工作目录中创建。执行演示代码后,文件被拆分为 26 个单独的文件,每个音频文件存储每个音节。
Observations:Some of the syllables were cut off, possibly needing modification of following parameters,min_silence_len=500
silence_thresh=-16
观察:部分音节被截断,可能需要修改以下参数,min_silence_len=500
silence_thresh=-16
One may want to tune these to one's own requirement.
人们可能想根据自己的要求调整这些。
Demo Code:
演示代码:
from pydub import AudioSegment
from pydub.silence import split_on_silence
sound_file = AudioSegment.from_wav("a-z.wav")
audio_chunks = split_on_silence(sound_file,
# must be silent for at least half a second
min_silence_len=500,
# consider it silent if quieter than -16 dBFS
silence_thresh=-16
)
for i, chunk in enumerate(audio_chunks):
out_file = ".//splitAudio//chunk{0}.wav".format(i)
print "exporting", out_file
chunk.export(out_file, format="wav")
Output:
输出:
Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> ================================ RESTART ================================
>>>
exporting .//splitAudio//chunk0.wav
exporting .//splitAudio//chunk1.wav
exporting .//splitAudio//chunk2.wav
exporting .//splitAudio//chunk3.wav
exporting .//splitAudio//chunk4.wav
exporting .//splitAudio//chunk5.wav
exporting .//splitAudio//chunk6.wav
exporting .//splitAudio//chunk7.wav
exporting .//splitAudio//chunk8.wav
exporting .//splitAudio//chunk9.wav
exporting .//splitAudio//chunk10.wav
exporting .//splitAudio//chunk11.wav
exporting .//splitAudio//chunk12.wav
exporting .//splitAudio//chunk13.wav
exporting .//splitAudio//chunk14.wav
exporting .//splitAudio//chunk15.wav
exporting .//splitAudio//chunk16.wav
exporting .//splitAudio//chunk17.wav
exporting .//splitAudio//chunk18.wav
exporting .//splitAudio//chunk19.wav
exporting .//splitAudio//chunk20.wav
exporting .//splitAudio//chunk21.wav
exporting .//splitAudio//chunk22.wav
exporting .//splitAudio//chunk23.wav
exporting .//splitAudio//chunk24.wav
exporting .//splitAudio//chunk25.wav
exporting .//splitAudio//chunk26.wav
>>>
回答by Piyush Sharma
You could look at AudiolabIt provides a decent API to convert the voice samples into numpyarrays. The Audiolab module uses the libsndfile C++ library to do the heavy lifting.
你可以看看Audiolab它提供了一个不错的 API 来将语音样本转换为numpy数组。Audiolab 模块使用 libsndfile C++ 库来完成繁重的工作。
You can then parse the arrays to find the lower values to find the pauses.
然后,您可以解析数组以找到较低的值以找到停顿。
回答by MonsieurBeilto
Use IBM STT. Using timestamps=true
you will get the word break up along with when the system detects them to have been spoken.
使用IBM STT。timestamps=true
当系统检测到它们被说出来时,使用你会得到这个词。
There are a lot of other cool features like word_alternatives_threshold
to get other possibilities of words and word_confidence
to get the confidence with which the system predicts the word. Set word_alternatives_threshold
to between (0.1 and 0.01) to get a real idea.
还有很多其他很酷的功能,比如word_alternatives_threshold
获得单词的其他可能性以及word_confidence
获得系统预测单词的信心。设置word_alternatives_threshold
在 (0.1 和 0.01) 之间以获得真正的想法。
This needs sign on, following which you can use the username and password generated.
这需要登录,然后您可以使用生成的用户名和密码。
The IBM STT is already a part of the speechrecognition module mentioned, but to get the word timestamp, you will need to modify the function.
IBM STT 已经是上面提到的语音识别模块的一部分,但是要获取单词时间戳,您需要修改该函数。
An extracted and modified form looks like:
提取和修改后的表单如下所示:
def extracted_from_sr_recognize_ibm(audio_data, username=IBM_USERNAME, password=IBM_PASSWORD, language="en-US", show_all=False, timestamps=False,
word_confidence=False, word_alternatives_threshold=0.1):
assert isinstance(username, str), "``username`` must be a string"
assert isinstance(password, str), "``password`` must be a string"
flac_data = audio_data.get_flac_data(
convert_rate=None if audio_data.sample_rate >= 16000 else 16000, # audio samples should be at least 16 kHz
convert_width=None if audio_data.sample_width >= 2 else 2 # audio samples should be at least 16-bit
)
url = "https://stream-fra.watsonplatform.net/speech-to-text/api/v1/recognize?{}".format(urlencode({
"profanity_filter": "false",
"continuous": "true",
"model": "{}_BroadbandModel".format(language),
"timestamps": "{}".format(str(timestamps).lower()),
"word_confidence": "{}".format(str(word_confidence).lower()),
"word_alternatives_threshold": "{}".format(word_alternatives_threshold)
}))
request = Request(url, data=flac_data, headers={
"Content-Type": "audio/x-flac",
"X-Watson-Learning-Opt-Out": "true", # prevent requests from being logged, for improved privacy
})
authorization_value = base64.standard_b64encode("{}:{}".format(username, password).encode("utf-8")).decode("utf-8")
request.add_header("Authorization", "Basic {}".format(authorization_value))
try:
response = urlopen(request, timeout=None)
except HTTPError as e:
raise sr.RequestError("recognition request failed: {}".format(e.reason))
except URLError as e:
raise sr.RequestError("recognition connection failed: {}".format(e.reason))
response_text = response.read().decode("utf-8")
result = json.loads(response_text)
# return results
if show_all: return result
if "results" not in result or len(result["results"]) < 1 or "alternatives" not in result["results"][0]:
raise Exception("Unknown Value Exception")
transcription = []
for utterance in result["results"]:
if "alternatives" not in utterance:
raise Exception("Unknown Value Exception. No Alternatives returned")
for hypothesis in utterance["alternatives"]:
if "transcript" in hypothesis:
transcription.append(hypothesis["transcript"])
return "\n".join(transcription)
回答by epo3
pyAudioAnalysiscan segment an audio file if the words are clearly separated (this is rarely the case in natural speech). The package is relatively easy to use:
pyAudioAnalysis可以分割音频文件,如果单词被清楚地分开(在自然语音中很少出现这种情况)。这个包比较好用:
python pyAudioAnalysis/pyAudioAnalysis/audioAnalysis.py silenceRemoval -i SPEECH_AUDIO_FILE_TO_SPLIT.mp3 --smoothing 1.0 --weight 0.3?
More details on my blog.
更多详情请见我的博客。