Python 音频帧包含什么?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3957025/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What does a audio frame contain?
提问by Jason94
Im doing some research on how to compare sound files(wave). Basically i want to compare stored soundfiles (wav) with sound from a microphone. So in the end i would like to pre-store some voice commands of my own and then when Im running my app I would like to compare the pre-stored files with input from the microphone.
我正在研究如何比较声音文件(wave)。基本上我想将存储的声音文件(wav)与麦克风的声音进行比较。所以最后我想预先存储一些我自己的语音命令,然后当我运行我的应用程序时,我想将预先存储的文件与来自麦克风的输入进行比较。
My thought was to put in some margin when comparing because saying something two times in a row in the exatly same way would be difficult I guess.
我的想法是在比较时留出一些余量,因为我想以完全相同的方式连续说两次的话会很困难。
So after some googling i see that python have this module named wave and the Wave_read object. That object has a function named readframes(n):
所以经过一些谷歌搜索后,我看到 python 有这个名为 wave 的模块和 Wave_read 对象。该对象有一个名为 readframes(n) 的函数:
Reads and returns at most n frames of audio, as a string of bytes.
读取并返回最多 n 帧音频,作为字节字符串。
What does these bytes contain? Im thinking of looping thru the wave files one frame at the time comparing them frame by frame.
这些字节包含什么?我正在考虑一次一帧地循环遍历波形文件,逐帧比较它们。
采纳答案by Soviut
An audio frame, or sample, contains amplitude (loudness) information at that particular point in time. To produce sound, tens of thousands of frames are played in sequence to produce frequencies.
音频帧或样本包含该特定时间点的幅度(响度)信息。为了产生声音,数以万计的帧被依次播放以产生频率。
In the case of CD quality audio or uncompressed wave audio, there are around 44,100 frames/samples per second. Each of those frames contains 16-bits of resolution, allowing for fairly precise representations of the sound levels. Also, because CD audio is stereo, there is actually twice as much information, 16-bits for the left channel, 16-bits for the right.
在 CD 质量音频或未压缩波音频的情况下,每秒大约有 44,100 帧/样本。这些帧中的每一个都包含 16 位分辨率,可以相当精确地表示声级。此外,由于 CD 音频是立体声,实际上信息量是原来的两倍,左声道 16 位,右声道 16 位。
When you use the sound module in python to get a frame, it will be returned as a series of hexadecimal characters:
当你使用python中的sound模块获取一个frame时,它会以一系列十六进制字符的形式返回:
- One character for an 8-bit mono signal.
- Two characters for 8-bit stereo.
- Two characters for 16-bit mono.
- Four characters for 16-bit stereo.
- 8 位单声道信号的一个字符。
- 8 位立体声的两个字符。
- 16 位单声道的两个字符。
- 16 位立体声的四个字符。
In order to convert and compare these values you'll have to first use the python wave module's functions to check the bit depth and number of channels. Otherwise, you'll be comparing mismatched quality settings.
为了转换和比较这些值,您必须首先使用 python 波形模块的函数来检查位深度和通道数。否则,您将比较不匹配的质量设置。
回答by Marcelo Cantos
A simple byte-by-byte comparison has almost no chance of a successful match, even with some tolerance thrown in. Voice-pattern recognition is a very complex and subtle problem that is still the subject of much research.
一个简单的逐字节比较几乎没有成功匹配的机会,即使有一些容忍度。语音模式识别是一个非常复杂和微妙的问题,仍然是许多研究的主题。
回答by Konrad H?ffner
The first thing you should do is a fourier transformation to transform the data into its frequencies. It is rather complex however. I wouldn't use voice recognition libraries here as it sounds like you don't record voices only. You would then try different time shifts (in case the sounds are not exactly aligned) and use the one that gives you the best similarity - where you have to define a similarity function. Oh and you should normalize both signals (same maximum loudness).
您应该做的第一件事是进行傅立叶变换,将数据转换为其频率。然而,它相当复杂。我不会在这里使用语音识别库,因为听起来您不仅仅录制语音。然后,您将尝试不同的时移(以防声音不完全对齐)并使用能够为您提供最佳相似度的时间位移 - 您必须在其中定义相似度函数。哦,你应该标准化两个信号(相同的最大响度)。
回答by bobobobo
I believe the accepted description to be slightly incorrect.
我相信接受的描述有点不正确。
A frameappears to be somewhat like stridein graphics formats. For interleavedstereo @ 16 bits/sample, the frame size is 2*sizeof(short)=4 bytes. For non-interleaved stereo @ 16 bits/sample, the samples of the left channel are all one after another, so the frame size is just sizeof(short).
甲帧似乎有点像步幅在图形格式。对于交错立体声@ 16 位/样本,帧大小为2*sizeof(short)= 4 字节。对于非交错立体声@16 位/样本,左声道的样本都是一个接一个,因此帧大小仅为sizeof(short).

