python 检测具有不同比特率和/或不同 ID3 标签的重复 MP3 文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/476227/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 20:12:01  来源:igfitidea点击:

Detect duplicate MP3 files with different bitrates and/or different ID3 tags?

pythonfilemp3duplicatesid3

提问by tzot

How could I detect (preferably with Python) duplicate MP3 files that can be encoded with different bitrates (but they are the same song) and ID3 tags that can be incorrect?

我如何检测(最好使用 Python)可以使用不同比特率(但它们是同一首歌曲)进行编码的重复 MP3 文件和可能不正确的 ID3 标签?

I know I can do an MD5checksum of the files content but that won't work for different bitrates. And I don't know if ID3 tags have influence in generating the MD5 checksum. Should I re-encode MP3 files that have a different bitrate and then I can do the checksum? What do you recommend?

我知道我可以对文件内容进行MD5校验,但这不适用于不同的比特率。而且我不知道 ID3 标签是否对生成 MD5 校验和有影响。我是否应该重新编码具有不同比特率的 MP3 文件,然后我可以进行校验和?你有什么建议吗?

回答by tzot

The exact same question that people at the old AudioScrobbler and currently at MusicBrainzhave worked on since long ago. For the time being, the Python project that can aid in your quest, is Picard, which will tag audio files (not only MPEG 1 Layer 3 files) with a GUID (actually, several of them), and from then on, matching the tags is quite simple.

老 AudioScrobbler 和目前MusicBrainz 的人们很久以前就一直在研究的完全相同的问题。目前,可以帮助您完成任务的 Python 项目是Picard,它将使用 GUID(实际上是其中几个)标记音频文件(不仅是 MPEG 1 第 3 层文件),从那时起,匹配标签很简单。

If you prefer to do it as a project of your own, libofamight be of help.

如果您更喜欢将其作为自己的项目进行,libofa可能会有所帮助。

回答by Benjamin Wohlwend

Like the others said, simple checksums won't detect duplicates with different bitrates or ID3 tags. What you need is an audio fingerprint algorithm. The Python Audioprocessing Suite has such an an algorithm, but I can't say anything about how reliable it is.

就像其他人所说的那样,简单的校验和不会检测到具有不同比特率或 ID3 标签的重复项。您需要的是音频指纹算法。Python Audioprocessing Suite 有这样一个算法,但我不能说它有多可靠。

http://rudd-o.com/new-projects/python-audioprocessing

http://rudd-o.com/new-projects/python-audioprocessing

回答by Francois G

For tag issues, Picardmay indeed be a very good bet. If, having identified two potentially duplicate files, what you want is to extract bitrate information from them, have a look at mp3guessenc.

对于标签问题,Picard可能确实是一个很好的选择。如果在确定了两个可能重复的文件后,您想要从中提取比特率信息,请查看mp3guessnc

回答by Douglas Leeder

I don't think simple checksums will ever work:

我不认为简单的校验和会起作用:

  1. ID3 tags will affect the md5
  2. Different encoders will encode the same song different ways - so the checksums will be different
  3. Different bit-rates will produce different checksums
  4. Re-encoding an mp3 to a different bit-rate will probably sound terrible and will certainly be different to the original audio compressed in one step.
  1. ID3 标签会影响 md5
  2. 不同的编码器将以不同的方式对同一首歌曲进行编码 - 因此校验和会有所不同
  3. 不同的比特率会产生不同的校验和
  4. 将 mp3 重新编码为不同的比特率可能听起来很糟糕,而且肯定会与一步压缩的原始音频不同。

I think you'll have to compare ID3 tags, song length, and filenames.

我认为您必须比较 ID3 标签、歌曲长度和文件名。

回答by James McMahon

Re-encoding at the same bit rate won't work, in fact it may make things worse as transcoding (that is what re-encoding at different bitrates is called) is going to change the nature of the compression, you are recompressing an already compressed file is going to lead to a significantly different file.

以相同的比特率重新编码是行不通的,实际上它可能会使事情变得更糟,因为转码(即所谓的以不同比特率重新编码)将改变压缩的性质,您正在重新压缩一个已经压缩文件将导致一个明显不同的文件。

This is a little out of my league but I would approach the problem by looking at the wave pattern of the MP3. Either by converting the MP3 to an uncompressd .wav or maybe by just running the analysis on the MP3 file itself. There should be a library out there for this. Just a word of warning, this is an expensive operation.

这有点超出我的范围,但我会通过查看 MP3 的波形来解决这个问题。通过将 MP3 转换为未压缩的 .wav 或仅对 MP3 文件本身运行分析。那里应该有一个图书馆。只是警告一下,这是一项昂贵的操作。

Another idea, use ReplayGain to scan the files. If they are the same song, they should be be tagged with the same gain. This will only work on the exact same song from the exact same album. I know of several cases were reissues are remastered at a higher volume, thus changing the replaygain.

另一个想法,使用 ReplayGain 扫描文件。如果它们是同一首歌曲,则应将它们标记为相同的增益。这仅适用于完全相同专辑中完全相同的歌曲。我知道有几个案例是重新发行以更高的音量重新制作,从而改变了重播增益。

EDIT:
You might want to check out http://www.speech.kth.se/snack/, which apparently can do spectrogram visualization. I imagine any library that can visual spectrogram can help you compare them.

编辑:
您可能想查看http://www.speech.kth.se/snack/,它显然可以进行频谱图可视化。我想任何可以可视化频谱图的库都可以帮助您比较它们。

This linkfrom the official python page may also be helpful.

来自官方 python 页面的这个链接也可能有帮助。

回答by lollercoaster

The Dejavu project is written in Python and does exactly what you are looking for.

Dejavu 项目是用 Python 编写的,完全符合您的要求。

https://github.com/worldveil/dejavu

https://github.com/worldveil/dejavu

It also supports many common formats (.wav, .mp3, etc) as well as finding the time offset of the clip in the original audio track.

它还支持许多常见格式(.wav、.mp3 等)以及在原始音轨中查找剪辑的时间偏移。

回答by PeterCo

You can use the successor for PUID and MusicBrainz, called AcoustiD:

您可以使用 PUID 和 MusicBrainz 的后继者,称为AcoustiD

AcoustID is an open source project that aims to create a free database of audio fingerprints with mapping to the MusicBrainz metadata database and provide a web service for audio file identification using this database...

...fingerprints along with some metadata necessary to identify the songs to the AcoustID database...

AcoustID 是一个开源项目,旨在创建一个免费的音频指纹数据库,并映射到 MusicBrainz 元数据数据库,并使用该数据库提供用于音频文件识别的网络服务...

...指纹以及将歌曲识别到 AcoustID 数据库所需的一些元数据...

You will find various client libraries and examples for the webservice at https://acoustid.org/

您可以在https://acoustid.org/找到各种客户端库和 Web 服务示例

回答by Menda

I'm looking for something similar and I found this:
http://www.lastfm.es/user/nova77LF/journal/2007/10/12/4kaf_fingerprint_(command_line)_client

我正在寻找类似的东西,我发现了这个:http:
//www.lastfm.es/user/nova77LF/journal/2007/10/12/4kaf_fingerprint_(command_line)_client

Hope it helps.

希望能帮助到你。

回答by splicer

I'd use length as my primary heuristic. That's what iTunes does when it's trying to identify a CD using the Gracenote database. Measure the lengths in millisecondsrather than seconds. Remember, this is only a heuristic: you should definitely listen to any detected duplicates before deleting them.

我会使用长度作为我的主要启发式。这就是 iTunes 在尝试使用Gracenote 数据库识别 CD 时所做的。以毫秒而不是秒为单位测量长度。请记住,这只是一种启发式方法:在删除它们之前,您绝对应该听取任何检测到的重复项。