java HDFS 文件校验和
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14563245/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
HDFS File Checksum
提问by pradeep
I am trying to check the consistency of a file after copying to HDFS using Hadoop API - DFSCleint.getFileChecksum().
我正在尝试使用 Hadoop API - DFSCleint.getFileChecksum() 复制到 HDFS 后检查文件的一致性。
I am getting the following output for the above code:
我得到上述代码的以下输出:
Null
HDFS : null
Local : null
Can anyone point out the error or mistake? Here is the Code :
谁能指出错误或错误?这是代码:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileChecksum;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.LocalFileSystem;
import org.apache.hadoop.fs.Path;
public class fileCheckSum {
/**
* @param args
* @throws IOException
*/
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
Configuration conf = new Configuration();
FileSystem hadoopFS = FileSystem.get(conf);
// Path hdfsPath = new Path("/derby.log");
LocalFileSystem localFS = LocalFileSystem.getLocal(conf);
// Path localPath = new Path("file:///home/ubuntu/derby.log");
// System.out.println("HDFS PATH : "+hdfsPath.getName());
// System.out.println("Local PATH : "+localPath.getName());
FileChecksum hdfsChecksum = hadoopFS.getFileChecksum(new Path("/derby.log"));
FileChecksum localChecksum = localFS.getFileChecksum(new Path("file:///home/ubuntu/derby.log"));
if(null!=hdfsChecksum || null!=localChecksum){
System.out.println("HDFS Checksum : "+hdfsChecksum.toString()+"\t"+hdfsChecksum.getLength());
System.out.println("Local Checksum : "+localChecksum.toString()+"\t"+localChecksum.getLength());
if(hdfsChecksum.toString().equals(localChecksum.toString())){
System.out.println("Equal");
}else{
System.out.println("UnEqual");
}
}else{
System.out.println("Null");
System.out.println("HDFS : "+hdfsChecksum);
System.out.println("Local : "+localChecksum);
}
}
}
回答by omid
Since you aren't setting a remote address on the conf
and essentially using the same configuration, both hadoopFS
and localFS
are pointing to an instance of LocalFileSystem
.
由于您没有在 上设置远程地址conf
并且基本上使用相同的配置,因此hadoopFS
和localFS
都指向LocalFileSystem
.
getFileChecksum
isn't implemented for LocalFileSystem
and returns null. It should be working for DistributedFileSystem
though, which if your conf
is pointing to a distributed cluster, FileSystem.get(conf)
should return an instance of DistributedFileSystem
that returns an MD5 of MD5 of CRC32 checksumsof chunks of size bytes.per.checksum
. This value depends on the block size and the cluster-wide config, bytes.per.checksum
. That's why these two params are also encoded in the return value of the distributed checksum as the name of the algorithm: MD5-of-xxxMD5-of-yyyCRC32 where xxx is number of CRC checksums per block and yyy is the bytes.per.checksum
parameter.
getFileChecksum
未实现LocalFileSystem
并返回 null。DistributedFileSystem
不过,它应该可以工作,如果您conf
指向分布式集群,则FileSystem.get(conf)
应该返回一个实例,DistributedFileSystem
该实例返回大小块的CRC32 校验和的MD5 的 MD5bytes.per.checksum
。此值取决于块大小和集群范围的配置,bytes.per.checksum
. 这就是为什么这两个参数也作为算法名称编码在分布式校验和的返回值中的原因:MD5-of-xxxMD5-of-yyyCRC32 其中 xxx 是每个块的 CRC 校验和数,yyy 是bytes.per.checksum
参数。
The getFileChecksum
isn't designed to be comparable across filesystems. Although it's possible to simulate the distributed checksum locally, or hand-craft map-reduce jobs to calculate equivalents of local hashes, I suggest relying Hadoop's own integrity checks that happens when a files gets written to or read from Hadoop
在getFileChecksum
没有设计能够跨文件系统相媲美。虽然可以在本地模拟分布式校验和,或者手工制作 map-reduce 作业来计算本地哈希的等价物,但我建议依靠 Hadoop 自己的完整性检查,当文件被写入或从 Hadoop 读取时发生
回答by Ravi Shankar
Try this. In this i have calculated the MD5 of both local and HDFS file and then compared the same for both files equality. Hope this helps.
试试这个。在此,我计算了本地和 HDFS 文件的 MD5,然后比较了两个文件的相等性。希望这可以帮助。
public static void compareChecksumForLocalAndHdfsFile(String sourceHdfsFilePath, String sourceLocalFilepath, Map<String, String> hdfsConfigMap)
throws Exception {
System.setProperty("HADOOP_USER_NAME", hdfsConfigMap.get(Constants.USERNAME));
System.setProperty("hadoop.home.dir", "/tmp");
Configuration hdfsConfig = new Configuration();
hdfsConfig.set(Constants.USERNAME, hdfsConfigMap.get(Constants.USERNAME));
hdfsConfig.set("fsURI", hdfsConfigMap.get("fsURI"));
FileSystem hdfs = FileSystem.get(new URI(hdfsConfigMap.get("fsURI")), hdfsConfig);
Path inputPath = new Path(hdfsConfigMap.get("fsURI") + "/" + sourceHdfsFilePath);
InputStream is = hdfs.open(inputPath);
String localChecksum = getMD5Checksum(new FileInputStream(sourceLocalFilepath));
String hdfsChecksum = getMD5Checksum(is);
if (null != hdfsChecksum || null != localChecksum) {
System.out.println("HDFS Checksum : " + hdfsChecksum.toString() + "\t" + hdfsChecksum.length());
System.out.println("Local Checksum : " + localChecksum.toString() + "\t" + localChecksum.length());
if (hdfsChecksum.toString().equals(localChecksum.toString())) {
System.out.println("Equal");
} else {
System.out.println("UnEqual");
}
} else {
System.out.println("Null");
System.out.println("HDFS : " + hdfsChecksum);
System.out.println("Local : " + localChecksum);
}
}
public static byte[] createChecksum(String filename) throws Exception {
InputStream fis = new FileInputStream(filename);
byte[] buffer = new byte[1024];
MessageDigest complete = MessageDigest.getInstance("MD5");
int numRead;
do {
numRead = fis.read(buffer);
if (numRead > 0) {
complete.update(buffer, 0, numRead);
}
} while (numRead != -1);
fis.close();
return complete.digest();
}
// see this How-to for a faster way to convert
// a byte array to a HEX string
public static String getMD5Checksum(String filename) throws Exception {
byte[] b = createChecksum(filename);
String result = "";
for (int i = 0; i < b.length; i++) {
result += Integer.toString((b[i] & 0xff) + 0x100, 16).substring(1);
}
return result;
}
OutPut:
输出:
HDFS Checksum : d99513cc4f1d9c51679a125702bd27b0 32
Local Checksum : d99513cc4f1d9c51679a125702bd27b0 32
Equal