java 如何将两组weka实例合并在一起
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/10771558/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to merge two sets of weka Instances together
提问by fodon
Currently, I'm copying one instance at a time from one dataset to the other. Is there a way to do this so that string mappings remain intact? The mergeInstances works horizontally, is there an equivalent vertical merge?
目前,我一次将一个实例从一个数据集复制到另一个数据集。有没有办法做到这一点,以便字符串映射保持完整?mergeInstances 水平工作,是否有等效的垂直合并?
This is one step of a loop I use to read datasets of the same structure from multiple arff files into one large dataset. There has got to be a simpler way.
这是我用来将相同结构的数据集从多个 arff 文件读取到一个大数据集的循环的一个步骤。必须有一个更简单的方法。
Instances iNew = new ConverterUtils.DataSource(name).getDataSet();
for (int i = 0; i < iNew.numInstances(); i++) {
Instance nInst = iNew.instance(i);
inst.add(nInst);
}
采纳答案by kaz
Why not make a new ARFF file which has the data from both of the originals? A simple
为什么不制作一个包含两个原始数据的新 ARFF 文件?一个简单的
cat 1.arff > tmp.arff
tail -n+20 2.arff >> tmp.arff
where 20
is replaced by however many lines long your arff header is. This would then produce a new arff file with all of the desired instances, and you could read this new file with your existing code:
where20
被替换为你的 arff 标头有多长。这将生成一个包含所有所需实例的新 arff 文件,您可以使用现有代码读取这个新文件:
Instances iNew = new ConverterUtils.DataSource(name).getDataSet();
You could also invoke weka on the command line using this documentation: http://old.nabble.com/how-to-merge-two-data-file-a.arff-and-b.arff-into-one-data-list--td22890856.html
您还可以使用以下文档在命令行上调用 weka:http: //old.nabble.com/how-to-merge-two-data-file-a.arff-and-b.arff-into-one-data -list--td22890856.html
java weka.core.Instances append filename1 filename2 > output-file
However, there is no function in the documentation http://weka.sourceforge.net/doc.dev/weka/core/Instances.html#main%28java.lang.Stringwhich will allow you to append multiple arff files natively within your java code. As of Weka 3.7.6, the code that appends two arff files is this:
但是,文档http://weka.sourceforge.net/doc.dev/weka/core/Instances.html#main%28java.lang.String 中没有允许您在本地添加多个 arff 文件的功能爪哇代码。从 Weka 3.7.6 开始,附加两个 arff 文件的代码是这样的:
// read two files, append them and print result to stdout
else if ((args.length == 3) && (args[0].toLowerCase().equals("append"))) {
DataSource source1 = new DataSource(args[1]);
DataSource source2 = new DataSource(args[2]);
String msg = source1.getStructure().equalHeadersMsg(source2.getStructure());
if (msg != null)
throw new Exception("The two datasets have different headers:\n" + msg);
Instances structure = source1.getStructure();
System.out.println(source1.getStructure());
while (source1.hasMoreElements(structure))
System.out.println(source1.nextElement(structure));
structure = source2.getStructure();
while (source2.hasMoreElements(structure))
System.out.println(source2.nextElement(structure));
}
Thus it looks like Weka itself simply iterates through all of the instances in a data set and prints them, the same process your code uses.
因此,看起来 Weka 本身只是简单地遍历数据集中的所有实例并打印它们,这与您的代码使用的过程相同。
回答by mountrix
If you want a totally fully automated method that also copy properly string and nominal attributes, you can use the following function:
如果您想要一个完全自动化的方法,同时正确复制字符串和名义属性,您可以使用以下函数:
public static Instances merge(Instances data1, Instances data2)
throws Exception
{
// Check where are the string attributes
int asize = data1.numAttributes();
boolean strings_pos[] = new boolean[asize];
for(int i=0; i<asize; i++)
{
Attribute att = data1.attribute(i);
strings_pos[i] = ((att.type() == Attribute.STRING) ||
(att.type() == Attribute.NOMINAL));
}
// Create a new dataset
Instances dest = new Instances(data1);
dest.setRelationName(data1.relationName() + "+" + data2.relationName());
DataSource source = new DataSource(data2);
Instances instances = source.getStructure();
Instance instance = null;
while (source.hasMoreElements(instances)) {
instance = source.nextElement(instances);
dest.add(instance);
// Copy string attributes
for(int i=0; i<asize; i++) {
if(strings_pos[i]) {
dest.instance(dest.numInstances()-1)
.setValue(i,instance.stringValue(i));
}
}
}
return dest;
}
Please note that the following conditions should hold (there are not checked in the function):
请注意,应满足以下条件(函数中未检查):
- Datasets must have the same attributes structure (number of attributes, type of attributes)
- Class index has to be the same
- Nominal values have to exactly correspond
- 数据集必须具有相同的属性结构(属性数量、属性类型)
- 类索引必须相同
- 标称值必须完全对应
To modify on the fly the values of the nominal attributes of data2 to match the ones of data1, you can use:
要动态修改 data2 的名义属性的值以匹配 data1 的值,您可以使用:
data2.renameAttributeValue(
data2.attribute("att_name_in_data2"),
"att_value_in_data2",
"att_value_in_data1");
回答by user2402105
Another possible solution is to use addAll from java.util.AbstractCollection, since Instances implement it.
另一种可能的解决方案是使用 java.util.AbstractCollection 中的 addAll,因为实例实现了它。
instances1.addAll(instances2);
回答by btaranta
I've just shared an extended weka.core.Instaces
class with methods like innerJoin
, leftJoin
, fullJoin
, update
and union
.
我刚刚共享的扩展weka.core.Instaces
类等的方法innerJoin
,leftJoin
,fullJoin
,update
和union
。
table1.makeIndex(table1.attribute("Continent_ID");
table2.makeIndex(table2.attribute("Continent_ID");
Instances result = table1.leftJoin(table2);
Instances can have different number of attributes, levels of NOMINAL
and STRING
variables are merged together if neccesary.
如果需要,实例可以具有不同数量的属性、级别NOMINAL
和STRING
变量合并在一起。
Sources and some examples are here on GitHub: weka.join.
来源和一些示例在 GitHub 上:weka.join。