有没有办法在elasticsearch服务器中导入一个json文件(包含100个文档)。?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20646836/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
is there any way to import a json file(contains 100 documents) in elasticsearch server.?
提问by shailendra pathak
Is there any way to import a JSON file (contains 100 documents) in elasticsearch server? I want to import a big json file into es-server..
有没有办法在elasticsearch服务器中导入一个JSON文件(包含100个文档)?我想将一个大的 json 文件导入 es-server..
采纳答案by dadoonet
You should use Bulk API. Note that you will need to add a header line before each json document.
您应该使用批量 API。请注意,您需要在每个 json 文档之前添加标题行。
$ cat requests
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
$ curl -s -XPOST localhost:9200/_bulk --data-binary @requests; echo
{"took":7,"items":[{"create":{"_index":"test","_type":"type1","_id":"1","_version":1,"ok":true}}]}
回答by Peter
As dadoonet already mentioned, the bulk API is probably the way to go. To transform your file for the bulk protocol, you can use jq.
正如 dadoonet 已经提到的,批量 API 可能是要走的路。要为批量协议转换文件,您可以使用jq。
Assuming the file contains just the documents itself:
假设文件只包含文档本身:
$ echo '{"foo":"bar"}{"baz":"qux"}' |
jq -c '
{ index: { _index: "myindex", _type: "mytype" } },
. '
{"index":{"_index":"myindex","_type":"mytype"}}
{"foo":"bar"}
{"index":{"_index":"myindex","_type":"mytype"}}
{"baz":"qux"}
And if the file contains the documents in a top level list they have to be unwrapped first:
如果文件包含顶级列表中的文档,则必须首先解开它们:
$ echo '[{"foo":"bar"},{"baz":"qux"}]' |
jq -c '
.[] |
{ index: { _index: "myindex", _type: "mytype" } },
. '
{"index":{"_index":"myindex","_type":"mytype"}}
{"foo":"bar"}
{"index":{"_index":"myindex","_type":"mytype"}}
{"baz":"qux"}
jq's -cflag makes sure that each document is on a line by itself.
jq 的-c标志确保每个文档都单独在一行上。
If you want to pipe straight to curl, you'll want to use --data-binary @-, and not just -d, otherwise curl will strip the newlines again.
如果您想直接通过管道进行卷曲,则需要使用--data-binary @-, 而不仅仅是-d,否则 curl 将再次去除换行符。
回答by Deryck
I'm sure someone wants this so I'll make it easy to find.
我敢肯定有人想要这个,所以我会很容易找到。
FYI - This is using Node.js(essentially as a batch script) on the same server as the brand new ES instance. Ran it on 2 files with 4000 items each and it only took about 12 seconds on my shared virtual server. YMMV
仅供参考 - 这是在与全新 ES 实例相同的服务器上使用Node.js(基本上作为批处理脚本)。在 2 个文件上运行它,每个文件有 4000 个项目,在我的共享虚拟服务器上只花了大约 12 秒。青年会
var elasticsearch = require('elasticsearch'),
fs = require('fs'),
pubs = JSON.parse(fs.readFileSync(__dirname + '/pubs.json')), // name of my first file to parse
forms = JSON.parse(fs.readFileSync(__dirname + '/forms.json')); // and the second set
var client = new elasticsearch.Client({ // default is fine for me, change as you see fit
host: 'localhost:9200',
log: 'trace'
});
for (var i = 0; i < pubs.length; i++ ) {
client.create({
index: "epubs", // name your index
type: "pub", // describe the data thats getting created
id: i, // increment ID every iteration - I already sorted mine but not a requirement
body: pubs[i] // *** THIS ASSUMES YOUR DATA FILE IS FORMATTED LIKE SO: [{prop: val, prop2: val2}, {prop:...}, {prop:...}] - I converted mine from a CSV so pubs[i] is the current object {prop:..., prop2:...}
}, function(error, response) {
if (error) {
console.error(error);
return;
}
else {
console.log(response); // I don't recommend this but I like having my console flooded with stuff. It looks cool. Like I'm compiling a kernel really fast.
}
});
}
for (var a = 0; a < forms.length; a++ ) { // Same stuff here, just slight changes in type and variables
client.create({
index: "epubs",
type: "form",
id: a,
body: forms[a]
}, function(error, response) {
if (error) {
console.error(error);
return;
}
else {
console.log(response);
}
});
}
Hope I can help more than just myself with this. Not rocket science but may save someone 10 minutes.
希望我能帮助的不仅仅是我自己。不是火箭科学,但可以为某人节省 10 分钟。
Cheers
干杯
回答by max
jq is a lightweight and flexible command-line JSON processor.
jq 是一个轻量级且灵活的命令行 JSON 处理器。
Usage:
用法:
cat file.json | jq -c '.[] | {"index": {"_index": "bookmarks", "_type": "bookmark", "_id": .id}}, .' | curl -XPOST localhost:9200/_bulk --data-binary @-
cat file.json | jq -c '.[] | {"index": {"_index": "bookmarks", "_type": "bookmark", "_id": .id}}, .' | curl -XPOST localhost:9200/_bulk --data-binary @-
We're taking the file file.json and piping its contents to jq first with the -c flag to construct compact output. Here's the nugget: We're taking advantage of the fact that jq can construct not only one but multiple objects per line of input. For each line, we're creating the control JSON Elasticsearch needs (with the ID from our original object) and creating a second line that is just our original JSON object (.).
我们首先使用文件 file.json 并将其内容通过 -c 标志通过管道传送到 jq 以构造紧凑的输出。关键在于:我们正在利用 jq 可以在每一行输入中不仅构造一个而且构造多个对象这一事实。对于每一行,我们正在创建 JSON Elasticsearch 需要的控件(使用来自我们原始对象的 ID)并创建第二行,它只是我们的原始 JSON 对象 (.)。
At this point we have our JSON formatted the way Elasticsearch's bulk API expects it, so we just pipe it to curl which POSTs it to Elasticsearch!
在这一点上,我们已经按照 Elasticsearch 的批量 API 期望的方式格式化了 JSON,所以我们只需将它通过管道传递给 curl,然后将其 POST 到 Elasticsearch!
Credit goes to Kevin Marsh
归功于凯文·马什
回答by mconlin
Import no, but you can index the documents by using the ES API.
不导入,但您可以使用 ES API 为文档编制索引。
You can use the index api to load each line (using some kind of code to read the file and make the curl calls) or the index bulk api to load them all. Assuming your data file can be formatted to work with it.
您可以使用索引 api 加载每一行(使用某种代码读取文件并进行 curl 调用)或使用索引批量 api 加载它们。假设您的数据文件可以格式化以使用它。
A simple shell script would do the trick if you comfortable with shell something like this maybe (not tested):
如果您对这样的 shell 感到满意(未测试),那么一个简单的 shell 脚本就可以解决问题:
while read line
do
curl -XPOST 'http://localhost:9200/<indexname>/<typeofdoc>/' -d "$line"
done <myfile.json
Peronally, I would probably use Python either pyes or the elastic-search client.
就个人而言,我可能会使用 Python pyes 或 elastic-search 客户端。
pyes on github
elastic search python client
Stream2esis also very useful for quickly loading data into es and may have a way to simply stream a file in. (I have not tested a file but have used it to load wikipedia doc for es perf testing)
Stream2es对于快速将数据加载到 es 中也非常有用,并且可能有一种方法可以简单地将文件流式传输进来。(我没有测试过文件,但已经用它来加载 wikipedia doc 以进行 es 性能测试)
回答by Jon Burgess
Stream2esis the easiest way IMO.
Stream2es是 IMO 最简单的方法。
e.g. assuming a file "some.json" containing a list of JSON documents, one per line:
例如,假设文件“some.json”包含 JSON 文档列表,每行一个:
curl -O download.elasticsearch.org/stream2es/stream2es; chmod +x stream2es
cat some.json | ./stream2es stdin --target "http://localhost:9200/my_index/my_type
回答by miku
You can use esbulk, a fast and simple bulk indexer:
您可以使用esbulk,一个快速而简单的批量索引器:
$ esbulk -index myindex file.ldj
Here's an asciicastshowing it loading Project Gutenberg data into Elasticsearch in about 11s.
这是一个asciicast,显示它在大约 11秒内将 Project Gutenberg 数据加载到 Elasticsearch 中。
Disclaimer: I'm the author.
免责声明:我是作者。
回答by miku
you can use Elasticsearch Gatherer Plugin
您可以使用 Elasticsearch Gatherer 插件
The gatherer plugin for Elasticsearch is a framework for scalable data fetching and indexing. Content adapters are implemented in gatherer zip archives which are a special kind of plugins distributable over Elasticsearch nodes. They can receive job requests and execute them in local queues. Job states are maintained in a special index.
Elasticsearch 的 Gatherer 插件是一个用于可扩展数据获取和索引的框架。内容适配器在 Gatherer zip 档案中实现,这是一种可在 Elasticsearch 节点上分发的特殊插件。它们可以接收作业请求并在本地队列中执行它们。作业状态保存在一个特殊的索引中。
This plugin is under development.
这个插件正在开发中。
Milestone 1 - deploy gatherer zips to nodes
里程碑 1 - 将 Gatherer zip 部署到节点
Milestone 2 - job specification and execution
里程碑 2 - 工作规范和执行
Milestone 3 - porting JDBC river to JDBC gatherer
里程碑 3 - 将 JDBC river 移植到 JDBC Gatherer
Milestone 4 - gatherer job distribution by load/queue length/node name, cron jobs
里程碑 4 - 按负载/队列长度/节点名称、cron 作业分布的 Gatherer 作业
Milestone 5 - more gatherers, more content adapters
里程碑 5 - 更多采集者,更多内容适配器
回答by 9digitdev
One way is to create a bash script that does a bulk insert:
一种方法是创建一个执行批量插入的 bash 脚本:
curl -XPOST http://127.0.0.1:9200/myindexname/type/_bulk?pretty=true --data-binary @myjsonfile.json
After you run the insert, run this command to get the count:
运行插入后,运行此命令以获取计数:
curl http://127.0.0.1:9200/myindexname/type/_count

