Java 将数百万个 JSON 文档导入 MongoDB 的最快方法

Question

提问by rok

I have more than 10 million JSON documents of the form :

我有超过 1000 万个以下形式的 JSON 文档：

["key": "val2", "key1" : "val", "{\"key\":\"val", \"key2\":\"val2"}"]

in one file.

在一个文件中。

Importing using JAVA Driver API took around 3 hours, while using the following function (importing one BSON at a time):

使用 JAVA Driver API 导入大约需要 3 个小时，同时使用以下功能（一次导入一个 BSON）：

public static void importJSONFileToDBUsingJavaDriver(String pathToFile, DB db, String collectionName) {
    // open file
    FileInputStream fstream = null;
    try {
        fstream = new FileInputStream(pathToFile);
    } catch (FileNotFoundException e) {
        e.printStackTrace();
        System.out.println("file not exist, exiting");
        return;
    }
    BufferedReader br = new BufferedReader(new InputStreamReader(fstream));

    // read it line by line
    String strLine;
    DBCollection newColl =   db.getCollection(collectionName);
    try {
        while ((strLine = br.readLine()) != null) {
            // convert line by line to BSON
            DBObject bson = (DBObject) JSON.parse(JSONstr);
            // insert BSONs to database
            try {
                newColl.insert(bson);
            }
            catch (MongoException e) {
              // duplicate key
              e.printStackTrace();
            }


        }
        br.close();
    } catch (IOException e) {
        e.printStackTrace();  //To change body of catch statement use File | Settings | File Templates.
    }


}

Is there a faster way? Maybe, MongoDB settings may influence the insertion speed? (for, example adding key : "_id" which will function as index, so that MongoDB would not have to create artificial key and thus index for each document) or disable index creation at all at insertion. Thanks.

有没有更快的方法？也许，MongoDB 设置可能会影响插入速度？（例如，添加键：“_id”，它将用作索引，这样 MongoDB 就不必为每个文档创建人工键和索引）或在插入时完全禁用索引创建。谢谢。

Answer 1

采纳答案by tom

I've got a slightly faster way (I'm also inserting millions at the moment), insert collections instead of single documents with

我有一个稍微快一点的方法（我现在也在插入数百万），插入集合而不是单个文档

insert(List<DBObject> list)

http://api.mongodb.org/java/current/com/mongodb/DBCollection.html#insert(java.util.List)

That said, it's not that much faster. I'm about to experiment with setting other WriteConcerns than ACKNOWLEDGED (mainly UNACKNOWLEDGED) to see if I can speed it up faster. See http://docs.mongodb.org/manual/core/write-concern/for info

也就是说，它并没有那么快。我将尝试设置除 ACKNOWLEDGED（主要是 UNACKNOWLEDGED）以外的其他 WriteConcerns，看看我是否可以加快速度。有关信息，请参阅http://docs.mongodb.org/manual/core/write-concern/

Another way to improve performance, is to create indexes after bulk inserting. However, this is rarely an option except for one off jobs.

另一种提高性能的方法是在批量插入后创建索引。但是，除了一次性工作外，这很少是一种选择。

Apologies if this is slightly wooly sounding, I'm still testing things myself. Good question.

抱歉，如果这听起来有点毛骨悚然，我仍在自己测试。好问题。

Answer 2

回答by evanchooly

You can also remove all the indexes (except for the PK index, of course) and rebuild them after the import.

您还可以删除所有索引（当然，PK 索引除外）并在导入后重建它们。

Answer 3

回答by Jhanvi

You can parse the entire file together at once and the insert the whole json in mongo document, Avoid multiple loops, You need to separate the logic as follows:

可以一次性解析整个文件，将整个json插入到mongo文档中，避免多个循环，需要分离逻辑如下：

1)Parse the file and retrieve the json Object.

1）解析文件并检索json对象。

2)Once the parsing is over, save the json Object in the Mongo Document.

2）一旦解析结束，将json对象保存在Mongo Document中。

Answer 4

回答by Yadli

I've done importing a multi-line json file with ~250M records. I just use mongoimport < data.txt and it took 10 hours. Compared to your 10M vs. 3 hours I think this is considerably faster.

我已经完成了一个包含 ~250M 记录的多行 json 文件的导入。我只是使用 mongoimport < data.txt 并且花了 10 个小时。与您的 10M 和 3 小时相比，我认为这要快得多。

Also from my experience writing your own multi-threaded parser would speed things up drastically. The procedure is simple:

同样根据我的经验，编写自己的多线程解析器会大大加快速度。程序很简单：

Open the file as BINARY (not TEXT!)
Set markers(offsets) evenly across the file. The count of markers depends on the number of threads you want.
Search for '\n' near the markers, calibrate the markers so they are aligned to lines.
Parse each chunk with a thread.

以 BINARY（不是 TEXT！）打开文件
在整个文件中均匀设置标记（偏移）。标记的数量取决于您想要的线程数。
在标记附近搜索 '\n'，校准标记，使其与线对齐。
用一个线程解析每个块。

A reminder:

提醒：

when you want performance, don't use stream reader or any built-in line-based read methods. They are slow. Just use binary buffer and search for '\n' to identify a line, and (most preferably) do in-place parsing in the buffer without creating a string. Otherwise the garbage collector won't be so happy with this.

当你想要性能时，不要使用流读取器或任何内置的基于行的读取方法。他们很慢。只需使用二进制缓冲区并搜索 '\n' 来标识一行，并且（最好）在缓冲区中进行就地解析而不创建字符串。否则垃圾收集器不会对此感到满意。

Answer 5

回答by Bruno D. Rodrigues

I'm sorry but you're all picking minor performance issues instead of the core one. Separating the logic from reading the file and inserting is a small gain. Loading the file in binary mode (via MMAP) is a small gain. Using mongo's bulk inserts is a big gain, but still no dice.

我很抱歉，但你们都在选择次要的性能问题而不是核心问题。将逻辑与读取文件和插入分开是一个小收获。以二进制模式（通过 MMAP）加载文件是一个小收获。使用 mongo 的批量插入是一个很大的好处，但仍然没有骰子。

The whole performance bottleneck is the BSON bson = JSON.parse(line). Or in other words, the problem with the Java drivers is that they need a conversion from json to bson, and this code seems to be awfully slow or badly implemented. A full JSON (encode+decode) via JSON-simple or specially via JSON-smart is 100 times faster than the JSON.parse() command.

整个性能瓶颈是 BSON bson = JSON.parse(line)。或者换句话说，Java 驱动程序的问题在于它们需要从 json 到 bson 的转换，而这段代码似乎非常缓慢或实现得很差。通过 JSON-simple 或特别通过 JSON-smart 的完整 JSON（编码+解码）比 JSON.parse() 命令快 100 倍。

I know Stack Overflow is telling me right above this box that I should be answering the answer, which I'm not, but rest assured that I'm still looking for an answer for this problem. I can't believe all the talk about Mongo's performance and then this simple example code fails so miserably.

我知道 Stack Overflow 就在这个框的正上方告诉我，我应该回答答案，我不是，但请放心，我仍在寻找这个问题的答案。我不敢相信所有关于 Mongo 性能的讨论，然后这个简单的示例代码失败得如此惨。

Answer 6

回答by Sam Wolfand

You can use a bulk insertion

您可以使用批量插入

You can read the documentation at mongo websiteand you can also check this java exampleon StackOverflow

您可以阅读mongo 网站上的文档，也可以在 StackOverflow 上查看此java 示例

Answer 7

回答by PUG

Use bulk operations insert/upserts. After Mongo 2.6you can do Bulk Updates/Upserts. Example below does bulk update using c#driver.

使用批量操作插入/更新插入。在Mongo 2.6您可以执行批量更新/更新之后。下面的示例使用c#驱动程序进行批量更新。

MongoCollection<foo> collection = database.GetCollection<foo>(collectionName);
      var bulk = collection.InitializeUnorderedBulkOperation();
      foreach (FooDoc fooDoc in fooDocsList)
      {
        var update = new UpdateDocument { {fooDoc.ToBsonDocument() } };
        bulk.Find(Query.EQ("_id", fooDoc.Id)).Upsert().UpdateOne(update);
      }
      BulkWriteResult bwr =  bulk.Execute();

Java 将数百万个 JSON 文档导入 MongoDB 的最快方法

提问by rok

采纳答案by tom

回答by evanchooly

回答by Jhanvi

回答by Yadli

回答by Bruno D. Rodrigues

回答by Sam Wolfand

回答by PUG

相关推荐

最近更新

标签

Java 将数百万个 JSON 文档导入 MongoDB 的最快方法

提问by rok

采纳答案by tom

回答by evanchooly

回答by Jhanvi

回答by Yadli

回答by Bruno D. Rodrigues

回答by Sam Wolfand

回答by PUG

相关推荐

Java Modcount (ArrayList)

Java 在屏幕上移动对象

Java 在 Dom4j 中使用 Xpath

使用 Google 的 Gson 将 Json 转换为 java 对象

相关推荐

最近更新

标签