Java 使用多线程写入文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22308158/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 14:51:22  来源:igfitidea点击:

Writing a file using multiple threads

javamultithreadingpostgresqlfile-iobufferedwriter

提问by jayanth88

I am trying to write a single huge file in Java using multiple threads.

我正在尝试使用多个线程在 Java 中编写一个大文件。

I have tried both FileWriterand bufferedWriterclasses in Java.

我已经尝试过Java 中的FileWriterbufferedWriter类。

The content being written is actually an entire table (Postgres) being read using CopyManagerand written. Each line in the file is a single tuple from the table and I am writing 100s of lines at a time.

正在写入的内容实际上是正在使用CopyManager和写入的整个表(Postgres)。文件中的每一行都是表中的一个元组,我一次写入 100 行。

Approach to write:

写法:

The single to-be-written file is opened by multiple threads in append mode. Each thread thereafter tries writing to the file file.

单个待写入文件由多个线程以追加方式打开。此后,每个线程尝试写入文件 file。

Following are the issues I face:

以下是我面临的问题:

  • Once a while, the contents of the file gets overwritten i.e: One line remains incomplete and the next line starts from there itself. My assumption here is that the buffers for writer are getting full. This forces the writer to immediately write the data onto the file. The data written may not be a complete line and before it can write the remainder, the next thread writes its content onto the file.
  • While using Filewriter, once a while I see a single black line in the file.
  • 有时,文件的内容会被覆盖,即:一行仍然不完整,下一行从那里开始。我的假设是写入器的缓冲区已满。这会强制写入器立即将数据写入文件。写入的数据可能不是完整的一行,在写入剩余部分之前,下一个线程将其内容写入文件。
  • 在使用时Filewriter,我偶尔会在文件中看到一条黑线。

Any suggestions, how to avoid this data integrity issue?

任何建议,如何避免此数据完整性问题?

采纳答案by jayanth88

Shared Resource == Contention

共享资源==争用

Writing to a normal file by definition is a serialized operation. You gain no performance by trying to write to it from multiple threads, I/O is a finite bounded resource at orders of magnitude less bandwidth than even the slowest or most overloaded CPU.

根据定义写入普通文件是一个序列化操作。尝试从多个线程写入它不会获得任何性能,I/O 是一种有限的有界资源,其带宽甚至比最慢或最过载的 CPU 还要少几个数量级。

Concurrent access to a shared resource can be complicated ( and slow )

对共享资源的并发访问可能很复杂(而且很慢)

If you have multiple threads that are doing expensive calculations then you have options, if you are just using multiple threads because you think you are going to speed something up, you are just going to do the opposite. Contention for I/O always slows down access to the resource, it never speeds it up because of the lock waits and other overhead.

如果您有多个线程在执行昂贵的计算,那么您有多种选择,如果您只是使用多个线程,因为您认为您会加快某些事情的速度,那么您只会做相反的事情。对 I/O 的争用总是会减慢对资源的访问速度,它永远不会因为锁等待和其他开销而加快速度。

You have to have a critical section that is protected and allows only a single writer at a time. Just look up the source code for any logging writer that supports concurrency and you will see that there is only a single thread that writes to the file.

您必须有一个受保护的关键部分,并且一次只允许一个写入者。只需查找任何支持并发的日志编写器的源代码,您就会看到只有一个线程写入文件。

If your application is primarily:

如果您的应用主要是:

  1. CPU Bound:You can use some locking mechanism/data construct to only let one thread out of many write to the file at a time, which will be useless from a concurrency standpoint as a naive solution; If these threads are CPU bound with little I/O this might work.

  2. I/O Bound:This is the most common case, you must use a messaging passing system with a queue of some sort and have all the threads post to a queue/buffer and have a single thread pull from it and write to the file. This will be the most scalable and easiest to implement solution.

  1. CPU Bound:您可以使用一些锁定机制/数据结构,一次只让多个线程中的一个线程写入文件,从并发的角度来看,这是一种幼稚的解决方案,这是无用的;如果这些线程受 CPU 限制且 I/O 很少,这可能会起作用。

  2. I/O Bound:这是最常见的情况,您必须使用带有某种队列的消息传递系统,并将所有线程发送到队列/缓冲区,并从中提取一个线程并写入文件。这将是最具扩展性和最容易实施的解决方案。

Journaling - Async Writes

日记 - 异步写入

If you need to create a single super large file where order of writes are unimportant and the program is CPU bound you can use a journaling technique.

如果您需要创建单个超大文件,其中写入顺序不重要并且程序受 CPU 限制,您可以使用日志记录技术。

Have each processwrite to a separate file and then concat the multiple files into a single large file at the end. This is a very old school low techsolution that works well and has for decades.

让每个process写入一个单独的文件,然后最后将多个文件合并为一个大文件。这是一个非常老派的低技术解决方案,效果很好并且已经使用了几十年。

Obviously the more storage I/O you have the better this will perform on the end concat.

显然,您拥有的存储 I/O 越多,它在最终连接时的性能就越好。

回答by Gray

I am trying to write a single huge file in Java using multiple threads.

我正在尝试使用多个线程在 Java 中编写一个大文件。

I would recommend that you have X threads reading from the database and a single thread writing to your output file. This is going to be much easier to implement as opposed to doing file locking and the like.

我建议你有 X 个线程从数据库中读取,一个线程写入你的输出文件。与执行文件锁定等操作相比,这将更容易实现。

You could use a shared BlockingQueue(maybe ArrayBlockingQueue) so the database readers would add(...)to the queue and your writer would be in a take()loop on the queue. When the readers finish, they could add some special IM_DONEstring constant and as soon as the writing thread sees X of these constants (i.e. one for each reader), it would close the output file and exit.

您可以使用共享BlockingQueue(也许ArrayBlockingQueue),以便数据库读者add(...)进入队列,而您的作者将在take()队列中循环。当读者完成时,他们可以添加一些特殊的IM_DONE字符串常量,一旦写入线程看到这些常量中的 X 个(即每个读者一个),它就会关闭输出文件并退出。

So then you can use a single BufferedWriterwithout any locks and the like. Chances are that you will be blocked by the database calls instead of the local IO. Certainly the extra thread isn't going to slow you down at all.

那么你可以使用一个BufferedWriter没有任何锁之类的东西。您可能会被数据库调用而不是本地 IO 阻塞。当然,额外的线程根本不会减慢你的速度。

The single to-be-written file is opened by multiple threads in append mode. Each thread thereafter tries writing to the file file.

单个待写入文件由多个线程以追加方式打开。此后,每个线程尝试写入文件 file。

If you are adamant to have your reading threads also do the writing then you should add a synchronizedblock around the access to a single shared BufferedWriter-- you could synchronize on the BufferedWriterobject itself. Knowing when to close the writer is a bit of an issue since each thread would have to know if the other one has exited. Each thread could increment a shared AtomicIntegerwhen they run and decrement when they are done. Then the thread that looks at the run-count and sees 0 would be the one that would close the writer.

如果您坚持让您的读取线程也进行写入,那么您应该synchronized在对单个共享的访问周围添加一个块BufferedWriter——您可以在BufferedWriter对象本身上进行同步。知道何时关闭编写器是一个问题,因为每个线程都必须知道另一个线程是否已退出。每个线程可以AtomicInteger在运行时增加共享,在完成时减少共享。然后查看运行计数并看到 0 的线程将关闭编写器。

回答by jayesh

Instead of having a synchronized methods, the better solution would be to have a threadpool with single thread backed by a blocking queue. The message application would be writing will be pushed to blocking queue. The log writer thread would continue to read from blocking queue (will be blocked in case queue is empty) and would continue to write it to single file.

而不是使用同步方法,更好的解决方案是拥有一个由阻塞队列支持的单线程线程池。正在写入的消息应用程序将被推送到阻塞队列。日志写入线程将继续从阻塞队列中读取(如果队列为空,将被阻塞)并将继续将其写入单个文件。