java 基于编号的线程配置。CPU 核数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/13834692/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Threads configuration based on no. of CPU-cores
提问by Santosh
Scenario : I have a sample application and I have 3 different system configuration -
场景:我有一个示例应用程序,我有 3 种不同的系统配置 -
- 2 core processor, 2 GB RAM, 60 GB HHD,
- 4 core processor, 4 GB RAM, 80 GB HHD,
- 8 core processor, 8 GB RAM, 120 GB HHD
In order to effectively exploit the H/W capabilities for my application, I wish to configure the no. of threads at the application level. However, I wish to do this only after a thorough understanding of system capabilities.
为了有效地利用我的应用程序的 H/W 功能,我希望配置 no. 应用程序级别的线程数。但是,我希望只有在彻底了解系统功能后才能执行此操作。
Could there be some way(system/modus/tool) to determine the system prowess with reference to the max and min no. of threads it could service optimally & without any loss in efficiency and performance. By this, I could configure only those values for my application that will do full justice and achieve best performance for the respective hardware configuration.
是否有某种方法(系统/模式/工具)可以参考最大和最小编号来确定系统实力。它可以优化服务的线程数,而不会降低效率和性能。通过这种方式,我只能为我的应用程序配置那些可以充分公正并为相应硬件配置实现最佳性能的值。
Edited1 :Could any one please advise any read-up on how to set a baseline for a particular h/w config.
Edited1:任何人都可以就如何为特定的硬件配置设置基线提出建议。
Edited2 :To make it more direct - Wish to learn/know about any resource/write-up that I can read to gain some understanding on CPU management of Threads at a general/holistic level.
Edited2:为了使其更直接-希望了解/了解我可以阅读的任何资源/文章,以便在一般/整体级别对线程的 CPU 管理有所了解。
回答by assylias
The optimal number of threads to use depends on several factors, but mostly the number of available processors and how cpu-intensive your tasks are. Java Concurrency in Practiceproposes the following formal formula to estimate the optimal number of threads:
要使用的最佳线程数取决于多个因素,但主要取决于可用处理器的数量以及您的任务的 CPU 密集程度。Java Concurrency in Practice提出了以下正式公式来估计最佳线程数:
N_threads = N_cpu * U_cpu * (1 + W / C)
Where:
在哪里:
- N_threads is the optimal number of threads
- N_cpu is the number of prcessors, which you can obtain from
Runtime.getRuntime().availableProcessors();
- U_cpu is the target CPU utilization (1 if you want to use the full available resources)
- W / C is the ratio of wait time to compute time (0 for CPU-bound task, maybe 10 or 100 for slow I/O tasks)
- N_threads 是最佳线程数
- N_cpu 是处理器的数量,您可以从
Runtime.getRuntime().availableProcessors();
- U_cpu 是目标 CPU 利用率(如果要使用全部可用资源,则为 1)
- W / C 是等待时间与计算时间的比率(0 表示 CPU 密集型任务,10 或 100 表示慢速 I/O 任务)
So for example, in a CPU-bound scenario, you would have as many threads as CPU (some advocate to use that number + 1 but I have never seen that it made a significant difference).
因此,例如,在 CPU 密集型场景中,您将拥有与 CPU 一样多的线程(有人主张使用该数字 + 1,但我从未见过它产生显着差异)。
For a slow I/O process, for example a web crawler, W/C could be 10 if downloading a page is 10 times slower than processing it, in which case using 100 threads would be useful.
对于慢速 I/O 进程,例如网络爬虫,如果下载页面比处理页面慢 10 倍,则 W/C 可能为 10,在这种情况下使用 100 个线程会很有用。
Note however that there is an upper bound in practice (using 10,000 threads will generally not speed things up, and you would probably get an OutOfMemoryError before you can start them all anyway with normal memory settings).
但是请注意,在实践中存在一个上限(使用 10,000 个线程通常不会加快速度,并且在使用正常内存设置启动它们之前,您可能会收到 OutOfMemoryError)。
This is probably the best estimate you can get if you don't know anything about the environment in which your application runs. Profiling your application in production might enable you to fine tune the settings.
如果您对应用程序运行的环境一无所知,这可能是您可以获得的最佳估计。在生产中分析您的应用程序可能使您能够微调设置。
Although not strictly related, you might also be interested in Amdahl's law, which aims at measuring the maximum speed-up you can expect from parallelising a program.
回答by jstine
My recommendation is to provide config and command-line switches for assigning the number of threads per-machine. Use a heuristic based on Runtime.getRuntime().availableProcessors() as indicated by other answers here, in cases where the user/admin hasn't explicitly configured the application differently. I stronglyrecommend against exclusive heuristic-based thread-to-core guessing, for several reasons:
我的建议是提供配置和命令行开关来分配每台机器的线程数。使用基于 Runtime.getRuntime().availableProcessors() 的启发式方法,如此处其他答案所示,以防用户/管理员未以不同方式显式配置应用程序。我强烈建议不要使用基于启发式的独占式线程到核心猜测,原因如下:
Most modern hardware is moving toward increasingly ambiguous types of 'hardware threads': SMT models such as Intel's Hyperthreading and AMD's Compute Modules complicate formulas (details below), and querying this info at runtime can be difficult.
Most modern hardware has a turbo feature that scales speed based on active cores and ambient temperatures. As turbo tech improves, the range of speed (ghz) grows. Some recent Intel and AMD chips can range from 2.6ghz (all cores active) to 3.6ghz (single/dual core active), which combined with SMT can mean each thread getting an effective 1.6ghz - 2.0ghz throughput in the former design. There is currently no way to query this info at runtime.
If you do not have a strong guarantee that your application will be the only process running on the target systems, then blindly consuming all cpu resources may not please the user or server admin (depending on if the software is a user app or server app).
大多数现代硬件正朝着越来越模糊的“硬件线程”类型发展:英特尔的超线程和 AMD 的计算模块等 SMT 模型使公式复杂化(详情如下),并且在运行时查询此信息可能很困难。
大多数现代硬件都具有涡轮增压功能,可根据活动内核和环境温度调整速度。随着涡轮增压技术的改进,速度范围 (ghz) 会增加。最近的一些 Intel 和 AMD 芯片的范围可以从 2.6ghz(所有内核活动)到 3.6ghz(单核/双核活动),结合 SMT 可以意味着每个线程在以前的设计中获得 1.6ghz - 2.0ghz 的有效吞吐量。目前无法在运行时查询此信息。
如果您不能保证您的应用程序将是目标系统上运行的唯一进程,那么盲目消耗所有 cpu 资源可能不会让用户或服务器管理员满意(取决于软件是用户应用程序还是服务器应用程序) .
There is no robust way to know what's going on within the rest of the machine at run-time, without replacing the entire operating system with your own home-rolled multitasking kernel. Your software can try to make educated guesses by querying processes and peeking at CPU loads and such, but doing so is complicated and usefulness is limited to specific types of applications (of which yours may qualify), and usually benefit from or require elevated or privileged access levels.
没有可靠的方法可以在运行时了解机器其余部分发生的情况,而无需用您自己的自制多任务内核替换整个操作系统。您的软件可以尝试通过查询进程和查看 CPU 负载等来做出有根据的猜测,但这样做很复杂,而且用途仅限于特定类型的应用程序(您的应用程序可能符合条件),并且通常受益于或需要提升或特权访问级别。
Modern virus scanners now-days work by setting a special priority flag provided by modern operating systems, eg. they let the OS tell them when "the system is idle". The OS bases its decision on more than just CPU load: it also considers user input and multimedia flags that may have been set by movie players, etc. This is fine for mostly-idle tasks, but not useful to a cpu intensive task such as yours.
Distributed home computing apps (BOINC, Folding@Home, etc) work by querying running processes and system CPU load periodically -- once every second or half-second perhaps. If load is detected on processes not belonging to the app for multiple queries in a row then the app will suspend computation. Once the load goes low for some number of queries, it resumes. Multiple queries are required because the CPU load readouts are notorious for brief spikes. There are still caveats: 1. Users are still encouraged to manually reconfigure BOINC to fit their machine's specs. 2. if BOINC is run without Admin privileges then it won't be aware of processes started by other users (including some service processes), so it may unfairly compete with those for CPU resources.
现代病毒扫描程序现在通过设置现代操作系统提供的特殊优先级标志来工作,例如。当“系统空闲”时,他们让操作系统告诉他们。操作系统不仅仅基于 CPU 负载做出决定:它还考虑用户输入和可能由电影播放器等设置的多媒体标志。这对于大部分空闲任务来说很好,但对 CPU 密集型任务没有用,例如你的。
分布式家庭计算应用程序(BOINC、Folding@Home 等)通过定期查询正在运行的进程和系统 CPU 负载来工作——也许每秒或半秒一次。如果在不属于应用程序的进程上检测到连续多个查询的负载,则应用程序将暂停计算。一旦某些查询的负载变低,它就会恢复。需要多次查询,因为 CPU 负载读数因短暂的峰值而臭名昭著。仍然有一些警告: 1. 仍然鼓励用户手动重新配置 BOINC 以适应他们机器的规格。2. 如果BOINC在没有Admin权限的情况下运行,那么它不会意识到其他用户启动的进程(包括一些服务进程),因此它可能会不公平地与那些用户竞争CPU资源。
Regarding SMT (HyperThreading, Compute Modules):
关于 SMT(超线程、计算模块):
Most SMTs will report as hardware cores or threads these days, which is usually not good because few applications perform optimally when scaled across every core on an SMT system. To make matters worse, querying whether a core is shared (SMT) or dedicated often fails to yield expected results. In some cases the OS itself simply doesn't know (Windows 7 being unaware of AMD Bulldozer's shared core design, for example). If you can get a reliable SMT count, then the rule of thumb is to count each SMT as half-a-thread for CPU-intensive tasks, and as a full thread for mostly-idle tasks. But in reality, the weight of the SMT depends on what sort of computation its doing, and the target architecture. Intel and AMD's SMT implementations behave almost opposite of each other, for example -- Intel's is strong at running tasks loaded with integer and branching ops in parallel. AMD's is strong at running SIMD and memory ops in parallel.
如今,大多数 SMT 将报告为硬件内核或线程,这通常不好,因为在 SMT 系统上的每个内核上扩展时,很少有应用程序能以最佳方式执行。更糟糕的是,查询内核是共享 (SMT) 还是专用内核通常无法产生预期的结果。在某些情况下,操作系统本身根本不知道(例如,Windows 7 不知道 AMD Bulldozer 的共享核心设计)。如果您可以获得可靠的 SMT 计数,那么经验法则是将每个 SMT 计算为 CPU 密集型任务的半线程,以及大部分空闲任务的完整线程。但实际上,SMT 的权重取决于它执行的计算类型以及目标架构。英特尔和 AMD 的 SMT 实现几乎完全相反,例如——英特尔 s 擅长并行运行加载有整数和分支操作的任务。AMD 擅长并行运行 SIMD 和内存操作。
Regarding Turbo Features:
关于涡轮功能:
Most CPUs these days have very effective built-in Turbo support that further lessens the value-gained from scaling across all cores of the system. Worse, the turbo feature is sometimes based as much on real temperature of the system as it is on CPU loads, so the cooling system of the tower itself affects the speed as much as the CPU specs do. On a particular AMD A10 (Bulldozer), for example, I observed it running at 3.7ghz on two threads. It dropped to 3.5ghz when a third thread is started, and to 3.4ghz when a fourth was started. Since it's an integrated GPU as well, it dropped all the way to approx 3.0ghz when four threads plus the GPU were working (the A10 CPU internally gives priority to the GPU in high-load scenarios); but could still muster 3.6ghz with 2 threads and GPU active. Since my application used both CPU and GPU, this was a critical discovery. I was able to improve overall performance by limiting the process to two CPU-bound threads (the other two shared cores were still helpful, they served as GPU servicing threads -- able to wake up and respond quickly to push new data to the GPU, as needed).
如今,大多数 CPU 都具有非常有效的内置 Turbo 支持,这进一步降低了从系统所有内核扩展中获得的价值。更糟糕的是,涡轮增压功能有时基于系统的实际温度和 CPU 负载一样多,因此塔本身的冷却系统对速度的影响与 CPU 规格一样多。例如,在特定的 AMD A10(推土机)上,我观察到它在两个线程上以 3.7GHz 的频率运行。当第三个线程启动时它下降到 3.5ghz,当第四个启动时下降到 3.4ghz。由于它也是一个集成GPU,当四个线程加上GPU工作时,它一路下降到3.0GHz左右(A10 CPU内部在高负载场景下优先使用GPU);但仍然可以在 2 个线程和 GPU 活动的情况下召集 3.6ghz。由于我的应用程序同时使用了 CPU 和 GPU,这是一个重要的发现。我能够通过将进程限制为两个 CPU 绑定线程来提高整体性能(另外两个共享内核仍然很有帮助,它们充当 GPU 服务线程——能够快速唤醒并响应以将新数据推送到 GPU,如所须)。
... but at the same time, my application at 4x threads may have performed much better on a system with a higher-quality cooling device installed. It's all so very complicated.
...但与此同时,我的 4x 线程应用程序在安装了更高质量冷却设备的系统上可能表现得更好。这一切都非常复杂。
Conclusion: There is no good answer, and because the field of CPU SMT/Turbo design keeps evolving, I doubt there will be a good answer anytime soon. Any decent heuristic you formulate today may very well not produce ideal results tomorrow. So my recommendation is: don't waste much time on it. Rough-guess something based on core counts that suits local your purposes well enough, allow it to be overridden by config/switch, and move on.
结论:没有好的答案,而且由于 CPU SMT/Turbo 设计领域不断发展,我怀疑很快就会有好的答案。您今天制定的任何体面的启发式方法明天很可能不会产生理想的结果。所以我的建议是:不要在上面浪费太多时间。粗略猜测一些基于核心数量的东西,它足够适合您的本地目的,允许它被配置/开关覆盖,然后继续。
回答by Gustav Grusell
You can get the number of processors available to the JVM like this:
您可以像这样获得 JVM 可用的处理器数量:
Runtime.getRuntime().availableProcessors()
Calculating the optimal number of threads from the number of available processors is unfortunately not trivial however. This depends a lot on the characteristics of the application, for instance with a CPU-bound application having more threads than the number of processors make little sense, while if the application is mostly IO-bound you might want to use more threads. You also need to take into account if other resource intensive processes are running on the system.
然而不幸的是,根据可用处理器的数量计算最佳线程数并非易事。这在很大程度上取决于应用程序的特性,例如,CPU 密集型应用程序的线程数多于处理器数量则意义不大,而如果应用程序主要是 IO 密集型应用程序,您可能希望使用更多线程。您还需要考虑系统上是否正在运行其他资源密集型进程。
I think the best strategy would be to decide the optimal number of threads empirically for each of the hardware configuration, and then use these numbers in your application.
我认为最好的策略是根据经验为每个硬件配置确定最佳线程数,然后在您的应用程序中使用这些数字。
回答by GreyBeardedGeek
I agree with the other answers here that recommend a best-guess approach, and providing configuration for overriding the defaults.
我同意这里的其他答案,这些答案建议采用最佳猜测方法,并提供用于覆盖默认值的配置。
In addition, if your application is particularly CPU-intensive, you may want to look into "pinning" your application to particular processors.
此外,如果您的应用程序特别占用 CPU,您可能需要考虑将您的应用程序“固定”到特定处理器。
You don't say what your primary operating system is, or whether you're supporting multiple operating systems, but most have some way of doing this. Linux, for instance, has taskset.
您不会说您的主要操作系统是什么,或者您是否支持多个操作系统,但大多数人都有某种方法可以做到这一点。例如,Linux 有taskset。
A common approach is to avoid CPU 0 (always used by the OS), and to set your application's cpu affinity to a group of CPUs that are in the same socket.
一种常见的方法是避免 CPU 0(始终由操作系统使用),并将应用程序的 CPU 关联设置为同一插槽中的一组 CPU。
Keeping the app's threads away from cpu 0 (and, if possible, away from other applications) often improves performance by reducing the amount of task switching.
使应用程序的线程远离 cpu 0(并且,如果可能,远离其他应用程序)通常可以通过减少任务切换量来提高性能。
Keeping the application on one socket can further increase performance by reducing cache invalidation as your app's threads switch among cpus.
将应用程序保持在一个套接字上可以通过减少缓存失效来进一步提高性能,因为应用程序的线程在 CPU 之间切换。
As with everything else, this is highly dependent on the architecture of the machine that you are running on, as well as what other applications are runnning.
与其他一切一样,这高度依赖于您正在运行的机器的架构,以及正在运行的其他应用程序。
回答by abishkar bhattarai
回答by goblinjuice
I use this Python script here to determine the number of cores (and memory, etc.) to launch my Java application with optimum parameters and ergonomics. PlatformWise on Github
我在这里使用这个 Python 脚本来确定内核(和内存等)的数量,以便以最佳参数和人体工程学启动我的 Java 应用程序。Github 上的 PlatformWise
It works like this: Write a python script which calls the getNumberOfCPUCores()
in the above script to get the number of cores, and getSystemMemoryInMB()
to get the RAM. You can pass that inform to your program via command line arguments. Your program can then use the appropriate number of threads based on the number of cores.
它的工作原理是这样的:编写一个 python 脚本,它调用getNumberOfCPUCores()
上面脚本中的 来获取内核数,并getSystemMemoryInMB()
获取 RAM。您可以通过命令行参数将该通知传递给您的程序。然后,您的程序可以根据内核数量使用适当数量的线程。
回答by Vaibs
Creating a thread on application level is good and in a multicore processor separate threads are executed on cores to enhance performance.So to utilize the core processing power it is best practice to implement threading.
在应用程序级别创建线程很好,在多核处理器中,单独的线程在内核上执行以提高性能。因此,为了利用内核处理能力,最好的做法是实现线程化。
What i think:
我的想法:
- At a time only 1 thread of a program will execute on 1 core.
- Same application with 2 thread will execute on half time on 2 core.
- Same application with 4 Threads will execute more faster on 4 core.
- 一次只有 1 个程序的线程会在 1 个内核上执行。
- 具有 2 个线程的相同应用程序将在 2 个核心上执行半场。
- 具有 4 个线程的相同应用程序将在 4 核上执行得更快。
So the application you developing should have the threading level<= no of cores.
因此,您开发的应用程序应该具有线程级别<= 内核数。
Thread execution time is managed by the operating system and is a highly unpredictable activity. CPU execution time is known as a time slice or a quantum. If we create more and more threads the operating system spends a fraction of this time slice in deciding which thread goes first, thus reducing the actual execution time each thread gets. In other words each thread will do lesser work if there were a large number of threads queued up.
线程执行时间由操作系统管理,是一种高度不可预测的活动。CPU 执行时间称为时间片或量程。如果我们创建越来越多的线程,操作系统会花费这个时间片的一小部分来决定哪个线程先运行,从而减少每个线程获得的实际执行时间。换句话说,如果有大量线程排队,每个线程将执行较少的工作。
Read this to get how to actually utilize cpu core's.Fantastic content. csharp-codesamples.com/2009/03/threading-on-multi-core-cpus/
阅读本文以了解如何实际利用 cpu 核心。精彩内容。 csharp-codesamples.com/2009/03/threading-on-multi-core-cpus/
回答by user3118709
Calculating the optimal number of threads from the number of available processors is unfortunately not trivial however. This depends a lot on the characteristics of the application, for instance with a CPU-bound application having more threads than the number of processors make little sense, while if the application is mostly IO-bound you might want to use more threads. You also need to take into account if other resource intensive processes are running on the system.
然而不幸的是,根据可用处理器的数量计算最佳线程数并非易事。这在很大程度上取决于应用程序的特性,例如,CPU 密集型应用程序的线程数多于处理器数量则意义不大,而如果应用程序主要是 IO 密集型应用程序,您可能希望使用更多线程。您还需要考虑系统上是否正在运行其他资源密集型进程。