multithreading 是否可以在 Linux x86 GAS 程序集中创建没有系统调用的线程?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/714905/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-10 01:01:22  来源:igfitidea点击:

Is it possible to create threads without system calls in Linux x86 GAS assembly?

linuxmultithreadingassemblygas

提问by sven

Whilst learning the "assembler language" (in linux on a x86 architecture using the GNU as assembler), one of the aha moments was the possibility of using system calls. These system calls come in very handy and are sometimes even necessary as your program runs in user-space.
However system calls are rather expensive in terms of performance as they require an interrupt (and of course a system call) which means that a context switch must be made from your current active program in user-space to the system running in kernel-space.

在学习“汇编语言”(在 x86 架构的 linux 中,使用 GNU 作为汇编器)时,其中之一是使用系统调用的可能性。这些系统调用非常方便,有时甚至是必要的,因为您的程序在用户空间中运行
然而,系统调用在性能方面相当昂贵,因为它们需要中断(当然还有系统调用),这意味着必须从用户空间中的当前活动程序到内核空间中运行的系统进行上下文切换。

The point I want to make is this: I'm currently implementing a compiler (for a university project) and one of the extra features I wanted to add is the support for multi-threaded code in order to enhance the performance of the compiled program. Because some of the multi-threaded code will be automatically generated by the compiler itself, this will almost guarantee that there will be really tiny bits of multi-threaded code in it as well. In order to gain a performance win, I must be sure that using threads will make this happen.

我想说的一点是:我目前正在实现一个编译器(用于大学项目),我想添加的额外功能之一是对多线程代码的支持,以提高编译程序的性能. 因为有些多线程代码会由编译器自己自动生成,这几乎可以保证其中也会有非常小的多线程代码。为了获得性能上的胜利,我必须确保使用线程会实现这一点。

My fear however is that, in order to use threading, I mustmake system calls and the necessary interrupts. The tiny little (auto-generated) threads will therefore be highly affected by the time it takes to make these system calls, which could even lead to a performance loss...

然而,我担心的是,为了使用线程,我必须进行系统调用和必要的中断。因此,这些微小的(自动生成的)线程将受到进行这些系统调用所需时间的高度影响,这甚至可能导致性能损失......

my question is therefore twofold (with an extra bonus question underneath it):

因此,我的问题是双重的(下面有一个额外的奖励问题):

  • Is it possible to write assembler code which can run multiple threads simultaneously on multiple cores at once, withoutthe need of system calls?
  • Will I get a performance gain if I have really tiny threads (tiny as in the total execution time of the thread), performance loss, or isn't it worth the effort at all?
  • 是否可以编写无需系统调用即可同时在多个内核上同时运行多个线程的汇编代码?
  • 如果我有非常小的线程(在线程的总执行时间中很小),性能损失,或者根本不值得付出努力,我会获得性能提升吗?

My guess is that multithreaded assembler code is notpossible without system calls. Even if this is the case, do you have a suggestion (or even better: some real code) for implementing threads as efficient as possible?

我的猜测是多线程的汇编代码是不是可以不系统调用。即使是这种情况,您是否有建议(或者甚至更好:一些真实的代码)来尽可能高效地实现线程?

采纳答案by Nathan Fellman

The short answer is that you can't. When you write assembly code it runs sequentially (or with branches) on one and only one logical (i.e. hardware) thread. If you want some of the code to execute on another logical thread (whether on the same core, on a different core on the same CPU or even on a different CPU), you need to have the OS set up the other thread's instruction pointer (CS:EIP) to point to the code you want to run. This implies using system calls to get the OS to do what you want.

简短的回答是你不能。当您编写汇编代码时,它会在一个且仅一个逻辑(即硬件)线程上按顺序(或通过分支)运行。如果你想让一些代码在另一个逻辑线程上执行(无论是在同一个内核上,在同一个 CPU 上的不同内核上,甚至是在不同的 CPU 上),你需要让操作系统设置另一个线程的指令指针(CS:EIP) 指向要运行的代码。这意味着使用系统调用来让操作系统做你想做的事。

User threads won't give you the threading support that you want, because they all run on the same hardware thread.

用户线程不会为您提供所需的线程支持,因为它们都运行在同一个硬件线程上。

Edit:Incorporating Ira Baxter's answer with Parlanse. If you ensure that your program has a thread running in each logical thread to begin with, then you can build your own scheduler without relying on the OS. Either way, you need a scheduler to handle hopping from one thread to another. Between calls to the scheduler, there are no special assembly instructions to handle multi-threading. The scheduler itself can't rely on any special assembly, but rather on conventions between parts of the scheduler in each thread.

编辑:将Ira Baxter 的回答与Parlanse结合起来。如果您确保您的程序首先在每个逻辑线程中运行一个线程,那么您可以构建自己的调度程序,而无需依赖操作系统。无论哪种方式,您都需要一个调度程序来处理从一个线程到另一个线程的跳转。在调用调度程序之间,没有特殊的汇编指令来处理多线程。调度器本身不能依赖任何特殊的程序集,而是依赖于每个线程中调度器各部分之间的约定。

Either way, whether or not you use the OS, you still have to rely on some scheduler to handle cross-thread execution.

无论哪种方式,无论您是否使用操作系统,您仍然必须依靠某些调度程序来处理跨线程执行。

回答by Ira Baxter

"Doctor, doctor, it hurts when I do this". Doctor: "Don't do that".

“医生,医生,当我这样做时很痛”。医生:“不要那样做”。

The short answer is you can do multithreaded programming without calling expensive OS task management primitives. Simply ignore the OS for thread scheduling operations. This means you have to write your own thread scheduler, and simply never pass control back to the OS. (And you have to be cleverer somehow about your thread overhead than the pretty smart OS guys). We chose this approach precisely because windows process/thread/ fiber calls were all too expensive to support computation grains of a few hundred instructions.

简短的回答是您可以在不调用昂贵的操作系统任务管理原语的情况下进行多线程编程。只需忽略操作系统进行线程调度操作。这意味着您必须编写自己的线程调度程序,并且永远不要将控制权交还给操作系统。(并且在线程开销方面,您必须比非常聪明的操作系统人员更聪明)。我们之所以选择这种方法,正是因为 Windows 进程/线程/光纤调用成本太高,无法支持几百条指令的计算粒度。

Our PARLANSE programming langauge is a parallel programming language: See http://www.semdesigns.com/Products/Parlanse/index.html

我们的 PARLANSE 编程语言是一种并行编程语言:请参阅http://www.semdesigns.com/Products/Parlanse/index.html

PARLANSE runs under Windows, offers parallel "grains" as the abstract parallelism construct, and schedules such grains by a combination of a highly tuned hand-written scheduler and scheduling code generated by the PARLANSE compiler that takes into account the context of grain to minimimze scheduling overhead. For instance, the compiler ensures that the registers of a grain contain no information at the point where scheduling (e.g., "wait") might be required, and thus the scheduler code only has to save the PC and SP. In fact, quite often the scheduler code doesnt get control at all; a forked grain simply stores the forking PC and SP, switches to compiler-preallocated stack and jumps to the grain code. Completion of the grain will restart the forker.

PARLANSE 在 Windows 下运行,提供并行“grains”作为抽象并行构造,并通过高度调整的手写调度程序和由 PARLANSE 编译器生成的调度代码的组合来调度此类谷粒,该代码考虑了谷粒的上下文以最小化调度高架。例如,编译器确保在可能需要调度(例如,“等待”)的点上,grain 的寄存器不包含任何信息,因此调度程序代码只需保存 PC 和 SP。事实上,很多时候调度程序代码根本没有得到控制;分叉的grain简单地存储分叉的PC和SP,切换到编译器预分配的堆栈并跳转到grain代码。完成谷物将重新启动分叉器。

Normally there's an interlock to synchronize grains, implemented by the compiler using native LOCK DEC instructions that implement what amounts to counting semaphores. Applications can fork logically millions of grains; the scheduler limits parent grains from generating more work if the work queues are long enough so more work won't be helpful. The scheduler implements work-stealing to allow work-starved CPUs to grab ready grains form neighboring CPU work queues. This has been implemented to handle up to 32 CPUs; but we're a bit worried that the x86 vendors may actually swamp use with more than that in the next few years!

通常有一个互锁来同步谷物,由编译器使用本机 LOCK DEC 指令实现,这些指令实现了计数信号量。应用程序可以在逻辑上分叉数百万个谷粒;如果工作队列足够长,调度程序会限制父谷物生成更多工作,因此更多工作将无济于事。调度程序实现工作窃取,以允许工作匮乏的 CPU 从相邻的 CPU 工作队列中获取就绪的谷物。这已被实施以处理多达 32 个 CPU;但我们有点担心 x86 供应商实际上可能会在接下来的几年内使用更多!

PARLANSE is a mature langauge; we've been using it since 1997, and have implemented a several-million line parallel application in it.

PARLANSE 是一种成熟的语言;我们从 1997 年就开始使用它,并在其中实现了数百万行并行应用程序。

回答by Ira Baxter

Implement user-mode threading.

实现用户模式线程。

Historically, threading models are generalised as N:M, which is to say N user-mode threads running on M kernel-model threads. Modern useage is 1:1, but it wasn't always like that and it doesn't have to be like that.

历史上,线程模型被概括为 N:M,也就是说 N 个用户模式线程运行在 M 个内核模型线程上。现代用法是 1:1,但并不总是这样,也不一定是那样。

You are free to maintain in a single kernel thread an arbitrary number of user-mode threads. It's just that it's your responsibility to switch between them sufficiently often that it all looks concurrent. Your threads are of course co-operative rather than pre-emptive; you basically scatted yield() calls throughout your own code to ensure regular switching occurs.

您可以在单个内核线程中自由维护任意数量的用户模式线程。只是你有责任在它们之间频繁切换,使它们看起来都是并发的。您的线程当然是合作的而不是先发制人的;您基本上在自己的代码中分散了 yield() 调用,以确保发生定期切换。

回答by Adam Rosenfield

If you want to gain performance, you'll have to leverage kernel threads. Only the kernel can help you get code running simultaneously on more than one CPU core. Unless your program is I/O bound (or performing other blocking operations), performing user-mode cooperative multithreading (also known as fibers) is not going to gain you any performance. You'll just be performing extra context switches, but the one CPU that your real thread is running will still be running at 100% either way.

如果您想获得性能,则必须利用内核线程。只有内核可以帮助您在多个 CPU 内核上同时运行代码。除非您的程序受 I/O 限制(或执行其他阻塞操作),否则执行用户模式协作多线程(也称为纤程)不会为您带来任何性能。您将只执行额外的上下文切换,但您的真实线程正在运行的一个 CPU 仍然会以 100% 的速度运行。

System calls have gotten faster. Modern CPUs have support for the sysenterinstruction, which is significantly faster than the old intinstruction. See also this articlefor how Linux does system calls in the fastest way possible.

系统调用变得更快了。现代 CPU 支持该sysenter指令,该指令比旧int指令快得多。另请参阅本文以了解 Linux 如何以最快的方式进行系统调用。

Make sure that the automatically-generated multithreading has the threads run for long enough that you gain performance. Don't try to parallelize short pieces of code, you'll just waste time spawning and joining threads. Also be wary of memory effects (although these are harder to measure and predict) -- if multiple threads are accessing independent data sets, they will run much faster than if they were accessing the same data repeatedly due to the cache coherencyproblem.

确保自动生成的多线程使线程运行足够长的时间以获得性能。不要试图并行化一小段代码,你只会浪费时间产生和加入线程。还要警惕内存效应(尽管这些更难测量和预测)——如果多个线程访问独立的数据集,由于缓存一致性问题,它们的运行速度将比重复访问相同数据时快得多。

回答by Nick

Quite a bit late now, but I was interested in this kind of topic myself. In fact, there's nothing all that special about threads that specifically requires the kernel to intervene EXCEPT for parallelization/performance.

现在有点晚了,但我自己对这种话题很感兴趣。事实上,线程没有什么特别之处,特别需要内核进行干预,除了并行化/性能。

Obligatory BLUF:

强制性 BLUF

Q1: No. At least initial system calls are necessary to create multiple kernel threads across the various CPU cores/hyper-threads.

问题 1:否。至少需要初始系统调用才能跨各种 CPU 内核/超线程创建多个内核线程。

Q2: It depends. If you create/destroy threads that perform tiny operations then you're wasting resources (the thread creation process would greatly exceed the time used by the tread before it exits). If you create N threads (where N is ~# of cores/hyper-threads on the system) and re-task them then the answer COULD be yes depending on your implementation.

Q2:这取决于。如果您创建/销毁执行微小操作的线程,那么您就是在浪费资源(线程创建过程将大大超过胎面退出前使用的时间)。如果您创建 N 个线程(其中 N 是系统上核心/超线程的 ~# 个)并对它们重新分配任务,那么答案可能是肯定的,具体取决于您的实现。

Q3: You COULD optimize operation if you KNEW ahead of time a precise method of ordering operations. Specifically, you could create what amounts to a ROP-chain (or a forward call chain, but this may actually end up being more complex to implement). This ROP-chain (as executed by a thread) would continuously execute 'ret' instructions (to its own stack) where that stack is continuously prepended (or appended in the case where it rolls over to the beginning). In such a (weird!) model the scheduler keeps a pointer to each thread's 'ROP-chain end' and writes new values to it whereby the code circles through memory executing function code that ultimately results in a ret instruction. Again, this is a weird model, but is intriguing nonetheless.

Q3:如果您提前知道订购操作的精确方法,您可以优化操作。具体来说,您可以创建相当于 ROP 链的内容(或前向调用链,但这实际上最终实现起来可能更加复杂)。这个 ROP 链(由线程执行)将连续执行 'ret' 指令(到它自己的堆栈),其中该堆栈被连续添加(或在它滚动到开头的情况下附加)。在这样一个(奇怪的!)模型中,调度程序保留一个指向每个线程的“ROP 链末端”的指针,并向其写入新值,从而代码在内存中循环执行函数代码,最终导致 ret 指令。同样,这是一个奇怪的模型,但仍然很有趣。

Onto my 2-cents worth of content.

在我价值 2 美分的内容上。

I recently created what effectively operate as threads in pure assembly by managing various stack regions (created via mmap) and maintaining a dedicated area to store the control/individualization information for the "threads". It is possible, although I didn't design it this way, to create a single large block of memory via mmap that I subdivide into each thread's 'private' area. Thus only a single syscall would be required (although guard pages between would be smart these would require additional syscalls).

我最近通过管理各种堆栈区域(通过 mmap 创建)并维护一个专用区域来存储“线程”的控制/个性化信息,从而在纯汇编中创建了有效地作为线程运行的东西。虽然我不是这样设计的,但有可能通过 mmap 创建单个大内存块,我将其细分为每个线程的“私有”区域。因此只需要一个系统调用(虽然保护页面之间会很聪明,但它们需要额外的系统调用)。

This implementation uses only the base kernel thread created when the process spawns and there is only a single usermode thread throughout the entire execution of the program. The program updates its own state and schedules itself via an internal control structure. I/O and such are handled via blocking options when possible (to reduce complexity), but this isn't strictly required. Of course I made use of mutexes and semaphores.

此实现仅使用进程生成时创建的基本内核线程,并且在整个程序执行过程中只有一个用户模式线程。该程序更新自己的状态并通过内部控制结构自行安排。I/O 等在可能的情况下通过阻塞选项处理(以降低复杂性),但这不是严格要求的。当然,我使用了互斥锁和信号量。

To implement this system (entirely in userspace and also via non-root access if desired) the following were required:

要实现这个系统(完全在用户空间中,如果需要,也可以通过非 root 访问)需要以下内容:

A notion of what threads boil down to: A stack for stack operations (kinda self explaining and obvious) A set of instructions to execute (also obvious) A small block of memory to hold individual register contents

线程的概念归结为: 用于堆栈操作的堆栈(有点自我解释和明显) 一组要执行的指令(也很明显) 一小块内存来保存各个寄存器内容

What a scheduler boils down to: A manager for a series of threads (note that processes never actually execute, just their thread(s) do) in a scheduler-specified ordered list (usually priority).

调度程序归结为:在调度程序指定的有序列表(通常是优先级)中的一系列线程的管理器(注意,进程从不实际执行,只是它们的线程执行)。

A thread context switcher: A MACRO injected into various parts of code (I usually put these at the end of heavy-duty functions) that equates roughly to 'thread yield', which saves the thread's state and loads another thread's state.

线程上下文切换器:一个 MACRO 注入到代码的各个部分(我通常将它们放在重载函数的末尾),大致相当于“线程产量”,它保存线程的状态并加载另一个线程的状态。

So, it is indeed possible to (entirely in assembly and without system calls other than initial mmap and mprotect) to create usermode thread-like constructs in a non-root process.

因此,确实可以(完全在汇编中并且没有除初始 mmap 和 mprotect 之外的系统调用)在非 root 进程中创建类似用户模式线程的构造。

I only added this answer because you specifically mention x86 assembly and this answer was entirely derived via a self-contained program written entirely in x86 assembly that achieves the goals (minus multi-core capabilities) of minimizing system calls and also minimizes system-side thread overhead.

我只是添加了这个答案,因为你特别提到了 x86 程序集,这个答案完全是通过一个完全用 x86 程序集编写的自包含程序得出的,该程序实现了最小化系统调用的目标(减去多核功能)并最小化系统端线程高架。

回答by Bastien Léonard

First you should learn how to use threads in C (pthreads, POSIX theads). On GNU/Linux you will probably want to use POSIX threads or GLib threads. Then you can simply call the C from assembly code.

首先,您应该学习如何在 C 中使用线程(pthreads、POSIX theads)。在 GNU/Linux 上,您可能希望使用 POSIX 线程或 GLib 线程。然后您可以简单地从汇编代码中调用 C。

Here are some pointers:

以下是一些提示:

  • Posix threads: link text
  • A tutorial where you will learn how to call C functions from assembly: link text
  • Butenhof's book on POSIX threads link text

回答by Zifre

System calls are not that slow now, with syscallor sysenterinstead of int. Still, there will only be an overhead when you create or destroy the threads. Once they are running, there are no system calls. User mode threads will not really help you, since they only run on one core.

系统调用现在没有那么慢,使用syscallsysenter代替int. 尽管如此,当您创建或销毁线程时只会产生开销。一旦它们运行起来,就没有系统调用。用户模式线程不会真正帮助您,因为它们只在一个内核上运行。