当用 Python 处理一个巨大的 CSV 突然停止时,“杀死”是什么意思?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19189522/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 13:05:22  来源:igfitidea点击:

What does 'killed' mean when a processing of a huge CSV with Python, which suddenly stops?

pythoncsvetlkill

提问by user1893354

I have a Python script that imports a large CSV file and then counts the number of occurrences of each word in the file, then exports the counts to another CSV file.

我有一个 Python 脚本,它导入一个大型 CSV 文件,然后计算文件中每个单词的出现次数,然后将计数导出到另一个 CSV 文件。

But what is happening is that once that counting part is finished and the exporting begins it says Killedin the terminal.

但是发生的事情是,一旦计数部分完成并开始导出,它就会Killed在终端中显示。

I don't think this is a memory problem (if it was I assume I would be getting a memory error and not Killed).

我不认为这是内存问题(如果是我假设我会收到内存错误而不是Killed)。

Could it be that the process is taking too long? If so, is there a way to extend the time-out period so I can avoid this?

可能是这个过程花费的时间太长了?如果是这样,有没有办法延长超时时间,这样我就可以避免这种情况?

Here is the code:

这是代码:

csv.field_size_limit(sys.maxsize)
    counter={}
    with open("/home/alex/Documents/version2/cooccur_list.csv",'rb') as file_name:
        reader=csv.reader(file_name)
        for row in reader:
            if len(row)>1:
                pair=row[0]+' '+row[1]
                if pair in counter:
                    counter[pair]+=1
                else:
                    counter[pair]=1
    print 'finished counting'
    writer = csv.writer(open('/home/alex/Documents/version2/dict.csv', 'wb'))
    for key, value in counter.items():
        writer.writerow([key, value])

And the Killedhappens after finished countinghas printed, and the full message is:

并且在打印Killed之后发生finished counting,完整的消息是:

killed (program exited with code: 137)

采纳答案by Blckknght

Exit code 137 (128+9) indicates that your program exited due to receiving signal 9, which is SIGKILL. This also explains the killedmessage. The question is, why did you receive that signal?

退出代码 137 (128+9) 表示您的程序由于接收到信号 9 而退出,即SIGKILL. 这也解释了该killed消息。问题是,你为什么会收到那个信号?

The most likely reason is probably that your process crossed some limit in the amount of system resources that you are allowed to use. Depending on your OS and configuration, this could mean you had too many open files, used too much filesytem space or something else. The most likely is that your program was using too much memory. Rather than risking things breaking when memory allocations started failing, the system sent a kill signal to the process that was using too much memory.

最可能的原因可能是您的进程超出了允许使用的系统资源量的限制。根据您的操作系统和配置,这可能意味着您打开的文件太多,使用了太多的文件系统空间或其他东西。最有可能的是您的程序使用了太多内存。当内存分配开始失败时,系统没有冒险破坏事情,而是向使用过多内存的进程发送终止信号。

As I commented earlier, one reason you might hit a memory limit after printing finished countingis that your call to counter.items()in your final loop allocates a list that contains all the keys and values from your dictionary. If your dictionary had a lot of data, this might be a very big list. A possible solution would be to use counter.iteritems()which is a generator. Rather than returning all the items in a list, it lets you iterate over them with much less memory usage.

正如我之前评论过的,您在打印后可能会遇到内存限制的一个原因finished counting是,您counter.items()在最终循环中对 的调用分配了一个包含字典中所有键和值的列表。如果您的字典有很多数据,这可能是一个非常大的列表。一个可能的解决方案是使用counter.iteritems()哪个是生成器。它不是返回列表中的所有项目,而是让您以更少的内存使用量迭代它们。

So, I'd suggest trying this, as your final loop:

所以,我建议尝试这个,作为你的最后一个循环:

for key, value in counter.iteritems():
    writer.writerow([key, value])

Note that in Python 3, itemsreturns a "dictionary view" object which does not have the same overhead as Python 2's version. It replaces iteritems, so if you later upgrade Python versions, you'll end up changing the loop back to the way it was.

请注意,在 Python 3 中,items返回一个“字典视图”对象,它的开销与 Python 2 的版本不同。它取代了iteritems,因此如果您稍后升级 Python 版本,您最终会将循环更改回原来的方式。

回答by Wingware

I doubt anything is killing the process just because it takes a long time. Killed generically means something from the outside terminated the process, but probably not in this case hitting Ctrl-C since that would cause Python to exit on a KeyboardInterrupt exception. Also, in Python you would get MemoryError exception if that was the problem. What might be happening is you're hitting a bug in Python or standard library code that causes a crash of the process.

我怀疑任何事情都会因为它需要很长时间而扼杀这个过程。Killed 通常意味着来自外部的某些东西终止了进程,但在这种情况下可能不会按下 Ctrl-C,因为这会导致 Python 在 KeyboardInterrupt 异常时退出。此外,在 Python 中,如果这是问题所在,您会得到 MemoryError 异常。可能发生的情况是您在 Python 或标准库代码中遇到了导致进程崩溃的错误。

回答by ROY

There are two storage areas involved: the stack and the heap.The stack is where the current state of a method call is kept (ie local variables and references), and the heap is where objects are stored. recursion and memory

涉及两个存储区域:堆栈和堆。堆栈是保存方法调用的当前状态(即局部变量和引用)的地方,而堆是存储对象的地方。递归和记忆

I gues there are too many keys in the counterdict that will consume too much memory of the heap region, so the Python runtime will raise a OutOfMemoryexception.

我猜counter字典中有太多键会消耗堆区域的太多内存,因此 Python 运行时会引发OutOfMemory异常。

To save it, don't create a giant object, e.g. the counter.

为了保存它,不要创建一个巨大的对象,例如counter

1.StackOverflow

1.StackOverflow

a program that create too many local variables.

创建过多局部变量的程序。

Python 2.7.9 (default, Mar  1 2015, 12:57:24) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> f = open('stack_overflow.py','w')
>>> f.write('def foo():\n')
>>> for x in xrange(10000000):
...   f.write('\tx%d = %d\n' % (x, x))
... 
>>> f.write('foo()')
>>> f.close()
>>> execfile('stack_overflow.py')
Killed

2.OutOfMemory

2.内存不足

a program that creats a giant dictincludes too many keys.

创建一个巨人的程序dict包含太多的键。

>>> f = open('out_of_memory.py','w')
>>> f.write('def foo():\n')
>>> f.write('\tcounter = {}\n')
>>> for x in xrange(10000000):
...   f.write('counter[%d] = %d\n' % (x, x))
... 
>>> f.write('foo()\n')
>>> f.close()
>>> execfile('out_of_memory.py')
Killed


参考

回答by ivanleoncz

Most likely, you ran out of memory, so the Kernel killed your process.

最有可能的是,您的内存不足,因此内核终止了您的进程。

Have you heard about OOM Killer?

你听说过OOM 杀手吗?

Here's a log from a script that I developed for processing a huge set of data from CSV files:

这是我为处理来自 CSV 文件的大量数据而开发的脚本中的日志:

Mar 12 18:20:38 server.com kernel: [63802.396693] Out of memory: Kill process 12216 (python3) score 915 or sacrifice child
Mar 12 18:20:38 server.com kernel: [63802.402542] Killed process 12216 (python3) total-vm:9695784kB, anon-rss:7623168kB, file-rss:4kB, shmem-rss:0kB
Mar 12 18:20:38 server.com kernel: [63803.002121] oom_reaper: reaped process 12216 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

It was taken from /var/log/syslog.

它是从/var/log/syslog.

Basically:

基本上:

PID 12216 elected as a victim(due to its use of +9Gb of total-vm), so oom_killerreaped it.

PID 12216 被选为受害者(因为它使用了 +9Gb 的总虚拟机),所以oom_killer收获了它。

Here's a article about OOM behavior.

这是一篇关于OOM 行为的文章。

回答by Timothy C. Quinn

I just had the same happen on me when I tried to run a python script from a shared folder in VirtualBoxwithin the new Ubuntu 20.04 LTS. Python bailed with Killedwhile loading my own personal library. When I moved the folder to a local directory, the issue went away. It appears that the Killedstop happened during the initial imports of my library as I got messages of missing libraries once I moved the folder over.

当我尝试VirtualBox从新的 Ubuntu 20.04 LTS 中的共享文件夹运行 python 脚本时,我也遇到了同样的情况。PythonKilled在加载我自己的个人库时遇到了问题。当我将文件夹移动到本地目录时,问题就消失了。似乎Killed停止发生在我的库的初始导入期间,因为一旦我移动文件夹,我就会收到丢失库的消息。

The issue went away after I restarted my computer.

我重新启动计算机后问题消失了。

Therefore, people may want to try moving the program to a local directory if its over a share of some kind or it could be a transient problem that just requires a reboot of the OS.

因此,如果程序超过某种共享,或者它可能是只需要重新启动操作系统的暂时性问题,人们可能想要尝试将程序移动到本地目录。