pandas 数据框 - 选择行并清除内存?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19674212/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas data frame - select rows and clear memory?
提问by a b
I have a large pandas dataframe (size = 3 GB):
我有一个大Pandas数据框(大小 = 3 GB):
x = read.table('big_table.txt', sep='\t', header=0, index_col=0)
Because I'm working under memory constraints, I subset the dataframe:
因为我在内存限制下工作,所以我对数据帧进行了子集化:
rows = calculate_rows() # a function that calculates what rows I need
cols = calculate_cols() # a function that calculates what cols I need
x = x.iloc[rows, cols]
The functions that calculate the rows and columns are not important, but they are DEFINITELY a smaller subset of the original rows and columns. However, when I do this operation, memory usage increases by a lot! The original goal was to shrink the memory footprint to less than 3GB, but instead, memory usage goes well over 6GB.
计算行和列的函数并不重要,但它们绝对是原始行和列的较小子集。但是,当我做这个操作时,内存使用量增加了很多!最初的目标是将内存占用量缩小到 3GB 以下,但相反,内存使用量远远超过 6GB。
I'm guessing this is because Python creates a local copy of the dataframe in memory, but doesn't clean it up. There may also be other things that are happening... So my question is how do I subset a large dataframe and clean up the space? I can't find a function that selects rows/cols in place.
我猜这是因为 Python 在内存中创建了数据帧的本地副本,但没有清理它。可能还有其他事情正在发生......所以我的问题是如何对大型数据帧进行子集化并清理空间?我找不到选择行/列的函数。
I have read a lot of Stack Overflow, but can't find much on this topic. It could be I'm not using the right keywords, so if you have suggestions, that could also help. Thanks!
我已经阅读了很多 Stack Overflow,但在这个主题上找不到太多内容。可能是我没有使用正确的关键字,所以如果您有建议,那也会有所帮助。谢谢!
回答by Jeff
You are much better off doing something like this:
你最好做这样的事情:
Specify usecolsto sub-select which columns you want in the first place to read_csv, see here.
指定usecols子选择您首先想要的列read_csv,请参见此处。
Then read the file in chunks, see here, if the rows that you want are select, shunt them to off, finally concatenating the result.
然后分块读取文件,请参见此处,如果您想要的行被选中,则将它们分流关闭,最后连接结果。
Pseudo-code ish:
伪代码ish:
reader = pd.read_csv('big_table.txt', sep='\t', header=0,
index_col=0, usecols=the_columns_i_want_to_use,
chunksize=10000)
df = pd.concat([ chunk.iloc[rows_that_I_want_] for chunk in reader ])
This will have a constant memory usage (the size of a chunk)
这将具有恒定的内存使用量(块的大小)
plus the selected rows usage x 2, which will happen when you concat the rows after the concat the usage will go down to selected rows usage
加上选定的行使用量 x 2,当您在连接后连接行时会发生这种情况,使用量将下降到选定的行使用量
回答by tinproject
I've had a similar problem, I solved it with a filtering data before loading. When you read the file with read.table you are loading the whole in a DataFrame, and maybe also the whole file in memory or some duplication becouse the use of different types, so this is the 6GB used.
我遇到了类似的问题,我在加载之前用过滤数据解决了它。当您使用 read.table 读取文件时,您正在将整个文件加载到 DataFrame 中,也可能将整个文件加载到内存中或由于使用不同类型而导致某些重复,所以这是使用的 6GB。
You could make a generator to load the contents of the file line by line, I assume that the data it's row based, one record is one row and one line in big_table.txt, so
您可以制作一个生成器来逐行加载文件的内容,我假设数据是基于行的,一条记录在 big_table.txt 中是一行一行,所以
def big_table_generator(filename):
with open(filename, 'rt') as f:
for line in f:
if is_needed_row(line): #Check if you want this row
#cut_columns() return a list with only the selected columns
record = cut_columns(line)
yield column
gen = big_table_generator('big_table.txt')
df = pandas.DataFrame.from_records(list(gen))
Note the list(gen), pandas 0.12 and previous version don't allow generators so you have to convert it to a list so all the data provided by generator it's put on memory. 0.13 will do the same thing internally. Also you need twice the memory of the data you need, one for load the data and one for put it into pandas NDframe structure.
请注意 list(gen)、pandas 0.12 和以前的版本不允许生成器,因此您必须将其转换为列表,以便生成器提供的所有数据都放在内存中。0.13 将在内部做同样的事情。您还需要两倍于所需数据的内存,一个用于加载数据,另一个用于将其放入 Pandas NDframe 结构中。
You also could make the generator to read from a compressed file, with python 3.3 gzip library only decompress the needed chuncks.
您还可以使生成器从压缩文件中读取,python 3.3 gzip 库仅解压缩所需的块。

