减少从 MongoDB 加载巨大的 Pandas 数据帧所用的内存

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47747979/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:52:55  来源:igfitidea点击:

Reduce the memory used loading huge pandas dataframe from MongoDB

pythonmongodbpandasmemory

提问by Tony Wang

I have a large dataset containing 40 million records with total size ~21.0G, stored in MongoDB. Took me a few hours to load it into pandas dataframe. But the total memory size increased to about 28.7G (before loading, it was ~600Mb).

我有一个包含 4000 万条记录的大型数据集,总大小约为 21.0G,存储在 MongoDB 中。我花了几个小时才将它加载到 Pandas 数据框中。但是总内存大小增加到了大约 28.7G(加载之前,它是~600Mb)。

cursor = mongocollection.find()
data = pd.DataFrame()
count = 0
dataset = []
for i in cursor:
    dataset.append(i)
    del i
    count += 1
    if count % 100000 == 0:
        print(count)
        temp = pd.DataFrame(dataset, columns=dataset[0].keys())
        dataset = []
        data = data.append(temp)

temp = pd.DataFrame(dataset, columns=dataset[0].keys())
dataset = []
data = data.append(temp)

Concerned about the time cost by loading such dataset, saved it to local disk with pd.to_csv('localdisk.csv'). The size of csv file is 7.1Gb.

担心加载此类数据集的时间成本,使用pd.to_csv('localdisk.csv')将其保存到本地磁盘。csv 文件的大小为 7.1Gb。

So the question is why the csv file is so small, while memory size of dataframe (or other data?) used is about 4 times larger, and is there any better solution to reduce the memory usage by dataframe. I have another dataset with more than 100 million same items. Wonder if I'm able to load into memory with such solution.

所以问题是为什么csv文件这么小,而使用的数据帧(或其他数据?)的内存大小大约大4倍,有没有更好的解决方案来减少数据帧的内存使用量。我有另一个包含超过 1 亿个相同项目的数据集。想知道我是否能够使用这种解决方案加载到内存中。

PS: I thinks the reason that cost so much time to load data into memory is these three commands:

PS:我认为将数据加载到内存中花费这么多时间的原因是这三个命令:

temp = pd.DataFrame(dataset, columns=dataset[0].keys())
dataset = []
data = data.append(temp)

There are 60,000 items in dataset, and it took ~5-10 minutes to load into data(pandas DataFrame)

中有 60,000 个项目dataset,加载到data(pandas DataFrame)需要大约 5-10 分钟

> data.memory_usage(index=True).sum() 
6451973127 bytes # About 6G, close to the size of csv file.

UPDATE:

更新:

Code for generating metrics

生成指标的代码

This SO answersays concatis efficient than append. I didn't test that yet.

这个SO 答案concatappend. 我还没有测试。

last_time = time.time()
for i in cursor:
    dataset.append(i)
    del i
    count += 1
    if count % 100000 == 0:
        temp = pd.DataFrame(dataset, columns=dataset[0].keys())
        dataset = []
        data = pd.concat([data,temp])
        current_time = time.time()
        cost_time = current_time - last_time
        last_time = current_time
        memory_usage = psutil.virtual_memory().used / (1024**3)
        print("count is {}, cost time is {}, memory usage is {}".format(count, cost_time, memory_usage))

metrics of loading data into dataframe

将数据加载到数据框的指标

count is 100000, cost time is 12.346338033676147, memory usage is 0.7630538940429688
count is 200000, cost time is 8.272525310516357, memory usage is 0.806121826171875
count is 300000, cost time is 10.19885516166687, memory usage is 0.9408340454101562
count is 400000, cost time is 6.370742082595825, memory usage is 0.9675140380859375
count is 500000, cost time is 7.93895959854126, memory usage is 0.9923629760742188
count is 600000, cost time is 12.54422402381897, memory usage is 1.1193618774414062
count is 700000, cost time is 9.631025552749634, memory usage is 1.1592445373535156
count is 800000, cost time is 7.459330081939697, memory usage is 1.1680374145507812
count is 900000, cost time is 9.528786659240723, memory usage is 1.2807159423828125
count is 1000000, cost time is 7.681959867477417, memory usage is 1.2977333068847656
count is 1100000, cost time is 7.3086090087890625, memory usage is 1.3396949768066406
count is 1200000, cost time is 11.282068252563477, memory usage is 1.4544296264648438
count is 1300000, cost time is 9.21155858039856, memory usage is 1.4788284301757812
count is 1400000, cost time is 10.056787014007568, memory usage is 1.5263175964355469
count is 1500000, cost time is 12.212023973464966, memory usage is 1.6380157470703125
count is 1600000, cost time is 14.238991260528564, memory usage is 1.69512939453125
count is 1700000, cost time is 8.800130128860474, memory usage is 1.7134437561035156
count is 1800000, cost time is 11.374922275543213, memory usage is 1.8270645141601562
count is 1900000, cost time is 8.9767906665802, memory usage is 1.8472061157226562
count is 2000000, cost time is 8.989881992340088, memory usage is 1.8804588317871094
count is 2100000, cost time is 11.93136477470398, memory usage is 2.000476837158203
count is 2200000, cost time is 11.224282264709473, memory usage is 2.016876220703125
count is 2300000, cost time is 13.535449266433716, memory usage is 2.0394668579101562
count is 2400000, cost time is 12.848443269729614, memory usage is 2.1280059814453125
count is 2500000, cost time is 12.208937883377075, memory usage is 2.138408660888672
count is 2600000, cost time is 16.975553512573242, memory usage is 2.2880821228027344
count is 2700000, cost time is 19.275086879730225, memory usage is 2.287738800048828
count is 2800000, cost time is 11.386988639831543, memory usage is 2.3098106384277344
count is 2900000, cost time is 13.70014500617981, memory usage is 2.3990440368652344
count is 3000000, cost time is 10.45867395401001, memory usage is 2.420604705810547
count is 3100000, cost time is 10.75408387184143, memory usage is 2.4437637329101562
count is 3200000, cost time is 15.346243619918823, memory usage is 2.5608978271484375
count is 3300000, cost time is 12.275937795639038, memory usage is 2.5855789184570312
count is 3400000, cost time is 11.398426532745361, memory usage is 2.6102142333984375
count is 3500000, cost time is 17.990268230438232, memory usage is 2.7031402587890625
count is 3600000, cost time is 11.90847396850586, memory usage is 2.724163055419922
count is 3700000, cost time is 14.961709260940552, memory usage is 2.8711891174316406
count is 3800000, cost time is 13.13991904258728, memory usage is 2.8688430786132812
count is 3900000, cost time is 12.900552749633789, memory usage is 2.8935928344726562
count is 4000000, cost time is 15.278205633163452, memory usage is 3.01715087890625
count is 4100000, cost time is 12.421746492385864, memory usage is 3.044261932373047
count is 4200000, cost time is 12.715410232543945, memory usage is 3.1170883178710938
count is 4300000, cost time is 15.297654867172241, memory usage is 3.195178985595703
count is 4400000, cost time is 11.920997858047485, memory usage is 3.2213592529296875
count is 4500000, cost time is 12.397282123565674, memory usage is 3.2494659423828125
count is 4600000, cost time is 13.162795305252075, memory usage is 3.3564605712890625
count is 4700000, cost time is 14.042455434799194, memory usage is 3.413494110107422
count is 4800000, cost time is 10.402931451797485, memory usage is 3.3945388793945312
count is 4900000, cost time is 13.326395034790039, memory usage is 3.4888954162597656
count is 5000000, cost time is 11.762998580932617, memory usage is 3.5169677734375
count is 5100000, cost time is 13.566682577133179, memory usage is 3.610504150390625
count is 5200000, cost time is 11.697095155715942, memory usage is 3.637969970703125
count is 5300000, cost time is 11.785945415496826, memory usage is 3.702167510986328
count is 5400000, cost time is 20.747815132141113, memory usage is 3.7620506286621094
count is 5500000, cost time is 12.001267910003662, memory usage is 3.788776397705078
count is 5600000, cost time is 12.201840877532959, memory usage is 3.8513031005859375
count is 5700000, cost time is 16.82955837249756, memory usage is 3.9653396606445312
count is 5800000, cost time is 12.35794973373413, memory usage is 3.9715538024902344
count is 5900000, cost time is 12.41870403289795, memory usage is 3.999217987060547
count is 6000000, cost time is 14.590713024139404, memory usage is 4.0941619873046875
count is 6100000, cost time is 13.40040898323059, memory usage is 4.119499206542969
count is 6200000, cost time is 15.54603385925293, memory usage is 4.2159881591796875
count is 6300000, cost time is 12.232314348220825, memory usage is 4.2417449951171875
count is 6400000, cost time is 12.939337491989136, memory usage is 4.268760681152344
count is 6500000, cost time is 15.472190856933594, memory usage is 4.371849060058594
count is 6600000, cost time is 13.525130987167358, memory usage is 4.392463684082031
count is 6700000, cost time is 13.798184633255005, memory usage is 4.467185974121094
count is 6800000, cost time is 16.133020877838135, memory usage is 4.513973236083984
count is 6900000, cost time is 20.654539108276367, memory usage is 4.537406921386719
count is 7000000, cost time is 15.181331872940063, memory usage is 4.617683410644531
count is 7100000, cost time is 16.90074348449707, memory usage is 4.6607208251953125
count is 7200000, cost time is 15.26277780532837, memory usage is 4.6886749267578125
count is 7300000, cost time is 13.590909719467163, memory usage is 4.7701873779296875
count is 7400000, cost time is 17.623094081878662, memory usage is 4.812957763671875
count is 7500000, cost time is 14.904731035232544, memory usage is 4.8453521728515625
count is 7600000, cost time is 16.52383327484131, memory usage is 4.992897033691406
count is 7700000, cost time is 14.730050325393677, memory usage is 4.961498260498047
count is 7800000, cost time is 14.83224892616272, memory usage is 4.986965179443359
count is 7900000, cost time is 16.819100856781006, memory usage is 5.141094207763672
count is 8000000, cost time is 16.299737691879272, memory usage is 5.108722686767578
count is 8100000, cost time is 15.587513208389282, memory usage is 5.14031982421875
count is 8200000, cost time is 19.151288747787476, memory usage is 5.296863555908203
count is 8300000, cost time is 15.674288511276245, memory usage is 5.3394622802734375
count is 8400000, cost time is 16.563526153564453, memory usage is 5.292533874511719
count is 8500000, cost time is 20.42433261871338, memory usage is 5.447917938232422
count is 8600000, cost time is 15.694331884384155, memory usage is 5.412452697753906
count is 8700000, cost time is 20.2867329120636, memory usage is 5.571533203125
count is 8800000, cost time is 18.203043222427368, memory usage is 5.532035827636719
count is 8900000, cost time is 16.625596523284912, memory usage is 5.628833770751953
count is 9000000, cost time is 23.0804705619812, memory usage is 5.652252197265625
count is 9100000, cost time is 17.696472883224487, memory usage is 5.745880126953125
count is 9200000, cost time is 15.72276496887207, memory usage is 5.705802917480469

UPDATE2

更新2

Code for normalization the data(small integer and categorical)

数据规范化代码(小整数和分类)

last_time = time.time()
dtypes = {"somecount":'int32',"somecount":"int32","somecate":"category","somecount":"int32","somecate":"category","somecount":"int32","somecount":"int32","somecate":"category"}
for i in cursor:
    del i['something']
    del i['sometime']
    del i['something']
    del i['something']
    del i['someint']
    dataset.append(i)
    del i
    count += 1
    if count % 100000 == 0:
        temp = pd.DataFrame(dataset,columns=dataset[0].keys())
        temp.fillna(0,inplace=True)
        temp = temp.astype(dtypes, errors="ignore")
        dataset = []
        data = pd.concat([data,temp])

Metrics for the optimization:

优化指标:

The total memory usage reduced almost half of above one. But the concating/appending time didn't change much.

总内存使用量减少了几乎一半以上。但是concating/ appending时间没有太大变化。

data length is 37800000,count is 37800000, cost time is 132.23220038414001, memory usage is 11.789329528808594
data length is 37900000,count is 37900000, cost time is 65.34806060791016, memory usage is 11.7882080078125
data length is 38000000,count is 38000000, cost time is 122.15527963638306, memory usage is 11.804153442382812
data length is 38100000,count is 38100000, cost time is 47.79928374290466, memory usage is 11.828723907470703
data length is 38200000,count is 38200000, cost time is 49.70282459259033, memory usage is 11.837543487548828
data length is 38300000,count is 38300000, cost time is 155.42868423461914, memory usage is 11.895767211914062
data length is 38400000,count is 38400000, cost time is 105.94551157951355, memory usage is 11.947330474853516
data length is 38500000,count is 38500000, cost time is 136.1993544101715, memory usage is 12.013351440429688
data length is 38600000,count is 38600000, cost time is 114.5268976688385, memory usage is 12.013912200927734
data length is 38700000,count is 38700000, cost time is 53.31018781661987, memory usage is 12.017452239990234
data length is 38800000,count is 38800000, cost time is 65.94741868972778, memory usage is 12.058589935302734
data length is 38900000,count is 38900000, cost time is 42.62899565696716, memory usage is 12.067787170410156
data length is 39000000,count is 39000000, cost time is 57.95372486114502, memory usage is 11.979434967041016
data length is 39100000,count is 39100000, cost time is 62.12286162376404, memory usage is 12.026973724365234
data length is 39200000,count is 39200000, cost time is 80.76535606384277, memory usage is 12.111717224121094

回答by Amadan

What is in a CSV and what is in a dataframe are two very different things. For example, 9.9and 9.99999999999999in a CSV will take the same amount of space in a dataframe.

CSV 中的内容和数据框中的内容是两件非常不同的事情。例如,9.99.99999999999999在一个CSV将在一个数据帧的空间相同的量。

That said, data in a dataframe takes up much less space than data in a list. Constructing a list is expensive in memory; and appending to a dataframe requires pandas to make a new (bigger) dataframe, copy everything, then leave the original dataframe to be garbage-collected.

也就是说,数据帧中的数据比列表中的数据占用的空间少得多。构建列表在内存中的开销很大;并附加到数据帧需要Pandas创建一个新的(更大的)数据帧,复制所有内容,然后将原始数据帧保留为垃圾收集。

You would probably do much better if you preallocated a dataframe of 60000 rows (or however many rows you have in total); e.g.:

如果您预先分配了 60000 行的数据框(或总共有多少行),您可能会做得更好;例如:

data = pd.DataFrame(np.empty((60000,), dtype=[
    ('x', np.uint8),
    ('y', np.float64)
]))

and then for each row inserted data for that row, without relying on datasetlist:

然后为每一行插入该行的数据,而不依赖于dataset列表:

data.values[count,:] = rowdata_at_count

This is not type-safe, but it is pretty fast (since no allocations are occuring), so make sure rowdata_at_countis a list whose elements correspond with column types.

这不是类型安全的,但它非常快(因为没有发生分配),因此请确保rowdata_at_count是一个元素与列类型对应的列表。

EDIT

编辑

concat is [more] efficient than append

concat 比 append [更] 高效

Yes, appending 100 lines is, I believe, like 100 concats of one line (as each append must reallocate and copy the table, just like concat). Preallocating avoids both append and concat: the table size does not change, no reallocation and copying needs to be done.

是的,我相信追加 100 行就像一行的 100 个连接(因为每个追加都必须重新分配和复制表,就像连接一样)。预分配避免了 append 和 concat:表大小不会改变,不需要重新分配和复制。

回答by Tony Wang

This problem solved by hdf5 and pytables very very efficiently.

hdf5 和 pytables 非常有效地解决了这个问题。

1. define a description:

1.定义描述:

from tables import *
class Description(IsDescription):
    something1 = StringCol(30)
    somecount1 = Int32Col()
    somecount2 = Int32Col()
    something2 = StringCol(10)
    somecount3 = Int32Col()
    something3 = StringCol(20)
    somecount4 = Int32Col()
    somecount5 = Int32Col()
    something4 = StringCol(29)
    sometime = Time64Col()

2. Generate a hdf5 file with pytables

2.用pytables生成hdf5文件

h5file = open_file("filename.h5", mode='w', title = "title_of_filename")
group = h5file.create_group("/", 'groupname', 'somethingelse')
table = h5file.create_table(group, 'readout', Description, "Readout example")
particle = table.row

3. loop the cursor and insert data to table

3.循环游标并插入数据到表

for i in cursor:
    try:

        particle['something1'] = i['something1']
            ...
        particle['sometime'] = i['sometime']
        particle.append()
        count += 1
        if count % 100000 == 0:
            current_time = time.time()
            cost_time = current_time - last_time
            last_time = current_time
            memory_usage = psutil.virtual_memory().used / (1024**3)
            print("count is {}, cost time is {}, memory usage is {}".format( count, cost_time, memory_usage))
    except Exception as e:
        print(e)
        print(i)
        break

4. By now, all the data from MongoDB is stored into a hdf5 file in local disk. The final h5 size is 4.6G.

4. 至此,MongoDB 中的所有数据都存储到本地磁盘中的一个 hdf5 文件中。最终的 h5 大小为 4.6G。

count is 100000, cost time is 61.384639501571655, memory usage is 0.6333351135253906
count is 200000, cost time is 1.8020610809326172, memory usage is 0.6135673522949219
count is 300000, cost time is 2.348151206970215, memory usage is 0.6422805786132812
count is 400000, cost time is 1.768083095550537, memory usage is 0.6340789794921875
count is 500000, cost time is 1.7722208499908447, memory usage is 0.6187820434570312
count is 600000, cost time is 2.418192148208618, memory usage is 0.6522865295410156
count is 700000, cost time is 1.8863332271575928, memory usage is 0.6428298950195312
count is 800000, cost time is 1.8162147998809814, memory usage is 0.6209907531738281
count is 900000, cost time is 2.2260451316833496, memory usage is 0.6533966064453125
count is 1000000, cost time is 1.644845962524414, memory usage is 0.6412544250488281
count is 1100000, cost time is 1.7015583515167236, memory usage is 0.6193504333496094
count is 1200000, cost time is 2.2118935585021973, memory usage is 0.6539993286132812
count is 1300000, cost time is 1.732633352279663, memory usage is 0.6396903991699219
count is 1400000, cost time is 1.7652947902679443, memory usage is 0.6167755126953125
count is 1500000, cost time is 2.49992299079895, memory usage is 0.6546707153320312
count is 1600000, cost time is 1.9869158267974854, memory usage is 0.6390419006347656
count is 1700000, cost time is 1.8238599300384521, memory usage is 0.6159439086914062
count is 1800000, cost time is 2.2168307304382324, memory usage is 0.6554222106933594
count is 1900000, cost time is 1.7153246402740479, memory usage is 0.6401138305664062

5. Metrics of loading data to h5.

5.加载数据到h5的指标。

The last you need to do is loading the hdf5 file into pandas dataframe with pd.DataFrame.from_records(table.read()). The total memory usage is 10G and time cost is less than half an hour. A single laptop (MBP 16G Memory, running in a docker container) could solved this much more efficiently.

您需要做的最后一步是使用pd.DataFrame.from_records(table.read())将 hdf5 文件加载到 Pandas 数据帧中。总内存使用量为10G,时间成本不到半小时。一台笔记本电脑(MBP 16G 内存,在 docker 容器中运行)可以更有效地解决这个问题。

回答by Ando Jurai

If you saved your file to csv, you can latter use "read_csv" with parameter "memory_map=True".
You might also select a subset of columns from the start instead of dropping them later, and read these just in time if you only need them for a few parameters.
You might convert textual repeated/categorical data to dummies/integer instead.
Maybe you might also want to use Numpy instead if you can obtain an all same data type table.
Together with the use of sparse matrices, it might help reduce significantly your in memory size and speed your loading and processing.

如果您将文件保存到 csv,您可以稍后使用带有参数“memory_map=True”的“read_csv”。
您还可以从一开始就选择列的子集,而不是稍后删除它们,如果您只需要它们作为几个参数,请及时阅读这些列。
您可以将文本重复/分类数据转换为哑元/整数。
如果您可以获得所有相同的数据类型表,您可能还想改用 Numpy。
结合使用稀疏矩阵,它可能有助于显着减少内存大小并加快加载和处理速度。

For operations of fusion, merge is told by Pandas docs to be "high performance", compared to concat and so (append predating these).
You might want to use inplace=True parameters to avoid the burden of copy

对于融合操作,与 concat 等相比,Pandas 文档告诉 merge 是“高性能”的(在这些之前追加)。
您可能希望使用 inplace=True 参数来避免复制负担