Pandas DataFrame.merge MemoryError

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31765123/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:43:09  来源:igfitidea点击:

Pandas DataFrame.merge MemoryError

pythonpandasdataframeanaconda

提问by Thomas Matthew

Goal

目标

My goal is to merge two DataFrames by their common column (gene names) so I can take a product of each gene score across each gene row. I'd then perform a groupbyon patients and cells and sum all scores from each. The ultimate data frame should look like this:

我的目标是通过它们的公共列(基因名称)合并两个 DataFrame,这样我就可以在每个基因行中获取每个基因分数的乘积。然后我会对groupby患者和细胞执行 a并将每个分数的所有分数相加。最终数据框应如下所示:

    patient  cell 
    Pat_1    22RV1    12
             DU145    15
             LN18      9
    Pat_2    22RV1    12
             DU145    15
             LN18      9
    Pat_3    22RV1    12
             DU145    15
             LN18      9

That last part should work fine, but I have not been able to perform the first merge on gene names due to a MemoryError. Below are snippets of each DataFrame.

最后一部分应该可以正常工作,但由于MemoryError. 下面是每个 DataFrame 的片段。

Data

数据

cell_s =

cell_s =

    Description          Name                      level_2  0
0  LOC100009676  100009676_at  LN18_CENTRAL_NERVOUS_SYSTEM  1
1  LOC100009676  100009676_at               22RV1_PROSTATE  2
2  LOC100009676  100009676_at               DU145_PROSTATE  3
3          AKT3      10000_at  LN18_CENTRAL_NERVOUS_SYSTEM  4
4          AKT3      10000_at               22RV1_PROSTATE  5
5          AKT3      10000_at               DU145_PROSTATE  6
6          MED6      10001_at  LN18_CENTRAL_NERVOUS_SYSTEM  7
7          MED6      10001_at               22RV1_PROSTATE  8
8          MED6      10001_at               DU145_PROSTATE  9

cell_s is about 10,000,000 rows

cell_s 大约有 10,000,000 行

patient_s =

患者_s =

             id level_1  0
0          MED6   Pat_1  1
1          MED6   Pat_2  1
2          MED6   Pat_3  1
3  LOC100009676   Pat_1  2
4  LOC100009676   Pat_2  2
5  LOC100009676   Pat_3  2
6          ABCD   Pat_1  3
7          ABCD   Pat_2  3
8          ABCD   Pat_3  3
    ....

patient_s is about 1,200,000 rows

patient_s 大约有 1,200,000 行

Code

代码

def get_score(cell, patient):
    cell_s = cell.set_index(['Description', 'Name']).stack().reset_index()
    cell_s.columns = ['Description', 'Name', 'cell', 's1']

    patient_s = patient.set_index('id').stack().reset_index()
    patient_s.columns = ['id', 'patient', 's2']

    # fails here:
    merged = cell_s.merge(patient_s, left_on='Description', right_on='id')
    merged['score'] = merged.s1 * merged.s2

    scores = merged.groupby(['patient','cell'])['score'].sum()
    return scores

I was getting a MemoryError when initially read_csving these files, but then specifying the dtypes resolved the issue. Confirming that my python is 64 bitdid not fix my issue either. I haven't reached the limitations on pandas, have I?

最初read_csving 这些文件时,我遇到了 MemoryError ,但随后指定了 dtype 解决了该问题。确认我的python 是 64 位也没有解决我的问题。我还没有达到Pandas的限制,是吗?

Python 3.4.3 |Anaconda 2.3.0 (64-bit)| Pandas 0.16.2

Python 3.4.3 |Anaconda 2.3.0(64 位)| Pandas 0.16.2

采纳答案by Parfait

Consider two workarounds:

考虑两种解决方法:

CSV By CHUNKS

CSV 由 CHUNKS

Apparently, read_csvcan suffer performance issues and therefore large files must load in iterated chunks.

显然,read_csv可能会遇到性能问题,因此大文件必须以迭代块加载。

cellsfilepath = 'C:\Path\To\Cells\CSVFile.csv'
tp = pd.io.parsers.read_csv(cellsfilepath, sep=',', iterator=True, chunksize=1000)
cell_s = pd.concat(tp, ignore_index=True)

patientsfilepath = 'C:\Path\To\Patients\CSVFile.csv'
tp = pd.io.parsers.read_csv(patientsfilepath, sep=',', iterator=True, chunksize=1000)
patient_s = pd.concat(tp, ignore_index=True)

CSV VIA SQL

CSV 通过 SQL

As a database guy, I always recommend handling large data loads and merging/joining with a SQL relational engine that scales well for such processes. I have written many a comment on dataframe merge Q/As to this effect -even in R. You can use any SQL database including file server dbs (Access, SQLite) or client server dbs (MySQL, MSSQL, or other), even where your dfs derive. Python maintains a built-in library for SQLite (otherwise you use ODBC); and dataframes can be pushed into databases as tables using pandas to_sql:

作为一名数据库人员,我总是建议处理大数据负载并与 SQL 关系引擎合并/加入,该引擎可以很好地适应此类过程。我已经写了很多关于数据帧合并 Q/As 的评论 - 即使在 R 中。您可以使用任何 SQL 数据库,包括文件服务器 dbs(Access、SQLite)或客户端服务器 dbs(MySQL、MSSQL 或其他),即使在你的 dfs 派生。Python 为 SQLite 维护了一个内置库(否则你使用 ODBC);可以使用pandas to_sql将数据帧作为表推送到数据库中:

import sqlite3

dbfile = 'C:\Path\To\SQlitedb.sqlite'
cxn = sqlite3.connect(dbfile)
c = cxn.cursor()

cells_s.to_sql(name='cell_s', con = cxn, if_exists='replace')
patient_s.to_sql(name='patient_s', con = cxn, if_exists='replace')

strSQL = 'SELECT * FROM cell_s c INNER JOIN patient_s p ON c.Description = p.id;'
# MIGHT HAVE TO ADJUST ABOVE FOR CELL AND PATIENT PARAMS IN DEFINED FUNCTION

merged = pd.read_sql(strSQL, cxn)

回答by Skorpeo

You may have to do it in pieces, or look into blaze. http://blaze.pydata.org

您可能必须分块进行,或者查看火焰。http://blaze.pydata.org