Pandas DataFrame.merge MemoryError

Question

提问by Thomas Matthew

Goal

目标

My goal is to merge two DataFrames by their common column (gene names) so I can take a product of each gene score across each gene row. I'd then perform a groupbyon patients and cells and sum all scores from each. The ultimate data frame should look like this:

我的目标是通过它们的公共列（基因名称）合并两个 DataFrame，这样我就可以在每个基因行中获取每个基因分数的乘积。然后我会对groupby患者和细胞执行 a并将每个分数的所有分数相加。最终数据框应如下所示：

    patient  cell 
    Pat_1    22RV1    12
             DU145    15
             LN18      9
    Pat_2    22RV1    12
             DU145    15
             LN18      9
    Pat_3    22RV1    12
             DU145    15
             LN18      9

That last part should work fine, but I have not been able to perform the first merge on gene names due to a MemoryError. Below are snippets of each DataFrame.

最后一部分应该可以正常工作，但由于MemoryError. 下面是每个 DataFrame 的片段。

Data

数据

cell_s =

    Description          Name                      level_2  0
0  LOC100009676  100009676_at  LN18_CENTRAL_NERVOUS_SYSTEM  1
1  LOC100009676  100009676_at               22RV1_PROSTATE  2
2  LOC100009676  100009676_at               DU145_PROSTATE  3
3          AKT3      10000_at  LN18_CENTRAL_NERVOUS_SYSTEM  4
4          AKT3      10000_at               22RV1_PROSTATE  5
5          AKT3      10000_at               DU145_PROSTATE  6
6          MED6      10001_at  LN18_CENTRAL_NERVOUS_SYSTEM  7
7          MED6      10001_at               22RV1_PROSTATE  8
8          MED6      10001_at               DU145_PROSTATE  9

cell_s is about 10,000,000 rows

cell_s 大约有 10,000,000 行

patient_s =

患者_s =

             id level_1  0
0          MED6   Pat_1  1
1          MED6   Pat_2  1
2          MED6   Pat_3  1
3  LOC100009676   Pat_1  2
4  LOC100009676   Pat_2  2
5  LOC100009676   Pat_3  2
6          ABCD   Pat_1  3
7          ABCD   Pat_2  3
8          ABCD   Pat_3  3
    ....

patient_s is about 1,200,000 rows

patient_s 大约有 1,200,000 行

Code

代码

def get_score(cell, patient):
    cell_s = cell.set_index(['Description', 'Name']).stack().reset_index()
    cell_s.columns = ['Description', 'Name', 'cell', 's1']

    patient_s = patient.set_index('id').stack().reset_index()
    patient_s.columns = ['id', 'patient', 's2']

    # fails here:
    merged = cell_s.merge(patient_s, left_on='Description', right_on='id')
    merged['score'] = merged.s1 * merged.s2

    scores = merged.groupby(['patient','cell'])['score'].sum()
    return scores

I was getting a MemoryError when initially read_csving these files, but then specifying the dtypes resolved the issue. Confirming that my python is 64 bitdid not fix my issue either. I haven't reached the limitations on pandas, have I?

最初read_csving 这些文件时，我遇到了 MemoryError ，但随后指定了 dtype 解决了该问题。确认我的python 是 64 位也没有解决我的问题。我还没有达到Pandas的限制，是吗？

Python 3.4.3 |Anaconda 2.3.0 (64-bit)| Pandas 0.16.2

Python 3.4.3 |Anaconda 2.3.0（64 位）| Pandas 0.16.2

Answer 1

采纳答案by Parfait

Consider two workarounds:

考虑两种解决方法：

CSV By CHUNKS

CSV 由 CHUNKS

Apparently, read_csvcan suffer performance issues and therefore large files must load in iterated chunks.

显然，read_csv可能会遇到性能问题，因此大文件必须以迭代块加载。

cellsfilepath = 'C:\Path\To\Cells\CSVFile.csv'
tp = pd.io.parsers.read_csv(cellsfilepath, sep=',', iterator=True, chunksize=1000)
cell_s = pd.concat(tp, ignore_index=True)

patientsfilepath = 'C:\Path\To\Patients\CSVFile.csv'
tp = pd.io.parsers.read_csv(patientsfilepath, sep=',', iterator=True, chunksize=1000)
patient_s = pd.concat(tp, ignore_index=True)

CSV VIA SQL

CSV 通过 SQL

As a database guy, I always recommend handling large data loads and merging/joining with a SQL relational engine that scales well for such processes. I have written many a comment on dataframe merge Q/As to this effect -even in R. You can use any SQL database including file server dbs (Access, SQLite) or client server dbs (MySQL, MSSQL, or other), even where your dfs derive. Python maintains a built-in library for SQLite (otherwise you use ODBC); and dataframes can be pushed into databases as tables using pandas to_sql:

作为一名数据库人员，我总是建议处理大数据负载并与 SQL 关系引擎合并/加入，该引擎可以很好地适应此类过程。我已经写了很多关于数据帧合并 Q/As 的评论 - 即使在 R 中。您可以使用任何 SQL 数据库，包括文件服务器 dbs（Access、SQLite）或客户端服务器 dbs（MySQL、MSSQL 或其他），即使在你的 dfs 派生。Python 为 SQLite 维护了一个内置库（否则你使用 ODBC）；可以使用pandas to_sql将数据帧作为表推送到数据库中：

import sqlite3

dbfile = 'C:\Path\To\SQlitedb.sqlite'
cxn = sqlite3.connect(dbfile)
c = cxn.cursor()

cells_s.to_sql(name='cell_s', con = cxn, if_exists='replace')
patient_s.to_sql(name='patient_s', con = cxn, if_exists='replace')

strSQL = 'SELECT * FROM cell_s c INNER JOIN patient_s p ON c.Description = p.id;'
# MIGHT HAVE TO ADJUST ABOVE FOR CELL AND PATIENT PARAMS IN DEFINED FUNCTION

merged = pd.read_sql(strSQL, cxn)

Answer 2

回答by Skorpeo

You may have to do it in pieces, or look into blaze. http://blaze.pydata.org

您可能必须分块进行，或者查看火焰。http://blaze.pydata.org

Pandas DataFrame.merge MemoryError

提问by Thomas Matthew

Goal

目标

Data

数据

Code

代码

采纳答案by Parfait

回答by Skorpeo

相关推荐

最近更新

标签

Pandas DataFrame.merge MemoryError

提问by Thomas Matthew

Goal

目标

Data

数据

Code

代码

采纳答案by Parfait

回答by Skorpeo

相关推荐

pandas 当行包含特定文本时计算行数

pandas 根据斜率向 matplotlib 散点图添加一条线

如何在 Pandas DataFrame 中使用 inside / in 运算符？

pandas 在python中基于条件绘制多色线

相关推荐

最近更新

标签