pandas 在多个条件下合并数据帧 - 不是特别针对相等的值

Question

提问by Tim Vivian-Griffiths

Firstly, sorry if this is a bit lengthy, but I wanted to fully describe what I have having problems with and what I have tried already.

首先，对不起，如果这有点冗长，但我想完整描述我遇到的问题以及我已经尝试过的内容。

I am trying to join (merge) together two dataframe objects on multiple conditions. I know how to do this if the conditions to be met are all 'equals' operators, however, I need to make use of LESS THAN and MORE THAN.

我正在尝试在多个条件下将两个数据帧对象连接（合并）在一起。如果要满足的条件都是“等于”运算符，我知道如何执行此操作，但是，我需要使用 LESS THAN 和 MORE THAN。

The dataframes represent genetic information: one is a list of mutations in the genome (referred to as SNPs) and the other provides information on the locations of the genes on the human genome. Performing df.head() on these returns the following:

数据框代表遗传信息：一个是基因组中的突变列表（称为 SNP），另一个提供关于基因在人类基因组上的位置的信息。对这些执行 df.head() 返回以下内容：

SNP DataFrame (snp_df):

SNP 数据帧 (snp_df)：

   chromosome        SNP      BP
0           1  rs3094315  752566
1           1  rs3131972  752721
2           1  rs2073814  753474
3           1  rs3115859  754503
4           1  rs3131956  758144

This shows the SNP reference ID and their locations. 'BP' stands for the 'Base-Pair' position.

这显示了 SNP 参考 ID 及其位置。“BP”代表“Base-Pair”位置。

Gene DataFrame (gene_df):

基因数据框（gene_df）：

   chromosome  chr_start  chr_stop        feature_id
0           1      10954     11507  GeneID:100506145
1           1      12190     13639  GeneID:100652771
2           1      14362     29370     GeneID:653635
3           1      30366     30503  GeneID:100302278
4           1      34611     36081     GeneID:645520

This dataframe shows the locations of all the genes of interest.

该数据框显示了所有感兴趣基因的位置。

What I want to find out is all of the SNPs which fall within the gene regions in the genome, and discard those that are outside of these regions.

我想找出基因组中基因区域内的所有 SNP，并丢弃这些区域之外的 SNP。

If I wanted to merge together two dataframes based on multiple (equals) conditions, I would do something like the following:

如果我想根据多个（等于）条件将两个数据帧合并在一起，我会执行以下操作：

merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])

However, in this instance - I need to find the SNPs where the chromosome values match those in the Gene dataframe, and the BP value falls between 'chr_start' and 'chr_stop'. What makes this challenging is that these dataframes are quite large. In this current dataset the snp_df has 6795021 rows, and the gene_df has 34362.

但是，在这种情况下 - 我需要找到染色体值与 Gene 数据帧中的值匹配的 SNP，并且 BP 值介于 'chr_start' 和 'chr_stop' 之间。具有挑战性的是这些数据框非常大。在当前数据集中，snp_df 有 6795021 行，gene_df 有 34362 行。

I have tried to tackle this by either looking at chromosomes or genes seperately. There are 22 different chromosome values (ints 1-22) as the sex chromosomes are not used. Both methods are taking an extremely long time. One uses the pandasqlmodule, while the other approach is to loop through the separate genes.

我试图通过分别查看染色体或基因来解决这个问题。由于未使用性染色体，因此有 22 个不同的染色体值（整数 1-22）。这两种方法都需要很长时间。一种使用pandasql模块，而另一种方法是循环遍历单独的基因。

SQL method

SQL方法

import pandas as pd
import pandasql as psql

pysqldf = lambda q: psql.sqldf(q, globals())

q           = """
SELECT s.SNP, g.feature_id
FROM this_snp s INNER JOIN this_genes g
WHERE s.BP >= g.chr_start
AND s.BP <= g.chr_stop;
"""

all_dfs = []

for chromosome in snp_df['chromosome'].unique():
    this_snp    = snp_df.loc[snp_df['chromosome'] == chromosome]
    this_genes  = gene_df.loc[gene_df['chromosome'] == chromosome]
    genic_snps  = pysqldf(q)
    all_dfs.append(genic_snps)

all_genic_snps  = pd.concat(all_dfs)

Gene iteration method

基因迭代法

all_dfs = []
for line in gene_df.iterrows():
    info    = line[1] # Getting the Series object
    this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) &
            (snp_df['BP'] >= info['chr_start']) & (snp_df['BP'] <= info['chr_stop'])]
    if this_snp.shape[0] != 0:
        this_snp = this_snp[['SNP']]
        this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id'])
        all_dfs.append(this_snp)


all_genic_snps = pd.concat(all_dfs)

Can anyone give any suggestions of a more effective way of doing this?

任何人都可以就更有效的方法提出任何建议吗？

Answer 1

回答by Tim Vivian-Griffiths

I've just thought of a way to solve this - by combining my two methods:

我刚刚想到了解决这个问题的方法 - 通过结合我的两种方法：

First, focus on the individual chromosomes, and then loop through the genes in these smaller dataframes. This also doesn't have to make use of any SQL queries either. I've also included a section to immediately identify any redundant genes that don't have any SNPs that fall within their range. This makes use of a double for-loop which I normally try to avoid - but in this case it works quite well.

首先，关注单个染色体，然后遍历这些较小数据框中的基因。这也不必使用任何 SQL 查询。我还包括一个部分，用于立即识别没有任何 SNP 在其范围内的任何冗余基因。这利用了我通常会尽量避免的双 for 循环 - 但在这种情况下它工作得很好。

all_dfs = []

for chromosome in snp_df['chromosome'].unique():
    this_chr_snp    = snp_df.loc[snp_df['chromosome'] == chromosome]
    this_genes      = gene_df.loc[gene_df['chromosome'] == chromosome]

    # Getting rid of redundant genes
    min_bp      = this_chr_snp['BP'].min()
    max_bp      = this_chr_snp['BP'].max()
    this_genes  = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
            ~(this_genes['chr_stop'] <= min_bp)]

    for line in this_genes.iterrows():
        info     = line[1]
        this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
                (this_chr_snp['BP'] <= info['chr_stop'])]
        if this_snp.shape[0] != 0:
            this_snp    = this_snp[['SNP']]
            this_snp.insert(1, 'feature_id', info['feature_id'])
            all_dfs.append(this_snp)

all_genic_snps  = pd.concat(all_dfs)

While this doesn't run spectacularly quickly - it does run so that I can actually get some answers. I'd still like to know if anyone has any tips to make it run more efficiently though.

虽然这不会运行得非常快 - 它确实运行，所以我实际上可以得到一些答案。我仍然想知道是否有人有任何提示可以使其更有效地运行。

Answer 2

回答by khammel

You can use the following to accomplish what you're looking for:

您可以使用以下内容来完成您正在寻找的内容：

merged_df=snp_df.merge(gene_df,on=['chromosome'],how='inner')
merged_df=merged_df[(merged_df.BP>=merged_df.chr_start) & (merged_df.BP<=merged_df.chr_stop)][['SNP','feature_id']]

Note: your example dataframes do not meet your join criteria. Here is an example using modified dataframes:

注意：您的示例数据框不符合您的加入条件。这是使用修改后的数据帧的示例：

snp_df
Out[193]: 
   chromosome        SNP      BP
0           1  rs3094315  752566
1           1  rs3131972   30400
2           1  rs2073814  753474
3           1  rs3115859  754503
4           1  rs3131956  758144

gene_df
Out[194]: 
   chromosome  chr_start  chr_stop        feature_id
0           1      10954     11507  GeneID:100506145
1           1      12190     13639  GeneID:100652771
2           1      14362     29370     GeneID:653635
3           1      30366     30503  GeneID:100302278
4           1      34611     36081     GeneID:645520

merged_df
Out[195]: 
         SNP        feature_id
8  rs3131972  GeneID:100302278

pandas 在多个条件下合并数据帧 - 不是特别针对相等的值

提问by Tim Vivian-Griffiths

SNP DataFrame (snp_df):

SNP 数据帧 (snp_df)：

Gene DataFrame (gene_df):

基因数据框（gene_df）：

SQL method

SQL方法

Gene iteration method

基因迭代法

回答by Tim Vivian-Griffiths

回答by khammel

相关推荐

最近更新

标签

pandas 在多个条件下合并数据帧 - 不是特别针对相等的值

提问by Tim Vivian-Griffiths

SNP DataFrame (snp_df):

SNP 数据帧 (snp_df)：

Gene DataFrame (gene_df):

基因数据框（gene_df）：

SQL method

SQL方法

Gene iteration method

基因迭代法

回答by Tim Vivian-Griffiths

回答by khammel

相关推荐

pandas 散景悬停工具提示未显示所有数据 - Ipython notebook

Python：Pandas - 按组删除第一行

在 Pandas 中将分钟格式的时间列转换为 HH:MM:SS 格式的时间

pandas Python，制作数据帧时出现内存错误

相关推荐

最近更新

标签