pandas 在多个条件下合并数据帧 - 不是特别针对相等的值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31410356/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Merging DataFrames on multiple conditions - not specifically on equal values
提问by Tim Vivian-Griffiths
Firstly, sorry if this is a bit lengthy, but I wanted to fully describe what I have having problems with and what I have tried already.
首先,对不起,如果这有点冗长,但我想完整描述我遇到的问题以及我已经尝试过的内容。
I am trying to join (merge) together two dataframe objects on multiple conditions. I know how to do this if the conditions to be met are all 'equals' operators, however, I need to make use of LESS THAN and MORE THAN.
我正在尝试在多个条件下将两个数据帧对象连接(合并)在一起。如果要满足的条件都是“等于”运算符,我知道如何执行此操作,但是,我需要使用 LESS THAN 和 MORE THAN。
The dataframes represent genetic information: one is a list of mutations in the genome (referred to as SNPs) and the other provides information on the locations of the genes on the human genome. Performing df.head() on these returns the following:
数据框代表遗传信息:一个是基因组中的突变列表(称为 SNP),另一个提供关于基因在人类基因组上的位置的信息。对这些执行 df.head() 返回以下内容:
SNP DataFrame (snp_df):
SNP 数据帧 (snp_df):
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 752721
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
This shows the SNP reference ID and their locations. 'BP' stands for the 'Base-Pair' position.
这显示了 SNP 参考 ID 及其位置。“BP”代表“Base-Pair”位置。
Gene DataFrame (gene_df):
基因数据框(gene_df):
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
This dataframe shows the locations of all the genes of interest.
该数据框显示了所有感兴趣基因的位置。
What I want to find out is all of the SNPs which fall within the gene regions in the genome, and discard those that are outside of these regions.
我想找出基因组中基因区域内的所有 SNP,并丢弃这些区域之外的 SNP。
If I wanted to merge together two dataframes based on multiple (equals) conditions, I would do something like the following:
如果我想根据多个(等于)条件将两个数据帧合并在一起,我会执行以下操作:
merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])
However, in this instance - I need to find the SNPs where the chromosome values match those in the Gene dataframe, and the BP value falls between 'chr_start' and 'chr_stop'. What makes this challenging is that these dataframes are quite large. In this current dataset the snp_df has 6795021 rows, and the gene_df has 34362.
但是,在这种情况下 - 我需要找到染色体值与 Gene 数据帧中的值匹配的 SNP,并且 BP 值介于 'chr_start' 和 'chr_stop' 之间。具有挑战性的是这些数据框非常大。在当前数据集中,snp_df 有 6795021 行,gene_df 有 34362 行。
I have tried to tackle this by either looking at chromosomes or genes seperately. There are 22 different chromosome values (ints 1-22) as the sex chromosomes are not used. Both methods are taking an extremely long time. One uses the pandasqlmodule, while the other approach is to loop through the separate genes.
我试图通过分别查看染色体或基因来解决这个问题。由于未使用性染色体,因此有 22 个不同的染色体值(整数 1-22)。这两种方法都需要很长时间。一种使用pandasql模块,而另一种方法是循环遍历单独的基因。
SQL method
SQL方法
import pandas as pd
import pandasql as psql
pysqldf = lambda q: psql.sqldf(q, globals())
q = """
SELECT s.SNP, g.feature_id
FROM this_snp s INNER JOIN this_genes g
WHERE s.BP >= g.chr_start
AND s.BP <= g.chr_stop;
"""
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
genic_snps = pysqldf(q)
all_dfs.append(genic_snps)
all_genic_snps = pd.concat(all_dfs)
Gene iteration method
基因迭代法
all_dfs = []
for line in gene_df.iterrows():
info = line[1] # Getting the Series object
this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) &
(snp_df['BP'] >= info['chr_start']) & (snp_df['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
Can anyone give any suggestions of a more effective way of doing this?
任何人都可以就更有效的方法提出任何建议吗?
回答by Tim Vivian-Griffiths
I've just thought of a way to solve this - by combining my two methods:
我刚刚想到了解决这个问题的方法 - 通过结合我的两种方法:
First, focus on the individual chromosomes, and then loop through the genes in these smaller dataframes. This also doesn't have to make use of any SQL queries either. I've also included a section to immediately identify any redundant genes that don't have any SNPs that fall within their range. This makes use of a double for-loop which I normally try to avoid - but in this case it works quite well.
首先,关注单个染色体,然后遍历这些较小数据框中的基因。这也不必使用任何 SQL 查询。我还包括一个部分,用于立即识别没有任何 SNP 在其范围内的任何冗余基因。这利用了我通常会尽量避免的双 for 循环 - 但在这种情况下它工作得很好。
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_chr_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
# Getting rid of redundant genes
min_bp = this_chr_snp['BP'].min()
max_bp = this_chr_snp['BP'].max()
this_genes = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
~(this_genes['chr_stop'] <= min_bp)]
for line in this_genes.iterrows():
info = line[1]
this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
(this_chr_snp['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(1, 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
While this doesn't run spectacularly quickly - it does run so that I can actually get some answers. I'd still like to know if anyone has any tips to make it run more efficiently though.
虽然这不会运行得非常快 - 它确实运行,所以我实际上可以得到一些答案。我仍然想知道是否有人有任何提示可以使其更有效地运行。
回答by khammel
You can use the following to accomplish what you're looking for:
您可以使用以下内容来完成您正在寻找的内容:
merged_df=snp_df.merge(gene_df,on=['chromosome'],how='inner')
merged_df=merged_df[(merged_df.BP>=merged_df.chr_start) & (merged_df.BP<=merged_df.chr_stop)][['SNP','feature_id']]
Note: your example dataframes do not meet your join criteria. Here is an example using modified dataframes:
注意:您的示例数据框不符合您的加入条件。这是使用修改后的数据帧的示例:
snp_df
Out[193]:
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 30400
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
gene_df
Out[194]:
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
merged_df
Out[195]:
SNP feature_id
8 rs3131972 GeneID:100302278

