pandas 选择特定列仅形成 Python 中的数据框

Question

提问by BioProgram

Using python and pandas as pd, I am trying to OUTPUT a file that has a subset of columns based on specific headers.

使用 python 和 Pandas 作为 pd，我试图输出一个包含基于特定标题的列子集的文件。

Here is an example of an input file

这是输入文件的示例

gene_input = pd.read_table(args.gene, sep="\t" ,index_col=0)

The structure of gene_input:

基因输入的结构：

       Sample1  Sample2  Sample3  Sample4  Sample5  Sample6  Sample7  Sample8
Gene1        2       23      213      213       13      132      213     4312
Gene2        3       12    21312      123      123       23     4321      432
Gene3        5      213    21312       15      516     3421     4312     4132
Gene4        2      123      123        7      610       23     3214     4312
Gene5        1      213      213        1      152       23     1423     3421

Using a different loop, I generated TWO dictionaries. The first one has the keys (Sample 1 and Sample 7) and the second has the keys (Sample 4 and 8).

使用不同的循环，我生成了两个字典。第一个具有键（示例 1 和示例 7），第二个具有键（示例 4 和 8）。

I would like to have the following output (Note that I want the samples from each of the dictionaries to be consecutive; i.e. all Dictionary 1 first, then all Dictionary 2): The output that I am looking for is:

我想要以下输出（请注意，我希望每个词典中的样本都是连续的；即首先是所有词典 1，然后是所有词典 2）：我正在寻找的输出是：

        Sample1 Sample7 Sample4 Sample8
Gene1   2   213 213 4312
Gene2   3   4321    123 432
Gene3   5   4312    15  4132
Gene4   2   3214    7   4312
Gene5   1   1423    1   3421

I have tried the following but none worked:

我尝试了以下但没有奏效：

key_num=list(dictionary1.keys())
num = genes_input[gene_input.columns.isin(key_num)]

In order to extract the first set of columns then somehow combine it, but that failed. It kept giving me attributes error, and i did update pandas. I also tried the following:

为了提取第一组列，然后以某种方式组合它，但失败了。它一直给我属性错误，我确实更新了Pandas。我还尝试了以下方法：

reader = csv.reader( open(gene_input, 'rU'), delimiter='\t')
header_row = reader.next() # Gets the header

for key, value in numerator.items():
    output.write(key + "\t")
    if key in header_row:
        for row in reader:
            idx=header_row.index(key)
            output.write(idx +"\t")

as well as some other commands/loops/lines. Sometimes i only get the first key only to be in the output, other times i get an error; depending on which method i tried (i am not listing them all here for sake of convenience).

以及其他一些命令/循环/行。有时我只得到第一个键才出现在输出中，其他时候我得到一个错误；取决于我尝试过的方法（为方便起见，我没有在此处列出所有方法）。

Anyway, if anyone has any input on how I can generate the output file of interest, I'd be grateful.

无论如何，如果有人对我如何生成感兴趣的输出文件有任何意见，我将不胜感激。

Again, here is what I want as a final output:

同样，这是我想要的最终输出：

        Sample1 Sample7 Sample4 Sample8
Gene1   2   213 213 4312
Gene2   3   4321    123 432
Gene3   5   4312    15  4132
Gene4   2   3214    7   4312
Gene5   1   1423    1   3421

Answer 1

回答by Dthal

For a specific set of columns in a specific order, use:
df = gene_input[['Sample1', 'Sample2', 'Sample4', 'Sample7']]

对于特定顺序的一组特定列，请使用：
df = gene_input[['Sample1', 'Sample2', 'Sample4', 'Sample7']]

If you need to make that list (['Sample1',...]) automatically, and the names are as given, you should be able to build the two lists, combine them and then sort:
column_names = sorted(dictionary1.keys() + dictionary2.keys())

如果您需要自动制作该列表 (['Sample1',...])，并且名称已给定，您应该能够构建两个列表，将它们组合起来，然后进行排序：
column_names = sorted(dictionary1.keys() + dictionary2.keys())

The names that you have should sort correctly. For output, you should be able to use:
df.to_csv(<output file name>, sep='\t')

您拥有的名称应该正确排序。对于输出，您应该能够使用：
df.to_csv(<output file name>, sep='\t')

EDIT: added part about output

编辑：添加了关于输出的部分

pandas 选择特定列仅形成 Python 中的数据框

提问by BioProgram

回答by Dthal

相关推荐

最近更新

标签

pandas 选择特定列仅形成 Python 中的数据框

提问by BioProgram

回答by Dthal

相关推荐

pandas 获取熊猫数据框中所有唯一行的计数

pandas 将年份和年份中的日期转换为熊猫中的日期时间索引

pandas 熊猫：具有不同列名的连接数据框

在 Pandas DataFrame 的字符串中漂亮地打印换行符

相关推荐

最近更新

标签