Pandas:循环遍历列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31857707/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:44:47  来源:igfitidea点击:

Pandas: for loop through columns

pythonpandas

提问by DJF

My data looks like:

我的数据看起来像:

SNP Name ss715583617 ss715592335 ss715591044 ss715598181
4    PI081762           T           A           A           T
5   PI101404A           T           A           A           T
6   PI101404B           T           A           A           T
7    PI135624           T           A           A           T
8    PI326581           T           A           A           T
9   PI326582A           T           A           A           T
10  PI326582B           T           A           A           T
11   PI339732           T           A           A           T
12  PI339735A           T           A           A           T
13  PI339735B           T           A           A           T
14  PI342618A           T           A           A           T

In reality I have a dataset of 50,000 columns of 479 rows. My objective is to go through each column with characters and convert the data to integers depending on which is the most abundant character.

实际上,我有一个 50,000 列 479 行的数据集。我的目标是用字符遍历每一列,并根据哪个字符最丰富将数据转换为整数。

Right now I have the data input, and I have more or less written the function I would like to use to analyze each column separately. However, I can't quite understand how to use a forloop or use the apply function through all of the columns in the dataset. I would prefer not to hardcode the columns because I will have 40,000~50,000 columns to analyze.

现在我有了数据输入,我或多或少地编写了我想用来分别分析每一列的函数。但是,我不太明白如何在数据集中的所有列中使用 forloop 或应用函数。我不想对列进行硬编码,因为我将有 40,000~50,000 列要分析。

My code so far is:

到目前为止我的代码是:

import pandas as pd

df = pd.read_csv("/home/dfreese/Desktop/testSNPtext", delimiter='\t')

df.head() # check that the file format fits

# ncol df
df2 = df.iloc[4:-1] # Select the rows you want to analyze in a subset df
print(df2)

My function:

我的功能:

def countAlleles(N): 
    # N is just suppose to be the column, ideally once I've optimized the function
    # I need to analyze every column 

    # Will hold the counts of each letter in the column
    letterCount = [] 
    # This is a parallel array to know the order
    letterOrder = {'T','A','G','C','H','U'}  


    # Boolean to use which one is the maximum 
    TFlag = None 
    AFlag = None 
    GFlag = None
    CFlag = None
    HFlag = None
    UFlag = None 

# Loop through the column to determine which one is the maximum 
for i in range(len(N)): # How do I get index information of the column?
    if(N[i] == 'T'): # If the element in the column is T
        letterCount[0] = letterCount[0] + 1
    elif(N[i] == 'A'):
        letterCount[1] = letterCount [1] + 1
    elif (N[i] == 'G'):
        letterCount[2] = letterCount [2] + 1
    elif (N[i] == 'C'):
        lettercount[3] = letterCount[3] + 1
    elif(N[i] == 'H'):
        letterCount[4] = letterCount[4] + 1
    else:
        letterCount[5] = letterCount[5] + 1

max = letterCount[0] # This will hold the value of maximum
mIndex = 0 # This holds the index position with the max value

# Determine which one is max
for i in range(len(letterCount)):
    if (letterCount[i] > max):
        max = letterCount[i]
        mIndex = i

So I designed the function to input the column, in hopes to be able to iterate through all the columns of the dataframe. My main question is:

所以我设计了输入列的函数,希望能够遍历dataframe的所有列。我的主要问题是:

1) How would I pass each in each column as a parameter to the for loop through the elements of each column?

1) 我如何将每列中的每一个作为参数传递给 for 循环通过每列的元素?

My major source of confusion is how indexes are being used in pandas. I'm familiar with 2-dimensional array in C++ and Java and that is most of where my knowledge stems from.

我的主要困惑在于如何在 Pandas 中使用索引。我熟悉 C++ 和 Java 中的二维数组,这是我知识的大部分来源。

I'm attempting to use the apply function:

我正在尝试使用 apply 函数:

df2 = df2.apply(countAlleles('ss715583617'), axis=2)

but it doesn't seem that my application is correct.

但我的申请似乎不正确。

回答by chris-sc

Updated answer: Now the dataframe is analyzed and replaced with the int values according to the occurences of a allele per column. The problem with what happens if one allele has the same number of occurences than the other is still the same - the assignment will be not unique.

更新的答案:现在根据每列等位基因的出现情况分析数据框并用 int 值替换。如果一个等位基因与另一个等位基因的出现次数相同,会发生什么问题仍然相同 - 分配将不是唯一的。

import pandas as pd                                                              
import numpy as np                                                               

df = pd.DataFrame.from_dict({"ss1": ["T", "T", "T", "G"],                        
                             "ss2": ["G", "G", "T", "A"],                        
                             "ss3": ["C", "H", "C", "H"]})                       

letterOrder = np.array(['T', 'A', 'G', 'C', 'H', 'U'])                           

for col in df:                                                                   
    alleles = list()                                                             
    for num, allele in enumerate(letterOrder):                                   
        alleles.append(df[col].str.count(allele).sum())                          

    # dictionary with full sorted keys                                           
    repl = letterOrder[np.argsort(alleles)][::-1]                                
    # directly replace chars by value                                            
    for num, char in enumerate(repl):                                            
        df[col].replace(char, num+1, inplace=True)                               

print(df)

This will change the initial dataframe

这将更改初始数据框

  ss1 ss2 ss3
0   T   G   C
1   T   G   H
2   T   T   C
3   G   A   H

to the new dataframe with ints sorted according to the number of occurences:

到根据出现次数排序的整数的新数据帧:

   ss1  ss2  ss3
0    1    1    2
1    1    1    1
2    1    3    2
3    2    2    1


For reference the oldanswer which gives the maximum column indices:

作为参考,给出最大列索引的答案:

import pandas as pd                                                              
import numpy as np                                                               
from collections import OrderedDict                                              

df = pd.DataFrame.from_dict({"ss1": ["T", "T", "T", "G"],                                     
                             "ss2": ["G", "G", "T", "A"],                           
                             "ss3": ["C", "H", "C", "H"]})                          

letterOrder = ['T', 'A', 'G', 'C', 'H', 'U']                                        

full_results = OrderedDict()                                                        
for col in df:                                                                      
    alleles = list()                                                                
    for num, allele in enumerate(letterOrder):                                      
        alleles.append(df[col].str.count(allele).sum())                             

    full_results[col] = [letterOrder[np.argmax(alleles)], np.max(alleles)]       

print(full_results)

This will give:

这将给出:

OrderedDict([('ss1', ['T', 3]), ('ss2', ['G', 2]), ('ss3', ['C', 2])])

The key in the dict is the name of your column, and the value is a list with [allele, number_of_occurences].

字典中的键是您的列的名称,值是一个包含 [allele, number_of_occurences] 的列表。

I used OrderedDictto keep the order of your columns and the name, but if you don't need the order, you can use a dict, or if you don't need the column name (and the implicit ID is enough), use a list.

OrderedDict以前会保留你的列的顺序和名称,但如果你不需要顺序,你可以使用 a dict,或者如果你不需要列名(隐式 ID 就足够了),使用 a list

But be careful: If in one column two (or more) characters have the same number of counts, this will only return one of them. You would need to add an additional test for this.

但要小心:如果在一列中两个(或更多)字符的计数相同,这将只返回其中一个。您需要为此添加额外的测试。

回答by dermen

To iterate over columns in e.g. a for loop, use list(df). Anyhow, you can easily do what you are attempting using collections.Counter

要遍历例如 for 循环中的列,请使用list(df). 无论如何,您可以轻松地执行您尝试使用的操作collections.Counter

assume a dataframe df

假设一个数据框 df

df
#         Name ss715583617 ss715592335 ss715591044 ss715598181
#0    PI081762           T           A           A           T
#1   PI101404A           T           A           A           T
#2   PI101404B           T           A           A           T
#3    PI135624           T           A           A           T
#4    PI326581           T           A           F           D
#5   PI326582A           G           A           F           T
#6   PI326582B           G           A           A           T
#7    PI339732           D           H           A           T
#8   PI339735A           D           A           A           T
#9   PI339735B           A           A           A           T
#10  PI342618A           D           A           A           T

What I gather from the comments sections and your original post, you want to replace each character in each column according to it's frequency of occurrence. This is one approach:

我从评论部分和您的原始帖子中收集到的信息,您希望根据出现频率替换每列中的每个字符。这是一种方法:

Make the Counters

制作计数器

from collections import Counter

cols       = [ col for col in list(df) if col not in ['Name'] ] # all the column you want to operate on
col_counters = { col: Counter( df[col] ) for col in cols } 
#{'ss715583617': Counter({'T': 5, 'D': 3, 'G': 2, 'A': 1}),
# 'ss715591044': Counter({'A': 9, 'F': 2}),
# 'ss715592335': Counter({'A': 10, 'H': 1}),
# 'ss715598181': Counter({'T': 10, 'D': 1})}

Sort the items in each Counter

对每个 Counter 中的项目进行排序

sort_func = lambda items: sorted(items, key=lambda x:x[1], reverse=True ) # sort a nested list according to second element in each sublist
sort_result = { col: sort_func(counter.items()) for col,counter in col_counters.iteritems() }
#{'ss715583617': [('T', 5), ('D', 3), ('G', 2), ('A', 1)],
# 'ss715591044': [('A', 9), ('F', 2)],
# 'ss715592335': [('A', 10), ('H', 1)],
# 'ss715598181': [('T', 10), ('D', 1)]}

Replace letters in dataframe according to sort result

根据排序结果替换数据框中的字母

Here we will use enumerateto get the position of each sort result

这里我们将使用enumerate来获取每个排序结果的位置

mapper = { col: {letter:i+1 for i,letter in enumerate(sort_result[col]) } for col in sort_result } 
#{'ss715583617': {'A': 4, 'D': 2, 'G': 3, 'T': 1},
# 'ss715591044': {'A': 1, 'F': 2},
# 'ss715592335': {'A': 1, 'H': 2},
# 'ss715598181': {'D': 2, 'T': 1}}

df.replace( to_replace=mapper, inplace=True)
#         Name  ss715583617  ss715592335  ss715591044  ss715598181
#0    PI081762            1            1            1            1
#1   PI101404A            1            1            1            1
#2   PI101404B            1            1            1            1
#3    PI135624            1            1            1            1
#4    PI326581            1            1            2            2
#5   PI326582A            3            1            2            1
#6   PI326582B            3            1            1            1
#7    PI339732            2            2            1            1
#8   PI339735A            2            1            1            1
#9   PI339735B            4            1            1            1
#10  PI342618A            2            1            1            1

This should be enough to get you on your way. I am not sure how you want to handle duplicate elements, for instance if a column has the same number if T and G.

这应该足以让您继续前进。我不确定您想如何处理重复元素,例如,如果 T 和 G 列具有相同的编号。