Python Pandas 查找非零索引

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16755731/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:51:12  来源:igfitidea点击:

Python Pandas find non zero indices

pythonnumpypandas

提问by Tahnoon Pasha

Hi I'm trying to use pandas to tidy up a DataFrame. It is imported from a spreadsheet and has some empty rows and columns.

嗨,我正在尝试使用Pandas来整理 DataFrame。它是从电子表格导入的,有一些空的行和列。

I thought I could use

我以为我可以用

df.apply(numpy.nonzero(),axis=1)and df.apply(numpy.nonzero(),axis=0)to get the indices of the non-zero columns so that I could remove there inverses from the DataFrame.That gives me a list of tuples I'm not clear how to get at.

df.apply(numpy.nonzero(),axis=1)df.apply(numpy.nonzero(),axis=0)获取非零列的索引,以便我可以从 DataFrame 中删除那里的逆。这给了我一个我不清楚如何获取的元组列表。

numpy.nonzero(df)produces an array of all the non-zero values but I'm not sure how to feed that value into an all()function.

numpy.nonzero(df)生成一个包含所有非零值的数组,但我不确定如何将该值输入到all()函数中。

My question is what would be the best and quickest way of removing those index rows and columns from a DataFrame that are all empty (or all have a value such as N/A)s

我的问题是从 DataFrame 中删除那些全为空(或都具有 N/A 之类的值)的索引行和列的最佳和最快方法是什么

Thanks

谢谢

EDIT example of the source spreadsheet added

添加的源电子表格的编辑示例

<bound method DataFrame.head of             0         1         2  3         4         5  6         7         8  9   \
0                                                                                     
1   some title                                                                        
2         date     38477                                                              
3                                                                                     
4                                                                                     
5                   cat1                   cat2                   cat3                
6                      a         b            c         d            e         f      
7                                                                                     
8            Z  167.9404  151.1389      346.197  434.3589     336.7873  80.52901      
9            X   220.683   56.0029     73.73679  428.8939     483.7445  251.1877      
10           C  433.0189  390.1931     251.6636  418.6703     12.21859   113.093      
11                                                                                    
12           V  226.0135  418.1141     310.2038  153.9018     425.7491  73.08073      
13           W   295.146  173.2747     2.187459  401.6453     51.47293   175.387      
14           S  306.9325  157.2772     464.1394   216.248     478.3903   173.948      
15           A  19.86611  73.11554      320.078  199.7598     467.8272  234.0331      
16                                                                                    
17           F   225.511  20.97305     425.8834  190.1625     123.9103  116.3803      
18           R  130.4728  96.08118     428.2007  22.46184     26.34678  359.5625      
19           E  239.1516  439.7733     197.7023  121.6911     195.0169  264.5553      
20           W  227.1557  471.8341     165.3779  151.7552     314.7827  367.0868      

this is the def I'm using at the moment but it feels very clunky

这是我目前使用的定义,但感觉很笨重

def nulls(x):
    ''' the NULS section to clear all nulls from the 
DataFrame'''
    # Empty Rows
    nr = [i for i in x.index if all(str(k) in '' for k in x.ix[i])]
    # Non Empty Rows
    r = [i for i in x.index if i not in nr]
    # Empty columns
    nc = [j for j in range(x.shape[1]) if all(str(k) in '' for k in x[j])]
    # Non Empty Columns
    c = [j for j in range(x.shape[1]) if j not in nc]
    # Subset the non-empties
    x=x.ix[r,c]
    x=x.reindex()
    return(x)

回答by Jeff Tratner

dropna(how='all')is what you are looking for (generally), but you need to load in your dataframe in such a way that empty cells are treated as NaNinstead of empty string. That said, you have a few options here.

dropna(how='all')是您正在寻找的(通常),但是您需要以将空单元格视为NaN而不是空字符串的方式加载到数据框中。也就是说,您在这里有几个选择。

If you are sure that everything you want to drop is either the literal empty string ('', None, np.NaN, or 0) andthat you don't want to keep 0, then you can just fill the NaNand convert to boolean and check whether the sum is 0. You can tweak depending on how you want to drop.

如果您确定要删除的所有内容都是文字空字符串 ( '', None, np.NaN, 或0)并且不想保留0,那么您只需填写NaN并转换为布尔值并检查总和是否为 0。你可以根据你想要的下降方式进行调整。

indexer = df.fillna(False).astype(bool)
drop_columns = indexer.sum(0) == 0
keep_rows = indexer.sum(1) != 0

new_df = df.drop(df.columns[drop_columns], axis=1)[keep_rows]

However, if you need to check for whitespace, or want to exclude the literal zero, then you should use applymapwith a function (mostly based on this StackOverflow answer on dropping None/empty/whitespace columns) and then do the same thing as above.

但是,如果您需要检查空格,或者想要排除文字零,那么您应该使用applymap一个函数(主要基于此 StackOverflow 对删除 None/empty/whitespace columns 的回答),然后执行与上述相同的操作。

def is_blank(x):
    return x is None or pd.isnull(x) or bool(str(x).strip())

indexer = df.applymap(is_blank)

Personally though, I suggest you add ''to na_valueswhen you load your dataset.

个人虽然,我建议你添加''na_values,当你加载你的数据集。



Brief explanation of fillna()and astype()

的简要说明 fillna(),并astype()

fillna()lets you "fill" NAvalues with some other value. Here, we fill with False (because bool(float('nan'))evaluates to True), but you can fill with any value or with a variety of different methods. astypeconverts the array from one type to another. So putting astype(bool)means that it converts the entire array to either Trueor False(which are equivalent to 1and 0respectively) and then you can just sum to find the number of Truevalues in a row or column.

fillna()让您NA用其他一些值“填充”值。在这里,我们填充 False(因为bool(float('nan'))计算结果为 True),但您可以填充任何值或使用各种不同的方法。 astype将数组从一种类型转换为另一种类型。所以 putastype(bool)意味着它将整个数组转换为Trueor False(分别等效于10),然后您只需求和即可找到True行或列中的值数。

回答by kdamica

Have you tried DataFrame.dropna()? This won't deal with the zeroes but gets rid of NaN columns and rows.

你试过 DataFrame.dropna() 吗?这不会处理零,但会摆脱 NaN 列和行。

http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.dropna.html

http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.dropna.html