Python Pandas 查找非零索引
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16755731/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python Pandas find non zero indices
提问by Tahnoon Pasha
Hi I'm trying to use pandas to tidy up a DataFrame. It is imported from a spreadsheet and has some empty rows and columns.
嗨,我正在尝试使用Pandas来整理 DataFrame。它是从电子表格导入的,有一些空的行和列。
I thought I could use
我以为我可以用
df.apply(numpy.nonzero(),axis=1)and df.apply(numpy.nonzero(),axis=0)to get the indices of the non-zero columns so that I could remove there inverses from the DataFrame.That gives me a list of tuples I'm not clear how to get at.
df.apply(numpy.nonzero(),axis=1)并df.apply(numpy.nonzero(),axis=0)获取非零列的索引,以便我可以从 DataFrame 中删除那里的逆。这给了我一个我不清楚如何获取的元组列表。
numpy.nonzero(df)produces an array of all the non-zero values but I'm not sure how to feed that value into an all()function.
numpy.nonzero(df)生成一个包含所有非零值的数组,但我不确定如何将该值输入到all()函数中。
My question is what would be the best and quickest way of removing those index rows and columns from a DataFrame that are all empty (or all have a value such as N/A)s
我的问题是从 DataFrame 中删除那些全为空(或都具有 N/A 之类的值)的索引行和列的最佳和最快方法是什么
Thanks
谢谢
EDIT example of the source spreadsheet added
添加的源电子表格的编辑示例
<bound method DataFrame.head of 0 1 2 3 4 5 6 7 8 9 \
0
1 some title
2 date 38477
3
4
5 cat1 cat2 cat3
6 a b c d e f
7
8 Z 167.9404 151.1389 346.197 434.3589 336.7873 80.52901
9 X 220.683 56.0029 73.73679 428.8939 483.7445 251.1877
10 C 433.0189 390.1931 251.6636 418.6703 12.21859 113.093
11
12 V 226.0135 418.1141 310.2038 153.9018 425.7491 73.08073
13 W 295.146 173.2747 2.187459 401.6453 51.47293 175.387
14 S 306.9325 157.2772 464.1394 216.248 478.3903 173.948
15 A 19.86611 73.11554 320.078 199.7598 467.8272 234.0331
16
17 F 225.511 20.97305 425.8834 190.1625 123.9103 116.3803
18 R 130.4728 96.08118 428.2007 22.46184 26.34678 359.5625
19 E 239.1516 439.7733 197.7023 121.6911 195.0169 264.5553
20 W 227.1557 471.8341 165.3779 151.7552 314.7827 367.0868
this is the def I'm using at the moment but it feels very clunky
这是我目前使用的定义,但感觉很笨重
def nulls(x):
''' the NULS section to clear all nulls from the
DataFrame'''
# Empty Rows
nr = [i for i in x.index if all(str(k) in '' for k in x.ix[i])]
# Non Empty Rows
r = [i for i in x.index if i not in nr]
# Empty columns
nc = [j for j in range(x.shape[1]) if all(str(k) in '' for k in x[j])]
# Non Empty Columns
c = [j for j in range(x.shape[1]) if j not in nc]
# Subset the non-empties
x=x.ix[r,c]
x=x.reindex()
return(x)
回答by Jeff Tratner
dropna(how='all')is what you are looking for (generally), but you need to load in your dataframe in such a way that empty cells are treated as NaNinstead of empty string. That said, you have a few options here.
dropna(how='all')是您正在寻找的(通常),但是您需要以将空单元格视为NaN而不是空字符串的方式加载到数据框中。也就是说,您在这里有几个选择。
If you are sure that everything you want to drop is either the literal empty string ('', None, np.NaN, or 0) andthat you don't want to keep 0, then you can just fill the NaNand convert to boolean and check whether the sum is 0. You can tweak depending on how you want to drop.
如果您确定要删除的所有内容都是文字空字符串 ( '', None, np.NaN, 或0)并且不想保留0,那么您只需填写NaN并转换为布尔值并检查总和是否为 0。你可以根据你想要的下降方式进行调整。
indexer = df.fillna(False).astype(bool)
drop_columns = indexer.sum(0) == 0
keep_rows = indexer.sum(1) != 0
new_df = df.drop(df.columns[drop_columns], axis=1)[keep_rows]
However, if you need to check for whitespace, or want to exclude the literal zero, then you should use applymapwith a function (mostly based on this StackOverflow answer on dropping None/empty/whitespace columns) and then do the same thing as above.
但是,如果您需要检查空格,或者想要排除文字零,那么您应该使用applymap一个函数(主要基于此 StackOverflow 对删除 None/empty/whitespace columns 的回答),然后执行与上述相同的操作。
def is_blank(x):
return x is None or pd.isnull(x) or bool(str(x).strip())
indexer = df.applymap(is_blank)
Personally though, I suggest you add ''to na_valueswhen you load your dataset.
个人虽然,我建议你添加''到na_values,当你加载你的数据集。
Brief explanation of fillna()and astype()
的简要说明 fillna(),并astype()
fillna()lets you "fill" NAvalues with some other value. Here, we fill with False (because bool(float('nan'))evaluates to True), but you can fill with any value or with a variety of different methods. astypeconverts the array from one type to another. So putting astype(bool)means that it converts the entire array to either Trueor False(which are equivalent to 1and 0respectively) and then you can just sum to find the number of Truevalues in a row or column.
fillna()让您NA用其他一些值“填充”值。在这里,我们填充 False(因为bool(float('nan'))计算结果为 True),但您可以填充任何值或使用各种不同的方法。 astype将数组从一种类型转换为另一种类型。所以 putastype(bool)意味着它将整个数组转换为Trueor False(分别等效于1和0),然后您只需求和即可找到True行或列中的值数。
回答by kdamica
Have you tried DataFrame.dropna()? This won't deal with the zeroes but gets rid of NaN columns and rows.
你试过 DataFrame.dropna() 吗?这不会处理零,但会摆脱 NaN 列和行。
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.dropna.html
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.dropna.html

