Python 熊猫:多列的 to_numeric

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36814100/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 18:23:35  来源:igfitidea点击:

pandas: to_numeric for multiple columns

pythonpandas

提问by Michael Perdue

I'm working with the following df:

我正在使用以下df

c.sort_values('2005', ascending=False).head(3)
      GeoName ComponentName     IndustryId IndustryClassification Description                                2004 2005  2006  2007  2008  2009 2010 2011 2012 2013 2014
37926 Alabama Real GDP by state 9          213                    Support activities for mining              99   98    117   117   115   87   96   95   103  102  (NA)
37951 Alabama Real GDP by state 34         42                     Wholesale trade                            9898 10613 10952 11034 11075 9722 9765 9703 9600 9884 10199
37932 Alabama Real GDP by state 15         327                    Nonmetallic mineral products manufacturing 980  968   940   1084  861   724  714  701  589  641  (NA)

I want to force numeric on all of the years:

我想在所有年份都强制使用数字:

c['2014'] = pd.to_numeric(c['2014'], errors='coerce')

is there an easy way to do this or do I have to type them all out?

有没有一种简单的方法可以做到这一点,还是我必须将它们全部输入?

回答by MaxU

UPDATE:you don't need to convert your values afterwards, you can do it on-the-flywhen reading your CSV:

更新:之后您不需要转换您的值,您可以在读取 CSV 时即时进行:

In [165]: df=pd.read_csv(url, index_col=0, na_values=['(NA)']).fillna(0)

In [166]: df.dtypes
Out[166]:
GeoName                    object
ComponentName              object
IndustryId                  int64
IndustryClassification     object
Description                object
2004                        int64
2005                        int64
2006                        int64
2007                        int64
2008                        int64
2009                        int64
2010                        int64
2011                        int64
2012                        int64
2013                        int64
2014                      float64
dtype: object

If you need to convert multiple columns to numeric dtypes - use the following technique:

如果您需要将多列转换为数字 dtypes - 使用以下技术:

Sample source DF:

示例源 DF:

In [271]: df
Out[271]:
     id    a  b  c  d  e    f
0  id_3  AAA  6  3  5  8    1
1  id_9    3  7  5  7  3  BBB
2  id_7    4  2  3  5  4    2
3  id_0    7  3  5  7  9    4
4  id_0    2  4  6  4  0    2

In [272]: df.dtypes
Out[272]:
id    object
a     object
b      int64
c      int64
d      int64
e      int64
f     object
dtype: object

Converting selected columns to numeric dtypes:

将选定的列转换为数字 dtypes:

In [273]: cols = df.columns.drop('id')

In [274]: df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')

In [275]: df
Out[275]:
     id    a  b  c  d  e    f
0  id_3  NaN  6  3  5  8  1.0
1  id_9  3.0  7  5  7  3  NaN
2  id_7  4.0  2  3  5  4  2.0
3  id_0  7.0  3  5  7  9  4.0
4  id_0  2.0  4  6  4  0  2.0

In [276]: df.dtypes
Out[276]:
id     object
a     float64
b       int64
c       int64
d       int64
e       int64
f     float64
dtype: object

PS if you want to select allstring(object) columns use the following simple trick:

PS,如果您想选择所有string( object) 列,请使用以下简单技巧:

cols = df.columns[df.dtypes.eq('object')]

回答by muon

another way is using apply, one liner:

另一种方法是使用apply,一个班轮:

cols = ['col1', 'col2', 'col3']
data[cols] = data[cols].apply(pd.to_numeric, errors='coerce', axis=1)

回答by jezrael

You can use:

您可以使用:

print df.columns[5:]
Index([u'2004', u'2005', u'2006', u'2007', u'2008', u'2009', u'2010', u'2011',
       u'2012', u'2013', u'2014'],
      dtype='object')

for col in  df.columns[5:]:
    df[col] = pd.to_numeric(df[col], errors='coerce')

print df
       GeoName      ComponentName  IndustryId  IndustryClassification  \
37926  Alabama  Real GDP by state           9                     213   
37951  Alabama  Real GDP by state          34                      42   
37932  Alabama  Real GDP by state          15                     327   

                                      Description  2004   2005   2006   2007  \
37926               Support activities for mining    99     98    117    117   
37951                            Wholesale  trade  9898  10613  10952  11034   
37932  Nonmetallic mineral products manufacturing   980    968    940   1084   

        2008  2009  2010  2011  2012  2013     2014  
37926    115    87    96    95   103   102      NaN  
37951  11075  9722  9765  9703  9600  9884  10199.0  
37932    861   724   714   701   589   641      NaN  

Another solution with filter:

另一个解决方案filter

print df.filter(like='20')
       2004   2005   2006   2007   2008  2009  2010  2011  2012  2013   2014
37926    99     98    117    117    115    87    96    95   103   102   (NA)
37951  9898  10613  10952  11034  11075  9722  9765  9703  9600  9884  10199
37932   980    968    940   1084    861   724   714   701   589   641   (NA)

for col in  df.filter(like='20').columns:
    df[col] = pd.to_numeric(df[col], errors='coerce')
print df
       GeoName      ComponentName  IndustryId  IndustryClassification  \
37926  Alabama  Real GDP by state           9                     213   
37951  Alabama  Real GDP by state          34                      42   
37932  Alabama  Real GDP by state          15                     327   

                                      Description  2004   2005   2006   2007  \
37926               Support activities for mining    99     98    117    117   
37951                            Wholesale  trade  9898  10613  10952  11034   
37932  Nonmetallic mineral products manufacturing   980    968    940   1084   

        2008  2009  2010  2011  2012  2013     2014  
37926    115    87    96    95   103   102      NaN  
37951  11075  9722  9765  9703  9600  9884  10199.0  
37932    861   724   714   701   589   641      NaN  

回答by Ali Siddiq

df[cols] = pd.to_numeric(df[cols].stack(), errors='coerce').unstack()

回答by Jeremy

If you are looking for a range of columns, you can try this:

如果您正在寻找一系列列,您可以试试这个:

df.iloc[7:] = df.iloc[7:].astype(float)

The examples above will convert type to be float, for all the columns begin with the 7th to the end. You of course can use different type or different range.

上面的示例将类型转换为浮点数,因为所有列都从第 7 行开始到末尾。您当然可以使用不同的类型或不同的范围。

I think this is useful when you have a big range of columns to convert and a lot of rows. It doesn't make you go over each row by yourself - I believe numpy do it more efficiently.

我认为当您要转换的列范围很大且行很多时,这很有用。它不会让你自己遍历每一行 - 我相信 numpy 会更有效地完成它。

This is useful only if you know that all the required columns contain numbers only - it will not change "bad values" (like string) to be NaN for you.

仅当您知道所有必需的列仅包含数字时,这才有用 - 它不会将“坏值”(如字符串)更改为 NaN。