处理 Pandas read_csv 中的缺失数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39812493/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:07:13  来源:igfitidea点击:

Dealing with missing data in Pandas read_csv

pythonpandasnannamissing-data

提问by abalter

I have not found a satisfying solution to the problem of missing data when importing CSV data into a pandas DataFrame.

在将 CSV 数据导入到 Pandas DataFrame 时,我还没有找到令人满意的解决方案来解决丢失数据的问题。

I have datasets where I don't know in advance what the columns or data types are. I would like pandas to do a better job inferring how to read in the data.

我有数据集,我事先不知道列或数据类型是什么。我希望大Pandas在推断如何读取数据方面做得更好。

I haven't found any combination of na_values=...that really helps.

我还没有发现任何na_values=...真正有帮助的组合。

Consider the following csv files:

考虑以下 csv 文件:

no_holes.csv

no_holes.csv

letter,number
a,1
b,2
c,3
d,4

with_holes.csv

with_holes.csv

letter,number
a,1
,2
b, 
,4

empty_column.csv

empty_column.csv

letters,numbers
,1
,2
,3
,4

with_NA.csv

with_NA.csv

letter,number
a,1
b,NA
NA,3
d,4

Here is what happens when I read them into a DataFrame (code below):

这是我将它们读入 DataFrame 时发生的情况(下面的代码):

**no holes**
  letter  number
0      a       1
1      b       2
2      c       3
3      d       4
letter    object
number     int64
dtype: object

**with holes**
  letter number
0      a      1
1    NaN      2
2      b       
3    NaN      4
letter    object
number    object
dtype: object

**empty_column**
   letters  numbers
0      NaN        1
1      NaN        2
2      NaN        3
3      NaN        4
letters    float64
numbers      int64
dtype: object

**with NA**
  letter  number
0      a     1.0
1      b     NaN
2    NaN     3.0
3      d     4.0
letter     object
number    float64
dtype: object

Is there a way to tell pandas to assume empty values are of objecttype? I've tried na_values=[""].

有没有办法告诉Pandas假设空值是object类型的?我试过na_values=[""]

demo_holes.py

demo_holes.py

import pandas as pd

with_holes = pd.read_csv("with_holes.csv")
no_holes = pd.read_csv("no_holes.csv")
empty_column = pd.read_csv("empty_column.csv")
with_NA = pd.read_csv("with_NA.csv")

print("\n**no holes**")
print(no_holes.head())
print(no_holes.dtypes)
print("\n**with holes**")
print(with_holes.head())
print(with_holes.dtypes)
print("\n**empty_column**")
print(empty_column.head())
print(empty_column.dtypes)
print("\n**with NA**")
print(with_NA.head())
print(with_NA.dtypes)

回答by piRSquared

you want to use the parameter skipinitialspace=True

你想使用参数 skipinitialspace=True

setup

设置

no_holes = """letter,number
a,1
b,2
c,3
d,4"""

with_holes = """letter,number
a,1
,2
b, 
,4"""

empty_column = """letters,numbers
,1
,2
,3
,4"""

with_NA = """letter,number
a,1
b,NA
NA,3
d,4"""

from StringIO import StringIO
import pandas as pd

d1 = pd.read_csv(StringIO(no_holes), skipinitialspace=True)
d2 = pd.read_csv(StringIO(with_holes), skipinitialspace=True)
d3 = pd.read_csv(StringIO(empty_column), skipinitialspace=True)
d4 = pd.read_csv(StringIO(with_NA), skipinitialspace=True)

pd.concat([d1, d2, d3, d4], axis=1,
          keys=['no_holes', 'with_holes',
                'empty_column', 'with_NA'])

enter image description here

在此处输入图片说明



if you want those NaNs to be ''then use fillna

如果你想要那些NaNs''然后使用fillna

d1 = pd.read_csv(StringIO(no_holes), skipinitialspace=True).fillna('')
d2 = pd.read_csv(StringIO(with_holes), skipinitialspace=True).fillna('')
d3 = pd.read_csv(StringIO(empty_column), skipinitialspace=True).fillna('')
d4 = pd.read_csv(StringIO(with_NA), skipinitialspace=True).fillna('')

pd.concat([d1, d2, d3, d4], axis=1,
          keys=['no_holes', 'with_holes',
                'empty_column', 'with_NA'])

enter image description here

在此处输入图片说明