pandas 使用pandas将具有缺失值的csv数据读入python

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27228964/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:44:05  来源:igfitidea点击:

read csv-data with missing values into python using pandas

pythoncsvpandasmissing-data

提问by July

I have a CSV-file looking like this:

我有一个像这样的 CSV 文件:

"row ID","label","val"
"Row0","5",6
"Row1","",6
"Row2","",6
"Row3","5",7
"Row4","5",8
"Row5",,9
"Row6","nan",
"Row7","nan",
"Row8","nan",0
"Row9","nan",3
"Row10","nan",

All quoted entries are strings. Non-quoted entries are numerical. Empty fields are missing values (NaN), Quoted empty fields still should be considered as empty strings. I tried to read it in with pandas read_csv but I cannot get it working the way I would like to have it... It still consideres ,"", and ,, as NaN, while it's not true for the first one.

所有带引号的条目都是字符串。未引用的条目是数字。空字段是缺失值 (NaN),引用的空字段仍应视为空字符串。我试图用 pandas read_csv 读取它,但我无法让它按照我想要的方式工作......它仍然认为 ,"", 和 , 作为 NaN,而第一个则不然。

d = pd.read_csv(csv_filename, sep=',', keep_default_na=False, na_values=[''], quoting = csv.QUOTE_NONNUMERIC)

Can anybody help? Is it possible at all?

有人可以帮忙吗?有可能吗?

回答by AnandViswanathan89

You can try with numpy.genfromtxtand specify the missing_valuesparameter

您可以尝试使用numpy.genfromtxt并指定missing_values参数

http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html

http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html

回答by July

I found a way to get it more or less working. I just don't know, why I need to specify dtype=type(None) to get it working... Comments on this piece of code are very welcome!

我找到了一种或多或少让它发挥作用的方法。我只是不知道,为什么我需要指定 dtype=type(None) 才能让它工作......非常欢迎对这段代码的评论!

import re
import pandas as pd
import numpy as np

# clear quoting characters
def filterTheField(s):
    m = re.match(r'^"?(.*)?"$', s.strip())
    if m:
        return m.group(1)
    else:
        return np.nan

file = 'test.csv'

y = np.genfromtxt(file, delimiter = ',', filling_values = np.nan, names = True, dtype = type(None), converters = {'row_ID': filterTheField, 'label': filterTheField,'val': float})

d = pd.DataFrame(y)

print(d)

回答by Moritz

Maybe something like:

也许是这样的:

import pandas as pd
import csv
import numpy as np
d = pd.read_csv('test.txt', sep=',', keep_default_na=False, na_values=[''], quoting = csv.QUOTE_NONNUMERIC)
mask = d['label'] == 'nan'
d.label[mask] = np.nan