Python 熊猫如何更换?使用 NaN - 处理非标准缺失值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29247712/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas How to Replace ? with NaN - handling non standard missing values
提问by swati saoji
I am new to pandas , I am trying to load the csv in Dataframe. My data has missing values represented as ? , and I am trying to replace it with standard Missing values - NaN
我是 pandas 的新手,我正在尝试在 Dataframe 中加载 csv。我的数据缺失值表示为?,我试图用标准缺失值替换它 - NaN
Kindly help me with this . I have tried reading through Pandas docs, but I am not able to follow.
请帮我解决这个问题。我曾尝试通读 Pandas 文档,但我无法理解。
def readData(filename):
DataLabels =["age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
"occupation", "relationship", "race", "sex", "capital-gain",
"capital-loss", "hours-per-week", "native-country", "class"]
# ==== trying to replace ? with Nan using na_values
rawfile = pd.read_csv(filename, header=None, names=DataLabels, na_values=["?"])
age = rawfile["age"]
print age
print rawfile[25:40]
#========trying to replace ?
rawfile.replace("?", "NaN")
print rawfile[25:40]
采纳答案by EdChum
You can replace this just for that column using replace
:
您可以使用以下方法为该列替换它replace
:
df['workclass'].replace('?', np.NaN)
or for the whole df:
或对于整个 df:
df.replace('?', np.NaN)
UPDATE
更新
OK I figured out your problem, by default if you don't pass a separator character then read_csv
will use commas ','
as the separator.
好的,我想出了您的问题,默认情况下,如果您不传递分隔符,read_csv
则将使用逗号','
作为分隔符。
Your data and in particular one example where you have a problematic line:
您的数据,特别是您遇到问题线路的一个示例:
54, ?, 180211, Some-college, 10, Married-civ-spouse, ?, Husband, Asian-Pac-Islander, Male, 0, 0, 60, South, >50K
has in fact a comma and a space as the separator so when you passed the na_value=['?']
this didn't match because all your values have a space character in front of them all which you can't observe.
实际上有一个逗号和一个空格作为分隔符,所以当你通过na_value=['?']
this时,这不匹配,因为你的所有值前面都有一个空格字符,你无法观察到。
if you change your line to this:
如果您将线路更改为:
rawfile = pd.read_csv(filename, header=None, names=DataLabels, sep=',\s', na_values=["?"])
then you should find that it all works:
那么你应该会发现一切正常:
27 54 NaN 180211 Some-college 10
回答by Liam Foley
Use numpy.nan
使用 numpy.nan
Numpy - Replace a number with NaN
import numpy as np
df.applymap(lambda x: np.nan if x == '?' else x)
回答by swati saoji
okay I got it by :
好的,我是通过以下方式获得的:
#========trying to replace ?
newraw= rawfile.replace('[?]', np.nan, regex=True)
print newraw[25:40]
回答by Nishanth
some times there will be white spaces with the ? in the file generated by systems like informatica or HANA
有时会有空格?在由 informatica 或 HANA 等系统生成的文件中
first you Need to strip the white spaces in the DataFrame
首先你需要去除 DataFrame 中的空格
temp_df_trimmed = temp_df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
And later apply the function to replace the data
然后应用该函数来替换数据
temp_df_trimmed['RC'] = temp_df_trimmed['RC'].map(lambda x: np.nan if x=="?" else x)