pandas 使用python清理大数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/13867294/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
cleaning big data using python
提问by Kathirmani Sukumar
I have to clean a input data file in python. Due to typo error, the datafield may have strings instead of numbers. I would like to identify all fields which are a string and fill these with NaN using pandas. Also, I would like to log the index of those fields.
我必须在 python 中清理输入数据文件。由于拼写错误,数据字段可能包含字符串而不是数字。我想识别所有作为字符串的字段,并使用 Pandas 用 NaN 填充这些字段。另外,我想记录这些字段的索引。
One of the crudest way is to loop through each and every field and checking whether it is a number or not, but this consumes lot of time if the data is big.
最粗略的方法之一是遍历每个字段并检查它是否为数字,但是如果数据很大,这会消耗大量时间。
My csv file contains data similar to the following table:
我的 csv 文件包含类似于下表的数据:
Country Count Sales
USA 1 65000
UK 3 4000
IND 8 g
SPA 3 9000
NTH 5 80000
.... Assume that i have 60,000 such rows in the data.
.... 假设我有 60,000 行这样的数据。
Ideally I would like to identify that row IND has an invalid value under SALES column. Any suggestions on how to do this efficiently?
理想情况下,我想确定 IND 行在 SALES 列下具有无效值。关于如何有效地做到这一点的任何建议?
回答by Andy Hayden
There is a na_valuesargument to read_csv:
有一个na_values论点read_csv:
na_values: list-like or dict, defaultNone
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values
na_values: list-like or dict, defaultNone
要识别为 NA/NaN 的附加字符串。如果 dict 通过,特定的每列 NA 值
df = pd.read_csv('city.csv', sep='\s+', na_values=['g'])
In [2]: df
Out[2]:
Country Count Sales
0 USA 1 65000
1 UK 3 4000
2 IND 8 NaN
3 SPA 3 9000
4 NTH 5 80000
Using pandas.isnull, you can select only those rows with NaN in the 'Sales'column, or the 'Country'series:
使用pandas.isnull,您可以仅选择'Sales'列或'Country'系列中带有 NaN 的行:
In [3]: df[pd.isnull(df['Sales'])]
Out[3]:
Country Count Sales
2 IND 8 NaN
In [4]: df[pd.isnull(df['Sales'])]['Country']
Out[4]:
2 IND
Name: Country
If it's already in the DataFrame you could use applyto convert those strings which are numbers into integers (using str.isdigit):
如果它已经在 DataFrame 中,您可以使用apply将这些数字字符串转换为整数(使用str.isdigit):
df = pd.DataFrame({'Count': {0: 1, 1: 3, 2: 8, 3: 3, 4: 5}, 'Country': {0: 'USA', 1: 'UK', 2: 'IND', 3: 'SPA', 4: 'NTH'}, 'Sales': {0: '65000', 1: '4000', 2: 'g', 3: '9000', 4: '80000'}})
In [12]: df
Out[12]:
Country Count Sales
0 USA 1 65000
1 UK 3 4000
2 IND 8 g
3 SPA 3 9000
4 NTH 5 80000
In [13]: df['Sales'] = df['Sales'].apply(lambda x: int(x)
if str.isdigit(x)
else np.nan)
In [14]: df
Out[14]:
Country Count Sales
0 USA 1 65000
1 UK 3 4000
2 IND 8 NaN
3 SPA 3 9000
4 NTH 5 80000
回答by unutbu
import os
import numpy as np
import pandas as PD
filename = os.path.expanduser('~/tmp/data.csv')
df = PD.DataFrame(
np.genfromtxt(
filename, delimiter = '\t', names = True, dtype = '|O4,<i4,<f8'))
print(df)
yields
产量
Country Count Sales
0 USA 1 65000
1 UK 3 4000
2 IND 8 NaN
3 SPA 3 9000
4 NTH 5 80000
and to find the country with NaNsales, you could compute
并找到有NaN销售额的国家,你可以计算
print(y['Country'][np.isnan(y['Sales'])])
which yields the pandas.Series:
这产生了pandas.Series:
2 IND
Name: Country
回答by tacaswell
Try to convert the 'sales' string to an int, if it is well formed then it goes on, if it is not it will raise a ValueErrorwhich we catch and replace with the place holder.
尝试将 'sales' 字符串转换为 an int,如果它格式正确,则继续,如果不是,它将引发ValueError我们捕获并替换为占位符的 a 。
bad_lines = []
with open(fname,'rb') as f:
header = f.readline()
for j,l in enumerate(f):
country,count,sales = l.split()
try:
sales_count = int(sales)
except ValueError:
sales_count = 'NaN'
bad_lines.append(j)
# shove in to your data structure
print country,count,sales_count
you might need to edit the line that splits the line (as your example copied out as spaces, not tabs). Replace the print line, with what ever you want to do with the data. You probably need to relpace 'NaN' with the pandas NaN as well.
您可能需要编辑分割该行的行(如您的示例复制为空格,而不是制表符)。用您想要对数据执行的操作替换打印行。您可能还需要将 'NaN' 与 Pandas NaN 替换为 'NaN'。
回答by Artem Kisel
filename = open('file.csv')
filename.readline()
for line in filename:
currentline = line.split(',')
try:
int(currentline[2][:-1])
except:
print currentline[0], currentline[2][:-1]
IND g
IND
回答by eyquem
I propose to use a regex:
我建议使用正则表达式:
import re
ss = '''Country Count Sales
USA , 3 , 65000
UK , 3 , 4000
IND , 8 , g
SPA , ju , 9000
NTH , 5 , 80000
XSZ , rob , k3'''
with open('fofo.txt','w') as f:
f.write(ss)
print ss
print
delimiter = ','
regx = re.compile('(.+?(?:{0}))'
'(( *\d+?)| *.+?)'
'( *(?:{0}))'
'(( *\d+?)| *.+?)'
'( *\r?\n?)$'.format(delimiter))
def READ(filepath, regx = regx):
with open(filepath,'rb+') as f:
yield f.readline()
for line in f:
if None in regx.match(line).group(3,6):
g2,g3,g5,g6 = regx.match(line).group(2,3,5,6)
tr = ('%%%ds' % len(g2) % 'NaN' if g3 is None else g3,
'%%%ds' % len(g5) % 'NaN' if g6 is None else g6)
modified_line = regx.sub(('\g<1>%s\g<4>%s\g<7>' % tr),line)
print ('------------------------------------------------\n'
'%r with aberration\n'
'%r modified line'
% (line,modified_line))
yield modified_line
else:
yield line
with open('modified.txt','wb') as g:
g.writelines(x for x in READ('fofo.txt'))
result
结果
Country Count Sales
USA , 3 , 65000
UK , 3 , 4000
IND , 8 , g
SPA , ju , 9000
NTH , 5 , 80000
XSZ , rob , k3
------------------------------------------------
'IND , 8 , g\r\n' with aberration
'IND , 8 , NaN\r\n' modified line
------------------------------------------------
'SPA , ju , 9000\r\n' with aberration
'SPA , NaN , 9000\r\n' modified line
------------------------------------------------
'XSZ , rob , k3' with aberration
'XSZ , NaN , NaN' modified line

