pandas 使用python清理大数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13867294/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:32:10  来源:igfitidea点击:

cleaning big data using python

pythonpandas

提问by Kathirmani Sukumar

I have to clean a input data file in python. Due to typo error, the datafield may have strings instead of numbers. I would like to identify all fields which are a string and fill these with NaN using pandas. Also, I would like to log the index of those fields.

我必须在 python 中清理输入数据文件。由于拼写错误,数据字段可能包含字符串而不是数字。我想识别所有作为字符串的字段,并使用 Pandas 用 NaN 填充这些字段。另外,我想记录这些字段的索引。

One of the crudest way is to loop through each and every field and checking whether it is a number or not, but this consumes lot of time if the data is big.

最粗略的方法之一是遍历每个字段并检查它是否为数字,但是如果数据很大,这会消耗大量时间。

My csv file contains data similar to the following table:

我的 csv 文件包含类似于下表的数据:

Country  Count  Sales
USA         1   65000
UK          3    4000
IND         8       g
SPA         3    9000
NTH         5   80000

.... Assume that i have 60,000 such rows in the data.

.... 假设我有 60,000 行这样的数据。

Ideally I would like to identify that row IND has an invalid value under SALES column. Any suggestions on how to do this efficiently?

理想情况下,我想确定 IND 行在 SALES 列下具有无效值。关于如何有效地做到这一点的任何建议?

回答by Andy Hayden

There is a na_valuesargument to read_csv:

有一个na_values论点read_csv

na_values: list-like or dict, default None
       Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values

na_values: list-like or dict, defaultNone
       要识别为 NA/NaN 的附加字符串。如果 dict 通过,特定的每列 NA 值

df = pd.read_csv('city.csv', sep='\s+', na_values=['g'])

In [2]: df
Out[2]:
  Country  Count  Sales
0     USA      1  65000
1      UK      3   4000
2     IND      8    NaN
3     SPA      3   9000
4     NTH      5  80000

Using pandas.isnull, you can select only those rows with NaN in the 'Sales'column, or the 'Country'series:

使用pandas.isnull,您可以仅选择'Sales'列或'Country'系列中带有 NaN 的行:

In [3]: df[pd.isnull(df['Sales'])]
Out[3]: 
  Country  Count  Sales
2     IND      8    NaN

In [4]: df[pd.isnull(df['Sales'])]['Country']
Out[4]: 
2    IND
Name: Country

If it's already in the DataFrame you could use applyto convert those strings which are numbers into integers (using str.isdigit):

如果它已经在 DataFrame 中,您可以使用apply将这些数字字符串转换为整数(使用str.isdigit):

df = pd.DataFrame({'Count': {0: 1, 1: 3, 2: 8, 3: 3, 4: 5}, 'Country': {0: 'USA', 1: 'UK', 2: 'IND', 3: 'SPA', 4: 'NTH'}, 'Sales': {0: '65000', 1: '4000', 2: 'g', 3: '9000', 4: '80000'}})

In [12]: df
Out[12]: 
  Country  Count  Sales
0     USA      1  65000
1      UK      3   4000
2     IND      8      g
3     SPA      3   9000
4     NTH      5  80000

In [13]: df['Sales'] = df['Sales'].apply(lambda x: int(x) 
                                                  if str.isdigit(x)
                                                  else np.nan)

In [14]: df
Out[14]: 
  Country  Count  Sales
0     USA      1  65000
1      UK      3   4000
2     IND      8    NaN
3     SPA      3   9000
4     NTH      5  80000

回答by unutbu

import os
import numpy as np
import pandas as PD

filename = os.path.expanduser('~/tmp/data.csv')
df = PD.DataFrame(
        np.genfromtxt(
            filename, delimiter = '\t', names = True, dtype = '|O4,<i4,<f8'))
print(df)

yields

产量

  Country  Count  Sales
0     USA      1  65000
1      UK      3   4000
2     IND      8    NaN
3     SPA      3   9000
4     NTH      5  80000

and to find the country with NaNsales, you could compute

并找到有NaN销售额的国家,你可以计算

print(y['Country'][np.isnan(y['Sales'])])

which yields the pandas.Series:

这产生了pandas.Series

2    IND
Name: Country

回答by tacaswell

Try to convert the 'sales' string to an int, if it is well formed then it goes on, if it is not it will raise a ValueErrorwhich we catch and replace with the place holder.

尝试将 'sales' 字符串转换为 an int,如果它格式正确,则继续,如果不是,它将引发ValueError我们捕获并替换为占位符的 a 。

bad_lines = []

with open(fname,'rb') as f:
    header = f.readline()
    for j,l in enumerate(f):
        country,count,sales = l.split()
        try:
            sales_count = int(sales)
        except ValueError:
            sales_count = 'NaN'
            bad_lines.append(j)
        # shove in to your data structure
        print country,count,sales_count

you might need to edit the line that splits the line (as your example copied out as spaces, not tabs). Replace the print line, with what ever you want to do with the data. You probably need to relpace 'NaN' with the pandas NaN as well.

您可能需要编辑分割该行的行(如您的示例复制为空格,而不是制表符)。用您想要对数据执行的操作替换打印行。您可能还需要将 'NaN' 与 Pandas NaN 替换为 'NaN'。

回答by Artem Kisel

filename = open('file.csv')
filename.readline()

for line in filename:
    currentline = line.split(',')
    try:
        int(currentline[2][:-1])
    except:
        print currentline[0], currentline[2][:-1]

IND g

IND

回答by eyquem

I propose to use a regex:

我建议使用正则表达式:

import re

ss = '''Country  Count  Sales
USA   ,      3  , 65000
UK    ,      3  ,  4000
IND   ,      8  ,     g
SPA   ,     ju  ,  9000
NTH   ,      5  , 80000
XSZ   ,    rob  ,    k3'''

with open('fofo.txt','w') as f:
    f.write(ss)

print ss
print

delimiter = ','

regx = re.compile('(.+?(?:{0}))'
                  '(( *\d+?)| *.+?)'
                  '( *(?:{0}))'
                  '(( *\d+?)| *.+?)'
                  '( *\r?\n?)$'.format(delimiter))

def READ(filepath, regx = regx):
    with open(filepath,'rb+') as f:
        yield f.readline()
        for line in f:
            if None in regx.match(line).group(3,6):
                g2,g3,g5,g6 = regx.match(line).group(2,3,5,6)
                tr = ('%%%ds' % len(g2) % 'NaN' if g3 is None else g3,
                      '%%%ds' % len(g5) % 'NaN' if g6 is None else g6)
                modified_line = regx.sub(('\g<1>%s\g<4>%s\g<7>' % tr),line)
                print ('------------------------------------------------\n'
                       '%r with aberration\n'
                       '%r modified line'
                       % (line,modified_line))
                yield modified_line
            else:
                yield line

with open('modified.txt','wb') as g:
    g.writelines(x for x in READ('fofo.txt'))

result

结果

Country  Count  Sales
USA   ,      3  , 65000
UK    ,      3  ,  4000
IND   ,      8  ,     g
SPA   ,     ju  ,  9000
NTH   ,      5  , 80000
XSZ   ,    rob  ,    k3

------------------------------------------------
'IND   ,      8  ,     g\r\n' with aberration
'IND   ,      8  ,   NaN\r\n' modified line
------------------------------------------------
'SPA   ,     ju  ,  9000\r\n' with aberration
'SPA   ,    NaN  ,  9000\r\n' modified line
------------------------------------------------
'XSZ   ,    rob  ,    k3' with aberration
'XSZ   ,    NaN  ,   NaN' modified line