pandas 跳过 read_csv 中缺失值的行

Question

提问by eleanora

I have a very large csv which I need to read in. To make this fast and save RAM usage I am using read_csv and set the dtype of some columns to np.uint32. The problem is that some rows have missing values and pandas uses a float to represent those.

我有一个非常大的 csv，我需要读入。为了加快速度并节省 RAM 使用量，我使用了 read_csv 并将某些列的 dtype 设置为 np.uint32。问题是某些行缺少值，而Pandas使用浮点数来表示这些值。

Is it possible to simply skip rows with missing values? I know I could do this after reading in the whole file but this means I couldn't set the dtype until then and so would use too much RAM.
Is it possible to convert missing values to some other I choose during the reading of the data?

是否可以简单地跳过缺少值的行？我知道我可以在阅读整个文件后执行此操作，但这意味着在那之前我无法设置 dtype，因此会使用太多 RAM。
是否可以将缺失值转换为我在读取数据期间选择的其他值？

Answer 1

采纳答案by Kartik

It would be dainty if you could fill NaNwith say 0during read itself. Perhaps a feature request in Pandas's git-hub is in order...

如果你能在阅读过程NaN中用 say来填词，那就太好了0。也许 Pandas 的 git-hub 中的功能请求是有序的......

Using a converter function

使用转换器功能

However, for the time being, you can define your own function to do that and pass it to the convertersargument in read_csv:

但是，就目前而言，您可以定义自己的函数来执行此操作并将其传递给中的converters参数read_csv：

def conv(val):
    if val == np.nan:
        return 0 # or whatever else you want to represent your NaN with
    return val

df = pd.read_csv(file, converters={colWithNaN : conv}, dtypes=...)

Note that converterstakes a dict, so you need to specify it for each column that has NaN to be dealt with. It can get a little tiresome if a lot of columns are affected. You can specify either column names or numbers as keys.

请注意，converters需要 a dict，因此您需要为每个要处理 NaN 的列指定它。如果很多列受到影响，它会变得有点烦人。您可以指定列名或数字作为键。

Also note that this might slow down your read_csvperformance, depending on how the convertersfunction is handled. Further, if you just have one column that needs NaNs handled during read, you can skip a proper function definition and use a lambdafunction instead:

另请注意，这可能会降低您的read_csv性能，具体取决于converters函数的处理方式。此外，如果您只有一列需要在读取期间处理 NaN，您可以跳过正确的函数定义并使用lambda函数代替：

df = pd.read_csv(file, converters={colWithNaN : lambda x: 0 if x == np.nan else x}, dtypes=...)

Reading in chunks

分块阅读

You could also read the file in small chunks that you stitch together to get your final output. You can do a bunch of things this way. Here is an illustrative example:

您还可以将文件分成小块读取，然后拼接在一起以获得最终输出。你可以用这种方式做很多事情。这是一个说明性示例：

result = pd.DataFrame()
df = pd.read_csv(file, chunksize=1000)
for chunk in df:
    chunk.dropna(axis=0, inplace=True) # Dropping all rows with any NaN value
    chunk[colToConvert] = chunk[colToConvert].astype(np.uint32)
    result = result.append(chunk)
del df, chunk

Note that this method does not strictly duplicate data. There is a time when the data in chunkexists twice, right after the result.appendstatement, but only chunksizerows are repeated, which is a fair bargain. This method may also work out to be faster than by using a converter function.

请注意，此方法不会严格复制数据。有一次数据chunk存在两次，就在result.append语句之后，但只chunksize重复行，这是一个公平的交易。这种方法也可能比使用转换器功能更快。

Answer 2

回答by John Zwinck

There is no feature in Pandas that does that. You can implement it in regular Python like this:

Pandas 中没有这样做的功能。您可以像这样在常规 Python 中实现它：

import csv
import pandas as pd

def filter_records(records):
    """Given an iterable of dicts, converts values to int.
    Discards any record which has an empty field."""

    for record in records:
        for k, v in record.iteritems():
            if v == '':
                break
            record[k] = int(v)
        else: # this executes whenever break did not
            yield record

with open('t.csv') as infile:
    records = csv.DictReader(infile)
    df = pd.DataFrame.from_records(filter_records(records))

Pandas uses the csvmodule internally anyway. If the performance of the above turns out to be a problem, you could probably speed it up with Cython (which Pandas also uses).

csv无论如何，Pandas 都会在内部使用该模块。如果上述性能有问题，您可能可以使用 Cython（Pandas 也使用）来加速它。

Answer 3

回答by Merlin

If you show some data, SO ppl could help.

如果您显示一些数据，SO ppl 可以提供帮助。

pd.read_csv('FILE', keep_default_na=False)

For starters try these:

首先，试试这些：

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

na_values : str or list-like or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘', ‘#N/A', ‘#N/A N/A', ‘#NA', ‘-1.#IND', ‘-1.#QNAN', ‘-NaN', ‘-nan', ‘1.#IND', ‘1.#QNAN', ‘N/A', ‘NA', ‘NULL', ‘NaN', ‘nan'.

keep_default_na : bool, default True
If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they're appended to.

na_filter : boolean, default True
    Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing na_filter=False can improve the performance of reading a large file

pandas 跳过 read_csv 中缺失值的行

提问by eleanora

采纳答案by Kartik

Using a converter function

使用转换器功能

Reading in chunks

分块阅读

回答by John Zwinck

回答by Merlin

相关推荐

最近更新

标签

pandas 跳过 read_csv 中缺失值的行

提问by eleanora

采纳答案by Kartik

Using a converter function

使用转换器功能

Reading in chunks

分块阅读

回答by John Zwinck

回答by Merlin

相关推荐

如何比较 Pandas 中两个数据框的值？

如何在 Pandas DataFrame 上添加列标签

使用对列值的函数对 Pandas DataFrame 进行排序

pandas 数据透视表错误：此时不支持 1 ndim Categorical

相关推荐

最近更新

标签