pandas 跳过 read_csv 中缺失值的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38818609/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Skip rows with missing values in read_csv
提问by eleanora
I have a very large csv which I need to read in. To make this fast and save RAM usage I am using read_csv and set the dtype of some columns to np.uint32. The problem is that some rows have missing values and pandas uses a float to represent those.
我有一个非常大的 csv,我需要读入。为了加快速度并节省 RAM 使用量,我使用了 read_csv 并将某些列的 dtype 设置为 np.uint32。问题是某些行缺少值,而Pandas使用浮点数来表示这些值。
- Is it possible to simply skip rows with missing values? I know I could do this after reading in the whole file but this means I couldn't set the dtype until then and so would use too much RAM.
- Is it possible to convert missing values to some other I choose during the reading of the data?
- 是否可以简单地跳过缺少值的行?我知道我可以在阅读整个文件后执行此操作,但这意味着在那之前我无法设置 dtype,因此会使用太多 RAM。
- 是否可以将缺失值转换为我在读取数据期间选择的其他值?
采纳答案by Kartik
It would be dainty if you could fill NaN
with say 0
during read itself. Perhaps a feature request in Pandas's git-hub is in order...
如果你能在阅读过程NaN
中用 say来填词,那就太好了0
。也许 Pandas 的 git-hub 中的功能请求是有序的......
Using a converter function
使用转换器功能
However, for the time being, you can define your own function to do that and pass it to the converters
argument in read_csv
:
但是,就目前而言,您可以定义自己的函数来执行此操作并将其传递给 中的converters
参数read_csv
:
def conv(val):
if val == np.nan:
return 0 # or whatever else you want to represent your NaN with
return val
df = pd.read_csv(file, converters={colWithNaN : conv}, dtypes=...)
Note that converters
takes a dict
, so you need to specify it for each column that has NaN to be dealt with. It can get a little tiresome if a lot of columns are affected. You can specify either column names or numbers as keys.
请注意,converters
需要 a dict
,因此您需要为每个要处理 NaN 的列指定它。如果很多列受到影响,它会变得有点烦人。您可以指定列名或数字作为键。
Also note that this might slow down your read_csv
performance, depending on how the converters
function is handled. Further, if you just have one column that needs NaNs handled during read, you can skip a proper function definition and use a lambda
function instead:
另请注意,这可能会降低您的read_csv
性能,具体取决于converters
函数的处理方式。此外,如果您只有一列需要在读取期间处理 NaN,您可以跳过正确的函数定义并使用lambda
函数代替:
df = pd.read_csv(file, converters={colWithNaN : lambda x: 0 if x == np.nan else x}, dtypes=...)
Reading in chunks
分块阅读
You could also read the file in small chunks that you stitch together to get your final output. You can do a bunch of things this way. Here is an illustrative example:
您还可以将文件分成小块读取,然后拼接在一起以获得最终输出。你可以用这种方式做很多事情。这是一个说明性示例:
result = pd.DataFrame()
df = pd.read_csv(file, chunksize=1000)
for chunk in df:
chunk.dropna(axis=0, inplace=True) # Dropping all rows with any NaN value
chunk[colToConvert] = chunk[colToConvert].astype(np.uint32)
result = result.append(chunk)
del df, chunk
Note that this method does not strictly duplicate data. There is a time when the data in chunk
exists twice, right after the result.append
statement, but only chunksize
rows are repeated, which is a fair bargain. This method may also work out to be faster than by using a converter function.
请注意,此方法不会严格复制数据。有一次数据chunk
存在两次,就在result.append
语句之后,但只chunksize
重复行,这是一个公平的交易。这种方法也可能比使用转换器功能更快。
回答by John Zwinck
There is no feature in Pandas that does that. You can implement it in regular Python like this:
Pandas 中没有这样做的功能。您可以像这样在常规 Python 中实现它:
import csv
import pandas as pd
def filter_records(records):
"""Given an iterable of dicts, converts values to int.
Discards any record which has an empty field."""
for record in records:
for k, v in record.iteritems():
if v == '':
break
record[k] = int(v)
else: # this executes whenever break did not
yield record
with open('t.csv') as infile:
records = csv.DictReader(infile)
df = pd.DataFrame.from_records(filter_records(records))
Pandas uses the csv
module internally anyway. If the performance of the above turns out to be a problem, you could probably speed it up with Cython (which Pandas also uses).
csv
无论如何,Pandas 都会在内部使用该模块。如果上述性能有问题,您可能可以使用 Cython(Pandas 也使用)来加速它。
回答by Merlin
If you show some data, SO ppl could help.
如果您显示一些数据,SO ppl 可以提供帮助。
pd.read_csv('FILE', keep_default_na=False)
For starters try these:
首先,试试这些:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
na_values : str or list-like or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘', ‘#N/A', ‘#N/A N/A', ‘#NA', ‘-1.#IND', ‘-1.#QNAN', ‘-NaN', ‘-nan', ‘1.#IND', ‘1.#QNAN', ‘N/A', ‘NA', ‘NULL', ‘NaN', ‘nan'.
keep_default_na : bool, default True
If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they're appended to.
na_filter : boolean, default True
Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing na_filter=False can improve the performance of reading a large file