Python Pandas read_csv low_memory 和 dtype 选项

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24251219/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 04:18:28  来源:igfitidea点击:

Pandas read_csv low_memory and dtype options

pythonparsingnumpypandasdataframe

提问by Josh

When calling

打电话时

df = pd.read_csv('somefile.csv')

I get:

我得到:

/Users/josh/anaconda/envs/py27/lib/python2.7/site-packages/pandas/io/parsers.py:1130: DtypeWarning: Columns (4,5,7,16) have mixed types. Specify dtype option on import or set low_memory=False.

/Users/josh/anaconda/envs/py27/lib/python2.7/site-packages/pandas/io/parsers.py:1130: DtypeWarning: 列 (4,5,7,16) 具有混合类型。在导入时指定 dtype 选项或设置 low_memory=False。

Why is the dtypeoption related to low_memory, and why would making it Falsehelp with this problem?

为什么该dtype选项与 相关low_memory,为什么使它False有助于解决这个问题?

采纳答案by firelynx

The deprecated low_memory option

已弃用的 low_memory 选项

The low_memoryoption is not properly deprecated, but it should be, since it does not actually do anything differently[source]

low_memory选项没有被正确弃用,但应该是,因为它实际上并没有做任何不同的事情[来源]

The reason you get this low_memorywarning is because guessing dtypes for each column is very memory demanding. Pandas tries to determine what dtype to set by analyzing the data in each column.

您收到此low_memory警告的原因是因为猜测每一列的 dtype 非常需要内存。Pandas 尝试通过分析每列中的数据来确定要设置的 dtype。

Dtype Guessing (very bad)

Dtype 猜测(非常糟糕)

Pandas can only determine what dtype a column should have once the whole file is read. This means nothing can really be parsed before the whole file is read unless you risk having to change the dtype of that column when you read the last value.

Pandas 只能在读取整个文件后确定列应该具有什么 dtype。这意味着在读取整个文件之前无法真正解析任何内容,除非您在读取最后一个值时必须更改该列的 dtype。

Consider the example of one file which has a column called user_id. It contains 10 million rows where the user_id is always numbers. Since pandas cannot know it is only numbers, it will probably keep it as the original strings until it has read the whole file.

考虑一个具有名为 user_id 的列的文件的示例。它包含 1000 万行,其中 user_id 始终是数字。由于熊猫无法知道它只是数字,因此它可能会将其保留为原始字符串,直到它读取了整个文件。

Specifying dtypes (should always be done)

指定 dtypes(应该总是完成)

adding

添加

dtype={'user_id': int}

to the pd.read_csv()call will make pandas know when it starts reading the file, that this is only integers.

pd.read_csv()呼叫将使大熊猫知道它开始读取文件时,认为这是唯一的整数。

Also worth noting is that if the last line in the file would have "foobar"written in the user_idcolumn, the loading would crash if the above dtype was specified.

另外值得注意的是,如果文件中的最后一行"foobar"写在user_id列中,如果指定了上述 dtype,加载将崩溃。

Example of broken data that breaks when dtypes are defined

定义 dtypes 时损坏的损坏数据示例

import pandas as pd
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO


csvdata = """user_id,username
1,Alice
3,Bob
foobar,Caesar"""
sio = StringIO(csvdata)
pd.read_csv(sio, dtype={"user_id": int, "username": "string"})

ValueError: invalid literal for long() with base 10: 'foobar'

dtypes are typically a numpy thing, read more about them here: http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

dtypes 通常是一个 numpy 的东西,在这里阅读更多关于它们的信息:http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

What dtypes exists?

存在哪些数据类型?

We have access to numpy dtypes: float, int, bool, timedelta64[ns] and datetime64[ns]. Note that the numpy date/time dtypes are nottime zone aware.

我们可以访问 numpy dtypes:float、int、bool、timedelta64[ns] 和 datetime64[ns]。请注意,numpy 日期/时间 dtypes知道时区。

Pandas extends this set of dtypes with its own:

Pandas 用它自己的数据类型扩展了这组数据类型:

'datetime64[ns, ]' Which is a time zone aware timestamp.

'datetime64[ns, ]' 这是一个时区感知时间戳。

'category' which is essentially an enum (strings represented by integer keys to save

'category' 本质上是一个枚举(由整数键表示的字符串以保存

'period[]' Not to be confused with a timedelta, these objects are actually anchored to specific time periods

'period[]' 不要与 timedelta 混淆,这些对象实际上锚定到特定的时间段

'Sparse', 'Sparse[int]', 'Sparse[float]' is for sparse data or 'Data that has a lot of holes in it' Instead of saving the NaN or None in the dataframe it omits the objects, saving space.

'Sparse', 'Sparse[int]', 'Sparse[float]' 用于稀疏数据或 'Data that has a lot of hole in it' 而不是在数据框中保存 NaN 或 None 它省略对象,节省空间.

'Interval' is a topic of its own but its main use is for indexing. See more here

“间隔”是它自己的主题,但它的主要用途是索引。在这里查看更多

'Int8', 'Int16', 'Int32', 'Int64', 'UInt8', 'UInt16', 'UInt32', 'UInt64' are all pandas specific integers that are nullable, unlike the numpy variant.

'Int8', 'Int16', 'Int32', 'Int64', 'UInt8', 'UInt16', 'UInt32', 'UInt64' 都是可空的熊猫特定整数,与 numpy 变体不同。

'string' is a specific dtype for working with string data and gives access to the .strattribute on the series.

'string' 是用于处理字符串数据的特定 dtype,可以访问.str系列上的属性。

'boolean' is like the numpy 'bool' but it also supports missing data.

'boolean' 就像 numpy 'bool' 但它也支持缺失数据。

Read the complete reference here:

在这里阅读完整的参考:

Pandas dtype reference

熊猫数据类型参考

Gotchas, caveats, notes

陷阱、警告、注意事项

Setting dtype=objectwill silence the above warning, but will not make it more memory efficient, only process efficient if anything.

设置dtype=object将使上述警告静音,但不会提高内存效率,如果有的话,只会提高处理效率。

Setting dtype=unicodewill not do anything, since to numpy, a unicodeis represented as object.

设置dtype=unicode不会做任何事情,因为对于 numpy,aunicode表示为object.

Usage of converters

转换器的使用

@sparrow correctly points out the usage of converters to avoid pandas blowing up when encountering 'foobar'in a column specified as int. I would like to add that converters are really heavy and inefficient to use in pandas and should be used as a last resort. This is because the read_csv process is a single process.

@sparrow 正确地指出了转换器的使用,以避免'foobar'在指定为int. 我想补充一点,在熊猫中使用转换器非常笨重且效率低下,应该作为最后的手段使用。这是因为 read_csv 进程是单个进程。

CSV files can be processed line by line and thus can be processed by multiple converters in parallel more efficiently by simply cutting the file into segments and running multiple processes, something that pandas does not support. But this is a different story.

CSV 文件可以逐行处理,因此可以通过简单地将文件切成段并运行多个进程来更有效地由多个转换器并行处理,这是 Pandas 不支持的。但这是一个不同的故事。

回答by hd1

Try:

尝试:

dashboard_df = pd.read_csv(p_file, sep=',', error_bad_lines=False, index_col=False, dtype='unicode')

According to the pandas documentation:

根据熊猫文档:

dtype : Type name or dict of column -> type

dtype : 类型名称或列的字典 -> 类型

As for low_memory, it's True by defaultand isn't yet documented. I don't think its relevant though. The error message is generic, so you shouldn't need to mess with low_memory anyway. Hope this helps and let me know if you have further problems

至于low_memory,默认情况下为True 尚未记录。我不认为它相关。错误消息是通用的,因此无论如何您都不需要弄乱 low_memory。希望这会有所帮助,如果您有进一步的问题,请告诉我

回答by Neal

df = pd.read_csv('somefile.csv', low_memory=False)

This should solve the issue. I got exactly the same error, when reading 1.8M rows from a CSV.

这应该可以解决问题。从 CSV 读取 1.8M 行时,我遇到了完全相同的错误。

回答by sparrow

As mentioned earlier by firelynx if dtype is explicitly specified and there is mixed data that is not compatible with that dtype then loading will crash. I used a converter like this as a workaround to change the values with incompatible data type so that the data could still be loaded.

正如 firelynx 前面提到的,如果明确指定了 dtype 并且存在与该 dtype 不兼容的混合数据,则加载将崩溃。我使用这样的转换器作为解决方法来更改具有不兼容数据类型的值,以便仍然可以加载数据。

def conv(val):
    if not val:
        return 0    
    try:
        return np.float64(val)
    except:        
        return np.float64(0)

df = pd.read_csv(csv_file,converters={'COL_A':conv,'COL_B':conv})

回答by Dr Nigel

I had a similar issue with a ~400MB file. Setting low_memory=Falsedid the trick for me. Do the simple things first,I would check that your dataframe isn't bigger than your system memory, reboot, clear the RAM before proceeding. If you're still running into errors, its worth making sure your .csvfile is ok, take a quick look in Excel and make sure there's no obvious corruption. Broken original data can wreak havoc...

我有一个大约 400MB 的文件的类似问题。设置low_memory=False对我有用。先做简单的事情,我会检查你的数据帧不大于你的系统内存,重新启动,在继续之前清除内存。如果您仍然遇到错误,则值得确保您的.csv文件没问题,在 Excel 中快速查看并确保没有明显的损坏。损坏的原始数据可能会造成严重破坏...

回答by Rajat Saxena

It worked for me with low_memory = Falsewhile importing a DataFrame. That is all the change that worked for me:

它在low_memory = False导入 DataFrame 时对我有用。这就是对我有用的所有变化:

df = pd.read_csv('export4_16.csv',low_memory=False)

回答by Wim Folkerts

I was facing a similar issue when processing a huge csv file (6 million rows). I had three issues: 1. the file contained strange characters (fixed using encoding) 2. the datatype was not specified (fixed using dtype property) 3. Using the above I still faced an issue which was related with the file_format that could not be defined based on the filename (fixed using try .. except..)

在处理巨大的 csv 文件(600 万行)时,我遇到了类似的问题。我遇到了三个问题:1. 文件包含奇怪的字符(使用编码修复) 2. 未指定数据类型(使用 dtype 属性修复) 3. 使用上述我仍然面临一个与 file_format 相关的问题,无法基于文件名定义(使用 try .. except.. 修复)

df = pd.read_csv(csv_file,sep=';', encoding = 'ISO-8859-1',
                 names=['permission','owner_name','group_name','size','ctime','mtime','atime','filename','full_filename'],
                 dtype={'permission':str,'owner_name':str,'group_name':str,'size':str,'ctime':object,'mtime':object,'atime':object,'filename':str,'full_filename':str,'first_date':object,'last_date':object})

try:
    df['file_format'] = [Path(f).suffix[1:] for f in df.filename.tolist()]
except:
    df['file_format'] = ''