pandas get_dummies python 内存错误
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31321892/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
get_dummies python memory error
提问by Duesentrieb
I'm having a problem with a data set that has 400,000 rows and 300 variables. I have to get dummy variables for a categorical variable with 3,000+ different items. At the end I want to end up with a data set with 3,300 variables or features so that I can train a RandomForest model.
我在处理具有 400,000 行和 300 个变量的数据集时遇到问题。我必须为包含 3,000 多个不同项目的分类变量获取虚拟变量。最后,我想得到一个包含 3,300 个变量或特征的数据集,以便我可以训练 RandomForest 模型。
Here is what I've tried to do:
这是我尝试做的:
df = pd.concat([df, pd.get_dummies(df['itemID'],prefix = 'itemID_')], axis=1)
When I do that I'll always get an memory error. Is there a limit to the number of variables i can have?
当我这样做时,我总是会遇到内存错误。我可以拥有的变量数量有限制吗?
If I do that with only the first 1,000 rows (which got 374 different categories) it just works fine.
如果我只对前 1,000 行(有 374 个不同的类别)执行此操作,它就可以正常工作。
Does anyone have a solution for my problem? The computer I'm using has 8 GB of memory.
有没有人有解决我的问题的方法?我使用的计算机有 8 GB 的内存。
回答by JohnE
Update:Starting with version 0.19.0, get_dummies returns an 8bit integer rather than 64bit float, which will fix this problem in many cases and make the as_typesolution below unnecessary. See: get_dummies -- pandas 0.19.0
更新:从 0.19.0 版本开始,get_dummies 返回一个 8 位整数而不是 64 位浮点数,这将在许多情况下解决这个问题,并使as_type下面的解决方案变得不必要。请参阅: get_dummies -- Pandas 0.19.0
But in other cases, the sparseoption descibed below may still be helpful.
但在其他情况下,sparse下面描述的选项可能仍然有用。
Original Answer:Here are a couple of possibilities to try. Both will reduce the memory footprint of the dataframe substantially but you could still run into memory issues later. It's hard to predict, you'll just have to try.
原始答案:这里有几种可能性可以尝试。两者都将大大减少数据帧的内存占用,但您稍后仍可能遇到内存问题。这很难预测,你只需要尝试。
(note that I am simplifying the output of info()below)
(请注意,我正在简化info()下面的输出)
df = pd.DataFrame({ 'itemID': np.random.randint(1,4,100) })
pd.concat([df, pd.get_dummies(df['itemID'],prefix = 'itemID_')], axis=1).info()
itemID 100 non-null int32
itemID__1 100 non-null float64
itemID__2 100 non-null float64
itemID__3 100 non-null float64
memory usage: 3.5 KB
Here's our baseline. Each dummy column takes up 800 bytes because the sample data has 100 rows and get_dummiesappears to default to float64 (8 bytes). This seems like an unnecessarily inefficient way to store dummies as you could use as little as a bit to do it, but there may be some reason for that which I'm not aware of.
这是我们的基线。每个虚拟列占用 800 字节,因为示例数据有 100 行并且get_dummies默认为 float64(8 字节)。这似乎是一种不必要的低效存储假人的方式,因为您可以使用一点点来做到这一点,但可能有一些我不知道的原因。
So, first attempt, just change to a one byte integer (this doesn't seem to be an option for get_dummiesso it has to be done as a conversion with astype(np.int8).
因此,第一次尝试,只需更改为一个字节的整数(这似乎不是一个选项,get_dummies因此必须将其作为astype(np.int8).
pd.concat([df, pd.get_dummies(df['itemID'],prefix = 'itemID_').astype(np.int8)],
axis=1).info()
itemID 100 non-null int32
itemID__1 100 non-null int8
itemID__2 100 non-null int8
itemID__3 100 non-null int8
memory usage: 1.5 KB
Each dummy column now takes up 1/8 the memory as before.
每个虚拟列现在像以前一样占用 1/8 的内存。
Alternatively, you can use the sparseoption of get_dummies.
或者,您可以使用sparse选项get_dummies。
pd.concat([df, pd.get_dummies(df['itemID'],prefix = 'itemID_',sparse=True)],
axis=1).info()
itemID 100 non-null int32
itemID__1 100 non-null float64
itemID__2 100 non-null float64
itemID__3 100 non-null float64
memory usage: 2.0 KB
Fairly comparable savings. The info()output somewhat hides the way savings are occurring, but you can look at the value of memory usage to see to total savings.
相当可比的储蓄。在info()输出有些隐藏的储蓄正在发生的方式,但你可以看看内存使用的价值,看到总储蓄。
Which of these will work better in practice will depend on your data, so you'll just need to give them each a try (or you could even combine them).
哪些在实践中效果更好取决于您的数据,因此您只需要尝试一下(或者您甚至可以将它们组合起来)。

