如何使用 Pandas 将巨大的 CSV 转换为 SQLite?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34672581/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to transform a huge CSV into SQLite using Pandas?
提问by Roman
I have a huge table (about 60 GB) in form of an archived CSV file. I want to transform it into an SQLite file.
我有一个存档 CSV 文件形式的大表(大约 60 GB)。我想将其转换为 SQLite 文件。
What I do at the moment in the following:
我现在做的事情如下:
import pandas
import sqlite3
cnx = sqlite3.connect('db.sqlite')
df = pandas.read_csv('db.gz', compression='gzip')
df.to_sql('table_name', cnx)
It works fine for smaller files but with the huge files I have memory problem. The problem is that pandas reads the whole table into memory (RAM) and then saves it into SQLite file.
它适用于较小的文件,但对于大文件我有内存问题。问题在于pandas 将整个表读入内存(RAM),然后将其保存到SQLite 文件中。
Is there an elegant solution to this problem?
这个问题有优雅的解决方案吗?
采纳答案by AChampion
This is going to be problematic with pandas
due to its size. Any reason you can't use the csv
module and just iterate through the file.
pandas
由于它的大小,这将是有问题的。任何原因都不能使用该csv
模块而只能遍历文件。
Basic idea (untested):
基本思想(未经测试):
import gzip
import csv
import sqlite3
with gzip.open('db.gz') as f, sqlite3.connect('db.sqlite') as cnx:
reader = csv.reader(f)
c = cnx.cursor()
c.executemany('insert into table_name values (?,?,...)', reader)
回答by cwcobb
I haven't done any work with CSVs of that size, but it sounds like the kind of thing Odomight solve quickly.
我还没有对这种大小的 CSV 做过任何工作,但这听起来像是Odo可能会很快解决的那种事情。
I did a cursory check of the docs, and it appears they've written somethingaddressing the topic of larger-than-memory CSV parsing into SQL databases that specifically calls out SQLite3 as a destination.
我对文档进行了粗略检查,似乎他们已经写了一些东西来解决大于内存的 CSV 解析到 SQL 数据库的主题,该数据库专门调用 SQLite3 作为目标。
Here's the example they publish for parsing a 33 GB text file.
这是他们发布的用于解析 33 GB 文本文件的示例。
In [1]: dshape = discover(resource('all.csv'))
In [2]: %time t = odo('all.no.header.csv', 'sqlite:///db.db::nyc',
...: dshape=dshape)
CPU times: user 3.09 s, sys: 819 ms, total: 3.91 s
Wall time: 57min 31s
回答by Yul
[Update at 06-15-2017]
[06-15-2017 更新]
It seems that csv2sqlite.pymay be the way-to-go with SQLite. Definitely, Chuck-by-Chuck is too slow for big file (> 1GB). When I tested a 6.5GB of nyc311calls.csv with csv2sqlite.py, it took only ~ 24 minutes to create a SQLite database file with data type guessing. 24 minutes is similar to the spending time by MySQL with "LOAD DATA INFILE". This is not bad even though you may need to change the data type for some of columns. In my opinion, to use csv2sqlite.py is the most time-efficient method to create a SQLite database file from csv file right now.
似乎csv2sqlite.py可能是 SQLite 的首选。当然,Chuck-by-Chuck 对于大文件(> 1GB)来说太慢了。当我使用 csv2sqlite.py 测试 6.5GB 的 nyc311calls.csv 时,创建一个带有数据类型猜测功能的 SQLite 数据库文件只用了大约 24 分钟。24 分钟类似于 MySQL 使用“LOAD DATA INFILE”花费的时间。即使您可能需要更改某些列的数据类型,这也不错。在我看来,使用 csv2sqlite.py 是目前从 csv 文件创建 SQLite 数据库文件的最省时的方法。
1) Download a csv2sqlite.py from hereand put it in the directory containing a csv file.
1) 从这里下载一个 csv2sqlite.py并将其放在包含 csv 文件的目录中。
2) By using Windows Prompt
, go to the directory containing the csv2sqlite.py and the csv file (ex. nyc311calls.csv) you want to import.
2) 使用Windows Prompt
,转到包含要导入的 csv2sqlite.py 和 csv 文件(例如 nyc311calls.csv)的目录。
3) Run the code of python csv2sqlite.py nyc311calls.csv database_name.db
and wait. Note: python PATH should be included in your Windows Environment Variables
.
3)运行代码python csv2sqlite.py nyc311calls.csv database_name.db
并等待。注意:python PATH 应包含在您的Windows Environment Variables
.
This is a little old quest but it seems nobody gave the clear answers. I hope my answer will help you. With Sqlite I recommend you to see this site, which gives you the idea and what you should do, a chunk-by-chunk load. I tested several approaches but until now this is the most reliable way in my opinion.
这是一个有点古老的任务,但似乎没有人给出明确的答案。希望我的回答能帮到你。使用 Sqlite,我建议您查看此站点,它为您提供了想法和您应该做什么,逐块加载。我测试了几种方法,但到目前为止,这是我认为最可靠的方法。
Basic procedure is like that: 1) Import a small portion of the big table to pandas. 2) Process and load them to SQLite. 3) Keep to continue this process.
基本过程是这样的: 1)将大表的一小部分导入pandas。2) 处理并将它们加载到 SQLite。3) 继续这个过程。
I uploaded more detailed procedure of what I did here(Jupyter file) if you are interested in. You can find the NYC311call data here
如果你有兴趣,我上传了我在这里所做的更详细的过程(Jupyter 文件)。你可以在这里找到 NYC311call 数据
A few comments from me.
我的一些评论。
1) Odo package is not fully working if you data included the empty strings. I hope that they could improve these issues. i.e. if you data is very clean and well-organized, Odo package might be the option.
1) 如果您的数据包含空字符串,则 Odo 包无法正常工作。我希望他们能改善这些问题。即如果您的数据非常干净且组织良好,那么 Odo 包可能是您的选择。
2) The above approach is a really time-consuming work. Especially, a ~6GB of table take more than 24 hours. Because pandas is slow.
2)上述方法是一项非常耗时的工作。特别是,大约 6GB 的表需要超过 24 小时。因为Pandas很慢。
3) If you will not stick to SQLite, I would say MySQL with "LOAD DATA INFILE" is a good option for you. You can find how to do with it from internet searching. As long as I tested, this is much reliable and efficient way. Later you may convert to the sqlite if you really need to use sqlite. Especially, if the data have many empty strings and datetime columns, which are needed to convert to datetime type, I definitely go with MySQL.
3) 如果您不坚持使用 SQLite,我会说带有“LOAD DATA INFILE”的 MySQL 对您来说是一个不错的选择。您可以通过互联网搜索找到如何处理它。只要我测试过,这是非常可靠和有效的方式。稍后如果您确实需要使用sqlite,您可以转换为sqlite。特别是,如果数据有很多空字符串和日期时间列,需要转换为日期时间类型,我肯定会选择MySQL。