如何将 mysqldump 导入 Pandas

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27584405/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:46:58  来源:igfitidea点击:

How to import a mysqldump into Pandas

pythonmysqlpandasmysqldumppandas-datareader

提问by Keith

I am interested if there is a simple way to import a mysqldumpinto Pandas.

如果有一种简单的方法可以将mysqldump导入 Pandas,我很感兴趣。

I have a few small (~110MB) tables and I would like to have them as DataFrames.

我有几个小的(~110MB)表,我想把它们作为数据帧。

I would like to avoid having to put the data back into a database since that would require installation/connection to such a data base. I have the .sql files and want to import the contained tables into Pandas. Does any module exist to do this?

我想避免将数据放回数据库,因为这需要安装/连接到这样的数据库。我有 .sql 文件,想将包含的表导入 Pandas。是否存在任何模块来执行此操作?

If versioning matters the .sql files all list "MySQL dump 10.13 Distrib 5.6.13, for Win32 (x86)" as the system the dump was produced in.

如果版本控制很重要,那么 .sql 文件都会列出“MySQL dump 10.13 Distrib 5.6.13, for Win32 (x86)”作为生成转储的系统。

Background in hindsight

事后的背景

I was working locally on a computer with no data base connection. The normal flow for my work was to be given a .tsv, .csv or json from a third party and to do some analysis which would be given back. A new third party gave all their data in .sql format and this broke my workflow since I would need a lot of overhead to get it into a format which my programs could take as input. We ended up asking them to send the data in a different format but for business/reputation reasons wanted to look for a work around first.

我在没有数据库连接的计算机上本地工作。我工作的正常流程是从第三方获得 .tsv、.csv 或 json,并进行一些分析,然后将其返回。一个新的第三方以 .sql 格式提供了他们的所有数据,这破坏了我的工作流程,因为我需要大量开销才能将其转换为我的程序可以作为输入的格式。我们最终要求他们以不同的格式发送数据,但出于业务/声誉原因,我们想先寻找解决方法。

Edit: Below is Sample MYSQLDump File With two tables.

编辑:下面是带有两个表的示例 MYSQLDump 文件。

/*
MySQL - 5.6.28 : Database - ztest
*********************************************************************
*/


/*!40101 SET NAMES utf8 */;

/*!40101 SET SQL_MODE=''*/;

/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/`ztest` /*!40100 DEFAULT CHARACTER SET latin1 */;

USE `ztest`;

/*Table structure for table `food_in` */

DROP TABLE IF EXISTS `food_in`;

CREATE TABLE `food_in` (
  `ID` int(11) NOT NULL AUTO_INCREMENT,
  `Cat` varchar(255) DEFAULT NULL,
  `Item` varchar(255) DEFAULT NULL,
  `price` decimal(10,4) DEFAULT NULL,
  `quantity` decimal(10,0) DEFAULT NULL,
  KEY `ID` (`ID`)
) ENGINE=InnoDB AUTO_INCREMENT=10 DEFAULT CHARSET=latin1;

/*Data for the table `food_in` */

insert  into `food_in`(`ID`,`Cat`,`Item`,`price`,`quantity`) values 

(2,'Liq','Beer','2.5000','300'),

(7,'Liq','Water','3.5000','230'),

(9,'Liq','Soda','3.5000','399');

/*Table structure for table `food_min` */

DROP TABLE IF EXISTS `food_min`;

CREATE TABLE `food_min` (
  `Item` varchar(255) DEFAULT NULL,
  `quantity` decimal(10,0) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

/*Data for the table `food_min` */

insert  into `food_min`(`Item`,`quantity`) values 

('Pizza','300'),

('Hotdogs','200'),

('Beer','300'),

('Water','230'),

('Soda','399'),

('Soup','100');

/*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
/*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
/*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */;
/*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;

回答by firelynx

No

Pandas has no nativeway of reading a mysqldump without it passing through a database.

Pandas 没有在不通过数据库的情况下读取 mysqldump 的本机方式。

There is a possible workaround, but it is in my opinion a very bad idea.

有一个可能的解决方法,但在我看来这是一个非常糟糕的主意。

Workaround (Not recommended for production use)

解决方法(不推荐用于生产用途)

Of course you couldparse the data from the mysqldump file using a preprocessor.

当然,您可以使用预处理器解析 mysqldump 文件中的数据。

MySQLdump files often contain a lot of extra data we are not interested in when loading a pandas dataframe, so we need to preprocess it and remove noise and even reformat lines so that they conform.

MySQLdump 文件通常包含许多我们在加载 Pandas 数据帧时不感兴趣的额外数据,因此我们需要对其进行预处理并去除噪音,甚至重新格式化行以使其符合要求。

Using StringIOwe can read a file, process the data before it is fed to the pandas.read_csvfuncion

使用StringIO我们可以读文件,处理数据,它被送到之前pandas.read_csvfuncion

from StringIO import StringIO
import re

def read_dump(dump_filename, target_table):
    sio = StringIO()

    fast_forward = True
    with open(dump_filename, 'rb') as f:
        for line in f:
            line = line.strip()
            if line.lower().startswith('insert') and target_table in line:
                fast_forward = False
            if fast_forward:
                continue
            data = re.findall('\([^\)]*\)', line)
            try:
                newline = data[0]
                newline = newline.strip(' ()')
                newline = newline.replace('`', '')
                sio.write(newline)
                sio.write("\n")
            except IndexError:
                pass
            if line.endswith(';'):
                break
    sio.pos = 0
    return sio

Now that we have a function that reads and formatts the data to look like a CSV file, we can read it with pandas.read_csv()

现在我们有一个函数可以读取和格式化数据,使其看起来像一个 CSV 文件,我们可以用 pandas.read_csv()

import pandas as pd

food_min_filedata = read_dump('mysqldumpexample', 'food_min')
food_in_filedata = read_dump('mysqldumpexample', 'food_in')

df_food_min = pd.read_csv(food_min_filedata)
df_food_in = pd.read_csv(food_in_filedata)

Results in:

结果是:

        Item quantity
0    'Pizza'    '300'
1  'Hotdogs'    '200'
2     'Beer'    '300'
3    'Water'    '230'
4     'Soda'    '399'
5     'Soup'    '100'

and

   ID    Cat     Item     price quantity
0   2  'Liq'   'Beer'  '2.5000'    '300'
1   7  'Liq'  'Water'  '3.5000'    '230'
2   9  'Liq'   'Soda'  '3.5000'    '399'

Note on Stream processing

流处理注意事项

This approach is called stream processing and is incredibly streamlined, almost taking no memory at all. In general it is a good idea to use this approach to read csv files more efficiently into pandas.

这种方法称为流处理,并且非常精简,几乎不占用内存。一般来说,使用这种方法更有效地将 csv 文件读入 Pandas 是一个好主意。

It is the parsing of a mysqldump file I advice against

这是我建议反对的 mysqldump 文件的解析

回答by Andy Hayden

One way is to export mysqldump to sqlite(e.g. run this shell script) then read the sqlite file/database.

一种方法是将 mysqldump 导出到 sqlite(例如运行这个 shell 脚本),然后读取 sqlite 文件/数据库。

See the SQL section of the docs:

请参阅文档的 SQL 部分

pd.read_sql_table(table_name, sqlite_file)


Another option is just to run read_sqlon the mysql database directly...

另一种选择是read_sql直接在mysql数据库上运行...

回答by Tony S.

I found myself in a similar situation to yours, and the answer from @firelynxwas really helpful!

我发现自己的情况与您类似,@ firelynx的回答真的很有帮助!

But since I had only limited knowledge of the tables included in the file, I extended the script by adding the header generation (pandas picks it up automatically), as well as searching for all the tables within the dump file. As a result, I ended up with a following script, that indeed works extremely fast. I switched to io.StringIO, and save the resulting tables as table_name.csvfiles.

但是由于我对文件中包含的表的了解有限,我通过添加头生成(pandas 自动选择它)以及搜索转储文件中的所有表来扩展脚本。结果,我最终得到了以下脚本,它确实运行得非常快。我切换到io.StringIO,并将结果表保存为table_name.csv文件。

P.S. I also support the advise against relying on this approach, and provide the code just for illustration purposes :)

PS我也支持反对依赖这种方法的建议,并提供代码仅用于说明目的:)

So, first thing first, we can augment the read_dumpfunction like this

所以,首先,我们可以read_dump像这样增加函数

from io import StringIO
import re, shutil

def read_dump(dump_filename, target_table):
    sio = StringIO()

    read_mode = 0 # 0 - skip, 1 - header, 2 - data
    with open(dump_filename, 'r') as f:
        for line in f:
            line = line.strip()
            if line.lower().startswith('insert') and target_table in line:
                read_mode = 2
            if line.lower().startswith('create table') and target_table in line:
                read_mode = 1
                continue

            if read_mode==0:
                continue

            # Filling up the headers
            elif read_mode==1:
                if line.lower().startswith('primary'):
                    # add more conditions here for different cases 
                    #(e.g. when simply a key is defined, or no key is defined)
                    read_mode=0
                    sio.seek(sio.tell()-1) # delete last comma
                    sio.write('\n')
                    continue
                colheader = re.findall('`([\w_]+)`',line)
                for col in colheader:
                    sio.write(col.strip())
                    sio.write(',')

            # Filling up the data -same as @firelynx's code
            elif read_mode ==2:
                data = re.findall('\([^\)]*\)', line)
                try:
                    # ...
                except IndexError:
                    pass
                if line.endswith(';'):
                    break
    sio.seek(0)
    with open (target_table+'.csv', 'w') as fd:
        shutil.copyfileobj(sio, fd,-1)
    return # or simply return sio itself

To find the list of tables we can use the following function:

要查找表列表,我们可以使用以下函数:

def find_tables(dump_filename):
    table_list=[]

    with open(dump_filename, 'r') as f:
        for line in f:
            line = line.strip()
            if line.lower().startswith('create table'):
                table_name = re.findall('create table `([\w_]+)`', line.lower())
                table_list.extend(table_name)

    return table_list

Then just combine the two, for example in a .py script that you'll run like

然后将两者结合起来,例如在您将运行的 .py 脚本中

python this_script.py mysqldump_name.sql [table_name]

python this_script.py mysqldump_name.sql [table_name]

import os.path
def main():
    try:
        if len(sys.argv)>=2 and os.path.isfile(sys.argv[1]):
            if len(sys.argv)==2:
                print('Table name not provided, looking for all tables...')
                table_list = find_tables(sys.argv[1])
                if len(table_list)>0:
                    print('Found tables: ',str(table_list))
                    for table in table_list:
                        read_dump(sys.argv[1], table)
            elif len(sys.argv)==3:
                read_dump(sys.argv[1], sys.argv[2])
    except KeyboardInterrupt:
        sys.exit(0)

回答by Peter Ebelsberger

I would like to share my solution about this problem and ask for feedback:

我想分享我对这个问题的解决方案并寻求反馈:

import pandas as pd
import re
import os.path
import csv
import logging
import sys


def convert_dump_to_intermediate_csv(dump_filename, csv_header, csv_out_put_file, delete_csv_file_after_read=True):
    """
    :param dump_filename: five an mysql export dump (mysqldump...syntax)
    :param csv_header: the very first line in the csv file which should appear, give a string separated by coma
    :param csv_out_put_file: the name of the csv file
    :param delete_csv_file_after_read: if you set this to False, no new records will be written as the file exists.
    :return: returns a pandas dataframe for further analysis.
    """
    with open(dump_filename, 'r') as f:
        for line in f:
            pre_compiled_all_values_per_line = re.compile('(?:INSERT\sINTO\s\S[a-z\S]+\sVALUES\s+)(?P<values>.*)(?=\;)')
            result = pre_compiled_all_values_per_line.finditer(line)
            for element in result:
                values_only = element.groups('values')[0]
                value_compile = re.compile('\(.*?\)')
                all_identified = value_compile.finditer(values_only)
                for single_values in all_identified:
                    string_to_split = single_values.group(0)[1:-1]
                    string_array = string_to_split.split(",")

                    if not os.path.exists(csv_out_put_file):
                        with open(csv_out_put_file, 'w', newline='') as file:
                            writer = csv.writer(file)
                            writer.writerow(csv_header.split(","))
                            writer.writerow(string_array)
                    else:
                        with open(csv_out_put_file, 'a', newline='') as file:
                            writer = csv.writer(file)
                            writer.writerow(string_array)
    df = pd.read_csv(csv_out_put_file)
    if delete_csv_file_after_read:
        os.remove(csv_out_put_file)
    return df


if __name__ == "__main__":
    log_name = 'test.log'
    LOGGER = logging.getLogger(log_name)
    LOGGER.setLevel(logging.DEBUG)
    LOGGER.addHandler(logging.NullHandler())
    FORMATTER = logging.Formatter(
        fmt='%(asctime)s %(levelname)-8s %(message)s',
        datefmt='%Y-%m-%d %H:%M:%S')
    SCREEN_HANDLER = logging.StreamHandler(stream=sys.stdout)
    SCREEN_HANDLER.setFormatter(FORMATTER)
    LOGGER.addHandler(SCREEN_HANDLER)

    dump_filename = 'test_sql.sql'
    header_of_csv_file = "A,B,C,D,E,F,G,H,I" # i did not identify the columns in the table definition...
    csv_output_file = 'test.csv'
    pandas_df = convert_dump_to_intermediate_csv(dump_filename, header_of_csv_file, csv_output_file, delete_csv_file_after_read=False)
    LOGGER.debug(pandas_df)

Of course, logger part can be removed.... EDIT: Did not see that this topic is that old. I am sorry for this.

当然,记录器部分可以删除.... 编辑:没有看到这个话题是那么老。我很抱歉。