pandas 使用df.to_sql将pandas数据帧写入sqlite数据库表时如何设置主键

Question

提问by yoshiserry

I have created a sqlite database using pandas df.to_sql however accessing it seems considerably slower than just reading in the 500mb csv file.

我已经使用 pandas df.to_sql 创建了一个 sqlite 数据库，但是访问它似乎比仅读取 500mb csv 文件要慢得多。

I need to:

我需要：

set the primary key for each table using the df.to_sql method
tell the sqlite database what datatype each of the columns in my 3.dataframe are? - can I pass a list like [integer,integer,text,text]

使用 df.to_sql 方法为每个表设置主键
告诉 sqlite 数据库我的 3.dataframe 中的每一列是什么数据类型？- 我可以传递一个列表，如 [integer,integer,text,text]

code.... (format code button not working)

代码....（格式代码按钮不起作用）

if ext == ".csv": 
df = pd.read_csv("/Users/data/" +filename) 
columns = df.columns columns = [i.replace(' ', '_') for i in columns]

df.columns = columns
df.to_sql(name,con,flavor='sqlite',schema=None,if_exists='replace',index=True,index_label=None, chunksize=None, dtype=None)

Answer 1

回答by Chris Guarino

Unfortunately there is no way right now to set a primary key in the pandas df.to_sql() method. Additionally, just to make things more of a pain there is no way to set a primary key on a column in sqlite after a table has been created.

不幸的是，现在无法在 pandas df.to_sql() 方法中设置主键。此外，为了让事情变得更加痛苦，在创建表后无法在 sqlite 的列上设置主键。

However, a work around at the moment is to create the table in sqlite with the pandas df.to_sql() method. Then you could create a duplicate table and set your primary key followed by copying your data over. Then drop your old table to clean up.

但是，目前的解决方法是使用 pandas df.to_sql() 方法在 sqlite 中创建表。然后你可以创建一个重复的表并设置你的主键，然后复制你的数据。然后放下旧桌子进行清理。

It would be something along the lines of this.

这将是与此类似的事情。

import pandas as pd
import sqlite3

df = pd.read_csv("/Users/data/" +filename) 
columns = df.columns columns = [i.replace(' ', '_') for i in columns]

#write the pandas dataframe to a sqlite table
df.columns = columns
df.to_sql(name,con,flavor='sqlite',schema=None,if_exists='replace',index=True,index_label=None, chunksize=None, dtype=None)

#connect to the database
conn = sqlite3.connect('database')
c = conn.curser()

c.executescript('''
    PRAGMA foreign_keys=off;

    BEGIN TRANSACTION;
    ALTER TABLE table RENAME TO old_table;

    /*create a new table with the same column names and types while
    defining a primary key for the desired column*/
    CREATE TABLE new_table (col_1 TEXT PRIMARY KEY NOT NULL,
                            col_2 TEXT);

    INSERT INTO new_table SELECT * FROM old_table;

    DROP TABLE old_table;
    COMMIT TRANSACTION;

    PRAGMA foreign_keys=on;''')

#close out the connection
c.close()
conn.close()

In the past I have done this as I have faced this issue. Just wrapped the whole thing as a function to make it more convenient...

过去我已经这样做了，因为我遇到了这个问题。只是将整个事情包装为一个函数，以使其更方便......

In my limited experience with sqlite I have found that not being able to add a primary key after a table has been created, not being able to perform Update Inserts or UPSERTS, and UPDATE JOIN has caused a lot of frustration and some unconventional workarounds.

在我对 sqlite 的有限经验中，我发现无法在创建表后添加主键、无法执行更新插入或 UPSERTS，以及 UPDATE JOIN 造成了很多挫折和一些非常规的解决方法。

Lastly, in the pandas df.to_sql() method there is a a dtype keyword argument that can take a dictionary of column names:types. IE: dtype = {col_1: TEXT}

最后，在 pandas df.to_sql() 方法中有一个 dtype 关键字参数，它可以采用列名字典：类型。IE: dtype = {col_1: TEXT}

Answer 2

回答by RobinL

Building on Chris Guarino's answer, here's some functions that provide a more general solution. See the example at the bottom for how to use them.

基于 Chris Guarino 的回答，这里有一些提供更通用解决方案的函数。有关如何使用它们，请参阅底部的示例。

import re

def get_create_table_string(tablename, connection):
    sql = """
    select * from sqlite_master where name = "{}" and type = "table"
    """.format(tablename) 
    result = connection.execute(sql)

    create_table_string = result.fetchmany()[0][4]
    return create_table_string

def add_pk_to_create_table_string(create_table_string, colname):
    regex = "(\n.+{}[^,]+)(,)".format(colname)
    return re.sub(regex, "\1 PRIMARY KEY,",  create_table_string, count=1)

def add_pk_to_sqlite_table(tablename, index_column, connection):
    cts = get_create_table_string(tablename, connection)
    cts = add_pk_to_create_table_string(cts, index_column)
    template = """
    BEGIN TRANSACTION;
        ALTER TABLE {tablename} RENAME TO {tablename}_old_;

        {cts};

        INSERT INTO {tablename} SELECT * FROM {tablename}_old_;

        DROP TABLE {tablename}_old_;

    COMMIT TRANSACTION;
    """

    create_and_drop_sql = template.format(tablename = tablename, cts = cts)
    connection.executescript(create_and_drop_sql)

# Example:

# import pandas as pd 
# import sqlite3

# df = pd.DataFrame({"a": [1,2,3], "b": [2,3,4]})
# con = sqlite3.connect("deleteme.db")
# df.to_sql("df", con, if_exists="replace")

# add_pk_to_sqlite_table("df", "index", con)
# r = con.execute("select sql from sqlite_master where name = 'df' and type = 'table'")
# print(r.fetchone()[0])

There is a gist of this code here

这段代码的要点在这里

Answer 3

回答by Shawn

In Sqlite, with a normal rowid table, unless the primary key is a single INTEGERcolumn (See ROWIDs and the INTEGER PRIMARY KEYin the documentation), it's equivalent to a UNIQUEindex (Because the real PK of a normal table is the rowid).

在 Sqlite 中，对于普通的 rowid 表，除非主键是单列INTEGER（参见文档中的ROWIDs 和 INTEGER PRIMARY KEY），否则它相当于一个UNIQUE索引（因为普通表的真正 PK 是rowid）。

Notes from the documentation for rowid tables:

rowid 表文档中的注释：

The PRIMARY KEY of a rowid table (if there is one) is usually not the true primary key for the table, in the sense that it is not the unique key used by the underlying B-tree storage engine. The exception to this rule is when the rowid table declares an INTEGER PRIMARY KEY. In the exception, the INTEGER PRIMARY KEY becomes an alias for the rowid.
The true primary key for a rowid table (the value that is used as the key to look up rows in the underlying B-tree storage engine) is the rowid.
The PRIMARY KEY constraint for a rowid table (as long as it is not the true primary key or INTEGER PRIMARY KEY) is really the same thing as a UNIQUE constraint. Because it is not a true primary key, columns of the PRIMARY KEY are allowed to be NULL, in violation of all SQL standards.

rowid 表的 PRIMARY KEY（如果有）通常不是表的真正主键，因为它不是底层 B 树存储引擎使用的唯一键。此规则的例外情况是 rowid 表声明了 INTEGER PRIMARY KEY。在例外情况下，INTEGER PRIMARY KEY 成为 rowid 的别名。
rowid 表的真正主键（用作在底层 B 树存储引擎中查找行的键的值）是 rowid。
rowid 表的 PRIMARY KEY 约束（只要它不是真正的主键或 INTEGER PRIMARY KEY）实际上与 UNIQUE 约束相同。因为它不是真正的主键，PRIMARY KEY 的列被允许为 NULL，这违反了所有 SQL 标准。

So you can easily fake a primary key after creating the table with:

因此，您可以在创建表后轻松伪造主键：

CREATE UNIQUE INDEX mytable_fake_pk ON mytable(pk_column)

Besides the NULLthing, you won't get the benefits of an INTEGER PRIMARY KEYif your column is supposed to hold integers, like taking up less space and auto-generating values on insert if left out, but it'll otherwise work for most purposes.

除此之外NULL，INTEGER PRIMARY KEY如果您的列应该包含整数，您将无法获得好处，例如占用更少的空间并在插入时自动生成值（如果省略），但否则它将适用于大多数目的。

pandas 使用df.to_sql将pandas数据帧写入sqlite数据库表时如何设置主键

提问by yoshiserry

回答by Chris Guarino

回答by RobinL

回答by Shawn

相关推荐

最近更新

标签

pandas 使用df.to_sql将pandas数据帧写入sqlite数据库表时如何设置主键

提问by yoshiserry

回答by Chris Guarino

回答by RobinL

回答by Shawn

相关推荐

pandas 如何为 dask.dataframe 指定元数据

用另一个 df 中的行替换 Pandas df 中的行

pandas 展平熊猫数据透视表

按特定顺序排序（情况：pandas DataFrame Groupby）

相关推荐

最近更新

标签