将数据从 sqlalchemy 移动到 Pandas DataFrame
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/49215096/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Moving data from sqlalchemy to a pandas DataFrame
提问by David Collins
I am trying to load an SQLAlchemy in a pandas DataFrame.
我正在尝试在 Pandas DataFrame 中加载 SQLAlchemy。
When I do:
当我做:
df = pd.DataFrame(LPRRank.query.all())
I get
我得到
>>> df
0 <M. Misty || 1 || 18>
1 <P. Patch || 2 || 18>
...
...
But, what I want is each column in the database to be a column in the dataframe:
但是,我想要的是数据库中的每一列都是数据框中的一列:
0 M. Misty 1 18
1 P. Patch 2 18
...
...
and when I try:
当我尝试时:
dff = pd.read_sql_query(LPRRank.query.all(), db.session())
I get an Attribute Error:
我收到一个属性错误:
AttributeError: 'SignallingSession' object has no attribute 'cursor'
and
和
dff = pd.read_sql_query(LPRRank.query.all(), db.session)
also gives an error:
还报错:
AttributeError: 'scoped_session' object has no attribute 'cursor'
What I'm using to generate the list of objects is:
我用来生成对象列表的是:
app = Flask(__name__)
db = SQLAlchemy(app)
class LPRRank(db.Model):
id = db.Column(db.Integer, primary_key=True)
candid = db.Column(db.String(40), index=True, unique=False)
rank = db.Column(db.Integer, index=True, unique=False)
user_id = db.Column(db.Integer, db.ForeignKey('lprvote.id'))
def __repr__(self):
return '<{} || {} || {}>'.format(self.candid,
self.rank, self.user_id)
This question: How to convert SQL Query result to PANDAS Data Structure?is error free, but gives each row as an object, which is not what I want. I can access the individual columns in the returned object, but its seems like there is a better way to do it.
这个问题: 如何将 SQL 查询结果转换为 PANDAS 数据结构?没有错误,但将每一行作为一个对象,这不是我想要的。我可以访问返回对象中的各个列,但似乎有更好的方法来做到这一点。
The documentation at pandas.pydata.org is great if you already understand what is going on and just need to review syntax. The documentation from April 20, 2016 (the 1319 page pdf) identifies a pandas connection as still experimental on p.872.
如果您已经了解正在发生的事情并且只需要查看语法,那么 pandas.pydata.org 上的文档非常有用。2016 年 4 月 20 日的文档(1319 页 pdf)将 Pandas 连接确定为在 p.872 上仍处于实验阶段。
Now, SQLALCHEMY/PANDAS - SQLAlchemy reading column as CLOB for Pandas to_sqlis about specifying the SQL type. Mine is SQLAlchemy which is the default.
现在,SQLALCHEMY/PANDAS - SQLAlchemy 读取列作为 Pandas to_sql 的 CLOB是关于指定 SQL 类型。我的是 SQLAlchemy,这是默认设置。
And, sqlalchemy pandas to_sql OperationalError, Writing to MySQL database with pandas using SQLAlchemy, to_sql, and SQLAlchemy/pandas to_sql for SQLServer -- CREATE TABLE in master dbare about writing to the SQL database which produces an operational error, a database error, and a 'create table' error neither of which are my problems.
并且,sqlalchemy pandas to_sql OperationalError,使用 SQLAlchemy、to_sql和SQLAlchemy/pandas to_sql for SQLServer使用Pandas写入 MySQL 数据库——在 master db中创建 TABLE是关于写入 SQL 数据库,这会产生操作错误、数据库错误和“创建表”错误都不是我的问题。
This one, SQLAlchemy Pandas read_sql from jsonbwants a jsonb
attribute to columns: not my cup 'o tea.
这个,来自 jsonb 的 SQLAlchemy Pandas read_sql想要一个jsonb
属性到列:不是我的杯茶。
This previous question SQLAlchemy ORM conversion to pandas DataFrameaddresses my issue but the solution: using query.session.bind
is not my solution. I'm opening /closing sessions with db.session.add(), and db.session.commit(), but when I use db.session.bind
as specified in the second answer here, then I get an Attribute Error:
上一个问题SQLAlchemy ORM 转换为 Pandas DataFrame解决了我的问题,但解决方案是:使用query.session.bind
不是我的解决方案。我正在使用 db.session.add() 和 db.session.commit() 打开/关闭会话,但是当我db.session.bind
按照此处第二个答案中的指定使用时,出现属性错误:
AttributeError: 'list' object has no attribute '_execute_on_connection'
回答by Parfait
Simply add an __init__
method in your model and call the Class object before dataframe build. Specifically below creates an iterable of tuples binded into columns with pandas.DataFrame()
.
只需__init__
在模型中添加一个方法并在构建数据帧之前调用 Class 对象。具体来说,下面创建了一个可迭代的元组,它们绑定到带有pandas.DataFrame()
.
class LPRRank(db.Model):
id = db.Column(db.Integer, primary_key=True)
candid = db.Column(db.String(40), index=True, unique=False)
rank = db.Column(db.Integer, index=True, unique=False)
user_id = db.Column(db.Integer, db.ForeignKey('lprvote.id'))
def __init__(self, candid=None, rank=None, user_id=None):
self.data = (candid, rank, user_id)
def __repr__(self):
return (self.candid, self.rank, self.user_id)
data = db.session.query(LPRRank).all()
df = pd.DataFrame([(d.candid, d.rank, d.user_id) for d in data],
columns=['candid', 'rank', 'user_id'])
Alternatively, use the SQLAlchemy ORM based on your defined Model class, LPRRank, to run read_sql
:
或者,使用基于您定义的模型类LPRRank的 SQLAlchemy ORM来运行read_sql
:
df = pd.read_sql(sql = db.session.query(LPRRank)\
.with_entities(LPRRank.candid,
LPRRank.rank,
LPRRank.user_id).statement,
con = db.session.bind)
回答by bioinfornatics
The Parfait answer is good but could have to problems:
Parfait 的答案很好,但可能有问题:
- efficiency each object creation imply duplication of data into a DataFrame, so a list of dataframe could take time to be created
- That do not mirror a dataframe with a collection of row
- 效率 每个对象的创建都意味着将数据复制到 DataFrame 中,因此创建数据框列表可能需要时间
- 不镜像具有行集合的数据帧
Thus below example provides a parent
class which is assimilated to a DataFramerepresentation and a child
class assimilated to rowof a given dataframe.
因此,下面的示例提供了一个parent
被同化到DataFrame表示的child
类和一个被同化到给定数据帧的行的类。
Code below provides two way to get a dataframe, the DataFrame object is created only at demand to not waste cpu and memory.
下面的代码提供了两种获取数据帧的方式,数据帧对象是按需创建的,不会浪费cpu和内存。
If dataframe is need at creation time you have only to add constructor (def __init__(self, rows:List[MyDataFrameRow] = None)...
) and create a new attribute and assing the result of self.data_frame
.
如果在创建时需要数据框,您只需添加构造函数 ( def __init__(self, rows:List[MyDataFrameRow] = None)...
) 并创建一个新属性并分配self.data_frame
.
from pandas import DataFrame, read_sql
from sqlalchemy import Column, Integer, String, Float, ForeignKey
from sqlalchemy.orm import relationship, Session
Base = declarative_base()
class MyDataFrame(Base):
__tablename__ = 'my_data_frame'
id = Column(Integer, primary_key=True)
rows = relationship('MyDataFrameRow', cascade='all,delete')
@property
def data_frame(self) -> DataFrame:
columns = GenomeCoverageRow.data_frame_columns()
return DataFrame([[getattr(row, column) for column in columns] for row in self.rows],
columns=columns)
@staticmethod
def to_data_frame(identifier: int, session: Session) -> DataFrame:
query = session.query(MyDataFrameRow).join(MyDataFrame).filter(MyDataFrame.id == identifier)
return read_sql(query.statement, session.get_bind())
class MyDataFrameRow(Base):
__tablename__ = 'my_data_row'
id = Column(Integer, primary_key=True)
name= Column(String)
age= Column(Integer)
number_of_children = Column(Integer)
height= Column(Integer)
parent_id = Column(Integer, ForeignKey('my_data_frame.id'))
@staticmethod
def data_frame_columns() -> Tuple[Any]:
return tuple(column.name for column in GenomeCoverageRow.__table__.columns if len(column.foreign_keys) == 0
and column.primary_key is False)
...
session = Session(...)
df1 = MyDataFrame.to_data_frame(1,session)
my_table_obj = session.query(MyDataFrame).filter(MyDataFrame.id == 1).one()
df2 = my_table_obj.data_frame