使用 pandas dataframe.query() 选择列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/44610766/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Select columns using pandas dataframe.query()
提问by javadba
The documentation on dataframe.query()
is veryterse http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html. I was also unable to find examples of projections by web search.
在该文档dataframe.query()
是非常简洁http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html。我也无法通过网络搜索找到预测的例子。
So I tried simply providing the column names: that gave a syntax error. Likewise for typing select
and then the column names. So .. how to do this?
所以我试着简单地提供列名:这给出了一个语法错误。同样用于键入select
然后是列名。那么..如何做到这一点?
采纳答案by Max Power
After playing around with this for a while and reading through the source codefor DataFrame.query
, I can't figure out a way to do it.
玩弄了一会儿,并通过阅读后的源代码的DataFrame.query
,我不能想出一个办法做到这一点。
If it's not impossible, apparently it's at least strongly discouraged. When this question came up on github, prolific Pandas dev/maintainer jreback suggested using df.eval()
for selecting columns and df.query()
for filtering on rows.
如果这不是不可能的,显然它至少是强烈不鼓励的。当这个问题出现在 github 上时,多产的 Pandas dev/maintainer jreback建议使用df.eval()
fordf.query()
selection columns 和filter on rows。
UPDATE:
更新:
javadba points out that the return value of eval
is not a dataframe. For example, to flesh out jreback's example a bit more...
javadba 指出的返回值eval
不是数据帧。例如,更多地充实 jreback 的例子......
df.eval('A')
returns a Pandas Series, but
返回一个 Pandas 系列,但是
df.eval(['A', 'B'])
does not return at DataFrame, it returns a list (of Pandas Series).
不返回 DataFrame,它返回一个列表(Pandas 系列)。
So it seems ultimately the best way to maintain flexibility to filter on rows and columns is to use iloc
/loc
, e.g.
因此,似乎最终保持对行和列进行过滤的灵活性的最佳方法是使用iloc
/ loc
,例如
df.loc[0:4, ['A', 'C']]
output
输出
A C
0 -0.497163 -0.046484
1 1.331614 0.741711
2 1.046903 -2.511548
3 0.314644 -0.526187
4 -0.061883 -0.615978
回答by Scott Boston
Dataframe.query
is more like the where clausein a SQL statement than the select part.
Dataframe.query
更像是 SQL 语句中的where 子句,而不是select 部分。
import pandas as pd
import numpy as np
np.random.seed(123)
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
To select a column or columns you can use the following:
要选择一列或多列,您可以使用以下命令:
df['A'] or df.loc[:,'A']
or
或者
df[['A','B']] or df.loc[:,['A','B']]
To use the .query
method you do something like
要使用该.query
方法,您可以执行以下操作
df.query('A > B')
which would return all the rows where the value in column A is greater than the value in column b.
df.query('A > B')
这将返回 A 列中的值大于 b 列中的值的所有行。
A B C D
2000-01-03 1.265936 -0.866740 -0.678886 -0.094709
2000-01-04 1.491390 -0.638902 -0.443982 -0.434351
2000-01-05 2.205930 2.186786 1.004054 0.386186
2000-01-08 -0.140069 -0.861755 -0.255619 -2.798589
Which is more readable in my opinion that boolean index selection with
在我看来,布尔索引选择与
df[df['A'] > df['B']]
回答by javadba
pandasql
PandasSQL
https://pypi.python.org/pypi/pandasql/0.1.0
https://pypi.python.org/pypi/pandasql/0.1.0
Here is an example from the following blog http://blog.yhat.com/posts/pandasql-sql-for-pandas-dataframes.html. The inputs are two DataFrame
s meat
and births
: and this approach gives the projections
, filtering
, aggregation
and sorting
expected from sql.
以下是来自以下博客http://blog.yhat.com/posts/pandasql-sql-for-pandas-dataframes.html的示例。输入两个DataFrame
小号 meat
和births
:这种方法使projections
,filtering
,aggregation
和sorting
从SQL预期。
@maxpower did mention this package is buggy: so let's see.. At least the code from the blog and shown below works fine.
@maxpower 确实提到这个包有问题:所以让我们看看.. 至少来自博客的代码和下面显示的代码工作正常。
pysqldf = lambda q: sqldf(q, globals())
q = """
SELECT
m.date
, m.beef
, b.births
FROM
meat m
LEFT JOIN
births b
ON m.date = b.date
WHERE
m.date > '1974-12-31';
"""
meat = load_meat()
births = load_births()
df = pysqldf(q)
The output is a pandas DataFrame
as desired.
输出是DataFrame
所需的Pandas。
It is working great for my particular use case (evaluating us crimes)
它非常适合我的特定用例(评估我们的犯罪)
odf = pysqldf("select %s from df where sweapons > 10 order by sweapons desc limit 10" %scols)
p('odf\n', odf)
odf
: SMURDER SRAPE SROBBERY SAGASSLT SOTHASLT SVANDLSM SWEAPONS
0 0 0 0 1 1 10 54
1 0 0 0 0 1 0 52
2 0 0 0 0 1 0 46
3 0 0 0 0 1 0 43
4 0 0 0 0 1 0 33
5 1 0 2 16 28 4 32
6 0 0 0 7 17 4 30
7 0 0 0 0 1 0 29
8 0 0 0 7 16 3 29
9 0 0 0 1 0 5 28
UpdateI have done a bunch of stuff with pandasql
now: calculated fields, limits, aliases, cascaded dataframes.. it is just soproductive.
更新我现在做了很多事情pandasql
:计算字段、限制、别名、级联数据帧......它是如此高效。
回答by gonkan
How about
怎么样
df_new = df.query('col1==1 & col2=="x" ')[['col1', 'col3']]
Would filter rows where col1 equals 1 and col2 equals "X" and return only columns 1 and 3.
将过滤 col1 等于 1 且 col2 等于“X”的行并仅返回第 1 列和第 3 列。
but you would need to filter for rows otherwise it doesn't work.
但您需要过滤行,否则它不起作用。
for filtering columns only better use .loc
or .iloc
过滤列只能更好地使用.loc
或.iloc