使用 pandas dataframe.query() 选择列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44610766/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:49:06  来源:igfitidea点击:

Select columns using pandas dataframe.query()

pythonsqlpandasdataframe

提问by javadba

The documentation on dataframe.query()is veryterse http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html. I was also unable to find examples of projections by web search.

在该文档dataframe.query()非常简洁http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html。我也无法通过网络搜索找到预测的例子。

So I tried simply providing the column names: that gave a syntax error. Likewise for typing selectand then the column names. So .. how to do this?

所以我试着简单地提供列名:这给出了一个语法错误。同样用于键入select然后是列名。那么..如何做到这一点?

采纳答案by Max Power

After playing around with this for a while and reading through the source codefor DataFrame.query, I can't figure out a way to do it.

玩弄了一会儿,并通过阅读后的源代码DataFrame.query,我不能想出一个办法做到这一点。

If it's not impossible, apparently it's at least strongly discouraged. When this question came up on github, prolific Pandas dev/maintainer jreback suggested using df.eval()for selecting columns and df.query()for filtering on rows.

如果这不是不可能的,显然它至少是强烈不鼓励的。当这个问题出现在 github 上时,多产的 Pandas dev/maintainer jreback建议使用df.eval()fordf.query()selection columns 和filter on rows



UPDATE:

更新:

javadba points out that the return value of evalis not a dataframe. For example, to flesh out jreback's example a bit more...

javadba 指出的返回值eval不是数据帧。例如,更多地充实 jreback 的例子......

df.eval('A')

returns a Pandas Series, but

返回一个 Pandas 系列,但是

df.eval(['A', 'B'])

does not return at DataFrame, it returns a list (of Pandas Series).

不返回 DataFrame,它返回一个列表(Pandas 系列)。

So it seems ultimately the best way to maintain flexibility to filter on rows and columns is to use iloc/loc, e.g.

因此,似乎最终保持对行和列进行过滤的灵活性的最佳方法是使用iloc/ loc,例如

df.loc[0:4, ['A', 'C']]

output

输出

          A         C
0 -0.497163 -0.046484
1  1.331614  0.741711
2  1.046903 -2.511548
3  0.314644 -0.526187
4 -0.061883 -0.615978

回答by Scott Boston

Dataframe.queryis more like the where clausein a SQL statement than the select part.

Dataframe.query更像是 SQL 语句中的where 子句,而不是select 部分

import pandas as pd
import numpy as np
np.random.seed(123)
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])

To select a column or columns you can use the following:

要选择一列或多列,您可以使用以下命令:

df['A'] or df.loc[:,'A']

or

或者

df[['A','B']] or df.loc[:,['A','B']]

To use the .querymethod you do something like

要使用该.query方法,您可以执行以下操作

df.query('A > B')which would return all the rows where the value in column A is greater than the value in column b.

df.query('A > B')这将返回 A 列中的值大于 b 列中的值的所有行。

                   A         B         C         D
2000-01-03  1.265936 -0.866740 -0.678886 -0.094709
2000-01-04  1.491390 -0.638902 -0.443982 -0.434351
2000-01-05  2.205930  2.186786  1.004054  0.386186
2000-01-08 -0.140069 -0.861755 -0.255619 -2.798589

Which is more readable in my opinion that boolean index selection with

在我看来,布尔索引选择与

df[df['A'] > df['B']]

回答by javadba

pandasql

PandasSQL

https://pypi.python.org/pypi/pandasql/0.1.0

https://pypi.python.org/pypi/pandasql/0.1.0

Here is an example from the following blog http://blog.yhat.com/posts/pandasql-sql-for-pandas-dataframes.html. The inputs are two DataFrames meatand births: and this approach gives the projections, filtering, aggregationand sortingexpected from sql.

以下是来自以下博客http://blog.yhat.com/posts/pandasql-sql-for-pandas-dataframes.html的示例。输入两个DataFrame小号 meatbirths:这种方法使projectionsfilteringaggregationsorting从SQL预期。

@maxpower did mention this package is buggy: so let's see.. At least the code from the blog and shown below works fine.

@maxpower 确实提到这个包有问题:所以让我们看看.. 至少来自博客的代码和下面显示的代码工作正常。

pysqldf = lambda q: sqldf(q, globals())

q  = """
SELECT
  m.date
  , m.beef
  , b.births
FROM
  meat m
LEFT JOIN
  births b
    ON m.date = b.date
WHERE
    m.date > '1974-12-31';
"""

meat = load_meat()
births = load_births()

df = pysqldf(q)

The output is a pandas DataFrameas desired.

输出是DataFrame所需的Pandas。

It is working great for my particular use case (evaluating us crimes)

它非常适合我的特定用例(评估我们的犯罪)

odf = pysqldf("select %s from df where sweapons > 10 order by sweapons desc limit 10" %scols)
p('odf\n', odf)

 odf
:    SMURDER  SRAPE  SROBBERY  SAGASSLT  SOTHASLT  SVANDLSM  SWEAPONS
0        0      0         0         1         1        10        54
1        0      0         0         0         1         0        52
2        0      0         0         0         1         0        46
3        0      0         0         0         1         0        43
4        0      0         0         0         1         0        33
5        1      0         2        16        28         4        32
6        0      0         0         7        17         4        30
7        0      0         0         0         1         0        29
8        0      0         0         7        16         3        29
9        0      0         0         1         0         5        28

UpdateI have done a bunch of stuff with pandasqlnow: calculated fields, limits, aliases, cascaded dataframes.. it is just soproductive.

更新我现在做了很多事情pandasql:计算字段、限制、别名、级联数据帧......它是如此高效。

回答by gonkan

How about

怎么样

df_new = df.query('col1==1 & col2=="x" ')[['col1', 'col3']]

Would filter rows where col1 equals 1 and col2 equals "X" and return only columns 1 and 3.

将过滤 col1 等于 1 且 col2 等于“X”的行并仅返回第 1 列和第 3 列。

but you would need to filter for rows otherwise it doesn't work.

但您需要过滤行,否则它不起作用。

for filtering columns only better use .locor .iloc

过滤列只能更好地使用.loc.iloc