Pandas:查询字符串,其中列名包含特殊字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40045545/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:13:03  来源:igfitidea点击:

Pandas: query string where column name contains special characters

pythonpandasdataframe

提问by Joe

I am working with a data frame that has a structure something like the following:

我正在使用具有如下结构的数据框:

In[75]: df.head(2)
Out[75]: 
  statusdata             participant_id association  latency response  \
0   complete  CLIENT-TEST-1476362617727       seeya      715  dislike   
1   complete  CLIENT-TEST-1476362617727      welome      800     like   

   stimuli elementdata statusmetadata demo$gender  demo$question2  \
0  Sample B    semi_imp       complete        male              23   
1  Sample C    semi_imp       complete      female              23   

I want to be able to run a query string against the column demo$gender.

我希望能够对列运行查询字符串demo$gender

I.e,

IE,

df.query("demo$gender=='male'")

But this has a problem with the $sign. If I replace the $sign with another delimited (like -) then the problem persists. Can I fix up my query string to avoid this problem. I would prefer not to rename the columns as these correspond tightly with other parts of my application.

但是这个$标志有问题。如果我$用另一个分隔符(如-)替换该符号,则问题仍然存在。我可以修复我的查询字符串以避免这个问题。我不想重命名这些列,因为它们与我的应用程序的其他部分紧密对应。

I really want to stick with a query string as it is supplied by another component of our tech stack and creating a parser would be a heavy lift for what seems like a simple problem.

我真的很想坚持使用查询字符串,因为它是由我们技术堆栈的另一个组件提供的,而创建解析器对于看似简单的问题来说将是一项艰巨的任务。

Thanks in advance.

提前致谢。

回答by Joe

For the interested here is a simple proceedure I used to accomplish the task:

对于感兴趣的人,这是我用来完成任务的一个简单程序:

# Identify invalid column names
invalid_column_names = [x for x in list(df.columns.values) if not x.isidentifier() ]

# Make replacements in the query and keep track
# NOTE: This method fails if the frame has columns called REPL_0 etc.
replacements = dict()
for cn in invalid_column_names:
    r = 'REPL_'+ str(invalid_column_names.index(cn))
    query = query.replace(cn, r)
    replacements[cn] = r

inv_replacements = {replacements[k] : k for k in replacements.keys()}

df = df.rename(columns=replacements) # Rename the columns
df  = df.query(query) # Carry out query

df = df.rename(columns=inv_replacements)

Which amounts to identifying the invalid column names, transforming the query and renaming the columns. Finally we perform the query and then translate the column names back.

这相当于识别无效的列名,转换查询并重命名列。最后,我们执行查询,然后将列名翻译回来。

Credit to @chrisb for their answer that pointed me in the right direction

感谢@chrisb 的回答,为我指明了正确的方向

回答by chrisb

The current implementation of queryrequires the string to be a valid python expression, so column names must be valid python identifiers. Your two options are renaming the column, or using a plain boolean filter, like this:

的当前实现query要求字符串是有效的 Python 表达式,因此列名必须是有效的 Python 标识符。您的两个选项是重命名列,或使用普通布尔过滤器,如下所示:

df[df['demo$gender'] =='male']