透视包含字符串的 Pandas 数据框 - “没有可聚合的数字类型”错误
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34442214/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pivoting a Pandas Dataframe containing strings - 'No numeric types to aggregate' error
提问by jmhead
There is a good number of questions about this error, but after looking around I'm still not able to find/wrap my mind around a solution yet. I'm trying to pivot a data frame with strings, to get some row data to become columns, but not working out so far.
关于这个错误有很多问题,但环顾四周后,我仍然无法找到/解决解决方案。我正在尝试使用字符串旋转数据框,以使一些行数据成为列,但到目前为止还没有解决。
Shape of my df
我的 df 的形状
<class 'pandas.core.frame.DataFrame'>
Int64Index: 515932 entries, 0 to 515931
Data columns (total 5 columns):
id 515932 non-null object
cc_contact_id 515932 non-null object
Network_Name 515932 non-null object
question 515932 non-null object
response_answer 515932 non-null object
dtypes: object(5)
memory usage: 23.6+ MB
Sample format
样本格式
id contact_id question response_answer
16 137519 2206 State Ca
17 137520 2206 State Ca
18 137521 2206 State Ca
19 137522 2206 State Ca
20 137523 2208 City Lancaster
21 137524 2208 City Lancaster
22 137525 2208 City Lancaster
23 137526 2208 City Lancaster
24 137527 2208 Trip_End Location Home
25 137528 2208 Trip_End Location Home
26 137529 2208 Trip_End Location Home
27 137530 2208 Trip_End Location Home
What I would like to pivot to
我想转向什么
id contact_id State City Trip_End Location
16 137519 2206 Ca None None None
20 137523 2208 None Lancaster None None
24 137527 2208 None None None Home
etc. etc.
Where the questionvalues become the columns, with the response_answerbeing in it's corresponding column, and retaining the ids
当问题值将成为列,与response_answer它是的相应数列,并保留IDS
What I have tried
我试过的
unified_df = pd.DataFrame(unified_data, columns=target_table_headers, dtype=object)
pivot_table = unified_df.pivot_table('response_answer',['id','cc_contact_id'],'question')
# OR
pivot_table = unified_df.pivot_table('response_answer','question')
DataError: No numeric types to aggregate
DataError:没有要聚合的数字类型
What is the way to pivot a data frame with string values?
用字符串值透视数据框的方法是什么?
回答by cwharland
The default aggfunc
in pivot_table
is np.sum
and it doesn't know what to do with strings and you haven't indicated what the index should be properly. Trying something like:
默认aggfunc
为pivot_table
isnp.sum
并且它不知道如何处理字符串,并且您还没有指出索引应该是什么。尝试类似:
pivot_table = unified_df.pivot_table(index=['id', 'contact_id'],
columns='question',
values='response_answer',
aggfunc=lambda x: ' '.join(x))
This explicitly sets one row per id, contact_id
pair and pivots the set of response_answer
values on question
. The aggfunc
just assures that if you have multiple answers to the same question in the raw data that we just concatenate them together with spaces. The syntax of pivot_table
might vary depending on your pandas version.
这明确地为每id, contact_id
对设置一行,并在 上旋转这组response_answer
值question
。The aggfunc
just 确保如果您对原始数据中的同一问题有多个答案,我们只需将它们用空格连接在一起。的语法pivot_table
可能因您的 Pandas 版本而异。
Here's a quick example:
这是一个快速示例:
In [24]: import pandas as pd
In [25]: import random
In [26]: df = pd.DataFrame({'id':[100*random.randint(10, 50) for _ in range(100)], 'question': [str(random.randint(0,3)) for _ in range(100)], 'response': [str(random.randint(100,120)) for _ in range(100)]})
In [27]: df.head()
Out[27]:
id question response
0 3100 1 116
1 4500 2 113
2 5000 1 120
3 3900 2 103
4 4300 0 117
In [28]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 3 columns):
id 100 non-null int64
question 100 non-null object
response 100 non-null object
dtypes: int64(1), object(2)
memory usage: 3.1+ KB
In [29]: df.pivot_table(index='id', columns='question', values='response', aggfunc=lambda x: ' '.join(x)).head()
Out[29]:
question 0 1 2 3
id
1000 110 120 NaN 100 NaN
1100 NaN 106 108 104 NaN
1200 104 113 119 NaN 101
1300 102 NaN 116 108 120
1400 NaN NaN 116 NaN
回答by johnInHome
There are several ways.
有几种方法。
1
1
df1 = df.groupby(["id","contact_id","Network_Name","question"])['response_answer'].aggregate(lambda x: x).unstack().reset_index()
df1.columns=df1.columns.tolist()
print (df1)
2
2
df1 = df.set_index(["id","contact_id","Network_Name","question"])['response_answer'].unstack().reset_index()
df1.columns=df1.columns.tolist()
print (df1)
3
3
df1 = df.groupby(["id","contact_id","Network_Name","question"])['response_answer'].aggregate('first').unstack().reset_index()
df1.columns=df1.columns.tolist()
print (df1)
4
4
df1 = df.pivot_table(index=["id","contact_id","Network_Name"], columns='question', values=['response_answer'], aggfunc='first')
df1.columns = df1.columns.droplevel()
df1 = df1.reset_index()
df1.columns=df1.columns.tolist()
print (df1)
Same ans.
同答。
id contact_id Network_Name City State Trip_End_Location
0 16 137519 2206 None Ca None
1 17 137520 2206 None Ca None
2 18 137521 2206 None Ca None
3 19 137522 2206 None Ca None
4 20 137523 2208 Lancaster None None
5 21 137524 2208 Lancaster None None
6 22 137525 2208 Lancaster None None
7 23 137526 2208 Lancaster None None
8 24 137527 2208 None None Home
9 25 137528 2208 None None Home
10 26 137529 2208 None None Home
11 27 137530 2208 None None Home