python:pandas - 如何将熊猫数据帧的前两行组合到数据帧标题?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46357100/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
python:pandas - How to combine first two rows of pandas dataframe to dataframe header?
提问by kit
I'm trying to read one excel file which looks like below:
我正在尝试读取一个如下所示的 excel 文件:
I also have one script which will convert this xlsx file into the csv files with its sheet names (If three sheets available then it will create three different csv files).
我还有一个脚本,可以将这个 xlsx 文件转换为带有工作表名称的 csv 文件(如果有三张工作表,那么它将创建三个不同的 csv 文件)。
It's csv file looks like below:
它的 csv 文件如下所示:
Unnamed: 0,Gender A,Unnamed: 2,Gender B,Unnamed: 4,Gender C,Gender D
date,Male,Female,Male,Female,Male,Female
2017-01-01 00:00:00,2,3,3,2,3,3
2017-01-02 00:00:00,5,7,7,42,3,5
2017-01-03 00:00:00,4,6,6,12,2,7
2017-01-04 00:00:00,6,7,3,6,4,8
2017-01-05 00:00:00,6,8,8,3,5,3
2017-01-06 00:00:00,54,3,3,6,3,5
2017-01-07 00:00:00,3,4,6,3,6,5
2017-01-08 00:00:00,3,6,6,3,6,4
2017-01-09 00:00:00,2,2,8,7,5,2
2017-01-10 00:00:00,4,3,2,4,5,5
2017-01-11 00:00:00,12,10,10,3,1,6
2017-01-12 00:00:00,9,7,7,3,4,1
So, my first question is which is better choice to process these files - xlsx or csv?
所以,我的第一个问题是处理这些文件的更好选择是 xlsx 还是 csv?
Next, I just want to read first two rows as a column header. So that I can understand In which Gender how many males and females are available.
接下来,我只想读取前两行作为列标题。这样我就可以了解在哪个性别中有多少男性和女性可用。
Expected Output:
预期输出:
0 date Gender A_Male Gender A_Female Gender B_Male Gender B_Female Gender C_Male Gender D_Female
1 2017-01-01 00:00:00 2 3 3 2 3 3
2 2017-01-02 00:00:00 5 7 7 42 3 5
3 2017-01-03 00:00:00 4 6 6 12 2 7
4 2017-01-04 00:00:00 6 7 3 6 4 8
5 2017-01-05 00:00:00 6 8 8 3 5 3
6 2017-01-06 00:00:00 54 3 3 6 3 5
7 2017-01-07 00:00:00 3 4 6 3 6 5
8 2017-01-08 00:00:00 3 6 6 3 6 4
9 2017-01-09 00:00:00 2 2 8 7 5 2
10 2017-01-10 00:00:00 4 3 2 4 5 5
11 2017-01-11 00:00:00 12 10 10 3 1 6
12 2017-01-12 00:00:00 9 7 7 3 4 1
回答by Scott Boston
Let's try:
咱们试试吧:
df = pd.read_excel('Untitled 2.xlsx', header=[0,1])
df.columns = df.columns.map('_'.join)
df.rename_axis('Date').reset_index()
Output:
输出:
Date Gender A_Male Gender A_Female Gender B_Male Gender B_Female \
0 2017-01-01 2 3 3 2
1 2017-01-02 5 7 7 42
2 2017-01-03 4 6 6 12
3 2017-01-04 6 7 3 6
4 2017-01-05 6 8 8 3
5 2017-01-06 54 3 3 6
6 2017-01-07 3 4 6 3
7 2017-01-08 3 6 6 3
8 2017-01-09 2 2 8 7
9 2017-01-10 4 3 2 4
10 2017-01-11 12 10 10 3
11 2017-01-12 9 7 7 3
Gender C_Male Gender D_Female
0 3 3
1 3 5
2 2 7
3 4 8
4 5 3
5 3 5
6 6 5
7 6 4
8 5 2
9 5 5
10 1 6
11 4 1
回答by hobs
I love @ScottBoston's approach. Here are some cosmetic alternatives. If you want your column headers to look pretty, especially when the second row contains the units for the quantity, you can do something like this:
我喜欢@ScottBoston 的方法。这里有一些化妆品替代品。如果您希望列标题看起来漂亮,尤其是当第二行包含数量单位时,您可以执行以下操作:
df = pd.read_excel('Untitled 2.xlsx', header=[0,1], index_col=0)
df.columns = df.columns.map(lambda h: '{}\n({})'.format(h[0], h[1]))
df.rename_axis('Date')
And if you want to make sure your column names don't contain spaces (so you can access them as properties of the DataFrame):
如果您想确保您的列名不包含空格(因此您可以将它们作为 DataFrame 的属性访问):
df = pd.read_excel('Untitled 2.xlsx', header=[0,1], index_col=0)
df.columns = df.columns.map(lambda h: ' '.join(h).replace(' ', '_'))
df.rename_axis('Date')
Which gives:
这使:
Gender_A__Male Gender_A__Female ... Gender_C__Male Gender_D__Female
Date ...
2017-01-01 00:00:00 2 3 ... 3 3
2017-01-02 00:00:00 5 7 ... 3 5
2017-01-03 00:00:00 4 6 ... 2 7
2017-01-04 00:00:00 6 7 ... 4 8
2017-01-05 00:00:00 6 8 ... 5 3
2017-01-06 00:00:00 54 3 ... 3 5
2017-01-07 00:00:00 3 4 ... 6 5
2017-01-08 00:00:00 3 6 ... 6 4
2017-01-09 00:00:00 2 2 ... 5 2
2017-01-10 00:00:00 4 3 ... 5 5
2017-01-11 00:00:00 12 10 ... 1 6
2017-01-12 00:00:00 9 7 ... 4 1