pandas 如何将数据框列拆分为多列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/18034361/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to split a dataframe column into multiple columns
提问by Derric Lewis
After much prodding I am starting migrating my R scripts to Python. Most of my work in R involved data frames, and I am using the DataFrameobject from the pandas package. In my script I need to read in a csv file and import the data into a DataFrameobject. Next I need to convert the hex values into a column labelled DATAinto bitwise data, and then create 16 new columns, one for each bit.
经过多次刺激,我开始将我的 R 脚本迁移到 Python。我在 R 中的大部分工作都涉及数据框,我使用的DataFrame是 pandas 包中的对象。在我的脚本中,我需要读入一个 csv 文件并将数据导入到一个DataFrame对象中。接下来,我需要将十六进制值转换为标记DATA为按位数据的列,然后创建 16 个新列,每个位一个。
My example input data in file test.txtlooks as follows,
我在文件中的示例输入数据test.txt如下所示,
PREFIX,TEST,ZONE,ROW,COL,DATA
6_6,READ,0, 0, 0,BFED
6_6,READ,0, 1, 0,BB7D
6_6,READ,0, 2, 0,FFF7
6_6,READ,0, 3, 0,E7FF
6_6,READ,0, 4, 0,FBF8
6_6,READ,0, 5, 0,DE75
6_6,READ,0, 6, 0,DFFE
前缀、测试、区域、行、列、数据
6_6,READ,0, 0, 0,BFED
6_6,READ,0, 1, 0,BB7D
6_6,READ,0, 2, 0,FFF7
6_6,READ,0, 3, 0,E7FF
6_6,READ,0, 4, 0,FBF8
6_6,READ,0, 5, 0,DE75
6_6,READ,0, 6, 0,DFFE
My python script test.pyis as follows,
我的python脚本test.py如下,
import glob
import pandas as pd
import numpy as np
fname = 'test.txt'
df = pd.read_csv(fname, comment="#")
dfs = df[df.TEST == 'READ']
# function to convert the hexstring into a binary string
def hex2bin(hstr):
    return bin(int(hstr,16))[2:]
# convert the hexstring in column DATA to binarystring ROWDATA
dfs['BINDATA'] = dfs['DATA'].apply(hex2bin)
# get rid of the column DATA
del dfs['DATA']
When I run this script, and inspect the object dfs, I get the following,
当我运行这个脚本并检查对象时dfs,我得到以下信息,
PREFIX TEST ZONE ROW COL BINDATA
0 6_6 READ 0 0 0 1011111111101101
1 6_6 READ 0 1 0 1011101101111101
2 6_6 READ 0 2 0 1111111111110111
3 6_6 READ 0 3 0 1110011111111111
4 6_6 READ 0 4 0 1111101111111000
5 6_6 READ 0 5 0 1101111001110101
6 6_6 READ 0 6 0 1101111111111110
前缀测试区行列二进制数据
0 6_6 读 0 0 0 10111111111101101
1 6_6 读 0 1 0 1011101101111101
2 6_6 读 0 2 0 11111111111110111
3 6_6 读 0 3 0 11100111111111111
4 6_6 读 0 4 0 11111011111111000
5 6_6 读 0 5 0 1101111001110101
6 6_6 读 0 6 0 11011111111111110
So now I am not sure how to split the column named BINDATAinto 16 new columns (could be named B0, B0, B2, ...., B15). Any help will be appreciated. 
所以现在我不确定如何将命名的列拆分BINDATA为 16 个新列(可以命名为 B0、B0、B2、....、B15)。任何帮助将不胜感激。
Thanks & Regards,
感谢和问候,
Derric.
德里克。
回答by joris
I don't know if it can be done simpler (without the for loop), but this does the trick:
我不知道它是否可以做得更简单(没有 for 循环),但这确实有效:
for i in range(16):
    dfs['B'+str(i)] = dfs['BINDATA'].str[i]
The strattribute of the Series gives access to some vectorized string methods which act upon each element (see docs: http://pandas.pydata.org/pandas-docs/stable/basics.html#vectorized-string-methods). In this case we just index the string to acces the different characters.
This gives me:
str系列的属性允许访问一些对每个元素起作用的矢量化字符串方法(参见文档:http: //pandas.pydata.org/pandas-docs/stable/basics.html#vectorized-string-methods)。在这种情况下,我们只是索引字符串以访问不同的字符。
这给了我:
In [20]: dfs
Out[20]:
            BINDATA B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15
0  1011111111101101  1  0  1  1  1  1  1  1  1  1   1   0   1   1   0   1
1  1011101101111101  1  0  1  1  1  0  1  1  0  1   1   1   1   1   0   1
2  1111111111110111  1  1  1  1  1  1  1  1  1  1   1   1   0   1   1   1
3  1110011111111111  1  1  1  0  0  1  1  1  1  1   1   1   1   1   1   1
4  1111101111111000  1  1  1  1  1  0  1  1  1  1   1   1   1   0   0   0
5  1101111001110101  1  1  0  1  1  1  1  0  0  1   1   1   0   1   0   1
6  1101111111111110  1  1  0  1  1  1  1  1  1  1   1   1   1   1   1   0
If you want them as ints instead of strings, you can add .astype(int)in the for loop.
如果您希望它们作为整数而不是字符串,您可以.astype(int)在 for 循环中添加。
EDIT: Another way to do it (a oneliner, but you have to change the column names in a second step):
编辑:另一种方法(oneliner,但您必须在第二步中更改列名):
In [34]: splitted = dfs['BINDATA'].apply(lambda x: pd.Series(list(x)))
In [35]: splitted.columns = ['B'+str(x) for x in splitted.columns]
In [36]: dfs.join(splitted)
Out[36]:
            BINDATA B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15
0  1011111111101101  1  0  1  1  1  1  1  1  1  1   1   0   1   1   0   1
1  1011101101111101  1  0  1  1  1  0  1  1  0  1   1   1   1   1   0   1
2  1111111111110111  1  1  1  1  1  1  1  1  1  1   1   1   0   1   1   1
3  1110011111111111  1  1  1  0  0  1  1  1  1  1   1   1   1   1   1   1
4  1111101111111000  1  1  1  1  1  0  1  1  1  1   1   1   1   0   0   0
5  1101111001110101  1  1  0  1  1  1  1  0  0  1   1   1   0   1   0   1
6  1101111111111110  1  1  0  1  1  1  1  1  1  1   1   1   1   1   1   0
回答by Phillip Cloud
Here's how you can do this without a loop (but not really, since there's a lot of implicit looping in this code):
下面是如何在没有循环的情况下执行此操作(但实际上并非如此,因为此代码中有很多隐式循环):
import pandas as pd
# read the above frame from the clipboard
df = pd.read_clipboard(converters={'BINDATA': str})
df = df.fillna(nan).replace('None', nan).dropna(axis=0, how='all')
# here are the lines that matter
bindata = df.BINDATA.apply(list).apply(Series)
bindata.columns = bindata.columns.map('B{0}'.format)
res = pd.concat([df, bindata], axis=1).convert_objects(convert_numeric=True)

