pandas 如何将数据框列拆分为多列

Question

提问by Derric Lewis

After much prodding I am starting migrating my R scripts to Python. Most of my work in R involved data frames, and I am using the DataFrameobject from the pandas package. In my script I need to read in a csv file and import the data into a DataFrameobject. Next I need to convert the hex values into a column labelled DATAinto bitwise data, and then create 16 new columns, one for each bit.

经过多次刺激，我开始将我的 R 脚本迁移到 Python。我在 R 中的大部分工作都涉及数据框，我使用的DataFrame是 pandas 包中的对象。在我的脚本中，我需要读入一个 csv 文件并将数据导入到一个DataFrame对象中。接下来，我需要将十六进制值转换为标记DATA为按位数据的列，然后创建 16 个新列，每个位一个。

My example input data in file test.txtlooks as follows,

我在文件中的示例输入数据test.txt如下所示，

PREFIX,TEST,ZONE,ROW,COL,DATA
6_6,READ,0, 0, 0,BFED
6_6,READ,0, 1, 0,BB7D
6_6,READ,0, 2, 0,FFF7
6_6,READ,0, 3, 0,E7FF
6_6,READ,0, 4, 0,FBF8
6_6,READ,0, 5, 0,DE75
6_6,READ,0, 6, 0,DFFE

前缀、测试、区域、行、列、数据
6_6,READ,0, 0, 0,BFED
6_6,READ,0, 1, 0,BB7D
6_6,READ,0, 2, 0,FFF7
6_6,READ,0, 3, 0,E7FF
6_6,READ,0, 4, 0,FBF8
6_6,READ,0, 5, 0,DE75
6_6,READ,0, 6, 0,DFFE

My python script test.pyis as follows,

我的python脚本test.py如下，

import glob

import pandas as pd

import numpy as np

fname = 'test.txt'

df = pd.read_csv(fname, comment="#")

dfs = df[df.TEST == 'READ']

# function to convert the hexstring into a binary string

def hex2bin(hstr):

    return bin(int(hstr,16))[2:]


# convert the hexstring in column DATA to binarystring ROWDATA

dfs['BINDATA'] = dfs['DATA'].apply(hex2bin)

# get rid of the column DATA

del dfs['DATA']

When I run this script, and inspect the object dfs, I get the following,

当我运行这个脚本并检查对象时dfs，我得到以下信息，

PREFIX TEST ZONE ROW COL BINDATA
0 6_6 READ 0 0 0 1011111111101101
1 6_6 READ 0 1 0 1011101101111101
2 6_6 READ 0 2 0 1111111111110111
3 6_6 READ 0 3 0 1110011111111111
4 6_6 READ 0 4 0 1111101111111000
5 6_6 READ 0 5 0 1101111001110101
6 6_6 READ 0 6 0 1101111111111110

前缀测试区行列二进制数据
0 6_6 读 0 0 0 10111111111101101
1 6_6 读 0 1 0 1011101101111101
2 6_6 读 0 2 0 11111111111110111
3 6_6 读 0 3 0 11100111111111111
4 6_6 读 0 4 0 11111011111111000
5 6_6 读 0 5 0 1101111001110101
6 6_6 读 0 6 0 11011111111111110

So now I am not sure how to split the column named BINDATAinto 16 new columns (could be named B0, B0, B2, ...., B15). Any help will be appreciated.

所以现在我不确定如何将命名的列拆分BINDATA为 16 个新列（可以命名为 B0、B0、B2、....、B15）。任何帮助将不胜感激。

Thanks & Regards,

感谢和问候，

Derric.

德里克。

Answer 1

回答by joris

I don't know if it can be done simpler (without the for loop), but this does the trick:

我不知道它是否可以做得更简单（没有 for 循环），但这确实有效：

for i in range(16):
    dfs['B'+str(i)] = dfs['BINDATA'].str[i]

The strattribute of the Series gives access to some vectorized string methods which act upon each element (see docs: http://pandas.pydata.org/pandas-docs/stable/basics.html#vectorized-string-methods). In this case we just index the string to acces the different characters.
This gives me:

str系列的属性允许访问一些对每个元素起作用的矢量化字符串方法（参见文档：http: //pandas.pydata.org/pandas-docs/stable/basics.html#vectorized-string-methods）。在这种情况下，我们只是索引字符串以访问不同的字符。
这给了我：

In [20]: dfs
Out[20]:
            BINDATA B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15
0  1011111111101101  1  0  1  1  1  1  1  1  1  1   1   0   1   1   0   1
1  1011101101111101  1  0  1  1  1  0  1  1  0  1   1   1   1   1   0   1
2  1111111111110111  1  1  1  1  1  1  1  1  1  1   1   1   0   1   1   1
3  1110011111111111  1  1  1  0  0  1  1  1  1  1   1   1   1   1   1   1
4  1111101111111000  1  1  1  1  1  0  1  1  1  1   1   1   1   0   0   0
5  1101111001110101  1  1  0  1  1  1  1  0  0  1   1   1   0   1   0   1
6  1101111111111110  1  1  0  1  1  1  1  1  1  1   1   1   1   1   1   0

If you want them as ints instead of strings, you can add .astype(int)in the for loop.

如果您希望它们作为整数而不是字符串，您可以.astype(int)在 for 循环中添加。

EDIT: Another way to do it (a oneliner, but you have to change the column names in a second step):

编辑：另一种方法（oneliner，但您必须在第二步中更改列名）：

In [34]: splitted = dfs['BINDATA'].apply(lambda x: pd.Series(list(x)))

In [35]: splitted.columns = ['B'+str(x) for x in splitted.columns]

In [36]: dfs.join(splitted)
Out[36]:
            BINDATA B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15
0  1011111111101101  1  0  1  1  1  1  1  1  1  1   1   0   1   1   0   1
1  1011101101111101  1  0  1  1  1  0  1  1  0  1   1   1   1   1   0   1
2  1111111111110111  1  1  1  1  1  1  1  1  1  1   1   1   0   1   1   1
3  1110011111111111  1  1  1  0  0  1  1  1  1  1   1   1   1   1   1   1
4  1111101111111000  1  1  1  1  1  0  1  1  1  1   1   1   1   0   0   0
5  1101111001110101  1  1  0  1  1  1  1  0  0  1   1   1   0   1   0   1
6  1101111111111110  1  1  0  1  1  1  1  1  1  1   1   1   1   1   1   0

Answer 2

回答by Phillip Cloud

Here's how you can do this without a loop (but not really, since there's a lot of implicit looping in this code):

下面是如何在没有循环的情况下执行此操作（但实际上并非如此，因为此代码中有很多隐式循环）：

import pandas as pd

# read the above frame from the clipboard
df = pd.read_clipboard(converters={'BINDATA': str})
df = df.fillna(nan).replace('None', nan).dropna(axis=0, how='all')

# here are the lines that matter
bindata = df.BINDATA.apply(list).apply(Series)
bindata.columns = bindata.columns.map('B{0}'.format)
res = pd.concat([df, bindata], axis=1).convert_objects(convert_numeric=True)

pandas 如何将数据框列拆分为多列

提问by Derric Lewis

回答by joris

回答by Phillip Cloud

相关推荐

最近更新

标签

pandas 如何将数据框列拆分为多列

提问by Derric Lewis

回答by joris

回答by Phillip Cloud

相关推荐

Python pandas 停留在 0.7.0 版本

pandas 返回最接近用户定义数字的数据框中的行

pandas 大熊猫数据框按十年分组年份索引

基于从 Pandas DataFrame 中其他 2 个列的值中进行条件选择的新列

相关推荐

最近更新

标签