pandas 读取文本文件数据到pandas DataFrame

Question

提问by Arnoldas Bankauskas

I have specific file format from CNC (work center) data. saved like .txt . I want read this table to pandas dataframe but i never seen this format before.

我有来自 CNC（工作中心）数据的特定文件格式。保存为 .txt 。我想将此表读取到 Pandas 数据帧，但我以前从未见过这种格式。

_MASCHINENNUMMER    : >0-251-11-0950/51<     SACHBEARB.: >BSTWIN32<
_PRODUKTSCHLUESSEL  : >BST 500<           DATUM     : >05-20-2016<
---------------------------------------------------------------------------
*BOHRKOPF !SPINDEL!WK!DELTA-X   !DELTA-Y   !DURCHMESSER! KOMMENTAR
----------+----------+----------+----------+-----------+-------------------
[NoValidForUse]
A21       !      1!62!     0.000!     0.000!      0.000!
[V11]
A12       !     -1!62!     0.000!  -160.000!      0.000!
A12       !      2!62!     0.000!  -128.000!      3.000!  70.0
A12       !     -3!62!     0.000!   -96.000!      0.000!
A12       !      4!62!     0.000!   -64.000!      0.000!
---------------------------------------------------------------------------
*BOHRKOPF !SPINDEL!WK!DELTA-X   !DELTA-Y   !DURCHMESSER! KOMMENTAR
----------+----------+----------+----------+-----------+-------------------
[V11]
O11       !     -9!62!     0.000!   -96.000!      0.000!
O11       !     10!62!     0.000!  -128.000!      5.000!  70.0

Questions: 1. Is it possible to read this and convert as pandas Dataframe? 2. Hou to do this ?

问题： 1. 是否可以读取此内容并将其转换为 Pandas Dataframe？2. 什么时候做这个？

why pandas dataFrame? I want this data use for some analysis by this characteristics of item. For analysis i always use pandas. Maybe for this i need do different ways ?

为什么是Pandas数据框？我希望将此数据用于通过项目的此特征进行一些分析。对于分析，我总是使用Pandas。也许为此我需要做不同的事情？

Expected outpu:

预期输出：

two pandas DataFrames first:

首先是两个Pandas数据帧：

---------------------------------------------------------------------------------------
*BOHRKOPF !SPINDEL!WK!DELTA-X   !DELTA-Y   !DURCHMESSER! KOMMENTAR ! TYPE
----------+----------+----------+----------+-----------+-------------------------------
A21       !      1!62!     0.000!     0.000!      0.000!           !NoValidForUse
A12       !     -1!62!     0.000!  -160.000!      0.000!           !V11
A12       !      2!62!     0.000!  -128.000!      3.000!  70.0     !V11
A12       !     -3!62!     0.000!   -96.000!      0.000!           !V11
A12       !      4!62!     0.000!   -64.000!      0.000!           !V11

And second:

第二：

---------------------------------------------------------------------------------------
*BOHRKOPF !SPINDEL!WK!DELTA-X   !DELTA-Y   !DURCHMESSER! KOMMENTAR ! TYPE
----------+----------+----------+----------+-----------+-------------------------------
O11       !     -9!62!     0.000!   -96.000!      0.000!           !V11
O11       !     10!62!     0.000!  -128.000!      5.000!  70.0     !V11

Headers of Dataframe1 and dataframe2 can be different:

Dataframe1 和 dataframe2 的标头可以不同：

_MASCHINENNUMMER    : >0-251-11-0950/51<     SACHBEARB.: >BSTWIN32<
_PRODUKTSCHLUESSEL  : >BST 500<           DATUM     : >05-20-2016<
---------------------------------------------------------------------------
*BOHRKOPF !SPINDEL!WK!DELTA-X   !DELTA-Y   !DURCHMESSER! KOMMENTAR
----------+----------+----------+----------+-----------+-------------------
[NoValidForUse]
A21       !      1!62!     0.000!     0.000!      0.000!
[V11]
A12       !     -1!62!     0.000!  -160.000!      0.000!
A12       !      2!62!     0.000!  -128.000!      3.000!  70.0
A12       !     -3!62!     0.000!   -96.000!      0.000!
 ---------------------------------------------------------------------------
*BOHRKOPF !          !X-POS     !Y-POS     !           ! 
----------+----------+----------+----------+-----------+-------------------
[V11]
O11       !          !     0.000!   -96.000!           !
O11       !          !     0.000!  -128.000!           !

on file can be different number of dataframes between 5 and 10 but structure of file sesame separator "!" headers row starts whit "*"

文件可以是 5 到 10 之间不同数量的数据帧，但文件芝麻分隔符的结构“！” 标题行以“*”开头

Answer 1

回答by jezrael

Yes, it is possible, but really data dependent:

是的，这是可能的，但实际上取决于数据：

first read_csvwith omit first 3rows and omit first whitespaces
omit trailing whitespaces in columns by strip
create column TYPEby extractvalues between []and forward fill next rows
create helper column for distinguish each DataFrameby startswithand cumsum
last remove by containsrows where first column starts with [, --or *

首先read_csv省略第一3行并省略第一个空格
省略列中的尾随空格 strip
TYPE按extract值创建列[]并向前填充下一行
创建辅助列的每个区分DataFrame由startswith和cumsum
最后删除contains第一列以[,--或*

df = pd.read_csv(file, sep="!", skiprows=3, skipinitialspace=True)
df.columns = df.columns.str.strip()
df['TYPE'] = df['*BOHRKOPF'].str.extract('\[(.*)\]', expand=False).ffill()
df['G'] = df['*BOHRKOPF'].str.startswith('*').cumsum()
df = df[~df['*BOHRKOPF'].str.contains('^\[|^--|^\*')]
print (df)
     *BOHRKOPF SPINDEL  WK DELTA-X   DELTA-Y DURCHMESSER KOMMENTAR  \
2   A21              1  62   0.000     0.000       0.000       NaN   
4   A12             -1  62   0.000  -160.000       0.000       NaN   
5   A12              2  62   0.000  -128.000       3.000      70.0   
6   A12             -3  62   0.000   -96.000       0.000       NaN   
7   A12              4  62   0.000   -64.000       0.000       NaN   
12  O11             -9  62   0.000   -96.000       0.000       NaN   
13  O11             10  62   0.000  -128.000       5.000      70.0   

             TYPE  G  
2   NoValidForUse  0  
4             V11  0  
5             V11  0  
6             V11  0  
7             V11  0  
12            V11  1  
13            V11  1

and then filter by Gcolumn:

然后按G列过滤：

df1 = df[df['G'] == 0].drop('G', axis=1)
print (df1)
    *BOHRKOPF SPINDEL  WK DELTA-X   DELTA-Y DURCHMESSER KOMMENTAR  \
2  A21              1  62   0.000     0.000       0.000       NaN   
4  A12             -1  62   0.000  -160.000       0.000       NaN   
5  A12              2  62   0.000  -128.000       3.000      70.0   
6  A12             -3  62   0.000   -96.000       0.000       NaN   
7  A12              4  62   0.000   -64.000       0.000       NaN   

            TYPE  
2  NoValidForUse  
4            V11  
5            V11  
6            V11  
7            V11  

df2 = df[df['G'] == 1].drop('G', axis=1)
print (df2)
     *BOHRKOPF SPINDEL  WK DELTA-X   DELTA-Y DURCHMESSER KOMMENTAR TYPE
12  O11             -9  62   0.000   -96.000       0.000       NaN  V11
13  O11             10  62   0.000  -128.000       5.000      70.0  V11

If in file is multiple DataFrames is possible use list comprehensionfor list of DataFrames:

如果文件中有多个 DataFrames 可以list comprehension用于list of DataFrames：

dfs = [v.drop('G', axis=1) for k, v in df.groupby('G')]
print (dfs[0])
    *BOHRKOPF SPINDEL  WK DELTA-X   DELTA-Y DURCHMESSER KOMMENTAR  \
2  A21              1  62   0.000     0.000       0.000       NaN   
4  A12             -1  62   0.000  -160.000       0.000       NaN   
5  A12              2  62   0.000  -128.000       3.000      70.0   
6  A12             -3  62   0.000   -96.000       0.000       NaN   
7  A12              4  62   0.000   -64.000       0.000       NaN   

            TYPE  
2  NoValidForUse  
4            V11  
5            V11  
6            V11  
7            V11  

print (dfs[1])
     *BOHRKOPF SPINDEL  WK DELTA-X   DELTA-Y DURCHMESSER KOMMENTAR TYPE
12  O11             -9  62   0.000   -96.000       0.000       NaN  V11
13  O11             10  62   0.000  -128.000       5.000      70.0  V11

EDIT:

编辑：

temp=u"""_MASCHINENNUMMER    : >0-251-11-0950/51<     SACHBEARB.: >BSTWIN32<
_PRODUKTSCHLUESSEL  : >BST 500<           DATUM     : >05-20-2016<
---------------------------------------------------------------------------
*BOHRKOPF !SPINDEL!WK!DELTA-X   !DELTA-Y   !DURCHMESSER! KOMMENTAR
----------+----------+----------+----------+-----------+-------------------
[NoValidForUse]
A21       !      1!62!     0.000!     0.000!      0.000!
[V11]
A12       !     -1!62!     0.000!  -160.000!      0.000!
A12       !      2!62!     0.000!  -128.000!      3.000!  70.0
A12       !     -3!62!     0.000!   -96.000!      0.000!
A12       !      4!62!     0.000!   -64.000!      0.000!
---------------------------------------------------------------------------
*BOHRKOPF !          !X-POS     !Y-POS     !           ! 
----------+----------+----------+----------+-----------+-------------------
[V11]
O11       !          !     0.000!   -96.000!           !
O11       !          !     0.000!  -128.000!           !  """

Add parameter headerfor default columns names:

header为默认列名称添加参数：

#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep="!", skiprows=3, skipinitialspace=True, header=None)
df['TYPE'] = df[0].str.extract('\[(.*)\]', expand=False).ffill()
df['G'] = df[0].str.startswith('*').cumsum()
#dont remove rows start with *
df = df[~df[0].str.contains('^\[|^--')]

print (df)
             0        1           2           3           4            5  \
0   *BOHRKOPF   SPINDEL          WK  DELTA-X     DELTA-Y     DURCHMESSER   
3   A21               1          62       0.000       0.000        0.000   
5   A12              -1          62       0.000    -160.000        0.000   
6   A12               2          62       0.000    -128.000        3.000   
7   A12              -3          62       0.000     -96.000        0.000   
8   A12               4          62       0.000     -64.000        0.000   
10  *BOHRKOPF       NaN  X-POS       Y-POS              NaN          NaN   
13  O11             NaN       0.000     -96.000         NaN          NaN   
14  O11             NaN       0.000    -128.000         NaN          NaN   

            6           TYPE  G  
0   KOMMENTAR            NaN  1  
3         NaN  NoValidForUse  1  
5         NaN            V11  1  
6        70.0            V11  1  
7         NaN            V11  1  
8         NaN            V11  1  
10        NaN            V11  2  
13        NaN            V11  2  
14        NaN            V11  2

For each loop remove column G, rename all columns without last 2 by first row, remove first row by ilocand last if necessary remove all columns fill NaNs only by dropna:

对于每个循环 remove column G，按第一行重命名所有没有最后 2 的列，删除第一行iloc和最后，如有必要，删除所有列 fill NaNs only by dropna：

dfs = [v.drop('G', axis=1).rename(columns=v.iloc[0, :-2]).iloc[1:].dropna(axis=1, how='all') for k, v in df.groupby('G')]
print (dfs[0])
   *BOHRKOPF  SPINDEL  WK DELTA-X    DELTA-Y    DURCHMESSER KOMMENTAR  \
3  A21              1  62      0.000      0.000       0.000       NaN   
5  A12             -1  62      0.000   -160.000       0.000       NaN   
6  A12              2  62      0.000   -128.000       3.000      70.0   
7  A12             -3  62      0.000    -96.000       0.000       NaN   
8  A12              4  62      0.000    -64.000       0.000       NaN   

            TYPE  
3  NoValidForUse  
5            V11  
6            V11  
7            V11  
8            V11 

print (dfs[1])
    *BOHRKOPF  X-POS      Y-POS      TYPE
13  O11             0.000    -96.000  V11
14  O11             0.000   -128.000  V11

pandas 读取文本文件数据到pandas DataFrame

提问by Arnoldas Bankauskas

回答by jezrael

相关推荐

最近更新

标签

pandas 读取文本文件数据到pandas DataFrame

提问by Arnoldas Bankauskas

回答by jezrael

相关推荐

高效地将 Pandas 数据帧写入 Google BigQuery

pandas 使用循环创建多个数据帧

pandas 如何从数据框中保存 CSV，以在带有数字的列中保留零？

pandas 计算熊猫数据帧行中的非空单元格并将计数添加为一列

相关推荐

最近更新

标签