pandas 读取文本文件数据到pandas DataFrame
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/49082287/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Read text file data to pandas DataFrame
提问by Arnoldas Bankauskas
I have specific file format from CNC (work center) data. saved like .txt . I want read this table to pandas dataframe but i never seen this format before.
我有来自 CNC(工作中心)数据的特定文件格式。保存为 .txt 。我想将此表读取到 Pandas 数据帧,但我以前从未见过这种格式。
_MASCHINENNUMMER : >0-251-11-0950/51< SACHBEARB.: >BSTWIN32<
_PRODUKTSCHLUESSEL : >BST 500< DATUM : >05-20-2016<
---------------------------------------------------------------------------
*BOHRKOPF !SPINDEL!WK!DELTA-X !DELTA-Y !DURCHMESSER! KOMMENTAR
----------+----------+----------+----------+-----------+-------------------
[NoValidForUse]
A21 ! 1!62! 0.000! 0.000! 0.000!
[V11]
A12 ! -1!62! 0.000! -160.000! 0.000!
A12 ! 2!62! 0.000! -128.000! 3.000! 70.0
A12 ! -3!62! 0.000! -96.000! 0.000!
A12 ! 4!62! 0.000! -64.000! 0.000!
---------------------------------------------------------------------------
*BOHRKOPF !SPINDEL!WK!DELTA-X !DELTA-Y !DURCHMESSER! KOMMENTAR
----------+----------+----------+----------+-----------+-------------------
[V11]
O11 ! -9!62! 0.000! -96.000! 0.000!
O11 ! 10!62! 0.000! -128.000! 5.000! 70.0
Questions: 1. Is it possible to read this and convert as pandas Dataframe? 2. Hou to do this ?
问题: 1. 是否可以读取此内容并将其转换为 Pandas Dataframe?2. 什么时候做这个?
- why pandas dataFrame? I want this data use for some analysis by this characteristics of item. For analysis i always use pandas. Maybe for this i need do different ways ?
- 为什么是Pandas数据框?我希望将此数据用于通过项目的此特征进行一些分析。对于分析,我总是使用Pandas。也许为此我需要做不同的事情?
Expected outpu:
预期输出:
two pandas DataFrames first:
首先是两个Pandas数据帧:
---------------------------------------------------------------------------------------
*BOHRKOPF !SPINDEL!WK!DELTA-X !DELTA-Y !DURCHMESSER! KOMMENTAR ! TYPE
----------+----------+----------+----------+-----------+-------------------------------
A21 ! 1!62! 0.000! 0.000! 0.000! !NoValidForUse
A12 ! -1!62! 0.000! -160.000! 0.000! !V11
A12 ! 2!62! 0.000! -128.000! 3.000! 70.0 !V11
A12 ! -3!62! 0.000! -96.000! 0.000! !V11
A12 ! 4!62! 0.000! -64.000! 0.000! !V11
And second:
第二:
---------------------------------------------------------------------------------------
*BOHRKOPF !SPINDEL!WK!DELTA-X !DELTA-Y !DURCHMESSER! KOMMENTAR ! TYPE
----------+----------+----------+----------+-----------+-------------------------------
O11 ! -9!62! 0.000! -96.000! 0.000! !V11
O11 ! 10!62! 0.000! -128.000! 5.000! 70.0 !V11
Headers of Dataframe1 and dataframe2 can be different:
Dataframe1 和 dataframe2 的标头可以不同:
_MASCHINENNUMMER : >0-251-11-0950/51< SACHBEARB.: >BSTWIN32<
_PRODUKTSCHLUESSEL : >BST 500< DATUM : >05-20-2016<
---------------------------------------------------------------------------
*BOHRKOPF !SPINDEL!WK!DELTA-X !DELTA-Y !DURCHMESSER! KOMMENTAR
----------+----------+----------+----------+-----------+-------------------
[NoValidForUse]
A21 ! 1!62! 0.000! 0.000! 0.000!
[V11]
A12 ! -1!62! 0.000! -160.000! 0.000!
A12 ! 2!62! 0.000! -128.000! 3.000! 70.0
A12 ! -3!62! 0.000! -96.000! 0.000!
---------------------------------------------------------------------------
*BOHRKOPF ! !X-POS !Y-POS ! !
----------+----------+----------+----------+-----------+-------------------
[V11]
O11 ! ! 0.000! -96.000! !
O11 ! ! 0.000! -128.000! !
- on file can be different number of dataframes between 5 and 10 but structure of file sesame separator "!" headers row starts whit "*"
- 文件可以是 5 到 10 之间不同数量的数据帧,但文件芝麻分隔符的结构“!” 标题行以“*”开头
回答by jezrael
Yes, it is possible, but really data dependent:
是的,这是可能的,但实际上取决于数据:
- first
read_csv
with omit first3
rows and omit first whitespaces - omit trailing whitespaces in columns by
strip
- create column
TYPE
byextract
values between[]
and forward fill next rows - create helper column for distinguish each
DataFrame
bystartswith
andcumsum
- last remove by
contains
rows where first column starts with[
,--
or*
- 首先
read_csv
省略第一3
行并省略第一个空格 - 省略列中的尾随空格
strip
TYPE
按extract
值创建列[]
并向前填充下一行- 创建辅助列的每个区分
DataFrame
由startswith
和cumsum
- 最后删除
contains
第一列以[
,--
或*
df = pd.read_csv(file, sep="!", skiprows=3, skipinitialspace=True)
df.columns = df.columns.str.strip()
df['TYPE'] = df['*BOHRKOPF'].str.extract('\[(.*)\]', expand=False).ffill()
df['G'] = df['*BOHRKOPF'].str.startswith('*').cumsum()
df = df[~df['*BOHRKOPF'].str.contains('^\[|^--|^\*')]
print (df)
*BOHRKOPF SPINDEL WK DELTA-X DELTA-Y DURCHMESSER KOMMENTAR \
2 A21 1 62 0.000 0.000 0.000 NaN
4 A12 -1 62 0.000 -160.000 0.000 NaN
5 A12 2 62 0.000 -128.000 3.000 70.0
6 A12 -3 62 0.000 -96.000 0.000 NaN
7 A12 4 62 0.000 -64.000 0.000 NaN
12 O11 -9 62 0.000 -96.000 0.000 NaN
13 O11 10 62 0.000 -128.000 5.000 70.0
TYPE G
2 NoValidForUse 0
4 V11 0
5 V11 0
6 V11 0
7 V11 0
12 V11 1
13 V11 1
and then filter by G
column:
然后按G
列过滤:
df1 = df[df['G'] == 0].drop('G', axis=1)
print (df1)
*BOHRKOPF SPINDEL WK DELTA-X DELTA-Y DURCHMESSER KOMMENTAR \
2 A21 1 62 0.000 0.000 0.000 NaN
4 A12 -1 62 0.000 -160.000 0.000 NaN
5 A12 2 62 0.000 -128.000 3.000 70.0
6 A12 -3 62 0.000 -96.000 0.000 NaN
7 A12 4 62 0.000 -64.000 0.000 NaN
TYPE
2 NoValidForUse
4 V11
5 V11
6 V11
7 V11
df2 = df[df['G'] == 1].drop('G', axis=1)
print (df2)
*BOHRKOPF SPINDEL WK DELTA-X DELTA-Y DURCHMESSER KOMMENTAR TYPE
12 O11 -9 62 0.000 -96.000 0.000 NaN V11
13 O11 10 62 0.000 -128.000 5.000 70.0 V11
If in file is multiple DataFrames is possible use list comprehension
for list of DataFrames
:
如果文件中有多个 DataFrames 可以list comprehension
用于list of DataFrames
:
dfs = [v.drop('G', axis=1) for k, v in df.groupby('G')]
print (dfs[0])
*BOHRKOPF SPINDEL WK DELTA-X DELTA-Y DURCHMESSER KOMMENTAR \
2 A21 1 62 0.000 0.000 0.000 NaN
4 A12 -1 62 0.000 -160.000 0.000 NaN
5 A12 2 62 0.000 -128.000 3.000 70.0
6 A12 -3 62 0.000 -96.000 0.000 NaN
7 A12 4 62 0.000 -64.000 0.000 NaN
TYPE
2 NoValidForUse
4 V11
5 V11
6 V11
7 V11
print (dfs[1])
*BOHRKOPF SPINDEL WK DELTA-X DELTA-Y DURCHMESSER KOMMENTAR TYPE
12 O11 -9 62 0.000 -96.000 0.000 NaN V11
13 O11 10 62 0.000 -128.000 5.000 70.0 V11
EDIT:
编辑:
temp=u"""_MASCHINENNUMMER : >0-251-11-0950/51< SACHBEARB.: >BSTWIN32<
_PRODUKTSCHLUESSEL : >BST 500< DATUM : >05-20-2016<
---------------------------------------------------------------------------
*BOHRKOPF !SPINDEL!WK!DELTA-X !DELTA-Y !DURCHMESSER! KOMMENTAR
----------+----------+----------+----------+-----------+-------------------
[NoValidForUse]
A21 ! 1!62! 0.000! 0.000! 0.000!
[V11]
A12 ! -1!62! 0.000! -160.000! 0.000!
A12 ! 2!62! 0.000! -128.000! 3.000! 70.0
A12 ! -3!62! 0.000! -96.000! 0.000!
A12 ! 4!62! 0.000! -64.000! 0.000!
---------------------------------------------------------------------------
*BOHRKOPF ! !X-POS !Y-POS ! !
----------+----------+----------+----------+-----------+-------------------
[V11]
O11 ! ! 0.000! -96.000! !
O11 ! ! 0.000! -128.000! ! """
Add parameter header
for default columns names:
header
为默认列名称添加参数:
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep="!", skiprows=3, skipinitialspace=True, header=None)
df['TYPE'] = df[0].str.extract('\[(.*)\]', expand=False).ffill()
df['G'] = df[0].str.startswith('*').cumsum()
#dont remove rows start with *
df = df[~df[0].str.contains('^\[|^--')]
print (df)
0 1 2 3 4 5 \
0 *BOHRKOPF SPINDEL WK DELTA-X DELTA-Y DURCHMESSER
3 A21 1 62 0.000 0.000 0.000
5 A12 -1 62 0.000 -160.000 0.000
6 A12 2 62 0.000 -128.000 3.000
7 A12 -3 62 0.000 -96.000 0.000
8 A12 4 62 0.000 -64.000 0.000
10 *BOHRKOPF NaN X-POS Y-POS NaN NaN
13 O11 NaN 0.000 -96.000 NaN NaN
14 O11 NaN 0.000 -128.000 NaN NaN
6 TYPE G
0 KOMMENTAR NaN 1
3 NaN NoValidForUse 1
5 NaN V11 1
6 70.0 V11 1
7 NaN V11 1
8 NaN V11 1
10 NaN V11 2
13 NaN V11 2
14 NaN V11 2
For each loop remove column G
, rename all columns without last 2 by first row, remove first row by iloc
and last if necessary remove all columns fill NaN
s only by dropna
:
对于每个循环 remove column G
,按第一行重命名所有没有最后 2 的列,删除第一行iloc
和最后,如有必要,删除所有列 fill NaN
s only by dropna
:
dfs = [v.drop('G', axis=1).rename(columns=v.iloc[0, :-2]).iloc[1:].dropna(axis=1, how='all') for k, v in df.groupby('G')]
print (dfs[0])
*BOHRKOPF SPINDEL WK DELTA-X DELTA-Y DURCHMESSER KOMMENTAR \
3 A21 1 62 0.000 0.000 0.000 NaN
5 A12 -1 62 0.000 -160.000 0.000 NaN
6 A12 2 62 0.000 -128.000 3.000 70.0
7 A12 -3 62 0.000 -96.000 0.000 NaN
8 A12 4 62 0.000 -64.000 0.000 NaN
TYPE
3 NoValidForUse
5 V11
6 V11
7 V11
8 V11
print (dfs[1])
*BOHRKOPF X-POS Y-POS TYPE
13 O11 0.000 -96.000 V11
14 O11 0.000 -128.000 V11